[mira_talk] Re: change stringency in Mira mapping assembly

Hi Bastien,

Thanks for your reply! I have a few more things that are bothering me:

On 12/12/2011 10:24 PM, Bastien Chevreux wrote:
On Dec 12, 2011, at 16:09 , Christoph Hahn wrote:
After your step 2 (convert_project -f maf -t fasta -A "SOLEXA_SETTINGS 
-CO:fnicpst=yes" derjavinoidesmtlane8_out.maf iteration1) I get a whole bunch of 
files:

iteration1_AllStrains.padded.fasta
[...]
iteration1_default.padded.fasta
[...]
iteration1_derjavinoidesmtlane8.padded.fasta
[...]
iteration1_derjavinoides_mt.padded.fasta
[...]
I was continuing with the iteration1_derjavinoidesmtlane8.padded.fasta, 
although I am not sure if it would maybe be safer to continue with the unpadded 
file.
Take the unpadded version. From the manual:

fasta contains the contig consensus sequences (and .fasta.qual the consensus 
qualities). Please note that they come in two flavours: padded and unpadded. 
The padded versions may contains stars (*) denoting gap base positions where 
there was some minor evidence for additional bases, but not strong enough to be 
considered as a real base. Unpadded versions have these gaps removed.
Ok. will do!
What confuses me are all the other files. What are they? THey are obviously all 
some variants of the reference..
If you map reads from different strain to a reference, what should MIRA give you as consensus? See? 
Not that easy. So MIRA takes the broad approach: one consensus per strain, plus one consensus for 
reads without strain info ("default") and one consensus for all strains together 
("AllStrains")

I ran my initial assembly like so:
mira --project=derjavinoidesmtlane8 --job=mapping,genome,accurate,solexa -MI:somrnl=0 -AS:nop=1 -SB:bsn=derjavinoides_mt:bft=gbf:bbq=30 SOLEXA_SETTINGS -CO:msr=no -GE:uti=no:tismin=120:tismax=200 -SB:dsn=derjavinoidesmtlane8

I still don t quite get the issue with the strains for my particular case - I am not trying to map reads from different strains. I am not giving any prior information about strains via a straindata file or anything like that. So, if I am correct, with "-SB:dsn=derjavinoidesmtlane8 " I specify the default strain name for all reads.

The iteration1_default* files just contain @ - that seems logical to me as there is no reads without strain information used. Thats that. The iteration_derjavinoidesmtlane8* files pretty much contain what I was expecting. Stretches of rather conserved sequence parts where I get a sequence and then bits of @ in between where the reference is too divergent. What I am not clear about is why I get iteration1_derjavinoides_mt* files. derjavinoides_mt was just the strain name I specified for the backbone, right?. The sequence in the derjavinoides_mt* files is identical to the reference with the exception that there are some 50 @ added before the actual start of the reference sequence and some 10 @ added after the end of the reference sequence. What is that? The iteration1_Allstrains* file seems to contain a consensus sequence, as expected.
-) the file with the trimmed reads that I obtained from the first mapping 
attempt with Mira (mynewl8data.fastq) as well as the file I get from mirabait 
(mymtreads_iteration1.fastq) both start with the reference and are then 
followed by several sequences (header e.g. @rr_####50####) before the actual 
reads. Apparently these @rr_#### sequences are all part of the reference.. what 
exactly is it?
Uhhhh ... you seeing these ###-reads tells me something went wrong. Somewhere. 
Where exactly did you get the original file from? Additional question: did the 
assembly run over several passes? If yes, why?
The original file is a standard fastq file. I actually trimmed the reads in the file already quite thouroughly before the initial mapping attempt with mira. Then I was wanting to obtained the even further clipped reads from Mira after the first mapping assembly as you suggested with (convert_project -f maf -t FASTQ -C readpool.maf mynewl8data). The obtained fastq file already contains this rails then. Also all following fastq files do. The assembly only ran over one pass. See the stdout of the convert_project below. I don`t know what`s the problem.. Do you have any suggestions what I could try to get rid of this rails? Do you think it is affecting the results from the following steps? I don`t get any errors concerning this and I looked at the results from Mirabait and the following mapping approaches and they look ok to me.

Loading from maf, saving to: fastq
First counting reads:
[0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%] ....|.... [90%] ....|.... [100%] tcmalloc: large alloc 9261940736 bytes == 0x1c170000 @

Now loading and processing data:
[0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%] ....|.... [90%] ....|.... [100%]

Data conversion process finished, no obvious errors encountered.
SC Read:: read name issorted (1) capacity 4294967295(4) size 1
SC Read:: scf path name issorted (1) capacity 255(1) size 1
SC Read:: exp path name issorted (1) capacity 255(1) size 1
SC Read:: machine type issorted (1) capacity 255(1) size 1
SC Read:: primer issorted (1) capacity 255(1) size 1
SC Read:: strain issorted (1) capacity 255(1) size 3
SC Read:: base caller issorted (1) capacity 255(1) size 1
SC Read:: dye issorted (1) capacity 255(1) size 1
SC Read:: process status issorted (1) capacity 255(1) size 1
SC Read:: clone vector name issorted (1) capacity 65535(2) size 1
SC Read:: sequencing vector name issorted (1) capacity 65535(2) size 1
SC asped issorted (1) capacity 4294967295(4) size 1

To answer your question: the rr_### reads are "rails", helper reads used by 
MIRA during the assembly. They are not present in the final results, only in intermediate 
files.

-) Also I tried to use mirabait to identify reads that map onto the the 
sequence of the host organism, but unfortunately it seems as if the reference 
sequences are too long. Is there a way of dealing with this, apart from cutting 
the reference in smaller bits? This is the error message I get:
"Read gi|354459049|gb|AGKD01000001.1| is 194200 bp long and thus longer than 
MAXREADSIZEALLOWED (29900) bases. Skim cannot handle than, sorry."
Oooooooops! This is something which should not happen. Definitively a bug. I'll 
have a look at it as I am currently working on this part pf MIRA. In the mean 
time: sorry, you need to fragment :-/
I am fragmenting the reference now.

Much obliged!
cheers,
Christoph



--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: