[mira_talk] Re: change stringency in Mira mapping assembly

On Dec 12, 2011, at 16:09 , Christoph Hahn wrote:
> After your step 2 (convert_project -f maf -t fasta -A "SOLEXA_SETTINGS 
> -CO:fnicpst=yes" derjavinoidesmtlane8_out.maf iteration1) I get a whole bunch 
> of files:
> 
> iteration1_AllStrains.padded.fasta
> [...]
> iteration1_default.padded.fasta
> [...]
> iteration1_derjavinoidesmtlane8.padded.fasta
> [...]
> iteration1_derjavinoides_mt.padded.fasta
> [...]
> I was continuing with the iteration1_derjavinoidesmtlane8.padded.fasta, 
> although I am not sure if it would maybe be safer to continue with the 
> unpadded file.

Take the unpadded version. From the manual:

fasta contains the contig consensus sequences (and .fasta.qual the consensus 
qualities). Please note that they come in two flavours: padded and unpadded. 
The padded versions may contains stars (*) denoting gap base positions where 
there was some minor evidence for additional bases, but not strong enough to be 
considered as a real base. Unpadded versions have these gaps removed.

> What confuses me are all the other files. What are they? THey are obviously 
> all some variants of the reference..

If you map reads from different strain to a reference, what should MIRA give 
you as consensus? See? Not that easy. So MIRA takes the broad approach: one 
consensus per strain, plus one consensus for reads without strain info 
("default") and one consensus for all strains together ("AllStrains")

> -) the file with the trimmed reads that I obtained from the first mapping 
> attempt with Mira (mynewl8data.fastq) as well as the file I get from mirabait 
> (mymtreads_iteration1.fastq) both start with the reference and are then 
> followed by several sequences (header e.g. @rr_####50####) before the actual 
> reads. Apparently these @rr_#### sequences are all part of the reference.. 
> what exactly is it?

Uhhhh ... you seeing these ###-reads tells me something went wrong. Somewhere. 
Where exactly did you get the original file from? Additional question: did the 
assembly run over several passes? If yes, why?

To answer your question: the rr_### reads are "rails", helper reads used by 
MIRA during the assembly. They are not present in the final results, only in 
intermediate files.

> -) Also I tried to use mirabait to identify reads that map onto the the 
> sequence of the host organism, but unfortunately it seems as if the reference 
> sequences are too long. Is there a way of dealing with this, apart from 
> cutting the reference in smaller bits? This is the error message I get:
> "Read gi|354459049|gb|AGKD01000001.1| is 194200 bp long and thus longer than 
> MAXREADSIZEALLOWED (29900) bases. Skim cannot handle than, sorry."

Oooooooops! This is something which should not happen. Definitively a bug. I'll 
have a look at it as I am currently working on this part pf MIRA. In the mean 
time: sorry, you need to fragment :-/

Bastien


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: