[mira_talk] Re: Strange output reads from Solexa assembly

  • From: Björn Nystedt <bjorn.nystedt@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 9 Dec 2009 08:51:22 +0100

Hi Bastien, 
thanks a lot! I can certainly see the point of collapsing reads, just wanted to 
confirm things to be sure what I was looking at. 

> >  And if I wanted to find all reads mapped to a certain site, is that info
> >  preserved somewhere?
> 
> No, it isn't. Any special reason why you would need that?

No special reason, just checking what is possible in general. 

B



On Tue, 8 Dec 2009 20:03:13 +0100
Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:

> On Dienstag 08 Dezember 2009 Björn Nystedt wrote:
> > Hi all,
> > I'm playing with our first Solexa data in MIRA, doing a preliminary
> >  reference assembly (2.7 million Illumina reads, non-paired, 38bp) on a
> >  2Mbp genome with the call to MIRA (V3rc4):
> 
> Hi Björn,
> 
> please switch to 3rc4d, it fixes a few bugs and annoyances of 3rc4 which 
> might 
> hit you (nothing too drastic, but still):
>   http://www.chevreux.org/tmp/mira-3rc4d.tar.bz2
> or compiled
>   http://www.chevreux.org/tmp/mira_3rc4d_dev_linux-gnu_x86_64_static.tar.bz2
>  
> > mira --project=BAnh1IrMREF02 --job=mapping,genome,normal,solexa
> >  -OUT:ora=yes -GE:not=1 -AS:urd=yes -SB:lb=yes:bft=fasta:bbq=20
> >  SOLEXA_SETTINGS -LR:ft=fastq
> 
> I see that you are not using strain information: I'd recommend doing this. 
> Even if your backbone is, technically speaking, the same strain as the Solexa 
> sequences, giving the sequences another strain identifier open up a panoply 
> of 
> analyses for MIRA that might be useful to you in the final project in form of 
> tags. The most important being then: SROc, WRMc, MSVc, IUPc, UNSc.
> 
> Furthermore, I'm not sure whether setting the backbone quality to 20 is a 
> good 
> idea. I'd have to check, but I think a number of tags are set only if the 
> quality is 30 or more.
> 
> > In my output .caf and .ace files, I found only very few reads from my input
> >  files (with names like HWI-EAS210R_0001:6:1:3:224#GATCAG/1). Instead, I
> >  found ~400000 reads with read names like _cer_sxa_0_
> > _cer_sxa_1_
> > ..
> > These reads are generally much(!) longer than my input reads.
> > 
> > Does anyone know what these reads are?
> 
> Yes, I do :-) Not very well documented I fear.
> 
> > I guess they could be fake reads to reduce read numbers while preserving
> > coverage, but I am not sure?
> 
> That's exactly what they are there for. CER stands for _C_overage 
> _E_quivalent 
> _Read_ and is a way to get finishing programs (gap4, consed) be able to work 
> almost swiftly with the data of mapping assemblies. 
> 
> When I first worked with Solexas in late 2007, getting such projects edited 
> was a pain in the rear as the editors were simply not made for handling 
> contigs with 3+ million reads. I thought of ways to alleviate the problem and 
> implemented CERs in December 2007 / January 2008 (V2.9.19) it became pretty 
> stable somewhere in the 2.9.2x version (27? I don't remember).
> 
> In usual projects, you get a reduction of 80 to 90 % in the number of reads. 
> This in turn allows the finishing programs like gap4 to open bacterial 
> projects only with 2 to 4 GB RAM usage instead of 10 or more. Scrolling and 
> editing is also perfectly possible, which is not really the case when all 6 
> or 
> 7 million reads of a Solexa lane are mapped as own entity.
> 
> What MIRA does is this: every read which maps 100% to the reference backbone 
> (possibly after clipping) is used to create the CER sequences. If a read does 
> not map 100% (be it due to SNPs or a sequencing error), it is kept as single 
> sequence. This is also a great visual help afterwards during finishing to 
> discern true SNPs: they are almost always composed only of these original 
> reads, while the rest of the genome is covered by the CERs.
> 
> >  And if so, does the coverage truly represent
> >  all mismatches (i.e. are "allel frequencies" truly preserved)? 
> 
> From coverage perspective, what you see is 100% identical to what you would 
> get if you mapped the reads as own entities. In case of different allels, 
> MIRA 
> simply transforms those reads matching 100% the backbone to CER and the reads 
> with a "SNP" it keeps as single reads. Frequencies are fully preserved.
> 
> >  And if I wanted to find all reads mapped to a certain site, is that info
> >  preserved somewhere?
> 
> No, it isn't. Any special reason why you would need that?
> 
> >  Is there a way to turn this feature off?
> 
> Now, if you had to guess? :-)
> 
>   SOLEXA_SETTINGS -CO:msr=no
> 
> is what you're searching for.
> 
> Regards,
>   Bastien
> 
> PS: you don't have a GenBank file with annotations as backbone. What a pity, I
>     could introduce you to the MIRA SNP analysis pipeline :-)
> 
> -- 
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html


-- 
====================================
Björn Nystedt, PhD
Molecular Evolution
EBC, Uppsala University
Norbyv. 18C, 752 36  Uppsala
Sweden
phone: +46 (0)18-471 45 88
email: Bjorn.Nystedt@xxxxxxxxx
====================================

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: