Hi Bastien, thanks a lot! I can certainly see the point of collapsing reads, just wanted to confirm things to be sure what I was looking at. > > And if I wanted to find all reads mapped to a certain site, is that info > > preserved somewhere? > > No, it isn't. Any special reason why you would need that? No special reason, just checking what is possible in general. B On Tue, 8 Dec 2009 20:03:13 +0100 Bastien Chevreux <bach@xxxxxxxxxxxx> wrote: > On Dienstag 08 Dezember 2009 Björn Nystedt wrote: > > Hi all, > > I'm playing with our first Solexa data in MIRA, doing a preliminary > > reference assembly (2.7 million Illumina reads, non-paired, 38bp) on a > > 2Mbp genome with the call to MIRA (V3rc4): > > Hi Björn, > > please switch to 3rc4d, it fixes a few bugs and annoyances of 3rc4 which > might > hit you (nothing too drastic, but still): > http://www.chevreux.org/tmp/mira-3rc4d.tar.bz2 > or compiled > http://www.chevreux.org/tmp/mira_3rc4d_dev_linux-gnu_x86_64_static.tar.bz2 > > > mira --project=BAnh1IrMREF02 --job=mapping,genome,normal,solexa > > -OUT:ora=yes -GE:not=1 -AS:urd=yes -SB:lb=yes:bft=fasta:bbq=20 > > SOLEXA_SETTINGS -LR:ft=fastq > > I see that you are not using strain information: I'd recommend doing this. > Even if your backbone is, technically speaking, the same strain as the Solexa > sequences, giving the sequences another strain identifier open up a panoply > of > analyses for MIRA that might be useful to you in the final project in form of > tags. The most important being then: SROc, WRMc, MSVc, IUPc, UNSc. > > Furthermore, I'm not sure whether setting the backbone quality to 20 is a > good > idea. I'd have to check, but I think a number of tags are set only if the > quality is 30 or more. > > > In my output .caf and .ace files, I found only very few reads from my input > > files (with names like HWI-EAS210R_0001:6:1:3:224#GATCAG/1). Instead, I > > found ~400000 reads with read names like _cer_sxa_0_ > > _cer_sxa_1_ > > .. > > These reads are generally much(!) longer than my input reads. > > > > Does anyone know what these reads are? > > Yes, I do :-) Not very well documented I fear. > > > I guess they could be fake reads to reduce read numbers while preserving > > coverage, but I am not sure? > > That's exactly what they are there for. CER stands for _C_overage > _E_quivalent > _Read_ and is a way to get finishing programs (gap4, consed) be able to work > almost swiftly with the data of mapping assemblies. > > When I first worked with Solexas in late 2007, getting such projects edited > was a pain in the rear as the editors were simply not made for handling > contigs with 3+ million reads. I thought of ways to alleviate the problem and > implemented CERs in December 2007 / January 2008 (V2.9.19) it became pretty > stable somewhere in the 2.9.2x version (27? I don't remember). > > In usual projects, you get a reduction of 80 to 90 % in the number of reads. > This in turn allows the finishing programs like gap4 to open bacterial > projects only with 2 to 4 GB RAM usage instead of 10 or more. Scrolling and > editing is also perfectly possible, which is not really the case when all 6 > or > 7 million reads of a Solexa lane are mapped as own entity. > > What MIRA does is this: every read which maps 100% to the reference backbone > (possibly after clipping) is used to create the CER sequences. If a read does > not map 100% (be it due to SNPs or a sequencing error), it is kept as single > sequence. This is also a great visual help afterwards during finishing to > discern true SNPs: they are almost always composed only of these original > reads, while the rest of the genome is covered by the CERs. > > > And if so, does the coverage truly represent > > all mismatches (i.e. are "allel frequencies" truly preserved)? > > From coverage perspective, what you see is 100% identical to what you would > get if you mapped the reads as own entities. In case of different allels, > MIRA > simply transforms those reads matching 100% the backbone to CER and the reads > with a "SNP" it keeps as single reads. Frequencies are fully preserved. > > > And if I wanted to find all reads mapped to a certain site, is that info > > preserved somewhere? > > No, it isn't. Any special reason why you would need that? > > > Is there a way to turn this feature off? > > Now, if you had to guess? :-) > > SOLEXA_SETTINGS -CO:msr=no > > is what you're searching for. > > Regards, > Bastien > > PS: you don't have a GenBank file with annotations as backbone. What a pity, I > could introduce you to the MIRA SNP analysis pipeline :-) > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html -- ==================================== Björn Nystedt, PhD Molecular Evolution EBC, Uppsala University Norbyv. 18C, 752 36 Uppsala Sweden phone: +46 (0)18-471 45 88 email: Bjorn.Nystedt@xxxxxxxxx ==================================== -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html