I have tried yet another approach for this assembly. I assumed that the Solexa data was contaminated, so I ran MIRA with the 454 contigs (used as long Sanger reads) and the Solexa mate-pairs - but only the mate pairs that could be mapped to the contigs (using Bowtie and allowing for 3 mismatches). This reduced the amount of mate-pairs from 3M to 1M, and still reads should be included that spans the gaps between the contigs. So I fed the following to MIRA mira -project=M1 -job=denovo,genome,normal,sanger,solexa -GENERAL:number_of_threads=4 SOLEXA_SETTINGS -CO:msr=no -GE:uti=no:tismin=2000:tismax=3000 M1_in.sanger.fasta M1_in.solexa.fasta And waited for a day and got: Localtime: Tue Sep 15 10:51:42 2009 Assembly information: ===================== Num. reads assembled: 1186418 Num. singlets: 20 Large contigs: -------------- With Contig size >= 500 AND (Total avg. Cov >= 8 OR Cov(san) >= 0 OR Cov(454) >= 0 OR Cov(sxa) >= 8 OR Cov(sid) >= 0 ) Length assessment: ------------------ Number of contigs: 915 Total consensus: 696521 Largest contig: 6010 N50 contig size: 739 N90 contig size: 538 N95 contig size: 523 Coverage assessment: -------------------- Max coverage (total): 1625 Max coverage Sanger: 0 454: 0 Solexa: 1625 Solid: 0 Avg. total coverage (size >= 5000): 23.75 Avg. coverage (contig size >= 5000) Sanger: 0.00 454: 0.00 Solexa: 23.75 Solid: 0.00 Quality assessment: ------------------- Average consensus quality: 22 Consensus bases with IUPAC (IUPc): 645 (you might want to check these) Strong unresolved repeat positions (SRMc): 0 (excellent) Weak unresolved repeat positions (WRMc): 0 (excellent) Sequencing Type Mismatch Unsolved (STMU): 0 (excellent) Contigs having only reads wo qual: 0 (excellent) Contigs with reads wo qual values: 0 (excellent) All contigs: ------------ Length assessment: ------------------ Number of contigs: 28134 Total consensus: 3312250 Largest contig: 6010 N50 contig size: 178 N90 contig size: 48 N95 contig size: 41 Coverage assessment: -------------------- Max coverage (total): 1625 Max coverage Sanger: 0 454: 0 Solexa: 1626 Solid: 0 Avg. total coverage (size >= 5000): 23.75 Avg. coverage (contig size >= 5000) Sanger: 0.00 454: 0.00 Solexa: 23.75 Solid: 0.00 Quality assessment: ------------------- Average consensus quality: 19 Consensus bases with IUPAC (IUPc): 1542 (you might want to check these) Strong unresolved repeat positions (SRMc): 0 (excellent) Weak unresolved repeat positions (WRMc): 0 (excellent) Sequencing Type Mismatch Unsolved (STMU): 0 (excellent) Contigs having only reads wo qual: 0 (excellent) Contigs with reads wo qual values: 0 (excellent) This strikes me as completely wrong. The long contigs are gone. According to the log both Sanger and Solexa reads were loaded (I omitted the quals on purpose expecting a simple run). Martin On Thu, Sep 10, 2009 at 5:09 PM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote: > On Donnerstag 10 September 2009 Martin A. Hansen wrote: > > So, why are qualities so important? If you have enough sequence it should > > level out? > > The "non-perfect-repeat" detection routines heavily rely on qualities to > tag > bases that aid to discern the different repeats. The base calling > algorithms > also take qualities into consideration when confronted to unsure > situations. > > Plus a few other places where qualities do matter quite a lot :-) > > B. > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html >