Hi Bastein ________________________________ From: Bastien Chevreux <bach@xxxxxxxxxxxx> To: mira_talk@xxxxxxxxxxxxx Sent: Mon, March 28, 2011 12:07:03 PM Subject: [mira_talk] Re: regd Mapping 454 titanium with Solexa paired reads On Monday 28 March 2011 19:44:45 Arun Rawat wrote: > I assembled bacterial genome from 454 Titanium that resulted in around 500 > contigs overall. That is a fairly high number of contigs. Way to high to my likings. What's the coverage of these? 1. Longer contigs (>=5000) are 125 in number (total contig count is 480): Avg. total coverage (size >= 5000): 27.57. > Now I am trying to use these sets of contigs to map paired reads to > generate larger contigs. To test the results, I ran with default > parameters: mira --project=mapSX_454 --job=mapping,genome,accurate,solexa > -GE:not=16 -AS:nop=1 -SB:bft=caf >&log_assembly.txt > > The results came pretty good with higher N50, lesser number of contigs > (~300) etc as mentioned in info_assembly.txt Ahmmm ... something's not right here. If 500 contigs came in, 500 must come out of a mapping assembly in the "All contigs" category: MIRA does not join contigs in mapping. Can you please do a count on your input CAF for the number of contigs with grep -c Is_contig mapSX_454_backbone_in.caf and tell what that gives you? 2. I think earlier file somehow got corrupted and I rerun it and found the result consistent with the mapping file as you mentioned. I think I can try to extract the contigs generated from solexa mapped against the 454 from ace file generated. Do you think its right? 3. I also ran solexa paired read and 454 denovo (instead of mapping) and the statistics are: Assembly information: ===================== Num. reads assembled: 6397880 Num. singlets: 0 Coverage assessment (calculated from contigs >= 5000): ========================================================= Avg. total coverage: 102.61 Avg. coverage per sequencing technology Sanger: 0.00 454: 27.85 PacBio: 0.00 Solexa: 74.42 Solid: 0.00 Large contigs (makes less sense for EST assemblies): ==================================================== With Contig size >= 500 AND (Total avg. Cov >= 34 OR Cov(san) >= 0 OR Cov(454) >= 9 OR Cov(pbs) >= 0 OR Cov(sxa) >= 24 OR Cov(sid) >= 0 ) Length assessment: ------------------ Number of contigs: 157 Total consensus: 5348906 Largest contig: 542454 N50 contig size: 191948 N90 contig size: 32153 N95 contig size: 12196 Coverage assessment: -------------------- Max coverage (total): 2134 Max coverage per sequencing technology Sanger: 0 454: 1125 PacBio: 0 Solexa: 1606 Solid: 0 Quality assessment: ------------------- Average consensus quality: 85 Consensus bases with IUPAC: 135 (you might want to check these) Strong unresolved repeat positions (SRMc): 0 (excellent) Weak unresolved repeat positions (WRMc): 71 (you might want to check these) Sequencing Type Mismatch Unsolved (STMU): 0 (excellent) Contigs having only reads wo qual: 0 (excellent) Contigs with reads wo qual values: 0 (excellent) All contigs: ============ Length assessment: ------------------ Number of contigs: 4926 Total consensus: 8216840 Largest contig: 542454 N50 contig size: 81331 N90 contig size: 561 N95 contig size: 346 Coverage assessment: -------------------- Max coverage (total): 2134 Max coverage per sequencing technology Sanger: 0 454: 1125 PacBio: 0 Solexa: 1606 Solid: 0 Quality assessment: ------------------- Average consensus quality: 78 Consensus bases with IUPAC: 2759 (you might want to check these) Strong unresolved repeat positions (SRMc): 0 (excellent) Weak unresolved repeat positions (WRMc): 73 (you might want to check these) Sequencing Type Mismatch Unsolved (STMU): 0 (excellent) Contigs having only reads wo qual: 0 (excellent) Contigs with reads wo qual values: 0 (excellent) Any suggestions on these results!! As mira does not allow to mix two directions from same technology, in the next step I want to create longer contigs by mixing the contigs generated (454+solexa paired end generated contigs as shown above) with the solexa mate pair. Do you think the contigs generated (454+solexa paired end) can be considered as sanger and the mate pair data added to create longer contigs. I do not want to scaffold now as we plan to use these long contigs as our reference sequences for further studies. Thanks!!