[mira_talk] Re: regd Mapping 454 titanium with Solexa paired reads

  • From: Arun Rawat <rawat_arun@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 28 Mar 2011 18:38:36 -0700 (PDT)

Hi Bastein




________________________________
From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Sent: Mon, March 28, 2011 12:07:03 PM
Subject: [mira_talk] Re: regd Mapping 454 titanium with Solexa paired reads

 
On Monday 28 March 2011 19:44:45 Arun Rawat wrote:
> I assembled bacterial genome from 454 Titanium that resulted in around 500
> contigs overall.
That is a fairly high number of contigs. Way to high to my likings. What's the 
coverage of these?
1. Longer contigs (>=5000) are 125 in number (total contig count is 480): 

Avg. total coverage (size >= 5000): 27.57.


> Now I am trying to use these sets of contigs to map paired reads to
> generate larger contigs. To test the results, I ran with default
> parameters: mira --project=mapSX_454 --job=mapping,genome,accurate,solexa
> -GE:not=16 -AS:nop=1 -SB:bft=caf >&log_assembly.txt
> 
> The results came pretty good with higher N50, lesser number of contigs
> (~300) etc as mentioned in info_assembly.txt
Ahmmm ... something's not right here. If 500 contigs came in, 500 must come out 
of a mapping assembly in the "All contigs" category: MIRA does not join contigs 
in mapping. Can you please do a count on your input CAF for the number of 
contigs with
grep -c Is_contig mapSX_454_backbone_in.caf
and tell what that gives you?
2. I think earlier file somehow got corrupted and I rerun it and found the 
result consistent with the mapping file as you mentioned. I think I can try to 
extract the contigs generated from solexa mapped against the 454 from ace file 
generated. Do you think its right?
3. I also ran solexa paired read and 454 denovo (instead of mapping) and the 
statistics are:
Assembly information:
=====================
Num. reads assembled: 6397880
Num. singlets: 0
Coverage assessment (calculated from contigs >= 5000):
=========================================================
  Avg. total coverage: 102.61
  Avg. coverage per sequencing technology
    Sanger:    0.00
    454:    27.85
    PacBio:    0.00
    Solexa:    74.42
    Solid:    0.00
Large contigs (makes less sense for EST assemblies):
====================================================
With    Contig size        >= 500
    AND (Total avg. Cov    >= 34
         OR Cov(san)    >= 0
         OR Cov(454)    >= 9
         OR Cov(pbs)    >= 0
         OR Cov(sxa)    >= 24
         OR Cov(sid)    >= 0
        )
  Length assessment:
  ------------------
  Number of contigs:    157
  Total consensus:    5348906
  Largest contig:    542454
  N50 contig size:    191948
  N90 contig size:    32153
  N95 contig size:    12196
  Coverage assessment:
  --------------------
  Max coverage (total):    2134
  Max coverage per sequencing technology
    Sanger:    0
    454:    1125
    PacBio:    0
    Solexa:    1606
    Solid:    0
  Quality assessment:
  -------------------
  Average consensus quality:            85
  Consensus bases with IUPAC:            135    (you might want to check these)
  Strong unresolved repeat positions (SRMc):    0    (excellent)
  Weak unresolved repeat positions (WRMc):    71    (you might want to check 
these)
  Sequencing Type Mismatch Unsolved (STMU):    0    (excellent)
  Contigs having only reads wo qual:        0    (excellent)
  Contigs with reads wo qual values:        0    (excellent)
All contigs:
============
  Length assessment:
  ------------------
  Number of contigs:    4926
  Total consensus:    8216840
  Largest contig:    542454
  N50 contig size:    81331
  N90 contig size:    561
  N95 contig size:    346
  Coverage assessment:
  --------------------
  Max coverage (total):    2134
  Max coverage per sequencing technology
    Sanger:    0
    454:    1125
    PacBio:    0
    Solexa:    1606
    Solid:    0
  Quality assessment:
  -------------------
  Average consensus quality:            78
  Consensus bases with IUPAC:            2759    (you might want to check these)
  Strong unresolved repeat positions (SRMc):    0    (excellent)
  Weak unresolved repeat positions (WRMc):    73    (you might want to check 
these)
  Sequencing Type Mismatch Unsolved (STMU):    0    (excellent)
  Contigs having only reads wo qual:        0    (excellent)
  Contigs with reads wo qual values:        0    (excellent)

Any suggestions on these results!!

As mira does not allow to mix two directions from same technology, in the next 
step I want to create longer contigs by mixing the contigs  generated 
(454+solexa paired end generated contigs as shown above) with the solexa mate  
pair. Do you think the contigs generated (454+solexa paired end) can be 
considered as sanger and the mate pair data added to create longer contigs. I 
do 
not want to scaffold now as we plan to use these long contigs as our reference 
sequences for further studies. 


Thanks!!


      

Other related posts: