[mira_talk] Re: Denovo hybrid assembly of a 3.8M genome using 454 and Solexa

  • From: Hui Sun <hsun@xxxxxxx>
  • To: Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx>
  • Date: Thu, 13 Oct 2011 16:13:47 -0700

Hi,

Thank you all for the help and sorry about the delay of my response.

It seems like there were two major problems with the data and assembly.

1. The estimated genome size by allpath assembly was way off, it
should be 50Mb not 3.8Mb.
2. The MIDs used in 454 sequencing didn't get screened out properly.

I'm re-processing the data and will keep you updated on the progress.

Thanks again.


On Sat, Oct 8, 2011 at 4:21 AM, Martin Mokrejs
<mmokrejs@xxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>  can you show the 454 adaptor sequence you searched for? Along with
> some input 20 entries in FASTA format before and after that "masking"?
> Finally, send "sffinfo -s -n" for those first 20 entries? Maybe you failed
> to mask the adaptors? Or to send it to me directly so that we do not spoil
> the email list if you cannot upload it to some public place.
>
> Did you try just 454 data assembly alone?
> Martin
>
> Hui Sun wrote:
>> Hello,
>>
>> I am trying to assemble a genome with an estimated size of 3.8M.  I
>> have used allpath and generated an assembly.  As a comparison, I'm
>> trying to use MIRA assembler.
>>
>> I have 4 million 454 PE reads and 120 million Solexa reads. I have
>> screened out adaptors by using SSAHA2.
>>
>> I then subset Solexa reads to 2.5 million and 454 reads to 400K, which
>> is ~42x coverage for each of the platform. I then ran MIRA hybrid
>> denovo assembly: mira
>>   --project=test --job=denovo,genome,normal,454,solexa
>>
>> The resulting contig stats seem to be really fragmented, see stats
>> below.  What can I do to improve scaffolding?   My allpath assembly
>> generated 15106 contigs, 11965 scaffolds, N50 contig size 1.9kb, which
>> seems to be much better.  Thanks for the help.
>>
>> All contigs:
>> ============
>>   Length assessment:
>>   ------------------
>>   Number of contigs:  124259
>>   Total consensus:    46425879
>>   Largest contig:     4936
>>   N50 contig size:    402
>>   N90 contig size:    240
>>   N95 contig size:    206
>>
>>   Coverage assessment:
>>   --------------------
>>   Max coverage (total):       298
>>   Max coverage per sequencing technology
>>       Sanger: 0
>>       454:    304
>>       IonTor: 0
>>       PacBio: 0
>>       Solexa: 767
>>       Solid:  0
>>
>

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: