[mira_talk] Assembling 454 and Solexa mate-pair data - rethinking ...

  • From: "Martin A. Hansen" <mail@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 31 Aug 2009 11:16:23 +0200

MIRA has been running for over a week now digesting this rather simple
de-novo assembly:

   - 454 Reads (n=478840)
   - Solexa reads (n=3308974) which are mate-pair (d=~2500)

Now, assembling just the 454 reads takes a couple of hours and results in
around 70 contigs. That is good, but I would expect that with the additional
Solexa mate pair data, I should be able to close these gaps and hopefully
end up with a single contig. However, Mira does not close gaps when doing
mapping assemblies - so the the deal is to setup a de-novo assembly using
both types of data. I have tried this, but I find it too slow.

Now, I was thinking about a bit of hackery to limit the load on MIRA. My
idea is to index the ends (2500bp + ~500bp) of each contig (from a 454-only
assembly) using a simple hash with a sequence word of size 35 as key and the
word positions as values. Then I would go over all the Solexa reads and
filter out all reads that does not match the begin of one contig and the end
of another contig in a way that satisfies the minimum mate-pair distance.
The result would be a limited stack of Solexa reads that have perfect hits
in multiple contigs within the allowed mate pair distances - and then I
could feed the 454 reads and this limited stack to MIRA.

Pseudo code mockup (with limited detail):

foreach contig
   add_to_contig_list( contig_name, index_5prime_end(contig),

foreach solexa_pair
   for (i=0;i<contigs;i++)
      for (j=i+1;j<contigs;j++)
         if (exists contig->index_5prime_end(read1) && exists
             if (minimum_dist_ok(contig->index_5prime_end(read1),
                print read1 and read2

                if ( enough_reads_found_for_a_pair_of_contigs() )   # The
great speed-up assuming that 500-1000 reads are enough to close the gap
                   remove contigs from contig list

Memory consumption would be OK. Speed would be OK.

How about it?


Other related posts: