[mira_talk] Assembling 454 and Solexa mate-pair data - rethinking ...

MIRA has been running for over a week now digesting this rather simple
de-novo assembly:

   - 454 Reads (n=478840)
   - Solexa reads (n=3308974) which are mate-pair (d=~2500)

Now, assembling just the 454 reads takes a couple of hours and results in
around 70 contigs. That is good, but I would expect that with the additional
Solexa mate pair data, I should be able to close these gaps and hopefully
end up with a single contig. However, Mira does not close gaps when doing
mapping assemblies - so the the deal is to setup a de-novo assembly using
both types of data. I have tried this, but I find it too slow.

Now, I was thinking about a bit of hackery to limit the load on MIRA. My
idea is to index the ends (2500bp + ~500bp) of each contig (from a 454-only
assembly) using a simple hash with a sequence word of size 35 as key and the
word positions as values. Then I would go over all the Solexa reads and
filter out all reads that does not match the begin of one contig and the end
of another contig in a way that satisfies the minimum mate-pair distance.
The result would be a limited stack of Solexa reads that have perfect hits
in multiple contigs within the allowed mate pair distances - and then I
could feed the 454 reads and this limited stack to MIRA.

Pseudo code mockup (with limited detail):

foreach contig
   add_to_contig_list( contig_name, index_5prime_end(contig),
index_3prime_end(contig))

foreach solexa_pair
   for (i=0;i<contigs;i++)
      for (j=i+1;j<contigs;j++)
         if (exists contig->index_5prime_end(read1) && exists
contig->index_3primer_end(read2))
             if (minimum_dist_ok(contig->index_5prime_end(read1),
contig->index_3prime_end(read2)))
                print read1 and read2

                if ( enough_reads_found_for_a_pair_of_contigs() )   # The
great speed-up assuming that 500-1000 reads are enough to close the gap
                   remove contigs from contig list



Memory consumption would be OK. Speed would be OK.


How about it?


Martin

Other related posts: