[mira_talk] Re: Megahub info

  • From: Björn Nystedt <bjorn.nystedt@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Fri, 26 Jun 2009 09:49:49 +0200

> On Donnerstag 25 Juni 2009 Björn Nystedt wrote:
> > [...]
> > I think that a proper handling of really long sequences would be great in
> > the gap-closure phase of a genome project. (This is also the only area
> > where phrap is still beating MIRA...)
> 
> Care to elaborate a bit more on that one? From the feedback I got from 
> several 
> people, I was under the assumption that MIRA was giving phrap tough times on 
> all aspects of assembly quality :-)
> 
> So if there's room for improvement, I'll be happy to have a look at the 
> problem.

Basically this is about integrating data; if I have long pieces of the genome 
that I know is correct, how do I best combine that information with the full 
set of shotgun and paired-end reads to make the most accurate and complete 
assembly? 

Phrap can assemble "reads" of any length in a sensible way. We often have long 
segments of a genome that for various reasons have been proven correct; that 
might be PCR products assembled separately, or simply contigs where the 
assembly has been checked in different ways and where we feel certain that the 
contig is ok. In phrap, we can feed these long "reads" into the assembly 
together with the complete set of real reads. These long reads will then guide 
the assembly of the short reads, while still allowing for joining everything 
into (in the best case..) a single continuous contig, representing the complete 
chromosome. This way we ensure that checked parts of the genome are not messed 
up, while still using all the raw data. So, yes, in certain cases, phrap still 
beats MIRA, even though one has too look pretty deep to find it ;)

Now, the really bad thing about phrap is of course that it does not properly 
use paired-end info, and it is pretty bad at handling repeats, so if this 
process could be performed in MIRA instead, that would be strongly superior! 
Maybe it would be sufficient to have a combined mapping+denovo assembly as is, 
with an option of a final merging step where contigs can be joined based on 
overlapping ends and paired-end info?

As discussed, fake reads of up to 20kb can be fed into MIRA allready now, but 
there was the issue with the megahubs, making me a bit unsure that the assembly 
algorithm is really designed for this, although it appears to work pretty well 
(but I have not had time to investigate it too much yet). However, for longer 
fake reads (such as for example complete manually checked contigs, or manually 
combined PCR products), we need to cut them into 20kb overlapping pieces, which 
is kind of against the whole idea of producing long correct segments.

If anything can be done in this direction it would be great! 
(I think we can provide one or two datasets if needed for testing)
Björn Nystedt

PS
On the technical side, we normally feed our PCR "reads" with fairly low 
quality, since the PCR itself has a higher error rate than the sequencing, and 
thus fake reads from shotgun PCR products typically contain some high-quality 
base errors.
 





> Regards,
>   Bastien
> 
> 
> -- 
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html


-- 
====================================
Björn Nystedt (Sällström)
PhD Student
Molecular Evolution
EBC, Uppsala University
Norbyv. 18C, 752 36  Uppsala
Sweden
phone: +46 (0)18-471 45 88
email: Bjorn.Nystedt@xxxxxxxxx
====================================

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: