> On Donnerstag 25 Juni 2009 Björn Nystedt wrote: > > [...] > > I think that a proper handling of really long sequences would be great in > > the gap-closure phase of a genome project. (This is also the only area > > where phrap is still beating MIRA...) > > Care to elaborate a bit more on that one? From the feedback I got from > several > people, I was under the assumption that MIRA was giving phrap tough times on > all aspects of assembly quality :-) > > So if there's room for improvement, I'll be happy to have a look at the > problem. Basically this is about integrating data; if I have long pieces of the genome that I know is correct, how do I best combine that information with the full set of shotgun and paired-end reads to make the most accurate and complete assembly? Phrap can assemble "reads" of any length in a sensible way. We often have long segments of a genome that for various reasons have been proven correct; that might be PCR products assembled separately, or simply contigs where the assembly has been checked in different ways and where we feel certain that the contig is ok. In phrap, we can feed these long "reads" into the assembly together with the complete set of real reads. These long reads will then guide the assembly of the short reads, while still allowing for joining everything into (in the best case..) a single continuous contig, representing the complete chromosome. This way we ensure that checked parts of the genome are not messed up, while still using all the raw data. So, yes, in certain cases, phrap still beats MIRA, even though one has too look pretty deep to find it ;) Now, the really bad thing about phrap is of course that it does not properly use paired-end info, and it is pretty bad at handling repeats, so if this process could be performed in MIRA instead, that would be strongly superior! Maybe it would be sufficient to have a combined mapping+denovo assembly as is, with an option of a final merging step where contigs can be joined based on overlapping ends and paired-end info? As discussed, fake reads of up to 20kb can be fed into MIRA allready now, but there was the issue with the megahubs, making me a bit unsure that the assembly algorithm is really designed for this, although it appears to work pretty well (but I have not had time to investigate it too much yet). However, for longer fake reads (such as for example complete manually checked contigs, or manually combined PCR products), we need to cut them into 20kb overlapping pieces, which is kind of against the whole idea of producing long correct segments. If anything can be done in this direction it would be great! (I think we can provide one or two datasets if needed for testing) Björn Nystedt PS On the technical side, we normally feed our PCR "reads" with fairly low quality, since the PCR itself has a higher error rate than the sequencing, and thus fake reads from shotgun PCR products typically contain some high-quality base errors. > Regards, > Bastien > > > -- > You have received this mail because you are subscribed to the mira_talk > mailing list. For information on how to subscribe or unsubscribe, please > visit http://www.chevreux.org/mira_mailinglists.html -- ==================================== Björn Nystedt (Sällström) PhD Student Molecular Evolution EBC, Uppsala University Norbyv. 18C, 752 36 Uppsala Sweden phone: +46 (0)18-471 45 88 email: Bjorn.Nystedt@xxxxxxxxx ==================================== -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html