[mira_talk] Re: Megahub info
- From: Björn Nystedt <bjorn.nystedt@xxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Fri, 26 Jun 2009 09:49:49 +0200
> On Donnerstag 25 Juni 2009 Björn Nystedt wrote:
> > [...]
> > I think that a proper handling of really long sequences would be great in
> > the gap-closure phase of a genome project. (This is also the only area
> > where phrap is still beating MIRA...)
>
> Care to elaborate a bit more on that one? From the feedback I got from
> several
> people, I was under the assumption that MIRA was giving phrap tough times on
> all aspects of assembly quality :-)
>
> So if there's room for improvement, I'll be happy to have a look at the
> problem.
Basically this is about integrating data; if I have long pieces of the genome
that I know is correct, how do I best combine that information with the full
set of shotgun and paired-end reads to make the most accurate and complete
assembly?
Phrap can assemble "reads" of any length in a sensible way. We often have long
segments of a genome that for various reasons have been proven correct; that
might be PCR products assembled separately, or simply contigs where the
assembly has been checked in different ways and where we feel certain that the
contig is ok. In phrap, we can feed these long "reads" into the assembly
together with the complete set of real reads. These long reads will then guide
the assembly of the short reads, while still allowing for joining everything
into (in the best case..) a single continuous contig, representing the complete
chromosome. This way we ensure that checked parts of the genome are not messed
up, while still using all the raw data. So, yes, in certain cases, phrap still
beats MIRA, even though one has too look pretty deep to find it ;)
Now, the really bad thing about phrap is of course that it does not properly
use paired-end info, and it is pretty bad at handling repeats, so if this
process could be performed in MIRA instead, that would be strongly superior!
Maybe it would be sufficient to have a combined mapping+denovo assembly as is,
with an option of a final merging step where contigs can be joined based on
overlapping ends and paired-end info?
As discussed, fake reads of up to 20kb can be fed into MIRA allready now, but
there was the issue with the megahubs, making me a bit unsure that the assembly
algorithm is really designed for this, although it appears to work pretty well
(but I have not had time to investigate it too much yet). However, for longer
fake reads (such as for example complete manually checked contigs, or manually
combined PCR products), we need to cut them into 20kb overlapping pieces, which
is kind of against the whole idea of producing long correct segments.
If anything can be done in this direction it would be great!
(I think we can provide one or two datasets if needed for testing)
Björn Nystedt
PS
On the technical side, we normally feed our PCR "reads" with fairly low
quality, since the PCR itself has a higher error rate than the sequencing, and
thus fake reads from shotgun PCR products typically contain some high-quality
base errors.
> Regards,
> Bastien
>
>
> --
> You have received this mail because you are subscribed to the mira_talk
> mailing list. For information on how to subscribe or unsubscribe, please
> visit http://www.chevreux.org/mira_mailinglists.html
--
====================================
Björn Nystedt (Sällström)
PhD Student
Molecular Evolution
EBC, Uppsala University
Norbyv. 18C, 752 36 Uppsala
Sweden
phone: +46 (0)18-471 45 88
email: Bjorn.Nystedt@xxxxxxxxx
====================================
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: