[mira_talk] Re: Reference vs. De novo assembly.

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 3 Dec 2009 20:48:00 +0100

On Mittwoch 02 Dezember 2009 Andrzej N wrote:
> I need some help... I did *de novo* assembly of several plant mitochondrial
> genome sequences (454, Titanium, one end reads), about 200000 reads used
>  for assembly, this should give me about... 100x coverage). Yes, I know
>  overkill, but... MIRA created about 160  contings around 78 quality score
>  (what is it exactly?) (total number of contigs like 5,000 but including
>  smaller ones that don’t help much i.e., "junk"). These contigs don't go
>  together to create one big consensus contig.

Hello Adrzej,

100x is not only overkill, it also is a bit dangerous for many assemblers 
(including MIRA), as there are some unwanted side-effects of ultra-high 
coverage. One of them: as sequencing errors are not totally random, they tend 
to accumulate at certain points. If you now have very high coverage, these 
sequencing errors will be recognised as valid variants and hence split off 
into other contigs.

Plus you've got plant mitochondrial genomes, and these I've come to fear a 
bit. 454 data from those I've seen so far suggest pretty uneven coverage, 
which might lead MIRA to have problems if the uniform rad distribution is 
used, mistakenly recognising some parts as repeats when they're not.

> I also did reference assembly, to an already finished and assembled
> sequence. MIRA is covering all of this reference sequence with just only
>  one small break (so I get two huge contings about 200000bp each).
> 
> Now is the interesting part. When I take these contings from *de novo *
> assembly* *and blast them against the ones generated based on reference
> assembly, they cover the entire sequence very nicely... So, my question is
> why MIRA is not creating larger contings during *de novo* assembly. These
> contigs are next to each other and show a certain amount of sequence
>  overlap (I setup BLAST on my computer to blast the against each other) but
>  MIRA is not seeing this and combining them.

Oh, MIRA is probably seeing them, but refuses to join because the ends contain 
to many sequencing errors (mistakenly recognised as valid variants) or because 
the ends lay in regions with exceptionally high coverage (mistakenly 
recognised as repeat).

> What parameters in MIRA need to be changed to help build larger contings?
>  My adjustment to date have not helped do much more than your default
>  settings for "fast assembly".

Umm ... the 'draft' options are really just that: for drafts. And if you've 
got 60kb chunks it's not too bad already. But use at least 'normal' or 
'accurate' mode.

Now, other things you probably want to do:
1) decrease sensitivity of repeat marker base recognition. I'd suggest to add
     454_SETTINGS -CO:mrpg=12
   and see what happens then
2) eventually assemble without uniform read distribution
     -AS:urd=no
   and loosen the repeat detection thresholds
     454_SETTINGS -AS:ardct=3:mrl=800
   or switch off repeat detection altogether
     -AS:ard=no

If everything else fails: join the large contigs by hand in 'gap4', just takes 
a couple of minutes for a plant mitochondrion :-)

Hope that helps,
  Bastien

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: