[mira_talk] Re: MIRA edits reads during assembly?

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Sun, 25 Oct 2009 11:15:30 +0100

On Donnerstag 22 Oktober 2009 Lionel Guy wrote:
> Well, to produce the phdball I guess I have to start from the caf
> file...

Hi Guy,

let me start with an old citation: "Ceterum censeo, ACE esse delendam."

ACE is the worst of all possibilities to represent an assembly as some vital 
information is missing: alignment of reads to the original sequence. Splitting 
away the qualities into other files (phdballs) doesn't make things any easier.

Indeed, the only format that MIRA currently supports and which contains 
everything needed is CAF. In a short while, MAF may also be used but I'm not 
sure whether I want to keep the fomrat private to MIRA and wait for larger 
sequencing centers to come up with something workable.

> I did a bit a digging into the files, and indeed,
> some reads are edited during the assembly (I checked both ace and caf
> files). Same thing for the quality values, they change between the input
> files and the caf files...

I hope that with "change" you mean: some are deleted. Other changes shouldn't 
be.

> I also checked the tags, and it seems that the R454 tags correspond to
> such deletions (marked between underscores in the sequences below.

They are only hints. The only viable way to detect correct alignment of an 
assembled read to the original sequence is to use the Align_to_SCF info lines 
from the CAF.

> What are the criteria that MIRA uses to decide to delete a nucleotide in
> a read?

Dependingon the read type: for Sanger reads, Thomas wrote a pretty nifty 
automatic editor (EdIt) back in 1999, with bells and whistles like trace 
analysis using neural networks; insert/delete/basechange edits etc.pp. That's 
the integrated "EdIt" editor (SANGER_SETTINGS -ED:ace=yes).

For 454 and Solexa reads, I wrote a much simpler editor which look for common 
overcall problems which it can safely delete. There's a whole set of rules 
behind it, but basically this editor wants a certain base to gap ratio, 
forward/reverse reads and a few things more before it decides to edit away a 
base in a read.

> Wouldn't it be more appropriate to make gaps in the other sequences?

This is actually done. But as MIRA works in multiple passes, deleting bases 
improves the overlap graph in subsequent passes, which leads to better overlap 
alignments and - as a side effect - a slight speed increase.

Regards,
  Bastien


-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: