[mira_talk] Re: Reference assembly issues...

  • From: Bastien Chevreux <bach@xxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 19 Mar 2012 20:37:50 +0100

On Mar 19, 2012, at 14:12 , Shankar Manoharan wrote:
>      I made a reference assembly of my 454-bacterial data with a closely 
> related strain as the backbone.


There is a slight semantical difference between reference (guided) assembly and 
a mapping assembly. MIRA does de novo & mapping assemblies, but not reference 
guided assemblies.

> When visualizing the reference assembly with Tablet, I see that there are 
> regions where there aren't really any reads spanning the region except the 
> template. How is this acceptable ?

Totally so, if the backbone (reference) genome contains sequence that is not 
present in the genome you sequenced. Or if parts of the reference genome are 
vastly different from the corresponding parts in your genome. Depending on the 
parameters you used, "vastly" can be anything down to 1 SNP, though standard 
mapping parameters are far more lenient than that.

> It appears as though MIRA replaces the assembly with the template sequence 
> which may or may not be present in the sequenced genome.

Yes and no. When MIRA writes out the result, CAF, MAF and ACE files contain the 
complete alignment alignment and give you the full picture of what is present 
(and what not). The FASTA file indeed contains a kind of mixture of both 
sequences. Some people need that, others not. In case you gave MIRA information 
about the strain of the reference and the strain of the mapped reads (you did 
do that, right?), you see what is there and what not also a bit more detailed 
in FASTA format by running:

  convert_project -f MAF -t FASTA miraresult_out.maf somename

which will create several FASTA files where each strain gets its separate file.

> So how far can this assembly be trusted ?

As far as you keep in mind that this is not comparable to a de-novo assembly. 
It really is a mapping assembly. That is: you basically tell the assembler to 
treat all reads as if they came from the same organism as the reference. 
Whether or not this is the truth, that's how the reads are treated.
> Secondly, wasn't the reference assembly feature of MIRA developed to identify 
> SNPs and other genomic changes in pre-sequenced genomes?

It was developed to find differences between a reference sequence and reads 
mapped to it.

> So, is it technically right to assemble based on closely related organisms ?

As long as the organisms are closely related, yes. As you have 454 reads, even 
smaller inserts or deletions can be correctly resolved, though one might need 
to do a bit of manual correction here and there (but MIRA usually tells you 
where to look).

As soon as your organism starts to differ quite a bit, like, e.g., genome 
reorganisations or stretches with a larger differences on the nucleotide level, 
mapping assemblies will give you an idea of where to turn your attention to. 
Which you should do then, really.

> Third, If I were to accept the reference assembly that MIRA has putput, what 
> kind of validation tests are essential before annotation?

The way you described it, there are some bigger differences between the 
reference and what you sequenced. You should try to resolve these. Always keep 
in mind what you want to do with that sequence: depending on what questions you 
want to answer, you may need more or less work.

B.

Other related posts: