[mira_talk] mira assembly visualization

Warning: long mail  ;)

Hi:
I'm trying to choose the best mira parameters  for my sequences. I'm trying to 
assemble ESTs and I want to get SNPs from the assemblies.
Last week I talked with A. Papanicolau and he gave me an idea and I've 
developed it a little more. I would like to have an easy way to compare two 
mira runs done with the same datasets and different parameters.
Here I'll explain how I think that this comparison could be done. I would like 
to get feedback from you, ways to improve it, criticisms, etc.

The process has three steps:
1- run mira
2- compare the unigenes against themselves doing a blastn of the unigenes 
against the unigenes
3- do a graphical representation of the blast result.

The meat of the process is the thrid phase. I'm doing two different 
representations.
The first one is a distribution of number of hits in the blast result with 
different percentage of similarities. I want to know how many unigenes are 
80%, 85%, 90%, 95% and 100% similar. If we represent that we get the first 
graph attached.
If you take a look at the graphic it's very easy to realize that there is a 
lot of unigenes with a similarity close to 100%. Would it be better to merge 
those unigenes? That depends on the unigenes and on your objectives. 
Fortunatelly mira is adaptable for the different situations, unfortunatelly I 
don't know how to tell mira that I want those unigenes merged. I've tried 
with the asir paremeter with no luck (but maybe this should go in another 
mail).

Ok. Now we know that there are a lot of unigenes with a 100% similartiy. But 
two unigenes with 100% similarity in one part of their sequences may have 
another part incompatible that prevents their merge. e.g:

unigene1      --------------------------------->
                      ||||||||||||||||
unigene2      ------------------------------>
                      <- similar-><no similar>
                      compatible    incompatible

So these unigenes are 100% identical in their first halve but different in 
their second halve. We could take a look that if we calculate how many 
compatible and incompatible base pairs have each hit in the blast. If we do 
it we can draw a distribution similar to the previous one, but with three 
axes: %similarity, %incompatibility and number of hits.
I haven't draw a 3-D plot because matplotlib can't do it, so I've represented 
the number of hits as different colors. This is the second plot.
You can see in this plot that most of the unigenes that have 100% identical 
tracks have no incompatible regions. I would like to merge those unigenes but 
I don't know how to do it. Of course I don't want to merge the unigenes with 
100% identical tracks but with incompatible regions. Those could be 
alternative splicings, chimeric clones, etc.

What do you think about those plots? If you find them interesting I could 
prepare an script to create them. Right now the code is here:
http://bioinf.comav.upv.es/svn/biolib/biolib/src/
in scrips/length_score_distribution.py

Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

Attachment: 1_hits_vs_similarity.png
Description: PNG image

Attachment: 2_hits_vs_similarity_vs_incompatiblity.png
Description: PNG image

Other related posts: