[mira_talk] mira assembly visualization
- From: Jose Blanca <jblanca@xxxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Mon, 29 Jun 2009 16:54:28 +0200
Warning: long mail ;)
Hi:
I'm trying to choose the best mira parameters for my sequences. I'm trying to
assemble ESTs and I want to get SNPs from the assemblies.
Last week I talked with A. Papanicolau and he gave me an idea and I've
developed it a little more. I would like to have an easy way to compare two
mira runs done with the same datasets and different parameters.
Here I'll explain how I think that this comparison could be done. I would like
to get feedback from you, ways to improve it, criticisms, etc.
The process has three steps:
1- run mira
2- compare the unigenes against themselves doing a blastn of the unigenes
against the unigenes
3- do a graphical representation of the blast result.
The meat of the process is the thrid phase. I'm doing two different
representations.
The first one is a distribution of number of hits in the blast result with
different percentage of similarities. I want to know how many unigenes are
80%, 85%, 90%, 95% and 100% similar. If we represent that we get the first
graph attached.
If you take a look at the graphic it's very easy to realize that there is a
lot of unigenes with a similarity close to 100%. Would it be better to merge
those unigenes? That depends on the unigenes and on your objectives.
Fortunatelly mira is adaptable for the different situations, unfortunatelly I
don't know how to tell mira that I want those unigenes merged. I've tried
with the asir paremeter with no luck (but maybe this should go in another
mail).
Ok. Now we know that there are a lot of unigenes with a 100% similartiy. But
two unigenes with 100% similarity in one part of their sequences may have
another part incompatible that prevents their merge. e.g:
unigene1 --------------------------------->
||||||||||||||||
unigene2 ------------------------------>
<- similar-><no similar>
compatible incompatible
So these unigenes are 100% identical in their first halve but different in
their second halve. We could take a look that if we calculate how many
compatible and incompatible base pairs have each hit in the blast. If we do
it we can draw a distribution similar to the previous one, but with three
axes: %similarity, %incompatibility and number of hits.
I haven't draw a 3-D plot because matplotlib can't do it, so I've represented
the number of hits as different colors. This is the second plot.
You can see in this plot that most of the unigenes that have 100% identical
tracks have no incompatible regions. I would like to merge those unigenes but
I don't know how to do it. Of course I don't want to merge the unigenes with
100% identical tracks but with incompatible regions. Those could be
alternative splicings, chimeric clones, etc.
What do you think about those plots? If you find them interesting I could
prepare an script to create them. Right now the code is here:
http://bioinf.comav.upv.es/svn/biolib/biolib/src/
in scrips/length_score_distribution.py
Best regards,
--
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)
Attachment:
1_hits_vs_similarity.png
Description: PNG image
Attachment:
2_hits_vs_similarity_vs_incompatiblity.png
Description: PNG image
Other related posts:
- » [mira_talk] mira assembly visualization - Jose Blanca
- » [mira_talk] Re: mira assembly visualization - Jose Blanca