[mira_talk] Re: mira assembly visualization

Hi:

On Monday 29 June 2009 18:02:42 Alexie Papanicolaou wrote:
> Hello Jose,
>
> Glad you got this discussion started. My suggestions/questions below
> yours.
> p.s. i'm having problems posting to the mira list with my exeter address
> so if anyone else is interested i'll re-post it from home.
>
> > 2- compare the unigenes against themselves doing a blastn of the unigenes
> > against the unigenes
>
> a) That will give you an indication of redundancy within the assembly.
> But MIRA is not a clusterer so I don't think the parameter optimization
> should focus on this...
You're right. I forgot that mira is not a clusterer, although sometimes I wish 
it could cluster just a little :)

> c) I would go for a global alignment... needle?
I don't understand. Do you mean to do an alignment to get a better estimate of 
the similarity and the incompatible regions. That would be nice, I would do 
it with a local alignment, but it would be much slower than just parsing the 
blast result.

> > 3- do a graphical representation of the blast result.
>
> I like the graphs, I was rather lost in the axis of the second one...
Not to blame, sometimes I get convoluted results, but I think that is an 
informative graph.

> >  If you take a look at the graphic it's very easy to realize that
> > there is a lot of unigenes with a similarity close to 100%.
In fact I've fixed the graph. I was taking into account the query-subject 
pairs in which both sequences are the same. Quite I dumb thing to do. Now 
it's fixed and the results are much better. Here you have the new two graphs 
obtained from the same blast result.
Now the peak around 100% similarity is smaller and in the second graph you can 
see that is mainly due to sequences with incompatible regions that shouldn't 
be alignned. So mira is not doing such a bad job after all. There are still 
sequences with a similarity of 99% and with no incompatible regions that are 
not merged by mira. I think that the most probable reason for the 1% 
difference are SNPs and I would like to have them, but right now I can't. It 
would be nice to be able to tell mira to join contigs with a similarity above 
99% and with no incompatible regions.
Now it seems that the main reason for the 100% similarity peak is due to short 
repeats in the reads and mira does a good job splitting them in different 
contigs.

> The problem here is how to define a cut-off: different genes/gene
> families will have different cut-offs. Some researchers make cDNA
> libraries from multiple outbred populations (yea... i know what you're
> thinking...)
That's my case.

> But generally, you are absolutely right. MIRA cannot and shouldn't
> really make a decision in the case you give. Also this case:
>
> unigene1      --------------------------------->
>
> unigene2      --------------------------------->
>
> | =similar
>
> I'm working on the problem of defining the cut-off... It involves a
> supervised algorithm approach and calculates a variety of features (not
> just % similarity). We can talk about it on the phone if you want.

> I'm always worried about missasemblies. There is a way to see if a
> chimaeric clone is merging two contigs (coverage in chimaeric region
> should == 1) but I don't know if you can deal with any misassemblies
> created by the assembler itself (due to the quality of 454 data).
That seems quite complex and tricky to do, not a simple task.

> What I like about your approach is this: can you get the script to give
> a list of the IDs of the different types of contig-pairs?
I've modified the script to do just that. You can get the list with the 
similarity and incompatibility between them.

> That way we can try to see why they are not joined and implement the
> supervised algorithm approach above to join them. If they should be
> joined then a simple XML file should bind them into supercontigs or
> potential misassemblies.
>
> a
I wouldn't take the blast result as the base to do that. The blast alignment 
is not very good. I would do a local alignment with water to estimate the 
paremeters better.
Best regards,

-- 
Jose M. Blanca Postigo
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

Attachment: 1_hits_vs_similarity.png
Description: PNG image

Attachment: 2_hits_vs_similarity_vs_incompatiblity.png
Description: PNG image

Other related posts: