[mira_talk] Re: mira assembly visualization
- From: Jose Blanca <jblanca@xxxxxxxxxx>
- To: A.Papanicolaou@xxxxxxxxxxxx
- Date: Tue, 30 Jun 2009 16:30:43 +0200
Hi: On Monday 29 June 2009 18:02:42 Alexie Papanicolaou wrote: > Hello Jose, > > Glad you got this discussion started. My suggestions/questions below > yours. > p.s. i'm having problems posting to the mira list with my exeter address > so if anyone else is interested i'll re-post it from home. > > > 2- compare the unigenes against themselves doing a blastn of the unigenes > > against the unigenes > > a) That will give you an indication of redundancy within the assembly. > But MIRA is not a clusterer so I don't think the parameter optimization > should focus on this... You're right. I forgot that mira is not a clusterer, although sometimes I wish it could cluster just a little :) > c) I would go for a global alignment... needle? I don't understand. Do you mean to do an alignment to get a better estimate of the similarity and the incompatible regions. That would be nice, I would do it with a local alignment, but it would be much slower than just parsing the blast result. > > 3- do a graphical representation of the blast result. > > I like the graphs, I was rather lost in the axis of the second one... Not to blame, sometimes I get convoluted results, but I think that is an informative graph. > > If you take a look at the graphic it's very easy to realize that > > there is a lot of unigenes with a similarity close to 100%. In fact I've fixed the graph. I was taking into account the query-subject pairs in which both sequences are the same. Quite I dumb thing to do. Now it's fixed and the results are much better. Here you have the new two graphs obtained from the same blast result. Now the peak around 100% similarity is smaller and in the second graph you can see that is mainly due to sequences with incompatible regions that shouldn't be alignned. So mira is not doing such a bad job after all. There are still sequences with a similarity of 99% and with no incompatible regions that are not merged by mira. I think that the most probable reason for the 1% difference are SNPs and I would like to have them, but right now I can't. It would be nice to be able to tell mira to join contigs with a similarity above 99% and with no incompatible regions. Now it seems that the main reason for the 100% similarity peak is due to short repeats in the reads and mira does a good job splitting them in different contigs. > The problem here is how to define a cut-off: different genes/gene > families will have different cut-offs. Some researchers make cDNA > libraries from multiple outbred populations (yea... i know what you're > thinking...) That's my case. > But generally, you are absolutely right. MIRA cannot and shouldn't > really make a decision in the case you give. Also this case: > > unigene1 ---------------------------------> > > unigene2 ---------------------------------> > > | =similar > > I'm working on the problem of defining the cut-off... It involves a > supervised algorithm approach and calculates a variety of features (not > just % similarity). We can talk about it on the phone if you want. > I'm always worried about missasemblies. There is a way to see if a > chimaeric clone is merging two contigs (coverage in chimaeric region > should == 1) but I don't know if you can deal with any misassemblies > created by the assembler itself (due to the quality of 454 data). That seems quite complex and tricky to do, not a simple task. > What I like about your approach is this: can you get the script to give > a list of the IDs of the different types of contig-pairs? I've modified the script to do just that. You can get the list with the similarity and incompatibility between them. > That way we can try to see why they are not joined and implement the > supervised algorithm approach above to join them. If they should be > joined then a simple XML file should bind them into supercontigs or > potential misassemblies. > > a I wouldn't take the blast result as the base to do that. The blast alignment is not very good. I would do a local alignment with water to estimate the paremeters better. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473)
Attachment:
1_hits_vs_similarity.png
Description: PNG image
Attachment:
2_hits_vs_similarity_vs_incompatiblity.png
Description: PNG image
- References:
- [mira_talk] mira assembly visualization
- From: Jose Blanca
- [mira_talk] mira assembly visualization
Other related posts:
- » [mira_talk] mira assembly visualization - Jose Blanca
- » [mira_talk] Re: mira assembly visualization - Jose Blanca