On Dienstag 27 Juli 2010 Davide Scaglione wrote: > I sequenced three different samples (i.e. varieties) of my plant, which is > tremendously heterozygous. about 0,6 M 454-titanium each, 1,7 M on the > whole (+36k sanger) > As first thought I planned to assemble each varieties, separately and then > cluster the contigs together again (maybe using a less stringent alignment > due to the diversity between varieties). This is exactly what miraSearchESTSNPs does. > After, I wondered that there > shouldn't be such a big difference rather than assembling everything just > once. In this way I proceeded. Am I correct? As long as you tell MIRA which read comes from which sample, it should be doable. With that approach you have a much higher chance to also catch rare transcripts. > What do you think about repeat histogram? The high 0-level seems weird to > me, I expected more repeat on the avg cov. (level-1) having so many reads, > all coming from normalized libraries. The repeat histogram is not as good an indicator for EST assemblies as for genome assemblies. There isn't such thing as an "average" coverage there, hence the level 0 and 1 values may be way off target. It still is useful though. Remember: normalising EST libraries in lab can do only so much ... you will still have a couple of sequences being highly overrepresented. > Do you think I need to switch on > -SK:mnr and -SK:nrr even if I'm looking at deep covered genes, (to be > able to mine heterozygous SNPs within the same sample)? Can be these > switch useful in EST clustering too? -SKmnr:nrr are self-defense switches of MIRA to mask away things which would otherwise lead to incredibly high running times. What I would do: assemble in a first pass with a moderate -SK:nrr. Then take all reads from the debris, and assemble these with -SK:mnr=no, but -SK:mhpr=10 (or even 5). > Moreover, if one single sample estimated the coverage to 9x with hstat, all > the three together just encreased to 15x (less than doubled). Does it make > sense to you? or my parameters are not permissive to cluster quite > diverged genes in the three samples? Again: in EST assemblies those number do not have the same value as for genome assembly. Don't pay too much attention to the exact "average". > Trying with two separated sample I got 35k-40k contigs each, while > assembled together they rised to 70k contigs, thus I'm afraid that reads > coming from different samples may split apart. As before: you also have a much higher chance of catching more rare transcripts and have them assembled into smaller contigs instead of having them in the debris afterwards. The downside: for highly expressed transcripts, there's a much higher chance that MIRA splits them due to non-random sequencing errors. > Could you please give a > quick look at my parameters and give me an opinion, with hints on how to > play around this? thanks so much! > [...] > -CO:mr=yes:asir=yes I wouldn't do that for assembly. > -CL:ascdc=yes Dangerous for lowly expressed transcripts (like -CL:pec). May also be the reason for the 40k contigs vs 70k for the whole data sets: -CL:ascdc may have caught less "chimeras" for the whole set (which are not chimeras then). > -DP:ure=yes HAs no effect for 454 reads. > Measured avg. frequency coverage: 15 > Deduced thresholds: > ------------------- > Min normal cov: 6.0 > Max normal cov: 24.0 > Repeat cov: 28.5 > Heavy cov: 120.0 > Crazy cov: 300.0 > Mask cov: 1500 > Repeat ratio histogram: > ----------------------- > 0 50239170 > 1 16063758 > ... > 10 210342 > 11 171042 > 12 136860 > ... > ... > ... > 885 2 > 927 2 > 972 2 Try a -SK:nrr=100 (or 50) for a first run, then assemble the debris without - SK:mnr B -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html