[mira_talk] heterozygosity, coverage and repeat histogram

  • From: "Davide Scaglione" <gianza@xxxxxxxxxx>
  • To: <mira_talk@xxxxxxxxxxxxx>
  • Date: Tue, 27 Jul 2010 01:03:11 +0200

Hi Bastien,
thank you so much for all the free support you are giving to the  community.

I'm trying to make a "definitive" assembly of my ESTs for SNP mining, but 
little concern rised from coverage values and repeat histogram.

I sequenced three different samples (i.e. varieties) of my plant, which is 
tremendously heterozygous.
about 0,6 M 454-titanium each, 1,7 M on the whole (+36k sanger)
As first thought I planned to assemble each varieties, separately and then 
cluster the contigs together again (maybe using a less stringent alignment due 
to the diversity between varieties). After, I wondered that there shouldn't be 
such a big difference rather than assembling everything just once. In this way 
I proceeded. Am I correct?

What do you think about repeat histogram? The high 0-level seems weird to me, I 
expected more repeat on the avg cov. (level-1) having so many reads, all coming 
from normalized libraries. 
Do you think I need to switch on -SK:mnr and -SK:nrr even if I'm looking at 
deep covered genes, (to  be able to mine heterozygous SNPs within the same 
sample)? Can be these switch useful in EST clustering too?

Moreover, if one single sample estimated the coverage to 9x with hstat, all the 
three together just encreased to 15x (less than doubled). Does it make sense to 
you? or my parameters are not permissive to cluster quite diverged genes in the 
three samples?

Trying with two separated sample I got 35k-40k contigs each, while assembled 
together they rised to 70k contigs, thus I'm afraid that reads coming from 
different samples may split apart. 
Could you please give a quick look at my parameters and give me an opinion, 
with hints on how to play around this? thanks so much!

COMMON_SETTINGS 
-GE:not=8 
-AS:sep=yes:ugpf=no
-SK:not=8:pr=85:mnr=no 
-CO:mr=yes:asir=yes
-OUT:ora=yes:org=no
-SB:lsd=yes
-CL:ascdc=yes

SANGER_SETTINGS 
-LR:wqf=no 
-AS:epoq=no:bdq=20
-CL:cpat=no
-OUT:sssip=yes
-AL:mo=50:ms=50:mrs=93:egp=no
-CO:fnicpst=yes
-ED:ace=no

454_SETTINGS 
-OUT:sssip=yes
-AL:mo=50:ms=50:mrs=93:egp=no 
-CL:cpat=no:qc=yes:qcmq=15:qcwl=20
-CO:fnicpst=yes:rodirs=10
-DP:ure=yes
-ED:ace=no


Measured avg. frequency coverage: 15
Deduced thresholds:
-------------------
Min normal cov: 6.0
Max normal cov: 24.0
Repeat cov: 28.5
Heavy cov: 120.0
Crazy cov: 300.0
Mask cov: 1500
Repeat ratio histogram:
-----------------------
0 50239170
1 16063758
2 4565828
3 2193594
4 1285108
5 856230
6 606206
7 446770
8 336404
9 262926
10 210342
11 171042
12 136860
...
...
...
885 2
927 2
972 2



Other related posts: