[mira_talk] Re: heterozygosity, coverage and repeat histogram

From: Bastien Chevreux <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Wed, 28 Jul 2010 20:46:00 +0200
On Dienstag 27 Juli 2010 Davide Scaglione wrote:
> I sequenced three different samples (i.e. varieties) of my plant, which is
>  tremendously heterozygous. about 0,6 M 454-titanium each, 1,7 M on the
>  whole (+36k sanger)
> As first thought I planned to assemble each varieties, separately and then
>  cluster the contigs together again (maybe using a less stringent alignment
>  due to the diversity between varieties).

This is exactly what miraSearchESTSNPs does.

>  After, I wondered that there
>  shouldn't be such a big difference rather than assembling everything just
>  once. In this way I proceeded. Am I correct?

As long as you tell MIRA which read comes from which sample, it should be 
doable. With that approach you have a much higher chance to also catch rare 
transcripts.

> What do you think about repeat histogram? The high 0-level seems weird to
>  me, I expected more repeat on the avg cov. (level-1) having so many reads,
>  all coming from normalized libraries.

The repeat histogram is not as good an indicator for EST assemblies as for 
genome assemblies. There isn't such thing as an "average" coverage there, 
hence the level 0 and 1 values may be way off target. It still is useful 
though.

Remember: normalising EST libraries in lab can do only so much ... you will 
still have a couple of sequences being highly overrepresented.

>  Do you think I need to switch on
>  -SK:mnr and -SK:nrr even if I'm looking at deep covered genes, (to  be
>  able to mine heterozygous SNPs within the same sample)? Can be these
>  switch useful in EST clustering too?

-SKmnr:nrr are self-defense switches of MIRA to mask away things which would 
otherwise lead to incredibly high running times. What I would do: assemble in 
a first pass with a moderate -SK:nrr. Then take all reads from the debris, and 
assemble these with -SK:mnr=no, but -SK:mhpr=10 (or even 5).

> Moreover, if one single sample estimated the coverage to 9x with hstat, all
>  the three together just encreased to 15x (less than doubled). Does it make
>  sense to you? or my parameters are not permissive to cluster quite
>  diverged genes in the three samples?

Again: in EST assemblies those number do not have the same value as for genome 
assembly. Don't pay too much attention to the exact "average".

> Trying with two separated sample I got 35k-40k contigs each, while
>  assembled together they rised to 70k contigs, thus I'm afraid that reads
>  coming from different samples may split apart.

As before: you also have a much higher chance of catching more rare 
transcripts and have them assembled into smaller contigs instead of having 
them in the debris afterwards.

The downside: for highly expressed transcripts, there's a much higher chance 
that MIRA splits them due to non-random sequencing errors.

>  Could you please give a
>  quick look at my parameters and give me an opinion, with hints on how to
>  play around this? thanks so much!
> [...]
> -CO:mr=yes:asir=yes

I wouldn't do that for assembly.

> -CL:ascdc=yes

Dangerous for lowly expressed transcripts (like -CL:pec). May also be the 
reason for the 40k contigs vs 70k for the whole data sets: -CL:ascdc may have 
caught less "chimeras" for the whole set (which are not chimeras then).

> -DP:ure=yes

HAs no effect for 454 reads.

> Measured avg. frequency coverage: 15
> Deduced thresholds:
> -------------------
> Min normal cov: 6.0
> Max normal cov: 24.0
> Repeat cov: 28.5
> Heavy cov: 120.0
> Crazy cov: 300.0
> Mask cov: 1500
> Repeat ratio histogram:
> -----------------------
> 0 50239170
> 1 16063758
> ...
> 10 210342
> 11 171042
> 12 136860
> ...
> ...
> ...
> 885 2
> 927 2
> 972 2

Try a -SK:nrr=100 (or 50) for a first run, then assemble the debris without -
SK:mnr

B

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html
References:
- [mira_talk] heterozygosity, coverage and repeat histogram
  - From: Davide Scaglione
[mira_talk] Re: heterozygosity, coverage and repeat histogram

Other related posts: