[mira_talk] Re: Digital normalization discarded too many 454-based reads

  • From: Martin MOKREJŠ <mmokrejs@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 24 Jun 2014 18:03:12 +0200

Bastien Chevreux wrote:
> On 15 Jun 2014, at 17:34 , Martin MOKREJŠ <mmokrejs@xxxxxxxxx> wrote:
>> Could you ensure me that the normalization discarded from my 454 data only 
>> reads shorter than Illumina?
> 
> No, I cannot as this is not how LDN works. For repetitive reads, it simply 
> checks whether all the kmers it is composed of have already been taken enough 
> times. If yes, it discards the read.
> 
>>  Can I ensure mira does not apply diginorm to 454 data, except cases when 
>> eventually 454-read is a substring of a LONGER Illumina read? How can I 
>> disable it for 454 technology? Can be -HS:ldn specified under 454_SETTINGS?
> 
> No. You can’t. No.
> 
> In this order. Though I see your point, and making this partly configurable 
> seems easy enough. I’ll give it a look. But read on.
> 
>> […]
>> Still, I wonder what else could I consider before doing so.
> 
> Looking through the code, one thing which should work is specifying the 
> readgroup for the 454 before the Illumina reads. LDN works readgroup by 
> readgroup (in the order as specified by the manifest). In your current 
> configuration, it first looks at all the short Illuminas and takes these. 
> Which then can leave quite a number of 454 out on the street as - from a kmer 
> perspective - they don’t add to the assembly and are thus discarded.
> 
> Turning this around in the manifest should alleviate the problem with the 454 
> reads. However, the way your assembly is set up, you will run into similar 
> trouble with all the Illumina readgroups: you’ll get overproportional amount 
> of reads from the readgroups early in the process.

Still, I lost about 1/2 of the 454 reads but it is still better then staying 
with 1/4 only. However, in those per-individual assemblies with 454- data 
defined as the very last group I lost just 1/5 of reads.

I thought I could get around by just disabling the second round of diginorm but 
had too much data after merging 8 normalized individuals. And with 4 individual 
had too few. ;)

> 
>> Or should I have better used miraSearchForSNPs instead?
> 
> You can’t: that has been discontinued. Sorry.

Please update the manual below 
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect1_est_est_difference_assembly_clustering
 . It still refers to it.

It is also unclear to me to what to adjust the -CO:rodirs=


I have 12 diploid animals, normalized by mira, with many defaults and notably 
with -CO:asir=yes (I think I should have assigned strain names to each to 
enable setup closer to -CO:asir=yes, too late now).
Then I extracted all assembled or potentially useful reads from debris and did 
two assemblies with same settings and again with normalization enabled.
Now I just want to merge redundant contigs from all previous assembly attempts 
(with each allele) together.




project = Mybug_reassembly_of_contigs
job = est,denovo,accurate

readgroup = mira
data = ../all_15_assemblies_out.unpadded.fasta
technology = Sanger

parameters = COMMON_SETTINGS -GENERAL:number_of_threads=6 -HS:ldn=no



Seems I shoudl add -AL:egp=no -CO:asir=yes but what to do about -CO:rodirs= ?

Can I disable clipping of contigs? How about min. overlap and min read count 
per contig? Basically I believe everything "redundant" appears 3-15x.

Or is it better to use genomic assembly mode to assemble the partial transcript 
contigs?

Thanks,
Martin

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: