[mira_talk] Re: large hybrid assembly w/ minimal ram

  • From: "Wachholtz, Michael" <mwachholtz@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 19 May 2011 20:21:51 -0500

We have 1.7 million 454 reads, and about 350 million 55bp Illumina
reads. Given the vast amount of data, we keep the duplicate read with
best quality scores, and only on a per lane basis. So they may
reappear in other lanes. But given this vast amount of data, even a
Velvet hybrid assembly will be good... correct? It seems to be a
trade-off here. The huge amount of data we have keeps us from using
detailed assemblers such as MIRA, but won't all this data eliminate
any coverage, low abundance transcript, or sequencing error issues in
fast assemblers such as Velvet, even at a high k-mer level?

On Thu, May 19, 2011 at 4:31 PM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:
> On May 19, 2011, at 19:56 , Wachholtz, Michael wrote:
>>  I would like your opinion on this idea:
>
> Ah, I should have perhaps answered your post before Shane's ... I won't 
> repeat things here but just add a couple of points I haven't touched there.
>
>> I have noticed that removing duplicate reads from the solexa data will
>> reduce the data size by 30-50%. I assume these reads are from highly
>> expressed genes.
>
> I am a bit split regarding filtering away exact duplicate reads. First, even 
> if Solexa reads are astoundingly good regarding quality (in my last projects 
> with 100mers, between 70 and 80% of the reads had no error), there are still 
> 20 to 30% percent of the reads *with* at least one error (probably more if 
> you are in high GC organisms, but that is another story).
>
> Anyway, filtering on exact duplicates will fail for those, actually 
> introducing a bias for reads having sequencing errors. Not good.
>
>> I am thinking of running a velvet assembly, with very
>> high k-mer value. This assembly should create reliable sequences for
>> the highly expressed genes. After that, any reads that were not
>> assembled into these complete transcripts can be reused in a hybrid
>> assembly (w/ 454 reads) via MIRA. I assume that removing duplicate
>> reads and reads from highly expressed transcripts will 1) reduce my
>> solexa data size significantly and 2) will allow more detailed
>> assembly of low abundance transcripts. Can anyone think of problems
>> with this method?
>
> Fortunately, my last projects did not involve the analysis of genes being 
> utterly high expressed so, I did not have to solve the problem to assemble 
> these really really well. However, data reduction was also one path I 
> explored and here's what I came up with (you'll excuse that I am again using 
> only MIRA for what I am explaining, there may be better tools to do it but at 
> least with MIRA I know what I can expect):
>
> With SKIM, MIRA has a k-mer analysis algorithms implemented which were quite 
> well expanded over the past few years, including a hash frequency analysis 
> (more trivially: k-mer counting). MIRA uses that to filter away really nasty 
> data (sequencing vectors, adaptors etc.pp). The threshold for what is 'nasty' 
> can be set freely (-SK:nrr) and is relative to the average hash frequency 
> (k-mer occurrence). When MIRA encounters k-mers surpassing that threshold, it 
> masks them *and* logs different repeat levels to an info file early on in the 
> assembly.
>
> That info file contains all read stretches covered by different repeat levels 
> (HAF5, HAF6, HAF7 and MNRr) and with that one can pretty quickly analyse a 
> data set for unexpected or unwanted things (unknown adaptors, high copy 
> plasmids etc.pp). Now, you could use that same file to filter out all reads 
> which have say MNRr stretches of at least 50 or more bases. That would reduce 
> your data set by almost all genes which are utterly high expressed, where you 
> set the level for what is "uttery high" for you. That way, only very few 
> reads from those genes will remain in the data set and you then have two data 
> sets: one with utterly high expressed genes and one with all the rest (lowly 
> and normally expressed genes) on which you can run different analyses.
>
> Just my 2 cents, I'm interested to hear what others could think of.
>
> B.
>
>
> --
> You have received this mail because you are subscribed to the mira_talk 
> mailing list. For information on how to subscribe or unsubscribe, please 
> visit http://www.chevreux.org/mira_mailinglists.html
>

--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: