[mira_talk] Re: Very long transcripts

  • From: Martin Mokrejs <mmokrejs@xxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 31 Oct 2012 12:00:36 +0100

Hi Jackie,
  you haven't said what lab protocol was used to prepare the sample for 
sequencing.
The numbers sound like a whole disaster and therefore I guess you used the 
MINT/SMART
protocols (especially as you mentioned polyA/T trimming and intended 3'-end 
only cDNA
library). I analyzed few dozens of such datasets and FIXED them. Since I 
started to
work on these I hit so many issues that honestly said, my only advise is you 
don't
work with such datasets unless you clean them up. The question is what you 
should
do ... ;-) I found the way through ... If you want to re-invent the wheel 
reserve
6-8 months of your time as a molecular biologist and hire a programmer.
  On the other hand, if you used Roche protocol then with those random hexamers 
you
shouldn't have that many issues but still I have seen datasets which also had 
polyA/T
tails in reads. These, in theory, should NOT appear in cDNA prepared by Roche 
protocol
but somehow they do! ;-)
  I think I mentioned that here on this list some while ago ... I can offer a
commercial service in cleanup and correction of the SFF files. There have to be 
done
so many alignments that fixing 300k reads (a 1/4 XLR plate) takes several weeks 
on a
fastest machine I could get). Therefore, a paid service. I really do know why it
takes so long ;-), not a single adapter-trimming software available around does
anything even remotely similar and unless the thing is published I won't say 
more
about my approach.

  In general, transcripts are 2-5kb long, some aberrant transcripts are longer 
with
extended 3'-UTR. That's the current view of transcriptomes. Unless you have a 
truly
obscure organism then the assembly is just wrong.

  As a quick insight I can look into the assembled contigs for adapters and 
..., but
for a final solution I would need SFF files. I can work with fast(a/q) files 
but that
is suboptimal. So once again. How many raw reads are to be analyzed? What 
protocol?
Otherwise? You have to wait and meanwhile, good luck with you efforts. ;-)
Best,
Martin


Bastien Chevreux wrote:
> On Oct 30, 2012, at 14:35 , Jackie Lighten <jackie.lighten@xxxxxx 
> <mailto:jackie.lighten@xxxxxx>> wrote:
>> I have performed an accurate de novo assembly with poly-a/t trimming.
>> I get all reads assembled, and no singlets, into around 66k contigs. Around 
>> 27k of these are large contigs, with the largest being ~25k bases long. This 
>> does not make much sense to me as I constructed a 3' target cDNA library 
>> (454 FLX). I can envisage multiple open reading frames may create longer 
>> transcripts but 25k seems dodgy to me.
>> Any thoughts?
> 
> Yes. Have a look at those contigs :-) No joke, this always brings the best 
> insights.
> 
> Possible reasons:
> - PKS genes. These can be up to 45 - 50kb long, maybe even longer
> - contamination of the cDNA with gDNA
> - introns. Especially for highly expressed genes, one has a higher chance to 
> have sequence unedited mRNA
> - unclipped adaptors which "join" contains
> - assembly "errors": short overlaps of just a couple of bases
> 
> B.
> 

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: