Hi Jackie, you haven't said what lab protocol was used to prepare the sample for sequencing. The numbers sound like a whole disaster and therefore I guess you used the MINT/SMART protocols (especially as you mentioned polyA/T trimming and intended 3'-end only cDNA library). I analyzed few dozens of such datasets and FIXED them. Since I started to work on these I hit so many issues that honestly said, my only advise is you don't work with such datasets unless you clean them up. The question is what you should do ... ;-) I found the way through ... If you want to re-invent the wheel reserve 6-8 months of your time as a molecular biologist and hire a programmer. On the other hand, if you used Roche protocol then with those random hexamers you shouldn't have that many issues but still I have seen datasets which also had polyA/T tails in reads. These, in theory, should NOT appear in cDNA prepared by Roche protocol but somehow they do! ;-) I think I mentioned that here on this list some while ago ... I can offer a commercial service in cleanup and correction of the SFF files. There have to be done so many alignments that fixing 300k reads (a 1/4 XLR plate) takes several weeks on a fastest machine I could get). Therefore, a paid service. I really do know why it takes so long ;-), not a single adapter-trimming software available around does anything even remotely similar and unless the thing is published I won't say more about my approach. In general, transcripts are 2-5kb long, some aberrant transcripts are longer with extended 3'-UTR. That's the current view of transcriptomes. Unless you have a truly obscure organism then the assembly is just wrong. As a quick insight I can look into the assembled contigs for adapters and ..., but for a final solution I would need SFF files. I can work with fast(a/q) files but that is suboptimal. So once again. How many raw reads are to be analyzed? What protocol? Otherwise? You have to wait and meanwhile, good luck with you efforts. ;-) Best, Martin Bastien Chevreux wrote: > On Oct 30, 2012, at 14:35 , Jackie Lighten <jackie.lighten@xxxxxx > <mailto:jackie.lighten@xxxxxx>> wrote: >> I have performed an accurate de novo assembly with poly-a/t trimming. >> I get all reads assembled, and no singlets, into around 66k contigs. Around >> 27k of these are large contigs, with the largest being ~25k bases long. This >> does not make much sense to me as I constructed a 3' target cDNA library >> (454 FLX). I can envisage multiple open reading frames may create longer >> transcripts but 25k seems dodgy to me. >> Any thoughts? > > Yes. Have a look at those contigs :-) No joke, this always brings the best > insights. > > Possible reasons: > - PKS genes. These can be up to 45 - 50kb long, maybe even longer > - contamination of the cDNA with gDNA > - introns. Especially for highly expressed genes, one has a higher chance to > have sequence unedited mRNA > - unclipped adaptors which "join" contains > - assembly "errors": short overlaps of just a couple of bases > > B. > -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html