Hi Victoria, Offord, Victoria wrote: > Hi Martin, > > Wow, thanks for the super speedy response!!! > > Have run a couple of the repeats through BLAST and it seems some match to > rRNA. Am also seeing a lot of polyA-tails in there too. Use the mirabait utility mentioned so many times today on this list to extract rRNA-containing reads and put them aside for a separate assembly. You will find it explained in mira HTML manual. As somebody already posted on this list, take first 1000 or 10,000 reads from you input files and do a mini-assembly, most likely some of the contigs will be the rRNA you can use for "baiting". I never found a proper polyA-removing tool so I ended up coming up with my own approach and my own software (removing originally just the adapters/artefacts). So far what I have seen are approaches to somehow overcome few non-A or non-T nucleotides in the sequence but they won't uncover complete polyA-tail or its polyT representation. They always leave something in the sequence unremoved. Knowing the adaptors used in the lab to bind the polyA-tails you could do better job in their detection while that is still not enough. Even more trickery had to be invented, and notably, much more alignments have to be created, analyzed, cross-compared to draw the clean cut at the end. I overwrite then the non-perfect polyA or polyT stretches in FASTA/Q files at the end, to make life easier to the assembler. Or drop it ... but that has consequences because assembler could extend the 3'-UTR erroneously (imagine some population of transcripts had longer 3'-UTR while another just a shorter-one -- you w ould always get only the longest 3'-UTR output, which is wrong). Also polyA-signal sequence can be looked up to increase sensitivity and specificity. Bastien at about mira-4.0-rc5 added some new features I ordered from him ;-) (maybe somebody else as well?) coping with polyA-tails but I have to check how they work. I promised to do that but did not (yet). I just stepped across this today myself and seems I got some trasnscripts merged via their polyA or polyT sequence ... I think we agreed with Bastien they should be masked during assembly and only unleashed during final "basecalling" but seems it is not like that. Or, maybe what happened is they were not masked because there were a few non-A or non-T characters in their sequence so they were not recognized by mira. So we are back to what I said in the beginning. ;-) > I ran all of the reads through seqclean and assumed that this would fix the > above. In hindsight, I probably should have been a bit more ruthless. Call me too picky but trimpoly, seqclean etc. just don't work well, definitely not on NextGen data. > Any advice on how to clean up the dataset a bit more or should I just drop > the reads? Although you haven't said what is the ration within those megahub-causing reads, just extract all matching rRNA and then, drop all remaining reads which appeared in the megahubs file. This will not help if the megahubs are caused by polyA/T-tails (often they are but your 0.1052271917% tells me it is not your case). > > Be kind, I'm a transcriptome assembling newbie! ;) We are all. ;-) Transcriptomes are always hard. Use -AL:mo=80 (you cannot go much higher for the Illumina reads) and maybe after some testing of the data increase -AL:mrs= to 90 or 94. Martin > > Thanks in advance! > > Victoria > > -----Original Message----- > From: mira_talk-bounce@xxxxxxxxxxxxx [mailto:mira_talk-bounce@xxxxxxxxxxxxx] > On Behalf Of Martin MOKREJŠ > Sent: 25 June 2014 17:02 > To: mira_talk@xxxxxxxxxxxxx > Subject: [mira_talk] Re: MIRA run failing due to megahubs > > Hi Victoria, > you have to look into the reads listed in *megahubs* file placed in > MergedAssembly/MergedAssembly_d_tmp/ > subdirectory. Most likely you have too many polyA-tails in your data or rRNA > contamination or unremoved adapters/MIDs. Or simple crap like [CA]n repeats > causes that (sometimes they are authentic, sometimes not, in either case hard > to be useful). Provided it is so few reads, just drop them from your input > dataset. > Transcriptomic datasets need proper adapter removal and trimming ... see > below my signature. ;) Or maybe you get back to me later once you get 40kb > long contigs in your assemblies ... :)) Martin > > Offord, Victoria wrote: >> Hi, >> >> My MIRA run on a parasite transcriptome keeps ending with: >> >> You have 0.1052271917% of your reads as megahubs. >> You have set a maximum allowed ratio of: 0.0000000000 > > -- > Martin Mokrejs, Ph.D. > 454 / IonTorrent / Evrogen MINT / Clontech SMART adapter/artifact removal > (... too many protocols to name here) > http://www.bioinformatics.cz/software/supported-protocols/ -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html