[mira_talk] Re: MIRA run failing due to megahubs

  • From: Martin MOKREJŠ <mmokrejs@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 25 Jun 2014 18:57:56 +0200

Hi Victoria,

Offord, Victoria wrote:
> Hi Martin,
> 
> Wow, thanks for the super speedy response!!!
> 
> Have run a couple of the repeats through BLAST and it seems some match to 
> rRNA.  Am also seeing a lot of polyA-tails in there too.

Use the mirabait utility mentioned so many times today on this list to extract 
rRNA-containing reads and put them aside for a separate assembly. You will find 
it explained in mira HTML manual. As somebody already posted on this list, take 
first 1000 or 10,000 reads from you input files and do a mini-assembly, most 
likely some of the contigs will be the rRNA you can use for "baiting".

I never found a proper polyA-removing tool so I ended up coming up with my own 
approach and my own software (removing originally just the adapters/artefacts). 
So far what I have seen are approaches to somehow overcome few non-A or non-T 
nucleotides in the sequence but they won't uncover complete polyA-tail or its 
polyT representation. They always leave something in the sequence unremoved. 
Knowing the adaptors used in the lab to bind the polyA-tails you could do 
better job in their detection while that is still not enough. Even more 
trickery had to be invented, and notably, much more alignments have to be 
created, analyzed, cross-compared to draw the clean cut at the end. I overwrite 
then the non-perfect polyA or polyT stretches in FASTA/Q files at the end, to 
make life easier to the assembler. Or drop it ... but that has consequences 
because assembler could extend the 3'-UTR erroneously (imagine some population 
of transcripts had longer 3'-UTR while another just a shorter-one
  -- you 
 w
ould always get only the longest 3'-UTR output, which is wrong). Also 
polyA-signal sequence can be looked up to increase sensitivity and specificity.

Bastien at about mira-4.0-rc5 added some new features I ordered from him ;-) 
(maybe somebody else as well?) coping with polyA-tails but I have to check how 
they work. I promised to do that but did not (yet). I just stepped across this 
today myself and seems I got some trasnscripts merged via their polyA or polyT 
sequence ... I think we agreed with Bastien they should be masked during 
assembly and only unleashed during final "basecalling" but seems it is not like 
that. Or, maybe what happened is they were not masked because there were a few 
non-A or non-T characters in their sequence so they were not recognized by 
mira. So we are back to what I said in the beginning. ;-)

> I ran all of the reads through seqclean and assumed that this would fix the 
> above.  In hindsight, I probably should have been a bit more ruthless.

Call me too picky but trimpoly, seqclean etc. just don't work well, definitely 
not on NextGen data.

> Any advice on how to clean up the dataset a bit more or should I just drop 
> the reads?

Although you haven't said what is the ration within those megahub-causing 
reads, just extract all matching rRNA and then, drop all remaining reads which 
appeared in the megahubs file. This will not help if the megahubs are caused by 
polyA/T-tails (often they are but your 0.1052271917% tells me it is not your 
case).

> 
> Be kind, I'm a transcriptome assembling newbie!  ;)

We are all. ;-) Transcriptomes are always hard. Use -AL:mo=80 (you cannot go 
much higher for the Illumina reads) and maybe after some testing of the data 
increase -AL:mrs= to 90 or 94.
Martin

> 
> Thanks in advance!
> 
> Victoria
> 
> -----Original Message-----
> From: mira_talk-bounce@xxxxxxxxxxxxx [mailto:mira_talk-bounce@xxxxxxxxxxxxx] 
> On Behalf Of Martin MOKREJŠ
> Sent: 25 June 2014 17:02
> To: mira_talk@xxxxxxxxxxxxx
> Subject: [mira_talk] Re: MIRA run failing due to megahubs
> 
> Hi Victoria,
>   you have to look into the reads listed in *megahubs* file placed in 
> MergedAssembly/MergedAssembly_d_tmp/
> subdirectory. Most likely you have too many polyA-tails in your data or rRNA 
> contamination or unremoved adapters/MIDs. Or simple crap like [CA]n repeats 
> causes that (sometimes they are authentic, sometimes not, in either case hard 
> to be useful). Provided it is so few reads, just drop them from your input 
> dataset.
>   Transcriptomic datasets need proper adapter removal and trimming ... see 
> below my signature. ;) Or maybe you get back to me later once you get 40kb 
> long contigs in your assemblies ... :)) Martin
> 
> Offord, Victoria wrote:
>> Hi,
>>
>> My MIRA run on a parasite transcriptome keeps ending with:
>>
>> You have 0.1052271917% of your reads as megahubs.
>> You have set a maximum allowed ratio of: 0.0000000000
> 
> --
> Martin Mokrejs, Ph.D.
> 454 / IonTorrent / Evrogen MINT / Clontech SMART adapter/artifact removal 
> (... too many protocols to name here) 
> http://www.bioinformatics.cz/software/supported-protocols/

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: