[mira_talk] Re: MIRA error

  • From: David Sannino <drs357@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Wed, 27 May 2015 15:27:37 -0400

This is my manifest file if that helps:

# Example for a manifest describing a denovo assembly with
# several kinds of sequencing libraries
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
project = typeB_assembly1
job = genome,denovo,accurate
parameters = COMMON_SETTINGS -GE:not=0 -AS:nop=4,sd=on,bts=3600
-CL:asjdc=on -HS:nrr=100,mnr=yes -SB:sbuip=2 -NW:cmrnl=warn,acv=105 \
SOLEXA_SETTINGS -AS:mrpc=10 -CO:msr=on,msrme=0 -CL:bsqc=on
-AL:bip=100,egp=yes -CO:mgqrt=25 \
SANGER_SETTINGS -AS:mrpc=2 -DP:ure=on -CL:pvlc=on,qc=on,bsqc=on -CL:bsqc=on
-AL:bip=100,egp=yes -CO:mgqrt=25 \
PCBIOHQ_SETTINGS -AS:mrpc=5 -CL:pec=on -CL:bsqc=on -AL:bip=100,egp=yes
-CO:mgqrt=25
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the Illumina paired-end data
readgroup = paired-end illumina
data = /workdir/sannino/trimmednew2typeb_R1.fastq
/workdir/sannino/trimmednew2typeb_R2.fastq
technology = solexa
template_size = 100 300
segment_placement = ---> <---
segment_naming = solexa
rename_prefix = @HWI-ST397:193:D093UACXX:3:1101: ill_
# Sanger data
readgroup = Sanger sequences
data = /workdir/sannino/typeb_s.fastq
technology = sanger
segment_naming = sanger
# Pacbio data
readgroup = Pacbio
data = /workdir/sannino/typeb_pbccs.fastq
technology = pcbiohq
rename_prefix = m140320_025629_42146_c100610822550000001823111606241443
pbcss_

On Wed, May 27, 2015 at 3:00 PM, Bastien Chevreux <bach@xxxxxxxxxxxx> wrote:

On 27 May 2015, at 19:06 , David Sannino <drs357@xxxxxxxxxxx> wrote:

Hi I am performing a hybrid assembly of a bacterial genome that may have
some contamination in it. The following error came up during the assembly:
Total megahubs: 2


Hello David,

when trying to get help, sending along the manifest you used helps to
assess the complexity of the problem. For the remainder of this answer, I
will assume you used a pretty standard setup (“genome,denovo,accurate”)
without additional parameters.

The number of megahubs found (2) is extremely low, one could quite safely
tell MIRA to ignore it via -SK:mmhr=5 (or similar).

But please read on.

[...]
This is a bacterial genome from environmental DNA so there is the
potential for contamination in it. I've made sure there are no illumina
adapters left over in my illumina data, and I am pretty sure there are no
vectors in my sanger data (I analyzed it with fastqc and there wasn't any
over-represented sequences). I am not sure which is the best way to proceed
with the assembly. It is around a 3.6MB genome and is being constructed
from Illumina HiSeq, Sanger, and PacBio data.


If this were only Illumina and Sanger, having megahubs would ring all my
alarm bells and I would send you on a hunt for possible contaminants. But
you wrote you have PacBio mixed in, and I know that the megahub filter
sometimes triggers on long PacBio reads containing a couple of repetitive
elements. If you do not have megahub warnings when using Illumina and
Sanger alone, just tell MIRA to ignore the warning via -SK:mmhr.

In case you want to know a bit more about repetitive or contaminant
elements in your data, please have a look at chapter 11 (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_hard),
and there especially at the section dealing with the repeat info file (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_the_readrepeats_info_file)
and the following section (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data)
which is conveniently named “Pipeline to find worst contaminants or repeats
in sequencing data” and has a step-by-step walkthrough,

B.

PS: I’ll repeat myself here: do not pre-treat (trim/clip) your Illumina
data with 3rd party software when assembling with MIRA: the algorithms in
MIRA really are better than anything people normally try out. Inquisitive
natures on this list tested this with their data and if I remember right,
they all agreed.

Other related posts: