[mira_talk] Re: MIRA error

  • From: Chris Hoefler <hoeflerb@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Wed, 27 May 2015 14:51:59 -0500

Out of curiosity, what coverage of PacBio reads do you have? I have usually
had the best results assembling PacBio reads alone and then mapping
Illumina reads over the contigs for polishing. Do you only have CCS reads,
or do you also have self-corrected long reads?

On Wed, May 27, 2015 at 2:27 PM, David Sannino <drs357@xxxxxxxxxxx> wrote:

This is my manifest file if that helps:

# Example for a manifest describing a denovo assembly with
# several kinds of sequencing libraries
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
project = typeB_assembly1
job = genome,denovo,accurate
parameters = COMMON_SETTINGS -GE:not=0 -AS:nop=4,sd=on,bts=3600
-CL:asjdc=on -HS:nrr=100,mnr=yes -SB:sbuip=2 -NW:cmrnl=warn,acv=105 \
SOLEXA_SETTINGS -AS:mrpc=10 -CO:msr=on,msrme=0 -CL:bsqc=on
-AL:bip=100,egp=yes -CO:mgqrt=25 \
SANGER_SETTINGS -AS:mrpc=2 -DP:ure=on -CL:pvlc=on,qc=on,bsqc=on
-CL:bsqc=on -AL:bip=100,egp=yes -CO:mgqrt=25 \
PCBIOHQ_SETTINGS -AS:mrpc=5 -CL:pec=on -CL:bsqc=on -AL:bip=100,egp=yes
-CO:mgqrt=25
# The second part defines the sequencing data MIRA should load and assemble
# The data is logically divided into "readgroups"
# now the Illumina paired-end data
readgroup = paired-end illumina
data = /workdir/sannino/trimmednew2typeb_R1.fastq
/workdir/sannino/trimmednew2typeb_R2.fastq
technology = solexa
template_size = 100 300
segment_placement = ---> <---
segment_naming = solexa
rename_prefix = @HWI-ST397:193:D093UACXX:3:1101: ill_
# Sanger data
readgroup = Sanger sequences
data = /workdir/sannino/typeb_s.fastq
technology = sanger
segment_naming = sanger
# Pacbio data
readgroup = Pacbio
data = /workdir/sannino/typeb_pbccs.fastq
technology = pcbiohq
rename_prefix = m140320_025629_42146_c100610822550000001823111606241443
pbcss_

On Wed, May 27, 2015 at 3:00 PM, Bastien Chevreux <bach@xxxxxxxxxxxx>
wrote:

On 27 May 2015, at 19:06 , David Sannino <drs357@xxxxxxxxxxx> wrote:

Hi I am performing a hybrid assembly of a bacterial genome that may have
some contamination in it. The following error came up during the assembly:
Total megahubs: 2


Hello David,

when trying to get help, sending along the manifest you used helps to
assess the complexity of the problem. For the remainder of this answer, I
will assume you used a pretty standard setup (“genome,denovo,accurate”)
without additional parameters.

The number of megahubs found (2) is extremely low, one could quite safely
tell MIRA to ignore it via -SK:mmhr=5 (or similar).

But please read on.

[...]
This is a bacterial genome from environmental DNA so there is the
potential for contamination in it. I've made sure there are no illumina
adapters left over in my illumina data, and I am pretty sure there are no
vectors in my sanger data (I analyzed it with fastqc and there wasn't any
over-represented sequences). I am not sure which is the best way to proceed
with the assembly. It is around a 3.6MB genome and is being constructed
from Illumina HiSeq, Sanger, and PacBio data.


If this were only Illumina and Sanger, having megahubs would ring all my
alarm bells and I would send you on a hunt for possible contaminants. But
you wrote you have PacBio mixed in, and I know that the megahub filter
sometimes triggers on long PacBio reads containing a couple of repetitive
elements. If you do not have megahub warnings when using Illumina and
Sanger alone, just tell MIRA to ignore the warning via -SK:mmhr.

In case you want to know a bit more about repetitive or contaminant
elements in your data, please have a look at chapter 11 (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_hard),
and there especially at the section dealing with the repeat info file (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_the_readrepeats_info_file)
and the following section (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data)
which is conveniently named “Pipeline to find worst contaminants or repeats
in sequencing data” and has a step-by-step walkthrough,

B.

PS: I’ll repeat myself here: do not pre-treat (trim/clip) your Illumina
data with 3rd party software when assembling with MIRA: the algorithms in
MIRA really are better than anything people normally try out. Inquisitive
natures on this list tested this with their data and if I remember right,
they all agreed.





--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Other related posts: