[mira_talk] MIRA error

  • From: Chris Hoefler <hoeflerb@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Wed, 27 May 2015 21:33:30 -0500

There are a few options, but the easiest is probably the bash5tools.py
script for filtering raw reads,
https://github.com/PacificBiosciences/pbh5tools/blob/master/doc/index.rst

The recommended parameters for self-correction are,
--readType subreads
--minLength 500
--minReadScore 0.80
--outType fastq

Use the .bax.h5 files as input. This will give you a fastq file that you
can then feed into PBcR.

PBcR will default to self-correction. This usually performs better than
Illumina correction, but my recent experiments with version 8.2, which uses
MHAP gave pretty good results too. The main reason to use Illumina
correction would be if you couldn't get sufficient coverage at a desirable
read length (ex: your longest 25X is >8 kb, but you don't have sufficient
short reads to self-correct and get a low yield). PBcR is pretty fast, in
general, so you can try both and compare them.

The pipeline will take your corrected reads and feed them to Celera
Assembler by default, but you can stop it after error-correction and use
Mira instead. Once you get contigs from Mira, PBcR bundles the pbalign and
quiver utilities for polishing. Mapping of Illumina reads can also clean up
remaining errors. I usually do both.

Apropos to Bastien's comment, I always use the default settings with Mira
when assembling PacBio data. If you get more than one contig, it will
usually be a small enough number that you can investigate the contig breaks
manually to understand what happened.



On Wed, May 27, 2015 at 4:22 PM, David Sannino <drs357@xxxxxxxxxxx> wrote:

I think I will try your suggestions. Can you use the raw reads for
correction, or do you have to use filtered subreads? Do you think
correcting them against themselves would be better than using the Illumina
data? What program do you use to filter the reads or is that part of the
PBcR pipeline?

Thanks,
Dave

On Wed, May 27, 2015 at 4:31 PM, Chris Hoefler <hoeflerb@xxxxxxxxx> wrote:

Do you have the raw reads? If you have the filtered subreads, you can
correct them using,
http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR

You can also use your Illumina data to correct with the MHAP overlapper.
Either way, I guarantee you will get better results using corrected reads
over CCS reads. What size library did you sequence?

You can use the filtered subreads from the default pipeline output, but
you may get more out of your data if you go back to the raw reads and
filter them yourself. The correction can be affected by the filtering
parameters.

On Wed, May 27, 2015 at 3:07 PM, David Sannino <drs357@xxxxxxxxxxx>
wrote:

I only have the CCS reads, I don't have self corrected long reads, just
the filtered subreads. The filtered subreads give about 100x coverage.

Dave

On Wed, May 27, 2015 at 3:51 PM, Chris Hoefler <hoeflerb@xxxxxxxxx>
wrote:

Out of curiosity, what coverage of PacBio reads do you have? I have
usually had the best results assembling PacBio reads alone and then mapping
Illumina reads over the contigs for polishing. Do you only have CCS reads,
or do you also have self-corrected long reads?

On Wed, May 27, 2015 at 2:27 PM, David Sannino <drs357@xxxxxxxxxxx>
wrote:

This is my manifest file if that helps:

# Example for a manifest describing a denovo assembly with
# several kinds of sequencing libraries
# First part: defining some basic things
# In this example, we just give a name to the assembly
# and tell MIRA it should map a genome in accurate mode
project = typeB_assembly1
job = genome,denovo,accurate
parameters = COMMON_SETTINGS -GE:not=0 -AS:nop=4,sd=on,bts=3600
-CL:asjdc=on -HS:nrr=100,mnr=yes -SB:sbuip=2 -NW:cmrnl=warn,acv=105 \
SOLEXA_SETTINGS -AS:mrpc=10 -CO:msr=on,msrme=0 -CL:bsqc=on
-AL:bip=100,egp=yes -CO:mgqrt=25 \
SANGER_SETTINGS -AS:mrpc=2 -DP:ure=on -CL:pvlc=on,qc=on,bsqc=on
-CL:bsqc=on -AL:bip=100,egp=yes -CO:mgqrt=25 \
PCBIOHQ_SETTINGS -AS:mrpc=5 -CL:pec=on -CL:bsqc=on -AL:bip=100,egp=yes
-CO:mgqrt=25
# The second part defines the sequencing data MIRA should load and
assemble
# The data is logically divided into "readgroups"
# now the Illumina paired-end data
readgroup = paired-end illumina
data = /workdir/sannino/trimmednew2typeb_R1.fastq
/workdir/sannino/trimmednew2typeb_R2.fastq
technology = solexa
template_size = 100 300
segment_placement = ---> <---
segment_naming = solexa
rename_prefix = @HWI-ST397:193:D093UACXX:3:1101: ill_
# Sanger data
readgroup = Sanger sequences
data = /workdir/sannino/typeb_s.fastq
technology = sanger
segment_naming = sanger
# Pacbio data
readgroup = Pacbio
data = /workdir/sannino/typeb_pbccs.fastq
technology = pcbiohq
rename_prefix =
m140320_025629_42146_c100610822550000001823111606241443 pbcss_

On Wed, May 27, 2015 at 3:00 PM, Bastien Chevreux <bach@xxxxxxxxxxxx>
wrote:

On 27 May 2015, at 19:06 , David Sannino <drs357@xxxxxxxxxxx> wrote:

Hi I am performing a hybrid assembly of a bacterial genome that may
have some contamination in it. The following error came up during the
assembly:
Total megahubs: 2


Hello David,

when trying to get help, sending along the manifest you used helps to
assess the complexity of the problem. For the remainder of this answer, I
will assume you used a pretty standard setup (“genome,denovo,accurate”)
without additional parameters.

The number of megahubs found (2) is extremely low, one could quite
safely tell MIRA to ignore it via -SK:mmhr=5 (or similar).

But please read on.

[...]
This is a bacterial genome from environmental DNA so there is the
potential for contamination in it. I've made sure there are no illumina
adapters left over in my illumina data, and I am pretty sure there are no
vectors in my sanger data (I analyzed it with fastqc and there wasn't any
over-represented sequences). I am not sure which is the best way to
proceed
with the assembly. It is around a 3.6MB genome and is being constructed
from Illumina HiSeq, Sanger, and PacBio data.


If this were only Illumina and Sanger, having megahubs would ring all
my alarm bells and I would send you on a hunt for possible contaminants.
But you wrote you have PacBio mixed in, and I know that the megahub
filter
sometimes triggers on long PacBio reads containing a couple of repetitive
elements. If you do not have megahub warnings when using Illumina and
Sanger alone, just tell MIRA to ignore the warning via -SK:mmhr.

In case you want to know a bit more about repetitive or contaminant
elements in your data, please have a look at chapter 11 (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_hard),
and there especially at the section dealing with the repeat info file (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_the_readrepeats_info_file)
and the following section (
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_hard_pipeline_to_find_worst_contaminants_or_repeats_in_sequencing_data)
which is conveniently named “Pipeline to find worst contaminants or
repeats
in sequencing data” and has a step-by-step walkthrough,

B.

PS: I’ll repeat myself here: do not pre-treat (trim/clip) your
Illumina data with 3rd party software when assembling with MIRA: the
algorithms in MIRA really are better than anything people normally try
out.
Inquisitive natures on this list tested this with their data and if I
remember right, they all agreed.





--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128





--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128





--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128


--
Chris Hoefler, PhD
Postdoctoral Research Associate
Straight Lab
Texas A&M University
2128 TAMU
College Station, TX 77843-2128

Other related posts: