[mira_talk] Re: Mapping assembly problems: autorefine and low coverage

  • From: Andrej Benjak <abenjak@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 08 Jun 2015 15:37:16 +0200

Hi Elizabeth,

Which version of MIRA are you using?

a


On 06/08/2015 03:03 PM, Elizabeth Latham wrote:

Hi Andrej
Thanks for the response!
It really hates autopairing. This happens everytime:

Mon Oct 14 17:32:36 CEST 2013
On: Linux vk10464 2.6.32-41-generic #94-Ubuntu SMP Fri Jul 6 18:00:34
UTC 2012 x86_64 GNU/Linux
Compiled in boundtracking mode.
Compiled in bugtracking mode.
Compiled with ENABLE64 activated.
Runtime settings (sorry, for debug):
Size of size_t : 8
Size of uint32 : 4
Size of uint32_t: 4
Size of uint64 : 8
Size of uint64_t: 8
Current system: Linux compute-1-9.local 2.6.32-220.el6.x86_64 #1 SMP
Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

Fatal error (may be due to problems of the input data or parameters):
******************************************************************************
* In file /data/ealatham/manifest.conf: line *
*
* autopairing has keyword 'autopairing' which is not recognised. *

*******************************************************************************

->Thrown: void Manifest::slurpInManifest(stringstream & mfin, const
string & origsource, bool resume))
->Caught: main

Aborting process, probably due to error in the input data or parametrisation.
Please check the output log for more information.
For help, please write a mail to the mira talk mailing list.
Subscribing / unsubscribing to mira talk, see:
//www.freelists.org/list/mira_talk

CWD: /data/ealatham
Thank you for noticing that this is *NOT* a crash, but a
controlled program stop.
Failure, wrapped MIRA process aborted.

Manifest:

ealatham@wsgi-hpc ealatham]$ cat manifest.conf
# First part: defining some basic things
project = Paeni_unpaired2
job = genome,mapping,accurate
parameters = -GE:not=12 -DI:trt=/state/partition1/tmp/ealatham -NW:cmrnl=no
#Reference sequence
readgroup
is_reference
data = ./paeni_r.fa
technology = text
strain = Paeni_r_genome
# The second part defines the sequencing data MIRA should load and assemble
readgroup = Newimport_unpaired
autopairing
data = ./R1.fastq ./R2.fastq
technology = solexa

Is there something I'm missing?! Thanks for the advice on compressed
files (I'll definitely do that in the future)


On Mon, Jun 8, 2015 at 2:49 AM, Andrej Benjak <abenjak@xxxxxxxxx> wrote:
Hi Elizabeth,

I think you should use 'infoonly' in mapping, especially if the reference is
another species.
Even better, just use 'autopairing' and forget about the template size or
placement, like this:

# The second part defines the sequencing data MIRA should load and assemble
readgroup = Newimport_unpaired
autopairing
data = ./R1.fastq ./R2.fastq
technology = solexa


mira will default to 'infoonly' because it's a mapping project, and will
infer the insert size and placement automatically, for information purpose
only.

Does the assembly improve with that?

Cheers,
Andrej

BTW, your fastq files don't need to be uncompressed, they take too much
space. Mira can handle compressed files.
For example:

data = ./R1.fastq.gz ./R2.fastq.gz






On 06/07/2015 06:41 PM, Elizabeth Latham wrote:
Greetings,

This is my first time posting so apologies for any faux pas. I'm
trying to map >35mil illumina reads to the genome of its closest
relative based on 16s sanger sequencing. However, I'm having problems
with the autorefine option. Inputting the following manifest results
in the attached error. I'm using cluster computing fyi

# First part: defining some basic things
project = Paeni_unpaired2
job = genome,mapping,accurate

parameters = -GE:not=8 -DI:trt=/state/partition1/tmp/ealatham -NW:cmrnl=no
#Reference sequence
readgroup
is_reference
data = ./paeni_r.fa
technology = text
strain = Paeni_r_genome

# The second part defines the sequencing data MIRA should load and
assemble
readgroup = Newimport_unpaired
data = ./R1.fastq ./R2.fastq
technology = solexa
template_size = 50 1000 autorefine
segment_placement = ---> <---
segment_naming = solexa

Second part of my question, if I eliminate the template size, segment
placement, and naming I can map the reads against the reference genome
but I get really low coverage (>10% of the reads map to the reference)
despite being in the same genera. I'm not sure if this is related to
the lack of template size info or if there is something else that I am
missing in the manifest that would cause this problem. If I assemble
the reads on CLC genomic workbench I get around five >300,000bp
contigs so there is nothing intrinsically wrong with the data I think.

Thanks for helping!
Elizabeth


--
You have received this mail because you are subscribed to the mira_talk
mailing list. For information on how to subscribe or unsubscribe, please
visit http://www.chevreux.org/mira_mailinglists.html


--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html

Other related posts: