[mira_talk] Re: Mapping assembly problems: autorefine and low coverage

  • From: Andrej Benjak <abenjak@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 08 Jun 2015 09:49:19 +0200

Hi Elizabeth,

I think you should use 'infoonly' in mapping, especially if the reference is another species.
Even better, just use 'autopairing' and forget about the template size or placement, like this:

# The second part defines the sequencing data MIRA should load and assemble
readgroup = Newimport_unpaired
autopairing
data = ./R1.fastq ./R2.fastq
technology = solexa


mira will default to 'infoonly' because it's a mapping project, and will infer the insert size and placement automatically, for information purpose only.

Does the assembly improve with that?

Cheers,
Andrej

BTW, your fastq files don't need to be uncompressed, they take too much space. Mira can handle compressed files.
For example:

data = ./R1.fastq.gz ./R2.fastq.gz





On 06/07/2015 06:41 PM, Elizabeth Latham wrote:

Greetings,

This is my first time posting so apologies for any faux pas. I'm
trying to map >35mil illumina reads to the genome of its closest
relative based on 16s sanger sequencing. However, I'm having problems
with the autorefine option. Inputting the following manifest results
in the attached error. I'm using cluster computing fyi

# First part: defining some basic things
project = Paeni_unpaired2
job = genome,mapping,accurate

parameters = -GE:not=8 -DI:trt=/state/partition1/tmp/ealatham -NW:cmrnl=no
#Reference sequence
readgroup
is_reference
data = ./paeni_r.fa
technology = text
strain = Paeni_r_genome

# The second part defines the sequencing data MIRA should load and assemble
readgroup = Newimport_unpaired
data = ./R1.fastq ./R2.fastq
technology = solexa
template_size = 50 1000 autorefine
segment_placement = ---> <---
segment_naming = solexa

Second part of my question, if I eliminate the template size, segment
placement, and naming I can map the reads against the reference genome
but I get really low coverage (>10% of the reads map to the reference)
despite being in the same genera. I'm not sure if this is related to
the lack of template size info or if there is something else that I am
missing in the manifest that would cause this problem. If I assemble
the reads on CLC genomic workbench I get around five >300,000bp
contigs so there is nothing intrinsically wrong with the data I think.

Thanks for helping!
Elizabeth


--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html

Other related posts: