[mira_talk] Re: Genome assembly with Illumina MPs only

  • From: "Rohit Kolora" <rohit@xxxxxxxxxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 10 Nov 2015 19:48:34 +0100


As Adrian mentioned, kmer analysis is the pivotal after checking the quality.

The genome size appears to be around 380Mb(from your 0.4pg*~950Mb).
Definitely you have >100X MP reads. So you should do a de-duplication step
and also remove all the chimeras (represented by low kmer counts - not
sure of options in mira for kmer coverage threshold), then try to assemble
as single-end reads.
Since your data is sub-sampled to 10%, it means you have less about 11X
coverage. Mira recommends 60X coverage, and anything less than 40X makes
it more difficult with illumina data.

I am not sure why mate-pair was done with such high coverage and
paired-end was not considered, usually it is done the other way round.

--
Regards,
Rohit


Why not try kmer analysis to get an idea of genome size and coverage? Try
kmergenie or jellyfish with k=23

On Tuesday, 10 November 2015, Bernhard Egger <shadowcrust@xxxxxxxxx>
wrote:

Hi Rohit,

I don't know the exact genome size of my species, but congeners average
at
about 0.4 pg.

I've got about 418 million reads a 100 bp, so if the above assumption is
correct, it would be about 100x coverage with the MP data only.

The 10% subset of the data as single reads is still chugging along with
the current assembly attempt, having proceeded further than the full
data
MP attempt before (it's now at the _int_posmatchf_pass.1.bin.reduced
stage).

Cheers and thanks for your help,

Bernhard

On 11/09/2015 01:48 PM, Rohit Kolora wrote:

Hi Bernhard,

I don't think sub-sampling would be needed, unless you have >80X
coverage.
Also you need to know how much of the genome would be represented due
to
your sub-sampling.

Just try to remove redundancy in the data. Since chimeras are imminent
in
Matepair data, they need to be handled well too :)

Cheers

Dear Rohit,

thank you for your help. I have stopped the previous MP assembly and
restarted with a subsample (10%) of the MP data as a single read
assembly, let's see what it can do.

Many thanks,

Bernhard

On 11/06/2015 12:07 PM, Rohit Kolora wrote:

Hi,

If you have just Mate-pair data then combine the two files in any
order
and feed them as single end data.

Mate pairs should only be used for scaffolding, as for contig
building
these reads can be used without pairing information as single ones.
But
not sure how big your genome is and how much would be missing as
paired-end merging helps a lot to generate longer contigs.


Rohit

Hello,

I'm attempting a genome assembly, with only Illumina mate pair
reads,
with insert sizes of about 5 kb. It's a fairly large amount of data
(2x
50 GB fastq files). Unfortunately, something went wroing with the
corresponding paired end reads, they do not pass quality filters
and
cannot be assembled at all.

This is the manifest file for the assembly:

project = Metazoan_MP
job = genome,denovo,accurate
parameters = -GE:not=128
parameters = -GE:mps=1000

readgroup = DataIlluminaPairedLibMP
autopairing
data = Metazoan-MP-R1-all.fastq Metazoan-MP-R2-all.fastq
technology = solexa

It is running on an HPC where I've reserved 128 cores and 1024 GB
of
RAM.

The MP-only genome assembly is now running for more than two weeks,
and
only the first checkpoint has been passed (12 days ago). Since
then,
two
files are constantly updated, sometimes growing, sometimes
shrinking:

Metazoan_MP_int_posmatchc_pass.1.bin
Metazoan_MP_int_posmatchf_pass.1.bin

After extending the walltime two times already, the HPC
administrator
asked me if there was hope that this assembly could be finished
successfully at all.

Mira in principle works like a charm, I've done several
Illumina-only
and Illumina-454 hybrid transcriptome assemblies.

Can you help me in determining if this MP-only assembly may be
completed
within another two weeks, or if there is little hope for an
assembly
with these raw data?

Many thanks for your help,

Bernhard

--
You have received this mail because you are subscribed to the
mira_talk
mailing list. For information on how to subscribe or unsubscribe,
please
visit http://www.chevreux.org/mira_mailinglists.html


--
Dr. Bernhard Egger FLS
Group leader
Institute of Zoology, University of Innsbruck
Technikerstr. 25
6020 Innsbruck
Austria

http://www.uibk.ac.at/zoology/staff/egger/

http://www.uibk.ac.at/zoology/research/regeneration/


--
You have received this mail because you are subscribed to the
mira_talk
mailing list. For information on how to subscribe or unsubscribe,
please
visit http://www.chevreux.org/mira_mailinglists.html





--
You have received this mail because you are subscribed to the mira_talk
mailing list. For information on how to subscribe or unsubscribe, please
visit http://www.chevreux.org/mira_mailinglists.html




--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html

Other related posts: