[mira_talk] Re: Genome assembly with Illumina MPs only - an be done only for 50%GC bacterial genomes.

  • From: Bernhard Egger <shadowcrust@xxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Tue, 10 Nov 2015 17:12:28 +0100

Dear Markiyan,

thank you for your help - sorry for the late answer.

The Illumina libraries were prepared with the Mate Pair Library Prep Kit, not with the Nextera protocol (sequencing was done 3 years ago).

Your point 0 (deduplication): I've got 48.65% of seqs remaining if deduplicated, so deduplication would be helpful, agreed.

I've already done PE sequencing, I'll check the sequences once more if they are usable, as I don't recall anymore what exactly was wrong with them (kmer content looks slightly off, and I recall not being able to assemble them).

Cheers,

Bernhard



On 11/06/2015 03:24 PM, Markiyan Samborskyy wrote:

Dear Bernhard,

I'm attempting a genome assembly, with only Illumina mate pair reads,
with insert sizes of about 5 kb. It's a fairly large amount of data (2x
50 GB fastq files). Unfortunately, something went wroing with the
corresponding paired end reads, they do not pass quality filters and
cannot be assembled at all.
So there are 2 major problems here.

1. Experimental design:
For good de novo assembly the read lenght and evenness of
representation across GC content are the most critical
factors.
Assuming that this library was done using a nextera matepair
protocol, there are 10-15 cycles of PCR involved in the library
preparation, which is likely to almost eliminate high AT/GC reads.
RK> Mate pairs should only be used for scaffolding, as for contig building
RK> these reads can be used without pairing information as single ones. But
RK> not sure how big your genome is and how much would be missing as
RK> paired-end merging helps a lot to generate longer contigs.
This is caused by high number of PCR cycles causing "near exact
duplicates" which compounded by the chimeras (~2-20% of the mates)
confuses an assembly algorithm, and it gets stuck in nearly
indefinite loops.

2. Data (pre)processing. The matepair assembly results can very quite
a lot depending on the exact processing tools and setting used for
adapter trimming, reads overlapping and linker removal.
Also it is a good idea to remove the artifacts introduced by PCR
step (duplicated reads).

I'm attempting a genome assembly, with only Illumina mate pair reads,
with insert sizes of about 5 kb. It's a fairly large amount of data (2x
50 GB fastq files). Unfortunately, something went wroing with the
corresponding paired end reads, they do not pass quality filters and
cannot be assembled at all.
If dealing with unknown beast(s)/new dataset ALWAYS assemble a subset
of the reads first (like 1/100th - 1/10th).
Also try the good shotgun data first to asses any repeats/contaminants
composition.

In order to recover from the situation:
0. Run fastqc on the input dataset, look for %after deduplication,
if less than 85% - DEDUPLICATE!

1. Make some good quality PCR-free shotgun library and sequence it
on the MiSeq in 2x300 or 2x250 mode. It would help with wide range
of the GC of the input sample. I would dedicate a whole run to it.
(Also you can try Pacbio if available at a reasonable price).
2. Assemble a subset/whole thinkg, look at the coverage distribution.
3. Screen out reads from very high coverage (repeats) contigs (like
vector
backbone).
4. Now try adding some matepair data...



Can you help me in determining if this MP-only assembly may be completed
within another two weeks, or if there is little hope for an assembly
with these raw data?
I would not hope much for timely completion in this case

PS: Reserve/get yourself a dedicated server(s)/nodes.

PPS: With NGS it is quite easy to try to assemble too much data, which
can lead to geological timescales.../ bad results...




--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html

Other related posts: