Dear Markiyan,
thank you for your help - sorry for the late answer.
The Illumina libraries were prepared with the Mate Pair Library Prep
Kit, not with the Nextera protocol (sequencing was done 3 years ago).
Your point 0 (deduplication): I've got 48.65% of seqs remaining if
deduplicated, so deduplication would be helpful, agreed.
I've already done PE sequencing, I'll check the sequences once more if
they are usable, as I don't recall anymore what exactly was wrong with
them (kmer content looks slightly off, and I recall not being able to
assemble them).
Cheers,
Bernhard
On 11/06/2015 03:24 PM, Markiyan Samborskyy wrote:
Dear Bernhard,
So there are 2 major problems here.I'm attempting a genome assembly, with only Illumina mate pair reads,
with insert sizes of about 5 kb. It's a fairly large amount of data (2x
50 GB fastq files). Unfortunately, something went wroing with the
corresponding paired end reads, they do not pass quality filters and
cannot be assembled at all.
1. Experimental design:
For good de novo assembly the read lenght and evenness of
representation across GC content are the most critical
factors.
Assuming that this library was done using a nextera matepair
protocol, there are 10-15 cycles of PCR involved in the library
preparation, which is likely to almost eliminate high AT/GC reads.
RK> Mate pairs should only be used for scaffolding, as for contig building
RK> these reads can be used without pairing information as single ones. But
RK> not sure how big your genome is and how much would be missing as
RK> paired-end merging helps a lot to generate longer contigs.
This is caused by high number of PCR cycles causing "near exact
duplicates" which compounded by the chimeras (~2-20% of the mates)
confuses an assembly algorithm, and it gets stuck in nearly
indefinite loops.
2. Data (pre)processing. The matepair assembly results can very quite
a lot depending on the exact processing tools and setting used for
adapter trimming, reads overlapping and linker removal.
Also it is a good idea to remove the artifacts introduced by PCR
step (duplicated reads).
If dealing with unknown beast(s)/new dataset ALWAYS assemble a subsetI'm attempting a genome assembly, with only Illumina mate pair reads,
with insert sizes of about 5 kb. It's a fairly large amount of data (2x
50 GB fastq files). Unfortunately, something went wroing with the
corresponding paired end reads, they do not pass quality filters and
cannot be assembled at all.
of the reads first (like 1/100th - 1/10th).
Also try the good shotgun data first to asses any repeats/contaminants
composition.
In order to recover from the situation:
0. Run fastqc on the input dataset, look for %after deduplication,
if less than 85% - DEDUPLICATE!
1. Make some good quality PCR-free shotgun library and sequence it
on the MiSeq in 2x300 or 2x250 mode. It would help with wide range
of the GC of the input sample. I would dedicate a whole run to it.
(Also you can try Pacbio if available at a reasonable price).
2. Assemble a subset/whole thinkg, look at the coverage distribution.
3. Screen out reads from very high coverage (repeats) contigs (like
vector
backbone).
4. Now try adding some matepair data...
I would not hope much for timely completion in this caseCan you help me in determining if this MP-only assembly may be completed
within another two weeks, or if there is little hope for an assembly
with these raw data?
PS: Reserve/get yourself a dedicated server(s)/nodes.
PPS: With NGS it is quite easy to try to assemble too much data, which
can lead to geological timescales.../ bad results...