[mira_talk] Re: Mate-pairs with MiSeq for MIRA

  • From: Jason Steen <j.steen@xxxxxxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Sun, 1 Sep 2013 04:28:10 +0000

The internal adaptor is listed in the nextera paperwork, available here.  
http://res.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf
The ligation/sequencing adaptors are standard truseq adaptors.

I split out A and D mates from the raw reads as known mate fragments using 
perl, and normally get ~70-80% of the data falling into this category.  For my 
purposes, I discard most of the rest of the data, as I normally already have 
piles and piles of paired reads which the first assembly was generated from.


Cheers



From: Adrian Pelin <apelin20@xxxxxxxxx<mailto:apelin20@xxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Sunday, 1 September 2013 7:57 AM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Re: Mate-pairs with MiSeq for MIRA

Anyone have any idea how to guess the adaptor of the mate pairs? I looked 
through my documents and couldn't find it.



On 8/31/2013 1:07 AM, Jason Steen wrote:
15% mates doesn’t sound right.  We expect 75-80% mates, and 20% 
PE/unassignable.  Less than 70% and I'd ask my library maker to repeat them.

Are you pre-processing the raw data off the machine?  I don’t map anything 
until I have extracted known mates from the raw data (based on the presence of 
internal adaptor in at least one read)



---

Dr Jason Steen
Research Officer
Australian Centre for Ecogenomics
Ph : +61 7 3365 4040
www.ecogenomics.org



From: Robert Willows <robert.willows@xxxxxxxxx<mailto:robert.willows@xxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Saturday, 31 August 2013 2:33 PM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Re: Mate-pairs with MiSeq for MIRA

I had similar problems with a poor 5kbp mate pair illumina run. Only 15% mate 
paired, with the remainder being 300bp paired end. I wanted to go back and use 
this for scaffolding a 454 assembly obtained on a DNA prep done 18 months 
later. I'm aware this isn't ideal but its all I had.

Using bowtie with the MIRA 454 contigs as library I separated the reads into 4 
groups.
1. Single reads (no pair due to just Ns in the paired read).
2. Confirmed paired end 300bp mapped to 454 contigs from a MIRA assembly
3. Confirmed mate pair 1000-6500 mapped to the 454 contigs from a MIRA assembly
4. Remaining paired reads.

Then ran all 4 groups together with 454 data in a hybrid assembly in MIRA.
Used ?? for direction and size range 100-7000 in the manifest for group 4.

Then used gap5 to join the contigs in the new assembly based on the pairs which 
MIRA had tagged as being in different contigs.

This seems to work well and we are currently doing PCRs and some extra Sanger 
sequencing to verify the joins and problem areas.

It is a 5.5Mb bacterial genome with a lot of transposons and repeats. There 
seemed to be a bit of transposon activity and a few rearrangements between DNA 
preps which was interesting.

Robert

On 31/08/2013, at 9:37 AM, Jason Steen 
<j.steen@xxxxxxxxxxxxx<mailto:j.steen@xxxxxxxxxxxxx>> wrote:

I have some experience with illumina matepairs (the most recent nextera version)

You really need to process the data for adaptor contamination and to split out 
good pairs. (I have a custom perl script for this which I believe works OK)
There is very little you can do about the PE contamination (lifetech own all 
the good patents for mate pair library generation)
We use the data to super scaffold existing assemblies using sspace, and it 
works pretty well. (and also to confirm binning in metagenome assemblies)
The distribution of sizes is all to do with how awesome your library maker is.  
And whether you did gel free or size selection.

I'd love to chat more to people who have other experiences with this sort of 
data.

Jason

---

Dr Jason Steen
Research Officer
Australian Centre for Ecogenomics
Ph : +61 7 3365 4040
www.ecogenomics.org



From: Adrian Pelin <apelin20@xxxxxxxxx<mailto:apelin20@xxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Saturday, 31 August 2013 3:29 AM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Re: Mate-pairs with MiSeq for MIRA

I was wondering about separation of reads.

Suppose i got Library X, and I extracted Subset A from this library x. Now, is 
there any way to obtain Subset B that is LibraryX - SubsetA?



Sincerely,
Adrian

On 2013-08-30, at 12:02 PM, Chris Hoefler 
<hoeflerb@xxxxxxxxx<mailto:hoeflerb@xxxxxxxxx>> wrote:

>
> I got 13 M reads, 100bp in length. Quality wasn't too bad, ran it through 
> FastQC, and trimmed a bit on both sides due to GC content bias and ended up 
> with 65bp. Anyone think I shouldn't have?

Me, me, me! *jump*

I second. We have fed a lot of MiSeq 250bp PE to Mira, and it always performs 
best when it does its own adapter clipping. It is actually very illustrative to 
look at the resulting assembly in gap5. When you look at the fastQC, it is a 
bit shocking what it considers low-quality sequence, especially for the R2 
data. Based on what fastQC says, one would think only about 150bp of the 250bp 
were usable, but that is definitely not the case. Mira was able to use the full 
250 bp I would say about 98% of the time, and in a few cases it had to trim 
back a few bp to 245 bp or so. Mira is really good at this.

> Just make sure you use the [-CL:pec] (proposed_end_clip) option of MIRA.

-CL:pec is on by default for Illumina reads, I think. The other option that 
will probably be useful is -CL:fpx174 to filter out the PhiX spike-in that 
usually accompanies Illumina runs.

There is a big peak at about 300, 400bp, those are the short-pairs and a 
smaller peak starting from about 1000 to 4000.

That sounds like a problem with your library and could be a reason why the 
assembly doesn't improve. I would speak with your sequencing provider and try 
to resolve that issue. Since Mira needs to know whether the reads are "innie" 
or "outtie", a mixture of the two it likely finds confusing. I suppose you 
could try separating them, but I don't know how easy that would be.


On Fri, Aug 30, 2013 at 9:50 AM, Bastien Chevreux 
<bach@xxxxxxxxxxxx<mailto:bach@xxxxxxxxxxxx>> wrote:
On Aug 30, 2013, at 16:13 , Adrian Pelin 
<apelin20@xxxxxxxxx<mailto:apelin20@xxxxxxxxx>> wrote:
> Not sure how many people here have hands on experience with mate pair data. I 
> just got my first batch and think it's great. Unfortunately, there is no 
> guide out there that I ran into, to tell me what to look out for.
>
> I got 13 M reads, 100bp in length. Quality wasn't too bad, ran it through 
> FastQC, and trimmed a bit on both sides due to GC content bias and ended up 
> with 65bp. Anyone think I shouldn't have?

Me, me, me! *jump*

To quote the corresponding short section from the new manual:
  
http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_pd_illumina

----
Outside MIRA: for heavens' sake: do NOT try to clip or trim by quality 
yourself. Do NOT try to remove standard sequencing adaptors yourself. Just 
leave Illumina data alone! (really, I mean it).
MIRA is much, much better at that job than you will probably ever get ... and I 
dare to say that MIRA is better at that job than 99% of all clipping/trimming 
software existing out there. Just make sure you use the [-CL:pec] 
(proposed_end_clip) option of MIRA.
----

> Since their point is for scaffolding, and not assembly, I expect 65bp to be 
> enough.

Longer reads are always better.


> Now the problem that I see when mapping these mate-pairs to existing contigs 
> using something like bowtie, is when I visualize the alignment, I see lot's 
> of paired-end contaminations. The variation in the size of the insert is also 
> gigantic. There is a big peak at about 300, 400bp, those are the short-pairs 
> and a smaller peak starting from about 1000 to 4000.

You are not the first from whom I hear that the distribution of Illumina 
mate-pairs isn't what one would expect.

B.



--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: