[mira_talk] Re: Request for help: demystifying Illumina paired-end format identifiers ... /4 ???

  • From: Jason Steen <j.steen2@xxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Tue, 8 Oct 2013 02:50:12 +0000

Bastien

With dual dual indexing, there are 4 reads generated from a nextera library.  
But, the index reads are not made available to the end user, they are simply 
used internally to the pipeline to determine which index group the read falls 
into.  The index is then written into the fastq header line after the final 
colon.

@HWI-ST1243:121:D1AF9ACXX:1:1101:1371:1992 1:N:0:CGTACTAGAGAGTCGA
@HWI-ST1243:121:D1AF9ACXX:1:1101:1371:1992 2:N:0:CGTACTAGAGAGTCGA

I havent looked at every dataset, but this looks like a 1 and 2 to me.  Are 
both your fastq files definitely full length?  Is every read a 4?


Jason

---

Dr Jason Steen
Research Officer
Australian Centre for Ecogenomics
Ph : +61 7 3365 4957
www.ecogenomics.org



From: Bastien Chevreux <bach@xxxxxxxxxxxx<mailto:bach@xxxxxxxxxxxx>>
Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Date: Saturday, 5 October 2013 6:38 PM
To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" 
<mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>>
Subject: [mira_talk] Request for help: demystifying Illumina paired-end format 
identifiers ... /4 ???

*sigh* Illumina, I hate you.

I'm currently investigating why MIRA seemed to completely fail to
decently assemble a MiSeq Nextera lib in a recent publication
("Efficient and accurate whole genome assembly and methylome profiling
of E. coli", http://www.biomedcentral.com/1471-2164/14/675/abstract).

Thankfully, the authors gave me two of their data sets and I am finding
the following kind of read naming/comment in their files:

- in the first  file: @HWI-M01378:3:000000000-A2CB9:1:1101:17827:2093 1:N:0:
- in the second file: @HWI-M01378:3:000000000-A2CB9:1:1101:17827:2093 4:N:0:

The names and comments in the first file look OK, but what the hell is
the "4" in the comment section of the second file??? So far I'd seen
only "1" and "2" to determine the two reads of a pair. The Wikipedia
entry on FASTQ also only knows 1 and 2, googling around I found the
following document:

  
http://supportres.illumina.com/documents/myillumina/354c68ce-32f3-4ea4-9fe5-8cb2d968616c/casava1_8_changes.pdf

which helpfully states:

   <read number> will typically be 1 or 2, but the field can support
other values. (For example, certain indexing formats lead to 3 reads.)

Fine, so Illumina says there can be up to 3 reads (but are not saying
how they name that). So why am I seeing a value of 4?

Could anyone enlighten me?

B.

PS: Of course MIRA looked only for 1 and 2, and not seeing any 2 it
treated the data as unpaired. I've built in many sanity checks into MIRA
so far, but checking whether there are no pairs in a set where pairs
could be expected is not present so far. Guess what I'm going to program
this week-end ... :-(


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: