Bastien With dual dual indexing, there are 4 reads generated from a nextera library. But, the index reads are not made available to the end user, they are simply used internally to the pipeline to determine which index group the read falls into. The index is then written into the fastq header line after the final colon. @HWI-ST1243:121:D1AF9ACXX:1:1101:1371:1992 1:N:0:CGTACTAGAGAGTCGA @HWI-ST1243:121:D1AF9ACXX:1:1101:1371:1992 2:N:0:CGTACTAGAGAGTCGA I havent looked at every dataset, but this looks like a 1 and 2 to me. Are both your fastq files definitely full length? Is every read a 4? Jason --- Dr Jason Steen Research Officer Australian Centre for Ecogenomics Ph : +61 7 3365 4957 www.ecogenomics.org From: Bastien Chevreux <bach@xxxxxxxxxxxx<mailto:bach@xxxxxxxxxxxx>> Reply-To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Date: Saturday, 5 October 2013 6:38 PM To: "mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>" <mira_talk@xxxxxxxxxxxxx<mailto:mira_talk@xxxxxxxxxxxxx>> Subject: [mira_talk] Request for help: demystifying Illumina paired-end format identifiers ... /4 ??? *sigh* Illumina, I hate you. I'm currently investigating why MIRA seemed to completely fail to decently assemble a MiSeq Nextera lib in a recent publication ("Efficient and accurate whole genome assembly and methylome profiling of E. coli", http://www.biomedcentral.com/1471-2164/14/675/abstract). Thankfully, the authors gave me two of their data sets and I am finding the following kind of read naming/comment in their files: - in the first file: @HWI-M01378:3:000000000-A2CB9:1:1101:17827:2093 1:N:0: - in the second file: @HWI-M01378:3:000000000-A2CB9:1:1101:17827:2093 4:N:0: The names and comments in the first file look OK, but what the hell is the "4" in the comment section of the second file??? So far I'd seen only "1" and "2" to determine the two reads of a pair. The Wikipedia entry on FASTQ also only knows 1 and 2, googling around I found the following document: http://supportres.illumina.com/documents/myillumina/354c68ce-32f3-4ea4-9fe5-8cb2d968616c/casava1_8_changes.pdf which helpfully states: <read number> will typically be 1 or 2, but the field can support other values. (For example, certain indexing formats lead to 3 reads.) Fine, so Illumina says there can be up to 3 reads (but are not saying how they name that). So why am I seeing a value of 4? Could anyone enlighten me? B. PS: Of course MIRA looked only for 1 and 2, and not seeing any 2 it treated the data as unpaired. I've built in many sanity checks into MIRA so far, but checking whether there are no pairs in a set where pairs could be expected is not present so far. Guess what I'm going to program this week-end ... :-( -- You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html