[mira_talk] Re: HiSeq data problem with Mira

From: "Bastien Chevreux" <bach@xxxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Thu, 3 Nov 2011 13:07:14 +0100 (MET)

From: Mcnally, Alan<alan.mcnally@xxxxxxxxx>
> I have sent Bastian the first 10,000 lines of my FastQ input
> file.................hopefully he can solve for me...........its getting
> very frustrating

The file you sent me is everything but a valid FASTQ. Here's an excerpt, the 
first few lines (I've Xed out the bases):

------------------------ snip --------------------------
@HWI-ST300:133:B0908ABXX:3:1101:1242:2117 1:Y:0:ATCACG
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
@HWI-ST300:133:B0908ABXX:3:1101:1242:2117 2:Y:0:ATCACG
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
####################################################################################################
+
####################################################################################################
------------------------ snip --------------------------

That file is so plain wrong and not FASTQ, I do not know where to start:
1) the first read (HWI-ST300:133:B0908ABXX:3:1101:1242:2117 1:Y:0:ATCACG) has 
no quality
2) the second read (HWI-ST300:133:B0908ABXX:3:1101:1242:2117 2:Y:0:ATCACG) has 
two quality lines

May I *strongly* suggest you leave out any script (I have this ominous 
shuffleseq.pl in mind you wrote about) and simply cat the first n million lines 
of your data sets together? That's how I do it and it works like a charm :-)

Furthermore, those are CASAVA 1.8 data sets where Illumina has changed the read 
naming scheme. I.e., you will notice that the first two reads have actually the 
very same name. For this atrocity the developers at Illumina should be nailed, 
crucified, tarred, feathered, shot into space and subsequently dissected by 
slimy green purple aliens with 18 tentacles and a single eyeball (in this 
order, and preferably alive during the whole procedure).

Anyway, while the development version of MIRA now knows these things, you need 
to rename your reads for all public versions out there. More specifically, you 
need to pull the first character of the comment to the read name, separating it 
with a slash.

I.e., a line reading 

@HWI-ST300:133:B0908ABXX:3:1101:1242:2117 1:Y:0:ATCACG

must be changed to

@HWI-ST300:133:B0908ABXX:3:1101:1242:2117/1 1:Y:0:ATCACG

and a line with

@HWI-ST300:133:B0908ABXX:3:1101:1242:2117 2:Y:0:ATCACG

to

@HWI-ST300:133:B0908ABXX:3:1101:1242:2117/2 2:Y:0:ATCACG

Best,
  Bastien


--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Follow-Ups:
- [mira_talk] Re: HiSeq data problem with Mira
  - From: Mcnally, Alan

[mira_talk] Re: HiSeq data problem with Mira

Other related posts: