[mira_talk] Re: MIRA Error Message

  • From: Jeremiah Davie <jdavie@xxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Mon, 7 Dec 2009 10:02:51 -0500

Hi Bastien,
I reviewed the .fasta file created from the two .sff files and the read names were indeed identical. Even more interesting is that the sequence for each duplicated read name is identical. No difference in clipping or any evidence of a single nucleotide changes. I'll have to inquire as to why two seemingly identical .sff files were generated; I'm sure there had to be some reason for it. Thanks for the help! Take care, - Jeremiah
On Dec 4, 2009, at 3:44 PM, Bastien Chevreux wrote:

On Freitag 04 Dezember 2009 Jeremiah Davie wrote:
Hi Everyone,
   Quick question: In my assembly of 454 data (older GS20 data from
standard chemistry), my output log lists that it found 290768 reads
extracted from two *.sff files, yet when those reads are loaded I
receive 145339 instances of the following error message: "Error: read
name EERNOEK******* present multiple times in readpool!", where the
asterisks denote a different read name.

Hello Jeremiah,

well, if it's not a bug in MIRA (and I don't think it is), then you really have sequences named identically in the two SFF files. Whether the sequences themselves are identical remains to be checked (and you should do that)

Since this is almost half of
the the total loaded reads, I'm curious what might have caused this?
Presumably sff_extract used the names found within the *.sff files for these reads, but it seems strange to me that the two *.sff files would
contain all the same names.

I think I remember reading in some Roche documentation that they changed the
read naming somewhen to ensure more diversity or something similar.

On the other hand, it's absolutely possible to recombine SFF files into different SFF files with the sff-tools from the Roche pipeline, so that may be an explanation if you have identical reads in several SFF files. Or maybe one
SFF has 'clipped' versions of reads from the other SFF.

Did I do some thing wrong in the
sff_extract process that might have caused this without generating an
error message?

No, you didn't do anything wrong. sff_extract (when used on unpaired data) takes the read names verbatim from the SFF file, so they're in there already.

(on paired-end data, there's some name mangling, but that consists of
appending postfixes, so no way something could go wrong there neither)

Any thoughts? Thanks in advance everyone! - Jeremiah

Qickest thing to do: take a few examples where MIRA complained about double read names and check the sequences by eye in the FASTA file. You should be able to decide pretty quickly whether the sequences are the same or perhaps a clipped subset of each other. Then you'll need to decide which version you
want to take.

In the most improbable case that the reads have identical names but the sequences have nothing in common ... well, you would need to massage the read names a bit. For that, I would extract both SFF separately, put a prefix for each read name of each FASTA, FASTA quality and XML file (that's one "sed" command :-) and then put everything back together into one FASTA, one FASTA
quality and one XML file.

Hope that helps, and I'd be curious to know what it turned out to be.

Regards,
 Bastien

--
You have received this mail because you are subscribed to the mira_talk mailing list. For information on how to subscribe or unsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html




--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: