[mira_talk] Re: MIRA Error Message

From: Jeremiah Davie <jdavie@xxxxxxxxxxx>
To: mira_talk@xxxxxxxxxxxxx
Date: Mon, 7 Dec 2009 10:02:51 -0500

Hi Bastien,

I reviewed the .fasta file created from the two .sff files and theread names were indeed identical. Even more interesting is that thesequence for each duplicated read name is identical. No difference inclipping or any evidence of a single nucleotide changes. I'll have toinquire as to why two seemingly identical .sff files were generated;I'm sure there had to be some reason for it. Thanks for the help! Takecare, - Jeremiah

On Dec 4, 2009, at 3:44 PM, Bastien Chevreux wrote:

On Freitag 04 Dezember 2009 Jeremiah Davie wrote:
Hi Everyone,
   Quick question: In my assembly of 454 data (older GS20 data from
standard chemistry), my output log lists that it found 290768 reads
extracted from two *.sff files, yet when those reads are loaded I
receive 145339 instances of the following error message: "Error: read
name EERNOEK******* present multiple times in readpool!", where the
asterisks denote a different read name.
Hello Jeremiah,
well, if it's not a bug in MIRA (and I don't think it is), then youreallyhave sequences named identically in the two SFF files. Whether thesequencesthemselves are identical remains to be checked (and you should dothat)
Since this is almost half of
the the total loaded reads, I'm curious what might have caused this?
Presumably sff_extract used the names found within the *.sff filesforthese reads, but it seems strange to me that the two *.sff fileswould
contain all the same names.
I think I remember reading in some Roche documentation that theychanged the
read naming somewhen to ensure more diversity or something similar.
On the other hand, it's absolutely possible to recombine SFF filesintodifferent SFF files with the sff-tools from the Roche pipeline, sothat may bean explanation if you have identical reads in several SFF files. Ormaybe one
SFF has 'clipped' versions of reads from the other SFF.
Did I do some thing wrong in the
sff_extract process that might have caused this without generating an
error message?
No, you didn't do anything wrong. sff_extract (when used on unpaireddata)takes the read names verbatim from the SFF file, so they're in therealready.
(on paired-end data, there's some name mangling, but that consists of
appending postfixes, so no way something could go wrong there neither)
Any thoughts? Thanks in advance everyone! - Jeremiah
Qickest thing to do: take a few examples where MIRA complained aboutdoubleread names and check the sequences by eye in the FASTA file. Youshould beable to decide pretty quickly whether the sequences are the same orperhaps aclipped subset of each other. Then you'll need to decide whichversion you
want to take.
In the most improbable case that the reads have identical names butthesequences have nothing in common ... well, you would need to massagethe readnames a bit. For that, I would extract both SFF separately, put aprefix foreach read name of each FASTA, FASTA quality and XML file (that's one"sed"command :-) and then put everything back together into one FASTA,one FASTA
quality and one XML file.

Hope that helps, and I'd be curious to know what it turned out to be.

Regards,
 Bastien

--
You have received this mail because you are subscribed to themira_talk mailing list. For information on how to subscribe orunsubscribe, please visit http://www.chevreux.org/mira_mailinglists.html



--
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

References:
- [mira_talk] MIRA Error Message
  - From: Jeremiah Davie
- [mira_talk] Re: MIRA Error Message
  - From: Bastien Chevreux

[mira_talk] Re: MIRA Error Message

Other related posts: