[mira_talk] Re: Fwd: ace file output from MIRA
- From: Lionel Guy <guy.lionel@xxxxxxxxx>
- To: mira_talk@xxxxxxxxxxxxx
- Date: Tue, 12 May 2009 21:04:43 +0200
Hi,
On Mon, 2009-05-11 at 23:05 +0200, Bastien Chevreux wrote:
> On Sonntag 10 Mai 2009 Björn Nystedt wrote:
> > [...]
> > The basic problem is that MIRA does not create the .phd files.
>
> Let me start my answer like Cato would have: "Ceterum censeo ACE esse
> delendam."
Well, when the Romans erased Carthage and sowed salt on its ruins, they
didn't need Carthage, did they? I'm not sure we can say the same about
consed and ace files... When will gap5 be up and working? ;)
Speaking of not-so-good-format, it reminds me that for a reason that I
don't understand, I still can't see MIRA tags in consed. Following Jan's
advice, I created a mira_tags.conf file and linked to it with a
consed.fileOfTagTypes line. Now I can add the new tag types, but not see
the ones created by mira... any idea?
> On the other hand: reading the consed docs for version 19.0, I see this
> paragraph:
> 9.81) Delete the phd.ball link in edit_dir--it is also intended for
> obsolete versions of consed and may cause problems with the current
> version.
>
> Anyone who can comment on that?
As Sven mentioned, now the phdBall files are linked into the ACE file,
with something like:
WA{
phdBall newbler 080416:144002
../phdball_dir/phd.ball.1
}
> > [...]
> > so maybe a wish would be for MIRA to output a full
> > consed folder similar to what Newbler does?
>
> If I only had a description of the format needed, that probably would be a
> minor problem. I searched through the web to find one, but no luck. Or I am
> blind, which is also a possibility.
I think Björn was referring to the folder structure, which is, hum...
not quite well described. In general, what the authors propose is to
have a subdir edit_dir, and other subdirs dependent on the project:
- sanger (sometimes these are also present in the solexa and 454
projects): chromat_dir (chromatograms), phd_dir (phd files)
- solexa: phdball_dir (phdball files, linked in ace file), solexa_dir
(containing the fastq files, for example)
- 454: phdball_dir, sff_dir (sff files)
I guess one of the phd_dir/solexa_dir/sff_dir is necessary, but I don't
know exactly how this is checked.
phdball files are just multiple phd files concatenated.
As far as phd files are concerned, here is what I found in PHRED.DOC
(let me know if you want the complete file):
[...]
10. Phd files
Phred writes 'phd' files to store base calling information,
including the sequence, quality values, and peak locations,
when it is run with either the '-pd' or the '-p' options.
Phred creates phd files with the name '<chromat_name>.phd.1'
where the '1' at the end of the name is the version number
of the phd file for that chromatogram. It always writes
version '1' phd files, whereas 'consed' writes phd files
with higher version numbers (it increments the version
number each time it saves an edited read).
The phd files phred creates begin with the line
BEGIN_SEQUENCE <sequence_name>
and end with the line
END_SEQUENCE
Enclosed between these lines phred writes a header data block,
which is enclosed between lines with the labels 'BEGIN_COMMENT'
and 'END_COMMENT', and a read data block, which is enclosed
between lines with the labels 'BEGIN_DNA' and 'END_DNA'. Thus
the overall file structure is (the lines are indented here)
BEGIN_SEQUENCE <sequence_name>
BEGIN_COMMENT
[comment block]
END_COMMENT
BEGIN_DNA
[read data block]
END_DNA
END_SEQUENCE
The header data consists of a number of lines where each line begins
with a label followed by a colon and one or more values. Currently,
the phd header has the following information
header entry description
------------ -----------
CHROMAT_FILE: <string> chromatogram file name
ABI_THUMBPRINT: <n> an integer assigned by the ABI
software
PHRED_VERSION: <string> phred version used to create the file
CALL_METHOD: <string> <string>="phred" unless run with
'-nocall'
QUALITY_LEVELS: <n> maximum quality value permitted
TIME: <string> the time and date the file was
created
TRACE_ARRAY_MIN_INDEX: <n> the index for the first trace point
(always 0)
TRACE_ARRAY_MAX_INDEX: <n> the index for the last trace point
(npoints-1)
TRIM: <n1> <n2> <r> read trim points. See (a) below.
TRACE_PEAK_AREA_RATIO: <r> trace noise level. See (b) below.
CHEM: <string> chromatogram sequencing chemistry
type
DYE: <string> chromatogram sequencing dye type
(a) the 'TRIM' values consist of the first and last bases in the
high
quality read segment (where the first base of the read is zero)
and the error probability used to calculate the trim points. The
modified Mott algorithm is used to calculate the the trim
points.
(b) the 'TRACE_PEAK_AREA_RATIO' is the ratio of the total
uncalled-base
peak area to the total called-base peak area within the high
quality segment of the read. Thus this value indicates the level
of the 'background' signal as a fraction of the called-base peak
area. This value will tend to be relatively high for traces with
o little or no signal
o a mixture of inserts
The read data block consists of one line for each read base. Each
line has the three values
o the called base (a, c, g, t, or n)
o the quality value assigned to the base
o the location of the called-base peak
in the trace
The values are separated from each other by a single
space.
[...]
Hope that helps,
Lionel
--
You have received this mail because you are subscribed to the mira_talk mailing
list. For information on how to subscribe or unsubscribe, please visit
http://www.chevreux.org/mira_mailinglists.html
Other related posts: