[mira_talk] Re: Fwd: ace file output from MIRA

Hi,

On Mon, 2009-05-11 at 23:05 +0200, Bastien Chevreux wrote:
> On Sonntag 10 Mai 2009 Björn Nystedt wrote:
> > [...]
> > The basic problem is that MIRA does not create the .phd files.
> 
> Let me start my answer like Cato would have: "Ceterum censeo ACE esse 
> delendam."

Well, when the Romans erased Carthage and sowed salt on its ruins, they
didn't need Carthage, did they? I'm not sure we can say the same about
consed and ace files... When will gap5 be up and working? ;)

Speaking of not-so-good-format, it reminds me that for a reason that I
don't understand, I still can't see MIRA tags in consed. Following Jan's
advice, I created a mira_tags.conf file and linked to it with a
consed.fileOfTagTypes line. Now I can add the new tag types, but not see
the ones created by mira... any idea?

> On the other hand: reading the consed docs for version 19.0, I see this 
> paragraph:
>   9.81)  Delete the phd.ball link in edit_dir--it is also intended for
>     obsolete versions of consed and may cause problems with the current
>     version.
> 
> Anyone who can comment on that?

As Sven mentioned, now the phdBall files are linked into the ACE file,
with something like:

WA{
phdBall newbler 080416:144002
../phdball_dir/phd.ball.1
}

> > [...]
> > so maybe a wish would be for MIRA to output a full
> > consed folder similar to what Newbler does?
> 
> If I only had a description of the format needed, that probably would be a 
> minor problem. I searched through the web to find one, but no luck. Or I am 
> blind, which is also a possibility.

I think Björn was referring to the folder structure, which is, hum...
not quite well described. In general, what the authors propose is to
have a subdir edit_dir, and other subdirs dependent on the project:
- sanger (sometimes these are also present in the solexa and 454
projects): chromat_dir (chromatograms), phd_dir (phd files)
- solexa: phdball_dir (phdball files, linked in ace file), solexa_dir
(containing the fastq files, for example)
- 454: phdball_dir,  sff_dir (sff files)

I guess one of the phd_dir/solexa_dir/sff_dir is necessary, but I don't
know exactly how this is checked.

phdball files are just multiple phd files concatenated. 

As far as phd files are concerned, here is what I found in PHRED.DOC
(let me know if you want the complete file):
[...]
10. Phd files

    Phred writes 'phd' files to store base calling information,
    including the sequence, quality values, and peak locations,
    when it is run with either the '-pd' or the '-p' options.
    Phred creates phd files with the name '<chromat_name>.phd.1'
    where the '1' at the end of the name is the version number
    of the phd file for that chromatogram. It always writes
    version '1' phd files, whereas 'consed' writes phd files
    with higher version numbers (it increments the version 
    number each time it saves an edited read).

    The phd files phred creates begin with the line

      BEGIN_SEQUENCE <sequence_name>

    and end with the line

      END_SEQUENCE

    Enclosed between these lines phred writes a header data block,
    which is enclosed between lines with the labels 'BEGIN_COMMENT'
    and 'END_COMMENT', and a read data block, which is enclosed
    between lines with the labels 'BEGIN_DNA' and 'END_DNA'. Thus
    the overall file structure is (the lines are indented here)
    
      BEGIN_SEQUENCE <sequence_name>

      BEGIN_COMMENT

        [comment block]

      END_COMMENT

      BEGIN_DNA

        [read data block]

      END_DNA

      END_SEQUENCE

    The header data consists of a number of lines where each line begins
    with a label followed by a colon and one or more values.  Currently,
    the phd header has the following information

      header entry                 description
      ------------                 -----------
      CHROMAT_FILE: <string>       chromatogram file name
      ABI_THUMBPRINT: <n>          an integer assigned by the ABI
software
      PHRED_VERSION: <string>      phred version used to create the file
      CALL_METHOD: <string>        <string>="phred" unless run with
'-nocall'
      QUALITY_LEVELS: <n>          maximum quality value permitted
      TIME: <string>               the time and date the file was
created
      TRACE_ARRAY_MIN_INDEX: <n>   the index for the first trace point
(always 0)
      TRACE_ARRAY_MAX_INDEX: <n>   the index for the last trace point
(npoints-1)
      TRIM: <n1> <n2> <r>          read trim points. See (a) below.
      TRACE_PEAK_AREA_RATIO: <r>   trace noise level. See (b) below.
      CHEM: <string>               chromatogram sequencing chemistry
type
      DYE: <string>                chromatogram sequencing dye type

    (a) the 'TRIM' values consist of the first and last bases in the
high
        quality read segment (where the first base of the read is zero)
        and the error probability used to calculate the trim points. The
        modified Mott algorithm is used to calculate the the trim
points.

    (b) the 'TRACE_PEAK_AREA_RATIO' is the ratio of the total
uncalled-base
        peak area to the total called-base peak area within the high
        quality segment of the read. Thus this value indicates the level
        of the 'background' signal as a fraction of the called-base peak
        area. This value will tend to be relatively high for traces with
  
          o  little or no signal

          o  a mixture of inserts

    The read data block consists of one line for each read base. Each
    line has the three values

      o  the called base (a, c, g, t, or n)

      o  the quality value assigned to the base

      o  the location of the called-base peak
         in the trace

    The values are separated from each other by a single
    space.
[...]


Hope that helps,

Lionel




-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: