[mira_talk] Re: mid-tags

  • From: Sven Klages <sir.svencelot@xxxxxxxxxxxxxx>
  • To: mira_talk@xxxxxxxxxxxxx
  • Date: Thu, 19 Feb 2009 10:24:42 +0100

Hi Bastien,

this is how Roche's sfffile can deal with MIDs (part of the sfftools),

Usage:  sfffile [options...] [MIDList@][sfffile | datadir]...

You can either use

$ sfffile -s mySFFfile.sff

which would look for all MIDs found in 'MIDConfig.parse' [1]
and split the sff file accordingly, generating something like:

Reading the input SFF file(s)...
Generating the split SFF file(s)...
  MID1:   22149 reads written into the SFF file.
  MID2:   34128 reads written into the SFF file.
  MID3:   14190 reads written into the SFF file.
  MID4:   89150 reads written into the SFF file.
  MID5:       0 reads found.
  MID6:       0 reads found.
[...]

or you can directly access the MID of interest,

$ sfffile mid1@xxxxxxxxxxxxx

Reading the input SFF file(s)...
Generating the split SFF file(s)...
  MID1:   22149 reads written into the SFF file.


If no output filename is given, new file is called
"454Reads.MID1.sff". The same applies to the above
example,

"454Reads.MID1.sff"
"454Reads.MID2.sff"
"454Reads.MID3.sff"
[...]

Having a look at the docs reveals

"With either mode, the 5' trim points for the output reads is reset
to just past the MID sequence (i.e., the MID sequence is trimmed
from the output read)."

So writing a software extracting the sequences from a SFF file
should take this into account.

As the sequences (key or MID) are not physically removed
from sff file (only the begin offsets are shifted) it is easy
to track down the information about the MID even if these
"biological non-relevant" sequences are physically removed
from the resulting fasta/qual files via the read ID.

[1] MIDConfig.parse, example
- case insensitive
- mid name, mid sequence, allowed mismatches
- users can make their own config file with custom MIDs,
  to be used via 'sfffile -mcf FILENAME'

GSMIDs
{
        mid = "MID1", "ACGAGTGCGT", 2;
        mid = "MID2", "ACGCTCGACA", 2;
        mid = "MID3", "AGACGCACTC", 2;
        mid = "MID4", "AGCACTGTAG", 2;
        mid = "MID5", "ATCAGACACG", 2;
        mid = "MID6", "ATATCGCGAG", 2;
        mid = "MID7", "CGTGTCTCTA", 2;
        mid = "MID8", "CTCGCGTGTC", 2;
        mid = "MID9", "TAGTATCAGC", 2;
        mid = "MID10", "TCTCTATGCG", 2;
        mid = "MID11", "TGATACGTCT", 2;
        mid = "MID12", "TACTGAGCTA", 2;
        mid = "MID13", "CATAGTAGTG", 2;
        mid = "MID14", "CGAGAGATAC", 2;
}


Cheers,
Sven

*** Bastien Chevreux (19.02.2009 01:11):
> On Wednesday 18 February 2009 Sven Klages wrote:
>> The Roche software itself is writing 5' and 3' trim points into the sff
>> file; the 5' trim point is usually set to base position 5 (after the key
>> sequence 'TCAG'). The software can (sfftools, sfffile) split sff
>> archives by their MID sequences, which means for every MID found
>> there is a new sff file created with the 5' trim point shifted by
>> 10 bases (or whatever length of MID is configured) to 3'.
>> [...]
>> just my 2p,
> 
> Most valuable 2p you have there Sven.
> 
> I don't have a MID data set myself, would you care to give a short "demo" on 
> how to split the data sets with the Roche software? I'd immediately integrate 
> this into the walk-through for working with 454 data.
> 
> Regards,
>   Bastien
> 
begin:vcard
fn:Sven Klages
n:Klages;Sven
email;internet:sven.klages@xxxxxxxxxxxxxx
version:2.1
end:vcard

Other related posts: