[mira_talk] Strand FLAG error in SAM output from MIRA 4.0 miraconvert?

  • From: Peter Cock <p.j.a.cock@xxxxxxxxxxxxxx>
  • To: "mira_talk@xxxxxxxxxxxxx" <mira_talk@xxxxxxxxxxxxx>
  • Date: Mon, 17 Feb 2014 15:05:16 +0000

Hi Bastien,

I think there is a problem in the MIRA 4.0 SAM output with FLAG bit
fields 0x10 and 0x20 (partner read's strand) sometimes being wrong,
giving bad FLAG value pairs like 83 and 131, or 99 and 179.

$ /home/pc40583/downloads/mira_4.0_linux-gnu_x86_64_static/bin/miraconvert
-f maf -t sam LCN4_LargeContigs_out.maf
LCN4_LargeContigs_out.padded.sam
Loading from maf, saving to: sam
Collecting basic SAM info from MAF file
Sorting read info ... done
...
Data conversion process finished, no obvious errors encountered.

Here's a snippet from a recent MAF file, a MiSeq and 454 hybrid
de novo assembly:

$ grep "M01157:2:000000000-A27AN:1:1102:8685:23197" LCN4_out.maf -A 8

or,

$ grep "M01157:2:000000000-A27AN:1:1102:8685:23197"
LCN4_LargeContigs_out.maf -A 8
RD    M01157:2:000000000-A27AN:1:1102:8685:23197/2
RG    1
RS    
TCTCAGATGTTGGTTTTTATCAAATATCAACAACTATAGGTAATAGTTCAGATATGGGATATGTATCAAATCGTTTCCCTATAAATGAAAGCCTTCGATGGTTTAAGCCAAACAGTGGTGCAAAAATGATTTTCGGCGATACAGACCTGAT
RQ    
?A?A?BBBDDDDDADDFFFFFFHHIHIIIIIIIIHHHIHICFFGGGHIHIHHIIHGHHFHHIHHH=EGGGHHHHHHIIIHHHHIHIIIIHHIFFGGFDFHHHIIIFHHIHHHFFHHGF?DFDFHHHFFHDFFFFAEFEEEEECEBE<=BB=
TN    M01157:2:000000000-A27AN:1:1102:8685:23197
TS    255
RT    HAF5    1    41    =    MIRA    .
RT    HAF4    42    46    =    MIRA    .
RT    HAF3    47    52    =    MIRA    .
RT    HAF4    53    70    =    MIRA    .
RT    HAF5    71    151    =    MIRA    .
ER
AT    190    340    1    151
--
RD    M01157:2:000000000-A27AN:1:1102:8685:23197/1
RG    1
RS    
AAATACATCTCTATAACCACTTTCAACAGGCATATCACTACGTGTTCTCGCCATTCTCCCGGCATTCACAGTCATCAGGTCTGTATCGCCGAAAATCATTTTTGCACCACTGTTTGGCTTAAACCATCGAAGGCTTTCATTTATAGGGAAA
RQ    
????ABBBDDDDDDDDGGGGGGIHFHHHHHIIIHHHIGHHIHHHHIIIIHIHHHHHHIIIHHHHHIHHGDFGHIIIIHIEGHHGHIIHHHHHHHHHHHHHHHHHHHHHHHFGFGGGGGGGGGGGGGGGGGGHGGGGGGGGGGGGHGGGGGG
TN    M01157:2:000000000-A27AN:1:1102:8685:23197
TS    1
RT    HAF3    1    8    =    MIRA    .
RT    HAF4    9    30    =    MIRA    .
RT    HAF5    31    151    =    MIRA    .
ER
AT    413    263    1    151
...

From the AT line, read/2 is forwards while read/1 is reversed.

Here's the resulting SAM output from miraconvert, note FLAG 131 vs 83:

$ grep "M01157:2:000000000-A27AN:1:1102:8685:23197"
LCN4_LargeContigs_out.padded.sam
M01157:2:000000000-A27AN:1:1102:8685:23197    131    LCN4_c1    190
255    151M    =    263    223
TCTCAGATGTTGGTTTTTATCAAATATCAACAACTATAGGTAATAGTTCAGATATGGGATATGTATCAAATCGTTTCCCTATAAATGAAAGCCTTCGATGGTTTAAGCCAAACAGTGGTGCAAAAATGATTTTCGGCGATACAGACCTGAT
   
?A?A?BBBDDDDDADDFFFFFFHHIHIIIIIIIIHHHIHICFFGGGHIHIHHIIHGHHFHHIHHH=EGGGHHHHHHIIIHHHHIHIIIIHHIFFGGFDFHHHIIIFHHIHHHFFHHGF?DFDFHHHFFHDFFFFAEFEEEEECEBE<=BB=
   RG:Z:1    
PT:Z:1;41;.;HAF5;|42;46;.;HAF4;|47;52;.;HAF3;|53;70;.;HAF4;|71;151;.;HAF5;
M01157:2:000000000-A27AN:1:1102:8685:23197    83    LCN4_c1    263
255    151M    =    190    -223
TTTCCCTATAAATGAAAGCCTTCGATGGTTTAAGCCAAACAGTGGTGCAAAAATGATTTTCGGCGATACAGACCTGATGACTGTGAATGCCGGGAGAATGGCGAGAACACGTAGTGATATGCCTGTTGAAAGTGGTTATAGAGATGTATTT
   
GGGGGGHGGGGGGGGGGGGHGGGGGGGGGGGGGGGGGGFGFHHHHHHHHHHHHHHHHHHHHHHHIIHGHHGEIHIIIIHGFDGHHIHHHHHIIIHHHHHHIHIIIIHHHHIHHGIHHHIIIHHHHHFHIGGGGGGDDDDDDDDBBBA????
   RG:Z:1    PT:Z:144;151;.;HAF3;|122;143;.;HAF4;|1;121;.;HAF5

The second line is for read one,

FLAG 83 = 0x53 = 0x1 + 0x2 + 0x10 + 0x40
0x1 = read paired
0x2 =  read mapped in proper pair
0x10 = read reverse strand
0x40 = first in pair

The first line is for read two,

FLAG 131 = 0x83 = 0x1 + 0x2 + 0x80
0x1 = read paired
0x2 = read mapped in proper pair
0x80 =  second in pair

This is missing bit 0x20 indicating the partner (read one) is on the
reverse strand,
i.e. I think it should be:

FLAG 161 = 0xA3 = 0x1 + 0x2 + 0x20 + 0x80
0x1 = read paired
0x2 = read mapped in proper pair
0x20 = mate reverse strand
0x80 =  second in pair

(It seems using samtools fixmate drops the 0x2 bit for some reason...)

--------------------------------------

Here's another example from the same assembly:

$ grep "M01157:2:000000000-A27AN:1:1113:12717:15930" LCN4_out.maf -A A

or

$ grep "M01157:2:000000000-A27AN:1:1113:12717:15930"
LCN4_LargeContigs_out.maf -A 4
RD    M01157:2:000000000-A27AN:1:1113:12717:15930/1
RG    1
RS    
CCTTCATTTCAGATTGCGCATCGCGCGTTGGCAAATACGGTGGGTCGCGGCAGAGGGGGATTCTTGCAACCATTGCTGACACAAATCCATTACCC
RQ    
?????<?BBBBBB?<BFECAEC>ECCEHHHHHHHFHHHHHH;@EDFEHEHHHC@FHHH73=DFDFFFFFFFDDEDEDEBEEEEDEECEEEE;,3?
TN    M01157:2:000000000-A27AN:1:1113:12717:15930
TS    1
RT    HAF3    1    95    =    MIRA    .
ER
AT    178472    178566    1    95
RD    M01157:2:000000000-A27AN:1:1113:12717:15930/2
RG    1
RS    
GGGTAATGGATTTGTGTCAGCAATGGTTGCAAGAATCCCCCTCTGCCGCGACCCACCGTATTTGCCAACGCGCGATGCGCAATCTGAAATGAAGG
RQ    
?????BBBD?DDDBBBFFFFFFFFFHFHHHIIIHIHHHIHHHHIIHGHH@CCHHFHHH@FHIHDFHHHHEHHHDBEHH:DAEEFBDEEFFFEFFF
TN    M01157:2:000000000-A27AN:1:1113:12717:15930
TS    255
RT    HAF3    1    95    =    MIRA    .
ER
AT    178566    178472    1    95

The AT lines in the MAF file tell me read/2 is forward, while read/1
is reversed.
Note that by chance the two reads are the reverse complement of each other,
and therefore map to the exact same place!

$  grep "M01157:2:000000000-A27AN:1:1113:12717:15930"
LCN4_LargeContigs_out.padded.sam
M01157:2:000000000-A27AN:1:1113:12717:15930    179    LCN4_c1
178472    255    95M    =    178472    -94
CCTTCATTTCAGATTGCGCATCGCGCGTTGGCAAATACGGTGGGTCGCGGCAGAGGGGGATTCTTGCAACCATTGCTGACACAAATCCATTACCC
   
FFFEFFFEEDBFEEAD:HHEBDHHHEHHHHFDHIHF@HHHFHHCC@HHGHIIHHHHIHHHIHIIIHHHFHFFFFFFFFFBBBDDD?DBBB?????
   RG:Z:1    PT:Z:1;95;.;HAF3;
M01157:2:000000000-A27AN:1:1113:12717:15930    99    LCN4_c1    178472
   255    95M    =    178472    94
CCTTCATTTCAGATTGCGCATCGCGCGTTGGCAAATACGGTGGGTCGCGGCAGAGGGGGATTCTTGCAACCATTGCTGACACAAATCCATTACCC
   
?????<?BBBBBB?<BFECAEC>ECCEHHHHHHHFHHHHHH;@EDFEHEHHHC@FHHH73=DFDFFFFFFFDDEDEDEBEEEEDEECEEEE;,3?
   RG:Z:1    PT:Z:1;95;.;HAF3;

The second line is for read one, on the forward strand:

FLAG 99 = 0x63 = 0x1 + 0x2 + 0x20 + 0x40
0x1 = read paired
0x2 = read mapped in proper pair
0x20 = mate reverse strand
0x40 = first in pair

The first line is for read two, on the reverse strand:

FLAG 179  = 0xB3 =  0x1 + 0x2 + 0x10 + 0x20 + 0x80
0x1 = read paired
0x2 = read mapped in proper pair
0x10 = read reverse strand
0x20 = mate reverse strand
0x80 =  second in pair

This wrongly includes 0x20 in the FLAG, it should I think be 147 =
0x1 + 0x2 + 0x10 + 0x80 (as used via samtools fixmate).

I've read over mira/sam_collect.C and can see where the MAF AT line is
parsed, and the read direction set to +1 (forward) or -1 (reverse), and I
can see where this information is used latter to build the FLAG value -
but the nature of the error escapes me.

Peter

-- 
You have received this mail because you are subscribed to the mira_talk mailing 
list. For information on how to subscribe or unsubscribe, please visit 
http://www.chevreux.org/mira_mailinglists.html

Other related posts: