Decoding SAM flags

While the SAM file format has a lot in common with every other tab-delimited format for storing biological data, the bitwise flag column uses a nifty approach that I have rarely if ever seen elsewhere.

The idea behind the bitwise flag is compression. There are 11 boolean flags associated with each entry in a SAM file, but instead of creating a dedicated column for each of these flags, they are all stored in a single column. This not only reduces the amount of space the data occupy on the disk, it also makes the format less cluttered (I mean seriously, more than 8-10 columns for a data point and my brain can’t handle it any more).

Bits in the bitwise flag

The following table (directly from the SAM specification) provides a description of each of the 11 bits in the bitwise flag.

Bit    Description
0x1    template having multiple segments in sequencing
0x2    each segment properly aligned according to the aligner
0x4    segment unmapped
0x8    next segment in the template unmapped
0x10   SEQ being reverse complemented
0x20   SEQ of the next segment in the template being reversed
0x40   the rst segment in the template
0x80   the last segment in the template
0x100  secondary alignment
0x200  not passing quality controls
0x400  PCR or optical duplicate

So how are these 11 values stored in a single column? The SAM format uses the concept of a binary string 11 characters long. Each character (bit) in the string corresponds to one of the 11 flags, and a value of “1” indicates that the flag is set. But rather than storing a binary string of length 11, the SAM format evaluates the string as a binary number and stores the corresponding decimal representation of that number. For example, the number ‘00001001101’ in binary encoding has the same value as the number ’77’ in decimal encoding, which is the value that would be stored in the second column of a SAM entry. Note also that (for some reason) each bit is described in the SAM specification using its hexadecimal representation (i.e., ‘0x10’ = 16 and ‘0x40’ = 64).

Decoding

If you are examining a SAM/BAM file manually, then this little decimal-to-flag converter on the picard tools website is extremely useful. However, given the typically large size of these files, chances are that most SAM/BAM processing will be done with some kind of program or script.

For my first experiences processing SAM files, I was not accustomed to this type of encoding. It didn’t take long for me to wrap my head around the concept, but my initial approach to decoding bitwise flags definitely felt a bit kludgy. I was essentially converting the decimal numbers into their corresponding binary string representations, and then testing the value of the character (“0” or “1”) at each position of interest in that string.

However, there is a much better way to decode the flags in the bitwise column using bitwise operators (huh, imagine that). Specifically, the ‘bitwise AND’ operator can be used to test whether each flag is set. For example, if you want to test whether a given SAM entry passes quality controls, you would test the ‘0x200’ bit like this.

# This is Perl code, but a single '&' symbol represents 'bitwise AND' in many
# languages.

if($flag & hex("0x200"))
{
  # pass!
}
else
{
  # fail!
}

# If you're comfortable converting between hexadecimal and decimal in your head,
# you could just use the decimal representation of the same flag. '0x200' = 512

if($flag & 512)
{
  # pass!
}
else
{
  # fail!
}

Update

This post by Damian Kao covers the same concept, and provides a bit more background with regards to converting between different number systems. Could have saved myself a lot of trouble if I had seen this earlier…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s