## Storing DNA data efficiently in C/C++

Posted by Diego Assencio on 2014.06.25 under Programming (C/C++)

Suppose you have a DNA sequence (e.g. ATCGATTGAC...) which you wish to store efficiently. Assuming the sequence is uniformly random, in the sense that the $i$-th base can be any of A, C, G, T with equal probabilities, what is the most efficient way to store this sequence?

To start solving this problem, notice first that storing the DNA sequence in a char array is not a good idea because each character can have only four possible values: 'A', 'C', 'G' or 'T'. Since an 8 bit char can store 256 different possible values (0-255), using chars to store DNA bases is too wasteful. We can do better.

Consider this fact: since a base can take only four different possible values, we only need two bits to store it. In more technical words, a base in the DNA sequence has only two bits of entropy. This means we can assign a unique pair of bits to each base as shown in the table below:

BaseBinary code
A00
C01
G10
T11

In light of this fact, the most efficient way to store a DNA sequence is to store not one but four bases per 8 bit char (or, equivalently, per one byte unsigned integer: uint8_t). Figure 1 shows how this can be accomplished.

 Fig. 1: Instead of storing one DNA base per char, a much more efficient method is to store four bases per 8 bit block of memory.

Below is an example of a C++ class which stores a DNA sequence by converting it from a string to a bit array. It was written using as few STL members as possible to make it clear how the solution can be written in C (C++ developers could use std::strings instead of pointers to chars and store the DNA data on an std::bitset). Since most of the details are not hard to figure out, I will leave the task of understanding the code below to the reader. A fully compilable example can be found here.

#define BASE_MASK 0x3 /* binary: 11 */

/* useful constants */
enum
{
BASE_A = 0x0, /* binary: 00 */
BASE_C = 0x1, /* binary: 01 */
BASE_G = 0x2, /* binary: 10 */
BASE_T = 0x3, /* binary: 11 */
};

class DnaBitset
{
public:
/**
* @brief Constructs a compressed representation of a DNA sequence.
* @param dna_str A string holding a DNA sequence (e.g. "ATGCACG").
* @param dna_len The length of the DNA sequence.
*/
DnaBitset(const char* dna_str, const size_t dna_len)
{
m_len = dna_len;

/* number of bytes necessary to store dna_str as a bitset */
size_t dna_bytes = (dna_len / 4) + (dna_len % 4 != 0);

m_data = new uint8_t[dna_bytes];

std::memset(m_data, 0, dna_bytes);

/* for each base of the DNA sequence */
for (size_t i = 0; i < dna_len; ++i)
{
uint8_t shift = 6 - 2 * (i % 4);

switch (dna_str[i])
{
case 'A':
m_data[i / 4] |= BASE_A << shift;
break;
case 'C':
m_data[i / 4] |= BASE_C << shift;
break;
case 'G':
m_data[i / 4] |= BASE_G << shift;
break;
case 'T':
m_data[i / 4] |= BASE_T << shift;
break;
default:
throw std::invalid_argument("invalid DNA base");
}

shift = (shift == 0) ? 6 : shift - 2;
}
}

/**
* @brief Destructor.
*/
~DnaBitset()
{
delete[] m_data;
}

/**
* @brief Returns the stored DNA sequence as an ASCII string.
*/
char* to_string() const
{
char* dna_str = new char[m_len + 1];

/* for each base of the DNA sequence */
for (size_t i = 0; i < m_len; ++i)
{
uint8_t shift = 6 - 2 * (i % 4);

/* get the i-th DNA base */
uint8_t base = (m_data[i / 4] & mask) >> shift;

switch (base)
{
case BASE_A:
dna_str[i] = 'A';
break;
case BASE_C:
dna_str[i] = 'C';
break;
case BASE_G:
dna_str[i] = 'G';
break;
case BASE_T:
dna_str[i] = 'T';
break;
default:
throw std::runtime_error("invalid DNA base");
}
}

dna_str[m_len] = '\0';
return dna_str;
}

private:
uint8_t* m_data;
size_t m_len;
};