Storing DNA data efficiently in C/C++


Posted by Diego Assencio on 2014.06.25 under Programming (C/C++)

Suppose you have a DNA sequence (e.g. ATCGATTGAC...) which you wish to store efficiently. Assuming the sequence is uniformly random, in the sense that the $i$-th base can be any of A, C, G, T with equal probabilities, what is the most efficient way to store this sequence?

To start solving this problem, notice first that storing the DNA sequence in a char array is not a good idea because each character can have only four possible values: 'A', 'C', 'G' or 'T'. Since an 8 bit char can store 256 different possible values (0-255), using chars to store DNA bases is too wasteful. We can do better.

Consider this fact: since a base can take only four different possible values, we only need two bits to store it. In more technical words, a base in the DNA sequence has only two bits of entropy. This means we can assign a unique pair of bits to each base as shown in the table below:

BaseBinary code
A00
C01
G10
T11

In light of this fact, the most efficient way to store a DNA sequence is to store not one but four bases per 8 bit char (or, equivalently, per one byte unsigned integer: uint8_t). Figure 1 shows how this can be accomplished.

Fig. 1: Instead of storing one DNA base per char, a much more efficient method is to store four bases per 8 bit block of memory.

Below is an example of a C++ class which stores a DNA sequence by converting it from a string to a bit array. It was written using as few STL members as possible to make it clear how the solution can be written in C (C++ developers could use std::strings instead of pointers to chars and store the DNA data on an std::bitset). Since most of the details are not hard to figure out, I will leave the task of understanding the code below to the reader. A fully compilable example can be found here.

#define BASE_MASK 0x3 /* binary: 11 */

/* useful constants */
enum
{
    BASE_A = 0x0, /* binary: 00 */
    BASE_C = 0x1, /* binary: 01 */
    BASE_G = 0x2, /* binary: 10 */
    BASE_T = 0x3, /* binary: 11 */
};

class DnaBitset
{
public:
    /**
     * @brief Constructs a compressed representation of a DNA sequence.
     * @param dna_str A string holding a DNA sequence (e.g. "ATGCACG").
     * @param dna_len The length of the DNA sequence.
     */
    DnaBitset(const char* dna_str, const size_t dna_len)
    {
        m_len = dna_len;

        /* number of bytes necessary to store dna_str as a bitset */
        size_t dna_bytes = (dna_len / 4) + (dna_len % 4 != 0);

        m_data = new uint8_t[dna_bytes];

        std::memset(m_data, 0, dna_bytes);

        /* for each base of the DNA sequence */
        for (size_t i = 0; i < dna_len; ++i)
        {
            uint8_t shift = 6 - 2 * (i % 4);

            switch (dna_str[i])
            {
                case 'A':
                    m_data[i / 4] |= BASE_A << shift;
                    break;
                case 'C':
                    m_data[i / 4] |= BASE_C << shift;
                    break;
                case 'G':
                    m_data[i / 4] |= BASE_G << shift;
                    break;
                case 'T':
                    m_data[i / 4] |= BASE_T << shift;
                    break;
                default:
                    throw std::invalid_argument("invalid DNA base");
            }

            shift = (shift == 0) ? 6 : shift - 2;
        }
    }

    /**
     * @brief Destructor.
     */
    ~DnaBitset()
    {
        delete[] m_data;
    }

    /**
     * @brief Returns the stored DNA sequence as an ASCII string.
     */
    char* to_string() const
    {
        char* dna_str = new char[m_len + 1];

        /* for each base of the DNA sequence */
        for (size_t i = 0; i < m_len; ++i)
        {
            uint8_t shift = 6 - 2 * (i % 4);
            uint8_t mask = BASE_MASK << shift;

            /* get the i-th DNA base */
            uint8_t base = (m_data[i / 4] & mask) >> shift;

            switch (base)
            {
                case BASE_A:
                    dna_str[i] = 'A';
                    break;
                case BASE_C:
                    dna_str[i] = 'C';
                    break;
                case BASE_G:
                    dna_str[i] = 'G';
                    break;
                case BASE_T:
                    dna_str[i] = 'T';
                    break;
                default:
                    throw std::runtime_error("invalid DNA base");
            }
        }

        dna_str[m_len] = '\0';
        return dna_str;
    }

private:
    uint8_t* m_data;
    size_t m_len;
};

Comments

Adem HALIMOGLU on Aug 10, 2017:
Firstly, thank you for your working.

While I was using your code, I realized that sometimes std::memset(m_data, 0, dna_bytes); line cannot set m_data to wholly zero. Therefore, I have changed and used that line like this:
memset(dna_comp, 0, sizeof(*dna_comp));

Maybe you take this into consideration. Thank you very much again.
Diego Assencio on Aug 10, 2017:
@Adem: Thank you very much for your comment!

Your change should not have made any difference because the code is creating an array of length dna_bytes and then asking std::memset to fill dna_bytes bytes of memory with zeros starting at the beginning of that array. No bytes should be left "unzeroed" there.

Are you sure there were no modifications on the original code? Also, which compiler are you using?

Leave a reply

NOTE: A name and a comment (max. 1024 characters) must be provided; all other fields are optional. Equations will be processed if surrounded with dollar signs (as in LaTeX). You can post up to 5 comments per day.