Storing DNA data efficiently in C/C++


Posted by Diego Assencio on 2014.06.25 under Programming (C/C++)

Suppose you have a DNA sequence (e.g. ATCGATTGAC...) which you wish to store efficiently. Assuming the sequence is uniformly random, in the sense that the $i$-th base can be any of A, C, G, T with equal probabilities, what is the most efficient way to store this sequence?

To start solving this problem, notice first that storing the DNA sequence in a char array is not a good idea because each character can have only four possible values: 'A', 'C', 'G' or 'T'. Since an 8 bit char can store 256 different possible values (0-255), using chars to store DNA bases is too wasteful. We can do better.

Consider this fact: since a base can take only four different possible values, we only need two bits to store it. In more technical words, a base in the DNA sequence has only two bits of entropy. This means we can assign a unique pair of bits to each base as shown in the table below:

BaseBinary code
A00
C01
G10
T11

In light of this fact, the most efficient way to store a DNA sequence is to store not one but four bases per 8 bit char (or, equivalently, per one byte unsigned integer: uint8_t). Figure 1 shows how this can be accomplished.

Fig. 1: Instead of storing one DNA base per char, a much more efficient method is to store four bases per 8 bit block of memory.

Below is an example of a C++ class which stores a DNA sequence by converting it from a string to a bit array. It was written using as few STL members as possible to make it clear how the solution can be written in C (C++ developers could use std::strings instead of pointers to chars and store the DNA data on an std::bitset). Since most of the details are not hard to figure out, I will leave the task of understanding the code below to the reader. A fully compilable example can be found here.

#define BASE_MASK 0x3 /* binary: 11 */

/* useful constants */
enum
{
    BASE_A = 0x0, /* binary: 00 */
    BASE_C = 0x1, /* binary: 01 */
    BASE_G = 0x2, /* binary: 10 */
    BASE_T = 0x3, /* binary: 11 */
};

class DnaBitset
{
public:
    /**
     * @brief Constructs a compressed representation of a DNA sequence.
     * @param dna_str A string holding a DNA sequence (e.g. "ATGCACG").
     * @param dna_len The length of the DNA sequence.
     */
    DnaBitset(const char* dna_str, const size_t dna_len)
    {
        m_len = dna_len;

        /* number of bytes necessary to store dna_str as a bitset */
        size_t dna_bytes = (dna_len / 4) + (dna_len % 4 != 0);

        m_data = new uint8_t[dna_bytes];

        std::memset(m_data, 0, dna_bytes);

        /* for each base of the DNA sequence */
        for (size_t i = 0; i < dna_len; ++i)
        {
            uint8_t shift = 6 - 2 * (i % 4);

            switch (dna_str[i])
            {
                case 'A':
                    m_data[i / 4] |= BASE_A << shift;
                    break;
                case 'C':
                    m_data[i / 4] |= BASE_C << shift;
                    break;
                case 'G':
                    m_data[i / 4] |= BASE_G << shift;
                    break;
                case 'T':
                    m_data[i / 4] |= BASE_T << shift;
                    break;
                default:
                    throw std::invalid_argument("invalid DNA base");
            }

            shift = (shift == 0) ? 6 : shift - 2;
        }
    }

    /**
     * @brief Destructor.
     */
    ~DnaBitset()
    {
        delete[] m_data;
    }

    /**
     * @brief Returns the stored DNA sequence as an ASCII string.
     */
    char* to_string() const
    {
        char* dna_str = new char[m_len + 1];

        /* for each base of the DNA sequence */
        for (size_t i = 0; i < m_len; ++i)
        {
            uint8_t shift = 6 - 2 * (i % 4);
            uint8_t mask = BASE_MASK << shift;

            /* get the i-th DNA base */
            uint8_t base = (m_data[i / 4] & mask) >> shift;

            switch (base)
            {
                case BASE_A:
                    dna_str[i] = 'A';
                    break;
                case BASE_C:
                    dna_str[i] = 'C';
                    break;
                case BASE_G:
                    dna_str[i] = 'G';
                    break;
                case BASE_T:
                    dna_str[i] = 'T';
                    break;
                default:
                    throw std::runtime_error("invalid DNA base");
            }
        }

        dna_str[m_len] = '\0';
        return dna_str;
    }

private:
    uint8_t* m_data;
    size_t m_len;
};

Comments

Adem HALIMOGLU on Aug 10, 2017:
Firstly, thank you for your working.

While I was using your code, I realized that sometimes std::memset(m_data, 0, dna_bytes); line cannot set m_data to wholly zero. Therefore, I have changed and used that line like this:
memset(dna_comp, 0, sizeof(*dna_comp));

Maybe you take this into consideration. Thank you very much again.
Diego Assencio on Aug 10, 2017:
@Adem: Thank you very much for your comment!

Your change should not have made any difference because the code is creating an array of length dna_bytes and then asking std::memset to fill dna_bytes bytes of memory with zeros starting at the beginning of that array. No bytes should be left "unzeroed" there.

Are you sure there were no modifications on the original code? Also, which compiler are you using?
Mannu on Feb 27, 2019:
There is error in line 107
Diego Assencio on Feb 27, 2019:
@Mannu: The line is correct, but you need the full code to compile it:

https://github.com/dassencio/dna-compression/blob/master/dna.h
Dre on Jul 20, 2019:
Hi Diego,

Thank you for this intuitive guide for how to convert bases into a two-bit representation. I've even managed to build python wrappers for each of the functions, to_string and encoding the string. I would like to extend the code to include N in the alphabet. I know the representation will now be 3-bit, however I am stuck as to how I should go about changing the rest of the code. e.g. Do I need change uint8 for something large, how will the shift and base_mask be effected ect.

Best,
Dre
Diego Assencio on Jul 20, 2019:
@Dre: If you add N to the alphabet, it will contain five symbols in total, but three bits can represent eight distinct symbols. This is wasteful.

A better approach is to use Huffman's algorithm since it does not require a fixed number of bits to be used to represent every symbol:

https://diego.assencio.com/?index=36c4c02124a10282d8a0f92277a43ec4

Here is an implementation of Huffman's algorithm:

https://github.com/dassencio/huffman

I hope that helps!
Dre on Jul 20, 2019:
Ah you are absolutely right! Very nice implementation of Huffman's algorithm. I like that it uses less memory than python's zlib compression when encoding < 500 nucleotides. I'm guessing that is partially due to the header. As for speed, do you think it would make sense to use a Huffman's implementation in C/C++? Or would the gain be minimal?
Diego Assencio on Jul 20, 2019:
@Dre: I would expect a C/C++ implementation of Huffman's algorithm to be significantly faster than my implementation in Python, but unless your DNA sequence is really long, the performance gain will hardly be noticeable.

As an experiment, I generated a random DNA sequence with 100 million nucleotides and compressed it with my script. It took only eight seconds on an Intel i5-7Y54.

Leave a reply

NOTE: A name and a comment (max. 1024 characters) must be provided; all other fields are optional. Equations will be processed if surrounded with dollar signs (as in LaTeX). You can post up to 5 comments per day.

Preview: