Suppose you have a DNA sequence (e.g. ATCGATTGAC...) which you wish to store efficiently. Assuming the sequence is uniformly random, in the sense that the $i$-th base can be any of A, C, G, T with equal probabilities, what is the most efficient way to store this sequence?
To start solving this problem, notice first that storing the DNA sequence in a char array is not a good idea because each character can have only four possible values: 'A', 'C', 'G' or 'T'. Since an 8 bit char can store 256 different possible values (0-255), using chars to store DNA bases is too wasteful. We can do better.
Consider this fact: since a base can take only four different possible values, we only need two bits to store it. In more technical words, a base in the DNA sequence has only two bits of entropy. This means we can assign a unique pair of bits to each base as shown in the table below:
Base | Binary code |
---|---|
A | 00 |
C | 01 |
G | 10 |
T | 11 |
In light of this fact, the most efficient way to store a DNA sequence is to store not one but four bases per 8 bit char (or, equivalently, per one byte unsigned integer: uint8_t). Figure 1 shows how this can be accomplished.
Fig. 1: | Instead of storing one DNA base per char, a much more efficient method is to store four bases per 8 bit block of memory. |
Below is an example of a C++ class which stores a DNA sequence by converting it from a string to a bit array. It was written using as few STL members as possible to make it clear how the solution can be written in C (C++ developers could use std::strings instead of pointers to chars and store the DNA data on an std::bitset). Since most of the details are not hard to figure out, I will leave the task of understanding the code below to the reader. A fully compilable example can be found here.
#define BASE_MASK 0x3 /* binary: 11 */ /* useful constants */ enum { BASE_A = 0x0, /* binary: 00 */ BASE_C = 0x1, /* binary: 01 */ BASE_G = 0x2, /* binary: 10 */ BASE_T = 0x3, /* binary: 11 */ }; class DnaBitset { public: /** * @brief Constructs a compressed representation of a DNA sequence. * @param dna_str A string holding a DNA sequence (e.g. "ATGCACG"). * @param dna_len The length of the DNA sequence. */ DnaBitset(const char* dna_str, const size_t dna_len) { m_len = dna_len; /* number of bytes necessary to store dna_str as a bitset */ size_t dna_bytes = (dna_len / 4) + (dna_len % 4 != 0); m_data = new uint8_t[dna_bytes]; std::memset(m_data, 0, dna_bytes); /* for each base of the DNA sequence */ for (size_t i = 0; i < dna_len; ++i) { uint8_t shift = 6 - 2 * (i % 4); switch (dna_str[i]) { case 'A': m_data[i / 4] |= BASE_A << shift; break; case 'C': m_data[i / 4] |= BASE_C << shift; break; case 'G': m_data[i / 4] |= BASE_G << shift; break; case 'T': m_data[i / 4] |= BASE_T << shift; break; default: throw std::invalid_argument("invalid DNA base"); } shift = (shift == 0) ? 6 : shift - 2; } } /** * @brief Destructor. */ ~DnaBitset() { delete[] m_data; } /** * @brief Returns the stored DNA sequence as an ASCII string. */ char* to_string() const { char* dna_str = new char[m_len + 1]; /* for each base of the DNA sequence */ for (size_t i = 0; i < m_len; ++i) { uint8_t shift = 6 - 2 * (i % 4); uint8_t mask = BASE_MASK << shift; /* get the i-th DNA base */ uint8_t base = (m_data[i / 4] & mask) >> shift; switch (base) { case BASE_A: dna_str[i] = 'A'; break; case BASE_C: dna_str[i] = 'C'; break; case BASE_G: dna_str[i] = 'G'; break; case BASE_T: dna_str[i] = 'T'; break; default: throw std::runtime_error("invalid DNA base"); } } dna_str[m_len] = '\0'; return dna_str; } private: uint8_t* m_data; size_t m_len; };
Comments
While I was using your code, I realized that sometimes std::memset(m_data, 0, dna_bytes); line cannot set m_data to wholly zero. Therefore, I have changed and used that line like this:
memset(dna_comp, 0, sizeof(*dna_comp));
Maybe you take this into consideration. Thank you very much again.
Your change should not have made any difference because the code is creating an array of length dna_bytes and then asking std::memset to fill dna_bytes bytes of memory with zeros starting at the beginning of that array. No bytes should be left "unzeroed" there.
Are you sure there were no modifications on the original code? Also, which compiler are you using?
https://github.com/dassencio/dna-compression/blob/master/dna.h
Thank you for this intuitive guide for how to convert bases into a two-bit representation. I've even managed to build python wrappers for each of the functions, to_string and encoding the string. I would like to extend the code to include N in the alphabet. I know the representation will now be 3-bit, however I am stuck as to how I should go about changing the rest of the code. e.g. Do I need change uint8 for something large, how will the shift and base_mask be effected ect.
Best,
Dre
A better approach is to use Huffman's algorithm since it does not require a fixed number of bits to be used to represent every symbol:
https://diego.assencio.com/?index=36c4c02124a10282d8a0f92277a43ec4
Here is an implementation of Huffman's algorithm:
https://github.com/dassencio/huffman
I hope that helps!
As an experiment, I generated a random DNA sequence with 100 million nucleotides and compressed it with my script. It took only eight seconds on an Intel i5-7Y54.