Posts on Computer science

Lossless data compression with Huffman's algorithm


Posted by Diego Assencio on 2015.04.05 under Computer science (Algorithms)

Suppose you wish to losslessly compress data which can be expressed as a sequence of symbols (a regular text file is a classic example where the symbols are letters and punctuation marks). There are many good compression algorithms out there that solve this problem. This post will present a famous one called Huffman's algorithm, which compresses data by creating an encoding scheme which is optimal in the sense that no other encoding will generate a better compression of the data. This form of compression is surely not the only one available, but it is very easy to understand and also very important since it is used by codecs such as JPEG and MP3.

Huffman's algorithm works as follows:

1.count the number of occurrences $n_i$ of each distinct symbol $s_i$ on the data
2.for each distinct symbol $s_i$, generate a tree containing a single node with key $s_i$ and weight $n_i$
3.select the two trees with smallest root node weights and merge them under a new root node; the key of this new root node is the concatenation of the keys of the root nodes of the merged trees and its weight is the sum of their weights (so if the two chosen trees have root nodes with keys $\alpha_j$ and $\alpha_k$ and weights $w_j$ and $w_k$ respectively, we put them under a new root node with key $\alpha_j\alpha_k$ and weight $w_j + w_k$)
4.if only a single tree is left, stop; otherwise go back to step 3
5.the representation of a given symbol is then given by the bit sequence which one obtains by starting from the root node of the final tree and moving along the tree until the node containing the given symbol as key is reached: each time we need to go the left child, we append a 0 to the bit representation, otherwise we append a 1 to it.

Although the description above is somewhat abstract, the algorithm is really simple. Figure 1 describes visually how it works.

Fig. 1: A visual description of Huffman's algorithm for some text which contains only the letters in the set {A,B,C,D}. In the first step, we merge the trees with root nodes C and D, then we merge the trees with root nodes B and CD, then we finally we merge the trees with root nodes A and BCD. The produced representation of a given symbol is a bit string which is obtained by merely traversing the final tree (starting from its root node) until the node whose key is this symbol is reached. Above, we have the following symbol-to-bit-string mappings: A $\rightarrow$ 0, B $\rightarrow$ 10, C $\rightarrow$ 110, D $\rightarrow$ 111.

As mentioned above, Huffman's algorithm is optimal in the sense that no other encoding will generate a better compression of the data. Consider the following sequence with $20$ characters where $n_\textrm{A} = 12$, $n_\textrm{B} = 5$, $n_\textrm{C} = 2$ and $n_\textrm{D} = 1$ (as in figure 1):

AABACAABBAABAAACABAD

Below is the representation of the sequence above using ASCII codes (which is a very common encoding scheme used for storing text files on computers):

0100000101000001010000100100000101000011010000010100000101000010010000100100000101000001010000100100000101000001010000010100001101000001010000100100000101000100

And here is the representation of the same sequence using the encoding scheme from figure 1:

0010011000101000100001100100111

Huffman's representation of the original character sequence is clearly better than the ASCII one, but how much better is it? If $L^{\textrm{A}}_i$ is the number of bits used to represent the $i$-th symbol in the alphabet {A,B,C,D} in the ASCII encoding, then the total number of bits necessary to represent the character sequence using ASCII is: $$ N^{\textrm{A}} = \sum_i L^{\textrm{A}}_i n_i = 8n_{\textrm{A}} + 8n_{\textrm{B}} + 8n_{\textrm{C}} + 8n_{\textrm{D}} = 160 $$ The equivalent quantity for the Huffman encoding is: $$ N^{\textrm{H}} = \sum_i L^{\textrm{H}}_i n_i = 1n_{\textrm{A}} + 2n_{\textrm{B}} + 3n_{\textrm{C}} + 3n_{\textrm{D}} = 31 $$ which is less than $20\%$ of the amount of space needed with the ASCII encoding scheme. The Huffman encoding uses a clever trick: it uses short bit strings to represent the characters which appear more often in the data and larger bit strings to represent the ones which occur less frequently.

One interesting aspect of Huffman's algorithm is the fact that it always generates encodings which are not ambiguous, in the sense that the original data can be directly obtained from the compressed version expressed as a concatenation of bit strings. This is due to the fact that Huffman's algorithm always produces prefix codes. To clarify this point, consider this mapping: A $\rightarrow$ 0, B $\rightarrow$ 10 and C $\rightarrow$ 101. If you have the following bit sequence: 01010, what does it represent: ABB or ACA? The ambiguity here comes from the fact that the representation of B is a prefix of the representation of C. Huffman's algorithm will never yield an encoding scheme like this.

An implementation of Huffman's algorithm written in Python can be found here.

Comments (0) Direct link

Detecting cycles on undirected graphs


Posted by Diego Assencio on 2014.11.13 under Computer science (Algorithms)

A common problem which one needs to solve when dealing with graphs is to determine whether a given graph $G$ contains a cycle or not. As an example, one way to implement Kruskal's algorithm to compute the minimum spanning tree (MST) of a graph requires us to prevent the existence of cycles in the computed MST by manually detecting them and rooting them out as the MST is constructed (this version of Kruskal's algorithm does not use the union-find data structure). This post describes how one can detect the existence of cycles on undirected graphs (directed graphs are not considered here).

In what follows, a graph is allowed to have parallel edges and self-loops. On both cases, the graph has a trivial cycle. Our cycle detection algorithm will be written in Python, but implementing it in other languages should not be a difficult task if you understand the description given below. You can download the code shown in this post by clicking here.

Let's first start by introducing a simple implementation of a graph class which we will use later. A very simple class which contains all the functionality we need is defined below (it is based on the one presented here). This class does not explicitly store a list of vertices. Instead, it stores a dictionary (self.neighbors) which contains, for each vertex $v$, a list of its neighbors. A list of all vertices of the graph can be easily obtained from the keys of this dictionary, as shown in the member function vertices() below.

class graph:

	def __init__(self):
		self.neighbors = {}

	def add_vertex(self, v):
		if v not in self.neighbors:
			self.neighbors[v] = []

	def add_edge(self, u, v):
		self.neighbors[u].append(v)
		# if u == v, do not connect u to itself twice
		if u != v:
			self.neighbors[v].append(u)

	def vertices(self):
		return list(self.neighbors.keys())

	def vertex_neighbors(self, v):
		return self.neighbors[v]

Now we can proceed to detecting cycles. For undirected graphs, the algorithm is quite simple: start by selecting some unexplored vertex $v$ of $G$ and use breadth-first search (BFS) to explore every vertex which is reachable from $v$. We will refer to the set of vertices which are at a distance $n$ from $v$ simply as "the $n$-th layer". If a cycle exists, then one of the following will eventually happen as BFS progresses:

1.there will be an integer $n$ such that after we explore all vertices from the $n$-th layer and proceed to explore the vertices on the $(n+1)$-th layer, some vertex $z$ on the $(n+1)$-th layer will be reachable from two vertices $u$ and $w$ on the $n$-th layer (see figure 1a)
2.there will be an integer $n$ such that after we explore all vertices from the $n$-th layer and proceed to explore the vertices on the $(n+1)$-th layer, we will end up finding that two vertices $u$ and $w$ on the $n$-th layer are actually connected (see figure 1b).

Consider case 1 first. If $u$ discovers $z$ first, it will set its layer value to $\textrm{layer}(u)+1 = n+1$. When $w$ discovers $z$, it will see that $z$ has a layer value large than $\textrm{layer}(w) = n$. Consider now case 2. In this case, some vertex $u$ from layer $n$ will find another vertex $w$ also on layer $n$, meaning $\textrm{layer}(u) = \textrm{layer}(w) = n$. Both cases lead us to immediately conclude the graph has a cycle.

Case 1
(a)
Case 2
(b)
Fig. 1: One of the two possible cases which must be dealt with when detecting cycles using BFS (only the relevant edges of each graph are shown). On both figures, $v$ is the starting vertex and both $u$ and $w$ are vertices on the $n$-th layer; (a) shows the case in which $u$ and $w$ discover the same vertex $z$ on the $(n+1)$-th layer while (b) shows the case in which $u$ and $w$ discover each other.

The analysis above suggests we should do the following while going over the edges incident on vertices on the $n$-th layer to discover vertices on the $(n+1)$-th layer: if an edge connects the $n$-th layer vertex to an unexplored vertex, set the layer value of the unexplored vertex to $(n+1)$. If the edge leads to an already explored vertex, we must consider three possible cases:

1.edge connects vertex on layer $n$ to a vertex on layer $(n-1)$ $\Rightarrow$ ignore it, this edge has already been taken into account
2.edge connects vertex on layer $n$ to another vertex on layer $n$ $\Rightarrow$ cycle detected
3.edge connects vertex on layer $n$ to a vertex which was already tagged as being on layer $(n+1)$ $\Rightarrow$ cycle detected

In order to detect cycles also on disconnected graphs, we must go over every unexplored vertex $v$ and proceed as above. Every vertex which is reachable from a chosen starting vertex $v$ will not be used later as a starting point because it will be marked as explored by then. If $G$ is disconnected, some vertices will not be explored when we do BFS starting from the first unexplored vertex $v$, but as we go over the unexplored vertices in the main loop, we will eventually find every connected component of $G$.

Enough with the abstract talk. The function below implements the algorithm we just discussed and returns True if the input graph $G$ has a cycle or False otherwise:

def is_cyclic_graph(G):

	Q = []
	V = G.vertices()

	# initially all vertices are unexplored
	layer = { v: -1 for v in V }

	for v in V:

		# v has already been explored; move on
		if layer[v] != -1:
			continue

		# take v as a starting vertex
		layer[v] = 0
		Q.append(v)

		# as long as Q is not empty
		while len(Q) > 0:

			# get the next vertex u of Q that must be looked at
			u = Q.pop(0)

			C = G.vertex_neighbors(u)

			for z in C:
				# if z is being found for the first time
				if layer[z] == -1:
					layer[z] = layer[u] + 1
					Q.append(z)
				elif layer[z] >= layer[u]:
					return True

	return False

As for the run-time complexity of the algorithm, notice that each edge is considered at most twice (once for each of its end vertices), and since we go over every vertex of the graph, the overall complexity is $O(m + n)$, with $m$ and $n$ being the number of edges and vertices respectively.

Comments (0) Direct link

Binary search: an easy but tricky algorithm


Posted by Diego Assencio on 2014.06.29 under Computer science (Algorithms)

The binary search algorithm is a very famous and conceptually easy method for finding whether a given value is stored in a sorted array; if the value is found, its position is returned, otherwise a "not found" notification is returned. The running time of the algorithm is $O(\log_2 n)$, with $n$ being the length of the array.

There are many ways to implement the binary search algorithm (the focus here will be put on the implementation details; the reader is assumed to know the algorithm already). The "tricky" on the post title is due to the fact that it is very easy to write an implementation of the algorithm which looks perfectly right but is buggy. As an example, try to figure out what is the problem in the implementation below (written in Python):

# a wrong and conceptually poor implementation of binary search
def binary_search(u, x):

	if len(u) == 0:
		return None

	left  = 0
	right = len(u) - 1

	while left != right:

		middle = left + (right - left) // 2

		# left subarray:  u[i] for i in [left, middle]
		# right subarray: u[i] for i in [middle, right]

		if u[middle] <= x:
			left = middle
		else:
			right = middle

	# at this point, left == right; check if x was found
	if u[left] == x:
		return left
	else:
		return None

Consider what would happen if u = [0,1] and x = 0. Initially, we have left = 0 and right = 1. Inside the while loop, middle is set to 0 since (right - left) // 2 = 0 (the // operator computes integer division and is necessary since regular division is converted to a floating point value in newer Python versions). After that point, the condition u[middle] <= x is satisfied and left = middle sets left to 0. Since neither left not right have changed, the loop will restart and be executed forever (or at least until you get tired of waiting and stop the process). To make a long story short: integer arithmetic has pulled the rug under our feet.

Before working on a fix for the implementation above, notice that it has a suboptimal design decision which, as will be shown below, is closely related to the infinite loop just described. Indeed, the implementation above "divides" the currently considered portion of the array u into two overlapping subarrays: a "left subarray" containing the elements with indices in the closed interval [left, middle] and a "right subarray" containing the elements with indices in [middle, right]. In other words, u[middle] is part of both intervals. Inside the while loop, the program decides on whether to continue iterating over the left or over the right subarray.

There is no need to consider two overlapping subarrays. Let's first consider what would happen if u[middle] were part of the right subarray but not part of the left one. One could try to improve the implementation above by using this assumption explicitly:

# a conceptually better but still wrong implementation of binary search
def binary_search(u, x):

	if len(u) == 0:
		return None

	left  = 0
	right = len(u) - 1

	while left != right:

		middle = left + (right - left) // 2

		# left subarray:  u[i] for i in [left, middle)
		# right subarray: u[i] for i in [middle, right]

		if u[middle] <= x:
			left = middle
		else:
			right = middle - 1

	# at this point, left == right; check if x was found
	if u[left] == x:
		return left
	else:
		return None

The implementation above is conceptually better in the sense that each subarray is strictly smaller than the original one, making it obvious that the method must converge if no other issues exist. Unfortunately, the applied changes do not fix the original problem: given u = [0,1] and x = 0, the program will still run the while loop forever. However, letting u[middle] be part of the left but not part of the right subarray solves our problem:

# a correct and conceptually clean implementation of binary search
def binary_search(u, x):

	if len(u) == 0:
		return None

	left  = 0
	right = len(u) - 1

	while left != right:

		middle = left + (right - left) // 2

		# left subarray:  u[i] for i in [left, middle]
		# right subarray: u[i] for i in (middle, right]

		if u[middle] < x:
			left = middle + 1
		else:
			right = middle

	# at this point, left == right; check if x was found
	if u[left] == x:
		return left
	else:
		return None

The implementation above is correct and will always find the desired value x provided it is in the array u. Indeed, the value of either left or right must change in each iteration since the first branch (lines 17 and 18) increases left by at least 1 (left <= middle) while the second branch (lines 19 and 20) decreases right by at least 1 (middle < right). Since, by construction, left never becomes larger than right, at some point both variables must contain the same value and the while loop will therefore terminate.

Computing the middle index

The middle index can be computed as shown below:

middle = (left + right) // 2

However, doing integer arithmetic in this manner is prone to bugs which can be very hard to find. This will not be the case in Python since there the integer type is promoted to arbitrary precision by default, but in other languages such as C and C++, the sum of two integers can overflow and yield a result which is completely different from what you would normally expect. This overflow can be safely avoided by computing middle as we did above:

middle = left + (right - left) // 2

A final note on arrays with possibly duplicated values

Some curious reader might be wondering what would happen if we attempted to find a value x which occurs more than once in u. In this case, the value of left, if ever changed in the while loop, will always be such that u[left - 1] < x. This property will be preserved until the condition left == right is satisfied, meaning left must be the position of either the first occurrence of x in u or the position of some value different from x, in which case x is not in u. If left is never changed, it remains equal to 0 during the entire process, meaning the searched value x is either at the first position of u or not in u at all (in the latter case, x must be smaller than u[0]).

Comments (2) Direct link