Sorting in less than $O(n\log n)$ time


Posted by Diego Assencio on 2016.01.01 under Programming (General)

Despite the title, this post is not really about sorting. Instead, the aim here is to inspire the following attitude: when faced with a problem, instead of implementing the first solution that comes to mind, ask as many questions about the problem as you can; very often this will lead to much better solutions. As this post illustrates, this is the case even if the given problem seems totally trivial.

Here is a typical programming task: sort an array of $n$ integer values. Any well-educated computer scientist knows multiple solutions to this problem: merge sort, quicksort, heap sort and a lot of other commonly known general-purpose sorting algorithms.

However, instead of immediately coming up with a solution to the problem, a better attitude is to ask questions about it first. For instance, one may ask: what do the array values represent? Are they already ordered in some special pattern? You will likely come up with other interesting questions as well if you try it.

To clarify the importance of asking those questions, consider the case in which the array values are ages of users of a website. Your goal might be to compute the median user age. One way to do this is by first sorting the array: the median is then either the middle element if the array length is odd or the average of the two center elements if it is even. The sorting part can be obviously done using well-known algorithms with run-time complexity $O(n\log n)$, but since no human being has ever lived for more than 125 years, we can assume that no age value exceeds 124, meaning the sorting part can be done in run-time $O(n)$ as shown below. The following algorithm is implemented in C, but only basic features of the language are used for maximum clarity:

void sort_ages(int* ages, const size_t length)
{
	int count[125];

	/* initialize all elements of count to zero */
	for (int age = 0; age < 125; ++age)
	{
		count[age] = 0;
	}

	/* count the number of occurrences of each age value */
	for (int i = 0; i < length; ++i)
	{
		++count[ages[i]];
	}

	/* sort (rewrite) the ages array in place using count */
	int i = 0;
	for (int age = 0; age < 125; ++age)
	{
		while (count[age] > 0)
		{
			ages[i] = age;
			--count[age];
			++i;
		}
	}
}

double median_age(int* ages, const size_t length)
{
	sort_ages(ages, length);

	int age1 = ages[length / 2];
	int age2 = ages[(length-1) / 2];

	/*
	 * age1 == age2 if ages has an odd number of elements
	 * because (length-1)/2 == length/2 if length is odd;
	 * in this case, we sum the middle value of ages twice,
	 * so this formula always yields the correct median
	 */
	return (double) (age1 + age2) / 2.0;
}

The algorithm above runs in $O(n)$ time and is based on a simple fact: since age values are constrained to a small range, we can count the number of occurrences of each age value and then write them back in order on the original array ages: this will sort it in place. Computing the median is then a trivial task. This is a significant improvement over algorithms which run in $O(n\log n)$ time.

Besides having a better bound on the asymptotic run-time for large values of $n$, the algorithm just shown has a few other important advantages:

  • each value on the input array ages is read only once and in sequence, so the number of cache misses is minimized when compared to algorithms such as randomized quick-sort
  • no dynamic memory allocation is done, and only a small and fixed amount of memory is allocated on the stack (the memory usage does not grow with $n$)
  • the sorting step is executed in a single function (without additional function calls)

While it may sound as if we are done with our discussion, the magic is actually not over yet. If you look at the definition of sort_ages, it may occur to you that to compute the median value of the ages array, we don't have to sort it at all! We can instead determine the median directly from count:

double median_age2(const int* ages, const size_t length)
{
	int count[125];

	/* initialize all elements of count to zero */
	for (int age = 0; age < 125; ++age)
	{
		count[age] = 0;
	}

	/* count the number of occurrences of each age value */
	for (int i = 0; i < length; ++i)
	{
		++count[ages[i]];
	}

	/* determine the median age from count */
	int i = 0;
	int age_sum = 0;
	for (int age = 0; age < 125; ++age)
	{
		while (count[age] > 0)
		{
			if (i == (length-1)/2)
			{
				age_sum += age;
			}
			if (i == length/2)
			{
				age_sum += age;
				return (double) age_sum / 2.0;
			}
			--count[age];
			++i;
		}
	}

	return 0.0;
}

The median age computation still runs in $O(n)$ time and has all other advantages from the previous method, but now we do not rewrite the entire ages array to have it sorted, reducing the number of cache misses even further. As a matter of fact, we do not modify ages at all! Instead, since count is an exact and compact representation of the sorted values from ages, we can compute the median age directly from it. As before, when ages has an odd number of elements, we end up summing the middle element of its sorted version twice, but since we divide the sum by two, we get the correct median value at the end.

Another small advantage worth mentioning is the fact that median_age2 can naturally handle zero-length arrays, while median_age assumes ages is non-empty.

To summarize, asking a few simple questions about our problem led us to jump from a median computation algorithm which was based on sorting to another vastly superior algorithm which actually does not perform any sorting at all.

Comments

No comments posted yet.