Sorting in less than time
Consider a typical programming task: sorting an array of
However, before diving into a solution, it's prudent to inquire more about the problem at hand. For instance, one might ask: what do the array values represent? Are they sorted in some specific pattern already? Is sorting actually necessary? Taking a moment to ponder can often yield more intriguing and relevant questions.
To underscore the significance of posing these questions, let's consider a scenario where the array values represent the ages of a website's users. Suppose your objective is to determine the median user age. One approach is to sort the array first. In this case, the median would be the middle element if the array length is odd, or the average of the two central elements if the length is even. While the sorting can certainly be executed using well-established algorithms with a runtime complexity of
void sort_ages(int* ages, const size_t length) {
int count[125];
// Initialize all elements of count to zero.
for (int age = 0; age < 125; ++age) {
count[age] = 0;
}
// Count the number of occurrences of each age value.
for (int i = 0; i < length; ++i) {
++count[ages[i]];
}
// Sort (rewrite) the ages array in place using count.
int i = 0;
for (int age = 0; age < 125; ++age) {
while (count[age] > 0) {
ages[i] = age;
--count[age];
++i;
}
}
}
double median_age(int* ages, const size_t length) {
sort_ages(ages, length);
int age_1 = ages[length / 2];
int age_2 = ages[(length - 1) / 2];
/*
* age_1 == age_2 if ages has an odd number of elements
* because (length - 1) / 2 == length / 2 if length is odd.
* In this case, we sum the middle value of ages twice,
* so this formula always yields the correct median.
*/
return (double) (age_1 + age_2) / 2.0;
}
The algorithm presented above operates in ages
array, effectively sorting it in place. Once sorted, computing the median becomes straightforward. This offers a substantial advantage over algorithms that run in
In addition to its improved asymptotic runtime for large
ages
is read sequentially and only once. This minimizes the number of cache misses, especially when contrasted with algorithms like randomized quick-sort. Just when it might seem our discussion has reached its conclusion, there is still more to uncover. Upon examining the sort_ages
function, one might realize that there is no need to sort the ages
array to compute its median value. Instead, the median can be determined directly from the count
array.
double median_age_2(const int* ages, const size_t length) {
int count[125];
// Initialize all elements of count to zero.
for (int age = 0; age < 125; ++age) {
count[age] = 0;
}
// Count the number of occurrences of each age value.
for (int i = 0; i < length; ++i) {
++count[ages[i]];
}
// Determine the median age from count.
int i = 0;
int age_sum = 0;
for (int age = 0; age < 125; ++age) {
while (count[age] > 0) {
if (i == (length - 1) / 2) {
age_sum += age;
}
if (i == length / 2) {
age_sum += age;
return (double) age_sum / 2.0;
}
--count[age];
++i;
}
}
return 0.0;
}
The revised computation for the median age remains efficient with a runtime of ages
array, which further minimizes cache misses. Notably, we leave the ages
array untouched! The count
array serves as a precise and concise representation of the sorted values from ages
, enabling us to derive the median age directly. As in the previous method, when ages
has an odd number of elements, we sum the middle value twice. But since this sum is divided by two, the result accurately represents the median.
It's also worth highlighting a subtle benefit: the median_age_2
function can gracefully manage arrays of length zero, whereas the median_age
function expects a non-empty ages
array.
In essence, by merely posing some straightforward questions about the problem we're trying to solve, we were able to devise an algorithm that is both more efficient and more elegant than the standard sorting-based median-computing algorithm. And curiously, in the process of trying to improve the sorting of age values, we discovered a better way to compute the median without requiring sorting at all.