Avoid using floating-point numbers as hash table keys

Suppose you have an array of floating-point numbers and you want to count how many times each unique number occurs in the array. As an expert programmer, you might do as shown below (Python is used throughout this post, but everything mentioned here applies to associative arrays implemented as hash tables in virtually every programming language):

numbers = [1.4142, 2.7182, 2.7182, 3.1415, 2.7182, 1.4142]

# counters is a dictionary that maps each number in the numbers
# array to its number of occurrences.
counters = {}
for x in numbers:
  counters[x] = counters.get(x, 0) + 1

for x in counters:
  print("%f: %d" % (x, counters[x]))

The output of this program will not surprise you. The order of the lines below may be different on your system, but the overall results will be the same:

1.414200: 2
2.718200: 3
3.141500: 1

Everything seems fine, right? Unfortunately, no. A Python dictionary is implemented as a hash table. Hash tables work as follows: to insert a pair (k, v) into the table, where k is a key and v is its associated value, we first hash k to find a bucket into which (k, v) should be inserted. Then, we insert (k, v) into this bucket; if a pair with key k already exists in the bucket, it is replaced by the new pair. But why is this problematic? The following example will illustrate it for you:

a = 0.123456
b = 0.567890

# Is this the same as [a, b, a, b]?
numbers = [a, b, (a / b) * b, (b / a) * a]

counters = {}
for x in numbers:
  counters[x] = counters.get(x, 0) + 1

for x in counters:
  print("%f: %d" % (x, counters[x]))

The output now is not what we expect:

0.123456: 1
0.567890: 2
0.123456: 1

What went wrong here? To find out, let's modify the code above to gain additional information on what is happening:

a = 0.123456
b = 0.567890

# Is this the same as [a, b, a, b]?
numbers = [a, b, (a / b) * b, (b / a) * a]
print("numbers = %s\n" % numbers)

counters = {}
for x in numbers:
  counters[x] = counters.get(x, 0) + 1

for x in counters:
  print("%f: %d" % (x, counters[x]))

The output of the program now indicates what the problem is:

numbers = [0.123456, 0.56789, 0.12345599999999998, 0.56789]

0.123456: 1
0.567890: 2
0.123456: 1

Aha! It is floating-point arithmetic that has tripped us up. Naively checking whether floating-point numbers are equal, i.e., comparing them as a == b, is a dangerous idea because such numbers are stored with finite precision. Additionally, arithmetic operations involving floating-point numbers are also carried out with finite precision. Therefore, given two nonzero floating-point numbers a and b, there is no guarantee that a == (a / b) * b. In general, when you hash two different floating-point numbers, you will, with very high probability, obtain two different hash values even if these numbers are very close to each other. If that happens, the numbers will, again with high probability, be placed into different buckets of a hash table.

In the example above, we were fortunate to have (b / a) * a equal to b, but not so fortunate with (a / b) * b and a: these two are not equal to each other and are therefore treated as distinct numbers by the dictionary (i.e., by its hash table implementation).

At this point, it should be clear that hashing floating-point numbers is as dangerous as directly checking for their equality using the == operator. Even tiny differences in their values can lead them to be placed into different buckets of a hash table.

If you really need to use floating-point numbers as keys for a hash table, and depending on the numbers you will be hashing, you might be able to alleviate the problems described above with a special trick. For example, if all numbers fall within the interval , and if two significant figures are sufficient to represent these numbers accurately, we can truncate each number to its first two decimal digits before passing it to the hash function: this will effectively divide the interval into intervals of length , with all numbers in the same interval being considered equal. Here is the code that implements this idea:

# Keep only the first two decimal digits of x.
def truncate(x):
  return int(x / 0.01) * 0.01

a = 0.123456
b = 0.567890

numbers = [a, b, (a / b) * b, (b / a) * a]

counters = {}
for x in numbers:
  x = truncate(x)
  counters[x] = counters.get(x, 0) + 1

for x in counters:
  print("%f: %d" % (x, counters[x]))

The output of this program shows it works as desired:

0.120000: 2
0.560000: 2

Note that now both b and (b / a) * a are no longer considered to be two distinct numbers by the hash function since they are truncated before being passed to it, and the same holds true for a and (a / b) * a.

However, as I mentioned earlier, the problem is only alleviated by the trick just presented. Indeed, the original problem (numbers extremely close to each other being considered distinct by the hash function) has not been entirely addressed: we divided the interval into intervals of width over which all numbers are considered equal, but on the shared boundaries of each interval, i.e., for numbers in the form for , the problem persists since minor deviations around these numbers will fall into different intervals. For example, for a very small , will fall within while will fall within ; hence, these two numbers will be treated differently by the hash function even though they are very close to each other.

To summarize: creating hash tables using floating-point numbers as keys is a tricky task. Hashing them is akin to comparing them for equality using the == operator, and the result may cause your application to behave in unexpected ways.