Avoid using floating-point numbers as hash table keys


Posted by Diego Assencio on 2017.02.11 under Programming (General)

Suppose you have an array of floating-point numbers and wish to count how many times each unique number occurs in the array. As an expert programmer, you may do as shown in the code below (important note: I will use Python to exemplify the concepts in this post, but everything mentioned here applies to associative arrays implemented as hash tables in any programming language, e.g. std::unordered_map in C++):

numbers = [ 1.4142, 2.7182, 2.7182, 3.1415, 2.7182, 1.4142 ]

# counters is a dictionary which maps each number in the numbers
# array to its number of occurrences
counters = { (x, numbers.count(x)) for x in numbers }

for (x, count) in counters:
    print("%f: %d" % (x, count))

The output of this program will not surprise you. The order of the lines below may be different on your system, but the overall resuls should be the same:

1.414200: 2
3.141500: 1
2.718200: 3

Everything is fine, right? Unfortunately, no. A Python dictionary is implemented as a hash table. Hash tables are commonly implemented as follows: to insert a pair (k,v) into the table, where k is a key and v is its associated value, we first hash k to find a bucket into which (k,v) must be inserted, then we insert (k,v) into this bucket; if a pair with key k already exists in the bucket, it is replaced by this new pair. But why is this a problem? The following example will illustrate it for you:

a = 0.123456
b = 0.567890

# don't be fooled: numbers != [ a, b, a, b ]
numbers = [ a, b, (a/b)*b, (b/a)*a ]

counters = { (x, numbers.count(x)) for x in numbers }

for (x, count) in counters:
    print("%f: %d" % (x, count))

The output now is not what we expect:

0.123456: 1
0.123456: 1
0.567890: 2

What went wrong here? To find out, let us modify the code above to get additional information on what is happening:

a = 0.123456
b = 0.567890

# don't be fooled: numbers != [ a, b, a, b ]
numbers = [ a, b, (a/b)*b, (b/a)*a ]

print("numbers = %s\n" % numbers)

counters = { (x, numbers.count(x)) for x in numbers }

for (x, count) in counters:
    print("%f: %d" % (x, count))

I hope you will now be able to see what the problem is:

numbers = [0.123456, 0.56789, 0.12345599999999998, 0.56789]

0.123456: 1
0.123456: 1
0.567890: 2

Aha! It is floating-point arithmetic that has bitten our feet. As you likely know, naively checking whether floating-point numbers are equal, i.e., comparing them as a == b, is a very bad idea because such numbers are stored with finite precision. Also, arithmetic operations involving floating-point numbers are carried out with finite precision as well. Indeed, given two nonzero floating-point numbers a and b, there is no guarantee that a == (a/b)*b. In general, when you hash two different floating-point numbers, you will, with very high probability, obtain two different hash values even if these numbers are very close to each other. If that happens, the numbers will, again with high probability, be placed in different buckets of a hash table.

On the example above, we were lucky to have (b/a)*a being equal to b, but not so lucky with (a/b)*b and a: these two are unfortunately not equal to each other and are therefore treated as distinct numbers by the dictionary (i.e., by its underlying hash table).

At this point, it should be clear that hashing floating-point numbers is as dangerous as directly checking for their equality using the == operator. Even tiny differences in their values may throw them into different buckets of an associated hash table.

If you really need to use floating-point numbers as keys for a hash table, and depending on the numbers you will be hashing, you may be able to alleviate the problems described above with some special trick. As an example, if all numbers fall within the interval $[0,1)$, and if two significant figures are enough to represent these numbers accurately, we can truncate each number to its first two decimal digits before passing it to the hash function: this will effectively divide the interval $[0,1)$ into $100$ intervals of length $0.01$, with all numbers in the same interval being considered equal. Here is the code which implements this idea:

# keep only the first two decimal digits of x for x in [0,1)
def truncate(x):
	return int(x / 0.01) * 0.01

a = 0.123456
b = 0.567890

# truncate each number before adding it to the numbers array
numbers = [ truncate(x) for x in [ a, b, (a/b)*b, (b/a)*a ] ]

print("numbers = %s\n" % numbers)

counters = { (x, numbers.count(x)) for x in numbers }

for (x, count) in counters:
    print("%f: %d" % (x, count))

The output of this program shows it works as we wish it to:

numbers = [0.12, 0.56, 0.12, 0.56]

0.560000: 2
0.120000: 2

Notice that now both b and (b/a)*a are no longer considered to be two distinct numbers by the hash function since they are truncated before being passed to it, and the same remains true for a and (a/b)*a.

However, as I said above, the problem is only alleviated by the trick just presented. Indeed, the original problem (numbers extremely close to each other being considered distinct by the hash function) has not been entirely addressed: we divided the interval $[0,1)$ into intervals of width $0.01$ over which all numbers are considered equal, but on the shared boundaries of each interval, i.e., for numbers in the form $0.01N$ for $N = 1, 2, \ldots, 99$, the problem still persists since minor deviations around these numbers will fall into different intervals. As a concrete example, for a very small $\delta \gt 0$, $0.50 - \delta$ will fall within $[0.49, 0.50)$ while $0.50 + \delta$ will fall within $[0.50, 0.51)$; therefore, these two numbers will be treated differently by the hash function even though they are very close to each other.

To summarize: creating hash tables using floating-point numbers as keys is a tricky task as hashing them is similar to comparing them for equality using the == operator, and the result may cause your application to behave in unexpected ways. My recommendation: avoid doing this, if you can.

Comments

No comments posted yet.

Leave a reply

NOTE: A name and a comment (max. 1024 characters) must be provided; all other fields are optional. Equations will be processed if surrounded with dollar signs (as in LaTeX). You can post up to 5 comments per day.