## Is German harder than English?

Posted by Diego Assencio on 2014.05.12 under Computer science (Natural language processing)

As Germans often say: "Deutsche Sprache, schwere Sprache" ("German language, hard language"). German is indeed (for most people) a hard language to learn. This post will illustrate some aspects which make German so complicated when compared to the English language.

Personally, besides the complex declension forms, verbal conjugations, plural forms and gender assignment, I find one of the most complicating factors of the German language to be the existence of so many "similar words" which have very different meanings. For instance, consider the following words:

 leben neben haben sehen oben deren liegen seien gehen sieben neuen eben wegen gegen jeden bleiben

Although the spelling of these words does not vary much, their meanings are substantially different from one another. In other words, the words are all "close" to each other but have completely unrelated meanings.

One way to measure the "distance" between two words is through the Levenshtein distance, which is a type of edit distance. The Levenshtein distance between two words is equal to the smallest number of character insertions, deletions or substitutions that need to be applied to one of these words to transform it into the other (each modification is called an "edit"). For instance, consider the words "kitten" and "sitting". One way to transform "kitten" into "sitting" is described below:

 1 substitute "k" by "s": kitten → sitten 2 substitute "e" by "i": sitten → sittin 3 insert "g" at the end: sittin → sitting

The word "kitten" can therefore be transformed into "sitting" in three edits. In this case, the smallest possible number of edits which transform "kitten" into "sitting" is indeed three. In general, if $r$ and $s$ are strings with lengths (number of characters) $m$ and $n$ respectively, the Levenshtein distance between them, $\textrm{lev}_{r,s}(m,n)$, can be computed using the following recursive formula: $$\textrm{lev}_{r,s}(i,j) = \begin{cases} \max(i,j) \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\textrm{if } \min(i,j) = 0 \\[5pt] \min \begin{cases} \textrm{lev}_{r,s}(i-1,j) + 1 \\[5pt] \textrm{lev}_{r,s}(i,j-1) + 1 \quad\quad\quad\quad \textrm{otherwise} \\[5pt] \textrm{lev}_{r,s}(i-1,j-1) + 1_{(r_i \neq s_j)} \end{cases} \end{cases} \label{post_9636c4a74afcc3924fdd2f03f83492c6_levenshtein}$$ The formula above can be understood by right-aligning the strings $r$ and $s$ and considering the necessary transformations to convert $r$ into $s$ when starting from the rightmost character of both strings, then moving to the left one character at a time. At each step, character insertion, deletion or substitution is considered. With that in mind:

 $\textrm{lev}_{r,s}(i-1,j) + 1$ corresponds to deleting a character from $r$ $\textrm{lev}_{r,s}(i,j-1) + 1$ corresponds to inserting a character in $r$ $\textrm{lev}_{r,s}(i-1,j-1) + 1_{(r_i \neq s_j)}$ corresponds to substituting a character in $r$ if $r_i \neq s_j$ (i.e., if the $i$-th character of $r$ is different from the $j$-th character of $s$)

The Levenshtein distance satisfies all properties of a metric function, so the space of strings with the Levenshtein distance forms a metric space!

As mentioned before, equation \eqref{post_9636c4a74afcc3924fdd2f03f83492c6_levenshtein} defines the Levenshtein distance in a recursive manner. Although easy to implement in a program, there are equivalent versions which compute $\textrm{lev}_{r,s}(m,n)$ in a much more efficient manner, e.g. using the Wagner–Fischer algorithm (a thorough discussion of such algorithms is however outside the scope of this post).

At this point, a natural question to ask is: if one randomly chooses a word from a dictionary for a particular language, what is the average number of words from that dictionary which will be within distance $d$ (from now on, "distance" will refer to the "Levenshtein distance") from the chosen word? To better understand the question, see figure 1.

 Fig. 1: A set of words which are similar to the German word "Land". The inner circle contains all words which are within distance $d=0$ from "Land" (i.e., only "Land" itself). The middle and outer layers contain words with distances $d=1$ and $d=2$ from "Land" respectively (the word lists are not exhaustive).

To answer the posed question, I decided to write a Python script which takes a dictionary file, computes the Levenshtein distances between all words which have at least $L$ characters and then computes the average number of words within distance $d$ from a randomly chosen word from that dictionary with at least $L$ characters (i.e., the average number of "neighbor words"). Calling this average the "word proximity level" (WPL) of the language, we can see that the larger the WPL for a language is (technically, for the input dictionary), the "closer" the words of the language are in the way they are spelled.

The reason why only words with at least $L$ characters (e.g. $L = 5$) are considered is due to the fact that small words will be within distance $d$ of each other even for small values of $d$, but this hardly means they are "close to each other" (for instance, consider the words "ape" and "bee": they are at a distance $d=2$ from each other but their spellings are arguably not very similar).

To compare the English and German languages, I have computed their WPLs for word lists containing the $1000$ most commonly used words from each language. These lists were obtained from the University of Leipzig, Germany. The results I got are very interesting. Since the University of Leipzig also offers similar word lists for Dutch and French, I computed the WPLs for those languages as well. All results are summarized on table 1:

Language$\textrm{WPL}(L=5, d=2)$$\textrm{WPL}(L=6, d=3) English1.59$$2.66$
German$3.40$$8.15 French2.11$$3.53$
Dutch$2.94$$7.40$
 Table 1: WPL values for several languages and different values of $d$ and $L$. Notice how the values for Dutch are similar to the ones for German, with both being significantly larger than the values for English.

Table 1 shows that if we consider only words with at least five characters ($L = 5$) and pairs of words with a maximum distance equal to two ($d = 2$), a German word has approximately twice as many "close words" to it as a word in English (considering only the $1000$ most commonly used words of these languages!). When we consider words with at least six characters ($L = 6$) and pairs of words with maximum distance equal to three ($d = 3$), the difference becomes even bigger: a German word now has about three times as many "close words" as an English word.

Finally, to further illustrate the difference between English and German, histograms showing the percentages of the analyzed words containing different numbers of "neighbor words" (i.e., within distance $d$ from the given word) are shown on figures 2 and 3 for the pairs of values $(L,d) = (5,2)$ and $(L,d) = (6,3)$ respectively.

 Fig. 2: Percentage of analyzed words versus number of "neighbor words" for $L = 5$ and $d = 2$. As the graph shows, English words have less neighbor words than German words, resulting in a smaller WPL value.
 Fig. 3: Percentage of analyzed words versus number of "neighbor words" for $L = 6$ and $d = 3$. Compare it with figure 2.