## Is German harder than English?

Posted by Diego Assencio on 2014.05.12 under Computer science (Natural language processing)

As Germans often say: "Deutsche Sprache, schwere Sprache" ("German language, hard language"). German is indeed (for most people) a hard language to learn. This post will illustrate some aspects which makes German so complicated by comparing it with the English language.

Personally, besides the complex declension forms, verbal conjugations, plural forms and gender assignment, I find one of the most complicating factors of the German language to be the existence of so many "similar words" which have very different meanings. For instance, consider the following words:

 leben neben haben sehen oben deren liegen seien gehen sieben neuen eben wegen gegen jeden bleiben

Although the spelling of these words does not vary much, their meanings are substantially different from one another. In other words, the words are all "close" to each other but have completely unrelated meanings.

One way to measure the "distance" between two words is through the Levenshtein distance, which is a type of edit distance. The Levenshtein distance between two words is equal to the smallest number of character insertions, deletions or substitutions that need to be applied on one word to transform it into the other (each modification is called an "edit"). For instance, consider the words "kitten" and "sitting". One way to transform "kitten" into "sitting" is described below:

 1 substitute "k" by "s": kitten → sitten 2 substitute "e" by "i": sitten → sittin 3 insert "g" at the end: sittin → sitting

The word "kitten" can therefore be transformed into "sitting" in three edits. In this case, the smallest possible number of edits which transform "kitten" into "sitting" in indeed three. In general, if $r$ and $s$ are strings with lenghts (number of characters) $m$ and $n$ respectively, the Levenshtein distance between them, $\textrm{lev}_{r,s}(m,n)$, can be computed using the following recursive formula: $$\textrm{lev}_{r,s}(i,j) = \begin{cases} \max(i,j) \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\,\,\textrm{if } \min(i,j) = 0 \\[5pt] \min \begin{cases} \textrm{lev}_{r,s}(i-1,j) + 1 \\[5pt] \textrm{lev}_{r,s}(i,j-1) + 1 \quad\quad\quad\quad \textrm{otherwise} \\[5pt] \textrm{lev}_{r,s}(i-1,j-1) + 1_{(r_i \neq s_j)} \end{cases} \end{cases} \label{post_9636c4a74afcc3924fdd2f03f83492c6_levenshtein}$$ The Levenshtein distance can be understood if one right-aligns the strings $r$ and $s$ and considers the necessary transformations to convert $r$ into $s$ when starting from the rightmost character of both strings and moving to the left one character at a time. At each step, character insertion, deletion or substitution is considered. With that in mind:

 $\textrm{lev}_{r,s}(i-1,j) + 1$ corresponds to a deletion of a character from $r$ $\textrm{lev}_{r,s}(i,j-1) + 1$ corresponds to a character insertion on $r$ $\textrm{lev}_{r,s}(i-1,j-1) + 1_{(r_i \neq s_j)}$ corresponds to a character substitution which is done only if $r_i \neq s_j$ (i.e., if the $i$-th character of $r$ is different from the $j$-th character of $s$)

The Levenshtein distance satisfies all properties of a metric function, so the space of strings with the the Leveshtein distance forms a metric space!

As mentioned before, equation \eqref{post_9636c4a74afcc3924fdd2f03f83492c6_levenshtein} defines the Levenshtein distance in a recursive manner. Although easy to implement on a program, there are equivalent versions which compute $\textrm{lev}_{r,s}(m,n)$ in a much more efficient manner, e.g. using the Wagner–Fischer algorithm. A thorough discussion of such algorithms is outside the scope of this post and will not be discussed here.

At this point, a natural question to ask is: if one randomly chooses a word from a dictionary for a particular language, what is the average number of words from that dictionary which will be within distance $d$ (from now on, "distance" will refer to the "Levenshtein distance") from the chosen word? To better understand the question, see figure 1.

 Fig. 1: A set of words which are similar to the German word "Land". The inner circle contains all words which are within distance $d=0$ from "Land" (i.e., only "Land" itself). The middle and outer layers contain words with distances $d=1$ and $d=2$ from "Land" respectively (the word lists are not exhaustive).

To answer the posed question, I decided to write a Python script which takes a dictionary file, computes the Levenshtein distances between all words which have at least $L$ characters and then computes the average number of words within distance $d$ from a randomly chosen word from that dictionary with at least $L$ characters (i.e., the average number of "neighbor words"). Calling this average the "word proximity level" (WPL) of the language, we can see that the larger the WPL for a language (technically for the input dictionary), the "closer" the words of the language are (in terms of spelling).

The reason why only words with at least $L$ characters (e.g. $L = 5$) are considered is due to the fact that small words will be within distance $d$ of each other even for small values of $d$, but this hardly means they are "close to each other" (for instance, consider the words "ape" and "bee": they are at a distance $d=2$ from each other but their spellings are arguably not very similar).

To compare the English and German languages, I have computed their WPLs for word lists containing the $1000$ most commonly used words from each language. These lists were obtained from the University of Leipzig, Germany. The results I got are very interesting. Since the University of Leipzig also offers similar word lists for Dutch and French, I computed the WPLs for those languages as well. All results are summarized on table 1:

Language$\textrm{WPL}(L=5, d=2)$$\textrm{WPL}(L=6, d=3) English1.59$$2.66$
German$3.19$$7.18 French2.11$$3.53$
Dutch$2.93$$7.40$
 Table 1: WPL values for several languages and different values of $d$ and $L$. Notice how the values for Dutch are similar to the ones for German, with both being significantly larger than the values for English.

Table 1 shows that if we consider only words with at least five characters ($L = 5$) and pairs of words with a maximum distance equal to two ($d = 2$), a German word has twice as many "close words" to it as a word in English (considering only the $1000$ most commonly words of these languages!). When we consider words with at least six characters ($L = 6$) and pairs of words with maximum distance equal to three ($d = 3$), the difference becomes even larger: a German word now has approximately $2.7$ times more "close words" than an English word.

Finally, to further illustrate the difference between English and German, histograms showing the percentages of the analyzed words containing different numbers of "neighbor words" (i.e., within distance $d$ from the given word) are shown on figures 2 and 3 for the pairs of values $(L,d) = (5,2)$ and $(L,d) = (6,3)$ respectively.

 Fig. 2: Percentage of the analyzed words versus number of "neighbor words" for $L = 5$ and $d = 2$. As the graph shows, English words have less neighbor words than German words, resulting in a smaller WPL value.
 Fig. 3: Percentage of the analyzed words versus number of "neighbor words" for $L = 6$ and $d = 3$. Compare it with figure 2.