Loss in
Translation

A study on the invertibility of translation between languages

Devang Thakkar
LinkedIn | Twitter

8th August, 2018

UPDATE: Loss in Translation was longlisted for the Information Is Beautiful Awards 2018!

'Do you know Languages? What's the French for fiddle-de-dee?' questions the Red Queen, aiming to perplex Alice in Lewis Carroll’s Through the Looking Glass. “If you'll tell me what language "fiddle-de-dee" is, I'll tell you the French for it” replies Alice, continuing the nonsensical repartee that builds up a large fraction of the corpus.

A present-day adaptation of this highly underrated sequel would probably gloss over the aforementioned conversation since not knowing the language of the source text would no longer be an impediment for an Alice with a smartphone. The quality of machine translation has improved exponentially in the last few decades and now with the advent of neural machine translation, we have now reached closer to human translation than ever before. Despite these advances, there will always be certain limitations to machine translation, not due to the machine aspect of it rather due to the inherent limitations of the process of translation.

Languages do not always have a one-to-one correspondence, meaning that (i) not every word in a given language may have a befitting analog in every other language, and (ii) some words in a language may be translated into multiple equally appropriate words in other languages. I examine this correspondence by analysing the translations of the top 1000 nouns as enumerated in the Corpus of Contemporary American English. (More about the implementation in the Methods section below.) To understand better what I’m trying to put forth, check out the demonstration of imperfect circular translations below.

0

Hindi (hi): stock stock भण्डार stores स्टोर store दुकान shop

Figure 1. The chain follows the trajectory of a word as it is translated back and forth. The top arcs represent the translation from English to the other language, while the bottom arcs represent the translation back to English. The final word in the chain gets translated to a word that already exists in the chain.

One can see that the translation from English to another language and back to English leads a different word/phrase as compared to the original word. These chains may go on for multiple levels before a stationary translation is found (more about this later). The next step is to analyse the relation between the number of these imperfect circular translations and the characteristics of the language. Germanic languages seem to perform the best with the fewest number of imperfect circular translations whereas languages from the Niger-Congo family dominate the bottom. (In order to ensure that I was not cherry picking the set size, I computed the rankings for the top 500 and 1500 nouns as well. The individual standings of the languages shift slightly, but the overall trend of the families remains the same.)

1

Tooltip

Figure 2. Each of the 103 bars represents a language examined in our study. These are all the languages currently available on Google Translate (besides English). The height of each bar denotes the number of imperfect circular translations for the top 1000 nouns for that language. Choose a family from the dropdown to see where its members stand in this ranking.

As can be observed from the plot above, some families are seen to be more dominant in the high scoring regions - 22 of the top 30 languages belong to one of Germanic, Slavic, and Romance families. On the other hand, certain families tend to populate the low scoring regions - 14 of the bottom 20 languages belong to one of Niger-Congo, Austronesian, and Iranian families. English is primarily a Germanic language according to most sources, but the merger of Norman French with Old English introduced a good deal of Latin vocabulary to what became Middle English. As a consequence, languages from Germanic, Romance, and Slavic - the three major families of the European branch of the global language tree have a good number of perfect circular translations. Another result worth noting from this plot is how languages from the same family tend to cluster together. Even besides the families at the two extremities, there are multiple families (viz. Indic, Afroasiatic, et cetera) that have some of their languages positioned contiguous to each other.

An interesting feature of imperfect circular translations is that they themselves often go on to have further imperfect translations. These word chains go on for multiple levels before reaching a stationary translation (well, most do[1]). Most word chains end after one or two levels but chains of up to 8 levels have been identified as well[2]. We plot the chain lengths of the 1000 nouns in 103 languages to obtain a bird’s eye view of these chains. The dots on the spiral below represent the top 1000 nouns along the path, moving outwards from the centre. The presence of a dot at a particular position implies an imperfect circular translation, while its absence implies a perfect circular translation. The colors of the individual dots add another dimension to the plot by denoting the length of the chain for the particular noun in that language.

1

Figure 3. The plot shows the top 1000 nouns arranged in order starting from the centre of the spiral. The colour of the dot represents the length of the chain for the word in the mentioned language - Hawaiian in this case. View an interactive version of the plot by visiting the desktop website.

This article intends to bring out the intrinsic non-invertibility of translation between languages. In some cases, the problem seems to stem from the shortcomings of machine translation. However in most cases, the problem us due to the lack of equivalent words between languages. For example, the translation in Hawaiian for multiple words such as color, information, appearance, and contact among 27 other words is 'ike which trannslates back to knowledge. I assume this is a linguistic problem rather than technical because the dictionary translation for 'ike seems to confirm that the word does indeed mean a lot of things. Future extension of this work could involve the use of other languages as the primary source and the comparison of obtained results.

Methods

I use the list of the top 1000 nouns in the English language for this study. This list was procured from the Corpus of Contemporary American English available freely here[3]. I have chosen to restrict the study to nouns because they exist in almost[4] every language unlike other parts of speech[5]. Considering a pair of languages, one can study the relation between them by carrying out a circular translation[6] of each word between the two languages. Google Translate has a limit of free translations of 5000 characters at one time, so the script had to be modified accordingly. You can find the codes here. Some of the resulting data was wrangled manually to obtain the final data.

Footnotes

[1] Sometimes, Google Translate happens to go bonkers and decides to send the user on an infinite translation trip. Consider what happens when you attempt to build the chain for the word baseball in Armenian.

baseball -> Baseball: Baseball -> Baseball. Baseball: Baseball -> ... ad infinitum

It gets weirder if you try to build the chain for the word contact in Armenian.

contact -> Contacts -> Contacts: -> Contact: Douglas Doug -> Contact: Douglas Doug Douglas Doug -> ... ad infinitum

This anomaly is not limited to Armenian - try building chains for master in Croatian or forest in Bengali.

[2] One of the longest sensible chains in this study was built by translating the word tax in Bengali.

-> tax - do - be done - will do - will - will power - will have power - have the power

[3] The data available on this website contains a list of the top 5000 words, along with a specifier that denotes what part of speech they are used as. I had to wrangle the data to exclude all other other parts of speech except nouns.

[4] I choose to hedge my statement not only because of my lack of expertise in this area but also to pay my homage to the nounless language of Borges’ Tlon.

[5] For example, a lot of Indic/Dravidian languages do not have the definite article 'the' while many languages, more commonly Japanese, do not have exclusive personal pronouns.

[6] We start off from English words, translate them to a given language, and then translate the translation back to English - thus completing the circle.

Acknowledgements

I would like to dedicate this project to Tathagata Biswas, Pauline Sémon, Shreni Dand, and Anupama Rao - four designers who've greatly inspired me. The primary inspiration for this undertaking was Nadieh Bremer's Beautiful in English. The spiral plot was inspired by Nick Rougeux's Spacewalking. The spiral was created using a script from here.

You may reach me via Twitter @devangvang or drop me a mail at [firstname] (dot) [lastname] (at) duke (dot) edu. This site uses Google Analytics to collect anonymized data about visitors.