Learn word-to-word translations – IBM1

In this post I will go into the details of the IBM1 model algorithm, one of the simplest algorithms that helps us do statistical machine translation (SMT). The book, in which you can find read a lot about Machine Translation (MT), and which I often refer to is called Statistical Machine Translation by Phillip Koehn, published in 2010. It also discussed IBM1 model, pages 88-89, along with evaluation and language model learning.

Back to learning how to build a translation model, we will pay attention to the simplest case, which takes care of lexical translation. Lexical translation, also referred to as translation of words in isolation, opts for choosing the most appropriate translation of a word from the foreign language to English (if you want to read about how MT works and why we speak about translation from a foreign language to English, read the introductory post in here).

Therefore we have identified the problem – learn lexical translation. If we solve the problem of collecting frequency statistics of the translations of each foreign word, we solve our problem that is to learn how to translate words in isolation. Consequently, we assume that:

LexicalTranslation

We can read the following assumption as the following: We want to learn single word translation probabilities based on frequency statistics from data. Data in this case is bilingual parallel texts aligned at a sentence level. In other words, we have some sentence pairs for all of which an English language sentence is known to be the exact translation of a foreign language sentence. What we DO NOT know is the word-to-word alignment for each of these sentence pairs. We know that the sentences are direct translations of each other, but we do not know which word in the foreign language sentence translates into which word of the English sentence. Therefore:

DataIsIncomplete

It is good that there exists a Machine Learning algorithm that can helps us learn translation probabilities even if our data is incomplete.This algorithm is the Expectation Maximization (EM) algorithm. This algorithm learns a model (model == translation probabilities in this specific case) after iterating through the data several times. It begins with randomly initialized model (randomly initialized word-to-word translation probabilities) – this is step 1. Given the model (at this point randomly initialized), it applies the model to the data – step 2, also called the expectation step. During step 2 and given the model, the algorithm collects statistics from the data. When the algorithm is finished collecting statistics on all of the data, the algorithm updates the model according to the collected statistics (or counts) – this is step 3, also called the maximization step. The algorithm iterates over the expectation and the maximization step until it converges (or in other words until the model does not change after the maximization step):

EMAlgo

Why Expectation Maximization?

This algorithm learns automatically in an unsupervised manner from data. It is applied when data is incomplete. In our case we have know the exact translation in English of a foreign language sentence. Therefore we know the source that is the foreign language sentence, we know the target that is the English sentence. What we lack is knowing exactly which foreign word corresponds to which English word. This is what we need to complete our data. And since data is incomplete we want to learn this correspondence by guessing the initial model and then adjusting it according to what we know.

Let’s start going into details for each of the three main steps of the EM algorithm. The first one takes care of the initialization of the model. Great, how do we do that?

EMStep1

Let me remind you again that the model is the probability of a foreign word to be translated into an English word for all words. In this example, we prefer to uniformly assign the initial foreign-word-to-English-word probabilities. It is said that the initial assignments to these probabilities are not important and the algorithm will converge for any model initialization choice.

Let’s take an example. Imagine we have three training examples, therefore our data consists of three sentence pairs. We want to decode German text into English. If you are not familiar with what decoding is in this case, please read the introduction to Machine translation post – it should not take long but it will make you feel more comfortable when reading about MT. We have a total of 4 different German words, and a total of 4 different English words. We want to apply the Em algorithm to learn translation probabilities of German-to-English word translation (lexical translation):

EMdata

In this example (taken from Phillip Koehn. 2010. Statistical Machine Translation. Chapter 4. Cambridge University Press, pages 88-89) we can see that our first training example is the sentence pair: das Haus in German translates into the house in English (das Haus->the house). The second training example is das Buch->the book. The third, and the last, training example is ein Buch->a book.

What word translates into book? We can see that we have the English word book in two out of the three examples. Unfortunately we cannot say to what word translates into book from the data (data is incomplete!!). What we can easily see is that there are three different possibilities. From the second training example we know that either das translates into book or Buch translates into book. Thus we can draw an imaginary line between das and book, and Buch and book, to remember these possible alignments between the foreign and English words. What we want to model is which is the most appropriate, or the most probable, translation of book. From the third training example, we see that the German word ein probably can be translated into book. The same applies for Buch and book:

EMdataSomeAlignments

In total, we want to learn 10 different word-to-word translation probabilities, that are das->the, das->book, das->house, buch->the, buch->a, buch->book, ein->book, ein->a, haus->the, haus->house. Therefore, after we draw all these imaginary lines that shows us what are the possible alignments of the words, we arrive at:

EMdata

Phillip Koehn. 2010. Statistical Machine Translation. Chapter 4. Cambridge University Press, page 92.

 

Again, we want to model the word-to-word probabilities, that are 10 in this particular example. After the algorithm converges, we will have the model built according to our data, and we can apply it to new examples to obtain translation from German to English:

EMProbs

Phillip Koehn. 2010. Statistical Machine Translation. Chapter 4. Cambridge University Press, page 92.

Now back to the algorithm – step 1: Initialize the model. We have already discussed that we assign uniform (the same for all entries) lexical translation probability.As for each training example we have 4 possible alignments (for the first – das:house, Haus:house, das:the & das:house), we assign to each of these alignments (imaginary lines we have drawn) the probability 1/4 (= 0.25):

EMStep1Initialize1stExample

The same applies for the two other training examples.At the end of step 1 we have the following initial model:

EMStep1Initialize

Now we need to iterate between the Expectation Step (step 2) and the Maximization Step (step 3) until the algorithm converges (until the model does not change any more and has reached a stable state). The model is all of the probabilities of a foreign word f translated into an English word e. When the algorithm converges, we have the optimal probabilities t(e|f):

EMStep23Probs

The goal of the algorithms is to optimize t(e|f) – the probability of a foreign word f to decode into an English word e. In the Expectation Step we calculate three different counts (collect three different types of statistics). As mentioned before, during the Expectation Step we do not change the model, this happens during the Maximization Step. Therefore we will use there three different types of counts we collect during the Expectation Step to update our model (all probabilities t(e|f)) during the Maximization Step. These three different types of counts we will call: s-total(e), count(e|f) and total(f):

ExpectMaxStepCounts

By combining the collected counts, we will discuss how later on in this post, we obtain the updated probabilities t(e|f) for each iteration. Therefore after finishing the Expectation Step in each iteration, the Maximization Step uses the collected counts to update the model and to feed this model to the next iteration’s Expectation Step. The s-total(e) (shown with 1 on the screenshot above) collects counts only according to the English part of the training data (in our case only according to our three training examples). s-total(e) starts from 0 for each training example in each iteration. Following our example, after the initialization of the model (uniformly distributed lexical translation probabilities t(e|f)) the s-total(e) is set to 0 for the first training example (das Haus -> the house). After counts have been collected for the first training example and the algorithm starts processing the second training example (das Buch -> the book) the s-total(e) is set to 0 again. The same applies for the third training example. The second statistic we want to collect is count(e|f). This variable collects frequency counts per English word depending a foreign word, therefore is both dependent on the English and foreign words int he training examples. count(e|f) is initialized with 0 and is re-initialized per iteration. This means that at the beginning after initializing the lexical translation model, we set count(e|f) for all 10 possible translation pairs to 0, and keep track of the counts across all training examples. Therefore after finishing collecting statistics for the first training example (das Haus -> the house) we do not set count(e|f) to 0. We do so after the Maximization Step has finished its execution. The last statistic variable, in which we store temporary counts during the Expectation Step, is total(f). It collects temporary counts depending only the foreign words as we will see in a few sentences. This variable total(f) is reset to 0 per iteration (the same way as count(e|f)). To summarise, for each algorithm pass, we collect three types of counts in three different variables during the Expectation Step ->s-total(e)count(e|f) and  total(f):

ExpectMaxStepCountsTableN.B.: Feel free to pause reading this post at any time, click on the images to enlarge them and make sure you understand what has been going on until now. It is not difficult and explained in this manner, it promotes non-technical people to take a look at the Machine Learning algorithm behind word-to-word automated translation.

Now, I will try to explain in pictures how we collect the three different types of counts I mentioned in the last paragraphs. The first one is the s-total(e), which collects per training example counts dependent on the English words. It cumulatively sums up the probability t(e|f) values that depend on the English word e (e=the, and then e=house) present in the first training example:

CalcStotalDiagram

Then we want to calculate the count(e|f) values. They depend on the probability values t(e|f) – for the first iteration these values are the uniform values we assigned to the model; and are divided by the s-total(e) for the given English and foreign word:

CalcCountDiagram

Lastly, we propagate the count(e|f) value to total(f), where total(f) is updated by adding the newly obtained count(e|f) value to the current total(f) value:

CalcTotalDiagram

With this we first the Expectation Step counting for the current iteration (algorithm pass). What we need to do next is to update the model, therefore to do the Maximization Step (step 3 of the algorithm). The t(e|f) values for each English word depending on a foreign word is the count(e|f) divided by the total(f):

CalcNewProbs

The  Working Example ! :

As it took me some time to write down the calculations myself (I do not trust computers to do them for me in such cases when I cannot imagine the process myself), I want to count the statistics of the first iteration with you. In case you feel very confident you have understood how the algorithm will learn the lexical translation probabilities of a set of sentence pairs, feel free to only quickly scan through the rest of the post. If you doubt what I have explained to you so far (doubt is good, do not take my words for granted, make sure you agree – or disagree), please follow carefully the calculations we will do from here on until the end of this post.

As you can see on the screenshot shown above, during the first iteration’s Expectation Step we must fill in all of the missing variables

Leave a comment