2  [DRAFT] Language modeling: the basics

This chapter covers


We saw in Chapter 1 some of the amazing things language models (especially large LMs) can do; we learned about different types of LMs, transformers, and we saw that they can be used to greatly enhance all NLP tasks.

Let’s now dive deeper into language modeling as a discipline and cover the basics, such as the origins of the field, what the most common models look like, and how one can measure how good an LM is. Neural LMs are also a big part of language modeling, but they will be covered in another chapter (?sec-ch-neural-language-models-and-self-supervision)—many important details justify us dedicating a chapter to them.

Here’s how we will structure this chapter: In Section 2.1 we will give an introduction to language modeling and explain how the need to model language first appeared. In Section 2.2 we will show with code and examples what the simplest possible language model could look like. Then we will introduce statistical language models (SLMs) in Section 2.3 and \(N\)-gram models in Section 2.4, including worked examples for both types of models. Finally, we explain how to measure a given LM’s performance in Section 2.5, followed by a summary and references in Sections 2.6 and 2.7, respectively.

2.1 Overview

As we explained in the previous chapter, a language model (LM) is a computational device that understands how words are used in a given language.

More specifically, it can be thought of as a function (in the computational sense) that takes in a sequence of words as input and outputs a score representing the likelihood that the sequence is a properly formed sentence. In other words, it should be good at measuring how valid a given sequence of words is. This is shown in Figure 2.1.

The definition of valid may vary but we can see some examples in the figure below. Intuitively, the sentence “Cats and dogs were playing on the grass” would be considered valid in the English language: it gets a high score from the LM.

Examples (2), (3), and (4) below show several ways in which a word sequence may be considered invalid. Examples (2) and (3) don’t make semantic sense (even though all individual words are valid). Example (4) is made up of words that don’t exist, so it gets a low score from the LM.

Figure 2.1: A language model (LM) can be seen as a function (in the computational sense) that takes a sequence of words as input and outputs a score (usually from 0.0 to 1.0) indicating how likely the sequence is—that is, what is the probability that this sequence is an actual sentence in the target language we are trying to model.

In addition to being able to score any sequence of words and output its likelihood score, LMs can also be used to predict the next word in a sentence. This has several real-world applications and will play a key role when we talk about Transfer Learning and Zero-shot Learning in Part III. An example of what it means to predict the next word in a sentence can be seen in Figure 2.2 in the next Section.

Definition

A Language Model (LM) is a mathematical or computational device that has been trained to understand which words are used in a language—and how. The two basic uses of an LM are (1) to score a sequence of words and output a score indicating how likely that sequence of words is and (2) to predict the next word in a sentence.

The next 2 subsections explain why the two basic uses (scoring and predicting) are two sides of the same coin and why the distinction between train time and inference time is so important for LMs.

The duality between scoring and predicting

As we said before, the basic uses for a language model are:

  1. Scoring the likelihood of a word sequence
  2. Predicting the most likely next word in a sentence

They may seem unrelated to the casual observer but 1 implies 2. Let’s see how.

Suppose you have a trained LM that can score arbitrary sequences of words and you want to repurpose it to predict the next word in a sentence. How would you do that?

A naïve way to predict the next word in an input sequence would be to generate all possible sentences that start with the input sequence and then score every single one of them. Then, you just pick the variation that got the highest likelihood score and you pick the word that generates that variation as the victor.

This strategy is shown in Figure 2.2: We start with the sentence “A dog was running on a ______” and then we generate multiple variations having multiple filler words in the missing space and score each variation with the LM—to pick the highest score.

Figure 2.2: The ocean, fields, or the moon? Where do you think dogs are more often found running? This is one way (admittedly naïve) to predict the most likely next word in a sentence: simply generate all possible variations (all words in the language) and pick the highest-scoring one. In this example, it’s “field”, with a score of 0.6.

In short, if you have a function that outputs the likelihood score for a sequence of words, you can use it to predict the next word in a sentence—just generate all possible variations of the next word, score each one, and pick the highest-scoring variation! This is why scoring and predicting are two sides of the same coin—if you can score a sentence, you can predict the most likely next word in it.

This brute-force approach is not often used in this manner in practice, but it helps understand the relationship between the two concepts. We will revisit this duality in Section 2.3.1 when we present yet another way to look at this problem, from the prism of conditional probabilities.

Train time vs Inference time

The language models we will focus on in this book learn from data. This is in contrast with systems that are built off expert knowledge, which are sets of hard rules manually created by human experts.

You are probably aware that there exists an actual branch of science that deals with such data-driven systems; it’s called Machine Learning (ML). All language models we will study in this book are machine learning models. This means that most concepts, terms, and conclusions from the ML world also apply to language models.

The concepts of train time and inference time are core to machine learning. Let’s explain them from the prism of language models.

At train time, the language models are trained on the training data. How this training is performed will depend on the specific model type used (statistical LMs, neural LMs, transformers, etc), but regardless of the specifics of the algorithms used, they all share this concept of learning from data.

After the model is trained we can proceed to the inference time. This refers to the stage at which we use (or perform inference with) the model. The model can be used in one of the two ways we have already mentioned: scoring a sequence’s likelihood or predicting the next word in a sentence.

Figure 2.3 provides a visual representation of these two concepts. On the left, an LM is trained with a set of documents (the corpus), and on the right, we see one example of the inference-time use of the model.

Figure 2.3: At train time we go over the training corpus and learn from the text. The result of the training stage is a trained LM. At inference time, we use the trained LM to perform inference (scoring the likelihood of a sequence or predicting the next word in the sentence)

Remember, everything the language model knows was learned at train time. When the model is used at inference time, no new learning takes place—the model simply applies whatever knowledge it learned previously.

The distinction between train and inference times is also critical to the way we evaluate models. Regardless of the evaluation technique we use, we must never evaluate a model on the same dataset it was trained on. Measuring performance on a different set (so-called test set) is a basic tenet of machine learning. We will cover simple ways of evaluating LMs in Section 2.5.

2.2 The need to model language

The need to model language had already been identified as far back as the 1940s. In the seminal, widely-cited work “A Mathematical Theory of Communication” (Shannon (1948)) we see early formulations of language models using probabilities and statistics, in the context of Information Theory.1

1 Information theory studies how to transmit and store information efficiently

As we saw in the previous section, a simple way to think of a language model is as a function that takes in a sequence of words and outputs a score that tells us how likely that sequence is.

In addition to the advanced scenarios we will see later in Part III, several real-world use cases benefit from such a simple function. Let’s see some examples below. These are also summarized in Figure 2.4:

  • Speech Recognition: Speech-to-Text or Voice-typing systems turn spoken language into text. Language models are used to help understand what exactly is being said, which is often hard in the presence of background noise.

    For example, if the sound quality is bad and it’s unclear whether someone said “The dog was running on the field” or “The dog was running on the shield”, the former is probably the correct option, as it’s a more likely sentence than the latter. An LM is exactly the tool to tell you that.

  • Text correction: Many people have trouble spelling and using words correctly. Language models are also helpful here as they can detect misspelled words—by simply validating every word again a list of known words. They can also be used to inform users when words and sentences are not being used correctly.

    For example, the sentence “Last week I will go to the movies” is grammatically correct, but a good LM would be able to detect that it doesn’t make sense, as it’s using a verb in the future with an adverb in the past.

  • Typing Assistant (on-screen keyboards): Mobile phones with on-screen keyboards can benefit from language models to enhance user experience. Keys are usually small areas on the screen and users tend to type as fast as they can, so typing is usually a messy affair and systems need to correctly infer what keys the users are typing, on the fly.

    LMs are of great help here; they can predict the most likely text being typed, even as users touch the screen imprecisely and make typing mistakes.

  • Extracting text from images (e.g. OCR): Extracting text from images and other types of binary files such as PDF is helpful in many settings. People need to scan documents and save them as text; cameras and software need to extract text from videos and/or static images.

    Data in other media types (audio, image, video, etc) is inherently lossy and noisy—think of a picture of a car’s license plate, taken at night under low lights—so it’s useful to have some means to calculate how likely some text is so that you can trust your readings. LMs can help here too.

Figure 2.4: Four different uses of simple language models.

We have so far seen how LMs work and we also saw that they can be used in one of two ways: scoring and predicting. We understood the difference between training and using these models, and some real-life scenarios where they could help. Let’s now see what it would take to build a very simple—the simplest possible—LM, in the next section.

2.3 The simplest possible Language Model

Let’s think about what the simplest possible language model would look like. Any ideas? What would you say is the simplest way to describe a language such as English?

We can say that a language is simply the collection of every word in it. In other words, if you can tell me all words used in English, I will accept it as a suitable description of the English language. It sounds very simplistic (and it is), but let’s use this as an example to help illustrate how we would build and use such a language model.

For our simplest possible LM, training consists of going through every document in the training corpus and storing every unique word seen. Listing 2.1 shows a very simple Python snippet that could be used for that.

Listing 2.1: The simplest possible language model: Train time
language_model = set() # empty set

# loop over every word in every document in the corpus
for document in corpus:
    for word in document:
    language_model.add(word)

For this simplest possible LM—let’s call it SPLM—the trained LM is simply the set of all unique words in the training set.

Ok, so what exactly do we do with this information? How does one go about using this simple language model? Like any other LM: assigning likelihood scores to arbitrary word sequences.

We need to define how this model will score a sequence. Let’s say that a word sequence is deemed valid depending on how many valid words it contains. The score will therefore be the proportion of valid words in the sequence.

For example, if a sequence contains 4 words and 3 of them are valid words, its score is ¾ or 0.75. Let’s formalize this in Equation 2.1 below:

\[ Output_{Simplest\ Possible\ LM} = \frac{\#\ valid\ words\ in\ input\ sequence}{\#total\ words\ in\ input\ sequence} \tag{2.1}\]

Listing 2.2 shows what the Python code would look like for the scoring at inference time. Right now we assume that there are no duplicate words in the input sequence and that the sequence is non-empty.

Listing 2.2: The simplest possible language model: Inference time
num_valid_words = 0
num_total_words = len(input_sequence)
                                      
for word in input_sequence:
    if word in language_model:
        num_valid_words = num_valid_words + 1

final_score = num_valid_words / num_total_words

How does our model know what a valid word is? It’s any word it’s seen in the training set.

In Figure 2.5 we can see 4 different word sequences and how they would be scored by our model. According to Equation 2.1 and following Listing 2.2, we have to count how many valid words there are in the sequence and divide by the total number of words.

Sequence (1) “A man walks across the street” is straightforward. All 6 words are known (i.e. they are part of the corpus the model was trained on) so it gets a perfect 1.0 score. With sequence (2) we see one shortcoming of our model: even though the input sequence doesn’t make sense it still gets a 1.0 score because the model only looks at words individually and ignores their context. Sequence (3) is again simple to understand: the words “xyz” and “yyy” are invalid because they are not part of the training corpus, so the sequence gets 4 out of 6 or 0.67 as a score. Finally, in sequence (4) all words are invalid, so it gets a zero score.

Figure 2.5: Simplest Possible Language Model (SPLM) being used at inference time to score the likelihood of some word sequences. We see 4 examples of sentences; the terms marked with a cross are invalid and those marked with a tick are valid.

It’s important to note that everything shown in Figure 2.5 happens at inference time, using information learned at train time. This highlights one very important point that is true for all models we see in this book: Everything depends on what data the model has seen at train time. In other words, a language model is only as good as the data it’s trained on.

There is much more to language modeling than is seen in our simplest possible LM (or we wouldn’t need a book on the topic), but the basics are here: it can learn from text data and it’s able to score arbitrary word sequences. These two steps have been described in code, in Listing 2.1 and Listing 2.2, respectively.

Every other language model we will see in the book will be some variation of this basic strategy. Any method by which you can accomplish these two tasks is a language model.

For the remainder of this chapter, we will analyze two such variations. Section 2.3 will cover statistical language models (SLMs), which are perhaps the first nontrivial type of language model. Then, in Section 2.4, we will learn about \(N\)-gram models, which can be seen as a set of optimizations on top of SLMs.

2.4 Statistical Language Models

Statistical language models (SLMs) are an application of probability and statistical theory to language modeling. They model a language (such as English) as a learned probability distribution over words. Sentences are also seen from a probabilistic prism: the likelihood of a sentence is the likelihood that that sentence would have been sampled from the probability distribution that represents the language.

We’ll see how SLMs are a useful theoretical framework to work with and why it’s unfeasible to work with them in practice.

In this book we focus on word-level language models, where the smallest unit considered by the model is a word. Most LMs we read about are word-level models but that doesn’t mean other types of models don’t exist. There are such things as character-level models.

A character-level LM works similarly to a word-level LM, but likelihood scores are calculated for sequences of characters instead of sequences of words. Also, instead of predicting the next word in a sequence, they predict the next character in the sequence.

Character-level LMs do have some advantages over word-level models. Firstly, they don’t suffer from collapsing probabilities2 due to out-of-vocabulary words—the vocabulary for character-level LMs is just a limited set of characters.

In addition to that, they may be better able to capture meaning in languages with heavy declension and in-word semantics such as Finnish, German, etc.

The reason why they aren’t as common as word-level LMs is that they are more computationally expensive to train and that they underperform word-level LMs for English, which is the language most applications are made for.

2 Here, collapsing probabilities refer to situations where at least one of the composing terms of a conditional probability product is zero, thus causing the whole product to collapse to zero. A more thorough explanation can be seen in Section 2.3.3.

Let’s now see a more practical explanation of SLMs, how they work, and go through a worked example to see what they look like in practice. At the end of this section, we’ll cover the limitations of pure SLMs—some of which will be addressed when we discuss \(N\)-gram language models in Section 2.4.

2.4.1 Overview

The goal of statistical language models (SLMs) is to create a probabilistic model of a language, such as English, as represented by the training corpus.

What does this mean in practice? We want to have a function that takes a word or a sequence thereof—that is, a sentence—as input and outputs a score that represents how likely that word or sentence is.

Even though the terms “probability” and “likelihood” are not exact synonyms in Statistics literature, we shall use them as such to follow the convention in the NLP field. When the meaning is not clear from the context we will explicitly differentiate them.

But wait! Isn’t this similar to what we had previously, with our simplest possible language model? Yes, it is. The difference is how this function is built. In the case of SLMs, this function is a probabilistic model, using counts and frequencies of words and sentences in the language.

Statistical language models (including \(N\)-gram models we’ll see next) are also sometimes called count-based models. This is because such models can be constructed from counts and/or frequencies of words (and combinations thereof) in the training corpus. See Sections 2.3.2 and 2.4.3 for worked examples.

An SLM defines the likelihood of a sentence as the joint probability of its composing words. The joint probability of a sequence of words is the probability that all individual words appear in the text, in that order.

In the next subsections, we show how to calculate the likelihood of a sentence. Drawing from our earlier example in Figure 2.5, we should expect an SLM to assign a higher likelihood score for “A man walks across the street” than for “A street walks across a man”!

We must first learn how to calculate the likelihood of a single word; then we will see how to extend this knowledge for full sentences. Also, in Section 2.3.2 we will have a fully worked example to see how an SLMs work with real data, both at train and at inference time.

Likelihood of a single word

We shall borrow some notation from statistics and probability theory, and define that \(P(''word'')\) represents the likelihood of a word. As with any probability, it must be a value between 0 and 1.

One way to calculate the likelihood of a single word is to count how many times that word appears in the training corpus and divide it by the number of times every word appears. In other words, the likelihood of a word is its relative frequency in the training corpus.

You may skip forward to the example in Section 2.3.2. For now, let’s write down an informal formula for the likelihood of a single word in Equation 2.2:

\[ P(''word'') = \frac{count(''word'')}{\sum\limits_{all\_words\ w} count(w)} \tag{2.2}\]

The concept of a word’s context is very important and it will be key in several parts of this book. A word’s context refers to the other words that precede it.

In later parts of the book we will see cases where a word’s context will include both the words to the right and to the left of the center word. We will then explicitly state what type of context we are talking about. Unless otherwise stated we will refer to a word’s left-to-right context—other words that precede a word.

To calculate the likelihood that a word is preceded by some context, we use a variation of the previous equation. The denominator is now a sum of the counts over all contexts (of the same size) after which the word appears. This can be seen in Equation 2.3 below.

\[ P(''word''\ |\ context) = \frac{count(''word''\ |\ context)}{\sum\limits_{all\_contexts\ c} count (''word''\ |\ c)} \tag{2.3}\]

Likelihood of a sequence of words

Similarly to a single word, a sequence of words—a sentence—can also be represented as a probability. We say that the probability of a sentence is the joint probability of its composing words.

In other words, the probability of the sequence is the probability that every word in the sequence appears together, in that order. The chain rule of probability allows us to turn a joint probability into a product of conditional probabilities (each of which we can calculate with Equation 2.2 and Equation 2.3 above). Let’s show this in Equation 2.4 using the dummy sentence “w1 w2 w3” as an example:

\[ \begin{aligned} P(''w1\ w2\ w3'') &= P(''w1'', ''w2'', ''w3'') \\ &= P(''w1'') \cdot P(''w2''\ |\ ''w1'') \cdot P(''w3''\ |\ ''w1 w2'') \end{aligned} \tag{2.4}\]

The symbol “\(|\)” is read as “given” or “conditioned on” in statistics literature. When we are talking about text, however, it’s better to take that symbol to mean “preceded by”. For example, \(P(''field''\ |\ ''a\ dog\ played\ on\ the'')\) should read: the likelihood that we see the word “dog” when preceded by the words “a dog played on the”.

In Section 2.1 we touched upon the duality between scoring (calculating the likelihood of a sequence of words) and predicting (predicting the most likely next word given its context). We mentioned that one could discover the most likely next word by a brute force approach—generating all possible options for the word, scoring them all, and picking the one with the highest score.

When we are talking about SLMs there is another very clear link between scoring and predicting.

The chain rule of probability (used in Equation 2.4) is a mathematical device that can be used to decompose a joint probability into a product of conditional probabilities.

If you have the individual conditional probabilities of every word in a sentence given a context, you simply need to multiply them to arrive at the full joint probability of the sentence.

In ?sec-language-modeling-the-basics we use a language model to calculate the score for each variation of the sentence and pick the highest variation to predict the most likely next word. Here it’s the other way around: we use the probabilities of each possible next word to calculate the likelihood of the whole sentence.

Let’s look at a worked example in the next section.

2.4.2 Worked example

Let’s see how all these concepts tie together, with a complete worked example. We’ll use a tiny training set so that calculations are simple and you can understand exactly what is going on. The examples at inference time will be carefully chosen so that we see basic cases (the so-called happy path) but we’ll also include some pathological examples to showcase the limitations of the model.

Training time

The first thing we need is a training set—a corpus of text to train the model on. Let’s use the following small text, containing 96 words:

“The man lives in a bustling city where he sees men walking on the streets every day. He often feels lost in the sea of people, but he enjoys the energy and excitement of city life. Despite the chaos, the man has found a sense of community among his neighbors and coworkers. On his daily commute, he observes the different men walking on the streets, every man with his own story and purpose. After some years of living there, the man has found himself a home in the city and he can’t imagine living anywhere else.”

Now that we have a training set, we can train the model. Training an SLM, as we saw in Section 2.3.1 means going over the text and counting the times each word appears in each context. We can represent the trained model with a table as Table 2.1 below.3

3 For simplicity’s sake, we will consider words in a case-insensitive manner. Also, we will only show the counts of a couple of examples; the full table would be too large.

Table 2.1: Selected counts of words and contexts, for a model trained on the text above.
Word Preceding context Raw count Likelihood

Inference time

Let’s see how scoring the likelihood of a sentence would work in practice, with the model trained as per the section above.

Calculating the likelihood of the sentence “the man”:
Figure 2.6: Calculating the probability of the sentence “the man” with our sample SLM
Calculating the likelihood of the sentence: “men walking on the roads”:
Figure 2.7: Calculating the probability of the sentence “men walking on the roads” with our sample SLM
Calculating the likelihood of the sentence: “the man lives in a home in the city”:
Figure 2.8: Calculating the probability of the sentence “the man lives in a home in the city” with our sample SLM

Figure 2.6 above shows a basic case where a simple SLM can give an adequate score to an (admittedly short) phrase. Figure 2.7 and Figure 2.8 above were meant to display two limitations of SLMs, namely: the inability of these models to generalize to out-of-vocabulary words, even when they are semantically similar to other words found in the text; the collapse of probabilities to zero, due to words that don’t appear in the training set in that order. Let’s look at the limitations of SLMs in the next section.

2.4.3 Limitations

The examples we saw in the previous section provided some hints as to what limitations would prevent a naïve statistical language model (SLM) from being used in practice. Granted, we used an artificially small dataset to train the model, but the point stands at a larger scale.

Even if we train an SLM on a very, very large corpus, it will never contain every possible sentence, and even if it could, there would be no hard-disk large enough to store all the necessary information.

Collapsed probabilities

A clear limitation of SLMs as depicted in this Section is that inference is based on multiplications of conditional probabilities. This is seen in Equation 2.4.

If you try to calculate the likelihood for a sequence of words and any of the conditional probabilities are 0, the whole product collapses to 0. This can be seen in Figure 2.7 and Figure 2.8. In Figure 2.7, the product collapsed due to an out-of-vocabulary word, whereas in example Figure 2.8 all individual words were in the vocabulary, but not in that order.

Generalization

In Figure 2.7 we saw that we got a collapsed probability for the sequence “men walking on the roads”, which is explained by the fact that the word “roads” is not in the training set. Any conditional probability term that includes this word will collapse to 0.

A better model should be able to understand that the word “roads” is roughly interchangeable with “streets”, which is a word that does appear in the training set. Being able to generalize—transfer probability density—across similar words would make for a more useful model, as it would reduce the need for every possible sequence to be present in the training set.

Computational cost

Training an SLM like the one we used in the worked example means going over the training corpus and recording how often each word appears—in what context.

This may seem easy enough with a small train corpus (as used in the example) but what if we were to use a large corpus with a vocabulary containing 1 million4 terms? If we had to store the conditional probabilities of 1 million terms appearing in multiple contexts, the disk space needed would be prohibitively expensive, precluding any practical use.

4 1 million is the rough order of magnitude of the vocabulary size for used in popular open-source NLP packages.

That’s it for the basics of statistical language models (SLMs). In the following section, we’ll see what \(N\)-gram models are and how they help address some of the limitations seen above. Although \(N\)-gram models are an approximation to or even a subtype of SLMs, they are different enough to merit their own section.

2.5 \(N\)-gram Models

In the previous section, we saw some of the limitations of statistical language models (SLMs). These are mostly related to (a) the computational cost caused by the need to store a lot of information and (b) the inability of the model to generalize its knowledge to unseen word combinations, causing collapsing probabilities.

\(N\)-gram language models are more efficient SLMs and, at the same time, address some of their limitations. Let’s see how \(N\)-grams help us optimize SLMs and what such a model looks like in practice with a worked example. Let’s start with an explanation of what \(N\)-grams are.

2.5.1 \(N\)-grams

\(N\)-grams are a generalization over words. They can be used to represent single words but also pairs, triplets, and larger sequences thereof.

The choice of \(N\) in an \(N\)-gram defines the length of the smallest unit we use. For example, a 1-gram (also called a unigram) is an \(N\)-gram with a single element, so it’s just another name for a word. A 2-gram (also called a bigram) is simply a pair of words, whereas a 3-gram (also called a trigram) is an \(N\)-gram where \(N\)=3, and it’s simply a sequence of 3 words—a triplet.

The actual term “\(N\)-gram” has been around for at least some decades, as it’s used in works such as those by Shannon (1948). In that specific work, however, character \(N\)-grams were used instead of word \(N\)-grams.

Let’s see how the same sentence would be represented using \(N\)-grams, with different values for \(N\). Figure 2.9 shows what the sentence “Cats and dogs make unlikely but adorable friends” looks like when split into uni-grams (N-gram with N=1), bigrams (N-gram with N=2), and trigrams (N-gram with N=3).

Figure 2.9: Representing a sentence with \(N\)-grams: In this example, the sentence “Cats and dogs make unlikely but adorable friends” can be represented with \(N\)-grams in several ways: using \(N\)=1 (unigrams), \(N\)=2 (bigrams), and \(N\)=3 (trigrams).

We can leverage \(N\)-grams to create approximations of fully-fledged SLMs. Let’s see how and why \(N\)-gram models tend to work better in practice.

2.5.2 \(N\)-gram models explained

As we mentioned in the introduction to Section 2.5, \(N\)-gram models are an approximation of a full statistical language model (SLM). They operate under two assumptions:

  1. We don’t need to know all the previous words in the context—just an N-gram thereof
  2. The closer a context word is to the target word, the more important it is.5

5 The last word before the target word is more important than the second-last word, which in turn is more important than the third-last word, and so on.

In practice, \(N\)-gram models work just like the SLMs we saw in Section 2.4, with a subtle but key difference: The conditional probabilities obtained after decomposition are pruned after \(N\)-1 steps. Following assumptions (1) and (2), we don’t need to consider all the context words to arrive at adequate estimations of probabilities—just the last \(N\)-1, just like an \(N\)-gram of order \(N\).

In Figure 2.10 we see an example of how to calculate the likelihood of a sentence using an \(N\)-gram model (with \(N\)=2, right side) and compare it to the way it’s done for a regular SLM (left):

Figure 2.10: Calculating the likelihood of sentence “w1 w2 w3 w4” using full conditional probabilities (left) and the analogous calculation with pruned conditional probabilities (right), as is the case with \(N\)-gram models. The last two terms are much simpler when calculated with \(N\)-gram models.

\(N\)-gram models can be seen as the result of applying an engineering mindset to purely statistical LMs. While pure SLMs are a theoretical framework that provides a solid foundation to study language from a mathematical standpoint, they wouldn’t be very practical for real-world use. \(N\)-gram models could therefore be described as an “engineer’s approach to SLMs” or an answer to “What’s the best proxy we can have so that SLMs can be used in practice?”

\(N\)-gram models are not only more efficient than SLMs but also provide better results in the presence of incomplete data. Let’s see some ways in which \(N\)-gram models are better suited to language modeling than full SLMs.

Advantages over SLMs

Up until now, we saw how \(N\)-gram models are an efficient approximation to fully statistical LMs. It turns out there’s another benefit when we look at the collapsed probability issue. Let’s look at two specific advantages of \(N\)-gram models.

  1. Efficiency: SLMs and \(N\)-gram Models need to store conditional probability counts somewhere—either in memory or on disk—as they are trained. \(N\)-gram models require less space because the number of conditional probability combinations we need to keep track of is smaller (only contexts up to \(N\)-1 need to be kept). In addition to that, inference is also faster, as the number of multiplications is smaller.

  2. Reduced chance of collapsed probabilities: Since a smaller number of context words are taken into account when calculating probabilities, there’s a reduced chance of hitting upon a combination of words that were never seen in the training corpus. This means that there will be fewer zero terms in the conditional probability product—which in turn means that the result is not going to collapse to zero (as in any product, if any individual term is zero, the whole thing is zero as well).

2.5.3 Worked example

Similarly to what we did for regular statistical LMs in Section 2.4.2, let’s now see how we go from a body of text (the training corpus) to a trained \(N\)-gram language model, and see what results we get when we perform inference with it.

Train time

Let’s use the same text we used in Section 2.4.2:

“The man lives in a bustling city where he sees men walking on the streets every day. He often feels lost in the sea of people, but he enjoys the energy and excitement of city life. Despite the chaos, the man has found a sense of community among his neighbors and coworkers. On his daily commute, he observes the different men walking on the streets, every man with his own story and purpose. After some years of living there, the man has found himself a home in the city and he can’t imagine living anywhere else.”

There’s a difference between SLMs and \(N\)-gram models at training time: \(N\)-gram models only need to store the counts for a limited combination of contexts. For example, if we are using an \(N\)-gram model with \(N\)=2—a bigram model—we only need to store the counts for words with up to 1 context word. This greatly reduces the amount of space needed.

Let’s see the information we would need to store, in Table 2.2 below. Note that in this table there are no entries for combinations where the context size is larger than 1.

Table 2.2: Selected counts of words and contexts, for an \(N\)-gram model (\(N\)=2) trained on the text above.
header-1 header-2

Inference time

Let’s see some examples of our \(N\)-gram model being used at inference time:

Calculating the likelihood of the sentence “the man”:
Figure 2.11: Calculating the probability of the sentence “the man” with an \(N\)-gram model (\(N\)=2). Note that the result is the same as we got in Section 2.4.2 for the SLM.
Calculating the likelihood of the sentence: “men walking on the roads”:
Figure 2.12: Calculating the probability of the sentence “men walking on the roads” with an \(N\)-gram model (\(N\)=2). Note that several terms are simpler than the equivalent example calculated by the SLM in Section 2.4.2. However, the probability of the whole sentence still collapsed to zero because “roads” is an out-of-vocabulary word.
Calculating the likelihood of the sentence: “the man lives in a home in the city”:
Figure 2.13: Calculating the probability of the sentence “the man lives in a home in the city” with an \(N\)-gram model (\(N\)=2). The probability for the sentence is non-zero because none of the terms collapsed to zero. This is one example where using an \(N\)-gram model saved us from collapsing probabilities. See Figure 2.8 in Section 2.4.2 for the same example calculated with SLMs.

After seeing these examples, let’s understand which limitations still affect \(N\)-gram models—even though they have clear benefits over full SLMs.

2.5.4 Limitations

In the previous section, we saw that \(N\)-gram models fail to provide correct scores for relatively simple input sequences. Let’s analyze some of the main shortcomings of these models.

Limited context

Using a reduced version of the context increases efficiency and reduces the risk of collapsed probabilities. But in some cases—especially if one chooses a small value for \(N\)—there will be some loss in performance because relevant words will be left out of the context.

Let’s see an example with the following sentence: “People in France speak ______”.

As explained in Section 2.1, we can predict the most likely next word by generating all variations, calculating the likelihood score for them, and then picking the highest-scoring variant. Let’s analyze what these would look like when calculated using \(N\)=2 and using \(N\)=3.

Since we want to keep the example simple, we will only consider the scores for two variants: “People in France speak Spanish” and “People in France speak French”. A human would naturally know that the latter is the correct answer, but what would our model say?

We can see the difference in Figure 2.14 below. When we use \(N\)=2, the model would assign almost the same score to both variants, so it will not be able to realize that “French” is a better candidate than “Spanish” for the missing word. Adding a single word to the context (increasing \(N\) to 3) allows the model to include a new word and it makes all the difference. It’s now able to use conditional probabilities that include the word “France”, which naturally occurs much more often together with “French” than it does with “Spanish”!

Figure 2.14: A clear case where using an N-gram model with too small a context hinders performance. When we use \(N\)=3, the model can pick up a context that includes the word “France”. Since the word “France” is used much more often with “French” than it is with “Spanish”, the model is more likely to infer that the missing word is “French”. EOS is a marker for the end of the sequence.

The choice of \(N\) introduces a tradeoff between efficiency and expressivity; smaller values of \(N\) make the model faster and require less storage space but larger values of \(N\) allow the model to consider more data to calculate probabilities.

Collapsed probabilities

As we saw in Figure 2.12, \(N\)-gram models are still vulnerable to collapsed probabilities, especially due to out-of-vocabulary words. Not as vulnerable as pure SLMs, but it’s still an issue.

Generalization

With respect to generalizing to semantically similar words as those seen at train time, \(N\)-gram models don’t help us much in comparison with regular SLMs.

We saw an example in Figure 2.12. Even though the calculations were simpler and faster, we again hit upon a collapsed probability—and the model wasn’t able to transfer probability from the word “streets” to the word “roads”. The generalization problem will be solved with Neural LMs and Embedding Layers, which will be discussed in Chapter 03.

Some of these limitations were worked around—or at least mitigated—with a little engineering ingenuity, as we’ll see next.

2.5.5 Later optimizations

\(N\)-gram models are an optimization on top of regular SLMs, but they fall short of a perfect model, as we saw in the previous section. Additional enhancements were proposed to make \(N\)-gram models more capable and address some of their limitations. These include backoff, smoothing, and interpolation. Table 2.3 shows the main objectives of each strategy:

Table 2.3: \(N\)-gram model enhancements and their objectives
Strategy Objective
Backoff Reduce the chance of collapsed conditional probability products
Smoothing Reduce the chance of collapsed conditional probability products
Interpolation Increase the quality of model output

Let’s explain each in more detail.

Backoff

Backoff is a strategy whereby one replaces a zero-term with a lower-order \(N\)-gram term in the conditional probability decomposition—to prevent the whole product from collapsing to zero.

As an example, let’s calculate the likelihood score for the string “w1 w2 w3 w4” using backoff. If the word “w4” is never preceded by “w2 w3” in the training set, the conditional probability \(P(''w4''\ |\ ''w2\ w3'')\) will collapse to zero, so we back-off to \(P(''w4''\ |\ ''w3'')\) instead. See Figure 2.15 for a visual representation.

Figure 2.15: If the term \(P(''w4''\ |\ ''w2\ w3'')\) has zero probability in the train set, a simple trigram model would cause the score for “w1 w2 w3 w4” to collapse to zero. Using a backoff model allows us to replace the zero-term with a lower-order \(N\)-gram instead, which can prevent the collapse at the cost of slightly lower accuracy.

Smoothing

Smoothing refers to modifying zero terms in the conditional probability decomposition, to avoid collapsed probabilities.

The simplest way to do this is called Laplace Smoothing (aka Add-one Smoothing). It works by adding 1 to every \(N\)-gram count before performing the calculations. This will get rid of zeros and prevent probability collapse. See an example in Figure 2.16 below:

Figure 2.16: By adding 1 to every count before multiplying the terms, we get rid of zeros and we have a valid—non-collapsed—score.

Interpolation

Unlike backoff and smoothing, interpolation is used to enhance the quality of scores output by an \(N\)-gram model—not to help with collapsed probabilities.

It works by using all available \(N\)-grams up to the order we are considering. For example, if we apply interpolation to a trigram model, bigrams and unigrams will also be used—in addition to trigrams. Figure 2.17 shows how the calculations change when we introduce interpolation:

Figure 2.17: A simplified interpolation strategy for a trigram model. Added terms include values for lower-order N-grams (\(N\)=2 and \(N\)=1) to the original terms. Some details were left out to aid in understanding, without loss of generality.

2.6 Measuring the performance of Language Models

Measuring the quality of outputs from a language model is crucial to understanding which techniques enhance the model capabilities—and whether the added complexity is worth the gain in performance.

There are two different strategies to evaluate the performance of an LM: intrinsic evaluation refers to measuring how well an LM performs on its own, i.e. in the language modeling task itself. Extrinsic evaluation, on the other hand, measures an LM’s performance based on how well that LM can be used to help other NLP tasks.

These two types of evaluation are not mutually exclusive—we may need to use both, depending on the task at hand.

In this Section, we will focus exclusively on intrinsic evaluation. Extrinsic evaluation will be discussed in Part III, where we will see how pre-trained LMs can be used to leverage other—so-called downstream—NLP tasks.

A visual summary of the ways to evaluate an LM’s performance can be seen in Figure 2.18 below.

Figure 2.18: Language models can have their performance evaluated using different strategies.

2.6.1 Objective vs Subjective evaluation

In intrinsic evaluation, we have at least two ways to evaluate an LM: using objective metrics calculated on a test set (as any other ML model) and asking humans to subjectively rate the output of an LM and assess its quality based on their own opinions.

Both objective and subjective evaluations are useful depending on what we want to measure.

When comparing two different models you probably want an objective metric, as it will better show small differences in performance.

On the other hand, suppose you want to compare two models to decide which generates text that’s closer to Shakespeare’s style—this is not so easily captured in an objective metric so it may be better to use subjective evaluation.

We will see one example of each of these two strategies. We’ll start with objective evaluation, by looking at perplexity—the most commonly used metric for intrinsic LM evaluation. Then we’ll go over a subjective evaluation method used by OpenAI to measure the quality of the output generated by GPT-3.

2.6.2 Objective evaluation

Evaluating a model objectively means using precisely defined metrics. This is in contrast to subjective evaluation, which usually involves humans and their subjective opinions.

Language models are a subtype of machine learning (ML) models. And there is a very simple yet effective way to evaluate ML models to understand how well (if at all) they are able to learn from data.

We call it the train-test-split: We randomly split our data into two groups (train and test set) and make sure you use the train set for training and the test set for validation, without mixing both. It would be trivially easy for any model to have perfect performance on the same set it was trained on—it could simply memorize all of the data. A random split also guarantees that both sets will have the same distribution. A visual explanation can be seen in Figure 2.19 below:

Figure 2.19: Machine learning models must be evaluated on a different dataset from the one they’ve been trained on.

The train-test-split strategy is not tied to any one specific modeling strategy; It can be used for any ML model and, consequently, by any LM we mention in this book—including neural LMs, which we’ll cover in the next Chapter.

You may have heard of metrics such as BLUE and ROUGE being used to evaluate language models. These are extrinsic metrics—they are used in downstream NLP tasks such as machine translation and summarization.

The most widely used metric to evaluate language models intrinsically, in an objective manner is perplexity. Let’s see how.

Perplexity

Perplexity (commonly abbreviated as PPL) is a measure of how surprised (or perplexed) a language model is when it looks at some new text. The higher a model’s perplexity, the worse its performance.6

6 One to remember this relationship is to consider that a well-trained LM should never be perplexed when it encounters some new text! A well-trained LM should have a good estimate of what its target language looks like—any new text should be unsurprising, statistically speaking.

Perplexity can only be calculated in relation to some data. This is why it doesn’t make sense to say “I trained a language model and its perplexity is 100”. Perplexity is always calculated based on some set—the test set. We must always say on which test set we calculated the perplexity on!

Mathematically, the perplexity of a language model \(\theta\) with respect to a corpus \(C\) is the \(N\)-th root of the inverse of the likelihood score of \(C\), as we can see in Equation 2.5 below. \(N\) is the number of words in the corpus; it’s used to normalize the sore.

\[ \textit{Perplexity}(\theta)_C = \sqrt[n]{\frac{1}{P_\theta(C)}} \tag{2.5}\]

If our LM gives a high likelihood score to the text in the test corpus it means that it was able to correctly understand that it is valid text. The denominator in Equation 2.5 will be large, so the perplexity will be small.

Conversely, if an LM assigns a low likelihood score to the text in the test corpus, it should be penalized, as it failed to give high scores to valid text. This intuition matches the formula in Equation 2.5: if the likelihood score is low, the denominator will be small and the value of the fraction will be larger.

What about \(N\)? Why do we need to take the \(N\)-th root?

\(N\) is the number of words in the test corpus \(C\). We take the root of the inverse likelihood because a corpus with fewer words would always have lower perplexity for a given model \(\theta\). Taking the root eliminates this bias and allows us to compare models across multiple corpora.7

7 Corpora is the plural form of corpus

2.6.3 Subjective evaluation

Language models can be used to generate text—by just predicting the next word of a sentence over and over again. It’s hard to precisely define the quality of such generated text, especially in an intrinsic manner. One way is by measuring the perplexity on a test set.

But we can also delegate this task to humans—ideally a large, diverse group thereof—and ask them to evaluate a language model according to what they believe is good performance.

Strategies to subjectively evaluate the output of language models will make a comeback when we talk about Alignment in Part IV, especially as related to RLHF (Reinforcement Learning with Human Feedback)

An interesting example of such a subjective evaluation strategy can be seen in the GPT-3 article (Brown et al. (2020)). It centers around asking humans to differentiate automatically generated texts from those written by humans. Let’s see the specifics:

maybe add something related to using an LM itself to evaluate some text. Was used in DPO and in LLAMA-2.

Detecting automatically generated text

One way to measure the quality of generated text is by checking if a person can tell it apart from a legitimate text, written by a human. This can be seen as a simplified version of a Turing test, where a model must trick a human interviewer into thinking they are talking to a real person.

This method was used in the GPT-3 article (Brown et al. (2020)) and it proceeds by showing human annotators texts examples that are either (1) extracted from a real news website, or (2) generated on the spot with a language model.

Human participants were asked to rate how likely they thought the text was automatically generated by assigning it Likert-type classifications ranging from “very likely written by a human” to “I don’t know”, and to “very likely written by a machine”.

Detecting whether a given piece of text was automatically generated is key to addressing the many risks introduced by the widespread use of large LMs, as we’ll explain in Part V.

At the end of the experiment, the authors concluded that the results conform to the expectations and they highlighted two key results:

  1. The longer the text, the easier it is for humans to tell apart synthetic from human-generated text;
  2. For the most powerful GPT-3 variant, human annotators were only 52% likely to tell apart synthetic from natural texts. This is hardly better than flipping a coin—text generated by GPT-3 is nearly indistinguishable from human-written text!

Note that this strategy is also an intrinsic evaluation method—since it doesn’t evaluate the LM in how it enables other NLP tasks, but in how it performs in a core language modeling task: raw text generation by predicting the next word in a sentence.

2.7 Summary

  • Language models are Machine Learning models and, as such, have separate training and inference stages and should be validated on out-of-sample data.

  • Language models were first created to aid in problems such as speech recognition, text correction, text correction, and text recognition in images.

  • Statistical language models (SLMs) view language as a statistical distribution over words. They represent a sentence as a joint probability of words and use the chain rule of probability to decompose it into conditional probabilities that can be directly calculated from word counts.

  • Simple SLMs suffer from problems such as the sparsity of long sequences of words and heavy computational costs.

  • \(N\)-gram models are an approximation to SLMs and they solve some of the problems that afflict those. \(N\)-gram models themselves have some limitations and some recent advancements to address those are Backoff, Smoothing, and Interpolation.

  • Strategies to evaluate the performance of LMs can be split into intrinsic and extrinsic, depending on whether the LM is being measured on the language modeling task or on downstream NLP tasks.

2.8 References

Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, 33:1877–1901. Curran Associates, Inc. https://bit.ly/brown-2020-gpt3.
Shannon, C. E. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27: 379–423. Accessed April 22, 2003. http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.