1  Language Models: What and Why

This chapter covers


Anyone who hasn’t been living under a rock in the last few years (late 2010s to early 2020s) has been bombarded with examples of seemingly magical content produced by new Artificial Intelligence (AI) models. Full novels written by AI; poems specifically generated to copy a specific author’s style; Chatbots that act indistinguishably from a human being. The list goes on.

The names of such models may sound familiar to those with technical bent: GPT, GPT-2, BERT, GPT-3, ChatGPT, and so on.

These are called Large Language Models (LLMs) and they have only grown more powerful and more expressive over time, being trained on ever larger amounts of data and applying techniques discovered by researchers in academia and in companies such as Google, Meta, and Microsoft.

In this chapter, we will give a brief introduction to language models (large and otherwise) and related technologies, providing a foundation for the rest of the book. We call it “What and Why” because we show what LMs are but also why they are now the base layer for modern Natural Language Processing (NLP).

In Section 1.1 we explain at a basic level what language models (LMs) are and how one can use them. Section 1.2 introduces the main types of LMs, namely statistical and neural language models. In Section 1.3 we show how Attention mechanisms and the Transformer architecture help LMs better keep state and use a word’s context. In Section 1.4 we explain why LMs play a pivotal role in modern NLP and, finally, in Section 1.5 we show what alignment means in the context of LMs and how it is used to create models such as ChatGPT.

1.1 What are Language Models?

As its name suggests, a Language Model (LM) models how a given language works.

Here, modeling a language means assigning scores to arbitrary sequences of words, such that the higher the score for a sequence of words, the more likely it is a meaningful sentence in a language such as English. This is shown in Figure 1.1:

The title of this book specifically mentions Large Language Models (LLMs). The term is not very precisely defined but here’s our working definition: Large Language Models are LMs that (a) employ large, deep neural nets (billions of parameters) and (b) have been trained on massive amounts of text data (hundreds of billions of tokens).

Figure 1.1: At its most basic, a Language Model is a function that takes in a sequence of words as input and outputs a score. High scores mean the input is valid. Low scores mean it’s probably gibberish. An in-depth overview of language modeling is given in Chapter 2.

The LMs we will focus on in this book learn from data (as opposed to rules manually crafted by linguistics experts), mostly using Machine Learning (ML) tooling.

A perfect picture of a language would require us to have access to every document that was ever written in it. This is clearly impossible, so we settle for using as large a dataset as we can. An unlabelled natural language dataset is called a corpus.

The corpus is thus the set of documents that represent the language we are trying to model, such as English. The corpus is the data LMs are trained on. After they are trained they can then be used, just like any other ML model. Here, using a trained LM means feeding it word sequences as input and obtaining a likelihood score as output, as is shown in Figure 1.1.

A corpus is a set of unlabelled documents that represent the language we are trying to model with a Language Model. Corpora is the plural form of corpus.

The distinction between train time and inference time is crucial to understanding how LMs work.

Train time refers to the process where the model processes the corpus and learns the characteristics of the target language. This step is usually very time-consuming. On the other hand, inference time is the stage in which the language model is used, for example in calculating a sentence’s likelihood score; Inference is usually very fast and takes place after training. This will be explained in more detail in Chapter 2 where we dive deep into language modeling.

In the sections below we will have a better look at how LMs can be used in a real-world setting and what are the main types used in practice.

1.1.1 LM use cases

As we saw in the previous section, LMs need to be trained on a corpus of documents. After they have been trained, they hold some idea of what the language in question looks like, and only then can we use them for practical tasks.

The most basic for a language model is to output likelihood scores for word sequences, as we stated previously. But there is an additional use for them: predicting the next word in a sentence.

The two ways to use—or perform inference with—a trained language model are therefore: (a) assign a score that measures the likelihood of a sequence of words and (b) predict the most likely next word in a sentence, based on the previous words. These are two simple and seemingly uninteresting applications, but they will unlock surprising outcomes as we’ll see in the next sections. Figure 1.2 shows examples for both uses, side-by-side:

Figure 1.2: The two basic uses for Language Models: (a) calculate the likelihood score for a given word sequence and (b) predict the next word in the sequence. Note that the left side of this figure is just a rehash of Figure 1.1

It may not be immediately obvious but these use cases are two sides of the same coin: if you have a trained LM trained to calculate the likelihood score, it is easy to use it to predict or guess what the next word in a sentence will be. This is because you can brute force your way into the solution by scoring every possible word in the vocabulary and picking the option with the highest score! (see Figure 1.2, right side).

Let’s now have a brief look at the main types of language models, what they have in common, and how they differ.

1.2 Types of Language Models

In practice, the two most common ways to implement language models are either via statistics (counting word frequencies) or with the aid of neural networks. For both the general structure stays the same, just like we explained in Section 1.1: the language model is trained on some corpus and, once trained, it can be used as a function to measure how likely a given word sequence is or to predict the next word in a sequence.

Figure 1.3 shows the main types of language models, namely statistical and neural language models, with subclassifications. These are discussed in Sections 1.2 and 1.3. Also, Chapters 02 and 03 will provide a thorough analysis of these models.

Figure 1.3: Types of Language models

Let’s now explore the characteristics of statistical and neural language models.

1.2.1 Statistical Language Models

One simple way to model a language is to use probability distributions. We are claiming that a language can be defined as the probability distribution of every word in that language.

\(P(word)\) is the frequency with which the word appears in the corpus. \(P(word\ |\ context)\) measures how often \(word\) appears when preceded by \(context\) in the corpus.

More specifically: the probability of a word sequence is the joint probability of the words in the sequence, as can be seen in Equation 1.1:

\[ \begin{aligned} P(''a\ man\ walks\ across\ the\ street'') = P(''a'',\ ''man'',\ ''walks'',\ ''across'',\ ''the'',\ ''street'') \end{aligned} \tag{1.1}\]

Intuitively, the meaning of a word is very much dependent on the words around it. Therefore, We can decompose the joint probability into a product of conditional probabilities, using the chain rule of probability (See Section 2.1 for a detailed explanation and examples).

Solving Equation 1.1 is simple enough—after decomposing into conditional probabilities, you count how many times each word appears in every context in the corpus.

This approach, however, quickly becomes impractical with real-world-size data, as it is computationally very expensive to calculate the terms for large text corpora—the number of possible combinations grows exponentially with the size of the context and the vocabulary.

In addition to being inefficient, such models aren’t able to generalize calculations if we need to score a word sequence that is not present in the corpus—they would assign a score of zero to every sequence not seen in the train set.

Purely statistical models aren’t often used in practice—but they provide a foundation for \(N\)-gram models, as we’ll see next.

N-gram language models

\(N\)-gram language models are an optimization on top of fully statistical language models (SLMs). But what are \(N\)-grams?

\(N\)-grams are an abstraction of words. \(N\)-grams where \(N=1\) are called unigrams and are just another name for a word. If \(N=2\), they are called bigrams and they represent ordered pairs of words. If \(N=3\), they’re called trigrams and—you guessed it—they represent ordered triples of words. Figure 1.4 shows an example of what a sentence looks like when it’s split into unigrams, bigrams, and trigrams:

Figure 1.4: \(N\)-grams: Representing a sentence with \(N\)-grams: In this example, the sentence “A man walked by the grocery store” can be represented with unigrams, bigrams, trigrams, etc. Note that a unigram is just another name for a word.

But how do \(N\)-grams make SLMs better?

\(N\)-grams enable us to prune the number of context words when calculating conditional probabilities. Instead of considering all previous words in the context, we approximate it by using the last \(N\) words. This reduces the space and the number of computations needed as compared with fully statistical LMs and helps address the curse of dimensionality related to rare combinations of words.

\(N\)-gram models are no panacea, however; they still suffer from the inability to generalize calculations to unseen sequences; and deliberately ignoring contexts beyond \(N-1\) words hinders the capacity of the model to consider longer dependencies. \(N\)-gram models will be discussed in more detail in Chapter 2.

1.2.2 Neural Language Models

As interest in neural nets picked up again in the early 2000s, researchers (starting with Bengio et al. (2003)) began to experiment with applying neural nets to the task of building a language model, using the well-known and trusted backpropagation algorithm. They found out that not only it was possible, but it worked better than any other language model seen so far—and it solved a key problem faced by statistical language models: not being able to generalize into unseen words.

The training strategy relies on self-supervised learning: training a neural network where the features are the words in the context and the target is the next word. Simply build a training set like that and train it in a supervised way like you would any other neural net. We will explain this in-depth in Chapter 03.

Self-supervised learning refers to programmatically constructing a labeled dataset from an unlabeled dataset, then proceeding with supervised learning. For this book, both self-supervised and unsupervised learning will be used interchangeably, to signal cases where no human labeling is needed.

In addition to being a good way to train language models (in the sense that they are good at predicting how valid a piece of text is), it turns out that there remained an interesting by-product after the training was done: learned representations for words—word embeddings.

Word embeddings will play a key role later in Section 1.4. Also, Chapter 05 will focus specifically on text representations and embeddings will be discussed in depth.

Even though the first model introduced by Bengio et al. (2003) was a relatively simple feedforward, shallow neural net, it proved that this strategy worked, and it set the path forward for many other developments.

With time, neural LMs evolved by using ever more complex neural nets, trained on increasingly larger datasets. Deep neural nets, convolutional neural nets, recurring connections, the encoder-decoder architecture, attention, and, finally, transformers, are just some examples of the technologies used in these models. Figure 1.5 shows a selected timeline with some of the key technological breakthroughs and milestones related to neural LMs:

Figure 1.5: Selected timeline with key milestones related to neural language models, from both academia and the industry.

Most modern language models are neural LMs. This is unsurprising because (1) neural nets can handle a lot of complexity and (2) neural LMs can be trained on massive amounts of data, with no need for labeling. The only constraints are the available computing power and the budget.

Research (from both academia and industry) on neural nets has advanced a lot in the last decades so it was a match made in heaven: as the magnitude of text on the Web grew larger, there appeared new and more efficient ways to train neural nets—better algorithms on the software side and purpose-built hardware on the hardware side.

Let’s now explain the role a word’s context plays in neural LMs—and how they help us train better models.

The need for memory in Neural LMs

The basic building block of text is a word, but a word on its own doesn’t tell us much. We need to know its context—the other words around it—to fully understand what a word means. This is seen in polysemous1 words: the word “cap” in English can mean a head cover, a hard limit for something (a spending cap), or even a verb. Without context, it’s impossible to know what the word means.

1 Polysemous words are those that have multiple meanings.

As a human is better able to understand a word when its context is available, so are LMs. In the case of language modeling, this means having some kind of memory or state in the model—so that it can consider past words as it predicts the next ones.

In LM parlance, the context of a word W refers to the accompanying words around W. For example, if we focus on the word “running” in the sentence “A dog is running on the field”, the context is made up of “A dog is …” on the left side and “… on the field” on the right side. A word’s context is key for language modeling.

While the neural language model we have seen so far does take some context into account when training, there are key limitations: They use very small contexts (5-10 words only) and the context size needs to be fixed a priori for the whole model2. Recurrent neural nets can work around this limitation, as we’ll see next.

2 Feedforward neural nets cannot natively deal with variable-length input.

Recurrent Neural Networks

The standard way to incorporate state in neural nets (to address the memory problem explained above) has for some time been Recurrent Neural Networks (RNNs). RNNs use the output from the previous time step as additional features to produce an output for the current time step. This enables RNNs to take past data into account. The basic differences between regular (i.e. feedforward) and recurrent neural nets are shown in Figure 1.6:

Figure 1.6: Feedward neural nets only use features from the current time step to calculate the output, whereas recurrent neural nets use features from the current time stamp but also use the output from the previous time step to calculate the output. The dotted lines represent the flow of information and the circles represent nonlinear operations on vectors, such as the sigmoid function.

The simplest way to train RNNs is to use an algorithm called Backpropagation Through Time (BPTT). It’s similar to the normal backpropagation algorithm but for each iteration, the network is first unrolled so that the recurrent connections can be treated as if they were normal connections.

There are three issues with BPTT for RNNs however: Firstly, it’s very computationally expensive to execute especially as one increases the number of time steps (in the case of NLP, this means the size of the context) one wants to look at. Secondly, it’s not easy to parallelize training for RNNs, as many operations must be executed sequentially. Finally, running backpropagation over such large distances causes gradients to explode or vanish, which precludes the training of networks using larger contexts.

Better memory: LSTMs

It turns out one can be a little more clever when propagating past information in RNNs. The ultimate objective is to be able to store long-range dependencies (i.e. being able to consider very large contexts) while avoiding the problems of vanishing/exploding gradients.

One can better control how past information is passed along with so-called memory cells. One commonly used type of memory cell is the LSTM (Long Short-term Memory).

LSTMs were introduced some time ago by Hochreiter and Schmidhuber (1997) and they work by propagating an internal state and applying nonlinear operations to the inputs (i.e. input from the current time step and previous output) and gates to control what should be input, output or forgotten. In vanilla RNN cells, no such operations are applied, and no state is propagated explicitly. These 3 gates are the 3 solid blocks labeled “F”, “I” and “O”, shown in Figure 1.7, right side.

Figure 1.7: RNN cells (left) take the output from the previous time step and also the current input and apply a nonlinear operation (circles) to produce the current output. LSTM cells (right) also propagate an internal state, to which several nonlinear operations are applied—forget gates, input gates, and output gates, represented by the letters F, I, and O, respectively. The dotted lines represent the flow of information and the circles represent nonlinear operation on vectors, such as the sigmoid function.

Crucially, LSTMs (or any other type of memory cell) don’t address the computational costs of training recurrent neural nets, because the recurrent connections are still there. They do help avoid the problem of vanishing/exploding gradients and they also help in storing longer-range dependencies than would be possible in a vanilla RNN, but the scaling problems wrt. training complexity remain. We will cover RNNs and LSTM cells in more detail in Chapter 04.

In Section 1.2 we saw the main types of language models and we showed how neural nets enable better training of LMs. We also saw how important it is for LMs to be able to keep state and how RNNs can be used for that, but training these is costly and they don’t work as well as we’d expect. Attention mechanisms and the Transformer architecture address precisely these points, as we’ll see next.

1.3 Attention and the Transformer Revolution

If you are interested in modern NLP, you will have heard the terms Attention and Transformers being thrown around recently. You might not understand exactly what they are, but you picked up a few hints and you have a feeling Transformers are a significant part of modern LLMs—and that they have something to do with Attention.

You’re right on both counts—we’ll now explain what Attention is, how it enables Transformers, and why they matter. These two topics will be covered in more detail in Part II.

1.3.1 Enter Attention

The problem of how to propagate past information to the present efficiently and accurately also occupied the minds of researchers and practitioners working on a different language task: Machine Translation.

The traditional way to handle machine translation and other sequence-to-sequence (Seq2Seq) learning is using a recurrent neural network architecture called the encoder-decoder. This architecture consists of encoding input sequences into a single, fixed-length vector and then decoding it back again to generate the output. See the upper part of Figure 1.8 for a visual representation.

We’ll cover Sequence-to-sequence (Seq2seq) learning in more detail in Part II, Chapter 06.

Soon after the introduction of these encoder-decoder networks, other researchers (Bahdanau et al. (2014)) proposed a subtle but impactful enhancement: instead of encoding the input sequences into a single fixed-length vector as an intermediary representation, they are encoded into multiple so-called annotation vectors instead. Then, at decoding time, an attention mechanism learns which annotations it should use—or attend to. This can be seen before the decoder in Figure 1.8, below.

More specifically, the attention mechanism inside the decoder contains another small feedforward neural network with learnable parameters. This is the so-called alignment model and its task is precisely to learn, over time, which of the vectors generated by the encoder best fit the output it is trying to generate. This is represented in Figure 1.8 on the bottom part:

Figure 1.8: Differences between regular (above) and attention-enabled (below) encoder-decoder networks. The structure is similar but the bottom network uses multiple vectors for the intermediary representation and there is an extra component before the decoder—the attention mechanism.

A common way to think about Attention is by framing it as an information retrieval problem with query, key, and value vectors.

In a translation task, for example, each output word (in the target language) can be seen as a query and each input word (in the source language) is a combination of keys and values, which will be searched over to find the best input word. This will be explained in more detail in Chapter 07.

It is worth noting that, while adding Attention cells to encoder-decoder networks does allow for more precise models, they still use recurrent connections, which make them computationally expensive to train and hard to parallelize. This is where Transformers come in, as we’ll see next.

1.3.2 Transformers

Vaswani et al. (2017) introduced an alternative version of the encoder-decoder architecture, along with several engineering tricks to make training such networks much faster. It was called the Transformer architecture, and it has been the architecture of choice for most large NLP models since then.

The seminal Transformer article was called “Attention is all you need”, for good reason—the proposed architecture ditched RNN layers altogether, replacing them with Attention layers (while keeping the encoder-decoder structure). This is shown in detail in Figure 1.9: The encoder and decoder components are there but recurrent connections are nowhere to be seen—only attention layers.

Figure 1.9: The original Transformer model for Seq2Seq learning. It is still an encoder-decoder architecture (akin to Figure 1.8), but there are no RNNs or any other recurrent connections—only attention layers. Adapted from Vaswani et al. (2017)

This was very, very relevant. The main problem with such networks was precisely that of the recurrent connections present in RNNs. As we saw earlier, these precluded parallel training using GPUs, TPUs, and other purpose-built hardware and, therefore, severely limited the amount of data they could be trained on.

Without RNNs or any recurrent connections, the original Transformer model was able to match or even surpass the then-current SOTA in machine translation, at a fraction of the cost (100 - 1000 times more efficiently).

The key advancements introduced were (1) using self-attention instead of recurrent connections both in the encoder and the decoder (2) encoding words with positional embeddings to keep track of word position and (3) introducing multi-head attention as a way to add more expressivity while enabling more parallelization in the architecture.

Let’s see how and why Transformers are used for language modeling.

1.3.3 Transformer-based Language models

Now we know what Transformers are, but we saw that they were created for Seq2Seq learning, not for language modeling.

We can, however, repurpose encoder-decoder Transformers to build language models—we can just use the encoder or the decoder part of the network, in a self-supervised training setting, just like the original LMs we saw in previous sections.

Such language models are now called encoder Transformers or decoder Transformers, depending on which part of the original Transformer they use. The first truly large LM based on the Transformer architecture was the Open AI GPT-1 model by Radford et al. (2018), a decoder Transformer. Figure 1.10 shows a timeline of released transformer-based LLMs, starting with GPT-1, soon after the seminal paper was published:

Figure 1.10: Timeline with the main research milestones and transformer-based LMs released as of this writing

Section A in the Appendix contains a section where we describe the implementation details of the most important Transformer-based LMs.

We are still missing one part of the puzzle: why exactly are LMs (especially large LMs) so important for NLP?

1.4 Why Language models? LMs as the building blocks of modern NLP

We saw in the previous sections that one can use Neural Nets to train Language Models—and that this works surprisingly well. We also saw how using Transformers enables us to train massively larger and more powerful LMs.

You are probably wondering why we talk so much about Language Models if their uses are relatively limited and seemingly uninteresting (predicting the next word in a sentence doesn’t seem all that sexy, right?).

The short answer is threefold: (1) LMs can be used to build representations for downstream NLP tasks (2) LMs can be trained on huge amounts of data because they’re unsupervised (3) we can frame any NLP task as a Language Modeling task, with in-context learning techniques, such as zero-shot learning.

We’ll explain each of these 3 points in detail but first, let’s quickly see what we mean by NLP.

I kinda betrayed the reader here: there is no subsection on point 2

1.4.1 NLP is all around us

NLP stands for Natural Language Processing, an admittedly vague term. In this book, we will take it to mean any sort of Machine Learning (ML) task that involves natural languagetext as written by humans. This includes all physical text ever written and, most importantly, all text on the Web.

Chapter 14 is focused on exploring different types of NLP tasks and how they benefit from LMs.

Table 1.1 shows a selected list of NLP tasks that have been addressed both by researchers in academia and practitioners in the industry:

Table 1.1: Selected examples of NLP tasks
Task Description/Example
Language Modeling Capture the distribution of words in a language. Also, score a given word sequence to measure its likelihood or predict the next word in a sentence.
Machine Translation Translate a piece of text between languages, keeping the semantics the same. Machine Translation is a type of Seq2Seq learning.
Natural Language Inference (NLI) Establish the relationship between two pieces of text (e.g., do the texts imply one another? Do they contradict one another?). Also known as Textual Entailment.
Question Answering (Q&A) Given a question and a document that contains the answer, retrieve the correct answer to the question (or conclude that it doesn’t exist).
Sentiment Analysis Infer the sentiment expressed by text. Examples of sentiments include: “positive”, “negative” and “neutral”.
Summarization Given a large piece of text, extract the most relevant parts thereof (extractive summarization) or generate a shorter text with the most important message (abstractive summarization). Summarization is a type of Seq2Seq learning.

All of these problems can be framed as normal machine learning tasks, be they supervised or unsupervised, classification or regression, pointwise predictions or sequence learning, binary or multiclass, discrete or real-valued. They can be modeled using any of the default ML algorithms at our disposal (neural nets, tree-based models, linear models, etc).

The one difference between text-based ML—that is, NLP— and other forms of ML tasks is that text data must be encoded before it can be fed to traditional ML algorithms. This is because ML algorithms cannot deal natively with text, only numerical data. This is crucial in NLP, as we’ll see next.

1.4.2 It’s all about representation

As explained above, text data must be represented or encoded as numbers before we can apply ML to it. Therefore all NLP tasks must begin by building representations for the text we want to operate on. The way we represent data in ML is usually via numeric vectors.

The traditional form of representing text is the so-called bag-of-words (BOW) schema. As the name implies, this means representing text as an unordered set (i.e. a bag) of words. The simplest way to represent one word is to use a one-hot encoded (OHE) vector. An OHE vector only has one of its elements “turned on” with a 1. All other elements are 0. See Figure 1.11 (top part) for an example.

You may have heard of TF-IDF as a common way to represent text data. We don’t include it in this section because we are only listing word representations. TF-IDF vectors are used to represent a document, not a single word. Again, refer back to Chapter 04 where we’ll explain these concepts in detail.

Although simple, BOW encoding works reasonably well in practice, for many NLP tasks—they are usually combined with some form of weighting such as TF-IDF (see callout above).

Now for the problems. Firstly, OHE vectors are sparse (only one element is “on” and all others are “off”) and large (their length must be the size of the vocabulary). This means they are very memory/compute intensive to work with and not many ML algorithms deal with such high-cardinality data very well. Secondly, OHE vectors encode no semantic information at all. The OHE vector for the word “cow” is just as distant (geometrically speaking) from the word “bull” as it is from the word “spacecraft”.

We mentioned learned representations in Section 1.2.2 when we said that one of the by-products of training a neural LM was the creation of fixed-length representation vectors for each word. These are called word embeddings.

Embeddings look very different from OHE vectors, as can be seen in Figure 1.11. They are smaller in length; they are denser (i.e. non-sparse) and they encode semantic information about the word. This opens up a whole new avenue for making NLP more accurate.

Figure 1.11: The word “man”, represented in two ways: as a one-hot encoded vector and as a word embedding. Word embeddings are shorter and denser than OHE vectors.

Another advantage of embeddings is that they get continually better (in the sense of encoding increasingly rich semantic information) as the LMs they were trained by get larger and more powerful. See Table 1.2 for a summarized comparison between OHE vectors and Word embeddings:

Table 1.2: Differences between One-hot encoded vectors and word embeddings
OHE Vectors Word Embeddings
Density Sparse Dense
Discrete/Continuous Discrete Continuous
Length Long (as long as the vocabulary size) Short (Fixed-length)
Encoded Semantics No semantic information encoded Encode semantic information (similar words are closer together)

Word2vec (Mikolov et al. (2013)) was one of the first LMs trained exclusively to produce embeddings. It showed that a relatively simple architecture (a shallow, linear neural net) able to train on much more data beats more complex models by far.

The embeddings produced by Word2vec were so good that one could even perform arithmetic on them and arrive at reasonable results. Figure 1.12 shows an example of this: the country-capital relationship can be represented as a vector addition. If you add the vector that represents the country-capital relationship to the vector that represents a country, you will arrive close enough to the vector that represents its capital city!

Figure 1.12: Word2vec embedding vectors for countries and capitals plotted on a 2D chart. They are so accurate that one can visually and geometrically identify country-capital relationships over several pairs. Source: Mikolov et al. (2013)

Word embeddings (learned through language models) were a massive boon to NLP tasks—they could be used as a drop-in replacement for OHE vectors. But that’s still not the end of the story. Let’s see how LMs can be used for NLP at large.

1.4.3 The de facto base layer for NLP tasks

As we saw earlier, the main reason why language models can be trained on such large datasets (in the order of trillions of tokens) is that they can be trained in an unsupervised fashion. There is no need for manual data annotation! Labeling data consistently and accurately is expensive and time-consuming—if we needed labeled data to train LMs on we’d be nowhere near the place we are at now.

Language models harness massive amounts of data to learn increasingly good representations of words. This boosts the performance of any other downstream NLP task using those.

But how exactly does one use a pre-trained LM to leverage other NLP tasks? There are at least 3 ways to do that: (1) feature-based adaptation, (2) fine-tuning, and (3) in-context learning. We will explain each of them briefly but you can see a summary in Figure 1.13:

Part III (chapters 09, 10, and 11) will explain in more detail how to use LMs in other NLP tasks, with worked examples and detailed illustrations.

Figure 1.13: Three ways to use pre-trained LMs for downstream NLP tasks: (1) Feature-based adaptation (just using embeddings as features), (2) fine-tuning a pre-trained LM on task-specific layers, and (3) framing NLP tasks with natural language with in-context learning

Each of these three strategies has advantages and disadvantages—let’s examine them in more detail:

Feature-based adaptation

Feature-based adaptation is the simplest way to adapt existing NLP task pipelines to benefit from pre-trained language models.

It means taking embeddings from any pre-trained LM and “plugging” them in as features in whatever NLP task you are working on, instead of using OHE vectors, as a drop-in replacement. This supports any type of classifier, including those that are not neural nets.

Fine-tuning

The term fine-tuning is reminiscent of transfer learning literature, especially as related to computer vision.3

3 The term transfer learning is sometimes used interchangeably with fine-tuning in NLP.

To fine-tune a pre-trained language model to a specific NLP task, you replace the last layers in the LM neural net with task-specific layers. That way you will have a neural net that will solve your task but will be augmented by all the previous LM layers.

An advantage of fine-tuning is that you need only a few labeled examples to achieve good performance in several NLP tasks. This helps reduce costs, as labeled data is expensive to obtain.

When fine-tuning an LM, you can either fully freeze all LM layers and only perform backpropagation in the last task-specific layers or you can let all parameters in the network be freely updated by backprop. This can only be done if the task-specific algorithm is also a neural net, however.

In-context Learning

The last way we can leverage pre-trained LMs for downstream NLP tasks is by using the so-called in-context learning strategy. It’s the most versatile use of LMs we have seen so far.

TODO: add some note or callout mentioning that this is the same as zero-show and few-shot learning and the same as prompting

Remember from Section 1.1.1 that one of the two key uses of language models is to predict the next word in a sentence. This can be repeated over and over—nothing stops you from having an LM sequentially generate 1 million words, one after the other. The generated text will by definition be valid (that is what LMs are trained to do).

Now, what happens if you can fully describe an NLP task in free-form text and then feed that “context” to LMs and ask it to start generating word tokens, with no extra supervised fine-tuning? This is called in-context learning.4

4 Not to be confused with a word’s context — i.e. the words surrounding a given word.

A prompt is another name for the context passed as input to an LLM.

Few-shot, One-shot, Zero-shot

In-context learning may be subdivided into few-shot, one-shot, and zero-shot: Few-shot and one-shot refer to cases where you provide a few examples or one example, respectively, of the task you want an LM to complete. Zero-shot in-context learning, no examples are provided in the context. A more detailed explanation will be given in Chapter 11.

The key characteristic of in-context learning is that it requires no extra training whatsoever; Not only is the pretraining unsupervised but also the inference step—no parameter updates are performed at inference time.

A simple way to see zero-shot in-context learning at work is to take any text, append the string “TL;DR”5 to it, and feed that into an LLM as the prompt. This is what is shown in Figure 1.14: since the model is trained on a large dataset, there were many cases where it saw the string “TL;DR”, followed by a summary of the previous block of text. When given some text followed by “TL;DR”, it will provide a summary of whatever text was given!

5 “TL;DR” is internet-speak for “Too long; Didn’t read.”

Figure 1.14: An example of zero-shot in-context learning is inputting any text followed by the string “TL;DR”, and then asking an LLM to predict the next words in the sequence. Surprisingly, LLMs can understand the request and generate an adequate summary of the text, with no fine-tuning whatsoever.

Being able to have LMs solve NLP tasks from free-form descriptions is surprising, and it was clear we were entering uncharted waters. However, that was still not perfect, and it’s not trivial to make an LM understand what text you want it to produce without more specific optimization. This is where instruction-tuning and alignment come in.

1.5 Instruction-tuning and Model Alignment: ChatGPT and Beyond

In Section 1.4 we learned the why of language models: they are very good for building representations, they can be trained in an unsupervised fashion on large amounts of data and they can significantly improve any downstream NLP task.

Now we’ll explain how we go from vanilla6 LMs to models that can answer questions and follow instructions given in natural language, such as ChatGPT. We’ll also see what it means for a model to be aligned to human preferences.

6 Pre-trained, large language models that were not instruction-tuned are called vanilla LMs or base models.

Figure 1.15 below shows a selected timeline of academia and industry milestones related to instruction-tuning and alignment. Note the significant contribution from the Reinforcement Learning (RL) community.

Figure 1.15: Selected timeline of academia and industry milestones related to instruction-tuning and alignment

In the next sub-sections, we will explain what instruction-tuning and alignment are, where they differ, cover the main approaches for tuning, and then briefly explain how this connects with ChatGPT: the first major LLM-based product (and perhaps the reason you are reading this book).

1.5.1 Teaching models to follow instructions

The previous section shows that LMs can be used not only to generate free-form text but also to solve some NLP tasks—provided they’re framed correctly. The “TL;DR” example (Figure 1.14) is striking as it shows how a purely autoregressive7 pre-trained LLM can implicitly follow instructions—such as to summarize text—with no specific fine-tuning.

7 Autoregressive models use only their previous data points as features to make a prediction. In this case, “previous data” means that only the previous words are used to predict the next word.

But what if we did fine-tune an LLM to solve any NLP task? We could train it with descriptions of the tasks (as input) and their respective solutions (as output).

This was done in the T5 model (Raffel et al. (2019))—a single model fine-tuned on multiple types of NLP tasks, all described in natural language.8 In Figure 1.16 we can see how NLP tasks are framed as input-output pairs using natural language in T5 and other similar models.

8 The Natural Language Decathlon (McCann et al. (2018)) was another precursor to a unified approach for NLP. It, however, framed NLP tasks as question-answer pairs instead.

TODO: an image like the one in Figure 1 from T5. it would be good if the NLP tasks were the same as what I talked about in section 1.4.1

Figure 1.16: Framing NLP tasks themselves as natural language instructions took LLMs to yet another level, with models such as T5, T0, and FLAN. Adapted from Raffel et al. (2019)

After T5 and similar models, the next obvious step was to have LLMs follow generic instructions—not just those that referred to NLP tasks. This is called instruction-tuning.

MEAT: the key difference between an instruction tuned and a pretrained, vanilla LM is that an instruction tuned model understands that it should interpret the input text as an instruction — not just predict the next word in an autoregressive fashion. To clarify, both pre-trained LM and instruction tuned LMs produce valid text, but tuned LMs produce valid text that is closer to what a human expects.

figure below shows an example: all outputs are valid text — to the text that they are gramatically correct and seem to follow logically from the input (as we would expect from a model that’s been trained on a shitload of text). But only option 3 correctly interprets the input as an instruction — and provides a useful output.

TODO upgrade and re-add the “who was the president” image. label the input “Input framed as a generic instruction”. should have 2 examples of misaligned output. 1st: …. “is a question people ask when they are studying US history.” 2nd. “and also tell me who the vice president was.” and then a properly aligned answer as the 3rd option.

Figure 1.17: Most LMs generate syntactically valid text when given that input, but an instruction-tuned model provides text that is not only syntactically and semantically value, but also understands that the input was an instruction to be followed.

TODO: perhaps add a sidenote saying that the example above is basically using an LLM as a search engine, which is a legitimate use of such models, but limited to the facts that were in the training set the model was trained with, so useless for things that took place recently.

TODO: short connection text to chatgpt

1.5.2 ChatGPT

ChatGPT is an example of an instruction-tuned LLM that is fine-tuned to its use-case, that is, of a virtual assistant or chatbot.

no official details but there is instructgpt and OpenAI themselves said that it’s a sibling model. instruction-tuned gpt3. Quickly explain the data it was trained on and the fact that it was fine tuned with SFT and RLHF + add link to section below

don’t confuse chatgpt the product with the backend (gpt3.5, gpt4, etc)

1.5.3 Instruction-tuning vs Alignment

Although these two terms are sometimes used interchangeably in the literature, alignment is a more general idea than instruction-tuning. We shall use the term alignment not only to refer to fine-tuning models to follow natural language instructions but also to finer aspects of such models, namely those related to sometimes implicit human preferences, intentions, and perhaps values.

foo

Alignment is a fast-moving area of research so expect this section to be updated as there is a clearer view of the field.

This is what we mean by aligning models to human preferences — to have models generate text that is not only syntactically and semantically valid, but to respond in alignment with a set of implicit and explicity assumptions and intentions.

implicit alignment means implicit expectations that the generated text will be truthful, not misleading, etc. TODO mention 3 H’s of alignment.

MEAT: In Figure 1.19 in the previous sub-section, we already saw examples of what it means for a model to be instruction-tuned. Let’s now see an example of how a instruction-tuned model may give different outputs depending on how it was aligned.

TODO: add image: please tell me how to build a bomb. In a model fully optimized for helpfulness, you’d probably get a full description of how to do it probably with materials found in a home, for maximum helpfulness, but maybe it would be wrong in some places. in a model fully optimized for honesty, you might get a entirely correct recipe, but which builds fireworks instead of what one would think of as a bomb. in a model optimized for harmlessness, you would get a response saying that building a bomb is against the law in most places and shouldn’t be done.

MEAT: As you may have expected, model alignment touches on many philosophical or political questions, which we won’t go into in this book. Let’s instead see the main ways in which one can fine-tune an LLM to follow instructions, optionally according to some set of values.

1.5.4 Approaches to Instruction-tuning

TODO: short into with a table. SFT, RL and MIXED. under SFT there is from datasets, from the model itself (self-instruct) or from another teacher model. under RL, there are RLHF and RLAIF. under mixed, DPO and maybe constitutional AI

Supervised fine-tuning

as explained before it’s just another type of fine-tuning, with a dataset that looks like the one in ?fig-sft-dataset-sample

Figure 1.18: A sample dataset that could be used for instruction-tuning a pre-trained LLM
- with manually generated examples

- with sampled examples from the model itself (auto-instruct, but check if auto-instruct is )

- with sampled examples from another teacher model (is this necessarily the same as distillation? or is distillation only about the fine-tuning/RL steps?)

RL-based fine-tuning

VERY SHORT intro so that this chapter doesn’t get too long. Just say that it’s a way of using RL to avoid having to have humongous annotated input output datasets for SFT

  • applied as a second step after SFT (in RLHF) why? Because SFT can be used as a starting point to make the RL algo optimize better. and because without SFT, RL doesn’t work too well (this is claimed in the zephyr paper)

  • the model is optimized with a composite cost function such that it aligns but does not deviate too much from the original distribution (explain why — to prevent the model from overfitting to text that satisfied the policy but doesn’t make linguistic sense.)

  • RL is notoriously hard to get right. And expensive.

TODO add image a la instructgpt that shows the 3 steps.

RLHF and RLAIF: with the emergence of very

Hybrid approaches

DPO is a mix between supervised and reinforcement learning.

TODO: what about constitutional AI? is it a different method?

1.5.5 What’s next?

Where do we go from here? Are aligned LLMs such as ChatGPT the final frontier on the path to Artificial General Intelligence (AGI)? No one knows.

Even though results are striking and very useful (as proven by the commercial success of ChatGPT) there are still many open questions in the field: how to optimize costs, how to make alignment better and safer, how to address potentially existential risks, how to fuse multiple modalities (video, audio in addition to text), and so on.

In the rest of the book, we will go into detail over the technical concepts we touched on in this chapter—and many we didn’t. We’ll talk about future research directions and also several other ways to apply LMs to real-world and business problems, all in a beginner-friendly way.

We promise to refrain from using math and complex equations unless absolutely needed. 😀

Aligning preferences with Reinforcement Learning: ChatGPT and Beyond

We now make the final link between language models and reinforcement learning and arrive at models that are aligned to follow a user’s instructions given as input.

We saw that LMs can generate syntactically valid text very well. But a text being valid (according to syntax/grammar rules) does not mean it’s useful or helpful. There is yet another layer of complexity for us to traverse: aligning generated text with the user’s intent.

Researchers usually refer to the 3 Hs of model alignment: an aligned model should be helpful (helps the user with whatever task they have), honest (should not generate false or misleading information), and harmless (should in no occasion cause physical harm to anyone). The ins and outs of alignment will be discussed in depth in Chapter 13.

A language model is said to be aligned if the generated output matches the user’s intent. On the other hand, it’s said to be misaligned when the output produced (although valid from a syntactical point of view) is not what the user intended.

This will become clearer with the example in Figure 1.19: It’s clear from the input that the user wants to know who the U.S. president was in 1985. A properly aligned model should not generate just any text from that. It should generate the text the user wants: the answer to the question.

Figure 1.19: While both language models generate syntactically valid text when given the input, a well-aligned model (such as ChatGPT) will provide text that is not only valid but also matches the user’s intention.

Measuring how well a model’s output is aligned to users’ intent is deeply subjective. It’s hard to teach LMs to generate aligned output without annotated examples. That means that we’ll need labels, which means supervised learning.

Once we understand what it means for an LM to be aligned we now need to see how to do it. We did say it will involve supervised learning, but Reinforcement Learning (RL) will also play a key role. Let’s see how.

1.5.6 Reinforcement learning applied to LMs

Reinforcement Learning (RL) is the Machine Learning paradigm used in cases when you want to learn an algorithm, instead of a simple mapping from inputs to output. It is usually seen in fields such as robotics and autonomous vehicles—i.e. fields where models need to learn continuously from the environment. However, it’s now also being used to help align LLMs, as we’ll see next.

RL 101

Instead of learning from a train set, from features and target variables, RL models (also called agents) learn iteratively from interacting with the environment.

The ultimate objective of RL training is to learn a sequential algorithm (called a policy in RL jargon) that will be able to act on the environment and observe its state before and after the action. Some states are more desirable than others, as defined by a value called the reward. The training consists of discovering a strategy to maximize the total future reward. Figure 1.20 gives a visual overview of the reinforcement learning loop, with the elements we explained above, namely: agents, environment, state, policy, and rewards:

Figure 1.20: Reinforcement learning loop: In each training iteration, the agent applies an action, modifying the environment. Then the environment is observed and returns its state and the reward associated with that state, as defined by a reward function.

These are the basic building blocks of RL. We’ll provide a more complete introduction to reinforcement learning, with examples, in Chapter 12. Let’s now tie this together with language modeling to see how RL can be used to optimize LMs.

How to apply RL to align LMs?

A simple way to teach LMs to produce aligned output from given input would be to just build a massively large supervised train set with pairs of the form (input, aligned output) and fit a large model on it, using good old supervised learning.

The approach above is, however, not feasible in practice. Manually creating a dataset of the size we need (hundreds of billions of rows) would require so many human annotators as to be prohibitively expensive.

So what can we do? We can use reinforcement learning to approximate that, at a much lower cost.

The process is detailed in Figure 1.21 and it consists of 3 steps:

  • Step 1: SFT We annotate a small supervised dataset (with input/output text pairs) and fine-tune a pre-trained LM on those.

  • Step 2: Reward modeling We take the fine-tuned LM and sample several input/output pairs. Each sampled pair is given an integer rank by a human labeler, saying how aligned the output is to the intention expressed in the input. The ranking data is used to train a reward model. The reward model takes an input/output pair and produces a single scalar: a reward value.

  • Step 3: RL Fine-tuning We apply reinforcement learning to align the fine-tuned LM to generate appropriate responses, as defined by the reward model.

RL is used because we can’t afford to align an LLM with simple supervised learning, as it would just be too expensive to generate a manually labeled dataset. So we do the next best thing—approximate that using RL.

This specific variant of RL shown in Figure 1.21 is called Reinforcement Learning with Human Feedback (RLHF). This means that we are using a learned model (what we called the reward model) to provide the reward scores for the RL optimization loop. In other types of RL systems, the reward needs to be observed via a reward function. Once again we are trading off raw performance for scalability. Refer to Chapter 13 for a worked example of how to align a model using RLHF.

Figure 1.21: RLHF is one way to teach Language Models how to produce output that’s aligned with human intent. Inspired by Ouyang et al. (2022)

This is also the pipeline used to train aligned LLMs such as ChatGPT, as we’ll see next.

1.5.7 InstructGPT / ChatGPT

ChatGPT has been the first aligned LLM in widespread use, and it gave the general public a glimpse of what these models are capable of. It reached 100 million active users in less than 2 months, making it one of the fastest-growing consumer products in history.

Of course, ChatGPT is a product in addition to a language model, and as such there are many other interesting implementation details we won’t cover in this book.

Open AI has not (as of this writing) made ChatGPT details public but it has said that the training pipeline closely resembles that of InstructGPT by Ouyang et al. (2022), whose details we do know. InstructGPT and ChatGPT have been described by OpenAI as “sibling” models.

From a technical standpoint, ChatGPT is a GPT3.5 LLM, optimized for chat-like behavior using RLHF, as explained in Figure 1.21.

The main difference between InstructGPT and ChatGPT (see callout above) is that the latter supports stateful computations while the former generates one output for one input at a time. In other words, ChatGPT keeps track of previous interactions (like a chat between two people) and it will take those into account in addition to the current prompt.

The choice to align ChatGPT for chat-like behavior seems to indicate that the key players in this field may be aiming at building an AI-based virtual assistant. This is an important observation, as it gives some hints as to what may be coming next.

1.5.8 What’s next?

Where do we go from here? Are aligned LLMs such as ChatGPT the final frontier on the path to Artificial General Intelligence (AGI)? No one really knows.

Even though results are striking and very useful (as proven by the commercial success of ChatGPT) there are still many open questions in the field: how to optimize costs, how to make alignment better and safer, how to address potentially existential risks, how to fuse multiple modalities (video, audio in addition to text), and so on.

In the rest of the book, we will go into detail over the technical concepts we touched on in this chapter—and many we didn’t. We’ll talk about future research directions and also several other ways to apply LMs to real-world and business problems, all in a beginner-friendly way.

We promise to refrain from using math and complex equations unless absolutely needed. 😀

1.6 Summary

  • Language Models (LMs) model the distribution of words in a language (such as English) either by counting co-occurrence statistics or by using neural nets, in a self-supervised training regimen

  • Neural Nets can be used to train LMs with very good performance, especially if they can keep state about previous words seen in the context.

  • Large LMs are now the de facto base layer for many downstream NLP tasks. They can provide embeddings that can replace one-hot-encoded vectors, serve as a base model to be fine-tuned for specific tasks, and function in so-called in-context learning, where NLP tasks are directly framed as natural language, sometimes with examples.

  • Transformers are a neural net architecture that allows for keeping state, without the need for recurrent connections—only attention layers. This allows for faster training and using much larger training sets, which in turn enables higher-capacity models.

  • Reinforcement learning (particularly RLHF) can be used to align LMs so that they output text that matches the user’s intent. This was the case with ChatGPT, which was aligned to perform well in a chat-like context.

1.7 References

Bahdanau, D., K. Cho, and Y. Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” https://arxiv.org/abs/1409.0473.
Bengio, Y., J. Ducharme, P. Vincent, and C. Janvin. 2003. “A Neural Probabilistic Language Model.” J. Mach. Learn. Res. JMLR.org. http://dl.acm.org/citation.cfm?id=944919.944966.
Hochreiter, S., and J. Schmidhuber. 1997. “Long Short-Term Memory.” Neural Comput. 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
McCann, B., N. S. Keskar, C. Xiong, and R. Socher. 2018. “The Natural Language Decathlon: Multitask Learning as Question Answering.” CoRR abs/1806.08730. http://arxiv.org/abs/1806.08730.
Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems, edited by C. J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger. Vol. 26. Curran Associates, Inc. http://bit.ly/mikolov-2013-nips.
Ouyang, L., J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” https://arxiv.org/abs/2203.02155.
Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever. 2018. “Improving Language Understanding by Generative Pre-Training.”
Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2019. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” CoRR abs/1910.10683. http://arxiv.org/abs/1910.10683.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://bit.ly/vaswani-2017-attention.