1 Language Models: What and Why

This chapter covers

Introducing Language Models (LMs); what they are, how they work, and basic use cases
Showing why and how neural LMs have become key to modern NLP—and the role of representations
Explaining how Transformers have upended the field in recent years
Aligning LMs to human preferences, as is the case with ChatGPT

Anyone who hasn’t been living under a rock in the last few years (late 2010s to early 2020s) has been bombarded with examples of seemingly magical content produced by new Artificial Intelligence (AI) models. Full novels written by AI; poems specifically generated to copy a specific author’s style; Chatbots that act indistinguishably from a human being. The list goes on.

The names of such models may sound familiar to those with a technical bent: GPT, GPT-2, BERT, GPT-3, ChatGPT, and so on.

These are called Large Language Models (LLMs) and they have only grown more powerful and more expressive over time, being trained on ever larger amounts of data and applying techniques discovered by researchers in academia and in companies such as Google, Meta, and Microsoft.

In this chapter, we will give a brief introduction to language models (large and otherwise) and related technologies, providing a foundation for the rest of the book. We call it “What and Why” because we show what LMs are but also why they are now the base layer for modern Natural Language Processing (NLP).

In Section 1.1 we explain at a basic level what language models (LMs) are and how one can use them. Section 1.2 introduces the main types of LMs, namely statistical and neural language models. In Section 1.3 we show how Attention mechanisms and the Transformer architecture help LMs better keep state and use a word’s context. In Section 1.4 we explain why LMs play a pivotal role in modern NLP and, finally, in ?sec-instruction-tuning we show what alignment means in the context of LMs and how it is used to create models such as ChatGPT.

1.1 What are Language Models?

As its name suggests, a Language Model (LM) models how a given language works.

Here, modeling a language means assigning scores to arbitrary sequences of words, such that the higher the score for a sequence of words, the more likely it is a meaningful sentence in a language such as English. This is shown in Figure 1.1:

Large language models

The title of this book specifically mentions Large Language Models (LLMs). The term is not very precisely defined but here’s our working definition: Large Language Models are LMs that (a) employ large, deep neural nets (billions of parameters) and (b) have been trained on massive amounts of text data (hundreds of billions of tokens).

Figure 1.1: At its most basic, a Language Model is a *function* that takes in a sequence of words as input and outputs a score. High scores mean the input is *valid*. Low scores mean it’s probably gibberish. An in-depth overview of language modeling is given in Chapter 2.

The LMs we will focus on in this book learn from data (as opposed to rules manually crafted by linguistics experts), mostly using Machine Learning (ML) tooling.

A perfect picture of a language would require us to have access to every document that was ever written in it. This is clearly impossible, so we settle for using as large a dataset as we can. An unlabelled natural language dataset is called a corpus.

The corpus is thus the set of documents that represent the language we are trying to model, such as English. The corpus is the data LMs are trained on. After they are trained they can then be used, just like any other ML model. Here, using a trained LM means feeding it word sequences as input and obtaining a likelihood score as output, as is shown in Figure 1.1.

Corpus and Corpora

A corpus is a set of unlabelled documents that represent the language we are trying to model with a Language Model. Corpora is the plural form of corpus.

The distinction between train time and inference time is crucial to understanding how LMs work.

Train time refers to the process where the model processes the corpus and learns the characteristics of the target language. This step is usually very time-consuming. On the other hand, inference time is the stage in which the language model is used, for example in calculating a sentence’s likelihood score; Inference is usually very fast and takes place after training. This will be explained in more detail in Chapter 2 where we dive deep into language modeling.

In the sections below we will have a better look at how LMs can be used in a real-world setting and what are the main types used in practice.

1.1.1 LM use cases

As we saw in the previous section, LMs need to be trained on a corpus of documents. After they have been trained, they hold some idea of what the language in question looks like, and only then can we use them for practical tasks.

The most basic for a language model is to output likelihood scores for word sequences, as we stated previously. But there is an additional use for them: predicting the next word in a sentence.

The two ways to use—or perform inference with—a trained language model are therefore: (a) assign a score that measures the likelihood of a sequence of words and (b) predict the most likely next word in a sentence, based on the previous words. These are two simple and seemingly uninteresting applications, but they will unlock surprising outcomes as we’ll see in the next sections. Figure 1.2 shows examples for both uses, side-by-side:

Figure 1.2: The two basic uses for Language Models: (a) calculate the likelihood score for a given word sequence and (b) predict the next word in the sequence. Note that the left side of this figure is just a rehash of Figure 1.1

It may not be immediately obvious but these use cases are two sides of the same coin: if you have a trained LM trained to calculate the likelihood score, it is easy to use it to predict or guess what the next word in a sentence will be. This is because you can brute force your way into the solution by scoring every possible word in the vocabulary and picking the option with the highest score! (see Figure 1.2, right side).

Let’s now have a brief look at the main types of language models, what they have in common, and how they differ.

1.2 Types of Language Models

In practice, the two most common ways to implement language models are either via statistics (counting word frequencies) or with the aid of neural networks. For both the general structure stays the same, just like we explained in Section 1.1: the language model is trained on some corpus and, once trained, it can be used as a function to measure how likely a given word sequence is or to predict the next word in a sequence.

Figure 1.3 shows the main types of language models, namely statistical and neural language models, with subclassifications. These are discussed in sections 1.2.1 and 1.2.2. Also, Chapter 2 and ?sec-ch-neural-language-models-and-self-supervision will provide a thorough analysis of these models.

Let’s now explore the characteristics of statistical and neural language models.

1.2.1 Statistical Language Models

One simple way to model a language is to use probability distributions. We are claiming that a language can be defined as the probability distribution of every word in that language.

\(P(word)\) is the frequency with which the word appears in the corpus. \(P(word\ |\ context)\) measures how often \(word\) appears when preceded by \(context\) in the corpus.

More specifically: the probability of a word sequence is the joint probability of the words in the sequence, as can be seen in Equation 1.1:

\[ \begin{aligned} P(''a\ man\ walks\ across\ the\ street'') = P(''a'',\ ''man'',\ ''walks'',\ ''across'',\ ''the'',\ ''street'') \end{aligned} \tag{1.1}\]

Intuitively, the meaning of a word is very much dependent on the words around it. Therefore, We can decompose the joint probability into a product of conditional probabilities, using the chain rule of probability (See Section 2.1 for a detailed explanation and examples).

Solving Equation 1.1 is simple enough—after decomposing into conditional probabilities, you count how many times each word appears in every context in the corpus.

This approach, however, quickly becomes impractical with real-world-size data, as it is computationally very expensive to calculate the terms for large text corpora—the number of possible combinations grows exponentially with the size of the context and the vocabulary.

In addition to being inefficient, such models aren’t able to generalize calculations if we need to score a word sequence that is not present in the corpus—they would assign a score of zero to every sequence not seen in the train set.

Purely statistical models aren’t often used in practice—but they provide a foundation for \(N\)-gram models, as we’ll see next.

N-gram language models

\(N\)-gram language models are an optimization on top of fully statistical language models (SLMs). But what are \(N\)-grams?

\(N\)-grams are an abstraction of words. \(N\)-grams where \(N=1\) are called unigrams and are just another name for a word. If \(N=2\), they are called bigrams and they represent ordered pairs of words. If \(N=3\), they’re called trigrams and—you guessed it—they represent ordered triples of words. Figure 1.4 shows an example of what a sentence looks like when it’s split into unigrams, bigrams, and trigrams:

Figure 1.4: \(N\)-grams: Representing a sentence with \(N\)-grams: In this example, the sentence *“A man walked by the grocery store”* can be represented with unigrams, bigrams, trigrams, etc. Note that a unigram is just another name for a word.

But how do \(N\)-grams make SLMs better?

\(N\)-grams enable us to prune the number of context words when calculating conditional probabilities. Instead of considering all previous words in the context, we approximate it by using the last \(N\) words. This reduces the space and the number of computations needed as compared with fully statistical LMs and helps address the curse of dimensionality related to rare combinations of words.

\(N\)-gram models are no panacea, however; they still suffer from the inability to generalize calculations to unseen sequences; and deliberately ignoring contexts beyond \(N-1\) words hinders the capacity of the model to consider longer dependencies. \(N\)-gram models will be discussed in more detail in Chapter 2.

1.2.2 Neural Language Models

As interest in neural nets picked up again in the early 2000s, researchers (starting with Bengio et al. (2003)) began to experiment with applying neural nets to the task of building a language model, using the well-known and trusted backpropagation algorithm. They found out that not only it was possible, but it worked better than any other language model seen so far—and it solved a key problem faced by statistical language models: not being able to generalize into unseen words.

The training strategy relies on self-supervised learning: training a neural network where the features are the words in the context and the target is the next word. Simply build a training set like that and train it in a supervised way like you would any other neural net. We will explain this in-depth in Chapter 03.

Self-supervised and Unsupervised Learning

Self-supervised learning refers to programmatically constructing a labeled dataset from an unlabeled dataset, then proceeding with supervised learning. For this book, both self-supervised and unsupervised learning will be used interchangeably, to signal cases where no human labeling is needed.

In addition to being a good way to train language models (in the sense that they are good at predicting how valid a piece of text is), it turns out that there remained an interesting by-product after the training was done: learned representations for words—word embeddings.

Word embeddings will play a key role later in Section 1.4. Also, Chapter 05 will focus specifically on text representations and embeddings will be discussed in depth.

Even though the first model introduced by Bengio et al. (2003) was a relatively simple feedforward, shallow neural net, it proved that this strategy worked, and it set the path forward for many other developments.

With time, neural LMs evolved by using ever more complex neural nets, trained on increasingly larger datasets. Deep neural nets, convolutional neural nets, recurring connections, the encoder-decoder architecture, attention, and, finally, transformers, are just some examples of the technologies used in these models. Figure 1.5 shows a selected timeline with some of the key technological breakthroughs and milestones related to neural LMs:

Figure 1.5: Selected timeline with key milestones related to neural language models, from both academia and the industry.

Most modern language models are neural LMs. This is unsurprising because (1) neural nets can handle a lot of complexity and (2) neural LMs can be trained on massive amounts of data, with no need for labeling. The only constraints are the available computing power and the budget.

Research (from both academia and industry) on neural nets has advanced a lot in the last decades so it was a match made in heaven: as the magnitude of text on the Web grew larger, there appeared new and more efficient ways to train neural nets—better algorithms on the software side and purpose-built hardware on the hardware side.

Let’s now explain the role a word’s context plays in neural LMs—and how they help us train better models.

The need for memory in Neural LMs

The basic building block of text is a word, but a word on its own doesn’t tell us much. We need to know its context—the other words around it—to fully understand what a word means. This is seen in polysemous¹ words: the word “cap” in English can mean a head cover, a hard limit for something (a spending cap), or even a verb. Without context, it’s impossible to know what the word means.

¹ Polysemous words are those that have multiple meanings.

As a human is better able to understand a word when its context is available, so are LMs. In the case of language modeling, this means having some kind of memory or state in the model—so that it can consider past words as it predicts the next ones.

Word contexts

In LM parlance, the context of a word W refers to the accompanying words around W. For example, if we focus on the word “running” in the sentence “A dog is running on the field”, the context is made up of “A dog is …” on the left side and “… on the field” on the right side. A word’s context is key for language modeling.

While the neural language model we have seen so far does take some context into account when training, there are key limitations: They use very small contexts (5-10 words only) and the context size needs to be fixed a priori for the whole model². Recurrent neural nets can work around this limitation, as we’ll see next.

² Feedforward neural nets cannot natively deal with variable-length input.

Recurrent Neural Networks

The standard way to incorporate state in neural nets (to address the memory problem explained above) has for some time been Recurrent Neural Networks (RNNs). RNNs use the output from the previous time step as additional features to produce an output for the current time step. This enables RNNs to take past data into account. The basic differences between regular (i.e. feedforward) and recurrent neural nets are shown in Figure 1.6:

Figure 1.6: Feedward neural nets only use features from the current time step to calculate the output, whereas recurrent neural nets use features from the current time stamp but also use the output from the previous time step to calculate the output. The dotted lines represent the flow of information and the circles represent nonlinear operations on vectors, such as the sigmoid function.

The simplest way to train RNNs is to use an algorithm called Backpropagation Through Time (BPTT). It’s similar to the normal backpropagation algorithm but for each iteration, the network is first unrolled so that the recurrent connections can be treated as if they were normal connections.

There are three issues with BPTT for RNNs however: Firstly, it’s very computationally expensive to execute especially as one increases the number of time steps (in the case of NLP, this means the size of the context) one wants to look at. Secondly, it’s not easy to parallelize training for RNNs, as many operations must be executed sequentially. Finally, running backpropagation over such large distances causes gradients to explode or vanish, which precludes the training of networks using larger contexts.

Better memory: LSTMs

It turns out one can be a little more clever when propagating past information in RNNs. The ultimate objective is to be able to store long-range dependencies (i.e. being able to consider very large contexts) while avoiding the problems of vanishing/exploding gradients.

One can better control how past information is passed along with so-called memory cells. One commonly used type of memory cell is the LSTM (Long Short-term Memory).

LSTMs were introduced some time ago by Hochreiter and Schmidhuber (1997) and they work by propagating an internal state and applying nonlinear operations to the inputs (i.e. input from the current time step and previous output) and gates to control what should be input, output or forgotten³. In vanilla RNN cells, no such operations are applied, and no state is propagated explicitly. These 3 gates are the 3 solid blocks labeled “F”, “I” and “O”, shown in Figure 1.7, right side.

³ The usual implementation of an LSTM includes a “forget-gate” as introduced by Gers et al. (1999).

Figure 1.7: RNN cells (left) take the output from the previous time step and also the current input and apply a nonlinear operation (circles) to produce the current output. LSTM cells (right) also propagate an internal state, to which several nonlinear operations are applied—forget gates, input gates, and output gates, represented by the letters F, I, and O, respectively. The dotted lines represent the flow of information and the circles represent nonlinear operation on vectors, such as the sigmoid function.

Crucially, LSTMs (or any other type of memory cell) don’t address the computational costs of training recurrent neural nets, because the recurrent connections are still there. They do help avoid the problem of vanishing/exploding gradients and they also help in storing longer-range dependencies than would be possible in a vanilla RNN, but the scaling issues when training remain. We will cover RNNs and LSTM cells in more detail in Chapter 04.

In Section 1.2 we saw the main types of language models and we showed how neural nets enable better training of LMs. We also saw how important it is for LMs to be able to keep state and how RNNs can be used for that, but training these is costly and they don’t work as well as we’d expect. Attention mechanisms and the Transformer architecture address these points, as we’ll see next.

1.3 Attention and the Transformer revolution

If you are interested in modern NLP, you will have heard the terms Attention and Transformers being thrown around recently. You might not understand exactly what they are, but you picked up a few hints and you have a feeling Transformers are a significant part of modern LLMs—and that they have something to do with Attention.

You’re right on both counts—we’ll now explain what Attention is, how it enables Transformers, and why they matter. These two topics will be covered in more detail in Part II.

1.3.1 Enter Attention

The problem of how to propagate past information to the present efficiently and accurately also occupied the minds of researchers and practitioners working on a different language task: Machine Translation.

The traditional way to handle machine translation and other sequence-to-sequence (Seq2Seq) learning is using a recurrent neural network architecture called the encoder-decoder. This architecture consists of encoding input sequences into a single, fixed-length vector and then decoding it back again to generate the output. See the upper part of Figure 1.8 for a visual representation.

We’ll cover Sequence-to-sequence (Seq2seq) learning in more detail in ?sec-ch-sequence-learning-and-the-encoder-decoder-architecture.

Soon after the introduction of these encoder-decoder networks, other researchers (Bahdanau et al. (2014)) proposed a subtle but impactful enhancement: instead of encoding the input sequences into a single fixed-length vector as an intermediary representation, they are encoded into multiple so-called annotation vectors instead. Then, at decoding time, an attention mechanism learns which annotations it should use—or attend to. This can be seen before the decoder in Figure 1.8, below.

More specifically, the attention mechanism inside the decoder contains another small feedforward neural network with learnable parameters. This is the so-called alignment model and its task is precisely to learn, over time, which of the vectors generated by the encoder best fit the output it is trying to generate. This is represented in Figure 1.8 on the bottom part:

Figure 1.8: Differences between regular (above) and attention-enabled (below) encoder-decoder networks. The structure is similar but the bottom network uses multiple vectors for the intermediary representation and there is an extra component before the decoder—the attention mechanism.

Attention as Information Retrieval

A common way to think about Attention is by framing it as an information retrieval problem with query, key, and value vectors.

In a translation task, for example, each output word (in the target language) can be seen as a query and each input word (in the source language) is a combination of keys and values, which will be searched over to find the best input word. This will be explained in more detail in ?sec-ch-attention-and-the-transformer-architecture.

It is worth noting that, while adding Attention cells to encoder-decoder networks does allow for more precise models, they still use recurrent connections, which make them computationally expensive to train and hard to parallelize. This is where Transformers come in, as we’ll see next.

1.3.2 Transformers

Vaswani et al. (2017) introduced an alternative version of the encoder-decoder architecture, along with several engineering tricks to make training such networks much faster. It was called the Transformer architecture, and it has been the architecture of choice for most large NLP models since then.

The seminal Transformer article was called “Attention is all you need”, for good reason—the proposed architecture ditched RNN layers altogether, replacing them with Attention layers (while keeping the encoder-decoder structure). This is shown in detail in Figure 1.9: The encoder and decoder components are there but recurrent connections are nowhere to be seen—only attention layers.

Figure 1.9: The original Transformer model for Seq2Seq learning. It is still an encoder-decoder architecture (akin to Figure 1.8), but there are no RNNs or any other recurrent connections—only attention layers. Adapted from Vaswani et al. (2017)

This was very, very relevant. The main problem with such networks was precisely that of the recurrent connections present in RNNs. As we saw earlier, these precluded parallel training using GPUs, TPUs, and other purpose-built hardware and, therefore, severely limited the amount of data they could be trained on.

Without RNNs or any recurrent connections, the original Transformer model was able to match or even surpass the then-current SOTA in machine translation, at a fraction of the cost (100 - 1000 times more efficiently).

The key advancements introduced were (1) using self-attention instead of recurrent connections both in the encoder and the decoder (2) encoding words with positional embeddings to keep track of word position and (3) introducing multi-head attention as a way to add more expressivity while enabling more parallelization in the architecture.

Let’s see how and why Transformers are used for language modeling.

1.3.3 Transformer-based Language models

Now we know what Transformers are, but we saw that they were created for Seq2Seq learning, not for language modeling.

We can, however, repurpose encoder-decoder Transformers to build language models—we can just use the encoder or the decoder part of the network, in a self-supervised training setting, just like the original LMs we saw in previous sections.

Such language models are now called encoder Transformers or decoder Transformers, depending on which part of the original Transformer they use. The first truly large LM based on the Transformer architecture was the Open AI GPT-1 model by Radford et al. (2018), a decoder Transformer. Figure 1.10 shows a timeline of released transformer-based LLMs, starting with GPT-1, soon after the seminal paper was published:

Figure 1.10: Selected timeline with the main transformer-based LLMs released as of this writing

Section A in the Appendix contains a section where we describe the implementation details of the most important Transformer-based LMs.

We are still missing one part of the puzzle: why exactly are LMs (especially large LMs) so important for NLP?

1.4 Why Language models? LMs as the building blocks of modern NLP

We saw in the previous sections that one can use Neural Nets to train Language Models—and that this works surprisingly well. We also saw how using Transformers enables us to train massively larger and more powerful LMs.

You are probably wondering why we talk so much about Language Models if their uses are relatively limited and seemingly uninteresting (predicting the next word in a sentence doesn’t seem all that sexy, right?).

The short answer is threefold: (1) LMs can be used to build representations for downstream NLP tasks (2) LMs can be trained on huge amounts of data because they’re unsupervised (3) we can plug language models into any NLP task.

We’ll explain each of these 3 points in detail in sections 1.4.2, 1.4.3 and 1.4.4 but first, let’s quickly see what we mean by NLP.

1.4.1 NLP is all around us

NLP stands for Natural Language Processing, an admittedly vague term. In this book, we will take it to mean any sort of Machine Learning (ML) task that involves natural language — text as written by humans. This includes all physical text ever written and, most importantly, all text on the Web.

?sec-ch-applications explores different types of NLP tasks and how they benefit from LMs.

Table 1.1 shows a selected list of NLP tasks that have been addressed both by researchers in academia and practitioners in the industry:

Table 1.1: Selected examples of NLP tasks

Task	Description/Example
Language Modeling	Capture the distribution of words in a language. Also, score a given word sequence to measure its likelihood or predict the next word in a sentence.
Machine Translation	Translate a piece of text between languages, keeping the semantics the same.
Natural Language Inference (NLI)	Establish the relationship between two pieces of text (e.g., do the texts imply one another? Do they contradict one another?). Also known as Textual Entailment.
Question Answering (Q&A)	Given a question and a document, retrieve the correct answer to the question (or conclude that it doesn’t exist).
Sentiment Analysis	Infer the sentiment expressed by text. Examples of sentiments include: “positive”, “negative” and “neutral”.
Summarization	Given a large piece of text, extract the most relevant parts thereof (extractive summarization) or generate a shorter text with the most important message (abstractive summarization).

All of these problems can be framed as normal machine learning tasks, be they supervised or unsupervised, classification or regression, pointwise predictions or sequence learning, binary or multiclass, discrete or real-valued. They can be modeled using any of the default ML algorithms at our disposal (neural nets, tree-based models, linear models, etc).

The one difference between text-based ML—that is, NLP— and other forms of ML tasks is that text data must be encoded before it can be fed to traditional ML algorithms. This is because ML algorithms cannot deal natively with text, only numerical data. This is crucial in NLP, as we’ll see next.

1.4.2 It’s all about representation

As explained above, text data must be represented or encoded as numbers before we can apply ML to it. Therefore all NLP tasks must begin by building representations for the text we want to operate on. The way we represent data in ML is usually via numeric vectors.

The traditional form of representing text is the so-called bag-of-words (BOW) schema. As the name implies, this means representing text as an unordered set (i.e. a bag) of words. The simplest way to represent one word is to use a one-hot encoded (OHE) vector. An OHE vector only has one of its elements “turned on” with a 1. All other elements are 0. See Figure 1.11 (top part) for an example.

TF-IDF

You may have heard of TF-IDF as a common way to represent text data. We don’t include it in this section because we are only listing word representations. TF-IDF vectors are used to represent a document, not a single word. Again, refer back to Chapter 04 where we’ll explain these concepts in detail.

Although simple, BOW encoding works reasonably well in practice, for many NLP tasks—they are usually combined with some form of weighting such as TF-IDF (see callout above).

Now for the problems. Firstly, OHE vectors are sparse (only one element is “on” and all others are “off”) and large (their length must be the size of the vocabulary). This means they are very memory/compute intensive to work with and not many ML algorithms deal with such high-cardinality data very well. Secondly, OHE vectors encode no semantic information at all. The OHE vector for the word “cow” is just as distant (geometrically speaking) from the word “bull” as it is from the word “spacecraft”.

We mentioned learned representations in Section 1.2.2 when we said that one of the by-products of training a neural LM was the creation of fixed-length representation vectors for each word. These are called word embeddings.

Embeddings look very different from OHE vectors, as can be seen in Figure 1.11. They are smaller in length; they are denser (i.e. non-sparse) and they encode semantic information about the word. This opens up a whole new avenue for making NLP more accurate.

Figure 1.11: The word “man”, represented in two ways: as a one-hot encoded vector and as a word embedding. Word embeddings are shorter and denser than OHE vectors.

Another advantage of embeddings is that they get continually better (in the sense of encoding increasingly rich semantic information) as the LMs they were trained by get larger and more powerful. See Table 1.2 for a summarized comparison between OHE vectors and Word embeddings:

Table 1.2: Differences between One-hot encoded vectors and word embeddings

	OHE Vectors	Word Embeddings
Density	Sparse	Dense
Discrete vs Continuous	Discrete	Continuous
Length	Long (as long as the vocabulary size)	Short (Fixed-length)
Encoded Semantics	No semantic information encoded	Encode semantic information (similar words are closer together)

Word2vec (Mikolov et al. (2013)) was one of the first LMs trained exclusively to produce embeddings. It showed that a relatively simple architecture (a shallow, linear neural net) trained on more data beats more complex models by far.

The embeddings produced by Word2vec were so good that one could even perform arithmetic on them and arrive at reasonable results. Figure 1.12 shows an example of this: the country-capital relationship can be represented as a vector addition. If you add the vector that represents the country-capital relationship to the vector that represents a country, you will arrive close enough to the vector that represents its capital city!

Figure 1.12: Word2vec embedding vectors for countries and capitals plotted on a 2D chart. They are so accurate that one can visually and geometrically identify country-capital relationships over several pairs. Source: Mikolov et al. (2013)

Word embeddings (learned via language modeling) were an immediate boon to NLP tasks—they could be used as a drop-in replacement for OHE vectors.

1.4.3 Unsupervised training for the win

Training language models does not require labeled data; you just need natural language datasets to either calculate word co-occurrence statistics or train a neural network in a self-supervised manner, as we explained in Section 1.2.

Unlabeled data is much more widely available and cheaper to obtain, as labeling is usually done by humans. This greatly increases the amount of data LMs can be trained on, and thus the amount of knowledge such they can hold.

This makes such models a great base layer that could, in theory, encode all knowledge that exists in text form, including all text in books but, most importantly, all text that’s available in digital form.

With the explosion of the amount of text data on the web, one could create corpora in the order of trillions of tokens. This, together with advancements in model architectures and purpose-built hardware, has enabled the creation of very large and capable models. Access to computing resources becomes the only real bottleneck to ever larger models.

As models reach nearly unlimited capacity, the question then becomes: “How do we put these world models to use?”. This is what we will see now, as we draw the final connection between LMs and NLP at large and show why LMs potentialize all NLP tasks.

1.4.4 The de facto base layer for NLP tasks

The main reason why language models can be trained on such large datasets (in the order of trillions of tokens) is that they can be trained in an unsupervised fashion. There is no need for manual data annotation! Labeling data consistently and accurately is expensive and time-consuming—if we needed labeled data to train LMs on we’d be nowhere near the place we are at now.

Language models harness massive amounts of data to learn increasingly good representations of words. This boosts the performance of any other downstream NLP task using those.

But how exactly does one use a pre-trained LM to leverage other NLP tasks? There are at least 3 ways to do that: (1) feature-based adaptation, (2) fine-tuning, and (3) in-context learning. We will explain each of them briefly but you can see a summary in Figure 1.13:

Part III (chapters 09, 10, and 11) will explain in more detail how to use LMs in other NLP tasks, with worked examples and detailed illustrations.

Each of these three strategies has advantages and disadvantages—let’s examine them in more detail:

Feature-based adaptation

Feature-based adaptation is the simplest way to adapt existing NLP task pipelines to benefit from pre-trained language models.

It means taking embeddings from any pre-trained LM and “plugging” them in as features in whatever NLP task you are working on, instead of using OHE vectors, as a drop-in replacement. This supports any type of classifier, including those that are not neural nets.

Fine-tuning

The term fine-tuning is reminiscent of transfer learning literature, especially as related to computer vision.⁴

⁴ The term transfer learning is sometimes used interchangeably with fine-tuning in NLP.

To fine-tune a pre-trained language model to a specific NLP task, you replace the last layers in the LM neural net with task-specific layers. That way you will have a neural net that will solve your task but will be augmented by all the previous LM layers.

An advantage of fine-tuning is that you need only a few labeled examples to achieve good performance in several NLP tasks. This helps reduce costs, as labeled data is expensive to obtain.

When fine-tuning an LM, you can either fully freeze all LM layers and only perform backpropagation in the last task-specific layers or you can let all parameters in the network be freely updated by backprop. This can only be done if the task-specific algorithm is also a neural net, however.

In-context Learning

The last way we can leverage pre-trained LMs for downstream NLP tasks is by using the so-called in-context learning strategy. It’s the most versatile use of LMs we have seen so far.

TODO: add some note or callout mentioning that this is the same as zero-show and few-shot learning and the same as prompting

Remember from Section 1.1.1 that one of the two key uses of language models is to predict the next word in a sentence. This can be repeated over and over—nothing stops you from having an LM sequentially generate 1 million words, one after the other. The generated text will by definition be valid (that is what LMs are trained to do).

Now, what happens if you can fully describe an NLP task in free-form text and then feed that “context” to LMs and ask it to start generating word tokens, with no extra supervised fine-tuning? This is called in-context learning.⁵

⁵ Not to be confused with a word’s context — i.e. the words surrounding a given word.

A prompt is another name for the context passed as input to an LLM.

Few-shot, One-shot, Zero-shot

In-context learning may be subdivided into few-shot, one-shot, and zero-shot: Few-shot and one-shot refer to cases where you provide a few examples or one example, respectively, of the task you want an LM to complete. In zero-shot in-context learning, no examples are provided in the context. A more detailed explanation will be given in Chapter 11.

The key characteristic of in-context learning is that it requires no extra training whatsoever; Not only is the pretraining unsupervised but also the inference step—no parameter updates are performed at inference time.

A simple way to see zero-shot in-context learning at work is to take any text, append the string “TL;DR”⁶ to it, and feed that into an LLM as the prompt. This is what is shown in Figure 1.14: since the model is trained on a large dataset, there were many cases where it saw the string “TL;DR”, followed by a summary of the previous block of text. When given some text followed by “TL;DR”, it will provide a summary of whatever text was given!

⁶ “TL;DR” is internet-speak for “Too long; Didn’t read.”

Figure 1.14: An example of zero-shot in-context learning is inputting any text followed by the string “TL;DR”, and then asking an LLM to predict the next words in the sequence. Surprisingly, LLMs can understand the request and generate an adequate summary of the text, with no fine-tuning whatsoever.

Being able to have LMs solve an NLP task from a free-form description was surprising, and it was clear we were entering uncharted waters. However, that was still not perfect, and it’s not trivial to make an LM understand what text you want it to produce without more specific optimization. This is where instruction-tuning and alignment come in.

1.5 Instruction-tuning and Model Alignment: ChatGPT and Beyond

In Section 1.4 we learned the why of language models:

They are useful for building word representations such as embeddings;
They can be trained in an unsupervised fashion on large amounts of data;
They can significantly improve any downstream NLP task;

There is, however, still one piece missing: how do we go from a model that is good at generating the next words from a prompt to a model that is able to solve arbitrary tasks?. In Figure 1.15 we see this difference at play: while all 3 model responses are syntactically and semantically valid, only one of them correctly interprets the prompt as an instruction and provides the expected response.

Figure 1.15: Most LMs generate valid text when given that input, but an instruction-tuned model provides text that is not only syntactically and semantically valid, but also understands that the input was an instruction to be followed.

In this section we explain how we can instruction-tune a vanilla⁷ LM to a model such as ChatGPT, which can answer questions and follow instructions given in natural language. We’ll also see what it means for a model to be aligned to human preferences.

⁷ Pre-trained large language models that were not instruction-tuned are called vanilla or base models.

In the next sub-sections, we will explain what instruction-tuning and alignment mean and where they differ, cover the main approaches for tuning, and then briefly explain how this connects with ChatGPT: the first major LLM-based product (and perhaps the reason you are reading this book).

1.5.1 Teaching models to follow instructions

The previous section shows that LMs can be used not only to generate text but also to solve a simple NLP task like Text Summarization. The “TL;DR” example (Figure 1.14) is striking as it shows how a purely autoregressive⁸ pre-trained LLM can be made to accomplish tasks if we provide a carefully thought-out context and ask it to fill out the next words.

⁸ Autoregressive models use only their previous data points as features to make a prediction. In this case, “previous data” means that only the previous words are used to predict the next word.

⁹ The Natural Language Decathlon (McCann et al. (2018)) was another precursor to a unified approach for NLP. It, however, framed NLP tasks as question-answer pairs instead.

the next obvious step was to extend this to any NLP task. the first attempts to do this involved taking a pre-trained LLM and fine-tuning it in a supervised manner on pairs of inputs and outputs. T5 (Raffel et al. (2019)) was a model fine-tuned on multiple types of NLP tasks, described in natural language.⁹ In Figure 1.16 we can see how NLP tasks are framed as input-output pairs using natural language in T5 and other similar models. This approach is now commonly called Supervised Fine-tuning (SFT).

TODO: an image like the one in Figure 1 from T5. it would be good if the NLP tasks were the same as what I talked about in section 1.4.1

Figure 1.16: Framing NLP tasks *themselves* as natural language instructions took LLMs to yet another level, with models such as T5, T0, and FLAN. Adapted from Raffel et al. (2019)

After T5 and similar models, the next obvious step was to have LLMs follow generic instructions—not just those that referred to NLP tasks. This is called instruction-tuning.

Instruction-tuning vs Alignment

TODO: connection text to the approaches for instruction tuning.

1.5.2 Approaches to Instruction-tuning

TODO: short into with a table. SFT, RL-Based and MIXED. under SFT there is from datasets, from the model itself (self-instruct) or from another teacher model. under RL, there are RLHF and RLAIF and RL-only like R1. under mixed, DPO and maybe constitutional AI

Supervised fine-tuning

as explained before it’s just another type of fine-tuning, with a dataset that looks like the one in ?fig-sft-dataset-sample

Figure 1.17: A sample dataset that could be used for instruction-tuning a pre-trained LLM

- with manually generated examples

- with sampled examples from the model itself (auto-instruct, but check if auto-instruct is )

- with sampled examples from another teacher model (is this necessarily the same as distillation? or is distillation only about the fine-tuning/RL steps?)

RL-based fine-tuning

VERY SHORT intro so that this chapter doesn’t get too long. Just say that it’s a way of using RL to avoid having to have humongous annotated input output datasets for SFT

applied as a second step after SFT (in RLHF) why? Because SFT can be used as a starting point to make the RL algo optimize better. and because without SFT, RL doesn’t work too well (this is claimed in the zephyr paper)

the model is optimized with a composite cost function such that it aligns but does not deviate too much from the original distribution (explain why — to prevent the model from overfitting to text that satisfied the policy but doesn’t make linguistic sense.)

RL is notoriously hard to get right. And expensive.

TODO add image a la instructgpt that shows the 3 steps.

RLHF and RLAIF: with the emergence of very capable models, many chose to use those as “judges” to select the best outcome from a list of candidates. So entsteht RLAIF

now i need to touch upon CoT and R1-zero too… call it “RL_only” fine-tuning

Hybrid approaches

DPO is a mix between supervised and reinforcement learning.

TODO: what about constitutional AI? is it a different method?

connection text to chatgpt

1.5.3 ChatGPT

ChatGPT is an example of an instruction-tuned LLM that is fine-tuned to its use-case, that is, of a virtual assistant or chatbot.

ChatGPT has been the first aligned LLM in widespread use, and it gave the general public a glimpse of what these models are capable of. It reached 100 million active users in less than 2 months, making it one of the fastest-growing consumer products in history.

since it’s a commercial product, it’s bound to change very frequently and add new capabilities. We will base this subection on the original product, i.e. the first text-based chat interface that came out towards the end of 2022

The technical details are all based off the instructgpt article (OpenAI themselves said that it’s a sibling model to chatgpt). It’s a instruction-tuned gpt3. Quickly explain the data it was trained on and the fact that it was fine tuned with SFT and RLHF + add link to section below

ChatGPT details

Open AI has not (as of this writing) made ChatGPT details public but it has said that the training pipeline closely resembles that of InstructGPT by Ouyang et al. (2022), whose details we do know. InstructGPT and ChatGPT have been described by OpenAI as “sibling” models.

From a technical standpoint, ChatGPT is a GPT3.5 LLM, optimized for chat-like behavior using RLHF, as explained in Section 1.5.2.2.

The main difference between InstructGPT and ChatGPT (see callout above) is that the latter supports stateful computations while the former generates one output for one input at a time. In other words, ChatGPT keeps track of previous interactions (like a chat between two people) and it will take those into account in addition to the current prompt.

don’t confuse chatgpt the product with the backend (gpt3.5, gpt4, etc)

1.5.4 What’s next?

Where do we go from here? Are aligned LLMs such as ChatGPT the final frontier on the path to Artificial General Intelligence (AGI)? No one knows.

Even though results are striking and very useful (as proven by the commercial success of ChatGPT) there are still many open questions in the field: how to optimize costs, how to make alignment better and safer, how to address potentially existential risks, how to fuse multiple modalities (video, audio in addition to text), and so on.

In the rest of the book, we will go into detail over the technical concepts we touched on in this chapter—and many we didn’t. We’ll talk about future research directions and also several other ways to apply LMs to real-world and business problems, all in a beginner-friendly way.

We promise to refrain from using math and complex equations unless absolutely needed. 😀

1.6 Summary

Language Models (LMs) model the distribution of words in a language (such as English) either by counting co-occurrence statistics or by using neural nets, in a self-supervised training regimen
Neural Nets can be used to train LMs with very good performance, especially if they can keep state about previous words seen in the context.
Large LMs are now the de facto base layer for many downstream NLP tasks. They can provide embeddings that can replace one-hot-encoded vectors, serve as a base model to be fine-tuned for specific tasks, and function in so-called in-context learning, where NLP tasks are directly framed as natural language, sometimes with examples.
Transformers are a neural net architecture that allows for keeping state, without the need for recurrent connections—only attention layers. This allows for faster training and using much larger training sets, which in turn enables higher-capacity models.
Reinforcement learning (particularly RLHF) can be used to align LMs so that they output text that matches the user’s intent. This was the case with ChatGPT, which was aligned to perform well in a chat-like context.

1.7 References

Bahdanau, D., K. Cho, and Y. Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” https://arxiv.org/abs/1409.0473.

Bengio, Y., J. Ducharme, P. Vincent, and C. Janvin. 2003. “A Neural Probabilistic Language Model.” J. Mach. Learn. Res. JMLR.org. http://dl.acm.org/citation.cfm?id=944919.944966.

Gers, F. A., J. Schmidhuber, and F. Cummins. 1999. “Learning to Forget: Continual Prediction with LSTM” 2: 850–855 vol.2. https://doi.org/10.1049/cp:19991218.

Hochreiter, S., and J. Schmidhuber. 1997. “Long Short-Term Memory.” Neural Comput. 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.

McCann, B., N. S. Keskar, C. Xiong, and R. Socher. 2018. “The Natural Language Decathlon: Multitask Learning as Question Answering.” CoRR abs/1806.08730. http://arxiv.org/abs/1806.08730.

Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems, edited by C. J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger. Vol. 26. Curran Associates, Inc. http://bit.ly/mikolov-2013-nips.

Ouyang, L., J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” https://arxiv.org/abs/2203.02155.

Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever. 2018. “Improving Language Understanding by Generative Pre-Training.”

Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2019. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” CoRR abs/1910.10683. http://arxiv.org/abs/1910.10683.

Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://bit.ly/vaswani-2017-attention.