Building GPT-2 in PyTorch

GPT-2 was introduced in 2019 but we can learn how the fundamentals of transformer training and inference work by rebuilding GPT-2 in PyTorch.

23 min read Ben Hayes

Table of Contents


Note: This post interprets and implements the Python GPT-2 tutorial by Andrej Karpathy while adding my own commentary and results. ▶️ You can watch the full 4-hour tutorial on Andrej's YouTube channel here: Let's reproduce GPT-2 (124M).

Introduction

OpenAI introduced GPT-2 (Generative Pre-trained Transformer 2) in 2019 with the paper Better Language Models and Their Implications (2019). The four models trained as part of the research for GPT-2 are “approximately log-uniformly spaced sizes” starting with the smallest model having 124 million parameters and ending with the largest model having 1.5 billion parameters. For comparison, last month Meta released and open-sourced Llama 3.1 405B - a model with 405 billion parameters.

While there has been a tremendous amount of interest and research following GPT-2’s introduction, thanks in large part to advances in accelerated computing with faster GPUs, commercial interest in the technology, and collection of more training data, the smallest GPT model still offers potential for understanding how these large language models (LLMs) work. In this blog, we will walk through how to build GPT-2 (124 million parameter model). We’ll split the process into two parts; first we’ll focus on inferencing to get a foundation of how the model produces token predictions, and second we’ll focus on training to understand how the weights are calculated.

We will deviate from the paper and code that brought us GPT-2 in 2019 in one major way - we will leverage the PyTorch framework. This change differs from the original model which used TensorFlow which is another popular framework.

This post is a continuation of the AI series focusing on artificial intelligence. If you’d like to learn more about the general theory of deep neural networks, please see my post from 2018 that introduces the topic for beginners. You can also find more recent posts that leverage vision and multimodal approaches to AI.

Background and Assumptions

To maximize your own learning from this post, please be sure to have at least rough familiarity with most of the concepts below. These concepts are critical for understanding what the inference process does, what the training process does, and how we use different metrics to evaluate performance, quality, etc. Here are the major concepts:

Transformer architecture
  • Encoder, decoder, encoder-decoder (seq-2-seq)
  • Positional encoding (e.g., RoPE)
  • Attention (self-attention, QKV, multiple heads)
  • Layer normalization
  • Softmax

Original transformer architecture
Source: Wikipedia

Graphics Processing Units (GPUs)
  • As opposed to CPUs
  • Memory hierarchy
  • CUDA (high-level knowledge)

General memory hierarchy
Source: Wikipedia

Python / PyTorch
  • Torch.nn layers, functions
  • Data Distributed Parallel (DDP)

PyTorch logo
Source: Wikipedia

For many of these, rough familiarity is enough to help get you through the content below. For example, you do not need to understand all of the details of CUDA programming such as blocks, threads, streaming multiprocessors, or grid-stride loops. Knowing, however, that GPUs work best when dealing with numbers near powers of two, is helpful for understanding certain design choices below.


Inferencing

For this GPT-2 model, we’ll begin by looking at the inference process. Plainly, the inference process takes an input (which can contain a single or multiple examples) and returns an output. We may pass in “In the future, humans will be able to” and expect the model to return the next tokens, essentially completing the sentence.

For a model to work with inferencing, it needs weights. These weights are often trained over multiple iterations, batches, and epochs using dozens or hundreds of GPUs. In our case, we want to build the logic around using the model, plug in existing weights, and test our logic to see if it works. For the training portion of this exercise, we can then expand our inference logic, include training logic, and swap in randomly initialized weights. Here are the tools we’ll use to complete the inference process.

Tools

  • NVIDIA is the largest provider of enterprise GPUs. The company is considered one of the largest in the world by market capitalization and also maintains the CUDA framework to distribute processing into the GPU. For inferencing, we can rely on a simple GPU but you’ll see later we will need to up the horsepower.
  • PyTorch is a popular machine learning framework that provides python interfaces for interacting with deep neural networks including language models.
  • HuggingFace is a collaboration platform designed to simplify access to and sharing of models, data sets, and discoveries. We will use HuggingFace to provide us the weights to test language model inferencing.

Weights

For the first half of this post, we will reconstruct GPT-2 using PyTorch while relying on the weights from HuggingFace. Skip to the second section if you are looking for information on training the weights.

Code

Let’s dive into the code! We will not show the full code sections here, but if you would like to view them, you can visit Andrej Karpathy’s code from this commit checkpoint in Github. Subsequent checkpoints begin to add logic for the training. We will review snippets of the code that are highly relevant and give us intuition for the design choices for GPT-2.

Attention

 1import math
 2from dataclasses import dataclass
 3import torch
 4import torch.nn as nn
 5from torch.nn import functional as F
 6
 7class CausalSelfAttention(nn.Module):
 8    def __init__(self, config):
 9        super().__init__()
10        assert config.n_embd % config.n_head == 0
11        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
12        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
13        self.n_head = config.n_head
14        self.n_embd = config.n_embd
15        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
16                                     .view(1, 1, config.block_size, config.block_size))
17
18    def forward(self, x):
19        B, T, C = x.size()
20        qkv = self.c_attn(x)
21        q, k, v = qkv.split(self.n_embd, dim=2)
22        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
23        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
24        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
25        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
26        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
27        att = F.softmax(att, dim=-1)
28        y = att @ v
29        y = y.transpose(1, 2).contiguous().view(B, T, C)
30        y = self.c_proj(y)
31        return y
32
33class MLP(nn.Module):
34    ...
35
36class Block(nn.Module):
37    ...

In this first section, we focus on the imports and the CausalSelfAttention class. We import PyTorch by running import torch and since it's frequently used, we also import torch.nn as nn. Next we define CausalSelfAttention which will perform the Q, K, V, calculations and softmax needed in the self-attention mechanism. Note that we have divided the process into 12 heads as per the GPT-2 specification.

For brevity, we omit the MLP and Block classes but please see the source code for details.

GPT and Weights

 1@dataclass
 2class GPTConfig:
 3    block_size: int = 1024    # max sequence length
 4    vocab_size: int = 50257   # number of tokens: 50,000 BPE merges + 256 bytes tokens + 1 <|endoftext|> token
 5    n_layer: int = 12         # number of layers
 6    n_head: int = 12          # number of heads
 7    n_embd: int = 768         # embedding dimension
 8
 9class GPT(nn.Module):
10    def __init__(self, config):
11        super().__init__()
12        self.config = config
13
14        self.transformer = nn.ModuleDict(dict(
15            wte = nn.Embedding(config.vocab_size, config.n_embd),
16            wpe = nn.Embedding(config.block_size, config.n_embd),
17            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
18            ln_f = nn.LayerNorm(config.n_embd),
19        ))
20        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
21
22    def forward(self, idx):
23        B, T = idx.size()
24        assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"
25        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
26        pos_emb = self.transformer.wpe(pos)
27        tok_emb = self.transformer.wte(idx)
28        x = tok_emb + pos_emb
29        for block in self.transformer.h:
30            x = block(x)
31        x = self.transformer.ln_f(x)
32        logits = self.lm_head(x)
33        return logits
34
35    @classmethod
36    def from_pretrained(cls, model_type):
37        """Loads pretrained GPT-2 model weights from huggingface"""
38        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
39        from transformers import GPT2LMHeadModel
40        print("loading weights from pretrained gpt: %s" % model_type)
41
42        config_args = {
43            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
44            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
45            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
46            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
47        }[model_type]
48        config_args['vocab_size'] = 50257
49        config_args['block_size'] = 1024
50        config = GPTConfig(**config_args)
51        model = GPT(config)
52        sd = model.state_dict()
53        sd_keys = sd.keys()
54        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')]
55
56        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
57        sd_hf = model_hf.state_dict()
58
59        sd_keys_hf = sd_hf.keys()
60        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]
61        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]
62        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
63        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
64        for k in sd_keys_hf:
65            if any(k.endswith(w) for w in transposed):
66                assert sd_hf[k].shape[::-1] == sd[k].shape
67                with torch.no_grad():
68                    sd[k].copy_(sd_hf[k].t())
69            else:
70                assert sd_hf[k].shape == sd[k].shape
71                with torch.no_grad():
72                    sd[k].copy_(sd_hf[k])
73
74        return model

In this second section, we begin to configure our GPT. We need to specify the block size, vocab size, number of layers, number of heads, and embedding dimensions. We'll use these values throughout to specify the correct dimensions of the vectors, matrices, and tensors that we use. The numbers are taken from the GPT-2 specification. For example, the 50257 size vocabulary includes certain special tokens like a start and end token.

We leverage our own from_pretrained() method to load the model weights from HuggingFace using the transformers library. Note that if we want to generate from the larger versions of GPT-2, we can by toggling the model_type.

Generation

 1num_return_sequences = 5
 2max_length = 30
 3
 4model = GPT.from_pretrained('gpt2')
 5model.eval()
 6model.to('cuda')
 7
 8# Encoding
 9import tiktoken
10enc = tiktoken.get_encoding('gpt2')
11tokens = enc.encode("Hello, I'm a language model,")
12tokens = torch.tensor(tokens, dtype=torch.long)
13tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1)
14x = tokens.to('cuda')
15
16
17# Iterate 
18torch.manual_seed(42)
19torch.cuda.manual_seed(42)
20while x.size(1) < max_length:
21    with torch.no_grad():
22        logits = model(x)
23        logits = logits[:, -1, :]
24        probs = F.softmax(logits, dim=-1)
25        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
26        ix = torch.multinomial(topk_probs, 1)
27        xcol = torch.gather(topk_indices, -1, ix)
28        x = torch.cat((x, xcol), dim=1)
29
30for i in range(num_return_sequences):
31    tokens = x[i, :max_length].tolist()
32    decoded = enc.decode(tokens)
33    print(">", decoded)

In this final section, we execute our code defined above including using the .to('cuda') method which loads our GPU with the appropriate model and weights.

Next we take advantage of the tiktoken library to encode our inputs which, after being encoded, get moved to the GPU as well. Lastly, once the process completes, we can decode the output tokens and see the results!

Outputs

Let’s execute our code. We call our python script which tells us that the device is correctly configured, assuming you are using a CUDA-compatible GPU. We load the model and then we see some example outputs.

1using device: cuda
2loading weights from pretraind gpt: gpt2
3> Hello, I'm a language model, not a program. So this morning I started studying for the interview in the lab. This was not
4> Hello, I'm a language model, and one of the main things that bothers me when they create languages is how easy it becomes to create something that
5> Hello, I'm a language model, and I wrote it off on the grounds that a language model would make me more fluent. But I'm not
6> Hello, I'm a language model, I really like languages. I like languages because like, they're good. And the way we talk about languages
7> Hello, I'm a language model, a language model I'm using for data modelling. All I did was test the results and then I wrote some

We can see that the results are pretty legitimate! They are sentences that attempt to complete the provided input. There’s also a sense of understanding of what a language model is and does.

Let’s recall that we did not yet train the model - we have so far only used the existing weights which were previously trained and available on HuggingFace. Why is training important? If we initialize the weights with minimal or no structure, no information - esssentially noise - we are given relatively poor results. Below are a few examples. Notice how the output, while demonstrating semblance of words and vocabulary, is almost entirely unintelligible.

1using device: cuda
2> Hello, I'm a language model, approving behold Utilityadin guess precautions Jord curing Sampdiff toss925ultz Groundsome brazen chosen inadequvis cloves authoritieschat
3> Hello, I'm a language model, furnacedt ElectoralchanginglvededPage Monitor specialty insiderslogin curtail Eva securingvo butterfly Consulting"},{" censusWeapon Cam cries
4> Hello, I'm a language model, gitISO solutions� lia Secondly bunnyeous Spot Di ty anallarge Life lonely Idlib ExplanEuro commissioner Brendancutting woman
5> Hello, I'm a language model,eki lifetime clearance Cust stemreset controbb Frostwhelmingmark Mitchellaxies Tao WW Fractnington McCartney resisting Survive Interview Steps
6> Hello, I'm a language model, translations approaches inserts Lernerqq gearingnen-,using many mart collaboration trial Overt Pogamoto guarded tremb shifted scaresKingsthose

So how were these model weights that we used from HuggingFace trained? Let’s take a look in the next section.


Training

Let’s switch gears to focus on the training of the model’s weights. This is a fundamental part of the process but does not need to occur every time we want to generate new tokens or text. By distilling the knowledge of the training data set or corpus, we can freeze the model weights as a checkpoint and use them for generation just like we did in the previous section. This two stage process is advantageous because training is a computationally heavy process that even in 2019 required arrays of GPUs and today requires data centers of GPUs in addition to other tools.

Tools

When it comes to tools, we'll continue to use those from the first section which were NVIDIA, PyTorch, and HuggingFace. In addition to these, we need to add two:

  • Lambda Labs is a platform that allows us to access GPUs at hourly rates. The capital expenditure for the GPUs we want to use would be prohibitive for an individual and fortunately we can rent them for only dollars per hour.
  • FineWeb EDU is a highly curated data set containing educational sources. While this is not the exact data set that GPT-2 would have used for training, it is advantageous because of how refined and filtered the data set already is.

Note: In his video, Andrej Karpathy leverages a Hellaswag data set which can be used to evaluate the performance of the model. For the remainder of this post, we largely ignore this part of his process. If you wish to understand more - please watch his video here.

Before we dive in, let’s discuss how this will work:

  • We will use Lambda Labs when we begin our training run and we can expect the training of ~20,000 steps to be completed within 2 hours by using 8 x A100 (80 GB SXM4) GPUs.
  • We will modify the code to incorporate a loss function and backward passes so that our model weights are updated.
  • We will modify the code to iteratively update the weights and print out results regarding sample output and evaluation metrics.

Code

The following code snippits will omit (e.g., ) pieces of code that were largely unchanged above or otherwise unnecessary to explain the training process. Please refer to the source code for the full code base.

Add Optimization to GPT Module

 1class GPT(nn.Module):
 2    def __init__(self, config):
 3        ...
 4
 5    def _init_weights(self, module):
 6        if isinstance(module, nn.Linear):
 7            std = 0.02
 8            if hasattr(module, 'NANOGPT_SCALE_INIT'):
 9                std *= (2 * self.config.n_layer) ** -0.5
10            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
11            if module.bias is not None:
12                torch.nn.init.zeros_(module.bias)
13        elif isinstance(module, nn.Embedding):
14            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
15
16    def forward(self, idx, targets=None):
17        ...
18        logits = self.lm_head(x) # (B, T, vocab_size)
19        loss = None
20        if targets is not None:
21            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
22        return logits, loss
23
24    def configure_optimizers(self, weight_decay, learning_rate, device_type):
25        param_dict = {pn: p for pn, p in self.named_parameters()}
26        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
27        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
28        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
29        optim_groups = [
30            {'params': decay_params, 'weight_decay': weight_decay},
31            {'params': nodecay_params, 'weight_decay': 0.0}
32        ]
33        num_decay_params = sum(p.numel() for p in decay_params)
34        num_nodecay_params = sum(p.numel() for p in nodecay_params)
35        if master_process:
36            print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
37            print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
38        # Create AdamW optimizer and use the fused version if it is available
39        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
40        use_fused = fused_available and device_type == "cuda"
41        if master_process:
42            print(f"using fused AdamW: {use_fused}")
43        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=(0.9, 0.95), eps=1e-8, fused=use_fused)
44        return optimizer

For our GPT module to function within the context of training, we need to modify/add multiple methods. First, we add a loss calculation using cross entropy to the forward() method. We will return the loss value at each step of the training process.

Next we define a method to configure the optimizer. In this case, we want to use AdamW. Note that we will use the fused version if available as this helps our kernel avoid costly, extra reads.

Add Distributed Data Parallel

 1from torch.distributed import init_process_group, destroy_process_group
 2from torch.nn.parallel import DistributedDataParallel as DDP
 3import torch.distributed as dist
 4
 5ddp = int(os.environ.get('RANK', -1)) != -1
 6if ddp:
 7    assert torch.cuda.is_available(), "for now i think we need CUDA for DDP"
 8    init_process_group(backend='nccl')
 9    ddp_rank = int(os.environ['RANK'])
10    ddp_local_rank = int(os.environ['LOCAL_RANK'])
11    ddp_world_size = int(os.environ['WORLD_SIZE'])
12    device = f'cuda:{ddp_local_rank}'
13    torch.cuda.set_device(device)
14    master_process = ddp_rank == 0 # use this process to do logging, etc.
15else:
16    ddp_rank = 0
17    ddp_local_rank = 0
18    ddp_world_size = 1
19    master_process = True
20    device = "cpu"
21    if torch.cuda.is_available():
22        device = "cuda"
23    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
24        device = "mps"
25    print(f"using device: {device}")
26
27device_type = "cuda" if device.startswith("cuda") else "cpu"

We need to tell our training pipeline to leverage the additional GPUs. We can do so by using torch.distributed and torch.nn.parallel, spefiically DistributedDataParallel. In this case, we train one model and the data is parallelized.

Create Model and Define Learning Rate

 1torch.manual_seed(1337)
 2if torch.cuda.is_available():
 3    torch.cuda.manual_seed(1337)
 4
 5enc = tiktoken.get_encoding("gpt2")
 6
 7total_batch_size = 524288
 8B = 64 # micro batch size
 9T = 1024 # sequence length
10assert total_batch_size % (B * T * ddp_world_size) == 0, "make sure total_batch_size is divisible by B * T * ddp_world_size"
11grad_accum_steps = total_batch_size // (B * T * ddp_world_size)
12if master_process:
13    print(f"total desired batch size: {total_batch_size}")
14    print(f"=> calculated gradient accumulation steps: {grad_accum_steps}")
15
16train_loader = DataLoaderLite(B=B, T=T, process_rank=ddp_rank, num_processes=ddp_world_size, split="train")
17val_loader = DataLoaderLite(B=B, T=T, process_rank=ddp_rank, num_processes=ddp_world_size, split="val")
18
19torch.set_float32_matmul_precision('high')
20
21# create model
22model = GPT(GPTConfig(vocab_size=50304))
23model.to(device)
24use_compile = False
25if use_compile:
26    model = torch.compile(model)
27if ddp:
28    model = DDP(model, device_ids=[ddp_local_rank])
29raw_model = model.module if ddp else model
30
31max_lr = 6e-4
32min_lr = max_lr * 0.1
33warmup_steps = 715
34max_steps = 19073
35def get_lr(it):
36    if it < warmup_steps:
37        return max_lr * (it+1) / warmup_steps
38    if it > max_steps:
39        return min_lr
40    decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
41    assert 0 <= decay_ratio <= 1
42    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
43    return min_lr + coeff * (max_lr - min_lr)
44
45optimizer = raw_model.configure_optimizers(weight_decay=0.1, learning_rate=6e-4, device_type=device_type)

In this third section, we will define the micro batch size, the sequence length (which remains the same as before).

Perhaps most important, we set up the learning rate calculations. We could keep this learning rate constant, but research indicates that having the learning rate decay between a maximum and a minimum bound improves the training process. Additionally, we include warmup steps to give the model a chance to warm up.

Execute Training Loop

 1log_dir = "log"
 2os.makedirs(log_dir, exist_ok=True)
 3log_file = os.path.join(log_dir, f"log.txt")
 4with open(log_file, "w") as f:
 5    pass
 6
 7for step in range(max_steps):
 8    t0 = time.time()
 9    last_step = (step == max_steps - 1)
10
11    # Periodically evaluate validation loss
12    if step % 250 == 0 or last_step:
13        model.eval()
14        val_loader.reset()
15        with torch.no_grad():
16            val_loss_accum = 0.0
17            val_loss_steps = 20
18            for _ in range(val_loss_steps):
19                x, y = val_loader.next_batch()
20                x, y = x.to(device), y.to(device)
21                with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
22                    logits, loss = model(x, y)
23                loss = loss / val_loss_steps
24                val_loss_accum += loss.detach()
25        if ddp:
26            dist.all_reduce(val_loss_accum, op=dist.ReduceOp.AVG)
27        if master_process:
28            print(f"validation loss: {val_loss_accum.item():.4f}")
29            with open(log_file, "a") as f:
30                f.write(f"{step} val {val_loss_accum.item():.4f}\n")
31            if step > 0 and (step % 5000 == 0 or last_step):
32                checkpoint_path = os.path.join(log_dir, f"model_{step:05d}.pt")
33                checkpoint = {
34                    'model': raw_model.state_dict(),
35                    'config': raw_model.config,
36                    'step': step,
37                    'val_loss': val_loss_accum.item()
38                }
39                torch.save(checkpoint, checkpoint_path)
40
41    model.train()
42    optimizer.zero_grad()
43    loss_accum = 0.0
44    for micro_step in range(grad_accum_steps):
45        x, y = train_loader.next_batch()
46        x, y = x.to(device), y.to(device)
47        if ddp:
48            model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
49        with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
50            logits, loss = model(x, y)
51        loss = loss / grad_accum_steps
52        loss_accum += loss.detach()
53        loss.backward()
54    if ddp:
55        dist.all_reduce(loss_accum, op=dist.ReduceOp.AVG)
56    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
57    lr = get_lr(step)
58    for param_group in optimizer.param_groups:
59        param_group['lr'] = lr
60    optimizer.step()
61    if device_type == "cuda":
62        torch.cuda.synchronize()
63    t1 = time.time()
64    dt = t1 - t0
65    tokens_processed = train_loader.B * train_loader.T * grad_accum_steps * ddp_world_size
66    tokens_per_sec = tokens_processed / dt
67    if master_process:
68        print(f"step {step:5d} | loss: {loss_accum.item():.6f} | lr {lr:.4e} | norm: {norm:.4f} | dt: {dt*1000:.2f}ms | tok/sec: {tokens_per_sec:.2f}")
69        with open(log_file, "a") as f:
70            f.write(f"{step} train {loss_accum.item():.6f}\n")
71
72if ddp:
73    destroy_process_group()

As expected, in this last section we need to iterate up to max_steps. We'll use this loop to train the model and periodically print output along with samples generated from the model as of the training step.

What does this look like when we’re training the model? Here’s a view of the training output in VSCode with nvidia-smi (right) showing the GPU array memory usage. We’re maxing out the memory usage on the GPUs which is great because we’re getting our money’s worth.

Training in action!

This approach is fantastic but it does show its age! We have enabled quantization by using TF32 and BF16 where possible. You can learn more about these data types on NVIDIA’s blog here. We do not use torch.compile to compile the model but we can leverage this to potentially speedup training. By compiling the code, efficiencies are found which accelerate training but it is less interactive as compilation can take a few minutes before outputs are generated. Additional enhancements include use of FastAttention and friendlier parameters (powers of 2).

Outputs

Well, let’s take a look at samples generated during training. Our code will output a few dozen examples periodically throughout the training run.

Generated samples: ~250 steps in (1% training complete)

Early on (~250 steps into the training run), the results are largely indecipherable.

In the future, humans will  You do you!! So me that is a new book. A, a different. To do you can be too many
In the future, humans will  We will like that, the country (’t. It’s a great and that will have a little �
In the future, humans will |A)
In the future, humans will  AreD-1-s of those at the “The word “G”
In the future, humans will  So do they are a doctor and need to make? That is an overview of a child with a long-shaped or if

Generated samples: ~3000 steps in (15% training complete)

Now at ~3000 steps into the training run, we can see marked improvement in the output quality. The loss declined substantially (from ~11 to ~3.4).

step  2999 | loss: 3.433184 | lr 5.7964e-04 | norm: 0.3628 | dt: 453.42ms | tok/sec: 1156285.84
validation loss: 3.5127
In the future, humans will be able to control populations of many species and plants in their territory for generations. We are already creating tools for future economic and ecological
In the future, humans will be able to get access to the resources they need using technology as well as opportunities to do as they get access for themselves.
In the future, humans will be able to extract only 10% of their diet as whole food supplements on a daily basis. So that will be done through diet
In the future, humans will be able to interact with their pets, and have different needs. So, if you consider that, it’s a matter
In the future, humans will be able to breed with one another. They will create a home for themselves, which will be more prosperous and sustainable.

Generated samples: ~19071 steps in (100% training complete)

After the training run, we can see that the output has continued to improve. Our model’s validation loss dropped again but relatively little this time (from ~3.4 to ~3.1).

step 19071 | loss: 3.071546 | lr 6.0000e-05 | norm: 0.2978 | dt: 455.08ms | tok/sec: 1152086.52
validation loss: 3.0734
In the future, humans will be able to see how to use technology in an environment much more advanced than we are accustomed to today.
In the future, humans will be able to get enough oxygen from the ground. They will then be able to grow and take advantage of the oxygen on their leaves
In the future, humans will be able to monitor how well they use their sense of smell, and even how much they smell in their indoor environment. But a
In the future, humans will be able to control the way they’re using space, creating a much more advanced human intelligence with even more space capabilities.
In the future, humans will be able to navigate a world the size of Antarctica. And humans could move to the New World much as they did the last ten
In the future, humans will be able to do complex tasks where humans could not even imagine them.
In the future, humans will be able to see both planets as they are, and they will be able to use this information to make intelligent decisions on the one
In the future, humans will be able to use machine learning tools like predictive analytics to explore the risks and threats of AI security. Future improvements could include, for
In the future, humans will be able to survive a global temperature rise of 2 degrees by the end of the century, according to a new study. If the
In the future, humans will be able to eat more seaweed, which will not go bad, but we also would like to eat it more of it.

What do you think? Will humans in the future be able to get enough oxygen from the ground? Eat more seaweed? Monitor how well we use our sense of smell? 😅


Conclusion

As we’ve learned throughout this process, there are a lot of considerations to make when building a large language model. It’s helpful to first understand how transformers work generally, how encoding and decoding work, how attention mechanisms avoid the need for recurrence, and how our training data can bias our model weights. In this post, we learned how to build a GPT-2 like model for inference/generation, how to train a language model using NVIDIA GPUs both locally and from a cloud provider (i.e., Lambda Labs), and train a model optimized against a loss function. It’s easy for us to see why PyTorch is such a popular, sophistocated framework for designing and optimizing deep neural networks.

GPT-2 is approaching 5 years old but it still demonstrates the fundamentals of transformer-based models, using tokenization, and how the training process iteratively improves model performance. What’s next for generative AI? In recent years, newer models introduced capabilities to work with audio, images, and video - demonstrating the flexibility of the attention mechanisms inherent in transformers. Additionally, there is emphasis on models having function-calling capabilities to provide agent capabilities. Stay tuned for follow-ups!

Additional Resources