Progamming Practical 1: Sampling from GPT2¶
This Programming Practical focuses on sampling from a trained GPT-2 model. After completing the missing part of model.py, we will use the sample.py script to generate text samples based on a given prompt.
This code is a skinny version of Karpathy's nanoGPT, focused on sampling and GPT2 only. Check out the full repo for more features and training code.
Complete the Missing Parts¶
In this PP, you will mainly implement the missing parts of several building blocks of model.py. the goal is to get a working GPT-2 implementation that can sample text. First, take a look at the code to understand the overall structure and flow.
You will have to use config parameters, whose name can be found in the @dataclassGPTConfig, on l. 136:
@dataclass
class GPTConfig:
block_size: int = 1024
vocab_size: int = (
50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
)
n_layer: int = 12
n_head: int = 12
n_embd: int = 768
dropout: float = 0.0
bias: bool = (
True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
)
__init__ method if they are declared. Let's go!
CausalSelfAttention¶
In model.py:30, CausalSelfAttention implements multi-head causal self-attention (projection to q/k/v, masked attention, then output projection).
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# key, query, value projections for all heads, but in a batch
self.c_attn = # TODO. /!\ note that each k, q, v vector will be of size (n_embd // n_head)
# output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
- Complete
self.c_attn. - Make each key, query, and value vector use the per-head embedding size.
def forward(self, x):
B, T, C = (
x.size()
) # batch size, sequence length, embedding dimensionality (n_embd)
# calculate query, key, values for all heads in batch and move head forward to be the batch dim
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
k = # TODO. output shape: (B, nh, T, hs)
q = # TODO. output shape: (B, nh, T, hs)
v = # TODO. output shape: (B, nh, T, hs)
# [...]
else:
# manual implementation of attention
att = # TODO matmul and scaling
att = # TODO causal mask (using att.masked_fill)
att = # TODO softmax and dropout
y = att @ v
y = (
y.transpose(1, 2).contiguous().view(B, T, C)
) # re-assemble all head outputs side by side
# output projection
y = self.resid_dropout(
# TODO
)
return y
- Reshape
k,q, andvto the required output shape(B, nh, T, hs). - In the non-flash branch, compute attention scores with matrix multiplication and scaling.
- Apply a causal mask using
att.masked_fill. - Apply softmax, then attention dropout.
- Fill the missing lines inside
self.resid_dropout(...)for the output projection path.
MLP¶
In model.py:98, MLP is the feed-forward sub-layer used inside each transformer block.
def __init__(self, config):
super().__init__()
self.c_fc = # TODO the depth of the MLP will be 4 * config.n_embd
self.gelu = nn.GELU()
self.c_proj = # TODO
self.dropout = nn.Dropout(config.dropout)
- Complete
self.c_fcand follow the explicit TODO:"the depth of the MLP will be 4 * config.n_embd". - Set
self.c_fcso the hidden depth is4 * config.n_embd. - Complete
self.c_proj.
def forward(self, x):
# TODO
return x
- Complete the missing lines in the forward pass.
Block¶
In model.py:112, Block combines normalization, attention, and MLP with residual connections.
def forward(self, x):
# TODO layer_norm -> attention with residual connection -> layer_norm -> MLP with residual connection
return x
- Implement this order in the missing lines: layer norm, attention with residual, layer norm, then MLP with residual.
GPT¶
In model.py:141, GPT defines the full language model and, in forward, applies embeddings, stacked blocks, and output logits.
def forward(self, idx):
# [...]
# forward the GPT model itself
tok_emb = self.transformer.wte(idx)
pos_emb = self.transformer.wpe(pos)
x = # TODO: Dropout + positional encoding
# TODO apply blocks sequentially
x = self.transformer.ln_f(x)
logits = self.lm_head(
# TODO keep the logits of the final position
)
return logits
- Build
xfrom token embeddings and positional embeddings, then apply dropout. - Apply each transformer block sequentially.
- Keep only the logits from the final time position.
Sample Text¶
Once you have completed the implementation, use sample.py to test it by generating text samples from from pre-trained GPT-2.
Note
It is likely that your implementation will fail at first, you will have to make back and forth between sampling and debugging the code in previous section.
Basic Usage¶
# Sample from GPT-2
uv run python sample.py --start="Hello, my name is"
CLI Options¶
All available command-line options for sample.py:
-
--start: The prompt text to start generation (default:"\n")- Can be any string:
--start="Once upon a time" - Can load from a file:
--start="FILE:prompt.txt" - Special tokens like
--start="<|endoftext|>"
- Can be any string:
-
--num_samples: Number of samples to generate (default:1) -
--max_new_tokens: Number of tokens to generate per sample (default:100) -
--temperature: Sampling temperature (default:0.8)1.0: No change (standard sampling)< 1.0: Less random (more deterministic)> 1.0: More random (more creative)
-
--top_k: Keep only top k most likely tokens (default:200)- Higher values: More diverse outputs
- Lower values: More focused outputs
-
--seed: Random seed for reproducibility (default:1337) -
--device: Device to run on (default:"cpu")- Examples:
"cpu","cuda","cuda:0","cuda:1"
- Examples:
Love Letter¶
Try to generate a decent love letter using GPT2. Then, send it to pnovello@insa-toulouse.fr. The more romantic one, the funniest one and the clumsiest one will be displayed in the website. Work hard for posterity!
Bonus: Implement Layer Norm by Yourself¶
In model.py, l. 27, layer norm is implemented by calling torch.nn.LayerNorm.
def forward(self, input):
return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
Try to implement it by yourself, without using PyTorch's built-in implementation.