Skip to content

Programming Practical 4: Tokenizers

This practical session provides a deep dive into tokenizers — the critical first step in any NLP pipeline. You will compare pre-trained tokenizers (BPE, WordPiece, SentencePiece), train your own from scratch, understand padding/batching/masking, and see how the tokenizer integrates into a full LLM pipeline.

Tokenizers

Notebook: Open In Colab