Lec 09 | Tokenization Strategies

Name: Lec 09 | Tokenization Strategies
Uploaded: Feb 7, 2025
Duration: 4388 s

NPTEL IIT Delhi77.9K subscribers

12.6K views

Feb 7, 2025

1:13:08

This lecture covers key tokenization strategies such as Byte-Pair Encoding, WordPiece, and Unigram Language Model, essential for anyone looking to enhance their understanding of how language models efficiently process text. 🎓 Lecturer: Tanmoy Chakraborty [https://tanmoychak.com] 🔗 Get the Book: https://tanmoychak.com/llmbook 📚 Suggested Readings: - Byte Pair Encoding [https://arxiv.org/abs/1508.07909] - WordPiece [https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6289079] - Unigram Language Model [https://arxiv.org/abs/1804.10959] - Chapter-2, Intro to LLM, An Overview of Natural Language Processing and Neural Network, Section 2.4 (Tokenisation) [https://tanmoychak.com/llmbook] Unlock the fundamentals of tokenization in NLP with this lecture focusing on Byte-Pair Encoding (BPE), WordPiece, and Unigram Language Model tokenization. These strategies are pivotal in how modern language models process and understand text by breaking down complex script into manageable pieces. This session is ideal for those seeking to understand the mechanics behind effective language model training and its application across various NLP tasks.

Download

0 formats

No download links available.