Back to Browse

Lec 19 | Pre-Training Strategies: Encoder-decoder and Decoder-only Models

7.8K views
Mar 1, 2025
54:32

tl;dr: This lecture explores pre-training strategies for encoder-decoder (T5, BART) and decoder-only (GPT, LLaMA) models, focusing on denoising-based and autoregressive objectives. It also covers data selection, scaling laws, and key empirical insights for optimizing pre-training efficiency. 🎓 Lecturer: Tanmoy Chakraborty [https://tanmoychak.com] 🔗 Get the Book: https://tanmoychak.com/llmbook 📚 Suggested Readings: - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5) [https://arxiv.org/abs/1910.10683] - BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension [https://arxiv.org/abs/1910.13461] - Scaling Laws for Neural Language Models [https://arxiv.org/abs/2001.08361] - Chapter-8, Intro to LLM, Sections 7.4 (Decoder-Based Pretraining), 7.5 (Encoder-Decoder Based Pretraining [https://tanmoychak.com/llmbook] This lecture provides a deep dive into the pre-training methodologies used in modern language models. We examine the denoising objectives central to encoder-decoder architectures like T5 and BART, as well as the autoregressive training strategies underpinning decoder-only models like GPT and LLaMA. Additionally, we discuss critical factors such as dataset selection, scaling laws, and empirical findings that guide the optimization of large-scale pre-training.

Download

0 formats

No download links available.

Lec 19 | Pre-Training Strategies: Encoder-decoder and Decoder-only Models | NatokHD