L-6 Transformer Encoder Explained | Q, K, V Intuition + Math
In this lecture, we continue our Transformer series and move into the actual working of the Transformer encoder. Before starting, I strongly recommend watching the previous two lectures on: Tokenization and Embeddings - https://www.youtube.com/watch?v=a65rvfDxbUA Positional Encoding - https://www.youtube.com/watch?v=OmTKTJXW3qE because this video builds directly on those concepts. 🔹 What you’ll learn in this video ✅ How tokenized input becomes encoder input ✅ What vocabulary size really means ✅ How the embedding layer works internally ✅ Why embedding table shape = (vocab_size × d_model) ✅ How positional encoding is added ✅ What exactly goes into the Transformer encoder 🔹 Transformer Encoder Explained We then move into the core of the Transformer: 📍 What does the encoder do? It learns relationships between words in a sentence. 📍 Main components of the encoder Multi-Head Self-Attention Feed Forward Neural Network (FFN) Residual Connections Layer Normalization 🔹 Self-Attention Explained (Q, K, V) You’ll clearly understand: What Query (Q) means What Key (K) represents What Value (V) actually carries Why Q, K, V are not learned directly How linear projections create Q, K, V Why the same weights are shared across tokens We go step-by-step through: Shapes of X, Q, K, V Meaning of dmodel, dk, and dv Matrix multiplication intuition 🔹 Intuition + Math (Both Covered) This lecture is designed to help you: ✔️ Build strong intuition ✔️ Understand matrix shapes without confusion ✔️ Read Transformer equations confidently ✔️ Prepare for advanced Transformer & LLM topics 👩🏫 Who is this video for? Machine Learning beginners Deep Learning students NLP & Transformer learners Anyone confused about Q, K, V If you find this explanation helpful, don’t forget to like, share, and subscribe — it really helps the channel grow ❤️ 📸 Follow me on Instagram: @codewithaarohihindi 🔗 https://instagram.com/codewithaarohihindi 📧 You can also reach me at: [email protected] 📌 Next videos will cover: Attention score computation Scaled Dot-Product Attention Multi-Head Attention in detail
Download
0 formatsNo download links available.