Rotary Positional Embeddings (RoPE) explained from first principles. This video covers how transformers encode relative positional information using rotation, dot products, and attention, and how RoPE works mathematically.
Unlike absolute positional encodings, Rotary Positional Embeddings allow transformers to reason about relative distance between tokens, which is crucial for long-context models and large language models.
We start by building intuition around relative positional information, then carefully derive how RoPE uses rotations to inject relative position into attention scores. From there, we generalize RoPE to d-dimensional embeddings and analyze how factors like base angles, frequency scaling parameters, and relative distance affect attention behavior.
⏱️ Timestamps
00:00 In this video
00:40 What and Why of Relative Positional Information
04:29 2D Rotation Review
06:40 Rotary Position Embeddings(ROPE) Explained
11:00 ROPE beyond 2D
13:29 Why & How Rotary Positional Encodings Work
📖 Resources
ROFORMER: ENHANCED TRANSFORMER WITH ROTARY
POSITION EMBEDDING - https://arxiv.org/pdf/2104.09864
ROUND AND ROUND WE GO! WHAT MAKES ROTARY POSITIONAL ENCODINGS USEFUL?- https://arxiv.org/pdf/2410.06205
🔔 Subscribe :
https://tinyurl.com/exai-channel-link
Email - [email protected]