Slides are available at https://martinisadad.github.io/
Transformers are everywhere in AI and almost all LLMs these days. The secret sauce is the attention mechanism. However it can be super slow and eat up tons of memory, especially for larger and larger models.
That's where FlashAttention comes in. It's a game-changer and one of the most important breakthroughs in recent years, making it dramatically faster and more memory-efficient.
As a regular normal SWE, I'd like to share this great technique with all of you :)
Related Video:
Transformer Deep Dive https://youtu.be/TcKJMBZySj0
#ai
#llm
#transformers
#attention
#flash
#maths
#machinelearning
0:00 Intro
0:56 CPU and GPU Memory Hierarchy
4:29 Standard Attention
8:26 Flash Attention Intro
11:31 Softmax Algorithms
14:23 Tiling
16:32 Online Safe Softmax
21:13 Final Forward Pass Algorithm
23:39 Memory IO Analysis
26:47 Backward Pass
34:11 Ending
Download
0 formats
No download links available.
FlashAttention V1 Deep Dive By Google Engineer | Fast and Memory-Efficient LLM Training | NatokHD