Modify the behavior or the personality of a model at inference time, without fine-tuning or prompt engineering.
Read the blog post π https://huggingface.co/spaces/dlouapre/eiffel-tower-llama
Explore SAEs on the Hub π https://huggingface.co/collections/dlouapre/sparse-auto-encoders-saes-for-mechanistic-interpretability
Neuronpedia https://www.neuronpedia.org
00:00 Introduction
00:25 Steering as Neurostimulation
02:18 Transformer architecture
04:25 Linear representation of concepts
09:04 Steering using π€ transformers
13:43 Finding steering vectors
14:36 Using Sparse AutoEncoders
16:28 Conclusion