Posts Coming Soon

This blog will feature reproductions of significant papers in mechanistic interpretability and related areas, along with my own mini-extensions and experiments. Each post will include an embedded interactive playground where you can run the models and experiments yourself in the browser.

Planned Posts
Reproducing "A Mathematical Framework for Transformer Circuits" (Elhage et al., 2021)
Planned
A from-scratch reproduction of the foundational circuit analysis framework, with interactive attention pattern visualizations and a sandbox for exploring induction heads and Q-K circuits in a small transformer.
Circuits Transformers Induction Heads
Reproducing "Interpretability in the Wild: a Circuit for Indirect Object Identification" (Wang et al., 2022)
Planned
Reproducing the IOI circuit discovery and implementing an interactive activation patching demo where you can ablate specific heads and see the effect on indirect object identification in real time.
IOI Circuit Discovery Activation Patching
Reproducing "Sparse Autoencoders Find Highly Interpretable Features in Language Models" (Cunningham et al., 2023)
Planned
Training a sparse autoencoder on a small language model and building an interactive feature browser to explore learned features, their activation patterns, and their effects on model behavior.
Sparse Autoencoders Feature Decomposition Monosemanticity