Interpretability of Embodied VLAs and Cross-Modal Circuit Tracing

Mechanistic Interpretability VLAs Circuit Tracing Cross-Modal Embodied AI

Overview

This project develops both the theory and experimental methodology for tracing causal circuits across multiple, heterogeneous ML model types that are composed together in a single pipeline. The primary application domain is Vision-Language-Action (VLA) models: systems that combine a vision encoder, a language model, and an action policy to control a robot or agent from natural language instructions and visual input.

The core challenge is that mechanistic interpretability techniques developed for single-modality transformers (e.g., activation patching, attention head analysis, circuit discovery) do not straightforwardly transfer to multi-component pipelines where representations must be translated between fundamentally different model architectures and representation spaces. We are developing a framework for cross-modal circuit tracing that can follow a causal chain of computation across these boundaries.

Background

VLAs represent one of the most promising directions in embodied AI, allowing robots and agents to follow open-ended natural language instructions while perceiving rich visual scenes. However, as these systems are deployed in real-world settings, understanding why they produce particular actions becomes critical for safety, debugging, and alignment.

Existing interpretability methods are largely designed for single-model analysis. Techniques like activation patching (from the ROME and causal tracing literature) and sparse autoencoder feature decomposition (from the Anthropic and EleutherAI mechanistic interpretability programs) assume a single unified model. In a VLA, a vision encoder may produce patch embeddings that are fed into a language model via a projection layer, which then conditions an action diffusion policy. Each of these components has its own internal representation geometry.

Cross-modal circuit tracing requires solving new problems: How do we identify meaningful circuits that span the projection boundary? How do we attribute an action decision causally to features in the visual input versus the language instruction? What are the correct primitives for circuit analysis in diffusion-based action policies?

Discussion

The project is developing both theoretical tools and empirical benchmarks. On the theoretical side, we are working out a formalism for defining "circuits" in a compositional pipeline where modules have different architectures, and developing methods for propagating causal attribution across module boundaries.

On the experimental side, we are running activation patching and sparse autoencoder experiments on open-source VLA models, building a library of known circuits corresponding to specific behaviors, and testing whether the same circuit structures generalize across model scale and training variations.

This work connects to broader goals in AI safety: understanding the internal decision-making of embodied agents before they are deployed in high-stakes physical environments.

Interactive Playground

An interactive circuit visualization and activation patching demo is planned for this project. Check back later.