Interpretability-Auditing for RAG LLMs in High-Regulation Contexts

RAG Sparse Autoencoders Transcoders Finance AI Circuit Tracing Provenance

Overview

This project applies Sparse Autoencoders (SAEs) and Transcoder-based circuit tracing to Retrieval-Augmented Generation (RAG) LLM systems deployed in financial services. The goal is to provide auditable explanations of model behavior: specifically, to trace the chain of reasoning from retrieved documents through the model's internal computations to its final output, enabling compliance and explainability requirements to be met in high-regulation environments.

Background

Retrieval-Augmented Generation (RAG) systems address a core limitation of static LLMs by retrieving relevant documents at inference time and conditioning the model's response on that retrieved context. In regulated industries such as finance, healthcare, and law, RAG LLMs are appealing because they can ground responses in specific, citable documents. However, the mere presence of a retrieved document does not guarantee that the model's output faithfully reflects it—the model may partially ignore retrieved context, blend it with parametric knowledge, or hallucinate details.

Regulatory frameworks (e.g., financial advisory regulations, explainability requirements under emerging AI legislation) increasingly require that automated systems be able to explain the basis for their outputs. "It retrieved document X and that's what it generated" is not a sufficient explanation if we cannot verify whether the model's internal computation actually used that document in the way we believe.

Sparse Autoencoders decompose model activations into interpretable, monosemantic features. Transcoders extend this by providing a way to analyze how a model transforms inputs to outputs through an interpretable intermediate representation. Together, these tools allow us to trace which retrieved content causally influenced which parts of the output through which internal circuits.

Discussion

The audit pipeline works in three stages. First, we train or adapt SAEs on the target RAG LLM, obtaining a dictionary of interpretable features over the residual stream. Second, we use activation patching and transcoder analysis to construct a circuit graph showing which features activate in response to which retrieved passages, and how those features propagate through the model's layers to influence the final generation. Third, we produce a human-readable provenance report that maps each claim in the model's output to the features and retrieved passages that causally support it.

The reduction aspect of the project involves identifying and pruning circuits that do not correspond to legitimate retrieved-context reasoning—for instance, circuits that reflect parametric biases or hallucinated additions not grounded in any retrieved document. This circuit reduction can be used to produce a "cleaned" model that is more faithful to retrieved context.

The financial context provides a concrete evaluation environment: we can measure whether our provenance attributions align with compliance officer judgments about which document passages justify which output statements.

Interactive Playground

A demo of provenance tracing on sample RAG queries is planned. Check back later.