Towards Causal Benchmarks for LLM Hallucination

LLM Hallucination Causal Circuits Benchmarking ConeCUT Mechanistic Interpretability

Overview

This project builds a benchmark that identifies and ranks the tendency of language models to hallucinate based on their internal circuit structures, rather than purely on output-level evaluation. Using an approach inspired by ConeCUT—a technique for discovering and cutting causal circuits within neural networks—we construct tasks and metrics that expose hallucination-prone internal mechanisms and allow for principled ranking of models by their structural hallucination risk.

Background

Hallucination in large language models is one of the most pressing reliability problems in applied AI. A model "hallucinates" when it generates confident, fluent text that is factually incorrect or unsupported by its context. Existing benchmarks (e.g., TruthfulQA, HaluEval) evaluate hallucination post-hoc by inspecting model outputs on curated question sets. This approach has a fundamental limitation: it measures the symptom, not the cause.

An output-level benchmark cannot distinguish between a model that hallucinates rarely because it has good internal factual representations versus one that hallucinates rarely on those particular test prompts due to surface-level patterns. Two models can achieve the same score on an output-level benchmark while having completely different internal mechanisms, and thus different failure modes in deployment.

ConeCUT and related circuit-discovery methods from mechanistic interpretability allow us to identify specific circuits within a model that are causally responsible for particular behaviors. The hypothesis behind this project is that hallucination-prone models have identifiable circuit signatures: e.g., factual recall circuits that are weak or easily overridden, attention heads that over-rely on positional or syntactic cues rather than semantic content, or suppression mechanisms that fail to inhibit false continuations.

Discussion

The benchmark pipeline works as follows: given a model, we run a suite of diagnostic prompts designed to isolate factual recall, counterfactual resistance, and context grounding. We then apply ConeCUT-style causal tracing to identify which internal circuits activate during correct versus incorrect responses. The resulting circuit signatures form a fingerprint of the model's hallucination profile, and this fingerprint is used to rank models.

A key advantage of this approach is predictivity: if internal circuit signatures predict hallucination behavior better than output-level metrics alone, then the benchmark can identify latent hallucination risk in models that have not yet been deployed or tested on the specific failure-mode prompts that matter in production.

The project is currently in the experimental phase, validating circuit-level predictions against known hallucination behaviors across a set of open-source language models.

Interactive Playground

A demo allowing you to run circuit-level hallucination probes on sample prompts is planned. Check back later.