{
  "request": {
    "slug": "sparse-autoencoders-interpretability",
    "title": "Research sparse autoencoders for interpretability",
    "model": "moonshotai/Kimi-K2.6:novita",
    "input": "Research sparse autoencoders (SAEs) for mechanistic interpretability of language models. In 5 concise bullets, explain the current frontier, the main open problem, and cite 2 relevant Hugging Face paper pages if available. Keep under 350 words.",
    "background": true,
    "max_cost_usd": 3.0,
    "metadata": {
      "example_slug": "sparse-autoencoders-interpretability"
    }
  },
  "response": {
    "id": "resp_0a5e9ee6a94a43eda152c4310d7ddab2",
    "object": "response",
    "created_at": 1781314801,
    "completed_at": 1781314831,
    "status": "completed",
    "model": "moonshotai/Kimi-K2.6:novita",
    "background": true,
    "previous_response_id": null,
    "session_id": "b84ec8fc-8ace-4431-9566-99c401b1cfee",
    "max_cost_usd": 3.0,
    "instructions": null,
    "output": [
      {
        "type": "custom_tool_call",
        "id": "hf_papers_0",
        "name": "hf_papers",
        "input": "{\"operation\": \"search\", \"query\": \"sparse autoencoders mechanistic interpretability language models\", \"limit\": 15}",
        "output": "# Papers matching 'sparse autoencoders mechanistic interpretability language models'\nShowing 15 paper(s)\n\n## 1. The Birth of Knowledge: Emergent Features across Time, Space, and Scale\n  in Large Language Models\n**arxiv_id:** 2505.19440 | **upvotes:** 1\nhttps://huggingface.co/papers/2505.19440\n**Keywords:** large language models, interpretable categorical features, training checkpoints, transformer layers, varying model sizes\n**Summary:** The study examines interpretable categorical features in large language models, using sparse autoencoders to identify semantic concept emergence over time, across layers, and varying sizes, revealing spatial feature reactivation.\n\n## 2. Geospatial Mechanistic Interpretability of Large Language Models\n**arxiv_id:** 2505.03368 | **upvotes:** 13\nhttps://huggingface.co/papers/2505.03368\n**Keywords:** Large Language Models, geospatial mechanistic interpretability, spatial analysis, probing, mechanistic interpretability\n**GitHub:** https://github.com/sdesabbata/geospatial-mechanistic-interpretability (19 stars)\n**Summary:** A framework for understanding how Large Language Models process geographical information using spatial analysis and mechanistic interpretability techniques.\n\n## 3. Sparse Autoencoders Find Highly Interpretable Features in Language\n  Models\n**arxiv_id:** 2309.08600 | **upvotes:** 15\nhttps://huggingface.co/papers/2309.08600\n**Keywords:** polysemanticity, superposition, sparse autoencoders, internal activations, monosemantic\n**GitHub:** https://github.com/hoagyc/sparse_coding (305 stars)\n**Summary:** Using sparse autoencoders, this work identifies and removes overcomplete directions in language model activations, improving interpretability and model editability.\n\n## 4. A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models\n**arxiv_id:** 2503.05613 | **upvotes:** 0\nhttps://huggingface.co/papers/2503.05613\n**Keywords:** Sparse Autoencoders, Large Language Models, mechanistic interpretability, input-based explanation, output-based explanation\n**Summary:** A survey explores the use of Sparse Autoencoders for interpreting Large Language Models, detailing their architecture, explanation methods, evaluation metrics, and applications.\n\n## 5. Interpreting and Steering Protein Language Models through Sparse\n  Autoencoders\n**arxiv_id:** 2502.09135 | **upvotes:** 1\nhttps://huggingface.co/papers/2502.09135\n**Keywords:** sparse autoencoders, protein language models, latent components, protein annotations, transmembrane regions\n**GitHub:** https://github.com/edithvillegas/plm-sae (1 stars)\n**Summary:** Latent components of a protein language model (ESM-2) are analyzed using sparse autoencoders to guide sequence generation towards specific protein characteristics.\n\n## 6. Sparse Autoencoders Enable Scalable and Reliable Circuit Identification\n  in Language Models\n**arxiv_id:** 2405.12522 | **upvotes:** 2\nhttps://huggingface.co/papers/2405.12522\n**Keywords:** sparse autoencoders, discrete sparse autoencoders, attention head outputs, integer codes, indirect object identification\n**Summary:** A new method using discrete sparse autoencoders identifies interpretable circuits in large language models efficiently and robustly, outperforming baselines with reduced computational cost.\n\n## 7. Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models\n**arxiv_id:** 2605.11887 | **upvotes:** 13\nhttps://huggingface.co/papers/2605.11887\n**Keywords:** sparse autoencoders, mechanistic interpretability, language models, inference-time steering, evaluation analysis\n**Summary:** Sparse autoencoders (SAEs) developed for Qwen models enable both interpretability and practical model development applications including inference steering, evaluation analysis, data-centric workflows, and post-training optimization.\n\n## 8. Evaluating and Designing Sparse Autoencoders by Approximating\n  Quasi-Orthogonality\n**arxiv_id:** 2503.24277 | **upvotes:** 8\nhttps://huggingface.co/papers/2503.24277\n**Keywords:** sparse autoencoders, $\\ell_0$, $\\ell_2$-norm, feature a\u2026 [truncated 3794 chars]",
        "status": "completed"
      },
      {
        "type": "custom_tool_call",
        "id": "hf_papers_1",
        "name": "hf_papers",
        "input": "{\"operation\": \"read_paper\", \"arxiv_id\": \"2503.05613\"}",
        "output": "# A Survey on Sparse Autoencoders:Interpreting the Internal Mechanisms of Large Language Models\nhttps://arxiv.org/abs/2503.05613\n\n## Abstract\nLarge Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque. Recently, mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs. Among various mechanistic interpretability approaches, Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components. This paper presents a comprehensive survey of SAEs for interpreting and understanding the internal workings of LLMs. Our major contributions include: (1) exploring the technical framework of SAEs, covering basic architecture, design improvements, and effective training strategies; (2) examining different approaches to explaining SAE features, categorized into input-based and output-based explanation methods; (3) discussing evaluation methods for assessing SAE performance, covering both structural and functional metrics; and (4) investigating real-world applications of SAEs in understanding and manipulating LLM behaviors.\n\n## Sections\n- **1 Introduction**: Large Language Models (LLMs), such as GPT-4 OpenAI et\u00a0al. ( 2024 ) , Claude-3.5 Anthropic ( 2024 ) , DeepSeek-R1 DeepSeek-AI et\u00a0al. ( 2025 ) , and Grok-3 xAI ( 2025 ) , have emerged as powerful tools in natural language processing, demonstrating remarkable capabilities in tasks r...\n  - **1.1 Contribution and Uniqueness**: Our Contributions. In this paper, we provide a comprehensive overview of SAE for LLM interpretability, with major contributions listed as following: (1) We explore the technical framework of SAEs, including their basic architecture, various design improvements, and effective trai...\n- **2 Technical Framework of SAEs**: 2.1 Basic SAE Framework SAE is a neural network that learns an overcomplete dictionary for representation reconstruction. As shown in Figure 1 a, the input of SAE is the representation of a token from LLMs, which is mapped onto a sparse vector of dictionary activations. Input. Gi...\n  - **2.1 Basic SAE Framework**: SAE is a neural network that learns an overcomplete dictionary for representation reconstruction. As shown in Figure 1 a, the input of SAE is the representation of a token from LLMs, which is mapped onto a sparse vector of dictionary activations.\n\nInput. Given a LLM denoted as f ...\n  - **2.2 Different SAE Variants**: As SAEs continue to emerge as a powerful tool for interpreting the internal representations of LLMs, researchers have increasingly focused on refining and extending their capabilities. Various SAE variants have been proposed to address the limitations of traditional SAEs, each in...\n- **3 Explainability Analysis of SAEs**: This section aims to interpret the learned feature vectors from a trained SAE with natural language.\nSpecifically, given a pre-defined vocabulary set \ud835\udcb1 \\mathcal{V} , the goal of the explainability analysis is to extract a subset of words \u2110 m \u2282 \ud835\udcb1 \\mathcal{I}_{m}\\subset\\mathcal{V} ...\n  - **3.1 Input-based Explanations**: MaxAct. The most straightforward way to collect natural language explanation is by selecting a set of texts whose hidden representation can maximally activate a certain feature vector we are interpreting Bricken et\u00a0al. ( 2023 ); Lee et\u00a0al. ( 2023 ) . Formally, given a large corpu...\n  - **3.2 Output-based Explanations**: VocabProj. Output-based explanations project the learned feature vectors to the output word embeddings of words to compute the activations.\nMathematically, f out \u200b ( w ) : \ud835\udcb1 \u2192 \u211d d f_{\\text{out}}(w):\\mathcal{V}\\rightarrow\\mathbb{R}^{d} denotes the output word embedding layer that ...\n- **4 Evaluation Metrics and Methods**: Evaluating SAEs is inherently challenging due to the absence of ground truth labels. Unlike traditional machine learning tasks where performance can be directly measured against labeled data, t\u2026 [truncated 7909 chars]",
        "status": "completed"
      },
      {
        "type": "custom_tool_call",
        "id": "hf_papers_2",
        "name": "hf_papers",
        "input": "{\"operation\": \"read_paper\", \"arxiv_id\": \"2601.03047\"}",
        "output": "# When the Coffee Feature Activates on Coffins:An Analysis of Feature Extraction and Steering for Mechanistic Interpretability\nhttps://arxiv.org/abs/2601.03047\n\n## Abstract\nRecent work by Anthropic on Mechanistic interpretability claims to understand and control Large Language Models by extracting human-interpretable features from their neural activation patterns using sparse autoencoders (SAEs). If successful, this approach offers one of the most promising routes for human oversight in AI safety. We conduct an initial stress-test of these claims by replicating their main results with open-source SAEs for Llama 3.1.\nWhile we successfully reproduce basic feature extraction and steering capabilities, our investigation suggests that major caution is warranted regarding the generalizability of these claims.\nWe find that feature steering exhibits substantial fragility, with sensitivity to layer selection, steering magnitude, and context. We observe non-standard activation behavior and demonstrate the difficulty to distinguish thematically similar features from one another.\nWhile SAE-based interpretability produces compelling demonstrations in selected cases, current methods often fall short of the systematic reliability required for safety-critical applications.\nThis suggests a necessary shift in focus from prioritizing interpretability of internal representations toward reliable prediction and control of model output.\nOur work contributes to a more nuanced understanding of what mechanistic interpretability has achieved and highlights fundamental challenges for AI safety that remain unresolved.\n\n## Sections\n- **1 Introduction**: Mechanistic interpretability (MI) is a form of explainable AI. It refers to a broad set of recent approaches to understanding, and ultimately controlling, the behavior of Large Language Models (LLMs).\nThe core idea is to construct semantic representations \u2013 so-called \u201cfeatures\u201d \u2013...\n- **2 Background: Mechanistic Interpretability**: Mechanistic interpretability is one approach to providing an understanding of the behavior of AI systems (mostly LLMs) by aiming [\u2026] to completely specify a neural network\u2019s computation, potentially in a format as explicit as pseudocode (also called reverse engineering), striving...\n- **3 Anthropic\u2019s Claims**: Our analysis of SAE-based features will focus on two claims by the Anthropic team that are explicitly laid out in the proof-of-concept work in bricken2023monosemanticity and the follow-up work by templeton2024scaling , who scale the approach up to production-level LLMs.\nBoth clai...\n- **4 Replicating Anthropic\u2019s Results**: Our goal is to critically examine the two claims underlying Anthropic\u2019s recent work on interpretable features in LLMs. As a starting point, we will first reproduce the relevant basic results found in Scaling Monosemanticity [ templeton2024scaling ] on a publicly available model.\n...\n  - **4.1 Replicating Interpretable Feature Extraction and Activation**: As a first step, we show that there exist interpretable features that activate in the appropriate contexts. To do so, we need to find a feature that is similar to the \u201cGolden Gate Bridge\u201d feature in templeton2024scaling : the feature should have a clear-cut meaning (as opposed to...\n  - **4.2 Replicating Feature Activation Specificity**: The next step is to check the specificity of our coffee-feature. We want our feature to activate only in the context that its description mentions. The further away the content of a text is to the feature, the lower its activation should be. While we are not able to test the feat...\n  - **4.3 Replicating Feature Steering in Output**: Finally, we need to check for the influence this feature can exert on the output of the LLM. In order to achieve an effect, a so-called steering coefficient can be used to clamp a specific feature. In that case, the output of the SAE for the clamped feature consists of the baseli...\n- **5 Do the Findings Generalize?**: In Section 4 we showed that we are able to reproduce Anthropic\u2019s basic results. We now turn to the stronger promise ass\u2026 [truncated 7230 chars]",
        "status": "completed"
      },
      {
        "type": "message",
        "id": "msg_3",
        "role": "assistant",
        "status": "completed",
        "content": [
          {
            "type": "output_text",
            "text": "**Sparse Autoencoders (SAEs) for Mechanistic Interpretability: Current State**\n\n- **Current frontier:** SAEs have scaled from proof-of-concept demonstrations to production-level LLMs (e.g., Llama 3.1, Qwen, Gemma), learning overcomplete sparse dictionaries that disentangle superimposed features into human-interpretable concepts. Recent variants include Gated SAEs, k-sparse autoencoders, Matryoshka-SAEs, and discrete SAEs for circuit identification, alongside applications in model steering, evaluation analysis, and multimodal interpretability.\n\n- **Feature extraction works, but\u2026** Replicated studies confirm SAEs can extract semantically meaningful features (e.g., \"coffee,\" \"Golden Gate Bridge\") and steer model outputs by amplifying or attenuating specific feature activations. Open-source SAE suites (LLaMa Scope, Gemma Scope, Pythia SAE) are now widely available.\n\n- **Main open problem: Reliability and faithfulness.** Feature steering exhibits substantial fragility\u2014sensitive to layer selection, steering magnitude, and context. Similar-looking features produce divergent activation patterns; features activate on semantically unrelated inputs (e.g., a \"coffee\" feature firing on \"coffins\"); and automated interpretations often mismatch actual behavior. Reconstruction errors remain poorly understood and can degrade model performance.\n\n- **Interpretation-behavior gap:** There is a critical disconnect between what SAE features appear to represent and how they actually influence model computation. Hyperactive features, spurious activations, and context-dependent behavior mean current methods fall short of the systematic reliability required for safety-critical oversight.\n\n- **Path forward:** The field is shifting focus from merely interpreting internal representations toward ensuring reliable prediction and control of model outputs. Key challenges remain: incomplete concept coverage, lack of theoretical foundations for linear feature representations, and the immense computational cost of training layer-wise SAEs on large models.\n\n**Relevant Hugging Face Papers:**\n1. [A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models](https://huggingface.co/papers/2503.05613) \u2014 Comprehensive survey of SAE architectures, explanation methods, evaluation metrics, and applications.\n2. [When the Coffee Feature Activates on Coffins: An Analysis of Feature Extraction and Steering for Mechanistic Interpretability](https://huggingface.co/papers/2601.03047) \u2014 Stress-test revealing fragility in feature steering and the interpretation-behavior gap in open-source SAEs."
          }
        ]
      }
    ],
    "error": null,
    "incomplete_details": null,
    "usage": {
      "session_id": "b84ec8fc-8ace-4431-9566-99c401b1cfee",
      "total_usd": 0.0,
      "inference_usd": 0.0,
      "hf_jobs_estimated_usd": 0.0,
      "sandbox_estimated_usd": 0.0,
      "llm_calls": 4,
      "hf_jobs_count": 0,
      "sandbox_count": 0,
      "prompt_tokens": 72069,
      "completion_tokens": 868,
      "cache_read_tokens": 49664,
      "cache_creation_tokens": 0,
      "total_tokens": 72937,
      "hf_jobs_billable_seconds_estimate": 0,
      "sandbox_billable_seconds_estimate": 0
    },
    "artifacts": [],
    "metadata": {
      "example_slug": "sparse-autoencoders-interpretability"
    }
  }
}