Emerging Technology

day one project

Accelerating AI Interpretability To Promote U.S. Technological Leadership

06.10.25 | 12 min read | Text by Matteo Pistillo

The most advanced AI systems remain ‘black boxes’ whose inner workings even their developers cannot fully understand, leading to issues with reliability and trustworthiness. However, as AI systems become more capable, there is a growing desire to deploy them in high-stakes scenarios. The bipartisan National Security Commission on AI cautioned that AI systems perceived as unreliable or unpredictable will ‘stall out’: leaders will not adopt them, operators will mistrust them, Congress will not fund them, and the public will not support them (NSCAI, Final Report, 2021). AI interpretability research—the science of opening these black boxes and attempting to comprehend why they do what they do—could turn opacity into understanding and enable wider AI adoption.

With AI capabilities racing ahead, the United States should accelerate interpretability research now to keep its technological edge and field high-stakes AI deployment with justified confidence. This memorandum describes three policy recommendations that could help the United States seize the moment and maintain a lead on AI interpretability: (1) creatively investing in interpretability research, (2) entering into research and development agreements between interpretability experts and government agencies and laboratories, and (3) prioritizing interpretable AI in federal procurement.

Challenge and Opportunity

AI capabilities are progressing rapidly. According to many frontier AI companies’ CEOs and independent researchers, AI systems could reach general-purpose capabilities that equal or even surpass humans within the next decade. As capabilities progress, there is a growing desire to incorporate these systems into high-stakes use cases, from military and intelligence uses (DARPA, 2025; Ewbank, 2024) to key sectors of the economy (AI for American Industry, 2025).

However, the most advanced AI systems are still ‘black boxes’ (Sharkey et al., 2024) that we observe from the outside and that we ‘grow,’ more than we ‘build’ (Olah, 2024). Our limited comprehension of the inner workings of neural networks means that we still really do not understand what happens within these black boxes, leaving uncertainty regarding their safety and reliability. This could have resounding consequences. As the 2021 final report of the National Security Commission on AI (NSCAI) highlighted, “[i]f AI systems routinely do not work as designed or are unpredictable in ways that can have significant negative consequences, then leaders will not adopt them, operators will not use them, Congress will not fund them, and the American people will not support them” (NSCAI, Final Report, 2021). In other words, if AI systems are not always reliable and secure, this could inhibit or limit their adoption, especially in high-stakes scenarios, potentially compromising the AI leadership and national security goals outlined in the Trump administration’s agenda (Executive Order, 2025).

AI interpretability is a subfield of AI safety that is specifically concerned with opening and peeking inside the black box to comprehend “why AI systems do what they do, and … put this into human-understandable terms” (Nanda, 2024; Sharkey et al., 2025). In other words, interpretability is the AI equivalent of an MRI (Amodei, 2025) because it attempts to provide observers with an understandable image of the hidden internal processes of AI systems.

The Challenge of Understanding AI Systems Before They Reach or Even Surpass Human-Level Capabilities

Recent years have brought breakthroughs across several research areas focused on making AI more trustworthy and reliable, including in AI interpretability. Among other efforts, the same companies developing the most advanced AI systems have designed systems that are easier to understand and have reached new research milestones (Marks et al., 2025; Lindsey et al., 2025; Lieberum et al. 2024; Kramar et al., 2024; Gao et al., 2024; Tillman & Mossing, 2025).

AI interpretability, however, is still trailing behind raw AI capabilities. AI companies project that it could take 5–10 years to reliably understand model internals (Amodei, 2025), while experts expect systems exhibiting human‑level general-purpose capabilities by as early as 2027 (Kokotajlo et al., 2025). That gap will force policymakers into a difficult corner once AI systems reach similar capabilities: deploy unprecedentedly powerful yet opaque systems, or slow deployment and fall behind. Unless interpretability accelerates, the United States could risk both competitive and security advantages.

The Challenge of Trusting Today’s Systems for High-Stakes Applications

We must understand the inner workings of highly advanced AI systems before they reach human or above-human general-purpose capabilities, especially if we want to trust them in high-stakes scenarios. There are several reasons why current AI systems might not always be reliable and secure. For instance, AI systems could exhibit the following vulnerabilities. First, AI systems inherit the blind spots of their training data. When the world changes—alliances shift, governments fall, regulations update—systems still reason from outdated facts, undermining reliability in high-stakes diplomatic or military settings (Jensen et al., 2025).

Second, AI systems are unusually easy to strip‑mine for memorized secrets, especially if these secrets come as uncommon word combinations (e.g., proprietary blueprints). Data‑extraction attacks are now “practical and highly realistic” and will grow even more effective as system size increases (Carlini et al., 2021; Nasr et al., 2023; Li et al., 2025). The result could be wholesale leakage of classified or proprietary information (DON, 2023).

Third, cleverly crafted prompts can still jailbreak cutting‑edge systems, bypassing safety rails and exposing embedded hazardous knowledge (Hughes et al., 2024; Ramesh et al., 2024). With attack success rates remaining uncomfortably high across even the leading systems, adversaries could manipulate AI systems with these vulnerabilities in real‑time national security scenarios (Caballero & Jenkins, 2024).

This is not a comprehensive list. Systems could exhibit vulnerabilities in high-stakes applications for many other reasons. For instance, AI systems could be misaligned and engage in scheming behavior (Meinke et al., 2024; Phuong et al., 2025) or have baked-in backdoors that an attacker could exploit (Hubinger et al., 2024; Davidson et al., 2025).

The Opportunity to Promote AI Leadership Through Interpretability

Interpretability offers an opportunity to address these described challenges and reduce barriers to the safe adoption of the most advanced AI systems, thereby further promoting innovation and increasing the existing advantages those systems present over adversaries’ systems. In this sense, accelerating interpretability could help promote and secure U.S. AI leadership (Bau et al., 2025; IFP, 2025). For example, by helping ensure that highly advanced AI systems are deployed safely in high-stakes scenarios, interpretability could improve national security and help mitigate the risk of state and non-state adversaries using AI capabilities against the United States (NSCAI, Final Report, 2021). Interpretability could therefore serve as a front‑line defense against vulnerabilities in today’s most advanced AI systems.

Making future AI systems safe and trustworthy could become easier the more we understand how they work (Shah et al., 2025). Anthropic’s CEO recently endorsed the importance and urgency of interpretability, noting that “every advance in interpretability quantitatively increases our ability to look inside models and diagnose their problems” (Amodei, 2025). This means that interpretability not only enhances reliability in the deployment of today’s AI systems, but understanding AI systems could also lead to breakthroughs in designing more targeted systems or attaining more robust monitoring of deployed systems. This could then enable the United States to deploy tomorrow’s human-level or above-human general-purpose AI systems with increased confidence, thus securing strategic advantages when engaging geopolitically. The following uses the vulnerabilities discussed above to demonstrate three ways in which interpretability could improve the reliability of today’s AI systems when deployed in high-stakes scenarios.

First, interpretability could help systems selectively update outdated information through model editing, without risking a reduction in performance. Model editing allows us to selectively inject new facts or fix mistakes (Cohen et al., 2023; Hase et al., 2024) by editing activations without updating the entire model. However, this ‘surgical tool’ has shown ‘side effects’ causing performance degradation (Gu et al., 2024; Gupta et al., 2024). Interpretability could help us understand how stored knowledge alters parameters as well as develop stronger memorization measures (Yao et al., 2023; Carlini et al., 2019), enabling us to ‘incise and excise’ AI models with fewer side effects.

Second, interpretability could help systems selectively forget training data through machine unlearning, once again without losing performance. Machine unlearning allows systems to forget specific data classes (such as memorized secrets or hazardous knowledge) while remembering the rest (Tarun et al., 2023). Like model editing, this ‘surgical tool’ suffers from performance degradation. Interpretability could help develop new unlearning techniques that preserve performance (Guo et al., 2024; Belrose et al., 2023; Zou et al., 2024).

Third, interpretability could help effectively block jailbreak attempts, which can only currently be discovered empirically (Amodei, 2025). Interpretability could lead to a breakthrough in understanding models’ persistent vulnerability to jailbreaking by allowing us to characterize dangerous knowledge. Existing interpretability research has already analyzed how AI models process harmful prompts (He et al., 2024; Ball et al., 2024; Lin et al., 2024; Zhou et al., 2024), and additional research could build on these initial findings

The conditions are ripe to promote technological leadership and national security through interpretability. Many of the same problems that were highlighted in the 2019 National AI R&D Strategic Plan remained the same in its 2023 update, echoing those included in NSCAI’s 2021 final report. We have made relatively little progress addressing these challenges. AI systems are still vulnerable to attacks (NSCAI, Final Report, 2021) and can still “be made do the wrong thing, reveal the wrong thing” and “be easily fooled, evaded, and misled in ways that can have profound security implications” (National AI R&D Strategic Plan, 2019). The field of interpretability is gaining some momentum among AI companies (Amodei, 2025; Shah et al., 2025; Goodfire, 2025) and AI researchers (IFP, 2025; Bau et al., 2025; FAS, 2025).

To be sure, despite recent progress, interpretability remains challenging and has attracted some skepticism (Hendrycks & Hiscott, 2025). Accordingly, a strong AI safety strategy must include many components beyond interpretability, including robust AI evaluations (Apollo Research, 2025) and control measures (Redwood Research, 2025).

Plan of Action

The United States has an opportunity to seize the moment and lead an acceleration of AI interpretability. The following three recommendations establish a strategy for how the United States could promptly incentivize AI interpretability research.

Recommendation 1. The federal government should prioritize and invest in foundational AI interpretability research, which would include identifying interpretability as a ‘strategic priority’ in the 2025 update of the National AI R&D Strategic Plan.

The National Science and Technology Council (NSTC) should identify AI interpretability as a ‘strategic priority’ in the upcoming National AI R&D Strategic Plan. Congress should then appropriate federal R&D funding for federal agencies (including DARPA and the NSF) to catalyze and support AI interpretability acceleration through various mechanisms, including grants and prizes, R&D credits, tax credits, advanced market commitments, and buyer-of-first-resort mechanisms.

This first recommendation echoes not only the 2019 update of the National AI R&D Strategic Plan and NSCAI’s 2021 final report––which recommended allocating more federal R&D investments to advance the interpretability of Al systems (NSCAI, Final Report, 2021; National AI R&D Strategic Plan, 2019),, but also the more recent remarks by the Director of the Office of Science and Technology Policy (OSTP), according to whom we need creative R&D funding approaches to enable scientists and engineers to create new theories and put them into practice (OSTP Director’s Remarks, 2025). This recommendation is also in line with calls from AI companies, asserting that “we still need significant investment in ‘basic science’” (Shah et al., 2025).

The United States could incentivize and support AI interpretability work through various approaches. In addition to prize competitions, advanced market commitments, fast and flexible grants (OSTP Director’s Remarks, 2025; Institute for Progress, 2025), and challenge-based acquisition programs (Institute for Progress, 2025), funding mechanisms could include R&D tax credits for AI companies undertaking or investing in interpretability research, and tax credits to adopters of interpretable AI, such as downstream deployers. If the federal government acts as “an early adopter and avid promoter of American technology” (OSTP Director’s Remarks, 2025), federal agencies could also rely on buyer-of-first-resort mechanisms for interpretability platforms.

These strategies may require developing a clearer understanding of which frontier AI companies undertake sufficient interpretability efforts when developing their most advanced systems, and which companies currently do not. Requiring AI companies to disclose how they use interpretability to test models before release (Amodei, 2025) could be helpful, but might not be enough to devise a ‘ranking’ of interpretability efforts. While potentially premature given the state of the art in interpretability, an option could be to start developing standardized metrics and benchmarks to evaluate interpretability (Mueller et al., 2025; Stephenson et al., 2025). This task could be carried out by the National Institute of Standards and Technology (NIST), within which some AI researchers have recommended creating an AI Interpretability and Control Standards Working Group (Bau et al., 2025).

A great way to operationalize this first recommendation would be for the National Science and Technology Council (NSTC) to include interpretability as a “strategic priority” in the 2025 update of the National AI R&D Strategic Plan (RFI, 2025). These “strategic priorities” seek to target and focus AI innovation for the next 3–5 years, paying particular attention to areas of “high-risk, high-reward AI research” that the industry is unlikely to address because it may not provide immediate commercial returns (RFI, 2025). If interpretability were included as a “strategic priority,” then the Office of Management and Budget (OMB) could instruct agencies to align their budgets with the 2025 National AI R&D Strategic Plan priorities in its memorandum addressed to executive department heads. Relevant agencies, including DARPA and the National Science Foundation (NSF), would then develop their budget requests for Congress, aligning them with the 2025 National AI R&D Strategic Plan and the OMB memorandum. After Congress reviews these proposals and appropriates funding, agencies could launch initiatives that incentivize interpretability work, including grants and prizes, R&D credits, tax credits, advanced market commitments, and buyer-of-first-resort mechanisms.

Recommendation 2. The federal government should enter into research and development agreements with AI companies and interpretability research organizations to red team AI systems applied in high-stakes scenarios and conduct targeted interpretability research.

AI companies, interpretability organizations, and federal agencies and laboratories (such as DARPA, the NSF, and the U.S. Center for AI Standards and Innovation) should enter into research and development agreements to pursue targeted AI interpretability research to solve national security vulnerabilities identified through security-focused red teaming.

This second recommendation takes into account the fact that the federal government possesses unique expertise and knowledge in national security issues to support national security testing and evaluation (FMF, 2025). Federal agencies and laboratories (such as DARPA, the NSF, and the U.S. Center for AI Standards and Innovation), frontier AI companies, and interpretability organizations could enter into research and development agreements to undertake red teaming of national security vulnerabilities (as, for instance, SABER which aims to assess AI-enabled battlefield systems for the DoD; SABER, 2025) and provide state-of-the-art interpretability platforms to patch the revealed vulnerabilities. In the future, AI companies could also apply the most advanced AI systems to support interpretability research.

Recommendation 3. The federal government should prioritize interpretable AI in federal procurement, especially for high-stakes applications.

If federal agencies are procuring highly advanced AI for high-stakes scenarios and national security missions, they should preferentially procure interpretable AI systems. This preference could be accounted for by weighing the lack of understanding of an AI system’s inner workings when calculating cost.

This third and final recommendation provides for the interim and assumes interpretable AI systems will coexist in a ‘gradient of interpretability’ with other AI systems that are less interpretable. In that scenario, agencies procuring AI systems should give preference to AI systems that are more interpretable. One way to account for this preference would be by weighing the potential vulnerabilities of uninterpretable AI systems within calculating costs during federal acquisition analyses. This recommendation also requires establishing a defined ‘ranking’ of interpretability efforts. While defining this ranking is currently challenging, the research outlined in recommendations 1 and 2 could better position the government to measure and rank the interpretability of different AI systems.

Conclusion

Now is the time for the United States to take action and lead the charge on AI interpretability research. While research is never guaranteed to lead to desired outcomes or to solve persistent problems, the potential high reward—understanding and trusting future AI systems and making today’s systems more robust to adversarial attacks—justifies this investment. Not only could AI interpretability make AI safer and more secure, but it could also establish justified confidence in the prompt adoption of future systems that are as capable as or even more capable than humans, and enable the deployment of today’s most advanced AI systems to high-stakes scenarios, thus promoting AI leadership and national security. With this goal in mind, this policy memorandum recommends that the United States, through the relevant federal agencies and laboratories (including DARPA, the NSF, and the U.S. Center for AI Standards and Innovation), invest in interpretability research, form research and development agreements to red team high-stakes AI systems and undertake targeted interpretability research, and prioritize interpretable AI systems in federal acquisitions.

Acknowledgments

I wish to thank Oliver Stephenson, Dan Braun, Lee Sharkey, and Lucius Bushnaq for their ideas, comments, and feedback on this memorandum.

This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.

publications

See all publications

Emerging Technology

day one project

Policy Memo

Behavioral Economics Megastudies are Necessary to Make America Healthy

When the U.S. government funds the establishment of a platform for testing hundreds of behavioral interventions on a large diverse population, we will start to better understand the interventions that will have an efficient and lasting impact on health behavior.

11.06.25 | 10 min read

Emerging Technology

Policy Memo

Making Healthcare AI Human-Centered through the Requirement of Clinician Input

Integrating AI tools into healthcare has an immense amount of potential to improve patient outcomes, streamline clinical workflows, and reduce errors and bias.

11.04.25 | 15 min read

Emerging Technology

Policy Memo

A National Blueprint for Whole Health Transformation

Whole Health is a proven, evidence-based framework that integrates medical care, behavioral health, public health, and community support so that people can live healthier, longer, and more meaningful lives.

11.03.25 | 8 min read

Emerging Technology

Blog

Innovation Ecosystem Job Board Launches to Connect Federal Talent to Opportunities

Innovation Ecosystem Job Board connects scientists, engineers, technologists, and skilled federal workers and contractors who have recently departed government service with the emerging innovation ecosystems across America.

09.19.25 | 3 min read