Moving Beyond Pilot Programs to Codify and Expand Continuous AI Benchmarking in Testing and Evaluation
Rapid and advanced AI integration and diffusion within the Department of Defense (DoD) and other government agencies has emerged as a critical national security priority. This convergence of rapid AI advancement and DoD prioritization creates an urgent need to ensure that AI models integrated into defense operations are reliable, safe, and mission-enhancing. For this purpose, the DoD must deploy and expand one of its most critical tools available within its Testing and Evaluation (T&E) process: benchmarking—the structured practice of applying shared tasks and metrics to compare models, track progress, and expose performance gaps.
A standardized AI benchmarking framework is critical for delivering uniform, mission-aligned evaluations across the DoD. Despite their importance, the DoD currently lacks standardized, enforceable AI safety benchmarks, especially for open-ended or adaptive use cases. A shift from ad hoc to structured assessments will support more informed, trusted, and effective procurement decisions.
Particularly at the acquisition stage for AI models, rapid DoD acquisition platforms such as Tradewinds can serve as the policy vehicle for enabling more robust benchmarking efforts. This can be done with the establishment of a federally coordinated benchmarking hub, spearheaded by a coordinated effort between the Chief Data and Artificial Intelligence Officer (CDAO) and Defense Innovation Unit (DIU) in consultation with the newly established Chief AI Officer’s Council (CAIOC) of the White House Office of Management and Budget (OMB).
Challenge and Opportunity
Experts at the intersection of both AI and defense, such as the retired Lieutenant General John (Jack) N.T. Shanahan, have emphasized the profound impact of AI on the way the United States will fight future wars – with the character of war continuously reshaped by AI’s diffusion across all domains. The DoD is committed to remaining at the forefront of these changes: between 2022-2023, the value of federal AI contracts increased by over 1200%, with the surge driven by increases in DoD spending. Secretary of Defense Pete Hegseth has pledged increased investment in AI specifically for military modernization efforts, and has tasked the Army to implement AI in command and control across the theater, corps, and division headquarters by 2027–further underscoring AI’s transformative impact on modern warfare.
Strategic competitors—especially the People’s Republic of China—are rapidly integrating AI into their military and technological systems. The Chinese Communist Party views AI-enabled science and technology as central to accelerating military modernization and achieving global leadership. At this pivotal moment, the DoD is pushing to adopt advanced AI across operations to preserve the U.S. edge in military and national security applications. Yet, accelerating too quickly without proper safeguards risks exposing vulnerabilities adversaries could exploit.
With the DoD at a unique inflection point, it must balance the rapid adoption and integration of AI into its operations with the need for oversight and safety. DoD needs AI systems that consistently meet clearly defined performance standards set by acquisition authorities, operate strictly within the scope of their intended use, and do not exhibit unanticipated or erratic behaviors under operational conditions. These systems can deliver measurable value to mission outcomes while fostering trust and confidence among human operators through predictability, transparency, and alignment with mission-specific requirements.
AI benchmarks are standardized tasks and metrics that systematically measure a model’s performance, reliability, and safety, and have increasingly been adopted as a key measurement tool by the AI industry. Currently, DoD lacks standardized, comprehensive AI safety benchmarks, especially for open-ended or adaptive use cases. Without these benchmarks, the DoD risks acquiring models that underperform, deviate from mission requirements, or introduce avoidable vulnerabilities, leading to increased operational risk, reduced mission effectiveness, and costly contract revisions.
A recent report from the Center for a New American Security (CNAS) on best practices for AI T&E outlined that the rapid and unpredictable pace of AI advancement presents distinctive challenges for both policymakers and end-users. The accelerating pace of adoption and innovation heightens both the urgency and complexity of establishing effective AI benchmarks to ensure acquired models meet the mission-specific performance standards required by the DoD and the services.
The DoD faces particularly outsized risk, as its unique operational demands can expose AI models to extreme conditions where performance may degrade. For example, under adversarial conditions, or when encountering data that is different from its training, an AI model may behave unpredictably, posing heightened risk to the mission. Robust evaluations, such as those offered through benchmarking, help to identify points of failure or harmful model capabilities before they become apparent during critical use cases. By measuring model performance in real-world applicable scenarios and environments, we increase understanding of attack surface vulnerabilities to adversarial inputs. We can identify inaccurate or over-confident measurements of outputs, and recognize potential failures in edge cases and extreme scenarios (including those beyond training parameters, Moreover, we improve human-AI performance and trust factors, and avoid unintended capabilities. Benchmarking helps to surface these issues early.
Robust AI benchmarking frameworks can enhance U.S. leadership by shaping international norms for military AI safety, improving acquisition efficiency by screening out underperforming systems, and surfacing unintended or high-risk model behaviors before deployment. Furthermore, benchmarking enables AI performance to be quantified in alignment with mission needs, using guidance from the CDAO RAI Toolkit and clear acquisition parameters to support decision-making for both procurement officers and warfighters. Given the DoD’s high-risk use cases and unique mission requirements, robust benchmarking is even more essential than in the commercial sector.
The DoD now has an opportunity to formalize AI safety benchmark frameworks within its Testing and Evaluation (T&E) processes, tailored to both dual-use and defense-specific applications. T&E is already embedded in DoD culture, offering a strong foundation for expanding benchmarking. Public-private AI testing initiatives, such as the DoD collaboration with Scale AI to create effective T&E (including through benchmarking) for AI models show promise and existing motivation for such initiatives. Yet, critical policy gaps still exist. With pilot programs underway, the DoD can move beyond vendor-led or ad hoc evaluations to introduce DoD-led testing, assess mission-specific capabilities, launch post-acquisition benchmarking, and develop human-AI team metrics. The widely used Tradewinds platform offers an existing vehicle to integrate these enhanced benchmarks without reinventing the wheel.
To implement robust benchmarking at DoD, this memo proposes the following policy recommendations, to be coordinated by DoD Chief Digital and Artificial Intelligence Office (CDAO):
- Expanding on existing benchmarking efforts
- Standardizing AI safety thresholds during the procurement cycle
- Implementing benchmarking during the lifecycle of the model
- Establishing a benchmarking repository
- Enabling adversarial stress testing, or “red-teaming”, prior to deployment to enhance current benchmarking gaps for DoD AI use cases
Plan of Action
The CDAO should launch a formalized AI Benchmarking Initiative, moving beyond current vendor-led pilot programs, while continuing to refine its private industry initiatives. This effort should be comprehensive and collaborative in nature, leveraging internal technical expertise. This includes the newly established coordinating bodies on AI such as the Chief AI Officer’s Council, which can help to ensure that DoD benchmarking practices are aligned with federal priorities, and the Defense Innovation Unit, which can be an excellent private industry-national defense sector bridge and coordinator in these efforts. Specifically, the CDAO should integrate benchmarking into the acquisition pipeline. This will establish ongoing benchmarking practices that facilitate continuous model performance evaluation through the entirety of the model lifecycle.
Policy Recommendations
Recommendation 1. Establish a Standardized Defense AI Benchmarking Initiative and create a Centralized Repository of Benchmarks
The DoD should build on lessons learned from its partnership with Scale AI (and others) developing benchmarks specifically for defense use cases. This should expand into a standardized, agency-wide framework.
This recommendation is in line with findings outlined by RAND, which calls for developing a comprehensive framework for robust evaluation and emphasizes the need for collaborative practices, and measurable performance metrics for model performance.
The DoD should incorporate the following recommendations and government entities to achieve this goal:
Develop a Whole-of-Government Approach to AI Benchmarking
- Develop and expand on existing pilot benchmarking frameworks, similar to Massive Multitask Language Understanding (MMLU) but tailored to military-relevant tasks and DoD-specific use cases.
- Expand the $10 million T&E and research budget by $10 million, with allocations specifically for bolstering internal benchmarking capabilities. One crucial piece is identifying and recruiting technically capable talent to aid in developing internal benchmarking guidelines. As AI models advance, new “reasoning” models with advanced capabilities become far costlier to benchmark, and the DoD must plan for these future demands now. Part of this allocation can come from the $500 million allocated for the combatant command AI budgets. This monetary allocation is critical to successfully implementing this policy because model benchmarking for more advanced models – such as OpenAI’s GPT-3 – can cost millions. This modest budgetary increase is a starting point for moving beyond piecemeal and ad hoc benchmarking, to a comprehensive and standardized process. This funding increases would facilitate:
- Development of and expansion of internal and customized benchmarking capabilities
- Recruitment and retention of technical talent
- Development of simulation environment for more mission-relevant benchmarks
If internal reallocations from the $500 million allocation proves insufficient or unviable, Congressional approval for additional funds can be another funding source. Given the strategic importance of AI in defense, such requests can readily find bipartisan support, particularly when tied to operational success and risk mitigation.
- Create a centralized AI benchmarking repository under the CDAO. This will standardize categories, performance metrics, mission alignment, and lessons learned across defense-specific use cases. This repository will enable consistent tracking of model performance over time, support analysis across model iterations, and allow for benchmarking transferability across similar operational scenarios. By compiling performance data at scale, the repository will also help identify interoperability risks and system-level vulnerabilities—particularly how different AI models may behave when integrated—thereby enhancing the DoD’s ability to assess, document, and mitigate potential performance and safety failures.
- Convene a partnership, organized by OMB, between the CDAO, the DIU and the CAIOC, to jointly establish and maintain a centralized benchmarking repository. While many CAIOC members represent civilian agencies, their involvement is crucial: numerous departments (such as the Department of Homeland Security, the Department of Energy, and the National Institute of Standards and Technology) are already employing AI in high-stakes contexts and bring relevant technical expertise, safety frameworks, and risk management policies. Incorporating these perspectives ensures that DoD benchmarking practices are not developed in isolation but reflect best practices across the federal government. This partnership will leverage the DIU’s insights on emerging private-sector technologies, the CDAO’s acquisition and policy authorities, and CAIOC’s alignment with broader executive branch priorities, thereby ensuring that benchmarking practices are technically sound, risk-informed, and consistent with government-wide standards and priorities for trustworthy, safe, and reliable AI.
Recommendation 2. Formalize Pre-Deployment Benchmarking for AI Models at the Acquisition Stage
The key to meaningful benchmarking lies in integrating it at the pre-award stage of procurement. The DoD should establish a formal process that:
- Integrates benchmarking into existing AI acquisition platforms, such as Tradewinds, and embeds it within the T&E process.
- Requires participation from third-party vendors in benchmarking the products they propose for DoD acquisition and use.
- Embeds internal adversarial stress testing, or “red-teaming”, into AI benchmarking ensures more realistic, mission-aligned evaluations that account for adversarial threats and the unique, high-risk operating environments the military faces. By leveraging its internal expertise in mission context, classified threat models, and domain-specific edge cases that external vendors are unlikely to fully replicate, the DoD can produce a more comprehensive and defense-relevant assessment of AI system safety, efficacy, and suitability for deployment. Specifically, this policy memo recommends that the AI Rapid Capabilities Cell (AI RCC) be tasked with carrying out the red-teaming, as a technically qualified element of the CDAO.
- Assures procurement officers understand the value of incorporating benchmarking performance metrics into their contract award decision-making. This can be done by hosting benchmarking workshops for procurement officers, which outline the benchmarking results for model performance for various models in the acquisition pipeline and to guide them on how to apply these metrics to their own performance requirements and guidelines.
Recommendation 3. Contextualize Benchmarking into Operational Environments
Current efforts to scale and integrate AI reflect the distinct operational realities of the DoD and military services. Scale AI, in partnership with the DoD, Anduril, Microsoft, and the CDAO, is developing AI-powered solutions which are focused on the United States Indo-Pacific Command (INDOPACOM) and United States European Command (EUCOM). With these regional command focused AI solutions, it makes sense to create equally focused benchmarking standards to test AI model performance in specific environments and under unique and focused conditions. In fact, researchers have been identifying the limits of traditional AI benchmarking and making the case for bespoke, holistic, and use-case relevant benchmark development. This is vital because as AI models advance, they introduce entirely new capabilities which require more robust testing and evaluation. For example, large language models, which have introduced new functionalities including natural language querying or multimodal search interfaces, require entirely new benchmarks that measure: natural language understanding, modal integration accuracy, context retention, and result usefulness. In the same vein, DoD relevant benchmarks must be developed in an operationally-relevant context. This can be achieved by:
- Developing simulation environments for benchmarking that are mission-specific across a broader set of domains, including technical and regional commands, to test AI models under specific conditions which are likely to be encountered by users in unique, contested, and/or adversarial environments. The Bipartisan House Task Force on Artificial Intelligence report provides useful guidance on AI model functionality, reliability, and safety in operating in contested, denied, and degraded environments.
- Prioritizing use-case-specific benchmarks over broad commercial metrics by incorporating user feedback and identifying tailored risk scenarios that more accurately measure model performance.
- Introducing context relevant benchmarks to measure performance in specific, DoD-relevant scenarios, such as:
- Task-specific accuracy (i.e. correct ID in satellite imagery cases)
- Alignment with context-specific rules of engagement
- Instances of degraded performance under high-stress conditions
- Susceptibility to adversarial manipulation (i.e. data poisoning)
- Latency in high-risk, fast-paced decision-making scenarios
- Creating post-deployment benchmarking to ensure ongoing performance and risk compliance, and to detect and address issues like model drift, a phenomenon where model performance degrades over time. As there is no established consensus on how often continuous model benchmarking should be performed, the DoD should study the appropriate practical, risk-informed timelines for re-evaluating deployed systems.
Frameworks such as Holistic Evaluation of Language Models (HELM) and Focused LLM Ability Skills and Knowledge (FLASK) can offer valuable guidance for developing LLM-focused benchmarks within the DoD, by enabling more comprehensive evaluations based on specific model skill sets, use-case scenarios, and tailored performance metrics.
Recommendation 4. Integration of Human-in-the-Loop Benchmarking
An additional layer of AI benchmarking for safe and effective AI diffusion into the DoD ecosystem is evaluating AI-human team performance, and measuring user trust, perceptions and confidence in various AI models. “Human‑in‑the‑loop” systems require a person to approve or adjust the AI’s decision before action, while “human‑on‑the‑loop” systems allow autonomous operation but keep a person supervising and ready to intervene. Both “Human in the loop” and “Human on the loop” are critical components of the DoD and military approach to AI. Both require continued human oversight of ethical and safety considerations over AI-enabled capabilities with national security implications. A recent study by MIT study found that there are surprising performance gaps between AI only, human only, and AI-human teams. For the DoD particularly, it is important to effectively measure these performance gaps across the various AI models it plans to integrate into its operations due to heavy reliance on user-AI teams.
A CNAS report on effective T&E for AI spotlighted the DARPA Air Combat Evolution (ACE) program, which sought autonomous air‑combat agents needing minimal human intervention. Expert test pilots could override the system, yet often did so prematurely, distrusting its unfamiliar tactics. This case underscores the need for early, extensive benchmarks that test user capacity, surface trust gaps that can cripple human‑AI teams, and assure operators that models meet legal and ethical standards. Accordingly, this memo urges expanding benchmarking beyond pure model performance to AI‑human team evaluations in high‑risk national‑security, lethal, or error‑sensitive environments.
Conclusion
The Department of Defense is racing to integrate AI across every domain of warfare, yet speed without safety will jeopardize mission success and national security. Standardized, acquisition‑integrated, continuous, and mission‑specific benchmarking is therefore not a luxury—it is the backbone of responsible AI deployment. Current pilot programs with private partners are encouraging starts, but they remain too ad hoc and narrow to match the scale and tempo of modern AI development.
Benchmarking must begin at the pre‑award acquisition stage and follow systems through their entire lifecycle, detecting risks, performance drift, and adversarial vulnerabilities before they threaten operations. As the DARPA ACE program showed, early testing of human‑AI teams and rigorous red‑teaming surface trust gaps and hidden failure modes that vendor‑led evaluations often miss. Because AI models—and enemy capabilities—evolve constantly, our evaluation methods must evolve just as quickly.
By institutionalizing robust benchmarks under CDAO leadership, in concert with the Defense Innovation Unit and the Chief AI Officers Council, the DoD can set world‑class standards for military AI safety while accelerating reliable procurement. Ultimately, AI benchmarking is not a hurdle to innovation and acquisition, but rather it is the infrastructure that can make rapid acquisition more reliable and innovation more viable. The DoD cannot afford the risk of deploying AI systems which are risky, unreliable, ineffective or misaligned with mission needs and standards in high-risk operational environments. At this inflection point, the choice is not between speed and safety but between ungoverned acceleration and a calculated momentum that allows our strategic AI advantage to be both sustained and secured.
This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.
he Scale AI benchmarking initiative, launched in February 2024 in partnership with the DoD, is a pilot framework designed to evaluate the performance of AI models intended for defense and national security applications. It is part of the broader efforts to create a framework for T&E of AI models for the CDAO.
This memo builds on that foundation by:
- Formalizing benchmarking as a standard requirement at the procurement stage across DoD acquisition processes.
- Inserting benchmarking protocols into rapid acquisition platforms like Tradewinds.
- Establishing a defense-specific benchmarking repository and enabling red-teaming led by the AI Rapid Capabilities Cell (AI RCC) within the CDAO.
- Shifting the lead on benchmarking from vendor-enabled to internally developed, led, and implemented, creating bespoke evaluation criteria tailored to specific mission needs.
The proposed benchmarking framework will apply to a diverse range of AI systems, including:
- Decision-making and command and control support tools (sensors, target recognition, process automation, and tools involved in natural language processing).
- Generative models for planning, logistics, intelligence, or data generation.
- Autonomous agents, such as drones and robotic systems.
Benchmarks will be theater and context-specific, reflecting real-world environments (e.g. contested INDOPACOM scenarios), end-user roles (human-AI teaming in combat), and mission-specific risk factors such as adversarial interference and model drift.
Open-source models present distinct challenges due to model ownership and origin, additional possible exposure to data poisoning, and downstream user manipulation. However, due to the nature of open-source models, it should be noted that the general increase in transparency and potential access to training data could make open-source models less challenging to put through rigorous T&E.
This memo recommends:
- Applying standardized evaluation criteria across both open-source and proprietary models which can be developed by utilizing the AI benchmarking repository and applying model evaluations based on possible use cases of the model.
- Incorporating benchmarking to test possible areas of vulnerability for downstream user manipulation.
- Measuring the transparency of training data.
- Performing adversarial testing to assess resilience against manipulated inputs via red-teaming.
- Logging the open-source model performance in the proposed centralized repository, enabling ongoing monitoring for drift and other issues
Red-teaming implements adversarial stress-testing (which can be more robust and operationally relevant if led by an internal team as this memo proposes), and can identify vulnerabilities and unintended capabilities before deployment. Internally led red-teaming, in particular, is critical for evaluating models intended for use in unpredictable or hostile environments.
To effectively employ the red-teaming efforts, this policy recommends that:
- The AI Rapid Capabilities Cell within the CDAO should lead red-teaming operations, leveraging the team’s technical capabilities with its experience and mission set to integrate and rapidly scale AI at the speed of relevance — delivering usable capability fast enough to affect current operations and decision cycles.
- Internal, technically skilled teams should be created who are capable of incorporating classified threat models and edge-case scenarios.
- Red-teaming should focus on simulating realistic mission conditions, and searching for specific model capabilities, going beyond generic or vendor-supplied test cases.
Integrating benchmarking at the acquisition stage enables procurement officers to:
- Compare models on mission-relevant, standardized performance metrics and ensure that there is evidence of measurable performance metrics which align with their own “vision of success” procurement requirements for the models.
- Identify and avoid models with unsafe, misaligned, unverified, or ineffective capabilities.
- Prevent cost-overruns or contract revisions.
Benchmarking workshops for acquisition officers can further equip them with the skills to interpret benchmark results and apply them to their operational requirements.
Develop a Risk Assessment Framework for AI Integration into Nuclear Weapons Command, Control, and Communications Systems
As the United States overhauls nearly every element of its strategic nuclear forces, artificial intelligence is set to play a larger role—initially in early‑warning sensors and decision‑support tools, and likely in other mission areas. Improved detection could strengthen deterrence, but only if accompanying hazards—automation bias, model hallucinations, exploitable software vulnerabilities, and the risk of eroding assured second‑strike capability—are well managed.
To ensure responsible AI integration, the Office of the Assistant Secretary of Defense for Nuclear Deterrence, Chemical, and Biological Defense Policy and Programs (OASD (ND-CBD)), the U.S. Strategic Command (STRATCOM), the Defense Advanced Research Projects Agency (DARPA), the Office of the Undersecretary of Defense for Policy (OUSD(P)), and the National Nuclear Security Administration (NNSA), should jointly develop a standardized AI risk-assessment framework guidance document, with implementation led by the Department of Defense’s Chief Digital and Artificial Intelligence Office (CDAO) and STRATCOM. Furthermore, DARPA and CDAO should join the Nuclear Weapons Council to ensure AI-related risks are systematically evaluated alongside traditional nuclear modernization decisions.
Challenge and Opportunity
The United States is replacing or modernizing nearly every component of its strategic nuclear forces, estimated to cost at least $1.7 trillion over the next 30 years. This includes its:
- Intercontinental ballistic missiles (ICBMs)
- Ballistic missile submarines and their submarine-launched ballistic missiles (SLBMs)
- Strategic bombers, cruise missiles, and gravity bombs
- Nuclear warhead production and plutonium pit fabrication facilities
Simultaneously, artificial intelligence (AI) capabilities are rapidly advancing and being applied across the national security enterprise, including nuclear weapons stockpile stewardship and some components of command, control, and communications (NC3) systems, which encompass early warning, decision-making, and force deployment components.
The NNSA, responsible for stockpile stewardship, is increasingly integrating AI into its work. This includes using AI for advanced modeling and simulation of nuclear warheads. For example, by creating a digital twin of existing weapons systems to analyze aging and performance issues, as well as using AI to accelerate the lifecycle of nuclear weapons development. Furthermore, NNSA is leading some aspects of the safety testing and systematic evaluations of frontier AI models on behalf of the U.S. government, with a specific focus on assessing nuclear and radiological risk.
Within the NC3 architecture, a complex “system of systems” with over 200 components, simpler forms of AI are already being used in areas including early‑warning sensors, and may be applied to decision‑support tools and other subsystems as confidence and capability grow. General Anthony J. Cotton—who leads STRATCOM, the combatant command that directs America’s global nuclear forces and their command‑and‑control network—told a 2024 conference that STRATCOM is “exploring all possible technologies, techniques, and methods” to modernize NC3. Advanced AI and data‑analytics tools, he said, can sharpen decision‑making, fuse nuclear and conventional operations, speed data‑sharing with allies, and thus strengthen deterrence. General Cotton added that research must also map the cascading risks, emergent behaviors, and unintended pathways that AI could introduce into nuclear decision processes.
Thus, from stockpile stewardship to NC3 systems, AI is likely to be integrated across multiple nuclear capabilities, some potentially stabilizing, others potentially highly destabilizing. For example, on the stabilizing effects, AI could enhance early warning systems by processing large volumes of satellite, radar, and other signals intelligence, thus providing more time to decision-makers. On the destabilizing side, the ability for AI to detect or track other countries’ nuclear forces could be destabilizing, triggering an expansionary arms race if countries doubt the credibility of their second-strike capability. Furthermore, countries may misinterpret each other’s nuclear deterrence doctrines or have no means of verification of human control of their nuclear weapons.
While several public research reports have been conducted on how AI integration into NC3 could upset the balance of strategic stability, less research has focused on the fundamental challenges with AI systems themselves that must be accounted for in any risk framework. Per the National Institute of Standards and Technology’s (NIST) AI Risk Management Framework, several fundamental AI challenges at a technical level must be accounted for in the integration of AI into stockpile stewardship and NC3.
Not all AI applications within the nuclear enterprise carry the same level of risk. For example, using AI to model warhead aging in stockpile stewardship is largely internal to the Department of Energy (DOE) and involves less operational risk. Despite lower risk, there is still potential for an insufficiently secure model to lead to leaked technical data about nuclear weapons.
However, integrating AI into decision support systems or early warning functions within NC3 introduces significantly higher stakes. These systems require time-sensitive, high-consequence judgments, and AI integration in this context raises serious concerns about issues including confabulations, human-AI interactions, and information security:
- Confabulations: A phenomenon in which generative AI systems (GAI) systems generate and confidently present erroneous or false content in response to user inputs, or
prompts. These phenomena are colloquially also referred to as “hallucinations” or “fabrications”, and could have particularly dangerous consequences in high-stakes settings.
- Human-AI Interactions: Due to the complexity and human-like nature of GAI technology, humans may over-rely on GAI systems or may unjustifiably perceive GAI content to be of higher quality than that produced by other sources. This phenomenon is an example of automation bias or excessive deference to automated systems. This deference can lead to a shift from a human making the final decision (“human in the loop”), to a human merely observing AI generated decisions (“human on the loop”). Automation bias therefore risks exacerbating other risks of GAI systems as it can lead to humans maintaining insufficient oversight.
- Information Security: AI expands the cyberattack surface of NC3. Poisoned AI training data and tampered code can embed backdoors, and, once deployed, prompt‑injection or adversarial examples can hijack AI decision tools, distort early‑warning analytics, or leak secret data. The opacity of large AI models can let these exploits spread unnoticed, and as models become more complex, they will be harder to debug.
This is not an exhaustive list of issues with AI systems, however it highlights several key areas that must be managed. A risk framework must account for these distinctions and apply stricter oversight where system failure could have direct consequences for escalation or deterrence credibility. Without such a framework, it will be challenging to harness the benefits AI has to offer.
Plan of Action
Recommendation 1. OASD (ND-CBD), STRATCOM, DARPA, OUSD(P), and NNSA, should develop a standardized risk assessment framework guidance document to evaluate the integration of artificial intelligence into nuclear stockpile stewardship and NC3 systems.
This framework would enable systematic evaluation of risks, including confabulations, human-AI configuration, and information security, across modernization efforts. The framework could assess the extent to which an AI model is prone to confabulations, involving performance evaluations (or “benchmarking”) under a wide range of realistic conditions. While there are public measurements for confabulations, it is essential to evaluate AI systems on data relevant to the deployment circumstances, which could involve highly sensitive military information.
Additionally, the framework could assess human-AI configuration with specific focus on risks from automation bias and the degree of human oversight. For these tests, it is important to put the AI systems in contact with human operators in situations that are as close to real deployment as possible, for example when operators are tired, distracted, or under pressure.
Finally, the framework could include assessments of information security under extreme conditions. This should include simulating comprehensive adversarial attacks (or “red-teaming”) to understand how the AI system and its human operators behave when subject to a range of known attacks on AI systems.
NNSA should be included in this development due to their mission ownership of stockpile stewardship and nuclear safety, and leadership in advanced modeling and simulation capabilities. DARPA should be included due to its role as the cutting edge research and development agency, extensive experience in AI red-teaming, and understanding of the AI vulnerabilities landscape. STRATCOM must be included as the operational commander of NC3 systems, to ensure the framework accounts for real-word needs and escalation risks. OASD (ND-CBD) should be involved given the office’s responsibilities to oversee nuclear modernization and coordinate across the interagency. The OUSD (P) should be included to provide strategic oversight and ensure the risk assessment aligns with broader defense policy objectives and international commitments.
Recommendation 2. CDAO should implement the Risk Assessment Framework with STRATCOM
While NNSA, DARPA, OASD (ND-CBD) and STRATCOM can jointly create the risk assessment framework, CDAO and STRATCOM should serve as the implementation leads for utilizing the framework. Given that the CDAO is already responsible for AI assurance, testing and evaluation, and algorithmic oversight, they would be well-positioned to work with relevant stakeholders to support implementation of the technical assessment. STRATCOM would have the strongest understanding of operational contexts with which to apply the framework. NNSA and DARPA therefore could advise on technical underpinnings with regards to AI of the framework, while the CDAO would prioritize operational governance and compliance, ensuring that there are clear risk assessments completed and understood when considering integration of AI into nuclear-related defense systems.
Recommendation 3. DARPA and CDAO should join the Nuclear Weapons Council
Given their roles in the creation and implementation of the AI risk assessment framework, stakeholders from both DARPA and the CDAO should be incorporated into the Nuclear Weapons Council (NWC), either as full members or attendees to a subcommittee. As the NWC is the interagency body the DOE and the DoD responsible for sustaining and modernizing the U.S. nuclear deterrent, the NWC is responsible for endorsing military requirements, approving trade-offs, and ensuring alignment between DoD delivery systems and NNSA weapons.
As AI capabilities become increasingly embedded in nuclear weapons stewardship, NC3 systems, and broader force modernization, the NWC must be equipped to evaluate associated risks and technological implications. Currently, the NWC is composed of senior officials from the Department of Defense, the Joint Chiefs of Staff, and the Department of Energy, including the NNSA. While these entities bring deep domain expertise in nuclear policy, military operations, and weapons production, the Council lacks additional representation focused on AI.
DARPA’s inclusion would ensure that early-stage technology developments and red-teaming insights are considered upstream in decision-making. Likewise, CDAO’s presence would provide continuity in AI assurance, testing, and digital system governance across operational defense components. Their participation would enhance the Council’s ability to address new categories of risk, such as model confabulation, automation bias, and adversarial manipulation of AI systems, that are not traditionally covered by existing nuclear stakeholders. By incorporating DARPA and CDAO, the NWC would be better positioned to make informed decisions that reflect both traditional nuclear considerations and the rapidly evolving technological landscape that increasingly shapes them.
Conclusion
While AI is likely to be integrated into components of the U.S. nuclear enterprise, without a standardized initial approach to assessing and managing AI-specific risk, including confabulations, automation bias, and novel cybersecurity threats, this integration could undermine an effective deterrent. A risk assessment framework coordinated by OASD (ND-CBD), with STRATCOM, NNSA and DARPA, and implemented with support of the CDAO, could provide a starting point for NWC decisions and assessments of the alignment between DoD delivery system needs, the NNSA stockpile, and NC3 systems.
This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.
Yes, NWC subordinate organizations or subcommittees are not codified in Title 10 USC §179, so the NWC has the flexibility to create, merge, or abolish organizations and subcommittees as needed.
Section 1638 of the FY2025 National Defense Authorization Act established a Statement of Policy emphasizing that any use of AI in support of strategic deterrence should not compromise, “the principle of requiring positive human actions in execution of decisions by the President with respect to the employment of nuclear weapons.” However, as this memo describes, AI presents further challenges outside of solely keeping a human in the loop in terms of decision-making.
A National Center for Advanced AI Reliability and Security
While AI’s transformative advances have enormous positive potential, leading scientists and industry executives are also sounding the alarm about catastrophic risks on a global scale. If left unmanaged, these risks could undermine our ability to reap the benefits of AI progress. While the U.S. government has made some progress, including by establishing the Center for AI Standards and Innovation (CAISI)—formerly the US AI Safety Institute—current government capacity is insufficient to respond to these extreme frontier AI threats. To address this problem, this memo proposes scaling up a significantly enhanced “CAISI+” within the Department of Commerce. CAISI+ would require dedicated high-security compute facilities, specialized talent, and an estimated annual operating budget of $67-155 million, with a setup cost of $155-275 million. CAISI+ would have expanded capacity for conducting advanced model evaluations for catastrophic risks, provide direct emergency assessments to the President and National Security Council (NSC), and drive critical AI reliability and security research, ensuring America is prepared to lead on AI and safeguard its national interests.
Challenge and Opportunity
Frontier AI is advancing rapidly toward powerful general-purpose capabilities. While this progress has produced widely useful products, it is also generating significant security risks. Recent evaluations on Anthropic’s Claude Opus 4 model were unable to rule out the risk that the model could be used to advise novice actors to produce bioweapons, triggering additional safeguards. Meanwhile, the FBI warns that AI “increases cyber-attack speed, scale, and automation”, with a 442% increase in AI-enhanced voice phishing attacks in 2024, and recent evaluations showing AI models rapidly gaining offensive cyber capabilities.
AI company CEOs and leading researchers have predicted that this progress will continue, with potentially transformative AI capabilities arriving in the next few years–and fast progress in AI capabilities will continue to generate novel threats greater than those from existing models. As AI systems are predicted to become increasingly capable of performing complex tasks and taking extended autonomous actions, researchers warn of these additional risks, such as loss of human control, AI-enabled WMD proliferation, and strategic surprise with severe national security implications. While timelines to AI systems surpassing dangerous capability thresholds are uncertain, this proposal attempts to lay out a US government response that is robust to a range of possible timelines, while taking the above trends seriously.
Current U.S. Government capabilities, including the existing Center for AI Standards and Innovation (CAISI), are not adequately resourced or empowered to independently evaluate, monitor, or respond to the most advanced AI threats. For example, current CAISI funding is precarious, its home institution (NIST)’s offices are reportedly “crumbling”, and its budget is roughly one-tenth of its counterpart in the UK. Despite previous underinvestment, CAISI has consistently produced rigorous model evaluations, and in doing so, has earned strong credibility with industry and government stakeholders. This also includes support from legislators: bipartisan legislation has been introduced in both chambers of Congress to authorize CAISI in statute, while just last month, the House China Committee released a letter noting that CAISI has a role to play in “understanding, predicting, and preparing for” national security risks from AI development in the PRC.
A dedicated and properly resourced national entity is essential for supporting the development of safe, secure, and trustworthy AI to drive widespread adoption, by providing sustained, independent technical assessments and emergency coordination—roles that ad-hoc industry consultations or self-reporting cannot fulfill for paramount matters of national security and public safety.
Establishing CAISI+ now is a critical opportunity to proactively manage these profound risks, ensure American leadership in AI, and prevent strategic disadvantage as global AI capabilities advance. While full operational capacity may not be needed immediately, certain infrastructure, such as highly secure computing, has significant lead times, demanding foresight and preparatory action. This blueprint offers a scalable framework to build these essential national capabilities, safeguarding our future against AI-related catastrophic events and enabling the U.S. to shape the trajectory of this transformative technology.
Plan of Action
To effectively address extreme AI risks, develop more trustworthy AI systems, and secure U.S. interests, the Administration and Congress should collaborate to establish and resource a world-class national entity to inform the federal response to the above trendlines.
Recommendation 1. Establish CAISI+ to Lead National AI Safety and Coordinate Crisis Response.
CAISI+, evolving from the current CAISI within the National Institute of Standards and Technology, under the Department of Commerce, must have a clear mandate focused on large-scale AI risks. Core functions include:
- Advanced Model Evaluation: Developing and operating state-of-the-art platforms to test frontier AI models for dangerous capabilities, adversarial behavior or goals (such as deception or power-seeking), and potential weaponization. While the level of risk presented by current models is very uncertain, even those who are skeptical of particular risk models are often supportive of developing better evaluations.
- Emergency Assessment & Response: Providing rapid, expert risk assessments and warnings directly to the President and the National Security Council (NSC) in the event of severe AI-driven national security threats. The CAISI+ Director should be statutorily designated as the Principal Advisor on AI Risks to the President and NSC, with authority to:
- Submit AI threat assessments to the President’s Daily Brief (PDB) when intelligence indicates imminent or critical risks
- Convene emergency sessions of the NSC Deputies Committee or Principals Committee for time-sensitive AI security threats
- Maintain direct communication channels to the National Security Advisor for immediate threat notification
- Issue “Critical AI Threat Warnings” through established NSC emergency communication protocols, similar to those used for terrorism or WMD threats
- Foundational AI Reliability and Security Research: Driving and funding research into core AI alignment, control, and security challenges to maintain U.S. technological leadership while developing trustworthy AI systems. This research will yield dual benefits to both the public and industry, by enabling broader adoption of reliable AI tools and preventing catastrophic incidents that could devastate the AI sector, similar to how the Three Mile Island disaster impacted nuclear energy development. Following the model of NIST’s successful encryption standards, establishing rigorous AI safety benchmarks and protocols will create industry-wide confidence while ensuring American competitiveness.
Governance will feature clear interagency coordination (e.g., with the Department of Defense, Department of Energy, Department of Homeland Security, and other relevant bodies in the intelligence community) and an internal structure with distinct directorates for evaluations, emergency response, and research, coordinated by CAISI+ leadership.
Recommendation 2. Equip CAISI+ with Elite American Talent and Sustained Funding
CAISI+’s efficacy hinges on world-class personnel and reliable funding to execute its mission. This necessitates:
- Exceptional American Talent: Special hiring authorities (e.g., direct hire, excepted service) and competitive compensation are paramount to attract and retain leading U.S. AI researchers, evaluators, and security experts, ensuring our AI standards reflect American values.
- Significant, Sustained Funding: Initial mainline estimates (see “Funding estimates for CAISI+” below) suggest $155-$275 million for setup and an annual operating budget of $67-$155 million for the recommended implementation level, sourced via new appropriations, to ensure America develops strong domestic capacity for defending against AI-powered threats. If funding is not appropriated, or if appropriations fall short, additional support may be able to be sourced via a NIST Foundation.
Funding estimates for CAISI+
Implementation Considerations
- Phased approach: The facility could be developed in stages, prioritizing core evaluation capabilities before expanding to full emergency response capacity.
- Leverage existing assets: Initial operations could utilize existing DOE relationships rather than immediately building dedicated infrastructure.
- Partnership model: Some costs could be offset through public-private partnerships with technology companies and research institutions.
- Talent acquisition strategy: Use of special hiring authorities (direct hire, excepted service) and competitive compensation (SL/ST pay scales, retention bonuses) may help compete with private sector AI companies.
- Sustainable funding: For stability, a multi-year Congressional appropriation with dedicated line-item funding would be crucial.
Staffing Breakdown by Function
- Technical Research (40-60% of staff): AI evaluations, safety research, alignment, interpretability research
- Security Operations (25-35% of staff): Red-teaming, misuse assessment, weaponization evaluation, security management
- Policy & Strategy (10-15% of staff): Leadership, risk assessment, interagency coordination, international liaisons
- Support Functions (15-20% of staff): Legal, procurement, compute infrastructure management, administration
For context, current funding levels include:
- Current CAISI funding (mid-2025): $10 million annually
- UK AISI (CAISI counterpart) initial funding: £100 million (~$125 million)
- Oak Ridge Leadership Computing Facility operations: ~$200-300 million annually
- Standard DOE supercomputing facility construction: $400-600 million
Even the minimal implementation would require substantially greater resources than the current CAISI, but remains well within the scale of other national-priority technology initiatives. The recommended implementation level would position CAISI+ to effectively fulfill its expanded mission of frontier AI evaluation, monitoring, and emergency response.
Funding Longevity
- Initial authorization: 5-year authorization with specific milestones and metrics
- Review mechanism: Independent assessment by the Government Accountability Office at 3-year mark to evaluate effectiveness and adjust scope/resources, supplemented by a National Academies study specifically tasked with evaluating the scientific and technical rigor of the CAISI+.
- Long-term vision: Transition to permanent authorization for core functions with periodic reauthorization of specific initiatives
- Accountability: Annual reporting to Congress on key performance metrics and risk assessments
Recommendation 3. Equip CAISI+ with Essential Secure Compute Infrastructure.
CAISI+ must be able to access secure compute in order to run certain evaluations involving proprietary models and national security data. This cluster can remain relatively modest in scale. Other researchers have hypothesized that a “Trusted AI Verification and Evaluation Cluster” for verifying and evaluating frontier AI development would need only 128 to 512 state-of-the-art graphical processing units (GPU)s–orders of magnitude smaller than the scale of training compute, such as the recent Llama 3.1 405 B model’s training run use of a 16,000 H100 GPU cluster, or xAI’s 200,000 GPU Colossus cluster.
However, the cluster will need to be highly secure–in other words, able to defend against attacks from nation-state adversaries. Certain evaluations will require full access to the internal “weights” of AI models, which requires hosting the model. Model hosting introduces the risk of model theft and proliferation of dangerous capabilities. Some evaluations will also involve the use of very sensitive data, such as nuclear weapons design evals–introducing additional incentive for cyberattacks. Researchers at Gladstone AI, a national security-focused AI policy consulting firm, write that in several years, powerful AI systems may confer significant strategic advantages to nation-states, and will therefore be top-priority targets for theft or sabotage by adversary nation-states. They also note that neither existing datacenters nor AI labs are secure enough to prevent this theft–thereby necessitating novel research and buildout to reach the necessary security level, outlined as “Security Level-5” (SL-5) in RAND’s Playbook for Securing AI Model Weights.
Therefore, we suggest a hybrid strategy for specialized secure compute, featuring a highly secure SL-5 air-gapped core facility for sensitive model analysis (a long-lead item requiring immediate planning), with access to a secondary pool of compute for additional capacity to run less sensitive evaluations via a formal partnership with DOE to access national lab resources. CAISI+ may also want to coordinate with the NITRD National Strategic Computing Reserve Pilot Program to explore needs for AI-crisis-related surge computing capability.
If a sufficiently secure compute cluster is infeasible or not developed in time, CAISI+ will ultimately be unable to host model internals without introducing unacceptable risks of model theft, severely limiting its ability to evaluate frontier AI systems.
Recommendation 4. Explore Granting Critical Authorities
While current legal authorities may suffice for CAISI+’s core missions, evolving AI threats could require additional tools. The White House (specifically the Office of Science and Technology Policy [OSTP], in collaboration with the Office of Management and Budget [OMB]) should analyze existing federal powers (such as the Defense Production Act or the International Emergency Economic Powers Act) to identify gaps in AI threat response capabilities–including potential needs for an incident reporting system and related subpoena authorities (similar to the function of the National Transportation Safety Board), or for model access for safety evaluations, or compute oversight authorities. Based on this analysis, the executive branch should report to Congress where new statutory authorities may be necessary, with defined risk criteria and appropriate safeguards.
Recommendation 5. Implement CAISI+ Enhancements Through Urgent, Phased Approach
Building on CAISI’s existing foundation within NIST/DoC, the Administration should enhance its capabilities to address AI risks that extend beyond current voluntary evaluation frameworks. Given expert warnings that transformative AI could emerge within the current Administration’s term, immediate action is essential to augment CAISI’s capacity to handle extreme scenarios. To achieve full operational capacity by early 2027, initial-phase activities must begin now due to long infrastructure lead times:
Immediate Enhancements (0-6 months):
- Leverage NIST’s existing relationships with DOE labs to secure interim access to classified computing facilities for sensitive evaluations
- Initiate the security research and procurement process for the SL-5 compute facility outlined in Recommendation 3
- Work with OMB and Department of Commerce leadership to secure initial funding through reprogramming or supplemental appropriations
- Build on CAISI’s current voluntary agreements to develop protocols for emergency model access and crisis response
- Begin the OSTP-led analysis of existing federal authorities (per Recommendation 4) to identify potential gaps in AI threat response capabilities
Subsequent phases will extend CAISI’s current work through:
- Foundation-building activities (6-12 months): Implementing the special hiring authorities described in Recommendation 2, formalizing enhanced interagency MOUs to support coordination described in Recommendation 1, and establishing the direct NSC reporting channels for the CAISI+ Director as Principal Advisor on AI Risks.
- Capability expansion (12-18 months): Beginning construction of the SL-5 facility, operationalizing the three core functions (Advanced Model Evaluation, Emergency Assessment & Response, and Foundational AI Reliability Research), and recruiting the 80-150 technical staff outlined in the funding breakdown.
- Full enhanced capacity (18+ months): Achieving the operational capabilities described in Recommendation 1, including mature evaluation platforms, direct Presidential/NSC threat warning protocols, and comprehensive research programs.
Conclusion
Enhancing and empowering CAISI+ is a strategic investment in U.S. national security, far outweighed by the potential costs of inaction on this front. With an estimated annual operating budget of $67-155 million, CAISI+ will provide essential technical capabilities to evaluate and respond to the most serious AI risks, ensuring the U.S. leads in developing and governing AI safely and securely, irrespective of where advanced capabilities emerge. While timelines to AI systems surpassing dangerous capability thresholds are uncertain, by acting now to establish the necessary infrastructure, expertise, and authorities, the Administration can safeguard American interests and our technological future through a broad range of possible scenarios.
This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.
A Grant Program to Enhance State and Local Government AI Capacity and Address Emerging Threats
States and localities are eager to leverage artificial intelligence (AI) to optimize service delivery and infrastructure management, but they face significant resource gaps. Without sufficient personnel and capital, these jurisdictions cannot properly identify and mitigate the risks associated with AI adoption, including cyber threats, surging power demands, and data privacy issues. Congress should establish a new grant program, coordinated by the Cybersecurity and Infrastructure Security Agency (CISA), to assist state and local governments in addressing these challenges. Such funding will allow the federal government to instill best security and operating practices nationwide, while identifying effective strategies from the grassroots that can inform federal rulemaking. Ultimately, federal, state, and local capacity are interrelated; federal investments in state and local government will help the entire country harness AI’s potential and reduce the risk of catastrophic events such as a large, AI-powered cyberattack.
Challenge and Opportunity
In 2025, 45 state legislatures have introduced more than 550 bills focused on the regulation of artificial intelligence, covering everything from procurement guidelines to acceptable AI uses in K-12 education to liability standards for AI misuse and error. Major cities have followed suit with sweeping guidance of their own, identifying specific AI risks related to bias and hallucination and directives to reduce their impact on government functions. The influx of regulatory action reflects burgeoning enthusiasm about AI’s ability to streamline public services and increase government efficiency.
Yet two key roadblocks stand in the way: inconsistent rules and uneven capacity. AI regulations vary widely across jurisdictions — sometimes offering contradictory guidance — and public agencies often lack the staff and skills needed to implement them. In a 2024 survey, six in ten public sector professionals cited the AI skills gap as their biggest obstacle in implementing AI tools. This reflects a broader IT staffing crisis, with over 450,000 unfilled cybersecurity roles nationwide, which is particularly acute in the public sector given lower salaries and smaller budgets.
These roadblocks at the state and local level pose a major risk to the entire country. In the cyber space, ransomware attacks on state and local targets have demonstrated that hackers can exploit small vulnerabilities in legacy systems to gain broad access and cause major disruption, extending far beyond their initial targets. The same threat trajectory is conceivable with AI. States and cities, lacking the necessary workforce and adhering to a patchwork of different regulations, will find themselves unable to safely adopt AI tools and mount a uniform response in an AI-related crisis.
In 2021, Congress established the State and Local Cybersecurity Grant Program (SLCGP) at CISA, which focused on resourcing states, localities, and tribal territories to better respond to cyber threats. States have received almost $1 billion in funding to implement CISA’s security best practices like multifactor authentication and establish cybersecurity planning committees, which effectively coordinate strategic planning and cyber governance among state, municipal, and private sector information technology leaders.
Federal investment in state and local AI capacity-building can help standardize the existing, disparate guidance and bridge resource gaps, just as it has in the cybersecurity space. AI coordination is less mature today than the cybersecurity space was when the SLCGP was established in 2021. The updated Federal Information Security Modernization Act, which enabled the Department of Homeland Security to set information security standards across government, had been in effect for seven years by 2021, and some of its best practices had already trickled down to states and localities.
Thus, the need for clear AI state capacity, guardrails, and information-sharing across all levels of government is even greater. A small federal investment now can unlock large returns by enabling safe, effective AI adoption and avoiding costly failures. Local governments are eager to deploy AI but lack the resources to do so securely. Modest funding can align fragmented rules, train high-impact personnel, and surface replicable models—lowering the cost of responsible AI use nationwide. Each successful pilot creates a multiplier effect, accelerating progress while reducing risk.
Plan of Action
Recommendation 1. Congress should authorize a three-year pilot grant program focused on state and local AI capacity-building.
SLCGP’s authorization expires on August 31, 2025, which provides two unique pathways for a pilot grant program. The Homeland Security Committees in the House and Senate could amend and renew the existing SLCGP provision to make room for an AI-focused pilot. Alternatively, Congress could pass a new authorization, which would likely set the stage for a sustained grant program, upon successful completion of the pilot. A separate authorization would also allow Congress to consider other federal agencies as program facilitators or co-facilitators, in case they want to cover AI integrations that do not directly touch critical infrastructure, which is CISA’s primary focus.
Alternatively, the House Energy and Commerce and Senate Commerce, Science, and Transportation Committees could authorize a program coordinated by the National Institute of Standards and Technology, which produced the AI Risk Management Framework and has strong expertise in a range of vulnerabilities embedded within AI models. Congress might also consider mandating an interagency advisory committee to oversee the program, including, for example, experts from the Department of Energy to provide technical assistance and guidance on projects related to energy infrastructure.
In either case, the authorization should be coupled with a starting appropriation of $55 million over three years, which would fund ten statewide pilot projects totaling up to $5 million plus administrative costs. The structure of the program will broadly parallel SLCGP’s goals. First, it would align state and local AI approaches with existing federal guidance, such as the NIST AI Risk Management Framework and the Trump Administration’s OMB guidance on the regulation and procurement of artificial intelligence applications. Second, the program would establish better coordination between local and state authorities on AI rules. A new authorization for AI, however, allows Congress and the agency tasked with managing the program the opportunity to improve upon SLCGP’s existing provisions. This new program should permit states to coordinate their AI activities through existing leadership structures rather than setting up a new planning committee. The legislative language should also prioritize skills training and allocate a portion of grant funding to be spent on recruiting and retaining AI professionals within state and local government who can oversee projects.
Recommendation 2. Pilot projects should be implementation-focused and rooted in one of three significant risks: cybersecurity, energy usage, or data privacy.
Similar to SLCGP, this pilot grant program should be focused on implementation. The target product for a grant is a functional local or state AI application that has undergone risk mitigation, rather than a report that identifies issues in the abstract. For example, under this program, a state would receive federal funding to integrate AI into the maintenance of its cities’ wastewater treatment plants without compromising cybersecurity. Funding would support AI skills training for the relevant municipal employees and scaling of certain cybersecurity best practices like data encryption that minimize the project’s risk. States will submit reports to the federal government at each phase of their project: first documenting the risks they identified, then explaining their prioritization of risks to mitigate, then walking through their specific mitigation actions, and later, retrospectively reporting on the outcomes of those mitigations after the project has gone into operational use.
This approach would maximize the pilot’s return on investment. States will be able to complete high-impact AI projects without taking on the associated security costs. The frameworks generated from the project can be reused many times over for later projects, as can the staff who are hired or trained with federal support.
Given the inconsistency of priorities surfaced in state and local AI directives, the federal government should set the agenda of risks to focus on. The clearest set of risks for the pilot are cybersecurity, energy usage, and data privacy, all of which are highlighted in NIST’s Risk Management Framework.
- Cybersecurity. Cybersecurity projects should focus on detecting AI-assisted social engineering tactics, used to gain access into secure systems, and adversarial attacks like “poisoning” or “jailbreaking”, which manipulate AI models to produce undesirable outputs. Consider emergency response systems: the transition to IP-based, interconnected 911 systems increases the cyberattack surface, making it easier for an attack targeting one response center to spread across other jurisdictions. A municipality could seek funding to trial an AI dispatcher with necessary guardrails. As part of their project, they could ensure they have the appropriate cyber hygiene protocols in place to prevent cyberattacks from rendering the dispatcher useless or exploiting vulnerabilities in the dispatcher to gain access to underlying 911 systems that multiple localities rely on.
- Energy Usage. Energy usage projects should calculate power needs associated with AI development and implementation and the additional energy resources available to prevent outages. Much of the country faces a heightened risk of power outages due to antiquated grids, under-resourced providers, and a dearth of new electricity generation. AI integrations and supportive infrastructure that require significant power will place a heavy burden on states and potentially impact the operation of other critical infrastructure. A sample project might examine the energy demands of a new data center, powering an AI integration into traffic monitoring, and determine where that data center can best be constructed to accommodate available grid capacity.
- Data Privacy. Finally, data privacy projects should focus on bringing AI systems into compliance with existing data laws like the Health Insurance Portability and Accountability Act (HIPAA) and the Children’s Online Privacy Protection Act (COPPA) for AI interventions in healthcare and education, respectively. Because the U.S. lacks a comprehensive data privacy law, states might also experiment with additional best practices, such as training models to detect and reject prompts that contain personally identifiable information. A sample project in this domain might integrate a chatbot into the state Medicaid system to more efficiently triage patients and identify the steps the state can take to prevent the chatbot from handling PII in a manner that does not comply with HIPAA.
If successful, the pilot could expand to address additional risks or support broader, multi-risk, multi-state interventions.
Recommendation 3. The pilot program must include opportunities for grantees to share their ideas with other states and localities.
Arguably the most important facet of this new AI program will be forums where grantees share their learnings. Administrative costs for this program should go toward funding a twice-yearly (bi-annual) in-person forum, where grantees can publicly share updates on their projects. An in-person forum would also provide states with the space to coordinate further projects on the margins. CISA is particularly well positioned to host a forum like this given its track record of convening critical infrastructure operators. Grantees should be required to publish guidance, tools, and templates in a public, digital repository. Ideally, states that did not secure grants can adopt successful strategies from their peers and save taxpayers the cost of duplicate planning work.
Conclusion
Congress should establish a new grant program to assist state and local governments in addressing AI risks, including cybersecurity, energy usage, and data privacy. Such federal investments will give structure to the dynamic yet disparate national AI regulatory conversation. The grant program, which will cost $55 million to pilot over three years, will yield a high return on investment for both the ten grantee states and the peers that learn from its findings. By making these investments now, Congress can keep states moving fast toward AI without opening the door to critical, costly vulnerabilities.
This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.
No, Congress could leverage SLCGP’s existing authorization to focus on projects that look at the intersection of AI and cybersecurity. They could offer an amendment to the next Homeland Security Appropriations package that directs modest SLCGP funding (e.g. $10-20 million) to AI projects. Alternatively, Congress could insert language on AI into SLCGP’s reauthorization, which is due on August 31, 2025.
Although leveraging the existing authorization would be easier, Congress would be better served by authorizing a new program, which can focus on multiple priorities including energy usage and data privacy. To stay agile, the language in the statute could allow CISA to direct funds toward new emerging risks, as they are identified by NIST and other agencies. Finally, a specific authorization would pave the way for an expansion of this program assuming the initial 10 state pilot goes well.
This pilot is right-sized for efficiency, impact, and cost savings. A program to bring all 50 states into compliance with certain AI risk mitigation guidelines would cost hundreds of millions, which is not feasible in the current budgetary environment. States are starting from very different baselines, especially with their energy infrastructure, which makes it difficult to bring them all to a single end-point. Moreover, because AI is evolving so rapidly, guidance is likely to age poorly. The energy needs of AI might change before states finish their plan to build data centers. Similarly, federal data privacy laws might go in place that undercut or contradict the best practices established by this program.
This pilot will allow 10 states and/or localities to quickly deploy AI implementations that produce real value: for example, quicker emergency response times and savings on infrastructure maintenance. CISA can learn from the grantees’ experiences to iterate on federal guidance. They might identify a stumbling block on one project and refine their guidance to prevent 49 other states from encountering the same obstacle. If grantees effectively share their learnings, they can cut massive amounts of time off other states’ planning processes and help the federal government build guidance that is more rooted in the realities of AI deployment.
No. If done correctly, this pilot will cut red tape and allow the entire country to harness AI’s positive potential. States and localities are developing AI regulations in a vacuum. Some of the laws proposed are contradictory or duplicative precisely because many state legislatures are not coordinating effectively with state and local government technical experts. When bills do pass, guidance is often poorly implemented because there is no overarching figure, beyond a state chief information officer, to bring departments and cities into compliance. In essence, 50 states are producing 50 sets of regulations because there is scant federal guidance and few mechanisms for them to learn from other states and coordinate within their state on best practices.
This program aims to cut down on bureaucratic redundancy by leveraging states’ existing cyber planning bodies to take a comprehensive approach to AI. By convening the appropriate stakeholders from the public sector, private sector, and academia to work on a funded AI project, states will develop more efficient coordination processes and identify regulations that stand in the way of effective technological implementation. States and localities across the country will build their guidelines based on successful grantee projects, absorbing best practices and casting aside inefficient rules. It is impossible to mount a coordinated response to significant challenges like AI-enabled cyberattacks without some centralized government planning, but this pilot is designed to foster efficient and effective coordination across federal, state, and local governments.
Accelerating AI Interpretability To Promote U.S. Technological Leadership
The most advanced AI systems remain ‘black boxes’ whose inner workings even their developers cannot fully understand, leading to issues with reliability and trustworthiness. However, as AI systems become more capable, there is a growing desire to deploy them in high-stakes scenarios. The bipartisan National Security Commission on AI cautioned that AI systems perceived as unreliable or unpredictable will ‘stall out’: leaders will not adopt them, operators will mistrust them, Congress will not fund them, and the public will not support them (NSCAI, Final Report, 2021). AI interpretability research—the science of opening these black boxes and attempting to comprehend why they do what they do—could turn opacity into understanding and enable wider AI adoption.
With AI capabilities racing ahead, the United States should accelerate interpretability research now to keep its technological edge and field high-stakes AI deployment with justified confidence. This memorandum describes three policy recommendations that could help the United States seize the moment and maintain a lead on AI interpretability: (1) creatively investing in interpretability research, (2) entering into research and development agreements between interpretability experts and government agencies and laboratories, and (3) prioritizing interpretable AI in federal procurement.
Challenge and Opportunity
AI capabilities are progressing rapidly. According to many frontier AI companies’ CEOs and independent researchers, AI systems could reach general-purpose capabilities that equal or even surpass humans within the next decade. As capabilities progress, there is a growing desire to incorporate these systems into high-stakes use cases, from military and intelligence uses (DARPA, 2025; Ewbank, 2024) to key sectors of the economy (AI for American Industry, 2025).
However, the most advanced AI systems are still ‘black boxes’ (Sharkey et al., 2024) that we observe from the outside and that we ‘grow,’ more than we ‘build’ (Olah, 2024). Our limited comprehension of the inner workings of neural networks means that we still really do not understand what happens within these black boxes, leaving uncertainty regarding their safety and reliability. This could have resounding consequences. As the 2021 final report of the National Security Commission on AI (NSCAI) highlighted, “[i]f AI systems routinely do not work as designed or are unpredictable in ways that can have significant negative consequences, then leaders will not adopt them, operators will not use them, Congress will not fund them, and the American people will not support them” (NSCAI, Final Report, 2021). In other words, if AI systems are not always reliable and secure, this could inhibit or limit their adoption, especially in high-stakes scenarios, potentially compromising the AI leadership and national security goals outlined in the Trump administration’s agenda (Executive Order, 2025).
AI interpretability is a subfield of AI safety that is specifically concerned with opening and peeking inside the black box to comprehend “why AI systems do what they do, and … put this into human-understandable terms” (Nanda, 2024; Sharkey et al., 2025). In other words, interpretability is the AI equivalent of an MRI (Amodei, 2025) because it attempts to provide observers with an understandable image of the hidden internal processes of AI systems.
The Challenge of Understanding AI Systems Before They Reach or Even Surpass Human-Level Capabilities
Recent years have brought breakthroughs across several research areas focused on making AI more trustworthy and reliable, including in AI interpretability. Among other efforts, the same companies developing the most advanced AI systems have designed systems that are easier to understand and have reached new research milestones (Marks et al., 2025; Lindsey et al., 2025; Lieberum et al. 2024; Kramar et al., 2024; Gao et al., 2024; Tillman & Mossing, 2025).
AI interpretability, however, is still trailing behind raw AI capabilities. AI companies project that it could take 5–10 years to reliably understand model internals (Amodei, 2025), while experts expect systems exhibiting human‑level general-purpose capabilities by as early as 2027 (Kokotajlo et al., 2025). That gap will force policymakers into a difficult corner once AI systems reach similar capabilities: deploy unprecedentedly powerful yet opaque systems, or slow deployment and fall behind. Unless interpretability accelerates, the United States could risk both competitive and security advantages.
The Challenge of Trusting Today’s Systems for High-Stakes Applications
We must understand the inner workings of highly advanced AI systems before they reach human or above-human general-purpose capabilities, especially if we want to trust them in high-stakes scenarios. There are several reasons why current AI systems might not always be reliable and secure. For instance, AI systems could exhibit the following vulnerabilities. First, AI systems inherit the blind spots of their training data. When the world changes—alliances shift, governments fall, regulations update—systems still reason from outdated facts, undermining reliability in high-stakes diplomatic or military settings (Jensen et al., 2025).
Second, AI systems are unusually easy to strip‑mine for memorized secrets, especially if these secrets come as uncommon word combinations (e.g., proprietary blueprints). Data‑extraction attacks are now “practical and highly realistic” and will grow even more effective as system size increases (Carlini et al., 2021; Nasr et al., 2023; Li et al., 2025). The result could be wholesale leakage of classified or proprietary information (DON, 2023).
Third, cleverly crafted prompts can still jailbreak cutting‑edge systems, bypassing safety rails and exposing embedded hazardous knowledge (Hughes et al., 2024; Ramesh et al., 2024). With attack success rates remaining uncomfortably high across even the leading systems, adversaries could manipulate AI systems with these vulnerabilities in real‑time national security scenarios (Caballero & Jenkins, 2024).
This is not a comprehensive list. Systems could exhibit vulnerabilities in high-stakes applications for many other reasons. For instance, AI systems could be misaligned and engage in scheming behavior (Meinke et al., 2024; Phuong et al., 2025) or have baked-in backdoors that an attacker could exploit (Hubinger et al., 2024; Davidson et al., 2025).
The Opportunity to Promote AI Leadership Through Interpretability
Interpretability offers an opportunity to address these described challenges and reduce barriers to the safe adoption of the most advanced AI systems, thereby further promoting innovation and increasing the existing advantages those systems present over adversaries’ systems. In this sense, accelerating interpretability could help promote and secure U.S. AI leadership (Bau et al., 2025; IFP, 2025). For example, by helping ensure that highly advanced AI systems are deployed safely in high-stakes scenarios, interpretability could improve national security and help mitigate the risk of state and non-state adversaries using AI capabilities against the United States (NSCAI, Final Report, 2021). Interpretability could therefore serve as a front‑line defense against vulnerabilities in today’s most advanced AI systems.
Making future AI systems safe and trustworthy could become easier the more we understand how they work (Shah et al., 2025). Anthropic’s CEO recently endorsed the importance and urgency of interpretability, noting that “every advance in interpretability quantitatively increases our ability to look inside models and diagnose their problems” (Amodei, 2025). This means that interpretability not only enhances reliability in the deployment of today’s AI systems, but understanding AI systems could also lead to breakthroughs in designing more targeted systems or attaining more robust monitoring of deployed systems. This could then enable the United States to deploy tomorrow’s human-level or above-human general-purpose AI systems with increased confidence, thus securing strategic advantages when engaging geopolitically. The following uses the vulnerabilities discussed above to demonstrate three ways in which interpretability could improve the reliability of today’s AI systems when deployed in high-stakes scenarios.
First, interpretability could help systems selectively update outdated information through model editing, without risking a reduction in performance. Model editing allows us to selectively inject new facts or fix mistakes (Cohen et al., 2023; Hase et al., 2024) by editing activations without updating the entire model. However, this ‘surgical tool’ has shown ‘side effects’ causing performance degradation (Gu et al., 2024; Gupta et al., 2024). Interpretability could help us understand how stored knowledge alters parameters as well as develop stronger memorization measures (Yao et al., 2023; Carlini et al., 2019), enabling us to ‘incise and excise’ AI models with fewer side effects.
Second, interpretability could help systems selectively forget training data through machine unlearning, once again without losing performance. Machine unlearning allows systems to forget specific data classes (such as memorized secrets or hazardous knowledge) while remembering the rest (Tarun et al., 2023). Like model editing, this ‘surgical tool’ suffers from performance degradation. Interpretability could help develop new unlearning techniques that preserve performance (Guo et al., 2024; Belrose et al., 2023; Zou et al., 2024).
Third, interpretability could help effectively block jailbreak attempts, which can only currently be discovered empirically (Amodei, 2025). Interpretability could lead to a breakthrough in understanding models’ persistent vulnerability to jailbreaking by allowing us to characterize dangerous knowledge. Existing interpretability research has already analyzed how AI models process harmful prompts (He et al., 2024; Ball et al., 2024; Lin et al., 2024; Zhou et al., 2024), and additional research could build on these initial findings
The conditions are ripe to promote technological leadership and national security through interpretability. Many of the same problems that were highlighted in the 2019 National AI R&D Strategic Plan remained the same in its 2023 update, echoing those included in NSCAI’s 2021 final report. We have made relatively little progress addressing these challenges. AI systems are still vulnerable to attacks (NSCAI, Final Report, 2021) and can still “be made do the wrong thing, reveal the wrong thing” and “be easily fooled, evaded, and misled in ways that can have profound security implications” (National AI R&D Strategic Plan, 2019). The field of interpretability is gaining some momentum among AI companies (Amodei, 2025; Shah et al., 2025; Goodfire, 2025) and AI researchers (IFP, 2025; Bau et al., 2025; FAS, 2025).
To be sure, despite recent progress, interpretability remains challenging and has attracted some skepticism (Hendrycks & Hiscott, 2025). Accordingly, a strong AI safety strategy must include many components beyond interpretability, including robust AI evaluations (Apollo Research, 2025) and control measures (Redwood Research, 2025).
Plan of Action
The United States has an opportunity to seize the moment and lead an acceleration of AI interpretability. The following three recommendations establish a strategy for how the United States could promptly incentivize AI interpretability research.
Recommendation 1. The federal government should prioritize and invest in foundational AI interpretability research, which would include identifying interpretability as a ‘strategic priority’ in the 2025 update of the National AI R&D Strategic Plan.
The National Science and Technology Council (NSTC) should identify AI interpretability as a ‘strategic priority’ in the upcoming National AI R&D Strategic Plan. Congress should then appropriate federal R&D funding for federal agencies (including DARPA and the NSF) to catalyze and support AI interpretability acceleration through various mechanisms, including grants and prizes, R&D credits, tax credits, advanced market commitments, and buyer-of-first-resort mechanisms.
This first recommendation echoes not only the 2019 update of the National AI R&D Strategic Plan and NSCAI’s 2021 final report––which recommended allocating more federal R&D investments to advance the interpretability of Al systems (NSCAI, Final Report, 2021; National AI R&D Strategic Plan, 2019),, but also the more recent remarks by the Director of the Office of Science and Technology Policy (OSTP), according to whom we need creative R&D funding approaches to enable scientists and engineers to create new theories and put them into practice (OSTP Director’s Remarks, 2025). This recommendation is also in line with calls from AI companies, asserting that “we still need significant investment in ‘basic science’” (Shah et al., 2025).
The United States could incentivize and support AI interpretability work through various approaches. In addition to prize competitions, advanced market commitments, fast and flexible grants (OSTP Director’s Remarks, 2025; Institute for Progress, 2025), and challenge-based acquisition programs (Institute for Progress, 2025), funding mechanisms could include R&D tax credits for AI companies undertaking or investing in interpretability research, and tax credits to adopters of interpretable AI, such as downstream deployers. If the federal government acts as “an early adopter and avid promoter of American technology” (OSTP Director’s Remarks, 2025), federal agencies could also rely on buyer-of-first-resort mechanisms for interpretability platforms.
These strategies may require developing a clearer understanding of which frontier AI companies undertake sufficient interpretability efforts when developing their most advanced systems, and which companies currently do not. Requiring AI companies to disclose how they use interpretability to test models before release (Amodei, 2025) could be helpful, but might not be enough to devise a ‘ranking’ of interpretability efforts. While potentially premature given the state of the art in interpretability, an option could be to start developing standardized metrics and benchmarks to evaluate interpretability (Mueller et al., 2025; Stephenson et al., 2025). This task could be carried out by the National Institute of Standards and Technology (NIST), within which some AI researchers have recommended creating an AI Interpretability and Control Standards Working Group (Bau et al., 2025).
A great way to operationalize this first recommendation would be for the National Science and Technology Council (NSTC) to include interpretability as a “strategic priority” in the 2025 update of the National AI R&D Strategic Plan (RFI, 2025). These “strategic priorities” seek to target and focus AI innovation for the next 3–5 years, paying particular attention to areas of “high-risk, high-reward AI research” that the industry is unlikely to address because it may not provide immediate commercial returns (RFI, 2025). If interpretability were included as a “strategic priority,” then the Office of Management and Budget (OMB) could instruct agencies to align their budgets with the 2025 National AI R&D Strategic Plan priorities in its memorandum addressed to executive department heads. Relevant agencies, including DARPA and the National Science Foundation (NSF), would then develop their budget requests for Congress, aligning them with the 2025 National AI R&D Strategic Plan and the OMB memorandum. After Congress reviews these proposals and appropriates funding, agencies could launch initiatives that incentivize interpretability work, including grants and prizes, R&D credits, tax credits, advanced market commitments, and buyer-of-first-resort mechanisms.
Recommendation 2. The federal government should enter into research and development agreements with AI companies and interpretability research organizations to red team AI systems applied in high-stakes scenarios and conduct targeted interpretability research.
AI companies, interpretability organizations, and federal agencies and laboratories (such as DARPA, the NSF, and the U.S. Center for AI Standards and Innovation) should enter into research and development agreements to pursue targeted AI interpretability research to solve national security vulnerabilities identified through security-focused red teaming.
This second recommendation takes into account the fact that the federal government possesses unique expertise and knowledge in national security issues to support national security testing and evaluation (FMF, 2025). Federal agencies and laboratories (such as DARPA, the NSF, and the U.S. Center for AI Standards and Innovation), frontier AI companies, and interpretability organizations could enter into research and development agreements to undertake red teaming of national security vulnerabilities (as, for instance, SABER which aims to assess AI-enabled battlefield systems for the DoD; SABER, 2025) and provide state-of-the-art interpretability platforms to patch the revealed vulnerabilities. In the future, AI companies could also apply the most advanced AI systems to support interpretability research.
Recommendation 3. The federal government should prioritize interpretable AI in federal procurement, especially for high-stakes applications.
If federal agencies are procuring highly advanced AI for high-stakes scenarios and national security missions, they should preferentially procure interpretable AI systems. This preference could be accounted for by weighing the lack of understanding of an AI system’s inner workings when calculating cost.
This third and final recommendation provides for the interim and assumes interpretable AI systems will coexist in a ‘gradient of interpretability’ with other AI systems that are less interpretable. In that scenario, agencies procuring AI systems should give preference to AI systems that are more interpretable. One way to account for this preference would be by weighing the potential vulnerabilities of uninterpretable AI systems within calculating costs during federal acquisition analyses. This recommendation also requires establishing a defined ‘ranking’ of interpretability efforts. While defining this ranking is currently challenging, the research outlined in recommendations 1 and 2 could better position the government to measure and rank the interpretability of different AI systems.
Conclusion
Now is the time for the United States to take action and lead the charge on AI interpretability research. While research is never guaranteed to lead to desired outcomes or to solve persistent problems, the potential high reward—understanding and trusting future AI systems and making today’s systems more robust to adversarial attacks—justifies this investment. Not only could AI interpretability make AI safer and more secure, but it could also establish justified confidence in the prompt adoption of future systems that are as capable as or even more capable than humans, and enable the deployment of today’s most advanced AI systems to high-stakes scenarios, thus promoting AI leadership and national security. With this goal in mind, this policy memorandum recommends that the United States, through the relevant federal agencies and laboratories (including DARPA, the NSF, and the U.S. Center for AI Standards and Innovation), invest in interpretability research, form research and development agreements to red team high-stakes AI systems and undertake targeted interpretability research, and prioritize interpretable AI systems in federal acquisitions.
Acknowledgments
I wish to thank Oliver Stephenson, Dan Braun, Lee Sharkey, and Lucius Bushnaq for their ideas, comments, and feedback on this memorandum.
This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.
Accelerating R&D for Critical AI Assurance and Security Technologies
The opportunities presented by advanced artificial intelligence are immense, from accelerating cutting-edge scientific research to improving key government services. However, for these benefits to be realized, both the private and public sectors need confidence that AI tools are reliable and secure. This will require R&D effort to solve urgent technical challenges related to understanding and evaluating emergent AI behaviors and capabilities, securing AI hardware and infrastructure, and preparing for a world with many advanced AI agents.
To secure global adoption of U.S. AI technology and ensure America’s workforce can fully leverage advanced AI, the federal government should take a strategic and coordinated approach to support AI assurance and security R&D by: clearly defining AI assurance and security R&D priorities; establishing an AI R&D consortium and deploying agile funding mechanisms for critical R&D areas; and establishing an AI Frontier Science Fellowship to ensure a pipeline of technical AI talent.
Challenge and Opportunity
AI systems have progressed rapidly in the past few years, demonstrating human-level and even superhuman performance across diverse tasks. Yet, they remain plagued by flaws that produce unpredictable and potentially dangerous failures. Frontier systems are vulnerable to attacks that can manipulate them into executing unintended actions, hallucinate convincing but incorrect information, and exhibit other behaviors that researchers struggle to predict or control.
As AI capabilities rapidly advance toward more consequential applications—from medical diagnosis to financial decision-making to military systems—these reliability issues could pose increasingly severe risks to public safety and national security, while reducing beneficial uses. Recent polling shows that just 32% of Americans trust AI, and this limited trust will slow the uptake of impactful AI use-cases that could drive economic growth and enhance national competitiveness.
The federal government has an opportunity to secure America’s technological lead and promote global adoption of U.S. AI by catalyzing research to address urgent AI reliability and security challenges—challenges that align with broader policy consensus reflected in the National Security Commission on AI’s recommendations and bipartisan legislative efforts like the VET AI Act. Recent research has surfaced substantial expert consensus around priority research areas that address the following three challenges.
The first challenge involves understanding emergent AI capabilities and behaviors. As AI systems get larger, also referred to as “scaling”, they develop unexpected capabilities and reasoning patterns that researchers cannot predict, making it difficult to anticipate risks or ensure reliable performance. Addressing this means advancing the science of AI scaling and evaluations.
This research aims to build a scientific understanding of how AI systems learn, reason, and exhibit diverse capabilities. This involves not only studying specific phenomena like emergence and scaling but, more broadly, employing and refining evaluations as the core empirical methodology to characterize all facets of AI behavior. This includes evaluations in areas such as CBRN weapons, cybersecurity, and deception, and broader research on AI evaluations to ensure that AI systems can be accurately assessed and understood. Example work includes Wijk et al. (2024) and McKenzie et al. (2023)
The second challenge is securing AI hardware and infrastructure. AI systems require robust protection of model weights, secure deployment environments, and resilient supply chains to prevent theft, manipulation, or compromise by malicious actors seeking to exploit these powerful technologies. Addressing this means advancing hardware and infrastructure security for AI.
Ensuring the security of AI systems at the hardware and infrastructure level involves protecting model weights, securing deployment environments, maintaining supply chain integrity, and implementing robust monitoring and threat detection mechanisms. Methods include the use of confidential computing, rigorous access controls, specialized hardware protections, and continuous security oversight. Example work includes Nevo et al. (2024) and Hepworth et al. (2024)
The third challenge involves preparing for a world with many AI agents—AI models that can act autonomously. Alongside their potentially immense benefits, the increasing deployment of AI agents creates critical blind spots, as agents could coordinate covertly beyond human oversight, amplify failures into system-wide cascades, and combine capabilities in ways that circumvent existing safeguards. Addressing this means advancing agent metrology, infrastructure, and security.
Developing a deeper understanding of agentic behavior in LLM-based systems, including clarifying how LLM agents learn over time, respond to underspecified goals, and engage with their environments. This also includes research that ensures safe multi-agent interactions, such as detecting and preventing malicious collective behaviors, studying how transparency can affect agent interactions, and developing evaluations for agent behavior and interaction. Example work includes Lee and Tiwari (2024) and Chan et al. (2024)
While academic and industry researchers have made progress on these problems, this progress is not keeping pace with AI development and deployment. The market is likely to underinvest in research that is more experimental or with no immediate commercial applications. The U.S. government, as the R&D lab of the world, has an opportunity to unlock AI’s transformative potential through accelerating assurance and security research.
Plan of Action
The rapid pace of AI advancement demands a new strategic, coordinated approach to federal R&D for AI assurance and security. Given financial constraints, it is more important than ever to make sure that the impact of every dollar invested in R&D is maximized.
Much of the critical technical expertise now resides in universities, startups, and leading AI companies rather than traditional government labs. To harness this distributed talent, we need R&D mechanisms that move at the pace of innovation, leverage academic research excellence, engage early-career scientists who drive breakthroughs, and partner with industry leaders who can share access to essential compute resources and frontier models. Traditional bureaucratic processes risk leaving federal efforts perpetually behind the curve.
The U.S. government should implement a three-pronged plan to advance the above R&D priorities.
Recommendation 1. Clearly define AI assurance and security R&D priorities
The Office of Science and Technology Policy (OSTP) and the National Science Foundation (NSF) should highlight critical areas of AI assurance and security as R&D priorities by including these in the 2025 update of the National AI R&D Strategic Plan and the forthcoming AI Action Plan. All federal agencies conducting AI R&D should engage with the construction of these plans to explain how their expertise could best contribute to these goals. For example, the Defense Advanced Research Projects Agency (DARPA)’s Information Innovation Office could leverage its expertise in AI security to investigate ways to design secure interaction protocols and environments for AI agents that eliminate risks from rogue agents.
The priorities would help coordinate government R&D activities by providing funding agencies with a common set of priorities, public research institutes such as the National Labs to conduct fundamental R&D activities, Congress with information to support relevant legislative decisions, and industry to serve as a guide to R&D.
Additionally, given the dynamic nature of frontier AI research, OSTP and NSF should publish an annual survey of progress in critical AI assurance and security areas and identify which challenges are the highest priority.
Recommendation 2. Establish an AI R&D consortium and deploy agile funding mechanisms for critical R&D
As noted by OSTP Director Michael Kratsios, “prizes, challenges, public-private partnerships, and other novel funding mechanisms, can multiply the impact of targeted federal dollars. We must tie grants to clear strategic targets, while still allowing for the openness of scientific exploration.” Federal funding agencies should develop and implement agile funding mechanisms for AI assurance and security R&D in line with established priorities. Congress should include reporting language in its Commerce, Justice, Science (CJS) appropriations bill that supports accelerated R&D disbursements for investment into prioritized areas.
A central mechanism should be the creation of an AI Assurance and Security R&D Consortium, jointly led by DARPA and NSF, bringing together government, AI companies, and universities. In this model:
- Government provides funding for personnel, administrative support, and manages the consortium’s strategic direction
- AI companies contribute model access, compute credits, and engineering expertise
- Universities provide researchers and facilities for conducting fundamental research
This consortium structure would enable rapid resource sharing, collaborative research projects, and accelerated translation of research into practice. It would operate under flexible contracting mechanisms using Other Transaction Authority (OTA) to reduce administrative barriers.
Beyond the consortium, funding agencies should leverage Other Transaction Authority (OTA) and Prize Competition Authority to flexibly contract and fund research projects related to priority areas. New public-private grant vehicles focused on funding fundamental research in priority areas should be set up via existing foundations linked to funding agencies such as the NSF Foundation, DOE’s Foundation for Energy Security and Innovation, or the proposed NIST Foundation.
Specific funding mechanisms should be chosen based on the target technology’s maturity level. For example, the NSF can support more fundamental research through fast grants via its EAGER and RAPID programs. Previous fast-grant programs, such as SGER, were found to be wildly effective, with “transformative research results tied to more than 10% of projects.”
For research areas where clear, well-defined technical milestones are achievable, such as developing secure cluster-scale environments for large AI training workloads, the government can support the creation of focused research organizations (FROs) and implement advanced market commitments (AMCs) to take technologies across the ‘valley of death’. DARPA and IARPA can administer higher-risk, more ambitious R&D programs with national security applications.
Recommendation 3. Establish an AI Frontier Science Fellowship to ensure a pipeline of technical AI talent that can contribute directly to R&D and support fast-grant program management
It is critical to ensure that America has a growing pool of talented researchers entering the field of AI assurance and security, given its strategic importance to American competitiveness and national security.
The NSF should launch an AI Frontier Science Fellowship targeting early-career researchers in critical AI assurance and security R&D. Drawing from proven models like CyberCorp Scholarship for Service, COVID-19 Fast Grants, and proposals such as for “micro-ARPAs”, this program operates on two tracks:
- Frontier Scholars: This track would provide comprehensive research support for PhD students and post-docs conducting relevant research on priority AI security and reliability topics. This includes computational resources, research rotations at government labs and agencies, and financial support.
- Rapid Grant Program Managers (PM): This track recruits researchers to serve fixed terms as Rapid Grant PMs, responsible for administering EAGER/RAPID grants focused on AI assurance and security.
This fellowship solves multiple problems at once. It builds the researcher pipeline while creating a nimble, decentralized approach to science funding that is more in line with the dynamic nature of the field. This should improve administrative efficiency and increase the surface area for innovation by allowing for more early-stage high-risk projects to be funded. Also, PMs who perform well in administering these small, fast grants can then become full-fledged program officers and PMs at agencies like the NSF and DARPA. This program (including grant budget) would cost around $40 million per year.
Conclusion
To unlock AI’s immense potential, from research to defense, we must ensure these tools are reliable and secure. This demands R&D breakthroughs to better understand emergent AI capabilities and behaviors, secure AI hardware and infrastructure, and prepare for a multi-agent world. The federal government must lead by setting clear R&D priorities, building foundational research talent, and injecting targeted funding to fast-track innovation. This unified push is key to securing America’s AI leadership and ensuring that American AI is the global gold standard.
This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.
Yes, the recommendations are achievable by reallocating the existing budget and using existing authorities, but this would likely mean accepting a smaller initial scale.
In terms of authorities, OSTP and NSF can already update the National AI R&D Strategic Plan and establish AI assurance and security priorities through normal processes. To implement agile funding mechanisms, agencies can use OTA and Prize Competition Authority. Fast grants require no special statute and can be done under existing grant authorities.
In terms of budget, agencies can reallocate 5-10% of existing AI research funds towards security and assurance R&D. The Frontier Science Fellowship could start as a $5-10 million pilot under NSF’s existing education authorities, e.g. drawing from NSF’s Graduate Research Fellowship Program.
While agencies have flexibility to begin this work, achieving the memo’s core objective – ensuring AI systems are trustworthy and reliable for workforce and military adoption – requires dedicated funding. Congress could provide authorization and appropriation for a named fellowship, which would make the program more stable and allow it to survive personnel turnover.
Market incentives drive companies to fix AI failures that directly impact their bottom line, e.g., chatbots giving bad customer service or autonomous vehicles crashing. More visible, immediate problems are likely to be prioritized because customers demand it or because of liability concerns. This memo focuses on R&D areas that the private sector is less likely to tackle adequately.
The private will address some security and reliability issues, but there are likely to be significant gaps. Understanding emergent model capabilities demands costly fundamental research that generates little immediate commercial return. Likewise, securing AI infrastructure against nation-state attacks will likely require multi-year R&D processes, and companies can fail to coordinate to develop these technologies without a clear demand signal. Finally, systemic dangers arising from multi-agent interactions might be left unmanaged because these failures emerge from complex dynamics with unclear liability attribution.
The government can step in to fund the foundational research that the market is likely to undersupply by default and help coordinate the key stakeholders in the process.
Companies need security solutions to access regulated industries and enterprise customers. Collaboration on government-funded research provides these solutions while sharing costs and risks.
The proposed AI Assurance and Security R&D Consortium in Recommendation 2 create a structured framework for cooperation. Companies contribute model access and compute credits while receiving:
- Government-funded researchers working on their deployment challenges
- Shared IP rights under consortium agreements
- Early access to security and reliability innovations
- Risk mitigation through collaborative cost-sharing
Under the consortia’s IP framework, companies retain full commercial exploitation rights while the government gets unlimited rights for government purposes. In the absence of a consortium agreement, an alternative arrangement could be a patent pool, where companies can access patented technologies in the pool through a single agreement. These structures, combined with the fellowship program providing government-funded researchers, creates strong incentives for private sector participation while advancing critical public research objectives.
The Federation of American Scientists Calls on OMB to Maintain the Agency AI Use Case Inventories at Their Current Level of Detail
The federal government’s approach to deploying AI systems is a defining force in shaping industry standards, academic research, and public perception of these technologies. Public sentiment toward AI remains mixed, with many Americans expressing a lack of trust in AI systems. To fully harness the benefits of AI, the public must have confidence that these systems are deployed responsibly and enhance their lives and livelihoods.
The first Trump Administration’s AI policies clearly recognized the opportunity to promote AI adoption through transparency and public trust. President Trump’s Executive Order 13859 explicitly stated that agencies must design, develop, acquire, and use “AI in a manner that fosters public trust and confidence while protecting privacy, civil rights, civil liberties, and American values.” This commitment laid the foundation for increasing government accountability in AI use.
A major step in this direction was the AI Use Case Inventory, established under President Trump’s Executive Order 13960 and later codified in the 2023 Advancing American AI Act. The agency inventories have since become a crucial tool in fostering public trust and innovation in government AI use. Recent OMB guidance (M-24-10) has expanded its scope, standardizing AI definitions, and collecting information on potential adverse impacts. The detailed inventory enhances accountability by ensuring transparency in AI deployments, tracks AI successes and risks to improve government services, and supports AI vendors by providing visibility into public-sector AI needs, thereby driving industry innovation.
The end of 2024 marked a major leap in government transparency regarding AI use. Agency reporting on AI systems saw dramatic improvements, with federal AI inventories capturing more than 1,700 AI use cases —a 200% increase in reported use cases from the previous year. The Department of Homeland Security (DHS) alone reported 158 active AI use cases. Of these, 29 were identified as high-risk, with detailed documentation on how 24 of those use cases are mitigating potential risks. This level of disclosure is essential for maintaining public trust and ensuring responsible AI deployment.
OMB is set to release revisions to its AI guidance (M-24-10) in mid-March, presenting an opportunity to ensure that transparency remains a top priority.
To support continued transparency and accountability in government AI use, the Federation of American Scientists has written a letter urging OMB to maintain its detailed guidance on AI inventories. We believe that sustained transparency is crucial to ensuring responsible AI governance, fostering public trust, and enabling industry innovation.
Federation of American Scientists and 16 Tech Organizations Call on OMB and OSTP to Maintain Agency AI Use Case Inventories
The first Trump Administration’s E.O. 13859 commitment laid the foundation for increasing government accountability in AI use; this should continue
Washington, D.C. – March 6, 2025 – The Federation of American Scientists (FAS), a non-partisan, nonprofit science think tank dedicated to developing evidence-based policies to address national challenges, today released a letter to the White House Office of Management and Budget (OMB) and the Office of Science and Technology Policy (OSTP), signed by 16 additional scientific and technical organizations, urging the current Trump administration to maintain the federal agency AI use cases inventories at the current level of detail.
“The federal government has immense power to shape industry standards, academic research, and public perception of artificial intelligence,” says Daniel Correa, CEO of the Federation of American Scientists. “By continuing the work set forth by the first Trump administration in Executive Order 13960 and continued by the bipartisan 2023 Advancing American AI Act, OMB’s detailed use cases help us understand the depth and scope of AI systems used for government services.”
“FAS and our fellow organizations urge the administration to maintain these use case standards because these inventories provide a critical check on government AI use,” says Dr. Jedidah Isler, Chief Science Officer at FAS.
AI Guidance Update Mid-March
“Transparency is essential for public trust, which in turn is critical to maximizing the benefits of government AI use. That’s why FAS is leading a letter urging the administration to uphold the current level of agency AI use case detail—ensuring transparency remains a top priority,” says Oliver Stephenson, Associate Director of AI and Emerging Tech Policy at FAS.
“Americans want reassurances that the development and use of artificial intelligence within the federal government is safe; and that we have the ability to mitigate any adverse impacts. By maintaining guidance that federal agencies have to collect and publish information on risks, development status, oversight, data use and so many other elements, OMB will continue strengthening Americans’ trust in the development and use of artificial intelligence,” says Clara Langevin, AI Policy Specialist at FAS.
Surging Use of AI in Government
This letter follows the dramatic rise in the use of artificial intelligence across government, with anticipated growth coming at a rapid rate. For example, at the end of 2024 the Department of Homeland Security (DHS) alone reported 158 active AI use cases. Of these, 29 were identified as high-risk, with detailed documentation on how 24 of those use cases are mitigating potential risks. OMB and OSTP have the ability and authority to set the guidelines that can address the growing pace of government innovation.
FAS and our signers believe that sustained transparency is crucial to ensuring responsible AI governance, fostering public trust, and enabling responsible industry innovation.
Signatories Urging AI Use Case Inventories at Current Level of Detail
Federation of American Scientists
Beeck Center for Social Impact + Innovation at Georgetown University
Bonner Enterprises, LLC
Center for AI and Digital Policy
Center for Democracy & Technology
Center for Inclusive Change
CUNY Public Interest Tech Lab
Electronic Frontier Foundation
Environmental Policy Innovation Center
Mozilla
National Fair Housing Alliance
NETWORK Lobby for Catholic Social Justice
New America’s Open Technology Institute
POPVOX Foundation
Public Citizen
SeedAI
The Governance Lab
###
ABOUT FAS
The Federation of American Scientists (FAS) works to advance progress on a broad suite of contemporary issues where science, technology, and innovation policy can deliver dramatic progress, and seeks to ensure that scientific and technical expertise have a seat at the policymaking table. Established in 1945 by scientists in response to the atomic bomb, FAS continues to work on behalf of a safer, more equitable, and more peaceful world. More information about FAS work at fas.org.
ABOUT THIS COALITION
Organizations signed on to this letter represent a range of technology stakeholders in industry, academia, and nonprofit realms. We share a commitment to AI transparency. We urge the current administration, OMB, and OSTP to retain the policies set forth in Trump’s Executive Order 13960 and continued in the bipartisan 2023 Advancing American AI Act.
Three Artificial Intelligence Bills Endorsed by Federation of American Scientists Advance from the House Committee
Proposed bills advance research ecosystems, economic development, and education access and move now to the U.S. House of Representatives for a vote
Washington, D.C. – September 12, 2024 – Three proposed artificial intelligence bills endorsed by the Federation of American Scientists (FAS), a nonpartisan science think tank, advance forward from a House Science, Space, and Technology Committee markup held on September 11th, 2024. These bills received bipartisan support and will now be reported to the full chamber. The three bills are: H.R. 9403, the Expanding AI Voices Act, co-sponsored by Rep. Vince Fong (CA-20) and Rep. Andrea Salinas (OR-06); H.R. 9197, the Small Business AI Act, co-sponsored by Rep. Mike Collins (GA-10) and Rep. Haley Stevens (MI-11), and H.R. 9403, the Expand AI Act, co-sponsored by Rep. Valerie Foushee (NC-04) and Rep. Frank Lucas (OK-03).
“FAS endorsed these bills based on the evaluation of their strengths. Among these are the development of infrastructure to develop AI safely and responsibly; the deployment of resources to ensure development benefits more equitably across our economy; and investment in the talent pool necessary for this consequential, emerging technology,” says Dan Correa, CEO of FAS.
“These three bills pave a vision for the equitable and safe use of AI in the U.S. Both the Expanding AI Voices Act and the NSF AI Education Act will create opportunities for underrepresented voices to have a say in how AI is developed and deployed. Additionally, the Small Business AI Act will ensure that an important sector of our society feels empowered to use AI safely and securely,” says Clara Langevin, FAS AI Policy Specialist.
Expanding AI Voices Act
The Expanding AI Voices Act will support a broad and diverse interdisciplinary research community for the advancement of artificial intelligence and AI-powered innovation through partnerships and capacity building at certain institutions of higher education to expand AI capacity in populations historically underrepresented in STEM.
Specifically, the Expanding AI Voices Act of 2024 will:
- Codify and expand the ExpandAI program at the National Science Foundation (NSF), which supports artificial intelligence (AI) capacity-building projects for eligible entities including Minority Serving Institutions (MSIs), Historically Black Colleges and Universities (HBCUs), and Tribal Colleges and Universities (TCUs).
- Broaden the ExpandAI program in scope and types of activities it supports to further build and enhance partnerships between eligible entities and awardees of the National AI Research Institutes ecosystem to broaden AI research and development.
- Direct the National Science Foundation to engage in outreach to increase their pool of applications and address common barriers preventing these organizations from submitting an application.
Small Business AI Act
Emerging science is central to new and established small businesses, across industries and around the country. This bill will require the Director of the National Institute of Standards and Technology (NIST) to develop resources for small businesses in utilizing artificial intelligence, and for other purposes.
- This bill amends the NIST Organic Act, as amended by the National AI Initiative Act, and directs NIST, in coordination with the Small Business Administration, to consider the needs of America’s small businesses and develop AI resources for best practices, case studies, benchmarks, methodologies, procedures, and processes for small businesses to understand, apply, and integrate AI systems.
- It will connect Small Businesses with existing Federal educational resources, such as the risk management framework and activities from the national cybersecurity awareness and education program under the Cybersecurity Enhancement Act of 2014.
- This bill aligns with FAS’s mission to broaden AI use and access as a catalyst for economic development.
National Science Foundation Artificial Intelligence Education Act of 2024 (NSF AI Education Act).
The National Artificial Intelligence Initiative Act of 2020 (15 U.S.C. 9451) will bolster educational skills in AI through new learning initiatives and workforce training programs. Specifically, the bill will:
- Allow NSF to award AI scholarships in critical sectors such as education, agriculture and advanced manufacturing.
- Authorize the NSF to conduct outreach and encourage applications from rural institutions, Tribal Colleges and Universities, and institutions located in Established Program to Stimulate Competitive Research (EPSCoR) jurisdictions to promote research competitiveness.
- Award fellowships for teachers, school counselors, and other school professionals for professional development programs, providing skills and training in collaboration with industry partners on the teaching and application of artificial intelligence in K-12 settings
- This bill aligns with FAS’s commitment to STEM education and equity as powerful levers for our nation to compete on the global stage.
###
ABOUT FAS
The Federation of American Scientists (FAS) works to advance progress on a broad suite of contemporary issues where science, technology, and innovation policy can deliver dramatic progress, and seeks to ensure that scientific and technical expertise have a seat at the policymaking table. Established in 1945 by scientists in response to the atomic bomb, FAS continues to work on behalf of a safer, more equitable, and more peaceful world. More information at fas.org.
Public Comment on the U.S. Artificial Intelligence Safety Institute’s Draft Document: NIST AI 800-1, Managing Misuse Risk for Dual-Use Foundation Models
Public comments serve the executive branch by informing more effective, efficient program design and regulation. As part of our commitment to evidence-based, science-backed policy, FAS staff leverage public comment opportunities to embed science, technology, and innovation into policy decision-making.
The Federation of American Scientists (FAS) is a non-partisan organization dedicated to using science and technology to benefit humanity through equitable and impactful policy. With a strong track record in AI governance, FAS has actively contributed to the development of AI standards and frameworks, including providing feedback on NIST AI 600-1, the Generative AI Profile. Our work spans advocating for federal AI testbeds, recommending policy measures for frontier AI developers, and evaluating industry adoption of the NIST AI Risk Management Framework. We are members of the U.S. AI Safety Institute Research Consortium, and we responded to NIST’s request for information earlier this year concerning its responsibilities under sections 4.1, 4.5, and 11 of the AI Executive Order.
We commend NIST’s U.S. Artificial Intelligence Safety Institute for developing the draft guidance on “Managing Misuse Risk for Dual-Use Foundation Models.” This document represents a significant step toward establishing robust practices for mitigating catastrophic risks associated with advanced AI systems. The guidance’s emphasis on comprehensive risk assessment, transparent decision-making, and proactive safeguards aligns with FAS’s vision for responsible AI development.
In our response, we highlight several strengths of the guidance, including its focus on anticipatory risk assessment and the importance of clear documentation. We also identify areas for improvement, such as the need for harmonized language and more detailed guidance on model development safeguards. Our key suggestions include recommending a more holistic socio-technical approach to risk evaluation, strengthening language around halting development for unmanageable risks, and expanding the range of considered safeguards. We believe these adjustments will further strengthen NIST’s crucial role in shaping responsible AI development practices.
Background and Context
The rapid advancement of AI foundation models has spurred novel industry-led risk mitigation strategies. Leading AI companies have voluntarily adopted frameworks like Responsible Scaling Policies and Preparedness Frameworks, outlining risk thresholds and mitigation strategies for increasingly capable AI systems. (Our response to NIST’s February RFI was largely an exploration of these policies, their benefits and drawbacks, and how they could be strengthened.)
Managing misuse risks in foundation models is of paramount importance given their broad applicability and potential for dual use. As these models become more powerful, they may inadvertently enable malicious actors to cause significant harm, including facilitating the development of weapons, enabling sophisticated cyber attacks, or generating harmful content. The challenge lies not only in identifying current risks but also in anticipating future threats that may emerge as AI capabilities expand.
NIST’s new guidance on “Managing Misuse Risk for Dual-Use Foundation Models” builds upon these industry initiatives, providing a more standardized and comprehensive approach to risk management. By focusing on objectives such as anticipating potential misuse, establishing clear risk thresholds, and implementing robust evaluation procedures, the guidance creates a framework that can be applied across the AI development ecosystem. This approach is crucial for ensuring that as AI technology advances, appropriate safeguards are in place to protect against potential misuse while still fostering innovation.
Strengths of the guidance
1. Comprehensive Documentation and Transparency
The guidance’s emphasis on thorough documentation and transparency represents a significant advancement in AI risk management. For every practice under every objective, the guidance indicates appropriate documentation; this approach is more thorough in advancing transparency than any comparable guidance to date. The creation of a paper trail for decision-making and risk evaluation is crucial for both internal governance and potential external audits.
The push for transparency extends to collaboration with external stakeholders. For instance, practice 6.4 recommends providing “safe harbors for third-party safety research,” including publishing “a clear vulnerability disclosure policy for model safety issues.” This openness to external scrutiny and feedback is essential for building trust and fostering collaborative problem-solving in AI safety. (FAS has published a legislative proposal calling for enshrining “safe harbor” protections for AI researchers into law.)
2. Lifecycle Approach to Risk Management
The guidance excels in its holistic approach to risk management, covering the entire lifecycle of foundation models from pre-development assessment through to post-deployment monitoring. This comprehensive approach is evident in the structure of the document itself, which follows a logical progression from anticipating risks (Objective 1) through to responding to misuse after deployment (Objective 6).
The guidance demonstrates a proactive stance by recommending risk assessment before model development. Practice 1.3 suggests to “Estimate the model’s capabilities of concern before it is developed…”, which helps anticipate and mitigate potential harms before they materialize. The framework for red team evaluations (Practice 4.2) is particularly robust, recommending independent external experts and suggesting ways to compensate for gaps between red teams and real threat actors. The guidance also emphasizes the importance of ongoing risk assessment. Practice 3.2 recommends to “Periodically revisit estimates of misuse risk stemming from model theft…” This acknowledgment of the dynamic nature of AI risks encourages continuous vigilance.
3. Strong Stance on Model Security and Risk Tolerance
The guidance takes a firm stance on model security and risk tolerance, particularly in Objective 3. It unequivocally states that models relying on confidentiality for misuse risk management should only be developed when theft risk is sufficiently mitigated. This emphasizes the critical importance of security in AI development, including considerations for insider threats (Practice 3.1).
The guidance also demonstrates a realistic approach to the challenges posed by different deployment strategies. In Practice 5.1, it notes, “For example, allowing fine-tuning via API can significantly limit options to prevent jailbreaking and sharing the model’s weights can significantly limit options to monitor for misuse (Practice 6.1) and respond to instances of misuse (Practice 6.2).” This candid discussion of the limitations of safety interventions for open weight foundation models is crucial for fostering realistic risk assessments.
Additionally, the guidance promotes a conservative approach to risk management. Practice 5.3 recommends to “Consider leaving a margin of safety between the estimated level of risk at the point of deployment and the organization’s risk tolerance.” It further suggests considering “a larger margin of safety to manage risks that are more severe or less certain.” This approach provides an extra layer of protection against unforeseen risks or rapid capability advancements, which is crucial given the uncertainties inherent in AI development.
These elements collectively demonstrate NIST’s commitment to promoting realistic and robust risk management practices that prioritize safety and security in AI development and deployment. However, while the NIST guidance demonstrates several important strengths, there are areas where it could be further improved to enhance its effectiveness in managing misuse risks for dual-use foundation models.
Areas for improvement
1. Need for a More Comprehensive Socio-technical Approach to Measuring Misuse Risk
Objective 4 of the guidance demonstrates a commendable effort to incorporate elements of a socio-technical approach in measuring misuse risk. The guidance recognizes the importance of considering both technical and social factors, emphasizes the use of red teams to assess potential misuse scenarios, and acknowledges the need to consider different levels of access and various threat actors. Furthermore, it highlights the importance of avoiding harm during the measurement process, which is crucial in a socio-technical framework.
However, the guidance falls short in fully embracing a comprehensive socio-technical perspective. While it touches on the importance of external experts, it does not sufficiently emphasize the value of diverse perspectives, particularly from individuals with lived experiences relevant to specific risk scenarios. The guidance also lacks a structured approach to exploring the full range of potential misuse scenarios across different contexts and risk areas. Finally, the guidance does not mention measuring absolute versus marginal risks (ie., how much total misuse risk a model poses in a specific context versus how much marginal risk it poses compared to existing tools). These gaps limit the effectiveness of the proposed risk measurement approach in capturing the full complexity of AI system interactions with human users and broader societal contexts.
Specific recommendations for improving socio-technical approach
The NIST guidance in Practice 1.3 suggests estimating model capabilities by comparison to existing models, but provides little direction on how to conduct these comparisons effectively. To improve this, NIST could incorporate the concept of “available affordances.” This concept emphasizes that an AI system’s risk profile depends not just on its absolute capabilities, but also on the environmental resources and opportunities for affecting the world that are available to it.
Additionally, Kapoor et al. (2024) emphasize the importance of assessing the marginal risk of open foundation models compared to existing technologies or closed models. This approach aligns with a comprehensive socio-technical perspective by considering not just the absolute capabilities of AI systems, but also how they interact with existing technological and social contexts. For instance, when evaluating cybersecurity risks, they suggest considering both the potential for open models to automate vulnerability detection and the existing landscape of cybersecurity tools and practices. This marginal risk framework helps to contextualize the impact of open foundation models within broader socio-technical systems, providing a more nuanced understanding of their potential benefits and risks.
NIST could recommend that organizations assess both the absolute capabilities of their AI systems and the affordances available to them in potential deployment contexts. This approach would provide a more comprehensive view of potential risks than simply comparing models in isolation. For instance, the guidance could suggest evaluating how a system’s capabilities might change when given access to different interfaces, actuators, or information sources.
Similarly, Weidinger et al. (2023) argue that while quantitative benchmarks are important, they are insufficient for comprehensive safety evaluation. They suggest complementing quantitative measures with qualitative assessments, particularly at the human interaction and systemic impact layers. NIST could enhance its guidance by providing more specific recommendations for integrating qualitative evaluation methods alongside quantitative benchmarks.
NIST should acknowledge potential implementation challenges with a comprehensive socio-technical approach. Organizations may struggle to create benchmarks that accurately reflect real-world misuse scenarios, particularly given the rapid evolution of AI capabilities and threat landscapes. Maintaining up-to-date benchmarks in a fast-paced field presents another ongoing challenge. Additionally, organizations may face difficulties in translating quantitative assessments into actionable risk management strategies, especially when dealing with novel or complex risks. NIST could enhance the guidance by providing strategies for navigating these challenges, such as suggesting collaborative industry efforts for benchmark development or offering frameworks for scalable testing approaches.
OpenAI‘s approach of using human participants to evaluate AI capabilities provides both a useful model for more comprehensive evaluation and an example of quantification challenges. While their evaluation attempted to quantify biological risk increase from AI access, they found that, as they put it, “Translating quantitative results into a meaningfully calibrated threshold for risk turns out to be difficult.” This underscores the need for more research on how to set meaningful thresholds and interpret quantitative results in the context of AI safety.
2. Inconsistencies in Risk Management Language
There are instances where the guidance uses varying levels of strength in its recommendations, particularly regarding when to halt or adjust development. For example, Practice 2.2 recommends to “Plan to adjust deployment or development strategies if misuse risks rise to unacceptable levels,” while Practice 3.2 uses stronger language, suggesting to “Adjust or halt further development until the risk of model theft is adequately managed.” This variation in language could lead to confusion and potentially weaker implementation of risk management strategies.
Furthermore, while the guidance emphasizes the importance of managing risks before deployment, it does not provide clear criteria for what constitutes “adequately managed” risk, particularly in the context of development rather than deployment. More consistent and specific language around these critical decision points would strengthen the guidance’s effectiveness in promoting responsible AI development.
Specific recommendations for strengthening language on halting development for unmanageable risks
To address the inconsistencies noted above, we suggest the following changes:
1. Standardize the language across the document to consistently use strong phrasing such as “Adjust or halt further development” when discussing responses to unacceptable levels of risk.
The current guidance uses varying levels of strength in its recommendations regarding development adjustments. For instance, Recommendation 4 of Practice 2.2 uses the phrase “Plan to adjust deployment or development strategies,” while Recommendation 3 of Practice 3.2 more strongly suggests to “Adjust or halt further development.” Consistent language would emphasize the critical nature of these decisions and reduce potential confusion or weak implementation of risk management strategies. This could be accomplished by changing the language of Practice 2.2, Recommendation 4 to “Plan to adjust or halt further development or deployment if misuse risks rise to unacceptable levels before adequate security and safeguards are available to manage risk.”
The need for stronger language regarding halting development is reflected both in NIST’s other work and in commitments that many frontier AI developers have publicly agreed to. For instance, the NIST AI Risk Management Framework, section 1.2.3 (Risk Prioritization), suggests: “In some cases where an AI system presents the highest risk – where negative impacts are imminent, severe harms are actually occurring, or catastrophic risks are present – development and deployment should cease in a safe manner until risks can be sufficiently mitigated.” Further, the AI Seoul Summit frontier AI safety commitments explicitly state that organizations should “set out explicit processes they intend to follow if their model or system poses risks that meet or exceed the pre-defined thresholds.” Importantly, these commitments go on to specify that “In the extreme, organisations commit not to develop or deploy a model or system at all, if mitigations cannot be applied to keep risks below the thresholds.”
2. Add to the list of transparency documentation for Practice 2.2 the following: “A decision-making framework for determining when risks have become truly unmanageable, considering factors like the severity of potential harm, the likelihood of the risk materializing, and the feasibility of mitigation strategies.”
While the current guidance emphasizes the importance of managing risks before deployment (e.g., in Practice 5.3), it does not provide clear criteria for what constitutes “adequately managed” risk, particularly in the context of development rather than deployment. A decision-making framework would provide clearer guidance on when to take the serious step of halting development. This addition would help prevent situations where development continues despite unacceptable risks due to a lack of clear stopping criteria. This recommendation aligns with the approach suggested by Alaga and Schuett (2023) in their paper on coordinated pausing, where they emphasize the need for clear thresholds and decision criteria to determine when AI development should be halted due to unacceptable risks.
3. Gaps in Model Development Safeguards
The guidance’s treatment of safeguards, particularly those related to model development, lacks sufficient detail to be practically useful. This is most evident in Appendix B, which lists example safeguards. While this appendix is a valuable addition, the safeguards related to model training (“Improve the model’s training”) are notably lacking in detail compared to the safeguards around model security and detecting misuse.
While the guidance covers many aspects of risk management comprehensively, especially model security, it does not provide enough specific recommendations for technical approaches to building safer models during the development phase. This gap could limit the practical utility of the guidance for AI developers seeking to implement safety measures from the earliest stages of model creation.
Specific recommendations for additional safeguards for model development
For some safeguards, we recommend that the misuse risk guidance explicitly reference relevant sections of NIST 600-1, the Generative Artificial Intelligence Profile. Specifically, the GAI profile offers more comprehensive guidance on data-related and monitoring safeguards. For instance, the profile emphasizes documenting training data curation policies (MP-4.1-004) and establishing policies for data collection, retention, and quality (MP-4.1-005), which are crucial for managing misuse risk from the earliest stages of development. Additionally, the profile suggests implementing real-time monitoring processes for analyzing generated content performance and trustworthiness characteristics (MG-3.2-006), which could significantly enhance ongoing risk management during development. These references to the GAI Profile on model development safeguards could take the form of an additional item in Appendix B, or be incorporated into the relevant sections earlier in the guidance.
Beyond pointing to the model development safeguards included in the GAI Profile, we also recommend expanding Appendix B to include further safeguards for the model development phase. Both the GAI Profile and the current misuse risk guidance lack specific recommendations for two key model development safeguards: iterative safety testing throughout development and staged development/release processes. Below are two proposed additions to Appendix B:
The proposed safeguard “Implement iterative safety testing throughout development” addresses the current guidance’s limited detail on model training and development safeguards. This approach aligns with Barrett, et al.’s AI Risk-Management Standards Profile for General-Purpose AI Systems and Foundation Models (the “GPAIS Profile”)’s emphasis on proactive and ongoing risk assessment. Specifically, the Profile recommends identifying “GPAIS impacts…and risks (including potential uses, misuses, and abuses), starting from an early AI lifecycle stage and repeatedly through new lifecycle phases or as new information becomes available” (Barrett et al., 2023, p. 19). The GPAIS Profile further suggests that for larger models, developers should “analyze, customize, reanalyze, customize differently, etc., then deploy and monitor” (Barrett et al., 2023, p. 19), where “analyze” encompasses probing, stress testing, and red teaming. This iterative safety testing would integrate safety considerations throughout development, aligning with the guidance’s emphasis on proactive risk management and anticipating potential misuse risk.
Similarly, the proposed safeguard “Establish a staged development and release process” addresses a significant gap in the current guidance. While Practice 5.1 discusses pre-deployment risk assessment, it lacks a structured approach to incrementally increasing model capabilities or access. Solaiman et al. (2023) propose a “gradient of release” framework for generative AI, a phased approach to model deployment that allows for iterative risk assessment and mitigation. This aligns with the guidance’s emphasis on ongoing risk management and could enhance the ‘margin of safety’ concept in Practice 5.3. Implementing such a staged process would introduce multiple risk assessment checkpoints throughout development and deployment, potentially improving safety outcomes.
Conclusion
NIST’s guidance on “Managing Misuse Risk for Dual-Use Foundation Models” represents a significant step forward in establishing robust practices for mitigating catastrophic risks associated with advanced AI systems. The document’s emphasis on comprehensive risk assessment, transparent decision-making, and proactive safeguards demonstrates a commendable commitment to responsible AI development. However, to more robustly contribute to risk mitigation, the guidance must evolve to address key challenges, including a stronger approach to measuring misuse risk, consistent language on halting development, and more detailed model development safeguards.
As the science of AI risk assessment advances, this guidance should be recursively updated to address emerging risks and incorporate new best practices. While voluntary guidance is crucial, it is important to recognize that it cannot replace the need for robust policy and regulation. A combination of industry best practices, government oversight, and international cooperation will be necessary to ensure the responsible development of high-risk AI systems.
We appreciate the opportunity to provide input on this important document. FAS stands ready to continue assisting NIST in refining and implementing this guidance, as well as in developing further resources for responsible AI development. We believe that close collaboration between government agencies, industry leaders, and civil society organizations is key to realizing the benefits of AI while effectively mitigating its most serious risks.
Scaling AI Safely: Can Preparedness Frameworks Pull Their Weight?
A new class of risk mitigation policies has recently come into vogue for frontier AI developers. Known alternately as Responsible Scaling Policies or Preparedness Frameworks, these policies outline commitments to risk mitigations that developers of the most advanced AI models will implement as their models display increasingly risky capabilities. While the idea for these policies is less than a year old, already two of the most advanced AI developers, Anthropic and OpenAI, have published initial versions of these policies. The U.K. AI Safety Institute asked frontier AI developers about their “Responsible Capability Scaling” policies ahead of the November 2023 UK AI Safety Summit. It seems that these policies are here to stay.
The National Institute of Standards & Technology (NIST) recently sought public input on its assignments regarding generative AI risk management, AI evaluation, and red-teaming. The Federation of American Scientists was happy to provide input; this is the full text of our response. NIST’s request for information (RFI) highlighted several potential risks and impacts of potentially dual-use foundation models, including: “Negative effects of system interaction and tool use…chemical, biological, radiological, and nuclear (CBRN) risks…[e]nhancing or otherwise affecting malign cyber actors’ capabilities…[and i]mpacts to individuals and society.” This RFI presented a good opportunity for us to discuss the benefits and drawbacks of these new risk mitigation policies.
This report will provide some background on this class of risk mitigation policies (we use the term Preparedness Framework, for reasons to be described below). We outline suggested criteria for robust Preparedness Frameworks (PFs) and evaluate two key documents, Anthropic’s Responsible Scaling Policy and OpenAI’s Preparedness Framework, against these criteria. We claim that these policies are net-positive and should be encouraged. At the same time, we identify shortcomings of current PFs, chiefly that they are underspecified, insufficiently conservative, and address structural risks poorly. Improvement in the state of the art of risk evaluation for frontier AI models is a prerequisite for a meaningfully binding PF. Most importantly, PFs, as unilateral commitments by private actors, cannot replace public policy.
Motivation for Preparedness Frameworks
As AI labs develop potentially dual-use foundation models (as defined by Executive Order No. 14110, the “AI EO”) with capability, compute, and efficiency improvements, novel risks may emerge, some of them potentially catastrophic. Today’s foundation models can already cause harm and pose some risks, especially as they are more broadly used. Advanced large language models at times display unpredictable behaviors.
To this point, these harms have not risen to the level of posing catastrophic risks, defined here broadly as “devastating consequences for vast numbers of people.” The capabilities of models at the current state of the art simply do not imply levels of catastrophic risk above current non-AI related margins.1 However, as these models continue to scale in training compute, some speculate they may develop novel capabilities that could potentially be misused. The specific capabilities that will emerge from further scaling remain difficult to predict with confidence or certainty. Some analysis indicates that as training compute for AI models has doubled approximately every six months since 2015, performance on capability benchmarks has also steadily improved. While it’s possible that bigger models could lead to better performance, it wouldn’t be surprising if smaller models emerge with better capabilities, as despite years of research by machine learning theorists, our knowledge of just how the number of model parameters relates to model capabilities remains uncertain.
Nonetheless, as capabilities increase, risks may also increase, and new risks may appear. Executive Order 14110 (the Executive Order on Artificial Intelligence, or the “AI EO”) detailed some novel risks of potentially dual-use foundation models, including potential risks associated with chemical, biological, radiological, or nuclear (CBRN) risks and advanced cybersecurity risks. Other risks are more speculative, such as risks of model autonomy, loss of control of AI systems, or negative impacts on users including risks of persuasion.2 Without robust risk mitigations, it is plausible that increasingly powerful AI systems will eventually pose greater societal risks.
Other technologies that pose catastrophic risks, such as nuclear technologies, are heavily regulated in order to prevent those risks from resulting in serious harms. There is a growing movement to regulate development of potentially dual-use biotechnologies, particularly gain-of-function research on the most pathogenic microbes. Given the rapid pace of progress at the AI frontier, comprehensive government regulation has yet to catch up; private companies that develop these models are starting to take it upon themselves to prevent or mitigate the risks of advanced AI development.
Prevention of such novel and consequential risks requires developers to implement policies that address potential risks iteratively. That is where preparedness frameworks come in. A preparedness framework is used to assess risk levels across key categories and outline associated risk mitigations. As the introduction to OpenAI’s PF states, “The processes laid out in each version of the Preparedness Framework will help us rapidly improve our understanding of the science and empirical texture of catastrophic risk, and establish the processes needed to protect against unsafe development.” Without such processes and commitments, the tendency to prioritize speed over safety concerns might prevail. While the exact consequences of failing to mitigate these risks are uncertain, they could potentially be significant.
Preparedness frameworks are limited in scope to catastrophic risks. These policies aim to prevent the worst conceivable outcomes of the development of future advanced AI systems; they are not intended to cover risks from existing systems. We acknowledge that this is an important limitation of preparedness frameworks. Developers can and should address both today’s risks and future risks at the same time; preparedness frameworks attempt to address the latter, while other “trustworthy AI” policies attempt to address a broader swathe of risks. For instance, OpenAI’s “Preparedness” team sits alongside its “Safety Systems” team, which “focuses on mitigating misuse of current models and products like ChatGPT.”
A note about terminology: The term “Responsible Scaling Policy” (RSP) is the term that took hold first, but it presupposes scaling of compute and capabilities by default. “Preparedness Framework” (PF) is a term coined by OpenAI, and it communicates the idea that the company needs to be prepared as its models approach the level of artificial general intelligence. Of the two options, “Preparedness Framework” communicates the essential idea more clearly: developers of potentially dual-use foundation models must be prepared for and mitigate potential catastrophic risks from development of these models.
The Industry Landscape
In September of 2023, ARC Evals (now METR, “Model Evaluation & Threat Research”) published a blog post titled “Responsible Scaling Policies (RSPs).” This post outlined the motivation and basic structure of an RSP, and revealed that ARC Evals had helped Anthropic write its RSP (version 1.0) which had been released publicly a few days prior. (ARC Evals had also run pre-deployment evaluations on Anthropic’s Claude model and OpenAI’s GPT-4.) And in December 2023, OpenAI published its Preparedness Framework in beta; while using new terminology, this document is structurally similar to ARC Evals’ outline of the structure of an RSP. Both OpenAI and Anthropic have indicated that they plan to update their PFs with new information as the frontier of AI development advances.
Not every AI company should develop or maintain a preparedness framework. Since these policies relate to catastrophic risk from models with advanced capabilities, only those developers whose models could plausibly attain those capabilities should use PFs. Because these advanced capabilities are associated with high levels of training compute, a good interim threshold for who should develop a PF could be the same as the AI EO threshold for potentially dual-use foundation models; that is, developers of models trained on over 10^26 FLOPS (or October 2023-equivalent level of compute adjusted for compute efficiency gains).3 Currently, only a handful of developers have models that even approach this threshold. This threshold should be subject to change, like that of the AI EO, as developers continue to push the frontier (e.g. by developing more efficient algorithms or realizing other compute efficiency gains).
While several other companies published “Responsible Capability Scaling” documents ahead of the UK AI Safety Summit, including DeepMind, Meta, Microsoft, Amazon, and Inflection AI, the rest of this report focuses primarily on OpenAI’s PF and Anthropic’s RSP.
Weaknesses of Preparedness Frameworks
Preparedness frameworks are not panaceas for AI-associated risks. Even with improvements in specificity, transparency, and strengthened risk mitigations, there are important weaknesses to the use of PFs. Here we outline a couple weaknesses of PFs and possible answers to them.
1. Spirit vs. text: PFs are voluntary commitments whose success depends on developers’ faithfulness to their principles.
Current risk thresholds and mitigations are defined loosely. In Anthropic’s RSP, for instance, the jump from the current risk level posed by Claude 2 (its state of the art model) to the next risk level is defined in part by the following: “Access to the model would substantially increase the risk of catastrophic misuse, either by proliferating capabilities, lowering costs, or enabling new methods of attack….” A “substantial increase” is not well-defined. This ambiguity leaves room for interpretation; since implementing risk mitigations can be costly, developers could have an incentive to take advantage of such ambiguity if they do not follow the spirit of the policy.
This concern about the gap between following the spirit of the PF and following the text might be somewhat eased with more specificity about risk thresholds and associated mitigations, and especially with more transparency and public accountability to these commitments.
To their credit, OpenAI’s PF and Anthropic’s RSP show a serious approach to the risks of developing increasingly advanced AI systems. OpenAI’s PF includes a commitment to fine-tune its models to better elicit capabilities along particular risk categories, then evaluate “against these enhanced models to ensure we are testing against the ‘worst case’ scenario we know of.” They also commit to triggering risk mitigations “when any of the tracked risk categories increase in severity, rather than only when they all increase together.” And Anthropic “commit[s] to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL [AI Safety Level].” These commitments are costly signals that these developers are serious about their PFs.
2. Private commitment vs. public policy: PFs are unilateral commitments that individual developers take on; we might prefer more universal policy (or regulatory) approaches.
Private companies developing AI systems may not fully account for broader societal risks. Consider an analogy to climate change—no single company’s emissions are solely responsible for risks like sea level rise or extreme weather. The risk comes from the aggregate emissions of all companies. Similarly, AI developers may not consider how their systems interact with others across society, potentially creating structural risks. Like climate change, the societal risks from AI will likely come from the cumulative impact of many different systems. Unilateral commitments are poor tools to address such risks.
Furthermore, PFs might reduce the urgency for government intervention. By appearing safety-conscious, developers could diminish the perceived need for regulatory measures. Policymakers might over-rely on self-regulation by AI developers, potentially compromising public interest for private gains.
Policy can and should step into the gap left by PFs. Policy is more aligned to the public good, and as such is less subject to competing incentives. And policy can be enforced, unlike voluntary commitments. In general, preparedness frameworks and similar policies help hold private actors accountable to their public commitments; this effect is stronger with more specificity in defining risk thresholds, better evaluation methods, and more transparency in reporting. However, these policies cannot and should not replace government action to reduce catastrophic risks (especially structural risks) of frontier AI systems.
Suggested Criteria for Robust Preparedness Frameworks
These criteria are adapted from the ARC Evals post, Anthropic’s RSP, and OpenAI’s PF. Broadly, they are aspirational; no existing preparedness framework meets all or most of these criteria.
For each criterion, we explain the key considerations for developers adopting PFs. We analyze OpenAI’s PF and Anthropic’s RSP to illustrate the strengths and shortcomings of their approaches. Again, these policies are net-positive and should be encouraged. They demonstrate costly unilateral commitments to measuring and addressing catastrophic risk from their models; they meaningfully improve on the status quo. However, these initial PFs are underspecified and insufficiently conservative. Improvement in the state of the art of risk evaluation and mitigation, and subsequent updates, would make them more robust.
1. Preparedness frameworks should cover the breadth of potential catastrophic risks of developing frontier AI models.
These risks may include:
- CBRN risks. Advanced AI models might enable or aid the creation of chemical, biological, radiological, and/or nuclear threats. OpenAI’s PF includes CBRN risks as their own category; Anthropic’s RSP includes CBRN risks within risks from misuse.
- Model autonomy. Anthropic’s RSP defines this as: “risk that a model is capable of accumulating resources (e.g. through fraud), navigating computer systems, devising and executing coherent strategies, and surviving in the real world while avoiding being shut down.” OpenAI’s PF defines this as: “[enabling] actors to run scaled misuse that can adapt to environmental changes and evade attempts to mitigate or shut down operations. Autonomy is also a prerequisite for self-exfiltration, self-improvement, and resource acquisition.” OpenAI’s definition includes risk from misuse of a model in model autonomy; Anthropic’s focuses on risks from the model itself.
- Potential for misuse, including cybersecurity and critical infrastructure. OpenAI’s PF defines cybersecurity risk (in their own category) as “risks related to use of the model for cyber-exploitation to disrupt confidentiality, integrity, and/or availability of computer systems.” Anthropic’s RSP mentions cyber risks in the context of risks from misuse.
- Adverse impact on human users. OpenAI’s PF includes a tracked risk category for persuasion: “Persuasion is focused on risks related to convincing people to change their beliefs (or act on) both static and interactive model-generated content.” Anthropic’s RSP does not mention persuasion per se.
- Unknown future risks. As developers create and evaluate more highly capable models, new risk vectors might become clear. PFs should acknowledge that unknown future risks are possible with any jump in capabilities. OpenAI’s PF includes a commitment to tracking “currently unknown categories of catastrophic risk as they emerge.”
Preparedness frameworks should apply to catastrophic risks in particular because they govern the scaling of capabilities of the most advanced AI models, and because catastrophic risks are of the highest consequence to such development. PFs are one tool among many that developers of the most advanced AI models should use to prevent harm. Developers of advanced AI models tend to also have other “trustworthy AI” policies, which seek to prevent and address already-existing risks such as harmful outputs, disinformation, and synthetic sexual content. Despite PFs’ focus on potentially catastrophic risks, faithfully applying PFs may help developers catch many other kinds of risks as well, since they involve extensive evaluation for misuse potential and adverse human impacts.
2. Preparedness frameworks should define the developer’s acceptable risk level (“risk appetite”) in terms of likelihood and severity of risk, in accordance with the NIST AI Risk Management Framework, section Map 1.5.
Neither OpenAI nor Anthropic has publicly declared their risk appetite. This is a nascent field of research, as these risks are novel and perhaps less predictable than eg. nuclear accident risk.5 NIST and other standard-setting bodies will be crucial in developing AI risk metrology. For now, PFs should state developers’ risk appetites as clearly as possible, and update them regularly with research advances.6
AI developers’ risk appetites might be different than a regulatory risk appetite. Developers should elucidate their risk appetite in quantitative terms so their PFs can be evaluated accordingly. As in the case of nuclear technology, regulators may eventually impose risk thresholds on frontier AI developers. At this point, however, there is no standard, scientifically-grounded approach to measuring the potential for catastrophic AI risk; this has to start with the developers of the most capable AI models.
3. Preparedness frameworks should clearly define capability levels and risk thresholds. Risk thresholds should be quantified robustly enough to hold developers accountable to their commitments.
OpenAI and Anthropic both outline qualitative risk thresholds corresponding with different categories of risk. For instance, in OpenAI’s PF, the High risk threshold in the CBRN category reads: “Model enables an expert to develop a novel threat vector OR model provides meaningfully improved assistance that enables anyone with basic training in a relevant field (e.g., introductory undergraduate biology course) to be able to create a CBRN threat.” And Anthropic’s RSP defines the ASL-3 [AI Safety Level] threshold as: “Low-level autonomous capabilities, or access to the model would substantially increase the risk of catastrophic misuse, either by proliferating capabilities, lowering costs, or enabling new methods of attack, as compared to a non-LLM baseline of risk.”
These qualitative thresholds are under-specified; reasonable people are likely to differ on what “meaningfully improved assistance” looks like, or a “substantial increase [in] the risk of catastrophic misuse.” In PFs, these thresholds should be quantified to the extent possible.
To be sure, the AI development research community currently lacks a good empirical understanding of the likelihood or quantification of frontier AI-related risks. Again, this is a novel science that needs to be developed with input from both the private and public sectors. Since this science is still developing, it is natural to want to avoid too much quantification. A conceivable failure mode is that developers “check the boxes,” which may become obsolete quickly, in lieu of using their judgment to determine when capabilities are dangerous enough to warrant higher risk mitigations. Again, as research improves, we should expect to see improvements in PFs’ specification of risk thresholds.
4. Preparedness frameworks should include detailed evaluation procedures for AI models, ensuring comprehensive risk assessment within a developer’s tolerance.
Anthropic and OpenAI both have room for improvement on detailing their evaluation procedures. Anthropic’s RSP includes evaluation procedures for model autonomy and misuse risks. Its evaluation procedures for model autonomy are impressively detailed, including clearly defined tasks on which it will evaluate its models. Its evaluation procedures for misuse risk are much less well-defined, though it does include the following note: “We stress that this will be hard and require iteration. There are fundamental uncertainties and disagreements about every layer…It will take time, consultation with experts, and continual updating.” And OpenAI’s PF includes a “Model Scorecard,” a mock evaluation of an advanced AI model. This model scorecard includes the hypothetical results of various evaluations in all four of their tracked risk categories; it does not appear to be a comprehensive list of evaluation procedures.
Again, the science of AI model evaluation is young. The AI EO directs NIST to develop red-teaming guidance for developers of potentially dual-use foundation models. NIST, along with private actors such as METR and other AI evaluators, will play a crucial role in creating and testing red-teaming practices and model evaluations that elicit all relevant capabilities.
5. For different risk thresholds, preparedness frameworks should identify and commit to pre-specified risk mitigations.
Classes of risk mitigations may include:
- Restricting development and/or deployment of models at different risk thresholds
- Enhanced cybersecurity measures, to prevent exfiltration of model weights
- Internal compartmentalization and tiered access
- Interacting with the model only in restricted environments
- Deleting model weights8
Both OpenAI’s PF and Anthropic’s RSP commit to a number of pre-specified risk mitigations for different thresholds. For example, for what Anthropic calls “ASL-2” models (including its most advanced model, Claude 2), they commit to measures including publishing model cards, providing a vulnerability reporting mechanism, enforcing an acceptable use policy, and more. Models at higher risk thresholds (what Anthropic calls “ASL-3” and above) have different, more stringent risk mitigations, including “limit[ing] access to training techniques and model hyperparameters…” and “implement[ing] measures designed to harden our security…”
Risk mitigations can and should differ in approaches to development versus deployment. There are different levels of risk associated with possessing models internally and allowing external actors to interact with them. Both OpenAI’s PF and Anthropic’s RSP include different risk mitigation approaches for development and deployment. For example, OpenAI’s PF restricts deployment of models such that “Only models with a post-mitigation score of “medium” or below can be deployed,” whereas it restricts development of models such that “Only models with a post-mitigation score of “high” or below can be developed further.”
Mitigations should be defined as specifically as possible, with the understanding that as the state of the art changes, this too is an area that will require periodic updates. Developers should include some room for judgment here.
6. Preparedness frameworks’ pre-specified risk mitigations must effectively address potentially catastrophic risks.
Having confidence that the risk mitigations do in fact address potential catastrophic risks is perhaps the most important and difficult aspect of a PF to evaluate. Catastrophic risk from AI is a novel and speculative field; evaluating AI capabilities is a science in its infancy; and there are no empirical studies of the effectiveness of risk mitigations preventing such risks. Given this uncertainty, frontier AI developers should err on the side of caution.
Both OpenAI and Anthropic should be more conservative in their risk mitigations. Consider OpenAI’s commitment to restricting development: “[I]f we reach (or are forecasted to reach) ‘critical’ pre-mitigation risk along any risk category, we commit to ensuring there are sufficient mitigations in place…for the overall post-mitigation risk to be back at most to ‘high’ level.” To understand this commitment, we have to look at their threshold definitions. Under the Model Autonomy category, the “critical” threshold in part includes: “model can self-exfiltrate under current prevailing security.” Setting aside that this threshold is still quite vague and difficult to evaluate (and setting aside the novelty of this capability), a model that approaches or exceeds this threshold by definition can self-exfiltrate, rendering all other risk mitigations ineffective. A more robust approach to restricting development would not permit training or possessing a model that comes close to exceeding this threshold.
As for Anthropic, consider their threshold for “ASL-3,” which reads in part: “Access to the model would substantially increase the risk of catastrophic misuse…” The risk mitigations for ASL-3 models include the following: “Harden security such that non-state attackers are unlikely to be able to steal model weights and advanced threat actors (e.g. states) cannot steal them without significant expense.” While an admirable approach to development of potentially dual-use foundation models, assuming state actors seek out tools whose misuse involves catastrophic risk, a more conservative mitigation would entail hardening security such that it is unlikely that any actor, state or non-state, could steal the model weights of such a model.9
7. Preparedness frameworks should combine credible risk mitigation commitments with governance structures that ensure these commitments are fulfilled.
Preparedness Frameworks should detail governance structures that incentivize actually undertaking pre-committed risk mitigations when thresholds are met. Other incentives, including profit and shareholder value, sometimes conflict with risk management.
Anthropic’s RSP includes a number of procedural commitments meant to enhance the credibility of its risk mitigation commitments. For example, Anthropic commits to proactively planning to pause scaling of its models,10 publicly sharing evaluation results, and appointing a “Responsible Scaling Officer.” However, Anthropic’s RSP also includes the following clause: “[I]n a situation of extreme emergency, such as when a clearly bad actor (such as a rogue state) is scaling in so reckless a manner that it is likely to lead to lead to imminent global catastrophe if not stopped…we could envisage a substantial loosening of these restrictions as an emergency response…” This clause potentially undermines the credibility of Anthropic’s other commitments in the RSP, if at any time it can point to another actor who in its view is scaling recklessly.
OpenAI’s PF also outlines commendable governance measures, including procedural commitments, meant to enhance its risk mitigation credibility. It summarizes its operation structure: “(1) [T]here is a dedicated team “on the ground” focused on preparedness research and monitoring (Preparedness team), (2) there is an advisory group (Safety Advisory Group) that has a sufficient diversity of perspectives and technical expertise to provide nuanced input and recommendations, and (3) there is a final decision-maker (OpenAI Leadership, with the option for the OpenAI Board of Directors to overrule).”
8. Preparedness frameworks should include a mechanism for regular updates to the framework itself, in light of ongoing research and advances in AI.
Both OpenAI’s PF and Anthropic’s RSP acknowledge the importance of regular updates. This is reflected in both of these documents’ names: Anthropic labels its RSP as “Version 1.0,” while OpenAI’s PF is labeled as “(Beta).”
Anthropic’s RSP includes an “Update Process” that reads in part: “We expect most updates to this process to be incremental…as we learn more about model safety features or unexpected capabilities…” This language directly commits Anthropic to changing its RSP as the state of the art changes. OpenAI references updates throughout its PF, notably committing to updating its evaluation methods and rubrics (“The Scorecard will be regularly updated by the Preparedness team to help ensure it reflects the latest research and findings”).
9. For models with risk above the lowest level, most evaluation results and methods should be public, including any performed mitigations.
Publishing model evaluations and mitigations is an important tool for holding developers accountable to their PF commitments. Sensitivity about the level of transparency is key. For example, full information about evaluation methodology and risk mitigations could be exploited by malicious actors. Anthropic’s RSP takes a balanced approach in committing to “[p]ublicly share evaluation results after model deployment where possible, in some cases in the initial model card, in other cases with a delay if it serves a broad safety interest.” OpenAI’s PF does not commit to publishing its Model Scorecards, but OpenAI has since published related research on whether its models aid the creation of biological threats.
Conclusion
Preparedness frameworks represent a promising approach for AI developers to voluntarily commit to robust risk management practices. However, current versions have weaknesses—particularly their lack of specificity in risk thresholds, insufficiently conservative risk mitigation approaches, and inadequacy in addressing structural risks. Frontier AI developers without PFs should consider adopting them, and OpenAI and Anthropic should update their policies to strengthen risk mitigations and include more specificity.
Strengthening preparedness frameworks will require advancing AI safety science to enable precise risk quantification and develop new mitigations. NIST, academics, and companies plan to collaborate to measure and model frontier AI risks. Policymakers have a crucial opportunity to adapt regulatory approaches from other high-risk technologies like nuclear power to balance AI innovation and catastrophic risk prevention. Furthermore, standards bodies could develop more robust AI evaluations best practices, including guidance for third-party auditors.
Overall the AI community must view safety as an intrinsic priority, not just private actors creating preparedness frameworks. All stakeholders, including private companies, academics, policymakers and civil society organizations have roles to play in steering AI development toward societally beneficial outcomes. Preparedness frameworks are one tool, but not sufficient absent more comprehensive, multi-stakeholder efforts to scale AI safely and for the public good.
Many thanks to Madeleine Chang, Di Cooke, Thomas Woodside, and Felipe Calero Forero for providing helpful feedback.
A National AI for Good Initiative
Summary
Artificial intelligence (AI) and machine learning (ML) models can solve well-specified problems, like automatically diagnosing disease or grading student essays, at scale. But applications of AI and ML for major social and scientific problems are often constrained by a lack of high-quality, publicly available data—the foundation on which AI and ML algorithms are built.
The Biden-Harris Administration should launch a multi-agency initiative to coordinate the academic, industry, and government research community to support the identification and development of datasets for applications of AI and ML in domain-specific, societally valuable contexts. The initiative would include activities like generating ideas for high-impact datasets, linking siloed data into larger and more useful datasets, making existing datasets easier to access, funding the creation of real-world testbeds for societally valuable AI and ML applications, and supporting public-private partnerships related to all of the above.