Emerging Technology
day one project

Moving Beyond Pilot Programs to Codify and Expand Continuous AI Benchmarking in Testing and Evaluation

06.11.25 | 12 min read | Text by Kateryna Halstead

Rapid and advanced AI integration and diffusion within the Department of Defense (DoD) and other government agencies has emerged as a critical national security priority. This convergence of rapid AI advancement and DoD prioritization creates an urgent need to ensure that AI models integrated into defense operations are reliable, safe, and mission-enhancing. For this purpose, the DoD must deploy and expand one of its most critical tools available within its Testing and Evaluation (T&E) process: benchmarking—the structured practice of applying shared tasks and metrics to compare models, track progress, and expose performance gaps.

A standardized AI benchmarking framework is critical for delivering uniform, mission-aligned evaluations across the DoD. Despite their importance, the DoD currently lacks standardized, enforceable AI safety benchmarks, especially for open-ended or adaptive use cases. A shift from ad hoc to structured assessments will support more informed, trusted, and effective procurement decisions. 

Particularly at the acquisition stage for AI models, rapid DoD acquisition platforms such as Tradewinds can serve as the policy vehicle for enabling more robust benchmarking efforts. This can be done with the establishment of a federally coordinated benchmarking hub, spearheaded by a coordinated effort between the Chief Data and Artificial Intelligence Officer (CDAO) and Defense Innovation Unit (DIU) in consultation with the newly established Chief AI Officer’s Council (CAIOC) of the White House Office of Management and Budget (OMB). 

Challenge and Opportunity

Experts at the intersection of both AI and defense, such as the retired Lieutenant General John (Jack) N.T. Shanahan, have emphasized the profound impact of AI on the way the United States will fight future wars – with the character of war continuously reshaped by AI’s diffusion across all domains. The DoD is committed to remaining at the forefront of these changes: between 2022-2023, the value of federal AI contracts increased by over 1200%, with the surge driven by increases in DoD spending. Secretary of Defense Pete Hegseth has pledged increased investment in AI specifically for military modernization efforts, and has tasked the Army to implement AI in command and control across the theater, corps, and division headquarters by 2027–further underscoring AI’s transformative impact on modern warfare.

Strategic competitors—especially the People’s Republic of China—are rapidly integrating AI into their military and technological systems. The Chinese Communist Party views AI-enabled science and technology as central to accelerating military modernization and achieving global leadership. At this pivotal moment, the DoD is pushing to adopt advanced AI across operations to preserve the U.S. edge in military and national security applications. Yet, accelerating too quickly without proper safeguards risks exposing vulnerabilities adversaries could exploit.

With the DoD at a unique inflection point, it must balance the rapid adoption and integration of AI into its operations with the need for oversight and safety. DoD needs AI systems that consistently meet clearly defined performance standards set by acquisition authorities, operate strictly within the scope of their intended use, and do not exhibit unanticipated or erratic behaviors under operational conditions. These systems can deliver measurable value to mission outcomes while fostering trust and confidence among human operators through predictability, transparency, and alignment with mission-specific requirements.

AI benchmarks are standardized tasks and metrics that systematically measure a model’s performance, reliability, and safety, and have increasingly been adopted as a key measurement tool by the AI industry. Currently, DoD lacks standardized, comprehensive AI safety benchmarks, especially for open-ended or adaptive use cases. Without these benchmarks, the DoD risks acquiring models that underperform, deviate from mission requirements, or introduce avoidable vulnerabilities, leading to increased operational risk, reduced mission effectiveness, and costly contract revisions.  

A recent report from the Center for a New American Security (CNAS) on best practices for AI T&E outlined that the rapid and unpredictable pace of AI advancement presents distinctive challenges for both policymakers and end-users. The accelerating pace of adoption and innovation heightens both the urgency and complexity of establishing effective AI benchmarks to ensure acquired models meet the mission-specific performance standards required by the DoD and the services. 

The DoD faces particularly outsized risk, as its unique operational demands can expose AI models to extreme conditions where performance may degrade. For example, under adversarial conditions, or when encountering data that is different from its training, an AI model may behave unpredictably, posing heightened risk to the mission. Robust evaluations, such as those offered through benchmarking, help to identify points of failure or harmful model capabilities before they become apparent during critical use cases. By measuring model performance in real-world applicable scenarios and environments, we increase understanding of attack surface vulnerabilities to adversarial inputs. We can identify inaccurate or over-confident measurements of outputs, and recognize potential failures in edge cases and extreme scenarios (including those beyond training parameters, Moreover, we improve human-AI performance and trust factors, and avoid unintended capabilities. Benchmarking helps to surface these issues early. 

Robust AI benchmarking frameworks can enhance U.S. leadership by shaping international norms for military AI safety, improving acquisition efficiency by screening out underperforming systems, and surfacing unintended or high-risk model behaviors before deployment. Furthermore, benchmarking enables AI performance to be quantified in alignment with mission needs, using guidance from the CDAO RAI Toolkit and clear acquisition parameters to support decision-making for both procurement officers and warfighters. Given the DoD’s high-risk use cases and unique mission requirements, robust benchmarking is even more essential than in the commercial sector.

The DoD now has an opportunity to formalize AI safety benchmark frameworks within its Testing and Evaluation (T&E) processes, tailored to both dual-use and defense-specific applications. T&E is already embedded in DoD culture, offering a strong foundation for expanding benchmarking. Public-private AI testing initiatives, such as the DoD collaboration with Scale AI to create effective T&E (including through benchmarking) for AI models show promise and existing motivation for such initiatives. Yet, critical policy gaps still exist. With pilot programs underway, the DoD can move beyond vendor-led or ad hoc evaluations to introduce DoD-led testing, assess mission-specific capabilities, launch post-acquisition benchmarking, and develop human-AI team metrics. The widely used Tradewinds platform offers an existing vehicle to integrate these enhanced benchmarks without reinventing the wheel. 

To implement robust benchmarking at DoD, this memo proposes the following policy recommendations, to be coordinated by DoD Chief Digital and Artificial Intelligence Office (CDAO):

Plan of Action 

The CDAO should launch a formalized AI Benchmarking Initiative, moving beyond current vendor-led pilot programs, while continuing to refine its private industry initiatives. This effort should be comprehensive and collaborative in nature, leveraging internal technical expertise. This includes the newly established coordinating bodies on AI such as the Chief AI Officer’s Council, which can help to ensure that DoD benchmarking practices are aligned with federal priorities, and the Defense Innovation Unit, which can be an excellent private industry-national defense sector bridge and coordinator in these efforts. Specifically, the CDAO should integrate benchmarking into the acquisition pipeline. This will  establish ongoing benchmarking practices that facilitate continuous model performance evaluation through the entirety of the model lifecycle. 

Policy Recommendations 

Recommendation 1. Establish a Standardized Defense AI Benchmarking Initiative and create a Centralized Repository of Benchmarks

The DoD should build on lessons learned from its partnership with Scale AI (and others) developing benchmarks specifically for defense use cases. This should expand  into a standardized, agency-wide framework. 

This recommendation is in line with findings outlined by RAND, which calls for developing a comprehensive framework for robust evaluation and emphasizes the need for collaborative practices, and measurable performance metrics for model performance. 

The DoD should incorporate the following recommendations and government entities to achieve this goal:

Develop a Whole-of-Government Approach to AI Benchmarking

If internal reallocations from the $500 million allocation proves insufficient or unviable, Congressional approval for additional funds can be another funding source. Given the strategic importance of AI in defense, such requests can readily find bipartisan support, particularly when tied to operational success and risk mitigation.

Recommendation 2. Formalize Pre-Deployment Benchmarking for AI Models at the Acquisition Stage 

The key to meaningful benchmarking lies in integrating it at the pre-award stage of procurement. The DoD should establish a formal process that:

Recommendation 3. Contextualize Benchmarking into Operational Environments

Current efforts to scale and integrate AI reflect the distinct operational realities of the DoD and military services. Scale AI, in partnership with the DoD, Anduril, Microsoft, and the CDAO, is  developing AI-powered solutions which are focused on the United States Indo-Pacific Command (INDOPACOM) and United States European Command (EUCOM). With these regional command focused AI solutions, it makes sense to create equally focused benchmarking standards to test AI model performance in specific environments and under unique and focused conditions. In fact, researchers have been identifying the limits of traditional AI benchmarking and making the case for bespoke, holistic, and use-case relevant benchmark development. This is vital because as AI models advance, they introduce entirely new capabilities which require more robust testing and evaluation. For example, large language models, which have introduced new functionalities including natural language querying or multimodal search interfaces, require entirely new benchmarks that measure: natural language understanding, modal integration accuracy, context retention, and result usefulness. In the same vein, DoD relevant benchmarks must be developed in an operationally-relevant context. This can be achieved by:

Frameworks such as Holistic Evaluation of Language Models (HELM) and Focused LLM Ability Skills and Knowledge (FLASK) can offer valuable guidance for developing LLM-focused benchmarks within the DoD, by enabling more comprehensive evaluations based on specific model skill sets, use-case scenarios, and tailored performance metrics.

Recommendation 4. Integration of Human-in-the-Loop Benchmarking 

An additional layer of AI benchmarking for safe and effective AI diffusion into the DoD ecosystem is evaluating AI-human team performance, and measuring user trust, perceptions and confidence in various AI models. “Human‑in‑the‑loop” systems require a person to approve or adjust the AI’s decision before action, while “human‑on‑the‑loop” systems allow autonomous operation but keep a person supervising and ready to intervene. Both  “Human in the loop”  and “Human on the loop” are critical components of the DoD and military approach to AI. Both  require continued human oversight of ethical and safety considerations over AI-enabled capabilities with national security implications. A recent study by MIT study found that there are surprising performance gaps between AI only, human only, and AI-human teams. For the DoD particularly, it is important to effectively measure these performance gaps across the various AI models it plans to integrate into its operations due to heavy reliance on user-AI teams. 

A CNAS report on effective T&E for AI spotlighted the DARPA Air Combat Evolution (ACE) program, which sought autonomous air‑combat agents needing minimal human intervention. Expert test pilots could override the system, yet often did so prematurely, distrusting its unfamiliar tactics. This case underscores the need for early, extensive benchmarks that test user capacity, surface trust gaps that can cripple human‑AI teams, and assure operators that models meet legal and ethical standards. Accordingly, this memo urges expanding benchmarking beyond pure model performance to AI‑human team evaluations in high‑risk national‑security, lethal, or error‑sensitive environments.

Conclusion

The Department of Defense is racing to integrate AI across every domain of warfare, yet speed without safety will  jeopardize mission success and national security. Standardized, acquisition‑integrated, continuous, and mission‑specific benchmarking is therefore not a luxury—it is the backbone of responsible AI deployment. Current pilot programs with private partners are encouraging starts, but they remain too ad hoc and narrow to match the scale and tempo of modern AI development.

Benchmarking must begin at the pre‑award acquisition stage and follow systems through their entire lifecycle, detecting risks, performance drift, and adversarial vulnerabilities before they threaten operations. As the DARPA ACE program showed, early testing of human‑AI teams and rigorous red‑teaming surface trust gaps and hidden failure modes that vendor‑led evaluations often miss. Because AI models—and enemy capabilities—evolve constantly, our evaluation methods must evolve just as quickly.

By institutionalizing robust benchmarks under CDAO leadership, in concert with the Defense Innovation Unit and the Chief AI Officers Council, the DoD can set world‑class standards for military AI safety while accelerating reliable procurement. Ultimately, AI benchmarking is not a hurdle to innovation and acquisition, but rather it is the infrastructure that can make rapid acquisition more reliable and innovation more viable. The DoD cannot afford the risk of deploying AI systems which are risky, unreliable, ineffective or misaligned with mission needs and standards in high-risk operational environments. At this inflection point, the choice is not between speed and safety but between ungoverned acceleration and a calculated momentum that allows our strategic AI advantage to be both sustained and secured.

This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.

What is the Scale AI benchmarking pilot program at DoD, and why and how does this policy proposal build on this initiative?

he Scale AI benchmarking initiative, launched in February 2024 in partnership with the DoD, is a pilot framework designed to evaluate the performance of AI models intended for defense and national security applications. It is part of the broader efforts to create a framework for T&E of AI models for the CDAO.

This memo builds on that foundation by:



  • Formalizing benchmarking as a standard requirement at the procurement stage across DoD acquisition processes.

  • Inserting benchmarking protocols into rapid acquisition platforms like Tradewinds.

  • Establishing a defense-specific benchmarking repository and enabling red-teaming led by the AI Rapid Capabilities Cell (AI RCC) within the CDAO.

  • Shifting the lead on benchmarking from vendor-enabled to internally developed, led, and implemented, creating bespoke evaluation criteria tailored to specific mission needs.

What types of AI systems will these benchmarks apply to, and how will they be tailored for national security use cases?

The proposed benchmarking framework will apply to a diverse range of AI systems, including:



  • Decision-making and command and control support tools (sensors, target recognition, process automation, and tools involved in natural language processing).

  • Generative models for planning, logistics, intelligence, or data generation.

  • Autonomous agents, such as drones and robotic systems.


Benchmarks will be theater and context-specific, reflecting real-world environments (e.g. contested INDOPACOM scenarios), end-user roles (human-AI teaming in combat), and mission-specific risk factors such as adversarial interference and model drift.

How will this benchmarking framework approach open-source or non-proprietary AI models intended for DoD use?

Open-source models present distinct challenges due to model ownership and origin, additional possible exposure to data poisoning, and downstream user manipulation. However, due to the nature of open-source models, it should be noted that the general increase in transparency and potential access to training data could make open-source models less challenging to put through rigorous T&E.


This memo recommends:



  • Applying standardized evaluation criteria across both open-source and proprietary models which can be developed by utilizing the AI benchmarking repository and applying model evaluations based on possible use cases of the model.

  • Incorporating benchmarking to test possible areas of vulnerability for downstream user manipulation.

  • Measuring the transparency of training data.

  • Performing adversarial testing to assess resilience against manipulated inputs via red-teaming.

  • Logging the open-source model performance in the proposed centralized repository, enabling ongoing monitoring for drift and other issues

Why is red-teaming a necessity in addition to AI benchmarking, and how will it be executed?

Red-teaming implements adversarial stress-testing (which can be more robust and operationally relevant if led by an internal team as this memo proposes), and can identify vulnerabilities and unintended capabilities before deployment. Internally led red-teaming, in particular, is critical for evaluating models intended for use in unpredictable or hostile environments.

How will red-teaming be executed?

To effectively employ the red-teaming efforts, this policy recommends that:



  • The AI Rapid Capabilities Cell within the CDAO should lead red-teaming operations, leveraging the team’s technical capabilities with its experience and mission set to integrate and rapidly scale AI at the speed of relevance — delivering usable capability fast enough to affect current operations and decision cycles.

  • Internal, technically skilled teams should be created who are capable of incorporating classified threat models and edge-case scenarios.

  • Red-teaming should focus on simulating realistic mission conditions, and searching for specific model capabilities, going beyond generic or vendor-supplied test cases.

How does this benchmarking framework improve acquisition decisions and reduce risks?

Integrating benchmarking at the acquisition stage enables procurement officers to:



  • Compare models on mission-relevant, standardized performance metrics and ensure that there is evidence of measurable performance metrics which align with their own “vision of success” procurement requirements for the models.

  • Identify and avoid models with unsafe, misaligned, unverified, or ineffective capabilities.

  • Prevent cost-overruns or contract revisions.


Benchmarking workshops for acquisition officers can further equip them with the skills to interpret benchmark results and apply them to their operational requirements.

publications
See all publications
Emerging Technology
day one project
Policy Memo
Moving Beyond Pilot Programs to Codify and Expand Continuous AI Benchmarking in Testing and Evaluation

At this inflection point, the choice is not between speed and safety but between ungoverned acceleration and a calculated momentum that allows our strategic AI advantage to be both sustained and secured.

06.11.25 | 12 min read
read more
Emerging Technology
day one project
Policy Memo
Develop a Risk Assessment Framework for AI Integration into Nuclear Weapons Command, Control, and Communications Systems

Improved detection could strengthen deterrence, but only if accompanying hazards—automation bias, model hallucinations, exploitable software vulnerabilities, and the risk of eroding assured second‑strike capability—are well managed.

06.11.25 | 8 min read
read more
Emerging Technology
day one project
Policy Memo
A National Center for Advanced AI Reliability and Security

A dedicated and properly resourced national entity is essential for supporting the development of safe, secure, and trustworthy AI to drive widespread adoption, by providing sustained, independent technical assessments and emergency coordination.

06.11.25 | 10 min read
read more
Emerging Technology
day one project
Policy Memo
A Grant Program to Enhance State and Local Government AI Capacity and Address Emerging Threats

Congress should establish a new grant program, coordinated by the Cybersecurity and Infrastructure Security Agency, to assist state and local governments in addressing AI challenges.

06.11.25 | 8 min read
read more