Emerging Technology

day one project

Moving Beyond Pilot Programs to Codify and Expand Continuous AI Benchmarking in Testing and Evaluation

06.11.25 | 12 min read | Text by Kateryna Halstead

Rapid and advanced AI integration and diffusion within the Department of Defense (DoD) and other government agencies has emerged as a critical national security priority. This convergence of rapid AI advancement and DoD prioritization creates an urgent need to ensure that AI models integrated into defense operations are reliable, safe, and mission-enhancing. For this purpose, the DoD must deploy and expand one of its most critical tools available within its Testing and Evaluation (T&E) process: benchmarking—the structured practice of applying shared tasks and metrics to compare models, track progress, and expose performance gaps.

A standardized AI benchmarking framework is critical for delivering uniform, mission-aligned evaluations across the DoD. Despite their importance, the DoD currently lacks standardized, enforceable AI safety benchmarks, especially for open-ended or adaptive use cases. A shift from ad hoc to structured assessments will support more informed, trusted, and effective procurement decisions.

Particularly at the acquisition stage for AI models, rapid DoD acquisition platforms such as Tradewinds can serve as the policy vehicle for enabling more robust benchmarking efforts. This can be done with the establishment of a federally coordinated benchmarking hub, spearheaded by a coordinated effort between the Chief Data and Artificial Intelligence Officer (CDAO) and Defense Innovation Unit (DIU) in consultation with the newly established Chief AI Officer’s Council (CAIOC) of the White House Office of Management and Budget (OMB).

Challenge and Opportunity

Experts at the intersection of both AI and defense, such as the retired Lieutenant General John (Jack) N.T. Shanahan, have emphasized the profound impact of AI on the way the United States will fight future wars – with the character of war continuously reshaped by AI’s diffusion across all domains. The DoD is committed to remaining at the forefront of these changes: between 2022-2023, the value of federal AI contracts increased by over 1200%, with the surge driven by increases in DoD spending. Secretary of Defense Pete Hegseth has pledged increased investment in AI specifically for military modernization efforts, and has tasked the Army to implement AI in command and control across the theater, corps, and division headquarters by 2027–further underscoring AI’s transformative impact on modern warfare.

Strategic competitors—especially the People’s Republic of China—are rapidly integrating AI into their military and technological systems. The Chinese Communist Party views AI-enabled science and technology as central to accelerating military modernization and achieving global leadership. At this pivotal moment, the DoD is pushing to adopt advanced AI across operations to preserve the U.S. edge in military and national security applications. Yet, accelerating too quickly without proper safeguards risks exposing vulnerabilities adversaries could exploit.

With the DoD at a unique inflection point, it must balance the rapid adoption and integration of AI into its operations with the need for oversight and safety. DoD needs AI systems that consistently meet clearly defined performance standards set by acquisition authorities, operate strictly within the scope of their intended use, and do not exhibit unanticipated or erratic behaviors under operational conditions. These systems can deliver measurable value to mission outcomes while fostering trust and confidence among human operators through predictability, transparency, and alignment with mission-specific requirements.

AI benchmarks are standardized tasks and metrics that systematically measure a model’s performance, reliability, and safety, and have increasingly been adopted as a key measurement tool by the AI industry. Currently, DoD lacks standardized, comprehensive AI safety benchmarks, especially for open-ended or adaptive use cases. Without these benchmarks, the DoD risks acquiring models that underperform, deviate from mission requirements, or introduce avoidable vulnerabilities, leading to increased operational risk, reduced mission effectiveness, and costly contract revisions.

A recent report from the Center for a New American Security (CNAS) on best practices for AI T&E outlined that the rapid and unpredictable pace of AI advancement presents distinctive challenges for both policymakers and end-users. The accelerating pace of adoption and innovation heightens both the urgency and complexity of establishing effective AI benchmarks to ensure acquired models meet the mission-specific performance standards required by the DoD and the services.

The DoD faces particularly outsized risk, as its unique operational demands can expose AI models to extreme conditions where performance may degrade. For example, under adversarial conditions, or when encountering data that is different from its training, an AI model may behave unpredictably, posing heightened risk to the mission. Robust evaluations, such as those offered through benchmarking, help to identify points of failure or harmful model capabilities before they become apparent during critical use cases. By measuring model performance in real-world applicable scenarios and environments, we increase understanding of attack surface vulnerabilities to adversarial inputs. We can identify inaccurate or over-confident measurements of outputs, and recognize potential failures in edge cases and extreme scenarios (including those beyond training parameters, Moreover, we improve human-AI performance and trust factors, and avoid unintended capabilities. Benchmarking helps to surface these issues early.

Robust AI benchmarking frameworks can enhance U.S. leadership by shaping international norms for military AI safety, improving acquisition efficiency by screening out underperforming systems, and surfacing unintended or high-risk model behaviors before deployment. Furthermore, benchmarking enables AI performance to be quantified in alignment with mission needs, using guidance from the CDAO RAI Toolkit and clear acquisition parameters to support decision-making for both procurement officers and warfighters. Given the DoD’s high-risk use cases and unique mission requirements, robust benchmarking is even more essential than in the commercial sector.

The DoD now has an opportunity to formalize AI safety benchmark frameworks within its Testing and Evaluation (T&E) processes, tailored to both dual-use and defense-specific applications. T&E is already embedded in DoD culture, offering a strong foundation for expanding benchmarking. Public-private AI testing initiatives, such as the DoD collaboration with Scale AI to create effective T&E (including through benchmarking) for AI models show promise and existing motivation for such initiatives. Yet, critical policy gaps still exist. With pilot programs underway, the DoD can move beyond vendor-led or ad hoc evaluations to introduce DoD-led testing, assess mission-specific capabilities, launch post-acquisition benchmarking, and develop human-AI team metrics. The widely used Tradewinds platform offers an existing vehicle to integrate these enhanced benchmarks without reinventing the wheel.

To implement robust benchmarking at DoD, this memo proposes the following policy recommendations, to be coordinated by DoD Chief Digital and Artificial Intelligence Office (CDAO):

Expanding on existing benchmarking efforts
Standardizing AI safety thresholds during the procurement cycle
Implementing benchmarking during the lifecycle of the model
Establishing a benchmarking repository
Enabling adversarial stress testing, or “red-teaming”, prior to deployment to enhance current benchmarking gaps for DoD AI use cases

Plan of Action

The CDAO should launch a formalized AI Benchmarking Initiative, moving beyond current vendor-led pilot programs, while continuing to refine its private industry initiatives. This effort should be comprehensive and collaborative in nature, leveraging internal technical expertise. This includes the newly established coordinating bodies on AI such as the Chief AI Officer’s Council, which can help to ensure that DoD benchmarking practices are aligned with federal priorities, and the Defense Innovation Unit, which can be an excellent private industry-national defense sector bridge and coordinator in these efforts. Specifically, the CDAO should integrate benchmarking into the acquisition pipeline. This will establish ongoing benchmarking practices that facilitate continuous model performance evaluation through the entirety of the model lifecycle.

Policy Recommendations

Recommendation 1. Establish a Standardized Defense AI Benchmarking Initiative and create a Centralized Repository of Benchmarks

The DoD should build on lessons learned from its partnership with Scale AI (and others) developing benchmarks specifically for defense use cases. This should expand into a standardized, agency-wide framework.

This recommendation is in line with findings outlined by RAND, which calls for developing a comprehensive framework for robust evaluation and emphasizes the need for collaborative practices, and measurable performance metrics for model performance.

The DoD should incorporate the following recommendations and government entities to achieve this goal:

Develop a Whole-of-Government Approach to AI Benchmarking

Develop and expand on existing pilot benchmarking frameworks, similar to Massive Multitask Language Understanding (MMLU) but tailored to military-relevant tasks and DoD-specific use cases.
Expand the $10 million T&E and research budget by $10 million, with allocations specifically for bolstering internal benchmarking capabilities. One crucial piece is identifying and recruiting technically capable talent to aid in developing internal benchmarking guidelines. As AI models advance, new “reasoning” models with advanced capabilities become far costlier to benchmark, and the DoD must plan for these future demands now. Part of this allocation can come from the $500 million allocated for the combatant command AI budgets. This monetary allocation is critical to successfully implementing this policy because model benchmarking for more advanced models – such as OpenAI’s GPT-3 – can cost millions. This modest budgetary increase is a starting point for moving beyond piecemeal and ad hoc benchmarking, to a comprehensive and standardized process. This funding increases would facilitate:
- Development of and expansion of internal and customized benchmarking capabilities
- Recruitment and retention of technical talent
- Development of simulation environment for more mission-relevant benchmarks

If internal reallocations from the $500 million allocation proves insufficient or unviable, Congressional approval for additional funds can be another funding source. Given the strategic importance of AI in defense, such requests can readily find bipartisan support, particularly when tied to operational success and risk mitigation.

Create a centralized AI benchmarking repository under the CDAO. This will standardize categories, performance metrics, mission alignment, and lessons learned across defense-specific use cases. This repository will enable consistent tracking of model performance over time, support analysis across model iterations, and allow for benchmarking transferability across similar operational scenarios. By compiling performance data at scale, the repository will also help identify interoperability risks and system-level vulnerabilities—particularly how different AI models may behave when integrated—thereby enhancing the DoD’s ability to assess, document, and mitigate potential performance and safety failures.
Convene a partnership, organized by OMB, between the CDAO, the DIU and the CAIOC, to jointly establish and maintain a centralized benchmarking repository. While many CAIOC members represent civilian agencies, their involvement is crucial: numerous departments (such as the Department of Homeland Security, the Department of Energy, and the National Institute of Standards and Technology) are already employing AI in high-stakes contexts and bring relevant technical expertise, safety frameworks, and risk management policies. Incorporating these perspectives ensures that DoD benchmarking practices are not developed in isolation but reflect best practices across the federal government. This partnership will leverage the DIU’s insights on emerging private-sector technologies, the CDAO’s acquisition and policy authorities, and CAIOC’s alignment with broader executive branch priorities, thereby ensuring that benchmarking practices are technically sound, risk-informed, and consistent with government-wide standards and priorities for trustworthy, safe, and reliable AI.

Recommendation 2. Formalize Pre-Deployment Benchmarking for AI Models at the Acquisition Stage

The key to meaningful benchmarking lies in integrating it at the pre-award stage of procurement. The DoD should establish a formal process that:

Integrates benchmarking into existing AI acquisition platforms, such as Tradewinds, and embeds it within the T&E process.
Requires participation from third-party vendors in benchmarking the products they propose for DoD acquisition and use.
Embeds internal adversarial stress testing, or “red-teaming”, into AI benchmarking ensures more realistic, mission-aligned evaluations that account for adversarial threats and the unique, high-risk operating environments the military faces. By leveraging its internal expertise in mission context, classified threat models, and domain-specific edge cases that external vendors are unlikely to fully replicate, the DoD can produce a more comprehensive and defense-relevant assessment of AI system safety, efficacy, and suitability for deployment. Specifically, this policy memo recommends that the AI Rapid Capabilities Cell (AI RCC) be tasked with carrying out the red-teaming, as a technically qualified element of the CDAO.
Assures procurement officers understand the value of incorporating benchmarking performance metrics into their contract award decision-making. This can be done by hosting benchmarking workshops for procurement officers, which outline the benchmarking results for model performance for various models in the acquisition pipeline and to guide them on how to apply these metrics to their own performance requirements and guidelines.

Recommendation 3. Contextualize Benchmarking into Operational Environments

Current efforts to scale and integrate AI reflect the distinct operational realities of the DoD and military services. Scale AI, in partnership with the DoD, Anduril, Microsoft, and the CDAO, is developing AI-powered solutions which are focused on the United States Indo-Pacific Command (INDOPACOM) and United States European Command (EUCOM). With these regional command focused AI solutions, it makes sense to create equally focused benchmarking standards to test AI model performance in specific environments and under unique and focused conditions. In fact, researchers have been identifying the limits of traditional AI benchmarking and making the case for bespoke, holistic, and use-case relevant benchmark development. This is vital because as AI models advance, they introduce entirely new capabilities which require more robust testing and evaluation. For example, large language models, which have introduced new functionalities including natural language querying or multimodal search interfaces, require entirely new benchmarks that measure: natural language understanding, modal integration accuracy, context retention, and result usefulness. In the same vein, DoD relevant benchmarks must be developed in an operationally-relevant context. This can be achieved by:

Developing simulation environments for benchmarking that are mission-specific across a broader set of domains, including technical and regional commands, to test AI models under specific conditions which are likely to be encountered by users in unique, contested, and/or adversarial environments. The Bipartisan House Task Force on Artificial Intelligence report provides useful guidance on AI model functionality, reliability, and safety in operating in contested, denied, and degraded environments.
Prioritizing use-case-specific benchmarks over broad commercial metrics by incorporating user feedback and identifying tailored risk scenarios that more accurately measure model performance.
Introducing context relevant benchmarks to measure performance in specific, DoD-relevant scenarios, such as:
- Task-specific accuracy (i.e. correct ID in satellite imagery cases)
- Alignment with context-specific rules of engagement
- Instances of degraded performance under high-stress conditions
- Susceptibility to adversarial manipulation (i.e. data poisoning)
- Latency in high-risk, fast-paced decision-making scenarios
Creating post-deployment benchmarking to ensure ongoing performance and risk compliance, and to detect and address issues like model drift, a phenomenon where model performance degrades over time. As there is no established consensus on how often continuous model benchmarking should be performed, the DoD should study the appropriate practical, risk-informed timelines for re-evaluating deployed systems.

Frameworks such as Holistic Evaluation of Language Models (HELM) and Focused LLM Ability Skills and Knowledge (FLASK) can offer valuable guidance for developing LLM-focused benchmarks within the DoD, by enabling more comprehensive evaluations based on specific model skill sets, use-case scenarios, and tailored performance metrics.

Recommendation 4. Integration of Human-in-the-Loop Benchmarking

An additional layer of AI benchmarking for safe and effective AI diffusion into the DoD ecosystem is evaluating AI-human team performance, and measuring user trust, perceptions and confidence in various AI models. “Human‑in‑the‑loop” systems require a person to approve or adjust the AI’s decision before action, while “human‑on‑the‑loop” systems allow autonomous operation but keep a person supervising and ready to intervene. Both “Human in the loop” and “Human on the loop” are critical components of the DoD and military approach to AI. Both require continued human oversight of ethical and safety considerations over AI-enabled capabilities with national security implications. A recent study by MIT study found that there are surprising performance gaps between AI only, human only, and AI-human teams. For the DoD particularly, it is important to effectively measure these performance gaps across the various AI models it plans to integrate into its operations due to heavy reliance on user-AI teams.

A CNAS report on effective T&E for AI spotlighted the DARPA Air Combat Evolution (ACE) program, which sought autonomous air‑combat agents needing minimal human intervention. Expert test pilots could override the system, yet often did so prematurely, distrusting its unfamiliar tactics. This case underscores the need for early, extensive benchmarks that test user capacity, surface trust gaps that can cripple human‑AI teams, and assure operators that models meet legal and ethical standards. Accordingly, this memo urges expanding benchmarking beyond pure model performance to AI‑human team evaluations in high‑risk national‑security, lethal, or error‑sensitive environments.

Conclusion

The Department of Defense is racing to integrate AI across every domain of warfare, yet speed without safety will jeopardize mission success and national security. Standardized, acquisition‑integrated, continuous, and mission‑specific benchmarking is therefore not a luxury—it is the backbone of responsible AI deployment. Current pilot programs with private partners are encouraging starts, but they remain too ad hoc and narrow to match the scale and tempo of modern AI development.

Benchmarking must begin at the pre‑award acquisition stage and follow systems through their entire lifecycle, detecting risks, performance drift, and adversarial vulnerabilities before they threaten operations. As the DARPA ACE program showed, early testing of human‑AI teams and rigorous red‑teaming surface trust gaps and hidden failure modes that vendor‑led evaluations often miss. Because AI models—and enemy capabilities—evolve constantly, our evaluation methods must evolve just as quickly.

By institutionalizing robust benchmarks under CDAO leadership, in concert with the Defense Innovation Unit and the Chief AI Officers Council, the DoD can set world‑class standards for military AI safety while accelerating reliable procurement. Ultimately, AI benchmarking is not a hurdle to innovation and acquisition, but rather it is the infrastructure that can make rapid acquisition more reliable and innovation more viable. The DoD cannot afford the risk of deploying AI systems which are risky, unreliable, ineffective or misaligned with mission needs and standards in high-risk operational environments. At this inflection point, the choice is not between speed and safety but between ungoverned acceleration and a calculated momentum that allows our strategic AI advantage to be both sustained and secured.

This memo was written by an AI Safety Policy Entrepreneurship Fellow over the course of a six-month, part-time program that supports individuals in advancing their policy ideas into practice. You can read more policy memos and learn about Policy Entrepreneurship Fellows here.

What is the Scale AI benchmarking pilot program at DoD, and why and how does this policy proposal build on this initiative?

he Scale AI benchmarking initiative, launched in February 2024 in partnership with the DoD, is a pilot framework designed to evaluate the performance of AI models intended for defense and national security applications. It is part of the broader efforts to create a framework for T&E of AI models for the CDAO.

This memo builds on that foundation by:

Formalizing benchmarking as a standard requirement at the procurement stage across DoD acquisition processes.

Inserting benchmarking protocols into rapid acquisition platforms like Tradewinds.

Establishing a defense-specific benchmarking repository and enabling red-teaming led by the AI Rapid Capabilities Cell (AI RCC) within the CDAO.

Shifting the lead on benchmarking from vendor-enabled to internally developed, led, and implemented, creating bespoke evaluation criteria tailored to specific mission needs.

What types of AI systems will these benchmarks apply to, and how will they be tailored for national security use cases?

The proposed benchmarking framework will apply to a diverse range of AI systems, including:

Decision-making and command and control support tools (sensors, target recognition, process automation, and tools involved in natural language processing).

Generative models for planning, logistics, intelligence, or data generation.

Autonomous agents, such as drones and robotic systems.

Benchmarks will be theater and context-specific, reflecting real-world environments (e.g. contested INDOPACOM scenarios), end-user roles (human-AI teaming in combat), and mission-specific risk factors such as adversarial interference and model drift.

How will this benchmarking framework approach open-source or non-proprietary AI models intended for DoD use?

Open-source models present distinct challenges due to model ownership and origin, additional possible exposure to data poisoning, and downstream user manipulation. However, due to the nature of open-source models, it should be noted that the general increase in transparency and potential access to training data could make open-source models less challenging to put through rigorous T&E.

This memo recommends:

Applying standardized evaluation criteria across both open-source and proprietary models which can be developed by utilizing the AI benchmarking repository and applying model evaluations based on possible use cases of the model.

Incorporating benchmarking to test possible areas of vulnerability for downstream user manipulation.

Measuring the transparency of training data.

Performing adversarial testing to assess resilience against manipulated inputs via red-teaming.

Logging the open-source model performance in the proposed centralized repository, enabling ongoing monitoring for drift and other issues

Why is red-teaming a necessity in addition to AI benchmarking, and how will it be executed?

Red-teaming implements adversarial stress-testing (which can be more robust and operationally relevant if led by an internal team as this memo proposes), and can identify vulnerabilities and unintended capabilities before deployment. Internally led red-teaming, in particular, is critical for evaluating models intended for use in unpredictable or hostile environments.

How will red-teaming be executed?

To effectively employ the red-teaming efforts, this policy recommends that:

The AI Rapid Capabilities Cell within the CDAO should lead red-teaming operations, leveraging the team’s technical capabilities with its experience and mission set to integrate and rapidly scale AI at the speed of relevance — delivering usable capability fast enough to affect current operations and decision cycles.

Internal, technically skilled teams should be created who are capable of incorporating classified threat models and edge-case scenarios.

Red-teaming should focus on simulating realistic mission conditions, and searching for specific model capabilities, going beyond generic or vendor-supplied test cases.

How does this benchmarking framework improve acquisition decisions and reduce risks?

Integrating benchmarking at the acquisition stage enables procurement officers to:

Compare models on mission-relevant, standardized performance metrics and ensure that there is evidence of measurable performance metrics which align with their own “vision of success” procurement requirements for the models.

Identify and avoid models with unsafe, misaligned, unverified, or ineffective capabilities.

Prevent cost-overruns or contract revisions.

Benchmarking workshops for acquisition officers can further equip them with the skills to interpret benchmark results and apply them to their operational requirements.

publications

See all publications

Emerging Technology

day one project

Policy Memo

Behavioral Economics Megastudies are Necessary to Make America Healthy

When the U.S. government funds the establishment of a platform for testing hundreds of behavioral interventions on a large diverse population, we will start to better understand the interventions that will have an efficient and lasting impact on health behavior.

11.06.25 | 10 min read

Emerging Technology

Policy Memo

Making Healthcare AI Human-Centered through the Requirement of Clinician Input

Integrating AI tools into healthcare has an immense amount of potential to improve patient outcomes, streamline clinical workflows, and reduce errors and bias.

11.04.25 | 15 min read

Emerging Technology

Policy Memo

A National Blueprint for Whole Health Transformation

Whole Health is a proven, evidence-based framework that integrates medical care, behavioral health, public health, and community support so that people can live healthier, longer, and more meaningful lives.

11.03.25 | 8 min read

Emerging Technology

Blog

Innovation Ecosystem Job Board Launches to Connect Federal Talent to Opportunities

Innovation Ecosystem Job Board connects scientists, engineers, technologists, and skilled federal workers and contractors who have recently departed government service with the emerging innovation ecosystems across America.

09.19.25 | 3 min read