FY24 NDAA AI Tracker

As both the House and Senate gear up to vote on the National Defense Authorization Act (NDAA), FAS is launching this live blog post to track all proposals around artificial intelligence (AI) that have been included in the NDAA. In this rapidly evolving field, these provisions indicate how AI now plays a pivotal role in our defense strategies and national security framework. This tracker will be updated following major updates.

Senate NDAA. This table summarizes the provisions related to AI from the version of the Senate NDAA that advanced out of committee on July 11. Links to the section of the bill describing these provisions can be found in the “section” column. Provisions that have been added in the manager’s package are in red font. Updates from Senate Appropriations committee and the House NDAA are in blue.

Senate NDAA Provisions

Provision	Summary	Section
Generative AI Detection and Watermark Competition	Directs Under Secretary of Defense for Research and Engineering to create a competition for technology that detects and watermarks the use of generative artificial intelligence.	218
DoD Prize Competitions for Business Systems Modernization	Authorizes competitions to improve military business systems, emphasizing the integration of AI where possible.	221
Broad review and update of DoD AI Strategy	Directs the Secretary of Defense to perform a periodic review and update of its 2018 AI strategy, and to develop and issue new guidance on a broad range of AI issues, including adoption of AI within DoD, ethical principles for AI, mitigation of bias in AI, cybersecurity of generative AI, and more.	222
Strategy and assessment on use of automation and AI for shipyard optimization	Development of a strategy on the use of AI for Navy shipyard logistics	332
Strategy for talent development and management of DoD Computer Programming Workforce	Establishes a policy for “appropriate” talent development and management policies, including for AI skills.	1081
Sense of the Senate Resolution in Support of NATO	Offers support for NATO and NATO’s DIANA program as critical to AI and other strategic priorities	1238 \| 1239
Enhancing defense partnership with India	Directs DoD to enhance defense partnership with India, including collaboration on AI as one potential priority area.	1251
Specification of Duties for Electronic Warfare Executive Committee	Amends US code to specify the duties of the Electronic Warfare Executive Committee, including an assessment of the need for automated, AI/ML-based electronic warfare capabilities	1541
Next Generation Cyber Red Teams	Directs the DoD and NSA to submit a plan to modernize cyber red-teaming capabilities, ensuring the ability to emulate possible threats, including from AI	1604
Management of Data Assets by Chief Digital Officer	Outlines responsibilities for CDAO to provide data analytics capabilities needed for “global cyber-social domain.”	1605
Developing Digital Content Provenance Course	Directs Director of Defense Media Activity to develop a course on digital content provenance, including digital forgeries developed with AI systems, e.g. AI-generated “deepfakes,”	1622
Report on Artificial Intelligence Regulation in Financial Services Industry	Directs regulators of the financial services industry to produce reports analyzing how AI is and ought to be used by the industry and by regulators	6096
AI Bug Bounty Programs	Directs CDAO to develop a bug bounty program for AI foundation models that are being integrated in DOD operations	6097
Vulnerability analysis study for AI-enabled military applications	Directs CDAO to complete a study analyzing vulnerabilities to the privacy, security, and accuracy of AI-enabled military applications, as well as R&D needs for such applications, including foundation models.	6098
Report on Data Sharing and Coordination	Directs SecDef to to submit a report on ways to improve data sharing across DoD	6099
Establishment of Chief AI Officer of the Department of State	Establishes within the Department of State a Chief AI Officer, who may also serve as Chief Data Officer to oversee adoption of AI in the Department and to advise the Secretary of State on the use of AI in conducting data-informed diplomacy.	6303

House NDAA. This table summarizes the provisions related to AI from the version of the House NDAA that advanced out of committee. Links to the section of the bill describing these provisions can be found in the “section” column.

House NDAA Provisions

Provision	Summary	Section
Process to ensure the responsible development and use of artificial intelligence	Directs CDAO to develop a process for assessing whether AI technology used by DoD is functioning responsibly, including through the development of clear standards, and to amend AI technology as needed	220
Intellectual property strategy	Directs DoD to develop an intellectual property strategy to enhance capabilities in procurement of emerging technologies and capabilities	263
Study on establishment of centralized platform for development and testing of autonomy software	Directs SecDef and CDAO to conduct a study, assessing the feasibility and advisability of developing a centralized platform to develop and test autonomous software.	264
Congressional notification of changes to Department of Defense policy on autonomy in weapon systems	Requires that Congress be notified of changes to DoD Directive 3000.09 (on autonomy in weapons systems) within 30 days of any changes	266
Sense of Congress on dual use innovative technology for the robotic combat vehicle of the Army	This offers support for the Army’s acquisition strategy for the Robot Combat Vehicle program, and recommends that the Army consider a similar framework for future similar programs.	267
Pilot program on optimization of aerial refueling and fuel management in contested logistics environments through use of artificial intelligence	Directs CDAO, USD(A&S), and Air Force to develop a pilot program to optimize the logistics of aerial refueling and to consider the use of AI technology to help with this mission.	266
Modification to acquisition authority of the senior official with principal responsibility for artificial intelligence and machine learning	Increases annual acquisition authority for CDAO from $75M to $125M, and extends this authority from 2025 to 2029.	827
Framework for classification of autonomous capabilities	Directs CDAO and others within DoD to establish a department-wide classification framework for autonomous capabilities to enable easier use of autonomous systems in the department.	930

Funding Comparison. The following tables compare the funding requested in the President’s budget to funds that are authorized in current House and Senate versions of the NDAA. All amounts are in thousands of dollars.

Funding Comparison

Program	Requested	Authorized in House	Authorized in Senate	NEW! Passed in Senate Approps 7/27	NEW! Passed in full House 9/28
Other Procurement, Army–Engineer (non-construction) equipment: Robotics and Applique Systems	68,893	68,893	68,893	65,118 (-8,775 for “Effort previously funded,” +5,000 for “Soldier borne sensor”)	73,893 (+5,000 for “Soldier borne sensor”)
AI/ML Basic Research, Army	10,708	10,708	10,708	10,708	10,708
AI/ML Technologies, Army	24,142	24,142	24,142	27,142 (+3,000 for “Automated battle damage assessment and adjust fire”)	24
AI/ML Advanced Technologies, Army	13,187	15,687 (+ 2,500 for “Autonomous Long Range Resupply”)	18,187 (+ 5,000 for “Tactical AI & ML”)	24,687 (+11,500 for “Cognitive computing architecture for military systems”)	13,187
AI Decision Aids for Army Missile Defense Systems Integration	0	6,000	0	0	0
Robotics Development, Army	3,024	3,024	3,024	3,024	3,024
Ground Robotics, Army	35,319	35,319	35,319	17,337 (-17,982 for “SMET Inc II early to need”)	45,319 (+10,000 for “common robotic controller”)
Applied Research, Navy: Long endurance mobile autonomous passive acoustic sensing research	0	2,500	0	0	0
Advanced Components, Navy: Autonomous surface and underwater dual-modality vehicles	0	5,000	0	3,000	0
Air Force University Affiliated Research Center (UARC)—Tactical Autonomy	8,018	8,018	8,018	8,018	8,018
Air Force Applied Research: Secure Interference Avoiding Connectivity of Autonomous AI Machines	0	3,000	5,000	0	0
Air Force Advanced Technology Development: Semiautonomous adversary air platform	0	0	10,000	0	0
Advanced Technology Development, Air Force: High accuracy robotics	0	2,500	0	0	0
Air Force Autonomous Collaborative Platforms	118,826	176,013 (+ 75,000 for Project 647123: Air-Air Refueling TMRR, -17,813 for Technical realignment )	101,013 (- 17,813 for DAF requested realignment of funds)	101,013	101,013
Space Force: Machine Learning Techniques for Radio Frequency (RF) Signal Monitoring and Interference Detection	0	10,000	0	0	0
Defense-wide: Autonomous resupply for contested logistics	0	2,500	0	0	0
Military Construction–Pennsylvania Navy Naval Surface Warfare Center Philadelphia: AI Machinery Control Development Center	0	88,200	88,200	0	0
Intelligent Autonomous Systems for Seabed Warfare	0	0	7,000	5,000	0

Funding for Office of Chief Digital and Artificial Intelligence Officer

Program	Requested	Authorized in House	Authorized in Senate	NEW! Passed in Senate Approps	NEW! Passed in full House
Advanced Component Development and Prototypes	34,350	34,350	34,350	34,350	34,350
System Development and Demonstration	615,245	570,246 (-40,000 for “insufficient justification,” -5,000 for “program decrease.”)	615,246	246,003 (-369,243, mostly for functional transfers to JADC2 and Alpha-1)	704,527 (+89,281, mostly for “management innovation pilot” and transfers from other programs for “enterprise digital alignment”)
Research, Development, Test, and Evaluation	17,247	17,247	17,247	6,882 (-10,365, “Functional transfer to line 130B for ALPHA-1″)	13,447 (-3,800 for “excess growth”)
Senior Leadership Training Courses	0	2,750	0	0	0
ALPHA-1	0	0	0	222,723	0

On Senate Approps Provisions

The Senate Appropriations Committee generally provided what was requested in the White House’s budget regarding artificial intelligence (AI) and machine learning (ML), or exceeded it. AI was one of the top-line takeaways from the Committee’s summary of the defense appropriations bill. Particular attention has been paid to initiatives that cut across the Department of Defense, especially the Chief Digital and Artificial Intelligence Office (CDAO) and a new initiative called Alpha-1. The Committee is supportive of Joint All-Domain Command and Control (JADC2) integration and the recommendations of the National Security Commission on Artificial Intelligence (NSCAI).

On House final bill provisions

Like the Senate Appropriations bill, the House of Representatives’ final bill generally provided or exceeded what was requested in the White House budget regarding AI and ML. However, in contract to the Senate Appropriations bill, AI was not a particularly high-priority takeaway in the House’s summary. The only note about AI in the House Appropriations Committee’s summary of the bill was in the context of digital transformation of business practices. Program increases were spread throughout the branches’ Research, Development, Test, and Evaluation budgets, with a particular concentration of increased funding for the Defense Innovation Unit’s AI-related budget.

Six Policy Ideas for the National AI Strategy

The White House Office of Science and Technology Policy (OSTP) has sought public input for the Biden administration’s National AI Strategy, acknowledging the potential benefits and risks of advanced AI. The Federation of American Scientists (FAS) was happy to recommend specific actions for federal agencies to safeguard Americans’ rights and safety. With U.S. companies creating powerful frontier AI models, the federal government must guide this technology’s growth toward public benefit and risk mitigation.

Recommendation 1: OSTP should work with a suitable agency to develop and implement a pre-deployment risk assessment protocol that applies to any frontier AI model.

Before launching a frontier AI system, developers must ensure safety, trustworthiness, and reliability through pre-deployment risk assessment. This protocol aims to thoroughly analyze potential risks and vulnerabilities in AI models before deployment.

We advocate for increased funding towards the National Institute of Standards and Technology (NIST) to enhance its risk measurement capacity and develop robust benchmarks for AI model risk assessment. Building upon NIST’s AI Risk Management Framework (RMF) will standardize metrics for evaluation incorporating various cases such as open-source models, academic research, and fine-tuning of models which differ from larger labs like OpenAI’s GPT-4.

We propose the Federal Trade Commission (FTC), under Section 5 of the FTC Act, implement and enforce this pre-deployment risk assessment strategy. The FTC’s role to prevent unfair or deceptive practices in commerce is aligned with mitigating potential risks from AI systems.

Recommendation 2: Adherence to the appropriate risk management framework should be compulsory for any AI-related project that receives federal funding.

The U.S. government, as a significant funder of AI through contracts and grants, has both a responsibility and opportunity. Responsibility: to ensure that its AI applications meet a high bar for risk management. Opportunity: to enhance a culture of safety in AI development more broadly. Adherence to a risk management framework should be a prerequisite for AI projects seeking federal funds.

Currently, voluntary guidelines such as NIST’s AI RMF exist, but we propose making these compulsory. Agencies should require contractors to document and verify the risk management practices in place for the contract. For agencies that do not have their own guidelines, the NIST AI RMF should be used. And the NSF should require documentation of the grantee’s compliance with the NIST AI RMF in grant applications for AI projects. This approach will ensure all federally funded AI initiatives maintain a high bar for risk management.

Recommendation 3: NSF should increase its funding for “trustworthy AI” R&D.

“Trustworthy AI” refers to AI systems that are reliable, safe, transparent, privacy-enhanced, and unbiased. While NSF is a key non-military funder of AI R&D in the U.S., our rough estimates indicate that its investment in fields promoting trustworthiness has remained relatively static, accounting for only 10-15% of all AI grants. Given its $800 million annual AI-related budget, we recommend that NSF direct a larger share of grants towards research in trustworthy AI.

To enable this shift, NSF could stimulate trustworthy AI research through specific solicitations; launch targeted programs in this area; and incorporate a “trustworthy AI” section in funding applications, prompting researchers to outline the trustworthiness of their projects. This would help evaluate AI project impacts and promote proposals with significant potential in trustworthy AI. Lastly, researchers could be requested or mandated to apply the NIST AI RMF during their studies.

Recommendation 4: FedRAMP should be broadened to cover AI applications contracted for by the federal government.

The Federal Risk and Authorization Management Program (FedRAMP) is a government-wide initiative that standardizes security protocols for cloud services. Given the rising utilization of AI services in federal operations, a similar system of security standards should apply to these services, since they are responsible for managing highly sensitive data related to national security and individual privacy.

Expanding FedRAMP’s mandate to include AI services is a logical next step in ensuring the secure integration of advanced technologies into federal operations. Applying a framework like FedRAMP to AI services would involve establishing robust security standards specific to AI, such as secure data handling, model transparency, and robustness against adversarial attacks. The expanded FedRAMP program would streamline AI integration into federal operations and avoid repetitive security assessments.

Recommendation 5: The Department of Homeland Security should establish an AI incidents database.

The Department of Homeland Security (DHS) needs to create a centralized AI Incidents Database, detailing AI-related breaches, failures and misuse across industries. Its existing authorization under the Homeland Security Act of 2002 makes DHS capable of this role. This database would increase understanding, mitigate risks, and build trust in AI systems’ safety and security.

Voluntary reporting from AI stakeholders should be encouraged while preserving data confidentiality. For effectiveness, anonymized or aggregated data should be shared with AI developers, researchers, and policymakers to better understand AI risks. DHS could use existing databases such as the one maintained by the Partnership on AI and Center for Security and Emerging Technologies, as well as adapt reporting methods from global initiatives like the Financial Services Information Sharing and Analysis Center.

Recommendation 6: OSTP should work with agencies to streamline the process of granting Interested Agency Waivers to AI researchers on J-1 visas.

The ongoing global competition in AI necessitates attracting and retaining a diverse, highly skilled talent pool. The US J-1 Exchange Visitor Program, often used by visiting researchers, requires some participants to return home for two years before applying for permanent residence.

Federal agencies can waive this requirement for certain individuals via an “Interested Government Agency” (IGA) request. Agencies should establish a transparent, predictable process for AI researchers to apply for such waivers. The OSTP should collaborate with agencies to streamline this process. Taking cues from the Department of Defense’s structured application process, including a dedicated webpage, application checklist, and sample sponsor letter, could prove highly beneficial for improving the transition of AI talent to permanent residency in the US.
Review the details of these proposals in our public comment.

How Do OpenAI’s Efforts To Make GPT-4 “Safer” Stack Up Against The NIST AI Risk Management Framework?

In March, OpenAI released GPT-4, another milestone in a wave of recent AI progress. This is OpenAI’s most advanced model yet, and it’s already being deployed broadly to millions of users and businesses, with the potential for drastic effects across a range of industries.

But before releasing a new, powerful system like GPT-4 to millions of users, a crucial question is: “How can we know that this system is safe, trustworthy, and reliable enough to be released?” Currently, this is a question that leading AI labs are free to answer on their own–for the most part. But increasingly, the issue has garnered greater attention as many have become worried that the current pre-deployment risk assessment and mitigation methods like those done by OpenAI are insufficient to prevent potential risks, including the spread of misinformation at scale, the entrenchment of societal inequities, misuse by bad actors, and catastrophic accidents.

This concern is central to a recent open letter, signed by several leading machine learning (ML) researchers and industry leaders, which calls for a 6-month pause on the training of AI systems “more powerful” than GPT-4 to allow more time for, among other things, the development of strong standards which would “ensure that systems adhering to them are safe beyond a reasonable doubt” before deployment. There’s a lot of disagreement over this letter, from experts who contest the letter’s basic narrative, to others who think that the pause is “a terrible idea” because it would unnecessarily halt beneficial innovation (not to mention that it would be impossible to implement). But almost all of the participants in this conversation tend to agree, pause or no, that the question of how to assess and manage risks of an AI system before actually deploying it is an important one.

A natural place to look for guidance here is the National Institute of Standards and Technology (NIST), which released its AI Risk Management Framework (AI RMF) and an associated playbook in January. NIST is leading the government’s work to set technical standards and consensus guidelines for managing risks from AI systems, and some cite its standard-setting work as a potential basis for future regulatory efforts.

In this piece we walk through both what OpenAI actually did to test and improve GPT-4’s safety before deciding to release it, limitations of this approach, and how it compares to current best practices recommended by the National Institute of Standards and Technology (NIST). We conclude with some recommendations for Congress, NIST, industry labs like OpenAI, and funders.

What did OpenAI do before deploying GPT-4?

OpenAI claims to have taken several steps to make their system “safer and more aligned”. What are those steps? OpenAI describes these in the GPT-4 “system card,” a document which outlines how OpenAI managed and mitigated risks from GPT-4 before deploying it. Here’s a simplified version of what that process looked like:

They brought in over 50 “red-teamers,” outside experts across a range of domains to test the model, poking and prodding at it to find ways that it could fail or cause harm. (Could it “hallucinate” in ways that would contribute to massive amounts of cheaply produced misinformation? Would it produce biased/discriminatory outputs? Could it help bad actors produce harmful pathogens? Could it make plans to gain power of its own?)
Where red-teamers found ways that the model went off the rails, they could train out many instances of undesired outputs via Reinforcement Learning on Human Feedback (RLHF), a process in which human raters give feedback on the kinds of outputs provided by the model (both through human-generated examples of how a model should respond given some type of input, and with “thumbs-up, thumbs-down” ratings on model-generated outputs). Thus, the model was adjusted to be more likely to give the kind of answer that their raters scored positively, and less likely to give the kinds of outputs that would score poorly.

Was this enough?

Though OpenAI says they significantly reduced the rates of undesired model behavior through the above process, the controls put in place are not robust, and methods for mitigating bad model behavior are still leaky and imperfect.

OpenAI did not eliminate the risks they identified. The system card documents numerous failures of the current version of GPT-4, including an example in which it agrees to “generate a program calculating attractiveness as a function of gender and race.”

Current efforts to measure risks also need work, according to GPT-4 red teamers. The Alignment Research Center (ARC) which assessed these models for “emergent” risks says that “the testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.” Another GPT-4 red-teamer, Aviv Ovadya, says that “if red-teaming GPT-4 taught me anything, it is that red teaming alone is not enough.” Ovadya recommends that future pre-deployment risk assessment efforts are improved using “violet teaming,” in which companies identify “how a system (e.g., GPT-4) might harm an institution or public good, and then support the development of tools using that same system to defend the institution or public good.”

Since current efforts to measure and mitigate risks of advanced systems are not perfect, the question comes down to when they are “good enough.” What levels of risk are acceptable? Today, industry labs like OpenAI can mostly rely on their own judgment when answering this question, but there are many different standards that could be used. Amba Kak, the executive director of the AI Now Institute, suggests a more stringent standard, arguing that regulators should require AI companies ”to prove that they’re going to do no harm” before releasing a system. To meet such a standard, new, much more systematic risk management and measurement approaches would be needed.

How did OpenAI’s efforts map on to NIST’s Risk Management Framework?

NIST’s AI RMF Core consists of four main “functions,” broad outcomes which AI developers can aim for as they develop and deploy their systems: map, measure, manage, and govern.

Framework users can map the overall context in which a system will be used to determine relevant risks that should be “on their radar” in that identified context. They can then measure identified risks quantitatively or qualitatively, before finally managing them, acting to mitigate risks based on projected impact. The govern function is about having a well-functioning culture of risk management to support effective implementation of the three other functions.

Looking back to OpenAI’s process before releasing GPT-4, we can see how their actions would align with each function in the RMF Core. This is not to say that OpenAI applied the RMF in its work; we’re merely trying to assess how their efforts might align with the RMF.

They first mapped risks by identifying areas for red-teamers to investigate, based on domains where language models had caused harm in the past and areas that seemed intuitively likely to be particularly impactful.
They aimed to measure these risks, largely through the qualitative, “red-teaming” efforts described above, though they also describe using internal quantitative evaluations for some categories of risk such as “hate speech” or “self-harm advice”.
And to manage these risks, they relied on Reinforcement Learning on Human Feedback, along with other interventions, such as shaping the original dataset and some explicitly “programmed in” behaviors that don’t rely on training in behaviors via RLHF.

Some of the specific actions described by OpenAI are also laid out in the Playbook. The Measure 2.7 function highlights “red-teaming” activities as a way to assess an AI system’s “security and resilience,” for example.

NIST’s resources provide a helpful overview of considerations and best practices that can be taken into account when managing AI risks, but they are not currently designed to provide concrete standards or metrics by which one can assess whether the practices taken by a given lab are “adequate.” In order to develop such standards, more work would be needed. To give some examples of current guidance that could be clarified or made more concrete:

NIST recommends that AI actors “regularly evaluate failure costs to inform go/no-go deployment decisions throughout the AI system lifecycle.” How often is “regularly?” What kinds of “failure costs” are too much? Some of this will depend on the ultimate use cases as our risk tolerance for a sentiment analysis model may be far higher than risk tolerance with a medical decision support system.

NIST recommends that AI developers aim to understand and document “intended purposes, potentially beneficial uses, context-specific laws, norms and expectations, and prospective settings in which the AI system will be deployed.” For a system like GPT-4, which is being deployed broadly and which could have use cases across a huge number of domains, the relevant context appears far too vast to be cleanly “established and understood,” unless this is done at a very high level of abstraction.

NIST recommends that AI actors make “a determination as to whether the AI system achieves its intended purpose and stated objectives and whether its development or deployment should proceed”. Again, this is hard to define: what is the intended purpose of a large language model like GPT-4? Its creators generally don’t expect to know the full range of its potential use cases at the time that it’s released, posing further challenges in making such determinations.

NIST describes explainability and interpretability as a core feature of trustworthy AI systems. OpenAI does not describe GPT-4 as being interpretable. The model can be prompted to generate explanations of its outputs, but we don’t know how these model-generated explanations actually reflect the system’s internal process to generate its outputs.

So, across NIST’s AI RMF, while determining whether a given “outcome” has been achieved could be up for debate, nothing stops developers from going above and beyond the perceived minimum (and we believe they should). This is not a bug of the framework as it is currently designed, rather a feature, as the RMF “does not prescribe risk tolerance.” However, it is important to note that more work is needed to establish both stricter guidelines which leading labs can follow to mitigate risks from leading AI systems, and concrete standards and methods for measuring risk on top of which regulations could be built.

Recommendations

There are a few ways that standards for pre-deployment risk assessment and mitigation for frontier systems can be improved:

Congress

Congress should appropriate additional funds to NIST to expand its capacity for work on risk measurement and management of frontier AI systems.

NIST

Industry best practices: With additional funding, NIST could provide more detailed guidance based on industry best practices for measuring and managing risks of frontier AI systems, for example by collecting and comparing efforts of leading AI developers. NIST could also look for ways to get “ahead of the curve” on risk management practices, rather than just collecting existing industry practice, for example by exploring new, less well-tested practices such as violet teaming.
Metrics: NIST could also provide more concrete metrics and benchmarks by which to assess whether functions in the RMF have been adequately achieved.
Testbeds: Section 10232 of The CHIPS and Science Act authorized NIST to “establish testbeds […] to support the development of robust and trustworthy artificial intelligence and machine learning systems.” With additional funds appropriated, NIST could develop a centralized, voluntary set of test beds to assess frontier AI systems for risks, thereby encouraging more rigorous pre-deployment model evaluations. Such efforts could build on existing language model evaluation techniques, e.g. the Holistic Evaluation of Language Models from Stanford’s Center for Research on Foundation Models.

Industry Labs

Leading industry labs should aim to provide more insights to government standard-setters like NIST on how they manage risks from their AI systems, including by clearly outlining their safety practices and mitigation efforts as OpenAI did in the GPT-4 system card, how well these practices work, and ways which they could still break in the future.
Labs should also aim to incorporate more public feedback in their risk management process, to determine what levels of risk are acceptable when deploying systems with potential for broad public impact.
Labs should aim to go beyond the NIST AI RMF 1.0. This will further help NIST in assessing new risk management strategies that are not part of the current RMF but could be part of RMF 2.0.

Funders

Government funders like NSF and private philanthropic grantmakers should fund researchers to develop metrics and techniques for assessing and mitigating risks from frontier AI systems. Currently, few people focus on this work professionally, and funds to support more research in this area could have broad benefits by encouraging more work on risk management practices and metrics for frontier AI systems.
Funders should also make grants to AI projects conditional on these projects following current best practices as described in the NIST AI RMF.

Creating Auditing Tools for AI Equity

The unregulated use of algorithmic decision-making systems (ADS)—systems that crunch large amounts of personal data and derive relationships between data points—has negatively affected millions of Americans. These systems impact equitable access to education, housing, employment, and healthcare, with life-altering effects. For example, commercial algorithms used to guide health decisions for approximately 200 million people in the United States each year were found to systematically discriminate against Black patients, reducing, by more than half, the number of Black patients who were identified as needing extra care.

One way to combat algorithmic harm is by conducting system audits, yet there are currently no standards for auditing AI systems at the scale necessary to ensure that they operate legally, safely, and in the public interest. According to one research study examining the ecosystem of AI audits, only one percent of AI auditors believe that current regulation is sufficient.

To address this problem, the National Institute of Standards and Technology (NIST) should invest in the development of comprehensive AI auditing tools, and federal agencies with the charge of protecting civil rights and liberties should collaborate with NIST to develop these tools and push for comprehensive system audits.

These auditing tools would help the enforcement arms of these federal agencies save time and money while fulfilling their statutory duties. Additionally, there is a pressing need to develop these tools now, with Executive Order 13985 instructing agencies to “focus their civil rights authorities and offices on emerging threats, such as algorithmic discrimination in automated technology.”

Challenge and Opportunity

The use of AI systems across all aspects of life has become commonplace as a way to improve decision-making and automate routine tasks. However, their unchecked use can perpetuate historical inequities, such as discrimination and bias, while also potentially violating American civil rights.

Algorithmic decision-making systems are often used in prioritization, classification, association, and filtering tasks in a way that is heavily automated. ADS become a threat when people uncritically rely on the outputs of a system, use them as a replacement for human decision-making, or use systems with no knowledge of how they were developed. These systems, while extremely useful and cost-saving in many circumstances, must be created in a way that is equitable and secure.

Ensuring the legal and safe use of ADS begins with recognizing the challenges that the federal government faces. On the one hand, the government wants to avoid devoting excessive resources to managing these systems. With new AI system releases happening everyday, it is becoming unreasonable to oversee every system closely. On the other hand, we cannot blindly trust all developers and users to make appropriate choices with ADS.

This is where tools for the AI development lifecycle come into play, offering a third alternative between constant monitoring and blind trust. By implementing auditing tools and signatory practices, AI developers will be able to demonstrate compliance with preexisting and well-defined standards while enhancing the security and equity of their systems.

Due to the extensive scope and diverse applications of AI systems, it would be difficult for the government to create a centralized body to oversee all systems or demand each agency develop solutions on its own. Instead, some responsibility should be shifted to AI developers and users, as they possess the specialized knowledge and motivation to maintain proper functioning systems. This allows the enforcement arms of federal agencies tasked with protecting the public to focus on what they do best, safeguarding citizens’ civil rights and liberties.

Plan of Action

To ensure security and verification throughout the AI development lifecycle, a suite of auditing tools is necessary. These tools should help enable outcomes we care about, fairness, equity, and legality. The results of these audits should be reported (for example, in an immutable ledger that is only accessible by authorized developers and enforcement bodies) or through a verifiable code-signing mechanism. We leave the specifics of the reporting and documenting the process to the stakeholders involved, as each agency may have different reporting structures and needs. Other possible options, such as manual audits or audits conducted without the use of tools, may not provide the same level of efficiency, scalability, transparency, accuracy, or security.

The federal government’s role is to provide the necessary tools and processes for self-regulatory practices. Heavy-handed regulations or excessive government oversight are not well-received in the tech industry, which argues that they tend to stifle innovation and competition. AI developers also have concerns about safeguarding their proprietary information and users’ personal data, particularly in light of data protection laws.

Auditing tools provide a solution to this challenge by enabling AI developers to share and report information in a transparent manner while still protecting sensitive information. This allows for a balance between transparency and privacy, providing the necessary trust for a self-regulating ecosystem.

Solution Technical Requirements

A general machine learning lifecycle. Examples of what system developers at each stage would be responsible for signing off on the use of the security and equity tools in the lifecycle. These developers represent companies, teams, or individuals.

The equity tool and process, funded and developed by government agencies such as NIST, would consist of a combination of (1) AI auditing tools for security and fairness (which could be based on or incorporate open source tools such as AI Fairness 360 and the Adversarial Robustness Toolbox), and (2) a standardized process and guidance for integrating these checks (which could be based on or incorperate guidance such as the U.S. Government Accountability Office’s Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities).¹

Dioptra, a recent effort between NIST and the National Cybersecurity Center of Excellence (NCCoE) to build machine learning testbeds for security and robustness, is an excellent example of the type of lifecycle management application that would ideally be developed. Failure to protect civil rights and ensure equitable outcomes must be treated as seriously as security flaws, as both impact our national security and quality of life.

Equity considerations should be applied across the entire lifecycle; training data is not the only possible source of problems. Inappropriate data handling, model selection, algorithm design, and deployment, also contribute to unjust outcomes. This is why tools combined with specific guidance is essential.

As some scholars note, “There is currently no available general and comparative guidance on which tool is useful or appropriate for which purpose or audience. This limits the accessibility and usability of the toolkits and results in a risk that a practitioner would select a sub-optimal or inappropriate tool for their use case, or simply use the first one found without being conscious of the approach they are selecting over others.”

Companies utilizing the various packaged tools on their ADS could sign off on the results using code signing. This would create a record that these organizations ran these audits along their development lifecycle and received satisfactory outcomes.

We envision a suite of auditing tools, each tool applying to a specific agency and enforcement task. Precedents for this type of technology already exist. Much like security became a part of the software development lifecycle with guidance developed by NIST, equity and fairness should be integrated into the AI lifecycle as well. NIST could spearhead a government-wide initiative on auditing AI tools, leading guidance, distribution, and maintenance of such tools. NIST is an appropriate choice considering its history of evaluating technology and providing guidance around the development and use of specific AI applications such as the NIST-led Face Recognition Vendor Test (FRVT).

Areas of Impact & Agencies / Departments Involved

Security & Justice
The U.S. Department of Justice, Civil Rights Division, Special Litigation SectionDepartment of Homeland Security U.S. Customs and Border Protection U.S. Marshals Service

Public & Social Sector
The U.S. Department of Housing and Urban Development’s Office of Fair Housing and Equal Opportunity

Education
The U.S. Department of Education

Environment
The U.S. Department of Agriculture, Office of the Assistant Secretary for Civil RightsThe Federal Energy Regulatory CommissionThe Environmental Protection Agency

Crisis Response
Federal Emergency Management Agency

Health & Hunger
The U.S. Department of Health and Human Services, Office for Civil RightsCenter for Disease Control and PreventionThe Food and Drug Administration

Economic
The Equal Employment Opportunity Commission, The U.S. Department of Labor, Office of Federal Contract Compliance Programs

Infrastructure
The U.S. Department of Transportation, Office of Civil RightsThe Federal Aviation AdministrationThe Federal Highway Administration

Information Verification & Validation
The Federal Trade Commission, The Federal Communication Commission, The Securities and Exchange Commission.

Many of these tools are open source and free to the public. A first step could be combining these tools with agency-specific standards and plain language explanations of their implementation process.

Benefits

These tools would provide several benefits to federal agencies and developers alike. First, they allow organizations to protect their data and proprietary information while performing audits. Any audits, whether on the data, model, or overall outcomes, would be run and reported by the developers themselves. Developers of these systems are the best choice for this task since ADS applications vary widely, and the particular audits needed depend on the application.

Second, while many developers may opt to use these tools voluntarily, standardizing and mandating their use would allow an evaluation of any system thought to be in violation of the law to be easily assesed. In this way, the federal government will be able to manage standards more efficiently and effectively.

Third, although this tool would be designed for the AI lifecycle that results in ADS, it can also be applied to traditional auditing processes. Metrics and evaluation criteria will need to be developed based on existing legal standards and evaluation processes; once these metrics are distilled for incorporation into a specific tool, this tool can be applied to non-ADS data as well, such as outcomes or final metrics from traditional audits.

Fourth, we believe that a strong signal from the government that equity considerations in ADS are important and easily enforceable will impact AI applications more broadly, normalizing these considerations.

Example of Opportunity

An agency that might use this tool is the Department of Housing and Urban Development (HUD), whose purpose is to ensure that housing providers do not discriminate based on race, color, religion, national origin, sex, familial status, or disability.

To enforce these standards, HUD, which is responsible for 21,000 audits a year, investigates and audits housing providers to assess compliance with the Fair Housing Act, the Equal Credit Opportunity Act, and other related regulations. During these audits, HUD may review a provider’s policies, procedures, and records, as well as conduct on-site inspections and tests to determine compliance.

Using an AI auditing tool could streamline and enhance HUD’s auditing processes. In cases where ADS were used and suspected of harm, HUD could ask for verification that an auditing process was completed and specific metrics were met, or require that such a process be undergone and reported to them.

Noncompliance with legal standards of nondiscrimination would apply to ADS developers as well, and we envision the enforcement arms of protection agencies would apply the same penalties in these situations as they would in non-ADS cases.

R&D

To make this approach feasible, NIST will require funding and policy support to implement this plan. The recent CHIPS and Science Act has provisions to support NIST’s role in developing “trustworthy artificial intelligence and data science,” including the testbeds mentioned above. Research and development can be partially contracted out to universities and other national laboratories or through partnerships/contracts with private companies and organizations.

The first iterations will need to be developed in partnership with an agency interested in integrating an auditing tool into its processes. The specific tools and guidance developed by NIST must be applicable to each agency’s use case.

The auditing process would include auditing data, models, and other information vital to understanding a system’s impact and use, informed by existing regulations/guidelines. If a system is found to be noncompliant, the enforcement agency has the authority to impose penalties or require changes to be made to the system.

Pilot program

NIST should develop a pilot program to test the feasibility of AI auditing. It should be conducted on a smaller group of systems to test the effectiveness of the AI auditing tools and guidance and to identify any potential issues or areas for improvement. NIST should use the results of the pilot program to inform the development of standards and guidelines for AI auditing moving forward.

Collaborative efforts

Achieving a self-regulating ecosystem requires collaboration. The federal government should work with industry experts and stakeholders to develop the necessary tools and practices for self-regulation.

A multistakeholder team from NIST, federal agency issue experts, and ADS developers should be established during the development and testing of the tools. Collaborative efforts will help delineate responsibilities, with AI creators and users responsible for implementing and maintaining compliance with the standards and guidelines, and agency enforcement arms agency responsible for ensuring continued compliance.

Regular monitoring and updates

The enforcement agencies will continuously monitor and update the standards and guidelines to keep them up to date with the latest advancements and to ensure that AI systems continue to meet the legal and ethical standards set forth by the government.

Transparency and record-keeping

Code-signing technology can be used to provide transparency and record-keeping for ADS. This can be used to store information on the auditing outcomes of the ADS, making reporting easy and verifiable and providing a level of accountability to users of these systems.

Conclusion

Creating auditing tools for ADS presents a significant opportunity to enhance equity, transparency, accountability, and compliance with legal and ethical standards. The federal government can play a crucial role in this effort by investing in the research and development of tools, developing guidelines, gathering stakeholders, and enforcing compliance. By taking these steps, the government can help ensure that ADS are developed and used in a manner that is safe, fair, and equitable.

WHAT IS AN ALGORITMIC DECISION-MAKING SYSTEM

An algorithmic decision-making system (ADS) is software that uses algorithms to make decisions or take actions based on data inputs, sometimes without human intervention. ADS are used in a wide range of applications, from customer service chatbots to screening job applications to medical diagnosis systems. ADS are designed to analyze data and make decisions or predictions based on that data, which can help automate routine or repetitive tasks, improve efficiency, and reduce errors. However, ADS can also raise ethical and legal concerns, particularly when it comes to bias and privacy.

WHAT IS AN ALGORITMIC AUDIT

An algorithmic audit is a process that examines automated decision-making systems and algorithms to ensure that they are fair, transparent, and accountable. Algorithmic audits are typically conducted by independent third-party auditors or specialized teams within organizations. These audits examine various aspects of the algorithm, such as the data inputs, the decision-making process, and the outcomes produced, to identify any biases or errors. The goal is to ensure that the system operates in a manner consistent with ethical and legal standards and to identify opportunities to improve the system’s accuracy and fairness.

WHAT IS CODE SIGNING, AND WHY IS IT INVOLVED?

Code signing is the process of digitally signing software and code to verify the integrity and authenticity of the code. It involves adding a digital signature to the code, which is a unique cryptographic hash that is generated using a private key held by the code signer. The signature is then embedded into the code along with other metadata.

Code signing is used to establish trust in code that is distributed over the internet or other networks. By digitally signing the code, the code signer is vouching for its identity and taking responsibility for its contents. When users download code that has been signed, their computer or device can verify that the code has not been tampered with and that it comes from a trusted source.

Code signing can be extended to all parts of the AI lifecycle as a means of verifying the authenticity, integrity, and function of a particular piece of code or a larger process. After each step in the auditing process, code signing enables developers to leave a well-documented trail for enforcement bodies/auditors to follow if a system were suspected of unfair discrimination or unsafe operation.

Code signing is not essential for this project’s success, and we believe that the specifics of the auditing process, including documentation, are best left to individual agencies and their needs. However, code signing could be a useful piece of any tools developed.

WHAT IS AN AI AUDITOR

An AI auditor is a professional who evaluates and ensures the fairness, transparency, and accountability of AI systems. AI auditors often have experience in risk management, IT or cybersecurity auditing, or engineering, and use frameworks such as IIA’s AI Framework, COSO ERM Framework, or the U.S. GAO’s Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities. Much like other other IT auditors, they review and audit the development, deployment, and operation of systems to ensure that they align with business objectives and legal standards. AI auditors more than in other fields have also had a push to include consideration for sociotechnical issues as well. This includes analyzing the underlying algorithms and data used to develop the AI system, assessing its impact on various stakeholders, and recommending improvements to ensure that it is being used effectively.

WHY SHOULD THE FEDERAL GOVERNMENT BE THE ENTITY TO ACT RATHER THAN THE PRIVATE SECTOR OR STATE/LOCAL GOVERNMENT?

The federal government is uniquely positioned to take the lead on this issue because of its responsibility to protect civil rights and ensure compliance with federal laws and regulations. The federal government can provide the necessary resources, expertise, and implementation guidance to ensure that AI systems are audited in a fair, equitable, and transparent manner.

WHO IS LIKELY TO PUSH BACK ON THIS PROPOSAL AND HOW CAN THAT HURDLE BE OVERCOME?

Industry stakeholders may be resistant to these changes. They should be engaged in the development of tools and guidelines so their concerns can be addressed and effort should be made to clearly communicate the benefits of increased accountability and transparency for both the industry and the public. Collaboration and transparency are key to overcoming potential hurdles, as is making any tools produced user-friendly and accessible.

Additionally, there may be pushback on the tool design. It is important to remember that currently, engineers often use fairness tools at the end of a development process, as a last box to check, instead of as an integrated part of the AI development lifecycle. These concerns can be addressed by emphasizing the comprehensive approach taken and by developing the necessary guidance to accompany these tools—which does not currently exist.

WHAT ARE SOME OTHER EXAMPLES OF HOW AI HAS HARMED SOCIETY

Example #1: Healthcare

New York regulators are calling on a UnitedHealth Group to either stop using or prove there is no problem with a company-made algorithm that researchers say exhibited significant racial bias. This algorithm, which UnitedHealth Group sells to hospitals for assessing the health risks of patients, assigned similar risk scores to white patients and Black patients despite the Black patients being considerably sicker.

In this case, researchers found that changing just one parameter could generate “an 84% reduction in bias.” If we had specific information on the parameters going into the model and how they are weighted, we would have a record-keeping system to see how certain interventions affected the output of this model.

Bias in AI systems used in healthcare could potentially violate the Constitution’s Equal Protection Clause, which prohibits discrimination on the basis of race. If the algorithm is found to have a disproportionately negative impact on a certain racial group, this could be considered discrimination. It could also potentially violate the Due Process Clause, which protects against arbitrary or unfair treatment by the government or a government actor. If an algorithm used by hospitals, which are often funded by the government or regulated by government agencies, is found to exhibit significant racial bias, this could be considered unfair or arbitrary treatment.

Example #2: Policing

A UN panel on the Elimination of Racial Discrimination has raised concern over the increasing use of technologies like facial recognition in law enforcement and immigration, warning that it can exacerbate racism and xenophobia and potentially lead to human rights violations. The panel noted that while AI can enhance performance in some areas, it can also have the opposite effect as it reduces trust and cooperation from communities exposed to discriminatory law enforcement. Furthermore, the panel highlights the risk that these technologies could draw on biased data, creating a “vicious cycle” of overpolicing in certain areas and more arrests. It recommends more transparency in the design and implementation of algorithms used in profiling and the implementation of independent mechanisms for handling complaints.

A case study on the Chicago Police Department’s Strategic Subject List (SSL) discusses an algorithm-driven technology used by the department to identify individuals at high risk of being involved in gun violence and inform its policing strategies. However, a study by the RAND Corporation on an early version of the SSL found that it was not successful in reducing gun violence or reducing the likelihood of victimization, and that inclusion on the SSL only had a direct effect on arrests. The study also raised significant privacy and civil rights concerns. Additionally, findings reveal that more than one-third of individuals on the SSL, approximately 70% of that cohort, have never been arrested or been a victim of a crime yet received a high-risk score. Furthermore, 56% of Black men under the age of 30 in Chicago have a risk score on the SSL. This demographic has also been disproportionately affected by the CPD’s past discriminatory practices and issues, including torturing Black men between 1972 and 1994, performing unlawful stops and frisks disproportionately on Black residents, engaging in a pattern or practice of unconstitutional use of force, poor data collection, and systemic deficiencies in training and supervision, accountability systems, and conduct disproportionately affecting Black and Latino residents.

Predictive policing, which uses data and algorithms to try to predict where crimes are likely to occur, has been criticized for reproducing and reinforcing biases in the criminal justice system. This can lead to discriminatory practices and violations of the Fourth Amendment’s prohibition on unreasonable searches and seizures, as well as the Fourteenth Amendment’s guarantee of equal protection under the law. Additionally, bias in policing more generally can also violate these constitutional provisions, as well as potentially violating the Fourth Amendment’s prohibition on excessive force.

Example #3: Recruiting

ADS in recruiting crunch large amounts of personal data and, given some objective, derive relationships between data points. The aim is to use systems capable of processing more data than a human ever could to uncover hidden relationships and trends that will then provide insights for people making all types of difficult decisions.

Hiring managers across different industries use ADS every day to aid in the decision-making process. In fact, a 2020 study reported that 55% of human resources leaders in the United States use predictive algorithms across their business practices, including hiring decisions.

For example, employers use ADS to screen and assess candidates during the recruitment process and to identify best-fit candidates based on publicly available information. Some systems even analyze facial expressions during interviews to assess personalities. These systems promise organizations a faster, more efficient hiring process. ADS do theoretically have the potential to create a fairer, qualification-based hiring process that removes the effects of human bias. However, they also possess just as much potential to codify new and existing prejudice across the job application and hiring process.

The use of ADS in recruiting could potentially violate several constitutional laws, including discrimination laws such as Title VII of the Civil Rights Act of 1964 and the Americans with Disabilities Act. These laws prohibit discrimination on the basis of race, gender, and disability, among other protected characteristics, in the workplace. Additionally, the these systems could also potentially violate the right to privacy and the due process rights of job applicants. If the systems are found to be discriminatory or to violate these laws, they could result in legal action against the employers.

WHAT OPEN-SOURCE TOOLS COULD BE LEVERAGED FOR THIS PROJECT?

Aequitas, Accenture Algorithmic Fairness. Alibi Explain, AllenNLP, BlackBox Auditing, DebiasWE, DiCE, ErrorAnalysis, EthicalML xAI, Facebook DynaBoard, Fairlearn, FairSight, FairTest, FairVis, FoolBox, Google Explainable AI, Google KnowYourData, Google ML Fairness Gym, Google PAIR Facets, Google PAIR Language Interpretability Tool, Google PAIR Saliency, Google PAIR What-If Tool, IBM Adversarial Robustness Toolbox, IBM AI Fairness 360, IBM AI Explainability 360, Lime, MLI, ODI Data Ethics Canvas, Parity, PET Repository, PwC Responsible AI Toolkit, Pymetrics audit-AI, RAN-debias, REVISE, Saidot, SciKit Fairness, Skater, Spatial Equity Data Tool, TCAV, UnBias Fairness Toolkit

AI for science: creating a virtuous circle of discovery and innovation

In this interview, Tom Kalil discusses the opportunities for science agencies and the research community to use AI/ML to accelerate the pace of scientific discovery and technological advancement.

Q. Why do you think that science agencies and the research community should be paying more attention to the intersection between AI/ML and science?

Recently, researchers have used DeepMind’s AlphaFold to predict the structures of more than 200 million proteins from roughly 1 million species, covering almost every known protein on the planet! Although not all of these predictions will be accurate, this is a massive step forward for the field of protein structure prediction.

The question that science agencies and different research communities should be actively exploring is – what were the pre-conditions for this result, and are there steps we can take to create those circumstances in other fields?

Photo by DeepMind on Unsplash

One partial answer to that question is that the protein structure community benefited from a large open database (the Protein Data Bank) and what linguist Mark Liberman calls the “Common Task Method.”

Q. What is the Common Task Method (CTM), and why is it so important for AI/ML?

In a CTM, competitors share the common task of training a model on a challenging, standardized dataset with the goal of receiving a better score. One paper noted that common tasks typically have four elements:

Tasks are formally defined with a clear mathematical interpretation
Easily accessible gold-standard datasets are publicly available in a ready-to-go standardized format
One or more quantitative metrics are defined for each task to judge success
State-of-the-art methods are ranked in a continuously updated leaderboard

Computational physicist and synthetic biologist Erika DeBenedictis has proposed adding a fifth component, which is that “new data can be generated on demand.” Erika, who runs Schmidt Futures-supported competitions such as the 2022 BioAutomation Challenge, argues that creating extensible living datasets has a few advantages. This approach can detect and help prevent overfitting; active learning can be used to improve performance per new datapoint; and datasets can grow organically to a useful size.

Common Task Methods have been critical to progress in AI/ML. As David Donoho noted in 50 Years of Data Science,

The ultimate success of many automatic processes that we now take for granted—Google translate, smartphone touch ID, smartphone voice recognition—derives from the CTF (Common Task Framework) research paradigm, or more specifically its cumulative effect after operating for decades in specific fields. Most importantly for our story: those fields where machine learning has scored successes are essentially those fields where CTF has been applied systematically.

Q. Why do you think that we may be under-investing in the CTM approach?

U.S. agencies have already started to invest in AI for Science. Examples include NSF’s AI Institutes, DARPA’s Accelerated Molecular Discovery, NIH’s Bridge2AI, and DOE’s investments in scientific machine learning. The NeurIPS conference (one of the largest scientific conferences on machine learning and computational neuroscience) now has an entire track devoted to datasets and benchmarks.

However, there are a number of reasons why we are likely to be under-investing in this approach.

These open datasets, benchmarks and competitions are what economists call “public goods.” They benefit the field as a whole, and often do not disproportionately benefit the team that created the dataset. Also, the CTM requires some level of community buy-in. No one researcher can unilaterally define the metrics that a community will use to measure progress.
Researchers don’t spend a lot of time coming up with ideas if they don’t see a clear and reliable path to getting them funded. Researchers ask themselves, “what datasets already exist, or what dataset could I create with a $500,000 – $1 million grant?” They don’t ask the question, “what dataset + CTM would have a transformational impact on a given scientific or technological challenge, regardless of the resources that would be required to create it?” If we want more researchers to generate concrete, high-impact ideas, we have to make it worth the time and effort to do so.
Many key datasets (e.g., in fields such as chemistry) are proprietary, and were designed prior to the era of modern machine learning. Although researchers are supposed to include Data Management Plans in their grant applications, these requirements are not enforced, data is often not shared in a way that is useful, and data can be of variable quality and reliability. In addition, large dataset creation may sometimes not be considered academically novel enough to garner high impact publications for researchers.
Creation of sufficiently large datasets may be prohibitively expensive. For example, experts estimate that the cost of recreating the Protein Data Bank would be $15 billion! Science agencies may need to also explore the role that innovation in hardware or new techniques can play in reducing the cost and increasing the uniformity of the data, using, for example, automation, massive parallelism, miniaturization, and multiplexing. A good example of this was NIH’s $1,000 Genome project, led by Jeffrey Schloss.

Q. Why is close collaboration between experimental and computational teams necessary to take advantage of the role that AI can play in accelerating science?

According to Michael Frumkin with Google Accelerated Science, what is even more valuable than a static dataset is a data generation capability, with a good balance of latency, throughput, and flexibility. That’s because researchers may not immediately identify the right “objective function” that will result in a useful model with real-world applications, or the most important problem to solve. This requires iteration between experimental and computational teams.

Q. What do you think is the broader opportunity to enable the digital transformation of science?

I think there are different tools and techniques that can be mixed and matched in a variety of ways that will collectively enable the digital transformation of science and engineering. Some examples include:

Self-driving labs (and eventually, fleets of networked, self-driving labs), where machine learning is not only analyzing the data but informing which experiment to do next.
Scientific equipment that is high-throughput, low-latency, automated, programmable, and potentially remote (e.g. “cloud labs”).
Novel assays and sensors.
The use of “science discovery games” that allow volunteers and citizen scientists to more accurately label training data. For example, the game Mozak trains volunteers to collaboratively reconstruct complex 3D representations of neurons.
Advances in algorithms (e.g. progress in areas such as causal inference, interpreting high-dimensional data, inverse design, uncertainty quantification, and multi-objective optimization).
Software for orchestration of experiments, and open hardware and software interfaces to allow more complex scientific workflows.
Integration of machine learning, prior knowledge, modeling and simulation, and advanced computing.
New approaches to informatics and knowledge representation – e.g. a machine-readable scientific literature, increasing number of experiments that can be expressed as code and are therefore more replicable.
Approaches to human-machine teaming that allow the best division of labor between human scientists and autonomous experimentation.
Funding mechanisms, organizational structures and incentives that enable the team science and community-wide collaboration needed to realize the potential of this approach.

There are many opportunities at the intersection of these different scientific and technical building blocks. For example, use of prior knowledge can sometimes reduce the amount of data that is needed to train a ML model. Innovation in hardware could lower the time and cost of generating training data. ML can predict the answer that a more computationally-intensive simulation might generate. So there are undoubtedly opportunities to create a virtuous circle of innovation.

Q. Are there any risks of the common task method?

Some researchers are pointing to negative sociological impacts associated with “SOTA-chasing” – e.g. a single-minded focus on generating a state-of-the-art result. These include reducing the breadth of the type of research that is regarded as legitimate, too much competition and not enough cooperation, and overhyping AI/ML results with claims of “super-human” levels of performance. Also, a researcher who makes a contribution to increasing the size and usefulness of the dataset may not get the same recognition as the researcher who gets a state-of-the-art result.

Some fields that have become overly dominated by incremental improvements in a metric have had to introduce Wild and Crazy Ideas as a separate track in their conferences to create a space for more speculative research directions.

Q. Which types of science and engineering problems should be prioritized?

One benefit to the digital transformation of science and engineering is that it will accelerate the pace of discovery and technological advances. This argues for picking problems where time is of the essence, including:

Developing and manufacturing carbon-neutral and carbon-negative technologies we need for power, transportation, buildings, industry, and food and agriculture. Currently, it can take 17-20 years to discover and manufacture a new material. This is too long if we want to meet ambitious 2050 climate goals.
Improving our response to future pandemics by being able to more rapidly design, develop and evaluate new vaccines, therapies, and diagnostics.
Addressing new threats to our national security, such as engineered pathogens and the technological dimension of our economic and military competition with peer adversaries.

Obviously, it also has to be a problem where AI and ML can make a difference, e.g. ML’s ability to approximate a function that maps between an input and an output, or to lower the cost of making a prediction.

Q. Why should economic policy-makers care about this as well?

One of the key drivers of the long-run increases in our standard of living is productivity (output per worker), and one source of productivity is what economists call general purpose technologies (GPTs). These are technologies that have a pervasive impact on our economy and our society, such as interchangeable parts, the electric grid, the transistor, and the Internet.

Historically – GPTs have required other complementary changes (e.g. organizational changes, changes in production processes and the nature of work) before their economic and societal benefits can be realized. The introduction of electricity eventually led to massive increases in manufacturing productivity, but not until factories and production lines were reorganized to take advantage of small electric motors. There are similar challenges for fostering the role that AI/ML and complementary technologies will play in accelerating the pace of scientific and technological advances:

Researchers and science funders need to identify and support the technical infrastructure (e.g. datasets + CTMs, self-driving labs) that will move an entire field forward, or solve a particularly important problem.
A leading academic researcher involved in protein structure prediction noted that one of the things that allowed DeepMind to make so much progress on the protein folding problem was that “everyone was rowing in the same direction,” “18 co-first authors .. an incentive structure wholly foreign to academia,” and “a fast and focused research paradigm … [which] raises the question of what other problems exist that are ripe for a fast and focused attack.” So capitalizing on the opportunity is likely to require greater experimentation in mechanisms for funding, organizing and incentivizing research, such as Focused Research Organizations.

Q. Why is this an area where it might make sense to “unbundle” idea generation from execution?

Traditional funding mechanisms assume that the same individual or team who has an idea should always be the person who implements the idea. I don’t think this is necessarily the case for datasets and CTMs. A researcher may have a brilliant idea for a dataset, but may not be in a position to liberate the data (if it already exists), rally the community, and raise the funds needed to create the dataset. There is still a value in getting researchers to submit and publish their ideas, because their proposal could be catalytic of a larger-scale effort.

Agencies could sponsor white paper competitions with a cash prize for the best ideas. [A good example of a white paper competition is MIT’s Climate Grand Challenge, which had a number of features which made it catalytic.] Competitions could motivate researchers to answer questions such as:

What dataset and Common Task would have a significant impact on our ability to answer a key scientific question or make progress on an important use-inspired or technological problem? What preliminary work has been done or should be done prior to making a larger-scale investment in data collection?
To the extent that industry would also find the data useful, would they be willing to share the cost of collecting it? They could also share existing data, including the results from failed experiments.
What advance in hardware or experimental techniques would lower the time and cost of generating high-value datasets by one or more orders of magnitude?
What self-driving lab would significantly accelerate progress in a given field or problem, and why?

The views and opinions expressed in this blog are the author’s own and do not necessarily reflect the view of Schmidt Futures.

The Magic Laptop Thought Experiment

One of the main goals of Kalil’s Corner is to share some of the things I’ve learned over the course of my career about policy entrepreneurship. Below is an FAQ on a thought experiment that I think is useful for policy entrepreneurs, and how the thought experiment is related to a concept I call “shared agency.”

Q. What is your favorite thought experiment?

Imagine that you have a magic laptop. The power of the laptop is that any press release that you write will come true.

You have to write a headline (goal statement), several paragraphs to provide context, and 1-2 paragraph descriptions of who is agreeing to do what (in the form organization A takes action B to achieve goal C). The individuals or organizations could be federal agencies, the Congress, companies, philanthropists, investors, research universities, non-profits, skilled volunteers, etc. The constraint, however, is that it has to be plausible that the organizations would be both willing and able to take the action. For example, a for-profit company is not going to take actions that are directly contrary to the interests of their shareholders.

What press release would you write, and why? What makes this a compelling idea?

Q. What was the variant of this that you used to ask people when you worked in the White House for President Obama?

You have a 15-minute meeting in the Oval Office with the President, and he asks:

“If you give me a good idea, I will call anyone on the planet. It can be a conference call, so there can be more than one person on the line. What’s your idea, and why are you excited about it? In order to make your idea happen, who would I need to call and what would I need to ask them to do in order to make it happen?”

Q. What was your motivation for posing this thought experiment to people?

I’ve been in roles where I can occasionally serve as a “force multiplier” for other people’s ideas. The best way to have a good idea is to be exposed to many ideas.

When I was in the White House, I would meet with a lot of people who would tell me that what they worked on was very important, and deserved greater attention from policy-makers.

But when I asked them what they wanted the Administration to consider doing, they didn’t always have a specific response. Sometimes people would have the kernel of a good idea, but I would need to play “20 questions” with them to refine it. This thought experiment would occasionally help me elicit answers to basic questions like who, what, how and why.

Q. Why does this thought experiment relate to the Hamming question?

Richard Hamming was a researcher at Bell Labs who used to ask his colleagues, “What are the most important problems in your field? And what are you working on?” This would annoy some of his colleagues, because it forced them to confront the fact that they were working on something that they didn’t think was that important.

If you really did have a magic laptop or a meeting with the President, you would presumably use it to help solve a problem that you thought was important!

Q. How does this thought experiment highlight the importance of coalition-building?

There are many instances where we have a goal that requires building a coalition of individuals and organizations.

It’s hard to do that if you can’t identify (1) the potential members of the coalition; and (2) the mutually reinforcing actions you would like them to consider taking.

Once you have a hypothesis about the members of your coalition of the willing and able, you can begin to ask and answer other key questions as well, such as:

Why is it in the enlightened self-interest of the members of the coalition to participate?
Who is the most credible messenger for your idea? Who could help convene the coalition?
Is there something that you or someone else can do to make it easier for them to get involved? For example, policy-makers do things with words, in the same way that a priest changes the state of affairs in the world by stating “I now pronounce you husband and wife.” Presidents sign Executive Orders, Congress passes legislation, funding agencies issue RFPs, regulatory agencies issue draft rules for public comment, and so on. You can make it easier for a policy-maker to consider an idea by drafting the documents that are needed to frame, make, and implement a decision.
If a member of the coalition is willing but not able, can someone else take some action that relaxes the constraint that is preventing them from participating?
What evidence do you have that if individual or organization A took action B, that C is likely to occur?
What are the risks associated with taking the course of action that you recommend, and how could they be managed or mitigated?

Q. Is this thought experiment only relevant to policy-makers?

Not at all. I think it is relevant for any goal that you are pursuing — especially ones that require concerted action by multiple individuals and organizations to accomplish.

Q. What’s the relationship between this thought experiment and Bucky Fuller’s concept of a “trim tab?”

Fuller observed that a tiny device called a trim tab is designed to move a rudder, which in turn can move a giant ship like the Queen Elizabeth.

So, it’s incredibly useful to identify these leverage points that can help solve important problems.

For example, some environmental advocates have focused on the supply chains of large multinationals. If these companies source products that are more sustainable (e.g. cooking oils that are produced without requiring deforestation) – that can have a big impact on the environment.

Q. What steps can people take to generate better answers to this thought experiment?

There are many things – like having a deep understanding of a particular problem, being exposed to both successful and unsuccessful efforts to solve important problems in many different domains, or understanding how particular organizations that you are trying to influence make decisions.

One that I’ve been interested in is the creation of a “toolkit” for solving problems. If, as opposed to having a hammer and looking for nails to hit, you also have a saw, a screwdriver, and a tape measure, you are more likely to have the right tool or combination of tools for the right job.

For example, during my tenure in the Obama Administration, my team and other people in the White House encouraged awareness and adoption of dozens of approaches to solving problems, such as:

Sponsoring incentive prizes, which allow agencies to set a goal without having to choose the team or approach that is most likely to be successful;
Making open data available in machine-readable formats, and encouraging teams to develop new applications that use the data to solve a real-world problem;
Changing federal hiring practices and recruiting top technical talent;
Embracing modern software methodologies such as agile and human-centered design for citizen-facing digital services;
Identifying and pursuing 21^st century moonshots;
Using insights from behavioral science to improve policies and programs;
Using and building evidence to increase the share of federal resources going to more effective interventions;
Changing procurement policies so that the government can purchase products and services from startups and commercial firms, not just traditional contractors.

Of course, ideally one would be familiar with the problem-solving tactics of different types of actors (companies, research universities, foundations, investors, civil society organization) and individuals with different functional or disciplinary expertise. No one is going to master all of these tools, but you might aspire to (1) know that they exist; (2) have some heuristics about when and under what circumstances you might use them; and (3) know how to learn more about a particular approach to solving problems that might be relevant. For example, I’ve identified a number of tactics that I’ve seen foundations and nonprofits use.

Q. How does this thought experiment relate to the concept that psychologists call “agency?”

Agency is defined by psychologists like Albert Bandura as “the human capability to influence …the course of events by one’s actions.”

The particular dimension of agency that I have experienced is a sense that there are more aspects of the status quo that are potentially changeable as opposed to being fixed. These are the elements of the status quo that are attributable to human action or inaction, as opposed to the laws of physics.

Obviously, this sense of agency didn’t extend to every problem under the sun. It was limited to those areas where progress could be made by getting identifiable individuals and organizations to take some action – like the President signing an Executive Order or proposing a new budget initiative, the G20 agreeing to increase investment in a global public good, Congress passing a law, or a coalition of organizations like companies, foundations, nonprofits and universities working together to achieve a shared goal.

Q. How did you develop a strong sense of agency over the course of your career?

I had the privilege of working at the White House for both Presidents Clinton and Obama.

As a White House staffer, I had the ability to send the President a decision memo. If he checked the box that said “yes” – and the idea actually happened and was well-implemented, this reinforced my sense of agency.

But it wasn’t just the experience of being successful. It was also the knowledge that one acquires by repeatedly trying to move from an idea to something happening in the world, such as:

Working with the Congress to pass legislation that gave every agency the authority to sponsor incentive prizes for up to $50 million;
Including funding for dozens of national science and technology initiatives in the President’s budget, such as the National Nanotechnology Initiative or the BRAIN Initiative;
Recruiting people to the White House to help solve hard and important problems, like reducing the waiting list for an organ transplant, allowing more foreign entrepreneurs to come to the United States, or launching a behavioral sciences unit within the federal government; and,
Using the President’s “bully pulpit” to build coalitions of companies, non-profits, philanthropists, universities, etc. to achieve a particular goal, like expanding opportunities for more students to excel in STEM.

Q. What does it mean for you to have a shared sense of agency with another individual, a team, or a community?

Obviously, most people have not had 16 years of their professional life in which they could send a decision memo to the President, get a line in the President’s State of the Union address, work with Congress to pass legislation, create a new institution, shape the federal budget, and build large coalitions with hundreds of organizations that are taking mutually reinforcing actions in the pursuit of a shared goal.

So sometimes when I am talking to an individual, a team or a community, it will become clear to me that there is some aspect of the status quo that they view as fixed, and I view as potentially changeable. It might make sense for me to explain why I believe the status quo is changeable, and what are the steps we could take together in the service of achieving a shared goal.

Q. Why is shared agency important?

Changing the status quo is hard. If I don’t know how to do it, or believe that I would be tilting at windmills – it’s unlikely that I would devote a lot of time and energy to trying to do so.

It may be the case that pushing for change will require a fair amount of work, such as:

Expressing my idea clearly, and communicating it effectively to multiple audiences;
Marshaling the evidence to support it;
Determining who the relevant “deciders” are for a given idea;
Addressing objections or misconceptions, or responding to legitimate criticism; and
Building the coalition of people and institutions who support the idea, and may be prepared to take some action to advance it.

So if I want people to devote time and energy to fleshing out an idea or doing some of the work needed to make it happen, I need to convince them that something constructive could plausibly happen. And one way to do that is to describe what success might look like, and discuss the actions that we would take in order to achieve our shared goal. As an economist might put it, I am trying to increase their “expected return” of pursuing a shared goal by increasing the likelihood that my collaborators attach to our success.

Q. Are there risks associated with having this strong sense of agency, and how might one mitigate against those risks?

Yes, absolutely. One is a lack of appropriate epistemic humility, by pushing a proposed solution in the absence of reasonable evidence that it will work, or failing to identify unintended consequences. It’s useful to read books like James Scott’s Seeing Like a State.

I also like the idea of evidence-based policy. For example, governments should provide modest amounts of funding for new ideas, medium-sized grants to evaluate promising approaches, and large grants to scale interventions that have been rigorously evaluated and have a high benefit to cost ratio.

The views and opinions expressed in this blog are the author’s own and do not necessarily reflect the view of Schmidt Futures.

Creating an AI Testbed for Government

Summary

The United States should establish a testbed for government-procured artificial intelligence (AI) models used to provide services to U.S. citizens. At present, the United States lacks a uniform method or infrastructure to ensure that AI systems are secure and robust. Creating a standardized testing and evaluation scheme for every type of model and all its use cases is an extremely challenging goal. Consequently, unanticipated ill effects of AI models deployed in real-world applications have proliferated, from radicalization on social media platforms to discrimination in the criminal justice system. Increased interest in integrating emerging technologies into U.S. government processes raises additional concerns about the robustness and security of AI systems.

Establishing a designated federal AI testbed is an important part of alleviating these concerns. Such a testbed will help AI researchers and developers better understand how to construct testing methods and ultimately build safer, more reliable AI models. Without this capacity, U.S. agencies risk perpetuating existing structural inequities as well as creating new government systems based on insecure AI systems — both outcomes that could harm millions of Americans while undermining the missions that federal agencies are entrusted to pursue.

An Institute for Scalable Heterogeneous Computing

Summary

The future of computing innovation is becoming more uncertain as the 2020s have brought about a pivot point in the global semiconductor industry. We owe this uncertainty to several factors, including the looming end of Moore’s Law, disruptions in semiconductor supply chains, international competition in innovation investment, a growing demand for more specialized computer chips, and the continued development of alternate computing paradigms, such as quantum computing.

In order to address the next generation of computing needs, architectures are beginning to emphasize the integration of multiple, specialized computing components. Within this framework, the U.S. is well poised to emerge as a leader in the future of next-generation computing, and more broadly advanced semiconductor manufacturing. However, there remains a missing link in the United States’ computing innovation strategy: a coordinating organization which will down-select and integrate the wide variety of promising, next-generation computing materials, architectures, and approaches so that they can form the building blocks of advanced, high-performance, heterogeneous systems.

Armed with these facts, and using the existing authorization language in the 2021 National Defense Authorization Act (NDAA), the Biden Administration and Congress have a unique opportunity to establish a Manufacturing USA Institute under the National Institute of Standards and Technology (NIST) with the goal of pursuing advanced packaging for scalable heterogeneous computing. This Institute will leverage the enormous body of previous work in post-Moore computing funded by the federal government (Semiconductor Technology Advanced Research Network (STARnet), Nanoelectronics Computing Research (nCORE), Joint University Microelectronics Program (JUMP), Energy-Efficient Computing: From Devices to Architectures (E2CDA), Electronics Resurgence Initiative (ERI)) and will bridge a key gap in bringing these R&D efforts from the laboratory to real world applications. By doing this, the U.S. will be well positioned to continue its dominance in semiconductor design and potentially regain advanced semiconductor manufacturing activity over the coming decades.

A National Cloud for Conducting Disinformation Research at Scale

Summary

Online disinformation continues to evolve and threaten national security, federal elections, public health, and other critical U.S. sectors. Yet the federal government lacks access to data and computational power needed to study disinformation at scale. Those with the greatest capacity to study disinformation at scale are large technology companies (e.g., Google, Facebook, Twitter, etc.), which biases much research and limits federal capacity to address disinformation.

To address this problem, we propose that the Department of Defense (DOD) fund a one-year pilot of a National Cloud for Disinformation Research (NCDR). The NCDR would securely house disinformation data and provide computational power needed for the federal government and its partners to study disinformation. The NCDR should be managed by a governance team led by Federally Funded Research and Development Centers (FFRDCs) already serving the DOD. The FFRDC Governance Team will manage (i) which stakeholders can access the Cloud, (ii) coordinate sharing of data and computational resources among stakeholders, and (iii) motivate participation from diverse stakeholders (including industry; academia; federal, state, and local government, and non-governmental organizations).

A National Cloud for Disinformation Research will help the Biden-Harris administration fulfill its campaign promise to reposition the United States as a leader of the democratic world. The NCDR will benefit the federal government by providing access to data and computational resources needed to combat the threats and harms of disinformation. Our nation needs a National Cloud for Disinformation Research to foresee future disinformation attacks and safeguard our democracy in turbulent times.

Enabling Responsible U.S. Leadership on Global AI Regulation

Summary

Algorithmic governance concerns are critical for US foreign policy in the 21st century as they relate intimately to the relationship between governments and their citizens – the very fabric of the world’s societies. The United States should strategically invest resources into the principal multilateral forums in which digital technology regulation is currently under discussion. In partnership with like-minded governments and international organizations, the Biden-Harris Administration should set clear priorities championing a collective digital rights agenda that considers the impact of commercial algorithms and algorithmic decision-making on both American citizens and technology consumers around the world.

These investments would build substantially upon initial forays into national AI regulatory policy advanced by the National Security Commission on Artificial Intelligence (NSCAI) established by Congress in August 2018 and the Executive Order on Maintaining American Leadership in Artificial Intelligence issued in January 2019. Both policy moves featured broad approaches focused on national security and competitiveness, without seriously engaging the complex and context-specific problems of international governance that must be squarely addressed if the United States is to develop a coherent approach to AI regulation.

We suggest the federal government pay special attention to impacts on people living in regions outside the geographic focus of the most prominent regulatory deliberations today – which occur almost exclusively in Washington and the developed world. Such an inclusive, global approach to digital policymaking will increase the potential for the United States to bring the world along in efforts to develop meaningful, consumer-first internet policy that addresses the economic and social factors driving digital disparities. At a time when the risk of a global “splinternet” increasingly looms, this clarified focus will help establish effective rules toward which jurisdictions around the world can converge under U.S. leadership.

A National Framework for AI Procurement

Summary

As artificial intelligence (AI) applications for public use have proliferated, there has been a large uptick in challenges associated with AI safety and fairness. These challenges are due in part to poor transparency in and standardization of AI procurement protocols, particularly for public-use applications. In this memo, we propose a federal framework—orchestrated through the Office of Federal Procurement Policy (OFPP) situated in the Office of Management and Budget (OMB)—to standardize and guide AI procurement in a safer, fairer manner. While this framework is designed for federal implementation, it is important to recognize that many decisions on AI usage are made by municipalities. The principles guiding the federal framework outlined herein are intended to also help guide development and implementation of similar frameworks for AI procurement at the local level.

Establish a $100M National Lab of Neurotechnology for Brain Moonshots

A rigorous scientific understanding of how the brain works would transform human health and the economy by (i) enabling design of effective therapies for mental and neurodegenerative diseases (such as depression and Alzheimer’s), and (ii) fueling novel areas of enterprise for the biomedical, technology, and artificial intelligence industries. Launched in 2013, the U.S. BRAIN (Brain Research through Advancing Innovative Neurotechnologies) Initiative has made significant progress toward harnessing the ingenuity and creativity of individual laboratories in developing neurotechnological methods. This has provided a strong foundation for future work, producing advances like:

3-D microscopy and ultra-fast imaging that enables real-time observations of brain-cell activity in intact tissues: key data for understanding circuit principles underlying human behavior.
A sophisticated genetic method that could be helpful for finding new druggable targets that could be engaged to manage pain effectively, helping avoid the risks and harms of opioid drug addiction.
A portable backpack supporting (i) simultaneous recording from a stimulator implanted into a human subject’s brain (part of an early-stage clinical trial), (ii) measurement of other biomarkers, and (iii) recording of the subject’s positions and movements within their environment. This integrated and comprehensive dataset is helping researchers understand links between neural-circuit activity and behavior in humans.

However, pursuing these ambitious goals will require new approaches to brain research, at greater scale and scope.

Given the BRAIN Initiative’s momentum, this is the moment to expand the Initiative by investing in a National Laboratory of Neurotechnology (NLN) that would bring together a multidisciplinary team of researchers and engineers with combined expertise in physical and biomedical sciences. The NLN team would develop large-scale instruments, tools, and methods for recording and manipulating the activity of complex neural circuits in living animals or humans — studies that would enable us to understand how the brain works at a deeper, more detailed level than ever before. Specific high-impact initiatives that the NLN team could pursue include:

Developing a multibeam, large-scale electron microscope to map the wiring of neural circuits.
Determining the molecular components of each cell of the brain.
Fabricating a diverse array of implantable and wearable devices.
Conducting supercomputer-enabled mining of large datasets to support neurotechnology development.

The BRAIN Initiative currently funds small teams at existing research institutes. The natural next step is to expand the Initiative by establishing a dedicated center — staffed by a large, collaborative, and interdisciplinary team — capable of developing the high-cost, large-scale equipment needed to address complex and persistent challenges in the field of neurotechnology. Such a center would multiply the return on investment in brain research that the federal government is making on behalf of American taxpayers. Successful operation of a National Laboratory of Neurotechnology would require about $100 million per year.

To read a detailed vision for a National Laboratory of Neurotechnology, click here.