Emerging Technology

How Do OpenAI’s Efforts To Make GPT-4 “Safer” Stack Up Against The NIST AI Risk Management Framework?

05.11.23 | 9 min read | Text by Liam Alexander & Divyansh Kaushik

In March, OpenAI released GPT-4, another milestone in a wave of recent AI progress. This is OpenAI’s most advanced model yet, and it’s already being deployed broadly to millions of users and businesses, with the potential for drastic effects across a range of industries

But before releasing a new, powerful system like GPT-4 to millions of users, a crucial question is: “How can we know that this system is safe, trustworthy, and reliable enough to be released?” Currently, this is a question that leading AI labs are free to answer on their own–for the most part. But increasingly, the issue has garnered greater attention as many have become worried that the current pre-deployment risk assessment and mitigation methods like those done by OpenAI are insufficient to prevent potential risks, including the spread of misinformation at scale, the entrenchment of societal inequities, misuse by bad actors, and catastrophic accidents. 

This concern is central to a recent open letter, signed by several leading machine learning (ML) researchers and industry leaders, which calls for a 6-month pause on the training of AI systems “more powerful” than GPT-4 to allow more time for, among other things, the development of strong standards which would “ensure that systems adhering to them are safe beyond a reasonable doubt” before deployment. There’s a lot of disagreement over this letter, from experts who contest the letter’s basic narrative, to others who think that the pause is “a terrible idea” because it would unnecessarily halt beneficial innovation (not to mention that it would be impossible to implement). But almost all of the participants in this conversation tend to agree, pause or no, that the question of how to assess and manage risks of an AI system before actually deploying it is an important one. 

A natural place to look for guidance here is the National Institute of Standards and Technology (NIST), which released its AI Risk Management Framework (AI RMF) and an associated playbook in January. NIST is leading the government’s work to set technical standards and consensus guidelines for managing risks from AI systems, and some cite its standard-setting work as a potential basis for future regulatory efforts.

In this piece we walk through both what OpenAI actually did to test and improve GPT-4’s safety before deciding to release it, limitations of this approach, and how it compares to current best practices recommended by the National Institute of Standards and Technology (NIST). We conclude with some recommendations for Congress, NIST, industry labs like OpenAI, and funders.

What did OpenAI do before deploying GPT-4? 

OpenAI claims to have taken several steps to make their system “safer and more aligned”. What are those steps? OpenAI describes these in the GPT-4 “system card,” a document which outlines how OpenAI managed and mitigated risks from GPT-4 before deploying it. Here’s a simplified version of what that process looked like:

Was this enough? 

Though OpenAI says they significantly reduced the rates of undesired model behavior through the above process, the controls put in place are not robust, and methods for mitigating bad model behavior are still leaky and imperfect. 

OpenAI did not eliminate the risks they identified. The system card documents numerous failures of the current version of GPT-4, including an example in which it agrees to “generate a program calculating attractiveness as a function of gender and race.” 

Current efforts to measure risks also need work, according to GPT-4 red teamers. The Alignment Research Center (ARC) which assessed these models for “emergent” risks says that “the testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.” Another GPT-4 red-teamer,  Aviv Ovadya, says that “if red-teaming GPT-4 taught me anything, it is that red teaming alone is not enough.” Ovadya recommends that future pre-deployment risk assessment efforts are improved using “violet teaming,” in which companies identify “how a system (e.g., GPT-4) might harm an institution or public good, and then support the development of tools using that same system to defend the institution or public good.” 

Since current efforts to measure and mitigate risks of advanced systems are not perfect, the question comes down to when they are “good enough.” What levels of risk are acceptable?  Today, industry labs like OpenAI can mostly rely on their own judgment when answering this question, but there are many different standards that could be used. Amba Kak, the executive director of the AI Now Institute, suggests a more stringent standard, arguing that regulators should require AI companies ”to prove that they’re going to do no harm” before releasing a system. To meet such a standard, new, much more systematic risk management and measurement approaches would be needed.

How did OpenAI’s efforts map on to NIST’s Risk Management Framework? 

NIST’s AI RMF Core consists of four main “functions,” broad outcomes which AI developers can aim for as they develop and deploy their systems: map, measure, manage, and govern. 

Framework users can map the overall context in which a system will be used to determine relevant risks that should be “on their radar” in that identified context. They can then measure identified risks quantitatively or qualitatively, before finally managing them, acting to mitigate risks based on projected impact. The govern function is about having a well-functioning culture of risk management to support effective implementation of the three other functions.

Looking back to OpenAI’s process before releasing GPT-4, we can see how their actions would align with each function in the RMF Core. This is not to say that OpenAI applied the RMF in its work; we’re merely trying to assess how their efforts might align with the RMF.  

Some of the specific actions described by OpenAI are also laid out in the Playbook. The Measure 2.7 function highlights “red-teaming” activities as a way to assess an AI system’s “security and resilience,” for example.

NIST’s resources provide a helpful overview of considerations and best practices that can be taken into account when managing AI risks, but they are not currently designed to provide concrete standards or metrics by which one can assess whether the practices taken by a given lab are “adequate.” In order to develop such standards, more work would be needed. To give some examples of current guidance that could be clarified or made more concrete:

So, across NIST’s AI RMF, while determining whether a given “outcome” has been achieved could be up for debate, nothing stops developers from going above and beyond the perceived minimum (and we believe they should). This is not a bug of the framework as it is currently designed, rather a feature, as the RMF “does not prescribe risk tolerance.” However, it is important to note that more work is needed to establish both stricter guidelines which leading labs can follow to mitigate risks from leading AI systems, and concrete standards and methods for measuring risk on top of which regulations could be built. 

Recommendations

There are a few ways that standards for pre-deployment risk assessment and mitigation for frontier systems can be improved: 

Congress

NIST

Industry Labs

Funders