AI Safety Needs a Shared Playbook—Before It’s Too Late

Summary
  • Frontier AI systems are advancing rapidly, but evaluation methods remain fragmented and opaque.
  • Major companies assess risks like cyber and biothreats differently, making cross-model comparisons nearly impossible.
  • This lack of standardization could delay interventions if dangerous capabilities go undetected.
  • A transparent, unified evaluation framework would empower faster, safer, and more accountable AI development.

Introduction

Frontier AI technologies are accelerating at a pace few predicted, promising breakthroughs in everything from scientific discovery to natural language understanding. Yet alongside this rapid progress, a critical problem is looming. While leading AI companies—such as OpenAI, Anthropic, Google DeepMind, and Meta—all acknowledge the need to evaluate security risks, especially in the realms of cyber offense or CBRN (chemical, biological, radiological, and nuclear) threats, they currently rely on their own model evaluations, which are often opaque.

At first glance, allowing each company to define its own methods, terminology, and reporting standards for evaluations might seem harmless—or even beneficial—as it can encourage independent innovation. However, this fragmented approach to risk creates a fundamental blind spot: policymakers, regulators, and external stakeholders cannot easily compare the risks posed by different models or track escalating capabilities in security-critical areas, such as autonomous exploit discovery or biothreat facilitation, over time. Without clear, consistent measurements of these risks and harmful capabilities, it is nearly impossible for observers to determine whether a new model’s capacity for malicious actions has crossed a dangerous threshold.

As AI systems advance, these blind spots could lead to real-world harms, particularly if a newly released model exhibits powerful offensive cyber capabilities or drastically lowers the barriers to weaponizing biological agents. In such scenarios, inconsistencies in how companies measure and disclose risks could delay urgent interventions. Policymakers would be left scrambling after threats materialize, rather than taking proactive steps to contain them.

In this piece, I argue that a shared, transparent evaluation framework across major AI companies is more than a bureaucratic nicety: it’s an urgent necessity. By unifying standards for assessing AI-driven threats, we can empower policymakers to act swiftly and ensure that next-generation AI remains a strategic advantage rather than an unforeseen vulnerability.

Current landscape of AI evaluations

Across today’s leading AI companies, there is no single, universally recognized playbook for how to gauge frontier AI risks. Each organization implements its own methodology and risk classification scheme, often with varying levels of depth and disclosure. While these approaches generally share the goal of preventing misuse—such as the use of AI for cyberattacks or dangerous biological research—they diverge in terminology, thresholds for concern, and transparency regarding how conclusions are reached. These conclusions involve determinations about whether models are safe or unsafe across various tested capabilities (cyber, bio, etc.) and decisions about what "tier" or risk level a model's capability falls under—with each company employing slightly idiosyncratic tier definitions and evaluation methods to make these assessments.

Consider OpenAI, which publishes “system cards” outlining notable strengths and limitations of models like GPT-4. While these documents offer insight into certain risky capabilities, they do not use the same definitions or thresholds as Anthropic’s reports for Claude, which are guided by Anthropic’s “Responsible Scaling Policy.” Google DeepMind references high-level AI principles in its “Frontier Safety Framework” but has not published detailed model cards for some recent releases—including Gemini 2.0 Flash and Gemini 2.5 Pro—making it unclear how, or whether, it evaluates advanced cybersecurity or CBRN threats. Other companies, such as Meta’s AI division or Elon Musk’s xAI, have minimal or no formal documentation describing how they assess the potential for cyber or CBRN vulnerabilities.

These discrepancies matter. A “moderate” cyber risk rating at one company might be labeled “AI R&D-3” at another—or not labeled at all. Some companies pledge specific interventions if they detect dangerously capable models—Anthropic, for instance, outlines steps to limit deployment if a model crosses certain safety thresholds—whereas others provide no details on what should trigger additional safeguards. The result is a patchwork of approaches that leaves policymakers and outside experts guessing how to interpret each company’s safety claims. Moreover, without consistent benchmarks or oversight, companies may find their profit incentives at odds with transparent self-reporting. Left unchecked, those incentives make it far likelier that a powerful model will be released without anyone realizing—or admitting—how dangerous its capabilities actually are. Opacity does not merely erode trust; it can put a hazardous system into the wild before effective safeguards are in place.

Critically, the fragmented nature of these evaluations makes it nearly impossible to compare risky capabilities—such as automated hacking or bioweapons facilitation—across models. If Company A claims that its system can autonomously identify and exploit zero-day vulnerabilities, but Company B does not test for these behaviors at all, how can security agencies track the overall trajectory of such capabilities? Without consistent baselines and transparent reporting, no clear picture can emerge.

Figure 1. Examples of published materials.

A future with vs. without a standardized evaluation

Imagine it is a year from now, and a major AI developer quietly releases a breakthrough model. Within days, cybersecurity experts discover that this model can autonomously identify and exploit unknown vulnerabilities in critical infrastructure—from hospital networks to financial institutions.

Under today’s fragmented system, there is no common trigger that would compel the company to disclose, or even internally recognize, these offensive capabilities. Perhaps the company did test for hacking behaviors but used metrics incompatible with other companies’ benchmarks—or it never evaluated cyber-offensive potential at all. By the time outside observers confirm the danger, critical vulnerabilities have already been exploited. Government agencies scramble to contain the damage, much like they did following major software flaws such as the 2021 Log4Shell vulnerability in Apache Log4j. Only this time, the breach is being driven by a highly capable AI system. Headlines decry system-wide outages, and public confidence in AI governance plummets.

Now picture a scenario in which all major AI companies subscribe to shared evaluation frameworks. The moment a new model crosses a predefined capability threshold—​for example, demonstrating a markedly higher success rate at autonomously discovering and chaining zero-day exploits than any prior baseline—​it would trigger a sequenced response: an external audit, a formal review, and a tightening of access to the capability, which may include a temporary pause on broader deployment until mitigations are in place. These measures would not magically eliminate the threat, but they would buy institutions critical time to patch vulnerabilities and allow officials to prepare a robust defense, rather than forcing them to lurch into last-minute crisis mode.

In essence, the difference comes down to whether powerful capabilities slip under the radar due to incompatible and unrigorous evaluations or are flagged early through a well-coordinated system of checks. As frontier models rapidly improve, that distinction could determine whether we respond to emerging threats with measured preparedness or haphazard firefighting.

The case for standardization

Standardization is a powerful mechanism for transparency and accountability. A single, widely recognized framework would allow AI companies to present evaluations consistently, making it easier for policymakers, researchers, and the public to compare risks and identify emerging threats. This clarity would also level the playing field, discouraging companies from downplaying hazards for competitive advantage.

A shared language for risk assessment

If every frontier AI company used the same terminology and metrics to describe cyber-offensive capabilities or CBRN-related functions, comparing models and flagging emerging dangers would become far more efficient. Policymakers could readily see how a “high risk” threat from OpenAI aligns  with a “CBRN-3” label at Anthropic (to name just one example), and whether those designations demand immediate intervention. This consistency would also enable oversight bodies—such as the U.S. AI Safety Institute or international regulators—to issue targeted guidance without grappling with an alphabet soup of competing definitions.

Fostering accountability

When all companies agree on thresholds for classifying a capability as genuinely dangerous—such as the ability to autonomously plan and execute cyberattacks—no single organization can downplay or obscure the seriousness of crossing that line. This mutual accountability means that individual companies do not need to self-police in isolation. Rather than each company inventing its own risk thresholds, the entire industry can respond to a commonly understood benchmark.

Building public trust

We have seen in other sectors, such as pharmaceuticals and aviation, that unified safety standards can boost confidence and help regulators act decisively when red flags appear. Given how quickly AI tools can move from benign research instruments to potential security threats, this kind of clarity is essential. If all companies committed to disclosing risk evaluations in a standard format, it would become easier for independent experts, watchdog groups, and even average citizens to understand when and why certain mitigation measures should be implemented. The fledgling Frontier Model Forum—a collaboration among many tech companies—signals a step toward this kind of cross-industry effort, but it remains to be seen whether it will yield a truly transparent and robust framework.

A common test suite should be a baseline, not a ceiling. This would guarantee that every frontier model clears the same minimum bar while leaving space—and indeed setting an expectation—for companies to incorporate additional stress tests.

Standardization carries potential downsides. For example, some critics argue that it might stifle inventive testing methods or fail to keep pace with specialized advancements. However, these concerns can be mitigated if frameworks allow for continuous refinement and the addition of new subcategories as breakthroughs occur. Standardization need not be static; it can serve as a common baseline that evolves over time, much like the National Institute of Standards and Technology’s (NIST) AI Risk Management Framework or Defense Advanced Research Projects Agency-led cybersecurity initiatives. By adopting a flexible yet unified approach, companies can continue to innovate while maintaining a common language for emergent threats—ultimately benefiting everyone who wishes to enjoy AI’s promises without its perils.

Recommendations

An evaluation task force

A practical first step is for a U.S. government body—such as the NIST, home to the U.S. AI Safety Institute, or Cybersecurity and Infrastructure Security Agency (CISA)—to convene a public-private task force dedicated to AI risk evaluation. Its purpose would be to bring together frontier AI developers, relevant government agencies, and independent experts in cybersecurity, national security, and emerging technologies. The idea is simple: no single organization can see the full picture of how AI capabilities intersect with real-world threats, and a cross-sector body can share insights more effectively.

Analogous public-private alliances already work well in other high-stakes domains—​for instance, the Commercial Aviation Safety Team (CAST) in aviation and the Clinical Trials Transformation Initiative (CTTI) in drug development. Regulators and industry can jointly spot hazards early and fix them before they escalate. For AI, an aligned, high-trust venue for information exchange and rapid decision-making could reduce blind spots while promoting best practices across all major companies.

Consistent benchmarks for dangerous capabilities

The second priority is to define consistent, industry-wide benchmarks for dangerous capabilities. Currently, each AI company uses its own internal standards, which hinders cross-comparison and opens the door to confusion. A neutral government steward—most plausibly the U.S. AI Safety Institute at the NIST, in concert with the CISA—could publish a living suite of baseline tests that every frontier model must pass before any public or internal deployment.

The first version could focus on three threat domains: automated cyber offense, biothreat facilitation, and “agentic takeover” or loss of control. To maintain the integrity of the evaluation, companies would be required to provide accredited red-teamers with a no-mitigation evaluation endpoint—an air-gapped sandbox where system prompts and safety filters are stripped, ensuring that hidden capabilities cannot be masked. Any model that crosses a danger threshold (for example, successfully chaining zero-day exploits or producing a step-by-step pathogen protocol) would trigger the same response protocol, even if detected internally before release: an immediate pause in development, a 72-hour incident report, and a flag for immediate oversight in governance and decision-making regarding the model.

Practically, it would also be easier for decision makers to compare reported threats between companies if everyone used the same baseline.

Link thresholds to mitigation actions

Next, it is essential to link clear thresholds to enforceable mitigation actions. If a model is found to surpass a specific “danger threshold”—for example, the point at which it can autonomously craft malware or facilitate bioweapon design—there should be predefined consequences. This might include a temporary halt on deployment, followed by an independent audit to identify root causes and validate additional mitigations. These protocols would mirror the safety “tripwires” found in other high-risk domains—for instance, certain nuclear facilities employ automated shutdown systems when radiation levels exceed allowable limits.

Rather than leaving crucial decisions to be made ad hoc in last-minute debates, companies would be bound by an agreed-upon playbook that activates whenever a system’s demonstrated capabilities cross dangerous lines.

Capability transparency reports

Finally, “capability transparency” reports—akin to today’s model cards—should become a cornerstone of AI development and deployment. If each major company regularly published a standardized assessment of how its models measure up against agreed-upon benchmarks, policymakers and outside observers could better identify trends in emerging threats and respond before it is too late.

This kind of transparency has proven effective in other industries, such as finance, where periodic disclosures help prevent systemic risk by ensuring that regulators and the public are not left in the dark. The same logic applies to AI: consistent, comparative snapshots of what models can and cannot do would reduce the risk of sudden, unforeseen leaps in capability and promote responsible, well-informed innovation.

Authors
Joseph Kwon
Technical Policy Analyst, Center for AI Policy
Subscribe to newsletter
Share
This is some text inside of a div block.

Have something to share? Please reach out to us with your pitch.

If we decide to publish your piece, we will provide thorough editorial support.