The state of AI safety

by Team EthicAI | Mar 26, 2025 | AI Safety

As frontier AI systems advance toward superhuman capabilities, their developers increasingly acknowledge the potentially catastrophic consequences of failure. The three leading labs – OpenAI, Anthropic, and Google DeepMind – are planning to mitigate these risks, but their policies diverge in significant ways.

Common principles across the labs

Despite differences in design and execution, all three labs converge around a few core ideas:

Structured capability thresholds: They evaluate models based on categories of dangerous capabilities, such as cyber offence, autonomy, and CBRN (chemical, biological, radiological, nuclear) risk.
Tiered safety levels: They assign models to risk categories based on exhibited or forecasted capabilities, which then trigger escalating mitigation requirements.
Commitment to regular evaluations: Each lab pledges to assess model capabilities as development progresses, especially following significant increases in compute or fine-tuning.
Security, deployment, and development mitigations: All frameworks describe actions to limit risks, including halting model development or restricting access.

But these surface-level similarities in AI safety obscure profound differences in execution, specificity, and credibility of enforcement.

Anthropic: the most defined – but still incomplete – commitment to pause

Anthropic’s Responsible Scaling Policy (RSP) stands out for making a formal public commitment to not train or deploy models that exceed a defined risk threshold without implementing safety mitigations.

Strengths:

Predefined pause commitment: RSP is unique in its promise to halt development or deployment if mitigation measures can’t be applied in time.
Clear compute-based triggers: Evaluations are run every 4x increase in effective compute or every 6 months of fine-tuning -whichever comes first.
Substantial security investment: ASL-3 models (Anthropic’s third tier of risk) require 5–10% of employee labour to be dedicated to security.

Weaknesses:

Vague evaluation criteria: While they say they’ll test unmitigated and fine-tuned models, specifics on what evaluations will be run are absent.
ASL-4 undefined: The next tier of risk – potentially catastrophic – is still not concretely specified.
Crisis override clause: A footnote allows Anthropic to abandon safeguards if it believes a competitor poses a more imminent threat -potentially undermining the framework’s credibility.

Anthropic has the strongest theoretical commitment to pause development at critical thresholds, but gaps in definition and a competitive loophole weaken its enforceability.

OpenAI: the most transparent about uncertainty

OpenAI’s Preparedness Framework offers a tiered system for risk assessment and mitigation and is notable for its candour about the immaturity of AI safety science.

Strengths:

Unique risk category – ‘persuasion’: OpenAI is the only lab to track models’ ability to manipulate human beliefs, acknowledging risks beyond pure technical misuse.
Evaluation at multiple stages: OpenAI commits to testing both pre- and post-mitigation versions of its models, including those deliberately fine-tuned to maximise risk (e.g., for cyberattacks).
More frequent evaluation: Triggers are every 2x increase in effective compute or after major algorithmic breakthroughs.

Weaknesses:

Ambiguous mitigation thresholds: While models scoring “high” risk must not be deployed until downgraded, the actual steps to reduce that score are not prescriptively defined.
Loosely defined “asset protection”: Security measures are mentioned but not clearly tied to capability levels, unlike Anthropic’s.

OpenAI demonstrates more awareness of the limits of its frameworks and openly admits that current safety science is underdeveloped. This approach may encourage trust – but may raise concerns about operational readiness.

DeepMind: cautious refinement, but light on commitments

DeepMind’s Frontier Safety Framework (FSF) has recently evolved, especially with its 2025 update introducing new dimensions like deceptive alignment. It’s cautious, systems-oriented, and the most academically grounded.

Strengths:

Most granular security scaling: Five security levels, each tied to model capability thresholds, offer the clearest operationalisation of access restrictions.
Safety case requirement: Deployment of risky models requires a formal safety case reviewed by a corporate governing body – a potentially robust internal check.
Industry-first focus on deceptive alignment: The only lab to include a framework for evaluating whether models are learning to conceal their capabilities.

Weaknesses:

Low specificity on development halts: DeepMind vaguely states it would pause development if thresholds are hit without mitigations but provides no concrete conditions.
No unique commitments to transparency: Compared to OpenAI’s model evaluation disclosures, DeepMind lags in public clarity.

DeepMind offers the most academically rigorous framework, particularly in addressing novel risks like deception. But its lack of hard deployment or development stopping rules could be a liability if those internal processes are not well-governed.

Side-by-side comparison

Feature	Open AI Preparedness Framework	Anthrophic Responsible Scaling Policy (RSP)	DeepMind Frontier Safety Framework (FSF)
Public pause commitment	Partial (deploy only under medium risk)	Yes (train/deploy halts if mitigations fail)	Vague
Risk categories	CBRN, Cyber, Autonomy, Persuasion	CBRN, Autonomy	CBRN, Cyber, Deceptive Alignment
Evaluation frequency	Every 2x compute or algorithmic leap	Every 4x compute or 6 months	Every 6x compute or 3 months
Security tiers	Generic measures, not tied to risk levels	Defined for ASL-3, more to come	5 levels explicitly mapped to risk
Deceptive alignment	Not tracked	Not tracked	Actively tracked (Instrumental Reasoning CCLs)
Deployment standards	Do not deploy “high” risk models	Robust, layered defence (ASL-3)	Deployment requires a “safety case”
Model evaluation transparency	Illustrative examples given	No planned evals shared yet	Links to research, but little deployment detail

Critical gaps and takeaways

All these frameworks are first-generation efforts – important signals, but incomplete strategies. AI leaders assessing partnerships or deployments with these labs should ask critical questions:

Can safety frameworks hold under competition?
Anthropic’s “get-out-of-policy-free” clause suggests that in a tight race, all bets are off.
What counts as ‘acceptable’ risk?
Even lab insiders admit to double-digit probabilities of catastrophic failure – far higher than in nuclear or aviation contexts.
Are mitigations actionable or just aspirational?
Many safety documents are more akin to plans to make plans – especially lacking in defined thresholds and binding consequences.
Is there true external accountability?
None of the frameworks mandate third-party oversight. Internal reviews dominate. This may suffice for early-stage models—but not for AGI.

Policy and global context

These safety frameworks are not being developed in a vacuum. Recent geopolitical developments are likely to shape the environment in which they evolve:

Paris AI Action Summit (Feb 2025):
58 countries signed a declaration on inclusive and sustainable AI at the Paris Summit last month. Though the US and UK abstained, the Summit marked growing international alignment on AI governance.
US policy shift under Trump administration:
Executive Order 14179 repealed prior regulatory orders and favours market-driven AI development. At the same time, the US AI Safety Institute (under NIST) has been empowered to define voluntary evaluation frameworks – possibly (hopefully) suggesting a bifurcated approach: deregulatory in tone, but still building public AI safety scaffolding.

Towards credible safety governance

For leaders, the message is clear: don’t outsource AI safety to lab frameworks alone. While OpenAI, Anthropic, and DeepMind are taking the problem seriously, their policies remain nascent, fragmented, and heavily populated with caveats.

Leaders should:

Demand independent evaluation and benchmark standards across models
Prioritise safety case reviews in vendor procurement
Align internal AI deployments to worst-case evaluations, not just lab labels
Engage with national and international AI governance initiatives

Get in touch with EthicAI if we can help with assurance at any stage of your AI lifecycle.

← Is the future of AI Physical AI? Exposing the gender gap in AI →