Responsible Scaling: Comparing Government Guidance and Company Policy
As advanced AI systems scale up in capability, companies will need to implement practices to identify, monitor, and mitigate potential risks. “Responsible capability scaling” is the specification of progressively higher levels of risk, roughly corresponding to model size or capabilities, and entailing progressively more stringent response measures. We evaluate the original example of a Responsible Scaling Policy (RSP) – that of Anthropic – against guidance on responsible capability scaling from the UK Department for Science, Innovation and Technology (DSIT).
Our top recommendations based on our critique of Anthropic’s RSP are:
Anthropic and other AI companies should define verifiable risk thresholds for their AI safety levels (ASLs - or equivalent), informed by tolerances for “societal risk” (SR) in other industries. Such risk thresholds should likely be lower than Anthropic’s current thresholds, and should be defined in terms of absolute risk above a given baseline, rather than relative risk over said baseline.
The literature we survey suggests that “maximum” SR tolerances for events involving ≥1,000 fatalities – Anthropic’s definition of a “catastrophic risk” – should range between 1 E-04 to 1 E-10 such events per year. “Broadly acceptable” tolerances are generally two orders of magnitude lower.
We tentatively suggest that Anthropic set their ASL-4 and ASL-3 thresholds in the “maximum” and “broadly acceptable” SR ranges, respectively. We think that Anthropic’s current risk thresholds probably exceed those ranges.
Ultimately, a government body, such as UK DSIT or the US National Institute for Standards and Technology (NIST), or an industry body such as the Frontier Model Forum (FMF), should develop standardized operationalizations of risk-thresholds for RSPs.
Anthropic and other companies should specify thresholds for a more granular set of risk types at a given safety level – for example, not just “misuse” but “biological misuse” as opposed to “cyber misuse.”
Anthropic and other companies should detail when they will alert government authorities of identified risks – currently, their RSP does not mention communication with governments outside of a narrow case (involving Anthropic’s response to a bad actor scaling dangerously fast). We suggest that risks should at minimum be communicated to relevant agencies when they reach a given threshold, for example, the ASL-3 or ASL-4 thresholds outlined above.
Anthropic and other companies should commit to external scrutiny of both their evaluation methods (i.e., whether those methods work) and their individual evaluation results at ASL-3 or sooner.