Plans to build AGI with nuclear reactor-like safety lack 'systematic thinking,' say researchers


Aspirations to adapt the principle of 'defense in depth' from nuclear engineering to AI appear to fall short on key requirements.


In a preprint from October 13, two researchers from the Ruhr University Bochum and the University of Bonn in Germany found that while leading AI companies say they will design their most general-purpose AI, often called AGI, based on the most stringent safety principles—adapted from fields like nuclear engineering—the safety techniques they apply do not satisfy them.

In particular, the authors note that existing proposals fail to satisfy the principle known as defense in depth, which calls for the application of multiple, redundant, and independent safety mechanisms. The conventional safety methods that companies are known to apply are not independent; in certain problematic scenarios, which are relatively easy to foresee, they all tend to fail simultaneously.

Many leading AI companies, including Anthropic, Microsoft, and OpenAI have all published safety documents that explicitly mention their intention to implement defense in depth for the design of their most advanced AI systems. 

In an interview with Foom, the first co-author of the study, Leonard Dung of the Ruhr University Bochum, said that it was not surprising that many of the methods for designing AI systems to be safe might fail. Research on making powerful AI systems safe is broadly viewed to be at an early stage of maturity.

More surprising for Dung, and also concerning, was that it was left to him and his co-author, who are academic scholars in philosophy and machine learning, to make what is arguably a foundational contribution to the safety literature of a new branch of industrial engineering.

"There has not been much systematic thinking about what exactly does it mean to take a defense-in-depth approach to safety," said Dung. "The sort of basic way of thinking about risk that you would expect these companies—and policymakers who regulate these companies—to implement has not been implemented." 

Consensus that AI companies must look to more established fields for safety principles has only emerged recently, for example, in the International AI Safety Report, published in January 2025. The authors of that report describe defense in depth as a "prominent technical approach" that has emerged for managing the risks of advanced AI, motivated by its application in areas such as the nuclear power industry.

Nuclear power has a long and established history of industrial-scale deployment compared to AI technology. Because nuclear reactors generate power from highly radioactive fuel elements, which could poison or kill many people if released unexpectedly, nuclear reactors require extensive safety engineering.

The principal government agency overseeing nuclear reactor safety in the United States, the Nuclear Regulatory Commission (NRC), reviewed the defense in depth concept in a report from 2016. The agency described it as appearing frequently since 1957, and at the "core" of the NRC's "safety philosophy." But they note that it has not always been discussed or defined consistently.

However, as described in the NRC report, there are at root only a few core requirements for designing a system with the defense in depth property. The first of these is redundancy. To consider a simple example from automotive engineering, a modern car might be said to be designed with redundant mechanisms for braking in that if the front brakes fail, the rear brakes should still function—at least, in many cases.

The second requirement for defense in depth is that multiple methods for safety should not all fail simultaneously. For example, even though automotive designers might create a redundant set of two sets of brakes, they must also consider that there are certain cases, such as icy weather conditions, where both sets fail at the same time. In this case, the two safety 'layers' are said to possess 'failure modes' that are 'correlated,' thereby necessitating an additional layer of safety, such as a crash-resistant vehicle chassis.

In analogy to nuclear reactor designers, AGI designers must make three considerations for defense in depth engineering: An identification of the distinct methods for ensuring safety, an assessment of the various situations where each method might fail, and lastly, an analysis of what methods must be applied to ensure safety across all such cases. 

In their preprint, the researchers considered four widely-known categories of techniques for designing an AI system safely. In particular, they considered AI alignment techniques, which aim to align the behavior of an AI with the values of its controllers, setting aside the 'sociotechnical' question of whether those values are aligned with society's. 

Such safety techniques are in some cases now already widely deployed, according to the authors, like training language models to generate responses more likely to be approved by human annotators. They also considered techniques that are not commonly developed, such as an architecture for AI proposed in February, called Scientist AI, which is explicitly non-agentic in nature, and therefore less likely to behave unexpectedly. 

They identified seven different failure cases for study. These include scenarios problematic for non-technical reasons, like an unwillingness of a company to invest in a safety technique that does not contribute comparably to capabilities. They also considered failure scenarios of a technical nature, like the known possibility that an AI can learn to be deceptive and fake its alignment during testing.

In their study, the authors acknowledge that their selection of scenarios is only preliminary. A problem they faced, also described in the International AI Safety Report, is that it is not clear what failure scenarios are the most impactful or most likely. By comparison, a widely used framework for evaluating the security of enterprise networks, called ATT&CK, contains fourteen general categories of failure modes and over 300 individual failure mechanisms for designers of networks to consider.

Through a conceptual analysis of each safety method in each of the failure cases, the authors found that most of the known safety methods, with the exception of certain techniques like Scientist AI, failed to provide redundant safety across most scenarios.

Although leading AI companies commonly use more than one safety method, none of those companies appears to have presented an analysis showing that their safety methods would possess failure modes that are uncorrelated. Google DeepMind has presented perhaps the only high-level analysis, but only for the subproblem of detecting the misuse of their AI systems by outside users.

The authors found that it may be a combination of two known techniques, in particular, that could be most promising for ensuring safety. First, interpreting the internal knowledge of the AI successfully, which thereby allows for making appropriate safety interventions prior to deployment. Second, the technique of getting two copies of an AI to debate each other in front of human evaluators, or human proxies, which enables designers to detect issues like deceptiveness.

Their analysis arguably only confirms the intuition, held by many, that if we can 'read the mind' of an unsafe AI, perhaps analogous to the mind of a psychopathic human, we should be able to change its mind to make it safer. More insightfully, the findings suggest that is not enough to change the mind of a psychopathic AI to make it safe. Rather, designers must watch one psychopath debate another.

The philosophical reason why this combination of two techniques works, identified in the study, is that any deceptiveness of the psychopaths would be exposed, thanks to the unique transparency of the debate process; and once treated, non-psychopaths stay safe, even if we continue to scale up their capabilities so much that we can no longer comprehend what it is they're debating. 

An elegant finding, if true, but one that is undoubtedly in need of further study. 


Author's note: No AI was used in writing or editing this post.