We Should Not Allow Powerful AI to Be Trained in Secret: The Case for Increased Public Transparency

Summary:
- Advanced AI systems may reach human-level intelligence within years, with potentially catastrophic consequences if developed irresponsibly.
- Current secretive development by corporations and governments creates risks in alignment, misuse, and dangerous concentrations of power without public oversight.
- The solution requires mandatory disclosure of capabilities, independent safety audits, and whistleblower protections to ensure accountability.
- Proactive measures are needed now to establish governance frameworks before AGI development becomes uncontrollable.
This is a long-form article arguing an earlier draft by Daniel Kokatajlo, co-written with Sarah Hastings-Woodhouse.
A small handful of tech companies have the stated goal of building Artificial General Intelligence (AGI), a general-purpose AI system competitive with human experts across every domain. Besides having profound and unpredictable implications for the future of human labor, many experts worry that the consequences of developing such a system could be as catastrophic as human extinction.
Daniel worked at OpenAI for two years. Last year, he resigned after losing confidence that it would responsibly handle the development of AGI. His time there, and events occurring subsequently, has increased concerns that AGI will soon be trained covertly, by a private company and a small number of actors with the US government. Not only is taking such immense risks without the informed consent of the public unethical, but a lack of transparency in AI development could make catastrophic outcomes more likely.
These concerns may be pertinent in the short-term. This article explains why AGI could be coming soon, why training it secretly is likely to go badly, and how we can change course.
Powerful AI might be coming soon
There’s a good chance AGI could be created within the next five years. Many of those closest to the technology are converging on a similar view. It’s easy to dismiss the idea we’re on the cusp of such radical transformation simply because it sounds implausible. But the history of AI progress has repeatedly vindicated those predicting rapid improvements that took the world at large by surprise. The predicted year of AGI’s arrival on forecasting platform Metaculus has dropped by over two decades since early 2022, with similarly dramatic shifts amongst machine learning experts.
The performance gap between humans and AIs is fast closing. In October 2024, OpenAI released o1, which exceeds PhD-level human accuracy on advanced science questions and ranks in the 89th percentile at Codeforces’ competitive programming questions. Three months later, its successor, o3, more than doubled this Codeforces score and lept from just 2% to 25% on one of the toughest mathematics benchmarks in the world, and from 32% to 87% on reasoning benchmark ARC-AGI (a breakthrough that the creator of ARC did not expect this decade).
But most significant is the speed with which AIs are improving at the task of AI research itself. Automated researchers could quickly close any remaining gaps from here to human-level intelligence very quickly, and potentially achieve vastly superhuman capabilities shortly thereafter. State-of-the-art AI models already outperform top human engineers at AI R&D over two hour timescales – while agency and long-term planning are among the industry’s chief priorities for 2025. There are likely few remaining barriers to full automation of AI R&D.
A final (if anecdotal) piece of evidence in favour of imminent AGI is that during Daniel’s time at OpenAI, he learned that a significant proportion of employees agree with this assessment – coy as the company may be in public.
AGI is on track to be developed in secret
You may disagree that AGI is likely coming soon. But in light of accelerating capability advancements and collapsing timelines in and outside of the industry, you can hopefully concede that imminent AGI is at least plausible. What would this look like?
If and when an American AGI lab such as OpenAI, Anthropic or DeepMind develops AGI, or a system capable of massively accelerating AI R&D, its leaders may keep it a secret from the public for some number of months, possibly up to a year. This information may even be internally siloed, made possible since automated AI research could accelerate capabilities progress with very little human involvement. Those privy to knowledge of this system may want to prevent its spread due to several fears – of public backlash, of internal whistleblowers, of competitors racing even harder to catch up, and of government regulation that shuts down, nationalizes or otherwise hampers them. This kind of scenario is not without historical precedent. The Manhattan Project worked hard to stay hidden from Congress, in part because its leaders feared Congress would defund the effort if it found out.
The company in question likely wouldn’t keep their breakthrough entirely secret from the US government, however. Disclosing it to the President and a small number of other executive branch members could benefit AI companies. For one, controlling the introduction of a powerful technology preempts the issue of his learning about it from a concerned whistleblower and bringing down the regulatory hammer. For two, The White House would be a powerful ally in improving the security of the project, both in helping to prevent whistle-blows and in discrediting any that do occur. Officials in the Roosevelt administration were able to conscript seven House and Senate members into the Manhattan Project, while the rest of Congress remained in the dark.
We are already seeing increasing collaboration between the executive and OpenAI in the form of Project Stargate, which, while not a government-funded project, was publicly announced at the White House, alongside commitments by President Trump to accelerate the build-out of datacenters through “emergency declarations”.
Covertly training AGI will likely end in catastrophe
The upshot of AGI being trained in secret will be that a very small number of people – namely a subsection of the US government and a few employees at a private company – would bear responsibility for guaranteeing the safety of this incredibly capable AI system. Ensuring that powerful AI behaves as its developers intend, known as “alignment”, is currently an unsolved research problem’; frontier AI companies do not expect the methods they have used thus far to prevent language models from producing harmful outputs will scale to AIs that are more capable than humans.
AGI will likely be developed before this problem has been solved. As a result, this small group may be tasked with containing a system that is capable of causing catastrophic harm, with unpredictable, unknown goals. They will be the sole actors responsible for making decisions about which concerns to take seriously and which to dismiss as implausible, which solutions to implement and which to deprioritize as too costly (just as a small group of scientists working on the Manhattan project scrambled to calculate the odds that the first nuclear detonation would ignite the atmosphere, with almost no outside oversight). They will be faced with innumerable thorny and high-stakes dilemmas: What sorts of constraints do we want the AGI to have? Should it faithfully follow every instruction, or sometimes disobey in the interests of humanity? If there is a conflict between the government and the AI company, who should it side with? How do we know whether the AGI is deceiving us, and what do we do if it is?
Successfully mitigating all the threats posed by AGI will be a mammoth task, and under the conditions described above, we will likely fail. There will be nowhere near as much brain power dedicated to the problem of controlling AGI as there might have been (since most AI safety experts will not be in the know!). There will be few checks and balances in this tiny group, and few people to correct the mistakes of any one actor. Of the many alignment failures that could occur once we develop smarter-than-human systems, at least some are likely to result in the AI attempting to disguise the failure so that its human overseers do not intervene. In the context of hurried AI development, especially that is largely automated, several failures of this kind could occur. In a world where AGI is broadly more capable than us, and has either been granted or gained access to all sorts of permissions and resources, just one could prove catastrophic.
Even if this group succeeds at getting AGI to reliably follow its instructions, we’d be faced with an unprecedented concentration of power. We can hope that they will benevolently devolve power to others, but we have no reason to be confident of this. They could instruct AGI system to help them take over the US government (and later the world), or any number of less extreme but still worrying alternatives (for example, perhaps they fear that if they devolve power then there will be a backlash against them, so they ask their AGI system for advice on how to avoid this).
More transparency could help us avoid catastrophe
I (Daniel) was once of the opinion that openness in AGI development was bad for humanity. I believed it would lead to an intense, competitive race between companies and nations, likely to be won by the actor most willing to cut corners on safety. I worried that companies announcing AGI milestones would only cause their rivals to accelerate harder. But as we get closer to the finish line, I’ve changed my mind. The race I feared is essentially happening anyway – and actors will race as hard as they can regardless. I also expect that the Chinese government will find out what is happening even if the American public is kept in the dark (continuing the Manhattan Project analogy, the Roosevelt administration succeeded in keeping it a secret from Congress and the Vice President, but not from the USSR).
I have also moved from thinking about openness as a binary to a spectrum. “Openness” does not have to mean open-sourcing model weights and code to the entire world. I now envision a compromise, in which the public knows what the latest systems are capable of and is able to observe and critique the decisions being made by their developers. The scientific community should also be able to do alignment research on the latest models, without everyone having access to model weights, similar to the researcher access provisions in the Digital Services Act.
There are policies we can put in place now that will increase transparency into AGI development and help reduce the probability of the worst-case scenarios explained above.
First, leaders of AI companies should publicly commit not to train AGI in secret. CEOs should acknowledge that this would be unsafe and unethical, and encourage (and protect) their own employees to whistleblow if they renege on this promise. This commitment may need to be organised through government summits, or industry groups.
Second, labs should develop policies, ideally enforced by government regulation, that include public reporting requirements. Adherence to these requirements should make it impossible to train AGI in secret. For example, companies could commit to informing the public once they reach certain scores on particular benchmarks, and regularly publish safety cases – structured arguments for the safety of particular systems – that explain to the public why they are not being endangered, and solicit feedback on them. They should encourage their own employees to share criticisms of safety cases on social media or other public forums.
Third, companies should give a predetermined number of external safety researchers pre-deployment access to state-of-art models for the purpose of alignment research, to ensure that as much safety as possible is directed at the challenge of aligning AGI.
These policies diverge significantly from what companies will be incentivized to do by default. They also have costs, such as making it easier for adversaries to learn about AI breakthroughs earlier than they might have otherwise. However, the benefits probably outweigh the costs. It seems likely that the Chinese government will learn most of this information anyway through espionage (recall that Stalin began receiving information about the Manhattan Project from Soviet spies in 1941, before it formally began). Much of the information, such as public safety cases, will not help them accelerate capabilities progress – and may even help to set useful precedents on safe development.
It is possible that the situation is not on track to play out as described above. Perhaps the default trajectory is far less secretive than imagined here. But if there’s a chance this story is right, then it makes sense to put basic transparency requirements in place now that will prevent an extremely small group from developing one of the riskiest technologies in human history without public accountability or oversight.