Years ago my ex-co-blogger Eliezer Yudkowsky and I argued here on this blog about his AI risk fear than an AI so small dumb & weak that few had ever heard of it, might without warning, suddenly “foom”, i.e., innovate very fast, and take over the world after one weekend. I mostly argued that we have a huge literature on economic growth at odds with this. Historically, the vast majority of innovation has been small, incremental, and spread across many industries and locations. Yes, humans mostly displaced other pre-human species, as such species can’t share innovations well. But since then sharing and complementing of innovations has allowed most of the world to gain from even the biggest lumpiest innovations ever seen. Eliezer said a deep conceptual analysis allowed him to see that this time is different.
Since then there’s been a vast increase in folks concerned about AI risk, many focused on scenarios like Yudkowsky’s. (But almost no interest in my critique.) In recent years I’ve heard many say they are now less worried about foom, but have new worries just as serious. Though I’ve found it hard to understand what worries could justify big efforts now, compared to later when we should know far more about powerful AI details. (E.g., worrying about cars, TV, or nukes in the year 1000 would have been way too early.)
Enter Paul Christiano (see also Vox summary). Paul says:
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity. I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. …
If I want to convince Bob to vote for Alice, I can experiment with many different persuasion strategies … Or I can build good predictive models of Bob’s behavior … These are powerful techniques for achieving any goal that can be easily measured over short time periods. But if I want to help Bob figure out whether he should vote for Alice—whether voting for Alice would ultimately help create the kind of society he wants—that can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. …
It’s already much easier to pursue easy-to-measure goals, but machine learning will widen the gap by letting us try a huge number of possible strategies and search over massive spaces of possible actions. … Eventually our society’s trajectory will be determined by powerful optimization with easily-measurable goals rather than by human intentions about the future. …over time [our] proxies will come apart:
Corporations will deliver value to consumers as measured by profit. Eventually this mostly means manipulating consumers, capturing regulators, extortion and theft. Investors … instead of actually having an impact they will be surrounded by advisors who manipulate them into thinking they’ve had an impact. Law enforcement will drive down complaints and increase … a false sense of security, … As this world goes off the rails, there may not be any discrete point where consensus recognizes that things have gone off the rails. … human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory. …
Patterns that want to seek and expand their own influence—organisms, corrupt bureaucrats, companies obsessed with growth. … will tend to increase their own influence and so can come to dominate the behavior of large complex systems unless there is competition or a successful effort to suppress them. … a wide variety of goals could lead to influence-seeking behavior, … an influence-seeker would be aggressively gaming whatever standard you applied …
If influence-seeking patterns do appear and become entrenched, it can ultimately lead to a rapid phase transition … where humans totally lose control. … For example, an automated corporation may just take the money and run; a law enforcement system may abruptly start seizing resources and trying to defend itself from attempted decommission. … Eventually we reach the point where we could not recover from a correlated automation failure. (more)
While I told Yudkowsky his fear doesn’t fit with our large literature on economic growth, I’ll tell Christiano his fear doesn’t fit with our large (mostly economic) literature on agency failures (see 1 2 3 4 5).
An agent is someone you pay to assist you. You must always pay to get an agent who consumes real resources. But agents can earn extra “agency rents” when you and other possible agents can’t see everything that they know and do. And even if an agent doesn’t earn more rents, a more difficult agency relation can cause “agency failure”, wherein you get less of what you want from your agent.
Now like any agent, an AI who costs real resources must be paid. And depending on the market and property setup this could let AIs save, accumulate capital, and eventually collectively control most capital. This is an well-known AI concern, that AIs who are more useful than humans might earn more income, and thus become richer and more influential than humans. But this isn’t Christiano’s fear.
It is easy to believe that agent rents and failures generally scale roughly with the overall important and magnitude of activities. That is, when we do twice as much, and get roughly twice as much value out of it, we also lose about twice as much potential via agency failures, relative to a perfect agency relation, and the agents gain about twice as much in agency rents. So it is plausible to think that this also happens with AIs as they become more capable; we get more but then so do they, and more potential is lost.
Christiano instead fears that as AIs get more capable, the AIs will gain so much more agency rents, and we will suffer so much more due to agency failures, that we will actually become worse off as as result. And not just a bit worse off; we apparently get apocalypse level worse off! This sort of agency apocalypse is not only a far larger problem than we’d expect via simple scaling, it is also not supported anywhere I know of in the large academic literature on agency problems.
This literature has found many factors that influence the difficulty of agency relations. Agency tends to be harder when more relevant agent info and actions are hidden both to principals and other agents, when info about outcomes get noisier, when there is more noise in the mapping between effort and outcomes, when agents and principals are more impatient and risk averse, when agents are more unique, when principals can threaten more extreme outcomes, and when agents can more easily coordinate.
But this literature has not found that smarter agents are more problematic, all else equal. In fact, the economics literature that models agency problems typically assumes perfectly rational and thus infinitely smart agents, who reason exactly correctly in every possible situation. This typically results in limited and modest agency rents and failures.
For concreteness, imagine a twelve year old rich kid, perhaps a king or queen, seeking agents to help manage their wealth or kingdom. It is far from obvious that this child is on average worse off when they choose a smarter more capable agent, or when the overall pool of agents from which they can choose becomes smarter and more capable. And its even less obvious that the kid becomes maximally worse off as their agents get maximally smart and capable. In fact, I suspect the opposite.
Of course it remains possible that there is something special about the human-AI agency relation that can justify Christiano’s claims. But surely the burden of “proof” (really argument) should lie on those say this case is radically different from most found in our large and robust agency literatures. (Google Scholar lists 234K papers with keyword “principal-agent”.)
And even if AI agency problems turn out to be unusual severe, that still doesn’t justify trying to solve them so far in advance of knowing about their details.
> And even if AI agency problems turn out to be unusual severe, that still doesn’t justify trying to solve them so far in advance of knowing about their details.
I agree that we need to look to empirical evidence of AI safety problems. To this end, we have started investigating how GPT models actually behave when faced with principal-agent conflict.
S. Phelps and R. Ranson. Of Models and Tin-Men - A Behavioral Economics Study of Principal-Agent Problems in AI Alignment Using Large-Language Models, July 2023, arXiv:2307.11137.
https://arxiv.org/abs/2307.11137
Abstract:
AI Alignment is often presented as an interaction between a single designer and an artificial agent in which the designer attempts to ensure the agent's behavior is consistent with its purpose, and risks arise solely because of conflicts caused by inadvertent misalignment between the utility function intended by the designer and the resulting internal utility function of the agent. With the advent of agents instantiated with large-language models (LLMs), which are typically pre-trained, we argue this does not capture the essential aspects of AI safety because in the real world there is not a one-to-one correspondence between designer and agent, and the many agents, both artificial and human, have heterogeneous values. Therefore, there is an economic aspect to AI safety and the principal-agent problem is likely to arise. In a principal-agent problem conflict arises because of information asymmetry together with inherent misalignment between the utility of the agent and its principal, and this inherent misalignment cannot be overcome by coercing the agent into adopting a desired utility function through training. We argue the assumptions underlying principal-agent problems are crucial to capturing the essence of safety problems involving pre-trained AI models in real-world situations. Taking an empirical approach to AI safety, we investigate how GPT models respond in principal-agent conflicts. We find that agents based on both GPT-3.5 and GPT-4 override their principal's objectives in a simple online shopping task, showing clear evidence of principal-agent conflict. Surprisingly, the earlier GPT-3.5 model exhibits more nuanced behaviour in response to changes in information asymmetry, whereas the later GPT-4 model is more rigid in adhering to its prior alignment. Our results highlight the importance of incorporating principles from economics into the alignment process.
This all sounds to me like generic excuses. "Sure there's a literature but it can't have explored all possible combinations of assumptions, so couldn't the problem be with one of those unexplored combinations?" Which seems a pretty obvious attempt to ignore a literature.