> And even if AI agency problems turn out to be unusual severe, that still doesn’t justify trying to solve them so far in advance of knowing about their details.

I agree that we need to look to empirical evidence of AI safety problems. To this end, we have started investigating how GPT models actually behave when faced with principal-agent conflict.

S. Phelps and R. Ranson. Of Models and Tin-Men - A Behavioral Economics Study of Principal-Agent Problems in AI Alignment Using Large-Language Models, July 2023, arXiv:2307.11137.



AI Alignment is often presented as an interaction between a single designer and an artificial agent in which the designer attempts to ensure the agent's behavior is consistent with its purpose, and risks arise solely because of conflicts caused by inadvertent misalignment between the utility function intended by the designer and the resulting internal utility function of the agent. With the advent of agents instantiated with large-language models (LLMs), which are typically pre-trained, we argue this does not capture the essential aspects of AI safety because in the real world there is not a one-to-one correspondence between designer and agent, and the many agents, both artificial and human, have heterogeneous values. Therefore, there is an economic aspect to AI safety and the principal-agent problem is likely to arise. In a principal-agent problem conflict arises because of information asymmetry together with inherent misalignment between the utility of the agent and its principal, and this inherent misalignment cannot be overcome by coercing the agent into adopting a desired utility function through training. We argue the assumptions underlying principal-agent problems are crucial to capturing the essence of safety problems involving pre-trained AI models in real-world situations. Taking an empirical approach to AI safety, we investigate how GPT models respond in principal-agent conflicts. We find that agents based on both GPT-3.5 and GPT-4 override their principal's objectives in a simple online shopping task, showing clear evidence of principal-agent conflict. Surprisingly, the earlier GPT-3.5 model exhibits more nuanced behaviour in response to changes in information asymmetry, whereas the later GPT-4 model is more rigid in adhering to its prior alignment. Our results highlight the importance of incorporating principles from economics into the alignment process.

Expand full comment

This all sounds to me like generic excuses. "Sure there's a literature but it can't have explored all possible combinations of assumptions, so couldn't the problem be with one of those unexplored combinations?" Which seems a pretty obvious attempt to ignore a literature.

Expand full comment

There's plenty in this literature that doesn't assume enforceability of contracts, and that accepts the relevance of norms.

Expand full comment

There might be an important difference between /assuming/ and explicitly /modeling/ (e.g.) institutions, norms, and laws. Specifically, it could be the case that the literature does not /model/ these things, in the sense that it presents very parsimonious models that don't contain any variables intended to refer to laws etc.; but it could be the case that the assumptions that justify our belief that the models accurately represents the real world, and so that the model results are predictive of what we will actually see, require certain properties having to do with laws etc. to hold. My guess is Rohin's concern was that the literature "assumes" laws etc. in this latter sense.

Expand full comment

It's not a question of a few counterexamples, it's a question of the relevance of the model. And the standard literature you are pointing to makes exactly the assumptions that Stuart noted - enforcability of contracts, etc.

Not only that, but (even though economic literature mostly elides the point,) the literature in sociology and political science that you note is very clear about the role of norms and institutions on the relevance of the model.

Expand full comment

The standard literature on agency problems is used not just in economics, but also in law, business, sociology, and polisci. Yes of course there have been many agency failures in history, including with smart agents. But a few such examples hardly constitutes evidence that agency problems are on average larger when agents are smarter.

Expand full comment

I've argued here: https://www.lesswrong.com/p... that agents betraying their principals happens in politics all the time, sometime with disastrous results. By restricting to the economic literature on this problem, we're only looking at a small subsets of "agency problems", and implicitly assuming that institutions are sufficiently strong to detect and deter bad behaviour of very powerful AI agents - which is not at all evident.

Expand full comment

As I said in response to Paul, "We do have some models of boundary rational principals with perfectly rational agents, and those models don’t display huge added agency rents." And no, this literature usually does not model norms and laws.

Expand full comment

I have a bunch of complicated thoughts on this post, many of which were said in Paul's comment reply, but I'll say a few things.

Firstly, I think that if you want to view the AI alignment problem in the context of the principal-agent literature, the natural way to think about it is with the principal being less rational than the agent. I claim that it is at least conceivable that an AI system could make humans worse off, but the standard principal-agent model cannot accommodate such a scenario because it assumes the principal is rational, which means the principal always does at least as well as not ceding any control to the agent at all.

More importantly, although I'm not too familiar with the principal-agent literature, I'm guessing that the literature assumes the presence of norms, laws and institutions that constrain both the principal and the agent, and in such cases it makes sense that the loss that the principal could incur would be bounded -- but it's not obvious that this would hold for sufficiently powerful AI systems.

Expand full comment

Re: comparing your described scenario 2 vs scenario 3, it seems to me there's an important additional factor: differing empirical beliefs, not just moral ones. Disagreement on what a universe run by non-human AIs will actually look like - what features the AIs will have, what they'll spend their time doing, etc.

I think this is a more relevant question the broader your idea of moral value is. If moral value to you is very strongly focused on the detailed particularities of humans, and them doing only specific activities, then it's unlikely that AIs will share these features. If on the other hand you place fewer requirements for moral value to occur - perhaps you say moral value is "sentient creatures having experiences they find to be overall good" or something on similar lines - then it becomes much more plausible that AIs might represent that, and so the empirical questions of whether or not they actually do become relevant.

Expand full comment

Your essay seems pretty clearly to me to be pointing to agency failures, and not simple changes in relative wealth. The relative wealth fear is a very old one, and you present yourself as talking about a newer worry. You talk about humans getting what they measure versus what they want, and about subtle strategies to persuade and get influence over humans, in the context of machines that on the surface just seem like they are trying to help humans. Those are agency problems, not simple relative wealth problems.

I don’t understand your point about property rights in space. We haven’t bothered to define such rights because we have almost no ability to use them now. Once such rights become substantially useful, we will work harder to define and protect such rights. And I don’t understand how you can say the literature on agency doesn’t help with understanding the consequences of agency failure; that’s kind of the point of that literature isn’t it?

Only particular kinds of info induce agency failures, not all info in general. For example, info about an agent’s ability and effort is problematic. There’s no obvious reason to expect smarter agents to have much more of this kind of info. And in human experience, smarter agents don’t seem to be worse agents. We do have some models of boundary rational principals with perfectly rational agents, and those models don’t display huge added agency rents. If you want to claim that relative intelligence creates large agency problems, you should offer concrete models that show such an effect. Similarly for your claim that two competitors are both made worse off when the pool of competing agents they can pick from gets smarter and more capable.

Expand full comment

>In particular, property rights in space are very weak and so we are likely to lose our influence over the stars if we are greatly outcompeted, even if in absolute terms our output increases. I don't really have strong views on the probability of outright war, I don't think it matters much to the scenario.

Have you seen Prefer Law To Values where Robin Hanson wrote:

>The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire. In this case we don’t expect to have much in the way of skills to offer, so we mostly care that they are law-abiding enough to respect our property rights. If they use the same law to keep the peace among themselves as they use to keep the peace with us, we could have a long and prosperous future in whatever weird world they conjure. In such a vast rich universe our “retirement income” should buy a comfortable if not central place for humans to watch it all in wonder.

So it seems that Robin is mostly concerned about the difference between war / complete breakdown of property rights and humans having property rights over a small part of the universe that would provide us with a nice "retirement" but not much influence over the universe, whereas you and I are perhaps more concerned about the difference between this second scenario and humans or human values having central influence over the universe. It's not entirely clear to me why, but I guess it's a combination of Robin thinking there's not much we can do to achieve the third outcome, and his values being such that the second outcome is almost as good as the third outcome.

Expand full comment

If machines add most of the value, they will probably get to choose what happens. Humans might retain some influence since we hold capital, but I think the extrapolation from contemporary experience doesn't look good.

> But this literature has not found that smarter agents are more problematic, all else equal.

I think most researchers in the area would view "more intelligent" as creating the same agency problems as "has more information" and indeed would typically model differences in cognitive abilities this way (many instances modeled as imperfect information in fact involve information the principal could deduce in principle). Do you think this is a bad way to think about the situation? (Do you disagree with my claim that it's how experts in the area would think about it?)

> But this isn’t Christiano’s fear.

I described the outcome as: "human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory. By the time we spread through the stars our current values are just one of many forces in the world, not even a particularly strong one."

I don't this differs that much from "AIs save, accumulate capital, and eventually collectively control most capital."

> Christiano instead fears that as AIs get more capable, the AIs will gain so much more agency rents, and we will suffer so much more due to agency failures, that we will actually become worse off as as result.

Property rights aren't that secure. The literature on agency failures is (somewhat) helpful for understanding how much value we might lose due to these agency failures. It's not particularly helpful for understanding the consequences of that, and I'm not sure what you are pointing to here.

In particular, property rights in space are very weak and so we are likely to lose our influence over the stars if we are greatly outcompeted, even if in absolute terms our output increases. I don't really have strong views on the probability of outright war, I don't think it matters much to the scenario.

> But this literature has not found that smarter agents are more problematic, all else equal. In fact, the economics literature that models agency problems typically assumes perfectly rational and thus infinitely smart agents, who reason exactly correctly in every possible situation.

It also models the principal as infinitely smart. When we adapt such models to the case of bounded principals and agents, we will find agency costs similar to those of assymetric information. I'm not sure if you are disputing this point---if you are, then I think you'll be on your own, if you aren't then I'm not sure what the relevance is.

> For concreteness, imagine a twelve year old rich kid, perhaps a king or queen, seeking agents to help manage their wealth or kingdom. It is far from obvious that this child is on average worse off when they choose a smarter more capable agent, or when the overall pool of agents from which they can choose becomes smarter and more capable. And its even less obvious that the kid becomes maximally worse off as their agents get maximally smart and capable. In fact, I suspect the opposite.

No, the kid is made worse off when the king in the neighboring kingdom acquires access to much smarter advisors. This forces the kid to defer more and more to their advisors, and to cede more and more influence, as the competition between neighboring kingdoms (and necessary internal organization to remain competitive) become too complex for the kid to understand.

Expand full comment

Would you be interested in discussing the specifics of the work that you sampled?

Is there another example, either historical or contemporary, besides AI safety/alignment, of a whole field of researchers doing work too early? If not, what explains people (both researchers and funders) being uniquely impatient about this field?

Expand full comment

I have of course not carefully reviewed all work in the area, but I have sampled it, and haven't seen anything that seemed worth doing now, rather than later. As I said before, that sort of evaluation isn't possible if no one is doing anything, so someone should be trying a bit at all times so as to make such evaluations possible.

Expand full comment

>But I'm still skeptical that much useful can be done now, relative to later.

Do you say this knowing what people have done in AI safety/alignment recently and what people are proposing to do in the near future? If so, is there any specific past or proposed work you can point to and say "I don't think that was/is worth funding/doing"? Or do you just want to reduce the overall funding and let people in the field figure out what projects to cut? If so, what overall budget seems reasonable to you, either as a fixed number or relative to AI as a whole or relative to some other field?

Expand full comment