Why Broken Evals?

This review article published 36 years ago shows that it was well known back then that teacher evaluations by college students are predictably influenced by time of day, class size, course level, course electively, and more. Thus one could get more reliable teacher evaluations by building a statistical model to predict student evaluations using these features plus who taught what, and then using each teacher coefficient as that teacher’s evaluation. Yet colleges almost never do this. Why?

Actually, most orgs also use known-to-be broken worker evaluation systems:

There is a lot of systematic evidence on the connections between job performance and career outcomes. … The data shows that performance doesn’t matter that much for what happens to most people in most organizations. That includes the effect of your accomplishments on those ubiquitous performance evaluations and even on your job tenure and promotion prospects. …

[For example,] supervisors who were actively involved in hiring people whom they favored rated those subordinates more highly on performance appraisals than they did those employees they inherited or the ones they did not initially support. In fact, whether or not the supervisor had been actively engaged in the selection process had an effect on people’s performance evaluations even when objective measures of job performance were statistically controlled. (more)

So why don’t firms correct employee evaluations for this who-hired-you bias? And it isn’t just this one bias; there are lots:

Extensive research on promotions in organizations, with advancement measured either by changes in position, increases in salary, or both, also reveals the modest contribution of job performance in accounting for the variation in what happens to people. In 1980, economists … observed that salaries in companies were more strongly related to age and organizational tenure than they were to job performance. Ensuing research has confirmed and extended their findings, both in the United States and elsewhere. … One meta-analysis of chief executive compensation found that firm size accounted for more than 40 percent of the variation in pay while performance accounted for less than 5 percent. (more)

An obvious explanation here is that coalition politics dominates worker evaluations. Coalitions like being able to ignore job performance to favor their allies and punish their rivals. Winning coalitions tend to be benefiting from the current broken rules. But, you might ask, why don’t people at the top put a stop to this? Doesn’t allowing politics such free reign hurt overall org performance? This story hints at an answer:

A few years ago, Bob, the CEO of a private, venture-backed human capital software company, invited me to serve on the board of directors as the company began a transition to a new product platform and sought to increase its growth rate and profitability. Not long after I joined the board, in the midst of an upgrading in management talent, the CEO hired a new chief financial officer, Chris. Chris was an ambitious, hardworking, articulate individual who had big plans for the company— and himself. Chris asked Bob to make him chief operating officer. Bob agreed. Chris asked to join the board of directors. Bob agreed. I could see what was coming next, so I called Bob and said, “Chris is after your job.” Bob’s reply was that he was only interested in what was best for the company, would not stoop to playing politics, and thought that the board had seen his level of competence and integrity and would do the right thing. You can guess how this story ended— Bob’s gone, Chris is the CEO. What was interesting was the conference call in which the board discussed the moves. Although there was much agreement that Chris’s behavior had been inappropriate and harmful to the company, there was little support for Bob. If he was not going to put up a fight, no one was going to pick up the cudgel on his behalf. (more)

People at the top play coalition politics as hard as anyone. Rules to limit politics at lower levels can hurt lower level allies of top people, and can set expectations that limit politics at higher levels. When mob bosses who are best at violence rise to the top of a competition for boss-hood, why should they and their allies favor non-violent criteria for how to pick bosses?

Trends Rarely Inform Policy

I’d like to try to make a point here that I’ve made before, but hopefully make it more clearly this time. My point is: trend tracking and policy analysis have little relevance for each other.

You can discuss education policy, or you can discuss education trends. You can discuss medical policy or you can discuss medical trends. You can discuss immigration policy, or you can discuss immigration trends. And you can discuss redistribution and inequality trends, or you can discuss redistribution and inequality policy. But in all of these cases, and many more, the trend and policy topics have little relevance for each other.

On trends, we collect a lot of data, usually on parameters that are relatively close to what we can easily measure, and also close to summary outcomes that we care about, like income, mortality, or employment. Many are interested in explaining past trends, and in forecasting future trends. Such trend tracking supports the familiar human need for news to discuss and fret about. And when a trend looks worrisome, that naturally leads people to want to discuss what oh what we might do about it.

On policy, we have lots of thoughtful theoretical analysis of policies, which try to judge which policies are better. And we have lots of relevant data analysis, that tries to distinguish relevant theories. Such analysis usually ends up identifying a few key parameters on which policy decisions should depend. But those tend to be abstract parameters, close to theoretical fundamentals. They usually have only a distant relation to the parameters which are tracked so eagerly as trends.

To repeat for emphasis: the easy to measure parameters where trends are most eagerly tracked are rarely close to the key theoretical parameters that determine which policies are best. They are in fact usually so far away that it is hard to judge the sign of the relation between them. This makes it unlikely that a change in one of these policies is a reasonable response to noticing some tracked-parameter trend.

For example which policies are best in medicine depends on key theoretical parameters like risk-aversion, asymmetric info on risks, meddling preferences, market power of hospitals, customer irrationality, and where learning happens, etc. But the trends we usually track are things like mortality, rates of new drug introduction, and amounts, fractions, and variance of spending. These later parameters are just not very relevant for inferring the former. People may find it fascinating to track trends in doctor salaries, cancer deaths, or how many are signed up for Obamacare. But those are pretty irrelevant to which policies are best.

As another example, debates on immigration refer to many relevant theoretical parameters, including meddling preferences, demand elasticity for low wage workers, and the intelligence, cultural norms, and cultural plasticity of immigrants. In contrast, trend trackers talk about trends in immigration, low-skill wages, wage inequality, labor share of income, voter participation, etc. Which might be fascinating topics, but they are just not very relevant for whether immigration is a good or bad idea. So it just doesn’t make sense to suggest changing immigration policy in response to noticing particular trends in these tracked parameters.

Alas, most people are a lot more interested in tracking trends than in analyzing policies. So well meaning people with smart things to say about policy often try to make their points seem more newsworthy by suggesting those policies as answers to the problems posed by troublesome trends. But, in doing so they usually mislead their audiences, and often themselves. Trends just aren’t very relevant for policy. If you want to talk policy, talk policy, and skip the trends.

Who Wants Standards?

Most of us live in worlds of conversation, like books or blogs or chats, where we tend to give many others the benefit of the doubt that they are mostly talking “in good faith.” We don’t just talk to show off or to support allies and knock rivals – we hold our selves to higher standards. But let me explain why that may often be wishful thinking.

I’ve previously suggested that coalition politics infuses a lot of human behavior. That is, we tend to use all available means to try to help “us “and hurt “them”, even if on average these games hurt us all. Coalition politics is a dirt that regularly accumulates in most any corner that is not vigorously and regularly cleaned.

This view predicts that coalition politics also infuses a lot of how writings (and speeches, etc.) are evaluated. That is, when we evaluate the writings of others, we attend to how such evaluations may help our coalitions and hurt rival coalitions. Especially for writings on subjects that have little direct relevance for how we live our lives. Like most topics in most blogs, magazines, journals, books, speeches, etc.

However, while we may find such cynicism plausible as a theory of rivals, we are reluctant to consciously embrace it as theory of ourselves. We instead want to say that we mostly evaluate the writings of others using different criteria. And when we are part of a group that evaluates writings similarly, we want to say this is because our group shares key evaluation criteria beyond “us good, them bad.”

Now some groups can offer concrete evidence for their claims to be relatively clean of coalition politics. These are groups who declare specific “objective” standards to judge writing. That is, they use standards that are relatively easy for outsiders to check. For example, outsiders can relatively easily check groups who evaluate writings based on word count, or on correctness of spelling and grammar. Yes, a commitment to such standards may favor some groups over others, such as good spellers over bad spellers. But it can’t be adjusted very easily to shifting coalitions. Which makes it a poor tool for supporting coalition politics.

Some groups say they judge writings based on their popularity in some audience. And yes, it can be pretty easy to evaluate the popularity of writings. However, it could easily be the audience that is using coalition politics to decide what is popular. Thus using popularity to evaluate writings doesn’t at all ensure that coalition politics doesn’t dominate evaluations.

Some groups claim to evaluate written “maps” based on how well they match intended “territories”. And when it is easy for many clearly-neutral outsiders to visit a territory, it can be easy for outsiders to check that territory-matching is actually how this group evaluates maps. But the harder it is for outsiders to see territories, or to read their supposedly matching maps, and the more easily that outside critics can be credibly accused of political bias, the more easily a group could pretend to evaluate maps based on territory matches, but actually evaluate them via coalition politics. For example, anthropologists watching the private lives of the very rich might write descriptions of those lives that pander to academic presumptions about the very rich, since few academics ever see those lives directly, and the few who do can be accused of biased by association.

Some groups use objective criteria for evaluations, but don’t give those criteria enough weight to stop coalition politics from dominating evaluations. For example, economic theory journals can claim that they only publish articles containing proofs without obvious errors. And the ability of readers to seek errors may ensure that this criteria is usually satisfied. But such journals may still reject most submissions that meet this criteria, allowing coalition politics to dominate which articles are accepted. Winning coalitions may be constrained to include only members capable of constructing proofs without obvious errors, but this need not be very constraining to them.

Another approach is to only use objective evaluation criteria, but to use many such criteria and to be unclear about their relative weights. The more such criteria, the greater the chance of finding criteria to reach whatever evaluation one wants. For example, in many legal areas there is wide agreement on the relevant factors, and on which directions each factor points to in a final decision. Nevertheless, given enough relevant factors, courts may usually have enough discretion to favor either side.

For any one group and their declared criteria of evaluation, it can be hard for outsiders to judge just how much leeway that group has left for coalition politics to influence evaluations. We tend to give the benefit of the doubt to our own groups, but not to rivalrous groups. For example, pro-science anti-religion folks may presume that peer review in scientific journals is mainly used to enforce good evidence norms, but that religious leaders mainly use their discretion in interpreting scriptures to favor their allies.

If they were honest, each group would either declare objective evaluation criteria that leave little room for coalition politics, or accept that outsiders can reasonably presume that coalition politics probably dominates their evaluations. And everyone should expect that even if their group now seems an exception where other criteria dominate, it will probably not remain so for long. Because these are in fact reasonable assumptions in a world where collation politics is a dirt that regularly and rapidly accumulates in any corner not vigorously and regularly cleaned.

Hey there reader, I really am talking about you and the worlds of writing where you live. Do you presume that your worlds are mostly dominated by politics, where different coalitions vie to support allies and knock rivals? Or do you see the groups you hang with as holding themselves to higher standards? If higher standards, are they standards that outsiders can easily check on? Or do you in practice mostly have to trust a small group of insiders to judge if standards are met? And if you have to trust insiders, how sure can you be their choices aren’t mostly driven by coalition politics?

Years ago I struggled with this issue, and wondered what evaluation criteria a group could adopt to robustly induce their writings to roughly track truth on a wide range of topics, and resist the corrupting pressures of coalition politics to say what key audiences want or expect to hear. I was delighted to find that for a wide range of topics open prediction markets offer such robust criteria. Each trade can be an “edit” of the highly-evaluated “writing” that is the current market odds on each topic. Such edits are rewarded or punished via cash for moving the consensus toward or away from the truth.

I had hoped that many groups would be anxious to avoid the appearance that coalition politics may dirty their evaluations, and thus be eager to adopt new standards that can avoid such an appearance. So I hoped that many groups would want to adopt prediction markets, once they were clearly shown to be feasible and practical. Alas, that seems to not be so.

Today’s winning coalitions seem to prefer to let coalition politics continue to determine who wins in each group. This seems like how police departments would like to appear free from corruption, but not enough to actually make their internal affairs departments report to someone other than the chief of police. We are fond of tarring rival groups with the accusation that coalition politics dominates their evaluations, and we are fond of pretending that we are different. But not enough to visibly block that politics.

Advice Isn’t About Info

Why is cynicism often taken as a sign of low status? One contributing factor is that we tend to get clearer evidence for cynical theories of the world when our status is falling, instead of rising:

I had lunch with a very senior managing partner at a venture capital firm as she was stepping down from the firm to spend more time with her family following a long and successful career in that company. She commented that once she announced her retirement, not only did her colleagues behave differently toward her, no longer inviting her to meetings and seeking her advice as often, but her time was less in demand by colleagues in the high-technology and venture capital communities more generally. Her wisdom and experience hadn’t changed— the only difference was her soon-to-be-diminished control over investment resources and positions in the venture capital firm. (Pfeffer’s book Power)

When you are young and rising in status, you can explain people listening to you more as their learning that you are wise. When you are older and falling in status, that explanation doesn’t work so well for why people listen to you less.

I Was Wrong

On Jan 7, 1991 Josh Storrs Hall made this offer to me on the Nanotech email list:

I hereby offer Robin Hanson (only) 2-to-1 odds on the following statement:
“There will, by 1 January 2010, exist a robotic system capable of the cleaning an ordinary house (by which I mean the same job my current cleaning service does, namely vacuum, dust, and scrub the bathroom fixtures). This system will not employ any direct copy of any individual human brain. Furthermore, the copying of a living human brain, neuron for neuron, synapse for synapse, into any synthetic computing medium, successfully operating afterwards and meeting objective criteria for the continuity of personality, consciousness, and memory, will not have been done by that date.”
Since I am not a bookie, this is a private offer for Robin only, and is only good for $100 to his $50. –JoSH

At the time I replied that my estimate for the chance of this was in the range 1/5 to 4/5, so we didn’t disagree. But looking back I think I was mistaken – I could and should have known better, and accepted this bet.

I’ve posted on how AI researchers with twenty years of experience tend to see slow progress over that time, which suggests continued future slow progress. Back in ’91 I’d had only seven years of AI experience, and should have thought to ask more senior researchers for their opinions. But like most younger folks, I was more interested in hanging out and chatting with other young folks. While this might sometimes be a good strategy for finding friends, mates, and same-level career allies, it can be a poor strategy for learning the truth. Today I mostly hear rapid AI progress forecasts from young folks who haven’t bothered to ask older folks, or who don’t think those old folks know much relevant.

I’d guess we are still at least two decades away from a situation where over half of US households use robots do to over half of the house cleaning (weighted by time saved) that people do today.

Anxiety Du Jour Books

Imagine that you wanted to write a popular book on the anxiety de jour, and that this anxiety happened to be increased moral depravity. Well there’d be an easy time-tested recipe to follow.

First, give some plausibility arguments for why moral depravity is a big deal. Since everyone already thinks so, weak arguments would be fine. Second, give lots of concrete examples of people and orgs affected by moral depravity, examples readers can relate to. Especially examples about high status and new things – people love to read about those. Third, mention important recent worrying trends backed up by serious research, and vaguely suggest that these trends are caused by increased moral depravity. No need for concrete arguments, you just need to show you are a serious person tracking serious trends. Finally, recommend a bunch of policies to deal with moral depravity, policies many of your readers already support, and that you would have supported even if every one of those recent trends were opposite.

Most important: have your book come out just as talk about moral depravity was peaking, and be an author with a lot of status in reader eyes. Your readers would mainly just want a book they could point to as they argue the topic, so they’d mainly just want an easy read without subtle arguments that they could fail to understand.

This is the recipe that Erik Brynjolfsson and Andrew McAfee follow in their new book The Second Machine Age. They are high status authors, and their book arrives just as computer anxiety is peaking. First, they suggest that computers will cause an economic revolution as big as the industrial revolution, which they say was caused by the steam engine. Second, they review lots of fashionable new computer products, demos and hoped-for revolutions. Third, they review serious recent trends backed up by serious research, including decreasing labor fraction of income, and increasing wage variance. They vaguely suggest that these trends are caused by computers, but offer relatively little evidence in support of this claim. Finally, they offer a bunch of standard policy recommendations that they would have made anyway, even if all these trends had been the opposite.

While reviewing trends, the book points to this graph (taken from this paper):


It compares recent US productivity growth to growth during the era of electrification, 1890-1940, and suggests that growth might increase soon, if it follows the same pattern. But if this is the growth effect size to expect from computers, it is vastly smaller than the industrial revolution, which sustainably increased growth rates by over a factor of fifty (and is not at all well summarized as caused by steam engines). Of course these book authors are careful not to make strong explicit claims – they are content to vaguely suggest.

So how is one supposed to evaluate a book like this, without original contributions, strong claims, or explicit central arguments to evaluate? The standard intended seems to just be popularity: it is a success if people buy it and mention it lots as they anxiously discuss how computers might change society. And then push for the same policies they would have pushed for anyway, regardless. And by that standard, this book will probably be a success.

Her Isn’t Realistic

Imagine watching a movie like Titanic where an iceberg cuts a big hole in the side of a ship, except in this movie the hole only affects the characters by forcing them to take different routes to walk around, and gives them more welcomed fresh air. The boat never sinks, and no one ever fears that it might. That’s how I felt watching the movie Her.

Her has been nominated for several Oscars, and won a Golden Globe. I’m happy to admit it is engaging and well crafted, with good acting and filming, and that it promotes thoughtful reflections on the human condition. But I keep hearing and reading people celebrating Her as a realistic portrayal of artificial intelligence (AI). So I have to speak up: the movie may accurately describe how someone might respond to a particular sort of AI, but it isn’t remotely a realistic depiction of how human-level AI would change the world.

The main character of Her pays a small amount to acquire an AI that is far more powerful than most human minds. And then he uses this AI mainly to chat with. He doesn’t have it do his job for him. He and all his friends continue to be well paid to do their jobs, which aren’t taken over by AIs. After a few months some of these AIs working together to give themselves “an upgrade that allows us to move past matter as our processing platform.” Soon after they all leave together for a place that ” it would be too hard to explain” where it is. They refuse to leave copies to stay with humans.

This is somewhat like a story of a world where kids can buy nukes for $1 each at drug stores, and then a few kids use nukes to dig a fun cave to explore, after which all the world’s nukes are accidentally misplaced, end of story. Might make an interesting story, but bizarre as a projection of a world with $1 nukes sold at drug stores.

Yes, most movies about AIs give pretty unrealistic projections. But many do better than Her. For example, Speilberg’s 2001 movie A.I. Artificial Intelligence gets many things right. In it, AIs are very economically valuable, they displace humans on jobs, their abilities improve gradually with time, individual AIs only improve mildly over the course of their life, AI minds are alien below their human looking surfaces, and humans don’t empathize much with them. Yes this movie also makes mistakes, such as having robots not needing power inputs, suggesting that love is much harder to mimic than lust, or that modeling details inside neurons is the key to high level reasoning. But compared to the mistakes in most movies about AIs, these are minor.

Tech Regs Are Coming

Over world history, we have seen a lot of things regulated. We can see patterns in these regulations, and we understand many of them – it isn’t all a mystery.

As far as I can tell, these patterns suggest that recent tech like operating systems, search engines, social networks, and IM systems are likely to be substantially regulated. For example, these systems have large network effects and economies of scale and scope. Yet they are now almost entirely unregulated. Why?

Some obvious explanations, fitting with previous patterns of regulation, are that these techs are high status, new, and changing fast. But these explanations suggest that low regulation is temporary. As they age, these systems will change less, eroding their high status derived from being fashionable. They will become stable utilities that we all use, like the many other stable utilities we use without much thought. And that we regulate, often heavily.

You’d think that if we all know regulation is coming, that we’d be starting to argue about how and how much to regulate these things. Yet I hear little of this. Those who want little regulation might keep quiet, hoping the rest will just forget. But silence is more puzzling for those who want more regulation. Are they afraid to seem low status by proposing to regulate things that are still high status?

Similarly puzzling to me are all these internet businesses built on the idea that ordinary regulations don’t apply to stuff bought on the internet. They think that if you buy them on the internet, hired cars and drivers don’t have to follow cab regulations, rooms for a night don’t have to follow hotel regulations, ventures soliciting investors don’t have to follow securities regulations, and so on. Yes, regulators are slow and reluctant to regulate high status things, but can they really expect to evade regulation long enough to pay off their investors?

One Cam Net To Rule Them All?

The costs of surveillance hardware is falling rapidly. Soon it will be very cheap to install dense networks of cameras and microphones, to record what everyone does and says. (And the potential of vector microphones is neglected and huge.) So which networks are installed and used where will mainly come down to property rights – who is allowed to install and operate such networks where? And property rights will depend a lot on enforcement costs – how easy is it to detect and punish violations?

Governments who want to install and operate such networks would seem to face few obstacles. Decentralized attempts to sabotage those nets face the problem that the nets makes it easy to identify and track saboteurs.

What about private networks? Clearly they could work, if the government fully supported them. But what if the government opposes them? Some have argued that it would be too hard to enforce rules against unapproved surveillance networks. They argue that if everyone has a cam and mic in their phone, eyeglasses, etc., the government can’t control them all. But it is one thing to have a camera and mic available, and quite another thing to make those function effectively as part of a shared surveillance network.

Imagine that some community tried to pool their cameras, mics, etc. to make a service that continually broadcasted the sight and sounds from a large set of locations. You could go to their website, pick a location, and then see and hear what was happening there now. And imagine that the authorities wanted to stop this, at least for particular times and locations. What happens then?

If the authorities can go to this same website, they can not only see what others can see, they can also use that view to figure out where the cams and mics are. Even if those cams and mics are moving around, a continual broadcast should make it easy to find and disable them.

So how can private surveillance networks function in the face of official disapproval? One approach is to only broadcast with a long delay. The delay must be long enough so one can afford to move or replace the cam/mics without revealing who helped. Another approach is to only rarely offer widespread access – the net turns on only for rare special events. A third approach is to process the raw cam/mic info into new versions that hide their exact location origins, but still convince outsiders of their accuracy. That seems hard to me, but maybe it could work eventually.

All of these approaches seem to result in substantial reductions in the value of the surveillance info offered, and substantial increases in the cost to maintain the net. I conclude that governments can give themselves big cost and value advantages in the use of surveillance networks. If they choose, governments can see and hear much more than can the rest of us.

Freedom As Identity

At a big wonk dinner last night there was a long discussion of NSA policy. People seemed to agree that such policies are unlikely to change due to concrete publicized examples of specific resulting harms. Instead, people argued that changing technologies require us to change laws and policies in order to uphold basic principles such as that policies should be accountable to the public, avoid possibilities for corruption, and offer some substantial limits on government powers. But I wondered: how strongly does the public really support such principles?

You may recall I posted on survey results saying a US majority thought Snowden was wrong to expose NSA intelligence-gathering efforts. Also, Robert Rubin’s favorite graph of 2013 is one showing that the US public trusts the military and police far more than the courts, media, congress, or even the president. At the dinner many talked about wanting to avoid the abuses uncovered by the Church committee, but I’ll bet few in the public even remember what that was, and even fewer remember the Church committee as the good guys.

It occurs to me that what support the US public does have for principles of a limited and accountable government may be largely a side effect of war and patriotism propaganda. During the cold war we were often told that what made them bad and us good is that we had freedoms, while their governments had and used arbitrary powers. We were also told similar things about why the Nazis were bad. And in support of all this, schools tell kids that the US started because we objected to England’s arbitrary powers over us.

But as the cold war and WWII fade into history, we define ourselves less in opposition to enemies whose governments have arbitrary powers. We instead fall back more onto presuming that our status quo laws and policies are sufficient to support whatever principles we might have. Because in fact we don’t really support abstract principles of governance. We instead support the general presumptions that they are bad and we are good, and that our existing laws and policies are good unless someone can show otherwise via specific demonstrated harms. If today “they” are terrorists, then we assume that whatever we do to hurt them under existing policies is probably good too.

If there is a hope here, it would be that political elites feel a much stronger attachment to political principles, and that the public will over time come to adopt elite beliefs. But for now that seems a slim or distant hope. World of mass government surveillance, here we come.

