Best Combos Are Robust

I’ve been thinking a lot lately about what a future world of ems would be like, and in doing so I’ve been naturally drawn to a simple common intuitive way to deal with complexity: form best estimates on each variable one at a time, and then adjust each best estimate to take into account the others, until one has a reasonably coherent baseline combination: a set of variable values that each seem reasonable given the others.

I’ve gotten a lot of informal complaints that this approach is badly overconfident, unscientific, and just plain ignorant. Don’t I know that any particular forecasted combo is very unlikely to be realized? Well yes I do know this. But I don’t think critics realize how robust and widely used is this best combo approach.

For example, this is the main approach historians use studying ancient societies. A historian estimating Roman Empire copper trade will typically rely on the best estimates by other experts on Roman population, mine locations, trade routes, travel time, crime rates, lifespans, climate, wages, copper use in jewelry, etc. While such estimates are sometimes based on relatively direct clues about those parameters, historians usually rely more on consistency with other parameter estimates. While they usually acknowledge their uncertainty, and sometimes identify coherent sets of alternative values for small sets of variables, historians mostly build best estimates on the other historians’ best estimates.

As another example, the scheduling of very complex projects, as in construction, is usually done via reference to “baseline schedules,” which specify a best estimate start time, duration, and resource use for each part. While uncertainties are often given for each part, and sophisticated algorithms can take complex uncertainty dependencies into account in constructing this schedule (more here), most attention still focuses on that single best combination schedule.

As a third example, even when people go to all the trouble to set up a full formal joint probability distribution over a complex space, as in a complex Bayesian network, and so would seem to have the least need to crudely avoid complexity by focusing on just one joint state, they still quite commonly want to compute the “most probable explanation”, i.e., that single most likely joint state.

We also robustly use best tentative combinations when solving puzzles like Sudoku, crossword, or jigsaw. In fact, it is hard to think of realistic complex decision or inference problems full of interdependencies where we don’t rely heavily on a few current best guess baseline combinations. Since I’m not willing to believe that we are so badly mistaken in all these areas as to heavily rely on a terribly mistaken method, I have to believe it is a reasonable and robust method. I don’t see why I should hesitate to apply it to future forecasting.

GD Star Rating
Tagged as: , ,
Trackback URL:
  • Siddharth

    Haven’t you simply described the genetic algorithm approach to solve optimization problems? Or am I missing something?

    • Daniel Carrier

      In an optimization problem, you’re given a way to test how optimal something is. For example, if you’re trying to make a fast robot, you time it running across a course. In this problem, you’re trying to optimize accuracy, but you have no way to test how accurate it is.

      • Siddharth

        Ah. I see. So I guess, the Robin is arguing for consistency as a reliable proxy for the true fitness function, which is the accuracy of the prediction.

      • Daniel Carrier

        Consistency alone isn’t very good. You can easily make a consistent theory, so long as you don’t bother giving it a basis in fact. Perhaps the world is about to be destroyed by a meteor. What he’s suggesting is that the method he uses is the best anyone can do, and it generally works fairly well.

  • Stephen Diamond

    But are combos always robust? You should also provide examples of where they aren’t (unless you truly can’t find any after diligent consideration). Then you can analyze which class your project belongs to.

    • Doug

      The No Free Lunch Theorem tells us that anytime we impose restrictions on our hypothesis space that we’re making a tradeoff. Improved performance on some classes of problems for decreased performance on other classes.

      For example when interpolating from data we usually restrict or bias our hypothesis space to the set of smooth functions. This is reasonable because the vast majority of real world phenomenon are described by smooth, instead of jagged, functions. Relative to the space of all functions, reality is highly biased towards smoothness.

      Robin’s method is biasing the hypothesis space towards conditional independence between the features. What determines whether this is a good assumption is whether the way we separate features “cuts reality at its joints.”

      In other words if the axis of our feature space is special then conditional independence is usually a pretty good assumption. On the other hand if the feature space tends to be a random rotation of an underlying feature set then the assumption is a bad one.

      For most problems that humans regularly deal with we already have separated the features out pretty well. A contractor is probably going to make separate estimates that add up together using reasonably separate components of the building. Whereas in more “black-box” type applications like genomic analysis the feature space is frequently rotated.

  • Jess Riedel

    > Since I’m not willing to believe that we are so badly mistaken in all these areas as to heavily rely on a terribly mistaken method

    You’re willing to believe that human being are hopelessly wrong about why they do the things they do, and are grossly overconfident about anything where their status is on the line, but you’re not willing to believe that historians (who are rarely exposed to cross-checks) consistently overstate their knowledge, and that large construction projects (which are practically defined by their fumbling) could be consistently mismanaged? I don’t understand how you can argue against the very simple fact that the conjunction of more than a few independent hypotheses has a very low probability unless almost all of the hypotheses are each very likely.

    Or maybe you’re just making the more nebulous claim that the combo hypotheses *are* likely wrong, but reasoning as if they are true is a useful exercise. But, without quantification or at least an example, this seems too ambiguous a claim to assess. (After all, analysing counterfactuals can be useful too.)

    • A very concrete claim: you can’t make many billions of dollars by entering the construction industry and scheduling your projects some other way.

      • Jess Riedel

        The concrete-ness is very useful. Maybe I misunderstood before, but now it sounds like you’re only claiming that this method is the *best* method we have rather than a *good* method. In other words, you might be right in saying that the most-likely-combo method is widely used, but wrong in saying that it’s robust. Your critics might be right in saying that it’s badly overconfident, but wrong in saying that its unscientific.

        If this is your position, then I would defend it explicitly, e.g. “The scenario I describe is very unlikely to share many properties with the real future. However, (a) the future is so important that it’s worth theorizing about a 0.001% scenario and/or (b) I believe theorizing about this scenario is likely to give use useful insights even if it looks nothing like the real future.

      • IAMSBA

        I agree, that would be a better phrasing, but we don’t know what Hanson really intends, maybe he doesn’t yet know himself.

        When it comes to his future scenarios he seems to be going back and forth between “omfg, the ruthless efficiency of it would be so awesome!”, “this is just one, still low probability, way it could be and it’s fun to theorize about it, don’t take every detail at face value” and “whether we like it or not, this happens to be the most likely scenario, because I used science, bitch!”. The confusion arises when he rotates between these stances on the same future scenario. I guess his conscience still manages to go on the offensive from time to time.

  • Doug

    What you’re describing is well understood in the field of machine learning. For example its well documented that Naive Bayes, which maximizes conditional independence between features, predicts very competitively even in the presence of interdependent features.

    • John_Maxwell_IV

      Hm, my analogy would have been to maximum a posteriori estimation. Heck, the “adjust each best estimate to take in to account the others” sounds a lot like dual decomposition.

      • Ilya Shpitser

        MAP and MPE are both intractable. So whatever the reasons are for graphical models people to like these, it can’t be because they are “crude but efficient.”

        I think an approach that more resembles what Robin is talking about is copula methods (model each single variable marginal independently, then come up with something relatively simple that gives an entire joint consistent with the choices you made for each marginal).

  • Robert Koslover

    I think this is insightful. I would add, to your examples that classic business-world project management involves making detailed plans (schedules, work breakdown structures, resource planning, etc.), which seldom actually work out as-planned at the detailed level, yet (if done well) do tend to help move projects forward effectively. Another example is war planning: In war, he who makes the most careful and detailed plans will be victorious (to paraphrase that truly-brilliant ancient war-theorist, Sun Tzu), despite the fact that no plan survives contact with the enemy (to paraphrase Helmuth von Moltke) amidst the “fog of war” (per Carl von Clausewitz).

  • Emily

    So I’m the only one who plays completely deductive Sudoku?

    • Even then doesn’t each deduction add to a tentative best combination?

  • Cyan

    “form best estimates on each variable one at a time, and then adjust each best estimate to take into account the others, until one has a reasonably coherent baseline combination: a set of variable values that each seem reasonable given the others.”

    My intuition is that this approach suffers from a kind of curse of dimensionality. Consider a standard multivariate normal distribution. As the dimension increases, the maximum density increases too. But for large N because most of the hypervolume of an N-dimensional ball is close to its hypersurface; as a consequence, most of the probability mass of (and hence a “typical” random sample from) the standard multivariate normal is found in a thin shell near a hypersphere with radius around sqrt(N).

    This analogy suggests that you’d do better to try to generate a manageable number of “typical combos” rather than the “best combo”.

    • Doug

      What you’re describing is the classic exploration-exploitation tradeoff in numerical optimization.

      In practice though that what you describe it typically not a serious issue even in high dimensional problems. Most seemingly high-dimensional problems have low effective dimensionality. I.e. the number of parameters that actually have significant effect on the objective function tends to be small, we simply don’t know which ones are important ahead of time.

      In this scenario the kind of iterative gradient descent that Robin describes will most likely find a pretty good solution. The gradient will be heavily loaded on the hyperplane that the effective dimensionality lives in, so we’ll converge relatively quickly even with a bunch of “noise dimensions.”

      • Cyan

        “Most seemingly high-dimensional problems have low effective dimensionality.”

        Part of the reason I prefer a manageable number of typical combos to a single best combo is that while the latter will underperform in an actual (rather than “seemingly”) high-dimensional problem, the former works well regardless of the dimensionality.


    “I’ve gotten a lot of informal complaints that this approach is badly overconfident, unscientific, and just plain ignorant.”

    It’s not unscientific, it’s fine, not just in practice but also in theory (there are several regression techniques that work like that, though you often need more than one iteration to get a decent result). But there are cases when you can go wrong: if you want to improve on a copper estimate of the Roman Empire that was based on counting the number of known copper mines and you do this by looking at population estimates that are based on that same counting of known copper mines you are obviously engaging in circular reasoning. Your original best estimates of all the variables should be independent form each other and from the research you are trying to surpass.


    “they still quite commonly want to compute the “most probable explanation”, i.e., that single most likely joint state.”

    Yes, but one when it has a spectacularly low probability it loses its significance, you may provide it for the sake of completeness though.

    “I have to believe it is a reasonable and robust method. I don’t see why I should hesitate to apply it to future forecasting.”

    The method is robust, but the available input data is a different story. Your conclusions obviously depend on the quality of the data as well and even the best available data may be of subpar quality.

  • Kim Øyhus

    The shannon limit with error correcting codes, was reached by methods like in the article, so they can be VERY good.

  • Philon

    I presume that in choosing the variables, your best estimates of which you then try to get into an equilibrium, you rely on some
    theory, relating just those variables to the kind of outcome in which you are interested. Then everything depends on your
    theory’s being accurate and complete.

  • Stephen Diamond

    What’s absent from what you can conclude using this method is actual likelihood. A better description of our knowledge–conceding your combination argument turns out to be convincing–could still be: we really have hardly the faintest idea. It’s a contingent fact that for some endeavors like construction the best method is also a good method.

  • Pingback: Overcoming Bias : Exemplary Futurism()