I’ve been thinking a lot lately about what a future world of ems would be like, and in doing so I’ve been naturally drawn to a simple common intuitive way to deal with complexity: form best estimates on each variable one at a time, and then adjust each best estimate to take into account the others, until one has a reasonably coherent baseline combination: a set of variable values that each seem reasonable given the others.

What's absent from what you can conclude using this method is actual likelihood. A better description of our knowledge--conceding your combination argument turns out to be convincing--could still be: we really have hardly the faintest idea. It's a contingent fact that for some endeavors like construction the best method is also a good method.

If I can extract a nontechnical understanding of the forgoing comments: best combinations are likely to give you not only your best single estimate but also to provide a gradient, so futures far from the best combination are less likely than futures near. (This would seem to provide the argument against multiple estimates--the world of possibilities isn't polycentric.)

But what shouldn't be ignored--that this method doesn't give you--is any indication regarding how good the estimate is. It's entirely possible--and a priori likely for estimates of a century based on past performance--that your very best estimate is a very poor estimate indeed. It could be that a better description of our knowledge--conceding your combination argument turns out to be convincing--could still be: we really have hardly the faintest idea about 100 years hence. This would all be consistent with your combination argument being very strong, that is, with your convincingly showed you have constructed the most reliable combination available.It could be that a better description of our knowledge--conceding your combination argument turns out to be convincing--we really have hardly the faintest idea. The best combination is in all likelihood a very weak combination, and it's a contingent fact that in some endeavors, like construction estimates, that the best estimate is also a good estimate.

Consistency alone isn't very good. You can easily make a consistent theory, so long as you don't bother giving it a basis in fact. Perhaps the world is about to be destroyed by a meteor. What he's suggesting is that the method he uses is the best anyone can do, and it generally works fairly well.

I presume that in choosing the variables, your best estimates of which you then try to get into an equilibrium, you rely on sometheory, relating just those variables to the kind of outcome in which you are interested. Then everything depends on yourtheory's being accurate and complete.

"Most seemingly high-dimensional problems have low effective dimensionality."

Part of the reason I prefer a manageable number of typical combos to a single best combo is that while the latter will underperform in an actual (rather than "seemingly") high-dimensional problem, the former works well regardless of the dimensionality.

I agree, that would be a better phrasing, but we don't know what Hanson really intends, maybe he doesn't yet know himself.

When it comes to his future scenarios he seems to be going back and forth between "omfg, the ruthless efficiency of it would be so awesome!", "this is just one, still low probability, way it could be and it's fun to theorize about it, don't take every detail at face value" and "whether we like it or not, this happens to be the most likely scenario, because I used science, bitch!". The confusion arises when he rotates between these stances on the same future scenario. I guess his conscience still manages to go on the offensive from time to time.

The concrete-ness is very useful. Maybe I misunderstood before, but now it sounds like you're only claiming that this method is the *best* method we have rather than a *good* method. In other words, you might be right in saying that the most-likely-combo method is widely used, but wrong in saying that it's robust. Your critics might be right in saying that it's badly overconfident, but wrong in saying that its unscientific.

If this is your position, then I would defend it explicitly as such, e.g. "The scenario I describe is very unlikely to share many properties with the real future. However, (a) the future is so important that it's worth theorizing about a 0.001% scenario and/or (b) I believe theorizing about this scenario is likely to give use useful insights even if it looks nothing like the real future.

"they still quite commonly want to compute the “most probable explanation”, i.e., that single most likely joint state."

Yes, but one when it has a spectacularly low probability it loses its significance, you may provide it for the sake of completeness though.

"I have to believe it is a reasonable and robust method. I don’t see why I should hesitate to apply it to future forecasting."

The method is robust, but the available input data is a different story. Your conclusions obviously depend on the quality of the data as well and even the best available data may be of subpar quality.

"I’ve gotten a lot of informal complaints that this approach is badly overconfident, unscientific, and just plain ignorant."

It's not unscientific, it's fine, not just in practice but also in theory (there are several regression techniques that work like that, though you often need more than one iteration to get a decent result). But there are cases when you can go wrong: if you want to improve on a copper estimate of the Roman Empire that was based on counting the number of known copper mines and you do this by looking at population estimates that are based on that same counting of known copper mines you are obviously engaging in circular reasoning. Your original best estimates of all the variables should be independent form each other and from the research you are trying to surpass.

MAP and MPE are both intractable. So whatever the reasons are for graphical models people to like these, it can't be because they are "crude but efficient."

I think an approach that more resembles what Robin is talking about is copula methods (model each single variable marginal independently, then come up with something relatively simple that gives an entire joint consistent with the choices you made for each marginal).

Hm, my analogy would have been to maximum a posteriori estimation. Heck, the "adjust each best estimate to take in to account the others" sounds a lot like dual decomposition.

The No Free Lunch Theorem tells us that anytime we impose restrictions on our hypothesis space that we're making a tradeoff. Improved performance on some classes of problems for decreased performance on other classes.

For example when interpolating from data we usually restrict or bias our hypothesis space to the set of smooth functions. This is reasonable because the vast majority of real world phenomenon are described by smooth, instead of jagged, functions. Relative to the space of all functions, reality is highly biased towards smoothness.

Robin's method is biasing the hypothesis space towards conditional independence between the features. What determines whether this is a good assumption is whether the way we separate features "cuts reality at its joints."

In other words if the axis of our feature space is special then conditional independence is usually a pretty good assumption. On the other hand if the feature space tends to be a random rotation of an underlying feature set then the assumption is a bad one.

For most problems that humans regularly deal with we already have separated the features out pretty well. A contractor is probably going to make separate estimates that add up together using reasonably separate components of the building. Whereas in more "black-box" type applications like genomic analysis the feature space is frequently rotated.

In practice though that what you describe it typically not a serious issue even in high dimensional problems. Most seemingly high-dimensional problems have low effective dimensionality. I.e. the number of parameters that actually have significant effect on the objective function tends to be small, we simply don't know which ones are important ahead of time.

In this scenario the kind of iterative gradient descent that Robin describes will most likely find a pretty good solution. The gradient will be heavily loaded on the hyperplane that the effective dimensionality lives in, so we'll converge relatively quickly even with a bunch of "noise dimensions."

## Best Combos Are Robust

What's absent from what you can conclude using this method is actual likelihood. A better description of our knowledge--conceding your combination argument turns out to be convincing--could still be: we really have hardly the faintest idea. It's a contingent fact that for some endeavors like construction the best method is also a good method.

If I can extract a nontechnical understanding of the forgoing comments: best combinations are likely to give you not only your best single estimate but also to provide a gradient, so futures far from the best combination are less likely than futures near. (This would seem to provide the argument against multiple estimates--the world of possibilities isn't polycentric.)

But what shouldn't be ignored--that this method doesn't give you--is any indication regarding how good the estimate is. It's entirely possible--and a priori likely for estimates of a century based on past performance--that your very best estimate is a very poor estimate indeed. It could be that a better description of our knowledge--conceding your combination argument turns out to be convincing--could still be: we really have hardly the faintest idea about 100 years hence. This would all be consistent with your combination argument being very strong, that is, with your convincingly showed you have constructed the most reliable combination available.It could be that a better description of our knowledge--conceding your combination argument turns out to be convincing--we really have hardly the faintest idea. The best combination is in all likelihood a very weak combination, and it's a contingent fact that in some endeavors, like construction estimates, that the best estimate is also a good estimate.

Consistency alone isn't very good. You can easily make a consistent theory, so long as you don't bother giving it a basis in fact. Perhaps the world is about to be destroyed by a meteor. What he's suggesting is that the method he uses is the best anyone can do, and it generally works fairly well.

I presume that in choosing the variables, your best estimates of which you then try to get into an equilibrium, you rely on sometheory, relating just those variables to the kind of outcome in which you are interested. Then everything depends on yourtheory's being accurate and complete.

"Most seemingly high-dimensional problems have low effective dimensionality."

Part of the reason I prefer a manageable number of typical combos to a single best combo is that while the latter will underperform in an actual (rather than "seemingly") high-dimensional problem, the former works well regardless of the dimensionality.

The shannon limit with error correcting codes, was reached by methods like in the article, so they can be VERY good.

I agree, that would be a better phrasing, but we don't know what Hanson really intends, maybe he doesn't yet know himself.

When it comes to his future scenarios he seems to be going back and forth between "omfg, the ruthless efficiency of it would be so awesome!", "this is just one, still low probability, way it could be and it's fun to theorize about it, don't take every detail at face value" and "whether we like it or not, this happens to be the most likely scenario, because I used science, bitch!". The confusion arises when he rotates between these stances on the same future scenario. I guess his conscience still manages to go on the offensive from time to time.

The concrete-ness is very useful. Maybe I misunderstood before, but now it sounds like you're only claiming that this method is the *best* method we have rather than a *good* method. In other words, you might be right in saying that the most-likely-combo method is widely used, but wrong in saying that it's robust. Your critics might be right in saying that it's badly overconfident, but wrong in saying that its unscientific.

If this is your position, then I would defend it explicitly as such, e.g. "The scenario I describe is very unlikely to share many properties with the real future. However, (a) the future is so important that it's worth theorizing about a 0.001% scenario and/or (b) I believe theorizing about this scenario is likely to give use useful insights even if it looks nothing like the real future.

Even then doesn't each deduction add to a tentative best combination?

A very concrete claim: you can't make many billions of dollars by entering the construction industry and scheduling your projects some other way.

"they still quite commonly want to compute the “most probable explanation”, i.e., that single most likely joint state."

Yes, but one when it has a spectacularly low probability it loses its significance, you may provide it for the sake of completeness though.

"I have to believe it is a reasonable and robust method. I don’t see why I should hesitate to apply it to future forecasting."

The method is robust, but the available input data is a different story. Your conclusions obviously depend on the quality of the data as well and even the best available data may be of subpar quality.

"I’ve gotten a lot of informal complaints that this approach is badly overconfident, unscientific, and just plain ignorant."

It's not unscientific, it's fine, not just in practice but also in theory (there are several regression techniques that work like that, though you often need more than one iteration to get a decent result). But there are cases when you can go wrong: if you want to improve on a copper estimate of the Roman Empire that was based on counting the number of known copper mines and you do this by looking at population estimates that are based on that same counting of known copper mines you are obviously engaging in circular reasoning. Your original best estimates of all the variables should be independent form each other and from the research you are trying to surpass.

MAP and MPE are both intractable. So whatever the reasons are for graphical models people to like these, it can't be because they are "crude but efficient."

I think an approach that more resembles what Robin is talking about is copula methods (model each single variable marginal independently, then come up with something relatively simple that gives an entire joint consistent with the choices you made for each marginal).

http://en.wikipedia.org/wik...

Hm, my analogy would have been to maximum a posteriori estimation. Heck, the "adjust each best estimate to take in to account the others" sounds a lot like dual decomposition.

The No Free Lunch Theorem tells us that anytime we impose restrictions on our hypothesis space that we're making a tradeoff. Improved performance on some classes of problems for decreased performance on other classes.

For example when interpolating from data we usually restrict or bias our hypothesis space to the set of smooth functions. This is reasonable because the vast majority of real world phenomenon are described by smooth, instead of jagged, functions. Relative to the space of all functions, reality is highly biased towards smoothness.

Robin's method is biasing the hypothesis space towards conditional independence between the features. What determines whether this is a good assumption is whether the way we separate features "cuts reality at its joints."

In other words if the axis of our feature space is special then conditional independence is usually a pretty good assumption. On the other hand if the feature space tends to be a random rotation of an underlying feature set then the assumption is a bad one.

For most problems that humans regularly deal with we already have separated the features out pretty well. A contractor is probably going to make separate estimates that add up together using reasonably separate components of the building. Whereas in more "black-box" type applications like genomic analysis the feature space is frequently rotated.

What you're describing is the classic exploration-exploitation tradeoff in numerical optimization.

http://en.wikipedia.org/wik...

In practice though that what you describe it typically not a serious issue even in high dimensional problems. Most seemingly high-dimensional problems have low effective dimensionality. I.e. the number of parameters that actually have significant effect on the objective function tends to be small, we simply don't know which ones are important ahead of time.

In this scenario the kind of iterative gradient descent that Robin describes will most likely find a pretty good solution. The gradient will be heavily loaded on the hyperplane that the effective dimensionality lives in, so we'll converge relatively quickly even with a bunch of "noise dimensions."