57 Comments

You are confused. Dynamic treatment regimes necessitate a causal connection between the policy and the outcome. They are defined, ultimately, in terms of counterfactuals, see for instance:

http://www.stat.lsa.umich.e...

http://www.rss.org.uk/uploa...

etc.

EDT doesn't even know what those counterfactual things _are_. I am not sure you really understand the difference between CDT and EDT (there is more going on here than just "oh there is an expectation and a conditioning bar, therefore it's EDT"). So far, every clearcut example of the use of CDT (such as policy selection in dynamic treatment regimes, or actions based on randomized control trials) you classified as EDT. I can only conclude that the set of things under the heading of CDT to you is the empty set.

Expand full comment

Read the "Mathematical foundation" paragraph of Wikipedia article you cited. That's the typical textbook version of the EDT formula.

The article mentions the difficulty of inducing the optimal policies from the data due to confounding variables, but it makes clear that this is an estimation problem.

You keep conflating estimation theory and decision theory. While actual algorithms may combine them, computing actions from data, estimation and decision are conceptually different problems.

Expand full comment

What you are describing is this:

http://en.wikipedia.org/wik...

What you have to use there is a variation of the g-formula with a policy. I suggest reading any references by "Robins" linked by above wikipedia article.

See also this sentence in the article: "The use of experimental data, where treatments have been randomly assigned, is preferred because it helps eliminate bias caused by unobserved confounding variables that influence both the choice of the treatment and the clinical outcome."

If you can't use experimental data, you have to use the g-formula instead (assuming your study satisfies certain conditions) to eliminate this bias. EDT doesn't even know what "confounding" is, as it has no language to talk about causal concepts.

The policy uses L1, but in a very particular way. It is certainly not EDT.

The FDA and the NIH use RCTs to establish effects. G-formula will give you the same answer as an RCT in the example I gave. Anything that isn't the g-formula will give you garbage instead.

Expand full comment

Ilya,

A question for clarification. Do you contend that the procedures Wiblin recommended in his untimely April Fool's joke would be endorsed by competent decision theorists of the evidential school?

Expand full comment

So, assume I'm a doctor and I have a patient who has been already administered treatment a1, and now has vitals l1. I have to choose the second treatment.

You are saying that I have to use a formula that ignores the value of l1. This is clearly absurd. I would love to see a reference to a guideline from the FDA, the NIH or whatever other medical authority supporting your point.

Expand full comment

No, I am pretty sure I am not confusing anything.

\sum_{l1} E[Y | a1, a2, l1] p(l1 | a1) will give you the same number as if, instead of listening to L1, the doctor randomized both A1 and A2 (for an infinite population of patients). In this sense it gives "the causal effect". Or at least that's what e.g. the FDA and the NIH think about when they think about drug efficacy. Which is why you will get in trouble with them if you use E[Y | a1, a2, l1].

The functionals in (a) and (b) will not give you the causal effect for this study (e.g. will give you bias). If you think (a) (b) and (c) are all the same functional, you need to do some reading on basic probability.

Meta comment: what I am talking about now is very very basic causal inference. "In causal inference with observational data we have to use the g-formula" is comparable to "in quantum mechanics we have to use complex numbers."

Again, I urge you to go read the appropriate Bible.

Expand full comment

As for links google for Markov equivalence in DAGs, for instance this:http://www.multimedia-computin...

It's well known that the orientation of the edges of a Bayesian network is arbitrary to some extent, but I can't see your point.

Expand full comment

I can't really follow you. You keep saying that using E[Y | A1, A2, L1] is improper, that should send you to jail, but you don't provide an argument for that.

Note that, due to the linearity of the expectation operator, the other formulas you provided are also computations of the expected value of Y, with respect to different conditional probability distributions. Are you sure you are not confusing the true conditional probability distribution p[Y | A1, A2, L1] with its many possible estimators (which may or may not be biased depending on confounding variables and so on)?

Expand full comment

"EDT uses P(O|A) i.e. probability of outcome O given your decision A (or that's the way I see it) . That's the math it has in it. The probability of outcome O given your decision A does not straightforwardly relate to an estimate obtained from a bunch of different people's decisions. "

Doesn't matter. I can replace different doctors in my example with the same doctor, and then ask that doctor whether he thinks A1 and A2 help or kill patients. If he then uses either E[Y | a1, l1, a2] or E[Y | a1, a2] (which is E(O|A) that EDT advocates) he should go to jail.

As for links google for Markov equivalence in DAGs, for instance this:

http://www.multimedia-compu...

or any standard textbook on causal inference or graphical models. As long as AIXI is only observing it runs into standard limits. Also (this is obvious but worth stating), nobody uses AIXI to make actual decision about drugs or anything like that. People use causal inference now and have been for at least a century or two (depending on how you count).

Expand full comment

>That sounds like a good reason to start flossing to me

That's a very bad reason to start flossing - the worst, in fact. Flossing is only evidence for underlying health if it is done for reasons correlated with health. But starting it for other reasons completely destroys its use as a signal.

Expand full comment

Has math in it, you say...

EDT uses P(O|A) i.e. probability of outcome O given your decision A (or that's the way I see it) . That's the math it has in it. The probability of outcome O given your decision A does not straightforwardly relate to an estimate obtained from a bunch of different people's decisions. Control for co-founders got to be par course with finding the probability of outcome given specifically your decision. I'm saying it is vague because you seem to see it differently and i've seen plenty of varying understandings.

As for impossibility theorems in question, i'd need links. AIXI tries every computing code and uses those that correctly predict the observations, weighted by their length. It can learn anything computable.

Expand full comment

I don't understand your first two paragraphs at all.

AIXI (or anything else) cannot learn causality from observations alone due to standard impossibility theorems, similarly to how it cannot predict if a Turing machine will halt.

You seem to have very strong opinions about whether something is EDT or CDT, if you think they are very vague. The definitions I am aware of are very precise (e.g. have math formulas in them).

CDT is about counterfactual causation which was understood precisely since Neyman's time (1920s), and is certainly understood very well now, almost a century later. Causal inference is not philosophy anymore, it's a (vibrant and growing!) branch of statistics. I think there were 73 causality papers in last year's JSM. Everyone agrees on the underlying math now.

I do recommend doing the reading I suggested above.

Expand full comment

Yes, it is probably the case that flossing is beneficial. You're talking of a causal consequence, not of correlation, though... Wiblin is talking of correlation, and flossing is an example of his choice which hides his error.

Consider a ritual Foo which people decide to do or not do, correlated with outcome Bar. The correlation may be a result of common factor influencing both the decision and the outcome - e.g. people who want Bar do Foo, and then attain Bar by other means. In this case, knowing the desire for Bar, the Foo does not provide any extra evidence that Bar would be obtained.

For a specific example, consider medicine in the past. People who use some Mercury Based Cream on their face are rich, care for their health, and live longer than people who do not use Mercury Based Cream, even though the cream is actively harmful and such harmfulness could be inferred from available data if only you control for variables that influence the decision.

And it is universally true that deterministic decision itself provides no new evidence, in so much as the decision is produced from what is known to the decider.

Expand full comment

> Look, this example is based on papers people actually wrote, which is based on a problem people in HIV actually have.

Hemophiliacs were actually treated with bloodletting and aspirin, too. Unless there's knowledge that lets you discern those two examples, you can only be correct by sheer luck.

You need to state what the decision theory knows about Foo and Bar , and then we can see what decision theory does without introducing all the extra facts that decision theory does not do, and being able to invoke counter examples (e.g. what if Foo is bloodletting and Bar is hemophilia).

> Second, even if you observe everything, you still need to figure out causal directionality, which you cannot do by observations alone due to standard issues with observational equivalence of different causal DAGs.

You don't need to observe everything. A formalized agent such as AIXI will, with few enough observations, get the gist of the causality, and use it. The problem with CDT, EDT, and so on, is that those aren't actual theories, they're very vague. (And causality very ill defined, e.g. see http://plato.stanford.edu/e... )

Expand full comment

" For instance, provided that your dataset contains enough information, you could use an opaque machine learning method, like a neural network, to learn a direct evidence-to-action mapping."

You still don't get it. Here's the example study:

We randomize a treatment, call it A1. We then wait and measure patient's vitals, that's L1. Then, based on these vitals the doctor gives (or not) some additional treatment A2. Finally, we measure the outcome (is the patient alive or not?), call it Y.

We want to know if A1 and A2 are killing patients or not, given that L1 and Y are hopelessly confounded. Here's what we can do:

(a) Look at E[Y | a1, a2, l1] (this is what EDT suggests). This is completely wrong (you will get bias, e.g. go to jail for malpractice). It doesn't matter if you use a neural net or a support vector machine, or a non-linear regression to figure out this mapping, it will still be garbage.

(b) Look at \sum_{l1} E[Y | a1, a2, l1] p(l1) (this is the standard 'adjusting for confounders', and what people did for a long time in these cases). This is _also_ completely wrong, but already this would not be EDT anymore. Incidentally the reason this is wrong for this study is the exact same reason people think Simpsons paradox is a paradox (the functional is wrong for the graph we have).

(c) What you have to do is look at this:\sum_{l1} E[Y | a1, a2, l1] p(l1 | a1). This is the so called 'g-formula', and this will get you the effect without bias. If you don't understand why, I suggest a google for 'g-formula' or perhaps a good read of Judea's book, specifically chapters 3 and 4.

---

The point with EDT is you just act based on maximizing a functional of conditional probabilities. If you are randomizing, or if you are adjusting for confounders, or you do the more complicated thing in (c) correct for longitudinal studies, you are interested in a causal connection between your action and the outcome, so you aren't doing EDT anymore. I mean you can call it EDT if you want to, but it's really CDT by a standard definition. You can check wikipedia, or any standard textbook.

Expand full comment

The doctors don't use the underlying health status either when deciding, they use the observable effects of same (which are recorded in the patient's file).

The doctors' decisions are evidence for the underlying health status, as patient files are if they are available to you.

Your advice to model arbitrarily complex latents is doomed: most such modeling will result in misspecification bias, or be hopelessly expensive.

Yes, and as I said, that's why double-blind interventional studies are used.

If your study just records doctors' decisions and outcomes I don't think you can even feasibly distinguish between an effective therapy and a therapy that is only as good as a placebo, or even one that does more harm than good. If rather than HAART it was homeopathy or bloodletting, what difference would you expect?

Luckily there are well understood methods for this. These methods aren't "evidential decision theory" though, because all such a theory can do is condition. Conditioning sometimes gets you in trouble with confounding.

Clearly there are decision problems where it is more cost-effective to develop problem-specific decision procedures which don't involve an explicit probability distribution estimation step followed by an explicit expected value maximization step. For instance, provided that your dataset contains enough information, you could use an opaque machine learning method, like a neural network, to learn a direct evidence-to-action mapping. I suppose that in addition to black-box methods there are specialized (and lawsuit-friendly) methods for medicine.

This doesn't mean there is anything intrinsically wrong with evidential decision theory. You are just approximating it to make your decision process tractable.

Expand full comment