Causality APIs 05: A/B Tests Are Causal, Until You Condition on the Wrong Thing

Abstract A/B testing graph showing randomized treatment split, hidden user intent, post-treatment click node, and outcome node.

People often talk about A/B testing and causality as if one replaces the other. That is backwards. A clean randomized A/B test is already a causal design.

The interesting mistakes start after the randomization, not before it. Someone filters to clickers. Someone asks for the “real” effect inside an engagement slice. Someone conditions on a variable the treatment itself changed and then treats the result as if it were still experimental.

That is where a causal API becomes useful. It lets you say, precisely, which quantity the experiment identified and which quantity you broke by conditioning on the wrong node.

A/B testing is already causal. The hard part is keeping it causal while you analyze it.

Start with the graph the experiment actually implies

Here is a small setup in py-bbn:

  • U: pre-treatment user intent
  • T: randomized treatment assignment
  • M: post-treatment click
  • Y: downstream conversion

The important design choice is that T has no parents. That is the randomization.

Drawn explicitly, the structure looks like this:

Randomize cleanly, then break the estimand by conditioning on the wrong node

from pybbn.factory import create_reasoning_model

d = {
    "nodes": ["U", "T", "M", "Y"],
    "edges": [("U", "M"), ("U", "Y"), ("T", "M"), ("T", "Y")],
}

p = {
    "U": {"columns": ["U", "__p__"], "data": [["low", 0.8], ["high", 0.2]]},
    "T": {"columns": ["T", "__p__"], "data": [["A", 0.5], ["B", 0.5]]},
    "M": {
        "columns": ["U", "T", "M", "__p__"],
        "data": [
            ["low", "A", "no", 0.98],
            ["low", "A", "yes", 0.02],
            ["low", "B", "no", 0.85],
            ["low", "B", "yes", 0.15],
            ["high", "A", "no", 0.25],
            ["high", "A", "yes", 0.75],
            ["high", "B", "no", 0.22],
            ["high", "B", "yes", 0.78],
        ],
    },
    "Y": {
        "columns": ["U", "T", "Y", "__p__"],
        "data": [
            ["low", "A", "no", 0.98],
            ["low", "A", "yes", 0.02],
            ["low", "B", "no", 0.95],
            ["low", "B", "yes", 0.05],
            ["high", "A", "no", 0.55],
            ["high", "A", "yes", 0.45],
            ["high", "B", "no", 0.45],
            ["high", "B", "yes", 0.55],
        ],
    },
}

model = create_reasoning_model(d, p)

This is not meant to be a production experiment model. It is meant to expose the logic cleanly.

The good news: the aggregate A/B readout is already a causal estimate

Because treatment is randomized, the observational and interventional questions line up:

observed_a = model.pquery(nodes=["Y"], evidences=model.e({"T": "A"}))["Y"].prob_of({"Y": "yes"})
observed_b = model.pquery(nodes=["Y"], evidences=model.e({"T": "B"}))["Y"].prob_of({"Y": "yes"})

interventional_a = model.iquery(["Y"], ["yes"], ["T"], ["A"], method="graph").iloc[0]
interventional_b = model.iquery(["Y"], ["yes"], ["T"], ["B"], method="graph").iloc[0]

effect = model.iquery(
    ["Y"],
    ["yes"],
    ["T"],
    ["B"],
    x_ref=["A"],
    method="graph",
).iloc[0]

The numbers match exactly:

P(Y=yes | T=A)     = 0.1060
P(Y=yes | T=B)     = 0.1500

P(Y=yes | do(T=A)) = 0.1060
P(Y=yes | do(T=B)) = 0.1500

lift(B - A)        = 0.0440

That is the whole payoff of randomization. In the aggregate experiment readout, the difference in conversion between A and B is already causal.

So if all you want is the experiment-wide effect, the A/B test has already done the hardest part for you.

The bad news: post-treatment slicing can destroy the identification

Now make the analysis mistake teams make all the time. Instead of comparing all assigned users, compare only the users who clicked:

clicked_a = model.pquery(
    nodes=["Y"],
    evidences=model.e({"T": "A", "M": "yes"}),
)["Y"].prob_of({"Y": "yes"})

clicked_b = model.pquery(
    nodes=["Y"],
    evidences=model.e({"T": "B", "M": "yes"}),
)["Y"].prob_of({"Y": "yes"})

high_clicked_a = model.pquery(
    nodes=["U"],
    evidences=model.e({"T": "A", "M": "yes"}),
)["U"].prob_of({"U": "high"})

high_clicked_b = model.pquery(
    nodes=["U"],
    evidences=model.e({"T": "B", "M": "yes"}),
)["U"].prob_of({"U": "high"})

Now the story flips:

Among clickers only:
  P(Y=yes | T=A, M=yes) = 0.4086
  P(Y=yes | T=B, M=yes) = 0.3326

Intent mix among clickers:
  P(U=high | T=A, M=yes) = 0.9036
  P(U=high | T=B, M=yes) = 0.5652

The aggregate experiment says B > A.

The clicker-only slice says A > B.

Why? Because M is post-treatment. Treatment changes the click pool. Once you condition on M=yes, you are no longer comparing the randomized populations you started with. You are comparing two different mixtures of underlying user intent.

That is not a new experiment. It is a different estimand.

This is where py-bbn helps

py-bbn is useful here because it makes the experiment structure explicit:

  • pquery(...) lets you inspect what the randomized experiment actually says
  • iquery(...) lets you state the intervention explicitly and verify that the randomized estimate is causal
  • the graph itself makes it obvious which nodes are post-treatment and dangerous to condition on

That is valuable even when the experimental design is already good, because most A/B test mistakes happen in the analysis layer.

Where py-scm fits

A lot of product experimentation is not binary conversion. It is revenue per user, latency, watch time, spend, or some other continuous KPI. That is where py-scm becomes the better fit.

If you have a linear-Gaussian setup with treatment coded numerically and continuous outcomes, the same logic carries over:

from pyscm.reasoning import create_reasoning_model

model = create_reasoning_model(d, p)
effect = model.equery("revenue", {"T": 1.0}, {"T": 0.0})

So the split is straightforward:

  • use py-bbn when the experiment is naturally discrete and you want exact reasoning over assignments, funnel states, and post-treatment traps
  • use py-scm when the KPI is continuous and you want observational, interventional, effect, or counterfactual queries in a linear-Gaussian setting

So what

The point is not that A/B tests need to be replaced by causality.

The point is that A/B tests are one clean entry point into causality, and causal APIs help you keep the experiment honest after the randomization has done its job.

That is a much stronger product story than saying “we also do A/B testing.” The useful claim is narrower and better: we help you state the estimand correctly, preserve the design when analyzing it, and avoid post-treatment mistakes that make a causal experiment sound observational again.

Leave a Reply

Discover more from Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading