Causality APIs 05: A/B Tests Are Causal, Until You Condition on the Wrong Thing

People often talk about A/B testing and causality as if one replaces the other. That is backwards. A clean randomized A/B test is already a causal design.

The interesting mistakes start after the randomization, not before it. Someone filters to clickers. Someone asks for the “real” effect inside an engagement slice. Someone conditions on a variable the treatment itself changed and then treats the result as if it were still experimental.

That is where a causal API becomes useful. It lets you say, precisely, which quantity the experiment identified and which quantity you broke by conditioning on the wrong node.

A/B testing is already causal. The hard part is keeping it causal while you analyze it.

Start with the graph the experiment actually implies

Here is a small setup in py-bbn:

U: pre-treatment user intent
T: randomized treatment assignment
M: post-treatment click
Y: downstream conversion

The important design choice is that T has no parents. That is the randomization.

Drawn explicitly, the structure looks like this:

Randomize cleanly, then break the estimand by conditioning on the wrong node

from pybbn.factory import create_reasoning_model

d = {
    "nodes": ["U", "T", "M", "Y"],
    "edges": [("U", "M"), ("U", "Y"), ("T", "M"), ("T", "Y")],
}

p = {
    "U": {"columns": ["U", "__p__"], "data": [["low", 0.8], ["high", 0.2]]},
    "T": {"columns": ["T", "__p__"], "data": [["A", 0.5], ["B", 0.5]]},
    "M": {
        "columns": ["U", "T", "M", "__p__"],
        "data": [
            ["low", "A", "no", 0.98],
            ["low", "A", "yes", 0.02],
            ["low", "B", "no", 0.85],
            ["low", "B", "yes", 0.15],
            ["high", "A", "no", 0.25],
            ["high", "A", "yes", 0.75],
            ["high", "B", "no", 0.22],
            ["high", "B", "yes", 0.78],
        ],
    },
    "Y": {
        "columns": ["U", "T", "Y", "__p__"],
        "data": [
            ["low", "A", "no", 0.98],
            ["low", "A", "yes", 0.02],
            ["low", "B", "no", 0.95],
            ["low", "B", "yes", 0.05],
            ["high", "A", "no", 0.55],
            ["high", "A", "yes", 0.45],
            ["high", "B", "no", 0.45],
            ["high", "B", "yes", 0.55],
        ],
    },
}

model = create_reasoning_model(d, p)

This is not meant to be a production experiment model. It is meant to expose the logic cleanly.

The good news: the aggregate A/B readout is already a causal estimate

Because treatment is randomized, the observational and interventional questions line up:

observed_a = model.pquery(nodes=["Y"], evidences=model.e({"T": "A"}))["Y"].prob_of({"Y": "yes"})
observed_b = model.pquery(nodes=["Y"], evidences=model.e({"T": "B"}))["Y"].prob_of({"Y": "yes"})

interventional_a = model.iquery(["Y"], ["yes"], ["T"], ["A"], method="graph").iloc[0]
interventional_b = model.iquery(["Y"], ["yes"], ["T"], ["B"], method="graph").iloc[0]

effect = model.iquery(
    ["Y"],
    ["yes"],
    ["T"],
    ["B"],
    x_ref=["A"],
    method="graph",
).iloc[0]

The numbers match exactly:

P(Y=yes | T=A)     = 0.1060
P(Y=yes | T=B)     = 0.1500

P(Y=yes | do(T=A)) = 0.1060
P(Y=yes | do(T=B)) = 0.1500

lift(B - A)        = 0.0440

That is the whole payoff of randomization. In the aggregate experiment readout, the difference in conversion between A and B is already causal.

So if all you want is the experiment-wide effect, the A/B test has already done the hardest part for you.

The bad news: post-treatment slicing can destroy the identification

Now make the analysis mistake teams make all the time. Instead of comparing all assigned users, compare only the users who clicked:

clicked_a = model.pquery(
    nodes=["Y"],
    evidences=model.e({"T": "A", "M": "yes"}),
)["Y"].prob_of({"Y": "yes"})

clicked_b = model.pquery(
    nodes=["Y"],
    evidences=model.e({"T": "B", "M": "yes"}),
)["Y"].prob_of({"Y": "yes"})

high_clicked_a = model.pquery(
    nodes=["U"],
    evidences=model.e({"T": "A", "M": "yes"}),
)["U"].prob_of({"U": "high"})

high_clicked_b = model.pquery(
    nodes=["U"],
    evidences=model.e({"T": "B", "M": "yes"}),
)["U"].prob_of({"U": "high"})

Now the story flips:

Among clickers only:
  P(Y=yes | T=A, M=yes) = 0.4086
  P(Y=yes | T=B, M=yes) = 0.3326

Intent mix among clickers:
  P(U=high | T=A, M=yes) = 0.9036
  P(U=high | T=B, M=yes) = 0.5652

The aggregate experiment says B > A.

The clicker-only slice says A > B.

Why? Because M is post-treatment. Treatment changes the click pool. Once you condition on M=yes, you are no longer comparing the randomized populations you started with. You are comparing two different mixtures of underlying user intent.

That is not a new experiment. It is a different estimand.

This is where `py-bbn` helps

py-bbn is useful here because it makes the experiment structure explicit:

pquery(...) lets you inspect what the randomized experiment actually says
iquery(...) lets you state the intervention explicitly and verify that the randomized estimate is causal
the graph itself makes it obvious which nodes are post-treatment and dangerous to condition on

That is valuable even when the experimental design is already good, because most A/B test mistakes happen in the analysis layer.

Where `py-scm` fits

A lot of product experimentation is not binary conversion. It is revenue per user, latency, watch time, spend, or some other continuous KPI. That is where py-scm becomes the better fit.

If you have a linear-Gaussian setup with treatment coded numerically and continuous outcomes, the same logic carries over:

from pyscm.reasoning import create_reasoning_model

model = create_reasoning_model(d, p)
effect = model.equery("revenue", {"T": 1.0}, {"T": 0.0})

So the split is straightforward:

use py-bbn when the experiment is naturally discrete and you want exact reasoning over assignments, funnel states, and post-treatment traps
use py-scm when the KPI is continuous and you want observational, interventional, effect, or counterfactual queries in a linear-Gaussian setting

So what

The point is not that A/B tests need to be replaced by causality.

The point is that A/B tests are one clean entry point into causality, and causal APIs help you keep the experiment honest after the randomization has done its job.

That is a much stronger product story than saying “we also do A/B testing.” The useful claim is narrower and better: we help you state the estimand correctly, preserve the design when analyzing it, and avoid post-treatment mistakes that make a causal experiment sound observational again.

Blogs