People often talk about A/B testing and causality as if one replaces the other. That is backwards. A clean randomized A/B test is already a causal design.
The interesting mistakes start after the randomization, not before it. Someone filters to clickers. Someone asks for the “real” effect inside an engagement slice. Someone conditions on a variable the treatment itself changed and then treats the result as if it were still experimental.
That is where a causal API becomes useful. It lets you say, precisely, which quantity the experiment identified and which quantity you broke by conditioning on the wrong node.
A/B testing is already causal. The hard part is keeping it causal while you analyze it.
Start with the graph the experiment actually implies
Here is a small setup in py-bbn:
U: pre-treatment user intentT: randomized treatment assignmentM: post-treatment clickY: downstream conversion
The important design choice is that T has no parents. That is the randomization.
Drawn explicitly, the structure looks like this:

from pybbn.factory import create_reasoning_model
d = {
"nodes": ["U", "T", "M", "Y"],
"edges": [("U", "M"), ("U", "Y"), ("T", "M"), ("T", "Y")],
}
p = {
"U": {"columns": ["U", "__p__"], "data": [["low", 0.8], ["high", 0.2]]},
"T": {"columns": ["T", "__p__"], "data": [["A", 0.5], ["B", 0.5]]},
"M": {
"columns": ["U", "T", "M", "__p__"],
"data": [
["low", "A", "no", 0.98],
["low", "A", "yes", 0.02],
["low", "B", "no", 0.85],
["low", "B", "yes", 0.15],
["high", "A", "no", 0.25],
["high", "A", "yes", 0.75],
["high", "B", "no", 0.22],
["high", "B", "yes", 0.78],
],
},
"Y": {
"columns": ["U", "T", "Y", "__p__"],
"data": [
["low", "A", "no", 0.98],
["low", "A", "yes", 0.02],
["low", "B", "no", 0.95],
["low", "B", "yes", 0.05],
["high", "A", "no", 0.55],
["high", "A", "yes", 0.45],
["high", "B", "no", 0.45],
["high", "B", "yes", 0.55],
],
},
}
model = create_reasoning_model(d, p)
This is not meant to be a production experiment model. It is meant to expose the logic cleanly.
The good news: the aggregate A/B readout is already a causal estimate
Because treatment is randomized, the observational and interventional questions line up:
observed_a = model.pquery(nodes=["Y"], evidences=model.e({"T": "A"}))["Y"].prob_of({"Y": "yes"})
observed_b = model.pquery(nodes=["Y"], evidences=model.e({"T": "B"}))["Y"].prob_of({"Y": "yes"})
interventional_a = model.iquery(["Y"], ["yes"], ["T"], ["A"], method="graph").iloc[0]
interventional_b = model.iquery(["Y"], ["yes"], ["T"], ["B"], method="graph").iloc[0]
effect = model.iquery(
["Y"],
["yes"],
["T"],
["B"],
x_ref=["A"],
method="graph",
).iloc[0]
The numbers match exactly:
P(Y=yes | T=A) = 0.1060
P(Y=yes | T=B) = 0.1500
P(Y=yes | do(T=A)) = 0.1060
P(Y=yes | do(T=B)) = 0.1500
lift(B - A) = 0.0440
That is the whole payoff of randomization. In the aggregate experiment readout, the difference in conversion between A and B is already causal.
So if all you want is the experiment-wide effect, the A/B test has already done the hardest part for you.
The bad news: post-treatment slicing can destroy the identification
Now make the analysis mistake teams make all the time. Instead of comparing all assigned users, compare only the users who clicked:
clicked_a = model.pquery(
nodes=["Y"],
evidences=model.e({"T": "A", "M": "yes"}),
)["Y"].prob_of({"Y": "yes"})
clicked_b = model.pquery(
nodes=["Y"],
evidences=model.e({"T": "B", "M": "yes"}),
)["Y"].prob_of({"Y": "yes"})
high_clicked_a = model.pquery(
nodes=["U"],
evidences=model.e({"T": "A", "M": "yes"}),
)["U"].prob_of({"U": "high"})
high_clicked_b = model.pquery(
nodes=["U"],
evidences=model.e({"T": "B", "M": "yes"}),
)["U"].prob_of({"U": "high"})
Now the story flips:
Among clickers only:
P(Y=yes | T=A, M=yes) = 0.4086
P(Y=yes | T=B, M=yes) = 0.3326
Intent mix among clickers:
P(U=high | T=A, M=yes) = 0.9036
P(U=high | T=B, M=yes) = 0.5652
The aggregate experiment says B > A.
The clicker-only slice says A > B.
Why? Because M is post-treatment. Treatment changes the click pool. Once you condition on M=yes, you are no longer comparing the randomized populations you started with. You are comparing two different mixtures of underlying user intent.
That is not a new experiment. It is a different estimand.
This is where py-bbn helps
py-bbn is useful here because it makes the experiment structure explicit:
pquery(...)lets you inspect what the randomized experiment actually saysiquery(...)lets you state the intervention explicitly and verify that the randomized estimate is causal- the graph itself makes it obvious which nodes are post-treatment and dangerous to condition on
That is valuable even when the experimental design is already good, because most A/B test mistakes happen in the analysis layer.
Where py-scm fits
A lot of product experimentation is not binary conversion. It is revenue per user, latency, watch time, spend, or some other continuous KPI. That is where py-scm becomes the better fit.
If you have a linear-Gaussian setup with treatment coded numerically and continuous outcomes, the same logic carries over:
from pyscm.reasoning import create_reasoning_model
model = create_reasoning_model(d, p)
effect = model.equery("revenue", {"T": 1.0}, {"T": 0.0})
So the split is straightforward:
- use
py-bbnwhen the experiment is naturally discrete and you want exact reasoning over assignments, funnel states, and post-treatment traps - use
py-scmwhen the KPI is continuous and you want observational, interventional, effect, or counterfactual queries in a linear-Gaussian setting
So what
The point is not that A/B tests need to be replaced by causality.
The point is that A/B tests are one clean entry point into causality, and causal APIs help you keep the experiment honest after the randomization has done its job.
That is a much stronger product story than saying “we also do A/B testing.” The useful claim is narrower and better: we help you state the estimand correctly, preserve the design when analyzing it, and avoid post-treatment mistakes that make a causal experiment sound observational again.


Leave a Reply