TWFE with Staggered Adoption

by DataMarvin

2 days ago

In the previous post, we saw how Two-Way Fixed Effects (TWFE) models absorb stable unit-level differences and common time trends to isolate a treatment effect. The setup is clean, the intuition is solid, and the implementation is a single OLS regression.

But TWFE has a serious flaw — one that's easy to miss and surprisingly common in practice. It shows up the moment your treatment doesn't roll out to everyone at the same time.

1. The Setup: Staggered Rollout

In a perfect experiment, treatment is assigned simultaneously. Everyone flips from control to treatment at the same moment, and your control group is clean throughout.

In reality, treatments almost always roll out in waves. A new feature gets enabled for 20% of users in Week 1, another 30% in Week 3, and the remaining 50% in Week 6. A policy change hits one region first, then spreads to others. A loyalty program launches in flagship stores before expanding to the full network.

This is staggered adoption — units receive treatment at different points in time, forming distinct cohorts based on when they were treated.

Standard TWFE handles simultaneous treatment well. With staggered adoption, it quietly breaks.

2. Why TWFE Breaks Under Staggered Adoption

The implicit comparison TWFE makes

When treatment timing varies, TWFE doesn't just compare treated units to never-treated controls. It also uses already-treated units as controls for later-treated units.

Go back to the coffee chain loyalty program example. Suppose:

Cohort A: 30 stores treated in Week 4
Cohort B: 30 stores treated in Week 8
Never-treated: 40 stores

When TWFE estimates the effect for Cohort B (treated Week 8), it compares them to all available "controls" at that point — including Cohort A stores, which have already been treated for four weeks.

If the loyalty program's effect grows over time (which is plausible — customers accumulate points and come back more), then Cohort A stores in Week 8 look better than they did in Week 4, not because of a new treatment, but because their treatment effect is maturing. Using them as a "control" for Cohort B makes Cohort B's effect look smaller than it actually is.

This is the contaminated control problem.

Negative weights

The contamination goes deeper than intuition suggests. Goodman-Bacon (2021) showed that the TWFE coefficient can be decomposed into a weighted average of all pairwise 2x2 DiD comparisons. Some of those weights are negative — meaning some cohorts actively drag the estimate in the wrong direction.

In the worst case, TWFE can produce a negative coefficient for a treatment that is uniformly positive across all cohorts. The sign of your estimate can be wrong.

3. Diagnosing the Problem: Event Study

Before trusting any TWFE result under staggered adoption, run an event study. This decomposes the treatment effect into period-by-period estimates relative to the treatment date:

$Y_it = α_i + α_t + Σ_k δ_k · 1[t - G_i = k] + ε_it$

Where:

G_i = the period when unit i was first treated (its cohort)
k = periods relative to treatment (k = -3, -2, -1, 0, +1, +2, ...)
k = -1 is omitted as the reference period

Plot the δ_k coefficients over time. What you're looking for:

Pattern	Interpretation
Pre-period estimates flat around zero	Parallel trends holds — control group is valid
Pre-period estimates drifting upward	Pre-existing trend — parallel trends violated
Post-period estimates growing over time	Dynamic treatment effects — standard TWFE collapses these into one number
Pre-period estimates negative	Already-treated units contaminating the control group

A clean event study looks like a flat line before treatment and a step up (or gradual rise) after. If the pre-period shows drift, your TWFE estimate is unreliable regardless of how the post-period looks.

4. The Fix: Cohort-Specific DiD

The core principle behind modern staggered adoption estimators is simple:

Only compare a cohort to units that haven't been treated yet.

By restricting comparisons to "clean controls" — units that are either never treated or not yet treated — you eliminate the contamination problem. You estimate a separate treatment effect for each cohort-period combination, then aggregate.

Two estimators implement this principle in different ways.

5. Callaway–Sant'Anna (2021)

The idea

Callaway and Sant'Anna propose estimating cohort-specific Average Treatment Effects on the Treated (ATT(g,t)) — the average effect for units first treated in period g, measured at calendar time t.

For each cohort g, they compare:

Treated units (first treated in g), post-treatment
Clean controls: either never-treated units or not-yet-treated units at time t

This produces a matrix of ATT(g,t) estimates — one for each cohort × time combination. These can then be aggregated in several ways:

Overall ATT → weighted average across all cohort-period cells Dynamic ATT → average effect by "time since treatment" (e-study style) Calendar-time ATT → average effect by calendar period Group-specific ATT → separate estimate per cohort

When to use CS-DiD

You want interpretable cohort-level estimates — to understand whether early and late adopters respond differently
You have enough sample size per cohort to support cell-level estimation
You're willing to accept a more complex output (a matrix of ATTs rather than a single coefficient)
You want maximum flexibility in how you aggregate

Practical note

CS-DiD requires choosing a comparison group: never-treated units only, or never-treated + not-yet-treated. The not-yet-treated option uses more data and improves precision, but assumes those units' trends are valid controls — which may not hold if treatment timing is correlated with outcomes.

6. Sun–Abraham (2021)

The idea

Sun and Abraham take a different approach. Rather than stepping outside the OLS framework, they show that the TWFE coefficient is a weighted sum of cohort-specific ATTs — and that those weights can be negative.

Their fix: interact each treatment indicator with cohort dummies to estimate cohort-specific effects directly within OLS, then aggregate with the correct (non-negative) weights.

The regression looks like:

$Y_it = α_i + α_t + Σ_g Σ_k (δ_g,k · 1[G_i = g] · 1[t - g = k]) + ε_it$

This is more complex than standard TWFE, but it stays within the familiar OLS world — same software, same inference framework, just more interaction terms.

When to use Sun–Abraham

You want a single aggregate ATT rather than a full cohort matrix
Staggered timing is mild (few cohorts, similar treatment timing)
You want minimal implementation overhead — it runs in any regression package
You're presenting to an audience comfortable with OLS but not with semiparametric estimators

7. CS-DiD vs. Sun–Abraham — Side by Side

	Callaway–Sant'Anna	Sun–Abraham
Framework	Semiparametric (outside OLS)	OLS with interaction terms
Output	Matrix of ATT(g,t), then aggregated	Single aggregate ATT (or dynamic)
Clean control requirement	Explicit — never-treated or not-yet-treated	Implicit — relies on never-treated as reference
Handles heterogeneous effects	Yes, by design	Yes, via cohort interactions
Implementation complexity	Higher	Moderate
Best for	Cohort-level analysis, large panels	Single-number estimate, familiar workflow
Common packages	`csdid` (Stata), `did` (R), `pyfixest` (Python)	`eventstudyinteract` (Stata), `sunab` (R/Stata)

8. Standard TWFE vs. Robust Estimators — When Does It Matter?

Not every staggered adoption setting requires the full CS-DiD or Sun–Abraham treatment. Standard TWFE remains valid when:

Treatment effects are homogeneous across cohorts — if every cohort responds the same way, the negative weight problem doesn't materialize
Treatment timing is nearly simultaneous — if cohorts are treated within a short window, the contamination is minimal
You've verified flat pre-trends — a clean event study doesn't guarantee TWFE is right, but a contaminated one is a clear signal to switch

Use the robust estimators when:

Pre-trends are non-flat or noisy
You have many cohorts spread across a long time window
You have strong reason to believe treatment effects differ by cohort (early adopters vs. late adopters often behave differently)
Your results will be used to make consequential decisions

9. A Practical Workflow

Step 1: Identify cohorts
        → Which units were treated in which period?
        → How many never-treated units are available as clean controls?

Step 2: Run a standard event study
        → Plot pre- and post-treatment period effects
        → Flag if pre-periods show drift or negative weights

Step 3: Decide on estimator
        → Simultaneous treatment or homogeneous effects → TWFE acceptable
        → Staggered with heterogeneous effects → CS-DiD or Sun–Abraham

Step 4: Estimate cohort-specific ATTs
        → CS-DiD: compute ATT(g,t) matrix, aggregate
        → Sun–Abraham: interact cohort × relative-time dummies, aggregate

Step 5: Report
        → Dynamic ATT plot (effect by time since treatment)
        → Overall ATT with confidence intervals
        → Sensitivity check: does the result hold under alternative control group choices?`

Takeaway

TWFE is not wrong — it's misapplied when treatment timing varies and effects are heterogeneous. The contaminated control problem and negative weights are not edge cases; they show up regularly in product experiments, policy evaluations, and any observational setting with rolling rollouts.

The modern fix isn't complicated in principle: only compare treated units to clean controls, estimate cohort-specific effects, then aggregate. Callaway–Sant'Anna and Sun–Abraham both implement this — with different tradeoffs in flexibility and implementation complexity.

One sentence summary:

When treatment rolls out in waves, standard TWFE silently contaminates your control group — CS-DiD and Sun–Abraham fix this by restricting comparisons to units that haven't been treated yet.