TWFE with Staggered Adoption
In the previous post, we saw how Two-Way Fixed Effects (TWFE) models absorb stable unit-level differences and common time trends to isolate a treatment effect. The setup is clean, the intuition is solid, and the implementation is a single OLS regression.
But TWFE has a serious flaw — one that's easy to miss and surprisingly common in practice. It shows up the moment your treatment doesn't roll out to everyone at the same time.
1. The Setup: Staggered Rollout
In a perfect experiment, treatment is assigned simultaneously. Everyone flips from control to treatment at the same moment, and your control group is clean throughout.
In reality, treatments almost always roll out in waves. A new feature gets enabled for 20% of users in Week 1, another 30% in Week 3, and the remaining 50% in Week 6. A policy change hits one region first, then spreads to others. A loyalty program launches in flagship stores before expanding to the full network.
This is staggered adoption — units receive treatment at different points in time, forming distinct cohorts based on when they were treated.
Standard TWFE handles simultaneous treatment well. With staggered adoption, it quietly breaks.
2. Why TWFE Breaks Under Staggered Adoption
The implicit comparison TWFE makes
When treatment timing varies, TWFE doesn't just compare treated units to never-treated controls. It also uses already-treated units as controls for later-treated units.
Go back to the coffee chain loyalty program example. Suppose:
- Cohort A: 30 stores treated in Week 4
- Cohort B: 30 stores treated in Week 8
- Never-treated: 40 stores
When TWFE estimates the effect for Cohort B (treated Week 8), it compares them to all available "controls" at that point — including Cohort A stores, which have already been treated for four weeks.
If the loyalty program's effect grows over time (which is plausible — customers accumulate points and come back more), then Cohort A stores in Week 8 look better than they did in Week 4, not because of a new treatment, but because their treatment effect is maturing. Using them as a "control" for Cohort B makes Cohort B's effect look smaller than it actually is.
This is the contaminated control problem.
Negative weights
The contamination goes deeper than intuition suggests. Goodman-Bacon (2021) showed that the TWFE coefficient can be decomposed into a weighted average of all pairwise 2x2 DiD comparisons. Some of those weights are negative — meaning some cohorts actively drag the estimate in the wrong direction.
In the worst case, TWFE can produce a negative coefficient for a treatment that is uniformly positive across all cohorts. The sign of your estimate can be wrong.
3. Diagnosing the Problem: Event Study
Before trusting any TWFE result under staggered adoption, run an event study. This decomposes the treatment effect into period-by-period estimates relative to the treatment date:
Where:
- G_i = the period when unit i was first treated (its cohort)
- k = periods relative to treatment (k = -3, -2, -1, 0, +1, +2, ...)
- k = -1 is omitted as the reference period
Plot the δ_k coefficients over time. What you're looking for:
| Pattern | Interpretation |
|---|---|
| Pre-period estimates flat around zero | Parallel trends holds — control group is valid |
| Pre-period estimates drifting upward | Pre-existing trend — parallel trends violated |
| Post-period estimates growing over time | Dynamic treatment effects — standard TWFE collapses these into one number |
| Pre-period estimates negative | Already-treated units contaminating the control group |
A clean event study looks like a flat line before treatment and a step up (or gradual rise) after. If the pre-period shows drift, your TWFE estimate is unreliable regardless of how the post-period looks.
4. The Fix: Cohort-Specific DiD
The core principle behind modern staggered adoption estimators is simple:
Only compare a cohort to units that haven't been treated yet.
By restricting comparisons to "clean controls" — units that are either never treated or not yet treated — you eliminate the contamination problem. You estimate a separate treatment effect for each cohort-period combination, then aggregate.
Two estimators implement this principle in different ways.
5. Callaway–Sant'Anna (2021)
The idea
Callaway and Sant'Anna propose estimating cohort-specific Average Treatment Effects on the Treated (ATT(g,t)) — the average effect for units first treated in period g, measured at calendar time t.
For each cohort g, they compare:
- Treated units (first treated in g), post-treatment
- Clean controls: either never-treated units or not-yet-treated units at time t
This produces a matrix of ATT(g,t) estimates — one for each cohort × time combination. These can then be aggregated in several ways:
Overall ATT → weighted average across all cohort-period cells Dynamic ATT → average effect by "time since treatment" (e-study style) Calendar-time ATT → average effect by calendar period Group-specific ATT → separate estimate per cohort
When to use CS-DiD
- You want interpretable cohort-level estimates — to understand whether early and late adopters respond differently
- You have enough sample size per cohort to support cell-level estimation
- You're willing to accept a more complex output (a matrix of ATTs rather than a single coefficient)
- You want maximum flexibility in how you aggregate
Practical note
CS-DiD requires choosing a comparison group: never-treated units only, or never-treated + not-yet-treated. The not-yet-treated option uses more data and improves precision, but assumes those units' trends are valid controls — which may not hold if treatment timing is correlated with outcomes.
6. Sun–Abraham (2021)
The idea
Sun and Abraham take a different approach. Rather than stepping outside the OLS framework, they show that the TWFE coefficient is a weighted sum of cohort-specific ATTs — and that those weights can be negative.
Their fix: interact each treatment indicator with cohort dummies to estimate cohort-specific effects directly within OLS, then aggregate with the correct (non-negative) weights.
The regression looks like:
This is more complex than standard TWFE, but it stays within the familiar OLS world — same software, same inference framework, just more interaction terms.
When to use Sun–Abraham
- You want a single aggregate ATT rather than a full cohort matrix
- Staggered timing is mild (few cohorts, similar treatment timing)
- You want minimal implementation overhead — it runs in any regression package
- You're presenting to an audience comfortable with OLS but not with semiparametric estimators
7. CS-DiD vs. Sun–Abraham — Side by Side
| Callaway–Sant'Anna | Sun–Abraham | |
|---|---|---|
| Framework | Semiparametric (outside OLS) | OLS with interaction terms |
| Output | Matrix of ATT(g,t), then aggregated | Single aggregate ATT (or dynamic) |
| Clean control requirement | Explicit — never-treated or not-yet-treated | Implicit — relies on never-treated as reference |
| Handles heterogeneous effects | Yes, by design | Yes, via cohort interactions |
| Implementation complexity | Higher | Moderate |
| Best for | Cohort-level analysis, large panels | Single-number estimate, familiar workflow |
| Common packages | csdid (Stata), did (R), pyfixest (Python) | eventstudyinteract (Stata), sunab (R/Stata) |
8. Standard TWFE vs. Robust Estimators — When Does It Matter?
Not every staggered adoption setting requires the full CS-DiD or Sun–Abraham treatment. Standard TWFE remains valid when:
- Treatment effects are homogeneous across cohorts — if every cohort responds the same way, the negative weight problem doesn't materialize
- Treatment timing is nearly simultaneous — if cohorts are treated within a short window, the contamination is minimal
- You've verified flat pre-trends — a clean event study doesn't guarantee TWFE is right, but a contaminated one is a clear signal to switch
Use the robust estimators when:
- Pre-trends are non-flat or noisy
- You have many cohorts spread across a long time window
- You have strong reason to believe treatment effects differ by cohort (early adopters vs. late adopters often behave differently)
- Your results will be used to make consequential decisions
9. A Practical Workflow
Step 1: Identify cohorts
→ Which units were treated in which period?
→ How many never-treated units are available as clean controls?
Step 2: Run a standard event study
→ Plot pre- and post-treatment period effects
→ Flag if pre-periods show drift or negative weights
Step 3: Decide on estimator
→ Simultaneous treatment or homogeneous effects → TWFE acceptable
→ Staggered with heterogeneous effects → CS-DiD or Sun–Abraham
Step 4: Estimate cohort-specific ATTs
→ CS-DiD: compute ATT(g,t) matrix, aggregate
→ Sun–Abraham: interact cohort × relative-time dummies, aggregate
Step 5: Report
→ Dynamic ATT plot (effect by time since treatment)
→ Overall ATT with confidence intervals
→ Sensitivity check: does the result hold under alternative control group choices?`
Takeaway
TWFE is not wrong — it's misapplied when treatment timing varies and effects are heterogeneous. The contaminated control problem and negative weights are not edge cases; they show up regularly in product experiments, policy evaluations, and any observational setting with rolling rollouts.
The modern fix isn't complicated in principle: only compare treated units to clean controls, estimate cohort-specific effects, then aggregate. Callaway–Sant'Anna and Sun–Abraham both implement this — with different tradeoffs in flexibility and implementation complexity.
One sentence summary:
When treatment rolls out in waves, standard TWFE silently contaminates your control group — CS-DiD and Sun–Abraham fix this by restricting comparisons to units that haven't been treated yet.
Dataeons