Run Your First Experiment

From question to published result in 90 days.

A practical guide for city officials, program managers, and analysts who want to move from "we should measure this" to "here is what we found." No academic background required.

Get support from ES →See what others have run →

Before you start: You need two things that no guide can provide — a question someone with budget authority cares about, and access to the data that will measure the answer. Without both, the experiment can't produce a decision. Secure them first.

Pick a tractable question.

The most common failure mode is starting with 'we want to improve outcomes' and trying to measure everything. A good experiment question has three properties: it is specific (which population, which behavior, which time window), it is uncertain (you genuinely don't know the answer), and it is answerable with the data you can collect. 'Does a follow-up phone call within 48 hours of a benefits application increase enrollment completion?' is a good question. 'Does our outreach improve program outcomes?' is not.

Checklist

Can you state the question in one sentence?
Do you not know the answer already?
Can you measure the outcome within 90 days?
Does someone with budget authority care about the answer?

Practitioner note

If you can't pass all four checks, refine the question before proceeding. The question is more important than the intervention.

Choose your intervention and comparison.

Every experiment compares something to something else. The comparison group — the control — is what tells you what would have happened anyway. The most common mistake is designing an intervention without specifying the comparison. 'We'll send reminder texts' is a description. 'We'll randomly assign 50% of eligible residents to receive weekly reminder texts; the other 50% receives the standard mailed notice' is an experiment design. Define both arms before you start.

Checklist

Is the intervention clearly specified (what, to whom, when, by whom)?
Is the comparison clearly specified?
Is random assignment feasible given your population and timeline?
Have you checked for ethical concerns with your supervisor or legal team?

Practitioner note

If random assignment isn't feasible, a waitlist design — randomizing the order of rollout — is often acceptable and produces valid comparison data.

Pre-specify your primary outcome.

Before the experiment begins, write down the one number that will determine whether the intervention worked. Not five numbers — one. This prevents the most common form of analysis error: running many outcome measures and reporting whichever one looks best. The primary outcome should be measurable (not a score you invent), meaningful (a real behavior or decision you care about), and available (you can retrieve it from administrative data or a follow-up survey within your timeline).

Checklist

Is the primary outcome stated in writing before the experiment launches?
Is it derived from administrative data you already collect, or a simple survey?
Do you know the baseline rate? (Required for sample size calculation.)
Have any secondary outcomes been pre-specified separately?

Practitioner note

Pre-registration — filing your outcome and analysis plan with a registry before data collection — protects you politically. If the result is null, pre-registration proves you ran a real experiment rather than searching for a positive result.

Calculate how many people you need.

Sample size determines whether your experiment can detect a meaningful effect. The key inputs are: your baseline rate (what percentage do the thing now?), your minimum detectable effect (what improvement would be worth knowing about?), and your desired statistical power (usually 80%). A free online calculator handles the math. The typical civic experiment needs a few hundred to a few thousand participants to detect a 5–15 percentage point improvement. If your eligible population is smaller, design accordingly — or accept that you'll need a longer follow-up window.

Checklist

Do you know your baseline rate from administrative data?
Have you defined the smallest effect worth detecting?
Is your eligible population at least 3–5× larger than the required sample?
Have you accounted for attrition? (Add 15–20% to your target N.)

Practitioner note

If your population is too small to power a formal test, run the experiment anyway and report effect sizes with confidence intervals. An underpowered experiment that finds a large effect is useful. An experiment you don't run teaches you nothing.

Assign participants to arms — and document the process.

Random assignment is the core of an experiment. The method matters: computer-generated random numbers, sorted by a random seed, applied to your list of eligible participants. A coin flip for each person works, but a spreadsheet with RAND() is better — it's reproducible and auditable. Block randomization (assigning equal numbers in each arm within subgroups) improves balance on observable characteristics. After randomization, verify balance: are the arms similar on age, location, prior usage, and other characteristics you can measure?

Checklist

Is the randomization process documented and reproducible?
Has the randomization been done before any participants are contacted?
Have you verified balance between arms on key observable characteristics?
Is the randomization list stored securely and separately from outcome data?

Practitioner note

Never stratify after the fact. If you randomize, then adjust the groups based on observed imbalances, you've introduced selection bias. Run balance checks; don't modify assignments based on them.

Deliver the intervention. Protect the comparison.

The treatment and control arms must stay separate throughout the intervention period. The most common implementation failure is 'contamination' — members of the control group receive the treatment anyway, either because staff didn't know who was in which arm, or because participants communicated with each other. Brief your delivery team, mark participant records clearly, and build monitoring into the implementation calendar. Check weekly that treatment rates are on target and that no controls are receiving the treatment.

Checklist

Have all staff delivering the intervention been briefed on the study arms?
Is participant assignment visible in the system staff use to deliver services?
Have you built in weekly compliance checks?
Is there a clear escalation path if the intervention isn't reaching the treatment group?

Practitioner note

Keep a log of every deviation from the protocol — missed contacts, technical failures, emergency policy changes that affect delivery. This becomes essential context when interpreting results.

Analyze the pre-specified outcome. Report all results.

Once the outcome window closes, retrieve your data and run the pre-specified analysis. For binary outcomes (enrolled/not enrolled), a simple comparison of proportions with a chi-squared test works. For continuous outcomes, a t-test or regression. If you pre-specified covariates for adjustment, include them. Then report everything — the primary outcome, secondary outcomes, and any subgroup analyses — with effect sizes and confidence intervals, not just p-values. A 95% confidence interval tells the reader more than a binary significant/not-significant verdict.

Checklist

Have you analyzed the primary outcome exactly as pre-specified?
Are effect sizes and confidence intervals reported (not just p-values)?
Are null and negative results reported with the same detail as positive ones?
Have you assessed compliance: what share of the treatment group received the treatment?

Practitioner note

Intent-to-treat analysis — analyzing everyone in the arm they were assigned to, regardless of whether they received the treatment — is the correct default. It gives you a conservative, real-world estimate of the policy effect.

Write the report. Make it replicable.

The report should be written for two audiences: the decision-maker who needs to know whether to scale, modify, or stop the program; and the practitioner in another city who wants to try the same intervention. The second audience is often forgotten. Include everything they would need: exact wording of any communications, timing and channel of delivery, the randomization method, data sources, attrition rates, and implementation challenges. A replication-ready report is the difference between local knowledge and cumulative evidence.

Checklist

Does the report state the research question, design, and primary outcome?
Are exact intervention materials included or linked?
Are implementation challenges and deviations from protocol documented?
Does the report include a recommendation: scale, modify, or stop?

Practitioner note

A 3-page summary beats a 40-page report for decision-makers. Write both. Post both publicly.

Publish the result — including if it's null.

A civic experiment that isn't published didn't happen as far as the broader field is concerned. Post the report on your agency's website. Submit it to the Public Registry. If you have an academic partner, write a short working paper. The null results are especially important to share: if your intervention didn't work, the field needs to know before replicating the same mistake. Negative results that are well-documented are valuable. Negative results that disappear into filing cabinets are not.

Checklist

Is the full report publicly available (not gated behind a login)?
Have you submitted the experiment to the Public Registry?
Have you shared the result with any peer cities running similar programs?
Have you briefed the decision-maker on the interpretation?

Practitioner note

If the result is positive and surprising, be skeptical. Replicate before scaling. If the result is null and you ran a well-powered experiment, believe it. Null results from well-designed experiments are not failures — they are expensive answers to important questions.

Decide: scale, modify, or stop.

Every experiment ends with a decision. Scale means rolling out the intervention broadly based on the evidence. Modify means changing the design based on what you learned and running again. Stop means the evidence is clear enough to discontinue or redirect resources. The decision belongs to the institution, not the researcher. The experiment's job is to produce the information needed to make that decision well. Document the decision — and the reasoning behind it — in the public record.

Checklist

Has a clear decision been made by someone with budget authority?
Is the decision documented alongside the study results?
If scaling: have you pre-specified how you'll monitor fidelity at scale?
If modifying: have you pre-specified the new primary outcome before the next experiment?

Practitioner note

The most common failure mode at this stage is 'scale regardless of results.' If the experiment showed no effect, the burden of proof for scaling should be very high. Political momentum is not evidence.

Common Mistakes

Five things that make experiments fail.

These are not hypothetical. Each appears in at least a dozen documented civic evaluations.

Measuring everything

Pre-specify one primary outcome. Track others as secondary. This prevents you from finding a positive result by accident.

Running without a control group

Before-after comparisons without a control group are almost always misleading. External factors (seasonality, policy changes, economic conditions) change outcomes independently of your intervention.

Sample sizes too small

Most civic pilots are dramatically underpowered. If you can only reach 100 people, you need an effect of 20+ percentage points to have 80% power. Calculate before you start.

Not publishing null results

If the intervention didn't work, publish that. Submit to the registry. The next practitioner needs to know.

Confusing implementation failure with intervention failure

If only 30% of the treatment group received the treatment, a null result tells you about implementation, not the intervention. Measure and report delivery compliance.

Go Deeper

Evidence and tools to support your pilot.

Registry →

68 documented experiments. Find one in your policy area and read the implementation notes before designing yours.

Pilot Templates →

Pre-specified templates for library engagement, permit simplification, and parks outreach — ready to adapt.

What Works →

Ten patterns from the evidence base. Know what mechanisms tend to work before designing your intervention.

Get Support →

The Experiment Society can provide study design review, power calculations, and randomization support at no cost for qualifying pilots.

Ready to Start?

We can help with design, power, and randomization.

If you have a question and a dataset, we can help you turn it into a rigorous pilot in 8 weeks. City partnerships are free for pilots under 2,000 participants.

Start a conversation →