TSTR vs TRTR: how to measure if synthetic medical data actually works?
By Soheil Fallah · Data Scientist & AI Consultant · peer-reviewed researcher in generative AI
Published 24 June 2026 · Updated 24 June 2026 · 2 min read
TSTR means train on synthetic, test on real: you train a model only on generated data, then evaluate it on real, held-out data. TRTR (train real, test real) is the reference you judge it against. The distance between the two scores tells you how much task-relevant signal the synthetic data actually kept, which is a different thing from how realistic it looks.
Why "looks real" is not enough
A generator can produce images that pass a visual check and still be useless for a task, because it captured texture without the features a classifier depends on. Fidelity scores like the Fréchet Inception Distance (FID) measure how close two image distributions are, not whether a model trained on one can do a job on the other. TSTR ties the synthetic data to a concrete model metric instead of an impression.
Reading the two numbers
A TSTR score close to TRTR means the synthetic data kept the signal a model needs, so it can stand in as a training set or augment a real one. A wide gap means the generator missed task-relevant structure, however good the images look. On its own, TSTR has no scale. A 0.75 means little until you know the real-data baseline sitting next to it.
A worked case: in a brain-MRI study I ran, a classifier reached TSTR AUC 0.754 against TRTR 0.810. Close enough to be useful, with a gap I could see and report.
The protocol, step by step
- Train your generator on real data.
- Generate a labelled training set from it.
- Train the downstream model on the synthetic set only.
- Evaluate that model on a held-out set of real data. This is your TSTR score.
- Train the same architecture on real data and evaluate on the same real test set. This is TRTR.
- Report both scores together, with a privacy measure.
One caveat worth repeating
TSTR rewards usefulness, and usefulness can be gamed. A generator that memorised its training images and replayed them would score well on TSTR while leaking patient data. Run a privacy check before you call a synthetic dataset safe to share.