TSTR vs TRTR: how to measure if synthetic medical data actually works?

By Soheil Fallah · Data Scientist & AI Consultant · peer-reviewed researcher in generative AI

Published 24 June 2026 · Updated 24 June 2026 · 2 min read

TSTR means train on synthetic, test on real: you train a model only on generated data, then evaluate it on real, held-out data. TRTR (train real, test real) is the reference you judge it against. The distance between the two scores tells you how much task-relevant signal the synthetic data actually kept, which is a different thing from how realistic it looks.

Why "looks real" is not enough

A generator can produce images that pass a visual check and still be useless for a task, because it captured texture without the features a classifier depends on. Fidelity scores like the Fréchet Inception Distance (FID) measure how close two image distributions are, not whether a model trained on one can do a job on the other. TSTR ties the synthetic data to a concrete model metric instead of an impression.

Reading the two numbers

A TSTR score close to TRTR means the synthetic data kept the signal a model needs, so it can stand in as a training set or augment a real one. A wide gap means the generator missed task-relevant structure, however good the images look. On its own, TSTR has no scale. A 0.75 means little until you know the real-data baseline sitting next to it.

A worked case: in a brain-MRI study I ran, a classifier reached TSTR AUC 0.754 against TRTR 0.810. Close enough to be useful, with a gap I could see and report.

The protocol, step by step

  1. Train your generator on real data.
  2. Generate a labelled training set from it.
  3. Train the downstream model on the synthetic set only.
  4. Evaluate that model on a held-out set of real data. This is your TSTR score.
  5. Train the same architecture on real data and evaluate on the same real test set. This is TRTR.
  6. Report both scores together, with a privacy measure.

One caveat worth repeating

TSTR rewards usefulness, and usefulness can be gamed. A generator that memorised its training images and replayed them would score well on TSTR while leaking patient data. Run a privacy check before you call a synthetic dataset safe to share.

Related

Related