Benchmarks#

GenSBI is validated on a suite of inference tasks from the Simulation-Based Inference Benchmark (SBIBM). This page summarises the key results; for full details, see the GenSBI paper.

Benchmark Tasks#

The following five SBIBM tasks cover a range of parameter dimensionalities and posterior geometries:

Task	dim(θ)	dim(x)	Key challenge
Two Moons	2	2	Bimodal crescent-shaped posterior
Gaussian Linear	10	10	Moderate-dimensional Gaussian posterior
Gaussian Mixture	2	2	Multi-scale posterior structure
SLCP	5	8	Complex multimodal geometry (4 modes)
Bernoulli GLM	10	10	Discrete data, correlated prior

All benchmark datasets are available on HuggingFace, and the gensbi-examples package provides utilities and notebooks to reproduce every result shown here.

Evaluation Metric: C2ST#

Posterior quality is measured using the Classifier Two-Sample Test (C2ST) (Lopez-Paz & Oquab, 2017). A binary classifier is trained to distinguish samples drawn from the learned posterior and the reference posterior:

C2ST = 0.50 → the classifier cannot tell them apart → perfect posterior recovery
C2ST → 1.0 → large discrepancy between learned and reference posteriors

A C2ST score ≤ 0.55 is generally considered very good.

Posterior Quality#

Comparison with the Literature#

The figure below compares GenSBI’s best C2ST scores (selecting the best architecture and generative method at each simulation budget) against three baselines: OneFlowSBI, SimFormer, and NPE (from the sbi library).

C2ST comparison with literature baselines

At the largest simulation budget of 10⁵, GenSBI performs on par with or improves on all baselines across every task. Gaussian Linear and Gaussian Mixture achieve C2ST scores of 0.500 and 0.501, respectively — essentially perfect posterior recovery. Two Moons reaches 0.502, closely matching SimFormer (0.505). SLCP and Bernoulli GLM, the two most challenging tasks, also show results in line with the best available methods. GenSBI achieves these results with a nearly uniform training configuration across all tasks.

Detailed Comparison at 30k Simulations#

The figures above show results across multiple simulation budgets. The table below zooms in on a single budget — 3×10⁴ simulations — and reports C2ST accuracy for all GenSBI model variants alongside three literature baselines. Best value per task in bold; second-best in italics.

Abbreviations: FM = flow matching, SM = score matching, EDM = Elucidating the Design Space of Diffusion-Based Generative Models.

Method	Two Moons	Gauss. Linear	Gauss. Mixture	SLCP	Bernoulli GLM
GenSBI FM (Flux1)	0.53	0.51	0.52	0.72	0.70
GenSBI SM (Flux1)	0.52	0.51	0.51	0.69	0.58
GenSBI FM (Flux1Joint)	0.53	0.51	0.52	0.59	0.58
GenSBI SM (Flux1Joint)	0.51	0.50	0.50	0.57	0.58
OneFlowSBI	0.51	0.51	0.51	0.73	0.58
SimFormer	0.51	0.50	0.51	0.57	0.59
NPE	0.57	0.55	0.56	0.84	0.65

GenSBI SM (Flux1Joint) — score matching with the joint architecture — matches or improves on all baselines across all five tasks.

Architecture Comparison: Flux1 vs Flux1Joint#

GenSBI brings two new transformer architectures to the field of SBI: Flux1 (conditional density estimation) and Flux1Joint (joint density estimation). Each can be combined with three generative methods: flow matching (FM), score matching (SM), and EDM Diffusion.

Flux1 (Conditional)

Best C2ST as a function of simulation budget for the Flux1 architecture. Lower is better; 0.5 = perfect match.

Flux1Joint (Joint)

C2ST vs simulation budget for Flux1Joint

Best C2ST as a function of simulation budget for Flux1Joint. Flux1Joint achieves stronger performance than Flux1, especially on SLCP.

Key observations:

All three generative methods converge to similar C2ST scores as the simulation budget increases, indicating that the methods are effectively interchangeable.
On simpler tasks (Two Moons, Gaussian Mixture, Gaussian Linear), all generative methods reach near-optimal scores (≤ 0.52) already at 3×10⁴ simulations.
On harder tasks (SLCP, Bernoulli GLM), flow matching and score matching converge faster than EDM.
Flux1Joint tends to match or improve on Flux1 across all tasks, consistent with an advantage of joint density estimation for tasks with unstructured data and complex posterior geometries.

Computational Cost#

All computational benchmarks below are measured on the Two Moons task (a low-dimensional problem) using a single NVIDIA Tesla V100 GPU, with batch size 256 and 50,000 training steps. Timings will be higher for larger models and higher-dimensional tasks.

Architecture	Method	Training speed (it/s)	Training time (50k steps)	Solver steps	Sampling time (10⁴ samples)
Flux1	Flow Matching	4.58	~3.0 h	100	6.9 s
Flux1	Score Matching	4.50	~3.1 h	1000	24.1 s
Flux1	EDM	4.73	~2.9 h	18	8.3 s
Flux1Joint	Flow Matching	11.56	~1.2 h	100	6.4 s
Flux1Joint	Score Matching	11.63	~1.2 h	1000	42.3 s
Flux1Joint	EDM	11.51	~1.2 h	18	5.3 s

Key takeaways:

Within a fixed architecture, the three generative methods have comparable training throughput — the choice of generative method has no meaningful impact on training cost.
Sampling speed varies by generative method: EDM is fastest (18 solver steps), flow matching is moderate (100 steps), and score matching is slowest (~1000 steps).
Even in the slowest configuration, drawing 10,000 posterior samples takes at most ~42 seconds.
All benchmark models can also be trained on a consumer-grade NVIDIA RTX 4070 GPU (12 GB VRAM) with batch size 256.

Note

While the Flux1Joint model seems to train faster, the benchmarked architechtures have different parameter counts and reflect the default initialization setup. At equal parameter counts, the training times are comparable.

Calibration#

Beyond C2ST, GenSBI posteriors are validated using several calibration diagnostics: the TARP test (Tests of Accuracy with Random Points), SBC (Simulation-Based Calibration) rank histograms, L-C2ST (Local C2ST), and marginal posterior coverage checks. All models trained with 10⁶ simulations produce well-calibrated posteriors — the TARP coverage curves fall on the diagonal within Jeffreys 95% confidence intervals across all tasks and posterior geometries.

The example notebooks include full calibration plots for each benchmark task.

Summary#

	GenSBI
Posterior quality	On par with or improving on existing baselines across all SBIBM tasks
Best configuration	Flux1Joint with score matching
Generative method interchangeability	Flow matching, score matching, and EDM converge to comparable quality
Training cost	1–3 hours on a single V100; feasible on consumer GPUs
Sampling speed	10,000 posterior samples in 5–42 seconds
Calibration	Well-calibrated posteriors verified by C2ST, TARP, SBC, L-C2ST, and marginal posterior coverage diagnostics
Configuration	Nearly uniform hyperparameters across all tasks