gensbi.diagnostics.lc2st#

Classes#

EnsembleClassifier

Base class for all estimators in scikit-learn.

LC2ST

L-C2ST: Local Classifier Two-Sample Test.

Functions#

eval_lc2st(theta_p, x_o, clf[, return_proba])

Evaluates the classifier returned by train_lc2st for one observation

permute_data(theta_p, theta_q[, seed])

Permutes the concatenated data [P,Q] to create null samples.

plot_lc2st(lc2st, post_samples_star, x_o[, fig, ax, ...])

train_lc2st(theta_p, theta_q, x_p, x_q, clf)

Trains the classifier on the joint data for the L-C2ST.

Module Contents#

class gensbi.diagnostics.lc2st.EnsembleClassifier(clf, num_ensemble=1, verbosity=1)[source]#

Bases: sklearn.base.BaseEstimator

Base class for all estimators in scikit-learn.

Inheriting from this class provides default implementations of:

  • setting and getting parameters used by GridSearchCV and friends;

  • textual and HTML representation displayed in terminals and IDEs;

  • estimator serialization;

  • parameters validation;

  • data validation;

  • feature names validation.

Read more in the User Guide.

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

Examples

>>> import numpy as np
>>> from sklearn.base import BaseEstimator
>>> class MyEstimator(BaseEstimator):
...     def __init__(self, *, param=1):
...         self.param = param
...     def fit(self, X, y=None):
...         self.is_fitted_ = True
...         return self
...     def predict(self, X):
...         return np.full(shape=X.shape[0], fill_value=self.param)
>>> estimator = MyEstimator(param=2)
>>> estimator.get_params()
{'param': 2}
>>> X = np.array([[1, 2], [2, 3], [3, 4]])
>>> y = np.array([1, 0, 1])
>>> estimator.fit(X, y).predict(X)
array([2, 2, 2])
>>> estimator.set_params(param=3).fit(X, y).predict(X)
array([3, 3, 3])
fit(X, y)[source]#
predict_proba(X)[source]#
clf[source]#
num_ensemble = 1[source]#
trained_clfs = [][source]#
verbosity = 1[source]#
class gensbi.diagnostics.lc2st.LC2ST(thetas, xs, posterior_samples, seed=1, num_folds=1, num_ensemble=1, classifier=MLPClassifier, z_score=False, classifier_kwargs=None, num_trials_null=100, permutation=True)[source]#

L-C2ST: Local Classifier Two-Sample Test.

Implementation based on the official code from [1] and the exisiting C2ST metric [2], using scikit-learn classifiers.

L-C2ST tests the local consistency of a posterior estimator \(q\) with respect to the true posterior \(p\), at a fixed observation \(x_o\), i.e., whether the following null hypothesis holds:

\(H_0(x_o) := q(\theta \mid x_o) = p(\theta \mid x_o)\).

L-C2ST proceeds as follows:

  1. It first trains a classifier to distinguish between samples from two joint distributions \([\theta_p, x_p]\) and \([\theta_q, x_q]\), and evaluates the L-C2ST statistic at a given observation \(x_o\).

  2. The L-C2ST statistic is the mean squared error between the predicted probabilities of being in p (class 0) and a Dirac at 0.5, which corresponds to the chance level of the classifier, unable to distinguish between p and q.

  • If num_ensemble>1, the average prediction over all classifiers is used.

  • If num_folds>1 the average statistic over all cv-folds is used.

To evaluate the test, steps 1 and 2 are performed over multiple trials under the null hypothesis (H0). If the null distribution is not known, it is estimated using the permutation method, i.e. by training the classifier on the permuted data. The statistics obtained under (H0) is then compared to the one obtained on observed data to compute the p-value, used to decide whether to reject (H0) or not.

Parameters:
  • thetas (Array) – Samples from the prior, of shape (sample_size, dim).

  • xs (Array) – Corresponding simulated data, of shape (sample_size, dim_x).

  • posterior_samples (Array) – Samples from the estiamted posterior, of shape (sample_size, dim).

  • seed (int, optional) – Seed for the sklearn classifier and the KFold cross validation. Default is 1.

  • num_folds (int, optional) – Number of folds for the cross-validation. Default is 1 (no cross-validation). This is useful to reduce variance coming from the data.

  • num_ensemble (int, optional) – Number of classifiers for ensembling. Default is 1. This is useful to reduce variance coming from the classifier.

  • classifier (str or Type[BaseEstimator], optional) – Classification architecture to use, can be one of the following: - “random_forest” or “mlp”, defaults to “mlp” or - A classifier class (e.g., RandomForestClassifier, MLPClassifier).

  • z_score (bool, optional) – Whether to z-score to normalize the data. Default is False.

  • classifier_kwargs (Dict[str, Any], optional) – Custom kwargs for the sklearn classifier. Default is None.

  • num_trials_null (int, optional) – Number of trials to estimate the null distribution. Default is 100.

  • permutation (bool, optional) – Whether to use the permutation method for the null hypothesis. Default is True.

References

[1] : https://arxiv.org/abs/2306.03580, JuliaLinhart/lc2st [2] : sbi-dev/sbi

_train(theta_p, theta_q, x_p, x_q, verbosity=0)[source]#

Returns the classifiers trained on observed data.

Parameters:
  • theta_p (Array) – Samples from P, of shape (sample_size, dim).

  • theta_q (Array) – Samples from Q, of shape (sample_size, dim).

  • x_p (Array) – Observations corresponding to P, of shape (sample_size, dim_x).

  • x_q (Array) – Observations corresponding to Q, of shape (sample_size, dim_x).

  • verbosity (int, optional) – Verbosity level. Default is 0.

Returns:

List of trained classifiers for each cv fold.

Return type:

List[Any]

get_scores(theta_o, x_o, trained_clfs, return_probs=False)[source]#

Computes the L-C2ST scores given the trained classifiers.

Mean squared error (MSE) between 0.5 and the predicted probabilities of being in class 0 over the dataset (theta_o, x_o).

Parameters:
  • theta_o (Array) – Samples from the posterior conditioned on the observation x_o, of shape (sample_size, dim).

  • x_o (Array) – The observation, of shape (,dim_x).

  • trained_clfs (List[Any]) – List of trained classifiers, of length num_folds.

  • return_probs (bool, optional) – Whether to return the predicted probabilities of being in P. Default is False.

Returns:

  • scores: L-C2ST scores at x_o, of shape (num_folds,).

  • (probs, scores): Predicted probabilities and L-C2ST scores at x_o, each of shape (num_folds,).

Return type:

Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

get_statistic_on_observed_data(theta_o, x_o)[source]#

Computes the L-C2ST statistics for the observed data.

Mean over all cv-scores.

Parameters:
  • theta_o (Array) – Samples from the posterior conditioned on the observation x_o, of shape (sample_size, dim).

  • x_o (Array) – The observation, of shape (, dim_x)

Returns:

L-C2ST statistic at x_o.

Return type:

float

get_statistics_under_null_hypothesis(theta_o, x_o, return_probs=False, verbosity=0)[source]#

Computes the L-C2ST scores under the null hypothesis.

Parameters:
  • theta_o (Array) – Samples from the posterior conditioned on the observation x_o, of shape (sample_size, dim).

  • x_o (Array) – The observation, of shape (, dim_x).

  • return_probs (bool, optional) – Whether to return the predicted probabilities of being in P. Default is False.

  • verbosity (int, optional) – Verbosity level. Default is 1.

Returns:

  • scores: L-C2ST scores under (H0).

  • (probs, scores): Predicted probabilities and L-C2ST scores under (H0).

Return type:

Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

p_value(theta_o, x_o)[source]#

Computes the p-value for L-C2ST.

The p-value is the proportion of times the L-C2ST statistic under the null hypothesis is greater than the L-C2ST statistic at the observation x_o. It is computed by taking the empirical mean over statistics computed on several trials under the null hypothesis: $1/H sum_{h=1}^{H} I(T_h < T_o)$.

Parameters:
  • theta_o (Array) – Samples from the posterior conditioned on the observation x_o, of dhape (sample_size, dim).

  • x_o (Array) – The observation, of shape (, dim_x).

Returns:

p-value for L-C2ST at x_o.

Return type:

float

reject_test(theta_o, x_o, alpha=0.05)[source]#

Computes the test result for L-C2ST at a given significance level.

Parameters:
  • theta_o (Array) – Samples from the posterior conditioned on the observation x_o, of shape (sample_size, dim).

  • x_o (Array) – The observation, of shape (, dim_x).

  • alpha (float, optional) – Significance level. Default is 0.05.

Returns:

The L-C2ST result: True if rejected, False otherwise.

Return type:

bool

train_on_observed_data(seed=None, verbosity=1)[source]#

Trains the classifier on the observed data.

Saves the trained classifier(s) as a list of length num_folds.

Parameters:
  • seed (int, optional) – Random state of the classifier. Default is None.

  • verbosity (int, optional) – Verbosity level. Default is 1.

Return type:

Union[None, List[Any]]

train_under_null_hypothesis(verbosity=1)[source]#

Computes the L-C2ST scores under the null hypothesis (H0). Saves the trained classifiers for each null trial.

Parameters:

verbosity (int, optional) – Verbosity level. Default is 1.

Return type:

None

clf_class[source]#
null_distribution = None[source]#
num_ensemble = 1[source]#
num_folds = 1[source]#
num_trials_null = 100[source]#
permutation = True[source]#
rngs[source]#
seed = 1[source]#
theta_p[source]#
theta_p_mean[source]#
theta_p_std[source]#
theta_q[source]#
trained_clfs = None[source]#
trained_clfs_null = None[source]#
x_p[source]#
x_p_mean[source]#
x_p_std[source]#
x_q[source]#
z_score = False[source]#
gensbi.diagnostics.lc2st.eval_lc2st(theta_p, x_o, clf, return_proba=False)[source]#

Evaluates the classifier returned by train_lc2st for one observation x_o and over the samples P.

Parameters:
  • theta_p (Array) – Samples from p (class 0), of shape (sample_size, dim).

  • x_o (Array) – The observation, of shape (1, dim_x).

  • clf (BaseEstimator) – Trained classifier.

  • return_proba (bool, optional) – Whether to return the predicted probabilities of being in P. Default is False.

Returns:

L-C2ST score at x_o: MSE between 0.5 and the predicted classifier probability for class 0 on theta_p.

Return type:

Union[float, Tuple[np.ndarray, float]]

gensbi.diagnostics.lc2st.permute_data(theta_p, theta_q, seed=1)[source]#

Permutes the concatenated data [P,Q] to create null samples.

Parameters:
  • theta_p (Array) – Samples from P, of shape (sample_size, dim).

  • theta_q (Array) – Samples from Q, of shape (sample_size, dim).

  • seed (int, optional) – Random seed. Default is 1.

Returns:

Permuted data [theta_p, theta_q].

Return type:

Tuple[Array, Array]

gensbi.diagnostics.lc2st.plot_lc2st(lc2st, post_samples_star, x_o, fig=None, ax=None, conf_alpha=0.05)[source]#
Parameters:
  • lc2st (LC2ST)

  • post_samples_star (jax.Array)

  • x_o (jax.Array)

  • fig (Optional[matplotlib.figure.Figure])

  • ax (Optional[matplotlib.axes.Axes])

Return type:

Tuple[matplotlib.figure.Figure, matplotlib.axes.Axes]

gensbi.diagnostics.lc2st.train_lc2st(theta_p, theta_q, x_p, x_q, clf)[source]#

Trains the classifier on the joint data for the L-C2ST.

Parameters:
  • theta_p (Array) – Samples from P, of shape (sample_size, dim).

  • theta_q (Array) – Samples from Q, of shape (sample_size, dim).

  • x_p (Array) – Observations corresponding to P, of shape (sample_size, dim_x).

  • x_q (Array) – Observations corresponding to Q, of shape (sample_size, dim_x).

  • clf (BaseEstimator) – Classifier to train.

Returns:

Trained classifier.

Return type:

Any