apt.risk.data_assessment package

Submodules

apt.risk.data_assessment.attack_strategy_utils module

class apt.risk.data_assessment.attack_strategy_utils.AttackStrategyUtils

Bases: ABC

Abstract base class for common utilities of various privacy attack strategies.

class apt.risk.data_assessment.attack_strategy_utils.DistributionValidationResult(distributions_validated: bool, distributions_valid: bool, member_column_distribution_diff: list, non_member_column_distribution_diff: list)

Bases: object

Holds the result of the validation of distributions similarities.

Attributes:

distributions_validated : False if distribution validation failed for some reason, and no conclusion was drawn distributions_valid: False if there are columns whose distribution is different between the datasets member_column_distribution_diff (list): Columns whose distribution is different between the member and the

synthetic datasets

non_member_column_distribution_diff (list): Columns whose distribution is different between the non-member and: the synthetic datasets

distributions_valid: bool

distributions_validated: bool

member_column_distribution_diff: list

non_member_column_distribution_diff: list

class apt.risk.data_assessment.attack_strategy_utils.KNNAttackStrategyUtils(use_batches: bool = False, batch_size: int = 10, distribution_comparison_alpha: float = 0.05, distribution_comparison_numeric_test: str = 'KS', distribution_comparison_categorical_test: str = 'CHI')

Bases: AttackStrategyUtils

Common utilities for attack strategy based on KNN distances.

find_knn(knn_learner: NearestNeighbors, query_samples: ArrayDataset, distance_processor=None)

Nearest neighbor search function.

Parameters:

query_samples – query samples, to which nearest neighbors are to be found
knn_learner – unsupervised learner for implementing neighbor searches, after it was fitted
distance_processor – function for processing the distance into another more relevant metric per sample. Its input is an array representing distances (the distances returned by NearestNeighbors.kneighbors() ), and the output should be another array with distance-based values that enable to compute the final risk score

Returns:

distances of the query samples to their nearest neighbors, or a metric based on that distance and calculated by the distance_processor function

fit(knn_learner: NearestNeighbors, dataset: ArrayDataset)

Fit the KNN learner.

Parameters:

knn_learner – The KNN model to fit.
dataset – The training set to fit the model on.

validate_distributions(original_data_members: ArrayDataset, original_data_non_members: ArrayDataset, synthetic_data: ArrayDataset, categorical_features: list | None = None)

Validate column distributions are similar between the datasets. One advantage of the ES test compared to the KS test is that is does not assume a continuous distribution. In [1], the authors conclude that the test also has a higher power than the KS test in many examples. They recommend the use of the ES test for discrete samples as well as continuous samples with at least 25 observations each, whereas AD is recommended for smaller sample sizes in the continuous case.

Parameters:

original_data_members – A container for the training original samples and labels
original_data_non_members – A container for the holdout original samples and labels
synthetic_data – A container for the synthetic samples and labels
categorical_features – a list of categorical features of the datasets

Returns:

DistributionValidationResult

apt.risk.data_assessment.dataset_assessment_manager module

class apt.risk.data_assessment.dataset_assessment_manager.DatasetAssessmentManager(config: ~apt.risk.data_assessment.dataset_assessment_manager.DatasetAssessmentManagerConfig | None = <class 'apt.risk.data_assessment.dataset_assessment_manager.DatasetAssessmentManagerConfig'>)

Bases: object

The main class for running dataset assessment attacks.

assess(original_data_members: ArrayDataset, original_data_non_members: ArrayDataset, synthetic_data: ArrayDataset, dataset_name: str = 'dataset', categorical_features: list = []) → list[DatasetAttackScore]

Do dataset privacy risk assessment by running dataset attacks, and return their scores.

Parameters:

original_data_members – A container for the training original samples and labels, only samples are used in the assessment
original_data_non_members – A container for the holdout original samples and labels, only samples are used in the assessment
synthetic_data – A container for the synthetic samples and labels, only samples are used in the assessment
dataset_name – A name to identify this dataset, optional
categorical_features – A list of categorical feature names or numbers

Returns:

a list of dataset attack risk scores

attack_scores = {}

dump_all_scores_to_files(): Save assessment results to filesystem.

class apt.risk.data_assessment.dataset_assessment_manager.DatasetAssessmentManagerConfig(persist_reports: bool = False, timestamp_reports: bool = False, generate_plots: bool = False)

Bases: object

Configuration for DatasetAssessmentManager. :param persist_reports: save assessment results to filesystem, or not. :param timestamp_reports: if persist_reports is True, then define if create a separate report for each timestamp,

or append to the same reports

Parameters:: generate_plots – generate and visualize plots as part of assessment, or not..

generate_plots: bool = False

persist_reports: bool = False

timestamp_reports: bool = False

apt.risk.data_assessment.dataset_attack module

This module defines the interface for privacy risk assessment of synthetic datasets.

class apt.risk.data_assessment.dataset_attack.Config

Bases: ABC

The base class for dataset attack configurations

class apt.risk.data_assessment.dataset_attack.DatasetAttack(original_data_members: ArrayDataset, original_data_non_members: ArrayDataset, synthetic_data: ArrayDataset, config: Config, dataset_name: str, categorical_features: list = [], attack_strategy_utils: AttackStrategyUtils | None = None)

Bases: ABC

The interface for performing privacy attack for risk assessment of synthetic datasets to be used in AI model training. The original data members (training data) and non-members (the holdout data) should be available. For reliability, all the datasets should be preprocessed and normalized.

abstract assess_privacy() → DatasetAttackScore: Assess the privacy of the dataset :return:

score: DatasetAttackScore the privacy attack risk score

abstract property short_name

class apt.risk.data_assessment.dataset_attack.DatasetAttackMembership(original_data_members: ArrayDataset, original_data_non_members: ArrayDataset, synthetic_data: ArrayDataset, config: Config, dataset_name: str, categorical_features: list = [], attack_strategy_utils: AttackStrategyUtils | None = None)

Bases: DatasetAttack

An abstract base class for performing privacy risk assessment for synthetic datasets on a per-record level.

static calculate_metrics(member_probabilities: ndarray, non_member_probabilities: ndarray): Calculate attack performance metrics :param member_probabilities: probability estimates of the member samples, the training data :param non_member_probabilities: probability estimates of the non-member samples, the hold-out data :return:

fpr: False Positive rate tpr: True Positive rate threshold: threshold auc: area under the Receiver Operating Characteristic Curve ap: average precision score

abstract calculate_privacy_score(dataset_attack_result: DatasetAttackResultMembership, generate_plot: bool = False) → DatasetAttackScore: Calculate dataset privacy score based on the result of the privacy attack :return:

score: DatasetAttackScore

static plot_roc_curve(dataset_name: str, member_probabilities: ndarray, non_member_probabilities: ndarray, filename_prefix: str = ''): Plot ROC curve :param dataset_name: dataset name, will become part of the plot filename :param member_probabilities: probability estimates of the member samples, the training data :param non_member_probabilities: probability estimates of the non-member samples, the hold-out data :param filename_prefix: name prefix for the ROC curve plot

apt.risk.data_assessment.dataset_attack_membership_knn_probabilities module

This module implements privacy risk assessment of synthetic datasets based on the paper: “GAN-Leaks: A Taxonomy of Membership Inference Attacks against Generative Models” by D. Chen, N. Yu, Y. Zhang, M. Fritz published in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 343–62, 2020. https://doi.org/10.1145/3372297.3417238 and its implementation in https://github.com/DingfanChen/GAN-Leaks.

class apt.risk.data_assessment.dataset_attack_membership_knn_probabilities.DatasetAttackConfigMembershipKnnProbabilities(k: int = 5, use_batches: bool = False, batch_size: int = 10, compute_distance: Callable | None = None, distance_params: dict | None = None, generate_plot: bool = False, distribution_comparison_alpha: float = 0.05)

Bases: Config

Configuration for DatasetAttackMembershipKnnProbabilities.

Attributes:

k: Number of nearest neighbors to search use_batches: Divide query samples into batches or not. batch_size: Query sample batch size. compute_distance: A callable function, which takes two arrays representing 1D vectors as inputs and must return

one value indicating the distance between those vectors. See ‘metric’ parameter in sklearn.neighbors.NearestNeighbors documentation.

distance_params: Additional keyword arguments for the distance computation function, see ‘metric_params’ in: sklearn.neighbors.NearestNeighbors documentation.

generate_plot: Generate or not an AUR ROC curve and persist it in a file distribution_comparison_alpha: the significance level of the statistical distribution test p-value.

If p-value is less than alpha, then we reject the null hypothesis that the observed samples are drawn from the same distribution, and we claim that the distributions are different.

batch_size: int = 10

compute_distance: Callable = None

distance_params: dict = None

distribution_comparison_alpha: float = 0.05

generate_plot: bool = False

k: int = 5

use_batches: bool = False

class apt.risk.data_assessment.dataset_attack_membership_knn_probabilities.DatasetAttackMembershipKnnProbabilities(original_data_members: ArrayDataset, original_data_non_members: ArrayDataset, synthetic_data: ArrayDataset, config: DatasetAttackConfigMembershipKnnProbabilities = DatasetAttackConfigMembershipKnnProbabilities(k=5, use_batches=False, batch_size=10, compute_distance=None, distance_params=None, generate_plot=False, distribution_comparison_alpha=0.05), dataset_name: str = 'dataset', categorical_features: list | None = None, **kwargs)

Bases: DatasetAttackMembership

Privacy risk assessment for synthetic datasets based on Black-Box MIA attack using distances of members (training set) and non-members (holdout set) from their nearest neighbors in the synthetic dataset. By default, the Euclidean distance is used (L2 norm), but another compute_distance() method can be provided in configuration instead. The area under the receiver operating characteristic curve (AUC ROC) gives the privacy risk measure.

SHORT_NAME = 'MembershipKnnProbabilities'

assess_privacy() → DatasetAttackScoreMembershipKnnProbabilities

Membership Inference Attack which calculates probabilities of member and non-member samples to be generated by the synthetic data generator. The assumption is that since the generative model is trained to approximate the training data distribution then the probability of a sample to be a member of the training data should be proportional to the probability that the query sample can be generated by the generative model. So, if the probability that the query sample is generated by the generative model is large, it is more likely that the query sample was used to train the generative model. This probability is approximated by the Parzen window density estimation in probability_per_sample(), computed from the NN distances from the query samples to the synthetic data samples. Before running the assessment, there is a validation that the distribution of the synthetic data is similar to that of the original data members and to that of the original data non-members.

Returns:: Privacy score of the attack together with the attack result with the probabilities of member and non-member samples to be generated by the synthetic data generator based on the NN distances from the query samples to the synthetic data samples The result also contains the distribution validation result and a warning if the distributions are not similar.

calculate_privacy_score(dataset_attack_result: DatasetAttackResultMembership, generate_plot: bool = False) → DatasetAttackScoreMembershipKnnProbabilities

Evaluate privacy score from the probabilities of member and non-member samples to be generated by the synthetic data generator. The probabilities are computed by the assess_privacy() method. :param dataset_attack_result attack result containing probabilities of member and non-member samples to be

generated by the synthetic data generator

:param generate_plot generate AUC ROC curve plot and persist it :return:

score of the attack, based on distance-based probabilities - mainly the ROC AUC score

static probability_per_sample(distances: ndarray): For every sample represented by its distance from the query sample to its KNN in synthetic data, computes the probability of the synthetic data to be part of the query dataset. :param distances: distance between every query sample in batch to its KNNs among synthetic samples, a numpy array of size (n, k) with n being the number of samples, k - the number of KNNs :return:

probability estimates of the query samples being generated and so - of being part of the synthetic set, a numpy array of size (n,)

short_name()

class apt.risk.data_assessment.dataset_attack_membership_knn_probabilities.DatasetAttackScoreMembershipKnnProbabilities(dataset_name: str, roc_auc_score: float, average_precision_score: float, result: DatasetAttackResultMembership)

Bases: DatasetAttackScore

DatasetAttackMembershipKnnProbabilities privacy risk score.

assessment_type: str = 'MembershipKnnProbabilities'

average_precision_score: float

distributions_validation_result: DistributionValidationResult

roc_auc_score: float

apt.risk.data_assessment.dataset_attack_result module

class apt.risk.data_assessment.dataset_attack_result.DatasetAttackResult

Bases: object

Basic class for storing privacy risk assessment results.

class apt.risk.data_assessment.dataset_attack_result.DatasetAttackResultMembership(member_probabilities: ndarray, non_member_probabilities: ndarray)

Bases: DatasetAttackResult

Class for storing membership attack results.

Parameters:

member_probabilities – The attack probabilities for member samples.
non_member_probabilities – The attack probabilities for non-member samples.

member_probabilities: ndarray

non_member_probabilities: ndarray

class apt.risk.data_assessment.dataset_attack_result.DatasetAttackScore(dataset_name: str, risk_score: float, result: DatasetAttackResult | None)

Bases: object

Basic class for storing privacy risk assessment scores.

Parameters:

dataset_name – The name of the dataset that was assessed.
risk_score – The privacy risk score.
result – An optional list of more detailed results.

dataset_name: str

result: DatasetAttackResult | None

risk_score: float

apt.risk.data_assessment.dataset_attack_whole_dataset_knn_distance module

This module implements privacy risk assessment of synthetic datasets based on the papers “Data Synthesis based on Generative Adversarial Networks.” by N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim in International Conference on Very Large Data Bases (VLDB), 2018. and “Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data” by M. Platzer and T. Reutterer. and on a variation of its reference implementation in https://github.com/mostly-ai/paper-fidelity-accuracy.

class apt.risk.data_assessment.dataset_attack_whole_dataset_knn_distance.DatasetAttackConfigWholeDatasetKnnDistance(use_batches: bool = False, batch_size: int = 10, compute_distance: callable | None = None, distance_params: dict | None = None, distribution_comparison_alpha: float = 0.05, distribution_comparison_numeric_test: str = ('KS',), distribution_comparison_categorical_test: str = 'CHI')

Bases: Config

Configuration for DatasetAttackWholeDatasetKnnDistance.

Attributes:

use_batches: Divide query samples into batches or not. batch_size: Query sample batch size. compute_distance: A callable function, which takes two arrays representing 1D vectors as inputs and must return

one value indicating the distance between those vectors. See ‘metric’ parameter in sklearn.neighbors.NearestNeighbors documentation.

distance_params: Additional keyword arguments for the distance computation function, see ‘metric_params’ in: sklearn.neighbors.NearestNeighbors documentation.
distribution_comparison_alpha: the significance level of the statistical distribution test p-value.: If p-value is less than alpha, then we reject the null hypothesis that the observed samples are drawn from the same distribution, and we claim that the distributions are different.

batch_size: int = 10

compute_distance: callable = None

distance_params: dict = None

distribution_comparison_alpha: float = 0.05

distribution_comparison_categorical_test: str = 'CHI'

distribution_comparison_numeric_test: str = ('KS',)

use_batches: bool = False

class apt.risk.data_assessment.dataset_attack_whole_dataset_knn_distance.DatasetAttackScoreWholeDatasetKnnDistance(dataset_name: str, share: float)

Bases: DatasetAttackScore

DatasetAttackWholeDatasetKnnDistance privacy risk score.

assessment_type: str = 'WholeDatasetKnnDistance'

distributions_validation_result: DistributionValidationResult

share: float

class apt.risk.data_assessment.dataset_attack_whole_dataset_knn_distance.DatasetAttackWholeDatasetKnnDistance(original_data_members: ArrayDataset, original_data_non_members: ArrayDataset, synthetic_data: ArrayDataset, config: DatasetAttackConfigWholeDatasetKnnDistance = DatasetAttackConfigWholeDatasetKnnDistance(use_batches=False, batch_size=10, compute_distance=None, distance_params=None, distribution_comparison_alpha=0.05, distribution_comparison_numeric_test=('KS',), distribution_comparison_categorical_test='CHI'), dataset_name: str = 'dataset', categorical_features: list | None = None, **kwargs)

Bases: DatasetAttack

Privacy risk assessment for synthetic datasets based on distances of synthetic data records from members (training set) and non-members (holdout set). The privacy risk measure is the share of synthetic records closer to the training than the holdout dataset. By default, the Euclidean distance is used (L2 norm), but another compute_distance() method can be provided in configuration instead.

SHORT_NAME = 'WholeDatasetKnnDistance'

assess_privacy() → DatasetAttackScoreWholeDatasetKnnDistance: Calculate the share of synthetic records closer to the training than the holdout dataset, based on the DCR computed by ‘calculate_distances()’. Before running the assessment, there is a validation that the distribution of the synthetic data is similar to that of the original data members and to that of the original data non-members. :return:

score of the attack, based on the NN distances from the query samples to the synthetic data samples. The result also contains the distribution validation result and a warning if the distributions are not similar.

calculate_distances()

Calculate member and non-member query probabilities, based on their distance to their KNN among synthetic samples. This distance is called distance to the closest record (DCR), as defined by N. Park et. al. in “Data Synthesis based on Generative Adversarial Networks.”

Returns:: member_distances - distances of each synthetic data member from its nearest training sample non_member_distances - distances of each synthetic data member from its nearest validation sample

short_name()

Module contents

Module providing privacy risk assessment for synthetic data.

The main interface, DatasetAttack, with the assess_privacy() main method assumes the availability of the training data, holdout data and synthetic data at the time of the privacy evaluation. It is to be implemented by concrete assessment methods, which can run the assessment on a per-record level, or on the whole dataset. The abstract class DatasetAttackMembership implements the DatasetAttack interface, but adds the result of the membership inference attack, so that the final score contains both the membership inference attack result for further analysis and the calculated score.