apt.anonymization package

Submodules

apt.anonymization.anonymizer module

class apt.anonymization.anonymizer.Anonymize(k: int, quasi_identifiers: ndarray | list, quasi_identifer_slices: list | None = None, categorical_features: list | None = None, is_regression: bool | None = False, train_only_QI: bool | None = False)

Bases: object

Class for performing tailored, model-guided anonymization of training datasets for ML models.

Based on the implementation described in: https://arxiv.org/abs/2007.13086

Parameters:
  • k (int) – The privacy parameter that determines the number of records that will be indistinguishable from each other (when looking at the quasi identifiers). Should be at least 2.

  • quasi_identifiers (np.ndarray or list of strings or integers.) – The features that need to be minimized in case of pandas data, and indexes of features in case of numpy data.

  • quasi_identifer_slices (list of lists of strings or integers.) – If some of the quasi-identifiers represent 1-hot encoded features that need to remain consistent after anonymization, provide a list containing the list of column names or indexes that represent a single feature.

  • categorical_features (list, optional) – The list of categorical features (if supplied, these featurtes will be one-hot encoded before using them to train the decision tree model).

  • is_regression (list, optional) – Whether the model is a regression model or not (if False, assumes a classification model). Default is False.

  • train_only_QI (boolean, optional) – The required method to train data set for anonymization. Default is to train the tree on all features.

anonymize(dataset: ArrayDataset) ndarray | DataFrame

Method for performing model-guided anonymization.

Parameters:

dataset (ArrayDataset) – Data wrapper containing the training data for the model and the predictions of the original model on the training data.

Returns:

The anonymized training dataset as either numpy array or pandas DataFrame (depending on the type of the original data used to create the ArrayDataset).

Module contents

Module providing ML anonymization.

This module contains methods for anonymizing ML model training data, so that when a model is retrained on the anonymized data, the model itself will also be considered anonymous. This may help exempt the model from different obligations and restrictions set out in data protection regulations such as GDPR, CCPA, etc.

The module contains methods that enable anonymizing training datasets in a manner that is tailored to and guided by an existing, trained ML model. It uses the existing model’s predictions on the training data to train a second, anonymizer model, that eventually determines the generalizations that will be applied to the training data. For more information about the method see: https://arxiv.org/abs/2007.13086

Once the anonymized training data is returned, it can be used to retrain the model.