apt.minimization package

Submodules

apt.minimization.minimizer module

This module implements all classes needed to perform data minimization

class apt.minimization.minimizer.GeneralizeToRepresentative(estimator: BaseEstimator | Model | None = None, target_accuracy: float | None = 0.998, cells: list | None = None, categorical_features: ndarray | list | None = None, encoder: OrdinalEncoder | OneHotEncoder | None = None, features_to_minimize: ndarray | list | None = None, feature_slices: list | None = None, train_only_features_to_minimize: bool | None = True, is_regression: bool | None = False, generalize_using_transform: bool = True)

Bases: BaseEstimator, MetaEstimatorMixin, TransformerMixin

A transformer that generalizes data to representative points.

Learns data generalizations based on an original model’s predictions and a target accuracy. Once the generalizations are learned, can receive one or more data records and transform them to representative points based on the learned generalization. An alternative way to use the transformer is to supply cells in init or set_params and those will be used to transform data to representatives. In this case, fit must still be called but there is no need to supply it with X and y, and there is no need to supply an existing estimator to init. In summary, either estimator and target_accuracy should be supplied or cells should be supplied.

Parameters:
  • estimator (sklearn BaseEstimator or Model) – The original model for which generalization is being performed. Should be pre-fitted.

  • target_accuracy (float, optional) – The required relative accuracy when applying the base model to the generalized data. Accuracy is measured relative to the original accuracy of the model.

  • cells (list of objects, optional) – The cells used to generalize records. Each cell must define a range or subset of categories for each feature, as well as a representative value for each feature. This parameter should be used when instantiating a transformer object without first fitting it.

  • categorical_features (list of strings or integers, optional) – The list of categorical features (if supplied, these featurtes will be one-hot encoded before using them to train the decision tree model).

  • encoder (sklearn OrdinalEncoder or OneHotEncoder) – Optional encoder for encoding data before feeding it into the estimator (e.g., for categorical features). If not provided, the data will be fed as is directly to the estimator.

  • features_to_minimize (list of strings or int, optional) – The features to be minimized. If not provided, all features will be minimized.

  • feature_slices (list of lists of strings or integers, optional) – If some of the features to be minimized represent 1-hot encoded features that need to remain consistent after minimization, provide a list containing the list of column names or indexes that represent a single feature.

  • train_only_features_to_minimize (boolean, optional) – Whether to train the tree just on the features_to_minimize or on all features. Default is only on features_to_minimize.

  • is_regression (boolean, optional) – Whether the model is a regression model or not (if False, assumes a classification model). Default is False.

  • generalize_using_transform (boolean, optional) – Indicates how to calculate NCP and accuracy during the generalization process. True means that the transform method is used to transform original data into generalized data that is used for accuracy and NCP calculation. False indicates that the generalizations structure should be used. Default is True.

calculate_ncp(samples: ArrayDataset)

Compute the NCP score of the generalization. Calculation is based on the value of the generalize_using_transform param. If samples are provided, updates stored ncp value to the one computed on the provided data. If samples not provided, returns the last NCP score computed by the fit or transform method.

Based on the NCP score presented in: Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss (https://www.vldb.org/conf/2007/papers/research/p758-ghinita.pdf)

Parameters:

samples (ArrayDataset, optional. feature_names should be set.) – The input samples to compute the NCP score on.

Returns:

NCP score as float.

fit(X: ndarray | DataFrame | None = None, y: ndarray | DataFrame | None = None, features_names: Optional = None, dataset: ArrayDataset = None)

Learns the generalizations based on training data. Also sets the fit_score and generalizations_score in self.ncp.

Parameters:
  • X ({array-like, sparse matrix}, shape (n_samples, n_features), optional) – The training input samples.

  • y (array-like, shape (n_samples,), optional) – The target values. This should contain the predictions of the original model on X.

  • features_names (list of strings, optional) – The feature names, in the order that they appear in the data. Should be provided when passing the data as X as a numpy array

  • dataset (ArrayDataset, optional) – Data wrapper containing the training input samples and the predictions of the original model on the training data. Either X, y OR dataset need to be provided, not both.

Returns:

self

fit_transform(X: ndarray | DataFrame | None = None, y: ndarray | DataFrame | None = None, features_names: list | None = None, dataset: ArrayDataset | None = None)

Learns the generalizations based on training data, and applies them to the data. Also sets the fit_score, transform_score and generalizations_score in self.ncp.

Parameters:
  • X ({array-like, sparse matrix}, shape (n_samples, n_features), optional) – The training input samples.

  • y (array-like, shape (n_samples,), optional) – The target values. This should contain the predictions of the original model on X.

  • features_names (list of strings, optional) – The feature names, in the order that they appear in the data. Can be provided when passing the data as X and y

  • dataset (ArrayDataset, optional) – Data wrapper containing the training input samples and the predictions of the original model on the training data. Either X, y OR dataset need to be provided, not both.

Returns:

Array containing the representative values to which each record in X is mapped, as numpy array or pandas DataFrame (depending on the type of X), shape (n_samples, n_features)

property generalizations

Return the generalizations derived from the model and test data.

Returns:

generalizations object. Contains 3 sections: ‘ranges’ that contains ranges for numerical features, ‘categories’ that contains sub-groups of categories for categorical features, and ‘untouched’ that contains the features that could not be generalized.

get_params(deep=True)

Get parameters

Parameters:

deep (boolean, optional) – If True, will return the parameters for this estimator and contained sub-objects that are estimators.

Returns:

Parameter names mapped to their values

property ncp

Return the last calculated NCP scores. NCP score is calculated upon calling fit (on the training data), transform’ (on the test data) or when explicitly calling `calculate_ncp and providing it a dataset.

Returns:

NCPScores object, that contains a score corresponding to the last fit call, one for the last

transform call, and a score based on global generalizations.

set_params(**params)

Set parameters

Parameters:
  • target_accuracy (float, optional) – The required relative accuracy when applying the base model to the generalized data. Accuracy is measured relative to the original accuracy of the model.

  • cells (list of objects, optional) – The cells used to generalize records. Each cell must define a range or subset of categories for each feature, as well as a representative value for each feature. This parameter should be used when instantiating a transformer object without first fitting it.

Returns:

self

transform(X: ndarray | DataFrame | None = None, features_names: list | None = None, dataset: ArrayDataset | None = None)

Transforms data records to representative points. Also sets the transform_score in self.ncp.

Parameters:
  • X ({array-like, sparse matrix}, shape (n_samples, n_features), optional) – The training input samples.

  • features_names (list of strings, optional) – The feature names, in the order that they appear in the data. Should be provided when passing the data as X as a numpy array

  • dataset (ArrayDataset, optional) – Data wrapper containing the training input samples and the predictions of the original model on the training data. Either X OR dataset need to be provided, not both.

Returns:

Array containing the representative values to which each record in X is mapped, as numpy array or pandas DataFrame (depending on the type of X), shape (n_samples, n_features)

class apt.minimization.minimizer.NCPScores(fit_score: float = None, transform_score: float = None, generalizations_score: float = None)

Bases: object

fit_score: float = None
generalizations_score: float = None
transform_score: float = None

Module contents

Module providing data minimization for ML.

This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform predictions with a machine learning model, by removing or generalizing some of the input features. For more information about the method see: http://export.arxiv.org/pdf/2008.04113

The main class, GeneralizeToRepresentative, is a scikit-learn compatible Transformer, that receives an existing estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for analysis by the original model. The fit() method learns the generalizations and the transform() method applies them to new data.

It is also possible to export the generalizations as feature ranges.