apt.minimization package
Submodules
apt.minimization.minimizer module
This module implements all classes needed to perform data minimization
- class apt.minimization.minimizer.GeneralizeToRepresentative(estimator: BaseEstimator | Model | None = None, target_accuracy: float | None = 0.998, cells: list | None = None, categorical_features: ndarray | list | None = None, encoder: OrdinalEncoder | OneHotEncoder | None = None, features_to_minimize: ndarray | list | None = None, feature_slices: list | None = None, train_only_features_to_minimize: bool | None = True, is_regression: bool | None = False, generalize_using_transform: bool = True)
Bases:
BaseEstimator
,MetaEstimatorMixin
,TransformerMixin
A transformer that generalizes data to representative points.
Learns data generalizations based on an original model’s predictions and a target accuracy. Once the generalizations are learned, can receive one or more data records and transform them to representative points based on the learned generalization. An alternative way to use the transformer is to supply
cells
in init or set_params and those will be used to transform data to representatives. In this case, fit must still be called but there is no need to supply it withX
andy
, and there is no need to supply an existingestimator
to init. In summary, eitherestimator
andtarget_accuracy
should be supplied orcells
should be supplied.- Parameters:
estimator (sklearn BaseEstimator or Model) – The original model for which generalization is being performed. Should be pre-fitted.
target_accuracy (float, optional) – The required relative accuracy when applying the base model to the generalized data. Accuracy is measured relative to the original accuracy of the model.
cells (list of objects, optional) – The cells used to generalize records. Each cell must define a range or subset of categories for each feature, as well as a representative value for each feature. This parameter should be used when instantiating a transformer object without first fitting it.
categorical_features (list of strings or integers, optional) – The list of categorical features (if supplied, these featurtes will be one-hot encoded before using them to train the decision tree model).
encoder (sklearn OrdinalEncoder or OneHotEncoder) – Optional encoder for encoding data before feeding it into the estimator (e.g., for categorical features). If not provided, the data will be fed as is directly to the estimator.
features_to_minimize (list of strings or int, optional) – The features to be minimized. If not provided, all features will be minimized.
feature_slices (list of lists of strings or integers, optional) – If some of the features to be minimized represent 1-hot encoded features that need to remain consistent after minimization, provide a list containing the list of column names or indexes that represent a single feature.
train_only_features_to_minimize (boolean, optional) – Whether to train the tree just on the
features_to_minimize
or on all features. Default is only onfeatures_to_minimize
.is_regression (boolean, optional) – Whether the model is a regression model or not (if False, assumes a classification model). Default is False.
generalize_using_transform (boolean, optional) – Indicates how to calculate NCP and accuracy during the generalization process. True means that the transform method is used to transform original data into generalized data that is used for accuracy and NCP calculation. False indicates that the generalizations structure should be used. Default is True.
- calculate_ncp(samples: ArrayDataset)
Compute the NCP score of the generalization. Calculation is based on the value of the generalize_using_transform param. If samples are provided, updates stored ncp value to the one computed on the provided data. If samples not provided, returns the last NCP score computed by the fit or transform method.
Based on the NCP score presented in: Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss (https://www.vldb.org/conf/2007/papers/research/p758-ghinita.pdf)
- Parameters:
samples (ArrayDataset, optional. feature_names should be set.) – The input samples to compute the NCP score on.
- Returns:
NCP score as float.
- fit(X: ndarray | DataFrame | None = None, y: ndarray | DataFrame | None = None, features_names: Optional = None, dataset: ArrayDataset = None)
Learns the generalizations based on training data. Also sets the fit_score and generalizations_score in self.ncp.
- Parameters:
X ({array-like, sparse matrix}, shape (n_samples, n_features), optional) – The training input samples.
y (array-like, shape (n_samples,), optional) – The target values. This should contain the predictions of the original model on
X
.features_names (list of strings, optional) – The feature names, in the order that they appear in the data. Should be provided when passing the data as
X
as a numpy arraydataset (ArrayDataset, optional) – Data wrapper containing the training input samples and the predictions of the original model on the training data. Either
X
,y
ORdataset
need to be provided, not both.
- Returns:
self
- fit_transform(X: ndarray | DataFrame | None = None, y: ndarray | DataFrame | None = None, features_names: list | None = None, dataset: ArrayDataset | None = None)
Learns the generalizations based on training data, and applies them to the data. Also sets the fit_score, transform_score and generalizations_score in self.ncp.
- Parameters:
X ({array-like, sparse matrix}, shape (n_samples, n_features), optional) – The training input samples.
y (array-like, shape (n_samples,), optional) – The target values. This should contain the predictions of the original model on
X
.features_names (list of strings, optional) – The feature names, in the order that they appear in the data. Can be provided when passing the data as
X
andy
dataset (ArrayDataset, optional) – Data wrapper containing the training input samples and the predictions of the original model on the training data. Either
X
,y
ORdataset
need to be provided, not both.
- Returns:
Array containing the representative values to which each record in
X
is mapped, as numpy array or pandas DataFrame (depending on the type ofX
), shape (n_samples, n_features)
- property generalizations
Return the generalizations derived from the model and test data.
- Returns:
generalizations object. Contains 3 sections: ‘ranges’ that contains ranges for numerical features, ‘categories’ that contains sub-groups of categories for categorical features, and ‘untouched’ that contains the features that could not be generalized.
- get_params(deep=True)
Get parameters
- Parameters:
deep (boolean, optional) – If True, will return the parameters for this estimator and contained sub-objects that are estimators.
- Returns:
Parameter names mapped to their values
- property ncp
Return the last calculated NCP scores. NCP score is calculated upon calling fit (on the training data), transform’ (on the test data) or when explicitly calling `calculate_ncp and providing it a dataset.
- Returns:
NCPScores object, that contains a score corresponding to the last fit call, one for the last
transform call, and a score based on global generalizations.
- set_params(**params)
Set parameters
- Parameters:
target_accuracy (float, optional) – The required relative accuracy when applying the base model to the generalized data. Accuracy is measured relative to the original accuracy of the model.
cells (list of objects, optional) – The cells used to generalize records. Each cell must define a range or subset of categories for each feature, as well as a representative value for each feature. This parameter should be used when instantiating a transformer object without first fitting it.
- Returns:
self
- transform(X: ndarray | DataFrame | None = None, features_names: list | None = None, dataset: ArrayDataset | None = None)
Transforms data records to representative points. Also sets the transform_score in self.ncp.
- Parameters:
X ({array-like, sparse matrix}, shape (n_samples, n_features), optional) – The training input samples.
features_names (list of strings, optional) – The feature names, in the order that they appear in the data. Should be provided when passing the data as
X
as a numpy arraydataset (ArrayDataset, optional) – Data wrapper containing the training input samples and the predictions of the original model on the training data. Either
X
ORdataset
need to be provided, not both.
- Returns:
Array containing the representative values to which each record in
X
is mapped, as numpy array or pandas DataFrame (depending on the type ofX
), shape (n_samples, n_features)
Module contents
Module providing data minimization for ML.
This module implements a first-of-a-kind method to help reduce the amount of personal data needed to perform predictions with a machine learning model, by removing or generalizing some of the input features. For more information about the method see: http://export.arxiv.org/pdf/2008.04113
The main class, GeneralizeToRepresentative
, is a scikit-learn compatible Transformer
, that receives an existing
estimator and labeled training data, and learns the generalizations that can be applied to any newly collected data for
analysis by the original model. The fit()
method learns the generalizations and the transform()
method applies
them to new data.
It is also possible to export the generalizations as feature ranges.