apt.utils.datasets package

Submodules

apt.utils.datasets.datasets module

The AI Privacy Toolbox (datasets). Implementation of utility classes for dataset handling

class apt.utils.datasets.datasets.ArrayDataset(x: ndarray | DataFrame | List | Tensor | csr_matrix, y: ndarray | DataFrame | List | Tensor | csr_matrix | None = None, features_names: list | None = None, **kwargs)

Bases: Dataset

Dataset that is based on x and y arrays (e.g., numpy/pandas/list…)

Parameters:
  • x (numpy array or pandas DataFrame or list or pytorch Tensor) – collection of data samples

  • y (numpy array or pandas DataFrame or list or pytorch Tensor, optional) – collection of labels

  • feature_names (list of strings, optional) – The feature names, in the order that they appear in the data

get_labels() ndarray

Get labels

Returns:

labels as numpy array

get_predictions() ndarray

Get predictions

Returns:

predictions as numpy array

get_samples() ndarray

Get data samples

Returns:

data samples as numpy array

class apt.utils.datasets.datasets.Data(train: Dataset | None = None, test: Dataset | None = None, **kwargs)

Bases: object

Class for storing train and test datasets.

Parameters:
  • train (Dataset) – the training set

  • test (Dataset, optional) – the test set

get_test_labels() Collection[Any]

Get test set labels

Returns:

test labels, or None if no test labels provided

get_test_predictions() Collection[Any]

Get test set predictions, or None if no test predictions provided

Returns:

test labels

get_test_samples() Collection[Any]

Get test set samples

Returns:

test samples, or None if no test data provided

get_test_set() Dataset

Get test set

Returns:

test ‘Dataset`

get_train_labels() Collection[Any]

Get train set labels, or None if no training labels provided

Returns:

training labels

get_train_predictions() Collection[Any]

Get train set predictions, or None if no training predictions provided

Returns:

training labels

get_train_samples() Collection[Any]

Get train set samples, or None if no training data provided

Returns:

training samples

get_train_set() Dataset

Get training set

Returns:

training ‘Dataset`

class apt.utils.datasets.datasets.Dataset(**kwargs)

Bases: object

Base Abstract Class for Dataset

abstract get_labels() Collection[Any]

Return labels

Returns:

the labels

abstract get_predictions() ndarray

Get predictions

Returns:

predictions as numpy array

abstract get_samples() Collection[Any]

Return data samples

Returns:

the data samples

class apt.utils.datasets.datasets.DatasetFactory

Bases: object

Factory class for dataset creation

classmethod create_dataset(name: str, **kwargs) Dataset

Factory command to create dataset instance.

This method gets the appropriate Dataset class from the registry and creates an instance of it, while passing in the parameters given in kwargs.

Parameters:
  • name (string) – The name of the dataset to create.

  • kwargs (keyword arguments as expected by the class) – dataset parameters

Returns:

An instance of the dataset that is created.

classmethod register(name: str) Callable

Class method to register Dataset to the internal registry

Parameters:

name (string) – dataset name

Returns:

a Callable that returns the registered dataset class

registry = {}
class apt.utils.datasets.datasets.DatasetWithPredictions(pred: ndarray | DataFrame | List | Tensor | csr_matrix, x: ndarray | DataFrame | List | Tensor | csr_matrix | None = None, y: ndarray | DataFrame | List | Tensor | csr_matrix | None = None, features_names: list | None = None, **kwargs)

Bases: Dataset

Dataset that is based on arrays (e.g., numpy/pandas/list…). Includes predictions from a model, and possibly also features and true labels.

Parameters:
  • x (numpy array or pandas DataFrame or list or pytorch Tensor) – collection of data samples

  • y (numpy array or pandas DataFrame or list or pytorch Tensor, optional) – collection of labels

  • feature_names (list of strings, optional) – The feature names, in the order that they appear in the data

get_labels() ndarray

Get labels

Returns:

labels as numpy array

get_predictions() ndarray

Get predictions

Returns:

predictions as numpy array

get_samples() ndarray

Get data samples

Returns:

data samples as numpy array

class apt.utils.datasets.datasets.PytorchData(x: ndarray | DataFrame | List | Tensor | csr_matrix, y: ndarray | DataFrame | List | Tensor | csr_matrix | None = None, **kwargs)

Bases: Dataset

Dataset for pytorch models.

Parameters:
  • x (numpy array or pandas DataFrame or list or pytorch Tensor) – collection of data samples

  • y (numpy array or pandas DataFrame or list or pytorch Tensor, optional) – collection of labels

get_item(idx: int) Tensor

Get the sample and label according to the given index

Parameters:

idx (int) – the index of the sample to return

Returns:

the sample and label as pytorch Tensors. Returned as a tuple (sample, label)

get_labels() ndarray

Get labels.

Returns:

labels as numpy array

get_predictions() ndarray

Get predictions

Returns:

predictions as numpy array

get_sample_item(idx: int) Tensor

Get the sample according to the given index

Parameters:

idx (int) – the index of the sample to return

Returns:

the sample as a pytorch Tensor

get_samples() ndarray

Get data samples.

Returns:

samples as numpy array

class apt.utils.datasets.datasets.StoredDataset(**kwargs)

Bases: Dataset

Abstract Class for a Dataset that can be downloaded from a URL and stored in a file

static download(url: str, dest_path: str, filename: str, unzip: bool | None = False) None

Download the dataset from URL

Parameters:
  • url (string) – dataset URL, the dataset will be requested from this URL

  • dest_path (string) – local dataset destination path

  • filename (string) – local dataset filename

  • unzip (boolean, optional) – flag whether or not perform extraction. Default is False.

Returns:

None

static extract_archive(zip_path: str, dest_path: str | None = None, remove_archive: bool | None = False)

Extract dataset from archived file

Parameters:
  • zip_path (string) – path to archived file

  • dest_path (string, optional) – directory path to uncompress the file to

  • remove_archive (boolean, optional) – whether remove the archive file after uncompress. Default is False.

Returns:

None

abstract load(**kwargs)

Load dataset

Returns:

None

abstract load_from_file(path: str)

Load dataset from file

Parameters:

path (string) – the path to the file

Returns:

None

static split_debug(datafile: str, dest_datafile: str, ratio: int, shuffle: bool | None = True, delimiter: str | None = ',', fmt: str | list | None = None) None

Split the data and take only a part of it

Parameters:
  • datafile (string) – dataset file path

  • dest_datafile (string) – destination path for the partial dataset file

  • ratio (int) – part of the dataset to save

  • shuffle (boolean, optional) – whether to shuffle the data or not. Default is True.

  • delimiter (string, optional) – dataset delimiter. Default is “,”

  • fmt (string or sequence of strings, optional) – format for the correct data saving. As defined by numpy.savetxt(). Default is None.

Returns:

None

apt.utils.datasets.datasets.array2numpy(arr: ndarray | DataFrame | List | Tensor | csr_matrix) ndarray

converts from INPUT_DATA_ARRAY_TYPE to numpy array

apt.utils.datasets.datasets.array2torch_tensor(arr: ndarray | DataFrame | List | Tensor | csr_matrix) Tensor

converts from INPUT_DATA_ARRAY_TYPE to torch tensor array

Module contents

The AI Privacy Toolbox (datasets). Implementation of datasets utility components for datasets creation, load, and store