apt.utils.datasets package

Submodules

apt.utils.datasets.datasets module

The AI Privacy Toolbox (datasets). Implementation of utility classes for dataset handling

Bases: Dataset

Dataset that is based on x and y arrays (e.g., numpy/pandas/list…)

Parameters:

x (numpy array or pandas DataFrame or list or pytorch Tensor) – collection of data samples
y (numpy array or pandas DataFrame or list or pytorch Tensor, optional) – collection of labels
feature_names (list of strings, optional) – The feature names, in the order that they appear in the data

get_labels() → ndarray

Get labels

Returns:: labels as numpy array

get_predictions() → ndarray

Get predictions

Returns:: predictions as numpy array

get_samples() → ndarray

Get data samples

Returns:: data samples as numpy array

class apt.utils.datasets.datasets.Data(train: Dataset | None = None, test: Dataset | None = None, **kwargs)

Bases: object

Class for storing train and test datasets.

Parameters:

train (Dataset) – the training set
test (Dataset, optional) – the test set

get_test_labels() → Collection[Any]

Get test set labels

Returns:: test labels, or None if no test labels provided

get_test_predictions() → Collection[Any]

Get test set predictions, or None if no test predictions provided

Returns:: test labels

get_test_samples() → Collection[Any]

Get test set samples

Returns:: test samples, or None if no test data provided

get_test_set() → Dataset

Get test set

Returns:: test ‘Dataset`

get_train_labels() → Collection[Any]

Get train set labels, or None if no training labels provided

Returns:: training labels

get_train_predictions() → Collection[Any]

Get train set predictions, or None if no training predictions provided

Returns:: training labels

get_train_samples() → Collection[Any]

Get train set samples, or None if no training data provided

Returns:: training samples

get_train_set() → Dataset

Get training set

Returns:: training ‘Dataset`

class apt.utils.datasets.datasets.Dataset(**kwargs)

Bases: object

Base Abstract Class for Dataset

abstract get_labels() → Collection[Any]

Return labels

Returns:: the labels

abstract get_predictions() → ndarray

Get predictions

Returns:: predictions as numpy array

abstract get_samples() → Collection[Any]

Return data samples

Returns:: the data samples

class apt.utils.datasets.datasets.DatasetFactory

Bases: object

Factory class for dataset creation

classmethod create_dataset(name: str, **kwargs) → Dataset

Factory command to create dataset instance.

This method gets the appropriate Dataset class from the registry and creates an instance of it, while passing in the parameters given in kwargs.

Parameters:

name (string) – The name of the dataset to create.
kwargs (keyword arguments as expected by the class) – dataset parameters

Returns:

An instance of the dataset that is created.

classmethod register(name: str) → Callable

Class method to register Dataset to the internal registry

Parameters:: name (string) – dataset name
Returns:: a Callable that returns the registered dataset class

registry = {}

Bases: Dataset

Dataset that is based on arrays (e.g., numpy/pandas/list…). Includes predictions from a model, and possibly also features and true labels.

Parameters:

x (numpy array or pandas DataFrame or list or pytorch Tensor) – collection of data samples
y (numpy array or pandas DataFrame or list or pytorch Tensor, optional) – collection of labels
feature_names (list of strings, optional) – The feature names, in the order that they appear in the data

get_labels() → ndarray

Get labels

Returns:: labels as numpy array

get_predictions() → ndarray

Get predictions

Returns:: predictions as numpy array

get_samples() → ndarray

Get data samples

Returns:: data samples as numpy array

Bases: Dataset

Dataset for pytorch models.

Parameters:

x (numpy array or pandas DataFrame or list or pytorch Tensor) – collection of data samples
y (numpy array or pandas DataFrame or list or pytorch Tensor, optional) – collection of labels

get_item(idx: int) → Tensor

Get the sample and label according to the given index

Parameters:: idx (int) – the index of the sample to return
Returns:: the sample and label as pytorch Tensors. Returned as a tuple (sample, label)

get_labels() → ndarray

Get labels.

Returns:: labels as numpy array

get_predictions() → ndarray

Get predictions

Returns:: predictions as numpy array

get_sample_item(idx: int) → Tensor

Get the sample according to the given index

Parameters:: idx (int) – the index of the sample to return
Returns:: the sample as a pytorch Tensor

get_samples() → ndarray

Get data samples.

Returns:: samples as numpy array

class apt.utils.datasets.datasets.StoredDataset(**kwargs)

Bases: Dataset

Abstract Class for a Dataset that can be downloaded from a URL and stored in a file

static download(url: str, dest_path: str, filename: str, unzip: bool | None = False) → None

Download the dataset from URL

Parameters:

url (string) – dataset URL, the dataset will be requested from this URL
dest_path (string) – local dataset destination path
filename (string) – local dataset filename
unzip (boolean, optional) – flag whether or not perform extraction. Default is False.

Returns:

None

static extract_archive(zip_path: str, dest_path: str | None = None, remove_archive: bool | None = False)

Extract dataset from archived file

Parameters:

zip_path (string) – path to archived file
dest_path (string, optional) – directory path to uncompress the file to
remove_archive (boolean, optional) – whether remove the archive file after uncompress. Default is False.

Returns:

None

abstract load(**kwargs)

Load dataset

Returns:: None

abstract load_from_file(path: str)

Load dataset from file

Parameters:: path (string) – the path to the file
Returns:: None

static split_debug(datafile: str, dest_datafile: str, ratio: int, shuffle: bool | None = True, delimiter: str | None = ',', fmt: str | list | None = None) → None

Split the data and take only a part of it

Parameters:

datafile (string) – dataset file path
dest_datafile (string) – destination path for the partial dataset file
ratio (int) – part of the dataset to save
shuffle (boolean, optional) – whether to shuffle the data or not. Default is True.
delimiter (string, optional) – dataset delimiter. Default is “,”
fmt (string or sequence of strings, optional) – format for the correct data saving. As defined by numpy.savetxt(). Default is None.

Returns:

None

apt.utils.datasets.datasets.array2numpy(arr: ndarray | DataFrame | List | Tensor | csr_matrix) → ndarray: converts from INPUT_DATA_ARRAY_TYPE to numpy array

apt.utils.datasets.datasets.array2torch_tensor(arr: ndarray | DataFrame | List | Tensor | csr_matrix) → Tensor: converts from INPUT_DATA_ARRAY_TYPE to torch tensor array

Module contents

The AI Privacy Toolbox (datasets). Implementation of datasets utility components for datasets creation, load, and store