p2pfl.learning.dataset.partition_strategies module¶

Data partitioning strategies for P2PFL Datasets.

class p2pfl.learning.dataset.partition_strategies.DataPartitionStrategy[source]¶

Bases: object

Abstract class for defining data partitioning strategies in federated learning.

This class provides a common interface for generating partitions of a dataset, which can be used to simulate different data distributions across clients.

abstract static generate_partitions(train_data, test_data, num_partitions, **kwargs)[source]¶

Generate partitions of the dataset based on the specific strategy.

Parameters:

train_data (Dataset) – The training Dataset object to partition.
test_data (Dataset) – The test Dataset object to partition.
num_partitions (int) – The number of partitions to create.
**kwargs – Additional keyword arguments that may be required by specific strategies.

Returns:

The first list contains lists of indices for the training data partitions.
The second list contains lists of indices for the test data partitions.

Return type:

A tuple containing two lists of lists

class p2pfl.learning.dataset.partition_strategies.DirichletPartitionStrategy[source]¶

Bases: DataPartitionStrategy

Data partition strategy based on the Dirichlet distribution.

It assigns data to different partitions (clients) so that the distribution of classes in each partition follows a Dirichlet distribution, where alpha determines the concentration of the distribution.

Inspired by the implementation of flower. Thank you so much for taking FL to another level :) Original implementation: https://github.com/adap/flower/blob/main/datasets/flwr_datasets/partitioner/dirichlet_partitioner.py

classmethod generate_partitions(train_data, test_data, num_partitions, label_tag='label', alpha=1, min_partition_size=2, self_balancing=False, **kwargs)[source]¶

Generate partitions of the dataset using Dirichlet distribution.

It divides the data into partitions so that the distribution of classes in each partition follows a Dirichlet distribution controlled by the alpha parameter.

Parameters:

train_data (Dataset) – The training Dataset object to partition.
test_data (Dataset) – The test Dataset object to partition.
num_partitions (int) – The number of partitions to create.
label_tag (str) – The name of the column containing the labels.
alpha (Union[int, float, list[float]]) – The alpha parameters of the dirichlet distribution
min_partition_size (int) – The minimum partition size allowed in train and test.
self_balancing (bool) – Whether the partitions should be balanced or not. The balancing is done by not allowing some label values to go in partitions that are already overly big.
shuffle – Whether to shuffle the indexes or not
**kwargs – Additional keyword arguments that may be required by specific strategies.

Returns:

The first list contains lists of indices for the training data partitions.
The second list contains lists of indices for the test data partitions.

Return type:

A tuple containing two lists of lists

class p2pfl.learning.dataset.partition_strategies.LabelSkewedPartitionStrategy[source]¶

Bases: DataPartitionStrategy

Partitions the dataset by grouping samples with the same label, resulting in a non-IID distribution.

This is generally considered the “worst-case” scenario for federated learning.

static generate_partitions(train_data, test_data, num_partitions, label_tag='label', **kwargs)[source]¶

Generate partitions of the dataset by grouping samples with the same label.

Parameters:

train_data (Dataset) – The training Dataset object to partition.
test_data (Dataset) – The test Dataset object to partition.
num_partitions (int) – The number of partitions to create.
label_tag (str) – The name of the column containing the labels.
**kwargs – Additional keyword arguments that may be required by specific strategies.

Returns:

The first list contains lists of indices for the training data partitions.
The second list contains lists of indices for the test data partitions.

Return type:

A tuple containing two lists of lists

class p2pfl.learning.dataset.partition_strategies.PercentageBasedNonIIDPartitionStrategy[source]¶

Bases: DataPartitionStrategy

Not implemented yet.

class p2pfl.learning.dataset.partition_strategies.RandomIIDPartitionStrategy[source]¶

Bases: DataPartitionStrategy

Partition the dataset randomly, resulting in an IID distribution of data across clients.

static generate_partitions(train_data, test_data, num_partitions, **kwargs)[source]¶

Generate partitions of the dataset using random sampling.

Parameters:

train_data (Dataset) – The training Dataset object to partition.
test_data (Dataset) – The test Dataset object to partition.
num_partitions (int) – The number of partitions to create.
**kwargs – Additional keyword arguments that may be required by specific strategies.

Returns:

The first list contains lists of indices for the training data partitions.
The second list contains lists of indices for the test data partitions.

Return type:

A tuple containing two lists of lists