p2pfl.learning.dataset.partition_strategies moduleΒΆ

Data partitioning strategies for P2PFL Datasets.

class p2pfl.learning.dataset.partition_strategies.DataPartitionStrategy[source]ΒΆ

Bases: object

Abstract class for defining data partitioning strategies in federated learning.

This class provides a common interface for generating partitions of a dataset, which can be used to simulate different data distributions across clients.

abstract static generate_partitions(train_data, test_data, num_partitions, **kwargs)[source]ΒΆ

Generate partitions of the dataset based on the specific strategy.

Parameters:
  • train_data (Dataset) – The training Dataset object to partition.

  • test_data (Dataset) – The test Dataset object to partition.

  • num_partitions (int) – The number of partitions to create.

  • **kwargs – Additional keyword arguments that may be required by specific strategies.

Returns:

  • The first list contains lists of indices for the training data partitions.

  • The second list contains lists of indices for the test data partitions.

Return type:

A tuple containing two lists of lists

class p2pfl.learning.dataset.partition_strategies.DirichletPartitionStrategy[source]ΒΆ

Bases: DataPartitionStrategy

Data partition strategy based on the Dirichlet distribution.

It assigns data to different partitions (clients) so that the distribution of classes in each partition follows a Dirichlet distribution, where alpha determines the concentration of the distribution.

Inspired by the implementation of flower. Thank you so much for taking FL to another level :) Original implementation: https://github.com/adap/flower/blob/main/datasets/flwr_datasets/partitioner/dirichlet_partitioner.py

classmethod generate_partitions(train_data, test_data, num_partitions, seed=666, label_tag='label', alpha=1, min_partition_size=2, self_balancing=False, **kwargs)[source]ΒΆ

Generate partitions of the dataset using Dirichlet distribution.

It divides the data into partitions so that the distribution of classes in each partition follows a Dirichlet distribution controlled by the alpha parameter.

Parameters:
  • train_data (Dataset) – The training Dataset object to partition.

  • test_data (Dataset) – The test Dataset object to partition.

  • num_partitions (int) – The number of partitions to create.

  • seed (int) – The random seed to use for reproducibility.

  • label_tag (str) – The name of the column containing the labels.

  • alpha (Union[int, float, list[float]]) – The alpha parameters of the dirichlet distribution

  • min_partition_size (int) – The minimum partition size allowed in train and test.

  • self_balancing (bool) – Whether the partitions should be balanced or not. The balancing is done by not allowing some label values to go in partitions that are already overly big.

  • shuffle – Whether to shuffle the indexes or not

  • **kwargs – Additional keyword arguments that may be required by specific strategies.

Returns:

  • The first list contains lists of indices for the training data partitions.

  • The second list contains lists of indices for the test data partitions.

Return type:

A tuple containing two lists of lists

class p2pfl.learning.dataset.partition_strategies.LabelSkewedPartitionStrategy[source]ΒΆ

Bases: DataPartitionStrategy

Partitions the dataset by grouping samples with the same label, resulting in a non-IID distribution.

This is generally considered the β€œworst-case” scenario for federated learning.

static generate_partitions(train_data, test_data, num_partitions, seed=666, label_tag='label', **kwargs)[source]ΒΆ

Generate partitions of the dataset by grouping samples with the same label.

Parameters:
  • train_data (Dataset) – The training Dataset object to partition.

  • test_data (Dataset) – The test Dataset object to partition.

  • num_partitions (int) – The number of partitions to create.

  • seed (int) – The random seed to use for reproducibility.

  • label_tag (str) – The name of the column containing the labels.

  • **kwargs – Additional keyword arguments that may be required by specific strategies.

Returns:

  • The first list contains lists of indices for the training data partitions.

  • The second list contains lists of indices for the test data partitions.

Return type:

A tuple containing two lists of lists

class p2pfl.learning.dataset.partition_strategies.PercentageBasedNonIIDPartitionStrategy[source]ΒΆ

Bases: DataPartitionStrategy

Not implemented yet.

class p2pfl.learning.dataset.partition_strategies.RandomIIDPartitionStrategy[source]ΒΆ

Bases: DataPartitionStrategy

Partition the dataset randomly, resulting in an IID distribution of data across clients.

static generate_partitions(train_data, test_data, num_partitions, seed=666, **kwargs)[source]ΒΆ

Generate partitions of the dataset using random sampling.

Parameters:
  • train_data (Dataset) – The training Dataset object to partition.

  • test_data (Dataset) – The test Dataset object to partition.

  • num_partitions (int) – The number of partitions to create.

  • seed (int) – The random seed to use for reproducibility.

  • **kwargs – Additional keyword arguments that may be required by specific strategies.

Returns:

  • The first list contains lists of indices for the training data partitions.

  • The second list contains lists of indices for the test data partitions.

Return type:

A tuple containing two lists of lists