p2pfl.learning.dataset.p2pfl_dataset moduleΒΆ

P2PFL dataset abstraction.

class p2pfl.learning.dataset.p2pfl_dataset.DataExportStrategy[source]ΒΆ

Bases: ABC

Abstract base class for export strategies.

abstract static export(data, transforms=None, **kwargs)[source]ΒΆ

Export the data using the specific strategy.

Parameters:
  • data (Dataset) – The data to export.

  • transforms (Optional[Callable]) – The transforms to apply to the data.

  • **kwargs – Additional keyword arguments for the export strategy.

Return type:

Any

Returns:

The exported data.

class p2pfl.learning.dataset.p2pfl_dataset.P2PFLDataset(data, train_split_name='train', test_split_name='test', transforms=None)[source]ΒΆ

Bases: object

Handle various data sources for Peer-to-Peer Federated Learning (P2PFL).

This class uses Hugging Face’s datasets.Dataset as the intermediate representation for its flexibility and optimizations.

Supported data sources:
  • CSV files

  • JSON files

  • Parquet files

  • Python dictionaries

  • Python lists

  • Pandas DataFrames

  • Hugging Face datasets

  • SQL databases

To load different data sources, it is recommended to directly instantiate the datasets.Dataset object and pass it to the P2PFLDataset constructor.

Example

Load data from various sources and create a P2PFLDataset object:

from datasets import load_dataset, DatasetDict, concatenate_datasets

# Load data from a CSV file
dataset_csv = load_dataset("csv", data_files="data.csv")

# Load from the Hub
dataset_hub = load_dataset("squad", split="train")

# Create the final dataset object
p2pfl_dataset = P2PFLDataset(
    DatasetDict({
        "train": concatenate_datasets([dataset_csv, dataset_hub]),
        "test": dataset_json
    })
)

Todo

Add more complex integrations (databricks, etc.)

export(strategy, train=True, **kwargs)[source]ΒΆ

Export the dataset using the given strategy.

Parameters:
  • strategy (Type[DataExportStrategy]) – The export strategy to use.

  • train (bool) – If True, export the training data. Otherwise, export the test data.

  • **kwargs – Additional keyword arguments for the export strategy.

Return type:

Any

Returns:

The exported data.

classmethod from_csv(data_files, **kwargs)[source]ΒΆ

Create a P2PFLDataset from a CSV file.

Parameters:
  • data_files (Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], None]) – The path to the CSV file or a list of paths to CSV files.

  • **kwargs – Keyword arguments to pass to datasets.load_dataset.

Return type:

P2PFLDataset

Returns:

A P2PFLDataset object.

classmethod from_generator(generator)[source]ΒΆ

Create a P2PFLDataset from a generator function.

Parameters:

generator (Callable[[], Iterable[Dict[str, Any]]]) – A generator function that yields dictionaries.

Return type:

P2PFLDataset

Returns:

A P2PFLDataset object.

classmethod from_huggingface(dataset_name, **kwargs)[source]ΒΆ

Create a P2PFLDataset from a Hugging Face dataset.

Parameters:
  • dataset_name (str) – The name of the Hugging Face dataset.

  • **kwargs – Keyword arguments to pass to datasets.load_dataset.

Return type:

P2PFLDataset

Returns:

A P2PFLDataset object.

classmethod from_json(data_files, **kwargs)[source]ΒΆ

Create a P2PFLDataset from a JSON file.

Parameters:
  • data_files (Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], None]) – The path to the JSON file or a list of paths to JSON files.

  • **kwargs – Keyword arguments to pass to datasets.load_dataset.

Return type:

P2PFLDataset

Returns:

A P2PFLDataset object.

classmethod from_pandas(df)[source]ΒΆ

Create a P2PFLDataset from a Pandas DataFrame.

Parameters:

df (DataFrame) – A Pandas DataFrame containing the data.

Return type:

P2PFLDataset

Returns:

A P2PFLDataset object.

classmethod from_parquet(data_files, **kwargs)[source]ΒΆ

Create a P2PFLDataset from a Parquet file or files.

Parameters:
  • data_files (Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], None]) – The path to the Parquet file or a list of paths to Parquet files.

  • **kwargs – Keyword arguments to pass to datasets.load_dataset.

Return type:

P2PFLDataset

Returns:

A P2PFLDataset object.

generate_partitions(num_partitions, strategy, seed=666, label_tag='label')[source]ΒΆ

Generate partitions of the dataset.

Parameters:
  • num_partitions (int) – The number of partitions to generate.

  • strategy (DataPartitionStrategy) – The partition strategy to use.

  • seed (int) – The random seed to use for reproducibility.

  • label_tag (str) – The tag to use for the label.

Return type:

List[P2PFLDataset]

Returns:

An iterable of P2PFLDataset objects.

generate_train_test_split(test_size=0.2, seed=42, shuffle=True, **kwargs)[source]ΒΆ

Generate a train/test split of the dataset.

Parameters:
  • test_size (float) – The proportion of the dataset to include in the test split.

  • seed (int) – The random seed to use for reproducibility.

  • shuffle (bool) – Whether to shuffle the data before splitting.

  • **kwargs – Additional keyword arguments to pass to the train_test_split method.

Return type:

None

get(idx, train=True)[source]ΒΆ

Get the item at the given index.

Parameters:
  • idx – The index of the item to retrieve.

  • train (bool) – If True, get the item from the training split. Otherwise, get the item from the test split.

Return type:

Dict[str, Any]

Returns:

The item at the given index.

get_num_samples(train=True)[source]ΒΆ

Get the number of samples in the dataset.

Parameters:

train (bool) – If True, get the number of samples in the training split. Otherwise, get the number of samples in the test split.

Return type:

int

Returns:

The number of samples in the dataset.

set_transforms(transforms)[source]ΒΆ

Set the transforms to apply to the data.

Parameters:

transforms (Callable) – The transforms to apply to the data.

Return type:

None