๐๏ธ Datasetsยถ
The P2PFLDataset
class is a key component of the P2PFL framework, providing a flexible and efficient way to manage datasets for federated learning experiments. It simplifies dataset loading, partitioning (both IID and non-IID), and exporting to various machine learning frameworks. It supports a wide range of data sources, making it easy to integrate your existing data into P2PFL.
Key Featuresยถ
Unified Data Handling:
P2PFLDataset
provides a consistent API for working with datasets, regardless of their original format or source.Flexible Data Loading: Load data from various sources, including CSV, JSON, and Parquet files, Pandas DataFrames, Python dictionaries and lists, and the Hugging Face Hub.
Automated Partitioning: Easily partition your data for federated learning using built-in strategies like
RandomIIDPartitionStrategy
orDirichletPartitionStrategy
.Framework-Specific Export: Export your data in formats readily usable by popular machine learning frameworks like PyTorch, TensorFlow, and Flax.
Data Transformations: Apply custom transformations to your data before training.
Loading Dataยถ
The P2PFLDataset
simplifies data loading by offering convenient methods for common data sources. You can directly instantiate a P2PFLDataset
object from various formats, such as CSV, JSON, or Parquet files, as well as from Pandas DataFrames, Python dictionaries, or lists. For instance, to load data from a CSV file, you can use the from_csv()
method, providing the file path as an argument. Similarly, from_json()
and from_parquet()
methods are available for JSON and Parquet files, respectively.
For those working with datasets hosted on the Hugging Face Hub, the from_huggingface()
method provides a direct way to load datasets by specifying the dataset name. Additionally, if you have data already loaded in a Pandas DataFrame, you can use the from_pandas()
method.
Here are some examples of how to use these methods:
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
# From CSV
p2pfl_dataset = P2PFLDataset.from_csv("path/to/your/data.csv")
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
# From JSON
p2pfl_dataset = P2PFLDataset.from_json("path/to/your/data.json")
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
# From Parquet
p2pfl_dataset = P2PFLDataset.from_parquet("path/to/your/data.parquet")
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
# From Hugging Face Hub (MNIST Example)
p2pfl_dataset = P2PFLDataset.from_huggingface("p2pfl/MNIST", split="train")
from datasets import Dataset
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
import numpy as np
# From a generator function
def my_generator():
for i in range(10):
yield {"id": i, "image": np.random.rand(28, 28), "label": i % 10} # Example MNIST-like data
dataset = Dataset.from_generator(my_generator)
p2pfl_dataset = P2PFLDataset(dataset)
import pandas as pd
from datasets import Dataset
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
# From a Pandas DataFrame (Example with image and label columns)
data = {'image': [np.random.rand(28, 28) for _ in range(10)], 'label': [i % 10 for i in range(10)]}
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)
p2pfl_dataset = P2PFLDataset(dataset)
Advanced Data Loadingยถ
For more complex scenarios or when dealing with data sources not directly supported by the built-in methods, you can leverage the flexibility of the Hugging Face datasets
library. This allows you to create a datasets.Dataset
or datasets.DatasetDict
instance using custom loading scripts or by applying intricate data manipulation techniques before integrating it with P2PFL.
Example: Creating a Dataset from Python Lists/Dictionariesยถ
You can easily create a P2PFLDataset
from Python lists or dictionaries using the datasets
library:
from datasets import Dataset
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
import numpy as np
# Example data (MNIST-like)
data = [{"image": np.random.rand(28, 28), "label": i % 10} for i in range(10)]
# Create a Hugging Face Dataset
dataset = Dataset.from_list(data)
# Create a P2PFLDataset
p2pfl_dataset = P2PFLDataset(dataset)
Example: Creating a DatasetDict with Custom Splitsยถ
You can also create a datasets.DatasetDict
with custom splits, allowing you to manage different portions of your data separately:
from datasets import Dataset, DatasetDict
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
import numpy as np
# Create dummy data for train and test splits
train_data = {"image": [np.random.rand(28, 28) for _ in range(50)], "label": [i % 10 for i in range(50)]}
test_data = {"image": [np.random.rand(28, 28) for _ in range(20)], "label": [i % 10 for i in range(20)]}
# Create Dataset objects for each split
train_dataset = Dataset.from_dict(train_data)
test_dataset = Dataset.from_dict(test_data)
# Create a DatasetDict
dataset_dict = DatasetDict({"train": train_dataset, "test": test_dataset})
# Create a P2PFLDataset object
p2pfl_dataset = P2PFLDataset(dataset_dict)
These examples demonstrate how you can utilize the datasets
libraryโs capabilities to load and preprocess your data in a customized manner before using it with P2PFLDataset
. This approach provides maximum flexibility and control over your data handling pipeline, enabling you to tailor it to the specific needs of your federated learning experiments.
Train/Test Splitยถ
If the dataset does not come with predefined splits, you can create them using generate_train_test_split()
. Hereโs an example:
# Generate a train/test split with an 80/20 ratio
p2pfl_dataset.generate_train_test_split(test_size=0.2, seed=42)
# Access the number of samples in each split
num_train_samples = p2pfl_dataset.get_num_samples(train=True)
num_test_samples = p2pfl_dataset.get_num_samples(train=False)
print(f"Number of training samples: {num_train_samples}")
print(f"Number of test samples: {num_test_samples}")
Data Access and Manipulationยถ
The P2PFLDataset
class provides convenient methods for accessing and manipulating data.
Accessing Data: You can retrieve individual samples using the get()
method. Specify the index of the desired sample and the split (train or test) you want to access:
# Access the 5th item from the training split
item = p2pfl_dataset.get(4, train=True)
print(item)
Transformations: You can apply transformations to your data using set_transforms()
. This allows you to preprocess your data before using it for training or evaluation. Hereโs an example demonstrating how to apply a simple transformation to the MNIST dataset:
from p2pfl.learning.dataset.p2pfl_dataset import P2PFLDataset
from torchvision import transforms
# Load MNIST dataset
mnist_dataset = P2PFLDataset.from_huggingface("p2pfl/MNIST")
# Define a transform to convert the image to a tensor and normalize it
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
# Set the transform
mnist_dataset.set_transforms(transform)
# Access a transformed sample
transformed_sample = mnist_dataset.get(0, train=True)
print(transformed_sample)
Partitioning for Federated Learningยถ
Partitioning data is a crucial step in federated learning, replicating the scenario where data is spread across multiple devices or nodes. The method used for partitioning has a significant impact on the performance, convergence rate, and overall effectiveness of federated learning models. This process is not merely about dividing data; it involves strategically constructing a realistic simulation of a decentralized data environment.
P2PFL offers the capability to investigate various partitioning strategies, each tailored to emulate different real-world scenarios. The table below summarizes the available partitioning strategies within the P2PFL framework:
Strategy |
Description |
Use Case |
---|---|---|
Distributes data randomly across clients, creating an Independent and Identically Distributed (IID) scenario. |
Simulates homogeneous client data, useful for baseline comparisons. |
|
|
Distributes data based on a Dirichlet distribution, controlled by the |
Simulates heterogeneous client data, reflecting real-world scenarios where clients have different distributions. |
Use generate_partitions()
to generate a list of P2PFLDataset
with a given DataPartitionStrategy
. The following example demonstrates how to a IID and Non-IID partitioning using the RandomIIDPartitionStrategy
and DirichletPartitionStrategy
respectively:
from p2pfl.learning.dataset.partition_strategies import RandomIIDPartitionStrategy, DirichletPartitionStrategy
# Generate 3 IID partitions
partitions = p2pfl_dataset.generate_partitions(3, RandomIIDPartitionStrategy)
# Generate 3 non-IID partitions using Dirichlet distribution with alpha=0.1
partitions = p2pfl_dataset.generate_partitions(3, DirichletPartitionStrategy, alpha=0.1)
Exporting Dataยถ
Once you have loaded, preprocessed, and partitioned your data, the next step is to export it into a format that your chosen machine learning framework can understand. The P2PFLDataset
class simplifies this process by providing an export()
method that works seamlessly with different DataExportStrategy
โs. Each strategy is designed to handle the specific requirements of a particular framework, ensuring that your data is prepared correctly for training or evaluation.
from p2pfl.learning.frameworks.pytorch.lightning_dataset import PyTorchExportStrategy
# Export the training data for PyTorch
pytorch_data = p2pfl_dataset.export(PyTorchExportStrategy, train=True, batch_size=32, num_workers=4)
from p2pfl.learning.frameworks.tensorflow.keras_dataset import KerasExportStrategy
# Export the training data for TensorFlow/Keras
tensorflow_data = p2pfl_dataset.export(KerasExportStrategy, train=True, batch_size=32)
from p2pfl.learning.frameworks.flax.flax_dataset import FlaxExportStrategy
# Export the training data for Flax
flax_data = p2pfl_dataset.export(FlaxExportStrategy, train=True, batch_size=32)