p2pfl.learning.dataset.p2pfl_dataset moduleΒΆ
P2PFL dataset abstraction.
- class p2pfl.learning.dataset.p2pfl_dataset.DataExportStrategy[source]ΒΆ
Bases:
ABC
Abstract base class for export strategies.
- abstract static export(data, transforms=None, **kwargs)[source]ΒΆ
Export the data using the specific strategy.
- Parameters:
data (
Dataset
) β The data to export.transforms (
Optional
[Callable
]) β The transforms to apply to the data.**kwargs β Additional keyword arguments for the export strategy.
- Return type:
Any
- Returns:
The exported data.
- class p2pfl.learning.dataset.p2pfl_dataset.P2PFLDataset(data, train_split_name='train', test_split_name='test', transforms=None)[source]ΒΆ
Bases:
object
Handle various data sources for Peer-to-Peer Federated Learning (P2PFL).
This class uses Hugging Faceβs datasets.Dataset as the intermediate representation for its flexibility and optimizations.
- Supported data sources:
CSV files
JSON files
Parquet files
Python dictionaries
Python lists
Pandas DataFrames
Hugging Face datasets
SQL databases
To load different data sources, it is recommended to directly instantiate the datasets.Dataset object and pass it to the P2PFLDataset constructor.
Example
Load data from various sources and create a P2PFLDataset object:
from datasets import load_dataset, DatasetDict, concatenate_datasets # Load data from a CSV file dataset_csv = load_dataset("csv", data_files="data.csv") # Load from the Hub dataset_hub = load_dataset("squad", split="train") # Create the final dataset object p2pfl_dataset = P2PFLDataset( DatasetDict({ "train": concatenate_datasets([dataset_csv, dataset_hub]), "test": dataset_json }) )
Todo
Add more complex integrations (databricks, etc.)
- export(strategy, train=True, **kwargs)[source]ΒΆ
Export the dataset using the given strategy.
- Parameters:
strategy (
Type
[DataExportStrategy
]) β The export strategy to use.train (
bool
) β If True, export the training data. Otherwise, export the test data.**kwargs β Additional keyword arguments for the export strategy.
- Return type:
Any
- Returns:
The exported data.
- classmethod from_csv(data_files, **kwargs)[source]ΒΆ
Create a P2PFLDataset from a CSV file.
- Parameters:
data_files (
Union
[str
,Sequence
[str
],Mapping
[str
,Union
[str
,Sequence
[str
]]],None
]) β The path to the CSV file or a list of paths to CSV files.**kwargs β Keyword arguments to pass to datasets.load_dataset.
- Return type:
- Returns:
A P2PFLDataset object.
- classmethod from_generator(generator)[source]ΒΆ
Create a P2PFLDataset from a generator function.
- Parameters:
generator (
Callable
[[],Iterable
[Dict
[str
,Any
]]]) β A generator function that yields dictionaries.- Return type:
- Returns:
A P2PFLDataset object.
- classmethod from_huggingface(dataset_name, **kwargs)[source]ΒΆ
Create a P2PFLDataset from a Hugging Face dataset.
- Parameters:
dataset_name (
str
) β The name of the Hugging Face dataset.**kwargs β Keyword arguments to pass to datasets.load_dataset.
- Return type:
- Returns:
A P2PFLDataset object.
- classmethod from_json(data_files, **kwargs)[source]ΒΆ
Create a P2PFLDataset from a JSON file.
- Parameters:
data_files (
Union
[str
,Sequence
[str
],Mapping
[str
,Union
[str
,Sequence
[str
]]],None
]) β The path to the JSON file or a list of paths to JSON files.**kwargs β Keyword arguments to pass to datasets.load_dataset.
- Return type:
- Returns:
A P2PFLDataset object.
- classmethod from_pandas(df)[source]ΒΆ
Create a P2PFLDataset from a Pandas DataFrame.
- Parameters:
df (
DataFrame
) β A Pandas DataFrame containing the data.- Return type:
- Returns:
A P2PFLDataset object.
- classmethod from_parquet(data_files, **kwargs)[source]ΒΆ
Create a P2PFLDataset from a Parquet file or files.
- Parameters:
data_files (
Union
[str
,Sequence
[str
],Mapping
[str
,Union
[str
,Sequence
[str
]]],None
]) β The path to the Parquet file or a list of paths to Parquet files.**kwargs β Keyword arguments to pass to datasets.load_dataset.
- Return type:
- Returns:
A P2PFLDataset object.
- generate_partitions(num_partitions, strategy, seed=666, label_tag='label')[source]ΒΆ
Generate partitions of the dataset.
- Parameters:
num_partitions (
int
) β The number of partitions to generate.strategy (
DataPartitionStrategy
) β The partition strategy to use.seed (
int
) β The random seed to use for reproducibility.label_tag (
str
) β The tag to use for the label.
- Return type:
List
[P2PFLDataset
]- Returns:
An iterable of P2PFLDataset objects.
- generate_train_test_split(test_size=0.2, seed=42, shuffle=True, **kwargs)[source]ΒΆ
Generate a train/test split of the dataset.
- Parameters:
test_size (
float
) β The proportion of the dataset to include in the test split.seed (
int
) β The random seed to use for reproducibility.shuffle (
bool
) β Whether to shuffle the data before splitting.**kwargs β Additional keyword arguments to pass to the train_test_split method.
- Return type:
None
- get(idx, train=True)[source]ΒΆ
Get the item at the given index.
- Parameters:
idx β The index of the item to retrieve.
train (
bool
) β If True, get the item from the training split. Otherwise, get the item from the test split.
- Return type:
Dict
[str
,Any
]- Returns:
The item at the given index.