πΉοΈ SimulationsΒΆ
P2PFL leverages Ray, a powerful distributed computing framework, to enable efficient simulations of large-scale federated learning scenarios. This allows you to train and evaluate models across a cluster of machines or multiple processes on a single machine, significantly accelerating the process and overcoming the limitations of single-machine setups.
π Ray Integration for ScalabilityΒΆ
P2PFL seamlessly integrates with Ray to distribute the learning process. When Ray is installed, P2PFL automatically creates a pool of actors, which are independent Python processes that can be distributed across your cluster. Each actor hosts a Learner instance, allowing for parallel training and evaluation.
π§© Actor PoolΒΆ
The core of P2PFLβs simulation capabilities is the SuperActorPool. This pool manages the lifecycle of VirtualNodeLearner actors. Each VirtualNodeLearner wraps a standard Learner, enabling it to be executed remotely by Ray. This means that each node in your federated learning simulation can be run as an independent actor, managed by the pool.
π Benefits of Using RayΒΆ
Scalability: Distribute the learning process across multiple machines or processes, enabling larger-scale simulations.
Efficiency: Parallelize training and evaluation, significantly reducing overall experiment time.
Fault Tolerance: Rayβs actor model provides fault tolerance. If an actor fails, Ray can automatically restart it.
Resource Management: Ray intelligently manages the allocation of resources (CPUs, GPUs) to actors.
βοΈ Setting Up a Ray ClusterΒΆ
To disable ray (even if installed), export an environment var
export DISABLE_RAY=1. To run P2PFL simulations with Ray, you need to set up a Ray cluster. This can be done on a single machine (for smaller simulations) or across multiple machines (for larger simulations).
Single Machine SetupΒΆ
For simulations on a single machine, you donβt need to explicitly start a Ray cluster. P2PFL will automatically initialize Ray in local mode when you start your experiment.
Multi-Machine SetupΒΆ
For larger simulations, youβll need to set up a Ray cluster across multiple machines:
Start the head node: On the machine designated as the head node, run:
ray start --head --port=6379
Start worker nodes: On each additional machine, run:
ray start --address='<head_node_ip>:6379'
Replace
<head_node_ip>with the IP address of the head node.Verify the cluster: Check the status of your Ray cluster using:
ray status
Once the cluster is set up, you can run your P2PFL experiment as usual. P2PFL will automatically distribute the VirtualNodeLearner actors across the available nodes in the cluster.