TensorFlow Data Pipeline Hangs When Used with Ray or OpenDP

Problem

When using TensorFlow’s tf.data.Dataset (via HuggingFace’s to_tf_dataset()) with p2pfl, the program hangs at:

sample = next(iter(tf_dataset))

Root Cause

Import order conflict: Ray or OpenDP is initialized before TensorFlow is imported, causing a threading deadlock.

  1. Importing p2pfl.management.logger triggers ray.init() at module level

  2. TensorFlow is imported afterwards

  3. TensorFlow’s data pipeline threads deadlock due to Ray’s modified threading environment

p2pfl/management/logger/__init__.py:30 -> ray_installed() -> ray.init()

This fails (hangs):

from p2pfl.management.logger import logger  # Ray initialized here
import tensorflow as tf  # Too late
from datasets import Dataset

dataset = Dataset.from_dict({"x": [[1]*784], "y": [0]})
tf_dataset = dataset.to_tf_dataset(batch_size=1, columns=["x"], label_cols=["y"])
next(iter(tf_dataset))  # Hangs forever

This works:

import tensorflow as tf  # TensorFlow first
from p2pfl.management.logger import logger  # Ray after
from datasets import Dataset

dataset = Dataset.from_dict({"x": [[1]*784], "y": [0]})
tf_dataset = dataset.to_tf_dataset(batch_size=1, columns=["x"], label_cols=["y"])
next(iter(tf_dataset))  # Works

Solutions

Option 1: Import TensorFlow first (Quick Fix)

Import TensorFlow before Ray or OpenDP:

import tensorflow as tf  # FIRST
from p2pfl.management.logger import logger  # After TensorFlow

This is how p2pfl’s test suite handles it in test/conftest.py:

with contextlib.suppress(ImportError):
    import tensorflow

Option 2: Don’t install Ray

Ray is an optional dependency. If you don’t need distributed computing features:

pip install "p2pfl[tensorflow]"  # Without Ray

Option 3: Disable Ray at runtime

from p2pfl.settings import Settings
Settings.general.DISABLE_RAY = True

Environment

  • TensorFlow 2.20.0

  • Ray 2.53.0

  • Python 3.12

  • macOS (Darwin)

Status

Fixed on macOS - p2pfl now uses a Ray worker setup hook to import TensorFlow before Ray workers start.

The fix is in p2pfl/utils/check_ray.py:

def _worker_setup() -> None:
    """Import ML frameworks first in Ray workers to avoid deadlocks on macOS."""
    if sys.platform != "darwin":
        return
    import contextlib
    with contextlib.suppress(ImportError):
        import tensorflow
    with contextlib.suppress(ImportError):
        import torch

# In ray.init():
if sys.platform == "darwin":
    init_kwargs["runtime_env"] = {"worker_process_setup_hook": _worker_setup}

Related: ray-project/ray#59661