Skip to content

datasets massively slows data reads, even in memory #7753

@lrast

Description

@lrast

Describe the bug

Loading image data in a huggingface dataset results in very slow read speeds, approximately 1000 times longer than reading the same data from a pytorch dataset. This applies even when the dataset is loaded into RAM using a keep_in_memory=True flag.

The following script reproduces the result with random data, but it applies equally to datasets that are loaded from the hub.

Steps to reproduce the bug

The following script should reproduce the behavior

import torch
import time
from datasets import Dataset

images = torch.randint(0, 255, (1000, 3, 224, 224), dtype=torch.uint8)
labels = torch.randint(0, 200, (1000,), dtype=torch.uint8)

pt_dataset = torch.utils.data.TensorDataset(images, labels)

hf_dataset = Dataset.from_dict({'image': images, 'label':labels})
hf_dataset.set_format('torch', dtype=torch.uint8)
hf_in_memory = hf_dataset.map(lambda x: x, keep_in_memory=True)

# measure access speeds
def time_access(dataset, img_col):
    start_time = time.time()
    for i in range(1000):
        _ = dataset[i][img_col].shape
    end_time = time.time()
    return end_time - start_time


print(f"In-memory Tensor access: {time_access(pt_dataset, 0):.4f} seconds")
print(f"HF Dataset access: {time_access(hf_dataset, 'image'):.4f} seconds")
print(f"In-memory HF Dataset access: {time_access(hf_in_memory, 'image'):.4f} seconds")

Expected behavior

For me, the above script produces

In-memory Tensor access: 0.0025 seconds
HF Dataset access: 2.9317 seconds
In-memory HF Dataset access: 2.8082 seconds

I think that this difference is larger than expected.

Environment info

  • datasets version: 4.0.0
  • Platform: macOS-14.7.7-arm64-arm-64bit
  • Python version: 3.12.11
  • huggingface_hub version: 0.34.3
  • PyArrow version: 18.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions