-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
Loading image data in a huggingface dataset results in very slow read speeds, approximately 1000 times longer than reading the same data from a pytorch dataset. This applies even when the dataset is loaded into RAM using a keep_in_memory=True
flag.
The following script reproduces the result with random data, but it applies equally to datasets that are loaded from the hub.
Steps to reproduce the bug
The following script should reproduce the behavior
import torch
import time
from datasets import Dataset
images = torch.randint(0, 255, (1000, 3, 224, 224), dtype=torch.uint8)
labels = torch.randint(0, 200, (1000,), dtype=torch.uint8)
pt_dataset = torch.utils.data.TensorDataset(images, labels)
hf_dataset = Dataset.from_dict({'image': images, 'label':labels})
hf_dataset.set_format('torch', dtype=torch.uint8)
hf_in_memory = hf_dataset.map(lambda x: x, keep_in_memory=True)
# measure access speeds
def time_access(dataset, img_col):
start_time = time.time()
for i in range(1000):
_ = dataset[i][img_col].shape
end_time = time.time()
return end_time - start_time
print(f"In-memory Tensor access: {time_access(pt_dataset, 0):.4f} seconds")
print(f"HF Dataset access: {time_access(hf_dataset, 'image'):.4f} seconds")
print(f"In-memory HF Dataset access: {time_access(hf_in_memory, 'image'):.4f} seconds")
Expected behavior
For me, the above script produces
In-memory Tensor access: 0.0025 seconds
HF Dataset access: 2.9317 seconds
In-memory HF Dataset access: 2.8082 seconds
I think that this difference is larger than expected.
Environment info
datasets
version: 4.0.0- Platform: macOS-14.7.7-arm64-arm-64bit
- Python version: 3.12.11
huggingface_hub
version: 0.34.3- PyArrow version: 18.0.0
- Pandas version: 2.2.3
fsspec
version: 2024.9.0
Metadata
Metadata
Assignees
Labels
No labels