Skip to content

Hugging Face Hub Dataset Upload CAS Error #7760

@n-bkoe

Description

@n-bkoe

Describe the bug

Experiencing persistent 401 Unauthorized errors when attempting to upload datasets to Hugging Face Hub using the datasets library. The error occurs specifically with the CAS (Content Addressable Storage) service during the upload process. Tried using HF_HUB_DISABLE_XET=1. It seems to work for smaller files.

Exact error message :

Processing Files (0 / 0)                : |                                                                                                                                 |  0.00B /  0.00B             2025-09-10T09:44:35.657565Z ERROR  Fatal Error: "cas::upload_xorb" api call failed (request id 01b[...]XXX): HTTP status client error (401 Unauthorized) for url (https://cas-server.xethub.hf.co/xorb/default/7f3abdc[...]XXX)
    at /home/runner/work/xet-core/xet-core/cas_client/src/retry_wrapper.rs:113

Processing Files (0 / 0)                :   0%|                                                                                                                             |  0.00B /  184kB,  0.00B/s  
New Data Upload                         :   0%|                                                                                                                             |  0.00B /  184kB,  0.00B/s  
❌ Failed to push some_dataset: Data processing error: CAS service error : Reqwest Error: HTTP status client error (401 Unauthorized), domain: https://cas-server.xethub.hf.co/xorb/default/7f3abdc[...]XXX

Workaround Attempts

  1. Disabled XET: Set HF_HUB_DISABLE_XET=1 environment variable
  2. Updated hf-xet: Use hf-xet==1.1.9 rather than latest
  3. Verified Authentication: Confirmed HF token is valid and has write permissions
  4. Tested with Smaller Datasets:
    • 100 samples: ✅ SUCCESS (uploaded successfully)
    • 10,000 samples: ❌ FAILS (401 Unauthorized)

Steps to reproduce the bug

from datasets import Dataset, DatasetDict

# Create dataset (example with 10,000 samples)
dataset = Dataset.from_dict({
    "question": questions,
    "answer": answers,
    # ... other fields
})

# Split into train/test
dataset_dict = dataset.train_test_split(test_size=0.1)

# Upload to Hub
dataset_dict.push_to_hub("Org/some-dataset")

Expected behavior

Expected Behavior

  • Dataset should upload successfully to Hugging Face Hub
  • Progress bars should complete without authentication errors
  • Dataset should be accessible at the specified repository URL

Actual Behavior

  • Upload fails consistently with 401 Unauthorized error
  • Error occurs specifically during CAS service interaction
  • No progress is made on the upload (0% completion)
  • Dataset is created on Hugging Face Hub with no data folder

Environment info

  • Platform: SageMaker (AWS)
  • Python Version: 3.12
  • Libraries:
    • datasets library (latest version)
    • hf-xet==1.1.9 (attempted fix)
  • Authentication: Hugging Face token configured
  • Dataset Size: ~10,000 samples, works for smaller sizes (e.g. 100)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions