Skip to content

DataFile objects display as empty DataFile[] due to __slots__ usage #2556

@rynewang

Description

@rynewang

Apache Iceberg version

0.9.1

Please describe the bug 🐞

When printing or using repr() on a PyIceberg DataFile object, it displays as an
empty DataFile[] instead of showing its actual attributes. This makes debugging
and logging difficult as you cannot see the contents of DataFile objects.

Example Code to Reproduce

  from pyiceberg.manifest import DataFile, DataFileContent, FileFormat
  from pyiceberg.typedef import Record

  # Create a DataFile with data
  data_file = DataFile(
      content=DataFileContent.DATA,
      file_path="s3://my-bucket/data/part-00000.parquet",
      file_format=FileFormat.PARQUET,
      partition=Record(),
      record_count=50000,
      file_size_in_bytes=1048576,
      spec_id=0
  )

  # Print the DataFile
  print(data_file)
  # Output: DataFile[]

  print(repr(data_file))
  # Output: DataFile[]

  # But the data is there:
  print(data_file.file_path)
  # Output: s3://my-bucket/data/part-00000.parquet

Root Cause

The issue occurs because:

  1. DataFile class uses __slots__ for memory efficiency (defined in manifest.py)
  2. DataFile inherits from Record class (defined in typedef.py)
  3. Record's __repr__ method iterates over self.__dict__.items() to build the
    string representation
  4. Classes using __slots__ don't populate __dict__ by default - attributes are
    stored differently
  5. Therefore, __dict__ is empty and the repr shows "DataFile[]"

Relevant Code

In pyiceberg/typedef.py:

  class Record(StructProtocol):
      # ...
      def __repr__(self) -> str:
          """Return the string representation of the Record class."""
          return f"{self.__class__.__name__}[{', '.join(f'{key}={repr(value)}' for 
  key, value in self.__dict__.items() if not key.startswith('_'))}]"

In pyiceberg/manifest.py:

  class DataFile(Record):
      __slots__ = (
          "content",
          "file_path",
          "file_format",
          "partition",
          "record_count",
          "file_size_in_bytes",
          # ... many more fields
      )

Proposed Solution

The Record.__repr__ method should check if the subclass uses __slots__ and iterate
over those attributes instead of only checking __dict__. Here's a potential fix:

  def __repr__(self) -> str:
      """Return the string representation of the Record class."""
      attrs = []

      # Check if the class uses __slots__
      if hasattr(self.__class__, '__slots__'):
          for slot in self.__class__.__slots__:
              if hasattr(self, slot) and not slot.startswith('_'):
                  value = getattr(self, slot)
                  attrs.append(f'{slot}={repr(value)}')

      # Also include __dict__ items for non-slotted attributes
      for key, value in self.__dict__.items():
          if not key.startswith('_'):
              attrs.append(f'{key}={repr(value)}')

      return f"{self.__class__.__name__}[{', '.join(attrs)}]"

Expected Behavior

  print(data_file)
  # Should output:
  # DataFile[content=DataFileContent.DATA, 
  file_path='s3://my-bucket/data/part-00000.parquet', 
  file_format=FileFormat.PARQUET, partition=Record[], record_count=50000, 
  file_size_in_bytes=1048576, spec_id=0]

Environment

  • PyIceberg version: 0.9.1
  • Python version: 3.11

Impact

This affects debugging and logging when working with Iceberg manifests. Developers
cannot easily inspect DataFile objects during development or when troubleshooting
issues.

Workaround

Until fixed, users can create a custom function to display DataFile contents:

  def format_datafile(datafile):
      """Format a DataFile object for display."""
      from pyiceberg.manifest import DataFile

      if not isinstance(datafile, DataFile):
          return str(datafile)

      attrs = []
      for slot in DataFile.__slots__:
          if hasattr(datafile, slot):
              value = getattr(datafile, slot)
              attrs.append(f"{slot}={value!r}")

      return f"DataFile[{', '.join(attrs)}]"

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions