-
Notifications
You must be signed in to change notification settings - Fork 377
Description
Apache Iceberg version
0.9.1
Please describe the bug 🐞
When printing or using repr() on a PyIceberg DataFile object, it displays as an
empty DataFile[] instead of showing its actual attributes. This makes debugging
and logging difficult as you cannot see the contents of DataFile objects.
Example Code to Reproduce
from pyiceberg.manifest import DataFile, DataFileContent, FileFormat
from pyiceberg.typedef import Record
# Create a DataFile with data
data_file = DataFile(
content=DataFileContent.DATA,
file_path="s3://my-bucket/data/part-00000.parquet",
file_format=FileFormat.PARQUET,
partition=Record(),
record_count=50000,
file_size_in_bytes=1048576,
spec_id=0
)
# Print the DataFile
print(data_file)
# Output: DataFile[]
print(repr(data_file))
# Output: DataFile[]
# But the data is there:
print(data_file.file_path)
# Output: s3://my-bucket/data/part-00000.parquet
Root Cause
The issue occurs because:
- DataFile class uses
__slots__
for memory efficiency (defined in manifest.py) - DataFile inherits from Record class (defined in typedef.py)
- Record's
__repr__
method iterates overself.__dict__.items()
to build the
string representation - Classes using
__slots__
don't populate__dict__
by default - attributes are
stored differently - Therefore,
__dict__
is empty and the repr shows"DataFile[]"
Relevant Code
In pyiceberg/typedef.py:
class Record(StructProtocol):
# ...
def __repr__(self) -> str:
"""Return the string representation of the Record class."""
return f"{self.__class__.__name__}[{', '.join(f'{key}={repr(value)}' for
key, value in self.__dict__.items() if not key.startswith('_'))}]"
In pyiceberg/manifest.py:
class DataFile(Record):
__slots__ = (
"content",
"file_path",
"file_format",
"partition",
"record_count",
"file_size_in_bytes",
# ... many more fields
)
Proposed Solution
The Record.__repr__
method should check if the subclass uses __slots__
and iterate
over those attributes instead of only checking __dict__
. Here's a potential fix:
def __repr__(self) -> str:
"""Return the string representation of the Record class."""
attrs = []
# Check if the class uses __slots__
if hasattr(self.__class__, '__slots__'):
for slot in self.__class__.__slots__:
if hasattr(self, slot) and not slot.startswith('_'):
value = getattr(self, slot)
attrs.append(f'{slot}={repr(value)}')
# Also include __dict__ items for non-slotted attributes
for key, value in self.__dict__.items():
if not key.startswith('_'):
attrs.append(f'{key}={repr(value)}')
return f"{self.__class__.__name__}[{', '.join(attrs)}]"
Expected Behavior
print(data_file)
# Should output:
# DataFile[content=DataFileContent.DATA,
file_path='s3://my-bucket/data/part-00000.parquet',
file_format=FileFormat.PARQUET, partition=Record[], record_count=50000,
file_size_in_bytes=1048576, spec_id=0]
Environment
- PyIceberg version: 0.9.1
- Python version: 3.11
Impact
This affects debugging and logging when working with Iceberg manifests. Developers
cannot easily inspect DataFile objects during development or when troubleshooting
issues.
Workaround
Until fixed, users can create a custom function to display DataFile contents:
def format_datafile(datafile):
"""Format a DataFile object for display."""
from pyiceberg.manifest import DataFile
if not isinstance(datafile, DataFile):
return str(datafile)
attrs = []
for slot in DataFile.__slots__:
if hasattr(datafile, slot):
value = getattr(datafile, slot)
attrs.append(f"{slot}={value!r}")
return f"DataFile[{', '.join(attrs)}]"
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time