Issue #14601 fix - pytables.py put() and append() docstring fix #61936

JoeDediop · 2025-07-24T20:08:56Z

Problem Summary

The current documentation for min_itemsize in HDFStore methods is unclear about:

How string length is calculated (bytes vs characters)
How encoding affects the required size
Proper usage with multi-byte characters

Proposed Changes

Enhanced put() method docstring

def put(
    self,
    key: str,
    value: DataFrame | Series,
    format=None,
    index: bool = True,
    append: bool = False,
    complib=None,
    complevel: int | None = None,
    min_itemsize: int | dict[str, int] | None = None,
    nan_rep=None,
    data_columns: Literal[True] | list[str] | None = None,
    encoding=None,
    errors: str = "strict",
    track_times: bool = True,
    dropna: bool = False,
) -> None:
    """
    Store object in HDFStore.

    Parameters
    ----------
    key : str
    value : {Series, DataFrame}
    format : 'fixed(f)|table(t)', default is 'fixed'
        Format to use when storing object in HDFStore. Value can be one of:

        ``'fixed'``
            Fixed format.  Fast writing/reading. Not-appendable, nor searchable.
        ``'table'``
            Table format.  Write as a PyTables Table structure which may perform
            worse but allow more flexible operations like searching / selecting
            subsets of the data.
    index : bool, default True
        Write DataFrame index as a column.
    append : bool, default False
        This will force Table format, append the input data to the existing.
    min_itemsize : int, dict of str: int, or None, default None
        Minimum size in bytes for string columns. This parameter is only used when
        format='table'. Can be:
        
        - int: Apply the same minimum size to all string columns
        - dict: Map column names to their minimum sizes
        - None: Use default sizing
        
        **Important**: The size refers to the number of bytes after encoding, not
        the number of characters. For multi-byte characters (e.g., Chinese, Arabic),
        you need to account for the encoding. For example, the character '香' is
        1 character but 3 bytes when encoded as UTF-8.
        
        See examples below for proper usage with encoded strings.
    data_columns : list of columns or True, default None
        List of columns to create as data columns, or True to use all columns.
        See `here
        <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>`__.
    encoding : str, default None
        Provide an encoding for strings. When using multi-byte characters,
        this affects how min_itemsize should be calculated.
    track_times : bool, default True
        Parameter is propagated to 'create_table' method of 'PyTables'.
        If set to False it enables to have the same h5 files (same hashes)
        independent on creation time.
    dropna : bool, default False, optional
        Remove missing values.

    Examples
    --------
    Basic usage with ASCII strings:
    
    >>> df = pd.DataFrame([['hello', 'world']], columns=['A', 'B'])
    >>> store = pd.HDFStore("store.h5", 'w')  # doctest: +SKIP
    >>> store.put('data', df, format='table', min_itemsize={'A': 10, 'B': 10})  # doctest: +SKIP
    
    Usage with multi-byte characters:
    
    >>> df_unicode = pd.DataFrame([['香港', '北京']], columns=['city1', 'city2'])  # doctest: +SKIP
    >>> # Each Chinese character is 3 bytes in UTF-8, so '香港' needs 6 bytes
    >>> store.put('cities', df_unicode, format='table',  # doctest: +SKIP
    ...           min_itemsize={'city1': 12, 'city2': 12}, encoding='utf-8')  # doctest: +SKIP
    
    Determining the correct size for encoded strings:
    
    >>> text = '香港'  # doctest: +SKIP
    >>> len(text)  # Character length  # doctest: +SKIP
    2
    >>> len(text.encode('utf-8'))  # Byte length  # doctest: +SKIP
    6
    >>> # Use the byte length for min_itemsize
    """

Enhanced append() method docstring

def append(
    self,
    key: str,
    value: DataFrame | Series,
    format=None,
    axes=None,
    index: bool | list[str] = True,
    append: bool = True,
    complib=None,
    complevel: int | None = None,
    columns=None,
    min_itemsize: int | dict[str, int] | None = None,
    nan_rep=None,
    chunksize: int | None = None,
    expectedrows=None,
    dropna: bool | None = None,
    data_columns: Literal[True] | list[str] | None = None,
    encoding=None,
    errors: str = "strict",
) -> None:
    """
    Append to Table in file.

    Node must already exist and be Table format.

    Parameters
    ----------
    key : str
    value : {Series, DataFrame}
    format : 'table' is the default
        Format to use when storing object in HDFStore.  Value can be one of:

        ``'table'``
            Table format. Write as a PyTables Table structure which may perform
            worse but allow more flexible operations like searching / selecting
            subsets of the data.
    index : bool, default True
        Write DataFrame index as a column.
    append : bool, default True
        Append the input data to the existing.
    data_columns : list of columns, or True, default None
        List of columns to create as indexed data columns for on-disk
        queries, or True to use all columns. By default only the axes
        of the object are indexed. See `here
        <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>`__.
    min_itemsize : int, dict of str: int, or None, default None
        Minimum size in bytes for string columns. Can be:
        
        - int: Apply the same minimum size to all string columns
        - dict: Map column names to their minimum sizes  
        - None: Use the existing table's column sizes
        
        **Important**: This parameter is only effective when creating a new table.
        If the table already exists, the column sizes are fixed and cannot be
        changed. The size refers to the number of bytes after encoding, not
        the number of characters.
        
        For multi-byte characters, calculate the size using the encoded byte length.
        For example: len('香'.encode('utf-8')) returns 3, not len('香') which returns 1.
    nan_rep : str to use as str nan representation
    chunksize : size to chunk the writing
    expectedrows : expected TOTAL row size of this table
    encoding : str, default None
        Provide an encoding for strings. This should match the encoding used
        when the table was initially created.
    dropna : bool, default False, optional
        Do not write an ALL nan row to the store settable
        by the option 'io.hdf.dropna_table'.

    Notes
    -----
    Does *not* check if data being appended overlaps with existing
    data in the table, so be careful.
    
    When appending to an existing table, the min_itemsize parameter has no effect
    as column sizes are already fixed. Set min_itemsize when initially creating
    the table with put() or the first append() call.

    Examples
    --------
    Creating a table and appending data:
    
    >>> df1 = pd.DataFrame([['short', 'text']], columns=['A', 'B'])
    >>> store = pd.HDFStore("store.h5", 'w')  # doctest: +SKIP
    >>> # Set min_itemsize when creating the table
    >>> store.put('data', df1, format='table', min_itemsize={'A': 20, 'B': 20})  # doctest: +SKIP
    >>> 
    >>> df2 = pd.DataFrame([['longer text here', 'more text']], columns=['A', 'B'])
    >>> store.append('data', df2)  # doctest: +SKIP
    >>> store.close()  # doctest: +SKIP
    
    Handling multi-byte characters:
    
    >>> df_en = pd.DataFrame([['hello']], columns=['text'])
    >>> df_zh = pd.DataFrame([['你好世界']], columns=['text'])  # "Hello World" in Chinese
    >>> store = pd.HDFStore("store.h5", 'w')  # doctest: +SKIP
    >>> # Calculate size needed: len('你好世界'.encode('utf-8')) = 12 bytes
    >>> store.put('messages', df_en, format='table', 
    ...           min_itemsize={'text': 15}, encoding='utf-8')  # doctest: +SKIP
    >>> store.append('messages', df_zh)  # doctest: +SKIP
    >>> store.close()  # doctest: +SKIP
    
    Common error when min_itemsize is too small:
    
    >>> df = pd.DataFrame([['香']], columns=['char'])  # 3 bytes in UTF-8
    >>> store = pd.HDFStore("store.h5", 'w')  # doctest: +SKIP
    >>> # This will raise ValueError: string length [3] exceeds limit [1]
    >>> # store.put('test', df, format='table', min_itemsize={'char': 1})
    >>> # Correct usage:
    >>> store.put('test', df, format='table', min_itemsize={'char': 3})  # doctest: +SKIP
    >>> store.close()  # doctest: +SKIP
    """

Key Improvements Made

Clear parameter type documentation: Explicitly state that min_itemsize can be int, dict, or None
Byte vs character clarification: Emphasize that size refers to bytes after encoding
Multi-byte character examples: Show real examples with Chinese characters
Encoding relationship: Explain how encoding affects size calculations
Timing clarification: Explain when min_itemsize is effective vs ignored
Error prevention: Show common mistakes and how to avoid them
Practical examples: Include realistic use cases that users encounter

Notes
I'm a new contributor, I followed the contribution guide and it looks like in that process other modifications were made to the pytables.py file. As far as manual changes go, what I noted above are the only changes I implemented.

…urate documentation for the min_itemsize variable

jbrockmendel · 2025-07-24T23:44:19Z

Is this AI?

JoeDediop · 2025-07-25T03:32:54Z

Is this AI?

Hello! Yes this was written with AI assistance. I am a CS student in my junior year, the assignment was focused on simply submitting a PR so I utilized AI to aid in the process so I could meet my deadlines.

However, if that causes problems I would love to work with you to adjust changes I've made so that it could be usable!

JoeDediop · 2025-07-25T06:31:23Z

I want to add to my previous reply. I actually misunderstood my own assignment, I was not supposed to use AI to contribute toward a repo at all. However, I would like to start again fresh with helping resolve this issue as I think it would be a good learning experience for me.

This time around I wouldn't generate code with AI, I will likely use it to help me understand what the code is doing, but I will work toward contributing proper documentation written in my own words that follows your style and guidelines, at that point I'd make a new PR with the new work.

So I'll start by asking, does my original PR do a good job at summarizing the issue and proposing a decent solution? Or is it off track on what the variable 'min_itemsize' is used for? Once I know that I can work toward fully understanding the code so I can help create better documentation, perhaps for more than just the two methods I changed!

Let me know and I'll get to work right away!

jbrockmendel · 2025-07-25T14:25:33Z

Please understand you're asking me to put more time and effort into reviewing this PR than you put into making it. If you'd like to start over without AI and a ton of unrelated (and wrong) edits, I'll take a look at that.

JoeDediop · 2025-07-25T14:38:46Z

That sounds perfectly reasonable. I will work over the next couple weeks to better understand the code and style guidelines! I'll close this PR so I can start fresh.

I'll work harder to properly understand how min_itemsize works and then once I have a better idea of how it works I'll write clear documentation in my own words. Then I'll make sure the new PR is more polished than this rushed one. Thank you for the feedback, I can definitely understand why you would want a proper level of effort for the contributions to a project this large!

modified pytables.py put() and append() docstrings to ensure more acc…

eaacfeb

…urate documentation for the min_itemsize variable

JoeDediop changed the title ~~Modified pytables.py put() and append() docstring~~ Issue #14601 fix - pytables.py put() and append() docstring fix Jul 24, 2025

JoeDediop mentioned this pull request Jul 24, 2025

DOC: min_itemsize for HDFStore append for encoded strings #14601

Open

JoeDediop closed this Jul 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Issue #14601 fix - pytables.py put() and append() docstring fix #61936

Issue #14601 fix - pytables.py put() and append() docstring fix #61936

JoeDediop commented Jul 24, 2025

Uh oh!

jbrockmendel commented Jul 24, 2025

Uh oh!

JoeDediop commented Jul 25, 2025

Uh oh!

JoeDediop commented Jul 25, 2025

Uh oh!

jbrockmendel commented Jul 25, 2025

Uh oh!

JoeDediop commented Jul 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Issue #14601 fix - pytables.py put() and append() docstring fix #61936

Issue #14601 fix - pytables.py put() and append() docstring fix #61936

Conversation

JoeDediop commented Jul 24, 2025

Uh oh!

jbrockmendel commented Jul 24, 2025

Uh oh!

JoeDediop commented Jul 25, 2025

Uh oh!

JoeDediop commented Jul 25, 2025

Uh oh!

jbrockmendel commented Jul 25, 2025

Uh oh!

JoeDediop commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JoeDediop commented Jul 25, 2025 •

edited

Loading