Skip to content

Issue #14601 fix - pytables.py put() and append() docstring fix #61936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

JoeDediop
Copy link

Problem Summary

The current documentation for min_itemsize in HDFStore methods is unclear about:

  1. How string length is calculated (bytes vs characters)
  2. How encoding affects the required size
  3. Proper usage with multi-byte characters

Proposed Changes

  1. Enhanced put() method docstring
def put(
    self,
    key: str,
    value: DataFrame | Series,
    format=None,
    index: bool = True,
    append: bool = False,
    complib=None,
    complevel: int | None = None,
    min_itemsize: int | dict[str, int] | None = None,
    nan_rep=None,
    data_columns: Literal[True] | list[str] | None = None,
    encoding=None,
    errors: str = "strict",
    track_times: bool = True,
    dropna: bool = False,
) -> None:
    """
    Store object in HDFStore.

    Parameters
    ----------
    key : str
    value : {Series, DataFrame}
    format : 'fixed(f)|table(t)', default is 'fixed'
        Format to use when storing object in HDFStore. Value can be one of:

        ``'fixed'``
            Fixed format.  Fast writing/reading. Not-appendable, nor searchable.
        ``'table'``
            Table format.  Write as a PyTables Table structure which may perform
            worse but allow more flexible operations like searching / selecting
            subsets of the data.
    index : bool, default True
        Write DataFrame index as a column.
    append : bool, default False
        This will force Table format, append the input data to the existing.
    min_itemsize : int, dict of str: int, or None, default None
        Minimum size in bytes for string columns. This parameter is only used when
        format='table'. Can be:
        
        - int: Apply the same minimum size to all string columns
        - dict: Map column names to their minimum sizes
        - None: Use default sizing
        
        **Important**: The size refers to the number of bytes after encoding, not
        the number of characters. For multi-byte characters (e.g., Chinese, Arabic),
        you need to account for the encoding. For example, the character '香' is
        1 character but 3 bytes when encoded as UTF-8.
        
        See examples below for proper usage with encoded strings.
    data_columns : list of columns or True, default None
        List of columns to create as data columns, or True to use all columns.
        See `here
        <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>`__.
    encoding : str, default None
        Provide an encoding for strings. When using multi-byte characters,
        this affects how min_itemsize should be calculated.
    track_times : bool, default True
        Parameter is propagated to 'create_table' method of 'PyTables'.
        If set to False it enables to have the same h5 files (same hashes)
        independent on creation time.
    dropna : bool, default False, optional
        Remove missing values.

    Examples
    --------
    Basic usage with ASCII strings:
    
    >>> df = pd.DataFrame([['hello', 'world']], columns=['A', 'B'])
    >>> store = pd.HDFStore("store.h5", 'w')  # doctest: +SKIP
    >>> store.put('data', df, format='table', min_itemsize={'A': 10, 'B': 10})  # doctest: +SKIP
    
    Usage with multi-byte characters:
    
    >>> df_unicode = pd.DataFrame([['香港', '北京']], columns=['city1', 'city2'])  # doctest: +SKIP
    >>> # Each Chinese character is 3 bytes in UTF-8, so '香港' needs 6 bytes
    >>> store.put('cities', df_unicode, format='table',  # doctest: +SKIP
    ...           min_itemsize={'city1': 12, 'city2': 12}, encoding='utf-8')  # doctest: +SKIP
    
    Determining the correct size for encoded strings:
    
    >>> text = '香港'  # doctest: +SKIP
    >>> len(text)  # Character length  # doctest: +SKIP
    2
    >>> len(text.encode('utf-8'))  # Byte length  # doctest: +SKIP
    6
    >>> # Use the byte length for min_itemsize
    """
  1. Enhanced append() method docstring
def append(
    self,
    key: str,
    value: DataFrame | Series,
    format=None,
    axes=None,
    index: bool | list[str] = True,
    append: bool = True,
    complib=None,
    complevel: int | None = None,
    columns=None,
    min_itemsize: int | dict[str, int] | None = None,
    nan_rep=None,
    chunksize: int | None = None,
    expectedrows=None,
    dropna: bool | None = None,
    data_columns: Literal[True] | list[str] | None = None,
    encoding=None,
    errors: str = "strict",
) -> None:
    """
    Append to Table in file.

    Node must already exist and be Table format.

    Parameters
    ----------
    key : str
    value : {Series, DataFrame}
    format : 'table' is the default
        Format to use when storing object in HDFStore.  Value can be one of:

        ``'table'``
            Table format. Write as a PyTables Table structure which may perform
            worse but allow more flexible operations like searching / selecting
            subsets of the data.
    index : bool, default True
        Write DataFrame index as a column.
    append : bool, default True
        Append the input data to the existing.
    data_columns : list of columns, or True, default None
        List of columns to create as indexed data columns for on-disk
        queries, or True to use all columns. By default only the axes
        of the object are indexed. See `here
        <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>`__.
    min_itemsize : int, dict of str: int, or None, default None
        Minimum size in bytes for string columns. Can be:
        
        - int: Apply the same minimum size to all string columns
        - dict: Map column names to their minimum sizes  
        - None: Use the existing table's column sizes
        
        **Important**: This parameter is only effective when creating a new table.
        If the table already exists, the column sizes are fixed and cannot be
        changed. The size refers to the number of bytes after encoding, not
        the number of characters.
        
        For multi-byte characters, calculate the size using the encoded byte length.
        For example: len('香'.encode('utf-8')) returns 3, not len('香') which returns 1.
    nan_rep : str to use as str nan representation
    chunksize : size to chunk the writing
    expectedrows : expected TOTAL row size of this table
    encoding : str, default None
        Provide an encoding for strings. This should match the encoding used
        when the table was initially created.
    dropna : bool, default False, optional
        Do not write an ALL nan row to the store settable
        by the option 'io.hdf.dropna_table'.

    Notes
    -----
    Does *not* check if data being appended overlaps with existing
    data in the table, so be careful.
    
    When appending to an existing table, the min_itemsize parameter has no effect
    as column sizes are already fixed. Set min_itemsize when initially creating
    the table with put() or the first append() call.

    Examples
    --------
    Creating a table and appending data:
    
    >>> df1 = pd.DataFrame([['short', 'text']], columns=['A', 'B'])
    >>> store = pd.HDFStore("store.h5", 'w')  # doctest: +SKIP
    >>> # Set min_itemsize when creating the table
    >>> store.put('data', df1, format='table', min_itemsize={'A': 20, 'B': 20})  # doctest: +SKIP
    >>> 
    >>> df2 = pd.DataFrame([['longer text here', 'more text']], columns=['A', 'B'])
    >>> store.append('data', df2)  # doctest: +SKIP
    >>> store.close()  # doctest: +SKIP
    
    Handling multi-byte characters:
    
    >>> df_en = pd.DataFrame([['hello']], columns=['text'])
    >>> df_zh = pd.DataFrame([['你好世界']], columns=['text'])  # "Hello World" in Chinese
    >>> store = pd.HDFStore("store.h5", 'w')  # doctest: +SKIP
    >>> # Calculate size needed: len('你好世界'.encode('utf-8')) = 12 bytes
    >>> store.put('messages', df_en, format='table', 
    ...           min_itemsize={'text': 15}, encoding='utf-8')  # doctest: +SKIP
    >>> store.append('messages', df_zh)  # doctest: +SKIP
    >>> store.close()  # doctest: +SKIP
    
    Common error when min_itemsize is too small:
    
    >>> df = pd.DataFrame([['香']], columns=['char'])  # 3 bytes in UTF-8
    >>> store = pd.HDFStore("store.h5", 'w')  # doctest: +SKIP
    >>> # This will raise ValueError: string length [3] exceeds limit [1]
    >>> # store.put('test', df, format='table', min_itemsize={'char': 1})
    >>> # Correct usage:
    >>> store.put('test', df, format='table', min_itemsize={'char': 3})  # doctest: +SKIP
    >>> store.close()  # doctest: +SKIP
    """

Key Improvements Made

  1. Clear parameter type documentation: Explicitly state that min_itemsize can be int, dict, or None
  2. Byte vs character clarification: Emphasize that size refers to bytes after encoding
  3. Multi-byte character examples: Show real examples with Chinese characters
  4. Encoding relationship: Explain how encoding affects size calculations
  5. Timing clarification: Explain when min_itemsize is effective vs ignored
  6. Error prevention: Show common mistakes and how to avoid them
  7. Practical examples: Include realistic use cases that users encounter

Notes
I'm a new contributor, I followed the contribution guide and it looks like in that process other modifications were made to the pytables.py file. As far as manual changes go, what I noted above are the only changes I implemented.

…urate documentation for the min_itemsize variable
@JoeDediop JoeDediop changed the title Modified pytables.py put() and append() docstring Issue #14601 fix - pytables.py put() and append() docstring fix Jul 24, 2025
@jbrockmendel
Copy link
Member

Is this AI?

@JoeDediop
Copy link
Author

Is this AI?

Hello! Yes this was written with AI assistance. I am a CS student in my junior year, the assignment was focused on simply submitting a PR so I utilized AI to aid in the process so I could meet my deadlines.

However, if that causes problems I would love to work with you to adjust changes I've made so that it could be usable!

@JoeDediop
Copy link
Author

I want to add to my previous reply. I actually misunderstood my own assignment, I was not supposed to use AI to contribute toward a repo at all. However, I would like to start again fresh with helping resolve this issue as I think it would be a good learning experience for me.

This time around I wouldn't generate code with AI, I will likely use it to help me understand what the code is doing, but I will work toward contributing proper documentation written in my own words that follows your style and guidelines, at that point I'd make a new PR with the new work.

So I'll start by asking, does my original PR do a good job at summarizing the issue and proposing a decent solution? Or is it off track on what the variable 'min_itemsize' is used for? Once I know that I can work toward fully understanding the code so I can help create better documentation, perhaps for more than just the two methods I changed!

Let me know and I'll get to work right away!

@jbrockmendel
Copy link
Member

Please understand you're asking me to put more time and effort into reviewing this PR than you put into making it. If you'd like to start over without AI and a ton of unrelated (and wrong) edits, I'll take a look at that.

@JoeDediop
Copy link
Author

JoeDediop commented Jul 25, 2025

That sounds perfectly reasonable. I will work over the next couple weeks to better understand the code and style guidelines! I'll close this PR so I can start fresh.

I'll work harder to properly understand how min_itemsize works and then once I have a better idea of how it works I'll write clear documentation in my own words. Then I'll make sure the new PR is more polished than this rushed one. Thank you for the feedback, I can definitely understand why you would want a proper level of effort for the contributions to a project this large!

@JoeDediop JoeDediop closed this Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants