Skip to content

Equivalent to unite in dplyr #1445

@fkgruber

Description

@fkgruber

In tidyverse you can use unite to combine columns into a new column. This is useful when you want to create an id for example by combining multiple columns. In dplyr we have the function unite for this. We probably also want a separate function.

This is an initial proposition. I also added an option to reduce the number of significant digits on numeric quantities. Otherwise, numerical precision can lead to different IDs.

def signif(x, digits=2):
    """Round numeric values to significant digits."""
    try:
        return float(f"{x:.{digits}g}") if isinstance(x, (int, float)) else x
    except:
        return x  # Return as-is if conversion fails

@register_dataframe_method
def unite(df, prefix, new_column_name, sep="_", digits=4):
    """
    Combines all columns with a given prefix into a single column without removing the originals.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    prefix (str): The prefix to filter columns.
    new_column_name (str): The name of the new combined column.
    sep (str): Separator for concatenating values.
    digits (int): Number of significant digits for numeric values.

    Returns:
    pd.DataFrame: DataFrame with the new combined column.
    """
    df2 = df.copy()

    # Select columns with the given prefix using pyjanitor's select method
    config_cols = df2.select(columns=[f"{prefix}*"])

    # Apply rounding to numeric values using map
    config_cols = config_cols.map(lambda x: signif(x, digits) if isinstance(x, (float, int)) else x)

    # Create the new combined column
    df2[new_column_name] = config_cols.astype(str).agg(sep.join, axis=1)

    return df2

Example run:

import pandas as pd
df = pd.DataFrame({
    "config_a": [1.234567, 2.345678, 3.456789],
    "config_b": ["B1", "B2", "B3"],
    "config_c": [100.567, 200.678, 300.789],
    "other_col": [1, 2, 3]
})

# Use the custom pandas method via pandas_flavor
df = df.unite(prefix="config", new_column_name="id")

print(df)

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions