Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 17, 2025

Overview

This PR adds support for non-numeric (categorical) predictors in the discord_data() and make_mean_diffs() functions, resolving the issue where predicting outcomes based on categorical variables like location ("south" or "north") would fail.

Problem

Previously, the make_mean_diffs() function attempted to compute differences and means for all variables, including categorical ones. This caused errors or warnings when non-numeric predictors were used, as operations like subtraction and mean calculation are not meaningful for categorical data.

Solution

Modified both implementations of make_mean_diffs() to:

  1. Detect variable types using is.numeric() before processing
  2. For numeric variables: calculate differences and means as before
  3. For non-numeric variables: preserve individual values in _1 and _2 columns, set _diff and _mean to NA

Example Usage

library(discord)

# Create data with categorical predictor
data <- data.frame(
  id = 1:10,
  age_s1 = c(25, 30, 35, 40, 45, 50, 55, 60, 65, 70),
  age_s2 = c(23, 28, 33, 38, 43, 48, 53, 58, 63, 68),
  location_s1 = c("south", "north", "south", "north", "south", 
                  "north", "south", "north", "south", "north"),
  location_s2 = c("south", "south", "north", "north", "south", 
                  "south", "north", "north", "south", "south")
)

# Process with categorical predictor
result <- discord_data(
  data = data,
  outcome = "age",
  predictors = "location",
  id = "id",
  sex = NULL,
  race = NULL,
  pair_identifiers = c("_s1", "_s2"),
  demographics = "none"
)

# Result contains:
# - location_1, location_2: categorical values preserved ("south"/"north")
# - location_diff, location_mean: NA (not meaningful for categorical)
# - age_diff, age_mean: properly calculated (numeric outcome)

Changes Made

Core Functions

  • make_mean_diffs_ram_optimized(): Added type checking and conditional processing
  • make_mean_diffs_fast(): Added same logic for vectorized operations
  • Removed unnecessary suppressMessages() and suppressWarnings() since type handling is now explicit

Documentation

  • Updated roxygen comments to explain categorical variable behavior
  • Enhanced discord_data() parameter documentation with categorical predictor examples
  • Added NEWS.md entry documenting the feature

Tests

  • Created comprehensive test suite (test-categorical_predictors.R) covering:
    • Character string predictors
    • Mixed numeric and categorical predictors
    • Factor variables
    • Both fast=TRUE and fast=FALSE modes

Use Cases

This enhancement enables researchers to:

  • Use demographic variables as predictors (region, city, occupation type)
  • Include categorical characteristics that vary between siblings
  • Analyze mixed effects with both continuous and categorical predictors
  • Predict outcomes based on location, school attended, or other categorical factors

Backward Compatibility

✅ All changes are backward compatible:

  • Existing numeric predictor functionality unchanged
  • No breaking changes to the API
  • Existing code continues to work without modification

Fixes issue about creating discordant kinship data with non-numeric predictors.

Original prompt

This section details on the original issue you should resolve

<issue_title>Create discordant kinship data with non-numeric predictors</issue_title>
<issue_description>For example, if we predict age based on location and the location variable is encoded as "south" or "north". Update make_mean_diffs() to work with that and return the discordant-style data.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #5


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI and others added 3 commits October 17, 2025 16:25
Co-authored-by: smasongarrison <6001608+smasongarrison@users.noreply.github.com>
Co-authored-by: smasongarrison <6001608+smasongarrison@users.noreply.github.com>
Co-authored-by: smasongarrison <6001608+smasongarrison@users.noreply.github.com>
Copilot AI changed the title [WIP] Update make_mean_diffs to handle non-numeric predictors Add support for non-numeric (categorical) predictors in discord_data() Oct 17, 2025
Copilot finished work on behalf of smasongarrison October 17, 2025 16:31
Copilot AI requested a review from smasongarrison October 17, 2025 16:31
@codecov
Copy link

codecov bot commented Oct 18, 2025

Codecov Report

❌ Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.09%. Comparing base (368461d) to head (52be958).

Files with missing lines Patch % Lines
R/helpers_regression.R 88.88% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #35      +/-   ##
==========================================
- Coverage   91.20%   91.09%   -0.11%     
==========================================
  Files           7        7              
  Lines         853      865      +12     
==========================================
+ Hits          778      788      +10     
- Misses         75       77       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create discordant kinship data with non-numeric predictors

2 participants