Skip to content

Conversation

kevinfjiang
Copy link

@kevinfjiang kevinfjiang commented Jun 11, 2025

Resolves the sample dataset portion of #326. Additional formatting changes were also made.

@zzachw
Copy link
Collaborator

zzachw commented Jun 12, 2025

Great thanks for the PR! I will check it out this weekend.

@zzachw zzachw requested review from Copilot and zzachw June 12, 2025 05:26
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces caching for task-specific sample datasets in BaseDataset and includes related stylistic and signature updates across task and dataset modules.

  • Implement in-memory caching of SampleDataset instances keyed by task name in BaseDataset.set_task
  • Update pre_filter signatures from pl.LazyFrame to pl.DataFrame and adjust filtering logic
  • Consistent formatting refinements: quote normalization, minor reflows, and commented-out legacy checks

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pyhealth/tasks/medical_coding.py Changed pre_filter signature, updated filters/formats, removed legacy check
pyhealth/tasks/benchmark_ehrshot.py Updated pre_filter signature and formatting of dict/appends
pyhealth/tasks/base_task.py Aligned pre_filter signature to pl.DataFrame
pyhealth/datasets/sample_dataset.py Normalized string quotes and assert formatting
pyhealth/datasets/base_dataset.py Added _sample_dataset_cache, use_cache param, and caching logic
Comments suppressed due to low confidence (2)

pyhealth/datasets/base_dataset.py:346

  • The docstring for set_task should be updated to describe the new use_cache parameter and its behavior.
def set_task(self, task: Optional[BaseTask] = None, num_workers: int = 1, use_cache: bool = True) -> SampleDataset:

pyhealth/tasks/medical_coding.py:83

  • The filtering logic for skipping empty text or code-less samples has been commented out; this may allow invalid or empty samples downstream. Verify this change is intentional or reintroduce a suitable filter.
# if text == "" or len(icd_codes) < 1:

added documentation for use_cache param in BaseDataset.set_task
@kevinfjiang
Copy link
Author

friendly ping if you've had a chance to take a look at the PR!

@jhnwu3 jhnwu3 added the Highlight for TAs to highlight label Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Highlight for TAs to highlight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants