Added sample dataset cache for base dataset #515

kevinfjiang · 2025-06-11T17:41:25Z

Resolves the sample dataset portion of #326. Additional formatting changes were also made.

zzachw · 2025-06-12T05:26:49Z

Great thanks for the PR! I will check it out this weekend.

Copilot

Pull Request Overview

This PR introduces caching for task-specific sample datasets in BaseDataset and includes related stylistic and signature updates across task and dataset modules.

Implement in-memory caching of SampleDataset instances keyed by task name in BaseDataset.set_task
Update pre_filter signatures from pl.LazyFrame to pl.DataFrame and adjust filtering logic
Consistent formatting refinements: quote normalization, minor reflows, and commented-out legacy checks

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pyhealth/tasks/medical_coding.py	Changed `pre_filter` signature, updated filters/formats, removed legacy check
pyhealth/tasks/benchmark_ehrshot.py	Updated `pre_filter` signature and formatting of dict/appends
pyhealth/tasks/base_task.py	Aligned `pre_filter` signature to `pl.DataFrame`
pyhealth/datasets/sample_dataset.py	Normalized string quotes and assert formatting
pyhealth/datasets/base_dataset.py	Added `_sample_dataset_cache`, `use_cache` param, and caching logic

Comments suppressed due to low confidence (2)

pyhealth/datasets/base_dataset.py:346

The docstring for set_task should be updated to describe the new use_cache parameter and its behavior.

def set_task(self, task: Optional[BaseTask] = None, num_workers: int = 1, use_cache: bool = True) -> SampleDataset:

pyhealth/tasks/medical_coding.py:83

The filtering logic for skipping empty text or code-less samples has been commented out; this may allow invalid or empty samples downstream. Verify this change is intentional or reintroduce a suitable filter.

# if text == "" or len(icd_codes) < 1:

pyhealth/datasets/base_dataset.py

pyhealth/tasks/medical_coding.py

added documentation for use_cache param in BaseDataset.set_task

kevinfjiang · 2025-07-06T00:03:55Z

friendly ping if you've had a chance to take a look at the PR!

Added sample dataset cache for base dataset

3eeb99b

zzachw requested review from Copilot and zzachw June 12, 2025 05:26

Copilot AI reviewed Jun 12, 2025

View reviewed changes

pyhealth/datasets/base_dataset.py Outdated Show resolved Hide resolved

pyhealth/tasks/medical_coding.py Outdated Show resolved Hide resolved

Restored medical_coding.py unintended changes and

9532d9e

added documentation for use_cache param in BaseDataset.set_task

jhnwu3 added the Highlight for TAs to highlight label Aug 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added sample dataset cache for base dataset #515

Added sample dataset cache for base dataset #515

Uh oh!

kevinfjiang commented Jun 11, 2025 •

edited

Loading

Uh oh!

zzachw commented Jun 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

kevinfjiang commented Jul 6, 2025

Uh oh!

Uh oh!

Added sample dataset cache for base dataset #515

Are you sure you want to change the base?

Added sample dataset cache for base dataset #515

Uh oh!

Conversation

kevinfjiang commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zzachw commented Jun 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

kevinfjiang commented Jul 6, 2025

Uh oh!

Uh oh!

kevinfjiang commented Jun 11, 2025 •

edited

Loading