Skip to content

Conversation

@mikeknep
Copy link
Contributor

@mikeknep mikeknep commented Jan 8, 2026

Adds support for seed dataset plugins. Plugin authors can register new kinds of seed datasets by defining a plugin with a SeedSource and a corresponding SeedReader. Like column generation plugins, seed dataset plugins are automatically wired in so they Just Work with the config builder and the data designer interface. The mechanics are a little different from column generation plugins, since readers are passed in rather than registered in a purely backend registry, but the effect in the library is the same; in NMP, the service will instantiate the readers it can use (HuggingFace + a custom plugin it defines for NMP Files service) and pass them to create_resource_provider itself. Note: I think we could consider changing the seed reader mechanics to be more similar to column generators in the future, if we want.

Also adds a harness for really high-level e2e tests, with two initial tests added for the two plugin types. These tests prove the plugin registry system works all the way at the level of registering plugins in a pyproject.toml. I'm not sure how many additional tests we'd want to put at this high a level vs. keep as traditional unit tests, but there's room if we want it, and I really like the confidence it gives for plugins in particular.

@mikeknep mikeknep changed the title feature: Seed dataset plugins feat: Seed dataset plugins Jan 8, 2026
@mikeknep mikeknep force-pushed the mknepper/feature/seed-dataset-plugins branch 3 times, most recently from 71b2030 to e78bbe5 Compare January 9, 2026 00:51
Comment on lines +12 to +17
plugin_manager = PluginManager()

_SeedSourceT: TypeAlias = LocalFileSeedSource | HuggingFaceSeedSource | DataFrameSeedSource
_SeedSourceT = plugin_manager.inject_into_seed_source_type_union(_SeedSourceT)

SeedSourceT = Annotated[_SeedSourceT, Field(discriminator="seed_type")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type alias needs to live in a separate module instead of config/seed_source.py because:

  • The config classes for seed dataset plugins are expected to inherit from SeedSource
  • Since config code is always save to import/load for plugins, we do so as part of pydantic validation of each Plugin instance to validate the config discriminator is set properly
  • If the code highlighted here is defined in the same module as SeedSource, we run into a deadlock: the plugin registry is actively discovering/loading plugins and so has grabbed a lock, and to instantiate and load a seed dataset plugin we need to load the module containing the base SeedSource class, and when doing so we hit the PluginManager() call, which initializes a (singleton) registry, which tries to grab the lock that is already taken.

We didn't see this before* for column generator plugins because the base classes to inherit from are already defined in a different module than the ColumnConfigT type alias union (config.column_configs and config.column_types respectively)

*or maybe we did see this before and these two modules were set up the way they are now for this very reason; I don't know the history

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*or maybe we did see this before and these two modules were set up the way they are now for this very reason; I don't know the history

haha yes indeed that's why we split them up


class PluginType(str, Enum):
COLUMN_GENERATOR = "column-generator"
SEED_DATASET = "seed-dataset"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be SEED_SOURCE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe SEED_READER?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Probably SEED_READER I guess, since the existing plugin type is essentially aligned with the impl class (which makes sense).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in 7ee33ec

@echo "🧹 Cleaning e2e test environment..."
rm -rf e2e_tests/uv.lock e2e_tests/.pycache e2e_tests/.venv
@echo "🧪 Running e2e tests..."
uv run --no-cache --refresh --directory e2e_tests pytest -s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need --group dev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe uv run automatically uses the dev group? Seems to be working fine as-is, but I'm not opposed to adding it explicitly

Comment on lines +2 to +4
John,Coltrane
Miles,Davis
Bill,Evans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dream trio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing bass and drums to fill out the quintet 😂 In all seriousness, I'm happy to change this to anything else if we want to have a "house style" for dummy data in tests like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Astronomy and jazz legends" sounds like a sweet house style to me!

johnnygreco
johnnygreco previously approved these changes Jan 9, 2026
Copy link
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛸

@mikeknep mikeknep force-pushed the mknepper/feature/seed-dataset-plugins branch from da1c8d1 to e1c842c Compare January 9, 2026 19:48
@mikeknep mikeknep merged commit 2cfff52 into main Jan 9, 2026
21 checks passed
@mikeknep mikeknep deleted the mknepper/feature/seed-dataset-plugins branch January 9, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants