feat: Metadata schema architecture to scope #70

DougManuel · 2025-06-09T00:27:07Z

This PR introduces formal YAML metadata schemas that define our core CSV files' structure, validation rules, and documentation. The discussion of these metadata files arose from @StaceyFisher note at our scope planning meeting, and also because we've been having too many CSV git conflicts in cchsflow, which are difficult to manage. However, these schemas are the foundation for data quality, version control optimization, and future interoperability.

This metadata is helpful to discuss early in our recodeflow redevelopment because the metadata can form the basis of code refactoring. Currently, some parts of the YAML are encoded as environment variables.

In this PR

Addition at the bottom of the metadata.qmd scope file, which provides an overview.
1-page scope/specification.
draft variables.yaml and variable_details.yaml

The yamls

variables.yaml: Schema for the variable catalog with semantic row IDs (var_{variable_name}), formal transformation patterns, and comprehensive validation rules
variable_details.yaml: Schema for value mappings with hierarchical IDs (detail_{variable_name}_{sequence}), interval notation for ranges, and template system support.
CSV format specifications: Explicit formatting rules to ensure git-friendly diffs
Project-specific sections: CCHS metadata with database patterns, enums, and NA categories

Key features of the YAMLs

Semantic versioning: Both schemas at v2.1.0 to align with cchsflow.
Git-optimized: Semantic row IDs prevent arbitrary reordering in diffs. We want to use these now in cchsflow. That can serve as a test or prototype.
Validation rules: Cross-field dependencies and conditional requirements. This may be too much for the current discussion; perhaps it is a more conceptual approach and discussion.
Formal patterns: Regex specifications for complex fields like variableStart
LinkML-ready: Structure designed for future semantic web integration

Schema highlights
Most of the schema details describe how we are currently implementing the variables.csv and variable_details.csv, but many parts are not in our documentation and certainly not machine-actionable.

Required fields determined by data analysis (100% completion rates)
Conditional validation (e.g., dummyVariable required for categorical types)
Interval notation support: [a,b] for closed, (a,b) for open intervals
category system: NA::a (valid skip), NA::b (don't know/refusal)
Template system for reusable transformation patterns

Implementation Notes

I have ideas where we can store these files.

zargot · 2025-06-10T13:52:45Z

How will the yaml files be used? Is it supposed to complement the current documentation for human readers?

DougManuel · 2025-06-11T00:39:08Z

How will the yaml files be used? Is it supposed to complement the current documentation for human readers? I stated the following uses metadata-schema.qmd in this pr. Our discussion about CSV format and BOM is a good example of the benefit of having that format in a configuration/documentation file.

Primary Use Cases

1. Data Validation

Pre-commit hooks validate CSV structure
Type checking and constraint enforcement

2. Documentation Generation

Automated codebooks from schema definitions

3. Transformation Specifications

Formal recoding rules for reproducibility

4. Tool Integration

R package functions consume schemas directly
Validation tools parse YAML specifications

zargot · 2025-06-11T17:20:46Z

1. Data Validation

Pre-commit hooks validate CSV structure

Type checking and constraint enforcement

Is there an existing tool you have in mind for validating the CSV files with the yaml schemas? We can write our own of course.

4. Tool Integration

R package functions consume schemas directly

Do you mean the rec_with_table function, or new API functions that will be added in the next version? I don't fully understand the library yet, so how will the schema be used in these functions?

DougManuel · 2025-06-11T19:28:36Z

Is there an existing tool you have in mind for validating the CSV files with the yaml schemas? We can write our own of course.

No tool or code yet. Just the specification stage of the metadata. I envied that we'd write our own that would use the yaml as the configuration. So, we'd need to keep that in mind during this review of the metadata specification.

4. Tool Integration

R package functions consume schemas directly

Do you mean the rec_with_table function, or new API functions that will be added in the next version? I don't fully understand the library yet, so how will the schema be used in these functions?

I was mainly thinking of existing rec_with_table but I haven't reviewed the existing code from that perspective, and I would value your perspective as you understand the library. We have environmental variables, I believe, that has some of the information in the metadata. If so, it seems that the metadata should be the starting point. If needed, the environmental data would be generated from the metadata.

It would be an overkill, I think, but when I saw this line in your PR:

variables_sheet <- read.csv("./integration-assets/variables.csv", fileEncoding = "UTF-8-BOM")

I thought it could be:

variables_sheet <- read.csv("./integration-assets/variables.csv", fileEncoding = variables_encoding)

Where variables_encoding comes from line 7 of the variables.yaml file.

DougManuel · 2025-06-11T19:29:11Z

Load YAML and extract encoding from line 7

schema <- yaml::read_yaml("./schemas/variables.yaml")
variables_encoding <- schema$csv_format$encoding

Use it in the read.csv call

variables_sheet <- read.csv("./integration-assets/variables.csv", fileEncoding = variables_encoding)

zargot · 2025-06-11T19:57:44Z

I see, thank you!

yulric

The purpose of the PR is good but needs works around:

Separating the scope/specifications and implementation documentation
Having documentation around the fields within the schema metadata files
Sharing metadata between sachems mainly validation rules

scope-docs/metadata.qmd

scope-docs/metadata-schemas.qmd

scope-docs/metadata-schema/variable_details.yaml

yulric · 2025-06-19T16:34:22Z

scope-docs/metadata-schema/variable_details.yaml

+
+  # How to identify template definitions
+  template_definitions:
+    marker_field: "templateVariable"


Are there documentation for these fields? Or are they coming from some other schema standard?

A few notes:

I reviewed template variables and cleaned/removed these from the variable_details.yaml. I am not clear why I added those.

I did catch that we are missing catStartLabel, which is needed for templateVariables and other uses. That was an important omission that has now been added.

there is a considerably expanded description of templateVariables. This could be paired down. However, a question is how and where should we define the rules for this variable.

As well, for our metadata and the actual variables and variable_details sheets, we may need to think a bit more about 'core', 'optional', 'extensions', and then database-specific metadata and rules. Are these the concepts? How do we want to present them to users and capture them in metadata?

I noticed that custom-variables.rmd, the vignette for templateVariables.rmd never made it to the pkgdown site (not on _pkgdown.yml. Can you check and make an issue (or just add it)?

DougManuel · 2025-06-21T17:15:11Z

The purpose of the PR is good but needs works around:

Separating the scope/specifications and implementation documentation

Having documentation around the fields within the schema metadata files

Sharing metadata between sachems mainly validation rules

1. Separating scope/specifications and implementation documentation

Restructured into 3-layer architecture:

metadata.qmd - Project overview
variables_metadata_scope.qmd - Requirements and design specifications
variables_metadata_implementation.qmd - Technical implementation and R code examples

Clear separation between "what we need" (scope) vs "how to build it" (implementation).

2. Having documentation around fields within schema metadata files

Solution: Added comprehensive notes sections to all major fields. I realize that usually we'd write the specifications and then implement, but I thought this two-birds-with-one-stone approach works because we've essentially drafted the implementation, and we can reuse the field documentation in the final YAMLs. Self-contained documentation within schema files, repurposable for human-readable docs.

- name: "variableType" 
  notes: |
    This classification determines how rec_with_table() processes the variable:
    - "Categorical": Uses factor() with explicit levels and labels
    - "Continuous": Preserves numeric values and applies transformations

- name: "recEnd"
  notes: |
    Supports diverse patterns from real-world validation:
    - Simple values: "1", "2", "3" for categorical codes
    - Missing data: "NA::a", "NA::b" with haven::tagged_na() integration
    - Function references: "Func::bmi_cat_fun" for complex derivations

3. Sharing metadata between schemas mainly validation rules

Implemented shared validation architecture:

shared_validation_architecture:
  core_rules: "Defined in base schemas"
  extension_rules: "Inherit and extend (never remove) core rules"
  template_patterns: "Reusable across all projects"
  missing_data_categories: "Standardized across projects"

Template system enables pattern reuse between schemas
Inheritance model where cchsflow/raiflow extend (never remove) core rules
Standardized missing data with haven::tagged_na() integration across projects

Integration with cchsflow advances

There are a range of updates from testing the YAMLs in practice. For example, the CCHS-specific YAML has been considerably expanded with more information for CCHS's missing data approaches. There are enhanced transformation patterns and complex interval notation, etc.

Files changed:

Documentation restructured into 3-layer architecture
Both schema files were enhanced with field notes and shared validation
Natural row identifiers (variable names) replace the artificial fileRowId approach

DougManuel · 2025-06-22T18:08:11Z

Overall, there has been a modest to large refactoring of the metadata to address the comments, but also from experience using the metadata for a large cchsflow variable addition exercise.

A re-review is in order and there are a few outstanding discussion points in the the conversation.

We currently have:

New Architecture

metadata_registry.yaml - Central coordination layer
- Shared specifications (CSV format, tier system, validation patterns)
- Extension registry (template variables, project extensions)
- Cross-validation rules and implementation guidance
- DRY compliance: Eliminates duplication between schemas
variables.yaml - Clean variable schema
- Field definitions for variables.csv structure
- References registry for shared specifications
- Focused solely on variable-level metadata
variable_details.yaml - Clean transformation schema
- Field definitions for variable_details.csv structure
- References registry for shared specifications
- Focused solely on transformation rules

Key Architectural Benefit

DRY Principle: Single source of truth for shared specifications
Separation of Concerns: Each file has distinct, focused responsibility
Extension Management: Centralized registry for current and future extensions
AI-Friendly Documentation: Clear structure prevents AI-assist errors

Introduces metadata_registry.yaml as central coordination layer, streamlined variables.yaml and variable_details.yaml schemas, and ecosystem_roadmap.md for future planning. Eliminates duplication while adding production-tested validation patterns from real-world cchsflow usage.

Copilot

Pull Request Overview

This PR introduces a formal YAML metadata schema architecture for the recodeflow system to improve CSV file management, validation, and documentation. The key changes include:

Adding a new "Metadata Schema Architecture" section in metadata.qmd to outline the schema design.
Introducing new schema documentation in metadata-schemas.qmd and new YAML schema files (variables.yaml, variable_details.yaml, metadata_registry.yaml) defining the data structure and validation rules.
Adding an ecosystem roadmap (ecosystem_roadmap.md) to outline future metadata developments.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
scope-docs/metadata.qmd	Added a new section describing the metadata schema architecture.
scope-docs/metadata-schemas.qmd	New documentation detailing YAML metadata schema specifications.
scope-docs/metadata-schema/variables.yaml	New YAML file defining the variables schema for CSV files.
scope-docs/metadata-schema/variable_details.yaml	New YAML file defining the variable details schema.
scope-docs/metadata-schema/metadata_registry.yaml	New YAML registry for shared metadata validation and formatting.
scope-docs/ecosystem_roadmap.md	New roadmap outlining future improvements and ecosystem plans.

Comments suppressed due to low confidence (4)

scope-docs/metadata.qmd:39

[nitpick] The newly added line 'The sheets themselves' appears incomplete and may confuse readers. Consider elaborating or revising this sentence for clarity.

The sheets themselves

scope-docs/metadata-schema/variables.yaml:1

The schema_version is set to '1.0.0', which does not match the intended 'v2.1.0' as described in the PR. Please update the version for consistency.

schema_version: "1.0.0"

scope-docs/metadata-schema/variable_details.yaml:1

The schema_version is set to '1.0.0', which does not match the intended 'v2.1.0' as described in the PR. Please update the version for consistency.

schema_version: "1.0.0"

scope-docs/metadata-schema/metadata_registry.yaml:1

The schema_version is set to '1.0.0', which is inconsistent with the 'v2.1.0' version mentioned in the PR description. Please update the version to align with the intended release.

schema_version: "1.0.0"

- Implements Dublin Core standard with 10 core fields from PRs #65 and #43 - Follows three-file architecture with registry reference for DRY principles - Includes recodeflow-specific extensions for workflow integration - Supports metadata file naming conventions and validation rules

DougManuel · 2025-06-22T18:24:04Z

added database metadata.

…age separation Major changes: - Clean metadata_registry.yaml: removed usage guidance, enhanced extension registry - Add CCHS examples with validation rules and complete variable definitions - Create separate metadata_schema_usage_guide.md for implementation guidance - Register haven::tagged_na() integration and template system extensions - Add cross-validation requirements for schema consistency Architecture improvements: - Clear separation between registry coordination and usage documentation - Database-specific extensions with proper validation rules and examples - Production statistics for template system (87% reduction validated) - Comprehensive CCHS patterns based on 3,577 real records

DougManuel · 2025-06-23T15:00:20Z

I've added the cchs-specific yaml files. I envisioned these as just placeholders for discussion. We'd want our file examples to be for pbc data because that is the example data in recodeflow.
There are additional files also for discussion. l had an interesting time trying out Claude Code. That ai-assist generally worked really well, but with the challenge that additional context is needed beyond the yaml files. That is quite reasonable, and human collaborators need that documentation, too. But then we get a documenation bloat.

To sum, I am not sure the best approach for the final documentation of the YAMLs.

In this regard, tests will be helpful and critical in some parts. @yulric and I are getting different passing for our informal checks of the RegEx sections. We'll want tests for those.

feat: Metadata schema architecture to scope

2a98bcf

DougManuel requested review from StaceyFisher, Copilot, yulric and zargot June 9, 2025 00:27

This comment was marked as outdated.

Sign in to view

DougManuel requested a review from Copilot June 9, 2025 00:32

This comment was marked as outdated.

Sign in to view

DougManuel mentioned this pull request Jun 10, 2025

Added initial version of an integration test #64

Merged

yulric requested changes Jun 19, 2025

View reviewed changes

DougManuel requested review from Copilot and yulric June 22, 2025 18:13

Copilot AI reviewed Jun 22, 2025

View reviewed changes

feat: Metadata schema architecture to scope #70

Are you sure you want to change the base?

feat: Metadata schema architecture to scope #70

Uh oh!

Conversation

DougManuel commented Jun 9, 2025 • edited by yulric Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

zargot commented Jun 10, 2025

Uh oh!

DougManuel commented Jun 11, 2025

Primary Use Cases

1. Data Validation

2. Documentation Generation

3. Transformation Specifications

4. Tool Integration

Uh oh!

zargot commented Jun 11, 2025

1. Data Validation

4. Tool Integration

Uh oh!

DougManuel commented Jun 11, 2025

4. Tool Integration

Uh oh!

DougManuel commented Jun 11, 2025

Load YAML and extract encoding from line 7

Use it in the read.csv call

Uh oh!

zargot commented Jun 11, 2025

Uh oh!

yulric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yulric Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

DougManuel Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

DougManuel commented Jun 21, 2025

1. Separating scope/specifications and implementation documentation

2. Having documentation around fields within schema metadata files

3. Sharing metadata between schemas mainly validation rules

Integration with cchsflow advances

Uh oh!

DougManuel commented Jun 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

DougManuel commented Jun 22, 2025

Uh oh!

DougManuel commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DougManuel commented Jun 9, 2025 •

edited by yulric

Loading