Skip to content

Conversation

DougManuel
Copy link
Contributor

@DougManuel DougManuel commented Jun 9, 2025

This PR introduces formal YAML metadata schemas that define our core CSV files' structure, validation rules, and documentation. The discussion of these metadata files arose from @StaceyFisher note at our scope planning meeting, and also because we've been having too many CSV git conflicts in cchsflow, which are difficult to manage. However, these schemas are the foundation for data quality, version control optimization, and future interoperability.

This metadata is helpful to discuss early in our recodeflow redevelopment because the metadata can form the basis of code refactoring. Currently, some parts of the YAML are encoded as environment variables.

In this PR

  • Addition at the bottom of the metadata.qmd scope file, which provides an overview.
  • 1-page scope/specification.
  • draft variables.yaml and variable_details.yaml

The yamls

  • variables.yaml: Schema for the variable catalog with semantic row IDs (var_{variable_name}), formal transformation patterns, and comprehensive validation rules
  • variable_details.yaml: Schema for value mappings with hierarchical IDs (detail_{variable_name}_{sequence}), interval notation for ranges, and template system support.
  • CSV format specifications: Explicit formatting rules to ensure git-friendly diffs
    Project-specific sections: CCHS metadata with database patterns, enums, and NA categories

Key features of the YAMLs

  • Semantic versioning: Both schemas at v2.1.0 to align with cchsflow.
  • Git-optimized: Semantic row IDs prevent arbitrary reordering in diffs. We want to use these now in cchsflow. That can serve as a test or prototype.
  • Validation rules: Cross-field dependencies and conditional requirements. This may be too much for the current discussion; perhaps it is a more conceptual approach and discussion.
  • Formal patterns: Regex specifications for complex fields like variableStart
  • LinkML-ready: Structure designed for future semantic web integration

Schema highlights
Most of the schema details describe how we are currently implementing the variables.csv and variable_details.csv, but many parts are not in our documentation and certainly not machine-actionable.

  • Required fields determined by data analysis (100% completion rates)
  • Conditional validation (e.g., dummyVariable required for categorical types)
  • Interval notation support: [a,b] for closed, (a,b) for open intervals
  • category system: NA::a (valid skip), NA::b (don't know/refusal)
  • Template system for reusable transformation patterns

Implementation Notes

  • I have ideas where we can store these files.

Copilot

This comment was marked as outdated.

@DougManuel DougManuel requested a review from Copilot June 9, 2025 00:32
Copilot

This comment was marked as outdated.

@zargot
Copy link
Collaborator

zargot commented Jun 10, 2025

How will the yaml files be used? Is it supposed to complement the current documentation for human readers?

@DougManuel
Copy link
Contributor Author

How will the yaml files be used? Is it supposed to complement the current documentation for human readers? I stated the following uses metadata-schema.qmd in this pr. Our discussion about CSV format and BOM is a good example of the benefit of having that format in a configuration/documentation file.

Primary Use Cases

1. Data Validation

  • Pre-commit hooks validate CSV structure
  • Type checking and constraint enforcement

2. Documentation Generation

  • Automated codebooks from schema definitions

3. Transformation Specifications

  • Formal recoding rules for reproducibility

4. Tool Integration

  • R package functions consume schemas directly
  • Validation tools parse YAML specifications

@zargot
Copy link
Collaborator

zargot commented Jun 11, 2025

1. Data Validation

  • Pre-commit hooks validate CSV structure
  • Type checking and constraint enforcement

Is there an existing tool you have in mind for validating the CSV files with the yaml schemas? We can write our own of course.

4. Tool Integration

  • R package functions consume schemas directly

Do you mean the rec_with_table function, or new API functions that will be added in the next version? I don't fully understand the library yet, so how will the schema be used in these functions?

@DougManuel
Copy link
Contributor Author

Is there an existing tool you have in mind for validating the CSV files with the yaml schemas? We can write our own of course.

No tool or code yet. Just the specification stage of the metadata. I envied that we'd write our own that would use the yaml as the configuration. So, we'd need to keep that in mind during this review of the metadata specification.

4. Tool Integration

  • R package functions consume schemas directly

Do you mean the rec_with_table function, or new API functions that will be added in the next version? I don't fully understand the library yet, so how will the schema be used in these functions?

I was mainly thinking of existing rec_with_table but I haven't reviewed the existing code from that perspective, and I would value your perspective as you understand the library. We have environmental variables, I believe, that has some of the information in the metadata. If so, it seems that the metadata should be the starting point. If needed, the environmental data would be generated from the metadata.

It would be an overkill, I think, but when I saw this line in your PR:

variables_sheet <- read.csv("./integration-assets/variables.csv", fileEncoding = "UTF-8-BOM")

I thought it could be:

variables_sheet <- read.csv("./integration-assets/variables.csv", fileEncoding = variables_encoding)

Where variables_encoding comes from line 7 of the variables.yaml file.

@DougManuel
Copy link
Contributor Author

Load YAML and extract encoding from line 7

schema <- yaml::read_yaml("./schemas/variables.yaml")
variables_encoding <- schema$csv_format$encoding

Use it in the read.csv call

variables_sheet <- read.csv("./integration-assets/variables.csv", fileEncoding = variables_encoding)

@zargot
Copy link
Collaborator

zargot commented Jun 11, 2025

I see, thank you!

Copy link
Collaborator

@yulric yulric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the PR is good but needs works around:

  1. Separating the scope/specifications and implementation documentation
  2. Having documentation around the fields within the schema metadata files
  3. Sharing metadata between sachems mainly validation rules


# How to identify template definitions
template_definitions:
marker_field: "templateVariable"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there documentation for these fields? Or are they coming from some other schema standard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes:

  • I reviewed template variables and cleaned/removed these from the variable_details.yaml. I am not clear why I added those.
  • I did catch that we are missing catStartLabel, which is needed for templateVariables and other uses. That was an important omission that has now been added.
  • there is a considerably expanded description of templateVariables. This could be paired down. However, a question is how and where should we define the rules for this variable.
  • As well, for our metadata and the actual variables and variable_details sheets, we may need to think a bit more about 'core', 'optional', 'extensions', and then database-specific metadata and rules. Are these the concepts? How do we want to present them to users and capture them in metadata?
  • I noticed that custom-variables.rmd, the vignette for templateVariables.rmd never made it to the pkgdown site (not on _pkgdown.yml. Can you check and make an issue (or just add it)?

@DougManuel
Copy link
Contributor Author

The purpose of the PR is good but needs works around:

  1. Separating the scope/specifications and implementation documentation
  2. Having documentation around the fields within the schema metadata files
  3. Sharing metadata between sachems mainly validation rules

1. Separating scope/specifications and implementation documentation

Restructured into 3-layer architecture:

Clear separation between "what we need" (scope) vs "how to build it" (implementation).

2. Having documentation around fields within schema metadata files

Solution: Added comprehensive notes sections to all major fields. I realize that usually we'd write the specifications and then implement, but I thought this two-birds-with-one-stone approach works because we've essentially drafted the implementation, and we can reuse the field documentation in the final YAMLs. Self-contained documentation within schema files, repurposable for human-readable docs.

- name: "variableType" 
  notes: |
    This classification determines how rec_with_table() processes the variable:
    - "Categorical": Uses factor() with explicit levels and labels
    - "Continuous": Preserves numeric values and applies transformations

- name: "recEnd"
  notes: |
    Supports diverse patterns from real-world validation:
    - Simple values: "1", "2", "3" for categorical codes
    - Missing data: "NA::a", "NA::b" with haven::tagged_na() integration
    - Function references: "Func::bmi_cat_fun" for complex derivations

3. Sharing metadata between schemas mainly validation rules

Implemented shared validation architecture:

shared_validation_architecture:
  core_rules: "Defined in base schemas"
  extension_rules: "Inherit and extend (never remove) core rules"
  template_patterns: "Reusable across all projects"
  missing_data_categories: "Standardized across projects"
  • Template system enables pattern reuse between schemas
  • Inheritance model where cchsflow/raiflow extend (never remove) core rules
  • Standardized missing data with haven::tagged_na() integration across projects

Integration with cchsflow advances

There are a range of updates from testing the YAMLs in practice. For example, the CCHS-specific YAML has been considerably expanded with more information for CCHS's missing data approaches. There are enhanced transformation patterns and complex interval notation, etc.

Files changed:

  • Documentation restructured into 3-layer architecture
  • Both schema files were enhanced with field notes and shared validation
  • Natural row identifiers (variable names) replace the artificial fileRowId approach

@DougManuel
Copy link
Contributor Author

Overall, there has been a modest to large refactoring of the metadata to address the comments, but also from experience using the metadata for a large cchsflow variable addition exercise.

A re-review is in order and there are a few outstanding discussion points in the the conversation.

We currently have:

New Architecture

  1. metadata_registry.yaml - Central coordination layer
    - Shared specifications (CSV format, tier system, validation patterns)
    - Extension registry (template variables, project extensions)
    - Cross-validation rules and implementation guidance
    - DRY compliance: Eliminates duplication between schemas
  2. variables.yaml - Clean variable schema
    - Field definitions for variables.csv structure
    - References registry for shared specifications
    - Focused solely on variable-level metadata
  3. variable_details.yaml - Clean transformation schema
    - Field definitions for variable_details.csv structure
    - References registry for shared specifications
    - Focused solely on transformation rules

Key Architectural Benefit

  • DRY Principle: Single source of truth for shared specifications
  • Separation of Concerns: Each file has distinct, focused responsibility
  • Extension Management: Centralized registry for current and future extensions
  • AI-Friendly Documentation: Clear structure prevents AI-assist errors

Introduces metadata_registry.yaml as central coordination layer, streamlined
variables.yaml and variable_details.yaml schemas, and ecosystem_roadmap.md
for future planning. Eliminates duplication while adding production-tested
validation patterns from real-world cchsflow usage.
@DougManuel DougManuel requested review from Copilot and yulric June 22, 2025 18:13
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a formal YAML metadata schema architecture for the recodeflow system to improve CSV file management, validation, and documentation. The key changes include:

  • Adding a new "Metadata Schema Architecture" section in metadata.qmd to outline the schema design.
  • Introducing new schema documentation in metadata-schemas.qmd and new YAML schema files (variables.yaml, variable_details.yaml, metadata_registry.yaml) defining the data structure and validation rules.
  • Adding an ecosystem roadmap (ecosystem_roadmap.md) to outline future metadata developments.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
scope-docs/metadata.qmd Added a new section describing the metadata schema architecture.
scope-docs/metadata-schemas.qmd New documentation detailing YAML metadata schema specifications.
scope-docs/metadata-schema/variables.yaml New YAML file defining the variables schema for CSV files.
scope-docs/metadata-schema/variable_details.yaml New YAML file defining the variable details schema.
scope-docs/metadata-schema/metadata_registry.yaml New YAML registry for shared metadata validation and formatting.
scope-docs/ecosystem_roadmap.md New roadmap outlining future improvements and ecosystem plans.
Comments suppressed due to low confidence (4)

scope-docs/metadata.qmd:39

  • [nitpick] The newly added line 'The sheets themselves' appears incomplete and may confuse readers. Consider elaborating or revising this sentence for clarity.
The sheets themselves 

scope-docs/metadata-schema/variables.yaml:1

  • The schema_version is set to '1.0.0', which does not match the intended 'v2.1.0' as described in the PR. Please update the version for consistency.
schema_version: "1.0.0"

scope-docs/metadata-schema/variable_details.yaml:1

  • The schema_version is set to '1.0.0', which does not match the intended 'v2.1.0' as described in the PR. Please update the version for consistency.
schema_version: "1.0.0"

scope-docs/metadata-schema/metadata_registry.yaml:1

  • The schema_version is set to '1.0.0', which is inconsistent with the 'v2.1.0' version mentioned in the PR description. Please update the version to align with the intended release.
schema_version: "1.0.0"

- Implements Dublin Core standard with 10 core fields from PRs #65 and #43
- Follows three-file architecture with registry reference for DRY principles
- Includes recodeflow-specific extensions for workflow integration
- Supports metadata file naming conventions and validation rules
@DougManuel
Copy link
Contributor Author

added database metadata.

…age separation

Major changes:
- Clean metadata_registry.yaml: removed usage guidance, enhanced extension registry
- Add CCHS examples with validation rules and complete variable definitions
- Create separate metadata_schema_usage_guide.md for implementation guidance
- Register haven::tagged_na() integration and template system extensions
- Add cross-validation requirements for schema consistency

Architecture improvements:
- Clear separation between registry coordination and usage documentation
- Database-specific extensions with proper validation rules and examples
- Production statistics for template system (87% reduction validated)
- Comprehensive CCHS patterns based on 3,577 real records
@DougManuel
Copy link
Contributor Author

  1. I've added the cchs-specific yaml files. I envisioned these as just placeholders for discussion. We'd want our file examples to be for pbc data because that is the example data in recodeflow.

  2. There are additional files also for discussion. l had an interesting time trying out Claude Code. That ai-assist generally worked really well, but with the challenge that additional context is needed beyond the yaml files. That is quite reasonable, and human collaborators need that documentation, too. But then we get a documenation bloat.

To sum, I am not sure the best approach for the final documentation of the YAMLs.

In this regard, tests will be helpful and critical in some parts. @yulric and I are getting different passing for our informal checks of the RegEx sections. We'll want tests for those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants