-
Notifications
You must be signed in to change notification settings - Fork 1
feat: Metadata schema architecture to scope #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: scoping-doc
Are you sure you want to change the base?
Conversation
How will the yaml files be used? Is it supposed to complement the current documentation for human readers? |
Primary Use Cases1. Data Validation
2. Documentation Generation
3. Transformation Specifications
4. Tool Integration
|
Is there an existing tool you have in mind for validating the CSV files with the yaml schemas? We can write our own of course.
Do you mean the |
No tool or code yet. Just the specification stage of the metadata. I envied that we'd write our own that would use the yaml as the configuration. So, we'd need to keep that in mind during this review of the metadata specification.
I was mainly thinking of existing It would be an overkill, I think, but when I saw this line in your PR:
I thought it could be:
Where |
Load YAML and extract encoding from line 7schema <- yaml::read_yaml("./schemas/variables.yaml") Use it in the read.csv callvariables_sheet <- read.csv("./integration-assets/variables.csv", fileEncoding = variables_encoding) |
I see, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the PR is good but needs works around:
- Separating the scope/specifications and implementation documentation
- Having documentation around the fields within the schema metadata files
- Sharing metadata between sachems mainly validation rules
|
||
# How to identify template definitions | ||
template_definitions: | ||
marker_field: "templateVariable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there documentation for these fields? Or are they coming from some other schema standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes:
- I reviewed template variables and cleaned/removed these from the variable_details.yaml. I am not clear why I added those.
- I did catch that we are missing
catStartLabel
, which is needed fortemplateVariables
and other uses. That was an important omission that has now been added. - there is a considerably expanded description of templateVariables. This could be paired down. However, a question is how and where should we define the rules for this variable.
- As well, for our metadata and the actual variables and variable_details sheets, we may need to think a bit more about 'core', 'optional', 'extensions', and then database-specific metadata and rules. Are these the concepts? How do we want to present them to users and capture them in metadata?
- I noticed that
custom-variables.rmd
, the vignette fortemplateVariables.rmd
never made it to the pkgdown site (not on _pkgdown.yml. Can you check and make an issue (or just add it)?
1. Separating scope/specifications and implementation documentationRestructured into 3-layer architecture:
Clear separation between "what we need" (scope) vs "how to build it" (implementation). 2. Having documentation around fields within schema metadata filesSolution: Added comprehensive - name: "variableType"
notes: |
This classification determines how rec_with_table() processes the variable:
- "Categorical": Uses factor() with explicit levels and labels
- "Continuous": Preserves numeric values and applies transformations
- name: "recEnd"
notes: |
Supports diverse patterns from real-world validation:
- Simple values: "1", "2", "3" for categorical codes
- Missing data: "NA::a", "NA::b" with haven::tagged_na() integration
- Function references: "Func::bmi_cat_fun" for complex derivations 3. Sharing metadata between schemas mainly validation rulesImplemented shared validation architecture: shared_validation_architecture:
core_rules: "Defined in base schemas"
extension_rules: "Inherit and extend (never remove) core rules"
template_patterns: "Reusable across all projects"
missing_data_categories: "Standardized across projects"
Integration with cchsflow advancesThere are a range of updates from testing the YAMLs in practice. For example, the CCHS-specific YAML has been considerably expanded with more information for CCHS's missing data approaches. There are enhanced transformation patterns and complex interval notation, etc. Files changed:
|
Overall, there has been a modest to large refactoring of the metadata to address the comments, but also from experience using the metadata for a large cchsflow variable addition exercise. A re-review is in order and there are a few outstanding discussion points in the the conversation. We currently have: New Architecture
Key Architectural Benefit
|
Introduces metadata_registry.yaml as central coordination layer, streamlined variables.yaml and variable_details.yaml schemas, and ecosystem_roadmap.md for future planning. Eliminates duplication while adding production-tested validation patterns from real-world cchsflow usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a formal YAML metadata schema architecture for the recodeflow system to improve CSV file management, validation, and documentation. The key changes include:
- Adding a new "Metadata Schema Architecture" section in metadata.qmd to outline the schema design.
- Introducing new schema documentation in metadata-schemas.qmd and new YAML schema files (variables.yaml, variable_details.yaml, metadata_registry.yaml) defining the data structure and validation rules.
- Adding an ecosystem roadmap (ecosystem_roadmap.md) to outline future metadata developments.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
scope-docs/metadata.qmd | Added a new section describing the metadata schema architecture. |
scope-docs/metadata-schemas.qmd | New documentation detailing YAML metadata schema specifications. |
scope-docs/metadata-schema/variables.yaml | New YAML file defining the variables schema for CSV files. |
scope-docs/metadata-schema/variable_details.yaml | New YAML file defining the variable details schema. |
scope-docs/metadata-schema/metadata_registry.yaml | New YAML registry for shared metadata validation and formatting. |
scope-docs/ecosystem_roadmap.md | New roadmap outlining future improvements and ecosystem plans. |
Comments suppressed due to low confidence (4)
scope-docs/metadata.qmd:39
- [nitpick] The newly added line 'The sheets themselves' appears incomplete and may confuse readers. Consider elaborating or revising this sentence for clarity.
The sheets themselves
scope-docs/metadata-schema/variables.yaml:1
- The schema_version is set to '1.0.0', which does not match the intended 'v2.1.0' as described in the PR. Please update the version for consistency.
schema_version: "1.0.0"
scope-docs/metadata-schema/variable_details.yaml:1
- The schema_version is set to '1.0.0', which does not match the intended 'v2.1.0' as described in the PR. Please update the version for consistency.
schema_version: "1.0.0"
scope-docs/metadata-schema/metadata_registry.yaml:1
- The schema_version is set to '1.0.0', which is inconsistent with the 'v2.1.0' version mentioned in the PR description. Please update the version to align with the intended release.
schema_version: "1.0.0"
added database metadata. |
…age separation Major changes: - Clean metadata_registry.yaml: removed usage guidance, enhanced extension registry - Add CCHS examples with validation rules and complete variable definitions - Create separate metadata_schema_usage_guide.md for implementation guidance - Register haven::tagged_na() integration and template system extensions - Add cross-validation requirements for schema consistency Architecture improvements: - Clear separation between registry coordination and usage documentation - Database-specific extensions with proper validation rules and examples - Production statistics for template system (87% reduction validated) - Comprehensive CCHS patterns based on 3,577 real records
To sum, I am not sure the best approach for the final documentation of the YAMLs. In this regard, tests will be helpful and critical in some parts. @yulric and I are getting different passing for our informal checks of the RegEx sections. We'll want tests for those. |
This PR introduces formal YAML metadata schemas that define our core CSV files' structure, validation rules, and documentation. The discussion of these metadata files arose from @StaceyFisher note at our scope planning meeting, and also because we've been having too many CSV git conflicts in cchsflow, which are difficult to manage. However, these schemas are the foundation for data quality, version control optimization, and future interoperability.
This metadata is helpful to discuss early in our recodeflow redevelopment because the metadata can form the basis of code refactoring. Currently, some parts of the YAML are encoded as environment variables.
In this PR
variables.yaml
andvariable_details.yaml
The yamls
Project-specific sections: CCHS metadata with database patterns, enums, and NA categories
Key features of the YAMLs
Schema highlights
Most of the schema details describe how we are currently implementing the variables.csv and variable_details.csv, but many parts are not in our documentation and certainly not machine-actionable.
Implementation Notes