Added genetic diversity fields - Fixes #1610 #1611

arschat · 2025-02-21T13:38:26Z

Release notes

#1610

For human_specific.json schema:

ethnicity_question
ethnicity_parents
primary_language
mother_father_language
current_residence
place_of_birth

For residence.json schema:

country
granular
duration
area_type

For medical_history.json schema:

diet_meat_consumption
reproduction_history

For reproduction_history.json schema:

menarche_age
menopause_status
parity
gravidity

Reviews requested

Need 4 Reviewers to approve because this is a major update

…nto ac-genetic-diversity-Issue1610

idazucchi

Nice job! It's well organised, I've left some comments so we can discuss a few things

json_schema/module/biomaterial/human_specific.json

hannes-ucsc

Which project/dataset is this for?

arschat · 2025-05-23T08:50:54Z

Hi @hannes-ucsc this PR is not for a specific project/dataset.

These are the recommendation metadata fields from the HCA Genetic Diversity TaskForce to record the genetic, geographic and in generally human diversity in the HCA studies.

Do you have any specific concerns?

hannes-ucsc · 2025-05-23T17:57:22Z

I may be out of the loop, but if this isn't going to be used for any actual projects, why is it being added to the schema?

I am worried that this isn't sufficiently modular, risking for human-specific and medical history modules to become a kitchen sink of fields, i.e. a flat, unstructured list of fields, some related and some not. This will make comprehending the schema (and the JSON documents compliant with it) increasingly difficult. For example, menarche_age, menopause_status, parity and gravidity are all clearly related but they are not encapsulated in a module. Another example are the place_of_birth_… fields. The fact that their names all share a prefix is somewhat of a design smell, indicating that they, too, want to be encapsulated. The ethnicity-related fields added here appear to relate to a questionnaire of some sort, similarly suggesting that they should be encapsulated in a module.

arschat · 2025-05-27T09:26:24Z

Thank you for your comment Hannes.

Bionetwork coordinators have requested around 250 fields of Tier 2 bionetwork-specific metadata that do not exist currently into our schema. Since the Tier 2 collection has not been officially started yet, we cannot be sure of how frequent all of those fields will be filled. We've shared our concerns with bionetwork coordinators on the feasibility of the metadata collection, but we trust their confidence on the collection.

Regarding the modularity. There are indeed some fields that could be clustered together but we choose this modelling to avoid extensive "module-in-a-module" structure since we've avoided this until now (with the exception of ontology modules). I am happy to encapsulate similar fields in modules either inside human_specific/ medical_history modules or as separate modules in donor_organism if you prefer this modelling.

hannes-ucsc · 2025-05-27T16:50:27Z

Bionetwork coordinators have requested around 250 fields of Tier 2 bionetwork-specific metadata that do not exist

If I understand you correctly, this PR is just the first slate of a series of fairly involved changes. Since it is extremely difficult to fix the schema once metadata using it has been released, it is very important that we get this right from the beginning. Is there any substantive documentation about this effort that you could share?

avoid extensive "module-in-a-module" structure since we've avoided this until now

Given the sheer number of new fields you cite, lack of modularity is a serious concern. I don't see why nesting modules is problematic. Hierarchical structures are commonplace in computer science—in biology, too, for that matter—and have proven to be a useful modeling approach.

arschat · 2025-06-18T10:48:39Z

Hi @hannes-ucsc I understand your concerns whether this definitions are going to be adopted by contributors in this way or not.
However this PR, is about the additional Genetic Diversity fields, that have been suggested by the Human Cell Atlas Genetic Divesity Taskforce, have been accepted by the HCA Organization Committee and are going to be requested across all bionetworks' Tier 2 metadata. Thus, they are unlikely to be changed.

Given your feedback on modularity, I refactored some fields in a more modular way. Let me know if this works better for you. I could split into different PRs for each module but please let me know your comments here before we move into new PRs.

hannes-ucsc · 2025-07-08T18:08:52Z

json_schema/module/biomaterial/residence.json

+            "pattern": "^[0-9]{1,}.[0-9]{1,}.[0-9]{1,}$",
+            "example": "4.6.1"
+        },
+        "country": {


Some countries don't call the first level subdivision "state" and some countries are known under various different names, in addition to the fact that most countries have a name in their native language and English.

Taking inspiration from https://schema.org/PostalAddress, this should be split into country (as a two letter ISO code) and region.

This is an interesting suggestion, however ISO country codes are not commonly used in biological research so it would be an added burden on who is preparing the metadata.
Is there an easy way to validate the iso codes within our schema, other than adding manually all the codes as an enum? I think it's worth using the ISO standard only if it can allow us to easily validate the country names, which we can't do with the current schema

however ISO country codes are not commonly used in biological research …

I don't claim to be an authority on biomedical research but I find that hard to believe. Would you be able provide evidence that supports your claim? A quick Google search immediately brings up anecdotal evidence that appears to contradict it. For example, NIH's NLM style guide recommends using country codes, the BCO ontology has a property for it, and Our World In Data tracks COVID across the globe using three-letter country codes, albeit in a different, now discouraged three-letter form.

… so it would be an added burden on who is preparing the metadata

To confirm, the burden you are referring to is having to Google "iso country code France"? Should we not also consider the burden on the user who has to consolidate various ambiguous, misspelled or localized country names (United Kingdom/Great Britain, Italy/Italia and so on)?

Is there an easy way to validate the iso codes within our schema.

Of course. All that's needed is an enum of the country codes. OCDS for example, maintains one: https://github.com/open-contracting/standard/blob/68db85199bea30e22d4d936abcb7564d213c82e5/schema/release-schema.json#L1963

Hi @hannes-ucsc , INSDC uses a full name country list. ENA and BioSamples use this list to record the name of geographic sampling location. Do you think we could use this list here?

You'd maintain a copy of the list as an enum as part of the schema?

Yes, we can use the enum list that ENA uses after removing the missing related & NA values in the end.

hannes-ucsc · 2025-07-08T18:09:10Z

json_schema/module/biomaterial/residence.json

+            "guidelines": "Enter the country or state and country if available.",
+            "bionetworks": ["genetic diversity"]
+        },
+        "granular_location": {


Suggested change

"granular_location": {

"locality": {

json_schema/module/biomaterial/reproduction_history.json

hannes-ucsc · 2025-07-08T18:13:02Z

json_schema/module/biomaterial/human_specific.json

+            "example": "What is your ethnicity?; Are you Hispanic/Latino?; Which categories describe you? Select all that apply. Note You may select more than one group. 1. American Indian or Alaska Native (for example, Aztec, Blackfeet Tribe, Mayan, Navajo Nation, Native Village of Barrow (Utqiagvik) Inupiat Traditional Government, Nome Eskimo Community, etc.), 2 - Asian (for example, Asian Indian, Chinese, Filipino, Japanese, Korean, Vietnamese, etc.), 3 - Black, African American, or African (for example, African American, Ethiopian, Haitian, Jamaican, Nigerian, Somali, etc.), 4 - Hispanic, Latino, or Spanish (for example, Columbian, Cuban, Dominican, Mexican or Mexican American, Puerto Rican, Salvadoran, etc.), 5 - Middle Eastern or North African (for example, Algerian, Egyptian, Iranian, Lebanese, Moroccan, Syrian, etc.), 6 - Native Hawaiian or other Pacific Islander (for example, Chamorro, Fijian, Marshallese, Native Hawaiian, Tongan, etc.), 7 - White (for example, English, European, French, German, Irish, Italian, Polish, etc.), 8 - None of these fully describe me (optional free text answer), 9 - Prefer not to answer",
+            "bionetworks": ["genetic diversity"]
+        },
+        "ethnicity_parents": {


Suggested change

"ethnicity_parents": {

"parental_ethnicities": {

Just a suggestion

Or, in line with my comment below

Suggested change

"ethnicity_parents": {

"ethnicity_of_parents": {

hannes-ucsc · 2025-07-08T18:13:29Z

json_schema/module/biomaterial/human_specific.json

+            "example": "Mandarin Chinese; Hokkien; Bahasa Melayu",
+            "bionetworks": ["genetic diversity"]
+        },
+        "mother_father_language": {


Suggested change

"mother_father_language": {

"first_language": {

According to the Taskforce guidelines, this is about the language that the donor parents speak. Not necessarily the first language the donor spoke.
Maybe we could revert to parents_language to distinguish between "first language" and "language of mother".

In this PR, this field is currently documented as

Ancestral language(s), spoken by parents (“mother tongue” and / or “father tongue”) and / or grandparents. Can include dialects (for example, Hokkien).

"Mother tongue" typically means one's native language, not the language spoken by one's mother. To avoid confusion, I would prefer we avoid the terms "mother/father tongue" completely. "Ancestral language" usually means the opposite of modern language. I would also avoid that.

As for the name of the property, I propose that we establish a convention for naming properties of immediate ancestors (parents) or ancestors in general. Prefixing the property name with parental_ or ancestral_ comes to mind but then the aforementioned ambiguity arises. Since it is not possible to include an apostrophe in a property name, I wouldn't use "parents_" as that could be the plural "parents" or a possessive "parent's".

My preferred choice would be the suffix "_of_parents" and "_of_ancestors", applied to all such properties across this PR. I would include great grand parents in this property as it seems arbitrary to draw the line above the grand parents.

json_schema/module/biomaterial/human_specific.json

…uman_specific.json

arschat added 15 commits February 20, 2025 22:03

Added ethnicity genetic diversity fields

6640701

Added language genetic diversity fields

1615c72

Added residence and place of birth genetic diversity fields

7dd118b

Added dietary state fields

f75abbd

Added reproduction genetic diversity fields

d56461e

Removed trailing whitespaces

3b37ae4

Ran human_readable_json.py script

7049d18

Updated update_log.csv

836c795

Added diet_meat in place of dietary_state

078808e

Merge branch 'staging' of github.com:HumanCellAtlas/metadata-schema i…

af5cc71

…nto ac-genetic-diversity-Issue1610

Merge branch 'staging' of github.com:HumanCellAtlas/metadata-schema i…

b47e9d0

…nto ac-genetic-diversity-Issue1610

Fixed diet_meat_consumption field name

697794d

Added dependency for ethnicity_question field.

9e47276

Added ancestry genetic fields.

0773d49

Replaced special characters in ethnicity_question.

e6eae2d

idazucchi reviewed May 13, 2025

View reviewed changes

arschat added 3 commits May 13, 2025 17:41

Removed ancestry_genetic fields to be added in Liver

ad34acf

Updated place_of_birth_duration definition

19c68b3

Removed ancestry_genetic dependecies.

e359985

arschat requested a review from amnonkhen May 14, 2025 13:53

arschat assigned NoopDog and hannes-ucsc and unassigned NoopDog and hannes-ucsc May 14, 2025

arschat requested review from NoopDog, hannes-ucsc and ncalvanese1 May 14, 2025 13:53

HumanCellAtlas deleted a comment from idazucchi May 14, 2025

Updated examples in genetic diversity values

e25686d

ncalvanese1 approved these changes May 14, 2025

View reviewed changes

NoopDog approved these changes May 14, 2025

View reviewed changes

amnonkhen approved these changes May 22, 2025

View reviewed changes

hannes-ucsc reviewed May 22, 2025

View reviewed changes

arschat added 6 commits June 2, 2025 11:38

Replaced special character in diet description

a16c5e4

Added residence module in human_specific

32d999a

Updated update_log.csv

6717c3c

Added reproduction_history module

dc1ef4e

Removed unnecessary space

a9056e0

Fixed names of new modules

e8f4a70

Added user friendly name for reproduction_history

d9bcedc

arschat requested review from NoopDog, amnonkhen, hannes-ucsc and ncalvanese1 June 24, 2025 15:01

hannes-ucsc requested changes Jul 8, 2025

View reviewed changes

Rename reproductive_history module

e60820b

NoopDog approved these changes Aug 15, 2025

View reviewed changes

arschat added 2 commits October 17, 2025 20:56

Added INSDC country enum and region field

5f4b27f

Renamed fields to ethnicity_of_parents and to language_of_family in h…

cbca404

…uman_specific.json

Added genetic diversity fields - Fixes #1610 #1611

Are you sure you want to change the base?

Added genetic diversity fields - Fixes #1610 #1611

Uh oh!

Conversation

arschat commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release notes

Reviews requested

Uh oh!

idazucchi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hannes-ucsc left a comment

Choose a reason for hiding this comment

Uh oh!

arschat commented May 23, 2025

Uh oh!

hannes-ucsc commented May 23, 2025

Uh oh!

arschat commented May 27, 2025

Uh oh!

hannes-ucsc commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arschat commented Jun 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

arschat commented Feb 21, 2025 •

edited

Loading

hannes-ucsc commented May 27, 2025 •

edited

Loading