-
Notifications
You must be signed in to change notification settings - Fork 33
Added genetic diversity fields - Fixes #1610 #1611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
Conversation
…nto ac-genetic-diversity-Issue1610
…nto ac-genetic-diversity-Issue1610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! It's well organised, I've left some comments so we can discuss a few things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which project/dataset is this for?
Hi @hannes-ucsc this PR is not for a specific project/dataset. These are the recommendation metadata fields from the HCA Genetic Diversity TaskForce to record the genetic, geographic and in generally human diversity in the HCA studies. Do you have any specific concerns? |
I may be out of the loop, but if this isn't going to be used for any actual projects, why is it being added to the schema? I am worried that this isn't sufficiently modular, risking for human-specific and medical history modules to become a kitchen sink of fields, i.e. a flat, unstructured list of fields, some related and some not. This will make comprehending the schema (and the JSON documents compliant with it) increasingly difficult. For example, |
Thank you for your comment Hannes. Bionetwork coordinators have requested around 250 fields of Tier 2 bionetwork-specific metadata that do not exist currently into our schema. Since the Tier 2 collection has not been officially started yet, we cannot be sure of how frequent all of those fields will be filled. We've shared our concerns with bionetwork coordinators on the feasibility of the metadata collection, but we trust their confidence on the collection. Regarding the modularity. There are indeed some fields that could be clustered together but we choose this modelling to avoid extensive "module-in-a-module" structure since we've avoided this until now (with the exception of ontology modules). I am happy to encapsulate similar fields in modules either inside |
If I understand you correctly, this PR is just the first slate of a series of fairly involved changes. Since it is extremely difficult to fix the schema once metadata using it has been released, it is very important that we get this right from the beginning. Is there any substantive documentation about this effort that you could share?
Given the sheer number of new fields you cite, lack of modularity is a serious concern. I don't see why nesting modules is problematic. Hierarchical structures are commonplace in computer science—in biology, too, for that matter—and have proven to be a useful modeling approach. |
Hi @hannes-ucsc I understand your concerns whether this definitions are going to be adopted by contributors in this way or not. Given your feedback on modularity, I refactored some fields in a more modular way. Let me know if this works better for you. I could split into different PRs for each module but please let me know your comments here before we move into new PRs. |
"pattern": "^[0-9]{1,}.[0-9]{1,}.[0-9]{1,}$", | ||
"example": "4.6.1" | ||
}, | ||
"country": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some countries don't call the first level subdivision "state" and some countries are known under various different names, in addition to the fact that most countries have a name in their native language and English.
Taking inspiration from https://schema.org/PostalAddress, this should be split into country (as a two letter ISO code) and region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting suggestion, however ISO country codes are not commonly used in biological research so it would be an added burden on who is preparing the metadata.
Is there an easy way to validate the iso codes within our schema, other than adding manually all the codes as an enum? I think it's worth using the ISO standard only if it can allow us to easily validate the country names, which we can't do with the current schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however ISO country codes are not commonly used in biological research …
I don't claim to be an authority on biomedical research but I find that hard to believe. Would you be able provide evidence that supports your claim? A quick Google search immediately brings up anecdotal evidence that appears to contradict it. For example, NIH's NLM style guide recommends using country codes, the BCO ontology has a property for it, and Our World In Data tracks COVID across the globe using three-letter country codes, albeit in a different, now discouraged three-letter form.
… so it would be an added burden on who is preparing the metadata
To confirm, the burden you are referring to is having to Google "iso country code France"? Should we not also consider the burden on the user who has to consolidate various ambiguous, misspelled or localized country names (United Kingdom/Great Britain, Italy/Italia and so on)?
Is there an easy way to validate the iso codes within our schema.
Of course. All that's needed is an enum of the country codes. OCDS for example, maintains one: https://github.com/open-contracting/standard/blob/68db85199bea30e22d4d936abcb7564d213c82e5/schema/release-schema.json#L1963
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hannes-ucsc , INSDC uses a full name country list. ENA and BioSamples use this list to record the name of geographic sampling location. Do you think we could use this list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd maintain a copy of the list as an enum
as part of the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can use the enum list that ENA uses after removing the missing related & NA values in the end.
"guidelines": "Enter the country or state and country if available.", | ||
"bionetworks": ["genetic diversity"] | ||
}, | ||
"granular_location": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"granular_location": { | |
"locality": { |
"example": "What is your ethnicity?; Are you Hispanic/Latino?; Which categories describe you? Select all that apply. Note You may select more than one group. 1. American Indian or Alaska Native (for example, Aztec, Blackfeet Tribe, Mayan, Navajo Nation, Native Village of Barrow (Utqiagvik) Inupiat Traditional Government, Nome Eskimo Community, etc.), 2 - Asian (for example, Asian Indian, Chinese, Filipino, Japanese, Korean, Vietnamese, etc.), 3 - Black, African American, or African (for example, African American, Ethiopian, Haitian, Jamaican, Nigerian, Somali, etc.), 4 - Hispanic, Latino, or Spanish (for example, Columbian, Cuban, Dominican, Mexican or Mexican American, Puerto Rican, Salvadoran, etc.), 5 - Middle Eastern or North African (for example, Algerian, Egyptian, Iranian, Lebanese, Moroccan, Syrian, etc.), 6 - Native Hawaiian or other Pacific Islander (for example, Chamorro, Fijian, Marshallese, Native Hawaiian, Tongan, etc.), 7 - White (for example, English, European, French, German, Irish, Italian, Polish, etc.), 8 - None of these fully describe me (optional free text answer), 9 - Prefer not to answer", | ||
"bionetworks": ["genetic diversity"] | ||
}, | ||
"ethnicity_parents": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ethnicity_parents": { | |
"parental_ethnicities": { |
Just a suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, in line with my comment below
"ethnicity_parents": { | |
"ethnicity_of_parents": { |
"example": "Mandarin Chinese; Hokkien; Bahasa Melayu", | ||
"bionetworks": ["genetic diversity"] | ||
}, | ||
"mother_father_language": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"mother_father_language": { | |
"first_language": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the Taskforce guidelines, this is about the language that the donor parents speak. Not necessarily the first language the donor spoke.
Maybe we could revert to parents_language
to distinguish between "first language" and "language of mother".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, this field is currently documented as
Ancestral language(s), spoken by parents (“mother tongue” and / or “father tongue”) and / or grandparents. Can include dialects (for example, Hokkien).
"Mother tongue" typically means one's native language, not the language spoken by one's mother. To avoid confusion, I would prefer we avoid the terms "mother/father tongue" completely. "Ancestral language" usually means the opposite of modern language. I would also avoid that.
As for the name of the property, I propose that we establish a convention for naming properties of immediate ancestors (parents) or ancestors in general. Prefixing the property name with parental_
or ancestral_
comes to mind but then the aforementioned ambiguity arises. Since it is not possible to include an apostrophe in a property name, I wouldn't use "parents_" as that could be the plural "parents" or a possessive "parent's".
My preferred choice would be the suffix "_of_parents" and "_of_ancestors", applied to all such properties across this PR. I would include great grand parents in this property as it seems arbitrary to draw the line above the grand parents.
Release notes
#1610
For
human_specific.json
schema:For
residence.json
schema:For
medical_history.json
schema:For
reproduction_history.json
schema:Reviews requested