Skip to content

Phenotype data in practice

Phenotype Data in practice

Overview

The goals of this document are:

  • Give a sense of the contexts in which phenotype data is produced (research and clinical)
  • Give a sense of the shape of different styles of phenotype data

Table of contents

Some examples of phenotype data

The interested reader should familiarise themselves with some of the following resources. They are a tiny glimpse into the diverse world of phenotype data, and the purpose of this list is to convince how prevalent and diverse phenotype data is across the biomedical domain.

Category Example datasets Example phenotype
Gene to phenotype associations Online Mendelian Inheritance in Man (OMIM), Human Phenotype Ontology (HPO) annotations, Gene Ontology (GO) Achondroplasia (associated with FGFR3 gene mutations)
Gene to disease associations The Cancer Genome Atlas (TCGA), Online Mendelian Inheritance in Man (OMIM), GWAS Catalog Breast invasive carcinoma (associated with BRCA1/BRCA2 mutations)
Phenotype-phenotype semantic similarity Human Phenotype Ontology (HPO), Monarch Initiative Cardiac abnormalities (semantic similarity with congenital heart defects)
Quantified trait data (QTL etc) NHGRI-EBI GWAS Catalog, Genotype-Tissue Expression (GTEx), The Human Protein Atlas Height (quantified trait associated with SNPs in genomic regions)
Electronic health records Medical Information Mart for Intensive Care III (MIMIC-III), UK Biobank, IBM Watson Health Acute kidney injury (recorded diagnosis during ICU stay)
Epidemiological datasets Framingham Heart Study, National Health and Nutrition Examination Survey (NHANES), Global Burden of Disease Study (GBD) Cardiovascular disease (epidemiological study of risk factors and disease incidence)
Clinical trial datasets ClinicalTrials.gov, European Union Clinical Trials Register (EUCTR), International Clinical Trials Registry Platform (ICTRP) Treatment response (clinical trial data on efficacy and safety outcomes)
Environmental exposure datasets Environmental Protection Agency Air Quality System (EPA AQS), Global Historical Climatology Network (GHCN), National Centers for Environmental Information Climate Data Online (NCEI CDO) Respiratory diseases (association with air pollutant exposure)
Population surveys e.g., UK Biobank UK Biobank, National Health Interview Survey (NHIS), National Health and Nutrition Examination Survey (NHANES) Chronic diseases (population-based study on disease prevalence and risk factors)
Behavioral observation datasets National Survey on Drug Use and Health (NSDUH), Add Health, British Cohort Study (BCS) Substance abuse disorders (survey data on drug consumption and addiction)

Different shapes of phenotype data

Phenotype data comes in many different shapes and forms. In the following, we will describe some of the most common features of such data:

Standardised/non-standardized

Spectrum of semantics

Phenotype data can be standardised to varying degrees. It is not uncommon for data to be completely unstandardised. Unfortunately, only a fraction of the available data is actually annotated using terms from controlled phenotype ontologies. Here are some of the more "typical" kinds of data on the standardised/non-standardised spectrum:

  1. Free text in clinical notes and scientific publications. For example, the paper "Rare genetic variants impact muscle strength" (Huang et al.) mentions phenotypic traits including muscle strength, hand grip strength, body size, body weight, BMI, and whole-body muscle mass.
  2. Free text in specific database fields (for example a "height" column in a table about measurements of Giraffes)
  3. Controlled but non-standardised vocabulary like enums in a datamodel (for example the keyword "abnormal" in the ZFIN example above)
  4. Controlled standardised vocabulary (e.g. SNOMED CT)
  5. Controlled vocabulary terms with well defined semantics (e.g. ontology terms from HP or MP)

Quantitative/qualitative

Qualitative and quantitative phenotype data represent two fundamental ways of describing characteristics or traits in biology, each providing different types of information:

Qualitative Phenotype Data describes qualities or characteristics that are observed but not measured with numbers. It often involves categorical or descriptive information. - Examples: The presence or absence of a specific physical trait (like eye color or wing shape in animals) or types of behavior (aggressive vs. passive). - Analysis: Qualitative data is analyzed by categorization and identification of patterns or variations. It is more about the 'type' or 'kind' of trait rather than its 'amount'. - Interpretation: Since it's descriptive, this data relies on subjective interpretation and classification.

Quantitative Phenotype Data is numerical and quantifies traits. It involves measurements of characteristics, often allowing for more precise and objective analysis.

  • Examples: Height, weight, blood pressure, cholesterol levels, or the number of fruit produced by a plant. Quantitative traits can often be measured on a continuous scale, for example height of 35 cm, weight of 67 KG or blood pressure of 120/80.
  • Analysis: It involves statistical analysis, such as calculating mean, median, standard deviation, and applying various statistical tests. It allows for a more objective and replicable assessment.
  • Interpretation: Quantitative data provides a more concrete and measurable understanding of traits, making comparisons and statistical testing more straightforward.

Pre-coordinated vs. post-coordinated

Spectrum of semantics

Pre-coordinated phenotype data is data where the various aspects of the phenotype term, such as the bearer ("retinal blood vessels") and the characteristic ("Attenuation", or "thinning/narrowing"), and the modifier (in the case of HPO terms, simply abnormal), are combined ("coordinated") into a single term, e.g. HP:0007843 "Attenuation of retinal blood vessels".

Pre-coordinated phenotype data is popular in the clinical domain, where a lot of observations are taken by a clinician and recorded as "phenotypic abnormalities" with the goal of eventual diagnosis. Phenopackets such as the one below are an emerging standard to capture and sharing disease and phenotype information about patients. Phenotypic features are captured in phenopackets as pre-coordinated HPO terms.

Phenopacket

Apart from clinical diagnostics, pre-coordinated phenotype terms are used in many other contexts such as model organism research (e.g. IMPC) or the curation of Genome Wide Association Studies. For example, IMPC can be searched using the pre-coordinated term "enlarged heart" to find knockout mice with this phenotype, and similarly the GWAS Catalog can be browsed by the pre-coordinated term "cardiac hypertrophy" to find gene assocations with the phenotype from GWAS studies.

Post-coordinated phenotype curation simply means that the different constituents of phenotype (characteristic, bearer, modifier etc) are captured individually. This has certain advantages. For example, the phenotype space is enormous, as you can measure variations in many observable charactertics from chemical entities present in the blood, the microbiome to a host of morphological and developmental abnormalities. Instead of having individual (controlled vocabulary) terms for increased level of X, decreased level X, abnormal level of X, increased level of X in blood for thousands of chemical compounds synthesized by the human body, you just have "increased level", "blood" and all the chemical compounds, and capture them separately.

There are at least three flavours (probably more) of post-coordinated phenotype curation prevalent in the biomedical domain (four if you count quantified phenotypes):

Trait + modifier pattern is used for example by databases such as the Saccharomyces Genome Database (SGD). Here are some examples:

dateAssigned evidence/publicationId objectId phenotypeStatement phenotypeTermIdentifiers/0/termId phenotypeTermIdentifiers/1/termId conditionRelations/0/conditions/0/chemicalOntologyId conditionRelations/0/conditions/0/conditionClassId
2010-07-08T00:07:00-00:00 PMID:1406694 SGD:S000003901 abnormal RNA accumulation APO:0000002 APO:0000224
2006-05-05T00:05:00-00:00 PMID:785224 SGD:S000000854 decreased resistance to chemicals APO:0000003 APO:0000087 CHEBI:78661 ZECO:0000111
2010-07-07T00:07:00-00:00 PMID:10545447 SGD:S000000969 decreased cell size APO:0000003 APO:0000052
  • APO:0000002 (abnormal) and APO:0000003 (decreased) are modifiers.
  • APO:0000087 (resistance to chemicals), APO:0000224 (RNA accumulation), APO:0000052 (cell size) are biological attributes/traits.
  • CHEBI:78661 (borrelidin) is recorded as an experimental condition, but should probably be interpreted as part of the bearer expression.
  • Note: SGD has different kinds of phenotype data, and it should be carefully evaluated which one it is.

Info

Data was obtained from the Alliance of Genome Resources on the 30.03.2023 and simplified for illustration.

The bearer-only pattern is used by many databases, such as Flybase. In the data, we only find references of bearers, such as anatomical entities or biological processes. Instead of explicitly stating phenotypic modifiers (abnormal, morphology, changed), it is implicit in the definition of the dataset.

dateAssigned evidence/crossReference/id evidence/publicationId objectId phenotypeStatement phenotypeTermIdentifiers/0/termId
2024-01-05T11:54:24-05:00 FB:FBrf0052655 PMID:2385293 FB:FBal0016988 embryonic telson FBbt:00000184
2024-01-05T11:54:24-05:00 FB:FBrf0058077 PMID:8223248 FB:FBal0001571 larva FBbt:00001727

Info

Data was obtained from the Alliance of Genome Resources on the 30.03.2023 and simplified for illustration.

The most complex pattern for phenotype descriptions which essentially decomposes the entire phenotype expression into atomic consituents can be found, for example, in the The Zebrafish Information Network (ZFIN).

Examples:

Fish ID Affected Structure or Process 1 subterm ID Affected Structure or Process 1 subterm Name Post-composed Relationship ID Post-composed Relationship Name Affected Structure or Process 1 superterm ID Affected Structure or Process 1 superterm Name Phenotype Keyword ID Phenotype Keyword Name Phenotype Tag Affected Structure or Process 2 subterm ID Affected Structure or Process 2 subterm name Post-composed Relationship (rel) ID Post-composed Relationship (rel) Name Affected Structure or Process 2 superterm ID Affected Structure or Process 2 superterm name Publication ID
ZDB-FISH-150901-29105 ZFA:0009366 hair cell BFO:0000050 part_of ZFA:0000051 otic vesicle PATO:0000374 increased distance abnormal ZFA:0009366 hair cell BFO:0000050 part_of ZFA:0000051 otic vesicle ZDB-PUB-171025-12
ZDB-FISH-150901-29105 ZFA:0009366 hair cell BFO:0000050 part_of ZFA:0000051 otic vesicle PATO:0000374 increased distance abnormal ZFA:0009366 hair cell BFO:0000050 part_of ZFA:0000051 otic vesicle ZDB-PUB-171025-12
ZDB-FISH-150901-11537 ZFA:0000051 otic vesicle PATO:0001905 has normal numbers of parts of type normal ZFA:0009366 hair cell BFO:0000050 part_of ZFA:0000051 otic vesicle ZDB-PUB-150318-1
ZDB-FISH-150901-18770 ZFA:0000119 retinal inner nuclear layer PATO:0002001 has fewer parts of type abnormal ZFA:0009315 horizontal cell BFO:0000050 part_of ZFA:0000119 retinal inner nuclear layer ZDB-PUB-130222-28
ZDB-FISH-190806-7 BSPO:0000084 ventral region BFO:0000050 part_of ZFA:0000101 diencephalon PATO:0002001 has fewer parts of type abnormal ZFA:0009301 dopaminergic neuron BFO:0000050 part_of ZFA:0000101 diencephalon ZDB-PUB-190216-5
ZDB-FISH-190807-7 BSPO:0000084 ventral region BFO:0000050 part_of ZFA:0000101 diencephalon PATO:0001905 has normal numbers of parts of type normal ZFA:0009301 dopaminergic neuron BFO:0000050 part_of ZFA:0000101 diencephalon ZDB-PUB-190216-5
ZDB-FISH-190807-8 BSPO:0000084 ventral region BFO:0000050 part_of ZFA:0000101 diencephalon PATO:0002001 has fewer parts of type abnormal ZFA:0009301 dopaminergic neuron BFO:0000050 part_of ZFA:0000101 diencephalon ZDB-PUB-190216-5
ZDB-FISH-150901-29105 ZFA:0000101 diencephalon PATO:0001555 has number of normal ZFA:0009301 dopaminergic neuron BFO:0000050 part_of ZFA:0000101 diencephalon ZDB-PUB-161120-7
ZDB-FISH-210421-9 ZFA:0009290 glutamatergic neuron BFO:0000050 part_of ZFA:0000008 brain PATO:0040043 increased proportionality to abnormal ZFA:0009276 GABAergic neuron BFO:0000050 part_of ZFA:0000008 brain ZDB-PUB-191011-2
ZDB-FISH-210421-9 ZFA:0009290 glutamatergic neuron BFO:0000050 part_of ZFA:0000008 brain PATO:0040043 increased proportionality to abnormal ZFA:0009276 GABAergic neuron BFO:0000050 part_of ZFA:0000008 brain ZDB-PUB-191011-2

Lets break down the second to last row:

  • ZFA:0009290 (glutamatergic neuron): The primary entity whose characteristic is being observed
  • BFO:0000050 (part of): a relation used to connect the primary entity to the structure it is part of
  • ZFA:0000008 (brain): the location of the primary entity being observed
  • PATO:0040043 (increased proportionality to): the modified characteristic being observed
  • abnormal: the change modifier (note: not an ontology term)
  • ZFA:0009276 (GABAergic neuron): the secondary entity being observed in relation to which the characteristic is measured
  • ZFA:0000008 (brain): the location of the secondary entity

Example: brain increased proportionality to glutamatergic neuron GABAergic neuron brain, abnormal

The interested reader may look at an integrated version of that huge post-coordinated expression here (brain increased proportionality to glutamatergic neuron GABAergic neuron brain, abnormal - ZP:0141834).

Info

Data was obtained from ZFIN (Phenotype of Zebrafish Genes) on the 30.03.2023 and simplified for illustration.

As one can see in the last example, bearers can be anything from simple atomic entities to arbitrarily complex compositions:

  • "lysine" (lysine)
  • "lysine in the blood" (lysine part_of blood)
  • "lysine in heart muscle cells" (lysine part_of cell part_of (muscle part of heart))
  • "lysine in the cytoplasm of heart muscle cells" (lysine part_of (cytoplasm part_of (cell part_of (muscle part of heart))))
  • etc, etc