Name
ClinGen Linked Data Hub: a scalable infrastructure to support variant pathogenicity assessment by modeling and linking diverse types of evidence about variants using the GA4GH Variant Annotation standards
Description

Variant pathogenicity assessment requires evaluation of numerous data types such as functional, population, case-level, in-silico predictors and others. The extreme diversity, volume, and rapid pace of updates of this information calls for data aggregation strategies that go beyond centralized data warehousing. By employing a decentralized strategy based on Linked Data principles, the ClinGen Linked Data Hub (LDH, https://ldh.clinicalgenome.org/) aggregates diverse structured data from distributed sources. This data linking strategy ensures that curators of genes or variants have access to current variant data and provides means and incentives for engaging data contributors to link their databases into ClinGen via the LDH. In addition to collating information about variants, LDH is extensible to facilitate efficient access to collated information about any subject. LDH stores a variety of linked “data entities”, which consists of one or more of the following:

Excerpts of information with attributions to external data sources

Links to more extensive information at the external data sources

LDH is being developed as a driver project for GA4GH, with an intent to implement the Variant Annotation (VA) standards for several types of variant information, such as molecular consequences, experimental functional impact, population frequency, as well as variant pathogenicity assessments. Select data and links are stored in LDH as “excerpts” indicating that specific information (i.e. “linked data”) about the variant (i.e. the “subject”) was provided by a specific source at a specific time (“provenance”). The goal is to adopt a minimum viable model for these “excerpts” using the GA4GH VA standards, which will include sufficient information about the entity for the curators to assess its utility in pathogenicity classification.

LDH utilizes the ClinGen Allele Registry (CAR, https://reg.genome.network/) as a variant naming service. While the Registry provides unique and permanent variant identifiers, the LDH provides a versioned and time-stamped reference to information about variants and tracks provenance for inclusion as permanent records of supporting evidence for variant interpretations.

Currently, LDH integrates population frequency data from gnomAD, molecular consequences from Ensembl Variant Effect Predictor, in-silico predictors from MyVariant.info, allele functional impact from the ClinGen Functional Data Repository (FDRepo), literature curation output of ClinGen community annotators from Hypothes.is, and also clinical interpretations from CIViC. Effort is under way to integrate with other external databases such as BRCA Exchange, the mitochondrial DNA database MSeqDr, and LOVD. The adoption of GA4GH standards to represent content in LDH allows integration of new entities in an efficient and streamlined manner.

Here, we present the beta version of the LDH resource linking information from 40 genes and over a million variants for use in ClinGen Variant Curation using the VCI. We demonstrate tools for modeling layers of variant evidence using the GA4GH Variant Annotation standards from several external data sources, tracking provenance and provision of permanent references. Finally, we explore the future potential of the novel approaches for exchanging genomic information first implemented by LDH using standardized models to empower clinical genetics and also to open new roads to discovery in genomics.

VIEW POSTER