Copenhagen, Denmark
Onsite/Online

ESTRO 2022

Session Item

Other
5540
Poster (digital)
Interdisciplinary
Privacy-preserving dashboard for clinical data using open-source federated learning infrastructure
Varsha Gouthamchand, The Netherlands
PO-1062

Abstract

Privacy-preserving dashboard for clinical data using open-source federated learning infrastructure
Authors:

Varsha Gouthamchand1, Gauri K2, Rajamenakshi Subramanian2, Ananya Choudhury1, Leonard Wee1, Andre Dekker1, Shwetabh Sinha3, Sarbani Ghosh Laskar3, Lohith Reddy4

1University of Maastricht, Faculty of Health, Medicine and Life Sciences, Maastricht, The Netherlands; 2Centre for Development of Advanced Computing (C-DAC), IT, Pune, India; 3Tata Memorial Hospital, Oncology, Mumbai, India; 4HealthCare Global Enterprises Limited, Oncology, Bangalore, India

Show Affiliations
Purpose or Objective

Research on real-world cancer data requires vast volumes of geographically distributed multi-institutional data to be analyzed. It is not always feasible to centralize the data for statistical analysis. “Distributed methods” are now making inroads into oncology research, however confidentiality of individual human subjects remains the highest priority. In keeping with Findable-Accessible-Interoperable-Reusable (FAIR) data management principles, data schema and idiosyncratic coding should be publishable as metadata, even though the data content itself is kept private. Additional privacy is made possible by only exchanging statistical summaries, but never any individual-level data. In this way, privacy is maintained without relying exclusively on encryption or privacy-obscuring mechanisms (such as differential privacy). This is known as privacy by design, since re-identification of individual subjects from statistical summaries is a very low risk, even where encryption and differential mechanisms are breached.

Material and Methods

We demonstrated a semi-automated method for FAIR-ification of structured clinical data in four simulated geographically-separated “data nodes”, each with one of four open-access HNSCC datasets. The datasets were independent and had distinct schema with coding dictionaries. Individual patient-level data were processed as Resource Descriptor Format (RDF). The anonymous schema were separately extracted as an Ontology Web Language (OWL) resource. Each OWL resource was shared, then individually mapped to a common oncology ontology using an extra annotations layer above the raw data. In the same manner, three private HNSCC datasets - from two Indian centers and one Dutch clinic - were processed in the same manner. An interactive dashboard was created in Python to exploit the Vantage6 distributed learning infrastructure. Distributed algorithms for descriptive statistics of cohorts were transmitted through the Vantage6 infrastructure.

Results

The results from the algorithms provided the summary statistics of the seven datasets, without combining any individual patient-level data into a single repository (see Fig. 1). The interactive display allows manual exploration of the data at deeper levels by the researcher. The prototype visualization is interactive and can be easily adapted to other use cases.


Fig 1. Privacy preserving distributed dashboard for four public HNSCC datasets


Conclusion

An interactive distributed dashboard was deployed through the open source federated privacy-preserving infrastructure, Vantage6. The procedure was developed on four public datasets, then applied on three geographically separated private datasets. Annotating datasets as FAIR data allows researchers to explore and evaluate the case mix through a universal distributed dashboard without breaching confidentiality of patient-level data. This allows institutions to plan collaborations while retaining total control over their own data.