FederatedHealth: A Nordic Federated Health Data Network

Description

Electronic health record (EHR) systems contain an estimated 40-80% of data in unstructured form (Dalianis 2018). This means much of the context provided by digital health information systems cannot be easily exploited for improving patient care, cost efficiencies or generating new knowledge. This project proposes an innovative, language-agnostic privacy-preserving solution that utilizes clinical text data in multiple Nordic languages.

Goals

O1. Develop a secure Nordic Health Data Space (NHDS) – geared towards secondary use of health data, enabled by privacy-preserving distributed machine learning (federated learning). The shared space enables secure, distributed training of multilingual (no,se,dk,fi,ee) clinical language models and other machine learning models designed to improve patient safety.

O2.Demonstrator: Language-agnostic medical concept extraction for patient safety

O2a. Detection of medical implants

O2b. Detection of adverse drug reactions

Method

O1. A secure Nordic health data space

It can be argued that the biggest challenges sharing structured clinical data across hospitals and across borders are organizational and political, rather than technical. Sharing unstructured data such as clinical text, on the other hand, presents significant technical challenges since it is difficult to provide privacy guarantees. In addition to privacy challenges, individual countries have EHR data in their native languages. This presents special challenges for processing clinical text using data-driven AI algorithms.

To solve the two challenges of privacy and multiple languages, we pursue two lines of inquiry (i) by law, patient consent is required before processing patient data, but obtaining consent is not practical for large datasets required by data-driven algorithms. To comply with regulation, we will therefore de-identify all data, and in addition use privacy-preserving federated learning to train machine learning models, effectively making the data controller the same as the data processor and (ii) we apply recent advances in multilingual general language modeling using transformer models (Devlin et al. 2018) to solve the problem of processing clinical texts in multiple languages.

The driving research question is: can we configure a sustainable multilingual shared data space within the constraints of current and emerging regulation, and organizational restrictions?

O2a. Detection of medical implants

Prior to an MRI-examination (magnetic resonance imaging), it is important to have detailed model knowledge about the medical implants (e.g. a pacemaker) that an individual patient has because they can cause serious harm to the patient, or even death. About 30,000 MRI-examinations are carried out in the County Council of Östergötland (Sweden) alone, and an increasing number of our patients have implants, about 20-25%.

Medical record systems used in healthcare are certainly now digitized, but the information is not structured in a way that makes it easy to directly find which implants an individual patient has. It is very difficult to know whether a patient has an implant or not, because patients usually do not know the model of implant – and this is crucial information to decide if the implant is safe or conditional. When a patient today has, or is suspected of having, an implant, the procedure to obtain this detailed information is entirely manual; it is laborious and involves a range of experts with specialized knowledge (Kihlberg & Lundberg, 2019) reading through a whole patient record.

To improve patient safety in this case, an AI solution is required to dramatically speed up the reading of the patient records and, in principle, drastically reduce the risk of missing any implant before an MRI examination. Background work has already been done by our partners in Sweden using implant terms and focused terminology extraction (Jerdhaf et al. 2021, 2022). The goal is to increase the robustness of the current model to handle different forms of peculiar linguistic features in clinical text, and enable detection of a wider range of medical implants.

We hypothesize that data aggregation from multiple sources will improve model performance and reduce time for finding existing and previous implants, and increase the number of devices that can be detected.

O2b. Detection of adverse drug reactions

Adverse drug reactions (ADRs) are unwanted or harmful effects that occur as a result of taking medication/drugs. They can range from mild symptoms, such as a rash or headache, to more severe reactions such as an allergic reaction or organ damage. Many factors can contribute to the likelihood of an ADR, including a person's age, underlying health conditions, and other medications they may be taking. It is important to report any suspected ADRs to a healthcare provider, as they can help to identify and prevent future reactions, and often, clues about ADRs exist in clinical text. Previously, consortium members reported linking adverse effects to specific drugs, using temporal text mining as well as structured data representing specific drugs in the EHR (Eriksson et al. 2014). The study revealed a number of specific drugs and their potentially unwanted effects. Elsewhere in our consortium, we used distributional semantics to detect adverse drug events (Henriksson et al. 2015).

The goal for this use case is to build on the ADR system developed by consortium members and apply it to new datasets in new languages. This process will involve two key components: (i) staking multilingual capabilities onto the existing model, and (ii) additional training based on new data from other sites so as to increase the range of medication/drug data from which the model can learn. The model, capable of learning in different Nordic languages, holds a potential to improve drug safety and accelerate discovery of currently unknown effects.

We hypothesize that data aggregation from multiple sources will improve model performance, thus also patient safety, and increase the number of drugs for which adverse events can be detected.

Conclusion

Instead of aggregating data based only on the structured parts of the health record, we use unstructured clinical text to extract additional contextual information in multiple languages. We develop processing techniques and computational models based on transformer architectures to train multilingual clinical language models to solve the problem of aggregating data sources in multiple languages. The models are trained in a self-supervised/unsupervised manner to acquire a statistical understanding of the general language, as well as an understanding of the peculiarities of clinical text in each of the languages. Our two use cases will demonstrate the benefits of this approach.

Articles

0. Dagens Medicin - Nordisk-baltiskt konsortium driver projekt med hälsodatanätverk (paid access)

1. Federated Learning (popularized report) -- Norwegian version

2. Lamproudis A, Mora S, Svenning TO, Torsvik T, Chomutare T, Ngo PD, Dalianis H (2023) “De-identifying Norwegian Clinical Text using Resources from Swedish and Danish”. Proceedings of AMIA Annual Symposium, November 11-15. New Orleans, LA, USA. (to appear)