Klinisk NorBERT

Our main objective in this project is to develop a Norwegian BERT model finetuned on clinical data for more accurate representation of clinical text data than currently available from general Norwegian language representations. The model is planned to be available via Helse Vest IKT under certain circumstances to Norwegian health care providers and medical NLP researchers, delivering a demanded tool (Pilan, et al., 2020) and extending the research possibilities in this field significantly.

In work package one, we develop an anonymization tool for clinical text data. This tool should identify names, dates, ages, locations, clinical institutions, phone numbers and email addresses and give them realistic random values. For example, “Askøy” could replace “Bønes”, “19.02.2013” could replace “05.07.2015” or “Erlend” could replace “Håkon”. Additionally, we are splitting the patient journals by section and paragraph and shuffling all the documents. The anonymization process will be tested via random sample testing and must be approved by all four health institutions in Helse Vest as data owners.

Before we can test it on 3-6 month of clinical text data, we build up the server infrastructure approved via ROS to securely host and process the sensitive data. Then we approve the anonymization tool for use by random sample testing. Without approval by all four health institutions of the anonymization tool, the project cannot continue. That process should assure the privacy of the patients. At the same time as the anonymization process is happening, Helse Vest IKT prepares the computational resources for model training such that the project has access to GPU processing units inside the secure environment in Helse Vest.

In work package two, Helse Vest IKT with the help of DIPS extracts all free text information from the relevant patient journals and saves them in a secure environment for sensitive patient data provided by Helse Vest IKT. Both, DIPS and Helse Vest IKT, are familiar with handling sensitive patient data from Helse Vest. Nevertheless, we secure the process by executing a DPIA and ROS analysis. We extract all patient journals from 2019, 2021 and 2022 from each of the four health institutions in Helse Vest, Bergen, Stavanger, Fonna and Førde. It is important for the project to receive data from all four institutions and all different medical fields inside the hospitals to avoid discrimination of the model to certain areas (for example urban vs rural). Nevertheless, we do not extract texts from sensitive or blocked journals.

In work package three, we are pre-processing the anonymized text data, which we received as a result of work package two. Pre-processing includes for example converting the text to string format and cleaning it from unnecessary characters. Additionally, we must adjust sentence segmentation to the clinical corpus.

After the anonymized text data is pre-processed, we start the core part of our project: the development of a Norwegian Clinical BERT model. We finetune two base models to develop two clinical BERT models. NB-BERT-base (Kummervold, De la Rosa, Wetjen, & Brygfjeld, 2021) and NorBERT (Kutuzov, Barnes, Velldal, Øvrelid, & Oepen, 2021) are our base models representing a general understanding of the Norwegian language (bokmål and nynorsk), but not the specialized vocabulary or writing style as used in clinical patient journals.

We finetune the models on the same pre-training tasks as (Devlin, Chang, Lee, & Toutanova, 2019), i.e., masked language modelling and next sentence prediction. For masked language modelling, the model is masking single words in the available text and trying to predict the masked word by only using the words around. In the next sentence prediction task, two sentences are given to the model, which tries to predict if the two sentences are consecutive or not. We estimate that the training process for each model will be running on our servers for up to three weeks.

We evaluate both models on its trained tasks masked language modelling and next sentence prediction. Additionally, we add a qualitative analysis based on medical term similarity for Norwegian medical terms, as used for the English Clinical BERT by (Huang, 2019) and introduced by (Pedersen, 2007). “The data is 30 pairs of medical terms whose similarity is rated by physicians. To compute an embedding for a medical term, Clinical NorBERT is fed a sequence of tokens corresponding to the term. Following (Devlin, Chang, Lee, & Toutanova, 2019), the sum of the last four hidden states of ClinicalBERT encoders is used to represent each medical term. Medical terms vary in length, so the average is computed over the hidden states of sub word units. This results in a fixed 768-dimensional vector for each medical term.” (Huang, 2019) We use the calculated distance between these vectors as a representation of their similarity.

Samarbeidspartnere

DIPS

Formål

Problemstilling

In Helse Bergen alone health care professionals produce ca 1.5 million clinical text documents each year in the form of history sheets (epikriser in Norwegian), physician notes (sykepleienotater in Norwegian), etc. Hospitals invest a lot of time and resources to save most of the information in these texts in structured code-form to make them accessible for automatic evaluation by machines or algorithms. Nevertheless, the text itself and the nuanced information left in it are not accessible for automatic evaluation by machines or algorithms (Frønli, 2016). To make these nuanced and by machines unused information available for automatic evaluation is the studying field of Natural language processing (NLP), one of the fastest growing sub-fields of Artificial Intelligence (AI).

Improvements in NLP in recent years extend the possibilities of medical researchers and clinical administrative to analyze unstructured text data. Contextual word embedding models such as BERT (“Bidirectional Encoder Representations from Transformers”) (Devlin, Chang, Lee, & Toutanova, 2019) and ELMo (“Embeddings from Language Model”) (Peters, et al., 2018) are the main contributors for these developments. These models were until recently just available in a few main languages as English, Chinese, and Spanish etc. reducing the possibilities for Norwegian NLP research. This changed recently in 2021 when first NorLM1 published the first fully functional Norwegian BERT “NorBERT” (Kutuzov, Barnes, Velldal, Øvrelid, & Oepen, 2021), which was trained jointly on Bokmål and Nynorsk, followed by NB-BERT-base/NB-BERT-large from AI-lab at NLN2 (Kummervold, De la Rosa, Wetjen, & Brygfjeld, 2021).

Unfortunately, these models, developed for a broad range of topics on publicly available texts as for example Wikipedia, do not necessarily generalize well to specialized industry corpora compared to simpler embedding models developed on fewer certain industry texts (Alsentzer, et al., 2019) (Ezen[1]Can, 2020). These challenges led to the development of BioBERT (Lee, et al., 2020), a BERT model finetuned on non-clinical biomedical text, and Clinical BERT (Alsentzer, et al., 2019), a BERT model finetuned on clinical text (e.g., physician notes), in English. The results of Clinical BERT (Alsentzer, et al., 2019), are of special interest for us as they demonstrate robust evidence that specialized clinical embeddings are superior to general domain BERT models in the clinical setting.

We aim to develop a Norwegian Clinical BERT model (Klinisk NorBERT) based on anonymized Norwegian clinical text from Helse Vest. Klinisk NorBERT is planned to be available via Helse Vest IKT under certain circumstances to Norwegian health care providers and medical NLP researchers, delivering a demanded tool (Pilan, et al., 2020) and extending the research possibilities in this field significantly.

Pågående prosjekt

Prosjektperiode

2023 - 2025

Kategorier

Fokusområde:

Klinisk, IKT-infrastruktur/datatilgang

Type helsetjeneste:

Type data:

Fritekst

Datakilde:

Journal

Planlagt sluttfase:

Annet

Oppgave:

Tilrettelegging

Prosjekteier

HBE, HFD, HST, HFO, HVIKT

Helseregion

Helse Vest