Disease prediction from primary care consultation notes by using external documents

Scientific Research Project Number: MA 2020 03
Place: Amsterdam UMC, location AMC, department of Medical Informatics

Introduction

Description of the SRP Project/Problem


It has been shown in various work based on word-embeddings that text in consultation notes have predictive value of an outcome (such as cancer). It is however often unclear what exact part of the text contributed to the prediction and why. In this SRP we address a simpler task by focusing on words in the consultation notes and looking at their similarity to known symptoms of the disease one intends to predict. Specifically, we assess similarity between words between the consultation notes and symptoms and other terms appearing in relevant documents such as guidelines by considering their distance in the embedding space. In this space any word is just a vector of numbers. In addition we are interested to assess the added value of text on top of structured (demographic and clinical) variables.


The proposed approach can be decomposed in the following steps 1) obtain word embeddings of words appearing in consultation notes and in external sources, 2) use similarity between words in the consultation notes and those obtained from external sources, 3) use the similarities to make predictions of the outcome: with and without structured variables, and 4) assess the predictive performance of the text on top of the structured variables. The approach is to be applied in a case study (open source or otherwise) that is determined together once a student selects this research proposal.


Research question

What is the predictive value of words from primary care consultation notes when their similarity to external terms is used in predicting the outcome?


Expected results

An algorithm and a trained prediction model using consultation notes and external text.

A validation approach and performance results of this algorithm, with and without text.

A master thesis written in a form of a scientific article.


Time period:

7 months


Contact:

Mentors: Martijn Schut en Miguel Gaona, Amsterdam UMC, location AMC, department of Medical Informatics, m.c.schut/m.a.riosgaona@amsterdamumc.nl

Tutor: Ameen Abu-Hanna, Amsterdam UMC, location AMC, department of Medical Informatics, a.abu-hanna@amsterdamumc.nl


References:

Edward Choi, Cao Xiao, Walter F. Stewart, Jimeng Sun, Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare


Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang, Bioinformatics, 2018, BioBERT: a pre-trained biomedical language representation model for biomedical text mining


Contact

Martijn Schut, Amsterdam UMC, location AMC, department of Medical Informatics , m.c.schut@amsterdamumc.nl
Miguel Angel Rios Gaona , Amsterdam UMC, location AMC, department of Medical Informatics, m.a.riosgaona@amsterdamumc.nl