Predicting ICD Codes from Clinical Notes
This repository contains code for training and evaluating several neural network models for predicting ICD codes from discharge summaries on the publicly acessible MIMIC-III dataset (v. 1.4). The models are described in the paper Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes, which uses the results on MIMIC-III as a benchmark. The implemented models are:
This project depends on:
data/
, place the files below:MIMIC_preprocessing.py
to select discharge summaries and merge MIMIC-III tables.MIMIC-III tables NOTEEVENTS
, DIAGNOSES_ICD
are loaded and merged through admission IDs. From NOTEEVENTS
, only a single discharge summary is selected per admission ID.
Outputs data/mimic3_data.pkl
, a DataFrame containing 4 columns:
SUBJECT_ID
may be linked to multiple HADM_IDs
.HADM_ID
.HADM_ID
).MIMIC_train_w2v.py
to train Word2Vec word embeddings for the neural network models.This script generates training instances by filtering data/mimic3_data.pkl
with data/train_full_hadmids.csv
to train gensim.models.Word2Vec word embeddings.
Outputs:
MIMIC_train_baselines.py
, for LR and Constant models.For Constant:
Computes the k
most ocurring ICDs in the training set and predicts them for all test samples. Nothing is stored here.
For LR:
Computes TF-IDF features in the training set. Then, fits the LR model to the training set.
After training, the weights of the epoch with best micro F1 in the validation set are restored and threshold-optimized metrics are computed for all subsets. The fitted model is stored using Tensorflow SavedModel format.
MIMIC_train_nn.py
, for CNN, GRU and CNN-Att.This script loads the data splits and Word2Vec embeddings, then fits the desired model for the training set.
After training, the weights of the epoch with best micro F1 in the validation set are restored and threshold-optimized metrics are computed for all subsets.
The fitted model is stored using Tensorflow SavedModel format.
notebooks/
, you will find: