[MA 2025 04] Sampling for knowledge distillation by copying the behaviour of prediction models without access to training data
Department of Medical Informatics (KIK)
Proposed by: Iacer Calixto, assistant professor of artificial intelligence [i.coimbra@amsterdamumc.nl]
Introduction
Recently, researchers have developed an efficient method to “copy” the behaviour of a prediction model ‘A’ into another prediction model ‘B’. For instance, ‘A’ can be a neural network trained to predict mortality given a patient’s medical history, and ‘B’ can be any kind of machine learning model (e.g., a decision tree, support vector machine, random forest, logistic regression, neural network, etc.) [1, 2]. We refer to model A as the teacher model, and model B as the student model. The method proposed in [1,2] allows for one to distil the knowledge from the teacher model into the student model so that in terms of performance they are similar. In the formulation in [1,2], researchers assume no access to the original training data used to train the teacher model. This makes the problem more challenging, since one needs to: first generate synthetic data points, label these synthetic inputs with the teacher model, and train the student model on the synthetic data plus silver labels. The reasons to distil this knowledge from the teacher to the student are many, and a few examples include interpretability (e.g., the student model comes from a family of machine learning models that are more interpretable than the teacher) or compute efficiency (e.g., the student model requires less specialised hardware to run and is therefore cheaper to deploy in terms of energy consumption).
Description of the SRP Project/Problem
In this SRP, you will focus on the problem of sampling synthetic tabular data and you will use the MIMIC-IV [3] dataset in your experiments. You will use synthetic data for the tasks of predicting the risk of death (a task that can be modelled as a classification problem) for an intensive care unit (ICU) patient. MIMIC-IV includes data for over 40,000 ICU patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC) in the United States, including demographics, labs, medications, and more.1 You will build neural network-based teacher models using all relevant tabular data available for a patient (i.e., structured data). You will then adapt the methodology introduced in [1,2] to distil the knowledge of the teacher into different student models, investigating different techniques for sampling synthetic data points under the assumption that no information about the input features used to train the model are available. More concretely, the only information available about the input features are their type and range (e.g., feature f1 can take values between 0 and 1 in case of a normalised numerical feature; or feature f2 can take one out of K possible classes for categorical variables). In the strictest case, you assume there is no access to extra information about the features: no feature names, no class names, no prior feature distribution, etc.
Research questions
RQ 1) Using MIMIC-IV, how well can you generate synthetic data under the strictest case (no extra information available) when this data is numerical? And when this data is a mix between numerical and categorical?
RQ 2) To what extent does using the synthetic data you generate for knowledge distillation as proposed in [1,2] impact knowledge transfer to the student model when the teacher model is trained on a classification task, e.g., predicting mortality using MIMIC-IV?
Expected results
The main outcome of this SRP project is a scientific paper. We will publish the results of your work in a top-tier machine learning workshop, and you will use this paper as your thesis for defending your SRP.
You will deliver a publicly available code base where all the experiments conducted on your SRP will be shared with the research community.
Time period, please tick at least 1 time period
November – June ?
May – November ?
Contact
Iacer Calixto, assistant professor of artificial intelligence, KIK, i.coimbra@amsterdamumc.nl
References
[1] N. Statuto, I. Unceta, J. Nin, and O. Pujol. A scalable and efficient iterative method for copying machine learning classifiers. Journal of Machine Learning Research, 24(390):1–34, 2023. URL http://jmlr.org/papers/v24/23-0135.html.
[2] I. Unceta, J. Nin, and O. Pujol. Copying machine learning classifiers. IEEE Access, 8:160268–160284, 2019. URL https://api.semanticscholar.org/CorpusID:67877026.
[3] A. EW Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J Pollard, S. Hao, B. Moody, B. Gow, et al. Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1, 2023.