[MA 2023 26] Online testing: Netflix, COVID-19, casinos, and N-of-1 clinical trials

Amsterdam UMC, Dept. Epidemiology and Data Science (EDS)
Proposed by: dr. Judith ter Schure [j.a.terschure@amsterdamumc.nl]

Introduction

Netflix releases a new version of their website seven times a day. The software goes through extensive software testing within the company, but the final ‘lab rats’ are the users. If too many events occur that are negative for user experience, Netflix needs to pull the software as soon as possible and replace it with an older version. To find out what is ‘too many’, Netflix compares the performance of new website software to an old one, each on a random sample of users. This is called a canary test. The users of the new software are so-called ‘canaries in the coal mine’ that signal when something is wrong. Netflix hopes that the number of events is the same, or even smaller with the new software: a null hypothesis. If that is not the case, the canary test needs to reject that hypothesis as soon as possible, while millions of users watch Netflix and the number of events are an open-ended stream of data coming in. This setting is not that familiar to most statisticians, but it is familiar in the field of ‘online machine learning’ (and related reinforcement learning).


Canary tests at Netflix are never based on sample size calculations. In essence, these online hypothesis tests are open-ended and have 100% power [1][2]. So there is a need to think about hypothesis testing in a way that is different from most statistics courses and study ‘anytime-valid’ methods [3]. This is gaining popularity in the tech industry (Booking.com, Adobe Experience Platform, Amazon), and might hold the future for efficient large-scale randomized clinical trials (such as during the COVID-19 pandemic [4][5]) and precision medicine (N-of-1 trials in which more than one treatment is assigned in random order to a single patient with a chronic condition).


Description of the SRP Project/Problem

The focus of this project is on data simulation and visualization, preferably in R using the package ggplot. Because anytime-valid tests are very different from usual statistics, we need very good simulations, plots and figures to grasp what happens and explain how it works. We will start by programming and visualizing what goes on in a casino offering roulette. A casino also keeps a close eye on the stream of data from gamblers that are winning and losing money. This data needs to signal whether anything is wrong with their roulette wheels. No sample size calculations, and very few casinos go bankrupt by gamblers that exploit mechanical deviations in their machines. So casinos perform online hypothesis testing, and you can show that these are anytime-valid. From there, we will go into examples closer to healthcare, and focus on risk and estimating odds ratios. You might along the way help develop and publish a new anytime-valid test that will be very useful in N-of-1 trials (and probably the tech industry as well).


Requirements

The requirements depend on the interest of the student.

The focus can be on implementation and how to write tutorials that explain the essence of anytime-valid methods, and when to choose a certain option. Anytime-valid methods can be constructed that optimize different things, e.g. power given an effect-size of minimal clinical importance or 100% power in general. Some constructions are easy to understand, some are less intuitive, but easy to implement, and some are computationally very challenging.

The focus can also be on scientific progress and how to develop new anytime-valid methods.

The focus on implementation requires mainly good abstract thinking, programming and visualization skills in R ggplot. The focus on development also requires the ability to independently read applied statistical literature (such as [4] and [5] below), and an interest to collaboratively study more mathematical machine learning/statistical literature ([1], [2] and [3] below).


Research questions

What are possible criteria to evaluate a method of construction for anytime-valid statistics on a stream of independent 0/1-data? How can we explain these criteria? Which is the most important criterion for an N-of-1 trial with 0/1-outcomes?

(For a focus on development)

Does the importance of the criteria change if instead of one patient (N-of-1) we evaluate more patients (cross-over trial/aggregate N-of-1 trial)? Can we propose a new anytime-valid method that is optimal for the analysis of cross-over trials/aggregate N-of-1 trials?


Expected results

Logistics

Possibility to work on this project for more than one student, and to sign up as a duo. Weekly supervised meetings, or more if necessary. November 2023 – June 2024. Working from home, or at the Amsterdam UMC EDS department based on the student’s preferences. Competitive internship allowance.


Future

This project might prepare you for a PhD on the topic starting September 2024 in Amsterdam UMC, or a more mathematical PhD project elsewhere*. It also might give you an advantage if you are looking for a data science job in the tech industry.

* Possible through contacts working on the topic in the United States at UC Berkeley, Stanford University (both close to Silicon Valley), Carnegie Mellon University, in Canada at University of Waterloo, in Europe at University of Bern, University of London, and in the Netherlands at the UvA, VU and Leiden University.


Time period

· November – June


Contact

dr. Judith ter Schure, Amsterdam UMC, location AMC, Dept. Epidemiology and Data Science (EDS)

j.a.terschure@amsterdamumc.nl


References

[1] Lindon, M., & Malek, A. (2022). Anytime-valid inference for multinomial count data. Advances in Neural Information Processing Systems, 35, 2817-2831.

[2] ter Schure, J., Pérez-Ortiz, M. F., Ly, A., & Grunwald, P. (2020). The anytime-valid logrank test: error control under continuous monitoring with unlimited horizon. arXiv preprint arXiv:2011.06931.

[3] Ramdas, A., Grünwald, P., Vovk, V., & Shafer, G. (2022). Game-theoretic statistics and safe anytime-valid inference. arXiv preprint arXiv:2210.01948.

[4] ter Schure, J., & Grünwald, P. (2022). ALL-IN meta-analysis: breathing life into living systematic reviews. F1000Research, 11.

[5] ter Schure, J., Ly, A., Belin, L., Benn, C. S., Bonten, M. J., Cirillo, J. D., ... & van Werkhoven, H. (2022). Bacillus Calmette-Guerin vaccine to reduce COVID-19 infections and hospitalisations in healthcare workers: a living systematic review and prospective ALL-IN meta-analysis of individual participant data from randomised controlled trials. medRxiv, 2022-12.