Leonie Fohler
A comparison between Bayesian optimization frameworks and reinforcement learning for dose optimization
Leonie Fohler presents her master's thesis, which she recently defended as part of the applied mathematics degree program at Koblenz University of Applied Sciences. Her thesis was supervised by Prof. Dr. Holger Fröhlich and Prof. Dr. Michael Kinder.
Background and Motivation
Dose optimization is an important objective during the development of a drug, in order to ensure that patients are treated with dosages that optimize clinical efficacy while minimizing adverse effects [1,2]. The effect of a drug highly depends on the patient's characteristics, expressing the need for individualized predictions of optimal doses. The variability between patients (inter-individual variability (IIV)), including differences in drug response, is of crucial importance in the treatment of diseases, especially those with high mortality rates and heterogeneous manifestations [1]. An example of this is cancer, which is one of the leading causes of death worldwide, accounting for nearly 10 million deaths in 2018 [3]. Even when focusing on one cancer type, tremendous variation can be displayed between patients, within a patient and even within a given tumor [4]. This makes the search for an optimal medication and dosage extremely challenging, as a failed or toxic treatment can be potentially life-threatening for the patient. Over the last years, methods based on artificial intelligence have been increasingly developed, which, when combined with pharmacological expertise, have facilitated the development of drugs.
The publication of Valderrama et al. [5] introduced the multimodal pharmacokinetic SciML model (MMPK-SciML), a baseline for building optimization frameworks that include precise estimations of IIV while adhering to pharmacokinetic dynamics. The idea of this work is therefore to predict the optimal dose for each patient given their specific characteristics by integrating the MMPK-SciML model into two common optimization approaches: A Bayesian optimization (BO) and a reinforcement learning (RL) framework. In the end, both approaches are compared regarding their success in patient-specific dose optimization.
To the greatest extent, drug optimization models are trained on synthetic data, simulated for the purpose of the specific model. Therefore, the objective is not only to introduce models that have been trained on synthetic data, but also to train them on a real dataset. The real dataset includes data of patients treated with 5-Fluorouracil (5FU), an anti-cancer medication. A simulated dataset was generated to show a stronger population-wide dose-concentration relationship.
Methods
Datasets
The real 5FU dataset consists of 505 measurements from 120 patients, while the simulated 5FU dataset comprises 1128 measurements from 200 patients [5]. Only patients with between two and nine measurements from different treatment cycles are included. Per protocol, the drug is administered via intravenous infusion for 24 hours in a seven-day cycle and its dose can be adjusted continuously in the range of 1000 to 5000 mg/m2, depending on the body surface area (BSA) of the patient. Measurements are taken of plasma concentrations at steady-state after 18 hours. A dose of 5FU is optimal if the corresponding area under the plasma concentration time curve (AUC) is in the range of [20,30) mg*h/L.
Pharmacometrics
The pharmacokinetic dynamics of an intravenous infusion can be described by the following system of ordinary differential equations, with D being the dose, the infusion time, CL the clearance, and V the volume of distribution:

with initial conditions

C1 describes the amount of drug, therefore, the ODE solutions need to be divided by V to obtain concentration and AUC:

IIV is introduced to the clearance by using estimations provided by a pretrained MMPK-SciML, either using patient-specific values of a patient-specific conditional posterior distribution on real (IPRED PSPD) or Gaussian process-proposed concentrations (IPRED PSPDGP), or samples from the population's unconditional predictive posterior distribution (IPRED PD). If no IIV is included, typical population values of the clearance are used (PRED).
Approach 1: Bayesian Optimization
BO is a probabilistic model-based optimization technique that uses a surrogate model to find the minimum of an objective function by iteratively selecting points to evaluate based on an acquisition function [6].
The BO framework utilized for this work applies as an objective function a score function with its minimum in the target range of AUC = [20,30) mg*h/L and the Euclidean distance of the AUC to this range outside of it (Figure 1A). To model the dynamics of the dose-concentration relationship that is used to assess the objective function, two different surrogate models are applied: On the one hand a Gaussian process (GP) model [6], on the other hand a Tree-structured Parzen Estimator (TPE) model [7].
BO is performed for each measurement of each patient individually, using the pretrained MMPK-SciML on the training set and applying it individually on the measurements of the test set to find patient-specific optimized doses.
Approach 2: Reinforcement Learning
RL is a trial-and-error process in which a neural network agent takes actions (doses), transitions to corresponding states (concentrations) and receives feedback in form of rewards and penalties. The goal is to maximize the cumulative reward over time. In this RL framework, the reward function reflects the dose optimality depending on the reached AUC, with its maximum in the target range and the Euclidean distance of the AUC to this range outside of it, being a modification of the score function used for BO (Figure 1B).
The RL algorithm used in this framework applies the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) that utilizes an actor-critic architecture, twin Q-networks, target networks, and delayed policy updates to provide stable estimations [8,9].
Dose optimization is performed by training a joint model for all patients and their measurements using training data, validating on unseen test data using the weights of the model with the highest cumulative reward of the training.
Results and Conclusion
It was found that the Bayesian optimization approach reached great performance on an individual level, but its inability to predict optimal doses for unseen patients obstructs a potential application in clinical practice. The results show that the PD approach performs better than the PSPD or PSPDGP and that the TPE surrogate model outperforms the GP. Figure 2 depicts the results for an exemplary patient. The reinforcement learning framework enables optimal dose predictions on a larger scale with a joint model, but it still lacks consistent performance during training and validation, which is necessary for reliable dose predictions for unseen patients. This approach benefited from the patient-specific introduction of IIV with PSPDGP. The RL results for an exemplary patient are shown in Figure 3. Both approaches exhibited considerable performance discrepancies between cross-validation folds, suggesting instability due to a dependency on the distribution of the training and test sets. Overall, neither the real nor the simulated dataset showed a consistently better performance. Although the simulated dataset depicted a stronger population-wide dose-concentration relationship than the real dataset, this could indicate that either this relationship was not simulated strong enough to be effective, or that a patient-wide dependency could suffice. It was generally observable that both datasets were underdosed, as most optimized doses were increased, often doubled, compared with the administered doses. This could be a result of doctors tending to initially administer lower doses to minimize harmful side effects and to then carefully approach the optimal dose. Furthermore, it has been demonstrated that the inclusion of IIV greatly improves the ability of the models to predict individualized optimal doses.
Combining machine learning methods with pharmacological knowledge is showing promising progress towards a joint model that personalizes dose optimization to ensure the best treatment for patients.
Citations
[1] A. Papachristos, J. Patel, M. Vasileiou, and G. P. Patrinos, “Dose optimization in oncology drug development: The emerging role of pharmacogenomics, pharmacokinetics, and pharmacodynamics,” Cancers, vol. 15, no. 12, p. 3233, Jun. 2023, issn: 2072-6694. https://doi.org/10.3390/cancers15123233
[2] P. Chotsiri, P. Yodsawat, R. M. Hoglund, J. A. Simpson, and J. Tarning, “Pharmacometric and statistical considerations for dose optimization,” CPT: Pharmacometrics & Systems Pharmacology, vol. 14, no. 2, pp. 279–291, Nov. 2024, issn: 2163-8306. https://doi.org/10.1002/psp4.13271
[3] World Health Organization, WHO report on cancer: setting priorities, investing wisely and providing care for all. Genève, Switzerland: World Health Organization, Apr. 2020.
[4] A. Nguyen, M. Yoshida, H. Goodarzi, and S. F. Tavazoie, “Highly variable cancer subpopulations that exhibit enhanced transcriptome variability and metastatic fitness,” Nature Communications, vol. 7, no. 1, May 2016, issn: 2041-1723. https://doi.org/10.1038/ncomms11246
[5] D. Valderrama, O. Teplytska, L. M. Koltermann, E. Trunz, E. Schmulenson, A. Fritsch, U. Jaehde, and H. Fröhlich, “Comparing scientific machine learning with population pharmacokinetic and classical machine learning approaches for prediction of drug concentrations,” CPT: Pharmacometrics & Systems Pharmacology, Feb. 2025, issn: 2163-8306. https://doi.org/10.1002/psp4.13313
[6] P. I. Frazier, A tutorial on bayesian optimization, 2018. https://doi.org/10.48550/arXiv.1807.02811
[7] S. Watanabe, Tree-structured parzen estimator: Understanding its algorithm components and their roles for better empirical performance, 2023. https://doi.org/10.48550/arXiv.2304.11127
[8] H. Mashayekhi, M. Nazari, F. Jafarinejad, and N. Meskin, “Deep reinforcement learning-based control of chemo-drug dose in cancer treatment,” Computer Methods and Programs in Biomedicine, vol. 243, p. 107 884, Jan. 2024. https://doi.org/10.1016/j.cmpb.2023.107884
[9] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, PMLR, Jul. 2018, pp. 1587–1596. [Online]. Available:
https://proceedings.mlr.press/v80/fujimoto18a.html