Machine learning models for predicting the development of HF and CKD in early-stage type 2 diabetes patients

Ethics statement

This study was conducted in accordance with ethical principles consistent with the Declaration of Helsinki, Good Clinical Practices of the International Council for Harmonization, Good Pharmacoepidemiology Practices and applicable legislation on non-interventional studies and/or observational studies. In this study, we used two anonymized, publicly available commercial databases obtained from Japan Medical Data Vision Co., Ltd. and Real World Data Co., Ltd. Institutional review board approvals were not required because the current study only involves secondary analysis of – identified and anonymized data. In Japan, ethical approval and informed consent do not apply to the use of anonymized secondary data according to Japanese ethical guidelines for medical and health research involving human subjects.

Source of data and study population

Data were collated from the Japan Medical Data Vision (MDV) database from April 2008 to September 2018 (Supplementary Table S1). The database contains administrative claims and laboratory data from 376 Japanese diagnostic procedure combination (DPC) hospitals, representing 21.7% of the 1,730 DPC hospitals in Japan, covering approximately 20 million patients in Japan.32. In this database, data have been classified using the International Classification of Diseases, 10th Revision (ICD-10) diagnostic codes; disease names are coded using Japanese-specific disease codes, and procedures, prescriptions, and drug administration are coded using Japanese-specific receipt codes32.

Patients aged ≥ 18 years with a diagnosis of T2DM receiving antidiabetic treatment and with a run-in interval of 18 months before the index date were included. Patients with a diagnosis of type 1 diabetes mellitus at any time in the database, with a diagnosis of gestational diabetes at any time in the database, or with a medical history of CVD or CKD prior to the index date were excluded. Supplementary Table S2 lists the anatomical therapeutic chemical (ATC), ICD-10, and procedural codes used for inclusion and exclusion criteria.

The index date was defined as the date the first oral medication for T2D was prescribed after the diagnosis of T2D and must be more than 18 months after the start date of the observation (look-back period). The look-back period was set to a minimum of 18 months prior to the indexing date to ensure a sufficiently long pre-indexing period to allow for proper collection of information on patient history characteristics and avoid information bias due to seasonal fluctuations.

Results and variables

Risk prediction models were created for the following clinical outcomes: the primary outcomes were (1) diagnosis of CKD/CHF in an inpatient or outpatient setting, and (2) hospitalization for CKD/CHD or for medical reasons. uncertain such as maximal use of healthcare resources on admission related to CKD/HF. Secondary outcomes were (1) HF diagnosis (inpatient or outpatient), (2) CKD diagnosis (inpatient or outpatient), and (3) hospitalization for IC or for uncertain reasons such that maximal use of healthcare resources on admission was related to HF. Finally, exploratory outcomes were (1) composite major adverse cardiovascular events (MACE) – diagnosis of myocardial infarction (MI), stroke, or in-hospital death related to MI or stroke; (2) major composite renal and cardiovascular adverse events (MARCE); diagnosis of myocardial infarction, stroke or hospitalization due to HF; kidney results (dialysis and kidney transplant); or death in hospital related to myocardial infarction, stroke or heart failure; and 3) all in-hospital deaths. Supplementary Table S3 shows the list of ICD-10 codes and procedures used for the results.

Variables included patient demographics (age, gender, BMI, frequency of outpatient visits, and frequency of hospitalizations), ICD-10 codes, ATC disease diagnostic codes, and laboratory values ​​derived from the MDV database. . Laboratory values ​​were categorized and patients with measurements were categorized into normal, subnormal, and above normal based on Common Criteria for Major Laboratory Parameters in Japan; patients without measurements were classified as having no measurements.

Model making

The model architecture was developed in two distinct phases (Supplementary Fig. S1). The first phase included an evaluation of the feasibility of the development of the algorithm and the evaluation of the variables. The second phase included the development and tuning of the complete prediction model to finalize and validate the model. In both phases, 80% of all analysis data was used for model building and 20% was used for internal validation.

Phase I: preliminary model

Data preprocessing included entering explanatory variables, processing laboratory data, and missing data. As the laboratory data was adopted as a continuous variable, outliers were not detected (step 1). For the modeling, 32 models were built and evaluated according to the method corresponding to eight results and four points in time (1, 2, 3 and 5 years after the date of the index). Preliminary model building differed from building the full prediction model in a number of ways, including randomly selecting a population of 10,000 individuals with a 1:1 positive/negative ratio; laboratory values ​​were not categorized and missing values ​​were imputed using mean values. The preliminary model was built using random forest and logistic regression methods; and model performance was assessed using the area under the receiver operating characteristic curve (AUROC), precision, accuracy, and recall (step 2).

Phase II: full prediction model

Two different techniques (gradient boosting [XGB] and deep learning [multilayer perceptron]) were used for model building in phase II using traditional statistical models (logistic regression and Cox proportional hazards) as comparators. While all positive patients were selected, negative patients were randomly selected to outnumber the positive patients for the model construction, resulting in a ratio of 1:2 positive:negative patients. In the primary results, the number of explanatory variables used in the analysis was assumed to be 60, with the coefficient of determination adjusted for the degree of freedom (R‘2) as a measure of model fit.

The selection of explanatory variables was first performed by univariate regression analysis using 0.05 as the probability threshold of each outcome event. After selection, data for 60 variables were extracted using the random forest method with Gini importance, and the data were ranked by quality.

After building models using XGB and neural networks, hyperparameters were determined using a random search method to increase the accuracy of model-based prediction33. Supplementary Tables S4 and S5 show the range of hyperparameters used for building the model. To refine the model, 16 categorized laboratory variables (Supplementary Fig. S2) were included in addition to the 60 selected variables. The 16 additional variables were selected based on a factor analysis method determining the number of factors (Supplementary Fig. S3). The model was validated by evaluating model performance using AUROC, accuracy, precision, recall, and specificity.

All model development procedures were implemented using Python 3.9.5. Additionally, a SHAPley Additive exPlanation (SHAP) analysis was performed for XGB to identify whether variables with the highest variable importance contributed positively or negatively to the occurrence of the event.34 (Step 4).

External validation

XGB, which had the best predictive performance in all results, was subjected to external validation using a dataset obtained from Real World Data Co., Ltd. (RWD; Kyoto, Japan). This database contains electronic medical records and claims data of approximately 20 million patients from over 160 medical institutions across Japan, as of 2020. It includes information on patient characteristics, diagnoses, prescriptions, procedures and laboratory data for inpatient and outpatient care. . This data is systematically collected within each medical establishment and anonymized using identifiers for each patient. We used only DPC data in the RWD database to perform consistent analysis with internal validation.

In this analysis, model accuracy was assessed based on AUROC, precision, recall, and specificity for each outcome. Additionally, for Kaplan-Meier analysis, patients were divided into high- and low-risk groups based on the best cut-off value determined by the receiver operating characteristics (ROC) curve, obtained as the point on the ROC curve that provides the shortest distance between the arc of the ROC curve and the upper left corner of the unit square (sensitivity = 1, specificity = 0). This point is the optimal cut-off point (threshold) to distinguish the two groups in the survival analysis. The log-rank test was used to compare the two curves.

These external validation analyzes were performed independently of model development to ensure the reliability of the results.