Enhancing diabetes follow-up period prediction through classification algorithms with feature selection techniques

Paisit Khanarsa; Sutikiat Suwanmanee; Supichaya Chumpong; Kittisak Chumpong

doi:10.21037/jphe-24-119

Original Article

Enhancing diabetes follow-up period prediction through classification algorithms with feature selection techniques

Paisit Khanarsa¹, Sutikiat Suwanmanee², Supichaya Chumpong³, Kittisak Chumpong^2,4,5

¹Institute of Field Robotics, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand; ²Division of Computational Science, Faculty of Science, Prince of Songkla University, Songkhla, Thailand; ³Department of Medicine, Pak Phanang Hospital, Pak Phanang, Thailand; ⁴Research Center in Mathematics and Statistics with Applications, Prince of Songkla University, Songkhla, Thailand; ⁵Financial Mathematics, Data Science and Computational Innovations Research Unit (FDC), Department of Mathematics, Faculty of Science, Kasetsart University, Bangkok, Thailand

Contributions: (I) Conception and design: S Chumpong, K Chumpong; (II) Administrative support: K Chumpong; (III) Provision of study materials or patients: S Suwanmanee, S Chumpong; (IV) Collection and assembly of data: P Khanarsa, S Suwanmanee, S Chumpong; (V) Data analysis and interpretation: P Khanarsa, S Suwanmanee, K Chumpong; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Kittisak Chumpong, PhD. Division of Computational Science, Faculty of Science, Prince of Songkla University, 15 Kanjanavanich Road, Hat Yai, Songkhla 90110, Thailand. Email: kittisak.ch@psu.ac.th.

Background: Despite the critical role of follow-up care in managing type 2 diabetes, limited research has focused on predicting follow-up periods using machine learning. Addressing this gap can improve patient management and optimize clinical resource allocation. Our objective is to develop and validate a machine learning-based model for predicting follow-up periods in patients with type 2 diabetes, using feature selection techniques to enhance predictive performance.

Methods: From 16,094 patient records retrieved from Pak Phanang Hospital, Thailand, 2,042 eligible follow-up records were retained after exclusion of patients aged below 35 years, missing or invalid values, and single-visit records. All included patients were diagnosed with type 2 diabetes in 2022. Follow-up periods were grouped into four categories: 1–4, 5–8, 9–12, and more than 12 weeks. Data preprocessing involved handling missing values, encoding categorical variables, scaling numerical features, and addressing class imbalance using Synthetic Minority Oversampling Technique (SMOTE). Three feature selection methods were applied: filter, wrapper, and embedded. Six classifiers (Support Vector Machine, Random Forest, K-Nearest Neighbors, Extra Trees Classifier, Adaptive Boosting and Artificial Neural Network) were evaluated using 5-fold cross-validation, with each fold consisting of 80% training and 20% testing data. Model performance was assessed using accuracy, precision, recall, and weighted F1-score.

Results: We analyzed 2,042 follow-up records from patients with type 2 diabetes diagnosed in 2022. The Extra Trees Classifier with Elastic Net feature selection achieved the highest performance, with a weighted F1-score of 90.69% [95% confidence interval (CI): 89.21–92.18%], precision of 89.54% (95% CI: 87.76–91.31%), and both weighted recall and accuracy of 91.97% (95% CI: 90.56–93.38%). Key predictors consistently identified included age, blood pressure, pulse, height, waist, fasting blood sugar, and creatinine. Elastic Net demonstrated strong feature selection performance, particularly with tree-based models.

Conclusions: Feature selection notably enhanced the predictive performance of machine learning models in classifying follow-up periods. The proposed model could assist clinicians in scheduling timely and personalized follow-up visits based on individual patient profiles, particularly key demographic and clinical features, thereby improving continuity of care, optimizing resource use, and supporting decision-making in real-world diabetes management.

Keywords: Diabetes; follow-up period; classification; feature selection

Received: 19 December 2024; Accepted: 14 May 2025; Published online: 14 July 2025.

doi: 10.21037/jphe-24-119

Highlight box

Key findings

• This study demonstrates that incorporating feature selection techniques, particularly Elastic Net, can significantly improve both the performance and stability of machine learning models used to classify diabetes follow-up periods. By systematically comparing multiple feature selection methods and classifiers, the study identifies that Elastic Net performs robustly across different algorithms. In addition, clinical features such age, blood pressure, pulse, height, waist, fasting blood sugar, and creatinine. were consistently selected as strong predictors of follow-up needs in real-world outpatient data.

What is known and what is new?

• Previous studies in diabetes care have focused primarily on predicting disease onset or complications, with limited attention given to the prediction of follow-up intervals. Moreover, few studies have compared how different feature selection techniques influence model performance in this context. This study is among the first to rigorously evaluate a combination of filter, wrapper, and embedded methods across six classifiers for follow-up prediction. The findings show that Elastic Net not only enhances predictive performance but also improves the interpretability of models by consistently selecting clinically relevant variables.

What is the implication, and what should change now?

• The results support the integration of machine learning models into clinical workflows to assist in scheduling follow-up visits based on patient-specific risk profiles. These tools could be deployed as user-friendly web, mobile, or desktop applications without requiring machine learning expertise. This broadens usability to general practitioners, endocrinologists, nurse practitioners, and potentially patients. The approach can help optimize resource use, improve care continuity, and reduce preventable complications through timely follow-up.

Introduction

Background

Diabetes mellitus is a long-term metabolic disorder that poses a serious global health concern, with type 2 diabetes being the most widespread form. It arises when the body either becomes resistant to insulin or does not produce adequate insulin, resulting in high blood sugar levels. Unlike type 1 diabetes, which is an autoimmune disease commonly diagnosed in younger individuals, type 2 diabetes typically develops in adults over 45 years old. However, lifestyle factors are contributing to an increasing incidence in younger populations. Obesity, lack of physical activity and a family history of diabetes are closely linked to the condition. Proper management of type 2 diabetes requires a combination of lifestyle changes, such as eating a balanced diet, exercising regularly and maintaining a healthy weight, alongside medical treatments. Regular follow-up and early clinical intervention are vital to prevent complications such as cardiovascular disease, nerve damage, and kidney failure (1-4).

In 2021, diabetes presented a major global health crisis, with 537 million adults aged 20–79 years—equivalent to 1 in 10 people—living with the condition. By 2030, this figure is expected to rise to 643 million, and by 2045, to 783 million. Over 75% of these cases were found in low- and middle-income countries. The disease contributed to 6.7 million deaths and healthcare costs totaling at least USD 966 billion, underscoring the urgent need for global initiatives to combat diabetes and mitigate its effects (5).

The follow-up period is essential for people with diabetes. Regular follow-ups ensure continuous monitoring and effective management of the disease, help adjust treatment plans according to patient progress, detect potential complications early, and reinforce necessary lifestyle changes. This process is crucial in improving patient outcomes and minimizing the risk of severe complications that stem from poorly managed diabetes (6).

In routine clinical practice, follow-up intervals are commonly scheduled based on fixed guidelines or physician judgment, rather than personalized risk assessment. This generalized approach can result in either overuse or underuse of healthcare resources and may delay timely interventions for high-risk patients.

Rationale and knowledge gap

While many studies have applied machine learning (ML) to predict diabetes diagnosis or complications (7-10), only a limited number have explored ML-based models for predicting follow-up duration (6). Yet, individualized follow-up scheduling is vital in type 2 diabetes care to ensure timely monitoring, support medication adherence, and reduce preventable complications.

Several clinical variables such as age, blood pressure, glycemic control, and comorbidity status have been associated with follow-up patterns, but few models have incorporated these variables into a predictive framework. Furthermore, most ML studies emphasize diagnostic prediction rather than downstream care planning. Thus, there is a significant gap in applying ML to inform follow-up decisions and optimize resource allocation in real-world diabetes care.

A recent study by Chapakiya et al. analyzed data from Pak Phanang Hospital, Thailand, that comprise 2,042 records with 14 factors. The researchers applied six models to the dataset: Random Forest (RF), Extra Trees Classifier (ETC), Adaptive Boosting (AB), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Artificial Neural Network (ANN). Feature selection was performed using RF, while class imbalances were addressed through the Synthetic Minority Oversampling Technique (SMOTE). The SVM model, combined with SMOTE and feature selection by RF, achieved the highest weighted precision of 0.93, significantly improving the prediction accuracy for diabetes follow-up periods (6).

ML techniques are well-suited for supporting personalized follow-up scheduling by analyzing multidimensional clinical data and identifying patterns that may not be evident to clinicians (11,12). These models can enhance decision-making by helping providers prioritize patients based on individualized risk and care needs.

Objective

To address this gap, we developed and validated a ML-based model to predict follow-up periods for patients with type 2 diabetes. By comparing multiple classifiers and feature selection techniques, our goal is to identify the most effective approach for supporting personalized follow-up scheduling. The model is designed to assist clinicians in optimizing care timelines based on individual patient risk profiles. We present this article in accordance with the TRIPOD reporting checklist (available at https://jphe.amegroups.com/article/view/10.21037/jphe-24-119/rc).

Methods

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Ethical approval for the study was obtained from the PSU Human Research Ethics Committee, Prince of Songkla University (PSU-HREC-2024-068-1-1), which deemed the study to meet the criteria for Exempt Research Determination. Individual consent for this retrospective analysis was waived.

This paper aims to develop a ML framework for predicting follow-up periods for diabetes patients based on a comprehensive dataset. The methodology employed is illustrated in Figure 1 and encompasses several key stages: data preprocessing, feature selection, data splitting, model training, and performance evaluation. During the data preprocessing stage, missing values were addressed, categorical variables were one-hot encoded, and numerical features were normalized using the robust scaler technique. Following preprocessing, feature selection was conducted with a combination of methods that included Mutual Information (MI) for filtering, sequential feature selection (SFS) and a genetic algorithm (GA) as wrapper methods, and random forest feature importance (RFFI) and Elastic Net (EN) as embedded methods. SMOTE was applied to address class imbalance in the dataset. The dataset was split into an 80% training set and a 20% testing set, with 5-fold cross-validation applied to ensure robust model performance. The classifiers implemented during the model training stage included SVM, KNN, RF, ETC, AB, and ANN. The primary objective was to predict diabetes treatment follow-up periods, categorized into four distinct intervals: 1–4, 5–8, 9–12, and >12 weeks. This classification was based on standard appointment intervals used at the study hospital and clinical consultation with practicing endocrinologists. Model performances were evaluated using the classification metrics of accuracy, weighted precision, weighted recall, and weighted F1-score to identify the most effective predictive model. A detailed explanation of each step in the process is provided in the following subsections. A detailed explanation of each step in the process is provided in the following subsections, while an overall view of the ML workflow is outlined in Figure S1 as pseudocode.

Figure 1 Machine learning process for predicting follow-up period category. SMOTE, Synthetic Minority Oversampling Technique.

Data collection

A total of 16,094 outpatient records from 2022 were retrieved from Pak Phanang Hospital, Nakhon Si Thammarat Province, Thailand. After excluding patients aged less than 35 years, records with missing or invalid data, and those with only one visit, 2,042 eligible records of patients with type 2 diabetes (ICD-10 codes E110–E119) remained for analysis. The selection process is illustrated in Figure 2.

Figure 2 Flowchart of patient selection for follow-up period prediction.

Predictor variables

The dataset consisted of 14 independent variables representing potential risk factors, along with one multiclass outcome variable related to the follow-up period. All predictor variables were measured during the patient’s outpatient visits in 2022 and were available to physicians prior to assigning the follow-up schedule. These clinical and laboratory indicators were routinely used by physicians to guide follow-up scheduling decisions. A detailed description of the features included in the dataset, along with their data types, is presented in Table 1. These features encompass demographic and clinical factors. These variables were selected as candidate predictors for model development based on their clinical relevance and previous studies demonstrating associations with diabetes progression and follow-up needs (13,14).

Table 1

Description of the diabetes dataset

Feature ID	Feature name	Data type	Description
1	Age (age)	Numerical	The patient age (years)
2	Gender (sex)	Nominal	The biological sex assigned at birth (0: male, 1: female)
3,4	Blood pressure (bps, bpd)	Numerical	Blood pressure (systolic/diastolic) measured in mmHg
5	Body mass index (bmi)	Numerical	A standard measure for assessing body weight relative to height (kg/m²)
6	Heart rate (pulse)	Numerical	The heart rate measured in beats per minute (bpm)
7	Weight (w)	Numerical	The total body weight (kilograms)
8	Height (h)	Numerical	The body height from head to toe measured from the ground (centimeters)
9	Waist circumference (waist)	Numerical	The waist circumference measured at the level of the navel (centimeters)
10	Smoking status (smoking_type)	Nominal	The patient smoking status:
			0: when the patient does not smoke
			1: when the patient either smokes or has quit smoking for less than a month
			2: when the patient has quit smoking for greater than a month
			3: when there is no smoking information available about the patient
11	Alcohol consumption (drinking_type)	Nominal	The patient alcohol consumption status:
			0: when the patient does not drink alcohol
			1: when the patient drinks alcohol
			2: when the patient has quit drinking alcohol
			3: when there is no drinking alcohol information available about the patient
12	Family history of diabetes (fh)	Nominal	Indicates a parental history of diabetes:
			0: when the parents do not have diabetes
			1: when one of the parents has diabetes
			2: when both parents have diabetes
13	Fasting blood sugar (fbs)	Numerical	The blood sugar level measured after fasting for at least 8 hours (mg/dL)
14	Creatinine level (cr)	Numerical	The blood creatinine level, indicating kidney function, related to diabetes (mg/dL)
15	Follow-up period category (Target)	Categorical	Follow-up period categories:
			Group 1: 1–4 weeks
			Group 2: 5–8 weeks
			Group 3: 9–12 weeks
			Group 4: >12 weeks

Data preprocessing

Effective data preprocessing is essential to ensure the quality and usability of the dataset for subsequent analysis. The dataset used in this study was preprocessed using a series of steps to handle missing values, convert categorical variables into a format suitable for analysis, and scale the numerical data. First, the dataset was checked for missing or duplicated records, which were removed to maintain data integrity. Categorical features, such as gender, smoking status, alcohol consumption, and family history of diabetes, were transformed into dummy variables. This transformation ensured that qualitative data was represented numerically, facilitating its use in the ML models. Next, potential outliers were detected using the interquartile range (IQR) method. Outliers can significantly impact the performance of a model, so identifying them helps in deciding how to manage extreme values. In this case, no action was taken to remove outliers as part of the preprocessing step. Finally, the numerical features were scaled using the robust scaler transformation, which reduces the impact of outliers by scaling the data based on the IQR. This method was chosen over other scaling techniques to ensure that the presence of outliers did not disproportionately affect the scaling process. The final preprocessed dataset, with all features appropriately transformed and scaled, was saved for use in the modeling phase.

Feature selection

In this phase, feature selection techniques were applied to optimize model performances by identifying the most relevant attributes within the dataset. By focusing on key features, feature selection reduces the complexity of the model, minimizes training time, and improves generalization performance. The methods were chosen for their compatibility with both numerical and categorical data, making them versatile across the various feature types. This experiment employed three types of feature selection approaches: the filter method, wrapper method and embedded method.

As illustrated in Figure 3, the filter method used MI to evaluate the relevance of each feature independently of the learning algorithm. This method generated a reduced subset of features that were passed into the classifier, allowing for faster computation and a simpler model. The wrapper method explores different feature subsets in combination with a learning algorithm. SFS and GA were applied iteratively to evaluate the predictive power of each subset. While effective, this method can be computationally intensive due to its iterative nature. Lastly, the embedded method integrates feature selection within the training process itself, combining the strengths of both filter and wrapper methods. The EN and RFFI algorithms were used to select the most informative features while the model was being built. To generate optimal feature subsets, EN comprised logistic regression with combined Lasso (L1) and Ridge (L2) priors and RFFI employed a RF classifier. A complete overview of the feature selection framework, including the step-by-step procedures for each method, is presented in Figure S2, while detailed discussions of each method are provided in the following sections.

Figure 3 Framework of feature selection process.

Filter method

MI is one of the filter methods that are statistical measures quantifying the amount of information obtained about one random variable through another. In the context of feature selection, MI is used to assess the dependency between features and the target variable (15). The higher the MI score, the more informative the feature is regarding the target variable. The formula for MI is

$I (X; Y) = \sum_{y \in Y} \sum_{x \in X} p (x, y) \log_{2} \frac{p (x, y)}{p (x) p (y)}$ [1]

where $p (x, y)$ is the joint probability distribution of X and Y, and $p (x)$ and $p (y)$ are the marginal probability distributions of X and Y, respectively. In this study, MI was calculated using the mutual_info_classif function from the sklearn library. Ten features with a MI score greater than zero were selected. These selected features were consistent with those obtained by SFS and GA under the same settings.

Wrapper method

SFS

SFS is a greedy search method that selects features by sequentially adding (or removing) one feature at a time. The method evaluates the performance of a model trained on subsets of features and selects the subset that maximizes the evaluation metric (16). In this work, SFS was implemented using the mlxtend library, with a decision tree (DT) classifier as the base model. The number of selected features was set to 10, and the forward selection strategy was applied. The parameter settings for this method are presented in Table S1.

GA

GA is a search heuristic inspired by the process of natural selection. It evolves a population of candidate solutions over several generations, applying operations such as crossover and mutation. In feature selection, GA was used to explore different combinations of features, and the best subset was selected based on model performance (17). The GA was implemented using the DEAP library, and the fitness function was defined as the accuracy of a DT classifier trained on the selected subset of features. The parameter settings for this method are presented in Table S2.

Embedded method

EN

EN is a regularization technique that combines the properties of L1 and L2 regression. It encourages both sparsity and regularization in the model coefficients (18). The EN model solves the following optimization problem:

$\hat{β} = \arg \min_{β} (\frac{1}{2 n} \sum_{i = 1}^{n} [{(y_{i} - x_{i}^{T} β)}^{2}] + α (λ {‖ β ‖}_{1} + (\frac{1 - λ}{2}) {‖ β ‖}_{2}^{2}))$ [2]

where β is the vector of model coefficients, β is the estimated coefficient vector minimizing the loss function, n is the number of data samples, y_i is the true target value for the i-th sample, x_i is the feature vector for the i-th sample, α is the regularization strength parameter, λ is a regularization parameter that balances the L1 and L2 components in the regularization term, is the L1 norm, and is the L2 norm.

In this study, EN was calculated using the EN function from the sklearn library. The model was optimized using grid search to find the best hyperparameters for regularization strength and L1 to L2 penalty ratio. Features with non-zero coefficients were selected. The parameter settings for this method are presented in Table S3.

RFFI

RFFI is an ensemble learning method that operates by constructing multiple DTs during training. The importance of each feature is determined by how much it reduces impurity (e.g., Gini impurity or entropy) across all trees in the forest. Features with higher importance scores are considered more relevant (19). In this study, a RF model was trained on the RandomForestClassifier from the sklearn library with the number of trees in the forest equal to 100, and the top 10 features based on importance were selected. The parameter settings for this method are presented in Table S4.

Training models

Classification is an essential task in predictive modeling that aims to assign input data to predefined categories based on patterns observed in the training data. The prediction of follow-up periods for patients with diabetes, the classification problem in this study, is a task that requires robust ML models due to the complexity and diversity of the dataset. Given the nature of the problem, we employed six distinct supervised learning algorithms: SVM, KNN, RF, ETC, AB, and ANN. Each of these models offers unique advantages in handling different data structures and relationships. By utilizing a range of algorithms, we aimed to determine which model performed optimally in predicting follow-up categories, thus providing insight into the most effective method for this healthcare-related classification task. The hyperparameters setting of each classifier is presented in Table S5. The following sections will discuss the specifics of each model and its application to the diabetes dataset.

SVM

SVM is a widely used supervised learning algorithm known for its effectiveness in both classification and regression tasks. The key principle behind SVM is the identification of an optimal hyperplane that maximally separates data points into different classes. For multi-class classification, strategies like “one-vs-one” or “one-vs-all” are typically employed. To classify the follow-up periods of diabetes patients (20), the SVM model was configured with a radial basis function kernel to capture non-linear relationships. The regularization parameter (C) was set to 1.0 to control the trade-off between margin maximization and classification error, while the gamma parameter was set to ‘scale’ to define the influence of each data point. SVM provided a balance between accuracy and generalization, making it effective for the classification of follow-up periods.

KNN

KNN is a non-parametric, instance-based learning classification algorithm. The key idea of KNN is to classify a data point based on the majority class of its k nearest neighbors. KNN was applied to predict follow-up periods by comparing patient features with the nearest neighbors (21). The value of k was set to 5, meaning the five nearest data points were considered to determine the class. The Minkowski distance with p=2, equivalent to the Euclidean distance, was used as the distance metric. The weights parameter was set to ‘uniform’, giving equal importance to all neighbors, and the algorithm was set to ‘auto’, allowing automatic selection of the optimal nearest neighbor search method. Although KNN is computationally expensive for large datasets, it performed reasonably well when applied to this task.

RF

RF is an ensemble learning method that constructs multiple DT and aggregates their predictions to improve classification accuracy and reduce overfitting. The key concept of RF is to create each tree from a different bootstrap sample of the data, and at each node, a random subset of features is considered (22). To classify follow-up periods for diabetic patients, the RF model was configured with n_estimators =100 (100 trees), and Gini impurity (criterion = ‘gini’) was used as the splitting criterion. The max_depth parameter was set to None, allowing the trees to grow fully, and the min_samples_split was set to 2, requiring a minimum of two samples to split a node. The max_features was set to ‘sqrt’, meaning a square root subset of features was considered at each split. RF performed well, providing high accuracy and stability for the classification of follow-up periods.

ANN

An ANN is an ML model inspired by biological neural networks and consists of layers of interconnected neurons. The key principle of ANNs is their ability to model complex, non-linear relationships through multiple layers of abstraction (23). In this study, a feedforward ANN was implemented with one hidden layer containing 100 neurons [hidden_layer_sizes = (100,)]. The rectified linear unit (ReLU) activation function (activation = ‘relu’) was used to introduce non-linearity into the model, and the Adam optimizer (solver = ‘adam’) was employed for gradient-based optimization. The learning rate was initialized at 0.001 (learning_rate_init = 0.001), and the maximum number of iterations was set to 200 (max_iter = 200). The ANN performed well in capturing non-linear patterns within the dataset, though regularization techniques and early stopping were applied to mitigate overfitting, which is undesirable in a complex healthcare-related dataset like this one.

ETC

ETC is an extension of the RF model that introduces additional randomness by selecting split points randomly, rather than choosing the best split based on a criterion like Gini impurity. This added randomness helps reduce variance and improves generalization. To classify follow-up periods (24), the ETC model was configured with n_estimators =100, and Gini impurity (criterion = ‘gini’) was selected as the splitting criterion. Similar to RF, the max_depth was set to None, and min_samples_split was set to 2. The max_features parameter was also set to ‘sqrt’, allowing a random subset of features to be considered at each split. ETC performed well, with the additional randomness helping to reduce overfitting and improve generalization on unseen data.

AB

AB is an ensemble learning algorithm that combines multiple weak classifiers to form a strong classifier. The key concept of AB is to iteratively adjust the weights of misclassified samples, so subsequent classifiers focus more on these difficult cases (25). In this study, AB was applied to classify follow-up periods. The model was configured with 50 weak learners (n_estimators =50), and the learning rate was set to 1.0, controlling how much influence each weak learner had on the final model. The algorithm was set to ‘sAMME.R’, optimizing for multi-class classification. AB performed adequately, although, due to the complexity of the data, it was outperformed by the tree-based models RF and ETC. However, the ability of AB to reduce bias and variance proved useful in certain cases.

Model evaluation

The evaluation of the models was essential for assessing their effectiveness in predicting follow-up periods for diabetes patients. The performance assessments were based on the metrics of accuracy, weighted precision, weighted recall (sensitivity) and weighted F1-score. Accuracy, defined as the ratio of the number of correct predictions to the total number of predictions made, was calculated using the formula from the confusion matrix:

$A c c u r a c y = \frac{N u m b e r o f c o r r e c t p r e d i c t i o n s}{T o t a l p r e d i c t i o n s}$ [3]

Weighted precision considers the proportion of true positive predictions relative to the total positive predictions across all classes, accounting for class imbalance. It was calculated as follows:

$W e i g h t e d p r e c i s i o n = \frac{\sum_{i = 1}^{c} P r e c i s i o n_{i} \cdot S u p p o r t_{i}}{\sum_{i = 1}^{c} S u p p o r t_{i}}$ [4]

where Precision_i= Precision for class i, Support_i= number of true instances for class i and c = total number of classes.

Weighted recall (Sensitivity) evaluates the ratio of true positive predictions to the actual number of positive instances across all classes, similarly weighted by the support for each class. It was calculated as follows:

$W e i g h t e d^{} R e c a l l = \frac{\sum_{i = 1}^{c} R e c a l l_{i} \cdot S u p p o r t_{i}}{\sum_{i = 1}^{c} S u p p o r t_{i}}$ [5]

where Recall_i= recall for class i, Support_i= number of true instances for class i and c = total number of classes.

The weighted F1-score is the harmonic mean of weighted precision and weighted recall, providing a single metric that balances the two, and was given by:

$W e i g h e d F 1 - s c o r e = 2 \cdot \frac{W e i g h t e d p r e c i s i o n \cdot W e i g h t e d r e c a l l}{W e i g h t e d p r e c i s i o n + W e i g h t e d r e c a l l}$ [6]

These metrics were computed for each classifier after applying K-fold cross-validation (K=5) (26), which ensured the robustness and generalizability of the models to unseen data. The results from these evaluations were analyzed to identify the most effective model for predicting the follow-up periods.

Statistical analysis

Model performance was evaluated using accuracy, weighted precision, weighted recall, and weighted F1-score, which were derived from confusion matrices to account for both overall and class-specific performance under class imbalance. Five-fold cross-validation with an 80/20 train-test split was employed to ensure generalizability, and SMOTE was applied during training to address class imbalance. All analyses were conducted using Python 3.13 with the scikit-learn and imbalanced-learn libraries.

Results

This section provides a detailed analysis of the experimental results, focusing on the comparative performances of the ML models. The section will first present the results of the feature selection methods, highlighting the most informative features selected for prediction in each model. Following this, a comparative evaluation of different classifiers is presented, analyzing model performances across the key metrics.

Prior to presenting the model-related results, the baseline characteristics of the patient cohort stratified by follow-up periods are summarized in Table 2. The majority of patients (n=1,923) belonged to the >12-week follow-up group, with a mean age of 65.23±11.88 years. Female patients were predominant in all groups. Clinical parameters, including blood pressure, body mass index (BMI), and fasting blood sugar (FBS) levels, varied slightly among the groups, with notably higher FBS levels observed in patients with shorter follow-up intervals. These characteristics provide essential context for interpreting the subsequent predictive analyses.

Table 2

Baseline characteristics of patients with type 2 diabetes categorized by follow-up periods

Patient characteristics	Follow-up
Patient characteristics	1–4 weeks (n=37)	5–8 weeks (n=49)	9–12 weeks (n=33)	>12 weeks (n=1,923)
Age, years	68.35±10.64	68.20±10.90	69.24±10.12	65.23±11.88
Sex
Male	12 (32.43)	19 (38.78)	14 (42.42)	549 (28.55)
Female	25 (67.57)	30 (61.22)	19 (57.58)	1,374 (71.45)
BPS, mmHg	130.65±15.12	134.69±16.39	129.12±14.25	129.33±13.38
BPD, mmHg	75.43±10.65	76.08±10.37	71.88±11.70	75.72±10.16
BMI, kg/m²	24.81±4.92	27.52±4.40	25.59±3.74	25.52±4.83
Heart rate, bpm	85.43±15.45	84.61±12.55	86.18±14.25	85.85±13.54
Weight, kg	63.00±13.38	70.87±14.08	64.49±11.19	64.30±13.99
Height, cm	159.19±6.45	160.10±7.79	158.61±7.41	158.48±9.43
Waist circumference, cm	83.68±7.28	88.82±10.60	85.45±8.21	84.79±9.43
Smoking status
0	32 (86.49)	39 (79.59)	24 (72.73)	1,595 (82.94)
1	1 (2.70)	4 (8.16)	5 (15.15)	149 (7.75)
2	4 (10.81)	6 (12.24)	4 (12.12)	171 (8.89)
3	0 (0)	0 (0)	0 (0)	8 (0.42)
Alcohol status
0	33 (89.19)	42 (85.71)	31 (93.94)	1,719 (89.39)
1	1 (2.70)	5 (10.20)	0 (0)	112 (5.82)
2	3 (8.11)	2 (4.08)	2 (6.06)	85 (4.42)
3	0 (0)	(0)	0 (0)	7 (0.36)
Family history of diabetes
0	28 (75.68)	37 (75.51)	24 (72.73)	1,469 (76.39)
1	8 (21.62)	11 (22.45)	9 (27.27)	415 (21.58)
2	1 (2.70)	1 (2.04)	0 (0)	39 (2.03)
Fasting blood sugar, mg/dL	180.22±78.09	168.18±81.12	145.30±58.13	151.81±49.66
Creatinine level, mg/dL	1.79±1.01	2.03±1.51	1.59±1.22	1.03±0.79

Data are presented as mean ± standard deviation or n (%). BMI, body mass index; BPD, diastolic blood pressure; BPS, systolic blood pressure.

Feature selection results

Table 3 presents the results of five feature selection techniques, considering both primary features and their sub-classes. Each feature in the table is represented by a value of either 1 or 0, where 1 indicates that the feature is included in the feature subset, and 0 indicates its exclusion, based on the following attribute sequence. The features of age, blood pressure, pulse, height, waist, FBS, and creatinine were consistently selected, highlighting their critical importance in predicting the diabetes follow-up period. Notably, the sub-classes of categorical variables such as smoking type, drinking type, and family history exhibited diverse selection patterns. For smoking type, the ‘non-smoker’ [0] sub-class was frequently selected, while the sub-classes 1–3 (including smokers and former smokers) were less commonly chosen. Similarly, for drinking type, the ‘non-drinker’ [0] and ‘current drinker’ [1] sub-classes were often selected, suggesting that alcohol consumption may influence follow-up needs, but not necessarily in correlation with the duration of cessation. In Table 3, gray shading is used to distinguish continuous variables from categorical sub-classes, providing visual clarity between variable types. The family history of diabetes feature showed a preference across most methods for the ‘no or one parent with diabetes’ sub-classes [0 and 1], indicating that even having a single parent with a history of the disease significantly impacted follow-up considerations.

Table 3

Feature selection results across different methods for the prediction of diabetes follow-up periods

Method	Feature
	Age	Sex		bps	bpd	bmi	pulse	w	h	waist	Smoking_type				Drinking_type				fh			fbs	cr
	Age	0	1	bps	bpd	bmi	pulse	w	h	waist	0	1	2	3	0	1	2	3	0	1	2	fbs	cr
MI	1	1	1	1	1	0	1	1	1	1	1	1	1	0	1	1	1	1	1	0	0	1	1
SFS	1	1	1	1	1	1	1	1	1	1	1	0	1	1	0	1	1	0	1	1	0	1	1
GA	0	1	1	0	1	0	0	0	1	1	1	1	0	1	1	1	0	0	1	1	0	1	1
EN	1	1	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
RFFI	1	1	1	1	1	1	1	1	1	1	1	0	0	0	1	0	0	0	1	1	0	1	1
Total	4	5	5	4	5	3	4	3	5	5	5	3	3	3	4	4	3	2	5	4	1	5	5

The definition of features is provided in Table 1. EN, Elastic Net; GA, genetic algorithm; MI, Mutual Information; RFFI, random forest feature importance; SFS, sequential feature selection.

Evaluation of classification metrics

Figure 4 presents the classification accuracy of the six ML models in combination with the five different feature selection methods. The graph illustrates that ETC achieved the highest overall accuracy of 91.97% with EN. RF followed closely, with a maximum accuracy of 89.81% using EN. Among the other classification models, KNN showed the lowest performance, peaking at 72.09% with EN, suggesting that distance-based models may struggle with the dataset. For SVM, the best accuracy was achieved with the MI and SFS selection methods, both yielding 81.29%, while AB produced significantly worse performances, achieving a maximum of 71.40% with MI. Notably, the GA selection method extracted the worst performances from all models, with the lowest performance in AB (62.04%). These results indicated that GA may not be suitable for this particular classification task. Overall, in terms of accuracy, ETC combined with EN was the most effective classification model-feature selection method pairing, and demonstrated robustness on this dataset.

Figure 4 Comparison of classification accuracy across different models and feature selection techniques. AB, Adaptive Boosting; ANN, Artificial Neural Network; EN, Elastic Net; ETC, Extra Trees Classifier; GA, Genetic Algorithm; KNN, K-Nearest Neighbors; MI, Mutual Information; RF, Random Forest; RFFI, Random Forest Feature Importance; SFS, Sequential Feature Selection; SVM, Support Vector Machine.

When the feature selection methods were evaluated across all models, EN stood out as the best-performing method overall. It consistently yielded high accuracy across most models, achieving the highest accuracy with ETC (91.97%), RF (89.81%) and ANN (88.44%), and also significantly improved the performance of weaker models like KNN (72.09%) and AB (69.15%). This consistency suggests that EN effectively balanced the selection of features, optimizing model performances across various algorithms. In contrast, GA underperformed in most cases, particularly with AB, where its accuracy was only 62.04%, indicating that GA may not be reliable on this dataset. The ability of EN to consistently enhance model accuracy highlights its robustness and suitability as the best feature selection method for this classification task.

The graph in Figure 5 illustrates the weighted precision across classification models. Each model performance was evaluated using different feature selection methods. AB with GA returned the highest precision of 91.78%, indicating that this combination provided the most accurate predictions among all the configurations. KNN with GA also performed well, achieving a weighted precision of 91.27%. Conversely, ETC with SFS produced the lowest precision, at 89.91%.

Figure 5 Comparison of classification weighted precision across different models and feature selection techniques. AB, Adaptive Boosting; ANN, Artificial Neural Network; EN, Elastic Net; ETC, Extra Trees Classifier; GA, Genetic Algorithm; KNN, K-Nearest Neighbors; MI, Mutual Information; RF, Random Forest; RFFI, Random Forest Feature Importance; SFS, Sequential Feature Selection; SVM, Support Vector Machine.

When considering the best feature selection method in terms of weighted precision, GA stood out as the top performer with SVM (91.25%), KNN (91.27%), AB (91.78%), RF (89.78%) and ANN (89.97%), while SFS performed best with ETC (89.91%). GA was the most reliable feature selection method overall, consistently yielding the best results when evaluated using weighted precision. GA returned the highest weighted precision with five out of six classification models, achieving excellent performances with AB (91.78%) and KNN (91.27%). These results demonstrated that the GA selection method could optimize feature selection to improve model precision across these algorithms.

Figure 6 shows the weighted recall (sensitivity) scores for all pairings. The ETC model with EN returned the highest recall at 91.97%, indicating the most balanced recall performance, reducing the chances of false negatives (FNs). On the other hand, AB with GA returned the lowest recall score at 62.04%, showing its limitations in capturing all relevant positive instances.

Figure 6 Comparison of classification weighted recall across different models and feature selection techniques. AB, Adaptive Boosting; ANN, Artificial Neural Network; EN, Elastic Net; ETC, Extra Trees Classifier; GA, Genetic Algorithm; KNN, K-Nearest Neighbors; MI, Mutual Information; RF, Random Forest; RFFI, Random Forest Feature Importance; SFS, Sequential Feature Selection; SVM, Support Vector Machine.

When analyzing the best feature selection method for each model in terms of weighted recall, EN performed best for KNN (72.09%), ETC (91.97%), RF (89.81%), and ANN (88.44%). MI and SFS were the best selection methods for SVM, but the results were not much different from EN, which returned a recall score of 81.29% with SVM. In contrast, AB worked best with MI, achieving a recall of 71.40%. These results indicated that EN consistently excelled in recall across multiple models.

Figure 7 illustrates the weighted F1-scores of the six ML models across the five different feature selection methods. ETC delivered the highest F1-score of 90.69% with MI, EN and SFS, matching its best performances in the accuracy and weighted recall evaluations. RF showed good performances with the RFFI feature selection method, achieving a maximum F1-score of 89.45% and ANN with EN scored 89.15%. KNN, as it did in the accuracy evaluation, performed poorly, returning a best F1-score of 79.99% with the EN method. AB returned relatively low F1-scores across all feature selection methods, as it did in the accuracy evaluation, returning a best score of 79.66% with MI.

Figure 7 Comparison of classification weighted F1-scores across different classification models and feature selection technique. AB, Adaptive Boosting; ANN, Artificial Neural Network; EN, Elastic Net; ETC, Extra Trees Classifier; GA, Genetic Algorithm; KNN, K-Nearest Neighbors; MI, Mutual Information; RF, Random Forest; RFFI, Random Forest Feature Importance; SFS, Sequential Feature Selection; SVM, Support Vector Machine.

Upon further examination, EN emerged as the most consistent feature selection method when assessing the F1-score across all models. It not only boosted ETC to its highest F1-score (90.69%) but also improved RF (89.71%) and ANN (89.15%) to competitive levels. Although SFS performed well with the ETC and ANN classification algorithms, EN produced more improvements, making it the best overall feature selection method based on weighted F1-scores. This consistency suggested that EN outperformed the other feature selection methods, optimizing feature subsets more effectively across diverse model architectures.

Comprehensive model performance comparison

Table 4 shows the performances of all the pairings of classification models and feature selection methods in terms of accuracy and weighted F1-score. The performances were scored from 1 to 6, where 1 represents the lowest performance and 6 denotes the highest performance for each metric. The ETC classification model outperformed all other models in both metrics, achieving the highest overall accuracy score of 6 (average) and the highest F1-score of 6 (average), giving a notable total score of 30 in both categories. RF also performed well, with an average accuracy of 5 and F1-score of 5. AB consistently produced the lowest performances in terms of accuracy, with an average score of 1, while KNN similarly exhibited weaker performances with an accuracy average of 2. The results demonstrated that ETC and RF were the most reliable models, performing well with all the feature selection methods.

Table 4

Classification model performance metrics: accuracy and weighted F1-score (scores range, 1–6)

Method	Accuracy						Weighted F1-score
Method	SVM	KNN	ETC	AB	RF	ANN	SVM	KNN	ETC	AB	RF	ANN
MI	3	2	6	1	5	4	3	1	6	2	5	4
SFS	3	2	6	1	5	4	3	2	6	1	5	4
GA	3	2	6	1	5	4	3	2	6	1	5	4
EN	3	2	6	1	5	4	3	2	6	1	5	4
RFFI	3	2	6	1	5	4	3	2	6	1	5	4
Sum	15	10	30	5	25	20	15	9	30	6	25	20
Average	3	2	6	1	5	4	3	1.8	6	1.2	5	4

AB, Adaptive Boosting; ANN, Artificial Neural Network; EN, Elastic Net; ETC, Extra Trees Classifier; GA, Genetic Algorithm; KNN, K-Nearest Neighbors; MI, Mutual Information; RF, Random Forest; RFFI, Random Forest Feature Importance; SFS, Sequential Feature Selection; SVM, Support Vector Machine.

Table 5 shows the performances of all the pairings of classification models and feature selection methods in terms of weighted precision and weighted recall. AB performed the best in terms of weighted precision, achieving the highest total score of 30 for an average of 6. This result indicated that AB predicted true positives (TPs) more accurately across the feature selection methods, making it the most precise model in this evaluation. ETC returned a low total precision score of 9, while RF scored the lowest in this category with a total precision score of 7 for an average of 1.4, indicating less consistency in accurate predictions across the different feature selection methods. In terms of weighted recall, ETC emerged as the top performer with a total score of 30 for an average of 6, showing its strong ability to correctly identify positive cases across all feature selection methods. AB, despite excelling in precision, scored low in recall, with a total of 5 and an average of 1. RF, however, achieved a total recall score of 25 with an average of 5, indicating a strong ability to minimize FNs and capture TPs effectively.

Table 5

Classification model performance metrics: weighted precision and weighted recall (scores range, 1–6)

Method	Weighted precision						Weighted recall
Method	SVM	KNN	ETC	AB	RF	ANN	SVM	KNN	ETC	AB	RF	ANN
MI	4	5	2	6	1	3	3	2	6	1	5	4
SFS	4	5	3	6	1	2	3	2	6	1	5	4
GA	4	5	1	6	2	3	3	2	6	1	5	4
EN	4	5	1	6	2	3	3	2	6	1	5	4
RFFI	4	5	2	6	1	3	3	2	6	1	5	4
Sum	20	25	9	30	7	14	15	10	30	5	25	20
Average	4	5	1.8	6	1.4	2.8	3	2	6	1	5	4

AB, Adaptive Boosting; ANN, Artificial Neural Network; EN, Elastic Net; ETC, Extra Trees Classifier; GA, Genetic Algorithm; KNN, K-Nearest Neighbors; MI, Mutual Information; RF, Random Forest; RFFI, Random Forest Feature Importance; SFS, Sequential Feature Selection; SVM, Support Vector Machine.

Considering both Table 4 and Table 5, it is evident that no single classification model performed best in every performance metric. When focusing on accuracy and weighted F1-score, ETC was the strongest classification model, making it an ideal choice for tasks that require high precision in class predictions. ETC also outperformed other classification models in terms of recall, indicating its strength in identifying positive cases reliably across different classes.

Discussion

Key findings

The study demonstrated a significant enhancement in predictive accuracy for diabetes follow-up period classification through the integration of feature selection techniques. The ETC, combined with EN, emerged as the best-performing model-feature selection pairing, achieving a weighted F1-score of 90.69%. This result underscored the robustness of tree-based models and the effectiveness of EN in feature selection.

Key features identified as critical for prediction included age, blood pressure, pulse, height, waist, FBS, and creatinine. These features highlighted the importance of demographic and clinical variables in predicting follow-up periods. EN consistently improved the performance of various models across multiple metrics, reinforcing its utility as a versatile feature selection method.

The study’s comparative analysis across six classifiers revealed that ETC and RF performed robustly across accuracy, precision, recall, and F1-score metrics. In contrast, models such as KNN and AB showed lower effectiveness, particularly in managing imbalanced datasets. These findings emphasized the importance of selecting appropriate classifiers for clinical prediction tasks.

Extended clinical evaluation

Based on the experimental findings, the two best-performing models in terms of weighted F1-score were ETC with EN and ETC with MI, both achieving the highest F1-score of 90.69%. To further analyze their diagnostic performance, specificity, sensitivity, and area under the curve (AUC) were examined. ETC with MI demonstrated a slightly higher specificity at 76.96% compared to 76.61% for ETC with EN, indicating a marginally better performance in correctly identifying negative cases. In contrast, ETC with EN yielded a higher sensitivity of 91.97%, outperforming the 91.77% sensitivity observed in ETC with MI, suggesting a stronger ability to detect positive cases. Moreover, ETC with EN achieved a higher AUC of 65.76%, compared to 63.09% for ETC with MI, reflecting a more favorable trade-off between sensitivity and specificity across varying classification thresholds. These comparative results suggest that although ETC with MI slightly favors specificity, ETC with EN provides a more balanced and effective classification performance overall, making it the preferable configuration in scenarios where both precision and recall are critical.

The calibration analysis further supports the comparative performance of ETC with EN and ETC with MI observed in previous evaluations. As shown in Figure 8, both models exhibit under-confidence in Class 3, with calibration curves lying consistently above the diagonal reference line. This pattern aligns with their high sensitivity values: 91.97% for EN and 91.77% for MI, indicating strong ability to detect true positives. However, ETC with EN demonstrates a smoother and more stable calibration curve for Class 3, suggesting better reliability in probability estimates. In contrast, ETC with MI displays greater fluctuation in predicted probabilities for this class, reflecting less consistent calibration. Additionally, both models show over-confidence in Classes 0 to 2, as indicated by the calibration curves falling below the diagonal, which corresponds to their moderate specificity values of 76.61% for EN and 76.96% for MI. Although ETC with MI yields slightly higher specificity, ETC with EN provides a more balanced and stable performance overall.

Figure 8 Calibration curves of the ETC model using two different feature selection methods: EN (A) and MI (B). EN, Elastic Net; ETC, Extra Trees Classifier; MI, Mutual Information.

The decision curve analyses of both ETC models, one with EN and the other with MI feature selection, offer consistent yet nuanced insights into their clinical utility across outcome classes. Figure 9 presents the net benefit curves for each class under both configurations. For Class 3, both models demonstrated the highest net benefit within the threshold range of 0.05 to 0.25, clearly outperforming the “Treat All” and “Treat None” strategies. This indicates strong decision-making value in identifying patients with durations longer than 12 weeks, where accurate positive identification minimizes unnecessary interventions. For Class 1, both configurations showed moderate net benefit between 0.2 and 0.6, suggesting potential utility in managing patients within the 2 to 5 weeks group. Differences emerged more clearly in Class 0 and Class 2. In the EN-based model, both classes showed net benefit between the “Treat All” and “Treat None” strategies in the 0 to 0.2 range, while in the MI-based model, Class 0 extended slightly further from 0 to 0.25 and Class 2 was limited to a narrower range from 0 to 0.15. In both cases, the models offered only marginal improvement over the “Treat None” approach and failed to outperform the “Treat All” strategy, indicating limited decision-support value. Overall, both ETC with EN and ETC with MI configurations are most effective in supporting clinical decisions for Class 3, moderately useful for Class 1, and of limited benefit for Classes 0 and 2.

Figure 9 Decision curve analysis of the ETC model using two different feature selection methods: EN (A) and MI (B). EN, Elastic Net; ETC, Extra Trees Classifier; MI, Mutual Information.

Strengths and limitations

This study effectively demonstrated the value of integrating feature selection techniques with ML algorithms to enhance the accuracy of predicting diabetes follow-up periods. The use of EN as a feature selection method proved robust across multiple metrics and models, particularly when paired with ETC. The identification of key demographic and clinical features, such as FBS and creatinine levels, provided actionable insights for improving clinical decision-making. Additionally, the methodology included comprehensive preprocessing steps, such as handling missing data and addressing class imbalance, ensuring the reliability of the results. The study also provided a comparative analysis of multiple ML classifiers, offering a clear evaluation of their performance in a healthcare context.

Future research could expand the scope of this work by incorporating a broader dataset that includes a more diverse range of demographic and clinical variables from a variety of healthcare settings. The limitation of the current study lies in its reliance on data from a single hospital, which may reduce the generalizability of the model to other populations. A more comprehensive dataset would address this limitation, improving the ability of models to make accurate predictions across different patient groups. Additionally, exploring more advanced ML models, such as deep learning and hybrid approaches, could capture complex, non-linear relationships that the current models may overlook. Another limitation is the use of standard features, which may not fully capture patient behaviors. Feature engineering would be crucial to generate new variables, such as treatment adherence and lifestyle habits, which could refine the accuracy of follow-up period predictions. These advancements would significantly improve the application of ML models in personalized diabetes care, aligning with our aim of supporting clinical decision-making through data-driven insights.

Another limitation is the lack of intervention-related variables during the follow-up period. Information such as medication changes, insulin initiation, or referrals to allied health services was not available in the dataset. These clinical actions may influence follow-up scheduling and represent potential confounding factors. Future studies should aim to incorporate such variables to improve interpretability and reduce bias in model predictions.

The model can be further developed into a user-friendly screening tool in the form of a web application, mobile app, or desktop program. It is designed for ease of use and does not require ML expertise, making it adaptable for a broad range of users, including endocrinologists, general practitioners, nurse practitioners, and potentially even patients or caregivers in community-based settings.

Comparison with similar research

The study built on and extended prior research by demonstrating the significant role of feature selection techniques and ML algorithms in predicting follow-up periods for diabetes patients. While previous studies in the literature primarily focused on diabetes diagnosis, they did not address the specific challenge of follow-up period prediction, which is critical for long-term diabetes management.

A related study by Chapakiya et al. (6) employed SVM with SMOTE and RF feature selection, achieving a weighted precision of 92.96%. In contrast, our study applied a broader comparison of classifiers and feature selection techniques, with the ETC and EN yielding the highest weighted F1-score of 90.69%. Furthermore, while their approach used min-max scaling, we applied robust scaler to better manage outliers. Although robust scaling did not directly enhance model performance, it provided greater stability in the presence of extreme values. This suggests that while min-max scaling may optimize certain metrics, robust scaling enhances generalizability, especially in clinical datasets with inherent variability.

Beyond follow-up classification, other ML studies have addressed related challenges in diabetes care by focusing on the prediction of long-term complications. For example, Ravaut et al. (27) applied gradient boosting techniques to predict multiple adverse outcomes such as cardiovascular events, infections, and amputations over a three-year horizon using administrative health data. Similarly, Zhao et al. (28) developed an XGBoost model to forecast the onset of diabetic retinopathy using electronic health records. Although these studies targeted downstream complications, our model addresses an upstream need by predicting follow-up intervals based on routinely collected clinical indicators. Accurate and timely follow-up scheduling may facilitate earlier detection and intervention, thereby helping prevent the complications predicted by those models. Despite differing prediction targets, all three studies share a common goal of enhancing individualized care and optimizing resource allocation in diabetes management.

Conclusions

This study developed and validated a ML-based model to predict follow-up periods for patients with type 2 diabetes, incorporating feature selection techniques to enhance performance. Using routine clinical data from outpatient visits, the model effectively classified patients into follow-up groups, with the ETC and EN feature selection yielding the most robust results. The study highlights the value of applying structured feature selection and comprehensive preprocessing to improve predictive accuracy and clinical interpretability. The findings suggest that routine demographic and clinical variables can inform timely and personalized follow-up planning, potentially supporting early intervention and reducing complications. Although the model was trained on data from a single hospital, its framework can be generalized through further validation with diverse datasets. Future efforts could focus on incorporating behavioral and intervention-related variables and developing the model into a practical application for clinicians or community-based use.

Acknowledgments

This research was supported by the Faculty of Science Research Fund, Faculty of Science, Prince of Songkla University (2024) under Contract No. SCIGEN67004. We also gratefully acknowledge the Institute of Field Robotics, King Mongkut’s University of Technology Thonburi, for providing financial support toward the article processing charge of this publication. We extend our sincere thanks to the Medical Records Department at Pak Phanang Hospital for providing the diabetes dataset. We also wish to acknowledge Thomas Duncan Coyne for his invaluable assistance in refining the English language of this manuscript. Any remaining errors are solely the responsibility of the authors. With heartfelt gratitude, we thank Oab and Chul Chumpong for their unwavering support, which has brought joy and inspiration to the third and fourth authors, enriching both this research and our family life.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://jphe.amegroups.com/article/view/10.21037/jphe-24-119/rc

Data Sharing Statement: Available at https://jphe.amegroups.com/article/view/10.21037/jphe-24-119/dss

Peer Review File: Available at https://jphe.amegroups.com/article/view/10.21037/jphe-24-119/prf

Funding: This research was supported by the Faculty of Science Research Fund, Faculty of Science, Prince of Songkla University (2024) under Contract No. SCIGEN67004 to K.C.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://jphe.amegroups.com/article/view/10.21037/jphe-24-119/coif). P.K. reports that article processing charges were supported by King Mongkut’s University of Technology Thonburi. K.C. reports that this research was supported by the Faculty of Science Research Fund, Faculty of Science, Prince of Songkla University (2024) under Contract No. SCIGEN67004. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Ethical approval for the study was obtained from the PSU Human Research Ethics Committee, Prince of Songkla University (PSU-HREC-2024-068-1-1), which deemed the study to meet the criteria for Exempt Research Determination. Individual consent for this retrospective analysis was waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Alhussan AA, Abdelhamid AA, Towfek SK, et al. Classification of Diabetes Using Feature Selection and Hybrid Al-Biruni Earth Radius and Dipper Throated Optimization. Diagnostics (Basel) 2023;13:2038. [Crossref] [PubMed]
Idris NF, Ismail MA, Jaya MIM, et al. Stacking with Recursive Feature Elimination-Isolation Forest for classification of diabetes mellitus. PLoS One 2024;19:e0302595. [Crossref] [PubMed]
Sivaranjani S, Ananya S, Aravinth J, et al. Diabetes prediction using machine learning algorithms with feature selection and dimensionality reduction. Proceedings of 2021 7th international conference on advanced computing and communication systems (ICACCS); 2021 Mar 19-20; Coimbatore, India; 2021:141-6.
Saxena R, Sharma SK, Gupta M, et al. A Novel Approach for Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods. Comput Intell Neurosci 2022;2022:3820360. [Crossref] [PubMed]
International Diabetes Federation. IDF Diabetes Atlas 2021 [Internet]. Brussels, Belgium: International Diabetes Federation; 2021 [cited 2024 Dec 5]. Available online: https://diabetesatlas.org/atlas/tenth-edition/
Chapakiya I, Traisuwan A, Chumpong S, et al. Follow-up period classification of type 2 diabetes patients using data mining techniques. J Health Sci Med Res 2024; [Crossref]
Maniruzzaman M, Rahman MJ, Al-MehediHasan M, et al. Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers. J Med Syst 2018;42:92. [Crossref] [PubMed]
Hasan MK, Alam MA, Das D, et al. Diabetes prediction using ensembling of different machine learning classifiers. IEEE Access 2020;8:76516-31.
Maniruzzaman M, Rahman MJ, Ahammed B, et al. Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst 2020;8:7. [Crossref] [PubMed]
Kakoly IJ, Hoque MR, Hasan N. Data-driven diabetes risk factor prediction using machine learning algorithms with feature selection technique. Sustainability 2023;15:4930.
Li JP, Haq AU, Din SU, et al. Heart disease identification method using machine learning classification in e-healthcare. IEEE Access 2020;8:107562-82.
Tsanas A, Little MA, McSharry PE, et al. Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity. J R Soc Interface 2011;8:842-55. [Crossref] [PubMed]
Saadine JB, Fong DS, Yao J. Factors associated with follow-up eye examinations among persons with diabetes. Retina 2008;28:195-200. [Crossref] [PubMed]
Lugner M, Rawshani A, Helleryd E, et al. Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data. Sci Rep 2024;14:2102. [Crossref] [PubMed]
Cai R, Hao Z, Yang X, et al. An efficient gene selection algorithm based on mutual information. Neurocomputing 2009;72:991-9.
Marcano-Cedeño A, Quintanilla-Domínguez J, Cortina-Januchs MG, et al. Feature selection using sequential forward selection and classification applying artificial metaplasticity neural network. Proceedings of 36th annual conference on IEEE industrial electronics society (IECON); 2010 Nov 7-10; Arizona, United State; 2010:2845-50.
Babatunde O, Armstrong L, Leng J, et al. A genetic algorithm-based feature selection. International Journal of Electronics Communication and Computer Engineering 2014;5:899-905.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 2005;67:301-20.
Menze BH, Kelm BM, Masuch R, et al. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 2009;10:213. [Crossref] [PubMed]
Widodo A, Yang BS. Machine health prognostics using survival probability and support vector machine. Expert Syst Appl 2011;38:8430-7.
Gou J, Ma H, Ou W, et al. A generalized mean distance-based k-nearest neighbor classifier. Expert Syst Appl 2019;115:356-72.
Khalilia M, Chakraborty S, Popescu M. Predicting disease risks from highly imbalanced data using random forest. BMC Med Inform Decis Mak 2011;11:51. [Crossref] [PubMed]
Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal 2000;22:717-27. [Crossref] [PubMed]
Sharaff A, Gupta H. Extra-tree classifier with metaheuristics approach for email classification. Proceedings of advances in computer communication and computational sciences (IC4S); 2018 Oct 20-21; Bangkok, Thailand; 2018:189-97.
An TK, Kim MH. A new diverse AdaBoost classifier. Proceedings of 2010 International conference on artificial intelligence and computational intelligence; 2010 Oct 23-24; Sanya, China; 2010:359-63.
Jung Y. Multiple predicting K-fold cross-validation for model selection. J Nonparametr Stat 2018;30:197-215.
Ravaut M, Sadeghi H, Leung KK, et al. Predicting adverse outcomes due to diabetes complications with machine learning using administrative health data. NPJ Digit Med 2021;4:24. [Crossref] [PubMed]
Zhao Y, Li X, Li S, et al. Using Machine Learning Techniques to Develop Risk Prediction Models for the Risk of Incident Diabetic Retinopathy Among Patients With Type 2 Diabetes Mellitus: A Cohort Study. Front Endocrinol (Lausanne) 2022;13:876559. [Crossref] [PubMed]

doi: 10.21037/jphe-24-119
Cite this article as: Khanarsa P, Suwanmanee S, Chumpong S, Chumpong K. Enhancing diabetes follow-up period prediction through classification algorithms with feature selection techniques. J Public Health Emerg 2025;9:25.