Improved performance of machine learning models in predicting length of stay, discharge disposition, and inpatient mortality after total knee arthroplasty using patient-specific variables
Arthroplasty volume 5, Article number: 31 (2023)
This study aimed to compare the performance of ten predictive models using different machine learning (ML) algorithms and compare the performance of models developed using patient-specific vs. situational variables in predicting select outcomes after primary TKA.
Data from 2016 to 2017 from the National Inpatient Sample were used to identify 305,577 discharges undergoing primary TKA, which were included in the training, testing, and validation of 10 ML models. 15 predictive variables consisting of 8 patient-specific and 7 situational variables were utilized to predict length of stay (LOS), discharge disposition, and mortality. Using the best performing algorithms, models trained using either 8 patient-specific and 7 situational variables were then developed and compared.
For models developed using all 15 variables, Linear Support Vector Machine (LSVM) was the most responsive model for predicting LOS. LSVM and XGT Boost Tree were equivalently most responsive for predicting discharge disposition. LSVM and XGT Boost Linear were equivalently most responsive for predicting mortality. Decision List, CHAID, and LSVM were the most reliable models for predicting LOS and discharge disposition, while XGT Boost Tree, Decision List, LSVM, and CHAID were most reliable for mortality. Models developed using the 8 patient-specific variables outperformed those developed using the 7 situational variables, with few exceptions.
This study revealed that performance of different models varied, ranging from poor to excellent, and demonstrated that models developed using patient-specific variables were typically better predictive of quality metrics after TKA than those developed employing situational variables.
Level of Evidence
Total knee arthroplasty (TKA) is a safe and effective treatment for end-stage osteoarthritis and is among the most common surgical procedures performed in the USA. National projections anticipate a substantial increase in TKA utilization and its associated economic burden well into the foreseeable future . As healthcare systems shift toward an increasing focus on value and patient satisfaction, there has been increased emphasis placed on risk stratification, perioperative optimization, and improving value in TKA care delivery [2,3,4]. As such, considerable effort has been put forth to develop models to predict clinical outcomes after TKA . More recently, artificial intelligence and machine learning have been heavily explored as potential tools to improve the predictive capacity of these models [6,7,8].
Machine learning (ML) is a subset of artificial intelligence (AI) that automates analytical model building by employing algorithms to progressively “learn” and improve from data [9, 10]. As technological shifts in the healthcare system have allowed for the accumulation and organization of large amounts of data, ML has shown immense promise for numerous applications within healthcare system. Recent studies within the orthopedic literature have applied ML to develop models to predict mortality, readmission rates, complication rates, length of stay, and patient-reported postoperative outcomes [8, 11,12,13,14,15]. Such predictive models have numerous potential benefits, including identifying patients at risk for worse outcomes, which allows for improved patient selection, targeted perioperative optimization, and stratification for risk-based alternative payment models. While these promising studies demonstrate the potential of ML to predict outcomes and improve value within orthopedics, they typically are limited in their choice of training variables and often employ a single elementary algorithm, without justification for the selection of either algorithm or variables. As a whole, there remains a critical need to develop and comparatively analyze the predictive capacity of various ML algorithms and to identify and select the relevant input variables used to train these models.
In that context, the purpose of this study was to comparatively evaluate the performance of ten different machine learning models in predicting LOS, mortality, and discharge disposition following TKA and to compare the performance of the best performing models developed with patient-specific vs. situational variables.
Data source and study sample
The National Inpatient Sample, a public and expansive database containing data of more than 7 million hospital stays in the US for the years 2016 and 2017, was utilized for this retrospective analysis and ML model development. Given the use of the International Classification of Disease, Tenth Revision (ICD-10) coding system in the database during the study period, the ICD-10-Procedure Coding System (ICD-10-PCS) for TKA was utilized to identify the study population (Additional file 1). Patients undergoing a conversion or revision TKA, younger than 18 years of age, or missing age information were excluded from the study population. This strategy resulted in a total of 305,577 discharges that were included in the current study.
Predictive and outcome variables selection
All available variables in the NIS database were considered and assessed for inclusion in this study. For the initial step of the study, 15 predictive variables were included in building and assessing ten different ML models, and subsequently divided, in the second step of the study, into 8 patient-specific (including Age, Sex, Race, Total number of diagnoses, All Patient Refined Diagnosis Related Groups (APRDRG) Severity of illness, APRDRG Mortality risk, Income zip quartile, Primary payer) and 7 situational variables (including Patient Location, Month of the procedure, Hospital Division, Hospital Region, Hospital Teaching status, Hospital Bed size, and Hospital Control). These features were manually selected by the authors by screening from all available variables in the NIS database. The analysis outcome variables were in-hospital mortality (binary yes/no outcome), discharge disposition (home vs. facility), and length of stay (≤ 2 vs. > 2) among primary TKA recipients. The determination of the LOS cutoff level was guided by analysis of the average LOS for the entire cohort, and subsequently utilizing the closest lower integral number to create the binary outcomes. Patient discharge destination was coded as either home (discharge to home or home health care) or facility (all other dispositions to a facility, such as skilled nursing facilities or inpatient rehabilitation centers). Patient datasets missing information on these variables were removed from the study sample.
Data handling and machine learning models development
SPSS Modeler (IBM, Armonk, NY, USA), a data mining and predictive analytics software, was utilized to develop the models based on commonly used ML techniques. The algorithmic methods implemented included Random Forest (RF), Neural Network (NN), Extreme Gradient Boost Tree (XGBoost Tree), Extreme Gradient Boost Linear (XGBoost Linear), Linear Support Vector Machine (LSVM), Chi square Automatic Interaction Detector (CHAID), Decision lists, Linear Discriminant Analysis (Discriminant), Logistic Regression, and Bayesian Networks. These methods were selected as they are well-studied, commonly used ML methods in medical literature and are distinct in their pattern recognition methods (Table 1) [8,9,10, 12].
For each technique and for each outcome-variable, a new algorithm was developed. The overall data set was split using random sampling into three separate groups: a training, testing, and validation cohort. A total of 80% of the data were used to train-test the models, while the remaining 20% was employed to validate the model parameters. The training–testing subset was subsequently divided into 80% training and 20% testing, yielding a final distribution of 64% for training, 16% for testing, and 20% for model validation. In-between those phases, there were no leaks between the data sets, as mutually exclusive sets were used to train, test, and then validate each predictive algorithm.
When predicting outcomes with a low incidence rate, there exists a bias within the model, leading to an inaccurate imbalance in predictive capacity biased against the minority outcome . As such, and to avoid such implications, when imbalanced outcome frequencies were encountered, the Synthetic Minority Oversampling Technique (SMOTE) was deployed to resample the training set to avoid any implications on the training of the ML classification [17, 18]. Despite the validation of SMOTE, as a measure to successfully minimize the impact of the bias, the classifier’s predictive ability in minority outcomes is improved, however, it remains imperfect.
The comparative analysis of the different ML models consisted of assessment of responsiveness and reliability of the predictions for all models. Responsiveness is a measure of successful prediction of variable outcomes and was quantified with area under the curve (AUC) for the receiver operating characteristic (ROC) curve. AUCROC measurements were generated by assessing true positive rates vs. false positive rates under the training, testing, and validation phases of each model. For this study, responsiveness was considered as excellent for AUCROC was 0.90–1.00, good for 0.80–0.90, fair for 0.70–0.80, poor for 0.60–0.70, and fail for 0.50–0.60. Reliability of the ML models was measured by the overall performance accuracy quantified by the percentage of correct predictions achieved by the model.
All ten ML models were trained, tested, and validated to assess responsiveness and reliability. The first step of the study aimed at analyzing and comparing the predictive performance of these ML models in identifying the outcome variables after primary TKA: in-hospital mortality, discharge disposition, and LOS. The validation phase utilizing 20% of the sample was considered as the main assessment metric and quantified with responsiveness and reliability. Once the development and comparative assessment of the different ML models were completed, the three algorithmic methodologies with the highest accuracy for each outcome variable were identified. The second step of the study consisted of developing and comparing the predictive performance of the top three ML methodologies for the same set of outcome measures while using patient-specific and situational predictive variables. All statistical analyses were performed with SPSS Modeler version 18.2.2 (IBM, Armonk, NY, USA).
This study included a total of 305,577 discharges that underwent primary TKA with an average age of 66.51 years. Descriptive statistics for the distributions of the aforementioned predictive variables are included in Table 2. The study population had an average of 0.1% mortality during hospitalization, a home discharge rate of 79.6%, and an LOS of 2.41 days.
For models developed using all 15 variables, the three most responsive models for LOS were LSVM, Neural Network, and Bayesian Network, with poor results measuring 0.684, 0.668 and 0.664, respectively (Table 3). The three most reliable models for LOS were Decision List, LSVM, and CHAID. Decision List had a good reliability of 85.44%, while LSVM and CHAID had a poor reliability of 66.55% and 65.63%, respectively. Figure 1 provides the ROC curves for the training, testing, and validation phases for the LSVM model predicting LOS. The three most responsive models for discharge disposition were LSVM, XGT Boost Tree, and XGT Boost Linear had fair performance with respective values of 0.747, 0.747, and 0.722 (Table 4). The two most reliable models for discharge yielding good reliability were Decision List and LSVM measuring 89.81% and 80.26% respectively, and the third most reliable one for discharge with fair results was CHAID at 79.80%. Figure 2 provides the ROC curves for the training, testing, and validation phases for the LSVM model predicting discharge disposition. The top 4 models that yielded excellent responsiveness for in-hospital mortality were LSVM, XGT Boost Linear, Neural Network, and Logistic Regression. with their values being 0.997, 0.997, and 0.996, respectively (Table 5). The most reliable models, all with excellent reliability, for in-hospital mortality were XGT Boost Tree, Decision List, LSVM, and CHAID, with values of 99.98%, 99.91%, 99.89%, and 99.89%, respectively. Figure 3 provides the ROC curves for the training, testing, and validation phases for the LSVM model predicting in-hospital mortality.
Separate models were then developed using the three most reliable algorithms for each outcome and their predictive performance was compared using either patient-specific or situational variables. Tables 6 and 7 describe the performance of models developed with patient-specific variables and situational variables, respectively. For nearly all outcomes, responsiveness was higher for each algorithm when trained with patient-specific variables vs. situational variables, the only exception being CHAID having marginally better performance for predicting LOS when developed with situational variables. Similarly, reliability was higher for most algorithms when models were developed using patient-specific as opposed to situational variables, with the exception of higher reliability for CHAID for predicting LOS and Decision List for predicting discharge disposition when developed using situational variables, and equivalent reliability of XGT Boost Tree and LSVM for predicting mortality when developed using either patient-specific or situational variables.
TKA is one of the most common procedures performed in the United States, with a considerable associated economic burden. As healthcare systems continue to aim to optimize value of care delivery, there has been a growing focus on standardizing outcomes and establishing accurate risk assessment prior to TKA [5, 19]. More recently, ML has been applied to develop models to predict outcomes after TKA [8, 13, 14, 20]. As such, the aim of this study was to develop and compare the performance of multiple ML models to predict in-hospital mortality, LOS, and discharge disposition after TKA and to compare the performance of models trained using patient-specific and situational variables.
Selecting an appropriate algorithm for training is critical in developing a predictive ML model. As the number of ML algorithms abounds, there has been a concerted effort within the medical literature to compare ML algorithms to identify which are optimal for a given set of data and diseases [21, 22]. However, within the nascent orthopedic ML literature, different ML algorithms have been seldom compared when developing predictive models. Therefore, this study aimed to assess the performance of ten different ML models for prediction of LOS, mortality, and discharge disposition after TKA. When comparing the different ML models using fifteen independent variables available in the NIS database, the LSVM methodology was consistently the most responsive and reliable one, being within the top three best-performing ML models in predicting all tested outcomes. This result is not surprising, as support vector machine algorithms have consistently been one of the most widely used ML predictive algorithms . Still, other studies in the general medical literature have shown superior performance of other algorithms for the prediction of other outcomes [21,22,23]. As such, it should be noted that if different variables or outcomes are to be tested in a different study, it is possible that a different ML algorithm would be more effective and accurate within its predictive capacity. As clinical application of ML continues to evolve, it should be stressed that various ML methodologies should be tested prior to developing and deploying a model for clinical use.
The selection of the optimal independent variables or features to train models is a cornerstone of supervised ML. Redundant variables can complicate models without increasing the predictive accuracy, while a deficiency of variables can oversimplify models without capturing the true complexity of a given use case. In the nascent TKA-related ML literature, there has typically been little justification for the variables selected to train models. Therefore, the predictive capacity of various models trained with either patient-specific or situational variables were compared. As both patient-specific factors, such as age, and situational variables, such as hospital volume, have been shown to correlate with outcomes after TKA, this distinction would be useful for the development of future models for use in clinical practice. Our analysis demonstrated consistently better performance of models developed with the 8 patient-specific variables when compared to models developed using 7 situational variables. These results, while stressing the importance of patient-specific variables, also highlight the potential of a smaller number of variables to develop equivalent predictive models. A similar concept was demonstrated recently in a study on heart failure patients that reported equivalent performance of an ML model using only 8 variables compared to one using a full set of 47 variables . Continued research within the orthopedic literature on variable engineering and selection is critical, and identifying the most predictive variables will prove useful for the development of models that will be deployed to clinical practice.
There were several limitations to this study. The strength of ML models is dependent on the quality of the data used to train, test, and validate the algorithms, and administrative databases may be prone to incompleteness and errors . However, the NIS has been demonstrated as an appropriate database to utilize for predictive large population-based studies and administratively-coded comorbidity data has been previously validated as accurate . Another limitation is that the LOS outcome was adjusted to be binary to simplify outcomes and provide more accurate analysis. These adjusted outcomes are useful in the setting of predictive ML at the expense of precise predictions. However, despite the continuous nature of LOS as a variable, when quality-improvement efforts are implemented in the clinical setting, the target for improvement in LOS is generally a binary cutoff, and so a binary predictive model has practical use. Another limitation is that the findings of this study were not externally validated. Although external validation was not within the scope of the study, efforts were made to internally validate the results, as the dataset was split into 64% training, 16% testing, and 20% validating groups. The analysis of each phase was concurrent with all models with similar results, indicating the internal validity of the findings. Still, comparison with another data source would be useful to assess the generalizability of each ML model and the replicability of the findings in this study.
There were several strengths to this study. This study represents a novel attempt in the orthopedic literature to analyze a large variety of ML algorithms to develop the best-performing model. Our analysis of multiple ML algorithms generates insights into the performance of these various algorithms for multiple outcomes, which has seldom been encountered in the orthopedic literature. Additionally, by demonstrating the generally superior performance of models trained on patient-specific variables over situational variables, this study highlights the role that patient-specific factors play in determining critical quality outcome metrics within the available dataset. These insights should empower efforts aimed to influence both clinical practice and reimbursement models, which typically do not consider patient factors despite their demonstrably substantial impact on various quality metrics.
In summary, this study compared ten ML models developed using different algorithms to predict three important quality metrics: mortality, LOS, and discharge disposition. Models developed using patient-specific variables performed better than models developed using situational variables. As the effort to develop ML models and identify which ML algorithms are optimal for a given set of conditions and outcomes, these results prove useful in the development of predictive ML models for accurate risk assessment and stratification for TKA.
Availability of data and materials
The datasets generated and/or analyzed during the current study are available in the National Inpatient Sample repository, https://www.hcup-us.ahrq.gov/db/nation/nis/nisdbdocumentation.jsp.
Singh JA, Yu S, Chen L, Cleveland JD. Rates of total joint replacement in the United States: future projections to 2020–2040 Using the National Inpatient Sample. J Rheumatol. 2019;46(9):1134–40.
Bernstein DN, Liu TC, Winegar AL, et al. Evaluation of a preoperative optimization protocol for primary hip and knee arthroplasty patients. J Arthroplasty. 2018;33(12):3642–8.
Gronbeck C, Cote MP, Lieberman JR, Halawi MJ. Risk stratification in primary total joint arthroplasty: the current state of knowledge. Arthroplast Today. 2019;5(1):126–31.
Schwartz FH, Lange J. Factors that affect outcome following total joint arthroplasty: a review of the recent literature. Curr Rev Musculoskelet Med. 2017;10(3):346–55.
Batailler C, Lording T, De Massari D, Witvoet-Braam S, Bini S, Lustig S. Predictive models for clinical outcomes in total knee arthroplasty: a systematic analysis. Arthroplast Today. 2021;9:1–15.
Devana SK, Shah AA, Lee C, Roney AR, van der Schaar M, SooHoo NF. A novel, potentially universal machine learning algorithm to predict complications in total knee arthroplasty. Arthroplast Today. 2021;10:135–43.
Lu Y, Khazi ZM, Agarwalla A, Forsythe B, Taunton MJ. Development of a machine learning algorithm to predict nonroutine discharge following unicompartmental knee arthroplasty. J Arthroplasty. 2021;36(5):1568–76.
Navarro SM, Wang EY, Haeberle HS, et al. Machine learning and primary total knee arthroplasty: patient forecasting for a patient-specific payment model. J Arthroplasty. 2018;33(12):3617–23.
Bini SA. Artificial intelligence, machine learning, deep learning, and cognitive computing: what do these terms mean and how will they impact health care? J Arthroplasty. 2018;33(8):2358–61.
Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods. 2018;15(4):233–4.
Arvind V, London DA, Cirino C, Keswani A, Cagle PJ. Comparison of machine learning techniques to predict unplanned readmission following total shoulder arthroplasty. J Shoulder Elbow Surg. 2021;30(2):e50–9.
Haeberle HS, Helm JM, Navarro SM, et al. Artificial intelligence and machine learning in lower extremity arthroplasty: a review. J Arthroplasty. 2019;34(10):2201–3.
Harris AHS, Kuo AC, Weng Y, Trickey AW, Bowe T, Giori NJ. Can machine learning methods produce accurate and easy-to-use prediction models of 30-day complications and mortality after knee or hip arthroplasty? Clin Orthop Relat Res. 2019;477(2):452–60.
Huber M, Kurz C, Leidl R. Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning. BMC Med Inform Decis Mak. 2019;19(1):3.
Ramkumar PN, Navarro SM, Haeberle HS, et al. Development and validation of a machine learning algorithm after primary total hip arthroplasty: applications to length of stay and payment models. J Arthroplasty. 2019;34(4):632–7.
Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intell Data Anal. 2002;6(5):429–49.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
Ho KC, Speier W, El-Saden S, et al. Predicting discharge mortality after acute ischemic stroke using balanced data. AMIA Annu Symp Proc. 2014;2014:1787–96.
Shah A, Memon M, Kay J, et al. Preoperative patient factors affecting length of stay following total knee arthroplasty: a systematic review and meta-analysis. J Arthroplasty. 2019;34(9):2124-2165 e2121.
Hinterwimmer F, Lazic I, Suren C, et al. Machine learning in knee arthroplasty: specific data are key-a systematic review. Knee Surg Sports Traumatol Arthrosc. 2022;30(2):376–88.
Subudhi S, Verma A, Patel AB, et al. Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19. NPJ Digit Med. 2021;4(1):87.
Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019;19(1):281.
Ahn I, Gwon H, Kang H, et al. Machine learning-based hospital discharge prediction for patients with cardiovascular diseases: development and usability study. JMIR Med Inform. 2021;9(11):e32662.
Awan SE, Bennamoun M, Sohel F, Sanfilippo FM, Chow BJ, Dwivedi G. Feature selection and transformation by machine learning reduce variable numbers and improve prediction for heart failure readmission or death. PLoS ONE. 2019;14(6):e0218760.
Johnson EK, Nelson CP. Values and pitfalls of the use of administrative databases for outcomes assessment. J Urol. 2013;190(1):17–8.
Bozic KJ, Bashyal RK, Anthony SG, Chiu V, Shulman B, Rubash HE. Is administratively coded comorbidity and complication data in total joint arthroplasty valid? Clin Orthop Relat Res. 2013;471(1):201–5.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Zalikha, A.K., Court, T., Nham, F. et al. Improved performance of machine learning models in predicting length of stay, discharge disposition, and inpatient mortality after total knee arthroplasty using patient-specific variables. Arthroplasty 5, 31 (2023). https://doi.org/10.1186/s42836-023-00187-2
- Machine learning
- Total knee arthroplasty
- Artificial intelligence
- Postoperative outcomes
- Length of stay