An Advanced Machine Learning Method for BC Risk Prediction in Chinese

An Advanced Machine Learning Method for Simultaneous Breast Cancer Risk Prediction and Risk Ranking in Chinese Population: A Prospective Cohort and Modeling Study

Breast cancer (BC) remains the most commonly diagnosed cancer among women globally, with an estimated 2.26 million new cases reported in 2020. In China, BC is the leading cause of cancer incidence among women, with approximately 416,000 new cases in 2020. The increasing burden of BC in China underscores the urgent need for effective risk assessment tools tailored to the Chinese population. Traditional BC risk prediction models, such as the Gail model, Claus model, and Tyrer–Cuzick model, have shown limited accuracy, with area under the receiver operating characteristic curve (AUC) values typically ranging from 0.55 to 0.65. These models often rely on invasive methods, such as genetic testing and breast biopsies, which are not feasible for widespread application in China due to economic constraints and uneven distribution of medical resources. This study aims to address these limitations by developing advanced machine learning-based risk prediction models that are highly accurate, non-invasive, and suitable for the Chinese population.

The study leverages data from the Breast Cancer Cohort Study in Chinese Women (BCCS-CW), a large prospective dynamic cohort that includes 122,058 women aged 25–70 years from eastern China. The cohort was established in 2008–2009, with follow-up conducted from 2017 to 2020. Participants provided detailed information on demographic characteristics, physiological and reproductive factors, medical and family history, dietary habits, lifestyle, and BC-related knowledge through face-to-face interviews and physiological measurements. Incident BC cases were identified through linkage with national health insurance claims databases, disease registries, and local residential records. The study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines and was approved by the Ethics Committees of the Second Hospital of Shandong University and the National Center for Chronic and Non-communicable Disease Control.

To develop the risk prediction models, the study employed advanced machine learning techniques, including penalized logistic regression (PLR), bootstrapping, and ensemble learning. The ensemble penalized logistic regression (EPLR) model was designed for short-term risk prediction, while the ensemble penalized long-term (EPLT) model was developed for long-term risk prediction. Both models were constructed using a bagging-based integrated framework, which aggregates multiple PLR models to enhance prediction accuracy and stability. The EPLR model incorporated 72 non-experimental risk factors, while the EPLT model included 51 variables. The models were trained and validated using data from Shandong province, with external validation conducted using data from Jiangsu, Hebei, and Tianjin provinces.

The performance of the models was evaluated based on discrimination and calibration. Discrimination was assessed using the AUC, receiver operating characteristic (ROC) curves, and net reclassification improvement (NRI). Calibration was evaluated using calibration plots and the observed-to-expected (E/O) ratio. The EPLR model demonstrated strong discrimination, with AUC values of 0.800 and 0.751 in internal and external validation sets, respectively. The NRI of the EPLR model relative to the Social Network-inspired Breast Cancer Risk Assessment Model (BCRAM) was 0.164 and 0.268 in internal and external validation sets, indicating significant improvement in prediction accuracy. The EPLT model also performed well, with AUC values of 0.692 and 0.760 in internal and external validation sets, respectively. The NRI of the EPLT model relative to the Gail and Han Chinese Breast Cancer Prediction Model (HCBCP) was 0.109 and 0.171 in internal validation and 0.193 and 0.233 in external validation, respectively. Calibration plots and E/O ratios further confirmed the models’ ability to accurately predict BC risk.

One of the key findings of the study was the importance of non-experimental risk factors in predicting BC risk. The EPLR model identified “overall life satisfaction” as the most important predictor, highlighting the role of psychological factors in BC risk. Other significant predictors included menopause status, family history of BC, breast hyperplasia, and dietary habits. The study also found that the EPLR and EPLT models outperformed traditional models such as the Gail and HCBCP models, which rely on fewer variables and often require invasive testing. The inclusion of a large number of non-experimental factors in the EPLR and EPLT models makes them more suitable for widespread application in China, where access to advanced medical resources is limited.

The study also addressed the challenge of imbalanced data by using a bootstrap strategy to create balanced datasets for model training. This approach reduces bias and improves the accuracy of risk factor selection. Additionally, the integration of multiple PLR models through ensemble learning enhances the stability and generalization ability of the EPLR and EPLT models. The models’ ability to rank the importance of risk factors based on their frequency of selection in multiple PLR models provides valuable insights into the relative contribution of different factors to BC risk.

Despite its strengths, the study has several limitations. First, the external validation of the models was limited to data from three provinces, and the EPLT model’s long-term predictions were validated using only three years of follow-up data. Further validation in larger and more diverse populations is needed to confirm the models’ generalizability. Second, some established risk factors, such as alcohol consumption, were not included in the models due to their low importance scores. Finally, the study did not account for BC subtypes, as the dataset lacked information on estrogen receptor status.

In conclusion, this study developed and validated advanced machine learning-based risk prediction models for BC in Chinese women. The EPLR and EPLT models demonstrated superior discrimination and calibration compared to traditional models, making them valuable tools for risk-stratified screening and BC prevention in China. The models’ reliance on non-experimental risk factors and their ability to rank the importance of these factors provide a practical and cost-effective approach to BC risk assessment. Future research should focus on validating the models in larger and more diverse populations and incorporating additional risk factors, including BC subtypes, to further enhance their predictive accuracy.

doi.org/10.1097/CM9.0000000000002891

Was this helpful?

0 / 0