E mail addresses wangdujuan dlut edu cn D Wang
E-mail addresses: [email protected] (D. Wang), [email protected] (Y. Jin).
Survivability prediction, one of the three main tasks of cancer prognosis , predicts the outcomes (survival or death) of cancer patients. So far, most research defines ‘‘survival’’ as any incidence of cancer in which the person is still living af-ter five years (sixty months) since the date of diagnosis. Five-year survivability is an indicator in medical science commonly used for evaluation of surgical and therapeutic effects. In earlier years, cancer survivability prediction was an estimate based on clinical features of malignant tumors and medical experience of doctors. With the continuous development of medical informationalization, alongside the establishment of all kinds of medical information systems for improving the operation e ciency of hospitals, a large number of cancer data are collected and stored in databases, which are available for con-structing machine learning model to predict disease survivability. In addition, the increasing application of data mining techniques in healthcare makes it 2-Guanidinoethylmercaptosuccinic Acid possible to further exploit these under-utilized medical data. Machine learning models, such as decision trees , neural networks , support vector machines , and random forests  have been used as popular tools to identify useful patterns among the variables and predict the outcome of cancer patients with the help of the models trained by historical clinical data.
Most existing studies on data-driven cancer survival prediction use classification to predict whether a patient can survive more than five years [23,30,50]. However, the prediction results obtained in this way are not precise enough for supporting medical decision-making. For example, in five-year survivability classification, the exact outcome (survival time) of the pa-tients who are classified as negative (who cannot survive more than five years) remains unknown, which therefore deserves more attention, especially for high mortality cancers. To predict more precisely, survival time prediction can be carried out, which is more challenging but also more meaningful for medical doctors. In traditional studies, it is common to construct prediction model using statistical tools based on survival related factors, for example, palliative prognostic score , pallia-tive performance index , cancer prognostic score  and intra-hospital cancer mortality risk model . Note, however, that the above prediction models based on statistical tools are for survival prediction of terminal cancer patients whose survival time is less than one month to provide proper support .
This paper aims to use machine learning methods to predict survival time on a monthly basis, which can be helpful in making e cient treatment decisions. Prediction of survival times has been demonstrated to be very challenging so far since big generalization errors often occur when one-stage regression models are used [14,24]. To address this challenge, we propose a two-stage model based on tree ensembles for cancer survival prediction, in which an effective classifier is used to predict whether patients can survive for five years at the first stage, and at the second stage, a novel regression tree ensemble is employed to predict the specific survival time for patients who are predicted to be not able to survive for five years.
A study on survival rates of five years for different TNM stages using 158,483 records from SEER dataset  shows that the five-year survival rate for colorectal cancer patients of stage IV is only 10.4%, which is much less than stage III patients (59.7%), stage II (81.4%) and stage I (92.5%). Moreover, it is aware that the performance of survivability prediction varies a lot across different stages of cancer and the prediction of survivability on all stages together is less accurate than on different stages separately . Based on the aforementioned findings, this work concentrates on constructing prediction models for the most urgent stage, stage IV, which is the advanced stage of colorectal cancer, where the majority of patients need more detailed and accurate prediction. The classification of advanced-stage cancer survivability at the first stage is an imbalance classification problem, which often results in a low sensitivity when using ordinary classification algorithms. Therefore, at the classification stage, a tree-based ensemble classification method is developed to handle the imbalance classification of advanced-stage cancer patients. At the regression stage, we propose a new selective ensemble regression method based on a new base learner generation approach, which employs a priori knowledge for semi-random feature selection and an indicator called mean proportion of error interval (MPEI) for selection of base learners. Here, MPEI is a regression performance indicator we propose for screening the initially generated base learners to create an accuracy and diverse ensemble. To the best of our knowledge, ensemble regression learning has not been reported for cancer survival time prediction.