Interpretable machine learning driven biomarker identification and validation for prostate cancer

Jianxu Yuan; Dalin Zhou; Shengjie Yu

doi:10.21037/tau-2025-242

Original Article

Interpretable machine learning driven biomarker identification and validation for prostate cancer

Jianxu Yuan , Dalin Zhou, Shengjie Yu

Department of Surgery, The Second Affiliated Hospital of Chongqing Medical University, Chongqing Medical University, Chongqing, China

Contributions: (I) Conception and design: J Yuan, S Yu; (II) Administrative support: S Yu; (III) Provision of study materials or patients: J Yuan; (IV) Collection and assembly of data: J Yuan, D Zhou; (V) Data analysis and interpretation: J Yuan; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Shengjie Yu, MD, PhD. Chief Physician, Associate Professor, Department of Surgery, The Second Affiliated Hospital of Chongqing Medical University, Chongqing Medical University, No. 74-76 Linjiang Road, Chongqing 400010, China. Email: bbyddh@sina.com.

Background: Prostate cancer (PCa), a common malignancy among men globally, requires the identification of biomarkers for early diagnosis and predicting progression. This study aimed to identify the key genes involved in the occurrence and development of PCa.

Methods: Leveraging data from the Gene Expression Omnibus (GEO) database, this study integrated multi-chip datasets, conducting differential expression analysis and enrichment analysis to pinpoint PCa-related genes. Subsequently, machine learning models were constructed using least absolute shrinkage and selection operator (LASSO) regression, support vector machine (SVM), and random forest (RF) methods. The optimal model was selected for further study and the contribution of related genes was explained using SHapley Additive exPlanations (SHAP) analysis. Furthermore, gene set enrichment analysis (GSEA) and immune cell infiltration analysis were utilized to uncover the underlying molecular mechanisms.

Results: In this study, 222 differentially expressed genes (DEGs) were identified and found to be enriched in functions and pathways potentially associated with PCa. Using multiple machine learning models, eight PCa-related core genes (TRPM4, EDN3, EFCAB4A, FAM83B, PENK, NUDT10, KRT14, and CXCL13) were identified. The most accurate RF model was selected for further study with SHAP analysis, which also revealed the contribution of the above genes. GSEA and immune cell infiltration analysis uncovered distinctions between PCa and normal tissues.

Conclusions: This study offered potential biomarkers and a theoretical basis for the diagnosis and treatment for PCa.

Keywords: Prostate cancer (PCa); SHapley Additive exPlanations (SHAP); least absolute shrinkage and selection operator regression (LASSO regression); support vector machine (SVM); random forest (RF)

Submitted Mar 31, 2025. Accepted for publication May 28, 2025. Published online Jun 26, 2025.

doi: 10.21037/tau-2025-242

Highlight box

Key findings

• This study identified 8 core genes linked to the occurrence and progression of prostate cancer (PCa) through diverse machine learning models, which could offer novel insights for PCa’s diagnosis and treatment in the future.

What is known and what is new?

• PCa, a highly heritable malignant tumor linked to multiple genes, poses a higher risk for men with susceptible gene mutations. The inactivation of tumor suppressor genes and the activation/overexpression of oncogenes also drive its occurrence, development, invasion, and metastasis by impacting cell processes like proliferation, apoptosis, and differentiation.

• This study applied multiple machine learning models to identify eight core genes from 222 PCa-related differentially expressed genes and used SHapley Additive exPlanations analysis to further clarify each gene’s impact on PCa. We also conducted gene set enrichment analysis and immune cell infiltration on these genes.

What is the implication, and what should change now?

• The identified core genes may become potential diagnostic and therapeutic targets for PCa. Subsequent in-vivo and in-vitro experiments are needed to validate the findings and facilitate their eventual clinical application.

Introduction

Prostate cancer (PCa), a common malignancy in men, shows rising global incidence and mortality with age, especially in those aged 65 and older. Its risk factors include uncontrollable elements like age, race, genetic predisposition, and family history, as well as lifestyle-related factors such as unhealthy diets, obesity, and smoking (1). Modern imaging techniques like multiparametric magnetic resonance imaging (MRI) and positron emission tomography (PET)/computed tomography (CT) play a key role in early diagnosis, precise staging, and treatment monitoring for PCa, greatly boosting diagnostic accuracy and the scientific basis of treatment decisions (2).

The clinical prostate-specific antigen (PSA) test lacks specificity, as elevated PSA can result from benign prostatic conditions. Common imaging modalities like MRI have limitations in early detection, often missing small tumors and possibly misgrading or miscalculating stages. Advanced metastatic PCa is mainly treated with endocrine therapy, but most patients eventually develop castration-resistant PCa (CRPC), which is resistant to such therapy, leaving limited options and a poor prognosis. PCa is highly heterogeneous, differing significantly among patients in gene expression, growth rate, and invasive capacity, making treatment responses and prognoses vary and increasing treatment complexity. Moreover, the lack of effective biomarkers to predict treatment responses and prognoses complicates the development of individualized treatment plans and accurate risk assessment, influencing treatment decisions and patient follow-up management.

Many key genes are dysregulated or mutated and can serve as potential biomarkers. Measuring their levels in blood, tissue, or other body fluids may lead to more specific and sensitive early diagnosis for PCa, making up for the PSA test’s shortcomings, lowering false positives/negatives, and offering non-/minimally-invasive diagnosis for those unsuitable for biopsy, while enabling dynamic disease and treatment monitoring. For example, certain mutations might indicate sensitivity to targeted drugs, enhancing efficacy. Also, their expression/mutation status links closely to clinical outcomes, helping assess prognosis like survival/recurrence risks and adjust treatment. Key-gene identification aids in exploring pathogenesis, revealing new drug targets for novel therapies, especially for advanced/refractory PCa. Studying them and related pathways can also improve combination therapies. Moreover, they assist in molecular classification, categorizing patients for precision treatment. In summary, understanding their roles in PCa development and progression enhances our grasp of its biological behavior and offers a basis for new treatment strategies.

This study integrated PCa-related data from the Gene Expression Omnibus (GEO) database. By applying bioinformatics and machine learning methods, it aimed to identify genes closely linked to PCa development, construct a machine learning model, and uncover the model’s internal decision-making process via SHapley Additive exPlanations (SHAP) analysis. Furthermore, combined with gene set enrichment analysis (GSEA) and immune cell infiltration analysis, it comprehensively explored PCa’s molecular mechanisms and immune microenvironment characteristics, seeking potential biomarkers and therapeutic targets for PCa. Our research promised to open up new avenues for PCa’s diagnosis and treatment. The identified biomarkers and targets would solidify the theoretical foundation for subsequent experimental and clinical applications, propelling the development of precision medicine in PCa. We present this article in accordance with the TRIPOD reporting checklist (available at https://tau.amegroups.com/article/view/10.21037/tau-2025-242/rc).

Methods

GEO data downloading and preprocessing

Multiple PCa-related datasets, namely GSE28680, GSE46602, GSE55945, and GSE69223, were retrieved from the GEO database. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Due to the fact that these data have already passed ethical review during collection and organization, and the GEO database is a free and open database, no additional ethical review is required for this study. The GSE28680 dataset encompassed 24 samples, consisting of 4 benign prostate tissues and 20 PCa tissues. The GSE46602 dataset incorporated 50 samples, with 14 benign prostate tissues and 36 PCa tissues. The GSE55945 dataset included 21 samples, of which 8 were benign prostate tissues and 13 were PCa tissues. The GSE69223 dataset was made up of 30 samples, evenly divided into 15 benign prostate tissues and 15 PCa tissues. More detailed clinical information could be retrieved from the GEO database.

These datasets encompassed gene-expression-profile data from PCa patients and normal controls, with PCa samples labeled as the “treat” group and normal samples as the “control” group. Data downloading and preliminary processing were performed through R (version 4.4.3) to guarantee data integrity and usability. The “limma (version 3.62.2)” package imported the expression data and sample information for the control and treat groups. The data was formatted into a matrix with gene identifiers as row names, creating a standardized expression matrix. Duplicate samples were merged to simplify the structure for downstream analysis. Log₂ transformation was applied if needed to meet statistical assumptions like normal distribution. Normalization was conducted to remove systematic bias, ensuring data comparability across samples. Finally, the relevant sample data from both groups was extracted and combined for intergroup difference analysis.

Batch correction and data integration

The Combat algorithm was used to correct batch effects across different microarray platforms, reducing systematic deviations from experimental variations (3). Principal component analysis (PCA) was applied to compare data distribution patterns before and after correction, assessing the effectiveness of batch correction to ensure data consistency and comparability (4). This step was completed using the “ggplot2 (version 3.5.2)” and “ggpubr (version 0.6.0)” packages in R (version 4.4.3), both with default settings.

In PCA, principal component (PC) scores, calculated as projections of raw data onto PC1 and PC2, reflected sample positions in major variation directions. Sample grouping (type) was extracted via regular expressions. In the PCA scatter plot, samples were color-coded and shape-coded by type. Analyzing sample distribution in the plot allowed assessment of batch effects. Notably, changes in sample clustering before and after batch correction quantitatively indicated correction efficacy through sample point shifts. Subsequently, the corrected data were integrated into a unified gene-expression matrix, which included expression data from all samples. This matrix formed the basis for differential expression analysis and model construction.

Differential expression analysis

The “limma (version 3.62.2)” package was used to analyze differences in gene expression between PCa and normal samples, identifying differentially expressed genes (DEGs) (5). Thresholds (P value <0.05, |logarithm fold change (logFC)| >1) were set to determine genes significantly up- or down-regulated in PCa. The identified DEGs were preliminarily evaluated to analyze the significance and biological implications of their expression differences.

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis

To explore common cellular functions and PCa-related mechanisms in identified DEGs, GO enrichment analysis was conducted using the “clusterProfiler (version 4.14.6)” package, covering biological processes (BPs), cellular components (CCs), and molecular functions (MFs) (6). The corresponding bar plot and bubble chart were generated using the “enrichplot (version 1.26.6)” and “ggplot2 (version 3.5.2)” packages (7,8). KEGG enrichment analysis was also performed using the same procedure.

Machine learning model construction and intersection feature genes identification

Least absolute shrinkage and selection operator (LASSO) regression, support vector machine (SVM), and random forest (RF) were employed to analyze DEGs and extract predictive genes. Models underwent training and optimization via cross-validation and independent test-set validation to boost predictive accuracy. By intersecting feature genes from these models, core genes consistently important across multiple models were pinpointed. The “ggvenn (version 0.1.10)” package was used to visualize this process. These genes might be crucial for PCa’s diagnosis and prognosis (9). Finally, the “ggplot2 (version 3.5.2)” package was utilized to visualize these intersecting feature genes.

SHAP analysis

We rebuilt machine learning models using intersection feature genes as input variables. Through cross-validation and independent test-set validation, we trained and optimized these models to enhance their predictive accuracy. The entire dataset was partitioned into a training set and a test set. Specifically, 70% of the data was allocated to the training set for model development, and the remaining 30% was reserved for the test set to conduct the final assessment of the model. Throughout the training phase, a repeated five-fold cross-validation approach was utilized. This entailed subdividing the training set into five equally-sized subsets, commonly referred to as “folds”. During each cross-validation cycle, four of these folds were employed to train the model, while the 5^th fold served as a validation set to evaluate the model’s performance. This process was repeated multiple times, each time with a different fold designated for validation, to comprehensively assess the model’s effectiveness across various data subsets.

To explore different machine learning approaches, a loop was executed to train multiple machine learning algorithms. Within each iteration of the loop, the training set was used to train a specific machine learning algorithm. During the training process, the aforementioned cross-validation technique was applied to optimize the model’s parameters and improve its performance. The model’s performance was gauged by the prediction results on the test set, with the primary metric being the area under the receiver operating characteristic (ROC) curve. After all models had been trained and evaluated, the model exhibiting the highest area under the curve (AUC) value on the test set was selected as the optimal model.

To boost the interpretability and transparency of the machine learning models, we conducted SHAP analysis, which unveiled the contribution of each feature gene to the prediction (10,11). We measured the importance of each feature gene in model predictions by calculating SHAP values. Using bar plot and bee swarm plot, we visually presented the global and local impacts of feature genes. By analyzing the distribution and trends of SHAP values, we identified core feature genes significantly affecting model predictions.

In order to select the optimal model, this study included 10 machine learning models: partial least squares (PLS), RF, decision trees (DTs), SVM, logistic regression (Logistic), K-nearest neighbors (KNN), extreme gradient boosting (XGBoost), gradient boosting machine (GBM), neural network (NeuralNet), and generalized linear model boosting (glmBoost). This crucial step utilized R packages such as “caret (version 7.0.1)”, “DALEX (version 2.4.3)”, “ggplot2 (version 3.5.2)”, “randomForest (version 4.7.1.2)”, “kernlab (version 0.9.33)”, “kernelshap (version 0.7.0)”, “pROC (version 1.18.5)”, “shapviz (version 0.9.7)”, “xgboost (version 1.7.9.1)”, and “klaR (version 1.7.3)”.

GSEA

We performed GSEA to compare predefined gene-set expression differences between PCa and normal samples. Unlike focusing on single genes, GSEA detects gene sets with coordinated expression changes, offering a comprehensive view of alterations in PCa-related pathways. For this analysis, the “org.Hs.eg.db (version 3.20.0)”, “clusterProfiler (version 4.14.6)”, and “enrichplot (version 1.26.6)” packages were used (12).

Immune cell infiltration analysis

We used immune cell infiltration analysis tools to assess the relative abundance of diverse immune cell types in PCa samples. This analysis aimed to explore the composition and dynamics of PCa immune microenvironment and its impact on tumor development and prognosis by examining the correlations between immune cell infiltration and gene expression. The CIBERSORT algorithm was applied to analyze immune cell infiltration using gene expression data (13). After inputting the gene expression matrix and obtaining the estimated immune cell proportions, we interpreted the infiltration results, examining the distribution and levels of various immune cell types in PCa tissues. Furthermore, we investigated the correlations between immune cell infiltration and core gene expression to identify potential immunotherapy targets.

Statistical analysis

The statistical analysis was performed using R software (version 4.4.3), and P value <0.05 was considered significant.

Results

Data preparation and DEG screening

Multiple PCa-related datasets were retrieved from GEO and preprocessed into a high-quality gene-expression matrix. PCA revealed dataset-specific batch effects pre-correction (Figure 1A), which were effectively mitigated post-correction (Figure 1B). Differential expression analysis identified 222 DEGs significantly associated with tumor biology (Figure 1C).

Figure 1 PCA and differential expression analysis. PCA results: (A) before batch correction; (B) after batch correction; (C) 222 PCa-related DEGs (P value was adjusted to control the false-positive rate while minimizing false-negative results). DEGs, differentially expressed genes; logFC, logarithm fold change; PC, principal component; PCA, principal component analysis; PCa, prostate cancer.

Enrichment analysis results

GO analysis showed DEGs were chiefly enriched in functions such as fatty acid metabolic process, collagen-containing extracellular matrix (ECM), and sulfur compound binding (Figure 2A,2B). KEGG analysis indicated enrichment in pathways like focal adhesion, cytoskeleton in muscle cells, and drug metabolism (Figure 2C,2D), suggesting crucial roles of DEGs in PCa progression.

Figure 2 Enrichment analysis. (A,B) GO enrichment analysis for DEGs. (C,D) KEGG pathway analysis for DEGs. BMP, bone morphogenetic protein; BP, biological process; CC, cellular component; DEGs, differentially expressed genes; ECM, extracellular matrix; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; MF, molecular function; TGF, transforming growth factor.

Machine learning model building and feature gene selection

Machine learning algorithms, including LASSO regression (Figure 3A), SVM (Figure 3B,3C), and RF (Figure 3D,3E), were used to build predictive models with high accuracy confirmed by cross-validation. A set of feature genes crucial for PCa prediction was identified. The number of disease characteristic genes screened by LASSO regression was 25, the number of disease characteristic genes obtained by SVM was 24, and the number of disease characteristic genes obtained by RF was 17. Finally, eight core feature genes were identified through intersection (Figure 3F), showing significant expression differences between the PCa (treat) and normal (control) groups (Figure 3G).

Figure 3 Machine learning model building and feature gene selection. (A) The cross-validated outcome of LASSO regression analysis. Graphs showing (B) SVM analysis accuracy and (C) cross-validation errors. (D) Forest plot (green, red, and black lines respectively indicate the cross-validation errors of the control group, treat group, and all samples) and (E) gene importance scores from RF analysis. (F) Eight intersection feature genes. (G) Box plot for feature genes. ***, represents P value <0.001 in statistical significance. CV, cross-validation; LASSO, least absolute shrinkage and selection operator; RF, random forest; SVM, support vector machine.

SHAP analysis

This study incorporated 10 machine learning models: PLS, RF, DTs, SVM, Logistic, KNN, XGBoost, GBM, NeuralNet, and glmBoost (Figure 4A). ROC curve analysis showed that the RF model has the highest AUC value of 0.978, making it the optimal model for subsequent SHAP analysis.

Figure 4 SHAP analysis. (A) ROC curve. (B) Importance ranking of feature genes. (C) Visualization of SHAP variables, with the included feature genes sorted by the average absolute value of SHAP from highest to lowest. Yellow dots denoting a higher impact and purple dots a lower one. DTs, decision trees; GBM, gradient boosting machine; glmBoost, generalized linear model boosting; KNN, K-nearest neighbors; Logistic, logistic regression; NeuralNet, neural network; PLS, partial least squares; RF, random forest; ROC, receiver operating characteristic; SHAP, SHapley Additive exPlanation; SVM, support vector machine; XGBoost, extreme gradient boosting.

The SHAP library was used to interpret and visualize the model, presenting a global view of eight core genes (Figure 4B). On the X-axis, data points on the right indicated positive associations with PCa development, while those on the left showed negative associations. The scatter plot colors reflected feature values, with yellow for high expression and purple for low expression. For example, TRPM4 and EFCAB4A overexpression raised PCa risk, while low expression of EDN3, FAM83B, PENK, NUDT10, KRT14, and CXCL13 did the same (Figure 4C). The SHAP interpretation suggested that genetic alterations in these eight genes played a crucial role in the onset and progression of PCa.

GSEA and immune cell infiltration analysis

In this study, GSEA analyzed the eight core genes in both high- and low-expression groups of PCa samples, revealing enrichment in PCa-related pathways. We only displayed graphs of gene expression groups corresponding to PCa risk here (Figure 5).

Figure 5 GSEA results for (A) high EFCAB4A group; (B) high TRPM4 group; (C) low CXCL13 group; (D) low EDN3 group; (E) low FAM83B group; (F) low KRT14 group; (G) low NUDT10 group; (H) low PENK group. GSEA, gene set enrichment analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes.

CIBERSORT was used for immune cell infiltration analysis, obtaining the relative composition of 22 immune cells. Results showed statistically different immune cell infiltration levels of naive B cells, T cells follicular helper, natural killer (NK) cells activated, and mast cells resting between the PCa group and normal group (Figure 6A). In normal samples, KRT14 expression was positively correlated with B cells naive and NK cells activated infiltration levels, and TRPM4 expression was negatively correlated with T cells follicular helper levels (Figure 6B). In PCa samples, these core genes lost the ability to regulate immune cells (Figure 6C).

Figure 6 Immune infiltration analysis. (A) Immune infiltration analysis between PCa and normal groups. (B) Correlation analysis in normal samples. (C) Correlation analysis in PCa samples. NK, natural killer; PCa, prostate cancer.

Discussion

In this study, differential expression analysis was conducted on the PCa transcriptome dataset to identify DEGs, followed by the application of machine learning algorithms to distill feature genes. The potential influence of these signature genes on PCa was explored. Additionally, the correlation between PCa and immune responses was examined to reveal possible immune mechanisms in PCa. Furthermore, a machine learning model based on the RF algorithm was established. Using SHAP for interpretation and visualization, the model elucidated the key genes affecting PCa progression and offered a theoretical foundation for precision PCa treatment.

Starting from human transcriptome data in the GEO database, this study identified 222 DEGs. GO analysis revealed that these DEGs were chiefly enriched in functions such as fatty acid metabolic process, collagen-containing ECM, and sulfur compound binding. KEGG analysis indicated that these genes were predominantly enriched in pathways like focal adhesion, cytoskeleton in muscle cells, drug metabolism-cytochrome P450, and drug metabolism-other enzymes.

These functions and pathways may be linked to PCa development. Enhanced fatty acid metabolism supplies energy and biosynthetic materials for PCa cells, supporting their rapid proliferation. It also alters cell membrane composition, boosting invasion and metastasis, and shapes the tumor microenvironment via metabolic products, influencing immune cell function and associating with tumor invasion and prognosis. The collagen-containing ECM offers structural support and adhesion sites for PCa cells. It activates signaling pathways that promote cell proliferation and migration. Its remodeling facilitates tumor invasion, affects drug sensitivity, and is closely related to tumor metastasis. Sulfur compound binding is involved in the antioxidant defense and detoxification processes of PCa cells. It regulates oxidative stress, impacts signal transduction and immune cell function, and is associated with tumor invasion and metastasis.

The focal adhesion pathway plays a key role in PCa cell adhesion, migration, and invasion. It activates downstream signaling pathways through integrin-ECM interactions, driving tumor cell proliferation and survival, and impacting tumor microenvironment remodeling and angiogenesis, thus advancing PCa. The cytoskeleton is crucial for maintaining cell shape, movement, and signaling. In PCa, cytoskeletal changes can enhance tumor cell motility and invasive/metastatic potential. It also helps tumor cells respond to mechanical signals and interact with their microenvironment. The drug metabolism pathway significantly impacts PCa treatment. It affects drug pharmacokinetics (absorption, distribution, metabolism, excretion), influencing drug concentrations and efficacy in tumors. Abnormal drug-metabolism-enzyme expression can lead to drug resistance, a major challenge in prostate-cancer treatment. Moreover, toxic drug-metabolism by-products may harm normal tissues, increasing treatment side effects.

Using LASSO regression, SVM, and RF methods, we extracted eight core genes (TRPM4, EDN3, EFCAB4A, FAM83B, PENK, NUDT10, KRT14, and CXCL13). Then, we reconstructed and evaluated machine learning models based on these genes using various algorithms. The optimal RF model was selected and interpreted using SHAP. Our research demonstrated that these eight core genes were significantly associated with PCa. The model we developed could visualize the impact of these core genes on the disease and offered precise decisions for PCa biomarker discovery. GSEA results also indicated that these genes were linked to pathways involved in PCa pathogenesis. Moreover, the findings of immune cell infiltration analysis further suggested a close relationship between these genes and PCa. Previous research has delved into the relationship between the CXCL13 and TRPM4 genes and PCa (14,15). However, studies on the EFCAB4A gene’s link to cancer have been inadequate. As a result, this study focused on the potential relationship between the remaining five genes and PCa.

EDN3 plays a significant role in PCa development and progression. EDN3 suppresses PCa cell glycolysis through the cGMP/PKG pathway, thereby reducing cellular proliferation. Research indicates that overexpression of EDNRB or EDN3 decreases the baseline extracellular acidification rate (ECAR) while increasing the baseline oxygen consumption rate (OCR) (16). This metabolic shift from glycolysis to oxidative phosphorylation in tumor cells inhibits their growth. By activating EDNRB and subsequent downstream pathways, such as the cGMP/PKG pathway, EDN3 curtails tumor cell proliferation and survival. Furthermore, the EDN3/EDNRB pathway influences the expression of apoptosis-related genes, promoting tumor cell apoptosis and, consequently, inhibiting tumor progression. In the context of PCa, EDNRB is frequently hypermethylated and underexpressed (17). This epigenetic alteration may compromise EDN3 function, thereby potentially impacting the progression of the disease.

A study indicates that in PCa, KRT14-positive cell subpopulations exhibit high proliferative capacity. They regulate cell cycle-related genes, promoting cell cycle progression and driving PCa cell proliferation. KRT14 expression may also interact with apoptosis mechanisms, influencing PCa cell apoptosis and affecting tumor progression (18). KRT14, a member of the cytokeratin family, plays a role in the cytoskeleton network. Altered KRT14 expression can enhance tumor cell motility, invasion, and metastasis. It also participates in tumor cells’ response to mechanical signals, facilitating their interaction with the microenvironment (19,20). In normal prostate tissue, basal cells expressing KRT14 and other cytokeratins have stem cell-like properties and may play a significant role in PCa development. Epigenetic dysregulation, such as DNA methylation, can affect KRT14 expression, promoting malignant PCa cell phenotypes. KRT14 may interact with other signaling pathways, such as the Wnt pathway to regulate cell proliferation and differentiation, or the transforming growth factor-β (TGF-β) pathway to impact the epithelial-mesenchymal transition (EMT) process, thereby enhancing tumor cell invasion and metastasis (21). Aberrant KRT14 expression can influence PCa cells’ drug sensitivity (22). Some studies have found that KRT14 expression levels correlate with PCa cells’ responses to certain chemotherapeutic or targeted drugs. KRT14 may modulate intracellular signaling pathways or metabolic processes, affecting drug efficacy and impacting PCa treatment outcomes and prognosis.

A study has shown that NUDT10 may promote PCa cell proliferation by regulating cell cycle-related gene expression, providing the energy and metabolic intermediates needed for rapid cell proliferation to meet the growth and division demands of tumor cells (23). Another research has found that the NUDT10 gene is hypermethylated in PCa, and this epigenetic change can lead to dysregulated NUDT10 expression (24). Since hypermethylation generally suppresses gene expression, reduced NUDT10 expression may disrupt intracellular metabolic balance and signaling, thereby promoting a malignant phenotype in PCa cells. In addition to DNA methylation, other epigenetic mechanisms like histone modification may also affect NUDT10’s transcriptional activity, indirectly influencing PCa cell growth and survival. Some research has revealed that NUDT10 expression levels in PCa cells are associated with responses to certain chemotherapeutic or targeted drugs. Cells with low NUDT10 expression may have reduced drug sensitivity, thereby affecting treatment outcomes.

Studies have demonstrated that the PENK gene is underexpressed in PCa, which may be due to its hypermethylation. As a general rule, hypermethylation represses gene expression. The reduced expression of PENK can disrupt intracellular metabolic balance and signaling, thereby promoting the malignant phenotype of PCa cells (25,26). Similar findings have been observed in other cancers, such as pancreatic cancer, where the downregulation of the PENK gene is also associated with hypermethylation. In colon cancer, the PENK protein has been proven to act as an apoptosis activator, especially when chemotherapeutic drugs are used. In PCa, the downregulation of PENK may reduce tumor cells’ sensitivity to apoptosis signals, thus facilitating tumor progression. Research has also indicated that the absence of PENK expression might be linked to enhanced invasion and metastasis of PCa (27). The reduced expression of PENK in PCa cells can affect cell adhesion, movement, and invasion, thereby promoting tumor cell metastasis. Studies have found a significant relationship between PENK expression levels and immune cell infiltration. PENK may influence the immune response in the tumor microenvironment by modulating immune cell infiltration and function (28,29). The infiltration and functional status of immune cells are crucial for tumor progression and treatment response. The downregulation of PENK may affect immune cell recruitment, activation, and function, thereby altering the immune response in the tumor microenvironment and creating conditions for tumor cell immune escape.

FAM83B may function as a downstream effector in the EGFR-RAS-MAPK pathway, mediating oncogenic transformation driven by EGFR and RAS. In PCa, FAM83B might activate analogous signaling pathways, thereby promoting cell proliferation and survival and propelling tumor progression (30). FAM83B can also orchestrate the activation of the PI3K/AKT and MAPK pathways. Once activated, these two pathways collaborate to facilitate the transformation of epithelial cells and confer resistance to targeted therapies. In the context of PCa, FAM83B may amplify the activity of these two pathways, endowing cells with enhanced proliferative capacity and survival advantage, which in turn may lead to the development of PCa (31). The aberrant expression of FAM83B could result in cell cycle perturbations and the dysregulation of apoptosis mechanisms. For example, by activating the aforementioned signaling pathways, FAM83B may facilitate the transition of cells from the G1 phase to the S phase, thereby accelerating proliferation. Meanwhile, it may suppress the expression or function of apoptosis-related proteins, enabling cells to evade normal apoptotic regulation and gradually form tumor foci within prostate tissue (32).

This study integrated multi-chip data and combined differential expression analysis, functional enrichment, machine learning, and SHAP analysis to comprehensively mine PCa-related genes. Rigorous data preprocessing ensured reliable analysis. The study explored not only gene expression differences but also their enrichment in functions and pathways, revealing potential mechanisms. Innovatively, it combined machine learning with SHAP analysis to build a highly accurate predictive model, explaining feature genes’ contribution. GSEA and immune cell infiltration analyses further elucidated the tumor immune microenvironment and its relationship with gene expression, providing potential targets for immunotherapy.

This study was based on GEO database transcriptome data. Despite rigorous preprocessing, data bias might have existed due to diverse experimental designs and sample variations across studies. The identified core genes lacked in-depth experimental validation, such as through knockout or overexpression experiments. The analysis primarily focused on gene expression and functional enrichment, with insufficient exploration of gene regulatory networks and epigenetic mechanisms. Moreover, the immune cell infiltration analysis, being computationally predicted, needs further validation through experimental techniques.

Conclusions

In conclusion, this study has successfully integrated bioinformatics methodologies with interpretable machine learning techniques to uncover potential biomarkers for diagnosis and treatment of PCa.

Acknowledgments

Thanks to GEO database of the open-source data.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tau.amegroups.com/article/view/10.21037/tau-2025-242/rc

Peer Review File: Available at https://tau.amegroups.com/article/view/10.21037/tau-2025-242/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tau.amegroups.com/article/view/10.21037/tau-2025-242/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Massanova M, Vere R, Robertson S, et al. Clinical and prostate multiparametric magnetic resonance imaging findings as predictors of general and clinically significant prostate cancer risk: A retrospective single-center study. Curr Urol 2023;17:147-52. [Crossref] [PubMed]
Barone B, Napolitano L, Calace FP, et al. Reliability of Multiparametric Magnetic Resonance Imaging in Patients with a Previous Negative Biopsy: Comparison with Biopsy-Naïve Patients in the Detection of Clinically Significant Prostate Cancer. Diagnostics (Basel) 2023;13:1939. [Crossref] [PubMed]
Wang Q, Li CL, Wu L, et al. Distinct molecular subtypes of systemic sclerosis and gene signature with diagnostic capability. Front Immunol 2023;14:1257802. [Crossref] [PubMed]
Ding Z, Deng Z, Li H. Single-cell transcriptome analysis reveals the key genes associated with macrophage polarization in liver cancer. Hepatol Commun 2023;7:e0304. [Crossref] [PubMed]
Tan Z, Chen X, Zuo J, et al. Comprehensive analysis of scRNA-Seq and bulk RNA-Seq reveals dynamic changes in the tumor immune microenvironment of bladder cancer and establishes a prognostic model. J Transl Med 2023;21:223. [Crossref] [PubMed]
Chen S, Zhang Y, Ding X, et al. Identification of lncRNA/circRNA-miRNA-mRNA ceRNA Network as Biomarkers for Hepatocellular Carcinoma. Front Genet 2022;13:838869. [Crossref] [PubMed]
Xiong W, Zhong J, Li Y, et al. Identification of Pathologic Grading-Related Genes Associated with Kidney Renal Clear Cell Carcinoma. J Immunol Res 2022;2022:2818777. [Crossref] [PubMed]
Wang J, Wu N, Feng X, et al. PROS1 shapes the immune-suppressive tumor microenvironment and predicts poor prognosis in glioma. Front Immunol 2022;13:1052692. [Crossref] [PubMed]
Zhou W, Li H, Zhang J, et al. Identification and mechanism analysis of biomarkers related to butyrate metabolism in COVID-19 patients. Ann Med 2025;57:2477301. [Crossref] [PubMed]
Lai Y, Lin P, Lin F, et al. Identification of immune microenvironment subtypes and signature genes for Alzheimer's disease diagnosis and risk prediction based on explainable machine learning. Front Immunol 2022;13:1046410. [Crossref] [PubMed]
Li Y, Yan F, Xiang J, et al. Identification and experimental validation of immune-related gene PPARG is involved in ulcerative colitis. Biochim Biophys Acta Mol Basis Dis 2024;1870:167300. [Crossref] [PubMed]
Feng J, Hu Y, Peng P, et al. Potential biomarkers of aortic dissection based on expression network analysis. BMC Cardiovasc Disord 2023;23:147. [Crossref] [PubMed]
Chen Y, Huang W, Ouyang J, et al. Identification of Anoikis-Related Subgroups and Prognosis Model in Liver Hepatocellular Carcinoma. Int J Mol Sci 2023;24:2862. [Crossref] [PubMed]
El-Haibi CP, Singh R, Sharma PK, et al. CXCL13 mediates prostate cancer cell proliferation through JNK signalling and invasion through ERK activation. Cell Prolif 2011;44:311-9. [Crossref] [PubMed]
Sagredo AI, Sagredo EA, Pola V, et al. TRPM4 channel is involved in regulating epithelial to mesenchymal transition, migration, and invasion of prostate cancer cell lines. J Cell Physiol 2019;234:2037-50. [Crossref] [PubMed]
Li X, Liu B, Wang S, et al. EDNRB negatively regulates glycolysis to exhibit anti-tumor functions in prostate cancer by cGMP/PKG pathway. Mol Cell Endocrinol 2025;598:112459. [Crossref] [PubMed]
Zhang P, Qian B, Liu Z, et al. Identification of novel biomarkers of prostate cancer through integrated analysis. Transl Androl Urol 2021;10:3239-54. [Crossref] [PubMed]
Sayan M, Tuac Y, Akgul M, et al. Prognostic Significance of the Cribriform Pattern in Prostate Cancer: Clinical Outcomes and Genomic Alterations. Cancers (Basel) 2024;16:1248. [Crossref] [PubMed]
D'Auria F, Valvano L, Rago L, et al. Monoclonal B-cell lymphocytosis and prostate cancer: incidence and effects of radiotherapy. J Investig Med 2019;67:779-82. [Crossref] [PubMed]
Wang Z, Wang Y, Peng M, et al. UBASH3B Is a Novel Prognostic Biomarker and Correlated With Immune Infiltrates in Prostate Cancer. Front Oncol 2019;9:1517. [Crossref] [PubMed]
Carneiro I, Quintela-Vieira F, Lobo J, et al. Expression of EMT-Related Genes CAMK2N1 and WNT5A is increased in Locally Invasive and Metastatic Prostate Cancer. J Cancer 2019;10:5915-25. [Crossref] [PubMed]
Su X, Long Q, Bo J, et al. Mutational and transcriptomic landscapes of a rare human prostate basal cell carcinoma. Prostate 2020;80:508-17. [Crossref] [PubMed]
Li W, Gu M. NUDT11 rs5945572 polymorphism and prostate cancer risk: a meta-analysis. Int J Clin Exp Med 2015;8:3474-81. [PubMed]
Kamdar S, Isserlin R, Van der Kwast T, et al. Exploring targets of TET2-mediated methylation reprogramming as potential discriminators of prostate cancer progression. Clin Epigenetics 2019;11:54. [Crossref] [PubMed]
Goo YA, Goodlett DR, Pascal LE, et al. Stromal mesenchyme cell genes of the human prostate and bladder. BMC Urol 2005;5:17. [Crossref] [PubMed]
Pascal LE, Ai J, Vêncio RZ, et al. Differential Inductive Signaling of CD90 Prostate Cancer-Associated Fibroblasts Compared to Normal Tissue Stromal Mesenchyme Cells. Cancer Microenviron 2011;4:51-9. [Crossref] [PubMed]
Liu AY. Prostate cancer research: tools, cell types, and molecular targets. Front Oncol 2024;14:1321694. [Crossref] [PubMed]
Liu AY. The opposing action of stromal cell proenkephalin and stem cell transcription factors in prostate cancer differentiation. BMC Cancer 2021;21:1335. [Crossref] [PubMed]
Ashour N, Angulo JC, Andrés G, et al. A DNA hypermethylation profile reveals new potential biomarkers for prostate cancer diagnosis and prognosis. Prostate 2014;74:1171-82. [Crossref] [PubMed]
Bartel CA, Parameswaran N, Cipriano R, et al. FAM83 proteins: Fostering new interactions to drive oncogenic signaling and therapeutic resistance. Oncotarget 2016;7:52597-612. [Crossref] [PubMed]
Grant S. FAM83A and FAM83B: candidate oncogenes and TKI resistance mediators. J Clin Invest 2012;122:3048-51. [Crossref] [PubMed]
Jiang Y, Yu J, Zhu T, et al. Involvement of FAM83 Family Proteins in the Development of Solid Tumors: An Update Review. J Cancer 2023;14:1888-903. [Crossref] [PubMed]

Cite this article as: Yuan J, Zhou D, Yu S. Interpretable machine learning driven biomarker identification and validation for prostate cancer. Transl Androl Urol 2025;14(6):1528-1541. doi: 10.21037/tau-2025-242

Interpretable machine learning driven biomarker identification and validation for prostate cancer

Highlight box

Introduction

Methods

GEO data downloading and preprocessing

Batch correction and data integration

Differential expression analysis

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis

Machine learning model construction and intersection feature genes identification

SHAP analysis

GSEA

Immune cell infiltration analysis

Statistical analysis

Results

Data preparation and DEG screening

Enrichment analysis results

Machine learning model building and feature gene selection

SHAP analysis

GSEA and immune cell infiltration analysis

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share