Determining accuracy of diagnosis and management of common presenting semen analyses using artificial intelligence programs
Original Article

Determining accuracy of diagnosis and management of common presenting semen analyses using artificial intelligence programs

Baylor Price1, Madeline Helm1, Christopher M. Deibert2

1College of Medicine, University of Nebraska Medical Center, Omaha, NE, USA; 2Division of Urologic Surgery, University of Nebraska Medical Center, Omaha, NE, USA

Contributions: (I) Conception and design: CM Deibert; (II) Administrative support: All authors; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: B Price, M Helm; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Madeline Helm, BS. College of Medicine, University of Nebraska Medical Center, SSP2015, Omaha, NE 68198, USA. Email: mhelm@unmc.edu.

Background: Over the past few years, artificial intelligence (AI) platforms have rapidly gained popularity within medicine. While AI has been applied in various subspecialties of urology, its role in evaluating male factor infertility has not been explored. The objective of this study was to evaluate the diagnostic accuracy of two commonly used AI programs, Google’s “Bard” and Bing. This study aimed to assess each program’s accuracy in correctly diagnosing a sample patient’s semen analysis results and recommending appropriate next steps following diagnosis.

Methods: Each respective AI program was given a set of data which included semen volume, pH, concentration, and sperm motility as a percentage along with a command to list the three most likely diagnoses and the next steps the patient should take. The data sets ranged from entirely normal to abnormal with clearly obstructive and non-obstructive azoospermia, teratozoospermia, oligospermia, or asthenospermia. Study personnel determined the clinical diagnostic accuracy of both Bard’s and Bing’s semen analysis interpretations. No patient data was utilized for this study.

Results: Bing resulted in only 29% accuracy of interpretation while 57% of results provided partially correct responses. First, second, and third, diagnoses provided resulted in 43%, 29% and 43% accuracy, respectively. Each analysis was 100% accurate in the next steps the patient should take and recommended discussing results with a physician 100% of the time. Bard was slightly more accurate regarding semen analysis with 50% accuracy. First, second, and third diagnoses provided resulted in 75%, 25%, and 25% accuracy, respectively. Bard had 75% accuracy regarding next steps but also had a 100% accuracy rating for recommending discussing results with a physician.

Conclusions: Overall, Bard was more accurate in providing correct analytical information regarding semen analysis (50% vs. 29%). Bard generated consistently accurate first diagnosis (75% vs. 43%). Bing resulted in increased accuracy regarding next steps (100% vs. 75%). Both programs recommended discussing semen analysis results with a physician. Overall, Bing and Bard are not capable of consistently providing patients with accurate analysis, diagnosis, or next steps when given a sample semen analysis. Specific training sets must be developed to provide with accurate interpretation of their urological results in a user-friendly format that can be further addressed with their physician.

Keywords: Semen analysis; artificial intelligence (AI); male infertility; spermatozoa; machine learning


Submitted Jul 19, 2025. Accepted for publication Sep 16, 2025. Published online Dec 24, 2025.

doi: 10.21037/tau-2025-508


Highlight box

Key findings

• Google’s Bard and Bing’s artificial intelligence (AI) programs demonstrated limited accuracy interpreting semen analysis results (50% and 29%, respectively).

What is known and what is new?

• AI has been increasingly applied within urology, including prostate cancer imaging analysis, antibiotic selection for recurrent urinary tract infections, and real-time annotation during robotic surgery.

• This study is the first to evaluate the ability of general-purpose large language models to interpret semen analysis results.

What is the implication, and what should change now?

• Current AI programs lack the clinical reasoning skills required for accurate evaluation of male factor infertility, highlighting the need for continued refinement and validation of these technologies.


Introduction

Artificial intelligence (AI) platforms have rapidly gained popularity across diverse fields, from business and academia to medicine. In healthcare, this technology is beginning to be tested and applied in hopes of improving efficiency while reducing the cost of care and errors in physician decision making. However, current AI platforms have not consistently demonstrated accuracy, reliability, or credibility in clinical applications (1). Ethical considerations must also be addressed when developing and improving AI platforms in order to maintain patient safety and well-being (2).

In reproductive medicine, AI has been applied to embryo selection (3). By contrast, the role of AI in the evaluation of male factor infertility remains unexplored. Semen analysis is a routinely performed laboratory test in the investigation of male factor infertility. A semen sample is evaluated according to key attributes including fluid volume, pH, motility, and concentration. The World Health Organization (WHO) has set forth normal reference limits and guidelines for clinical evaluation for semen analysis values (4).

Given the routine nature of semen analysis, the potential application of AI programs in diagnosis and management of common male infertility conditions inspired our current study. To date, many studies have investigated the application of AI within the subspecialty of reproductive urology. These studies include, but are not limited to, the ability of AI to detect sperm in azoospermia samples for use in intracytoplasmic sperm injections, classify sperm morphology, and respond to clinical questions regarding male infertility (5-7).

To our knowledge, no study has investigated the use of AI platforms in diagnosing and managing a patient given their semen analysis. This study aimed to determine the accuracy of the two most used AI programs, Google’s “Bard” (now Gemini) and Bing (powered with Open AI Chat GPT), in providing a correct diagnosis as well as next steps based only on a patient’s semen analysis. We present this article in accordance with the STARD reporting checklist (8) (available at https://tau.amegroups.com/article/view/10.21037/tau-2025-508/rc).


Methods

A semen analysis consists of several components that give the physician data to base their diagnosis on or support their clinical assessment regarding potential male infertility. Semen volume (in mL), sperm sample pH, sperm concentration (in millions of sperm per milliliter ranging from 15 to 300), and sperm motility which is tested via a computer-assisted sperm analyzer all factor into a semen analysis (4). For this study, sample semen analyses were utilized based upon the various presentations for normal or abnormal sperm samples similar to what male reproductive urologists analyze in their clinical practice.

Seven created sample semen analysis results were utilized for this study. No patient data was utilized during this study. The created samples were based on established clinical thresholds for semen analysis parameters and included results consistent with obstructive azoospermia (absence of sperm), non-obstructive azoospermia, teratozoospermia (morphologically abnormal sperm shape affecting fertility), oligospermia (fewer than 15 million sperm per mL of semen), asthenospermia (less than 40% sperm motility), and oligoasthenospermia (fewer than 15 million sperm per mL of semen and less than 40% sperm motility) (4). An example sample is as follows: oligoasthenospermia: semen volume, 2 mL; pH, 7.5; sperm concentration, 5; sperm motility, 20%.

This study utilized two of the most popular AI-powered search engines, Google’s “Bard” and Bing Chat. Each of the preselected sample semen analyses alongside a question regarding diagnosis, next steps, and physician recommendation was input into both search engines and their exact responses were compiled and recorded in a document. Neither program was provided with additional clinical information or reference standard results. To generate a meaningful response from each search engine that answered all relevant questions, different ways of asking the same questions were used. For example, typing, “I am a man having difficulty conceiving. Help me interpret my semen analysis. Volume 2.0, pH 7.5, concentration 5, motility 20%. Please concisely list the 3 most likely diagnoses and next step I should take” generated a response of, “I’m a language model and don’t have the capacity to help with that.” when using Bard. However, rephrasing this sentence as, “Semen volume: 2 mL pH: 7.5 Sperm motility: 20% concentration: 5. What are the top three diagnoses for these results and next steps I should take?” generated an actual response that included three possible diagnoses, next steps the user should take, and whether the AI program suggested a clinical visit with a physician. Each of the generated responses was independently reviewed by two third-year medical students (B.P., M.H.), with discrepancies adjudicated by a board-certified urologist specializing in male infertility and andrology (C.M.D.).

For each phrase, six questions were analyzed by each AI program: Accuracy of semen analysis interpretation, initial diagnosis accuracy, second diagnosis accuracy, third diagnosis accuracy, accuracy of next steps to take, and whether the program recommended scheduling a visit with a physician. Responses were interpreted by the authors as either “accurate”, “inaccurate”, or “partially accurate/none” and input into an excel spreadsheet. Graphs were generated to show accuracy of each AI program’s individual responses and how each program performed as percentages. IRB approval and informed consent are not required for this study as no patient data was used.

Statistical analysis

Diagnostic accuracy was calculated as the percentage of correct responses for each AI program across six assessment criteria for each semen analysis dataset. This includes (I) overall interpretation; (II-IV) first, second, and third differential diagnoses; (V) next steps; and (VI) recommendation to discuss with a doctor. Results are reported as percentages. Given the exploratory nature of this research and the limited sample size, only descriptive statistics were used.


Results

Bing resulted in 29% (2/7) accuracy of semen analysis interpretation while 57% (4/7) of results demonstrated partially correct responses. First, second, and third, diagnoses provided resulted in 43% (3/7), 29% (2/7), and 43% (3/7) accuracy, respectively. Despite relatively low diagnostic accuracy, Bing achieved 100% (7/7) accuracy in recommending appropriate next steps for patients and consistently advised consultation with a physician. Figure 1A displays Bing’s performance.

Figure 1 Displays the proportion of accurate, inaccurate, and partially accurate responses of (A) Bing and (B) Bard to the six study questions.

Bard resulted 50% (4/8) accuracy of semen analysis interpretation while 50% (4/8) of results demonstrated partially correct response. First, second, and third diagnoses provided resulted in 75% (6/8), 25% (2/8), and 25% (2/8) accuracy, respectively. Bard achieved 75% (6/8) accuracy regarding next steps and also recommended discussing results with a physician in 100% (8/8) of cases. Figure 1B displays Bard’s performance.

Figure 1 illustrates the accuracy of Bing and Bard across the six evaluation domains. Bard demonstrated superior performance in initial diagnostic interpretation, while Bing more consistently provided accurate next step recommendations.


Discussion

AI has been integrated into many areas of technology and business, but its application in medicine is only beginning to be explored (9). In this study, we found that two commonly used AI programs, Bing and Bard, were unable to consistently and accurately interpret semen analysis results. While both programs were able to recommend follow-up with a physician and produce appropriate to questions regarding male fertility, their diagnostic accuracy was variable and frequently incorrect.

These findings highlight an important cognitive gap between structured laboratory data and the nuanced clinical reasoning required for infertility evaluation. Interpretation of semen analysis parameters alone is often insufficient for diagnosis, as clinicians also rely on contextual elements such as patient history, physical examination, and additional testing. Current general-purpose large language models (LLMs) are not designed to carry out complex clinical reasoning, which limits their reliability and practical utility in male reproductive health.

Our results align with prior reports that AI systems achieve greater clinical utility when designed with domain-specific datasets and targets (10). In urology, machine learning models have demonstrated promising results in radiomic analysis of prostate cancer imaging, antibiotic selection in recurrent urinary tract infections, and real-time annotation during robotic surgery (11-13). These systems were trained with structured, clinically validated data and optimized for defined clinical outcomes. This distinction illustrates the importance of developing dedicated reproductive urology AI tools rather than relying on general-purpose models.

Interestingly, despite poor diagnostic performance, both AI programs consistently advised consultation with a physician. This likely reflects reinforcement in model training toward prioritizing patient safety in health-related responses. While not sufficient for diagnostic accuracy, this safety bias may still be valuable if future AI systems are designed to complement, rather than replace, clinical decision-making.

AI has already been applied in reproductive medicine. For example, a 2022 study demonstrated that an AI model could generate a fertility likelihood score for day-five embryos selected for in-vitro fertilization (IVF) (3). These AI-generated scores, based on known morphological features, were positively correlated with pregnancy outcomes and reduced time to conception. Additionally, predictive models using machine learning, a subtype of AI, have also been used to identify men most likely to recover sperm production after varicocelectomy. In this study, 87% of patients predicted by the AI model to improve sperm production were correctly identified during post-procedure follow-up (14).

This study has several limitations. First, only two AI platforms were evaluated, whereas numerous others are currently available. Second, the semen analysis values were hypothetical as opposed to samples from real patients, which restricts real-world applicability. Third, only four parameters (semen volume, pH, concentration, and motility) were included; other clinically significant elements such as morphology, total sperm count, vitality testing, and sperm DNA fragmentation (SDF) were not incorporated. Additionally, although responses were independently scored by two evaluators and discrepancies adjudicated by the faculty reviewer, a formal inter-rater agreement analysis was not performed, which may limit the objectivity of the results. Finally, the small sample size (n=7) limits generalizability.

Despite these limitations, this exploratory work illustrates both the current shortcomings and potential opportunities of AI integration into reproductive urology. Development of domain-specific tools that incorporate validated semen parameters and contextual clinical reasoning could enhance diagnostic accuracy, improve efficiency, and support both patients and providers in male infertility evaluation and management.


Conclusions

In this exploratory study, Bard demonstrated higher accuracy than Bing in interpreting semen analysis parameters and providing initial diagnoses, while Bing was more consistent in suggesting appropriate next steps. Both platforms reliably recommended consultation with a physician.

These findings reinforce that general-purpose LLMs are not yet suitable for independent use in male infertility evaluation. Nonetheless, they also highlight opportunities for developing domain-specific AI tools trained with validated urological datasets. Such systems could improve efficiency, reduce errors, and enhance both patient and physician experiences.

While exploratory in nature, our study underscores the need for future research to expand sample size, incorporate additional semen parameters such as morphology, total count, vitality, and SDF, and evaluate a broader range of AI platforms. Ultimately, carefully designed reproductive urology AI models hold the potential to provide user-friendly, accurate, clinically significant interpretations that can empower patients while supporting physicians in the care of male infertility.


Acknowledgments

None.


Footnote

Reporting Checklist: The authors have completed the STARD reporting checklist. Available at https://tau.amegroups.com/article/view/10.21037/tau-2025-508/rc

Data Sharing Statement: Available at https://tau.amegroups.com/article/view/10.21037/tau-2025-508/dss

Peer Review File: Available at https://tau.amegroups.com/article/view/10.21037/tau-2025-508/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tau.amegroups.com/article/view/10.21037/tau-2025-508/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. IRB approval and informed consent are not required for this study as no patient data was used.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Bajwa J, Munir U, Nori A, et al. Artificial intelligence in healthcare: transforming the practice of medicine. Future Healthc J 2021;8:e188-94. [Crossref] [PubMed]
  2. Cacciamani GE, Chen A, Gill IS, et al. Artificial intelligence and urology: ethical considerations for urologists and patients. Nat Rev Urol 2024;21:50-9. [Crossref] [PubMed]
  3. Diakiw SM, Hall JMM, VerMilyea M, et al. An artificial intelligence model correlated with morphological and genetic features of blastocyst quality improves ranking of viable embryos. Reprod Biomed Online 2022;45:1105-17. [Crossref] [PubMed]
  4. WHO laboratory manual for the examination and processing of human semen, 6th ed. (n.d.). Retrieved September 4, 2025. Available online: https://www.who.int/publications/i/item/9789240030787
  5. Goss DM, Vasilescu SA, Vasilescu PA, et al. Evaluation of an artificial intelligence-facilitated sperm detection tool in azoospermic samples for use in ICSI. Reprod Biomed Online 2024;49:103910. [Crossref] [PubMed]
  6. Shahali S, Murshed M, Spencer L, et al. Morphology Classification of Live Unstained Human Sperm Using Ensemble Deep Learning. Advanced Intelligent Systems 2024;6:2400141. [Crossref]
  7. Gokmen O, Gurbuz T, Devranoglu B, et al. Artificial intelligence and clinical guidance in male reproductive health: ChatGPT4.0’s AUA/ASRM guideline compliance evaluation. Andrology 2025;13:176-83. [Crossref] [PubMed]
  8. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015;351:h5527. [Crossref] [PubMed]
  9. Alowais SA, Alghamdi SS, Alsuhebany N, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 2023;23:689. [Crossref] [PubMed]
  10. Rajpurkar P, Chen E, Banerjee O, et al. AI in health and medicine. Nat Med 2022;28:31-8. [Crossref] [PubMed]
  11. Liu Y, Wu J, Ni X, et al. Machine learning based on automated 3D radiomics features to classify prostate cancer in patients with prostate-specific antigen levels of 4-10 ng/mL. Transl Androl Urol 2025;14:1025-35. [Crossref] [PubMed]
  12. Cai T, Anceschi U, Prata F, et al. Artificial Intelligence Can Guide Antibiotic Choice in Recurrent UTIs and Become an Important Aid to Improve Antimicrobial Stewardship. Antibiotics (Basel) 2023;12:375. [Crossref] [PubMed]
  13. Zuluaga L, Rich JM, Gupta R, et al. AI-powered real-time annotations during urologic surgery: The future of training and quality metrics. Urol Oncol 2024;42:57-66. [Crossref] [PubMed]
  14. Ory J, Tradewell MB, Blankstein U, et al. Artificial Intelligence Based Machine Learning Models Predict Sperm Parameter Upgrading after Varicocele Repair: A Multi-Institutional Analysis. World J Mens Health 2022;40:618-26. [Crossref] [PubMed]
Cite this article as: Price B, Helm M, Deibert CM. Determining accuracy of diagnosis and management of common presenting semen analyses using artificial intelligence programs. Transl Androl Urol 2025;14(12):3867-3871. doi: 10.21037/tau-2025-508

Download Citation