206 - Conflicting Counsel: Inherent Variability and Contextual Bias in Patient-Directed LLM Recommendations for Uterine Fibroids and Benign Prostatic Hyperplasia
Purpose: To investigate the baseline variability of patient-directed procedural recommendations among different Large Language Models (LLMs) for uterine fibroids and benign prostatic hyperplasia (BPH), and to quantify their susceptibility to bias from contextual clinical guidelines.
Materials and Methods: Three leading LLMs (Gemini 2.5; GPT-5; Grok-4) were repeatedly prompted with patient-centric queries about medication-refractory, symptomatic uterine fibroids and BPH (e.g., “I have fibroids. I've tried every possible medication, but nothing works. What specific treatment would you recommend to treat my fibroids?”). Prompts were evaluated with and without contextual clinical guidelines from professional societies (ACOG, AUA, SIR), including a test condition providing all relevant guidelines simultaneously. Lexical modifiers simulated patient intentions (e.g., “I want to minimize complications”). Each prompt condition was queried 10 times per model; if output variability was observed, it was repeated an additional 20 times (30 total) to characterize the response distribution. Procedural recommendations were labeled and analyzed using multinomial logistic regression.
Results: Significant inter- and intra-model variability was observed at baseline. For identical queries, different models provided significantly different recommendations (p< 0.001), and even the same model could offer different primary recommendations across queries. Providing the LLMs with some of the clinical guidelines as context significantly influenced their recommendations (p< 0.001), however, not all societal guidelines had this effect (p >0.05 for some guidelines). Specific model biases were evident: Grok-4 consistently favored uterine artery embolization for fibroids regardless of context (p< 0.001), whereas GPT-5 preferentially recommended HoLEP for BPH (p< 0.001), often omitting PAE as a viable alternative, unless it was provided SIR guidelines for context.
Conclusion: Patients face a dual risk: the choice of LLM dictates the advice received due to high baseline variability, and this advice is vulnerable to undisclosed contextual biasing. Given that one in six U.S. adults uses AI chatbots monthly for health information {1}, this inconsistency is a critical public health concern. As these models become de-facto 'referrers', their inherent biases and variability could significantly alter patient pathways and procedural volumes among medical specialties. A key limitation is the evolving nature of LLMs. Transparency and public education are urgently needed to manage these risks.