SIR 2025
General IR
Traditional Poster
Hossam A. Zaki
Medical Student
The Warren Alpert Medical School, Brown University, United States
Andrew Aoun, BS
PhD Student
Center for Neural Science, New York University, United States
Hazem Abdel-Megid, None
Undergraduate
The Warren Alpert Medical School of Brown University, United States
Saminah Munshi, MS
Medical Assistant
The Warren Alpert Medical School of Brown University, United States
Jay Gopal, None
Undergraduate
The Warren Alpert Medical School of Brown University, United States
Michelle Mai
Medical Student
Warren Alpert Medical School of Brown University, Providence, RI 02901, United States
Imran S. Alam, MS
Medical Student
Royal College of Surgeons Ireland, Ireland
Sun-Ho Ahn, MD
Associate Professor
Warren Alpert Medical School of Brown University, United States
Aaron WP Maxwell, MD
Director of Vascular and Interventional Oncology
Alpert Medical School, Brown University, United States
LLMs have shown competency in several medical fields, but their potential in IR has yet to be shown. This study assesses the reliability and efficacy of three large language models (LLMs) in predicting appropriate interventional radiology (IR) procedures and imaging studies based on patient presentations and evaluates their performance on a practice board exam.
Materials and Methods:
Two general LLMs, ChatGPT-4 (OpenAI) and Gemini (Google), and one healthcare-specific LLM, Glass AI (Glass Health), were chosen for the analysis. The American College of Radiology (ACR) Appropriateness Criteria for Interventional Radiology patient presentations served as benchmarks. Patient scenarios derived from the criteria were input into each LLM twice. The models' recommended imaging modalities or interventions were scored on a scale from 0 to 3 based on ACR guidelines, with higher scores indicating more appropriate recommendations. LLMs were compared together using the Friedman test. Additionally, ChatGPT-4 and Gemini were evaluated using a Self-Assessment Module for Vascular and Interventional Radiology to simulate a practice board exam environment.
Results:
All three LLMs reliably determined the most appropriate imaging studies or procedures for specific patient presentations. Glass AI achieved the highest average score (2.34 ± 0.52), followed by ChatGPT-4 (2.25 ± 0.59) and Gemini (2.21 ± 0.45), with no significant differences between the models (p = 0.40). On the practice board exam, ChatGPT-4 and Gemini attained accuracies of 71.88% and 79.69%, respectively, without a significant difference in performance.
Conclusion:
LLMs like ChatGPT-4, Glass AI, and Gemini demonstrate potential as decision-support tools in interventional radiology by reliably predicting appropriate imaging studies and procedures based on patient presentations and performing well on standardized exams. Their integration into IR practice could enhance clinical decision-making; however, further research is needed to address limitations such as knowledge cutoffs, explainability, and prompt standardization.