SIR 2024
Practice Development
Blair Warren, MD
Vascular and Interventional Radiology Fellow
University Health Network, Canada
Financial relationships: Full list of relationships is listed on the CME information page.
Sebastian Mafeld, MBBS FRCR (he/him/his)
Interventional Radiologist
University Health Network, CAN
Disclosure information not submitted.
Outline a method and feasibility of how clinicians may use a large language model (LLM) artificial intelligence (AI) to analyze and learn from device adverse event data. The example pipeline uses a common IR device (microwave ablation system), a medical device event database, and an AI LLM (OpenAI’s GPT-4). {1}
Background:
Interventional radiologists rely on medical devices to provide safe patient care. Monitoring medical device databases is time intensive and a delay may be seen before failure patterns are identified. LLMs show promise in interpreting natural language and may be used to efficiently generate value from existing data. Learning how to use AI will improve the interventional radiologist’s toolkit.
Clinical Findings/Procedure Details:
A proposed pipeline for use of an LLM in (1) labelling of data and (2) generating summaries/conclusions from natural language data is demonstrated. LLMs such as GPT-4 can use prompts to guide output but these require iterative development.
The pipeline for use of a LLM in labelling data begins with data collection. A query to the U.S. Food and Drug Administration's Manufacture and User Facility Device Experience database from January 1, 2020 to July 31, 2023 for microwave ablation devices was performed. 567 cases from January 2020 to July 2023 were collected. Next, data were lightly cleaned by removal of duplicates and a selection of 250 random events were split into training (n = 50), validation (n = 25), and test (n= 175) data. Reversed ratio of training to test data was used to simulate a real-world deployment. Data were labelled by a human for the presence of a microwave tip fracture. OpenAI’s GPT-4 API was then instructed to label the data for the presence of a tip fracture via an instruction prompt. Iterative prompt development was performed, eventually achieving 100.0% accuracy (95% CI: 92.9, 100.0) on the training data relative to human labels. Next, the prompt was validated (99.4% accuracy; 95% CI: 96.9, 100.0) and then applied to the test data (99.4% accuracy; 95% CI: 96.9, 100.0).
A secondary approach of using the LLM to generate summaries of all 250 cases and subsequent conclusions from the data was then performed after prompt development. The LLM conclusions were like the human conclusions including concern for tip separation and temperature problems.
Conclusion and/or Teaching Points:
The results from the example pipeline demonstrate the feasibility and outline a method to use modern AI techniques to improve learning from medical device adverse event data. The outlined data pipeline could be implemented on a small or large scale on many natural language databases including the SIR’s VIRTEX.