Recent AI models are still far from accurate when it comes to clinical recommendations, a study published Oct. 8 in Nature Communications found.
The study evaluated 10,000 random emergency department visits to evaluate the accuracy of large language models for clinical recommendations. Researchers looked at zero-shot GPT-3.5 and GPT-4-turbo-generated clinical recommendations across three tasks: admission status, radiological investigation request status and antibiotic prescription status.
The LLMs were provided information from the presenting history and physical examination sections of a patient's first ED physician note.
GPT-4-turbo and GPT-3.5-turbo performed "poorly" compared to a resident physician, according to the study, with accuracy scores of 8% and 24% respectively, lower than the physician average. Researchers found that the LLMs were "overly cautious" in their recommendations and that LLMs' performance must be significantly improved before being used as decision support for clinical recommendations.