LLMs are not ready to automate clinical coding, says Mount Sinai study

A new study from Mount Sinai suggests that using generative artificial intelligence to help with coding automation has some significant limitations.


For the research, Mount Sinai’s Icahn School of Medicine evaluated the potential application for large language models in healthcare to automate medical code assignments – based on clinical text – for reimbursement and research purposes. 

The study compared LLMs from OpenAI, Google and Meta to assess whether they could effectively match the right medical codes to their corresponding official text descriptions.

To assess and benchmark the performance of GPT-3.5, GPT-4, Gemini Pro and Llama2-70b, researchers extracted more than 27,000 unique diagnosis and procedure codes from 12 months of routine care in the Mount Sinai Health System, excluding patient data.

“Previous studies indicate that newer large language models struggle with numerical tasks,” Dr. Eyal Klang, director of Icahn Mount Sinai’s Data-Driven and Digital Medicine Generative AI Research Program and senior co-author of the study, explained in an announcement last week. 

“However, the extent of their accuracy in assigning medical codes from clinical text had not been thoroughly investigated across different models.”

In assessing whether the four available models could effectively match medical codes through qualitative and quantitative methods, the researchers determined all LLMs scored below 50% accuracy in generating unique diagnosis and procedure codes.

While GPT-4 performed the best in the study with the highest exact match rates for ICD-9-CM at 45.9%, ICD-10-CM at 33.9% and CPT codes at 49.8%, “unacceptably large” errors remained. 

The researchers said GPT-4 produced the most incorrectly generated codes, while GPT-3.5 had the greatest tendency to be vague, identifying more general rather than precise codes. 

The study results, which the New England Journal of Medicine AI published last week, led the researchers to caution that the performance of LLMs in real-world medical coding could have worse results.

“LLMs are not appropriate for use on medical coding tasks without additional research,” the researchers said in the report.

“While AI holds great potential, it must be approached with caution and ongoing development to ensure its reliability and efficacy in healthcare,” Dr. Ali Soroush, assistant professor of D3M and medicine, cautioned in a statement.

Mount Sinai noted that the researchers will look to develop tailored LLM tools for accurate medical data extraction and billing code assignment.


Despite the findings of the Mount Sinai study, others see value in AI-enabled coding, and say AI systems can help physician groups avoid missing revenue opportunities and elevate their documentation compliance.

Dr. Bruce Cohen, a surgeon and former CEO at OrthoCarolina in Charlotte, North Carolina. 

“As annual coding requirements are instituted, an AI-based system will integrate and implement those changes in real-time,” Dr. Bruce Cohen, a surgeon and former CEO at OrthoCarolina in Charlotte, North Carolina, told Healthcare IT News.

AI-based systems do not eliminate coders’ jobs, he added: “It expands the oversight and accuracy of every charge going out based on evaluation and management coding.”


“Our findings underscore the critical need for rigorous evaluation and refinement before deploying AI technologies in sensitive operational areas like medical coding,” Soroush asserted in a statement about the Mount Sinai research.

“This study sheds light on the current capabilities and challenges of AI in healthcare, emphasizing the need for careful consideration and additional refinement prior to widespread adoption,” added Dr. Girish Nadkarni, director of The Charles Bronfman Institute of Personalized Medicine and system chief of D3M. 

Andrea Fox is senior editor of Healthcare IT News.
Email: [email protected]

Healthcare IT News is a HIMSS Media publication.

Related Articles

Back to top button