News
Garber Announces Advisory Committee for Harvard Law School Dean Search
News
First Harvard Prize Book in Kosovo Established by Harvard Alumni
News
Ryan Murdock ’25 Remembered as Dedicated Advocate and Caring Friend
News
Harvard Faculty Appeal Temporary Suspensions From Widener Library
News
Man Who Managed Clients for High-End Cambridge Brothel Network Pleads Guilty
Researchers at Beth Israel Deaconess Medical Center found generative artificial intelligence tool ChatGPT-4 performed better than hospital physicians and residents in several — but not all — aspects of the clinical reasoning process, raising questions about the future use of AI in healthcare settings.
The research team gave the physicians and ChatGPT-4 clinical cases that were broken down into four different parts, and asked them to provide a summary and diagnosis, as well as an explanation of how they arrived at their answers. While ChatGPT-4 scored higher on clinical reasoning, it was also more likely to arrive at an incorrect diagnosis.
One metric used in the study to judge the diagnosis was completeness, which privileges ChatGPT-4, said Daniel Restrepo, an assistant professor of medicine at Harvard Medical School and internist at Massachusetts General Hospital.
“That scale rewards verbosity, and so the more words you use, the more likely you are to score,” he said.
However, Restrepo added that “it beat them with a very specific metric,” and said, “you still need doctors.”
Stephanie M. Cabral, a third-year resident in the Department of Internal Medicine at Beth Israel Deaconess Medical Center and lead author of the study, said the idea for the study came in April 2023.
After receiving IRB approval, the team of researchers sent out case surveys, testing physicians on their clinical reasoning skills, in July. Analysis of the data from physicians and ChatGPT-4 occurred in Oct. 2023 and was published in Feb. 2024.
“Overall, it was a very tight timeline. It was within a year, from April 2023 to February 2024,” Cabral said.
Zahir Kanjee, a clinician educator at Beth Israel Deaconess Medical Center and Assistant Professor of Medicine at HMS, said he was “drawn to this work” because of the new opportunities available with AI in healthcare.
“I wanted to be part of the effort to think about what artificial intelligence and large language models are capable of, and how they can help us make care better,” he said.
Despite ChatGPT-4’s better overall performance, the model had “more instances of incorrect reasoning,” lead author Cabral said.
She described ChatGPT-4’s effort to conduct a pregnancy test on a 74-year old woman with abdominal pain, even though “physicians wouldn't expect a 74-year old woman to be pregnant.”
“We still have to be wary that these technologies can make errors,” she added.
OpenAI, the owner of ChatGPT-4, could not be reached for comment.
Restrepo also warned that the study was “a very artificial environment.”
“It beat trainees and faculty members in documentation of reasoning using a very specific metric,” he said. “Why did it do that? One of the reasons I told you is because humans don’t read instructions, and humans cut corners.”
“If you do that — based on this metric — you run the risk of scoring less,” Restrepo added.
Adam Rodman, an internal medicine physician at Beth Israel Deaconess Medical Center and another leader of the study, said that implementing large language models poses risks.
“There’s something called automation bias, this cognitive psychology automation bias, and that if you think that an output is from a machine, you will trust it more than if you think the output is from a human, which could have unpredictable effects on how humans actually behave in the real world, “ he said.
Still, the researchers remain hopeful about AI’s usage in healthcare.
“I think it’d be really useful sort of as a cognitive second opinion,” Restrepo said. However, he added, “it is critical that private health information not be inputted into these.”
“I think AI technologies will hopefully bring back some of the humanity we’ve lost in medicine,” Rodman said.
“In the short run, I am cautiously optimistic that it will improve the human physician bond,” he said. “The medium to long run, I think it’s really hard to predict — just because we’re looking at something that’s really changing what it means to be a doctor and what it means to be a patient.”
Want to keep up with breaking news? Subscribe to our email newsletter.