Once upon a time, two villagers visited the fabled Mullah Nasreddin. They hoped that the Sufi philosopher, famed for his acerbic wisdom, could mediate a dispute that had driven a wedge between them. Nasreddin listened patiently to the first villager’s version of the story and, upon its conclusion, exclaimed, “You are absolutely right!” The second villager then presented his case. After hearing him out, Nasreddin again responded, “You are absolutely right!” An observant bystander, confused by Nasreddin’s proclamations, interjected, “But Mullah, they can’t both be right.” Nasreddin paused, regarding the bystander for a moment before replying, “You are absolutely right, too!”
In late May, the White House’s first “Make America Healthy Again” (MAHA) report was criticized for citing multiple research studies that did not exist. Fabricated citations like these are common in the outputs of generative artificial intelligence based on large language models, or LLMs. LLMs have presented plausible-sounding sources, catchy titles, or even false data to craft their conclusions. Here, the White House pushed back on the journalists who first broke the story before admitting to “minor citation errors.”
It is ironic that fake citations were used to support a principal recommendation of the MAHA report: addressing the health research sector’s “replication crisis,” wherein scientists’ findings often cannot be reproduced by other independent teams.
Yet the MAHA report’s use of phantom evidence is far from unique. Last year, The Washington Post reported on dozens of instances in which AI-generated falsehoods found their way into courtroom proceedings. Once uncovered, lawyers had to explain to judges how fictitious cases, citations, and decisions found their way into trials.
Despite these widely recognized problems, the MAHA roadmap released last month directs the Department of Health and Human Services to prioritize AI research to “…assist in earlier diagnosis, personalized treatment plans, real-time monitoring, and predictive interventions…” This breathless rush to embed AI in so many aspects of medicine could be forgiven if we believe that the technology’s “hallucinations” will be easy to fix through version updates. But as the industry itself acknowledges, these ghosts in the machine may be impossible to eliminate.
Consider the implications of accelerating AI use in health research for clinical decision making. Beyond the problems we’re seeing here, using AI in research without disclosure could create a feedback loop, supercharging the very biases that helped motivate its use. Once published, “research” based on false results and citations could become part of the datasets used to build future AI systems. Worse still, a recently published study highlights an industry of scientific fraudsters who could deploy AI to make their claims seem more legitimate.
In other words, a blind adoption of AI risks a downward spiral, where today’s flawed AI outputs become tomorrow’s training data, exponentially eroding research quality.
Three prongs of AI misuse
The challenge AI poses is threefold: hallucination, sycophancy, and the black box conundrum. Understanding these phenomena is critical for research scientists, policymakers, educators, and everyday citizens. Unaware, we risk vulnerability to deception as AI systems are increasingly deployed to shape diagnoses, insurance claims, health literacy, research, and public policy.
Here’s how hallucination works: When a user inputs a query into an AI tool such as ChatGPT or Gemini, the model evaluates the input and generates a string of words that is statistically likely to make sense based on its training data. Current AI models will complete this task even if their training data is incomplete or biased, filling in the blanks regardless of their ability to answer. These hallucinations can take the form of nonexistent research studies, misinformation, or even clinical interactions that never happened. LLMs’ emphasis on producing authoritative-sounding language shrouds their false outputs in a facsimile of truth.
And as human model trainers fine-tune generative AI responses, they tend to optimize and reward the AI system responses that favor their prior beliefs, leading to sycophancy. Human bias, it appears, begets AI bias, and human users of AI then perpetuate the cycle. A consequence is that AIs skew toward favoring pleasing answers over truthful ones, often seeking to reinforce the bias of the query.
A recent illustration of this occurred in April, when OpenAI canceled a ChatGPT update for being too sycophantic after users demonstrated that it agreed too quickly and enthusiastically with the assumptions embedded in users’ queries. Sycophancy and hallucination often interact with each other; systems that aim to please will be more apt to fabricate data to reach user-preferred conclusions.
Correcting hallucinations, sycophancy, and other LLM mishaps is cumbersome because human observers can’t always determine how an AI platform arrived at its conclusions. This is the “black box” problem. Behind the probabilistic mathematics, is it even testing hypotheses? What methods did it use to derive an answer? Unlike traditional computer code or the rubric of scientific methodology, AI models operate through billions of computations. Looking at some well-structured outputs, it is easy to forget that the underlying processes are impenetrable to scrutiny and vastly different from a human’s approach to problem-solving.
This opacity can become dangerous when people can’t identify where computations went wrong, making it impossible to correct systematic errors or biases in the decision-making process. In health care, this black box raises questions about accountability, liability, and trust when neither physicians nor patients can explain the sequence of reasoning that leads to a medical intervention.
AI and health research
These AI challenges can exacerbate the existing sources of error and bias that creep into traditional health research publications. Several sources originate from the natural human motivation to find and publish meaningful, positive results. Journalists want to report on connections, e.g., that St. John’s Wort improves mood (it might). Nobody would want to publish an article with the results: “the supplement has no significant effect.”
The problem compounds when researchers use a study design to test not just a single hypothesis but many. One quirk of statistics-backed research is that testing more hypotheses in a single study raises the likelihood of uncovering a spurious coincidence.
AI has the potential to supercharge these coincidences through its relentless ability to test hypotheses across massive datasets. In the past, a research assistant could use an existing dataset to test 10 to 20 of the most likely hypotheses; now, that assistant can set an AI loose to test millions of likely or unlikely hypotheses without human supervision. That all but guarantees some of the results will meet the criteria for statistical significance, regardless of whether the data includes any real biological effects.
AI’s tireless capacity to investigate data, combined with its growing ability to develop authoritative-sounding narratives, expands the potential to elevate fabricated or bias-confirming errors into the collective public consciousness.
What’s next?
If you read the missives of AI luminaries, it would appear that society is on the cusp of superintelligence, which will transform every vexing societal conundrum into a trivial puzzle. While that’s highly unlikely, AI has certainly demonstrated promise in some health applications, despite its limitations. Unfortunately, it’s now being rapidly deployed sector-wide, even in areas where it has no prior track record.
This speed may leave us little time to reflect on the accountability needed for safe deployment. Sycophancy, hallucination, and the black box of AI are non-trivial challenges when conjoined with existing biases in health research. If people can’t easily understand the inner workings of current AI tools (often comprising up to 1.8 trillion parameters), they will not be able to understand the process of future, more complex versions (using over 5 trillion parameters).
History shows that most technological leaps forward are double-edged swords. Electronic health records increased the ability of clinicians to improve care coordination and aggregate data on population health, but they have eroded doctor-patient interactions and have become a source of physician burnout. The recent proliferation of telemedicine has expanded access to care, but it has also promoted lower-quality interactions with no physical examination.
The use of AI in health policy and research is no different. Wisely deployed, it could transform the health sector, leading to healthier populations and unfathomable breakthroughs (for example, by accelerating drug discovery). But without embedding it in new professional norms and practices, it has the potential to generate countless flawed leads and falsehoods.
Here are some potential solutions we see to the AI and health replicability crisis:
- Clinical-specific models capable of admitting uncertainty in their outputs
- Greater transparency, requiring disclosure of AI model use in research
- Training for researchers, clinicians, and journalists on how to evaluate and stress-test AI-derived conclusions
- Pre-registered hypotheses and analysis plans before using AI tools
- AI audit trails
- Specific AI global prompts that limit sycophantic tendencies across user queries
Regardless of the solutions deployed, we need to solve the failure points described here to fully realize the potential of AI for use in health research. The public, AI companies, and health researchers must be active participants in this journey. After all, in science, not everyone can be right.
Amit Chandra is an emergency physician and global health policy specialist based in Washington, DC. He is an adjunct professor of global health at Georgetown University’s School of Health, where he has explored AI solutions for global health challenges since 2021.
Luke Shors is an entrepreneur who focuses on energy, climate, and global health. He is the co-founder of the sustainability company Capture6 and previously worked on topics including computer vision and blockchain.