
Scientists and journal specialists are concerned that the increasing sophistication of chatbots could undermine the integrity and accuracy of research.Credit: Ted Hsu/Alamy
An artificial intelligence (AI) chatbot can write bogus research paper summaries so convincing that scientists often can’t spot them, according to a preprint posted to the bioRxiv server in late December.one. Researchers are divided on the implications for science.
“I am very concerned,” says Sandra Wachter, who studies technology and regulation at the University of Oxford, UK, and was not involved in the research. “If we now find ourselves in a situation where experts can’t determine what is true or not, we lose the intermediary we desperately need to guide us through difficult issues,” she adds.
The chatbot, ChatGPT, creates realistic and intelligent text in response to user input. It is a ‘big language model’, a system based on neural networks that learn to perform a task by digesting large amounts of existing human-generated text. San Francisco, California-based software company OpenAI released the tool on November 30, and it’s free to use.
Since its release, researchers have been grappling with ethical issues surrounding its use, because much of its output can be difficult to distinguish from human-written text. The scientists have published a preprint2 and a publisher3 written by ChatGPT. Now, a group led by Catherine Gao at Northwestern University in Chicago, Illinois, has used ChatGPT to generate artificial research paper summaries to test whether scientists can spot them.
The researchers asked the chatbot to write 50 medical research abstracts based on a selection published in NEVER, The New England Journal of Medicine, the bmj, the lancet Y Natural medicine. They then compared them to the original abstracts by running them through a plagiarism detector and an AI exit detector, and asked a group of medical researchers to spot the fabricated abstracts.
under the radar
Abstracts generated by ChatGPT navigated through the plagiarism checker: the median originality score was 100%, indicating that no plagiarism was detected. The AI exit detector detected 66% of the generated summaries. But the human reviewers did not fare much better: they correctly identified only 68% of the generated abstracts and 86% of the genuine abstracts. They incorrectly identified 32% of generated abstracts as real and 14% of genuine abstracts as generated.
“ChatGPT writes credible scientific abstracts,” Gao and his colleagues say in the preprint. “The limits of the ethical and acceptable use of large linguistic models to aid scientific writing remain to be determined.”
Wachter says that if scientists can’t determine if the research is true, there could be “dire consequences.” As well as being problematic for researchers, who could get sucked into faulty research paths because the research they’re reading about has been fabricated, there are “implications for society at large because scientific research plays such an important role in our society.” For example, it could mean that research-based policy decisions are wrong, he adds.
But Arvind Narayanan, a computer scientist at Princeton University in New Jersey, says, “It’s unlikely that any serious scientist would use ChatGPT to generate abstracts.” He adds that whether the generated summaries can be detected is “irrelevant.” “The question is whether the tool can generate a summary that is accurate and convincing. You can’t, so the advantage of using ChatGPT is miniscule and the disadvantage is significant,” he says.
Irene Solaiman, who researches the social impact of AI at Hugging Face, an AI company based in New York and Paris, fears that large language models are being relied on for scientific thinking. “These models are trained on past information, and social and scientific progress can often come from thinking, or being open to thinking, differently from the past,” she adds.
The authors suggest that those who evaluate scientific communications, such as research papers and conference proceedings, should implement policies to end the use of AI-generated text. If institutions choose to allow the use of technology in certain cases, they must establish clear rules about disclosure. Earlier this month, the 40th International Conference on Machine Learning, a large AI conference taking place in Honolulu, Hawaii, in July, announced that it was banning articles written by ChatGPT and other AI language tools.
Solaiman adds that in fields where false information can endanger people’s safety, such as medicine, journals may need to take a more rigorous approach to verifying that information is accurate.
Narayanan says that solutions to these problems should not focus on the chatbot itself, “but on the perverse incentives that lead to this behavior, such as universities conducting recruitment and promotion reviews by counting jobs without regard to their quality or impact.” .