Nonsensical research papers generated by a computer program are still popping up in the scientific literature many years after the problem was first seen, a study has revealed1. Some publishers have told Nature they will take down the papers, which could result in more than 200 retractions.
The issue began in 2005, when three PhD students created paper-generating software called SCIgen for “maximum amusement”, and to show that some conferences would accept meaningless papers. The program cobbles together words to generate research articles with random titles, text and charts, easily spotted as gibberish by a human reader. It is free to download, and anyone can use it.
By 2012, computer scientist Cyril Labbé had found 85 fake SCIgen papers in conferences published by the Institute of Electrical and Electronic Engineers (IEEE); he went on to find more than 120 fake SCIgen papers published by the IEEE and by Springer2. It was unclear who had generated the papers or why. The articles were subsequently retracted — or sometimes deleted — and Labbé released a website allowing anyone to upload a manuscript and check whether it seems to be a SCIgen invention. Springer also sponsored a PhD project to help spot SCIgen papers, which resulted in free software called SciDetect. (Springer is now part of Springer Nature; Nature’s news team is editorially independent of its publisher.)
Labbé, who works at the University of Grenoble Alpes in France, originally searched manuscripts for words typical of SCIgen’s vocabulary. But he and another computer scientist, Guillaume Cabanac at the University of Toulouse, France, came up with a new idea: searching for key grammatical phrases characteristic of SCIgen’s output. Last May, he and Cabanac searched for such phrases in millions of papers indexed in the Dimensions database.
After manually inspecting every hit, the researchers identified 243 nonsense articles created entirely or partly by SCIgen, they report in a study published on 26 May1. These articles, published between 2008 and 2020, appeared in various journals, conference proceedings and preprint sites, and were mostly in the computer-science field. Some appeared in open-access journals; others were paywalled. Forty-six of them had already been retracted or deleted from the websites where they were first published.
Since last year, the researchers have added another 20 papers to their list, including gibberish articles created by MATHgen (software that generates mathematics papers) and the SBIR proposal generator (which creates nonsense grant proposals). Cabanac and Labbé have posted some of their findings on Twitter and the post-publication peer review website PubPeer, and they are releasing their full results online.
Most of the latest batch of SCIgen papers were authored by researchers from China (64%) or India (22%), although Labbé notes that the manuscripts could have been submitted in anyone’s name without their knowledge. One author of several of the papers told Labbé and Cabanac that he’d submitted them as hoaxes. But other manuscripts appear to have been edited with genuine reference lists, suggesting that they might have been generated to inflate scientists’ citation counts. “I think the vast majority are created to pad CVs in order to fulfil a need to publish papers,” says Labbé.
The researchers found only two SCIgen papers that hadn’t been retracted at IEEE — which is evaluating both of them — and one Springer paper that included a fragment of MATHgen text. But other publishers were caught out more badly. IOP Publishing, a subsidiary of the London-based Institute of Physics, says it retracted ten papers “as there was clear evidence they had been computer-generated” and is investigating why they weren’t identified during peer review. “We have reasonable evidence to suggest that the peer review process for some of these papers was compromised,” says Kim Eggleton, the publisher’s integrity and inclusion manager.
The publishers who posted the most SCIgen content were Trans Tech Publications, a Swiss publisher, which published 57 SCIgen papers, Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP), based in India, which had 54; and Atlantis Press, a French publisher that was acquired by Springer Nature this March, with 39. Both Trans Tech Publications and Atlantis told Nature that they were investigating and were in the process of retracting the articles, but a spokesperson for BEIESP said that it published only articles with original content that passed double-blind peer review and plagiarism checks.
The popular SSRN preprint server, where papers are shared before peer review, had published 16 SCIgen articles, the study found. A spokesperson for SSRN said it was investigating the issue, and noted that it provided “limited screening” for its preprints (with “advanced screening” for health-care manuscripts).
Cabanac is concerned by the non-transparent way in which some publishers deal with such papers. The IEEE, for instance, has wiped some SCIgen papers off its website, but left formal retraction notices for others. Cabanac also notes that research papers — or earlier versions of them — sometimes disappear from the SSRN preprint server, without such changes being recorded.
An IEEE spokesperson said that its policy on removing a paper or leaving a retraction label was “contingent on the outcome of our evaluation”; SSRN did not respond to a question about its policies on retraction or deletion.
SCIgen papers are extremely rare: Labbé and Cabanac estimate from their screen that they make up a mere 75 papers per million in the computer-science literature. They are a far smaller problem than are, for instance, suspected paper mills — which create seemingly real research papers to order for academics — which Labbé and Cabanac have also helped to uncover.
But, says Labbé, the existence of these papers is an indication of the harmful effects of a ‘publish or perish’ culture, and an example of how nonsensical work can still make it into conference proceedings or journals. “You shouldn’t find these things in the literature,” he says.
Cabanac, G. & Labbé, C. J. Assoc. Inf. Sci. Technol. https://doi.org/10.1002/asi.24495 (2021).
Labbé, C. & Labbé, D. Scientometrics 94, 379–396 (2013)