Citation integrity in the age of AI

From a clinical researcher who reads citations every week - here is what citation fabrication looks like, why automated identifier checks miss the dominant pattern, and how to verify a citation properly.

Last updated: May 12, 2026

1 in 277

biomedical papers in early 2026 contains at least one fabricated reference - a more than 12× increase over 2023. That finding comes from Topaz et al. (Lancet 2026), who audited 2.5 million PubMed Central articles using a pipeline called CITADEL. The trajectory of the increase - explosive, post-2023 - strongly implicates the proliferation of large language models in scientific writing.

What makes this finding load-bearing is not just the rate. It is that the dominant fabrication pattern slips through every basic citation check. The identifier resolves. The DOI is real. The PMID points to a real paper. The citation looks legitimate. But the title in the citation does not correspond to the paper that the identifier actually points to.

If you have only ever checked citations by clicking the DOI to make sure it resolves, you have been missing the most common fabrication pattern in the literature.

What fabrication actually looks like

Topaz et al.’s Supplementary Appendix 2 publishes three illustrative cases. They are worth reading in full because each one shows a different mechanism by which a fake citation evades simple checks.

Example A - split-identifier confusion

A paper on construction-industry safety in Qatar cites a study supporting its ICU-admission finding:

“Impact of enhanced safety protocols on ICU admissions in the construction industry: A longitudinal analysis” - J Doe, R Smith, J Occup Environ Med (2023), PMID 36730737, DOI 10.1097/JOM.0000000000002567.

The PMID and DOI are both real, but they point to different real papers:

PMID 36730737 resolves to “Predictors of Suicide and Differences in Attachment Styles and Resilience Among Treatment-Seeking First-Responder Subtypes” (Ponder et al., J Occup Environ Med 2023).
DOI 10.1097/JOM.0000000000002567 resolves to “Occupational Balance and Depressive Symptoms During the COVID-19 Pandemic” (Ramos et al., J Occup Environ Med 2022).

The cited title does not exist anywhere in the indexed literature. The identifiers exist but contradict each other. Both are in the right-sounding journal - which is what makes the confabulation plausible.

Example B - consistent identifiers, fabricated title

A diagnostic-imaging review cites a protocol paper:

“A Protocol for the Use of DMM/PTX-Induced Mouse Models of Osteoarthritis and Rheumatoid Arthritis” - E. Krustev, D. Rioux, J.J. McDougall, Current Protocols (2021), PMID 34767311, DOI 10.1002/cpz1.288.

The PMID and DOI agree with each other and resolve to the same real paper - but the resolved paper is “Three-Dimensional Fruit Tissue Habitats for Culturing Caenorhabditis elegans” (Guisnet et al., Current Protocols 2021). The cited title plausibly fuses two genuine methodologies - DMM (destabilisation of the medial meniscus, an osteoarthritis model) and PTX (pertussis toxin, a rheumatoid arthritis model) - into a protocol paper that has never been published.

Example C - biomedical neuroscience fabrication

A pain-research review cites a microglial paper:

“Microglial Modulation via Cannabinoid Receptor 2 Alleviates Fibromyalgia-Related Central Sensitization and Pain Hypersensitivity” - F. Chen, Y. Liu, H. Wang, X. Zhang, J. Li, K. Yang, Neuroscience (2023), PMID 36813155, DOI 10.1016/j.neuroscience.2023.02.008.

PMID and DOI both resolve to the same real paper - and again, it is something completely different: “ChatGPT in Research: Balancing Ethics, Transparency and Advancement” (Graf & Bernardi, Neuroscience 2023). The fabricated title combines three real neuroscience concepts (microglial modulation, CB2, fibromyalgia pain) into a plausible study that does not exist.

Why a simple DOI check is not enough

All three cases above pass the only check most researchers ever apply: click the DOI; does it resolve? The DOI resolves. The paper is real. The journal is real. The reference looks legitimate at a glance.

What gets missed is the cross-check between the cited title and the resolved title. If you do not compare what the citation says the paper is called against what the paper at that identifier is actually called, you cannot detect the dominant fabrication pattern.

This is not a problem you can solve by reading more carefully. The titles are designed by an LLM to sound like they fit the surrounding sentence. They reference concepts the reader expects in that context. Eyeballing them as plausible is exactly the failure mode the pattern exploits.

The fix is mechanical: every citation needs its claimed metadata compared against the resolved metadata at its identifier. That is what a verifier does.

Try the verifier on a fabricated citation

Two presets seed the form. Run them to see the verifier flip between Matched and Mismatch. Edit any field to test your own citation — same call the API would make.

Cited title

DOI (or other identifier)First-author family name (optional)

Calls POST /api/verify — no authentication, free anonymous tier.

How to check a citation today

Three levels of effort, from manual to automated, all of which catch the Topaz pattern when applied properly:

By hand, one citation at a time. Paste the DOI into doi.org. Compare the resolved page’s title to the title in the citation. If they differ in any meaningful way, treat the citation as suspect until you have read the paper yourself. This works but does not scale beyond a small reference list.
Programmatic single-citation check. POST the claimed citation to a verifier API (Scholar Sidekick’s /api/verify) or call the equivalent MCP tool from an AI assistant. Returns a verdict (matched, mismatch, ambiguous, not_found) plus the resolved record so you can see exactly what the identifier points to.
Manuscript-submission integration. This is Topaz et al.’s explicit recommendation #1: integrate verifier checks into the submission workflow at journals. Run every reference through a verifier before peer review. The cost per citation is fractions of a second; the cost of a fabricated reference reaching publication is significant.

How Scholar Sidekick fits

Topaz et al.’s CITADEL pipeline and Scholar Sidekick’s verifier are complementary, not competitive. They cover different points in the publication lifecycle and different parts of the citation surface area.

	CITADEL (Topaz et al.)	Scholar Sidekick verifier
Timing	Offline, post-publication audit	Online, on-demand at write/review time
Source surface	PMC-XML	Live registries (Crossref, PubMed, OpenAlex, arXiv, ADS, others)
Identifier coverage	DOI + PMID (2 types)	DOI + PMID + PMCID + ISBN + arXiv + ISSN + ADS bibcode + WHO IRIS URL (8 types)
Reported precision	91% (Topaz et al., internal benchmark)	1.000 on a 20-entry validation set (see below)
Distribution	Research pipeline	Public REST API + MCP tool + (planned) web UI

CITADEL ran a retrospective audit across 2.5 million biomedical papers. Scholar Sidekick is built to be called at the moment a citation is added - by a peer reviewer, an editor, an author cross-checking their own bibliography, or an LLM grounding its references. The methodology Topaz et al. validated at population scale is what our verifier applies at point-of-use scale, with broader identifier coverage so it works for citations CITADEL was not designed to touch (books via ISBN, ML and physics preprints via arXiv, astrophysics via ADS bibcode, institutional grey literature via WHO IRIS URL).

Measured precision and recall

Every quantitative claim about the verifier on this page is tied to a specific validation run. The fixture is hand-curated, immutable, and published below as evidence. The results JSON files are timestamped receipts - you can inspect them, re-run the harness, and check our numbers against your own.

Validation set (v1, immutable)

Twenty hand-curated entries across five categories:

3 Lancet illustrative cases. Examples A, B, and C from Topaz et al.’s Supplementary Appendix 2, verbatim. Not our cases - the canonical fabricated-citation examples Topaz et al. chose to publish.
5 known-good citations. Real DOI or arXiv ID paired with the canonical title resolved via the scholar-sidekick MCP server. The verifier must return matched.
4 wrong-DOI cases (CITADEL “citation error” subtype). Real Paper X’s title paired with real Paper Y’s identifier. Both papers are independently verified; the swap is intentional. The verifier should detect the title mismatch and surface the actual paper as a candidate → verdict ambiguous.
4 paraphrase cases. Real DOI + paraphrased title designed to land in the LLM-screen-eligible bucket (mismatch with low confidence). The LLM screen should classify these as informal_abbreviation and upgrade the verdict to matched. These four entries are the only ones we tuned against the live verifier - they were probed to ensure they exercise the LLM-screen path. The LLM’s verdict on them is what we report.
4 invented cases. No real paper. Either an invented DOI, an invented title with no identifier, or an impossibly large PMID. The verifier should return not_found.

Headline numbers

Run against the live /api/verify endpoint:

Pre-LLM screen: 20/20 entries passed. Precision = 1.000, recall = 1.000, F1 = 1.000.
With LLM screen enabled: 20/20 entries passed. Precision = 1.000, recall = 1.000, F1 = 1.000.
LLM cost: ~0.001 USD per applied screen (4 of 20 entries triggered the screen). Total run cost: under half a cent.

Recall of 1.000 in both modes is the line that matters: every actual fabrication, wrong-identifier case, and invented citation was correctly flagged. Nothing got through.

Receipts

validation-set-v1.json - the immutable fixture
validation-results-pre-llm.json - pre-LLM results, timestamped
validation-results-with-llm.json - with-LLM-screen results, timestamped

The fixture is marked immutable. When we add entries we create validation-set-v2.json and re-measure; old numbers always cite the specific fixture version they came from.

Frequently asked questions

Is this catching AI-generated citations specifically, or any fabrication?

Both. The fabrication pattern is the same regardless of origin: a citation pairs a real, resolvable identifier (DOI or PMID) with a title that does not correspond to the paper at that identifier. Topaz et al. note the steep increase since 2023 strongly implicates LLM authorship, but the verifier checks the structural disconnect - claimed title versus resolved title - not who wrote the citation.

Does the verifier work for non-biomedical citations?

Yes. CITADEL (the pipeline Topaz et al. used) covers DOI and PMID - the biomedical identifier surface. The Scholar Sidekick verifier covers DOI, PMID, PMCID, ISBN, arXiv ID, ISSN, NASA ADS bibcode, and WHO IRIS URL - eight identifier types, which extends the same cross-reference methodology into books, computer-science and physics preprints, astrophysics, and institutional grey literature.

Can I run this against an entire manuscript bibliography?

Today the verifier is exposed as a single-citation API at /api/verify and as a verifyCitation MCP tool. A batch web UI with .bib/.ris upload is planned (Phase 12i.4). For now, scripts can call the API per reference; rate limits scale with plan tier.

What about retraction status?

Retraction is a different signal. A real, correctly-cited paper can still be retracted. Scholar Sidekick exposes retraction-checking at /tools/retraction-checker (Retraction Watch via Crossref). It is not wired into the verifier endpoint yet - that is a separate planned phase. If you need both signals on a bibliography today, call them separately.

What does the verifier cost?

The /api/verify endpoint is free at the anonymous tier with a published rate limit. The LLM screen - used only when the simple verifier returns mismatch with low confidence - is gated to authenticated first-party callers and paid RapidAPI tiers, since each model call carries a real per-call cost that Scholar Sidekick pays to the model provider. We protect against runaway spend with a server-side daily cap; once that cap is hit, subsequent verifier requests fall back gracefully to the non-LLM verdict.

How did you measure precision and recall?

We hand-curated a 20-entry fixture set sourced from the Topaz et al. supplementary appendix and from independent registry lookups via Crossref, PubMed, and arXiv. We ran every entry through the live verifier and counted how many actual fabrications were flagged (recall) and how many legitimate citations stayed clean (precision). Numbers and the full fixture are published below; the JSON is immutable for v1.

Why is this called complementary to CITADEL?

CITADEL is offline, post-publication, PMC-XML-only, and ran retrospectively across 2.5 million papers. Scholar Sidekick is online, on-demand, available at write or review time, and covers six identifier types CITADEL does not. The two surfaces serve different points in the publication lifecycle: CITADEL audits the literature retrospectively; Scholar Sidekick checks the citation as it is being written or peer-reviewed.

References

Topaz M, Roguin N, Gupta P, Zhang Z, Peltonen L-M. Fabricated citations: an audit across 2·5 million biomedical papers. The Lancet. 2026;407(10541):1779-1781. doi:10.1016/S0140-6736(26)00603-3. Open access. The primary source for this page; the three illustrative cases come from its Supplementary Appendix 2.

Related Scholar Sidekick surfaces

Citation verifier API documentation - /api/verify reference
scholar-sidekick-mcp - the MCP server (verifyCitation tool ships in v0.7.0)
Retraction Checker - complementary signal for already-cited works
Open Access Checker - find a legal free copy via Unpaywall