Scholar Sidekick is built as deterministic citation infrastructure. These principles guide development and deployment.
Identical inputs produce identical outputs. CSL styles and locales are pinned to specific versions to prevent silent behavioral drift.
Public API behavior does not change without explicit versioning. Headers, error semantics, and export structures are treated as contract guarantees.
Marketplace integrations and MCP servers wrap the canonical HTTP API. No formatting or resolution logic is duplicated outside the core service.
Export formats conform to their published specifications. Outputs are validated through semantic tests rather than brittle snapshots.
Request identifiers, rate-limit headers, and health endpoints are consistently exposed to support operational transparency. Rolling uptime and incident history for the public API are published at the status page; the headline reflects Scholar Sidekick’s own availability, with upstream metadata-source health reported in a separate, clearly-labelled section.
Requests are processed on demand. Raw citation inputs are not retained as application data after processing.
Determinism is not just “same input, same output” - it is also “same failure mode, same response.” Scholar Sidekick documents how each edge case behaves so integrators can rely on the contract.
Inputs that fail format validation return a 400 response with the JSON envelope { ok: false, code: BAD_REQUEST, error: <message> }. Validation runs at the route boundary; malformed inputs never reach a fetch adapter.
When an identifier passes validation but no upstream source returns a record, the response is 200 with items: [] (or a per-item error in batch mode), not a 404. Not-found results are cached briefly to avoid hammering the upstream on repeat lookups.
A small set of fields is guaranteed when an identifier resolves successfully: type, title, and at least one of (DOI, PMID, ISBN, id). All other fields (authors, journal, year, page range, abstract) are best-effort - they are populated when the upstream source supplies them and omitted otherwise. Missing fields are not synthesised.
For identifiers with multiple resolvers (DOI → Crossref then DataCite then doi.org; ISBN → Open Library then Google Books), the chain is consulted in fixed order and the first non-empty record wins. The full resolver chain per identifier type is published at /.well-known/sources.json. The chain itself is part of the contract: reordering or substituting resolvers is treated as a transform-version change.
Outbound fetches use AbortController-bounded timeouts and a small fixed retry budget for transient failures (network errors, 5xx). Persistent upstream failure produces an error envelope rather than partial data; cached records remain available throughout. All outbound hosts are allowlisted; arbitrary user-supplied URLs are rejected before any fetch occurs.
Rate limiting is sliding-window per plan tier (anonymous, free, pro, ultra, mega). Quota exhaustion returns 429 with a Retry-After header and standard RateLimit-* headers (IETF + legacy X-RateLimit-*). The contract envelope is the same as other errors.
Operational gates produce predictable status codes: 503 with code: MAINTENANCE when MAINTENANCE_MODE=1, and 405 on mutation routes when READ_ONLY_MODE=1. Health endpoints remain reachable in both modes.
transform_versionEvery API response includes the x-scholar-transform-version header. The value is a date-stamped tag that identifies the active normalisation, formatting, and resolver chain. It mirrors the transform_version field in /.well-known/sources.json.
For a given transform_version, identical inputs (identifier, style, output format, locale) produce byte-identical output. This is the machine-checkable form of the determinism principle stated above. Send the same DOI in Vancouver style today and in six months, and as long as the response carries the same x-scholar-transform-version, the bytes will match.
The constant is bumped when any of the following change in a way that could alter byte-identical output for the same input:
Cosmetic, infrastructure, or test-only changes do not bump the version. Bug fixes that correct previously-incorrect output do bump it; the integrity of the version contract requires that any change which alters output is observable.
Pin the x-scholar-transform-version in your tests or pipelines. If a future response carries a different value, treat it as a signal to re-baseline expected output rather than a regression. Two requests carrying the same value should produce identical bodies for identical inputs; if they do not, file an issue - that is a contract violation.
/verification is a copy-paste curl kit that lets you (or any external evaluator) verify these claims against the live API in under a minute.
transform_version names the chain we have deployed right now; it is not a version you can request. We run one version at a time and do not retain or serve prior ones, so once the value is bumped the earlier behaviour is no longer reachable from the API. The guarantee is therefore consistency within the live version plus a drift signal you can detect - not a time machine. To reproduce a result as of an earlier date, keep the output you received (with its transform_version) at query time; that captured copy, not a later request to us, is your record of record. And transform_version pins our processing, not the upstream metadata - Crossref, PubMed, and the rest can correct a record at any time, which surfaces as x-scholar-cache: miss with a moved upstream_fetched_at, and is expected even within a single version.
Reproducibility is reinforced by per-request provenance headers: x-request-id, x-scholar-cache, x-scholar-formatter, x-scholar-style-used, x-scholar-transform-version, plus conditional CSL headers (x-csl-warning, x-csl-alias, x-csl-dependent, x-csl-fetch-style-id) when relevant. Together they let an integrator reconstruct exactly which code path produced a response.
Beyond the per-request headers, each resolved item can carry an opt-in _provenance object. Add ?provenance=1 (acknowledged via x-scholar-provenance: 1) to attach, per item: transform_version, resolved_at, request_id, the sources that produced the record (each with cache state and, where the upstream supplies one, an upstream_fetched_at timestamp), fallbacks_tried (resolvers consulted before the winner - omitted when the primary resolved on the first try), fields_from_source / fields_absent, and any normalization steps applied to your input. The default payload is unchanged; provenance is purely additive. Use ?provenance=full (acknowledged via x-scholar-provenance: full) to also include a full_metadata block (funders, ORCID iDs, ROR iDs, license, clinical-trial IDs) drawn from the upstream record. The full field shape is published in /.well-known/sources.json under provenance_schema.
?provenance=conflicts cross-checks the winning record against one genuinely independent source - a PubMed record for a Crossref-resolved DOI, or Google Books against Open Library - and reports a conflicts block whose status is agreed, conflict (with the disagreeing fields), or unavailable(no independent source, or the cross-check could not be fetched). We are deliberately honest that “couldn’t check” is not “agreed”. Because it depends on a live second fetch it is observational - like cache state it is exempt from byte-determinism - and it is gated behind a server flag (it adds one extra upstream fetch per item), so it is off unless explicitly enabled.
# A DataCite-resolved DOI whose Crossref + doi.org lookups missed first:
$ curl -s -X POST 'https://scholar-sidekick.com/api/format?provenance=1' \
-H 'content-type: application/json' \
-d '{"text":"10.5281/zenodo.1234567","style":"vancouver"}'
"items": [{
"_provenance": {
"transform_version": "2026-06-05",
"resolved_at": "2026-06-05T12:00:00.000Z",
"request_id": "…",
"sources": [{ "name": "datacite", "cache": "miss",
"upstream_fetched_at": "2026-05-30T08:11:02Z" }],
"fallbacks_tried": [
{ "name": "crossref", "outcome": "not_found" },
{ "name": "doi.org", "outcome": "not_found" }
],
"fields_from_source": ["title","authors","issued","doi"],
"fields_absent": ["pages","volume","issue","abstract"]
}
}]These principles - determinism, contract stability, observable infrastructure - underwrite the citation-integrity surface Scholar Sidekick exposes. The peer-reviewed work that motivates that surface is Topaz M, Roguin N, Gupta P, Zhang Z, Peltonen L-M. Fabricated citations: an audit across 2·5 million biomedical papers. The Lancet. 2026;407(10541):1779–1781 (doi:10.1016/S0140-6736(26)00603-3). The CITADEL pipeline in that paper is the methodological anchor; the determinism, source-provenance, and edge-case-handling principles on this page are what make a real-time, API-shaped analogue auditable. See /citation-integrity for the explainer and /tools/citation-verifier for the working implementation.