About Preprint Match

Every week, hundreds of new preprints appear on medRxiv. We match them to the journals that publish similar work.

What this is

Preprint Match identifies which journals best match a medRxiv preprint's content. We train on 17,958 preprints that were later published in peer-reviewed journals, learning what makes a paper a good fit for one journal over another.

For each of the 54,751 current preprints on medRxiv, we compute calibrated probability estimates across 501 journals with sufficient training data. You can explore this from either direction: pick a paper and see where it might land, or pick a journal and see what's coming. Preprints destined for niche journals outside this set will still get predictions, but only among the journals we do cover.

Why

Journals are filters. They take a stream of research and select what fits their scope, standards, and audience. Useful, but slow and opaque.

Preprints changed the first part: anyone can share their work immediately. But discovery is still hard. If you work in infectious disease epidemiology, how do you find the preprints that matter to you among thousands posted each month?

A personalised feed could do better: show you relevant work as it appears, not months later when a journal gets around to it.

Where this is going

Eventually, you should be able to define your own journal. Pick the topics, methods, and populations you care about, and get notified when a matching preprint appears. No existing journal covers exactly your interests, but a custom feed could.

bioRxiv, arXiv, and other preprint servers have the same discovery problem. The approach generalises.

Journal matching is a stepping stone. The real goal is connecting researchers with work they need to see, when it appears. Preprints don't have to wait for journals to be reviewed, either: platforms like Peer Community In, PREreview, Rapid Reviews: Infectious Diseases, and eLife are already reviewing preprints openly. Showing those evaluations alongside predictions is a natural next step.

How it works

We look at which previously-published papers are most similar to a new preprint, and use that to identify which journals publish similar work. The model picks up on topic, methods, writing style and scope from 17,958 preprints that have already been published.

Each paper's title and abstract are encoded into a numerical vector using SPECTER2, a scientific document embedding model trained on citation graphs. We fine-tune it with contrastive learning so that papers published in the same journal end up closer together, with training batches grouped by medRxiv category to force fine-grained distinctions. Predictions combine a nearest-neighbour lookup with a logistic regression classifier, blended 90/10 and tuned on a held-out validation set. Probabilities are calibrated using isotonic regression so that when the model says "15% chance of PLOS ONE", roughly 15% of such predictions turn out correct.

How well it works

We evaluate on 501 journals that have at least 10 papers in the training set (about 70% of all published preprints land in one of these). On a held-out test set of 6,950 papers, the model gets the exact journal right 20% of the time (versus under 1% by random chance), and the correct journal appears in the top 10 predictions 61% of the time. For the 20 most common journals, the top-10 hit rate is 82%.

Performance drops for less common journals: about 10% exact accuracy for mid-frequency journals, and the model cannot reliably predict journals with fewer than 10 training papers. Preprints headed for niche journals outside the 501 we cover will still get predictions, but only among the journals we know about — the model cannot tell you it doesn't know.

Limitations

This is a content-based classifier. It recognises which journals tend to publish which kinds of content. It does not assess quality or editorial fit, and it cannot tell you whether a paper will survive peer review. It can only tell you that a paper looks like the kind of work a journal has published before.

The training data consists entirely of preprints that were successfully published. Preprints that were never published, or took a very different path, are absent. The model also has no temporal awareness: publication patterns shift over time, and a paper matching a journal's 2020 output may not match its 2026 editorial direction.

Probabilities are well-calibrated on average, but less reliable for low-frequency journals where training data is sparse.

The full methodology (dataset construction, embedding comparison, fine-tuning, per-tier evaluation, calibration) is described in RESULTS.md. All code is open source.

Who

Built by Sebastian Funk.