About — Preprint Match

Every week, hundreds of new preprints appear on medRxiv and bioRxiv. We match them to the journals that publish similar work.

What this is

Preprint Match identifies which journals best match a preprint's content. We train on 105,187 medRxiv and bioRxiv preprints that were later published in peer-reviewed journals, learning what makes a paper a good fit for one journal over another.

For each of the 188,671 current preprints on medRxiv and bioRxiv, we compute calibrated probability estimates across 1385 journals with sufficient training data. You can explore this from either direction: pick a paper and see where it might land, or pick a journal and see what's coming. Preprints destined for niche journals outside this set will still get predictions, but only among the journals we do cover.

Why

Journals are filters. They take a stream of research and select what fits their scope, standards, and audience. Useful, but slow and opaque.

Preprints changed the first part: anyone can share their work immediately. But discovery is still hard. If you work in infectious disease epidemiology or cell biology, how do you find the preprints that matter to you among thousands posted each month?

A personalised feed could do better: show you relevant work as it appears, not months later when a journal gets around to it.

Where this is going

Eventually, you should be able to define your own journal. Pick the topics, methods, and populations you care about, and get notified when a matching preprint appears. No existing journal covers exactly your interests, but a custom feed could.

arXiv and other preprint servers have the same discovery problem. The approach generalises.

Journal matching is only the start. What I'd really like is to surface the preprints you'd care about as soon as they turn up, rather than months later when a journal finally decides. Preprints don't have to wait for a journal, either: Peer Community In, PREreview, Rapid Reviews: Infectious Diseases, and eLife already review them openly. Those community evaluations now appear next to our predictions, pulled in via Sciety and PubPeer.

How it works

We look at which previously-published papers are most similar to a new preprint, and use that to identify which journals publish similar work. The model picks up on topic, methods, writing style and scope from 105,187 medRxiv and bioRxiv preprints that have already been published.

In practice we turn each paper into a numerical fingerprint that captures what it's about. Papers with similar fingerprints tend to end up in the same journals, so we use those patterns to rank where a new preprint might go. The technical details: fingerprints come from SPECTER2, a scientific document embedding model trained on citation graphs, which we fine-tune so that papers published in the same journal end up closer together. Predictions combine a nearest-neighbour lookup with a logistic regression classifier, and we calibrate the probabilities so that when the model says "15% chance of PLOS ONE", about 15% of such predictions turn out to be right.

How well it works

We evaluate on 1385 journals that have at least 10 papers in the training set. On a held-out test set of 29,150 papers, the model gets the exact journal right 23% of the time (versus under 0.1% by random chance), and the correct journal appears in the top 10 predictions 67% of the time. For the 20 most common journals, the top-10 hit rate is 86%.

Performance drops for less common journals: about 13% exact accuracy for mid-frequency journals, and the model cannot reliably predict journals with fewer than 10 training papers. Preprints headed for niche journals outside the 1385 we cover will still get predictions, but only among the journals we know about — the model cannot tell you it doesn't know.

Limitations

This is a content-based classifier. It recognises which journals tend to publish which kinds of content. It does not assess quality or editorial fit, and it cannot tell you whether a paper will survive peer review. It can only tell you that a paper looks like the kind of work a journal has published before.

The training data is made up entirely of preprints that ended up being published. Preprints that were never published, or that went somewhere very different, aren't in there. The model also doesn't know what year it is. A journal's 2020 output may not look like its 2026 output, but the model treats them the same.

Probabilities are well-calibrated on average, but less reliable for low-frequency journals where training data is sparse.

Publication links depend on medRxiv's own metadata, which can lag weeks or months behind the actual publication date. Some papers that have been published will not yet show a publication link here.

The full methodology (dataset construction, embedding comparison, fine-tuning, per-tier evaluation, calibration) is described in RESULTS.md. All code is open source.

Related tools

preprints.io surfaces potentially impactful arXiv preprints in computer science using citation network analysis and author track records. It focuses on impact prediction rather than journal prediction, and is based on arXiv only.

preprints.ai provides AI-generated summaries and search across bioRxiv and medRxiv preprints. Where it helps you understand what a paper says, Preprint Match helps you guess where it might get published.

Who

Built by Sebastian Funk.

Contact

Questions, suggestions, or bug reports? Open an issue on GitHub or email preprints@epiforecasts.io.

About Preprint Match