Can you predict which reverse transcriptases work for prime editing — without memorizing their evolutionary family?
Prime editing is arguably the most precise genome editing technology ever built. It can write any of the 12 possible point mutations, make small insertions and deletions, and do it all without double-strand breaks or donor DNA templates. Where earlier CRISPR tools cut and hope, prime editing writes exactly what you want. It has the potential to correct the majority of known pathogenic mutations in the human genome.
At the heart of every prime editor is a reverse transcriptase (RT) enzyme that reads an RNA template and writes the desired edit into DNA. The RT determines whether the edit works, how efficiently it works, and in which cell types it works. Today's best prime editors use MMLV-derived RTs — large enzymes (~700 amino acids) that are difficult to package into viral delivery vectors and challenging to deliver therapeutically. Smaller, more efficient RTs would unlock prime editing for a far wider range of applications.
Nature has evolved thousands of RTs across retroviruses, bacteria, and mobile genetic elements. Some will work for prime editing. Most won't. Experimentally screening candidates is expensive and slow — each round takes months of cloning, expression, and activity measurement. Every candidate you can eliminate computationally is a month saved at the bench.
Unlike antibody engineering, where binding affinity provides a single, well-defined metric to optimise against, enzyme function has no universal objective. An RT that works for prime editing must satisfy multiple constraints simultaneously — catalytic efficiency, processivity, fidelity, thermostability, structural compatibility with the Cas9 fusion, and more.
The 57 experimentally tested RTs in this dataset come from 7 evolutionary families, and activity is heavily confounded with family membership. Models that achieve high accuracy are often just memorising "Retroviral = active" rather than learning the underlying biophysics. When tested on a held-out evolutionary family, they collapse.
In leave-one-family-out evaluation, predicting activity for the Retroviral family — the most important family, containing the majority of known active RTs — is nearly impossible from patterns learned on other families. Breaking through this wall would be a meaningful advance in computational biology and protein engineering.
This dataset goes beyond raw activity labels. We've enriched every RT using our computational platform to maximise the signal available from 57 experimentally tested enzymes (Doman et al.):
Spanning thermostability, catalytic geometry, surface charge, solubility, structural contacts, and more — signals that go beyond sequence alone.
1280-dimensional protein language model embeddings for every RT, capturing evolutionary and structural patterns.
ESMFold-predicted structures for all 57 enzymes in PDB format, ready for structural analysis and featurisation.
Complete amino acid sequences and evolutionary family annotations. Use any additional tools, models, or databases you like.
The signal is in this dataset. The question is whether you can find it.
We'll provide compute credits for compelling ideas that need GPU resources to execute.
| Family | Total | Active | Inactive | Efficiency Range |
|---|---|---|---|---|
| Retroviral | 18 | 12 | 6 | 1.5% – 41.0% |
| Retron | 12 | 5 | 7 | 0.5% – 8.0% |
| LTR Retrotransposon | 11 | 2 | 9 | 9.0% – 34.0% |
| Group II Intron | 5 | 2 | 3 | 1.0% – 12.5% |
| CRISPR-associated | 5 | 0 | 5 | — |
| Other | 5 | 0 | 5 | — |
| Unclassified | 1 | 0 | 1 | — |
| Total | 57 | 21 | 36 | 0.5% – 41.0% |
Build a model that predicts RT activity based on sequence, structure, or biophysical properties — not family membership. We evaluate using Leave-One-Family-Out (LOFO) cross-validation to ensure your model learns generalizable biology, not shortcuts.
We use a single metric called CLS (Cross-Lineage Score) that forces your model to solve two problems at once: separating active RTs from inactive ones, and ranking them by how well they actually perform.
How well do your predicted scores separate active RTs from inactive ones?
At each threshold on your score: Precision = of the RTs you called active, how many actually are? (bench time efficiency). Recall = of all truly active RTs, how many did you catch? (discovery rate).
Why PR-AUC over accuracy or ROC-AUC? The dataset is imbalanced — 21 active vs 36 inactive. A model predicting "everything inactive" gets 63% accuracy for free. PR-AUC focuses on finding the active RTs without false alarms.
~0.37 (the base rate of active RTs)Among the RTs you score highly, are the best-performing ones actually ranked highest?
The weighting is efficiency-proportional: MMLV (41% efficiency) has weight 41.1 — getting its rank wrong is heavily penalized. Inactive RTs (0%) have weight ~0.1 — their relative ranking barely matters.
This directly encodes our wet-lab priority: do not miss the highest-performing candidates. A negative correlation (worse than random) is floored at 0.
~0.000If either component is near zero, CLS collapses. A model with PR-AUC = 0.90 and WSpearman = 0.05 gets CLS = 0.095, not 0.475 (which a simple average would give). You cannot compensate for terrible ranking with great classification, or vice versa. Both problems must be solved.
All predictions are generated using LOFO cross-validation. For each of the 7 evolutionary families, the model is trained on the remaining 6 families and predicts on the held-out family. When predicting Retroviral RTs, the model has never seen a Retroviral RT during training.
The 7 LOFO folds produce 57 out-of-fold predictions (one per RT). Both PR-AUC and WSpearman are computed on these pooled predictions — not averaged per family. This avoids amplifying noise from small families (e.g., Group II Intron has only 2 active RTs).
| Baseline | PR-AUC | WSpearman | CLS |
|---|---|---|---|
| Predict all inactive | ~0.37 | 0.000 | 0.000 |
| Random scores | ~0.37 | ~0.000 | ~0.000 |
See how current submissions stack up. View Leaderboard →
Get the RT sequences, activity labels, enriched biophysical features, and evaluation scripts from our GitHub repository.
Use any approach: sequence embeddings, structural features, language models, or novel methods. Creativity is encouraged.
Submit your predictions CSV on Kaggle. All submissions go through the Kaggle competition page.
Top submissions selected based on reproducible performance and methodological novelty.
Finalist approaches will be run on Mandrake's proprietary experimental data. The winner will be determined by real-world generalization.
For the winning submission
The real prize isn't the cash
Compute Credits: Have a compelling idea but need GPU resources to execute? Email us your proposed approach. We fund compute on a case-by-case basis for ideas we find promising.
Stage 1 scores are preliminary. The winner will be determined by Stage 2, where we take your submitted model and evaluate it on entirely new experimental data from our lab. You don't submit anything new — your Stage 1 model carries forward.
We have sent approximately 40 RT candidates to the wet lab for prime editing activity testing. These are distributed roughly evenly across Retroviral, Retron, and Group II Intron families. Activity distribution is unknown — we don't yet know how many will be functional.
Your Stage 1 submission stays on the leaderboard. We run the same model on the Stage 2 candidates and re-score using the same CLS metric — but now evaluated on actual measured PE efficiency from lab assays. No LOFO — the model predicts directly on blind candidates.
1. Stage 1 leaderboard is live now — submit, iterate, improve your model.
2. Stage 1 closes. Your best submission is locked in.
3. We run your model on ~40 new RT candidates with real wet-lab data.
4. Leaderboard is updated with Stage 2 CLS scores. Winner is announced.
Yes, absolutely. You can use protein language models (ESM, ProtTrans, etc.), general LLMs, or any other foundation model. We encourage creative use of modern ML tools. Just document what you use and how.
You don't need a biology degree to compete. But the winning approach will almost certainly require thinking about the biophysics — not just the data matrix. You can participate in teams. Watch our webinar recording for a deep dive into the problem.
You retain full ownership of your code. The winning submission will be open-sourced under MIT license as part of the prize terms — meaning it becomes publicly available, but you remain the author and owner. Non-winning submissions are not published or shared.
Yes, teams are welcome. There's no limit on team size. Just submit under a single team name and list all contributors. The prize will be awarded to the team, and distribution is your responsibility.
We use two mechanisms: (1) Leave-One-Family-Out cross-validation in Stage 1, which tests generalization across evolutionary families, and (2) in Stage 2, we re-run your submitted model on proprietary experimental data that has never been public. You don't submit new predictions — your Stage 1 model is tested as-is on new data. This ensures winning models truly generalize.
CLS requires both good classification (PR-AUC) and good ranking (WSpearman). If your predicted_score is just classification probability (e.g., 0.9 for all actives, 0.1 for all inactives), you might classify well but rank poorly among the actives. The metric rewards models that can distinguish a 41% RT from a 1.5% RT, not just active from inactive.
It can be anything on any scale. CLS only cares about the ranking your scores produce (via Spearman) and the separation between actives and inactives (via PR-AUC). A score of 100 for the best RT and 0 for the worst works just as well as 0.95 and 0.05. What matters is the order.
It reflects the real-world cost structure. Missing a 41% efficiency RT is far more costly than misranking a 1.5% RT. In the wet lab, you want to test the best candidates first. The weighting ensures models that identify top performers are rewarded more than models that only rank low-performers correctly.
In Stage 1, Retroviral carries more weight in pooled WSpearman because its active RTs have higher efficiencies. In Stage 2, where families are evenly distributed, each family contributes roughly equally. To win Stage 2, your model must generalize across all families — the LOFO cross-validation in Stage 1 is specifically designed to test this.
Mandrake Bio is an AI-first gene editing company based in Bangalore, India. We use the best tools of today to build the best gene editing tools of tomorrow — and make biology truly programmable at scale.
Backed By














The hardest problems in biology won't be solved by one team alone. A physicist, a self-taught ML engineer, or a grad student halfway across the world might see the pattern we missed. We have our own internal approaches to this problem. We're opening it because we believe the best ideas come from diverse perspectives — and because we want to find the people who think about these problems the way we do.
This isn't a one-off competition. Winning approaches will be validated on real experimental data from our lab. Top performers join us — through paid collaborations or full-time roles — to work on the next set of problems. This is how we build the team.
Our Team Comes From










Our mission: Build the gene editing infrastructure the world needs. We believe the future of biology is computational — and we're hiring the people who will make it happen.
Kaggle only accepts prediction CSVs — not source code. Share your GitHub repository or Google Drive link here so we can consider your approach as a top submission.
Your code remains your IP as per the terms defined. We will use it solely to assess your solution and verify that it genuinely performed well without gaming the metric.
Stay updated on this challenge, join our mailing list for future updates, and connect with other participants.