Recording Webinar Recording Available  |  Deep dive into the challenge, evaluation metrics, and how to get started Watch Now
Open Problem #1

Mandrake Open Problems #1:
The Retroviral Wall

Can you predict which reverse transcriptases work for prime editing — without memorizing their evolutionary family?

-- Days
-- Hours
-- Minutes
-- Seconds
Participate View on GitHub View Leaderboard Submit Code Register Interest

Prime Editing's Bottleneck

Prime editing is arguably the most precise genome editing technology ever built. It can write any of the 12 possible point mutations, make small insertions and deletions, and do it all without double-strand breaks or donor DNA templates. Where earlier CRISPR tools cut and hope, prime editing writes exactly what you want. It has the potential to correct the majority of known pathogenic mutations in the human genome.

But prime editing has a bottleneck — and it's biological, not computational.

At the heart of every prime editor is a reverse transcriptase (RT) enzyme that reads an RNA template and writes the desired edit into DNA. The RT determines whether the edit works, how efficiently it works, and in which cell types it works. Today's best prime editors use MMLV-derived RTs — large enzymes (~700 amino acids) that are difficult to package into viral delivery vectors and challenging to deliver therapeutically. Smaller, more efficient RTs would unlock prime editing for a far wider range of applications.

The Dry-Lab to Wet-Lab Loop

Nature has evolved thousands of RTs across retroviruses, bacteria, and mobile genetic elements. Some will work for prime editing. Most won't. Experimentally screening candidates is expensive and slow — each round takes months of cloning, expression, and activity measurement. Every candidate you can eliminate computationally is a month saved at the bench.

No Universal Objective

Unlike antibody engineering, where binding affinity provides a single, well-defined metric to optimise against, enzyme function has no universal objective. An RT that works for prime editing must satisfy multiple constraints simultaneously — catalytic efficiency, processivity, fidelity, thermostability, structural compatibility with the Cas9 fusion, and more.

The ML Trap

The 57 experimentally tested RTs in this dataset come from 7 evolutionary families, and activity is heavily confounded with family membership. Models that achieve high accuracy are often just memorising "Retroviral = active" rather than learning the underlying biophysics. When tested on a held-out evolutionary family, they collapse.

The Retroviral Wall

In leave-one-family-out evaluation, predicting activity for the Retroviral family — the most important family, containing the majority of known active RTs — is nearly impossible from patterns learned on other families. Breaking through this wall would be a meaningful advance in computational biology and protein engineering.

Dataset

What We're Providing

This dataset goes beyond raw activity labels. We've enriched every RT using our computational platform to maximise the signal available from 57 experimentally tested enzymes (Doman et al.):

98 Biophysical Features

Spanning thermostability, catalytic geometry, surface charge, solubility, structural contacts, and more — signals that go beyond sequence alone.

ESM-2 Embeddings

1280-dimensional protein language model embeddings for every RT, capturing evolutionary and structural patterns.

Predicted 3D Structures

ESMFold-predicted structures for all 57 enzymes in PDB format, ready for structural analysis and featurisation.

Full Sequences & Annotations

Complete amino acid sequences and evolutionary family annotations. Use any additional tools, models, or databases you like.

The signal is in this dataset. The question is whether you can find it.
We'll provide compute credits for compelling ideas that need GPU resources to execute.

Data

RT Family Breakdown

Family Total Active Inactive Efficiency Range
Retroviral 18 12 6 1.5% – 41.0%
Retron 12 5 7 0.5% – 8.0%
LTR Retrotransposon 11 2 9 9.0% – 34.0%
Group II Intron 5 2 3 1.0% – 12.5%
CRISPR-associated 5 0 5
Other 5 0 5
Unclassified 1 0 1
Total 57 21 36 0.5% – 41.0%

Your Goal

Build a model that predicts RT activity based on sequence, structure, or biophysical properties — not family membership. We evaluate using Leave-One-Family-Out (LOFO) cross-validation to ensure your model learns generalizable biology, not shortcuts.

Evaluation & Scoring

We use a single metric called CLS (Cross-Lineage Score) that forces your model to solve two problems at once: separating active RTs from inactive ones, and ranking them by how well they actually perform.

Cross-Lineage Score
CLS = 2 × PR-AUC × WSpearman / (PR-AUC + WSpearman)
Harmonic mean — both components must be good. One near zero collapses the score.
Classification
PR-AUC
Precision-Recall Area Under Curve

How well do your predicted scores separate active RTs from inactive ones?

At each threshold on your score: Precision = of the RTs you called active, how many actually are? (bench time efficiency). Recall = of all truly active RTs, how many did you catch? (discovery rate).

Why PR-AUC over accuracy or ROC-AUC? The dataset is imbalanced — 21 active vs 36 inactive. A model predicting "everything inactive" gets 63% accuracy for free. PR-AUC focuses on finding the active RTs without false alarms.

Random baseline: ~0.37 (the base rate of active RTs)
Ranking
Weighted Spearman
weighti = pe_efficiencyi + ε   (ε = 0.1)

Among the RTs you score highly, are the best-performing ones actually ranked highest?

The weighting is efficiency-proportional: MMLV (41% efficiency) has weight 41.1 — getting its rank wrong is heavily penalized. Inactive RTs (0%) have weight ~0.1 — their relative ranking barely matters.

This directly encodes our wet-lab priority: do not miss the highest-performing candidates. A negative correlation (worse than random) is floored at 0.

Random baseline: ~0.000

Why the Harmonic Mean?

If either component is near zero, CLS collapses. A model with PR-AUC = 0.90 and WSpearman = 0.05 gets CLS = 0.095, not 0.475 (which a simple average would give). You cannot compensate for terrible ranking with great classification, or vice versa. Both problems must be solved.

Validation Method

Leave-One-Family-Out (LOFO) Cross-Validation

All predictions are generated using LOFO cross-validation. For each of the 7 evolutionary families, the model is trained on the remaining 6 families and predicts on the held-out family. When predicting Retroviral RTs, the model has never seen a Retroviral RT during training.

Retroviral (held out)
Retron
LTR Retrotransposon
Group II Intron
CRISPR-associated
Other
Unclassified

The 7 LOFO folds produce 57 out-of-fold predictions (one per RT). Both PR-AUC and WSpearman are computed on these pooled predictions — not averaged per family. This avoids amplifying noise from small families (e.g., Group II Intron has only 2 active RTs).

Reference Points

Baseline Scores

Baseline PR-AUC WSpearman CLS
Predict all inactive ~0.37 0.000 0.000
Random scores ~0.37 ~0.000 ~0.000

See how current submissions stack up. View Leaderboard →

5 Steps to Compete

1

Download the Dataset

Get the RT sequences, activity labels, enriched biophysical features, and evaluation scripts from our GitHub repository.

2

Build Your Model

Use any approach: sequence embeddings, structural features, language models, or novel methods. Creativity is encouraged.

3

Submit by April 30, 2026

Submit your predictions CSV on Kaggle. All submissions go through the Kaggle competition page.

4

Finalists Announced — May 2026

Top submissions selected based on reproducible performance and methodological novelty.

5

Winner Validated — Q2 2026

Finalist approaches will be run on Mandrake's proprietary experimental data. The winner will be determined by real-world generalization.

Challenge Guidelines

  • Open Worldwide — Individuals and teams welcome, regardless of location or affiliation.
  • Submit via Kaggle — All submissions must go through Kaggle. You may update your submission before the deadline.
  • Reproducibility Required — We will run your code independently. If we can't reproduce your results, the submission is disqualified.
  • External Data Allowed — Use any datasets, pre-trained models, or tools you want. Document all sources in your writeup.

Prizes & Benefits

$1,000 USD

For the winning submission

  • Your model validated on Mandrake's proprietary wet-lab data (Stage 2)
  • Co-authorship support: We will work with the winning team to co-author a publication on the methodology
  • Recognition on our website, GitHub, and social channels
Make Life Programmable

The real prize isn't the cash

  • Paid research collaborations on active Mandrake projects
  • Full-time roles for outstanding candidates
  • Work on problems where your models get validated in a real lab

Compute Credits: Have a compelling idea but need GPU resources to execute? Email us your proposed approach. We fund compute on a case-by-case basis for ideas we find promising.

Stage 2: Wet-Lab Validation

Stage 1 scores are preliminary. The winner will be determined by Stage 2, where we take your submitted model and evaluate it on entirely new experimental data from our lab. You don't submit anything new — your Stage 1 model carries forward.

~40 New RT Candidates

We have sent approximately 40 RT candidates to the wet lab for prime editing activity testing. These are distributed roughly evenly across Retroviral, Retron, and Group II Intron families. Activity distribution is unknown — we don't yet know how many will be functional.

Same Model, New Data

Your Stage 1 submission stays on the leaderboard. We run the same model on the Stage 2 candidates and re-score using the same CLS metric — but now evaluated on actual measured PE efficiency from lab assays. No LOFO — the model predicts directly on blind candidates.

Timeline

1. Stage 1 leaderboard is live now — submit, iterate, improve your model.
2. Stage 1 closes. Your best submission is locked in.
3. We run your model on ~40 new RT candidates with real wet-lab data.
4. Leaderboard is updated with Stage 2 CLS scores. Winner is announced.

Frequently Asked Questions

Yes, absolutely. You can use protein language models (ESM, ProtTrans, etc.), general LLMs, or any other foundation model. We encourage creative use of modern ML tools. Just document what you use and how.

You don't need a biology degree to compete. But the winning approach will almost certainly require thinking about the biophysics — not just the data matrix. You can participate in teams. Watch our webinar recording for a deep dive into the problem.

You retain full ownership of your code. The winning submission will be open-sourced under MIT license as part of the prize terms — meaning it becomes publicly available, but you remain the author and owner. Non-winning submissions are not published or shared.

Yes, teams are welcome. There's no limit on team size. Just submit under a single team name and list all contributors. The prize will be awarded to the team, and distribution is your responsibility.

We use two mechanisms: (1) Leave-One-Family-Out cross-validation in Stage 1, which tests generalization across evolutionary families, and (2) in Stage 2, we re-run your submitted model on proprietary experimental data that has never been public. You don't submit new predictions — your Stage 1 model is tested as-is on new data. This ensures winning models truly generalize.

CLS requires both good classification (PR-AUC) and good ranking (WSpearman). If your predicted_score is just classification probability (e.g., 0.9 for all actives, 0.1 for all inactives), you might classify well but rank poorly among the actives. The metric rewards models that can distinguish a 41% RT from a 1.5% RT, not just active from inactive.

It can be anything on any scale. CLS only cares about the ranking your scores produce (via Spearman) and the separation between actives and inactives (via PR-AUC). A score of 100 for the best RT and 0 for the worst works just as well as 0.95 and 0.05. What matters is the order.

It reflects the real-world cost structure. Missing a 41% efficiency RT is far more costly than misranking a 1.5% RT. In the wet lab, you want to test the best candidates first. The weighting ensures models that identify top performers are rewarded more than models that only rank low-performers correctly.

In Stage 1, Retroviral carries more weight in pooled WSpearman because its active RTs have higher efficiencies. In Stage 2, where families are evenly distributed, each family contributes roughly equally. To win Stage 2, your model must generalize across all families — the LOFO cross-validation in Stage 1 is specifically designed to test this.

About Mandrake Bio

Mandrake Bio is an AI-first gene editing company based in Bangalore, India. We use the best tools of today to build the best gene editing tools of tomorrow — and make biology truly programmable at scale.

Backed By

Why Open Challenges?

The hardest problems in biology won't be solved by one team alone. A physicist, a self-taught ML engineer, or a grad student halfway across the world might see the pattern we missed. We have our own internal approaches to this problem. We're opening it because we believe the best ideas come from diverse perspectives — and because we want to find the people who think about these problems the way we do.

What Happens After?

This isn't a one-off competition. Winning approaches will be validated on real experimental data from our lab. Top performers join us — through paid collaborations or full-time roles — to work on the next set of problems. This is how we build the team.

Our Team Comes From

ICAR
BITS Pilani
IIT Madras
IIIT Hyderabad
UCSF
University of Southampton
Lossfunk
IIT BHU
IIT Kharagpur
Imperial College London

Our mission: Build the gene editing infrastructure the world needs. We believe the future of biology is computational — and we're hiring the people who will make it happen.

Submit Your Code

Kaggle only accepts prediction CSVs — not source code. Share your GitHub repository or Google Drive link here so we can consider your approach as a top submission.

Your code remains your IP as per the terms defined. We will use it solely to assess your solution and verify that it genuinely performed well without gaming the metric.

Register Your Interest

Stay updated on this challenge, join our mailing list for future updates, and connect with other participants.