One forgetting curve, three kinds of test

June 28, 2026

I reimplemented FSRS-6 so flashcards, model-graded answers, and teach-backs all schedule on one memory model, with no due-date column anywhere.

Ohm, my learning app, tests you three ways. It shows you a flashcard. It asks a question and has a model grade the answer you write out. And it asks you to explain a topic back to it, then grades how well you taught it.

Whatever you just did, the same thing has to happen next: the app decides when to show that item again. One item, one schedule, regardless of which of the three ways you met it. That turned out to be the hard part, because the spaced-repetition algorithm everyone reaches for was built for flashcards and quietly assumes you're holding one.

What is FSRS?

I came to FSRS the way most people do, through Anki. FSRS, the Free Spaced Repetition Scheduler, is the open-source algorithm Anki uses to decide when to show a card again, and it's documented down to the formulas. Building Ohm, I had no interest in inventing a memory model from scratch when this one was sitting right there, proven and free to read.

FSRS models your memory of a single item with two numbers. The first number, Stability, is how durable the memory is, roughly how long until your chance of recalling the item decays to a target. The second is Difficulty, or how hard the item is for you, and it sets how fast stability grows. Put the two together and you get a forgetting curve, your probability of recall as time passes since you last saw the item. Recall it correctly and stability rises, so the curve flattens and the gap to the next review stretches out. Miss it and stability collapses, and it comes back almost at once.

Recall decays after each review, and the item comes due when it reaches the target line. Each correct review grows stability, so the next interval runs longer than the last. The final stretch carries past the line: an item sitting due, not yet reviewed.

FSRS only reads a button

To FSRS, a review is one of four buttons, Again, Hard, Good, or Easy, and that button is the whole input, the only thing the algorithm ever learns about how the review went. On a flashcard that's the entire truth. You flip the card, you knew it or you didn't, and Good versus Hard handles the shades in between.

A free-response answer doesn't arrive as a button. A model reads what you wrote against a rubric and returns a number, say 73 out of 100. The teach-back works the same way, scoring how completely and correctly you explained the idea. Force that 73 into one of four buckets and you discard most of what the grader actually saw, and because the buckets aren't evenly spaced, the rounding isn't even honest about it.

I could have run a separate scheduler for each kind of test, but then "due" means three unrelated things and a day's queue is three piles that can't see one another. The point was one schedule, so I needed one engine that could take all three.

A score between 0 and 1

So the engine doesn't take a button. Instead, it takes a score from 0 to 1, plus when it happened and which kind of test produced it. That triple is the whole unit of evidence it ingests:

type Attempt = {
  score: number; // 0..1
  gradedAt: Date;
  modality: "flashcard" | "free_response" | "teach_back";
};

FSRS-6's formulas stay verbatim. I only changed how the grade enters them: everywhere the math used to read one of the four buttons, it now reads the continuous score, interpolated so it passes exactly through the original bucketed values at fixed anchor points. Where FSRS already had an answer the answer is unchanged; it just also does something reasonable in the gaps the buttons left.

Which means flashcards lose nothing. Their four buttons are simply four of those anchors:

function flashcardRatingToScore(rating) {
  switch (rating) {
    case "again": return 0;
    case "hard":  return 0.75; // just into the passing range
    case "good":  return 0.9;
    case "easy":  return 1.0;
  }
}

Press Good, land on the Good anchor, and the math does exactly what FSRS always did. A 0.73 from the essay grader lands a hair below Good, in territory the four buttons never had a name for. One engine, and the flashcard gave up nothing to share it.

The 0.7 in there is the line between a pass and a miss. Score above it and stability grows; below it and you've lapsed, and stability falls off scaled by how far you dropped. A 0.69 is a near miss and barely moves the schedule; a 0.1 is a real failure and throws the item back to the front. Flashcard Hard sits just over the line on purpose, because Hard in FSRS means you got it, barely, not that you blew it.

Not every test is worth the same

One engine, three kinds of test, but they don't carry equal weight. Getting a flashcard right means you recognized the answer. Explaining a topic cold means you rebuilt it from nothing. The latter says far more about whether you know the thing, and the engine should reflect that.

So each kind of test has a gain that scales how far a single result can move stability:

const GAIN_FACTORS = {
  flashcard: 1.0,
  free_response: 1.3,
  teach_back: 1.8,
};

A teach-back shifts your stability nearly twice as much as a flashcard at the same score. The ordering, recognition under production under synthesis, is the testing-effect literature talking. The exact ratios are mine, picked by feeling and intuition rather than earned against real retention data, and they're the first thing I'll fit once there's enough data to fit anything.

Three kinds of test collapse to one score, drive one shared memory state, and that single state answers both questions the app cares about.

A teach-back also reaches past its own item. It covers a whole phase of the course at once, so passing one nudges the individual flashcards and free-responses under that phase upward too, at a fraction of the direct gain; failing one nudges them down. Diffuse evidence, but real, that the pieces underneath have or haven't set.

No more due dates... kind of

Most spaced-repetition systems store, for each item, a due date and a state (new, learning, review, relearning), and keep a queue sorted by that date. Ohm stores none of it. An item only carries the below data:

type RetentionState = {
  stability: number;
  difficulty: number;
  lastReviewedAt: Date | null;
  reps: number;
  lapses: number;
};

No dueAt. No state enum. Whether an item is due is a question I answer at read time, by dropping the time since its last review into the forgetting curve and checking whether recall has slipped under the target:

const isDue = retentionProbabilityAt(state, now) < REQUEST_RETENTION; // 0.85

I run that across the items and surface the ones that have decayed furthest. That line is the scheduler.

Computing due instead of storing it pays off three ways. The schedule can't fall out of step with reality, because there's no stored copy of it to disagree with the memory; due is recomputed from the state every time you ask. The fast "Again" loop you want after a miss comes for free: a lapse drops stability to the floor, the curve crosses the target inside a few minutes, and the item is back in the same session, with no separate relearning mode to track. And the day I decide 85% was the wrong target, every item in the system reschedules at once, because there were never any due dates to migrate. Store the smallest true thing, the state of the memory itself, and let the schedule be a question you ask of it.

Summary

It comes down to one move. A flashcard, a model-graded answer, and a teach-back all collapse to a single score from 0 to 1. That score drives one FSRS-6 memory state per item, and whether the item is due is a question I ask of that state at read time, never a date I store.

While I'm only using three different kinds of test, anything that can be scored from 0 to 1 drops into the same engine, so the day I add a fourth kind of test, the scheduler doesn't even need to know it happened.

If you'd like to try it along with some other AI-based learning tools, check out https://ohmlearn.com.