Building Yomi Sensei

2026-01-01

Why I Started This Project

In early 2025, I studied abroad in Osaka, Japan. It was a language study program, and I had already been studying the Japanese language for just under 2 years, so I was starting to reach mid N3-N2 level and reading novels like Re:Zero and Too Many Losing Heroines.

However, I struggled when reading certain books, like those by Murakami Haruki. Some of them, I could “read”, with heavy help from a dictionary, but wasn’t able to grasp certain nuances or in some cases even the overall plot of the book.

I remember a particularly disappointing case was when I visited a book store in Denden Town. I had just finished reading the first Re:Zero light novel and felt quite cocky in my Japanese ability. I saw a plastic wrapped copy of The Apothecary Diaries and having heard about the anime, decided to purchase it despite not checking if I could understand it beforehand.

When I got home and took off the plastic wrap, I realized I was unable to read it even with the aid of a dictionary. That book is still sitting on my bookshelf.

Of course, I don’t regret reading any of the books that I didn’t fully understand. All of them contributed significantly to my reading ability and were enjoyable to read to some extent.

However, forcing yourself through books that are too difficult is a real slog. Genuinely, to someone earlier in their learning journey, I wouldn’t be surprised if it could cause a massive slump or for them to quit learning altogether.

The Problem I’m Solving

Of course, you shouldn’t force yourself to read anything you don’t want to, just like I haven’t read The Apothecary Diaries yet. But that leaves the question, “How do I find native content that is at my level?” - and by “at my level”, I also mean not “too easy”, because I still want to learn a lot when I read.

There already exist solutions to this problem like Learn Natively, or Tadoku Reader. Don’t get me wrong, these are great resources. But I think that native content isn’t something well suited to being pigeonholed into 5 arbitrary levels made up by the JLPT.

For example, the average learner takes anywhere from 1 to 2 years to get from N3 to N2. Should a learner halfway to N2 read the same content as a learner who just barely passed N3? Should an N3 learner who has up until now read mostly slice-of-life manga read the same content as an N3 learner who has mostly watched slice-of-life anime?

Yomi Sensei doesn’t solve all of these problems by any means, and in fact falls into a similar pitfall of evaluating learners on a linear scale instead of a multidimensional space of potential learning journeys. But what I was hoping to build is something more personal than a label drawn from 5 categories.

My Solution

I decided to build a content recommendation engine that actually evaluates your skill level instead of just having you guess.

I had heard about Aozora Bunko - a collection of over 15,000 out-of-copyright Japanese works - I figured this would be great to start off with because I can distribute them for free, so people get instant feedback on whether my recommendation was good or not.

I originally planned on making an e-reader for macOS, but scrapped the idea after struggling with Xcode and realizing that I was just making a shittier version of ttsu.app with book recs at the end. As someone who uses Yomitan, ttsu.app, and mpvacious, I realized it would be better to build something that integrates with the language learning ecosystem instead of trying to compete with existing technologies.

But from the failed e-reader attempt, I had made some progress on the user evaluation problem. Wait - what was I trying to do again?

Evaluating Reading Levels

So somehow I need to evaluate a user on their reading ability, and then using their reading ability, evaluate their capability of reading a book. Hmmmm.

The i+1 rule is pretty well known and could be applied here. It just says that for any given sentence, it should have at most 1 unknown word for optimal learning. If I can calculate, for each sentence in a book, how many words you likely to not know, then average that across sentences, I can assign a number to each book that says, “On average, you will look up this many words per sentence”.

Aha Moment

“Simple”, I thought. I’ll just say, each user has a JLPT level from N5-N1 that I don’t know yet. Also, each word has a JLPT level from N5-N1 that I might know (from a curated list of words) but also might not. This is because I had a dictionary of 40,000 words but only around 6,000 of them were labeled with a JLPT level.

Then, given how users are able to recall these words, I can not only hone in on a user’s JLPT level, but I can also figure out how difficult a word is! For example, if 顕著 is being failed a lot by people who I think are N1, that must mean 顕著 is really hard. If is being recalled by people who I think are N5 and N4, it’s probably pretty easy.

Once I know the level of a word, and given your level and how people at your level respond to that word, I can figure out the probability that you will recall that word. For example, if I think you’re N3 and other people at N3 are 50/50 on 免税, I can guess that you’re probably gonna have a 50/50 chance of getting 免税 right.

Then I just have to do this calculation for every word in the book and divide by sentence length!

Despair

So I did the math with a lot of help from Claude. Apparently, this algorithm is called variational Bayesian inference.

Each word is assigned 2 parameters per level (so 10 total) that determine how hard a word is for users of that level. Each user is assigned 1 parameter per level (5 total) that is the algorithm’s estimate for how likely they are to be that level.

When I get an observation from the assessment, I can update both the user parameters and the word parameters according to Bayes’ theorem to make them closer to what I would expect for that observation.

Let πuR5\pi_u \in \mathbb{R}^5 be the user’s level distribution (a categorical distribution over N5–N1), and let each word ww have parameters αw,k\alpha_{w,k} and βw,k\beta_{w,k} for each level kk. The probability that user uu recalls word ww is:

P(Xuw=1)=k=15πu,k11+eαw,k(θkβw,k)P(X_{uw} = 1) = \sum_{k=1}^{5} \pi_{u,k} \cdot \frac{1}{1 + e^{-\alpha_{w,k}(\theta_k - \beta_{w,k})}}

where θk\theta_k is a fixed ability anchor for level kk. After observing a response x{0,1}x \in \{0, 1\}, we update the user’s level distribution via Bayes’ theorem:

πu,knew=πu,kP(Xuw=xlevel=k)j=15πu,jP(Xuw=xlevel=j)\pi_{u,k}^{\,\text{new}} = \frac{\pi_{u,k} \cdot P(X_{uw} = x \mid \text{level} = k)}{\sum_{j=1}^{5} \pi_{u,j} \cdot P(X_{uw} = x \mid \text{level} = j)}

For the word parameters, we maximize the expected complete-data log-likelihood (the M-step of variational EM):

αw,knew,βw,knew=arg maxα,βuUwπu,k[xuwlogσ ⁣(α(θkβ))+(1xuw)log(1σ ⁣(α(θkβ)))]\alpha_{w,k}^{\,\text{new}},\, \beta_{w,k}^{\,\text{new}} = \operatorname*{arg\,max}_{\alpha,\beta} \sum_{u \in \mathcal{U}_w} \pi_{u,k} \Big[ x_{uw} \log \sigma\!\big(\alpha(\theta_k - \beta)\big) + (1 - x_{uw}) \log \big(1 - \sigma\!\big(\alpha(\theta_k - \beta)\big)\big) \Big]

where σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}} is the logistic function and Uw\mathcal{U}_w is the set of users who have been tested on word ww.

However, after building the algorithm, I noticed something strange about its behavior. The algorithm would hone in on your JLPT level way too fast.

And this is not a good thing. After pondering, I realized my mistake. My assumption was that the algorithm would tell me how much each user was in each level. For example, if you were halfway between N3 and N2, it would say something like

  • N5: 1%
  • N4: 2%
  • N3: 49%
  • N2: 47%
  • N1: 1%

But this is not what my model was saying. What I had instead implicitly forced the algorithm to say was that this user is one of these levels for sure, find the level they’re most likely to be at.

Instead of giving you personalized recommendations, I had inadvertently created a machine that did what I feared most: place me into 1 of 5 rigidly defined buckets.

Rethinking Life

OK, so this algorithm isn’t a very good model of real life. Or maybe it is if you’re a JLPT administrator. But I can still use it as like a fun thing at the end of the assessment if you want to see your estimated JLPT level.

What I then wrote instead is called a two-PL IRT model. In this formulation, each user gets a single continuous ability parameter θuR\theta_u \in \mathbb{R} instead of a categorical distribution. Each word ww has a discrimination aw>0a_w > 0 and difficulty bwRb_w \in \mathbb{R}. The probability that user uu recalls word ww is:

P(Xuw=1θu,aw,bw)=11+eaw(θubw)P(X_{uw} = 1 \mid \theta_u, a_w, b_w) = \frac{1}{1 + e^{-a_w(\theta_u - b_w)}}

Estimating θu\theta_u is then a maximum a posteriori problem with a standard normal prior:

θ^u=arg maxθ[wWu(xuwlogσ ⁣(aw(θbw))+(1xuw)log(1σ ⁣(aw(θbw))))θ22]\hat{\theta}_u = \operatorname*{arg\,max}_{\theta} \left[ \sum_{w \in \mathcal{W}_u} \Big( x_{uw} \log \sigma\!\big(a_w(\theta - b_w)\big) + (1 - x_{uw}) \log\big(1 - \sigma\!\big(a_w(\theta - b_w)\big)\big) \Big) - \frac{\theta^2}{2} \right]

This is much better. Instead of cramming you into 1 of 5 buckets, you get a real number on a continuous scale. A user at θ=0.7\theta = 0.7 and a user at θ=1.3\theta = 1.3 will get meaningfully different recommendations, even though both would be classified as “N3” by the categorical model.