How to Become Fluent in Japanese by Reading Exclusively Yaoi (and Yuri)

A few days ago I was browsing the Yomi Sensei catalog for reading material and I ended up playing around with the filters I had added just a week prior. I wasn’t looking for BL per se, it was just one of the checkboxes I happened to click.

Anyway, I saw that I had around 500 BL works in the library and I was hit with a stroke of genius.

How much yaoi would you have to read to reach an N1 reading level?

The Yomi Sensei catalog search screen with the Web Novel and BL filters selected — The BL filter in question, for anyone else who wants to "research"

(N1 here being our standard for fluency).

Actually, before we get ahead of ourselves, is it even possible to reach N1 by reading just yaoi? I mean, most of the vocabulary isn’t even going to be stuff that appears on the N1 exam, right? You’d be learning how to discuss the sensualities of collarbones before you even knew how to ask for directions to the nearest station.

I’ve been crunching the numbers ever since to find the answer. Spoiler alert: It’s a resounding yes. But before I show you everything I discovered, I’ll include a list of the 1,000 best dōseiai works to read to comfortably learn 80% of JLPT vocabulary. In total, it’s about 2 million words, which would take the average native speaker 172 hours to read from start to finish. For a learner, maybe something like 1200 hours with dictionary lookups is more realistic.

The rainbow road to N1

If you’re just here for the reading list, look no further. These 1,000 titles were selected by an algorithm that optimizes for reinforcing the most JLPT vocabulary while introducing new words efficiently. They’re sorted by difficulty, so start from the top and work your way down.

Loading reading list...

The data foundation

For those who don’t know, Yomi Sensei is a webapp and browser extension that tracks your reading level and recommends content at your reading level. As a natural result of evaluating works and recommending them to people, I already had data on around 10,000 written works from 3 sources:

Syosetsuka Ni Narou, a Japanese website for people to write their own web novels. Highly popular in Japan, great for learners to immerse in the language
Aozora Bunko, a collection of out-of-copyright Japanese works that have been made freely available to read and distribute
Various anime from AniList’s API

In addition to a 400,000 word dictionary, I also had a set of 6,000 labeled JLPT vocabulary words that I use as a target to measure how the content of BL/GL works differ from the rest of the corpus.

Of these 10,000 works, only around 1000 of them were BL/GL, so I spent the next few days evaluating more using Syosetsu’s wonderful API. By the end of this process, I had:

5,600 BL and GL titles
14,000 other titles from Syosetsu, Aozora Bunko, and anime transcripts, forming a “general” corpus for comparison

If you’ve read the previous blog post, you’ll know that Yomi Sensei assigns every word a difficulty parameter $\theta_w$ based on how frequently it appears across the full corpus. We then estimate each user’s ability $\theta_u$ through an IRT-based assessment, and the probability that a user knows a given word falls out as:

$P(w \text{ is known} \mid \theta_u) = \frac{1}{1 + e^{-(\theta_u - \theta_w)}}$

To map this back to JLPT levels, I take the 6,000 sample words and compute the mean $\theta_w$ for each level. Those means become the JLPT landmarks you’ll see on the charts below. So when I say “at N3 level”, I mean $\theta_u$ equals the average difficulty of words labeled N3 in our sample, and the user is expected to know half of all “N3 level” words.

On the feasibility of reaching N1 through Yaoi/Yuri

So if you read nothing but BL and GL, are you actually going to encounter enough JLPT vocabulary to pass N1?

Scroll to load chart...

Short answer: overwhelmingly yes. For starters, this chart breaks down what percentage of JLPT words at each level appear somewhere in the BL/GL corpus. Even at N1, where you’d expect the most obscure vocabulary, the vast majority of words show up at least once. The tiny sliver of uncovered words is remarkably small.

This makes intuitive sense if you think about it. BL and GL novels are still novels. Characters go to work, ride trains, eat food, argue about politics, get sick, visit hospitals, sign contracts, and do all the mundane things that make up daily life in Japan. They just also happen to do other things.

But raw coverage doesn’t tell the whole story. Just because a word appears in the corpus doesn’t mean it appears often enough for you to actually learn it. A word that shows up once across 5,600 titles is technically “covered” but practically useless for acquisition.

So let’s look at how word frequencies in BL/GL compare to the general corpus:

Scroll to load chart...

Toggle JLPT levels and absent words on/off. Hover over any dot to see the word.

Each dot is a JLPT word, plotted by its frequency in the full corpus versus its frequency in BL/GL. Words on the diagonal appear at roughly the same rate in both corpora. Words above the line are overrepresented in BL/GL. Words below it are underrepresented.

Most words cluster near the diagonal, which means BL/GL frequency tracks general Japanese pretty closely, but the outliers are pretty interesting.

Scroll to load chart...

Switch between overrepresented and underrepresented to see both sides. Hover any bar for exact frequencies.

The overrepresented words aren’t what you’d expect. Rather than romance vocabulary, the dominant theme is school life. Across every JLPT level, school words dominate: 高校 (high school, 2.7×), 教室 (classroom), 屋上 (rooftop), 体育 (PE), 遅刻 (being late), 昼休み (lunch break), 転校 (school transfer, 2.8×), 登校 (going to school), 受験 (entrance exams). BL and GL are overwhelmingly set in schools, and it shows. Romance does appear - 恋人 (lover), 恋 (love), 失恋 (heartbreak), 告白 (confession) - but it’s outnumbered by the mundane infrastructure of Japanese student life.

The underrepresented words paint the inverse picture: technical, industrial, and political vocabulary that rarely comes up in a school romance. Think 組合 (union, 3.3×), 航空 (aviation), 議員 (legislator), 時速 (speed per hour), 工学 (engineering), 溶岩 (lava), 直径 (diameter). These are the words you’d find in news articles or science textbooks, not in a story about two boys sharing an umbrella on a rooftop.

We’ll explore later how much of a problem this is, but the good thing is that the underrepresented words aren’t absent, they’re just less frequent. And since BL/GL novels span such a huge variety of settings, even the “underrepresented” vocabulary gets reasonable exposure across 5,600 titles.

What’s the best time to start?

So we know it’s feasible. But should you dive into BL as a beginner? Let’s look at what the difficulty landscape actually looks like.

Loading chart...

Hover to see exact percentiles at any ability level.

This fan chart shows the percentage of unknown words you’d encounter at each ability level, with the shaded bands representing the range across different titles. The median line tells you what a “typical” BL/GL title looks like at your level, while the outer bands show the best and worst cases.

At N4, the median title has you looking up 35% of words, which is a lot. It’s doable if you’re the type who finds suffering educational, but it’s the reality of trying to read native content so early in your learning journey.

For most learners, early N3 is the sweet spot. That’s where the median crosses into comfortable reading territory, and even the harder titles in the corpus become at least survivable.

Scroll to load chart...

Hover over any dot to see title details.

Breaking it down by category, BL and GL titles have similar difficulty distributions at N3. So you’re not gaining or losing anything difficulty-wise by picking one over the other.

Scroll to load chart...

Toggle BL, GL, and Both on/off to compare genres.

I also checked whether longer works tend to be harder, but word count and difficulty are basically uncorrelated. So don’t shy away from longer titles thinking they’ll be more difficult; a 200,000-word web novel can be easier than a 5,000-word short story.

What’s the best way to actually do this?

OK, so you’re coming up on N3 and you’ve decided to gay the rest of your way to N1. With 5,601 titles and 18 million words of content, reading the entire library would take about 2 months of non-stop reading for a native speaker. You probably don’t want to do that.

So what should you read, and in what order?

First, let’s define what it means to “learn” a word through reading. In my experience, I don’t usually remember a word I looked up while reading unless I encounter it at least 5 times in context or I add it to my Anki deck. So when I measure “coverage” below, I mean the percentage of JLPT vocabulary you’ve seen 5 or more times.

My target is 80% coverage. The JLPT N1 passing score is about 50%, so you don’t need to know every word on the test. 80% of vocabulary seen 5+ times should be enough to pass, especially since once you get to higher levels you can start guessing meanings from kanji and context much more easily.

Strategy 1: Read in order of difficulty

The obvious approach. Start with the easiest titles and work your way up.

Scroll to load chart...

Switch between "by titles" and "by words" to see both axes. Hover for coverage details at any point.

This chart shows cumulative JLPT vocabulary coverage as you read more titles in difficulty order. The different bands represent words seen 1+ times, 2+ times, and so on up to 5+ times.

It works, but it’s slow. You need about 6.5 million words of reading to hit 80% coverage at the 5+ threshold, which would take a native speaker 560 hours.

Strategy 2: Greedy set cover

What if instead of reading in order of difficulty, we pick each next title to maximize $\frac{\text{new vocabulary exposure}}{\text{words in title}}$ ? This is basically the weighted set cover problem, and greedy algorithms are known to approximate it well.

I tested two variants:

Greedy unseen: pick the title that introduces the most new words per word of reading
Greedy reinforce: pick the title that pushes the most words past the 5-encounter threshold while still introducing new ones

Scroll to load chart...

Toggle strategies on/off to compare them. Hover to see a side-by-side breakdown of words read and coverage for each strategy.

The greedy reinforce strategy crushes the naive difficulty ordering. It reaches 80% coverage in roughly 2 million words - about 3.25x more efficient. That’s the reading list at the top of this article: 1,000 titles, ~2 million words, roughly 172 hours of reading for a native speaker.

For a learner, that’s probably closer to 1,000 hours. Spread over 2 years, that’s just under an hour and half a day to go from N3 to knowing 80% of N1 vocabulary by reading Yaoi and Yuri.

Limitations

This analysis has some obvious blind spots:

Vocabulary isn’t fluency. Knowing every word in a sentence doesn’t mean you understand the grammar, the idioms, or the cultural context, and it doesn’t help you at all in the listening section of the test.
Looking up a word isn’t necessarily “learning” it. The 5-encounter threshold is a rough heuristic from my own experience (translation: I made it up). Whether you actually acquire a word depends on attention, context, and a dozen other factors.
Grammar isn’t accounted for. BL/GL might overindex on certain patterns (casual speech, internal monologue) and underindex on others (敬語, academic writing, formal correspondence).
Genre familiarity is a hidden advantage. After a few BL titles, you develop schema knowledge like common plot structures and character archetypes that makes subsequent titles easier in ways my model doesn’t capture.

Future work

Extend to other genres. The same framework can analyze isekai, mystery, sci-fi, or any genre with enough titles. I’d love to compare optimal reading paths across genres.
Validate with real learners. Everything here is theoretical, but I’m working on tracking actual vocabulary acquisition for learners that use Yomi Sensei to read every day.
Account for grammar. The word segmentation algorithm already handles grammar, but that information is being thrown away. Cross-referencing with a grammar point database could identify which grammatical structures are over/underrepresented in BL/GL.

Why this matters

If you’ve made it this far, you might be wondering: “Did you really just write 2,000 words proving that you can learn Japanese by reading yaoi?”

Yes. And I don’t regret it one bit.

People learn fastest when they’re reading things they enjoy. This is one of the most robust findings in second language acquisition research. Extensive reading works best when the material is intrinsically motivating. And for a lot of learners, yaoi and yuri are exactly that material.

The problem is that most language learning resources treat genre as an afterthought. “Just read native content!” they say, without acknowledging that choosing the right content can make or break your motivation. If someone likes reading yaoi, telling them to grind through newspaper articles is a waste of everyone’s time.

This analysis shows that genre-specific learning paths aren’t just viable, they can actually be efficient. The vocabulary coverage, difficulty range, and volume of content is all there. All you need is a good reading order, which is exactly what Yomi Sensei is built to provide.