SEN LLC

Posted on Apr 15

Slugifying Japanese URLs Is Harder Than You Think — I Built a Focused CLI

#python #japanese #cli #tutorial

Slugifying Japanese URLs Is Harder Than You Think — I Built a Focused CLI

Most slug libraries silently drop Japanese. A few hallucinate pinyin. I built a small Python CLI that uses pykakasi to produce actual Hepburn romaji, and wrote down what I learned about why this is a surprising amount of work.

If you have ever tried to generate a URL slug from the title 東京タワー using a mainstream slug library, you probably got one of three results:

an empty string,
the string dong-jing-ta-wa (that's Mandarin pinyin, not Japanese at all),
or an exception somewhere in an NFC normalization step.

None of those are what a Japanese blog, product catalog, or meeting-notes archive wants on its URLs. What you actually want is tokyo-tawa (or tokyo-tower if you think in English first), which is Hepburn romaji, the Romanization convention every Japanese road sign, passport, and train station has used for a century.

This article is about why getting from 東京タワー to tokyo-tawa is harder than it looks, and what the minimum viable Python CLI for doing it looks like. Source code: github.com/sen-ltd/slug-jp.

The pinyin mistake

The first time I needed URL slugs for a Japanese site, I reached for the library I always use: python-slugify. The call looks like this:

>>> from slugify import slugify
>>> slugify("東京タワー")
''

Empty. That's because python-slugify is a thin wrapper around Unidecode, which is a transliteration library — and Unidecode's Chinese data happens to win the tie-break for CJK characters. Depending on version, Unidecode may give you:

>>> from unidecode import unidecode
>>> unidecode("東京タワー")
'Dong Jing tawa'

That first half is Mandarin pinyin: 东京 is Dōngjīng in Chinese, not Tōkyō in Japanese. So the library has neither dropped the input nor produced Japanese — it's produced a silent cross-language error. That second half (tawa) is katakana, which pinyin can't handle, so it sneaks through unchanged. If you accept this output you end up with URLs like /blog/dong-jing-news on a Japanese blog, and both search engines and humans will find them confusing.

Django's built-in django.utils.text.slugify is more conservative — it uses unicodedata.normalize('NFKD', ...).encode('ascii', 'ignore') plus a regex, which means every Japanese character simply disappears:

>>> from django.utils.text import slugify
>>> slugify("東京タワー")
''

awesome-slugify does better if you explicitly pass a pretranslation map, but "writing your own kanji table" is not a small ask, and if you forget a character it goes back to empty. None of these tools were built with Japanese as a first-class input.

The reason they all fail is that Japanese is not a transliteration problem. It's a morphology problem. A Japanese string is a mixture of kanji (which carry meaning, not sound), hiragana/katakana (which carry sound), and ASCII (which carries itself). To go from a kanji sequence to a reading, you need a dictionary that knows how each kanji is pronounced in context. That's what unidecode refuses to do — it just maps characters — and that's what a real Japanese slugger has to do.

Why pykakasi

The usual advice for "I need to romanize Japanese in Python" is: install MeCab. MeCab is a beautiful tool, but the price of admission is a system-level C library plus a 40-to-300 MB dictionary (IPAdic, UniDic, NEologd — pick your poison). For a CLI that is supposed to do one small job, that's a lot. You also have to explain dictionary installation to every person who wants to use the tool.

pykakasi is the pragmatic middle ground. It is:

pure Python (no C dependency),
pip-installable with pip install pykakasi,
ships a ~7 MB bundled dictionary,
and produces Hepburn romaji directly, without your having to chain kanji→hiragana→romaji steps yourself.

You call it like this:

import pykakasi
kks = pykakasi.kakasi()
kks.convert("2026年の予算会議")
# [
#   {'orig': '2026', 'hepburn': '2026', ...},
#   {'orig': '年',    'hepburn': 'nen', ...},
#   {'orig': 'の',    'hepburn': 'no', ...},
#   {'orig': '予算',  'hepburn': 'yosan', ...},
#   {'orig': '会議',  'hepburn': 'kaigi', ...},
# ]

That's already close to what a slug wants. Five tokens, five short ASCII strings. Join with -, lowercase, and you have 2026-nen-no-yosan-kaigi. Ship it.

Except there are about six edges still to file off, and every one of them taught me something.

Hepburn vs Kunrei-shiki, and why it matters for URLs

There are two main Romanization systems in common use:

Hepburn (ヘボン式) — what road signs, passports, JR train stations, and Wikipedia use. しんじゅく → shinjuku, つ → tsu, ち → chi.
Kunrei-shiki (訓令式) — taught in Japanese schools, preferred by the ISO 3602 standard. しんじゅく → sinzyuku, つ → tu, ち → ti.

If you're building URLs for a Japanese audience, Hepburn wins every time. It's what people expect to see, it's what maps to how the word actually sounds in English, and it's what every other Japanese service on the web uses. pykakasi gives you Hepburn via the hepburn field on each segment. Using the kunrei field instead would technically be valid output, but users would find it jarring.

The distinction is not hypothetical. Here's what the two look like on a real sentence:

Kanji	Hepburn	Kunrei-shiki
新宿	`shinjuku`	`sinzyuku`
築地	`tsukiji`	`tukizi`
渋谷	`shibuya`	`sibuya`
東京	`tōkyō` / `tokyo`	`tôkyô` / `tokyo`

The last row previews the next problem: what do you do with long vowels?

The long-vowel problem, and "URL Hepburn"

Japanese distinguishes long vowels from short ones. 東京 is Tōkyō (two long o's), not Tokyo. That distinction matters in the language, and Hepburn purists write it with a macron: Tōkyō. Macrons are not URL-safe in practice (they encode as %C5%8D) and they're not what anybody types into an address bar. So the convention for URL-friendly Hepburn, sometimes called "passport Hepburn" or "wāpuro Hepburn", is to drop the macron and write the short vowel: Tokyo.

pykakasi handles long vowels by doubling the Latin letter. 東京 comes back from pykakasi as toukyou, 学校 comes back as gakkou, タワー comes back as tawaa. That's an internally consistent representation (and the right answer for many use cases — you can mechanically reconstruct the kana from it), but it is not what you want on a URL.

So slug-jp does one small post-processing step: collapse doubled vowels.

def _collapse_long_vowels(s: str) -> str:
    out = []
    i = 0
    n = len(s)
    while i < n:
        ch = s[i]
        nxt = s[i + 1] if i + 1 < n else ""
        # ou → o (most common: 東京 toukyou → tokyo)
        if ch == "o" and nxt == "u":
            out.append("o")
            i += 2
            continue
        # Doubled identical vowels → single (aa, ii, uu, ee, oo)
        if ch in "aiueo" and nxt == ch:
            out.append(ch)
            i += 2
            continue
        out.append(ch)
        i += 1
    return "".join(out)

After that step, toukyou becomes tokyo, tawaa becomes tawa, gakkou becomes gakko. Three characters shorter, infinitely more readable.

Is this lossy? Yes. yō and yo become indistinguishable, which means some homographs collapse (e.g. おとうさん otōsan, "father" → otosan clashes with a hypothetical otosan meaning something else). But for slug purposes, this is the standard compromise and it's what every other Japanese URL on the web uses.

One subtlety: the ou rule has to come before the doubled-vowel rule, because ou is not a doubled vowel (the letters differ) — it's the Japanese long o as pykakasi encodes it. If you do doubled-vowels first, ou never collapses.

NFKC first, always

Japanese text from the web is messy. You'll get:

half-width katakana (ｱｲｳｴｵ) from legacy systems
full-width ASCII (ＡＢＣ) from anywhere a Japanese keyboard sent Roman letters
full-width digits (２０２６)
circled numbers (①②③)
ligatures and compatibility characters

None of these are things you want to feed into pykakasi's dictionary. The good news is that one call fixes nearly all of them:

import unicodedata
normalized = unicodedata.normalize("NFKC", text)

NFKC is "compatibility decomposition followed by canonical composition". In practice: full-width Ａ turns into half-width A, half-width ｱ turns into full-width ア, full-width ２ turns into 2, ligatures expand. The text that reaches pykakasi is uniform and pykakasi only has to know about "real" Japanese characters.

slug-jp runs NFKC unconditionally before converting, so:

$ slug-jp "ＡＢＣ"
abc

works without pykakasi needing a "full-width Latin" table.

The emoji-echo bug

This is the weirdest thing I found. Give pykakasi an input where a Japanese character is sandwiched between emoji or other non-CJK glyphs, and it will occasionally emit the same segment twice:

>>> kks.convert("🎉祝🎉")
[{'orig': '祝', 'hepburn': 'shuku', ...},
 {'orig': '祝', 'hepburn': 'shuku', ...}]

Two copies of 祝, even though the input has only one. I believe this is an artifact of how pykakasi handles boundaries when the "surrounding character" isn't in its alphabet tables — it walks the string and effectively visits the kanji twice. Whatever the cause, if you don't guard against it you get shuku-shuku on your URL, which looks like "celebrate celebrate" in a weird stutter.

The fix is tiny: dedupe consecutive segments where orig is identical.

tokens = []
last_orig = None
for piece in raw:
    orig = piece.get("orig", "")
    hepburn = piece.get("hepburn", "")
    if not hepburn:
        continue
    if orig and orig == last_orig:
        continue
    last_orig = orig
    tokens.append(_collapse_long_vowels(hepburn))

That's it. Three lines of defensive code that took me half an hour to diagnose.

Truncation at the word boundary

URLs don't need to be infinite. A --max-length flag is standard in every slug library, but they all handle overflow differently. The naive approach is slug[:max_length], which is wrong: it can leave a dangling - at the end or slice a token in half. 2026-nen-no-yosan-ka is not useful for anybody.

slug-jp walks backward from the cut point to the previous separator and cuts there. The code is small enough to quote whole:

def _truncate_at_word_boundary(slug, max_length, separator):
    if len(slug) <= max_length:
        return slug, False
    window = slug[:max_length]
    last_sep = window.rfind(separator)
    if last_sep > 0:
        return window[:last_sep], True
    return window, True

The fallback — if the very first token is already longer than max_length — is to hard-cut anyway, because a truncated slug is still better than an empty one. In practice this never happens with real Japanese titles, because no single morpheme is more than ~15 characters. But the test suite exercises it.

slugify_tokens returns a (slug, truncated) tuple so the JSON output can expose a truncated: true flag. When you're building CMS URLs, knowing whether a slug was auto-shortened is useful — you can surface a warning to the user: "your title was truncated, maybe you want to set the slug manually."

The full slug loop

Put together, the whole thing is five steps:

NFKC normalize the input.
Run pykakasi, get (orig, hepburn) pairs.
Dedupe consecutive duplicates, collapse long vowels.
Join, lowercase, regex-replace any run of non-[A-Za-z0-9] with the separator, strip edges.
Truncate at the word boundary if needed.

Let me walk through 2026年の予算会議 one more time:

Step	Value
Input	`2026年の予算会議`
After NFKC	`2026年の予算会議` (unchanged)
After pykakasi	`['2026', 'nen', 'no', 'yosan', 'kaigi']`
After long-vowel fix	`['2026', 'nen', 'no', 'yosan', 'kaigi']` (no doubled vowels)
After join + lowercase	`2026-nen-no-yosan-kaigi`
After alnum clean	`2026-nen-no-yosan-kaigi`
After truncate (80 max)	`2026-nen-no-yosan-kaigi`

Six steps. Fewer than 150 lines of actual code, including argparse and the JSON formatter.

Tradeoffs pykakasi doesn't hide

Pykakasi's small-dictionary approach gets ~95% of real-world slugs right for mainstream content, but there are genres where it falls over and you should know about them:

Rare kanji names. 澁谷 (an old form of 渋谷 Shibuya) is missing from the dictionary and comes out weird. Product names that use 旧字体 ("old character forms") will surprise you.
Personal names. Japanese names have ambiguous readings — 三郎 can be saburō, mitsurō, or sanrō depending on the person. pykakasi picks one reading, which may not be the right one. For people-facing URLs you want a human in the loop.
Proper nouns generally. Company names, place names not in the dictionary, neologisms — pykakasi will often fall back to character-by-character kanji readings, which may or may not be the "correct" one.
Mixed first/last-name order. slug-jp doesn't try to insert a hyphen between what might be a given name and a surname. 田中太郎 comes out as tanakataro, not tanaka-taro. That's a deliberate non-decision.
Modern vs historical kana. ゐ / ゑ are historical kana that pykakasi maps to i / e. That's the right call for URLs (those characters are unusable in practice) but scholars care.

None of these are showstoppers for blog posts, product catalog entries, meeting notes, or any of the ordinary URL-slug use cases I built this for. They are things you should know before deploying slug-jp to user-generated content where names matter.

If you do need better coverage, the next step up is running MeCab with UniDic and letting it return the pronunciation field. But at that point you're installing 300 MB of stuff to do a 150-line job, which was the thing I was trying to avoid.

Try it in 30 seconds

The Docker image is ~70 MB total, which is what Alpine Python plus pykakasi's dictionary fits into:

git clone https://github.com/sen-ltd/slug-jp
cd slug-jp
docker build -t slug-jp .

docker run --rm slug-jp "東京タワー"
# tokyo-tawa

docker run --rm slug-jp "Hello 世界!"
# hello-sekai

docker run --rm slug-jp "2026年の予算会議" --format json
# {"input":"2026年の予算会議","slug":"2026-nen-no-yosan-kaigi",
#  "tokens":["2026","nen","no","yosan","kaigi"],"truncated":false}

echo "日本語入力" | docker run --rm -i slug-jp -
# nihongonyuryoku

Or without Docker:

pip install -e .
slug-jp "東京タワー"

The test suite (67 tests) runs in under a second and covers every edge case in this article. Pull requests welcome — especially if you know a cleaner way to handle the long-vowel rules or have a better fix for the emoji-echo issue.

What slugifying taught me

The surprise from this project wasn't that Japanese romanization is hard — I knew that going in. The surprise was how much of the generic slug-library ecosystem is quietly wrong for non-Latin languages. python-slugify treating 東京 as "the empty string" is not a bug the maintainers chose; it's what happens when the library's foundation (Unidecode) was designed for European accents and then bolted onto CJK as an afterthought. If you care about URLs in your users' actual languages, you have to pick the right foundation for the language — and for Japanese, that foundation is pykakasi, not ASCII normalization with a regex.

The other lesson is the usual one: small, focused tools beat giant general-purpose ones when the domain is narrow enough. slug-jp is ~150 lines of Python with one dependency, and it fixes a real problem with every Japanese blog URL I've seen. That's the kind of scope I want more of my tools to have.

Source: github.com/sen-ltd/slug-jp
License: MIT
Deps: pykakasi (the whole runtime)

DEV Community

Slugifying Japanese URLs Is Harder Than You Think — I Built a Focused CLI

Slugifying Japanese URLs Is Harder Than You Think — I Built a Focused CLI

The pinyin mistake

Why pykakasi

Hepburn vs Kunrei-shiki, and why it matters for URLs

The long-vowel problem, and "URL Hepburn"

NFKC first, always

The emoji-echo bug

Truncation at the word boundary

The full slug loop

Tradeoffs pykakasi doesn't hide

Try it in 30 seconds

What slugifying taught me

Top comments (0)