close
Skip to content

sen-ltd/haiku-score

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

haiku-score

Validate Japanese haiku and tanka against a mora pattern. Counts moras (拍), not syllables or characters — the unit haiku actually uses. Kana-only by design. Zero runtime dependencies.

git clone https://github.com/sen-ltd/haiku-score
cd haiku-score
docker build -t haiku-score .
printf 'ふるいけや\nかわずとびこむ\nみずのおと\n' | docker run --rm -i haiku-score -

Screenshot

Why

The standard English explanation of haiku is "5-7-5 syllables", and it is wrong in a way that makes every beginner write bad haiku. Japanese verse doesn't count syllables; it counts moras, a unit shorter than an English syllable. The difference is load-bearing:

  • きゃ is 1 mora, not 2. The small ゃ fuses into the previous kana (this is called youon, 拗音).
  • とうきょう is 4 moras — と・う・きょ・う — even though English speakers often call it 2 syllables.
  • ほん is 2 moras: the ん (hatsuon, 撥音) is its own beat.
  • にっぽん is 4 moras: the small っ (sokuon, 促音) is a beat of silence.
  • コーヒー is 4 moras: the 長音符 ー extends the previous vowel by one full beat.

haiku-score implements these rules and validates a pattern (default 5-7-5, or 5-7-5-7-7 with --tanka, or anything you like with --pattern). The core algorithm is about 50 lines of Python stdlib; the whole project is a crash course in basic Japanese phonology wearing a CLI costume.

Quickstart

# A valid haiku (Basho's 古池, in kana form).
printf 'ふるいけや\nかわずとびこむ\nみずのおと\n' | docker run --rm -i haiku-score -

# Tanka (5-7-5-7-7).
docker run --rm -i haiku-score - --tanka < poem.txt

# A custom pattern.
docker run --rm -i haiku-score - --pattern 3,5,3 < micro.txt

# Machine-readable output.
docker run --rm -i haiku-score - --format json < poem.txt

# A one-liner with Japanese punctuation as line breaks.
docker run --rm haiku-score 'ふるいけや、かわずとびこむ。みずのおと' --auto-break

The kana-only scope choice

haiku-score refuses kanji input:

$ printf '古池や\n' | haiku-score -
haiku-score: kanji '' at position 0: haiku-score counts moras on kana input.
Convert kanji to kana first (e.g. with pykakasi, or your own lookup) then re-run.

This is deliberate. Kanji-to-kana is a hard problem: is or にち or じつ depending on context, and doing it right requires either a full morphological analyser (MeCab, Sudachi) or a large pronunciation dictionary. That's a whole different project, and once you're pulling in fugashi and a dictionary you've blown the "zero dependencies, ~50 lines of core logic" thesis. So the tool does one thing — count moras on kana input — and tells you how to convert kanji when it sees them.

If you need automatic kanji handling, pipe your text through pykakasi first:

echo "古池や" | python -c "
import sys, pykakasi
kks = pykakasi.kakasi()
for line in sys.stdin:
    print(''.join(item['hira'] for item in kks.convert(line.strip())))
" | haiku-score -

Exit codes

Code Meaning
0 Input matches the pattern.
1 Input parsed, but the mora counts don't match.
2 Bad input: kanji, empty text, wrong number of lines, bad flags.

Running without Docker

pip install .
haiku-score poem.txt
haiku-score - < poem.txt
haiku-score poem.txt --tanka --format json

Requires Python 3.10+. No runtime dependencies — just argparse, json, and unicodedata from the standard library.

Tests

docker run --rm --entrypoint pytest haiku-score -q

Extensive coverage of the mora rules specifically, including:

  • The five vowels (あいうえお = 5).
  • Youon fusion (きゃ = 1, きゅうり = 3, しゃしん = 3).
  • Sokuon (にっぽん = 4).
  • Hatsuon (ほん = 2, ラーメン = 4).
  • Long-vowel mark (コーヒー = 4).
  • Punctuation is ignored (あ、い。う = 3).
  • Katakana parity (カタカナ = 4, ジャム = 2).
  • Basho's 古池 haiku in kana form: 5-7-5 ✓.
  • Kanji input raises MoraError with the position of the offending character.
  • CLI exit codes: 0 match, 1 mismatch, 2 bad input.
  • JSON output shape.
  • --tanka and custom --pattern.
  • --auto-break on 。 and 、.

How it works

The core loop is in src/haiku_score/mora.py. Every character is classified into one of: full-size kana (+1 mora), small youon (+0, fused with previous), long-vowel mark (+1), sokuon (+1), hatsuon (+1), punctuation (+0), or "reject with position". Classification is done purely by Unicode codepoint ranges — the hiragana block U+3040..U+309F, the katakana block U+30A0..U+30FF, the long-vowel mark U+30FC, and small-kana characters by their specific codepoints.

if classifier.is_small_yoon(ch):
    # きゃ is 1 mora, not 2. The previous full kana already contributed 1,
    # and this small ゃ fuses into it for 0 additional moras.
    previous_was_full_kana = False
    continue

The scorer is a flat pass over the per-line counts, comparing each to the pattern, and the formatters render the result for humans (with colour and East-Asian width-aware alignment) or as JSON.

License

MIT. See LICENSE.

Links

About

Count Japanese haiku by moras (not syllables): youon fuses to 1, sokuon/hatsuon/long-mark each count as 1. Validates 5-7-5 and tanka. Kana-only by design.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors