close
Skip to content

Add semantic YMYL classification via LLM extraction#16

Merged
remete618 merged 2 commits intomainfrom
feat/semantic-ymyl
Apr 15, 2026
Merged

Add semantic YMYL classification via LLM extraction#16
remete618 merged 2 commits intomainfrom
feat/semantic-ymyl

Conversation

@remete618
Copy link
Copy Markdown
Owner

Summary

Two-stage YMYL pipeline that catches implied health/legal/financial content and rejects false positives from metaphorical keyword usage.

Stage 1 (fast): Regex strong patterns for definitive matches (blood type, 401k, DNR order). No LLM call.

Stage 2 (smart): LLM classifies YMYL during fact extraction. Zero additional API calls; the classification piggybacks on the extraction call that already happens for every add().

What it catches that regex misses

  • "my chest has been hurting for three days" -> health
  • "I owe $40,000 and can't make payments" -> financial
  • "I stopped taking my pills" -> medical
  • "my ex is threatening to take the kids" -> legal

What it skips that regex falsely flags

  • "walked by the bank of the river" -> null (not financial)
  • "The Doctor is a great TV show" -> null (not medical)
  • "court of public opinion" -> null (not legal)

Files changed

  • core/types.py - Add ymyl_category to Fact, ActionItem, Memory
  • extraction/prompts.py - Updated extraction prompt with YMYL examples
  • extraction/llm_extractor.py - Two-stage: regex override + LLM classification
  • conflict/batch_resolver.py - Thread ymyl_category through actions
  • core/pipeline.py - Store ymyl_category in metadata, use in YMYL-triggered active retrieval
  • core/memory.py - Reconstruct ymyl_category from metadata at search time
  • retrieval/temporal.py - Use stored ymyl_category for decay immunity
  • tests/test_ymyl_topics.py - 8 new tests

Test plan

  • All 163 tests pass
  • New: Fact/ActionItem/Memory carry ymyl_category
  • New: ymyl_category stored in metadata (and omitted when null)
  • New: LLM-classified YMYL memories get decay immunity
  • New: Proves regex misses implied YMYL that LLM would catch
  • Backward compatible: old memories without ymyl_category fall back to regex

Two-stage YMYL pipeline:
- Stage 1 (fast): Regex strong patterns for definitive matches
  (blood type, 401k, DNR order). No LLM call needed.
- Stage 2 (smart): LLM classifies YMYL during fact extraction.
  Zero additional API calls (piggybacks on existing extraction).
  Catches implied YMYL ("my chest hurts") and rejects metaphorical
  usage ("bank of the river").

Changes:
- Add ymyl_category field to Fact, ActionItem, and Memory models
- Update extraction prompt to ask LLM for ymyl_category per fact
- Thread ymyl_category through resolver -> pipeline -> metadata
- Scoring uses stored ymyl_category for decay immunity (falls back
  to regex for memories created before this change)
- 8 new tests for semantic YMYL classification
@remete618 remete618 merged commit 47c3dd1 into main Apr 15, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant