SEN LLC

Posted on Apr 15

Generating a Valid RSS Feed From Scratch in Stdlib Python

#python #cli #rss #tutorial

Generating a Valid RSS Feed From Scratch in Stdlib Python

Every static site generator ships a feed plugin. Until yours doesn't. Here's what I found out writing a zero-dependency RSS 2.0 + Atom 1.0 generator for a directory of markdown posts — the field formats people get wrong, why PyYAML was the wrong dependency for a blog, and the parts of the spec that don't matter.

📦 GitHub: https://github.com/sen-ltd/feedgen

The problem nobody has a good answer for

Most blog platforms hand you a feed for free. Hugo emits index.xml. Eleventy has a plugin. Jekyll ships one. Ghost, WordPress, Medium — done.

Until you're not on any of those. The cases I keep running into:

A small team maintaining a company blog with hand-rolled static tooling — scripts, a template, a build step, shipped to S3.
A docs site with an "articles" section that was never wired up to the main framework's feed plugin.
Someone's inherited WordPress site where the SEO plugin broke the feed two migrations ago and nobody's had 4 hours to fix it.
A JSON-API-backed CMS where the feed lives in a lambda someone wrote once and nobody touches.

In every case the answer "generate a feed" is blocked by a strange wall: it's 100 lines of stdlib Python, but nobody wants to write those 100 lines, because getting the feed format exactly right takes reading the spec and reading two other people's blog posts about the spec, and once you start doing that you find yourself reaching for feedparser and PyYAML and markdown2 and suddenly your tiny utility has a dependency tree.

I wanted to see if the direct approach was actually hard, or just unappealing. It turns out to be unappealing but not hard, so I wrote the tool I wish had existed: feedgen, which walks a directory of markdown files with YAML frontmatter and emits validator-clean RSS 2.0 or Atom 1.0 from stdlib only.

Here's what I learned about the field formats, the frontmatter parsing dodge, and the "minimal markdown renderer" tradeoff.

Design: the minimum enough to be correct

The tool needs three things:

Read the directory. Walk files, split frontmatter from body, coerce types.
Model the post. Enough fields to match what RSS and Atom ask for: title, date, slug, optional author / summary / tags / body.
Emit the feed. Validator-clean XML, right date formats, right required elements.

And I wanted zero dependencies — every external library I reached for felt too big for the problem.

Frontmatter: a 50-line YAML subset beats PyYAML

The obvious move is import yaml. But PyYAML has a legitimately large install footprint, a C extension dependency depending on platform, and a long history of security footguns around yaml.load vs yaml.safe_load. For a tool that reads blog frontmatter, the surface area that actually matters is tiny:

key: value
key: "quoted with: colons"
key: [item1, item2, item3]
key: true
key: 42

That's, what, five productions? Almost no frontmatter uses nested maps, multi-line scalars, anchors, or tags. If it does, the post probably wants a different generator anyway.

So I wrote a ~50-line parser that handles exactly the five shapes above. Each line becomes one key. Flow lists are split on commas, respecting quotes. Booleans and ints are coerced. Everything else is a string.

def parse(text: str) -> ParsedFrontmatter:
    fm_block, body = split_frontmatter(text)
    if fm_block is None:
        return ParsedFrontmatter(data={}, body=body)

    data: dict[str, Any] = {}
    for lineno, raw_line in enumerate(fm_block.splitlines(), start=1):
        line = raw_line.rstrip()
        if not line.strip() or line.lstrip().startswith("#"):
            continue

        if ":" not in line:
            raise FrontmatterError(
                f"line {lineno}: expected 'key: value', got {raw_line!r}"
            )

        key, _, value = line.partition(":")
        key = key.strip()
        value = value.strip()

        if value.startswith("["):
            data[key] = _parse_list(value)
        else:
            data[key] = _parse_scalar(value)

    return ParsedFrontmatter(data=data, body=body)

Two things to notice. First, line.partition(":") — not split(":", 1). Same effect, but partition is the canonical "split on first X" in Python. Second, the comment and blank-line handling is inside the loop, not a pre-pass. There's no ambiguity about what line number the error points to.

The flow-list parser is where it gets slightly fun, because you have to respect quotes. tags: ["hello, world", simple] should be two items, not three:

def _parse_list(raw: str) -> list[Any]:
    inner = raw.strip()[1:-1].strip()
    if not inner:
        return []

    items: list[str] = []
    buf: list[str] = []
    quote: str | None = None
    for ch in inner:
        if quote:
            if ch == quote:
                quote = None
            else:
                buf.append(ch)
            continue
        if ch in ("'", '"'):
            quote = ch
            continue
        if ch == ",":
            items.append("".join(buf).strip())
            buf = []
            continue
        buf.append(ch)
    items.append("".join(buf).strip())
    return [_parse_scalar(item) for item in items if item]

Not elegant, but predictable. And entirely self-contained — nothing to audit but these 20 lines.

The downside: posts that use YAML features the subset doesn't handle will silently produce surprising output or outright parse errors with a line number. I'm fine with that tradeoff for the target audience (people who write flat frontmatter). If someone brings a file that needs real YAML, the fix is one pip install away and the rest of the tool keeps working.

RSS 2.0 field formats people get wrong

This is the part I assumed was easy and was humbled by.

pubDate wants RFC 822. Not ISO 8601. Not RFC 3339. Not whatever str(datetime) produces. It wants:

Fri, 10 Apr 2026 12:30:45 +0000

With English day and month abbreviations, regardless of the locale of whoever's running the tool. The real spec says you can also use the obsolete two-digit year, which — please don't. If you reach for email.utils.formatdate(..., usegmt=True) you get a locale-independent version for free, but I wrote the format out by hand so the tests are obvious and the function doesn't smuggle in email module state:

def rfc822(dt: datetime) -> str:
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    dt = dt.astimezone(timezone.utc)
    day = ("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")[dt.weekday()]
    mon = (
        "Jan", "Feb", "Mar", "Apr", "May", "Jun",
        "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
    )[dt.month - 1]
    return (
        f"{day}, {dt.day:02d} {mon} {dt.year} "
        f"{dt.hour:02d}:{dt.minute:02d}:{dt.second:02d} +0000"
    )

guid needs isPermaLink="true" when the GUID is a URL. It defaults to "true" per spec, but readers get confused when you leave the attribute off and also give them a URL as the GUID — half of them treat it as an opaque string and never let the user open the link. The correct move is to set the attribute explicitly and use the permalink as the GUID:

guid = ET.SubElement(item, "guid", {"isPermaLink": "true"})
guid.text = link

This also means your GUID is stable as long as the permalink is stable, which is what feed readers need to avoid duplicate entries.

Use ElementTree, not f-strings. The single most common RSS mistake I've seen in the wild is busted XML because someone interpolated a post title into a string template and the title contained an ampersand. xml.etree.ElementTree.SubElement(..., text=title) handles all escaping for you, plus it emits well-formed XML, plus you get ET.fromstring(output) round-tripping for free in your tests. The whole RSS generator is ~80 lines of ElementTree calls:

item = ET.SubElement(channel, "item")
ET.SubElement(item, "title").text = post.title
ET.SubElement(item, "link").text = link
guid = ET.SubElement(item, "guid", {"isPermaLink": "true"})
guid.text = link
ET.SubElement(item, "pubDate").text = rfc822(post.date)

for tag in post.tags:
    ET.SubElement(item, "category").text = tag

That's it. No manual escaping, no f-string interpolation.

Atom 1.0 vs RSS 2.0: the differences that matter

The tool emits both formats so I had to understand where they disagree.

Date format. Atom wants RFC 3339 (2026-04-10T12:30:45Z), RSS wants RFC 822 (see above). They're close enough that you'll mix them up once.
Required feed id. Atom requires a <feed>-level <id> which is a globally unique identifier. Convention is to use the canonical feed URL or a tag: URI. I use the canonical URL.
Required updated. Both formats have "when was the feed last touched" but they disagree on name: lastBuildDate in RSS, updated in Atom. I derive both from the newest post's date, or datetime.now(tz=UTC) for empty feeds.
Category shape. RSS has <category>simple text</category>. Atom has <category term="news" />. The data is the same; the XML isn't.
Author structure. RSS expects an <author> that's formally supposed to be an email, though most feeds put a name there and readers tolerate it. Atom has a proper <author><name>...</name></author> sub-element, which is cleaner.
Content vs summary. RSS optionally uses content:encoded (with an extra namespace) for full post bodies. Atom has <content type="html"> built in.

None of this is hard once you've seen it, but all of it is the kind of thing you find out by loading your feed into a validator and it complains.

A minimal markdown renderer — tradeoff acknowledged

This is the part where I had to be honest with myself.

The default feedgen behavior is "summary only" — the feed entry's description is either whatever the frontmatter summary: field says, or the first paragraph of the body if no summary was given. This is the right default for feeds and it means you don't need a markdown renderer at all. Most users will never pass --include-content.

But for --include-content, I wanted something. Not a full CommonMark renderer — that's 2,000 lines on a good day. Just enough to produce valid HTML from the bits people actually use in blog posts: paragraphs, headings, bold/italic, inline code, links, fenced code blocks, bullet lists. That fit in ~80 lines:

def render(md: str) -> str:
    lines = md.splitlines()
    out: list[str] = []
    i = 0
    in_list = False

    def close_list() -> None:
        nonlocal in_list
        if in_list:
            out.append("</ul>")
            in_list = False

    while i < len(lines):
        line = lines[i]

        if line.strip().startswith("```

"):
            close_list()
            i += 1
            buf: list[str] = []
            while i < len(lines) and not lines[i].strip().startswith("

```"):
                buf.append(html.escape(lines[i]))
                i += 1
            i += 1
            out.append("<pre><code>" + "\n".join(buf) + "</code></pre>")
            continue

        # ... headings, bullets, paragraphs ...

The README is explicit: this is not a real markdown renderer. It's "just enough to not embarrass us in a feed entry". If the rest of your site renders markdown with Marked or python-markdown or goldmark and the feed entry has small rendering differences with the on-site version, that's fine — users read feeds, not diffs.

The alternative — not shipping the renderer at all and forcing users to pre-render their markdown and hand feedgen raw HTML — would actually be a cleaner design. I went the other way because the 80-line renderer unlocks feedgen posts/ --include-content --out feed.xml as a single-command experience, and single-command experiences matter more for a tool than design purity. Tradeoff accepted.

Permalink patterns are the interop surface

One thing that surprised me: the hardest UX question isn't about XML at all, it's about permalinks. Your feed items need a <link>. That link has to match exactly what your site is serving, including trailing slashes, including the /posts/ or /blog/ or /articles/ prefix, including year directories if you have them. Mismatch it and every post in your feed 404s.

I didn't want to hard-code one scheme. I went with a template:

--permalink "{link}/{slug}"
--permalink "{link}/posts/{year}/{slug}"
--permalink "{link}/blog/{year}/{month}/{slug}"

Substitution is just str.format with a handful of named keys (link, slug, year, month, day). Three lines of code in the emitter:

def _permalink(meta: FeedMeta, post: Post) -> str:
    link = meta.link.rstrip("/")
    return meta.permalink_pattern.format(
        link=link,
        slug=post.slug,
        year=post.date.year,
        month=f"{post.date.month:02d}",
        day=f"{post.date.day:02d}",
    )

Covers every blog URL scheme I've seen. The alternative — accepting a callback, or a regex, or a config file — would have been ten times the code and one-tenth as discoverable.

Tradeoffs worth naming

No enclosures / podcasts / iTunes extensions. Trivial to add (half a dozen more ElementTree calls), out of scope for v1.
No image enclosures. Same reason. If you need podcast or media-enclosure support, this isn't the tool yet.
No full-content RSS by default. Feeds with content:encoded are bigger and readers handle them inconsistently. Summary-only is the right default; --include-content is the escape hatch.
Frontmatter subset. ~95% of real blog frontmatter works. The 5% that doesn't will fail loud with a line number, which is the kind of error you can fix by editing one line of the post.
No validator integration. I validate the output by parsing it back with ET.fromstring in the tests, which catches well-formedness problems, but I don't ping the w3c feed validator. Good next step.

Try it in 30 seconds

git clone https://github.com/sen-ltd/feedgen
cd feedgen
docker build -t feedgen .

mkdir -p /tmp/posts
cat > /tmp/posts/hello.md << 'EOT'
---
title: "Hello world"
date: 2026-04-10
slug: hello
tags: [intro]
---

First post on the new blog.
EOT

docker run --rm -v /tmp/posts:/work feedgen . \
    --title "Test Blog" \
    --link "https://example.com/" \
    --description "A test" \
    --out /work/feed.xml

cat /tmp/posts/feed.xml

You get a validator-clean RSS 2.0 feed, two items sorted newest-first, well-formed XML, escaped content, correct pubDate. No PyYAML. No markdown library. No feed library. No dependencies at all — pyproject.toml has an empty dependencies = [].

If you want Atom 1.0 instead:

docker run --rm -v /tmp/posts:/work feedgen . \
    --title "Test Blog" \
    --link "https://example.com/" \
    --format atom \
    --out /work/atom.xml

Or both formats in one run with --format both.

What this is for

This isn't trying to replace Hugo's feed plugin on Hugo sites. It's for the long tail of blogs that somehow ended up without one — the hand-rolled, the inherited, the half-migrated. For those sites, the cost of adopting a feed plugin is often "rewrite the build pipeline". The cost of adopting feedgen is a single CLI invocation in whatever script you already have.

If that's you, this is worth bookmarking. If you're on a framework that has a feed plugin already, ignore this and use that. Both answers are correct.

And if you've been staring at your own site going "I should probably have a feed again at some point" — you probably should, and it's one command.

DEV Community

Generating a Valid RSS Feed From Scratch in Stdlib Python

Generating a Valid RSS Feed From Scratch in Stdlib Python

The problem nobody has a good answer for

Design: the minimum enough to be correct

Frontmatter: a 50-line YAML subset beats PyYAML

RSS 2.0 field formats people get wrong

Atom 1.0 vs RSS 2.0: the differences that matter

A minimal markdown renderer — tradeoff acknowledged

Permalink patterns are the interop surface

Tradeoffs worth naming

Try it in 30 seconds

What this is for

Top comments (0)