close
Skip to content

sen-ltd/ogp-fetch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ogp-fetch

Open Graph / Twitter Card metadata extractor CLI. Fetch a URL (or pipe HTML on stdin) and get back a clean JSON blob of everything you need to render a link preview: title, description, image, site name, Twitter Card fields, canonical URL, favicon.

Every team that builds a Slack bot, a Discord integration, a blog with link cards, or a notifications pipeline eventually writes this code. It's a small, well-defined problem that's easy to get wrong — og: takes precedence over twitter:, both take precedence over <title>, relative image URLs need resolving, HTML entities need decoding. ogp-fetch does those things in ~80 lines of Python stdlib + httpx.

ogp-fetch https://example.com/article
{
  "url": "https://example.com/article",
  "title": "Hello World",
  "description": "An example page",
  "image": "https://example.com/static/hero.png",
  "site_name": "Example",
  "type": "article",
  "twitter": { "card": "summary_large_image", "site": "@example", "creator": null },
  "canonical": "https://example.com/article",
  "favicon": "https://example.com/favicon.ico"
}

Install

pip install .

Runtime dependency: httpx only.

Usage

# Fetch a URL and emit JSON (default).
ogp-fetch https://example.com/

# Human-readable key: value layout.
ogp-fetch https://example.com/ --format human

# Markdown link-card preview (great for README / Slack).
ogp-fetch https://example.com/ --format markdown

# Pipe HTML from anywhere.
curl -s https://example.com/ | ogp-fetch - --no-resolve

# Pipe with a base URL so relative og:image paths still resolve.
curl -s https://example.com/ | ogp-fetch - --no-resolve --base-url https://example.com/

Options

Flag Default Description
--format {json,human,markdown} json Output format
--user-agent STRING ogp-fetch/0.1.0 (…) Sent as the User-Agent header
--timeout SECONDS 10 HTTP timeout
--max-size BYTES 2097152 (2 MB) Refuse responses larger than this
--base-url URL (fetched URL) Resolve relative links against this
--no-resolve off Skip HTTP entirely; requires - as the URL

Exit codes

Code Meaning
0 Metadata found
1 Fetched/parsed successfully but no OGP / Twitter / <title> data
2 Fetch, parse, or argument error

Why the stdin mode

ogp-fetch - --no-resolve reads HTML from stdin and emits the same JSON without any network traffic. That lets you:

  • plug it into a curl pipeline without making ogp-fetch responsible for TLS or retries;
  • test your extraction on a captured HTML fixture;
  • run it in a sandbox or offline CI job.

Precedence rules

The extractor collects every meta tag it finds; the normalizer picks the winner:

Field First checked Then Last resort
title og:title twitter:title <title>
description og:description twitter:description <meta name="description">
image og:image og:image:urltwitter:image
canonical <link rel="canonical"> og:url

Relative URLs in og:image, twitter:image, canonical, and favicon are resolved to absolute using urllib.parse.urljoin(base_url, value). Protocol-relative //cdn.example.com/x.png works too.

Docker

docker build -t ogp-fetch .
docker run --rm ogp-fetch --help

# Pipe HTML in:
cat page.html | docker run --rm -i ogp-fetch - --no-resolve --format markdown

Image is multi-stage Alpine, non-root, under 90 MB.

Tests

pip install ".[dev]"
pytest -q

All network paths are exercised via httpx.MockTransport — the test suite never touches the real network.

License

MIT

Links

About

Open Graph / Twitter Card metadata extractor CLI: og: → twitter: → <title> precedence, relative→absolute URL resolution, HTML entity decoding, stdin mode to chain with curl. ~80 lines on httpx + stdlib.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors