close
Skip to content

Instantly share code, notes, and snippets.

@shricodev
Created March 3, 2026 05:44
Show Gist options
  • Select an option

  • Save shricodev/4a47d65ec12229bdfda2b836b226eb50 to your computer and use it in GitHub Desktop.

Select an option

Save shricodev/4a47d65ec12229bdfda2b836b226eb50 to your computer and use it in GitHub Desktop.
Prompt - Opus 4.6 vs. Sonnet 4.6 - Blog Demo

You are being evaluated on your ability to build a complete, working Tensorlake project in Python in one go.

Before you write any code, you MUST read these pages carefully to understand Tensorlake concepts, patterns, and how examples are structured:

  1. https://docs.tensorlake.ai/introduction
  2. https://docs.tensorlake.ai/applications/introduction
  3. https://docs.tensorlake.ai/document-ingestion/overview
  4. https://docs.tensorlake.ai/integrations/overview
  5. https://docs.tensorlake.ai/examples/overview

You MUST use Python. You MUST build this as a real Tensorlake app with a clean repo structure, clear README, and a polished CLI experience.

Goal Build a "Deep Research Pack" system called research_pack that:

  • Takes a topic prompt from the user
  • Produces:
    1. a high quality report in Markdown with citations
    2. a machine-readable "source library" JSON file containing extracted text and metadata for each source
  • Uses parallel crawling and synthesis patterns
  • Is robust, repeatable, and pleasant to use

Core output contract Given an input topic, the system must return an output folder containing:

  • report.md
  • library.json
  • artifacts/
    • sources/ raw extracted text per URL
    • html/ optional raw html (if you keep it)
    • logs/
  • run.json (full structured run state)

High level workflow Stage 1: Plan

  • Generate 6 to 10 targeted search queries for the topic
  • Also generate 10 to 20 "seed URLs" strategy:
    • Use a mix: official docs, reputable blogs, academic sources, vendor pages where relevant
  • Produce a plan object that will be saved to disk

Stage 2: Acquire sources (parallel) Implement an acquisition pipeline that:

  • Accepts seed URLs and optionally a simple search provider abstraction
  • Crawls pages to depth 1 (configurable to depth 2)
  • Extracts main content text
  • Tracks provenance and crawl decisions

You must implement:

  • rate limiting
  • per-domain concurrency limits
  • timeouts and retries
  • robots.txt respect as a best-effort check (if implementing it is straightforward)
  • content-type handling:
    • html pages: extract main content
    • pdf files: if you choose to support PDFs, you may download and store them, and extract text only if a parser is available. If not, store binary and mark "pdf_text_unavailable".

Stage 3: Normalize and dedupe

  • Normalize text (whitespace, boilerplate removal)
  • Deduplicate near duplicates using a simple similarity strategy:
    • acceptable: simhash, minhash, or cosine similarity over TF-IDF
  • Keep the best representative copy, but record which URLs were duplicates

Stage 4: Enrich each source (parallel) For each kept source:

  • Produce:
    • title
    • author if available
    • publish date if available
    • short summary (5 to 8 bullets)
    • reliability notes (1 to 5 sentences, explain why the source is strong or weak)
    • key quotes: up to 3 short quotes with exact text spans and offsets
  • Store this in library.json

Stage 5: Synthesize report Write report.md:

  • Must have:
    • Executive summary
    • Key findings
    • Deep dive sections
    • Contradictions and uncertainty
    • "What to read next" section with top 5 sources
  • Every major claim must have an inline citation like [S3] pointing to the source id in library.json
  • Avoid hallucinating. If missing info, say so and cite what you do have.

Tensorlake requirements You MUST implement this as a Tensorlake application, designed like the examples:

  • Provide at least one HTTP endpoint that triggers a run, and another that fetches status/results.
  • Use durable execution concepts so long runs can resume.
  • Use parallelism:
    • crawling tasks in parallel
    • per-source enrichment in parallel
  • Emit streaming/progress updates with meaningful messages.
  • Use structured logging.
  • Use sandboxed execution for crawling and parsing steps where appropriate (containerized functions).
  • Keep the implementation in plain Python, but follow Tensorlake idioms from docs/examples.

CLI requirements (this matters a lot) Build a delightful CLI called research-pack:

  • Use Typer + Rich (preferred)
  • Commands:
    1. research-pack run "<topic>" [--out DIR] [--depth 1|2] [--max-sources N] [--max-pages-per-domain N] [--offline]
    2. research-pack status <run_id>
    3. research-pack open <run_id> which prints paths to report.md and library.json
  • UX details:
    • Shows progress: planning, crawling, dedupe, enrichment, writing report
    • End summary shows:
      • sources_fetched
      • sources_kept
      • duplicates_removed
      • top_domains
    • Always writes artifacts even if some sources fail

Normalized data model (must implement) Use Pydantic models and keep them in research_pack/models.py.

SourceRecord:

  • id (S1, S2, ...)
  • url
  • canonical_url (if redirected)
  • title (optional)
  • author (optional)
  • published_at (optional)
  • retrieved_at
  • content_type
  • text_path (path to extracted text artifact)
  • raw_path (optional)
  • summary_bullets (list[str])
  • reliability_notes (str)
  • key_quotes (list of {quote, start_offset, end_offset})
  • duplicate_of (optional source id)
  • tags (list[str])

RunResult:

  • run_id
  • topic
  • plan (queries, seeds, parameters)
  • stats:
    • fetched_count
    • kept_count
    • duplicates_count
    • failures_count
  • sources (list[SourceRecord])
  • report_path
  • library_path

Engineering requirements

  • Clean repo layout:
    • research_pack/ package
    • research_pack/app.py Tensorlake entry
    • research_pack/cli.py CLI entry
    • research_pack/workflow.py orchestrator
    • research_pack/crawl/ crawler + extractors
    • research_pack/dedupe/ similarity methods
    • research_pack/render/ report writer
    • research_pack/utils/ helpers
    • tests/ minimal tests for your own code
  • Must be runnable locally:
    • pip install -e .
    • research-pack run "some topic"
  • README must include:
    • setup
    • how to run locally
    • how to run on Tensorlake
    • configuration knobs
    • troubleshooting

Constraints and guardrails

  • Be pragmatic. Do not build a full search engine. Keep it simple.
  • If you cannot access search APIs, implement a seed-url mode and document it clearly.
  • Implement best-effort extraction. Prioritize correctness and stability over fancy heuristics.
  • Be explicit about limitations.
  • Timeouts and retries must exist for network calls.
  • Do not crash the whole run if a few URLs fail.

Deliverables in your response When you answer, output:

  1. A short explanation of how you mapped Tensorlake concepts to this project (durable workflow, sandboxes, parallel steps, progress updates)
  2. File-by-file code in code blocks for the entire project
  3. A complete README
  4. A quick "acceptance checklist" that a reviewer can run through to verify correctness

Quality bar This must feel like a real tool. The CLI must be pleasant. The workflow must be robust. The artifacts must be complete and consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment