You are being evaluated on your ability to build a complete, working Tensorlake project in Python in one go.
Before you write any code, you MUST read these pages carefully to understand Tensorlake concepts, patterns, and how examples are structured:
- https://docs.tensorlake.ai/introduction
- https://docs.tensorlake.ai/applications/introduction
- https://docs.tensorlake.ai/document-ingestion/overview
- https://docs.tensorlake.ai/integrations/overview
- https://docs.tensorlake.ai/examples/overview
You MUST use Python. You MUST build this as a real Tensorlake app with a clean repo structure, clear README, and a polished CLI experience.
Goal
Build a "Deep Research Pack" system called research_pack that:
- Takes a topic prompt from the user
- Produces:
- a high quality report in Markdown with citations
- a machine-readable "source library" JSON file containing extracted text and metadata for each source
- Uses parallel crawling and synthesis patterns
- Is robust, repeatable, and pleasant to use
Core output contract Given an input topic, the system must return an output folder containing:
report.mdlibrary.jsonartifacts/sources/raw extracted text per URLhtml/optional raw html (if you keep it)logs/
run.json(full structured run state)
High level workflow Stage 1: Plan
- Generate 6 to 10 targeted search queries for the topic
- Also generate 10 to 20 "seed URLs" strategy:
- Use a mix: official docs, reputable blogs, academic sources, vendor pages where relevant
- Produce a plan object that will be saved to disk
Stage 2: Acquire sources (parallel) Implement an acquisition pipeline that:
- Accepts seed URLs and optionally a simple search provider abstraction
- Crawls pages to depth 1 (configurable to depth 2)
- Extracts main content text
- Tracks provenance and crawl decisions
You must implement:
- rate limiting
- per-domain concurrency limits
- timeouts and retries
- robots.txt respect as a best-effort check (if implementing it is straightforward)
- content-type handling:
- html pages: extract main content
- pdf files: if you choose to support PDFs, you may download and store them, and extract text only if a parser is available. If not, store binary and mark "pdf_text_unavailable".
Stage 3: Normalize and dedupe
- Normalize text (whitespace, boilerplate removal)
- Deduplicate near duplicates using a simple similarity strategy:
- acceptable: simhash, minhash, or cosine similarity over TF-IDF
- Keep the best representative copy, but record which URLs were duplicates
Stage 4: Enrich each source (parallel) For each kept source:
- Produce:
- title
- author if available
- publish date if available
- short summary (5 to 8 bullets)
- reliability notes (1 to 5 sentences, explain why the source is strong or weak)
- key quotes: up to 3 short quotes with exact text spans and offsets
- Store this in
library.json
Stage 5: Synthesize report
Write report.md:
- Must have:
- Executive summary
- Key findings
- Deep dive sections
- Contradictions and uncertainty
- "What to read next" section with top 5 sources
- Every major claim must have an inline citation like [S3] pointing to the source id in
library.json - Avoid hallucinating. If missing info, say so and cite what you do have.
Tensorlake requirements You MUST implement this as a Tensorlake application, designed like the examples:
- Provide at least one HTTP endpoint that triggers a run, and another that fetches status/results.
- Use durable execution concepts so long runs can resume.
- Use parallelism:
- crawling tasks in parallel
- per-source enrichment in parallel
- Emit streaming/progress updates with meaningful messages.
- Use structured logging.
- Use sandboxed execution for crawling and parsing steps where appropriate (containerized functions).
- Keep the implementation in plain Python, but follow Tensorlake idioms from docs/examples.
CLI requirements (this matters a lot)
Build a delightful CLI called research-pack:
- Use Typer + Rich (preferred)
- Commands:
research-pack run "<topic>" [--out DIR] [--depth 1|2] [--max-sources N] [--max-pages-per-domain N] [--offline]research-pack status <run_id>research-pack open <run_id>which prints paths toreport.mdandlibrary.json
- UX details:
- Shows progress: planning, crawling, dedupe, enrichment, writing report
- End summary shows:
- sources_fetched
- sources_kept
- duplicates_removed
- top_domains
- Always writes artifacts even if some sources fail
Normalized data model (must implement)
Use Pydantic models and keep them in research_pack/models.py.
SourceRecord:
- id (S1, S2, ...)
- url
- canonical_url (if redirected)
- title (optional)
- author (optional)
- published_at (optional)
- retrieved_at
- content_type
- text_path (path to extracted text artifact)
- raw_path (optional)
- summary_bullets (list[str])
- reliability_notes (str)
- key_quotes (list of {quote, start_offset, end_offset})
- duplicate_of (optional source id)
- tags (list[str])
RunResult:
- run_id
- topic
- plan (queries, seeds, parameters)
- stats:
- fetched_count
- kept_count
- duplicates_count
- failures_count
- sources (list[SourceRecord])
- report_path
- library_path
Engineering requirements
- Clean repo layout:
research_pack/packageresearch_pack/app.pyTensorlake entryresearch_pack/cli.pyCLI entryresearch_pack/workflow.pyorchestratorresearch_pack/crawl/crawler + extractorsresearch_pack/dedupe/similarity methodsresearch_pack/render/report writerresearch_pack/utils/helperstests/minimal tests for your own code
- Must be runnable locally:
pip install -e .research-pack run "some topic"
- README must include:
- setup
- how to run locally
- how to run on Tensorlake
- configuration knobs
- troubleshooting
Constraints and guardrails
- Be pragmatic. Do not build a full search engine. Keep it simple.
- If you cannot access search APIs, implement a seed-url mode and document it clearly.
- Implement best-effort extraction. Prioritize correctness and stability over fancy heuristics.
- Be explicit about limitations.
- Timeouts and retries must exist for network calls.
- Do not crash the whole run if a few URLs fail.
Deliverables in your response When you answer, output:
- A short explanation of how you mapped Tensorlake concepts to this project (durable workflow, sandboxes, parallel steps, progress updates)
- File-by-file code in code blocks for the entire project
- A complete README
- A quick "acceptance checklist" that a reviewer can run through to verify correctness
Quality bar This must feel like a real tool. The CLI must be pleasant. The workflow must be robust. The artifacts must be complete and consistent.