close

DEV Community

Ali Sher
Ali Sher

Posted on

Grants to Investments Part 2-3: Models and Pipelines

πŸš€ Grants ETL Pipeline β€” Rust + Transformer-Based Classification

πŸ“Œ Overview

I built an end-to-end ETL pipeline to ingest, classify, and analyze Canadian government grant data. The project combines:

  • ⚑ High-performance data extraction using Rust
  • 🧠 Semantic classification using BERT (zero-shot)
  • πŸ“Š Structured output ready for downstream analytics and dashboarding

This project demonstrates systems design, data engineering, and applied NLP in a production-style pipeline.


🧩 Extraction Layer (Rust)

The Problem

The Grants Canada portal has no accessible API β€” only an HTML-rendered search interface. I needed a way to extract structured data at scale.

The Solution

I built a custom scraper targeting the paginated search endpoint:
https://search.open.canada.ca/grants/?page={}&sort=agreement_start_date+desc

Key Decisions

I initially started with Python but switched to Rust for performance at scale. The Rust scraper uses:

  • scraper β€” for HTML parsing
  • csv β€” for structured output

Designed to handle large-scale ingestion efficiently without extreme usage of memory or runtime.

Outcome

βœ… Successfully extracted structured grant data into CSV
βœ… Significantly faster ingestion vs. the prior Python-based workflow

πŸ“„ Sample Record

Agreement: European Space Agency (ESA)'s Space Weather Training Course
Agreement Number: 25COBLLAMY
Date Range: Mar 11, 2026 β†’ Mar 27, 2026
Description: Supports Canadian students attending international space training events
Recipient: Canadian Space Agency
Amount: $1,000.00
Location: La Prairie, Quebec, CA


🧠 Transformation + Classification

Objective

Categorize grants into meaningful sectors for analytics and discovery β€” making the data explorable beyond raw fields.

Categories

CATEGORIES = [
    "Housing & Shelter",
    "Education & Training",
    "Employment & Entrepreneurship",
    "Business & Innovation",
    "Health & Wellness",
    "Environment & Energy",
    "Community & Nonprofits",
    "Research & Academia",
    "Indigenous Programs",
    "Public Safety & Emergency Services",
    "Agriculture & Rural Development",
    "Arts, Culture & Heritage",
    "Civic & Democratic Engagement"
]
Enter fullscreen mode Exit fullscreen mode

πŸ€– Model Choice

I evaluated two approaches:

Approach Verdict
Traditional ML (clustering) Requires labeled data, less semantic
BERT via Hugging Face (zero-shot) βœ… Selected

Why zero-shot BERT?

  • No labeled dataset required
  • Strong semantic understanding out-of-the-box
  • Fast to implement and iterate

βš™οΈ Inference Pipeline

print("Running classification...")
predictions = []

for text in df['text']:
    result = classifier(text, candidate_labels=CATEGORIES)

    predictions.append({
        'predicted_category': result['labels'][0],
        'confidence_score': result['scores'][0]
    })
Enter fullscreen mode Exit fullscreen mode

Each grant description gets mapped to its most semantically relevant category, with a confidence score attached.


🧼 Data Quality

The source data was highly structured and clean, which meant:

  • Minimal preprocessing required
  • Faster iteration on modeling and pipeline integration
  • No time lost on data wrangling before getting to the interesting parts

πŸ“¦ Next Steps

The pipeline is actively being extended:

  • πŸ—„οΈ Load Layer β†’ Persist classified data in a database
  • πŸ“Š Analytics Dashboard β†’ Visualize funding trends by category, region, and time
  • ⏱️ Pipeline Orchestration β†’ Automate ingestion + inference end-to-end

πŸ’‘ Key Takeaways

  1. Rust is a legit choice for ETL scraping β€” not just systems programming. The performance gains over Python are real and measurable.
  2. Zero-shot BERT punches above its weight for classification tasks without labeled data. It's a great first-pass model.
  3. Modular pipeline design pays off early β€” separating extraction, transformation, and load made iteration much faster.
  4. Don't over-engineer β€” the right tool for each layer matters more than using a single stack.

πŸ”— Links


Open to opportunities in Data Science, ML Engineering, and Data Engineering β€” feel free to reach out at alisher213@outlook.com

Top comments (0)