Ali Sher

Posted on Apr 9

Grants to Investments Part 2-3: Models and Pipelines

#rust #ai #etl #datascience

🚀 Grants ETL Pipeline — Rust + Transformer-Based Classification

📌 Overview

I built an end-to-end ETL pipeline to ingest, classify, and analyze Canadian government grant data. The project combines:

⚡ High-performance data extraction using Rust
🧠 Semantic classification using BERT (zero-shot)
📊 Structured output ready for downstream analytics and dashboarding

This project demonstrates systems design, data engineering, and applied NLP in a production-style pipeline.

🧩 Extraction Layer (Rust)

The Problem

The Grants Canada portal has no accessible API — only an HTML-rendered search interface. I needed a way to extract structured data at scale.

The Solution

I built a custom scraper targeting the paginated search endpoint:
https://search.open.canada.ca/grants/?page={}&sort=agreement_start_date+desc

Key Decisions

I initially started with Python but switched to Rust for performance at scale. The Rust scraper uses:

scraper — for HTML parsing
csv — for structured output

Designed to handle large-scale ingestion efficiently without extreme usage of memory or runtime.

Outcome

✅ Successfully extracted structured grant data into CSV
✅ Significantly faster ingestion vs. the prior Python-based workflow

📄 Sample Record

Agreement: European Space Agency (ESA)'s Space Weather Training Course
Agreement Number: 25COBLLAMY
Date Range: Mar 11, 2026 → Mar 27, 2026
Description: Supports Canadian students attending international space training events
Recipient: Canadian Space Agency
Amount: $1,000.00
Location: La Prairie, Quebec, CA

🧠 Transformation + Classification

Objective

Categorize grants into meaningful sectors for analytics and discovery — making the data explorable beyond raw fields.

🤖 Model Choice

I evaluated two approaches:

Approach	Verdict
Traditional ML (clustering)	Requires labeled data, less semantic
BERT via Hugging Face (zero-shot)	✅ Selected

Why zero-shot BERT?

No labeled dataset required
Strong semantic understanding out-of-the-box
Fast to implement and iterate

⚙️ Inference Pipeline

print("Running classification...")
predictions = []

for text in df['text']:
    result = classifier(text, candidate_labels=CATEGORIES)

    predictions.append({
        'predicted_category': result['labels'][0],
        'confidence_score': result['scores'][0]
    })

Each grant description gets mapped to its most semantically relevant category, with a confidence score attached.

🧼 Data Quality

The source data was highly structured and clean, which meant:

Minimal preprocessing required
Faster iteration on modeling and pipeline integration
No time lost on data wrangling before getting to the interesting parts

📦 Next Steps

The pipeline is actively being extended:

🗄️ Load Layer → Persist classified data in a database
📊 Analytics Dashboard → Visualize funding trends by category, region, and time
⏱️ Pipeline Orchestration → Automate ingestion + inference end-to-end

💡 Key Takeaways

Rust is a legit choice for ETL scraping — not just systems programming. The performance gains over Python are real and measurable.
Zero-shot BERT punches above its weight for classification tasks without labeled data. It's a great first-pass model.
Modular pipeline design pays off early — separating extraction, transformation, and load made iteration much faster.
Don't over-engineer — the right tool for each layer matters more than using a single stack.

🔗 Links

📁 GitHub: github.com/Sher213/GrantsInvestments

Open to opportunities in Data Science, ML Engineering, and Data Engineering — feel free to reach out at alisher213@outlook.com

DEV Community