π Grants ETL Pipeline β Rust + Transformer-Based Classification
π Overview
I built an end-to-end ETL pipeline to ingest, classify, and analyze Canadian government grant data. The project combines:
- β‘ High-performance data extraction using Rust
- π§ Semantic classification using BERT (zero-shot)
- π Structured output ready for downstream analytics and dashboarding
This project demonstrates systems design, data engineering, and applied NLP in a production-style pipeline.
π§© Extraction Layer (Rust)
The Problem
The Grants Canada portal has no accessible API β only an HTML-rendered search interface. I needed a way to extract structured data at scale.
The Solution
I built a custom scraper targeting the paginated search endpoint:
https://search.open.canada.ca/grants/?page={}&sort=agreement_start_date+desc
Key Decisions
I initially started with Python but switched to Rust for performance at scale. The Rust scraper uses:
-
scraperβ for HTML parsing -
csvβ for structured output
Designed to handle large-scale ingestion efficiently without extreme usage of memory or runtime.
Outcome
β
Successfully extracted structured grant data into CSV
β
Significantly faster ingestion vs. the prior Python-based workflow
π Sample Record
Agreement: European Space Agency (ESA)'s Space Weather Training Course
Agreement Number: 25COBLLAMY
Date Range: Mar 11, 2026 β Mar 27, 2026
Description: Supports Canadian students attending international space training events
Recipient: Canadian Space Agency
Amount: $1,000.00
Location: La Prairie, Quebec, CA
π§ Transformation + Classification
Objective
Categorize grants into meaningful sectors for analytics and discovery β making the data explorable beyond raw fields.
Categories
CATEGORIES = [
"Housing & Shelter",
"Education & Training",
"Employment & Entrepreneurship",
"Business & Innovation",
"Health & Wellness",
"Environment & Energy",
"Community & Nonprofits",
"Research & Academia",
"Indigenous Programs",
"Public Safety & Emergency Services",
"Agriculture & Rural Development",
"Arts, Culture & Heritage",
"Civic & Democratic Engagement"
]
π€ Model Choice
I evaluated two approaches:
| Approach | Verdict |
|---|---|
| Traditional ML (clustering) | Requires labeled data, less semantic |
| BERT via Hugging Face (zero-shot) | β Selected |
Why zero-shot BERT?
- No labeled dataset required
- Strong semantic understanding out-of-the-box
- Fast to implement and iterate
βοΈ Inference Pipeline
print("Running classification...")
predictions = []
for text in df['text']:
result = classifier(text, candidate_labels=CATEGORIES)
predictions.append({
'predicted_category': result['labels'][0],
'confidence_score': result['scores'][0]
})
Each grant description gets mapped to its most semantically relevant category, with a confidence score attached.
π§Ό Data Quality
The source data was highly structured and clean, which meant:
- Minimal preprocessing required
- Faster iteration on modeling and pipeline integration
- No time lost on data wrangling before getting to the interesting parts
π¦ Next Steps
The pipeline is actively being extended:
- ποΈ Load Layer β Persist classified data in a database
- π Analytics Dashboard β Visualize funding trends by category, region, and time
- β±οΈ Pipeline Orchestration β Automate ingestion + inference end-to-end
π‘ Key Takeaways
- Rust is a legit choice for ETL scraping β not just systems programming. The performance gains over Python are real and measurable.
- Zero-shot BERT punches above its weight for classification tasks without labeled data. It's a great first-pass model.
- Modular pipeline design pays off early β separating extraction, transformation, and load made iteration much faster.
- Don't over-engineer β the right tool for each layer matters more than using a single stack.
π Links
- π GitHub: github.com/Sher213/GrantsInvestments
Open to opportunities in Data Science, ML Engineering, and Data Engineering β feel free to reach out at alisher213@outlook.com
Top comments (0)