A National-Scale Giving Data Pipeline
How do you find a needle in a haystack the size of the internet? Philanthropic giving data is publicly available but buried within millions of unstructured annual report PDFs scattered across the web. We set out to build an automated data factory to find, process, and structure these documents at a national scale.
Engineering for a Haystack of Needles
Massive Scale: Processing petabytes of Common Crawl data is a logistical feat. The system had to be designed for immense parallelism and cost-efficiency to be viable.
Unstructured Data: PDFs are notoriously difficult to parse reliably. Extracting structured records—donor, recipient, amount, year—from prose requires sophisticated pattern matching and validation.
Verifiable Accuracy: To be trustworthy, every extracted record needed to be validated and traceable back to a source document and a registered entity in the IRS Non-profit Master File.
Deployability: The entire cloud infrastructure was defined as code (IaC), allowing the whole system to be licensed and securely deployed into a client's own cloud environment with a single command.
30 Million+
Giving Records Extracted
1 Million+
Identified donor entities
Fully Automated
End-to-End Process
// Infrastructure-as-Code ensures the system is repeatable and deployable.
resource "aws_lambda_function" "pdf_processor" {
function_name = "pdf-processor-prod"
handler = "process_pdf.handler"
runtime = "python3.9"
memory_size = 2048
timeout = 300
# Triggered by a new PDF landing in an S3 bucket.
event_source_arn = aws_s3_bucket.raw_pdfs.arn
}