A National-Scale Giving Data Pipeline

How do you find a needle in a haystack the size of the internet? Philanthropic giving data is publicly available but buried within millions of unstructured annual report PDFs scattered across the web. We set out to build an automated data factory to find, process, and structure these documents at a national scale.

Engineering for a Haystack of Needles

Massive Scale: Processing petabytes of Common Crawl data is a logistical feat. The system had to be designed for immense parallelism and cost-efficiency to be viable.

Unstructured Data: PDFs are notoriously difficult to parse reliably. Extracting structured records—donor, recipient, amount, year—from prose requires sophisticated pattern matching and validation.

Verifiable Accuracy: To be trustworthy, every extracted record needed to be validated and traceable back to a source document and a registered entity in the IRS Non-profit Master File.

Deployability: The entire cloud infrastructure was defined as code (IaC), allowing the whole system to be licensed and securely deployed into a client's own cloud environment with a single command.

30 Million+

Giving Records Extracted

1 Million+

Identified donor entities

Fully Automated

End-to-End Process

hcl

// Infrastructure-as-Code ensures the system is repeatable and deployable.
resource "aws_lambda_function" "pdf_processor" {
  function_name = "pdf-processor-prod"
  handler       = "process_pdf.handler"
  runtime       = "python3.9"
  memory_size   = 2048
  timeout       = 300

  # Triggered by a new PDF landing in an S3 bucket.
  event_source_arn = aws_s3_bucket.raw_pdfs.arn
}

Core Technologies

AWS (S3, Lambda, SQS)PythonTerraformDockerPostgreSQLCommon Crawl

Need to Tame Unstructured Data at Scale?

Design Your Data Pipeline