Oct 2024 – Present
A local-LLM data pipeline that corroborates and migrates 3,000 to 5,000 project records for a UK fire and security firm: five legacy sources reconciled into one record, validated by on-prem AI, and submitted through the destination CRM web forms under human review.
Tech Stack
Tags
About this project
The challenge
Marshall Fire needed roughly 3,000 to 5,000 project records moved into a new system, and no single source held the truth. The data was spread across a project-management Excel spreadsheet, an Azure SQL database, and a pile of SharePoint documents (emails, Word files, PDFs), and the destination system gave no direct database access, so every record had to go in through its web forms. Done by hand, each record meant cross-checking several sources, resolving where they disagreed, and re-keying the result into a multi-step form. At that volume it is hundreds of hours of careful, error-prone work.
The approach
I built a pipeline that does the corroboration and submission automatically but keeps a person as the final gate on every record. The parts that made it non-trivial are the cross-source matching, and the fact that the AI and the human are two separate checkpoints rather than one.
1. Ingest: five sources into one
- Project-management Excel spreadsheet
- Azure SQL backup
- Azure SQL live scrape for records added after the backup cut-off
- SharePoint files: emails, Word documents, PDFs
- A scrape of the new system to detect what has already been migrated
All five land in a single local SQLite database, the working store for everything downstream.
2. Match and reconcile
rapidfuzz token-scoring matches the same project across Excel, Azure SQL and SharePoint, and a three-way pass flags where the sources disagree. This is the step that turns five noisy inputs into one candidate record per project.
3. Local LLM (on-prem)
Qwen 2.5 7B runs locally through Ollama, on-premises, so none of the client's data leaves the machine. It runs two advisory passes: a judge and veto on the Azure SQL matches, and a verdict on fee lines before submission.
4. Validation and normalisation
Pydantic schemas validate every record, and canonical-value maps fold messy inputs (status, sector, project type) into the exact values the destination system will accept.
5. Human review
A Flask and HTMX review queue puts a person in control: a SharePoint evidence panel alongside each record, per-field overrides, and explicit approval before anything is submitted. The LLM is advisory; the human is the gate.
6. Submission
Playwright drives the destination's web forms in order, Contact then Company then Project, because direct database access was not available. It runs a list-page duplicate probe before each insert and re-queries after Save and Exit to catch silent failures, so the automation does not quietly skip or double-enter records. A post-submission audit and live-patch CLI verifies what landed and fixes records after the fact, so it is not fire-and-forget.
Outcome
The pipeline covers roughly 3,000 to 5,000 records that would otherwise have to be corroborated and submitted entirely by hand. Running the model locally kept a confidential migration confidential, and the two-gate design, an AI veto plus human approval, gave the speed of automation without giving up control over what actually got written into the destination system.