1.23 million rows of enterprise SaaS data. Five different date formats. Dollar signs in numeric columns. 109,726 dirty rows removed. Seven anomalies flagged per thousand transactions. Then vectorized and made queryable in plain English. Total pipeline time: under 2 seconds.
Six stages. One machine. No cluster. No cloud database fees.
Three years of operational data from a fictional Series A SaaS company — the exact profile of a typical new client engagement.
The grime — three systems that were never designed to talk to each other:
2024-03-15 03/15/2024 15-MAR-2024 March 15, 2024 20240315
$1,200.00 USD 1200 1,200 1200
CA California Calif. ca CALIFORNIA
line_total ≠ quantity × unit_price
Automated quality score across every table — nulls, format inconsistency, duplicate rate. This is the first deliverable clients receive.
Type coercion, deduplication, date normalization, FK repair, amount parsing — all in a single SQL query per table.
-- Clean transactions: parse amounts, normalize dates, dedup, recalculate totals CREATE TABLE transactions_clean AS WITH amount_parsed AS ( SELECT *, TRY_CAST(REGEXP_REPLACE(TRIM(line_total), '[\$,USD ]','','g') AS DOUBLE) AS line_total_d, TRY_CAST(TRIM(quantity) AS INTEGER) AS quantity_i FROM transactions_raw ), deduped AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY transaction_date) AS rn FROM amount_parsed WHERE line_total_d > 0 AND quantity_i > 0 ) SELECT transaction_id, TRY_STRPTIME(transaction_date, ['%Y-%m-%d','%m/%d/%Y','%d-%b-%Y','%B %d, %Y']) AS transaction_date, -- Recalculate if calc error detected (|reported - computed| > 2%) CASE WHEN ABS(line_total_d - quantity_i * unit_price_d) > 0.02 * unit_price_d THEN ROUND(quantity_i * unit_price_d, 2) ELSE line_total_d END AS line_total, UPPER(REGEXP_REPLACE(currency, '[^A-Za-z]','','g')) AS currency FROM deduped WHERE rn = 1 AND UPPER(TRIM(currency)) IN ('USD','EUR','GBP')
Five analytical queries across 962,534 clean transaction rows. This is the benchmark inside the demo.
| Query | Result rows | Time |
|---|---|---|
| Annual revenue by year (2022–2025) | 9 | 10 ms |
| Revenue by product — top 10 | 10 | 8 ms |
| Monthly revenue trend 2024 | 16 | 4 ms |
| Top 10 customers by lifetime value | 10 | 19 ms |
| Invoice aging — overdue exposure by status | 10 | 3 ms |
| Total — 5 analytical queries, 962,534 rows | — | 43 ms |
| Product | Line Total | Z-Score | Flag | Action |
|---|---|---|---|---|
| Analytics Add-on | $99,975.20 | 3.26 | high_outlier | Manual review — possible duplicate billing |
| Enterprise Plan | $98,430.00 | 3.21 | high_outlier | Verify pricing tier with account manager |
| Custom Integration | $97,680.00 | 3.18 | high_outlier | Check SOW — may be legitimate large order |
| White-label License | $96,900.00 | 3.15 | high_outlier | Cross-reference contract signed amount |
| Security Module | $95,200.00 | 3.10 | high_outlier | Validate against approved rate card |
The cleaned data summaries, anomaly reports, and 200 enterprise contracts are automatically ingested into the vector store via ra-watch — no manual steps. Mistral NeMo 12B runs locally on the GCP A100. Zero OpenAI fees. Zero data egress.
Based on the provided contracts, there is one contract where the governing law is Delaware:
Client: Ironside Solutions | Liability Cap: $300,000 | File: contract_0041f5d4_20240831.txt
Two contracts contain HIPAA compliance clauses:
Ironside Solutions — HIPAA: "This engagement is structured to support HIPAA data handling requirements. All processing occurs on dedicated infrastructure." | contract_0041f5d4_20240831.txt
Radiant Analytics — HIPAA: "This engagement is structured to support HIPAA data handling requirements. All processing occurs on dedicated infrastructure." | contract_037146f7_20220504.txt
⚠ Ironside Solutions — auto-renews, 90 days notice required | contract_0131771e_20240417.txt | Billing risk
⚠ Yield Ventures — auto-renews, 60 days notice required | contract_01e6ee9a_20220103.txt | Billing risk
⚠ Radiant Analytics — auto-renews, 30 days notice required | contract_037146f7_20220504.txt | Billing risk
No billing risk: Ironside Solutions (contract_00d8c046), Quantum Innovations (contract_037f5347), Prism Consulting (contract_03908f5a) — renewal requires new signed SOW.
ra-watch daemon monitors the client's GCS bucket every 60 seconds.
When a new contract, report, or document lands, it is automatically chunked, embedded via gte-large-en-v1.5 (ONNX, CUDA),
and available for Q&A within 60 seconds. No manual re-indexing. No downtime. No human in the loop.
Every new client engagement starts with the same pipeline — audit your data, show you exactly what's broken, clean it, analyse it, and if you want AI Q&A over your documents, deploy the RAG stack on a dedicated GCP A100 I manage. You keep the data. I keep the code protected. Starts with a free 30-minute data audit call.
Book a Free Data Audit View Pricing