Live Pipeline Demo — Real Data, Real Numbers

You Cannot Build AI on a Data Swamp.
We Clean It First.

1.23 million rows of enterprise SaaS data. Five different date formats. Dollar signs in numeric columns. 109,726 dirty rows removed. Seven anomalies flagged per thousand transactions. Then vectorized and made queryable in plain English. Total pipeline time: under 2 seconds.

1,231,897
Raw rows ingested
1,122,171
Clean rows output
109,726
Dirty rows removed
43 ms
5 analytical queries
7,568
Anomalies detected

The Sovereign Pipeline End to End

Six stages. One machine. No cluster. No cloud database fees.

Stage 1
🗑️
Dirty Data
3 tables · 200 contracts
Stage 2
🔍
Audit
Quality scores · Grime map
Stage 3
🧹
Clean
SQL · Parquet out
Stage 4
📊
Analyse
5 queries · Anomaly detect
Stage 5
🤖
RAG
Embed · Index · Query

Stage 1 — The Data Acme Analytics Corp · Series A SaaS

Three years of operational data from a fictional Series A SaaS company — the exact profile of a typical new client engagement.

Customers

90,950
rows · 15.1 MB · CRM export

Transactions

1,025,447
rows · 155.8 MB · Stripe export

Invoices

115,500
rows · 18.6 MB · QuickBooks export

Contracts

200
plain-text enterprise agreements

The grime — three systems that were never designed to talk to each other:

5 date formats in one column 2024-03-15   03/15/2024   15-MAR-2024   March 15, 2024   20240315
Amount fields as strings $1,200.00   USD 1200   1,200   1200
State field — 6 representations CA   California   Calif.   ca   CALIFORNIA
Duplicate rows from re-imports 6.5% of customers, 22.4% of invoices duplicated across system exports
Line total calculation errors ~4% of transactions: line_total ≠ quantity × unit_price
Orphaned foreign keys Invoice customer_id references that exist in the invoice system but not in CRM

Stage 2 — Data Quality Audit Before Cleaning

Automated quality score across every table — nulls, format inconsistency, duplicate rate. This is the first deliverable clients receive.

Customers  53.6 / 100

Null / empty
1.9%
Date format errors
30.2%
Amount format errors
12.7%
Duplicate rows
6.5%

Transactions  64.6 / 100

Null / empty
0.7%
Date format errors
32.0%
Amount format errors
17.5%
Calc errors (total≠qty×price)
~4.0%

Invoices  21.9 / 100  — worst table

Null / empty
1.0%
Date format errors
30.0%
Amount format errors
15.7%
Duplicate invoice numbers
22.4%

Stage 3 — Cleaning SQL

Type coercion, deduplication, date normalization, FK repair, amount parsing — all in a single SQL query per table.

-- Clean transactions: parse amounts, normalize dates, dedup, recalculate totals
CREATE TABLE transactions_clean AS
WITH
  amount_parsed AS (
    SELECT *,
      TRY_CAST(REGEXP_REPLACE(TRIM(line_total), '[\$,USD ]','','g') AS DOUBLE) AS line_total_d,
      TRY_CAST(TRIM(quantity) AS INTEGER) AS quantity_i
    FROM transactions_raw
  ),
  deduped AS (
    SELECT *,
      ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY transaction_date) AS rn
    FROM amount_parsed WHERE line_total_d > 0 AND quantity_i > 0
  )
SELECT
  transaction_id,
  TRY_STRPTIME(transaction_date, ['%Y-%m-%d','%m/%d/%Y','%d-%b-%Y','%B %d, %Y']) AS transaction_date,
  -- Recalculate if calc error detected (|reported - computed| > 2%)
  CASE WHEN ABS(line_total_d - quantity_i * unit_price_d) > 0.02 * unit_price_d
    THEN ROUND(quantity_i * unit_price_d, 2)
    ELSE line_total_d
  END AS line_total,
  UPPER(REGEXP_REPLACE(currency, '[^A-Za-z]','','g')) AS currency
FROM deduped WHERE rn = 1
  AND UPPER(TRIM(currency)) IN ('USD','EUR','GBP')
53 / 64 / 22
Quality scores — before
customers / transactions / invoices
100 / 100 / 100
Quality scores — after
typed · deduped · normalized

Customers cleaned

73,163
from 90,950 raw  ·  −17,787 removed  ·  74 ms

Transactions cleaned

962,534
from 1,025,447 raw  ·  −62,913 removed  ·  430 ms

Invoices cleaned

86,474
from 115,500 raw  ·  −29,026 removed  ·  84 ms

Stage 4 — Analysis 5 Queries · 43 ms Total

Five analytical queries across 962,534 clean transaction rows. This is the benchmark inside the demo.

QueryResult rowsTime
Annual revenue by year (2022–2025)910 ms
Revenue by product — top 10108 ms
Monthly revenue trend 2024164 ms
Top 10 customers by lifetime value1019 ms
Invoice aging — overdue exposure by status103 ms
Total — 5 analytical queries, 962,534 rows43 ms

Anomaly Detection — IQR + Z-score

Transactions flagged

7,568
0.79% of clean transactions  ·  40 ms

High outliers (IQR)

7,568
above Q3 + 1.5×IQR fence

Overdue exposure

$120.9M
outstanding invoices flagged overdue
ProductLine TotalZ-ScoreFlagAction
Analytics Add-on$99,975.203.26high_outlierManual review — possible duplicate billing
Enterprise Plan$98,430.003.21high_outlierVerify pricing tier with account manager
Custom Integration$97,680.003.18high_outlierCheck SOW — may be legitimate large order
White-label License$96,900.003.15high_outlierCross-reference contract signed amount
Security Module$95,200.003.10high_outlierValidate against approved rate card

Stage 5 — AI Document Intelligence RAG · Mistral NeMo 12B

The cleaned data summaries, anomaly reports, and 200 enterprise contracts are automatically ingested into the vector store via ra-watch — no manual steps. Mistral NeMo 12B runs locally on the GCP A100. Zero OpenAI fees. Zero data egress.

Documents indexed

53
reports + contracts → auto-ingested via ra-watch

Embedding model

gte-large-en-v1.5
1024-dim · ONNX · CUDA · 8192-ctx

LLM

Mistral NeMo
12B · llama.cpp · local · $0 API fees
Query 6,425ms · RTX 3060 · 50 contracts searched
List all contracts where the governing law is Delaware. For each, state the client name, liability cap amount, and filename.
Answer — Mistral NeMo 12B (local · 0 API fees)

Based on the provided contracts, there is one contract where the governing law is Delaware:

Client: Ironside Solutions  |  Liability Cap: $300,000  |  File: contract_0041f5d4_20240831.txt

Sources: contract_0041f5d4_20240831.txtcontract_00d8c046_20231227.txtcontract_0131771e_20240417.txt+4 contracts
Query 3,725ms · RTX 3060 · 50 contracts searched
Which contracts contain ITAR compliance clauses or HIPAA data handling requirements? List the client name, compliance type, and filename for each.
Answer — Mistral NeMo 12B (local · 0 API fees)

Two contracts contain HIPAA compliance clauses:

Ironside Solutions — HIPAA: "This engagement is structured to support HIPAA data handling requirements. All processing occurs on dedicated infrastructure."  |  contract_0041f5d4_20240831.txt

Radiant Analytics — HIPAA: "This engagement is structured to support HIPAA data handling requirements. All processing occurs on dedicated infrastructure."  |  contract_037146f7_20220504.txt

Sources: contract_0041f5d4_20240831.txtcontract_037146f7_20220504.txt+5 contracts
Query 10,704ms · RTX 3060 · 50 contracts searched
Which contracts auto-renew automatically without requiring a new signed SOW? Flag any that could create unexpected billing obligations.
Answer — Mistral NeMo 12B (local · 0 API fees)

Ironside Solutions — auto-renews, 90 days notice required  |  contract_0131771e_20240417.txt  |  Billing risk

Yield Ventures — auto-renews, 60 days notice required  |  contract_01e6ee9a_20220103.txt  |  Billing risk

Radiant Analytics — auto-renews, 30 days notice required  |  contract_037146f7_20220504.txt  |  Billing risk

No billing risk: Ironside Solutions (contract_00d8c046), Quantum Innovations (contract_037f5347), Prism Consulting (contract_03908f5a) — renewal requires new signed SOW.

Sources: contract_0131771e_20240417.txtcontract_01e6ee9a_20220103.txtcontract_037146f7_20220504.txt+4 contracts
How the continuous ingest works: The ra-watch daemon monitors the client's GCS bucket every 60 seconds. When a new contract, report, or document lands, it is automatically chunked, embedded via gte-large-en-v1.5 (ONNX, CUDA), and available for Q&A within 60 seconds. No manual re-indexing. No downtime. No human in the loop.

Run This Pipeline on Your Data

Every new client engagement starts with the same pipeline — audit your data, show you exactly what's broken, clean it, analyse it, and if you want AI Q&A over your documents, deploy the RAG stack on a dedicated GCP A100 I manage. You keep the data. I keep the code protected. Starts with a free 30-minute data audit call.

Book a Free Data Audit View Pricing