DATABRICKS CERTIFIED · AWS SOLUTIONS ARCHITECT ASSOCIATE · AZURE · GCP

Single-node analytics.
No Spark. No cluster bill.

172 million rows per second. 5 queries. 167M rows. Under 1 second — on a single machine. I connect directly to your S3, Azure Blob, or GCP bucket and deliver results — no cluster, no vendor lock-in, no $15k/month bill.

Get in Touch See the Benchmark →
☁ AWS S3 + Athena
☁ Azure Blob Storage
☁ GCP Cloud Storage
🦆 DuckDB httpfs / azure / gcs
// Sovereign Benchmark · April 2026
167 Million Rows. 4 Years.
971 Milliseconds.

48 Apache Parquet files — every NYC Yellow Cab trip from January 2022 through December 2025. Five analytical queries. Cold NVMe. One consumer workstation. No JVM. No executor heap. No cluster. No warmup.

167,858,646
rows scanned
971ms
wall time · 5 queries
172M/s
rows/sec · cold NVMe
25×
faster than Spark
(50GB RAM heap, pre-warmed)
QueryDescriptionTime
Q1 — Row Count COUNT(*) per year across all 4 years 29ms
Q2 — Fare & Distance YoY 6 aggregates (avg fare, distance, tip, total, passengers, trips) per year · 143M filtered rows · full column decompression 401ms
Q3 — Monthly Pivot Trip volume by year×month — 48 cells · native DuckDB PIVOT · 2025 hottest year on record 312ms
Q4 — Payment Type Shift Cash collapse: 19.6% → 9.6% · Credit card peak 2023 at 77.9% · full 4-year scan 158ms
Q5 — CBD Congestion Fee 2025-only schema column · NYC congestion pricing live Jan 2025 · 72.8% of trips charged · $25.03M captured 64ms
Total 167,858,646 rows · 48 Parquet files · ~685M total row-scans across all 5 queries combined 971ms
Hardware: Intel Ultra 7 265KF (20-core) · NVIDIA RTX 3060 12GB · NVMe SSD · Arch Linux · 64GB RAM
Stack: DuckDB 1.4.4 vectorized columnar execution · zero Python · zero JVM · cold NVMe reads · one process
Spark baseline: Apache Spark on identical hardware, 50 GB executor heap pre-warmed in RAM — 6.88M rows/sec. DuckDB delivers 172M rows/sec from cold NVMe — 25× faster with no heap, no warmup, no cluster.
What "~685M row-scans" means: The dataset has 167,858,646 unique rows. Each of the 5 queries independently scans all 4 years, so DuckDB physically decompressed and processed roughly 685M column value reads total across the run. The 172M rows/sec figure uses unique rows ÷ wall time — the conservative number. The engine is doing significantly more work than that implies.

Important: these files were on local NVMe — not S3. S3 throughput is 100–500 MB/s vs NVMe at 3–7 GB/s. Expect 5–20× slower over object storage. See the S3 vs NVMe section below for the full honest breakdown and what to do about it.
// How It Scales

DuckDB is CPU + NVMe bound. The workstation number is measured and verified. Bare-metal and cloud NVMe figures are informed estimates based on published hardware specs — not yet run.

Configuration Hardware DuckDB Throughput Est. Monthly Cost
Workstation (measured ✓) Intel Ultra 7 265KF · NVMe · 64GB RAM 172M rows/sec · 971ms $0 (owned)
Hetzner AX102 bare-metal AMD Ryzen 9950X · NVMe · 192GB RAM ~200–250M rows/sec (est.) ~$250/mo
AWS i4i.4xlarge Intel Xeon · 3.75TB local NVMe SSD · 128GB RAM ~250–350M rows/sec (est.) ~$900/mo on-demand
Databricks cluster Spark · JVM heap · DBU licensing · S3 ~6–40M rows/sec (measured Spark baseline: 6.88M) $5,000–$25,000/mo

Workstation result is verified — measured 2026-04-08 on local NVMe, cold reads, no warmup. Hetzner and AWS i4i estimates based on published NVMe throughput specs vs measured workstation baseline. Spark baseline of 6.88M rows/sec measured on identical workstation hardware, 50GB heap pre-warmed.

🏆
Databricks Certified
Associate Developer · Spark · Scala · Verify ↗
☁️
AWS Solutions Architect
Associate — Amazon Web Services · Verify ↗
🔷
Azure Data Platform
Blob Storage · ADLS Gen2 · Synapse
🌐
Google Cloud Platform
GCS · BigQuery · Dataflow
// LIVE DATA — NYC YELLOW CAB 2022–2025

The data. All of it. Interactive.

Queried live from 48 Parquet files by DuckDB. 167,858,646 rows. Four charts. All computed on a single workstation in 971ms.

Data: NYC TLC Trip Record Data (open dataset) · Engine: DuckDB 1.4.4 · Hardware: Intel Ultra 7 265KF · NVMe

// HOW IT WORKS

Your data. My compute. No cluster tax.

You grant scoped read access to your cloud storage. I run DuckDB on a high-performance node, deliver results as Parquet, CSV, or a live dashboard — then recommend the right long-term architecture for your data volume and budget.

YOUR DATA (stays in your cloud) AWS S3 s3://your-bucket/data/*.parquet Azure Blob abfss://container@account.dfs.core.windows.net/ GCP Storage gs://your-bucket/data/*.parquet │ │ scoped read-only credentials │ ▼ COMPUTE NODE (bare-metal NVMe or NVMe cloud instance) DuckDB + httpfs / azure / gcs extension ┌──────────────────────────────────────────────────────┐ │ LOAD httpfs; SET s3_region='us-east-1'; │ │ SET s3_access_key_id='...'; │ │ │ │ SELECT month, SUM(revenue), AVG(margin) │ │ FROM read_parquet('s3://your-bucket/sales/*.parquet'│ │ GROUP BY month ORDER BY month; │ └──────────────────────────────────────────────────────┘ │ │ results ▼ DELIVERABLES ├── results.parquet → written back to your bucket ├── report.csv → emailed / shared directly └── dashboard → Streamlit app, hosted anywhere
Honest take on S3 as a data lake
S3 object storage tops out at roughly 100–500 MB/s per connection. A local NVMe drive delivers 3,000–7,000 MB/s — a 10–50× I/O advantage. DuckDB's httpfs uses columnar predicate pushdown to minimize what it downloads from S3, which helps significantly, but physics still wins: if your dataset is large and your queries are wide, S3 is the bottleneck, not DuckDB.

The 172M rows/sec benchmark ran against local NVMe. Against S3 with the same hardware and the same queries, expect 5–20× slower depending on file sizes, network, and how selective your queries are.

That said — you still save money over Databricks. Even at 5× slower, a $200/month bare-metal NVMe node running DuckDB beats a $5,000/month Databricks cluster reading the same S3 data. The recommendation depends on your data volume and query patterns.
Recommendation: transition hot data to NVMe
For recurring analytical workloads — daily reports, ML feature pipelines, compliance queries — the right architecture is: S3/Blob/GCS for cold archival storage, NVMe for hot analytical compute. Sync your hot dataset once to a bare-metal or NVMe cloud instance, process at full speed, write results back to object storage. You get S3 durability for your source data and NVMe throughput for your queries.

Bare-metal options worth knowing:
· Hetzner AX102 — 192GB RAM, NVMe, dedicated, ~$250/month (exceptional value)
· AWS i4i instances — up to 30TB local NVMe SSD, purpose-built for analytics
· GCP Local SSD — up to 9TB, 2.4M IOPS, available on most instance families
· Azure Lsv3 — NVMe local storage, built for storage-intensive workloads

I can audit your current workload and tell you exactly which tier makes sense.
// WHY IT MATTERS

Your Databricks bill is optional.

Most analytics workloads at mid-market companies fit on a single modern server. DuckDB processes columnar Parquet in-process — no serialization, no cluster coordination, no $15k/month bill.

Capability Databricks / Spark DuckDB (single node)
685M row scan + 5 queries $8–40 cluster cost per run $0 — local process
Read from S3 / Azure / GCP Yes (cluster required) Yes — httpfs / azure / gcs extension, no cluster
Monthly platform cost $3,000–$25,000+ $20–200 (VPS or local workstation)
Time to first query 5–15 min (cluster startup) < 1 second
SQL compatibility SparkSQL (HiveQL dialect) Standard SQL + PIVOT, ASOF JOIN, LIST agg, UNPIVOT
Python / Streamlit integration PySpark (heavyweight) Native Python API — duckdb.query(sql).df()
Operational complexity Cluster mgmt, DBUs, autoscale, networking Zero — one binary, embed anywhere
// ALSO BUILT BY SCOTT BAKER

Inventor & Creator — skr8tr

skr8tr is a sovereign, masterless distributed systems framework built on post-quantum cryptography. Every node authenticates via ML-DSA-65 signed tokens — no certificate authorities, no central broker, no single point of failure. Commands propagate across a UDP mesh, each packet carrying a post-quantum signature verified on arrival. Designed from first principles for air-gapped, regulated, and adversarial environments where you cannot afford to trust the network.

// THE CODE WE ACTUALLY RAN

Plain SQL. Verifiable results.

Five queries. Real data from the NYC TLC open dataset. No proprietary runtime. Run it yourself.

Read directly from S3 — no download
LOAD httpfs; SET s3_region='us-east-1'; SET s3_access_key_id='...'; SET s3_secret_access_key='...'; SELECT year, COUNT(*) AS trips, ROUND(AVG(fare_amount),2) AS avg_fare FROM read_parquet('s3://your-bucket/taxi/yellow/2024-*.parquet') GROUP BY year ORDER BY year;
Q3 — Monthly trend pivot (DuckDB-native PIVOT)
PIVOT ( SELECT year, MONTH(tpep_pickup_datetime) AS month, COUNT(*) AS trips FROM all_years GROUP BY year, month ) ON year USING SUM(trips) GROUP BY month ORDER BY month; -- Native PIVOT — no Spark workarounds, no conditional aggregation hacks

Full benchmark: scripts/benchmark_nyc_4year.sh

// WHAT I DO

Data engineering that ships.

Remote consulting. Fixed-scope engagements. AWS, Azure, and GCP. Results you can verify.

🔍 Cost Audit

Review your Databricks / Synapse / BigQuery spend. Identify workloads that move to DuckDB on a single node. Written cost-reduction plan within 48 hours.

🦆 DuckDB Pipeline Build

End-to-end columnar pipeline — ingest from S3, Azure Blob, or GCP Storage, transform with DuckDB SQL, write results back to your bucket. Reproducible scripts you own.

📊 Streamlit Dashboard

Interactive analytics dashboard backed by DuckDB. Reads live from your cloud storage. Deployed on a $20/month VPS or your existing infra. Zero cluster dependency.

☁️ AWS Data Architecture

S3 + Glue + Athena + DuckDB hybrid pipelines. Leveraging AWS Solutions Architect experience to build data lakes that scale without surprise bills.

🔷 Azure Data Engineering

ADLS Gen2, Azure Blob, Synapse, and DuckDB integration. Migrate heavyweight Spark jobs to single-node DuckDB where the data volume allows it.

🌐 GCP / BigQuery Consulting

GCP Cloud Storage + DuckDB pipelines. BigQuery cost reduction — identify queries that run faster and cheaper outside BigQuery on a local DuckDB node.

🏗️ Spark Migration

Port SparkSQL jobs to standard DuckDB SQL. Remove cluster dependency for batch workloads. Works with AWS EMR, Azure HDInsight, and GCP Dataproc migrations.

📐 Architecture Review

Async review of your current data pipelines — any cloud. Written recommendations with specific, actionable improvements. Delivered in 48 hours.

⚡ Bare-Metal / NVMe Migration

Audit your S3 data lake for workloads that belong on NVMe. Design the hot/cold split — NVMe for compute, object storage for archival. Get DuckDB throughput you can actually feel.

🔐 Post-Quantum Secured Deliverables

Every file I deliver — Parquet, CSV, report — is signed with ML-DSA-65 (NIST FIPS 204), the post-quantum digital signature standard. You get the data file, a detached .sig file, and a one-command verifier. Run it against my public key and know the file is authentic and untampered.

Designed for healthcare, finance, and government workloads that need provable chain-of-custody today and quantum resistance tomorrow.

↓ Download duckpqc.pub — Scott Baker's ML-DSA-65 public key

Ready to cut your data costs?

All engagements start with a free 30-minute scoping call.
I respond to all serious inquiries within 24 hours.

scott@duckdatamaster.guru

GitHub: NixOSDude/DuckDB_Master