// BENCHMARK · APRIL 2026

Billion Row Benchmark: 1,029,157,161 NYC Taxi Rows Scanned in 7.67 Seconds on a Single GCP Node

Scott Baker — Founder, Duck Data Master April 2026 · 12 min read · Databricks Certified Associate Developer · AWS Solutions Architect Associate

✓ VERIFIED — April 8, 2026 · Reproducible with public TLC data

TL;DR: We pulled every available year of NYC TLC yellow taxi data — 2012 through 2025, 14 years of monthly Parquet files — until we crossed 1 billion rows. The final count: 1,029,157,161 rows. We ran a full multi-column aggregation query across all of it on a single GCP c3-standard-44-lssd node. Wall time: 7.67 seconds. Average throughput: 134M rows/sec. Peak: 6.9B rows/sec. 19× faster than a pre-warmed Apache Spark cluster on the same query. This is not a toy dataset. This is not a cherry-picked SELECT COUNT(*). This is what columnar analytics looks like at billion-row scale.

1.03B

Rows Scanned

7.67s

Wall Time

134M

Rows/Sec Avg

6.9B

Rows/Sec Peak

19×

Faster Than Spark

14 yrs

2012 – 2025

The Dataset: 14 Years of NYC TLC Yellow Taxi Trip Records

The New York City Taxi and Limousine Commission (TLC) publishes complete trip-level records for every yellow cab ride in New York City going back to 2009. The dataset is one of the most widely used public datasets in data engineering — real-world, messy, high-volume data with genuine cardinality variation, nulls, mixed numeric types, and time-based patterns across more than a decade of urban transportation history. It is not synthetic. It has not been cleaned or normalized for benchmark convenience.

The goal for this benchmark was simple: accumulate enough real data to cross 1 billion rows. We pulled every available year of yellow taxi Parquet files from the TLC public portal, starting from 2012 and working forward — 14 years of monthly files, one Parquet file per month. The final row count landed at 1,029,157,161 rows.

The dataset tells the full story of NYC taxi across two distinct eras. The early years (2012–2014) represent peak yellow cab dominance in New York City — over 150–170 million trips per year before Uber and Lyft had meaningful market penetration. The middle years (2015–2019) show the steady decline as rideshare took over, trip counts falling from ~150M to under 90M annually. Then 2020 — a 75%+ collapse from COVID. 2021–2025 shows the gradual recovery of a changed industry, with yellow cab finding its floor.

All of that history — boom, disruption, collapse, recovery — is in this dataset. Crossing 1 billion rows required reaching back to 2012 when trip volumes were at their peak. Without those high-volume early years, post-pandemic taxi data alone doesn't get there.

Total dataset: 14 years · 168 monthly Parquet files · 1,029,157,161 rows · 19 columns per file

The file naming convention is yellow_tripdata_YYYY-MM.parquet. Each file contains all yellow cab trips recorded in that calendar month.

The Schema: 19 Columns, Real-World Data

Each Parquet file contains 19 columns covering every measurable dimension of a taxi trip: pickup/dropoff timestamps, location IDs, fare components, payment type, passenger count, and administrative fields. The full schema:

VendorID

BIGINT — 1 (Creative Mobile) or 2 (VeriFone)

tpep_pickup_datetime

TIMESTAMP — trip start time

tpep_dropoff_datetime

TIMESTAMP — trip end time

passenger_count

DOUBLE — nullable, 0–9

trip_distance

DOUBLE — miles, as reported by meter

RatecodeID

DOUBLE — 1=Standard, 2=JFK, 3=Newark, 4=Nassau, 5=Negotiated, 6=Group

store_and_fwd_flag

VARCHAR — Y/N, whether trip record was held in vehicle memory

PULocationID

BIGINT — TLC taxi zone, pickup (1–265)

DOLocationID

BIGINT — TLC taxi zone, dropoff (1–265)

payment_type

BIGINT — 1=Credit, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided

fare_amount

DOUBLE — meter fare, USD

extra

DOUBLE — surcharges ($0.50 rush, $1.00 overnight)

mta_tax

DOUBLE — $0.50 MTA surcharge

tip_amount

DOUBLE — auto-populated for credit cards; cash tips not captured

tolls_amount

DOUBLE — all tolls paid during trip

improvement_surcharge

DOUBLE — $0.30 assessed at flag drop

total_amount

DOUBLE — total charged to passenger (excl. cash tips)

congestion_surcharge

DOUBLE — $2.50 for trips in Manhattan below 96th St

airport_fee

DOUBLE — $1.25 at JFK/LaGuardia

The query used five of the nineteen columns — VendorID, total_amount, trip_distance, passenger_count, and the implicit row counter. Columnar storage means the other 14 columns are never read from disk. For this query, the engine reads 5/19 = 26% of the data on disk. The rest is skipped entirely.

The Hardware: GCP c3-standard-44-lssd

Instance type:    GCP c3-standard-44-lssd
CPU:            44 vCPUs — Intel Xeon Sapphire Rapids
RAM:            176 GB DDR5
Local storage:  RAID 0 — 8 × 375 GB NVMe local SSD = 3 TB total
NVMe read speed:~22 GB/s sequential read (RAID 0)
Network:       100 Gbps (not used in benchmark — data local)
Region:        us-central1
OS:             Ubuntu 22.04 LTS

The c3-standard-44-lssd is a compute-optimized instance class in GCP's C3 generation, built on Intel's 4th-generation Xeon (Sapphire Rapids). Sapphire Rapids supports AVX-512 SIMD instructions — allowing the analytics engine to process 16 double-precision floats per CPU cycle per core. On 44 cores, the theoretical peak float throughput is enormous. The benchmark achieved 6.9B rows/sec peak throughput during the in-memory aggregation phase, which reflects this parallelism operating at near-theoretical efficiency on cache-resident data.

The 8 local NVMe drives in RAID 0 deliver ~22 GB/s sequential read. The full 2.65 GB dataset reads off NVMe in under 130ms at peak — meaning the I/O phase of this query is less than 2% of total wall time. The remaining 7.5+ seconds are CPU-bound: Parquet decompression (Snappy), columnar projection, hash aggregation, and sort.

The Benchmark Query

We used a realistic analytical aggregation — not a SELECT COUNT(*) or a single-column SUM. The query aggregates by taxi vendor (two vendors), computing total revenue, average trip distance, total passenger count, and total trip count across all 1,029,157,161 rows:

SELECT
  VendorID,
  SUM(total_amount)      AS total_revenue,
  AVG(trip_distance)     AS avg_distance_miles,
  SUM(passenger_count)   AS total_passengers,
  COUNT(*)               AS trip_count
FROM read_parquet('/data/nyc_taxi/parquet/yellow/**/*.parquet')
GROUP BY VendorID
ORDER BY total_revenue DESC;

The glob pattern **/*.parquet resolves to all 48 files across the 4-year directory structure. The engine parallelizes file reads across all 44 vCPUs — each core handles a separate Parquet row group, decompresses it, projects the 5 needed columns, and feeds partial aggregates into a shared hash table. Final merge and sort are single-threaded but operate on a 2-row result (one per vendor ID).

Results — Verified April 8, 2026

Metric	Value	Notes
Total rows scanned	1,029,157,161	All 48 files, no filtering
Wall time	7.67 seconds	Cold read from NVMe; 3-run average; <3% variance
Average throughput	134M rows/sec	Including Parquet decompression + I/O
Peak throughput	6.9B rows/sec	In-memory aggregation phase on cached data
Data read from disk	~690 MB	5 of 19 columns — columnar projection saves 74%
Memory used	<8 GB	Streaming aggregation; full dataset not held in RAM
vs Spark (pre-warmed)	19× faster	Same query, same dataset, 4-node Dataproc cluster

Why Only 690 MB Read for a 2.65 GB Dataset?

This is columnar storage doing exactly what it's designed to do. The dataset has 19 columns. The query needs 5: VendorID, total_amount, trip_distance, passenger_count, and COUNT(*) (no column read — it uses row group metadata). In Parquet format, each column is stored in a separate column chunk with its own compression. Reading 5 columns means reading 5/19 of the compressed data — approximately 695 MB instead of 2.65 GB.

In a row-oriented format (CSV, MySQL table), the engine would read all 19 columns for every row — 2.65 GB — and then discard 14 of them. That's why the same query on a 2.65 GB CSV would take 5–10× longer: more I/O, more decompression, more memory bandwidth. The columnar format is not a minor optimization. It is a structural architectural advantage for analytical workloads.

Spark vs. Single Node: The 19× Gap Explained

The comparison was run against a GCP Dataproc cluster — 1 master + 4 workers, n2-standard-8 (8 vCPU, 32 GB RAM each), Apache Spark 3.5, pre-warmed with the dataset in HDFS. Pre-warmed means no cold start penalty — the cluster was already running and the data was already distributed. The Spark result on the same query: approximately 2.5 minutes.

Why 19×? Three structural reasons:

1. Spark is optimized for data that doesn't fit on one machine

Spark's distributed architecture introduces overhead that is amortized at petabyte scale but is significant at sub-terabyte scale: task scheduling across executors (hundreds of milliseconds), shuffle coordination (even for a GROUP BY with 2 output keys, Spark runs a full shuffle stage), driver-to-executor serialization, and JVM garbage collection pauses. These costs exist regardless of data size. At 2.65 GB, they dominate.

2. JVM vs. native vectorized execution

Spark runs in the JVM. Even with Spark's whole-stage codegen and Tungsten memory management, JVM execution cannot match native C++ with AVX-512 SIMD for tight numeric loops. The aggregation inner loop — SUM(total_amount) across 1B rows — is where the gap is largest. Native vectorized execution processes 16 doubles per instruction. Spark's codegen processes one row at a time in the JVM, with GC pressure on intermediate objects.

3. Network shuffle for a 2-row GROUP BY result

A GROUP BY VendorID with 2 possible values still triggers a full shuffle in Spark — all data moves across the network to co-locate records by VendorID before aggregation. On a single node, there is no network. Partial aggregates are in shared memory. The merge is a pointer dereference, not a network round-trip.

What These Numbers Mean for Business Analytics

Most business analytics workloads are 1–100 million rows, not 1 billion. If the engine processes 1B rows in 7.67 seconds, a 10M-row sales history answers in under 100ms. A 100M-row event log answers in under a second. A 500M-row clickstream answers in under 4 seconds — on a single node, in a browser-accessible cloud instance, at flat monthly pricing.

The billion-row benchmark matters not because you're running 1B-row queries daily. It matters because it proves the architecture works at scale — and more importantly, that there is headroom. An engine saturated at 10M rows has no headroom. An engine that handles 1B rows in under 8 seconds has 100× headroom for your typical workload. That headroom means sub-second response times, exploratory queries without planning, and no fear of accidentally running an expensive full scan.

Reproducibility: Run It Yourself

Every detail in this benchmark is reproducible with public data and a standard GCP instance:

Dataset: NYC TLC yellow taxi data, available at nyc.gov/site/tlc/about/tlc-trip-record-data.page
Files: yellow_tripdata_YYYY-MM.parquet, 2012–2025 — 168 monthly files, one per calendar month
Hardware: GCP c3-standard-44-lssd, available in us-central1, us-east1, europe-west4
Storage: mount local NVMe as RAID 0 (mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme*)
Query: exact SQL above — no modifications, no hints, no caching enabled
Methodology: 3 cold runs (NVMe, no OS page cache warmup), average reported

This is the engine powering your instance

The same columnar analytics engine. Deployed in your AWS, GCP, or Azure account. 3-day free trial.

Start Free Trial →

See the full benchmarks page → · Questions? support@duckdatamaster.guru