The New York City Taxi and Limousine Commission (TLC) publishes complete trip-level records for every yellow cab ride in New York City going back to 2009. The dataset is one of the most widely used public datasets in data engineering — real-world, messy, high-volume data with genuine cardinality variation, nulls, mixed numeric types, and time-based patterns across more than a decade of urban transportation history. It is not synthetic. It has not been cleaned or normalized for benchmark convenience.
The goal for this benchmark was simple: accumulate enough real data to cross 1 billion rows. We pulled every available year of yellow taxi Parquet files from the TLC public portal, starting from 2012 and working forward — 14 years of monthly files, one Parquet file per month. The final row count landed at 1,029,157,161 rows.
The dataset tells the full story of NYC taxi across two distinct eras. The early years (2012–2014) represent peak yellow cab dominance in New York City — over 150–170 million trips per year before Uber and Lyft had meaningful market penetration. The middle years (2015–2019) show the steady decline as rideshare took over, trip counts falling from ~150M to under 90M annually. Then 2020 — a 75%+ collapse from COVID. 2021–2025 shows the gradual recovery of a changed industry, with yellow cab finding its floor.
All of that history — boom, disruption, collapse, recovery — is in this dataset. Crossing 1 billion rows required reaching back to 2012 when trip volumes were at their peak. Without those high-volume early years, post-pandemic taxi data alone doesn't get there.
Total dataset: 14 years · 168 monthly Parquet files · 1,029,157,161 rows · 19 columns per file
The file naming convention is yellow_tripdata_YYYY-MM.parquet. Each file contains all yellow cab trips recorded in that calendar month.
Each Parquet file contains 19 columns covering every measurable dimension of a taxi trip: pickup/dropoff timestamps, location IDs, fare components, payment type, passenger count, and administrative fields. The full schema:
The query used five of the nineteen columns — VendorID, total_amount, trip_distance, passenger_count, and the implicit row counter. Columnar storage means the other 14 columns are never read from disk. For this query, the engine reads 5/19 = 26% of the data on disk. The rest is skipped entirely.
The c3-standard-44-lssd is a compute-optimized instance class in GCP's C3 generation, built on Intel's 4th-generation Xeon (Sapphire Rapids). Sapphire Rapids supports AVX-512 SIMD instructions — allowing the analytics engine to process 16 double-precision floats per CPU cycle per core. On 44 cores, the theoretical peak float throughput is enormous. The benchmark achieved 6.9B rows/sec peak throughput during the in-memory aggregation phase, which reflects this parallelism operating at near-theoretical efficiency on cache-resident data.
The 8 local NVMe drives in RAID 0 deliver ~22 GB/s sequential read. The full 2.65 GB dataset reads off NVMe in under 130ms at peak — meaning the I/O phase of this query is less than 2% of total wall time. The remaining 7.5+ seconds are CPU-bound: Parquet decompression (Snappy), columnar projection, hash aggregation, and sort.
We used a realistic analytical aggregation — not a SELECT COUNT(*) or a single-column SUM. The query aggregates by taxi vendor (two vendors), computing total revenue, average trip distance, total passenger count, and total trip count across all 1,029,157,161 rows:
The glob pattern **/*.parquet resolves to all 48 files across the 4-year directory structure. The engine parallelizes file reads across all 44 vCPUs — each core handles a separate Parquet row group, decompresses it, projects the 5 needed columns, and feeds partial aggregates into a shared hash table. Final merge and sort are single-threaded but operate on a 2-row result (one per vendor ID).
| Metric | Value | Notes |
|---|---|---|
| Total rows scanned | 1,029,157,161 | All 48 files, no filtering |
| Wall time | 7.67 seconds | Cold read from NVMe; 3-run average; <3% variance |
| Average throughput | 134M rows/sec | Including Parquet decompression + I/O |
| Peak throughput | 6.9B rows/sec | In-memory aggregation phase on cached data |
| Data read from disk | ~690 MB | 5 of 19 columns — columnar projection saves 74% |
| Memory used | <8 GB | Streaming aggregation; full dataset not held in RAM |
| vs Spark (pre-warmed) | 19× faster | Same query, same dataset, 4-node Dataproc cluster |
This is columnar storage doing exactly what it's designed to do. The dataset has 19 columns. The query needs 5: VendorID, total_amount, trip_distance, passenger_count, and COUNT(*) (no column read — it uses row group metadata). In Parquet format, each column is stored in a separate column chunk with its own compression. Reading 5 columns means reading 5/19 of the compressed data — approximately 695 MB instead of 2.65 GB.
In a row-oriented format (CSV, MySQL table), the engine would read all 19 columns for every row — 2.65 GB — and then discard 14 of them. That's why the same query on a 2.65 GB CSV would take 5–10× longer: more I/O, more decompression, more memory bandwidth. The columnar format is not a minor optimization. It is a structural architectural advantage for analytical workloads.
The comparison was run against a GCP Dataproc cluster — 1 master + 4 workers, n2-standard-8 (8 vCPU, 32 GB RAM each), Apache Spark 3.5, pre-warmed with the dataset in HDFS. Pre-warmed means no cold start penalty — the cluster was already running and the data was already distributed. The Spark result on the same query: approximately 2.5 minutes.
Why 19×? Three structural reasons:
Spark's distributed architecture introduces overhead that is amortized at petabyte scale but is significant at sub-terabyte scale: task scheduling across executors (hundreds of milliseconds), shuffle coordination (even for a GROUP BY with 2 output keys, Spark runs a full shuffle stage), driver-to-executor serialization, and JVM garbage collection pauses. These costs exist regardless of data size. At 2.65 GB, they dominate.
Spark runs in the JVM. Even with Spark's whole-stage codegen and Tungsten memory management, JVM execution cannot match native C++ with AVX-512 SIMD for tight numeric loops. The aggregation inner loop — SUM(total_amount) across 1B rows — is where the gap is largest. Native vectorized execution processes 16 doubles per instruction. Spark's codegen processes one row at a time in the JVM, with GC pressure on intermediate objects.
A GROUP BY VendorID with 2 possible values still triggers a full shuffle in Spark — all data moves across the network to co-locate records by VendorID before aggregation. On a single node, there is no network. Partial aggregates are in shared memory. The merge is a pointer dereference, not a network round-trip.
Most business analytics workloads are 1–100 million rows, not 1 billion. If the engine processes 1B rows in 7.67 seconds, a 10M-row sales history answers in under 100ms. A 100M-row event log answers in under a second. A 500M-row clickstream answers in under 4 seconds — on a single node, in a browser-accessible cloud instance, at flat monthly pricing.
The billion-row benchmark matters not because you're running 1B-row queries daily. It matters because it proves the architecture works at scale — and more importantly, that there is headroom. An engine saturated at 10M rows has no headroom. An engine that handles 1B rows in under 8 seconds has 100× headroom for your typical workload. That headroom means sub-second response times, exploratory queries without planning, and no fear of accidentally running an expensive full scan.
Every detail in this benchmark is reproducible with public data and a standard GCP instance:
mdadm --create /dev/md0 --level=0 --raid-devices=8 /dev/nvme*)The same columnar analytics engine. Deployed in your AWS, GCP, or Azure account. 3-day free trial.
Start Free Trial →See the full benchmarks page → · Questions? support@duckdatamaster.guru