// DATA ENGINEERING · MAY 2026

Parquet vs CSV: Why Your Analytics Are 50x Slower Than They Need to Be

Scott Baker — Founder, Duck Data Master May 2026 · 8 min read · Databricks Certified Associate Developer · AWS Solutions Architect Associate

TL;DR: CSV reads every byte of every row for every query. Parquet reads only the columns your query touches, at 5–10x compression. For a typical analytics query touching 5 of 50 columns, Parquet reads roughly 1% of what CSV reads. That difference — a few hundred milliseconds vs. tens of seconds — is where your analytics time actually goes.

Every data team has a folder full of CSVs. It's the universal format — Excel opens it, every tool exports it, every API produces it. CSV is how data moves between systems, and that's fine. The problem is that CSV was designed for data exchange, not for analytics. When you run analytical queries on CSVs at scale, you're using the wrong tool for the job — and paying for it in query time, memory, and cost.

Parquet was designed specifically for the analytical workload. Understanding the difference — concretely, mechanically — makes the performance gap obvious instead of mysterious.

The Core Problem: How CSV Stores Data

CSV is row-oriented. Each line is one record, and the fields in that record sit next to each other on disk:

order_id,customer_id,product_sku,quantity,unit_price,order_date,region,channel,...
1001,C4821,SKU-9923,2,49.99,2024-01-15,West,Online,...
1002,C1193,SKU-0041,1,129.00,2024-01-15,East,Retail,...
1003,C7744,SKU-9923,4,49.99,2024-01-16,West,Online,...

Now imagine you run this query:

SELECT region, SUM(unit_price * quantity) AS revenue
FROM orders
GROUP BY region;

This query touches three columns: region, unit_price, and quantity. But to find those three columns in a CSV with 50 fields per row, the database engine has to read and parse every single field in every single row — all 50 — just to extract the 3 it needs. With 10 million rows and 50 columns, you're reading roughly 47 columns of data you will immediately discard.

There is no escape from this in the CSV format. The data is interleaved, row by row, and the only way to get column 12 is to read through columns 1 through 11 first.

How Parquet Stores Data Differently

Parquet is columnar. Instead of storing row 1 then row 2 then row 3, it stores all values of column 1, then all values of column 2, then all values of column 3:

[ region column ]    West | East | West | South | West | East | ...
[ unit_price col ]   49.99 | 129.00 | 49.99 | 89.99 | 49.99 | ...
[ quantity col ]     2 | 1 | 4 | 1 | 3 | 2 | ...
[ order_date col ]   2024-01-15 | 2024-01-15 | 2024-01-16 | ...
...

For the same GROUP BY region query, the analytics engine goes directly to the region, unit_price, and quantity column segments. It reads those three column chunks from disk. The other 47 columns are never touched. They don't even come off disk.

This is called column pruning, and it's the single biggest reason Parquet is faster for analytics.

The Compression Multiplier

Column storage makes compression dramatically more effective. When all values in a column are the same data type — and often very similar values — standard compression algorithms (Snappy, Zstd, GZIP) achieve ratios that are impossible in row-oriented storage.

Consider a region column with four possible values: West, East, South, North. Across 10 million rows, those four strings repeat constantly. Parquet uses dictionary encoding: it stores the dictionary {0:West, 1:East, 2:South, 3:North} once, and then stores 2-bit integers for each of the 10 million rows instead of 4–5 character strings. A column that would be ~60MB in CSV becomes ~2.5MB in Parquet.

Numeric columns compress similarly. A price column with many repeated values, or timestamps that cluster by day, compress to a fraction of their raw size. In practice, a 1GB CSV typically becomes 100–200MB as Parquet — a 5–10x reduction.

5–10×

typical file size reduction CSV → Parquet

~1%

data read for a 5-of-50 column query in Parquet vs CSV

10–50×

typical query speed improvement for analytical workloads

Predicate Pushdown: Skipping Entire Row Groups

Parquet has a third trick beyond column pruning and compression: row group statistics. Parquet files are divided into row groups (typically 64MB–1GB each). For every column in every row group, Parquet stores the minimum value, maximum value, and null count in the file metadata.

When you add a WHERE clause:

SELECT region, SUM(unit_price * quantity)
FROM orders
WHERE order_date >= '2024-06-01'
GROUP BY region;

The analytics engine reads the row group statistics before touching any actual data. If a row group has max(order_date) = '2024-03-31', it knows no row in that group can satisfy order_date >= '2024-06-01'. It skips the entire row group — without reading a single data byte from it.

On a dataset sorted or partitioned by date, this means a query for last month's data might skip 90% of the file entirely. CSV has no equivalent mechanism. The engine must read and evaluate every row.

Head-to-Head: Real Query Performance

These numbers are representative of analytical queries on a dataset with ~50 columns and 10 million rows, run on the same hardware:

Query Type	CSV (1.2 GB)	Parquet (180 MB)	Speedup
`COUNT(*)`	2.1s	0.04s	52×
`GROUP BY` on 3 columns	4.8s	0.18s	27×
Filtered `WHERE` + aggregation	5.2s	0.09s	58×
Multi-column `JOIN`	12.4s	0.41s	30×
`ORDER BY` + `LIMIT`	3.9s	0.22s	18×
Window function (`ROW_NUMBER`)	8.7s	0.31s	28×
Average improvement	—	—	35×

The COUNT(*) result deserves a call-out. In CSV, counting rows means reading the entire file to count newlines. In Parquet, the row count is stored in the file metadata — the engine reads a few bytes of header, not the full 1.2GB of data.

What About Memory?

Column pruning and compression don't just help with disk reads — they dramatically reduce memory pressure. A 1.2GB CSV query that reads the entire file needs to buffer and parse all 1.2GB. The equivalent Parquet query, reading 3 columns from a 180MB file, might only need 15–30MB in memory.

This matters especially when you're running analytics on a cloud compute instance with fixed RAM. Queries that would OOM on CSV complete cleanly in Parquet. You can run more concurrent queries. You can process larger datasets on smaller (cheaper) hardware.

When CSV Is the Right Choice

CSV isn't always wrong. It remains the right format for:

Data exchange between systems

When you're sending data to a vendor, receiving an export from a SaaS tool, or feeding data into a system that doesn't support Parquet, CSV is universal. Every language, every tool, every spreadsheet opens it. For a file that's moving from A to B once and then being processed once, the format overhead doesn't matter much.

Small files you'll read entirely

If your "analytics" is loading a 5,000-row lookup table in full, CSV is fine. The columnar advantage only materializes when you're reading a small fraction of columns across a large number of rows. For small datasets read in full, the overhead of Parquet's encoding sometimes makes it marginally slower than CSV.

Human inspection

CSV opens in any text editor. Parquet is a binary format — you can't cat it or open it in Excel directly (without a converter). If you need to spot-check data with minimal tooling, CSV is friendlier.

Converting CSV to Parquet

The conversion is straightforward — one query and you're done:

-- Load your CSV and write it out as Parquet
COPY (SELECT * FROM read_csv_auto('orders.csv'))
TO 'orders.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);

Or directly from S3:

COPY (SELECT * FROM read_csv_auto('s3://my-bucket/orders/*.csv'))
TO 's3://my-bucket/orders-parquet/'
(FORMAT PARQUET, COMPRESSION ZSTD, PARTITION_BY (year, month));

Duck Data Master handles this automatically in the Ingest Tab — drag in your CSVs, and the analytics engine converts, compresses, and partitions them before your first query runs. You get the Parquet performance without touching the command line.

Partition Your Parquet Files

Parquet's predicate pushdown becomes much more powerful when you organize files into partitions — subdirectories named by the most common filter column:

orders-parquet/
  year=2024/
    month=01/
      part-000.parquet
      part-001.parquet
    month=02/
      ...
  year=2025/
    ...

A query like WHERE year = 2024 AND month = 06 goes directly to that subdirectory and reads nothing else. This is how billion-row datasets stay fast — the engine never sees rows outside the partition it needs.

Column pruning removes column width. Partitioning removes row depth. Combined, you reduce the data your query touches by orders of magnitude.

The Operational Impact

If your analytics pipeline currently runs on CSV exports, switching to Parquet typically produces:

Query times drop from seconds or minutes to milliseconds for most analytical queries
Storage costs drop 5–10× (S3, GCS, and Azure Blob pricing is per-GB)
Downstream compute costs drop proportionally — less data read = less CPU time billed
Scheduled jobs that previously ran for hours complete in minutes
Interactive exploration becomes genuinely interactive instead of "submit and wait"

For a team running $3,000/month in S3 costs on CSV data and $8,000/month in Redshift or Athena compute, converting to partitioned Parquet is frequently a $7,000–$9,000/month reduction. The conversion takes an afternoon.

Load your CSVs. Get Parquet performance automatically.

The Ingest Tab converts, compresses, and partitions your files before your first query runs. No command line required.

Start Free Trial →

Questions? support@duckdatamaster.guru