Every data team has a folder full of CSVs. It's the universal format — Excel opens it, every tool exports it, every API produces it. CSV is how data moves between systems, and that's fine. The problem is that CSV was designed for data exchange, not for analytics. When you run analytical queries on CSVs at scale, you're using the wrong tool for the job — and paying for it in query time, memory, and cost.
Parquet was designed specifically for the analytical workload. Understanding the difference — concretely, mechanically — makes the performance gap obvious instead of mysterious.
CSV is row-oriented. Each line is one record, and the fields in that record sit next to each other on disk:
order_id,customer_id,product_sku,quantity,unit_price,order_date,region,channel,...
1001,C4821,SKU-9923,2,49.99,2024-01-15,West,Online,...
1002,C1193,SKU-0041,1,129.00,2024-01-15,East,Retail,...
1003,C7744,SKU-9923,4,49.99,2024-01-16,West,Online,...
Now imagine you run this query:
SELECT region, SUM(unit_price * quantity) AS revenue
FROM orders
GROUP BY region;
This query touches three columns: region, unit_price, and quantity. But to find those three columns in a CSV with 50 fields per row, the database engine has to read and parse every single field in every single row — all 50 — just to extract the 3 it needs. With 10 million rows and 50 columns, you're reading roughly 47 columns of data you will immediately discard.
There is no escape from this in the CSV format. The data is interleaved, row by row, and the only way to get column 12 is to read through columns 1 through 11 first.
Parquet is columnar. Instead of storing row 1 then row 2 then row 3, it stores all values of column 1, then all values of column 2, then all values of column 3:
[ region column ] West | East | West | South | West | East | ...
[ unit_price col ] 49.99 | 129.00 | 49.99 | 89.99 | 49.99 | ...
[ quantity col ] 2 | 1 | 4 | 1 | 3 | 2 | ...
[ order_date col ] 2024-01-15 | 2024-01-15 | 2024-01-16 | ...
...
For the same GROUP BY region query, the analytics engine goes directly to the region, unit_price, and quantity column segments. It reads those three column chunks from disk. The other 47 columns are never touched. They don't even come off disk.
This is called column pruning, and it's the single biggest reason Parquet is faster for analytics.
Column storage makes compression dramatically more effective. When all values in a column are the same data type — and often very similar values — standard compression algorithms (Snappy, Zstd, GZIP) achieve ratios that are impossible in row-oriented storage.
Consider a region column with four possible values: West, East, South, North. Across 10 million rows, those four strings repeat constantly. Parquet uses dictionary encoding: it stores the dictionary {0:West, 1:East, 2:South, 3:North} once, and then stores 2-bit integers for each of the 10 million rows instead of 4–5 character strings. A column that would be ~60MB in CSV becomes ~2.5MB in Parquet.
Numeric columns compress similarly. A price column with many repeated values, or timestamps that cluster by day, compress to a fraction of their raw size. In practice, a 1GB CSV typically becomes 100–200MB as Parquet — a 5–10x reduction.
Parquet has a third trick beyond column pruning and compression: row group statistics. Parquet files are divided into row groups (typically 64MB–1GB each). For every column in every row group, Parquet stores the minimum value, maximum value, and null count in the file metadata.
When you add a WHERE clause:
SELECT region, SUM(unit_price * quantity)
FROM orders
WHERE order_date >= '2024-06-01'
GROUP BY region;
The analytics engine reads the row group statistics before touching any actual data. If a row group has max(order_date) = '2024-03-31', it knows no row in that group can satisfy order_date >= '2024-06-01'. It skips the entire row group — without reading a single data byte from it.
On a dataset sorted or partitioned by date, this means a query for last month's data might skip 90% of the file entirely. CSV has no equivalent mechanism. The engine must read and evaluate every row.
These numbers are representative of analytical queries on a dataset with ~50 columns and 10 million rows, run on the same hardware:
| Query Type | CSV (1.2 GB) | Parquet (180 MB) | Speedup |
|---|---|---|---|
COUNT(*) | 2.1s | 0.04s | 52× |
GROUP BY on 3 columns | 4.8s | 0.18s | 27× |
Filtered WHERE + aggregation | 5.2s | 0.09s | 58× |
Multi-column JOIN | 12.4s | 0.41s | 30× |
ORDER BY + LIMIT | 3.9s | 0.22s | 18× |
Window function (ROW_NUMBER) | 8.7s | 0.31s | 28× |
| Average improvement | — | — | 35× |
The COUNT(*) result deserves a call-out. In CSV, counting rows means reading the entire file to count newlines. In Parquet, the row count is stored in the file metadata — the engine reads a few bytes of header, not the full 1.2GB of data.
Column pruning and compression don't just help with disk reads — they dramatically reduce memory pressure. A 1.2GB CSV query that reads the entire file needs to buffer and parse all 1.2GB. The equivalent Parquet query, reading 3 columns from a 180MB file, might only need 15–30MB in memory.
This matters especially when you're running analytics on a cloud compute instance with fixed RAM. Queries that would OOM on CSV complete cleanly in Parquet. You can run more concurrent queries. You can process larger datasets on smaller (cheaper) hardware.
CSV isn't always wrong. It remains the right format for:
When you're sending data to a vendor, receiving an export from a SaaS tool, or feeding data into a system that doesn't support Parquet, CSV is universal. Every language, every tool, every spreadsheet opens it. For a file that's moving from A to B once and then being processed once, the format overhead doesn't matter much.
If your "analytics" is loading a 5,000-row lookup table in full, CSV is fine. The columnar advantage only materializes when you're reading a small fraction of columns across a large number of rows. For small datasets read in full, the overhead of Parquet's encoding sometimes makes it marginally slower than CSV.
CSV opens in any text editor. Parquet is a binary format — you can't cat it or open it in Excel directly (without a converter). If you need to spot-check data with minimal tooling, CSV is friendlier.
The conversion is straightforward — one query and you're done:
-- Load your CSV and write it out as Parquet
COPY (SELECT * FROM read_csv_auto('orders.csv'))
TO 'orders.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD);
Or directly from S3:
COPY (SELECT * FROM read_csv_auto('s3://my-bucket/orders/*.csv'))
TO 's3://my-bucket/orders-parquet/'
(FORMAT PARQUET, COMPRESSION ZSTD, PARTITION_BY (year, month));
Duck Data Master handles this automatically in the Ingest Tab — drag in your CSVs, and the analytics engine converts, compresses, and partitions them before your first query runs. You get the Parquet performance without touching the command line.
Parquet's predicate pushdown becomes much more powerful when you organize files into partitions — subdirectories named by the most common filter column:
orders-parquet/
year=2024/
month=01/
part-000.parquet
part-001.parquet
month=02/
...
year=2025/
...
A query like WHERE year = 2024 AND month = 06 goes directly to that subdirectory and reads nothing else. This is how billion-row datasets stay fast — the engine never sees rows outside the partition it needs.
Column pruning removes column width. Partitioning removes row depth. Combined, you reduce the data your query touches by orders of magnitude.
If your analytics pipeline currently runs on CSV exports, switching to Parquet typically produces:
For a team running $3,000/month in S3 costs on CSV data and $8,000/month in Redshift or Athena compute, converting to partitioned Parquet is frequently a $7,000–$9,000/month reduction. The conversion takes an afternoon.
The Ingest Tab converts, compresses, and partitions your files before your first query runs. No command line required.
Start Free Trial →Questions? support@duckdatamaster.guru