The first question any analytics workflow has to answer is: how do I get my data in? For most teams, the answer is a mess — a CSV export from the CRM, a Parquet file from the data lake, a JSON dump from an API, three different formats, three different loading workflows, and a data engineer in the loop for every one of them.
It doesn't have to be that way. Here's what you actually need to know about data formats and loading — and how to do it yourself.
| Format | Type | Analytics performance | File size | Human-readable? | Best for |
|---|---|---|---|---|---|
| Parquet | Columnar, binary | Excellent — 10–50x faster than CSV | Very small (4–10x compressed vs CSV) | No | Analytics workloads — use this whenever possible |
| CSV | Row-based, text | Slow — reads every column even if you only need one | Large | Yes | Excel exports, data sharing with non-technical users |
| JSON | Row-based, text | Slow — nested structures add parsing overhead | Large | Yes | API responses, config data, semi-structured records |
| Arrow (IPC) | Columnar, binary | Excellent — zero-copy in memory | Small | No | High-speed in-memory transfer between systems |
| Delta Lake | Parquet + transaction log | Excellent + ACID transactions | Small | No | Lakehouse workloads with upserts and deletes |
| Excel (.xlsx) | Row-based, binary | Slow, size-limited | Medium | With Excel only | Business reporting — convert to CSV/Parquet for analytics |
The single most impactful thing most analytics teams can do: convert their CSVs to Parquet. A 1GB CSV becomes ~100MB Parquet. A query that scans 1GB CSV in 8 seconds scans the Parquet equivalent in under 200ms — because Parquet only reads the columns your query needs, while CSV reads every byte of every row.
Drag and drop any file — CSV, Parquet, JSON, Arrow, Excel — directly into the Ingest Tab. The analytics engine detects the format automatically, infers the schema, and loads the table. A 500MB Parquet file loads in seconds. Available on all plans.
Paste any public or authenticated URL — an S3 presigned link, a GitHub raw file, a public data portal download — and the engine fetches and loads it directly. No intermediate download to your laptop. Available on all plans.
Connect directly to your GCS bucket, S3 bucket, or Azure Blob container. Browse files, load tables directly from cloud storage without moving the data. The analytics engine reads from your cloud storage in place — no ETL pipeline, no copy, no data movement. Query files that live in your data lake directly from the Query Tab.
One of the most painful parts of data loading — especially with CSVs — is schema inference. Is that column a date or a string? Is that number an integer or a float? Did the export tool wrap numbers in quotes?
The Duck Data Master Ingest Tab handles this automatically. It samples the file, infers column types, and creates the table with the correct schema. If a column has mixed types (a common CSV problem), it casts conservatively to VARCHAR to avoid data loss. You can override any inferred type from the schema editor before finalizing the load.
| Operation | CSV (1M rows) | Parquet (1M rows) | Improvement |
|---|---|---|---|
| COUNT(*) | ~400ms | 2ms | 200x faster |
| SUM on one column | ~600ms | 8ms | 75x faster |
| GROUP BY + aggregate | ~1.2s | 38ms | 32x faster |
| Full table scan (SELECT *) | ~2s | ~1.8s | Similar (reads all columns) |
The full table scan is the only case where format barely matters — because you're reading every column anyway. For every real analytical query (aggregations, filters, GROUP BY), Parquet wins by a large margin.
If your data arrives as CSV, convert it once and store the Parquet version. In Python:
Or use the Python NL Mode in Duck Data Master — type "convert my loaded CSV table to Parquet and save it" and it generates and runs the conversion code automatically.
Any format. Any size. Zero pipeline engineering. 3-day free trial.
Start Free Trial →Questions? support@duckdatamaster.guru