The traditional data pipeline looks like this: export from source → stage in S3 → ETL job copies it to a warehouse → analysts query the warehouse. Each hop adds latency, cost, complexity, and a point of failure. And the data is now in three places instead of one.
The modern answer is simpler: skip the copy. Query the data where it already lives.
Every data movement step has a cost. S3 egress fees. Warehouse ingestion costs. ETL compute costs. Engineering time to build and maintain the pipeline. And the hidden cost nobody talks about: the lag between when data lands in your source system and when analysts can query it.
A typical ETL pipeline introduces 15–60 minutes of lag for "near-real-time" data. Daily batch pipelines mean yesterday's data is the freshest your team can see. None of this is necessary if your analytics engine can read directly from cloud storage.
The Duck Data Master Guru Plan includes direct connectors to all three major cloud storage providers. Once connected, you browse your bucket like a file system, select the files or prefixes you want to query, and the analytics engine reads them directly — using columnar pushdown to read only the columns and row groups your query needs.
For Parquet files in a well-partitioned data lake, this means a query like "revenue by region for Q1" scans only the Q1 partition files, reads only the region and revenue columns, and skips everything else. No full table scan. No data copy. No ETL.
| Cloud Storage | Supported Formats | Auth Method | Partition Pruning? |
|---|---|---|---|
| Amazon S3 | Parquet, CSV, JSON, Arrow, Delta Lake | IAM role or access key | Yes — Hive-style partitions |
| Google Cloud Storage | Parquet, CSV, JSON, Arrow | Service account JSON | Yes — Hive-style partitions |
| Azure Blob Storage | Parquet, CSV, JSON, Arrow | Storage account key or SAS token | Yes — Hive-style partitions |
| Azure Data Lake Gen2 | Parquet, CSV, JSON, Delta Lake | SAS token or ADLS credentials | Yes — Hive-style partitions |
If your data is stored in Hive-style partitions — folders named like year=2026/month=01/ — the analytics engine uses those partition values to skip irrelevant files entirely. A query with WHERE year=2026 AND month=01 reads zero files from 2025 or February, even if those partitions contain billions of rows.
This is the single biggest performance win available in cloud storage analytics. A properly partitioned Parquet data lake with 10TB of data can answer a one-month query by reading 50GB — 200x less data than a full scan.
Delta Lake is Parquet with a transaction log on top — the format that powers most modern Databricks data lakehouses. If your data engineering team has already built Delta Lake tables in S3 or ADLS, you can query them directly from Duck Data Master without any migration or conversion. The analytics engine reads the Delta log to find the latest snapshot and skips deleted or overwritten files automatically.
This matters most for teams that use upserts and deletes — operations that don't exist in plain Parquet. A customer record that was updated three times only returns the latest version. A deleted record doesn't appear at all. Delta Lake handles this; plain Parquet readers don't.
| Metric | Traditional ETL → Warehouse | Direct Cloud Storage Query |
|---|---|---|
| Time to first query after data lands | 15 min – 24 hours (pipeline lag) | Seconds (read directly) |
| Data copies in flight | 3–5 (source → staging → warehouse → cache) | Zero — one source of truth |
| Monthly infrastructure cost | $2,000–$15,000 (warehouse + ETL compute) | S3/GCS storage + compute instance |
| Engineering time to maintain pipeline | Ongoing — every schema change breaks it | Near zero — schema changes auto-detected |
| Data governance complexity | High — data scattered across systems | Low — one canonical location |
Cloud storage credentials (IAM roles, service account keys, SAS tokens) are stored encrypted in your Duck Data Master instance configuration — not on any shared infrastructure. The analytics engine running in your GCP instance uses those credentials to make direct calls to your cloud provider's storage APIs. Duck Data Master's backend never sees your data or your files.
For S3, the recommended setup is an IAM role with read-only access to specific bucket prefixes — least-privilege access that can be audited in CloudTrail. For GCS, a service account with Storage Object Viewer on the specific bucket. Your security team can review and revoke access at any time.
| Scenario | Recommendation |
|---|---|
| One-off analysis of a file you already have | File upload or URL — faster for small files |
| Recurring analysis of data that updates daily | Cloud storage connector — reads latest automatically |
| Data lake with TB+ of Parquet partitions | Cloud storage connector — partition pruning is essential |
| Sensitive data that must not leave your cloud | Cloud storage connector — nothing leaves your region |
| Data shared by a colleague via URL | URL load — simplest path |
| Delta Lake tables from Databricks/Spark | Cloud storage connector with Delta Lake support |
Cloud storage connectors are available on the Guru Plan. 3-day free trial.
Start Free Trial →Questions? support@duckdatamaster.guru