// FEATURE DEEP DIVE · MAY 2026

Cloud Storage Analytics: Query S3, GCS, and Azure Blob Without Moving Your Data

Scott Baker — Founder, Duck Data Master May 2026 · 8 min read · Databricks Certified Associate Developer · AWS Solutions Architect Associate

TL;DR: The Guru Plan Cloud Storage Connector lets you query Parquet, CSV, and JSON files directly from your S3 bucket, GCS bucket, or Azure Blob container — no ETL pipeline, no data copy, no movement. Your data stays in your cloud. Your analytics engine reads it in place.

The traditional data pipeline looks like this: export from source → stage in S3 → ETL job copies it to a warehouse → analysts query the warehouse. Each hop adds latency, cost, complexity, and a point of failure. And the data is now in three places instead of one.

The modern answer is simpler: skip the copy. Query the data where it already lives.

The Data Movement Problem

Every data movement step has a cost. S3 egress fees. Warehouse ingestion costs. ETL compute costs. Engineering time to build and maintain the pipeline. And the hidden cost nobody talks about: the lag between when data lands in your source system and when analysts can query it.

A typical ETL pipeline introduces 15–60 minutes of lag for "near-real-time" data. Daily batch pipelines mean yesterday's data is the freshest your team can see. None of this is necessary if your analytics engine can read directly from cloud storage.

How Cloud Storage Analytics Works

The Duck Data Master Guru Plan includes direct connectors to all three major cloud storage providers. Once connected, you browse your bucket like a file system, select the files or prefixes you want to query, and the analytics engine reads them directly — using columnar pushdown to read only the columns and row groups your query needs.

For Parquet files in a well-partitioned data lake, this means a query like "revenue by region for Q1" scans only the Q1 partition files, reads only the region and revenue columns, and skips everything else. No full table scan. No data copy. No ETL.

Supported Storage Providers and Formats

Cloud Storage	Supported Formats	Auth Method	Partition Pruning?
Amazon S3	Parquet, CSV, JSON, Arrow, Delta Lake	IAM role or access key	Yes — Hive-style partitions
Google Cloud Storage	Parquet, CSV, JSON, Arrow	Service account JSON	Yes — Hive-style partitions
Azure Blob Storage	Parquet, CSV, JSON, Arrow	Storage account key or SAS token	Yes — Hive-style partitions
Azure Data Lake Gen2	Parquet, CSV, JSON, Delta Lake	SAS token or ADLS credentials	Yes — Hive-style partitions

Partition Pruning: Why Your Data Lake Structure Matters

If your data is stored in Hive-style partitions — folders named like year=2026/month=01/ — the analytics engine uses those partition values to skip irrelevant files entirely. A query with WHERE year=2026 AND month=01 reads zero files from 2025 or February, even if those partitions contain billions of rows.

This is the single biggest performance win available in cloud storage analytics. A properly partitioned Parquet data lake with 10TB of data can answer a one-month query by reading 50GB — 200x less data than a full scan.

Recommended partition structure for time-series data

s3://your-bucket/events/
  year=2025/month=11/day=01/events_00001.parquet
  year=2025/month=11/day=02/events_00001.parquet
  year=2026/month=01/day=01/events_00001.parquet
  ...

-- Query reads only Q1 2026 files (90 days out of 365)
SELECT region, SUM(revenue)
FROM read_parquet('s3://your-bucket/events/**/*.parquet', hive_partitioning=true)
WHERE year=2026 AND month BETWEEN 1 AND 3
GROUP BY region

Delta Lake Support

Delta Lake is Parquet with a transaction log on top — the format that powers most modern Databricks data lakehouses. If your data engineering team has already built Delta Lake tables in S3 or ADLS, you can query them directly from Duck Data Master without any migration or conversion. The analytics engine reads the Delta log to find the latest snapshot and skips deleted or overwritten files automatically.

This matters most for teams that use upserts and deletes — operations that don't exist in plain Parquet. A customer record that was updated three times only returns the latest version. A deleted record doesn't appear at all. Delta Lake handles this; plain Parquet readers don't.

What Changes When You Stop Copying Data

Metric	Traditional ETL → Warehouse	Direct Cloud Storage Query
Time to first query after data lands	15 min – 24 hours (pipeline lag)	Seconds (read directly)
Data copies in flight	3–5 (source → staging → warehouse → cache)	Zero — one source of truth
Monthly infrastructure cost	$2,000–$15,000 (warehouse + ETL compute)	S3/GCS storage + compute instance
Engineering time to maintain pipeline	Ongoing — every schema change breaks it	Near zero — schema changes auto-detected
Data governance complexity	High — data scattered across systems	Low — one canonical location

Security: Your Credentials, Your Control

Cloud storage credentials (IAM roles, service account keys, SAS tokens) are stored encrypted in your Duck Data Master instance configuration — not on any shared infrastructure. The analytics engine running in your GCP instance uses those credentials to make direct calls to your cloud provider's storage APIs. Duck Data Master's backend never sees your data or your files.

For S3, the recommended setup is an IAM role with read-only access to specific bucket prefixes — least-privilege access that can be audited in CloudTrail. For GCS, a service account with Storage Object Viewer on the specific bucket. Your security team can review and revoke access at any time.

When to Use Cloud Storage vs. Uploaded Files

Scenario	Recommendation
One-off analysis of a file you already have	File upload or URL — faster for small files
Recurring analysis of data that updates daily	Cloud storage connector — reads latest automatically
Data lake with TB+ of Parquet partitions	Cloud storage connector — partition pruning is essential
Sensitive data that must not leave your cloud	Cloud storage connector — nothing leaves your region
Data shared by a colleague via URL	URL load — simplest path
Delta Lake tables from Databricks/Spark	Cloud storage connector with Delta Lake support

Query your data lake directly

Cloud storage connectors are available on the Guru Plan. 3-day free trial.

Start Free Trial →

Questions? support@duckdatamaster.guru