AWS S3, Glue, Data Lakes and DuckDB

The separation of storage and compute is here to stay, so I wanted to document a quick note on considerations depending on your access pattern.
AWS
Data Lake
DuckDB
Glue
Athena
Author

Brock Tibert

Published

December 23, 2024

It’s increasingly clear that data teams need to lean into patterns that embrace the separation of storage and compute. This brief post below will walk through some access patterns which hopefully highlight the benefits of writing the artifacts of your data pipelines to cloud storage services like AWS S3.

Moreover, and while certainly not new, I want to emphasize the use of DuckDB as a data processing engine that drastically reduces the surface area of your data stack. Perhaps by the end you might start to question whether you should be leaning into large, bulky, provisioned warehouses like Redshift if you are an AWS shop.

This post will assume the following:

The Dataset

For the sake of illustration, I will going to use play-by-play data files from the NHL to highlight how a data team might store and analyze these datasets outside of a full load into a “data warehouse”. Common query patterns might summarize statistics by season, or more granularly, by month or game. With this in mind, let’s jump into how we might store the data on S3 to enable the analysis of the data.

Let’ assume I have a bucket called nhl-pbp. I am using “folders” to partition the data via a number of core fields. For example, here is the S3 URI for one file.

s3://nhl-pbp/pbp/season=2024/year=2024/month=12/day=22/gameid=2024020539/pbp.parquet
  • Think of the first folder pbp as the table in our database. Other tables might be players, or teams, or venue which would reside at /players, /teams and /venue accordingly.
  • Within the pbp table, we then partition the data by season
  • Within a given season, we can partition the data by calendar items like year month and day to enable granular queries such as “How many goals did Brad Marchand have in the month of December 2024?”
  • Finally, because there can be multiple games per calendar day, we add a partition for the game id itself.

Let’s consider a nightly pipeline that collects the data files from yesterday’s games. Our pipeline might do the following:

  • Run overnight via a schedule (e.g. EventBridge Scheduler)
  • Determine if there were any new games for today() - 1 day
  • If there are games, grab the JSON, perform a light transform to flatten into a tabular format, and write a parquet file to a path on S3 that conforms with the partitions.

For the sake of illustration, I already have some data on S3; it’s a handful of games to keep the surface area small.

Querying the Data via DuckDB

Without hyperbole, I tell everyone I can in the data space about DuckDB. I bring it up in nearly every conversation. To me, DuckDB will be considered the swiss army knife for data teams if it isn’t considered that already. The benefits are many, but for this post, I will use DuckDB as a query engine, which will use S3 as a data store. If you are new to DuckDb, let that sink in for a second.

To get started, you will need to ensure you have the HTTPFS duckdb extension installed.

# import the library
import duckdb

# create an in memory connection
db = duckdb.connect()

# ensure the extension is ready to go
db.execute("INSTALL httpfs; LOAD httpfs;")
<duckdb.duckdb.DuckDBPyConnection at 0x16a2e50b0>

Newer versions of duckdb have made it really simple to authenticate with Cloud storage services. In my case, I have already installed and configured my AWS profile, and the snippet below will tell duckdb to use that configuration when making calls to S3.

sql = """
CREATE SECRET aws_secret (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);
"""

db.execute(sql)
<duckdb.duckdb.DuckDBPyConnection at 0x16a2e50b0>

For a basic example, below I will print out the number of plays for one specific file on S3.

sql = """ 
select count(*)
from 's3://nhl-pbp/pbp/season=2023/year=2023/month=12/day=14/gameid=2023020448/pbp.parquet'
"""

db.execute(sql).df()
count_star()
0 331

While this hopefully highlights the power, simplicity, and speed of duckdb, this is a simple example. The true power comes when we want to use S3 as our data store.

Below we are counting the number of rows from all of the games in the month of December across all years.

sql = """
SELECT COUNT(*)
FROM read_parquet(
  's3://nhl-pbp/pbp/season=*/year=*/month=12/day=*/gameid=*/pbp.parquet'
);
"""

db.execute(sql).df()
count_star()
0 1575

You may have noticed above that we are using the read_parquet function as way to scan for files that match our path definition. DuckDB is smart enough to review the files on S3 and calculate the results which span multiple parquet files on S3. How cool is that?

If you want to calculate how many unique games are being evaluated, it’s as simple as:

sql = """
SELECT COUNT(distinct gameid)
FROM read_parquet(
  's3://nhl-pbp/pbp/season=*/year=*/month=12/day=*/gameid=*/pbp.parquet'
);
"""

db.execute(sql).df()
count(DISTINCT gameid)
0 4

It’s not too hard to imagine the type of data apps you could build by simplying writing parquet files to S3 and using DuckDB as the query engine that sits in between the the frontend and the S3 backend!

AWS Glue Data Catalog and Athena

AWS Glue includes a number of execellent services to empower data teams. One central service is AWS Glue Data Catalog, which allows us to formalize a “Database” and tables over the data stored in S3. There are some tradeoffs as it relates to DuckDB above. While it’s entirely possible to programmatically configure the steps below, I am going to use the console for sake of demonstration.

Step 1: Create a Database

After navigating to AWS Glue, select Databases from the left hand menu. Click Add Database.

Give the database a name, and click Create Database. In my case, I am calling it nhl-blog-example.

Glue? Database?

AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data integration by enabling data preparation, transformation, and cataloging at scale. A Glue Database acts as a logical container in the Glue Data Catalog, organizing metadata about data sources, such as schemas and table definitions, to support querying and processing. Glue Tables represent structured metadata about datasets, defining their schema and location, which allows tools like Athena or Spark to interact with data seamlessly in storage, such as S3.

Step 2: Create a Crawler

From the left-hand menu, select Crawler > Create Crawler.

For the unique name, I calling this nhl-blog-crawler. Click next. For step 2, we haven’t yet mapped data (keep that option selected) from our source into Glue, and our source is s3. Let’s get that setup now.

Below I am configuring the crawler for the pbp folder in the bucket.

The Step 2 screen should look similar to below. Click next.

If you do not already have an IAM Service role, let AWS create and select that for you before moving onto the next screen.

On step 4, select the database we created above. Under Advanced Options, select the first checkbox to Create a Single Schema for each S3 path.

There are a host of options we can configure here, but they are outside the scope for this post.

Finally, on the last page, select Create Crawler.

Step 3: Run the Crawler

After completing above, you should be on a screen where you can create an ondemand run of the Crawler. We will do that now. Select Run Crawler.

In my case, this will only take a minute or so since I only have a handful of files on S3.

Once it completes, select Databases from the left under Data Catalog.

We now have created a table of data in Glue based on the parquet files in S3. Selecting the table name > Partitions highlights how the crawler used the S3 folders to partition our data.

Step 4: Query with Athena

Now we can use Athena to query our data via the Glue Data Catalog backed by data in S3. Search for Athena in the search bar on the web console.

Important

I already have Athena setup. If this is your first time using the service, you may need to select an S3 location for the Athena output.

From the left hand side, you can see that we have our Data Catalog, the database within the catalog, and a query editor to execute queries against S3.

This is the query I will run

select year, month, day, count(*) as total from "AwsDataCatalog"."nhl-blog-example"."pbp" group by 1,2,3;

This returns

Adding new Data

Let’s step back, and imagine that our nightly service has added data to S3. To simulate this, I am going to add a new play-by-play parquet file for a single game. To add context, this will be for the 21st of December 2024. The S3 URI is below.

s3://nhl-pbp/pbp/season=2024/year=2024/month=12/day=21/gameid=2024020539/pbp.parquet

To demonstrate the final point of this post, let’s hop back to Athena and re-run the query.

We get the same results. This happens because Athena is using the Glue Data Catalog to support the queries against our table, and because above was a new partition for the data lake, our Data Catalog, and as a result Athena, are unaware to look for that file.

To emphasize our core different relative to the DuckDB example, we will run the same query.

sql = """

select year, month, day, count(*) as total 
FROM read_parquet(
  's3://nhl-pbp/pbp/season=*/year=*/month=*/day=*/gameid=*/pbp.parquet'
)
group by 1,2,3;
"""

db.execute(sql).df()
year month day total
0 2024 12 21 323
1 2024 12 15 277
2 2024 12 22 323
3 2024 12 16 321
4 2023 12 14 331

DuckDB correctly returns the results as expected.

What’s important to note above is that if our data pipeline adds new files to S3, DuckDB will automatically factor these files in as long as it meets the path in the read_parquet definition. You could consider this schema-on-read.

You likely are asking what we need to do to have Athena properly identify the new data added as part of our pipeline. There are ways to programmatically update the table in the Data Catalog, or, we could have our pipeline fire a new run of the crawler. For the latter, it’s important to note that you are billed for the processing time, so keep that in mind on which approach makes the most sense.

Summary

We saw above, via a simple example, that data teams can leverage services like S3 as the data store for their pipelines. Tools like DuckDB or services from AWS such as Glue’s Data Catalog and Athena remove the need to have a “warehouse-first” data strategy.

There isn’t always one right answer, but in future posts, we will explore the concepts of data lakes and lakehouses via Apache Iceberg in order provide alternatives to the traditional cloud data warehouse pattern.