# import the library
import duckdb
# create an in memory connection
= duckdb.connect()
db
# ensure the extension is ready to go
"INSTALL httpfs; LOAD httpfs;") db.execute(
<duckdb.duckdb.DuckDBPyConnection at 0x16a2e50b0>
Brock Tibert
December 23, 2024
It’s increasingly clear that data teams need to lean into patterns that embrace the separation of storage and compute. This brief post below will walk through some access patterns which hopefully highlight the benefits of writing the artifacts of your data pipelines to cloud storage services like AWS S3.
Moreover, and while certainly not new, I want to emphasize the use of DuckDB as a data processing engine that drastically reduces the surface area of your data stack. Perhaps by the end you might start to question whether you should be leaning into large, bulky, provisioned warehouses like Redshift if you are an AWS shop.
This post will assume the following:
For the sake of illustration, I will going to use play-by-play data files from the NHL to highlight how a data team might store and analyze these datasets outside of a full load into a “data warehouse”. Common query patterns might summarize statistics by season, or more granularly, by month or game. With this in mind, let’s jump into how we might store the data on S3 to enable the analysis of the data.
Let’ assume I have a bucket called nhl-pbp
. I am using “folders” to partition the data via a number of core fields. For example, here is the S3 URI for one file.
s3://nhl-pbp/pbp/season=2024/year=2024/month=12/day=22/gameid=2024020539/pbp.parquet
pbp
as the table in our database. Other tables might be players, or teams, or venue which would reside at /players, /teams and /venue accordingly.Let’s consider a nightly pipeline that collects the data files from yesterday’s games. Our pipeline might do the following:
For the sake of illustration, I already have some data on S3; it’s a handful of games to keep the surface area small.
Without hyperbole, I tell everyone I can in the data space about DuckDB. I bring it up in nearly every conversation. To me, DuckDB will be considered the swiss army knife for data teams if it isn’t considered that already. The benefits are many, but for this post, I will use DuckDB as a query engine, which will use S3 as a data store. If you are new to DuckDb, let that sink in for a second.
To get started, you will need to ensure you have the HTTPFS duckdb extension installed.
# import the library
import duckdb
# create an in memory connection
db = duckdb.connect()
# ensure the extension is ready to go
db.execute("INSTALL httpfs; LOAD httpfs;")
<duckdb.duckdb.DuckDBPyConnection at 0x16a2e50b0>
Newer versions of duckdb have made it really simple to authenticate with Cloud storage services. In my case, I have already installed and configured my AWS profile, and the snippet below will tell duckdb to use that configuration when making calls to S3.
<duckdb.duckdb.DuckDBPyConnection at 0x16a2e50b0>
For a basic example, below I will print out the number of plays for one specific file on S3.
sql = """
select count(*)
from 's3://nhl-pbp/pbp/season=2023/year=2023/month=12/day=14/gameid=2023020448/pbp.parquet'
"""
db.execute(sql).df()
count_star() | |
---|---|
0 | 331 |
While this hopefully highlights the power, simplicity, and speed of duckdb, this is a simple example. The true power comes when we want to use S3 as our data store.
Below we are counting the number of rows from all of the games in the month of December across all years.
sql = """
SELECT COUNT(*)
FROM read_parquet(
's3://nhl-pbp/pbp/season=*/year=*/month=12/day=*/gameid=*/pbp.parquet'
);
"""
db.execute(sql).df()
count_star() | |
---|---|
0 | 1575 |
You may have noticed above that we are using the read_parquet
function as way to scan for files that match our path definition. DuckDB is smart enough to review the files on S3 and calculate the results which span multiple parquet files on S3. How cool is that?
If you want to calculate how many unique games are being evaluated, it’s as simple as:
sql = """
SELECT COUNT(distinct gameid)
FROM read_parquet(
's3://nhl-pbp/pbp/season=*/year=*/month=12/day=*/gameid=*/pbp.parquet'
);
"""
db.execute(sql).df()
count(DISTINCT gameid) | |
---|---|
0 | 4 |
It’s not too hard to imagine the type of data apps you could build by simplying writing parquet files to S3 and using DuckDB as the query engine that sits in between the the frontend and the S3 backend!
AWS Glue includes a number of execellent services to empower data teams. One central service is AWS Glue Data Catalog, which allows us to formalize a “Database” and tables over the data stored in S3. There are some tradeoffs as it relates to DuckDB above. While it’s entirely possible to programmatically configure the steps below, I am going to use the console for sake of demonstration.
After navigating to AWS Glue, select Databases from the left hand menu. Click Add Database.
Give the database a name, and click Create Database. In my case, I am calling it nhl-blog-example
.
AWS Glue is a fully managed ETL (Extract, Transform, Load) service that simplifies data integration by enabling data preparation, transformation, and cataloging at scale. A Glue Database acts as a logical container in the Glue Data Catalog, organizing metadata about data sources, such as schemas and table definitions, to support querying and processing. Glue Tables represent structured metadata about datasets, defining their schema and location, which allows tools like Athena or Spark to interact with data seamlessly in storage, such as S3.
From the left-hand menu, select Crawler > Create Crawler.
For the unique name, I calling this nhl-blog-crawler
. Click next. For step 2, we haven’t yet mapped data (keep that option selected) from our source into Glue, and our source is s3. Let’s get that setup now.
Below I am configuring the crawler for the pbp
folder in the bucket.
The Step 2 screen should look similar to below. Click next.
If you do not already have an IAM Service role, let AWS create and select that for you before moving onto the next screen.
On step 4, select the database we created above. Under Advanced Options, select the first checkbox to Create a Single Schema for each S3 path.
There are a host of options we can configure here, but they are outside the scope for this post.
Finally, on the last page, select Create Crawler.
After completing above, you should be on a screen where you can create an ondemand run of the Crawler. We will do that now. Select Run Crawler.
In my case, this will only take a minute or so since I only have a handful of files on S3.
Once it completes, select Databases from the left under Data Catalog.
We now have created a table of data in Glue based on the parquet files in S3. Selecting the table name > Partitions highlights how the crawler used the S3 folders to partition our data.
Now we can use Athena to query our data via the Glue Data Catalog backed by data in S3. Search for Athena in the search bar on the web console.
I already have Athena setup. If this is your first time using the service, you may need to select an S3 location for the Athena output.
From the left hand side, you can see that we have our Data Catalog, the database within the catalog, and a query editor to execute queries against S3.
This is the query I will run
select year, month, day, count(*) as total from "AwsDataCatalog"."nhl-blog-example"."pbp" group by 1,2,3;
This returns
Let’s step back, and imagine that our nightly service has added data to S3. To simulate this, I am going to add a new play-by-play parquet file for a single game. To add context, this will be for the 21st of December 2024. The S3 URI is below.
s3://nhl-pbp/pbp/season=2024/year=2024/month=12/day=21/gameid=2024020539/pbp.parquet
To demonstrate the final point of this post, let’s hop back to Athena and re-run the query.
We get the same results. This happens because Athena is using the Glue Data Catalog to support the queries against our table, and because above was a new partition for the data lake, our Data Catalog, and as a result Athena, are unaware to look for that file.
To emphasize our core different relative to the DuckDB example, we will run the same query.
sql = """
select year, month, day, count(*) as total
FROM read_parquet(
's3://nhl-pbp/pbp/season=*/year=*/month=*/day=*/gameid=*/pbp.parquet'
)
group by 1,2,3;
"""
db.execute(sql).df()
year | month | day | total | |
---|---|---|---|---|
0 | 2024 | 12 | 21 | 323 |
1 | 2024 | 12 | 15 | 277 |
2 | 2024 | 12 | 22 | 323 |
3 | 2024 | 12 | 16 | 321 |
4 | 2023 | 12 | 14 | 331 |
DuckDB correctly returns the results as expected.
What’s important to note above is that if our data pipeline adds new files to S3, DuckDB will automatically factor these files in as long as it meets the path in the read_parquet
definition. You could consider this schema-on-read.
You likely are asking what we need to do to have Athena properly identify the new data added as part of our pipeline. There are ways to programmatically update the table in the Data Catalog, or, we could have our pipeline fire a new run of the crawler. For the latter, it’s important to note that you are billed for the processing time, so keep that in mind on which approach makes the most sense.
We saw above, via a simple example, that data teams can leverage services like S3 as the data store for their pipelines. Tools like DuckDB or services from AWS such as Glue’s Data Catalog and Athena remove the need to have a “warehouse-first” data strategy.
There isn’t always one right answer, but in future posts, we will explore the concepts of data lakes and lakehouses via Apache Iceberg in order provide alternatives to the traditional cloud data warehouse pattern.