1. Tutorial overview
Last Updated: 2024-10-24
Background
Several factors can affect the performance of a data lake. Ensuring optimal performance requires attention to how data is ingested, stored, and processed. This tutorial will focus on how table partitioning and bucketing, used independently or together, can improve data lake performance. Through hands-on exercises, you will see how both of these strategies will read far less data than a full scan of a table when there is a filter that aligns appropriately.
Learning objectives
Once you've completed this tutorial, you will be able to:
- Create partitioned tables.
- Explain how to write a query on a partitioned table so that less data is read.
- Create bucketed tables.
- Explain how to write a query on a bucketed table so that less data is read.
Prerequisites
- You must have a Starburst Galaxy account set up before you can complete this tutorial. You can find instructions for creating a free account here.
- You must complete the prerequisite tutorial titled Configure a Starburst Galaxy data lake catalog and schema before attempting this tutorial. It will guide you in setting up the required environment for this tutorial.
2. What is table partitioning?
Background
Table partitioning is an optimization technique that data lakes employ as an alternative to indexing. A partition is simply a way to divide data into subdirectories that are designed around logical groupings. In other words, partitioning divides data into subfolders. The field(s) you partition a table on become subfolders of the table's data lake folder. When a WHERE
clause is aligned to the partitioned field, then only the folder(s) that meet the particular criteria are read – the rest are ignored.
For example, suppose you have five years of historical data. Rather than putting all of that data into a single folder, you may choose to create a folder for each year, ensuring that the data for each year is in the appropriate folder. This is considered partitioning by year. Efficiency is gained when a query includes the partition key in the predicate or join, because the query engine can avoid reading partitions that don't match.
3. Create and query an unpartitioned table
Background
Your first exercise will be to create two tables with identical data. The only difference will be that one is partitioned and one is not. This will allow you to compare the amount of data being read when each table is queried. You'll begin with the unpartitioned table.
Step 1: Verify cluster, catalog, and schema
When you use the query editor in Starburst Galaxy, you have the option to set the cluster, catalog, and schema for your session. This lets you avoid typing out the fully-qualified table name every time you write a query. As a reminder, you should have created the catalog and schema you'll be using during the prerequisite tutorial.
- Log in to your Starburst Galaxy account.
- Use the left-hand navigation menu to select Query>>Query editor.
- Use the dropdown menus above the query editor window to select the following:
- Cluster:
aws-us-east-1-free
- Catalog:
tmp_cat
- Schema:
tmp_firstname_lastname_postalcode
Step 2: Create table
Next, it's time to create a new, unpartitioned table following the format and contents of the tpch.sf1.orders
table.
- Use the following SQL to create a new table from the
tpch.sf1.orders
table:
CREATE TABLE orders_no_pttn
WITH (
type = 'hive'
)
AS
SELECT *
FROM tpch.sf1.orders;
Step 3: Query new table
Now that you've created a new table, it's time to query it. After you run the query, you'll observe how much data was read from your unpartitioned table.
- Run the following query. Your result should return 941 rows:
SELECT * from orders_no_pttn
WHERE totalprice >= 430000.00;
Step 4: Locate metrics for data read
When you run a query in Starburst Galaxy, you can learn a lot of information about the query by going to the query details page.
- Click the Query details icon (denoted by an eye) on the bottom right of the screen.
- Select the Advanced tab below the Query ID label.
- In the Tables section of the Advanced tab you will see values for the number of Rows and Bytes read.
- For the query you just ran, all rows (1.5M) were read from the
order_no_pttn
table, which accounted for 33.7MB of data. That equates to a full table scan of all records from the data lake.
Step 5: Add additional filter to query
Let's enhance the filtering of the previous query by adding a filter on the priority
field.
- Run the following query, which should return 187 rows:
SELECT * from orders_no_pttn
WHERE totalprice >= 430000.00
AND orderpriority = '4-NOT SPECIFIED';
- Use the process shown in Step 4 to verify that another full table scan occurred. As a reminder, a full table scan would be indicated by 1.5 million rows read.
4. Create and query a partitioned table
Background
In the last section, you created and queried an unpartitioned table. Through that exercise, you learned that when a table is unpartitioned, the entire table is scanned during query execution, even when filters are applied.
In this section, you'll create another orders table, but this one will be partitioned on orderpriority
. This will allow you to verify that less data is read on partitioned tables when the WHERE
clause is aligned to the partitioned field, thus enhancing the performance of the data lake.
Step 1: Create partitioned table
To create a partitioned version of the orders
table, you'll need to modify the CTAS statement you used to create the unpartitioned table.
- Try to run the following SQL to create the table. Notice the addition of the
partitioned_by
clause:
CREATE TABLE orders_pttn_priority
WITH (
partitioned_by = ARRAY['orderpriority'],
type = 'hive'
)
AS
SELECT *
FROM tpch.sf1.orders;
- You should have received the following error:
- Just as the error message states, the partition field needs to be the last column in this (Hive) table.
- Execute the following statement that reorders the columns so that
orderpriority
is last:
CREATE TABLE orders_pttn_priority
WITH (
partitioned_by = ARRAY['orderpriority'],
type = 'hive'
)
AS
SELECT orderkey, custkey, orderstatus, totalprice,
orderdate, clerk, shippriority, comment,
orderpriority
FROM tpch.sf1.orders;
Step 2: Query partitioned table
It's time to query the partitioned table, using the same query from Step 3 in the previous section. This query does not include the partition column in the WHERE
clause, which means that the entire table will still be scanned. After you've queried the table, you'll see how to use a query plan to confirm a full table scan.
- Run the following query on the new table:
SELECT * from orders_pttn_priority
WHERE totalprice >= 430000.00;
- Use the information provided on the Query details page to confirm that a full table scan occurred. Recall that the table has 1.5M rows.
Step 3: Review query plan
You can also review the Query plan to see if a full table scan occurred. You should already be on the Query details page under the Advanced tab. The Query plan is located at the bottom of the page.
- Click on the Query plan to bring up a larger view.
- Review the lines that are circled in the image below. Because this is a partitioned column, the query plan will show you the folders that were read. The folders listed represent all of the distinct values for the
orderpriority
column. This is further evidence that a full table scan occurred.
Step 4: Leverage partition pruning
In this step, you'll take advantage of the partitioning on your table by adding an orderpriority
filter.
- Run the following query:
SELECT * from orders_pttn_priority
WHERE totalprice >= 430000.00
AND orderpriority = '4-NOT SPECIFIED';
- Navigate to the Query details page to see how many rows and bytes were read while executing the query.
In our example, 300K rows and 6.92 MB were read. This represents about 20% of the full table scan's metrics. The sample data from TPCH was relatively evenly distributed across the five order priority values, which provides anecdotal evidence that only one of the partition folders was read. In the next step, you'll confirm that by looking at the query plan.
Step 5: Review query plan
Let's take a look at the query plan to further confirm that only one partition folder was read during the execution of the last query. You should already be in the Advanced tab of the Query details page.
- Click to open an expanded view of the query plan.
- As you did before, look for the
orderpriority
row in the query plan.
Note that only one folder was read. This solidifies the anecdotal evidence with actual results. When the query engine can read a subset of the partition folders, this is referred to as partition pruning.
5. What is table bucketing?
Background
Table bucketing is a data lake technique where tables are divided into fixed-sized, smaller segments called buckets based on the values of a specific column (or columns). Bucketing optimizes data storage and speeds up query performance, especially for queries involving joins or aggregations on the bucketed columns.
How it works
In table bucketing:
- A column is selected as the bucketing key, often one that is frequently used in join or filter conditions.
- A hash function is applied to the values in the bucket key column. This hash function assigns each row to one of a fixed number of buckets.
- Each bucket is stored separately, usually in its own file, within the table directory.
For example, if a table is bucketed by user_id
into 4 buckets, each unique user_id
is hashed to determine its bucket. Rows with the same hash value (for user_id
) end up in the same bucket.
Let's take a look at this in practice.
6. Create and query a bucketed table
Background
In this section, you are going to recreate the orders
table, this time bucketing it on the customer identifier column.
Step 1: Create table
- Run the following SQL to create an
orders
table, bucketed oncustkey
:
CREATE TABLE orders_bckt_custkey
WITH (
bucketed_by = ARRAY['custkey'],
bucket_count = 4,
type = 'hive'
)
AS
SELECT *
FROM tpch.sf1.orders;
You do not have access to review the data in S3, but here is an example of the contents of this table. The data was evenly spread by custkey
into four separate files of similar size. The sum of these file sizes adds up to about the size of the full table.
Step 2: Query two tables
In this step, you're going to run the same query on two different tables. One will be the unpartitioned, unbucketed orders
table, and the other will be the bucketed table you just created. This exercise will compare the amount of data processed for each query.
- Begin by running the following query against the
orders_no_pttn
table:
SELECT *
FROM orders_no_pttn
WHERE custkey = 130000;
- A look at the query details page shows that the full table was scanned:
- Now run the same query against the
orders_bckt_custkey
table:
SELECT *
FROM orders_bckt_custkey
WHERE custkey = 130000;
- Review the query details again, and this time you'll see that far less data was read. In fact, only about 25% of the data was read because the specific
custkey
could only be in one of the four buckets you created.
Remember, partitioning and bucketing can be used independently or in combination.
7. Tutorial wrap-up
Tutorial complete
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You should now have a better understanding of how partitioning and bucketing can improve data lake performance. The data used in this tutorial was relatively small in size, which means that the performance improvements weren't very dramatic. With larger volumes of data you will see a much more obvious improvement.
Continuous learning
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Next steps
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Tutorials available
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!