Last Updated: 2024-01-22

Background

Memory management

When using Starburst Galaxy to run a query, only the necessary data for the query is read from the query engine. The coordinator then divides the workload into splits, and allocates them to worker nodes. Collectively, workers execute the workload by processing, in parallel, their assigned splits and only keeping in memory the data that is not yet ready to send to the next stage. This approach increases performance compared to other frameworks that spill this intermediary data to disk.

Resource constraints

However, some queries require a large amount of memory. In such cases, problems can arise if individual workers run out of memory. A similar situation can happen if a worker is disconnected from the group due to a network error. This situation often leads to query failures because the scheduled resources prove unavailable when the time comes to execute the workload.

Fault-tolerant execution

To address this issue, Starburst has introduced a fault-tolerant execution (FTE) feature that enhances the resilience of Starburst platforms when handling memory-intensive workloads. This model allows intermediary data to be stored into an immensely scalable persistence layer, typically by pushing this data to disk that would normally be held only in memory. This prevents the out of memory conditions and saves the query from failure.

Additionally, fault-tolerant query execution allows for failed tasks to be reprocessed. This allows long-running workloads to complete without worrying about query failures, even with limited hardware resources. This means that failed workers or network errors no longer result in query failure.

Scope of tutorial

In this tutorial, you will learn how to enable fault-tolerant execution mode on a cluster. You will also have the chance to observe FTE in action and compare it to a cluster without FTE enabled.

Prerequisites

You need a Starburst Galaxy account to complete this tutorial. Please be sure to complete the tutorial titled Starburst Galaxy: Getting started before attempting this tutorial.

Learning outcomes

Upon successful completion of this tutorial, you will be able to:

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Tutorial scenario

Chryse Corp. is a large retail organization that heavily relies on data analytics to drive business decisions. They have implemented Starburst Galaxy as their data analytics platform to query and analyze vast amounts of data from various sources. To ensure uninterrupted query execution and prevent costly disruptions, Chryse Corp. has decided to leverage Starburst's fault-tolerant execution mode. This mode enhances resilience by reprocessing failed tasks and ensures that long-running queries can be executed without the fear of query failures, regardless of the hardware limitations.

Help the Chryse Corp. team test the fault-tolerant execution feature by creating two clusters - one with FTE mode enabled and one without. Run a resource-intensive query on both clusters to compare the results.

Background

To get started, you need to create a new cluster with fault-tolerant execution enabled. This process will be similar to the creation of other clusters that do not have fault-tolerance engaged.

Step 1: Create a new cluster in Starburst Galaxy

Begin by creating a new cluster. This follows the usual cluster creation process seen in other tutorials.

Step 2: Name the cluster

You're going to start by giving the new cluster a name. This should be something descriptive of its use and location.

Step 3: Add catalogs to new cluster

Now it's time to choose which catalogs to include in the cluster. For this tutorial, the only catalog you need for testing purposes is the tpch catalog.

Step 4: Add cluster details

Next, you'll need to add the remaining cluster details to set up the new cluster. We recommend using the following configuration.

Step 5: Confirm cluster start

It is considered best practice to confirm that the new cluster has started before proceeding.

The cluster will start up automatically, but can take 3 to 5 minutes to start.

Background

Now it's time to create a standard cluster without FTE enabled. This cluster will be used for comparison purposes in the next section.

Step 1: Create a new cluster

Begin by creating a new cluster, this will become the non-FTE cluster.

Step 2: Name the cluster

Now it's time to provide a cluster name. This will be used to identify the cluster, so it should be descriptive.

Step 3: Add catalogs to new cluster

Next, you'll need to add the necessary catalogs to the cluster. For this tutorial, the only catalog you need is tpch.

Step 4: Add cluster details

Finally, you'll need to add the remaining cluster details to set up the new non-FTE cluster. We recommend using the following configuration.

Step 5: Confirm cluster start

Just as you did for the FTE cluster, it is considered best practice to confirm that the new cluster has started before proceeding.

Background

Now it's time to test out the differences between FTE and non-FTE clusters so you can see the difference first hand.

To do this, you'll run the same query on each cluster and then compare the results.

Let's get going!

Step 1: Open the Query editor

You're going to launch your query from the query editor.

Step 2: Run SQL on the no-FTE cluster

Next, you need to enter a suitable query that will be run on the non-FTE cluster, and then again on the FTE cluster. The query below will fit this role perfectly because it requires a lot of memory to execute. This is exactly the scenario that FTE is designed to assist with, so we would expect this non-FTE cluster to encounter problems.

EXPLAIN ANALYZE SELECT
  nation,
  o_year,
  SUM(amount) AS sum_profit
FROM (
       SELECT
         n.name                                                          AS nation,
         extract(YEAR FROM o.orderdate)                                  AS o_year,
         l.extendedprice * (1 - l.discount) - ps.supplycost * l.quantity AS amount
       FROM
         tpch.sf1000.part AS p,
         tpch.sf1000.supplier AS s,
         tpch.sf1000.lineitem AS l,
         tpch.sf1000.partsupp AS ps,
         tpch.sf1000.orders AS o,
         tpch.sf1000.nation AS n
       WHERE
         s.suppkey = l.suppkey
         AND ps.suppkey = l.suppkey
         AND ps.partkey = l.partkey
         AND p.partkey = l.partkey
         AND o.orderkey = l.orderkey
         AND s.nationkey = n.nationkey
         AND p.name LIKE '%green%'
     ) AS profit
GROUP BY
  nation,
  o_year
ORDER BY
  nation,
  o_year DESC;

Step 3: Review error

Once the query has been executed, you'll see that it does indeed encounter a problem. Specifically, it runs out of available memory. The error you get should be similar to the image below.

Step 4: View query details

Starburst Galaxy allows you to view more details about queries. This is used to troubleshoot scenarios like this one, where a query failed, or to help analyze query performance.

You'll need to navigate to the query details section to view this information.

Step 5: Run SQL on the FTE cluster

Now it's time to run the same query on the FTE cluster and see the difference.

Step 6: Delete the two clusters

Your testing is complete, and you can now delete your clusters so that you don't incur any unnecessary charges.

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've completed this tutorial, you should have a better understanding of how and when to use fault-tolerant execution mode for a cluster in Starburst Galaxy.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.