×

Building the Data Lake Analytics Stack

By: Kamil Bajda-Pawlikowski
August 24, 2022
Share: Linked In

In recent years, big data initiatives and investments have reached an all time high across all enterprises — and the pace of investment is accelerating. Except, there is a huge gap between initiatives and results.

According to this 2021 NewVantage Partners survey of Fortune 1000 executives, big data is still not used effectively and firms continue to struggle to derive value from their investments in this area.

For example:

  • Only 48.5% are driving innovation with data
  • Only 41.2% are competing on analytics
  • Only 29.2% are experiencing business transformation business impact
  • Only 24.0% have created a data-driven organization

The good news is that the ongoing demand for agile, flexible data analytics to leverage big data investments has fueled the rise of data lakes and distributed SQL query engines (such as Trino, formerly Presto SQL). Together, it maximizes data lake ROI and turns big data into a strategic asset.

Before we get to why data lakes are better with a distributed SQL query engine, let’s take a closer look at the traditional data lake stack itself.

What are the characteristics of a data lake?

The biggest advantage of data lakes is flexibility. For data to remain in its native and granular format means that data is not modeled in advance, transformed in flight, or at target storage.

This is an up-to-date stream of data that is available for analysis at any time, for any business purpose.

The main value organizations derive from the data lake stack is three-fold:

  1. It enables instant ease of access to their wealth of data, regardless of where it resides, with near zero time-to-market (no need for IT / data teams to prepare or move data)
  2. It creates a pervasive, data-driven culture
  3. It transforms data into the digital intelligence that is a prerequisite for achieving a competitive advantage in today’s data-driven ecosystem

But data lakes only have meaning to an organization’s vision when they help solve business problems through data democratization, re-use, and exploration by agile and flexible analytics. Data lake accessibility provides a real force multiplier when it is used by companies, across business units.

In practice, however, even after a successful implementation, many enterprises use the data lake on the fringes, running queries on a limited basis for ad hoc, high value queries. Thus, they dramatically fail to use their data lake to its potential, and experience poor ROI.

Data lake query engine limitations

The single most common problem in poor ROI relates to the fact that traditional data lake query engines are based on brute force query processing, culling through all of the data to return the result sets needed for application responses or analytics.

In fact, 80% of compute resources are squandered on full scans!

This unnecessary leverage of widely excessive resources runs up significant costs. The result is that SLAs are not sufficient to support interactive use cases and realistically support only ad hoc analytics or experimental queries. To effectively support a wide range of analytics use cases, dataops teams have no choice but to revert back to optimized data silos and querying traditional data warehouses.

Dataops teams are overstretched

However, dataops teams are already spread thin with responsibilities for managing the data analytics budget, prioritizing query requests, and optimizing query performance.

Manual query optimization is time consuming, and backlog optimizations grow everyday, creating a vicious cycle. The lack of workload level visibility prevents dataops teams from identifying which workloads need priority based on business needs — rather than on the needs of an individual user or query.

The frustration of data users throughout the organization and the burnout experienced by the dataops team can stymie even the best-made plans to capitalize on big data and build a data-driven culture.

These are a few real-world obstacles that prevent organizations from utilizing the power of their data lake stack, all of which require organizations to rethink their data lake architecture in order to capitalize on their investment in big data and analytics.

Analytics is paramount to the data lake architecture paradigm shift

Overcoming these obstacles to leveraging the power of the data lake demands a transition to a an analytics-ready data lake stack, which is composed of:

  • Scalable and massive storage (petabyte to exabyte scale) such as Object Storage
  • Data federation layer that provides access to many data sources and formats
  • Distributed SQL query engine such as Trino (formerly, PrestoSQL)
  • Query acceleration and workload optimization engine for performance/cost balance, to eliminate the disadvantages of brute force approach and its implications

These tools enable agile data lake analytics that harnesses near-perfect data — with traditional data warehouse comparable performance and cost.

By implementing those tools, the business no longer needs to adapt to existing data architecture, which limits which queries can be run. Instead, the data architecture adapts itself to specific business needs, which are highly elastic and dynamic. They offer a simple and cost effective way for enterprises to shift their analytics to the data lake, making it the one stop shop for agility and flexibility focused data analytics.

Autonomous query acceleration: the missing link in the data lake analytics stack

Data lake query acceleration platforms like Starburst are the missing link in your data lake stack. Sitting on top of your data lake and query engine, they serve as a smart acceleration layer on your data lake which remains the single source of truth. The data lake becomes the business’s mainstream data analytics platform, enabling enterprises to turn it into a strategic competitive advantage and achieve data lake ROI. Data also becomes a strategic asset, as businesses can use it to respond with agility to new opportunities and threats through innovations that drive business growth and competitive advantage.

Starburst gives you control over the performance and cost of your data lake analytics. Starburst Smart Indexing and Caching autonomously and continuously learns and adapts to the users, the queries they’re running, and the data being used. Workload-level visibility gives data ops teams an open view to see how data is being used across the entire organization, and better focus data ops resources on business priorities.

Dataops teams can specify which workloads are more important and how to allocate budgets. Based on this information, Starburst automatically and dynamically creates appropriate indexes, refines which queries to cache, and even materializes tables with the right column sets, including pre-joining dimensions.

Better together: data lakes and distributed SQL query engines

In short, the power of data lakes is promising when combined with SQL query engine. It holds vast amounts of raw data, in native formats, until it’s needed by the business. Then, when it’s combined with the agility and flexibility of distributed engines in querying that data, it promises organizations the ability to maximize data-driven growth.

Kamil Bajda-Pawlikowski

Co-Founder and CTO, Starburst

Kamil is a Co-Founder and CTO of Starburst. Previously, Kamil was the chief architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto, and the cofounder and chief software architect of Hadapt, the first SQL-on-Hadoop company (acquired by Teradata). Kamil began his journey with Hadoop and modern MPP SQL architectures about 10 years ago during a doctoral program at Yale University, where he co-invented HadoopDB, the original foundation of Hadapt’s technology. He holds an MS in computer science from Wroclaw University of Technology and both an MS and an MPhil in computer science from Yale University. Kamil is co-author of several US patents and a recipient of 2019 VLDB Test of Time Award.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.