Data Lake Analytics for Smart, Modern Data Management

By: Cindy Ng
May 27, 2022
Share: Linked In

Best-in-class organizations need fast, reliable data analytics that enable business leadership to identify patterns and key insights that will help them predict the best path for them to maximize future revenues, reduce costs, or minimize risk.

However, going from data to data-driven has been a technical challenge long before Big Data started trending. For one thing, data sprawl complexity issues are pervasive and universal across every industry and region. The average number of different data platforms most organizations use is 4-6 platforms, with at least 11% of organizations having 10-12 platforms.

Nevertheless, organizations that are able to produce and deliver data-driven results are also leading their respective industries. To see how these winning organizations traversed the world of Big Data, let’s retrace our steps so that you can perhaps examine what’s possible for your organization.

What does it mean to centralize data?

Even when data was a scarce commodity, business leaders were sold on the promise of data monetization. They envisioned that gathering all the data in one central location would be the best path to a single source of truth. At first glance, it would appear that centralizing data would mean that all users would know where all the data resides as well as where and why the data has been collected. Then, with a centralized data infrastructure, it would enable fast and easy data analysis to deliver business value to key stakeholders. Even better, you’d have clear oversight to meet regulatory compliance, have visibility into how the data was used, and enable more data accessibility for everyone.

Centralized data infrastructure: What is a data warehouse?

After business leaders bought into centralizing data, they spent millions on their data warehouses to enable the organization with business intelligence through data visualizations, reports and dashboards, resulting in a data architecture that looks like this:

Current data landscape

There’s operational data (transactional data that supports the business), analytical data (data generated by way of running the business) and extract-transform-load (ETL) data pipelines that nest between the operational and the analytical data plane. The analytical data plane is where a data warehouse resides. A data warehouse aggregates large amounts of structured data from multiple sources into a central data store to perform queries and run powerful analytics on high volume data (think: petabytes). Subsequently, data warehouses cemented their name as the future of big data analytics.

However, over time, those who invested in a centralized data architecture experienced problems with vendor lock-in and high storage and compute costs. Data engineers spent 70% of their time transforming data between data sources and data warehouses. Needless to say, data teams faced an enormous amount of cognitive and process overload with their existing data architectures.

Meanwhile, data continued to grow especially in various formats. Data warehouses only supported structured data and now, nearly 80%-90% of data in the enterprise is unstructured or semistructured. Increasingly, it became evident that data warehouses couldn’t efficiently scale to create value from the volume, velocity, and variety of all the big data platforms.

Data lakes, a solution to data warehouses

In response to the challenges of data warehouses, the data lake architecture emerged as a solution. Unlike the data warehouse, a data lake consolidates data stored in its native, raw, and open format, either on-premise, within private cloud infrastructure, or in the cloud. The data lake approach offers more affordable storage and lowers the overall costs compared to data warehouses.

Benefits of a modern data lake

Separation of compute and storage

With the emergence of the cloud, compute and storage are now separate. This resulted in an organization’s ability to scale things up or down, independent of one another. You can leave all your data in cheaper storage, and you can scale the compute aspect up or down, as needed.

Raw data, supports open data formats

Data in a data lake can still be extracted from the operational data plane, but unlike a data warehouse that only processes structured data, the data can be raw and minimally processed.

Data lake analytics: 5 key differentiators to smart, modern data management

Still, exploring the data became difficult. Many organizations have both a data warehouse and a data lake, which creates far more duplicate data and/or data drift than you’d want. It’s not ideal for analytics or governance. Also, even though the data lake approach decoupled compute and storage, it has failed to offer quality data (i.e. data swamp) and the expected performance modern enterprises need and want to thrive in an uncertain economic climate.

It’s time for a new take on modern data management by considering live, interactive queries directly on your cloud data lake storage. Starburst is the fastest data lake query engine, which leverages cheap data lake storage and enables data consumers to have easy and stable access to their open file formats of data, resulting in reduced data management costs and faster time to insights for critical business decisions. In other words, Starburst offers data warehouse analytics capabilities, without the data warehouse. Below are five key differentiators:

#1 Single point of access accelerates time-to-insight

Connecting to data anywhere (on-premise, hybrid, multi-cloud and/or any public/private cloud) with support for cross-cloud analytics across any geographical location makes data analytics easier and better.

Previously, it would take weeks for a leading food and beverage giant to retrieve data out of this company’s Teradata warehouses into their data lake. Now business units, wholesalers, and brands can immediately access and query this data where it lies and join it with larger data sets.

#2 Advanced data lake connectors improve data value

Delivering data warehouse functionality to data lakes requires connectivity to data lakes. With Starburst that means having access to 50+ data sources including modern & legacy enterprise sources. Meanwhile, you can also improve data value that can’t move by combining with data from external sources.

#3 Scale compute resources as needed to control costs

By leveraging the elasticity of the cloud (separation of compute and storage), you can scale your compute resources up or down to meet demand. 

Initially, Optum’s data lake architecture couldn’t support its needs at scale. Tired of poor query performance and inefficient resource utilization, Optum’s Advanced Research & Analytics group deployed Starburst Enterprise to improve data access, accelerate time to insight, maintain strong security, and reduce costs.

#4 Data lake query engine built for performance and flexibility

Starburst’s MPP query engine was built for speed and performance at scale, which gives organizations control over query response time and cost as well as the flexibility to perform ad-hoc and batch queries. 

Getting data to DoorDash’s teams faster is critical in providing a great user experience and ensuring the success of their many initiatives. With Starburst Enterprise, the team is able to run queries on S3 in ML Spark jobs instead of a data warehouse, which have resulted in a 10X to 15X performance improvement in overall run times. 

#5 Improve data access while adhering to security and compliance

Conducting business knowing that you’ll be adhering to regulatory compliance requirements with centralized and fine-grained control over access to all of your data is huge. 

Data Lake Analytics solutions on your terms

Data Lake Analytics is the ability to efficiently and effectively generate insights from data in a data lake to inform confident data-driven decision making. You’ll enable agility, speed, ease, and performance for your entire team of platform administrators, data engineers, data scientists, and data analysts. See how you can break down your data silos and start making your data work for you, on your own terms. 

Cindy Ng

Sr. Manager, Content, Starburst

Cindy Ng writes about what new data management and analytics strategies mean for both large enterprises and startups. She also serves as the producer of Data Mesh TV, a monthly educational program for data leaders about data monetization, aligning data strategy with business goals, and accelerating digital transformation initiatives with Data Mesh. Prior to Starburst, she’s written and spoken about ransomware, insider threats, data security, data compliance standards, algorithmic audits, and data ethics.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.