The 4 Stages to Big Data Nirvana (In the Cloud)

July 18, 2019

Shaun Bruno
Marketing
Starburst

Shaun Bruno
Marketing
Starburst

More deployment options

Request Enterprise trial license key →

Nirvana – a state of perfect happiness; an ideal or idyllic place.

In big data “Nirvana” is a wishlist of items:

The ability to scale your storage and computing power to your exact needs on demand
Drastically simplifying the architecture and locations in which your data is stored
Having the agility to access your data anywhere within the organization by anyone

…In other words “flexibility”.

It’s widely accepted that present day, the most flexible and efficient way to store your data is in the cloud. That’s why for the better part of the last 20 years the major movement in IT has been towards this goal. For most businesses this shift is no longer a question of “if” but “when”, and with mounting pressures from leadership, that “when” is often sooner than many are prepared for. However, this move should not be something to fear and it should not interrupt your users or bring your enterprise to a standstill.

These are the 4 stages needed to help you reach your big data “Nirvana”, and to ensure your migration to the cloud is a smooth and efficient one.

Stage 1: The Perfect Storm of Data

The data landscape of today’s Fortune 500 company is in a state of disarray. If they followed the progression of technology over the last 50 years then they currently support something like this:

Legacy database systems
A data warehouse and a number of data marts
A HDFS data lake
Possibly even some cloud systems.

“Customers’ data now sits in an average of six to eight clouds, as well as their own data centers. Data integration technologies are shifting from extract-transform-load to a process- and pipeline-driven approach, with data management and governance capabilities to support both centralized and federated models.”
Ken Tsai, global VP and head of cloud platform and data management for SAP

That’s a lot to manage and little has been done to make this management easier over time. The result is valuable data that resides in a number of closed off silos. This is a problem for analysts who need regular access to this data, and administrators who need to provide this access.

As it stands, analysts are required to maintain access to each of these silos if they want to interact with the data within. This creates complicated workflows and ultimately extends the time it takes for them to get the answers that they need. Additionally, when working with disparate sets of data stored across systems, they must engage IT for resource-intensive ETL projects. As data volumes increase, so do the time and cost of these projects, leaving IT as the bottleneck to your analytics.

Stage 2: Finding a Little Bit of Sunlight

Analytics has become a core component of every department’s strategy. As companies continue to become more data centric, they will also need to hire more and more data consumers (analysts, scientists, business intelligence, etc). As mentioned above, if these consumers have to go through IT every time they want to interact with their data then IT becomes the bottleneck that is preventing critical business needs from being met.

To provide immediate relief to this scenario you can introduce the concept of a “Consumption Layer”.

This can be accomplished with a tool like Trino. Trino is an incredibly fast and scalable distributed SQL engine that can be inserted in between your analysts and your data sources. This immediately provides your analysts with a number of performance benefits, but most importantly, artificially creates a separation of your computing power from your storage sources.

With this addition of a consumption layer, analysts no longer need to be aware of where their data resides. They can simply issue their SQL queries to Trino using the analytics tool of their choice and access the data wherever it lives. They can even federate data across systems without the need for any ETL.

Sounds great, but how does this help you get to the cloud?

Stage 3: The Big Data Cleanup

Effectively, Stage 2 has separated your users from your infrastructure.

Trino becomes your compute layer and all of your database systems simply become “storage”. With your analysts issuing queries directly to Trino, and Trino speaking to any data source, your data can now be stored anywhere you want.

With this, YOU decide what storage configuration will be the most efficient for YOUR business (ie. what will yield the maximum performance per dollar and most flexibility). The majority of the time this will be a data lake model, but rather than using traditional Hadoop, you can now accomplish this with cloud object storage like S3, Blob, or GCS (Data Lakes Without Hadoop).

The next question is how should you move your data to the cloud. Historically, traditional RDBMS systems use proprietary formats that require a lot of effort to access and move your data. These legacy vendors may offer their system as a VM in the cloud, but don’t be fooled. Choosing this path will only restrict your flexibility and leave you with the same vendor lock-in problem that you started with.

Instead, leverage the power of open data formats like ORC and Parquet. These were introduced by Hadoop, can be stored in object storage, and have been optimized for high performance analytics. Now you can have full control of your data and never have to worry about vendor lock-in again.

Stage 4: Nirvana – Complete control

Finally, your data is stored in open, high performance formats. Trino removed your analysts from your infrastructure (separating your storage from compute). You successfully, and strategically, moved your data to the cloud without your business feeling any interruptions or pain.

What was once your consumption layer in Trino, now becomes the computing power of your high performance cloud data lake.

Your analysts are happy because they can issue queries directly into the cloud with the same performance benefits they’re used to with Trino. IT is happy because they can flexibly scale compute and storage on demand, and ultimately pay ONLY for what is used. This maximizes the efficiency of your IT spend and gives you the most control of your analytics platform into the future.