×

On Wednesday, August 3rd, I had the opportunity to share a hands-on lab exploring Data Lake reporting structures with my AWS partner in crime, Antony Thevaraj.  The intent of the tutorial was to demonstrate a feasible method to create data lake reporting structures, while also sharing a tangible example that anyone could test out on their own. Using AWS S3 as the data lake and Starburst Galaxy as the analytics engine, I hope that you will run the tutorial and experience firsthand the benefits of implementing comprehensive data lake analytics solutions. 

I chose to use a public dataset because transparency is extremely important to me, and I wanted the lab to be reproducible by anyone at any time, without any barriers. I consider myself at least partially a kinesthetic learner, and I personally have only been able to buy into the value of something once I could explore it and then adopt it on my own. Since we are utilizing the AWS Covid 19 Data Lake, all you need to try this tutorial out for yourself is a set of AWS credentials (you can create a free account as well) and a Starburst Galaxy free trial.  

Data Lake Analytics

“The main challenge with a data lake architecture is that raw data is stored with no oversight of the contents.” – AWS

A data lake is a centralized repository that allows you to store all your structured and unstructured data at scale, as-is. This data can then be used for different types of analytics—from dashboards and visualizations to big data processing, or near real-time analytics. Interestingly, an Aberdeen survey saw organizations who implemented a Data Lake outperformed similar companies by 9% in organic revenue growth. By incorporating proper reporting structures within your data lake, it is likely your organization will become even more efficient. Utilize the separation of storage and compute and use Starburst Galaxy as the analytics engine to create reporting structures that provide helpful business insight.

Tutorial Summary 

The three levels or layers we will create in our reporting structure are:

  • Land layer: stores unmodified source data at any level of granularity
  • Structure layer: stores joined, enriched, cleansed data
  • Consume layer: stores aggregated data that is ready to be queried

For this tutorial, we will explore two different datasets from the Covid-19 data lake. The first dataset revolves around the Daily Global and US Covid 19 Cases provided by Enigma. We will use this dataset to eventually create tables in the Consume layer that all centralize around United States confirmed cases and Australia confirmed cases by aggregating the data for each province or state. The second dataset shares information on US Hospital Beds, provided by Rearc. With this hospital information, we will be able to create tables in the Consume layer to see the capacity and occupancy of hospital beds for each state. Eventually, we will implement Role-Based Access Control so that our data analysts will only have access to select from the Consume layer tables.

Final Thoughts

I hope that this tutorial will enable you and inspire you to experiment with some data lake analytics and create a set of reporting structures of your own. This sounds lame, but in all actuality, I really enjoyed getting to perform my own analysis on the Covid-19 data lake as I created this lab. If you have any questions or feedback about the tutorial, please reach out to me on GitHub. I’d love to hear from you if you have any questions, comments, concerns, or brainstorming ideas to make this tutorial even better. 

 

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.

s