On Wednesday, August 3rd, I had the opportunity to share a hands-on lab exploring Data Lake reporting structures with my AWS partner in crime, Antony Thevaraj. The intent of the tutorial was to demonstrate a feasible method to create data lake reporting structures, while also sharing a tangible example that anyone could test out on their own. Using AWS S3 as the data lake and Starburst Galaxy as the analytics engine, I hope that you will run the tutorial and experience firsthand the benefits of implementing comprehensive data lake analytics solutions.
I chose to use a public dataset because transparency is extremely important to me, and I wanted the lab to be reproducible by anyone at any time, without any barriers. I consider myself at least partially a kinesthetic learner, and I personally have only been able to buy into the value of something once I could explore it and then adopt it on my own. Since we are utilizing the AWS Covid 19 Data Lake, all you need to try this tutorial out for yourself is a set of AWS credentials (you can create a free account as well) and a Starburst Galaxy free trial.
Data Lake Analytics
“The main challenge with a data lake architecture is that raw data is stored with no oversight of the contents.” – AWS
A data lake is a centralized repository that allows you to store all your structured and unstructured data at scale, as-is. This data can then be used for different types of analytics—from dashboards and visualizations to big data processing, or near real-time analytics. Interestingly, an Aberdeen survey saw organizations who implemented a Data Lake outperformed similar companies by 9% in organic revenue growth. By incorporating proper reporting structures within your data lake, it is likely your organization will become even more efficient. Utilize the separation of storage and compute and use Starburst Galaxy as the analytics engine to create reporting structures that provide helpful business insight.
Tutorial Summary
The three levels or layers we will create in our reporting structure are:
- Land layer: stores unmodified source data at any level of granularity
- Structure layer: stores joined, enriched, cleansed data
- Consume layer: stores aggregated data that is ready to be queried
Final Thoughts
I hope that this tutorial will enable you and inspire you to experiment with some data lake analytics and create a set of reporting structures of your own. This sounds lame, but in all actuality, I really enjoyed getting to perform my own analysis on the Covid-19 data lake as I created this lab. If you have any questions or feedback about the tutorial, please reach out to me on GitHub. I’d love to hear from you if you have any questions, comments, concerns, or brainstorming ideas to make this tutorial even better.