Almost every company I speak to on a daily basis has some sort of data lake. One thing we can thank from Hadoop was companies changed their landing/staging/raw zones to their data lakes and from there, some of it goes to a data warehouse.
In reading the new report on data lakes and lakehouses from GigaOm, I was excited to see companies like ours and others continuing to innovate around providing value from a lake vs. copying the data to a data warehouse.
Much of my job lately has been around educating companies on the hidden value their data lake can provide while serving up most of their analytical use cases. There is still a lot of work to do and from my experience, there are 4 things that companies are hesitant about using their data lakes to provide a majority of their analytics:
- The state of the data – surprisingly, many companies I talk to don’t understand a data lake can be structured similarly to a data warehouse. Tables and different layers/zones can be built and they can be queried and joined together with any BI tool. I often hear that my data lake is a mess and I don’t trust it.
- Performance – Hadoop brought this mindset that only data scientists running multi-terabyte queries would benefit from a data lake. With the advancements in networking, hardware, and storage, this is no longer the case. Most vendors incorporate a fast indexing type layer to speed up queries when object storage cannot meet the SLA.
- Security – Many vendors on this list including Starburst offer the same if not better security across a data lake compared to legacy data warehouse solutions. It’s considered table stakes at this point to include access control down to the table column level.
- Modifying data – Many companies realized they were in a bad spot when GDPR became a reality and they had no way to delete customers from their data lakes. With the introduction of table formats such as Apache Iceberg and Delta Lake, this is now possible as well as updates and merges.
As we see from the report, the features and abilities of the different data lake vendors are converging and have become almost equal to a data warehouse. Data lakes are transforming from swamps into very well-structured architectures with the ability to handle a wide range of use cases from large multi-terabyte queries to seconds and even sub-seconds.
The diagram below illustrates what most companies perceive as a common data flow and use cases between a data lake and a warehouse:
This report shows more and more companies are choosing to serve up their analytics using data stored in a single location due to the simplicities and cost savings:
I presented these same ideas in a dbt coalesce presentation which can be found here and slides here. This talk covers the misconceptions of data lakes and an open, multi-engine approach which differs from some of the vendors in this report.
Building an open data lake architecture allows for the greatest flexibility to future-proof your company into being locked into one engine or even storage format. It enables data users to access data in their data lake directly via SQL, simplifies complexity, and makes life easier for data teams.
Top 4 things to consider when building or repurposing your data lake to handle more use cases:
- Avoid lock-in – choose a vendor or set of vendors with the least amount of lock-in. This could be file and table formats to proprietary caching that locks up your data to achieve any sort of performance.
- Open source matters – Although companies have no problems paying for software that they perceive offer a good value, a company based on a thriving open source component not only keeps them honest, it allows for the community to help direct the future of the project which isn’t found in traditional enterprise software.
- Experience – with the unstable state of the economy and startups quickly running out of steam and money, it’s vital to choose a vendor that not only has proven itself in small to enterprise companies but continues to innovate by adding new features as well as increasing performance.
- Additional features – outside of just querying capabilities, look for additional platform features such as cataloging, attribute-based security, data products, cross-cloud connectivity, observability and global search. A vendor that offers these in their platform means you can save money by avoiding additional vendor purchases.
A modern open data lakehouse, with data stored in vendor-agnostic formats, is the architecture that best enables data democratization, both today and for years to come.
As data lakes and lakehouses become more mainstream, the waters will quickly become muddied with new vendors jumping in. This further validates the reason to consider enhancing your data lake to handle more use cases vs. continuing to duplicate data in a data warehouse as well as the additional expenses associated with them.