One of the true pillars of the tech revolution, PostgreSQL is an OLTP database designed primarily to handle transactional workloads. The technology has been around for over 25 years and, nevertheless, clearly demonstrates that it’s stable, performant, and feature rich enough to support significant analytical workloads too. Just a quick look at the list of functions and operators, the list of different indexes supported, Windows functions, and JSON support, should convince you of that. The analytical capabilities available support the argument that PostgreSQL is an amazing choice to start with when building out a company’s data infrastructure.
PostgreSQL has a reputation of being a very approachable technology with very low barriers to entry that can be easily adopted by even less technically savvy companies out there. Finding people who have experience with the technology is relatively easy and you can even start by getting some help setting up from your application developers. It has been so widely adopted that pretty much every analytical framework or tool out there supports it. Actually, there are plenty of examples of newer databases that implement the Postgres protocol, so migrating to them from a postgres instance should be super easy. Finally, Postgres is one of the databases that can be deployed in every possible way from on-premise binary installations to cloud services and every cloud provider out there.
So, if PostgreSQL is so good, why would someone consider a different technology for analytics?
The reason is simple, PostgreSQL was and still is developed for transactional workloads. It’s not optimized for analytical workloads, and although it can accomodate them, at some point, each company has a responsibility to consider a technology and architecture that is built specifically for analytical purposes. This breaking point usually goes hand in hand with the topic of scalability. As scalability becomes a factor in the data architecture decisions, it is imperative to start considering how a data lake can relieve the analytical responsibilities previously served by PostgreSQL. Seems simple right? However, the trickiest decision is identifying the proper time that an organization needs to scale. Let me take you through some of the main signals I’ve seen that can help you identify that it’s time to consider a data lake over a PostgreSQL instance.
Scaling your data infrastructure is probably the most common reason that someone would consider moving away from Postgres and into an OLAP system like a data lake because of the conscious storage limitations within the Postgres technology. For example, if you check AWS you’ll notice that you can’t provision more than 64TB of storage per database instance. Can you use more than one database instance if you need more? Of course you can. You can partition your data somehow or even develop a way to run a cluster of PostgreSQL. But here’s the issue, PostgreSQL was not built for this unique type of architecture. You will very quickly get into trouble and start looking for some highly skilled engineers who can architect something out of nothing.
Consider the performance of PostgreSQL for analytical queries. Postgres supports a wide range of aggregation functions but it’s a row-based database. That’s important for Postgres because it optimizes for transactional workloads, where what matters is super high concurrency and the ability to safely alter specific rows in the database. Think of a database that runs a web-banking application for example. On the other hand, analytics is all about aggregations over large amounts of data. For example, we want to calculate the total revenue our orders have generated this month, to do that we want to apply a SUM function over all the orders that were created this month. Because of this nature of our workload, we rarely need to go and update a specific row, or in our example a specific order. This row by row processing model becomes extremely expensive as we start working on millions or billions of rows. That’s the reasonOLAP systems exists. So, there’s a point where you’ll have too much data for PostgreSQL to satisfy your analytics SLAs.
Another big difference is that the growth of data for Postgres is assumed differently compared to a system like a data lake. When you have a transactional system, deletes and updates happen equally if not more frequently than additions of new data. This is not the case with analytical systems though and that’s why the term warehouse was used. Ideally in your data analytics infrastructure, you never want to delete data. Actually in some industries you are not *allowed* to delete data for auditing purposes. This has a tremendous effect on how operating Postgres and a data lake is priced. Let’s take as an example the price per GB for RDS PostgreSQL, which is $0.23 USD and compare it with the price for AWS S3 (commonly used as the storage layer for a data lake) which is $0.023 USD per GB. It might not sound like a lot when we talk in cents per GB but as you scale and accumulate more and more data it starts to make a difference. Keep in mind that the above numbers are just for storage and while you can potentially negotiate a deal there’s an order of magnitude difference in price. Keep an eye on your RDS spending and compare with S3, this is a good way to get a signal that you should start considering a different architecture.
Data Lakes as an alternative
While there are many technological signals related to your need to scale your infrastructure, there are organizational scaling considerations that are equally, if not more, important of a reason to consider a data lake.
There must be conscious decisions into the organizational methods for building and iterating within your company. If you are driving towards a data driven culture that will fuel many rapid iterations on your products and business, a data lake is probably a better fit for you.
Because of the nature of PostgreSQL, the database follows a schema on write philosophy. That’s great for the use cases that PostgreSQL wants to serve best but it can slow your organization down if you want to iterate fast. The reason is that if you want to store something on the database, you will have to be very explicit about everything around the data, specifically about the schema. These choices have deeper implications that lead to creating complex ETL pipelines that take ridiculous amounts of time to get implemented. When you want to store your data and figure out later how to best monetize it, the above approach can become a problem. That’s where the data lake with its schema on read approach shines. You can literally throw anything you want on the data lake and take care of it when the timing is right, without losing any information. The above is also important in some technical edge cases, for example when there’s huge throughput of data generated that has to be stored before it can be processed, e.g. storing data from streams like Kafka. Adopting such an architecture will allow you to iterate faster and innovate more with your data.
The second type of change has to do with the type of processing you plan to perform on your data. The moment your organization decides to invest into ML and Data Science, the data lake is the best option you have. Moving away from the standard descriptive analytics of BI and getting into predictive and ML workloads, requires a completely different infrastructure because of the different data types. Ranging from strongly typed data, like the data PostgreSQL is built to handle, to completely unstructured data like free text, these different types of data have to be used together to deliver value. That’s where the data lake really shines. You can put anything you want in a lake, ranging from video files to tabular data stored in Parquet Files. The lake will handle it well and store it efficiently.
Not only does a data lake offer the maximum flexibility when it comes to the data types you can store but it also offers the maximum interoperability when it comes to what technologies can be used to process the data. Something that is also extremely important in ML and predictive analytics. Python, Tensorflow, R, SQL, GPUs and even FPGAs can be brought together and access data from a lake in an extremely scalable way. You might argue here that ML and predictive analytics is only for companies like Google and Facebook but this is far from the current reality. Many businesses require a data lake to scale in order to efficiently provide personalization, recommendations and advanced marketing segmentation.
To properly execute a data infrastructure strategy for a new company, starting lean is important. I’m a big fan of using proven technology that doesn’t cost much to use to kickstart a new process or function. In my opinion 95% of the companies out there should start with PostgreSQL if they are just starting considering their data infrastructure. It’s also important though, to know when it’s time to switch into a dedicated data analytics solution like a data lake. I hope I gave you a few good signals to consider in your data infrastructure journey.
Is everything so perfect with data lakes? No it’s not. Data lakes have been historically very difficult and technically challenging to build and operate. It was the solution of the FAANGs of the world that had the engineering talent to build and maintain them. But things are changing.
With products like Starburst Galaxy, you can build and operate your data lake having a similar experience like a cloud data warehouse. Making data lakes accessible to everyone.