×
×

PostgreSQL

PostgreSQL is an open-source relational database management system (RDBMS) with a rich feature set, reliability, and performance that competes with a commercial RDBMS like Oracle.

More specifically, PostgreSQL is an object-relational database that blends a relational database’s storage and query techniques with object-oriented features like user-defined data types and schema inheritance.

PostgreSQL features

PostgreSQL supports enterprise-class features like materialized views, foreign keys, and more. For example, features enabling ACID transactions include:

Multiversion concurrency control (MVCC): to prevent conflicts between concurrent transactions, PostgreSQL creates a database snapshot for each transaction.

Write-ahead logging: Changes to data files only occur after logging the changes to ensure data integrity.

Point-in-time recovery: Since all changes are logged, replaying these logs lets PostgreSQL reconstruct the database from any checkpoint.

Isolation levels: Of the four SQL-defined isolation levels, PostgreSQL doesn’t bother with read uncommitted. Instead, it defaults to read committed and allows configurations with repeatable read (without phantom reads) or serializable.

PostgreSQL benefits

With more than three decades of development by an active community, this open-source database platform has established itself in the enterprise space thanks to a proven track record for reliability and performance. Other benefits include:

Extensibility: The project’s core database is not as full-featured as commercial systems, but PostgreSQL has a rich extension ecosystem that lets organizations optimize their database architecture to meet specific use cases. When companies need more customized features, PostgreSQL supports a wide range of programming languages, including Python, Java, and Ruby.

Flexibility: Organizations can deploy PostgreSQL databases to their on-premises infrastructure with versions available for Windows, MacOS, Linux, BSD, and Solaris. The world’s largest cloud service providers, including AWS and Microsoft Azure, offer PostgreSQL database services.

Affordability: The free, open-source license lets small and large enterprises alike deploy PostgreSQL databases at any scale at no cost. In addition, the database’s popularity creates a large pool of developers with relevant skills.

PostgreSQL use cases

PostgreSQL’s core feature set and its dynamic ecosystem of open-source and licensable extensions let companies in any industry use this database platform for varied use cases.

Web applications: Along with the Linux operating system, Apache HTTP Server, and PHP, PostgreSQL replaces MySQL in the traditional LAMP technology stack that powers many websites and apps. PostgreSQL integrates well thanks to compatibility with JavaScript, Perl, and other programming languages. In addition, the database system allows for more complex queries than MySQL and is better for write-intensive applications.

Financial services: PostgreSQL’s most common use case is as an online transaction processing (OLTP) database for operational applications. In particular, the combination of high availability, high performance with enormous datasets, and ACID-compliant transactions makes it ideal for the financial services industry.

Geospatial: Extensions like PostGIS let PostgreSQL store geospatial data types like points, lines, and polygons. Specialized features let users search for and process geometric data before handing it off to ArcGIS and other geographic information systems.

Data warehousing and business intelligence: As the foundation of a company’s transactional systems, PostgreSQL often becomes a company’s de facto data warehouse platform. SQL compatibility lets business intelligence software or data pipelines query the database for descriptive analysis projects.

Scaling Postgres: migrating from a database to a data lake

Despite its many advantages, PostgreSQL becomes less compelling as an analytics platform once a company’s data consumption begins to scale.

1. Analytical workloads

Read operations dominate analytical workloads, but PostgreSQL’s success as a transactional database is due, in part, to its performance under operational workloads where quickly and efficiently writing data to storage is critical. In addition, PostgreSQL transactional databases use row-oriented tables unsuitable for analytical queries. PostgreSQL’s read performance becomes a bottleneck as a company’s decision-making needs evolve beyond descriptive analysis to include machine learning and predictive analytics.

Data lakes, on the other hand, are designed from the ground up to optimize analytical workloads. Columnar open file formats like Parquet, as well as open table formats like Iceberg, provide rich metadata for minimizing unproductive reads during analytical queries.

2. Scaling open source database

Storage is another limitation when using PostgreSQL as an analytics platform. Cloud platforms like AWS cap the size of a PostgreSQL server instance, forcing data teams to find ways to spread the database across multiple instances. This bespoke architecture becomes increasingly difficult — and expensive — to maintain as the company’s data climbs toward the petabyte range.

Data lakes leverage the inherent scalability of cloud object storage. Adding storage capacity to the file systems PostgreSQL databases run on often requires time-consuming changes to directories and file paths. Object storage services save objects in a flat structure, relying on object metadata for search and discovery. Scaling a data lake happens instantly

3. Unstructured and semi-structured data

JSON, XML, and key-value pairs will let you store unstructured and semi-structured data types in a relational PostgreSQL database. Doing so delivers many of the benefits of a NoSQL database like MongoDB while retaining PostgreSQL’s benefits like reliability and data integrity. However, you will sacrifice scalability and optimizations NoSQL solutions provide.

A data lake’s object storage can hold any data type — structured, unstructured, or semi-structured. Moreover, data lakes do not share the schema-on-write restrictions of a database. Instead, they store raw data and only apply schema when a query retrieves it.

4. Real-time analytics or streaming data

When companies build their transactional and operational systems on PostgreSQL, it’s only natural to use that database for real-time analytics. It’s already there, and the data team knows how to use it. Yet, the issues discussed in this section still apply. At scale, PostgreSQL analytics queries hit bottlenecks that can undermine the database’s operational performance.

Flowing real-time data into a data lake is a better approach for use cases that do not require instantaneous responses. Ingestion pipelines using technologies like Apache Kafka and Flink will aggregate streaming data to rapidly update the data lake with minimal latency for applications like security monitoring.

5. Complex data pipelines

PostgreSQL provides tools for developing complex data processing pipelines. In addition to support for Python and other programming languages, the project includes its own procedural language, pgsql, for efficiently executing complex queries.

A data lake lacks these kinds of analytics resources by itself, which is why most organizations add a high-performance SQL query engine like Trino. Massively parallel processing lets Trino handle petabyte-scale workloads, while federation lets a Trino SQL statement access data from multiple systems. Trino transforms a data lake into a scalable, performant data lakehouse analytics platform.

6. Performance

As discussed above, OLTP databases like PostgreSQL are optimized for transactional workloads where write performance under high concurrency must preserve data integrity. Analyzing huge datasets stored in row-oriented files imposes a significant performance penalty that becomes increasingly expensive to mitigate. Proprietary data warehouses get hit with these same issues as data volume, velocity, and variety keep growing.

Data lakes decouple storage from compute. Data teams can scale their storage infrastructure without impacting their processing budgets and vice versa. Their cloud computing expenses can scale up and down with actual demand. Furthermore, data lakehouse analytics platforms like Starburst provide cost and performance optimizations that boost query performance while managing cloud computing budgets.

7. Cost efficiency and pricing

Finally, the cost model of a transactional database like PostgreSQL works against its long-term use as an analytics platform. OLTP systems have a more even distribution of deletes, updates, and additions as they keep their databases up-to-date. To use it as an analytics platform requires keeping data for longer periods. The share of rarely accessed data grows leading to accelerating cloud storage costs.

Data lakes are optimized for storing infrequently accessed data in more affordable object storage services, where the cost-per-gigabyte is a fraction of cloud file storage service pricing.

Other related PostgreSQL database terms

PostgreSQL lives in an ecosystem of data technologies. For example, MySQL lacks the rich features and flexibility of PostgreSQL while having a shallower learning curve. MongoDB is a non-relational, NoSQL database optimized for rapid retrieval and analysis of JSON files and other documents. More direct competitors to PostgreSQL include:

PostgreSQL vs. Snowflake

PostgreSQL can be the foundation for an open-source data warehouse. However, data teams bear the full burden of developing and maintaining that infrastructure, whether on-premises or in the cloud. Snowflake is a commercial cloud data warehouse that delivers similar capabilities in a fully managed service — for a price. The choice requires carefully evaluating resources and priorities, especially since going with a proprietary solution carries the long-term risk of vendor lock-in.

PostgreSQL vs. cloud databases

Cloud service providers offer a number of cloud database options. Amazon Aurora and similar proprietary cloud databases offer PostgreSQL compatibility. Other options, like IBM Cloud Databases and Google Cloud SQL, offer cloud-based PostgreSQL services.

In both cases, you let the service provider manage infrastructure and software maintenance so your data team can focus on managing data. However, that level of service comes at an additional cost compared to hosting your own PostgreSQL instances in the cloud.

Related tutorial: Configure a PostgreSQL catalog

 

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.