Data Glossary

ACID Transactions

ACID transactions are methods for ensuring database integrity.

AI Analytics

AI data analytics is the application of artificial intelligence and machine learning technologies to traditional analytics.

AI Data Strategy

Both data strategy and AI strategy are integral to an organization’s success in the modern technological landscape, and yet they serve distinct purpos...

Anti-Money Laundering

Anti-money laundering consists of the regulations and practices used to prevent the abuse of the financial system in support of terrorism and other cr...

Apache Airflow

Apache Airflow is an open-source data workflow management framework based on Python that makes pipelines more dynamic, extensible, and scalable than t...

Apache Hadoop

Apache Hadoop is an open-source framework for distributed storage and processing of large datasets

Apache Hive

Apache Hive is a data warehouse system built on top of Hadoop’s distributed storage architecture.

Apache Hudi

Apache Hudi (pronounced “hoodie”) is a transactional data lake platform first developed by Uber to bring data warehouse-like analytics capabilities to...

Apache Iceberg

Apache Iceberg is an open-source table format that adds data warehouse-level capabilities to a traditional data lake.

Apache Impala

Impala is an SQL query engine for Hadoop-based data architectures.

Apache Parquet

The Apache Parquet file format is a way to bring columnar storage to Hadoop-based data lakes. Parquet supports efficient compression and encoding sche...

Apache Spark

Apache Spark is an analytics engine built for processing massive datasets. Spark’s ability to process vast quantities of data within Apache’s big data...

Attribute-Based Access Control (ABAC)

Attribute-based access control(ABAC) is a method for dynamically applying access policies based on specific attributes of the user, the data or system...

Business Analytics

Although businesses have always crunched the numbers, “business analytics” refers to a more rigorous approach that applies statistical analysis and ot...

Centralized Data

Centralized data is the long-established practice of gathering all data the company generates into an enterprise database, a data warehouse, or, more...

Change Data Capture (CDC)

Change data capture (CDC) is the process of identifying incremental changes to source systems and transmitting those changes in real time to a target...

Cloud Data

Cloud data covers any data stored or processed on internet-accessible remote servers, whether company-owned or hosted by third-party cloud services.

Cloud Data Lakehouse

A cloud data lakehouse is a data platform that unifies enterprise data sources within a performant, cost-effective cloud architecture.

Cloud Data Migration

Cloud data migration is the process that moves data from legacy systems to cloud platforms.

Cloud Data Warehouse

A cloud data warehouse is a cloud-based version of the traditional on-premises enterprise data warehouse. Given the large amounts of data businesses g...

Cloud Native

A cloud-native approach to software development takes full advantage of the cloud’s scalability, elasticity, resiliency, and efficiency.

Cloud Object Storage

Cloud computing makes extensive use of object storage. This has many advantages, including cost, speed, and scalability.

Customer Data Platform

Customer 360 is a strategic priority that requires the entire organization to create unified, end-to-end customer experiences. However, harnessing all...

Dark Data

Dark data is the dormant contents of a company’s data lakes and other repositories.

Data Analytics

Data analytics is the process that converts raw data into actionable insights. In data-driven organizations, analytics increasingly relies on large da...

Data Analytics Architecture

A data analytics architecture is a set of policies and standards that guides the organization as it builds analytical processes. More than technical o...

Data Applications

A data application (or data app) processes and analyzes big data to rapidly deliver insights or take autonomous action.

Data Architecture

Data architecture is a framework that guides how to collect, store, manage, and use data in ways that support an organization’s business goals.

Data Blending

Data blending is the process of combining data sets from different data sources to generate actionable insights that answer specific business question...

Data Catalog

Data catalogs are data source inventories. They collect metadata about the source’s various assets.

Data Classification

Data classification is a framework for organizing data in ways that improve data management, information security, and risk management.

Data Complexity

Data complexity is an emergent property of enterprise data shaped by volume, velocity, variety, veracity, value, and vigilance — the V’s of big data.

Data Compliance

Data compliance consists of the governance processes for meeting the requirements of internal, industry, and regulatory standards for data security an...

Data Democratization

Data democratization is the goal for organizations and employees to quickly and securely access data so that they can analyze it and make data-driven...

Data Discovery

Data discovery is a technique for gathering data, evaluating it for potential insights, and performing advanced analytics to create actionable insight...

Data Engineering

Data engineering emerged as a specialization of software engineering in response to exploding data volumes.

Data Exploration

Data exploration is an essential preliminary step to analyzing large datasets. Analysts use visualization and statistical methods to understand the qu...

Data Fabric

A data fabric is a data management architecture that uses artificial intelligence and machine learning algorithms to automate data ingestion best prac...

Data Federation

Data federation involves the creation of a virtual database that maps an enterprise’s many different sources and makes them accessible through a singl...

Data Governance

Data governance is a concept within the discipline of data management that takes a holistic approach to an organization’s data and its lifecycle: data...

Data Ingestion

Ingestion lands raw data from external sources into a central repository. From there, integration pipelines will transform data to meet data quality,...

Data Integration

Data integration is a series of data management procedures for bringing datasets from different sources into data lakes, data warehouses, or other dat...

Data Lake

A data lake is a single store of data that can include structured data from relational databases, semi-structured data and unstructured data.

Data Lake Storage

A Data Lake Storage houses a wide variety of data types, including structured, semi-structured, and unstructured data. Each of these data types serves...

Data Lakehouse

Combining data lakes and data warehouses, a data lakehouse is a centralized data repository, that uses cost-effective data storage, usually in the clo...

Data Lineage

Data lineage refers to the process and tools used to track the origin, movement, characteristics, and transformations of data as it flows through the...

Data Mart

A data mart is a repository of data curated to support the needs of a specific department, line of business, or business function.

Data Mesh

Data Mesh – an approach founded by Zhamak Dehghani – refers to a decentralized, distributed approach to enterprise data management. It is a holistic c...

Data Modernization

Data modernization is the process of moving data from the legacy systems of a fragmented, siloed infrastructure to an interconnected ecosystem of mode...

Data Observability

Data observability is the set of practices that help organizations understand data health and performance across the enterprise.

Data Pipeline

A data pipeline moves data from raw state to another location by executing a series of processing steps. This allows the data to be used by data consu...

Data Platform

A data platform is a technology stack or single solution for managing enterprise data. This system ingests and prepares data at scale for operational...

Data preparation

Data preparation is the process that turns raw data from disparate internal and external sources into usable datasets.

Data Privacy

Data privacy comprises the rights of consumers to control when and how organizations may collect and use their personally identifiable information (PI...

Data Products

Data products are curated collections of datasets and business-approved metadata designed to solve specific, targeted questions.

Data Quality

Data quality is the state of the data, reflected in its accuracy, completeness, reliability, relevance, and timeliness.

Data Security

A data security strategy protects digital information from the consequences of human error, unauthorized access, and cyberattacks. These consequences...

Data Sharing

Data sharing gives multiple users or applications simultaneous, consistent, and high-fidelity access to the same datasets.

Data Silos

Data silos are partially or wholly inaccessible data sets that result from a combination of technical and cultural forces. Proprietary databases and l...

Data Sovereignty

Data sovereignty is a legal concept defining jurisdiction over data. Specifically, sovereignty establishes the principle that any data collected or st...

Data Swamp

A data swamp is the inevitable outcome of a company’s misunderstanding of how data lakes work. Without a clear and well-supported big data strategy, l...

Data Transformation

Data transformation is the process of converting and cleaning raw data from one data source to meet the requirements of its new location. Also called...

Data Virtualization

Data virtualization is a solution that creates intermediate layers between data consumers and disparate data source systems. These systems give consum...

Data Warehouse

A data warehouse is a central repository for structured enterprise data. These systems ingest raw data from various data sources through extract, tran...

Data Warehouse Architecture

A data warehouse architecture refers to how data gets loaded from source systems into data warehouses and how it is accessed by data consumers. In the...


A database is a large collection of data organized, for rapid search and retrieval by a computer.

Database Management System

Database Management System (DBMS) is used to manage a database and enables users to create, read, update, delete, and secure data within a database.

Decentralized Data

Decentralized data architectures decouple the operational plane — where and how data is stored — from the analytical plane — how the business uses dat...

Delta Lake

A Delta Lake is an open-source data platform architecture that addresses the weaknesses of data warehouses and data lakes in modern big data analytics...

Distributed Data

Distributed data is a practice that stores data where it lives, empowering business analysis through a single point of access.

Extract, Transform, Load (ETL)

ETL pipelines are automated data migration techniques for the ingestion of data from various sources into a target system.

Fault Tolerance

Fault tolerance is the degree to which failures in a subsystem do not cause the overall system to stop operating. In the context of enterprise analyti...

Hadoop Cluster

Apache Hadoop clusters let companies manage big data processing on commodity hardware. This distributed computing model provided a more cost-effective...

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a scalable, open-source file system designed to run on commodity hardware while managing the large amount...

Hadoop Ecosystem

The Apache Hadoop Ecosystem is a collection of open-source software projects designed to work with Hadoop distributed data processing platforms.

Hybrid Cloud

Hybrid cloud is an architecture that manages storage, networking, and compute resources across different environments. This structure may include on-p...

Hypothesis-Driven Development

Hypothesis-driven development (HDD), also known as hypothesis-driven product development, is an approach used in software development and product mana...

Incident Response

According to the National Institute of Standards and Technology (NIST), incident response is the reaction to violations of computer security policies...

Massively Parallel Processing

Massively parallel processing is an architecture for distributing workloads across hundreds or thousands of separate processors. Although parallel com...


A multi-cloud infrastructure uses cloud services from one or more vendors.

Object Storage

Object storage is an alternative to traditional file systems for storing large amounts of unstructured data in scalable, cost-efficient, and performan...

Online Analytical Processing (OLAP)

Online analytical processing (OLAP) systems are data analysis platforms that centralize large amounts of data from disparate sources.

Open Data Lakehouse

An open data lakehouse is a data analytics architecture that combines a data lake’s cost-effective storage with a data warehouse’s robust analytics.

Open Data Warehouse

An open data warehouse is an open source alternative to monolithic, proprietary applications like Teradata or Snowflake.

Open File Formats

An open file format is a specification for the way data gets written to storage.

Open Table Formats

Open table formats are designed to provide enhanced performance and compliance capabilities for data lakes using cloud-based object storage.


PostgreSQL is an open-source relational database management system (RDBMS) with a rich feature set, reliability, and performance that competes with a...


Presto SQL query engine (formerly PrestoDB) and Trino (formerly PrestoSQL) are both SQL query engines. They are both designed for high-performance SQL...

Query Acceleration

Query acceleration is a set of techniques for minimizing data processing workloads when analyzing a large amount of data.

Query Engine

A query engine takes a request for data, translates it from human to machine language, and then fulfills the request by retrieving specific data.

Reference Data

Reference data categorizes information and defines the ranges of permissible values to ensure consistency in use across business processes and between...

Risk Management

Risk management is the process of identifying, assessing, analyzing, prioritizing, mitigating, controlling, and monitoring potential exposures to busi...

Role-based Access Control

Role-based access control is a system of fine-grained access privileges granted to authorized users to perform a defined set of tasks.

Schema Discovery

Schema discovery is a data engineering practice for finding and documenting the structure of data sources within a repository, such as a relational da...

Schema on Read

Schema-on-read approaches only apply a schema when a query accesses a table. Any required transformations happen at runtime.

Security Lake

A data lake is a centralized repository for large volumes of raw data from multiple sources that simplifies big data analytics and optimizes data infr...

Semantic Layer

A semantic layer is an interface sitting between data consumers and enterprise data sources, abstracting the underlying data architecture.

Single Source of Truth (SSOT)

A single source of truth (SSOT) is a centralized location of master data for an organization’s decision-making processes. Theoretically, a data wareho...


SQL stands for structured query language. SQL is a powerful language that plays a vital role in managing and analyzing data in relational databases, m...

Star Schema

In the context of data warehousing, the star schema is a popular architecture for organizing data. It is characterized by a central fact table that is...


Starburst is the data company, not the candy company. Our data lakehouse platform combines the best of data lakes, data warehouses and data virtualiza...

Streaming Data

Streaming data is the continuous dataflow generated by transactional systems, activity logs, Internet of Things (IoT) devices, and other real-time dat...


Trino is an open source distributed SQL query engine built in Java, designed to run fast analytic queries against various data sources ranging in size...

Unstructured Data

Unstructured data is not conformed to any preset schema or format. Traditionally, unstructured data was rare, but this has evolved due to the rise of...

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.