A data pipeline moves data from one location to another by executing a series of processing steps. Data pipelines are designed to efficiently move data from the point at which it’s collected to places where it can be analyzed, stored, or otherwise processed.
Modern IT environments rely on many different data pipelines, often overlapping and interconnecting, to facilitate their big data initiatives and extract value from data while meeting compliance, security, and accessibility requirements.
Any data pipeline will have three unique components: a source, transformation steps, and a destination.
Data can move through a data pipeline in batches, at timed intervals, or upon reaching adequate quantities, which is more cost-effective but means data isn’t always updated. Streaming data like the kind that companies like Uber use to see the exact location of drivers costs more, but provides real-time visibility when necessary. Whether to go with batch or streaming transmission is one of many important considerations when constructing data pipelines.
Data pipelines can move data into a data lake or warehouse or move it out of these repositories into operational systems for specialized analysis. The source and destination can also be the same (a loop) if the data pipeline simply serves to modify the data. Anytime data moves between systems for any reason (or returns to the same system in a different form), it travels through a data pipeline.
Organizations that rely on data pipelines to collect, move, and integrate data enjoy a number of benefits that apply to IT, executive leadership, and every decision that gets made.
The alternative to data pipelines is having staff run ad-hoc queries for data, which can be time- and labor-intensive process. Data pipelines improve the completeness, quality, and integrity of data sets while lowering risks associated with noncompliance or bad information. Repeatability and automation become especially important as the volume, speed, and diversity of data all increase.
With data pipelines to efficiently shuttle around information, decision makers have more complete, current, and accurate information at their disposal and make the “right” choice more often. Having trustworthy, abundant data to back up decision-making benefits an organization in endless ways.
Creating, maintaining, modifying, and repurposing data pipelines can all pose various challenges that create risks (sometimes significant) if not addressed.
Having multiple types of data moving through dense webs of data pipelines to reach one location can easily become an inefficient exercise that slows down the arrival of data or compromises the integrity of what arrives.
Data routinely defies expectations, and when it does, it can result in the wrong data being stored in the wrong location, leading to compliance violations or security issues. Inconsistent data can also cause data pipelines to break down. Data pipelines take constant observation, analysis, and adjustment to work as efficiently as possible
End users—often independent data consumers—will often try to create their own data pipelines that are redundant, noncompliant, and probably less effective than official pipelines. They may also try to alter data or existing data pipelines to fit their objectives without getting approval or documenting the change.
Furthermore, since data pipelines make data more accessible, they can inadvertently create compliance and security problems by giving the wrong users access to sensitive or private data. Data pipeline privileges must be carefully managed.
More data coming from more sources leads to an ever-expanding number of data pipelines that get harder to manage and document with time. New data pipelines can compromise old ones until the entire data strategy suffers.
Turning a data model into a sustainable series of data pipelines may be harder than anticipated and require tools, skills, or staff not currently in the organization. The disconnect between data “creators” and data “consumers” can further exacerbate the issue.
The autonomous nature of data pipelines does not eliminate the need for oversight and maintenance. Following the best practices outlined below helps prevent common problems:
Data coming out of a data pipeline at high volume or velocity can put stress on the destination that compromises performance and, in extreme cases, causes an overload. Be aware of how much data is going into the pipeline and how often, but pay closer attention to what’s coming out and the effect it’s having.
Also important to monitor is the quality of the data coming out of the pipeline. Are there duplicates, missing segments, or alterations between the source and the destination? Tracking data quality helps to highlight when/when there are problems with data pipelines and prevent data integrity issues from undermining decisions based on that data.
Data can become degraded in various ways as it gets translated between different infrastructures (SQL, Python, etc.) in a data pipeline. Maintaining version control calls attention to any changes in the code while making it simple to revert back to the earlier version as necessary.
Data pipeline is an umbrella term. One specific example of data pipelines is extract, transform, and load (ETL) pipelines, which extract data from a source, transform it somehow, and load it into a separate system for processing. Though that sounds similar to a data pipeline, a data pipeline doesn’t necessarily transform the data – it may simply transmit it from source to destination. Every ETL pipeline is also a data pipeline, but not every data pipeline is an ETL pipeline.
Starburst raises the bar for data pipelines and solves the problems associated with past ETL pipelines. See what a better way looks like. Explore ETL pipelines from Starburst.
https://www.starburst.io/resources/introduction-to-starburst-trino/