×
×

Apache Spark

Apache Spark is an analytics engine built for processing massive datasets. Spark’s ability to process vast quantities of data within Apache’s big data ecosystem makes it particularly useful for large, complex ETL data pipelines.

With reliability and fault-tolerance as the emphasis, Spark is best-equipped to handle transformative workloads, streaming, and batch processing of large amounts of data with ETL queries. Spark was developed in the early 2010s at the University of California, Berkeley’s Algorithms, Machines and People Lab (AMPLab) to achieve big data processing performance beyond what could be attained with the Apache Software Foundation’s Hadoop distributed computing platform. Thanks to its origins in academia, Spark also includes optimizations for machine learning and other data science workflows. It can handle distributed operations and executing SQL queries with a particular focus on implementing machine learning algorithms.

Spark vs Trino

What’s the difference between Spark and Trino? We take a closer look below

Trino: MPP query engine

Trino is a massively parallel distributed query engine that federates multiple enterprise data sources to create an accessible, unified resource for interactive data analysis and high-performance analytics. The open-source project’s heritage traces back to Presto, an effort to improve query performance on the massive Hadoop data warehouse within Meta. Trino is a Presto fork announced in 2019 and aggressively developed to become an analytics engine for modern data lakehouses.

Performance and cost optimizations yield efficient, performant, and cost-effective queries on exabyte-scale data lakes. Trino’s key use case is interactive and ad hoc analytics, and it was built from the ground up to enable its users to access their data in seconds, not hours. Trino also features a fault-tolerant execution mode which trades off some of its performance for reliability, allowing it to perform similarly to Spark for ETL and ELT queries.

In addition to performance and accessibility, Trino addresses a significant challenge in big data analytics: insights depend on data not stored in a data warehouse. Connectors for dozens of data sources let Trino create a virtualized access layer that unifies an enterprise’s data architecture, enabling users to use one query to access data scattered across the company. Trino eliminates the silos and swamps that stand in the way of analysis and insight generation.

Spark

Spark consists of four primary elements:

  • Spark Core — Manages scheduling, transformation, and optimization to provide the foundation for Spark’s other elements.
  • Spark SQL — Uses ANSI-standard SQL to query Spark’s DataFrames and Datasets table formats.
  • Machine Learning library (MLlib) — Provides a library of processing techniques essential to machine learning projects.
  • Structured Streaming — Allows near-real-time processing of data streams.

Spark shines when utilizing its resilient distributed datasets to process or train on large amounts of data. It can handle data analytics workloads, though it is not optimized for pure analytics and is not perfectly-suited for interactive or ad hoc analysis of your data.

Spark also isn’t a standalone platform; it requires cluster management frameworks like Kubernetes or Apache Mesos and a distributed storage system like Amazon S3, Cassandra, or Hadoop.

Spark and Trino: Key similarities and differences

Trino and Spark both make analytics more accessible by using ANSI-standard SQL, allowing engineers, analysts, and data scientists to access data with queries that work on a variety of other engines. Both are built to run at massive scale, handling huge amounts of data.

Spark excels at reliable processing and transformations of data, particularly when used with machine learning. Trino excels at fast SQL analytics on a huge variety of data sources. Many data stacks include both Spark and Trino, and this allows teams to use the engine appropriate for their specific use case. 

However, even ANSI-standard SQL dialects have some differences that can make maintaining two analytics engines a headache, as Spark SQL and Trino SQL are not fully compatible. With fault-tolerant execution released for Trino in 2022, data stacks that use Trino in conjunction with Spark for its resilience and ETL reliability may be better off utilizing the Trino Gateway to run multiple Trino clusters for different needs. This can greatly simplify the stack and ensure all queries are 100% interoperable.

Related reading:

How does Apache Spark handle fault tolerance?

Spark’s resilient distributed datasets (RDDs) provide the framework’s fault-tolerant capabilities. RDDs are parallelized collections of elements produced by Spark or from a stored data file that report transformed results lazily to a Spark driver program. Since RDDs don’t perform the transformations until required to send results, they can reconstruct data in the event of a fault.

What are the use cases of Apache Spark with Delta Lake?

Delta Lake is a table format that improves the performance of data lakes based on the Apache Parquet open file format. Delta Lake’s features include:

  • ACID transactions
  • Scalable metadata
  • Schema enforcement
  • Schema evolution
  • Time travel

Originally the proprietary storage framework for Databricks’ analytics platform, Delta Lake is now an open-source project managed under the Linux Foundation. Spark also has close associations with Databricks, so the two frameworks often go together.

However, Spark is one of many analytics engines companies can use with their Delta Lake-based distributed repositories. Delta Lake is supported by several alternatives, including Trino.

Hadoop vs Spark: How is Apache Spark different from Hadoop?

Apache Hadoop is a distributed computing framework developed in the early 2000s and comprising four components:

Hadoop Distributed File System (HDFS): A Java-based file system, HDFS distributes large datasets across multiple machines, using multiple copies of each block to provide fault tolerance.

Hadoop YARN: A resource management and scheduling platform.

Hadoop MapReduce: An engine for large-scale parallel data processing.

Hadoop Common: Utilities shared across Hadoop’s other modules.

Spark was developed ten years later in response to Hadoop’s limitations. The framework’s RDDs keep data in-memory, eliminating Hadoop’s write-heavy approach. However, Spark does not have a file system, so often runs on top of HDFS implementations.

Related reading: Hadoop migration

Kafka vs Spark

Apache Kafka is a distributed event stream processing platform designed to manage real-time streaming data’s distribution, integration, and analysis. Spark was originally a batch-oriented processing system. The release of Spark Structured Streaming gave Spark streaming data processing capabilities, albeit with higher latency than Kafka.

The two frameworks are complementary. Kafka’s fast stream processing optimizations make it an appropriate choice for managing high throughput event data streams. Spark provides a more complete set of transformational tools for data pipelines and machine learning projects.

Related reading: Streaming data into Iceberg tables, Kafka to Iceberg

Databricks and Spark

Databricks is a commercial developer of big data analytics platforms built upon Spark. The company’s founders include many of Spark’s original developers who still contribute to the open-source project. Delta Lake was the proprietary table format for Databricks platforms but was eventually released as an open-source project.

PySpark vs Apache Spark

Spark is written in Scala, a programming language developed to address issues in Java. However, a steep learning curve and relatively low adoption make Scala developers a rare commodity and add friction to Spark-based analytics. Python adoption is significantly higher by comparison — especially in the academic research community where data scientists are born.

PySpark is an API for integrating Spark’s large-scale data processing in Python code, including support for Spark’s DataFrames, MLlib, Spark Core, Spark SQL, and Structured Streaming.

Starburst Galaxy supports two Python DataFrame libraries, now in public preview, to give engineers more flexibility when building complex data pipelines. PyStarburst makes it easier to migrate PySpark workloads to Starburst, while Ibis provides a uniform Python API for writing portable code to query in Starburst.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.