2024 is well underway and New Year’s resolutions are in full swing. What better way to celebrate what’s ahead than reflecting on the many product innovations, partnerships, and data memories created in 2023! Welcome to Galaxy Wrapped, the 2023 edition.
From AI features to smart indexing and caching, to cross-cloud analytics and enhanced data observability, 2023 marked a significant leap forward for Galaxy. Let’s kick off by taking a closer look at some of the major product highlights from the last year.
We built. A lot.
Universal search across data sources
Universal search allows you to quickly locate relevant data sets in and around your lake using advanced search capabilities. Users can search for the names of catalogs, schemas, tables, views, columns, data products, tags, and more.
Automatic schema discovery
Schema discovery assists teams in locating and documenting new files in a data lake, along with any additional tables and views that have been added since the last discovery. It examines a root object within an object storage location and then provides the structure of any tables identified during the analysis.
AI-powered data classification
2023 was all about AI and we are excited to continue that momentum in 2024. We introduced AI-powered data classification to automatically classify data in catalogs, schemas, tables, and views via data classifier jobs. Users can run these classifier jobs on-demand or schedule them against an attached cluster.
Attribute-based access controls
Attribute-based access controls empower users to seamlessly combine policies and attributes, including tags. This dynamic approach enhances the management of role access to various entities, such as catalogs, schemas, tables, views, and columns.
Data products offer curated, high-quality datasets including relevant metadata, enhancing data discoverability. With the data products details view, users can provide detailed business context, including SQL statements and supporting images or links. These data products serve as a semantic layer, mapping complex data into familiar business terms, enabling easy access and high visibility for data consumers within a single pane.
We further added support for external data sharing, enabling organizations to share data directly from the source with third parties, eliminating the need for data copying.
Data observability, consisting of data lineage and data profile and quality, empowers organizations with enhanced visibility into their data. As data moves through Galaxy, users can now gain insights into data lake table statistics, collaborate with data consumers to establish quality rules, and swiftly identify quality issue origins and downstream impacts.
Fault-tolerant execution (FTE)
Fault Tolerant Execution (FTE) allows a cluster to retry specific queries or parts of query processing in case of failures, eliminating the need to restart the entire query. This feature is particularly valuable for long-running queries, commonly associated with batch processing and Extract Transform Load (ETL) queries.
Teams managing multi-cloud architectures often face challenges when handling one-off requests to merge two distinct datasets for a specific point-in-time analysis. This means data engineers invest valuable time constructing one-time-use ETL pipelines solely to copy data from one cloud to another. With cross-cloud support in Galaxy, you can discover, access, and manage diverse data sources, irrespective of the cloud platform.
Python Data Frames
We introduced two new libraries to Galaxy this year, PyStarburst and Ibis. Users no longer need to stand up two separate engines for their analytical and transformation workloads to get the performance and flexibility they need. Instead, they can leverage Starburst, powered by Trino, as their single source of compute, reducing cost, complexity, and time.
Warp Speed is a smart indexing and caching solution that autonomously accelerates your interactive workloads. Coupled with our best-in-class query engine, we are raising the bar in data lake analytics. Organizations can accelerate the discovery and extraction of insights from their data, achieving up to a 7x improvement in query performance and reducing cloud compute costs by up to 40%.
Streaming ingest is a first-of-its-kind solution. Organizations can now stream data into their lake in real-time and hydrate their lake with ease. This feature is in our private preview program. If you’re interested in being a part of our early testers and want to help us shape our streaming roadmap, apply here.
Automated data lake optimization
Users can schedule routine data maintenance operations on lake tables with automated data optimization, consisting of four main operations: data compaction, profiling and statistics, vacuuming, and data retention.
We delivered transformative experiences across industries
Halliburton partnered with Starburst to revolutionize their data swamp, turning it into a source of real-time insights. Through the creation of data products with Starburst and the integration of Generative AI, they achieved instantaneous data access, eliminated fragile data movement pipelines, and empowered swift data-driven decision-making. This transformation led to a reduction in the time to answer questions from 2-3 weeks to immediate responses.
7bridges, a leading AI data-driven supply chain management platform, selected Starburst Galaxy for its lakehouse architecture, addressing challenges posed by individual databases and diverse data sources. The deployment significantly enhanced data accessibility, query execution speed, and decision-making processes, resulting in expedited development cycles, heightened client satisfaction, cost savings, and increased efficiency. Galaxy’s flexibility also enabled non-technical users to interact with data effectively, contributing to 7bridges’ position as an industry leader in supply chain management with cutting-edge solutions.
Vectra is leading the charge for AI-powered cybersecurity. Previously, Vectra faced costly challenges in querying log data with solutions like ElasticSearch and EMR. These limitations restricted users to a narrow window of log data and ultimately impeded platform growth. By adopting Starburst Galaxy as an embedded query engine, Vectra extended query capabilities, unlocking new use cases and offloading Trino management. This transformative step enabled Vectra to enter 9+ new markets, facilitating the expansion of their real-time, AI-driven threat detection across 1100+ enterprises.
There’s more to these stories than can fit in one blog post, so if you want to read more on how we’re helping customers unlock their potential, and cut costs, check it out here.
We came, we saw, we conferenced
What an exhilarating year it was for events! Looking back, our larger gatherings were truly remarkable. Engaging with individuals worldwide who are pushing the boundaries of modern day analytics was nothing short of thrilling. From the virtual excitement of Datanova in February to an impactful re:Invent, these events have showcased the growing industry enthusiasm for Starburst. Trino Fest and Trino Summit, held virtually in June and December, continued the momentum, highlighting Trino’s ongoing role in powering big data management. With Gartner’s participation adding an extra layer of success, our events collectively attracted substantial participation, fostering connections and fruitful interactions. The energy and excitement around our events reflect the focus of the industry, highlighting that open data lake platforms like Starburst are shaping the future of big data.
Speaking of events, we invite you to join us for Datanova, live in New York City April 10-11, 2024! We are so excited to be the hosting Datanova in-person for the first time ever this year at Data Universe. Come join us to learn more about AI/ML, data management, query processing & optimization, and so much more.
What’s in store for 2024
At Starburst, we are hyper focused on freeing our customers to see the invisible and achieve the impossible. So far this year, the Galaxy product team is already making significant strides across several key focus areas.
We are dedicated to exploring new and responsible approaches to enhance user journeys with AI, streamline data lake management and hydration through improved streaming ingest, and further empower real-time analytics – all while expanding the reach of Starburst Gravity and our data sharing ecosystem.