How to choose the right solution for your big data analytics engine
When optimizing your analytics database performance, one of the most important decisions is to choose how data is stored and accessed. There are two approaches that are optimized for efficient, high performance analytics: micro-partitioning and indexing.
Micro-partitioning uses the values of a predefined (small) subset of columns to divide a table into blocks of tens to hundreds of megabytes in size each.
In contrast, indexing expands beyond this predefined subset and separates out how data is stored (typically in similar blocks to the micro-partition architecture) with how selected column values are accessed, allowing users to optimize access to any subset of columns.
The system that you choose has a significant impact on cost and performance.
Accelerating queries with micro-partitioning
Micro-partitioning is an approach for creating relatively small blocks of data, usually a few tens or hundreds of MB. Those blocks are organized around some partitioning or clustering key. Micro-partitioning based systems enable the query engine to only retrieve the data blocks that are required to answer a query based on the query predicate for the partitioning key, and to do so in parallel.
While micro-partitioning is more efficient and faster than full table scans, this approach still requires reading extensive amounts of data from disk, even when compressed. Micro-partitioning doesn’t help with very large join-based predicates, where tables are restricted based on the output of another table that itself is large, unless the predicate happens to align with the partitioned column.
When you have few columns and few or no joins, micro-partitions can be a simple, hassle-free approach to structuring your data.
A fresh look at big data indexing
Indexing is the tried and true method of creating separate files that can be used to quickly identify where to retrieve records from the data set. The query engine uses indexes and the query predicate to reduce the amount of data retrieved before reading all of the columns for each record.
In some cases, indexes can be used to answer the entire query, a technique known as covered indexes. Indexes are more efficient at pruning data based on joins, for example when looking for transactions from different people at the same organization. In this case, the query engine may need to join two or three tables — indexes can make this join more efficient.
Though indexes can be highly performant and cost efficient at query time, they typically require careful design considerations and most implementations are not optimized for loading data quickly.
Furthermore, traditional indexes are most optimized for row-based data layouts and not for columnar layouts that are typically used with big data. With columnar data, you can’t index every column without exploding your storage and grinding your load times to halt.
The key to big data indexing solutions is to have a dynamic, intelligent indexing system.
Taking indexing to the extreme, with nanoblocks
Starburst Smart Indexing and Caching uses an indexing mechanism that is uniquely optimized for high performance analytics called nanoblock indexing. Instead of storing one large index for each column that the user selects, Smart Indexing and Caching dynamically creates millions of nanoblocks – a few dozen kilobyte sized sub sections of the indexed column.
Each nanoblock only stores an index for a subset of the data, and each nanoblock index is independent and uniquely adapted to that nanoblock. By taking advantage of modern storage systems, such as SSDs, nanoblocks are fast to load, update, and systems like Smart Indexing and Caching can quickly read and execute queries against nanoblock indexes without the overhead inherent in either traditional indexes or micro-partitions. Smart Indexing and Caching is also able to dynamically create, modify, and remove both indexes and table columns without the overhead of rewriting table data.
Combining columnar storage and smart indexing delivers performance
In both micro-partitions and nanoblock index architectures, data is usually stored in a columnar format and can be processed efficiently by the query engine. How much data needs to be read and filtered can make a huge difference in both query performance and cost efficiency.
While micro-partitions exhibit faster load and grouping performance over traditional indexes, Smart Indexing and Caching nanoblock index implementation combined with columnar storage delivers both fast loads and faster queries than micro-partitions.
Nanoblock indexes enable the query engine to read just the columns required to compute query predicates, and in some cases can answer queries entirely using the indexes. With nanoblock indexes, users are free to add and remove indexes and table columns dynamically.
Micro-partitions must be optimized against a single partition key and contain all the columns, dramatically reducing both query speed and flexibility compared to using indexes. As the system is used and data keeps changing, micro-partitions may require heavy background maintenance such as defragmentation and rebalancing. The independent and columnar nature of nanoblock-based indexes removes the need for background heavy data reordering operations.
Example: analyzing ride sharing data
As an example, consider an analytics system for analyzing data from a ride sharing application. Users might need to track a couple dozen data points about each ride, as well as information about drivers and riders. Any queries that include constraints on drivers, riders, start locations, and end locations, will benefit from using one or more indexes. With a large number of columns, any query that returns or aggregates a subset of the columns outperforms micro-partitioning by better managing I/O, and reducing query cost. On average, testing shows a 40x increase in performance for these types of queries. In contrast, queries that can be optimized in advance, such as constraints or aggregates by date where data is micro-partitioned by data, ignoring indexes will perform better, often 4-10x.
Future proof your data platform
The choice of a micro-partitioning or indexing based analytics engine can impact individual query performance and overall cost efficiency. In particular, as user load increases, the impact of hundreds or thousands of queries a day can add up to significant I/O cost.
Micro-partitions often result in larger overall I/O depending on the selectivity of your queries and the number of columns in your data set. While micro-partitions work well for skinny tables, the more columns, either in your main tables or in the tables you join, the higher the cost of using micro-partitions.
To decide between a micro-partition based solution or a nanoblock indexing-based query engine, take a look at your query logs and identify the number of joins and complexity of predicates.
While popular implementations often use micro-partitions, it’s worth taking another look from a performance perspective at Starburst Smart Indexing and Caching, which uses nanoblock indexing. Starburst has taken the approach of building intelligent indexing directly into the query engine and integrated it with the query optimizer and query statistics.
Rather than requiring ongoing design, Smart Indexing and Caching creates and maintains indexes based on how users query your data, delivering critical observability capabilities. It also responds dynamically to changes in data and to workload prioritization set by administrators and gives you the performance and cost efficiency of indexes with the ease of use of micro-partitions.
With Starburst Smart Indexing and Caching, you don’t need to skip on wide columns or complex joins just to get the query times down and costs within budget.
Smart Indexing and Caching
Patented autonomous indexing technology that accelerates queries