4 Key Things You Should Know About Indexing

StrategySeptember 22, 2022

Roman Vainbrand
Director, Cache Strategy
Starburst

Roman Vainbrand
Director, Cache Strategy
Starburst

More deployment options

Request Enterprise trial license key →

Data indexing significantly accelerates query run time and concurrency, eliminating the need for massive compute resources. But before expecting indexing to solve all your needs, here are the four things you need to know before this solution will make the desired impact:

1. Indexing is useless if your queries need to perform a full scan

To benefit from indexing, you first need to have a deep understanding of your business.

For example, take a telephone book. If you know a person’s family name, first name, and the area they live in, you won’t have to scan the entire list of entries in the phonebook. However, if you don’t know the family name and, for example, only know the address, the way the phonebook is indexed is useless for your search. You will have to scan the phone book line by line.

The same is true of indexing databases. If they are indexed in a specific way and you want to run queries that do not match the indexing parameters, you will need to scan the lines one by one.

For indexing to bring optimum efficiencies, the indexes must cover the query connotations and business needs. Everything needs to be indexed according to the questions you need answered from your database. This will eliminate the need for full scans, which consume CPU resources and incur costs.

2. The way you write your SQL query matters…a lot

A query written in an inefficient way quickly turns a good query into a bad and slow query.

When writing SQL queries, two things can go wrong. First, if you’re choosing the wrong join strategy (partitioned or replicated) it can lead to poor performance. The second thing, which is easier to act upon, is how you order your tables when you’re doing the join.

So, you need to start with the big table, which is called the “Build Side“, and then go to the smaller ones. This is the most efficient way. And if you’re doing the opposite, it can be catastrophic.

You can solve it in two ways. You can educate your users to write queries in an innovative and efficient way, which can be challenging because people often want to complete their tasks and may not always consider the most optimized approach to queries. And the second option is to maintain table statistics.

And it’s challenging for companies to maintain these statistics, as they require running separate procedures, which take time and incur costs. For example, we have encountered cases where simply rewriting the sequence accelerated the query by three to five times.

3. Manage indexes to correspond to changes in query requirements

For questions that we ask frequently, it is relatively easy to optimize our dataset using indexing. However, business is changing, and research questions are becoming increasingly complex. These dimensions will be translated into columns in a table.

Traditional indexes are generally optimized for row-based data layouts, rather than columnar layouts, which are typically used with big data. With columnar data, you cannot index every column without rapidly expanding your storage and significantly slowing down your load times.

The key to big data indexing solutions today is to have a dynamic, intelligent indexing system that can adapt to the changing needs of business analytics.

4. A new way to index big data

Nano-blocks are written independently and read in parallel at query time. Users can create big data indexes on any column, adding and removing column indexes without updating the primary dataset.

By integrating nano-block indexing deep into a query engine that runs directly on data lake solutions, Smart Indexing and Caching can deliver faster big data analytics than is possible with partitioning, and the flexibility to change “partitions” as needed.