4 Key Things You Should Know About Indexing

By: Roman Vainbrand
September 22, 2022
Share: Linked In

Data indexing radically accelerates query run time and concurrency without the need for massive compute resources. But before expecting indexing to solve all your needs these are the four things you need to know before this solution will make the desired impact:

1. Indexing is useless if your queries need to perform a full scan

To benefit from indexing, first, you have to intimately understand your business.

Take a telephone book, for example. If you know the family name, the first name and area a person lives in, you will not have to scan the entire list of entries in the phonebook. However, if you don’t know the family name and, let’s say, only know the address, the way the phonebook is indexed is useless to your search. You will have to scan the phone book line by line.

The same is true of indexing databases. If they are indexed in a specific way and you want to run queries that do not match the indexing parameters, you will have to scan line by line.

For indexing to bring optimum efficiencies, the indexes must cover the query connotations and business needs. Everything needs to be indexed according to the questions you need answered from your database. This will eliminate the need to perform full scans, which burn CPU resources and money.

2. The way you write your SQL query matters…a lot

A query written in an inefficient way quickly turns a good query into a bad and slow query.

When writing SQL queries, two things can go wrong. First, if you’re choosing the wrong join strategy (partitioned or replicated) it can lead to poor performance. The second thing, which is easier to act upon, is how you order your tables when you’re doing the join.

So, you need to start with the big table, which is called the “Build Side“, and then to go to the smaller ones. This is the most efficient way. And if you’re doing the opposite, it can be catastrophic.

You can solve it in two ways. You can educate your users to write queries in a smart and efficient way, which is hard, because people want to do their job and are not always thinking about writing the sequel in the most optimized way. And the second option is to maintain table statistics.

And it’s hard for companies to maintain these statistics, because you need to run separate procedures and that takes time and it costs money. We have come across cases where just rewriting the sequence accelerated the query by three-to-five times more.

3. Manage indexes to correspond to changes in query requirements

For questions that we are always asking, it is relatively easy to optimize our data set using indexing. However, business is changing and research questions can have more and more dimensions. These dimensions will be translated into columns in a table.

Traditional indexes are generally optimized for row-based data layouts and not for columnar layouts that are typically used with big data. With columnar data, you cannot index every column without rapidly expanding your storage and grinding your load times to halt.

The key to big data indexing solutions these days is to have a dynamic, intelligent indexing system that can cope with the changing needs of business analytics.

4. A new way to index big data

Nano-blocks are written independently and read in parallel at query time. Users can create big data indexes on any column, adding and removing column indexes without updating the primary dataset.

By building nano-block indexing deep into a query engine that runs directly on data lake solutions, Smart Indexing and Caching can deliver faster big data analytics than is possible with partitioning and the flexibility of changing “partitions” when needed.

In fact, when using the nano-blocks indexing approach, partitioning becomes useless since every query will find its index. This also takes advantage of the flexibility inherent in nano-block indexing by dynamically and automatically adding and removing indexes depending on changing workloads, so future proof for any schema changes.

Roman Vainbrand

Director, Cache Strategy, Starburst

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.