In the big data analytics world, enabling analytics on unstructured text is a powerful capability. For that reason, it would be of use that we highlight the difference between “text analytics” and “text search”, how text search might help data teams and data scientists, and then go deeper into the tech components, including open source Apache Lucene.
Understanding text analytics
If you ask 10 people, you will probably get 10 different answers to ‘what is text analytics’. From our perspective, we see that text analytics is arranged by two different categories:
The first category is extracting new information and insights from text. Natural Language Processing (NLP) has been around since the 1950s and focuses on automatically understanding text. It includes word cognates such as stamming, text summarization, etc.
The second category is search on text. There are many analytics use cases that require this functionality. For example, identifying a specific section of a gene, logs or events analytics and many more.
Scaling text analytics to enable agile text searches
As use cases evolve and as organizations have been actively collecting billions (and more) of rows they want to analyze, text analytics requires a fresh approach and needs to evolve.
In practice most cases text searches will be applied by using a “Starts with”, “Contains”, “ends with” or a “regex”(pattern matching) type lookups.
These lookups are usually used in one of these common text searches use case:
- Logs analysis – looking for specific text in massive amounts of logs and events.
- Cyber threat detection and anomaly detection – in many cases a sub-type of the logs analysis use cases.
- Folder analysis – often used in marketing analytics. URLs essentially include a series of folders that describe a relevant product or service – enable to identify how often specific categories (i.e. folders) are being consumed by customers, for example ‘men shoes’ for an online retailer.
- General text lookup and pattern matching – find in a “description” field all the records that contain the word “garden”, or lookup in a postal code field all the records that start with 1002.
Searching for text to support these use-cases requires a technology that will enable it to run on billions of rows directly on the data lake without the need to move the data to a proprietary platform. To enable these search capabilities, organizations often overpay for advanced text analytics solutions that focus more on extracting insights from text, instead of enabling agile text searches. These heavy platforms are expensive not just in TCO, but also in time-to-market and maintenance which slows down the pace of innovation for data-driven organizations.
Size matters! Text search requires a very specific solution that can easily be served on the data lake, at massive scale, without the need to move data to heavy text-optimized platforms.
The role of smart indexing for text analytics
Starburst’s innovative approach for big data and data lake indexing is based on cutting the data to tiny pieces, called nanoblocks, which are stored on the worker machines SSDs. Each nanoblock is at approximately 60,000 rows of a specific column. Starburst uses this prism to continuously analyze data on the data lake.
Each nanoblock that will be indexed by the platform will be assigned the optimal index, including Lucene, according to its structure and type. Cardinality, how many different values appear in each dataset, is critical. By assigning indexes on the nanoblock-level, the cardinality challenge is dramatically lower.
Enabling text searches directly on the data lake
Also Starburst Smart Indexing and Caching leverages the Apache Lucene open-source library as part of the extensive indexing capabilities, to enable advanced text search. Starburst’s unique data lake indexing technology coupled with Lucene text indexing and search enables data teams and data scientists to run text searches in petabytes, at scale, and directly on the data lake.
Smart Indexing and Caching
Patented autonomous indexing technology that accelerates queries