As organizations strive to become more agile, there has been a mass movement jumping headfirst into what is called a security data lake.
Gartner defines data lakes as “a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores.”
Expanding this concept to include security-specific data, “security data lakes” can help you centralize and store unlimited amounts of data to power investigations, analytics, threat detection, and compliance initiatives. Analysts and applications can access logs from a single source to perform data-driven investigations at optimal speed with centralized, easily searchable data.
The global data lake market size was valued at over $8 billion USD in 2019 and is expected to grow at a compound annual growth rate (CAGR) of over 21% from 2021 to 2028. According to Gartner, over half of the organizations planned to implement a data lake until 2022. An enormous amount of information is generated daily on digital information platforms and requires efficient processing and indexing architectures.
Source: Gartner – Data Hubs, Data Lakes and Data Warehouses: How They Are Different and Why They Are Better Together
Security Data Lakes: Hunting for the Right Data
Security data lakes are designed to centralize all of your data so you can support complex use cases for security analysis, including threat hunting and anomaly detection at scale.
A top challenge is long-term data retention and the ability to search across collected telemetry. Most vendors have a data retention cap between 7 and 30 days and often offset costs to the buyer, whether they know it or not.
For example, according to Gartner and multiple cloud benchmark studies over the years, on average, it costs $6 USD per endpoint per year for 7 days of endpoint detection and response (EDR) continuous recorded data, which is why EDR solutions are so expensive.
Accessing all of your historical data is critical to being able to have the right contextual information to conduct an effective and efficient security investigation.
As we’ve seen with the SolarWinds supply chain attack, it was months before the security community was made aware of the malicious artifacts and adversarial tactics, techniques, and procedures (TTPs) and the motivations and scope behind such a complex type of attack.
This meant that many organizations could not perform historical hunting across the relevant time window because those logs already aged out of the platform or moved into offline archives — making it difficult to triage the scope of the attack.
Security Data Lake Success Criteria
There are 4 key data-related challenges that security teams must have in place for a security data lake architecture to operate efficiently and effectively.
- Access to all key data (any type, volume, timeframe, format)
Security applications and analysts need access to every piece of information they can get their hands on to conduct proper security investigations with the highest levels of fidelity.
- Instant access (zero time to insights)
Security investigations need to operate at the speed of now with zero delays in system responsiveness.
The approach needs to be able to elastically and effectively scale out and in as needed for a dynamically expanding digital ecosystem and volatile demand.
- Price-performance balance
This additional functionality needs to reduce costs and not contribute to them to remove barriers to implementation and long-term operational and financial benefits.
Benefits of Starting with Security Data Lake
Organizations are taking extra care in implementing a best-of-breed approach that not only addresses immediate needs but also does for the long run.
- Efficient resource utilization
- Consistent performance
- Access to all operational data sets of historical data
- Predictable cost structure
- Ability to access fast critical business operational data
- Full control of the data format (original raw form vs being modified and or truncated)
- Security control and compliance tradeoffs are sacrificed to favor of basic functionality
- Do-It-Yourself is not sustainable and is very costly in the long term
Resource efficiencies are the main pitfall for data lake architectures, especially when evaluated against existing SIEM solutions and other optimized platforms. Data lake query engines are often based on brute force technology that scans the entire data set. The result is that 80%-90% of compute resources are squandered on Scan and Filter operations.
Organizations that have attempted to leverage data lake architectures often find themselves managing huge clusters to ensure performance and concurrency requirements are met. This is extremely expensive on both resources and maintaining large data teams.
The Benefits of a Better Approach: The Power of Big Data Indexing
Unlike partitioning-based optimizations, which are designed to reduce the amount of data scanned and subsequently boost performance (or reduce cost) by partitioning the data by frequently used columns, Starburst’s big data indexing technology is not limited to several dimensions, but rather enables it to quickly find relevant data across any dimension (column).
Data lakes are not homogeneous and include data from many different sources and formats, the platform leverages a rich suite of indexes, such as Bitmap, Trees, Bloom, Lucene (text searches which are so critical for log and event analytics), etc.
These capabilities don’t require any special skill sets and automatically identify the best index. Taking it a step further, to deliver optimal performance on varying cardinality, Smart Indexing and Caching breaks down each column into small pieces, nano-blocks, and finds the optimal index for each nano-block.
Indexes, as well as cached data, are stored in SSDs to enable highly effective access and extremely fast performance.
Text Analytics is a Useful Feature
As a part of Starburst’s Smart Indexing and Caching Technology indexing suite, text searches with Apache Lucene are a native part of the platform and are applied automatically by the platform.
Organizations collect massive amounts of data on various events from many different applications and systems. These events need to be analyzed effectively to enable real-time threat detection, anomalies, and incident management. In various security-related use cases, text analytics is leveraged to provide deep insights into traffic and user behavior (segmentation, URL categorization, etc.).
Text analytics has proven to be critical for security information and event monitoring (SIEM) and other SOC tools in reducing the overall time and resources required to investigate a security incident while being as effective and efficient as possible.
Meeting the Security Data Lake Success Criteria
By leveraging Starburst the Security Data Lake, the Success Criteria can be met:
- Security teams can access all key data instantly on the data lake, without the need for managing ELT/ETL pipelines for moving data into a centralized platform
- Clusters can scale in and out based on demand, so spending can be controlled and managed according to the actual demand
- Price-performance is balanced by significantly saving compute costs by using indexing and reading only required data from the data lake.
Smart Indexing and Caching
Patented autonomous indexing technology that accelerates queries