Security Lake

A data lake is a centralized repository for large volumes of raw data from multiple sources that simplifies big data analytics and optimizes data infrastructure investments. A security lake applies a data lake architecture to streamline the analysis of activity logs, event notifications, and data sourced from across an enterprise’s security and network infrastructure.

To combat the growing sophistication of cyber attacks on increasingly complex information architectures, enterprises have turned to security data lakes or security lakes. These solutions combine big data management practices with on-premises and cloud security analytics to drive real-time incident detection and responses.

This guide will introduce the security data lake, discuss its advantages over traditional security data management systems, and explain how modern data lake analytics accelerate incident responses.

Why organizations need data lakes

In a dynamic cybersecurity landscape, security teams must react quickly to prevent breaches from compromising sensitive data. At the same time, traditional security tools can’t keep pace with the snowballing complexity and volume of security data.

The period between an initial security breach and the adversary’s lateral movement within an enterprise network is only 84 minutes. This breakout time is how long security teams have to identify, analyze, and eliminate the threat.

However, threat detection and response require systems that can analyze increasingly complex data at scale in real-time. Each log or event from an enterprise firewall can have hundreds of attributes. Improving signal-to-noise ratios requires data from identity and access management (IAM), vulnerability scanning, and other security systems.

The rise of security data lakes

Data lake storage is the logical way to manage this vast quantity of structured and unstructured data efficiently.

Traditional security data management systems, even when offered through a software-as-a-service (SaaS), combine storage and compute within a proprietary solution that is expensive to maintain and scale.

Other storage architectures, such as a data warehouse, are too inflexible and cannot handle the wide range of data types security systems collect.

By itself, a data lake lacks the analytics features needed to monitor network security. A security data lake addresses this weakness by combining the efficient storage of a data lake with the compute optimizations of an analytics platform.

Security data lake vs SIEM

Security information and event management (SIEM) solutions collect logs and event notifications, providing a single source for security staff to monitor and protect enterprise networks.

However, scaling these SIEM systems in response to accelerating volumes and complexity is difficult and expensive. Companies must pay for just-in-case capacity since these systems combine compute and storage in a single package. This investment must lie dormant much of the time to prevent SIEM solutions from becoming slow and unresponsive.

SIEM providers try to minimize these challenges by sampling or aggregating data rather than collecting every log from every system. To further reduce storage costs, SIEM solutions will offload older data to less accessible long-term data stores.

Security data lakes keep storage and compute separate. Chief security officers can use inexpensive cloud storage services like Microsoft Azure or AWS. They can then use scalable cloud computing services to run their analytics, ensuring they only pay for needed processing capacity.

Other advantages of a security data lake include:

Rich data for security analytics – Petabyte-scale data lake architectures can ingest data from multiple sources, so there’s no need to filter or aggregate data at ingestion. Low-cost storage allows a lake to retain granular historical data in its original raw form, providing more investigation context. In addition, a security data lake is not limited to security data. An SDL can enrich data from other sources to enable more holistic analyses of incidents.

Leverage data management practices – Freed from a SIEM’s proprietary features, security teams can use data pipelines and automations to simplify security monitoring. Machine learning workflows make the most of large datasets to distinguish abnormal network activity quickly.

Five stages of attack

Assume-breach is the touchstone of modern cybersecurity. Role-based access control, encryption key management, or any other security technique is never perfect. The most extensive, advanced system of security controls will fail, giving adversaries entry to the enterprise’s network.

The challenge for security operations centers (SOCs) is ensuring that an initial breach cannot spread to critical systems. The 84-minute window for cutting off an attack covers the first three of a cyber attack’s five stages:

1. Initial access/exploitation

The first breach can happen anywhere on an enterprise network. Threat actors are quick to leverage zero-day vulnerabilities. Through 2021 and 2022, Google’s Mandiant security service observed a 32-day average time-to-exploit. Network managers do not react so quickly. By one estimate, the mean-time-to-patch of high-severity vulnerabilities was 146 days. This nearly four-month gap gives adversaries ample opportunity to penetrate defenses.

In another technology-based attack, threat actors penetrate enterprise networks by compromising a software vendor’s API, SDK, or source code. These supply chain attacks increased 663% in 2022, partly driven by open-source software’s security weaknesses.

However, human error is by far the most common factor. Phishing and other social engineering tactics leverage the fallibility of human nature to steal access permissions. Trend Micro’s security service blocked 21 million phishing attacks in 2022 — a 21% increase over the previous year.

2. Persistence

An attacker’s victims could secure the original vulnerability or compromised credential anytime, so their priority is turning the initial breach into a beachhead. Attackers will set up backdoors and other tools to allow remote access. They will also create pathways for communication with command-and-control (C&C) servers.

Compromised credentials give attackers free access to networks through the company’s virtual private network (VPN) or remote data protocol (RDP) gateways. Compromising privileged credentials lets attackers use SSH and other network tools. At first glance, this activity will seem legitimate since attackers appear to be authorized users.

Rather than running attacks directly, threat actors establish C&C channels that let their servers operate on the compromised network. These servers install additional malware and create hard-to-detect pipelines for data exfiltration.

3. Discovery

There is no guarantee that the initial breach will give adversaries access to valuable resources. Discovery gives them the information needed to reach more sensitive data.

Threat actors combine active and passive techniques to explore the compromised network. They aim to understand as much as possible about potential weaknesses that could advance the attack. For example, discovering security policies indicates when compromised passwords may change.

Once attackers completely understand the compromised network’s structure, systems, and activity, they can plan to turn their beachhead into a full-scale invasion. The attack has reached its breakout point.

4. Lateral Movement

Lateral movement allows the attackers to traverse enterprise networks and find valuable data systems to target. While malware installed through C&C channels helps the attack spread, attackers often use the compromised network’s resources. The inappropriate use of SSH and other network management tools is much harder to detect — with so much legitimate activity going on, the hackers’ movements blend into the background.

Similarly tricky to spot, escalated privileges are essential elements of lateral movement. Compromising the accounts of administrators and others authorized to modify systems lets the attackers bypass defenses, deliberately open security holes, and move across subnetworks.

Once an attack reaches the lateral movement stage, it becomes increasingly difficult to identify and stop. Threat actors camouflage their activities beneath the guise of approved credentials and resource usage. They gain greater control over networks and systems.

5. Objective

In its final stage, a cyberattack’s objectives become clear. Financially-motivated criminals will launch data encryption ransomware that renders the victim’s files inaccessible. State-backed attacks will focus on accessing sensitive data for exfiltration through the C&C channels.

In the first stage of a supply chain attack, the adversaries won’t do any noticeable damage. Instead, they will slip malicious code into a software company’s applications to create downstream vulnerabilities.

Regardless of the attackers’ objectives, the direct costs of a successful breach run into the tens of millions.

Security data lake with smart indexing

Starburst’s data lake analytics platform becomes the foundation for a more efficient and responsive security data lake. For example, Starburst Galaxy reduces the costs and risks of large-scale data duplication. Rather than copying data from other sources to enrich log and event data, Starburst federates sources from across the enterprise. Data remains at the source yet is instantly accessible to security analysts.

Starburst also provides performance optimizations that support the real-time demands of incident investigations. Effectively indexing the heterogenous formats in a data lake can speed queries but requires extensive expertise to design correctly. Starburst Smart Indexing automatically makes the most appropriate choice from a rich suite of indexes.

Unnecessary table scanning is another source of friction in query performance. Starburst Smart Caching accelerates data retrieval by evaluating data usage frequency and business priority to cache data in high-performance solid-state storage.

With performance optimizations, data source federation, and other features, Starburst speeds analytics so security teams can identify, investigate, and control security breaches before breakout.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.