As we enter 2024, what’s certain is more of everything AI. However, before we forget about data and just “do” AI, whatever that means 👀, here are five trends based on my recent discussions with chief data officers around the world. This list is aimed to hopefully fuel discussion and debate.
At the forefront of our discussion is the concept of data virtualization. I will highlight some ideas that may initially seem disconnected from routine data tasks but are, in fact, integral to shaping the future of your data strategy as well as implementation, and have a subsequent impact on AI.
And finally, I conclude with the significance in these five trends, because after all, I do have a point, as in I like to think I’m not completely pointless.
1. The inevitable future of data is decentralized
Traditionally, the goal has been to centralize data — to gather it from various sources and store it in a single physical location for analysis. I think this has arisen from a misunderstanding that the concept of a data warehouse requires data to be physically resident in a single location. This approach (or misunderstanding) has been the backbone of systems like Oracle, Hadoop, and more recently, cloud data warehouse solutions like Snowflake. However, as we venture deeper into the era of data driven / enabled / empowered organizations, it’s becoming increasingly clear that this centralized model is unsustainable.
The sheer volume and diversity of data, coupled with the rapid pace at which it’s created and used, make centralization a daunting, if not impossible, task. This is where the future lies in decentralization. We’re moving towards a landscape where data resides in multiple locations — analytical stores, data lakes, lake houses, cloud data warehouses, and on-premises databases and stores.
This decentralized approach is a practical necessity. Data will exist in multiple forms and locations, whether we’re ready for it or not. The challenge now is to adapt to this reality, embracing new strategies and technologies that allow us to manage, govern and analyze data across various decentralized platforms effectively.
2. You can read data across a network faster than reading from a hard drive
Traditionally, it was assumed that co-locating data with processing units was the most efficient way to handle data operations. This belief was grounded in the idea that transferring data across networks was slower than accessing it from a local hard drive, particularly with the limitations of earlier network infrastructures.
However, a recent conversation with Daniel Abadi, a computer science professor at the University of Maryland with whom I am co-authoring a book on data virtualization, suggests a different reality. Reading data across a modern network can be faster than reading it from a traditional hard drive, specifically those spinning at 7,200 RPM. This revelation is not about comparing network speeds to solid-state drives (SSDs), but rather to the classical mechanical hard drives that have been a staple in computing for decades. This increase in network performance is demonstrated by the infrastructure that supports the public and private clouds that we have come to rely on.
This evolution challenges the necessity of keeping data and processing power in close physical proximity. The implications of this opens up possibilities in terms of data storage, access, and processing strategies.
3. CDOs typically allocate 20% of their resources to focus on innovation
In conversations I’ve had with various EU-based CDOs, a common pattern has emerged that they are dedicating about 20% of their team’s time and resources specifically for innovative data pursuits.
A particularly intriguing practice involves setting aside time, typically Fridays or Friday afternoons, for data engineers to engage freely with data. During this period, these professionals have access to various data sets and tools, allowing them to explore, experiment, and play with data without the pressure of specific outcomes. This approach fosters an environment where creativity is encouraged, and failure is seen as part of the learning process, not as an endpoint.
This emerging trend underscores a broader understanding: innovation in data management and usage isn’t just about coming up with new ideas that are relevant to specific business functions. It’s about creating a culture where experimentation and creative thinking are integral to the organization’s approach to data.
4. Data mesh drives decentralized data ownership with data products
Data Mesh represents a shift in data management, moving away from centralized control to a decentralized, business-centric approach. It’s more than just a new technology; it’s a philosophy, a way of rethinking how data is extracted, transformed, and served within an organization.
The core of Data Mesh is the decentralization of data responsibilities. Traditionally, central data teams have been responsible for handling data from operational systems and making it available for analytical purposes. Data Mesh challenges this model by distributing these responsibilities to the individual lines of business within the organization. This approach empowers those who are closest to the data and best understand its nuances to manage and utilize it effectively.
An integral component of Data Mesh is the concept of ‘data products’. A data product can be a dataset, a table, a view, or any data artifact accompanied by contextual metadata. These data products are not just collections of data; they are dynamic entities that are owned, built, and maintained by the specific lines of business. This ownership ensures that the data is not only more relevant but also more accurately reflects the needs and insights of each department.
Data Mesh, therefore, represents a significant departure from traditional data management practices. It breaks down the silos between operational and analytical systems and views data as a product that flows through a pipeline, with different parts of the organization responsible for various stages of its lifecycle. This approach raises important considerations regarding governance, security, and the overall management of data.
The adoption of Data Mesh varies globally. In the United States, there is a focus on developing data products, while in Asia, there is a broader interest in the entire Data Mesh concept. European countries show mixed approaches; for example, German enterprises lean heavily towards a complete Data Mesh approach, whereas the organizations in the UK are more like their American counterparts.
Implementing Data Mesh requires careful consideration of data ownership, the right technology infrastructure, skilled staff, and robust governance and security measures. To make this work, some companies deploy their central data teams as ‘enablement teams’ within business lines, acting as coaches to guide and support the development of data products. Alternatively, some organizations use ‘capability squads’ – specialized units from the central data team that assist business lines in building and managing data products.
5. Your AI Strategy starts with your Data Strategy
No list is complete without mentioning generative AI, because it’s all the rage. While generative AI is currently a buzzworthy topic, its role and application in analytics need to be carefully considered, especially when understanding the relationship between operational and analytical data in organizations.
To illustrate, consider the operational data in a banking system. This data, like account balances, is dynamic, constantly updated to reflect real-time transactions. Operational systems are designed to efficiently manage this flow of data, ensuring accuracy and immediacy. In contrast, analytical data serves a different purpose. It’s used to understand patterns, behaviors, and risks – like determining the spending habits of customers and assessing potential financial exposure.
Generative AI, which is adept at creating data, fits naturally within the operational realm of data management. It’s about generating new data points, scenarios, or simulations that can be used in real-time operational contexts. However, its role in analytics is less direct. Analytical processes often involve using historical data to derive insights and make predictions, which is where classical AI comes into play.
A practical example of this distinction can be seen in the use of generative AI in customer service, such as AI-driven call centers. These systems generate new data – like audio files from customer interactions – which are then analyzed to derive insights. The analytical side might involve using machine learning to assess customer satisfaction or to understand common queries. This analysis produces structured data that can be further enriched and supplied to the GenAI model to optimize future interactions.
This differentiation underscores a crucial point: your AI strategy must be grounded in a solid data strategy. It’s not just about choosing the right AI technology; it’s about understanding how that technology will interact with and leverage your existing data.
Bottomline: map out your data strategy. Understand how data flows through your organization, identify the key data points that drive your business, and then consider how AI and perhaps GenAI can enhance or optimize these processes.
According to a recent survey, only 54% of managers believe that their company’s AI initiatives create tangible business value.
ROI > TCO: Understand the value of your data before determining your spend on that data
Why are these five data trends worth pointing out?
When organizations run queries to answer business questions, the focus tends to be on the query response time. However, this is just the tip of the iceberg. The real measure of efficiency and value lies in the entire data management process — the time and resources spent from acquiring the data to generating actionable insights.
Consider the scenario where a query takes only five seconds to run, but the preparatory data management process takes a year. The effective response time is not just five seconds; it’s a year and five seconds. Reducing the time to acquire and process data can significantly impact the overall efficiency of data-driven decision-making.
Similarly, the cost of answering a question is not just the computational cost of running a query; it encompasses the entire expenditure on data management, including data migration, storage, processing, and the labor of data engineers. Traditional approaches often involve heavy investment in these processes before even understanding the potential value of the data being manipulated. This method can lead to disproportionate spending with little regard for the actual return on investment.
The key takeaway is that organizations must shift their focus to valuing their data upfront. Before moving data from operational systems to analytical platforms and investing in complex data pipelines, it is imperative to assess the potential value that the data holds. This approach involves understanding the relevance of the data to business goals, the insights it can generate, and how it can influence decision-making.
Understand the balance between cost and value. Companies should adopt a more strategic approach, where the value of data is assessed and understood before significant investments are made in managing and analyzing it. This shift in perspective can lead to more efficient use of resources, better alignment of data strategies with business objectives, and ultimately, a more value-driven approach to data analytics.
Why unknown data and unknown questions lead to curiosity, experimentation, and innovation
O'Reilly Data Virtualization in the Cloud Era
Data Lakes and Data Federation At Scale