Today in the data space, when you peruse technology solutions, it’s very difficult to put your finger on just exactly what each firm’s product or services do. Two solutions that are providing very different capabilities seem to have nearly identical messaging. Moreover, buzzwords drift to the front of the line, ahead of hard realities like data gravity, data silos, data recency, and other very real limiting factors standing in the way of data driven digital transformation. Data science is no exception. The hairy challenges of getting models, algorithms, and other key data assets to production often lurk in shadows, whilst we praise the amazing tools available and the business value they will enable. When it comes to data science, I have been most puzzled by the omission of data access as a very real blocker to success. In this blog, I detail just how significant data access really is to successful data science strategies.
Benefits of Data Access in ML/AI Projects
Data access is one of the most forgotten hardships of ML/AI, acting as a towering obstacle standing between brillant data minds, their tools, and the promise of data driven decisions. Examples of the benefits: time to market, operational efficiency, risk mitigation, customer 360, increased profits and more are regularly on display through various marketing efforts. The potential upside and business transformation is seemingly limitless.
For instance in healthcare, Nina Schwalbe, MPH, adjunct professor in the Heilbrunn Department of Population and Brian Wahl, PhD, assistant scientist in the Department of International Health at the Johns Hopkins Bloomberg School of Public Health said, “Enabling access across borders will require new types of data sharing protocols and standards on interoperability and data labeling. This global movement could be facilitated by an international collaboration so that data are rapidly and equitably available for the development and testing of AI-driven health interventions.”
ML/AI Models Are As Good As Your Data
When we unpack the process of getting ML/AI into production, we learn that these solutions are only as valuable as the data used to create them and constantly enrich them. This puts data exploration, data discovery, and unimpeded iterative questions front and center. This type of data work is rapidly becoming a workload majority stake holder. In essence, your data teams’ ability to make a difference will rely heavily on their ability to quickly and efficiently incorporate key workloads into their model creation workflows.
When we look at market leaders and market challengers alike, they lay out a path to success for ML/AI that naturally starts with a curious assumption, that all of the data you need is at your fingertips. The unfortunate reality is that is rarely the case and the data needed to shape and reshape ML/AI solutions is constantly evolving. This seldom recognized “elephant in the room” (no pun intended with Hadoop… or is there?) is data access, one of the most prohibitive blockers to getting data science frameworks up, running and out the door adding value.
Data Access for Data Consumers
The volume of data (and its growth) creates an opportunity for everyone within the organization to be a data consumer. However, there is an infinite number of questions which has created acute pain around data access and the requirement to centralize data through constant movement and copying. This acute pain is compounded further when considering the importance of data discovery and data exploration.
Let’s think about how data discovery has traditionally been done: data scientists or analysts request large volumes of data from all corners of the data estate to be moved into a lake or warehouse. Seems straightforward, right? Wrong, this is a messy, laborsome, arduous and all around a painful process. For all of that pain, typically this takes lots of time and produces data quality and recency issues. Imagine if archeologists scoured the Earth for dig sites, but instead of setting up shop and excavating these assets, started carving out however many square miles/ tons of Earth and had it airlifted back to their labs and museums. And, then hoping that they have grabbed the right plots of land, began their process of uncovering artifacts to surface to the public. How impractical would that be?
How Starburst Helps With Data Access
Good news! There is a way to provide secure, performant access to data without it having to be moved or copied. Without any lifting, shifting, ripping, replacing or migrating your data consumer population and data scientists in particular can begin to leverage SQL as a common language to do data exploration on any data, in any system! Starburst can deploy anywhere: on all major clouds, K8s, VMs or bare metal and connect with the BI tool, SQL editor or custom application of choice for the data consumers and provide near real time access for data discovery and exploration. We are not a data science tool, but a friend of ML/AI in our ability to accelerate key assets (models, algorithms etc) to production by streamlining the discovery process.