Organizations today are using artificial intelligence and machine learning to answer complex business questions. They often follow an artificial intelligence development life cycle to answer those questions. Historically there have been two well adopted methodologies: SEMMA and CRISP-DM. There is clearly a lot of merit to decomposing the activities needed to perform advanced analytics on data, and each of these approaches has their individual strengths and weaknesses.
However, I tend to think about a higher level framework for analytics, which I refer to as the 3Ds of artificial intelligence. Before we take a closer look at the AI life cycle, let’s understand what AI can do.
AI use cases | AI solutions
AI has a wide range of real-world applications with the following three use cases serving as validation for why the AI life cycle is critical:
Healthcare: AI is making significant contributions to healthcare including: improving disease diagnostics, drug discovery and treatment plans, and improved personalized patient care.
Finance: The finance industry leverages AI for various use cases to improve algorithmic trading, anti-money laundering, fraud detection in security, and data-driven decision-making processes.
Autonomous vehicles and transportation: AI is a driving force behind the development of self-driving cars, traffic management and predictive maintenance.
What are the phases of the AI lifecycle?
In data management, a common pattern has emerged which is a three phase life cycle that all organizations adopting artificial intelligence will go through — no matter what technology they’re using and no matter what outcome they’re looking for. The activities within each phase depend on the type of analytics used, and will be discussed in more detail in later posts. It’s important to note that to answer business questions we are likely to need a degree of iteration within each phase and across the different phases of the life cycle.
In this post, I outline the three phases of the artificial intelligence life cycle: (1) data discovery, (2) model development, and (3) model deployment (hence the 3Ds). Also, I will explain what it means to manage one of your precious resources: your data scientists.
Phases of the artificial intelligence(AI) life cycle
Let’s dive into the three phases:
1. Data discovery | Data preparation
Data discovery is the process that data scientists go through at the beginning of an exploration exercise. The business has a question that we need to answer. Once we know the question, we want to know:
- What data is relevant?
- What data do we have in our organization?
- Where is that data and how do I get access to it?
The data team or the data scientist spends a great deal of time in the data discovery phase, performing data analysis, analyzing data quality, labeling data, and wrangling datasets. Imagine them going to various DBAs, data stewards, data owners to ask them for access to their data, ask them about the data, and how they’re going to be able to use that data — it’s very time consuming.
Sure, if data catalogs are available, that helps, but what we tend to see is that data scientists still spend far too much time in this phase. This data discovery phase takes anywhere between 60-80% of the time and effort of a data scientist, and this forms the basis of what we now call feature engineering.
One thing that is clear, is that it is low value work for data scientists to be spending a majority of their time on. Where data scientists offer the most value is in the analytical model development phase.
2. Model development phase | What is the AI development process?
Once we’ve got access to the data; we’ve done some analysis to determine that data is relevant to the question that we’re trying to solve; and we have completed the data discovery process — we can begin to think about data products.
Data products serve as the proverbial platform for the next phase in the cycle which is analytical model development, this allows us to consider creating and using data products as training data, validation data and test data to support the model development phase. Data products are perfectly suited for this as they can include metadata to describe provenance and lineage, so that different data scientists can collaborate effectively.
At the model development (or model training) phase, we are exploring the data further but using various statistical routines and mathematical approaches to look for a way of bringing that data together to answer the specific business questions we have. As part of this process the data scientist will perform model evaluation including model performance optimization using the various data products that we created during the data discovery phase. This is really where data scientists provide a significant amount of value.
Once we have completed this model development process we have an analytical model (or an ml model), this model could be a list of rules and thresholds that when combined together provide an answer to the question.
3. Model deployment phase | AI models
In the model deployment process, we want to put the data output from the analytical model execution out in front of decision makers and stakeholders.
Ideally we want the data scientists to spend as little time here as possible, but remembering this is probably one of the most or probably the most important step. Without getting that outcome into a production environment and in front of decision makers we’re not going to make a better decision, and so the promised return on our investment will be lost.
We can consider the output of this process to be another data product, possibly one which calls an API or a user defined function in SQL that executes the analytical model and gives us answers to the business questions we have.
Lastly, when we deploy the analytical model we do need to consider the data that is input into the model. As per the discovery phase this data might come from multiple sources and need transformation, cleansing and organization to enable the analytical model to be applied to the latest version of the data efficiently.
This end to end lifecycle creating machine learning models forms the basis of the modern MLops discipline.
Artificial intelligence life cycle example
If we think about a marketing example where we’re looking for customer churn. Our data discovery will include understanding a number of different characteristics of our customers which will become our features.
Model development is where we apply math to figure out the propensity of a given customer to churn.
Then, at the end of the process we might want to call a specific API to execute an analytical model for an individual customer to answer the question ‘is this customer likely to churn?’ Or we may want to run that model deployment process across all of our customer base in a batch process where the outcome of that would be a propensity to churn score per customer.
Now we can see, in near real-time or batch that the customer with a higher score is more likely to churn, and a customer with a lower score is less likely to churn.
AI project: How Starburst enables your data scientists to do their best work and meet your business metrics
I often get asked “Is Starburst a data science toolset?” and my default answer is “No, but…” the reason for this is because at its heart Starburst is a SQL query engine, and writing Logistic Regressions or Neural Networks in SQL is, if not impossible, close to it.
“But” is important in my response though, if you want your data scientists to spend less time on data discovery. Rather for the data scientists to spend time on innovation, be productive and have the ability to create the best possible analytical model of reality and thus drive the best possible outcome from your artificial intelligence investment, Starburst should be part of your artificial intelligence lifecycle.
If it takes your data scientists 60%-80% of their time to wrangle, profile, discover, and manage relevant data, Starburst can help. With our ability to natively query data lakes as well as combine that data with data from other data systems with no data movement, we can significantly reduce this effort. Anecdotally, this has resulted in organizations spending as little as 20% of their overall effort in the data discovery phase.
Further, when deploying the output of the lifecycle to support a better decision making process and end-users, it is very likely that we will have to replicate at least some of the data pipeline / transformation steps that were performed in the initial data discovery phase. This activity might fall to a data engineer rather than a data scientist, but having a platform that enables data management at scale across data sources will only accelerate this process.
What does this mean?
It means that with Starburst, your lifecycle looks like this:
With your data scientists spending approximately 70% of their time in model development, this means that data scientists can experiment, innovate, and build more analytical models.
Moreover, they can build better analytical models and can get the result of these models into the operations of the business faster and more efficiently, which ultimately will result in better organizational performance.