Early Adoption of the Self-Service Data Infrastructure

By: Phillip Radley
May 16, 2022
Share: Linked In

Ten years ago, an earlier version of Data Mesh — self­-service data infrastructure and data as a product — was implemented at British Telecom (also known as BT). As a former Chief Data Architect at BT and a key stakeholder in the development of their decentralized data infrastructure, I retained many impactful learnings and takeaways that I hope to pass on to today’s Data Mesh adopters. I presented my ideas at Starburst’s second annual conference, Datanova: The Data Mesh Summit.

 

Building a big data infrastructure

Before Data Mesh, there was Big Data. By the time our data had volume, velocity, and variety, our challenge at BT was to scale: our current infrastructure wouldn’t withstand the needs of the business and the ambitions of our team. We needed a big data infrastructure and we needed it fast, but we didn’t have the tools, the budget, or an experienced team.

However, we enlisted the data center team who built the servers, the infrastructure and partnered with Cloudera who provided a Hadoop cluster and the support. We had all the ingredients in place and pitched our big data infrastructure plan with a speculative (or best guess) ROI to the senior leadership team who approved the initiative. Four months later, our HaaS platform (Hadoop as a Service) was up and running!

Self-service requires a balance of governance 

With this self-service platform that anyone could use, we ran into an access problem. As you can see below, the chart depicts a flat onboarding rate and we later learned that we had too much governance and didn’t make self-service easy to use.

This was how one data scientist depicted the access process:

How we simplified access governance

Ultimately, we reduced the bureaucracy and simplified the governance process with the Hadoop User Group (HUG). We had a new plan in place with a four-step process for data scientists, business analysts, and product managers. The user would:

  1. Fill out a simple SharePoint form.
  2. Attend a weekly governance meeting. New projects had 10 minutes per request to pitch.

We wanted to see:

  1. Do we have a business domain owner?
  2. Any GDPR or ethical issues?
  3. Is somebody already performing the request on the platform?

From a compliance perspective, if there were no new requests or pitches, we’d audit services at random to see what was being used and if they were being used for the correct purpose.

  1. Set up their environment
  2. Login, and create their data product!

Within 4 years of implementation, we managed 104 distinct identified viable data products across seven domains, settling into our early-stage Data Mesh. We now had the ability to scale with the business and put to rest the problems we faced when managing fixed compute and storage resources. It was a successful experiment that was ahead of its time.

Six Lessons on Self-Service Data Infrastructure 

The journey toward our early Data Mesh was not without challenges. Here are a few tips I have for those looking to do the same at their organizations:

#1 Find a domain agnostic team to build and run it 

Currently, most organizations have a data and analytics team that are tasked with managing all of the pipelines and orchestrating them. In practice, it’s a real struggle to create and move pipelines, making it a business operational challenge.

Instead, keep onboarding and make the user experience…simple! Highlight features through standard APIs. Let the project/product teams do their own data ingestion, processing, and serving of data. You can certainly help them with simple tools and examples showing how to ingest data and develop their domain-specific processing jobs.

#2 Keep the launch feature set small 

Offer the basics to start, this will streamline your launch and then you can iterate/advance the features based on what your users ask for. You’ll find that the users taking advantage of this program won’t be from a technical background, so they won’t know what to ask for. At launch, start them out with a small amount of storage and compute, then allow them to come back and ask for more in increments, if necessary. This was key to getting teams to use the platform.

#3 Reduce the barrier to entry for new data project/product teams. 

You only need a few successful products to make the platform successful in delivering a good ROI. Lower the barrier to entry so you can onboard platform teams quickly and eliminate the bottlenecks that many larger enterprises face when they’re too selective.

#4 Access source-aligned data products

Help operational systems land and expose their data. Spend time helping and coaching your data scientists on how to access that platform. It will be a new experience for them since they are used to going through the data warehousing teams, but setting a daily cadence for them can ensure they are taking full advantage of those products.

#5 Be very clear on your terms and conditions of your charging cycle

If you are not charging, you don’t need to offer SLAs, but you need to set boundaries in case people run immense loads. Incorporate that into your onboarding to make sure all team members are aligned on what their limits are.

#6 Audit for governance

When you adopt a self-service approach, you need to monitor the checkout/onboarding. Review records of who is onboarded and for what purpose. Just like the self-checkout at the supermarket, it can’t be completely self-driven. Assign an owner to keep a close eye on what’s going on, you don’t want to find unexpected items in your bagging area, especially when dealing with sensitive customer data!

Data Mesh implementation

Whether you’re in the middle of a Data Mesh implementation or just getting started, I hope you found some value in this story of the early days. The problems we faced at BT 10 years ago are the same ones that numerous enterprises are still facing today. That said, it’s great to see so many voices and experts emerge around Data Mesh these last few years, providing teams with more guidance and education. I now work at Thoughtworks alongside Zhamak Dehghani herself, offering technical leadership to clients engaged in complex enterprise-level data and AI challenges, coming full circle from my humble beginning!

For more on Data Mesh, visit the Data Mesh Resource Center.

Phillip Radley

Principal Data & AI Strategy Consultant, Thoughtworks

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.