Before I joined Starburst, I worked in the AdTech industry where companies buy and sell user data for online targeting advertisement campaigns or ML/AI-based product development. There, I experienced three critical technological pains widespread in the industry.
First, the process of building a suitable pipeline can be highly arduous, because each company has a unique way of managing its data.
Second, data transfer produces duplicative historical data where the same data unnecessarily sits on multiple object storage.
Third, companies have to run recurring ETL jobs like filtering out opt-out user data or updating timestamps on existing data sets to comply with privacy laws (CCPA, GDPR, etc) or to keep the data value.
At Starburst, I discovered that its data federation, access control, Trino fault tolerance, and granular task retires features have great potential to solve the three critical pains that AdTech companies experience today.
Figure 1: Existing User Data Journey
Figure 1 shows my understanding of how user data travels from the moment of origination, and back to the user. The app collects user data (email address, IP address, GPS location, etc). Because the individual app does not have enough scale, data brokers aggregate data to create bigger data sets. Then, the data aggregators sell the data sets to different players like data marketplaces and agencies that help their own brand clients purchase data. The brand customers utilize user data for their own business goals, such as targeting ad campaigns or product research and development.
Every time the data set transfers over to different players, the following pain points repeat themselves:
- Identifying similarities and differences between each other’s systems
- Building custom pipelines
- Transferring data
- Maintaining the pipeline health
- Managing historical data
On the other hand, Starburst’s data federation, access control, Trino fault tolerance, and granular task retries features alleviate such struggles.
Data federation simplifies data management
Figure 2: External Database Federation
Data federation allows companies to save time and effort to bridge the gap between different systems. Figure 2 shows one of the common painful situations for data transfer. If one uses AWS and the other Azure, finding the optimal and safe way to transfer data on a regular basis can become highly complicated. But even though one uses MongoDB on Azure and the other uses MySQL database on AWS, data federation with Starburst can provide adequate solutions for both parties to have easier access to the dataset. This also allows companies to continue using existing tools and systems, saving a huge amount of engineering and hardware resources that architecture modification can incur.
Secure and fast access to data
Figure 3: Single point of access
Second, Starburst’s access control capacity via built-in access control or Apache Ranger can reduce the frequency of data duplication by providing a single point of access to multiple users. Figure 3 shows that, whether users are in the marketing department or the R&D department, they can access the same database to query data instead of copying and pasting duplicate data sets to each department’s separate databases. It is easy to configure any user’s correct access rights to catalogs, individual schemas, and tables and can save resources to manage a database without needing to duplicate data for access controls.
Stable ETL jobs
Third, Trino’s fault tolerance and granular task retries can contribute to the way companies keep their user data clean and legally safe under a federated data ecosystem. These new features help Starburst users to run more stable ETL jobs through better adaptive planning, resource management, and fine-grained failure recovery. Due to privacy laws like CCPA and GDPR, AdTech companies are responsible for removing the data of users who request to remove their personal data from their system. Also, the value of data declines without refreshing the data scan timestamp frequently, because old user data has lower accuracy. The filtration of such opt-out users and updating timestamps process can run on all databases federated under Starburst, where companies can expect more cost and time-efficient ETL processes.
Balancing risk, cost optimization, and query performance
AdTech companies and Starburst should carefully consider legal risk, cost optimization, and query performance. Especially, the concept of federating databases of different company entities requires a more in-depth review. Still, Starburst is the pioneer and leader in data query engines, especially in data federation, which makes it the single best choice for AdTech companies to reach out to.