The Trino open source distributed query engine is known as a choice for running ad-hoc analysis where there’s no need to model the data and it can be easily sliced and diced. As such, Trino can be also leveraged for running geospatial workloads atop different data sources.
When it comes to running geospatial queries, Trino is OpenGIS compliant and supports a wealth of geospatial-specific functions using WKT/WKB vector data formats. Its diverse array of functions gives you the ability to unify and join geospatial data from multiple sources. For example, you can join points of interest by using Trino’s Postgres connector with events tables stored on S3 by using the Hive connector.
In this post, we’ll walk you through two methods for running geospatial queries on the data lake using Trino’s Hive connector to query parquet files on S3, explore some optimizations to help you accelerate and improve the interactivity for your geospatial queries.
Deep-Dive: Geospatial Analysis Methods with Trino
Leveraging Trino’s native Geospatial support and Bing tiles system are two methods for running interactive geospatial queries for common use cases. In this section, we’ll demonstrate these two methods by querying the trips_data table (4.4 billion rows stored in parquet files) and the places table (10 rows stored in parquet files) from the public ride-sharing dataset by using Trino’s Hive connector.
For reference the trips_data table DDL is as follows:
CREATE TABLE trips_data( tripid int, t_hour int, t_min int, d_weekday int, t_start_ts timestamp, fare double, src_zone int, dst_zone int, duration double, distance double, rider_id int, rider_age int, rider_first varchar, rider_last varchar, rider_gender varchar, driver_id int, driver_age int, driver_first varchar, driver_last varchar, driver_gender varchar, ts timestamp, lon double, lat double, point_id int, last_point int, d_date date) WITH ( external_location= '(location)', partitioned_by = ARRAY['d_date'], format = 'PARQUET') );
Places table locations:
Here is a list with some common geospatial analysis use-cases we came across:
- Getting the details of events near points of interest, for example, trips in a ride-sharing app passing by various distances from a billboard or a store — by using Trino’s great_circle_distance function
- Intersecting events (such as taxi trips) with a well-defined polygon (such as neighborhoods) — by using Trino’s ST_Contains function
- Finding out if a polygon contains a specific point, for example, if a trip started or ended in a specific area of the city — by using Trino’s ST_Contains function
Method 1 – Leveraging Trino’s Geospatial Support
Trino offers a wide range of geospatial functions that can be used out of the box.
The first use case might require running queries that get all the riders that were up to some distance from the points of interest (restaurants, billboard, etc.).
Below you will find such a query that joins the trips_data table and the places table, and counts the number of riders in trips that passed up to 500 meters from the points of interest stored in the places table.
Finding Adjacent Geographical Data Points
SELECT count(distinct rider_id) FROM hive.ride_sharing_dataset.trips_data as t, hive.ride_sharing_dataset.places as p WHERE great_circle_distance(t.lat, t.lon, p.lat, p.lon) <= 0.5;
Finding Data Points in Defined Geographical Locations
Another example for using the Geospatial functions in Trino is running queries related to use-case 2 in which we look up for events that took place in a well-defined area. Below you can find such a query that counts the number of drivers that drove in a specific area:
SELECT count(distinct driver_id) FROM ride_sharing_dataset.trips_data t WHERE ST_Contains(ST_Polygon('POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752, -122.45635986328125 37.072130604487681))'), st_point(lon, lat));
In the above query the Geospatial functions used in the predicate were not pushed down to the Hive connector, resulting in a massive data read. In order to mitigate this, it’s possible to build an additional bounding box by using the lon and lat columns as suggested in query #2.1 or to use the Bing tiles system as suggested in query #4.
Building Additional Bounding Box
SELECT count(distinct driver_id) FROM ride_sharing_dataset.trips_data t WHERE lon between -122.45635986328125 and -122.45567321777344 and lat between 37.72130604487681 and 37.72184917678752 and ST_Contains(ST_Polygon('POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752))'), st_point(lon, lat));
Here we built a bounding box using the lat, lon columns which include the polygon of interest. The filters on these columns are being pushed down to the Hive connector, which results in reading less data from S3, compared to query #2.
Although we saw a performance improvement by adding the bounding box predicate both on the Hive connector, it’s important to note that dynamically building a bounding box is not always straightforward.
Method 2 – Using Bing Tiles
Bing tiles are a way to define map areas at different resolutions:
They are stored as “quadkeys”:
- Quadkey = “03” defines a tile at zoom level 2, tile side length is ~7,000 kilometers
- Quadkey = “03023323” is zoom level 8, which side length is ~60 kilometers at the equator
- We used level 19, which side length is ~75 meters
Note that a prefix of a tile quadkey is always a larger tile that includes that tile at a lower resolution.
Building Bing Tile Using Trino’s CTAS Command
In order to utilize the Bing tile segmentation, we can either create the Bing tile during the query runtime (as we did in query #4) or by using an ETL/ELT procedure that will add a new quadkey column to both the trips_data and the places table. In this article, we used Trino’s CTAS command for the ELT procedure described here.
Once the quadkey column is in place we can now JOIN the places table and the table trips_data table by the quakey column.
CTAS for creating the trips_data_bing table
CREATE TABLE hive.ride_sharing_dataset.trips_data_bing with (format = 'PARQUET') as SELECT bing_tile_quadkey(bing_tile_at(lat, lon, 19)) as qkey, * FROM hive.ride_sharing_dataset.trips_data;
Finding adjacent geographical data points using Bing tiles
Below in Query #3 we implement the same logic as query #1 and utilize the Bing tiles for doing the join between the places and trips table.
with p as ( SELECT lat, lon, qkey FROM ride_sharing_dataset.places cross join unnest (transform(bing_tiles_around(lat, lon, 19, 0.5), t -> bing_tile_quadkey(t))) as t(qkey)) -- use that list to filter matching tiles SELECT count(distinct rider_id) FROM ride_sharing_dataset.trips_data_bing as v, r WHERE t.qkey = p.qkey -- match the bounding tile and great_circle_distance(t.lat, t.lon, p.lat, p.lon) <= 0.5;
This query joins the table by the quadkey column, which significantly reduces the number of rows read and boosts the query performance thanks to the dynamic filtering optimization.
Finding Data Points in Defined Geographical Locations Using Bing Tiles
Bing tiles can also be leveraged as can be seen below in query #4, which implements the same login we saw on query #2:
with q as ( SELECT qkey FROM ( VALUES ('POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752)) ') ) as a(p) CROSS JOIN unnest (transform(geometry_to_bing_tiles(ST_Polygon(a.p), 19), t -> bing_tile_quadkey(t))) as t(qkey) ) SELECT count(distinct driver_id) FROM ride_sharing_dataset.trips_data_bing v, q WHERE q.qkey = t.qkey and ST_Contains(ST_Polygon('POLYGON ((-122.45635986328125 37.72130604487681, -122.45567321777344 37.72130604487681, -122.45567321777344 37.72184917678752, -122.45635986328125 37.72184917678752))'), st_point(lon, lat));
When creating the Bing tile during query execution, execution time was slower for this query compared to the equivalent query #2, which might suggest that for this type of queries creating the Bing tile during query execution might not always be beneficial. That being said, when we ran the same experiment using Starburst Warp Speed we did see a significant improvement.
After running the above queries we can then compile the data into a mini benchmark that details the queries’ runtime and showing the improvement factor after implementing the suggested optimization. Here are the results of a mini benchmark data, using 4Xr6gd.4xlarge machines on Amazon Web Services.
Improve geospatial query performance by 4X-6.8X
Running these queries underlines the fact that you can use Trino to run geospatial queries on your data lake thanks to its out-of-the-box support for geospatial functions.
However, by simply applying the suggested methods above you can, on average, we saw our geospatial query performance from 4X to a maximum of 6.8X.
You can leverage your investment in Trino to run geospatial analysis today on the Hive connector by applying any of the methods and optimizations described in this article.
Originally published on thenewstack.io
Webinar: Setting A New Standard of Data Lake Analytics
Reduce costs and improve query response times