Data Lakes without Hadoop
It seems like migrating to the cloud has dominated the news and a lot of companies are shuttering their data centers and letting cloud providers handle it for them. Reasons such as elasticity, simplicity, and infrastructure agility are all great reasons but there are many companies that continue to host their own infrastructure. The reasons could be security or they believe the cloud doesn’t provide the cost benefits in their scenario.
For these companies, building a data lake usually means setting up a Hadoop cluster and choosing a vendor to support it (although this is becoming less of need as it used to be.) Organizations like the idea of a company-wide object store which can store a variety of data including structured and unstructured data. There are a variety of companies that offer object S3 compatible storage software which can be installed anywhere. Some examples include:
One of the advantages of deploying your own object store is you get to use your own storage. This could be storage that you already own or the chance to build a new cluster using commodity servers which combine into a large storage pool. Since most of these storage engines support Amazon’s S3 protocol, they work seamlessly with Presto and allow you to query data directly out of your on-premises data lake.
In this blog post, we’ll walk you through connecting Presto to a Minio object storage server. For our testing, it was just a single Minio server with some attached storage but this could be a fully distributed system with numerous hosts serving up the object storage.
For this simple demo, we had the following:
- Hive metastore installed in stand-alone mode on a virtual machine
- Presto Coordinator and one Presto worker install on virtual machines
- Linux VM with a 100GB mount point named /data01 for the Minio server
- Download the Minio server at https://www.minio.io/downloads.html. We chose the x64 Linux version.
- Set the storage point. For our example, we had a small Linux VM already installed with a storage mount point of: /data01.
- Start the Minio server and use this mount point. We simply typed: ./minio server /data01/. Note the AccessKey and SecretKey for later.
[root@minio ~]# ./minio server /data01/
Drive Capacity: 89 GiB Free, 90 GiB Total
Endpoint: http://10.70.0.105:9000 http://127.0.0.1:9000
Command-line Access: https://docs.minio.io/docs/minio-client-quickstart-guide
$ mc config host add myminio http://10.70.0.105:9000 OQ4K2YFH7S1BE6ZM10M0 Ft5iKPkY0YZNoDpp6r5HgbjFuGAxn+jt1avBPyHq
Object API (Amazon S3 compatible):
Now let’s add some data into Minio
- For our example, we created a very simple text file:
[root@minio]# cat customer.csv
5 Bob Jones
6 Phil Brune
- Then I created a new bucket to store my files: mkdir /data01/testbucket
- Lastly, I copied my customer.csv into this new bucket: cp customer.csv /data/01/testbucket
- To verify the file and bucket is there, I point my browser to the Minio IP with port 9000:
It lists my bucket: testbucket and my customer.csv file. Now that our Minio server is started and has data, let’s connect Presto to query the data.
Configuring Hive Metastore
In order for Presto to connect to Minio, it needs a cataloging service which the Hive Metastore provides. It tells Presto how the tables are defined and where the data is located.
- We add the following configuration to the core-site.xml on Hadoop:
<description>Enables or disables SSL connections to S3.</description>
<description>AWS S3 endpoint to connect to. An up-to-date list is
provided in the AWS Documentation: regions and endpoints. Without this
property, the standard region (s3.amazonaws.com) is assumed.
<description>AWS access key ID.</description>
<description>AWS secret key.</description>
<description>Enable S3 path style access ie disabling the default virtual hosting behaviour.
Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
<description>The implementation class of the S3A Filesystem</description>
2. Note the AWS secret ID and key. These were output when we started the Minio serve
3. Restart Hadoop and Hive and we’re ready to go.
For our demo, we had a single node Presto configuration.
1. We first create a new Hive catalog for our Minio server:
[root@coordinator catalog]# more minio.properties
2. This file also requires our Minio access and secret key from Minio
3. We then add this new catalog to our Presto cluster:
presto-admin catalog add minio
4. Lastly, we restart Presto to allow the changes to take effect:
presto-admin server restart
Now we can create our table:
presto:minio> create table customer(id varchar,fname varchar,lname varchar) with (format = 'TEXTFILE', external_location = 's3a://testbucket/');
Then we can run a query from Presto:
presto> select * from customer;
id | first | last
5 | Bob | Jones
6 | Phil | Brune
Query 20180506_202139_00002_2dcgj, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [2 rows, 25B] [3 rows/s, 38B/s]
As you can see, there really isn’t much of a difference between querying data in HDFS, in S3 or in an object store such as Minio. This opens up a whole new world for companies that want to deploy their own object stores within the walls of their companies or in 3rd party data centers and Presto enables their querying of that data just as if they were in AWS.
This way, you can create your own Data Lake without the headaches of setting up and managing a Hadoop environment.
Need help with this setup or want to learn more? Contact us today, we’re THE Presto experts.