×

It seems like migrating to the cloud has dominated the news and a lot of companies are shuttering their data centers and letting cloud providers handle it for them. Reasons such as elasticity, simplicity, and infrastructure agility are all great reasons but there are many companies that continue to host their own infrastructure. The reasons could be security or they believe the cloud doesn’t provide the cost benefits in their scenario.

For these companies, building a data lake usually means setting up a Hadoop cluster and choosing a vendor to support it (although this is becoming less of need as it used to be.) Organizations like the idea of a company-wide object store which can store a variety of data including structured and unstructured data. There are a variety of companies that offer object S3 compatible storage software which can be installed anywhere. Some examples include:

*Open Source

One of the advantages of deploying your own object store is you get to use your own storage. This could be storage that you already own or the chance to build a new cluster using commodity servers which combine into a large storage pool. Since most of these storage engines support Amazon’s S3 protocol, they work seamlessly with Presto and allow you to query data directly out of your on-premises data lake.

{{cta(‘eddbc327-0144-4f85-b4b2-4ce80ebaeed1′,’justifycenter’)}}

In this blog post, we’ll walk you through connecting Presto to a Minio object storage server. For our testing, it was just a single Minio server with some attached storage but this could be a fully distributed system with numerous hosts serving up the object storage.

For this simple demo, we had the following:

  1. Hive metastore installed in stand-alone mode on a virtual machine
  2. Presto Coordinator and one Presto worker install on virtual machines
  3. Linux VM with a 100GB mount point named /data01 for the Minio server

Install Minio:

  1. Download the Minio server at https://www.minio.io/downloads.html. We chose the x64 Linux version.
  2. Set the storage point. For our example, we had a small Linux VM already installed with a storage mount point of: /data01.
  3. Start the Minio server and use this mount point. We simply typed: ./minio server /data01/. Note the AccessKey and SecretKey for later.
[root@minio ~]# ./minio server /data01/

Drive Capacity: 89 GiB Free, 90 GiB Total

Endpoint: http://10.70.0.105:9000 http://127.0.0.1:9000
AccessKey: OQ4K2YFH7S1BE6ZM10M0
SecretKey: Ft5iKPkY0YZNoDpp6r5HgbjFuGAxn+jt1avBPyHq
Browser Access:
http://10.70.0.105:9000 http://127.0.0.1:9000

Command-line Access: https://docs.minio.io/docs/minio-client-quickstart-guide
$ mc config host add myminio http://10.70.0.105:9000 OQ4K2YFH7S1BE6ZM10M0 Ft5iKPkY0YZNoDpp6r5HgbjFuGAxn+jt1avBPyHq

Object API (Amazon S3 compatible):
Go: https://docs.minio.io/docs/golang-client-quickstart-guide
Java: https://docs.minio.io/docs/java-client-quickstart-guide
Python: https://docs.minio.io/docs/python-client-quickstart-guide
JavaScript: https://docs.minio.io/docs/javascript-client-quickstart-guide
.NET: https://docs.minio.io/docs/dotnet-client-quickstart-guid

Now let’s add some data into Minio

  1. For our example, we created a very simple text file:

[root@minio]# cat customer.csv

5 Bob Jones
6 Phil Brune

    1. Then I created a new bucket to store my files: mkdir /data01/testbucket
    2. Lastly, I copied my customer.csv into this new bucket: cp customer.csv  /data/01/testbucket
  1. To verify the file and bucket is there, I point my browser to the Minio IP with port 9000:

 

It lists my bucket: testbucket and my customer.csv file. Now that our Minio server is started and has data, let’s connect Presto to query the data.

Configuring Hive Metastore

In order for Presto to connect to Minio, it needs a cataloging service which the Hive Metastore provides. It tells Presto how the tables are defined and where the data is located.

  1. We add the following configuration to the core-site.xml on Hadoop:
<property>
   <name>fs.s3a.connection.ssl.enabled</name>
   <value>false</value>
   <description>Enables or disables SSL connections to S3.</description>
</property>

<property>
<name>fs.s3a.endpoint</name>
<description>AWS S3 endpoint to connect to. An up-to-date list is
provided in the AWS Documentation: regions and endpoints. Without this
property, the standard region (s3.amazonaws.com) is assumed.
</description>
<value>http://minio:9000</value>
</property>

<property>
<name>fs.s3a.awsAccessKeyId</name>
<description>AWS access key ID.</description>
<value>OQ4K2YFH7S1BE6ZM10M0</value>
</property>

<property>
<name>fs.s3a.awsSecretAccessKey</name>
<description>AWS secret key.</description>
<value>Ft5iKPkY0YZNoDpp6r5HgbjFuGAxn+jt1avBPyHq</value>
</property>

<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
<description>Enable S3 path style access ie disabling the default virtual hosting behaviour.
Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
</description>
</property>

<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
<description>The implementation class of the S3A Filesystem</description>
</property>

2. Note the AWS secret ID and key. These were output when we started the Minio serve

3. Restart Hadoop and Hive and we’re ready to go.

Configuring Presto

For our demo, we had a single node Presto configuration.

1. We first create a new Hive catalog for our Minio server:

[root@coordinator catalog]# more minio.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://hivemeta:9083
hive.s3.path-style-access=true
hive.s3.endpoint=http://minio:9000
hive.s3.aws-access-key=OQ4K2YFH7S1BE6ZM10M0
hive.s3.aws-secret-key=Ft5iKPkY0YZNoDpp6r5HgbjFuGAxn+jt1avBPyHq

2. This file also requires our Minio access and secret key from Minio
3. We then add this new catalog to our Presto cluster:
presto-admin catalog add minio
4. Lastly, we restart Presto to allow the changes to take effect:
presto-admin server restart

Now we can create our table:

presto:minio&gt; create table customer(id varchar,fname varchar,lname varchar) with (format = 'TEXTFILE', external_location = 's3a://testbucket/');

Then we can run a query from Presto:

presto> select * from customer;
 id | first | last
----+-------+-------
  5 | Bob   | Jones
  6 | Phil  | Brune
(2 rows)

Query 20180506_202139_00002_2dcgj, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [2 rows, 25B] [3 rows/s, 38B/s]

As you can see, there really isn’t much of a difference between querying data in HDFS, in S3 or in an object store such as Minio. This opens up a whole new world for companies that want to deploy their own object stores within the walls of their companies or in 3rd party data centers and Presto enables their querying of that data just as if they were in AWS.

This way, you can create your own Data Lake without the headaches of setting up and managing a Hadoop environment.

Need help with this setup or want to learn more? Contact us today, we’re THE Presto experts.

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.

s