This page contains information about connecting Looker to an Apache Hive database.
Looker can only fully integrate with Apache Hive 3 databases on versions specifically 3.1.2+. This is because of a Query Parsing Issue from Hive versions 2.4.0 - 3.1.2 that resulted in extremely long parsing times for Looker generated SQL.
Looker is architected to connect to a database server via JDBC. In the case of Hive, this is the thrift server (HiveServer2). See the Apache documentation for more information.
By default, this server will listen on port 10000.
Looker is an interactive querying tool, so it expects to work with an interactive SQL engine. If Hive is running on MapReduce —
hive.execution.engine is set to
mr — then Hive will return query results too slowly to be practical.
Looker was tested with Hive on Tez (
hive.execution.engine=tez), although it is also possible to run Looker against Hive on Spark. Spark support was added in Hive version 1.1. (Looker supports Hive 1.2.1 and later.)
Hosted Hive Services
Most hosted Hive services run Hive with Tez or Spark.
- Amazon Elastic Map Reduce (EMR): Runs Hive on Tez by default.
- Microsoft HDInsight: Runs Hive on Tez.
- Google Dataproc: Runs Hive on MapReduce, though there is a script that installs Tez in their dataproc-initialization-actions GitHub repo.
- Qubole: Runs Hive on Tez. To connect to Qubole Hive, you must connect directly to the master node.
- Cloudera CDH: The latest version of CDH (as of September 2018) ships with Hive 1.1, which is unsupported by Looker.
- MapR: MapR 4.1 documentation refers to Tez as a “developer preview.”
- Hortonworks HDP: Hortonworks provides full support for Hive on Tez.
Verifying the JDBC Connection
Before connecting Looker to the Hive thrift server, we recommend you connect another JDBC client and run queries to make sure that Hive performance is satisfactory.
Hive comes with a JDBC client called
beeline. Verify that the thrift server is running with a command similar to:
or if you want to connect as a specific user:
Once connected, verify that the command
!tables lists all databases and tables. The
!tables command makes a similar call as the one Looker uses in SQL Runner or when creating a new LookML project.
Also verify that you can run queries, with a query that invokes the Tez engine, using a command similar to:
Queries such as
select * from table and
show tables don’t invoke the Tez engine.
If you are going to enable PDTs in Looker on Hive, create a scratch schema for Looker to use, using a command similar to:
The user that Looker uses to connect to Hive (which can be anonymous if no authentication is used) must have the following abilities in the scratch schema:
- Create tables
- Alter tables
- Drop tables
Test this with a JDBC client before attempting to create PDTs with Hive.
Sometimes you want queries from Looker to go into a specific queue. To configure that, enter the queue name parameter in the Additional Params field on the Connection Settings page:
Other Hive parameters can be set this way in the Additional Params field on the Connection Settings page.
Using user attributes, it is possible for queries from different users or different groups of users to go into different queues. To do this, create a user attribute named something like
queue_name, then in the Additional Params field, add:
You can use this to customize other
hive-site.xml parameters on a per-user or per-group basis as well.
Looker’s ability to provide some features depends on whether the database dialect can support them.
In the current Looker release, Apache Hive 1+ supports the following Looker features:
In the current Looker release, Apache Hive 3.1.2+ supports the following Looker features: