Looker is architected to connect to a database server via JDBC.
In the case of Spark SQL 1.x this means that the Hive Thrift server needs to be compiled into Spark to run. For Spark 2.x, JDBC via a Thrift server comes with all versions.
By default, this Thrift server will listen on port 10000. However, if the cluster also has Hive installed on it, then port 10000 may already be used by a Thrift server that serves Hive queries. In that case, the Spark SQL Thrift server will need to use another port, like 10001 or 10010.
Installing the Thrift Server
If building Spark from source, see this section of the Spark documentation for more detail on how to compile in the Hive Thrift server.
If downloading Spark from spark.apache.org, select one of the package types that include “Pre-built for Apache Hadoop…” to ensure that Spark has the Thrift server compiled in.
Hosted Spark Services
Most hosted Spark services contain versions of Spark that have the Thrift service built in. Looker has run the Thrift service and connected Looker to the following:
- Amazon Elastic Map Reduce (EMR) (EMR’s Spark release notes)
- Qubole (Qubole’s Spark documentation)
- Databricks
- Google Dataproc (supported Spark versions) (connecting Looker to Dataproc)
Your Looker analyst can help you connect Looker to these services.
Starting and Stopping the Thrift Server
See the Spark documentation for more detail.
The Thrift server is controlled by two scripts in sbin
, called start-thriftserver.sh
and stop-thriftserver.sh
.
To run on an alternate port and as an alternate user, use a command such as:
or:
Depending on how Hadoop group permissions are set up, you may want to add something like HADOOP_USER_NAME=hadoop
to the command as well.
Verifying the JDBC Connection
Spark comes with a JDBC client in bin
called beeline
. Verify that the Thrift server is running, with a command such as:
or if you want to connect as a specific user:
Saving Tables from a Spark Script
See the Spark documentation for more detail.
To write a table that will persist after the Spark script exits and that Looker can search, it needs to be saved with .saveAsTable()
.
Connecting Looker to Spark SQL
In the Admin section of Looker, navigate to the Connections page and click New Connection. Looker displays this page:
Fill out the page as follows:
- Name: The name of the connection. This is how the connection will be referred to in the LookML model.
- Host:Port: The Spark SQL Thrift server host and port (10000 by default).
- Database: The default schema/database that will be modeled. When no database is specified for a table, this will be assumed.
- Username: The user that Looker will authenticate as.
- Password: The optional password for Looker user.
- Persistent Derived Tables: Check this if you will be using PDTs with Looker.
- Temp Database: A temporary schema/database for storing PDTs. It must be created beforehand, with a statement such as
CREATE SCHEMA looker_scratch;
. - SSL: Leave this unchecked.
- Database Time Zone: The time zone of data stored in Spark. Usually it can be left blank or set to UTC.
- Query Time Zone: The time zone to display data queried in Looker.
- Additional Params: Usually this is left blank, but you can add additional parameters to the JDBC string here, such as
;spark.sql.inMemoryColumnarStorage.compressed=true
.
Click Test These Settings to test the connection and make sure that it is configured correctly. If you see Can Connect, then press Add Connection. This runs the rest of the connection tests to verify that the service account was set up correctly and with the proper roles.
For more information about connection settings, see the Connecting Looker to Your Database documentation page.
Debugging the Connection to the Thrift Server
This section describes how to interpret a few possible connection errors and explains several strategies for fixing a faulty connection.
Understanding Connection Errors
If you do not see Can Connect after testing your connection, below are some example errors along with what caused them:
Cannot connect: connection refused: Java::JavaSql::SQLException: Could not open connection to jdbc:hive2://ec2-52-20-232-248.compute-1.amazonaws.com:10001/imdb;user=hadoop: java.net.ConnectException: Operation timed out
In this case, the firewall settings are blocking Looker’s attempts to connect to the Spark SQL Thrift server and the TCP connection attempt is timing out. To resolve this, set up an SSH tunnel or adjust the firewall settings.
Cannot connect: connection refused: Java::JavaSql::SQLException: Could not open connection to jdbc:hive2://localhost:10004/imdb;user=hadoop: null
In this case, Looker’s attempts to reach the Spark SQL Thrift server machine are successful, but the connection cannot be completed.
There may be no Thrift server running, or the wrong port may be configured. If this is the case, the error message will return within a few seconds.
Another possibility is that a different type of server (for example, a web server) is listening at that port instead of the Thrift server.
Yet another possibility is that the wrong authentication protocol is sent to the Thrift server. For example
;auth=noSasl
is configured in the Additional Params field of the Connection Settings page, when SASL has not been configured on the Thrift server.If one of the latter two possibilities is the case, then the error message will return after more than a few seconds.
Troubleshooting Faulty Connections
If Looker can’t connect to the Thrift server, investigate the following possibilities.
Verify That the Spark Thrift Server Is Running
If you have access to the machine running the Spark Thrift server, verify that it is running with a command such as:
Then you can verify the port number using the pid, using a command such as:
In this case, the listening port is 10001.
There may also be a Hive Thrift server running on the same node.
Verify the Connection from Looker to the Spark Thrift Server
The example below uses an SSH tunnel set up to the Spark Thrift server so localhost
is the host that connects to it.
Use telnet to verify that the client machine is able to connect on the appropriate port.
A successful connection will look similar to:
An unsuccessful connection will look similar to:
Verify the JDBC Connection String
Once you have verified the connection, verify the correct JDBC string to connect for the Spark Thrift server. It may be different based on methods of authentication and allowed usernames.
Spark comes with a client called beeline
. (See above for more detail.)
Use beeline
to verify the correct JDBC URL for connecting to the Thrift server. For example:
From this, you can tell that to connect you need to use localhost
as the host (after setting up the SSH tunnel), 10001
as the port, hadoop
as the username, no password, imdb
as the default database, and no additional parameters.
Another popular tool for testing JDBC connections is SQuirreL SQL.
Feature Support
For Looker to support some features, your database dialect must also support them.
In Looker 21.0, Spark 1.52+ supports the following Looker features:
In Looker 21.0, Spark 2.0+ supports the following Looker features:
Next Steps
Once you are finished creating the connection, you can proceed to setting authentication options.