home User Guide Getting Started Help Center Documentation Community Training Certification
Looker keyboard_arrow_down
language keyboard_arrow_down
Apache Spark 1.52+ and 2

Looker is architected to connect to a database server via JDBC.

In the case of Spark SQL 1.x this means that the Hive Thrift server needs to be compiled into Spark to run. For Spark 2.x, JDBC via a Thrift server comes with all versions.

By default, this Thrift server will listen on port 10000. However, if the cluster also has Hive installed on it, then port 10000 may already be used by a Thrift server that serves Hive queries. In that case, the Spark SQL Thrift server will need to use another port, like 10001 or 10010.

Installing the Thrift Server

If building Spark from source, see this section of the Spark documentation for more detail on how to compile in the Hive Thrift server.

If downloading Spark from spark.apache.org, select one of the package types that include “Pre-built for Apache Hadoop…” to ensure that Spark has the Thrift server compiled in.

Hosted Spark Services

Most hosted Spark services contain versions of Spark that have the Thrift service built in. Looker has run the Thrift service and connected Looker to the following:

Your Looker analyst can help you connect Looker to these services.

Starting and Stopping the Thrift Server

See the Spark documentation for more detail.

The Thrift server is controlled by two scripts in sbin, called start-thriftserver.sh and stop-thriftserver.sh.

To run on an alternate port and as an alternate user, use a command such as:

sudo -u spark HIVE_SERVER2_THRIFT_PORT=10001 /path/to/spark/sbin/start-thriftserver.sh


sudo -u spark /path/to/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001

Depending on how Hadoop group permissions are set up, you may want to add something like HADOOP_USER_NAME=hadoop to the command as well.

Verifying the JDBC Connection

Spark comes with a JDBC client in bin called beeline. Verify that the Thrift server is running, with a command such as:

bin/beeline -u 'jdbc:hive2://localhost:10001/'

or if you want to connect as a specific user:

bin/beeline -u 'jdbc:hive2://localhost:10001/' -n user -p abc123

Saving Tables from a Spark Script

See the Spark documentation for more detail.

To write a table that will persist after the Spark script exits and that Looker can search, it needs to be saved with .saveAsTable().

Connecting Looker to Spark SQL

In the Admin section of Looker, navigate to the Connections page and click New Connection. Looker displays this page:

Fill out the page as follows:

Click Test These Settings to test the connection and make sure that it is configured correctly. If you see Can Connect, then press Add Connection. This runs the rest of the connection tests to verify that the service account was set up correctly and with the proper roles.

For more information about connection settings, see the Connecting Looker to Your Database documentation page.

Debugging the Connection to the Thrift Server

This section describes how to interpret a few possible connection errors and explains several strategies for fixing a faulty connection.

Understanding Connection Errors

If you do not see Can Connect after testing your connection, below are some example errors along with what caused them:

Troubleshooting Faulty Connections

If Looker can’t connect to the Thrift server, investigate the following possibilities.

Verify That the Spark Thrift Server Is Running

If you have access to the machine running the Spark Thrift server, verify that it is running with a command such as:

$ ps -efwww | grep spark | grep thriftserver spark 17495 1 7 Sep25 ? 3-15:22:04 /usr/lib/jvm/java-1.7.0-openjdk- -cp /usr/lib/hadoop/:/usr/lib/hadoop/../hadoop-hdfs/:/usr/lib/hadoop/../hadoop-mapreduce/:/usr/lib/hadoop/../hadoop-yarn/:/etc/hive/conf/:/usr/lib/hadoop/../hadoop-lzo/lib/:/usr/share/aws/emr/emrfs/conf/:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/*:/usr/lib/spark/sbin/../conf/:/usr/lib/spark/lib/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar:/usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/lib/spark/lib/datanucleus-core-3.2.10.jar:/etc/hadoop/conf/ -Xms512m -Xmx512m -Dspark.driver.log.level=INFO -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=512M org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal

Then you can verify the port number using the pid, using a command such as:

$ sudo netstat -plten | grep 17495 tcp 0 0 :::43120 :::* LISTEN 492 96763982 17495/java tcp 0 0 :::10001 :::* LISTEN 492 96762603 17495/java tcp 0 0 ::ffff: :::* LISTEN 492 96762998 17495/java tcp 0 0 :::4040 :::* LISTEN 492 96762551 17495/java tcp 0 0 :::52968 :::* LISTEN 492 96752416 17495/java

In this case, the listening port is 10001.

There may also be a Hive Thrift server running on the same node.

Verify the Connection from Looker to the Spark Thrift Server

The example below uses an SSH tunnel set up to the Spark Thrift server so localhost is the host that connects to it.

Use telnet to verify that the client machine is able to connect on the appropriate port.

A successful connection will look similar to:

$ telnet localhost 10001 Trying ::1... Connected to localhost. Escape character is '^]'. ^] (type this line in with CTL+]) telnet> quit Connection closed.

An unsuccessful connection will look similar to:

telnet localhost 10004 Trying ::1... Connected to localhost. Escape character is '^]'. Connection closed by foreign host.

Verify the JDBC Connection String

Once you have verified the connection, verify the correct JDBC string to connect for the Spark Thrift server. It may be different based on methods of authentication and allowed usernames.

Spark comes with a client called beeline. (See above for more detail.)

Use beeline to verify the correct JDBC URL for connecting to the Thrift server. For example:

$ bin/beeline --color=yes -u 'jdbc:hive2://localhost:10001/imdb;username=hadoop' scan complete in 2ms Connecting to jdbc:hive2://localhost:10001/imdb;username=hadoop Connected to: Spark SQL (version 1.4.1) Driver: Spark Project Core (version 1.4.1) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.4.1 by Apache Hive 0: jdbc:hive2://localhost:10001/imdb> show tables; +------------------+--------------+ | tableName | isTemporary | +------------------+--------------+ | accidents2 | false | | aka_title | false | | cast_info | false | | char_name | false | | company_name | false | | complete_cast | false | | keyword | false | | movie_companies | false | | movie_info | false | | movie_keyword | false | | movie_link | false | | name | false | | person_info | false | | states | false | | title | false | | title2 | false | +------------------+--------------+ 16 rows selected (0.186 seconds)

From this, you can tell that to connect you need to use localhost as the host (after setting up the SSH tunnel), 10001 as the port, hadoop as the username, no password, imdb as the default database, and no additional parameters.

Another popular tool for testing JDBC connections is SQuirreL SQL.

Feature Support

For Looker to support some features, your database dialect must also support them.

In Looker 21.0, Spark 1.52+ supports the following Looker features:

In Looker 21.0, Spark 2.0+ supports the following Looker features:

Next Steps

Once you are finished creating the connection, you can proceed to setting authentication options.