Caching Queries and Rebuilding PDTs with Datagroups

LookML
Version

On this Page
Docs Menu

Overview

Looker reduces the load on your database and improves performance by using cached results of prior queries when available and permitted by your caching policy. In addition, you can create complex queries as PDTs which store their results to simplify later queries.

Using datagroups (using datagroup parameters) and persist_with parameters is the most powerful technique to define a caching policy for queries on explores and also for when persistent derived tables (PDTs) are rebuilt.

You use one or more datagroup parameters to name caching policies and specify the desired behavior. Then you can use persist_with parameters at the model or explore level to specify which explores use each policy and use datagroup_trigger in a PDT definition to specify which policy to use in rebuilding the PDTs.

Typical use cases for datagroups include:

  • wanting to integrate Looker more closely with the ETL phase of your data pipeline, such as loading data in batch through a nightly ETL job and wanting to clear any cached queries once that data is loaded.
  • having PDTs that are regenerated infrequently, even though the underlying tables update frequently, so queries against the PDTs can be cached longer.

How Looker Uses Cached Queries

For queries, the caching mechanism in Looker works as follows:

  1. Once a user runs a specific query, the result of that query is cached. Cache results are stored in an encrypted file on your Looker instance.
  2. When a new query is written, the cache is checked to see if the exact same query was previously run. Everything must be the same, including things such as the row limits, etc. If the query is not found, then Looker runs the query against the database to get fresh database results. (Those new results are also cached.)
  3. If the query is found in the cache, then Looker checks whether the query results are still valid. By default, Looker invalidates cached results after an hour (unless you enabled the legacy feature to default to 5 minutes.) You can use a persist_for parameter (defined here and here) or the more powerful datagroup parameter to specify the caching policy for when or in what circumstances the cached results become invalid and should be ignored. An administrator can also cause the cache results for a datagroup to become invalid.
    • If the query results are still valid, then those results are used.
    • If the query results are no longer valid, then Looker runs the query against the database to get fresh query results. (Those new results are also cached.)

How Looker Uses PDTs and Rebuilds Them

PDTs are queries, typically very complex queries, that you can create as the basis of a view. For example, your query might identify the customer’s first and most recent orders plus the lifetime value of all of their orders.

These PDTs are rebuilt periodically based on one of three settings:

  • triggered by a change (using sql_trigger_value, defined here)
  • a set time period (using persist_for, defined here)
  • a caching policy definition in a datagroup triggered by a change and/or a maximum time to use the cached values (using datagroup_trigger)

Using a datagroup gives you the most control over when the PDT is rebuilt. This page discusses setting up a datagroup for query caching and PDT rebuilding.

Specifying Caching Policies With datagroup Parameters

You define one or more datagroup parameters at the model level to specify the caching policy that you will assign to explores and/or PDTs. If you want different caching policies for different explores and/or PDTs, then specify each caching policy in a separate datagroup.

If the connection uses a user attribute to specify the connection username, then you cannot use a datagroup for models using that connection. Instead, use a persist_for parameter (defined here and here) for queries and either sql_trigger_value or persist_for for derived tables.

Each datagroup specifies a policy using one or both of these subparameters:

  • a sql_trigger parameter specifying a SQL query that returns one row with one column

    Looker checks whether to invalidate the cached results based on the sql_trigger parameter’s SQL query at intervals specified in the PDT Maintenance Schedule field of the database connection. If the value returned by the query is different from the result of the query in the prior check then the datagroup goes into a triggered state. In this case Looker proceeds to rebuild any PDTs with that datagroup specified in their datagroup_trigger parameter. Once all PDTs using that datagroup are rebuilt, the datagroup goes into a ready state and the cached results for any associated explores are invalidated.

    Typically the SQL query is designed to indicate that the next data load (ETL) into your database has happened, such as querying the max(ID) in a table. Alternatively you can specify a certain time of day by querying the current date and adding additional hours to that timestamp as needed.

    If you need to specify a time interval, you can use the sql query to select dates using sql_trigger, possibly adding a few hours to the result to reach the time you want (such as 4am.)

    The query should return exactly one column and exactly one row.

  • a max_cache_age parameter specifying a string defining a time period

    You can specify a maximum time period for the cached results to be used. In the max_cache_age parameter you pass a string containing an integer followed by minutes, hours, or days.

    When that time has elapsed since a query’s results were cached, then Looker invalidates that query’s results and the next time the same query is used, Looker will send the query to the database for fresh results.

    Often the best solution is to use the two parameters in combination. You specify a sql_trigger SQL query that should be triggered by your data load (ETL) into your database. In addition, you specify a max_cache_age that will invalidate old data eventually if your ETL fails. max_cache_age ensures that if the cache for a datagroup isn’t cleared by the sql_trigger then the cache entries will expire by a certain time. So the failure mode for a datagroup will be to query the database rather than serve stale data from the Looker cache.

Specifying Using a Datagroup’s Caching Policy

You can specify using a datagroup’s caching policy for

  • queries using a specific model
  • queries on one or a group of explores
  • the contents of one or more PDTs

Using a Datagroup for Query Results

You can specify that a datagroup’s caching policy be used for all queries on a specific explore, a group of explores, or as a default caching policy for all explores in a model.

To specify a datagroup’s caching policy as:

  • a default for all explores in a model: use the persist_with parameter at the model level and specify the name of the datagroup.

  • for a specific explore, use the persist_with parameter in that explore’s definition and specify the name of the datagroup.

  • for a group of explores, use the persist_with parameter in each of those explore’s definition and specify the name of the same datagroup.

In the following example, the orders and orders_facts explores both use the orders_datagroup caching policy. That policy specifies that the query select max(id) from my_tablename will be used to detect when an ETL has happened. Even if that ETL doesn’t happen for awhile, the cached data will be used for a maximum of 24 hours.

Meanwhile, the customer_facts and customer_background explores both use the customers_datagroup caching policy. That policy specifies that the query select max(id) from my_other_tablename will be used to detect when an ETL has happened. If no ETL happens, the cached results will continue to be used.

- datagroup: orders_datagroup sql_trigger: SELECT max(etl_job_id) FROM my_tablename max_cache_age: 24 hours   - datagroup: customers_datagroup sql_trigger: SELECT max(etl_job_id) FROM my_other_tablename   - persist_with: orders_datagroup   - explore: orders …   - explore: order_facts …   - explore: customer_facts persist_with: customers_datagroup …   - explore: customer_background persist_with: customers_datagroup …
datagroup: orders_datagroup { sql_trigger: SELECT max(etl_job_id) FROM my_tablename ;; max_cache_age: "24 hours" }   datagroup: customers_datagroup { sql_trigger: SELECT max(etl_job_id) FROM my_other_tablename ;; }   persist_with: orders_datagroup   explore: orders { … }   explore: order_facts { … }   explore: customer_facts { persist_with: customers_datagroup … }   explore: customer_background { persist_with: customers_datagroup … }

If both persist_with and persist_for are specified, then you will receive a validation warning and the persist_with will be used.

Using a Datagroup for a PDT

To use a datagroup’s caching policy to trigger rebuilding a PDT, you use the datagroup_trigger in the PDT’s definition and specify the name of the datagroup.

Do not also specify a sql_trigger_value or persist_for parameter for the PDT — if you do, then:

  • you’ll get a warning in the Looker IDE
  • the sql_trigger_value or persist_for will not be used by Looker
  • the datagroup_trigger will be used

Example:

If we wanted to set the PDT to rebuild as specified by the customers_datagroup datagroup, we could use the following definition. This definition also adds several indexes, on both customer_id and first_order_date. For more information about defining PDTs, see this page.

- view: customer_order_facts derived_table: sql: | … datagroup_trigger: customers_datagroup indexes: [customer_id, first_order_date]
view: customer_order_facts { derived_table: { sql: … ;; datagroup_trigger: customers_datagroup indexes: ["customer_id", "first_order_date"] } }

If you have PDTs that are dependent on other PDTs, be careful not to specify incompatible datagroup caching policies.

Using the Admin Panel for Datagroups

If you have the Admin role, you can use the Admin tab’s Datagroups page to view the existing datagroups. You can see the connection and model of each datagroup, as well as the datagroups’ current status. You can also reset the cache for a datagroup, trigger the datagroup, or navigate to the datagroup’s LookML.

Datagroups can also be used to trigger scheduled data delivery, as described on this page. With this option, Looker will send your data when the datagroup completes, so that the scheduled content is up to date.

Seeing Whether a Query Was Returned From Cache

You can determine whether or not a query has been returned from the cache by looking in the upper right corner after running a query.

If results are from a fresh query,
you will see …
If the results are from the cache,
you will see …

Forcing New Results To Be Generated From the Database

Users can also force new results to be retrieved from the database. Select the Clear Cache & Refresh option from the Gear dropdown menu, which you’ll find in the upper right of the screen after running a query:

A persistent derived table normally is regenerated based on the specified datagroup, persist_for, or sql_trigger_value. You can force the derived table to regenerate early if your admin has given you the develop permission, and you are viewing an Explore that includes fields from the PDT. Select the Rebuild Derived Tables & Run option from the Gear dropdown menu, which you’ll find in the upper right of the screen after running a query:

Still have questions?
Go to Discourse - or - Email Support
Top