Looker reduces the load on your database and improves performance by using cached results of prior queries when available and permitted by your caching policy. In addition, you can create complex queries as PDTs, which store their results to simplify later queries.
Using datagroups is the most powerful technique for defining a caching policy, and for specifying when to rebuild persistent derived tables (PDTs).
You can use one or more
datagroup parameters to name caching policies and specify the desired behavior. Then you can use
persist_with parameters at the model level or the Explore level to specify which Explores use each policy. You also can use the
datagroup_trigger parameter in a PDT definition to specify which policy to use in rebuilding the PDTs.
Typical use cases for datagroups include:
- Wanting to integrate Looker more closely with the ETL phase of your data pipeline, such as loading data in batch through a nightly ETL job and wanting to clear any cached queries once that data is loaded.
- Having PDTs that are regenerated infrequently, even though the underlying tables update frequently, so queries against the PDTs can be cached longer.
How Looker Uses Cached Queries
For queries, the caching mechanism in Looker works as follows:
- Once a user runs a specific query, the result of that query is cached. Cache results are stored in an encrypted file on your Looker instance.
When a new query is written, the cache is checked to see if the exact same query was previously run. All fields, filters, and parameters must be the same, including things such as the row limits. If the query is not found, then Looker runs the query against the database to get fresh database results (and those results are then cached).
Context Comments do not affect caching. Looker adds a unique comment to the beginning of each SQL query. But as long as the SQL query itself is the same as a previous query (not including the context comments), Looker will use cached results.
- If the query is found in the cache, then Looker checks the caching policy defined in the model to see if the cache is still valid. By default, Looker invalidates cached results after an hour. You can use a
persist_forparameter (at the model level or the Explore level) or the more powerful
datagroupparameter to specify the caching policy for when or in what circumstances the cached results become invalid and should be ignored. An admin can also invalidate the cached results for a datagroup.
- If the cache is still valid, then those results are used.
- If the cache is no longer valid, then Looker runs the query against the database to get fresh query results. (Those new results are also cached.)
How Looker Uses PDTs and Rebuilds Them
PDTs are queries, typically very complex queries, that you can create as the basis of a view. For example, your query might identify the customer’s first and most recent orders plus the lifetime value of all of their orders.
These PDTs are rebuilt periodically based on one of three settings:
- triggered by a change (using
- a set time period (using
- a caching policy definition in a
datagroup, triggered by a change using
sql_trigger, and assigned to the PDT using
max_cache_age does not cause PDTs to rebuild. See the
datagroup documentation page for more information.
Using a datagroup gives you the most control over when the PDT is rebuilt. This page discusses setting up a datagroup for query caching and PDT rebuilding.
Specifying Caching Policies with
You use one or more
datagroup parameters at the model level to assign a caching policy to Explores and/or PDTs. If you want different caching policies for different Explores and/or PDTs, then specify each caching policy in a separate datagroup.
For connections using user attributes to specify the connection parameters, you must create a separate connection using the PDT override fields if you want to do either of the following:
• Use PDTs in your model.
• Define a datagroup caching policy using a SQL query trigger.
Without the PDT overrides, you can still use a datagroup for the model and its Explores, as long as you define the datagroup’s caching policy using only
Each datagroup specifies a policy using one or both of these subparameters:
sql_triggerparameter specifies a SQL query that returns one row with one column. If the value returned by the query is different than the query’s prior results, then the datagroup goes into a triggered state. See this page for more information.
max_cache_ageparameter is a string defining a time period. When the age of a query’s cache exceeds the time period, Looker invalidates the cache. The next time the query is issued, Looker sends the query to the database for fresh results. See this page for more information.
Often the best solution is to use the two parameters in combination. You specify a
sql_trigger SQL query that will be triggered by your data load (ETL) into your database. In addition, you specify a
max_cache_age that will invalidate old data if your ETL fails. The
max_cache_age parameter ensures that if the cache for a datagroup isn’t cleared by the
sql_trigger, then the cache entries will expire by a certain time. So the failure mode for a datagroup will be to query the database rather than serve stale data from the Looker cache.
Specifying Using a Datagroup’s Caching Policy
You can use a datagroup’s caching policy for
- queries using a specific model
- queries on one or a group of Explores
- the contents of one or more PDTs
Using a Datagroup for Query Results
You can specify a datagroup’s caching policy based on how you’d like it applied:
- For a specific Explore: Use the datagroup in a
persist_withparameter for that Explore’s definition.
- For a group of Explores: Use the datagroup in a
persist_withparameter for each of those Explore’s definitions.
- As a default for all Explores: Use the datagroup in a single
persist_withparameter at the model level.
In the following example, we’ve included in the model file a datagroup named
orders_datagroup. The datagroup has a
sql_trigger parameter specifying that the query
select max(id) from my_tablename will be used to detect when an ETL has happened. Even if that ETL doesn’t happen for a while, the datagroup’s
max_cache_age specifies that the cached data will be used only for a maximum of 24 hours.
persist_with parameter points to the
orders_datagroup caching policy, which means this will be the default caching policy for all Explores in the model. But we don’t want to use the model’s default caching policy for the
customer_background Explores, so we can add the
persist_with parameter to specify a different caching policy for these two Explores. The
orders_facts Explores don’t have a
persist_with parameter, so they will use the model’s default caching policy:
persist_for are specified, then you will receive a validation warning and the
persist_with will be used.
Using a Datagroup for a PDT
To use a datagroup’s caching policy to trigger rebuilding a PDT, you use the
datagroup_trigger in the PDT’s definition and specify the name of the datagroup.
Do not also specify a
persist_for parameter for the PDT—if you do, then:
- you’ll get a warning in the Looker IDE
persist_forwill not be used by Looker
datagroup_triggerwill be used
If we wanted to set the PDT to rebuild as specified by the
customers_datagroup datagroup, we could use the following definition. This definition also adds several indexes, on both
first_order_date. For more information about defining PDTs, see this page.
If you have PDTs that are dependent on other PDTs, be careful not to specify incompatible datagroup caching policies.
Using the Admin Panel for Datagroups
If you have the Looker admin role, you can use the Admin tab’s Datagroups page to view the existing datagroups. You can see the connection and model of each datagroup, as well as the datagroups’ current status. You can also reset the cache for a datagroup, trigger the datagroup, or navigate to the datagroup’s LookML.
Using Datagroups to Trigger Scheduled Deliveries
Datagroups can also be used to trigger scheduled data delivery, as described on the Scheduling Data Deliveries page. With this option, Looker will send your data when the datagroup completes, so that the scheduled content is up to date.
Seeing Whether a Query Was Returned from Cache
You can determine whether or not a query has been returned from the cache by looking in the upper right corner after running a query.
|If results are from a fresh query,
you will see …
|If the results are from the cache,
you will see …
Forcing New Results to Be Generated from the Database
Users can also force new results to be retrieved from the database. Select the Clear Cache & Refresh option from the gear menu, which you’ll find in the upper right of the screen after running a query:
You can also clear the cache and refresh for merged results queries:
A persistent derived table normally is regenerated based on the specified
sql_trigger_value. You can force the derived table to regenerate early if your admin has given you the
develop permission, and you are viewing an Explore that includes fields from the PDT. Select the Rebuild Derived Tables & Run option from the Gear drop-down menu, which you’ll find in the upper right of the screen after running a query:
How Long Is Data Stored in the Cache?
To specify the amount of time before the cached results become invalid, use the
persist_for parameter (for a model or for an Explore) or the
max_cache_age parameter (for a datagroup). Here are the different behaviors along the timeline, depending on whether the
max_cache_age time has expired:
- Before the
max_cache_agetime expires: If the query is rerun, Looker pulls data from the cache.
- When the
max_cache_agetime expires: Looker deletes data from the cache, unless you have enabled the Instant Dashboards Looker Labs feature.
- After the
max_cache_agetime expires: If the query is rerun, Looker pulls the data from the database directly and resets the
One key point here is that the data is deleted from the cache when the
max_cache_age time expires, as long as the Instant Dashboards Looker Labs feature is disabled. (The Instant Dashboards feature requires the cache in order to immediately load cached results into a dashboard.) If you enable Instant Dashboards, data stays in the cache for 30 days, or until the cache storage limits are reached. If the cache reaches the storage limit, data is ejected based on a Least Recently Used (LRU) algorithm, with no guarantee that data with expired
max_cache_age timers will be deleted all at once.
Minimizing the Time Your Data Spends in the Cache
Looker requires the disk cache for internal processes, so data will always be written to the cache, even if you set the
max_cache_age parameters to 0. Once written to the cache, the data will be flagged for deletion but may live up to 10 minutes on disk.
However, all customer data that appears in the disk cache is Advanced Encryption Standard (AES) encrypted, and you can minimize the amount of time that data is stored in the cache by making all of these changes:
- Disable the Instant Dashboards Looker Labs feature, which requires Looker to store data in cache.
- For any
persist_forparameter (for a model or Explore) or
max_cache_ageparameter (for a datagroup), set the value to 0. In Looker 5.14+ with Instant Dashboards off, Looker deletes the cache when the
persist_fortime expires, or when the data reaches the
max_cache_agespecified in its datagroup. (This is not necessary for the
persist_forparameter of persistent derived tables, because persistent derived tables are written to the database itself, and not to the cache.)
- Set the
suggest_persist_forparameter to a small amount of time. The
suggest_persist_forvalue specifies how long Looker should keep filter suggestions in the cache. The filter suggestions are based on a query of the values for the field being filtered. These query results are kept in the cache so Looker can quickly provide suggestions as the user types in the filter text field. The default is to cache the filter suggestions for 6 hours. To minimize the amount of time your data is in the cache, set the
suggest_persist_forvalue to something lower, such as 5 minutes.