Exploring the Latest BigQuery Features: Materialized Views Update
Written on
BigQuery has introduced exciting updates this April, particularly with the advent of Google Next. To keep you informed, I will share the latest developments separately, beginning with Materialized Views. This topic follows our previous discussion on views in the "Hidden Gems of BigQuery" series.
Materialized Views Now Generally Available
The options for allow_non_incremental_definition and max_staleness for Materialized Views have reached general availability (GA).
Given that this subject can be quite intricate—something I struggled with for a while—I believe a brief overview of Materialized Views is beneficial.
Regular views are generally understood as a "syntax sugar," "abstraction layer," or "security layer," depending on the use case. When a query is executed against a regular view, its definition is replaced with the corresponding SQL statement.
Materialized Views, however, are distinct as they store data physically.
Why opt for them? Primarily, Materialized Views can act like regular views, but they also enhance query performance in OLTP systems and help reduce costs in OLAP systems. This efficiency arises because the query can utilize the pre-calculated and stored data of the Materialized View instead of processing the entire underlying dataset.
While the benefits of improved performance and reduced costs are appealing, implementing Materialized Views requires careful management of when to compute and refresh the data. Different systems adopt various strategies for this; let's explore how BigQuery handles these challenges.
Data Freshness
The first consideration is whether to ensure that Materialized Views always reflect the latest data.
We have two choices:
- Guarantee the latest data (data freshness), necessitating constant data refresh.
- Allow for outdated data (data staleness), simplifying implementation (for example, refreshing every 5 minutes), but potentially limiting use cases.
BigQuery supports both methods, but by default, it prioritizes data freshness, so we'll focus on that approach first.
Data Refreshing
Having agreed on the need for data freshness, several strategies arise:
- Avoid storing view data, always querying the base tables. This guarantees freshness but operates like a regular view.
- Store view data on request and refresh it when base data changes, functioning similarly to a cache, which BigQuery already offers.
- Calculate and refresh view data on a schedule. This allows for pre-calculated data availability when a query arrives, provided no base data changes occurred post-refresh. This is indeed how BigQuery operates.
Upon declaring a Materialized View, BigQuery will:
- Initiate the initial data calculation (which may take time depending on data volume).
- Refresh the Materialized View data within the next 5 minutes if there are changes to the base tables (but not more frequently than every 30 minutes).
Instead of adhering to a fixed schedule (like every 5 minutes), which may not be effective if data updates infrequently, BigQuery opts to refresh data upon changes to base tables.
But how do we ensure the promised data freshness? If a query for the Materialized View arrives shortly after the base data changes, don't worry. BigQuery will directly query the base data if the Materialized View data is outdated (similar to a regular view).
In summary, BigQuery's Materialized View strategy includes:
- Calculating view data upon creation.
- Updating view data within 5 minutes following any changes to base data, with a maximum cap of 30 minutes.
When querying Materialized View data:
- If the data is current, it is utilized (reducing the data size queried).
- If not current, the query will fetch data from the base tables.
This leads to several follow-up questions:
Can I modify the 5 and 30-minute settings? You can enable or disable automatic refresh.
ALTER MATERIALIZED VIEW PROJECT.DATASET.MATERIALIZED_VIEW
SET OPTIONS (enable_refresh = true);
However, the 5-minute interval is not adjustable, as it is a best-effort approach without guarantees on refresh timing. The 30-minute frequency cap can be modified.
ALTER MATERIALIZED VIEW PROJECT.DATASET.MATERIALIZED_VIEW
SET OPTIONS (refresh_interval_minutes = 60);
Is manual data refresh possible? Yes, you can manually refresh the data regardless of the automatic refresh settings.
CALL BQ.REFRESH_MATERIALIZED_VIEW('PROJECT.DATASET.MATERIALIZED_VIEW');
What if my data changes frequently? If your base data updates often (e.g., every minute) and queries are constant, the Materialized View may essentially function as a regular view. This is something to consider.
What if my data is extensive? Is constant refreshing costly? Yes, it can be expensive. BigQuery addresses this by employing a method of incremental updates rather than full recalculations, only modifying data that has changed.
Can incremental updates apply to any SQL statements in the view? Unfortunately, no. Materialized Views do come with restrictions on allowable SQL statements.
What if I need to use SQL features outside these restrictions? This is where the allow_non_incremental_definition option comes into play, which is now GA!
What if I don’t require strict data freshness? Since BigQuery guarantees freshness, it queries base data whenever the Materialized View data is outdated. If you can work with data that is 15, 30, or more minutes old, you can set a max_staleness interval.
CREATE MATERIALIZED VIEW project-id.my_dataset.my_mv_table
OPTIONS (enable_refresh = true, refresh_interval_minutes = 60,
max_staleness = INTERVAL "4:0:0" HOUR TO SECOND)
AS
...
This allows you to bypass querying the base dataset when changes are relatively recent, balancing freshness and cost.
Combining allow_non_incremental_definition and max_staleness Returning to the more complex queries within views, BigQuery aims to avoid full recalculations through incremental updates. However, if you need to bypass these restrictions, you can use the allow_non_incremental_definition statement. You must specify a max_staleness interval to understand the implications of this choice.
CREATE MATERIALIZED VIEW my_project.my_dataset.my_mv_table
OPTIONS (
enable_refresh = true, refresh_interval_minutes = 60,
max_staleness = INTERVAL "4" HOUR,
allow_non_incremental_definition = true)
AS
...
Is that everything? One more point: Materialized Views with allow_non_incremental_definition enabled are not compatible with smart-tuning.
What is smart-tuning? Smart-tuning is a significant topic. In essence, BigQuery automatically rewrites queries to utilize Materialized Views wherever feasible. This improves query performance and cost without altering results. For a query to be rewritten, the Materialized View must meet certain conditions:
- It must belong to the same dataset as one of its base tables.
- It must utilize the same set of base tables as the query.
- It must include all columns being read.
- It must cover all rows being read.
If you're interested in learning more, check out the smart-tuning documentation.
I hope this overview of the newly available features has been informative!
For more insights on Hidden Gems of BigQuery, check out the following articles:
- Auto, Transactions and Indexes — Hidden Gems of BigQuery
- Change History and Defaults — Hidden Gems of BigQuery — Part 2
- Data Lineage and Routines — Hidden Gems of BigQuery — Part 3
- Views, Caching, Scheduled Queries, Materialized Views — Hidden Gems of BigQuery — Part 4 — Five types of “Views”
- ML in BigQuery, including new Generative AI — BigQuery Breaking News — Generative AI
- Time-travel window, fail-safe, snapshots, and clones — Hidden Gems of BigQuery — P6 — Time-traveling and clones
- TRY…CATCH, RAISE, ERROR, SAFE_ and SAFE. — Hidden Gems of BigQuery — P7 — Safety first