karasms.com

PostgreSQL to BigQuery: Incremental Materialization with DBT

Written on

In this article, we'll explore the process of swiftly ingesting PostgreSQL data into BigQuery using Airbyte, and demonstrate how to implement incremental materialization with DBT. It is assumed that Airbyte is already installed. Let's dive in.

Connecting Airbyte to PostgreSQL

Once the source setup is complete, the next step involves establishing a connection to BigQuery as the destination.

I have configured the system to extract and load two tables, Employee and Branch, from PostgreSQL to BigQuery every 24 hours with a full refresh. You can adjust the configuration according to your needs before finalizing the connection.

Normalization and Transformation

Airbyte provides the capability to ingest raw data in JSON format and transform it into a normalized tabular structure. This ensures that the data is efficiently organized for storage and analysis in a relational database or data warehouse. After the initial sync, the branch and employee tables from PostgreSQL were successfully loaded into BigQuery.

Transformation with DBT

DBT, or Data Build Tool, is an open-source data transformation tool that enhances the quality of analytics workflows while ensuring data accuracy through testing and monitoring. It allows users to compile and execute analytics code directly on their data platform.

Materialization in DBT

Materialization in DBT refers to how the results of a DBT model are stored in the target data warehouse or database. DBT offers various materialization types, each serving different requirements.

Common materialization types include:

  • Table Materialization: This is the default option. When a model is defined as a table, DBT creates a physical table in the destination warehouse, executing the SQL query to generate the output, which will replace the existing table upon subsequent runs.

    Example usage: {{config(materialized='table')}}

  • View Materialization: DBT creates a logical view instead of a physical table. The SQL query runs every time the view is accessed, which is useful for creating virtual tables based on underlying models.

    Example usage: {{config(materialized='view')}}

  • Incremental Materialization: This approach processes only new or updated data during each run, which reduces resource use and processing time, making it ideal for large datasets.

    Example usage: {{config(materialized='incremental')}}

  • Ephemeral Materialization: Data is not stored permanently; instead, it is generated as a temporary table during execution and discarded afterward.

    Example usage: {{config(materialized='ephemeral')}}

Installing and Configuring DBT

To set up DBT, initiate a new project using the command dbt init <project_name>. For example, to create a project named "test," run:

dbt init test

Connect DBT to your BigQuery dataset to efficiently transform the ingested data.

After selecting your database, choose your authentication method. If using a GCP service account, provide the key_file.json for your project, along with the corresponding project ID and dataset.

Next, define the number of threads and execution timeout for optimal performance, selecting from the available locations. These settings can be updated later in the .dbt/profiles.yml file.

The .dbt/profiles.yml file is automatically created in the root user's home directory during DBT installation and contains vital information about the project.

Once the DBT project is created, default directories will be established.

DBT models are SQL artifacts that define the logic for data transformation, promoting modularity, dependency management, and efficient processing. A well-structured DBT model is essential for effective data pipeline management.

Best Practices for DBT Model Structure

Creating a clear structure in DBT requires establishing guidelines for organizing, naming, and documenting data models to ensure maintainability as the data scales. This consistent approach enhances collaboration and overall project quality.

In DBT, three main types of structures can be defined:

  • Staging
  • Intermediate
  • Mart

Within the models/example folder, two default models, my_first_dbt_model.sql and my_second_dbt_model.sql, are created. The schema.yml file helps configure and manage the structure of DBT-created tables or views, allowing for renaming columns, specifying data types, and defining materialization options.

To manage transformation queries effectively, three folders—staging, intermediate, and mart—are established.

Staging Models serve as foundational components for data models, focusing on basic calculations and renaming without complex joins or aggregations. They are not the final outputs.

I am creating two SQL files, employee.sql and branch.sql, within the Staging folder, using ephemeral materialization.

Intermediate Models combine multiple models to clean and transform data, including derived fields, aggregates, joins, and enforcing business rules.

I am creating a SQL file named join.sql, which integrates the employee and branch SQL files, utilizing ephemeral materialization.

Mart Models represent detailed business entities intended for end-users and BI tools. They store enriched and transformed data, providing a reliable source of information.

I will create a SQL file called employee_details.sql using incremental materialization.

When defining a model with materialized='incremental', a unique_key must be specified. This key allows DBT to identify which rows in the source data have changed since the last run, preventing data duplication.

The staging, intermediate, and mart layers are critical components of a well-structured DBT model, facilitating an efficient data warehouse design for insightful analytics.

Collaboration and Version Control with GitHub

Maintaining the integrity and consistency of data analytics projects requires effective collaboration and version control. GitHub is an excellent platform for integrating with Airbyte and DBT.

Begin by installing Git on your local machine, logging into GitHub, and creating a repository.

Once the repository is set up, you can connect DBT to GitHub, allowing your transformations to be tracked and versioned. Follow these steps to initiate Git and host your DBT project files:

git init git add . git commit -m "first commit" git branch -M main git remote add origin <paste github repository> git push -u origin main

After linking DBT with GitHub, the created models in DBT will be reflected in the GitHub repository.

A web URL will be generated within the code, which can be utilized to facilitate collaboration with Airbyte synchronization.

Enabling Transformation in Airbyte

To enhance the previously established connection, we will now integrate transformation capabilities via DBT. This requires incorporating the DBT GitHub link into Airbyte for seamless data transformations.

In Custom Transformation, select "Add Transformation," paste the GitHub URL (the repository link for your custom transformation project), and save the transformation.

Output

After enabling sync in Airbyte, data from PostgreSQL is extracted and loaded into BigQuery with transformations applied, resulting in an incremental table named employee_details.

The source tables (employee and branch) in PostgreSQL:

The transformed table in BigQuery (employee_details):

In this scenario, both the source and transformed tables initially contained 8 rows. After updating one row and adding another in the source PostgreSQL table, a manual sync was triggered in Airbyte. The incremental sync, based on the id column in employee_details, ensured only new and modified data from the source table were processed and transformed in BigQuery.

It was confirmed that the newly added and updated rows from the PostgreSQL table were accurately reflected in the employee_details table in BigQuery. This strategy significantly reduces processing time when dealing with large datasets, as only the pertinent data is synchronized.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Mindfulness: What Should You Focus on Right Now?

Explore the significance of mindful thinking and learn to identify what truly matters in your life.

Understanding Vaginal Odor: What Your Unique Scent Means

Discover the significance of various vaginal scents and how to maintain a healthy balance.

# Transforming Your Brain for Enhanced Happiness: 6 Effective Strategies

Discover six actionable strategies to enhance your happiness and transform your brain into a source of joy and positivity.

Embracing the Unknown: The Beauty of Life's Surprises

Life is unpredictable, and that’s what makes it beautiful. Join me on my journey of self-discovery and adventure.

# Effective Strategies to Stay Motivated and Achieve Your Goals

Discover practical tips to maintain focus and motivation while pursuing your goals, ensuring a productive and fulfilling journey.

Navigating the Impact of Generative AI on Democracy and Society

Exploring the risks of generative AI in the context of misinformation and its implications for the upcoming US elections.

Creating a Skills Matrix for Team Success and Growth

Discover how to build a skills matrix to foster continuous learning and teamwork in a scrum environment.

Embracing Mistakes: The Power of Being Wrong in Life

Discover the unexpected benefits of embracing mistakes and learning from being wrong in life.