Scalable data warehouse

Amanuel Zewdu
6 min readNov 12, 2022

Using Airflow, PostgreSQL, dbt, and Redash

Overview

business understanding

The challenge is to build a data warehouse that is able to organize data such that a number of downstream projects query the data efficiently. It aims to build a data pipeline using the Extract Load Transform (ELT) framework using DBT. We first create a postgres table and store the preprocessed data to the table in an orchestrated manner in to the table using airflow. Finally we do transformations on the data (e.g. aggregations) and prepare a nice visualization and presentation through redash.

Data source and format

The Data used for this project is from open-source dataset called PNeuma which is an open large-scale dataset of naturalistic trajectories of half a million vehicles that have been collected by a one-of-a-kind experiment by a swarm of drones in the congested downtown area of Athens, Greece. DVC is used to track the versions of the 86MB of data downloaded from the source.

Theoretical background

A technology stack is a set of technologies that are stacked together to build any application. Popularly known as a technology infrastructure or solutions stack, technology stack has become essential for building easy-to-maintain, scalable web applications. The chart below shows the ELT pipeline together with the tech stack used for this project.

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. It is one of the most robust ETL (Extract, Transform, Load) workflow management tools, used by Data Engineers for orchestrating workflows or pipelines. Using Airflow you can visualize your data pipelines’ dependencies, logs, code, trigger tasks, progress status. The benefits of using airflow include scalability and easy handling of flow between jobs. Airflow is used in this project for automation and orchestrations of tasks using DAGs (Directed acyclic graphs), which are python scripts which define data pipeline. DAGs were created in Airflow that use the bash/python operator to load the data files into the PostgreSQL database.

PostgreSQL is a useful and common data warehouse tool maintained by an active community. It can also handle more than just one kind of data processing, which makes it a pretty compelling option. PostgreSQL works with almost any kind of programming language that’s used in modern data extraction, from Python to .NET and more. It is also free and easy to install and try right away. But it has security vulnerabilities and risky load balancing.

DBT enables analytics engineers to transform data in their warehouses by simply writing select statements. DBT handles turning these select statements into tables and views. DBT is a data transformation tool that does the ‘T’ in ELT. it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse. A dbt project is a directory of .sql and .yml files. The directory must contain models and an yml file. Model is a single .sql file. Each model contains a single select statement that either transforms raw data into a dataset that is ready for analytics, or, more often, is an intermediate step in such a transformation. In this project we have models that aggregate average speed and total distance per vehicle. Project file is a file which configures and defines your dbt project.

DBT connects to a data warehouse to run data transformation queries. It natively supports connections to Snowflake, Big Query, Redshift and Postgres data warehouses, and there’s a number of community-supported adapters for other warehouses. Postgres data warehouse is used for this project. So, using profiles.yml configuration file in the root of the project, PostgreSQL targets were setup. Finally, the transformed data will be connected to the reporting environment and a dashboard will be created out of this data using Redash tool.

Figure 1: Tech-stack — Airflow, PostgreSQL, DBT and Redash

Methodology — discussion

  • The first sub-task is to create a DAG in airflow that uses python operator to load the data in to PostgreSQL database. So a ‘load_data.py’ DAG was created with an id of ‘loaddata’ that is scheduled on a daily interval.Inside of this dag there are two Postgres Operators called create table and load_data. The first one creates tables using the create_table.sql script and the latter one loads this tabular data into postgres_dwh using the load_data.sql file in the sql folder. The PostgreSQL database was created using DBeaver, and was integrated into airflow.
Figure 2: Airflow dags to create table and populate it
Figure 3: Table creation and data insertion through airflow (DBeaver management tool)
  • Then, Connection to dbt with the DWH named ‘pg_dwh’ was made and written down some transformation codes for the data in sql. These transformation models do aggregations on the data based on vehicle type, average speed and total distance travelled.
Figure 4: DBT — Lineage graph
  • Finally connecting to the Redash environment to create visual reports and a dashboard completed the challenge. Normally, we could install it using docker locally, but the developers behind Redash have created a very easy-to-use setup script that installs everything you need (docker-ce and docker-compose) and then deploys everything necessary for Redash to function. So we used this setup.sh and visualized out data from the local postgreSQL database to create a nice dashboard.

Results

The following visualizations were drawn out at the end of this project.

Conclusion

As set in the plan document, to successfully complete this challenge, It is required to complete the four subtasks specified. The first one was orchestration and automation of data feeding to the database, this was implemented using airflow DAGs that run on a daily basis. Then proceeding to the data transformation task, It was successfully done using the DBT tool. Models were wrote in sql that were able to do aggregations on the data using some of the fields like average speed, total distance and type of vehicle. To successful completion of the project, Visualization through Redash was the last subtask, and it was done through connecting the database local setup of the tool(Redash).

Challenges: Lessons

The key takeaway from this project was orchestration of tasks using airflow. It was challenging at first, as the data as obtained couldn't be fed into the DAGs, rather it should be preprocessed. When proceeding to the DBT transformation, setting up DBT by itself was challenging because it was desired to put it in the same docker file and run its console and serving together with airflow. Finally as we tried to visualize the data, we had to find a way to connect the locally hosted database to the redash.io. This was also a new and challenging task until discovering the developers there has already prepared a setup.sh which can be run with docker-compose.

Future work

In the near future, We could improve the project to be fully implemented inside one docker-compose and run at once. Another important improvement can be made on the pipeline so that airflow can continuously read data from the PNeuma source itself automatically based on file names and feed those files to the DAGs that does data loading to PostgreSQL database and the workflow can continue daily to provide downstream users with the appropriate data warehousing.

References: Further reading

Airflow documentation

DBT documentation

Redash documentation

DVC documentation

PostgreSQL documentation

https://medium.com/@komal1491/installing-airflow-dbt-python-with-docker-52ea5196c7ca

https://www.techrepublic.com/article/how-to-deploy-redash-data-visualization-dashboard-help-docker/

--

--

Amanuel Zewdu

Junior data engineer who builds scalable data pipelines using ETL tools; Airflow, Kafka and Dbt with data modeling dexterity; Python and SQL