Rewriting PostgreSQL Queries
This is Part 2 of a guide on migrating from PostgreSQL to ClickHouse. Using a practical example, it demonstrates how to efficiently carry out the migration with a real-time replication (CDC) approach. Many of the concepts covered are also applicable to manual bulk data transfers from PostgreSQL to ClickHouse.
Most SQL queries from your PostgreSQL setup should run in ClickHouse without modification and will likely execute faster.
Deduplication using CDC
When using real-time replication with CDC, keep in mind that updates and deletes may result in duplicate rows. To manage this, you can use techniques involving Views and Refreshable Materialized Views.
Refer to this guide to learn how to migrate your application from PostgreSQL to ClickHouse with minimal friction when migrating using real-time replication with CDC.
Optimize queries in ClickHouse
While this is possible to migrate with minimum query rewriting, it is recommended to leverage ClickHouse features to significantly simplify queries and further improve query performance.
The examples here covers common query patterns and show how to optimize them with ClickHouse. They use the full Stack Overflow dataset (up to April 2024) on equivalent resources in PostgreSQL and ClickHouse (8 cores, 32GiB RAM).
For simplicity, the queries below omit the use of techniques to deduplicate the data.
Counts here will slightly differ as the Postgres data only contains rows which satisfy the referential integrity of the foreign keys. ClickHouse imposes no such constraints and thus has the full dataset e.g. inc. anon users.
Users (with more than 10 questions) which receive the most views:
Which tags
receive the most views
:
Aggregate functions
Where possible, users should exploit ClickHouse aggregate functions. Below we show the use of the argMax function to compute the most viewed question of each year.
This is significantly simpler (and faster) than the equivalent Postgres query:
Conditionals and Arrays
Conditional and array functions make queries significantly simpler. The following query computes the tags (with more than 10000 occurrences) with the largest percentage increase from 2022 to 2023. Note how the following ClickHouse query is succinct thanks to the conditionals, array functions, and ability to reuse aliases in the HAVING and SELECT clauses.