Cache Ugly Reporting Queries With Materialized Views and Docker

Confidence and trust in your SaaS product depends, in part, on the continual conveyance of the value of the solution you provide. The reporting vectors (web-based dashboards, daily emails, etc.) obviously depend upon the specifics of your product and your engagement plan with your customers. But underlying all sorts of reporting is the need to derive hard metrics from databases: What's the usage of your application by seat? How has that driven value/efficiency for them? What are the trends and anomalies worth calling out? 

The bad news is that many of the most insightful metrics require complex joins across tables; and as you scale out to more and more customers, queries across multitenant databases will take longer and longer. The good news is that, unlike for interactive exploration and real-time monitoring and alerting use cases, many of the queries against your production databases can be lazy and done periodically.

At Wise.io, we needed a way to cache and periodically update long-running/expensive queries so that we could have more responsive dashboards for our customers and our implementation engineers. After some research, including exploration with 3rd party vendors, we settled on leveraging materialized views. This is a brief primer on a lightweight caching/update solution that uses materialized views coupled with Docker.

Materialized views are database objects that act as snapshots of certain queries. Whereas a SELECT against a view dynamically reissues the associated stored query, a SELECT against a materialized view queries against a cache of the original query result. Without the overhead of query planning, scanning, and executing against the database, a SELECT against a materialized view can be extremely fast. Materialized views come out of the box in Oracle, DB2, Microsoft SQL Server ("indexed views") and Postgres.

For reporting purposes, we can use materialized views to cache long-running/expensive queries and then refresh those views on a cadence that is appropriate for the use case. The outline of the strategy is as follows:

  1. Identify and hone the SQL statements that are ripe for MVs. These are queries that take a long time to execute and/or only need refreshing periodically (as opposed to in real time).
  2. Set a database-wide naming convention for the MVs that will be easy to remember (e.g., rolling_roi_all_clients_daily).
  3. Create the MVs in your database. We do this programmatically in staging/production using Python/Alembic. But you can certainly build these by hand.
  4. Determine an approach to identify which MVs need refreshing. You can maintain a DB table that contains the time until next update for each MV. Another approach is to name the MVs in a way that's easy to determine on what timescale they should be updated. For instance, by adding "daily" or "hourly" to the suffix. For the toy example, below this is the strategy we employ. Scheduling the refreshes will allow you to control when the underlying expensive queries will add load to your DB. You might want to schedule MV refreshes for times of the day when the load is typically small.
  5. Create a job that performs the refreshes. We build this these jobs into our middleware (via Python/Celery) but you could also have a cron job running on the DB machine. Or, as we show below, you could have a small Docker container which runs the update jobs periodically.

Creating the Materialized Views (in Postgres)

To create a materialized view, just add the appropriate line above your super ugly SQL query:

This should be done by a DB user that has CREATE privilege as well as SELECT privilege on the appropriate tables/row. Next, to allow for the refreshing of that view without blocking the usage of that view, you need to add an index of one of the columns returned in that view:

Scripting to Refresh the Materialized Views (in Python)

The key here is to leverage the strategy you set up for identifying which MVs should be updated and on what schedule. A simple strategy is to identify the MVs by name (e.g., ending in 'hourly') using the pg_class table to query for MVs. Then loop over all the MVs that need updating and issue a REFRESH MATERIALIZED VIEW CONCURRENTLY name. For example:

Run the Refresh Script Periodically (in Docker using cron)

You can create a lightweight Docker that has the ability to run cron and python. It also needs to have psycopg2 (to connect to the postgres DB): Here the 'root' file can contain the cron jobs that you wish to run

This Docker image is about 52 MB in size (thanks Gliderlabs!).

Once this is up and running, your long-running/expensive queries should be properly cached, allowing you to worry about all the million other unoptimized things in your stack that you can now your turn attention to. ;)

I created a github gist with all the files you need, including the files needed to install this in Elasticbeanstalk environment. Remember: you'll need to make sure that the instance you are running on is given proper permissions (via security groups) to access the database. Enjoy! And let us know if you have any suggestions for improvements.


Originally posted at wise.io/blog … see the archive.org link.

Avatar
Joshua Bloom
Professor of Astronomy

Astrophysics Prof at UC Berkeley, former Wise.io cofounder (acquired by GE); Former Department Chair; Inventor; Dad, Tennis everything. Anti #TransparentMoon. Check out his group activities at ml4science.org and art exhibition CuratingAI.art (Spring 2024).

Related