Cache Ugly Reporting Queries With Materialized Views and Docker

Confidence and trust in your SaaS product depends, in part, on the continual conveyance of the value of the solution you provide. The reporting vectors (web-based dashboards, daily emails, etc.) obviously depend upon the specifics of your product and your engagement plan with your customers. But underlying all sorts of reporting is the need to derive hard metrics from databases: What's the usage of your application by seat? How has that driven value/efficiency for them? What are the trends and anomalies worth calling out?

The bad news is that many of the most insightful metrics require complex joins across tables; and as you scale out to more and more customers, queries across multitenant databases will take longer and longer. The good news is that, unlike for interactive exploration and real-time monitoring and alerting use cases, many of the queries against your production databases can be lazy and done periodically.

At Wise.io, we needed a way to cache and periodically update long-running/expensive queries so that we could have more responsive dashboards for our customers and our implementation engineers. After some research, including exploration with 3rd party vendors, we settled on leveraging materialized views. This is a brief primer on a lightweight caching/update solution that uses materialized views coupled with Docker.

Materialized views are database objects that act as snapshots of certain queries. Whereas a SELECT against a view dynamically reissues the associated stored query, a SELECT against a materialized view queries against a cache of the original query result. Without the overhead of query planning, scanning, and executing against the database, a SELECT against a materialized view can be extremely fast. Materialized views come out of the box in Oracle, DB2, Microsoft SQL Server ("indexed views") and Postgres.

For reporting purposes, we can use materialized views to cache long-running/expensive queries and then refresh those views on a cadence that is appropriate for the use case. The outline of the strategy is as follows:

Identify and hone the SQL statements that are ripe for MVs. These are queries that take a long time to execute and/or only need refreshing periodically (as opposed to in real time).
Set a database-wide naming convention for the MVs that will be easy to remember (e.g., rolling_roi_all_clients_daily).
Create the MVs in your database. We do this programmatically in staging/production using Python/Alembic. But you can certainly build these by hand.
Determine an approach to identify which MVs need refreshing. You can maintain a DB table that contains the time until next update for each MV. Another approach is to name the MVs in a way that's easy to determine on what timescale they should be updated. For instance, by adding "daily" or "hourly" to the suffix. For the toy example, below this is the strategy we employ. Scheduling the refreshes will allow you to control when the underlying expensive queries will add load to your DB. You might want to schedule MV refreshes for times of the day when the load is typically small.
Create a job that performs the refreshes. We build this these jobs into our middleware (via Python/Celery) but you could also have a cron job running on the DB machine. Or, as we show below, you could have a small Docker container which runs the update jobs periodically.

Creating the Materialized Views (in Postgres)

To create a materialized view, just add the appropriate line above your super ugly SQL query:


    
      
        
  
    
    

        


  


  
        
          
          CREATE MATERIALIZED VIEW usefulname_hourly AS 
        
        
          
          SELECT super_horrible_SQL_that_makes_your_eyes_bleed
        
        
          
            FROM massive_table
        
        
          
            JOIN another_massive_table ...
        
        
          
            JOIN yet_another_massive_table ...
        
        
          
            WHERE ...
        
        
          
            GROUP BY ...

This should be done by a DB user that has CREATE privilege as well as SELECT privilege on the appropriate tables/row. Next, to allow for the refreshing of that view without blocking the usage of that view, you need to add an index of one of the columns returned in that view:


    
      
        
  
    
    

        


  


  
        
        
        
        
        
        
        
        
        
          
          CREATE UNIQUE INDEX usefulname_hourly_i ON 
        
        
          
              usefulname_hourly(column_to_index);

Scripting to Refresh the Materialized Views (in Python)

The key here is to leverage the strategy you set up for identifying which MVs should be updated and on what schedule. A simple strategy is to identify the MVs by name (e.g., ending in 'hourly') using the pg_class table to query for MVs. Then loop over all the MVs that need updating and issue a REFRESH MATERIALIZED VIEW CONCURRENTLY name. For example:


    
      
        
  
    
    

        


  


  
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
          
          with conn.cursor() as cur:
        
        
          
          

        
        
          
          	# get all the materalized views and then loop
        
        
          
          	get_all_mv_SQL = """SELECT oid::regclass::text as name
        
        
          
          	                    FROM   pg_class
        
        
          
          	                    WHERE  relkind = 'm'
        
        
          
          	                           and oid::regclass::text LIKE '%{rate}';""".format(rate=timescale_str)
        
        
          
          	print "Getting the materialized views"
        
        
          
          	cur.execute(get_all_mv_SQL)
        
        
          
          	rows = cur.fetchall()
        
        
          
          	for row in rows:
        
        
          
          		mvname = row[0]
        
        
          
          		if mvname in ignores:
        
        
          
          			print "Ignoring a refresh for",mvname
        
        
          
          			continue
        
        
          
          		print "Refreshing",mvname,"..."
        
        
          
          		cur.execute("REFRESH MATERIALIZED VIEW CONCURRENTLY {mvname};".format(mvname=mvname))
        
        
        
        
  



    

  


      
      
        view raw
        
          refresh.py
        
        hosted with ❤ by GitHub

Run the Refresh Script Periodically (in Docker using cron)

You can create a lightweight Docker that has the ability to run cron and python. It also needs to have psycopg2 (to connect to the postgres DB): Here the 'root' file can contain the cron jobs that you wish to run


    
      
        
  
    
    

        


  


  
        
          
          # Dockerfile to run the cron jobs in root, including python/psycopg2
        
        
          
          FROM alpine:edge 
        
        
          
          RUN apk add --no-cache -X http://dl-cdn.alpinelinux.org/alpine/edge/testing py-psycopg2
        
        
          
          EXPOSE 6000
        
        
          
          COPY refresh.py /refresh.py
        
        
          
          COPY root /var/spool/cron/crontabs/root
        
        
          
          CMD crond -l 2 -f
        
  



    

  


      
      
        view raw
        
          Dockerfile
        
        hosted with ❤ by GitHub

This Docker image is about 52 MB in size (thanks Gliderlabs!).

Once this is up and running, your long-running/expensive queries should be properly cached, allowing you to worry about all the million other unoptimized things in your stack that you can now your turn attention to. ;)

I created a github gist with all the files you need, including the files needed to install this in Elasticbeanstalk environment. Remember: you'll need to make sure that the instance you are running on is given proper permissions (via security groups) to access the database. Enjoy! And let us know if you have any suggestions for improvements.

Originally posted at wise.io/blog … see the archive.org link.

Cache Ugly Reporting Queries With Materialized Views and Docker

Creating the Materialized Views (in Postgres)

Scripting to Refresh the Materialized Views (in Python)

Run the Refresh Script Periodically (in Docker using cron)

Joshua Bloom

Professor of Astronomy

Related

	CREATE MATERIALIZED VIEW usefulname_hourly AS
	SELECT super_horrible_SQL_that_makes_your_eyes_bleed
	FROM massive_table
	JOIN another_massive_table ...
	JOIN yet_another_massive_table ...
	WHERE ...
	GROUP BY ...

	CREATE UNIQUE INDEX usefulname_hourly_i ON
	usefulname_hourly(column_to_index);

	with conn.cursor() as cur:

	# get all the materalized views and then loop
	get_all_mv_SQL = """SELECT oid::regclass::text as name
	FROM pg_class
	WHERE relkind = 'm'
	and oid::regclass::text LIKE '%{rate}';""".format(rate=timescale_str)
	print "Getting the materialized views"
	cur.execute(get_all_mv_SQL)
	rows = cur.fetchall()
	for row in rows:
	mvname = row[0]
	if mvname in ignores:
	print "Ignoring a refresh for",mvname
	continue
	print "Refreshing",mvname,"..."
	cur.execute("REFRESH MATERIALIZED VIEW CONCURRENTLY {mvname};".format(mvname=mvname))

	# Dockerfile to run the cron jobs in root, including python/psycopg2
	FROM alpine:edge
	RUN apk add --no-cache -X http://dl-cdn.alpinelinux.org/alpine/edge/testing py-psycopg2
	EXPOSE 6000
	COPY refresh.py /refresh.py
	COPY root /var/spool/cron/crontabs/root
	CMD crond -l 2 -f