clypd’s Lessons Learned in PostgreSQL Performance

By March 24, 2014 June 5th, 2014 clypd, Engineering, Technology

At clypd, we have been using PostgreSQL and it has been a great experience. We are impressed with its simplicity, good documentation, active community, abundance of functions and lot more. This post jots lessons learned and insights acquired during a recent journey in improving performance of one of our queries. Most resources and tools in this post point to other sites, however we think it would be helpful to list various ‘performance improvement’ resources in one page.

Our query involved 3 sub queries, each sub query performing joins over 2-3 tables; results from the sub queries are joined, aggregated and sorted. The query probably has same number of words as this post (ok…at least half). For easier reading we will group our findings in three buckets.

Early improvements

Early in the process we found that there were several small changes we could make to the first version of query implementation. These tips required a bit more insight into the workings of the query planner, but were pretty easy once we realized how the query was being executed:

  • Take another pass at your WHERE clauses. Make sure applicable criteria are specified to reduce the size of data. Keep in mind that less data means faster JOINs
  • Indexes. Make sure they exist on appropriate column(s) and ANALYZE is run on the table after adding new indexes. We run the VACUUM ANALYZE combination on tables after making changes related to indexes, updating data, etc.
  • Pre compute data whenever possible to avoid JOINS for each run. One of the four sub-queries was performing the same JOIN each time we ran the query. We precomputed the joined data in a table which helped the performance.

(Analyze and iterate) * N

Once we had eliminated the low-hanging fruit, we had to dig deeper into the inner workings of the Postgres query planner to really see how it was executing our query, and understand what we could do to make this job easier:

  • Its important to know how query planner is executing your query – use EXPLAIN ANALYZE. Output of our query was daunting (and definitely had more words than this post). Here are some resources that helped us
    • Explaining EXPLAIN – slides on PostgreSQL’s wiki, a must read
    • A picture is worth thousand words – pgAdmin III shows a nice graphical flow
    • EXPLAIN Output Visualisation – This tool helped a lot. The visual hierarchy with the colors helped us to hone in on the merge joins and nested loop operations that had major contributions to overall response times.
  • Know the difference between various JOIN operations and tune the query to utilize simpler JOINS. The Explaining EXPLAIN slides mentioned above are the best resource. A short one page version of all operations can be found here. One of the sub-queries was performing a MERGE JOIN on two large tables. The query structure was something like

    SELECT tx.xa, tx,xb, SUM(ty.ya)
    FROM tx
    JOIN ty on
    JOIN tz ON
    AND tx.xq = tz.zq AND between X and Y
    GROUP BY tx.xa,tx.xb
    ORDER BY tx.xa, tx.xb

    In our case moving the tz filter into a WHERE EXISTS clause replaced MERGE JOIN with a HASH JOIN. Merges joins sort data on both sides before doing the join. Here’s the restructured query

    SELECT tx.xa, tx,xb, SUM(ty.ya)
    FROM tx
    JOIN ty on
    FROM tz
    WHERE AND tz.xr = tz.zr
    AND between X and Y)
    GROUP BY tx.xa,tx.xb
    ORDER BY tx.xa, tx.xb

PostgreSQL Server tuning

Apart from tuning the queries themselves, there is definitely some performance improvement to be gained from ensuring you are getting the most from your Postgres server’s hardware. Everyone’s setup is different here, but this is how we got the most from our machines:

  • PostgreSQL 9.0 High Performance and PostgreSQL 9 Administration Cookbook are 2 nice books on PostgreSQL server tuning. If you have a Safari Books account, both books are carried by the site (clypd has an account)
  • PostgreSQL wiki has nicely documented various tuning parameters in the config file
  • There is literature on the web with recommended values of these parameters based on your hardware. pgtune is a nice utility which will, in their words expand the database server to be as powerful as the hardware it’s being deployed on
  • PostgreSQL stores various configuration values in its pg_settings catalog. It’s a good practice to make sure he values we specify are being used. Here’s a simple query that checks the value of shared_buffers
  • select name, setting from pg_settings where name = 'shared_buffers';

  • Lastly, though not a tuning utility, pgtop, for monitoring PostgreSQL processes can definitely help to hunt down issues.
  • This is by no means a complete list of tips and tricks, simply an aggregation of resources and tips we found useful. We will post updates in this space as we learn more. Which tools do you use to get the most out of your PostgreSQL?

    Wrestling databases into submission is one of many things that Sumit Shah does as Principal Engineer at clypd.

    Leave a Reply