Optimizing queries to retrieve data
Thu 18 Jul 2019

If we work on a complex system, queries are pretty complex with a ton of subqueries, we would encounter the problem of optimization of query. In order to achieve a goal, we have many ways to do. So which one is better? Or even sometimes we did not notice or try another way to do things until we faced a problem of worse solution. This article I'd like to say about how we design a query with a many subqueries to achieve a goal.

Divide Dataset

We might get the same result like below picture. But we have two solutions to get it done. First solution we filter by a query to get four subsets of data. We end up with two subsets which are what we want. Second solution, we use two subqueries to get two datasets. We involve two extra queries to get two smaller datasets. We get the final result then which is same with first solution.

Comparing two ways to play with data

Two above solutions, which one is better? I measured both of them. Because the first solution was also my design problem. What is the problem? As we can see, at first we have four datasets. So we need memory to contains the data, containing four above datasets. It's heavy! System not only contains input data but also processing data. Thus, we got to filter a ton of records then to get the final result. The second solution is obviously better. If that is the algorithm, we call it "DEVIDE AND CONQUER". We got to make two extra queries. It seems like we do more but actually it reduces the things to get done. It's like we take rest regularly instead of doing big thing once. Doing that we avoid the bottlenecks, reducing the risk of system. Instead of stop there for a while, we go on other roads. Everything would be in control. 

Be careful with the query like this. Be careful with IN in SQL. 

where(email: JA.subscribable_guru.select(:email).distinct).approved

We only use IN with the column we already set index. We use JOIN instead for filtering. When I said "for filtering" that means we only "join" other table to filter.

where(id: JA.subscribable_guru.joins(:documents).where(documents:{status: 'Done'})

This would be faster. 

However, be careful with UNION or OR syntax in rails. It's a trap. The query is faster or not, depends on how many things it does. For example, if we have to joins two tables and retrieve three datasets such as:

  • subscribed_at: nil (1)
  • subscribed_at: [value] (2)
  • table A does not have corresponding records in table B. (3)

We can use OR like this: 


It means we get all records which subscribed (nil value) or have no record in B (treat it as subscribed). We see OR is nice and we get started to develop separating scopes for each of them. Hence, we do more things we need to do. Instead we just create a LEFT OURTER JOIN query to eliminate records which have value of unsubscribed_at. If we want to optimize sql query, we count total things need to get done, not only a part. Too many GROUP and JOIN would make query slower. 

Using Caching

The idea is we use cache for containing data temporarily. In some cases if we just call query A, this would be simple. However, system is being gross overtime. The logic becomes complicated. When A becomes subquery of query B, CPU might increase unexpectedly. That time we might consider caching technique.