In preparation for the next session of Masters Summit for QlikView, which is coming up on October 1-3 in Amsterdam (look to the right for more info), we’ve been searching for new interesting insights in the area of performance tuning. Our Performance Tuning session has been one of the most popular highlights of the previous Summits, and we are hoping to make it even better this time.
Well, the members of the Masters Summit alumni club already know that the data structures with Concatenated Facts perform much better than data structures with Link Tables (if you are not too sure if it’s true and why, than you REALLY need to check out our agenda).
With this knowledge in mind, let’s ask ourselves a question that may appear silly at first. Which QlikView command should be used in the process of building the Concatenated Fact – Concatenate or Join?
Well, you must be thinking, one doesn’t need to be a Master to answer this question. Surely, Concatenated Facts are built using CONCATENATE!
Not so fast, grasshopper!
In our recent experience, we were healing a troubled dataset built by someone else, who clearly never attended the Masters Summit. One Fact table of approximately 500 Million rows was linked to another Fact table of approximately 125 Million rows, and various measures were scattered among various Dimension tables. Our first reaction was – we need to build a Star schema! Getting ahead, we can proudly say that we saw more than a tenfold improvement of performance as a result of this brief data modeling exercise.
Anyway, in the process of concatenating the Facts into a single Fact table, we noticed that the LOAD that performed the concatenation worked extremely slowly, compared to other LOAD statements. It was even slower than the statement that joined data from the two huge Fact tables. That caught our attention, and we decided to monitor the process more closely. We discovered that apparently the CONCATENATE LOAD runs as a single-threaded operation, only leveraging one core out of the 32 available cores on our server. Conversely, the JOIN LOAD performed as a multi-threaded operation, leveraging several cores at a time. The JOIN load appeared to be almost ten times faster than the CONCATENATE LOAD! This discovery seems illogical, but our results suggested just that.
With that new knowledge, we decided to use JOIN instead of CONCATENATE in order to build our Concatenated Fact. And yes, we are aware of the specifics of each command and the differences in their functionality. However, an easy trick can help you teach the OUTER JOIN to act just like CONCATENATE.
What are the main differences between OUTER JOIN and CONCATENATE ? The JOIN load will attempt to match identical key fields. If the two joining tables have no identical keys, the join becomes a Cartesian Join, which is certainly undesired. When there are identical fields, they are used as Join keys, and for each matching set of keys, a joined row will get created. When no match can be found, the OUTER JOIN will create separate rows for all the unmatched keys, practically acting as a CONCATENATE.
So, our goal is to ensure that the two tables share at least one identical field, and that there are no matching sets of key values. This is easily achieved by adding a new field, such as “Fact Type”, which we usually add anyway, to differentiate between the different types of Facts in the concatenated table. When the data from one Fact is marked with one Fact Type and the data from another Fact is marked with another Fact Type, then surely no matching keys will ever exist!
So, our simple work around that saves a lot of load time, schematically looks like the following:
If you work with a large data set, try replacing your CONCATENATE loads with JOINs and please share your experience with us, we’d love to hear from you! We could include the most valuable insights in our next session of Masters Summit for QlikView (once again, look to the right!).