In preparation for the next session of Masters Summit for QlikView, which is coming up on October 1-3 in Amsterdam (look to the right for more info), we’ve been searching for new interesting insights in the area of performance tuning. Our Performance Tuning session has been one of the most popular highlights of the previous Summits, and we are hoping to make it even better this time.
Well, the members of the Masters Summit alumni club already know that the data structures with Concatenated Facts perform much better than data structures with Link Tables (if you are not too sure if it’s true and why, than you REALLY need to check out our agenda).
With this knowledge in mind, let’s ask ourselves a question that may appear silly at first. Which QlikView command should be used in the process of building the Concatenated Fact – Concatenate or Join?
Well, you must be thinking, one doesn’t need to be a Master to answer this question. Surely, Concatenated Facts are built using CONCATENATE!
Not so fast, grasshopper!
In our recent experience, we were healing a troubled dataset built by someone else, who clearly never attended the Masters Summit. One Fact table of approximately 500 Million rows was linked to another Fact table of approximately 125 Million rows, and various measures were scattered among various Dimension tables. Our first reaction was – we need to build a Star schema! Getting ahead, we can proudly say that we saw more than a tenfold improvement of performance as a result of this brief data modeling exercise.
Anyway, in the process of concatenating the Facts into a single Fact table, we noticed that the LOAD that performed the concatenation worked extremely slowly, compared to other LOAD statements. It was even slower than the statement that joined data from the two huge Fact tables. That caught our attention, and we decided to monitor the process more closely. We discovered that apparently the CONCATENATE LOAD runs as a single-threaded operation, only leveraging one core out of the 32 available cores on our server. Conversely, the JOIN LOAD performed as a multi-threaded operation, leveraging several cores at a time. The JOIN load appeared to be almost ten times faster than the CONCATENATE LOAD! This discovery seems illogical, but our results suggested just that.
With that new knowledge, we decided to use JOIN instead of CONCATENATE in order to build our Concatenated Fact. And yes, we are aware of the specifics of each command and the differences in their functionality. However, an easy trick can help you teach the OUTER JOIN to act just like CONCATENATE.
What are the main differences between OUTER JOIN and CONCATENATE ? The JOIN load will attempt to match identical key fields. If the two joining tables have no identical keys, the join becomes a Cartesian Join, which is certainly undesired. When there are identical fields, they are used as Join keys, and for each matching set of keys, a joined row will get created. When no match can be found, the OUTER JOIN will create separate rows for all the unmatched keys, practically acting as a CONCATENATE.
So, our goal is to ensure that the two tables share at least one identical field, and that there are no matching sets of key values. This is easily achieved by adding a new field, such as “Fact Type”, which we usually add anyway, to differentiate between the different types of Facts in the concatenated table. When the data from one Fact is marked with one Fact Type and the data from another Fact is marked with another Fact Type, then surely no matching keys will ever exist!
So, our simple work around that saves a lot of load time, schematically looks like the following:
If you work with a large data set, try replacing your CONCATENATE loads with JOINs and please share your experience with us, we’d love to hear from you! We could include the most valuable insights in our next session of Masters Summit for QlikView (once again, look to the right!).
0 Comments
Finally, an explanation to something I’ve also noticed. I’m going to try this on one of my larger apps. Thanks!
Hi Brian,
glad you found it useful! Please share your results if you find out something interesting!
Oleg
Good idea to use an outer join for concatenation. But why not doing a normal second load without CONCATENATE prefix in full speed where the first load was prepared to have the upcoming fields of the second table?
– Ralf
Hi Ralf!
Hmm, interesting idea – to prepare both loads for automatic concatenation (I assume, by adding all the missing fields on both sides). Sounds like a lot of extra handling, however it might be worth the trouble for huge data sets.
thanks!
Oleg
Maybe, this can be automated (by a sub) where one dummy record (without data) is loaded from both tables with concatenate just to create an empty table upfront with all fields..
Thanks for the post!
How does this change if we consider optimized loads?
I have a case where I concatenate two similar tables and the second is tweaked so that it loads optimized. As I understand it the load is not optimized when joining. Does it make sense to change to join?
Matus
Hi Matus!
Thank you for your question. I never compared the load times between the optimized concatenated load and a join load. Actually, I was under the impression that the concatenated load cannot be optimized. It would be great if you tried it both ways and posted your finding here.
thanks!
Oleg
Hi Oleg.
I found it in discussion under Steve Dark’s article about optimized loads (http://www.quickintelligence.co.uk/qlikview-optimised-qvd-loads/). The trick is to have all the fields of the first table in the concatenated table (additional fields are allowed).
Now I noticed in the discussion that also LEFT JOIN could be used with optimized load (never used that before). That basically renders my question obsolete – the load is optimized in both cases and the difference should be only between CONCATENATE / LEFT JOIN.
I’ll try to look at it later and if I find something interesting I will get back.
Matus
The LOAD and preceding JOIN or LOAD and preceding CONCATENATE are two discrete steps. Check the timestamps in the log file to see what I mean.
Optimized QVD loads only affect the LOAD step, not the JOIN or CONCATENATE, although it’s worth noting that forcing concatenation of what would otherwise be an optimized QVD load will deoptimize it–at least, if we are to believe the parenthetical notes in the progress window.
Mike
I suspect that concatenate loads have the hidden overhead of resident loading the entire table again for each concatenate load, as evidenced by the fact that you can retroactively make an entire table distinct by adding ‘DISTINCT’ to the last concatenate load only (see: http://blog.axc.net/?p=1062). As such, I guessed, and have verified through testing, that the order in which tables are concatenate loaded has a material impact on the overall reload time: load your data sources in order from fewest to most rows.
I found a performance improvement when outer joining instead of concatenate loading, as well. The fastest overall time in my tests was when ordering data sources loaded from fewest to most rows and outer joining rather than concatenate loading.
That was just in my one test, though, of a series of LOAD * from several QVDs. The results may vary with resident loads, the option of doing optimized QVD loads, number of key fields, etc.
Mike
Mike,
the DISTINCT phenomenon that you described in your blog, is not something related to CONCATENATE. It’s actually a little known and little understood quality of DISTINCT:
DISTINCT is always applied to the resulting table. Whether you JOIN, CONCATENATE or load a single table, the keyword DISTINCT makes the end result distinct.
So, in your case, you loaded table 1 and then table 2. Applying DISTINCT to the last table made the end result DISTINCT. Why? Because, once again, DISTINCT is always applied to the end result.
I described this behavior in my very first blog post:
http://www.naturalsynergies.com/q-tip-1-beware-of-distinct/
cheers,
Oleg
That’s true–it applies to both concatenating and joining.
I wouldn’t overlook the point about the order of loads, however. On reading your post, I tested using data from a current project where we are creating a concatenated fact table from 6 QVDs, and the only result I have repeated consistently is that ordering the loads from fewest to most rows results in a lower reload time than the opposite. Ordering from most to least actually caused outer joining to take longer than concatenate loading. As I alluded to above, however, I wouldn’t claim that this will always be true, given additional variables.
Best to worst reload time:
1. fewest rows to most rows outer join
2. fewest rows to most rows concatenate
3. most rows to fewest rows concatenate
4. most rows to fewest rows outer join
Hi Mike,
>>load your data sources in order from fewest to most rows.
..very interesting findings! I have to measure this a bit.
– Ralf
In this article it was stated:
“We discovered that apparently the CONCATENATE LOAD runs as a single-threaded operation, only leveraging one core out of the 32 available cores on our server. Conversely, the JOIN LOAD performed as a multi-threaded operation, leveraging several cores at a time.”
How did you “DISCOVER” this? Did you somehow monitor CPUs while the load script was running? If so, how did you go about that?
Do you know of ANY information out there that explains what load script commands will multi-thread vs single-thread other than Concatenate?
Hi Tim,
Thank you for your comment. In response to your question – I simply watched the Task Manager on the server during the load process. During the Concatenate processing, it was clearly visible that only 1 CPU was engaged (total CPU load of 3%). Unfortunately, I’m not aware of any centralized list of all single-threaded vs. multi-threaded operations. Please share if you fine one.
best,
Oleg Troyansky
Hi Oleg,
The join seems to be not exactly the same as concatenate. The charts all seem to work the same but the document file itself is now smaller. Do you know if there are any performance advantages for charts or selections when using tables built this way?
Thanks,
-Tim
Hi Tim,
thank you for your interest. If the JOIN is performed in the way I described it (when one of the field values is always different and there is no joining happening per se), I can’t see why there should be any difference in the document size. If there is a difference, I can’t explain it. Similarly, I wouldn’t expect any performance advantages. Please let me know if you find any material differences and what was your use case.
best,
Oleg Troyansky
Hi Oleg,
First of all thanks for the post , I found it incredibly amazing trick , the performance when reloading the script is awesome , and it getts better if you prepare the facttype field on a previous step so when you load the tables to join them you read optimized qvds, I just want to ask … after seeing this what is the scenario where you would use concatenate instead of the join ? , just with small datasets?
Regards!
Edgar Baltazar
Hi Edgar,
thank you for your inquiry. Concatenation is the logically correct solution, so in essence it should be used by default, unless performance issues force you to use tricks like this one. Also, keep in mind that automated concatenation performs faster than joining. Only the forced concatenation is slower.
best,
Oleg