This material is going to be a part of my presentation on Advanced Set Analysis and Advanced Aggregation at the Master Summit for QlikView, which is coming up in October in London and Barcelona. During my 4-hours “deep-dive” session, we analyze complex business requirements and learn how to solve them using Advanced Set Analysis and AGGR (see full agenda here.)
The need to show “Top N Performers” is hardly new, and QlikView offers a number of ways to facilitate the solution. Some of those techniques are very simple, and some are not so much. Let’s walk through a number of solutions, climbing up from simple to complex.
If you have a Straight Table, sorted by Sales, as an example, and you need to show top 1,000 Items based on their Sales, the job is as easy as setting the “Max Number” of Dimension Values (the setting can be found on the Presentation Page). Easy, fast and elegant. If your needs stop here, you can skip the rest and move on to the next challenge.
Since ver. 11.0, QlikView offers a new feature “Dimension Limits” that allows to limit the number of chart entries based on a variety of conditions – you can restrict content based on Largest, Smallest, or First number of Values, or based on a certain % relative to total, or even only show values that accumulate to a certain % to total. You can chose to show the remaining balance under “Others” and even show the totals for the Dimension. This is truly a remarkable tool that allows a lot of flexibility in restricting the content of any chart.
So, you must be wondering “where is the catch?”. Well there are a couple…
Imagine that you developed your Sales Analysis application for a reasonably large company that sells several thousands of Items to several thousands of customers. You created your insightful visualizations, and, following the advice from Stephen Few, kept “Details on demand” in the form of Straight Tables and Pivot Tables aggregated by Customer and/or Item, sometimes with a few other dimensions, including a nice Monthly cross-tab. So, your users can see the trends visually and then get specific numbers from the detailed tables. Well, so far so good.
Straight tables and Pivot tables with several thousand rows may take a couple of seconds to render, but nothing is out of the ordinary.
Then, something unexpected happens. You need to implement your solution for a truly huge company that sells millions of products to millions of customers. All of a sudden, your application becomes very slow, and you are forced to perform a Performance Tuning exercise in order to understand what goes wrong (by the way, “Performance Tuning” is the second session that I am delivering at the Masters Summit, where we go into the most common performance pitfalls and the most common solutions for performance improvements). As a result of your Performance Tuning exercise, you discover that your Straight Tables take too long, and your Pivot Tables take even longer to render.
You have to go back to your users and tell them that at those data volumes, showing detailed tables becomes impossible – the tables with millions of rows just take too long to render. After lengthy discussions and negotiations, you agree to limit the content of the tables to the top 1,000 lines (Customers or Items). After all, who cares about millions of small Items and millions of small Customers? Let us see the “whales”…
Now, this is better. We can use Dimension Limits to filter our detailed charts and to make them more scalable. And this is where two unpleasant discoveries are waiting for you:
Discovery #1: Adding Dimension Limits to a heavy Straight Table makes it … heavier! Believe it or not, but if you are using Dimension limits on a large data set, a Straight Table that takes 10 sec. to render, may take up to 40 seconds when Dimension Limits are being used. Apparently, because of the flexibility of the Dimension Limits, QlikView needs to spend all the initial time to calculate all millions of rows of the original chart, and then apply the Dimension Limits conditions on top of that. Hence, our conclusion: Dimension Limits are very nice but not very scalable.
Discovery #2: Dimension Limits are not available for Pivot Tables. Yes, folks, this is probably not new to most of you, but Dimension Limits are available for all charts except for Pivot Tables. Not too surprising, knowing that the Pivot Table is rendered using its own separate charting engine. However, have fun explaining it to your non-technical users – “Yes, I can either limit the number of values in the chart, or show you months running across. Your choice…”
So, the bottom line is – Dimension Limits can be a nice solution, but many times it can’t be used for your needs. Certainly not if you hoped to improve performance on a very heavy dataset.
Here is another interesting question that might not be too easy to answer… What if you need to show a Pivot Table with Regions, and show Top 5 Customers in each Region? And remember, no Dimension Limits for the Pivot Tables…
So, at this point we all must be convinced that the question “Can you show Top N Performers” is not always very trivial. So far, we know how to do it in a Straight Table but not in a Pivot Table, and on a mid-size dataset, but not on a huge dataset. The following techniques can be used universally in all types of charts, and on all sizes of datasets.
For the sake of this example, let’s assume that we use Dimensions Brand and Item, as well as Region and Customer, and sum(Sales) as the main expression. We’ll keep thing simple and not worry about time frames or any other special conditions.
So, our goal is to limit the dimension values to the top 5 Items per Brand, or to the top 5 Customers within Region, based on their Sales. As you surely know, we use function Rank in QlikView to determine the numerical order of our data items:
Rank(sum(Sales), 4)
,where the second optional parameter drives the behavior when multiple items have the same values (look it up, we won’t bore you with details here).
Now, using Rank in a condition that selects top 5 Items or Customers would look like this:
IF(Rank(sum(Sales), 4)<=5, Item) or IF(Rank(sum(Sales), 4)<=5, Customer)
In order to use this condition in a Calculated Dimension, we need to use function AGGR, to specify at what level to aggregate our results. The full calculation for Items (or Customers) is presented below:
AGGR(IF(Rank(sum(Sales), 4)<=5, Item), Brand, Item)
or
AGGR(IF(Rank(sum(Sales), 4)<=5, Customer), Region, Customer)
These Calculated Dimensions will return the values of the first 5 Items or Customers, and a NULL() value for all others. To complete the job, check the box “Suppress When Value is Null” to suppress all the values for the Items and the Customers that didn’t pass the test.
As for the Chart Expression, we can keep it simple – Sum(Sales), and we can add any other calculations as needed, without worrying about limiting the chart to top 5 Items/Customers within each Group – because the filtering was done at the Dimension level.
The advantage of this technique is its ease of use and the fact that you can limit the data in the chart by applying a single condition in the Calculated Dimension (as we’ll see very soon, the alternative is to apply the condition individually in each expression). The downside is, once again, performance. Calculated Dimensions are known for their heavy impact on performance, so you may not want to chose this path on an extremely large dataset.
The following technique uses advanced Set Analysis Search to limit the chart to Top 1,000 Items or Customers. This is perhaps the best way to provide detailed information in Straight or Pivot Tables over a truly large dataset, – it will always outperform Dimension Limits and Calculated Dimensions, with a visibly evident difference.
Well, we need to construct a Set Analysis condition to select Items (Customers) that fit the same criteria (the Rank of Sales <= 1000). This is achieved with the following Set Analysis Modifier:
Item = {“=rank(sum(Sales),4)<=1000”}
Let’s explain every detail here:
Now, the full expression will look like the following:
=sum({<Item = {“=rank(sum(Sales),4)<=1000”}>} Sales)
or, for Top Customers:
=sum({<Customer = {“=rank(sum(Sales),4)<=1000”}>} Sales)
The only inconvenience is that the same condition needs to be applied to all the expressions in the chart. If your Straight Table should have 10 expressions, all of them will need to be equipped with the same Set Analysis condition. Alternatively, you can condition all other expressions by the value of the first one:
If(Column(1) <>0, … )
If you are curious enough to follow those instructions, you might notice an interesting behavior. If you implemented this solution in your chart by Item, try selecting individual Items. You will notice that the chart will still show the same Top 1,000 Items, disregarding your selection. It took me a moment to realize that the Set Analysis condition overrides any user selections in the field Item (or Customer). In order to respect user selections in the same fields, we need to add one last brushstroke to our masterpiece – replace the Modifier sign “=” with the sign “*=” that facilitates the intersection of current selections with our condition. The final expression looks like this:
=sum({<Item *= {“=rank(sum(Sales),4)<=1000”}>} Sales)
Notice that this solution cannot be used for the previous problem, where we wanted to show Top 5 Items per Brand. The reason for that is the well-known limitation of Set Analysis – it only gets calculated once per chart, not once per cell, and hence it cannot be sensitive to individual Dimension values. Hence, we can’t use Set Analysis to calculate the Top Items per Brand (i.e. per Dimension value).
So far, we managed to solve quite complex problems with relatively simple calculations. Let’s add one more degree of complexity. Let’s say that your dashboard includes a special analysis of “Luxury Items” performance. For simplicity, let’s assume that Luxury Items are signified with a special flag called “LuxuryFlag”. So, the chart that shows detailed information about Luxury Items has a number of expressions with set analysis conditions that select LuxuryFlag = 1:
Sum({<LuxuryFlag={1}>} Sales)
Now, remember our story about implementing the same dashboard in a huge company with millions of Items? Well, after performing your Performance Tuning exercise, you must have noticed that the “Luxury Items” detailed chart has also become sluggish, because almost half of the company’s products are considered “luxury” and hence the chart is almost as heavy as all other detailed charts. Your agreement with your users is to replace the chart with a similar chart showing details for the “Top 1,000 Luxury Items”.
Well, after everything we just learned, we must be ready for such as simple task. We will simply add the new condition into the existing Set Analysis:
=sum({<LuxuryFlag={1}, Item = {“*=rank(sum(Sales),4)<=1000”}>} Sales)
Problem solved, performance is significantly improved, and you are sending your new and improved version to the users for testing. The first thing our users do with our detailed charts is exporting the data into Excel. And then the user finds out that instead of the promised Top 1,000 Luxury Items, the spreadsheet only shows about 500… The user calls you back, categorically demanding all the missing Items to be immediately returned! At this point, you are trying to understand why would the chart only show about 500 Items when the condition is clearly asking for 1,000!
Those of you who had a fair share of dealing with Set Analysis, will quickly realize that the result that we are getting from the condition above, is the intersection of two sets – Luxury Items and Items that have a rank of the Top 1,000. So, only about 500 Items within the first 1,000 belong to the group of Luxury Items. Eureka! Now we understand that the missing Items were not stolen, they were missing for a good logical reason!!! We just need to come up with a better solution… How do we show the top 1,000 Items out of the Luxury Items?
Well, our problem started with the Set Modifier selecting the Top 1000 items without necessarily filtering only Luxury Items. This set of 1,000 will include both Luxury and Non-Luxury items. In order to get a “clean” count of 1,000 Luxury Items, we need to apply the Luxury filter to the inner sum(Sales) listed within the ranking function. Then the ranking will be happening within the pre-filtered group of luxury Items:
=sum({<Item = {“*=rank(sum( {< LuxuryFlag={1}>} Sales),4)<=1000”}>} Sales)
Now we should receive the list of Top 1,000 Luxury Items that includes, indeed, exactly 1,000 items. The user is happy and our reputation is restored!
I hope I managed to convince you that showing “Top N Performers” might not always be trivial to do. The techniques demonstrated in this article should help you find your own scalable high-performing solutions for your business questions. This is one of the lighter examples of Advanced Set Analysis and Aggregation that we are going to discuss at the Masters Summit for QlikView. Please review our full agenda here, and decide for yourself if Masters Summit for QlikView can help you take your QlikView skills to the next level. The next Master Summit events are coming up in October in two magnificent cities – London and Barcelona. Would love to see you there!
0 Comments
Wow, that is a very detailed and clearly delivered guide to a very complex area.
I will try to use this for a similar problem I am trying to solve, which looks aggregate up and concatenate purchases per customer to show the top 5 most common purchases. It involves millions of customers, but they generally only make 1 or 2 purchases.
Using aggr requires too much calculation time, so do you think your guide will work as an alternative
Thanks, mike
I think you should give it a try and if you don’t mind, report the results back. I’d be interested to hear from you.
thanks!
Oleg Troyansky
Very nice demonstration and a very clear explanation. Thanks heaps.
It looks like the “Using “Max Number” Setting in Presentation Properties” only allows you to go to 100. Is the 1000 example just a typo?
Yes, apparently it was. Thanks for the correction!
So, I am trying to limit the dimensión of my pie chart to the largest 7, and the rest are calculated on ‘others’. One of those 7, called ‘rest’ has a value of x%, I want to include them in the fill of ‘others’ so the value of ‘others’ will increase in x% and the field rest will disappear from the largest 7. Is there a way to do this or since ‘others’ field is calculated automatically by the program and ‘rest’ is a field on my ddbb, is imposible to exclude ‘rest’ form my top 7 and include it in the field ‘others’?
Thank you!
Hi Valeria,
thank you for your question. You may want to double it on the Qlik Community forum, to get more responses…
I think that the default calculation of “others” cannot be modified. However, you can replace it with your own calculation. If your data size allows, you can create a calculated dimension in which you question the Rank of the item and also check if the value is equal to ‘rest’. Something along those lines:
IF( Rank ( … ) > 7 or Item = ‘rest’, ‘Others’, Item)
then, the chart will show you the 6 Items and the combined slice of ‘Others’ that also includes the item ‘rest’
good luck!
Oleg
[…] 2. Show Top Performers. Common But Not Too Trivial via Naturalsynergies […]
Thanks for the great tip and detailed explanation, Oleg. Apologies for coming in so late, but I’m wondering if there is a typo around the use of the implicit set operator “*” used to find the intersection with current settings. Should the expression:
=sum({<Item = {“*=rank(sum(Sales),4)} Sales)
actually read
=sum({<Item *= {“=rank(sum(Sales),4)} Sales)
And a related question from me, can set analysis be used to return the set of items accounting for the top 90% of sales?
Hi Scott,
thank you for your message. Yes, indeed, you found a typo – the asterisk was misplaced, thanks for that!
On the other topic, I’m not away of a Set Analysis solution that allows implementing a Pareto selection. I know that it can be done using Actions and Dimension Limits, but not in Set Analysis. I’d be happy to learn such a solution if it existed.
thanks!
Oleg Troyansky
Oleg,
I needed a solution that allowed limiting all sales aggregations in the document to those products that made up the top 90% of spend, ie a constant product set regardless of all other dimensions and selections. In the end I precalculated a “percentile” value per product in the load script and filtered on this using set analysis controlled by a slider. So I too would be happy to know of a purely set analysis based solution….
Cheers
s2
Hi Oleg,
I realize i come bit late to the party but i have a question.
your solution for the performance is using set analysis in the expression and not using a calculated dimension, but as you said this method doesn’t work if we have a pivot with more the one dimension and we want the topN for each value in the 2nd dimension and i understand why. but that only leaves us with the option to use a calculated dimension which is
very heavy performance wise. so basically there is no good way to achieve this without a toll on performance?
Daniel Chotzen
Hi Daniel,
Good question! One answer I’d suggest is that we need to pick and choose our battles, depending on a specific situation. Some solutions are more scalable than others.
The other answer is more complex than that. I’d say that there is no easy way of doing the same using Set Analysis, however I think a more complex solution could be built with a bit more tinkering. Perhaps I should include the solution in the next Masters Summit for Qlik.
Stay tuned, I will blog about it one day, but before that I’ll teach the solution at the Masters Summit.
cheers,
Oleg
Hi Oleg,
Thank you very much for replying.
i am a renown tinker so i will take your suggestion and start tinkering:)
i am planning to attend the next masters summit so hopefully i will be able to learn this and other subjects
thanks again and cya there 🙂
Daniel
I need to have another expression that shows percentages of the Sales, will this same expression use as “Relative” to get the result.
Hi Balanandam,
thank you for your question! The best way to know for sure is to try it. Many times things that we expected to work, don’t work for some reason, and other times things that we didn’t expect to work, magically do. Why don’t you try it and let everyone know your results?
Happy New Year!
Oleg Troyansky