202013Aug

Q-Tip #5 – Show Top Performers. Common But Not Too Trivial.

This material is going to be a part of my presentation on Advanced Set Analysis and Advanced Aggregation at the Master Summit for QlikView, which is coming up in October in London and Barcelona. During my 4-hours “deep-dive” session, we analyze complex business requirements and learn how to solve them using Advanced Set Analysis and AGGR (see full agenda here.)

The need to show “Top N Performers” is hardly new, and QlikView offers a number of ways to facilitate the solution. Some of those techniques are very simple, and some are not so much. Let’s walk through a number of solutions, climbing up from simple to complex.

Using “Max Number” Setting in Presentation Properties

If you have a Straight Table, sorted by Sales, as an example, and you need to show top 1,000 Items based on their Sales, the job is as easy as setting the “Max Number” of Dimension Values (the setting can be found on the Presentation Page).  Easy, fast and elegant. If your needs stop here, you can skip the rest and move on to the next challenge.

Using Dimension Limits

Since ver. 11.0, QlikView offers a new feature “Dimension Limits” that allows to limit the number of chart entries based on a variety of conditions – you can restrict content based on Largest, Smallest, or First number of Values, or based on a certain % relative to total, or even only show values that accumulate to a certain % to total. You can chose to show the remaining balance under “Others” and even show the totals for the Dimension. This is truly a remarkable tool that allows a lot of flexibility in restricting the content of any chart.

So, you must be wondering “where is the catch?”. Well there are a couple…

Imagine that you developed your Sales Analysis application for a reasonably large company that sells several thousands of Items to several thousands of customers. You created your insightful visualizations, and, following the advice from Stephen Few, kept  “Details on demand” in the form of Straight Tables and Pivot Tables aggregated by Customer and/or Item, sometimes with a few other dimensions, including a nice Monthly cross-tab. So, your users can see the trends visually and then get specific numbers from the detailed tables. Well, so far so good.

Straight tables and Pivot tables with several thousand rows may take a couple of seconds to render, but nothing is out of the ordinary.

Pic. 1 Dimension Limits
















Then, something unexpected happens. You need to implement your solution for a truly huge company that sells millions of products to millions of customers. All of a sudden, your application becomes very slow, and you are forced to perform a Performance Tuning exercise in order to understand what goes wrong (by the way, “Performance Tuning” is the second session that I am delivering at the Masters Summit, where we go into the most common performance pitfalls and the most common solutions for performance improvements). As a result of your Performance Tuning exercise, you discover that your Straight Tables take too long, and your Pivot Tables take even longer to render.

You have to go back to your users and tell them that at those data volumes, showing detailed tables becomes impossible – the tables with millions of rows just take too long to render. After lengthy discussions and negotiations, you agree to limit the content of the tables to the top 1,000 lines (Customers or Items). After all, who cares about millions of small Items and millions of small Customers? Let us see the “whales”…

Now, this is better. We can use Dimension Limits to filter our detailed charts and to make them more scalable. And this is where two unpleasant discoveries are waiting for you:

Discovery #1: Adding Dimension Limits to a heavy Straight Table makes it … heavier! Believe it or not, but if you are using Dimension limits on a large data set, a Straight Table that takes 10 sec. to render, may take up to 40 seconds when Dimension Limits are being used. Apparently, because of the flexibility of the Dimension Limits, QlikView needs to spend all the initial time to calculate all millions of rows of the original chart, and then apply the Dimension Limits conditions on top of that. Hence, our conclusion: Dimension Limits are very nice but not very scalable.

Discovery #2: Dimension Limits are not available for Pivot Tables. Yes, folks, this is probably not new to most of you, but Dimension Limits are available for all charts except for Pivot Tables. Not too surprising, knowing that the Pivot Table is rendered using its own separate charting engine. However, have fun explaining it to your non-technical users – “Yes, I can either limit the number of values in the chart, or show you months running across. Your choice…”

So, the bottom line is – Dimension Limits can be a nice solution, but many times it can’t be used for your needs. Certainly not if you hoped to improve performance on a very heavy dataset.

How About Top 10 in a Group?

Here is another interesting question that might not be too easy to answer… What if you need to show a Pivot Table with Regions, and show Top 5 Customers in each Region? And remember, no Dimension Limits for the Pivot Tables…

So, at this point we all must be convinced that the question “Can you show Top N Performers” is not always very trivial. So far, we know how to do it in a Straight Table but not in a Pivot Table, and on a mid-size dataset, but not on a huge dataset. The following techniques can be used universally in all types of charts, and on all sizes of datasets.

Limiting Dimension Values Using Calculated Dimensions

For the sake of this example, let’s assume that we use Dimensions Brand and Item, as well as Region and Customer, and sum(Sales) as the main expression. We’ll keep thing simple and not worry about time frames or any other special conditions.

So, our goal is to limit the dimension values to the top 5 Items per Brand, or to the top 5 Customers within Region, based on their Sales. As you surely know, we use function Rank in QlikView to determine the numerical order of our data items:

Rank(sum(Sales), 4)

,where the second optional parameter drives the behavior when multiple items have the same values (look it up, we won’t bore you with details here).

Now, using Rank in a condition that selects top 5 Items or Customers would look like this:

IF(Rank(sum(Sales), 4)<=5, Item)   or   IF(Rank(sum(Sales), 4)<=5, Customer)

In order to use this condition in a Calculated Dimension, we need to use function AGGR, to specify at what level to aggregate our results. The full calculation for Items (or Customers) is presented below:

AGGR(IF(Rank(sum(Sales), 4)<=5, Item), Brand, Item)

or

AGGR(IF(Rank(sum(Sales), 4)<=5, Customer), Region, Customer)

These Calculated Dimensions will return the values of the first 5 Items or Customers, and a NULL() value for all others. To complete the job, check the box “Suppress When Value is Null” to suppress all the values for the Items and the Customers that didn’t pass the test.

As for the Chart Expression, we can keep it simple – Sum(Sales), and we can add any other calculations as needed, without worrying about limiting the chart to top 5 Items/Customers within each Group – because the filtering was done at the Dimension level.

The advantage of this technique is its ease of use and the fact that you can limit the data in the chart by applying a single condition in the Calculated Dimension (as we’ll see very soon, the alternative is to apply the condition individually in each expression). The downside is, once again, performance. Calculated Dimensions are known for their heavy impact on performance, so you may not want to chose this path on an extremely large dataset.

Limiting Chart Values Using Expressions

The following technique uses advanced Set Analysis Search to limit the chart to Top 1,000 Items or Customers. This is perhaps the best way to provide detailed information in Straight or Pivot Tables over a truly large dataset, – it will always outperform Dimension Limits and Calculated Dimensions, with a visibly evident difference.

Well, we need to construct a Set Analysis condition to select Items (Customers) that fit the same criteria (the Rank of Sales <= 1000). This is achieved with the following Set Analysis Modifier:

Item = {“=rank(sum(Sales),4)<=1000”}

Let’s explain every detail here:

      • Item is the field the we apply the condition to
      • Double quotes signify “search”
      • The condition, starting with the equal sign, signifies the search condition – those Items that fit this criteria, will get selected. Notice, that here, unlike in Calculated Dimensions, we don’t need to use AGGR. The fact that this condition is applied to the field Item, implies that the expression will be aggregated at the Item level.

Now, the full expression will look like the following:

=sum({<Item = {“=rank(sum(Sales),4)<=1000”}>}   Sales)

or, for Top Customers:

=sum({<Customer = {“=rank(sum(Sales),4)<=1000”}>}   Sales)

The only inconvenience is that the same condition needs to be applied to all the expressions in the chart. If your Straight Table should have 10 expressions, all of them will need to be equipped with the same Set Analysis condition. Alternatively, you can condition all other expressions by the value of the first one:

If(Column(1) <>0, … )

If you are curious enough to follow those instructions, you might notice an interesting behavior. If you implemented this solution in your chart by Item, try selecting individual Items. You will notice that the chart will still show the same Top 1,000 Items, disregarding your selection. It took me a moment to realize that the Set Analysis condition overrides any user selections in the field Item (or Customer). In order to respect user selections in the same fields, we need to add one last brushstroke to our masterpiece – replace the Modifier sign “=” with the sign “*=” that facilitates the intersection of current selections with our condition. The final expression looks like this:

=sum({<Item *= {“=rank(sum(Sales),4)<=1000”}>}   Sales)

Notice that this solution cannot be used for the previous problem, where we wanted to show Top 5 Items per Brand. The reason for that is the well-known limitation of Set Analysis – it only gets calculated once per chart, not once per cell, and hence it cannot be sensitive to individual Dimension values. Hence, we can’t use Set Analysis to calculate the Top Items per Brand (i.e. per Dimension value).

Extra Credit: Top 1,000 Items with Other Conditions

So far, we managed to solve quite complex problems with relatively simple calculations. Let’s add one more degree of complexity. Let’s say that your dashboard includes a special analysis of “Luxury Items” performance. For simplicity, let’s assume that Luxury Items are signified with a special flag called “LuxuryFlag”. So, the chart that shows detailed information about Luxury Items has a number of expressions with set analysis conditions that select LuxuryFlag = 1:

Sum({<LuxuryFlag={1}>} Sales)

Now, remember our story about implementing the same dashboard in a huge company with millions of Items? Well, after performing your Performance Tuning exercise, you must have noticed that the “Luxury Items” detailed chart has also become sluggish, because almost half of the company’s products are considered “luxury” and hence the chart is almost as heavy as all other detailed charts. Your agreement with your users is to replace the chart with a similar chart showing details for the “Top 1,000 Luxury Items”.

Well, after everything we just learned, we must be ready for such as simple task. We will simply add the new condition into the existing Set Analysis:

=sum({<LuxuryFlag={1}, Item = {“*=rank(sum(Sales),4)<=1000”}>}   Sales)

Problem solved, performance is significantly improved, and you are sending your new and improved version to the users for testing. The first thing our users do with our detailed charts is exporting the data into Excel. And then the user finds out that instead of the promised Top 1,000 Luxury Items, the spreadsheet only shows about 500… The user calls you back, categorically demanding all the missing Items to be immediately returned! At this point, you are trying to understand why would the chart only show about 500 Items when the condition is clearly asking for 1,000!

Those of you who had a fair share of dealing with Set Analysis, will quickly realize that the result that we are getting from the condition above, is the intersection of two sets – Luxury Items and Items that have a rank of the Top 1,000. So, only about 500 Items within the first 1,000 belong to the group of Luxury Items. Eureka! Now we understand that the missing Items were not stolen, they were missing for a good logical reason!!! We just need to come up with a better solution… How do we show the top 1,000 Items out of the Luxury Items?

Well, our problem started with the Set Modifier selecting the Top 1000 items without necessarily filtering only Luxury Items. This set of 1,000 will include both Luxury and Non-Luxury items. In order to get a “clean” count of 1,000 Luxury Items, we need to apply the Luxury filter to the inner sum(Sales) listed within the ranking function. Then the ranking will be happening within the pre-filtered group of luxury Items:

=sum({<Item = {“*=rank(sum(  {< LuxuryFlag={1}>} Sales),4)<=1000”}>}   Sales)

Now we should receive the list of Top 1,000 Luxury Items that includes, indeed, exactly 1,000 items. The user is happy and our reputation is restored!

Summary

I hope I managed to convince you that showing “Top N Performers” might not always be trivial to do. The techniques demonstrated in this article should help you find your own scalable high-performing solutions for your business questions. This is one of the lighter examples of Advanced Set Analysis and Aggregation that we are going to discuss at the Masters Summit for QlikView. Please review our full agenda here, and decide for yourself if Masters Summit for QlikView can help you take your QlikView skills to the next level. The next Master Summit events are coming up in October in two magnificent cities – London and Barcelona. Would love to see you there!



Leave a Reply

Your email address will not be published. Required fields are marked *