We all know and use QlikView keyword DISTINCT, which makes the resulting table of the load unique. What we don’t usually know, until we learn it the hard way, is that DISTINCT can be very dangerous and deceiving.
We will mention two major issues with DISTINCT that are not too obvious:
If you are loading your data from an unverified data source (for example, a user-maintained spreadsheet), you may expect that you can ensure uniqueness of the data using the keyword DISTINCT.
For example, if you are loading a spreadsheet with Salesperson Names:
LOAD DISTINCT
SalespersonID,
SalespersonName
FROM
Salespersons.xlsx…
;
Let’s assume you have 2 entries with the same Salesperson ID = 1000. One entry says Peter Smith, and the other entry says Peter A. Smith.
In this case, two entries with SalespersonID=1000 will get loaded into the system, leading to a possible duplication of the Sales numbers. The reason is the nature of the keyword DISTINCT. It means that the combination of all the fields in the load needs to be distinct. Since the names are different, the whole row will be considered different for the purpose of a DISTINCT load.
This better solution in this case is to load the data using GROUP BY, aggregating the questionable dimensional data by the key field. For example:
LOAD
SalespersonID,
MinString(SalespersonName) as SalespersonName
FROM
Salespersons.xlsx
…
GROUP BY
SalespersonID
;
This time, the load statement will certainly generate a unique list of SalespersonID numbers, with the rest of attributes selected accordingly.
2. DISTINCT is always applied on the resulting data set.
This statement sounds harmless and almost obvious – sure, what’s the problem? DISTINCT is always applied on the resulting data set…
Well, the good part is that when you concatenate a number of distinct loads, you don’t have to repeat one more DISTINCT reload to ensure that no duplicate records got in from multiple sources (this is particularly helpful when you are building a LinkTable from multiple data tables). Why don’t you need another reload? Because DISTINCT is Always Applies on the Resulting Data Set. So, the data got already de-duplicated in the last concatenated load. Pretty cool so far…
Now, to the horror stories…
Assume that you have a Fact table with multiple Key fields and metrics. Let’s assume that the table doesn’t have to be distinct (the same customers may be buying the same products with the same quantities on the same dates…).
Assume you loaded your non-distinct Fact Table, and your results showed $100 M in sales.
Then, you are adding Salesperson Name from the same spreadsheet, and you decided to use the keyword DISTINCT to make sure that the data is UNIQUE (which we already know is not completely accurate, but bear with me another moment…). So, you are adding the following statement:
LEFT JOIN (Fact)
LOAD DISTINCT
SalespersonID,
SalespersonName
FROM
Salespersons.xlsx
…
;
You check the result, and now your Sales decreased to $95 M. WHAT??? We all know that JOIN can lead to duplication of data, when the joined data is not unique. But how in the world are we losing data in a LEFT JOIN?
Well, let’s repeat the rule once again – DISTINCT is always applied on the resulting data set… So, the SalespersonName will get joined to the Fact table, and then DISTINCT will get applied to the whole result. So, not just the Salesperson data will be made distinct, but the whole fact table.
I shared this interesting finding at the Masters Summit for QlikView (www.masterssummit.com), and Barry Harmsen wrote an excellent blog on this subject: http://www.qlikfix.com/2013/07/30/distinct-can-be-deceiving/. Make sure to visit Barry’s blog frequently, and check out if Masters Summit for QlikView might be an opportunity for you.
0 Comments
[…] as it “infects” the tables he comes in contact with. This was covered in detail by Oleg Troyansky and Barry […]