Welcome back to Qlik Sense Q-Tips, it’s been a while…
If reading technical blogs is your idea of fun, then you have likely heard about Wordle, and you probably cherish your impeccable winning streak.
Just in case you’ve been living under a rock, here is a quick rundown…
Once a day the Gods of Wordle select a 5-letter word and we, mere mortals, need to guess it by entering up to six attempts. Each attempt has to be a valid 5-letter word. The game gives you feedback by color-coding the letters that you entered:
That’s it, folks. Happy guessing!
As impossible as it sounds, it’s actually quite doable to guess most common words in 6 attempts or less, however sometimes you may need to struggle a bit, before all 5 letters settle in their correct spots.
How about using some data analytics as our helping tools? Some may call them “cheating”, but we don’t have to subscribe to that definition 😉.
So, let’s prepare the data and get creative! For starters, we need a long (and preferably complete) list of 5-letter words as our raw dataset. I found a publicly available list of about 10,000 5-letter words on one of the web sites that cater to the Scrabble fan club. (Note to self: the same ideas that will be described in this article, can be equally applied to Scrabble). Here is the link to the desired JSON file:
https://www.wordgamedictionary.com/word-lists/5-letter-words/5-letter-words.json
With Qlik Sense Cloud, I loaded the file into the Data Manager using the Rest connector. The Data Manager created the necessary data connection and generated the following script in the Auto-generated Section:
Set dataManagerTables = '','root';
//This block renames script tables from non generated section which conflict with the names of managed tables
For each name in $(dataManagerTables)
Let index = 0;
Let currentName = name;
Let tableNumber = TableNumber(name);
Let matches = 0;
Do while not IsNull(tableNumber) or (index > 0 and matches > 0)
index = index + 1;
currentName = name & '-' & index;
tableNumber = TableNumber(currentName)
matches = Match('$(currentName)', $(dataManagerTables));
Loop
If index > 0 then
Rename Table '$(name)' to '$(currentName)';
EndIf;
Next;
Set dataManagerTables = ;
Unqualify *;
LIB CONNECT TO [REST_httpswww.wordgamedictionary.comword-lists5-letter-words5-letter-words.json];
[root]:
SQL SELECT
"word"
FROM JSON (wrap on) "root" QDL;
In order to enable a detailed analysis of words and their letters, I decided to add some minor transformations after the words are loaded into Qlik Sense. First, I converted all the words to upper case, and then I separated the 5 individual letters from each word, along with their respective position numbers. At the end, I added the three Wordle colors as variables and I cleaned up the variables that were left over by the Data Manager (while grumpily muttering under my breath something like “those data managers never clean up after themselves…”). The additional script looks like this:
Words:
NOCONCATENATE
Load
UPPER(word) as word
resident
root
;
drop Table root;
for i=1 to 5
Letters:
LOAD
word,
mid(word, $(i), 1) as letter,
$(i) as position
resident
Words
;
next
let color_Greeen='#74B25B';
let color_Yellow='#BFBE4B';
let color_Gray='#797C7E';
let i=;
let dataManagerTables=;
let name=;
Let index=;
Let currentName =;
Let tableNumber =;
Let matches =;
The data load only takes a second, and voila – now we have a list of all 5-letter words, along with their individual letters and their positions. Now let’s do some data analytics!
For starters, let’s determine what letters are the most commonly used in 5-letter words. A simple bar chart with the field letter as a Dimension and the count of words as a Measure would reveal the answer:
This simple analysis reveals that the most common five letters happen to be S, E, A, O, and R. So a word like AROSE would probably be the best candidate for your first guess. The next five letters are I, L, T, N, and U, so the word UNTIL could be your second candidate, at least statistically. With these two guesses, you can hopefully establish enough letters to figure out the rest relatively easily.
Now, how about the probability of each letter to be found in each one of the 5 positions? What is the most common letter in each spot? A simple Pivot table with letters in rows, positions in columns, and the count of words as a Measure, shows the distribution of letters based on their positions in the words:
I color-coded the numbers for better visibility, by using the following formula for the background color:
argb(
count(word)/
max(total aggr(count(word), letter, position)) * 255,
116, 178, 91
)
The ARGB() function allows us to add opacity to any RGB() color. In order to calculate opacity as a number between 0 and 255, I divided the Measure (count of words) value by the highest possible word count per letter per position, and multiplied the result by 255. In this calculation, I used the function AGGR(), which is one of the most powerful yet least understood functions in Qlik Sense.
(Shameless plug: If you’d like to learn more about AGGR(), consider attending my session about AGGR and Set Analysis at the Masters Summit for Qlik, which is going to take place IN-PERSON(that’s right, we are back to IN-PERSON again!) this fall in Madrid and in New Orleans. Details and registration can be found here.)
Back to the Pivot table… Wouldn’t it be nice to show the most common letters for each position, to make our discovery even easier? This can also be achieved with the use of AGGR(), in combination with the function FirstSortedValue(). This solution is also described and practiced in my Masters Summit session 😉. Our Measure needs to be enhanced to produce a different result at the total level:
= IF(Dimensionality() = 1, // Detailed row:
count(word)
, // Total row:
FirstSortedValue(letter, -aggr(count(word), position, letter))
)
Now the table looks like this:
Now we can easily see that the most common letter in positions 1 and 5 is the letter S, while A is most commonly found in positions 2 and 3. E is most common in the position 4. So, from the statistics standpoint, words like SALES , SAFES, or SADES could land a maximum of green letters from the first guess.
Now, let’s take our cheating helping tools up a notch, shall we? I’d like to enter up to 5 letters and get a list of words that contain these letters, color-coded the same way Wordle does it, and sorted by the best fit score. How do we do that?
I added 5 variables v_L1, v_L2, …, v_L5 for the 5 letter spots and I placed 5 Variable Input objects to allow data entry. Next to them I placed a table with the field word as a dimension and 5 measures, each representing one letter of the word, color-coded by the same Green-Yellow-Gray principle (sounds vaguely familiar, doesn’t it?). Finally, I added the measure for the score, in which each green letter adds 10 points, and each yellow letter adds 1 point:
Here are the expressions that I used to produce the table:
//-------------------------------------------------------------
// Letters:
// this formula is for letter 1. Replace the highlighted number 1 for the other four positions:
//-------------------------------------------------------------
only(
{<word={"=FindOneOf(word, '$(=UPPER(v_L1 & v_L2 & v_L3 & v_L4 & v_L5))')>0"}>}
mid(word, 1,1)
)
//-------------------------------------------------------------
// Background colors (for position 1):
//-------------------------------------------------------------
IF(
mid(word, 1,1) = UPPER(v_L1), '$(color_Greeen)',
IF(
FindOneOf(mid(word, 1,1),
'$(=UPPER(v_L1 & v_L2 & v_L3 & v_L4 & v_L5))')>0,
'$(color_Yellow)',
'$(color_Gray)'
)
)
//-------------------------------------------------------------
// The Score Calculation:
//-------------------------------------------------------------
Rangesum(
IF(mid(word, 1,1) = UPPER(v_L1), 10, IF(FindOneOf(mid(word, 1,1), '$(=UPPER(v_L1 & v_L2 & v_L3 & v_L4 & v_L5))')>0, 1,0)),
IF(mid(word, 2,1) = UPPER(v_L2), 10, IF(FindOneOf(mid(word, 2,1), '$(=UPPER(v_L1 & v_L2 & v_L3 & v_L4 & v_L5))')>0, 1,0)),
IF(mid(word, 3,1) = UPPER(v_L3), 10, IF(FindOneOf(mid(word, 3,1), '$(=UPPER(v_L1 & v_L2 & v_L3 & v_L4 & v_L5))')>0, 1,0)),
IF(mid(word, 4,1) = UPPER(v_L4), 10, IF(FindOneOf(mid(word, 4,1), '$(=UPPER(v_L1 & v_L2 & v_L3 & v_L4 & v_L5))')>0, 1,0)),
IF(mid(word, 5,1) = UPPER(v_L5), 10, IF(FindOneOf(mid(word, 5,1), '$(=UPPER(v_L1 & v_L2 & v_L3 & v_L4 & v_L5))')>0, 1,0))
)
In the calculation of Letters (the first formula listed above), we are using Set Analysis with an Advanced Search condition. If you’d like to learn more… yes, you guessed it – join us in Madrid or in New Orleans for the next Masters Summit for Qlik.
Well, this is already pretty useful, but we can do more. As we keep guessing our Wordles, we learn which letters exist in the mysterious word (green and yellow), but even more importantly, we learn which letters don’t exist in the word (the gray ones) – just like in Qlik Sense we learn a lot of information from those data elements that get excluded by our search. Ideally, we’d like to limit our list of possible words, excluding those words that include the letters that were colored in gray.
For this purpose, I added another variable with an input box, in which we can enter the letters that we’d like to exclude. We can continue using the same table with the list of words, but this time the dimension needs to get a lot more elaborate. Rather than using the field word, we need to select only those words that don’t contain the “gray” letters:
=AGGR(only({<word={"=max(index('$(v_Exclude)', letter))=0"} >} word), word)
In order to add a Set Analysis condition that would limit our words, we need to use the function Only(), enclosed in the AGGR function. In the Set Analysis condition, we select those words that only contain letters that cannot be found in the variable v_Exclude. Another AGGR, another Advanced Search condition. If you’d like to learn more… exactly right, you should include Madrid or New Orleans in your travel plans for this fall!
With the addition of this feature, our list of words becomes even more useful. Review the screen shot below. The first table includes all the words, scored based on the entered word “SENSE”. The first five suggestions are: SENSE, CENSE, DENSE, LENSE, and MENSE. If we request to exclude letters C,D,L, and M from the word, the four words with these letters disappear from the list, as presented in the lower table.
Hmm… We are almost there, but not quite there yet. The list of suggested words is helpful, but it’s not smart enough yet. It can suggest words that have some of the selected letters, but not necessarily ALL of the selected letters. For example, the word JESSE doesn’t include the letter N, even though N was entered as one of the selected letters. Ideally, I’d like to create a list of the words that include all of the letters that were entered. After a bit of tinkering, I came up with the following formula. I admit, it’s a bit bulky and not too elegant. If you can think of a more elegant solution, please post it in the comments, I’d be happy to see it.
=AGGR(only({<
word= {"=max(index('$(v_Exclude)', letter))=0 and
(len(trim('$(v_L1)'))=0 or index(word, UPPER('$(v_L1)'))>0 ) and
(len(trim('$(v_L2)'))=0 or index(word, UPPER('$(v_L2)'))>0 ) and
(len(trim('$(v_L3)'))=0 or index(word, UPPER('$(v_L3)'))>0 ) and
(len(trim('$(v_L4)'))=0 or index(word, UPPER('$(v_L4)'))>0 ) and
(len(trim('$(v_L5)'))=0 or index(word, UPPER('$(v_L5)'))>0 ) "}
>} word), word)
With the Dimension calculated like this, I can get a shorter list of only those words that contain all of the entered letters. In case you are puzzled by the formula, it basically says “either the variable is empty, or the entered value needs to be present in the word.” As you can see in the image below, the table contains a short list of words that have the letters O, L, E, and G in them.
This is it, folks. With a little bit of Qlik Sense, Set Analysis, and AGGR, you can build yourself your own Wordle cheating helping tool and never miss another Wordle again! There is still a lot of room for improvement. You can enter your green and yellow letters into separate variables (or use upper case for green letters, and lower case for yellow letters) and enhance your conditions to require that green letters should remain in place, and yellow letters should be found anywhere but in the spot that they were entered in. The sky is the limit. Post your enhancement ideas in the comments – I’d love to hear them!
Did I mention that the Masters Summit for Qlik is going to be LIVE again? In addition to my Set Analysis and AGGR session, I’m also teaching Performance Tuning, with the use of the new tool QSDA Pro by Rob Wunderlich. Rob, speaking of which, will cover Advanced Scripting. Barry Harmsen will teach you Advanced Data Modeling. Bill Lay will take you through an unforgettable session on Data Visualization techniques, and Nick Webster will introduce you to Qlik Sense APIs and the wonderful world of mashups and extensions. And not to forget – all of that while networking with like-minded colleagues in some of the most entertaining cities in the world – Madrid and New Orleans. Pretty hard to beat that, huh?
While I was writing this blog article, I found yet another fun puzzle game, Nerdle – it’s a new Wordle-like game for math fans. We need to guess an 8-character math formula in 6 attempts or less. That might be the subject of the next blog article. Stay tuned!