Quantcast
Channel: domain completion – Drawing with Numbers
Viewing all articles
Browse latest Browse all 5

Comparing Each Against Each Other: The No-SQL Cross Product

$
0
0

Here’s a problem that has been bouncing around in my brain since I first used Tableau. How do I compare the results of every permutation of one item vs. another? Here’s an example using Superstore Sales – I put Region on Rows and Columns, and SUM(Sales) on the Text Shelf, and only see four values: Screen Shot 2013-12-11 at 9.27.56 PM

What if I want to compare Sales in Central to those in East, South, and West, and Sales in East to South and West, and Sales in West to Sales in South simultaneously? We can compare two at a time using parameters or a self-blend, or one vs. the rest in different ways via sets or table calcs or calculated fields, but how about each against each other? What if we want a correlation matrix? Read on to find out how to do this without any SQL, and learn a little bit about domain completion.

When we put Region on both Rows and Columns, even though we have 4 rows by 4 columns = 16 cells, Tableau generates only 4 marks, because there are only 4 Regions in the data. Here’s the status bar for the above view, showing the 4 marks:Screen Shot 2013-12-11 at 9.30.23 PM

So we need to pad out the data somehow to be able to compare each to the other. The typical solution to this is to do a cross product, cross join, or Cartesian join of the data where every row in the data is crossed with the same rows from other sources. In a simple data set using Microsoft JET SQL syntax, the cross product might be written as:

SELECT [1].[Region] as [Region], [1].[Sales] AS [Sales], [2].[Region] AS [Region cross], [2].[Sales] AS [Sales cross] FROM [Data] AS [1], [Data] AS [2]

If we started out with 4 rows of data, we’d end up with 16 rows. If we started out with 4000 rows, we’d end up with 16,000,000 rows, which is a whole lot of extra data for a seemingly simple analysis. If we did some work to pre-aggregate the data in SQL, we could cut that down significantly, but then we’d have to be careful about what dimensions we wanted to filter or blend on to keep them in the data, and besides not everybody knows SQL and, perhaps more importantly, nor should everybody have to.

Now, with a little dose of table calculations, you don’t have to. The key to doing a no-SQL cross product in Tableau is domain completion. Domain completion is a form of padding (also called densification) that Tableau does when building crosstabs for table calculations to pad out sparse data and return meaningful results. For example, here’s Superstore Sales with Region on Rows and Order Date on Columns for the first 10 days of December and there are 8 empty cells where there are no values in the data for a given Region or Order Date: Screen Shot 2013-12-11 at 9.54.42 PM  

We can confirm that looking at the status bar, where Tableau shows 4 rows by 10 columns, but only 32 marks:Screen Shot 2013-12-11 at 9.55.12 PM

When we turn that Sum of Sales into a Running Total Quick Table Calculation, Tableau fills in those empty cells, in other words “completing the domain” of the combinations of Region and Order Date: Screen Shot 2013-12-11 at 9.55.24 PM

The domain padding that Noah described earlier this week in his post on Fibonacci sequences, Serpinski carpets and more is another kind of densification. Domain padding and domain completion can look like one another, and even happen at the same time, however in this case they are not, one part of checking is looking at the Show Missing Values option on the Order Date pill: Screen Shot 2013-12-11 at 9.56.04 PM

With an understanding of domain completion, now comes the beginnings of the cross product. Going back to our original view, Tableau has created 16 cells:

Screen Shot 2013-12-11 at 9.27.56 PM

All we have to do is have a table calculation (or two) that will fill in those cells. Here’s an introductory calculation to do just that:

WINDOW_MAX(SUM([Sales]))

Here’s what happens when we bring that into the Region/Region view: Screen Shot 2013-12-11 at 10.05.13 PM

Why isn’t it padded out? This reveals another factor in domain completion, that we need to have different dimensions in play, right now there’s only Region. The next step is to duplicate the Region dimension, and we can use that copy instead:Screen Shot 2013-12-11 at 10.05.54 PM

With the default Compute Using of Table (Across), now we’re seeing Tableau complete the domain for each Region (copy) for each Region and therefore generating a mark for every combination of Region and Region (copy). We can also see this down below in the status bar, Tableau is showing 8 rows (4 Regions x 2 measures) by 4 columns, and 32 marks, indicating a fully completed domain: Screen Shot 2013-12-11 at 10.07.09 PM

The reason why we use the WINDOW_MAX(SUM([Sales])) is that within a given row there’s only a single value of sales and we want to return that value to every column. We could use WINDOW_AVG, _MIN, _MEDIAN, or _SUM, I’m just used to using WINDOW_MAX. Also, we can make the calculation more efficient by computing the result only once for each partition by wrapping it in PREVIOUS_VALUE(), to get:

PREVIOUS_VALUE(WINDOW_MAX(SUM([Sales])))

In this case PREVIOUS_VALUE() returns the value of the calculation from the prior column to the next column, and for the very first column uses the WINDOW_MAX(SUM([Sales])) as the argument. Now we can take that formula and use it in two identical calculated fields to fully pad out the data, one for the columns and one for the rows:Screen Shot 2013-12-11 at 10.08.30 PM

Here, I’m using Table (Across) as the Compute Using for across and Table (Down) for down. To compare each row vs. the other, I created a basic % difference calc:

([Sales (Down)]-[Sales (Across)])/ABS([Sales (Across)])

Brought into the view, it takes on the Compute Usings of the existing table calcs:Screen Shot 2013-12-11 at 10.09.34 PM

We can set up for a heatmap view by dragging SUM(Sales), Sales (Across), and Sales (Down) off of the view: Screen Shot 2013-12-11 at 10.10.26 PM

However this looks confusing, is that 59.0% in the first column indicating that South had 59% more Sales than Central or vice versa? The easier choice would be a form of triangular matrix, showing one against the other. In a regular crosstab, we could potentially filter out the dimension combinations that we don’t want, however here we’ve used domain completion and a filter on dimensions wouldn’t remove the data that’s been padded in by the table calculations. In this case, the table calcs call for a table calculation filter. To set that up, we can use the INDEX() table calculation twice to identify each cell going across or down, and then do a simple comparison to return True for only the upper/right triangular matrix:

//creates triangle in upper right. flip < sign to have triangle in lower left [Index (Down)]<=[Index (Across)]

Screen Shot 2013-12-11 at 10.11.10 PM

And putting all the calcs together into one view: all workout

Finally, we can build the matrix as a highlight table. This is a little convoluted to work around what I believe is a a Tableau bug in version 8.0 and 8.1 (at least) where we have to use Show Me to get the highlight table with text that automatically becomes lighter on a dark background (if we manually build the highlight table Tableau will only have a single text color), here are the steps:

  1. Duplicate the workout crosstab worksheet.
  2. Moved the Upper/Right Triangle filter calc to the Filters Shelf, filtering for True.
  3. Drag the SUM(Sales), Index (Down), and Index (Across) measures off the Measure Values Shelf.
  4. Change the Sales (Down) measure into a discrete. (Tableau Tip – Once a measure is on the Measure Values card, you can change it into a discrete, even though you can’t drag a discrete measure onto the Measure Values card in the first place.)
  5. Change the Sales (Across) measure into a discrete.
  6. Drag Sales (Down) and Sales (Across) pills onto the Filters Shelf, filtering for All values. At this point, Tableau will replace Measure Values with the % Difference measure on the Text Shelf.  (This is to park two measures, because using Show Me would otherwise remove the measure). Screen Shot 2013-12-11 at 10.16.41 PM
  7. On Show Me, click the Highlight Table. Tableau creates the highlight table, but moves the Region (copy) dimension to the Rows Shelf: Screen Shot 2013-12-11 at 10.17.02 PM
  8. Move Region (copy) back to the Columns Shelf. Now the table looks like an upper triangular matrix: Screen Shot 2013-12-11 at 10.17.20 PM
  9. Drag Sales (Down) off the Filters Shelf to the Columns Shelf to provide an extra label for comparison:
  10. Drag Sales (Across) off the Filters Shelf to the Rows Shelf to provide an extra label as well. Screen Shot 2013-12-11 at 10.18.08 PM
  11. For cleanup, change the Color Scheme to a red/black or something else that will work for color blind folks, get rid of the extra row and column dividers, and hide the labels to have a usable view: upperright triangular matrix
  12. Finally, I added an additional calc that returns above/below/the same as to build a useful tooltip:Screen Shot 2013-12-11 at 11.03.33 PM

Here’s an alternate lower triangular matrix, and the whole workbook on Tableau Public:

I’m not sure how useful that kind of triangular matrix view is, so I’d been sitting on this post for awhile. It wasn’t until Monday that I got inspired to do something more…

Building a Correlation Matrix

This week Bora Beran had a post on creating a correlation matrix in Tableau using the car data from R, with two versions: one demonstrating R integration and the other table calcs:

I think correlation coefficients are very cool, and underutilized because historically the’ve been more painful to build in Tableau due to the combination of pre-processing the data and needing to build out the table calcs. I was excited to see this and tried out this technique on the correlation matrix Bora had created and got it working for both table calcs and R. The R side got a bit more complicated because I had to pad out more of the measures, but it works, here’s a download link because Tableau Public doesn’t support R. (Soon, I hope?). This makes a great example of how using this technique can reduce load on Tableau, the original data set with the cross product has over 11000 rows in it, using domain completion and table calcs cuts it down to 1/352nd of its size while retaining full ability to filter the data.

Conclusion

In a sign of not seeing the forest for the trees, I hadn’t actually realized that this technique is a cross product without SQL until working on Bora’s post, and with his encouragement I finished up this post (and thanks for the shout-out, too!). Also, thanks to Jim Wahl, I’d forgotten about the dual-index filter trick until he used it in a Tableau forum post. If you’re looking for other ways to compare ones to others, there’s a list of different tricks in the Analyses wiki. If you use this or some other comparison technique in one of your analyses, please link to it below!


Viewing all articles
Browse latest Browse all 5

Trending Articles