--- title: "The GoodmanKruskal package: Measuring association between categorical variables" author: "Ron Pearson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_caption: yes vignette: > %\VignetteIndexEntry{The GoodmanKruskal package: Measuring association between categorical variables} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- The standard association measure between numerical variables is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century. This measure characterizes the degree of linear association between numerical variables and is both normalized to lie between -1 and +1 and symmetric: the correlation between variables x and y is the same as that between y and x. Categorical variables arise commonly in many applications and the best-known association measure between two categorical variables is probably the chi-square measure, also introduced by Karl Pearson. Like the product-moment correlation coefficient, this association measure is symmetric, but it is not normalized. This lack of normalization provides one motivation for *Cramer's V*, defined as the square root of a normalized chi-square value; the resulting association measure varies between 0 and 1 and is conveniently available in the **vcd** package. An interesting alternative to Cramer's V is *Goodman and Kruskal's tau*, which is not nearly as well known and is *asymmetric*. This asymmetry arises because the tau measure is based on the fraction of variability in the categorical variable y that can be explained by the categorical variable x. In particular, the fraction of variability in x that is explainable by variations in y may be very different from the variability in y that is explainable by variations in x, as examples presented here demonstrate. While this asymmetry is initially disconcerting, it turns out to be extremely useful, particularlly in exploratry data analysis. This combination of utility and relative obscurity motivated the **GoodmanKruskal** package, developed to make this association measure readily available to the *R* community. ## 1. Introduction Both in developing predictive models and in understanding relations between different variables in a dataset, association measures play an important role. In the case of numerical variables, the standard association measure is the product-moment correlation coefficient introduced by Karl Pearson at the end of the nineteenth century, which provides a normalized measure of *linear* association between two variables. In building linear regression models, it is desirable to have high correlations - either positive or negative - between the prediction covariates and the response variable, but small correlations between the different prediction covariates. In particular, large correlations between prediction covariates leads to the problem of *collinearity* in linear regression, which can result in extreme sensitivity of the estimated model parameters to small changes in the data, incorrect signs of some model parameters, and large standard errors, causing the statistical significance of some parameters to be greatly underestimated. In addition, the presence of highly correlated predictors can also cause difficulties for newer predictive model types: the tendency for the original random forest model class to preferentially include highly correlated variables at the expense of other predictors was one of the motivations for developing the conditional random forest method included in the **party** package (see the paper by Strobl et al., "Conditional Variable Importance for Random Forests," *BMC Bioinformatics*, 2008, 9:307, http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307). Alternatives to the product-moment correlation for numerical data include Spearman's rank correlation and Kendall's tau, both of which measure *monotone* association between variables (i.e., the tendency for "large" values of one variable to be associated with "large" values of the other). All three of these correlation measures may be computed with the **cor** function in the base *R* **stats** package by specifying the **method** parameter appropriately. The Kendall and Spearman measures are easily extended to ordinal variables (i.e., ordered factors), but none of these measures are applicable to categorical (i.e., unordered factor) variables. Finally, note that all three of these association measures are symmetric: the correlation between $x$ and $y$ is equal to that between $y$ and $x$. Categorical variables arise frequently in practice, either because certain variables are inherently categorical (e.g., state, country, political affiliation, merchandise type, color, medical condition, etc.) or because numerical variables are frequently grouped in some application areas, converting them to categorical variables (e.g., replacing age with age group in demographic analysis). There is loss of information in making this conversion, but the original numerical data values are often not available, leaving us with categorical data for analysis and modeling. Also, numerical values are sometimes used to code categorical variables (e.g., numerical patient group identifiers), or integer-valued variables where there is relatively little loss of information in treating them as categorical (e.g., the variables **cyl**, **gear**, and **carb** in the **mtcars** dataset). In any of these cases, quantitative association measures may be of interest, and the most popular measures available for categorical variables are the chi-square and Cramer's V measures defined in Section 1.1 and available via the **assocstats** function in the **vcd** package. Like the correlation measures described above for numerical data, both of these association measures are symmetric: the association between $x$ and $y$ is the same as that between $y$ and $x$. Much less well known than the chi-square and Cramer's V measures is *Goodman and Kruskal's tau measure,* which is described in Section 1.2 and forms the basis for the **GoodmanKruskal** package. In contrast to all of the association measures discussed in the preceeding two paragraphs, Goodman and Kruskal's tau measure is *asymmetric*: the association between variables $x$ and $y$ is generally *not* the same as that between $y$ and $x$. This asymmetry is inherent in the way Goodman and Kruskal's tau is defined and, although it may be initially disconcerting, this characteristic of the association measure can actually be quite useful, as examples presented in Sections 2 and 3 illustrate. It is possible to apply any of the categorical association measures just described to numerical data, but the results are frequently not useful. There are exceptions - the cases noted above where numerical variables are either effectively encodings of categorical variables or nearly so - but in the case of "typical" numerical data with few or no repeated values, these categorical association measures generally give meaningless results, a point discussed in detail in Section 4. Similarly, these association measures perform poorly in assessing relationships between a "typical" numerical variable and a categorical variable with a moderate number of levels. Because it is sometimes desirable to attempt to measure the association between mixed variable types, the function **GroupNumeric** has been included in the **GoodmanKruskal** package to convert numerical variables into categorical ones, which may then be used as a basis for association analysis between mixed variable types. This function is described and illustrated in Section 5, but it should be noted that this approach is somewhat experimental: there is loss of information in grouping a numerical variable into a categorical variable, but neither the extent of this information loss nor its impact are clear. Also, it is not obvious how many groups should be chosen, or how the results are influenced by different grouping strategies (the **GroupNumeric** function is based on the **classInt** package, which provides a number of different grouping methods). Nevertheless, the preliminary results presented in Section 5 suggest that this strategy does have promise for mixed-variable association analysis, and a simple rule-of-thumb is offered for selecting the number of groups. The rest of this note is organized as follows. Section 1.1 presents a detailed problem formulation and describes the chi-square and Cramer's V association measures. Section 1.2 then describes Goodman and Kruskal's tau measure, and Section 2 gives a brief overview of the functionality included in the **GoodmanKruskal** package based on this association measure. Section 3 presents three examples to illustrate the kinds of results we can obtain with Goodman and Kruskal's tau measure, demonstrating its ability to uncover unexpected relations between variables that may be very useful in exploratory data analysis. Section 4 then considers the important special case where the variable $x$ has no repeated values - e.g., continuously-distributed random variables or categorical variables equivalent to "record indices" - showing how Goodman and Kruskal's tau measure breaks down completely in this case. Section 5 then introduces the **GroupNumeric** function to address this problem and describes its use. Finally, this note concludes with a brief summary in Section 6. ### 1.1 Problem formulation, chi-square, and Cramer's V The basic problem of interest here may be formulated as follows. We are given two categorical variables, $x$ and $y$, having $K$ and $L$ distinct values, respectively, and we wish to quantify the extent to which these variables are associated or ``vary together.'' It is assumed that we have $N$ records available, each listing values for $x$ and $y$; for convenience, introduce the notation $x \rightarrow i$ to indicate that $x$ assumes it's $i^{th}$ possible value. The basis for all categorical association measures is the contingency table $N_{ij}$ which counts the number of times $x \rightarrow i$ and $y \rightarrow j$: $$ \begin{equation} N_{ij} = |\{ k \; | \; x_k \rightarrow i, y_k \rightarrow j\}|, \end{equation} $$ where $| {\cal S} |$ indicates the number of elements in the set $\cal S$. The raw counts in this contingency table may be turned into simple probability estimates by dividing by the number of records $N$: $$ \begin{equation} \pi_{ij} = \frac{N_{ij}}{N}. \end{equation} $$ The chi-square association measure is given by: $$ \begin{equation} X^2 = N \sum_{i=1}^{K} \sum_{j=1}^{L} \; \frac{(\pi_{ij} - \pi_{i+} \pi_{+j})^2}{\pi_{i+} \pi_{+j}}, \end{equation} $$ where the marginals $\pi_{i+}$ and $\pi_{+j}$ are defined as: $$ \begin{eqnarray} \pi_{i+} & = & \sum_{j=1}^{L} \; \pi_{ij}, \\ \pi_{+j} & = & \sum_{i=1}^{K} \; \pi_{ij}. \end{eqnarray} $$ The idea behind this association measure is based on the observation that, if $x$ and $y$ are regarded as discrete-valued random variables, then $\pi_{ij}$ is an empirical estimate of their joint distribution, while $\pi_{i+}$ and $\pi_{+j}$ are estimates of the corresponding marginal distributions. If $x$ and $y$ are statistically independent, the joint distribution is simply the product of the marginal distributions, and the $X^2$ measure characterizes the extent to which the estimated probabilities depart from this independence assumption. Unfortunately, the $X^2$ measure is not normalized, varying between a minimum value of $0$ under the independence assumption to a maximum vaue of $N \min \{ K-1, L-1 \}$ (see Alan Agresti's book, *Categorical Data Analysis*, Wiley, 2002, second edition, page 112). This observation motivates Cramer's V measure, defined as: $$ \begin{equation} V = \sqrt{ \frac{X^2}{N \mbox{min} \{ (K-1, L-1) \} } }. \end{equation} $$ This normalized measure varies from a minimum value of $0$ when $x$ and $y$ are statistically independent to a maximum value of $1$ when one variable is perfectly predictable from the other. ### 1.2 Goodman and Kruskal's tau measure Goodman and Kruskal's $\tau$ measure of association between two variables, $x$ and $y$, is one member of a more general class of association measures defined by: $$ \begin{equation} \alpha(x, y) = \frac{V(y) - E [V(y|x)]}{V(y)} \end{equation} $$ where $V(y)$ denotes a measure of the unconditional variability in $y$ and $V(y|x)$ is the same measure of variability, but conditional on $x$, and its expectation is taken with respect to $x$. Different members of this family are obtained by selecting different definitions of these variability measures, as discussed in Section 2.4.2 of Agresti's book. The specific choices that lead to Goodman and Kruskal's $\tau$ measure are: $$ \begin{eqnarray} V(y) & = & 1 - \sum_{j=1}^{L} \; \pi_{+j}^2, \\ E [V(y|x)] & = & 1 - \sum_{i=1}^{K} \sum_{j=1}^{L} \frac{\pi_{ij}^2}{\pi_{i+}}. \end{eqnarray} $$ These equations form the basis for the function **GKtau** included in the GoodmanKruskal package. Before concluding this discussion, however, it is worth noting that substituting these expressions into the general expression for $\alpha(x, y)$ given above and simplifying (via some messy algebra), we obtain the following explicit expression for Goodman and Kruskal's $\tau$ measure: $$ \begin{equation} \tau(x, y) = \frac{ \sum_{i=1}^K \sum_{j=1}^L \; \left( \frac{ \pi_{ij}^2 - \pi_{i+}^2 \pi_{+j}^2 }{ \pi_{+j} } \right) }{ 1 - \sum_{j=1}^L \pi_{+j}^2 }. \end{equation} $$ It follows from the fact that $i$ and $j$ are not interchangeable on the right-hand side of this equation that $\tau(y, x) \neq \tau(x, y)$, in general. ## 2. The GoodmanKruskal R package ```{r, echo = FALSE, warning = FALSE, message = FALSE} require(GoodmanKruskal) require(MASS) require(car) ``` The **GoodmanKruskal** package includes four functions to compute Goodman and Kruskal's $\tau$ measure and support some simple extensions. These functions are: 1. __GKtau__ is the basic function to compute both the forward association $\tau(x, y)$ and the backward association $\tau(y, x)$ between two categorical vectors $x$ and $y$; 1. __GKtauDataframe__ computes the Goodman Kruskal association measures between all pairwise combinations of variables in a dataframe; 1. __GroupNumeric__ groups a numeric vector, returning a factor that can be used in association analysis, for reasons discussed in Sections 4 and 5; 1. __plot.GKtauMatrix__ is a plot method for the S3 objects of class __GKtauMatrix__ returned by the __GKtauDataframe__ function. As noted, **GKtau** is the basic function on which the **GoodmanKruskal** package is built. This function is called with two variables, $x$ and $y$, and it returns a single-row dataframe with six columns, giving the name of each variable, the number of distinct values each exhibits, and both the forward association $\tau(x,y)$ and the backward association $\tau(y,x)$. By default, missing values are treated as a separate level, if they are present (i.e., the presence of missing data increases the number of distinct values by one); alternatively, any valid value for the **useNA** parameter of the **table** function in base *R* may be specified for the optional **includeNA** parameter in the **GKtau** call. The other optional parameter for the **GKtau** function is **dgts**, which specifies the number of digits to retain in the results; the default value is $3$. As a specific illustration of the **GKtau** function, consider its application to the categorical variables **Manufacturer** and **Cylinders** from the **Cars93** dataframe in the **MASS** package, a dataframe considered further in Section 3.1: ```{r, echo = TRUE} GKtau(Cars93$Manufacturer, Cars93$Cylinders) ``` This example illustrates the asymmetry of the Goodman-Kruskal tau measure: knowledge of **Manufacturer** is somewhat predictive of **Cylinders**, but the reverse association is much weaker; knowing the number of cylinders tells us almost nothing about who manufactured the car. An even more dramatic example from the same dataframe is the association between the **Manufacturer** variable and the variable **Origin**, with levels "USA" and "non-USA": ```{r, echo = TRUE} GKtau(Cars93$Manufacturer, Cars93$Origin) ``` Here, knowledge of the manufacturer is enough to completely determine the car's origin - implying that each manufacturer has been characterized as either "foreign" or "domestic" - but knowledge of **Origin** provides essentially no ability to predict manufacturer, since each origin is represented by approximately $45$ different manufacturers. (It is interesting to note that the Cramer's V value returned by the **assocstats** function from the **vcd** package for this pair of variables is $1$, correctly identifying the strength of the relationship between these variables, but giving no indication of its extreme directionality.) The function **GKtauDataframe** is a wrapper that applies **GKtau** to all pairs of variables in a dataframe. This function returns an S3 object of class "GKtauMatrix" that consists of a square matrix with one row and one column for each variable included in the dataframe. The diagonal elements of this matrix give the number of unique values for each variable, and the off-diagonal elements contain the forward and backward tau measures for each variable pair. The **GoodmanKruskal** package includes a plot method for the S3 objects returned by the **GKtauDataframe** function, based on the **corrplot** package; detailed demonstrations of both the **GKtauDataframe** function and the associated plot method are given in Section 3. The **GroupNumeric** function converts numeric variables into categorical variables, to serve as a basis for association analysis between variables of different types. Motivation for this function comes from the fact, discussed in Sections 4 and 5, that a continuously distributed random variable $x$ exhibits no "ties" or duplicated values, implying that the number of levels for $x$ is equal to $N$, the number of records. As shown in Section 4, this means $\tau(x, y) = 1$ for any other variable $y$, rendering the Goodman-Kruskal measure useless in such situations. Grouping numerical variables reduces the number of distinct values, and this approach can - at least in some cases - provide a useful basis for characterizing the association between numerical and categorical variables. The **GroupNumeric** function is based on the **classIntervals** function from the **classInt** *R* package, which provides a variety of different procedures for grouping numerical variables. A more detailed discussion of the **GroupNumeric** function is given in Section 5, which illustrates its use. ## 3. Three examples The following examples illustrate the use of Goodman and Kruskal's $\tau$ measure of association between categorical variables in uncovering possibly surprising features in a dataset. The two examples presented in Section 3.1 are both based on the **Cars93** dataframe in the **MASS** package, and they illustrate two key points, each based on a different subset of the `r ncol(Cars93)` columns from the dataframe. The first example provides a useful illustration of the general behavior of Goodman and Kruskal's $\tau$ measure, including its asymmetry, while the second example illustrates an important special case, discussed in detail in Section 4.1. The third example is presented in Section 3.2 and it illustrates the utility of Goodman and Kruskal's $\tau$ measure in exploratory data analysis, uncovering a relationship that is not obvious, although easily understood once it is identified. ### 3.1 The Cars93 dataframe: two examples The **Cars93** dataframe from the **MASS** package characterizes 93 different cars in terms of `r ncol(Cars93)` attributes. The plot below gives a graphical summary of the results obtained using the **GKtauDataframe** procedure described in Section 2, applied to a subset of five of these attributes. As noted, this function returns an S3 object of class "GKtauMatrix" and the plot shown here was generated using the default options of the plot method for this object class. For this example, the resulting plot is in the form of a $5 \times 5$ array, with the variable names across the top and down the left side. The diagonal entries in this display give the numbers of unique levels for each variable, while the off-diagonal elements give both numeric and graphical representations of the Goodman-Kruskal $\tau$ values. Specifically, the numerical values appearing in each row represent the association measure $\tau(x, y)$ from the variable $x$ indicated in the row name to the variable $y$ indicated in the column name. Looking at the upper left $2 \times 2$ sub-array from this plot provides a graphical representation of the second **GKtau** function example presented in Section 2 to emphasize the extreme asymmetry possible for the Goodman-Kruskal $\tau$ measure. Specifically, the association from **Manufacturer** to **Origin** is $\tau(x, y) = 1$, as indicated by the degenerate ellipse (i.e., straight line) in the $(1, 2)$-element of this plot array. In contrast, the opposite association - from **Origin** to **Manufacturer** has a $\tau$ value of only $0.05$, small enough to be regarded as zero. As noted, this result means that **Origin** is perfectly predictable from **Manufacturer**, but **Origin** gives essentially no information about **Manufacturer**; in practical terms, this suggests we can uniquely associate an origin (i.e., "USA" or "non-USA") with every manufacturer, but that each of these origin designations includes multiple manufacturers. Looking carefully at the $26 \times 2$ contingency table constructed from these variables confirms this result - there are 48 manufacturers in the "USA"" group and 45 different manufacturers in the "non-USA"" group - but it is *much* easier to see this from the plot shown below. ```{r, echo = TRUE, fig.width = 7.5, fig.height = 6} varSet1 <- c("Manufacturer", "Origin", "Cylinders", "EngineSize", "Passengers") CarFrame1 <- subset(Cars93, select = varSet1) GKmatrix1 <- GKtauDataframe(CarFrame1) plot(GKmatrix1) ``` More generally, it appears from the plot of $\tau$ values that the variable **Origin** explains essentially no variability in any of the other variables, while *all* of the reverse associations are larger, ranging from a small association seen with **Cylinders** ($0.14$) to the complete predictability from **Origin** just noted. The variable **Cylinders** exhibits a slight ability to explain variations in the other variables (ranging from $0.06$ to $0.14$), but two of the reverse associations are much larger: the $\tau$ value from **Manufacturer** to **Cylinders** is $0.36$, while that from **EngineSize** is $0.85$, indicating quite a strong association. Again, after carefully examining the underlying data, it appears that larger engines generally have more cylinders, but that for each cylinder count, there exists a range of engine sizes, with significant overlap between some of these ranges. The next plot is basically the same as that just considered, with the addition of a single variable: **Make**, which completely specifies the car described by each record of the **Cars93** dataframe. Because of the way it is constructed, the only new features are the bottom row and the right-most column. Here, the asymmetry of the Goodman-Kruskal $\tau$ measure is even more extreme, since the variable **Make** is *perfectly predictive of all other variables in the dataset*. As shown in Section 4.1, this behavior is a consequence of the fact that **Make** exhibits a unique value for every record in the dataset, meaning it is effectively a record index. Conversely, note that these other variables are at best moderate predictors of the variations seen in **Make** (specifically, **Manufacturer** and **EngineSize** are somewhat predictive). Also, it is important to emphasize that perfect predictors need not be record indices, as in the case of **Manufacturer** and **Origin** discussed above. ```{r, echo = TRUE, fig.width = 7.5, fig.height = 6} varSet2 <- c("Manufacturer", "Origin", "Cylinders", "EngineSize", "Passengers", "Make") CarFrame2 <- subset(Cars93, select = varSet2) GKmatrix2 <- GKtauDataframe(CarFrame2) plot(GKmatrix2) ``` ### 3.2 The Greene dataframe The third and final example presented here is based on the **Greene** dataframe from the **car** package, which has `r nrow(Greene)` rows and `r ncol(Greene)` columns, with each row characterizing a request to the Canadian Federal Court of Appeal filed in 1990 to overturn a rejection of a refugee status request by the Immigration and Refugee Board. A more detailed description of these variables is given in the **help** file for this dataframe, but a preliminary idea of its contents may be obtained with the **str** function: ```{r, echo = TRUE} str(Greene) ``` Applying the **GKtauDataframe** function to this dataframe yields the association plot shown below, which reveals several interesting details. The most obvious feature of this plot is the fact that the variable **success** is perfectly predictable from **nation** (i.e., $\tau(x, y) = 1$ for this association). Similarly, the reverse association, while not perfect is also quite strong ($\tau(y, x) = 0.85$); taken together, these results suggest a very strong connection between these variables. Referring to the **help** file for this dataframe, we see that **success** is defined as the "logit of success rate, for all cases from the applicant's nation," which is completely determined by **nation**, consistent with the results seen here. Conversely, an examination of the numbers reveals that, while most **success** values are unique to a single nation, a few are duplicates (e.g., Ghana and Nigeria both exhibit the **success** value $-1.20831$), explaining the strong but not perfect reverse association between these variables. Note, however, that this case of perfect association is not due to the "record index" issue seen in the first **Cars93** example and discussed further in Section 4, since the number of levels **nation** is only $17$, far fewer than the number of data records ($N = 384$). ```{r, echo = TRUE, fig.width = 7.5, fig.height = 6} GKmatrix3 <- GKtauDataframe(Greene) plot(GKmatrix3) ``` The other reasonably strong association seen in this plot is that between **location** and **language**, where the forward association is $0.8$ and the reverse association is $0.5$; these numbers suggest that **location** (which has levels "Montreal", "Toronto", and "other") is highly predictive of **language** (which has levels "English" and "French"), which seems reasonable given the language landscape of Canada. We can obtain a more complete picture of this relationship by looking at the contingency table for these two variables: ```{r, echo = TRUE} table(Greene$language, Greene$location) ``` This table also suggests the reason for the much weaker reverse association between these variables: while almost all French petitions are heard in Montreal, there is a significant split in the English petitions between Toronto and the "other" locations. The key point here is that, while the contingency table provides a more detailed view of what is happening here than the forward and reverse Goodman and Kruskal's $\tau$ measures do, the plot of these measures helps us quickly identify which of the $21$ pairs of the seven variables included in this dataframe are worthy of further scrutiny. Finally, it is worth noting that, if we look at the **decision** variable, *none* of the other variables in the dataset are strongly associated, in either direction. This suggests that none of the refugee characteristics included in this dataset are strongly predictive of the outcome of their appeal. ## 4. An important special case: $K = N$ The special case $K = N$ arises in two distinct but extremely important circumstances. The first is the case of effective record labels like **Make** in the **Cars93** example discussed in Section 3.1, while the second is the case of continuously-distributed numerical variables discussed in Section 5. The point of the following discussion is to show that if $K = N$, then $\tau(x, y) = 1$ for any other variable $y$. To see this point, proceed as follows. First, note that if $K = N$, there is a one-to-one association between the record index $k$ and the levels of $x$, so the contingency table matrix $N_{ij}$ may be re-written as: $$ \begin{equation} N_{ij} = | \{ i \; | \; y_i \rightarrow j \} | = \left\{ \begin{array}{ll} 1 & \mbox{if $y_i \rightarrow j$}, \\ 0 & \mbox{otherwise,} \end{array} \right. \end{equation} $$ which implies: $$ \begin{equation} \pi_{ij} = \left\{ \begin{array}{ll} 1/N & \mbox{if $y_i \rightarrow j$}, \\ 0 & \mbox{otherwise.} \end{array} \right. \end{equation} $$ From this result, it follows that: $$ \begin{equation} \pi_{i+} = \sum_{j=1}^{L} \; \pi_{ij} = 1/N, \end{equation} $$ since only the single nonzero term $1/N$ appears in this sum. Thus, we have: $$ \begin{eqnarray} E [V(y|x)] & = & 1 - \sum_{i=1}^{K} \sum_{j=1}^{L} \frac{\pi_{ij}^2}{\pi_{i+}} \\ & = & 1 - \sum_{i=1}^{N} \sum_{j = 1}^{L} \; N \pi_{ij}^2 \\ & = & 1 - \sum_{i=1}^{N} \; N (1/N)^2 \\ & = & 1 - \sum_{i=1}^{N} \; (1/N) \\ & = & 0. \end{eqnarray} $$ Substituting this result into the defining equation for Goodman and Kruskal's $\tau$, we obtain the final result: $$ \begin{equation} K = N \; \Rightarrow \; \tau(x, y) = 1, \end{equation} $$ for any variable $y$. Before leaving this discussion, it is important to emphasize that the condition $K = N$ is *sufficient* for $\tau(x, y) = 1$, but *not necessary*. This point was illustrated by the fact that the variable **Manufacturer** completely explains the variability in **Origin** in the **Cars93** example discussed in Section 3.1, despite the fact that $K = 32$ for the **Manufacturer** variable but $N = 93$. ## 5. Grouping numeric variables The basic machinery of Goodman and Kruskal's $\tau$ can be applied to numerical variables, but the results may or may not be useful, depending strongly on circumstances. Specifically, for continuously distributed numerical variables (e.g., Gaussian data), repeated values or "ties" have zero probability, so if $x$ is continuously distributed it follows from the results presented in Section 4 that $\tau(x, y) = 1$ for *any* variable $y$, regardless of its degree of association with $x$. *Thus, for continuously distributed numerical variables, Goodman and Kruskal's tau should not be used to measure association; either the standard product-moment correlation coefficient or the other options available from the __cor__ function should be used instead.* Also, continuity arguments suggest that numerical variables with "only a few ties," for which $K$ is strictly less than $N$ but not a *lot* less than $N$, tend to give inflated association values under the Goodman and Kruskal tau measure. The **mtcars** dataframe is a case in point: this dataframe has $N = 32$ records and $11$ variable, all numeric, but with between $2$ and $30$ distinct values. Here, the forward associations from the $30$-level variable **qsec** range from $0.87$ for the binary variable **am** to $1.00$ for the binary variable **vs**. Similarly, the forward associations from the $27$-level numerical variable **disp** range from $0.84$ for **qsec** to $1.00$ for the few-level variables **cyl**, **vs**, **am**, and **gear**. The reverse associations are much smaller. Conversely, integer variables often have many repeated values, implying that the number of levels $K$ is much smaller than the number $N$ of observations, and in these cases Goodman and Kruskal's tau measure may yield useful association results. Also, it frequently happens that numerical variables are used to encode what are effectively categorical phenomena, either ordered or unordered. As a specific example, the **cyl** variable in the **mtcars** dataframe is nummeric but has only three distinct levels (corresponding to 4-, 6-, and 8-cylinder cars): the forward associations from this variable to the other 10 in the dataframe vary from $0.06$ for the $30$-level variable **qsec** to $0.67$ for the $2$-level variable **vs**. The **vs** variable encodes the general engine geometry - either a "V-shaped" (when **vs = 0**) or a "straight" design (when **vs = 1**) - and looking at the contingency table between these results reveals that the V-shaped design is strongly associated with engine designs having more cylinders: ```{r, echo = TRUE} table(mtcars$cyl, mtcars$vs) ``` To provide more useful association measures between continuous numerical variables with no ties or few ties and categorical variables with few- to moderate-levels, the strategy proposed here is to group these numerical variables, creating a related categorical variable with fewer levels. This grouping strategy does entail a loss of information relative to the original numerical variable, but to the extent that the groups are representative, applying Goodman and Kruskal's tau measure to this categorical variable should give a more reasonable measure of the degree of association between the approximate value of the numerical variable and the other categorical - or few-level numerical - variables under consideration. It is also worth noting that, despite the loss of information, this grouping strategy is popular in business applications (e.g., demographic analysis is often done by age group instead of by age itself). In the **GoodmanKruskal** package, this grouping is accomplished with the function **GroupNumeric**, with the following passing parameters: 1. the required parameter __x__, specifying the numerical vector to be grouped; 1. the optional parameter __n__, an integer specifying the number of groups for the resulting categorical variable; if this value is not specified, it is inferred from the __groupNames__ parameter; 1. the optional prameter __groupNames__, a character vector giving the names of the __n__ groups formed in creating the categorical variable returned by the function; if this value is not specified, the value of __n__ must be specified and the default names from the __cut__ function in base _R_ will be used; 1. the optional parameter __orderedFactor__, a logical variable with default value __FALSE__ specifying whether the categorical variable returned should be ordered or not (note that this option has no influence on the computed value of Goodman and Kruskal's tau measure, but it is included as a convenience for those who may wish to use this functon for other purposes); 1. the optional parameter __style__, passed to the __classIntervals__ function from the __classInt__ function on which __GroupNumeric__ is based; 1. un-named optional parameters (...) to be passed to the __classIntervals__ function for certain grouping methods. To illustrate the results obtained with this function and its potential utility, consider the application of Goodman and Kruskal's tau measure to the **mtcars** dataframe. Applying the **GKtauDataframe** function to the unmodified **mtcars** dataframe gives the results summarized in the plot below. Note that - consistent with the discussion above - the forward association between any variable with more than $20$ distinct levels and any other variable tend to be quite large, while the reverse associations with variables having few levels is typically quite small. The $27$-level variable **disp** provides a good illustration of this point: the forward associations range from $0.84$ to $1.00$, with the four variables having two or three unique levels (**cyl**, **vs**, **am**, and **gear**) all perfectly predictable. In contrast, the reverse associations for all of these variables are less than $0.10$. ```{r, echo = TRUE, fig.width = 7.5, fig.height = 6, fig.cap = "Goodman-Kruskal tau matrix for the mtcars dataframe."} GKmat <- GKtauDataframe(mtcars) plot(GKmat, diagSize = 0.8) ``` To see the effect of grouping on these numerical variables with few ties, the following *R* code uses the **GroupNumeric** function to construct grouped versions of the six variables with $K > 20$. These grouped variables are then used to build the modified dataframe **groupedMtcars**, replacing each variable with a factor with $n = 5$ groups, constructed with the default option **style = 'quantile'**: ```{r, echo = TRUE} groupedMpg <- GroupNumeric(mtcars$mpg, n = 5) groupedDisp <- GroupNumeric(mtcars$disp, n = 5) groupedHp <- GroupNumeric(mtcars$hp, n = 5) groupedDrat <- GroupNumeric(mtcars$drat, n = 5) groupedWt <- GroupNumeric(mtcars$wt, n = 5) groupedQsec <- GroupNumeric(mtcars$qsec, n = 5) groupedMtcars <- mtcars groupedMtcars$mpg <- NULL groupedMtcars$groupedMpg <- groupedMpg groupedMtcars$disp <- NULL groupedMtcars$groupedDisp <- groupedDisp groupedMtcars$hp <- NULL groupedMtcars$groupedHp <- groupedHp groupedMtcars$drat <- NULL groupedMtcars$groupedDrat <- groupedDrat groupedMtcars$wt <- NULL groupedMtcars$groupedWt <- groupedWt groupedMtcars$qsec <- NULL groupedMtcars$groupedQsec <- groupedQsec ``` Applying the **GKtauDataframe** function to this modified **mtcars** dataframe yields the plot shown below: ```{r, echo = TRUE, fig.width = 7.5, fig.height = 6, fig.cap = "Goodman-Kruskal tau matrix for the mtcars dataframe."} GKmat2 <- GKtauDataframe(groupedMtcars) plot(GKmat2, diagSize = 0.8) ``` Comparing this plot with the previous one, we see that regrouping the variables with more than $20$ distinct levels greatly reduces most of their associations. For example, the original **qsec** variable has $K = 30$ distinct values in $N = 32$ records and it exhibits very large forward Goodman-Kruskal tau values, ranging from $0.87$ for the two-level **am** variable to $1.00$ for the two-level **vs** variable. Replacing **qsec** with the 5-level variable **groupedQsec** dramatically reduces these values, which range from a minimum of $0.19$ for the 5-level **groupedWt** variable to $0.89$ for the two-level **vs** variable. To better understand why this last value is so large, we can look at the underlying contingency table: ```{r, echo = TRUE} table(groupedQsec, mtcars$vs) ``` The **qsec** variable is described in the **help** file for the **mtcars** dataframe as the "quarter mile time," so smaller values correspond to faster cars. It is clear from this contingency table that all of the "V-shaped" engine designs correspond to faster cars, with quarter-mile times between $14.5$ and $18.2$ seconds, while all but one of the "straight" engine designs have quarter mile times greater than $18.2$ seconds. The point of this example is to show that grouping numerical covariates and applying Goodman and Kruskal's tau measure can provide useful insights into the relationships between variables of mixed types, a setting where working directly with the ungrouped numerical variables gives spuriously high association measures. An important practical question is how many levels to select when grouping a numerical variable, a question to which there appears to be no obvious answer. One possibility is to take the number of groups $n$ as the square root of the number of data observations, $N$, rounded to the nearest integer. This strategy was adopted in the previous example, where the numbers of levels in the original numerical variables ranged from $22$ to $30$, where this rule-of-thumb led to the choice $n = 5$ used here. Also, the default grouping method - **style = "quantile"** - was used in this example because it probably represents the most familiar grouping strategy for numerical variables. As noted, the **GroupNumeric** function is based on the **classIntervals** function in the **classInt** package, which supports 10 different grouping methods, but takes "quantile" as the default option. The obvious questions - how many groups, and what method do we use in constructing them - appear to be fruitful areas for future research, and a key reason for including the **GroupNumeric** function in the **GoodmanKruskal** package is to facilitate work in this area. ## 6. Summary This note has described Goodman and Kruskal's tau measure of association between categorical variables and its implementation in the **GoodmanKruskal** *R* package. In contrast to the more popular chi-square and Cramer's V measures, Goodman and Kruskal's tau is *asymmetric*, an unusual characteristic that can be exploited in exploratory data analysis. Specifically, the tau measure belongs to a family of association measures that attempt to quantify the variability in a target variable $y$ that can be explained by variations in a source variable $x$. Since this relationship is not symmetric, Goodman and Kruskal's $\tau$ can be used to identify cases where one variable is highly predictive from another, but the reverse implication is not true. As a specific and extreme example, applying this measure to the variables **Manufacturer** and **Origin** in the **Cars93** dataframe from the **MASS** package shows that **Origin** - with values "USA" and "non-USA" - is completely predictable from **Manufacturer**, but knowledge of **Origin** has essentially no power to predict **Manufacturer** since each of the two origin classes is represented by many (approximately $45$) different manufacturers. A limitation of Goodman and Kruskal's tau measure is that it is not applicable to numerical variables with few ties (e.g., continuously distributed random variables where the probability of duplicated values is zero), as demonstrated in Section 4. This is not a problem by itself since much better known correlation measures are available for this case: the Pearson product-moment correlation coefficient, Spearman's rank correlation, and Kendall's tau, all computable as options of the **cor** function in base *R*. Where this failure does become an issue is in the assessment of associations between numerical variables with few ties or no ties and categorical variables with a moderate number of levels. For example, applying Goodman and Kruskal's tau measure between mileage (**mpg** in the **mtcars** dataframe, with $25$ distinct values in $32$ records) and the number of cylinders (**cyl**, a numerical variable with only three levels) suggests near perfect predictability of cylinders from gas mileage ($\tau(x,y) = 0.90$) but essentially no predictability in the other direction ($\tau(y,x) = 0.08$). This limitation prompted the numerical variable grouping strategy described in Section 5 and embodied in the **GroupNumeric** function included in the **GoodmanKruskal** package. Replacing **mpg** with the 5-level categorical variable **groupedMpg** created using this function gives association measures that appear more reasonable for these two variables: the forward association remains quite large ($\tau(x, y) = 0.70$), but the reverse association is no longer negligible ($\tau(y, x) = 0.36$). As noted in Section 5, the questions of "how many groups?" and "which of many grouping methods should be used?" appear to be open research questions at present, and one purpose for including the function **GroupNumeric** in the **GoodmanKruskal** package is to encourage research in this area. More immediately, the function **GKtauDataframe** and its associated plot method can be an extremely useful screening tool for exploratory data analysis. In particular, in cases where we have many categorical variables, plots like those shown in Section 3 can be useful in identifying variables that appear to be related. More complete understanding of any relationship seen in these plots can be obtained by using the **table** function to construct and carefully examine the contingency table on which Goodman and Kruskal's tau measure is based, but the advantage of plots like those presented in Section 3 is that they allow us to focus our attention on interesting variable pairs. Given that the number of pairs grows quadratically with the number of variables in a dataset, this ability to identify interesting pairs for further analysis can be extremely useful in the increasingly common situation where we have a dataset with many variables.