| Title: | A Set of Tools for Exploratory Data Analysis |
|---|---|
| Description: | Functions to profile a dataset, identify anomalies (special values, outliers, and inliers, defined as data values that are repeated unusually often), and compare data subsets with respect to either numerical or categorical variable distributions. |
| Authors: | Ronald Pearson [aut, cre] |
| Maintainer: | Ronald Pearson <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-22 07:35:57 UTC |
| Source: | https://github.com/cran/ExploreTheData |
Small dataset illustrating various unexpected data formats arising from the accounting data format in an Excel spreadsheet. Variables that appear to be numeric based on the name are represented as character strings with embedded commas, dollar signs, percent signs, and parentheses to indicate negative numbers
AccountingExampleAccountingExample
AccountingExampledata frame with 8 rows and 6 columns:
Four-digit integer year, with missing values coded NA
Two-character quarter designation, Q1 through Q4
Dollar amount with dollar signs, commas, and decimal points
Dollar amount with dollar signs, commas, and decimal points
Dollar amount with dollar signs, commas, decimal points and parentheses to indicate negative values
Ratio of YOYchange to CurrentTotal, converted to a percentage, with percent sign
Compute binomial probabilities over categorical variable levels
BinomialCIsByCategorical( DF, binVar, catVar, targetLevel, keepNA = "ifany", keepLevels = NULL, cLevel = 0.95 )BinomialCIsByCategorical( DF, binVar, catVar, targetLevel, keepNA = "ifany", keepLevels = NULL, cLevel = 0.95 )
DF |
A data frame containing |
binVar |
Binary variable for binomial probabilities |
catVar |
Categorical variable over which binomial probabilities are computed |
targetLevel |
Positive response level for |
keepNA |
Missing data handling option: |
keepLevels |
Optional subset of |
cLevel |
Confidence level for binomial probabilities (default 0.95) |
Data frame with one row for each catVar level in the analysis and these 6 columns:
Level the catVar level
nWith the number of records with catVar equal to Level and binVar equal to targetLevel
nTotal the total number of records with catVar equal to Level
pEst the estimated probability that binVar equals targetLevel
loCI the lower cLevel confidence limit for pEst
upCI the upper cLevel confidence limit for pEst
catVar <- c(rep("A", 100), rep("B", 100), rep("C", 100)) binVar <- c(rep(0,80),rep(1,20), rep(0,50),rep(1,50), rep(0,20),rep(1,80)) DF <- data.frame(catVar = catVar, binVar = binVar) BinomialCIsByCategorical(DF, "binVar", "catVar", 1)catVar <- c(rep("A", 100), rep("B", 100), rep("C", 100)) binVar <- c(rep(0,80),rep(1,20), rep(0,50),rep(1,50), rep(0,20),rep(1,80)) DF <- data.frame(catVar = catVar, binVar = binVar) BinomialCIsByCategorical(DF, "binVar", "catVar", 1)
Given two data subsets, defined by indexA and indexB, and a
categorical variable catVar, compute the probability that each
level of catVar appears in each subset and the Agresti-Caffo
confidence interval for the difference in these probabilities,
based on the PropCIs::wald2ci function.
CompareCategoricalLevels( DF, catVar, indexA, indexB = NULL, cLevel = 0.95, includeNA = "ifany" )CompareCategoricalLevels( DF, catVar, indexA, indexB = NULL, cLevel = 0.95, includeNA = "ifany" )
DF |
A data frame containing |
catVar |
Categorical variable whose distribution is compared between two subsets |
indexA |
Defines records in the first subset |
indexB |
Defines records in the second subset; default NULL uses all records not in the first subset |
cLevel |
Confidence level for estimated probability differences |
includeNA |
Missing data handling option: |
Data frame with one row for each catVar level and these 10 columns:
Level the catVar level
xA the number of times Level appears in the first subset
nA the total records in the first subset
xB the number of times Level appears in the second subset
nB the total records in the second subset
pA the estimated probability that Level appears in the first subset
pB the estimated probability that Level appears in the second subset
loCI the lower confidence limit on the difference pA - pB
upCI the upper confidence limit on the difference pA - pB
signif a logical indicator of whether pA - pB is significantly different from zero
catVar <- c(rep("a", 100), rep("b", 100), rep("c", 100)) auxVar <- c(rep("Set1", 30), rep("Set2", 70), rep("Set1", 50), rep("Set2", 50), rep("Set1", 90), rep("Set2", 10)) DF <- data.frame(catVar = catVar, auxVar = auxVar) indexA <- which(DF$auxVar == "Set1") CompareCategoricalLevels(DF, "catVar", indexA)catVar <- c(rep("a", 100), rep("b", 100), rep("c", 100)) auxVar <- c(rep("Set1", 30), rep("Set2", 70), rep("Set1", 50), rep("Set2", 50), rep("Set1", 90), rep("Set2", 10)) DF <- data.frame(catVar = catVar, auxVar = auxVar) indexA <- which(DF$auxVar == "Set1") CompareCategoricalLevels(DF, "catVar", indexA)
Sets up and calls WelchRankTest to compare the distributions of a set of numerical variables between two record subsets. If the set of numerical variables contains a single element, this function effectively reduces to WelchRankTest.
CompareNumericSets(DF, IndexA, numVars, IndexB = NULL, cLevel = 0.95)CompareNumericSets(DF, IndexA, numVars, IndexB = NULL, cLevel = 0.95)
DF |
data frame containing all variables in |
IndexA |
record index defining the first record subset to be compared |
numVars |
vector of numerical variable names from |
IndexB |
record index defining the second record subset to be compared (default NULL means the second set contains all records not included in the first) |
cLevel |
confidence level for the Welch rank test (default = 0.95) |
data frame with one row for each element of numVars and columns
containing the numVars element name and all columns from WelchRankTest
for that variable
x <- seq(-1, 1, length = 200) a <- rep(c("a", "b"), 100) offset <- rep(c(0, 0.2), 100) xMod <- x + offset DF <- data.frame(numVar = x, numVar2 = xMod, setVar = a) indexA <- which(DF$setVar == "a") CompareNumericSets(DF, indexA, c("numVar", "numVar2"))x <- seq(-1, 1, length = 200) a <- rep(c("a", "b"), 100) offset <- rep(c(0, 0.2), 100) xMod <- x + offset DF <- data.frame(numVar = x, numVar2 = xMod, setVar = a) indexA <- which(DF$setVar == "a") CompareNumericSets(DF, indexA, c("numVar", "numVar2"))
Compute upper and lower outlier limits by three detection rules: the 3-sigma edit rule, the Hampel identifier, or the boxplot rule
ComputeOutlierLimits(x, method, t = NULL)ComputeOutlierLimits(x, method, t = NULL)
x |
numerical vector in which outliers are to be detected |
method |
single character specifying the outlier rule (T, H, or B) |
t |
threshold parameter (default NULL, gives 3 for T and H rules, 1.5 for B rule) |
named numerical vector with these 4 elements:
nRec the number of elements in x
nonMiss the number of non-missing elements in x
loLim the lower outlier threshold for x elements
upLim the upper outlier threshold for x elements
x <- seq(-1, 1, length = 100) x[1:10] <- 10 ComputeOutlierLimits(x, "T") ComputeOutlierLimits(x, "H") ComputeOutlierLimits(x, "B")x <- seq(-1, 1, length = 100) x[1:10] <- 10 ComputeOutlierLimits(x, "T") ComputeOutlierLimits(x, "H") ComputeOutlierLimits(x, "B")
Returns an index to elements of a numerical vector whose frequency is unusually large relative to most elements, applying the three-sigma edit rule to counts of individual values. Inliers often represent data values that are incorrect but consistent with the overall data distribution, as in the case of numerically-coded disguised missing data
FindInliers(x, t = 3)FindInliers(x, t = 3)
x |
numerical vector in which inliers are to be detected |
t |
threshold parameter for detecting outlying counts (default value 3) |
index to elements of x that occur unusually often, if any
x <- seq(-1, 1, length = 100) x[45:54] <- 0 FindInliers(x)x <- seq(-1, 1, length = 100) x[45:54] <- 0 FindInliers(x)
Returns an index into outlying points, if any, identified by one of three outlier detection rules: the three-sigma edit rule, the Hampel identifier, or the boxplot rule
FindOutliers(x, method, t = NULL)FindOutliers(x, method, t = NULL)
x |
numerical vector in which outliers are to be detected |
method |
single character specifying the outlier rule (T, H, or B) |
t |
threshold parameter (default NULL, gives 3 for T and H rules, 1.5 for B rule) |
index into elements of x identified as outliers
x <- seq(-1, 1, length = 100) x[1:10] <- 10 Tindex <- FindOutliers(x, "T") x[Tindex] # Example where the three-sigma rule fails Hindex <- FindOutliers(x, "H") x[Hindex] Bindex <- FindOutliers(x, "B") x[Bindex]x <- seq(-1, 1, length = 100) x[1:10] <- 10 Tindex <- FindOutliers(x, "T") x[Tindex] # Example where the three-sigma rule fails Hindex <- FindOutliers(x, "H") x[Hindex] Bindex <- FindOutliers(x, "B") x[Bindex]
Small dataset illustrating standard missing values (NA and NaN), blanks, spaces, and other values sometimes used to represent missing data (e.g., blanks, spaces, and zeros)
FirstAnomalyDataFrameFirstAnomalyDataFrame
FirstAnomalyDataFramea data frame with 5 rows and 6 columns:
numerical variable with positive, zero, negative and missing (NA) values
numerical variable with positive, zero, and missing (NA) values
the ratio of NumVar1 to NumVar2
categorical variable with missing data represented as NA
categorical variable with missing data represented with blanks or spaces
categorical variable with missing data represented with multiple spaces
Plot method for the S3 object class BinomCIframe generated by the
BinomialCIsByCategorical() function
## S3 method for class 'BinomCIframe' plot(x, ..., CIrange = NULL, addRef = TRUE)## S3 method for class 'BinomCIframe' plot(x, ..., CIrange = NULL, addRef = TRUE)
x |
an S3 object of class BinomCIframe |
... |
optional named parameters to be passed to |
CIrange |
two-element vector giving the minimum and maximum y-axis
values to plot (default NULL uses minimum lower confidence limit and
maximum upper confidence limit from |
addRef |
logical: add a reference line at the average probability of a positive response? (default = TRUE) |
None: this method generates a plot from x
catVar <- c(rep("A", 100), rep("B", 100), rep("C", 100)) binVar <- c(rep(0,80),rep(1,20), rep(0,50),rep(1,50), rep(0,20),rep(1,80)) DF <- data.frame(catVar = catVar, binVar = binVar) CIframe <- BinomialCIsByCategorical(DF, "binVar", "catVar", 1) plot(CIframe)catVar <- c(rep("A", 100), rep("B", 100), rep("C", 100)) binVar <- c(rep(0,80),rep(1,20), rep(0,50),rep(1,50), rep(0,20),rep(1,80)) DF <- data.frame(catVar = catVar, binVar = binVar) CIframe <- BinomialCIsByCategorical(DF, "binVar", "catVar", 1) plot(CIframe)
Plot method for S3 objects of class CatDiffs generated by the
CompareCategoricalLevels() function, creating a horizontal barplot
of categorical variable level frequencies that differ significantly
between two data subsets
## S3 method for class 'CatDiffs' plot(x, ..., labelA, labelB, nMax = 20, levelFrac = 0.5, xLims = NULL)## S3 method for class 'CatDiffs' plot(x, ..., labelA, labelB, nMax = 20, levelFrac = 0.5, xLims = NULL)
x |
an S3 object of class CatDiffs |
... |
optional named parameters to be passed to |
labelA |
plot label identifying the first data subset |
labelB |
plot label identifying the second data subset |
nMax |
maximum number of levels to include in the barplot (default = 20) |
levelFrac |
relative position of the level labels on the barplot (default = 0.5) |
xLims |
two-element vector of x-axis limits for the barplot (default sets the range from 0 to 1.2 times the length of the longest bar on the plot) |
None: this method generates a plot from x
catVar <- c(rep("a", 100), rep("b", 100), rep("c", 100)) auxVar <- c(rep("Set1", 30), rep("Set2", 70), rep("Set1", 50), rep("Set2", 50), rep("Set1", 90), rep("Set2", 10)) DF <- data.frame(catVar = catVar, auxVar = auxVar) indexA <- which(DF$auxVar == "Set1") CatDiffObj <- CompareCategoricalLevels(DF, "catVar", indexA) plot(CatDiffObj, labelA = "Set1", labelB = "Set2")catVar <- c(rep("a", 100), rep("b", 100), rep("c", 100)) auxVar <- c(rep("Set1", 30), rep("Set2", 70), rep("Set1", 50), rep("Set2", 50), rep("Set1", 90), rep("Set2", 10)) DF <- data.frame(catVar = catVar, auxVar = auxVar) indexA <- which(DF$auxVar == "Set1") CatDiffObj <- CompareCategoricalLevels(DF, "catVar", indexA) plot(CatDiffObj, labelA = "Set1", labelB = "Set2")
Given the data frame DF, create a new data frame with one row for
each column of DF that characterizes that column in terms of the
number and fraction of missing values, the most frequent value and
its frequency and other characteristics like the Shannon homogeneity
measure computed by the ShannonHomogeneity() function.
ProfileDataFrame(DF, dgts = 3, charMax = 20)ProfileDataFrame(DF, dgts = 3, charMax = 20)
DF |
data frame to be characterized |
dgts |
digits retained for numerical characterizations like fractions (default = 3) |
charMax |
maximum number of characters retained in representing the most frequent value for a variable (default = 20) |
data frame with one row for each column of DF and these columns:
Variable the name of the column from DF being characterized
Type the class of Variable (e.g., numeric, integer, character, etc.)
nMiss the number of missing (NA) or blank Variable records
fracMiss the fraction of total records represented by nMiss
nLevels the number of distinct values Variable exhibits
topValue the most frequently occurring Variable value, truncated to charMax characters
topChars the actual number of characters required to represent topValue
topFreq the number of times topValue occurs
topFrac the fraction of total records represented by topFreq
Homog the Shannon homogeneity measure for Variable
ProfileDataFrame(ChickWeight)ProfileDataFrame(ChickWeight)
Computes the Shannon homogeneity (normalized Shannon entropy) for a vector, typically categorical but the procedure also works with numerical vectors. Returns a value in the range from 0 (for a highly inhomogeneous vector, concentrated entirely on one of L > 1 levels) to 1 (for a completely homogeneous vector). By convention, vectors of length 0 or 1 return homogeneity values of 1.
ShannonHomogeneity(x, dgts = 3)ShannonHomogeneity(x, dgts = 3)
x |
the vector to be characterized |
dgts |
number of digits in the return value (default = 3) |
a numerical homogeneity measure between 0 and 1
x <- rep(c("a", "b", "c", "d", "e"), 200) y <- c(rep("a", 497), rep("b", 497), rep("c", 2), rep("d", 2), rep("e", 2)) z <- c(rep("a", 996), "b", "c", "d", "e") ShannonHomogeneity(x) ShannonHomogeneity(y) ShannonHomogeneity(z)x <- rep(c("a", "b", "c", "d", "e"), 200) y <- c(rep("a", 497), rep("b", 497), rep("c", 2), rep("d", 2), rep("e", 2)) z <- c(rep("a", 996), "b", "c", "d", "e") ShannonHomogeneity(x) ShannonHomogeneity(y) ShannonHomogeneity(z)
Applies the three-sigma edit rule to the frequencies of distinct values
of a numerical vector, finding those that occur unusually often and
identifying them either by record number or an associated identifying
characteristic specified by label. Inliers often represent data
values that are incorrect but consistent with the overall data distribution,
as in the case of numerically-coded disguised missing data
SummarizeInliers(x, label = NULL, labelName = NULL, t = 3)SummarizeInliers(x, label = NULL, labelName = NULL, t = 3)
x |
numerical vector in which inliers are to be detected |
label |
optional identifying tag for inliers (default NULL gives
an index into the elements of |
labelName |
optional name for the |
t |
detection threshold for the three-sigma edit rule applied to record counts (default value 3) |
Data frame with one row for each inlier detected and two columns:
Record (or labelName value) identifying or characterizing each inlier
Value the numerical value that occurs unusually often
Note that this data frame is empty (0 rows) if no inliers are detected
x <- seq(-1, 1, length = 100) x[45:54] <- 0 SummarizeInliers(x)x <- seq(-1, 1, length = 100) x[45:54] <- 0 SummarizeInliers(x)
Generates a summary of outliers detected by the three-sigma edit rule, the Hampel identifier, and the boxplot rule, including an optional label to identify the outlying points
SummarizeOutliers(x, label = NULL, labelName = NULL, thresh = c(3, 3, 1.5))SummarizeOutliers(x, label = NULL, labelName = NULL, thresh = c(3, 3, 1.5))
x |
numerical vector in which outliers are to be detected |
label |
optional identifying tag for outliers (default NULL gives
an index into the elements of |
labelName |
optional name for the |
thresh |
vector of threshold values for each outlier detection rule (default = c(3, 3, 1.5)) |
Data frame with one row for each outlier detected by any of the three methods and these 5 columns:
Record (or labelName) giving the location or label for each outlier
Value the value detected as an outlier by at least one method
ThreeSigma 1 if the outlier is detected by the three-sigma rule, 0 otherwise
Hampel 1 if the outlier is detected by the Hampel identifier, 0 otherwise
Boxplot 1 if the outlier is detected by the boxplot rule, 0 otherwise
Note that this data frame is empty (0 rows) if no outliers are detected by any method
x <- seq(-1, 1, length = 100) x[1:10] <- 10 SummarizeOutliers(x)x <- seq(-1, 1, length = 100) x[1:10] <- 10 SummarizeOutliers(x)
Generates a summary of counts and fractions of records from the
variables listed in xVars that are missing (using the standard R
designation NA), blank (0 length, common in character data),
spaces (one or more, also common in character data), and zeros
or negative values in numerical data (sometimes indicative of
range errors or disguised missing data)
TabulateSpecialValues(DF, xVars = NULL, subsetIndex = NULL, dgts = 3)TabulateSpecialValues(DF, xVars = NULL, subsetIndex = NULL, dgts = 3)
DF |
data frame containing all variables in the |
xVars |
character vector of the names of variables to be examined
(default NULL means characterize all variables in data frame |
subsetIndex |
index into record subset in |
dgts |
number of digits in frequency results (default = 3) |
data frame with one row for each variable in xVars list and
these columns:
Variable an element of the xVars list
nMiss number of records exhibiting the missing value NA (or NaN)
fracMiss fraction of records represented by nMiss
nBlank number of records listing the value blank (0 length character string)
fracBlank fraction of records represented by nBlank
nSpaces number of records consisting only of one or more spaces
fracSpaces fraction of records represented by nSpaces
nZero number of records listing the numerical value zero
fracZero fraction of records represented by ‘nZero’
nNeg number of records listing a negative numerical value
fracNeg fraction of records represented by nNeg
FirstAnomalyDataFrame TabulateSpecialValues(FirstAnomalyDataFrame)FirstAnomalyDataFrame TabulateSpecialValues(FirstAnomalyDataFrame)
Uses the Welch rank-test (a robust alternative to the classical t-test, with better resistance to outliers and asymmetry) to compare the distributions of two subsets of the same numerical variable. The result characterizes the subsets in terms of their median values, and a small p-value (traditionally less than 0.05) implies significant distributional differences between the two subsets.
WelchRankTest(DF, xVar, indexA, indexB = NULL, cLevel = 0.95)WelchRankTest(DF, xVar, indexA, indexB = NULL, cLevel = 0.95)
DF |
data frame containing |
xVar |
numerical variable whose subsets are to be compared |
indexA |
record index defining the first subset of |
indexB |
record index defining the second subset of |
cLevel |
confidence level for the test (default = 0.95) |
a named vector with these 5 elements:
nA the number of records in the first xVar subset
nB the number of records in the second xVar subset
medianA the median xVar value in the first subset
medianB the median xVar value in the second subset
pValue the p-value returned by the Welch rank test
x <- seq(-1, 1, length = 200) a <- rep(c("a", "b"), 100) DF <- data.frame(numVar = x, setVar = a) indexA <- which(DF$setVar == "a") WelchRankTest(DF, "numVar", indexA) # No difference in distribution offset <- rep(c(0, 0.2), 100) DF$numVar2 <- x + offset WelchRankTest(DF, "numVar2", indexA) # Significant difference xMod <- x xMod[indexA[1:4]] <- x[indexA[1:4]] + 10 DF$numVar3 <- xMod WelchRankTest(DF, "numVar3", indexA) # No difference even with outliers stats::t.test(DF[indexA, "numVar3"], DF[-indexA, "numVar3"]) # Compare t-testx <- seq(-1, 1, length = 200) a <- rep(c("a", "b"), 100) DF <- data.frame(numVar = x, setVar = a) indexA <- which(DF$setVar == "a") WelchRankTest(DF, "numVar", indexA) # No difference in distribution offset <- rep(c(0, 0.2), 100) DF$numVar2 <- x + offset WelchRankTest(DF, "numVar2", indexA) # Significant difference xMod <- x xMod[indexA[1:4]] <- x[indexA[1:4]] + 10 DF$numVar3 <- xMod WelchRankTest(DF, "numVar3", indexA) # No difference even with outliers stats::t.test(DF[indexA, "numVar3"], DF[-indexA, "numVar3"]) # Compare t-test