Introduction to transformr

What is transformr?

transformr, as you might have guessed, is an R package that helps you easily transform your variables.

So why should you consider using transformr for your data analysis work?

Robustness: All of the functions within transformr have been loaded with argument checks, unit tests, and informative error messages so good luck trying to break them! (But, by all means, do try to break them and then create an issue so we can fix what’s broken!)
Convenience: Could you write functions to do these transformations? Absolutely, but why bother when it’s already been done for you?

This introductory vignette provides a brief description of the functions provided by transformr along with some simple examples. Please refer to the function documentation (e.g. ?trim or help(trim)) for more technical information on each of these functions.

trim()

Trimming handles outliers in your numeric variable.

A common use of trim is “capping” a numeric variable to reduce the effect of high-valued outliers on predictive algorithms. For example, imagine that we have a variable x that looks like this:

x <- c(1, 1, 1, 2, 6)

The last value of 6 is noticably higher than the rest of the values, so depending on the context we might want to do one of the following:

“Round” the outlying value of 6 down to the next highest value of 2
Convert the outlying value of 6 to NA (e.g. if the high value of 6 indicates a data quality issue)

trim allows you quickly and easily do these types of transformations.

# "Round" to 2
trim(x, hi=2)

## [1] 1 1 1 2 2

# Convert to NA
trim(x, hi=2, replace=NA)

## [1]  1  1  1  2 NA

The table below shows how these different versions of x compare with each other, where the values that have changed from x are bolded.

x	“Round” to 2	Convert to NA
1	1	1
1	1	1
1	1	1
2	2	2
6	2	NA

You can also easily trim via percentiles by adding the argument method="percentile". Check out trim’s documentation (?trim) for some more examples on using trim!

rescale()

Rescaling, well, rescales your numeric variables.

Common uses of rescaling are to make a variable mimic a standard normal distribution or to make a variable lie between two specific values. For example, imagine that we have a variable x that is approximately normal with a mean of -3 and a standard deviation of 1/2:

set.seed(666)
size <- 1e4
x <- rnorm(size, mean=-3, sd=1/2)

Depending on the context we might want to do one of the following:

Rescale x to have a standard normal distribution
Rescale x to have a minimum value of 0 and a maximum value of 1

rescale allows you quickly and easily do these types of transformations.

# Standard normal
rescale(x)
# Between 0 and 1
rescale(x, method="minmax")

The graph below shows how these different versions of x compare with each other.

You can also easily rescale to other normal distribution (e.g. mean of 99 and standard deviation of 16) or to be between other minimum and maximum values (e.g. minimum of -132 and maximum of 89). Check out rescale’s documentation (?rescale) for some more examples on using rescale!

corral()

Corralling groups together uncommon/uninteresting values of a categorical variable and “levels” the resulting factor variable.

A common use of corral is grouping secondary values into an “Other” category. For example, imagine that we have a variable x that looks like this:

x <- c("Red", "Red", "Red", "Blue", "Blue", "Green", "Orange", "Pink")

Depending on the context we might want to do one of the following:

Corral x to keep the most two common colors (Red and Blue) distinct and group all other values
Corral x to keep Blue and Green distinct and group all other values
Corral x to keep Blue and Green distinct and change other values to NA

corral allows you quickly and easily do these types of transformations.

# Keep two most common colors distinct
corral(x, groups=3)

## [1] Red   Red   Red   Blue  Blue  Other Other Other
## Levels: Red Blue Other

# Keep blue and green distinct
corral(x, groups=c("Blue", "Green"))

## [1] Other Other Other Blue  Blue  Green Other Other
## Levels: Blue Green Other

# Keep blue and green distinct and change other values to NA
corral(x, groups=c("Blue", "Green"), collect=NA)

## [1] <NA>  <NA>  <NA>  Blue  Blue  Green <NA>  <NA> 
## Levels: Blue Green

The table below shows how these different versions of x compare with each other, where the values that have changed from x are bolded.

x	Two most common colors	Blue and Green	Blue and Green and others as NA
Red	Red	*Other*	NA
Red	Red	*Other*	NA
Red	Red	*Other*	NA
Blue	Blue	Blue	Blue
Blue	Blue	Blue	Blue
Green	*Other*	Green	Green
Orange	*Other*	*Other*	NA
Pink	*Other*	*Other*	NA

You can also easily corral based on other criteria as well (e.g. level by alphabetical order rather than by size). Check out corral’s documentation (?corral) for some more examples on using corral!

impute()

Imputing replaces missing values with non-missing values.

A common use of impute is substituting non-NA values for NA values before running a statistical or machine learning algorithm. For example, imagine that we have a variable x that looks like this:

x <- c(1, 1, 1, 2, NA, NA)

Depending on the context we might want to do one of the following:

Impute the mean of x
Impute -1
Impute “Missing”

impute allows you quickly and easily do these types of transformations.

# Impute mean
impute(x, mean)

## [1] 1.00 1.00 1.00 2.00 1.25 1.25

# Impute -1
impute(x, -1)

## [1]  1  1  1  2 -1 -1

# Impute "Missing"
impute(x, "Missing")

## [1] "1"       "1"       "1"       "2"       "Missing" "Missing"

There are also a few impute_* helper functions for more sophisticated imputation.

# Impute mode (i.e. most common value)
impute(x, impute_mode)

## [1] 1 1 1 2 1 1

# Impute from ecdf
impute(x, impute_ecdf)

## [1] 1.00000 1.00000 1.00000 2.00000 1.00000 1.53081

# Impute from resampling
impute(x, impute_sample)

## [1] 1 1 1 2 1 2

The table below shows how these different versions of x compare with each other, where the values that have changed from x are bolded.

x	Mean	-1	Missing	Mode	ECDF	Sample
1	1	1	1	1	1	1
1	1	1	1	1	1	1
1	1	1	1	1	1	1
2	2	2	2	2	2	2
NA	*1.25*	-1	*Missing*	1	*1.448*	1
NA	*1.25*	-1	*Missing*	1	1	2

You can also easily impute based on other criteria as well (e.g. by writing your own imputation function). Check out impute’s documentation (?impute) for some more examples on using impute!

Introduction to transformr

Derek Damron

2016-07-09

What is transformr?

trim()

rescale()

corral()

impute()