transformr, as you might have guessed, is an R package that helps you easily transform your variables.
So why should you consider using transformr for your data analysis work?
This introductory vignette provides a brief description of the functions provided by transformr along with some simple examples. Please refer to the function documentation (e.g. ?trim
or help(trim)
) for more technical information on each of these functions.
Trimming handles outliers in your numeric variable.
A common use of trim
is “capping” a numeric variable to reduce the effect of high-valued outliers on predictive algorithms. For example, imagine that we have a variable x
that looks like this:
x <- c(1, 1, 1, 2, 6)
The last value of 6 is noticably higher than the rest of the values, so depending on the context we might want to do one of the following:
NA
(e.g. if the high value of 6 indicates a data quality issue)trim
allows you quickly and easily do these types of transformations.
# "Round" to 2
trim(x, hi=2)
## [1] 1 1 1 2 2
# Convert to NA
trim(x, hi=2, replace=NA)
## [1] 1 1 1 2 NA
The table below shows how these different versions of x
compare with each other, where the values that have changed from x
are bolded.
x | “Round” to 2 | Convert to NA |
---|---|---|
1 | 1 | 1 |
1 | 1 | 1 |
1 | 1 | 1 |
2 | 2 | 2 |
6 | 2 | NA |
You can also easily trim via percentiles by adding the argument method="percentile"
. Check out trim
’s documentation (?trim
) for some more examples on using trim
!
Rescaling, well, rescales your numeric variables.
Common uses of rescaling are to make a variable mimic a standard normal distribution or to make a variable lie between two specific values. For example, imagine that we have a variable x
that is approximately normal with a mean of -3 and a standard deviation of 1/2:
set.seed(666)
size <- 1e4
x <- rnorm(size, mean=-3, sd=1/2)
Depending on the context we might want to do one of the following:
x
to have a standard normal distributionx
to have a minimum value of 0 and a maximum value of 1rescale
allows you quickly and easily do these types of transformations.
# Standard normal
rescale(x)
# Between 0 and 1
rescale(x, method="minmax")
The graph below shows how these different versions of x
compare with each other.
You can also easily rescale to other normal distribution (e.g. mean of 99 and standard deviation of 16) or to be between other minimum and maximum values (e.g. minimum of -132 and maximum of 89). Check out rescale
’s documentation (?rescale
) for some more examples on using rescale
!
Corralling groups together uncommon/uninteresting values of a categorical variable and “levels” the resulting factor variable.
A common use of corral
is grouping secondary values into an “Other” category. For example, imagine that we have a variable x
that looks like this:
x <- c("Red", "Red", "Red", "Blue", "Blue", "Green", "Orange", "Pink")
Depending on the context we might want to do one of the following:
x
to keep the most two common colors (Red and Blue) distinct and group all other valuesx
to keep Blue and Green distinct and group all other valuesx
to keep Blue and Green distinct and change other values to NA
corral
allows you quickly and easily do these types of transformations.
# Keep two most common colors distinct
corral(x, groups=3)
## [1] Red Red Red Blue Blue Other Other Other
## Levels: Red Blue Other
# Keep blue and green distinct
corral(x, groups=c("Blue", "Green"))
## [1] Other Other Other Blue Blue Green Other Other
## Levels: Blue Green Other
# Keep blue and green distinct and change other values to NA
corral(x, groups=c("Blue", "Green"), collect=NA)
## [1] <NA> <NA> <NA> Blue Blue Green <NA> <NA>
## Levels: Blue Green
The table below shows how these different versions of x
compare with each other, where the values that have changed from x
are bolded.
x | Two most common colors | Blue and Green | Blue and Green and others as NA |
---|---|---|---|
Red | Red | Other | NA |
Red | Red | Other | NA |
Red | Red | Other | NA |
Blue | Blue | Blue | Blue |
Blue | Blue | Blue | Blue |
Green | Other | Green | Green |
Orange | Other | Other | NA |
Pink | Other | Other | NA |
You can also easily corral based on other criteria as well (e.g. level by alphabetical order rather than by size). Check out corral
’s documentation (?corral
) for some more examples on using corral
!
Imputing replaces missing values with non-missing values.
A common use of impute
is substituting non-NA
values for NA
values before running a statistical or machine learning algorithm. For example, imagine that we have a variable x
that looks like this:
x <- c(1, 1, 1, 2, NA, NA)
Depending on the context we might want to do one of the following:
x
impute
allows you quickly and easily do these types of transformations.
# Impute mean
impute(x, mean)
## [1] 1.00 1.00 1.00 2.00 1.25 1.25
# Impute -1
impute(x, -1)
## [1] 1 1 1 2 -1 -1
# Impute "Missing"
impute(x, "Missing")
## [1] "1" "1" "1" "2" "Missing" "Missing"
There are also a few impute_*
helper functions for more sophisticated imputation.
# Impute mode (i.e. most common value)
impute(x, impute_mode)
## [1] 1 1 1 2 1 1
# Impute from ecdf
impute(x, impute_ecdf)
## [1] 1.00000 1.00000 1.00000 2.00000 1.00000 1.53081
# Impute from resampling
impute(x, impute_sample)
## [1] 1 1 1 2 1 2
The table below shows how these different versions of x
compare with each other, where the values that have changed from x
are bolded.
x | Mean | -1 | Missing | Mode | ECDF | Sample |
---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 | 2 | 2 | 2 |
NA | 1.25 | -1 | Missing | 1 | 1.448 | 1 |
NA | 1.25 | -1 | Missing | 1 | 1 | 2 |
You can also easily impute based on other criteria as well (e.g. by writing your own imputation function). Check out impute
’s documentation (?impute
) for some more examples on using impute
!