In R, factors represents categorical variables. Factors are not different than characters, but they take limited number of values. In this article, we will create factors and understand its unique behavior.
What is Factor?
The factor data type is used to represent character data. This character data, however takes a small number of distinct values. Each distinct value is represented by a integer code, which is called as level of the factor variable. For e.g.
- Variable blood type if factor variable with three levels – A, AB and B
- Variable Gender is factor variable with two levels – male and female
Sometimes data analysts confuse factor type with character type. Characters are often used for labels in graphs, column names or row names. Factors must be used when you want to represent a discrete variable in a data frame and want to analyze it.
How to create Factor?
Factor objects can be created from character objects or from numeric objects, using the function factor( )
blood_type <- c("A","B","B+","A+","A","A+","B","B")
blood_type
[1] "A" "B" "B+" "A+" "A" "A+" "B" "B"
The object blood_type is a character object, and we need to convert it into factor.
blood_type <- factor(blood_type)
blood_type
[1] A B B+ A+ A A+ B B
Levels: A A+ B B+
Please note, printing a factor shows slightly different information from printing a character vector. A factor output, won’t show quotes around characters and levels are printed in alphabetical order.
Use a function levels( ) to see the various levels factor function has. We can also define order in which levels should appear.
levels(blood_type) # alphabetical level order
# new level order
factor(blood_type,levels=c("A","B","A+","B+"))
[1] "A" "A+" "B" "B+"
[1] A B B+ A+ A A+ B B
Levels: A B A+ B+
Inner workings of Factor
Note that, results of the levels function is of type character and class of factor variable is factor.
class(blood_type)
typeof(levels(blood_type))
[1] "factor"
[1] "character"
The factor can be generated from integer vector as well.
gender <- c(1,2,1,2,2,1)
gender <- factor(gender)
gender
levels(gender)
[1] 1 2 1 2 2 1
Levels: 1 2
[1] "1" "2"
The object gender looks like an integer variable, but it is not. The following arithmetic operation will produce NULL vector. We can use as.integer( ) function to convert factor into integer vector
gender + 3
[1] NA NA NA NA NA NA
It is better to rename the levels, so that 1 represents “male” and 2 represents “female”
levels(gender) <- c("Male", "Female")
gender
[1] Male Female Male Female Female Male
Levels: Male Female
Sometimes, order of the levels is of importance. In this example, the level 1 is assigned to Average, because alphabetically Average comes first.
Income <- c("High","Low","Average","Low","Average","High","Low")
Income <- factor(Income)
Income
[1] High Low Average Low Average High Low
Levels: Average High Low
We can set the order of the levels using ordered( ) function. This function will create ordered factor.
Income <- ordered(Income,levels=c("Low","Average","High"))
Income
[1] High Low Average Low Average High Low
Levels: Low < Average < High
When you transform ordered factor into integer, the order is used to assign numbers to the levels.
Income.numeric <- as.integer(Income)
Income.numeric
[1] 3 1 2 1 2 3 1
Why Factors are necessary?
Factors are important when including categorical variables in regression models and when plotting data.
In simple case of regression models, factor variables can be used in lm( ) and glm( ) functions, in which R automatically creates a dummy variables for each of the levels and picks one as a reference group. For e.g. if factor encodes income levels as low, medium and high, it might make a sense to use low income level as a reference class so that other income levels are interpreted in comparison to it.
In case of data visualization, factors play key role, especially in case where charts needs to be ordered in ascending/descending sequence based on some parameter.
The factors also helps in efficient storage of data. Consider a variable describing gender with categories are Male and Female. In R, there are two ways to store this information. One is to use series of character strings, and other is to store it as Factor.
In R, storing categorical data as a factor variable was considerably more efficient than storing the same data as strings, because factor variables only store the factor labels once.
Summary
Factors are used to represent categorical data. Each distinct categorical data value is given a level. These factor levels are used by R to distinguish factor from other data types. By default, factor levels are created as per alphanumeric sequence, for e.g. out of “male” and female”, “female” is the first in alphabetical order, thus will appear on first level.
The factor levels can be changed using levels( ) function. If the order of factor levels is important, ordered( ) function is used to arrange the factors in either ascending or descending order.
Factors are tricky to work with. If the factor is created from numeric vector, it will belong to factor class, thus won’t comply with arithmetic operations.
Thus, it is recommended, whenever you do an operation involving a factor you must make sure to examine the output and intermediate steps.
References
- Wrangling categorical data in R, THE AMERICAN STATISTICIAN, 2018, Vol. 72
- R for Data Science, Hadley Wikham, Springer, 2017