Unraveling R data types – The Data Frame – A gentle introduction

The River Loire at Nevers
Image Source: The Cleveland Museum of Art https://www.jstor.org/stable/community.24592825

Data Frames are a lot like spreadsheets or database tables. A data frame represents a data with a number of rows and columns. It looks like a matrix but its columns are not necessarily of the same data type (In matrix, all columns must belong to same data type). In data frames, rows are individual observations, which involves several different variables (columns)

Technically, data frames is a list whose elements are equal-length vectors, and that’s why it permits heterogeneity

A data frame is a natural way to represent such heterogeneous tabular data. Every element within a column must be of the same type, but different elements within a row may be of different types, that’s why we say that a data frame is a heterogeneous data structure.

In this article, we will create a data frame, subset it and run some useful functions that make working with data frames efficient.

We can use data.frame( ) function to create a data frame. We have to supply the data of each column by a vector of corresponding data type.

persons <- data.frame(Name =c("Siddharth", "Gayatri", "Kunal"),
                      Gender =c("Male", "Female", "Male"),
                      Age =c(24,25,23),
                      Subjects =c("Maths", "Science", "History"))

persons

    Name     Gender Age  Subjects
1 Siddharth  Male   24     Maths
2 Gayatri    Female 25    Science
3 Kunal      Male   23    History

Note that creating a data frame is exactly like creating a list. This is because, in essence, a data frame is a list in which each element is a vector and represents a table column and has the same number of elements.

We can coerce a data frame from a list by calling as.data.frame( ) function

temp_list <- list(Name =c("Siddharth", "Gayatri", "Kunal"),
                  Gender =c("Male", "Female", "Male"),
                  Age =c(24,25,23),
                  Subjects =c("Maths", "Science", "History"))

temp_list

$Name
[1] "Siddharth" "Gayatri" "Kunal"

$Gender
[1] "Male" "Female" "Male"

$Age
[1] 24 25 23

$Subjects
[1] "Maths"   "Science" "History"


as.data.frame(temp_list) # coerce the list into data frame.

 Name     Gender   Age  Subjects
1 Siddharth  Male   24     Maths
2 Gayatri    Female 25    Science
3 Kunal      Male   23    History

We can coerce a data frame from matrix with the same function. However, please note, since we are converting a matrix into a data frame, all the elements of the data frame will be of same data type.

temp_matrix <- matrix(c(1:6), nrow = 2)
temp_matrix

       [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

as.data.frame(temp_matrix)
       V1   V2   V3
[1,]    1    3    5
[2,]    2    4    6

Note that conversion automatically assigns column names to the new data frame.

We can rename columns and rows just like we did for a matrix

col_name <-c("student_name", "sex", "student_age", "major")
row_name <-c("student 1","student 2","student 3")

colnames(persons) <- col_name
rownames(persons) <- row_name

persons

          student_name    sex   student_age   major
student 1 Siddharth     Male          24     Maths
student 2 Gayatri       Female        25    Science
student 3 Kunal         Male          23    History

Since data frame is a list of vectors/matrices all having the same length, it generally has two attributes, names attribute labeling the variables and row.names attribute labeling the individual observations. Please also note that data frame has its own class, making it easy to manipulate and work with other data frames.

attributes(persons)

$names
[1] "student_name" "sex" "student_age"  "major"

$class
[1] "data.frame"

$row.names
[1] student 1" "student 2" "student 3"

Since data frame is a matrix-like list of column vectors, we can use both sets of notations to access the elements and subsets in a data frame. We can use $ to extract the value of one column by its name or, we can use [[ to extract by column position.

The sub-setting operator ([) allows us to use a numeric vector to extract columns by position, a character vector to extract columns by name, or a logical vector to extract columns by TRUE and FALSE selection:

persons[c(1,3)] # extract first and third column

         student_name  student_age   
student 1 Siddharth       24      
student 2 Gayatri         25    
student 3 Kunal           23    

# extract first column. The logical vector is recycled to match the length of data frame.

persons[c(TRUE,FALSE,FALSE)]

          student_name    major
student 1 Siddharth        Maths
student 2 Gayatri          Science
student 3 Kunal            History

persons[c("student_age", "student_name")]# the column position is changed

           student_age  student_name 
student 1      24       Siddharth
student 2      25       Gayatri
student 3      23       Kunal 

We can subset a data frame as a matrix by specifying row and column position [ row, column ]. The position vector can be numeric, character or logical vector.

# extract all elements of _name and _age column

persons[,c("student_name","student_age")]

          student_name student_age
student 1 Siddharth       24
student 2 Gayatri         25
student 3 Kunal           23

persons[,c(1,2)] #extract all elements of first and second position column

          student_name    sex
student 1 Siddharth      Male
student 2 Gayatri        Female
student 3 Kunal          Male

# extract all elements of first and third column. Observe how the logical vector recycles its 

length.persons[,c(TRUE,FALSE,TRUE), ]

          student_name  student_age   major
student 1 Siddharth            24     Maths
student 2 Gayatri              25    Science
student 3 Kunal                23    History
persons[c(1,2),] # extract first and second row elements of all columns

          student_name   sex  student_age   major
student 1 Siddharth     Male      24         Maths
student 2 Gayatri       Female    25         Science

# extract second and third row elements of all column

spersons[c(FALSE,TRUE,TRUE),]

          student_name    sex    student_age   major
student 2 Gayatri       Female    25          Science
student 3 Kunal         Male      23          History

persons[c("student 1","student 2"), ] # extract first and second row elements of all columns

          student_name    sex     student_age   major
student 1  Siddharth     Male      24          Maths
student 2  Gayatri       Female    25          Science
 

We can specify both selectors at the same time

persons[2,1] # extract second row element of first column

[1] "Gayatri"

# extract first row element of first and third column

persons[c(1,2),c(1,3)]

           student_name   student_age
student 1      Siddharth    24
student 2      Gayatri      25

# extract second row element of third and fourth column
 
persons[c(FALSE,TRUE),c(FALSE,FALSE,TRUE,TRUE)]

          student_age   major
student 2  25          Science 

# extract first row element of student_name column

persons["student 1","student_name"]

[1] "Siddharth"

Please note, after you subset the data frame, the result will be the same class as the original data frame.(except if you extract a single column or observation)

class(persons[ ,c(1,3)]) # output is data frame

[1] "data.frame"

class(persons[,4]) # output is single column

[1] "character"

class(persons[2,3]) # output is single observation

[1] "numeric"

If you strictly want output as data frame class irrespective of edge cases (like user input selecting only one column or element), you can set drop=FALSE flag.

persons[,4,drop=FALSE]

            major
student 1   Maths
student 2   Science
student 3   History

class(persons[,4,drop=FALSE])

[1] "data.frame"

class(persons[2,3,drop=FALSE])

[1] "data.frame"

You can set values of the subset of a data frame as:

# change values of first three rows of third column

persons[1:3,3] <-c(41,42,43)

persons

         student_name    sex   student_age   major
student 1 Siddharth      Male     41          Maths
student 2 Gayatri        Female   42          Science
student 3 Kunal          Male     43          History

persons$student_name <-c("Sachin","Aarya","Mandar")

persons

         student_name    sex   student_age   major
student 1  Sachin       Male     41         Maths
student 2  Aarya        Female   42         Science
student 3  Mandar       Male     43         History

There are many useful functions for a data frame. Here are the ones with most commonly used functions are discussed. summary( ) function works with a data frame by generating a table that shows summary statistics of each column. Here is a summary statistics of iris data set from base R package. iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.

head(iris) # prints first six rows of data set



summary(iris) # displays statistical summary of each variable in data frame

 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

For a numeric vector, the summary shows the important quantiles of the numbers. For Species, the summary counts the number of rows taking each value.

If the data frame contains any missing values, NA, then complete.cases( ) function returns a logical vector of length equal to the number of rows, and which contains a TRUE value for those rows that don’t have any NA values and FALSE for those that have at least one such value.

persons_NA <- data.frame(Name = c("Siddharth", "Gayatri", "Kunal"),
                      Gender = c("Male", "Female", "Male"),
                      Age = c(24,NA,NA),
                      Subjects = c("Maths", "Science", NA))
persons_NA



complete.cases(persons_NA)

[1]  TRUE FALSE FALSE

You can observe, the entire first row does not contain any missing value, hence TRUE is displayed, while rows second and third has one missing value, hence FALSE is displayed.

If we want to append the data frame with new columns or rows, we can use rbind( ) or cbind( ) functions.

If we want to append new row to a data frame, in this case new entry of a student, we can use rbind( ) function.

rbind(persons,data.frame(student_name="Aarya",sex="Female",student_age = 39, major="statistics"))

If we want to append new column to a data frame, in this case grades received in each subject, we can use cbind( ) function.

cbind(persons,grades=c("A","A+","B"))

            student_name       sex   student_age    major        grades
                <chr>           <chr>   <dbl>      <chr>         <chr>
student 1SiddharthMale24   Maths  A
student 2Gayatri       Female25  Science         A+
student 3Kunal        Male23  History  B

Note that rbind( ) and cbind( ) do not modify the original data but create a new data frame with given rows or columns appended.

A data frame represents a data with a number of rows and columns. Unlike matrix, data frames can contain variables with different data types, therefor Data Frames are heterogeneous. This heterogeneity is seen because data frames are fundamentally a list whose elements are equal length vectors. We create a data frame using data.frame( ) function. Data frame can be coerced out of Matrices and list using as.data.frame( ) function.

There are many different methods to subset the data frames. We can use $ to extract the value of one column by its name or, use [[ to extract the value by column position. We can subset a data frame as a matrix by specifying row and column position [row, column].

Finally, we used some useful functions to work on data frames. summary( ) function generates summary statistics of data frame variables. complete.cases( ) function returns a logical vector indicating which cases are complete, i.e. have no missing values. We will be
using data frames in our analysis extensively.

When analyzing data, it’s quite common to encounter categorical values. R provides a good way to represent categorical values using factors. In the next article we will look at factors and their use in analysis.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top