Learn 6 easy functions from base R to spot check your data set

Exterior of Schwartz residence in winter.
Image Source: Bentley Historical Library

When the new data is presented to you for the analysis, you would like to get first hand information on the data set before diving deep into analysis. This is similar to doing a warm up exercises before engaging your body into high intensity workout. In this tutorial we will look at six different ways to get the first hand information about the data set. These functions are extremely useful when you want to quickly glance through the data before starting your analysis. So let’s get started.

The six functions to spot check the data set

Here is a list of six functions we will look at in this tutorial:

dim( ) shows the dimension of the data set
head( ) shows first six rows of the data set
tail( ) shows last six rows of the data set
str( ) compactly displays the internal structure of the data set
summary( ) shows the summary statistics of the data
View( ) shows data in the form of spreadsheet type viewer

All the above functions are very basic. They only require name of the data set as an argument.

These six functions will help you to quickly spot check the data set. This step will also guide you in charting out the analysis strategy for your data. Let’s start with the first function.

Checking the dimension of the data

dim() function is used to check the dimension of the data set in the form of number of rows and columns. Let’s check the dimension of mtcars data set. This is in built data in R and comprises various automobile parameters of 32 different car models.

dim(mtcars)

[1] 32 11

We can note that this data-set has 32 rows and 11 columns. This function is useful, because it tells us whether it would be okay to print the entire data frame to the console. With this data set, it’s probably okay. If, however, there were 5,000 rows and 50 columns, we’d definitely want to view the data frame in smaller chunks.

Check first few rows of the data

head() function serves as a quick window to peek into your dataset, providing a snapshot of the initial rows. This function can be incredibly useful for preliminary data analysis, allowing you to get a sense of the data structure. This function prints out first 6 rows of the data.

head(mtcars)

The function can be modified to print more number of rows. If you want to print out, say first 10 rows, you can use:

head(mtcars,10)

It has displayed first 10 rows of the data set.

Check last few rows of the data

The tail() function works exactly opposite of head() function. The tail() function shows last 6 rows of the data.

tail(mtcars)

Like head() function, you can also modify the tail() function to print out last 10 rows of the data.

tail(mtcars,10)

Like head() function, you can also modify the tail() function to print out last 10 rows of the data.

Get compact display of the data set

The str() function compactly displays internal structure of the data. This function shows the not only dimension of the data-set but also the class of each data variable along with first initial observations.

str(mtcars)

'data.frame':32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Check the summary statistics

The summary() function shows the descriptive statistics of each variable in the data set. It includes min, max, mean, median and inter quartile range values. This function is super useful to check the spread of the data and to assess the possibility of outliers in the data set. For e.g. in case of miles per gallon – mpg variable, there is one vehicle which has 33.90 mpg average, way above than the mean average value of all the vehicles,. The inter quartile range is the difference between first quartile value and third quartile value.

summary(mtcars)

     mpg             cyl             disp             hp             drat      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec             vs               am        
 Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :3.325   Median :17.71   Median :0.0000   Median :0.0000  
 Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062  
 3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000  
      gear            carb      
 Min.   :3.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:2.000  
 Median :4.000   Median :2.000  
 Mean   :3.688   Mean   :2.812  
 3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :8.000

Display the data in spread sheet viewer

The View() function will pop up the new window which shows the data set in the spreadsheet format. It has horizontal and vertical scroll bars to navigate through entire data set. You can also perform basic filtering and sorting operation on the variables. However, you can not edit the data in the new window.

View(mtcars)

Summary

In this tutorial we looked at six different ways of spot checking the data set. These functions are extremely useful to get first hand information on the data. Getting comfortable with these functions should make it easier for you to work with data frames in a more logical and efficient manner.

Further Resources

Interested in learning R Programming – the most in-demand open source Data Science Language in Industry. Check out my additional resources:

Enroll for FREE on-demand course on Introduction to R programming. In this course you will get life time access to more than 2 hours of HD quality pre-recorded videos. Plus additional resources such as R scripts, and discounts on my training products. You can also use comments section in each video to ask me the questions and I will reply back to your queries as soon as possible.

Enroll for my 18 hours of live online training programme in R Programming Language for Data Analysis. Get downloadable PDF notes, 300+ lines of R code which you can use in your projects and 100+ hands on exercises based on real world data sets. Check out the course page for more information.

Subscribe to the Newsletter – Glimpse and get R tips and tricks, notifications on blogs and youtube videos (I publish 2 blogs and 3 byte sized videos on youtube every week) straight to your inbox. Subscribe here