When the new data is presented to you for the analysis, you would like to get first hand information on the data set before diving deep into analysis. This is similar to doing a warm up exercises before engaging your body into high intensity workout. In this tutorial we will look at six different ways to get the first hand information about the data set. These functions are extremely useful when you want to quickly glance through the data before starting your analysis. So let’s get started.
The six functions to spot check the data set
Here is a list of six functions we will look at in this tutorial:
- dim( ) shows the dimension of the data set
- head( ) shows first six rows of the data set
- tail( ) shows last six rows of the data set
- str( ) compactly displays the internal structure of the data set
- summary( ) shows the summary statistics of the data
- View( ) shows data in the form of spreadsheet type viewer
All the above functions are very basic. They only require name of the data set as an argument.
These six functions will help you to quickly spot check the data set. This step will also guide you in charting out the analysis strategy for your data. Let’s start with the first function.
Checking the dimension of the data
dim()
function is used to check the dimension of the data set in the form of number of rows and columns. Let’s check the dimension of mtcars
data set. This is in built data in R and comprises various automobile parameters of 32 different car models.
dim(mtcars)
[1] 32 11
We can note that this data-set has 32 rows and 11 columns. This function is useful, because it tells us whether it would be okay to print the entire data frame to the console. With this data set, it’s probably okay. If, however, there were 5,000 rows and 50 columns, we’d definitely want to view the data frame in smaller chunks.
Check first few rows of the data
head()
function serves as a quick window to peek into your dataset, providing a snapshot of the initial rows. This function can be incredibly useful for preliminary data analysis, allowing you to get a sense of the data structure. This function prints out first 6 rows of the data.
head(mtcars)
The function can be modified to print more number of rows. If you want to print out, say first 10 rows, you can use:
head(mtcars,10)
It has displayed first 10 rows of the data set.
Check last few rows of the data
The tail()
function works exactly opposite of head()
function. The tail()
function shows last 6 rows of the data.
tail(mtcars)
Like head()
function, you can also modify the tail()
function to print out last 10 rows of the data.
tail(mtcars,10)
Like head()
function, you can also modify the tail()
function to print out last 10 rows of the data.
Get compact display of the data set
The str()
function compactly displays internal structure of the data. This function shows the not only dimension of the data-set but also the class of each data variable along with first initial observations.
str(mtcars)
'data.frame':32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Check the summary statistics
The summary()
function shows the descriptive statistics of each variable in the data set. It includes min, max, mean, median and inter quartile range values. This function is super useful to check the spread of the data and to assess the possibility of outliers in the data set. For e.g. in case of miles per gallon – mpg
variable, there is one vehicle which has 33.90 mpg average, way above than the mean average value of all the vehicles,. The inter quartile range is the difference between first quartile value and third quartile value.
summary(mtcars)
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000
Median :3.325 Median :17.71 Median :0.0000 Median :0.0000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000
gear carb
Min. :3.000 Min. :1.000
1st Qu.:3.000 1st Qu.:2.000
Median :4.000 Median :2.000
Mean :3.688 Mean :2.812
3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :8.000
Display the data in spread sheet viewer
The View()
function will pop up the new window which shows the data set in the spreadsheet format. It has horizontal and vertical scroll bars to navigate through entire data set. You can also perform basic filtering and sorting operation on the variables. However, you can not edit the data in the new window.
View(mtcars)
Summary
In this tutorial we looked at six different ways of spot checking the data set. These functions are extremely useful to get first hand information on the data. Getting comfortable with these functions should make it easier for you to work with data frames in a more logical and efficient manner.
Further Resources
Interested in learning R Programming – the most in-demand open source Data Science Language in Industry. Check out my additional resources:
Enroll for FREE on-demand course on Introduction to R programming. In this course you will get life time access to more than 2 hours of HD quality pre-recorded videos. Plus additional resources such as R scripts, and discounts on my training products. You can also use comments section in each video to ask me the questions and I will reply back to your queries as soon as possible.
Enroll for my 18 hours of live online training programme in R Programming Language for Data Analysis. Get downloadable PDF notes, 300+ lines of R code which you can use in your projects and 100+ hands on exercises based on real world data sets. Check out the course page for more information.
Subscribe to the Newsletter – Glimpse and get R tips and tricks, notifications on blogs and youtube videos (I publish 2 blogs and 3 byte sized videos on youtube every week) straight to your inbox. Subscribe here