This is a second article on ongoing series on why you should use R. In the last article we have done a comparison between R and Excel in the context of analytical tools. R and Excel are beneficial in different ways. Excel starts off easier and probably your go to tool for the data analysis. R, on the other hand has a steeper learning curve, but in the long run, the use of R pays off as complexity of analysis increases.
In this article, we will compare the capabilities of R and Excel using various data exploration tasks. The idea is to show you a glimpse of how the common data exploration tasks are carried out in R by writing a single line of code as against series of point and click steps in Excel.
Install R and RStudio
Before we start, I assume you have latest R version installed on your machine. You can download R from CRAN server – the Comprehensive R Archive Network. CRAN is network of servers around the world on which R .exe files are hosted.
You also need Integrated Development Environment (IDE) to work with R. IDE makes using R more easy and efficient. You can download ope source version of IDE from Posit website. Posit is name of company that has developed IDE.
When you open RStudio for the first time you should see following layout.
The area on the left is where you will write R code in scripts, use terminals and run jobs. The right hand side of the IDE is comprised of two sections. The top is the files section where you can see all the files and folders in your current working directory and bottom is environment section that stores a list of defined variables and data sets, view the history, and connect to other database.
Loading the Data set
There are many different file types we can import into R. For the ease of demonstration, we will stick to excel file format. We will work with Gapminder data set. The data is about life expectancy, GDP per capita, and Population by country.
Excel
- Go to File > Open and browse to the location that contains the text file.
- Locate and double-click the excel file that you want to open.
R
Importing excel file in R is super easy. You have to use read_excel
function from readxl
package. What is R package? R packages are extensions to the R programming language. R packages contain code, data, and documentation in a standardized collection format that can be installed by users of R, typically via a centralized software repository such as CRAN.
Consider, R packages as mobile apps. Each app is designed for specific purpose. The app developer designs the mobile app and releases it so that others can use its features. Similarly, R users develop the R code to solve the specific problem and release the code in the form of package.
Before we use any function from a package, we need to install it from CRAN server, load it in current R session and then use its functions.
# install the package
install.packages("readxl")
# load the package
library(readxl)
# import the excel file in R
gapminder_data <- read_excel("C:\\Blogs\\R and Excel differences\\gapminder_data.xlsx")
In the above code, we have imported gapminder.xlsx
file located on “C:\Blogs\R and Excel differences” file path, and stored its content into a variable called gapminder_data
Exploring the Data
We will explore the data.
Excel
Excel has one basic data structure, which is the cell. These Excel cells are extremely flexible as they store data of various types (numeric, logical and characters). In Excel, you can view the data in the form of row and column table. Now let’s imagine you have a data set running down into thousands of rows and hundreds of columns. In this situation, you will have to scroll through vertically and horizontally to obtain an overview of the data. This manually scrolling task is time consuming and messy process.
R
In R, exploring data feels like a breeze. You have several functions at your disposal that can be effectively used to explore data in multiple forms.
We can view subset of the data using head()
and tail()
functions. head
function prints first six rows of a data, while tail
function prints last six rows.
# show first six rows of data
head(gapminder_data)
# show last six rows of data
tail(gapminder_data)
We can get more elaborate view of the data using str()
function. This function shows the number of rows and columns in the data along with the data type of each variable and some of its initial values.
This data set has 1704 rows and 6 columns.
Explore the data using summary statistics
Excel
To generate summary statistics (such as the minimum and maximum values) of our data in Excel, we followed a few steps:
- select the cell next to the numbers you want to sum
- select AutoSum on the Home tab
- click on min, max, average or any other function you want
- press enter
These steps are very easy to follow. However, if you refer to the analysis six months later, it is often difficult to recall how did you arrive to the answer or where had you clicked or which tab had you selected? If you were to discuss this workflow with your colleague, you would also likely to face a version issue, since there could be slight differences in the steps for different version of Excel.
Even if you had used the excel formula to calculate the summary statistics, the formula is often hidden inside the cell, making it impossible to check and validate the solution steps.
All theses hassles suggest, that the analysis is not reproducible in Excel.
R
In R, the approach is quite different. You write R code in the notepad like document, called as scripts. These scripts are written in the form of linear progression of the analysis steps you had taken to get to the answer. Thus every little step in the analysis is documented in the form of script. This makes your work reproducible, since the other person reading your script clearly understands how the analysis has been carried out.
Using summary()
function, we can generate summary statistics of a entire data set.
# generate summary statistics
summary(gapminder_data)
We can also calculate other statistical measures such as standard deviation.
# calculate standard deviation for population
sd(gapminder_data$pop)
[1] 106157897
The $ symbol is used to select the column from the data set.
R clearly shows the code(instructions), data and columns used in the analysis in a ways that Excel can not. Moreover, the code can be used repeatedly on variety of different data sets. If this script was shared with your colleagues they would have complete understanding on how the summary statistics is calculated because of R’s human readable syntax. This makes the analysis reproducible.
Summary
In this article, we compared and contrasted some of the data exploration functions in Excel and R. We started out by installing R and RStudio – an IDE. Then we had brief orientation of Rstudio user interface.
We touched upon the concept of R packages. Using readxl
package we imported data set in R. Finally we generated some summary statistics in R and Excel, and demonstrated R’s advantage in terms of reproducibility.
In the next article, we will continue with the comparison between R and Excel using common Data Wrangling tasks.