Data manipulation is the process of changing or transforming data to make it more organized, readable, and useful. It’s used to extract insights, perform calculations, and prepare data for analysis and presentation.

In this fourth article on why should you use R , we will compare the capabilities of R and Excel using various data manipulation tasks. The idea is to show you a glimpse of how the common data manipulation tasks are carried out in R by writing a single line of code as against series of point and click steps in Excel.

If you want to read all previous articles in this series, then here are the links:

Why should I use R: The Excel R Data Visualization comparison

Why Should I use R: The Excel R Data Exploration comparison

R Vs Excel – What is the difference?

Let’s continue with this article. We will work with `gapminder`

data set. Let’s first import the data into R.

```
library(readxl)
# import the excel file in R
gapminder_data <- read_excel("C:\\Blogs\\R and Excel differences\\gapminder_data.xlsx")
```

The common data manipulation tasks that we will apply in R and Excel are as below:

- Rename the columns
- Arrange the GDP Per capita column in ascending order.
- select and create new column
- remove column from data
- filter the observations for year 1997

**Rename the columns in R and Excel**

**Excel**

Renaming column in Excel is completely manual process which involves double clicking on the column heading you want to rename, typing out new column name and pressing enter. Since this is completely manual process, it will be absolutely nightmare if you want to change, say 20 column names out of 500 columns data set.

**R**

For data manipulation in R, we use powerful package called `dplyr`

. Lets load and install the package.

`library(dplyr)`

To rename column in R, there is handy function called `rename()`

. We simply pass this function the name of our data set (gapminder_data), and then rename selected column with new name.

There are other efficient functions available, which can be used to rename the column in variety of different ways. For e.g. you can rename the columns starting/ending with specific prefix/suffix, using column position index in data set and so on.

Let’s rename the `pop`

column to `polulation`

`rename(gapminder_data,popuation=pop)`

In R, renaming column is super easy. There are no point and clicks involved and rename function, because of its intuitive syntax, clearly shows which columns are renamed and what are there new names.

**Arrange the GDP Per capita column in ascending order.**

**Excel**

To change the column to ascending order, we first had to:

- Select
`gdpPercap`

column - Go to Sort and Filter tab under editing group
- Select the option to sort from the largest to the smallest value

**R**

In R, we will use the `arrange()`

function to sort the data in ascending order. We will need to pass the name of the dataset and the variable name to be sorted as arguments.

`arrange(gapminder_data,gdpPercap)`

Again, with Excel due to its point-and-click nature, it is impossible to identify, by looking at a column, how the data was modified. If I were to replicate these steps in two years time I would likely have forgotten where to point and click. With R however, we have our code which clearly shows each step used to manipulate the data. If I were to return to my script in two years time, I would easily be able to replicate the analysis.

**Selecting and Adding new column**

We will select country, continent and population column. We will also create new column called population_million, which will show the country population in millions.

**Excel**

If we want to create a separate data table by selecting a handful of the columns from main data set, we have to copy paste those columns into new tab. Also creating new column requires writing a formula to convert the population into millions.

**R**

In R, we will use `select`

function to select the columns we need for the analysis, and then use `mutate`

function to create new column. We can also save this transformed data set into new variable for later analysis.

```
# select the columns, then create new column and store the data into
# new variable called gapminder_data_new
gapminder_data_new <- gapminder_data %>%
select(country,continent,pop) %>%
mutate(population_million = pop/1e6)
head(gapminder_data_new)
```

This code will be completely reproducible, since after reading the script, user would clearly know which columns were selected for analysis and how the new column was added to the data. If user were to refer to the script two years later, he would have no difficulty in clearly understanding the analysis steps. That is a power of reproducible data analysis work, and it is only possible, since R is, by design a text based language.

**Remove Columns**

**Excel**

In Excel, inserting or deleting a column is a manual process. First, we select the column then right-click at the top of a column and then select the Delete option.

**R**

In R, we use `select`

function to select the column we want to remove and put negative sign – in front of the variable name we want to remove.

`select(gapminder_data,-continent)`

**Filter the observations**

**Excel**

In excel, we need to *point and click* several steps before filtering the data.

- Select the data that you want to filter
- click the filter arrow in the column header
- When you click the filter arrow in a column, all values in that column appear in a list.
- select the value that you want to filter

While applying this manual technique of filtering data that we are interested in, it is very easy to select the wrong data or even accidentally delete data.

**R**

In R, it is one liner code. Use `filter`

function, pass on the name of column to filter and use conditional statements to filter the column data.

`filter(gapminder_data,year == 1997)`

**Summary**

In this post we compared the methods for data manipulation in R and Excel. We used some of the functions from `dplyr`

package to manipulate data. It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code. This makes code easy to read and to reproduce.

In Excel, we had to undergo series of point and click steps to carry out common data manipulation tasks. This point and click nature of excel makes the analysis non-reproducible.