What R is and what it is not?

R, as a programming language, has been evolving and developing over the last 20 years. Its goal is quite clear to make it easy and flexible to perform comprehensive statistical computing, data exploration, and visualization. In this blog, we will look at some of the advantages of R, its inheritance from S language and its limitations.

The inspiration of R – the S language

R was inspired by the S statistical language developed by John Chambers at AT&T. R is an independent, open-source, and free implementation and extension of the S language, developed by an international team of statisticians.
One key limitation of the S language was that it was only available as a commercial package, S-PLUS. In 1991, R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland. R is the first letter of both Ross and Robert, and it is also the letter before S in the alphabet. The general S philosophy sets the stage for the design of the R language itself. S language had its roots in data analysis, and did not come from traditional programming language background. Its inventors were focused on making data analysis easier for themselves and for the others.

“We wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”

John Chambers ( Creator of S Programming language and core member of R Foundation)

They needed to build language that would be suitable for interactive data analysis (more command-line based) as well as for writing longer programs (more traditional programming language-like). Ross’s and Robert’s experience developing R is documented in a 1996 paper in the Journal of Computational and Graphical Statistics:

“Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics.”

Journal of Computational and Graphical Statistics, 5(3):299–314, 1996

R is high quality statistical computing system

R is comparable, and often superior, to commercial products when it comes to data analysis, statistical computing and graphics. Due to its sophisticated computational routines, researchers in Statistics and Machine Learning use R for the research work, and will often publish R packages to accompany their publications. This leads to immediate public access to latest machine learning techniques and implementations. Because of this, the majority of state-of-the-art techniques in machine learning and statistics are first published in R before being added to other software packages.

R is free

One key limitation of the S language was that it was only available in a commercial package, but R is free software. The copyright for the primary source code for R is held by the R Foundation and is published under General Public License (GPL). According to the Free Software Foundation, with free software (free as in freedom) you are granted the following four freedoms:

  • Freedom 0: Run the program for any purpose
  • Freedom 1: Study how the program works, and adapt it to your needs
  • Freedom 2: Redistribute copies so you can help your neighbor
  • Freedom 3: Improve the program and release your improvements to the public

These freedoms have allowed R to develop strong prolific communities that include world-class statisticians and programmers as well as many volunteers, who help improve and extend the language.

R is flexible programming language

R has a strong foundations for functional programming, which is well suited for solving many of the challenges of Data Analysis such as automate processes that make use of complex systems, create complex data visualization and interface seamlessly with different data bases. R allows users to write powerful, concise and descriptive code.

R as an ecosystem

The majority of R users are not professional developers but data analysts and statisticians. These users may not write best quality code, but they may contribute cutting-edge tools to the ecosystem in R language, and everyone else has a free access to these tools without having to reinvent the wheels.
For example, let’s say an econometrician writes an extension package that includes a new method to detect a category of time series patterns; it may attract several users who find it interesting and useful. Some professional users may improve the original code to make it faster and more general-purpose. A financial analyst may use this package and incorporate this methods in his trading strategy to find out the risk patterns in his portfolio. This is how the ecosystem works.

R Packages

The primary R system is available from the Comprehensive R Archive Network, also known as CRAN. This is called as Base-R version. R packages extend the functionality of R by providing additional functions, data, and documentation. They are written by a worldwide community of R users and can be downloaded for free from the CRAN website. There are more than 10000 packages hosted on CRAN web-site.

“A good analogy for R packages is they are like apps you can download onto a mobile phone: So base R is like a new mobile phone: while it has a certain amount of features when you use it for the first time, it doesn’t have everything. R packages are like the apps you can download onto your phone from Apple’s App Store or Android’s Google Play.”

Page 8, Statistical Inference via Data Science – A ModernDive into R and the Tidyverse, 2020

What R is not good for?

R is essentially based on 40-year-old technology, going back to the original S system developed at Bell Labs. When R was born, notion of large-scale data analysis and high performance computing were rare. Computer hardware cost was high and internet was just getting started.

Fast-forward to the present, hardware cost is just a fraction of what it used to be, computing power is available online for pennies, and due to social media proliferation, everyone is interested in collecting and analyzing data at large scale. This surge in data analysis has brought to the forefront two of R’s fundamental limitations, the fact that it’s single-threaded and memory-bound. These two characteristics drastically slow it down. R is a single-threaded language. Hence by default it will only use one core for a given process. R is memory-bound, because it stores the variables in physical memory, making it more memory intensive than other statistical packages. However, there have been number of advancements to deal with this, both in the R core and number of packages developed by contributors (Refer to data.table package , which is developed to handle large data frames using multi-core processor routines)

Another drawback of R emanates from its user base. Most of its users do not think of themselves as programmers, and are more concerned with results than with process (which is not necessarily a bad thing). This means that much of the R code you can find online is written without regard for elegance, speed, or readability. This leads to a code that is patchy and not rigorously tested.

Summary

R is a powerful programming language and environment for statistical computing, data exploration, analysis, and visualization. It is free, open source, and has a strong, rapidly growing community of users. Just as every programming language has its limitations, R is not an exception. R is slow due to its single threaded architecture, however there are many R packages that are designed for efficient memory management system.

References

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top