Summarizing your data, either numerically or graphically, is an important component of any data analysis. Fortunately, R has excellent graphics capabilities and can be used whether you want to produce plots for initial data exploration, model validation or highly complex publication quality figures.
Base R Graphics
The base R graphics system is the original plotting system that comes when you install R. When creating plots with base R we tend to use high level functions (like the plot()
function) to first create our plot and then use one or more low level functions (like lines()
and text()
etc) to add additional information to these plots.
You can provide a wide variety of objects to the plot()
function and R will “magically” present something that makes sense for that particular object. To illustrate this point, we will plot two very basic and simple objects – a vector and data frame.
x <- c(1:20)^2
plot(x)
When the numeric vector is provided to plot
function, it will generate scatter plot for a vector. The x-axis is the index of number in a vector and y-axis is the value of the corresponding number in vector.
df <- data.frame("a"=x,"b"=1/x,"c"=log(x),"d"=sqrt(x))
plot(df)
When data frame is provided to plot
function, the plot function will generate matrix of scatter plots of each column against every other column of a data frame. The main diagonal of matrix has names of columns.
plot
function will generate:
- scatter plot, if numeric variable is supplied as an input
- bar plot, if one factor object is supplied as input
- box-plot, if one factor and one numeric variable is supplied as in input
- matrix scatter plot, if data frame is supplied as an input
With this basic introduction to base R graphics system, let’s delve into the scatter plots.
Scatter Plots
Scatter plot is the most basic chart type you can think of. They show points plotted on the Cartesian Plane (X-Y Axis) Each point represents the combination of two variables. One variable is chosen in the horizontal axis and another in the vertical axis. Scatter plots are widely used to check the relationship between two variables.
The function used for scatter plot:
plot(x,y,xlab,ylab,xlim,ylim,pch)
where,x
: the data for horizontal axisy
: the data for vertical axismain
: the title of the graphxlab
: the title of x-axisylab
: the title of y-axisxlim
: the range of values on x axisylim
: the range of values on y axispch
: the display symbol
Let’s generate scatter plot for mtcars
data set. We want to look at the relationship between engine horse-power (hp) and miles per gallon(mpg) variable. To plot a scatter plot of one numeric variable against another numeric variable we just need to include both variables as arguments when using the plot()
function.
plot(x=mtcars$mpg,y=mtcars$hp)
The hp
variable name is automatically set to Y-axis and mpg
variable name is automatically set to X-axis. The scales have been also automatically set.
Looking at scatter plot, you can quickly figure out the negative relation between engine hp and miles per gallon variables. As engine size increases, the average fuel consumption reduces.
You can also use formula notation when using plot()
function. However, in formula method you need to specify the y-axis variable first, then ~ and then x-axis variable.
plot(mtcars$hp ~ mtcars$mpg)
Adding layers to scatter plot
Once the basic scatter plot is ready, we can add different layers
to it. These layers
are used to add title, colors and legends to scatter plot.
We will add X-Axis and Y-Axis legend and give title to the scatter plot.
plot(x=mtcars$mpg,y=mtcars$hp, xlab="Miles Per Gallon",ylab="Engine Horsepower"
,main="Miles per Gallon Vs Engine Horsepower")
With the argument pch
(short form for “plot character”), it is possible to change the symbol that is displayed on the scatter plot. Integer values 0 to 25 specify a symbol as shown in the figure below.
It is possible to change the color via col
argument.
plot(x=mtcars$mpg,y=mtcars$hp, xlab="Miles Per Gallon",ylab="Engine Horsepower"
,main="Miles per Gallon Vs Engine Horsepower",
pch=2,col="red")
Adding Regression line in scatter plot
A regression line is a straight line that describes how two numeric variables change with respect to each other. This is used to predict the value of y for a given value of x. Adding regression line to scatter plot clearly shows the nature of relationship between two variables.
For drawing regression line, we need two functions:abline()
function to draw straight line through scatter plotlm()
function, which stands for linear model is used to create simple linear model.
plot(x=mtcars$mpg,y=mtcars$hp, xlab="Miles Per Gallon",ylab="Engine Horsepower"
,main="Miles per Gallon Vs Engine Horsepower",
pch=2,col="red")
abline(lm(mtcars$hp~mtcars$mpg,data=mtcars),col='blue')
Summary
The base R graphics system is the original plotting system that comes when you install R. The base R graphics is built on generic plot()
function, which generates the visualizations depending on the nature of object provided to it.
A scatter plot uses points to represent values for two different numeric variables. Scatter plots are used to observe relationships between variables. We also looked at various options to customize the scatter plot.
Finally, we used a linear regression line to represent relationship between two variables.