How to generate scatter plot in R?

A building picture during autumn
Image Source: Colby College Archive Image Collection

Summarizing your data, either numerically or graphically, is an important component of any data analysis. Fortunately, R has excellent graphics capabilities and can be used whether you want to produce plots for initial data exploration, model validation or highly complex publication quality figures.

The base R graphics system is the original plotting system that comes when you install R. When creating plots with base R we tend to use high level functions (like the plot() function) to first create our plot and then use one or more low level functions (like lines() and text() etc) to add additional information to these plots.

You can provide a wide variety of objects to the plot() function and R will “magically” present something that makes sense for that particular object. To illustrate this point, we will plot two very basic and simple objects – a vector and data frame.

x <- c(1:20)^2
plot(x)

Scatter Plot

When the numeric vector is provided to plot function, it will generate scatter plot for a vector. The x-axis is the index of number in a vector and y-axis is the value of the corresponding number in vector.

df <- data.frame("a"=x,"b"=1/x,"c"=log(x),"d"=sqrt(x))
plot(df)

Scatter Plot Matrix

When data frame is provided to plot function, the plot function will generate matrix of scatter plots of each column against every other column of a data frame. The main diagonal of matrix has names of columns.

plot function will generate:

  • scatter plot, if numeric variable is supplied as an input
  • bar plot, if one factor object is supplied as input
  • box-plot, if one factor and one numeric variable is supplied as in input
  • matrix scatter plot, if data frame is supplied as an input

With this basic introduction to base R graphics system, let’s delve into the scatter plots.

Scatter plot is the most basic chart type you can think of. They show points plotted on the Cartesian Plane (X-Y Axis) Each point represents the combination of two variables. One variable is chosen in the horizontal axis and another in the vertical axis. Scatter plots are widely used to check the relationship between two variables.

The function used for scatter plot:

plot(x,y,xlab,ylab,xlim,ylim,pch)

where,
x: the data for horizontal axis
y: the data for vertical axis
main: the title of the graph
xlab: the title of x-axis
ylab: the title of y-axis
xlim: the range of values on x axis
ylim: the range of values on y axis
pch: the display symbol

Let’s generate scatter plot for mtcars data set. We want to look at the relationship between engine horse-power (hp) and miles per gallon(mpg) variable. To plot a scatter plot of one numeric variable against another numeric variable we just need to include both variables as arguments when using the plot() function.

plot(x=mtcars$mpg,y=mtcars$hp)

Scatter Plot for hp Vs mpg variable

The hp variable name is automatically set to Y-axis and mpg variable name is automatically set to X-axis. The scales have been also automatically set.

Looking at scatter plot, you can quickly figure out the negative relation between engine hp and miles per gallon variables. As engine size increases, the average fuel consumption reduces.

You can also use formula notation when using plot() function. However, in formula method you need to specify the y-axis variable first, then ~ and then x-axis variable.

plot(mtcars$hp ~ mtcars$mpg)

Scatter Plot for hp Vs mpg variable

Once the basic scatter plot is ready, we can add different layers to it. These layers are used to add title, colors and legends to scatter plot.

We will add X-Axis and Y-Axis legend and give title to the scatter plot.

plot(x=mtcars$mpg,y=mtcars$hp, xlab="Miles Per Gallon",ylab="Engine Horsepower"
     ,main="Miles per Gallon Vs Engine Horsepower")

Scatter Plot with X and Y Axis labels and Title

With the argument pch (short form for “plot character”), it is possible to change the symbol that is displayed on the scatter plot. Integer values 0 to 25 specify a symbol as shown in the figure below.

Plot characters

It is possible to change the color via col argument.

plot(x=mtcars$mpg,y=mtcars$hp, xlab="Miles Per Gallon",ylab="Engine Horsepower"
     ,main="Miles per Gallon Vs Engine Horsepower",
     pch=2,col="red")

Scatter Plot with color and different plot character

A regression line is a straight line that describes how two numeric variables change with respect to each other. This is used to predict the value of y for a given value of x. Adding regression line to scatter plot clearly shows the nature of relationship between two variables.

For drawing regression line, we need two functions:
abline() function to draw straight line through scatter plot
lm() function, which stands for linear model is used to create simple linear model.

plot(x=mtcars$mpg,y=mtcars$hp, xlab="Miles Per Gallon",ylab="Engine Horsepower"
     ,main="Miles per Gallon Vs Engine Horsepower",
     pch=2,col="red")

abline(lm(mtcars$hp~mtcars$mpg,data=mtcars),col='blue') 

Scatter Plot with Regression Line

The base R graphics system is the original plotting system that comes when you install R. The base R graphics is built on generic plot() function, which generates the visualizations depending on the nature of object provided to it.

A scatter plot uses points to represent values for two different numeric variables. Scatter plots are used to observe relationships between variables. We also looked at various options to customize the scatter plot.

Finally, we used a linear regression line to represent relationship between two variables.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top