Visualizations in R: All in One View

Content from High Level Plot Functions

Last updated on 2024-12-24 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How do you create and customize basic plots in R using base graphics?

Objectives

Understand high-level plotting functions.
Create basic plots like histograms and barplots.

Histogram

A histogram is a plot which groups values from a single numeric variable into intervals, called bins, and displays the frequency (or relative frequency) of values within each bin.

The hist() function creates histograms. The first argument (generically called x) specifies a numeric vector that represents the numeric variable we want to visualize. There are various optional arguments that can be used to control the details of the plot. We will first consider the hist() function with its default settings using the pre-existing data set trees within R.

R

hist(trees$Girth)

With that, we have successfully created our first histogram by simply just specifying the data input. However, there are a few components of the plot to note:

By default, R chooses the number of bins (or, equivalently, the bin width) by splitting the range of the values in x into approximately $log2(n) + 1$ intervals of equal length, where n is the number of values in x. This is known as Sturges’ rule. This is used as a guideline, but R also attempts to put breaks at round numbers. For example, in this case, the bin width is exactly 2, which produces more bins than Sturges’ rule would produce.
breaks: The number of breaks can be (approximately) specified using the optional breaks argument. There are several ways to use breaks, such as specifying the number of bins as a single value or specifying a different rule for computing the bins (either “Freedman-Diaconis” or “Scott”). By default, breaks is set to breaks = “Sturges”.
freqs: The hist() function by default outputs a frequency histogram, where the height of each bar indicates the frequency (i.e., number) of values within the corresponding bin. By setting the freq argument to FALSE, hist() will output a relative frequency histogram, where the area of each bar indicates the relative frequency (i.e., proportion) of values within the corresponding bin.

There are more optional arguments that control how the histogram is visualized and/or constructed. If you are interested in learning more, please refer to the histogram documentations.

In addition, there are components that are not specific to histograms (or the hist() function) but are components that are common to most statistical plots. The arguments to change these components are the same for nearly every high level plotting function in base graphics. The most commonly used arguments are given below.

xlim: The limits (or range) of the x-axis shown on the plot are set by default to span the range of the bins that contain all the values in x.
ylim: The limits (or range) of the y-axis shown on the plot are set by default to range from 0 to the height of the highest bar(s) in the histogram.
xlab: The label on the x-axis (set by default to the name of the input vector).
ylab: The label on the y-axis
main: The title of the histogram
col: The color of the bars in the histogram

Challenge 1: Can you do it?

Generate a histogram with trees$Girth and modify several arguments to customize the plot as follows:

Set the number of breaks (bins) to 10
Display relative frequencies instead of counts
Set the x-axis limits to range from 5 to 25
Use “blue” and “goldenrod” as the colors for the bars
Label the x-axis as “Girth of Black Cherry Trees”
Set the title of the histogram to “Histogram of Girth of Black Cherry Trees”.

Show me the solution

R

hist(
  trees$Girth,
  breaks = 10,
  freq = FALSE,
  xlim = c(5, 25),
  col = c("blue", "goldenrod"),
  xlab = "Girth of Black Cherry Trees",
  main = "Histogram of Girth\nof Black Cherry Trees"
)

Barplots

The most common statistical plot for visualizing categorical data is the bar plot (or bar chart or bar graph), which shows a bar for each observed category. The height of the bar is proportional to the frequency of that category.

The barplot() function creates bar plots. The first argument, called height, specifies the heights of the bars in the bar plot, which correspond to the frequencies (or relative frequencies) for the categorical variable(s) we want to visualize. The height argument can be a numeric vector or matrix. The type of bar plot that barplot() outputs depends on the class of the height object.

Vector Inputs

If the height argument is a numeric vector, the barplot() function will return a (simple) bar plot. We can demonstrate it again using the built-in data set mtcars，which contains various characteristics of 32 different car models. The am variable represents the type of transmission in the cars.

R

# Summarize the 'am' variable
transmission_freqs <- table(mtcars$am)
barplot(transmission_freqs,
        main = "Bar Plot of Transmission Types",
        xlab = "Transmission (0 = Automatic, 1 = Manual)",
        ylab = "Frequency",
        col = c("lightblue", "lightgreen"))

Matrix Inputs

If the height argument is a numeric matrix, the barplot() function will return a stacked (or segmented) bar plot. The matrix input typically represents a two-way (contingency) table for two categorical variables. Each bar of the plot corresponds to a column of height, and the values in the column correspond to the heights of the stacked sub-bars.

R

library(ggplot2) # Load ggplot2 package
data(diamonds) # Load diamonds data
diamonds_table <- with(diamonds, table(cut, color)) # Create two-way table of cut and color
barplot(diamonds_table)

In addition, if we want to create a side-by-side barplot instead of a stacked barplot, we simply set the optional argument beside = TRUE.

The differently shaded sub-bars correspond to the levels of the cut factor. To make the plot more read-able/informative, we can add a legend by setting the argument legend = TRUE. Notice that the same arguments that control the shading and color of histogram bars in hist() also work for barplot(). If you are interested in learning more, please refer to the barplot documentations.

Challenge 2: Play around on your own!

Can you modify the previous plot to display the bars side by side instead of stacked? Add a legend and customize the colors and shading of the bars.Ensure that the legend is included and that the plot is properly labeled. Try to play around with optional arguments so that the barplot can be more informative.

Show me the solution

R

barplot(diamonds_table,
        beside = TRUE,
        legend = TRUE,
        density = 40,
        col = 1:5,
        main = "Side-by-Side Distribution of Diamond Cuts by Color",
        xlab = "Diamond Color",
        ylab = "Frequency")

Notice that histograms and barplots are not the only high-level plot functions in R. There are more functions, like boxplot and scattergraph. Please refer to specific documentations in R regarding those functions.

Key Points

Know how to generate histogramsusing the hist() function
Know how to generate barplots using the barplot() function
Creatively use optional arguments to make high-level plots more informative

Content from Low Level Plot Functions

Last updated on 2024-12-24 | Edit this page

Estimated time: 12 minutes

Overview

Questions

How do you create and customize basic plots in R using base graphics?

Objectives

Understand low-level plotting functions.
Be able to customize existing plots with low-level plotting functions.

There are numerous low level plot functions in base graphics that add components to an existing plot.

Caution: Low level plot functions can only be used if there is already a plot open. If you run a low level plot command when there is not a plot already open, R will give you an error.

Points

The points() function is used to add points to an existing plot. Similar to the basic form of plot(), the syntax is points(x, y, ...), where x and y are numeric vectors that correspond to the coordinates of the points to add. The same optional arguments common to plot() can be used to specify the color, size, and type of the points. To demonstrate, we again use the built-in trees data set:

R

plot(Girth ~ Height, data = trees)
points(c(65, 70, 75), c(12, 17, 20), pch = 4, col = "red", cex = 1.5)

The coordinate pairs can alternatively be specified using a a two-column matrix or data frame, a list with two components called x and y, or a formula y ~ x.

Challenge 1: Can you do it?

Use the points()function to note which trees have an above average volume.

Show me the solution

R

# Find the observations (trees) with an above average (mean) volume
volume_index <- with(trees, Volume > mean(Volume))
# Plot the tree girths against height
plot(Girth ~ Height, data = trees)
# Add a blue + to the observations with an above average volume
points(Girth ~ Height, data = trees[volume_index, ], pch = "+", col = "blue", cex = 1.5)

Lines

The lines() function is used to add connected line segments to an existing plot. The syntax is identical to points(), but the output will connect specified coordinates by straight line segments.

R

plot(Girth ~ Height, data = trees) 
# Self-generated points
coords_mat <- cbind(c(65, 70, 75), c(12, 17, 20))
lines(coords_mat, col = "purple")

Notice: Even though lines() and points() are separate functions, the functionality of both are actually the same. The points() function can be used to add line segments by setting the optional argument type = "l". The lines() function can be used to add points by setting the optional argument type = "p".

Lines constructed in base graphics functions (like plot() or lines()) can be modified using line-specific optional arguments. Commonly used arguments are:

lty: The lty argument controls the line type. Line types can either be specified as an integer (0 is blank, 1 is solid (default), 2 is dashed, 3 is dotted, 4 is dotdash, 5 is longdash, and 6 is twodash) or as one of the character values “blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, or “twodash”, where “blank” uses “invisible lines” (i.e., the lines are not drawn).
lwd: The lwd argument controls the line width. Similar to the cex argument for points, the value corresponds to the amount by which the line width should be scaled relative to the default of lwd = 1. Values above 1 will make the line wider, and values below 1 will make the line thinner.

R

plot(Girth ~ Height, data = trees)
lines(Girth ~ Height, data = trees[volume_index, ], col = "green", lty = 2, lwd = 3)

Notice:The lines() function can also be used to add a smooth density curve over a relative frequency histogram (prob = TRUE). The density() function computes a kernel density estimate of the data, which can be visualized as a smooth curve superimposed over a histogram using the lines() function.

Challenge 2

Can you create a histogram of Girth with relative frequencies and then use the lines() function to add a smooth density curve over the histogram?

Show me the solution

R

hist(trees$Girth, prob = TRUE, main = "Histogram of Tree Girth with Density Curve", xlab = "Girth")
lines(density(trees$Girth), lwd = 2, col = "blue")

Key Points

Know how to add points to an existing plot using the points() function
Know how to add connected line segments to an existing plot using the lines() function
Understand how to use the lines() function to add points by setting the type argument to “p”
Learn how to add a smooth density curve over a histogram