Linear regression: Abalone growth and pH example

Author

Sumaiya Alvi

Published

February 20, 2026

Set up

library(tidyverse) # general use
library(janitor) # cleaning data frames
library(here) # file/folder organization
library(readxl) # reading .xlsx files
library(ggeffects) # generating model predictions
library(gtsummary) # generating summary tables for models

# abalone data from Hamilton et al. 2022
abalone <- read_xlsx(here("data", "Abalone IMTA_growth and pH.xlsx"))

Abalone example

Data from Hamilton et al. 2022. “Integrated multi-trophic aquaculture mitigates the effects of ocean acidification: Seaweeds raise system pH and improve growth of juvenile abalone.” https://doi.org/10.1016/j.aquaculture.2022.738571

a. Questions and hypotheses

Question: How does pH predict abalone growth (measured in change in shell surface area per day, mm2 d-1)?

H0: pH does not predict abalone growth (change in shell surface area per day, mm2 d-1).

HA: pH predicts abalone growth (change in shell surface area per day, mm2 d-1).

b. Cleaning

# creating a new object called abalone_clean from the original dataset
abalone_clean <- abalone |> 
  # cleaning up all column names
  clean_names() |> 
  # selecting columns of interest
  select(mean_p_h, change_in_area_mm2_d_1_25) |> 
  # renaming column names for easier use
  rename(mean_ph = mean_p_h, 
         change_in_area = change_in_area_mm2_d_1_25)

Don’t forget to look at your data! Use View(abalone_clean) in the Console.

c. Exploratory data visualization

# base layer: ggplot call
ggplot(data = abalone_clean, # dataset should be clean
       aes(x = mean_ph, # x-axis should be predictor, y-axis should be response
           y = change_in_area)) +
  # first layer: points to represent every change in shell surface area for each pair of tanks
  geom_point(size = 4,
             stroke = 1,
             fill = "firebrick4",
             shape = 21)

Where does the missing value come from?

Tank B1B, where the pH sensor failed.

Don’t forget to turn your warnings off either in the individual code chunk or in the YAML at the top of the file.

d. Abalone model

Model fitting

abalone_model <- lm(
  change_in_area ~ mean_ph, # formula: change in shell surface area per day as a function of pH
  data = abalone_clean      # data: clean abalone dataset 
)

Diagnostics

par(mfrow = c(2, 2))   # display residual plots in 2x2 grid
plot(abalone_model)    # plot residuals

Personal notes: Residuals seem homoscedastic (consistent across values) based on Residuals vs Fitted and Scale-Location plots. Residuals also seem normally distributed based on QQ plot. No major outliers based on Residuals vs. Leverage plot

Written communication: We evaluated model assumptions visually. For homoscedasticity of residuals we used Residuals vs. Fitted and Scale-Location plots. For normality of residuals we used QQ plots. Based on our diagnostics, there were no major deviations from homoscedasticity or normality of residuals. There were also no outliers that might influence model estimates or predictions.

Model summary

# more information about the model
summary(abalone_model)

Call:
lm(formula = change_in_area ~ mean_ph, data = abalone_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42714 -0.29282  0.08769  0.21258  0.43965 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -18.5010     6.1601  -3.003  0.01488 * 
mean_ph       2.9827     0.7596   3.926  0.00348 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3063 on 9 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.6314,    Adjusted R-squared:  0.5905 
F-statistic: 15.42 on 1 and 9 DF,  p-value: 0.003477

What is the slope estimate (with standard error?)

2.98 +/- 0.76 (SE). With each increase in pH, the model predicts an increase of 2.98 +/- 0.76 (SE) in change in shell surface area per day (mm2 d-1).

What is the intercept estimate (with standard error?)

-18.5 +/- 6.16 (SE). When pH = 0, the model predicts that abalone change in shell surface area per day is -18.5 +/- 6.16. (Note to self: this is not necessarily biologically realistic, but is something the model generates as an estimate).

Generating predictions

# creating a new object called abalone_preds
abalone_preds <- ggpredict(
  abalone_model,      # model object
  terms = "mean_ph"   # predictor (in quotation marks)
)

# display the predictions
abalone_preds
# Predicted values of change_in_area

mean_ph | Predicted |     95% CI
--------------------------------
   7.94 |      5.19 | 4.83, 5.54
   7.95 |      5.21 | 4.86, 5.55
   8.06 |      5.53 | 5.30, 5.76
   8.10 |      5.67 | 5.46, 5.88
   8.10 |      5.67 | 5.46, 5.88
   8.14 |      5.77 | 5.55, 5.98
   8.15 |      5.81 | 5.59, 6.04
   8.33 |      6.36 | 5.92, 6.80

Look at the column names using colnames(abalone_preds) in the Console.

What are the column names?

x: predictor variable values (pH)

predicted: model predictions for change in shell surface area per day

std. error: standard error, precision of the prediction

conf.low and conf.high: range for the 95% CI around model predictions

Look at the actual object (by clicking on it in Environment).

Look at the class of the object using class(abalone_preds) in the Console.

What is this type of object?

This object is a ggeffects output object, AND a data frame

View abalone_preds like a data frame.

Why generate model predictions?

This dataset does not include any actual observations of abalone growth at pH = 8.

However, the model can tell us a prediction for what abalone growth could be if pH = 8 based on the existing data.

Figure out what abalone growth the model would predict at pH = 8.

# finding the model prediction at a specific value
ggpredict(
  abalone_model,      # model object
  terms = "mean_ph[8]"   # predictor (in quotation marks) and predictor value in brackets
)
# Predicted values of change_in_area

mean_ph | Predicted |     95% CI
--------------------------------
      8 |      5.36 | 5.08, 5.64

How would you interpret this?

At pH = 8, the model predicts that abalone change in shell surface area per day should be 5.36 mm2 d-1 (95% CI: [5.08, 5.64]).

Visualizing model predictions and data

# base layer: ggplot
# using clean data frame
ggplot(data = abalone_clean,
       aes(x = mean_ph,
           y = change_in_area)) +
  # first layer: points representing abalones
  geom_point(size = 4,
             stroke = 1,
             fill = "firebrick4",
             shape = 21) +
  # second layer: ribbon representing confidence interval
  # using predictions data frame
  geom_ribbon(data = abalone_preds,
              aes(x = x,
                  y = predicted,
                  ymin = conf.low,
                  ymax = conf.high),
              alpha = 0.1) + 
  # third layer: line representing model predictions
  # using predictions data frame
  geom_line(data = abalone_preds,
            aes(x = x,
                y = predicted)) +
  # axis labels
  labs(x = "Mean pH", 
       y = expression("Change in shell area ("*mm^{2}~d^-1*")")) +
  # theme
  theme_minimal()

Creating a table with model coefficients, 95% confidence intervals, and more

Note: both these functions are from gtsummary.

tbl_regression(abalone_model,
               # make sure the y-intercept estimate is shown
               intercept = TRUE,
               # changing labels in "Characteristic" column
               label = list(`(Intercept)` = "Intercept",
                            mean_ph = "pH")) |> 
  # changing header text
  modify_header(label = "**Variable**",
                estimate = "**Estimate**") |> 
  # turning table into a flextable (makes things easier to render to word or PDF)
  as_flex_table()

Variable

Estimate

95% CI

p-value

Intercept

-19

-32, -4.6

0.015

pH

3.0

1.3, 4.7

0.003

Abbreviation: CI = Confidence Interval

END OF ABALONE EXAMPLE