Modeling Diamond Prices with Regression Analysis

Regression Analysis

Using regression analysis in R to identify how carat, cut, and color influence diamond prices

Overview

Timeline: Spring 2025
Methods: Exploratory Data Analysis, ANOVA, Simple Linear Regression, Multiple Linear Regression, Log Transformation, Model Diagnostics
Course: PSTAT 126 - Regression Analysis

This project explored how different diamond characteristics influence price using a random sample of 1,000 observations from a larger diamonds dataset. The goal was to determine which features were most useful for predicting diamond prices, and to build a regression model that balanced strong fit with clear interpretation.

Using R, I analyzed relationships between price and several predictors, including carat, cut, color, and depth. Through exploratory data analysis, hypothesis testing, and model comparison, I found that carat, cut, and color were the most important predictors of diamond price, while depth did not meaningfully improve the model.

Tools and Skills

R
ggplot2
dplyr
Exploratory Data Analysis (EDA)
ANOVA
Simple Linear Regression
Multiple Linear Regression
Log Transformation
Residual Analysis
Model Selection

Project Goals

This project focused on a few main questions:

Which diamond characteristics are most strongly associated with price?
How well can diamond price be predicted using carat, cut, color, and depth?
Do transformations improve model fit and better satisfy regression assumptions?
Which combination of predictors produces the strongest and most interpretable final model?

My Role

In this project, I completed the full analysis independently in R. This included selecting a random sample from the dataset, creating visualizations, fitting and interpreting regression models, checking model assumptions, comparing adjusted \(R^2\) values across models, and identifying the final best-performing model.

Methods

I combined visualization, statistical testing, and regression modeling to understand the drivers of diamond price.

Data Collection and Sampling

I worked with a diamonds dataset from Kaggle that included variables such as price, carat, cut, color, depth, and clarity. For this project, I selected a random sample of 1,000 diamonds and focused on carat, cut, color, depth, and price in order to build a manageable and interpretable regression analysis.

Exploratory Data Analysis

I began by creating summary statistics, histograms for the quantitative variables, and bar plots for the categorical variables. The distributions showed that carat and price were both strongly right-skewed, indicating that the sample contained many lower-carat, lower-priced diamonds and relatively few high-end diamonds.

The bar plots also showed that Ideal cuts were the most common, while the distribution of diamond colors was fairly even except for J, which appeared much less frequently. These early patterns suggested that carat, cut, and color might all influence price in different ways.

Initial Relationship Testing

To examine relationships between variables, I calculated correlations among the quantitative predictors and used ANOVA to test whether price differed across levels of cut and color. The strongest relationship was between price and carat, with a correlation of about 0.92, indicating that larger diamonds tended to be much more expensive.

The ANOVA results also showed statistically significant differences in price across both cut and color categories. In contrast, depth showed almost no linear relationship with price, suggesting that it would likely be less useful as a predictor.

Regression Modeling

I first fit a multiple linear regression model using price as the response and carat, depth, cut, and color as predictors. This model explained a large proportion of the variability in price, but it also showed that depth was not statistically significant, while some levels of cut and color were.

I then fit a simple linear regression model using carat alone to better understand its individual effect on price. Carat was a highly significant predictor, but residual diagnostics showed clear violations of the linearity and constant variance assumptions.

Model Transformation and Diagnostics

To address these issues, I transformed the variables by taking the logs of both price and carat. After fitting a model with log(price) as the response and log(carat) as the predictor, the residual plots improved substantially and the assumptions of linearity, normality, and homoscedasticity were much better satisfied.

I confirmed the improvement using both the residuals vs. fitted plot and a Q-Q plot. The transformed model also produced a stronger fit, with the \(R^2\) increasing from about 0.85 in the simple linear model to about 0.93 after the log transformation.

Model Selection

To determine the best final model, I compared adjusted \(R^2\) values after adding depth, cut, and color to the transformed model. Adding depth slightly decreased adjusted \(R^2\), indicating that it did not improve the model. Adding cut and color, however, both improved fit, and the best-performing model included:

log(carat)
cut
color

This final model achieved an \(R^2\) of about 0.9447, showing that it explained nearly 94% of the variation in log diamond price.

Key Takeaways

A few major findings emerged from the analysis:

Carat was the strongest predictor of diamond price
Cut and color both improved the model and helped explain additional variation in price
Depth did not meaningfully contribute to predicting price and was dropped from the final model
Log-transforming price and carat substantially improved the regression assumptions and overall model fit
The final model using log(carat), cut, and color explained about 94% of the variability in log price

View the full report

Reflection

This project strengthened my understanding of how regression modeling can be used not just to make predictions, but also to compare predictors, evaluate assumptions, and improve model performance through thoughtful transformations. It gave me more experience interpreting categorical and quantitative predictors together in a multiple regression setting, and helped me see how model diagnostics directly shape the decisions made during analysis.

What I found especially valuable was the process of refining the model rather than stopping at the first strong result. Seeing how the log transformation improved the residual patterns, and how model comparison helped justify dropping depth while keeping cut and color, made the final model feel much more intentional and statistically sound.