What is a correlation?

A correlation quantifies the linear association between two variables. From one perspective, a correlation has two parts: one part quantifies the association, and the other part sets the scale of that association.

The first part—the covariance, also the correlation numerator—equates to a sort of “average sum of squares” of two variables:

\(cov_{(X, Y)} = \frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}\)

It could be easier to interpret the covariance as an “average of the X-Y matches”: Deviations of X scores above the X mean multipled by deviations of Y scores below the Y mean will be negative, and deviations of X scores above the X mean multipled by deviations of Y scores above the Y mean will be positive. More “mismatches” leads to a negative covariance and more “matches” leads to a positive covariance.

The second part—the product of the standard deviations, also the correlation denominator—restricts the association to values from -1.00 to 1.00.

\(\sqrt{var_X var_Y} = \sqrt{\frac{\sum(X - \bar X)^2}{N - 1} \frac{\sum(Y - \bar Y)^2}{N - 1}}\)

Divide the numerator by the denominator and you get a sort of “ratio of the sum of squares”, the Pearson correlation coefficient:

\(r_{XY} = \frac{\frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}}{\sqrt{\frac{\sum(X - \bar X)^2}{N - 1} \frac{\sum(Y - \bar Y)^2}{N - 1}}} = \frac{cov_{(X, Y)}}{\sqrt{var_X var_Y}}\)

Square this “standardized covariance” for an estimate of the proportion of variance of Y that can be accounted for by a linear function of X, \(R^2_{XY}\).

By the way, the correlation equation is very similar to the bivariate linear regression beta coefficient equation. The only difference is in the denominator which excludes the Y variance:

\(\hat{\beta} = \frac{\frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}}{\sqrt{\frac{\sum(X - \bar X)^2}{N - 1} }} = \frac{cov_{(X, Y)}}{\sqrt{var_X}}\)

What does it mean to “adjust” a correlation?

An adjusted correlation refers to the (square root of the) change in a regression model’s \(R^2\) after adding a single predictor to the model: \(R^2_{full} - R^2_{reduced}\). This change quantifies that additional predictor’s “unique” contribution to observed variance explained. Put another way, this value quantifies observed variance in Y explained by a linear function of X after removing variance shared between X and the other predictors in the model.

Model and Conceptual Assumptions for Linear Regression

Correct functional form. Your model variables share linear relationships.

No omitted influences. This one is hard: Your model accounts for all relevant influences on the variables included. All models are wrong, but how wrong is yours?

Accurate measurement. Your measurements are valid and reliable. Note that unreliable measures can’t be valid, and reliable measures don’t necessairly measure just one construct or even your construct.

Well-behaved residuals. Residuals (i.e., prediction errors) aren’t correlated with predictor variables or eachother, and residuals have constant variance across values of your predictor variables.

Libraries

# library("tidyverse")
# library("knitr")
# library("effects")
# library("psych")
# library("candisc")

library(tidyverse)
library(knitr)
library(effects)
library(psych)
library(candisc)

# select from dplyr
select <- dplyr::select
recode <- dplyr::recode

Load data

From help("HSB"): “The High School and Beyond Project was a longitudinal study of students in the U.S. carried out in 1980 by the National Center for Education Statistics. Data were collected from 58,270 high school students (28,240 seniors and 30,030 sophomores) and 1,015 secondary schools. The HSB data frame is sample of 600 observations, of unknown characteristics, originally taken from Tatsuoka (1988).”

HSB <- as_tibble(HSB)

# print a random subset of rows from the dataset
HSB %>% sample_n(size = 15) %>% kable()

<<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95 <<<<<<< HEAD ======= >>>>>>> 47303b579cc9bfad90fb5dc653eb42904ad01c95

id	gender	race	ses	sch	prog	locus	concept	mot	career	read	write	math	sci	ss
355	female	222	male	white	low	public	general	-0.40	-0.89	0.33	operative	41.6	56.7	45.0	50.4	43.1
575	female	white	middle	private	general	0.68	0.03	0.00	prof2	44.2	35.9	43.6	47.1	40.6
335	0.93	0.34	1.00	prof2	46.9	54.1	54.6	55.3	50.6
42	male	hispanic	middle	public	academic	0.03	0.28	0.33	operative	52.1	54.1	52.0	55.3	50.6
413	female	white	middle	public	general	0.46	0.34	0.67	prof1	62.7	61.9	52.9	44.4	40.6
221	male	white	low	public	general	-0.36	-1.67	1.00	farmer	44.2	43.7	56.4	58.0	60.5
254	female	white	high	public	academic	0.48	-0.47	0.33	prof1	52.1	61.9	55.5	60.7	60.5
463	male	white	middle	public	general	-0.11	0.25	1.00	prof1	44.2	44.3	45.6	39.0	50.6
294	male	white	low	public	academic	-0.82	-0.76	0.00	clerical	57.4	43.7	59.6	52.6	50.6
419	male	white	middle	public	vocation	-0.19	0.03	0.33	craftsman	54.8	51.5	42.8	60.7	50.6
349	0.51	0.03	0.33	manager	60.1	51.5	53.9	63.4	50.6
486	female	white	middle	public	academic	0.53	0.81	0.67	prof1	54.8	59.3	61.4	47.1	55.6
458	male	white	middle	public	academic	0.46	0.65	1.00	technical	49.5	48.9	60.5	55.3	55.6
137	male	african-amer	middle	public	academic	-0.37	-1.90	0.67	manager	54.8	36.5	37.7	49.8	60.5
200	male	white	high	public	academic	-0.27	0.88	1.00	sales	52.1	64.5	60.6	60.7	45.6
440	female	white	high	public	academic	1.36	0.94	1.00	homemaker	52.1	48.9	51.3	41.7	45.6
367	male	white	high	public	vocation	-1.50	0.03	0.67	prof1	33.6	48.9	38.6	42.3	55.6
354	general	0.70	-0.16	0.33	prof1	68.0	59.3	55.7	63.4	65.5
277	female	white	high	public	academic	-0.60	-1.18	0.67	clerical	54.8	59.3	68.0	49.3	65.5
11	female	hispanic	low	public	academic	0.25	0.34	1.00	prof1	49.5	61.9	42.9	41.7	50.6
390	female	white	high	public	academic	0.45	0.03	0.67	prof1	60.1	61.9	51.9	53.1	58.1
307	male	1.11	0.34	1.00	prof2	73.3	67.1	62.3	58.0	65.5
173	female	white	low	public	general	-0.61	0.03	0.33	proprietor	44.2	54.1	40.3	52.6	40.6
524	female	white	middle	private	academic	-0.66	-1.07	0.67	clerical	49.5	61.9	60.4	47.1	50.6
264	female	white	low	public	academic	0.46	0.34	1.00	prof1	76.0	52.1	64.1	63.9	60.5
523	female	white	high	private	general	0.68	0.32	1.00	service	36.3	56.7	41.9	49.8	40.6
522	male	white	middle	private	academic	0.00	0.65	1.00	military	52.1	61.9	62.1	58.0	60.5	general	-0.80	0.15	0.33	service	41.6	41.1	39.5	47.1	60.5
404	female	white	middle	public	general	-0.38	-0.47	0.67	homemaker	62.7	43.7	44.7	52.6	41.9
493	male	white	low	public	vocation	-0.86	0.28	1.00	farmer	36.3	48.9	54.4	60.7	35.6

Do students who score higher on a standardized math test tend to score higher on a standardized science test?

Scatterplot

alpha below refers to the points’ transparency (0.5 = 50%), lm refers to linear model and se refers to standard error bands

HSB %>% 
  ggplot(mapping = aes(x = math, y = sci)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red")

Center the standardized math scores

If the standardized math scores are centered around their mean (i.e., 0 = mean), then we can interpret the regression intercept—x = 0 when the regression line crosses the y-axis—as the grand mean standardized science score.

HSB <- HSB %>% mutate(math_c = math - mean(math, na.rm = TRUE))

Fit linear regression model

scimath1 <- lm(sci ~ math_c, data = HSB)

Summarize model

summary(scimath1)

## 
## Call:
## lm(formula = sci ~ math_c, data = HSB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.7752  -4.8505   0.3355   5.1096  25.4184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 51.76333    0.30154  171.66   <2e-16 ***
## math_c       0.66963    0.03206   20.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.386 on 598 degrees of freedom
## Multiple R-squared:  0.4219, Adjusted R-squared:  0.4209 
## F-statistic: 436.4 on 1 and 598 DF,  p-value: < 2.2e-16

# print the standardized science score descriptive statistics
HSB %>% pull(sci) %>% describe()

##    vars   n  mean   sd median trimmed   mad min  max range  skew kurtosis
## X1    1 600 51.76 9.71   52.6   51.93 12.01  26 74.2  48.2 -0.16     -0.7
##     se
## X1 0.4

Interpretation

On average, students scored 51.76 points (SD = 9.71 points) on the standardized science test. However, for every one more point students scored on the standardized math test, they scored 0.67 more points (SE = 0.03) on the standardized science test, t(598) = 20.89, p < .001.

If we account for the fact that students who score higher on a standardized math test also tend to score higher on a standardized reading test, do students who score higher on the standardized math test still tend to score higher on the standardized science test?

Center the standardized reading scores

Same explanation as above: Because the regression line crosses the y-axis when the predictors’ axes = 0, transforming those predictors so that 0 reflects their means allows us to interpret the regression intercept as the grand mean standardized science score.

HSB <- HSB %>% mutate(read_c = read - mean(read, na.rm = TRUE))

Fit linear regression model

scimath2 <- lm(sci ~ math_c + read_c, data = HSB)

Summarize model

summary(scimath2)

## 
## Call:
## lm(formula = sci ~ math_c + read_c, data = HSB)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.5139  -4.5883   0.0933   4.5700  22.4739 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 51.76333    0.26995 191.754   <2e-16 ***
## math_c       0.34524    0.03910   8.829   <2e-16 ***
## read_c       0.44503    0.03644  12.213   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.612 on 597 degrees of freedom
## Multiple R-squared:  0.5375, Adjusted R-squared:  0.5359 
## F-statistic: 346.8 on 2 and 597 DF,  p-value: < 2.2e-16

Compute \(R^2\) change and compare models

# adjusted R-squared is an unbiased estimate of R-squared
summary(scimath2)$adj.r.squared - summary(scimath1)$adj.r.squared

## [1] 0.114985

# compare models
anova(scimath1, scimath2)

## Analysis of Variance Table
## 
## Model 1: sci ~ math_c
## Model 2: sci ~ math_c + read_c
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    598 32624                                  
## 2    597 26102  1    6521.7 149.16 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Save both model predictions in tables

Below, I use the effect() function to estimate predicted standardized science scores across a range of unique values of standardized math scores; for scimath2, the full model, the predicted scores have been purged of the linear effect of standardized reading scores. I transform the result from effect() into a tibble data.frame, which includes predicted values (fitted values), predictor values, standard errors of the predictions, and upper and lower confidence limits for the predictions. I can use this table to create a regression line and confidence bands in a plot.

(scimath_predtable1 <- effect(term = "math_c", mod = scimath1) %>% as_tibble())

## # A tibble: 5 x 5
##   math_c   fit    se lower upper
## *  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    -20  38.4 0.708  37.0  39.8
## 2     -9  45.7 0.417  44.9  46.6
## 3      2  53.1 0.308  52.5  53.7
## 4     10  58.5 0.440  57.6  59.3
## 5     20  65.2 0.708  63.8  66.5

(scimath_predtable2 <- effect(term = "math_c", mod = scimath2) %>% as_tibble())

## # A tibble: 5 x 5
##   math_c   fit    se lower upper
## *  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    -20  44.9 0.827  43.2  46.5
## 2     -9  48.7 0.444  47.8  49.5
## 3      2  52.5 0.281  51.9  53.0
## 4     10  55.2 0.475  54.3  56.1
## 5     20  58.7 0.827  57.0  60.3

Plot adjusted relationship

Below, I create the lines and the confidence “ribbons” from the tables I created above. The points come from the original data.frame though. Follow the code line by line: geom_point uses the HSB data, and both geom_lines use data from different tables of predicted values. In other words, layers of lines and ribbons are added on top of the layer of points.

HSB %>% 
  ggplot(mapping = aes(x = math_c, y = sci)) +
  geom_point(alpha = 0.5) +
  geom_line(data = scimath_predtable1, mapping = aes(x = math_c, y = fit), color = "red") +
  geom_line(data = scimath_predtable2, mapping = aes(x = math_c, y = fit), color = "blue") +
  geom_ribbon(data = scimath_predtable2, mapping = aes(x = math_c, y = fit, ymin = lower, ymax = upper), fill = "blue", alpha = 0.25) +
  labs(x = "Standardized math score (grand mean centered)", y = "Standardized science score")

## Warning: Ignoring unknown aesthetics: y

Interpretation

After partialling out variance shared between standardized math and reading scores, for every one more point students scored on the standardized math test, they scored 0.35 more points (SE = 0.04) on the standardized science test, t(597) = 12.21, p < .001. Importantly, the model that includes standardized reading scores explained 53.60% of the observed variance in standardized science scores, an 11.50% improvement over the model that included only standardized math scores.

Resources

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. New York, NY: Routledge.

Gonzalez, R. (December, 2016). Lecture Notes #8: Advanced Regression Techniques I Retreived from http://www-personal.umich.edu/~gonzo/coursenotes/file8.pdf on June 28th, 2018.

MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. New York, NY: Lawrence Erlbaum Associates.

General word of caution

Above, I listed resources prepared by experts on these and related topics. Although I generally do my best to write accurate posts, don’t assume my posts are 100% accurate or that they apply to your data or research questions. Trust statistics and methodology experts, not blog posts.

Plotting Adjusted Associations in R

What is a correlation?

\(cov_{(X, Y)} = \frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}\)

\(\sqrt{var_X var_Y} = \sqrt{\frac{\sum(X - \bar X)^2}{N - 1} \frac{\sum(Y - \bar Y)^2}{N - 1}}\)

\(r_{XY} = \frac{\frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}}{\sqrt{\frac{\sum(X - \bar X)^2}{N - 1} \frac{\sum(Y - \bar Y)^2}{N - 1}}} = \frac{cov_{(X, Y)}}{\sqrt{var_X var_Y}}\)

\(\hat{\beta} = \frac{\frac{\sum(X - \bar X)(Y - \bar Y)}{N - 1}}{\sqrt{\frac{\sum(X - \bar X)^2}{N - 1} }} = \frac{cov_{(X, Y)}}{\sqrt{var_X}}\)

What does it mean to “adjust” a correlation?

Model and Conceptual Assumptions for Linear Regression

Libraries

Load data

Do students who score higher on a standardized math test tend to score higher on a standardized science test?

Scatterplot

Center the standardized math scores

Fit linear regression model

Summarize model

Interpretation

If we account for the fact that students who score higher on a standardized math test also tend to score higher on a standardized reading test, do students who score higher on the standardized math test still tend to score higher on the standardized science test?

Center the standardized reading scores

Fit linear regression model

Summarize model

Compute \(R^2\) change and compare models

Save both model predictions in tables

Plot adjusted relationship

Interpretation

Resources

General word of caution