# install.packages("lavaan", dependencies = TRUE)
# install.packages("qgraph", dependencies = TRUE)
# install.packages("semPlot", dependencies = TRUE)
library(lavaan)
library(qgraph)
library(semPlot)
The lavaan package is a versatile tool for Structural Equation Modeling (SEM) (or latent variable analysis, covariance structure analysis). SEM is a good (?) friend of a researcher who aims 1) to understand the patterns of correlation/covariance amongst the variables, and 2) to explain their variance with the model specified (see Suhr, 2006).
We will first look into regression and path models using only observed variables. Second, we will build a SEM that contains also latent (unobserved) variables.
First (regression and path) models will be run with the familiar "dfttu" dataset that contains autonomy, competence and score variables alongside with "gender" and "group" variables. The SEM model will be run with another dataset which will be described later.
If you use lavaan in your own analyses, please use the following
reference:
Yves Rosseel (2012). lavaan: An R Package for Structural Equation
Modeling. Journal of Statistical Software, 48(2), 1-36. URL http://www.jstatsoft.org/v48/i02/
observed/manifest variable (something that you have measured: one column in the data frame, square box in diagram)
latent variable (factor that is composed of more than one observed variables: does not exist in the data frame, circle in diagram)
exogenous variable (independent: sends arrows, explains endogenous variable(s), may be observed (x) or latent (\(\xi\))
endogenous variable (dependent: receives arrows, is explained by other vars, may be observed (y) or latent (\(\eta\))
"=~" latent variable ("is measured by", e.g., SelfEfficacy =~ se1 + se2 + se3)
"~" regression ("is regressed on", e.g., score ~ autonomy + competence + gender)
"~~" (co)variance ("is correlated with", e.g., autonomy ~~ autonomy (variance) or autonomy ~~ competence (covariance))
"~1" intercept/mean (triangle in diagrams, e.g., autonomy ~ 1 estimates the mean of variable autonomy)
"1" fixes parameter or loading to one (e.g., SelfEfficacy =~ 1se1 + se2 + se3)
NA* frees parameter or loading (overrides default, e.g., SelfEfficacy =~ NA*se1 + se2 + se3)
The R lavaan offers several estimator options (see also the manual), but we use in these examples one of the most common one, Maximum Likelihood (ML) estimation. As one of the names for "SEM" is "covariance structure modeling", it indicates that the task of ML (for details, see Crisci, 2012) is to compute the implied covariance matrix (specified in the model) and compare that against the actual covariance matrix based on the empirical data.
Next we go through the steps that are needed before the estimation of a regression model.
# where this current script is located?
current_dir <- dirname(rstudioapi::getSourceEditorContext()$path)
# set working directory to current script location
setwd(current_dir)
# check out the result
getwd()
## [1] "/Users/nokelaip/CharveLEAD Dropbox/Petri Nokelainen/home1_cloud/home1/homepage TUNI/public_html/mmx9120/3_sem/exercises/R_lavaan"
# read the data into data frame (df) called "dfttu"
dfttu <- read.table("dfttu_sem.txt", header=TRUE, sep= "\t")
# display summary of df
summary(dfttu)
## group gender autonomy competence
## Min. :0.0000 Min. :0.0000 Min. :2.250 Min. :2.500
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.500 1st Qu.:3.500
## Median :1.0000 Median :1.0000 Median :3.750 Median :4.000
## Mean :0.5144 Mean :0.6971 Mean :3.786 Mean :3.907
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.250
## Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :5.000
## deep organised score
## Min. :2.000 Min. :1.250 Min. : 0.00
## 1st Qu.:3.750 1st Qu.:3.250 1st Qu.: 22.22
## Median :4.000 Median :3.750 Median : 50.00
## Mean :4.023 Mean :3.739 Mean : 47.76
## 3rd Qu.:4.250 3rd Qu.:4.250 3rd Qu.: 70.14
## Max. :5.000 Max. :5.000 Max. :100.00
# View(dfttu)
# display first six rows of df
head(dfttu)
## group gender autonomy competence deep organised score
## 1 0 1 3.75 4.25 4.00 3.0 16.66667
## 2 1 1 3.25 3.75 4.00 3.5 41.66667
## 3 0 1 4.00 4.00 4.00 4.0 88.88889
## 4 0 1 4.00 4.25 4.25 4.5 61.11111
## 5 0 0 3.50 3.75 4.50 5.0 38.88889
## 6 1 1 4.75 4.75 4.50 4.5 44.44444
# install.packages("stargazer")
library(stargazer)
stargazer(dfttu, type = "text")
##
## ============================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------
## group 208 0.514 0.501 0 1
## gender 208 0.697 0.461 0 1
## autonomy 208 3.786 0.607 2.250 5.000
## competence 208 3.907 0.605 2.500 5.000
## deep 208 4.023 0.507 2.000 5.000
## organised 208 3.739 0.678 1.250 5.000
## score 208 47.756 29.118 0.000 100.000
## --------------------------------------------
# install.packages("DescTools")
library(DescTools)
Desc(dfttu, plotit = TRUE)
## ------------------------------------------------------------------------------
## Describe dfttu (data.frame):
##
## data frame: 208 obs. of 7 variables
## 208 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 group integer .
## 2 gender integer .
## 3 autonomy numeric .
## 4 competence numeric .
## 5 deep numeric .
## 6 organised numeric .
## 7 score numeric .
##
##
## ------------------------------------------------------------------------------
## 1 - group (integer - dichotomous)
##
## length n NAs unique
## 208 208 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 101 48.6% 41.9% 55.3%
## 1 107 51.4% 44.7% 58.1%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## 2 - gender (integer - dichotomous)
##
## length n NAs unique
## 208 208 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 63 30.3% 24.4% 36.8%
## 1 145 69.7% 63.2% 75.6%
##
## ' 95%-CI (Wilson)
## ------------------------------------------------------------------------------
## 3 - autonomy (numeric)
##
## length n NAs unique 0s mean meanCI'
## 208 208 0 12 0 3.786 3.703
## 100.0% 0.0% 0.0% 3.869
##
## .05 .10 .25 median .75 .90 .95
## 2.750 3.000 3.500 3.750 4.000 4.575 4.750
##
## range sd vcoef mad IQR skew kurt
## 2.750 0.607 0.160 0.371 0.500 -0.178 0.031
##
##
## value freq perc cumfreq cumperc
## 1 2.25 5 2.4% 5 2.4%
## 2 2.5 3 1.4% 8 3.8%
## 3 2.75 6 2.9% 14 6.7%
## 4 3 17 8.2% 31 14.9%
## 5 3.25 12 5.8% 43 20.7%
## 6 3.5 33 15.9% 76 36.5%
## 7 3.75 38 18.3% 114 54.8%
## 8 4 43 20.7% 157 75.5%
## 9 4.25 20 9.6% 177 85.1%
## 10 4.5 10 4.8% 187 89.9%
## 11 4.75 11 5.3% 198 95.2%
## 12 5 10 4.8% 208 100.0%
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## 4 - competence (numeric)
##
## length n NAs unique 0s mean meanCI'
## 208 208 0 11 0 3.91 3.82
## 100.0% 0.0% 0.0% 3.99
##
## .05 .10 .25 median .75 .90 .95
## 2.75 3.00 3.50 4.00 4.25 4.75 5.00
##
## range sd vcoef mad IQR skew kurt
## 2.50 0.61 0.15 0.37 0.75 -0.23 -0.30
##
##
## value freq perc cumfreq cumperc
## 1 2.5 4 1.9% 4 1.9%
## 2 2.75 11 5.3% 15 7.2%
## 3 3 12 5.8% 27 13.0%
## 4 3.25 11 5.3% 38 18.3%
## 5 3.5 15 7.2% 53 25.5%
## 6 3.75 37 17.8% 90 43.3%
## 7 4 55 26.4% 145 69.7%
## 8 4.25 17 8.2% 162 77.9%
## 9 4.5 19 9.1% 181 87.0%
## 10 4.75 13 6.2% 194 93.3%
## 11 5 14 6.7% 208 100.0%
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## 5 - deep (numeric)
##
## length n NAs unique 0s mean meanCI'
## 208 208 0 13 0 4.02 3.95
## 100.0% 0.0% 0.0% 4.09
##
## .05 .10 .25 median .75 .90 .95
## 3.00 3.50 3.75 4.00 4.25 4.75 4.75
##
## range sd vcoef mad IQR skew kurt
## 3.00 0.51 0.13 0.37 0.50 -0.65 1.42
##
## lowest : 2.0, 2.25, 2.5, 2.75 (2), 3.0 (7)
## highest: 4.0 (62), 4.25 (31), 4.5 (26), 4.75 (15), 5.0 (8)
##
## heap(?): remarkable frequency (29.8%) for the mode(s) (= 4)
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## 6 - organised (numeric)
##
## length n NAs unique 0s mean meanCI'
## 208 208 0 13 0 3.739 3.646
## 100.0% 0.0% 0.0% 3.832
##
## .05 .10 .25 median .75 .90 .95
## 2.500 2.925 3.250 3.750 4.250 4.500 4.750
##
## range sd vcoef mad IQR skew kurt
## 3.750 0.678 0.181 0.741 1.000 -0.310 -0.183
##
## lowest : 1.25, 2.25, 2.5 (12), 2.75 (7), 3.0 (18)
## highest: 4.0 (38), 4.25 (20), 4.5 (26), 4.75 (8), 5.0 (8)
##
## heap(?): remarkable frequency (18.3%) for the mode(s) (= 4)
##
## ' 95%-CI (classic)
## ------------------------------------------------------------------------------
## 7 - score (numeric)
##
## length n NAs unique 0s mean meanCI'
## 208 208 0 32 15 47.756410 43.775973
## 100.0% 0.0% 7.2% 51.736847
##
## .05 .10 .25 median .75 .90 .95
## 0.000000 5.555556 22.222222 50.000000 70.138889 85.000000 94.444444
##
## range sd vcoef mad IQR skew kurt
## 100.000000 29.118418 0.609728 32.946667 47.916667 -0.109178 -1.151072
##
## lowest : 0.0 (15), 5.555556 (9), 8.333333 (2), 11.111111 (17), 16.666667 (4)
## highest: 80.555556 (2), 83.333333 (7), 88.888889 (9), 94.444444 (8), 100.0 (4)
##
## heap(?): remarkable frequency (8.2%) for the mode(s) (= 11.1111111111111)
##
## ' 95%-CI (classic)
# install.packages("psych")
library(psych)
describe(dfttu)
## vars n mean sd median trimmed mad min max range skew
## group 1 208 0.51 0.50 1.00 0.52 0.00 0.00 1 1.00 -0.06
## gender 2 208 0.70 0.46 1.00 0.74 0.00 0.00 1 1.00 -0.85
## autonomy 3 208 3.79 0.61 3.75 3.79 0.37 2.25 5 2.75 -0.18
## competence 4 208 3.91 0.61 4.00 3.92 0.37 2.50 5 2.50 -0.23
## deep 5 208 4.02 0.51 4.00 4.04 0.37 2.00 5 3.00 -0.65
## organised 6 208 3.74 0.68 3.75 3.76 0.74 1.25 5 3.75 -0.31
## score 7 208 47.76 29.12 50.00 47.85 32.95 0.00 100 100.00 -0.11
## kurtosis se
## group -2.01 0.03
## gender -1.28 0.03
## autonomy 0.03 0.04
## competence -0.30 0.04
## deep 1.42 0.04
## organised -0.18 0.05
## score -1.15 2.02
The data (208 participants, i.e., rows) contains 7 variables:
group 0=control, 1=intervention
gender 0=female, 1=male
autonomy 1=totally disagree, …, 5=totally agree (average of four
items)
competence 1=totally disagree, …, 5=totally agree (average of four
items)
deep 1=totally disagree, …, 5=totally agree (average of four
items)
organised 1=totally disagree, …, 5=totally agree (average of four
items)
score 0-100 (mathematics score)
We can see from the summary of the "dfttu" data frame (df) that it contains no missing values (NA). Usually we are happy about that, but in the real life NA's quite often do exist.
Next we examine the distributions of the three central variables that we are going to use in the following exercises: Competence, autonomy and score.
Frequency histograms show the count of each variable value:
hist(dfttu$competence, main="Histogram of Competence", xlab="Competence")
hist(dfttu$autonomy, main="Histogram of Autonomy", xlab="Autonomy")
hist(dfttu$score, main="Histogram of Score", xlab="Score")
Density plots of these three variables are based on kernel smoothing:
# # basic plot
# plot(density(dfttu$competence))
# plot(density(dfttu$autonomy))
# plot(density(dfttu$score))
# "bells and whistles" version
# https://datavizpyr.com/add-vertical-line-to-density-plot-with-ggplot2/
# competence
# shorter way
mean_competence <- as.numeric(dec2(mean(dfttu$competence)))
median_competence <- as.numeric(dec2(median(dfttu$competence)))
mode_competence <- as.numeric(dec2(Mode(dfttu$competence))) # Mode from DescTools library
variance_competence <- as.numeric(dec2(var(dfttu$competence)))
sd_competence <- as.numeric(dec2(sd(dfttu$competence)))
skewness_competence <- as.numeric(dec2(skewness(dfttu$competence)))
kurtosis_competence <- as.numeric(dec2(kurtosis(dfttu$competence)))
# # longer way
# mean_competence <- dfttu %>%
# pull(competence) %>%
# mean() %>%
# signif(3)
# display plot
ggplot(dfttu, aes(x=competence)) +
geom_density( fill="dodgerblue", alpha=0.2) +
geom_vline(xintercept=mean_competence, size=1, color="red") +
geom_text(aes(x=4.5, label=paste0("Mean (red)\n",mean_competence), y=1.5)) +
geom_vline(xintercept=median_competence, size=1, color="blue") +
geom_text(aes(x=4.5, label=paste0("Median (blue)\n",median_competence), y=1.2)) +
geom_vline(xintercept=mode_competence, size=1, color="orange", linetype="dashed") +
geom_text(aes(x=4.5, label=paste0("Mode (orange)\n",mode_competence), y=0.9)) +
geom_text(aes(x=3, label=paste0("Variance\n",variance_competence), y=1.5)) +
geom_text(aes(x=3, label=paste0("Std dev\n",sd_competence), y=1.2)) +
geom_text(aes(x=3, label=paste0("Skewness\n",skewness_competence), y=0.9)) +
geom_text(aes(x=3, label=paste0("Kurtosis\n",kurtosis_competence), y=0.6))
# autonomy
mean_autonomy <- as.numeric(dec2(mean(dfttu$autonomy)))
median_autonomy <- as.numeric(dec2(median(dfttu$autonomy)))
mode_autonomy <- as.numeric(dec2(Mode(dfttu$autonomy))) # Mode from DescTools library
variance_autonomy <- as.numeric(dec2(var(dfttu$autonomy)))
sd_autonomy <- as.numeric(dec2(sd(dfttu$autonomy)))
skewness_autonomy <- as.numeric(dec2(skewness(dfttu$autonomy)))
kurtosis_autonomy <- as.numeric(dec2(kurtosis(dfttu$autonomy)))
ggplot(dfttu, aes(x=autonomy)) +
geom_density( fill="dodgerblue", alpha=0.2) +
geom_vline(xintercept=mean_autonomy, size=1, color="red") +
geom_text(aes(x=4.5, label=paste0("Mean (red)\n",mean_autonomy), y=1.5)) +
geom_vline(xintercept=median_autonomy, size=1, color="blue") +
geom_text(aes(x=4.5, label=paste0("Median (blue)\n",median_autonomy), y=1.2)) +
geom_vline(xintercept=mode_autonomy, size=1, color="orange", linetype="dashed") +
geom_text(aes(x=4.5, label=paste0("Mode (orange)\n",mode_autonomy), y=0.9)) +
geom_text(aes(x=3, label=paste0("Variance\n",variance_autonomy), y=1.5)) +
geom_text(aes(x=3, label=paste0("Std dev\n",sd_autonomy), y=1.2)) +
geom_text(aes(x=3, label=paste0("Skewness\n",skewness_autonomy), y=0.9)) +
geom_text(aes(x=3, label=paste0("Kurtosis\n",kurtosis_autonomy), y=0.6))
# score
mean_score <- as.numeric(dec2(mean(dfttu$score)))
median_score <- as.numeric(dec2(median(dfttu$score)))
mode_score <- as.numeric(dec2(Mode(dfttu$score))) # Mode from DescTools library
variance_score <- as.numeric(dec2(var(dfttu$score)))
sd_score <- as.numeric(dec2(sd(dfttu$score)))
skewness_score <- as.numeric(dec2(skewness(dfttu$score)))
kurtosis_score <- as.numeric(dec2(kurtosis(dfttu$score)))
ggplot(dfttu, aes(x=score)) +
geom_density( fill="dodgerblue", alpha=0.2) +
geom_vline(xintercept=mean_score, size=1, color="red") +
geom_text(aes(x=75, label=paste0("Mean (red)\n",mean_score), y=0.01)) +
geom_vline(xintercept=median_score, size=1, color="blue") +
geom_text(aes(x=75, label=paste0("Median (blue)\n",median_score), y=0.007)) +
geom_vline(xintercept=mode_score, size=1, color="orange", linetype="dashed") +
geom_text(aes(x=75, label=paste0("Mode (orange)\n",mode_score), y=0.004)) +
geom_text(aes(x=25, label=paste0("Variance\n",variance_score), y=0.01)) +
geom_text(aes(x=25, label=paste0("Std dev\n",sd_score), y=0.007)) +
geom_text(aes(x=25, label=paste0("Skewness\n",skewness_score), y=0.004)) +
geom_text(aes(x=25, label=paste0("Kurtosis\n",kurtosis_score), y=0.001))
# install.packages("gridExtra")
# library(gridExtra)
# grid.arrange(plot_competence,plot_autonomy,plot_score, ncol=3)
Last figure of score density looks odd by the value of mode (11.11). This makes sense when we look at the frequencies of the original and rounded (zero digits) math score values: There are 17 occurences of 11.11 value.
# table(round(dfttu$score, digits=0))
table(dfttu$score)
##
## 0 5.55555555555556 8.33333333333333 11.1111111111111
## 15 9 2 17
## 16.6666666666667 19.4444444444444 22.2222222222222 25
## 4 1 6 4
## 27.7777777777778 30.5555555555556 33.3333333333333 36.1111111111111
## 6 3 6 2
## 38.8888888888889 41.6666666666667 44.4444444444444 47.2222222222222
## 4 5 15 1
## 50 52.7777777777778 55.5555555555556 58.3333333333333
## 7 5 7 4
## 61.1111111111111 63.8888888888889 66.6666666666667 69.4444444444445
## 10 5 12 6
## 72.2222222222222 75 77.7777777777778 80.5555555555555
## 7 2 13 2
## 83.3333333333333 88.8888888888889 94.4444444444445 100
## 7 9 8 4
# # install.packages("moments")
# library(moments)
#
# # is the data more or less "tailed" than normal distribution ("bell curve")?
# kurtosis(dfttu)
group gender autonomy competence deep organised score
1.003331 1.736070 3.059942 2.721777 4.460560 2.844617 1.866836
Kurtosis value of 3 indicates normal distribution (and it is called mesokurtic). If the value is less than 3, then the distribution is said to be platykurtic: It has thinner tails than a normal distribution, resulting in fewer extreme positive or negative values. If the value is greater than 3, then it is said to be leptokurtic: The shape is wider or flatter and the tails are fatter resulting in a greater chance of extreme positive or negative values. Competence and autonomy variables are quite close to (more or less) normal value of 3, but the score is clearly platycurtic.
Skewness is a distribution shape measure that tells if the response values are concentrated more on the left (too many smaller values) or right (too many greater values) side of the mean.
# library(moments)
#
# # is the data more towards low or high response values than the normal distribution?
# skewness(dfttu)
group gender autonomy competence deep organised score -0.05771633 -0.85794525 -0.17889672 -0.23497594 -0.65758103 -0.31181557 -0.10997036
Usually values greater than -1 or 1 indicate highly skewed (non-normal) data. Results above show that this is not the case here. Moderate skewness is attached to values -1 - -0.5 and 0.5 - 1. However, in our data autonomy, competence and score all have negative skewness values. It means that greater amount of values are larger than the mean value. Following density plots visualise this.
# sqrt
dfttu$scoreSqrt <- sqrt(dfttu$score)
# inverse sqrt
dfttu$scoreSqrtInv <- sqrt(max(dfttu$score+1)-dfttu$score)
# log10
dfttu$scoreLog <- log(dfttu$score)
# inverse log10
dfttu$scoreLogInv <- log(max(dfttu$score+1)-dfttu$score)
skewness(dfttu$score)
## [1] -0.1099704
skewness(dfttu$scoreSqrt)
## [1] -0.8831435
skewness(dfttu$scoreSqrtInv)
## [1] -0.4939812
skewness(dfttu$scoreLog)
## [1] NaN
skewness(dfttu$scoreLogInv)
## [1] -1.92719
plot(density(dfttu$score))
plot(density(dfttu$scoreSqrt))
plot(density(dfttu$scoreSqrtInv))
plot(density(dfttu$scoreLog))
plot(density(dfttu$scoreLogInv))
Skewness is not getting better by transforming the score (as skewness values go further from the desired zero value). Density plots below show that the square root transformation is milder than logarithmic transformation.
kurtosis(dfttu$score)
## [1] 1.866836
kurtosis(dfttu$scoreSqrt)
## [1] 2.915251
kurtosis(dfttu$scoreSqrtInv)
## [1] 2.591751
kurtosis(dfttu$scoreLog)
## [1] NaN
kurtosis(dfttu$scoreLogInv)
## [1] 8.184114
plot(density(dfttu$score))
plot(density(dfttu$scoreSqrt))
plot(density(dfttu$scoreSqrtInv))
plot(density(dfttu$scoreLog))
plot(density(dfttu$scoreLogInv))
The kurtosis of score benefits most from the square root transformation as the kurtosis value gets close to 3 (2.92, second figure). However, we decide to go on without transforming the score variable.
We create a second df ("dfttuNA") and add approximately 40% of NA's in the autonomy and organised variables (in that new df).
# create a copy of the "whole"no missing data" df ("dfttu")
dfttuNA <- dfttu
# create a new temporary df that only has autonomy and organised columns
# one way is to use "dplyr"
# library(dplyr)
# df_temp <- select(dfttuNA, autonomy, organised)
# but we opt to use "base R"
df_temp <- dfttu[c("autonomy","competence")]
# add appr. 40% of NA's into these two columns
set.seed(2021) # reproduce results
# https://stackoverflow.com/questions/27454265/randomly-insert-nas-into-dataframe-proportionaly/27454361#27454361
df_temp <- as.data.frame(lapply(df_temp,
function(cc) cc[ sample(c(TRUE, NA), prob = c(0.60, 0.40), size = length(cc), replace = TRUE) ]
)
)
# overwrite dfttuNA autonomy and organised columns with the same vars containing NA's
dfttuNA$autonomy <- as.numeric(df_temp[,1]) # first column of df_temp is autonomy
dfttuNA$organised <- as.numeric(df_temp[,2]) # seconf column of df_temp is organised
# display summary of the new df with NA's
summary(dfttuNA)
## group gender autonomy competence
## Min. :0.0000 Min. :0.0000 Min. :2.250 Min. :2.500
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.500 1st Qu.:3.500
## Median :1.0000 Median :1.0000 Median :3.750 Median :4.000
## Mean :0.5144 Mean :0.6971 Mean :3.773 Mean :3.907
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.250
## Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :5.000
## NA's :79
## deep organised score scoreSqrt
## Min. :2.000 Min. :2.500 Min. : 0.00 Min. : 0.000
## 1st Qu.:3.750 1st Qu.:3.500 1st Qu.: 22.22 1st Qu.: 4.714
## Median :4.000 Median :4.000 Median : 50.00 Median : 7.071
## Mean :4.023 Mean :3.937 Mean : 47.76 Mean : 6.357
## 3rd Qu.:4.250 3rd Qu.:4.500 3rd Qu.: 70.14 3rd Qu.: 8.375
## Max. :5.000 Max. :5.000 Max. :100.00 Max. :10.000
## NA's :82
## scoreSqrtInv scoreLog scoreLogInv
## Min. : 1.000 Min. : -Inf Min. :0.000
## 1st Qu.: 5.554 1st Qu.:3.101 1st Qu.:3.429
## Median : 7.141 Median :3.912 Median :3.932
## Mean : 6.949 Mean : -Inf Mean :3.732
## 3rd Qu.: 8.876 3rd Qu.:4.250 3rd Qu.:4.367
## Max. :10.050 Max. :4.605 Max. :4.615
##
head(dfttuNA)
## group gender autonomy competence deep organised score scoreSqrt
## 1 0 1 3.75 4.25 4.00 4.25 16.66667 4.082483
## 2 1 1 NA 3.75 4.00 3.75 41.66667 6.454972
## 3 0 1 NA 4.00 4.00 4.00 88.88889 9.428090
## 4 0 1 4.00 4.25 4.25 4.25 61.11111 7.817360
## 5 0 0 NA 3.75 4.50 3.75 38.88889 6.236096
## 6 1 1 NA 4.75 4.50 4.75 44.44444 6.666667
## scoreSqrtInv scoreLog scoreLogInv
## 1 9.183318 2.813411 4.434777
## 2 7.702813 3.729701 4.083171
## 3 3.480102 4.487387 2.494123
## 4 6.315765 4.112694 3.686098
## 5 7.881060 3.660709 4.128925
## 6 7.520343 3.794240 4.035223
# write new df
write.table (dfttuNA, file = "dfttu_sem_na.txt", row.names = FALSE, sep= "\t")
Here we have it - 79 missing values in autonomy and 82 in organised - great! Let's observe the data patterns:
library(mice)
md.pattern(dfttuNA, rotate.names = TRUE)
## group gender competence deep score scoreSqrt scoreSqrtInv scoreLog
## 75 1 1 1 1 1 1 1 1
## 54 1 1 1 1 1 1 1 1
## 51 1 1 1 1 1 1 1 1
## 28 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0
## scoreLogInv autonomy organised
## 75 1 1 1 0
## 54 1 1 0 1
## 51 1 0 1 1
## 28 1 0 0 2
## 0 79 82 161
First pattern is the complete data (75 participants) while the second one is the missing data in organised variable (54 participants). The third pattern represents the missigness in autonomy (51 participants), and finally, the fourth pattern are the 28 participants who have both autonomy and organised values missing. We'll return to this "dfttuNA" data later.
Linear regression has several assumptions related to the data that is to be analysed:
1. Linearity of data: Relationship between the y
(response/dependent) and x (predictor/independent) variable is linear
(instead of being non-linear).
How to investigate: Scatterplot of y and x (clear linear association).
Scatterplot of residuals vs fitted (model predicted) values of y
(horizontal line without distinct pattern).
2. Independence of errors: There is no correlation
between the residuals. Residuals are the difference between the observed
and model predicted values of y.
How to investigate: Scatterplot of residuals vs model predicted (fitted)
values of y (r is approximately zero).
3. Normality of errors: The residuals are normally
distributed.
How to investigate: Normal Q-Q ("Quantile-Quantile") plot.
4. Homogeneity of errors variance
(homoscedasticity): The variance in the residuals (of y) is
constant (i.e., same) regardless of the value of predictor (x).
How to investigate: Scatterplot of standardized residuals vs model
predicted (fitted) values of y (residuals should show no pattern across
x-axis).
5. Multicollinearity: No strong correlations among
the predictor/independent (x) variables.
How to investigate: Correlation matrix; Variance Inflation Factor;
Durbin-Watson's d.
First we produce diagnostic values for both predictors (autonomy, score).
Autonomy
# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
library(tidyverse)
library(broom)
theme_set(theme_classic())
# re-run with lm()
m1_lm <- lm(competence ~ autonomy, data = dfttu)
# see model diagnostics
m1_lm_diag <- augment(m1_lm)
head(m1_lm_diag)
## # A tibble: 6 × 8
## competence autonomy .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.25 3.75 3.88 0.370 0.00482 0.398 0.00210 0.930
## 2 3.75 3.25 3.50 0.246 0.00858 0.399 0.00166 0.619
## 3 4 4 4.07 -0.0683 0.00541 0.399 0.0000804 -0.172
## 4 4.25 4 4.07 0.182 0.00541 0.399 0.000569 0.457
## 5 3.75 3.5 3.69 0.0576 0.00588 0.399 0.0000623 0.145
## 6 4.75 4.75 4.63 0.118 0.0170 0.399 0.000768 0.298
head() command produces "glimpse" of the first six rows in the data:
fitted: fitted (predicted) values of y (competence)
resid: residual errors (difference between the observed and fitted
values)
hat: extreme values in the predictors (x variable here is the
autonomy)
sigma: estimated residual standard deviation when corresponding
observation is dropped from model
cooksd: Cook's distance (influential values that may indicate an
outlier)
std.resid: outliers (or extreme values) in the outcome y variable
(competence)
To see the last six rows, try tail() command.
Score
# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
library(tidyverse)
library(broom)
theme_set(theme_classic())
# re-run with lm()
m2_lm <- lm(competence ~ score, data = dfttu)
# see model diagnostics
m2_lm_diag <- augment(m2_lm)
head(m2_lm_diag)
## # A tibble: 6 × 8
## competence score .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.25 16.7 3.87 0.385 0.0103 0.606 0.00213 0.639
## 2 3.75 41.7 3.90 -0.149 0.00502 0.607 0.000154 -0.247
## 3 4 88.9 3.96 0.0364 0.0144 0.607 0.0000269 0.0606
## 4 4.25 61.1 3.93 0.324 0.00582 0.606 0.000846 0.537
## 5 3.75 38.9 3.90 -0.145 0.00526 0.607 0.000153 -0.241
## 6 4.75 44.4 3.90 0.847 0.00487 0.604 0.00482 1.40
Next we plot the observed values of the DV (competence) against the fitted regression line. First predictor is autonomy and then score.
Autonomy as predictor
# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
# plot residual errors (magenta) btw obs values and the fitted reg line
ggplot(m1_lm_diag, aes(autonomy, competence)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = autonomy, yend = .fitted), color = "magenta", size = 0.3) +
ggtitle("Competence ~ Autonomy (original)")
Association between competence and autonomy looks strong and linear.
Score as predictor
# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/
# plot residual errors (magenta) btw obs values and the fitted reg line
ggplot(m2_lm_diag, aes(score, competence)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = score, yend = .fitted), color = "magenta", size = 0.3) +
ggtitle("Competence ~ score (original)")
We can see that score is not a good predictor of competence. However, there is a weak positive association.
Next we look into the scatterplot of residuals vs fitted (model predicted) values of y (DV). Residual (or ‘error') is the difference between the dependent (‘endogenous',‘response') variable's observed and fitted (estimation based on the model) values. A horizontal line without distinct patterns is an indication of a linear relationship.
Autonomy
# run linear regression
m1_lm <- lm(competence ~ autonomy, data = dfttu)
# display first (of four) plot
plot(m1_lm,1)
We have here no distinct pattern in the residual plot.
Score
# run linear regression
m2_lm <- lm(competence ~ score, data = dfttu)
# display first (of four) plot
plot(m2_lm,1)
As was with competence ~ autonomy, we have no distinct pattern in the residual plot with competence ~ score.
Plots below show that there is no correlation between the residuals (red line is horizontal and close to zero).
plot(m1_lm,1)
plot(m2_lm,1)
Normal Q-Q plot is widely used way to see if the residuals are normally distributed. This is the case if they (approximately) follow the straight (dashed) line.
Competence ~ autonomy
Plots below show that the competence residuals (with autonomy as a predictor) follow the line quite nicely and we assume that they are normally distributed.
plot(m1_lm,2)
Not just to rely on a plot, we calculate Shapiro-Wilk and Kolmogorov-Smirnov tests.
# install.packages("olsrr")
library(olsrr)
# Shapiro-Wilk and Kolmogorov-Smirnov test values should be > .05
ols_test_normality(m1_lm) # both values are above .05
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9873 0.0594
## Kolmogorov-Smirnov 0.0741 0.2039
## Cramer-von Mises 29.92 0.0000
## Anderson-Darling 0.8217 0.0333
## -----------------------------------------------
# correlations btw observed and fitted values
ols_test_correlation(m1_lm) # high correlation (r = .994) as it should be
## [1] 0.9943387
S-W and K-S values are above .05, so the null hypothesis that the residuals are normally distributed cannot be rejected.
Competence ~ score
Plots below show that the competence residuals (with score as a predictor) do not follow the line as nicely as with autonomy, so we do some further checks.
plot(m2_lm,2)
Not just to rely on a plot, we will calculate Shapiro-Wilk and Kolmogorov-Smirnov tests.
# install.packages("olsrr")
library(olsrr)
# Shapiro-Wilk and Kolmogorov-Smirnov test value should be > .05
ols_test_normality(m2_lm) # both below .05
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9715 3e-04
## Kolmogorov-Smirnov 0.1023 0.0258
## Cramer-von Mises 21.9112 0.0000
## Anderson-Darling 1.9328 1e-04
## -----------------------------------------------
# correlations btw observed and fitted values
ols_test_correlation(m2_lm) # high correlation .987 as it should be
## [1] 0.9871404
S-W and K-S values are below .05, so the null hypothesis that the residuals are normally distributed can be rejected. However, normality assumption violation is not a big problem with larger sample sizes.
Scale-Location (or spread-Location): Homogeneity of
variance of the residuals (homoscedasticity) is met if the line is
horizontal and the points are equally spread.
- Our residuals are spread equally along the ranges of predictors (so we
have no problem with heteroscedasticity). If there would be a problem,
we could use a log or square root transformation of the outcome variable
(y):
lm(log(competence) ~ autonomy, data = dfttu)
or
lm(sqrt(competence) ~ autonomy, data = dfttu)
# raw data competence ~ autonomy
plot(m1_lm,3)
# square root transformed competence
m1b_lm_sqrt <- lm(sqrt(competence) ~ autonomy, data = dfttu)
# log transformed autonomy
m1c_lm_log <- lm(log(competence) ~ autonomy, data = dfttu)
plot(m1b_lm_sqrt,3)
plot(m1c_lm_log,3)
First plot shows that the residuals (of competence) are spread quite equally along the range of autonomy. Doing sqrt or log transformation to DV (competence) does not improve the situation.
# raw data competence ~ score
plot(m2_lm,3)
# square root transformed competence
m2b_lm_sqrt <- lm(sqrt(competence) ~ score, data = dfttu)
# log transformed autonomy
m2c_lm_log <- lm(log(competence) ~ score, data = dfttu)
plot(m2b_lm_sqrt,3)
plot(m2c_lm_log,3)
We also leave score untransformed.
Next we will investigate the correlations between the variables in the model.
# create new df with only two variables
dfttu2 <- dfttu[,c("autonomy","competence","score")]
# compute correlation matrix
lavCor(dfttu2)
## autnmy cmptnc score
## autonomy 1.000
## competence 0.754 1.000
## score -0.014 0.066 1.000
We see that autonomy and competence have a strong correlation (r = 0.75).
Cohen (1988) has defined some effect size "rules" for
r:
small .1
medium .3
large > .5
So we conclude at this point that autonomy has "a large bi-variate association" with competence. We further check if there is a multicollinearity problem with the regression models using Durbin-Watson d test.
library(car)
# Null hypothesis: There is no correlation among the residuals (p-value >.05)
durbinWatsonTest(m1_lm) # com ~ aut
## lag Autocorrelation D-W Statistic p-value
## 1 -0.03773094 2.071179 0.646
## Alternative hypothesis: rho != 0
durbinWatsonTest(m2_lm) # com ~ sco
## lag Autocorrelation D-W Statistic p-value
## 1 0.009978909 1.97743 0.912
## Alternative hypothesis: rho != 0
It seems that multicollinearity is not a problem in this data.
Final step before the linear regression analysis is to check if there are extreme values that might influence the regression results. Residuals vs Leverage plot is useful for this purpose.
Outlier: extreme outcome variable (DV) value (competence).
- Examination of standardized residuals (residual / standard error).
Potential outlier if std.res. > 3.
High leverage point: extreme predictor (IV) value (autonomy,
score).
- Examination of .hat value. Potential high leverage point if .hat >
2(p+1)/n. Note: p = number of predictors, n = number of
observations.
- In "competence ~ autonomy" model: 2(1+1)/208 = 0.02.
# plot autonomy
plot(m1_lm,5)
# plot score
plot(m2_lm,5)
First plot (competence ~ autonomy) shows that the results regarding outliers seem to be quite ok as the standardized residuals are not exceeding |3|. However, leverage values exceed 0.02 level. Top three potential outliers are participants 43, 81 and 167.
Score as a predictor performs better: No outliers or high leverage point problems are present.
Next we investigate the most influential values with Cook's distance method. An observation is influential if its Cook's distance value is greater than 4 / (n - p - 1). The value with our sample of 208 (n) and one predictor (p) is: 4 / (208 - 1 - 1) = 0.01941748. Also a simpler equation is applicable: 4/n = 4/208 = 0.01923077. Roughly speaking we are looking at Cook's distance values greater than 0.02.
First we plot competence ~ autonomy:
# autonomy
# plot 1
n <- nrow(dfttu)
plot(cooks.distance(m1_lm), main = "Cook's distance")
abline(h = 4/n, lty = 2, col = "darkred") # add cutoff line
# plot 2
plot(m1_lm, 4,
id.n = 10 # ask to indicate top 10 influential values
)
abline(h = 4/n, lty = 2, col = "darkred")
Row 81 value is clearly the highest, followed by 167 and 43.
And then the competence ~ score:
# score
# plot 1
n <- nrow(dfttu)
plot(cooks.distance(m2_lm), main = "Cook's distance")
abline(h = 4/n, lty = 2, col = "darkgreen")
# plot 2
plot(m2_lm, 4,
id.n = 10 # ask to indicate top 10 influential values
)
abline(h = 4/n, lty = 2, col = "darkgreen")
Plots above show that some rows in the df have higher values than 0.02. So we proceed to further check their behavior ..
# install.packages("performance")
library(performance)
# ---------------
# Cook's distance
# https://www.statology.org/how-to-identify-influential-data-points-using-cooks-distance/
# measures how much all of the fitted values in the model change when the one data point (in turn) is deleted.
# check outliers with Cook's distance, m1_lm (competence ~ autonomy)
check_outliers(m1_lm, method = "cook") # result is OK, no outliers
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.7).
## - For variable: (Whole model)
# display Cook's distance values in descending order
m1_lm_outlier_cook_list <- as.data.frame(check_outliers(m1_lm, method = "cook"))
m1_lm_outlier_cook_list[order(-m1_lm_outlier_cook_list$Distance_Cook),]
## Row Distance_Cook Outlier_Cook Outlier
## 81 81 5.777000e-02 0 0
## 167 167 3.305738e-02 0 0
## 43 43 2.597894e-02 0 0
## 156 156 2.473148e-02 0 0
## 112 112 2.442319e-02 0 0
## 162 162 2.389570e-02 0 0
## 62 62 2.216170e-02 0 0
## 192 192 2.216170e-02 0 0
## 89 89 1.966533e-02 0 0
## 13 13 1.787055e-02 0 0
## 140 140 1.665546e-02 0 0
## 152 152 1.665546e-02 0 0
## 161 161 1.665546e-02 0 0
## 174 174 1.529027e-02 0 0
## 26 26 1.495651e-02 0 0
## 41 41 1.396961e-02 0 0
## 7 7 1.350780e-02 0 0
## 66 66 1.350780e-02 0 0
## 207 207 1.350780e-02 0 0
## 49 49 1.339995e-02 0 0
## 116 116 1.339995e-02 0 0
## 171 171 1.339995e-02 0 0
## 146 146 1.223388e-02 0 0
## 141 141 1.189914e-02 0 0
## 73 73 1.161223e-02 0 0
## 189 189 1.161223e-02 0 0
## 60 60 1.153839e-02 0 0
## 82 82 1.153839e-02 0 0
## 83 83 1.144899e-02 0 0
## 138 138 8.990546e-03 0 0
## 31 31 8.856107e-03 0 0
## 44 44 8.856107e-03 0 0
## 28 28 8.193462e-03 0 0
## 17 17 8.006772e-03 0 0
## 32 32 8.006772e-03 0 0
## 119 119 8.006772e-03 0 0
## 68 68 7.853409e-03 0 0
## 77 77 7.853409e-03 0 0
## 94 94 7.853409e-03 0 0
## 97 97 7.725194e-03 0 0
## 154 154 7.725194e-03 0 0
## 163 163 7.725194e-03 0 0
## 24 24 7.496670e-03 0 0
## 64 64 7.496670e-03 0 0
## 195 195 7.428471e-03 0 0
## 25 25 7.317882e-03 0 0
## 47 47 7.317882e-03 0 0
## 71 71 7.317882e-03 0 0
## 90 90 6.996524e-03 0 0
## 8 8 6.755855e-03 0 0
## 200 200 6.755855e-03 0 0
## 27 27 6.260564e-03 0 0
## 29 29 6.260564e-03 0 0
## 123 123 6.260564e-03 0 0
## 108 108 5.952696e-03 0 0
## 110 110 5.895542e-03 0 0
## 134 134 5.895542e-03 0 0
## 136 136 5.895542e-03 0 0
## 51 51 5.832270e-03 0 0
## 133 133 5.832270e-03 0 0
## 57 57 5.565256e-03 0 0
## 50 50 4.181400e-03 0 0
## 127 127 4.181400e-03 0 0
## 130 130 4.181400e-03 0 0
## 175 175 4.181400e-03 0 0
## 185 185 4.181400e-03 0 0
## 194 194 4.181400e-03 0 0
## 122 122 3.670049e-03 0 0
## 172 172 3.670049e-03 0 0
## 183 183 3.464947e-03 0 0
## 16 16 3.210846e-03 0 0
## 52 52 3.210846e-03 0 0
## 72 72 3.210846e-03 0 0
## 87 87 3.210846e-03 0 0
## 95 95 3.210846e-03 0 0
## 143 143 3.210846e-03 0 0
## 98 98 3.196412e-03 0 0
## 169 169 3.196412e-03 0 0
## 37 37 2.581296e-03 0 0
## 63 63 2.581296e-03 0 0
## 137 137 2.581296e-03 0 0
## 139 139 2.581296e-03 0 0
## 150 150 2.581296e-03 0 0
## 173 173 2.581296e-03 0 0
## 180 180 2.581296e-03 0 0
## 54 54 2.221047e-03 0 0
## 55 55 2.221047e-03 0 0
## 135 135 2.221047e-03 0 0
## 179 179 2.221047e-03 0 0
## 188 188 2.221047e-03 0 0
## 19 19 2.098082e-03 0 0
## 178 178 2.098082e-03 0 0
## 1 1 2.098082e-03 0 0
## 70 70 1.779610e-03 0 0
## 96 96 1.779610e-03 0 0
## 23 23 1.775084e-03 0 0
## 42 42 1.775084e-03 0 0
## 65 65 1.775084e-03 0 0
## 78 78 1.775084e-03 0 0
## 79 79 1.775084e-03 0 0
## 115 115 1.775084e-03 0 0
## 144 144 1.775084e-03 0 0
## 153 153 1.775084e-03 0 0
## 168 168 1.775084e-03 0 0
## 191 191 1.775084e-03 0 0
## 101 101 1.745936e-03 0 0
## 158 158 1.745936e-03 0 0
## 186 186 1.745936e-03 0 0
## 190 190 1.745936e-03 0 0
## 2 2 1.659275e-03 0 0
## 22 22 1.659275e-03 0 0
## 75 75 1.659275e-03 0 0
## 131 131 1.659275e-03 0 0
## 106 106 1.604352e-03 0 0
## 107 107 1.604352e-03 0 0
## 120 120 1.604352e-03 0 0
## 160 160 1.604352e-03 0 0
## 193 193 1.604352e-03 0 0
## 56 56 1.450418e-03 0 0
## 124 124 1.450418e-03 0 0
## 205 205 1.450418e-03 0 0
## 80 80 1.408130e-03 0 0
## 126 126 1.408130e-03 0 0
## 34 34 1.019818e-03 0 0
## 125 125 9.698473e-04 0 0
## 128 128 9.698473e-04 0 0
## 184 184 9.143197e-04 0 0
## 6 6 7.684244e-04 0 0
## 10 10 7.684244e-04 0 0
## 85 85 7.684244e-04 0 0
## 121 121 7.684244e-04 0 0
## 182 182 7.684244e-04 0 0
## 132 132 6.939733e-04 0 0
## 155 155 6.939733e-04 0 0
## 165 165 6.939733e-04 0 0
## 4 4 5.687311e-04 0 0
## 12 12 5.687311e-04 0 0
## 53 53 5.687311e-04 0 0
## 76 76 5.687311e-04 0 0
## 103 103 5.687311e-04 0 0
## 111 111 5.687311e-04 0 0
## 159 159 5.687311e-04 0 0
## 196 196 3.942467e-04 0 0
## 35 35 2.608347e-04 0 0
## 46 46 2.608347e-04 0 0
## 59 59 2.608347e-04 0 0
## 67 67 2.608347e-04 0 0
## 91 91 2.608347e-04 0 0
## 104 104 2.608347e-04 0 0
## 114 114 2.608347e-04 0 0
## 118 118 2.608347e-04 0 0
## 147 147 2.608347e-04 0 0
## 15 15 2.198466e-04 0 0
## 18 18 2.198466e-04 0 0
## 30 30 2.198466e-04 0 0
## 38 38 2.198466e-04 0 0
## 58 58 2.198466e-04 0 0
## 84 84 2.198466e-04 0 0
## 86 86 2.198466e-04 0 0
## 149 149 2.198466e-04 0 0
## 151 151 2.198466e-04 0 0
## 164 164 2.198466e-04 0 0
## 176 176 2.198466e-04 0 0
## 187 187 2.198466e-04 0 0
## 201 201 2.198466e-04 0 0
## 202 202 2.198466e-04 0 0
## 203 203 2.198466e-04 0 0
## 109 109 1.841271e-04 0 0
## 129 129 1.841271e-04 0 0
## 181 181 1.841271e-04 0 0
## 14 14 1.150826e-04 0 0
## 105 105 1.150826e-04 0 0
## 204 204 1.150826e-04 0 0
## 3 3 8.042786e-05 0 0
## 21 21 8.042786e-05 0 0
## 36 36 8.042786e-05 0 0
## 39 39 8.042786e-05 0 0
## 48 48 8.042786e-05 0 0
## 61 61 8.042786e-05 0 0
## 69 69 8.042786e-05 0 0
## 92 92 8.042786e-05 0 0
## 100 100 8.042786e-05 0 0
## 102 102 8.042786e-05 0 0
## 117 117 8.042786e-05 0 0
## 142 142 8.042786e-05 0 0
## 145 145 8.042786e-05 0 0
## 148 148 8.042786e-05 0 0
## 157 157 8.042786e-05 0 0
## 170 170 8.042786e-05 0 0
## 197 197 8.042786e-05 0 0
## 198 198 8.042786e-05 0 0
## 5 5 6.231817e-05 0 0
## 9 9 6.231817e-05 0 0
## 11 11 6.231817e-05 0 0
## 20 20 6.231817e-05 0 0
## 33 33 6.231817e-05 0 0
## 40 40 6.231817e-05 0 0
## 45 45 6.231817e-05 0 0
## 74 74 6.231817e-05 0 0
## 99 99 6.231817e-05 0 0
## 113 113 6.231817e-05 0 0
## 208 208 6.231817e-05 0 0
## 166 166 9.702329e-07 0 0
## 177 177 9.702329e-07 0 0
## 199 199 9.702329e-07 0 0
## 88 88 7.263328e-07 0 0
## 93 93 5.265104e-07 0 0
## 206 206 5.265104e-07 0 0
# we see that observation 81 has the highest value (0.058), but also several other values are above the 0.02 level
# however, the test indicates that we do not have an outlier problem here
# --------------------
# Mahalanobis distance
# to get a second opinion, we turn to Mahalanobis distance
# https://www.statology.org/mahalanobis-distance-r/
check_outliers(dfttu[c("competence","autonomy")], method = "mahalanobis")
## OK: No outliers detected.
## - Based on the following method and threshold: mahalanobis (13.82).
## - For variables: competence, autonomy
# we have an indication of one possible outlier, participant 81
# this finding is against the Cook distance result above, but resonates with the earlier plot
# display Mahalanobis values in descending order
m1_lm_outlier_mah_list <- as.data.frame(check_outliers(dfttu[c("competence","autonomy")], method = "mahalanobis"))
m1_lm_outlier_mah_list <- m1_lm_outlier_mah_list[order(-m1_lm_outlier_mah_list$Distance_Mahalanobis),]
# add also p-values, usually we look for values smaller than .001
m1_lm_outlier_mah_list$pvalue <- pchisq(m1_lm_outlier_mah_list$Distance_Mahalanobis,
df=1, # number of variables minus one (2-1=1)
lower.tail=FALSE
)
m1_lm_outlier_mah_list
## Row Distance_Mahalanobis Outlier_Mahalanobis Outlier pvalue
## 81 81 8.64253893 0 0 0.003284035
## 89 89 7.35326766 0 0 0.006694098
## 167 167 7.03191732 0 0 0.008006958
## 156 156 6.99851302 0 0 0.008157745
## 97 97 6.81076737 0 0 0.009060982
## 154 154 6.81076737 0 0 0.009060982
## 163 163 6.81076737 0 0 0.009060982
## 195 195 6.79526286 0 0 0.009140007
## 88 88 6.40714450 0 0 0.011366206
## 43 43 6.06137486 0 0 0.013817008
## 140 140 5.84697068 0 0 0.015603889
## 152 152 5.84697068 0 0 0.015603889
## 161 161 5.84697068 0 0 0.015603889
## 26 26 5.62230507 0 0 0.017733311
## 112 112 5.41611661 0 0 0.019951688
## 162 162 5.36209425 0 0 0.020579046
## 62 62 5.05524349 0 0 0.024551532
## 192 192 5.05524349 0 0 0.024551532
## 141 141 4.91230737 0 0 0.026665997
## 73 73 4.79394504 0 0 0.028559942
## 189 189 4.79394504 0 0 0.028559942
## 98 98 4.72094619 0 0 0.029797343
## 169 169 4.72094619 0 0 0.029797343
## 28 28 4.65128824 0 0 0.031030252
## 13 13 4.43727106 0 0 0.035162409
## 60 60 4.36580300 0 0 0.036667198
## 82 82 4.36580300 0 0 0.036667198
## 146 146 4.35374393 0 0 0.036927681
## 174 174 4.30170177 0 0 0.038074256
## 41 41 4.20745756 0 0 0.040246617
## 37 37 4.20633873 0 0 0.040273174
## 63 63 4.20633873 0 0 0.040273174
## 137 137 4.20633873 0 0 0.040273174
## 139 139 4.20633873 0 0 0.040273174
## 150 150 4.20633873 0 0 0.040273174
## 173 173 4.20633873 0 0 0.040273174
## 180 180 4.20633873 0 0 0.040273174
## 7 7 4.08769083 0 0 0.043196619
## 66 66 4.08769083 0 0 0.043196619
## 207 207 4.08769083 0 0 0.043196619
## 196 196 4.03294286 0 0 0.044620043
## 31 31 3.82182965 0 0 0.050588968
## 44 44 3.82182965 0 0 0.050588968
## 49 49 3.70979489 0 0 0.054094038
## 116 116 3.70979489 0 0 0.054094038
## 171 171 3.70979489 0 0 0.054094038
## 24 24 3.37970765 0 0 0.066003762
## 64 64 3.37970765 0 0 0.066003762
## 83 83 3.34015836 0 0 0.067608102
## 138 138 3.25842886 0 0 0.071057178
## 17 17 3.06758106 0 0 0.079868232
## 32 32 3.06758106 0 0 0.079868232
## 119 119 3.06758106 0 0 0.079868232
## 34 34 3.01928394 0 0 0.082279788
## 184 184 3.00847947 0 0 0.082829957
## 68 68 2.86873133 0 0 0.090316039
## 77 77 2.86873133 0 0 0.090316039
## 94 94 2.86873133 0 0 0.090316039
## 25 25 2.63436604 0 0 0.104573796
## 47 47 2.63436604 0 0 0.104573796
## 71 71 2.63436604 0 0 0.104573796
## 125 125 2.63399309 0 0 0.104598356
## 128 128 2.63399309 0 0 0.104598356
## 6 6 2.61097975 0 0 0.106126177
## 10 10 2.61097975 0 0 0.106126177
## 85 85 2.61097975 0 0 0.106126177
## 121 121 2.61097975 0 0 0.106126177
## 182 182 2.61097975 0 0 0.106126177
## 110 110 2.43563022 0 0 0.118606218
## 134 134 2.43563022 0 0 0.118606218
## 136 136 2.43563022 0 0 0.118606218
## 90 90 2.39163211 0 0 0.121986211
## 8 8 2.33620536 0 0 0.126397119
## 200 200 2.33620536 0 0 0.126397119
## 50 50 2.31191714 0 0 0.128385615
## 127 127 2.31191714 0 0 0.128385615
## 130 130 2.31191714 0 0 0.128385615
## 175 175 2.31191714 0 0 0.128385615
## 185 185 2.31191714 0 0 0.128385615
## 194 194 2.31191714 0 0 0.128385615
## 27 27 2.20814334 0 0 0.137283808
## 29 29 2.20814334 0 0 0.137283808
## 123 123 2.20814334 0 0 0.137283808
## 51 51 2.19183832 0 0 0.138743637
## 133 133 2.19183832 0 0 0.138743637
## 57 57 2.17007958 0 0 0.140718952
## 108 108 2.12829885 0 0 0.144601158
## 183 183 1.97609842 0 0 0.159802035
## 80 80 1.89138536 0 0 0.169045749
## 126 126 1.89138536 0 0 0.169045749
## 109 109 1.70578063 0 0 0.191533728
## 129 129 1.70578063 0 0 0.191533728
## 181 181 1.70578063 0 0 0.191533728
## 122 122 1.46162827 0 0 0.226671223
## 172 172 1.46162827 0 0 0.226671223
## 14 14 1.40377972 0 0 0.236091747
## 105 105 1.40377972 0 0 0.236091747
## 204 204 1.40377972 0 0 0.236091747
## 16 16 1.30459828 0 0 0.253375000
## 52 52 1.30459828 0 0 0.253375000
## 72 72 1.30459828 0 0 0.253375000
## 87 87 1.30459828 0 0 0.253375000
## 95 95 1.30459828 0 0 0.253375000
## 143 143 1.30459828 0 0 0.253375000
## 70 70 1.19016357 0 0 0.275296566
## 96 96 1.19016357 0 0 0.275296566
## 2 2 1.16245019 0 0 0.280957972
## 22 22 1.16245019 0 0 0.280957972
## 75 75 1.16245019 0 0 0.280957972
## 131 131 1.16245019 0 0 0.280957972
## 106 106 1.00057036 0 0 0.317172538
## 107 107 1.00057036 0 0 0.317172538
## 120 120 1.00057036 0 0 0.317172538
## 160 160 1.00057036 0 0 0.317172538
## 193 193 1.00057036 0 0 0.317172538
## 56 56 0.96064811 0 0 0.327023640
## 124 124 0.96064811 0 0 0.327023640
## 205 205 0.96064811 0 0 0.327023640
## 54 54 0.91978336 0 0 0.337531860
## 55 55 0.91978336 0 0 0.337531860
## 135 135 0.91978336 0 0 0.337531860
## 179 179 0.91978336 0 0 0.337531860
## 188 188 0.91978336 0 0 0.337531860
## 1 1 0.86905665 0 0 0.351216601
## 19 19 0.86905665 0 0 0.351216601
## 178 178 0.86905665 0 0 0.351216601
## 23 23 0.82167395 0 0 0.364691196
## 42 42 0.82167395 0 0 0.364691196
## 65 65 0.82167395 0 0 0.364691196
## 78 78 0.82167395 0 0 0.364691196
## 79 79 0.82167395 0 0 0.364691196
## 115 115 0.82167395 0 0 0.364691196
## 144 144 0.82167395 0 0 0.364691196
## 153 153 0.82167395 0 0 0.364691196
## 168 168 0.82167395 0 0 0.364691196
## 191 191 0.82167395 0 0 0.364691196
## 93 93 0.78043626 0 0 0.377007750
## 206 206 0.78043626 0 0 0.377007750
## 101 101 0.76609739 0 0 0.381427069
## 158 158 0.76609739 0 0 0.381427069
## 186 186 0.76609739 0 0 0.381427069
## 190 190 0.76609739 0 0 0.381427069
## 166 166 0.58473862 0 0 0.444460906
## 177 177 0.58473862 0 0 0.444460906
## 199 199 0.58473862 0 0 0.444460906
## 132 132 0.45656892 0 0 0.499231872
## 155 155 0.45656892 0 0 0.499231872
## 165 165 0.45656892 0 0 0.499231872
## 4 4 0.33335675 0 0 0.563689168
## 12 12 0.33335675 0 0 0.563689168
## 53 53 0.33335675 0 0 0.563689168
## 76 76 0.33335675 0 0 0.563689168
## 103 103 0.33335675 0 0 0.563689168
## 111 111 0.33335675 0 0 0.563689168
## 159 159 0.33335675 0 0 0.563689168
## 5 5 0.24325082 0 0 0.621867999
## 9 9 0.24325082 0 0 0.621867999
## 11 11 0.24325082 0 0 0.621867999
## 20 20 0.24325082 0 0 0.621867999
## 33 33 0.24325082 0 0 0.621867999
## 40 40 0.24325082 0 0 0.621867999
## 45 45 0.24325082 0 0 0.621867999
## 74 74 0.24325082 0 0 0.621867999
## 99 99 0.24325082 0 0 0.621867999
## 113 113 0.24325082 0 0 0.621867999
## 208 208 0.24325082 0 0 0.621867999
## 3 3 0.15385645 0 0 0.694876906
## 21 21 0.15385645 0 0 0.694876906
## 36 36 0.15385645 0 0 0.694876906
## 39 39 0.15385645 0 0 0.694876906
## 48 48 0.15385645 0 0 0.694876906
## 61 61 0.15385645 0 0 0.694876906
## 69 69 0.15385645 0 0 0.694876906
## 92 92 0.15385645 0 0 0.694876906
## 100 100 0.15385645 0 0 0.694876906
## 102 102 0.15385645 0 0 0.694876906
## 117 117 0.15385645 0 0 0.694876906
## 142 142 0.15385645 0 0 0.694876906
## 145 145 0.15385645 0 0 0.694876906
## 148 148 0.15385645 0 0 0.694876906
## 157 157 0.15385645 0 0 0.694876906
## 170 170 0.15385645 0 0 0.694876906
## 197 197 0.15385645 0 0 0.694876906
## 198 198 0.15385645 0 0 0.694876906
## 35 35 0.11113322 0 0 0.738857652
## 46 46 0.11113322 0 0 0.738857652
## 59 59 0.11113322 0 0 0.738857652
## 67 67 0.11113322 0 0 0.738857652
## 91 91 0.11113322 0 0 0.738857652
## 104 104 0.11113322 0 0 0.738857652
## 114 114 0.11113322 0 0 0.738857652
## 118 118 0.11113322 0 0 0.738857652
## 147 147 0.11113322 0 0 0.738857652
## 15 15 0.09422431 0 0 0.758873977
## 18 18 0.09422431 0 0 0.758873977
## 30 30 0.09422431 0 0 0.758873977
## 38 38 0.09422431 0 0 0.758873977
## 58 58 0.09422431 0 0 0.758873977
## 84 84 0.09422431 0 0 0.758873977
## 86 86 0.09422431 0 0 0.758873977
## 149 149 0.09422431 0 0 0.758873977
## 151 151 0.09422431 0 0 0.758873977
## 164 164 0.09422431 0 0 0.758873977
## 176 176 0.09422431 0 0 0.758873977
## 187 187 0.09422431 0 0 0.758873977
## 201 201 0.09422431 0 0 0.758873977
## 202 202 0.09422431 0 0 0.758873977
## 203 203 0.09422431 0 0 0.758873977
# p-values are not smaller than .001, so the end result is in accordance with Cook's distance: no outliers here
# install.packages("performance")
library(performance)
# ---------------
# Cook's distance
# check outliers with Cook's distance, m2_lm (competence ~ score)
check_outliers(m2_lm, method = "cook") # result is OK, no outliers
## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.7).
## - For variable: (Whole model)
# display Cook distance values in descending order
m2_lm_outlier_cook_list <- as.data.frame(check_outliers(m2_lm, method = "cook"))
m2_lm_outlier_cook_list[order(-m2_lm_outlier_cook_list$Distance_Cook),]
## Row Distance_Cook Outlier_Cook Outlier
## 112 112 4.349204e-02 0 0
## 154 154 4.349204e-02 0 0
## 137 137 3.376143e-02 0 0
## 207 207 3.020325e-02 0 0
## 173 173 2.782953e-02 0 0
## 64 64 2.181044e-02 0 0
## 161 161 2.138833e-02 0 0
## 32 32 2.075458e-02 0 0
## 196 196 2.075458e-02 0 0
## 34 34 1.787142e-02 0 0
## 138 138 1.787142e-02 0 0
## 163 163 1.634844e-02 0 0
## 116 116 1.617318e-02 0 0
## 169 169 1.617318e-02 0 0
## 184 184 1.537804e-02 0 0
## 89 89 1.536176e-02 0 0
## 171 171 1.491481e-02 0 0
## 140 140 1.462045e-02 0 0
## 97 97 1.321338e-02 0 0
## 50 50 1.282176e-02 0 0
## 175 175 1.282176e-02 0 0
## 194 194 1.282176e-02 0 0
## 122 122 1.266227e-02 0 0
## 108 108 1.255755e-02 0 0
## 150 150 1.172364e-02 0 0
## 44 44 1.111755e-02 0 0
## 88 88 1.111755e-02 0 0
## 98 98 1.111755e-02 0 0
## 28 28 1.089646e-02 0 0
## 110 110 1.089646e-02 0 0
## 26 26 1.078020e-02 0 0
## 66 66 1.078020e-02 0 0
## 17 17 1.059711e-02 0 0
## 37 37 9.283488e-03 0 0
## 31 31 9.253806e-03 0 0
## 152 152 9.253806e-03 0 0
## 90 90 9.011594e-03 0 0
## 127 127 9.011594e-03 0 0
## 130 130 9.011594e-03 0 0
## 49 49 8.923769e-03 0 0
## 119 119 8.897112e-03 0 0
## 63 63 8.734415e-03 0 0
## 180 180 8.386897e-03 0 0
## 109 109 8.377824e-03 0 0
## 73 73 8.147832e-03 0 0
## 139 139 8.079897e-03 0 0
## 7 7 8.046487e-03 0 0
## 24 24 7.912767e-03 0 0
## 83 83 7.912767e-03 0 0
## 82 82 7.570818e-03 0 0
## 183 183 7.467486e-03 0 0
## 16 16 7.199920e-03 0 0
## 124 124 7.199920e-03 0 0
## 195 195 6.892093e-03 0 0
## 95 95 6.809739e-03 0 0
## 185 185 6.525593e-03 0 0
## 70 70 6.435082e-03 0 0
## 129 129 6.435082e-03 0 0
## 141 141 5.992125e-03 0 0
## 10 10 5.854913e-03 0 0
## 14 14 5.842673e-03 0 0
## 105 105 5.842673e-03 0 0
## 125 125 5.842673e-03 0 0
## 182 182 5.153303e-03 0 0
## 6 6 4.816981e-03 0 0
## 121 121 4.767457e-03 0 0
## 85 85 4.697992e-03 0 0
## 189 189 4.697992e-03 0 0
## 143 143 4.658250e-03 0 0
## 60 60 4.220776e-03 0 0
## 96 96 3.609483e-03 0 0
## 204 204 3.566008e-03 0 0
## 132 132 3.494977e-03 0 0
## 206 206 3.494977e-03 0 0
## 52 52 3.407767e-03 0 0
## 156 156 3.392245e-03 0 0
## 166 166 3.368245e-03 0 0
## 128 128 3.103397e-03 0 0
## 134 134 3.103397e-03 0 0
## 87 87 3.042874e-03 0 0
## 205 205 3.042874e-03 0 0
## 172 172 3.012455e-03 0 0
## 57 57 2.951455e-03 0 0
## 181 181 2.862053e-03 0 0
## 146 146 2.851714e-03 0 0
## 56 56 2.830691e-03 0 0
## 93 93 2.784165e-03 0 0
## 72 72 2.393235e-03 0 0
## 136 136 2.393235e-03 0 0
## 1 1 2.130405e-03 0 0
## 55 55 1.981862e-03 0 0
## 179 179 1.981862e-03 0 0
## 76 76 1.893256e-03 0 0
## 155 155 1.778205e-03 0 0
## 188 188 1.773929e-03 0 0
## 51 51 1.501351e-03 0 0
## 133 133 1.501351e-03 0 0
## 80 80 1.457427e-03 0 0
## 41 41 1.417386e-03 0 0
## 43 43 1.259443e-03 0 0
## 174 174 1.259443e-03 0 0
## 178 178 1.259443e-03 0 0
## 126 126 1.245011e-03 0 0
## 165 165 1.242661e-03 0 0
## 101 101 1.190774e-03 0 0
## 147 147 1.190774e-03 0 0
## 208 208 1.190774e-03 0 0
## 54 54 1.089950e-03 0 0
## 135 135 1.089950e-03 0 0
## 167 167 1.089950e-03 0 0
## 19 19 1.086357e-03 0 0
## 177 177 9.504576e-04 0 0
## 111 111 9.119107e-04 0 0
## 199 199 9.119107e-04 0 0
## 4 4 8.460711e-04 0 0
## 103 103 8.090240e-04 0 0
## 53 53 8.086590e-04 0 0
## 159 159 7.831465e-04 0 0
## 12 12 7.683000e-04 0 0
## 25 25 6.264717e-04 0 0
## 193 193 6.264717e-04 0 0
## 33 33 5.450939e-04 0 0
## 35 35 5.450939e-04 0 0
## 46 46 5.450939e-04 0 0
## 99 99 5.450939e-04 0 0
## 190 190 5.450939e-04 0 0
## 59 59 4.757915e-04 0 0
## 100 100 4.741431e-04 0 0
## 102 102 4.741431e-04 0 0
## 144 144 4.741431e-04 0 0
## 58 58 4.105127e-04 0 0
## 15 15 3.543362e-04 0 0
## 86 86 3.543362e-04 0 0
## 120 120 3.543362e-04 0 0
## 157 157 3.543362e-04 0 0
## 198 198 3.543362e-04 0 0
## 20 20 3.181235e-04 0 0
## 162 162 3.181235e-04 0 0
## 22 22 2.800687e-04 0 0
## 158 158 2.800687e-04 0 0
## 61 61 2.618524e-04 0 0
## 11 11 2.145915e-04 0 0
## 40 40 2.145915e-04 0 0
## 75 75 2.145915e-04 0 0
## 114 114 2.145915e-04 0 0
## 123 123 2.145915e-04 0 0
## 27 27 2.098712e-04 0 0
## 29 29 2.098712e-04 0 0
## 113 113 2.098712e-04 0 0
## 68 68 2.060278e-04 0 0
## 9 9 2.013686e-04 0 0
## 91 91 2.013686e-04 0 0
## 104 104 2.013686e-04 0 0
## 186 186 2.013686e-04 0 0
## 77 77 2.008647e-04 0 0
## 84 84 1.920380e-04 0 0
## 45 45 1.902360e-04 0 0
## 67 67 1.902360e-04 0 0
## 74 74 1.717528e-04 0 0
## 131 131 1.662128e-04 0 0
## 38 38 1.643160e-04 0 0
## 81 81 1.612125e-04 0 0
## 94 94 1.570163e-04 0 0
## 118 118 1.570163e-04 0 0
## 13 13 1.539395e-04 0 0
## 2 2 1.539395e-04 0 0
## 5 5 1.531850e-04 0 0
## 18 18 1.407420e-04 0 0
## 47 47 1.407420e-04 0 0
## 78 78 1.407420e-04 0 0
## 117 117 1.407420e-04 0 0
## 92 92 1.042790e-04 0 0
## 200 200 9.057769e-05 0 0
## 164 164 7.939761e-05 0 0
## 21 21 7.039568e-05 0 0
## 170 170 7.039568e-05 0 0
## 153 153 6.325273e-05 0 0
## 202 202 6.325273e-05 0 0
## 203 203 5.767296e-05 0 0
## 62 62 5.338332e-05 0 0
## 23 23 4.769338e-05 0 0
## 106 106 4.769338e-05 0 0
## 149 149 4.769338e-05 0 0
## 42 42 4.443710e-05 0 0
## 148 148 4.443710e-05 0 0
## 107 107 4.326893e-05 0 0
## 30 30 4.220760e-05 0 0
## 48 48 4.220760e-05 0 0
## 71 71 4.220760e-05 0 0
## 145 145 4.220760e-05 0 0
## 36 36 4.112895e-05 0 0
## 176 176 4.112895e-05 0 0
## 8 8 3.992922e-05 0 0
## 160 160 3.992922e-05 0 0
## 192 192 3.992922e-05 0 0
## 65 65 3.685330e-05 0 0
## 69 69 3.685330e-05 0 0
## 191 191 3.685330e-05 0 0
## 115 115 3.487146e-05 0 0
## 142 142 3.487146e-05 0 0
## 151 151 3.255743e-05 0 0
## 197 197 3.255743e-05 0 0
## 3 3 2.694825e-05 0 0
## 187 187 2.694825e-05 0 0
## 39 39 2.026786e-05 0 0
## 201 201 2.026786e-05 0 0
## 79 79 1.310395e-05 0 0
## 168 168 1.310395e-05 0 0
# we see that observations 112 and 154 have the highest value (0.043), but also several other values are above the 0.02 level
# however, the test indicates that we do not have an outlier problem here
# --------------------
# Mahalanobis distance
# to the a second opinion, we turn to Mahalanobis distance
# https://www.statology.org/mahalanobis-distance-r/
check_outliers(dfttu[c("competence","score")], method = "mahalanobis")
## OK: No outliers detected.
## - Based on the following method and threshold: mahalanobis (13.82).
## - For variables: competence, score
# we have an indication of two possible outliers: 112 and 154
# this finding is against the Cook distance result above
# display Mahalanobis values in descending order
m2_lm_outlier_mah_list <- as.data.frame(check_outliers(dfttu[c("competence","score")], method = "mahalanobis"))
m2_lm_outlier_mah_list <- m2_lm_outlier_mah_list[order(-m2_lm_outlier_mah_list$Distance_Mahalanobis),]
# add also p-values, usually we look for values smaller than .001
m2_lm_outlier_mah_list$pvalue <- pchisq(m2_lm_outlier_mah_list$Distance_Mahalanobis,
df=1,
lower.tail=FALSE
)
m2_lm_outlier_mah_list
## Row Distance_Mahalanobis Outlier_Mahalanobis Outlier pvalue
## 112 112 7.87183918 0 0 0.005021056
## 154 154 7.87183918 0 0 0.005021056
## 137 137 6.36673337 0 0 0.011627945
## 207 207 6.08041175 0 0 0.013668891
## 163 163 5.78638138 0 0 0.016150801
## 173 173 5.72932169 0 0 0.016683926
## 97 97 5.41252542 0 0 0.019992772
## 32 32 4.95018461 0 0 0.026087883
## 196 196 4.95018461 0 0 0.026087883
## 161 161 4.94845464 0 0 0.026114001
## 64 64 4.94233364 0 0 0.026206631
## 184 184 4.67592755 0 0 0.030588181
## 34 34 4.63618566 0 0 0.031304507
## 138 138 4.63618566 0 0 0.031304507
## 116 116 4.59629592 0 0 0.032041111
## 169 169 4.59629592 0 0 0.032041111
## 140 140 4.42036779 0 0 0.035512381
## 171 171 4.22636174 0 0 0.039800670
## 44 44 4.00226074 0 0 0.045439277
## 88 88 4.00226074 0 0 0.045439277
## 98 98 4.00226074 0 0 0.045439277
## 89 89 4.00001978 0 0 0.045499730
## 122 122 3.99755297 0 0 0.045566373
## 28 28 3.87655879 0 0 0.048964821
## 110 110 3.87655879 0 0 0.048964821
## 31 31 3.74866970 0 0 0.052849558
## 152 152 3.74866970 0 0 0.052849558
## 108 108 3.69213096 0 0 0.054669728
## 49 49 3.65959467 0 0 0.055747194
## 150 150 3.65445988 0 0 0.055919283
## 50 50 3.60082660 0 0 0.057750849
## 175 175 3.60082660 0 0 0.057750849
## 194 194 3.60082660 0 0 0.057750849
## 26 26 3.54359614 0 0 0.059775730
## 66 66 3.54359614 0 0 0.059775730
## 37 37 3.37670733 0 0 0.066124034
## 180 180 3.36840253 0 0 0.066458167
## 95 95 3.33812441 0 0 0.067691729
## 63 63 3.32068226 0 0 0.068413432
## 139 139 3.31475026 0 0 0.068660752
## 7 7 3.26347079 0 0 0.070839041
## 24 24 3.26228439 0 0 0.070890306
## 83 83 3.26228439 0 0 0.070890306
## 17 17 3.22237261 0 0 0.072638275
## 79 79 3.22031170 0 0 0.072729780
## 168 168 3.22031170 0 0 0.072729780
## 82 82 3.08763855 0 0 0.078889218
## 57 57 3.01128493 0 0 0.082686721
## 109 109 2.86015850 0 0 0.090798551
## 90 90 2.85794045 0 0 0.090923847
## 127 127 2.85794045 0 0 0.090923847
## 130 130 2.85794045 0 0 0.090923847
## 119 119 2.82573249 0 0 0.092764562
## 14 14 2.78485121 0 0 0.095159531
## 105 105 2.78485121 0 0 0.095159531
## 125 125 2.78485121 0 0 0.095159531
## 76 76 2.78416424 0 0 0.095200347
## 25 25 2.75807598 0 0 0.096764549
## 193 193 2.75807598 0 0 0.096764549
## 16 16 2.71644017 0 0 0.099319296
## 124 124 2.71644017 0 0 0.099319296
## 11 11 2.71321898 0 0 0.099519991
## 40 40 2.71321898 0 0 0.099519991
## 75 75 2.71321898 0 0 0.099519991
## 114 114 2.71321898 0 0 0.099519991
## 123 123 2.71321898 0 0 0.099519991
## 101 101 2.70501272 0 0 0.100033284
## 147 147 2.70501272 0 0 0.100033284
## 208 208 2.70501272 0 0 0.100033284
## 73 73 2.65483177 0 0 0.103235678
## 70 70 2.59612151 0 0 0.107125586
## 129 129 2.59612151 0 0 0.107125586
## 39 39 2.57312701 0 0 0.108692685
## 201 201 2.57312701 0 0 0.108692685
## 195 195 2.56106303 0 0 0.109524915
## 166 166 2.53963412 0 0 0.111020510
## 183 183 2.50221060 0 0 0.113686620
## 185 185 2.40752720 0 0 0.120752976
## 141 141 2.38411941 0 0 0.122573945
## 100 100 2.16225081 0 0 0.141437377
## 102 102 2.16225081 0 0 0.141437377
## 144 144 2.16225081 0 0 0.141437377
## 10 10 2.15402444 0 0 0.142196734
## 27 27 2.12779045 0 0 0.144649134
## 29 29 2.12779045 0 0 0.144649134
## 113 113 2.12779045 0 0 0.144649134
## 132 132 2.06322016 0 0 0.150891008
## 206 206 2.06322016 0 0 0.150891008
## 182 182 2.01329811 0 0 0.155926025
## 3 3 1.99906055 0 0 0.157396735
## 187 187 1.99906055 0 0 0.157396735
## 6 6 1.98137454 0 0 0.159245677
## 121 121 1.94569001 0 0 0.163052051
## 85 85 1.93930530 0 0 0.163743984
## 189 189 1.93930530 0 0 0.163743984
## 58 58 1.89175757 0 0 0.169003819
## 68 68 1.86249552 0 0 0.172337058
## 143 143 1.84879446 0 0 0.173923671
## 60 60 1.68722113 0 0 0.193967691
## 15 15 1.63954388 0 0 0.200388061
## 86 86 1.63954388 0 0 0.200388061
## 120 120 1.63954388 0 0 0.200388061
## 157 157 1.63954388 0 0 0.200388061
## 198 198 1.63954388 0 0 0.200388061
## 9 9 1.61548015 0 0 0.203723170
## 91 91 1.61548015 0 0 0.203723170
## 104 104 1.61548015 0 0 0.203723170
## 186 186 1.61548015 0 0 0.203723170
## 93 93 1.61459690 0 0 0.203846822
## 204 204 1.56374097 0 0 0.211118310
## 1 1 1.54654521 0 0 0.213646186
## 151 151 1.49811232 0 0 0.220962042
## 197 197 1.49811232 0 0 0.220962042
## 96 96 1.45161788 0 0 0.228268528
## 52 52 1.38999552 0 0 0.238405856
## 156 156 1.37037500 0 0 0.241747464
## 188 188 1.31856551 0 0 0.250849658
## 43 43 1.31250943 0 0 0.251940812
## 174 174 1.31250943 0 0 0.251940812
## 178 178 1.31250943 0 0 0.251940812
## 87 87 1.30294067 0 0 0.253676774
## 205 205 1.30294067 0 0 0.253676774
## 115 115 1.27505753 0 0 0.258820030
## 142 142 1.27505753 0 0 0.258820030
## 128 128 1.27362165 0 0 0.259088358
## 134 134 1.27362165 0 0 0.259088358
## 172 172 1.24038712 0 0 0.265396459
## 56 56 1.19995986 0 0 0.273329701
## 61 61 1.18995517 0 0 0.275338599
## 181 181 1.18250198 0 0 0.276847230
## 45 45 1.17628807 0 0 0.278112969
## 67 67 1.17628807 0 0 0.278112969
## 146 146 1.17552734 0 0 0.278268424
## 33 33 1.17097813 0 0 0.279200352
## 35 35 1.17097813 0 0 0.279200352
## 46 46 1.17097813 0 0 0.279200352
## 99 99 1.17097813 0 0 0.279200352
## 190 190 1.17097813 0 0 0.279200352
## 55 55 1.07875867 0 0 0.298975416
## 179 179 1.07875867 0 0 0.298975416
## 65 65 1.07028231 0 0 0.300881688
## 69 69 1.07028231 0 0 0.300881688
## 191 191 1.07028231 0 0 0.300881688
## 51 51 0.99366388 0 0 0.318848536
## 133 133 0.99366388 0 0 0.318848536
## 72 72 0.99092175 0 0 0.319517194
## 136 136 0.99092175 0 0 0.319517194
## 59 59 0.97928415 0 0 0.322375603
## 155 155 0.93670504 0 0 0.333126664
## 41 41 0.86730768 0 0 0.351701735
## 84 84 0.81348469 0 0 0.367092004
## 8 8 0.71557053 0 0 0.397600712
## 160 160 0.71557053 0 0 0.397600712
## 192 192 0.71557053 0 0 0.397600712
## 80 80 0.70743645 0 0 0.400296163
## 177 177 0.69699735 0 0 0.403794471
## 126 126 0.65786691 0 0 0.417314015
## 38 38 0.65266878 0 0 0.419160134
## 19 19 0.60529856 0 0 0.436563499
## 36 36 0.56563397 0 0 0.451999095
## 176 176 0.56563397 0 0 0.451999095
## 165 165 0.55128609 0 0 0.457792673
## 131 131 0.51725859 0 0 0.472013440
## 20 20 0.51387954 0 0 0.473464240
## 162 162 0.51387954 0 0 0.473464240
## 18 18 0.51013243 0 0 0.475081534
## 47 47 0.51013243 0 0 0.475081534
## 78 78 0.51013243 0 0 0.475081534
## 117 117 0.51013243 0 0 0.475081534
## 4 4 0.49891865 0 0 0.479975643
## 54 54 0.45834005 0 0 0.498400776
## 135 135 0.45834005 0 0 0.498400776
## 167 167 0.45834005 0 0 0.498400776
## 111 111 0.43778612 0 0 0.508192940
## 199 199 0.43778612 0 0 0.508192940
## 30 30 0.43397697 0 0 0.510043933
## 48 48 0.43397697 0 0 0.510043933
## 71 71 0.43397697 0 0 0.510043933
## 145 145 0.43397697 0 0 0.510043933
## 103 103 0.42729864 0 0 0.513317451
## 81 81 0.39820011 0 0 0.528020264
## 22 22 0.39530378 0 0 0.529524602
## 158 158 0.39530378 0 0 0.529524602
## 159 159 0.37395818 0 0 0.540854587
## 53 53 0.34339192 0 0 0.557877857
## 12 12 0.32211593 0 0 0.570338812
## 107 107 0.32059953 0 0 0.571247572
## 92 92 0.27989840 0 0 0.596767817
## 42 42 0.22550164 0 0 0.634879540
## 148 148 0.22550164 0 0 0.634879540
## 200 200 0.19220072 0 0 0.661091256
## 5 5 0.15070201 0 0 0.697865393
## 77 77 0.14925385 0 0 0.699249431
## 164 164 0.12278260 0 0 0.726035919
## 2 2 0.10476176 0 0 0.746188483
## 13 13 0.10476176 0 0 0.746188483
## 23 23 0.09014454 0 0 0.763993481
## 106 106 0.09014454 0 0 0.763993481
## 149 149 0.09014454 0 0 0.763993481
## 94 94 0.07710106 0 0 0.781265092
## 118 118 0.07710106 0 0 0.781265092
## 74 74 0.07661834 0 0 0.781933541
## 21 21 0.07164403 0 0 0.788957947
## 170 170 0.07164403 0 0 0.788957947
## 153 153 0.03878502 0 0 0.843875178
## 202 202 0.03878502 0 0 0.843875178
## 62 62 0.02790567 0 0 0.867330792
## 203 203 0.02420556 0 0 0.876363008
# p-values are not smaller than .001, so the end result is in accordance with Cook distance: no outliers here
Simple regression has only one x (or ‘independent', ‘predictor', ‘exogenous', ‘explanatory', ‘manifest') variable and one y (or ‘dependent', ‘endogenous', ‘response') variable.
y = b0 + b1*x
y predicted value of the dependent/endogenous/response variable
b0 intercept
b1 regression coefficient
x value of the independent/predictor/exogenous/explanatory variable
We specify a model that predicts competence with autonomy.
Lectures (PPT) contain more information about the different "roles" of the variables in SEM.
# specify simple regression model
regression_simple <- '
# regression
competence ~ 1 + autonomy
# variance
autonomy ~~ autonomy
'
# estimate model with lavaan
regression_simple_fit <- sem(regression_simple,
data=dfttu)
# display unstandardized results in LISREL format
summary(regression_simple_fit,
rsq=TRUE) # display R^2
## lavaan 0.6.14 ended normally after 22 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Number of observations 208
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## competence ~
## autonomy 0.752 0.045 16.562 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .competence 1.061 0.174 6.093 0.000
## autonomy 3.786 0.042 90.196 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## autonomy 0.366 0.036 10.198 0.000
## .competence 0.157 0.015 10.198 0.000
##
## R-Square:
## Estimate
## competence 0.569
In the output we see several values, of which the most important are:
Estimate: Parameter value for each model parameter (here unstandardized "B")
Std.err: Standard error for each estimated parameter (average distance of observed values from the regression line)
Z-value: Wald statistic (Z = Estimate / Std.err)
P(>|z|)): p-value for testing the null hypothesis that the parameter equals zero in the population
R-Square: Explained variance of the dependent variable (DV)
According to the output above, the intercept (average value) of autonomy is 3.786. Interpretation of the intercept of competence is a bit odd as the result now says that "Competence gets the value of 1.061 when predictor variable (autonomy) is set to zero." We do not have a zero value for autonomy (range 1-5), but we can create it with "centering grand mean" (CGM): autonomy - mean(autonomy).
# of course we can change autonomy and competency values from 1-5 -> 0-4
dfttu$autonomy04 <- dfttu$autonomy - 1
dfttu$competence04 <- dfttu$competence - 1
# but still, it makes more sense to interpret zero as an average value of autonomy
# Note: usually we do not center the DV!
# centering by the grand mean
dfttu$autonomyCGM1 <- dfttu$autonomy - mean(dfttu$autonomy)
# or
# install.packages("misty")
library(misty) # center()
dfttu$autonomyCGM2 <- dfttu$autonomy # we need to do this extra step as center() writes new values directly into df given in the first parameter
dfttu$autonomyCGM2 <- center(dfttu$autonomyCGM2, # function writes new values into this variable
type = "CGM", # center grand mean, other option is "CWC" (center withing cluster)
cluster = NULL, # CWC needs the cluster name (e.g., "group")
value = NULL # wish to center to a specific value?
)
# check how different "autonomy" variants look now
# table
print(dfttu[c(1:5),c("autonomy","autonomyCGM1","autonomyCGM2")])
## autonomy autonomyCGM1 autonomyCGM2
## 1 3.75 -0.03605769 -0.03605769
## 2 3.25 -0.53605769 -0.53605769
## 3 4.00 0.21394231 0.21394231
## 4 4.00 0.21394231 0.21394231
## 5 3.50 -0.28605769 -0.28605769
# plots
par(mfrow = c(1, 2)) # place plots in one row and two columns
plot(density(dfttu$autonomy), col = "black", main = "Autonomy original")
plot(density(dfttu$autonomyCGM1), col = "blue", main = "Autonomy CGM")
# original mean value of autonomy
mean(dfttu$autonomy) # 3.786058
## [1] 3.786058
# grand mean centered mean value of autonomy
mean(dfttu$autonomyCGM1) # -2.049622e-16 # naturally we have here exactly the same values
## [1] 2.297841e-16
mean(dfttu$autonomyCGM2) # -2.049622e-16
## [1] -2.465976e-16
As the plot shows, the shape of the density distribution does not change, only the average value of autonomy moves to zero.
Now we can re-run the analysis ..
# specify simple regression
regression_simple_CGM <- '
# regression
competence ~ 1 + autonomyCGM1
# variance
autonomyCGM1 ~~ autonomyCGM1
'
# estimate model
regression_simple_fit_CGM <- sem(regression_simple_CGM,
data=dfttu,
group = NULL, # no grouping variable (e.g., gender)
cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
)
# display unstandardized results in LISREL format
summary(regression_simple_fit_CGM,
rsq=TRUE) # display R^2
## lavaan 0.6.14 ended normally after 14 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Number of observations 208
## Number of missing patterns 1
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## competence ~
## autonomyCGM1 0.752 0.045 16.562 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .competence 3.907 0.027 142.170 0.000
## autonomyCGM1 0.000 0.042 0.000 1.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## autonomyCGM1 0.366 0.036 10.198 0.000
## .competence 0.157 0.015 10.198 0.000
##
## R-Square:
## Estimate
## competence 0.569
Results stay the same, but now we can say that "competence has an average value of 3.9 when autonomy is on its average level."
# extract values from the lavaan output for the Results section
# unstandardized output values of competence ~ autonomy
reg_m1_aut_est <- parameterEstimates(regression_simple_fit_CGM)$est[2] # estimate, "B"
reg_m1_aut_se <- parameterEstimates(regression_simple_fit_CGM)$se[2] # Standard error of the estimate
reg_m1_aut_z <- parameterEstimates(regression_simple_fit_CGM)$z[2] # Z value
reg_m1_aut_p <- parameterEstimates(regression_simple_fit_CGM)$pvalue[2] # p-value
reg_m1_aut_r2 <- as.numeric(lavaan::inspect(regression_simple_fit_CGM,'r2')[1]) # R^2 value Note: need to use "as.numeric()" to extract numerical value
Results of linear regression with R lavaan (Rosseel, 2012) showed that autonomy was statistically significant positive predictor of competence, B = 0.752, Z = 16.562, p = 0.000. This unstandardized result indicates that one unit increase in autonomy increases competence by 0.752 units. This model explains 56.87% of the variance of competence variable.
What if we had some missing values? We created earlier a data frame ("dfttuNA") with approximately 40% of missing values in autonomy variable. Let's re-run the analysis with that data.
# estimate model with lavaan
regression_simple_fit_NA <- sem(regression_simple,
data=dfttuNA)
# display unstandardized results in LISREL format
summary(regression_simple_fit_NA,
rsq=TRUE) # display R^2
## lavaan 0.6.14 ended normally after 32 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Used Total
## Number of observations 129 208
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## competence ~
## autonomy 0.756 0.063 12.003 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .competence 1.052 0.240 4.375 0.000
## autonomy 3.773 0.051 73.543 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## autonomy 0.340 0.042 8.031 0.000
## .competence 0.174 0.022 8.031 0.000
##
## R-Square:
## Estimate
## competence 0.528
Output shows that only 129 rows (of the complete data of 208 rows) have been analysed. Results stay quite the same … except that the standard errors are bigger (indicating worse predicting power of the model).
Instead of applying listwise deletion to the rows with missing data, we can use multiple imputation (MI) for the missing values. However, it's use is not that straightforward as it generates several imputed data sets that need to be analysed at the same time (as pooled data). This requires additional steps in the modeling process.
Full Information Maximum Likelihood (FIML) estimator (e.g., Cham et al., 2017) is a good alternative to MI, as it is part of the model estimation process in most SEM software packages (e.g., Mplus, AMOS, lavaan).
Next we utilize FIML to analyze the data. FIML estimates a likelihood function for each participant based on the observed values (so there has to be at least one observed value for each participant).
regression_simple_fit_NA_FIML <- sem(regression_simple,
data=dfttuNA,
group = NULL, # no grouping variable (e.g., gender)
cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
missing = "fiml", # full information maximum likelihood approach (good if there are missing observations)
fixed.x = FALSE # instead of using the default fixed.x = TRUE, we ask to estimate the means and variances of the variables
)
# display unstandardized results in LISREL format
summary(regression_simple_fit_NA_FIML,
rsq=TRUE) # display R^2
## lavaan 0.6.14 ended normally after 19 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Number of observations 208
## Number of missing patterns 2
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## competence ~
## autonomy 0.753 0.057 13.274 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .competence 1.066 0.217 4.919 0.000
## autonomy 3.776 0.046 82.457 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## autonomy 0.338 0.040 8.509 0.000
## .competence 0.173 0.021 8.419 0.000
##
## R-Square:
## Estimate
## competence 0.525
Summary shows that there are two missing data patterns: First one is the complete observations pattern and the second one has missing values in the autonomy variable. Number of observations is again 208: FIML allows the analysis of all rows (participants), also those containing missing data. Results stay the same, but now the standard errors are smaller.
Multiple regression means that we add predictors to the model. We specify a model that predicts competence with autonomy and score. We use centered autonomy ("autonomyCGM") but retain score uncentered (as it has a meaningful zero value: it is plausible to get zero points from the math test).
# specify multiple regression
regression_multiple <- '
# regression
competence ~ 1 + autonomyCGM1 + score
# variances
autonomyCGM1 ~~ autonomyCGM1
score ~~ score
'
# estimate model
regression_multiple_fit <- sem(regression_multiple,
data=dfttu,
group = NULL, # no grouping variable (e.g., gender)
cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
)
Now we get a warning that the variances of autonomy and competence are quite different. We follow the instruction and look into their variances:
# check variances
varTable(regression_multiple_fit)
## name idx nobs type exo user mean var nlev lnam
## 1 competence 4 208 numeric 0 0 3.907 0.366 0
## 2 autonomyCGM1 14 208 numeric 0 0 0.000 0.368 0
## 3 score 7 208 numeric 0 0 47.756 847.882 0
# extract variances of autonomy and score
reg_m2_aut_s2 <- varTable(regression_multiple_fit)$var[2] # s2 is abbreviation of s^2 = sigma^2 = variance
reg_m2_score_s2 <- varTable(regression_multiple_fit)$var[3]
Indeed, the "var" column of the output indicates that the variance of autonomy is 0.368 and the variance of score is 847.882 - quite a difference! Next we will use R base library's scale() function to center and scale both autonomy and score (i.e., set their mean M to zero and standard deviation SD to one).
# We start by plotting density of autonomy and score before any centering or scaling actions
# The probability density function describes the probability of the variable taking certain value
# It is a smoothed (Parzen–Rosenblatt / kernel estimator) version of the histogram
# Default kernel is "Gaussian", but there are several options
# Bandwidth controls the smoothing: Too big values oversmooth and too small values undersmooth (overfit)
# density() uses "rule of thumb" as default method
# https://r-coder.com/density-plot-r/
# AUTONOMY
# create a histogram
hist(dfttu$autonomy, freq = F, main = "Autonomy histogram and density")
# add density line to the histogram
lines(density(dfttu$autonomy), lwd = 2, col = "blue")
# compare bandwidth settings
par(mfrow = c(1, 3)) # place plots in one row and three columns
plot(density(dfttu$autonomy,
kernel="gaussian", # type "?density()" in Console to see other options
bw="nrd0"), # default bw is Rule of Thumb (in this case 0.1155)
col = "black", main = "Aut bw=RoT")
plot(density(dfttu$autonomy, kernel="gaussian", bw=0.05), # bw set too small
col = "blue", main = "Aut bw=0.05")
plot(density(dfttu$autonomy, kernel="gaussian", bw=0.5), # set too big
col = "red", main = "Aut bw=0.5")
# # other bw options if you wish to try
# plot(density(dfttu$autonomy, kernel="gaussian", bw="UCV"), # Unbiased cross validation
# col = "green", main = "Aut bw=UCV")
# plot(density(dfttu$autonomy, kernel="gaussian", bw="SJ"), # Plug-in by Sheather and Jones
# col = "orange", main = "Aut bw=SJ")
# using scale() function ... AND DOING NOTHING :-D
aut_ori <- scale(dfttu$autonomy,
center = F, # DO NOT center (F = FALSE)
scale = F # DO NOT scale (F = FALSE)
)
# -----
# SCORE
# histogram + density line
hist(dfttu$score, freq = F, main = "Score histogram and density")
lines(density(dfttu$score), lwd = 2, col = "darkred")
# AUTONOMY AND SCORE
d_a <- density(dfttu$autonomy)
d_s <- density(dfttu$score)
plot(d_a, lwd = 2, col = "darkblue",
main = "Autonomy and score densities", xlab = "",
xlim = c(min(d_a$x, d_s$x), c(max(d_a$x, d_s$x))), # Min and Max X-axis limits
ylim = c(min(d_a$y, d_s$y), c(max(d_a$y, d_s$y)))) # Min and Max Y-axis limits
lines(d_s, col = "darkred", lwd = 2)
Last figure clearly shows that there is a need for centering and scaling if we use autonomy and score in the same analysis as predictors.
# -----------------
# CENTERED autonomy
# center = subtract mean of autonomy from each autonomy value
# to produce M = 0
# equation approach
dfttu$autonomyCentered <- dfttu$autonomy - mean(dfttu$autonomy)
# using scale() function
aut_cen <- scale(dfttu$autonomy,
center = T, # center (T = TRUE)
scale = F # do not scale (F = FALSE)
)
# plot centered autonomy
plot(density(aut_cen), col = "black", main = "Autonomy Centered")
Centering works as it should: The shape of distribution is the same, but the average value has moved to zero.
# ---------------
# SCALED autonomy
# scale = divide each autonomy value by
# root mean square RMS (scale function default if not centered)
# or
# standard deviation SD (scale function default if centered)
# to make all variables comparable (by one unit change)
# equation approach
# RMS
dfttu$autonomyScaledRMS <- dfttu$autonomy / sqrt(sum(dfttu$autonomy^2)/(length(dfttu$autonomy)-1))
# SD
dfttu$autonomyScaledSD <- dfttu$autonomy / sd(dfttu$autonomy)
# using scale() function
# RMS
aut_scaRMS <- scale(dfttu$autonomy,
center = F, # do not center (F = FALSE)
scale = T # scale (T = TRUE)
)
# SD
aut_scaSD <- scale(dfttu$autonomy,
center = F,
scale = apply(dfttu[,4,drop=F], # fourth var (autonomy) in df, dimensions of df are kept
2, # 1=apply to rows, 2=apply to columns
sd) # calculate SD
)
# plot RMS and SD scaled autonomy
d_a_scaRMS <- density(aut_scaRMS)
d_a_scaSD <- density(aut_scaSD)
plot(d_a_scaRMS, lwd = 2, col = "darkblue",
main = "Autonomy RMS (blue) and SD (red) scaled densities", xlab = "",
xlim = c(min(d_a_scaRMS$x, d_a_scaSD$x), c(max(d_a_scaRMS$x, d_a_scaSD$x))), # Min and Max X-axis limits
ylim = c(min(d_a_scaRMS$y, d_a_scaSD$y), c(max(d_a_scaRMS$y, d_a_scaSD$y)))) # Min and Max Y-axis limits
lines(d_a_scaSD, col = "darkred", lwd = 2)
# ----------------------------
# CENTERED and SCALED autonomy
# equation approach
dfttu$autonomyCenSca <- (dfttu$autonomy - mean(dfttu$autonomy)) / sd(dfttu$autonomy)
# using scale() function
aut_cen_sca <- scale(dfttu$autonomy, center = T, scale = T)
# plot centered and scaled autonomy
plot(density(aut_cen_sca), col = "blue", main = "Aut Scale RMS")
# --------------------------------------------------------------------------
# plot 1) original, 2) centered, 3) scaled and 4) centered & scaled autonomy
par(mfrow = c(2, 2))
plot(density(dfttu$autonomy), col = "black", main = "Autonomy")
plot(density(dfttu$autonomyCentered), col = "blue", main = "Autonomy Centered")
plot(density(dfttu$autonomyScaledRMS), col = "red", main = "Autonomy Scaled")
plot(density(dfttu$autonomyCenSca), col = "green", main = "Autonomy Centered and Scaled")
# --------------------
# compare descriptives
df_scale <- data.frame(aut_ori, aut_cen, aut_scaRMS, aut_scaSD, aut_cen_sca)
library(psych)
describe(df_scale)
## vars n mean sd median trimmed mad min max range skew
## aut_ori 1 208 3.79 0.61 3.75 3.79 0.37 2.25 5.00 2.75 -0.18
## aut_cen 2 208 0.00 0.61 -0.04 0.00 0.37 -1.54 1.21 2.75 -0.18
## aut_scaRMS 3 208 0.99 0.16 0.98 0.99 0.10 0.59 1.30 0.72 -0.18
## aut_scaSD 4 208 6.26 1.00 6.20 6.26 0.61 3.72 8.26 4.55 -0.18
## aut_cen_sca 5 208 0.00 1.00 -0.06 0.01 0.61 -2.53 2.00 4.53 -0.18
## kurtosis se
## aut_ori 0.03 0.04
## aut_cen 0.03 0.04
## aut_scaRMS 0.03 0.01
## aut_scaSD 0.03 0.07
## aut_cen_sca 0.03 0.07
Figures above show that the distributional shape is the same, but the average point varies. The last (green) density line represents both centered and scaled autonomy with M = 0 and SD = 1 (this can also be seen in the last row of the table "aut_cen_sca"). Note that the mad stands for "the median of the absolute deviations from the median".
Next we do centering and scaling to score.
# center and scale
dfttu$scoreCenSca <- (dfttu$score - mean(dfttu$score)) / sd(dfttu$score)
# plot against the original score
d_s <- density(dfttu$score)
d_s_cenSca <- density(dfttu$scoreCenSca)
plot(d_s, lwd = 2, col = "black",
main = "Score original (black), centered and scaled (red) densities", xlab = "",
xlim = c(min(d_s$x, d_s_cenSca$x), c(max(d_s$x, d_s_cenSca$x))), # Min and Max X-axis limits
ylim = c(min(d_s$y, d_s_cenSca$y), c(max(d_s$y, d_s_cenSca$y)))) # Min and Max Y-axis limits
lines(d_s_cenSca, col = "darkred", lwd = 2)
Red density line is the centered and scaled mathematics score. Next we plot autonomy and score into same linegraph.
# we run everything once more ...
# center and scale
dfttu$autonomyCenSca <- (dfttu$autonomy - mean(dfttu$autonomy)) / sd(dfttu$autonomy)
dfttu$scoreCenSca <- (dfttu$score - mean(dfttu$score)) / sd(dfttu$score)
# plot densities
da <- density(dfttu$autonomyCenSca)
ds <- density(dfttu$scoreCenSca)
plot(da, lwd = 2, col = "blue",
main = "Centered and scaled densities of Autonomy (blue) and Score (red)", xlab = "",
xlim = c(min(da$x, ds$x), c(max(da$x, ds$x))), # Min and Max X-axis limits
ylim = c(min(da$y, ds$y), c(max(da$y, ds$y))) # Min and Max Y-axis limits
)
lines(ds, col = "darkred", lwd = 2)
Now we have both predictors centered and scaled (mean is zero and standard deviation is one). We re-run the analysis with the new variables.
# specify multiple regression2
regression_multiple2 <- '
# regression
competence ~ 1 + autonomyCenSca + scoreCenSca
# variances
autonomyCenSca ~~ autonomyCenSca
scoreCenSca ~~ scoreCenSca
'
# estimate model2
regression_multiple_fit2 <- sem(regression_multiple2,
data=dfttu,
group = NULL, # no grouping variable (e.g., gender)
cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
)
# check variances
varTable(regression_multiple_fit2)
## name idx nobs type exo user mean var nlev lnam
## 1 competence 4 208 numeric 0 0 3.907 0.366 0
## 2 autonomyCenSca 19 208 numeric 0 0 0.000 1.000 0
## 3 scoreCenSca 20 208 numeric 0 0 0.000 1.000 0
Now the mean for both scaled predictors is zero and the sd is one. Let's see the outputs of the original (uncentered/scaled) and new analyses ..
# display unstandardized results in LISREL format
# original variables
summary(regression_multiple_fit,
rsq=TRUE) # display R^2
## lavaan 0.6.14 ended normally after 22 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 8
##
## Number of observations 208
## Number of missing patterns 1
##
## Model Test User Model:
##
## Test statistic 0.042
## Degrees of freedom 1
## P-value (Chi-square) 0.838
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## competence ~
## autonomyCGM1 0.753 0.045 16.698 0.000
## score 0.002 0.001 1.689 0.091
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .competence 3.832 0.053 72.939 0.000
## autonomyCGM1 0.000 0.042 0.000 1.000
## score 47.756 2.014 23.711 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## autonomyCGM1 0.366 0.036 10.198 0.000
## score 843.806 82.742 10.198 0.000
## .competence 0.155 0.015 10.198 0.000
##
## R-Square:
## Estimate
## competence 0.575
# scaled variables
summary(regression_multiple_fit2,
rsq=TRUE) # display R^2
## lavaan 0.6.14 ended normally after 8 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 8
##
## Number of observations 208
## Number of missing patterns 1
##
## Model Test User Model:
##
## Test statistic 0.042
## Degrees of freedom 1
## P-value (Chi-square) 0.838
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## competence ~
## autonomyCenSca 0.457 0.027 16.698 0.000
## scoreCenSca 0.046 0.027 1.689 0.091
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .competence 3.907 0.027 143.141 0.000
## autonomyCenSca 0.000 0.069 0.000 1.000
## scoreCenSca -0.000 0.069 -0.000 1.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## autonomyCenSca 0.995 0.098 10.198 0.000
## scoreCenSca 0.995 0.098 10.198 0.000
## .competence 0.155 0.015 10.198 0.000
##
## R-Square:
## Estimate
## competence 0.575
Results are identical and the interpretation stays the same - except that after centering and scaling the intercept value of competence refers to a situation where predictors (autonomy, score) are on their average level (M = 0).
# unstandardized ("B")
# uncentered and unscaled (= original)
parameterEstimates(regression_multiple_fit)
## lhs op rhs est se z pvalue ci.lower ci.upper
## 1 competence ~1 3.832 0.053 72.939 0.000 3.729 3.935
## 2 competence ~ autonomyCGM1 0.753 0.045 16.698 0.000 0.665 0.841
## 3 competence ~ score 0.002 0.001 1.689 0.091 0.000 0.003
## 4 autonomyCGM1 ~~ autonomyCGM1 0.366 0.036 10.198 0.000 0.296 0.437
## 5 score ~~ score 843.806 82.742 10.198 0.000 681.635 1005.977
## 6 competence ~~ competence 0.155 0.015 10.198 0.000 0.125 0.185
## 7 autonomyCGM1 ~1 0.000 0.042 0.000 1.000 -0.082 0.082
## 8 score ~1 47.756 2.014 23.711 0.000 43.809 51.704
# centered and scaled
parameterEstimates(regression_multiple_fit2)
## lhs op rhs est se z pvalue ci.lower ci.upper
## 1 competence ~1 3.907 0.027 143.141 0.000 3.854 3.961
## 2 competence ~ autonomyCenSca 0.457 0.027 16.698 0.000 0.403 0.511
## 3 competence ~ scoreCenSca 0.046 0.027 1.689 0.091 -0.007 0.100
## 4 autonomyCenSca ~~ autonomyCenSca 0.995 0.098 10.198 0.000 0.804 1.186
## 5 scoreCenSca ~~ scoreCenSca 0.995 0.098 10.198 0.000 0.804 1.186
## 6 competence ~~ competence 0.155 0.015 10.198 0.000 0.125 0.185
## 7 autonomyCenSca ~1 0.000 0.069 0.000 1.000 -0.136 0.136
## 8 scoreCenSca ~1 0.000 0.069 0.000 1.000 -0.136 0.136
Although unstandardized estimates differ after centering and scaling, conclusions remain similar: Autonomy is a stat sig predictor of competence.
# standardized ("beta")
# uncentered and unscaled (= original)
standardizedSolution(regression_multiple_fit)
## lhs op rhs est.std se z pvalue ci.lower ci.upper
## 1 competence ~1 6.343 0.328 19.347 0.000 5.700 6.985
## 2 competence ~ autonomyCGM1 0.755 0.030 25.371 0.000 0.696 0.813
## 3 competence ~ score 0.076 0.045 1.688 0.091 -0.012 0.165
## 4 autonomyCGM1 ~~ autonomyCGM1 1.000 0.000 NA NA 1.000 1.000
## 5 score ~~ score 1.000 0.000 NA NA 1.000 1.000
## 6 competence ~~ competence 0.425 0.045 9.529 0.000 0.337 0.512
## 7 autonomyCGM1 ~1 0.000 0.069 0.000 1.000 -0.136 0.136
## 8 score ~1 1.644 0.106 15.462 0.000 1.436 1.852
# centered and scaled
standardizedSolution(regression_multiple_fit2)
## lhs op rhs est.std se z pvalue ci.lower
## 1 competence ~1 6.468 0.320 20.243 0.000 5.842
## 2 competence ~ autonomyCenSca 0.755 0.030 25.371 0.000 0.696
## 3 competence ~ scoreCenSca 0.076 0.045 1.688 0.091 -0.012
## 4 autonomyCenSca ~~ autonomyCenSca 1.000 0.000 NA NA 1.000
## 5 scoreCenSca ~~ scoreCenSca 1.000 0.000 NA NA 1.000
## 6 competence ~~ competence 0.425 0.045 9.529 0.000 0.337
## 7 autonomyCenSca ~1 0.000 0.069 0.000 1.000 -0.136
## 8 scoreCenSca ~1 0.000 0.069 0.000 1.000 -0.136
## ci.upper
## 1 7.095
## 2 0.813
## 3 0.165
## 4 1.000
## 5 1.000
## 6 0.512
## 7 0.136
## 8 0.136
Standardized estimates between original and centered/scaled predictors are (not surprisingly) quite the same, as the standardization is also applied to the results of original (uncentered and unscaled) variables.
# store predictor estimates for inline text
# https://lavaan.ugent.be/tutorial/inspect.html
#----------------------
# unstandardized values
# autonomy
# estimate
reg_m2_aut_est <- parameterEstimates(regression_multiple_fit)$est[2] # stores estimate 0.7530086
# or
reg_m2_aut_est <- regression_multiple_fit@ParTable$est[2]
# standard error
reg_m2_aut_se <- parameterEstimates(regression_multiple_fit)$se[2] # stores standard error 0.045
# or
reg_m2_aut_se <- regression_multiple_fit@ParTable$se[2]
# p-value
reg_m2_aut_p <- parameterEstimates(regression_multiple_fit)$pvalue[2] # stores p-value < . 001
# score
# estimate
reg_m2_score_est <- parameterEstimates(regression_multiple_fit)$est[3] # 0.001587065
# standard error
reg_m2_score_se <- parameterEstimates(regression_multiple_fit)$se[3] # 0.0009398335
# p-value
reg_m2_score_p <- parameterEstimates(regression_multiple_fit)$pvalue[3] # 0.0912835
#--------------------
# standardized values
# these are usually presented in journal articles
# autonomy
# estimate
reg_m2_aut_est.std <- standardizedSolution(regression_multiple_fit)$est.std[2] # stores estimate 0.7546165
# standard error
reg_m2_aut_se.std <- standardizedSolution(regression_multiple_fit)$se[2] # stores standard error 0.02974363
# p-value
reg_m2_aut_p.std <- standardizedSolution(regression_multiple_fit)$pvalue[2] # stores p-value < . 001
# score estimate
reg_m2_score_est.std <- standardizedSolution(regression_multiple_fit)$est.std[3] # 0.07631541
# score standard error
reg_m2_score_se.std <- standardizedSolution(regression_multiple_fit)$se[3] # 0.04519976
# score p-value
reg_m2_score_p.std <- standardizedSolution(regression_multiple_fit)$pvalue[3] # 0.09133388
#_-----------------------
# R squared for the model
# percentage of the dependent variable variation that the (linear) model explains
reg_m2_r2 <- as.numeric(lavaan::inspect(regression_multiple_fit,'r2')[1])
Results of multiple regression analysis (Figure 1) with R lavaan showed that autonomy was a significant predictor of competence, \(\beta\) = 0.755, p = 0.000. According to the model, one standard deviation increase in autonomy (SD = 0.61) produces 0.755 SD increase in competence (SD = 0.61). In practice this means that when a student's autonomy level goes up by 0.61 units (on a 5-point self-assessment scale), then his/her competence goes up by 0.755 * 0.61 = 0.457 units (on a similar 5-point self-assessment scale). Although the mathematics score was a non significant positive predictor of competence, we can make a similar prediction on its effect: When a student's mathematics score goes up by 1 SD (29.12 points (on a scale from 0-100), his/her competence goes up by 0.076 * 0.61 = 0.046 units (on a 5-point self-assessment scale).
Standard errors (average distance of observed values from the regression line) were quite small for autonomy (0.030) and mathematics score (0.045). Autonomy and mathematics score explained together 57.53% of the variance of competence. The improvement was quite small compared to the previous model that had only one predictor (autonomy, R2 = 56.87%).
Figure 1. Multiple regression of competence predicted by autonomy and mathematics score (standardized estimates).
# produce path diagram
semPaths(regression_multiple_fit,
layout="tree",
"std", # unstandardized parameter estimates OR use "par" for standardized parameter estimates
# "par",
edge.label.cex=0.7,
edge.width=0.2,
edge.color="black",
# optimizeLatRes=TRUE,
label.scale=FALSE, # do not make labels smaller to fit inside boxes
# normalize=TRUE,
fade=FALSE, # do not fade non sig lines
as.expression=c("nodes"), # display full labels instead of abbreviated
# levels=c(1,3,6,9) # set distances between elements of the graph
)
Report here the unstandardized results of the above analysis: Describe the "B" estimates and produce a diagram/figure of the model with unstandardized values.
Multivariate regression model has more than one endogenous variable (DV). We specify a model that predicts both competence (endogenous, DV) and autonomy (endogenous, DV) with deep approach to learning (exogenous, IV) and organised approach to studying (exogenous, IV). Now that autonomy is not a predictor anymore, we use a non-centered version of it (autonomy).
# center deep and organised by grand mean
dfttu$deepCGM <- dfttu$deep - mean(dfttu$deep)
dfttu$organisedCGM <- dfttu$organised - mean(dfttu$organised)
# specify multivariate regression
regression_multivariate <- '
# regressions
autonomy ~ 1 + deepCGM
competence ~ 1 + deepCGM + organisedCGM
# variances and covariance
deepCGM ~~ deepCGM
organisedCGM ~~ organisedCGM
deepCGM ~~ organisedCGM
'
# estimate model
regression_multivariate_fit <- sem(regression_multivariate,
data=dfttu,
group = NULL, # no grouping variable (e.g., gender)
cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
)
# display results in LISREL format
summary(regression_multivariate_fit,
rsq=TRUE) # display R^2
## lavaan 0.6.14 ended normally after 20 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 13
##
## Number of observations 208
## Number of missing patterns 1
##
## Model Test User Model:
##
## Test statistic 0.130
## Degrees of freedom 1
## P-value (Chi-square) 0.718
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## autonomy ~
## deepCGM 0.360 0.079 4.545 0.000
## competence ~
## deepCGM 0.478 0.080 5.993 0.000
## organisedCGM 0.050 0.044 1.136 0.256
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## deepCGM ~~
## organisedCGM 0.164 0.026 6.241 0.000
## .autonomy ~~
## .competence 0.228 0.027 8.473 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .autonomy 3.786 0.040 94.569 0.000
## .competence 3.907 0.038 103.514 0.000
## deepCGM 0.000 0.035 0.000 1.000
## organisedCGM -0.000 0.047 -0.000 1.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## deepCGM 0.256 0.025 10.198 0.000
## organisedCGM 0.458 0.045 10.198 0.000
## .autonomy 0.333 0.033 10.198 0.000
## .competence 0.296 0.029 10.196 0.000
##
## R-Square:
## Estimate
## autonomy 0.090
## competence 0.185
# unstandardized
parameterEstimates(regression_multivariate_fit)
## lhs op rhs est se z pvalue ci.lower ci.upper
## 1 autonomy ~1 3.786 0.040 94.569 0.000 3.708 3.865
## 2 autonomy ~ deepCGM 0.360 0.079 4.545 0.000 0.205 0.515
## 3 competence ~1 3.907 0.038 103.514 0.000 3.833 3.981
## 4 competence ~ deepCGM 0.478 0.080 5.993 0.000 0.322 0.634
## 5 competence ~ organisedCGM 0.050 0.044 1.136 0.256 -0.036 0.135
## 6 deepCGM ~~ deepCGM 0.256 0.025 10.198 0.000 0.207 0.305
## 7 organisedCGM ~~ organisedCGM 0.458 0.045 10.198 0.000 0.370 0.546
## 8 deepCGM ~~ organisedCGM 0.164 0.026 6.241 0.000 0.113 0.216
## 9 autonomy ~~ autonomy 0.333 0.033 10.198 0.000 0.269 0.397
## 10 competence ~~ competence 0.296 0.029 10.196 0.000 0.239 0.353
## 11 autonomy ~~ competence 0.228 0.027 8.473 0.000 0.175 0.281
## 12 deepCGM ~1 0.000 0.035 0.000 1.000 -0.069 0.069
## 13 organisedCGM ~1 0.000 0.047 0.000 1.000 -0.092 0.092
# standardized
standardizedSolution(regression_multivariate_fit)
## lhs op rhs est.std se z pvalue ci.lower ci.upper
## 1 autonomy ~1 6.254 0.314 19.938 0.000 5.639 6.869
## 2 autonomy ~ deepCGM 0.301 0.063 4.765 0.000 0.177 0.424
## 3 competence ~1 6.479 0.323 20.041 0.000 5.845 7.112
## 4 competence ~ deepCGM 0.401 0.062 6.514 0.000 0.280 0.521
## 5 competence ~ organisedCGM 0.056 0.049 1.135 0.256 -0.041 0.152
## 6 deepCGM ~~ deepCGM 1.000 0.000 NA NA 1.000 1.000
## 7 organisedCGM ~~ organisedCGM 1.000 0.000 NA NA 1.000 1.000
## 8 deepCGM ~~ organisedCGM 0.480 0.053 8.995 0.000 0.375 0.585
## 9 autonomy ~~ autonomy 0.910 0.038 23.992 0.000 0.835 0.984
## 10 competence ~~ competence 0.815 0.049 16.796 0.000 0.720 0.910
## 11 autonomy ~~ competence 0.726 0.033 22.142 0.000 0.662 0.790
## 12 deepCGM ~1 0.000 0.069 0.000 1.000 -0.136 0.136
## 13 organisedCGM ~1 0.000 0.069 0.000 1.000 -0.136 0.136
#--------------------
# standardized values
# aut ~ deep estimate
reg_m3_aut_deep_est.std <- standardizedSolution(regression_multivariate_fit)$est.std[2]
# aut ~ deep standard error
reg_m3_aut_deep_se.std <- standardizedSolution(regression_multivariate_fit)$se[2]
# aut ~ deep p-value
reg_m3_aut_deep_p.std <- standardizedSolution(regression_multivariate_fit)$pvalue[2]
# com ~ deep estimate
reg_m3_com_deep_est.std <- standardizedSolution(regression_multivariate_fit)$est.std[4]
# com ~ deep standard error
reg_m3_com_deep_se.std <- standardizedSolution(regression_multivariate_fit)$se[4]
# com ~ deep p-value
reg_m3_com_deep_p.std <- standardizedSolution(regression_multivariate_fit)$pvalue[4]
#_-----------------------
# R squared for the model
# percentage of the dependent variable variation that the (linear) model explains
reg_m3_aut_r2 <- lavaan::inspect(regression_multivariate_fit,'r2')[1]
reg_m3_com_r2 <- lavaan::inspect(regression_multivariate_fit,'r2')[2]
Results of multivariate regression analysis (Figure 2) with R lavaan showed that deep approach to learning was a significant positive predictor for both autonomy (\(\beta\) = 0.301, p = 0.000) and competence (\(\beta\) = 0.401, p = 0.000). Organised approach to studying … [Add results here]. Deep approach to studying explained [Add]% of the variance of autonomy. Deep and organised explained together [Add]% of the variance of competence.
Add "missing" results to the above analysis.
Figure 2. Multivariate regression of competence and autonomy predicted by deep approach to learning and organised approach to studying (standardized estimates).
# produce diagram
semPaths(regression_multivariate_fit,
layout="tree",
"std", # unstandardized parameter estimates OR "par" for standardized parameter estimates
edge.label.cex=0.7,
edge.width=0.2,
edge.color="black",
# optimizeLatRes=TRUE,
label.scale=FALSE, # do not make labels smaller to fit inside boxes
# normalize=TRUE,
fade=FALSE, # do not fade non sig lines
as.expression=c("nodes"), # display full labels instead of abbreviated
# levels=c(1,3,6,9) # set distances between elements of the graph
)
Multivariate regression model becomes a path model after endogenous variables are allowed to predict other endogenous variables. We specify a model that predicts 1) competence with deep approach to learning, organised approach to studying and autonomy (which is the "path" part of the model), and 2) autonomy with deep approach to learning.
# create new df with the four variables in the path model
dfttu3 <- dfttu[,c("autonomy","competence","deep", "organised")]
# produce variance-covariance matrix
cov(dfttu3)
## autonomy competence deep organised
## autonomy 0.36825878 0.2769039 0.09246969 0.06802478
## competence 0.27690392 0.3660915 0.13104881 0.10768969
## deep 0.09246969 0.1310488 0.25702428 0.16510330
## organised 0.06802478 0.1076897 0.16510330 0.46032928
# store some interesting values for the inline text
# mean of autonomy
dfttu_m_aut <- mean(dfttu3$autonomy)
# variance of autonomy
dfttu_var_aut <- cov(dfttu3)[1,1]
# square root of the variance of autonomy (SD)
dfttu_sd_aut <- sqrt(cov(dfttu3)[1,1])
Diagonal values represent variances and other values covariances. Variance-covariance matrix is symmetric (same values below and above the diagonal). Square root of the diagonal values produce standard deviations. For example, autonomy variance (\(\sigma^2\)) is 0.368 and SD (\(\sigma\)) is 0.61. As the mean of the autonomy is 3.79, +/- 1SD gives range where approximately 68% of the autonomy values are. All the covariances are positive (e.g., if autonomy increases, also competence increases).
# specify regressions
path_model <- '
autonomy ~ 1 + deepCGM
competence ~ 1 + deepCGM + organisedCGM + autonomy # we add autonomy here as a predictor of competence
# variances and covariances
deepCGM ~~ deepCGM
organisedCGM ~~ organisedCGM
deepCGM ~~ organisedCGM
'
# estimate model
path_model_fit <- sem(path_model,
data = dfttu,
group = NULL, # no grouping variable (e.g., gender, control/intervention,...)
cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
)
# display results in LISREL format
summary(path_model_fit,
rsq=TRUE) # display R^2
## lavaan 0.6.14 ended normally after 38 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 13
##
## Number of observations 208
## Number of missing patterns 1
##
## Model Test User Model:
##
## Test statistic 0.130
## Degrees of freedom 1
## P-value (Chi-square) 0.718
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## autonomy ~
## deepCGM 0.360 0.079 4.545 0.000
## competence ~
## deepCGM 0.232 0.061 3.828 0.000
## organisedCGM 0.050 0.044 1.136 0.256
## autonomy 0.685 0.045 15.223 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## deepCGM ~~
## organisedCGM 0.164 0.026 6.241 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .autonomy 3.786 0.040 94.569 0.000
## .competence 1.316 0.172 7.639 0.000
## deepCGM 0.000 0.035 0.000 1.000
## organisedCGM -0.000 0.047 -0.000 1.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## deepCGM 0.256 0.025 10.198 0.000
## organisedCGM 0.458 0.045 10.198 0.000
## .autonomy 0.333 0.033 10.198 0.000
## .competence 0.140 0.014 10.198 0.000
##
## R-Square:
## Estimate
## autonomy 0.090
## competence 0.615
# unstandardized
parameterEstimates(path_model_fit)
## lhs op rhs est se z pvalue ci.lower ci.upper
## 1 autonomy ~1 3.786 0.040 94.569 0.000 3.708 3.865
## 2 autonomy ~ deepCGM 0.360 0.079 4.545 0.000 0.205 0.515
## 3 competence ~1 1.316 0.172 7.639 0.000 0.978 1.653
## 4 competence ~ deepCGM 0.232 0.061 3.828 0.000 0.113 0.350
## 5 competence ~ organisedCGM 0.050 0.044 1.136 0.256 -0.036 0.135
## 6 competence ~ autonomy 0.685 0.045 15.223 0.000 0.596 0.773
## 7 deepCGM ~~ deepCGM 0.256 0.025 10.198 0.000 0.207 0.305
## 8 organisedCGM ~~ organisedCGM 0.458 0.045 10.198 0.000 0.370 0.546
## 9 deepCGM ~~ organisedCGM 0.164 0.026 6.241 0.000 0.113 0.216
## 10 autonomy ~~ autonomy 0.333 0.033 10.198 0.000 0.269 0.397
## 11 competence ~~ competence 0.140 0.014 10.198 0.000 0.113 0.167
## 12 deepCGM ~1 0.000 0.035 0.000 1.000 -0.069 0.069
## 13 organisedCGM ~1 0.000 0.047 0.000 1.000 -0.092 0.092
# standardized
standardizedSolution(path_model_fit)
## lhs op rhs est.std se z pvalue ci.lower ci.upper
## 1 autonomy ~1 6.254 0.314 19.938 0.000 5.639 6.869
## 2 autonomy ~ deepCGM 0.301 0.063 4.765 0.000 0.177 0.424
## 3 competence ~1 2.181 0.357 6.106 0.000 1.481 2.882
## 4 competence ~ deepCGM 0.194 0.051 3.827 0.000 0.095 0.294
## 5 competence ~ organisedCGM 0.056 0.049 1.135 0.256 -0.041 0.152
## 6 competence ~ autonomy 0.687 0.036 19.298 0.000 0.617 0.757
## 7 deepCGM ~~ deepCGM 1.000 0.000 NA NA 1.000 1.000
## 8 organisedCGM ~~ organisedCGM 1.000 0.000 NA NA 1.000 1.000
## 9 deepCGM ~~ organisedCGM 0.480 0.053 8.995 0.000 0.375 0.585
## 10 autonomy ~~ autonomy 0.910 0.038 23.992 0.000 0.835 0.984
## 11 competence ~~ competence 0.385 0.042 9.209 0.000 0.303 0.467
## 12 deepCGM ~1 0.000 0.069 0.000 1.000 -0.136 0.136
## 13 organisedCGM ~1 0.000 0.069 0.000 1.000 -0.136 0.136
#--------------------
# unstandardized values
# first, we investigate the structure of std output object
str(parameterEstimates(path_model_fit)) # we learn that "$est.sdt", "$se" and "$pvalue" contain estimates that we are interested in reporting
## Classes 'lavaan.data.frame' and 'data.frame': 13 obs. of 9 variables:
## $ lhs : chr "autonomy" "autonomy" "competence" "competence" ...
## $ op : chr "~1" "~" "~1" "~" ...
## $ rhs : chr "" "deepCGM" "" "deepCGM" ...
## $ est : num 3.7861 0.3598 1.3156 0.2317 0.0497 ...
## $ se : num 0.04 0.0792 0.1722 0.0605 0.0437 ...
## $ z : num 94.57 4.54 7.64 3.83 1.14 ...
## $ pvalue : num 0.00 5.50e-06 2.20e-14 1.29e-04 2.56e-01 ...
## $ ci.lower: num 3.708 0.205 0.978 0.113 -0.036 ...
## $ ci.upper: num 3.865 0.515 1.653 0.35 0.135 ...
# second, we look into the std output to see the "rows" we wish to refer
parameterEstimates(path_model_fit)
## lhs op rhs est se z pvalue ci.lower ci.upper
## 1 autonomy ~1 3.786 0.040 94.569 0.000 3.708 3.865
## 2 autonomy ~ deepCGM 0.360 0.079 4.545 0.000 0.205 0.515
## 3 competence ~1 1.316 0.172 7.639 0.000 0.978 1.653
## 4 competence ~ deepCGM 0.232 0.061 3.828 0.000 0.113 0.350
## 5 competence ~ organisedCGM 0.050 0.044 1.136 0.256 -0.036 0.135
## 6 competence ~ autonomy 0.685 0.045 15.223 0.000 0.596 0.773
## 7 deepCGM ~~ deepCGM 0.256 0.025 10.198 0.000 0.207 0.305
## 8 organisedCGM ~~ organisedCGM 0.458 0.045 10.198 0.000 0.370 0.546
## 9 deepCGM ~~ organisedCGM 0.164 0.026 6.241 0.000 0.113 0.216
## 10 autonomy ~~ autonomy 0.333 0.033 10.198 0.000 0.269 0.397
## 11 competence ~~ competence 0.140 0.014 10.198 0.000 0.113 0.167
## 12 deepCGM ~1 0.000 0.035 0.000 1.000 -0.069 0.069
## 13 organisedCGM ~1 0.000 0.047 0.000 1.000 -0.092 0.092
# third, we store values from the output into variables for inline text referencing ...
# autonomy is only predicted by deep
# aut ~ deep
# estimate
path_m4_aut_deep_est <- parameterEstimates(path_model_fit)$est[2] # row 2 in the column "est.sdt" contains beta (standardized) estimate of autonomy ~ deep approach to learning
# standard error
path_m4_aut_deep_se <- parameterEstimates(path_model_fit)$se[2]
# p-value
path_m4_aut_deep_p <- parameterEstimates(path_model_fit)$pvalue[2]
# competence is predicted by deep, organised and autonomy
# com ~ deep
# estimate
path_m4_com_deep_est <- parameterEstimates(path_model_fit)$est[4]
# standard error
path_m4_com_deep_se <- parameterEstimates(path_model_fit)$se[4]
# p-value
path_m4_com_deep_p <- parameterEstimates(path_model_fit)$pvalue[4]
# com ~ organised
# estimate
path_m4_com_org_est <- parameterEstimates(path_model_fit)$est[5]
# standard error
path_m4_com_org_se <- parameterEstimates(path_model_fit)$se[5]
# p-value
path_m4_com_org_p <- parameterEstimates(path_model_fit)$pvalue[5]
# com ~ autonomy
# estimate
path_m4_com_aut_est <- parameterEstimates(path_model_fit)$est[6]
# standard error
path_m4_com_aut_se <- parameterEstimates(path_model_fit)$se[6]
# p-value
path_m4_com_aut_p <- parameterEstimates(path_model_fit)$pvalue[6]
# -----------------------
# R squared for the model
# percentage of the dependent variable variation that the (linear) model explains
path_m4_aut_r2 <- lavaan::inspect(path_model_fit,'r2')[1] # how much "deep" explains the variance of "autonomy"?
path_m4_com_r2 <- lavaan::inspect(path_model_fit,'r2')[2] # how much "deep", "organised" and "autonomy" explain the variance of "competence"?
#--------------------
# standardized values
# first, we investigate the structure of std output object
str(standardizedSolution(path_model_fit)) # "$est.sdt", "$se" and "$pvalue" contain estimates that we are interested in reporting
## Classes 'lavaan.data.frame' and 'data.frame': 13 obs. of 9 variables:
## $ lhs : chr "autonomy" "autonomy" "competence" "competence" ...
## $ op : chr "~1" "~" "~1" "~" ...
## $ rhs : chr "" "deepCGM" "" "deepCGM" ...
## $ est.std : num 6.254 0.3006 2.1813 0.1943 0.0558 ...
## $ se : num 0.3137 0.0631 0.3573 0.0508 0.0491 ...
## $ z : num 19.94 4.77 6.11 3.83 1.13 ...
## $ pvalue : num 0.00 1.89e-06 1.02e-09 1.30e-04 2.56e-01 ...
## $ ci.lower: num 5.6392 0.1769 1.4811 0.0948 -0.0405 ...
## $ ci.upper: num 6.869 0.424 2.882 0.294 0.152 ...
# second, we look into the std output to see the "rows" we wish to refer
standardizedSolution(path_model_fit)
## lhs op rhs est.std se z pvalue ci.lower ci.upper
## 1 autonomy ~1 6.254 0.314 19.938 0.000 5.639 6.869
## 2 autonomy ~ deepCGM 0.301 0.063 4.765 0.000 0.177 0.424
## 3 competence ~1 2.181 0.357 6.106 0.000 1.481 2.882
## 4 competence ~ deepCGM 0.194 0.051 3.827 0.000 0.095 0.294
## 5 competence ~ organisedCGM 0.056 0.049 1.135 0.256 -0.041 0.152
## 6 competence ~ autonomy 0.687 0.036 19.298 0.000 0.617 0.757
## 7 deepCGM ~~ deepCGM 1.000 0.000 NA NA 1.000 1.000
## 8 organisedCGM ~~ organisedCGM 1.000 0.000 NA NA 1.000 1.000
## 9 deepCGM ~~ organisedCGM 0.480 0.053 8.995 0.000 0.375 0.585
## 10 autonomy ~~ autonomy 0.910 0.038 23.992 0.000 0.835 0.984
## 11 competence ~~ competence 0.385 0.042 9.209 0.000 0.303 0.467
## 12 deepCGM ~1 0.000 0.069 0.000 1.000 -0.136 0.136
## 13 organisedCGM ~1 0.000 0.069 0.000 1.000 -0.136 0.136
# third, we store values from the std output into variables for inline text referencing
# aut ~ deep
# estimate (row 2 in the column "est.sdt" contains beta (standardized) estimate of autonomy ~ deep approach to learning)
path_m4_aut_deep_est.std <- standardizedSolution(path_model_fit)$est.std[2]
# standard error
path_m4_aut_deep_se.std <- standardizedSolution(path_model_fit)$se[2]
# p-value
path_m4_aut_deep_p.std <- standardizedSolution(path_model_fit)$pvalue[2]
# com ~ deep
# estimate
path_m4_com_deep_score_est.std <- standardizedSolution(path_model_fit)$est.std[4]
# standard error
path_m4_com_deep_se.std <- standardizedSolution(path_model_fit)$se[4]
# p-value
path_m4_com_deep_p.std <- standardizedSolution(path_model_fit)$pvalue[4]
# com ~ organised
# estimate
path_m4_com_org_est.std <- standardizedSolution(path_model_fit)$est.std[5]
# standard error
path_m4_com_org_se.std <- standardizedSolution(path_model_fit)$se[5]
# p-value
path_m4_com_org_p.std <- standardizedSolution(path_model_fit)$pvalue[5]
# com ~ autonomy
# estimate
path_m4_com_aut_est.std <- standardizedSolution(path_model_fit)$est.std[6]
# standard error
path_m4_com_aut_se.std <- standardizedSolution(path_model_fit)$se[6]
# p-value
path_m4_com_aut_p.std <- standardizedSolution(path_model_fit)$pvalue[6]
Results of path analysis (Figure 3) with R lavaan showed that deep approach to learning was a positive statistically significant predictor of autonomy, B = 0.360, SE = 0.079, p = 0.000. This unstandardized result indicates that a one unit increase in deep increases autonomy by 0.360 units. Deep approach to learning explained 9.0% of the variance of autonomy variable. It was also investigated if deep approach to learning, organised approach to studying and autonomy are related to competence. Results showed that all three predictors had positive associations with competence, but only deep (B = 0.232, SE = 0.061, p = 0.000) and autonomy (B = 0.685, SE = 0.045, p = 0.000) were statistically significant. These variables explained 61.47% of the variance of competence variable.
Figure 3. Path model of competence and autonomy with selected predictors (unstandardized estimates).
# produce diagram
semPaths(path_model_fit,
layout="tree",
"par", # unstandardized parameter estimates OR "par" for standardized parameter estimates
edge.label.cex=0.7,
edge.width=0.2,
edge.color="black",
# optimizeLatRes=TRUE,
label.scale=FALSE, # do not make labels smaller to fit inside boxes
# normalize=TRUE,
fade=FALSE, # do not fade non sig lines
as.expression=c("nodes"), # display full labels instead of abbreviated
# levels=c(1,3,6,9) # set distances between elements of the graph
)
Produce text that describes the standardized results of the analyses above (alongside with a figure).
dfttu4 <- read.csv("data1_csv.dat", header = T)
summary(dfttu4)
## v1 v2 v3 v4
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.941 Mean :2.713 Mean :2.655 Mean :2.849
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :5 NA's :4 NA's :6 NA's :11
## v5 v6 v7 v8
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.000 Median :3.000
## Mean :3.817 Mean :3.689 Mean :3.432 Mean :3.356
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :4 NA's :3 NA's :3 NA's :3
## v9 v12 v13 v14
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :2.000 Median :2.000 Median :3.000
## Mean :3.235 Mean :2.153 Mean :2.068 Mean :2.977
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :4 NA's :3 NA's :6 NA's :8
## v15 v16 v17 v18 v29
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :3.00 Median :3.000 Median :3.000
## Mean :2.836 Mean :3.536 Mean :3.28 Mean :3.089 Mean :2.837
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000
## NA's :9 NA's :3 NA's :7 NA's :7 NA's :5
## v30 v31 v33 v34
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000
## Median :3.000 Median :3.000 Median :4.000 Median :4.000
## Mean :3.086 Mean :3.038 Mean :3.765 Mean :3.979
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.750 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :6 NA's :4 NA's :9 NA's :9
## v42r v43 v44 v45 v46
## Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:3.0 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :4.0 Median :3.000 Median :2.000 Median :2.000
## Mean :2.991 Mean :3.8 Mean :3.278 Mean :2.388 Mean :2.202
## 3rd Qu.:4.000 3rd Qu.:5.0 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.0 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :1 NA's :12 NA's :1 NA's :1 NA's :2
## v47
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :2.208
## 3rd Qu.:3.000
## Max. :5.000
## NA's :5
Our data is from a Finnish polytechnics of higher education, collected in 2000 (n = 447). The respondents are the staff members of the organization. The original measurement instrument (Growth-oriented Atmosphere Questionnaire, GOAQ) has 13 factors and 92-items (Ruohotie, 1996; Ruohotie, Nokelainen & Tirri, 2002), but for the purposes of this exercise we have selected the following factors and sample items. Five –point scale from 1 (totally disagree) to 5 (totally agree) was used in the online survey.
Factor 1: Psychic stress of the work (PSY)
v44 I feel that I am beginning to dislike my work.
v45 I feel that it is getting more difficult for me to take the
initiative.
Factor 2: Encouraging leadership (ENC)
v5 My manager is friendly and easily approachable.
v7 My manager pays attention to my suggestions and wishes.
v17 My manager works with a team to find solutions.
Factor 3: Build-up of work requirements (BUI)
v42r My workload has increased during the past years.
v43 My working pace has increased in recent years.
# create new df with only relevant variables for correlation matrix
dfttu4cor <- dfttu4[,c("v44","v45","v5","v7","v17","v42r","v43")]
# compute correlation matrix
lavCor(dfttu4cor)
## v44 v45 v5 v7 v17 v42r v43
## v44 1.000
## v45 0.486 1.000
## v5 -0.254 -0.290 1.000
## v7 -0.263 -0.343 0.626 1.000
## v17 -0.233 -0.312 0.731 0.684 1.000
## v42r 0.525 0.309 -0.239 -0.298 -0.226 1.000
## v43 0.483 0.167 -0.062 -0.118 -0.018 0.387 1.000
Correlation matrix shows that the observed variables for each factor have quite strong and positive correlations, as expected. For example, "v44" and "v45" are manifestations of the same latent variable ("PSY") and so they have a strong positive correlation (r = 0.486). Also the negative correlations make sense: It is plausible that the three observed variables ("v5","v7","v17") that belong to the latent ENC variable (Encouraging leadership) are negatively associated to observed variables related to psychic stress ("v44","v45") and build-up of work requirements ("v42r","v43").
However, there is one suspicious correlation that we need to examine further: "v44" (PSY) and "v43" (BUI) are quite strongly positively correlated ( r = 0.483) although they belong to different factors?! We will return to this issue in the last section of this exercise (via "modification indices").
According to Beaujean (2014, p. 37), "A structural equation model (SEM) is a broad class of statistical models that consists of two parts: the structural model and the latent variable model. The structural model consists of the regression-like relationships among the variables (i.e., the path analysis). The latent variable model (LVM) forms the latent variables (LVs) used in the structural model. When a LVM is analyzed without a structural model, it is sometimes called a confirmatory factor analysis (CFA). If there was not a hypothesized structure for the latent variable model, then it would be an exploratory factor analysis (EFA)."
The SEM model (Figure 4) aims to analyse if Encouraging leadership (ENC) and Build-up of work requirements (BUI) are related to the Psychic stress of the work (PSY).
Figure 4. SEM model between Encouraging leadership, Build-up of work requirements and Psychic stress of the work
ENC is a latent exogenous (IV) variable that is constructed with three observed variables (v5, v7, v17). The code below shows that the highest loading variable is not the first one (v5), but the last one (v17) instead. To make the model identifiable, we fix one indicator path to 1 [see more details] for each latent variable. In the case of PSY, we fix the strongest loading variable (v17) instead of the first variable in the list (using "NA*" removes the automatic fix to 1 for v5).
BUI is also an exogenous latent variable, but based only on two observed variables (v42r, v43). The "r" in v42r stands for a reverse coded variable: The original statement in v42 was "My workload has decreased during the past years.", but as we wish to measure build-up of work, we use reversed wording "My workload has increased during the past years." instead (and indicate that by naming the variable "v42r").
PSY is an endogenous (DV) latent variable with two observed indicators (v44, v45).
sem_model <- '
# Factors
ENC =~ NA*v5 + v7 + 1*v17 # Encouraging leadership
BUI =~ 1*v42r + v43 # Build-up of work requirements
PSY =~ 1*v44 + v45 # Psychic stress of the work
# Regressions
PSY ~ ENC + BUI # Predict PSY with ENC and BUI
# Covariances
ENC ~~ BUI # Let ENC and BUI correlate
'
The model specified in the previous section is now estimated. The default estimation method is Maximum Likelihood (ML). As there are some missing data, we use Full Information Maximum Likelihood (FIML) instead of casewise deletion.
sem_model_fit <- sem(sem_model,
data=dfttu4,
group = NULL, # no grouping variable (e.g., gender, control/intervention,...)
cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
missing = "fiml.x" # full information maximum likelihood approach
)
# summary(sem_model_fit,
# standardized=TRUE)
#
# # Latent Variables:
# # Estimate Std.Err z-value P(>|z|) Std.lv Std.all
# # ENC =~
# # v5 0.995 NA 0.962 0.814
# # v7 0.850 NA 0.822 0.768
# # v17 1.110 NA 1.074 0.886
# # BUI =~
# # v42r 0.864 NA 0.777 0.672
# # v43 0.689 NA 0.620 0.571
# # PSY =~
# # v44 1.053 NA 1.118 0.901
# # v45 0.623 NA 0.662 0.551
# #
# # Regressions:
# # Estimate Std.Err z-value P(>|z|) Std.lv Std.all
# # PSY ~
# # ENC -0.110 NA -0.100 -0.100
# # BUI 1.003 NA 0.850 0.850
# #
# # Covariances:
# # Estimate Std.Err z-value P(>|z|) Std.lv Std.all
# # ENC ~~
# # BUI -0.264 NA -0.303 -0.303
Before proceeding with the analysis, we need to check if our model is underidentified, just-identified or overidentified (Kline, 2016, pp.127-128). Model specifications can not require more parameters to be analysed than there are observations available. Thus, only just-identified (equal number of observations and parameters) and overidentified (less parameters than observations) models can be analysed (underidentification produces an error message).
The number of observations (free parameters) to be estimated (non-reduntant elements) can be calculated with k(k+1)/2. "k" is the number of observed (manifest) variables in the model.
Figure 4 above shows that we have seven observed manifest variables in our model (v5, v7, v17, v42r, v43, v44, v45). If we place that number into "k" and calculate, we get 7(7+1)/2 = 56/2 = 28. So, our number of observations is 28. Next, we need to figure out how many parameters are to be estimated in the model (Figure 4). We will use lavaan::lavInspect() to analyse this:
# how many parameters are to be estimated in the model?
# lavaan::inspect(sem_model_fit) # n of parameters: 24 # older version
lavaan::lavInspect(sem_model_fit) # n of parameters: 24
## $lambda
## ENC BUI PSY
## v5 1 0 0
## v7 2 0 0
## v17 0 0 0
## v42r 0 0 0
## v43 0 3 0
## v44 0 0 0
## v45 0 0 4
##
## $theta
## v5 v7 v17 v42r v43 v44 v45
## v5 8
## v7 0 9
## v17 0 0 10
## v42r 0 0 0 11
## v43 0 0 0 0 12
## v44 0 0 0 0 0 13
## v45 0 0 0 0 0 0 14
##
## $psi
## ENC BUI PSY
## ENC 15
## BUI 7 16
## PSY 0 0 17
##
## $beta
## ENC BUI PSY
## ENC 0 0 0
## BUI 0 0 0
## PSY 5 6 0
##
## $nu
## intrcp
## v5 18
## v7 19
## v17 20
## v42r 21
## v43 22
## v44 23
## v45 24
##
## $alpha
## intrcp
## ENC 0
## BUI 0
## PSY 0
Result shows that there are 24 parameters to be estimated:
4 manifest variable loadings on three factors (lambda matrix)
7 error variances of manifest variables (theta matrix)
3 latent variances and 1 covariance (psi matrix)
2 regressions (beta matrix)
7 intercepts(nu matrix)
The model is overidentified (good!), as the number of non-redundant elements [7(7+1)/2 = 56/2 = 28] is greater than the number of parameters to be estimated: 28 - 24 = 4.
Another issue is that do we have enough data for the SEM model? According to Jackson (2003), we can get a rough estimate of the sample size needed by multiplying the number of participants (N) by the number of parameters in the model (q). Sample-size-to-parameters ratio according to this N:q rule would ideally be 20:1 (Kline, 2016, p.16).
We have 24 parameters to be estimated, so our sample size should be 20*24 = 480. Our data frame has 208 rows (=participants), so we do fine here. However, Kline (2016, p.16) mentions that "Less ideal would be an N:q ratio of 10:1 …", which in our case would mean radically smaller sample size expectation (of 240).
A more clever way of investigating sample size needed is to use simulation alongside with desired alpha and power values. Usually alpha is set to 0.05 (five percent chance of Type I error) and power to 0.80 (80 percent chance to avoid rejecting H1 when it is true).
# install.packages("semPower")
library(semPower)
##
## ### Welcome to semPower 1.2.0 ###
##
## See https://github.com/moshagen/semPower for quick examples and a detailed manual.
##
## Please cite as:
## Moshagen, M., & Erdfelder, E. (2016). A new strategy for testing structural equation models.
## Structural Equation Modeling, 23, 54-60. doi: 10.1080/10705511.2014.950896
aPrioriSampleSize <- semPower::semPower.aPriori(effect = .05, effect.measure = 'RMSEA', alpha = .05, power = .80, df = 24)
summary(aPrioriSampleSize)
##
## semPower: A-priori power analysis
##
## F0 0.060000
## RMSEA 0.050000
## Mc 0.970446
##
## df 24
## Required Num Observations 375
##
## Critical Chi-Square 36.41502
## NCP 22.44000
## Alpha 0.050000
## Beta 0.201086
## Power (1-beta) 0.798914
## Implied Alpha/Beta Ratio 0.248650
# F0 0.060000
# RMSEA 0.050000
# Mc 0.970446
#
# df 24
# Required Num Observations 375
#
# Critical Chi-Square 36.41502
# NCP 22.44000
# Alpha 0.050000
# Beta 0.201086
# Power (1-beta) 0.798914
# Implied Alpha/Beta Ratio 0.248650
We learn that if we estimate 24 parameters (alpha = .05, power = .80), required sample size would be 375.
Let's take another opinion from simsem …
# install.packages("simsem")
library(simsem)
##
## #################################################################
## This is simsem 0.5-16
## simsem is BETA software! Please report any bugs.
## simsem was first developed at the University of Kansas Center for
## Research Methods and Data Analysis, under NSF Grant 1053160.
## #################################################################
##
## Attaching package: 'simsem'
## The following object is masked from 'package:lavaan':
##
## inspect
## The following object is masked from 'package:psych':
##
## sim
# https://github.com/simsem/simsem/wiki/Example-18:-Simulation-with-Varying-Sample-Size
# define population model
sem_population_model <- '
# Factors
ENC =~ 0.81*v5 + 0.77*v7 + 0.89*v17 # standardized loadings from "Std.all" output above
BUI =~ 0.67*v42r + 0.57*v43
PSY =~ 0.90*v44 + 0.55*v45
# Regressions
PSY ~ -0.10*ENC + 0.85*BUI # standardized path coefficients from "Std.all" output above
# Covariances
ENC ~~ -.30*BUI # covariance from "Std.all" output above
'
# simulation takes some time, so we'll save the results into a file ...
if (file.exists("SEM_sampleSize.rds")) {
SEM_sampleSize <- readRDS("SEM_sampleSize.rds")
} else {
SEM_sampleSize <- simsem::sim(NULL, n=50:500, model = sem_model, generate = sem_population_model, std.lv = TRUE, lavaanfun = "cfa")
saveRDS(SEM_sampleSize, "SEM_sampleSize.rds")
}
summary(SEM_sampleSize)
## RESULT OBJECT
## Model Type
## [1] "lavaan"
## ========= Fit Indices Cutoffs ============
## N chisq aic bic rmsea cfi tli srmr
## 1 50 20.243 1193.386 1232.024 0.088 0.911 0.830 0.073
## 2 162 19.751 3724.860 3775.449 0.074 0.931 0.868 0.061
## 3 275 19.254 6278.937 6341.583 0.059 0.951 0.907 0.049
## 4 388 18.758 8833.014 8907.717 0.045 0.971 0.945 0.037
## 5 500 18.266 11364.488 11451.142 0.031 0.991 0.984 0.025
## ========= Parameter Estimates and Standard Errors ============
## Estimate Average Estimate SD Average SE Power (Not equal 0) Std Est
## ENC=~v5 0.805 0.106 0.102 1.000 0.627
## ENC=~v7 0.766 0.107 0.099 1.000 0.609
## ENC=~v17 0.886 0.122 0.108 1.000 0.663
## BUI=~v42r 0.813 2.098 0.126 0.982 0.678
## BUI=~v43 0.577 0.153 0.126 0.982 0.504
## PSY=~v44 0.880 0.257 0.249 0.901 0.782
## PSY=~v45 0.518 0.135 0.136 0.924 0.585
## PSY~ENC 0.976 21.691 0.410 0.103 -0.055
## PSY~BUI 2.713 32.504 1.008 0.756 0.646
## ENC~~BUI -0.307 0.142 0.118 0.763 -0.307
## v5~~v5 0.992 0.160 0.146 0.989 0.602
## v7~~v7 0.985 0.148 0.137 0.993 0.624
## v17~~v17 0.986 0.184 0.164 0.979 0.554
## v42r~~v42r -3.601 72.477 0.174 0.956 -2.607
## v43~~v43 0.955 0.342 0.218 0.986 0.728
## v44~~v44 0.904 0.517 0.416 0.733 0.372
## v45~~v45 0.990 0.199 0.169 0.977 0.648
## Std Est SD Std Ave SE Average Param Average Bias Coverage r_coef.n
## ENC=~v5 0.073 0.066 0.81 -0.005 0.959 0.002
## ENC=~v7 0.073 0.066 0.77 -0.004 0.943 0.065
## ENC=~v17 0.078 0.067 0.89 -0.004 0.945 0.027
## BUI=~v42r 1.776 0.097 0.67 0.143 0.949 -0.101
## BUI=~v43 0.135 0.104 0.57 0.007 0.940 -0.035
## PSY=~v44 0.126 0.103 0.90 -0.020 0.977 -0.043
## PSY=~v45 0.098 0.083 0.55 -0.032 0.961 0.054
## PSY~ENC 0.153 0.131 -0.10 1.076 0.989 -0.050
## PSY~BUI 0.170 0.153 0.85 1.863 0.945 -0.058
## ENC~~BUI 0.142 0.118 -0.30 -0.007 0.910 0.007
## v5~~v5 0.092 0.083 1.00 -0.008 0.959 0.025
## v7~~v7 0.088 0.080 1.00 -0.015 0.938 0.008
## v17~~v17 0.104 0.089 1.00 -0.014 0.952 -0.002
## v42r~~v42r 50.903 0.110 1.00 -4.601 0.949 0.096
## v43~~v43 0.286 0.163 1.00 -0.045 0.954 0.125
## v44~~v44 0.222 0.175 1.00 -0.096 0.959 0.175
## v45~~v45 0.124 0.101 1.00 -0.010 0.949 0.000
## r_se.n
## ENC=~v5 -0.877
## ENC=~v7 -0.884
## ENC=~v17 -0.863
## BUI=~v42r -0.740
## BUI=~v43 -0.176
## PSY=~v44 -0.466
## PSY=~v45 -0.339
## PSY~ENC -0.083
## PSY~BUI -0.068
## ENC~~BUI -0.809
## v5~~v5 -0.815
## v7~~v7 -0.838
## v17~~v17 -0.780
## v42r~~v42r -0.638
## v43~~v43 -0.102
## v44~~v44 -0.389
## v45~~v45 -0.231
## ========= Correlation between Fit Indices ============
## chisq aic bic rmsea cfi tli srmr n
## chisq 1.000 -0.013 -0.013 0.872 -0.673 -0.781 0.486 -0.014
## aic -0.013 1.000 1.000 -0.228 0.283 -0.005 -0.743 1.000
## bic -0.013 1.000 1.000 -0.228 0.283 -0.005 -0.743 1.000
## rmsea 0.872 -0.228 -0.228 1.000 -0.866 -0.813 0.618 -0.229
## cfi -0.673 0.283 0.283 -0.866 1.000 0.817 -0.641 0.284
## tli -0.781 -0.005 -0.005 -0.813 0.817 1.000 -0.456 -0.005
## srmr 0.486 -0.743 -0.743 0.618 -0.641 -0.456 1.000 -0.743
## n -0.014 1.000 1.000 -0.229 0.284 -0.005 -0.743 1.000
## ================== Replications =====================
## Number of replications = 451
## Number of converged replications = 412
## Number of nonconverged replications:
## 1. Nonconvergent Results = 12
## 2. Nonconvergent results from multiple imputation = 0
## 3. At least one SE were negative or NA = 2
## 4. Nonpositive-definite latent or observed (residual) covariance matrix
## (e.g., Heywood case or linear dependency) = 25
## NOTE: The sample size is varying.
# look at alpha value at 0.05 and sample sizes
simsem::plotCutoff(SEM_sampleSize, 0.05)
simsem::getCutoff(SEM_sampleSize, 0.05, nVal = 240) # 10:1 rule
## chisq aic bic rmsea cfi tli srmr
## 1 19.40819 5487.851 5546.762 0.06378119 0.9449914 0.8949836 0.05273879
# chisq aic bic rmsea cfi tli srmr
# 1 19.40819 5487.851 5546.762 0.06378119 0.9449914 0.8949836 0.05273879
simsem::getCutoff(SEM_sampleSize, 0.05, nVal = 480) # 20:1 rule
## chisq aic bic rmsea cfi tli srmr
## 1 18.35418 10912.44 10996.96 0.03367612 0.9878986 0.9768973 0.02682591
# chisq aic bic rmsea cfi tli srmr
# 1 18.35418 10912.44 10996.96 0.03367612 0.9878986 0.9768973 0.02682591
# find the sample size that provides the power just over 0.80
Cpow <- simsem::getPower(SEM_sampleSize)
simsem::findPower(Cpow, "N", 0.80)
## ENC=~v5 ENC=~v7 ENC=~v17 BUI=~v42r BUI=~v43 PSY=~v44 PSY=~v45
## Inf Inf Inf 68 68 142 101
## PSY~ENC PSY~BUI ENC~~BUI v5~~v5 v7~~v7 v17~~v17 v42r~~v42r
## NA 246 266 50 54 65 92
## v43~~v43 v44~~v44 v45~~v45
## 50 306 58
# ENC=~v5 ENC=~v7 ENC=~v17 BUI=~v42r BUI=~v43 PSY=~v44 PSY=~v45 PSY~ENC PSY~BUI ENC~~BUI v5~~v5 v7~~v7 v17~~v17 v42r~~v42r v43~~v43 v44~~v44 v45~~v45
# Inf Inf Inf 68 68 142 101 NA 246 266 50 54 65 92 50 306 58
simsem::plotPower(SEM_sampleSize, powerParam=c("ENC=~v5", "BUI=~v43", "PSY=~v44", "PSY~ENC", "PSY~BUI", "ENC~~BUI"))
Results of the simulation show that no sample size is enough to make "PSY~ENC" regression plausible if the power level is set to .80. However, investigation of "PSY~BUI" regression would require 246 participants (result in accordance with earlier "rule-of-thumb" 10:1 estimation). The greatest suggested sample size is 306, quite close to the earlier estimation of 375.
We conclude that our model is overidentified and that we have enough participants.
summary(sem_model_fit, standardized=TRUE, rsq=TRUE) # display R Squared
## lavaan 0.6.14 ended normally after 41 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 24
##
## Number of observations 447
## Number of missing patterns 10
##
## Model Test User Model:
##
## Test statistic 62.390
## Degrees of freedom 11
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ENC =~
## v5 0.896 0.048 18.766 0.000 0.962 0.814
## v7 0.765 0.043 17.629 0.000 0.822 0.768
## v17 1.000 1.074 0.886
## BUI =~
## v42r 1.000 0.777 0.672
## v43 0.798 0.089 8.924 0.000 0.620 0.571
## PSY =~
## v44 1.000 1.118 0.901
## v45 0.592 0.068 8.702 0.000 0.662 0.551
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## PSY ~
## ENC -0.104 0.063 -1.653 0.098 -0.100 -0.100
## BUI 1.223 0.152 8.028 0.000 0.850 0.850
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ENC ~~
## BUI -0.253 0.062 -4.092 0.000 -0.303 -0.303
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .v5 3.818 0.056 68.091 0.000 3.818 3.229
## .v7 3.436 0.051 67.756 0.000 3.436 3.211
## .v17 3.280 0.058 56.971 0.000 3.280 2.705
## .v42r 2.992 0.055 54.600 0.000 2.992 2.584
## .v43 3.803 0.052 73.253 0.000 3.803 3.501
## .v44 3.277 0.059 55.754 0.000 3.277 2.639
## .v45 2.387 0.057 42.027 0.000 2.387 1.990
## ENC 0.000 0.000 0.000
## BUI 0.000 0.000 0.000
## .PSY 0.000 0.000 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .v5 0.473 0.047 9.978 0.000 0.473 0.338
## .v7 0.470 0.041 11.380 0.000 0.470 0.410
## .v17 0.317 0.048 6.588 0.000 0.317 0.216
## .v42r 0.736 0.080 9.226 0.000 0.736 0.549
## .v43 0.795 0.067 11.829 0.000 0.795 0.674
## .v44 0.291 0.119 2.458 0.014 0.291 0.189
## .v45 1.002 0.079 12.747 0.000 1.002 0.696
## ENC 1.153 0.105 10.934 0.000 1.000 1.000
## BUI 0.604 0.098 6.186 0.000 1.000 1.000
## .PSY 0.269 0.132 2.035 0.042 0.215 0.215
##
## R-Square:
## Estimate
## v5 0.662
## v7 0.590
## v17 0.784
## v42r 0.451
## v43 0.326
## v44 0.811
## v45 0.304
## PSY 0.785
# https://rpubs.com/mkearney/103040
library(stargazer)
# unstandardised
pars.factors <- parameterEstimates(sem_model_fit)[ parameterEstimates(sem_model_fit)[,'op']=='=~', c(1:5)]
stargazer(pars.factors, summary=FALSE, type='text', # use 'html' if you wish to save the table and edit it further in Word etc.
rownames=FALSE, initial.zero=FALSE, digits=3, title='Factor loadings')
##
## Factor loadings
## =====================
## lhs op rhs est se
## ---------------------
## ENC =~ v5 .896 .048
## ENC =~ v7 .765 .043
## ENC =~ v17 1 0
## BUI =~ v42r 1 0
## BUI =~ v43 .798 .089
## PSY =~ v44 1 0
## PSY =~ v45 .592 .068
## ---------------------
pars.regressions <- parameterEstimates(sem_model_fit)[ parameterEstimates(sem_model_fit)[,'op']=='~', c(3,4,5,7)]
stargazer(pars.regressions, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Predicting PSY')
##
## Predicting PSY
## =====================
## rhs est se pvalue
## ---------------------
## ENC -.104 .063 .098
## BUI 1.223 .152 0
## ---------------------
# standardised
std.pars.factors <- standardizedSolution(sem_model_fit)[ standardizedSolution(sem_model_fit)[,'op']=='=~', c(1:5)]
stargazer(std.pars.factors, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Factor loadings (std)')
##
## Factor loadings (std)
## ========================
## lhs op rhs est.std se
## ------------------------
## ENC =~ v5 .814 .022
## ENC =~ v7 .768 .025
## ENC =~ v17 .886 .019
## BUI =~ v42r .672 .043
## BUI =~ v43 .571 .044
## PSY =~ v44 .901 .043
## PSY =~ v45 .551 .042
## ------------------------
std.pars.regressions <- standardizedSolution(sem_model_fit)[ standardizedSolution(sem_model_fit)[,'op']=='~', c(3,4,5,7)]
stargazer(std.pars.regressions, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Predicting PSY (std)')
##
## Predicting PSY (std)
## =======================
## rhs est.std se pvalue
## -----------------------
## ENC -.100 .062 .106
## BUI .850 .058 0
## -----------------------
# produce joint tables
# display both un/standardised parameter estimates in a same table
joint.pars.factors <- cbind(pars.factors, std.pars.factors[4:5]) # pick only last two columns from std parameter estimates
names(joint.pars.factors)[7] <- "se.std" # rename the second "se" in column 7 as "se.std"
stargazer(joint.pars.factors, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Factor loadings')
##
## Factor loadings
## ====================================
## lhs op rhs est se est.std se.std
## ------------------------------------
## ENC =~ v5 .896 .048 .814 .022
## ENC =~ v7 .765 .043 .768 .025
## ENC =~ v17 1 0 .886 .019
## BUI =~ v42r 1 0 .672 .043
## BUI =~ v43 .798 .089 .571 .044
## PSY =~ v44 1 0 .901 .043
## PSY =~ v45 .592 .068 .551 .042
## ------------------------------------
# display both un/standardised regressions in a same table
joint.pars.regressions <- cbind(pars.regressions, std.pars.regressions[2:4]) # pick three columns from std regressions
names(joint.pars.regressions)[6] <- "se.std" # rename the second "se" in column 6 as "se.std"
names(joint.pars.regressions)[7] <- "pvalue.std" # rename the second "pvalue" in column 7 as "pvalue.std"
stargazer(joint.pars.regressions, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Predicting PSY')
##
## Predicting PSY
## ===============================================
## rhs est se pvalue est.std se.std pvalue.std
## -----------------------------------------------
## ENC -.104 .063 .098 -.100 .062 .106
## BUI 1.223 .152 0 .850 .058 0
## -----------------------------------------------
Lavaan produces a whole bunch of fit measures:
fitMeasures(sem_model_fit)
## npar fmin
## 24.000 0.070
## chisq df
## 62.390 11.000
## pvalue baseline.chisq
## 0.000 1131.720
## baseline.df baseline.pvalue
## 21.000 0.000
## cfi tli
## 0.954 0.912
## cfi.robust tli.robust
## 0.954 0.912
## nnfi rfi
## 0.912 0.895
## nfi pnfi
## 0.945 0.495
## ifi rni
## 0.954 0.954
## nnfi.robust rni.robust
## 0.912 0.954
## logl unrestricted.logl
## -4334.157 -4302.962
## aic bic
## 8716.315 8814.776
## ntotal bic2
## 447.000 8738.610
## rmsea rmsea.ci.lower
## 0.102 0.078
## rmsea.ci.upper rmsea.ci.level
## 0.128 0.900
## rmsea.pvalue rmsea.close.h0
## 0.000 0.050
## rmsea.notclose.pvalue rmsea.notclose.h0
## 0.938 0.080
## rmsea.robust rmsea.ci.lower.robust
## 0.103 0.079
## rmsea.ci.upper.robust rmsea.pvalue.robust
## 0.128 0.000
## rmsea.notclose.pvalue.robust rmr
## 0.940 0.083
## rmr_nomean srmr
## 0.093 0.062
## srmr_bentler srmr_bentler_nomean
## 0.062 0.070
## crmr crmr_nomean
## 0.070 0.080
## srmr_mplus srmr_mplus_nomean
## 0.062 0.070
## cn_05 cn_01
## 141.966 178.146
## gfi agfi
## 0.996 0.989
## pgfi mfi
## 0.313 0.944
## ecvi
## 0.247
# produce fit indices for inline text
sem.fits <- as.numeric(fitMeasures(sem_model_fit, c("chisq", # sem.fits[1]
"df", # sem.fits[2]
"pvalue", # sem.fits[3]
"cfi", # sem.fits[4]
"tli", # sem.fits[5]
"aic", # sem.fits[6]
"bic", # sem.fits[7]
"rmsea", # sem.fits[8]
"rmsea.ci.lower", # sem.fits[9]
"rmsea.ci.upper", # sem.fits[10]
"rmsea.pvalue", # sem.fits[11]
"srmr", # sem.fits[12]
"baseline.chisq", # sem.fits[13]
"baseline.df" # sem.fits[14]
)))
# create vector of desired measure names
fit.names <- c("chisq", "df", "pvalue", "cfi", "tli", "aic", "bic", "rmsea", "rmsea.ci.lower", "rmsea.ci.upper", "rmsea.pvalue", "srmr", "baseline.chisq", "baseline.df")
# merge names and values (from the previous "sem.fits") into df
df.fit.measures <- data.frame(fit.names,sem.fits)
names(df.fit.measures)[1] <- "Measure"
names(df.fit.measures)[2] <- "Value"
# use Stargazer to produce a table
stargazer(df.fit.measures, summary=FALSE, type='text', # use 'html' if you wish to save the table and edit it further in Word etc.
rownames=FALSE, initial.zero=FALSE, digits=3, title='Fit measures')
##
## Fit measures
## ========================
## Measure Value
## ------------------------
## chisq 62.390
## df 11
## pvalue 0
## cfi .954
## tli .912
## aic 8,716.315
## bic 8,814.776
## rmsea .102
## rmsea.ci.lower .078
## rmsea.ci.upper .128
## rmsea.pvalue .0003
## srmr .062
## baseline.chisq 1,131.720
## baseline.df 21
## ------------------------
Here are short descriptions of selected fit measures:
chisq (Chi square).
df (degrees of freedom) indicate free (not fixed) parameters of the model.
pvalue (probability value) is the probability of getting as large a discrepancy as occurred with the present sample (under appropriate distributional assumptions and assuming a correctly specified model) (Arbuckle & Wothke, 1999). It tests the hypothesis that the model fits perfectly to the population. The value should be greater than pre-set probability level (usually .05). If the value is .05 or less, the departure of the data from the model is significant at the .05 level (model does not fit into data). However, the appropriateness of hypothesis testing in model fitting, even when the necessary distributional assumptions are met, is routinely questioned.
NFI (Normed Fit Index) tells how big discrepancy there is between the model being evaluated (default model) and the baseline model (terribly fitting ‘independence model'). TLI, CFI, RNI (Tucker-Lewis Index, Comparative fit index, Relative Noncentrality Index) similarly compare the proposed model to a baseline model that all other models should be expected to exceed (Hair, Anderson, Tatham & Black, 1995, p. 685). Values close to one indicate a very good fit. According to Bentler & Bonett (1980, p. 600), referring to both the NFI and the TLI, "… models with overall fit indices of less than .90 can usually be improved substantially."
RMSEA (Root Mean Square Error of Approximation). The RMSEA is designed to evaluate the approximate fit of the model in the population (Kaplan, 2000, p.112). This indice is getting smaller as the df increases. In practice this means that models with large RMSEA values (e.g., 0.12) simplify the reality. This error could be estimated as follows (Browne & Cudeck, 1993; Kaplan, 2000, p. 113): < 0.05 ‘close fit', 0.05 – 0.08 ‘fair fit', 0.081 – 0.10 ‘mediocre fit', > 0.10 ‘poor fit'.
SRMR (Standardized Root Mean square Residual). The SRMR is a measure of the mean absolute correlation residual, the overall difference between the observed and predicted correlations (Kline, 2016, p. 277). This "badness-of-fit" statistic (Kline, 2016) indicates if the square root of the discrepancy between the sample and model covariance matrixes is too big. Cangur and Ercan (2015) have collected synthesis of several fit measures: An acceptable SRMR value should be below .10, while a good value is less than .05.
Next, we list the minimum set of fit statistics (Kline, 2016, p. 269) that should be reported alongside with the model results.
Absolute fit measures tell how well the model explains the data (Kline, 2016, p. 266). The do that by determining the degree to which the model predicts the observed correlation matrix (Hair et al., 1995, p. 683).
First, relative \(\chi^2\) is calculated by dividing the \(\chi^2\) with df, resulting in 5.67. Usually, values less than five are considered to be adequate (Marsh & Hocevar, 1985). However, some researchers argue that the value should be less than two (Byrne, 1989).
Second, the RMSEA estimate of 0.102 is in the upper bound of the mediocre fit level (.08 - .10), indicating that the model over simplifies the reality. The upper limit of the 90 per cent confidence interval (0.128) supports the assumption that we have quite poor fitting model here.
Third, SRMR estimate of 0.062 indicates that the discrepancy between the sample and model covariance matrix is not too big (as the value is below .10).
All three absolute fit measures are singing the same song: The model is not poor, but it is (perhaps due to being so simple) not doing best justice to the sample.
Incremental fit measures compare the proposed model (default model) to a baseline model (independence model) that all other models should be expected to exceed. Here we discuss only one measure, Comparative fit index (CFI).
The CFI value of 0.954 is calculated by dividing the relative \(\chi^2\) of the proposed model by the relative \(\chi^2\) of the baseline model and subtracting the output from 1:
CFI = 1 - (\(\chi^2\)proposed -
dfproposed) / (\(\chi^2\)baseline -
dfbaseline)
= 1 - (chisq-df)/(baseline.chisq-baseline.df)
= 0.954
CFI value shows that the proposed model has a discrepancy that is 95.37 per cent of the way between the (terribly fitting) independence model (CFI = 0) and the (perfectly fitting) saturated model (CFI = 1).
Quite high incremental fit measure (CFI) value of 0.95 indicates that this model has clearly better fit than the poor fitting baseline (independence) model.
# we report only stat sig results in the text (PSY ~ BUI)
sem_m5_bui_psy_est <- parameterEstimates(sem_model_fit)$est[9]
sem_m5_bui_psy_se <- parameterEstimates(sem_model_fit)$se[9]
sem_m5_bui_psy_p <- parameterEstimates(sem_model_fit)$pvalue[9]
# R Square
sem_m5_psy_r2 <- lavaan::inspect(sem_model_fit,'r2')[8] # use inspect(sem_model_fit,'r2') to see how many columns the R^2 table has (we are interested in the last value for the PSY)
Results showed that Build-up of work requirements was statistically significantly positively related to the Psychic stress of the work, B = 1.223, SE = 0.152, p = 0.000. Encouraging leadership had, as expected, a negative relation to the Psychic stress of the work, but it was not significant. This model (see Figure 5) explained (almost solely by the Build-up of work requirements) 78.49% of the variance of the Psychic stress of the work.
Fit measures are presented in Table 1. The model explained the data quite well (RMSEA = 0.102, C.I. = 0.078 - 0.128; SRMR = 0.062). However, the p-value of \(\chi^2\) test was below the desired .05 value (p = 0.000) and the relative \(\chi^2\) (5.67) was above the desired value of 5. Incremental fit measure (CFI = 0.954) showed that the model is superior compared to a baseline model.
Figure 5. Relationship between Build-up of work requirements (BUI), Encouraging leadership (ENC) and Psychic stress of the work (PSY) (unstandardized estimates).
semPaths(sem_model_fit,
layout="tree",
"par", # unstandardized parameter estimates
edge.label.cex=0.7,
edge.width=0.2,
edge.color="black",
optimizeLatRes=TRUE,
label.scale=TRUE,
normalize=TRUE,
fade=FALSE,
as.expression=c("nodes"))
Table 1. Fit measures
##
## =====================
## Measure Value
## ---------------------
## chisq 62.390
## df 11
## pvalue 0
## rmsea .102
## rmsea.ci.lower .078
## rmsea.ci.upper .128
## srmr .062
## cfi .954
## ---------------------
Modification indices (MI) demonstrate how re-specification of the model may improve the model fit [read more]. However, any modification of the model (according to modification indices) should be in accordance with the theoretical assumptions behind the model (see critical discussion about MI's in Hermida, 2015).
Table below shows the first 15 MI values in descending order.
modindices(sem_model_fit, # or modificationindices(sem_model_fit)
sort = TRUE,
maximum.number = 15) # show the 15 highest MI values
## lhs op rhs mi epc sepc.lv sepc.all sepc.nox
## 63 v43 ~~ v44 25.145 0.420 0.420 0.872 0.872
## 39 BUI =~ v45 23.728 -3.294 -2.561 -2.135 -2.135
## 34 ENC =~ v45 23.728 -0.281 -0.302 -0.252 -0.252
## 33 ENC =~ v44 23.728 0.475 0.510 0.411 0.411
## 38 BUI =~ v44 23.728 5.567 4.328 3.486 3.486
## 44 PSY =~ v43 13.345 -12.942 -14.469 -13.323 -13.323
## 43 PSY =~ v42r 13.344 16.226 18.141 15.671 15.671
## 32 ENC =~ v43 13.344 0.213 0.228 0.210 0.210
## 31 ENC =~ v42r 13.344 -0.267 -0.286 -0.247 -0.247
## 64 v43 ~~ v45 13.262 -0.199 -0.199 -0.223 -0.223
## 36 BUI =~ v7 8.710 -0.174 -0.135 -0.126 -0.126
## 57 v17 ~~ v43 8.442 0.103 0.103 0.204 0.204
## 37 BUI =~ v17 7.406 0.173 0.135 0.111 0.111
## 52 v7 ~~ v42r 6.256 -0.087 -0.087 -0.148 -0.148
## 55 v7 ~~ v45 6.073 -0.092 -0.092 -0.134 -0.134
The first row represents a modification (MI = 25.145) that is expected to have the strongest impact on the \(\chi^2\) value of the model. In this case, it is suggested that we add a covariance (~~) between v43 ("My working pace has increased in recent years") and v44 ("I feel that I am beginning to dislike my work"). As we recall from the earlier investigation of the correlation matrix, the correlation between these two variables was quite high (r = 0.483).
Although these two observed variables load on two different latent variables (PSY v44; BUI v43), they are quite close to each other in theoretical sense (as both PSY and BUI measure negative experiences towards work). If this covariance is added to the model specification, previous \(\chi^2\) value of 62.390 would improve approximately by 25.145). On the other hand, as Hermida (2015) argues, one should be critical towards the initial model (and its psychometric properties) and ask "Why the association between these two observed variables was not made visible when specifying the original model?"
For the demonstrational purposes we re-specify the original model according to the highest MI value.
# specify the modified model
sem_model_mod <- '
# Factors
ENC =~ NA*v5 + v7 + 1*v17 # Encouraging leadership
BUI =~ 1*v42r + v43 # Build-up of work requirements
PSY =~ 1*v44 + v45 # Psychic stress of the work
# Regressions
PSY ~ ENC + BUI # Predict PSY with ENC and BUI
# Covariances
ENC ~~ BUI # Let ENC and BUI correlate
v43 ~~ v44 # Covariance added by modification indice recommendation
'
# estimate the modified model
sem_model_mod_fit <- sem(sem_model_mod,
data=dfttu4,
group = NULL, # no grouping variable (e.g., gender, control/intervention,...)
cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
missing = "fiml.x" # full information maximum likelihood approach
)
# produce diagram of the model
semPaths(sem_model_mod_fit,
layout="tree",
"std", # standardized parameter estimates
edge.label.cex=0.7,
edge.width=0.2,
edge.color="black",
optimizeLatRes=TRUE,
label.scale=TRUE,
normalize=TRUE,
fade=FALSE,
as.expression=c("nodes"))
Figure above shows that now there is a connecting arc between v43 and v44.
Here are the estimates of the modified model:
summary(sem_model_mod_fit, standardized=TRUE, rsq=TRUE) # display R Squared
## lavaan 0.6.14 ended normally after 41 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 25
##
## Number of observations 447
## Number of missing patterns 10
##
## Model Test User Model:
##
## Test statistic 39.661
## Degrees of freedom 10
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Observed
## Observed information based on Hessian
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ENC =~
## v5 0.895 0.047 18.909 0.000 0.961 0.813
## v7 0.767 0.043 17.722 0.000 0.824 0.770
## v17 1.000 1.073 0.885
## BUI =~
## v42r 1.000 1.042 0.900
## v43 0.430 0.098 4.397 0.000 0.448 0.414
## PSY =~
## v44 1.000 1.018 0.820
## v45 0.706 0.078 9.009 0.000 0.719 0.599
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## PSY ~
## ENC -0.220 0.057 -3.852 0.000 -0.232 -0.232
## BUI 0.606 0.143 4.254 0.000 0.621 0.621
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ENC ~~
## BUI -0.345 0.066 -5.252 0.000 -0.308 -0.308
## .v43 ~~
## .v44 0.318 0.064 4.979 0.000 0.318 0.455
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .v5 3.818 0.056 68.096 0.000 3.818 3.229
## .v7 3.436 0.051 67.754 0.000 3.436 3.210
## .v17 3.280 0.058 56.980 0.000 3.280 2.705
## .v42r 2.992 0.055 54.600 0.000 2.992 2.584
## .v43 3.802 0.052 73.569 0.000 3.802 3.516
## .v44 3.277 0.059 55.765 0.000 3.277 2.640
## .v45 2.387 0.057 42.027 0.000 2.387 1.990
## ENC 0.000 0.000 0.000
## BUI 0.000 0.000 0.000
## .PSY 0.000 0.000 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .v5 0.475 0.047 10.100 0.000 0.475 0.340
## .v7 0.467 0.041 11.361 0.000 0.467 0.408
## .v17 0.318 0.047 6.708 0.000 0.318 0.217
## .v42r 0.255 0.221 1.151 0.250 0.255 0.190
## .v43 0.969 0.077 12.640 0.000 0.969 0.829
## .v44 0.504 0.104 4.838 0.000 0.504 0.327
## .v45 0.923 0.080 11.592 0.000 0.923 0.641
## ENC 1.152 0.105 10.959 0.000 1.000 1.000
## BUI 1.086 0.238 4.570 0.000 1.000 1.000
## .PSY 0.489 0.115 4.239 0.000 0.472 0.472
##
## R-Square:
## Estimate
## v5 0.660
## v7 0.592
## v17 0.783
## v42r 0.810
## v43 0.171
## v44 0.673
## v45 0.359
## PSY 0.528
We can see that the \(\chi^2\) ("Test statistic") value has become smaller and the df has also dropped by one (as we asked to estimate one additional covariance). Interestingly, the overall explanation of the PSY variance (R Square) has after the modification dropped from the original model's (quite high) value of 0.78 down to 0.53.
Here is the comparison of the regressions of the original and modified models:
std.pars.regressions <- standardizedSolution(sem_model_fit)[ standardizedSolution(sem_model_fit)[,'op']=='~', c(3,4,5,7)]
std.pars.regressions.mod <- standardizedSolution(sem_model_mod_fit)[ standardizedSolution(sem_model_mod_fit)[,'op']=='~', c(3,4,5,7)]
joint.pars.regressions.comparison <- cbind(std.pars.regressions, std.pars.regressions.mod[2:4]) # pick three columns from mod regressions
names(joint.pars.regressions.comparison)[2] <- "est.orig"
names(joint.pars.regressions.comparison)[3] <- "se.orig"
names(joint.pars.regressions.comparison)[4] <- "p.orig"
names(joint.pars.regressions.comparison)[5] <- "est.mod"
names(joint.pars.regressions.comparison)[6] <- "se.mod"
names(joint.pars.regressions.comparison)[7] <- "p.mod"
stargazer(joint.pars.regressions.comparison, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Predicting PSY (std regressions)')
##
## Predicting PSY (std regressions)
## ================================================
## rhs est.orig se.orig p.orig est.mod se.mod p.mod
## ------------------------------------------------
## ENC -.100 .062 .106 -.232 .064 .0003
## BUI .850 .058 0 .621 .079 0
## ------------------------------------------------
We see that the regression on ENC has become statistically significant (p.mod column) as we have included a covariance between variables v43 and v44.
Here is the comparison of the fit indices of the original ("Value") and modified ("Value.mod") models:
sem.fits.mod <- as.numeric(fitMeasures(sem_model_mod_fit, c("chisq", # sem.fits.table1[1]
"df", # sem.fits.table1[2]
"pvalue", # sem.fits.table1[3]
"rmsea", # sem.fits.table1[4]
"rmsea.ci.lower", # sem.fits.table1[5]
"rmsea.ci.upper", # sem.fits.table1[6]
"srmr", # sem.fits.table1[7]
"cfi" # sem.fits.table1[8]
)))
df.fit.measures.mod <- data.frame(fit.names.table1,sem.fits.mod)
names(df.fit.measures.mod)[2] <- "Value.mod"
joint.fit.comparison <- cbind(df.fit.measures.table1, df.fit.measures.mod[2])
stargazer(joint.fit.comparison, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Comparison of fit indices')
##
## Comparison of fit indices
## ===============================
## Measure Value Value.mod
## -------------------------------
## chisq 62.390 39.661
## df 11 10
## pvalue 0 .00002
## rmsea .102 .081
## rmsea.ci.lower .078 .056
## rmsea.ci.upper .128 .109
## srmr .062 .043
## cfi .954 .973
## -------------------------------
As the table above shows, the fit indices of the modified model are considerably better:
The data frame contains also the fourth factor (latent variable),
"Know-how rewarding" (REW). It is measured with two variables:
v13 The organization rewards its employees' professional knowledge and
skills.
v14 Employees with increased knowledge are given extra
responsibility.
Add "Know-how rewarding" (REW) to the previous model as an exogenous variable (alongside with ENC and BUI). Produce standardized results in a similar manner than in section "7.5 Report the results". That is, write text describing findings and produce figure and table similar to Figure 5 and Table 1.
Beaujean, A. A. (2014). Latent Variable Modeling Using R. A Step-by-Step Guide. Routledge.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588–606.
Bollen, K. (1989). Structural Equations with Latent Variables. New York: John Wiley & Sons.
Cangur, S., & Ercan, I. (2015). Comparison of Model Fit Indices Used in Structural Equation Modeling Under Multivariate Normality. Journal of Modern Applied Statistical Methods, 14(1), Article 14. https://digitalcommons.wayne.edu/jmasm/vol14/iss1/14/
Hair, J. F., Anderson, R. E., Tatham R. L., & Black, W. C. (1995). Multivariate Data Analysis. Fourth edition. Prentice Hall.
Hermida, R. (2015). The problem of allowing correlated errors in structural equation modeling: concerns and considerations. Computational Methods in Social Sciences, 3(1), 5-17.
Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34, 183–202.
Kaplan, D. (2000). Structural Equation Modeling. Thousand Oaks: Sage.
Kline, R. B. (2016). Principles and practice of structural equation modeling. Fourth edition. Guilford Press.
Marsh, H. W., & Hocevar, D. (1985). Application of confirmatory factor analysis to the study of self-concept: First- and higher-order factor models and their invariance across groups. Psychological Bulletin, 97, 562–582.
Nokelainen, P., & Ruohotie, P. (2009). Non-linear Modeling of Growth Prerequisites in a Finnish Polytechnic Institution of Higher Education. Journal of Workplace Learning, 21(1), 36-57.
Raykov, T., & Marcoulides, G. A. (2000). A First Course in Structural Equation Modeling. Lawrence Erlbaum Associates.
Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1-36. http://www.jstatsoft.org/v48/i02/
Schumacker, R. E., & Lomax, R. G. (2004). A Beginner's Guide to Structural Equation Modeling. Second edition. Lawrence Erlbaum Associates.
Suhr, D. (2006). [The Basics of Structural Equation Modeling]((https://www.lexjansen.com/wuss/2006/tutorials/TUT-Suhr.pdf).