SEM with R lavaan

# install.packages("lavaan", dependencies = TRUE)
# install.packages("qgraph", dependencies = TRUE)
# install.packages("semPlot", dependencies = TRUE)

library(lavaan)
library(qgraph)
library(semPlot)

1. Introduction

The lavaan package is a versatile tool for Structural Equation Modeling (SEM) (or latent variable analysis, covariance structure analysis). SEM is a good (?) friend of a researcher who aims 1) to understand the patterns of correlation/covariance amongst the variables, and 2) to explain their variance with the model specified (see Suhr, 2006).

We will first look into regression and path models using only observed variables. Second, we will build a SEM that contains also latent (unobserved) variables.

First (regression and path) models will be run with the familiar "dfttu" dataset that contains autonomy, competence and score variables alongside with "gender" and "group" variables. The SEM model will be run with another dataset which will be described later.

If you use lavaan in your own analyses, please use the following reference:
Yves Rosseel (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1-36. URL http://www.jstatsoft.org/v48/i02/

1.1. SEM terminology

Greek symbols

observed/manifest variable (something that you have measured: one column in the data frame, square box in diagram)

latent variable (factor that is composed of more than one observed variables: does not exist in the data frame, circle in diagram)

exogenous variable (independent: sends arrows, explains endogenous variable(s), may be observed (x) or latent (\(\xi\))

endogenous variable (dependent: receives arrows, is explained by other vars, may be observed (y) or latent (\(\eta\))

1.2 Basic model specification operators in lavaan

"=~" latent variable ("is measured by", e.g., SelfEfficacy =~ se1 + se2 + se3)

"~" regression ("is regressed on", e.g., score ~ autonomy + competence + gender)

"~~" (co)variance ("is correlated with", e.g., autonomy ~~ autonomy (variance) or autonomy ~~ competence (covariance))

"~1" intercept/mean (triangle in diagrams, e.g., autonomy ~ 1 estimates the mean of variable autonomy)

"1" fixes parameter or loading to one (e.g., SelfEfficacy =~ 1se1 + se2 + se3)

NA* frees parameter or loading (overrides default, e.g., SelfEfficacy =~ NA*se1 + se2 + se3)

1.3 Estimators

The R lavaan offers several estimator options (see also the manual), but we use in these examples one of the most common one, Maximum Likelihood (ML) estimation. As one of the names for "SEM" is "covariance structure modeling", it indicates that the task of ML (for details, see Crisci, 2012) is to compute the implied covariance matrix (specified in the model) and compare that against the actual covariance matrix based on the empirical data.

2 Before the analysis

Next we go through the steps that are needed before the estimation of a regression model.

2.1 Set the working directory

# where this current script is located?
current_dir <- dirname(rstudioapi::getSourceEditorContext()$path)

# set working directory to current script location
setwd(current_dir)

# check out the result
getwd()

## [1] "/Users/nokelaip/CharveLEAD Dropbox/Petri Nokelainen/home1_cloud/home1/homepage TUNI/public_html/mmx9120/3_sem/exercises/R_lavaan"

2.2 Read and inspect the data

# read the data into data frame (df) called "dfttu"
dfttu <- read.table("dfttu_sem.txt", header=TRUE, sep= "\t")

# display summary of df
summary(dfttu)

##      group            gender          autonomy       competence   
##  Min.   :0.0000   Min.   :0.0000   Min.   :2.250   Min.   :2.500  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.500   1st Qu.:3.500  
##  Median :1.0000   Median :1.0000   Median :3.750   Median :4.000  
##  Mean   :0.5144   Mean   :0.6971   Mean   :3.786   Mean   :3.907  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.250  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :5.000  
##       deep         organised         score       
##  Min.   :2.000   Min.   :1.250   Min.   :  0.00  
##  1st Qu.:3.750   1st Qu.:3.250   1st Qu.: 22.22  
##  Median :4.000   Median :3.750   Median : 50.00  
##  Mean   :4.023   Mean   :3.739   Mean   : 47.76  
##  3rd Qu.:4.250   3rd Qu.:4.250   3rd Qu.: 70.14  
##  Max.   :5.000   Max.   :5.000   Max.   :100.00

# View(dfttu)

# display first six rows of df
head(dfttu)

##   group gender autonomy competence deep organised    score
## 1     0      1     3.75       4.25 4.00       3.0 16.66667
## 2     1      1     3.25       3.75 4.00       3.5 41.66667
## 3     0      1     4.00       4.00 4.00       4.0 88.88889
## 4     0      1     4.00       4.25 4.25       4.5 61.11111
## 5     0      0     3.50       3.75 4.50       5.0 38.88889
## 6     1      1     4.75       4.75 4.50       4.5 44.44444

# install.packages("stargazer")
library(stargazer)
stargazer(dfttu, type = "text")

## 
## ============================================
## Statistic   N   Mean  St. Dev.  Min    Max  
## --------------------------------------------
## group      208 0.514   0.501     0      1   
## gender     208 0.697   0.461     0      1   
## autonomy   208 3.786   0.607   2.250  5.000 
## competence 208 3.907   0.605   2.500  5.000 
## deep       208 4.023   0.507   2.000  5.000 
## organised  208 3.739   0.678   1.250  5.000 
## score      208 47.756  29.118  0.000 100.000
## --------------------------------------------

# install.packages("DescTools")
library(DescTools)
Desc(dfttu, plotit = TRUE)

## ------------------------------------------------------------------------------ 
## Describe dfttu (data.frame):
## 
## data frame:  208 obs. of  7 variables
##      208 complete cases (100.0%)
## 
##   Nr  ColName     Class    NAs  Levels
##   1   group       integer  .          
##   2   gender      integer  .          
##   3   autonomy    numeric  .          
##   4   competence  numeric  .          
##   5   deep        numeric  .          
##   6   organised   numeric  .          
##   7   score       numeric  .          
## 
## 
## ------------------------------------------------------------------------------ 
## 1 - group (integer - dichotomous)
## 
##   length      n    NAs unique
##      208    208      0      2
##          100.0%   0.0%       
## 
##    freq   perc  lci.95  uci.95'
## 0   101  48.6%   41.9%   55.3%
## 1   107  51.4%   44.7%   58.1%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 2 - gender (integer - dichotomous)
## 
##   length      n    NAs unique
##      208    208      0      2
##          100.0%   0.0%       
## 
##    freq   perc  lci.95  uci.95'
## 0    63  30.3%   24.4%   36.8%
## 1   145  69.7%   63.2%   75.6%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 3 - autonomy (numeric)
## 
##   length       n    NAs  unique     0s    mean  meanCI'
##      208     208      0      12      0   3.786   3.703
##           100.0%   0.0%           0.0%           3.869
##                                                       
##      .05     .10    .25  median    .75     .90     .95
##    2.750   3.000  3.500   3.750  4.000   4.575   4.750
##                                                       
##    range      sd  vcoef     mad    IQR    skew    kurt
##    2.750   0.607  0.160   0.371  0.500  -0.178   0.031
##                                                       
## 
##     value  freq   perc  cumfreq  cumperc
## 1    2.25     5   2.4%        5     2.4%
## 2     2.5     3   1.4%        8     3.8%
## 3    2.75     6   2.9%       14     6.7%
## 4       3    17   8.2%       31    14.9%
## 5    3.25    12   5.8%       43    20.7%
## 6     3.5    33  15.9%       76    36.5%
## 7    3.75    38  18.3%      114    54.8%
## 8       4    43  20.7%      157    75.5%
## 9    4.25    20   9.6%      177    85.1%
## 10    4.5    10   4.8%      187    89.9%
## 11   4.75    11   5.3%      198    95.2%
## 12      5    10   4.8%      208   100.0%
## 
## ' 95%-CI (classic)

## ------------------------------------------------------------------------------ 
## 4 - competence (numeric)
## 
##   length       n    NAs  unique    0s   mean  meanCI'
##      208     208      0      11     0   3.91    3.82
##           100.0%   0.0%          0.0%           3.99
##                                                     
##      .05     .10    .25  median   .75    .90     .95
##     2.75    3.00   3.50    4.00  4.25   4.75    5.00
##                                                     
##    range      sd  vcoef     mad   IQR   skew    kurt
##     2.50    0.61   0.15    0.37  0.75  -0.23   -0.30
##                                                     
## 
##     value  freq   perc  cumfreq  cumperc
## 1     2.5     4   1.9%        4     1.9%
## 2    2.75    11   5.3%       15     7.2%
## 3       3    12   5.8%       27    13.0%
## 4    3.25    11   5.3%       38    18.3%
## 5     3.5    15   7.2%       53    25.5%
## 6    3.75    37  17.8%       90    43.3%
## 7       4    55  26.4%      145    69.7%
## 8    4.25    17   8.2%      162    77.9%
## 9     4.5    19   9.1%      181    87.0%
## 10   4.75    13   6.2%      194    93.3%
## 11      5    14   6.7%      208   100.0%
## 
## ' 95%-CI (classic)

## ------------------------------------------------------------------------------ 
## 5 - deep (numeric)
## 
##   length       n    NAs  unique    0s   mean  meanCI'
##      208     208      0      13     0   4.02    3.95
##           100.0%   0.0%          0.0%           4.09
##                                                     
##      .05     .10    .25  median   .75    .90     .95
##     3.00    3.50   3.75    4.00  4.25   4.75    4.75
##                                                     
##    range      sd  vcoef     mad   IQR   skew    kurt
##     3.00    0.51   0.13    0.37  0.50  -0.65    1.42
##                                                     
## lowest : 2.0, 2.25, 2.5, 2.75 (2), 3.0 (7)
## highest: 4.0 (62), 4.25 (31), 4.5 (26), 4.75 (15), 5.0 (8)
## 
## heap(?): remarkable frequency (29.8%) for the mode(s) (= 4)
## 
## ' 95%-CI (classic)

## ------------------------------------------------------------------------------ 
## 6 - organised (numeric)
## 
##   length       n    NAs  unique     0s    mean  meanCI'
##      208     208      0      13      0   3.739   3.646
##           100.0%   0.0%           0.0%           3.832
##                                                       
##      .05     .10    .25  median    .75     .90     .95
##    2.500   2.925  3.250   3.750  4.250   4.500   4.750
##                                                       
##    range      sd  vcoef     mad    IQR    skew    kurt
##    3.750   0.678  0.181   0.741  1.000  -0.310  -0.183
##                                                       
## lowest : 1.25, 2.25, 2.5 (12), 2.75 (7), 3.0 (18)
## highest: 4.0 (38), 4.25 (20), 4.5 (26), 4.75 (8), 5.0 (8)
## 
## heap(?): remarkable frequency (18.3%) for the mode(s) (= 4)
## 
## ' 95%-CI (classic)

## ------------------------------------------------------------------------------ 
## 7 - score (numeric)
## 
##       length          n        NAs     unique         0s       mean     meanCI'
##          208        208          0         32         15  47.756410  43.775973
##                  100.0%       0.0%                  7.2%             51.736847
##                                                                               
##          .05        .10        .25     median        .75        .90        .95
##     0.000000   5.555556  22.222222  50.000000  70.138889  85.000000  94.444444
##                                                                               
##        range         sd      vcoef        mad        IQR       skew       kurt
##   100.000000  29.118418   0.609728  32.946667  47.916667  -0.109178  -1.151072
##                                                                               
## lowest : 0.0 (15), 5.555556 (9), 8.333333 (2), 11.111111 (17), 16.666667 (4)
## highest: 80.555556 (2), 83.333333 (7), 88.888889 (9), 94.444444 (8), 100.0 (4)
## 
## heap(?): remarkable frequency (8.2%) for the mode(s) (= 11.1111111111111)
## 
## ' 95%-CI (classic)

# install.packages("psych")
library(psych)
describe(dfttu)

##            vars   n  mean    sd median trimmed   mad  min max  range  skew
## group         1 208  0.51  0.50   1.00    0.52  0.00 0.00   1   1.00 -0.06
## gender        2 208  0.70  0.46   1.00    0.74  0.00 0.00   1   1.00 -0.85
## autonomy      3 208  3.79  0.61   3.75    3.79  0.37 2.25   5   2.75 -0.18
## competence    4 208  3.91  0.61   4.00    3.92  0.37 2.50   5   2.50 -0.23
## deep          5 208  4.02  0.51   4.00    4.04  0.37 2.00   5   3.00 -0.65
## organised     6 208  3.74  0.68   3.75    3.76  0.74 1.25   5   3.75 -0.31
## score         7 208 47.76 29.12  50.00   47.85 32.95 0.00 100 100.00 -0.11
##            kurtosis   se
## group         -2.01 0.03
## gender        -1.28 0.03
## autonomy       0.03 0.04
## competence    -0.30 0.04
## deep           1.42 0.04
## organised     -0.18 0.05
## score         -1.15 2.02

The data (208 participants, i.e., rows) contains 7 variables:

group 0=control, 1=intervention
gender 0=female, 1=male
autonomy 1=totally disagree, …, 5=totally agree (average of four items)
competence 1=totally disagree, …, 5=totally agree (average of four items)
deep 1=totally disagree, …, 5=totally agree (average of four items)
organised 1=totally disagree, …, 5=totally agree (average of four items)
score 0-100 (mathematics score)

We can see from the summary of the "dfttu" data frame (df) that it contains no missing values (NA). Usually we are happy about that, but in the real life NA's quite often do exist.

Next we examine the distributions of the three central variables that we are going to use in the following exercises: Competence, autonomy and score.

Frequency histograms show the count of each variable value:

hist(dfttu$competence, main="Histogram of Competence", xlab="Competence")

hist(dfttu$autonomy, main="Histogram of Autonomy", xlab="Autonomy")

hist(dfttu$score, main="Histogram of Score", xlab="Score")

Density plots of these three variables are based on kernel smoothing:

# # basic plot
# plot(density(dfttu$competence))
# plot(density(dfttu$autonomy))
# plot(density(dfttu$score))

# "bells and whistles" version
# https://datavizpyr.com/add-vertical-line-to-density-plot-with-ggplot2/

# competence
# shorter way
mean_competence <- as.numeric(dec2(mean(dfttu$competence)))
median_competence <- as.numeric(dec2(median(dfttu$competence)))
mode_competence <- as.numeric(dec2(Mode(dfttu$competence))) # Mode from DescTools library
variance_competence <- as.numeric(dec2(var(dfttu$competence)))
sd_competence <- as.numeric(dec2(sd(dfttu$competence)))
skewness_competence <- as.numeric(dec2(skewness(dfttu$competence)))
kurtosis_competence <- as.numeric(dec2(kurtosis(dfttu$competence)))

# # longer way
# mean_competence <- dfttu %>% 
#   pull(competence) %>% 
#   mean() %>%
#   signif(3)

# display plot
ggplot(dfttu, aes(x=competence)) +
  geom_density( fill="dodgerblue", alpha=0.2) + 
  geom_vline(xintercept=mean_competence, size=1, color="red") +
  geom_text(aes(x=4.5, label=paste0("Mean (red)\n",mean_competence), y=1.5)) +
  geom_vline(xintercept=median_competence, size=1, color="blue") +
  geom_text(aes(x=4.5, label=paste0("Median (blue)\n",median_competence), y=1.2)) + 
  geom_vline(xintercept=mode_competence, size=1, color="orange", linetype="dashed") +
  geom_text(aes(x=4.5, label=paste0("Mode (orange)\n",mode_competence), y=0.9)) + 
  geom_text(aes(x=3, label=paste0("Variance\n",variance_competence), y=1.5)) + 
  geom_text(aes(x=3, label=paste0("Std dev\n",sd_competence), y=1.2)) + 
  geom_text(aes(x=3, label=paste0("Skewness\n",skewness_competence), y=0.9)) + 
  geom_text(aes(x=3, label=paste0("Kurtosis\n",kurtosis_competence), y=0.6))

# autonomy
mean_autonomy <- as.numeric(dec2(mean(dfttu$autonomy)))
median_autonomy <- as.numeric(dec2(median(dfttu$autonomy)))
mode_autonomy <- as.numeric(dec2(Mode(dfttu$autonomy))) # Mode from DescTools library
variance_autonomy <- as.numeric(dec2(var(dfttu$autonomy)))
sd_autonomy <- as.numeric(dec2(sd(dfttu$autonomy)))
skewness_autonomy <- as.numeric(dec2(skewness(dfttu$autonomy)))
kurtosis_autonomy <- as.numeric(dec2(kurtosis(dfttu$autonomy)))

ggplot(dfttu, aes(x=autonomy)) +
  geom_density( fill="dodgerblue", alpha=0.2) + 
  geom_vline(xintercept=mean_autonomy, size=1, color="red") +
  geom_text(aes(x=4.5, label=paste0("Mean (red)\n",mean_autonomy), y=1.5)) +
  geom_vline(xintercept=median_autonomy, size=1, color="blue") +
  geom_text(aes(x=4.5, label=paste0("Median (blue)\n",median_autonomy), y=1.2)) + 
  geom_vline(xintercept=mode_autonomy, size=1, color="orange", linetype="dashed") +
  geom_text(aes(x=4.5, label=paste0("Mode (orange)\n",mode_autonomy), y=0.9)) + 
  geom_text(aes(x=3, label=paste0("Variance\n",variance_autonomy), y=1.5)) + 
  geom_text(aes(x=3, label=paste0("Std dev\n",sd_autonomy), y=1.2)) + 
  geom_text(aes(x=3, label=paste0("Skewness\n",skewness_autonomy), y=0.9)) + 
  geom_text(aes(x=3, label=paste0("Kurtosis\n",kurtosis_autonomy), y=0.6))

# score
mean_score <- as.numeric(dec2(mean(dfttu$score)))
median_score <- as.numeric(dec2(median(dfttu$score)))
mode_score <- as.numeric(dec2(Mode(dfttu$score))) # Mode from DescTools library
variance_score <- as.numeric(dec2(var(dfttu$score)))
sd_score <- as.numeric(dec2(sd(dfttu$score)))
skewness_score <- as.numeric(dec2(skewness(dfttu$score)))
kurtosis_score <- as.numeric(dec2(kurtosis(dfttu$score)))

ggplot(dfttu, aes(x=score)) +
  geom_density( fill="dodgerblue", alpha=0.2) + 
  geom_vline(xintercept=mean_score, size=1, color="red") +
  geom_text(aes(x=75, label=paste0("Mean (red)\n",mean_score), y=0.01)) +
  geom_vline(xintercept=median_score, size=1, color="blue") +
  geom_text(aes(x=75, label=paste0("Median (blue)\n",median_score), y=0.007)) + 
  geom_vline(xintercept=mode_score, size=1, color="orange", linetype="dashed") +
  geom_text(aes(x=75, label=paste0("Mode (orange)\n",mode_score), y=0.004)) + 
  geom_text(aes(x=25, label=paste0("Variance\n",variance_score), y=0.01)) +
  geom_text(aes(x=25, label=paste0("Std dev\n",sd_score), y=0.007)) +
  geom_text(aes(x=25, label=paste0("Skewness\n",skewness_score), y=0.004)) +
  geom_text(aes(x=25, label=paste0("Kurtosis\n",kurtosis_score), y=0.001))

# install.packages("gridExtra")
# library(gridExtra)
# grid.arrange(plot_competence,plot_autonomy,plot_score, ncol=3)

Last figure of score density looks odd by the value of mode (11.11). This makes sense when we look at the frequencies of the original and rounded (zero digits) math score values: There are 17 occurences of 11.11 value.

# table(round(dfttu$score, digits=0))
table(dfttu$score)

## 
##                0 5.55555555555556 8.33333333333333 11.1111111111111 
##               15                9                2               17 
## 16.6666666666667 19.4444444444444 22.2222222222222               25 
##                4                1                6                4 
## 27.7777777777778 30.5555555555556 33.3333333333333 36.1111111111111 
##                6                3                6                2 
## 38.8888888888889 41.6666666666667 44.4444444444444 47.2222222222222 
##                4                5               15                1 
##               50 52.7777777777778 55.5555555555556 58.3333333333333 
##                7                5                7                4 
## 61.1111111111111 63.8888888888889 66.6666666666667 69.4444444444445 
##               10                5               12                6 
## 72.2222222222222               75 77.7777777777778 80.5555555555555 
##                7                2               13                2 
## 83.3333333333333 88.8888888888889 94.4444444444445              100 
##                7                9                8                4

# # install.packages("moments")
# library(moments)
# 
# # is the data more or less "tailed" than normal distribution ("bell curve")?
# kurtosis(dfttu)

group gender autonomy competence deep organised score
1.003331 1.736070 3.059942 2.721777 4.460560 2.844617 1.866836

Kurtosis value of 3 indicates normal distribution (and it is called mesokurtic). If the value is less than 3, then the distribution is said to be platykurtic: It has thinner tails than a normal distribution, resulting in fewer extreme positive or negative values. If the value is greater than 3, then it is said to be leptokurtic: The shape is wider or flatter and the tails are fatter resulting in a greater chance of extreme positive or negative values. Competence and autonomy variables are quite close to (more or less) normal value of 3, but the score is clearly platycurtic.

Skewness is a distribution shape measure that tells if the response values are concentrated more on the left (too many smaller values) or right (too many greater values) side of the mean.

# library(moments)
# 
# # is the data more towards low or high response values than the normal distribution?
# skewness(dfttu)

group gender autonomy competence deep organised score -0.05771633 -0.85794525 -0.17889672 -0.23497594 -0.65758103 -0.31181557 -0.10997036

Usually values greater than -1 or 1 indicate highly skewed (non-normal) data. Results above show that this is not the case here. Moderate skewness is attached to values -1 - -0.5 and 0.5 - 1. However, in our data autonomy, competence and score all have negative skewness values. It means that greater amount of values are larger than the mean value. Following density plots visualise this.

# sqrt
dfttu$scoreSqrt <- sqrt(dfttu$score)
# inverse sqrt
dfttu$scoreSqrtInv <- sqrt(max(dfttu$score+1)-dfttu$score)
# log10
dfttu$scoreLog <- log(dfttu$score)
# inverse log10
dfttu$scoreLogInv <- log(max(dfttu$score+1)-dfttu$score)

skewness(dfttu$score)

## [1] -0.1099704

skewness(dfttu$scoreSqrt)

## [1] -0.8831435

skewness(dfttu$scoreSqrtInv)

## [1] -0.4939812

skewness(dfttu$scoreLog)

## [1] NaN

skewness(dfttu$scoreLogInv)

## [1] -1.92719

plot(density(dfttu$score))

plot(density(dfttu$scoreSqrt))

plot(density(dfttu$scoreSqrtInv))

plot(density(dfttu$scoreLog))

plot(density(dfttu$scoreLogInv))

Skewness is not getting better by transforming the score (as skewness values go further from the desired zero value). Density plots below show that the square root transformation is milder than logarithmic transformation.

kurtosis(dfttu$score)

## [1] 1.866836

kurtosis(dfttu$scoreSqrt)

## [1] 2.915251

kurtosis(dfttu$scoreSqrtInv)

## [1] 2.591751

kurtosis(dfttu$scoreLog)

## [1] NaN

kurtosis(dfttu$scoreLogInv)

## [1] 8.184114

plot(density(dfttu$score))

plot(density(dfttu$scoreSqrt))

plot(density(dfttu$scoreSqrtInv))

plot(density(dfttu$scoreLog))

plot(density(dfttu$scoreLogInv))

The kurtosis of score benefits most from the square root transformation as the kurtosis value gets close to 3 (2.92, second figure). However, we decide to go on without transforming the score variable.

2.3 Deal with the missing data

We create a second df ("dfttuNA") and add approximately 40% of NA's in the autonomy and organised variables (in that new df).

# create a copy of the "whole"no missing data" df ("dfttu")
dfttuNA <- dfttu

# create a new temporary df that only has autonomy and organised columns
# one way is to use "dplyr"
# library(dplyr)
# df_temp <- select(dfttuNA, autonomy, organised)

# but we opt to use "base R"
df_temp <- dfttu[c("autonomy","competence")]

# add appr. 40% of NA's into these two columns
set.seed(2021) # reproduce results
# https://stackoverflow.com/questions/27454265/randomly-insert-nas-into-dataframe-proportionaly/27454361#27454361
df_temp <- as.data.frame(lapply(df_temp,
                     function(cc) cc[ sample(c(TRUE, NA), prob = c(0.60, 0.40), size = length(cc), replace = TRUE) ]
                     )
              )

# overwrite dfttuNA autonomy and organised columns with the same vars containing NA's 
dfttuNA$autonomy <- as.numeric(df_temp[,1]) # first column of df_temp is autonomy
dfttuNA$organised <- as.numeric(df_temp[,2]) # seconf column of df_temp is organised

# display summary of the new df with NA's
summary(dfttuNA)

##      group            gender          autonomy       competence   
##  Min.   :0.0000   Min.   :0.0000   Min.   :2.250   Min.   :2.500  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.500   1st Qu.:3.500  
##  Median :1.0000   Median :1.0000   Median :3.750   Median :4.000  
##  Mean   :0.5144   Mean   :0.6971   Mean   :3.773   Mean   :3.907  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.250  
##  Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :5.000  
##                                    NA's   :79                     
##       deep         organised         score          scoreSqrt     
##  Min.   :2.000   Min.   :2.500   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:3.750   1st Qu.:3.500   1st Qu.: 22.22   1st Qu.: 4.714  
##  Median :4.000   Median :4.000   Median : 50.00   Median : 7.071  
##  Mean   :4.023   Mean   :3.937   Mean   : 47.76   Mean   : 6.357  
##  3rd Qu.:4.250   3rd Qu.:4.500   3rd Qu.: 70.14   3rd Qu.: 8.375  
##  Max.   :5.000   Max.   :5.000   Max.   :100.00   Max.   :10.000  
##                  NA's   :82                                       
##   scoreSqrtInv       scoreLog      scoreLogInv   
##  Min.   : 1.000   Min.   : -Inf   Min.   :0.000  
##  1st Qu.: 5.554   1st Qu.:3.101   1st Qu.:3.429  
##  Median : 7.141   Median :3.912   Median :3.932  
##  Mean   : 6.949   Mean   : -Inf   Mean   :3.732  
##  3rd Qu.: 8.876   3rd Qu.:4.250   3rd Qu.:4.367  
##  Max.   :10.050   Max.   :4.605   Max.   :4.615  
##

head(dfttuNA)

##   group gender autonomy competence deep organised    score scoreSqrt
## 1     0      1     3.75       4.25 4.00      4.25 16.66667  4.082483
## 2     1      1       NA       3.75 4.00      3.75 41.66667  6.454972
## 3     0      1       NA       4.00 4.00      4.00 88.88889  9.428090
## 4     0      1     4.00       4.25 4.25      4.25 61.11111  7.817360
## 5     0      0       NA       3.75 4.50      3.75 38.88889  6.236096
## 6     1      1       NA       4.75 4.50      4.75 44.44444  6.666667
##   scoreSqrtInv scoreLog scoreLogInv
## 1     9.183318 2.813411    4.434777
## 2     7.702813 3.729701    4.083171
## 3     3.480102 4.487387    2.494123
## 4     6.315765 4.112694    3.686098
## 5     7.881060 3.660709    4.128925
## 6     7.520343 3.794240    4.035223

# write new df
write.table (dfttuNA, file = "dfttu_sem_na.txt", row.names = FALSE, sep= "\t")

Here we have it - 79 missing values in autonomy and 82 in organised - great! Let's observe the data patterns:

library(mice)
md.pattern(dfttuNA, rotate.names = TRUE)

##    group gender competence deep score scoreSqrt scoreSqrtInv scoreLog
## 75     1      1          1    1     1         1            1        1
## 54     1      1          1    1     1         1            1        1
## 51     1      1          1    1     1         1            1        1
## 28     1      1          1    1     1         1            1        1
##        0      0          0    0     0         0            0        0
##    scoreLogInv autonomy organised    
## 75           1        1         1   0
## 54           1        1         0   1
## 51           1        0         1   1
## 28           1        0         0   2
##              0       79        82 161

First pattern is the complete data (75 participants) while the second one is the missing data in organised variable (54 participants). The third pattern represents the missigness in autonomy (51 participants), and finally, the fourth pattern are the 28 participants who have both autonomy and organised values missing. We'll return to this "dfttuNA" data later.

2.4 Check assumptions

Linear regression has several assumptions related to the data that is to be analysed:

1. Linearity of data: Relationship between the y (response/dependent) and x (predictor/independent) variable is linear (instead of being non-linear).
How to investigate: Scatterplot of y and x (clear linear association). Scatterplot of residuals vs fitted (model predicted) values of y (horizontal line without distinct pattern).

2. Independence of errors: There is no correlation between the residuals. Residuals are the difference between the observed and model predicted values of y.
How to investigate: Scatterplot of residuals vs model predicted (fitted) values of y (r is approximately zero).

3. Normality of errors: The residuals are normally distributed.
How to investigate: Normal Q-Q ("Quantile-Quantile") plot.

4. Homogeneity of errors variance (homoscedasticity): The variance in the residuals (of y) is constant (i.e., same) regardless of the value of predictor (x).
How to investigate: Scatterplot of standardized residuals vs model predicted (fitted) values of y (residuals should show no pattern across x-axis).

5. Multicollinearity: No strong correlations among the predictor/independent (x) variables.
How to investigate: Correlation matrix; Variance Inflation Factor; Durbin-Watson's d.

2.4.1 Linearity of data

First we produce diagnostic values for both predictors (autonomy, score).

Autonomy

# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/

library(tidyverse)
library(broom)
theme_set(theme_classic())

# re-run with lm()
m1_lm <- lm(competence ~ autonomy, data = dfttu)

# see model diagnostics
m1_lm_diag <- augment(m1_lm)
head(m1_lm_diag)

## # A tibble: 6 × 8
##   competence autonomy .fitted  .resid    .hat .sigma   .cooksd .std.resid
##        <dbl>    <dbl>   <dbl>   <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1       4.25     3.75    3.88  0.370  0.00482  0.398 0.00210        0.930
## 2       3.75     3.25    3.50  0.246  0.00858  0.399 0.00166        0.619
## 3       4        4       4.07 -0.0683 0.00541  0.399 0.0000804     -0.172
## 4       4.25     4       4.07  0.182  0.00541  0.399 0.000569       0.457
## 5       3.75     3.5     3.69  0.0576 0.00588  0.399 0.0000623      0.145
## 6       4.75     4.75    4.63  0.118  0.0170   0.399 0.000768       0.298

head() command produces "glimpse" of the first six rows in the data:

fitted: fitted (predicted) values of y (competence)
resid: residual errors (difference between the observed and fitted values)
hat: extreme values in the predictors (x variable here is the autonomy)
sigma: estimated residual standard deviation when corresponding observation is dropped from model
cooksd: Cook's distance (influential values that may indicate an outlier)
std.resid: outliers (or extreme values) in the outcome y variable (competence)

To see the last six rows, try tail() command.

Score

# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/

library(tidyverse)
library(broom)
theme_set(theme_classic())

# re-run with lm()
m2_lm <- lm(competence ~ score, data = dfttu)

# see model diagnostics
m2_lm_diag <- augment(m2_lm)
head(m2_lm_diag)

## # A tibble: 6 × 8
##   competence score .fitted  .resid    .hat .sigma   .cooksd .std.resid
##        <dbl> <dbl>   <dbl>   <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1       4.25  16.7    3.87  0.385  0.0103   0.606 0.00213       0.639 
## 2       3.75  41.7    3.90 -0.149  0.00502  0.607 0.000154     -0.247 
## 3       4     88.9    3.96  0.0364 0.0144   0.607 0.0000269     0.0606
## 4       4.25  61.1    3.93  0.324  0.00582  0.606 0.000846      0.537 
## 5       3.75  38.9    3.90 -0.145  0.00526  0.607 0.000153     -0.241 
## 6       4.75  44.4    3.90  0.847  0.00487  0.604 0.00482       1.40

Next we plot the observed values of the DV (competence) against the fitted regression line. First predictor is autonomy and then score.

Autonomy as predictor

# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/

# plot residual errors (magenta) btw obs values and the fitted reg line
ggplot(m1_lm_diag, aes(autonomy, competence)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = autonomy, yend = .fitted), color = "magenta", size = 0.3) + 
  ggtitle("Competence ~ Autonomy (original)")

Association between competence and autonomy looks strong and linear.

Score as predictor

# http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials/

# plot residual errors (magenta) btw obs values and the fitted reg line
ggplot(m2_lm_diag, aes(score, competence)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = score, yend = .fitted), color = "magenta", size = 0.3) + 
  ggtitle("Competence ~ score (original)")

We can see that score is not a good predictor of competence. However, there is a weak positive association.

Next we look into the scatterplot of residuals vs fitted (model predicted) values of y (DV). Residual (or ‘error') is the difference between the dependent (‘endogenous',‘response') variable's observed and fitted (estimation based on the model) values. A horizontal line without distinct patterns is an indication of a linear relationship.

Autonomy

# run linear regression
m1_lm <- lm(competence ~ autonomy, data = dfttu)

# display first (of four) plot
plot(m1_lm,1)

We have here no distinct pattern in the residual plot.

Score

# run linear regression
m2_lm <- lm(competence ~ score, data = dfttu)

# display first (of four) plot
plot(m2_lm,1)

As was with competence ~ autonomy, we have no distinct pattern in the residual plot with competence ~ score.

2.4.2 Independence of errors

Plots below show that there is no correlation between the residuals (red line is horizontal and close to zero).

plot(m1_lm,1)

plot(m2_lm,1)

2.4.3 Normality of errors

Normal Q-Q plot is widely used way to see if the residuals are normally distributed. This is the case if they (approximately) follow the straight (dashed) line.

Competence ~ autonomy

Plots below show that the competence residuals (with autonomy as a predictor) follow the line quite nicely and we assume that they are normally distributed.

plot(m1_lm,2)

Not just to rely on a plot, we calculate Shapiro-Wilk and Kolmogorov-Smirnov tests.

# install.packages("olsrr")
library(olsrr)

# Shapiro-Wilk and Kolmogorov-Smirnov test values should be > .05
ols_test_normality(m1_lm) # both values are above .05

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9873         0.0594 
## Kolmogorov-Smirnov        0.0741         0.2039 
## Cramer-von Mises          29.92          0.0000 
## Anderson-Darling          0.8217         0.0333 
## -----------------------------------------------

# correlations btw observed and fitted values
ols_test_correlation(m1_lm) # high correlation (r = .994) as it should be

## [1] 0.9943387

S-W and K-S values are above .05, so the null hypothesis that the residuals are normally distributed cannot be rejected.

Competence ~ score

Plots below show that the competence residuals (with score as a predictor) do not follow the line as nicely as with autonomy, so we do some further checks.

plot(m2_lm,2)

Not just to rely on a plot, we will calculate Shapiro-Wilk and Kolmogorov-Smirnov tests.

# install.packages("olsrr")
library(olsrr)

# Shapiro-Wilk and Kolmogorov-Smirnov test value should be > .05
ols_test_normality(m2_lm) # both below .05

## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9715          3e-04 
## Kolmogorov-Smirnov        0.1023         0.0258 
## Cramer-von Mises         21.9112         0.0000 
## Anderson-Darling          1.9328          1e-04 
## -----------------------------------------------

# correlations btw observed and fitted values
ols_test_correlation(m2_lm) # high correlation .987 as it should be

## [1] 0.9871404

S-W and K-S values are below .05, so the null hypothesis that the residuals are normally distributed can be rejected. However, normality assumption violation is not a big problem with larger sample sizes.

2.4.4 Homogeneity of errors variance

Scale-Location (or spread-Location): Homogeneity of variance of the residuals (homoscedasticity) is met if the line is horizontal and the points are equally spread.
- Our residuals are spread equally along the ranges of predictors (so we have no problem with heteroscedasticity). If there would be a problem, we could use a log or square root transformation of the outcome variable (y):
lm(log(competence) ~ autonomy, data = dfttu)
or
lm(sqrt(competence) ~ autonomy, data = dfttu)

# raw data competence ~ autonomy
plot(m1_lm,3)

# square root transformed competence
m1b_lm_sqrt <- lm(sqrt(competence) ~ autonomy, data = dfttu)
# log transformed autonomy
m1c_lm_log <- lm(log(competence) ~ autonomy, data = dfttu)

plot(m1b_lm_sqrt,3)

plot(m1c_lm_log,3)

First plot shows that the residuals (of competence) are spread quite equally along the range of autonomy. Doing sqrt or log transformation to DV (competence) does not improve the situation.

# raw data competence ~ score
plot(m2_lm,3)

# square root transformed competence
m2b_lm_sqrt <- lm(sqrt(competence) ~ score, data = dfttu)
# log transformed autonomy
m2c_lm_log <- lm(log(competence) ~ score, data = dfttu)

plot(m2b_lm_sqrt,3)

plot(m2c_lm_log,3)

We also leave score untransformed.

2.4.5 Multicollinearity

Next we will investigate the correlations between the variables in the model.

# create new df with only two variables
dfttu2 <- dfttu[,c("autonomy","competence","score")]

# compute correlation matrix
lavCor(dfttu2)

##            autnmy cmptnc  score
## autonomy    1.000              
## competence  0.754  1.000       
## score      -0.014  0.066  1.000

We see that autonomy and competence have a strong correlation (r = 0.75).

Cohen (1988) has defined some effect size "rules" for r:
small .1
medium .3
large > .5

So we conclude at this point that autonomy has "a large bi-variate association" with competence. We further check if there is a multicollinearity problem with the regression models using Durbin-Watson d test.

library(car)

# Null hypothesis: There is no correlation among the residuals (p-value >.05)
durbinWatsonTest(m1_lm) # com ~ aut

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.03773094      2.071179   0.646
##  Alternative hypothesis: rho != 0

durbinWatsonTest(m2_lm) # com ~ sco

##  lag Autocorrelation D-W Statistic p-value
##    1     0.009978909       1.97743   0.912
##  Alternative hypothesis: rho != 0

It seems that multicollinearity is not a problem in this data.

2.4.6 Outliers and high leverage points

Final step before the linear regression analysis is to check if there are extreme values that might influence the regression results. Residuals vs Leverage plot is useful for this purpose.

Outlier: extreme outcome variable (DV) value (competence).
- Examination of standardized residuals (residual / standard error). Potential outlier if std.res. > 3.

High leverage point: extreme predictor (IV) value (autonomy, score).
- Examination of .hat value. Potential high leverage point if .hat > 2(p+1)/n. Note: p = number of predictors, n = number of observations.
- In "competence ~ autonomy" model: 2(1+1)/208 = 0.02.

# plot autonomy
plot(m1_lm,5)

# plot score
plot(m2_lm,5)

First plot (competence ~ autonomy) shows that the results regarding outliers seem to be quite ok as the standardized residuals are not exceeding |3|. However, leverage values exceed 0.02 level. Top three potential outliers are participants 43, 81 and 167.

Score as a predictor performs better: No outliers or high leverage point problems are present.

Next we investigate the most influential values with Cook's distance method. An observation is influential if its Cook's distance value is greater than 4 / (n - p - 1). The value with our sample of 208 (n) and one predictor (p) is: 4 / (208 - 1 - 1) = 0.01941748. Also a simpler equation is applicable: 4/n = 4/208 = 0.01923077. Roughly speaking we are looking at Cook's distance values greater than 0.02.

First we plot competence ~ autonomy:

# autonomy

# plot 1
n <- nrow(dfttu)
plot(cooks.distance(m1_lm), main = "Cook's distance")
abline(h = 4/n, lty = 2, col = "darkred") # add cutoff line

# plot 2
plot(m1_lm, 4, 
     id.n = 10 # ask to indicate top 10 influential values
     )
abline(h = 4/n, lty = 2, col = "darkred")

Row 81 value is clearly the highest, followed by 167 and 43.

And then the competence ~ score:

# score

# plot 1
n <- nrow(dfttu)
plot(cooks.distance(m2_lm), main = "Cook's distance")
abline(h = 4/n, lty = 2, col = "darkgreen")

# plot 2
plot(m2_lm, 4, 
     id.n = 10 # ask to indicate top 10 influential values
     )
abline(h = 4/n, lty = 2, col = "darkgreen")

Plots above show that some rows in the df have higher values than 0.02. So we proceed to further check their behavior ..

# install.packages("performance")
library(performance)

# ---------------
# Cook's distance
# https://www.statology.org/how-to-identify-influential-data-points-using-cooks-distance/
# measures how much all of the fitted values in the model change when the one data point (in turn) is deleted.

# check outliers with Cook's distance, m1_lm (competence ~ autonomy)
check_outliers(m1_lm, method = "cook") # result is OK, no outliers

## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.7).
## - For variable: (Whole model)

# display Cook's distance values in descending order
m1_lm_outlier_cook_list <- as.data.frame(check_outliers(m1_lm, method = "cook"))
m1_lm_outlier_cook_list[order(-m1_lm_outlier_cook_list$Distance_Cook),]

##     Row Distance_Cook Outlier_Cook Outlier
## 81   81  5.777000e-02            0       0
## 167 167  3.305738e-02            0       0
## 43   43  2.597894e-02            0       0
## 156 156  2.473148e-02            0       0
## 112 112  2.442319e-02            0       0
## 162 162  2.389570e-02            0       0
## 62   62  2.216170e-02            0       0
## 192 192  2.216170e-02            0       0
## 89   89  1.966533e-02            0       0
## 13   13  1.787055e-02            0       0
## 140 140  1.665546e-02            0       0
## 152 152  1.665546e-02            0       0
## 161 161  1.665546e-02            0       0
## 174 174  1.529027e-02            0       0
## 26   26  1.495651e-02            0       0
## 41   41  1.396961e-02            0       0
## 7     7  1.350780e-02            0       0
## 66   66  1.350780e-02            0       0
## 207 207  1.350780e-02            0       0
## 49   49  1.339995e-02            0       0
## 116 116  1.339995e-02            0       0
## 171 171  1.339995e-02            0       0
## 146 146  1.223388e-02            0       0
## 141 141  1.189914e-02            0       0
## 73   73  1.161223e-02            0       0
## 189 189  1.161223e-02            0       0
## 60   60  1.153839e-02            0       0
## 82   82  1.153839e-02            0       0
## 83   83  1.144899e-02            0       0
## 138 138  8.990546e-03            0       0
## 31   31  8.856107e-03            0       0
## 44   44  8.856107e-03            0       0
## 28   28  8.193462e-03            0       0
## 17   17  8.006772e-03            0       0
## 32   32  8.006772e-03            0       0
## 119 119  8.006772e-03            0       0
## 68   68  7.853409e-03            0       0
## 77   77  7.853409e-03            0       0
## 94   94  7.853409e-03            0       0
## 97   97  7.725194e-03            0       0
## 154 154  7.725194e-03            0       0
## 163 163  7.725194e-03            0       0
## 24   24  7.496670e-03            0       0
## 64   64  7.496670e-03            0       0
## 195 195  7.428471e-03            0       0
## 25   25  7.317882e-03            0       0
## 47   47  7.317882e-03            0       0
## 71   71  7.317882e-03            0       0
## 90   90  6.996524e-03            0       0
## 8     8  6.755855e-03            0       0
## 200 200  6.755855e-03            0       0
## 27   27  6.260564e-03            0       0
## 29   29  6.260564e-03            0       0
## 123 123  6.260564e-03            0       0
## 108 108  5.952696e-03            0       0
## 110 110  5.895542e-03            0       0
## 134 134  5.895542e-03            0       0
## 136 136  5.895542e-03            0       0
## 51   51  5.832270e-03            0       0
## 133 133  5.832270e-03            0       0
## 57   57  5.565256e-03            0       0
## 50   50  4.181400e-03            0       0
## 127 127  4.181400e-03            0       0
## 130 130  4.181400e-03            0       0
## 175 175  4.181400e-03            0       0
## 185 185  4.181400e-03            0       0
## 194 194  4.181400e-03            0       0
## 122 122  3.670049e-03            0       0
## 172 172  3.670049e-03            0       0
## 183 183  3.464947e-03            0       0
## 16   16  3.210846e-03            0       0
## 52   52  3.210846e-03            0       0
## 72   72  3.210846e-03            0       0
## 87   87  3.210846e-03            0       0
## 95   95  3.210846e-03            0       0
## 143 143  3.210846e-03            0       0
## 98   98  3.196412e-03            0       0
## 169 169  3.196412e-03            0       0
## 37   37  2.581296e-03            0       0
## 63   63  2.581296e-03            0       0
## 137 137  2.581296e-03            0       0
## 139 139  2.581296e-03            0       0
## 150 150  2.581296e-03            0       0
## 173 173  2.581296e-03            0       0
## 180 180  2.581296e-03            0       0
## 54   54  2.221047e-03            0       0
## 55   55  2.221047e-03            0       0
## 135 135  2.221047e-03            0       0
## 179 179  2.221047e-03            0       0
## 188 188  2.221047e-03            0       0
## 19   19  2.098082e-03            0       0
## 178 178  2.098082e-03            0       0
## 1     1  2.098082e-03            0       0
## 70   70  1.779610e-03            0       0
## 96   96  1.779610e-03            0       0
## 23   23  1.775084e-03            0       0
## 42   42  1.775084e-03            0       0
## 65   65  1.775084e-03            0       0
## 78   78  1.775084e-03            0       0
## 79   79  1.775084e-03            0       0
## 115 115  1.775084e-03            0       0
## 144 144  1.775084e-03            0       0
## 153 153  1.775084e-03            0       0
## 168 168  1.775084e-03            0       0
## 191 191  1.775084e-03            0       0
## 101 101  1.745936e-03            0       0
## 158 158  1.745936e-03            0       0
## 186 186  1.745936e-03            0       0
## 190 190  1.745936e-03            0       0
## 2     2  1.659275e-03            0       0
## 22   22  1.659275e-03            0       0
## 75   75  1.659275e-03            0       0
## 131 131  1.659275e-03            0       0
## 106 106  1.604352e-03            0       0
## 107 107  1.604352e-03            0       0
## 120 120  1.604352e-03            0       0
## 160 160  1.604352e-03            0       0
## 193 193  1.604352e-03            0       0
## 56   56  1.450418e-03            0       0
## 124 124  1.450418e-03            0       0
## 205 205  1.450418e-03            0       0
## 80   80  1.408130e-03            0       0
## 126 126  1.408130e-03            0       0
## 34   34  1.019818e-03            0       0
## 125 125  9.698473e-04            0       0
## 128 128  9.698473e-04            0       0
## 184 184  9.143197e-04            0       0
## 6     6  7.684244e-04            0       0
## 10   10  7.684244e-04            0       0
## 85   85  7.684244e-04            0       0
## 121 121  7.684244e-04            0       0
## 182 182  7.684244e-04            0       0
## 132 132  6.939733e-04            0       0
## 155 155  6.939733e-04            0       0
## 165 165  6.939733e-04            0       0
## 4     4  5.687311e-04            0       0
## 12   12  5.687311e-04            0       0
## 53   53  5.687311e-04            0       0
## 76   76  5.687311e-04            0       0
## 103 103  5.687311e-04            0       0
## 111 111  5.687311e-04            0       0
## 159 159  5.687311e-04            0       0
## 196 196  3.942467e-04            0       0
## 35   35  2.608347e-04            0       0
## 46   46  2.608347e-04            0       0
## 59   59  2.608347e-04            0       0
## 67   67  2.608347e-04            0       0
## 91   91  2.608347e-04            0       0
## 104 104  2.608347e-04            0       0
## 114 114  2.608347e-04            0       0
## 118 118  2.608347e-04            0       0
## 147 147  2.608347e-04            0       0
## 15   15  2.198466e-04            0       0
## 18   18  2.198466e-04            0       0
## 30   30  2.198466e-04            0       0
## 38   38  2.198466e-04            0       0
## 58   58  2.198466e-04            0       0
## 84   84  2.198466e-04            0       0
## 86   86  2.198466e-04            0       0
## 149 149  2.198466e-04            0       0
## 151 151  2.198466e-04            0       0
## 164 164  2.198466e-04            0       0
## 176 176  2.198466e-04            0       0
## 187 187  2.198466e-04            0       0
## 201 201  2.198466e-04            0       0
## 202 202  2.198466e-04            0       0
## 203 203  2.198466e-04            0       0
## 109 109  1.841271e-04            0       0
## 129 129  1.841271e-04            0       0
## 181 181  1.841271e-04            0       0
## 14   14  1.150826e-04            0       0
## 105 105  1.150826e-04            0       0
## 204 204  1.150826e-04            0       0
## 3     3  8.042786e-05            0       0
## 21   21  8.042786e-05            0       0
## 36   36  8.042786e-05            0       0
## 39   39  8.042786e-05            0       0
## 48   48  8.042786e-05            0       0
## 61   61  8.042786e-05            0       0
## 69   69  8.042786e-05            0       0
## 92   92  8.042786e-05            0       0
## 100 100  8.042786e-05            0       0
## 102 102  8.042786e-05            0       0
## 117 117  8.042786e-05            0       0
## 142 142  8.042786e-05            0       0
## 145 145  8.042786e-05            0       0
## 148 148  8.042786e-05            0       0
## 157 157  8.042786e-05            0       0
## 170 170  8.042786e-05            0       0
## 197 197  8.042786e-05            0       0
## 198 198  8.042786e-05            0       0
## 5     5  6.231817e-05            0       0
## 9     9  6.231817e-05            0       0
## 11   11  6.231817e-05            0       0
## 20   20  6.231817e-05            0       0
## 33   33  6.231817e-05            0       0
## 40   40  6.231817e-05            0       0
## 45   45  6.231817e-05            0       0
## 74   74  6.231817e-05            0       0
## 99   99  6.231817e-05            0       0
## 113 113  6.231817e-05            0       0
## 208 208  6.231817e-05            0       0
## 166 166  9.702329e-07            0       0
## 177 177  9.702329e-07            0       0
## 199 199  9.702329e-07            0       0
## 88   88  7.263328e-07            0       0
## 93   93  5.265104e-07            0       0
## 206 206  5.265104e-07            0       0

# we see that observation 81 has the highest value (0.058), but also several other values are above the 0.02 level
# however, the test indicates that we do not have an outlier problem here




# --------------------
# Mahalanobis distance

# to get a second opinion, we turn to Mahalanobis distance
# https://www.statology.org/mahalanobis-distance-r/
check_outliers(dfttu[c("competence","autonomy")], method = "mahalanobis")

## OK: No outliers detected.
## - Based on the following method and threshold: mahalanobis (13.82).
## - For variables: competence, autonomy

# we have an indication of one possible outlier, participant 81
# this finding is against the Cook distance result above, but resonates with the earlier plot

# display Mahalanobis values in descending order
m1_lm_outlier_mah_list <- as.data.frame(check_outliers(dfttu[c("competence","autonomy")], method = "mahalanobis"))
m1_lm_outlier_mah_list <- m1_lm_outlier_mah_list[order(-m1_lm_outlier_mah_list$Distance_Mahalanobis),]
# add also p-values, usually we look for values smaller than .001
m1_lm_outlier_mah_list$pvalue <- pchisq(m1_lm_outlier_mah_list$Distance_Mahalanobis, 
                                        df=1, # number of variables minus one (2-1=1)
                                        lower.tail=FALSE
                                        )
m1_lm_outlier_mah_list

##     Row Distance_Mahalanobis Outlier_Mahalanobis Outlier      pvalue
## 81   81           8.64253893                   0       0 0.003284035
## 89   89           7.35326766                   0       0 0.006694098
## 167 167           7.03191732                   0       0 0.008006958
## 156 156           6.99851302                   0       0 0.008157745
## 97   97           6.81076737                   0       0 0.009060982
## 154 154           6.81076737                   0       0 0.009060982
## 163 163           6.81076737                   0       0 0.009060982
## 195 195           6.79526286                   0       0 0.009140007
## 88   88           6.40714450                   0       0 0.011366206
## 43   43           6.06137486                   0       0 0.013817008
## 140 140           5.84697068                   0       0 0.015603889
## 152 152           5.84697068                   0       0 0.015603889
## 161 161           5.84697068                   0       0 0.015603889
## 26   26           5.62230507                   0       0 0.017733311
## 112 112           5.41611661                   0       0 0.019951688
## 162 162           5.36209425                   0       0 0.020579046
## 62   62           5.05524349                   0       0 0.024551532
## 192 192           5.05524349                   0       0 0.024551532
## 141 141           4.91230737                   0       0 0.026665997
## 73   73           4.79394504                   0       0 0.028559942
## 189 189           4.79394504                   0       0 0.028559942
## 98   98           4.72094619                   0       0 0.029797343
## 169 169           4.72094619                   0       0 0.029797343
## 28   28           4.65128824                   0       0 0.031030252
## 13   13           4.43727106                   0       0 0.035162409
## 60   60           4.36580300                   0       0 0.036667198
## 82   82           4.36580300                   0       0 0.036667198
## 146 146           4.35374393                   0       0 0.036927681
## 174 174           4.30170177                   0       0 0.038074256
## 41   41           4.20745756                   0       0 0.040246617
## 37   37           4.20633873                   0       0 0.040273174
## 63   63           4.20633873                   0       0 0.040273174
## 137 137           4.20633873                   0       0 0.040273174
## 139 139           4.20633873                   0       0 0.040273174
## 150 150           4.20633873                   0       0 0.040273174
## 173 173           4.20633873                   0       0 0.040273174
## 180 180           4.20633873                   0       0 0.040273174
## 7     7           4.08769083                   0       0 0.043196619
## 66   66           4.08769083                   0       0 0.043196619
## 207 207           4.08769083                   0       0 0.043196619
## 196 196           4.03294286                   0       0 0.044620043
## 31   31           3.82182965                   0       0 0.050588968
## 44   44           3.82182965                   0       0 0.050588968
## 49   49           3.70979489                   0       0 0.054094038
## 116 116           3.70979489                   0       0 0.054094038
## 171 171           3.70979489                   0       0 0.054094038
## 24   24           3.37970765                   0       0 0.066003762
## 64   64           3.37970765                   0       0 0.066003762
## 83   83           3.34015836                   0       0 0.067608102
## 138 138           3.25842886                   0       0 0.071057178
## 17   17           3.06758106                   0       0 0.079868232
## 32   32           3.06758106                   0       0 0.079868232
## 119 119           3.06758106                   0       0 0.079868232
## 34   34           3.01928394                   0       0 0.082279788
## 184 184           3.00847947                   0       0 0.082829957
## 68   68           2.86873133                   0       0 0.090316039
## 77   77           2.86873133                   0       0 0.090316039
## 94   94           2.86873133                   0       0 0.090316039
## 25   25           2.63436604                   0       0 0.104573796
## 47   47           2.63436604                   0       0 0.104573796
## 71   71           2.63436604                   0       0 0.104573796
## 125 125           2.63399309                   0       0 0.104598356
## 128 128           2.63399309                   0       0 0.104598356
## 6     6           2.61097975                   0       0 0.106126177
## 10   10           2.61097975                   0       0 0.106126177
## 85   85           2.61097975                   0       0 0.106126177
## 121 121           2.61097975                   0       0 0.106126177
## 182 182           2.61097975                   0       0 0.106126177
## 110 110           2.43563022                   0       0 0.118606218
## 134 134           2.43563022                   0       0 0.118606218
## 136 136           2.43563022                   0       0 0.118606218
## 90   90           2.39163211                   0       0 0.121986211
## 8     8           2.33620536                   0       0 0.126397119
## 200 200           2.33620536                   0       0 0.126397119
## 50   50           2.31191714                   0       0 0.128385615
## 127 127           2.31191714                   0       0 0.128385615
## 130 130           2.31191714                   0       0 0.128385615
## 175 175           2.31191714                   0       0 0.128385615
## 185 185           2.31191714                   0       0 0.128385615
## 194 194           2.31191714                   0       0 0.128385615
## 27   27           2.20814334                   0       0 0.137283808
## 29   29           2.20814334                   0       0 0.137283808
## 123 123           2.20814334                   0       0 0.137283808
## 51   51           2.19183832                   0       0 0.138743637
## 133 133           2.19183832                   0       0 0.138743637
## 57   57           2.17007958                   0       0 0.140718952
## 108 108           2.12829885                   0       0 0.144601158
## 183 183           1.97609842                   0       0 0.159802035
## 80   80           1.89138536                   0       0 0.169045749
## 126 126           1.89138536                   0       0 0.169045749
## 109 109           1.70578063                   0       0 0.191533728
## 129 129           1.70578063                   0       0 0.191533728
## 181 181           1.70578063                   0       0 0.191533728
## 122 122           1.46162827                   0       0 0.226671223
## 172 172           1.46162827                   0       0 0.226671223
## 14   14           1.40377972                   0       0 0.236091747
## 105 105           1.40377972                   0       0 0.236091747
## 204 204           1.40377972                   0       0 0.236091747
## 16   16           1.30459828                   0       0 0.253375000
## 52   52           1.30459828                   0       0 0.253375000
## 72   72           1.30459828                   0       0 0.253375000
## 87   87           1.30459828                   0       0 0.253375000
## 95   95           1.30459828                   0       0 0.253375000
## 143 143           1.30459828                   0       0 0.253375000
## 70   70           1.19016357                   0       0 0.275296566
## 96   96           1.19016357                   0       0 0.275296566
## 2     2           1.16245019                   0       0 0.280957972
## 22   22           1.16245019                   0       0 0.280957972
## 75   75           1.16245019                   0       0 0.280957972
## 131 131           1.16245019                   0       0 0.280957972
## 106 106           1.00057036                   0       0 0.317172538
## 107 107           1.00057036                   0       0 0.317172538
## 120 120           1.00057036                   0       0 0.317172538
## 160 160           1.00057036                   0       0 0.317172538
## 193 193           1.00057036                   0       0 0.317172538
## 56   56           0.96064811                   0       0 0.327023640
## 124 124           0.96064811                   0       0 0.327023640
## 205 205           0.96064811                   0       0 0.327023640
## 54   54           0.91978336                   0       0 0.337531860
## 55   55           0.91978336                   0       0 0.337531860
## 135 135           0.91978336                   0       0 0.337531860
## 179 179           0.91978336                   0       0 0.337531860
## 188 188           0.91978336                   0       0 0.337531860
## 1     1           0.86905665                   0       0 0.351216601
## 19   19           0.86905665                   0       0 0.351216601
## 178 178           0.86905665                   0       0 0.351216601
## 23   23           0.82167395                   0       0 0.364691196
## 42   42           0.82167395                   0       0 0.364691196
## 65   65           0.82167395                   0       0 0.364691196
## 78   78           0.82167395                   0       0 0.364691196
## 79   79           0.82167395                   0       0 0.364691196
## 115 115           0.82167395                   0       0 0.364691196
## 144 144           0.82167395                   0       0 0.364691196
## 153 153           0.82167395                   0       0 0.364691196
## 168 168           0.82167395                   0       0 0.364691196
## 191 191           0.82167395                   0       0 0.364691196
## 93   93           0.78043626                   0       0 0.377007750
## 206 206           0.78043626                   0       0 0.377007750
## 101 101           0.76609739                   0       0 0.381427069
## 158 158           0.76609739                   0       0 0.381427069
## 186 186           0.76609739                   0       0 0.381427069
## 190 190           0.76609739                   0       0 0.381427069
## 166 166           0.58473862                   0       0 0.444460906
## 177 177           0.58473862                   0       0 0.444460906
## 199 199           0.58473862                   0       0 0.444460906
## 132 132           0.45656892                   0       0 0.499231872
## 155 155           0.45656892                   0       0 0.499231872
## 165 165           0.45656892                   0       0 0.499231872
## 4     4           0.33335675                   0       0 0.563689168
## 12   12           0.33335675                   0       0 0.563689168
## 53   53           0.33335675                   0       0 0.563689168
## 76   76           0.33335675                   0       0 0.563689168
## 103 103           0.33335675                   0       0 0.563689168
## 111 111           0.33335675                   0       0 0.563689168
## 159 159           0.33335675                   0       0 0.563689168
## 5     5           0.24325082                   0       0 0.621867999
## 9     9           0.24325082                   0       0 0.621867999
## 11   11           0.24325082                   0       0 0.621867999
## 20   20           0.24325082                   0       0 0.621867999
## 33   33           0.24325082                   0       0 0.621867999
## 40   40           0.24325082                   0       0 0.621867999
## 45   45           0.24325082                   0       0 0.621867999
## 74   74           0.24325082                   0       0 0.621867999
## 99   99           0.24325082                   0       0 0.621867999
## 113 113           0.24325082                   0       0 0.621867999
## 208 208           0.24325082                   0       0 0.621867999
## 3     3           0.15385645                   0       0 0.694876906
## 21   21           0.15385645                   0       0 0.694876906
## 36   36           0.15385645                   0       0 0.694876906
## 39   39           0.15385645                   0       0 0.694876906
## 48   48           0.15385645                   0       0 0.694876906
## 61   61           0.15385645                   0       0 0.694876906
## 69   69           0.15385645                   0       0 0.694876906
## 92   92           0.15385645                   0       0 0.694876906
## 100 100           0.15385645                   0       0 0.694876906
## 102 102           0.15385645                   0       0 0.694876906
## 117 117           0.15385645                   0       0 0.694876906
## 142 142           0.15385645                   0       0 0.694876906
## 145 145           0.15385645                   0       0 0.694876906
## 148 148           0.15385645                   0       0 0.694876906
## 157 157           0.15385645                   0       0 0.694876906
## 170 170           0.15385645                   0       0 0.694876906
## 197 197           0.15385645                   0       0 0.694876906
## 198 198           0.15385645                   0       0 0.694876906
## 35   35           0.11113322                   0       0 0.738857652
## 46   46           0.11113322                   0       0 0.738857652
## 59   59           0.11113322                   0       0 0.738857652
## 67   67           0.11113322                   0       0 0.738857652
## 91   91           0.11113322                   0       0 0.738857652
## 104 104           0.11113322                   0       0 0.738857652
## 114 114           0.11113322                   0       0 0.738857652
## 118 118           0.11113322                   0       0 0.738857652
## 147 147           0.11113322                   0       0 0.738857652
## 15   15           0.09422431                   0       0 0.758873977
## 18   18           0.09422431                   0       0 0.758873977
## 30   30           0.09422431                   0       0 0.758873977
## 38   38           0.09422431                   0       0 0.758873977
## 58   58           0.09422431                   0       0 0.758873977
## 84   84           0.09422431                   0       0 0.758873977
## 86   86           0.09422431                   0       0 0.758873977
## 149 149           0.09422431                   0       0 0.758873977
## 151 151           0.09422431                   0       0 0.758873977
## 164 164           0.09422431                   0       0 0.758873977
## 176 176           0.09422431                   0       0 0.758873977
## 187 187           0.09422431                   0       0 0.758873977
## 201 201           0.09422431                   0       0 0.758873977
## 202 202           0.09422431                   0       0 0.758873977
## 203 203           0.09422431                   0       0 0.758873977

# p-values are not smaller than .001, so the end result is in accordance with Cook's distance: no outliers here

# install.packages("performance")
library(performance)

# ---------------
# Cook's distance

# check outliers with Cook's distance, m2_lm (competence ~ score)
check_outliers(m2_lm, method = "cook") # result is OK, no outliers

## OK: No outliers detected.
## - Based on the following method and threshold: cook (0.7).
## - For variable: (Whole model)

# display Cook distance values in descending order
m2_lm_outlier_cook_list <- as.data.frame(check_outliers(m2_lm, method = "cook"))
m2_lm_outlier_cook_list[order(-m2_lm_outlier_cook_list$Distance_Cook),]

##     Row Distance_Cook Outlier_Cook Outlier
## 112 112  4.349204e-02            0       0
## 154 154  4.349204e-02            0       0
## 137 137  3.376143e-02            0       0
## 207 207  3.020325e-02            0       0
## 173 173  2.782953e-02            0       0
## 64   64  2.181044e-02            0       0
## 161 161  2.138833e-02            0       0
## 32   32  2.075458e-02            0       0
## 196 196  2.075458e-02            0       0
## 34   34  1.787142e-02            0       0
## 138 138  1.787142e-02            0       0
## 163 163  1.634844e-02            0       0
## 116 116  1.617318e-02            0       0
## 169 169  1.617318e-02            0       0
## 184 184  1.537804e-02            0       0
## 89   89  1.536176e-02            0       0
## 171 171  1.491481e-02            0       0
## 140 140  1.462045e-02            0       0
## 97   97  1.321338e-02            0       0
## 50   50  1.282176e-02            0       0
## 175 175  1.282176e-02            0       0
## 194 194  1.282176e-02            0       0
## 122 122  1.266227e-02            0       0
## 108 108  1.255755e-02            0       0
## 150 150  1.172364e-02            0       0
## 44   44  1.111755e-02            0       0
## 88   88  1.111755e-02            0       0
## 98   98  1.111755e-02            0       0
## 28   28  1.089646e-02            0       0
## 110 110  1.089646e-02            0       0
## 26   26  1.078020e-02            0       0
## 66   66  1.078020e-02            0       0
## 17   17  1.059711e-02            0       0
## 37   37  9.283488e-03            0       0
## 31   31  9.253806e-03            0       0
## 152 152  9.253806e-03            0       0
## 90   90  9.011594e-03            0       0
## 127 127  9.011594e-03            0       0
## 130 130  9.011594e-03            0       0
## 49   49  8.923769e-03            0       0
## 119 119  8.897112e-03            0       0
## 63   63  8.734415e-03            0       0
## 180 180  8.386897e-03            0       0
## 109 109  8.377824e-03            0       0
## 73   73  8.147832e-03            0       0
## 139 139  8.079897e-03            0       0
## 7     7  8.046487e-03            0       0
## 24   24  7.912767e-03            0       0
## 83   83  7.912767e-03            0       0
## 82   82  7.570818e-03            0       0
## 183 183  7.467486e-03            0       0
## 16   16  7.199920e-03            0       0
## 124 124  7.199920e-03            0       0
## 195 195  6.892093e-03            0       0
## 95   95  6.809739e-03            0       0
## 185 185  6.525593e-03            0       0
## 70   70  6.435082e-03            0       0
## 129 129  6.435082e-03            0       0
## 141 141  5.992125e-03            0       0
## 10   10  5.854913e-03            0       0
## 14   14  5.842673e-03            0       0
## 105 105  5.842673e-03            0       0
## 125 125  5.842673e-03            0       0
## 182 182  5.153303e-03            0       0
## 6     6  4.816981e-03            0       0
## 121 121  4.767457e-03            0       0
## 85   85  4.697992e-03            0       0
## 189 189  4.697992e-03            0       0
## 143 143  4.658250e-03            0       0
## 60   60  4.220776e-03            0       0
## 96   96  3.609483e-03            0       0
## 204 204  3.566008e-03            0       0
## 132 132  3.494977e-03            0       0
## 206 206  3.494977e-03            0       0
## 52   52  3.407767e-03            0       0
## 156 156  3.392245e-03            0       0
## 166 166  3.368245e-03            0       0
## 128 128  3.103397e-03            0       0
## 134 134  3.103397e-03            0       0
## 87   87  3.042874e-03            0       0
## 205 205  3.042874e-03            0       0
## 172 172  3.012455e-03            0       0
## 57   57  2.951455e-03            0       0
## 181 181  2.862053e-03            0       0
## 146 146  2.851714e-03            0       0
## 56   56  2.830691e-03            0       0
## 93   93  2.784165e-03            0       0
## 72   72  2.393235e-03            0       0
## 136 136  2.393235e-03            0       0
## 1     1  2.130405e-03            0       0
## 55   55  1.981862e-03            0       0
## 179 179  1.981862e-03            0       0
## 76   76  1.893256e-03            0       0
## 155 155  1.778205e-03            0       0
## 188 188  1.773929e-03            0       0
## 51   51  1.501351e-03            0       0
## 133 133  1.501351e-03            0       0
## 80   80  1.457427e-03            0       0
## 41   41  1.417386e-03            0       0
## 43   43  1.259443e-03            0       0
## 174 174  1.259443e-03            0       0
## 178 178  1.259443e-03            0       0
## 126 126  1.245011e-03            0       0
## 165 165  1.242661e-03            0       0
## 101 101  1.190774e-03            0       0
## 147 147  1.190774e-03            0       0
## 208 208  1.190774e-03            0       0
## 54   54  1.089950e-03            0       0
## 135 135  1.089950e-03            0       0
## 167 167  1.089950e-03            0       0
## 19   19  1.086357e-03            0       0
## 177 177  9.504576e-04            0       0
## 111 111  9.119107e-04            0       0
## 199 199  9.119107e-04            0       0
## 4     4  8.460711e-04            0       0
## 103 103  8.090240e-04            0       0
## 53   53  8.086590e-04            0       0
## 159 159  7.831465e-04            0       0
## 12   12  7.683000e-04            0       0
## 25   25  6.264717e-04            0       0
## 193 193  6.264717e-04            0       0
## 33   33  5.450939e-04            0       0
## 35   35  5.450939e-04            0       0
## 46   46  5.450939e-04            0       0
## 99   99  5.450939e-04            0       0
## 190 190  5.450939e-04            0       0
## 59   59  4.757915e-04            0       0
## 100 100  4.741431e-04            0       0
## 102 102  4.741431e-04            0       0
## 144 144  4.741431e-04            0       0
## 58   58  4.105127e-04            0       0
## 15   15  3.543362e-04            0       0
## 86   86  3.543362e-04            0       0
## 120 120  3.543362e-04            0       0
## 157 157  3.543362e-04            0       0
## 198 198  3.543362e-04            0       0
## 20   20  3.181235e-04            0       0
## 162 162  3.181235e-04            0       0
## 22   22  2.800687e-04            0       0
## 158 158  2.800687e-04            0       0
## 61   61  2.618524e-04            0       0
## 11   11  2.145915e-04            0       0
## 40   40  2.145915e-04            0       0
## 75   75  2.145915e-04            0       0
## 114 114  2.145915e-04            0       0
## 123 123  2.145915e-04            0       0
## 27   27  2.098712e-04            0       0
## 29   29  2.098712e-04            0       0
## 113 113  2.098712e-04            0       0
## 68   68  2.060278e-04            0       0
## 9     9  2.013686e-04            0       0
## 91   91  2.013686e-04            0       0
## 104 104  2.013686e-04            0       0
## 186 186  2.013686e-04            0       0
## 77   77  2.008647e-04            0       0
## 84   84  1.920380e-04            0       0
## 45   45  1.902360e-04            0       0
## 67   67  1.902360e-04            0       0
## 74   74  1.717528e-04            0       0
## 131 131  1.662128e-04            0       0
## 38   38  1.643160e-04            0       0
## 81   81  1.612125e-04            0       0
## 94   94  1.570163e-04            0       0
## 118 118  1.570163e-04            0       0
## 13   13  1.539395e-04            0       0
## 2     2  1.539395e-04            0       0
## 5     5  1.531850e-04            0       0
## 18   18  1.407420e-04            0       0
## 47   47  1.407420e-04            0       0
## 78   78  1.407420e-04            0       0
## 117 117  1.407420e-04            0       0
## 92   92  1.042790e-04            0       0
## 200 200  9.057769e-05            0       0
## 164 164  7.939761e-05            0       0
## 21   21  7.039568e-05            0       0
## 170 170  7.039568e-05            0       0
## 153 153  6.325273e-05            0       0
## 202 202  6.325273e-05            0       0
## 203 203  5.767296e-05            0       0
## 62   62  5.338332e-05            0       0
## 23   23  4.769338e-05            0       0
## 106 106  4.769338e-05            0       0
## 149 149  4.769338e-05            0       0
## 42   42  4.443710e-05            0       0
## 148 148  4.443710e-05            0       0
## 107 107  4.326893e-05            0       0
## 30   30  4.220760e-05            0       0
## 48   48  4.220760e-05            0       0
## 71   71  4.220760e-05            0       0
## 145 145  4.220760e-05            0       0
## 36   36  4.112895e-05            0       0
## 176 176  4.112895e-05            0       0
## 8     8  3.992922e-05            0       0
## 160 160  3.992922e-05            0       0
## 192 192  3.992922e-05            0       0
## 65   65  3.685330e-05            0       0
## 69   69  3.685330e-05            0       0
## 191 191  3.685330e-05            0       0
## 115 115  3.487146e-05            0       0
## 142 142  3.487146e-05            0       0
## 151 151  3.255743e-05            0       0
## 197 197  3.255743e-05            0       0
## 3     3  2.694825e-05            0       0
## 187 187  2.694825e-05            0       0
## 39   39  2.026786e-05            0       0
## 201 201  2.026786e-05            0       0
## 79   79  1.310395e-05            0       0
## 168 168  1.310395e-05            0       0

# we see that observations 112 and 154 have the highest value (0.043), but also several other values are above the 0.02 level
# however, the test indicates that we do not have an outlier problem here


# --------------------
# Mahalanobis distance

# to the a second opinion, we turn to Mahalanobis distance
# https://www.statology.org/mahalanobis-distance-r/
check_outliers(dfttu[c("competence","score")], method = "mahalanobis")

## OK: No outliers detected.
## - Based on the following method and threshold: mahalanobis (13.82).
## - For variables: competence, score

# we have an indication of two possible outliers: 112 and 154
# this finding is against the Cook distance result above

# display Mahalanobis values in descending order
m2_lm_outlier_mah_list <- as.data.frame(check_outliers(dfttu[c("competence","score")], method = "mahalanobis"))
m2_lm_outlier_mah_list <- m2_lm_outlier_mah_list[order(-m2_lm_outlier_mah_list$Distance_Mahalanobis),]
# add also p-values, usually we look for values smaller than .001
m2_lm_outlier_mah_list$pvalue <- pchisq(m2_lm_outlier_mah_list$Distance_Mahalanobis, 
                                        df=1, 
                                        lower.tail=FALSE
                                        )
m2_lm_outlier_mah_list

##     Row Distance_Mahalanobis Outlier_Mahalanobis Outlier      pvalue
## 112 112           7.87183918                   0       0 0.005021056
## 154 154           7.87183918                   0       0 0.005021056
## 137 137           6.36673337                   0       0 0.011627945
## 207 207           6.08041175                   0       0 0.013668891
## 163 163           5.78638138                   0       0 0.016150801
## 173 173           5.72932169                   0       0 0.016683926
## 97   97           5.41252542                   0       0 0.019992772
## 32   32           4.95018461                   0       0 0.026087883
## 196 196           4.95018461                   0       0 0.026087883
## 161 161           4.94845464                   0       0 0.026114001
## 64   64           4.94233364                   0       0 0.026206631
## 184 184           4.67592755                   0       0 0.030588181
## 34   34           4.63618566                   0       0 0.031304507
## 138 138           4.63618566                   0       0 0.031304507
## 116 116           4.59629592                   0       0 0.032041111
## 169 169           4.59629592                   0       0 0.032041111
## 140 140           4.42036779                   0       0 0.035512381
## 171 171           4.22636174                   0       0 0.039800670
## 44   44           4.00226074                   0       0 0.045439277
## 88   88           4.00226074                   0       0 0.045439277
## 98   98           4.00226074                   0       0 0.045439277
## 89   89           4.00001978                   0       0 0.045499730
## 122 122           3.99755297                   0       0 0.045566373
## 28   28           3.87655879                   0       0 0.048964821
## 110 110           3.87655879                   0       0 0.048964821
## 31   31           3.74866970                   0       0 0.052849558
## 152 152           3.74866970                   0       0 0.052849558
## 108 108           3.69213096                   0       0 0.054669728
## 49   49           3.65959467                   0       0 0.055747194
## 150 150           3.65445988                   0       0 0.055919283
## 50   50           3.60082660                   0       0 0.057750849
## 175 175           3.60082660                   0       0 0.057750849
## 194 194           3.60082660                   0       0 0.057750849
## 26   26           3.54359614                   0       0 0.059775730
## 66   66           3.54359614                   0       0 0.059775730
## 37   37           3.37670733                   0       0 0.066124034
## 180 180           3.36840253                   0       0 0.066458167
## 95   95           3.33812441                   0       0 0.067691729
## 63   63           3.32068226                   0       0 0.068413432
## 139 139           3.31475026                   0       0 0.068660752
## 7     7           3.26347079                   0       0 0.070839041
## 24   24           3.26228439                   0       0 0.070890306
## 83   83           3.26228439                   0       0 0.070890306
## 17   17           3.22237261                   0       0 0.072638275
## 79   79           3.22031170                   0       0 0.072729780
## 168 168           3.22031170                   0       0 0.072729780
## 82   82           3.08763855                   0       0 0.078889218
## 57   57           3.01128493                   0       0 0.082686721
## 109 109           2.86015850                   0       0 0.090798551
## 90   90           2.85794045                   0       0 0.090923847
## 127 127           2.85794045                   0       0 0.090923847
## 130 130           2.85794045                   0       0 0.090923847
## 119 119           2.82573249                   0       0 0.092764562
## 14   14           2.78485121                   0       0 0.095159531
## 105 105           2.78485121                   0       0 0.095159531
## 125 125           2.78485121                   0       0 0.095159531
## 76   76           2.78416424                   0       0 0.095200347
## 25   25           2.75807598                   0       0 0.096764549
## 193 193           2.75807598                   0       0 0.096764549
## 16   16           2.71644017                   0       0 0.099319296
## 124 124           2.71644017                   0       0 0.099319296
## 11   11           2.71321898                   0       0 0.099519991
## 40   40           2.71321898                   0       0 0.099519991
## 75   75           2.71321898                   0       0 0.099519991
## 114 114           2.71321898                   0       0 0.099519991
## 123 123           2.71321898                   0       0 0.099519991
## 101 101           2.70501272                   0       0 0.100033284
## 147 147           2.70501272                   0       0 0.100033284
## 208 208           2.70501272                   0       0 0.100033284
## 73   73           2.65483177                   0       0 0.103235678
## 70   70           2.59612151                   0       0 0.107125586
## 129 129           2.59612151                   0       0 0.107125586
## 39   39           2.57312701                   0       0 0.108692685
## 201 201           2.57312701                   0       0 0.108692685
## 195 195           2.56106303                   0       0 0.109524915
## 166 166           2.53963412                   0       0 0.111020510
## 183 183           2.50221060                   0       0 0.113686620
## 185 185           2.40752720                   0       0 0.120752976
## 141 141           2.38411941                   0       0 0.122573945
## 100 100           2.16225081                   0       0 0.141437377
## 102 102           2.16225081                   0       0 0.141437377
## 144 144           2.16225081                   0       0 0.141437377
## 10   10           2.15402444                   0       0 0.142196734
## 27   27           2.12779045                   0       0 0.144649134
## 29   29           2.12779045                   0       0 0.144649134
## 113 113           2.12779045                   0       0 0.144649134
## 132 132           2.06322016                   0       0 0.150891008
## 206 206           2.06322016                   0       0 0.150891008
## 182 182           2.01329811                   0       0 0.155926025
## 3     3           1.99906055                   0       0 0.157396735
## 187 187           1.99906055                   0       0 0.157396735
## 6     6           1.98137454                   0       0 0.159245677
## 121 121           1.94569001                   0       0 0.163052051
## 85   85           1.93930530                   0       0 0.163743984
## 189 189           1.93930530                   0       0 0.163743984
## 58   58           1.89175757                   0       0 0.169003819
## 68   68           1.86249552                   0       0 0.172337058
## 143 143           1.84879446                   0       0 0.173923671
## 60   60           1.68722113                   0       0 0.193967691
## 15   15           1.63954388                   0       0 0.200388061
## 86   86           1.63954388                   0       0 0.200388061
## 120 120           1.63954388                   0       0 0.200388061
## 157 157           1.63954388                   0       0 0.200388061
## 198 198           1.63954388                   0       0 0.200388061
## 9     9           1.61548015                   0       0 0.203723170
## 91   91           1.61548015                   0       0 0.203723170
## 104 104           1.61548015                   0       0 0.203723170
## 186 186           1.61548015                   0       0 0.203723170
## 93   93           1.61459690                   0       0 0.203846822
## 204 204           1.56374097                   0       0 0.211118310
## 1     1           1.54654521                   0       0 0.213646186
## 151 151           1.49811232                   0       0 0.220962042
## 197 197           1.49811232                   0       0 0.220962042
## 96   96           1.45161788                   0       0 0.228268528
## 52   52           1.38999552                   0       0 0.238405856
## 156 156           1.37037500                   0       0 0.241747464
## 188 188           1.31856551                   0       0 0.250849658
## 43   43           1.31250943                   0       0 0.251940812
## 174 174           1.31250943                   0       0 0.251940812
## 178 178           1.31250943                   0       0 0.251940812
## 87   87           1.30294067                   0       0 0.253676774
## 205 205           1.30294067                   0       0 0.253676774
## 115 115           1.27505753                   0       0 0.258820030
## 142 142           1.27505753                   0       0 0.258820030
## 128 128           1.27362165                   0       0 0.259088358
## 134 134           1.27362165                   0       0 0.259088358
## 172 172           1.24038712                   0       0 0.265396459
## 56   56           1.19995986                   0       0 0.273329701
## 61   61           1.18995517                   0       0 0.275338599
## 181 181           1.18250198                   0       0 0.276847230
## 45   45           1.17628807                   0       0 0.278112969
## 67   67           1.17628807                   0       0 0.278112969
## 146 146           1.17552734                   0       0 0.278268424
## 33   33           1.17097813                   0       0 0.279200352
## 35   35           1.17097813                   0       0 0.279200352
## 46   46           1.17097813                   0       0 0.279200352
## 99   99           1.17097813                   0       0 0.279200352
## 190 190           1.17097813                   0       0 0.279200352
## 55   55           1.07875867                   0       0 0.298975416
## 179 179           1.07875867                   0       0 0.298975416
## 65   65           1.07028231                   0       0 0.300881688
## 69   69           1.07028231                   0       0 0.300881688
## 191 191           1.07028231                   0       0 0.300881688
## 51   51           0.99366388                   0       0 0.318848536
## 133 133           0.99366388                   0       0 0.318848536
## 72   72           0.99092175                   0       0 0.319517194
## 136 136           0.99092175                   0       0 0.319517194
## 59   59           0.97928415                   0       0 0.322375603
## 155 155           0.93670504                   0       0 0.333126664
## 41   41           0.86730768                   0       0 0.351701735
## 84   84           0.81348469                   0       0 0.367092004
## 8     8           0.71557053                   0       0 0.397600712
## 160 160           0.71557053                   0       0 0.397600712
## 192 192           0.71557053                   0       0 0.397600712
## 80   80           0.70743645                   0       0 0.400296163
## 177 177           0.69699735                   0       0 0.403794471
## 126 126           0.65786691                   0       0 0.417314015
## 38   38           0.65266878                   0       0 0.419160134
## 19   19           0.60529856                   0       0 0.436563499
## 36   36           0.56563397                   0       0 0.451999095
## 176 176           0.56563397                   0       0 0.451999095
## 165 165           0.55128609                   0       0 0.457792673
## 131 131           0.51725859                   0       0 0.472013440
## 20   20           0.51387954                   0       0 0.473464240
## 162 162           0.51387954                   0       0 0.473464240
## 18   18           0.51013243                   0       0 0.475081534
## 47   47           0.51013243                   0       0 0.475081534
## 78   78           0.51013243                   0       0 0.475081534
## 117 117           0.51013243                   0       0 0.475081534
## 4     4           0.49891865                   0       0 0.479975643
## 54   54           0.45834005                   0       0 0.498400776
## 135 135           0.45834005                   0       0 0.498400776
## 167 167           0.45834005                   0       0 0.498400776
## 111 111           0.43778612                   0       0 0.508192940
## 199 199           0.43778612                   0       0 0.508192940
## 30   30           0.43397697                   0       0 0.510043933
## 48   48           0.43397697                   0       0 0.510043933
## 71   71           0.43397697                   0       0 0.510043933
## 145 145           0.43397697                   0       0 0.510043933
## 103 103           0.42729864                   0       0 0.513317451
## 81   81           0.39820011                   0       0 0.528020264
## 22   22           0.39530378                   0       0 0.529524602
## 158 158           0.39530378                   0       0 0.529524602
## 159 159           0.37395818                   0       0 0.540854587
## 53   53           0.34339192                   0       0 0.557877857
## 12   12           0.32211593                   0       0 0.570338812
## 107 107           0.32059953                   0       0 0.571247572
## 92   92           0.27989840                   0       0 0.596767817
## 42   42           0.22550164                   0       0 0.634879540
## 148 148           0.22550164                   0       0 0.634879540
## 200 200           0.19220072                   0       0 0.661091256
## 5     5           0.15070201                   0       0 0.697865393
## 77   77           0.14925385                   0       0 0.699249431
## 164 164           0.12278260                   0       0 0.726035919
## 2     2           0.10476176                   0       0 0.746188483
## 13   13           0.10476176                   0       0 0.746188483
## 23   23           0.09014454                   0       0 0.763993481
## 106 106           0.09014454                   0       0 0.763993481
## 149 149           0.09014454                   0       0 0.763993481
## 94   94           0.07710106                   0       0 0.781265092
## 118 118           0.07710106                   0       0 0.781265092
## 74   74           0.07661834                   0       0 0.781933541
## 21   21           0.07164403                   0       0 0.788957947
## 170 170           0.07164403                   0       0 0.788957947
## 153 153           0.03878502                   0       0 0.843875178
## 202 202           0.03878502                   0       0 0.843875178
## 62   62           0.02790567                   0       0 0.867330792
## 203 203           0.02420556                   0       0 0.876363008

# p-values are not smaller than .001, so the end result is in accordance with Cook distance: no outliers here

3 Simple regression model

Simple regression has only one x (or ‘independent', ‘predictor', ‘exogenous', ‘explanatory', ‘manifest') variable and one y (or ‘dependent', ‘endogenous', ‘response') variable.

y = b0 + b1*x

y predicted value of the dependent/endogenous/response variable
b0 intercept
b1 regression coefficient
x value of the independent/predictor/exogenous/explanatory variable

We specify a model that predicts competence with autonomy.

Lectures (PPT) contain more information about the different "roles" of the variables in SEM.

3.1 Estimate the model

# specify simple regression model
regression_simple <- '
      # regression
      competence ~ 1 + autonomy
      # variance
      autonomy ~~ autonomy
      '

# estimate model with lavaan
regression_simple_fit <- sem(regression_simple,
                        data=dfttu)

# display unstandardized results in LISREL format
summary(regression_simple_fit, 
        rsq=TRUE) # display R^2

## lavaan 0.6.14 ended normally after 22 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                           208
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   competence ~                                        
##     autonomy          0.752    0.045   16.562    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .competence        1.061    0.174    6.093    0.000
##     autonomy          3.786    0.042   90.196    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     autonomy          0.366    0.036   10.198    0.000
##    .competence        0.157    0.015   10.198    0.000
## 
## R-Square:
##                    Estimate
##     competence        0.569

In the output we see several values, of which the most important are:

Estimate: Parameter value for each model parameter (here unstandardized "B")

Std.err: Standard error for each estimated parameter (average distance of observed values from the regression line)

Z-value: Wald statistic (Z = Estimate / Std.err)

P(>|z|)): p-value for testing the null hypothesis that the parameter equals zero in the population

R-Square: Explained variance of the dependent variable (DV)

According to the output above, the intercept (average value) of autonomy is 3.786. Interpretation of the intercept of competence is a bit odd as the result now says that "Competence gets the value of 1.061 when predictor variable (autonomy) is set to zero." We do not have a zero value for autonomy (range 1-5), but we can create it with "centering grand mean" (CGM): autonomy - mean(autonomy).

# of course we can change autonomy and competency values from 1-5 -> 0-4
dfttu$autonomy04 <- dfttu$autonomy - 1
dfttu$competence04 <- dfttu$competence - 1

# but still, it makes more sense to interpret zero as an average value of autonomy
# Note: usually we do not center the DV!

# centering by the grand mean
dfttu$autonomyCGM1 <- dfttu$autonomy - mean(dfttu$autonomy) 
# or
# install.packages("misty")
library(misty) # center()
dfttu$autonomyCGM2 <- dfttu$autonomy # we need to do this extra step as center() writes new values directly into df given in the first parameter
dfttu$autonomyCGM2 <- center(dfttu$autonomyCGM2, # function writes new values into this variable
                             type = "CGM", # center grand mean, other option is "CWC" (center withing cluster) 
                             cluster = NULL, # CWC needs the cluster name (e.g., "group")
                             value = NULL # wish to center to a specific value?
                             )

# check how different "autonomy" variants look now
# table
print(dfttu[c(1:5),c("autonomy","autonomyCGM1","autonomyCGM2")])

##   autonomy autonomyCGM1 autonomyCGM2
## 1     3.75  -0.03605769  -0.03605769
## 2     3.25  -0.53605769  -0.53605769
## 3     4.00   0.21394231   0.21394231
## 4     4.00   0.21394231   0.21394231
## 5     3.50  -0.28605769  -0.28605769

# plots
par(mfrow = c(1, 2)) # place plots in one row and two columns
plot(density(dfttu$autonomy), col = "black", main = "Autonomy original")
plot(density(dfttu$autonomyCGM1), col = "blue", main = "Autonomy CGM")

# original mean value of autonomy
mean(dfttu$autonomy) # 3.786058

## [1] 3.786058

# grand mean centered mean value of autonomy
mean(dfttu$autonomyCGM1) # -2.049622e-16 # naturally we have here exactly the same values

## [1] 2.297841e-16

mean(dfttu$autonomyCGM2) # -2.049622e-16

## [1] -2.465976e-16

As the plot shows, the shape of the density distribution does not change, only the average value of autonomy moves to zero.

Now we can re-run the analysis ..

# specify simple regression
regression_simple_CGM <- '
      # regression
      competence ~ 1 + autonomyCGM1
      # variance
      autonomyCGM1 ~~ autonomyCGM1
      '

# estimate model
regression_simple_fit_CGM <- sem(regression_simple_CGM,
                        data=dfttu,
                        group = NULL, # no grouping variable (e.g., gender)
                        cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
                        missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
                        )

# display unstandardized results in LISREL format
summary(regression_simple_fit_CGM, 
        rsq=TRUE) # display R^2

## lavaan 0.6.14 ended normally after 14 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                           208
##   Number of missing patterns                         1
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   competence ~                                        
##     autonomyCGM1      0.752    0.045   16.562    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .competence        3.907    0.027  142.170    0.000
##     autonomyCGM1      0.000    0.042    0.000    1.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     autonomyCGM1      0.366    0.036   10.198    0.000
##    .competence        0.157    0.015   10.198    0.000
## 
## R-Square:
##                    Estimate
##     competence        0.569

Results stay the same, but now we can say that "competence has an average value of 3.9 when autonomy is on its average level."

# extract values from the lavaan output for the Results section

# unstandardized output values of competence ~ autonomy
reg_m1_aut_est <- parameterEstimates(regression_simple_fit_CGM)$est[2] # estimate, "B"
reg_m1_aut_se <- parameterEstimates(regression_simple_fit_CGM)$se[2] # Standard error of the estimate
reg_m1_aut_z <- parameterEstimates(regression_simple_fit_CGM)$z[2] # Z value
reg_m1_aut_p <- parameterEstimates(regression_simple_fit_CGM)$pvalue[2] # p-value
reg_m1_aut_r2 <- as.numeric(lavaan::inspect(regression_simple_fit_CGM,'r2')[1]) # R^2 value Note: need to use "as.numeric()" to extract numerical value

3.4 Report the results

Results of linear regression with R lavaan (Rosseel, 2012) showed that autonomy was statistically significant positive predictor of competence, B = 0.752, Z = 16.562, p = 0.000. This unstandardized result indicates that one unit increase in autonomy increases competence by 0.752 units. This model explains 56.87% of the variance of competence variable.

3.5 Missing data

What if we had some missing values? We created earlier a data frame ("dfttuNA") with approximately 40% of missing values in autonomy variable. Let's re-run the analysis with that data.

# estimate model with lavaan
regression_simple_fit_NA <- sem(regression_simple,
                        data=dfttuNA)

# display unstandardized results in LISREL format
summary(regression_simple_fit_NA, 
        rsq=TRUE) # display R^2

## lavaan 0.6.14 ended normally after 32 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##                                                   Used       Total
##   Number of observations                           129         208
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   competence ~                                        
##     autonomy          0.756    0.063   12.003    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .competence        1.052    0.240    4.375    0.000
##     autonomy          3.773    0.051   73.543    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     autonomy          0.340    0.042    8.031    0.000
##    .competence        0.174    0.022    8.031    0.000
## 
## R-Square:
##                    Estimate
##     competence        0.528

Output shows that only 129 rows (of the complete data of 208 rows) have been analysed. Results stay quite the same … except that the standard errors are bigger (indicating worse predicting power of the model).

Instead of applying listwise deletion to the rows with missing data, we can use multiple imputation (MI) for the missing values. However, it's use is not that straightforward as it generates several imputed data sets that need to be analysed at the same time (as pooled data). This requires additional steps in the modeling process.

3.5.1 Full information maximum likelihood

Full Information Maximum Likelihood (FIML) estimator (e.g., Cham et al., 2017) is a good alternative to MI, as it is part of the model estimation process in most SEM software packages (e.g., Mplus, AMOS, lavaan).

Next we utilize FIML to analyze the data. FIML estimates a likelihood function for each participant based on the observed values (so there has to be at least one observed value for each participant).

regression_simple_fit_NA_FIML <- sem(regression_simple,
                        data=dfttuNA,
                        group = NULL, # no grouping variable (e.g., gender)
                        cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
                        missing = "fiml", # full information maximum likelihood approach (good if there are missing observations)
                        fixed.x = FALSE # instead of using the default fixed.x = TRUE, we ask to estimate the means and variances of the variables
                        )

# display unstandardized results in LISREL format
summary(regression_simple_fit_NA_FIML, 
        rsq=TRUE) # display R^2

## lavaan 0.6.14 ended normally after 19 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                           208
##   Number of missing patterns                         2
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   competence ~                                        
##     autonomy          0.753    0.057   13.274    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .competence        1.066    0.217    4.919    0.000
##     autonomy          3.776    0.046   82.457    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     autonomy          0.338    0.040    8.509    0.000
##    .competence        0.173    0.021    8.419    0.000
## 
## R-Square:
##                    Estimate
##     competence        0.525

Summary shows that there are two missing data patterns: First one is the complete observations pattern and the second one has missing values in the autonomy variable. Number of observations is again 208: FIML allows the analysis of all rows (participants), also those containing missing data. Results stay the same, but now the standard errors are smaller.

4. Multiple regression model

Multiple regression means that we add predictors to the model. We specify a model that predicts competence with autonomy and score. We use centered autonomy ("autonomyCGM") but retain score uncentered (as it has a meaningful zero value: it is plausible to get zero points from the math test).

# specify multiple regression
regression_multiple <- '
      # regression
      competence ~ 1 + autonomyCGM1 + score
      # variances
      autonomyCGM1 ~~ autonomyCGM1
      score ~~ score
      '

# estimate model
regression_multiple_fit <- sem(regression_multiple,
                        data=dfttu,
                        group = NULL, # no grouping variable (e.g., gender)
                        cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
                        missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
                        )

Now we get a warning that the variances of autonomy and competence are quite different. We follow the instruction and look into their variances:

# check variances
varTable(regression_multiple_fit)

##           name idx nobs    type exo user   mean     var nlev lnam
## 1   competence   4  208 numeric   0    0  3.907   0.366    0     
## 2 autonomyCGM1  14  208 numeric   0    0  0.000   0.368    0     
## 3        score   7  208 numeric   0    0 47.756 847.882    0

# extract variances of autonomy and score
reg_m2_aut_s2 <- varTable(regression_multiple_fit)$var[2] # s2 is abbreviation of s^2 = sigma^2 = variance
reg_m2_score_s2 <- varTable(regression_multiple_fit)$var[3]

Indeed, the "var" column of the output indicates that the variance of autonomy is 0.368 and the variance of score is 847.882 - quite a difference! Next we will use R base library's scale() function to center and scale both autonomy and score (i.e., set their mean M to zero and standard deviation SD to one).

# We start by plotting density of autonomy and score before any centering or scaling actions

# The probability density function describes the probability of the variable taking certain value
# It is a smoothed (Parzen–Rosenblatt / kernel estimator) version of the histogram
# Default kernel is "Gaussian", but there are several options
# Bandwidth controls the smoothing: Too big values oversmooth and too small values undersmooth (overfit)
# density() uses "rule of thumb" as default method
# https://r-coder.com/density-plot-r/

# AUTONOMY
# create a histogram
hist(dfttu$autonomy, freq = F, main = "Autonomy histogram and density")
# add density line to the histogram
lines(density(dfttu$autonomy), lwd = 2, col = "blue")

# compare bandwidth settings
par(mfrow = c(1, 3)) # place plots in one row and three columns
plot(density(dfttu$autonomy, 
             kernel="gaussian", # type "?density()" in Console to see other options
             bw="nrd0"), # default bw is Rule of Thumb (in this case 0.1155)
     col = "black", main = "Aut bw=RoT")
plot(density(dfttu$autonomy, kernel="gaussian", bw=0.05),  # bw set too small
     col = "blue", main = "Aut bw=0.05")
plot(density(dfttu$autonomy, kernel="gaussian", bw=0.5), # set too big
     col = "red", main = "Aut bw=0.5")

# # other bw options if you wish to try
# plot(density(dfttu$autonomy, kernel="gaussian", bw="UCV"), # Unbiased cross validation
#      col = "green", main = "Aut bw=UCV")
# plot(density(dfttu$autonomy, kernel="gaussian", bw="SJ"), # Plug-in by Sheather and Jones
#      col = "orange", main = "Aut bw=SJ")

# using scale() function ... AND DOING NOTHING :-D
aut_ori <- scale(dfttu$autonomy, 
                center = F, # DO NOT center (F = FALSE)
                scale = F   # DO NOT scale (F = FALSE)
                )

# -----
# SCORE
# histogram + density line
hist(dfttu$score, freq = F, main = "Score histogram and density")
lines(density(dfttu$score), lwd = 2, col = "darkred")

# AUTONOMY AND SCORE
d_a <- density(dfttu$autonomy)
d_s <- density(dfttu$score)

plot(d_a, lwd = 2, col = "darkblue",
    main = "Autonomy and score densities", xlab = "",
    xlim = c(min(d_a$x, d_s$x), c(max(d_a$x, d_s$x))),  # Min and Max X-axis limits
    ylim = c(min(d_a$y, d_s$y), c(max(d_a$y, d_s$y))))  # Min and Max Y-axis limits

lines(d_s, col = "darkred", lwd = 2)

Last figure clearly shows that there is a need for centering and scaling if we use autonomy and score in the same analysis as predictors.

# -----------------
# CENTERED autonomy
# center = subtract mean of autonomy from each autonomy value
# to produce M = 0

# equation approach
dfttu$autonomyCentered <- dfttu$autonomy - mean(dfttu$autonomy)

# using scale() function
aut_cen <- scale(dfttu$autonomy, 
                center = T, # center (T = TRUE)
                scale = F   # do not scale (F = FALSE)
                )

# plot centered autonomy
plot(density(aut_cen), col = "black", main = "Autonomy Centered")

Centering works as it should: The shape of distribution is the same, but the average value has moved to zero.

# ---------------
# SCALED autonomy
# scale = divide each autonomy value by 
# root mean square RMS (scale function default if not centered)
# or
# standard deviation SD (scale function default if centered)
# to make all variables comparable (by one unit change)

# equation approach
# RMS
dfttu$autonomyScaledRMS <- dfttu$autonomy / sqrt(sum(dfttu$autonomy^2)/(length(dfttu$autonomy)-1))
# SD
dfttu$autonomyScaledSD <- dfttu$autonomy / sd(dfttu$autonomy)

# using scale() function
# RMS
aut_scaRMS <- scale(dfttu$autonomy, 
                 center = F, # do not center (F = FALSE)
                 scale = T   # scale (T = TRUE)
                 )
# SD
aut_scaSD <- scale(dfttu$autonomy, 
                 center = F, 
                 scale = apply(dfttu[,4,drop=F], # fourth var (autonomy) in df, dimensions of df are kept
                               2,  # 1=apply to rows, 2=apply to columns
                               sd) # calculate SD 
                 )

# plot RMS and SD scaled autonomy
d_a_scaRMS <- density(aut_scaRMS)
d_a_scaSD <- density(aut_scaSD)

plot(d_a_scaRMS, lwd = 2, col = "darkblue",
    main = "Autonomy RMS (blue) and SD (red) scaled densities", xlab = "",
    xlim = c(min(d_a_scaRMS$x, d_a_scaSD$x), c(max(d_a_scaRMS$x, d_a_scaSD$x))),  # Min and Max X-axis limits
    ylim = c(min(d_a_scaRMS$y, d_a_scaSD$y), c(max(d_a_scaRMS$y, d_a_scaSD$y))))  # Min and Max Y-axis limits

lines(d_a_scaSD, col = "darkred", lwd = 2)

# ----------------------------
# CENTERED and SCALED autonomy

# equation approach
dfttu$autonomyCenSca <- (dfttu$autonomy - mean(dfttu$autonomy)) / sd(dfttu$autonomy)

# using scale() function
aut_cen_sca <- scale(dfttu$autonomy, center = T, scale = T)

# plot centered and scaled autonomy
plot(density(aut_cen_sca), col = "blue", main = "Aut Scale RMS")

# --------------------------------------------------------------------------
# plot 1) original, 2) centered, 3) scaled and 4) centered & scaled autonomy
par(mfrow = c(2, 2))
plot(density(dfttu$autonomy), col = "black", main = "Autonomy")
plot(density(dfttu$autonomyCentered), col = "blue", main = "Autonomy Centered")
plot(density(dfttu$autonomyScaledRMS), col = "red", main = "Autonomy Scaled")
plot(density(dfttu$autonomyCenSca), col = "green", main = "Autonomy Centered and Scaled")

# --------------------
# compare descriptives
df_scale <- data.frame(aut_ori, aut_cen, aut_scaRMS, aut_scaSD, aut_cen_sca)
library(psych)
describe(df_scale)

##             vars   n mean   sd median trimmed  mad   min  max range  skew
## aut_ori        1 208 3.79 0.61   3.75    3.79 0.37  2.25 5.00  2.75 -0.18
## aut_cen        2 208 0.00 0.61  -0.04    0.00 0.37 -1.54 1.21  2.75 -0.18
## aut_scaRMS     3 208 0.99 0.16   0.98    0.99 0.10  0.59 1.30  0.72 -0.18
## aut_scaSD      4 208 6.26 1.00   6.20    6.26 0.61  3.72 8.26  4.55 -0.18
## aut_cen_sca    5 208 0.00 1.00  -0.06    0.01 0.61 -2.53 2.00  4.53 -0.18
##             kurtosis   se
## aut_ori         0.03 0.04
## aut_cen         0.03 0.04
## aut_scaRMS      0.03 0.01
## aut_scaSD       0.03 0.07
## aut_cen_sca     0.03 0.07

Figures above show that the distributional shape is the same, but the average point varies. The last (green) density line represents both centered and scaled autonomy with M = 0 and SD = 1 (this can also be seen in the last row of the table "aut_cen_sca"). Note that the mad stands for "the median of the absolute deviations from the median".

Next we do centering and scaling to score.

# center and scale
dfttu$scoreCenSca <- (dfttu$score - mean(dfttu$score)) / sd(dfttu$score)

# plot against the original score
d_s <- density(dfttu$score)
d_s_cenSca <- density(dfttu$scoreCenSca)

plot(d_s, lwd = 2, col = "black",
    main = "Score original (black), centered and scaled (red) densities", xlab = "",
    xlim = c(min(d_s$x, d_s_cenSca$x), c(max(d_s$x, d_s_cenSca$x))),  # Min and Max X-axis limits
    ylim = c(min(d_s$y, d_s_cenSca$y), c(max(d_s$y, d_s_cenSca$y))))  # Min and Max Y-axis limits

lines(d_s_cenSca, col = "darkred", lwd = 2)

Red density line is the centered and scaled mathematics score. Next we plot autonomy and score into same linegraph.

# we run everything once more ...

# center and scale
dfttu$autonomyCenSca <- (dfttu$autonomy - mean(dfttu$autonomy)) / sd(dfttu$autonomy)
dfttu$scoreCenSca <- (dfttu$score - mean(dfttu$score)) / sd(dfttu$score)

# plot densities
da <- density(dfttu$autonomyCenSca)
ds <- density(dfttu$scoreCenSca)

plot(da, lwd = 2, col = "blue",
    main = "Centered and scaled densities of Autonomy (blue) and Score (red)", xlab = "",
    xlim = c(min(da$x, ds$x), c(max(da$x, ds$x))),  # Min and Max X-axis limits
    ylim = c(min(da$y, ds$y), c(max(da$y, ds$y)))  # Min and Max Y-axis limits
    )

lines(ds, col = "darkred", lwd = 2)

Now we have both predictors centered and scaled (mean is zero and standard deviation is one). We re-run the analysis with the new variables.

# specify multiple regression2
regression_multiple2 <- '
      # regression
      competence ~ 1 + autonomyCenSca + scoreCenSca
      # variances
      autonomyCenSca ~~ autonomyCenSca
      scoreCenSca ~~ scoreCenSca
      '

# estimate model2
regression_multiple_fit2 <- sem(regression_multiple2,
                        data=dfttu,
                        group = NULL, # no grouping variable (e.g., gender)
                        cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
                        missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
                        )

# check variances
varTable(regression_multiple_fit2)

##             name idx nobs    type exo user  mean   var nlev lnam
## 1     competence   4  208 numeric   0    0 3.907 0.366    0     
## 2 autonomyCenSca  19  208 numeric   0    0 0.000 1.000    0     
## 3    scoreCenSca  20  208 numeric   0    0 0.000 1.000    0

Now the mean for both scaled predictors is zero and the sd is one. Let's see the outputs of the original (uncentered/scaled) and new analyses ..

# display unstandardized results in LISREL format
# original variables
summary(regression_multiple_fit, 
        rsq=TRUE) # display R^2

## lavaan 0.6.14 ended normally after 22 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
## 
##   Number of observations                           208
##   Number of missing patterns                         1
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.042
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.838
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   competence ~                                        
##     autonomyCGM1      0.753    0.045   16.698    0.000
##     score             0.002    0.001    1.689    0.091
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .competence        3.832    0.053   72.939    0.000
##     autonomyCGM1      0.000    0.042    0.000    1.000
##     score            47.756    2.014   23.711    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     autonomyCGM1      0.366    0.036   10.198    0.000
##     score           843.806   82.742   10.198    0.000
##    .competence        0.155    0.015   10.198    0.000
## 
## R-Square:
##                    Estimate
##     competence        0.575

# scaled variables
summary(regression_multiple_fit2, 
        rsq=TRUE) # display R^2

## lavaan 0.6.14 ended normally after 8 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
## 
##   Number of observations                           208
##   Number of missing patterns                         1
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.042
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.838
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   competence ~                                        
##     autonomyCenSca    0.457    0.027   16.698    0.000
##     scoreCenSca       0.046    0.027    1.689    0.091
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .competence        3.907    0.027  143.141    0.000
##     autonomyCenSca    0.000    0.069    0.000    1.000
##     scoreCenSca      -0.000    0.069   -0.000    1.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     autonomyCenSca    0.995    0.098   10.198    0.000
##     scoreCenSca       0.995    0.098   10.198    0.000
##    .competence        0.155    0.015   10.198    0.000
## 
## R-Square:
##                    Estimate
##     competence        0.575

Results are identical and the interpretation stays the same - except that after centering and scaling the intercept value of competence refers to a situation where predictors (autonomy, score) are on their average level (M = 0).

# unstandardized ("B")
# uncentered and unscaled (= original)
parameterEstimates(regression_multiple_fit)

##            lhs op          rhs     est     se      z pvalue ci.lower ci.upper
## 1   competence ~1                3.832  0.053 72.939  0.000    3.729    3.935
## 2   competence  ~ autonomyCGM1   0.753  0.045 16.698  0.000    0.665    0.841
## 3   competence  ~        score   0.002  0.001  1.689  0.091    0.000    0.003
## 4 autonomyCGM1 ~~ autonomyCGM1   0.366  0.036 10.198  0.000    0.296    0.437
## 5        score ~~        score 843.806 82.742 10.198  0.000  681.635 1005.977
## 6   competence ~~   competence   0.155  0.015 10.198  0.000    0.125    0.185
## 7 autonomyCGM1 ~1                0.000  0.042  0.000  1.000   -0.082    0.082
## 8        score ~1               47.756  2.014 23.711  0.000   43.809   51.704

# centered and scaled
parameterEstimates(regression_multiple_fit2)

##              lhs op            rhs   est    se       z pvalue ci.lower ci.upper
## 1     competence ~1                3.907 0.027 143.141  0.000    3.854    3.961
## 2     competence  ~ autonomyCenSca 0.457 0.027  16.698  0.000    0.403    0.511
## 3     competence  ~    scoreCenSca 0.046 0.027   1.689  0.091   -0.007    0.100
## 4 autonomyCenSca ~~ autonomyCenSca 0.995 0.098  10.198  0.000    0.804    1.186
## 5    scoreCenSca ~~    scoreCenSca 0.995 0.098  10.198  0.000    0.804    1.186
## 6     competence ~~     competence 0.155 0.015  10.198  0.000    0.125    0.185
## 7 autonomyCenSca ~1                0.000 0.069   0.000  1.000   -0.136    0.136
## 8    scoreCenSca ~1                0.000 0.069   0.000  1.000   -0.136    0.136

Although unstandardized estimates differ after centering and scaling, conclusions remain similar: Autonomy is a stat sig predictor of competence.

# standardized ("beta")
# uncentered and unscaled (= original)
standardizedSolution(regression_multiple_fit)

##            lhs op          rhs est.std    se      z pvalue ci.lower ci.upper
## 1   competence ~1                6.343 0.328 19.347  0.000    5.700    6.985
## 2   competence  ~ autonomyCGM1   0.755 0.030 25.371  0.000    0.696    0.813
## 3   competence  ~        score   0.076 0.045  1.688  0.091   -0.012    0.165
## 4 autonomyCGM1 ~~ autonomyCGM1   1.000 0.000     NA     NA    1.000    1.000
## 5        score ~~        score   1.000 0.000     NA     NA    1.000    1.000
## 6   competence ~~   competence   0.425 0.045  9.529  0.000    0.337    0.512
## 7 autonomyCGM1 ~1                0.000 0.069  0.000  1.000   -0.136    0.136
## 8        score ~1                1.644 0.106 15.462  0.000    1.436    1.852

# centered and scaled
standardizedSolution(regression_multiple_fit2)

##              lhs op            rhs est.std    se      z pvalue ci.lower
## 1     competence ~1                  6.468 0.320 20.243  0.000    5.842
## 2     competence  ~ autonomyCenSca   0.755 0.030 25.371  0.000    0.696
## 3     competence  ~    scoreCenSca   0.076 0.045  1.688  0.091   -0.012
## 4 autonomyCenSca ~~ autonomyCenSca   1.000 0.000     NA     NA    1.000
## 5    scoreCenSca ~~    scoreCenSca   1.000 0.000     NA     NA    1.000
## 6     competence ~~     competence   0.425 0.045  9.529  0.000    0.337
## 7 autonomyCenSca ~1                  0.000 0.069  0.000  1.000   -0.136
## 8    scoreCenSca ~1                  0.000 0.069  0.000  1.000   -0.136
##   ci.upper
## 1    7.095
## 2    0.813
## 3    0.165
## 4    1.000
## 5    1.000
## 6    0.512
## 7    0.136
## 8    0.136

Standardized estimates between original and centered/scaled predictors are (not surprisingly) quite the same, as the standardization is also applied to the results of original (uncentered and unscaled) variables.

# store predictor estimates for inline text
# https://lavaan.ugent.be/tutorial/inspect.html

#----------------------
# unstandardized values

# autonomy 
# estimate
reg_m2_aut_est <-  parameterEstimates(regression_multiple_fit)$est[2] # stores estimate 0.7530086
# or 
reg_m2_aut_est <- regression_multiple_fit@ParTable$est[2]

# standard error
reg_m2_aut_se <-  parameterEstimates(regression_multiple_fit)$se[2] # stores standard error 0.045
# or
reg_m2_aut_se <- regression_multiple_fit@ParTable$se[2]

# p-value
reg_m2_aut_p <-  parameterEstimates(regression_multiple_fit)$pvalue[2] # stores p-value < . 001


# score 
# estimate
reg_m2_score_est <-  parameterEstimates(regression_multiple_fit)$est[3] # 0.001587065

# standard error
reg_m2_score_se <-  parameterEstimates(regression_multiple_fit)$se[3] # 0.0009398335

# p-value
reg_m2_score_p <-  parameterEstimates(regression_multiple_fit)$pvalue[3] # 0.0912835


#--------------------
# standardized values
# these are usually presented in journal articles

# autonomy 
# estimate
reg_m2_aut_est.std <-  standardizedSolution(regression_multiple_fit)$est.std[2] # stores estimate 0.7546165

# standard error
reg_m2_aut_se.std <-  standardizedSolution(regression_multiple_fit)$se[2] # stores standard error 0.02974363

# p-value
reg_m2_aut_p.std <-  standardizedSolution(regression_multiple_fit)$pvalue[2] # stores p-value < . 001

# score estimate
reg_m2_score_est.std <-  standardizedSolution(regression_multiple_fit)$est.std[3] # 0.07631541

# score standard error
reg_m2_score_se.std <-  standardizedSolution(regression_multiple_fit)$se[3] # 0.04519976

# score p-value
reg_m2_score_p.std <-  standardizedSolution(regression_multiple_fit)$pvalue[3] # 0.09133388


#_-----------------------
# R squared for the model
# percentage of the dependent variable variation that the (linear) model explains
reg_m2_r2 <- as.numeric(lavaan::inspect(regression_multiple_fit,'r2')[1])

4.1 Report the results

Results of multiple regression analysis (Figure 1) with R lavaan showed that autonomy was a significant predictor of competence, \(\beta\) = 0.755, p = 0.000. According to the model, one standard deviation increase in autonomy (SD = 0.61) produces 0.755 SD increase in competence (SD = 0.61). In practice this means that when a student's autonomy level goes up by 0.61 units (on a 5-point self-assessment scale), then his/her competence goes up by 0.755 * 0.61 = 0.457 units (on a similar 5-point self-assessment scale). Although the mathematics score was a non significant positive predictor of competence, we can make a similar prediction on its effect: When a student's mathematics score goes up by 1 SD (29.12 points (on a scale from 0-100), his/her competence goes up by 0.076 * 0.61 = 0.046 units (on a 5-point self-assessment scale).

Standard errors (average distance of observed values from the regression line) were quite small for autonomy (0.030) and mathematics score (0.045). Autonomy and mathematics score explained together 57.53% of the variance of competence. The improvement was quite small compared to the previous model that had only one predictor (autonomy, R² = 56.87%).

Figure 1. Multiple regression of competence predicted by autonomy and mathematics score (standardized estimates).

# produce path diagram
semPaths(regression_multiple_fit,
         layout="tree",
         "std", # unstandardized parameter estimates OR use "par" for standardized parameter estimates
         # "par", 
         edge.label.cex=0.7,
         edge.width=0.2,
         edge.color="black",
         # optimizeLatRes=TRUE,
         label.scale=FALSE, # do not make labels smaller to fit inside boxes
         # normalize=TRUE,
         fade=FALSE, # do not fade non sig lines
         as.expression=c("nodes"), # display full labels instead of abbreviated
         # levels=c(1,3,6,9) # set distances between elements of the graph
         )

Exercise 1

Report here the unstandardized results of the above analysis: Describe the "B" estimates and produce a diagram/figure of the model with unstandardized values.

5. Multivariate regression model

Multivariate regression model has more than one endogenous variable (DV). We specify a model that predicts both competence (endogenous, DV) and autonomy (endogenous, DV) with deep approach to learning (exogenous, IV) and organised approach to studying (exogenous, IV). Now that autonomy is not a predictor anymore, we use a non-centered version of it (autonomy).

# center deep and organised by grand mean
dfttu$deepCGM <- dfttu$deep - mean(dfttu$deep) 
dfttu$organisedCGM <- dfttu$organised - mean(dfttu$organised) 

# specify multivariate regression
regression_multivariate <- '
 # regressions
   autonomy ~ 1 + deepCGM
   competence ~ 1 + deepCGM + organisedCGM
   
 # variances and covariance
   deepCGM ~~ deepCGM
   organisedCGM ~~ organisedCGM
   deepCGM ~~ organisedCGM
'

# estimate model
regression_multivariate_fit <- sem(regression_multivariate, 
                                   data=dfttu,
                                   group = NULL, # no grouping variable (e.g., gender)
                                   cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
                                   missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
                        )

# display results in LISREL format
summary(regression_multivariate_fit, 
        rsq=TRUE) # display R^2

## lavaan 0.6.14 ended normally after 20 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        13
## 
##   Number of observations                           208
##   Number of missing patterns                         1
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.130
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.718
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   autonomy ~                                          
##     deepCGM           0.360    0.079    4.545    0.000
##   competence ~                                        
##     deepCGM           0.478    0.080    5.993    0.000
##     organisedCGM      0.050    0.044    1.136    0.256
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   deepCGM ~~                                          
##     organisedCGM      0.164    0.026    6.241    0.000
##  .autonomy ~~                                         
##    .competence        0.228    0.027    8.473    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .autonomy          3.786    0.040   94.569    0.000
##    .competence        3.907    0.038  103.514    0.000
##     deepCGM           0.000    0.035    0.000    1.000
##     organisedCGM     -0.000    0.047   -0.000    1.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     deepCGM           0.256    0.025   10.198    0.000
##     organisedCGM      0.458    0.045   10.198    0.000
##    .autonomy          0.333    0.033   10.198    0.000
##    .competence        0.296    0.029   10.196    0.000
## 
## R-Square:
##                    Estimate
##     autonomy          0.090
##     competence        0.185

# unstandardized
parameterEstimates(regression_multivariate_fit)

##             lhs op          rhs   est    se       z pvalue ci.lower ci.upper
## 1      autonomy ~1              3.786 0.040  94.569  0.000    3.708    3.865
## 2      autonomy  ~      deepCGM 0.360 0.079   4.545  0.000    0.205    0.515
## 3    competence ~1              3.907 0.038 103.514  0.000    3.833    3.981
## 4    competence  ~      deepCGM 0.478 0.080   5.993  0.000    0.322    0.634
## 5    competence  ~ organisedCGM 0.050 0.044   1.136  0.256   -0.036    0.135
## 6       deepCGM ~~      deepCGM 0.256 0.025  10.198  0.000    0.207    0.305
## 7  organisedCGM ~~ organisedCGM 0.458 0.045  10.198  0.000    0.370    0.546
## 8       deepCGM ~~ organisedCGM 0.164 0.026   6.241  0.000    0.113    0.216
## 9      autonomy ~~     autonomy 0.333 0.033  10.198  0.000    0.269    0.397
## 10   competence ~~   competence 0.296 0.029  10.196  0.000    0.239    0.353
## 11     autonomy ~~   competence 0.228 0.027   8.473  0.000    0.175    0.281
## 12      deepCGM ~1              0.000 0.035   0.000  1.000   -0.069    0.069
## 13 organisedCGM ~1              0.000 0.047   0.000  1.000   -0.092    0.092

# standardized
standardizedSolution(regression_multivariate_fit)

##             lhs op          rhs est.std    se      z pvalue ci.lower ci.upper
## 1      autonomy ~1                6.254 0.314 19.938  0.000    5.639    6.869
## 2      autonomy  ~      deepCGM   0.301 0.063  4.765  0.000    0.177    0.424
## 3    competence ~1                6.479 0.323 20.041  0.000    5.845    7.112
## 4    competence  ~      deepCGM   0.401 0.062  6.514  0.000    0.280    0.521
## 5    competence  ~ organisedCGM   0.056 0.049  1.135  0.256   -0.041    0.152
## 6       deepCGM ~~      deepCGM   1.000 0.000     NA     NA    1.000    1.000
## 7  organisedCGM ~~ organisedCGM   1.000 0.000     NA     NA    1.000    1.000
## 8       deepCGM ~~ organisedCGM   0.480 0.053  8.995  0.000    0.375    0.585
## 9      autonomy ~~     autonomy   0.910 0.038 23.992  0.000    0.835    0.984
## 10   competence ~~   competence   0.815 0.049 16.796  0.000    0.720    0.910
## 11     autonomy ~~   competence   0.726 0.033 22.142  0.000    0.662    0.790
## 12      deepCGM ~1                0.000 0.069  0.000  1.000   -0.136    0.136
## 13 organisedCGM ~1                0.000 0.069  0.000  1.000   -0.136    0.136

#--------------------
# standardized values

# aut ~ deep estimate
reg_m3_aut_deep_est.std <-  standardizedSolution(regression_multivariate_fit)$est.std[2]

# aut ~ deep standard error
reg_m3_aut_deep_se.std <-  standardizedSolution(regression_multivariate_fit)$se[2]

# aut ~ deep p-value
reg_m3_aut_deep_p.std <-  standardizedSolution(regression_multivariate_fit)$pvalue[2]


# com ~ deep estimate
reg_m3_com_deep_est.std <-  standardizedSolution(regression_multivariate_fit)$est.std[4]

# com ~ deep standard error
reg_m3_com_deep_se.std <-  standardizedSolution(regression_multivariate_fit)$se[4]

# com ~ deep p-value
reg_m3_com_deep_p.std <-  standardizedSolution(regression_multivariate_fit)$pvalue[4]


#_-----------------------
# R squared for the model
# percentage of the dependent variable variation that the (linear) model explains
reg_m3_aut_r2 <- lavaan::inspect(regression_multivariate_fit,'r2')[1]
reg_m3_com_r2 <- lavaan::inspect(regression_multivariate_fit,'r2')[2]

5.1 Report the results

Results of multivariate regression analysis (Figure 2) with R lavaan showed that deep approach to learning was a significant positive predictor for both autonomy (\(\beta\) = 0.301, p = 0.000) and competence (\(\beta\) = 0.401, p = 0.000). Organised approach to studying … [Add results here]. Deep approach to studying explained [Add]% of the variance of autonomy. Deep and organised explained together [Add]% of the variance of competence.

Exercise 2

Add "missing" results to the above analysis.

Figure 2. Multivariate regression of competence and autonomy predicted by deep approach to learning and organised approach to studying (standardized estimates).

# produce diagram
semPaths(regression_multivariate_fit,
         layout="tree",
         "std", # unstandardized parameter estimates OR "par" for standardized parameter estimates
         edge.label.cex=0.7,
         edge.width=0.2,
         edge.color="black",
         # optimizeLatRes=TRUE,
         label.scale=FALSE, # do not make labels smaller to fit inside boxes
         # normalize=TRUE,
         fade=FALSE, # do not fade non sig lines
         as.expression=c("nodes"), # display full labels instead of abbreviated
         # levels=c(1,3,6,9) # set distances between elements of the graph
         )

6. Path model

Multivariate regression model becomes a path model after endogenous variables are allowed to predict other endogenous variables. We specify a model that predicts 1) competence with deep approach to learning, organised approach to studying and autonomy (which is the "path" part of the model), and 2) autonomy with deep approach to learning.

# create new df with the four variables in the path model
dfttu3 <- dfttu[,c("autonomy","competence","deep", "organised")]

# produce variance-covariance matrix
cov(dfttu3)

##              autonomy competence       deep  organised
## autonomy   0.36825878  0.2769039 0.09246969 0.06802478
## competence 0.27690392  0.3660915 0.13104881 0.10768969
## deep       0.09246969  0.1310488 0.25702428 0.16510330
## organised  0.06802478  0.1076897 0.16510330 0.46032928

# store some interesting values for the inline text

# mean of autonomy
dfttu_m_aut <- mean(dfttu3$autonomy)

# variance of autonomy
dfttu_var_aut <- cov(dfttu3)[1,1]

# square root of the variance of autonomy (SD)
dfttu_sd_aut <- sqrt(cov(dfttu3)[1,1])

Diagonal values represent variances and other values covariances. Variance-covariance matrix is symmetric (same values below and above the diagonal). Square root of the diagonal values produce standard deviations. For example, autonomy variance (\(\sigma^2\)) is 0.368 and SD (\(\sigma\)) is 0.61. As the mean of the autonomy is 3.79, +/- 1SD gives range where approximately 68% of the autonomy values are. All the covariances are positive (e.g., if autonomy increases, also competence increases).

# specify regressions
path_model <- '
   autonomy ~ 1 + deepCGM
   competence ~ 1 + deepCGM + organisedCGM + autonomy # we add autonomy here as a predictor of competence
   
 # variances and covariances
   deepCGM ~~ deepCGM
   organisedCGM ~~ organisedCGM
   deepCGM ~~ organisedCGM
'

# estimate model
path_model_fit <- sem(path_model, 
                      data = dfttu,
                      group = NULL, # no grouping variable (e.g., gender, control/intervention,...)
                      cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
                      missing = "fiml.x" # full information maximum likelihood approach (good if there are missing observations)
                      )

# display results in LISREL format
summary(path_model_fit, 
        rsq=TRUE) # display R^2

## lavaan 0.6.14 ended normally after 38 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        13
## 
##   Number of observations                           208
##   Number of missing patterns                         1
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.130
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.718
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   autonomy ~                                          
##     deepCGM           0.360    0.079    4.545    0.000
##   competence ~                                        
##     deepCGM           0.232    0.061    3.828    0.000
##     organisedCGM      0.050    0.044    1.136    0.256
##     autonomy          0.685    0.045   15.223    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   deepCGM ~~                                          
##     organisedCGM      0.164    0.026    6.241    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .autonomy          3.786    0.040   94.569    0.000
##    .competence        1.316    0.172    7.639    0.000
##     deepCGM           0.000    0.035    0.000    1.000
##     organisedCGM     -0.000    0.047   -0.000    1.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     deepCGM           0.256    0.025   10.198    0.000
##     organisedCGM      0.458    0.045   10.198    0.000
##    .autonomy          0.333    0.033   10.198    0.000
##    .competence        0.140    0.014   10.198    0.000
## 
## R-Square:
##                    Estimate
##     autonomy          0.090
##     competence        0.615

# unstandardized
parameterEstimates(path_model_fit)

##             lhs op          rhs   est    se      z pvalue ci.lower ci.upper
## 1      autonomy ~1              3.786 0.040 94.569  0.000    3.708    3.865
## 2      autonomy  ~      deepCGM 0.360 0.079  4.545  0.000    0.205    0.515
## 3    competence ~1              1.316 0.172  7.639  0.000    0.978    1.653
## 4    competence  ~      deepCGM 0.232 0.061  3.828  0.000    0.113    0.350
## 5    competence  ~ organisedCGM 0.050 0.044  1.136  0.256   -0.036    0.135
## 6    competence  ~     autonomy 0.685 0.045 15.223  0.000    0.596    0.773
## 7       deepCGM ~~      deepCGM 0.256 0.025 10.198  0.000    0.207    0.305
## 8  organisedCGM ~~ organisedCGM 0.458 0.045 10.198  0.000    0.370    0.546
## 9       deepCGM ~~ organisedCGM 0.164 0.026  6.241  0.000    0.113    0.216
## 10     autonomy ~~     autonomy 0.333 0.033 10.198  0.000    0.269    0.397
## 11   competence ~~   competence 0.140 0.014 10.198  0.000    0.113    0.167
## 12      deepCGM ~1              0.000 0.035  0.000  1.000   -0.069    0.069
## 13 organisedCGM ~1              0.000 0.047  0.000  1.000   -0.092    0.092

# standardized
standardizedSolution(path_model_fit)

##             lhs op          rhs est.std    se      z pvalue ci.lower ci.upper
## 1      autonomy ~1                6.254 0.314 19.938  0.000    5.639    6.869
## 2      autonomy  ~      deepCGM   0.301 0.063  4.765  0.000    0.177    0.424
## 3    competence ~1                2.181 0.357  6.106  0.000    1.481    2.882
## 4    competence  ~      deepCGM   0.194 0.051  3.827  0.000    0.095    0.294
## 5    competence  ~ organisedCGM   0.056 0.049  1.135  0.256   -0.041    0.152
## 6    competence  ~     autonomy   0.687 0.036 19.298  0.000    0.617    0.757
## 7       deepCGM ~~      deepCGM   1.000 0.000     NA     NA    1.000    1.000
## 8  organisedCGM ~~ organisedCGM   1.000 0.000     NA     NA    1.000    1.000
## 9       deepCGM ~~ organisedCGM   0.480 0.053  8.995  0.000    0.375    0.585
## 10     autonomy ~~     autonomy   0.910 0.038 23.992  0.000    0.835    0.984
## 11   competence ~~   competence   0.385 0.042  9.209  0.000    0.303    0.467
## 12      deepCGM ~1                0.000 0.069  0.000  1.000   -0.136    0.136
## 13 organisedCGM ~1                0.000 0.069  0.000  1.000   -0.136    0.136

#--------------------
# unstandardized values

# first, we investigate the structure of std output object
str(parameterEstimates(path_model_fit)) # we learn that "$est.sdt", "$se" and "$pvalue" contain estimates that we are interested in reporting

## Classes 'lavaan.data.frame' and 'data.frame':    13 obs. of  9 variables:
##  $ lhs     : chr  "autonomy" "autonomy" "competence" "competence" ...
##  $ op      : chr  "~1" "~" "~1" "~" ...
##  $ rhs     : chr  "" "deepCGM" "" "deepCGM" ...
##  $ est     : num  3.7861 0.3598 1.3156 0.2317 0.0497 ...
##  $ se      : num  0.04 0.0792 0.1722 0.0605 0.0437 ...
##  $ z       : num  94.57 4.54 7.64 3.83 1.14 ...
##  $ pvalue  : num  0.00 5.50e-06 2.20e-14 1.29e-04 2.56e-01 ...
##  $ ci.lower: num  3.708 0.205 0.978 0.113 -0.036 ...
##  $ ci.upper: num  3.865 0.515 1.653 0.35 0.135 ...

# second, we look into the std output to see the "rows" we wish to refer
parameterEstimates(path_model_fit)

##             lhs op          rhs   est    se      z pvalue ci.lower ci.upper
## 1      autonomy ~1              3.786 0.040 94.569  0.000    3.708    3.865
## 2      autonomy  ~      deepCGM 0.360 0.079  4.545  0.000    0.205    0.515
## 3    competence ~1              1.316 0.172  7.639  0.000    0.978    1.653
## 4    competence  ~      deepCGM 0.232 0.061  3.828  0.000    0.113    0.350
## 5    competence  ~ organisedCGM 0.050 0.044  1.136  0.256   -0.036    0.135
## 6    competence  ~     autonomy 0.685 0.045 15.223  0.000    0.596    0.773
## 7       deepCGM ~~      deepCGM 0.256 0.025 10.198  0.000    0.207    0.305
## 8  organisedCGM ~~ organisedCGM 0.458 0.045 10.198  0.000    0.370    0.546
## 9       deepCGM ~~ organisedCGM 0.164 0.026  6.241  0.000    0.113    0.216
## 10     autonomy ~~     autonomy 0.333 0.033 10.198  0.000    0.269    0.397
## 11   competence ~~   competence 0.140 0.014 10.198  0.000    0.113    0.167
## 12      deepCGM ~1              0.000 0.035  0.000  1.000   -0.069    0.069
## 13 organisedCGM ~1              0.000 0.047  0.000  1.000   -0.092    0.092

# third, we store values from the output into variables for inline text referencing ...

# autonomy is only predicted by deep
# aut ~ deep 
# estimate
path_m4_aut_deep_est <-  parameterEstimates(path_model_fit)$est[2] # row 2 in the column "est.sdt" contains beta (standardized) estimate of autonomy ~ deep approach to learning
# standard error
path_m4_aut_deep_se <-  parameterEstimates(path_model_fit)$se[2]
# p-value
path_m4_aut_deep_p <-  parameterEstimates(path_model_fit)$pvalue[2]


# competence is predicted by deep, organised and autonomy
# com ~ deep
# estimate
path_m4_com_deep_est <-  parameterEstimates(path_model_fit)$est[4]
# standard error
path_m4_com_deep_se <-  parameterEstimates(path_model_fit)$se[4]
# p-value
path_m4_com_deep_p <-  parameterEstimates(path_model_fit)$pvalue[4]

# com ~ organised
# estimate
path_m4_com_org_est <-  parameterEstimates(path_model_fit)$est[5]
# standard error
path_m4_com_org_se <-  parameterEstimates(path_model_fit)$se[5]
# p-value
path_m4_com_org_p <-  parameterEstimates(path_model_fit)$pvalue[5]

# com ~ autonomy
# estimate
path_m4_com_aut_est <-  parameterEstimates(path_model_fit)$est[6]
# standard error
path_m4_com_aut_se <-  parameterEstimates(path_model_fit)$se[6]
# p-value
path_m4_com_aut_p <-  parameterEstimates(path_model_fit)$pvalue[6]


# -----------------------
# R squared for the model
# percentage of the dependent variable variation that the (linear) model explains
path_m4_aut_r2 <- lavaan::inspect(path_model_fit,'r2')[1] # how much "deep" explains the variance of "autonomy"?
path_m4_com_r2 <- lavaan::inspect(path_model_fit,'r2')[2] # how much "deep", "organised" and "autonomy" explain the variance of "competence"?

#--------------------
# standardized values

# first, we investigate the structure of std output object
str(standardizedSolution(path_model_fit)) # "$est.sdt", "$se" and "$pvalue" contain estimates that we are interested in reporting

## Classes 'lavaan.data.frame' and 'data.frame':    13 obs. of  9 variables:
##  $ lhs     : chr  "autonomy" "autonomy" "competence" "competence" ...
##  $ op      : chr  "~1" "~" "~1" "~" ...
##  $ rhs     : chr  "" "deepCGM" "" "deepCGM" ...
##  $ est.std : num  6.254 0.3006 2.1813 0.1943 0.0558 ...
##  $ se      : num  0.3137 0.0631 0.3573 0.0508 0.0491 ...
##  $ z       : num  19.94 4.77 6.11 3.83 1.13 ...
##  $ pvalue  : num  0.00 1.89e-06 1.02e-09 1.30e-04 2.56e-01 ...
##  $ ci.lower: num  5.6392 0.1769 1.4811 0.0948 -0.0405 ...
##  $ ci.upper: num  6.869 0.424 2.882 0.294 0.152 ...

# second, we look into the std output to see the "rows" we wish to refer
standardizedSolution(path_model_fit)

##             lhs op          rhs est.std    se      z pvalue ci.lower ci.upper
## 1      autonomy ~1                6.254 0.314 19.938  0.000    5.639    6.869
## 2      autonomy  ~      deepCGM   0.301 0.063  4.765  0.000    0.177    0.424
## 3    competence ~1                2.181 0.357  6.106  0.000    1.481    2.882
## 4    competence  ~      deepCGM   0.194 0.051  3.827  0.000    0.095    0.294
## 5    competence  ~ organisedCGM   0.056 0.049  1.135  0.256   -0.041    0.152
## 6    competence  ~     autonomy   0.687 0.036 19.298  0.000    0.617    0.757
## 7       deepCGM ~~      deepCGM   1.000 0.000     NA     NA    1.000    1.000
## 8  organisedCGM ~~ organisedCGM   1.000 0.000     NA     NA    1.000    1.000
## 9       deepCGM ~~ organisedCGM   0.480 0.053  8.995  0.000    0.375    0.585
## 10     autonomy ~~     autonomy   0.910 0.038 23.992  0.000    0.835    0.984
## 11   competence ~~   competence   0.385 0.042  9.209  0.000    0.303    0.467
## 12      deepCGM ~1                0.000 0.069  0.000  1.000   -0.136    0.136
## 13 organisedCGM ~1                0.000 0.069  0.000  1.000   -0.136    0.136

# third, we store values from the std output into variables for inline text referencing

# aut ~ deep 
# estimate (row 2 in the column "est.sdt" contains beta (standardized) estimate of autonomy ~ deep approach to learning)
path_m4_aut_deep_est.std <-  standardizedSolution(path_model_fit)$est.std[2] 
# standard error
path_m4_aut_deep_se.std <-  standardizedSolution(path_model_fit)$se[2]
# p-value
path_m4_aut_deep_p.std <-  standardizedSolution(path_model_fit)$pvalue[2]

# com ~ deep 
# estimate
path_m4_com_deep_score_est.std <-  standardizedSolution(path_model_fit)$est.std[4]
# standard error
path_m4_com_deep_se.std <-  standardizedSolution(path_model_fit)$se[4]
# p-value
path_m4_com_deep_p.std <-  standardizedSolution(path_model_fit)$pvalue[4]

# com ~ organised
# estimate
path_m4_com_org_est.std <-  standardizedSolution(path_model_fit)$est.std[5]
# standard error
path_m4_com_org_se.std <-  standardizedSolution(path_model_fit)$se[5]
# p-value
path_m4_com_org_p.std <-  standardizedSolution(path_model_fit)$pvalue[5]

# com ~ autonomy
# estimate
path_m4_com_aut_est.std <-  standardizedSolution(path_model_fit)$est.std[6]
# standard error
path_m4_com_aut_se.std <-  standardizedSolution(path_model_fit)$se[6]
# p-value
path_m4_com_aut_p.std <-  standardizedSolution(path_model_fit)$pvalue[6]

6.1 Report the results

Results of path analysis (Figure 3) with R lavaan showed that deep approach to learning was a positive statistically significant predictor of autonomy, B = 0.360, SE = 0.079, p = 0.000. This unstandardized result indicates that a one unit increase in deep increases autonomy by 0.360 units. Deep approach to learning explained 9.0% of the variance of autonomy variable. It was also investigated if deep approach to learning, organised approach to studying and autonomy are related to competence. Results showed that all three predictors had positive associations with competence, but only deep (B = 0.232, SE = 0.061, p = 0.000) and autonomy (B = 0.685, SE = 0.045, p = 0.000) were statistically significant. These variables explained 61.47% of the variance of competence variable.

Figure 3. Path model of competence and autonomy with selected predictors (unstandardized estimates).

# produce diagram
semPaths(path_model_fit,
         layout="tree",
         "par", # unstandardized parameter estimates OR "par" for standardized parameter estimates
         edge.label.cex=0.7,
         edge.width=0.2,
         edge.color="black",
         # optimizeLatRes=TRUE,
         label.scale=FALSE, # do not make labels smaller to fit inside boxes
         # normalize=TRUE,
         fade=FALSE, # do not fade non sig lines
         as.expression=c("nodes"), # display full labels instead of abbreviated
         # levels=c(1,3,6,9) # set distances between elements of the graph
         )

Exercise 3

Produce text that describes the standardized results of the analyses above (alongside with a figure).

7. SEM model

dfttu4 <- read.csv("data1_csv.dat", header = T)
summary(dfttu4)

##        v1              v2              v3              v4       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :2.941   Mean   :2.713   Mean   :2.655   Mean   :2.849  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :5       NA's   :4       NA's   :6       NA's   :11     
##        v5              v6              v7              v8       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :4.000   Median :4.000   Median :4.000   Median :3.000  
##  Mean   :3.817   Mean   :3.689   Mean   :3.432   Mean   :3.356  
##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :4       NA's   :3       NA's   :3       NA's   :3      
##        v9             v12             v13             v14       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :3.000   Median :2.000   Median :2.000   Median :3.000  
##  Mean   :3.235   Mean   :2.153   Mean   :2.068   Mean   :2.977  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :4       NA's   :3       NA's   :6       NA's   :8      
##       v15             v16             v17            v18             v29       
##  Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:2.00   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :4.000   Median :3.00   Median :3.000   Median :3.000  
##  Mean   :2.836   Mean   :3.536   Mean   :3.28   Mean   :3.089   Mean   :2.837  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000   Max.   :5.000  
##  NA's   :9       NA's   :3       NA's   :7      NA's   :7       NA's   :5      
##       v30             v31             v33             v34       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :3.000   Median :3.000   Median :4.000   Median :4.000  
##  Mean   :3.086   Mean   :3.038   Mean   :3.765   Mean   :3.979  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.750   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :6       NA's   :4       NA's   :9       NA's   :9      
##       v42r            v43           v44             v45             v46       
##  Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:3.0   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :4.0   Median :3.000   Median :2.000   Median :2.000  
##  Mean   :2.991   Mean   :3.8   Mean   :3.278   Mean   :2.388   Mean   :2.202  
##  3rd Qu.:4.000   3rd Qu.:5.0   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.0   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  NA's   :1       NA's   :12    NA's   :1       NA's   :1       NA's   :2      
##       v47       
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :2.208  
##  3rd Qu.:3.000  
##  Max.   :5.000  
##  NA's   :5

Our data is from a Finnish polytechnics of higher education, collected in 2000 (n = 447). The respondents are the staff members of the organization. The original measurement instrument (Growth-oriented Atmosphere Questionnaire, GOAQ) has 13 factors and 92-items (Ruohotie, 1996; Ruohotie, Nokelainen & Tirri, 2002), but for the purposes of this exercise we have selected the following factors and sample items. Five –point scale from 1 (totally disagree) to 5 (totally agree) was used in the online survey.

Factor 1: Psychic stress of the work (PSY)
v44 I feel that I am beginning to dislike my work.
v45 I feel that it is getting more difficult for me to take the initiative.

Factor 2: Encouraging leadership (ENC)
v5 My manager is friendly and easily approachable.
v7 My manager pays attention to my suggestions and wishes.
v17 My manager works with a team to find solutions.

Factor 3: Build-up of work requirements (BUI)
v42r My workload has increased during the past years.
v43 My working pace has increased in recent years.

# create new df with only relevant variables for correlation matrix
dfttu4cor <- dfttu4[,c("v44","v45","v5","v7","v17","v42r","v43")]

# compute correlation matrix
lavCor(dfttu4cor)

##         v44    v45     v5     v7    v17   v42r    v43
## v44   1.000                                          
## v45   0.486  1.000                                   
## v5   -0.254 -0.290  1.000                            
## v7   -0.263 -0.343  0.626  1.000                     
## v17  -0.233 -0.312  0.731  0.684  1.000              
## v42r  0.525  0.309 -0.239 -0.298 -0.226  1.000       
## v43   0.483  0.167 -0.062 -0.118 -0.018  0.387  1.000

Correlation matrix shows that the observed variables for each factor have quite strong and positive correlations, as expected. For example, "v44" and "v45" are manifestations of the same latent variable ("PSY") and so they have a strong positive correlation (r = 0.486). Also the negative correlations make sense: It is plausible that the three observed variables ("v5","v7","v17") that belong to the latent ENC variable (Encouraging leadership) are negatively associated to observed variables related to psychic stress ("v44","v45") and build-up of work requirements ("v42r","v43").

However, there is one suspicious correlation that we need to examine further: "v44" (PSY) and "v43" (BUI) are quite strongly positively correlated ( r = 0.483) although they belong to different factors?! We will return to this issue in the last section of this exercise (via "modification indices").

7.1 Specify the SEM model

According to Beaujean (2014, p. 37), "A structural equation model (SEM) is a broad class of statistical models that consists of two parts: the structural model and the latent variable model. The structural model consists of the regression-like relationships among the variables (i.e., the path analysis). The latent variable model (LVM) forms the latent variables (LVs) used in the structural model. When a LVM is analyzed without a structural model, it is sometimes called a confirmatory factor analysis (CFA). If there was not a hypothesized structure for the latent variable model, then it would be an exploratory factor analysis (EFA)."

The SEM model (Figure 4) aims to analyse if Encouraging leadership (ENC) and Build-up of work requirements (BUI) are related to the Psychic stress of the work (PSY).

Figure 4. SEM model between Encouraging leadership, Build-up of work requirements and Psychic stress of the work

ENC is a latent exogenous (IV) variable that is constructed with three observed variables (v5, v7, v17). The code below shows that the highest loading variable is not the first one (v5), but the last one (v17) instead. To make the model identifiable, we fix one indicator path to 1 [see more details] for each latent variable. In the case of PSY, we fix the strongest loading variable (v17) instead of the first variable in the list (using "NA*" removes the automatic fix to 1 for v5).

BUI is also an exogenous latent variable, but based only on two observed variables (v42r, v43). The "r" in v42r stands for a reverse coded variable: The original statement in v42 was "My workload has decreased during the past years.", but as we wish to measure build-up of work, we use reversed wording "My workload has increased during the past years." instead (and indicate that by naming the variable "v42r").

PSY is an endogenous (DV) latent variable with two observed indicators (v44, v45).

sem_model <- '
# Factors
ENC =~ NA*v5 + v7 + 1*v17   # Encouraging leadership
BUI =~ 1*v42r + v43         # Build-up of work requirements
PSY =~ 1*v44 + v45          # Psychic stress of the work
# Regressions
PSY ~ ENC + BUI             # Predict PSY with ENC and BUI
# Covariances
ENC ~~ BUI                  # Let ENC and BUI correlate
'

7.2 Estimate the model

The model specified in the previous section is now estimated. The default estimation method is Maximum Likelihood (ML). As there are some missing data, we use Full Information Maximum Likelihood (FIML) instead of casewise deletion.

sem_model_fit <- sem(sem_model, 
                    data=dfttu4,
                    
                    group = NULL, # no grouping variable (e.g., gender, control/intervention,...)
                    cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
                    missing = "fiml.x" # full information maximum likelihood approach
                    )

# summary(sem_model_fit, 
#         standardized=TRUE)
# 
# # Latent Variables:
# #                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
# #   ENC =~                                                                
# #     v5                0.995       NA                      0.962    0.814
# #     v7                0.850       NA                      0.822    0.768
# #     v17               1.110       NA                      1.074    0.886
# #   BUI =~                                                                
# #     v42r              0.864       NA                      0.777    0.672
# #     v43               0.689       NA                      0.620    0.571
# #   PSY =~                                                                
# #     v44               1.053       NA                      1.118    0.901
# #     v45               0.623       NA                      0.662    0.551
# # 
# # Regressions:
# #                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
# #   PSY ~                                                                 
# #     ENC              -0.110       NA                     -0.100   -0.100
# #     BUI               1.003       NA                      0.850    0.850
# # 
# # Covariances:
# #                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
# #   ENC ~~                                                                
# #     BUI              -0.264       NA                     -0.303   -0.303

Identification

Before proceeding with the analysis, we need to check if our model is underidentified, just-identified or overidentified (Kline, 2016, pp.127-128). Model specifications can not require more parameters to be analysed than there are observations available. Thus, only just-identified (equal number of observations and parameters) and overidentified (less parameters than observations) models can be analysed (underidentification produces an error message).

The number of observations (free parameters) to be estimated (non-reduntant elements) can be calculated with k(k+1)/2. "k" is the number of observed (manifest) variables in the model.

Figure 4 above shows that we have seven observed manifest variables in our model (v5, v7, v17, v42r, v43, v44, v45). If we place that number into "k" and calculate, we get 7(7+1)/2 = 56/2 = 28. So, our number of observations is 28. Next, we need to figure out how many parameters are to be estimated in the model (Figure 4). We will use lavaan::lavInspect() to analyse this:

# how many parameters are to be estimated in the model?
# lavaan::inspect(sem_model_fit) # n of parameters: 24 # older version
lavaan::lavInspect(sem_model_fit) # n of parameters: 24

## $lambda
##      ENC BUI PSY
## v5     1   0   0
## v7     2   0   0
## v17    0   0   0
## v42r   0   0   0
## v43    0   3   0
## v44    0   0   0
## v45    0   0   4
## 
## $theta
##      v5 v7 v17 v42r v43 v44 v45
## v5    8                        
## v7    0  9                     
## v17   0  0  10                 
## v42r  0  0   0   11            
## v43   0  0   0    0  12        
## v44   0  0   0    0   0  13    
## v45   0  0   0    0   0   0  14
## 
## $psi
##     ENC BUI PSY
## ENC  15        
## BUI   7  16    
## PSY   0   0  17
## 
## $beta
##     ENC BUI PSY
## ENC   0   0   0
## BUI   0   0   0
## PSY   5   6   0
## 
## $nu
##      intrcp
## v5       18
## v7       19
## v17      20
## v42r     21
## v43      22
## v44      23
## v45      24
## 
## $alpha
##     intrcp
## ENC      0
## BUI      0
## PSY      0

Result shows that there are 24 parameters to be estimated:

4 manifest variable loadings on three factors (lambda matrix)
7 error variances of manifest variables (theta matrix)
3 latent variances and 1 covariance (psi matrix)
2 regressions (beta matrix)
7 intercepts(nu matrix)

The model is overidentified (good!), as the number of non-redundant elements [7(7+1)/2 = 56/2 = 28] is greater than the number of parameters to be estimated: 28 - 24 = 4.

Sample size

Another issue is that do we have enough data for the SEM model? According to Jackson (2003), we can get a rough estimate of the sample size needed by multiplying the number of participants (N) by the number of parameters in the model (q). Sample-size-to-parameters ratio according to this N:q rule would ideally be 20:1 (Kline, 2016, p.16).

We have 24 parameters to be estimated, so our sample size should be 20*24 = 480. Our data frame has 208 rows (=participants), so we do fine here. However, Kline (2016, p.16) mentions that "Less ideal would be an N:q ratio of 10:1 …", which in our case would mean radically smaller sample size expectation (of 240).

A more clever way of investigating sample size needed is to use simulation alongside with desired alpha and power values. Usually alpha is set to 0.05 (five percent chance of Type I error) and power to 0.80 (80 percent chance to avoid rejecting H1 when it is true).

# install.packages("semPower")
library(semPower)

## 
## ### Welcome to semPower 1.2.0 ###
## 
## See https://github.com/moshagen/semPower for quick examples and a detailed manual.
## 
## Please cite as:
## Moshagen, M., & Erdfelder, E. (2016). A new strategy for testing structural equation models.
## Structural Equation Modeling, 23, 54-60. doi: 10.1080/10705511.2014.950896

aPrioriSampleSize <- semPower::semPower.aPriori(effect = .05, effect.measure = 'RMSEA', alpha = .05, power = .80, df = 24)
summary(aPrioriSampleSize)

## 
##  semPower: A-priori power analysis
##                                    
##  F0                        0.060000
##  RMSEA                     0.050000
##  Mc                        0.970446
##                                    
##  df                        24      
##  Required Num Observations 375     
##                                    
##  Critical Chi-Square       36.41502
##  NCP                       22.44000
##  Alpha                     0.050000
##  Beta                      0.201086
##  Power (1-beta)            0.798914
##  Implied Alpha/Beta Ratio  0.248650

 # F0                        0.060000
 # RMSEA                     0.050000
 # Mc                        0.970446
 #                                   
 # df                        24      
 # Required Num Observations 375     
 #                                   
 # Critical Chi-Square       36.41502
 # NCP                       22.44000
 # Alpha                     0.050000
 # Beta                      0.201086
 # Power (1-beta)            0.798914
 # Implied Alpha/Beta Ratio  0.248650

We learn that if we estimate 24 parameters (alpha = .05, power = .80), required sample size would be 375.

Let's take another opinion from simsem …

# install.packages("simsem")
library(simsem)

##

## #################################################################

## This is simsem 0.5-16

## simsem is BETA software! Please report any bugs.

## simsem was first developed at the University of Kansas Center for

## Research Methods and Data Analysis, under NSF Grant 1053160.

## #################################################################

## 
## Attaching package: 'simsem'

## The following object is masked from 'package:lavaan':
## 
##     inspect

## The following object is masked from 'package:psych':
## 
##     sim

# https://github.com/simsem/simsem/wiki/Example-18:-Simulation-with-Varying-Sample-Size

# define population model
sem_population_model <- '

# Factors
ENC =~ 0.81*v5   + 0.77*v7 + 0.89*v17   # standardized loadings from "Std.all" output above
BUI =~ 0.67*v42r + 0.57*v43
PSY =~ 0.90*v44  + 0.55*v45

# Regressions
PSY ~ -0.10*ENC + 0.85*BUI              # standardized path coefficients from "Std.all" output above

# Covariances
ENC ~~ -.30*BUI                         # covariance from "Std.all" output above
'

# simulation takes some time, so we'll save the results into a file ...
if (file.exists("SEM_sampleSize.rds")) {
SEM_sampleSize <- readRDS("SEM_sampleSize.rds")
} else {
SEM_sampleSize <- simsem::sim(NULL, n=50:500, model = sem_model, generate = sem_population_model, std.lv = TRUE, lavaanfun = "cfa")
saveRDS(SEM_sampleSize, "SEM_sampleSize.rds")
}

summary(SEM_sampleSize)

## RESULT OBJECT
## Model Type
## [1] "lavaan"
## ========= Fit Indices Cutoffs ============
##     N  chisq       aic       bic rmsea   cfi   tli  srmr
## 1  50 20.243  1193.386  1232.024 0.088 0.911 0.830 0.073
## 2 162 19.751  3724.860  3775.449 0.074 0.931 0.868 0.061
## 3 275 19.254  6278.937  6341.583 0.059 0.951 0.907 0.049
## 4 388 18.758  8833.014  8907.717 0.045 0.971 0.945 0.037
## 5 500 18.266 11364.488 11451.142 0.031 0.991 0.984 0.025
## ========= Parameter Estimates and Standard Errors ============
##            Estimate Average Estimate SD Average SE Power (Not equal 0) Std Est
## ENC=~v5               0.805       0.106      0.102               1.000   0.627
## ENC=~v7               0.766       0.107      0.099               1.000   0.609
## ENC=~v17              0.886       0.122      0.108               1.000   0.663
## BUI=~v42r             0.813       2.098      0.126               0.982   0.678
## BUI=~v43              0.577       0.153      0.126               0.982   0.504
## PSY=~v44              0.880       0.257      0.249               0.901   0.782
## PSY=~v45              0.518       0.135      0.136               0.924   0.585
## PSY~ENC               0.976      21.691      0.410               0.103  -0.055
## PSY~BUI               2.713      32.504      1.008               0.756   0.646
## ENC~~BUI             -0.307       0.142      0.118               0.763  -0.307
## v5~~v5                0.992       0.160      0.146               0.989   0.602
## v7~~v7                0.985       0.148      0.137               0.993   0.624
## v17~~v17              0.986       0.184      0.164               0.979   0.554
## v42r~~v42r           -3.601      72.477      0.174               0.956  -2.607
## v43~~v43              0.955       0.342      0.218               0.986   0.728
## v44~~v44              0.904       0.517      0.416               0.733   0.372
## v45~~v45              0.990       0.199      0.169               0.977   0.648
##            Std Est SD Std Ave SE Average Param Average Bias Coverage r_coef.n
## ENC=~v5         0.073      0.066          0.81       -0.005    0.959    0.002
## ENC=~v7         0.073      0.066          0.77       -0.004    0.943    0.065
## ENC=~v17        0.078      0.067          0.89       -0.004    0.945    0.027
## BUI=~v42r       1.776      0.097          0.67        0.143    0.949   -0.101
## BUI=~v43        0.135      0.104          0.57        0.007    0.940   -0.035
## PSY=~v44        0.126      0.103          0.90       -0.020    0.977   -0.043
## PSY=~v45        0.098      0.083          0.55       -0.032    0.961    0.054
## PSY~ENC         0.153      0.131         -0.10        1.076    0.989   -0.050
## PSY~BUI         0.170      0.153          0.85        1.863    0.945   -0.058
## ENC~~BUI        0.142      0.118         -0.30       -0.007    0.910    0.007
## v5~~v5          0.092      0.083          1.00       -0.008    0.959    0.025
## v7~~v7          0.088      0.080          1.00       -0.015    0.938    0.008
## v17~~v17        0.104      0.089          1.00       -0.014    0.952   -0.002
## v42r~~v42r     50.903      0.110          1.00       -4.601    0.949    0.096
## v43~~v43        0.286      0.163          1.00       -0.045    0.954    0.125
## v44~~v44        0.222      0.175          1.00       -0.096    0.959    0.175
## v45~~v45        0.124      0.101          1.00       -0.010    0.949    0.000
##            r_se.n
## ENC=~v5    -0.877
## ENC=~v7    -0.884
## ENC=~v17   -0.863
## BUI=~v42r  -0.740
## BUI=~v43   -0.176
## PSY=~v44   -0.466
## PSY=~v45   -0.339
## PSY~ENC    -0.083
## PSY~BUI    -0.068
## ENC~~BUI   -0.809
## v5~~v5     -0.815
## v7~~v7     -0.838
## v17~~v17   -0.780
## v42r~~v42r -0.638
## v43~~v43   -0.102
## v44~~v44   -0.389
## v45~~v45   -0.231
## ========= Correlation between Fit Indices ============
##        chisq    aic    bic  rmsea    cfi    tli   srmr      n
## chisq  1.000 -0.013 -0.013  0.872 -0.673 -0.781  0.486 -0.014
## aic   -0.013  1.000  1.000 -0.228  0.283 -0.005 -0.743  1.000
## bic   -0.013  1.000  1.000 -0.228  0.283 -0.005 -0.743  1.000
## rmsea  0.872 -0.228 -0.228  1.000 -0.866 -0.813  0.618 -0.229
## cfi   -0.673  0.283  0.283 -0.866  1.000  0.817 -0.641  0.284
## tli   -0.781 -0.005 -0.005 -0.813  0.817  1.000 -0.456 -0.005
## srmr   0.486 -0.743 -0.743  0.618 -0.641 -0.456  1.000 -0.743
## n     -0.014  1.000  1.000 -0.229  0.284 -0.005 -0.743  1.000
## ================== Replications =====================
## Number of replications = 451 
## Number of converged replications = 412 
## Number of nonconverged replications: 
##    1. Nonconvergent Results = 12 
##    2. Nonconvergent results from multiple imputation = 0 
##    3. At least one SE were negative or NA = 2 
##    4. Nonpositive-definite latent or observed (residual) covariance matrix 
##       (e.g., Heywood case or linear dependency) = 25 
## NOTE: The sample size is varying.

# look at alpha value at 0.05 and sample sizes
simsem::plotCutoff(SEM_sampleSize, 0.05)

simsem::getCutoff(SEM_sampleSize, 0.05, nVal = 240) # 10:1 rule

##      chisq      aic      bic      rmsea       cfi       tli       srmr
## 1 19.40819 5487.851 5546.762 0.06378119 0.9449914 0.8949836 0.05273879

#      chisq      aic      bic      rmsea       cfi       tli       srmr
# 1 19.40819 5487.851 5546.762 0.06378119 0.9449914 0.8949836 0.05273879

simsem::getCutoff(SEM_sampleSize, 0.05, nVal = 480) # 20:1 rule

##      chisq      aic      bic      rmsea       cfi       tli       srmr
## 1 18.35418 10912.44 10996.96 0.03367612 0.9878986 0.9768973 0.02682591

#      chisq      aic      bic      rmsea       cfi       tli       srmr
# 1 18.35418 10912.44 10996.96 0.03367612 0.9878986 0.9768973 0.02682591

# find the sample size that provides the power just over 0.80
Cpow <- simsem::getPower(SEM_sampleSize)
simsem::findPower(Cpow, "N", 0.80)

##    ENC=~v5    ENC=~v7   ENC=~v17  BUI=~v42r   BUI=~v43   PSY=~v44   PSY=~v45 
##        Inf        Inf        Inf         68         68        142        101 
##    PSY~ENC    PSY~BUI   ENC~~BUI     v5~~v5     v7~~v7   v17~~v17 v42r~~v42r 
##         NA        246        266         50         54         65         92 
##   v43~~v43   v44~~v44   v45~~v45 
##         50        306         58

   # ENC=~v5    ENC=~v7   ENC=~v17  BUI=~v42r   BUI=~v43   PSY=~v44   PSY=~v45    PSY~ENC    PSY~BUI   ENC~~BUI     v5~~v5     v7~~v7   v17~~v17 v42r~~v42r   v43~~v43   v44~~v44   v45~~v45 
   #     Inf        Inf        Inf         68         68        142        101         NA        246        266         50         54         65         92         50        306         58

simsem::plotPower(SEM_sampleSize, powerParam=c("ENC=~v5", "BUI=~v43", "PSY=~v44", "PSY~ENC", "PSY~BUI", "ENC~~BUI"))

Results of the simulation show that no sample size is enough to make "PSY~ENC" regression plausible if the power level is set to .80. However, investigation of "PSY~BUI" regression would require 246 participants (result in accordance with earlier "rule-of-thumb" 10:1 estimation). The greatest suggested sample size is 306, quite close to the earlier estimation of 375.

We conclude that our model is overidentified and that we have enough participants.

7.3 Extract the results

summary(sem_model_fit, standardized=TRUE, rsq=TRUE) # display R Squared

## lavaan 0.6.14 ended normally after 41 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        24
## 
##   Number of observations                           447
##   Number of missing patterns                        10
## 
## Model Test User Model:
##                                                       
##   Test statistic                                62.390
##   Degrees of freedom                                11
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ENC =~                                                                
##     v5                0.896    0.048   18.766    0.000    0.962    0.814
##     v7                0.765    0.043   17.629    0.000    0.822    0.768
##     v17               1.000                               1.074    0.886
##   BUI =~                                                                
##     v42r              1.000                               0.777    0.672
##     v43               0.798    0.089    8.924    0.000    0.620    0.571
##   PSY =~                                                                
##     v44               1.000                               1.118    0.901
##     v45               0.592    0.068    8.702    0.000    0.662    0.551
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   PSY ~                                                                 
##     ENC              -0.104    0.063   -1.653    0.098   -0.100   -0.100
##     BUI               1.223    0.152    8.028    0.000    0.850    0.850
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ENC ~~                                                                
##     BUI              -0.253    0.062   -4.092    0.000   -0.303   -0.303
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .v5                3.818    0.056   68.091    0.000    3.818    3.229
##    .v7                3.436    0.051   67.756    0.000    3.436    3.211
##    .v17               3.280    0.058   56.971    0.000    3.280    2.705
##    .v42r              2.992    0.055   54.600    0.000    2.992    2.584
##    .v43               3.803    0.052   73.253    0.000    3.803    3.501
##    .v44               3.277    0.059   55.754    0.000    3.277    2.639
##    .v45               2.387    0.057   42.027    0.000    2.387    1.990
##     ENC               0.000                               0.000    0.000
##     BUI               0.000                               0.000    0.000
##    .PSY               0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .v5                0.473    0.047    9.978    0.000    0.473    0.338
##    .v7                0.470    0.041   11.380    0.000    0.470    0.410
##    .v17               0.317    0.048    6.588    0.000    0.317    0.216
##    .v42r              0.736    0.080    9.226    0.000    0.736    0.549
##    .v43               0.795    0.067   11.829    0.000    0.795    0.674
##    .v44               0.291    0.119    2.458    0.014    0.291    0.189
##    .v45               1.002    0.079   12.747    0.000    1.002    0.696
##     ENC               1.153    0.105   10.934    0.000    1.000    1.000
##     BUI               0.604    0.098    6.186    0.000    1.000    1.000
##    .PSY               0.269    0.132    2.035    0.042    0.215    0.215
## 
## R-Square:
##                    Estimate
##     v5                0.662
##     v7                0.590
##     v17               0.784
##     v42r              0.451
##     v43               0.326
##     v44               0.811
##     v45               0.304
##     PSY               0.785

# https://rpubs.com/mkearney/103040

library(stargazer)

# unstandardised
pars.factors <- parameterEstimates(sem_model_fit)[ parameterEstimates(sem_model_fit)[,'op']=='=~', c(1:5)]
stargazer(pars.factors, summary=FALSE, type='text', # use 'html' if you wish to save the table and edit it further in Word etc.
          rownames=FALSE, initial.zero=FALSE, digits=3, title='Factor loadings')

## 
## Factor loadings
## =====================
## lhs op rhs  est   se 
## ---------------------
## ENC =~  v5  .896 .048
## ENC =~  v7  .765 .043
## ENC =~ v17   1    0  
## BUI =~ v42r  1    0  
## BUI =~ v43  .798 .089
## PSY =~ v44   1    0  
## PSY =~ v45  .592 .068
## ---------------------

pars.regressions <- parameterEstimates(sem_model_fit)[ parameterEstimates(sem_model_fit)[,'op']=='~', c(3,4,5,7)]
stargazer(pars.regressions, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Predicting PSY')

## 
## Predicting PSY
## =====================
## rhs  est   se  pvalue
## ---------------------
## ENC -.104 .063  .098 
## BUI 1.223 .152   0   
## ---------------------

# standardised
std.pars.factors <- standardizedSolution(sem_model_fit)[ standardizedSolution(sem_model_fit)[,'op']=='=~', c(1:5)]
stargazer(std.pars.factors, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Factor loadings (std)')

## 
## Factor loadings (std)
## ========================
## lhs op rhs  est.std  se 
## ------------------------
## ENC =~  v5   .814   .022
## ENC =~  v7   .768   .025
## ENC =~ v17   .886   .019
## BUI =~ v42r  .672   .043
## BUI =~ v43   .571   .044
## PSY =~ v44   .901   .043
## PSY =~ v45   .551   .042
## ------------------------

std.pars.regressions <- standardizedSolution(sem_model_fit)[ standardizedSolution(sem_model_fit)[,'op']=='~', c(3,4,5,7)]
stargazer(std.pars.regressions, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Predicting PSY (std)')

## 
## Predicting PSY (std)
## =======================
## rhs est.std  se  pvalue
## -----------------------
## ENC  -.100  .062  .106 
## BUI  .850   .058   0   
## -----------------------

# produce joint tables
# display both un/standardised parameter estimates in a same table
joint.pars.factors <- cbind(pars.factors, std.pars.factors[4:5]) # pick only last two columns from std parameter estimates
names(joint.pars.factors)[7] <- "se.std" # rename the second "se" in column 7 as "se.std"
stargazer(joint.pars.factors, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Factor loadings')

## 
## Factor loadings
## ====================================
## lhs op rhs  est   se  est.std se.std
## ------------------------------------
## ENC =~  v5  .896 .048  .814    .022 
## ENC =~  v7  .765 .043  .768    .025 
## ENC =~ v17   1    0    .886    .019 
## BUI =~ v42r  1    0    .672    .043 
## BUI =~ v43  .798 .089  .571    .044 
## PSY =~ v44   1    0    .901    .043 
## PSY =~ v45  .592 .068  .551    .042 
## ------------------------------------

# display both un/standardised regressions in a same table
joint.pars.regressions <- cbind(pars.regressions, std.pars.regressions[2:4]) # pick three columns from std regressions
names(joint.pars.regressions)[6] <- "se.std" # rename the second "se" in column 6 as "se.std"
names(joint.pars.regressions)[7] <- "pvalue.std" # rename the second "pvalue" in column 7 as "pvalue.std"
stargazer(joint.pars.regressions, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Predicting PSY')

## 
## Predicting PSY
## ===============================================
## rhs  est   se  pvalue est.std se.std pvalue.std
## -----------------------------------------------
## ENC -.104 .063  .098   -.100   .062     .106   
## BUI 1.223 .152   0     .850    .058      0     
## -----------------------------------------------

7.4 Show fit measures

Lavaan produces a whole bunch of fit measures:

fitMeasures(sem_model_fit)

##                         npar                         fmin 
##                       24.000                        0.070 
##                        chisq                           df 
##                       62.390                       11.000 
##                       pvalue               baseline.chisq 
##                        0.000                     1131.720 
##                  baseline.df              baseline.pvalue 
##                       21.000                        0.000 
##                          cfi                          tli 
##                        0.954                        0.912 
##                   cfi.robust                   tli.robust 
##                        0.954                        0.912 
##                         nnfi                          rfi 
##                        0.912                        0.895 
##                          nfi                         pnfi 
##                        0.945                        0.495 
##                          ifi                          rni 
##                        0.954                        0.954 
##                  nnfi.robust                   rni.robust 
##                        0.912                        0.954 
##                         logl            unrestricted.logl 
##                    -4334.157                    -4302.962 
##                          aic                          bic 
##                     8716.315                     8814.776 
##                       ntotal                         bic2 
##                      447.000                     8738.610 
##                        rmsea               rmsea.ci.lower 
##                        0.102                        0.078 
##               rmsea.ci.upper               rmsea.ci.level 
##                        0.128                        0.900 
##                 rmsea.pvalue               rmsea.close.h0 
##                        0.000                        0.050 
##        rmsea.notclose.pvalue            rmsea.notclose.h0 
##                        0.938                        0.080 
##                 rmsea.robust        rmsea.ci.lower.robust 
##                        0.103                        0.079 
##        rmsea.ci.upper.robust          rmsea.pvalue.robust 
##                        0.128                        0.000 
## rmsea.notclose.pvalue.robust                          rmr 
##                        0.940                        0.083 
##                   rmr_nomean                         srmr 
##                        0.093                        0.062 
##                 srmr_bentler          srmr_bentler_nomean 
##                        0.062                        0.070 
##                         crmr                  crmr_nomean 
##                        0.070                        0.080 
##                   srmr_mplus            srmr_mplus_nomean 
##                        0.062                        0.070 
##                        cn_05                        cn_01 
##                      141.966                      178.146 
##                          gfi                         agfi 
##                        0.996                        0.989 
##                         pgfi                          mfi 
##                        0.313                        0.944 
##                         ecvi 
##                        0.247

# produce fit indices for inline text
sem.fits <- as.numeric(fitMeasures(sem_model_fit, c("chisq",            # sem.fits[1]
                                                    "df",               # sem.fits[2]
                                                    "pvalue",           # sem.fits[3] 
                                                    "cfi",              # sem.fits[4] 
                                                    "tli",              # sem.fits[5] 
                                                    "aic",              # sem.fits[6]
                                                    "bic",              # sem.fits[7]
                                                    "rmsea",            # sem.fits[8]
                                                    "rmsea.ci.lower",   # sem.fits[9]
                                                    "rmsea.ci.upper",   # sem.fits[10]
                                                    "rmsea.pvalue",     # sem.fits[11]
                                                    "srmr",             # sem.fits[12]
                                                    "baseline.chisq",   # sem.fits[13]
                                                    "baseline.df"       # sem.fits[14]
                                                    )))

# create vector of desired measure names
fit.names <- c("chisq", "df", "pvalue", "cfi", "tli", "aic", "bic", "rmsea", "rmsea.ci.lower", "rmsea.ci.upper", "rmsea.pvalue", "srmr", "baseline.chisq", "baseline.df")  
# merge names and values (from the previous "sem.fits") into df
df.fit.measures <- data.frame(fit.names,sem.fits)
names(df.fit.measures)[1] <- "Measure"
names(df.fit.measures)[2] <- "Value"
# use Stargazer to produce a table
stargazer(df.fit.measures, summary=FALSE, type='text', # use 'html' if you wish to save the table and edit it further in Word etc.
          rownames=FALSE, initial.zero=FALSE, digits=3, title='Fit measures')

## 
## Fit measures
## ========================
## Measure          Value  
## ------------------------
## chisq           62.390  
## df                11    
## pvalue             0    
## cfi              .954   
## tli              .912   
## aic            8,716.315
## bic            8,814.776
## rmsea            .102   
## rmsea.ci.lower   .078   
## rmsea.ci.upper   .128   
## rmsea.pvalue     .0003  
## srmr             .062   
## baseline.chisq 1,131.720
## baseline.df       21    
## ------------------------

Here are short descriptions of selected fit measures:

chisq (Chi square).

df (degrees of freedom) indicate free (not fixed) parameters of the model.

pvalue (probability value) is the probability of getting as large a discrepancy as occurred with the present sample (under appropriate distributional assumptions and assuming a correctly specified model) (Arbuckle & Wothke, 1999). It tests the hypothesis that the model fits perfectly to the population. The value should be greater than pre-set probability level (usually .05). If the value is .05 or less, the departure of the data from the model is significant at the .05 level (model does not fit into data). However, the appropriateness of hypothesis testing in model fitting, even when the necessary distributional assumptions are met, is routinely questioned.

NFI (Normed Fit Index) tells how big discrepancy there is between the model being evaluated (default model) and the baseline model (terribly fitting ‘independence model'). TLI, CFI, RNI (Tucker-Lewis Index, Comparative fit index, Relative Noncentrality Index) similarly compare the proposed model to a baseline model that all other models should be expected to exceed (Hair, Anderson, Tatham & Black, 1995, p. 685). Values close to one indicate a very good fit. According to Bentler & Bonett (1980, p. 600), referring to both the NFI and the TLI, "… models with overall fit indices of less than .90 can usually be improved substantially."

RMSEA (Root Mean Square Error of Approximation). The RMSEA is designed to evaluate the approximate fit of the model in the population (Kaplan, 2000, p.112). This indice is getting smaller as the df increases. In practice this means that models with large RMSEA values (e.g., 0.12) simplify the reality. This error could be estimated as follows (Browne & Cudeck, 1993; Kaplan, 2000, p. 113): < 0.05 ‘close fit', 0.05 – 0.08 ‘fair fit', 0.081 – 0.10 ‘mediocre fit', > 0.10 ‘poor fit'.

SRMR (Standardized Root Mean square Residual). The SRMR is a measure of the mean absolute correlation residual, the overall difference between the observed and predicted correlations (Kline, 2016, p. 277). This "badness-of-fit" statistic (Kline, 2016) indicates if the square root of the discrepancy between the sample and model covariance matrixes is too big. Cangur and Ercan (2015) have collected synthesis of several fit measures: An acceptable SRMR value should be below .10, while a good value is less than .05.

Next, we list the minimum set of fit statistics (Kline, 2016, p. 269) that should be reported alongside with the model results.

7.4.1 Measures of absolute fit

Absolute fit measures tell how well the model explains the data (Kline, 2016, p. 266). The do that by determining the degree to which the model predicts the observed correlation matrix (Hair et al., 1995, p. 683).

First, relative \(\chi^2\) is calculated by dividing the \(\chi^2\) with df, resulting in 5.67. Usually, values less than five are considered to be adequate (Marsh & Hocevar, 1985). However, some researchers argue that the value should be less than two (Byrne, 1989).

Second, the RMSEA estimate of 0.102 is in the upper bound of the mediocre fit level (.08 - .10), indicating that the model over simplifies the reality. The upper limit of the 90 per cent confidence interval (0.128) supports the assumption that we have quite poor fitting model here.

Third, SRMR estimate of 0.062 indicates that the discrepancy between the sample and model covariance matrix is not too big (as the value is below .10).

All three absolute fit measures are singing the same song: The model is not poor, but it is (perhaps due to being so simple) not doing best justice to the sample.

7.4.2 Measures of incremental (or relative) fit

Incremental fit measures compare the proposed model (default model) to a baseline model (independence model) that all other models should be expected to exceed. Here we discuss only one measure, Comparative fit index (CFI).

The CFI value of 0.954 is calculated by dividing the relative \(\chi^2\) of the proposed model by the relative \(\chi^2\) of the baseline model and subtracting the output from 1:

CFI = 1 - (\(\chi^2\)_proposed - df_proposed) / (\(\chi^2\)_baseline - df_baseline)
= 1 - (chisq-df)/(baseline.chisq-baseline.df)
= 0.954

CFI value shows that the proposed model has a discrepancy that is 95.37 per cent of the way between the (terribly fitting) independence model (CFI = 0) and the (perfectly fitting) saturated model (CFI = 1).

Quite high incremental fit measure (CFI) value of 0.95 indicates that this model has clearly better fit than the poor fitting baseline (independence) model.

7.5 Report the results

# we report only stat sig results in the text (PSY ~ BUI)
sem_m5_bui_psy_est <-  parameterEstimates(sem_model_fit)$est[9]
sem_m5_bui_psy_se <-  parameterEstimates(sem_model_fit)$se[9]
sem_m5_bui_psy_p <-  parameterEstimates(sem_model_fit)$pvalue[9]

# R Square
sem_m5_psy_r2 <- lavaan::inspect(sem_model_fit,'r2')[8] # use inspect(sem_model_fit,'r2') to see how many columns the R^2 table has (we are interested in the last value for the PSY)

Results showed that Build-up of work requirements was statistically significantly positively related to the Psychic stress of the work, B = 1.223, SE = 0.152, p = 0.000. Encouraging leadership had, as expected, a negative relation to the Psychic stress of the work, but it was not significant. This model (see Figure 5) explained (almost solely by the Build-up of work requirements) 78.49% of the variance of the Psychic stress of the work.

Fit measures are presented in Table 1. The model explained the data quite well (RMSEA = 0.102, C.I. = 0.078 - 0.128; SRMR = 0.062). However, the p-value of \(\chi^2\) test was below the desired .05 value (p = 0.000) and the relative \(\chi^2\) (5.67) was above the desired value of 5. Incremental fit measure (CFI = 0.954) showed that the model is superior compared to a baseline model.

Figure 5. Relationship between Build-up of work requirements (BUI), Encouraging leadership (ENC) and Psychic stress of the work (PSY) (unstandardized estimates).

semPaths(sem_model_fit, 
         layout="tree",
         "par", # unstandardized parameter estimates
         edge.label.cex=0.7,
         edge.width=0.2,
         edge.color="black",
         optimizeLatRes=TRUE,
         label.scale=TRUE,
         normalize=TRUE,
         fade=FALSE,
         as.expression=c("nodes"))

Table 1. Fit measures

## 
## =====================
## Measure        Value 
## ---------------------
## chisq          62.390
## df               11  
## pvalue           0   
## rmsea           .102 
## rmsea.ci.lower  .078 
## rmsea.ci.upper  .128 
## srmr            .062 
## cfi             .954 
## ---------------------

7.6 Can we make the model better?

Modification indices (MI) demonstrate how re-specification of the model may improve the model fit [read more]. However, any modification of the model (according to modification indices) should be in accordance with the theoretical assumptions behind the model (see critical discussion about MI's in Hermida, 2015).

Table below shows the first 15 MI values in descending order.

modindices(sem_model_fit, # or modificationindices(sem_model_fit) 
           sort = TRUE, 
           maximum.number = 15) # show the 15 highest MI values

##    lhs op  rhs     mi     epc sepc.lv sepc.all sepc.nox
## 63 v43 ~~  v44 25.145   0.420   0.420    0.872    0.872
## 39 BUI =~  v45 23.728  -3.294  -2.561   -2.135   -2.135
## 34 ENC =~  v45 23.728  -0.281  -0.302   -0.252   -0.252
## 33 ENC =~  v44 23.728   0.475   0.510    0.411    0.411
## 38 BUI =~  v44 23.728   5.567   4.328    3.486    3.486
## 44 PSY =~  v43 13.345 -12.942 -14.469  -13.323  -13.323
## 43 PSY =~ v42r 13.344  16.226  18.141   15.671   15.671
## 32 ENC =~  v43 13.344   0.213   0.228    0.210    0.210
## 31 ENC =~ v42r 13.344  -0.267  -0.286   -0.247   -0.247
## 64 v43 ~~  v45 13.262  -0.199  -0.199   -0.223   -0.223
## 36 BUI =~   v7  8.710  -0.174  -0.135   -0.126   -0.126
## 57 v17 ~~  v43  8.442   0.103   0.103    0.204    0.204
## 37 BUI =~  v17  7.406   0.173   0.135    0.111    0.111
## 52  v7 ~~ v42r  6.256  -0.087  -0.087   -0.148   -0.148
## 55  v7 ~~  v45  6.073  -0.092  -0.092   -0.134   -0.134

The first row represents a modification (MI = 25.145) that is expected to have the strongest impact on the \(\chi^2\) value of the model. In this case, it is suggested that we add a covariance (~~) between v43 ("My working pace has increased in recent years") and v44 ("I feel that I am beginning to dislike my work"). As we recall from the earlier investigation of the correlation matrix, the correlation between these two variables was quite high (r = 0.483).

Although these two observed variables load on two different latent variables (PSY v44; BUI v43), they are quite close to each other in theoretical sense (as both PSY and BUI measure negative experiences towards work). If this covariance is added to the model specification, previous \(\chi^2\) value of 62.390 would improve approximately by 25.145). On the other hand, as Hermida (2015) argues, one should be critical towards the initial model (and its psychometric properties) and ask "Why the association between these two observed variables was not made visible when specifying the original model?"

For the demonstrational purposes we re-specify the original model according to the highest MI value.

# specify the modified model
sem_model_mod <- '
# Factors
ENC =~ NA*v5 + v7 + 1*v17   # Encouraging leadership
BUI =~ 1*v42r + v43         # Build-up of work requirements
PSY =~ 1*v44 + v45          # Psychic stress of the work
# Regressions
PSY ~ ENC + BUI             # Predict PSY with ENC and BUI
# Covariances
ENC ~~ BUI                  # Let ENC and BUI correlate
v43 ~~ v44                  # Covariance added by modification indice recommendation
'

# estimate the modified model
sem_model_mod_fit <- sem(sem_model_mod, 
                    data=dfttu4,
                    group = NULL, # no grouping variable (e.g., gender, control/intervention,...)
                    cluster = NULL, # no clustering variable (e.g., multilevel data may contain several values for each participant)
                    missing = "fiml.x" # full information maximum likelihood approach
                    )

# produce diagram of the model
semPaths(sem_model_mod_fit,
         layout="tree",
         "std", # standardized parameter estimates
         edge.label.cex=0.7,
         edge.width=0.2,
         edge.color="black",
         optimizeLatRes=TRUE,
         label.scale=TRUE,
         normalize=TRUE,
         fade=FALSE,
         as.expression=c("nodes"))

Figure above shows that now there is a connecting arc between v43 and v44.

Here are the estimates of the modified model:

summary(sem_model_mod_fit, standardized=TRUE, rsq=TRUE) # display R Squared

## lavaan 0.6.14 ended normally after 41 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        25
## 
##   Number of observations                           447
##   Number of missing patterns                        10
## 
## Model Test User Model:
##                                                       
##   Test statistic                                39.661
##   Degrees of freedom                                10
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Observed
##   Observed information based on                Hessian
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ENC =~                                                                
##     v5                0.895    0.047   18.909    0.000    0.961    0.813
##     v7                0.767    0.043   17.722    0.000    0.824    0.770
##     v17               1.000                               1.073    0.885
##   BUI =~                                                                
##     v42r              1.000                               1.042    0.900
##     v43               0.430    0.098    4.397    0.000    0.448    0.414
##   PSY =~                                                                
##     v44               1.000                               1.018    0.820
##     v45               0.706    0.078    9.009    0.000    0.719    0.599
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   PSY ~                                                                 
##     ENC              -0.220    0.057   -3.852    0.000   -0.232   -0.232
##     BUI               0.606    0.143    4.254    0.000    0.621    0.621
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ENC ~~                                                                
##     BUI              -0.345    0.066   -5.252    0.000   -0.308   -0.308
##  .v43 ~~                                                                
##    .v44               0.318    0.064    4.979    0.000    0.318    0.455
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .v5                3.818    0.056   68.096    0.000    3.818    3.229
##    .v7                3.436    0.051   67.754    0.000    3.436    3.210
##    .v17               3.280    0.058   56.980    0.000    3.280    2.705
##    .v42r              2.992    0.055   54.600    0.000    2.992    2.584
##    .v43               3.802    0.052   73.569    0.000    3.802    3.516
##    .v44               3.277    0.059   55.765    0.000    3.277    2.640
##    .v45               2.387    0.057   42.027    0.000    2.387    1.990
##     ENC               0.000                               0.000    0.000
##     BUI               0.000                               0.000    0.000
##    .PSY               0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .v5                0.475    0.047   10.100    0.000    0.475    0.340
##    .v7                0.467    0.041   11.361    0.000    0.467    0.408
##    .v17               0.318    0.047    6.708    0.000    0.318    0.217
##    .v42r              0.255    0.221    1.151    0.250    0.255    0.190
##    .v43               0.969    0.077   12.640    0.000    0.969    0.829
##    .v44               0.504    0.104    4.838    0.000    0.504    0.327
##    .v45               0.923    0.080   11.592    0.000    0.923    0.641
##     ENC               1.152    0.105   10.959    0.000    1.000    1.000
##     BUI               1.086    0.238    4.570    0.000    1.000    1.000
##    .PSY               0.489    0.115    4.239    0.000    0.472    0.472
## 
## R-Square:
##                    Estimate
##     v5                0.660
##     v7                0.592
##     v17               0.783
##     v42r              0.810
##     v43               0.171
##     v44               0.673
##     v45               0.359
##     PSY               0.528

We can see that the \(\chi^2\) ("Test statistic") value has become smaller and the df has also dropped by one (as we asked to estimate one additional covariance). Interestingly, the overall explanation of the PSY variance (R Square) has after the modification dropped from the original model's (quite high) value of 0.78 down to 0.53.

Here is the comparison of the regressions of the original and modified models:

std.pars.regressions <- standardizedSolution(sem_model_fit)[ standardizedSolution(sem_model_fit)[,'op']=='~', c(3,4,5,7)]
std.pars.regressions.mod <- standardizedSolution(sem_model_mod_fit)[ standardizedSolution(sem_model_mod_fit)[,'op']=='~', c(3,4,5,7)]

joint.pars.regressions.comparison <- cbind(std.pars.regressions, std.pars.regressions.mod[2:4]) # pick three columns from mod regressions
names(joint.pars.regressions.comparison)[2] <- "est.orig"
names(joint.pars.regressions.comparison)[3] <- "se.orig"
names(joint.pars.regressions.comparison)[4] <- "p.orig"
names(joint.pars.regressions.comparison)[5] <- "est.mod"
names(joint.pars.regressions.comparison)[6] <- "se.mod"
names(joint.pars.regressions.comparison)[7] <- "p.mod"
stargazer(joint.pars.regressions.comparison, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Predicting PSY (std regressions)')

## 
## Predicting PSY (std regressions)
## ================================================
## rhs est.orig se.orig p.orig est.mod se.mod p.mod
## ------------------------------------------------
## ENC  -.100    .062    .106   -.232   .064  .0003
## BUI   .850    .058     0     .621    .079    0  
## ------------------------------------------------

We see that the regression on ENC has become statistically significant (p.mod column) as we have included a covariance between variables v43 and v44.

Here is the comparison of the fit indices of the original ("Value") and modified ("Value.mod") models:

sem.fits.mod <- as.numeric(fitMeasures(sem_model_mod_fit, c("chisq",           # sem.fits.table1[1]
                                                           "df",               # sem.fits.table1[2]
                                                           "pvalue",           # sem.fits.table1[3] 
                                                           "rmsea",            # sem.fits.table1[4]
                                                           "rmsea.ci.lower",   # sem.fits.table1[5]
                                                           "rmsea.ci.upper",   # sem.fits.table1[6]
                                                           "srmr",             # sem.fits.table1[7]
                                                           "cfi"               # sem.fits.table1[8] 
                                                           )))

df.fit.measures.mod <- data.frame(fit.names.table1,sem.fits.mod)
names(df.fit.measures.mod)[2] <- "Value.mod"

joint.fit.comparison <- cbind(df.fit.measures.table1, df.fit.measures.mod[2]) 
stargazer(joint.fit.comparison, summary=FALSE, type='text', rownames=FALSE, initial.zero=FALSE, digits=3, title='Comparison of fit indices')

## 
## Comparison of fit indices
## ===============================
## Measure        Value  Value.mod
## -------------------------------
## chisq          62.390  39.661  
## df               11      10    
## pvalue           0     .00002  
## rmsea           .102    .081   
## rmsea.ci.lower  .078    .056   
## rmsea.ci.upper  .128    .109   
## srmr            .062    .043   
## cfi             .954    .973   
## -------------------------------

As the table above shows, the fit indices of the modified model are considerably better:

chisq/df ratio is now below the suggested value of 5 (39.66 / 10 = 4.0)
RMSEA and SRMR are now smaller (as desired)
CFI is closer to desired value of 1

Exercise 4

The data frame contains also the fourth factor (latent variable), "Know-how rewarding" (REW). It is measured with two variables:
v13 The organization rewards its employees' professional knowledge and skills.
v14 Employees with increased knowledge are given extra responsibility.

Add "Know-how rewarding" (REW) to the previous model as an exogenous variable (alongside with ENC and BUI). Produce standardized results in a similar manner than in section "7.5 Report the results". That is, write text describing findings and produce figure and table similar to Figure 5 and Table 1.

References

Beaujean, A. A. (2014). Latent Variable Modeling Using R. A Step-by-Step Guide. Routledge.

Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588–606.

Bollen, K. (1989). Structural Equations with Latent Variables. New York: John Wiley & Sons.

Cangur, S., & Ercan, I. (2015). Comparison of Model Fit Indices Used in Structural Equation Modeling Under Multivariate Normality. Journal of Modern Applied Statistical Methods, 14(1), Article 14. https://digitalcommons.wayne.edu/jmasm/vol14/iss1/14/

Hair, J. F., Anderson, R. E., Tatham R. L., & Black, W. C. (1995). Multivariate Data Analysis. Fourth edition. Prentice Hall.

Hermida, R. (2015). The problem of allowing correlated errors in structural equation modeling: concerns and considerations. Computational Methods in Social Sciences, 3(1), 5-17.

Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34, 183–202.

Kaplan, D. (2000). Structural Equation Modeling. Thousand Oaks: Sage.

Kline, R. B. (2016). Principles and practice of structural equation modeling. Fourth edition. Guilford Press.

Marsh, H. W., & Hocevar, D. (1985). Application of confirmatory factor analysis to the study of self-concept: First- and higher-order factor models and their invariance across groups. Psychological Bulletin, 97, 562–582.

Nokelainen, P., & Ruohotie, P. (2009). Non-linear Modeling of Growth Prerequisites in a Finnish Polytechnic Institution of Higher Education. Journal of Workplace Learning, 21(1), 36-57.

Raykov, T., & Marcoulides, G. A. (2000). A First Course in Structural Equation Modeling. Lawrence Erlbaum Associates.

Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1-36. http://www.jstatsoft.org/v48/i02/

Schumacker, R. E., & Lomax, R. G. (2004). A Beginner's Guide to Structural Equation Modeling. Second edition. Lawrence Erlbaum Associates.

Suhr, D. (2006). [The Basics of Structural Equation Modeling]((https://www.lexjansen.com/wuss/2006/tutorials/TUT-Suhr.pdf).