In this script we will demonstrate how to perform Multiple Imputation for \(\textsf{Rsiena}\) as described in Krause, Huisman and Snijders, ‘Multiple imputation for longitudinal network data’, 2018. The paper is also available on the \(\textsf{Rsiena}\) website and Research Gate. We ask the reader to first read the paper. Here we will only demonstrate how to perform the imputations. Explanations and guidelines on the procedure can be found in the paper.
Running this script as is can be quite time consuming. A major issue is to decide about the number of imputations. Usual values are between 20 and 50, depending on the amount of missing data. In this script we use D = 20
, but in practice you might need more. In the paper D = 50
was more appropriate. To get a feeling for how the procedure works, you might decrease D = 5
, to have less computation time.
In the paper we present two methods for imputation of the first wave:
Multiple Imputation using Stationary SAOMS
Multiple Imputation using Bayesian ERGMs
We will detail both methods in this script.
The following packages are required, before we start the procedure:
\(\textsf{Rsiena}\), and \(\textsf{Bergm}\).
Therefore, if necessary:
install.packages("RSiena", repos = "http://cran.us.r-project.org")
install.packages("Bergm", repos = "http://cran.us.r-project.org")
This procedure requires \(\textsf{Rsiena}\) or \(\textsf{RSienaTest}\) version 1.2-12 or higher, and \(\textsf{Bergm}\) version 4.2.0 or higher
library("RSiena")
library("Bergm")
packageVersion("RSiena")
## [1] '1.2.12'
packageVersion("Bergm")
## [1] '4.2.0'
Before we start, we create some missing data in the s50 data set. The s50 data set is included in the \(\textsf{Rsiena}\) package. See ?s50
.
Select the data internal to \(\textsf{Rsiena}\)
s501miss <- s501
s502miss <- s502
s503miss <- s503
alc <- s50a
Arbitrarily create some missing data
s501miss[c(5,13,21,28,30,31,39,42,43,48),] <- NA
s502miss[c(11,13,15,17,33,34,37,41,45,46),] <- NA
s503miss[c(1,8,14,19,20,26,35,39,42,44),] <- NA
We will define the following siena07ToConvergence
function to (hopefully) ensure convergence. This function will run a given SAOM until the algorithm is converged. A circuit breaker is applied after 20 runs, or when the divergence is seen as too large. You may use any other specifications of siena07()
to get converged estimates. See also the manual section 6.2..
siena07ToConvergence <- function(alg, dat, eff, ans0 = NULL, threshold = 0.25,
nodes = 1, ...){
# parameters are:
# alg, dat, eff: Arguments for siena07: algorithm, data, effects object.
# ans0: previous answer, if available; used as prevAns in siena07.
# threshold: largest satisfactory value
# for overall maximum convergence ratio (indicating convergence).
# nodes: number of processes for parallel processing.
if (!is.null(ans0)) {
alg$nsub = 6
}
numr <- 0 # number of repetitions
ans <- siena07(alg, data = dat, effects = eff, prevAns = ans0,
nbrNodes = nodes, returnDeps = TRUE,
useCluster = (nodes >= 2), ...) # the first run
repeat {
save(ans, file = paste("ans",numr,".RData",sep = "")) # to be safe
numr <- numr + 1 # count number of repeated runs
tm <- ans$tconv.max # convergence indicator
cat(numr, tm,"\n") # report how far we are
if (tm < threshold) {break} # success
if (tm > 10) {break} # divergence without much hope
# of returning to good parameter values
if (numr > 20) {break} # now it has lasted too long
alg$nsub <- 1
alg$n2start <- 2 * (sum(eff$include) + 7) * 2.52**4
alg$n3 <- 3000
if (numr == 1) {alg$firstg <- alg$firstg/5}
ans <- siena07(alg, data = dat, effects = eff, prevAns = ans,
nbrNodes = nodes, returnDeps = TRUE,
useCluster = (nodes >= 2),...)
}
if (tm > threshold)
{
stop("Convergence inadequate.\n")
}
ans
}
Since version 1.2-12, Maximum Likelihood (ML) estimation by \(\textsf{Rsiena}\) with returnDeps = TRUE
returns an edgelist of the final network at the end of the phase 3 simulation. The following function, getNet()
, uses this edgelist to impute the data.
getNet <- function(observedNet,edgeList) {
# observedNet = observed network as adjacency with missing data
# edgeList = edgelist that is returned by siena07(...)$sims
observedNet[is.na(observedNet)] <- 0
for (i in 1:nrow(edgeList)) {
observedNet[edgeList[i,1],edgeList[i,2]] <- 1
}
return(observedNet)
}
It is important that the imputation is done with a well fitting model that includes at least all parameters also included in the final model. For evaluating model fit of \(\textsf{Rsiena}\) models, please see sienaGOF.R.
First, we create lists to store the imputed networks, and specify the number of imputations. For time reasons we only set it to D = 5
imputations. For a proper analysis this is, however, not enough!
wave1imp <- list()
D <- 5
We start the stationary SAOM imputations like a usual \(\textsf{Rsiena}\) analysis by defining our data and effects objects. It is important to add wave 2 as predictor.
friends <- sienaDependent(array(c(s501miss, s501miss), dim = c( 50, 50, 2)) ,
allowOnly = FALSE)
alco <- coCovar(s50a[,1])
w2 <- coDyadCovar(s502miss)
Data.stationary <- sienaDataCreate(friends,alco,w2)
effects.stationary <- getEffects(Data.stationary)
effects.stationary <- includeEffects(effects.stationary, density, recip,
outActSqrt, inPopSqrt, gwespFF, gwespBB)
effects.stationary <- includeEffects(effects.stationary, egoX, altX, simX,
interaction1 = "alco")
effects.stationary <- includeEffects(effects.stationary, X, name = "friends",
interaction1 = "w2")
## effectName include fix test initialValue parm
## 1 outdegree (density) TRUE FALSE FALSE 0 0
## 2 reciprocity TRUE FALSE FALSE 0 0
## 3 GWESP I -> K -> J (#) TRUE FALSE FALSE 0 69
## 4 GWESP I <- K <- J (#) TRUE FALSE FALSE 0 69
## 5 indegree - popularity (sqrt) TRUE FALSE FALSE 0 0
## 6 outdegree - activity (sqrt) TRUE FALSE FALSE 0 1
## effectName include fix test initialValue parm
## 1 alco alter TRUE FALSE FALSE 0 0
## 2 alco ego TRUE FALSE FALSE 0 0
## 3 alco similarity TRUE FALSE FALSE 0 0
## effectName include fix test initialValue parm
## 1 w2 TRUE FALSE FALSE 0 0
\(\textsf{Rsiena}\) was not designed to estimate stationary models. It assumes that there is some change in the data. However, we can give the same network as start and end of our period and tell \(\textsf{Rsiena}\), that the rate function needs to be fixed at a large value, say D = 20
. Please see Tom’s and Christian’s work on stationary SAOMs.
effects.stationary <- setEffect(effects.stationary, Rate, initialValue = 20,
name = "friends", fix = TRUE, type = "rate")
## effectName include fix test initialValue parm
## 1 basic rate parameter friends TRUE TRUE FALSE 20 0
Now we can estimate the stationary SAOM with Methods of Moments (MoM) estimation. One converged estimate will suffice.
estimation.options <- sienaAlgorithmCreate(useStdInits = FALSE,
seed = 2214,
n3 = 1000, maxlike = FALSE,
cond = FALSE)
period0saom <- siena07ToConvergence(alg = estimation.options,
dat = Data.stationary,
eff = effects.stationary, threshold = 0.25)
## 1 2.071337
## 2 0.253957
## 3 0.1233461
After obtaining one converged model we change the \(\textsf{Rsiena}\) algorithm to imputation. This is achieved by the following settings:
imputation.options <- sienaAlgorithmCreate(seed = 13848,
useStdInits = FALSE,
maxlike = TRUE,
cond = FALSE,
nsub = 0,
simOnly = TRUE,
n3 = 10)
All that is left to do now is to obtain the D
imputations. However, one minor caveat is the problem that Maximum Likelihood (ML) simulation will only start if there is at least 1 tie different between the two observed waves. We obtain that by setting one random tie in the “starting network” from 1 to 0. The ML algorithm will guarantee that the tie will have changed back to 1 correctly by the end of the simulation.
Further, to reduce random changes in our imputation we will set all observed ties (except the changed one) to structurally fixed values (1 \(\rightarrow\) 11, 0 \(\rightarrow\) 10). This way no redundant changes will be simulated on the observed values.
Both of these steps are only done for the stationary imputation and will not be repeated for the imputation of later waves.
set.seed(142)
for (i in 1:D) {
cat('imputation',i,'\n')
n1 <- s501miss
n1 <- n1 + 10
diag(n1) <- 0
n2 <- n1
tieList <- c(1:(nrow(n1)**2))[c(n1 == 11)]
tieList <- tieList[!is.na(tieList)]
changedTie <- sample(tieList,1)
n1[changedTie] <- 0
n2[changedTie] <- 1
friends <- sienaDependent(array(c(n1,n2), dim = c( 50, 50, 2)),
allowOnly = FALSE )
Data.stationary <- sienaDataCreate(friends,alco,w2)
sims <- siena07(imputation.options, data = Data.stationary,
effects = effects.stationary,
prevAns = period0saom,
returnDeps = TRUE)$sims[[10]][[1]]
wave1imp[[i]] = getNet(s501miss,sims)
}
## imputation 1
## imputation 2
## imputation 3
## imputation 4
## imputation 5
Alternatively one can also draw all imputations from one single ML run. However, these simulations would then all use the same forced tie change on the observed data. I have not investigated if this has a meaningful impact.
imputation.options <- sienaAlgorithmCreate(seed = 13848,
useStdInits = FALSE,
maxlike = TRUE,
cond = FALSE,
nsub = 0,
simOnly = TRUE,
n3 = 10*D)
sim.all <- siena07(imputation.options, data = Data.stationary,
effects = effects.stationary,
prevAns = period0saom,
returnDeps = TRUE)
for (i in (1:D)) {
wave1imp[[i]] <- getNet(s501miss, sim.all$sims[[10*i]][[1]])
}
Now we have D
imputed networks stored in wave1imp
.
The missing data augmentation using Bayesian ERGMs to impute the first wave was introduced by Johan Koskinen and colleagues
\(\textsf{Rsiena}\) automatically centers covariates, for \(\textsf{Bergm}\) we have to do this manually.
alcCent <- as.vector(alc[,1] - mean(alc[,1]))
Further, we make s501miss a network object, so that we can add nodal attributes
s501missNetwork <- as.network(s501miss)
set.vertex.attribute(s501missNetwork, 'alc', alcCent)
As in the stationary SAOM, we would like to add the second wave as a predictor. Unfortunately the \(\textsf{ergm}\) (and thus the \(\textsf{Bergm}\)) package does not like missing data in covariates. Thus we will simply impute missing data in the 2nd wave ad hoc with zeroes.
s502missI <- s502miss
s502missI[is.na(s502missI)] <- 0
Bayesian ERGMs are, well, Bayesian, thus they benefit from prior knowledge about the model family, and the data. We can express that by setting our priors. We keep the priors for the means at the default (0 for every value), but we set priors for the variances. In erg-family models it can be expected that the parameters fall in the range of +/- 10, this setting the prior variance to, say, 9 is reasonable.
priorVar <- diag(9,10,10)
Now we can start. We use the missBergm()
function to impute the missing data. We retain the imputed data by setting nImp = D
.
Please be advised that running missBergm()
can take a very long time.
bergmImputation <- missBergm(s501missNetwork ~ edges
+ mutual
+ gwidegree(log(2), fixed = TRUE, cutoff = 30)
+ gwodegree(log(2), fixed = TRUE, cutoff = 6)
+ gwesp(decay = log(2), fixed = TRUE)
+ gwdsp(decay = log(2), fixed = TRUE)
+ edgecov(s502missI)
+ absdiff('alc')
+ nodeicov('alc')
+ nodeocov('alc'),
burn.in = 3000,
main.iters = D*100,
nchains = 5,
nImp = D,
gamma = .3,
prior.sigma = priorVar,
seed = 142)
Convergence of the algorithm can be assessed by looking at
bergm.output(bergmImputation)
You can read more about this output by requesting ?bergm.output
. The first figure shows the posterior probability density distribution and should look like a normal distribution. The second figure shows the estimated parameters at each iteration. A converged algorithm will resemble the picture above, mixing well through the space without any meaningful slope. The last plot is the auto-correlation between the estimated parameters over the iteration. Ideally this reduces to zero. It is adviced to set the main.iters
to a number D
times as large as the value where the auto correlation is lowest. missBergm
automatically spaces the D
imputations as much as possible over themain.iters
. Thus, setting main.iters = D * X
spaces the sampling of imputed networks by X
.
Now we need to extract the imputed networks from the bergmImputation
object
wave1imp <- list()
for (i in 1:D) {
wave1imp[[i]] <- as.matrix.network(bergmImputation$impNets[[i]])
}
Again we first start by preparing some lists to save the networks in:
wave2imp <- list()
wave3imp <- list()
The procedure goes wave by wave and is straight forward for \(\textsf{Rsiena}\) veterans, like you. We first define the data and effects objects as usual, using one imputed network for wave 1 and the observed wave 2, with the missing data, as the end network. Then we estimate using MoM. After we have a converged model we use the estimated parameters in a ML simulation to retain one imputation of wave 2. This imputation is passed on to the next period as the starting network. We define a new data object with the imputed wave 2 as the starting network and the observed wave 3, with missing data, as the target. We, again, estimate with MoM and then impute with ML and retain one imputation. This process is repeated D
times, starting once with each of the previously obtained 1st wave imputations.
set.seed(1307)
for (i in 1:D) {
cat('imputation',i,'\n')
# now impute wave2
friends <- sienaDependent(array(c(wave1imp[[i]],s502miss),
dim = c(50,50,2)))
alco <- sienaDependent((as.matrix(alc[, 1:2])), type = "behavior")
Data.w2 <- sienaDataCreate(friends, alco)
effects.twoWaves <- getEffects(Data.w2)
effects.twoWaves <- includeEffects(effects.twoWaves, density, recip,
outActSqrt, inPopSqrt, gwespFF, gwespBB)
effects.twoWaves <- includeEffects(effects.twoWaves, egoX, altX, simX,
interaction1 = "alco")
effects.twoWaves <- includeEffects(effects.twoWaves, avSim, name = 'alco',
interaction1 = "friends")
if (i == 1) {
period1saom <- siena07ToConvergence(alg = estimation.options,
dat = Data.w2,
eff = effects.twoWaves,
threshold = 0.25)
} else {
period1saom <- siena07ToConvergence(alg = estimation.options,
dat = Data.w2,
eff = effects.twoWaves,
threshold = 0.25,
ans0 = period1saom)
}
sims <- siena07(imputation.options, data = Data.w2,
effects = effects.twoWaves,
prevAns = period1saom,
returnDeps = TRUE)$sims[[10]][[1]]
wave2imp[[i]] <- getNet(s502miss,sims)
# impute wave 3
friends <- sienaDependent(array( c(wave2imp[[i]], s503miss),
dim = c( 50, 50, 2)))
alco <- sienaDependent((as.matrix(alc[, 2:3])), type = "behavior")
Data.w3 <- sienaDataCreate(friends, alco)
if (i == 1) {
period2saom <- siena07ToConvergence(alg = estimation.options,
dat = Data.w3,
eff = effects.twoWaves,
threshold = 0.25)
} else {
period2saom <- siena07ToConvergence(alg = estimation.options,
dat = Data.w3,
eff = effects.twoWaves,
threshold = 0.25,
ans0 = period2saom)
}
sims <- siena07(imputation.options, data = Data.w3,
effects = effects.twoWaves,
prevAns = period2saom,
returnDeps = TRUE)$sims[[10]][[1]]
wave3imp[[i]] = getNet(s503miss,sims)
save.image('mi.RData')
}
The save.image('mi.RData')
is inside the loop to save the data after every completed imputation. This way we are able to restart the imputation process in case of any crash at the current i
imputation. If we do so, we should set the random number seed again set.seed(####)
and note down the new seed and at which i
we restarted. This way we can ensure that we will later be able to obtain the exact same results again.
Keeping the save.image()
outside the loop is more time efficient, especially in larger data sets where save.image()
will take several seconds to back up your data. Personally, I prefer the security that I do not have to restart from scratch in case of a crash to some minutes (or a maybe an hour?) longer computation time.
Now that we have obtained our imputed data sets we need to run our model on the D complete data sets. The procedure works normally as any \(\textsf{Rsiena}\) analysis, except that we are running it in a loop, once for each of the D
imputed data sets and save the results in a list.
saomResults <- list()
for (i in 1:D) {
cat('Imputation',i,'\n')
friends <- sienaDependent(array(c(wave1imp[[i]], wave2imp[[i]],
wave3imp[[i]]), dim = c( 50, 50, 3)))
alco <- sienaDependent((as.matrix(alc)), type = "behavior")
Data <- sienaDataCreate(friends, alco)
effectsData <- getEffects(Data)
effectsData <- includeEffects(effectsData, density, recip, outActSqrt,
inPopSqrt, gwespFF, gwespBB)
effectsData <- includeEffects(effectsData, egoX, altX, simX,
interaction1 = "alco")
effectsData <- includeEffects(effectsData, avSim, name = 'alco',
interaction1 = "friends")
estimation.options <- sienaAlgorithmCreate(useStdInits = FALSE, seed = 312,
n3 = 1000, maxlike = FALSE,
lessMem = TRUE)
if (i == 1) {
saomResults[[i]] <- siena07ToConvergence(alg = estimation.options,
dat = Data, eff = effectsData,
threshold = 0.25,
nodes = 2)
} else {
saomResults[[i]] <- siena07ToConvergence(alg = estimation.options,
dat = Data, eff = effectsData,
threshold = 0.25,
ans0 = saomResults[[i - 1]],
nodes = 2)
}
save.image('mi.RData')
}
Now we have D
\(\textsf{Rsiena}\) results and all that is left is to combine them. We need to extract all parameter and standard error estimates from the models and combine them using Rubin’s Rules.
Rubin’s rules for combining results include combining parameter estimates and covariances. Let \(\hat{\gamma}_d\) denote the \(d\)th estimate of the parameter \(\gamma\) and \(W_d= \mbox{cov}(\hat{\gamma}_d \, | \, x_d)\) the (within-imputation) covariance matrix of the parameters of data set \(x_d\). The combined estimate for the parameters is the average of the estimates of the \(D\) analyses:
\(\bar{\gamma}_D = \frac{1}{D} \sum_{d=1}^D \hat{\gamma}_d.\)
Obtaining the proper standard errors is a bit less straightforward. The combined estimate for the standard error needs to take into account the variance within and between imputations. It requires the average within-imputation covariance matrix \(\bar{W}_D\) and the between-imputation covariance matrix \(B_D\).The average within-imputation covariance matrix is given by
\(\bar{W}_D = \frac{1}{D} \sum_{d=1}^D W_d\)
and the between covariance matrix by
\(B_D = \frac{1}{D-1} \sum_{d=1}^D (\hat{\gamma}_d - \bar{\gamma}_D)(\hat{\gamma}_d - \bar{\gamma}_D)'.\)
The total variability for \(\bar{\gamma}_D\) is estimated by
\(T_D = \hat{\mbox{cov}}(\bar{\gamma}_D) = \bar{W}_D + \left(1 + \frac{1}{D} \right) B_D.\)
The standard errors for the parameters are given by the square roots of the diagonal elements of \(T_D\).
To obtain this in R, a function for the row variance will be helpful:
rowVar <- function(x) {
rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)
}
How many parameters do we have?
npar <- sum(effectsData$include)
We start by creating a dataframe for our results and fill it with the D
estimated parameters and standard errors.
MIResults <- as.data.frame(matrix(,npar,(2 * D)))
for (i in 1:D) {
names(MIResults)[i * 2 - 1] <- paste("imp" , "mean", sep = as.character(i))
names(MIResults)[i * 2] <- paste("imp" , "se", sep = as.character(i))
MIResults[,i * 2 - 1] <- saomResults[[i]]$theta
MIResults[,i * 2] <- sqrt(diag(saomResults[[i]]$covtheta))
}
Now we get the average covariance structure between the parameters.
WDMIs <- matrix(0,npar,npar)
for (i in 1:D) {
WDMIs <- WDMIs + saomResults[[i]]$covtheta
}
WDMIs <- (1/D) * WDMIs
Using Rubin’s Rules we combine the parameters and standard errors and complete the procedure.
finalResults <- as.data.frame(matrix(,npar,2))
names(finalResults) <- c("combinedEstimate", "combinedSE")
rownames(finalResults) <- effectsData$effectName[effectsData$include]
finalResults$combinedEstimate <- rowMeans(MIResults[,seq(1,2*D,2)])
finalResults$combinedSE <- sqrt(diag(WDMIs) + ((D + 1)/D) *
rowVar(MIResults[,seq(1,2*D,2)]))
kable(round(finalResults, 3))
combinedEstimate | combinedSE | |
---|---|---|
constant friends rate (period 1) | 5.946 | 1.188 |
constant friends rate (period 2) | 6.651 | 1.722 |
outdegree (density) | -0.090 | 0.652 |
reciprocity | 2.506 | 0.310 |
GWESP I -> K -> J (#) | 2.281 | 0.317 |
GWESP I <- K <- J (#) | -0.062 | 0.287 |
indegree - popularity (sqrt) | -0.637 | 0.266 |
outdegree - activity (sqrt) | -0.842 | 0.270 |
alco alter | -0.090 | 0.139 |
alco ego | 0.231 | 0.178 |
alco similarity | 1.201 | 0.620 |
rate alco (period 1) | 1.294 | 0.359 |
rate alco (period 2) | 1.802 | 0.461 |
alco linear shape | 0.363 | 0.166 |
alco quadratic shape | -0.097 | 0.103 |
alco average similarity | 2.761 | 1.829 |
And let us not forget to save…
save.image('mi.RData')