Last updated 2024-06-24
This tutorial is a joint product of the Statnet Development Team:
Pavel N. Krivitsky (University of New South Wales)
Martina Morris (University of Washington)
Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Chad Klumb (University of Washington)
Skye Bender de-Moll (Oakland, CA)
Michał Bojanowski (Kozminski University, Poland)
The network modeling software demonstrated in this tutorial is
authored by Pavel Krivitsky (ergm
) and Carter Butts
(network
and sna
).
All Statnet packages are open-source, written for the R computing environment, and published on CRAN. The source repositories are hosted on GitHub. Our website is statnet.org
Need help? For general questions and comments, please email the Statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: Statnet_help listserve.
Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.
Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).
For all other issues, please email us at contact@statnet.org.
This workshop and tutorial provide an introduction to statistical
modeling of network data with exponential-family random graph
models (ERGMs) using statnet
software. This online
tutorial is also designed for self-study, with example code and
self-contained data. The statnet
packages we will be
demonstrating are:
network
— storage and manipulation of network dataergm
— statistical tools for estimating ERGMs, model
assessment, and network simulation.The ergm
package has more advanced functionality that is
not covered in this workshop. An overview can be found in this preprint.
This workshop assumes basic familiarity with R; experience with network concepts, terminology and data; and familiarity with the general framework for statistical modeling and inference. While previous experience with ERGMs is not required, some of the topics covered here may be difficult to understand without a strong background in linear and generalized linear models in statistics.
Minimally, you will need to install the latest version of
R (available
here) and the statnet
packages ergm
and
network
to run the code presented here (ergm
will automatically install network
when it is
loaded).
The workshops are conducted using the free version of
Rstudio
(available
here).
The full set of installation intructions with details can be found on
the statnet
workshop wiki.
If you have not already downloaded the statnet
packages
for this workshop, the quickest way to install these (and the other most
commonly used packages from the statnet
suite), is to open
an R session and type:
Loading required package: network
'network' 1.17.2 (2022-05-20), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4.2.2 (2022-06-01), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.
You can check the version number with:
[1] '4.2.2'
Throughout, we will set a random seed via set.seed()
for
commands in tutorial that require simulating random values—this is not
necessary, but it ensures that you will get the same results as this
tutorial (assuming that you are using the same ergm
version
or at least a version in which the algorithms you are using have not
changed).
Here we provide only a brief overview of the modeling framework, as
the primary purpose of this tutorial is to show how to implement
statistical analysis of network data with ERGMs using the
statnet
software tools, rather than to explain the
framework in detail. For more details, and to really understand ERGMs,
please see the references at
the end of this tutorial.
Exponential-family random graph models (ERGMs) are a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can, among other tasks:
Define a model for a network that includes covariates representing features like homophily, mutuality, triad effects, and a wide range of other structural features of interest;
Obtain maximum-likehood estimates for the parameters of the specified model for a given data set;
Test individual coefficients, assess models for convergence and goodness-of-fit, and perform various types of model comparison; and
Simulate new networks from the underlying probability distribution implied by the fitted model.
ERGMs are a class of models, superficially resembling linear regression or GLMs. The general form of the model specifies the probability of the entire network (the left hand side), as a function of terms that represent network features we hypothesize may occur more or less likely than expected by chance (the right hand side). The general form of the model is
\[ \Pr(Y=y)=\frac{\exp[\theta^\top g(y)] h(y)}{k(\theta)} \]
where
\(Y\) is the random variable for the state of the network and \(y\) is a particular realization \(Y\) could take,
\(g(y)\) is a vector of model statistics for network \(y\),
\(h(y)\) is the reference measure (which defines the baseline behavior of the model when \(\theta=0\))
\(\theta\) is the vector of coefficients for those statistics, and
\(k(\theta)\) is the summation of the numerator’s value over the set of all possible networks \(y\), typically taken to be all networks with the same node set as the observed network.
In particular, the model implies that the probability attached to a network \(y\) only depends on the network via the vector of statistics \(g(y)\). Among other things, this means that maximum likelihood estimation may be carried out even if we don’t observe the network itself, as long as we know the observed value of \(g(y)\).
If you’re not familiar with the compact notation above, the numerator represents a formula that is linear in the log form:
\[ \log(\exp[\theta^\top g(y)]) = \theta_1g_1(y) + \theta_2g_2(y)+ ... + \theta_pg_p(y) \] where \(p\) is the number of terms in the model. From this one can more easily observe the analogy to a traditional statistical model: the coefficients \(\theta\) represent the size and direction of the effects of the covariates \(g(y)\) on the overall probability of the network.
\(h(y)\) is important for specifying model behavior for valued ERGMs, for controlling for network size in extrapolative or multi-network settings, and other applications. When modeling single, unvalued networks, however, it is common to take \(h(y)\propto 1\) (the counting measure). We will adopt that convention here. Under the counting measure, the baseline behavior of an ERGM approaches a uniform random graph when \(\theta \to 0\), and the ERGM terms serve to modify this baseline distribution. We will see many examples below.
The statistics \(g(y)\) can be thought of as the “covariates” in the model. In the network modeling context, these represent network features like density, homophily, triads, etc. In one sense, they are like covariates you might use in other statistical models, but they are different in one key respect:
These \(g(y)\) statistics are functions of the network itself — each represents the frequency of a specific configuration of dyads observed in the network — so they are not measured by a question you include in a survey (e.g., the income of a node), but instead need to be computed on the specific network you have, after you have collected the data.
As a result, every term in an ERGM must have an associated algorithm
for computing its value for your network. The ergm
package
in statnet
includes about 150 term-computing algorithms,
and you can code up your own with our helper package. We will explore
some of the most commonly used terms in this tutorial, and links to more
information are provided in section
3.
Terms can be grouped in different ways, like the type of network they can be used on (e.g., binary, bipartite, directed, etc.), the type of covariate they represent (continuous, dyad-dependent, etc.) and their purpose (term operator, soft constraint, etc). In Statnet, these groupings are called “keywords”, and they provide a useful way to quickly find help for terms used in different contexts.
One key property of model terms is worth keeping in mind: terms are either dyad independent or dyad dependent.
Dyad independent terms (like nodal homophily terms) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree terms, or triad terms), by contrast, imply dependence between dyads. Dyad dependent terms have very different effects, and much of what is different about network models comes from these terms. They introduce complex cascading effects that can often lead to counter-intuitive and highly non-linear outcomes. In addition, a model with at least one dyad dependent term requires a different estimation algorithm, so when we use these terms below you will see some different components in the output.
There are several resources in the ergm
package that
provide a range of help and instructions for using these terms:
The main ergm terms reference page
Accessed via:
?ergmTerms
This contains an overview of model specification, some links to additional references and a complete up-to-date list of available terms with short descriptions and a link to the man page for that term.
Man pages for specific terms
If you know the name of the term, and are looking for help with syntax, you can type either
`help("[name]-ergmTerm")`
or the shorthand version
`ergmTerm?[name]`
where [name]
is the name of the term.
Keywords
Keywords are useful for narrowing down your search to find terms that have specific functionality. Available keywords and their meanings can be obtained by typing
?ergmKeyword`
You can search for terms with keywords, for example:
Found 11 matching ergm terms:
altkstar(lambda, fixed=FALSE) (binary)
Alternating k-star
gwb1degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution for the first mode in a bipartite network
gwb1dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution for dyads in the first bipartition
gwb2degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution for the second mode in a bipartite network
gwb2dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution for dyads in the second bipartition
gwdegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution
gwdsp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution
gwesp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted edgewise shared partner distribution
gwidegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted in-degree distribution
gwnsp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted nonedgewise shared partner distribution
gwodegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted out-degree distribution
Cross-reference guide
This guide is a good place to find a tables that summarize the keywords for terms (with tables for commonly used terms, operator terms, and all terms), definitions for all terms, and a list of keywords with the terms they refer to.
vignette("ergm-term-crossRef")
Note: When using RStudio, it is possible to press the tab key after
starting a line with ?ergm
to view the wide range of
possible help options beginning with the letters ergm
.
The ERGM expression for the probability of the entire graph that was shown above can be re-expressed in terms of the conditional log-odds (that is, the logit of the conditional probability) of a single tie between two actors:
\[ \operatorname{logit}{P(Y_{ij}=1|y^{c}_{ij})=\theta^\top\delta_{ij}(y)}, \] where
\(Y_{ij}\) is the random variable for the state of the actor pair \(i,j\) (with realization \(y_{ij}\)), and
\(y^{c}_{ij}\) signifies the complement of \(y_{ij}\), i.e. the entire network \(y\) except for \(y_{ij}\).
\(\delta_{ij}(y)\) is a vector of the “change statistics” for each model term. The change statistic records how the \(g(y)\) term changes if the \(y_{ij}\) tie is toggled from off to on while fixing the rest of the network. So
\[ \delta_{ij}(y) = g(y^{+}_{ij})-g(y^{-}_{ij}), \]
where
\(y^{+}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 1, and
\(y^{-}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 0.
So \(\delta_{ij}(y)\) equals the value of \(g(y)\) when \(y_{ij}=1\) minus the value of \(g(y)\) when \(y_{ij}=0\), but all other dyads are as in \(y\). When this vector of change statistics is multiplied by the vector of coefficients \(\theta\), the equation above shows that this dot product is the log-odds of the tie between \(i\) and \(j\), conditional on all other dyads remaining the same.
In other words, for an individual statistic, its change value for \(Y_{ij}\) times its corresponding coefficient can be interpreted as that term’s contribution to the log-odds of that tie, conditional on all other dyads remaining the same.
We will see exactly how this works in the sections that follow.
Network data can come in many different forms — ties can be stored as
edgelists or sociomatrices in .csv
files, or as exported
data from other programs like Pajek. Attributes for the nodes, ties, and
dyads can also come in various forms. All can be read into R
using either standard R tools (e.g., for .csv
files), or
methods from the network
package. For more information,
refer to the following:
However you read them in, the data will need to be transformed into a
network
object, the format that Statnet packages use to
store and work with network data. For information on how to do this,
refer to:
The ergm
package also contains several network data
sets, and we will use those here for demonstration purposes.
We’ll start with Padgett’s data on Renaissance Florentine families for our first example. As with all data analysis, it is good practice to start by summarizing our data using graphical and numerical descriptives.
set.seed(1) # The plot.network function uses random values
data(florentine) # loads flomarriage and flobusiness data
flomarriage # Equivalent to print(flomarriage): Examine properties
Network attributes:
vertices = 16
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 20
missing edges= 0
non-missing edges= 20
Vertex attribute names:
priorates totalties vertex.names wealth
No edge attributes
par(mfrow=c(1,2)) # Set up a 2-column (and 1-row) plot area
# Plot the network, saving the coordinates for the next plot
coords <- plot(flomarriage,
main="Florentine Marriage",
cex.main=0.8,
label = network.vertex.names(flomarriage),
label.cex=0.4,
pad=3) # Equivalent to plot.network(...)
wealth <- flomarriage %v% 'wealth' # %v% references vertex attributes, equivalent to get.vertex.attribute(flomarriage, "wealth")
wealth
[1] 10 36 55 44 20 32 8 42 103 48 49 3 27 10 146 48
ergm
and summary
functionsThese are the two most commonly used functions in ergm.
ergm
is the function you use to specify your model;
summary
is the function you use if you want to see the
numerical values of the terms in your model (the \(g(y)\) statistics.)
The syntax for specifying a model in the ergm
package
follows R’s formula convention:
\[ \mbox{my.network} \sim \mbox{my.model.terms} \]
This syntax is used for both the ergm
and
summary
functions.
It is good practice to run a summmary
command on any
model before fitting it with ergm
. Recall that the
terms in an ERGM are not like covariates in other model settings – they
are not attributes that have a pre-measured value, but instead functions
of the ties in your network, typically sums of specific tie
configurations. The summary
command calculates these
functions and allows you to see their values. This is the ERGM
equivalent of performing a bit of descriptive analysis on your
covariates. Inspecting these values can help you make sure you
understand what the term represents, and flag potential problems that
will lead to poor modeling results.
To get a sense of the types of terms that can be used in an ERGM and how to interpret their coefficients, it’s useful to start with some simple models.
We begin with the simplest possible model: one that assumes every tie
is equally likely. The model contains only one term that represents the
total number of edges in the network, \(\sum{y_{ij}}\). The name of this ergm-term
is edges
, and when included in an ERGM its coefficient
controls the overall density of the network.
edges
20
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Stopping at the initial estimate.
Evaluating log-likelihood at the estimate.
Call:
ergm(formula = flomarriage ~ edges)
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -1.6094 0.2449 0 -6.571 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 166.4 on 120 degrees of freedom
Residual Deviance: 108.1 on 119 degrees of freedom
AIC: 110.1 BIC: 112.9 (Smaller is better. MC Std. Err. = 0)
The probability of a tie is captured by the coefficient of the
edges
term.
How should we interpret the above estimate \(\hat\theta\) of this coefficient? The easiest way is to return to the logit form of the ERGM. The log-odds that a tie is present is given by \[\begin{align*} \operatorname{logit}(p) &= \hat\theta \times \delta_{ij}(y) \\ & = -1.61 \times \mbox{change in $g(y)$ when $y_{ij}$ goes from 0 to 1}\\ & = -1.61 \times 1. \end{align*}\]
In this model, the log-odds will be the same for every tie, since the
change statistic for the edges
term equals one for all
\(y_{ij}\). Do you see why \(\delta_{ij}(y)=1\) no matter which \(i\) and \(j\) you specify?
To convert \(\mbox{logit}(p)\) to \(p\), we take the inverse logit of \(\hat\theta\):
\[\begin{align*} & = \exp(-1.61)/ (1+\exp(-1.61))\\ & = 0.167 \end{align*}\]
This probability corresponds to the density we observe in the flomarriage network: there are \(20\) ties and \(\binom{16}{2} = (16 \times 15)/2 = 120\) dyads,so the density of ties is \(20/120 = 0.167\).
Let’s add a term often thought to be a measure of “clustering”: the
number of completed triangles in the network, or \(\frac13\sum{y_{ij}y_{ik}y_{jk}}\). The name
for this ergm-term is triangle
.
This is an example of a dyad dependent term, as the status of any
triangle containing dyad \(y_{ij}\)
depends on the status of dyads of the form \(y_{ik}\) and \(y_{jk}\). This means that any model
containing the ergm-term triangle
has the property that
dyads are not independent of one another.
Dyad dependent models introduce a number of important changes. One is
that the estimation can no longer be done using MPLE. There are some
different estimation options; ergm
uses an MCMC-based
maximum likelihood estimation algorithm. Because this is a stochastic
algorithm, your results may differ slightly unless you use the same
set.seed
value:
edges triangle
20 3
flomodel.02 <- ergm(flomarriage~edges+triangle) # Estimate the theta coefficients
summary(flomodel.02)
Call:
ergm(formula = flomarriage ~ edges + triangle)
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -1.6900 0.3620 0 -4.668 <1e-04 ***
triangle 0.1901 0.5982 0 0.318 0.751
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 166.4 on 120 degrees of freedom
Residual Deviance: 108.1 on 118 degrees of freedom
AIC: 112.1 BIC: 117.6 (Smaller is better. MC Std. Err. = 0.01102)
Now, how should we interpret coefficients?
The conditional log-odds of two actors having a tie, keeping the rest of the network fixed, is
\[ -1.69 \times\mbox{change in the number of ties} + 0.19 \times\mbox{change in number of triangles.} \]
Another difference with dyad dependent terms is that the change statistic for a specific tie can take more values. So even though there is a single coefficient, this does not specify a homogeneous propensity for triad formation for each tie.
For a tie that will create no triangles, the conditional log-odds is \(-1.69\).
For a tie that will create one triangle: \(-1.69 + 0.19 = -1.5\)
For a tie that will create two triangles: \(-1.69 + 2 \times0.19 = -1.31\)
The corresponding probabilities are shown below (note the use of the
plogis
and coef
functions):
[1] 0.1557799 0.1824455 0.2125265
We have been extracting the coefficients from the object that is
returned by a call to ergm
. Let’s take a closer look at
that object, to see what else is in there:
[1] "ergm"
[1] "coefficients" "sample" "sample.obs" "iterations"
[5] "MCMCtheta" "loglikelihood" "gradient" "hessian"
[9] "covar" "failure" "network" "newnetworks"
[13] "newnetwork" "coef.init" "est.cov" "coef.hist"
[17] "stats.hist" "steplen.hist" "control" "etamap"
[21] "call" "ergm_version" "MPLE_is_MLE" "formula"
[25] "nw.stats" "constrained" "constraints" "obs.constraints"
[29] "reference" "estimate" "estimate.desc" "offset"
[33] "drop" "estimable" "null.lik" "mle.lik"
NULL
edges triangle
-1.689969 0.190103
We saw earlier that wealth appeared to be associated with higher
degree in this network. We can use ergm
to test this.
Wealth is a nodal covariate, so we use the ergm-term
nodecov.
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 17.50 39.00 42.56 48.25 146.00
# plot(flomarriage,
# vertex.cex=wealth/25,
# main="Florentine marriage by wealth",
# cex.main=0.8) # network plot with vertex size proportional to wealth
summary(flomarriage~edges+nodecov('wealth')) # observed statistics for the model
edges nodecov.wealth
20 2168
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Evaluating log-likelihood at the estimate.
Call:
ergm(formula = flomarriage ~ edges + nodecov("wealth"))
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -2.594929 0.536056 0 -4.841 <1e-04 ***
nodecov.wealth 0.010546 0.004674 0 2.256 0.0241 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 166.4 on 120 degrees of freedom
Residual Deviance: 103.1 on 118 degrees of freedom
AIC: 107.1 BIC: 112.7 (Smaller is better. MC Std. Err. = 0)
And yes, there is a significant positive wealth effect on the probability of a tie.
What does the value of the nodecov statistic represent, and how should we interpret the coefficients here? The wealth effect operates on both nodes in a dyad. The conditional log-odds of a tie between two actors is \[ \small{ -2.59\times\mbox{change in the number of ties} + 0.01\times\mbox{the wealth of node 1} + 0.01\times\mbox{the wealth of node 2} }, \] or \[ \small{ -2.59 + 0.01\times\mbox{the sum of the wealth of the two nodes} }. \]
As in the triangle model above, the change statistic here is not the same for every pair of nodes.
This model specification does not include a term for homophily by
wealth, i.e., a term that measures similarity in wealth of the two
nodes. It just specifies a relation between wealth and the number of
ties a node will have. To specify homophily on wealth, you could use the
ergm-term absdiff
. This term is appropriate when the
covariate on which homophily operates is continuous. In the next section
we will use another term for homophily that is appropriate for
covariates that are discrete. See
section 3 below for more information on ergm-terms.
Let’s try a larger network, a simulated mutual friendship network
based on one of the schools from the AddHealth study. Here, we’ll
examine the homophily in friendships by grade and race. Both are
discrete attributes so we use the ergm-term nodematch
.
Network attributes:
vertices = 205
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 203
missing edges= 0
non-missing edges= 203
Vertex attribute names:
Grade Race Sex
No edge attributes
par(mfrow=c(1,1)) # Back to 1-panel plots
plot(mesa, vertex.col='Grade')
legend('bottomleft',fill=7:12,
legend=paste('Grade',7:12),cex=0.75)
fauxmodel.01 <- ergm(mesa ~edges +
nodefactor('Grade') + nodematch('Grade',diff=T) +
nodefactor('Race') + nodematch('Race',diff=T))
Observed statistic(s) nodematch.Race.Black and nodematch.Race.Other are at their smallest attainable values. Their coefficients will be fixed at -Inf.
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Stopping at the initial estimate.
Evaluating log-likelihood at the estimate.
Call:
ergm(formula = mesa ~ edges + nodefactor("Grade") + nodematch("Grade",
diff = T) + nodefactor("Race") + nodematch("Race", diff = T))
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -8.0538 1.2561 0 -6.412 < 1e-04 ***
nodefactor.Grade.8 1.5201 0.6858 0 2.216 0.026663 *
nodefactor.Grade.9 2.5284 0.6493 0 3.894 < 1e-04 ***
nodefactor.Grade.10 2.8652 0.6512 0 4.400 < 1e-04 ***
nodefactor.Grade.11 2.6291 0.6563 0 4.006 < 1e-04 ***
nodefactor.Grade.12 3.4629 0.6566 0 5.274 < 1e-04 ***
nodematch.Grade.7 7.4662 1.1730 0 6.365 < 1e-04 ***
nodematch.Grade.8 4.2882 0.7150 0 5.997 < 1e-04 ***
nodematch.Grade.9 2.0371 0.5538 0 3.678 0.000235 ***
nodematch.Grade.10 1.2489 0.6233 0 2.004 0.045111 *
nodematch.Grade.11 2.4521 0.6124 0 4.004 < 1e-04 ***
nodematch.Grade.12 1.2987 0.6981 0 1.860 0.062824 .
nodefactor.Race.Hisp -1.6659 0.2963 0 -5.622 < 1e-04 ***
nodefactor.Race.NatAm -1.4725 0.2869 0 -5.132 < 1e-04 ***
nodefactor.Race.Other -2.9618 1.0372 0 -2.856 0.004296 **
nodefactor.Race.White -0.8488 0.2958 0 -2.869 0.004112 **
nodematch.Race.Black -Inf 0.0000 0 -Inf < 1e-04 ***
nodematch.Race.Hisp 0.6912 0.3451 0 2.003 0.045153 *
nodematch.Race.NatAm 1.2482 0.3550 0 3.517 0.000437 ***
nodematch.Race.Other -Inf 0.0000 0 -Inf < 1e-04 ***
nodematch.Race.White 0.3140 0.6405 0 0.490 0.623947
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 28958 on 20910 degrees of freedom
Residual Deviance: 1798 on 20889 degrees of freedom
AIC: 1836 BIC: 1987 (Smaller is better. MC Std. Err. = 0)
Warning: The following terms have infinite coefficient estimates:
nodematch.Race.Black nodematch.Race.Other
Note that we specified diff=T
for the
nodematch
terms . This specification allows the strength of
the homophily to vary across the levels of the covariate (“differential
homophily”), so we see a coefficient estimate for each level. You can
find the specification options for each ergm
term using the
help function.
(hint, just type ?node
to see the R autocomplete
options, so you don’t need to remember the exact help command)
Two of the coefficients are estimated as -Inf
(the
nodematch coefficients for race Black and Other). Why is this?
Black Hisp NatAm Other White
6 109 68 4 18
Black Hisp NatAm Other White
Black 0 8 13 0 5
Hisp 8 53 41 1 22
NatAm 13 41 46 0 10
Other 0 1 0 0 0
White 5 22 10 0 4
Note: Marginal totals can be misleading for undirected mixing matrices.
We can see that there are very few students in this school in the Black and Other race categories, and these few students form no within-group ties. The empty diagonal cells are what produce the -Inf estimates.
We would have caught this earlier if we had looked at the \(g(y)\) statistics:
summary(mesa ~edges +
nodefactor('Grade') + nodematch('Grade',diff=T) +
nodefactor('Race') + nodematch('Race',diff=T))
edges nodefactor.Grade.8 nodefactor.Grade.9
203 75 65
nodefactor.Grade.10 nodefactor.Grade.11 nodefactor.Grade.12
36 49 28
nodematch.Grade.7 nodematch.Grade.8 nodematch.Grade.9
75 33 23
nodematch.Grade.10 nodematch.Grade.11 nodematch.Grade.12
9 17 6
nodefactor.Race.Hisp nodefactor.Race.NatAm nodefactor.Race.Other
178 156 1
nodefactor.Race.White nodematch.Race.Black nodematch.Race.Hisp
45 0 53
nodematch.Race.NatAm nodematch.Race.Other nodematch.Race.White
46 0 4
Moral: Check the descriptive statistics of a model in the observed network before fitting the model.
See also the ergm-term nodemix
for fitting mixing
patterns other than homophily on discrete nodal attributes.
Let’s try a model for a directed network and examine the tendency for
ties to be reciprocated (“mutuality”). The ergm-term for the
corresponding statistic is mutual
. We’ll fit this model to
the third wave of the classic Sampson Monastery data, and we’ll start by
taking a look at the network.
[1] "faux.mesa.high" "fauxmodel.01" "flobusiness" "flomarriage"
[5] "flomodel.01" "flomodel.02" "flomodel.03" "mesa"
[9] "samplk1" "samplk2" "samplk3" "wealth"
Network attributes:
vertices = 18
directed = TRUE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 56
missing edges= 0
non-missing edges= 56
Vertex attribute names:
cloisterville group vertex.names
No edge attributes
edges mutual
56 15
The plot now shows the direction of a tie, and the \(g(y)\) statistics for this model in this network are 56 total ties and 15 mutual dyads. This means 30 (\(=15*2\)) of the 56 ties are reciprocated, i.e., they are part of dyads in which both directional ties are present.
Call:
ergm(formula = samplk3 ~ edges + mutual)
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -2.1639 0.2211 0 -9.789 <1e-04 ***
mutual 2.3118 0.4860 0 4.757 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 424.2 on 306 degrees of freedom
Residual Deviance: 268.1 on 304 degrees of freedom
AIC: 272.1 BIC: 279.6 (Smaller is better. MC Std. Err. = 0.3199)
There is a statistically significant mutuality effect. The coefficients for the edges and mutual terms add to roughly zero for a mutual tie, so the conditional log-odds of a mutual tie are about zero. Thus, the conditional probability that a tie exists, given that the tie in the reverse direction exists, is about 50%. (Do you see why a log-odds of zero corresponds to a probability of 50%?) By contrast, a non-mutual tie has a conditional log-odds of -2.16, or 10% probability.
Triangle terms in directed networks can have many different configurations. Many of these configurations are coded as ergm-terms, and we’ll talk about these more below.
With network data it is important to distinguish between the absence of a tie and the absence of data on whether a tie exists. The former is an observed zero, whereas the latter is unobserved. We should not code both of these as “0”.
The ergm
package recognizes and handles missing data
appropriately, as long as we identify the data as missing. Let’s explore
this with a simple example.
Start by estimating an ergm on a 10-node network with three missing ties.
set.seed(4)
missnet <- network.initialize(10,directed=F) # initialize an empty net with 10 nodes
missnet[1,2] <- missnet[2,7] <- missnet[3,6] <- 1 # add a few ties
missnet[4,6] <- missnet[4,9] <- missnet[5,6] <- NA # mark a few dyads missing
summary(missnet)
Network attributes:
vertices = 10
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges = 6
missing edges = 3
non-missing edges = 3
density = 0.06666667
Vertex attributes:
vertex.names:
character valued attribute
10 valid vertex names
No edge attributes
Network adjacency matrix:
1 2 3 4 5 6 7 8 9 10
1 0 1 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0
3 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 NA 0 0 NA 0
5 0 0 0 0 0 NA 0 0 0 0
6 0 0 1 NA NA 0 0 0 0 0
7 0 1 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 NA 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
# plot missnet with missing dyads colored red.
tempnet <- missnet
tempnet[4,6] <- tempnet[4,9] <- tempnet[5,6] <- 1
missnetmat <- as.matrix(missnet)
missnetmat[is.na(missnetmat)] <- 2
plot(tempnet,label = network.vertex.names(tempnet),
edge.col = missnetmat)
edges
3
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Stopping at the initial estimate.
Evaluating log-likelihood at the estimate.
Call:
ergm(formula = missnet ~ edges)
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -2.5649 0.5991 0 -4.281 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 58.22 on 42 degrees of freedom
Residual Deviance: 21.61 on 41 degrees of freedom
AIC: 23.61 BIC: 25.35 (Smaller is better. MC Std. Err. = 0)
The coefficient estimate equals -2.56, which corresponds to a probability of 7.14%. Our network has 3 ties, out of the 42 non-missing nodal pairs (10 choose 2 minus 3), and 3/42 = 7.14%. So our estimate represents the density of ties in the observed sample.
Now let’s assign those missing ties the (observed) value “0” and
check how the value of the coefficient will change. Can you predict
whether it will get bigger or smaller? Can you calculate it directly
before checking the output of an ergm
fit?
missnet_bad <- missnet # create network with missing dyads set to 0
missnet_bad[4,6] <- missnet_bad[4,9] <- missnet_bad[5,6] <- 0
# fit an ergm to the network with missing dyads set to 0
summary(missnet_bad)
Network attributes:
vertices = 10
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges = 3
missing edges = 0
non-missing edges = 3
density = 0.06666667
Vertex attributes:
vertex.names:
character valued attribute
10 valid vertex names
No edge attributes
Network adjacency matrix:
1 2 3 4 5 6 7 8 9 10
1 0 1 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0
3 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0 0
7 0 1 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Stopping at the initial estimate.
Evaluating log-likelihood at the estimate.
Call:
ergm(formula = missnet_bad ~ edges)
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -2.6391 0.5976 0 -4.416 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 62.38 on 45 degrees of freedom
Residual Deviance: 22.04 on 44 degrees of freedom
AIC: 24.04 BIC: 25.85 (Smaller is better. MC Std. Err. = 0)
The coefficient is smaller now because the missing ties are counted as “0”, and this translates to a conditional tie probability of 3/45 or 6.67%.
MORAL: If you have missing data on ties, be sure to identify them by assigning the “NA” code. This is particularly important if you’re reading in data as an edgelist, as all dyads without edges are implicitly set to “0” in this case.
Model terms are the expressions (e.g., edges
and
triangle
) used to represent predictors on the right-hand
side of formulas used in:
summary
(to obtain measurements of network
statistics on a dataset)ergm
(to estimate, or fit, an ERGM’s
coefficients)simulate
(to simulate networks from a fitted
ERGM)Because these terms are not exogeneous measures, but functions of the dyad states in the network, they must be calculated for the network that is being modeled. Many ERGM terms are simple counts of configurations (e.g., edges, nodal degrees, stars, triangles), but others are more complex functions of these configurations (e.g., geometrically weighted degrees and shared partners). In theory, any configuration or function of configurations can be a term in an ERGM. In practice, however, these terms have to be constructed before they can be used—that is, one has to explicitly write an algorithm that defines and calculates the network statistic of interest. This is another key way that ERGMs differ from traditional linear and general linear models.
The terms that can be used in a model also depend on the type of network being analyzed: directed or undirected, one-mode or two-mode (“bipartite”), binary or valued edges.
The ergm
package provides a wide range of terms, and it
is not possible to cover all of them in this workshop. This is
particularly true with the release of ergm
version 4.0,
which expands the user’s ability to modify how terms are calculated
through the use of “term operators”.
You can explore all of the available terms using help resources covered in Section 1 above. But it may be helpful to start with a more concise list such as the one found here and the overview in volume 24, issue 4 of the Journal of Statistical Software.
To appreciate the expanded capabilities of the ergm
package as of the release of version 4.0, we recommend Krivitsky et al (2023).
In this article, Section 3 describes the enhanced flexibility to create
specialized model terms involving functions of nodal covariates, and
Section 4 explains how operators further extend the types of terms at
the user’s disposal (term operators are not covered in this introductory
workshop, but are covered in the Temporal ERGM and Advanced ERGM
workshops).
ERG models specification is not limited to the terms included in the
ergm
package.
So if you need a term that is not included, don’t despair! If you can
code (a little bit), you can write your own.
There is a statnet
package called
ergm.userterms
that provides utilities for writing custom
ergm-terms so they can be used with Statnet. The package is available
via GitHub at https://github.com/statnet/ergm.userterms,
and installing it will include the tutorial, called ergmuserterms.pdf.
There is a tutorial for writing custom terms in the Journal of Statistical
Software 52(2), and some introductory slides and installation
instructions from the workshop we teach on coding ergm-terms can be
found on
GitHub.
Writing up new ergm
terms requires some knowledge of C
and the ability to build R from source.
When dyad dependent terms are in the model, the computational
algorithms in ergm
use Markov chain Monte Carlo (MCMC) to
estimate the parameters. This approach basically works as follows:
Start with an initial vector of coefficient values; the default is to use the maximum pseudo-likelihood estimate, or MPLE. (We do not cover MPLE in this tutorial, but this estimator is easy to compute using a standard logistic regression algorithm.)
Choose a dyad at random, and flip a coin, weighted by the model, to decide whether there will be a tie.
Repeat this for 1024 steps, the default control value of
MCMC.interval
(see ?control.simulate
)
Calculate and store the \(g(y)\) statistics for the resulting network.
Repeat this process until either MCMC.samplesize
vectors of statistics have been collected, or until a certain
MCMC.effectiveSize
criterion is reached (see
?control.simulate
).
Calculate the sample average of the sampled \(g(y)\) statistics, then compare this to the vector of observed statistics.
Update the coefficient estimates as needed.
Repeat until the process converges: The difference between the MCMC sample average and the observed statistic is sufficiently small.
Whenever MCMC is used for estimation, it is important to assess model
convergence before interpreting the model results, i.e., before
evaluating statistical significance, interpreting coefficients, or
assessing goodness of fit. To do this, we use the function
mcmc.diagnostics
, as we now demonstrate.
We will first consider a simple dyadic dependent model where the algorithm works using the program defaults, with a degree(1) term that captures whether there are more (or less) degree 1 nodes than we would expect, given the density.
edges degree1
15 3
Call:
ergm(formula = flobusiness ~ edges + degree(1))
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -2.1177 0.3032 0 -6.984 <1e-04 ***
degree1 -0.6272 0.6010 0 -1.044 0.297
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 166.36 on 120 degrees of freedom
Residual Deviance: 89.39 on 118 degrees of freedom
AIC: 93.39 BIC: 98.96 (Smaller is better. MC Std. Err. = 0.03364)
Sample statistics summary:
Iterations = 7168:131072
Thinning interval = 512
Number of chains = 1
Sample size per chain = 243
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
edges 0.2140 3.601 0.2310 0.2310
degree1 -0.1276 1.817 0.1166 0.1166
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
edges -8 -2 1 2 7
degree1 -3 -1 0 1 4
Are sample statistics significantly different from observed?
edges degree1 Overall (Chi^2)
diff. 0.2139918 -0.1275720 NA
test stat. 0.9262347 -1.0944112 1.4827084
P-val. 0.3543240 0.2737747 0.4790184
Sample statistics cross-correlations:
edges degree1
edges 1.0000000 -0.3929828
degree1 -0.3929828 1.0000000
Sample statistics auto-correlation:
Chain 1
edges degree1
Lag 0 1.000000000 1.00000000
Lag 512 -0.018373256 0.06527360
Lag 1024 -0.008853872 -0.00419178
Lag 1536 -0.006593784 -0.05395258
Lag 2048 0.033260731 0.02580333
Lag 2560 -0.059894956 0.02109630
Sample statistics burn-in diagnostic (Geweke):
Chain 1
Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5
edges degree1
1.3769 0.1355
Individual P-values (lower = worse):
edges degree1
0.1685490 0.8922124
Joint P-value (lower = worse): 0.1257365 .
MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
What this shows is a summary of the statistics generated by the MCMC process, with each row summarizing a different statistic and with each statistic measured in terms of its value relative to the corresponding value for the original observed network. On the left is a “traceplot” in which the values are plotted as a function of iteration number; on the right is a histogram-like plot of the whole sample of statistics without regard to their order in the MCMC process.
This example exhibits the sort of behavior that we want to see in the MCMC diagnostics:
The MCMC sample statistics are varying randomly around the observed values at each step; we might say that the chain is “mixing well”.
The sampled values show little serial correlation, indicating that they are independent draws, and they have a roughly bell-shaped distribution, centered at zero.
Note: The sawtooth pattern visible in the degree term deviation plot is due to the combination of discrete values and small range in the statistics: the observed number of degree 1 nodes is 3, and only a few discrete values are produced by the simulations. So the sawtooth pattern is is an inherent property of the statistic, not a problem with the model.
There are many control parameters for the MCMC algorithm
(help(snctrl)
), and we’ll play with some of these below. To
see what the algorithm is doing at each step, we can drop the sampling
interval down to 1:
This runs an MCMC algorithm where the statistics are returned after only one tie is evaluated, which might be useful if we are trying to debug a bad model fit.
In the last section we’ll look at some models that don’t converge properly, and how to use MCMC diagnostics to identify and address this.
Once we have estimated the coefficients of an ERGM, the model is completely specified. It defines a probability distribution across all networks on the given set of nodes. If the model is a good fit to the observed data, then networks drawn from this distribution will be more likely to “resemble” the observed data. Thus, one way we use simulations from a model is to assess that model’s goodness of fit to our data. Here, we will take a quick look at how the simulation function works.
The simulate
command is easy to run if we have an ERGM
that has already been fitted. Let’s use the flomodel.03
object from earlier:
set.seed(101)
flomodel.03.sim <- simulate(flomodel.03,nsim=10)
class(flomodel.03.sim) # Reveal the class of the object created
[1] "network.list"
Number of Networks: 10
Model: flomarriage ~ edges + nodecov("wealth")
Reference: ~Bernoulli
Constraints: ~.
Stored network statistics:
edges nodecov.wealth
[1,] 17 1539
[2,] 18 1742
[3,] 21 2471
[4,] 16 1304
[5,] 18 1779
[6,] 26 3143
[7,] 22 2239
[8,] 26 2905
[9,] 24 2792
[10,] 15 1682
attr(,"monitored")
[1] FALSE FALSE
Number of Networks: 10
Model: flomarriage ~ edges + nodecov("wealth")
Reference: ~Bernoulli
Constraints: ~.
$coefficients
edges nodecov.wealth
-2.59492903 0.01054591
$control
Control parameter list generated by 'control.simulate.formula' or equivalent. Non-empty parameters:
MCMC.burnin: 16384
MCMC.interval: 1024
MCMC.scale: 1
MCMC.prop: ~sparse
MCMC.prop.weights: "default"
MCMC.batch: 0
MCMC.effectiveSize.damp: 10
MCMC.effectiveSize.maxruns: 1000
MCMC.effectiveSize.burnin.pval: 0.2
MCMC.effectiveSize.burnin.min: 0.05
MCMC.effectiveSize.burnin.max: 0.5
MCMC.effectiveSize.burnin.nmin: 16
MCMC.effectiveSize.burnin.nmax: 128
MCMC.effectiveSize.burnin.PC: FALSE
MCMC.effectiveSize.burnin.scl: 1024
MCMC.maxedges: Inf
MCMC.runtime.traceplot: FALSE
network.output: "network"
parallel: 0
parallel.version.check: TRUE
parallel.inherit.MT: FALSE
MCMC.samplesize: 10
obs.MCMC.mul: 0.25
obs.MCMC.samplesize.mul: 0.5
obs.MCMC.interval.mul: 0.5
obs.MCMC.burnin.mul: 0.5
obs.MCMC.prop: ~sparse
obs.MCMC.prop.weights: "default"
MCMC.save_networks: TRUE
$response
[1] NA
$class
[1] "network.list"
$stats
edges nodecov.wealth
[1,] 17 1539
[2,] 18 1742
[3,] 21 2471
[4,] 16 1304
[5,] 18 1779
[6,] 26 3143
[7,] 22 2239
[8,] 26 2905
[9,] 24 2792
[10,] 15 1682
attr(,"monitored")
[1] FALSE FALSE
$formula
flomarriage ~ edges + nodecov("wealth")
attr(,".Basis")
Network attributes:
vertices = 16
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 20
missing edges= 0
non-missing edges= 20
Vertex attribute names:
priorates totalties vertex.names wealth
No edge attributes
$constraints
~.
<environment: 0x115440798>
$reference
~Bernoulli
<environment: 0x11546aea8>
We can check whether it appears that the simulated sample mean statistics are in fact close to the observed statistics:
rbind("obs"=summary(flomarriage~edges+nodecov("wealth")),
"sim mean"=colMeans(attr(flomodel.03.sim, "stats")))
edges nodecov.wealth
obs 20.0 2168.0
sim mean 20.3 2159.6
By default, our network.list
object contains all ten of
the networks we simulated. If it were important to save memory, we could
have asked that only the network statistics be stored by passing the
output="stats"
option to the earlier simulate
command; see ?simulate.ergm
for more details. Let’s take a
look at the seventh network in our list of ten:
Network attributes:
vertices = 16
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 22
missing edges= 0
non-missing edges= 22
Vertex attribute names:
priorates totalties vertex.names wealth
No edge attributes
plot(flomodel.03.sim[[7]],
label= flomodel.03.sim[[7]] %v% "vertex.names",
label.cex = 0.5,
vertex.cex = (flomodel.03.sim[[7]] %v% "wealth")/25)
Voilà. Your plot may look different since randomness is involved in both the simulation and in the plotting of a network.
Simulation from a model is a very powerful tool for examining the
range of variation that can be expected from this model, both in the
sufficient statistics that define the model and in other statistics not
explicitly specified by the model. Simulation plays a large role in
analyzing egocentrically sampled data, and if you take the
tergm
workshop, you will see how we can use simulation to
examine the temporal implications of a model based on a single
cross-sectional egocentrically sampled dataset.
Next, we will examine a primary use of simulation in the
ergm
package: we simulate networks from a fitted model to
evaluate goodness of fit to the observed network.
ERGMs can be seen as generative models when they represent the process that governs the global patterns of tie prevalence from a local perspective, i.e., the perspective of the nodes involved in the particular micro-configurations represented by the ergm-terms in the model. The locally generated processes in turn aggregate to produce characteristic global network properties, even those global properties that are not explicit terms in the model.
One test of whether an ERGM fits the data is therefore how well it
reproduces observed global network properties that are not in the
model. We do this by using the gof
function to choose
network statistics that are not in the model, then compare the values of
these statistics observed in the original network to the distribution of
values we get in simulated networks from our model.
The gof
function is a bit different than the
summary
, ergm
, and simulate
functions, in that it currently (for undirected networks) only takes
three ergm-terms as arguments: degree
,
espartners
(edgewise shared partners), and
distance
(geodesic distances). Each of these terms captures
an aggregate network distribution at either the node level
(degree
), the edge level (espartners
), or the
dyad level (distance
).
set.seed(54321) # The gof function uses random values
flomodel.03.gof <- gof(flomodel.03)
flomodel.03.gof
Goodness-of-fit for degree
obs min mean max MC p-value
degree0 1 0 1.20 5 1.00
degree1 4 0 3.64 8 1.00
degree2 2 0 3.98 9 0.44
degree3 6 0 3.43 7 0.20
degree4 2 0 1.86 7 1.00
degree5 0 0 1.03 5 0.68
degree6 1 0 0.48 4 0.70
degree7 0 0 0.24 2 1.00
degree8 0 0 0.11 1 1.00
degree9 0 0 0.02 1 1.00
degree10 0 0 0.01 1 1.00
Goodness-of-fit for edgewise shared partner
obs min mean max MC p-value
esp0 12 5 12.65 19 0.86
esp1 7 0 5.49 15 0.72
esp2 1 0 1.71 8 1.00
esp3 0 0 0.22 5 1.00
esp4 0 0 0.03 2 1.00
Goodness-of-fit for minimum geodesic distance
obs min mean max MC p-value
1 20 13 20.10 37 1.00
2 35 17 35.34 67 1.00
3 32 11 27.79 41 0.58
4 15 2 12.20 26 0.76
5 3 0 3.68 13 0.94
6 0 0 0.88 11 1.00
7 0 0 0.19 8 1.00
8 0 0 0.03 2 1.00
Inf 15 0 19.79 65 1.00
Goodness-of-fit for model statistics
obs min mean max MC p-value
edges 20 13 20.10 37 1.0
nodecov.wealth 2168 1287 2201.89 3467 0.9
Let’s see how the gof
function operates on a larger
network by fitting the simplistic edges-only model to the
faux.mesa.high
dataset used earlier:
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Stopping at the initial estimate.
Evaluating log-likelihood at the estimate.
Warning in gof.formula(object = object$formula, coef = coef, GOF = GOF, : No
parameter values given, using 0.
Unsurprisingly, networks simulated from the simplistic model do not
appear to capture the global structure present in the AddHealth-based
faux.mesa.high
network.
For a good example of model exploration and fitting for the Add Health Friendship networks, see Goodreau, Kitts & Morris, Demography 2009. For more technical details on the approach, see Hunter, Goodreau and Handcock JASA 2008
When a model is not a good representation of the observed network, the simulated networks produced in the MCMC chains may be far enough away from the observed network that the estimation process is affected. In the worst case scenario, the simulated networks will be so different that the algorithm fails altogether. When this happens, it basically means the model specified would not have produced the network observed. Some classes of models, we now know, can almost never produce an interesting network, such as we might observe. This behavior is what we call “model degeneracy.”
For more detailed discussion of model degeneracy in the ERGM context, see the papers listed in the reference section.
In that worst case scenario, we end up not being able to obtain coefficent estimates, so we can’t use the GOF function to identify how the model simulations deviate from the observed data. We can, however, still use the MCMC diagnostics to observe what is happening with the simulation algorithm, and this (plus some experience and intuition about the behavior of ergm-terms) can help us improve the model specification.
For this purpose, we’ll use a larger network,
faux.magnolia.high
, and look at a simple model for triad
closure that includes only edges
and triangle
terms.
Network attributes:
vertices = 1461
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 974
missing edges= 0
non-missing edges= 974
Vertex attribute names:
Grade Race Sex vertex.names
Edge attribute names not shown
edges triangle
974 169
We now try to fit this “simple” model:
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
...
Iteration 4 of at most 60:
Optimizing with step length 0.3963.
The log-likelihood improved by 1.1568.
Estimating equations are not within tolerance region.
Iteration 5 of at most 60:
Error in ergm.MCMLE(init, nw, model, initialfit = (initialfit <- NULL), :
Number of edges in a simulated network exceeds that in the observed by a factor of more than 20. This is a strong indicator of model degeneracy or a very poor starting parameter configuration. If you are reasonably certain that neither of these is the case, increase the MCMLE.density.guard control.ergm() parameter.
Very interesting. Instead of converging, the algorithm heads off into
networks that are much much more dense than the observed network. This
is such a clear indicator of a degenerate model specification that the
algorithm stops after 3 iterations, to avoid storage problems. To peek a
bit more under the hood, we can stop the algorithm earlier, by setting
MCMLE.maxit=2
, to catch where it’s heading:
set.seed(1000)
fit <- ergm(magnolia~edges+triangle,
control=snctrl(MCMLE.maxit=2,MCMLE.effectiveSize=NULL))
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 2:
Optimizing with step length 0.2805.
The log-likelihood improved by 3.0798.
Estimating equations are not within tolerance region.
Iteration 2 of at most 2:
Optimizing with step length 0.0420.
The log-likelihood improved by 4.6627.
Estimating equations are not within tolerance region.
MCMLE estimation did not converge after 2 iterations. The estimated coefficients may not be accurate. Estimation may be resumed by passing the coefficients as initial values; see 'init' under ?control.ergm for details.
Finished MCMLE.
Evaluating log-likelihood at the estimate. Fitting the dyad-independent submodel...
Bridging between the dyad-independent submodel and the full model...
Setting up bridge sampling...
Using 16 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .
Bridging finished.
This model was fit using MCMC. To examine model diagnostics and check for degeneracy, use the mcmc.diagnostics() function.
Let’s use the MCMC diagnostics from Section 4 to get a sense of what happened:
For the diagnostic plots, the simulated network statistics are subtracted from their observed values so that the observed values equal zero. Clearly, this Markov chain is heading somewhere very bad!
The edges + triangle
model class turns out to be one of
the classic degenerate model specifications, and we now understand much
more about why it does not produce reasonable levels of triadic
closure.
We also now have a more robust way of modeling triangles: the geometrically-weighed edgewise shared partner term (GWESP). For a technical introduction to GWESP, see Hunter and Handcock, 2006; for a more intuitive description and empirical application, see Goodreau, Kitts & Morris, 2009 )
Let’s see what using gwesp
instead of
triangle
can do. We can also control the number of
Metropolis-Hastings (MCMC) proposals between sampled statistics in our
Markov chain, one of the many control parameters that may be passed to
functions in the ergm
package using the
control=snctrl()
syntax. (To see the many control
parameters that may be set by the user in the ergm
package,
type ?snctrl
.)
set.seed(10101)
fit <- ergm(magnolia~edges+gwesp(0.25, fixed=T),
control=snctrl(MCMC.interval = 10000),
verbose=T)
Evaluating network in model.
Initializing unconstrained Metropolis-Hastings proposal: ‘ergm:MH_TNT’.
Initializing model...
Model initialized.
Using initial method 'MPLE'.
Fitting initial model.
Starting maximum pseudolikelihood estimation (MPLE):
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
... (output snipped)
Bridging finished.
This model was fit using MCMC. To examine model diagnostics and check for degeneracy, use the mcmc.diagnostics() function.
Sample statistics summary:
Iterations = 2800000:55400000
Thinning interval = 40000
Number of chains = 1
Sample size per chain = 1316
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
edges 7.866 39.98 1.1020 4.141
gwesp.fixed.0.25 7.206 31.99 0.8819 3.360
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
edges -66.12 -18.25 6.000 34.00 93.12
gwesp.fixed.0.25 -57.14 -13.50 8.166 27.07 71.32
Are sample statistics significantly different from observed?
edges gwesp.fixed.0.25 Overall (Chi^2)
diff. 7.86626140 7.20579674 NA
test stat. 1.89941461 2.14437967 6.19661638
P-val. 0.05750998 0.03200248 0.04675916
Sample statistics cross-correlations:
edges gwesp.fixed.0.25
edges 1.0000000 0.7833691
gwesp.fixed.0.25 0.7833691 1.0000000
Sample statistics auto-correlation:
Chain 1
edges gwesp.fixed.0.25
Lag 0 1.0000000 1.0000000
Lag 40000 0.5460541 0.8587880
Lag 80000 0.4618254 0.7501801
Lag 120000 0.4129546 0.6642087
Lag 160000 0.3832516 0.5940199
Lag 2e+05 0.3082655 0.5300815
Sample statistics burn-in diagnostic (Geweke):
Chain 1
Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5
edges gwesp.fixed.0.25
0.1445 -0.1057
Individual P-values (lower = worse):
edges gwesp.fixed.0.25
0.8850903 0.9158320
Joint P-value (lower = worse): 0.3733633 .
MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
MORAL: Unwanted degeneracy is an indicator of a poorly specified model. It is not a property of all ERGMs (nor is it unique to ERGMs!), but it is associated with some dyadic-dependent terms, in particular, the reduced homogeneous Markov specifications (e.g., 2-stars and triangle terms). For a good technical discussion of unstable terms, see Schweinberger 2012. For a discussion of alternative terms that exhibit more stable behavior, see Snijders et al. 2006.. For the gwesp term and the curved exponential family terms in general, see Hunter and Handcock 2006.. Note that there are cases in which degenerate behavior may be realistic and desired (e.g., in modeling small groups that do genuinely collapse into cliques or other regular structures, or in some physical applications), but these are not frequently encountered in modeling medium-to-large social networks. If your system is not degenerate but your model is, this suggests that you may need to rethink the theoretical motivation behind your choice of model terms.
One of the most powerful features of ERGMs is that they can be used to estimate models from egocentrically sampled data, and the fitted models can then be used to simulate complete networks (of any size) that will have the properties of the original network that are observed and represented in the model.
In many empirical contexts, it is not feasible to collect a network census or even an adaptive (link-traced) sample. Even when one of these may be possible in practice, egocentrically sampled data are typically cheaper and easier to collect.
Long regarded as the poor country cousin in the network data family, egocentric data contain a remarkable amount of information. With the right statistical methods, such data can be used to explore the properties of the complete networks in which they are embedded. The basic idea here is to combine what is observed with assumptions to define a class of models that describes the distribution of networks that are centered on the observed properties. The variation in these networks quantifies some of the uncertainty introduced by the assumptions.
The egocentric estimation/simulation framework extends to temporal
ERGMs (“TERGMs”) as well, with the minimal addition of an estimate of
partnership duration. This makes it possible to simulate complete
dynamic networks from a single cross-sectional egocentrically sampled
network. For an example of what one can accomplish with this framework,
check out the network movie we developed to explore the impact of
dynamic network structure on HIV transmission at http://statnet.org/movies. Similar methods can be used
to simulate continuous time dynamics (see references below, and the
ergmgp
package).
While the ergm
package has had this capability for many
years, and old versions of this workshop had a detailed section on it,
there is now a specific package that makes this much easier:
ergm.ego
. The new package includes accurate statistical
inference, i.e., standard errors for model coefficient estimates, along
with many utilities that simplify the task of reading in the data,
conducting exploratory analyses, calculating the sample target
statistics, and specifying model options.
We now have a separate workshop/tutorial for ergm.ego
,
so we no longer cover this material in the current ERGM workshop. As
always, this workshop material can be found online at the Statnet Workshops
wiki.
“Statnet” refers to a suite of R
packages that are
designed to work together, providing tools for a wide range of different
types of network data analysis.
There is also an R package called statnet
, whose sole
function is to make it easy to install and load all of the packages
produced by the Statnet Project team in a single step.
Examples of Statnet Suite functionality beyond the ergm
package include temporal network models and dynamic network
vizualizations, analysis of egocentrically sampled network data,
multilevel network modeling, latent cluster models, and network
diffusion and epidemic models. Development is ongoing, with new packages
and new functionality added to existing packages on a regular basis.
Most of the Statnet packages can be downloaded from CRAN, and all are
available via GitHub. For more detailed information, please visit the
statnet
webpage at www.statnet.org.
Many of the packages have associated training workshops. Our tutorials can be found online, on the GitHub statnet Workshops wiki.
ergm
The ergm
package has considerable functionality beyond
what has been discussed here. This includes support for:
Many of these topics are covered in our Advanced ERGM and Valued ERGMs workshops.
There are now a number of excellent packages developed by others that
extend the functionality of statnet. The easiest way to find these is to
look at the “reverse depends” of the ergm
package on CRAN. Examples include:
Bergm
— Bayesian Exponential Random Graph Modelsbtergm
— Temporal Exponential Random Graph Models by
Bootstrapped Pseudolikelihoodhergm
— hierarchical ERGMs for multi-level network
dataxergm
— extensions to ERGM modelingYou will see the terms ergm and network used in multiple contexts throughout the documentation. This is common in R, but often confusing to newcomers. To clarify:
ergm
statnet
suitenetwork
statnet
suite; used to create, store, modify and plot the
information found in network objects.R
used to represent a network.For a general orientation to the statnet
packages, the
best place to start is the special volume of the Journal of
Statistical Software (JSS) devoted to statnet
: https://www.jstatsoft.org/issue/view/v024.
The nine papers in this volume cover a wide range of theoretical and
practical topics related to ERGMs, and their implementation in
statnet
.
However, this volume was written in 2008. The statnet
code base has evolved considerably since that time, and with the release
of ergm
version 4.0, the most current paper describing the
capabilities of the ergm
package is the following:
Krivitsky, P. N., Hunter, D. R., Morris, M., & Klumb, C. (2023). ergm 4: New Features for Analyzing Exponential-Family Random Graph Models. Journal of Statistical Software, 105(6), 1–44. https://doi.org/10.18637/jss.v105.i06
For social scientists, a good introductory application paper is:
Goodreau, S., J. Kitts and M. Morris (2009). Birds of a Feather, or Friend of a Friend? Using Statistical Network Analysis to Investigate Adolescent Social Networks. Demography 46(1): 103-125. link
For a good technical summary and clarification of the properties of ERGMs (which addresses several widely held misconceptions regarding degeneracy and consistency):
Michael Schweinberger, Pavel N. Krivitsky, Carter T. Butts, and Jonathan Stewart (2020). Exponential-Family Models of Random Graphs: Inference in Finite-, Super-, and Infinite Population Scenarios. Statistical Science 35 (4):627-662. link link to arXiv version
The literature on ERGMs is large and growing. The following are
examples of papers that may be particularly useful starting points for
those interested in exploring additional ERGM functionality implemented
in statnet
ERGM functionality per se, but should
not be taken as an exhaustive list.
Dealing with Model Degeneracy
Handcock M.S. (2003a). Assessing Degeneracy in Statistical Models of Social Networks. Working Paper 39, Center for Statistics and the Social Sciences, University of Washington. link
Snijders, T.A.B. et al. (2006). New Specifications For Exponential
Random Graph Models
Sociological Methodology 36(1): 99-153 link
Hunter, D. R. (2007). Curved Exponential Family Models for Social Networks. Social Networks, 29(2), 216-230.link
Schweinberger, M. (2011).
Instability, Sensitivity, and Degeneracy of Discrete Exponential
Families. JASA 106(496): 1361-1370. link
And a dynamic network modeling application where model “degeneracy” turns out to be appropriate, and eventually generates the patterns empirically observed in data:
Gianmarc Grazioli, Yue Yu, Megha H. Unhelkar, Rachel W. Martin, and Carter T. Butts (2019) Network-Based Classification and Modeling of Amyloid Fibrils The Journal of Physical Chemistry B 123 (26), 5452-5462. DOI: 10.1021/acs.jpcb.9b03494. link
Valued ERGMs
Krivitsky, P.N. (2012).
Exponential-Family Random Graph Models for Valued Networks.
Electronic Journal of Statistics 6: 1100-1128;
10.1214/12-EJS696
Krivitsky, P.N., Butts, C.T. (2017). Exponential-Family Random Graph Models for Rank-Order Relational Data. Sociological Methodology, 47: 68-112.
Temporal ERGMs and Network Dynamics
Krivitsky, P.N., Handcock, M.S. (2014). A Separable Model for Dynamic Networks. JRSS Series B-Statistical Methodology, 76(1):29-46; 10.1111/rssb.12014 JAN 2014 link
Krivitsky, P. N., Handcock, M.S., and Morris, M. (2011). Adjusting for Network Size and Composition Effects in Exponential-family Random Graph Models, Statistical Methodology 8(4): 319-339, ISSN 1572-3127 link
Butts, C.T. (2023). Continuous Time Graph Processes with Known ERGM Equilibria: Contextual Review, Extensions, and Synthesis. Journal of Mathematical Sociology; 10.1080/0022250X.2023.2180001 link
Egocentric ERGMs
Krivitsky, P. N., and Morris, M. (2017). Inference for Social Network Models from Egocentrically Sampled Data, with Application to Understanding Persistent Racial Disparities in HIV Prevalence in the US. Annals of Applied Statistics, 11(1), 427-455.link