Last updated 2024-06-24
This tutorial is a joint product of the Statnet Development Team:
Pavel N. Krivitsky (University of New South Wales)
Martina Morris (University of Washington)
Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Chad Klumb (University of Washington)
Skye Bender de-Moll (Oakland, CA)
Michał Bojanowski (Kozminski University, Poland)
The network modeling software demonstrated in this tutorial is
authored by Pavel Krivitsky (ergm
) and Carter Butts
(network
and sna
).
All Statnet packages are open-source, written for the R computing environment, and published on CRAN. The source repositories are hosted on GitHub. Our website is statnet.org
Need help? For general questions and comments, please email the Statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: Statnet_help listserve.
Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.
Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).
For all other issues, please email us at contact@statnet.org.
This workshop and tutorial provide an introduction to statistical
modeling of network data with exponential-family random graph
models (ERGMs) using statnet
software. This online
tutorial is also designed for self-study, with example code and
self-contained data. The statnet
packages we will be
demonstrating are:
network
— storage and manipulation of network dataergm
— statistical tools for estimating ERGMs, model
assessment, and network simulation.The ergm
package has more advanced functionality that is
not covered in this workshop. An overview can be found in this preprint.
This workshop assumes basic familiarity with R; experience with network concepts, terminology and data; and familiarity with the general framework for statistical modeling and inference. While previous experience with ERGMs is not required, some of the topics covered here may be difficult to understand without a strong background in linear and generalized linear models in statistics.
Minimally, you will need to install the latest version of
R (available
here) and the statnet
packages ergm
and
network
to run the code presented here (ergm
will automatically install network
when it is
loaded).
The workshops are conducted using the free version of
Rstudio
(available
here).
The full set of installation intructions with details can be found on
the statnet
workshop wiki.
If you have not already downloaded the statnet
packages
for this workshop, the quickest way to install these (and the other most
commonly used packages from the statnet
suite), is to open
an R session and type:
Loading required package: network
'network' 1.17.2 (2022-05-20), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4.2.2 (2022-06-01), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.
You can check the version number with:
[1] '4.2.2'
Throughout, we will set a random seed via set.seed()
for
commands in tutorial that require simulating random values—this is not
necessary, but it ensures that you will get the same results as this
tutorial (assuming that you are using the same ergm
version
or at least a version in which the algorithms you are using have not
changed).
Here we provide only a brief overview of the modeling framework, as
the primary purpose of this tutorial is to show how to implement
statistical analysis of network data with ERGMs using the
statnet
software tools, rather than to explain the
framework in detail. For more details, and to really understand ERGMs,
please see the references at
the end of this tutorial.
Exponential-family random graph models (ERGMs) are a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can, among other tasks:
Define a model for a network that includes covariates representing features like homophily, mutuality, triad effects, and a wide range of other structural features of interest;
Obtain maximum-likehood estimates for the parameters of the specified model for a given data set;
Test individual coefficients, assess models for convergence and goodness-of-fit, and perform various types of model comparison; and
Simulate new networks from the underlying probability distribution implied by the fitted model.
ERGMs are a class of models, superficially resembling linear regression or GLMs. The general form of the model specifies the probability of the entire network (the left hand side), as a function of terms that represent network features we hypothesize may occur more or less likely than expected by chance (the right hand side). The general form of the model is
\[ \Pr(Y=y)=\frac{\exp[\theta^\top g(y)] h(y)}{k(\theta)} \]
where
\(Y\) is the random variable for the state of the network and \(y\) is a particular realization \(Y\) could take,
\(g(y)\) is a vector of model statistics for network \(y\),
\(h(y)\) is the reference measure (which defines the baseline behavior of the model when \(\theta=0\))
\(\theta\) is the vector of coefficients for those statistics, and
\(k(\theta)\) is the summation of the numerator’s value over the set of all possible networks \(y\), typically taken to be all networks with the same node set as the observed network.
In particular, the model implies that the probability attached to a network \(y\) only depends on the network via the vector of statistics \(g(y)\). Among other things, this means that maximum likelihood estimation may be carried out even if we don’t observe the network itself, as long as we know the observed value of \(g(y)\).
If you’re not familiar with the compact notation above, the numerator represents a formula that is linear in the log form:
\[ \log(\exp[\theta^\top g(y)]) = \theta_1g_1(y) + \theta_2g_2(y)+ ... + \theta_pg_p(y) \] where \(p\) is the number of terms in the model. From this one can more easily observe the analogy to a traditional statistical model: the coefficients \(\theta\) represent the size and direction of the effects of the covariates \(g(y)\) on the overall probability of the network.
\(h(y)\) is important for specifying model behavior for valued ERGMs, for controlling for network size in extrapolative or multi-network settings, and other applications. When modeling single, unvalued networks, however, it is common to take \(h(y)\propto 1\) (the counting measure). We will adopt that convention here. Under the counting measure, the baseline behavior of an ERGM approaches a uniform random graph when \(\theta \to 0\), and the ERGM terms serve to modify this baseline distribution. We will see many examples below.
The statistics \(g(y)\) can be thought of as the “covariates” in the model. In the network modeling context, these represent network features like density, homophily, triads, etc. In one sense, they are like covariates you might use in other statistical models, but they are different in one key respect:
These \(g(y)\) statistics are functions of the network itself — each represents the frequency of a specific configuration of dyads observed in the network — so they are not measured by a question you include in a survey (e.g., the income of a node), but instead need to be computed on the specific network you have, after you have collected the data.
As a result, every term in an ERGM must have an associated algorithm
for computing its value for your network. The ergm
package
in statnet
includes about 150 term-computing algorithms,
and you can code up your own with our helper package. We will explore
some of the most commonly used terms in this tutorial, and links to more
information are provided in section
3.
Terms can be grouped in different ways, like the type of network they can be used on (e.g., binary, bipartite, directed, etc.), the type of covariate they represent (continuous, dyad-dependent, etc.) and their purpose (term operator, soft constraint, etc). In Statnet, these groupings are called “keywords”, and they provide a useful way to quickly find help for terms used in different contexts.
One key property of model terms is worth keeping in mind: terms are either dyad independent or dyad dependent.
Dyad independent terms (like nodal homophily terms) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree terms, or triad terms), by contrast, imply dependence between dyads. Dyad dependent terms have very different effects, and much of what is different about network models comes from these terms. They introduce complex cascading effects that can often lead to counter-intuitive and highly non-linear outcomes. In addition, a model with at least one dyad dependent term requires a different estimation algorithm, so when we use these terms below you will see some different components in the output.
There are several resources in the ergm
package that
provide a range of help and instructions for using these terms:
The main ergm terms reference page
Accessed via:
?ergmTerms
This contains an overview of model specification, some links to additional references and a complete up-to-date list of available terms with short descriptions and a link to the man page for that term.
Man pages for specific terms
If you know the name of the term, and are looking for help with syntax, you can type either
`help("[name]-ergmTerm")`
or the shorthand version
`ergmTerm?[name]`
where [name]
is the name of the term.
Keywords
Keywords are useful for narrowing down your search to find terms that have specific functionality. Available keywords and their meanings can be obtained by typing
?ergmKeyword`
You can search for terms with keywords, for example:
Found 11 matching ergm terms:
altkstar(lambda, fixed=FALSE) (binary)
Alternating k-star
gwb1degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution for the first mode in a bipartite network
gwb1dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution for dyads in the first bipartition
gwb2degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution for the second mode in a bipartite network
gwb2dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution for dyads in the second bipartition
gwdegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution
gwdsp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution
gwesp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted edgewise shared partner distribution
gwidegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted in-degree distribution
gwnsp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted nonedgewise shared partner distribution
gwodegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted out-degree distribution
Cross-reference guide
This guide is a good place to find a tables that summarize the keywords for terms (with tables for commonly used terms, operator terms, and all terms), definitions for all terms, and a list of keywords with the terms they refer to.
vignette("ergm-term-crossRef")
Note: When using RStudio, it is possible to press the tab key after
starting a line with ?ergm
to view the wide range of
possible help options beginning with the letters ergm
.
The ERGM expression for the probability of the entire graph that was shown above can be re-expressed in terms of the conditional log-odds (that is, the logit of the conditional probability) of a single tie between two actors:
\[ \operatorname{logit}{P(Y_{ij}=1|y^{c}_{ij})=\theta^\top\delta_{ij}(y)}, \] where
\(Y_{ij}\) is the random variable for the state of the actor pair \(i,j\) (with realization \(y_{ij}\)), and
\(y^{c}_{ij}\) signifies the complement of \(y_{ij}\), i.e. the entire network \(y\) except for \(y_{ij}\).
\(\delta_{ij}(y)\) is a vector of the “change statistics” for each model term. The change statistic records how the \(g(y)\) term changes if the \(y_{ij}\) tie is toggled from off to on while fixing the rest of the network. So
\[ \delta_{ij}(y) = g(y^{+}_{ij})-g(y^{-}_{ij}), \]
where
\(y^{+}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 1, and
\(y^{-}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 0.
So \(\delta_{ij}(y)\) equals the value of \(g(y)\) when \(y_{ij}=1\) minus the value of \(g(y)\) when \(y_{ij}=0\), but all other dyads are as in \(y\). When this vector of change statistics is multiplied by the vector of coefficients \(\theta\), the equation above shows that this dot product is the log-odds of the tie between \(i\) and \(j\), conditional on all other dyads remaining the same.
In other words, for an individual statistic, its change value for \(Y_{ij}\) times its corresponding coefficient can be interpreted as that term’s contribution to the log-odds of that tie, conditional on all other dyads remaining the same.
We will see exactly how this works in the sections that follow.
Network data can come in many different forms — ties can be stored as
edgelists or sociomatrices in .csv
files, or as exported
data from other programs like Pajek. Attributes for the nodes, ties, and
dyads can also come in various forms. All can be read into R
using either standard R tools (e.g., for .csv
files), or
methods from the network
package. For more information,
refer to the following:
However you read them in, the data will need to be transformed into a
network
object, the format that Statnet packages use to
store and work with network data. For information on how to do this,
refer to:
The ergm
package also contains several network data
sets, and we will use those here for demonstration purposes.
We’ll start with Padgett’s data on Renaissance Florentine families for our first example. As with all data analysis, it is good practice to start by summarizing our data using graphical and numerical descriptives.
set.seed(1) # The plot.network function uses random values
data(florentine) # loads flomarriage and flobusiness data
flomarriage # Equivalent to print(flomarriage): Examine properties
Network attributes:
vertices = 16
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 20
missing edges= 0
non-missing edges= 20
Vertex attribute names:
priorates totalties vertex.names wealth
No edge attributes
par(mfrow=c(1,2)) # Set up a 2-column (and 1-row) plot area
# Plot the network, saving the coordinates for the next plot
coords <- plot(flomarriage,
main="Florentine Marriage",
cex.main=0.8,
label = network.vertex.names(flomarriage),
label.cex=0.4,
pad=3) # Equivalent to plot.network(...)
wealth <- flomarriage %v% 'wealth' # %v% references vertex attributes, equivalent to get.vertex.attribute(flomarriage, "wealth")
wealth
[1] 10 36 55 44 20 32 8 42 103 48 49 3 27 10 146 48