Last updated 2024-06-24

This tutorial is a joint product of the Statnet Development Team:

Pavel N. Krivitsky (University of New South Wales)
Martina Morris (University of Washington)
Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Chad Klumb (University of Washington)
Skye Bender de-Moll (Oakland, CA)
Michał Bojanowski (Kozminski University, Poland)

The network modeling software demonstrated in this tutorial is authored by Pavel Krivitsky (ergm) and Carter Butts (network and sna).


The Statnet Project

All Statnet packages are open-source, written for the R computing environment, and published on CRAN. The source repositories are hosted on GitHub. Our website is statnet.org

  • Need help? For general questions and comments, please email the Statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: Statnet_help listserve.

  • Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.

  • Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).

  • For all other issues, please email us at contact@statnet.org.


Introduction to this workshop/tutorial.

This workshop and tutorial provide an introduction to statistical modeling of network data with exponential-family random graph models (ERGMs) using statnet software. This online tutorial is also designed for self-study, with example code and self-contained data. The statnet packages we will be demonstrating are:

  • network — storage and manipulation of network data
  • ergm — statistical tools for estimating ERGMs, model assessment, and network simulation.

The ergm package has more advanced functionality that is not covered in this workshop. An overview can be found in this preprint.

Prerequisites

This workshop assumes basic familiarity with R; experience with network concepts, terminology and data; and familiarity with the general framework for statistical modeling and inference. While previous experience with ERGMs is not required, some of the topics covered here may be difficult to understand without a strong background in linear and generalized linear models in statistics.

Software installation

Minimally, you will need to install the latest version of R (available here) and the statnet packages ergm and network to run the code presented here (ergm will automatically install network when it is loaded).
The workshops are conducted using the free version of Rstudio (available here).

The full set of installation intructions with details can be found on the statnet workshop wiki.

If you have not already downloaded the statnet packages for this workshop, the quickest way to install these (and the other most commonly used packages from the statnet suite), is to open an R session and type:

install.packages('ergm')
library(ergm)
Loading required package: network

'network' 1.17.2 (2022-05-20), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information

'ergm' 4.2.2 (2022-06-01), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.

You can check the version number with:

packageVersion("ergm")
[1] '4.2.2'

Throughout, we will set a random seed via set.seed() for commands in tutorial that require simulating random values—this is not necessary, but it ensures that you will get the same results as this tutorial (assuming that you are using the same ergm version or at least a version in which the algorithms you are using have not changed).

1. Statistical network modeling with ERGMs

Here we provide only a brief overview of the modeling framework, as the primary purpose of this tutorial is to show how to implement statistical analysis of network data with ERGMs using the statnet software tools, rather than to explain the framework in detail. For more details, and to really understand ERGMs, please see the references at the end of this tutorial.

Exponential-family random graph models (ERGMs) are a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can, among other tasks:

  • Define a model for a network that includes covariates representing features like homophily, mutuality, triad effects, and a wide range of other structural features of interest;

  • Obtain maximum-likehood estimates for the parameters of the specified model for a given data set;

  • Test individual coefficients, assess models for convergence and goodness-of-fit, and perform various types of model comparison; and

  • Simulate new networks from the underlying probability distribution implied by the fitted model.

The general form for an ERGM

ERGMs are a class of models, superficially resembling linear regression or GLMs. The general form of the model specifies the probability of the entire network (the left hand side), as a function of terms that represent network features we hypothesize may occur more or less likely than expected by chance (the right hand side). The general form of the model is

\[ \Pr(Y=y)=\frac{\exp[\theta^\top g(y)] h(y)}{k(\theta)} \]

where

  • \(Y\) is the random variable for the state of the network and \(y\) is a particular realization \(Y\) could take,

  • \(g(y)\) is a vector of model statistics for network \(y\),

  • \(h(y)\) is the reference measure (which defines the baseline behavior of the model when \(\theta=0\))

  • \(\theta\) is the vector of coefficients for those statistics, and

  • \(k(\theta)\) is the summation of the numerator’s value over the set of all possible networks \(y\), typically taken to be all networks with the same node set as the observed network.

In particular, the model implies that the probability attached to a network \(y\) only depends on the network via the vector of statistics \(g(y)\). Among other things, this means that maximum likelihood estimation may be carried out even if we don’t observe the network itself, as long as we know the observed value of \(g(y)\).

If you’re not familiar with the compact notation above, the numerator represents a formula that is linear in the log form:

\[ \log(\exp[\theta^\top g(y)]) = \theta_1g_1(y) + \theta_2g_2(y)+ ... + \theta_pg_p(y) \] where \(p\) is the number of terms in the model. From this one can more easily observe the analogy to a traditional statistical model: the coefficients \(\theta\) represent the size and direction of the effects of the covariates \(g(y)\) on the overall probability of the network.

\(h(y)\) is important for specifying model behavior for valued ERGMs, for controlling for network size in extrapolative or multi-network settings, and other applications. When modeling single, unvalued networks, however, it is common to take \(h(y)\propto 1\) (the counting measure). We will adopt that convention here. Under the counting measure, the baseline behavior of an ERGM approaches a uniform random graph when \(\theta \to 0\), and the ERGM terms serve to modify this baseline distribution. We will see many examples below.

The model statistics \(g(y)\): ERGM terms

The statistics \(g(y)\) can be thought of as the “covariates” in the model. In the network modeling context, these represent network features like density, homophily, triads, etc. In one sense, they are like covariates you might use in other statistical models, but they are different in one key respect:


These \(g(y)\) statistics are functions of the network itself — each represents the frequency of a specific configuration of dyads observed in the network — so they are not measured by a question you include in a survey (e.g., the income of a node), but instead need to be computed on the specific network you have, after you have collected the data.


As a result, every term in an ERGM must have an associated algorithm for computing its value for your network. The ergm package in statnet includes about 150 term-computing algorithms, and you can code up your own with our helper package. We will explore some of the most commonly used terms in this tutorial, and links to more information are provided in section 3.

Terms can be grouped in different ways, like the type of network they can be used on (e.g., binary, bipartite, directed, etc.), the type of covariate they represent (continuous, dyad-dependent, etc.) and their purpose (term operator, soft constraint, etc). In Statnet, these groupings are called “keywords”, and they provide a useful way to quickly find help for terms used in different contexts.


One key property of model terms is worth keeping in mind: terms are either dyad independent or dyad dependent.


Dyad independent terms (like nodal homophily terms) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree terms, or triad terms), by contrast, imply dependence between dyads. Dyad dependent terms have very different effects, and much of what is different about network models comes from these terms. They introduce complex cascading effects that can often lead to counter-intuitive and highly non-linear outcomes. In addition, a model with at least one dyad dependent term requires a different estimation algorithm, so when we use these terms below you will see some different components in the output.

Help for ERGM terms

There are several resources in the ergm package that provide a range of help and instructions for using these terms:

The main ergm terms reference page

Accessed via:

?ergmTerms

This contains an overview of model specification, some links to additional references and a complete up-to-date list of available terms with short descriptions and a link to the man page for that term.

Man pages for specific terms

If you know the name of the term, and are looking for help with syntax, you can type either

`help("[name]-ergmTerm")` 

or the shorthand version

`ergmTerm?[name]`

where [name] is the name of the term.

Keywords

Keywords are useful for narrowing down your search to find terms that have specific functionality. Available keywords and their meanings can be obtained by typing

?ergmKeyword`

You can search for terms with keywords, for example:

search.ergmTerms(keyword='curved')
Found  11  matching ergm terms:
altkstar(lambda, fixed=FALSE) (binary)
    Alternating k-star

gwb1degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted degree distribution for the first mode in a bipartite network

gwb1dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
    Geometrically weighted dyadwise shared partner distribution for dyads in the first bipartition

gwb2degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted degree distribution for the second mode in a bipartite network

gwb2dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
    Geometrically weighted dyadwise shared partner distribution for dyads in the second bipartition

gwdegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted degree distribution

gwdsp(decay, fixed=FALSE, cutoff=30) (binary)
    Geometrically weighted dyadwise shared partner distribution

gwesp(decay, fixed=FALSE, cutoff=30) (binary)
    Geometrically weighted edgewise shared partner distribution

gwidegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted in-degree distribution

gwnsp(decay, fixed=FALSE, cutoff=30) (binary)
    Geometrically weighted nonedgewise shared partner distribution

gwodegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted out-degree distribution

Cross-reference guide

This guide is a good place to find a tables that summarize the keywords for terms (with tables for commonly used terms, operator terms, and all terms), definitions for all terms, and a list of keywords with the terms they refer to.

vignette("ergm-term-crossRef")

Note: When using RStudio, it is possible to press the tab key after starting a line with ?ergm to view the wide range of possible help options beginning with the letters ergm.

ERGM probabilities: at the tie level

The ERGM expression for the probability of the entire graph that was shown above can be re-expressed in terms of the conditional log-odds (that is, the logit of the conditional probability) of a single tie between two actors:

\[ \operatorname{logit}{P(Y_{ij}=1|y^{c}_{ij})=\theta^\top\delta_{ij}(y)}, \] where

  • \(Y_{ij}\) is the random variable for the state of the actor pair \(i,j\) (with realization \(y_{ij}\)), and

  • \(y^{c}_{ij}\) signifies the complement of \(y_{ij}\), i.e. the entire network \(y\) except for \(y_{ij}\).

  • \(\delta_{ij}(y)\) is a vector of the “change statistics” for each model term. The change statistic records how the \(g(y)\) term changes if the \(y_{ij}\) tie is toggled from off to on while fixing the rest of the network. So

\[ \delta_{ij}(y) = g(y^{+}_{ij})-g(y^{-}_{ij}), \]

where

  • \(y^{+}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 1, and

  • \(y^{-}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 0.

So \(\delta_{ij}(y)\) equals the value of \(g(y)\) when \(y_{ij}=1\) minus the value of \(g(y)\) when \(y_{ij}=0\), but all other dyads are as in \(y\). When this vector of change statistics is multiplied by the vector of coefficients \(\theta\), the equation above shows that this dot product is the log-odds of the tie between \(i\) and \(j\), conditional on all other dyads remaining the same.

In other words, for an individual statistic, its change value for \(Y_{ij}\) times its corresponding coefficient can be interpreted as that term’s contribution to the log-odds of that tie, conditional on all other dyads remaining the same.

We will see exactly how this works in the sections that follow.

Loading network data

Network data can come in many different forms — ties can be stored as edgelists or sociomatrices in .csv files, or as exported data from other programs like Pajek. Attributes for the nodes, ties, and dyads can also come in various forms. All can be read into R using either standard R tools (e.g., for .csv files), or methods from the network package. For more information, refer to the following:

?read.paj
?read.paj.simplify
?loading.attributes

However you read them in, the data will need to be transformed into a network object, the format that Statnet packages use to store and work with network data. For information on how to do this, refer to:

?network

The ergm package also contains several network data sets, and we will use those here for demonstration purposes.

data(package='ergm') # tells us the datasets in our packages

We’ll start with Padgett’s data on Renaissance Florentine families for our first example. As with all data analysis, it is good practice to start by summarizing our data using graphical and numerical descriptives.

set.seed(1) # The plot.network function uses random values
data(florentine) # loads flomarriage and flobusiness data
flomarriage # Equivalent to print(flomarriage): Examine properties
 Network attributes:
  vertices = 16 
  directed = FALSE 
  hyper = FALSE 
  loops = FALSE 
  multiple = FALSE 
  bipartite = FALSE 
  total edges= 20 
    missing edges= 0 
    non-missing edges= 20 

 Vertex attribute names: 
    priorates totalties vertex.names wealth 

No edge attributes
par(mfrow=c(1,2)) # Set up a 2-column (and 1-row) plot area

# Plot the network, saving the coordinates for the next plot
coords <- plot(flomarriage, 
     main="Florentine Marriage", 
     cex.main=0.8, 
     label = network.vertex.names(flomarriage), 
     label.cex=0.4,
     pad=3) # Equivalent to plot.network(...)

wealth <- flomarriage %v% 'wealth' # %v% references vertex attributes, equivalent to get.vertex.attribute(flomarriage, "wealth")
wealth
 [1]  10  36  55  44  20  32   8  42 103  48  49   3  27  10 146  48
# Plot with vertex size proportional to wealth
plot(flomarriage, coord=coords, jitter=FALSE,
     vertex.cex=wealth/25, 
     main="Florentine marriage by wealth", cex.main=0.8,
     pad=0.5)