Last updated 2023-12-04

This tutorial is a joint product of the Statnet Development Team:

Pavel N. Krivitsky (University of New South Wales)
Martina Morris (University of Washington)
Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Chad Klumb (University of Washington)
Skye Bender de-Moll (Oakland, CA)
Michał Bojanowski (Kozminski University, Poland)

The network modeling software demonstrated in this tutorial is authored by Pavel Krivitsky (ergm) and Carter Butts (network and sna).


The statnet Project

All statnet packages are open-source, written for the R computing environment, and published on CRAN. The source repositories are hosted on GitHub. Our website is statnet.org

  • Need help? For general questions and comments, please email the statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: statnet_help listserve.

  • Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.

  • Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).

  • For all other issues, please email us at contact@statnet.org.


Introduction to this workshop/tutorial.

This workshop and tutorial provide an introduction to statistical modeling of network data with exponential-family random graph models (ERGMs) using statnet software. This online tutorial is also designed for self-study, with example code and self-contained data. The statnet packages we will be demonstrating are:

  • network — storage and manipulation of network data
  • ergm — statistical tools for estimating ERGMs, model assessment, and network simulation.

The ergm package has more advanced functionality that is not covered in this workshop. An overview can be found in this preprint.

Prerequisites

This workshop assumes basic familiarity with R; experience with network concepts, terminology and data; and familiarity with the general framework for statistical modeling and inference. While previous experience with ERGMs is not required, some of the topics covered here may be difficult to understand without a strong background in linear and generalized linear models in statistics.

Software installation

Minimally, you will need to install the latest version of R (available here) and the statnet packages ergm and network to run the code presented here (ergm will automatically install network when it is loaded).
The workshops are conducted using the free version of Rstudio (available here).

The full set of installation intructions with details can be found on the statnet workshop wiki.

If you have not already downloaded the statnet packages for this workshop, the quickest way to install these (and the other most commonly used packages from the statnet suite), is to open an R session and type:

install.packages('ergm')
library(ergm)
Loading required package: network

'network' 1.18.1 (2023-01-24), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information

'ergm' 4.5.0 (2023-05-27), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.

You can check the version number with:

packageVersion("ergm")
[1] '4.5.0'

Throughout, we will set a random seed via set.seed() for commands in tutorial that require simulating random values—this is not necessary, but it ensures that you will get the same results as this tutorial (assuming that you are using the same ergm version or at least a version in which the algorithms you are using have not changed).

1. Statistical network modeling with ERGMs

Here we provide only a brief overview of the modeling framework, as the primary purpose of this tutorial is to show how to implement statistical analysis of network data with ERGMs using the statnet software tools, rather than to explain the framework in detail. For more details, and to really understand ERGMs, please see the references at the end of this tutorial.

Exponential-family random graph models (ERGMs) are a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can, among other tasks:

  • Define a model for a network that includes covariates representing features like homophily, mutuality, triad effects, and a wide range of other structural features of interest;

  • Obtain maximum-likehood estimates for the parameters of the specified model for a given data set;

  • Test individual coefficients, assess models for convergence and goodness-of-fit, and perform various types of model comparison; and

  • Simulate new networks from the underlying probability distribution implied by the fitted model.

The general form for an ERGM

ERGMs are a class of models, superficially resembling linear regression or GLMs. The general form of the model specifies the probability of the entire network (the left hand side), as a function of terms that represent network features we hypothesize may occur more or less likely than expected by chance (the right hand side). The general form of the model is

\[ \Pr(Y=y)=\frac{\exp[\theta^\top g(y)] h(y)}{k(\theta)} \]

where

  • \(Y\) is the random variable for the state of the network and \(y\) is a particular realization \(Y\) could take,

  • \(g(y)\) is a vector of model statistics for network \(y\),

  • \(h(y)\) is the reference measure (which defines the baseline behavior of the model when \(\theta=0\))

  • \(\theta\) is the vector of coefficients for those statistics, and

  • \(k(\theta)\) is the summation of the numerator’s value over the set of all possible networks \(y\), typically taken to be all networks with the same node set as the observed network.

In particular, the model implies that the probability attached to a network \(y\) only depends on the network via the vector of statistics \(g(y)\). Among other things, this means that maximum likelihood estimation may be carried out even if we don’t observe the network itself, as long as we know the observed value of \(g(y)\).

If you’re not familiar with the compact notation above, the numerator represents a formula that is linear in the log form:

\[ \log(\exp[\theta^\top g(y)]) = \theta_1g_1(y) + \theta_2g_2(y)+ ... + \theta_pg_p(y) \] where \(p\) is the number of terms in the model. From this one can more easily observe the analogy to a traditional statistical model: the coefficients \(\theta\) represent the size and direction of the effects of the covariates \(g(y)\) on the overall probability of the network.

\(h(y)\) is important for specifying model behavior for valued ERGMs, for controlling for network size in extrapolative or multi-network settings, and other applications. When modeling single, unvalued networks, however, it is common to take \(h(y)\propto 1\) (the counting measure). We will adopt that convention here. Under the counting measure, the baseline behavior of an ERGM approaches a uniform random graph when \(\theta \to 0\), and the ERGM terms serve to modify this baseline distribution. We will see many examples below.

The model statistics \(g(y)\): ERGM terms

The statistics \(g(y)\) can be thought of as the “covariates” in the model. In the network modeling context, these represent network features like density, homophily, triads, etc. In one sense, they are like covariates you might use in other statistical models. But they are different in one important respect: these \(g(y)\) statistics are functions of the network itself — each is defined by the frequency of a specific configuration of dyads observed in the network — so they are not measured by a question you include in a survey (e.g., the income of a node), but instead need to be computed on the specific network you have, after you have collected the data.

As a result, every term in an ERGM must have an associated algorithm for computing its value for your network. The ergm package in statnet includes about 150 term-computing algorithms. We will explore some of these terms in this tutorial, and links to more information are provided in section 3.

You can get an up-to-date list of all available terms, and the syntax for using them, by typing ?ergmTerm. When using RStudio, it is possible to press the tab key after starting a line with ?ergm to view the wide range of possible help options beginning with the letters ergm.

Available keywords and their meanings can be obtained by typing ?ergmKeyword. You can also search for terms with keywords, as in

search.ergmTerms(keyword='curved')
Found  8  matching ergm terms:
altkstar(lambda, fixed=FALSE) (binary)
    Alternating k-star

gwb1degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted degree distribution for the first mode in a bipartite network

gwb1dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
    Geometrically weighted dyadwise shared partner distribution for dyads in the first bipartition

gwb2degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted degree distribution for the second mode in a bipartite network

gwb2dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
    Geometrically weighted dyadwise shared partner distribution for dyads in the second bipartition

gwdegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted degree distribution

gwidegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted in-degree distribution

gwodegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
    Geometrically weighted out-degree distribution

To obtain help for a specific term, use either help("[name]-ergmTerm") or the shorthand version ergmTerm?[name], where [name] is the name of the term.

For more guidance on ergm terms, there is a vignette in the ergm package entitled ergm-term-crossRef.Rmd that can be compiled as an RMarkdown document.

One key categorization of model terms is worth keeping in mind: terms are either dyad independent or dyad dependent. Dyad independent terms (like nodal homophily terms) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree terms, or triad terms), by contrast, imply dependence between dyads. Dyad dependent terms have very different effects, and much of what is different about network models comes from these terms. They introduce complex cascading effects that can often lead to counter-intuitive and highly non-linear outcomes. In addition, a model with at least one dyad dependent term requires a different estimation algorithm, so when we use these terms below you will see some different components in the output.

ERGM probabilities: at the tie level

The ERGM expression for the probability of the entire graph shown above can be re-expressed in terms of the conditional log-odds (that is, the logit of the conditional probability) of a single tie between two actors: \[ \operatorname{logit}{P(Y_{ij}=1|y^{c}_{ij})=\theta^\top\delta_{ij}(y)}, \] where

  • \(Y_{ij}\) is the random variable for the state of the actor pair \(i,j\) (with realization \(y_{ij}\)), and

  • \(y^{c}_{ij}\) signifies the complement of \(y_{ij}\), i.e. the entire network \(y\) except for \(y_{ij}\).

  • \(\delta_{ij}(y)\) is a vector of the “change statistics” for each model term. The change statistic records how the \(g(y)\) term changes if the \(y_{ij}\) tie is toggled from off to on while fixing the rest of the network. So \[ \delta_{ij}(y) = g(y^{+}_{ij})-g(y^{-}_{ij}), \] where

  • \(y^{+}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 1, and

  • \(y^{-}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 0.

So \(\delta_{ij}(y)\) equals the value of \(g(y)\) when \(y_{ij}=1\) minus the value of \(g(y)\) when \(y_{ij}=0\), but all other dyads are as in \(y\). When this vector of change statistics is multiplied by the vector of coefficients \(\theta\), the equation above shows that this dot product is the log-odds of the tie between \(i\) and \(j\), conditional on all other dyads remaining the same.

In other words, for an individual statistic, its change value for \(Y_{ij}\) times its corresponding coefficient can be interpreted as that term’s contribution to the log-odds of that tie, conditional on all other dyads remaining the same.

We will see exactly how this works in the sections that follow.

Loading network data

Network data can come in many different forms — ties can be stored as edgelists or sociomatrices in .csv files, or as exported data from other programs like Pajek. Attributes for the nodes, ties, and dyads can also come in various forms. All can be read into R using either standard R tools (e.g., for .csv files), or methods from the network package. For more information, refer to the following:

?read.paj
starting httpd help server ... done
?read.paj.simplify
?loading.attributes

However you read them in, the data will need to be transformed into a network object, the format that Statnet packages use to store and work with network data. For information on how to do this, refer to:

?network

The ergm package also contains several network data sets, and we will use those here for demonstration purposes.

data(package='ergm') # tells us the datasets in our packages

We’ll start with Padgett’s data on Renaissance Florentine families for our first example. As with all data analysis, it is good practice to start by summarizing our data using graphical and numerical descriptives.

set.seed(123) # The plot.network function uses random values
data(florentine) # loads flomarriage and flobusiness data
flomarriage # Equivalent to print.network(flomarriage): Examine properties
 Network attributes:
  vertices = 16 
  directed = FALSE 
  hyper = FALSE 
  loops = FALSE 
  multiple = FALSE 
  bipartite = FALSE 
  total edges= 20 
    missing edges= 0 
    non-missing edges= 20 

 Vertex attribute names: 
    priorates totalties vertex.names wealth 

No edge attributes
par(mfrow=c(1,2)) # Set up a 2-column (and 1-row) plot area
plot(flomarriage, 
     main="Florentine Marriage", 
     cex.main=0.8, 
     label = network.vertex.names(flomarriage)) # Equivalent to plot.network(...)
wealth <- flomarriage %v% 'wealth' # %v% references vertex attributes
wealth
 [1]  10  36  55  44  20  32   8  42 103  48  49   3  27  10 146  48
plot(flomarriage, 
     vertex.cex=wealth/25, # Make vertex size proportional to wealth attribute
     main="Florentine marriage by wealth", cex.main=0.8) 

The summary and ergm functions, and supporting functions

We’ll start by running some simple models to demonstrate the most commonly used functions for ERG modeling.

The syntax for specifying a model in the ergm package follows R’s formula convention:

\[ \mbox{my.network} \sim \mbox{my.model.terms} \]

This syntax is used for both the summary and ergm functions. The summary function simply returns the numerical values of the network statistics in the model. The ergm function estimates the model with those statistics.

It is good practice to run a summmary command on any model before fitting it with ergm. This is the ERGM equivalent of performing some descriptive analysis on your covariates. This can help you make sure you understand what the term represents, and it can help to flag potential problems that will lead to poor modeling results. We will now demonstrate the summary and ergm commands using a simple model.

A Bernoulli (“Erdős/Rényi”) model

We begin with a simple model, containing only one term that represents the total number of edges in the network, \(\sum{y_{ij}}\). The name of this ergm-term is edges, and when included in an ERGM its coefficient controls the overall density of the network.

summary(flomarriage ~ edges) # Calculate the edges statistic for this network
edges 
   20 
flomodel.01 <- ergm(flomarriage ~ edges) # Estimate the model 
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Evaluating log-likelihood at the estimate. 
summary(flomodel.01) # Look at the fitted model object
Call:
ergm(formula = flomarriage ~ edges)

Maximum Likelihood Results:

      Estimate Std. Error MCMC % z value Pr(>|z|)    
edges  -1.6094     0.2449      0  -6.571   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     Null Deviance: 166.4  on 120  degrees of freedom
 Residual Deviance: 108.1  on 119  degrees of freedom
 
AIC: 110.1  BIC: 112.9  (Smaller is better. MC Std. Err. = 0)

This simple model specifies a single homogeneous probability for all ties, which is captured by the coefficient of the edges term. How should we interpret the above estimate \(\hat\theta\) of this coefficient? The easiest way is to return to the logit form of the ERGM. The log-odds that a tie—any tie, since the change statistic for the edges term equals one for all \(y_{ij}\)—is present is \[\begin{align*} \operatorname{logit}(p) &= \hat\theta \times \delta_{ij}(y) \\ & = -1.61 \times \mbox{change in $g(y)$ when $y_{ij}$ goes from 0 to 1}\\ & = -1.61 \times 1. \end{align*}\] Do you see why \(\delta_{ij}(y)=1\) no matter which \(i\) and \(j\) you specify?

To convert \(\mbox{logit}(p)\) to \(p\), we take the inverse logit of \(\hat\theta\): \[\begin{align*} & = \exp(-1.61)/ (1+\exp(-1.61))\\ & = 0.167 \end{align*}\] This probability corresponds to the density we observe in the flomarriage network: there are \(20\) ties and \(\binom{16}{2} = (16 \times 15)/2 = 120\) dyads, so the density of ties is \(20/120 = 0.167\).

Triad formation

Let’s add a term often thought to be a measure of “clustering”: the number of completed triangles in the network, or \(\frac13\sum{y_{ij}y_{ik}y_{jk}}\). The name for this ergm-term is triangle.

This is an example of a dyad dependent term, as the status of any triangle containing dyad \(y_{ij}\) depends on the status of dyads of the form \(y_{ik}\) and \(y_{jk}\). This means that any model containing the ergm-term triangle has the property that dyads are not probabilistically independent of one another. As a result, ergm automatically uses its stochastic MCMC-based estimation algorithm, so your results may differ slightly unless you use the same set.seed value:

set.seed(321)
summary(flomarriage~edges+triangle) # Look at the g(y) statistics for this model
   edges triangle 
      20        3 
flomodel.02 <- ergm(flomarriage~<