*Last updated 2023-12-04*

*This tutorial is a joint product of the Statnet Development
Team:*

Pavel N. Krivitsky (University of New South Wales)

Martina Morris (University of Washington)

Mark S. Handcock (University of California, Los Angeles)

Carter T. Butts (University of California, Irvine)

David R. Hunter (Penn State University)

Steven M. Goodreau (University of Washington)

Chad Klumb (University of Washington)

Skye Bender de-Moll (Oakland, CA)

Michał Bojanowski (Kozminski University, Poland)

The network modeling software demonstrated in this tutorial is
authored by Pavel Krivitsky (`ergm`

) and Carter Butts
(`network`

and `sna`

).

`statnet`

ProjectAll `statnet`

packages are open-source, written for the
**R** computing environment, and published on CRAN. The
source repositories are hosted on GitHub. Our website is statnet.org

Need help? For general questions and comments, please email the statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: statnet_help listserve.

Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.

Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).

For all other issues, please email us at contact@statnet.org.

This workshop and tutorial provide an introduction to statistical
modeling of network data with *exponential-family random graph
models* (ERGMs) using `statnet`

software. This online
tutorial is also designed for self-study, with example code and
self-contained data. The `statnet`

packages we will be
demonstrating are:

`network`

— storage and manipulation of network data`ergm`

— statistical tools for estimating ERGMs, model assessment, and network simulation.

The `ergm`

package has more advanced functionality that is
not covered in this workshop. An overview can be found in this preprint.

This workshop assumes basic familiarity with **R**;
experience with network concepts, terminology and data; and familiarity
with the general framework for statistical modeling and inference. While
previous experience with ERGMs is not required, some of the topics
covered here may be difficult to understand without a strong background
in linear and generalized linear models in statistics.

Minimally, you will need to install the latest version of
**R** (available
here) and the `statnet`

packages `ergm`

and
`network`

to run the code presented here (`ergm`

will automatically install `network`

when it is
loaded).

The workshops are conducted using the free version of
`Rstudio`

(available
here).

The full set of installation intructions with details can be found on
the `statnet`

workshop wiki.

If you have not already downloaded the `statnet`

packages
for this workshop, the quickest way to install these (and the other most
commonly used packages from the `statnet`

suite), is to open
an R session and type:

`Loading required package: network`

```
'network' 1.18.1 (2023-01-24), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information
```

```
'ergm' 4.5.0 (2023-05-27), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
```

```
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.
```

You can check the version number with:

`[1] '4.5.0'`

Throughout, we will set a random seed via `set.seed()`

for
commands in tutorial that require simulating random values—this is not
necessary, but it ensures that you will get the same results as this
tutorial (assuming that you are using the same `ergm`

version
or at least a version in which the algorithms you are using have not
changed).

Here we provide only a brief overview of the modeling framework, as
the primary purpose of this tutorial is to show how to implement
statistical analysis of network data with ERGMs using the
`statnet`

software tools, rather than to explain the
framework in detail. For more details, and to really understand ERGMs,
please see the references at
the end of this tutorial.

Exponential-family random graph models (ERGMs) are a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can, among other tasks:

Define a model for a network that includes covariates representing features like homophily, mutuality, triad effects, and a wide range of other structural features of interest;

Obtain maximum-likehood estimates for the parameters of the specified model for a given data set;

Test individual coefficients, assess models for convergence and goodness-of-fit, and perform various types of model comparison; and

Simulate new networks from the underlying probability distribution implied by the fitted model.

ERGMs are a class of models, superficially resembling linear regression or GLMs. The general form of the model specifies the probability of the entire network (the left hand side), as a function of terms that represent network features we hypothesize may occur more or less likely than expected by chance (the right hand side). The general form of the model is

\[ \Pr(Y=y)=\frac{\exp[\theta^\top g(y)] h(y)}{k(\theta)} \]

where

\(Y\) is the random variable for the state of the network and \(y\) is a particular realization \(Y\) could take,

\(g(y)\) is a vector of model statistics for network \(y\),

\(h(y)\) is the reference measure (which defines the baseline behavior of the model when \(\theta=0\))

\(\theta\) is the vector of coefficients for those statistics, and

\(k(\theta)\) is the summation of the numerator’s value over the set of all possible networks \(y\), typically taken to be all networks with the same node set as the observed network.

In particular, the model implies that the probability attached to a network \(y\) only depends on the network via the vector of statistics \(g(y)\). Among other things, this means that maximum likelihood estimation may be carried out even if we don’t observe the network itself, as long as we know the observed value of \(g(y)\).

If you’re not familiar with the compact notation above, the numerator represents a formula that is linear in the log form:

\[ \log(\exp[\theta^\top g(y)]) = \theta_1g_1(y) + \theta_2g_2(y)+ ... + \theta_pg_p(y) \] where \(p\) is the number of terms in the model. From this one can more easily observe the analogy to a traditional statistical model: the coefficients \(\theta\) represent the size and direction of the effects of the covariates \(g(y)\) on the overall probability of the network.

\(h(y)\) is important for specifying
model behavior for valued ERGMs, for controlling for network size in
extrapolative or multi-network settings, and other applications. When
modeling single, unvalued networks, however, it is common to take \(h(y)\propto 1\) (the *counting
measure*). We will adopt that convention here. Under the counting
measure, the baseline behavior of an ERGM approaches a uniform random
graph when \(\theta \to 0\), and the
ERGM terms serve to modify this baseline distribution. We will see many
examples below.

The statistics \(g(y)\) can be thought of as the “covariates” in the model. In the network modeling context, these represent network features like density, homophily, triads, etc. In one sense, they are like covariates you might use in other statistical models. But they are different in one important respect: these \(g(y)\) statistics are functions of the network itself — each is defined by the frequency of a specific configuration of dyads observed in the network — so they are not measured by a question you include in a survey (e.g., the income of a node), but instead need to be computed on the specific network you have, after you have collected the data.

As a result, every term in an ERGM must have an associated algorithm
for computing its value for your network. The `ergm`

package
in `statnet`

includes about 150 term-computing algorithms. We
will explore some of these terms in this tutorial, and links to more
information are provided in section
3.

You can get an up-to-date list of all available terms, and the syntax
for using them, by typing `?ergmTerm`

. When using RStudio, it
is possible to press the tab key after starting a line with
`?ergm`

to view the wide range of possible help options
beginning with the letters `ergm`

.

Available keywords and their meanings can be obtained by typing
`?ergmKeyword`

. You can also search for terms with keywords,
as in

```
Found 8 matching ergm terms:
altkstar(lambda, fixed=FALSE) (binary)
Alternating k-star
gwb1degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution for the first mode in a bipartite network
gwb1dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution for dyads in the first bipartition
gwb2degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution for the second mode in a bipartite network
gwb2dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution for dyads in the second bipartition
gwdegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution
gwidegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted in-degree distribution
gwodegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted out-degree distribution
```

To obtain help for a specific term, use either
`help("[name]-ergmTerm")`

or the shorthand version
`ergmTerm?[name]`

, where `[name]`

is the name of
the term.

For more guidance on ergm terms, there is a vignette in the
`ergm`

package entitled `ergm-term-crossRef.Rmd`

that can be compiled as an RMarkdown document.

One key categorization of model terms is worth keeping in mind: terms
are either *dyad independent* or *dyad dependent*. Dyad
*independent* terms (like nodal homophily terms) imply no
dependence between dyads—the presence or absence of a tie may depend on
nodal attributes, but not on the state of other ties. Dyad
*dependent* terms (like degree terms, or triad terms), by
contrast, imply dependence between dyads. Dyad dependent terms have very
different effects, and much of what is different about network models
comes from these terms. They introduce complex cascading effects that
can often lead to counter-intuitive and highly non-linear outcomes. In
addition, a model with at least one dyad dependent term requires a
different estimation algorithm, so when we use these terms below you
will see some different components in the output.

The ERGM expression for the probability of the entire graph shown above can be re-expressed in terms of the conditional log-odds (that is, the logit of the conditional probability) of a single tie between two actors: \[ \operatorname{logit}{P(Y_{ij}=1|y^{c}_{ij})=\theta^\top\delta_{ij}(y)}, \] where

\(Y_{ij}\) is the random variable for the state of the actor pair \(i,j\) (with realization \(y_{ij}\)), and

\(y^{c}_{ij}\) signifies the complement of \(y_{ij}\), i.e. the entire network \(y\)

*except for*\(y_{ij}\).\(\delta_{ij}(y)\) is a vector of the “change statistics” for each model term. The change statistic records how the \(g(y)\) term changes if the \(y_{ij}\) tie is toggled from off to on while fixing the rest of the network. So \[ \delta_{ij}(y) = g(y^{+}_{ij})-g(y^{-}_{ij}), \] where

\(y^{+}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 1, and

\(y^{-}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 0.

So \(\delta_{ij}(y)\) equals the value of \(g(y)\) when \(y_{ij}=1\) minus the value of \(g(y)\) when \(y_{ij}=0\), but all other dyads are as in \(y\). When this vector of change statistics is multiplied by the vector of coefficients \(\theta\), the equation above shows that this dot product is the log-odds of the tie between \(i\) and \(j\), conditional on all other dyads remaining the same.

In other words, for an individual statistic, its change value for \(Y_{ij}\) times its corresponding coefficient can be interpreted as that term’s contribution to the log-odds of that tie, conditional on all other dyads remaining the same.

We will see exactly how this works in the sections that follow.

Network data can come in many different forms — ties can be stored as
edgelists or sociomatrices in `.csv`

files, or as exported
data from other programs like Pajek. Attributes for the nodes, ties, and
dyads can also come in various forms. All can be read into *R*
using either standard R tools (e.g., for `.csv`

files), or
methods from the `network`

package. For more information,
refer to the following:

`starting httpd help server ... done`

However you read them in, the data will need to be transformed into a
`network`

object, the format that Statnet packages use to
store and work with network data. For information on how to do this,
refer to:

The `ergm`

package also contains several network data
sets, and we will use those here for demonstration purposes.

We’ll start with Padgett’s data on Renaissance Florentine families for our first example. As with all data analysis, it is good practice to start by summarizing our data using graphical and numerical descriptives.

```
set.seed(123) # The plot.network function uses random values
data(florentine) # loads flomarriage and flobusiness data
flomarriage # Equivalent to print.network(flomarriage): Examine properties
```

```
Network attributes:
vertices = 16
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 20
missing edges= 0
non-missing edges= 20
Vertex attribute names:
priorates totalties vertex.names wealth
No edge attributes
```

```
par(mfrow=c(1,2)) # Set up a 2-column (and 1-row) plot area
plot(flomarriage,
main="Florentine Marriage",
cex.main=0.8,
label = network.vertex.names(flomarriage)) # Equivalent to plot.network(...)
wealth <- flomarriage %v% 'wealth' # %v% references vertex attributes
wealth
```

` [1] 10 36 55 44 20 32 8 42 103 48 49 3 27 10 146 48`

`summary`

and `ergm`

functions, and
supporting functionsWe’ll start by running some simple models to demonstrate the most commonly used functions for ERG modeling.

The syntax for specifying a model in the `ergm`

package
follows **R**’s formula convention:

\[ \mbox{my.network} \sim \mbox{my.model.terms} \]

This syntax is used for both the `summary`

and
`ergm`

functions. The `summary`

function simply
returns the numerical values of the network statistics in the model. The
`ergm`

function estimates the model with those
statistics.

It is good practice to run a `summmary`

command on any
model before fitting it with `ergm`

. This is the ERGM
equivalent of performing some descriptive analysis on your covariates.
This can help you make sure you understand what the term represents, and
it can help to flag potential problems that will lead to poor modeling
results. We will now demonstrate the `summary`

and
`ergm`

commands using a simple model.

We begin with a simple model, containing only one term that
represents the total number of edges in the network, \(\sum{y_{ij}}\). The name of this ergm-term
is `edges`

, and when included in an ERGM its coefficient
controls the overall density of the network.

```
edges
20
```

`Starting maximum pseudolikelihood estimation (MPLE):`

`Obtaining the responsible dyads.`

`Evaluating the predictor and response matrix.`

`Maximizing the pseudolikelihood.`

`Finished MPLE.`

`Evaluating log-likelihood at the estimate. `

```
Call:
ergm(formula = flomarriage ~ edges)
Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
edges -1.6094 0.2449 0 -6.571 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null Deviance: 166.4 on 120 degrees of freedom
Residual Deviance: 108.1 on 119 degrees of freedom
AIC: 110.1 BIC: 112.9 (Smaller is better. MC Std. Err. = 0)
```

This simple model specifies a single homogeneous probability for all
ties, which is captured by the coefficient of the `edges`

term. How should we interpret the above estimate \(\hat\theta\) of this coefficient? The
easiest way is to return to the logit form of the ERGM. The log-odds
that a tie—*any* tie, since the change statistic for the
`edges`

term equals one for all \(y_{ij}\)—is present is \[\begin{align*}
\operatorname{logit}(p) &= \hat\theta \times \delta_{ij}(y) \\
& = -1.61 \times \mbox{change in $g(y)$ when
$y_{ij}$ goes from 0 to 1}\\
& = -1.61 \times 1.
\end{align*}\] Do you see why \(\delta_{ij}(y)=1\) no matter which \(i\) and \(j\) you specify?

To convert \(\mbox{logit}(p)\) to \(p\), we take the inverse logit of \(\hat\theta\): \[\begin{align*} & = \exp(-1.61)/ (1+\exp(-1.61))\\ & = 0.167 \end{align*}\] This probability corresponds to the density we observe in the flomarriage network: there are \(20\) ties and \(\binom{16}{2} = (16 \times 15)/2 = 120\) dyads, so the density of ties is \(20/120 = 0.167\).

Let’s add a term often thought to be a measure of “clustering”: the
number of completed triangles in the network, or \(\frac13\sum{y_{ij}y_{ik}y_{jk}}\). The name
for this ergm-term is `triangle`

.

This is an example of a dyad dependent term, as the status of any
triangle containing dyad \(y_{ij}\)
depends on the status of dyads of the form \(y_{ik}\) and \(y_{jk}\). This means that any model
containing the ergm-term `triangle`

has the property that
dyads are not probabilistically independent of one another. As a result,
`ergm`

automatically uses its stochastic MCMC-based
estimation algorithm, so your results may differ slightly unless you use
the same `set.seed`

value:

```
edges triangle
20 3
```