*Last updated 2024-06-24*

*This tutorial is a joint product of the Statnet Development
Team:*

Pavel N. Krivitsky (University of New South Wales)

Martina Morris (University of Washington)

Mark S. Handcock (University of California, Los Angeles)

Carter T. Butts (University of California, Irvine)

David R. Hunter (Penn State University)

Steven M. Goodreau (University of Washington)

Chad Klumb (University of Washington)

Skye Bender de-Moll (Oakland, CA)

Michał Bojanowski (Kozminski University, Poland)

The network modeling software demonstrated in this tutorial is
authored by Pavel Krivitsky (`ergm`

) and Carter Butts
(`network`

and `sna`

).

All Statnet packages are open-source, written for the
**R** computing environment, and published on CRAN. The
source repositories are hosted on GitHub. Our website is statnet.org

Need help? For general questions and comments, please email the Statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: Statnet_help listserve.

Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.

Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).

For all other issues, please email us at contact@statnet.org.

This workshop and tutorial provide an introduction to statistical
modeling of network data with *exponential-family random graph
models* (ERGMs) using `statnet`

software. This online
tutorial is also designed for self-study, with example code and
self-contained data. The `statnet`

packages we will be
demonstrating are:

`network`

— storage and manipulation of network data`ergm`

— statistical tools for estimating ERGMs, model assessment, and network simulation.

The `ergm`

package has more advanced functionality that is
not covered in this workshop. An overview can be found in this preprint.

This workshop assumes basic familiarity with **R**;
experience with network concepts, terminology and data; and familiarity
with the general framework for statistical modeling and inference. While
previous experience with ERGMs is not required, some of the topics
covered here may be difficult to understand without a strong background
in linear and generalized linear models in statistics.

Minimally, you will need to install the latest version of
**R** (available
here) and the `statnet`

packages `ergm`

and
`network`

to run the code presented here (`ergm`

will automatically install `network`

when it is
loaded).

The workshops are conducted using the free version of
`Rstudio`

(available
here).

The full set of installation intructions with details can be found on
the `statnet`

workshop wiki.

If you have not already downloaded the `statnet`

packages
for this workshop, the quickest way to install these (and the other most
commonly used packages from the `statnet`

suite), is to open
an R session and type:

`Loading required package: network`

```
'network' 1.17.2 (2022-05-20), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information
```

```
'ergm' 4.2.2 (2022-06-01), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
```

```
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.
```

You can check the version number with:

`[1] '4.2.2'`

Throughout, we will set a random seed via `set.seed()`

for
commands in tutorial that require simulating random values—this is not
necessary, but it ensures that you will get the same results as this
tutorial (assuming that you are using the same `ergm`

version
or at least a version in which the algorithms you are using have not
changed).

Here we provide only a brief overview of the modeling framework, as
the primary purpose of this tutorial is to show how to implement
statistical analysis of network data with ERGMs using the
`statnet`

software tools, rather than to explain the
framework in detail. For more details, and to really understand ERGMs,
please see the references at
the end of this tutorial.

Exponential-family random graph models (ERGMs) are a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can, among other tasks:

Define a model for a network that includes covariates representing features like homophily, mutuality, triad effects, and a wide range of other structural features of interest;

Obtain maximum-likehood estimates for the parameters of the specified model for a given data set;

Test individual coefficients, assess models for convergence and goodness-of-fit, and perform various types of model comparison; and

Simulate new networks from the underlying probability distribution implied by the fitted model.

ERGMs are a class of models, superficially resembling linear regression or GLMs. The general form of the model specifies the probability of the entire network (the left hand side), as a function of terms that represent network features we hypothesize may occur more or less likely than expected by chance (the right hand side). The general form of the model is

\[ \Pr(Y=y)=\frac{\exp[\theta^\top g(y)] h(y)}{k(\theta)} \]

where

\(Y\) is the random variable for the state of the network and \(y\) is a particular realization \(Y\) could take,

\(g(y)\) is a vector of model statistics for network \(y\),

\(h(y)\) is the reference measure (which defines the baseline behavior of the model when \(\theta=0\))

\(\theta\) is the vector of coefficients for those statistics, and

\(k(\theta)\) is the summation of the numerator’s value over the set of all possible networks \(y\), typically taken to be all networks with the same node set as the observed network.

In particular, the model implies that the probability attached to a network \(y\) only depends on the network via the vector of statistics \(g(y)\). Among other things, this means that maximum likelihood estimation may be carried out even if we don’t observe the network itself, as long as we know the observed value of \(g(y)\).

If you’re not familiar with the compact notation above, the numerator represents a formula that is linear in the log form:

\[ \log(\exp[\theta^\top g(y)]) = \theta_1g_1(y) + \theta_2g_2(y)+ ... + \theta_pg_p(y) \] where \(p\) is the number of terms in the model. From this one can more easily observe the analogy to a traditional statistical model: the coefficients \(\theta\) represent the size and direction of the effects of the covariates \(g(y)\) on the overall probability of the network.

\(h(y)\) is important for specifying
model behavior for valued ERGMs, for controlling for network size in
extrapolative or multi-network settings, and other applications. When
modeling single, unvalued networks, however, it is common to take \(h(y)\propto 1\) (the *counting
measure*). We will adopt that convention here. Under the counting
measure, the baseline behavior of an ERGM approaches a uniform random
graph when \(\theta \to 0\), and the
ERGM terms serve to modify this baseline distribution. We will see many
examples below.

The statistics \(g(y)\) can be thought of as the “covariates” in the model. In the network modeling context, these represent network features like density, homophily, triads, etc. In one sense, they are like covariates you might use in other statistical models, but they are different in one key respect:

These \(g(y)\) statistics are functions of the network itself — each represents the frequency of a specific configuration of dyads observed in the network — so they are not measured by a question you include in a survey (e.g., the income of a node), but instead need to be computed on the specific network you have, after you have collected the data.

As a result, every term in an ERGM must have an associated algorithm
for computing its value for your network. The `ergm`

package
in `statnet`

includes about 150 term-computing algorithms,
and you can code up your own with our helper package. We will explore
some of the most commonly used terms in this tutorial, and links to more
information are provided in section
3.

Terms can be grouped in different ways, like the type of network they can be used on (e.g., binary, bipartite, directed, etc.), the type of covariate they represent (continuous, dyad-dependent, etc.) and their purpose (term operator, soft constraint, etc). In Statnet, these groupings are called “keywords”, and they provide a useful way to quickly find help for terms used in different contexts.

One key property of model terms is worth keeping in mind: terms are either

dyad independentordyad dependent.

Dyad *independent* terms (like nodal homophily terms) imply no
dependence between dyads—the presence or absence of a tie may depend on
nodal attributes, but not on the state of other ties. Dyad
*dependent* terms (like degree terms, or triad terms), by
contrast, imply dependence between dyads. Dyad dependent terms have very
different effects, and much of what is different about network models
comes from these terms. They introduce complex cascading effects that
can often lead to counter-intuitive and highly non-linear outcomes. In
addition, a model with at least one dyad dependent term requires a
different estimation algorithm, so when we use these terms below you
will see some different components in the output.

There are several resources in the `ergm`

package that
provide a range of help and instructions for using these terms:

*The main ergm terms reference page*

Accessed via:

`?ergmTerms`

This contains an overview of model specification, some links to additional references and a complete up-to-date list of available terms with short descriptions and a link to the man page for that term.

*Man pages for specific terms*

If you know the name of the term, and are looking for help with syntax, you can type either

``help("[name]-ergmTerm")` `

or the shorthand version

``ergmTerm?[name]``

where `[name]`

is the name of the term.

*Keywords*

Keywords are useful for narrowing down your search to find terms that have specific functionality. Available keywords and their meanings can be obtained by typing

`?ergmKeyword``

You can search for terms with keywords, for example:

```
Found 11 matching ergm terms:
altkstar(lambda, fixed=FALSE) (binary)
Alternating k-star
gwb1degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution for the first mode in a bipartite network
gwb1dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution for dyads in the first bipartition
gwb2degree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution for the second mode in a bipartite network
gwb2dsp(decay=0, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution for dyads in the second bipartition
gwdegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted degree distribution
gwdsp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted dyadwise shared partner distribution
gwesp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted edgewise shared partner distribution
gwidegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted in-degree distribution
gwnsp(decay, fixed=FALSE, cutoff=30) (binary)
Geometrically weighted nonedgewise shared partner distribution
gwodegree(decay, fixed=FALSE, attr=NULL, cutoff=30, levels=NULL) (binary)
Geometrically weighted out-degree distribution
```

*Cross-reference guide*

This guide is a good place to find a tables that summarize the keywords for terms (with tables for commonly used terms, operator terms, and all terms), definitions for all terms, and a list of keywords with the terms they refer to.

`vignette("ergm-term-crossRef")`

Note: When using RStudio, it is possible to press the tab key after
starting a line with `?ergm`

to view the wide range of
possible help options beginning with the letters `ergm`

.

The ERGM expression for the probability of the entire graph that was shown above can be re-expressed in terms of the conditional log-odds (that is, the logit of the conditional probability) of a single tie between two actors:

\[ \operatorname{logit}{P(Y_{ij}=1|y^{c}_{ij})=\theta^\top\delta_{ij}(y)}, \] where

\(Y_{ij}\) is the random variable for the state of the actor pair \(i,j\) (with realization \(y_{ij}\)), and

\(y^{c}_{ij}\) signifies the complement of \(y_{ij}\), i.e. the entire network \(y\)

*except for*\(y_{ij}\).\(\delta_{ij}(y)\) is a vector of the “change statistics” for each model term. The change statistic records how the \(g(y)\) term changes if the \(y_{ij}\) tie is toggled from off to on while fixing the rest of the network. So

\[ \delta_{ij}(y) = g(y^{+}_{ij})-g(y^{-}_{ij}), \]

where

\(y^{+}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 1, and

\(y^{-}_{ij}\) is defined as \(y^{c}_{ij}\) along with \(y_{ij}\) set to 0.

So \(\delta_{ij}(y)\) equals the value of \(g(y)\) when \(y_{ij}=1\) minus the value of \(g(y)\) when \(y_{ij}=0\), but all other dyads are as in \(y\). When this vector of change statistics is multiplied by the vector of coefficients \(\theta\), the equation above shows that this dot product is the log-odds of the tie between \(i\) and \(j\), conditional on all other dyads remaining the same.

In other words, for an individual statistic, its change value for \(Y_{ij}\) times its corresponding coefficient can be interpreted as that term’s contribution to the log-odds of that tie, conditional on all other dyads remaining the same.

We will see exactly how this works in the sections that follow.

Network data can come in many different forms — ties can be stored as
edgelists or sociomatrices in `.csv`

files, or as exported
data from other programs like Pajek. Attributes for the nodes, ties, and
dyads can also come in various forms. All can be read into *R*
using either standard R tools (e.g., for `.csv`

files), or
methods from the `network`

package. For more information,
refer to the following:

However you read them in, the data will need to be transformed into a
`network`

object, the format that Statnet packages use to
store and work with network data. For information on how to do this,
refer to:

The `ergm`

package also contains several network data
sets, and we will use those here for demonstration purposes.

We’ll start with Padgett’s data on Renaissance Florentine families for our first example. As with all data analysis, it is good practice to start by summarizing our data using graphical and numerical descriptives.

```
set.seed(1) # The plot.network function uses random values
data(florentine) # loads flomarriage and flobusiness data
flomarriage # Equivalent to print(flomarriage): Examine properties
```

```
Network attributes:
vertices = 16
directed = FALSE
hyper = FALSE
loops = FALSE
multiple = FALSE
bipartite = FALSE
total edges= 20
missing edges= 0
non-missing edges= 20
Vertex attribute names:
priorates totalties vertex.names wealth
No edge attributes
```

```
par(mfrow=c(1,2)) # Set up a 2-column (and 1-row) plot area
# Plot the network, saving the coordinates for the next plot
coords <- plot(flomarriage,
main="Florentine Marriage",
cex.main=0.8,
label = network.vertex.names(flomarriage),
label.cex=0.4,
pad=3) # Equivalent to plot.network(...)
wealth <- flomarriage %v% 'wealth' # %v% references vertex attributes, equivalent to get.vertex.attribute(flomarriage, "wealth")
wealth
```

` [1] 10 36 55 44 20 32 8 42 103 48 49 3 27 10 146 48`