*Last updated: June 23, 2015*

*This tutorial is a joint product of the Statnet Development Team:*

Mark S. Handcock (University of California, Los Angeles)

Carter T. Butts (University of California, Irvine)

David R. Hunter (Penn State University)

Steven M. Goodreau (University of Washington)

Skye Bender de-Moll (Oakland)

Pavel N. Krivitsky (University of Wollongong) Martina Morris (University of Washington)

For general questions and comments, please refer to the `statnet`

wiki and the `statnet`

users group and mailing list

http://statnet.org/statnet_users_group.shtml

The package `ergm.ego`

is still in development and has not been released to CRAN yet. So, for this workshop you will be downloading the ergm.ego package from the statnet repository.

Open an R session, and set your working directory to the location where you would like to save this work.

To install all of the CRAN packages in the statnet suite:

```
install.packages('statnet', repos='http://statnet.csde.washington.edu')
library(statnet)
```

To only install the specific statnet packages needed for this tutorial:

`install.packages('ergm.ego',repos='http://statnet.csde.washington.edu')`

The ergm.ego package is designed to provide principled estimation of and statistical inference for Exponential-family Random Graph Models (“ERGMs”) from egocentrically sampled network data.

In many empirical contexts, it is not feasible to collect a network census or even an adaptive (link-traced) sample. Even when one of these may be possible in practice, egocentrically sampled data are typically cheaper and easier to collect.

Long regarded as the poor country cousin in the network data family, egocentric data contain a remarkable amount of information. With the right statistical methods, such data can be used to explore the properties of the complete networks in which they are embedded. The basic idea here is to combine what is observed, with assumptions, to define a class of models that represent a distribution of networks that are centered on the observed properties. The variation in these networks quantifies some of the uncertainty introduced by the assumptions.

The package comprises of:

- a set of utilities to manage the data,
- a set of “ergm terms” that can be used in models,
- and a set of functions for estimation and inference that rely largely on the existing
`ergm`

package, but include the specific modifications needed in the egocentric data context.

The package is designed to work with the other statnet packages. So, for example, once you have fit a model, you can use the summary and diagnostic functions from `ergm`

, simulate to simulate complete network realizations from the model, the network descriptives from `sna`

to explore the properities of the network, and you can use other **R** functions and packages as well after converting the network data structure into a data frame.

Putting this all together, you can start with egocentric data, estimate a model, test the coefficients for statistical significance, assess the model goodness of fit, and simulate complete networks of any size from the model. The statistics in your simulated networks will be consistent with the appropriately scaled statistics from your sample for all of the terms that are represented in the model.

The full technical details on ERGM estimation and inference from egocentrically sampled data are in a paper that is currently under review. The working paper can be found here. This tutorial provides a brief introduction to the key concepts.

ERGMs represent a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can—among other tasks—obtain maximum-likehood estimates for the parameters of a specified model for a given data set; test individual models for goodness-of-fit, perform various types of model comparison; and simulate additional networks with the underlying probability distribution implied by that model.

The general form for an ERGM can be written as:

\[ P(Y=y)=\frac{\exp(\theta'g(y))}{k(\theta)} (1) \]

where Y is the random variable for the state of the network (with realization y), \(g(y)\) is a vector of model statistics for network y, \(\theta\) is the vector of coefficients for those statistics, and \(k(\theta)\) represents the quantity in the numerator summed over all possible networks (typically constrained to be all networks with the same node set as y).

The model terms \(g(y)\) are functions of network statistics that we hypothesize may be more or less common than what would be expected in a simple random graph (where all ties have the same probability). When working with egocentrically sampled network data, these statistics must be observed in the sample more details in section 4.2

A key distinction in model terms is *dyad independence* or *dyad dependence*. Dyad independent terms (like nodal homophily terms) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree terms, or triad terms), by contrast, imply dependence between dyads. The design of an egocentric sample means that most observable statistics are dyad independent, but there are a few, like degree, that are dyad dependent.

Network data are distinguished by having two units of analysis: the actors and the links between the actors. This gives rise to a range of sampling designs that can be classified into two groups: link tracing designs (e.g., snowball and respondent driven sampling) and egocentric designs.

Link-trace designs have traditionally been used to sample hard-to-reach populations. Sampling begins with a set of seed nodes. The seeds are asked to nominate alters, the alters are then recruited into the sample, asked to nominate their alters, and so on. Each new set of alters is called a *wave* or a *generation*, and the number of waves of sampling is a study design variable. At each wave, a census or a sample of the alters may be elicited, and/or recruited, this too is a study design variable, and alter recruitment may be investigator-driven, or respondent-driven. This gives rise to a wide range of possible link-trace designs. When the decision to elicit a new wave of alters depends on an attribute of the current node (e.g., is this node an injection drug user) the design is called an adaptive sample.

Egocentric network sampling comprises a range of designs developed specifically for the collection of network data in social science survey research. The design is (ideally) based on a probability sample of respondents (“egos”“) who, via interview, are asked to nominate a list of persons (”alters“) with whom they have a specific type of relationship (”tie“), and then asked to provide information on the characteristics of the alters and/or the ties. The alters are not recruited or directly observed. Depending on the study design, alters may or may not be uniquely identifiable, and respondents may or may not be asked to provide information on one or more ties among alters (the”alter" matrices). Alters could, in theory, also be present in the data as an ego or as an alter of a different ego; the likelihood of this depends on the sampling fraction.

Egocentric designs sample egos using standard sampling methods, and the sampling of links is implemented through the survey instrument. As a result, these methods are easily integrated into population-based surveys, and, as we show below, inherit many of the inferential benefits.

In the current package, we focus on the minimal egocentric network study design, in which alters cannot be uniquely identified and alter matrices are not collected (see Smith (2012) and Gjoka, Smith, and Butts (2014a) for considerations of when they are). The minimal design is more common, and the data are more widely available, largely because it is less invasive and less time-consuming than designs which include identifiable alter matrices.

The literature on statistical estimation and inference from network samples has become very active in the past decade. Handcock and Gile (2010) established a general framework for model-based inference for networks based on sampled data that allows for egocentrically sampled data as a special case: when only dyads incident on those in the sample are observed.

Kosikinen and Robins (2010) developed a similar approach in a Bayesian framework. Much of the remaining literature has focused on developing model- or design-based inference for link tracing designs (Thompson and Frank, 2000; Salganik and Heckathorn, 2004; Volz and Heckathorn, 2008; Snijders, 2010; Handcock and Gile, 2010; Tomas and Gile, 2011; Illenberger and Fltter, 2012; Pattison et al., 2013).

But, despite the widespread availability of egocentrically sampled network data, the statistical framework for analyzing such data is still relatively undeveloped (see Krivitsky and Morris 2015 for a brief review).

The general likelihood approach of Gile and Handcock, while theoretically applicable, is not feasible to implement for most egocentrically sampled data. The approach requires fitting an ERGM to a network of the size equal to that of the population from which the egos were sampled, which is, often, on the order of millions, and possibly unknown. When alters are not uniquely identifiable, the likelihood requires integration over the space of networks that produce **exactly** the observed dataset—a more complex constraint. And finally, if the data come from a complex (even just weighted) design, ignorability of the sampling process might not hold, requiring nested integration over the sampling process as well.

Krivitsy and Handcock (2011) developed a method for estimating ERGMs from egocentrically sampled data. This approach is described in the current **ergm** and **tergm** Sunbelt workshops, it is used in the **EpiModel** package, and it is the foundation of the estimation we build on here.

What has been lacking, however, is a general, rigorous framework for ERGM inference for such data, that is what we focus on here.

Let \(N\) be the population being studied: a very large, but finite, set of actors whose relations are of interest, and let \(x_i\) be a vector of attributes (e.g., age, sex, race) of an actor \(i \in N\), with \(x_N\) (or just \(x\), when there is no ambiguity) being the attributes of actors in \(N\). Let \(Y(N)\) be the set of **dyads** (potential ties) in an undirected network of these actors, and let \(y_ij\) be an indicator function of whether a tie between \(i\) and \(j\) is present in **\(y\)** and \(y_i=\{j\in N: y_{ij}=1\}\), the set of \(i\)’s network neighbors.

Throughout, **\(y\)** will refer to what we will call the **population network**: a fixed but unknown network of relationships of interest.

Now, let \(e_i\) be the view of network **\(y\)** from the point of view of actor \(i\) (). It comprises \(e_i \equiv x_i\): \(i\)’s own attributes, and \(a_i \equiv (x_{j})_{j\in y_i}\): an unordered list (technically, a multiset) of attribute vectors of \(i\)’s immediate neighbors (), but **not** their identities (indices in \(N\)). For convenience, we refer to the \(k\)th attribute/covariate observed on ego \(i\) and its alters as \(e_{i,k}\equiv x_{i,k}\) and \(a_{i,k}\equiv( x_{j,k})_{j\in y_i}\).

Then, \(e_{N}\) represents the **egocentric census**, the information retained by the minimal egocentric sampling design discussed above. The information about **\(y\)** contained in an **egocentric sample** of actors \(S\subseteq N\) can then be represented as \(e_{S}\).

Note that the statistical inference here is more straightforward than inference from data on complete networks. For complete networks you have the network census, so, in one sense, there is no inference to be made from the “sample” to the “population”, and no uncertainty to quantify. Statistical inference for complete networks is instead motivated in terms of an unobserved “super-population” from which the observed network was drawn.

With sampled networks, by contrast, the traditional framework of sampling distributions (of the estimates) arising from repeated samples from a fixed population apply.

Model terms are the expressions (e.g. “nodematch”) used to represent predictors on the right-hand size of equations used in:

- calls to
`summary`

(to obtain measurements of network statistics on a dataset) - calls to
`ergm.ego`

(to estimate an ergm model) - calls to
`simulate`

(to simulate networks from an ergm model fit)

The terms that can be used in an ERGM depend on the type of network being analyzed (directed or undirected, one-mode or two-mode (“bipartite”), binary or valued edges) and on the statistics that can be observed in the sample.

Even if the whole population is egocentrically observed (i.e., \(S=N\), a census), the alters are still not uniquely identifiable. This limits the kinds of network statistics that can be observed, and the ERGM terms that can be fit to such data. We turn to the notion of sufficiency to identify those that can be.

Define an ERGM of the form in eqn (1) to be **egocentric** if both its sufficient statistic and its sample space constraints (if any) can be recovered from an egocentric census. We discuss them in turn.

We call a network statistic \(g_{k}(\cdot,\cdot)\) **egocentric** if it can be expressed as \[
g_{k}(y,x)\equiv \textstyle\sum_{i\in N} h_{k}(e_i)
\] for some function \(h_{k}(\cdot)\) of egocentric information associated with a single actor.

The space of egocentric statistics includes **dyadic-independent** statistics that can be expressed in the general form of \[
g_{k}(y,x)=\sum_{ij\in y} f_k(x_i,x_j)
\] for some symmetric function \(f_k(\cdot,\cdot)\) of two actors’ attributes; and some **dyadic-dependent** statistics that can be expressed as \[
g_{k}(y,x)=\sum_{i\in N} f_k ({x_{i},(x_j)_{j\in y_i}})
\] for some function \(f_k(\cdot,\dotsb)\) of the attributes of an actor and their network neighbors.

Egocentric statistics induce at most Markov graph dependence (Frank and Strauss, 1986) and are **local** by the definition of Krivitsky and Handcock (2011).