*Last updated: April 05, 2016*

*This tutorial is a joint product of the Statnet Development Team:*

Mark S. Handcock (University of California, Los Angeles)

Carter T. Butts (University of California, Irvine)

David R. Hunter (Penn State University)

Steven M. Goodreau (University of Washington)

Skye Bender de-Moll (Oakland)

Pavel N. Krivitsky (University of Wollongong) Martina Morris (University of Washington)

For general questions and comments, please refer to the `statnet`

wiki and the `statnet`

users group and mailing list

http://statnet.org/statnet_users_group.shtml

Open an R session, and set your working directory to the location where you would like to save this work.

To install all of the CRAN packages in the statnet suite:

```
install.packages('statnet')
library(statnet)
```

To install the `ergm.ego`

,

`install.packages('ergm.ego')`

The `ergm.ego`

package is designed to provide principled estimation of and statistical inference for Exponential-family Random Graph Models (“ERGMs”) from egocentrically sampled network data.

In many empirical contexts, it is not feasible to collect a network census or even an adaptive (link-traced) sample. Even when one of these may be possible in practice, egocentrically sampled data are typically cheaper and easier to collect.

Long regarded as the poor country cousin in the network data family, egocentric data contain a remarkable amount of information. With the right statistical methods, such data can be used to explore the properties of the complete networks in which they are embedded. The basic idea here is to combine what is observed, with assumptions, to define a class of models that represent a distribution of networks that are centered on the observed properties. The variation in these networks quantifies some of the uncertainty introduced by the assumptions.

The package comprises:

- a set of utilities to manage the data,
- a set of “ergm terms” that can be used in models,
- and a set of functions for estimation and inference that rely largely on the existing
`ergm`

package, but include the specific modifications needed in the egocentric data context.

The package is designed to work with the other statnet packages. So, for example, once you have fit a model, you can use the summary and diagnostic functions from `ergm`

, simulate to simulate complete network realizations from the model, the network descriptives from `sna`

to explore the properities of the network, and you can use other **R** functions and packages as well after converting the network data structure into a data frame.

Putting this all together, you can start with egocentric data, estimate a model, test the coefficients for statistical significance, assess the model goodness of fit, and simulate complete networks of any size from the model. The statistics in your simulated networks will be consistent with the appropriately scaled statistics from your sample for all of the terms that are represented in the model.

The full technical details on ERGM estimation and inference from egocentrically sampled data are in a paper that is currently under review. The working paper can be found here. This tutorial provides a brief introduction to the key concepts.

ERGMs represent a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can—among other tasks—obtain maximum-likehood estimates for the parameters of a specified model for a given data set; test individual models for goodness-of-fit, perform various types of model comparison; and simulate additional networks with the underlying probability distribution implied by that model.

The general form for an ERGM can be written as: \[ P(Y=y;\theta,x)=\frac{\exp(\theta^{\top}g(y,x))}{\kappa(\theta,x)}\qquad (1) \] where \(Y\) is the random variable for the state of the network (with realization y), \(g(y,x)\) is a vector of model statistics for network y, \(\theta\) is the vector of coefficients for those statistics, and \(\kappa(\theta)\) represents the quantity in the numerator summed over all possible networks (typically constrained to be all networks with the same node set as \(y\)).

The model terms \(g(y,x)\) are functions of network statistics that we hypothesize may be more or less common than what would be expected in a simple random graph (where all ties have the same probability). When working with egocentrically sampled network data, these statistics must be observed in the sample more details in section 4.2

A key distinction in model terms is *dyad independence* or *dyad dependence*. Dyad independent terms (like nodal homophily terms) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree terms, or triad terms), by contrast, imply dependence between dyads. The design of an egocentric sample means that most observable statistics are dyad independent, but there are a few, like degree, that are dyad dependent.

Network data are distinguished by having two units of analysis: the actors and the links between the actors. This gives rise to a range of sampling designs that can be classified into two groups: link tracing designs (e.g., snowball and respondent driven sampling) and egocentric designs.

Link-trace designs have traditionally been used to sample hard-to-reach populations. Sampling begins with a set of seed nodes. The seeds are asked to nominate alters, the alters are then recruited into the sample, asked to nominate their alters, and so on. Each new set of alters is called a *wave* or a *generation*, and the number of waves of sampling is a study design variable. At each wave, a census or a sample of the alters may be elicited, and/or recruited, this too is a study design variable, and alter recruitment may be investigator-driven, or respondent-driven. This gives rise to a wide range of possible link-trace designs. When the decision to elicit a new wave of alters depends on an attribute of the current node (e.g., is this node an injection drug user) the design is called an adaptive sample.

Egocentric network sampling comprises a range of designs developed specifically for the collection of network data in social science survey research. The design is (ideally) based on a probability sample of respondents (“egos”“) who, via interview, are asked to nominate a list of persons (”alters“) with whom they have a specific type of relationship (”tie“), and then asked to provide information on the characteristics of the alters and/or the ties. The alters are not recruited or directly observed. Depending on the study design, alters may or may not be uniquely identifiable, and respondents may or may not be asked to provide information on one or more ties among alters (the”alter" matrices). Alters could, in theory, also be present in the data as an ego or as an alter of a different ego; the likelihood of this depends on the sampling fraction.

Egocentric designs sample egos using standard sampling methods, and the sampling of links is implemented through the survey instrument. As a result, these methods are easily integrated into population-based surveys, and, as we show below, inherit many of the inferential benefits.

For the moment `ergm.ego`

uses the minimal egocentric network study design, in which alters cannot be uniquely identified and alter matrices are not collected The minimal design is more common, and the data are more widely available, largely because it is less invasive and less time-consuming than designs which include identifiable alter matrices. However, deveopment of estimation where alter–alter matrices are available is being planned.

**Handcock and Gile (2010):** Likelihood inference for partially observed networks, has egocentric data as a special case.

**Kosikinen and Robins (2010):** Bayesian inference for partially observed networks, has egocentric data as a special case.

- Pros:
- Can fit any ERGM that can be identified.

- Can handle link-tracing designs.

- Cons:
- Requires alters to be identifiable.

- Cannot take into account sampling weights (unless all attributes that affect sampling weights are part of the model).

- Might not scale.

- Requires knowledge of the
*population*distribution of actor attributes used in the model.

- Requires knowledge of the

**Krivitsky and Morris (2015)** Use design-based estimators for sufficient statistics of the ERGM of interest and then transfer their properties to the ERGM estimate.

- Pros:
- Does not require alters to be identifiable.

- Borrows directly from design-based inference methods. (Can easily incorporate sampling weights, stratification, etc.)

- Can fit any ERGM that can be identified (though see below).

- Can be made invariant to network size for some models.

- Cons:
- Requires “reimplementation” of the model statistics as “
`EgoStats`

”: currenly does not support alter–alter statistics or directed or bipartite networks.

- Requires “reimplementation” of the model statistics as “
- Relies on independent sampling form population of interest in some form.

- Cannot be fit to more complex (e.g., RDS) designs.

- Requires knowledge of the
*population*distribution of actor attributes used in the model.

- Requires knowledge of the

- \(N\)
- be the population being studied: a very large, but finite, set of actors whose relations are of interest
- \(x _ i\)
- attribute (e.g., age, sex, race) vector of actor \(i \in N\)
- \(x_N\) (or just \(x\), when there is no ambiguity)
- the attributes of actors in \(N\)
- \(\mathbb{Y}(N)\)
- the set of
**dyads**(potential ties) in an undirected network of actors in \(N\) - \(y\subseteq \mathbb{Y}(N)\)
- the
**population network**: a fixed but unknown network (a set of relationships) of relationships of interest

In particular,

- \(y_{ij}\)
- an indicator function of whether a tie between \(i\) and \(j\) is present in \(y\)
- \(y _ i=\{j\in N: y _ {ij}=1\}\)
- the set of \(i\)’s network neighbors.

- \(e_i\)
- the “egocentric” view of network \(y\) from the point of view of actor \(i\) (“ego”), with the following parts:
- \(e^e_i \equiv x_i\):
- \(i\)’s own attributes
- \(e^a_i \equiv (x_{j})_{j\in y_i}\):
- an unordered list of attribute vectors of \(i\)’s immediate neighbors (“alters”), but
**not**their identities (indices in \(N\))

Also, let the \(k\)th attribute/covariate observed on ego \(i\) and its alters as \(e^e_{i,k}\equiv x_{i,k}\) and \(e^a_{i,k}\equiv( x_{j,k})_{j\in y_i}\).

Then,

- \(e_{N}\)
- the
**egocentric census**, the information retained by the minimal egocentric sampling design - \(S\subseteq N\)
- the set of egos in the sample
- \(e_{S}\)
- the data contained in an egocentric sample

Egocentric ERGMs are specified the same way as plain `ergm`

: via terms (e.g. `nodematch`

) used to represent predictors on the right-hand size of equations used in:

- calls to
`summary`

(to obtain measurements of network statistics on a dataset) - calls to
`ergm.ego`

(to estimate an ERGM) - calls to
`simulate`

(to simulate networks from an ERGM fit)

The terms that can be used in an ERGM depend on the type of network being analyzed (directed or undirected, one-mode or two-mode (“bipartite”), binary or valued edges) and on the statistics that can be observed in the sample.

Even if the whole population is egocentrically observed (i.e., \(S=N\), a census), the alters are still not uniquely identifiable. This limits the kinds of network statistics that can be observed, and the ERGM terms that can be fit to such data. We turn to the notion of sufficiency to identify those that can be.

We call a network statistic \(g_{k}(\cdot,\cdot)\) **egocentric** if it can be expressed as \[
g_{k}(y,x)\equiv \textstyle\sum_{i\in N} h_{k}(e_i)
\] for some function \(h_{k}(\cdot)\) of egocentric information associated with a single actor.

The space of egocentric statistics includes **dyadic-independent** statistics that can be expressed in the general form of \[
g_{k}(y,x)=\sum_{ij\in y} f_k(x_i,x_j)
\] for some symmetric function \(f_k(\cdot,\cdot)\) of two actors’ attributes; and some **dyadic-dependent** statistics that can be expressed as \[
g_{k}(y,x)=\sum_{i\in N} f_k ({x_{i},(x_j)_{j\in y_i}})
\] for some function \(f_k(\cdot,\dotsb)\) of the attributes of an actor and their network neighbors.

*What is “egocentric” depends on available data.*

- Egocentric with basic design
- Homophily

- Covariate effects

- Degree distribution

- Egocentric with alter-alter ties
- Triadic closure (transitive/cyclical ties, triangles)

- 4-cycles (possibly)

- Egocentric with star sample (full set of alter’s ties)
- Degree assortativity

- Not Egocentric for other reasons
- Mean degree (\(g_{k}(y,x)=2|y|/|N|\)): \(e _ i\) doesn’t know how big the network is
^{1}

- Mean degree (\(g_{k}(y,x)=2|y|/|N|\)): \(e _ i\) doesn’t know how big the network is