This tutorial is a joint product of the Statnet Development Team:

Pavel N. Krivitsky (University of New South Wales)
Martina Morris (University of Washington)
Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Chad Klumb (University of Washington)
Skye Bender de-Moll (Oakland, CA)
Michał Bojanowski (Kozminski University, Poland)

The network modeling software demonstrated in this tutorial is authored by Pavel Krivitsky (ergm.ego), with contributions from Michał Bojanowski.


The Statnet Project

All Statnet packages are open-source, written for the R computing environment, and published on CRAN. The source repositories are hosted on GitHub. Our website is statnet.org

  • Need help? For general questions and comments, please email the Statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: Statnet_help listserve.

  • Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.

  • Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).

  • For all other issues, please email us at contact@statnet.org.


1 Introduction

This tutorial provides an introduction to statistical modeling of egocentrically sampled network data with Exponential family Random Graph Models (ERGMs). The primary package we will be demonstrating is ergm.ego (Krivitsky 2023), but we will make use of utilities from other Statnet packages at various points. As of version 1.0, ergm.ego depends on the egor (Krenz et al. 2024) package for egocentric network data management.

1.1 Prerequisites

This workshop assumes basic familiarity with R, experience with network concepts, terminology and data, and familiarity with the basic principles of statistical modeling and inference. Previous experience with ERGMs is not required, but is strongly recommended (the introductory ERGM workshop is a good place to start).

The workshops are conducted using Rstudio.

1.2 Software Installation

Open an R session, and set your working directory to the location where you would like to save this work.

To install the package the ergm.ego

install.packages('ergm.ego')

This will install all of the “dependencies” – the other R packages that ergm.ego needs.

Even though we recommend using the CRAN versions of Statnet packages, it is also possible to install the development version of the package from Statnet’s R-universe using:

install.packages(
  "ergm.ego", 
  repos = c("https://statnet.r-universe.dev", "https://cloud.r-project.org")
)

Load the package into R and verify the package version:

library('ergm.ego')
Loading required package: ergm
Loading required package: network

'network' 1.18.2 (2023-12-04), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information

'ergm' 4.7-7368 (2024-06-11), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.
Loading required package: egor
Loading required package: dplyr

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Loading required package: tibble

'ergm.ego' 1.1-704 (2023-05-30), part of the Statnet Project
* 'news(package="ergm.ego")' for changes since last version
* 'citation("ergm.ego")' for citation information
* 'https://statnet.org' for help, support, and other information

Attaching package: 'ergm.ego'
The following objects are masked from 'package:ergm':

    COLLAPSE_SMALLEST, snctrl
The following object is masked from 'package:base':

    sample
packageVersion('ergm.ego')
[1] '1.1.704'

Installation of a development version is also possible, see Package Development section.

2 Overview of ergm.ego

The ergm.ego package is designed to provide principled estimation of and statistical inference for Exponential-family Random Graph Models (“ERGMs”) from egocentrically sampled network data.

This dramatically reduces the burden of data collection, which is typically one of the largest obstacles to empirical research on networks. In many contexts the collection of a network census or an adaptive (link-traced) sample is not possible. Even when one of these may be possible in theory, however, egocentrically sampled data are much cheaper and easier to collect.

Long regarded as the poor country cousin in the network data family, egocentric data actually contain a remarkable amount of information. With the right statistical methods, such data can be used to explore, summarize and simulate the complete networks in which they are embedded.

The basic idea here will be familiar to anyone who has worked with survey data: you combine what is observed (the data) with assumptions (the model terms and their sampling distributions), to define a class of models (the coefficients on the terms) that can be estimated.

Once estimated, the fitted model can be used for prediction. In the network context this means that the fitted model can be used to simulate complete networks. Each simulated network is a probabilistic draw (“realization”) from the distribution of networks specified by the model, and these draws will be centered on the observed statistics of the (appropriately scaled) sampled network. The stochastic variation in the simulated networks reflects both sampling uncertainty and the variation in network properties that are not included in the model terms.

It is worth emphasizing this point: the ERGM framework allows you to simulate the distribution of complete networks that are consistent with the egocentrically sampled data you have collected. You can exploit this feature to explore the whole network properties (e.g., connectivity, component size distributions, etc.) consistent with your data but not observable in egocentric samples.


ERGMs offer two powerful advantages to social network analysts:

  1. Estimation of complex models from egocentrically sampled data, and
  2. Simulation of complete networks from these egodata that are consistent with the observed model statistics.

All of these tasks can be accomplished using the ergm.ego package. The package comprises:

  • Utilities to manage the data, although mostly relies on egor [rpkg-egor],
  • Egocentric terms that can be used in models,
  • Functions for estimation and inference that rely largely on the existing ergm package (Handcock et al. 2022), but include the specific modifications needed in the egocentric data context.

ergm.ego is designed to work with the other Statnet packages. So, for example, once you have fit a model, you can use the summary and diagnostic functions from ergm to evaluate the model fit, the ergm simulate function to simulate complete network realizations from the model, the network descriptives from sna (Butts 2020) to explore the networks simulated from the model, and you can use other R functions and packages as well after converting the network data structure into a data frame.

Putting this all together, you can start with egocentric data, estimate a model, test the coefficients for statistical significance, assess the model goodness of fit, and simulate complete networks of any size from the model. The statistics in your simulated networks will be consistent with the appropriately scaled statistics from your sample for all of the terms that are represented in the model.

3 Key concepts

The full technical details on ERGM estimation and inference from egocentrically sampled data can be found in Krivitsky and Morris (2017). This section of the tutorial provides a brief introduction to the key concepts.

3.1 ERGMs

This section provides a brief overview of the key principles of ERGMs that are needed to understand how estimation from egocentric data works. For a more thorough introduction to ERGM theory and its implementation in the Statnet packages, see the special issue of the Journal of Statistical Software devoted to Statnet (Handcock et al. 2008). For an introduction to the ergm package, see the Statnet ERGM Workshop.

ERGMs represent a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can—among other tasks—obtain maximum-likehood estimates for the parameters of a specified model for a given data set; test individual models for goodness-of-fit, perform various types of model comparison; and simulate additional networks from the underlying probability distribution implied by that model.

The general form for an ERGM can be written as: \[ P(Y=y;\theta,x)=\frac{\exp(\theta^{\top}g(y,x))}{\kappa(\theta,x)}\qquad (1) \] where \(Y\) is the random variable for the state of the network (with realization y), \(g(y,x)\) is a vector of model statistics for network y, \(\theta\) is the vector of coefficients for those statistics, and \(\kappa(\theta)\) represents the quantity in the numerator summed over all possible networks (typically constrained to be all networks with the same node set as \(y\)).

The model terms \(g(y,x)\) are functions of network statistics that we hypothesize may be more (or less) common than what would be expected in a simple random graph (where all ties have the same probability). When working with egocentrically sampled network data, the statistics one can include in the model are limited by the requirement that they can be observed in the sample data (a detailed discussion can be found in Appendix C).

A key distinction in ERG model terms is whether they are dyad independent or dyad dependent. Dyad independent terms (such as nodematch for attribute homophily) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (such as degree for nodal degree, or triad-related terms such as gwesp), imply dependence between dyads.

The design of an egocentric sample means that most observable statistics are dyad independent, but there are a few, like degree, that are observable and dyad dependent.

3.2 Network Sampling

Network data are distinguished by having two units of analysis: the actors and the links between the actors. Data that contain information on all nodes and all links is called a “network census”. The two units give rise to a range of sampling designs that can be classified into two groups: link tracing designs (e.g., snowball and respondent driven sampling) and egocentric designs.

Network census

A network census is, as the name suggests, a dataset that contains information on every node and every link in the population of interest. In the SNA literature, this type of data is sometimes referred to as a “sociometric” design. As with all census designs, the data collection process tends to be expensive and time-consuming. As a result, this type of data tends to show up in two different application contexts: either small, well-bounded groups like classrooms, business firms and community organizations, or online settings where the data can be efficiently scraped.

Egocentric Designs

Egocentric network sampling comprises a range of designs developed specifically for the collection of network data in social science survey research. The design is (ideally) based on a probability sample \(S\) of respondents (“egos”) from the population \(N\). Via interview, the egos are asked to nominate a list of persons (“alters”) with whom they have a specific type of relationship (“tie”). The egos are then asked to provide information on the characteristics of the alters and/or the ties, but the alters are not recruited or directly observed. Depending on the study design, alters may or may not be uniquely identifiable, and respondents may or may not be asked to provide information on one or more ties among alters (the “alter” matrices). Alters could, in theory, also be present in the data as an ego or as an alter of a different ego; the likelihood of this depends on the sampling fraction.

Egocentric designs sample egos using standard sampling methods, and the sampling of links is implemented through the survey instrument. As a result, these methods are easily integrated into population-based surveys, and, as we show below, inherit many of the inferential benefits.

The minimal design (without the alter matrices) is more common, and the data are more widely available, largely because it is less invasive and less time-consuming than designs which include identifiable alter matrices.


3.3 Methods for sampled network data

Model-based

Handcock and Gile (2010) propose likelihood inference for partially observed networks, has egocentric data as a special case.

Koskinen et al. (2013) developed Bayesian inference for partially observed networks, has egocentric data as a special case.

Pros:
  • Can fit any ERGM that can be identified.
  • Can handle link-tracing designs.
Cons:
  • Requires alters to be identifiable across ego-networks
  • Cannot take into account sampling weights (unless all attributes that affect sampling weights are part of the model).
  • Might not scale.
  • Requires knowledge of the population distribution of actor attributes used in the model.

Design-based

Krivitsky and Morris (2017) use design-based estimators for sufficient statistics of the ERGM of interest, and then transfers their properties to the ERGM coefficient estimates.

Krivitsky, Bojanowski, and Morris (2019) demonstrate estimating triadic effects and scenarios in which an attribute is only observed on the ego.

Krivitsky, Morris, and Bojanowski (2022) discuss sampling design for inference.

Pros:
  • Does not require alters to be identifiable.
  • Borrows directly from design-based inference methods. (Can easily incorporate sampling weights, stratification, etc.)
  • Can fit any ERGM that can be identified (though see below).
  • Can be made invariant to network size for some models.
Cons:
  • Requires “reimplementation” of the ERG model statistics as “EgoStats
  • Relies on independent sampling from the population of interest.
  • Cannot be fit to more complex (e.g., RDS) designs.
  • Requires knowledge of the population distribution of actor attributes used in the model.

As currently implemented in the ergm.ego package, modeling does currently does not support alter–alter statistics or directed or bipartite networks.


4 Theoretical Framework

Consider an egocentric view of the entire population: every node is observed (i.e., \(S=N\), a census), but alters are not uniquely identifiable across the egos. This limits the kinds of network statistics that can be observed, which in turn restricts the terms that can be fit (the models that can be identified) in an ERGM. We can use the notion of sufficiency from statistical theory to identify the terms amenable to egocentric inference.

4.1 Estimation

The framework for estimation and inference relies on two basic properties of exponential family models:

  1. For MLEs, the expected value of a sufficient statistic (\(g(y,x)\)) in the model is equal to its observed value.
  2. The MLE is a smooth function of the sufficient statistic, and is defined for “in between” values of the statistics as well (e.g., fractional edges).

MLE’s uniquely maximize the probability of the observed statistics under the model, and any network with the same observed statistics will have the same probability.

Design-based estimation of ERGMs is done in three steps:

  1. Estimate the ERGM’s sufficient statistics using and egocentric sample \(S\) via \[g(y,x)\approx \tilde{g}(e _ S)=\frac{|N|}{|S|} \textstyle\sum_{i\in S} h _ {k}(e _ i).\]
    • As it involves summing over all sampled nodes, it is an estimate for a population total. By the Central Limit Theorem, it is consistent and asymptotically normal.
    • Sampling variances of these estimates can be estimated similarly to sampling variance of a mean.
  2. Fit the ERGM by finding \(\hat{\theta}\) that corresponds to (that is, produces) \(\tilde{g}(e_S)\).
    • It exists because of ERGM’s properties.
  3. Use Delta Method to transfer properties of \(\tilde{g}(e_S)\) to \(\hat{\theta}\).
    • We can do this, because the MLE is a smooth function of \(g(y,x)\).

Together, these allow us to use any statistic that can be observed in an egocentric sample as a term in an ERG model and to estimate the model from a complete “pseudo-network” that has the same (or appropriately scaled) sufficient statistics. The networks simulated from the fitted model will be centered on the (scaled) observed statistics.

4.1.1 Practical issues

In practice, egocentric sample statistics generally need to be adjusted for network size and some types of observable discrepancies. This is one of the key differences between working with sampled and unsampled network data.

Network Size

The treatment of network size is perhaps the most obvious way that egocentric estimation differs from a standard ERGM estimation on a completely observed network. With a network census, the network size is known; by contrast, with a network sample, we don’t typically know the size of the network from which it is drawn.

If the statistics we observe in the sample scale in a known way with network size, then we can adjust for this in the estimation, and the resulting parameter estimates (with the exception of the edges term) will be “size invariant”.

Here we will follow Krivitsky, Handcock, and Morris (2011), who showed that one can obtain a “per capita” size invariant parameterization for dyad-independent statistics in any network by using an offset, approximately equal to \(-\log(N)\), where \(N\) is the number of nodes in the network. The intuition is that this transforms the density-based parameterization (ties per dyad) that is the natural scale for ERGMs into a mean degree-based parameterization (ties per node):

\[ \text{Mean Degree} = \frac{2\times\text{ties}}{\text{nodes}} = \frac{2T}{n} \] \[ \text{Density} = \frac{\text{ties}}{\text{dyads}} = \frac{T}{\frac{N(N-1)}{2}} = \frac{\text{Mean Degree}}{(N-1)} \]

Once the number of edges is adjusted to preserve the mean degree all of the dyad independent terms are properly scaled (Krivitsky, Handcock, and Morris 2011). For degree-based terms, we would want, by analogy, the per-capita invariance to preserve the degree probability distribution.
Experimental results suggest that the mean-degree preserving offset has this property, but a mathematical proof is elusive. Scaling properties for triadic terms are less well developed (Krivitsky and Kolaczyk 2015).


Observable discrepancies

What we mean by discrepancy is: undirected tie subtotals that are required to balance in theory, but are observed not to balance in the sample. This can happen, for example, when ties are broken down by nodal attributes and the number of ties that group 1 reports to group 2 are not equal to the number that group 2 reports to group 1.

This is another unique feature of egocentrically sampled network data. With a network census, you have the complete edgelist, with the nodal attributes for each member of the dyad, so the reports will always balance. For an egocentrically sampled network, and even for an egocentric census, a discrepancy can arise, either from sampling variability, or from measurement error (if ego mis-reports the attribute of themselves or their alter).

The natural assumption, in the absence of specific knowledge, is that any discrepancy is due to sampling variation. Under this assumption the average of the discrepant reports is the appropriate estimate of the number of ties for that ego-alter configuration. This is the approach implemented in ergm.ego. If you know the source of the discrepancy, or want to make a different assumption you may address this before fitting the data in ergm.ego.


Egocentric target statistics

Once the network size-invariant parameterization and consistency issues are addressed it is straightforward to construct the target statistics needed for ERGM estimation: we scale up the values of the sample statistics to the desired network size.

The way we do this is by specifying an offset term in the model. The offset used will depend on the context.

  • For unweighted samples: To obtain population estimates from ergm.ego from an unweighted sample of size \(|S|\) to a population with a known (or specified) size \(N\), fit the model with an offset of \(\log(N/{S})=\log(N) - \log(S)\).

  • For weighted samples: To obtain population estimates from ergm.ego from a weighted sample to a population with a known (or specified) size \(N\), first choose a network size, \(|N'|\), to be used for estimation (a pseudo-population that will have the correct nodal attribute distribution specified by the weights), and then fit the model with an offset of \(\log(N/{N'})=\log(N) - \log(N')\). The criteria for choosing a good value of \(|N'|\) are discussed in the Model Fitting section below.

  • If the population network size is unknown: This is the most general case. If we do not know \(N\) or wish to specify it we often fit with an offset of \(-\log(S)\) (for the unweighted sample) or \(-\log(N')\) (for the weighted sample). This will return per-capita estimates that can be easily rescaled to any value post-estimation, e.g., for simulation purposes.


4.2 Statistical Inference

The standard errors for coefficients in an ergm.ego fit are designed to represent the uncertainty in our estimate. For ERGMs, this uncertainty can be thought of as coming from three possible sources:

  1. A superpopulation of networks, from which this one network is a single realization: What other networks could have been produced by the social process of interest?

  2. The sampling process of egos: What other samples could have been drawn?

  3. The stochastic MCMC algorithm used to estimate the coefficient: What other MCMC samples could we have gotten?

Most treatments of ERGM estimation treat the coefficient \(\theta\) as a parameter of a superpopulation process of which \(y\) is a single realization. The variance of the MLE of \(\theta\) is then conceived as coming from (1) and (3) above.

In contrast, in ergm.ego we treat the network as a fixed, unknown, finite population, so it is not a source of uncertainty. Rather, uncertainty comes from sampling from this network, and from the MCMC algorithm, (2) and (3) above.

This makes ergm.ego inference much more like traditional (frequentist) statistical inference: we imagine repeatedly drawing an egocentric sample, and estimating the ERGM on each replicate. The sampling distribution of the estimate reflects how our estimate will vary from sample to sample.

4.3 Survey design effects

The ergm.ego package can be used with weighted survey data and complex sampling designs. In this context, the egor package transforms the ego tibble into a srvyr object. The srvyr package (Freedman Ellis and Schneider 2023) can be used for descriptive statistics, and ergm.ego will incorporate the survey design into its estimation and inference.

This topic is beyond the scope of this introductory workshop but the ergm.ego package has an example you can run for more information:

example(sample.egor)

5 The package ergm.ego

Since ergm.ego is essentially a wrapper around ergm, there are relatively few functions in the ergm.ego package itself. The functions that are there deal with the specific requirements associated with data management, estimation and inference for egocentrically sampled data.

To get a list of documented functions, type:

library(help='ergm.ego')

The main R objects unique to ergm.ego are:

  • egor objects for storing the original data (egor is the analog to network in ergm),

  • ergm.ego objects, which store the model fit results (the analog to ergm objects in ergm).

Once you simulate from the fit, the resulting objects are just network objects.

The functionality can be divided into groups as follows:

Data structure and input

Stripped down to the basics, egocentric network data comprise:

  • data on egos (nodal attributes)

  • data on alters (a combination of nodal and edge attributes, since each alter represents a tie)

  • data on ties between alters (which may also have edge attributes)

The egor object has simple, analogous structure for storing this information: a list object with 3 components

  • ego - data frame of egos and their attributes

  • alter - a data frame of alters and their attributes nominated by egos (by default identified by column egoID) or a list of data frames (one for each ego).

  • aatie - a data frame with edge list of alter-alter ties or a list of data frames (one for each ego).

In addition, one can specify:

  • ego_design - a list of arguments passed to srvyr::as_survey_design() specifying the sample design for egos. For example: probs for unequal probability independent sample, strata for stratified samples etc.

  • alter_design - currently a list with one element, max= providing the maximal number of alters ego could nominate for a Fixed Choice Design (Holland and Leinhardt 1973).

The capacity to represent survey design elements makes egor a flexible and powerful foundation for network data analysis.

The simplicity of the data structure makes it easy to constructegor objects from external data read into R, and there are transformation utilities for working with other data formats (like network and igraph objects), which we will demonstrate in the Example section below.

For more information:

?as.egor

Model terms

The possible terms in an ergm.ego model are inherently limited to those that are egocentrically observable: statistics that can be inferred from an egocentric sample. In general, these will include terms that are functions of nodal attributes and attribute mixing, degree distribution terms, and triadic terms (when the alter–alter ties are observed). The ergm.ego terms have the same names and arguments as their ergm counterparts, there are just far fewer (n=14) of them available.

Dyad independent terms include density and nodal attribute based measures:

  • Density: edges
  • Vertex attribute effects: nodefactor (for discrete/nominal vars) and nodecov (for continuous)
  • Homophily: nodematch (for homophily), nodemix (for general mixing patterns) and absdiff

Dyad dependent terms include degree- and triad-based measures:

  • Degree: degree, degrange, gwdegree, and degreepopularity
  • Concurrency: concurrent and concurrentties
  • Triadic terms include, esp, gwesp, transitiveties and cyclicalties, but can only if be used if alter–alter ties have been observed.

For the full list of ergm.ego terms and their syntax, type:

help('ergm.ego-terms')

As in ergm, these terms can be used on the right-hand side of formulas in calls to model and simulation functions


6 Example Analysis

We will work with the faux.mesa.high dataset that is included with the ergm package, using the as.egor function to transform it into an egocentric dataset. In essence, this creates an egocentric census of the network: a census of all nodes in the network, not a sample.

In this egocentric census, every node is the center of their own egonet – we know their alters, and the ties between their alters, but we can not match the alters across the egonets because they are not uniquely identified. We can still compare the fits we get from ergm.ego (from the ego data) and ergm (from the original network) for models with the same terms.

Preliminaries:

Check package versions

sessionInfo()

Set seed for simulations – this is not necessary, but it ensures that we all get the same results (if we execute the same commands in the same order).

set.seed(1)

6.1 Data construction

We’ll show 2 examples of how to create an egor object here.

6.1.1 From a network object

Read in the faux.mesa.high data:

data(faux.mesa.high)
mesa <- faux.mesa.high

Take a quick look at the complete network

plot(mesa, vertex.col="Grade")
legend('bottomleft',fill=7:12,legend=paste('Grade',7:12),cex=0.75)

Now, let’s turn this into an egor object:

mesa.ego <- as.egor(mesa) 

Take a look at this object – there are several ways to do this:

names(mesa.ego) # what are the components of this object?
[1] "ego"   "alter" "aatie"
mesa.ego # shows the dimensions of each component
# EGO data (active): 205 × 4
  .egoID Grade Race  Sex  
*  <int> <dbl> <chr> <chr>
1      1     7 Hisp  F    
2      2     7 Hisp  F    
3      3    11 NatAm M    
4      4     8 Hisp  M    
5      5    10 White F    
# ℹ 200 more rows
# ALTER data: 406 × 5
  .altID .egoID Grade Race  Sex  
*  <int>  <int> <dbl> <chr> <chr>
1    174      1     7 Hisp  F    
2    161      1     7 Hisp  F    
3    151      1     7 Hisp  F    
# ℹ 403 more rows
# AATIE data: 372 × 3
  .egoID .srcID .tgtID
*  <int>  <int>  <int>
1      1    151    127
2      1    127     52
3      1    127     87
# ℹ 369 more rows
#View(mesa.ego) # opens the component in the Rstudio source window
class(mesa.ego) # what type of "object" is this?
[1] "egor" "list"

Each of the components of the egor object is a simple table, or data.frame.

class(mesa.ego$ego) # and what type of objects are the components?
[1] "tbl_df"     "tbl"        "data.frame"
class(mesa.ego$alter)
[1] "tbl_df"     "tbl"        "data.frame"
class(mesa.ego$aatie)
[1] "tbl_df"     "tbl"        "data.frame"

The ego table contains the ego ID (.egoID), and the nodal attributes Race, Grade and Sex. This is equivalent to a standard person-based survey sample flat file format.

mesa.ego$ego # first few rows of the ego table
# A tibble: 205 × 4
   .egoID Grade Race  Sex  
    <int> <dbl> <chr> <chr>
 1      1     7 Hisp  F    
 2      2     7 Hisp  F    
 3      3    11 NatAm M    
 4      4     8 Hisp  M    
 5      5    10 White F    
 6      6    10 Hisp  F    
 7      7     8 NatAm M    
 8      8    11 NatAm M    
 9      9     9 White M    
10     10     9 NatAm F    
# ℹ 195 more rows

The alter table is a type of edgelist: it lists the edges for each ego. It contains the alter ID (.altID), the corresponding ego ID, and a set of alter nodal attributes. Note that this is a slightly different data structure than a standard network edgelist.

  • The standard network edgelist contains one unique record for each edge; both ego and alter ID may appear more than once (depending on their degree), but each link is only represented once.

  • The alter table in this egor object is a different type of edgelist, as it is “egocentric.”

    • When this table is derived from a network census, as we did above, each tie will be represented twice: once with the first node as the ego, and once with the other node as the ego. As a result, the number of times alters will appear in the .altID list is equal to their degree, as is the number of times their ID will appear in the .egoID list.
    • When this table is derived from an egocentric sample of a network, it is likely that each tie will only be represented once, unless some of the alters are sampled as egos. When the sampling fraction is small (e.g., for the General Social Survey friendship network data), it is very unlikely that alters will be sampled as egos.
mesa.ego$alter # first few rows of the alter table
# A tibble: 406 × 5
   .altID .egoID Grade Race  Sex  
    <int>  <int> <dbl> <chr> <chr>
 1    174      1     7 Hisp  F    
 2    161      1     7 Hisp  F    
 3    151      1     7 Hisp  F    
 4    127      1     7 Hisp  F    
 5    110      1     7 Hisp  F    
 6    100      1     7 Hisp  F    
 7     96      1     7 NatAm F    
 8     92      1     7 NatAm F    
 9     87      1     7 White F    
10     70      1     7 NatAm F    
# ℹ 396 more rows
# ties show up twice, but alter info is linked to .altID
mesa.ego$alter %>% filter((.altID==1 & .egoID==25) | (.egoID==1 & .altID==25))
# A tibble: 2 × 5
  .altID .egoID Grade Race  Sex  
   <int>  <int> <dbl> <chr> <chr>
1     25      1     7 White F    
2      1     25     7 Hisp  F    

The aatie table lists the egoID, and the IDs of the two alters that have tie. The alters are distinguished as .srcID and .tgtID to allow for the possibility of directed tie data. In the case of undirected tie data, as we have here, each alter-alter tie will be represented twice, just swapping the target and source IDs.

mesa.ego$aatie # first few rows of the alter table
# A tibble: 372 × 3
   .egoID .srcID .tgtID
    <int>  <int>  <int>
 1      1    151    127
 2      1    127     52
 3      1    127     87
 4      1    127    151
 5      1    110     87
 6      1    110     92
 7      1    110     96
 8      1    100     96
 9      1     96     87
10      1     96    110
# ℹ 362 more rows

6.1.2 From external data

Since each of the egor components is a simple rectangular matrix, it’s easy to read in external data and use it to construct an egor object. You just need to make sure that the structure of the external file is consistent with the structure of the tables we looked at above.

To demonstrate we will construct an egor object derived from our mesa.ego data that has the features of an egocentrically sampled data set: the alters are not uniquely identified. And, we will ignore the alter–alter ties.

First, we write out the first two tables in our mesa.ego into external datafiles, deleting the .altID from the alter file.

# egos
write.csv(mesa.ego$ego, file="mesa.ego.table.csv", row.names = F)

# alters
write.csv(mesa.ego$alter[,-1], file="mesa.alter.table.csv", row.names = F)

Now read them back in:

mesa.egos <- read.csv("mesa.ego.table.csv")
head(mesa.egos)
  .egoID Grade  Race Sex
1      1     7  Hisp   F
2      2     7  Hisp   F
3      3    11 NatAm   M
4      4     8  Hisp   M
5      5    10 White   F
6      6    10  Hisp   F
mesa.alts <- read.csv("mesa.alter.table.csv")
head(mesa.alts)
  .egoID Grade Race Sex
1      1     7 Hisp   F
2      1     7 Hisp   F
3      1     7 Hisp   F
4      1     7 Hisp   F
5      1     7 Hisp   F
6      1     7 Hisp   F

To create an egor object from data frames, we use the egor() function:

my.egodata <- egor(egos = mesa.egos, 
                   alters = mesa.alts, 
                   ID.vars = list(ego = ".egoID"))
my.egodata
# EGO data (active): 205 × 4
  .egoID Grade Race  Sex  
* <chr>  <int> <chr> <chr>
1 1          7 Hisp  F    
2 2          7 Hisp  F    
3 3         11 NatAm M    
4 4          8 Hisp  M    
5 5         10 White F    
# ℹ 200 more rows
# ALTER data: 406 × 4
  .egoID Grade Race  Sex  
* <chr>  <int> <chr> <chr>
1 1          7 Hisp  F    
2 1          7 Hisp  F    
3 1          7 Hisp  F    
# ℹ 403 more rows
# AATIE data: 0 × 3
# ℹ 3 variables: .egoID <chr>, .srcID <chr>, .tgtID <chr>
my.egodata$alter
# A tibble: 406 × 4
   .egoID Grade Race  Sex  
   <chr>  <int> <chr> <chr>
 1 1          7 Hisp  F    
 2 1          7 Hisp  F    
 3 1          7 Hisp  F    
 4 1          7 Hisp  F    
 5 1          7 Hisp  F    
 6 1          7 Hisp  F    
 7 1          7 NatAm F    
 8 1          7 NatAm F    
 9 1          7 White F    
10 1          7 NatAm F    
# ℹ 396 more rows

Note that the alter data no longer have a unique alter identifier.

For another example that uses the alter–alter ties, see:

example("egor")

We will explore some of the other functions available for manipulating the egor object in a later section.


6.2 Exploratory analysis

Prior to model specification, we can explore the data using descriptive statistics observable in the original egocentric sample. In general, the observable statistics are the same as those that ergm.ego can estimate.

We can use standard R commands to view nodal attribute frequencies:

# to reduce typing, we'll pull the ego and alter data frames
egos <- mesa.ego$ego
alters <- mesa.ego$alter

table(egos$Sex) # Distribution of `Sex`

  F   M 
 99 106 
table(egos$Race) # Distribution of `Race`

Black  Hisp NatAm Other White 
    6   109    68     4    18 
barplot(table(egos$Grade), 
        main = "Ego grade distribution",
        ylab="frequency")

Compare egos and alters:

layout(matrix(1:2, 1, 2))
barplot(table(egos$Race)/nrow(egos),
        main="Ego Race Distn", ylab="percent",
        ylim = c(0,0.5), las = 3)
barplot(table(alters$Race)/nrow(alters),
        main="Alter Race Distn", ylab="percent",
        ylim = c(0,0.5), las = 3)
layout(1)

To look at the mixing matrix, we’ll use the mixingmatrix() function on the egor object, and
we’ll compare the output to what we would get from using this function on the original network object.

Note how the ties on the diagonal are counted twice in the ergm.ego data, compared with the original network data, but the off-diagonal tie counts are the same. Note also, though, that these off-diagonal counts are symmetric, because this is undirected data. So, in both cases, the off-diagonal ties are actually being counted twice (once above and once below the diagonal), but in the original network version, the ties on the diagonal are only counted once.

# to get the crosstabulated counts of ties:
mixingmatrix(mesa.ego,"Grade")
     7   8   9  10  11  12
7  150   0   0   1   1   1
8    0  66   2   4   2   1
9    0   2  46   7   6   4
10   1   4   7  18   1   5
11   1   2   6   1  34   5
12   1   1   4   5   5  12
Note:  Marginal totals can be misleading for undirected mixing matrices.
# contrast with the original network crosstab:
mixingmatrix(mesa, "Grade")
    7  8  9 10 11 12
7  75  0  0  1  1  1
8   0 33  2  4  2  1
9   0  2 23  7  6  4
10  1  4  7  9  1  5
11  1  2  6  1 17  5
12  1  1  4  5  5  6
Note:  Marginal totals can be misleading for undirected mixing matrices.

You can also use this function to calculate the row probabilities of the mixing matrix:

# to get the row conditional probabilities:

round(mixingmatrix(mesa.ego, "Grade", rowprob=T), 2)
      7    8    9   10   11   12
7  0.98 0.00 0.00 0.01 0.01 0.01
8  0.00 0.88 0.03 0.05 0.03 0.01
9  0.00 0.03 0.71 0.11 0.09 0.06
10 0.03 0.11 0.19 0.50 0.03 0.14
11 0.02 0.04 0.12 0.02 0.69 0.10
12 0.04 0.04 0.14 0.18 0.18 0.43
Note:  Marginal totals can be misleading for undirected mixing matrices.
round(mixingmatrix(mesa.ego, "Race", rowprob=T), 2)
      Black Hisp NatAm Other White
Black  0.00 0.31  0.50  0.00  0.19
Hisp   0.04 0.60  0.23  0.01  0.12
NatAm  0.08 0.26  0.59  0.00  0.06
Other  0.00 1.00  0.00  0.00  0.00
White  0.11 0.49  0.22  0.00  0.18
Note:  Marginal totals can be misleading for undirected mixing matrices.

We can also examine the observed number of ties, mean degree, and degree distributions.

# first, using the original network
network.edgecount(faux.mesa.high)
[1] 203
# compare to `egor`
# note that the ties are double counted, so we need to divide by 2.
nrow(mesa.ego$alter)/2
[1] 203
# mean degree -- here we want to count each "stub", so we don't divide by 2
nrow(mesa.ego$alter)/nrow(mesa.ego$ego)
[1] 1.980488
# overall degree distribution
summary(mesa.ego ~ degree(0:20))
         scaled mean     SE
degree0           57 6.4306
degree1           51 6.2048
degree2           30 5.0730
degree3           28 4.9289
degree4           18 4.0620
degree5           10 3.0917
degree6            2 1.4107
degree7            4 1.9852
degree8            1 1.0000
degree9            2 1.4107
degree10           1 1.0000
degree11           0 0.0000
degree12           0 0.0000
degree13           1 1.0000
degree14           0 0.0000
degree15           0 0.0000
degree16           0 0.0000
degree17           0 0.0000
degree18           0 0.0000
degree19           0 0.0000
degree20           0 0.0000
# and stratified by sex
summary(mesa.ego ~ degree(0:13, by="Sex"))
           scaled mean     SE
deg0.SexF           23 4.5299
deg1.SexF           23 4.5299
deg2.SexF           10 3.0917
deg3.SexF           17 3.9581
deg4.SexF           12 3.3694
deg5.SexF            7 2.6066
deg6.SexF            1 1.0000
deg7.SexF            3 1.7235
deg8.SexF            1 1.0000
deg9.SexF            0 0.0000
deg10.SexF           1 1.0000
deg11.SexF           0 0.0000
deg12.SexF           0 0.0000
deg13.SexF           1 1.0000
deg0.SexM           34 5.3385
deg1.SexM           28 4.9289
deg2.SexM           20 4.2588
deg3.SexM           11 3.2343
deg4.SexM            6 2.4193
deg5.SexM            3 1.7235
deg6.SexM            1 1.0000
deg7.SexM            1 1.0000
deg8.SexM            0 0.0000
deg9.SexM            2 1.4107
deg10.SexM           0 0.0000
deg11.SexM           0 0.0000
deg12.SexM           0 0.0000
deg13.SexM           0 0.0000

For the degree distribution we used thesummaryfunction in the same way that we would use it in ergm with a network object. But the summary function also has an egor specific argument, scaleto, that allows you to scale the summary statistics to a network of arbitrary size. So, for example, we can obtain the degree distribution scaled to a network of size 100,000, or a network that is 100 times larger than the sample.

summary(mesa.ego ~ degree(0:10), scaleto=100000)
         scaled mean      SE
degree0     27804.88 3136.89
degree1     24878.05 3026.75
degree2     14634.15 2474.63
degree3     13658.54 2404.34
degree4      8780.49 1981.47
degree5      4878.05 1508.16
degree6       975.61  688.17
degree7      1951.22  968.41
degree8       487.80  487.80
degree9       975.61  688.17
degree10      487.80  487.80
summary(mesa.ego ~ degree(0:10), scaleto=nrow(mesa.ego$ego)*100)
         scaled mean     SE
degree0         5700 643.06
degree1         5100 620.48
degree2         3000 507.30
degree3         2800 492.89
degree4         1800 406.20
degree5         1000 309.17
degree6          200 141.07
degree7          400 198.52
degree8          100 100.00
degree9          200 141.07
degree10         100 100.00

Note that the first scaling results in fractional numbers of nodes at each degree, because the proportion at each degree level does not scale to an integer for this population size. Again, this is not a problem for estimation, but one should be careful with descriptive statistics that expect integer values. The second scaling does result in integer counts because it is a multiple of the sample size.

We can plot the degree distribution using another egor specific function: degreedist. As with the mixingmatrix function, this can return either the counts or the proportions at each degree.

To get to get the frequency counts:

# degreedist(mesa.ego, plot=TRUE, prob=FALSE) # bug statnet/ergm.ego#82.
degreedist(mesa.ego, by="Sex", plot=TRUE, prob=FALSE)

To get the proportion at each degree level:

degreedist(mesa.ego, by="Sex", plot=TRUE, prob=TRUE)

The degreedist method for egor objects also has an argument that lets you overplot the expected degree distribution for a Bernoulli random graph with the same expected density. This is the plot equivalent of a CUG test (“conditional uniform graph”).

set.seed(1)
degreedist(mesa.ego, brg=TRUE)

degreedist(mesa.ego, by="Sex", prob=TRUE, brg=TRUE)

The brg overplot is based on 50 simulations of a Bernoulli random graph with the same number of nodes and expected density, implemented by using an ergm.ego simulation from an edges only model with \(\theta=\mbox{logit}(\mbox{probability of a tie})\) from the observed data. The overplot shows the mean and 2 standard deviations obtained for each degree value from the 50 simulations. Note that the brg automatically scales to the proportions when prob=T.

What does the plot suggest about the distribution of degree in this network?

6.3 Model Fitting

From the exploratory work, several characteristics emerged that we might want to capture in a model:

  • Variation in mean degree by nodal attributes (race, sex and grade)

  • Patterns of mixing by race, sex and grade

  • The degree distribution, (in particular the disproportionate isolate fraction)

We can use ergm.ego to fit a sequence of nested models to both estimate the parameters associated with these statistics, and test their significance. We can diagnose the both the estimation process (to verify convergence and good mixing in the MCMC sampler) and the fit of the model to the data. In both cases, we will use functionality that will be familiar to ergm users: MCMC diagnostics, and GOF.

6.3.1 Preliminaries

One thing that is different from a standard ergm call is that we need to specify the scaling, both for the pseudo-population (\(N'\)) that will be used to set the target statistics during estimation, and for the population (N) size that the final rescaled coefficients will represent. Recall,

Population size \((|N|)\)
If we wanted to rescale the final estimates to reproduce the expected values in the true population network, we would need to know \(N\), the size of the popuation. In general, we do not know this, so the most useful scaling is a per capita scaling, which can easily be transformed into any value of \(|N|\) later (for simulation, or other purposes). For per capita scaling, \(|N|=1\). In ergm.ego, it is controlled by the popsize top-level argument.
Pseudo-population \((|N'|)\)
Recall from above that the target statistics can be scaled to an arbitrary population size for estimation. What size should we use? Several principles guide this choice:
  • Bias – In general, estimation bias is reduced the closer \(N'\) is to \(N\) (usually larger).

  • Computing time – The larger the pseudo-population, the longer the estimation takes.

  • Sample weights – In general, it is good practice for the smallest sample weight to produce at least 1 observation in the pseudo-population network, though more is better.

This leads to different guidelines for data with and without weights.

Simulation studies in Krivitsky & Morris (2017) suggest that a good rule of thumb is to have a minimum pseudo-population size of 1,000 for unweighted data. For weighted data the pseudo-populations size should be at least 1 * sampleSize/smallestWeight (or 3 * sampleSize/smallestWeight to be safe), or 1000 (whichever is larger).

In ergm.ego, \(|N'|\) is controlled by a combination of four factors:

  • the sample size \(|S|\) (i.e., number of egos),
  • the top-level argument popsize (\(|N|\) or 1) (default: 1),
  • the control.ergm.ego control parameter ppopsize (default: "auto"),
  • the control.ergm.ego control parameter ppopsize.mul (default: 1).

If ppopsize is left at its default ("auto"),

  • If popsize is left at 1, \(|S|\times\)ppopsize.mul.
  • If popsize is specified, use \(|N|\times\)ppopsize.mul.

You can also force one of these two regimes by setting ppopsize to "samp" or "pop", respectively, or set it to a number to force a particular \(|N'|\) ignoring ppopsize.mul.

For more information, see

?control.ergm.ego

In both cases, the scaling will only affect the estimate of the edges term, and we demonstrate this below.

6.3.2 Fit a simple model

Let’s start with simple edges-only model to see what’s the same and what is different from a call to ergm:

fit.edges <- ergm.ego(mesa.ego ~ edges)
summary(fit.edges)
Call:
ergm.ego(formula = mesa.ego ~ edges)

Monte Carlo Maximum Likelihood Results:

                    Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj) -5.32301    0.00000      0    -Inf   <1e-04 ***
edges                0.69590    0.07717      0   9.018   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 

This is a model with homogenous tie probability – a Bernoulli random graph with the mean degree observed in our sampled data. The only difference in the syntax from standard ergm is the function call to ergm.ego. Let’s look under the hood at the components that are output to the fit.edges object:

names(fit.edges)
 [1] "coefficients"     "sample"           "iterations"       "MCMCtheta"       
 [5] "loglikelihood"    "gradient"         "hessian"          "covar"           
 [9] "failure"          "newnetwork"       "coef.init"        "est.cov"         
[13] "coef.hist"        "stats.hist"       "steplen.hist"     "control"         
[17] "etamap"           "MCMCflag"         "nw.stats"         "call"            
[21] "network"          "ergm_version"     "info"             "MPLE_is_MLE"     
[25] "drop"             "offset"           "estimable"        "formula"         
[29] "target.stats"     "target.esteq"     "reference"        "constraints"     
[33] "obs.constraints"  "estimate"         "estimate.desc"    "v"               
[37] "m"                "ergm.formula"     "popnw"            "ergm.offset.coef"
[41] "egor"             "ppopsize"         "popsize"          "netsize.adj"     
[45] "ergm.covar"       "DtDe"             "ergm.call"       
fit.edges$ppopsize
[1] 205
fit.edges$popsize
[1] 1

Many of the elements of the object are the same as you would get from an ergm fit, but the last few elements are unique to ergm.ego. Here you can see the ppopsize – the pseudo-population size used to construct the target statistics, and popsize – the final scaled population size after network size adjustment is applied. The values that were used in the fit were the default values, since we did not specify otherwise. So, ppopsize\(=205\) (the sample size, or number of egos), and popsize\(= 1\), so the scaling returns the per capita estimates from the model parameters.

The summary shows the netsize.adj is \(-5.32301= -\log(205)\).

The summary function also reports that:

 The following terms are fixed by offset and are not estimated:
  netsize.adj

So what would happen if we fit the model instead with target statistics from a pseudo-population of size 1000? To do this, we explicitly change the value of the ppopsize parameter through the control argument:

summary(ergm.ego(mesa.ego ~ edges, 
                 control = control.ergm.ego(ppopsize=1000)))
Constructing pseudopopulation network.
Note: Constructed network has size 1025, different from requested 1000. Estimation should not be meaningfully affected.
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0019.
Convergence test p-value: 0.0140. Not converged with 99% confidence; increasing sample size.
Iteration 2 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0129.
Convergence test p-value: 0.0001. Converged with 99% confidence.
Finished MCMLE.
This model was fit using MCMC.  To examine model diagnostics and check
for degeneracy, use the mcmc.diagnostics() function.
Call:
ergm.ego(formula = mesa.ego ~ edges, control = control.ergm.ego(ppopsize = 1000))

Monte Carlo Maximum Likelihood Results:

                    Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj) -6.93245    0.00000      0    -Inf   <1e-04 ***
edges                0.69348    0.08023      0   8.644   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 

Now the netsize.adj value is \(-6.9077553 = -\log(1000)\).

Note that the value of the estimated edges coefficient is the same in both models, 0.698. This is the behavior we expect – the model is returning the same per capita value in both cases; it is just using a different scaling for the target statistics used in the fit. For this simple model, there may not be much difference in the properties of the estimates for these two different pseudo-population sizes.

We will examine the impact of modifying the popsize parameter in a later section below.

As the output shows, the model fit was fit using MCMC. This, too is different from the edges-only model using ergm. For ergm, models with only dyad-dependent terms are fit using Newton-Raphson algorithms (the same algorithm used for logistic regression), not MCMC. For ergm.ego, estimation is always based on MCMC, regardless of the terms in the model.

6.3.3 Convergence assessment

Now let’s see what the MCMC diagnostics for this model look like

mcmc.diagnostics(fit.edges, which ="plots")


Note: MCMC diagnostics shown here are from the last round of
  simulation, prior to computation of final parameter estimates.
  Because the final estimates are refinements of those used for this
  simulation run, these diagnostics may understate model performance.
  To directly assess the performance of the final model on in-model
  statistics, please use the GOF command: gof(ergmFitObject,
  GOF=~model).

The diagnostics show good mixing, and the distribution of the sample statistic deviations from the targets (on the right panel) in the last iteration is well centered around zero. To verify that simulations from the fitted model match the target stats, we can use the gof function with the argument “model”.

plot(gof(fit.edges, GOF="model"))

Networks simulated from the model appear to be nicely centered around the values of the observed edges statistic.

6.3.4 GOF assessment

Finally, we should evaluate the model fit. We can also use gof to do this, by comparing observed statistics that are not in the model, like the full degree distribution, with simulations from the fitted model. This is the same procedure that we use for ergm, but now with a more limited set of observed higher-order statistics to use for assessment.

plot(gof(fit.edges, GOF="degree"))

Here, finally, we see some bad behavior, but this is expected from such a simple model. The GOF plot shows there are almost twice as many isolates in the observed data than would be predicted from a simple edges-only model.
Of course we knew this from having looked at the degree distribution plots with the Bernoulli random graph overlay.

Ok, so that’s a full cycle of description, estimation, and model assessment.

6.3.5 Improve the fit

Let’s try fitting a degree(0) term to see how that changes the degree distribution assessment. Note that in this example, we’re using a shortcut for control.ergm.ego – the snctrl function. The snctrl shortcut can be used in all of the Statnet packages (ergm, tergm, etc.) to specify controls specific for each type of model.

set.seed(1)
fit.deg0 <- ergm.ego(mesa.ego ~ edges + degree(0), 
                     control = snctrl(ppopsize=1000))
summary(fit.deg0)
Call:
ergm.ego(formula = mesa.ego ~ edges + degree(0), control = snctrl(ppopsize = 1000))

Monte Carlo Maximum Likelihood Results:

                    Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj)  -6.9324     0.0000      0    -Inf   <1e-04 ***
edges                 1.1704     0.1042      0  11.234   <1e-04 ***
degree0               1.4815     0.2592      0   5.716   <1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 
mcmc.diagnostics(fit.deg0, which = "plots")


Note: MCMC diagnostics shown here are from the last round of
  simulation, prior to computation of final parameter estimates.
  Because the final estimates are refinements of those used for this
  simulation run, these diagnostics may understate model performance.
  To directly assess the performance of the final model on in-model
  statistics, please use the GOF command: gof(ergmFitObject,
  GOF=~model).
plot(gof(fit.deg0, GOF="model"))

plot(gof(fit.deg0, GOF="degree"))

So, we’ve now fit the isolates exactly, and the overall fit is better, but the deviations suggest there are more nodes with just one tie than would be expected, given the mean degree, and the number of isolates.

And just to round things off, let’s fit a relatively large model. Here we’ll specify the omitted category for Race as the largest group.

fit.full <- ergm.ego(mesa.ego ~ edges + degree(0:1) 
                     + nodefactor("Sex")
                     + nodefactor("Race", levels = -LARGEST)
                     + nodefactor("Grade")
                     + nodematch("Sex") 
                     + nodematch("Race") 
                     + nodematch("Grade"))
summary(fit.full)
Call:
ergm.ego(formula = mesa.ego ~ edges + degree(0:1) + nodefactor("Sex") + 
    nodefactor("Race", levels = -LARGEST) + nodefactor("Grade") + 
    nodematch("Sex") + nodematch("Race") + nodematch("Grade"))

Monte Carlo Maximum Likelihood Results:

                      Estimate Std. Error MCMC % z value Pr(>|z|)    
offset(netsize.adj)   -5.32301    0.00000      0    -Inf  < 1e-04 ***
edges                 -1.38926    0.19665      0  -7.065  < 1e-04 ***
degree0                2.09717    0.36081      0   5.812  < 1e-04 ***
degree1                1.00401    0.28150      0   3.567 0.000362 ***
nodefactor.Sex.M      -0.17310    0.06319      0  -2.739 0.006155 ** 
nodefactor.Race.Black  1.20790    0.21176      0   5.704  < 1e-04 ***
nodefactor.Race.NatAm  0.30280    0.05821      0   5.202  < 1e-04 ***
nodefactor.Race.Other -0.90243    0.61221      0  -1.474 0.140466    
nodefactor.Race.White  0.57599    0.13107      0   4.394  < 1e-04 ***
nodefactor.Grade.8     0.14240    0.05373      0   2.650 0.008044 ** 
nodefactor.Grade.9     0.14073    0.04792      0   2.937 0.003319 ** 
nodefactor.Grade.10    0.31597    0.07197      0   4.391  < 1e-04 ***
nodefactor.Grade.11    0.40663    0.05753      0   7.068  < 1e-04 ***
nodefactor.Grade.12    0.77803    0.07399      0  10.515  < 1e-04 ***
nodematch.Sex          0.64352    0.12148      0   5.297  < 1e-04 ***
nodematch.Race         0.83975    0.12813      0   6.554  < 1e-04 ***
nodematch.Grade        3.05340    0.15340      0  19.904  < 1e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 The following terms are fixed by offset and are not estimated:
  offset(netsize.adj) 
mcmc.diagnostics(fit.full, which = "plots")


Note: To save space, only one in every 2 iterations of the MCMC sample
  used for estimation was stored for diagnostics. Sample size per chain
  was originally around 5266 with thinning interval 2048.

Note: MCMC diagnostics shown here are from the last round of
  simulation, prior to computation of final parameter estimates.
  Because the final estimates are refinements of those used for this
  simulation run, these diagnostics may understate model performance.
  To directly assess the performance of the final model on in-model
  statistics, please use the GOF command: gof(ergmFitObject,
  GOF=~model).
plot(gof(fit.full, GOF="model"))

plot(gof(fit.full, GOF="degree"))

In general the model diagnostics look good. If this were a genuine sample of 205 students from a larger school, we could infer the following:

  • there are many more isolates, and more degree 1 nodes than expected by chance;

  • there are significant differences in mean degree by race, with the largest group (Hispanics, the reference category) nominating fewer friends than most of the other groups;

  • 7th graders nominate fewer friends than all other grades;

  • there are strong and significant homophily effects, for all three attributes.

It is possible to simulate complete networks from this ergm.ego fit object – just as we would from an ergm fit object:

sim.full <- simulate(fit.full)
summary(mesa.ego ~ edges + degree(0:1)
                      + nodefactor("Sex")
                      + nodefactor("Race", levels = -LARGEST)
                      + nodefactor("Grade")
                      + nodematch("Sex") + nodematch("Race") + nodematch("Grade"))
                      scaled mean      SE
edges                         203 15.2022
degree0                        57  6.4306
degree1                        51  6.2048
nodefactor.Sex.M              171 17.1990
nodefactor.Race.Black          26  6.5507
nodefactor.Race.NatAm         156 19.7787
nodefactor.Race.Other           1  0.7054
nodefactor.Race.White          45  9.1943
nodefactor.Grade.8             75 17.3212
nodefactor.Grade.9             65 11.2475
nodefactor.Grade.10            36  8.0931
nodefactor.Grade.11            49 11.4861
nodefactor.Grade.12            28  7.2756
nodematch.Sex                 132 12.1128
nodematch.Race                103 10.0369
nodematch.Grade               163 13.6309
summary(sim.full ~ edges + degree(0:1)
                      + nodefactor("Sex")
                      + nodefactor("Race", levels = -LARGEST)
                      + nodefactor("Grade")
                      + nodematch("Sex") + nodematch("Race") + nodematch("Grade"))
                edges               degree0               degree1 
                  167                    54                    66 
     nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm 
                  130                    12                   150 
nodefactor.Race.Other nodefactor.Race.White    nodefactor.Grade.8 
                    3                    32                    63 
   nodefactor.Grade.9   nodefactor.Grade.10   nodefactor.Grade.11 
                   60                    29                    36 
  nodefactor.Grade.12         nodematch.Sex        nodematch.Race 
                   22                   113                    84 
      nodematch.Grade 
                  138 
plot(sim.full, vertex.col="Grade")
legend('bottomleft',fill=7:12,legend=paste('Grade',7:12),cex=0.75)

(Note that we have implicitly used simulate already – it’s the basis of the GOF results)

We can use network size invariance to simulate networks of a different size, albeit one has to be careful if the observed statistics are too small to be reliable (e.g., the nodefactor.Race.Other statistic here):

sim.full2 <- simulate(fit.full, popsize=network.size(mesa)*2)
summary(mesa~edges + degree(0:1)
                      + nodefactor("Sex")
                      + nodefactor("Race", levels = -LARGEST)
                      + nodefactor("Grade")
                      + nodematch("Sex") + nodematch("Race") + nodematch("Grade"))*2
                edges               degree0               degree1 
                  406                   114                   102 
     nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm 
                  342                    52                   312 
nodefactor.Race.Other nodefactor.Race.White    nodefactor.Grade.8 
                    2                    90                   150 
   nodefactor.Grade.9   nodefactor.Grade.10   nodefactor.Grade.11 
                  130                    72                    98 
  nodefactor.Grade.12         nodematch.Sex        nodematch.Race 
                   56                   264                   206 
      nodematch.Grade 
                  326 
summary(sim.full2~edges + degree(0:1)
                      + nodefactor("Sex")
                      + nodefactor("Race", levels = -LARGEST)
                      + nodefactor("Grade")
                      + nodematch("Sex") + nodematch("Race") + nodematch("Grade"))
                edges               degree0               degree1 
                  405                   120                    90 
     nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm 
                  312                    67                   307 
nodefactor.Race.Other nodefactor.Race.White    nodefactor.Grade.8 
                    2                    86                   146 
   nodefactor.Grade.9   nodefactor.Grade.10   nodefactor.Grade.11 
                  125                    74                    95 
  nodefactor.Grade.12         nodematch.Sex        nodematch.Race 
                   57                   279                   206 
      nodematch.Grade 
                  334 

We have only demonstrated the functionality briefly here, but this kind of simulation is a powerful way to diagnose structural properties of the fitted model, and to identify and remedy systematic lack of fit.

We will leave this model here and go on to explore how the idea of sampling uncertainty is being used to produce the standard errors for our coefficients.

6.4 Parameter recovery and sampling

When we estimate parameters based on sampled data, the sampling uncertainty in our estimates comes from the differences in the observations we draw from sample to sample, and the magnitude of uncertainty is a function of our sample size. This is why we typically see something like \(\sqrt{n}\) in the denominator of the standard error of a sample mean or sample proportion. The same principle holds in the context of egocentric network sampling: the standard errors will depend on the number of egos sampled.
This is true despite the fact that we are rescaling first to pseudo-population size, then back down to per capita values. Neither of these influences the estimates of the standard errors – those are influenced only by the size of the egocentric sample.

So let’s use the sample function from ergm.ego to demonstrate this effect. For this section we will use the larger built-in network, faux.magnolia.high.

data(faux.magnolia.high)
faux.magnolia.high -> fmh
N <- network.size(fmh)

Let’s start by fitting an ERGM to the complete network, and looking at the coefficients:

fit.ergm <- ergm(fmh ~ degree(0:3) 
                 + nodefactor("Race", levels=TRUE) + nodematch("Race")
                 + nodefactor("Sex") + nodematch("Sex") 
                 + absdiff("Grade"))
round(coef(fit.ergm), 3)
              degree0               degree1               degree2 
                0.954                 0.274                 0.034 
              degree3 nodefactor.Race.Asian nodefactor.Race.Black 
               -0.240                -2.476                -3.045 
 nodefactor.Race.Hisp nodefactor.Race.NatAm nodefactor.Race.Other 
               -2.693                -2.263                -2.634 
nodefactor.Race.White        nodematch.Race      nodefactor.Sex.M 
               -3.385                 1.679                -0.087 
        nodematch.Sex         absdiff.Grade 
                0.860                -2.116 

Egocentric census

Now, suppose we only observe an egocentric view of the data – as an egocentric census. With an egocentric census, it’s as though we give a survey to all of the students. Each student nominates her friends, but does not report the name of the friend, she only reports their sex, race and grade. How does the fit from ergm.ego to this egocentric census compare to the complete-network ergm estimates?

fmh.ego <- as.egor(fmh)
head(fmh.ego)
# EGO data (active): 3 × 5
  .egoID Grade Race  Sex   vertex.names
*  <int> <dbl> <chr> <chr> <chr>       
1      1     9 Black F     1           
2      2    10 Black M     2           
3      3    12 Black F     3           
# ALTER data: 6 × 6
  .altID .egoID Grade Race  Sex   vertex.names
*  <int>  <int> <dbl> <chr> <chr> <chr>       
1    669      1     9 Black F     669         
2    963      2    10 White F     963         
3    912      2    10 White M     912         
# ℹ 3 more rows
# AATIE data: 0 × 3
# ℹ 3 variables: .egoID <int>, .srcID <int>, .tgtID <int>
egofit <- ergm.ego(fmh.ego ~ degree(0:3) 
                   + nodefactor("Race", levels=TRUE) + nodematch("Race")
                   + nodefactor("Sex") + nodematch("Sex") 
                   + absdiff("Grade"), popsize=N,
                  control = snctrl(ppopsize=N))

# A convenience function.
model.se <- function(fit) sqrt(diag(vcov(fit)))

# Parameters recovered:
coef.compare <- data.frame(
  "NW est" = coef(fit.ergm), 
  "Ego Cen est" = coef(egofit)[-1],
  "diff Z" = (coef(fit.ergm)-coef(egofit)[-1])/model.se(egofit)[-1])

round(coef.compare, 3)
                      NW.est Ego.Cen.est diff.Z
degree0                0.954       0.939  0.035
degree1                0.274       0.262  0.034
degree2                0.034       0.032  0.008
degree3               -0.240      -0.243  0.015
nodefactor.Race.Asian -2.476      -2.485  0.065
nodefactor.Race.Black -3.045      -3.048  0.028
nodefactor.Race.Hisp  -2.693      -2.708  0.125
nodefactor.Race.NatAm -2.263      -2.282  0.136
nodefactor.Race.Other -2.634      -2.648  0.049
nodefactor.Race.White -3.385      -3.387  0.025
nodematch.Race         1.679       1.677  0.028
nodefactor.Sex.M      -0.087      -0.087  0.033
nodematch.Sex          0.860       0.860  0.006
absdiff.Grade         -2.116      -2.111 -0.080

Again, we can diagnose the fitted egocentric model for proper convergence. (We include the code, but leave this as an exercise for you)

# MCMC diagnostics. 
mcmc.diagnostics(egofit, which="plots")

And check whether the model converged to the right statistics:

plot(gof(egofit, GOF="model"))

Now let’s check whether the fitted model can be used to reconstruct the degree distribution.

plot(gof(egofit, GOF="degree"))


Egocentric Sample: Same size

What if we only had an equally large sample, instead of an egocentric census? Here, we sample N students with replacement.

set.seed(1)
fmh.egosampN <- sample(fmh.ego, N, replace=TRUE)

egofitN <- ergm.ego(fmh.egosampN ~ degree(0:3) 
                    + nodefactor("Race", levels=TRUE) + nodematch("Race") 
                    + nodefactor("Sex") + nodematch("Sex")
                    + absdiff("Grade"),
                    popsize=N)
Constructing pseudopopulation network.
Unable to match target stats. Using MCMLE estimation.
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 60:
1 Optimizing with step length 0.2525.
The log-likelihood improved by 1.9777.
Estimating equations are not within tolerance region.
Iteration 2 of at most 60:
1 Optimizing with step length 0.4057.
The log-likelihood improved by 2.1715.
Estimating equations are not within tolerance region.
Iteration 3 of at most 60:
1 Optimizing with step length 0.6318.
The log-likelihood improved by 2.0157.
Estimating equations are not within tolerance region.
Iteration 4 of at most 60:
1 Optimizing with step length 0.9426.
The log-likelihood improved by 1.7975.
Estimating equations are not within tolerance region.
Iteration 5 of at most 60:
1 Optimizing with step length 0.8976.
The log-likelihood improved by 1.8591.
Estimating equations are not within tolerance region.
Iteration 6 of at most 60:
1 Optimizing with step length 0.6572.
The log-likelihood improved by 2.1477.
Estimating equations are not within tolerance region.
Iteration 7 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.9768.
Estimating equations are not within tolerance region.
Iteration 8 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.7612.
Estimating equations are not within tolerance region.
Iteration 9 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.6097.
Estimating equations are not within tolerance region.
Estimating equations did not move closer to tolerance region more than 1 time(s) in 4 steps; increasing sample size.
Iteration 10 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1171.
Estimating equations are not within tolerance region.
Iteration 11 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1323.
Estimating equations are not within tolerance region.
Iteration 12 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1681.
Estimating equations are not within tolerance region.
Iteration 13 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1115.
Estimating equations are not within tolerance region.
Iteration 14 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.2142.
Estimating equations are not within tolerance region.
Estimating equations did not move closer to tolerance region more than 1 time(s) in 4 steps; increasing sample size.
Iteration 15 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1448.
Estimating equations are not within tolerance region.
Iteration 16 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.2262.
Estimating equations are not within tolerance region.
Iteration 17 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1817.
Estimating equations are not within tolerance region.
Iteration 18 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0152.
Convergence test p-value: 0.8234. Not converged with 99% confidence; increasing sample size.
Iteration 19 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0651.
Convergence test p-value: 0.7643. Not converged with 99% confidence; increasing sample size.
Iteration 20 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0410.
Convergence test p-value: 0.0149. Not converged with 99% confidence; increasing sample size.
Iteration 21 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0150.
Convergence test p-value: 0.0001. Converged with 99% confidence.
Finished MCMLE.
This model was fit using MCMC.  To examine model diagnostics and check
for degeneracy, use the mcmc.diagnostics() function.
# compare the coef
coef.compare <- data.frame(
  "NW est" = coef(fit.ergm), 
  "Ego SampN est" = coef(egofitN)[-1],
  "diff Z" = (coef(fit.ergm)-coef(egofitN)[-1])/model.se(egofitN)[-1])

round(coef.compare, 3)
                      NW.est Ego.SampN.est diff.Z
degree0                0.954         1.388 -0.933
degree1                0.274         0.516 -0.661
degree2                0.034         0.363 -1.184
degree3               -0.240        -0.021 -1.068
nodefactor.Race.Asian -2.476        -2.397 -0.516
nodefactor.Race.Black -3.045        -2.911 -1.206
nodefactor.Race.Hisp  -2.693        -2.532 -1.294
nodefactor.Race.NatAm -2.263        -2.112 -1.136
nodefactor.Race.Other -2.634        -2.623 -0.034
nodefactor.Race.White -3.385        -3.275 -1.037
nodematch.Race         1.679         1.613  0.812
nodefactor.Sex.M      -0.087        -0.142  2.012
nodematch.Sex          0.860         0.883 -0.407
absdiff.Grade         -2.116        -2.023 -1.353
# compare the s.e.'s
se.compare <- data.frame(
  "NW SE" = model.se(fit.ergm), 
  "Ego census SE" =model.se(egofit)[-1], 
  "Ego SampN SE" = model.se(egofitN)[-1])

round(se.compare, 3)
                      NW.SE Ego.census.SE Ego.SampN.SE
degree0               0.462         0.430        0.464
degree1               0.365         0.345        0.367
degree2               0.277         0.261        0.277
degree3               0.198         0.193        0.204
nodefactor.Race.Asian 0.150         0.144        0.152
nodefactor.Race.Black 0.115         0.102        0.112
nodefactor.Race.Hisp  0.144         0.121        0.124
nodefactor.Race.NatAm 0.165         0.141        0.133
nodefactor.Race.Other 0.402         0.292        0.335
nodefactor.Race.White 0.111         0.102        0.105
nodematch.Race        0.103         0.080        0.081
nodefactor.Sex.M      0.032         0.029        0.028
nodematch.Sex         0.070         0.054        0.056
absdiff.Grade         0.072         0.068        0.069

Egocentric Sample: Smaller sample

What if we have a smaller sample? If we have a sample of \(N/4=365\) students, how will our standard errors be affected?

set.seed(0) # Some samples have different sets of alter levels from ego levels.

fmh.egosampN4 <- sample(fmh.ego, round(N/4), replace=TRUE)

egofitN4 <- ergm.ego(fmh.egosampN4 ~ degree(0:3) 
                    + nodefactor("Race", levels=TRUE) + nodematch("Race") 
                    + nodefactor("Sex") + nodematch("Sex")
                    + absdiff("Grade"),
                    popsize=N)
Constructing pseudopopulation network.
Note: Constructed network has size 1460, different from requested 1461. Estimation should not be meaningfully affected.
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 60:
1 Optimizing with step length 0.2258.
The log-likelihood improved by 2.0248.
Estimating equations are not within tolerance region.
Iteration 2 of at most 60:
1 Optimizing with step length 0.3843.
The log-likelihood improved by 2.0104.
Estimating equations are not within tolerance region.
Iteration 3 of at most 60:
1 Optimizing with step length 0.8968.
The log-likelihood improved by 2.6821.
Estimating equations are not within tolerance region.
Iteration 4 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.5568.
Estimating equations are not within tolerance region.
Iteration 5 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.2369.
Estimating equations are not within tolerance region.
Iteration 6 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.2663.
Estimating equations are not within tolerance region.
Iteration 7 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1536.
Estimating equations are not within tolerance region.
Iteration 8 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.8350.
Estimating equations are not within tolerance region.
Iteration 9 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.9888.
Estimating equations are not within tolerance region.
Estimating equations did not move closer to tolerance region more than 1 time(s) in 4 steps; increasing sample size.
Iteration 10 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.9633.
Estimating equations are not within tolerance region.
Iteration 11 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0807.
Convergence test p-value: 0.8179. Not converged with 99% confidence; increasing sample size.
Iteration 12 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1154.
Estimating equations are not within tolerance region.
Estimating equations did not move closer to tolerance region more than 1 time(s) in 4 steps; increasing sample size.
Iteration 13 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0990.
Convergence test p-value: 0.8092. Not converged with 99% confidence; increasing sample size.
Iteration 14 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0206.
Convergence test p-value: < 0.0001. Converged with 99% confidence.
Finished MCMLE.
This model was fit using MCMC.  To examine model diagnostics and check
for degeneracy, use the mcmc.diagnostics() function.
# compare the coef
coef.compare <- data.frame(
  "NW est" = coef(fit.ergm), 
  "Ego SampN4 est" = coef(egofitN4)[-1],
  "diff Z" = (coef(fit.ergm)-coef(egofitN4)[-1])/model.se(egofitN4)[-1])

round(coef.compare, 3)
                      NW.est Ego.SampN4.est diff.Z
degree0                0.954          0.529  0.458
degree1                0.274         -0.239  0.697
degree2                0.034         -0.041  0.141
degree3               -0.240         -0.363  0.314
nodefactor.Race.Asian -2.476         -2.190 -1.215
nodefactor.Race.Black -3.045         -2.991 -0.237
nodefactor.Race.Hisp  -2.693         -2.808  0.489
nodefactor.Race.NatAm -2.263         -2.498  0.980
nodefactor.Race.Other -2.634         -2.412 -0.455
nodefactor.Race.White -3.385         -3.431  0.219
nodematch.Race         1.679          1.579  0.750
nodefactor.Sex.M      -0.087         -0.192  1.755
nodematch.Sex          0.860          0.889 -0.262
absdiff.Grade         -2.116         -2.078 -0.267
# compare the s.e.'s
se.compare <- data.frame(
  "NW SE" = model.se(fit.ergm), 
  "Ego census SE" =model.se(egofit)[-1], 
  "Ego SampN SE" = model.se(egofitN)[-1],
  "Ego Samp4 SE" = model.se(egofitN4)[-1])

round(se.compare, 3)
                      NW.SE Ego.census.SE Ego.SampN.SE Ego.Samp4.SE
degree0               0.462         0.430        0.464        0.929
degree1               0.365         0.345        0.367        0.736
degree2               0.277         0.261        0.277        0.539
degree3               0.198         0.193        0.204        0.394
nodefactor.Race.Asian 0.150         0.144        0.152        0.236
nodefactor.Race.Black 0.115         0.102        0.112        0.230
nodefactor.Race.Hisp  0.144         0.121        0.124        0.235
nodefactor.Race.NatAm 0.165         0.141        0.133        0.240
nodefactor.Race.Other 0.402         0.292        0.335        0.488
nodefactor.Race.White 0.111         0.102        0.105        0.213
nodematch.Race        0.103         0.080        0.081        0.134
nodefactor.Sex.M      0.032         0.029        0.028        0.060
nodematch.Sex         0.070         0.054        0.056        0.109
absdiff.Grade         0.072         0.068        0.069        0.143

As with ordinary statistics, standard error is inverse-proportional to the square root of the sample size.


7 Package Development

The ergm.ego package is under active development on GitHub at statnet/ergm.ego. This repository is the place to go to report bugs or request features (feature requests accompanied by a pull request are especially appreciated). If you are interested in contributing to the development of ergm.ego, please contact us through the GitHub interface.

Additional functionality is planned in the near future:

  • Support for directed relations.

  • Support for automatic fitting of tergms.

  • Support for target statistics distinct from ERGM statistics.

  • Support for degree censoring.

Appendices

A Real world example

Motivation: Analyzing racial disparities in HIV in the US

The work on ergm.ego was originally motivated by a specific question in the field of HIV epidemiology—Does network structure help explain the persistent racial disparities in HIV prevalence in the United States?

An African American today is 10 times more likely than a white American to be living with HIV/AIDS. The disparity begins early in life, persists through to old age, and is evident among all risk groups: heterosexuals, men who have sex with men (MSM), and injection drug users. The disproportionate risks faced by heterosexual African-American women are especially steep. In 2010, an African-American woman was over 40 times more likely to be diagnosed with HIV than a heterosexual white man (Figure 1).

Figure 1
Figure 1

Empirical studies repeatedly find that these disparities cannot be explained by individual behavior, or biological differences.

A growing body of work is therefore focused on the role of the underlying transmission network. This network can channel the spread of infection in the same way that a transportation network channels the flow of traffic, with emergent patterns that reflect the connectivity of the system, rather than the behavior of any particular element.

Descriptive analyses and simulation studies have focused attention on two structural features: homophily and concurrency. Homophily is the strong propensity for within-group partner selection. Concurrency is non-monogamy—having partners that overlap in time–which increases network connectivity by allowing for the emergence of stable network connected components larger than dyads (pairs of individuals).

The hypothesis is that these two network properties together can produce the sustained HIV/STI prevalence differentials we observe: differences in concurrency between groups are the mechanism that generates the prevalence disparity, while homophily is the mechanism that sustains it.

We will never observe the complete dynamic sexual network that transmits HIV. But ergm.ego allows us to test the network hypothesis with egocentrically sampled data–and we will demonstrate that here using data collected by the National Health and Social Life Survey from 1994. The analysis comes from a recent paper (Krivitsky and Morris, 2017).

First, ergm.ego allows us to assess whether empirical patterns of homophily and concurrency are in the predicted directions and statistically significant. We do this in the usual way – comparing sequential model fits with terms that represent the hypotheses of interests, and t-tests for their coefficients. We will discuss these terms in more detail in later sections, but here we test the concurrency effects with “monogamy bias” terms.

Table 1
Table 1

Result: Yes, the homophily and concurrency effects are in the predicted directions and statistically significant.

Next, we can assess the goodness of fit of each model in the way we usually do in ERGMs, by checking whether the models reproduce observed nework properties that are not in the model. We do this here by simulating from each model and comparing the fits to the full observed degree distribution:

Figure 2
Figure 2

Result: Only Model 3 (with both hypothesized nework effects) is able to reproduce the observed degree distribution.

The ability to simulate complete networks from the model, however, allows us to do much more–we can now examine the connectivity in the overall network that each of these models would generate. For example, we can examine the component size distributions under each model:

Figure 3
Figure 3

Result: Model 3, with its “monogamy bias” dramatically reduces the right skew of the component size distribution, and places most people in components of size 2, or 3 if they are in a larger component.

Finally, we can define a measure of “network exposure” that represents the signature feature of a network effect: indirect exposure to HIV via a partner’s behavior, rather than direct exposure via one’s own behavior. One metric for network exposure is the probability of being in a component of size 3 or more. Because this is a node-level metric, we can break it down by race and sex for each of the three models:

Figure 3
Figure 3

Result: Only model 3 produces a pattern of network exposure that is consistent with the observed disparities in HIV incidence.

ergm.ego provides a powerful analytic framework that uses extremely limited network data and testable models to investigate the unobservable patterns of complete network connectivity that are consistent with the sampled data.

B TERGMs with egocentrically sampled data

The principles of egocentric inference can be extended to temporal ERGMs (TERGMs). While we will not cover that in this workshop, an example can be found in another paper that sought to evaluate the network hypothesis for racial disparities in HIV in the US (Morris et al. 2009). In that paper, egocentric data from the National Longitudinal Survey of Adolescent Health (AddHealth) was analyzed, and an example of the resulting dynamic complete network simulation (on 10,000 nodes) can be found in this “network movie”.

The movie below is another simpler example – an epidemic spreading on a small dynamic contact network that is simulated with a STERGM estimated from egocentrically sampled network data. The movie was produced by the R packages EpiModel and ndtv, which are based on the Statnet tools.

C Formal definitions of egocentric statistics

C.1 Notation

We’ll need some notation for this (sorry, and a warning that it will get hairier).

C.1.1 Population network

Parameter Meaning
\(N\) the population being studied: a very large, but finite, set of actors whose relations are of interest
\(x _ i\) attribute (e.g., age, sex, race) vector of actor \(i \in N\)
\(x_N\) (or just \(x\), when there is no ambiguity) the attributes of actors in \(N\)
\(\mathbb{Y}(N)\) the set of dyads (potential ties) in an undirected network of actors in \(N\)
\(y\subseteq \mathbb{Y}(N)\) the population network: a fixed but unknown network (a set of relationships) of relationships of interest. In particular,
\(y_{ij}\) an indicator function of whether a tie between \(i\) and \(j\) is present in \(y\)
\(y _ i=\{j\in N: y _ {ij}=1\}\) the set of \(i\)’s network neighbors.

C.1.2 Egocentric sample

Parameter Meaning
\(e_{N}\) the egocentric census, the information retained by the minimal egocentric sampling design when all nodes are sampled
\(S\subseteq N\) the set of egos in a sample
\(e_{S}\) the data contained in an egocentric sample
\(e_i\) the “egocentric” view of network \(y\) from the point of view of actor \(i\) (“ego”), with the following parts:
\(e^e_i \equiv x_i\) \(i\)’s own attributes
\(e^a_i \equiv (x_{j})_{j\in y_i}\) an unordered list of attribute vectors of \(i\)’s immediate neighbors (“alters”), but not their identities (indices in \(N\))
\(e^e_{i,k}\equiv x_{i,k}\) The \(k\)th attribute/covariate observed on ego \(i\)
\(e^a_{i,k}\equiv( x_{j,k})_{j\in y_i}\) and its alters.

C.2 Egocentric Statistics

We call a network statistic \(g_{k}(\cdot,\cdot)\) egocentric if it can be expressed as \[ g_{k}(y,x)\equiv \textstyle\sum_{i\in N} h_{k}(e_i) \] for some function \(h_{k}(\cdot)\) of egocentric information associated with a single actor.

The space of egocentric statistics includes dyadic-independent statistics that can be expressed in the general form of \[ g_{k}(y,x)=\sum_{ij\in y} f_k(x_i,x_j) \] for some symmetric function \(f_k(\cdot,\cdot)\) of two actors’ attributes; and some dyadic-dependent statistics that can be expressed as \[ g_{k}(y,x)=\sum_{i\in N} f_k ({x_{i},(x_j)_{j\in y_i}}) \] for some function \(f_k(\cdot,\dotsb)\) of the attributes of an actor and their network neighbors.

The statistics that are identifiable in an egocentric sample depend on the specific egocentric study design.

Basic (minimal) egocentric design (alter attributes only)
  • Nodal Covariate/Factor effects
  • Homophily
  • Degree distribution
With ego reports of alter degree (the number of alter’s ties)
  • Degree assortativity
With alter-alter ties
  • Triadic closure (transitive/cyclical ties, triangles)
  • 4-cycles (possibly)
Not Egocentric for other reasons, but estimable
  • Mean degree (\(g_{k}(y,x)=2|y|/|N|\)): \(e _ i\) doesn’t know how big the network is 1

The table below (from Krivitsky & Morris 2017) shows some examples of egocentric statistics, and gives their representations in terms of of \(h_{k}(\cdot)\).

Examples of egocentric statistics for undirected networks, reproduced from Krivitsky and Morris (2017). \(x _ {i,k}\) may be a dummy variable indicating \(i\)’s membership in a particular exogenously defined group. \(h_{k}(e_i)\) that sum over ties are halved because each tie is observed egocentrically twice: once at each end.
Statistic \(g_{k}( y,x)\) \(h _ {k}(e_i)\)
General sum over ties \(\sum _ {(i,j)\in y} f _ k(x _ i,x _ j)\) \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} f _ k\big(e^\text{e}_i,e^\text{a}_{i,j'}\big)\)
Number of ties in the network \(\lvert y \rvert\equiv \sum _ {(i,j) \in y} 1\) \(\frac{1}{2}\lvert e^\text{a}_{i}\rvert\)
weighted by actor covariate \(x _ {i,k}\) \(\sum _ {(i,j) \in y} (x _ {i,k}+x _ {j,k})\) \(\frac{1}{2} \big(e^\text{e}_{i,k} \lvert e^\text{a}_{i}\rvert + \sum _ {j'\in e^\text{a} _ i} e^\text{a}_{i,j',k} \big)\)
weighted by difference in \(x _ {i,k}\) \(\sum _ {(i,j) \in y} \lvert x _ {i,k}-x _ {j,k}\rvert\) \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} \lvert e^\text{e}_{i,k}-e^\text{a}_{i,j',k}\rvert\)
within groups identified by \(x _ {i,k}\) \(\sum _ {(i,j) \in y} 1_{x _ {i,k}=x _ {j,k}}\) \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} 1_{ e^\text{e}_{i,k}= e^\text{a}_{i,j',k}}\)
General sum over actors \(\sum _ {i\in N} f _ k\big\{x _ {i},(x _ j) _ {j\in y_{i}}\big\}\) \(f _ k\big(e^\text{e}_i,e^\text{a}_{i}\big)\)
Number of actors with \(d\) neighbors \(\sum _ {i\in N} 1_{\lvert y_{i}\rvert=d}\) \(1_{\lvert e^\text{a}_{i}\rvert=d}\)
weighted by actor covariate \(x _ {i,k}\) \(\sum _ {i\in N} x _ {i,k} 1_{\lvert y_{i}\rvert=d}\) \(e^\text{e}_{i,k}1_{\lvert e^\text{a}_{i}\rvert=d}\)

D Session Info

Session info
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.1 (2024-06-14)
 os       Ubuntu 22.04.4 LTS
 system   x86_64, linux-gnu
 ui       X11
 language en
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/London
 date     2024-06-23
 pandoc   3.1.2 @ /usr/bin/ (via rmarkdown)

─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package        * version       date (UTC) lib source
 bookdown         0.39          2024-04-15 [1] CRAN (R 4.4.0)
 bslib            0.7.0         2024-03-29 [1] CRAN (R 4.4.0)
 cachem           1.1.0         2024-05-16 [1] CRAN (R 4.4.0)
 cli              3.6.3         2024-06-21 [1] CRAN (R 4.4.1)
 coda             0.19-4.1      2024-01-31 [1] CRAN (R 4.4.0)
 DBI              1.2.3         2024-06-02 [1] CRAN (R 4.4.0)
 deldir           2.0-4         2024-02-28 [1] CRAN (R 4.4.1)
 DEoptimR         1.1-3         2023-10-07 [1] CRAN (R 4.4.0)
 digest           0.6.35        2024-03-11 [1] CRAN (R 4.4.0)
 dplyr          * 1.1.4         2023-11-17 [1] CRAN (R 4.4.0)
 egor           * 1.24.2        2024-06-20 [1] Github (tilltnet/egor@44d87a0)
 ergm           * 4.7-7368      2024-06-20 [1] Github (statnet/ergm@93ecb25)
 ergm.ego       * 1.1-704       2024-06-20 [1] local
 evaluate         0.24.0        2024-06-10 [1] CRAN (R 4.4.0)
 fansi            1.0.6         2023-12-08 [1] CRAN (R 4.4.0)
 fastmap          1.2.0         2024-05-15 [1] CRAN (R 4.4.0)
 generics         0.1.3         2022-07-05 [1] CRAN (R 4.4.0)
 glue             1.7.0         2024-01-09 [1] CRAN (R 4.4.0)
 highr            0.11          2024-05-26 [1] CRAN (R 4.4.0)
 htmltools        0.5.8.1       2024-04-04 [1] CRAN (R 4.4.0)
 igraph           2.0.3         2024-03-13 [1] CRAN (R 4.4.0)
 interp           1.1-6         2024-01-26 [1] CRAN (R 4.4.1)
 jpeg             0.1-10        2022-11-29 [1] CRAN (R 4.4.1)
 jquerylib        0.1.4         2021-04-26 [1] CRAN (R 4.4.0)
 jsonlite         1.8.8         2023-12-04 [1] CRAN (R 4.4.0)
 knitr          * 1.47          2024-05-29 [1] CRAN (R 4.4.0)
 lattice          0.22-6        2024-03-20 [4] CRAN (R 4.4.1)
 latticeExtra     0.6-30        2022-07-04 [1] CRAN (R 4.4.1)
 lifecycle        1.0.4         2023-11-07 [1] CRAN (R 4.4.0)
 lpSolveAPI       5.5.2.0-17.11 2023-11-28 [1] CRAN (R 4.4.0)
 magrittr         2.0.3         2022-03-30 [1] CRAN (R 4.4.0)
 Matrix           1.7-0         2024-04-26 [4] CRAN (R 4.4.0)
 memoise          2.0.1         2021-11-26 [1] CRAN (R 4.4.0)
 mitools          2.4           2019-04-26 [1] CRAN (R 4.4.0)
 network        * 1.18.2        2024-06-20 [1] Github (statnet/network@c1b2084)
 pillar           1.9.0         2023-03-22 [1] CRAN (R 4.4.0)
 pkgconfig        2.0.3         2019-09-22 [1] CRAN (R 4.4.0)
 png              0.1-8         2022-11-29 [1] CRAN (R 4.4.0)
 purrr            1.0.2         2023-08-10 [1] CRAN (R 4.4.0)
 R6               2.5.1         2021-08-19 [1] CRAN (R 4.4.0)
 rbibutils        2.2.16        2023-10-25 [1] CRAN (R 4.4.0)
 RColorBrewer     1.1-3         2022-04-03 [1] CRAN (R 4.4.0)
 Rcpp             1.0.12        2024-01-09 [1] CRAN (R 4.4.0)
 Rdpack           2.6           2023-11-08 [1] CRAN (R 4.4.0)
 Rglpk            0.6-5.1       2024-01-13 [1] CRAN (R 4.4.0)
 rlang            1.1.4         2024-06-04 [1] CRAN (R 4.4.0)
 rle              0.9.2-234     2024-06-20 [1] Github (statnet/rle@d08b185)
 rmarkdown        2.27          2024-05-17 [1] CRAN (R 4.4.0)
 robustbase       0.99-2        2024-01-27 [1] CRAN (R 4.4.0)
 sass             0.4.9         2024-03-15 [1] CRAN (R 4.4.0)
 sessioninfo      1.2.2         2021-12-06 [1] CRAN (R 4.4.0)
 slam             0.1-50        2022-01-08 [1] CRAN (R 4.4.0)
 srvyr            1.2.0         2023-02-21 [1] CRAN (R 4.4.0)
 statnet.common   4.10.0-442    2024-06-20 [1] Github (statnet/statnet.common@4e8cb54)
 survey           4.4-2         2024-03-20 [1] CRAN (R 4.4.0)
 survival         3.7-0         2024-06-05 [4] CRAN (R 4.4.0)
 tibble         * 3.2.1         2023-03-20 [1] CRAN (R 4.4.0)
 tidygraph        1.3.1         2024-01-30 [1] CRAN (R 4.4.0)
 tidyr            1.3.1         2024-01-24 [1] CRAN (R 4.4.0)
 tidyselect       1.2.1         2024-03-11 [1] CRAN (R 4.4.0)
 trust            0.1-8         2020-01-10 [1] CRAN (R 4.4.0)
 utf8             1.2.4         2023-10-22 [1] CRAN (R 4.4.0)
 vctrs            0.6.5         2023-12-01 [1] CRAN (R 4.4.0)
 withr            3.0.0         2024-01-16 [1] CRAN (R 4.4.0)
 xfun             0.45          2024-06-16 [1] CRAN (R 4.4.1)
 yaml             2.3.8         2023-12-11 [1] CRAN (R 4.4.0)

 [1] /home/mbojan/R/library/4.4
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

References

Butts, Carter T. 2020. Sna: Tools for Social Network Analysis. https://CRAN.R-project.org/package=sna.
Freedman Ellis, Greg, and Ben Schneider. 2023. Srvyr: ’Dplyr’-Like Syntax for Summary Statistics of Survey Data. https://CRAN.R-project.org/package=srvyr.
Handcock, Mark S., and Krista J. Gile. 2010. “Modeling Social Networks from Sampled Data.” Annals of Applied Statistics 4 (1): 5–25. https://doi.org/10.1214/08-aoas221.
Handcock, Mark S., David R. Hunter, Carter T. Butts, Steven M. Goodreau, Pavel N. Krivitsky (maintainer), Martina Morris, Chad Klumb, Michał Bojanowski, and other contributors. 2022. Ergm: Fit, Simulate and Diagnose Exponential-Family Models for Networks. https://CRAN.R-project.org/package=ergm.
Handcock, Mark S., David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris. 2008. “Statnet: Software Tools for the Representation, Visualization, Analysis and Simulation of Network Data.” Journal of Statistical Software 24 (1-9). https://www.jstatsoft.org/v24/.
Holland, Paul W, and Samuel Leinhardt. 1973. “The Structural Implications of Measurement Error in Sociometry.” Journal of Mathematical Sociology 3 (1): 85–111. https://doi.org/10.1080/0022250X.1973.9989825.
Koskinen, Johan H., Gary L. Robins, Peng Wang, and Philippa E. Pattison. 2013. “Bayesian Analysis for Partially Observed Network Data, Missing Ties, Attributes and Actors.” Social Networks 35 (4): 514–27. https://doi.org/10.1016/j.socnet.2013.07.003.
Krenz, Till, Pavel N. Krivitsky, Raffaele Vacca, Michał Bojanowski, and Andreas Herz. 2024. Egor: Import and Analyse Ego-Centered Network Data. https://CRAN.R-project.org/package=egor.
Krivitsky, Pavel N. 2023. Ergm.ego: Fit, Simulate and Diagnose Exponential-Family Random Graph Models to Egocentrically Sampled Network Data. The Statnet Project (https://statnet.org). https://CRAN.R-project.org/package=ergm.ego.
Krivitsky, Pavel N., Michał Bojanowski, and Martina Morris. 2019. “Inference for Exponential-Family Random Graph Models from Egocentrically-Sampled Data with Alter–Alter Relations.” https://documents.uow.edu.au/content/groups/public/@web/@inf/@math/documents/doc/uow259552.pdf.
Krivitsky, Pavel N., Mark S. Handcock, and Martina Morris. 2011. “Adjusting for Network Size and Composition Effects in Exponential-Family Random Graph Models.” Statistical Methodology 8 (4): 319–39. https://doi.org/10.1016/j.stamet.2011.01.005.
Krivitsky, Pavel N., and Eric D. Kolaczyk. 2015. “On the Question of Effective Sample Size in Network Modeling: An Asymptotic Inquiry.” Statistical Science 30 (2): 184–98. https://doi.org/10.1214/14-sts502.
Krivitsky, Pavel N., and Martina Morris. 2017. “Inference for Social Network Models from Egocentrically Sampled Data, with Application to Understanding Persistent Racial Disparities in HIV Prevalence in the US.” Annals of Applied Statistics 11 (1): 427–55. https://doi.org/10.1214/16-AOAS1010.
Krivitsky, Pavel N., Martina Morris, and Michał Bojanowski. 2022. “Impact of Survey Design on Estimation of Exponential-Family Random Graph Models from Egocentrically-Sampled Data.” Social Networks 69: 22–34. https://doi.org/10.1016/j.socnet.2020.10.001.

  1. This does not mean that the mean degree itself cannot be estimated from egocentric data, only that our inferential results might not apply.↩︎