This tutorial is a joint product of the Statnet Development Team:
Pavel N. Krivitsky (University of New South Wales)
Martina Morris (University of Washington)
Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Chad Klumb (University of Washington)
Skye Bender de-Moll (Oakland, CA)
Michał Bojanowski (Kozminski University, Poland)
The network modeling software demonstrated in this tutorial is authored by Pavel Krivitsky (ergm.ego), with contributions from Michał Bojanowski.
All Statnet packages are open-source, written for the R computing environment, and published on CRAN. The source repositories are hosted on GitHub. Our website is statnet.org
Need help? For general questions and comments, please email the Statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: Statnet_help listserve.
Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.
Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).
For all other issues, please email us at contact@statnet.org.
This tutorial provides an introduction to statistical modeling of egocentrically sampled network data with Exponential family Random Graph Models (ERGMs). The primary package we will be demonstrating is ergm.ego (Krivitsky 2023), but we will make use of utilities from other Statnet packages at various points. As of version 1.0, ergm.ego depends on the egor (Krenz et al. 2024) package for egocentric network data management.
This workshop assumes basic familiarity with R, experience with network concepts, terminology and data, and familiarity with the basic principles of statistical modeling and inference. Previous experience with ERGMs is not required, but is strongly recommended (the introductory ERGM workshop is a good place to start).
The workshops are conducted using Rstudio
.
Open an R session, and set your working directory to the location where you would like to save this work.
To install the package the ergm.ego
This will install all of the “dependencies” – the other R packages that ergm.ego needs.
Even though we recommend using the CRAN versions of Statnet packages, it is also possible to install the development version of the package from Statnet’s R-universe using:
install.packages(
"ergm.ego",
repos = c("https://statnet.r-universe.dev", "https://cloud.r-project.org")
)
Load the package into R and verify the package version:
Loading required package: ergm
Loading required package: network
'network' 1.18.2 (2023-12-04), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4.7-7368 (2024-06-11), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.
Loading required package: egor
Loading required package: dplyr
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Loading required package: tibble
'ergm.ego' 1.1-704 (2023-05-30), part of the Statnet Project
* 'news(package="ergm.ego")' for changes since last version
* 'citation("ergm.ego")' for citation information
* 'https://statnet.org' for help, support, and other information
Attaching package: 'ergm.ego'
The following objects are masked from 'package:ergm':
COLLAPSE_SMALLEST, snctrl
The following object is masked from 'package:base':
sample
[1] '1.1.704'
Installation of a development version is also possible, see Package Development section.
The ergm.ego package is designed to provide principled estimation of and statistical inference for Exponential-family Random Graph Models (“ERGMs”) from egocentrically sampled network data.
This dramatically reduces the burden of data collection, which is typically one of the largest obstacles to empirical research on networks. In many contexts the collection of a network census or an adaptive (link-traced) sample is not possible. Even when one of these may be possible in theory, however, egocentrically sampled data are much cheaper and easier to collect.
Long regarded as the poor country cousin in the network data family, egocentric data actually contain a remarkable amount of information. With the right statistical methods, such data can be used to explore, summarize and simulate the complete networks in which they are embedded.
The basic idea here will be familiar to anyone who has worked with survey data: you combine what is observed (the data) with assumptions (the model terms and their sampling distributions), to define a class of models (the coefficients on the terms) that can be estimated.
Once estimated, the fitted model can be used for prediction. In the network context this means that the fitted model can be used to simulate complete networks. Each simulated network is a probabilistic draw (“realization”) from the distribution of networks specified by the model, and these draws will be centered on the observed statistics of the (appropriately scaled) sampled network. The stochastic variation in the simulated networks reflects both sampling uncertainty and the variation in network properties that are not included in the model terms.
It is worth emphasizing this point: the ERGM framework allows you to simulate the distribution of complete networks that are consistent with the egocentrically sampled data you have collected. You can exploit this feature to explore the whole network properties (e.g., connectivity, component size distributions, etc.) consistent with your data but not observable in egocentric samples.
ERGMs offer two powerful advantages to social network analysts:
- Estimation of complex models from egocentrically sampled data, and
- Simulation of complete networks from these egodata that are consistent with the observed model statistics.
All of these tasks can be accomplished using the ergm.ego package. The package comprises:
ergm.ego is designed to work with the other Statnet packages. So, for example, once you have fit a model, you can use the summary and diagnostic functions from ergm to evaluate the model fit, the ergm simulate function to simulate complete network realizations from the model, the network descriptives from sna (Butts 2020) to explore the networks simulated from the model, and you can use other R functions and packages as well after converting the network data structure into a data frame.
Putting this all together, you can start with egocentric data, estimate a model, test the coefficients for statistical significance, assess the model goodness of fit, and simulate complete networks of any size from the model. The statistics in your simulated networks will be consistent with the appropriately scaled statistics from your sample for all of the terms that are represented in the model.
The full technical details on ERGM estimation and inference from egocentrically sampled data can be found in Krivitsky and Morris (2017). This section of the tutorial provides a brief introduction to the key concepts.
This section provides a brief overview of the key principles of ERGMs that are needed to understand how estimation from egocentric data works. For a more thorough introduction to ERGM theory and its implementation in the Statnet packages, see the special issue of the Journal of Statistical Software devoted to Statnet (Handcock et al. 2008). For an introduction to the ergm package, see the Statnet ERGM Workshop.
ERGMs represent a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can—among other tasks—obtain maximum-likehood estimates for the parameters of a specified model for a given data set; test individual models for goodness-of-fit, perform various types of model comparison; and simulate additional networks from the underlying probability distribution implied by that model.
The general form for an ERGM can be written as: \[ P(Y=y;\theta,x)=\frac{\exp(\theta^{\top}g(y,x))}{\kappa(\theta,x)}\qquad (1) \] where \(Y\) is the random variable for the state of the network (with realization y), \(g(y,x)\) is a vector of model statistics for network y, \(\theta\) is the vector of coefficients for those statistics, and \(\kappa(\theta)\) represents the quantity in the numerator summed over all possible networks (typically constrained to be all networks with the same node set as \(y\)).
The model terms \(g(y,x)\) are functions of network statistics that we hypothesize may be more (or less) common than what would be expected in a simple random graph (where all ties have the same probability). When working with egocentrically sampled network data, the statistics one can include in the model are limited by the requirement that they can be observed in the sample data (a detailed discussion can be found in Appendix C).
A key distinction in ERG model terms is whether they are dyad independent or dyad dependent. Dyad independent terms (such as nodematch
for attribute homophily) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (such as degree
for nodal degree, or triad-related terms such as gwesp
), imply dependence between dyads.
The design of an egocentric sample means that most observable statistics are dyad independent, but there are a few, like degree, that are observable and dyad dependent.
Network data are distinguished by having two units of analysis: the actors and the links between the actors. Data that contain information on all nodes and all links is called a “network census”. The two units give rise to a range of sampling designs that can be classified into two groups: link tracing designs (e.g., snowball and respondent driven sampling) and egocentric designs.
A network census is, as the name suggests, a dataset that contains information on every node and every link in the population of interest. In the SNA literature, this type of data is sometimes referred to as a “sociometric” design. As with all census designs, the data collection process tends to be expensive and time-consuming. As a result, this type of data tends to show up in two different application contexts: either small, well-bounded groups like classrooms, business firms and community organizations, or online settings where the data can be efficiently scraped.
Link-trace designs have traditionally been used to sample hard-to-reach populations. Sampling begins with a set of seed nodes. The seeds are asked to nominate alters, the alters are then recruited into the sample, asked to nominate their alters, and so on. Each new set of alters is called a wave or a generation, and the number of waves of sampling is a study design variable. At each wave, a census or a sample of the alters may be elicited, and/or recruited, this too is a study design variable, and alter recruitment may be investigator-driven, or respondent-driven. This gives rise to a wide range of possible link-trace designs. When the decision to elicit a new wave of alters depends on an attribute of the current node (e.g., is this node an injection drug user) the design is called an adaptive sample.
Egocentric network sampling comprises a range of designs developed specifically for the collection of network data in social science survey research. The design is (ideally) based on a probability sample \(S\) of respondents (“egos”) from the population \(N\). Via interview, the egos are asked to nominate a list of persons (“alters”) with whom they have a specific type of relationship (“tie”). The egos are then asked to provide information on the characteristics of the alters and/or the ties, but the alters are not recruited or directly observed. Depending on the study design, alters may or may not be uniquely identifiable, and respondents may or may not be asked to provide information on one or more ties among alters (the “alter” matrices). Alters could, in theory, also be present in the data as an ego or as an alter of a different ego; the likelihood of this depends on the sampling fraction.
Egocentric designs sample egos using standard sampling methods, and the sampling of links is implemented through the survey instrument. As a result, these methods are easily integrated into population-based surveys, and, as we show below, inherit many of the inferential benefits.
The minimal design (without the alter matrices) is more common, and the data are more widely available, largely because it is less invasive and less time-consuming than designs which include identifiable alter matrices.
Handcock and Gile (2010) propose likelihood inference for partially observed networks, has egocentric data as a special case.
Koskinen et al. (2013) developed Bayesian inference for partially observed networks, has egocentric data as a special case.
Krivitsky and Morris (2017) use design-based estimators for sufficient statistics of the ERGM of interest, and then transfers their properties to the ERGM coefficient estimates.
Krivitsky, Bojanowski, and Morris (2019) demonstrate estimating triadic effects and scenarios in which an attribute is only observed on the ego.
Krivitsky, Morris, and Bojanowski (2022) discuss sampling design for inference.
EgoStats
”As currently implemented in the ergm.ego package, modeling does currently does not support alter–alter statistics or directed or bipartite networks.
Consider an egocentric view of the entire population: every node is observed (i.e., \(S=N\), a census), but alters are not uniquely identifiable across the egos. This limits the kinds of network statistics that can be observed, which in turn restricts the terms that can be fit (the models that can be identified) in an ERGM. We can use the notion of sufficiency from statistical theory to identify the terms amenable to egocentric inference.
The framework for estimation and inference relies on two basic properties of exponential family models:
MLE’s uniquely maximize the probability of the observed statistics under the model, and any network with the same observed statistics will have the same probability.
Design-based estimation of ERGMs is done in three steps:
Together, these allow us to use any statistic that can be observed in an egocentric sample as a term in an ERG model and to estimate the model from a complete “pseudo-network” that has the same (or appropriately scaled) sufficient statistics. The networks simulated from the fitted model will be centered on the (scaled) observed statistics.
In practice, egocentric sample statistics generally need to be adjusted for network size and some types of observable discrepancies. This is one of the key differences between working with sampled and unsampled network data.
The treatment of network size is perhaps the most obvious way that egocentric estimation differs from a standard ERGM estimation on a completely observed network. With a network census, the network size is known; by contrast, with a network sample, we don’t typically know the size of the network from which it is drawn.
If the statistics we observe in the sample scale in a known way with network size, then we can adjust for this in the estimation, and the resulting parameter estimates (with the exception of the edges term) will be “size invariant”.
Here we will follow Krivitsky, Handcock, and Morris (2011), who showed that one can obtain a “per capita” size invariant parameterization for dyad-independent statistics in any network by using an offset, approximately equal to \(-\log(N)\), where \(N\) is the number of nodes in the network. The intuition is that this transforms the density-based parameterization (ties per dyad) that is the natural scale for ERGMs into a mean degree-based parameterization (ties per node):
\[ \text{Mean Degree} = \frac{2\times\text{ties}}{\text{nodes}} = \frac{2T}{n} \] \[ \text{Density} = \frac{\text{ties}}{\text{dyads}} = \frac{T}{\frac{N(N-1)}{2}} = \frac{\text{Mean Degree}}{(N-1)} \]
Once the number of edges is adjusted to preserve the mean degree all of the dyad
independent terms are properly scaled (Krivitsky, Handcock, and Morris 2011). For degree-based terms, we would want, by
analogy, the per-capita invariance to preserve the degree probability distribution.
Experimental results suggest that the mean-degree preserving offset has this
property, but a mathematical proof is elusive. Scaling properties for triadic terms are less well developed (Krivitsky and Kolaczyk 2015).
What we mean by discrepancy is: undirected tie subtotals that are required to balance in theory, but are observed not to balance in the sample. This can happen, for example, when ties are broken down by nodal attributes and the number of ties that group 1 reports to group 2 are not equal to the number that group 2 reports to group 1.
This is another unique feature of egocentrically sampled network data. With a network census, you have the complete edgelist, with the nodal attributes for each member of the dyad, so the reports will always balance. For an egocentrically sampled network, and even for an egocentric census, a discrepancy can arise, either from sampling variability, or from measurement error (if ego mis-reports the attribute of themselves or their alter).
The natural assumption, in the absence of specific knowledge, is that any discrepancy is due to sampling variation. Under this assumption the average of the discrepant reports is the appropriate estimate of the number of ties for that ego-alter configuration. This is the approach implemented in ergm.ego. If you know the source of the discrepancy, or want to make a different assumption you may address this before fitting the data in ergm.ego.
Once the network size-invariant parameterization and consistency issues are addressed it is straightforward to construct the target statistics needed for ERGM estimation: we scale up the values of the sample statistics to the desired network size.
The way we do this is by specifying an offset
term in the model. The offset
used will depend on the context.
For unweighted samples: To obtain population estimates from ergm.ego from an unweighted sample of size \(|S|\) to a population with a known (or specified) size \(N\), fit the model with an offset of \(\log(N/{S})=\log(N) - \log(S)\).
For weighted samples: To obtain population estimates from ergm.ego from a weighted sample to a population with a known (or specified) size \(N\), first choose a network size, \(|N'|\), to be used for estimation (a pseudo-population that will have the correct nodal attribute distribution specified by the weights), and then fit the model with an offset of \(\log(N/{N'})=\log(N) - \log(N')\). The criteria for choosing a good value of \(|N'|\) are discussed in the Model Fitting section below.
If the population network size is unknown: This is the most general case. If we do not know \(N\) or wish to specify it we often fit with an offset of \(-\log(S)\) (for the unweighted sample) or \(-\log(N')\) (for the weighted sample). This will return per-capita estimates that can be easily rescaled to any value post-estimation, e.g., for simulation purposes.
The standard errors for coefficients in an ergm.ego fit are designed to represent the uncertainty in our estimate. For ERGMs, this uncertainty can be thought of as coming from three possible sources:
A superpopulation of networks, from which this one network is a single realization: What other networks could have been produced by the social process of interest?
The sampling process of egos: What other samples could have been drawn?
The stochastic MCMC algorithm used to estimate the coefficient: What other MCMC samples could we have gotten?
Most treatments of ERGM estimation treat the coefficient \(\theta\) as a parameter of a superpopulation process of which \(y\) is a single realization. The variance of the MLE of \(\theta\) is then conceived as coming from (1) and (3) above.
In contrast, in ergm.ego we treat the network as a fixed, unknown, finite population, so it is not a source of uncertainty. Rather, uncertainty comes from sampling from this network, and from the MCMC algorithm, (2) and (3) above.
This makes ergm.ego inference much more like traditional (frequentist) statistical inference: we imagine repeatedly drawing an egocentric sample, and estimating the ERGM on each replicate. The sampling distribution of the estimate reflects how our estimate will vary from sample to sample.
The ergm.ego package can be used with weighted survey data and complex sampling designs. In this context, the egor package transforms the
ego
tibble into a srvyr
object. The srvyr package (Freedman Ellis and Schneider 2023) can be used for descriptive statistics, and ergm.ego will incorporate the survey design into its estimation
and inference.
This topic is beyond the scope of this introductory workshop but the ergm.ego package has an example you can run for more information:
Since ergm.ego is essentially a wrapper around ergm, there are relatively few functions in the ergm.ego package itself. The functions that are there deal with the specific requirements associated with data management, estimation and inference for egocentrically sampled data.
To get a list of documented functions, type:
The main R
objects unique to ergm.ego are:
egor
objects for storing the original data (egor
is the analog to network
in ergm),
ergm.ego objects, which store the model fit results (the analog to ergm objects in ergm).
Once you simulate from the fit, the resulting objects are just network
objects.
The functionality can be divided into groups as follows:
Stripped down to the basics, egocentric network data comprise:
data on egos (nodal attributes)
data on alters (a combination of nodal and edge attributes, since each alter represents a tie)
data on ties between alters (which may also have edge attributes)
The egor
object has simple, analogous structure for storing this information: a list object with 3 components
ego
- data frame of egos and their attributes
alter
- a data frame of alters and their attributes nominated by egos (by default identified by column egoID) or a list of data frames (one for each ego).
aatie
- a data frame with edge list of alter-alter ties or a list of data frames (one for each ego).
In addition, one can specify:
ego_design
- a list of arguments passed to srvyr::as_survey_design()
specifying the sample design for egos. For example: probs
for unequal probability independent sample, strata
for stratified samples etc.
alter_design
- currently a list with one element, max=
providing the maximal number of alters ego could nominate for a Fixed Choice Design (Holland and Leinhardt 1973).
The capacity to represent survey design elements makes egor
a flexible and powerful foundation for network data analysis.
The simplicity of the data structure makes it easy to constructegor
objects from external data read into R
, and there are transformation utilities for working with other data formats (like network
and igraph
objects), which we will demonstrate in the Example section below.
For more information:
The possible terms in an ergm.ego model are inherently limited to those that are egocentrically observable: statistics that can be inferred from an egocentric sample. In general, these will include terms that are functions of nodal attributes and attribute mixing, degree distribution terms, and triadic terms (when the alter–alter ties are observed). The ergm.ego terms have the same names and arguments as their ergm counterparts, there are just far fewer (n=14) of them available.
Dyad independent terms include density and nodal attribute based measures:
edges
nodefactor
(for discrete/nominal vars) and nodecov
(for continuous)nodematch
(for homophily), nodemix
(for general mixing patterns) and absdiff
Dyad dependent terms include degree- and triad-based measures:
degree
, degrange
, gwdegree
, and degreepopularity
concurrent
and concurrentties
esp
, gwesp
, transitiveties
and cyclicalties
, but can only if be used if alter–alter ties have been observed.For the full list of ergm.ego terms and their syntax, type:
As in ergm, these terms can be used on the right-hand side of formulas in calls to model and simulation functions
We will work with the faux.mesa.high
dataset that is included with the ergm package, using the as.egor
function to transform it into an egocentric
dataset. In essence, this creates an egocentric census of the
network: a census of all nodes in the network, not a sample.
In this egocentric census, every node is the center of their own egonet – we know their alters, and the ties between their alters, but we can not match the alters across the egonets because they are not uniquely identified. We can still compare the fits we get from ergm.ego (from the ego data) and ergm (from the original network) for models with the same terms.
Preliminaries:
Check package versions
Set seed for simulations – this is not necessary, but it ensures that we all get the same results (if we execute the same commands in the same order).
We’ll show 2 examples of how to create an egor
object here.
network
objectRead in the faux.mesa.high
data:
Take a quick look at the complete network
Now, let’s turn this into an egor
object:
Take a look at this object – there are several ways to do this:
[1] "ego" "alter" "aatie"
# EGO data (active): 205 × 4
.egoID Grade Race Sex
* <int> <dbl> <chr> <chr>
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
# ℹ 200 more rows
# ALTER data: 406 × 5
.altID .egoID Grade Race Sex
* <int> <int> <dbl> <chr> <chr>
1 174 1 7 Hisp F
2 161 1 7 Hisp F
3 151 1 7 Hisp F
# ℹ 403 more rows
# AATIE data: 372 × 3
.egoID .srcID .tgtID
* <int> <int> <int>
1 1 151 127
2 1 127 52
3 1 127 87
# ℹ 369 more rows
#View(mesa.ego) # opens the component in the Rstudio source window
class(mesa.ego) # what type of "object" is this?
[1] "egor" "list"
Each of the components of the egor
object is a simple table, or data.frame.
[1] "tbl_df" "tbl" "data.frame"
[1] "tbl_df" "tbl" "data.frame"
[1] "tbl_df" "tbl" "data.frame"
The ego table contains the ego ID (.egoID
), and the nodal attributes Race, Grade and Sex. This is equivalent to a standard person-based survey sample flat file format.
# A tibble: 205 × 4
.egoID Grade Race Sex
<int> <dbl> <chr> <chr>
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
6 6 10 Hisp F
7 7 8 NatAm M
8 8 11 NatAm M
9 9 9 White M
10 10 9 NatAm F
# ℹ 195 more rows
The alter table is a type of edgelist: it lists the edges for each ego. It contains the alter ID (.altID
), the corresponding ego ID, and a set of alter nodal attributes. Note that this is a slightly different data structure than a standard network edgelist.
The standard network edgelist contains one unique record for each edge; both ego and alter ID may appear more than once (depending on their degree), but each link is only represented once.
The alter table in this egor
object is a different type of edgelist, as it is “egocentric.”
.altID
list is equal to their degree, as is the number of times their ID will appear in the .egoID
list.# A tibble: 406 × 5
.altID .egoID Grade Race Sex
<int> <int> <dbl> <chr> <chr>
1 174 1 7 Hisp F
2 161 1 7 Hisp F
3 151 1 7 Hisp F
4 127 1 7 Hisp F
5 110 1 7 Hisp F
6 100 1 7 Hisp F
7 96 1 7 NatAm F
8 92 1 7 NatAm F
9 87 1 7 White F
10 70 1 7 NatAm F
# ℹ 396 more rows
# ties show up twice, but alter info is linked to .altID
mesa.ego$alter %>% filter((.altID==1 & .egoID==25) | (.egoID==1 & .altID==25))
# A tibble: 2 × 5
.altID .egoID Grade Race Sex
<int> <int> <dbl> <chr> <chr>
1 25 1 7 White F
2 1 25 7 Hisp F
The aatie
table lists the egoID, and the IDs of the two alters that have tie. The alters are distinguished as .srcID
and .tgtID
to allow for the possibility of directed tie data. In the case of undirected tie data, as we have here, each alter-alter tie will be represented twice, just swapping the target and source IDs.
# A tibble: 372 × 3
.egoID .srcID .tgtID
<int> <int> <int>
1 1 151 127
2 1 127 52
3 1 127 87
4 1 127 151
5 1 110 87
6 1 110 92
7 1 110 96
8 1 100 96
9 1 96 87
10 1 96 110
# ℹ 362 more rows
Since each of the egor
components is a simple rectangular matrix, it’s easy to read in external data and use it to construct an egor
object. You just need to make sure that the structure of the external file is consistent with the structure of the tables we looked at above.
To demonstrate we will construct an egor
object derived from our mesa.ego
data that has the features of an egocentrically sampled data set: the alters are not uniquely identified. And, we will ignore the alter–alter ties.
First, we write out the first two tables in our mesa.ego
into external datafiles, deleting the .altID
from the alter file.
# egos
write.csv(mesa.ego$ego, file="mesa.ego.table.csv", row.names = F)
# alters
write.csv(mesa.ego$alter[,-1], file="mesa.alter.table.csv", row.names = F)
Now read them back in:
.egoID Grade Race Sex
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
6 6 10 Hisp F
.egoID Grade Race Sex
1 1 7 Hisp F
2 1 7 Hisp F
3 1 7 Hisp F
4 1 7 Hisp F
5 1 7 Hisp F
6 1 7 Hisp F
To create an egor
object from data frames, we use the egor()
function:
# EGO data (active): 205 × 4
.egoID Grade Race Sex
* <chr> <int> <chr> <chr>
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
# ℹ 200 more rows
# ALTER data: 406 × 4
.egoID Grade Race Sex
* <chr> <int> <chr> <chr>
1 1 7 Hisp F
2 1 7 Hisp F
3 1 7 Hisp F
# ℹ 403 more rows
# AATIE data: 0 × 3
# ℹ 3 variables: .egoID <chr>, .srcID <chr>, .tgtID <chr>
# A tibble: 406 × 4
.egoID Grade Race Sex
<chr> <int> <chr> <chr>
1 1 7 Hisp F
2 1 7 Hisp F
3 1 7 Hisp F
4 1 7 Hisp F
5 1 7 Hisp F
6 1 7 Hisp F
7 1 7 NatAm F
8 1 7 NatAm F
9 1 7 White F
10 1 7 NatAm F
# ℹ 396 more rows
Note that the alter data no longer have a unique alter identifier.
For another example that uses the alter–alter ties, see:
We will explore some of the other functions available for manipulating the egor
object in a later section.
Prior to model specification, we can explore the data using descriptive statistics observable in the original egocentric sample. In general, the observable statistics are the same as those that ergm.ego can estimate.
We can use standard R commands to view nodal attribute frequencies:
# to reduce typing, we'll pull the ego and alter data frames
egos <- mesa.ego$ego
alters <- mesa.ego$alter
table(egos$Sex) # Distribution of `Sex`
F M
99 106
Black Hisp NatAm Other White
6 109 68 4 18
Compare egos and alters:
layout(matrix(1:2, 1, 2))
barplot(table(egos$Race)/nrow(egos),
main="Ego Race Distn", ylab="percent",
ylim = c(0,0.5), las = 3)
barplot(table(alters$Race)/nrow(alters),
main="Alter Race Distn", ylab="percent",
ylim = c(0,0.5), las = 3)
layout(1)
To look at the mixing matrix, we’ll use the
mixingmatrix()
function on the egor
object, and
we’ll compare the output to what we would get from using this
function on the original network object.
Note how the ties on the diagonal are counted twice in the ergm.ego data, compared with the original network data, but the off-diagonal tie counts are the same. Note also, though, that these off-diagonal counts are symmetric, because this is undirected data. So, in both cases, the off-diagonal ties are actually being counted twice (once above and once below the diagonal), but in the original network version, the ties on the diagonal are only counted once.
7 8 9 10 11 12
7 150 0 0 1 1 1
8 0 66 2 4 2 1
9 0 2 46 7 6 4
10 1 4 7 18 1 5
11 1 2 6 1 34 5
12 1 1 4 5 5 12
Note: Marginal totals can be misleading for undirected mixing matrices.
7 8 9 10 11 12
7 75 0 0 1 1 1
8 0 33 2 4 2 1
9 0 2 23 7 6 4
10 1 4 7 9 1 5
11 1 2 6 1 17 5
12 1 1 4 5 5 6
Note: Marginal totals can be misleading for undirected mixing matrices.
You can also use this function to calculate the row probabilities of the mixing matrix:
7 8 9 10 11 12
7 0.98 0.00 0.00 0.01 0.01 0.01
8 0.00 0.88 0.03 0.05 0.03 0.01
9 0.00 0.03 0.71 0.11 0.09 0.06
10 0.03 0.11 0.19 0.50 0.03 0.14
11 0.02 0.04 0.12 0.02 0.69 0.10
12 0.04 0.04 0.14 0.18 0.18 0.43
Note: Marginal totals can be misleading for undirected mixing matrices.
Black Hisp NatAm Other White
Black 0.00 0.31 0.50 0.00 0.19
Hisp 0.04 0.60 0.23 0.01 0.12
NatAm 0.08 0.26 0.59 0.00 0.06
Other 0.00 1.00 0.00 0.00 0.00
White 0.11 0.49 0.22 0.00 0.18
Note: Marginal totals can be misleading for undirected mixing matrices.
We can also examine the observed number of ties, mean degree, and degree distributions.
[1] 203
# compare to `egor`
# note that the ties are double counted, so we need to divide by 2.
nrow(mesa.ego$alter)/2
[1] 203
# mean degree -- here we want to count each "stub", so we don't divide by 2
nrow(mesa.ego$alter)/nrow(mesa.ego$ego)
[1] 1.980488
scaled mean SE
degree0 57 6.4306
degree1 51 6.2048
degree2 30 5.0730
degree3 28 4.9289
degree4 18 4.0620
degree5 10 3.0917
degree6 2 1.4107
degree7 4 1.9852
degree8 1 1.0000
degree9 2 1.4107
degree10 1 1.0000
degree11 0 0.0000
degree12 0 0.0000
degree13 1 1.0000
degree14 0 0.0000
degree15 0 0.0000
degree16 0 0.0000
degree17 0 0.0000
degree18 0 0.0000
degree19 0 0.0000
degree20 0 0.0000
scaled mean SE
deg0.SexF 23 4.5299
deg1.SexF 23 4.5299
deg2.SexF 10 3.0917
deg3.SexF 17 3.9581
deg4.SexF 12 3.3694
deg5.SexF 7 2.6066
deg6.SexF 1 1.0000
deg7.SexF 3 1.7235
deg8.SexF 1 1.0000
deg9.SexF 0 0.0000
deg10.SexF 1 1.0000
deg11.SexF 0 0.0000
deg12.SexF 0 0.0000
deg13.SexF 1 1.0000
deg0.SexM 34 5.3385
deg1.SexM 28 4.9289
deg2.SexM 20 4.2588
deg3.SexM 11 3.2343
deg4.SexM 6 2.4193
deg5.SexM 3 1.7235
deg6.SexM 1 1.0000
deg7.SexM 1 1.0000
deg8.SexM 0 0.0000
deg9.SexM 2 1.4107
deg10.SexM 0 0.0000
deg11.SexM 0 0.0000
deg12.SexM 0 0.0000
deg13.SexM 0 0.0000
For the degree distribution we used
thesummary
function in the same way that we would use it in
ergm with a network
object.
But the summary function also has an egor
specific argument, scaleto
, that
allows you to scale the summary statistics
to a network of arbitrary size.
So, for example, we can obtain the degree distribution scaled to a network
of size 100,000, or a network that is 100 times larger than the sample.
scaled mean SE
degree0 27804.88 3136.89
degree1 24878.05 3026.75
degree2 14634.15 2474.63
degree3 13658.54 2404.34
degree4 8780.49 1981.47
degree5 4878.05 1508.16
degree6 975.61 688.17
degree7 1951.22 968.41
degree8 487.80 487.80
degree9 975.61 688.17
degree10 487.80 487.80
scaled mean SE
degree0 5700 643.06
degree1 5100 620.48
degree2 3000 507.30
degree3 2800 492.89
degree4 1800 406.20
degree5 1000 309.17
degree6 200 141.07
degree7 400 198.52
degree8 100 100.00
degree9 200 141.07
degree10 100 100.00
Note that the first scaling results in fractional numbers of nodes at each degree, because the proportion at each degree level does not scale to an integer for this population size. Again, this is not a problem for estimation, but one should be careful with descriptive statistics that expect integer values. The second scaling does result in integer counts because it is a multiple of the sample size.
We can plot the degree distribution
using another egor
specific function:
degreedist
. As with the mixingmatrix
function, this can return either
the counts or the proportions at each degree.
To get to get the frequency counts:
# degreedist(mesa.ego, plot=TRUE, prob=FALSE) # bug statnet/ergm.ego#82.
degreedist(mesa.ego, by="Sex", plot=TRUE, prob=FALSE)
To get the proportion at each degree level:
The degreedist
method for egor
objects also has an argument that lets you overplot the expected degree distribution for a Bernoulli random graph with the same expected density. This is the plot equivalent of a CUG test (“conditional uniform graph”).
The brg
overplot is based on 50 simulations of a Bernoulli random graph with the same
number of nodes and expected density, implemented by using an ergm.ego simulation from
an edges only model with \(\theta=\mbox{logit}(\mbox{probability of a tie})\) from the observed
data. The overplot shows the mean and 2 standard deviations obtained for each
degree value from the 50 simulations. Note that the brg
automatically scales
to the proportions when prob=T
.
What does the plot suggest about the distribution of degree in this network?
From the exploratory work, several characteristics emerged that we might want to capture in a model:
Variation in mean degree by nodal attributes (race, sex and grade)
Patterns of mixing by race, sex and grade
The degree distribution, (in particular the disproportionate isolate fraction)
We can use ergm.ego to fit a sequence of nested models to both estimate the parameters associated with these statistics, and test their significance. We can diagnose the both the estimation process (to verify convergence and good mixing in the MCMC sampler) and the fit of the model to the data. In both cases, we will use functionality that will be familiar to ergm users: MCMC diagnostics, and GOF.
One thing that is different from a standard ergm call is that we need to specify the scaling, both for the pseudo-population (\(N'\)) that will be used to set the target statistics during estimation, and for the population (N) size that the final rescaled coefficients will represent. Recall,
popsize
top-level argument.
Bias – In general, estimation bias is reduced the closer \(N'\) is to \(N\) (usually larger).
Computing time – The larger the pseudo-population, the longer the estimation takes.
Sample weights – In general, it is good practice for the smallest sample weight to produce at least 1 observation in the pseudo-population network, though more is better.
This leads to different guidelines for data with and without weights.
Simulation studies in Krivitsky & Morris (2017) suggest that a good rule of thumb is to have a minimum pseudo-population size of 1,000 for unweighted data. For weighted data the pseudo-populations size should be at least 1 * sampleSize/smallestWeight (or 3 * sampleSize/smallestWeight to be safe), or 1000 (whichever is larger).
In ergm.ego, \(|N'|\) is controlled by a combination of four factors:
popsize
(\(|N|\) or 1) (default: 1),control.ergm.ego
control parameter ppopsize
(default: "auto"
),control.ergm.ego
control parameter ppopsize.mul
(default: 1).If ppopsize
is left at its default ("auto"
),
popsize
is left at 1, \(|S|\times\)ppopsize.mul
.popsize
is specified, use \(|N|\times\)ppopsize.mul
.You can also force one of these two regimes by setting ppopsize
to "samp"
or "pop"
, respectively, or set it to a number to force a particular \(|N'|\) ignoring ppopsize.mul
.
For more information, see
In both cases, the scaling will only affect the estimate
of the edges
term, and we demonstrate this below.
Let’s start with simple edges-only model to see what’s the same and what is different from a call to ergm:
Call:
ergm.ego(formula = mesa.ego ~ edges)
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
offset(netsize.adj) -5.32301 0.00000 0 -Inf <1e-04 ***
edges 0.69590 0.07717 0 9.018 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following terms are fixed by offset and are not estimated:
offset(netsize.adj)
This is a model with homogenous tie probability – a Bernoulli random
graph with the mean degree observed in our sampled data. The
only difference in the syntax from standard ergm is the function
call to ergm.ego.
Let’s look under the hood at the components that are output to the
fit.edges
object:
[1] "coefficients" "sample" "iterations" "MCMCtheta"
[5] "loglikelihood" "gradient" "hessian" "covar"
[9] "failure" "newnetwork" "coef.init" "est.cov"
[13] "coef.hist" "stats.hist" "steplen.hist" "control"
[17] "etamap" "MCMCflag" "nw.stats" "call"
[21] "network" "ergm_version" "info" "MPLE_is_MLE"
[25] "drop" "offset" "estimable" "formula"
[29] "target.stats" "target.esteq" "reference" "constraints"
[33] "obs.constraints" "estimate" "estimate.desc" "v"
[37] "m" "ergm.formula" "popnw" "ergm.offset.coef"
[41] "egor" "ppopsize" "popsize" "netsize.adj"
[45] "ergm.covar" "DtDe" "ergm.call"
[1] 205
[1] 1
Many of the elements of the object are the same as you would get from an ergm fit,
but the last few elements are unique to ergm.ego. Here you can see the ppopsize
–
the pseudo-population size used to construct the target statistics, and
popsize
– the final scaled population size after network size adjustment is
applied. The values that were used in the fit were the default values,
since we did not specify otherwise. So, ppopsize
\(=205\) (the sample
size, or number of egos), and popsize
\(= 1\),
so the scaling returns the per capita estimates from
the model parameters.
The summary shows the netsize.adj
is
\(-5.32301= -\log(205)\).
The summary function also reports that:
The following terms are fixed by offset and are not estimated:
netsize.adj
So what would happen if we fit the model instead with target statistics from a
pseudo-population of size 1000? To do this, we explicitly change the value
of the ppopsize
parameter through the control argument:
Constructing pseudopopulation network.
Note: Constructed network has size 1025, different from requested 1000. Estimation should not be meaningfully affected.
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0019.
Convergence test p-value: 0.0140. Not converged with 99% confidence; increasing sample size.
Iteration 2 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0129.
Convergence test p-value: 0.0001. Converged with 99% confidence.
Finished MCMLE.
This model was fit using MCMC. To examine model diagnostics and check
for degeneracy, use the mcmc.diagnostics() function.
Call:
ergm.ego(formula = mesa.ego ~ edges, control = control.ergm.ego(ppopsize = 1000))
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
offset(netsize.adj) -6.93245 0.00000 0 -Inf <1e-04 ***
edges 0.69348 0.08023 0 8.644 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following terms are fixed by offset and are not estimated:
offset(netsize.adj)
Now the netsize.adj value is \(-6.9077553 = -\log(1000)\).
Note that the value of the estimated edges coefficient is the same in both models, 0.698. This is the behavior we expect – the model is returning the same per capita value in both cases; it is just using a different scaling for the target statistics used in the fit. For this simple model, there may not be much difference in the properties of the estimates for these two different pseudo-population sizes.
We will examine the impact of modifying the popsize
parameter in a later section below.
As the output shows, the model fit was fit using MCMC. This, too is different from the edges-only model using ergm. For ergm, models with only dyad-dependent terms are fit using Newton-Raphson algorithms (the same algorithm used for logistic regression), not MCMC. For ergm.ego, estimation is always based on MCMC, regardless of the terms in the model.
Now let’s see what the MCMC diagnostics for this model look like
Note: MCMC diagnostics shown here are from the last round of
simulation, prior to computation of final parameter estimates.
Because the final estimates are refinements of those used for this
simulation run, these diagnostics may understate model performance.
To directly assess the performance of the final model on in-model
statistics, please use the GOF command: gof(ergmFitObject,
GOF=~model).
The diagnostics show good mixing,
and the
distribution of the sample statistic deviations from the targets (on the right panel) in the last iteration is
well centered around zero. To verify that simulations from the fitted model match the target stats, we can use the gof
function with the
argument “model”.
Networks simulated from the model appear to be nicely centered around the values of the observed edges statistic.
Finally, we should evaluate the model fit. We can also use gof
to do this, by
comparing observed statistics that are not in the model, like the full degree distribution,
with simulations from the fitted model. This is the same procedure that we use for
ergm, but now with a more limited set of observed higher-order statistics to use for
assessment.
Here, finally, we see some bad behavior, but this is expected
from such a simple model. The GOF plot shows there are almost twice as many
isolates in the observed data than would be predicted from a simple edges-only model.
Of course we knew this from
having looked at the degree distribution plots with the Bernoulli random graph overlay.
Ok, so that’s a full cycle of description, estimation, and model assessment.
Let’s try fitting a degree(0)
term to see how that changes the degree distribution assessment. Note that in this example, we’re using a shortcut for control.ergm.ego
– the snctrl
function. The snctrl
shortcut can be used in all of the Statnet packages (ergm, tergm
, etc.) to specify controls specific for each type of model.
set.seed(1)
fit.deg0 <- ergm.ego(mesa.ego ~ edges + degree(0),
control = snctrl(ppopsize=1000))
summary(fit.deg0)
Call:
ergm.ego(formula = mesa.ego ~ edges + degree(0), control = snctrl(ppopsize = 1000))
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
offset(netsize.adj) -6.9324 0.0000 0 -Inf <1e-04 ***
edges 1.1704 0.1042 0 11.234 <1e-04 ***
degree0 1.4815 0.2592 0 5.716 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following terms are fixed by offset and are not estimated:
offset(netsize.adj)
Note: MCMC diagnostics shown here are from the last round of
simulation, prior to computation of final parameter estimates.
Because the final estimates are refinements of those used for this
simulation run, these diagnostics may understate model performance.
To directly assess the performance of the final model on in-model
statistics, please use the GOF command: gof(ergmFitObject,
GOF=~model).
So, we’ve now fit the isolates exactly, and the overall fit is better, but the deviations suggest there are more nodes with just one tie than would be expected, given the mean degree, and the number of isolates.
And just to round things off, let’s fit a relatively large model. Here we’ll specify the omitted category for Race as the largest group.
fit.full <- ergm.ego(mesa.ego ~ edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex")
+ nodematch("Race")
+ nodematch("Grade"))
summary(fit.full)
Call:
ergm.ego(formula = mesa.ego ~ edges + degree(0:1) + nodefactor("Sex") +
nodefactor("Race", levels = -LARGEST) + nodefactor("Grade") +
nodematch("Sex") + nodematch("Race") + nodematch("Grade"))
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
offset(netsize.adj) -5.32301 0.00000 0 -Inf < 1e-04 ***
edges -1.38926 0.19665 0 -7.065 < 1e-04 ***
degree0 2.09717 0.36081 0 5.812 < 1e-04 ***
degree1 1.00401 0.28150 0 3.567 0.000362 ***
nodefactor.Sex.M -0.17310 0.06319 0 -2.739 0.006155 **
nodefactor.Race.Black 1.20790 0.21176 0 5.704 < 1e-04 ***
nodefactor.Race.NatAm 0.30280 0.05821 0 5.202 < 1e-04 ***
nodefactor.Race.Other -0.90243 0.61221 0 -1.474 0.140466
nodefactor.Race.White 0.57599 0.13107 0 4.394 < 1e-04 ***
nodefactor.Grade.8 0.14240 0.05373 0 2.650 0.008044 **
nodefactor.Grade.9 0.14073 0.04792 0 2.937 0.003319 **
nodefactor.Grade.10 0.31597 0.07197 0 4.391 < 1e-04 ***
nodefactor.Grade.11 0.40663 0.05753 0 7.068 < 1e-04 ***
nodefactor.Grade.12 0.77803 0.07399 0 10.515 < 1e-04 ***
nodematch.Sex 0.64352 0.12148 0 5.297 < 1e-04 ***
nodematch.Race 0.83975 0.12813 0 6.554 < 1e-04 ***
nodematch.Grade 3.05340 0.15340 0 19.904 < 1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following terms are fixed by offset and are not estimated:
offset(netsize.adj)
Note: To save space, only one in every 2 iterations of the MCMC sample
used for estimation was stored for diagnostics. Sample size per chain
was originally around 5266 with thinning interval 2048.
Note: MCMC diagnostics shown here are from the last round of
simulation, prior to computation of final parameter estimates.
Because the final estimates are refinements of those used for this
simulation run, these diagnostics may understate model performance.
To directly assess the performance of the final model on in-model
statistics, please use the GOF command: gof(ergmFitObject,
GOF=~model).
In general the model diagnostics look good. If this were a genuine sample of 205 students from a larger school, we could infer the following:
there are many more isolates, and more degree 1 nodes than expected by chance;
there are significant differences in mean degree by race, with the largest group (Hispanics, the reference category) nominating fewer friends than most of the other groups;
7th graders nominate fewer friends than all other grades;
there are strong and significant homophily effects, for all three attributes.
It is possible to simulate complete networks from this ergm.ego fit object – just as we would from an ergm fit object:
sim.full <- simulate(fit.full)
summary(mesa.ego ~ edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex") + nodematch("Race") + nodematch("Grade"))
scaled mean SE
edges 203 15.2022
degree0 57 6.4306
degree1 51 6.2048
nodefactor.Sex.M 171 17.1990
nodefactor.Race.Black 26 6.5507
nodefactor.Race.NatAm 156 19.7787
nodefactor.Race.Other 1 0.7054
nodefactor.Race.White 45 9.1943
nodefactor.Grade.8 75 17.3212
nodefactor.Grade.9 65 11.2475
nodefactor.Grade.10 36 8.0931
nodefactor.Grade.11 49 11.4861
nodefactor.Grade.12 28 7.2756
nodematch.Sex 132 12.1128
nodematch.Race 103 10.0369
nodematch.Grade 163 13.6309
summary(sim.full ~ edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex") + nodematch("Race") + nodematch("Grade"))
edges degree0 degree1
167 54 66
nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm
130 12 150
nodefactor.Race.Other nodefactor.Race.White nodefactor.Grade.8
3 32 63
nodefactor.Grade.9 nodefactor.Grade.10 nodefactor.Grade.11
60 29 36
nodefactor.Grade.12 nodematch.Sex nodematch.Race
22 113 84
nodematch.Grade
138
plot(sim.full, vertex.col="Grade")
legend('bottomleft',fill=7:12,legend=paste('Grade',7:12),cex=0.75)
(Note that we have implicitly used simulate already – it’s the basis of the GOF results)
We can use network size invariance to simulate networks of a different size, albeit one has to be careful if the observed statistics are too small to be reliable (e.g., the nodefactor.Race.Other
statistic here):
sim.full2 <- simulate(fit.full, popsize=network.size(mesa)*2)
summary(mesa~edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex") + nodematch("Race") + nodematch("Grade"))*2
edges degree0 degree1
406 114 102
nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm
342 52 312
nodefactor.Race.Other nodefactor.Race.White nodefactor.Grade.8
2 90 150
nodefactor.Grade.9 nodefactor.Grade.10 nodefactor.Grade.11
130 72 98
nodefactor.Grade.12 nodematch.Sex nodematch.Race
56 264 206
nodematch.Grade
326
summary(sim.full2~edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex") + nodematch("Race") + nodematch("Grade"))
edges degree0 degree1
405 120 90
nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm
312 67 307
nodefactor.Race.Other nodefactor.Race.White nodefactor.Grade.8
2 86 146
nodefactor.Grade.9 nodefactor.Grade.10 nodefactor.Grade.11
125 74 95
nodefactor.Grade.12 nodematch.Sex nodematch.Race
57 279 206
nodematch.Grade
334
We have only demonstrated the functionality briefly here, but this kind of simulation is a powerful way to diagnose structural properties of the fitted model, and to identify and remedy systematic lack of fit.
We will leave this model here and go on to explore how the idea of sampling uncertainty is being used to produce the standard errors for our coefficients.
When we estimate parameters based on sampled data, the sampling uncertainty
in our estimates
comes from the differences in the observations we draw from sample to sample,
and the magnitude of uncertainty is a function of our sample size. This is
why we typically see something like \(\sqrt{n}\) in the denominator of the standard
error of a sample mean or sample proportion. The same principle holds in the
context of egocentric network sampling: the standard
errors will depend on the number of egos sampled.
This is true despite the fact that
we are rescaling first to pseudo-population size, then back down
to per capita values. Neither of these influences the estimates of the standard
errors – those are influenced only by the size of the egocentric sample.
So let’s use the sample
function from ergm.ego to demonstrate this effect.
For this section we will use the larger built-in network, faux.magnolia.high
.
Let’s start by fitting an ERGM to the complete network, and looking at the coefficients:
fit.ergm <- ergm(fmh ~ degree(0:3)
+ nodefactor("Race", levels=TRUE) + nodematch("Race")
+ nodefactor("Sex") + nodematch("Sex")
+ absdiff("Grade"))
round(coef(fit.ergm), 3)
degree0 degree1 degree2
0.954 0.274 0.034
degree3 nodefactor.Race.Asian nodefactor.Race.Black
-0.240 -2.476 -3.045
nodefactor.Race.Hisp nodefactor.Race.NatAm nodefactor.Race.Other
-2.693 -2.263 -2.634
nodefactor.Race.White nodematch.Race nodefactor.Sex.M
-3.385 1.679 -0.087
nodematch.Sex absdiff.Grade
0.860 -2.116
Now, suppose we only observe an egocentric view of the data – as an egocentric census. With an egocentric census, it’s as though we give a survey to all of the students. Each student nominates her friends, but does not report the name of the friend, she only reports their sex, race and grade. How does the fit from ergm.ego to this egocentric census compare to the complete-network ergm estimates?
# EGO data (active): 3 × 5
.egoID Grade Race Sex vertex.names
* <int> <dbl> <chr> <chr> <chr>
1 1 9 Black F 1
2 2 10 Black M 2
3 3 12 Black F 3
# ALTER data: 6 × 6
.altID .egoID Grade Race Sex vertex.names
* <int> <int> <dbl> <chr> <chr> <chr>
1 669 1 9 Black F 669
2 963 2 10 White F 963
3 912 2 10 White M 912
# ℹ 3 more rows
# AATIE data: 0 × 3
# ℹ 3 variables: .egoID <int>, .srcID <int>, .tgtID <int>
egofit <- ergm.ego(fmh.ego ~ degree(0:3)
+ nodefactor("Race", levels=TRUE) + nodematch("Race")
+ nodefactor("Sex") + nodematch("Sex")
+ absdiff("Grade"), popsize=N,
control = snctrl(ppopsize=N))
# A convenience function.
model.se <- function(fit) sqrt(diag(vcov(fit)))
# Parameters recovered:
coef.compare <- data.frame(
"NW est" = coef(fit.ergm),
"Ego Cen est" = coef(egofit)[-1],
"diff Z" = (coef(fit.ergm)-coef(egofit)[-1])/model.se(egofit)[-1])
round(coef.compare, 3)
NW.est Ego.Cen.est diff.Z
degree0 0.954 0.939 0.035
degree1 0.274 0.262 0.034
degree2 0.034 0.032 0.008
degree3 -0.240 -0.243 0.015
nodefactor.Race.Asian -2.476 -2.485 0.065
nodefactor.Race.Black -3.045 -3.048 0.028
nodefactor.Race.Hisp -2.693 -2.708 0.125
nodefactor.Race.NatAm -2.263 -2.282 0.136
nodefactor.Race.Other -2.634 -2.648 0.049
nodefactor.Race.White -3.385 -3.387 0.025
nodematch.Race 1.679 1.677 0.028
nodefactor.Sex.M -0.087 -0.087 0.033
nodematch.Sex 0.860 0.860 0.006
absdiff.Grade -2.116 -2.111 -0.080
Again, we can diagnose the fitted egocentric model for proper convergence. (We include the code, but leave this as an exercise for you)
And check whether the model converged to the right statistics:
Now let’s check whether the fitted model can be used to reconstruct the degree distribution.
What if we only had an equally large sample, instead of an egocentric census? Here, we sample N students with replacement.
set.seed(1)
fmh.egosampN <- sample(fmh.ego, N, replace=TRUE)
egofitN <- ergm.ego(fmh.egosampN ~ degree(0:3)
+ nodefactor("Race", levels=TRUE) + nodematch("Race")
+ nodefactor("Sex") + nodematch("Sex")
+ absdiff("Grade"),
popsize=N)
Constructing pseudopopulation network.
Unable to match target stats. Using MCMLE estimation.
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 60:
1 Optimizing with step length 0.2525.
The log-likelihood improved by 1.9777.
Estimating equations are not within tolerance region.
Iteration 2 of at most 60:
1 Optimizing with step length 0.4057.
The log-likelihood improved by 2.1715.
Estimating equations are not within tolerance region.
Iteration 3 of at most 60:
1 Optimizing with step length 0.6318.
The log-likelihood improved by 2.0157.
Estimating equations are not within tolerance region.
Iteration 4 of at most 60:
1 Optimizing with step length 0.9426.
The log-likelihood improved by 1.7975.
Estimating equations are not within tolerance region.
Iteration 5 of at most 60:
1 Optimizing with step length 0.8976.
The log-likelihood improved by 1.8591.
Estimating equations are not within tolerance region.
Iteration 6 of at most 60:
1 Optimizing with step length 0.6572.
The log-likelihood improved by 2.1477.
Estimating equations are not within tolerance region.
Iteration 7 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.9768.
Estimating equations are not within tolerance region.
Iteration 8 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.7612.
Estimating equations are not within tolerance region.
Iteration 9 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.6097.
Estimating equations are not within tolerance region.
Estimating equations did not move closer to tolerance region more than 1 time(s) in 4 steps; increasing sample size.
Iteration 10 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1171.
Estimating equations are not within tolerance region.
Iteration 11 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1323.
Estimating equations are not within tolerance region.
Iteration 12 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1681.
Estimating equations are not within tolerance region.
Iteration 13 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1115.
Estimating equations are not within tolerance region.
Iteration 14 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.2142.
Estimating equations are not within tolerance region.
Estimating equations did not move closer to tolerance region more than 1 time(s) in 4 steps; increasing sample size.
Iteration 15 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1448.
Estimating equations are not within tolerance region.
Iteration 16 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.2262.
Estimating equations are not within tolerance region.
Iteration 17 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1817.
Estimating equations are not within tolerance region.
Iteration 18 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0152.
Convergence test p-value: 0.8234. Not converged with 99% confidence; increasing sample size.
Iteration 19 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0651.
Convergence test p-value: 0.7643. Not converged with 99% confidence; increasing sample size.
Iteration 20 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0410.
Convergence test p-value: 0.0149. Not converged with 99% confidence; increasing sample size.
Iteration 21 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0150.
Convergence test p-value: 0.0001. Converged with 99% confidence.
Finished MCMLE.
This model was fit using MCMC. To examine model diagnostics and check
for degeneracy, use the mcmc.diagnostics() function.
# compare the coef
coef.compare <- data.frame(
"NW est" = coef(fit.ergm),
"Ego SampN est" = coef(egofitN)[-1],
"diff Z" = (coef(fit.ergm)-coef(egofitN)[-1])/model.se(egofitN)[-1])
round(coef.compare, 3)
NW.est Ego.SampN.est diff.Z
degree0 0.954 1.388 -0.933
degree1 0.274 0.516 -0.661
degree2 0.034 0.363 -1.184
degree3 -0.240 -0.021 -1.068
nodefactor.Race.Asian -2.476 -2.397 -0.516
nodefactor.Race.Black -3.045 -2.911 -1.206
nodefactor.Race.Hisp -2.693 -2.532 -1.294
nodefactor.Race.NatAm -2.263 -2.112 -1.136
nodefactor.Race.Other -2.634 -2.623 -0.034
nodefactor.Race.White -3.385 -3.275 -1.037
nodematch.Race 1.679 1.613 0.812
nodefactor.Sex.M -0.087 -0.142 2.012
nodematch.Sex 0.860 0.883 -0.407
absdiff.Grade -2.116 -2.023 -1.353
# compare the s.e.'s
se.compare <- data.frame(
"NW SE" = model.se(fit.ergm),
"Ego census SE" =model.se(egofit)[-1],
"Ego SampN SE" = model.se(egofitN)[-1])
round(se.compare, 3)
NW.SE Ego.census.SE Ego.SampN.SE
degree0 0.462 0.430 0.464
degree1 0.365 0.345 0.367
degree2 0.277 0.261 0.277
degree3 0.198 0.193 0.204
nodefactor.Race.Asian 0.150 0.144 0.152
nodefactor.Race.Black 0.115 0.102 0.112
nodefactor.Race.Hisp 0.144 0.121 0.124
nodefactor.Race.NatAm 0.165 0.141 0.133
nodefactor.Race.Other 0.402 0.292 0.335
nodefactor.Race.White 0.111 0.102 0.105
nodematch.Race 0.103 0.080 0.081
nodefactor.Sex.M 0.032 0.029 0.028
nodematch.Sex 0.070 0.054 0.056
absdiff.Grade 0.072 0.068 0.069
What if we have a smaller sample? If we have a sample of \(N/4=365\) students, how will our standard errors be affected?
set.seed(0) # Some samples have different sets of alter levels from ego levels.
fmh.egosampN4 <- sample(fmh.ego, round(N/4), replace=TRUE)
egofitN4 <- ergm.ego(fmh.egosampN4 ~ degree(0:3)
+ nodefactor("Race", levels=TRUE) + nodematch("Race")
+ nodefactor("Sex") + nodematch("Sex")
+ absdiff("Grade"),
popsize=N)
Constructing pseudopopulation network.
Note: Constructed network has size 1460, different from requested 1461. Estimation should not be meaningfully affected.
Starting maximum pseudolikelihood estimation (MPLE):
Obtaining the responsible dyads.
Evaluating the predictor and response matrix.
Maximizing the pseudolikelihood.
Finished MPLE.
Starting Monte Carlo maximum likelihood estimation (MCMLE):
Iteration 1 of at most 60:
1 Optimizing with step length 0.2258.
The log-likelihood improved by 2.0248.
Estimating equations are not within tolerance region.
Iteration 2 of at most 60:
1 Optimizing with step length 0.3843.
The log-likelihood improved by 2.0104.
Estimating equations are not within tolerance region.
Iteration 3 of at most 60:
1 Optimizing with step length 0.8968.
The log-likelihood improved by 2.6821.
Estimating equations are not within tolerance region.
Iteration 4 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.5568.
Estimating equations are not within tolerance region.
Iteration 5 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.2369.
Estimating equations are not within tolerance region.
Iteration 6 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.2663.
Estimating equations are not within tolerance region.
Iteration 7 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1536.
Estimating equations are not within tolerance region.
Iteration 8 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.8350.
Estimating equations are not within tolerance region.
Iteration 9 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.9888.
Estimating equations are not within tolerance region.
Estimating equations did not move closer to tolerance region more than 1 time(s) in 4 steps; increasing sample size.
Iteration 10 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.9633.
Estimating equations are not within tolerance region.
Iteration 11 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0807.
Convergence test p-value: 0.8179. Not converged with 99% confidence; increasing sample size.
Iteration 12 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.1154.
Estimating equations are not within tolerance region.
Estimating equations did not move closer to tolerance region more than 1 time(s) in 4 steps; increasing sample size.
Iteration 13 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0990.
Convergence test p-value: 0.8092. Not converged with 99% confidence; increasing sample size.
Iteration 14 of at most 60:
1 Optimizing with step length 1.0000.
The log-likelihood improved by 0.0206.
Convergence test p-value: < 0.0001. Converged with 99% confidence.
Finished MCMLE.
This model was fit using MCMC. To examine model diagnostics and check
for degeneracy, use the mcmc.diagnostics() function.
# compare the coef
coef.compare <- data.frame(
"NW est" = coef(fit.ergm),
"Ego SampN4 est" = coef(egofitN4)[-1],
"diff Z" = (coef(fit.ergm)-coef(egofitN4)[-1])/model.se(egofitN4)[-1])
round(coef.compare, 3)
NW.est Ego.SampN4.est diff.Z
degree0 0.954 0.529 0.458
degree1 0.274 -0.239 0.697
degree2 0.034 -0.041 0.141
degree3 -0.240 -0.363 0.314
nodefactor.Race.Asian -2.476 -2.190 -1.215
nodefactor.Race.Black -3.045 -2.991 -0.237
nodefactor.Race.Hisp -2.693 -2.808 0.489
nodefactor.Race.NatAm -2.263 -2.498 0.980
nodefactor.Race.Other -2.634 -2.412 -0.455
nodefactor.Race.White -3.385 -3.431 0.219
nodematch.Race 1.679 1.579 0.750
nodefactor.Sex.M -0.087 -0.192 1.755
nodematch.Sex 0.860 0.889 -0.262
absdiff.Grade -2.116 -2.078 -0.267
# compare the s.e.'s
se.compare <- data.frame(
"NW SE" = model.se(fit.ergm),
"Ego census SE" =model.se(egofit)[-1],
"Ego SampN SE" = model.se(egofitN)[-1],
"Ego Samp4 SE" = model.se(egofitN4)[-1])
round(se.compare, 3)
NW.SE Ego.census.SE Ego.SampN.SE Ego.Samp4.SE
degree0 0.462 0.430 0.464 0.929
degree1 0.365 0.345 0.367 0.736
degree2 0.277 0.261 0.277 0.539
degree3 0.198 0.193 0.204 0.394
nodefactor.Race.Asian 0.150 0.144 0.152 0.236
nodefactor.Race.Black 0.115 0.102 0.112 0.230
nodefactor.Race.Hisp 0.144 0.121 0.124 0.235
nodefactor.Race.NatAm 0.165 0.141 0.133 0.240
nodefactor.Race.Other 0.402 0.292 0.335 0.488
nodefactor.Race.White 0.111 0.102 0.105 0.213
nodematch.Race 0.103 0.080 0.081 0.134
nodefactor.Sex.M 0.032 0.029 0.028 0.060
nodematch.Sex 0.070 0.054 0.056 0.109
absdiff.Grade 0.072 0.068 0.069 0.143
As with ordinary statistics, standard error is inverse-proportional to the square root of the sample size.
The ergm.ego package is under active development on GitHub at statnet/ergm.ego. This repository is the place to go to report bugs or request features (feature requests accompanied by a pull request are especially appreciated). If you are interested in contributing to the development of ergm.ego, please contact us through the GitHub interface.
Additional functionality is planned in the near future:
Support for directed relations.
Support for automatic fitting of tergm
s.
Support for target statistics distinct from ERGM statistics.
Support for degree censoring.
Motivation: Analyzing racial disparities in HIV in the US
The work on ergm.ego was originally motivated by a specific question in the field of HIV epidemiology—Does network structure help explain the persistent racial disparities in HIV prevalence in the United States?
An African American today is 10 times more likely than a white American to be living with HIV/AIDS. The disparity begins early in life, persists through to old age, and is evident among all risk groups: heterosexuals, men who have sex with men (MSM), and injection drug users. The disproportionate risks faced by heterosexual African-American women are especially steep. In 2010, an African-American woman was over 40 times more likely to be diagnosed with HIV than a heterosexual white man (Figure 1).
Empirical studies repeatedly find that these disparities cannot be explained by individual behavior, or biological differences.
A growing body of work is therefore focused on the role of the underlying transmission network. This network can channel the spread of infection in the same way that a transportation network channels the flow of traffic, with emergent patterns that reflect the connectivity of the system, rather than the behavior of any particular element.
Descriptive analyses and simulation studies have focused attention on two structural features: homophily and concurrency. Homophily is the strong propensity for within-group partner selection. Concurrency is non-monogamy—having partners that overlap in time–which increases network connectivity by allowing for the emergence of stable network connected components larger than dyads (pairs of individuals).
The hypothesis is that these two network properties together can produce the sustained HIV/STI prevalence differentials we observe: differences in concurrency between groups are the mechanism that generates the prevalence disparity, while homophily is the mechanism that sustains it.
We will never observe the complete dynamic sexual network that transmits HIV. But ergm.ego allows us to test the network hypothesis with egocentrically sampled data–and we will demonstrate that here using data collected by the National Health and Social Life Survey from 1994. The analysis comes from a recent paper (Krivitsky and Morris, 2017).
First, ergm.ego allows us to assess whether empirical patterns of homophily and concurrency are in the predicted directions and statistically significant. We do this in the usual way – comparing sequential model fits with terms that represent the hypotheses of interests, and t-tests for their coefficients. We will discuss these terms in more detail in later sections, but here we test the concurrency effects with “monogamy bias” terms.
Result: Yes, the homophily and concurrency effects are in the predicted directions and statistically significant.
Next, we can assess the goodness of fit of each model in the way we usually do in ERGMs, by checking whether the models reproduce observed nework properties that are not in the model. We do this here by simulating from each model and comparing the fits to the full observed degree distribution:
Result: Only Model 3 (with both hypothesized nework effects) is able to reproduce the observed degree distribution.
The ability to simulate complete networks from the model, however, allows us to do much more–we can now examine the connectivity in the overall network that each of these models would generate. For example, we can examine the component size distributions under each model:
Result: Model 3, with its “monogamy bias” dramatically reduces the right skew of the component size distribution, and places most people in components of size 2, or 3 if they are in a larger component.
Finally, we can define a measure of “network exposure” that represents the signature feature of a network effect: indirect exposure to HIV via a partner’s behavior, rather than direct exposure via one’s own behavior. One metric for network exposure is the probability of being in a component of size 3 or more. Because this is a node-level metric, we can break it down by race and sex for each of the three models:
Result: Only model 3 produces a pattern of network exposure that is consistent with the observed disparities in HIV incidence.
ergm.ego provides a powerful analytic framework that uses extremely limited network data and testable models to investigate the unobservable patterns of complete network connectivity that are consistent with the sampled data.
The principles of egocentric inference can be extended to temporal ERGMs (TERGMs). While we will not cover that in this workshop, an example can be found in another paper that sought to evaluate the network hypothesis for racial disparities in HIV in the US (Morris et al. 2009). In that paper, egocentric data from the National Longitudinal Survey of Adolescent Health (AddHealth) was analyzed, and an example of the resulting dynamic complete network simulation (on 10,000 nodes) can be found in this “network movie”.
The movie below is another simpler example – an epidemic spreading on a small dynamic contact network that is simulated with a STERGM estimated from egocentrically sampled network data. The movie was produced by the R packages EpiModel
and ndtv
, which are based on the Statnet tools.
We’ll need some notation for this (sorry, and a warning that it will get hairier).
Parameter | Meaning |
---|---|
\(N\) | the population being studied: a very large, but finite, set of actors whose relations are of interest |
\(x _ i\) | attribute (e.g., age, sex, race) vector of actor \(i \in N\) |
\(x_N\) | (or just \(x\), when there is no ambiguity) the attributes of actors in \(N\) |
\(\mathbb{Y}(N)\) | the set of dyads (potential ties) in an undirected network of actors in \(N\) |
\(y\subseteq \mathbb{Y}(N)\) | the population network: a fixed but unknown network (a set of relationships) of relationships of interest. In particular, |
\(y_{ij}\) | an indicator function of whether a tie between \(i\) and \(j\) is present in \(y\) |
\(y _ i=\{j\in N: y _ {ij}=1\}\) | the set of \(i\)’s network neighbors. |
Parameter | Meaning |
---|---|
\(e_{N}\) | the egocentric census, the information retained by the minimal egocentric sampling design when all nodes are sampled |
\(S\subseteq N\) | the set of egos in a sample |
\(e_{S}\) | the data contained in an egocentric sample |
\(e_i\) | the “egocentric” view of network \(y\) from the point of view of actor \(i\) (“ego”), with the following parts: |
\(e^e_i \equiv x_i\) | \(i\)’s own attributes |
\(e^a_i \equiv (x_{j})_{j\in y_i}\) | an unordered list of attribute vectors of \(i\)’s immediate neighbors (“alters”), but not their identities (indices in \(N\)) |
\(e^e_{i,k}\equiv x_{i,k}\) | The \(k\)th attribute/covariate observed on ego \(i\) |
\(e^a_{i,k}\equiv( x_{j,k})_{j\in y_i}\) | and its alters. |
We call a network statistic \(g_{k}(\cdot,\cdot)\) egocentric if it can be expressed as \[ g_{k}(y,x)\equiv \textstyle\sum_{i\in N} h_{k}(e_i) \] for some function \(h_{k}(\cdot)\) of egocentric information associated with a single actor.
The space of egocentric statistics includes dyadic-independent statistics that can be expressed in the general form of \[ g_{k}(y,x)=\sum_{ij\in y} f_k(x_i,x_j) \] for some symmetric function \(f_k(\cdot,\cdot)\) of two actors’ attributes; and some dyadic-dependent statistics that can be expressed as \[ g_{k}(y,x)=\sum_{i\in N} f_k ({x_{i},(x_j)_{j\in y_i}}) \] for some function \(f_k(\cdot,\dotsb)\) of the attributes of an actor and their network neighbors.
The statistics that are identifiable in an egocentric sample depend on the specific egocentric study design.
The table below (from Krivitsky & Morris 2017) shows some examples of egocentric statistics, and gives their representations in terms of of \(h_{k}(\cdot)\).
Statistic | \(g_{k}( y,x)\) | \(h _ {k}(e_i)\) |
---|---|---|
General sum over ties | \(\sum _ {(i,j)\in y} f _ k(x _ i,x _ j)\) | \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} f _ k\big(e^\text{e}_i,e^\text{a}_{i,j'}\big)\) |
Number of ties in the network | \(\lvert y \rvert\equiv \sum _ {(i,j) \in y} 1\) | \(\frac{1}{2}\lvert e^\text{a}_{i}\rvert\) |
weighted by actor covariate \(x _ {i,k}\) | \(\sum _ {(i,j) \in y} (x _ {i,k}+x _ {j,k})\) | \(\frac{1}{2} \big(e^\text{e}_{i,k} \lvert e^\text{a}_{i}\rvert + \sum _ {j'\in e^\text{a} _ i} e^\text{a}_{i,j',k} \big)\) |
weighted by difference in \(x _ {i,k}\) | \(\sum _ {(i,j) \in y} \lvert x _ {i,k}-x _ {j,k}\rvert\) | \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} \lvert e^\text{e}_{i,k}-e^\text{a}_{i,j',k}\rvert\) |
within groups identified by \(x _ {i,k}\) | \(\sum _ {(i,j) \in y} 1_{x _ {i,k}=x _ {j,k}}\) | \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} 1_{ e^\text{e}_{i,k}= e^\text{a}_{i,j',k}}\) |
General sum over actors | \(\sum _ {i\in N} f _ k\big\{x _ {i},(x _ j) _ {j\in y_{i}}\big\}\) | \(f _ k\big(e^\text{e}_i,e^\text{a}_{i}\big)\) |
Number of actors with \(d\) neighbors | \(\sum _ {i\in N} 1_{\lvert y_{i}\rvert=d}\) | \(1_{\lvert e^\text{a}_{i}\rvert=d}\) |
weighted by actor covariate \(x _ {i,k}\) | \(\sum _ {i\in N} x _ {i,k} 1_{\lvert y_{i}\rvert=d}\) | \(e^\text{e}_{i,k}1_{\lvert e^\text{a}_{i}\rvert=d}\) |
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14)
os Ubuntu 22.04.4 LTS
system x86_64, linux-gnu
ui X11
language en
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/London
date 2024-06-23
pandoc 3.1.2 @ /usr/bin/ (via rmarkdown)
─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
bookdown 0.39 2024-04-15 [1] CRAN (R 4.4.0)
bslib 0.7.0 2024-03-29 [1] CRAN (R 4.4.0)
cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
coda 0.19-4.1 2024-01-31 [1] CRAN (R 4.4.0)
DBI 1.2.3 2024-06-02 [1] CRAN (R 4.4.0)
deldir 2.0-4 2024-02-28 [1] CRAN (R 4.4.1)
DEoptimR 1.1-3 2023-10-07 [1] CRAN (R 4.4.0)
digest 0.6.35 2024-03-11 [1] CRAN (R 4.4.0)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
egor * 1.24.2 2024-06-20 [1] Github (tilltnet/egor@44d87a0)
ergm * 4.7-7368 2024-06-20 [1] Github (statnet/ergm@93ecb25)
ergm.ego * 1.1-704 2024-06-20 [1] local
evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
highr 0.11 2024-05-26 [1] CRAN (R 4.4.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
igraph 2.0.3 2024-03-13 [1] CRAN (R 4.4.0)
interp 1.1-6 2024-01-26 [1] CRAN (R 4.4.1)
jpeg 0.1-10 2022-11-29 [1] CRAN (R 4.4.1)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
knitr * 1.47 2024-05-29 [1] CRAN (R 4.4.0)
lattice 0.22-6 2024-03-20 [4] CRAN (R 4.4.1)
latticeExtra 0.6-30 2022-07-04 [1] CRAN (R 4.4.1)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
lpSolveAPI 5.5.2.0-17.11 2023-11-28 [1] CRAN (R 4.4.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
Matrix 1.7-0 2024-04-26 [4] CRAN (R 4.4.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
mitools 2.4 2019-04-26 [1] CRAN (R 4.4.0)
network * 1.18.2 2024-06-20 [1] Github (statnet/network@c1b2084)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
png 0.1-8 2022-11-29 [1] CRAN (R 4.4.0)
purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
rbibutils 2.2.16 2023-10-25 [1] CRAN (R 4.4.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.4.0)
Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.4.0)
Rdpack 2.6 2023-11-08 [1] CRAN (R 4.4.0)
Rglpk 0.6-5.1 2024-01-13 [1] CRAN (R 4.4.0)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
rle 0.9.2-234 2024-06-20 [1] Github (statnet/rle@d08b185)
rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
robustbase 0.99-2 2024-01-27 [1] CRAN (R 4.4.0)
sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
slam 0.1-50 2022-01-08 [1] CRAN (R 4.4.0)
srvyr 1.2.0 2023-02-21 [1] CRAN (R 4.4.0)
statnet.common 4.10.0-442 2024-06-20 [1] Github (statnet/statnet.common@4e8cb54)
survey 4.4-2 2024-03-20 [1] CRAN (R 4.4.0)
survival 3.7-0 2024-06-05 [4] CRAN (R 4.4.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
tidygraph 1.3.1 2024-01-30 [1] CRAN (R 4.4.0)
tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
trust 0.1-8 2020-01-10 [1] CRAN (R 4.4.0)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
xfun 0.45 2024-06-16 [1] CRAN (R 4.4.1)
yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0)
[1] /home/mbojan/R/library/4.4
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
This does not mean that the mean degree itself cannot be estimated from egocentric data, only that our inferential results might not apply.↩︎