Stochastic block models for multiplex networks: an application to a multilevel network of researchers

Modelling relationships between individuals is a classical question in social sciences and clustering individuals according to the observed patterns of interactions allows us to uncover a latent structure in the data. The stochastic block model is a popular approach for grouping individuals with respect to their social comportment. When several relationships of various types can occur jointly between individuals, the data are represented by multiplex networks where more than one edge can exist between the nodes. We extend stochastic block models to multiplex networks to obtain a clustering based on more than one kind of relationship. We propose to estimate the parameters—such as the marginal probabilities of assignment to groups (blocks) and the matrix of probabilities of connections between groups—through a variational expectation–maximization procedure. Consistency of the estimates is studied. The number of groups is chosen by using the integrated completed likelihood criterion, which is a penalized likelihood criterion. Multiplex stochastic block models arise in many situations but our applied example is motivated by a network of French cancer researchers. The two possible links (edges) between researchers are a direct connection or a connection through their laboratories. Our results show strong interactions between these two kinds of connection and the groups that are obtained are discussed to emphasize the common features of researchers grouped together.


Introduction
Network analysis has emerged as a key technique for understanding and for investigating social interactions through the properties of relations between and within units. From a statistical point of view, a network is a realization of a random graph formed by a set of nodes V representing the units (e.g. individuals, actors, companies) and a set of edges E representing relationships between pairs of nodes.
The system in which the same nodes belong to multiple networks is typically referred to as a multiplex network or multigraph (see Wasserman (1994) for example). In recent literature, there has been an upsurge of interest in multiplex networks (see for example Cozzo et al. (2012); Loe and Jeldtoft Jensen (2014); Rank et al. (2010); Szell et al. (2010); Mucha et al. (2010); Maggioni et al. (2013); Brummitt et al. (2012); Saumell-Mendiola et al. (2012); Bianconi (2013); Nicosia et al. (2013)). In these multiplex networks, different kinds of links (or connections) are possible for each pair of nodes. This induced link multiplexity is a fundamental aspect of social relations (Snijders and Baerveldt, 2003) since these multiple links are frequently interdependent: links in one network may have 2 an influence on the formation or dissolution of links in other networks.
The simultaneous analysis of several networks also arises when one is interested in the social comportment of individuals belonging to organized entities (such as companies, laboratories, political groups, etc.), with some individuals possibly belonging to the same institution. While the actors will exchange resources (such as advice for instance) at the individual level, their respective organizations of affiliation will also share resources at the institutional level (financial resources for instance). Each level (individuals and organizations) constitutes a system of exchange of different resources that has its own logic and could be studied separately. However, studying the two networks jointly (and hence embedding the individuals in the multilevel relational and organizational structures constituting the inter-organizational context of their actions) would allow us to identify the individuals that benefit from relatively easy access to the resources circulating in each level, which is of much more interest. In other words, studying the two levels jointly could help us understand how an individual can benefit from the position of its organization in the institutional network.
In this paper, we are interested in studying the advice relations between researchers and the exchanges of resources between laboratories. We adopt the following individualoriented strategy (this point is discussed in the paper): the institutional network is used to define a new network on the individual level i.e. the set of nodes consists in the set of individuals and for a pair of individuals, two kinds of links are possible: a direct connection given by the individual network and a connection through their organizations given by the organizational network. As a consequence, the individual and institutional levels are fused into a multiplex.
We then develop a statistical model able to detect in multiplex substantial nontrivial topological features, with patterns of connection between their elements that are 3 not purely regular. Several models such as scale-free networks and small-world networks have been proposed to describe and understand the heterogeneity observed in networks.
These models allow to derive properties of the network at the macro-scale and to understand the outcomes of interactions. To explore heterogeneity at others scales (such as micro or meso-scale) in social networks, specific models such as the stochastic block models (SBM) (Snijders and Nowicki (1997)) have been developed for uniplex networks.
In this paper, we propose an original extension of the SBMs to the multiplex case. Our model is efficient to model not only the main effects (that correspond to a classical uniplex) but also the pairwise interactions between the nodes. We estimate the parameters of the multiplex SBM using an extension of the variational EM algorithm. Consistency of the estimation of the parameters is proved. As for uniplex SBM, a key issue is to choose the number of blocks. We use a penalized likelihood criterion, namely Integrated Completed Likelihood (ICL). The inference procedure is performed on the cancer researchers / laboratories dataset.
The paper is organized as follows. The extension of SBM to multiplex network is presented in Section 2, the proofs of model identifiability and the consistency of variational EM procedure are postponed in Appendices A and B. In Section 3, we describe Lazega et al. (2008)'s dataset, apply the new modeling and discuss the results. Eventually, the contribution of multiplex SBM to the analysis of multiplex networks is highlighted in Section 4.

Multiplex stochastic block model
The main objective is to cluster the individuals (or nodes) into blocks sharing connection properties with the other individuals of the multiplex-network. Stochastic block models (Nowicki and Snijders, 2001) for random graphs have emerged as a natural tool to perform such a clustering based on uniplex networks (directed or not, valued or not).
In the following, we propose an extension of the Stochastic Block Model (SBM) to multiplex networks. The SBM for multiplex networks is derived from a multiplex Erdös-Rényi model which is described in subsection 2.1. The SBM for multiplex networks is derived in subsection 2.2.
The maximum likelihood estimate of the parameter of interest π = (π (w) ) w∈{0,1} K is, for all w ∈ {0, 1} K : This model is quite simple since any relation between two individuals (a relation being a collection of edges) does not depend on the relations between the other individuals.
However, the different kind of relations between two individuals (edges) are not assumed to be independent.
Remark 1. This model is clearly an extension of the Erdös-Rényi model since the marginal distribution of X k ij (for any k = 1 . . . K) is Bernoulli with density : Moreover, any conditional distribution of X k ij given (X l ij ) l∈S \k (where S \k is a subset of {1, . . . , K} not containing k) is also univariate Bernoulli. For instance, if K = 2 the conditional distribution of X 1 ij given X 2 ij is .
Introduction of explanatory variables. Naturally, the Erdös-Rényi model for multiplex networks can be extended to take into account explanatory variables. Let y ij denote the covariates characterizing the couple of nodes (i, j), the model is defined by the probabilities: where x denotes the transposed vector of x.
Remark 2. Note that in the multiplex Erdös-Rényi model, the modeling is actor based, which means that the individuals are the same for all the networks X 1 , . . . , X K and we model conjointly all the connections. As a consequence, the covariates y ij only depend on the couple (i, j) and are not linked to the network under consideration.
Since the multiplex Erdös-Rényi model belongs to exponential models, the generalized linear model theory applies when we introduce the covariates as in model (2) and the estimates are obtained using standard optimization strategies.

Stochastic block model for multiplex networks
When the goal is to cluster individuals according to their social comportment, we can derive a Stochastic Block Model version of the multiplex Erdös-Rényi model. Let Q be 6 the number of blocks and Z i the latent variable such that Z i = q if the individual i belongs to block q (note that an individual can only belong to one block in this version).
The multiplex version of SBM is written as follows: Such a model includes (2 K − 1)Q 2 + (Q − 1) parameters. Introducing the following notations: the likelihood function is written as: where the latent variable (the block affectations) are integrated out. The identifiability of the model can be proved (see Appendix A, theorem A.1) and the maximum likelihood estimators are consistent (theorem A.2).
Remark 3. Note that, as before, covariates on the couple (i, j) can be introduced in the model: ∀w = (0, . . . , 0), 7 However, the number of parameters in this new model will increase drastically, leading to estimation issues. Maximum likelihood and model selection As soon as n or Q are large, the observed likelihood (4) is not tractable (due to the sum on Z ∈ {1, . . . Q} n ) and its maximization is a challenging task. Several approaches have been developed in the literature (for a review, see Matias and Robin, 2014), both in the frequentist and Bayesian frameworks, starting from Snijders and Nowicki (1997) and Nowicki and Snijders (2001).
However, when the latent data space is really large, these techniques can be burdensome. (2009) which relying on a profile-likelihood optimization or the moment estimation proposed by Ambroise and Matias (2012), to name but a few .

Some other strategies have been proposed, such as Bickel and Chen
The variational EM in the context of SBM proposed by Daudin et al. (2008) is a flexible tool to tackle the computational challenge in many types of graphs. Simulation studies showed its practical efficiency (Mariadassou et al., 2010). Moreover, its theoretical convergence towards the maximum likelihood estimates has been studied by Celisse et al. (2012) for binary graphs. In this paper, we adapt the variational EM to multiplex networks. The algorithm is described in Appendix B and its convergence towards the true parameter is proved (Theorem B.1).
The selection of the most adequate number of blocks Q is performed using a modification of the ICL criterion (as in Daudin et al., 2008;Mariadassou et al., 2010). Let M Q denote the model defined (5) with Q blocks.
where Z are the predictions of the assignments Z obtained as a sub-product of the variational EM algorithm (see Appendix B). As in the BIC criteria, the log refers to the number of data. Thus, the n nodes are used to estimate the Q − 1 probabilities α 1 , . . . , α Q−1 . The Kn(n − 1) edges are used to estimate π. No theoretical results exist for the ICL properties, but this criteria has proved its efficiency in practice.
Remark 4. If we consider covariates as in equation (5), then the ICL is adapted : where P Q is the dimension of (µ, β).
3 Analysis for the laboratory-researcher data At the inter-individual level, the actors (researchers) were submitted a list of cancer researchers and asked in interviews whom they sought advice from. The advice were of five types, namely advice to deal with choices about the direction of projects, advice to find institutional support, advice to handle financial resources, advice for recruitment, and finally advice about manuscripts before submitting them to journals. The five advice networks are too sparse to be studied separately. That is why they were aggregated.
Therefore two researchers are considered as linked if at least one kind of relationship exists. Obviously the links are directed.
At the laboratory level (concerning only laboratories with "Elite" researchers), the lab-oratory directors were asked to specify what type of resources they exchanged with the other laboratories on the list. The examined resources were the recruitment of post-docs and researchers, the development of programs of joint research, joint responses to tender offers, sharing of technical equipment, sharing of experimental material, mobility of administrative personnel, and invitations to conferences and seminars. Once again, to avoid over-sparsity, the various networks were aggregated and two laboratories are said to be linked if there is at least one link between them. From this network on labs, we can derive indirect links between the researchers, i.e. two researchers are connected if their laboratories exchange resources. We finally have two adjacency matrices on the same set of nodes (researchers).
This corresponds to transforming the multilevel network (individual/ organization) into a multiplex network (several kinds of relations among individuals). This is reasonable since the majority of laboratories contains a unique "Elite" researcher. Thus, there is no big difference in the number of nodes between the institutional and individual levels.
In addition, auxiliary covariates are available to describe the researchers: their age, their specialty, two publication performance scores based on two periods of five years, their status (director of the lab or not). Auxiliary covariates are also available the laboratories: their size (number of researchers) and their location.
Complete data for 95 researchers identified as the "Elite" of French cancer research working in 76 laboratories are available.

Statistical inference through SBM for multiplex
To estimate the parameters of the multiplex SBM model on this dataset, we use a modified version of the variational EM algorithm (Daudin et al., 2008) described in the Appendix Section B. The optimization of the ICL criterion derived from likelihood (4) leads to four blocks (indistinctly denominated clusters or groups). For the sake of clarity, we index by R and L (rather than X 1 , X 2 ) the two adjacency matrices (respectively the direct and indirect ones) In Figures 1 and 2 we plot the marginal and conditional probabilities of the connections of researchers (respectively labs) between and within blocks.
Note that the study of the estimated marginal distributions allows us to have results on the researchers without considering the laboratories. This gives a clear interpretation of the importance of the lab for the researcher network structure. The obtained blocks are described in Table 1: (1a) gives the sizes of the four blocks; (1b), (1c) and (1d) describe the blocks with respect to the covariates "location", "director of not" and "specialty". The estimations by the variational EM procedure were conducted by wmixnet (Leger, 2014) with our specific implementation of the bivariate Bernoulli model.
We now discuss the results. Multiplex SBM reveals interesting structural features of the multiplex network. More precisely, collaboration takes place in a clustered manner for both researchers and laboratories; collaborating laboratories tend to have affiliated researchers seeking advice from one another. Indeed, Figure 1 shows are clustered in blocks that make sense from the perspective of the categories of people and organizations, at each level separately and together.
Block 1 members work in the biggest labs in terms of size (Figure 3(b)). SBM clusters them because they have many more relations than the other members of the network. They have among the highest indegrees, and average outdegrees in labs that have highest average indegrees and outdegrees (Figures 3(e), 3(f), 3(g), 3(h)). They have the highest performance in terms of publication performance scores in both periods (Figures 3(c) and 3(d)). In fact, they have a similar relational profile since they are providers of transgenic mice for the experiments of many similar colleagues.
Block 2 members are among the lowest indegrees and outdegrees in labs that have the lowest average indegrees and outdegrees (Figures 3(e), 3(f), 3(g), 3(h)), slightly older than the others (Figure 3(a)), mainly in the smallest labs in terms of size. This block is heterogeneous in terms of specialties (especially 40% clinicians and 25% diagnostic/prevention/epidemiology specialists) except fundamental research (Table 1(d)).
They also have among the lowest performance levels for both periods (Figures 3(c), 3(d)), although this is increasing. This is the biggest block. Their behavior may be described as fusional as proposed in Lazega et al. (2008) since it corresponds to individuals for whom the probabilities of connection are the most affected by the connections of their laboratories.
In block 3, younger fundamental researchers in laboratories carrying out fundamental research (mostly molecular research) prevail (Figure 3(a) and Table 1(d)). They have relatively low indegrees and outdegrees, in labs that have the highest indegrees (after Block 1) and average outdegrees (Figures 3(e), 3(f), 3(g), 3(h)). 70% are among the top performers of this population, i.e. highest performance levels after Block 1 members, for both periods (Figures 3(c), 3(d)). Their dominant relational strategies are individualist or independentists (as shown in Figure 1) since the probabilities of connection between researchers remain quite unchanged, no matter if their laboratories are connected.
Block 4 is also heterogeneous in terms of specialties but its largest subgroup is composed of hematologists (clinical and fundamentalists) (

3(c), 3(d)
). In this block, researchers can take advantage of their laboratories to be connected with colleagues but they can also hold relations apart from their laboratories.
In this case, SBM highlight in the data a specific kind of block structure that provides an interesting understanding of the effect of dual positioning. SBM mixes people and laboratories that previous analyses used to separate. It is interesting to notice, for example, that SBM partitions the population regardless of geographic location (Table   1(b)) -a criterion that was previously shown to be meaningful to understand collective action in this research milieu. In fact this partition may highlight the links between laboratories and between individuals that cut across the boundaries and remoteness created by geography, showing that certain categories of actors tend to reach across these separations when it suits them, either as investments to prepare future collaborations, or as a follow-up for past investments.

Discussion
The essence of 'networks' is to help actors cut across organizational boundaries to create new relationships (Baker, 1992), to identify new opportunities and, eventually, to create new organizations to use or hoard these new opportunities (Lazega, 2012;Lazega et al., 2013;Tilly, 1998). This work is a step forward a more precise comprehension of such mechanisms.
In this paper, we proposed a SBM modelisation for multiplex networks. This way of partitioning the graph and calculating the probabilities that two individuals are connected takes into account more parameters in the graph than just centrality scores and size as in Lazega et al. (2007). Besides, compared to MERGMs (Wang et al., 2013), this method takes inhomogeneities of actors more into account because the separate blocks tend to make sense as blocks with a specific identity. Multiplex SBM statistical analysis designed for the analysis of multilevel networks given as "superposed" (Lazega et al., 2007) network data aims at identifying the rules that govern the formation of links at the inter-individual level conditioned by the characteristics of the inter-organizational network. Our analysis found a general structure that shows the ways in which these rules can be derived in this case and help assess actor inhomogeneities in terms of their influence on parameter estimates. This assessment is new in the sense that it shows that formation of connections in the inter-individual network does not apply to all actors in the network identically.
In this paper, the model is applied on K = 2 networks. Applying with K much larger would imply estimation difficulties since the number of parameters exponentially depends on K. However, particular assumption (such as Markovian dependencies) can reduce the difficulty and lead to feasible cases. Besides, in our example, going from multilevel to multiplex was quite natural since few laboratories contained more than one researcher.
If the number of organizations were far smaller then the number of individuals than maybe it would be interesting to introduce an other kind of relation, specifying if the individuals are in the same laboratory or not. In this case, the probabilities vector would have a special structure, taking this specificity into account (if the individuals belong to the same organization then they are automatically linked through their organization) .
In this version of the analysis, the covariates were studied a posteriori, the classification being purely done on the network. From a practical point of view, including the covariates in the model would imply estimation difficulty (and even more if the effect of the covariates on the connexion probabilities depends on the blocks). However, even if attractive at first sight, the choice of including the covariates in the modeling has to be questioned. Including them will cluster the data beyond the effect of these covariates, while not including them will provide a description of blocks on the basis of the relation, may this relation be influenced by these covariates. This second strategy may lead to more interpretable results.
The first aim of Lazega et al. (2008) was cancer research management. Thus, the covariate "performance in terms of publication" catches more attention. Indeed, this modelisation can generate other questions such as the influence of the networks on the performance of the actors (publications in our case). In this work, the performance is treated as a factor to explain the relations. However, one could dream of a model which would help the actors build the ideal network to optimize the performance. If it is true that contemporary society is an "organizational society" (Coleman, 1982;Perrow, 1991;Presthus, 1962) -in the sense that action and performance measured at the individual level strongly depend on the capacity of the actor to construct and to use organizations as instruments, and thus to manage his/her interdependencies at different levels in a strategic manner-, then the study of interdependencies jointly at the inter-individual and the inter-organizational level is important for numerous sets of problems. Proposing a hierarchical model in this direction is out of the scope of the paper but is clearly a possible extension.
To conclude, application to sociological network analysis will clearly benefit from this 18 methodological contribution. In spite of several limitations listed before, this analysis of multiplex networks seems therefore adapted to certain types of questions that sociologists ask when they try to combine both individual and contextual factors in order to estimate the likelihood of an individual or a group to adopt a given behavior or to reach a given level of performance. More generally, this approach explores a complex meso-social level of accumulation, of appropriation and of sharing of multiple resources. This level, still poorly known, is difficult to observe without a structural approach.
Proof. The proof given by Celisse et al. (2012) can be directly extended to our case leading to the expected result. As explained in the original paper, the identifiability can be proved using algebraic tools. The only difference in the multiplex context is that the proof has to be applied for any type of tie w ∈ {0, 1} K .

We can write the
Using the fact that R (w) (θ) = R (w) (θ ) and A(θ) = A(θ ), we get And the theorem is demonstrated.
A.2 Consistency of the maximum likelihood estimator in multiplex SBM models We study the asymptotic properties of the maximum likelihood estimator in multiplex SBM models. Let A 1 , A 2 and A 3 be the following assumptions: Assumption A 1 : for every q = q , there exists l ∈ {1, . . . , Q} such that π ql = π q l , or π lq = π lq .
Let X 1:K be a realization from the multiplex SBM: P( · ; Z * , α * , π * ) where Z * = (Z * 1 , . . . , Z * n ) is the true group label sequence and (α * , π * ) are the true parameters. Let ( α, π) be the maximum likelihood estimator defined as: ( α, π) = Argmax (α,π) L(X 1:K ; α, π) , where (X 1:K ; α, π) = Z∈{1,...Q} n e 1 (X 1: Theorem A.2. Let (A 1 ), (A 2 ) and (A 3 ) hold. Then, for any distance d(·, ·) on the set of parameter π, we have: Moreover, assume that π − π * ∞ = o P ( log(n)/n) then, for any distance d(·, ·) in R Q , Note the rate o P ( log(n)/n) has not been proved yet. However, in the unilevel context, there is empirical evidence that the rate convergence on π is 1/n (Gazal et al., 2012 which is not tractable as soon as n and Q are large. The variational EM is an alternative method to maximize the marginal likelihood with respect to θ. The variational EM (applied in the SBM context by Daudin et al. (2008)) relies on the following decomposition of (10). Let R X 1:K be any probability distribution on Z, we have: where KL is the Kullback-Leibler distance, p(X 1:K , Z; θ) is the joint density of X 1:K and Z (namely the complete likelihood) and p(·|X 1:K ; θ) is the posterior distribution of Z given the data X 1:K and the parameters θ. Instead of maximizing log (X 1:K ; θ), the variational EM optimizes a lower bound I θ (R X 1:K ) of log (X 1:K |θ) where: ≤ log (X 1:K |θ) .
The details of steps 1 and 2 directly depend on the considered statistical model. For uniplex SBM without covariates, they are given in Daudin et al. (2008). The details for the multiplex SBM are given here after.

B.2 Details of the calculus for multiplex SBM models
We now detail Step 1 and Step 2 for multiplex SBM models. • Step 1: τ (t) verifies We first rewrite I θ (R X 1:K ,τ ) for this special context: with log p(X 1:K , Z; θ) = 1 (X 1:K |Z; θ) + log p(Z; θ) , This quantity has to be integrated over Z where Z ∼ R X 1:K ,τ which means that Z = (Z i ) i=1...n are independent variables such that P(Z i = q) = τ iq . We obtain: = q,l i,j,i =j log p(X 1:K ij |Z i = q, Z j = l; θ)τ iq τ jl where log p(X 1:K ij |Z i = q, Z j = l; θ)'s expression is given in equation (3).
The maximization with respect to α is quite direct and in any case, we obtain: ql ) at step 2 of the VarEm is not explicit anymore and one should resort to optimization algorithms such as Newton-Raphson algorithm.

B.4 About the proofs
The proofs of these results require many intermediate results which won't be given here because their adaptation to the multiplex context is quite direct. Indeed, at any step of the proof, the original Bernoulli distribution needs to replaced by its K-dimensional version. More precisely, a central quantity in the proof is the ratio log p(Z|X 1:K ;α,π) p(Z * |X 1:K ;α,π) . In