519 Matching Annotations
  1. Apr 2021
    1. re ξˆΈπœƒ(𝐴)βˆβ„™πœƒ(𝐴)LΞΈ(A)∝PΞΈ(A)\mathcal L_\theta(A) \propto \mathbb P_\theta(A);

      do everything in terms of probabilities

    2. The reason we write ∝∝\propto instead of === is that, in the case of probability distributions, it is often rather tedious to take care of all constants when simplifying expressions, and these constants can detract from much of the intuition of what is going on. If we instead focus on the likelihood, we only need to worry about the parts of the probability statement that deal directly with the parameters πœƒΞΈ\theta and the realization itself 𝐴AA.

      tedious. remove until we decide we need it

    3. What does the likelihood for the a priori SBM look like? Fortunately, since πœβƒ—Β Ο„β†’\vec \tau is a parameter of the a priori SBM, the likelihood is a bit simpler than for the a posteriori SBM. This is because the a posteriori SBM requires a marginalization over potential realizations of πœπœβ†’Ο„Ο„β†’\vec{\pmb \tau}, whereas the a priori SBM does not. The likelihood is as follows, omitting detailed explanations of steps that are described above: ξˆΈπœƒ(𝐴)βˆβ„™πœƒ(𝐀=𝐴)=βˆπ‘—>π‘–β„™πœƒ(πšπ‘–π‘—=π‘Žπ‘–π‘—)Independence Assumption=βˆπ‘—>π‘–π‘π‘Žπ‘–π‘—β„“π‘˜(1βˆ’π‘β„“π‘˜)1βˆ’π‘Žπ‘–π‘—p.m.f. of Bernoulli distribution=βˆπ‘˜,ℓ𝑏|ξˆ±β„“π‘˜|β„“π‘˜(1βˆ’π‘β„“π‘˜)π‘›β„“π‘˜βˆ’|ξˆ±β„“π‘˜|LΞΈ(A)∝PΞΈ(A=A)=∏j>iPΞΈ(aij=aij)Independence Assumption=∏j>ibβ„“kaij(1βˆ’bβ„“k)1βˆ’aijp.m.f. of Bernoulli distribution=∏k,β„“bβ„“k|Eβ„“k|(1βˆ’bβ„“k)nβ„“kβˆ’|Eβ„“k|\begin{align*} \mathcal L_\theta(A) &\propto \mathbb P_{\theta}(\mathbf A = A) \\ &= \prod_{j > i} \mathbb P_\theta(\mathbf a_{ij} = a_{ij})\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{j > i} b_{\ell k}^{a_{ij}}(1 - b_{\ell k})^{1 - a_{ij}}\;\;\;\;\textrm{p.m.f. of Bernoulli distribution} \\ &= \prod_{k, \ell}b_{\ell k}^{|\mathcal E_{\ell k}|}(1 - b_{\ell k})^{n_{\ell k} - |\mathcal E_{\ell k}|} \end{align*} Like the ER model, there are again equivalence classes of the sample space <β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘π‘π‘™π‘Žπ‘ π‘ ="β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘">ξˆ­π‘›</β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘><hypothesisβˆ’highlightclass="hypothesisβˆ’highlight">An</hypothesisβˆ’highlight>\mathcal A_n in terms of their likelihood. Let |ξˆ±β„“π‘˜(𝐴)||Eβ„“k(A)||\mathcal E_{\ell k}(A)| denote the number of edges in the (β„“,π‘˜)(β„“,k)(\ell, k) block of adjacency matrix 𝐴AA. For a two-community setting, with πœβƒ—Β Ο„β†’\vec \tau and 𝐡BB given, the equivalence classes are the sets: πΈπ‘Ž,𝑏,𝑐(πœβƒ—Β ,𝐡)={π΄βˆˆξˆ­π‘›:11(𝐴)=π‘Ž,21=12(𝐴)=𝑏,22(𝐴)=𝑐}Ea,b,c(Ο„β†’,B)={A∈An:E11(A)=a,E21=E12(A)=b,E22(A)=c}\begin{align*} E_{a,b,c}(\vec \tau, B) &= \left\{A \in \mathcal A_n : \mathcal E_{11}(A) = a, \mathcal E_{21}=\mathcal E_{12}(A) = b, \mathcal E_{22}(A) = c\right\} \end{align*} The number of equivalence classes possible scales with the number of communities, and the manner in which vertices are assigned to communities (particularly, the number of nodes in each community). As before, we have the following. For any πœβƒ—Β Ο„β†’\vec \tau and 𝐡BB: If 𝐴,π΄β€²βˆˆπΈπ‘Ž,𝑏,𝑐(πœβƒ—Β ,𝐡)A,Aβ€²βˆˆEa,b,c(Ο„β†’,B)A, A' \in E_{a,b,c}(\vec \tau, B) (that is, 𝐴AA and 𝐴′Aβ€²A' are in the same equivalence class), ξˆΈπœƒ(𝐴)=ξˆΈπœƒ(𝐴′)LΞΈ(A)=LΞΈ(Aβ€²)\mathcal L_\theta(A) = \mathcal L_\theta(A'), and If π΄βˆˆπΈπ‘Ž,𝑏,𝑐(πœβƒ—Β ,𝐡)A∈Ea,b,c(Ο„β†’,B)A \in E_{a, b, c}(\vec \tau, B) but π΄β€²βˆˆπΈπ‘Žβ€²,𝑏′,𝑐′(πœβƒ—Β ,𝐡)Aβ€²βˆˆEaβ€²,bβ€²,cβ€²(Ο„β†’,B)A' \in E_{a', b', c'}(\vec \tau, B) where either π‘Žβ‰ π‘Žβ€²,𝑏≠𝑏′aβ‰ aβ€²,bβ‰ bβ€²a \neq a', b \neq b', or 𝑐≠𝑐′cβ‰ cβ€²c \neq c', then ξˆΈπœƒ(𝐴)β‰ ξˆΈπœƒ(𝐴′)LΞΈ(A)β‰ LΞΈ(Aβ€²)\mathcal L_\theta(A) \neq \mathcal L_\theta(A').

      goes in starred section

    4. What does the likelihood for the a posteriori SBM look like? In this case, πœƒ=(πœ‹βƒ—Β ,𝐡)ΞΈ=(Ο€β†’,B)\theta = (\vec \pi, B) are the parameters for the model, so the likelihood for a realization 𝐴AA of 𝐀A\mathbf A is: ξˆΈπœƒ(𝐴)βˆβ„™πœƒ(𝐀=𝐴)LΞΈ(A)∝PΞΈ(A=A)\begin{align*} \mathcal L_\theta(A) &\propto \mathbb P_\theta(\mathbf A = A) \end{align*} Next, we use the fact that the probability that 𝐀=𝐴A=A\mathbf A = A is, in fact, the marginalization (over realizations of πœπœΟ„Ο„\pmb \tau) of the joint (𝐀,𝜏𝜏)(A,ττ)(\mathbf A, \pmb \tau). In the line after that, we use Bayes’ Theorem to separate the joint probability into a conditional probability and a marginal probability: (2.1)ΒΆ=βˆ«πœβ„™πœƒ(𝐀=𝐴,𝜏𝜏=𝜏)d𝜏=βˆ«πœβ„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)β„™πœƒ(𝜏𝜏=𝜏)d𝜏=βˆ«Ο„PΞΈ(A=A,ττ=Ο„)dΟ„=βˆ«Ο„PΞΈ(A=A|ττ=Ο„)PΞΈ(ττ=Ο„)dΟ„\begin{align} &= \int_\tau \mathbb P_\theta(\mathbf A = A, \pmb \tau = \tau)\textrm{d}\tau \nonumber\\ &= \int_\tau \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau)\mathbb P_\theta(\pmb \tau = \tau)\textrm{d}\tau \label{eqn:apost_sbm_eq1} \end{align} Let’s think about each of these probabilities separately. Remember that for πœπœΟ„Ο„\pmb \tau, that each entry πœπœπ‘–Ο„Ο„i\pmb \tau_i is sampled independently and identically from πΆπ‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘–π‘π‘Žπ‘™(πœ‹βƒ—Β )Categorical(Ο€β†’)Categorical(\vec \pi).The probability mass for a πΆπ‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘–π‘π‘Žπ‘™(πœ‹βƒ—Β )Categorical(Ο€β†’)Categorical(\vec \pi)-valued random variable is β„™(πœπœπ‘–=πœπ‘–;πœ‹βƒ—Β )=πœ‹πœπ‘–P(ττi=Ο„i;Ο€β†’)=πτi\mathbb P(\pmb \tau_i = \tau_i; \vec \pi) = \pi_{\tau_i}. Finally, note that if we are taking the products of 𝑛nn πœ‹πœπ‘–Ο€Ο„i\pi_{\tau_i} terms, that many of these values will end up being the same. Consider, for instance, if the vector 𝜏=[1,2,1,2,1]Ο„=[1,2,1,2,1]\tau = [1,2,1,2,1]. We end up with three terms of πœ‹1Ο€1\pi_1, and two terms of πœ‹2Ο€2\pi_2, and it does not matter which order we multiply them in. Rather, all we need to keep track of are the counts of each πœ‹Ο€\pi. term. Written another way, we can use the indicator that πœπ‘–=π‘˜Ο„i=k\tau_i = k, given by πŸ™πœπ‘–=π‘˜1Ο„i=k\mathbb 1_{\tau_i = k}, and a running counter over all of the community probability assignments πœ‹π‘˜Ο€k\pi_k to make this expression a little more sensible. We will use the symbol π‘›π‘˜=βˆ‘π‘›π‘–=1πŸ™πœπ‘–=π‘˜nk=βˆ‘i=1n1Ο„i=kn_k = \sum_{i = 1}^n \mathbb 1_{\tau_i = k} to denote this value: β„™πœƒ(𝜏𝜏=𝜏)=βˆπ‘–=1π‘›β„™πœƒ(πœπœπ‘–=πœπ‘–)Independence Assumption=βˆπ‘–=1π‘›πœ‹πœπ‘–p.m.f. of a Categorical R.V.=βˆπ‘˜=1πΎπœ‹π‘›π‘˜π‘˜PΞΈ(ττ=Ο„)=∏i=1nPΞΈ(ττi=Ο„i)Independence Assumption=∏i=1nπτip.m.f. of a Categorical R.V.=∏k=1KΟ€knk\begin{align*} \mathbb P_\theta(\pmb \tau = \tau) &= \prod_{i = 1}^n \mathbb P_\theta(\pmb \tau_i = \tau_i)\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{i = 1}^n \pi_{\tau_i} \;\;\;\;\textrm{p.m.f. of a Categorical R.V.}\\ &= \prod_{k = 1}^K \pi_{k}^{n_k} \end{align*} Next, let’s think about the conditional probability term, β„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)PΞΈ(A=A|ττ=Ο„)\mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau). Remember that the entries are all independent conditional on πœπœΟ„Ο„\pmb \tau taking the value πœΟ„\tau. This means that we can separate the probability of the entire 𝐀=𝐴A=A\mathbf A = A into the product of the probabilities edge-wise. Further, remember that conditional on πœπœπ‘–=ℓττi=β„“\pmb \tau_i = \ell and πœπœπ‘—=π‘˜Ο„Ο„j=k\pmb \tau_j = k, that πšπ‘–π‘—aij\mathbf a_{ij} is π΅π‘’π‘Ÿπ‘›(𝑏ℓ,π‘˜)Bern(bβ„“,k)Bern(b_{\ell,k}). The distribution of πšπ‘–π‘—aij\mathbf a_{ij} does not depend on any of the other entries of πœπœΟ„Ο„\pmb \tau. Remembering that the probability mass function of a Bernoulli R.V. is given by β„™(πšπ‘–π‘—=π‘Žπ‘–π‘—;𝑝)=π‘π‘Žπ‘–π‘—(1βˆ’π‘)π‘Žπ‘–π‘—P(aij=aij;p)=paij(1βˆ’p)aij\mathbb P(\mathbf a_{ij}=a_{ij}; p) = p^{a_{ij}}(1 - p)^{a_{ij}}, this gives: β„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)=βˆπ‘—>π‘–β„™πœƒ(πšπ‘–π‘—=π‘Žπ‘–π‘—βˆ£βˆ£πœπœ=𝜏)Independence Assumption=βˆπ‘—>π‘–β„™πœƒ(πšπ‘–π‘—=π‘Žπ‘–π‘—βˆ£βˆ£πœπœπ‘–=β„“,πœπœπ‘—=π‘˜)πšπ‘–π‘—Β depends only onΒ πœπ‘–Β andΒ πœπ‘—=βˆπ‘—>π‘–π‘π‘Žπ‘–π‘—β„“π‘˜(1βˆ’π‘β„“π‘˜)1βˆ’π‘Žπ‘–π‘—PΞΈ(A=A|ττ=Ο„)=∏j>iPΞΈ(aij=aij|ττ=Ο„)Independence Assumption=∏j>iPΞΈ(aij=aij|ττi=β„“,ττj=k)aijΒ depends only onΒ Ο„iΒ andΒ Ο„j=∏j>ibβ„“kaij(1βˆ’bβ„“k)1βˆ’aij\begin{align*} \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau) &= \prod_{j > i}\mathbb P_\theta(\mathbf a_{ij} = a_{ij} \big | \pmb \tau = \tau)\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{j > i}\mathbb P_\theta(\mathbf a_{ij} = a_{ij} \big | \pmb \tau_i = \ell, \pmb \tau_j = k) \;\;\;\;\textrm{$\mathbf a_{ij}$ depends only on $\tau_i$ and $\tau_j$}\\ &= \prod_{j > i} b_{\ell k}^{a_{ij}} (1 - b_{\ell k})^{1 - a_{ij}} \end{align*} Again, we can simplify this expression a bit. Recall the indicator function above. Let |ξˆ±β„“π‘˜|=βˆ‘π‘—>π‘–πŸ™πœπ‘–=β„“πŸ™πœπ‘—=π‘˜π‘Žπ‘–π‘—|Eβ„“k|=βˆ‘j>i1Ο„i=β„“1Ο„j=kaij|\mathcal E_{\ell k}| = \sum_{j > i}\mathbb 1_{\tau_i = \ell}\mathbb 1_{\tau_j=k}a_{ij}, and let π‘›β„“π‘˜=βˆ‘π‘—>π‘–πŸ™πœπ‘–=β„“πŸ™πœπ‘—=π‘˜nβ„“k=βˆ‘j>i1Ο„i=β„“1Ο„j=kn_{\ell k}= \sum_{j>i}\mathbb 1_{\tau_i = \ell}\mathbb 1_{\tau_j = k}. Note that ξˆ±β„“π‘˜Eβ„“k\mathcal E_{\ell k} is the number of edges between nodes in community β„“β„“\ell and community π‘˜kk, and π‘›β„“π‘˜nβ„“kn_{\ell k} is the number of possible edges between nodes in community β„“β„“\ell and community π‘˜kk. This expression can be simplified to: β„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)=βˆβ„“,π‘˜π‘|ξˆ±β„“π‘˜|β„“π‘˜(1βˆ’π‘β„“π‘˜)π‘›β„“π‘˜βˆ’|ξˆ±β„“π‘˜|PΞΈ(A=A|ττ=Ο„)=βˆβ„“,kbβ„“k|Eβ„“k|(1βˆ’bβ„“k)nβ„“kβˆ’|Eβ„“k|\begin{align*} \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau) &= \prod_{\ell,k} b_{\ell k}^{|\mathcal E_{\ell k}|}(1 - b_{\ell k})^{n_{\ell k} - |\mathcal E_{\ell k}|} \end{align*} Combining these into the integrand from Equation (\ref{eqn:apost_sbm_eq1}) gives: ξˆΈπœƒ(𝐴)βˆβˆ«πœβ„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)β„™πœƒ(𝜏𝜏=𝜏)d𝜏=βˆ«πœβˆπ‘˜=1πΎπœ‹π‘›π‘˜π‘˜β‹…βˆβ„“,π‘˜π‘|ξˆ±β„“π‘˜|β„“π‘˜(1βˆ’π‘β„“π‘˜)π‘›β„“π‘˜βˆ’|ξˆ±β„“π‘˜|d𝜏

      i love. its complicated. make it a 'starred subsection' or something.

    5. ∫𝜏

      make it a sum

    6. ξˆ±β„“π‘˜|β„“π‘˜(

      use m_lk

    7. π‘›π‘˜=βˆ‘π‘›π‘–=1πŸ™πœπ‘–=π‘˜nk=βˆ‘i=1n1Ο„i=kn_k = \sum_{i = 1}^n \mathbb 1_{\tau_i = k}

      spell out in words what is n_k

    8. 𝜏𝜏

      can put paragraph back in, but must introduce latent variables

    9. <β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘π‘π‘™π‘Žπ‘ π‘ ="β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘">𝑝</β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘>

      didn't render properly

    10. Theory

      motivation

    11. =βˆπ‘—>π‘–β„™πœƒ(πšπ‘–π‘—=π‘Žπ‘–π‘—)

      please right out the left side of this equation

    12. model

      model is not a random process, etc.

    1. Fig. 3.1 The MASE algorithmΒΆ

      i'd elaborate here showing that we went from graph layouts to adjacency matrices. and explain the colors.

    2. Well, you could embed

      this paragraph should be about considering the many different ways one could do 'joint embedding'

      see figure from cep's paper, probably include a version of it.

      include just averaging graphs and then embedding the average, which is optimal in the absence of across-graph heterogeneity

    3. The goal of MASE is to embed the networks into a single space, with each point in that space representing a single node

      not quite

    4. However, what you’d really like to do is combine them all into a single representation to learn from every network at once.

      not necessarily

  2. Mar 2021
    1. Non-Assortative Case

      move it later

    2. silhouette

      replace with BIC, talk to tingshan

    3. section

      which is a computationally efficient line-search approach

    4. K-means

      lower case k

    5. Lloyd2

      LLoyd made up lloyd's algroithm for approxiamtely solving k-means

    6. a searching procedure until all the cluster centers are in nice places

      when no points move clusters from one iteration to the next

    7. essentially random places in our data

      k-means++ does not do this, the cluster centers are far from one another

    8. faster implementation

      remove

    9. 1

      this equation is still wrong, needs a transpose

    10. covariates

      explain why we have ~50 dimensions, especially given that we are only embedding into 2

    11. from statistics

      not a necessary clause.

    12. Stochastic Block ModelΒΆ

      i always want: words --> math --> figure

    13. 4.2359312775571826e-05 and a maximum weight of 40.00562586173658.

      just use 2 sig digs

    14. CASE simply sums these two matrices together, using a weight for 𝑋𝑋𝑇XXTXX^T so that they both contribute an equal amount of useful information to the result.

      CASE is a weighted sum

    15. 𝑋𝑋𝑇𝑖,𝑗XXi,jTXX^T_{i, j}

      is not valid notation.

    16. which we denote here by L for brevity

    17. .

      cite CASC paper here

    18. them

      "then plots the results, color coding each node by its true community"

    19. best

      avoid 'best'

    1. With a single network observed (or really, any number of networks we could collect in the real world) we would never be able to estimate 2𝑛22n22^{n^2} parameters. The number grows too quickly with 𝑛nn for any realistic choice of 𝑛nn in real-world data. This would lead to a thing called a lack of identifiability with a single network, which means that we would never be able to estimate 2𝑛22n22^{n^2} parameters from 111 network.

      unclear what you mean. MLE is [1,0,....,0]

    2. , we would need about 30,000,00030,000,00030,000,000 times the total number of storage available in the world to represent the parameters of a single distribution.

      confused

    3. We use a semi-colon to denote that the parameters πœƒΞΈ\theta are supposed to be fixed quantities for a given 𝐀A\mathbf A.

      remove

    4. What is the most natural choice for (Θ)P(Θ)\mathcal P(\Theta) that makes any sense?

      remove

    5. t is, in general, good for (Θ)P(Θ)\mathcal P(\Theta) to be fairly rich; that is, when we specify a parametrized statistical model (ξˆ­π‘›,(Θ))(An,P(Θ))(\mathcal A_n, \mathcal P(\Theta)), we want (Θ)P(Θ)\mathcal P(\Theta) to contain distributions that we think faithfully could represent our network realization 𝐴

      remove

    6. Note that by construction, we have that |(Θ)|=|Θ||P(Θ)|=|Θ|\left|\mathcal P(\Theta)\right| = \left|\Theta\right|. That is, the two sets have the same number of elements, since since πœƒβˆˆΞ˜ΞΈβˆˆΞ˜\theta \in \Theta has a particular distribution β„™πœƒβˆˆξˆΌ(Θ)Pθ∈P(Θ)\mathbb P_\theta \in \mathcal P(\Theta), and vice-versa.

      remove

    7. (Θ)|=|Θ|

      define notation

    8. So, now we know that we have probability distributions on networks, and a set ξˆ­π‘›An\mathcal A_n which defines all of the aadjacency matrices that every probability distribution must assign a probability to. Now, just what is a single network model? The single network model is the tuple (ξˆ­π‘›,)(An,P)(\mathcal A_n, \mathcal P). Above, we learned that ξˆ­π‘›An\mathcal A_n was the set of all possible adjacency matrices for unweighted networks with 𝑛nn nodes. We will call ξˆ­π‘›An\mathcal A_n the sample space of 𝑛nn-node networks. In general, ξˆ­π‘›An\mathcal A_n will be the same sample space for all 𝑛nn-node network models. This means that for any 𝑛nn-node network realization 𝐴AA, we can calculate a probability that 𝐴AA is described by any probability distribution on ξˆ­π‘›An\mathcal A_n found in P\mathcal P. What is P\mathcal P? It depends on the model we want to use! In general, P\mathcal P has only one rule: it is a nonempty set (it contains at least something), where for every β„™βˆˆξˆΌP∈P\mathbb P \in \mathcal P, β„™P\mathbb P is a probability distribution on ξˆ­π‘›An\mathcal A_n. Not that this says only that P\mathcal P cannot be empty, but it doesn’t say anything about how big or diverse it can be! In general, we will simplify P\mathcal P through something called parametrization; that is, we will write P\mathcal P as the set:

      don't love it

    9. {𝐴:𝐴∈{0,1}𝑛×𝑛}

      prob remove this

    10. When you see the short-hand expression β„™(𝐴)P(A)\mathbb P(A), you should typically think back to the most recent random network 𝐀A\mathbf A that has been discussed, and it is typically assumed that β„™P\mathbb P refers to the probability distribution of that random network; e.g., π€βˆΌβ„™A∼P\mathbf A \sim \mathbb P.

      delete

    11. Is this set the same for any unweighted random network, or is it ever different? It turns out that the answer here is fairly straightforward: for any unweighted random network with 𝑛nn nodes, the set of possible realizations, which we represent with the symbol ξˆ­π‘›An\mathcal A_n, is exactly the same!

      i think this is more confusing than helpful.

    12. possibly

      rmove

    13. if 𝐀A\mathbf A is an unweighted random network with 𝑛nn nodes,

      not necessary

    14. topology

      topology

    15. random network 𝐀

      network-valued random variable \mathbf{A}

    16. topology

      remove

    17. 𝐡

      why is it symmetric?

    18. This is an especially common approach when people deal with networks that are said to be sparse. A sparse network is a network in which the number of edges is much less than the total possible number of edges. This contrasts with a dense network, which is a network in which the number of edges is close to the maximum number of possible edges. In the case of an 𝐸𝑅𝑛(𝑝)ERn(p)ER_{n}(p) network, the network is sparse when 𝑝pp is small (closer to 000), and dense when 𝑝pp is large (closer to 111).

      why talk about sparse? if so, let's dedicate a section to it, not have it here.

      sparse can mean lots of different things:

      1. computationally sparse, meaning that storing the graph as an edge list is smaller than as an adjacency matrix.
      2. the probability of an edge scales with n, rather than n^2. Thus, p must be a function of n. This is an asymptotic claim, and therefore, does not make sense to apply to any given network.
    19. Probability that an edge exists between a pair of vertices

      iid on edges

    20. 0.3

      use same notation in title as the rest of the paper, so we write

      ER_n(p)

      in the paper, so let's also use it in the title.

    21. True

      false

    22. True

      undirected

    23. ps

      i'd use p

    24. same

      i'd write

      "is the same number, n*p

    25. The

      also need to be able to estimate parameters of the model

    26. unlikely

      impossible

    27. the model

      model is a set

    28. framework

      not a framework

    29. .3

      always write 0.3 instead of .3

    30. Structured Independent Edge Model (SIEM)ΒΆ

      i think this should go at the end, ie, right before IER, since it is a different special case?

    31. _{s}

      remove _s i think

    32. $\pmb A \sim ER_n(p)$

      write this out by factorizing, eg, show

      A ~ Bern(P) = \prod_ij Bern(p_ij) = \prod_ij Bern(p)

      and explain why

    1. Model Selection: The model is appropriate for the data w

      appropriateness

    2. Machine Learning

      is this capitalized?

    3. underlies

      nix for govern, everywhere maybe?

    4. underlying

      remove

    5. Stated another way, even if we believe that the process underlying the network isn’t random at all, we can still extract value by using a statistical model.

      clarify

    6. underlies

      governs

    7. statistics

      ML

    8. statistics

      replace everywhere with network machine learning.

    9. Comparing

      somewhere in here we need to remind people of our notational conventions.

    10. lives

      and also talk about missing people, people with multiple accounts (eg, famous people have personal and pro accounts), etc. also do that above.

    11. quantity

      variable

    12. not

      on a specific social media site

    13. random

      if we are assuming iid, we should say it here.

    14. For instance, we might think that

      In this simple example,

    15. Instead

      remove

    16. this

      remove

    17. discrimative modeling,

      is discriminative always classification? seems not, please clarify

    18. relating to how the network behaves

      about properties of the network

    19. network

      and potential network, node, and edge attributes

    20. statistics

      network ML

    21. 𝑑dd-dimensions

      i don't want to corner stats into Euclidean stats. The 'problem' with classical stats with regards to network machine learning is that it doesn't tell us how to leverage the structure of a network efficiently

    22. presentation

      representation

    23. well

      remove

    24. is that we have

      is concerned with

    25. reaization

      realization.

      This sentence is not right, however. We assume that our observed network is merely a realization of a random network.

    26. with which we seek

      we use

    27. Statistical modelling is a technique in which

      In statistical modeling

    28. Perhaps

      what about missing people from the social network? that is a big one

    29. perhaps

      remove

    30. is filled with

      includes much

    31. might

      remove

    32. the question that we

      we may

    33. A common question as a scientist that we might have is how, exactly, we could describe the network in the simplest way possible

      this is just not where to start

    34. Topology

      simple

    35. before

      no time yet please

    36. are

      correspond to

      or

      represent

    37. are

      correspond to

      or

      represent

    38. topology of a network

      simple network

    39. 𝑋

      A is realization of a vector/matrix \mathbf{A} is vector/matrix valued RV

      a is scalar realization \matbf{a} is a scalar valued RV

      a_ij is the realization of an edge

    40. π‘Žπ‘Žaa\pmb a

      i don't think we can use 'a', A is observed, and \mathbf{A} is random variable

    41. π‘₯𝑖

      should be x not x_i since we use X later.

    42. value

      whereas x takes on two possible values: heads or tails

    43. 𝑋

      but x is not, it is the actual observed realization of a coin flip

    44. generative

      we also use discriminative modeling, eg, signal subgraph

    45. might have a slightly different group of friends depending on when we look at their friend circle

      let's not introduce time varying yet

    46. we assume that the true network is a network for which we could never observe completely, as each time we look at the network, we will see it slightly differently.

      we have measurement error and other sources of uncertainty in our data. that is an empirical fact. let's preserve the word assumption for explicit model assumptions.

    47. The way we characterize the network is called the choice of an underlying statistical model.

      grammar

    48. if we know people within the social network are groups of student

      grammar

    1. What Is A Network?ΒΆ { requestKernel: true, binderOptions: { repo: "binder-examples/jupyter-stacks-datascience", ref: "master", }, codeMirrorConfig: { theme: "abcdef", mode: "python" }, kernelOptions: { kernelName: "python3", path: "./foundations/ch1" }, predefinedOutput: true } kernelName = 'python3'
      • simple network
      • weighted
      • loopless
      • directed
      • attributed

      then we discuss different representations of networks again

      • edge list
      • adjacency matrix
      • various laplacians.

      maybe in its own subsection

  3. Oct 2020
    1. The p-value determines the probability ofobserving data (or more extreme results), contingenton having assumed the null-hypothesis is true. The formal definition can be expressed as follows: P(Xβ‰₯ x|H0) or P(X≀ x|H0),

      I'm sure you know this, but this sentence, as written, I do not think is quite right. Assume that X is a random variable, and x is a realization of that random variable, and we sample n times identically and independently from some true but unknown distribution P. Then, choose a test statistic, T, which maps from the data (X1, ..., Xn) to a scalar t. Now, we can define the p-value the the probability of observing data with a test statistic as or more extreme than the observed, contingent on having assumed the null-hypothesis is true.

      The fact that there is a test statistic in there, I think, is incredibly important, because obviously (to you), if one chooses a different test statistic, one can obtain a different p-value.

      This is also important for your PV8, where "result" is ambiguous. The result here implicitly refers to the test statistic, which was not previously mentioned. It is easy to mistakenly believe that the 'result' somehow magically implies something about 'the data'. For example, anecdotally, I often find that people think a big p-value on a t-test implies no effect, whereas had they used a robust test, or tested for a change in variance rather than the mean, the effect is clear.