519 Matching Annotations
  1. Apr 2021
    1. re ξˆΈπœƒ(𝐴)βˆβ„™πœƒ(𝐴)LΞΈ(A)∝PΞΈ(A)\mathcal L_\theta(A) \propto \mathbb P_\theta(A);

      do everything in terms of probabilities

    2. The reason we write ∝∝\propto instead of === is that, in the case of probability distributions, it is often rather tedious to take care of all constants when simplifying expressions, and these constants can detract from much of the intuition of what is going on. If we instead focus on the likelihood, we only need to worry about the parts of the probability statement that deal directly with the parameters πœƒΞΈ\theta and the realization itself 𝐴AA.

      tedious. remove until we decide we need it

    3. What does the likelihood for the a priori SBM look like? Fortunately, since πœβƒ—Β Ο„β†’\vec \tau is a parameter of the a priori SBM, the likelihood is a bit simpler than for the a posteriori SBM. This is because the a posteriori SBM requires a marginalization over potential realizations of πœπœβ†’Ο„Ο„β†’\vec{\pmb \tau}, whereas the a priori SBM does not. The likelihood is as follows, omitting detailed explanations of steps that are described above: ξˆΈπœƒ(𝐴)βˆβ„™πœƒ(𝐀=𝐴)=βˆπ‘—>π‘–β„™πœƒ(πšπ‘–π‘—=π‘Žπ‘–π‘—)Independence Assumption=βˆπ‘—>π‘–π‘π‘Žπ‘–π‘—β„“π‘˜(1βˆ’π‘β„“π‘˜)1βˆ’π‘Žπ‘–π‘—p.m.f. of Bernoulli distribution=βˆπ‘˜,ℓ𝑏|ξˆ±β„“π‘˜|β„“π‘˜(1βˆ’π‘β„“π‘˜)π‘›β„“π‘˜βˆ’|ξˆ±β„“π‘˜|LΞΈ(A)∝PΞΈ(A=A)=∏j>iPΞΈ(aij=aij)Independence Assumption=∏j>ibβ„“kaij(1βˆ’bβ„“k)1βˆ’aijp.m.f. of Bernoulli distribution=∏k,β„“bβ„“k|Eβ„“k|(1βˆ’bβ„“k)nβ„“kβˆ’|Eβ„“k|\begin{align*} \mathcal L_\theta(A) &\propto \mathbb P_{\theta}(\mathbf A = A) \\ &= \prod_{j > i} \mathbb P_\theta(\mathbf a_{ij} = a_{ij})\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{j > i} b_{\ell k}^{a_{ij}}(1 - b_{\ell k})^{1 - a_{ij}}\;\;\;\;\textrm{p.m.f. of Bernoulli distribution} \\ &= \prod_{k, \ell}b_{\ell k}^{|\mathcal E_{\ell k}|}(1 - b_{\ell k})^{n_{\ell k} - |\mathcal E_{\ell k}|} \end{align*} Like the ER model, there are again equivalence classes of the sample space <β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘π‘π‘™π‘Žπ‘ π‘ ="β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘">ξˆ­π‘›</β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘><hypothesisβˆ’highlightclass="hypothesisβˆ’highlight">An</hypothesisβˆ’highlight>\mathcal A_n in terms of their likelihood. Let |ξˆ±β„“π‘˜(𝐴)||Eβ„“k(A)||\mathcal E_{\ell k}(A)| denote the number of edges in the (β„“,π‘˜)(β„“,k)(\ell, k) block of adjacency matrix 𝐴AA. For a two-community setting, with πœβƒ—Β Ο„β†’\vec \tau and 𝐡BB given, the equivalence classes are the sets: πΈπ‘Ž,𝑏,𝑐(πœβƒ—Β ,𝐡)={π΄βˆˆξˆ­π‘›:11(𝐴)=π‘Ž,21=12(𝐴)=𝑏,22(𝐴)=𝑐}Ea,b,c(Ο„β†’,B)={A∈An:E11(A)=a,E21=E12(A)=b,E22(A)=c}\begin{align*} E_{a,b,c}(\vec \tau, B) &= \left\{A \in \mathcal A_n : \mathcal E_{11}(A) = a, \mathcal E_{21}=\mathcal E_{12}(A) = b, \mathcal E_{22}(A) = c\right\} \end{align*} The number of equivalence classes possible scales with the number of communities, and the manner in which vertices are assigned to communities (particularly, the number of nodes in each community). As before, we have the following. For any πœβƒ—Β Ο„β†’\vec \tau and 𝐡BB: If 𝐴,π΄β€²βˆˆπΈπ‘Ž,𝑏,𝑐(πœβƒ—Β ,𝐡)A,Aβ€²βˆˆEa,b,c(Ο„β†’,B)A, A' \in E_{a,b,c}(\vec \tau, B) (that is, 𝐴AA and 𝐴′Aβ€²A' are in the same equivalence class), ξˆΈπœƒ(𝐴)=ξˆΈπœƒ(𝐴′)LΞΈ(A)=LΞΈ(Aβ€²)\mathcal L_\theta(A) = \mathcal L_\theta(A'), and If π΄βˆˆπΈπ‘Ž,𝑏,𝑐(πœβƒ—Β ,𝐡)A∈Ea,b,c(Ο„β†’,B)A \in E_{a, b, c}(\vec \tau, B) but π΄β€²βˆˆπΈπ‘Žβ€²,𝑏′,𝑐′(πœβƒ—Β ,𝐡)Aβ€²βˆˆEaβ€²,bβ€²,cβ€²(Ο„β†’,B)A' \in E_{a', b', c'}(\vec \tau, B) where either π‘Žβ‰ π‘Žβ€²,𝑏≠𝑏′aβ‰ aβ€²,bβ‰ bβ€²a \neq a', b \neq b', or 𝑐≠𝑐′cβ‰ cβ€²c \neq c', then ξˆΈπœƒ(𝐴)β‰ ξˆΈπœƒ(𝐴′)LΞΈ(A)β‰ LΞΈ(Aβ€²)\mathcal L_\theta(A) \neq \mathcal L_\theta(A').

      goes in starred section

    4. What does the likelihood for the a posteriori SBM look like? In this case, πœƒ=(πœ‹βƒ—Β ,𝐡)ΞΈ=(Ο€β†’,B)\theta = (\vec \pi, B) are the parameters for the model, so the likelihood for a realization 𝐴AA of 𝐀A\mathbf A is: ξˆΈπœƒ(𝐴)βˆβ„™πœƒ(𝐀=𝐴)LΞΈ(A)∝PΞΈ(A=A)\begin{align*} \mathcal L_\theta(A) &\propto \mathbb P_\theta(\mathbf A = A) \end{align*} Next, we use the fact that the probability that 𝐀=𝐴A=A\mathbf A = A is, in fact, the marginalization (over realizations of πœπœΟ„Ο„\pmb \tau) of the joint (𝐀,𝜏𝜏)(A,ττ)(\mathbf A, \pmb \tau). In the line after that, we use Bayes’ Theorem to separate the joint probability into a conditional probability and a marginal probability: (2.1)ΒΆ=βˆ«πœβ„™πœƒ(𝐀=𝐴,𝜏𝜏=𝜏)d𝜏=βˆ«πœβ„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)β„™πœƒ(𝜏𝜏=𝜏)d𝜏=βˆ«Ο„PΞΈ(A=A,ττ=Ο„)dΟ„=βˆ«Ο„PΞΈ(A=A|ττ=Ο„)PΞΈ(ττ=Ο„)dΟ„\begin{align} &= \int_\tau \mathbb P_\theta(\mathbf A = A, \pmb \tau = \tau)\textrm{d}\tau \nonumber\\ &= \int_\tau \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau)\mathbb P_\theta(\pmb \tau = \tau)\textrm{d}\tau \label{eqn:apost_sbm_eq1} \end{align} Let’s think about each of these probabilities separately. Remember that for πœπœΟ„Ο„\pmb \tau, that each entry πœπœπ‘–Ο„Ο„i\pmb \tau_i is sampled independently and identically from πΆπ‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘–π‘π‘Žπ‘™(πœ‹βƒ—Β )Categorical(Ο€β†’)Categorical(\vec \pi).The probability mass for a πΆπ‘Žπ‘‘π‘’π‘”π‘œπ‘Ÿπ‘–π‘π‘Žπ‘™(πœ‹βƒ—Β )Categorical(Ο€β†’)Categorical(\vec \pi)-valued random variable is β„™(πœπœπ‘–=πœπ‘–;πœ‹βƒ—Β )=πœ‹πœπ‘–P(ττi=Ο„i;Ο€β†’)=πτi\mathbb P(\pmb \tau_i = \tau_i; \vec \pi) = \pi_{\tau_i}. Finally, note that if we are taking the products of 𝑛nn πœ‹πœπ‘–Ο€Ο„i\pi_{\tau_i} terms, that many of these values will end up being the same. Consider, for instance, if the vector 𝜏=[1,2,1,2,1]Ο„=[1,2,1,2,1]\tau = [1,2,1,2,1]. We end up with three terms of πœ‹1Ο€1\pi_1, and two terms of πœ‹2Ο€2\pi_2, and it does not matter which order we multiply them in. Rather, all we need to keep track of are the counts of each πœ‹Ο€\pi. term. Written another way, we can use the indicator that πœπ‘–=π‘˜Ο„i=k\tau_i = k, given by πŸ™πœπ‘–=π‘˜1Ο„i=k\mathbb 1_{\tau_i = k}, and a running counter over all of the community probability assignments πœ‹π‘˜Ο€k\pi_k to make this expression a little more sensible. We will use the symbol π‘›π‘˜=βˆ‘π‘›π‘–=1πŸ™πœπ‘–=π‘˜nk=βˆ‘i=1n1Ο„i=kn_k = \sum_{i = 1}^n \mathbb 1_{\tau_i = k} to denote this value: β„™πœƒ(𝜏𝜏=𝜏)=βˆπ‘–=1π‘›β„™πœƒ(πœπœπ‘–=πœπ‘–)Independence Assumption=βˆπ‘–=1π‘›πœ‹πœπ‘–p.m.f. of a Categorical R.V.=βˆπ‘˜=1πΎπœ‹π‘›π‘˜π‘˜PΞΈ(ττ=Ο„)=∏i=1nPΞΈ(ττi=Ο„i)Independence Assumption=∏i=1nπτip.m.f. of a Categorical R.V.=∏k=1KΟ€knk\begin{align*} \mathbb P_\theta(\pmb \tau = \tau) &= \prod_{i = 1}^n \mathbb P_\theta(\pmb \tau_i = \tau_i)\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{i = 1}^n \pi_{\tau_i} \;\;\;\;\textrm{p.m.f. of a Categorical R.V.}\\ &= \prod_{k = 1}^K \pi_{k}^{n_k} \end{align*} Next, let’s think about the conditional probability term, β„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)PΞΈ(A=A|ττ=Ο„)\mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau). Remember that the entries are all independent conditional on πœπœΟ„Ο„\pmb \tau taking the value πœΟ„\tau. This means that we can separate the probability of the entire 𝐀=𝐴A=A\mathbf A = A into the product of the probabilities edge-wise. Further, remember that conditional on πœπœπ‘–=ℓττi=β„“\pmb \tau_i = \ell and πœπœπ‘—=π‘˜Ο„Ο„j=k\pmb \tau_j = k, that πšπ‘–π‘—aij\mathbf a_{ij} is π΅π‘’π‘Ÿπ‘›(𝑏ℓ,π‘˜)Bern(bβ„“,k)Bern(b_{\ell,k}). The distribution of πšπ‘–π‘—aij\mathbf a_{ij} does not depend on any of the other entries of πœπœΟ„Ο„\pmb \tau. Remembering that the probability mass function of a Bernoulli R.V. is given by β„™(πšπ‘–π‘—=π‘Žπ‘–π‘—;𝑝)=π‘π‘Žπ‘–π‘—(1βˆ’π‘)π‘Žπ‘–π‘—P(aij=aij;p)=paij(1βˆ’p)aij\mathbb P(\mathbf a_{ij}=a_{ij}; p) = p^{a_{ij}}(1 - p)^{a_{ij}}, this gives: β„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)=βˆπ‘—>π‘–β„™πœƒ(πšπ‘–π‘—=π‘Žπ‘–π‘—βˆ£βˆ£πœπœ=𝜏)Independence Assumption=βˆπ‘—>π‘–β„™πœƒ(πšπ‘–π‘—=π‘Žπ‘–π‘—βˆ£βˆ£πœπœπ‘–=β„“,πœπœπ‘—=π‘˜)πšπ‘–π‘—Β depends only onΒ πœπ‘–Β andΒ πœπ‘—=βˆπ‘—>π‘–π‘π‘Žπ‘–π‘—β„“π‘˜(1βˆ’π‘β„“π‘˜)1βˆ’π‘Žπ‘–π‘—PΞΈ(A=A|ττ=Ο„)=∏j>iPΞΈ(aij=aij|ττ=Ο„)Independence Assumption=∏j>iPΞΈ(aij=aij|ττi=β„“,ττj=k)aijΒ depends only onΒ Ο„iΒ andΒ Ο„j=∏j>ibβ„“kaij(1βˆ’bβ„“k)1βˆ’aij\begin{align*} \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau) &= \prod_{j > i}\mathbb P_\theta(\mathbf a_{ij} = a_{ij} \big | \pmb \tau = \tau)\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{j > i}\mathbb P_\theta(\mathbf a_{ij} = a_{ij} \big | \pmb \tau_i = \ell, \pmb \tau_j = k) \;\;\;\;\textrm{$\mathbf a_{ij}$ depends only on $\tau_i$ and $\tau_j$}\\ &= \prod_{j > i} b_{\ell k}^{a_{ij}} (1 - b_{\ell k})^{1 - a_{ij}} \end{align*} Again, we can simplify this expression a bit. Recall the indicator function above. Let |ξˆ±β„“π‘˜|=βˆ‘π‘—>π‘–πŸ™πœπ‘–=β„“πŸ™πœπ‘—=π‘˜π‘Žπ‘–π‘—|Eβ„“k|=βˆ‘j>i1Ο„i=β„“1Ο„j=kaij|\mathcal E_{\ell k}| = \sum_{j > i}\mathbb 1_{\tau_i = \ell}\mathbb 1_{\tau_j=k}a_{ij}, and let π‘›β„“π‘˜=βˆ‘π‘—>π‘–πŸ™πœπ‘–=β„“πŸ™πœπ‘—=π‘˜nβ„“k=βˆ‘j>i1Ο„i=β„“1Ο„j=kn_{\ell k}= \sum_{j>i}\mathbb 1_{\tau_i = \ell}\mathbb 1_{\tau_j = k}. Note that ξˆ±β„“π‘˜Eβ„“k\mathcal E_{\ell k} is the number of edges between nodes in community β„“β„“\ell and community π‘˜kk, and π‘›β„“π‘˜nβ„“kn_{\ell k} is the number of possible edges between nodes in community β„“β„“\ell and community π‘˜kk. This expression can be simplified to: β„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)=βˆβ„“,π‘˜π‘|ξˆ±β„“π‘˜|β„“π‘˜(1βˆ’π‘β„“π‘˜)π‘›β„“π‘˜βˆ’|ξˆ±β„“π‘˜|PΞΈ(A=A|ττ=Ο„)=βˆβ„“,kbβ„“k|Eβ„“k|(1βˆ’bβ„“k)nβ„“kβˆ’|Eβ„“k|\begin{align*} \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau) &= \prod_{\ell,k} b_{\ell k}^{|\mathcal E_{\ell k}|}(1 - b_{\ell k})^{n_{\ell k} - |\mathcal E_{\ell k}|} \end{align*} Combining these into the integrand from Equation (\ref{eqn:apost_sbm_eq1}) gives: ξˆΈπœƒ(𝐴)βˆβˆ«πœβ„™πœƒ(𝐀=𝐴∣∣𝜏𝜏=𝜏)β„™πœƒ(𝜏𝜏=𝜏)d𝜏=βˆ«πœβˆπ‘˜=1πΎπœ‹π‘›π‘˜π‘˜β‹…βˆβ„“,π‘˜π‘|ξˆ±β„“π‘˜|β„“π‘˜(1βˆ’π‘β„“π‘˜)π‘›β„“π‘˜βˆ’|ξˆ±β„“π‘˜|d𝜏

      i love. its complicated. make it a 'starred subsection' or something.

    5. π‘›π‘˜=βˆ‘π‘›π‘–=1πŸ™πœπ‘–=π‘˜nk=βˆ‘i=1n1Ο„i=kn_k = \sum_{i = 1}^n \mathbb 1_{\tau_i = k}

      spell out in words what is n_k

    6. <β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘π‘π‘™π‘Žπ‘ π‘ ="β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘">𝑝</β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘ βˆ’β„Žπ‘–π‘”β„Žπ‘™π‘–π‘”β„Žπ‘‘>

      didn't render properly

    7. =βˆπ‘—>π‘–β„™πœƒ(πšπ‘–π‘—=π‘Žπ‘–π‘—)

      please right out the left side of this equation

    1. Fig. 3.1 The MASE algorithmΒΆ

      i'd elaborate here showing that we went from graph layouts to adjacency matrices. and explain the colors.

    2. Well, you could embed

      this paragraph should be about considering the many different ways one could do 'joint embedding'

      see figure from cep's paper, probably include a version of it.

      include just averaging graphs and then embedding the average, which is optimal in the absence of across-graph heterogeneity

    3. However, what you’d really like to do is combine them all into a single representation to learn from every network at once.

      not necessarily

  2. Mar 2021
    1. CASE simply sums these two matrices together, using a weight for 𝑋𝑋𝑇XXTXX^T so that they both contribute an equal amount of useful information to the result.

      CASE is a weighted sum

    1. With a single network observed (or really, any number of networks we could collect in the real world) we would never be able to estimate 2𝑛22n22^{n^2} parameters. The number grows too quickly with 𝑛nn for any realistic choice of 𝑛nn in real-world data. This would lead to a thing called a lack of identifiability with a single network, which means that we would never be able to estimate 2𝑛22n22^{n^2} parameters from 111 network.

      unclear what you mean. MLE is [1,0,....,0]

    2. , we would need about 30,000,00030,000,00030,000,000 times the total number of storage available in the world to represent the parameters of a single distribution.

      confused

    3. We use a semi-colon to denote that the parameters πœƒΞΈ\theta are supposed to be fixed quantities for a given 𝐀A\mathbf A.

      remove

    4. t is, in general, good for (Θ)P(Θ)\mathcal P(\Theta) to be fairly rich; that is, when we specify a parametrized statistical model (ξˆ­π‘›,(Θ))(An,P(Θ))(\mathcal A_n, \mathcal P(\Theta)), we want (Θ)P(Θ)\mathcal P(\Theta) to contain distributions that we think faithfully could represent our network realization 𝐴

      remove

    5. Note that by construction, we have that |(Θ)|=|Θ||P(Θ)|=|Θ|\left|\mathcal P(\Theta)\right| = \left|\Theta\right|. That is, the two sets have the same number of elements, since since πœƒβˆˆΞ˜ΞΈβˆˆΞ˜\theta \in \Theta has a particular distribution β„™πœƒβˆˆξˆΌ(Θ)Pθ∈P(Θ)\mathbb P_\theta \in \mathcal P(\Theta), and vice-versa.

      remove

    6. So, now we know that we have probability distributions on networks, and a set ξˆ­π‘›An\mathcal A_n which defines all of the aadjacency matrices that every probability distribution must assign a probability to. Now, just what is a single network model? The single network model is the tuple (ξˆ­π‘›,)(An,P)(\mathcal A_n, \mathcal P). Above, we learned that ξˆ­π‘›An\mathcal A_n was the set of all possible adjacency matrices for unweighted networks with 𝑛nn nodes. We will call ξˆ­π‘›An\mathcal A_n the sample space of 𝑛nn-node networks. In general, ξˆ­π‘›An\mathcal A_n will be the same sample space for all 𝑛nn-node network models. This means that for any 𝑛nn-node network realization 𝐴AA, we can calculate a probability that 𝐴AA is described by any probability distribution on ξˆ­π‘›An\mathcal A_n found in P\mathcal P. What is P\mathcal P? It depends on the model we want to use! In general, P\mathcal P has only one rule: it is a nonempty set (it contains at least something), where for every β„™βˆˆξˆΌP∈P\mathbb P \in \mathcal P, β„™P\mathbb P is a probability distribution on ξˆ­π‘›An\mathcal A_n. Not that this says only that P\mathcal P cannot be empty, but it doesn’t say anything about how big or diverse it can be! In general, we will simplify P\mathcal P through something called parametrization; that is, we will write P\mathcal P as the set:

      don't love it

    7. When you see the short-hand expression β„™(𝐴)P(A)\mathbb P(A), you should typically think back to the most recent random network 𝐀A\mathbf A that has been discussed, and it is typically assumed that β„™P\mathbb P refers to the probability distribution of that random network; e.g., π€βˆΌβ„™A∼P\mathbf A \sim \mathbb P.

      delete

    8. Is this set the same for any unweighted random network, or is it ever different? It turns out that the answer here is fairly straightforward: for any unweighted random network with 𝑛nn nodes, the set of possible realizations, which we represent with the symbol ξˆ­π‘›An\mathcal A_n, is exactly the same!

      i think this is more confusing than helpful.

    9. This is an especially common approach when people deal with networks that are said to be sparse. A sparse network is a network in which the number of edges is much less than the total possible number of edges. This contrasts with a dense network, which is a network in which the number of edges is close to the maximum number of possible edges. In the case of an 𝐸𝑅𝑛(𝑝)ERn(p)ER_{n}(p) network, the network is sparse when 𝑝pp is small (closer to 000), and dense when 𝑝pp is large (closer to 111).

      why talk about sparse? if so, let's dedicate a section to it, not have it here.

      sparse can mean lots of different things:

      1. computationally sparse, meaning that storing the graph as an edge list is smaller than as an adjacency matrix.
      2. the probability of an edge scales with n, rather than n^2. Thus, p must be a function of n. This is an asymptotic claim, and therefore, does not make sense to apply to any given network.
    10. Structured Independent Edge Model (SIEM)ΒΆ

      i think this should go at the end, ie, right before IER, since it is a different special case?

    1. Stated another way, even if we believe that the process underlying the network isn’t random at all, we can still extract value by using a statistical model.

      clarify

    2. lives

      and also talk about missing people, people with multiple accounts (eg, famous people have personal and pro accounts), etc. also do that above.

    3. 𝑑dd-dimensions

      i don't want to corner stats into Euclidean stats. The 'problem' with classical stats with regards to network machine learning is that it doesn't tell us how to leverage the structure of a network efficiently

    4. reaization

      realization.

      This sentence is not right, however. We assume that our observed network is merely a realization of a random network.

    5. A common question as a scientist that we might have is how, exactly, we could describe the network in the simplest way possible

      this is just not where to start

    6. 𝑋

      A is realization of a vector/matrix \mathbf{A} is vector/matrix valued RV

      a is scalar realization \matbf{a} is a scalar valued RV

      a_ij is the realization of an edge

    7. might have a slightly different group of friends depending on when we look at their friend circle

      let's not introduce time varying yet

    8. we assume that the true network is a network for which we could never observe completely, as each time we look at the network, we will see it slightly differently.

      we have measurement error and other sources of uncertainty in our data. that is an empirical fact. let's preserve the word assumption for explicit model assumptions.

    1. What Is A Network?ΒΆ { requestKernel: true, binderOptions: { repo: "binder-examples/jupyter-stacks-datascience", ref: "master", }, codeMirrorConfig: { theme: "abcdef", mode: "python" }, kernelOptions: { kernelName: "python3", path: "./foundations/ch1" }, predefinedOutput: true } kernelName = 'python3'
      • simple network
      • weighted
      • loopless
      • directed
      • attributed

      then we discuss different representations of networks again

      • edge list
      • adjacency matrix
      • various laplacians.

      maybe in its own subsection

  3. Oct 2020
    1. The p-value determines the probability ofobserving data (or more extreme results), contingenton having assumed the null-hypothesis is true. The formal definition can be expressed as follows: P(Xβ‰₯ x|H0) or P(X≀ x|H0),

      I'm sure you know this, but this sentence, as written, I do not think is quite right. Assume that X is a random variable, and x is a realization of that random variable, and we sample n times identically and independently from some true but unknown distribution P. Then, choose a test statistic, T, which maps from the data (X1, ..., Xn) to a scalar t. Now, we can define the p-value the the probability of observing data with a test statistic as or more extreme than the observed, contingent on having assumed the null-hypothesis is true.

      The fact that there is a test statistic in there, I think, is incredibly important, because obviously (to you), if one chooses a different test statistic, one can obtain a different p-value.

      This is also important for your PV8, where "result" is ambiguous. The result here implicitly refers to the test statistic, which was not previously mentioned. It is easy to mistakenly believe that the 'result' somehow magically implies something about 'the data'. For example, anecdotally, I often find that people think a big p-value on a t-test implies no effect, whereas had they used a robust test, or tested for a change in variance rather than the mean, the effect is clear.