- Apr 2021
-
docs.neurodata.io docs.neurodata.io
-
re ξΈπ(π΄)ββπ(π΄)LΞΈ(A)βPΞΈ(A)\mathcal L_\theta(A) \propto \mathbb P_\theta(A);
do everything in terms of probabilities
-
The reason we write ββ\propto instead of === is that, in the case of probability distributions, it is often rather tedious to take care of all constants when simplifying expressions, and these constants can detract from much of the intuition of what is going on. If we instead focus on the likelihood, we only need to worry about the parts of the probability statement that deal directly with the parameters πΞΈ\theta and the realization itself π΄AA.
tedious. remove until we decide we need it
-
What does the likelihood for the a priori SBM look like? Fortunately, since πβΒ Οβ\vec \tau is a parameter of the a priori SBM, the likelihood is a bit simpler than for the a posteriori SBM. This is because the a posteriori SBM requires a marginalization over potential realizations of ππβΟΟβ\vec{\pmb \tau}, whereas the a priori SBM does not. The likelihood is as follows, omitting detailed explanations of steps that are described above: ξΈπ(π΄)ββπ(π=π΄)=βπ>πβπ(πππ=πππ)Independence Assumption=βπ>πππππβπ(1βπβπ)1βπππp.m.f. of Bernoulli distribution=βπ,βπ|ξ±βπ|βπ(1βπβπ)πβπβ|ξ±βπ|LΞΈ(A)βPΞΈ(A=A)=βj>iPΞΈ(aij=aij)Independence Assumption=βj>ibβkaij(1βbβk)1βaijp.m.f. of Bernoulli distribution=βk,βbβk|Eβk|(1βbβk)nβkβ|Eβk|\begin{align*} \mathcal L_\theta(A) &\propto \mathbb P_{\theta}(\mathbf A = A) \\ &= \prod_{j > i} \mathbb P_\theta(\mathbf a_{ij} = a_{ij})\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{j > i} b_{\ell k}^{a_{ij}}(1 - b_{\ell k})^{1 - a_{ij}}\;\;\;\;\textrm{p.m.f. of Bernoulli distribution} \\ &= \prod_{k, \ell}b_{\ell k}^{|\mathcal E_{\ell k}|}(1 - b_{\ell k})^{n_{\ell k} - |\mathcal E_{\ell k}|} \end{align*} Like the ER model, there are again equivalence classes of the sample space <βπ¦πππ‘βππ ππ ββππβπππβπ‘ππππ π ="βπ¦πππ‘βππ ππ ββππβπππβπ‘">ξπ</βπ¦πππ‘βππ ππ ββππβπππβπ‘><hypothesisβhighlightclass="hypothesisβhighlight">An</hypothesisβhighlight>\mathcal A_n in terms of their likelihood. Let |ξ±βπ(π΄)||Eβk(A)||\mathcal E_{\ell k}(A)| denote the number of edges in the (β,π)(β,k)(\ell, k) block of adjacency matrix π΄AA. For a two-community setting, with πβΒ Οβ\vec \tau and π΅BB given, the equivalence classes are the sets: πΈπ,π,π(πβΒ ,π΅)={π΄βξπ:ξ±11(π΄)=π,ξ±21=ξ±12(π΄)=π,ξ±22(π΄)=π}Ea,b,c(Οβ,B)={AβAn:E11(A)=a,E21=E12(A)=b,E22(A)=c}\begin{align*} E_{a,b,c}(\vec \tau, B) &= \left\{A \in \mathcal A_n : \mathcal E_{11}(A) = a, \mathcal E_{21}=\mathcal E_{12}(A) = b, \mathcal E_{22}(A) = c\right\} \end{align*} The number of equivalence classes possible scales with the number of communities, and the manner in which vertices are assigned to communities (particularly, the number of nodes in each community). As before, we have the following. For any πβΒ Οβ\vec \tau and π΅BB: If π΄,π΄β²βπΈπ,π,π(πβΒ ,π΅)A,Aβ²βEa,b,c(Οβ,B)A, A' \in E_{a,b,c}(\vec \tau, B) (that is, π΄AA and π΄β²Aβ²A' are in the same equivalence class), ξΈπ(π΄)=ξΈπ(π΄β²)LΞΈ(A)=LΞΈ(Aβ²)\mathcal L_\theta(A) = \mathcal L_\theta(A'), and If π΄βπΈπ,π,π(πβΒ ,π΅)AβEa,b,c(Οβ,B)A \in E_{a, b, c}(\vec \tau, B) but π΄β²βπΈπβ²,πβ²,πβ²(πβΒ ,π΅)Aβ²βEaβ²,bβ²,cβ²(Οβ,B)A' \in E_{a', b', c'}(\vec \tau, B) where either πβ πβ²,πβ πβ²aβ aβ²,bβ bβ²a \neq a', b \neq b', or πβ πβ²cβ cβ²c \neq c', then ξΈπ(π΄)β ξΈπ(π΄β²)LΞΈ(A)β LΞΈ(Aβ²)\mathcal L_\theta(A) \neq \mathcal L_\theta(A').
goes in starred section
-
What does the likelihood for the a posteriori SBM look like? In this case, π=(πβΒ ,π΅)ΞΈ=(Οβ,B)\theta = (\vec \pi, B) are the parameters for the model, so the likelihood for a realization π΄AA of πA\mathbf A is: ξΈπ(π΄)ββπ(π=π΄)LΞΈ(A)βPΞΈ(A=A)\begin{align*} \mathcal L_\theta(A) &\propto \mathbb P_\theta(\mathbf A = A) \end{align*} Next, we use the fact that the probability that π=π΄A=A\mathbf A = A is, in fact, the marginalization (over realizations of ππΟΟ\pmb \tau) of the joint (π,ππ)(A,ΟΟ)(\mathbf A, \pmb \tau). In the line after that, we use Bayesβ Theorem to separate the joint probability into a conditional probability and a marginal probability: (2.1)ΒΆ=β«πβπ(π=π΄,ππ=π)dπ=β«πβπ(π=π΄β£β£ππ=π)βπ(ππ=π)dπ=β«ΟPΞΈ(A=A,ΟΟ=Ο)dΟ=β«ΟPΞΈ(A=A|ΟΟ=Ο)PΞΈ(ΟΟ=Ο)dΟ\begin{align} &= \int_\tau \mathbb P_\theta(\mathbf A = A, \pmb \tau = \tau)\textrm{d}\tau \nonumber\\ &= \int_\tau \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau)\mathbb P_\theta(\pmb \tau = \tau)\textrm{d}\tau \label{eqn:apost_sbm_eq1} \end{align} Letβs think about each of these probabilities separately. Remember that for ππΟΟ\pmb \tau, that each entry πππΟΟi\pmb \tau_i is sampled independently and identically from πΆππ‘ππππππππ(πβΒ )Categorical(Οβ)Categorical(\vec \pi).The probability mass for a πΆππ‘ππππππππ(πβΒ )Categorical(Οβ)Categorical(\vec \pi)-valued random variable is β(πππ=ππ;πβΒ )=πππP(ΟΟi=Οi;Οβ)=ΟΟi\mathbb P(\pmb \tau_i = \tau_i; \vec \pi) = \pi_{\tau_i}. Finally, note that if we are taking the products of πnn πππΟΟi\pi_{\tau_i} terms, that many of these values will end up being the same. Consider, for instance, if the vector π=[1,2,1,2,1]Ο=[1,2,1,2,1]\tau = [1,2,1,2,1]. We end up with three terms of π1Ο1\pi_1, and two terms of π2Ο2\pi_2, and it does not matter which order we multiply them in. Rather, all we need to keep track of are the counts of each πΟ\pi. term. Written another way, we can use the indicator that ππ=πΟi=k\tau_i = k, given by πππ=π1Οi=k\mathbb 1_{\tau_i = k}, and a running counter over all of the community probability assignments ππΟk\pi_k to make this expression a little more sensible. We will use the symbol ππ=βππ=1πππ=πnk=βi=1n1Οi=kn_k = \sum_{i = 1}^n \mathbb 1_{\tau_i = k} to denote this value: βπ(ππ=π)=βπ=1πβπ(πππ=ππ)Independence Assumption=βπ=1ππππp.m.f. of a Categorical R.V.=βπ=1πΎππππPΞΈ(ΟΟ=Ο)=βi=1nPΞΈ(ΟΟi=Οi)Independence Assumption=βi=1nΟΟip.m.f. of a Categorical R.V.=βk=1KΟknk\begin{align*} \mathbb P_\theta(\pmb \tau = \tau) &= \prod_{i = 1}^n \mathbb P_\theta(\pmb \tau_i = \tau_i)\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{i = 1}^n \pi_{\tau_i} \;\;\;\;\textrm{p.m.f. of a Categorical R.V.}\\ &= \prod_{k = 1}^K \pi_{k}^{n_k} \end{align*} Next, letβs think about the conditional probability term, βπ(π=π΄β£β£ππ=π)PΞΈ(A=A|ΟΟ=Ο)\mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau). Remember that the entries are all independent conditional on ππΟΟ\pmb \tau taking the value πΟ\tau. This means that we can separate the probability of the entire π=π΄A=A\mathbf A = A into the product of the probabilities edge-wise. Further, remember that conditional on πππ=βΟΟi=β\pmb \tau_i = \ell and πππ=πΟΟj=k\pmb \tau_j = k, that πππaij\mathbf a_{ij} is π΅πππ(πβ,π)Bern(bβ,k)Bern(b_{\ell,k}). The distribution of πππaij\mathbf a_{ij} does not depend on any of the other entries of ππΟΟ\pmb \tau. Remembering that the probability mass function of a Bernoulli R.V. is given by β(πππ=πππ;π)=ππππ(1βπ)πππP(aij=aij;p)=paij(1βp)aij\mathbb P(\mathbf a_{ij}=a_{ij}; p) = p^{a_{ij}}(1 - p)^{a_{ij}}, this gives: βπ(π=π΄β£β£ππ=π)=βπ>πβπ(πππ=πππβ£β£ππ=π)Independence Assumption=βπ>πβπ(πππ=πππβ£β£πππ=β,πππ=π)πππΒ depends only onΒ ππΒ andΒ ππ=βπ>πππππβπ(1βπβπ)1βπππPΞΈ(A=A|ΟΟ=Ο)=βj>iPΞΈ(aij=aij|ΟΟ=Ο)Independence Assumption=βj>iPΞΈ(aij=aij|ΟΟi=β,ΟΟj=k)aijΒ depends only onΒ ΟiΒ andΒ Οj=βj>ibβkaij(1βbβk)1βaij\begin{align*} \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau) &= \prod_{j > i}\mathbb P_\theta(\mathbf a_{ij} = a_{ij} \big | \pmb \tau = \tau)\;\;\;\;\textrm{Independence Assumption} \\ &= \prod_{j > i}\mathbb P_\theta(\mathbf a_{ij} = a_{ij} \big | \pmb \tau_i = \ell, \pmb \tau_j = k) \;\;\;\;\textrm{$\mathbf a_{ij}$ depends only on $\tau_i$ and $\tau_j$}\\ &= \prod_{j > i} b_{\ell k}^{a_{ij}} (1 - b_{\ell k})^{1 - a_{ij}} \end{align*} Again, we can simplify this expression a bit. Recall the indicator function above. Let |ξ±βπ|=βπ>ππππ=βπππ=ππππ|Eβk|=βj>i1Οi=β1Οj=kaij|\mathcal E_{\ell k}| = \sum_{j > i}\mathbb 1_{\tau_i = \ell}\mathbb 1_{\tau_j=k}a_{ij}, and let πβπ=βπ>ππππ=βπππ=πnβk=βj>i1Οi=β1Οj=kn_{\ell k}= \sum_{j>i}\mathbb 1_{\tau_i = \ell}\mathbb 1_{\tau_j = k}. Note that ξ±βπEβk\mathcal E_{\ell k} is the number of edges between nodes in community ββ\ell and community πkk, and πβπnβkn_{\ell k} is the number of possible edges between nodes in community ββ\ell and community πkk. This expression can be simplified to: βπ(π=π΄β£β£ππ=π)=ββ,ππ|ξ±βπ|βπ(1βπβπ)πβπβ|ξ±βπ|PΞΈ(A=A|ΟΟ=Ο)=ββ,kbβk|Eβk|(1βbβk)nβkβ|Eβk|\begin{align*} \mathbb P_\theta(\mathbf A = A \big | \pmb \tau = \tau) &= \prod_{\ell,k} b_{\ell k}^{|\mathcal E_{\ell k}|}(1 - b_{\ell k})^{n_{\ell k} - |\mathcal E_{\ell k}|} \end{align*} Combining these into the integrand from Equation (\ref{eqn:apost_sbm_eq1}) gives: ξΈπ(π΄)ββ«πβπ(π=π΄β£β£ππ=π)βπ(ππ=π)dπ=β«πβπ=1πΎππππβ ββ,ππ|ξ±βπ|βπ(1βπβπ)πβπβ|ξ±βπ|dπ
i love. its complicated. make it a 'starred subsection' or something.
-
β«π
make it a sum
-
ξ±βπ|βπ(
use m_lk
-
ππ=βππ=1πππ=πnk=βi=1n1Οi=kn_k = \sum_{i = 1}^n \mathbb 1_{\tau_i = k}
spell out in words what is n_k
-
ππ
can put paragraph back in, but must introduce latent variables
-
<βπ¦πππ‘βππ ππ ββππβπππβπ‘ππππ π ="βπ¦πππ‘βππ ππ ββππβπππβπ‘">π</βπ¦πππ‘βππ ππ ββππβπππβπ‘>
didn't render properly
-
Theory
motivation
-
=βπ>πβπ(πππ=πππ)
please right out the left side of this equation
-
model
model is not a random process, etc.
-
-
docs.neurodata.io docs.neurodata.io
-
Fig. 3.1 The MASE algorithmΒΆ
i'd elaborate here showing that we went from graph layouts to adjacency matrices. and explain the colors.
-
Well, you could embed
this paragraph should be about considering the many different ways one could do 'joint embedding'
see figure from cep's paper, probably include a version of it.
include just averaging graphs and then embedding the average, which is optimal in the absence of across-graph heterogeneity
-
The goal of MASE is to embed the networks into a single space, with each point in that space representing a single node
not quite
-
However, what youβd really like to do is combine them all into a single representation to learn from every network at once.
not necessarily
-
- Mar 2021
-
docs.neurodata.io docs.neurodata.io
-
Non-Assortative Case
move it later
-
silhouette
replace with BIC, talk to tingshan
-
section
which is a computationally efficient line-search approach
-
K-means
lower case k
-
Lloyd2
LLoyd made up lloyd's algroithm for approxiamtely solving k-means
-
a searching procedure until all the cluster centers are in nice places
when no points move clusters from one iteration to the next
-
essentially random places in our data
k-means++ does not do this, the cluster centers are far from one another
-
faster implementation
remove
-
1
this equation is still wrong, needs a transpose
-
covariates
explain why we have ~50 dimensions, especially given that we are only embedding into 2
-
from statistics
not a necessary clause.
-
Stochastic Block ModelΒΆ
i always want: words --> math --> figure
-
4.2359312775571826e-05 and a maximum weight of 40.00562586173658.
just use 2 sig digs
-
CASE simply sums these two matrices together, using a weight for πππXXTXX^T so that they both contribute an equal amount of useful information to the result.
CASE is a weighted sum
-
ππππ,πXXi,jTXX^T_{i, j}
is not valid notation.
-
which we denote here by L for brevity
-
.
cite CASC paper here
-
them
"then plots the results, color coding each node by its true community"
-
best
avoid 'best'
-
-
docs.neurodata.io docs.neurodata.io
-
With a single network observed (or really, any number of networks we could collect in the real world) we would never be able to estimate 2π22n22^{n^2} parameters. The number grows too quickly with πnn for any realistic choice of πnn in real-world data. This would lead to a thing called a lack of identifiability with a single network, which means that we would never be able to estimate 2π22n22^{n^2} parameters from 111 network.
unclear what you mean. MLE is [1,0,....,0]
-
, we would need about 30,000,00030,000,00030,000,000 times the total number of storage available in the world to represent the parameters of a single distribution.
confused
-
We use a semi-colon to denote that the parameters πΞΈ\theta are supposed to be fixed quantities for a given πA\mathbf A.
remove
-
What is the most natural choice for ξΌ(Ξ)P(Ξ)\mathcal P(\Theta) that makes any sense?
remove
-
t is, in general, good for ξΌ(Ξ)P(Ξ)\mathcal P(\Theta) to be fairly rich; that is, when we specify a parametrized statistical model (ξπ,ξΌ(Ξ))(An,P(Ξ))(\mathcal A_n, \mathcal P(\Theta)), we want ξΌ(Ξ)P(Ξ)\mathcal P(\Theta) to contain distributions that we think faithfully could represent our network realization π΄
remove
-
Note that by construction, we have that |ξΌ(Ξ)|=|Ξ||P(Ξ)|=|Ξ|\left|\mathcal P(\Theta)\right| = \left|\Theta\right|. That is, the two sets have the same number of elements, since since πβΞΞΈβΞ\theta \in \Theta has a particular distribution βπβξΌ(Ξ)PΞΈβP(Ξ)\mathbb P_\theta \in \mathcal P(\Theta), and vice-versa.
remove
-
ξΌ(Ξ)|=|Ξ|
define notation
-
So, now we know that we have probability distributions on networks, and a set ξπAn\mathcal A_n which defines all of the aadjacency matrices that every probability distribution must assign a probability to. Now, just what is a single network model? The single network model is the tuple (ξπ,ξΌ)(An,P)(\mathcal A_n, \mathcal P). Above, we learned that ξπAn\mathcal A_n was the set of all possible adjacency matrices for unweighted networks with πnn nodes. We will call ξπAn\mathcal A_n the sample space of πnn-node networks. In general, ξπAn\mathcal A_n will be the same sample space for all πnn-node network models. This means that for any πnn-node network realization π΄AA, we can calculate a probability that π΄AA is described by any probability distribution on ξπAn\mathcal A_n found in ξΌP\mathcal P. What is ξΌP\mathcal P? It depends on the model we want to use! In general, ξΌP\mathcal P has only one rule: it is a nonempty set (it contains at least something), where for every ββξΌPβP\mathbb P \in \mathcal P, βP\mathbb P is a probability distribution on ξπAn\mathcal A_n. Not that this says only that ξΌP\mathcal P cannot be empty, but it doesnβt say anything about how big or diverse it can be! In general, we will simplify ξΌP\mathcal P through something called parametrization; that is, we will write ξΌP\mathcal P as the set:
don't love it
-
{π΄:π΄β{0,1}πΓπ}
prob remove this
-
When you see the short-hand expression β(π΄)P(A)\mathbb P(A), you should typically think back to the most recent random network πA\mathbf A that has been discussed, and it is typically assumed that βP\mathbb P refers to the probability distribution of that random network; e.g., πβΌβAβΌP\mathbf A \sim \mathbb P.
delete
-
Is this set the same for any unweighted random network, or is it ever different? It turns out that the answer here is fairly straightforward: for any unweighted random network with πnn nodes, the set of possible realizations, which we represent with the symbol ξπAn\mathcal A_n, is exactly the same!
i think this is more confusing than helpful.
-
possibly
rmove
-
if πA\mathbf A is an unweighted random network with πnn nodes,
not necessary
-
topology
topology
-
random network π
network-valued random variable \mathbf{A}
-
topology
remove
-
π΅
why is it symmetric?
-
This is an especially common approach when people deal with networks that are said to be sparse. A sparse network is a network in which the number of edges is much less than the total possible number of edges. This contrasts with a dense network, which is a network in which the number of edges is close to the maximum number of possible edges. In the case of an πΈπ π(π)ERn(p)ER_{n}(p) network, the network is sparse when πpp is small (closer to 000), and dense when πpp is large (closer to 111).
why talk about sparse? if so, let's dedicate a section to it, not have it here.
sparse can mean lots of different things:
- computationally sparse, meaning that storing the graph as an edge list is smaller than as an adjacency matrix.
- the probability of an edge scales with n, rather than n^2. Thus, p must be a function of n. This is an asymptotic claim, and therefore, does not make sense to apply to any given network.
-
Probability that an edge exists between a pair of vertices
iid on edges
-
0.3
use same notation in title as the rest of the paper, so we write
ER_n(p)
in the paper, so let's also use it in the title.
-
True
false
-
True
undirected
-
ps
i'd use p
-
same
i'd write
"is the same number, n*p
-
The
also need to be able to estimate parameters of the model
-
unlikely
impossible
-
the model
model is a set
-
framework
not a framework
-
.3
always write 0.3 instead of .3
-
Structured Independent Edge Model (SIEM)ΒΆ
i think this should go at the end, ie, right before IER, since it is a different special case?
-
_{s}
remove _s i think
-
$\pmb A \sim ER_n(p)$
write this out by factorizing, eg, show
A ~ Bern(P) = \prod_ij Bern(p_ij) = \prod_ij Bern(p)
and explain why
-
-
docs.neurodata.io docs.neurodata.io
-
Model Selection: The model is appropriate for the data w
appropriateness
-
Machine Learning
is this capitalized?
-
underlies
nix for govern, everywhere maybe?
-
underlying
remove
-
Stated another way, even if we believe that the process underlying the network isnβt random at all, we can still extract value by using a statistical model.
clarify
-
underlies
governs
-
statistics
ML
-
statistics
replace everywhere with network machine learning.
-
Comparing
somewhere in here we need to remind people of our notational conventions.
-
lives
and also talk about missing people, people with multiple accounts (eg, famous people have personal and pro accounts), etc. also do that above.
-
quantity
variable
-
not
on a specific social media site
-
random
if we are assuming iid, we should say it here.
-
For instance, we might think that
In this simple example,
-
Instead
remove
-
this
remove
-
discrimative modeling,
is discriminative always classification? seems not, please clarify
-
relating to how the network behaves
about properties of the network
-
network
and potential network, node, and edge attributes
-
statistics
network ML
-
πdd-dimensions
i don't want to corner stats into Euclidean stats. The 'problem' with classical stats with regards to network machine learning is that it doesn't tell us how to leverage the structure of a network efficiently
-
presentation
representation
-
well
remove
-
is that we have
is concerned with
-
reaization
realization.
This sentence is not right, however. We assume that our observed network is merely a realization of a random network.
-
with which we seek
we use
-
Statistical modelling is a technique in which
In statistical modeling
-
Perhaps
what about missing people from the social network? that is a big one
-
perhaps
remove
-
is filled with
includes much
-
might
remove
-
the question that we
we may
-
A common question as a scientist that we might have is how, exactly, we could describe the network in the simplest way possible
this is just not where to start
-
Topology
simple
-
before
no time yet please
-
are
correspond to
or
represent
-
are
correspond to
or
represent
-
topology of a network
simple network
-
π
A is realization of a vector/matrix \mathbf{A} is vector/matrix valued RV
a is scalar realization \matbf{a} is a scalar valued RV
a_ij is the realization of an edge
-
ππaa\pmb a
i don't think we can use 'a', A is observed, and \mathbf{A} is random variable
-
π₯π
should be x not x_i since we use X later.
-
value
whereas x takes on two possible values: heads or tails
-
π
but x is not, it is the actual observed realization of a coin flip
-
generative
we also use discriminative modeling, eg, signal subgraph
-
might have a slightly different group of friends depending on when we look at their friend circle
let's not introduce time varying yet
-
we assume that the true network is a network for which we could never observe completely, as each time we look at the network, we will see it slightly differently.
we have measurement error and other sources of uncertainty in our data. that is an empirical fact. let's preserve the word assumption for explicit model assumptions.
-
The way we characterize the network is called the choice of an underlying statistical model.
grammar
-
if we know people within the social network are groups of student
grammar
-
-
docs.neurodata.io docs.neurodata.io
-
What Is A Network?ΒΆ { requestKernel: true, binderOptions: { repo: "binder-examples/jupyter-stacks-datascience", ref: "master", }, codeMirrorConfig: { theme: "abcdef", mode: "python" }, kernelOptions: { kernelName: "python3", path: "./foundations/ch1" }, predefinedOutput: true } kernelName = 'python3'
- simple network
- weighted
- loopless
- directed
- attributed
then we discuss different representations of networks again
- edge list
- adjacency matrix
- various laplacians.
maybe in its own subsection
-
-
docs.neurodata.io docs.neurodata.io
-
Preface
add pretty cover art from @pedigo
-
-
docs.neurodata.io docs.neurodata.io
-
$LL + aXX^T$
make latex work :)
-
- Oct 2020
-
-
The p-value determines the probability ofobserving data (or more extreme results), contingenton having assumed the null-hypothesis is true. The formal definition can be expressed as follows: P(Xβ₯ x|H0) or P(Xβ€ x|H0),
I'm sure you know this, but this sentence, as written, I do not think is quite right. Assume that X is a random variable, and x is a realization of that random variable, and we sample n times identically and independently from some true but unknown distribution P. Then, choose a test statistic, T, which maps from the data (X1, ..., Xn) to a scalar t. Now, we can define the p-value the the probability of observing data with a test statistic as or more extreme than the observed, contingent on having assumed the null-hypothesis is true.
The fact that there is a test statistic in there, I think, is incredibly important, because obviously (to you), if one chooses a different test statistic, one can obtain a different p-value.
This is also important for your PV8, where "result" is ambiguous. The result here implicitly refers to the test statistic, which was not previously mentioned. It is easy to mistakenly believe that the 'result' somehow magically implies something about 'the data'. For example, anecdotally, I often find that people think a big p-value on a t-test implies no effect, whereas had they used a robust test, or tested for a change in variance rather than the mean, the effect is clear.
-