404 Matching Annotations
  1. Nov 2021
    1. Note

      Important point.

    2. predictions

      \(\hat{P}\) values are grouped into intervals

    3. joint distribution

      \(\hat{P}\) is also a random variable as its value may be different when the same input is passed again and again in the same model. For example, due to dropout running at test time. Or some other random variate inside the model.

    1. highly variable representations of syntax

      The models learn representations of syntax during pre-training. The MNLI dataset, changes representations(mostly of the last few layers) so as to fit its own dataset. The process of transferring knowledge of the NSP task, to the MNLI task, by finetuning CLS position on the MNLI dataset:

      1. sometimes causes the model to get "caught" in wrong simple heuristics.
      2. sometimes does "something" to the model so that it performs around 50% on HANS.

      It could be that the heuristic is built within the model due to NSP, and sometimes MNLI training allows you to overcome it, sometimes not.

      Or it could be that the MNLI training collapses the models to these heuristics?

    2. or to randomluck in the choice of the model’s initial weights.

      Or maybe to both..

    3. low-bias learner

      The bias of bias-variance tradeoff.

    1. Figure 8.

      Maybe should try: slightly random-ly shift(by using sgd on two/three training batches) each model in the deep ensembles, and add those to the deep ensemble too, and then compare the performances, to better ablate the improvements that authors attribute to ESPRO?

    2. shows that we are able to discover simplexesin parameter space containing models that lead to diversepredictions, meaning that we can ensemble within a sim-plex and gain some of the benefits seen by deep ensembles(Lakshminarayanan et al., 2017).

      Very counter-intuitive. This means good alternate ways to solve a problem lie nearby. Meaning identifying dogs using tongues lies nearby to identifying dogs using standing tails.

      Possibly because of the hierarchical nature of neural networks? Once individual parts of an image have been identified, you can change weights slightly, to decide whether prediction of an image having a dog, should depend on the image having a tail or tongue.

      Possibly can consider checking which weights change the most as we move inside the simplex? Top layer weights/bottom layer weights?

    3. We can continue adding new modes until wereach k = 11, when the volume collapses to approximately10−4, from a maximum of 105

      Let's say there are two 2-D low loss surfaces, connecting \(w_1\) and \(w_2\). If \(\theta_1\) lands, after tuning, to reduce loss on the polygonal connector, \(w_1 \rightarrow \theta_1 \rightarrow w_2\), on surface 1. We can be sure that \(\theta_2\) also lands on surface 1, as our loss function encapsulates moving \(\theta_2\) in the direction which minimizes expected loss over entire triangular surface \(\Delta w_1\theta_1\theta_2\) and the triangular surface \(\Delta w_2\theta_1\theta_2\).

    4. at least 10

      Why only lower bound? Can we find upper bound too?

    5. both low lossparameter settings

      As soon as we add a point out of the simplex, choosing a point at random from the volume, will be almost everywhere, a point of high loss.

      For, example, consider a triangle of low loss parameter settings, such that the 3-d space containing the triangle is all high-loss parameters. As soon you consider any pyramid with its base as the triangle, almost every point in the pyramid is a point of high loss.

    6. volume

      we calculate higher dimensional volume at higher \(k\)

    7. b

      Projection of 9-dimensional simplex into 3-dimensional space?

    8. Note thatin this case not all modes are in shared simplexes with allconnecting points.

      How to choose which modes form simplexes with which connecting points?

    9. All samples from the SWAG posterior lie in the region oflow loss, as do the sampled connecting paths, indicating thatthere is indeed an entire volume of connected low loss solu-tions induced by the SWAG posterior over θ.

      Does the color correspond to the loss of the projection? Or the loss of actual SWAG sample?

      probably loss of actual SWAG samples, as loss of points in space spanned by \(\theta_0\), \(w_0\), \(w_1\) doesn't look like this, as seen in Garipov et. al.(2018).

    1. In this instance, we expect thedistance between turning points to be less than the distance between endpoints

      why?

    2. coincide

      The losses may differ(by a constant coefficient) if the "linear in \(t\)" relation doesn't have a \(t\) with a coefficient of 1. We need \(||\phi_\theta'(t)||=1 \forall t \in [0,1]\) for the two losses to be exactly equal.

    3. uniform distribution

      expectation of uniform distribution on the "curve \(\phi_\theta\)" is \(\hat{l}(\theta)\).

      uniform distribution on "curve \(\phi_\theta\)" means the probability of being in each unit length of the curve is same for every unit length segment of the curve.

    1. First, it suffers from the herd effect. Ifthere are many clients waiting to acquire a lock, they willall vie for the lock when it is released even though onlyone client can acquire the lock.

      That is, all waiting processes, who have registered watches on the ephemeral znode will be notified of its deletion, and accordingly try to create the same znode, again. Even if one client has already made the znode, after the deletion of first one, the remaining clients will try nonetheless.

  2. Oct 2021
    1. For example, there could be a non-deterministic result if anapplication/OS in a VM is reading a memory block at thesame time a disk read is occurring to that block.

      How??

    2. simultaneous diskoperations that access the same disk location can lead tonon-determinism.

      For example if the primary tries to write to the same location in a file in a loop, As disk operations are non-blocking, we don't know in what order will all the write-s be actually done.

      The VMware FT will detect any such operations on the primary's side, and force them to execute sequentially, then ask the backup's VMware FT to execute in the same sequence.

    3. try associated

      Additionally, if a client doesn't receive an output indicating that the state change it intended for the server has been done. The client can't assume that it hasn't been done. It has to consider that the state change may be done, and the packet acknowledging the state change lost on its way; so it needs to query the current state, again. To check if the change that it intended has been done.

      This also helps avoid confusion in case, the primary fails after backup get updated, but before it receives the ACK from backup.

    4. point of the last log entry. However, suppose a failure wereto happen immediately after the primary executed the out-put o

      Difficult to ensure if backup VM quite far away, and lots of log entries to send. Then, we will have to wait for all log entries to reach backup VM.

  3. storage.googleapis.com storage.googleapis.com
    1. Leases

      To avoid "Split Brain". The idea is that both the master and primary know that uptil a point(the lease expiration time), even if communication to primary(from Master) fails, the Master assumes that the Primary is still alive & well. And the Primary, continues to act as though it is primary. Once the lease expiration time is over, the primary will stop acting as Primary if its lease is not extended.

      The master must wait for lease to expire before it can designate a new primary.

      This prevents "split brain" problem, which would have occurred if the Master had assumed that the Primary was dead(upon losing conversation with it) and designated a new primary immediately. Then, the clients who knew the old primary, will continue to do work with it, while the clients who know new primary, will ask it to do work, leading to conflicts.

    2. This ensures that anysubsequent writes to these chunks will require an interactionwith the master to find the lease holder.

      If some client thinks that some chunk-server is primary(due to that info existing in its cache) and sends a write request to that chunk-server, the chunk server will reply saying it doesn't have the lease. Hearing this, the client will ask the Master regarding who has the lease.

    3. file region may end up containing fragments from differentclients, although the replicas will be identical because the in-dividual operations are completed successfully in the sameorder on all replicas.

      Only way clients may end up writing something they didn't intend.

    4. Since clients cache chunk locations, they may read from astale replica before that information is refreshed. This win-dow is limited by the cache entry’s timeout and the nextopen of the file, which purges from the cache all chunk in-formation for that file. Moreover, as most of our files areappend-only, a stale replica usually returns a prematureend of chunk rather than outdated data. When a readerretries and contacts the master, it will immediately get cur-rent chunk locations.

      Only way clients can get slightly wrong information, while reading.

    5. checkpoints

      For example, after a client has tried to write, say 2MB of data, to a file, and if the write completed successfully, it will next write a checksum for that 2MB data to the same file. If there was a concurrent write, by another client, on the same location where the 2MB data was to be written, the write will be considered unsuccessful.

      Any reader, will only read data which has been followed by a checksum.

    6. atomically

      Atomicity guaranteed by atomicity of file namespace of master.

    7. chunkis lost irreversibly only if all its replicas are lost before GFScan react, typically within minutes.

      Only way to lose data.

  4. Aug 2021
    1. but it can also be trainedto maximize entropy over all valid insertions forrobustness.

      AutoRegressive Models only maximize probability of only one possible sequence of insertions, i.e., the one corresponding to inserting one token at a time, left to right.

    1. tangent planes to the surface (3) and the envelope coincide.

      Read : "tangent plane to the surface" and "the envelope". Not : "tangent plane to the surface" and "the tangent plane to the envelope".

    1. sentence

      That is, they are checking the presence of "MajorClaim", "Claim" and "Premise" in each full-stop separated sentence in the essay. They don't check whether they begin with the same token in the sentence or not, or whether it is continuation from previous sentence or marked as a different claim/premise.

    Annotators

    1. Prompt Engineering

      Choosing what and where(in what context) we want the model to predict.

    2. Answer Engineering:

      Choosing what possible outputs of the model, will be considered as valid answers.

    3. Sea Change

      Profound change.

    1. the op-timization directions of the first few epochs aredivergent from the final optimization direction.

      This means that in random initialization, we are not sure whether locally optimal choices, are globally optimal; but pre-training makes sure that they are. (That is, it makes sure that almost a straight line path remains between pre-trained weights and the optimal weights for the downstream task. It is basically due to the relation between pre-training and downstream tasks.)

    2. The optimization pro-cess also converges faster

      Notice the long first arrow in bottom graphs in Fig. 3. All arrows are plotted after equal number of steps, probably.

    3. path

      The straight line path from (0,0) to (0,1) or (1,0).

    1. fiction

      BERT is trained on books, Maybe positive \(\Delta\) here is related to that?

    2. closely aligned tasks

      I think there are two kinds of "closely-related tasks" : one are "subset tasks"(like those we can get from prompt engineering; where the class prediction of one task can be used to completely determine the class on another task) and others are "overlapping tasks"(where the class can only define a prior over class predictions over the the class predictions for another task).

      Here we are in territory of "overlapping tasks" and neglecting the other kind.

      While there is no need for changing representations in case of subset tasks, there is need for changing representation in case of overlapping tasks. Heated training allows for this. And is hence, better for such "closely aligned tasks".

      But, the results show that if the overlap between the tasks is not sufficient, heated training may actually destroy useful information present in embeddings(is preserved in frozen training, though). Is it actually destroying information useful to the downstream task? Or is the better performance in frozen setting due to the parameters of ESIM?

      A useful comparison could be with frozen training (with ESIM) followed by heated training(with ESIM learned during frozen training) for the STS tasks. This is done in "gradual unfreeze" settings below.

    3. A confounding factor may bethe suitability of the inductive bias of the modelarchitecture for sentence pair tasks

      Another factor is the data the model was trained on. The increase in frozen performance of BERT from the frozen performance of ELMo, is mainly due to these factors only.

    4. When extracting features, it is im-portant to expose the internal layers as they typi-cally encode the most transferable representations.

      This is how they differ from probing. Very nice point.

    5. LM pretraining withhas alsobeen successfully applied to sentence level tasks.

      Gap b/w frozen and heated settings will be large compared to gap between frozen and heated settings for other pre-training objectives like NSP etc. (when that objective and adaptation tasks are related). LM objective is expected to be quite far from adaptation task.

    1. Pinsker’s inequality

      Remember to use this anywhere you want to convert between algebraic difference in probabilities of events under different distributions, to the KL-divergence/cross-entropies of the distributions.

    2. `xent,s(p·|s)−`xent,s(p∗·|s)

      This is \(\epsilon\).

    3. ‖v‖2∞

      This here is \(B^2\).

    4. `T({p∗·|s},v∗)

      This quantity is \(\tau\)

    5. Jensen

      Only to bring square and root, so that Pinsker's inequality without root can be used.

    6. aγ(pT)-fraction

      An initial result is that \((\gamma (p_\Tau))^{-1} l_{xent}(\{\mathbf{p}_{.|s}\}) = (\gamma (p_\Tau))^{-1} E_{s \sim p_L} [ E_{w \sim \mathbf{p^*}_{.|s}} [-log(\mathbf{p}_{.|s}(w))] ] \ge E_{s \sim p_\Tau} [ E_{w \sim \mathbf{p^*}_{.|s}} [-log(\mathbf{p}_{.|s}(w))] ]\)

      For proof, observe that to obtain expectation over \(s \sim p_\Tau\) from expectation of \(s \sim p_L\), we need to multiply each term in expectation for \(s \sim p_L\) by \(\frac{p_\Tau(s)}{p_L(s)}\). We can instead multiply each term by \( (\gamma (p_\Tau))^{-1} = max_{s \in S} \frac{p_\Tau(s)}{p_L(s)} \) to get an upper bound on expectation over \(s \sim p_\Tau\). We can replace the innermost expression from \(-log(\mathbf{p}_{.|s}(w))\) to \(-log(\mathbf{p}_{.|s}(w))+log(\mathbf{p^*}_{.|s}(w))\). This will allow us to get the same inequality on the deviation from optimal cross entropy, \(\epsilon\), viz., \((\gamma (p_\Tau))^{-1}\epsilon_L \le \epsilon_\Tau\)

    7. 5

      Only exists when support\((p_\Tau) \subseteq\) support\((p_L)\). Similar to \(inf\) {\(s \in S\) : \(\frac{p_L(s)}{p_\Tau(s)}\)}.

    8. sfor everys∈support(pL)

      This means we can only match on a \(d\) dimensional subspace for the distribution obtained from the each context. Moreover, the \(d\)-dimensional subspace is same for every context. It may be useful to let it vary with context, while keeping \(d\) same.

    9. xent

      In reality, just self-entropy of \(\mathbf{p^*}_{.|s}\).

  5. Jul 2021
    1. per-component

      They consider the PDTB parser as one component and their EDU segment generator as another. This terminology is ignorant of the various components inside the PDTB parser. Need to confirm if this was what author intended.

    2. The goldstandard parsing and EDUs boundarie

      That is the PDTB parser(which identifies spans and relations between them based on a parsing of the sentences in the text), is given the gold parsing of the sentences in the text and gold EDUs.

      The authors probably modify the "Argument Extractor" to predict Arg1 as one of the EDUs; and parse each EDU individually.. Or probably they choose that EDU which has maximum overlap with Arg1. Need to confirm.

    3. tasks

      respectively

    1. subtree nodes

      Arg1, Arg2 spans can only correspond to the subtrees rooted at the nodes of the constituency parse tree.

    2. The span to which the connective is syntactically attached

      That is , when we do dependency parsing, (or just see the one given in PTB), the span in which the connective's head lies, is called \(Arg2\) and the other one \(Arg1\).

      This is a link between dependency parsing and discourse parsing/argument mining. Can try to extend/generalize this kind of link further!

    1. This shows when one comparesthe number ofSup1orSup2annotations on explicit discourse relations, which wereannotated first, with the number of such annotations on implicit discourse relations,which were annotated on a subsequent pass: 1,571 explicit relations were annotatedwith supplementary information, whereas only 126 implicit relations were, despitenearly equal numbers of both.

      i.e., annotators became lazy after first pass.

    2. arguments

      *distinct arguments

    3. nevertheless likely that such discourse relations areunder-annotated in the PDTB and should be addressed.

      Because they'd have been annotated as AltLex, in which case, they haven't been grounded properly(unlike implicit, which were grounded by adding in extra words as in example (2))

    4. deictic

      relating to or denoting a word or expression whose meaning is dependent on the context in which it is used (such as here, you, me, that one there, or next Tuesday ).

    1. 45c9c7c6Figure

      c6, c7. c9 are just random ids, with no meaning. Derived from the xml representation. For example, see here . The actual types of edge is specified in the "type" attribute of <edge> xml tags..

  6. www.ling.uni-potsdam.de www.ling.uni-potsdam.de
    1. The segment status we need forargument mining is less genre-dependentand instead tailored to the argument notation or schema we want to use.

      In argument mining, a segment can correspond to a connected sub-graph of the argument complex, constructed from contiguous ADUs. Construct a new graph with each subgraph reduced to a single node. Each of these nodes can be classified as belonging to a broad set of classes, depending on the local structure around the node in the graph. The set of classes become more specific and expanded as we consider more and more structure around the sub-graph.

      In the simple case, where connections in argument complex are sparse, the sub-graphs detected, and the graph formed therefrom, is also sparse and has much less number of possible classes, due to sparsity, and yields to simple classes like "claim", "premise"( As described in previous section).

    2. formulaic expressions

      may consider integrating in the argument notation/schema itself, or learn these.

    3. linear order of zones

      That is, which is more likely to occur first, which second etc.

    4. see (Stede, 2011,Sct. 4.2)

      to see

    5. Since the presence of such a signal is a valuable feature in thestudy of the principles of argument presentation,

      We need to know this, in order to understand whether the proponent is accepting defeat, being humble.. etc. Although the intent is recoverable entirely from the text(as is the entire argument complex), the authors still choose to represent this factor in their diagrams. If we use a weighting for the "importance" of each of the attack edges, then we can probably omit this.

    6. weightyreasons

      An important question: How to assign and compose weights of various parts of the argument complex.

    7. The authordoes not even need to address whether the anticipated exception to his argument holds or not

      That is, in the below example, "in principle it is possible" can be replaced by "even if it is possible".

    8. An example diagram is shown inFigure 5b.

      The exact figure for the provided "harmless asbestos" example, will consist of a circular node \(3\) rather than a square node \(3\), in figure \(5b\) (Assuming \(3\) and \(4\) are said by the proponent himself).

    9. Figure 5a.

      In \(4a\), it'd have look like a circular node \(4\) connected on top of \(3\), with an edge with circular end point towards \(3\).

    10. We should tear the building down,]1[even thoughit’s supposed to be some touristic attraction.]

      Imagine this as a rhetorical comment by the attacker. In this comment, \(1\) is a reference to the original claim of proponent. And \(2\) is the rebutter(i.e. a set of premises against the referenced claim) for the reference claim. Imagine the following sentence as a response by proponent(i.e., \(3\)). It is then a counter-rebutter(a set of premises against the rebutter). We can see that indeed the counter-rebutter is evidence against the premise that compose the rebutter.

    11. counter-rebutter

      a set of premises against the set of premises in the rebutter.

    12. rebutter

      a set of premises against some claim.

    13. according to the main claim.

      And the shapes of nodes(circle/box)

    14. rebutterFigure

      Had the squares been circles, it'd mean that the counter-argument is presented by the proponent himself.

    15. supportingforce

      That is, the reason why the claim follows from the premise.

    16. discussion of that difference can be found in (Freeman, 2011, ch. 5).

      to see.

    17. they

      whether a node is a "supporting evidence" or some "data node"

    18. amodal

      Alethic Modality

      Basically, modal operators refer to "can't be" , "may be", "may not be", which indicate different strengths of inferential links.

    19. argumentative text portion

      in contrast with an instructive, or narrative, or descriptive, or expository text.

    1. generalization

      Generalization of the model weights learned from MLM task, to other tasks.

      Given 2 models(or same model with different pre-trianed weights), which perform exactly same on the MLM task, it may still be the case that one performs better when fine-tuned on RTE, than the other.

      But here, the generalization we are talking about is the one that corresponds to the fact that even if two models(or single model with different weights) obtain same loss on training data for a downstream task, like RTE, their performance on the test set may still be different.

    2. barrier

      The color stays same around the red star, and the barrier is where the color changes quickly (those are the walls of the valley). Around the line \(x=0.25\) in fig. \(7(a)\).

    3. CoLAFigure 7:

      Moving along X-axis from (0,0)(black star) corresponds to failed run(red star) and moving along Y-axis corresponds to successful run(white star).

    4. Importantly, we note that the vanishing gradients we observe during fine-tuning are harder to resolvethan the standardvanishing gradient problem(Hochreiter, 1991; Bengio et al., 1994). In particular,common weight initialization schemes (Glorot and Bengio, 2010; He et al., 2015) ensure that thepre-activations of each layer of the network have zero mean and unit variance in expectation. However,we cannot simply modify the weights of a pre-trained model on each layer to ensure this propertysince this would conflict with the idea of using the pre-trained weights.

      This is also probably the reason why vanishing gradients occur despite skip connections in transformer.

    5. optimization problem

      i.e., a problem with the optimization procedure.. as it can't reduce the loss!

    1. but does not in itself say if these changesare important for the model’s behavior–i.e. forthe processing necessary to solve the downstreamtask.

      That is, we can see from the RSA, that at layer 6, dependency-finetuned bert's embeddings are far away from those of pre-trained bert's. But are these changes in dependency-finetuned bert actually being used in dependency parsing, or are they being ignored/grouped together into sub-classes by the layers ahead of the 6-th one?

    2. Thesebaselines are important because inspection-basedanalysis can often discover patterns that are notobviously present due to the high capacity of aux-iliary classifiers. For example, Zhang and Bow-man (2018); Hewitt and Liang (2019) found thatexpressive-enough probing methods can have de-cent performance even when trained on randomnoise.

      As an extreme example, BERT itself, can be thought of as a probe on word-embeddings, if we freeze word-embeddings. Now, if by chance,the weights we use to initialize classifier head(BERT) happen to be the same as that of pre-trained BERT.. then we can get awesome performance on probing task, without the our initial model(i.e., the word embeddings) having any knowledge related to probing task.

    3. 4The cor-responding metrics are the prediction accuracy ofthe root node and the Spearman correlation be-tween the predicted and true tree depths.

      2 metrics for the first structural probe only. The word \(i\) with the lowest \(||\mathbf{h}_i^l||_B\) is the predicted root .

    4. with mod-els appearing to use dataset heuristics such as lex-ical overlap (McCoy et al., 2019b) or word pri-ors (Poliak et al., 2018),

      That is, models tend to use spurious artefacts in the dataset, instead of actual linguistic features learned during pre-training; when fine-tuned on a downstream task.

    5. these

      The linguistic features

  7. arxiv.org arxiv.org
    1. 2017DEEPBIAFFINEATTENTION FORNEURALDEPENDENCYPARSING

      Main idea: Using classifier(softmax) like loss in addition to MST(CRF) type loss, leads to better performance.

      Can implement classifier-like loss for the connection from each word, using scores as given in equation (2).

    2. Applying smaller MLPs to the recurrent output states beforethe biaffine classifier has the advantageof stripping away information not relevant to the current decision. That is, every top recurrent stateriwill need to carry enough information to identify wordi’s head, find all its dependents, exclude allits non-dependents, assign itself the correct label, and assign all its dependents their correct labels, aswell as transfer any relevant information to the recurrent states of words before and after it. Thusrinecessarily contains significantly more information than is needed to compute any individual score,and training on this superfluous information needlessly reduces parsing speed and increases the riskof overfitting. Reducing dimensionality and applying a nonlinearity (4 - 6) addresses both of theseproblems.

      Nice logic. Can be applied anywhere we have embeddings that are supposed to work for multiple tasks/encode multiple information. Usually we add a "head" for transforming for the particular task, but forget to do dimensionality reduction!

    3. biaffine

      The two affine transformations are not one after another, but rather in parallel, \(RU^{(1)}\) is one and \(R\mathbf{u}^{(2)}\) another. \(U^{(1)}\) and \(\mathbf{u}^{(2)}\) are learnable parameters of the bi-affine transformation. \(R\) is the matrix of token-wise embeddings of all words, stacked together.

    4. Variable

      The classes are the various words in the sentence, they change as the sentence changes.

    1. latter

      i.e., the downstream tasks.

    2. The figure shows that NMT probing performanceis largely independent of target language, withstrikingly similar development patterns acrossFrench, German and Finnish

      Can use same encoder for all the target languages!

    3. not easily readable by our MLP

      But readable by the decoder of the AutoEncoder

    4. No target form occurs across thetrain/dev/test split,

      That is, a word in past tense(say "saw") which is the verb of the main-clause will occur in past tense as the verb of main-clause, in only one of train/dev/test split.

      Forms of verbs:

      VB (base form)(be),

      VBD (past tense)(was),

      VBG (gerund/present participle)(being),

      VBN (past participle)(been),

      VBP (sing. present, non-3d)(am),

      VBZ (3rd person sing. present)(is).

    5. train parameter-richmulti-layer classifiers,

      Need to be careful with the complexity of the classifier and the amount of data used to train it.

      A BERT like model has max. sequence length of 512, so, it has \(|V|^{512}\) possible inputs. Suppose, it maps each of it, to a distinct fixed-size sentence embedding. (\(V\) is the vocabulary of the model). Then, if we provide all our samples to the classifier, and it memorises the labels for all of them, then we can't really say if the linguistic information was captured in the sentence embedding.

      For binary classification, an accuracy greater than \(50%\) on held out test set, is a must to conclude that at least some traces of the linguistic property corresponding to the binary classification task are captured in the sentence embedding.

    1. Spearman’sρ

      It is the Pearson's \(r\) coefficient for the ranks of actual and observed values. The higher of the actual/observed values gets rank 1 and the lower one gets rank 0. It measures to what extent the function mapping from one of the r.v.(actual/observed) to the other, can be approximated by a monotonic function.

      Pearson's \(r\), on the other hand, measures how well can the function mapping from one of the r.v to another can be approximated by a linear relation of the form \(Y=aX\). If it can approximated exactly, then Pearson's \(r\) is \(1\).

    2. Pearson’sr

      It is the covariance between the observed and actual values (of related-ness) normalized by the product of standard deviations of the observed and actual values.

    1. Interestingly,we observe no difference between CoLA and SST-2fine-tuning in this experiment.

      meaning that the deviations induced by cola/sst-2 are equivalent to the pre-trained weights of last layer. Although, one induces more deviation than the other.. probably because of LayerNorm?

    2. This affirms our hypothesis that improvements inthe fine-tuning with CLS-pooling can be attributedto a change in the attention distribution which isless necessary for the mean-pooling.

      This hypothesis explains why huge changes came doing probing with [CLS] rather than mean pooling. As we had used [CLS] during fine-tuning, the attention distribution would have adapted to allow the [CLS] token to capture better sentence level representations, which wouldn't have been the case before, probably.

      That is, there are two factors at the least, that are bringing about the changes(compared to base pre-trained models) in the accuracy of probing tasks, viz. :

      1. [CLS] token learning better(compared to the representations captured by [CLS] token in base pre-trained model) representations of sentence when model is fine-tuned for the given fine-tuning task.

      2. How the probing task is related to the fine-tuning task. (That is, if the finetuning and probing task are both syntactic level tasks, or one is syntactic level, another semantic etc.)

    3. The improvement is only visiblewhen using CLS-pooling and becomes negligiblewhen probing with mean-pooling.

      Probably because during fine-tuning, we took the first token's representation, and passed gradients through it. Hence, the first token's representation was the most affected(where [CLS] is now). Mean-pooling during probing, allows one to access more of the un-corrupted pre-trained knowledge in that way.

      To check: What would have happened had the fine-tuning involved mean-pooling too?

    4. highlighting the impor-tance of sentence-level context for each of the prob-

      or the inability of [CLS] token to capture sentence level context.

    5. outFigure

      No fine-tuning. These are results on base model.

    6. Our hypothesis is that if a pre-trained model en-codes certain linguistic knowledge, this acquiredknowledge should lead to a good performance ona probing task testing for the same linguistic phe-nomenon.

      Very logical. Though I don't think it is that discrete. How quickly the knowledge can be recovered, if lost, will allow this phenomenon to be made continuous.

      If we replace probing everywhere in the paper, with actual fine-tuning on the probing task(that is unfreeze the weights of transformer model, while doing the probe), and then measure the number of samples require to get saturated accuracy on the probing task; it will correspond to how quickly the knowledge corresponding to the probing task can be recovered.

    7. coordinate clause

      In English grammar, a coordinate clause is a clause (i.e., a word group containing a subject and predicate) that is introduced by one of the coordinating conjunctions--most commonly and or but.

      Example: "It was apple-blossom time, and the days were getting warmer." (E.B. White, Charlotte's Web. Harper, 1952)

    8. changes

      expected improvement

    1. Var[∇ξif(ξ)]

      In (7) we have already assumed that \(p(\xi)\) is multivariate Gaussian. The variance is calculated for the term whose expected value needs to be calculated in order to get gradient of \(f(\xi)\) with respect to a single mean parameter \(mu_i\) of the multivariate Gaussian.

      Since, \(f(\xi)\) depends on \(\xi_i\) only via \(f(\xi_i)\), the gradient \(\nabla_{\xi_i} f(\xi)\), whose expected value is to be found in (7), is independent of how many \(\xi_i\)'s are there in total.

    2. asVar[(ξi−μi)σ2i(f(ξ)−E[f(ξ)])]∼O(K)

      The \(f(\xi)\) is the cause of \(O(K)\). The term in the bracket is obtained by assuming \(p(\xi|\theta)\) is multivariate Gaussian. And the gradient w.r.t. a single mean, \(\mu_i \in \theta\) is being calculated in equation (34). The variance given is the variance of the term whose expected value is being calculated to get the gradient.

    1. One limitation of the ST estimator is that backpropagating with respect to the sample-independentmean may cause discrepancies between the forward and backward pass, leading to higher variance.

      Let's say we use a function \(f\) to compute loss from the sampled sample \(i ~ z\). And that \(z\) is a collection of \(n\) Bernoulli variables with probabilities of turning out 1 as \({p_1, p_2,...p_n}\). Then, the gradients of loss with respect to \(z\), calculated during backward pass until z will actually be gradients at the sampled value, \(i \in {0,1}^n\). Now, in reality we want to update our \(p\)'s and hence we copy over these gradients w.r.t. these \(0,1\)'s("straight through") as the gradient of loss with respect to the "mean parameters" \(p_j\) for each \(j \in [1,n]\).

      The idea is that if we were to actually backpropagate gradients of actual loss, that is to minimize \(y=\sum_{i \in {0,1}^n} P f(i)\), the direction in which we would like to move our \(p\)'s would be the component of the vector \(\left[\frac{\partial y}{\partial p_j}\right]_j\) along the plane \(\sum_j p_j=1\). Where \(P\) is the probability of the configuration \(i\) occuring, that is the product of \(p_j\) for which \(j=1\) and \(1-p_j\) for which \(j=0\). The \(j^{th}\) partial derivative will evaluate to expected value of \(f(i)\) given the \(j^{th}\) component is set to \(1\) minus the expected value of \(f(i)\) given the \(j^{th}\) component is set to \(0\).

      Now, the straight through estimator, on average, is sending a gradient of \(Ex\left[\frac{\partial f(z)}{\partial z_j}|_{z_j=1}|z_j=1\right]\) with probability \(p_j\) and \(Ex\left[\frac{\partial f(z)}{\partial z_j}|_{z_j=0}|z_j=0\right]\) with probability \(1-p_j\).

      First thing to notice is that as expectation and partial derivative are linear operators, we can move expectation inside.

      This is a biased estimator of gradients. Also, the gradients may vary hugely around their biased means, this problem increases with increase in \(n\)

  8. Jun 2021
    1. howing higheragreement between keys and values on the identityof the top-ranked token

      Probably should additionally confirm that in the lower layers the many of 4096 activations.. are high, while comparitively quite few are active in the higher layers.

      If it is not the case then the models layers may not be operating in the same embedding space.

    2. he vast majority of retrievedprefixes (65%-80%) were associated with at leastone identified pattern

      What patterns exist in the remaining ones? Are they adversarial? Is the output on those correct?

    3. We assume thatpatterns stored in memory cells originate from ex-amples the model was trained on

      They also originate due to initialization.

    1. ρ(h)(x ? θ)(g)

      Only defined when each element of the group can be identified with an element of the domain.

    2. 〈x,ρ(g)θ〉=〈ρ(g−1)x,θ〉

      It is due to the measure in the inner-product being invariant, to the group under consideration, that this holds.

    3. invariant

      invariant w.r.t. symmetry group under consideration.

    1. knowl-edge neurons are activated by knowledge-probingprompts;

      Are we sure about this? Knowledge probing prompts have only heads..

      Yes, we are sure, because the sentences only having heads, will not have the relation, most of the time.

      A more precise formulation is "knowledge neurons identified(for a fact) based on activations of neurons on the samples from ParaRel(which are variations of that fact, with masked tails), also get activated by the prompts constructed by finding sentences having the root as well as tail(of the fact) in natural language data, and masking the tail. "

    2. activated

      We check the activations at the location of the masked words in each prompt.

    3. 3

      Random words masked in here too.

    4. 2

      Random word masked in these prompts

    5. 1

      The tail entity is masked in these prompts.

    6. knowledge neurons are exclusive to their corre-sponding factual knowledge.

      how?

    7. Example prompttemplates of relations are shown in Table 2.

      Such statistics and examples from training data are a must for any good paper!

    8. 3)

      If we have limited context to size \(n\) and the size of vocabulary is \(|V|\), then the first layer(word embedding) can have at most \(|V|^n\) different possible values in the entire training set, at any single position there are \(|V|\) possible word embedding vectors. Since output of first self attention layer at \(i\)-th location depends on all input positions, there are \(|V|^n\) different possible for output at the \(i\)-th location. Also, since the output of entire self-attention layer depends on its entire input, and its input has only \(|V|^n\) different values, so does its entire output. That is, there are only \(|V|^n\) distinct sequences that can be output by self-attention and will be input to FFN.

      The keys of FFN, are \(d\)-dimensional vectors, and values too. There are \(m\) distinct keys and values, where \(m>d\) is the intermediate dimension of the FFN. These key-value pairs are used to map the \(|V|^n\) possible vectors at a particular location, to a new set of \(|V|^n\) vectors.

    9. First, suppressing and amplifying knowl-edge neurons can control the expressions of thecorresponding knowledge, without effecting otherfacts. Second, we find that knowledge-probingprompts tend to activate knowledge neurons of thespecific fact. Third, given the knowledge neuronsof a relational fact, the top activating prompts re-trieved from open-domain texts usually expressthe corresponding fact, while the bottom activatingtexts describe other information.

      first means that the facts under consideration are contained only in the knowledge neurons under consideration; and no-where else in the model.

      second, is just the way knowledge neurons are defined/identified initially.

      third provides some little evidence that the set of knowledge neurons under consideration, contains information regarding the particular fact under consideration, only.(converse direction of 1)

    1. y∆x

      In normal backpropagation, we continue multiplying by Jacobians of the layers, going backwards through the network, to get the Jacobian of output w.r.t. the input.

      In this method, we go on multiplying by discretized Jacobians(i.e., a Jacobian with \(\frac{\partial y_i}{\partial x_i}\) replaced by \(\frac{y_{i_1}-y_{i_0}}{x_{i_1}-x_{i_0}}\) where \(x_{i_0}, y_{i_0}\) are the values at reference input).

      Now, we define the matrix obtained by multiplying discretized Jacobians of two consecutive layers(L1, L2) as being the "multiplier matrix" of the transformation from inputs of L1 to outputs of L2. And multiply this "multiplier matrix" with the changes in the inputs of L1, to get the "contributions" of various inputs of L1 in producing the output of L2.

      All the above are definitions. The discretized Jacobian of \(f(g(x))\) where \(f,g: R^d \rightarrow R^d\), is not equal to the product of discretized Jacobians of \(f\) and \(g\). Had there only been linear/convolutional layers, they might have been equal as the discretized gradient is same as the natural one for linear layers. But this is not true for non-linearities.

    2. wi∆xi>0

      This is 1 if the actual change in \(x_i\), caused \(y_i\) to increase, 0 otherwise.

    3. contributions:C∆x+i∆y+

      This is contribution of increase in \(x_i\) to increase in \(y_i\)

    4. impor-tance

      There are two possible notions of "importance" here.

      If the model doesn't care about the value of an input/activation if it is below certain threshold, and starts caring about it just after it crosses that threshold, then you expect a sudden jump in the importance score. Here "caring about value of input/activation" means does it change the output if we slightly shift that input/activation. Or in other words gradient.

      The other notion is of attribution, i.e., How much each feature contributed to the current value of output? A logical way to define it seems to be gradXinp, but since gradient is discontinuous, so will be any continuous function of it.

      But if we assume that our model is a differentiable function(we are backprop-ing through it, so it is indeed differentiable), and that there is some known attribution for each of the reference input in producing the reference output, then when we shift the reference input slightly, the reference output will shift too. Now we add to original attribution of each feature the contribution of that feature's change to the change in the reference output. This contribution is given by gradX\(\delta\)inp. More mathematically, \(\delta(y)=\sum_{i=1}^n \nabla_{x_i}y*\delta(x_i)\), where \(y\) is the reference output. If \(a_i\) was the attribution of the reference input \(x_i\), in the reference output, then after moving slightly, the attribution of new \(x_i\) is \(a_i+\delta(x_i)\nabla_{x_i}y\). We can integrate this for calculating attribution to various features at longer distances from reference inputs.

    5. 1

      Shouldn't this hold for summations over \(x_i\) in each layer, separately?

    1. Proposition 2.2

      As seen in the annotation for equation (2), the cross entropy depends only on the projection of \(p^{*}_{.|s}\) into the \(d\)-dimensional subspace induced by \(\phi\). This means that the optimal value for \(f(s)\) is same for any other \(p^{*}_{.|s}\), given that it has the same projection in the \(d\)-dimensional subspace.

      All the \(p^{*}_{.|s}\) lie in \(|V|-1\)-dimensional simplex.

      All the \(p^{*}_{.|s}\) having the same projection in the \(d\)-dimensional subspace hence lie on the intersection of this simplex with various \((|V|-d)\)-dimensional hyper-planes(each hyperplane containing all vectors whose projection lead to the same point in the \(d\)-dimensional subspace). This intersection forms a "section" of the simplex of dimension \(|V|-d\) [assuming \(d>1\)].

      As each point in the \(d\)-dimensional subspace was made to correspond to a hyperplane of all the vectors having their projection as that point; each point in the \(d\)-dimensional subspace can also be associated with a probability distribution over the vocabulary(by taking soft-max of the vector corresponding to that point).

      The first point to note is that this mapping maps the \(d\)-dimensional subspace onto(due to nature of soft-max) a \(d\)-dimensional simplex lying inside the original \((|V|-1)\)-dimensional simplex. Since we can generate any element of the \(d\)-dimensional subspace by choosing appropriate \(f(s)\) , we can generate any element of this \(d-1\)-dimensional simplex by choosing an appropriate \(f(s)\).

      Other important thing is that if this simplex intersects each section(generated previously) at, at least one point \(t_0\), then it means at that point(i.e., when the real distribution is \(p^{*}_{.|s}(t_0)\) corresponding to point \(t_0\)) , we can find an \(f(s)\) that will make our predicted distribution match the real distribution \(p^{*}_{.|s}(t_0)\), exactly. Since, this is the solution when our predicted distribution is un-restricted(proposition 1), it is also the solution when our predicted distribution is restricted. Moreover, since all points on a section have the same optimal probability distribution for minimizing cross-entropy, it means that we have found the optimal solution for all the distributions in the section containing the point \(t_0\). All the points have \(p^{*}_{.|s}(t_0)\), as their optimal solution. And hence, by definition of a section, proposition 2.2, follows.

      Now the only thing that remains to be proved is that the \(d\)-dimensional simplex indeed intersects every section at one and only one point. First thing to see is that, if the simplex indeed intersects each section, then it can only intersect each one, once. As the sections are \(|V|-d\)-dimensional objects filled in a \(|V|-1\) dimensional simplex, so if each section is reduced to a single point, the points will occupy a \(|V|-1 - (|V|-d) = d-1\)-dimensional subspace. So, since we have a \(d\)-dimensional simplex, it can intersect every section at least once.

      The orientation of the \(d\)-dimensional simplex and the \(|V|-d\)-dimensional sections are linked together by \(\phi\). This linkage makes sure that both the things intersect each other at least once. As you change \(\phi\), you end up rotating and shifting around the \(d\)-dimensional simplex as the \(d\)-dimensional subspace rotates and shifts. Correspondingly, this rotation of \(d\) dimensional subspace also causes the space of vectors (having same projections in the \(d\)-dimensional subspace(i.e., the sections)) to rotate and shift correspondingly.

      EDIT 1: The \(d\)-dimensional simplex above is actually a \(d\)-dimensional surface/manifold over a \(d\)-dimensional subset. That is when you set the probabilities of first \(d\) words, the probabilities of the rest of words are automatically determined, using the fact that the probabilities are supposed to be written as the soft-max of a vector in the \(d\)-dimensional subspace.

      EDIT 2: Proposition 2.2 already assumes the existence of minimum, so whole of the above procedure is not needed, but the above can be thought of as an attempt to explore existence as well as the exact construction of it.

    2. (2)

      The first term can be written as \(-f(s)^T \phi p^{*}_{.|s}\) where \(\phi\) is \(d \times V\) matrix of output word embeddings, and \(p^{*}_{.|s}\) is a \(V \times 1\) vector of actual probabilities of different words in the vocabulary given the context \(s\).

    3. rewrite it as`xent(f,Φ) =Es∼pL[`xent,s(f(s),Φ)]

      Assuming that \(f\) can be as expressive a function as we want(that is, \(f\) can assign and \(d\)-dimensional vector to any \(s\)), we can reduce the problem of minimizing the expected value of cross entropy between the predicted and actual probability distribution over words given the context, to the problems of finding the \(d\)-dimensional vectors \(f(s)\) for each \(s\), independently, that minimize the cross entropy for that \(s\).

  9. May 2021
    1. we can also interpolate between them.

      That is, we optimize the image with different objectives, \(\alpha L_{1} + (1-\alpha ) L_{2}\) where \(L_{i}\) is the loss from the \(i\)-th neuron, whose activation is to be maximised. For each different value of \(\alpha\) we will get a different objective and a corresponding optimized image.

    1. We note that this metric is not a perfect measure of information content in single units; for example,a unit with a little information about every class would have a low class selectivity index. However,it does measure the discriminability of classes along a given direction.

      For example, consider 2-d vectors belonging to two classes. All the elements above X-axis belong to category 1 and all vectors below X-axis, to category 2. Suppose the activation only lights up(with same activation value, a.k.a. activity) for two vectors of category 1 in the second quadrant and two vectors of category 2 in the fourth quadrant. Then it would have 0 class selectivity index, but still, it can discriminate classes along the direction of y=-x . That is, if there is another activation that lets us know that(or, in other words, lights up when) our input vector is somewhere in the second or the fourth quadrant, then we can use our original activation to classify the vectors into appropriate category.

    2. To test networks’ reliance upon random single directions,we added Gaussian noise to all units with zero mean and progressively increasing variance.

      No clamping this time. The addition of noise to a feature blurs that particular feature. The amount of blurring increases with increasing variance. The relative proportion of blurring decides which single direction we are removing from the input. This addition of noise allows us to calculate the effect of blurring all dimensions together(blurring in any dimensions is proportionate to the variance in that dimension, on the training set).

    1. 2019THELOTTERYTICKETHYPOTHESIS

      Sometimes, we have to change our hyper-parameters a lot of times, to figure out the correct configuration that leads to good results. We disregard the fact that we may end up overfitting our data, and are satisfied as long as the model generalizes well at last; we don't care about having any a-priori statistical guarantees for how the model will perform.. we just find the best hyper-parameter settings wihtout caring about overfitting, and see how the learned model generalize.

      Now, what if we could run all the hyperparameter settings in parallel? We could save a lot of time!

      This paper views the network structure and the initialized weights as hyper-parameters of training. And, the hypothesis is roughly that over parametrized neural networks allow us to check a lot of hyper parameter settings at once. They provide evidence for their hypothesis by recovering a pruned network from the entire NN(they initialize it with the weights they were originally initialized with and retrain the pruned network), and showing that this network could perform as good as original one, which would mean that the other connections and initialized weights were just extra "try-outs" which are finally not necessary.

    1. Exercise

      Lesson : A well-ordered set is a set consisting of a chain of points, which can have a limit point in the direction of increasing value, but not in the reverse direction.

      Another implication is that we can propagate values through induction, along and through infinite sequences, even beyond them, if the direction of propagation matches with the direction in which limit points occur.

    2. Exercise

      Use the fact that if \(x \ne y\), then either \(x<y\) or \(y<x\), but not both. Which means that either \(x\) will be included in the set of all elements \(>y\) or \(y\) will be included in the set of all elements \(>x\), but not both. But since, \(succ(x)=succ(y)\), are same and equal to say \(z\) (since the set is well ordered, the minimum in the definition of \(succ\) exists , and we can call it \(z\)). The set of all elements greater than \(x\) is then, the set of all elements greater than or equal to z. Similarly, The set of all elements greater than \(y\), is the same. Which means both sets are same. But earlier we saw that they must be different if \(x \ne y\). This means that \(x=y\).

    3. Successor case

      \(x\) is smaller than any element greater than \(y\) for some \(y\) and \(x > y\).

      If this is true, it means that the set of all elements \(<x\), has a maximum. In which case the \(sup\) just becomes the maximum, rather than \(x\).</p>

      If this is false, it means that the set of all elements \(<x\) has no maximum, in which case, \(x\) becomes the supremum and the limit case becomes true.</p>

    4. (Limit case)

      \(x\) is smaller than any element greater than or equal to all elements smaller than \(x\).

    5. show that a totally ordered set X enjoys the principle of strong induction if and only if it is well-ordered

      If the set is not well-ordered, it means, that there is a subset(\(A\)) which has no minimum element. We can set \(P(.)\) to be false for all the elements of this set and for any element (\in X) greater than any element of \(A\). Also, set \(P(.)\) to be true for any \(x \in X: x<y \forall y \in A\).</p>

      This setting for \(P(.)\) satisfies both the property that \(P(x)=\)true if \(\forall y<x, P(y)=\)true and the fact that \(P(min(X))=\)true, if the \(min(.)\) exists. But, still not all elements of \(X\) have property \(P\) as true.</p>

    6. This is called the principle of strong induction.

      We need to assume \(min(X)=\)true.

      To proof this, use a proof by contradiction. Assume that for some \(x \in X\), \(P(x)=\) false. Then this means that \(\exists y<x : P(y)=\)false, which means \(\exists z<y : P(z)=\)false, and so on.. we will end up at \(min(X)\).. or we may not.. if the sequence y, z, .. tends to some element \(> min(X)\).. or even if the sequence y, z... tends to \(min(X)\) we can't conclude that \(min(X)\) will be part of the sequence.

      So, new approach: Consider the set of all \(y \in X: y<x\) and \(P(y)=\)false, then this set has a minimum element. This minimum element must be \(min(X)\), because otherwise, if some \(z\) is the minimum element of that set, then it will be element such that all elements less than it have \(P(.)=\)true, which means that \(P(z)=\)true, which is contradiction.</p>

      Lesson : Sometimes infinite iteration can be expressed as a set of elements; and as we have clearly defined properties of sets, it is preferably to express the result of an infinite iteration as a set, making the iteration invisible/irrelevant.

    7. formed by gluing all the partial choices together

      First you assume the existence of something known as \(\infty\), that concretizes your notion of infinitely many objects.

      Then you assume the existence of sets of infiinitely many objects, while defining them.

      Then you define the union of infinitely many sets(assuming that one exists, in the process of defining).

      Then you assert that the union of infintely many sets that you have constructed here, must equal the underlying family in the axiom of choice, if Zorn's Lemma is true.

    1. recover the sample in the original repre-sentationz∈Rdand obtainf(z),

      This process of reverting, combined with the choice of our interpretable data representations, constraints what regions around \(x\) can be reached, upon inversion; which in turn constraints the regions around \(x\) from which we can obtain \((z', f(inv(z')))\). That is the interpretation model \(g\) will only be trained to interpret samples obtained from the original one, by perturbing along interpretable directions.

      So, if you are unable to interpret the predictions of the model, you can only conclude that you can't interpret the model in terms of the interpretable data representation you chose. YOU MIGHT HAVE MISSED SOME FEATURE IMPORTANT TO THE MODEL; in which case, you won't be able to tell what exactly the model is doing wrong(i.e., which feature is it assigning weight to instead of the ones you chose?)?

      One also needs to be careful with the design of \(inv(.)\) function as it is also detrimental in determining where exactly will the perturbed version of \(x'\), i.e., \(z'\) will fall. If \(inv(.)\) function introduces features that are detected by the model function \(f\) but not considered in our interpretable data representations, then the model's predictions are actually based on those features, but to our interpretation model, \(g\), they appear to be related to the features in interpretable data representations, so it might wrong things/pick up wrong correlations.

    1. every bounded linear functional isgiven by the inner product with a unique vector inH

      Every dot product with an element of \(H\) can be seen as a bounded linear functional.

      Similarly, we can associate a dot product with a bounded linear functional in the following way: Consider the subspace of \(H\) consisting of all the elements that get mapped to \(0 (\in R)\) under the bounded linear functional. Consider the quotient space of this subspace. Now choose a \(f \in H\) such that the linear functional doesn't evaluate to 0 on \(f\).

      Claim; There exists a unique element in this quotient space, whose dot product with \(f \in H\) is equal to the value of the functional on \(f\).

      The dot product with this element, will equal the evaluation of the bounded linear functional on other functions in \(H\), as well.

    1. The models are con-nected because they use normalized latent factors as topicdistributions. In other words, latent factors are regularizedsuch that they are also good at explaining the topic distri-butions in review text.

      VERY INNOVATIVE TYING TOGETHER OF MODELS

    2. global relationships

      That is, present throughout the data. These "what the model is focusing on" relations might not exist for some adversarial samples.

    3. perfect matches for the real-life tasks they are meant tosolve.

      First observation. This is first thing that is making people run after asking for interpretability. If people probably understood what the models are meant for, they wouldn't ask for so called "interpretability"

    4. the opti-mization objective for most supervised learning models issimply to minimize error, a feat that might be achieved in apurely correlative fashion.

      This optimization objective was a result of the model we chose. The mode we chose, doesn't distinguish between correlation and causation, hence the model learnt after minimising of error, also doesn't know about whether smoking and cancer are correlated or one causes another. So basically our model was too simple! It should have involved some part that modelled the process by which one observation can lead to another.

    1. Combining kernels by multiplication

      See Schur's Theorem first. It states that the pointwise product of two positive semi-definite matrices is positive semi-definite.

    2. Decreasing the scale-mixture αα\alpha will allow for more minor local variations while still keeping the longer scale trends defined by the lengthscale parameter. Increasing the scale-mixture to a large value reduces the minor local variations.

      If you see (4.20) of GPML book, you'll realise, that decreasing \(\alpha\) means that the gamma distribution will give more weight to smaller values of \( \tau = \frac{1}{l^2} \); which means, more weight will be given to higher values of \(l\), hence, flatter distributions are seen with decreasing \(\alpha\). If we make \( \alpha \) small enough, like 0.1 in the given figure, we can completely overcome any effect of \(l\), but if we keep \( \alpha \) in the right range, we can use it to modulate the rate of decay of covariance, in a much fine-tuned way than is possible with \(l\).

    3. The rational quadratic can be interpreted as an infinite sum of different exponentiated quadratic kernels with different lengthscales with αα\alpha determining the weighting between different lengthscales.

      A linear combination of positive semi-definite matrices is positive semi-definite(follows directly by plugging the linear combination into the definition of positive semi-definite).

      If we consider exponentiated quadratic kernels(EQKs) for different length scales, and then define a probability distribution over all possible length scales for the EQK(which in turn defines a probability distribution over the set of all EQKs), and then find the expected value of the EQK under this probability distribution of EQKs, we will get another positive semi-definite kernel.

      In case, the distribution over lengths is a gamma distribution, with rate parameter \(\beta = \frac{1}{l^2}\), and the scale parameter as \(\alpha\) the the expected value of all the EQKs, leads to the rational quadratic kernel, with the length scale \(l\) and scale-mixture \(\alpha\), as specified in the gamma distribution.

      For derivation see page 87 of GPML book.

    1. pk+1(nn)G0

      This term is added, after summation to adjust for the probability that a new(\(k+1\)-th) species comes along in the \(n+1\)-th sample

    2. 1pj

      \(p_j\) is a function that takes in a sequence corresponding to the size of clusters corresponding to each of the various possible species and outputs a number corresponding to the weight of \(j\)-th species.

    3. δx

      It is a vector with 1 at \(j\)-th position and 0 elsewhere.

    4. exchangeable sequence of r.v.’s

      In statistics, an exchangeable sequence of random variables (also sometimes interchangeable) is a sequence X1, X2, X3, ... (which may be finitely or infinitely long) whose joint probability distribution does not change when the positions in the sequence in which finitely many of them appear are altered.

    5. model

      The previous line can be considered enough as the definition of a model, but that model isn't enough for us to do Bayesian inference, hence we restrict the model.

    6. positive

      That is, non-zero..which in turns means, that the family of distributions considered is very large.

    7. these competing goals are not always honestly spelledout, and the resulting uncertainties are not fully described.

      Although we may state that these two random variables are independent or that this is the family of distributions that the actual distribution belongs to; we tend to forget/ignore what trade-offs we made to arrive at those assumptions.

    8. Often the choice ofthe final inference model is a compromise of an accurate representation of theexperimental conditions, a preference for parsimony and the need for a practicableimplementation.

      That is, in practice, we first limit our models to some parametric family of distributions , with a finite number of parameters to infer. In restricting ourselves to this parametric family of distributions, we often assume independence of certain random variables, we assume that the data follows a particular style of distribution only. All these assumptions can be thought of as a compromise for accurate representations of the experimental conditions(2 random variables may only be very nearly independent, but we assume them to be independent nonetheless).. for the sake of "parsimony and practicable implementation".

    Annotators

  10. Mar 2021
    1. all: say_hello generate

      Unlike a declaration, this runs all its pre-requisites, even though they may not be files.

    2. .PHONY: all say_hello generate clean

      Like a declaration. Since nothing in the pre-requisites is a file, the program doesn't need to run any of the all, say_hello, generate and clean.

      Make only runs a recipe when one of the pre-requisites are updated.

  11. cseweb.ucsd.edu cseweb.ucsd.edu
    1. hard to invert

      That is, at some non-neglible fraction of all possible binary strings of \(k\)-bits, the probability of getting the inverse by following the coing tosses of \(A\) is a negligible function of \(k\).

    2. polynomial fraction of thek-bit composite integers

      \(k\)-bit composite integers are composite integers in the range \([2^k, 2^{k+1})\) .

      If you pair up primes like \((2^2-1, 2^{k-1}-1), (2^3-1, 2^{k-2}-1) ...\), and multiply the elements of a pair amongst themselves.. you'll get composite numbers that are product of two primes ans lie in \([2^k, 2^{k+1})\) .

      So, number of \(k\)-bit composite integers in \([2^k, 2^{k+1})\) that can be written as product of two primes \( \ge \frac{k}{2}\) which is a polynomial in \(k\).

    3. For every

      No matter what algorithm the adversary comes up with.. the inequality must hold for each one of them.

    4. adversary

      The PPT \(A\).

    5. (1)

      This also means that the length of output \(f(x)\) is at most a polynomial function of length(\(k\)) of input(\(x\)). But it can be a lower order function too.. for e.g., \(|y| = log(|x|)\).. so, to make sure that \(A\) is a PPT w.r.t size(\(k\)) of the input(\(x\)) to the one-way function we add extra k symbols to input of A(i.e., \(1^k\)).

    6. N

      Natural number corresopnding to the number of symbols in the output of the one-way function.

      \(\nu\) maps the size(\(n\)) of an element in the domain of a one-way function, to the average probability of finding the inverse of f(a string of size \(n\)) using a PPT algorithm.

      Requiring \(\nu\) is negligible means, requiring the probability of finding inverse decreases faster than inverse of any polynomial function(of the input size of the one-way function).

    7. nput length t

      The length input to \(\nu\) is the output length of one-way function.

    8. worst-case hardness is a poor measure of security. Security requires hardness onmost cases or at least average-case hardness

      This means that the Non-deterministic TM should branch out into multiple states at almost every step of the polynomial number of steps it takes; on almost every input string.

    9. machine

      Turing machine, that is.

    10. n

      This is the integer being factored.

      \(log n ~\) the number of bits required to represent \(n\).

      Factoring a number, \(n\) has exponential complexity in terms of the square root of the no. of bits required to store (\(n\)) it.

      Hence the difficulty of factoring problem increases exponentially w.r.t. how long(in terms of bits) our chosen number \(n\) is.

    11. (x, p) = 1

      \(GCD(x,p)=1\)

    1. really think through what this definition is saying

      We can see it as follows:

      An efficient certifier will exist iff

      You can find \(2^t=2^{p(|s|)}\) polynomial-time algorithms \(\{B_i(s)\}_{i=1}^{i=2^t}\) such that at least one of \(B_i(s)\) is "yes" iff \(s \in X\).

      You can choose from one of these polynomial time algorithms based on the value of \(t\).

      So, in effect we are saying that you must be able to check whether \(s \in X\) using \(2^{p(|s|)}\) polynomial-time algorithms at max.

      This is equivalent to the other definition of NP which defines it as the set of all problems solvable in polynomial number of steps using Non-Deterministic Turing Machines. Each step of NTM will result in branching. Each of the branches leading to the some leaf(the state that the branch reaches upon completing the reading of string) consists of the \(2^{p(|s|)}\) algorithms. Notice that if we replace \(p(|s|)\) with \(log_{2}(a) \times p(|s|)\) we can change the base of exponent; in particular it can be matched with the max. branching factor(a constant for a give NTM) of the NTM.

    2. setof strings

      Forms a language over the alphabet \(\{0,1\}\)

  12. ajc.maths.uq.edu.au ajc.maths.uq.edu.au
    1. S-FIRE is NP-complete, even ifSis the set of leaves of a treeof maximum degree thre

      \(S\) can be broken down into disjoint subtrees of the whole tree. If there are \(n\) trees, we need to protect (at worse) \(n\) vertices, just on top of the \(n\) roots of the \(n\) subtrees. There are \(n!\) possible orderings, in which we can choose to defend the nodes.

    1. do not incorpo-rate a measure of runtime

      Maybe their compiler will skip some computations/make them less complex based on the measure of runtime of the compiled program.. Let's read on..

    2. target-dependen

      Dependent on your GPU/TPU/end device you will run code on. XLA does Linear Algebra specific optimization, and is target-independent.

    3. white-box

      That is the entire code for executing commands on GPU is known.

  13. Feb 2021
  14. ajc.maths.uq.edu.au ajc.maths.uq.edu.au
    1. QUESTION

      It is clever, how the author abstains from stating that the sequence \({d_i}\) is the sequence of nodes saved.

      Asserting the existence of this (finite)sequence, makes sure that there are nodes that can be saved at every time step.

      The finiteness of \(t\) , combined with the conditions \((ii)\) and \((iii)\) imply that \(k\) vertices have been saved at time \(t\), while we defended 1 vertex per time step.

      The important thing to note is that this sequence will exist, iff we have chosen to save one vertex per time step. The backward direction is easily proved by taking \({d_i}\) as the sequence of nodes saved. The forwar

    1. for all

      It is interesting to note that this is not "almost everywhere" !!

    2. almost allx∈Rn

      at the remaining \(x\), there are singularities. These may be removable, or not.

    3. a.e.x∈E

      cannot be extended to "for every \(x \in E\)", as \(f(x)=\infty\) maybe true , for some \(x \in E\); and we can't claim \(\infty = \infty\) or \(\infty < \infty\) at that point.

    4. ‖f‖p

      missing mulitplication with \(p\) here.

    1. The system (6)

      The \((x(t), y(t), z(t))\) satisfying (6.) will be the line passing through \((x_0, y_0, z_0)\) and lying on the monge cone of all the planes with tangent \((p,q(p),-1)\) passing through \(x_0, y_0, z_0)\). The tangent plane to the cone, in which the line will lie in will depend on the particular value of \(p\) plugged in the equation.

      Suppose we have our solution \(z=u(x,y)\) . Then we know \(p, q\) at every point.

      Now, we actually look towards finding the characteristic curves.. Note that the \((x(t), y(t), z(t))\) we have found just tells the direction in which the tangent to the characteristic curve through \((x_0, y_0, z_0)\) should point.

      Now let the characteristic curve be given by \((x(t), y(t), z(t))\), which are unknown right now. These \(x(t), y(t), z(t)\) will give a new curve through \(\mathbb{R}^3\) which may not be a straight line.

      But still, at each point \(s=(x(t_0),y(t_0),z(t_0))\) having the partial derivatives \( (p(t_0), q(t_0)) \) the tangent to this new curve must point in the direction of tangent specified by the characteristic direction at \(s\). Also note that since we have assumed the existence of solution, the partial derivatives at each point are also existent. Now our problem is to find all the five things.

    2. generator of the Monge con

      Also called "line of contact" with the surface (although may share only a single point with the surface where it is tangent).

      This line is the \(\gamma_{p_0}\) for some \(p_0=(x_0,y_0,z_0)\).

      This \(\gamma_{p_0}\) gives the direction in the tangent plane along which : $$ \frac{ \partial z }{ \partial p }(p_0) = 0 $$

    3. field of cones

      This solves our problem of having muliple tangent planes at every point.

      It is solved by constructing another surface \(N_s\) at every point \(s\), such that there are multiple tangent planes to \(N_s\) at \(s\); and only those planes that are possible tangent planes to \(z=u(x,y)\) are the tangent planes to \(N_s\) at \(s=(x,y,z)\) . This \(N_s\) is a generalized cone called a Monge cone and \(s\) corresponds to the vertex of the cone. Now instead of choosing a small 2-D element of tangent space at every point;

      you have a cone at every point and choose a small 1-D line(or a small 2-D surface?)(infinitely close to the vertex of the cone) from the cone at each point, and then merge together these 1-d lines to create a 2-D surface

    4. possible tangent planes

      at \((x_0, y_0, z_0) \)

      In case of linear PDE, the directional derivative at each \((x_0,y_0)\) was known. So, we could propagate initial values along the characteristic curves. (Assuming one and only one characteristic curve through every point)

      In case of quasi-linear PDE, the directional derivative at each point was known in terms of the surface. That is, each surface had a corresponding directional derivative field. In other words, the directional derivative was known at each \((x_0, y_0, z_0)\). We had to join together characteristic curves to form a surface. (Assuming one and only one characteristic curve through every point)

      Here, we know a set of possible tangent planes at each \((x_0, y_0, z_0)\). Each tangent plane will contribute a small 2-D element to the surface and we shall combine these tangent spaces to form a surface. But the problem is that we have multiple tangent planes at a single point.

    5. (6)

      The solution to these 3 equations at \(p=p_0\) gives \(\gamma_{p_0}\).