Hypothesis

404 Matching Annotations

Nov 2021
proceedings.mlr.press proceedings.mlr.press

guo17a.pdf

3
1. Jeevesh8 30 Nov 2021
  
  in Public
  
  Note
  
  Important point.
2. Jeevesh8 30 Nov 2021
  
  in Public
  
  predictions
  
  $\hat{P}$ values are grouped into intervals
3. Jeevesh8 30 Nov 2021
  
  in Public
  
  joint distribution
  
  $\hat{P}$ is also a random variable as its value may be different when the same input is passed again and again in the same model. For example, due to dropout running at test time. Or some other random variate inside the model.
Visit annotations in context

Annotators

Jeevesh8

URL

proceedings.mlr.press/v70/guo17a/guo17a.pdf
arxiv.org arxiv.org

2106.16163.pdf

2
1. Jeevesh8 26 Nov 2021
  
  in Public
  
  bootstrap
  
  https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
2. Jeevesh8 26 Nov 2021
  
  in Public
  
  The choice of test sample
  
  Very detrimental in low resource settings.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2106.16163.pdf
arxiv.org arxiv.org

1911.02969.pdf

3
1. Jeevesh8 25 Nov 2021
  
  in Public
  
  highly variable representations of syntax
  
  The models learn representations of syntax during pre-training. The MNLI dataset, changes representations(mostly of the last few layers) so as to fit its own dataset. The process of transferring knowledge of the NSP task, to the MNLI task, by finetuning CLS position on the MNLI dataset:
  
  sometimes causes the model to get "caught" in wrong simple heuristics.
  
  sometimes does "something" to the model so that it performs around 50% on HANS.
  
  It could be that the heuristic is built within the model due to NSP, and sometimes MNLI training allows you to overcome it, sometimes not.
  
  Or it could be that the MNLI training collapses the models to these heuristics?
2. Jeevesh8 25 Nov 2021
  
  in Public
  
  or to randomluck in the choice of the model’s initial weights.
  
  Or maybe to both..
3. Jeevesh8 25 Nov 2021
  
  in Public
  
  low-bias learner
  
  The bias of bias-variance tradeoff.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1911.02969.pdf
arxiv.org arxiv.org

Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling

9
1. Jeevesh8 25 Nov 2021
  
  in Public
  
  Figure 8.
  
  Maybe should try: slightly random-ly shift(by using sgd on two/three training batches) each model in the deep ensembles, and add those to the deep ensemble too, and then compare the performances, to better ablate the improvements that authors attribute to ESPRO?
2. Jeevesh8 25 Nov 2021
  
  in Public
  
  shows that we are able to discover simplexesin parameter space containing models that lead to diversepredictions, meaning that we can ensemble within a sim-plex and gain some of the benefits seen by deep ensembles(Lakshminarayanan et al., 2017).
  
  Very counter-intuitive. This means good alternate ways to solve a problem lie nearby. Meaning identifying dogs using tongues lies nearby to identifying dogs using standing tails.
  
  Possibly because of the hierarchical nature of neural networks? Once individual parts of an image have been identified, you can change weights slightly, to decide whether prediction of an image having a dog, should depend on the image having a tail or tongue.
  
  Possibly can consider checking which weights change the most as we move inside the simplex? Top layer weights/bottom layer weights?
3. Jeevesh8 25 Nov 2021
  
  in Public
  
  We can continue adding new modes until wereach k = 11, when the volume collapses to approximately10−4, from a maximum of 105
  
  Let's say there are two 2-D low loss surfaces, connecting $w_1$ and $w_2$. If $\theta_1$ lands, after tuning, to reduce loss on the polygonal connector, $w_1 \rightarrow \theta_1 \rightarrow w_2$, on surface 1. We can be sure that $\theta_2$ also lands on surface 1, as our loss function encapsulates moving $\theta_2$ in the direction which minimizes expected loss over entire triangular surface $\Delta w_1\theta_1\theta_2$ and the triangular surface $\Delta w_2\theta_1\theta_2$.
4. Jeevesh8 25 Nov 2021
  
  in Public
  
  at least 10
  
  Why only lower bound? Can we find upper bound too?
5. Jeevesh8 25 Nov 2021
  
  in Public
  
  both low lossparameter settings
  
  As soon as we add a point out of the simplex, choosing a point at random from the volume, will be almost everywhere, a point of high loss.
  
  For, example, consider a triangle of low loss parameter settings, such that the 3-d space containing the triangle is all high-loss parameters. As soon you consider any pyramid with its base as the triangle, almost every point in the pyramid is a point of high loss.
6. Jeevesh8 25 Nov 2021
  
  in Public
  
  volume
  
  we calculate higher dimensional volume at higher $k$
7. Jeevesh8 25 Nov 2021
  
  in Public
  
  b
  
  Projection of 9-dimensional simplex into 3-dimensional space?
8. Jeevesh8 25 Nov 2021
  
  in Public
  
  Note thatin this case not all modes are in shared simplexes with allconnecting points.
  
  How to choose which modes form simplexes with which connecting points?
9. Jeevesh8 25 Nov 2021
  
  in Public
  
  All samples from the SWAG posterior lie in the region oflow loss, as do the sampled connecting paths, indicating thatthere is indeed an entire volume of connected low loss solu-tions induced by the SWAG posterior over θ.
  
  Does the color correspond to the loss of the projection? Or the loss of actual SWAG sample?
  
  probably loss of actual SWAG samples, as loss of points in space spanned by $\theta_0$, $w_0$, $w_1$ doesn't look like this, as seen in Garipov et. al.(2018).
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2102.13042.pdf
arxiv.org arxiv.org

1802.10026.pdf

3
1. Jeevesh8 25 Nov 2021
  
  in Public
  
  In this instance, we expect thedistance between turning points to be less than the distance between endpoints
  
  why?
2. Jeevesh8 24 Nov 2021
  
  in Public
  
  coincide
  
  The losses may differ(by a constant coefficient) if the "linear in $t$" relation doesn't have a $t$ with a coefficient of 1. We need $||\phi_\theta'(t)||=1 \forall t \in [0,1]$ for the two losses to be exactly equal.
3. Jeevesh8 24 Nov 2021
  
  in Public
  
  uniform distribution
  
  expectation of uniform distribution on the "curve $\phi_\theta$" is $\hat{l}(\theta)$.
  
  uniform distribution on "curve $\phi_\theta$" means the probability of being in each unit length of the curve is same for every unit length segment of the curve.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1802.10026
brightspace.vanderbilt.edu brightspace.vanderbilt.edu

ZooKeeper - Wait-free coordination for Internet-scale systems.pdf

1
1. Jeevesh8 20 Nov 2021
  
  in Public
  
  First, it suffers from the herd effect. Ifthere are many clients waiting to acquire a lock, they willall vie for the lock when it is released even though onlyone client can acquire the lock.
  
  That is, all waiting processes, who have registered watches on the ephemeral znode will be notified of its deletion, and accordingly try to create the same znode, again. Even if one client has already made the znode, after the deletion of first one, the remaining clients will try nonetheless.
Visit annotations in context

Annotators

Jeevesh8

URL

brightspace.vanderbilt.edu/d2l/le/dropbox/403222/226207/DownloadAttachment
Oct 2021
nil.csail.mit.edu nil.csail.mit.edu

The design of a practical system for fault-tolerant virtual machines

4
1. Jeevesh8 18 Oct 2021
  
  in Public
  
  For example, there could be a non-deterministic result if anapplication/OS in a VM is reading a memory block at thesame time a disk read is occurring to that block.
  
  How??
2. Jeevesh8 18 Oct 2021
  
  in Public
  
  simultaneous diskoperations that access the same disk location can lead tonon-determinism.
  
  For example if the primary tries to write to the same location in a file in a loop, As disk operations are non-blocking, we don't know in what order will all the write-s be actually done.
  
  The VMware FT will detect any such operations on the primary's side, and force them to execute sequentially, then ask the backup's VMware FT to execute in the same sequence.
3. Jeevesh8 18 Oct 2021
  
  in Public
  
  try associated
  
  Additionally, if a client doesn't receive an output indicating that the state change it intended for the server has been done. The client can't assume that it hasn't been done. It has to consider that the state change may be done, and the packet acknowledging the state change lost on its way; so it needs to query the current state, again. To check if the change that it intended has been done.
  
  This also helps avoid confusion in case, the primary fails after backup get updated, but before it receives the ACK from backup.
4. Jeevesh8 18 Oct 2021
  
  in Public
  
  point of the last log entry. However, suppose a failure wereto happen immediately after the primary executed the out-put o
  
  Difficult to ensure if backup VM quite far away, and lots of log entries to send. Then, we will have to wait for all log entries to reach backup VM.
Visit annotations in context

Annotators

Jeevesh8

URL

nil.csail.mit.edu/6.824/2020/papers/vm-ft.pdf
storage.googleapis.com storage.googleapis.com

paper.dvi

7
1. Jeevesh8 18 Oct 2021
  
  in Public
  
  Leases
  
  To avoid "Split Brain". The idea is that both the master and primary know that uptil a point(the lease expiration time), even if communication to primary(from Master) fails, the Master assumes that the Primary is still alive & well. And the Primary, continues to act as though it is primary. Once the lease expiration time is over, the primary will stop acting as Primary if its lease is not extended.
  
  The master must wait for lease to expire before it can designate a new primary.
  
  This prevents "split brain" problem, which would have occurred if the Master had assumed that the Primary was dead(upon losing conversation with it) and designated a new primary immediately. Then, the clients who knew the old primary, will continue to do work with it, while the clients who know new primary, will ask it to do work, leading to conflicts.
2. Jeevesh8 18 Oct 2021
  
  in Public
  
  This ensures that anysubsequent writes to these chunks will require an interactionwith the master to find the lease holder.
  
  If some client thinks that some chunk-server is primary(due to that info existing in its cache) and sends a write request to that chunk-server, the chunk server will reply saying it doesn't have the lease. Hearing this, the client will ask the Master regarding who has the lease.
3. Jeevesh8 18 Oct 2021
  
  in Public
  
  file region may end up containing fragments from differentclients, although the replicas will be identical because the in-dividual operations are completed successfully in the sameorder on all replicas.
  
  Only way clients may end up writing something they didn't intend.
4. Jeevesh8 18 Oct 2021
  
  in Public
  
  Since clients cache chunk locations, they may read from astale replica before that information is refreshed. This win-dow is limited by the cache entry’s timeout and the nextopen of the file, which purges from the cache all chunk in-formation for that file. Moreover, as most of our files areappend-only, a stale replica usually returns a prematureend of chunk rather than outdated data. When a readerretries and contacts the master, it will immediately get cur-rent chunk locations.
  
  Only way clients can get slightly wrong information, while reading.
5. Jeevesh8 17 Oct 2021
  
  in Public
  
  checkpoints
  
  For example, after a client has tried to write, say 2MB of data, to a file, and if the write completed successfully, it will next write a checksum for that 2MB data to the same file. If there was a concurrent write, by another client, on the same location where the 2MB data was to be written, the write will be considered unsuccessful.
  
  Any reader, will only read data which has been followed by a checksum.
6. Jeevesh8 17 Oct 2021
  
  in Public
  
  atomically
  
  Atomicity guaranteed by atomicity of file namespace of master.
7. Jeevesh8 17 Oct 2021
  
  in Public
  
  chunkis lost irreversibly only if all its replicas are lost before GFScan react, typically within minutes.
  
  Only way to lose data.
Visit annotations in context

Annotators

Jeevesh8

URL

storage.googleapis.com/pub-tools-public-publication-data/pdf/035fc972c796d33122033a0614bc94cff1527999.pdf
Aug 2021
arxiv.org arxiv.org

Insertion Transformer: Flexible Sequence Generation via Insertion Operations

1
1. Jeevesh8 27 Aug 2021
  
  in Public
  
  but it can also be trainedto maximize entropy over all valid insertions forrobustness.
  
  AutoRegressive Models only maximize probability of only one possible sequence of insertions, i.e., the one corresponding to inserting one token at a time, left to right.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1902.03249.pdf
idv.sinica.edu.tw idv.sinica.edu.tw

fully_nonlinear.pdf

1
1. Jeevesh8 15 Aug 2021
  
  in Public
  
  tangent planes to the surface (3) and the envelope coincide.
  
  Read : "tangent plane to the surface" and "the envelope". Not : "tangent plane to the surface" and "the tangent plane to the envelope".
Visit annotations in context

Annotators

Jeevesh8

URL

idv.sinica.edu.tw/ftliang/pde/*PDE_1/first_order/characteristcs/fully_nonlinear.pdf
Local file Local file

Annotating Argument Components and Relations in Persuasive Essays

1
1. Jeevesh8 11 Aug 2021
  
  in Public
  
  sentence
  
  That is, they are checking the presence of "MajorClaim", "Claim" and "Premise" in each full-stop separated sentence in the essay. They don't check whether they begin with the same token in the sentence or not, or whether it is continuation from previous sentence or marked as a different claim/premise.
Annotators

Jeevesh8
arxiv.org arxiv.org

2107.13586.pdf

3
1. Jeevesh8 10 Aug 2021
  
  in Public
  
  Prompt Engineering
  
  Choosing what and where(in what context) we want the model to predict.
2. Jeevesh8 10 Aug 2021
  
  in Public
  
  Answer Engineering:
  
  Choosing what possible outputs of the model, will be considered as valid answers.
3. Jeevesh8 10 Aug 2021
  
  in Public
  
  Sea Change
  
  Profound change.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2107.13586
arxiv.org arxiv.org

1908.05620.pdf

3
1. Jeevesh8 10 Aug 2021
  
  in Public
  
  the op-timization directions of the first few epochs aredivergent from the final optimization direction.
  
  This means that in random initialization, we are not sure whether locally optimal choices, are globally optimal; but pre-training makes sure that they are. (That is, it makes sure that almost a straight line path remains between pre-trained weights and the optimal weights for the downstream task. It is basically due to the relation between pre-training and downstream tasks.)
2. Jeevesh8 10 Aug 2021
  
  in Public
  
  The optimization pro-cess also converges faster
  
  Notice the long first arrow in bottom graphs in Fig. 3. All arrows are plotted after equal number of steps, probably.
3. Jeevesh8 10 Aug 2021
  
  in Public
  
  path
  
  The straight line path from (0,0) to (0,1) or (1,0).
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1908.05620.pdf
arxiv.org arxiv.org

1903.05987.pdf

5
1. Jeevesh8 10 Aug 2021
  
  in Public
  
  fiction
  
  BERT is trained on books, Maybe positive $\Delta$ here is related to that?
2. Jeevesh8 10 Aug 2021
  
  in Public
  
  closely aligned tasks
  
  I think there are two kinds of "closely-related tasks" : one are "subset tasks"(like those we can get from prompt engineering; where the class prediction of one task can be used to completely determine the class on another task) and others are "overlapping tasks"(where the class can only define a prior over class predictions over the the class predictions for another task).
  
  Here we are in territory of "overlapping tasks" and neglecting the other kind.
  
  While there is no need for changing representations in case of subset tasks, there is need for changing representation in case of overlapping tasks. Heated training allows for this. And is hence, better for such "closely aligned tasks".
  
  But, the results show that if the overlap between the tasks is not sufficient, heated training may actually destroy useful information present in embeddings(is preserved in frozen training, though). Is it actually destroying information useful to the downstream task? Or is the better performance in frozen setting due to the parameters of ESIM?
  
  A useful comparison could be with frozen training (with ESIM) followed by heated training(with ESIM learned during frozen training) for the STS tasks. This is done in "gradual unfreeze" settings below.
3. Jeevesh8 10 Aug 2021
  
  in Public
  
  A confounding factor may bethe suitability of the inductive bias of the modelarchitecture for sentence pair tasks
  
  Another factor is the data the model was trained on. The increase in frozen performance of BERT from the frozen performance of ELMo, is mainly due to these factors only.
4. Jeevesh8 10 Aug 2021
  
  in Public
  
  When extracting features, it is im-portant to expose the internal layers as they typi-cally encode the most transferable representations.
  
  This is how they differ from probing. Very nice point.
5. Jeevesh8 10 Aug 2021
  
  in Public
  
  LM pretraining withhas alsobeen successfully applied to sentence level tasks.
  
  Gap b/w frozen and heated settings will be large compared to gap between frozen and heated settings for other pre-training objectives like NSP etc. (when that objective and adaptation tasks are related). LM objective is expected to be quite far from adaptation task.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1903.05987.pdf
arxiv.org arxiv.org

2010.03648.pdf

9
1. Jeevesh8 03 Aug 2021
  
  in Public
  
  Pinsker’s inequality
  
  Remember to use this anywhere you want to convert between algebraic difference in probabilities of events under different distributions, to the KL-divergence/cross-entropies of the distributions.
2. Jeevesh8 03 Aug 2021
  
  in Public
  
  `xent,s(p·|s)−`xent,s(p∗·|s)
  
  This is $\epsilon$.
3. Jeevesh8 03 Aug 2021
  
  in Public
  
  ‖v‖2∞
  
  This here is $B^2$.
4. Jeevesh8 03 Aug 2021
  
  in Public
  
  `T({p∗·|s},v∗)
  
  This quantity is $\tau$
5. Jeevesh8 03 Aug 2021
  
  in Public
  
  Jensen
  
  Only to bring square and root, so that Pinsker's inequality without root can be used.
6. Jeevesh8 03 Aug 2021
  
  in Public
  
  aγ(pT)-fraction
  
  An initial result is that $(\gamma (p_\Tau))^{-1} l_{xent}(\{\mathbf{p}_{.|s}\}) = (\gamma (p_\Tau))^{-1} E_{s \sim p_L} [ E_{w \sim \mathbf{p^*}_{.|s}} [-log(\mathbf{p}_{.|s}(w))] ] \ge E_{s \sim p_\Tau} [ E_{w \sim \mathbf{p^*}_{.|s}} [-log(\mathbf{p}_{.|s}(w))] ]$
  
  For proof, observe that to obtain expectation over $s \sim p_\Tau$ from expectation of $s \sim p_L$, we need to multiply each term in expectation for $s \sim p_L$ by $\frac{p_\Tau(s)}{p_L(s)}$. We can instead multiply each term by $ (\gamma (p_\Tau))^{-1} = max_{s \in S} \frac{p_\Tau(s)}{p_L(s)} $ to get an upper bound on expectation over $s \sim p_\Tau$. We can replace the innermost expression from $-log(\mathbf{p}_{.|s}(w))$ to $-log(\mathbf{p}_{.|s}(w))+log(\mathbf{p^*}_{.|s}(w))$. This will allow us to get the same inequality on the deviation from optimal cross entropy, $\epsilon$, viz., $(\gamma (p_\Tau))^{-1}\epsilon_L \le \epsilon_\Tau$
7. Jeevesh8 03 Aug 2021
  
  in Public
  
  5
  
  Only exists when support$(p_\Tau) \subseteq$ support$(p_L)$. Similar to $inf$ {$s \in S$ : $\frac{p_L(s)}{p_\Tau(s)}$}.
8. Jeevesh8 03 Aug 2021
  
  in Public
  
  sfor everys∈support(pL)
  
  This means we can only match on a $d$ dimensional subspace for the distribution obtained from the each context. Moreover, the $d$-dimensional subspace is same for every context. It may be useful to let it vary with context, while keeping $d$ same.
9. Jeevesh8 03 Aug 2021
  
  in Public
  
  xent
  
  In reality, just self-entropy of $\mathbf{p^*}_{.|s}$.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2010.03648.pdf
Jul 2021
sigsem.uvt.nl sigsem.uvt.nl

23_Saveleva-Paper.pdf

3
1. Jeevesh8 16 Jul 2021
  
  in Public
  
  per-component
  
  They consider the PDTB parser as one component and their EDU segment generator as another. This terminology is ignorant of the various components inside the PDTB parser. Need to confirm if this was what author intended.
2. Jeevesh8 16 Jul 2021
  
  in Public
  
  The goldstandard parsing and EDUs boundarie
  
  That is the PDTB parser(which identifies spans and relations between them based on a parsing of the sentences in the text), is given the gold parsing of the sentences in the text and gold EDUs.
  
  The authors probably modify the "Argument Extractor" to predict Arg1 as one of the EDUs; and parse each EDU individually.. Or probably they choose that EDU which has maximum overlap with Arg1. Need to confirm.
3. Jeevesh8 16 Jul 2021
  
  in Public
  
  tasks
  
  respectively
Visit annotations in context

Annotators

Jeevesh8

URL

sigsem.uvt.nl/isa17/23_Saveleva-Paper.pdf
arxiv.org arxiv.org

1011.0835.pdf

2
1. Jeevesh8 16 Jul 2021
  
  in Public
  
  subtree nodes
  
  Arg1, Arg2 spans can only correspond to the subtrees rooted at the nodes of the constituency parse tree.
2. Jeevesh8 16 Jul 2021
  
  in Public
  
  The span to which the connective is syntactically attached
  
  That is , when we do dependency parsing, (or just see the one given in PTB), the span in which the connective's head lies, is called $Arg2$ and the other one $Arg1$.
  
  This is a link between dependency parsing and discourse parsing/argument mining. Can try to extend/generalize this kind of link further!
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1011.0835.pdf
aclanthology.org aclanthology.org

Reflections on the Penn Discourse TreeBank, Comparable Corpora, and Complementary Annotation

4
1. Jeevesh8 16 Jul 2021
  
  in Public
  
  This shows when one comparesthe number ofSup1orSup2annotations on explicit discourse relations, which wereannotated first, with the number of such annotations on implicit discourse relations,which were annotated on a subsequent pass: 1,571 explicit relations were annotatedwith supplementary information, whereas only 126 implicit relations were, despitenearly equal numbers of both.
  
  i.e., annotators became lazy after first pass.
2. Jeevesh8 16 Jul 2021
  
  in Public
  
  arguments
  
  *distinct arguments
3. Jeevesh8 16 Jul 2021
  
  in Public
  
  nevertheless likely that such discourse relations areunder-annotated in the PDTB and should be addressed.
  
  Because they'd have been annotated as AltLex, in which case, they haven't been grounded properly(unlike implicit, which were grounded by adding in extra words as in example (2))
4. Jeevesh8 16 Jul 2021
  
  in Public
  
  deictic
  
  relating to or denoting a word or expression whose meaning is dependent on the context in which it is used (such as here, you, me, that one there, or next Tuesday ).
Visit annotations in context

Annotators

Jeevesh8

URL

aclanthology.org/J14-4007.pdf
aclanthology.org aclanthology.org

Neural Attention Model for Classification of Sentences that Support Promoting/Suppressing Relationship

1
1. Jeevesh8 15 Jul 2021
 
 in Public
 
 45c9c7c6Figure
 
 c6, c7. c9 are just random ids, with no meaning. Derived from the xml representation. For example, see here . The actual types of edge is specified in the "type" attribute of <edge> xml tags..
Visit annotations in context

Annotators

Jeevesh8

URL

aclanthology.org/W16-2812.pdf
www.ling.uni-potsdam.de www.ling.uni-potsdam.de

arg.dvi

19
1. Jeevesh8 15 Jul 2021
  
  in Public
  
  The segment status we need forargument mining is less genre-dependentand instead tailored to the argument notation or schema we want to use.
  
  In argument mining, a segment can correspond to a connected sub-graph of the argument complex, constructed from contiguous ADUs. Construct a new graph with each subgraph reduced to a single node. Each of these nodes can be classified as belonging to a broad set of classes, depending on the local structure around the node in the graph. The set of classes become more specific and expanded as we consider more and more structure around the sub-graph.
  
  In the simple case, where connections in argument complex are sparse, the sub-graphs detected, and the graph formed therefrom, is also sparse and has much less number of possible classes, due to sparsity, and yields to simple classes like "claim", "premise"( As described in previous section).
2. Jeevesh8 15 Jul 2021
  
  in Public
  
  formulaic expressions
  
  may consider integrating in the argument notation/schema itself, or learn these.
3. Jeevesh8 15 Jul 2021
  
  in Public
  
  linear order of zones
  
  That is, which is more likely to occur first, which second etc.
4. Jeevesh8 15 Jul 2021
  
  in Public
  
  see (Stede, 2011,Sct. 4.2)
  
  to see
5. Jeevesh8 15 Jul 2021
  
  in Public
  
  Since the presence of such a signal is a valuable feature in thestudy of the principles of argument presentation,
  
  We need to know this, in order to understand whether the proponent is accepting defeat, being humble.. etc. Although the intent is recoverable entirely from the text(as is the entire argument complex), the authors still choose to represent this factor in their diagrams. If we use a weighting for the "importance" of each of the attack edges, then we can probably omit this.
6. Jeevesh8 15 Jul 2021
  
  in Public
  
  weightyreasons
  
  An important question: How to assign and compose weights of various parts of the argument complex.
7. Jeevesh8 15 Jul 2021
  
  in Public
  
  The authordoes not even need to address whether the anticipated exception to his argument holds or not
  
  That is, in the below example, "in principle it is possible" can be replaced by "even if it is possible".
8. Jeevesh8 15 Jul 2021
  
  in Public
  
  An example diagram is shown inFigure 5b.
  
  The exact figure for the provided "harmless asbestos" example, will consist of a circular node $3$ rather than a square node $3$, in figure $5b$ (Assuming $3$ and $4$ are said by the proponent himself).
9. Jeevesh8 15 Jul 2021
  
  in Public
  
  Figure 5a.
  
  In $4a$, it'd have look like a circular node $4$ connected on top of $3$, with an edge with circular end point towards $3$.
10. Jeevesh8 15 Jul 2021
  
  in Public
  
  We should tear the building down,]1[even thoughit’s supposed to be some touristic attraction.]
  
  Imagine this as a rhetorical comment by the attacker. In this comment, $1$ is a reference to the original claim of proponent. And $2$ is the rebutter(i.e. a set of premises against the referenced claim) for the reference claim. Imagine the following sentence as a response by proponent(i.e., $3$). It is then a counter-rebutter(a set of premises against the rebutter). We can see that indeed the counter-rebutter is evidence against the premise that compose the rebutter.
11. Jeevesh8 15 Jul 2021
  
  in Public
  
  counter-rebutter
  
  a set of premises against the set of premises in the rebutter.
12. Jeevesh8 15 Jul 2021
  
  in Public
  
  rebutter
  
  a set of premises against some claim.
13. Jeevesh8 15 Jul 2021
  
  in Public
  
  according to the main claim.
  
  And the shapes of nodes(circle/box)
14. Jeevesh8 15 Jul 2021
  
  in Public
  
  rebutterFigure
  
  Had the squares been circles, it'd mean that the counter-argument is presented by the proponent himself.
15. Jeevesh8 15 Jul 2021
  
  in Public
  
  supportingforce
  
  That is, the reason why the claim follows from the premise.
16. Jeevesh8 15 Jul 2021
  
  in Public
  
  discussion of that difference can be found in (Freeman, 2011, ch. 5).
  
  to see.
17. Jeevesh8 15 Jul 2021
  
  in Public
  
  they
  
  whether a node is a "supporting evidence" or some "data node"
18. Jeevesh8 15 Jul 2021
  
  in Public
  
  amodal
  
  Alethic Modality
  
  Basically, modal operators refer to "can't be" , "may be", "may not be", which indicate different strengths of inferential links.
19. Jeevesh8 15 Jul 2021
  
  in Public
  
  argumentative text portion
  
  in contrast with an instructive, or narrative, or descriptive, or expository text.
Visit annotations in context

Annotators

Jeevesh8

URL

ling.uni-potsdam.de/~peldszus/ijcini2013-preprint.pdf
arxiv.org arxiv.org

2006.04884.pdf

5
1. Jeevesh8 13 Jul 2021
  
  in Public
  
  generalization
  
  Generalization of the model weights learned from MLM task, to other tasks.
  
  Given 2 models(or same model with different pre-trianed weights), which perform exactly same on the MLM task, it may still be the case that one performs better when fine-tuned on RTE, than the other.
  
  But here, the generalization we are talking about is the one that corresponds to the fact that even if two models(or single model with different weights) obtain same loss on training data for a downstream task, like RTE, their performance on the test set may still be different.
2. Jeevesh8 13 Jul 2021
  
  in Public
  
  barrier
  
  The color stays same around the red star, and the barrier is where the color changes quickly (those are the walls of the valley). Around the line $x=0.25$ in fig. $7(a)$.
3. Jeevesh8 13 Jul 2021
  
  in Public
  
  CoLAFigure 7:
  
  Moving along X-axis from (0,0)(black star) corresponds to failed run(red star) and moving along Y-axis corresponds to successful run(white star).
4. Jeevesh8 13 Jul 2021
  
  in Public
  
  Importantly, we note that the vanishing gradients we observe during fine-tuning are harder to resolvethan the standardvanishing gradient problem(Hochreiter, 1991; Bengio et al., 1994). In particular,common weight initialization schemes (Glorot and Bengio, 2010; He et al., 2015) ensure that thepre-activations of each layer of the network have zero mean and unit variance in expectation. However,we cannot simply modify the weights of a pre-trained model on each layer to ensure this propertysince this would conflict with the idea of using the pre-trained weights.
  
  This is also probably the reason why vanishing gradients occur despite skip connections in transformer.
5. Jeevesh8 13 Jul 2021
  
  in Public
  
  optimization problem
  
  i.e., a problem with the optimization procedure.. as it can't reduce the loss!
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2006.04884.pdf
arxiv.org arxiv.org

2004.14448.pdf

5
1. Jeevesh8 13 Jul 2021
  
  in Public
  
  but does not in itself say if these changesare important for the model’s behavior–i.e. forthe processing necessary to solve the downstreamtask.
  
  That is, we can see from the RSA, that at layer 6, dependency-finetuned bert's embeddings are far away from those of pre-trained bert's. But are these changes in dependency-finetuned bert actually being used in dependency parsing, or are they being ignored/grouped together into sub-classes by the layers ahead of the 6-th one?
2. Jeevesh8 13 Jul 2021
  
  in Public
  
  Thesebaselines are important because inspection-basedanalysis can often discover patterns that are notobviously present due to the high capacity of aux-iliary classifiers. For example, Zhang and Bow-man (2018); Hewitt and Liang (2019) found thatexpressive-enough probing methods can have de-cent performance even when trained on randomnoise.
  
  As an extreme example, BERT itself, can be thought of as a probe on word-embeddings, if we freeze word-embeddings. Now, if by chance,the weights we use to initialize classifier head(BERT) happen to be the same as that of pre-trained BERT.. then we can get awesome performance on probing task, without the our initial model(i.e., the word embeddings) having any knowledge related to probing task.
3. Jeevesh8 13 Jul 2021
  
  in Public
  
  4The cor-responding metrics are the prediction accuracy ofthe root node and the Spearman correlation be-tween the predicted and true tree depths.
  
  2 metrics for the first structural probe only. The word $i$ with the lowest $||\mathbf{h}_i^l||_B$ is the predicted root .
4. Jeevesh8 13 Jul 2021
  
  in Public
  
  with mod-els appearing to use dataset heuristics such as lex-ical overlap (McCoy et al., 2019b) or word pri-ors (Poliak et al., 2018),
  
  That is, models tend to use spurious artefacts in the dataset, instead of actual linguistic features learned during pre-training; when fine-tuned on a downstream task.
5. Jeevesh8 13 Jul 2021
  
  in Public
  
  these
  
  The linguistic features
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2004.14448.pdf
arxiv.org arxiv.org

()

4
1. Jeevesh8 13 Jul 2021
  
  in Public
  
  2017DEEPBIAFFINEATTENTION FORNEURALDEPENDENCYPARSING
  
  Main idea: Using classifier(softmax) like loss in addition to MST(CRF) type loss, leads to better performance.
  
  Can implement classifier-like loss for the connection from each word, using scores as given in equation (2).
2. Jeevesh8 13 Jul 2021
  
  in Public
  
  Applying smaller MLPs to the recurrent output states beforethe biaffine classifier has the advantageof stripping away information not relevant to the current decision. That is, every top recurrent stateriwill need to carry enough information to identify wordi’s head, find all its dependents, exclude allits non-dependents, assign itself the correct label, and assign all its dependents their correct labels, aswell as transfer any relevant information to the recurrent states of words before and after it. Thusrinecessarily contains significantly more information than is needed to compute any individual score,and training on this superfluous information needlessly reduces parsing speed and increases the riskof overfitting. Reducing dimensionality and applying a nonlinearity (4 - 6) addresses both of theseproblems.
  
  Nice logic. Can be applied anywhere we have embeddings that are supposed to work for multiple tasks/encode multiple information. Usually we add a "head" for transforming for the particular task, but forget to do dimensionality reduction!
3. Jeevesh8 13 Jul 2021
  
  in Public
  
  biaffine
  
  The two affine transformations are not one after another, but rather in parallel, $RU^{(1)}$ is one and $R\mathbf{u}^{(2)}$ another. $U^{(1)}$ and $\mathbf{u}^{(2)}$ are learnable parameters of the bi-affine transformation. $R$ is the matrix of token-wise embeddings of all words, stacked together.
4. Jeevesh8 13 Jul 2021
  
  in Public
  
  Variable
  
  The classes are the various words in the sentence, they change as the sentence changes.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1611.01734.pdf
aclanthology.org aclanthology.org

What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties

5
1. Jeevesh8 13 Jul 2021
  
  in Public
  
  latter
  
  i.e., the downstream tasks.
2. Jeevesh8 13 Jul 2021
  
  in Public
  
  The figure shows that NMT probing performanceis largely independent of target language, withstrikingly similar development patterns acrossFrench, German and Finnish
  
  Can use same encoder for all the target languages!
3. Jeevesh8 13 Jul 2021
  
  in Public
  
  not easily readable by our MLP
  
  But readable by the decoder of the AutoEncoder
4. Jeevesh8 12 Jul 2021
  
  in Public
  
  No target form occurs across thetrain/dev/test split,
  
  That is, a word in past tense(say "saw") which is the verb of the main-clause will occur in past tense as the verb of main-clause, in only one of train/dev/test split.
  
  Forms of verbs:
  
  VB (base form)(be),
  
  VBD (past tense)(was),
  
  VBG (gerund/present participle)(being),
  
  VBN (past participle)(been),
  
  VBP (sing. present, non-3d)(am),
  
  VBZ (3rd person sing. present)(is).
5. Jeevesh8 12 Jul 2021
  
  in Public
  
  train parameter-richmulti-layer classifiers,
  
  Need to be careful with the complexity of the classifier and the amount of data used to train it.
  
  A BERT like model has max. sequence length of 512, so, it has $|V|^{512}$ possible inputs. Suppose, it maps each of it, to a distinct fixed-size sentence embedding. ($V$ is the vocabulary of the model). Then, if we provide all our samples to the classifier, and it memorises the labels for all of them, then we can't really say if the linguistic information was captured in the sentence embedding.
  
  For binary classification, an accuracy greater than $50%$ on held out test set, is a must to conclude that at least some traces of the linguistic property corresponding to the binary classification task are captured in the sentence embedding.
Visit annotations in context

Annotators

Jeevesh8

URL

aclanthology.org/P18-1198.pdf
aclanthology.org aclanthology.org

Empirical Linguistic Study of Sentence Embeddings

2
1. Jeevesh8 12 Jul 2021
  
  in Public
  
  Spearman’sρ
  
  It is the Pearson's $r$ coefficient for the ranks of actual and observed values. The higher of the actual/observed values gets rank 1 and the lower one gets rank 0. It measures to what extent the function mapping from one of the r.v.(actual/observed) to the other, can be approximated by a monotonic function.
  
  Pearson's $r$, on the other hand, measures how well can the function mapping from one of the r.v to another can be approximated by a linear relation of the form $Y=aX$. If it can approximated exactly, then Pearson's $r$ is $1$.
2. Jeevesh8 12 Jul 2021
  
  in Public
  
  Pearson’sr
  
  It is the covariance between the observed and actual values (of related-ness) normalized by the product of standard deviations of the observed and actual values.
Visit annotations in context

Annotators

Jeevesh8

URL

aclanthology.org/P19-1573.pdf
aclanthology.org aclanthology.org

On the Interplay Between Fine-tuning and Sentence-Level Probing for Linguistic Knowledge in Pre-Trained Transformers

8
1. Jeevesh8 11 Jul 2021
  
  in Public
  
  Interestingly,we observe no difference between CoLA and SST-2fine-tuning in this experiment.
  
  meaning that the deviations induced by cola/sst-2 are equivalent to the pre-trained weights of last layer. Although, one induces more deviation than the other.. probably because of LayerNorm?
2. Jeevesh8 11 Jul 2021
  
  in Public
  
  This affirms our hypothesis that improvements inthe fine-tuning with CLS-pooling can be attributedto a change in the attention distribution which isless necessary for the mean-pooling.
  
  This hypothesis explains why huge changes came doing probing with [CLS] rather than mean pooling. As we had used [CLS] during fine-tuning, the attention distribution would have adapted to allow the [CLS] token to capture better sentence level representations, which wouldn't have been the case before, probably.
  
  That is, there are two factors at the least, that are bringing about the changes(compared to base pre-trained models) in the accuracy of probing tasks, viz. :
  
  [CLS] token learning better(compared to the representations captured by [CLS] token in base pre-trained model) representations of sentence when model is fine-tuned for the given fine-tuning task.
  
  How the probing task is related to the fine-tuning task. (That is, if the finetuning and probing task are both syntactic level tasks, or one is syntactic level, another semantic etc.)
3. Jeevesh8 11 Jul 2021
  
  in Public
  
  The improvement is only visiblewhen using CLS-pooling and becomes negligiblewhen probing with mean-pooling.
  
  Probably because during fine-tuning, we took the first token's representation, and passed gradients through it. Hence, the first token's representation was the most affected(where [CLS] is now). Mean-pooling during probing, allows one to access more of the un-corrupted pre-trained knowledge in that way.
  
  To check: What would have happened had the fine-tuning involved mean-pooling too?
4. Jeevesh8 11 Jul 2021
  
  in Public
  
  highlighting the impor-tance of sentence-level context for each of the prob-
  
  or the inability of [CLS] token to capture sentence level context.
5. Jeevesh8 11 Jul 2021
  
  in Public
  
  outFigure
  
  No fine-tuning. These are results on base model.
6. Jeevesh8 11 Jul 2021
  
  in Public
  
  Our hypothesis is that if a pre-trained model en-codes certain linguistic knowledge, this acquiredknowledge should lead to a good performance ona probing task testing for the same linguistic phe-nomenon.
  
  Very logical. Though I don't think it is that discrete. How quickly the knowledge can be recovered, if lost, will allow this phenomenon to be made continuous.
  
  If we replace probing everywhere in the paper, with actual fine-tuning on the probing task(that is unfreeze the weights of transformer model, while doing the probe), and then measure the number of samples require to get saturated accuracy on the probing task; it will correspond to how quickly the knowledge corresponding to the probing task can be recovered.
7. Jeevesh8 11 Jul 2021
  
  in Public
  
  coordinate clause
  
  In English grammar, a coordinate clause is a clause (i.e., a word group containing a subject and predicate) that is introduced by one of the coordinating conjunctions--most commonly and or but.
  
  Example: "It was apple-blossom time, and the days were getting warmer." (E.B. White, Charlotte's Web. Harper, 1952)
8. Jeevesh8 11 Jul 2021
  
  in Public
  
  changes
  
  expected improvement
Visit annotations in context

Annotators

Jeevesh8

URL

aclanthology.org/2020.blackboxnlp-1.7.pdf
arxiv.org arxiv.org

1401.4082.pdf

2
1. Jeevesh8 09 Jul 2021
  
  in Public
  
  Var[∇ξif(ξ)]
  
  In (7) we have already assumed that $p(\xi)$ is multivariate Gaussian. The variance is calculated for the term whose expected value needs to be calculated in order to get gradient of $f(\xi)$ with respect to a single mean parameter $mu_i$ of the multivariate Gaussian.
  
  Since, $f(\xi)$ depends on $\xi_i$ only via $f(\xi_i)$, the gradient $\nabla_{\xi_i} f(\xi)$, whose expected value is to be found in (7), is independent of how many $\xi_i$'s are there in total.
2. Jeevesh8 09 Jul 2021
  
  in Public
  
  asVar[(ξi−μi)σ2i(f(ξ)−E[f(ξ)])]∼O(K)
  
  The $f(\xi)$ is the cause of $O(K)$. The term in the bracket is obtained by assuming $p(\xi|\theta)$ is multivariate Gaussian. And the gradient w.r.t. a single mean, $\mu_i \in \theta$ is being calculated in equation (34). The variance given is the variance of the term whose expected value is being calculated to get the gradient.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1401.4082.pdf
arxiv.org arxiv.org

1509.00519.pdf

1
1. Jeevesh8 06 Jul 2021
  
  in Public
  
  5
  
  should be $\ge$ instead of $=$
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1509.00519.pdf
arxiv.org arxiv.org

1611.01144.pdf

1
1. Jeevesh8 05 Jul 2021
  
  in Public
  
  One limitation of the ST estimator is that backpropagating with respect to the sample-independentmean may cause discrepancies between the forward and backward pass, leading to higher variance.
  
  Let's say we use a function $f$ to compute loss from the sampled sample $i ~ z$. And that $z$ is a collection of $n$ Bernoulli variables with probabilities of turning out 1 as ${p_1, p_2,...p_n}$. Then, the gradients of loss with respect to $z$, calculated during backward pass until z will actually be gradients at the sampled value, $i \in {0,1}^n$. Now, in reality we want to update our $p$'s and hence we copy over these gradients w.r.t. these $0,1$'s("straight through") as the gradient of loss with respect to the "mean parameters" $p_j$ for each $j \in [1,n]$.
  
  The idea is that if we were to actually backpropagate gradients of actual loss, that is to minimize $y=\sum_{i \in {0,1}^n} P f(i)$, the direction in which we would like to move our $p$'s would be the component of the vector $\left[\frac{\partial y}{\partial p_j}\right]_j$ along the plane $\sum_j p_j=1$. Where $P$ is the probability of the configuration $i$ occuring, that is the product of $p_j$ for which $j=1$ and $1-p_j$ for which $j=0$. The $j^{th}$ partial derivative will evaluate to expected value of $f(i)$ given the $j^{th}$ component is set to $1$ minus the expected value of $f(i)$ given the $j^{th}$ component is set to $0$.
  
  Now, the straight through estimator, on average, is sending a gradient of $Ex\left[\frac{\partial f(z)}{\partial z_j}|_{z_j=1}|z_j=1\right]$ with probability $p_j$ and $Ex\left[\frac{\partial f(z)}{\partial z_j}|_{z_j=0}|z_j=0\right]$ with probability $1-p_j$.
  
  First thing to notice is that as expectation and partial derivative are linear operators, we can move expectation inside.
  
  This is a biased estimator of gradients. Also, the gradients may vary hugely around their biased means, this problem increases with increase in $n$
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/abs/1611.01144
Jun 2021
arxiv.org arxiv.org

Transformer Feed-Forward Layers Are Key-Value Memories

3
1. Jeevesh8 19 Jun 2021
  
  in Public
  
  howing higheragreement between keys and values on the identityof the top-ranked token
  
  Probably should additionally confirm that in the lower layers the many of 4096 activations.. are high, while comparitively quite few are active in the higher layers.
  
  If it is not the case then the models layers may not be operating in the same embedding space.
2. Jeevesh8 19 Jun 2021
  
  in Public
  
  he vast majority of retrievedprefixes (65%-80%) were associated with at leastone identified pattern
  
  What patterns exist in the remaining ones? Are they adversarial? Is the output on those correct?
3. Jeevesh8 19 Jun 2021
  
  in Public
  
  We assume thatpatterns stored in memory cells originate from ex-amples the model was trained on
  
  They also originate due to initialization.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2012.14913.pdf
arxiv.org arxiv.org

2104.13478.pdf

3
1. Jeevesh8 17 Jun 2021
  
  in Public
  
  ρ(h)(x ? θ)(g)
  
  Only defined when each element of the group can be identified with an element of the domain.
2. Jeevesh8 17 Jun 2021
  
  in Public
  
  〈x,ρ(g)θ〉=〈ρ(g−1)x,θ〉
  
  It is due to the measure in the inner-product being invariant, to the group under consideration, that this holds.
3. Jeevesh8 17 Jun 2021
  
  in Public
  
  invariant
  
  invariant w.r.t. symmetry group under consideration.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2104.13478.pdf
browse.arxiv.org browse.arxiv.org

2104.08696.pdf

9
1. Jeevesh8 12 Jun 2021
  
  in Public
  
  knowl-edge neurons are activated by knowledge-probingprompts;
  
  Are we sure about this? Knowledge probing prompts have only heads..
  
  Yes, we are sure, because the sentences only having heads, will not have the relation, most of the time.
  
  A more precise formulation is "knowledge neurons identified(for a fact) based on activations of neurons on the samples from ParaRel(which are variations of that fact, with masked tails), also get activated by the prompts constructed by finding sentences having the root as well as tail(of the fact) in natural language data, and masking the tail. "
2. Jeevesh8 12 Jun 2021
  
  in Public
  
  activated
  
  We check the activations at the location of the masked words in each prompt.
3. Jeevesh8 12 Jun 2021
  
  in Public
  
  3
  
  Random words masked in here too.
4. Jeevesh8 12 Jun 2021
  
  in Public
  
  2
  
  Random word masked in these prompts
5. Jeevesh8 12 Jun 2021
  
  in Public
  
  1
  
  The tail entity is masked in these prompts.
6. Jeevesh8 12 Jun 2021
  
  in Public
  
  knowledge neurons are exclusive to their corre-sponding factual knowledge.
  
  how?
7. Jeevesh8 12 Jun 2021
  
  in Public
  
  Example prompttemplates of relations are shown in Table 2.
  
  Such statistics and examples from training data are a must for any good paper!
8. Jeevesh8 12 Jun 2021
  
  in Public
  
  3)
  
  If we have limited context to size $n$ and the size of vocabulary is $|V|$, then the first layer(word embedding) can have at most $|V|^n$ different possible values in the entire training set, at any single position there are $|V|$ possible word embedding vectors. Since output of first self attention layer at $i$-th location depends on all input positions, there are $|V|^n$ different possible for output at the $i$-th location. Also, since the output of entire self-attention layer depends on its entire input, and its input has only $|V|^n$ different values, so does its entire output. That is, there are only $|V|^n$ distinct sequences that can be output by self-attention and will be input to FFN.
  
  The keys of FFN, are $d$-dimensional vectors, and values too. There are $m$ distinct keys and values, where $m>d$ is the intermediate dimension of the FFN. These key-value pairs are used to map the $|V|^n$ possible vectors at a particular location, to a new set of $|V|^n$ vectors.
9. Jeevesh8 12 Jun 2021
  
  in Public
  
  First, suppressing and amplifying knowl-edge neurons can control the expressions of thecorresponding knowledge, without effecting otherfacts. Second, we find that knowledge-probingprompts tend to activate knowledge neurons of thespecific fact. Third, given the knowledge neuronsof a relational fact, the top activating prompts re-trieved from open-domain texts usually expressthe corresponding fact, while the bottom activatingtexts describe other information.
  
  first means that the facts under consideration are contained only in the knowledge neurons under consideration; and no-where else in the model.
  
  second, is just the way knowledge neurons are defined/identified initially.
  
  third provides some little evidence that the set of knowledge neurons under consideration, contains information regarding the particular fact under consideration, only.(converse direction of 1)
Visit annotations in context

Annotators

Jeevesh8

URL

browse.arxiv.org/pdf/2104.08696.pdf
arxiv.org arxiv.org

1704.02685.pdf

5
1. Jeevesh8 07 Jun 2021
  
  in Public
  
  y∆x
  
  In normal backpropagation, we continue multiplying by Jacobians of the layers, going backwards through the network, to get the Jacobian of output w.r.t. the input.
  
  In this method, we go on multiplying by discretized Jacobians(i.e., a Jacobian with $\frac{\partial y_i}{\partial x_i}$ replaced by $\frac{y_{i_1}-y_{i_0}}{x_{i_1}-x_{i_0}}$ where $x_{i_0}, y_{i_0}$ are the values at reference input).
  
  Now, we define the matrix obtained by multiplying discretized Jacobians of two consecutive layers(L1, L2) as being the "multiplier matrix" of the transformation from inputs of L1 to outputs of L2. And multiply this "multiplier matrix" with the changes in the inputs of L1, to get the "contributions" of various inputs of L1 in producing the output of L2.
  
  All the above are definitions. The discretized Jacobian of $f(g(x))$ where $f,g: R^d \rightarrow R^d$, is not equal to the product of discretized Jacobians of $f$ and $g$. Had there only been linear/convolutional layers, they might have been equal as the discretized gradient is same as the natural one for linear layers. But this is not true for non-linearities.
2. Jeevesh8 06 Jun 2021
  
  in Public
  
  wi∆xi>0
  
  This is 1 if the actual change in $x_i$, caused $y_i$ to increase, 0 otherwise.
3. Jeevesh8 06 Jun 2021
  
  in Public
  
  contributions:C∆x+i∆y+
  
  This is contribution of increase in $x_i$ to increase in $y_i$
4. Jeevesh8 06 Jun 2021
  
  in Public
  
  impor-tance
  
  There are two possible notions of "importance" here.
  
  If the model doesn't care about the value of an input/activation if it is below certain threshold, and starts caring about it just after it crosses that threshold, then you expect a sudden jump in the importance score. Here "caring about value of input/activation" means does it change the output if we slightly shift that input/activation. Or in other words gradient.
  
  The other notion is of attribution, i.e., How much each feature contributed to the current value of output? A logical way to define it seems to be gradXinp, but since gradient is discontinuous, so will be any continuous function of it.
  
  But if we assume that our model is a differentiable function(we are backprop-ing through it, so it is indeed differentiable), and that there is some known attribution for each of the reference input in producing the reference output, then when we shift the reference input slightly, the reference output will shift too. Now we add to original attribution of each feature the contribution of that feature's change to the change in the reference output. This contribution is given by gradX$\delta$inp. More mathematically, $\delta(y)=\sum_{i=1}^n \nabla_{x_i}y*\delta(x_i)$, where $y$ is the reference output. If $a_i$ was the attribution of the reference input $x_i$, in the reference output, then after moving slightly, the attribution of new $x_i$ is $a_i+\delta(x_i)\nabla_{x_i}y$. We can integrate this for calculating attribution to various features at longer distances from reference inputs.
5. Jeevesh8 06 Jun 2021
  
  in Public
  
  1
  
  Shouldn't this hold for summations over $x_i$ in each layer, separately?
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1704.02685.pdf
arxiv.org arxiv.org

2010.03648.pdf

3
1. Jeevesh8 01 Jun 2021
  
  in Public
  
  Proposition 2.2
  
  As seen in the annotation for equation (2), the cross entropy depends only on the projection of $p^{*}_{.|s}$ into the $d$-dimensional subspace induced by $\phi$. This means that the optimal value for $f(s)$ is same for any other $p^{*}_{.|s}$, given that it has the same projection in the $d$-dimensional subspace.
  
  All the $p^{*}_{.|s}$ lie in $|V|-1$-dimensional simplex.
  
  All the $p^{*}_{.|s}$ having the same projection in the $d$-dimensional subspace hence lie on the intersection of this simplex with various $(|V|-d)$-dimensional hyper-planes(each hyperplane containing all vectors whose projection lead to the same point in the $d$-dimensional subspace). This intersection forms a "section" of the simplex of dimension $|V|-d$ [assuming $d>1$].
  
  As each point in the $d$-dimensional subspace was made to correspond to a hyperplane of all the vectors having their projection as that point; each point in the $d$-dimensional subspace can also be associated with a probability distribution over the vocabulary(by taking soft-max of the vector corresponding to that point).
  
  The first point to note is that this mapping maps the $d$-dimensional subspace onto(due to nature of soft-max) a $d$-dimensional simplex lying inside the original $(|V|-1)$-dimensional simplex. Since we can generate any element of the $d$-dimensional subspace by choosing appropriate $f(s)$ , we can generate any element of this $d-1$-dimensional simplex by choosing an appropriate $f(s)$.
  
  Other important thing is that if this simplex intersects each section(generated previously) at, at least one point $t_0$, then it means at that point(i.e., when the real distribution is $p^{*}_{.|s}(t_0)$ corresponding to point $t_0$) , we can find an $f(s)$ that will make our predicted distribution match the real distribution $p^{*}_{.|s}(t_0)$, exactly. Since, this is the solution when our predicted distribution is un-restricted(proposition 1), it is also the solution when our predicted distribution is restricted. Moreover, since all points on a section have the same optimal probability distribution for minimizing cross-entropy, it means that we have found the optimal solution for all the distributions in the section containing the point $t_0$. All the points have $p^{*}_{.|s}(t_0)$, as their optimal solution. And hence, by definition of a section, proposition 2.2, follows.
  
  Now the only thing that remains to be proved is that the $d$-dimensional simplex indeed intersects every section at one and only one point. First thing to see is that, if the simplex indeed intersects each section, then it can only intersect each one, once. As the sections are $|V|-d$-dimensional objects filled in a $|V|-1$ dimensional simplex, so if each section is reduced to a single point, the points will occupy a $|V|-1 - (|V|-d) = d-1$-dimensional subspace. So, since we have a $d$-dimensional simplex, it can intersect every section at least once.
  
  The orientation of the $d$-dimensional simplex and the $|V|-d$-dimensional sections are linked together by $\phi$. This linkage makes sure that both the things intersect each other at least once. As you change $\phi$, you end up rotating and shifting around the $d$-dimensional simplex as the $d$-dimensional subspace rotates and shifts. Correspondingly, this rotation of $d$ dimensional subspace also causes the space of vectors (having same projections in the $d$-dimensional subspace(i.e., the sections)) to rotate and shift correspondingly.
  
  EDIT 1: The $d$-dimensional simplex above is actually a $d$-dimensional surface/manifold over a $d$-dimensional subset. That is when you set the probabilities of first $d$ words, the probabilities of the rest of words are automatically determined, using the fact that the probabilities are supposed to be written as the soft-max of a vector in the $d$-dimensional subspace.
  
  EDIT 2: Proposition 2.2 already assumes the existence of minimum, so whole of the above procedure is not needed, but the above can be thought of as an attempt to explore existence as well as the exact construction of it.
2. Jeevesh8 01 Jun 2021
  
  in Public
  
  (2)
  
  The first term can be written as $-f(s)^T \phi p^{*}_{.|s}$ where $\phi$ is $d \times V$ matrix of output word embeddings, and $p^{*}_{.|s}$ is a $V \times 1$ vector of actual probabilities of different words in the vocabulary given the context $s$.
3. Jeevesh8 01 Jun 2021
  
  in Public
  
  rewrite it as`xent(f,Φ) =Es∼pL[`xent,s(f(s),Φ)]
  
  Assuming that $f$ can be as expressive a function as we want(that is, $f$ can assign and $d$-dimensional vector to any $s$), we can reduce the problem of minimizing the expected value of cross entropy between the predicted and actual probability distribution over words given the context, to the problems of finding the $d$-dimensional vectors $f(s)$ for each $s$, independently, that minimize the cross entropy for that $s$.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2010.03648.pdf
May 2021
distill.pub distill.pub

Feature Visualization

1
1. Jeevesh8 30 May 2021
  
  in Public
  
  we can also interpolate between them.
  
  That is, we optimize the image with different objectives, $\alpha L_{1} + (1-\alpha ) L_{2}$ where $L_{i}$ is the loss from the $i$-th neuron, whose activation is to be maximised. For each different value of $\alpha$ we will get a different objective and a corresponding optimized image.
Visit annotations in context

Annotators

Jeevesh8

URL

distill.pub/2017/feature-visualization
arxiv.org arxiv.org

1910.13503.pdf

1
1. Jeevesh8 25 May 2021
  
  in Public
  
  c
  
  can be the weights learned by the model
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1910.13503.pdf
arxiv.org arxiv.org

1803.06959.pdf

2
1. Jeevesh8 25 May 2021
  
  in Public
  
  We note that this metric is not a perfect measure of information content in single units; for example,a unit with a little information about every class would have a low class selectivity index. However,it does measure the discriminability of classes along a given direction.
  
  For example, consider 2-d vectors belonging to two classes. All the elements above X-axis belong to category 1 and all vectors below X-axis, to category 2. Suppose the activation only lights up(with same activation value, a.k.a. activity) for two vectors of category 1 in the second quadrant and two vectors of category 2 in the fourth quadrant. Then it would have 0 class selectivity index, but still, it can discriminate classes along the direction of y=-x . That is, if there is another activation that lets us know that(or, in other words, lights up when) our input vector is somewhere in the second or the fourth quadrant, then we can use our original activation to classify the vectors into appropriate category.
2. Jeevesh8 25 May 2021
  
  in Public
  
  To test networks’ reliance upon random single directions,we added Gaussian noise to all units with zero mean and progressively increasing variance.
  
  No clamping this time. The addition of noise to a feature blurs that particular feature. The amount of blurring increases with increasing variance. The relative proportion of blurring decides which single direction we are removing from the input. This addition of noise allows us to calculate the effect of blurring all dimensions together(blurring in any dimensions is proportionate to the variance in that dimension, on the training set).
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1803.06959.pdf
www.semanticscholar.org www.semanticscholar.org

1803.03635.pdf

1
1. Jeevesh8 25 May 2021
  
  in Public
  
  2019THELOTTERYTICKETHYPOTHESIS
  
  Sometimes, we have to change our hyper-parameters a lot of times, to figure out the correct configuration that leads to good results. We disregard the fact that we may end up overfitting our data, and are satisfied as long as the model generalizes well at last; we don't care about having any a-priori statistical guarantees for how the model will perform.. we just find the best hyper-parameter settings wihtout caring about overfitting, and see how the learned model generalize.
  
  Now, what if we could run all the hyperparameter settings in parallel? We could save a lot of time!
  
  This paper views the network structure and the initialized weights as hyper-parameters of training. And, the hypothesis is roughly that over parametrized neural networks allow us to check a lot of hyper parameter settings at once. They provide evidence for their hypothesis by recovering a pruned network from the entire NN(they initialize it with the weights they were originally initialized with and retrain the pruned network), and showing that this network could perform as good as original one, which would mean that the other connections and initialized weights were just extra "try-outs" which are finally not necessary.
Visit annotations in context

Annotators

Jeevesh8

URL

semanticscholar.org/reader/21937ecd9d66567184b83eca3d3e09eb4e6fbd60
terrytao.wordpress.com terrytao.wordpress.com

245B, Notes 7: Well-ordered sets, ordinals, and Zorn’s lemma (optional)

7
1. Jeevesh8 18 May 2021
 
 in Public
 
 Exercise
 
 Lesson : A well-ordered set is a set consisting of a chain of points, which can have a limit point in the direction of increasing value, but not in the reverse direction.
 
 Another implication is that we can propagate values through induction, along and through infinite sequences, even beyond them, if the direction of propagation matches with the direction in which limit points occur.
2. Jeevesh8 18 May 2021
 
 in Public
 
 Exercise
 
 Use the fact that if $x \ne y$, then either $x<y$ or $y<x$, but not both. Which means that either $x$ will be included in the set of all elements $>y$ or $y$ will be included in the set of all elements $>x$, but not both. But since, $succ(x)=succ(y)$, are same and equal to say $z$ (since the set is well ordered, the minimum in the definition of $succ$ exists , and we can call it $z$). The set of all elements greater than $x$ is then, the set of all elements greater than or equal to z. Similarly, The set of all elements greater than $y$, is the same. Which means both sets are same. But earlier we saw that they must be different if $x \ne y$. This means that $x=y$.
3. Jeevesh8 18 May 2021
 
 in Public
 
 Successor case
 
 $x$ is smaller than any element greater than $y$ for some $y$ and $x > y$.
 
 If this is true, it means that the set of all elements $<x$, has a maximum. In which case the $sup$ just becomes the maximum, rather than $x$.
 If this is false, it means that the set of all elements $<x$ has no maximum, in which case, $x$ becomes the supremum and the limit case becomes true.
4. Jeevesh8 18 May 2021
 
 in Public
 
 (Limit case)
 
 $x$ is smaller than any element greater than or equal to all elements smaller than $x$.
5. Jeevesh8 18 May 2021
 
 in Public
 
 show that a totally ordered set X enjoys the principle of strong induction if and only if it is well-ordered
 
 If the set is not well-ordered, it means, that there is a subset($A$) which has no minimum element. We can set $P(.)$ to be false for all the elements of this set and for any element (\in X) greater than any element of $A$. Also, set $P(.)$ to be true for any $x \in X: x<y \forall y \in A$.
 This setting for $P(.)$ satisfies both the property that $P(x)=$true if $\forall y<x, P(y)=$true and the fact that $P(min(X))=$true, if the $min(.)$ exists. But, still not all elements of $X$ have property $P$ as true.
6. Jeevesh8 18 May 2021
 
 in Public
 
 This is called the principle of strong induction.
 
 We need to assume $min(X)=$true.
 
 To proof this, use a proof by contradiction. Assume that for some $x \in X$, $P(x)=$ false. Then this means that $\exists y<x : P(y)=$false, which means $\exists z<y : P(z)=$false, and so on.. we will end up at $min(X)$.. or we may not.. if the sequence y, z, .. tends to some element $> min(X)$.. or even if the sequence y, z... tends to $min(X)$ we can't conclude that $min(X)$ will be part of the sequence.
 
 So, new approach: Consider the set of all $y \in X: y<x$ and $P(y)=$false, then this set has a minimum element. This minimum element must be $min(X)$, because otherwise, if some $z$ is the minimum element of that set, then it will be element such that all elements less than it have $P(.)=$true, which means that $P(z)=$true, which is contradiction.
 Lesson : Sometimes infinite iteration can be expressed as a set of elements; and as we have clearly defined properties of sets, it is preferably to express the result of an infinite iteration as a set, making the iteration invisible/irrelevant.
7. Jeevesh8 18 May 2021
 
 in Public
 
 formed by gluing all the partial choices together
 
 First you assume the existence of something known as $\infty$, that concretizes your notion of infinitely many objects.
 
 Then you assume the existence of sets of infiinitely many objects, while defining them.
 
 Then you define the union of infinitely many sets(assuming that one exists, in the process of defining).
 
 Then you assert that the union of infintely many sets that you have constructed here, must equal the underlying family in the axiom of choice, if Zorn's Lemma is true.
Visit annotations in context

Annotators

Jeevesh8

URL

terrytao.wordpress.com/2009/01/28/245b-notes-7-well-ordered-sets-ordinals-and-zorns-lemma-optional/
arxiv.org arxiv.org

1602.04938v3.pdf

1
1. Jeevesh8 15 May 2021
  
  in Public
  
  recover the sample in the original repre-sentationz∈Rdand obtainf(z),
  
  This process of reverting, combined with the choice of our interpretable data representations, constraints what regions around $x$ can be reached, upon inversion; which in turn constraints the regions around $x$ from which we can obtain $(z', f(inv(z')))$. That is the interpretation model $g$ will only be trained to interpret samples obtained from the original one, by perturbing along interpretable directions.
  
  So, if you are unable to interpret the predictions of the model, you can only conclude that you can't interpret the model in terms of the interpretable data representation you chose. YOU MIGHT HAVE MISSED SOME FEATURE IMPORTANT TO THE MODEL; in which case, you won't be able to tell what exactly the model is doing wrong(i.e., which feature is it assigning weight to instead of the ones you chose?)?
  
  One also needs to be careful with the design of $inv(.)$ function as it is also detrimental in determining where exactly will the perturbed version of $x'$, i.e., $z'$ will fall. If $inv(.)$ function introduces features that are detected by the model function $f$ but not considered in our interpretable data representations, then the model's predictions are actually based on those features, but to our interpretation model, $g$, they appear to be related to the features in interpretable data representations, so it might wrong things/pick up wrong correlations.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1602.04938.pdf
www.math.uh.edu www.math.uh.edu

rkhs.pdf

1
1. Jeevesh8 13 May 2021
  
  in Public
  
  every bounded linear functional isgiven by the inner product with a unique vector inH
  
  Every dot product with an element of $H$ can be seen as a bounded linear functional.
  
  Similarly, we can associate a dot product with a bounded linear functional in the following way: Consider the subspace of $H$ consisting of all the elements that get mapped to $0 (\in R)$ under the bounded linear functional. Consider the quotient space of this subspace. Now choose a $f \in H$ such that the linear functional doesn't evaluate to 0 on $f$.
  
  Claim; There exists a unique element in this quotient space, whose dot product with $f \in H$ is equal to the value of the functional on $f$.
  
  The dot product with this element, will equal the evaluation of the bounded linear functional on other functions in $H$, as well.
Visit annotations in context

Annotators

Jeevesh8

URL

math.uh.edu/~vern/rkhs.pdf
arxiv.org arxiv.org

1606.03490.pdf

4
1. Jeevesh8 08 May 2021
  
  in Public
  
  The models are con-nected because they use normalized latent factors as topicdistributions. In other words, latent factors are regularizedsuch that they are also good at explaining the topic distri-butions in review text.
  
  VERY INNOVATIVE TYING TOGETHER OF MODELS
2. Jeevesh8 08 May 2021
  
  in Public
  
  global relationships
  
  That is, present throughout the data. These "what the model is focusing on" relations might not exist for some adversarial samples.
3. Jeevesh8 08 May 2021
  
  in Public
  
  perfect matches for the real-life tasks they are meant tosolve.
  
  First observation. This is first thing that is making people run after asking for interpretability. If people probably understood what the models are meant for, they wouldn't ask for so called "interpretability"
4. Jeevesh8 08 May 2021
  
  in Public
  
  the opti-mization objective for most supervised learning models issimply to minimize error, a feat that might be achieved in apurely correlative fashion.
  
  This optimization objective was a result of the model we chose. The mode we chose, doesn't distinguish between correlation and causation, hence the model learnt after minimising of error, also doesn't know about whether smoking and cancer are correlated or one causes another. So basically our model was too simple! It should have involved some part that modelled the process by which one observation can lead to another.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1606.03490.pdf
peterroelants.github.io peterroelants.github.io

Gaussian process kernels

3
1. Jeevesh8 07 May 2021
  
  in Public
  
  Combining kernels by multiplication
  
  See Schur's Theorem first. It states that the pointwise product of two positive semi-definite matrices is positive semi-definite.
2. Jeevesh8 07 May 2021
  
  in Public
  
  Decreasing the scale-mixture αα\alpha will allow for more minor local variations while still keeping the longer scale trends defined by the lengthscale parameter. Increasing the scale-mixture to a large value reduces the minor local variations.
  
  If you see (4.20) of GPML book, you'll realise, that decreasing $\alpha$ means that the gamma distribution will give more weight to smaller values of $ \tau = \frac{1}{l^2} $; which means, more weight will be given to higher values of $l$, hence, flatter distributions are seen with decreasing $\alpha$. If we make $ \alpha $ small enough, like 0.1 in the given figure, we can completely overcome any effect of $l$, but if we keep $ \alpha $ in the right range, we can use it to modulate the rate of decay of covariance, in a much fine-tuned way than is possible with $l$.
3. Jeevesh8 07 May 2021
  
  in Public
  
  The rational quadratic can be interpreted as an infinite sum of different exponentiated quadratic kernels with different lengthscales with αα\alpha determining the weighting between different lengthscales.
  
  A linear combination of positive semi-definite matrices is positive semi-definite(follows directly by plugging the linear combination into the definition of positive semi-definite).
  
  If we consider exponentiated quadratic kernels(EQKs) for different length scales, and then define a probability distribution over all possible length scales for the EQK(which in turn defines a probability distribution over the set of all EQKs), and then find the expected value of the EQK under this probability distribution of EQKs, we will get another positive semi-definite kernel.
  
  In case, the distribution over lengths is a gamma distribution, with rate parameter $\beta = \frac{1}{l^2}$, and the scale parameter as $\alpha$ the the expected value of all the EQKs, leads to the rational quadratic kernel, with the length scale $l$ and scale-mixture $\alpha$, as specified in the gamma distribution.
  
  For derivation see page 87 of GPML book.
Visit annotations in context

Annotators

Jeevesh8

URL

peterroelants.github.io/posts/gaussian-process-kernels/
Local file Local file

Untitled document

8
1. Jeevesh8 06 May 2021
  
  in Public
  
  pk+1(nn)G0
  
  This term is added, after summation to adjust for the probability that a new($k+1$-th) species comes along in the $n+1$-th sample
2. Jeevesh8 06 May 2021
  
  in Public
  
  1pj
  
  $p_j$ is a function that takes in a sequence corresponding to the size of clusters corresponding to each of the various possible species and outputs a number corresponding to the weight of $j$-th species.
3. Jeevesh8 06 May 2021
  
  in Public
  
  δx
  
  It is a vector with 1 at $j$-th position and 0 elsewhere.
4. Jeevesh8 06 May 2021
  
  in Public
  
  exchangeable sequence of r.v.’s
  
  In statistics, an exchangeable sequence of random variables (also sometimes interchangeable) is a sequence X1, X2, X3, ... (which may be finitely or infinitely long) whose joint probability distribution does not change when the positions in the sequence in which finitely many of them appear are altered.
5. Jeevesh8 06 May 2021
  
  in Public
  
  model
  
  The previous line can be considered enough as the definition of a model, but that model isn't enough for us to do Bayesian inference, hence we restrict the model.
6. Jeevesh8 06 May 2021
  
  in Public
  
  positive
  
  That is, non-zero..which in turns means, that the family of distributions considered is very large.
7. Jeevesh8 06 May 2021
  
  in Public
  
  these competing goals are not always honestly spelledout, and the resulting uncertainties are not fully described.
  
  Although we may state that these two random variables are independent or that this is the family of distributions that the actual distribution belongs to; we tend to forget/ignore what trade-offs we made to arrive at those assumptions.
8. Jeevesh8 06 May 2021
  
  in Public
  
  Often the choice ofthe final inference model is a compromise of an accurate representation of theexperimental conditions, a preference for parsimony and the need for a practicableimplementation.
  
  That is, in practice, we first limit our models to some parametric family of distributions , with a finite number of parameters to infer. In restricting ourselves to this parametric family of distributions, we often assume independence of certain random variables, we assume that the data follows a particular style of distribution only. All these assumptions can be thought of as a compromise for accurate representations of the experimental conditions(2 random variables may only be very nearly independent, but we assume them to be independent nonetheless).. for the sake of "parsimony and practicable implementation".
Annotators

Jeevesh8
web.stanford.edu web.stanford.edu

lecture-notes.pdf

2
1. Jeevesh8 04 May 2021
  
  in Public
  
  n2log
  
  should have additional $log(|\sum|)$ term
2. Jeevesh8 04 May 2021
  
  in Public
  
  Proposition
  
  https://stats.stackexchange.com/questions/194051/prove-that-the-maximum-entropy-distribution-with-a-fixed-covariance-matrix-is-a
Visit annotations in context

Annotators

Jeevesh8

URL

web.stanford.edu/class/stats311/lecture-notes.pdf
Mar 2021
opensource.com opensource.com

What is a Makefile and how does it work?

2
1. Jeevesh8 29 Mar 2021
  
  in Public
  
  all: say_hello generate
  
  Unlike a declaration, this runs all its pre-requisites, even though they may not be files.
2. Jeevesh8 29 Mar 2021
  
  in Public
  
  .PHONY: all say_hello generate clean
  
  Like a declaration. Since nothing in the pre-requisites is a file, the program doesn't need to run any of the all, say_hello, generate and clean.
  
  Make only runs a recipe when one of the pre-requisites are updated.
Visit annotations in context

Annotators

Jeevesh8

URL

opensource.com/article/18/8/what-how-makefile
cseweb.ucsd.edu cseweb.ucsd.edu

gb.pdf

11
1. Jeevesh8 19 Mar 2021
  
  in Public
  
  hard to invert
  
  That is, at some non-neglible fraction of all possible binary strings of $k$-bits, the probability of getting the inverse by following the coing tosses of $A$ is a negligible function of $k$.
2. Jeevesh8 19 Mar 2021
  
  in Public
  
  polynomial fraction of thek-bit composite integers
  
  $k$-bit composite integers are composite integers in the range $[2^k, 2^{k+1})$ .
  
  If you pair up primes like $(2^2-1, 2^{k-1}-1), (2^3-1, 2^{k-2}-1) ...$, and multiply the elements of a pair amongst themselves.. you'll get composite numbers that are product of two primes ans lie in $[2^k, 2^{k+1})$ .
  
  So, number of $k$-bit composite integers in $[2^k, 2^{k+1})$ that can be written as product of two primes $ \ge \frac{k}{2}$ which is a polynomial in $k$.
3. Jeevesh8 19 Mar 2021
  
  in Public
  
  For every
  
  No matter what algorithm the adversary comes up with.. the inequality must hold for each one of them.
4. Jeevesh8 19 Mar 2021
  
  in Public
  
  adversary
  
  The PPT $A$.
5. Jeevesh8 19 Mar 2021
  
  in Public
  
  (1)
  
  This also means that the length of output $f(x)$ is at most a polynomial function of length($k$) of input($x$). But it can be a lower order function too.. for e.g., $|y| = log(|x|)$.. so, to make sure that $A$ is a PPT w.r.t size($k$) of the input($x$) to the one-way function we add extra k symbols to input of A(i.e., $1^k$).
6. Jeevesh8 19 Mar 2021
  
  in Public
  
  N
  
  Natural number corresopnding to the number of symbols in the output of the one-way function.
  
  $\nu$ maps the size($n$) of an element in the domain of a one-way function, to the average probability of finding the inverse of f(a string of size $n$) using a PPT algorithm.
  
  Requiring $\nu$ is negligible means, requiring the probability of finding inverse decreases faster than inverse of any polynomial function(of the input size of the one-way function).
7. Jeevesh8 18 Mar 2021
  
  in Public
  
  nput length t
  
  The length input to $\nu$ is the output length of one-way function.
8. Jeevesh8 18 Mar 2021
  
  in Public
  
  worst-case hardness is a poor measure of security. Security requires hardness onmost cases or at least average-case hardness
  
  This means that the Non-deterministic TM should branch out into multiple states at almost every step of the polynomial number of steps it takes; on almost every input string.
9. Jeevesh8 18 Mar 2021
  
  in Public
  
  machine
  
  Turing machine, that is.
10. Jeevesh8 11 Mar 2021
  
  in Public
  
  n
  
  This is the integer being factored.
  
  $log n ~$ the number of bits required to represent $n$.
  
  Factoring a number, $n$ has exponential complexity in terms of the square root of the no. of bits required to store ($n$) it.
  
  Hence the difficulty of factoring problem increases exponentially w.r.t. how long(in terms of bits) our chosen number $n$ is.
11. Jeevesh8 11 Mar 2021
  
  in Public
  
  (x, p) = 1
  
  $GCD(x,p)=1$
Visit annotations in context

Annotators

Jeevesh8

URL

cseweb.ucsd.edu/~mihir/papers/gb.pdf
www.cs.sjtu.edu.cn www.cs.sjtu.edu.cn

9780133024029.pdf

2
1. Jeevesh8 18 Mar 2021
  
  in Public
  
  really think through what this definition is saying
  
  We can see it as follows:
  
  An efficient certifier will exist iff
  
  You can find $2^t=2^{p(|s|)}$ polynomial-time algorithms $\{B_i(s)\}_{i=1}^{i=2^t}$ such that at least one of $B_i(s)$ is "yes" iff $s \in X$.
  
  You can choose from one of these polynomial time algorithms based on the value of $t$.
  
  So, in effect we are saying that you must be able to check whether $s \in X$ using $2^{p(|s|)}$ polynomial-time algorithms at max.
  
  This is equivalent to the other definition of NP which defines it as the set of all problems solvable in polynomial number of steps using Non-Deterministic Turing Machines. Each step of NTM will result in branching. Each of the branches leading to the some leaf(the state that the branch reaches upon completing the reading of string) consists of the $2^{p(|s|)}$ algorithms. Notice that if we replace $p(|s|)$ with $log_{2}(a) \times p(|s|)$ we can change the base of exponent; in particular it can be matched with the max. branching factor(a constant for a give NTM) of the NTM.
2. Jeevesh8 12 Mar 2021
  
  in Public
  
  setof strings
  
  Forms a language over the alphabet $\{0,1\}$
Visit annotations in context

Annotators

Jeevesh8

URL

cs.sjtu.edu.cn/~jiangli/teaching/CS222/files/materials/Algorithm Design.pdf
ajc.maths.uq.edu.au ajc.maths.uq.edu.au

C156.dvi

1
1. Jeevesh8 10 Mar 2021
  
  in Public
  
  S-FIRE is NP-complete, even ifSis the set of leaves of a treeof maximum degree thre
  
  $S$ can be broken down into disjoint subtrees of the whole tree. If there are $n$ trees, we need to protect (at worse) $n$ vertices, just on top of the $n$ roots of the $n$ subtrees. There are $n!$ possible orderings, in which we can choose to defend the nodes.
Visit annotations in context

Annotators

Jeevesh8

URL

ajc.maths.uq.edu.au/pdf/43/ajc_v43_p057.pdf
arxiv.org arxiv.org

2001.08743.pdf

3
1. Jeevesh8 02 Mar 2021
  
  in Public
  
  do not incorpo-rate a measure of runtime
  
  Maybe their compiler will skip some computations/make them less complex based on the measure of runtime of the compiled program.. Let's read on..
2. Jeevesh8 02 Mar 2021
  
  in Public
  
  target-dependen
  
  Dependent on your GPU/TPU/end device you will run code on. XLA does Linear Algebra specific optimization, and is target-independent.
3. Jeevesh8 02 Mar 2021
  
  in Public
  
  white-box
  
  That is the entire code for executing commands on GPU is known.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2001.08743.pdf
Feb 2021
ajc.maths.uq.edu.au ajc.maths.uq.edu.au

C156.dvi

1
1. Jeevesh8 28 Feb 2021
  
  in Public
  
  QUESTION
  
  It is clever, how the author abstains from stating that the sequence ${d_i}$ is the sequence of nodes saved.
  
  Asserting the existence of this (finite)sequence, makes sure that there are nodes that can be saved at every time step.
  
  The finiteness of $t$ , combined with the conditions $(ii)$ and $(iii)$ imply that $k$ vertices have been saved at time $t$, while we defended 1 vertex per time step.
  
  The important thing to note is that this sequence will exist, iff we have chosen to save one vertex per time step. The backward direction is easily proved by taking ${d_i}$ as the sequence of nodes saved. The forwar
Visit annotations in context

Annotators

Jeevesh8

URL

ajc.maths.uq.edu.au/pdf/43/ajc_v43_p057.pdf
math.fau.edu math.fau.edu

Mollification.pdf

4
1. Jeevesh8 12 Feb 2021
 
 in Public
 
 for all
 
 It is interesting to note that this is not "almost everywhere" !!
2. Jeevesh8 12 Feb 2021
 
 in Public
 
 almost allx∈Rn
 
 at the remaining $x$, there are singularities. These may be removable, or not.
3. Jeevesh8 11 Feb 2021
 
 in Public
 
 a.e.x∈E
 
 cannot be extended to "for every $x \in E$", as $f(x)=\infty$ maybe true , for some $x \in E$; and we can't claim $\infty = \infty$ or $\infty < \infty$ at that point.
4. Jeevesh8 11 Feb 2021
 
 in Public
 
 ‖f‖p
 
 missing mulitplication with $p$ here.
Visit annotations in context

Annotators

Jeevesh8

URL

math.fau.edu/schonbek/PDES/Mollification.pdf
idv.sinica.edu.tw idv.sinica.edu.tw

fully_nonlinear.pdf

5
1. Jeevesh8 08 Feb 2021
  
  in Public
  
  The system (6)
  
  The $(x(t), y(t), z(t))$ satisfying (6.) will be the line passing through $(x_0, y_0, z_0)$ and lying on the monge cone of all the planes with tangent $(p,q(p),-1)$ passing through $x_0, y_0, z_0)$. The tangent plane to the cone, in which the line will lie in will depend on the particular value of $p$ plugged in the equation.
  
  Suppose we have our solution $z=u(x,y)$ . Then we know $p, q$ at every point.
  
  Now, we actually look towards finding the characteristic curves.. Note that the $(x(t), y(t), z(t))$ we have found just tells the direction in which the tangent to the characteristic curve through $(x_0, y_0, z_0)$ should point.
  
  Now let the characteristic curve be given by $(x(t), y(t), z(t))$, which are unknown right now. These $x(t), y(t), z(t)$ will give a new curve through $\mathbb{R}^3$ which may not be a straight line.
  
  But still, at each point $s=(x(t_0),y(t_0),z(t_0))$ having the partial derivatives $ (p(t_0), q(t_0)) $ the tangent to this new curve must point in the direction of tangent specified by the characteristic direction at $s$. Also note that since we have assumed the existence of solution, the partial derivatives at each point are also existent. Now our problem is to find all the five things.
2. Jeevesh8 08 Feb 2021
  
  in Public
  
  generator of the Monge con
  
  Also called "line of contact" with the surface (although may share only a single point with the surface where it is tangent).
  
  This line is the $\gamma_{p_0}$ for some $p_0=(x_0,y_0,z_0)$.
  
  This $\gamma_{p_0}$ gives the direction in the tangent plane along which : $$ \frac{ \partial z }{ \partial p }(p_0) = 0 $$
3. Jeevesh8 08 Feb 2021
  
  in Public
  
  field of cones
  
  This solves our problem of having muliple tangent planes at every point.
  
  It is solved by constructing another surface $N_s$ at every point $s$, such that there are multiple tangent planes to $N_s$ at $s$; and only those planes that are possible tangent planes to $z=u(x,y)$ are the tangent planes to $N_s$ at $s=(x,y,z)$ . This $N_s$ is a generalized cone called a Monge cone and $s$ corresponds to the vertex of the cone. Now instead of choosing a small 2-D element of tangent space at every point;
  
  you have a cone at every point and choose a small 1-D line(or a small 2-D surface?)(infinitely close to the vertex of the cone) from the cone at each point, and then merge together these 1-d lines to create a 2-D surface
4. Jeevesh8 08 Feb 2021
  
  in Public
  
  possible tangent planes
  
  at $(x_0, y_0, z_0) $
  
  In case of linear PDE, the directional derivative at each $(x_0,y_0)$ was known. So, we could propagate initial values along the characteristic curves. (Assuming one and only one characteristic curve through every point)
  
  In case of quasi-linear PDE, the directional derivative at each point was known in terms of the surface. That is, each surface had a corresponding directional derivative field. In other words, the directional derivative was known at each $(x_0, y_0, z_0)$. We had to join together characteristic curves to form a surface. (Assuming one and only one characteristic curve through every point)
  
  Here, we know a set of possible tangent planes at each $(x_0, y_0, z_0)$. Each tangent plane will contribute a small 2-D element to the surface and we shall combine these tangent spaces to form a surface. But the problem is that we have multiple tangent planes at a single point.
5. Jeevesh8 08 Feb 2021
  
  in Public
  
  (6)
  
  The solution to these 3 equations at $p=p_0$ gives $\gamma_{p_0}$.
Visit annotations in context

Annotators

Jeevesh8

URL

idv.sinica.edu.tw/ftliang/pde/*PDE_1/first_order/characteristcs/fully_nonlinear.pdf

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators