Hypothesis

404 Matching Annotations

Mar 2024
francisbach.com francisbach.com

The quest for adaptivity – Machine Learning Research Blog

2
1. Jeevesh8 23 Mar 2024
  
  in Public
  
  Latent variables
  
  Notice how flat the second graph is along the 2x_1+2x_2 = 1 direction. (0 derivative throughout)
  
  It seems flatness along some directions of the loss as a function of some representation indicates how good (dis-entangled) the representations themselves are. The next layer that operates on these representations can then discard this flat direction, and focus on the other one. Hence, the more directions the curve is flatter in (== the flatter the curve is), the more efficiently can the next layer learn.
2. Jeevesh8 23 Mar 2024
  
  in Public
  
  Smoothness of the target function.
  
  Essentially, smoothness limits how many times can the network "turn" in a given small subset of the input. This limit then makes it so tha all the data in that subset must be fit by the function without wiggling around.
  
  One can imaging how this can be used to create an equivalent problem by collapsing a lot of points into a single point. And lots of wiggly functions to a smooth variant.
Visit annotations in context

Annotators

Jeevesh8

URL

francisbach.com/quest-for-adaptivity/
Aug 2023
arxiv.org arxiv.org

2202.10054.pdf

1
1. Jeevesh8 06 Aug 2023
  
  in Public
  
  However, the gradient is non-zero fordirections u in the ID subspace and the corresponding features Bu change across the fine-tuning process. Wecall this feature distortion: the features in some directions are changed but not others
  
  If we go from ID data to OOD data via a covariate shift, this is true.
  
  But not if a concept shift has happened when going from ID to OOD data.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2202.10054.pdf
Jul 2023
arxiv.org arxiv.org

2306.11435.pdf

2
1. Jeevesh8 10 Jul 2023
  
  in Public
  
  Note that the node representations do not explicitly encode their ground-truth positions.
  
  Are we sure?
2. Jeevesh8 10 Jul 2023
  
  in Public
  
  BD
  
  Brownian Dynamics
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2306.11435.pdf
openreview.net openreview.net

17_max_margin_inspired_per_sample.pdf

2
1. Jeevesh8 03 Jul 2023
  
  in Public
  
  Lemma 2.
  
  But there can be other fixed points also! In particular, $\theta^*$ is a global minima. We may get stuck in local minima, or oscillate around them due to the changing nature of $P_t(X_i; \theta)$.
  
  This pattern is observed in other places in Deep Learning also. Wherein, introducing new/meta-variables and varying quantities often doesn't hurt convergence.
  
  There seems to be something about the architecture/initialisation/optimisation in DL that seems to enable it.
2. Jeevesh8 03 Jul 2023
  
  in Public
  
  We will call the model well-specified if there exists θ∗ such that θ∗ ∈ arg minθ∫ `(X, Y ; θ)Q(dY |X)for every X.
  
  That is, there exists a parameter setting $\theta^*$ such that it can minimize the loss on each and every input. Meaning that it can predict the correct probabilities over $Y$ for every $x \in X$.
Visit annotations in context

Annotators

Jeevesh8

URL

openreview.net/pdf
Jun 2023
terrytao.files.wordpress.com terrytao.files.wordpress.com

matrix-book.pdf

2
1. Jeevesh8 23 Jun 2023
  
  in Public
  
  (ii)
  
  The $\sigma$-algebra of the probability space is closed under complements and countable intersections.
  
  Also using countable additivity of measure space with the above fact, leads to the result.
2. Jeevesh8 22 Jun 2023
  
  in Public
  
  (i)
  
  As each event = $\bar{\phi}$.
Visit annotations in context

Annotators

Jeevesh8

URL

terrytao.files.wordpress.com/2011/02/matrix-book.pdf
doc-0k-28-apps-viewer.googleusercontent.com doc-0k-28-apps-viewer.googleusercontent.com

ACFrOgBQXIHhxq10OTea_txO3WrPxRvjhuBOiIJ0d6Y9DVjmORXs0wcP6sd6IQXtsN6pWXftEBWE8mHWqjWIbrAev2GixSgSFn5ARHSvtsrCfbRqTL8nH-LZhM9ntqvP5GgmpA5jbZ5ElDwrIii7

1
1. Jeevesh8 22 Jun 2023
  
  in Public
  
  Generally we need more assumptions about a specific problem and hypothesis class to bound absolutepopulation risk, hence we focus on bounding the excess risk
  
  In particular, using the relation between the structure of the problem and how it relates to the hypothesis class we have chosen. The better the two structures align, the better the best performing model will be and it will be easier to find that model. Hence it is clear that more information helps in obtaining better bounds, but not clear why it will be difficult to calculate the $inf_{g \in \mathcal{H}} L(g)$ and hence convert all the bounds obtained on excess risk to those on absolute risk.
  
  It will be difficult for two reasons probably:
  
  You will need to know the actual distribution $p$, just samples from it, won't do. But maybe we can have a limiting case like bounds, where as $#$samples $\rightarrow N$ the actual distribution $p$ becomes known and the bound becomes valid.
  
  Finding the optimal loss or $g$ may be a computationally intractable problem.
  
  Though, in some function classes the optimal loss may just be $0$ ?
Visit annotations in context

Annotators

Jeevesh8

URL

doc-0k-28-apps-viewer.googleusercontent.com/viewer/secure/pdf/s397tcrbrupcl8lht43r4755716m4r1h/716upje1jq80vgcjdmln1ifhp7dppgv6/1687454400000/lantern/15285167423693813040/ACFrOgBQXIHhxq10OTea_txO3WrPxRvjhuBOiIJ0d6Y9DVjmORXs0wcP6sd6IQXtsN6pWXftEBWE8mHWqjWIbrAev2GixSgSFn5ARHSvtsrCfbRqTL8nH-LZhM9ntqvP5GgmpA5jbZ5ElDwrIii7
arxiv.org arxiv.org

Learning to Reweight Examples for Robust Deep Learning

2
1. Jeevesh8 20 Jun 2023
  
  in Public
  
  model the most basicassumption instead: the best example weighting shouldminimize the loss of a set of unbiased clean validationexamples that are consistent with the evaluation procedure
  
  Very clean.
2. Jeevesh8 20 Jun 2023
  
  in Public
  
  Here we argue that in order to learngeneral forms of training set biases, it is necessary to havea small unbiased validation to guide training.
  
  Very clean idea.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1803.09050.pdf
arxiv.org arxiv.org

2305.12827.pdf

6
1. Jeevesh8 08 Jun 2023
  
  in Public
  
  lin(x; θ0 + τ
  
  The prediction of $f_{lin}$ at $x$ is equal to the original value of $f$ at $x$ at parameter $\theta_0$ plus the component of the task vector along the direction of gradient at $x$. That is we add a factor corresponding to how well the gradient on the sample $x$ aligns with the "average" gradient(task vector) direction.
  
  Another useful way to make sense would be to,replace $\tau$ with $\Delta \tau$. And then see what the equation looks like. It is a first order approximation of $f$ around $\theta_0$.
2. Jeevesh8 08 Jun 2023
  
  in Public
  
  task arithmetic property
  
  For a particular input $x$; if I add any linear combination of task vectors, only the component along the task vector corresponding to the task to which $x$ belongs, determines the value of $f$ at that point. That is, the other task vectors lie in the tangent space of $f$ seen as function of parameters, at the point $x$.
3. Jeevesh8 08 Jun 2023
  
  in Public
  
  functional variation is entirelycaptured by each τ
  
  This means that $g_t$ varies the same way as $f$ varies when we take the parameters $\alpha_t\mathbf{\tau_t}$ away from the original $\theta_0$, and $g_t$'s value doesn't get effected when we move along any other task vector.
  
  So, if I move to $A=\theta_0 + \alpha_1\tau_1$ first, then all the variation in $f$, due to that movement is captured by $g_1$ and all other $g_t$ stay the same.
  
  Then if we move to $\theta_0 + \alpha_1\tau_1 + \alpha_2\tau_2$ from $A$, all the variation will be captured by $g_2$ and the rest of $g_t$ will stay same.
  
  If moving along $\tau_t$ only changes the value of $f$ on $\mathcal{D}_t$ and not on any other task's dataset, then the functions:
  
  $g_t(x;\alpha_t\tau_t) := f(x; \theta_0+\alpha_t\tau_t)\mathbb{1}[x \in \mathcal{D}_t]$
  
  will satisfy the condition required by equation $(4.)$.
4. Jeevesh8 08 Jun 2023
  
  in Public
  
  spatially-localized components
  
  Spatially localized in the input space. That is, functions that are 0 outside of $\mathcal{D}_t$.
5. Jeevesh8 08 Jun 2023
  
  in Public
  
  single mixing coefficient
  
  Do the results hold if we allow multiple mixing coefficients? One per task?
6. Jeevesh8 08 Jun 2023
  
  in Public
  
  if adding τt does not modify the output of the model outside Dt
  
  Moreover, the output is assumed to be linear in the parameters. And addition can be adding scaled versions of $\tau_t$, too.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2305.12827.pdf
arxiv.org arxiv.org

2211.00107.pdf

1
1. Jeevesh8 01 Jun 2023
  
  in Public
  
  Figure 1:
  
  What makes a task transferable?
  
  Conjecture 1: The diversity of input for a task is what matters about how transferable the task will be.
  
  Three experiments for comparing transferability:
  
  distributions of cosine similarities between samples => task diversity (?=>) transferability.
  
  Transf.(Model trained on QQP) on one hand and Transf(model trained on MNLI with labels generated by QQP fine-tuned model)
  
  Slowly shuffle larger proportion of labels. Does the task still stay transferable?
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2211.00107.pdf
openreview.net openreview.net

8646_patching_open_vocabulary_model.pdf

1
1. Jeevesh8 01 Jun 2023
  
  in Public
  
  Figure 1
  
  It is not necessary that the dashed line is bound on performance.
Visit annotations in context

Annotators

Jeevesh8

URL

openreview.net/pdf
May 2023
terrytao.files.wordpress.com terrytao.files.wordpress.com

matrix-book.pdf

3
1. Jeevesh8 26 May 2023
  
  in Public
  
  (i)
  
  Here the equality is in the probabilistic sense.
  
  A probabilistic notion of equality of two events $E, F$ is to say that $E$ and $F$ are equal when $E\subset F$ and $F\subset E$.
  
  Another one can be: Two events $E, F$ are said to be equal if $P(E\backslash F)=0$ and $P(F\backslash E)=0$.
  
  Both the definitions of equality are preserved under extensions as extensions are probability preserving. And both form equivalence relations on the underlying $\sigma$-algebra. The former is a refinement of the latter.
2. Jeevesh8 08 May 2023
  
  in Public
  
  CAn−A
  
  This is a sequence that decays to 0, as n increases. As we bring A closer to 0, this sequence decays faster and faster, with increasing $n$.
  
  The statement that "event $E$ holds with overwhelming probability" means that the probability of this event is an upper bound on the faster and faster decaying sequences.
  
  In $(iv.)$, instead of having sequences that decay faster and faster, we want that $\mathbf{P}(E)$ is an upper bound on just one sequence that decays as a polynomial in $n$.
  
  In $(v.)$ this is further relaxed to $\mathbf{P}(E)$ being an upper bound on some sequence that decays to 0.
  
  Note that in the above, upper bounds are on the 1-decaying sequence.
  
  Also, as $E=E_n$, $\mathbf{P}(E)$ is also a sequence of $n$. So, in fact, we have that one sequence, is forming an upper bound on another.
  
  In $(iii.)$, the sequence of proababilities is upper bound to any sequence that tends to 1 as an exponent of $n$.
  
  In $(iv.)$, the sequence of proababilities is upper bound to some sequence that tends to 1 as an exponent in $n$.
  
  In $(v.)$, the sequence of proababilities is upper bound to some sequence that tends to 1.
  
  Also, the constant is omitted for brevity, but choosing an apt constant is also to be considered in above statements.
3. Jeevesh8 08 May 2023
  
  in Public
  
  Indeed, once one is no longer working at the foundational level,it is best to try to suppress the fact that events are being modeled assets altogether
  
  So that we don't think about set-theoretic constructs like cardinality which may not be a probabilistic property. One should be aware that the unions/intersections is not between sets, but between events.
Visit annotations in context

Annotators

Jeevesh8

URL

terrytao.files.wordpress.com/2011/02/matrix-book.pdf
christophm.github.io christophm.github.io

Interpretable Machine Learning

2
1. Jeevesh8 11 May 2023
  
  in Public
  
  permutation o
  
  Same permutation for both $x, z$.
2. Jeevesh8 11 May 2023
  
  in Public
  
  values from a random data point z,
  
  Generalization of zeroing out the features instead.
Visit annotations in context

Annotators

Jeevesh8

URL

christophm.github.io/interpretable-ml-book/shapley.html
Apr 2023
arxiv.org arxiv.org

Untitled document

18
1. Jeevesh8 21 Apr 2023
  
  in Public
  
  b, d ≤ supf ∈F ⊗t,h∈H |Rtrain(f , h) − ˆRtrain(f , h)|
  
  The RHS is the maximum difference between the empirical loss(averaged over all the tasks), and the actual loss(averaged over all the tasks) for any shared representation $\mathbf{h}$ and any task specific mappings $\mathbf{f}$.
  
  The inequality holds as both $b$, and $d$ resolve to the difference of empirical and average losses, for the functions $(\hat{\mathbf{f}}, \hat{\mathbf{h}})$ and $(\mathbf{f}^, \mathbf{h}^)$ respectively.
2. Jeevesh8 21 Apr 2023
  
  in Public
  
  c ≤ 0
  
  As $\hat{\mathbf{h}}$ and $\hat{\mathbf{f}}$ are the minimizers for the data over using which $\hat{L}$ is computed.
  
  On the other hand $\mathbf{h}^$ and $\mathbf{f}^$ are the functions that minimize the actual centered training risk($L$), and hence the empirical centered training risk($\hat{L}$) may be larger when we use these values.
3. Jeevesh8 20 Apr 2023
  
  in Public
  
  L(f ⋆, h⋆, f ⋆, h⋆)
  
  This is 0.
4. Jeevesh8 20 Apr 2023
  
  in Public
  
  dF ,f ⋆ (ˆh; h⋆)
  
  Distance between the actual and learned representations, under the function class $\mathcal{F}$, and given the optimal fit over the real shared representation $\mathbf{h}^*$.
5. Jeevesh8 20 Apr 2023
  
  in Public
  
  ̃f
  
  $\hat{\mathbf{h}}$ is the learned shared representation.
  
  $\tilde{\mathbf{f}}$ is the function that actually optimizes the performance for each task $j \in [t]$ using the learned shared representation $\hat{\mathbf{h}}$.
  
  $\hat{\mathbf{f}}$ is the learned approximation to $\tilde{\mathbf{f}}$.
6. Jeevesh8 20 Apr 2023
  
  in Public
  
  ∑
  
  A division by $n$ seems to be missing..
7. Jeevesh8 20 Apr 2023
  
  in Public
  
  centered training risk
  
  The training risk over and above the one attained by the actual underlying feature representation $\mathbf{h}^$ and the optimal $f_j^$ corresponding to it.
8. Jeevesh8 20 Apr 2023
  
  in Public
  
  h⋆
  
  The actual underlying shared representation of the tasks.
9. Jeevesh8 20 Apr 2023
  
  in Public
  
  RX(Q)
  
  How well can any function from the function class $\mathcal{Q}$ match a set of completely random samples from ${+1, -1}$.
10. Jeevesh8 20 Apr 2023
  
  in Public
  
  Rad
  
  Rademacher distribution
11. Jeevesh8 20 Apr 2023
  
  in Public
  
  polylogarithmic factors
  
  https://crypto.stackexchange.com/questions/48638/whats-the-difference-between-polylogarithmic-and-logarithmic
12. Jeevesh8 19 Apr 2023
  
  in Public
  
  dF ,f (h′; h)
  
  If $\mathbf{h}'=\mathbf{h}$ you can directly put $f'=f_j$ and the distance will go to 0.
  
  If $\mathbf{h}'\ne\mathbf{h}$, but you can find some $g \in \mathcal{F}$ such that $g \circ \mathbf{h}'= \mathbf{h}$, you can still make the distance 0.
  
  This metric can also be negative, given what $\mathbf{f}$ and $\mathbf{h}$ are.
13. Jeevesh8 19 Apr 2023
  
  in Public
  
  y|f ⋆j ◦ h⋆(x)
  
  $\mathbf{x}$ influences the distribution over $y$ only via the task specific representation $f_j^ \circ \mathbf{h}^$
14. Jeevesh8 19 Apr 2023
  
  in Public
  
  L(F )
  
  This is a Lipschitz constant dependent on the function class $\mathcal{F}$.
15. Jeevesh8 19 Apr 2023
  
  in Public
  
  G ̄X(Q)
  
  what happens to this definition if $N \to \infty$ ?
16. Jeevesh8 19 Apr 2023
  
  in Public
  
  N data points
  
  The assumption here is that the value of $\mathbf{q}$ at any point $\mathbf{x}_i$ is constrained by its value on the other points. Because otherwise, if $\mathbf{q}(\mathbf{x}_i)$ can go to $\infty$ without any other value going to $-\infty$; then complexity is $\infty$.
17. Jeevesh8 18 Apr 2023
  
  in Public
  
  common-design assumption
  
  The assumption that the tasks share a common design.
18. Jeevesh8 18 Apr 2023
  
  in Public
  
  One ofthe most promising methods for multitask and transfer learning is founded on the belief that multiple, differing tasks aredistinguished by a small number of task-specific parameters, but often share a common low-dimensional representation
  
  Low-dimensional being the keyword here.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2006.11650.pdf
arxiv.org arxiv.org

2202.10054.pdf

2
1. Jeevesh8 18 Apr 2023
  
  in Public
  
  gB is a deep network andin our theory (Section 3), gB is a linear projection.
  
  Why is this valid?
  
  Why is a deep network approximated with a linear projection?
2. Jeevesh8 18 Apr 2023
  
  in Public
  
  he headonly accomodates the distorted features of ID points and performs poorly
  
  The point being made is that the head performs poorly, and the feature extractor may still be giving appropriate features for the head to perform well.
  
  Can be verified by probing the representations of the finetuned model.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2202.10054.pdf
kafka.apache.org kafka.apache.org

Apache Kafka

5
1. Jeevesh8 14 Apr 2023
  
  in Public
  
  Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited.
  
  Also, we can't guarantee sequential reads/writes in a B-Tree like structure, no matter how we store it in the memory.
2. Jeevesh8 14 Apr 2023
  
  in Public
  
  per-consumer queue
  
  A queue per consumer. To efficiently distribute all messages among various consumers. The per-consumer queue is maintained by each consumer, probably.
  
  Kafka broker manages consumer groups, maintaining the state of each consumer, and distributes load amongst them.
3. Jeevesh8 14 Apr 2023
  
  in Public
  
  we at least double the available cache by having automatic access to all free memory,
  
  On the other hand, if we make an in-memory cache ourselves, that cache will take half the memory, and another half will be taken by the pagecache of the OS, when it writes to the disk.
  
  Though, we can probably free off our cache as we send the writes to OS. But as we don't control when the OS will flush those out to disk, we are only guaranteed to have access of half of the available memory for our (in-memory)cache.
4. Jeevesh8 14 Apr 2023
  
  in Public
  
  relying on pagecache
  
  That is, directly write to filesystem, OS will buffer your writes.
5. Jeevesh8 14 Apr 2023
  
  in Public
  
  performance divergence,
  
  The disk is fast when writing/reading lots of data sequentially. But not so fast when writing/reading data in random fashion.
  
  This is the divergence.
  
  OS maintains a cache of writes, and only when enough are accumulated, are they sent to the disk. This allows for efficient writes.
Visit annotations in context

Annotators

Jeevesh8

URL

kafka.apache.org/documentation/
Mar 2023
pages.cs.wisc.edu pages.cs.wisc.edu

single.dvi

2
1. Jeevesh8 28 Mar 2023
  
  in Public
  
  Figure 39.3
  
  There is one entry in OFT per open() call. Each call to open(), in any process will create a new entry in the OFT.
  
  Each process maintains its own set of file descriptors in its PCB(process control block). These file descriptors, are just numbered pointers to entries in the OFT.
  
  The fork() call creates a child process, and copies over the contents of its file descriptor list, to the child; without executing open()-open() in child process again and again. Instead it increments the refcnt of the entry in the OFT.
  
  dup() call in a process, creates a new entry in that process' file descriptors list, and updates the refcnt. It returns the position of the new entry in the file descriptors list.
  
  dup2() does the same as dup() but puts the new entry at a specified position. So that any other function accessing the entry at that position will now be redirected to the file referred to by the new entry.
  
  See here too.
2. Jeevesh8 28 Mar 2023
  
  in Public
  
  dup
  
  Returns a copied file descriptor that points to the same entry in the OFT. The new file descriptor has a new number.
Visit annotations in context

Annotators

Jeevesh8

URL

pages.cs.wisc.edu/~remzi/OSTEP/file-intro.pdf
pages.cs.wisc.edu pages.cs.wisc.edu

single.dvi

6
1. Jeevesh8 27 Mar 2023
  
  in Public
  
  sleep(b, &ide_lock);
  
  Releases lock atomically.
2. Jeevesh8 27 Mar 2023
  
  in Public
  
  ide intr()
  
  Called by OS, when the application program requests a read/write.
3. Jeevesh8 27 Mar 2023
  
  in Public
  
  ide start request()
  
  Called by ISR, (ide_intr()).
4. Jeevesh8 27 Mar 2023
  
  in Public
  
  ide rw(
  
  Called by the process which wants to read/write.
5. Jeevesh8 27 Mar 2023
  
  in Public
  
  as if they were memory locations
  
  That is, locations of RAM.
6. Jeevesh8 27 Mar 2023
  
  in Public
  
  device registers
  
  These are registers of the external device.
Visit annotations in context

Annotators

Jeevesh8

URL

pages.cs.wisc.edu/~remzi/OSTEP/file-devices.pdf
sd.blackball.lv sd.blackball.lv

Designing Event-Driven Systems

4
1. Jeevesh8 25 Mar 2023
  
  in Public
  
  database engines
  
  https://en.wikipedia.org/wiki/Database_engine
2. Jeevesh8 25 Mar 2023
  
  in Public
  
  linear scale-ou
  
  https://redis.com/blog/redis-enterprise-delivers-linear-scale-proven-time/
3. Jeevesh8 25 Mar 2023
  
  in Public
  
  This is quitedifferent from traditional messaging system
  
  https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
4. Jeevesh8 25 Mar 2023
  
  in Public
  
  (enterprise service bu
  
  https://www.ibm.com/topics/esb
Visit annotations in context

Annotators

Jeevesh8

URL

sd.blackball.lv/library/Designing_Event-Driven_Systems_(2018).pdf
arxiv.org arxiv.org

1706.03741.pdf

4
1. Jeevesh8 10 Mar 2023
  
  in Public
  
  Quantitative
  
  No discounting or similar thing in their definition
2. Jeevesh8 10 Mar 2023
  
  in Public
  
  online
  
  that is, concurrent learning of reward predictor and the agent.
3. Jeevesh8 10 Mar 2023
  
  in Public
  
  non-expert users
  
  Expert users know how to break the decision problem correctly into possible answers and can possibly provide very fine-grained feedback, as a result.
  
  We want to succeed in the absence of such detailed feedback also.
4. Jeevesh8 10 Mar 2023
  
  in Public
  
  we can only recognize the desired behavior, but notnecessarily demonstrate it
  
  Like NP problems. Solution can be verified in polynomial time, but not generated.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1706.03741
Aug 2022
aclanthology.org aclanthology.org

U18-1006.pdf

3
1. Jeevesh8 25 Aug 2022
  
  in Public
  
  Foreach field we rank the most likely value from thedocument for that field. If the most likely predic-tion falls below a tuned likelihood threshold, weemit null to indicate no field value is predicted
  
  Better than having a separate label for null?
2. Jeevesh8 25 Aug 2022
  
  in Public
  
  Filtering
  
  using regexes
3. Jeevesh8 25 Aug 2022
  
  in Public
  
  token
  
  the amount of "gross" written and "net" written in the invoice.
Visit annotations in context

Annotators

Jeevesh8

URL

aclanthology.org/U18-1006.pdf
citeseerx.ist.psu.edu citeseerx.ist.psu.edu

10.1.1.716.8257.pdf

5
1. Jeevesh8 25 Aug 2022
  
  in Public
  
  Field Extraction from Administrative Documents byIncremental Structural Templates
  
  D’Andecy et al. (2018) build upon this approach by incorporating an a-priori model of term-positions to their iterative layout- specific extraction model, significantly boosting performance on difficult fields.
2. Jeevesh8 25 Aug 2022
  
  in Public
  
  sidw
  
  This can be adjusted when only one document is available. For itf-df we need more than one documents though.
3. Jeevesh8 25 Aug 2022
  
  in Public
  
  From this list of matched nodes, weobtain a set of polar coordinates that indicate a possible spatialrelationship between the word wi and the target field to extract
  
  That is, suppose we are extracting a word that is at 1.5cm far along a line at -45 degrees from a node whose transcript is "Total".
  
  Then we don't only consider a "Total" appearing at appearing in a particular location(say at the end of document), but all "Total"s occuring thoughout the document and cast a vote for a location that is 1.5cm along the -45 degree line from each of those locations.
  
  In effect, the model has following assumptions:
  
  A field to extract has particular position relative to all other words in the invoice/document. All these are constraints are for the field. (Alternatively, the constraints are the edges of the star graph.)
  
  The position that satisfies most of these constraints is the one that gets extracted.
  
  Additionally, the model is made better and more in-line with reality when the three kinds of weightings in section VI are done incrementally.
4. Jeevesh8 25 Aug 2022
  
  in Public
  
  j
  
  $j$ iterates over all the nodes having same transcription as the $i^{th}$ word, $w_i$.
5. Jeevesh8 25 Aug 2022
  
  in Public
  
  definition of fixed spatial templates
  
  The document processing(structured information extraction) problem can then be broken into the two tasks:
  
  Template identification(identigying templates from a single document; or grouping documents with same templates together).
  
  OCR.
Visit annotations in context

Annotators

Jeevesh8

URL

citeseerx.ist.psu.edu/viewdoc/download
arxiv.org arxiv.org

1708.07403.pdf

8
1. Jeevesh8 24 Aug 2022
  
  in Public
  
  Algorithm 1
  
  All (uni/bi/tri/quad)-grams from any value of a field, will have that field as their label in the final results dict.
  
  This training data generation, generates {nGram, field} pairs, which can be used to improve and adapt the classifier. But it is less clear how to use this to improve OCR, or the character level language model in case we decide to use one.[Think more over this later..]
2. Jeevesh8 24 Aug 2022
  
  in Public
  
  Hungarian algorithm
  
  The agents are the various classes. The tasks are all the words with the label as that class.
3. Jeevesh8 24 Aug 2022
  
  in Public
  
  he UBL invoice produced by the engine is presented tothe user along with the original PDF invoice in a graphicaluser interface (GUI)
  
  We can use this data to further finetune our models. We would need extra step to find which model exactly went wrong. The architecture of our pipeline should be such that this step can be performed easily.
  
  [Aside: Use large model like T5 and condition on natural language description of what exactly went wrong?]
4. Jeevesh8 24 Aug 2022
  
  in Public
  
  Post Processor
  
  There is a consistency issue with previous step:
  
  The N-grams that are classified as belonging to a particular field, should all compose a single contiguous unit(like a word or a date or entire amount) that can be the value of that field. That is, the Classifier should output "amount" for all characters in "100.0o" for the post-processor to convert it to "100.00". This means that the classifier should have language modelling like abilities.
  
  The post processor can be thought of as a character level language model, that is used to correct spellings/common mistakes of OCR*. In case of poor images, it may be necessary to guess the characters based on the output of the classifier. For example, when from an invoice, half of a zero has evaporated so it looks like "100.0C". In this case it will be wrong to expect OCR to predict 0 at the last position. The character level language model may also let it go undetected(considering C for Celsius, which occurs fairly commonly), unless the language model is conditioned on the predictions of the previous step, i.e., it knows that it is producing an amount not temperature.
  
  An architecture that will allow us to handle these dependencies between the two tasks effeciently, can be constructed by considering few layers common for the two tasks of language modelling spell correction and classifying the n-grams.
  
  Other approaches can be to consider ensuring proper images. Or perform some image enhancement/edge detection. Or to do data augmentation to involve such poor samples in training too. Or finetune a pre-trained vision model.
  
  It can also be the case that there are examples with invoices where C stands for cents. In this case, the sample will be truly ambiguous. [Probably we can do a joint model-ling of replacements and classifier labels, with a crf that is aware of this?]
  
  [*Probably we could have had a character level language model correct characters output by OCR(step 1), earlier? As this ould be character by character replacement we can preserve the features associated with various character spans. RegEx make sense in case we already know what we are searching for, and it has fixed format, like date and amounts.]
5. Jeevesh8 24 Aug 2022
  
  in Public
  
  Classifier.
  
  Can also be a CRF. CRF can be a linear chain CRF( only one dimension) or be a CRF in 2 directions too.
  
  In case, all classes are not known a-priori, we can train a model that learns to extract {key: value} pairs instead.(Using just B-K, I-K, O, B-V, I-V tags, for example.)That is ,cast the problem as a sequence tagging one.
  
  We would need new model to include different kinds of annotations(for example, list item data), though.
6. Jeevesh8 24 Aug 2022
  
  in Public
  
  N-grammer.
  
  This step is necessary to find units with which features in the next step can be associated.
  
  Converting to actual words, at this point will result in loss of correspondence between characters identified by OCR(to which features like size etc. can be associated) and the actual words.
7. Jeevesh8 24 Aug 2022
  
  in Public
  
  Text Extractor.
  
  The OCR for a character is independent of the structure/context in which the character occurs. Hence it makes sense to include this step as a separate.
8. Jeevesh8 23 Aug 2022
  
  in Public
  
  for each field,
  
  Pre-specified by user? Or will be different based on the extraction from various templates?
  
  What about list-item type data, rather than key-value data?
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1708.07403.pdf
arxiv.org arxiv.org

Untitled document

5
1. Jeevesh8 05 Aug 2022
  
  in Public
  
  d e f _ _ i n i t _ _ ( s e l f , key ) :15 key1 , key2 = j r a n d o m . s p l i t ( key )16 s e l f . l a y e r s = [ eqx . nn . L i n e a r ( 2 , 8 , key = key1 ) ,17 eqx . nn . L i n e a r ( 8 , 2 , key = key2 ) ]18 s e l f . a c t i v a t i o n = j n n . r e l u19 s e l f . b i a s = j n p . o n e s ( 2 )
  
  We don't need to write this __init__. If we already construct layers, activation and bias outside the class, we can pass them through the automatically constructed __init__(). Note that __init__() is also present, and the one we wrote will only be called if pass a single argument while initializing(the key).
  
  We can also add in_dim, out_dim, hidden_dim, activation_fn as arguments in this __init__, rather than having fixed values, if we want.
  
  All the fields specified in the annotation list are considered as the children of the PyTree defined by the class.
2. Jeevesh8 05 Aug 2022
  
  in Public
  
  This cannot be true in general. We may wish to parameterise a function by arbitrary Python types,unknown to JAX.
  
  This is a problem because:
  
  As we won't be writing our own tree_flatten and tree_unflatten(will be inherited from eqx.Module), all the attributes of self that are set in __init__, will be passed through these flatten and unflatten.
  
  So, if we want to write a class like in Appendix A, that allows activation function as a class attribute(we could also have written an __init__() function that takes in the activation function, alongside key), we will ed up having a leaf in the PyTree, that is not a jnp.array but an activation function.
3. Jeevesh8 05 Aug 2022
  
  in Public
  
  which is a pure function
  
  With the first argument(self) being a PyTree instance, having all the parameters(as jax.grad differentiates w.r.t. first argument, it is ideal that first argument contains all the parameters!).
4. Jeevesh8 05 Aug 2022
  
  in Public
  
  The gradient returned by higher_order_func will itself be an instance of Adder
  
  As gradient of PyTree basically corresponds to calculating gradient w.r.t. all leaf nodes, and storing them in the same PyTree structure.
5. Jeevesh8 05 Aug 2022
  
  in Public
  
  referential transparenc
  
  An expression is called referentially transparent if it can be replaced with its corresponding value (and vice-versa) without changing the program's behavior.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2111.00254.pdf
jax.readthedocs.io jax.readthedocs.io

Pytrees — JAX documentation

1
1. Jeevesh8 05 Aug 2022
  
  in Public
  
  JAX sometimes needs to compare treedef for equality.
  
  For example when ensuring that treedef of two arguments of a function are same, like in vmap example.
  
  As the auxiliary data gets added to the treedef, it becomes essential that the auxiliary data also matches for vmap to be applied properly. So, we need to keep this mind while defining/registering new pytree node.
Visit annotations in context

Annotators

Jeevesh8

URL

jax.readthedocs.io/en/latest/pytrees.html
Jul 2022
arxiv.org arxiv.org

Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems

1
1. Jeevesh8 21 Jul 2022
  
  in Public
  
  Gradient-enhanced physics-informed neural networks for forward andinverse PDE problems
  
  Gradient of loss with respect to inputs(rather than NN parameters) contains quite meaningful information that can be easily used to learn better NN parameters, via backprop.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2111.02801.pdf
sites.math.northwestern.edu sites.math.northwestern.edu

dual-spaces.pdf

1
1. Jeevesh8 21 Jul 2022
  
  in Public
  
  this is whythe dual map and adjoint use the same notation
  
  Adjoint is also denoted as $T^*$. What about the conjugation in the adjoint, though?
Visit annotations in context

Annotators

Jeevesh8

URL

sites.math.northwestern.edu/~scanez/courses/334/notes/dual-spaces.pdf
people.missouristate.edu people.missouristate.edu

Fisher_info.pdf

5
1. Jeevesh8 20 Jul 2022
  
  in Public
  
  Varθ[ˆθ] ≥ 1nI(θ)
  
  Nice result. variance of estimator, bounded below by 1/[information content]!! This is the meaning of "maximum information" in a sample.
2. Jeevesh8 20 Jul 2022
  
  in Public
  
  n
  
  scales proportional to $n^{-1}$.
3. Jeevesh8 20 Jul 2022
  
  in Public
  
  m′(θ)
  
  measures how sensitive expected value of estimator is to small changes in the underlying real distribution's real parameter $\theta$.
4. Jeevesh8 20 Jul 2022
  
  in Public
  
  Eθ(ˆθ)
  
  Expected value of the estimator $\hat{\theta}$ over all possible samples drawn from the "real" distribution$f(x|\theta)$ as a function of the "real" parameter $\theta$.
  
  Here we suppose "ahh.. if $\theta$ was our real parameter..", and then ask .. "what would the expected value of our estimator be with this $\theta$?"
5. Jeevesh8 20 Jul 2022
  
  in Public
  
  We will show howto used Fisher information to determine the lower bound for the variance of an estimator ofthe parameter θ
  
  Fisher information tells what is the maximum information we can get about parameter $\theta$ from a sample $X_1, ..., X_n$.
  
  As this is the maximum information, our inference must have a variance related to this maximum information.
Visit annotations in context

Annotators

Jeevesh8

URL

people.missouristate.edu/songfengzheng/Teaching/MTH541/Lecture notes/Fisher_info.pdf
terrytao.wordpress.com terrytao.wordpress.com

245B, Notes 7: Well-ordered sets, ordinals, and Zorn’s lemma (optional)

5
1. Jeevesh8 14 Jul 2022
  
  in Public
  
  Exercise 4
  
  Consider the set of all $x$ for which $P(x)$ is false. Consider a minimal element of this set, $y$. $P(x)$ is true $\forall x<y $.
  
  Now, $y=succ(z)$ for some $z \in X$ xor $y=sup([min(X), y))$. In both cases, we can apply one of the suppositions, 2 and 3 respectively, to conclude that $P(y)$ is true, which is a contradiction.
  
  Hence $P(x)$ is true $\forall x\in X$.
2. Jeevesh8 14 Jul 2022
  
  in Public
  
  n particular, is not the successor of any element in X
  
  It is assumed $min(X) \in [min(X), min(X))$.
3. Jeevesh8 12 Jul 2022
  
  in Public
  
  a totally ordered set
  
  not needed, as totally ordered is implied by existence of minimal elements for subsets consisting of only $2$ elements of $X$.
4. Jeevesh8 12 Jul 2022
  
  in Public
  
  a collection
  
  The proof doesn't require $\mathcal{F}$ to be the collection of all subsets of $X$. Hence, we can, for one, consider a $\mathcal{F}$ to be a collection of mutually disjoint subsets of $X$.
5. Jeevesh8 12 Jul 2022
  
  in Public
  
  non-empty
  
  This is essential as $f(\phi) \notin \phi$, no matter how you choose $f$.
Visit annotations in context

Annotators

Jeevesh8

URL

terrytao.wordpress.com/2009/01/28/245b-notes-7-well-ordered-sets-ordinals-and-zorns-lemma-optional/
proceedings.neurips.cc proceedings.neurips.cc

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

9
1. Jeevesh8 07 Jul 2022
  
  in Public
  
  d|f0 (xj )
  
  shape$=n_L$
2. Jeevesh8 07 Jul 2022
  
  in Public
  
  K(x, xj )
  
  shape$=n_L\times n_L$
3. Jeevesh8 07 Jul 2022
  
  in Public
  
  As a result, the (functional) derivative of the cost C at apoint f0 ∈ F can be viewed as an element of F
  
  $C$ is defined to have a (functional) derivative at $f_0$ if the map from $\mathcal{F}$ to $\mathbb{R}$ defined as:
  
  $\rho \longrightarrow \lim_{x \rightarrow 0} \frac{C[f_0+x\rho]-C[f_0]}{x}$
  
  is an element of $\mathcal{F^*}$.
4. Jeevesh8 07 Jul 2022
  
  in Public
  
  ΦK
  
  maps linear forms [functions from $\mathcal{F}$ to $\mathbb{R}$] to functions on the data $\in \mathcal{F}$.
5. Jeevesh8 05 Jul 2022
  
  in Public
  
  on the data
  
  or in the support of $p^{in}$.
  
  Note that $||f||K=||f||{p^{in}}=0$ if $f$ is $0$ at all data points.
  
  Or in otherwords, $||f-g||K=||f-g||{p^{in}}=0$ if $f$ and $g$ are same on all the data points.
6. Jeevesh8 05 Jul 2022
  
  in Public
  
  The kernel K is positive definite with respect to || · ||pin if ||f ||pin > 0 =⇒ ||f ||K > 0
  
  If $K(x,x')$ is not positive definite for some $(x,x')$ pair,
  
  $\implies \exists v \in \mathbb{R}^{n_L} \backslash { \mathbf{0} } : v^TK(x,x')v \le 0$.
  
  Setting $f(x)=v \forall x \in \mathbb{R}^{n_0}$, we have $||f||_K = \sqrt{\langle f, f \rangle_K}$ is not defined or 0.
  
  whereas $ ||f||_{p^{in}} $ is still well defined and $>0$ as $v\ne\mathbf{0}$.
  
  This means that if $K(x,x')$ is not positive definite for some $(x,x')$, then the kernel $K$ is not positive definite as per the definition here.
  
  Moreover, if $K(x,x')$ is positive definite for every $(x,x')$, then
  
  $\langle f,f \rangle > 0 \forall f $ that are non-zero in some region($\subseteq \mathbb{R}^{n_0}$) of non-zero probability,
  
  which means that $ ||f||{p^{in}} >0 \implies ||f||_K > 0 \forall p^{in}$. This means that K is positive definite w.r.t. $||\dot||{p^{in}}$.
  
  This means the current definition is equivalent to saying that K is positive definite iff $K(x,x')$ is a positive definite $n_L \times n_L$ matrix, for all $(x,x') \sim p^{in}$.
7. Jeevesh8 05 Jul 2022
  
  in Public
  
  twice differentiable nonlinearity functionσ : R → R, with bounded second derivative
  
  The second derivative of output of ANN with respect input of ANN may not be bounded?
8. Jeevesh8 05 Jul 2022
  
  in Public
  
  empirical distribution
  
  The expected value of seminorm, then resolves to the empirical average over the entire dataset of $N$ samples,
9. Jeevesh8 05 Jul 2022
  
  in Public
  
  seminorm
  
  Seminorm satisfies: 1. Triangle inequality 2. absolute homogeneity, i.e., $||s.f||=|s|.||f||$ for all scalars $s$ and all functions $f \in \mathcal{F}$. 3. 1 &2 => non-negativity, i.e., $||f|| \ge 0 \forall f \in \mathcal{F}$.
  
  For a seminorm to be a norm, it is additionally required that: 4. $||f||=0 \implies f=0$.
  
  (https://en.wikipedia.org/wiki/Seminorm#Definition)
  
  The triangle inequality here follows from the fact that: 1. $\langle v, u \rangle \le ||v||||u|| \forall v,u \in \mathbb{R}^{n_L}$ 2. and linearity of integration
Visit annotations in context

Annotators

Jeevesh8

URL

proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf
arxiv.org arxiv.org

2002.09572.pdf

1
1. Jeevesh8 07 Jul 2022
  
  in Public
  
  e assume that the local curvature of the loss surface(the spectral norm of the Hessian) increases or decreases monotonically along the optimization tra-jectory.
  
  Does it? [CONFIRM]
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2002.09572.pdf
arxiv.org arxiv.org

Empirical%20Analysis%20of%20the%20Hessian%20of%20Over-Parametrized%20Neural%20Networks.pdf

10
1. Jeevesh8 07 Jul 2022
  
  in Public
  
  Equation 5
  
  Equation 4?
  
  First term is positive semidefinite as $vv^T$ is and sum of PSD matrices is PSD.
2. Jeevesh8 07 Jul 2022
  
  in Public
  
  Figure6
  
  Figure 5?
3. Jeevesh8 07 Jul 2022
  
  in Public
  
  larger positive eigenvalues
  
  That is sharper minima.
  
  This kind of analysis(comparing magnitude of the eigenvalues) tells us that sharper minima found by large batch gradient descent are indeed sharper in terms of the curvature or the eigenvalues of the Hessian.
  
  That is, it is not the case that sharper minima just span lesser number of dimensions.
  
  The number of dimensions spanned by the minima can be guessed from the number of positive eigenvalues.
4. Jeevesh8 07 Jul 2022
  
  in Public
  
  correlateddata
  
  In this case $>M-N$ eigenvalues are expected to be trivial, i.e., $0$.
5. Jeevesh8 07 Jul 2022
  
  in Public
  
  if `′(f ( ˆw))and ∇2f ( ˆw) are not correlated
  
  when is this true?
6. Jeevesh8 07 Jul 2022
  
  in Public
  
  Figure 1
  
  Second derivative is 0 along most directions, at the final trained point in parameter space!
7. Jeevesh8 07 Jul 2022
  
  in Public
  
  bottom of the landscape
  
  low loss regions of parameter space.
8. Jeevesh8 07 Jul 2022
  
  in Public
  
  The bulk is mostly full of zero eigenvalues
  
  problem for the non-degenerate assumption.
9. Jeevesh8 07 Jul 2022
  
  in Public
  
  Hessian of theloss is non-degenerate
  
  If the Hessian is degenerate, then it has at least one 0 eigenvalue. That is, there is some subspace of directions along which the derivative stays 0. When the number of parameters, $M >>$, the number of samples, $N$; it is possible that all of the $N$ samples allow gradients in the Kernel of Hessian only. In which case, we will converge to wrong solution..
10. Jeevesh8 07 Jul 2022
  
  in Public
  
  gradient descen
  
  *full batch
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1706.04454.pdf
www.semanticscholar.org www.semanticscholar.org

2206.04817.pdf

5
1. Jeevesh8 07 Jul 2022
  
  in Public
  
  Figure 3
  
  The green peaks occur at small plateaus just before big slingshots of weight norm. The sharpness increases first, indicating a big change in last layer norm is about to come.
2. Jeevesh8 07 Jul 2022
  
  in Public
  
  Cosine distance
  
  why cosine distance? What happens with euclidean distance?
3. Jeevesh8 07 Jul 2022
  
  in Public
  
  Weare unable to observe Slingshot Effects with Adagrad [6] and also with stochastic gradient descent(SGD) or SGD with momentum, pointing to the generality of the mechanism across architectures anddatasets.
  
  Is this sentence correct?
4. Jeevesh8 07 Jul 2022
  
  in Public
  
  the training loss retains a low value in periods ofrapid norm growth, and then wildly fluctuating in periods of norm plateau.
  
  why does training loss retain low value when norm grows rapidly?
  
  Probably because at first, the norm increase is just trying to increase the difference between logit of correct and other labels. But, as we have momentum in our optimizers, maybe it increases the norm too much. This leads to incompatibility between features available to last layer, and the classifying function implemented by the last layer. The features and classifier need to be made compatible again. This causes feature updates and wild loss function fluctuations to happen after norm growth?
5. Jeevesh8 07 Jul 2022
  
  in Public
  
  at the point of reaching perfect classification of the training set,the cross-entropy (CE) loss by design pressures the classification layer to grow in norm at relativelyfast rate
  
  so that the difference in the logit of the correct label and other labels increases.
Visit annotations in context

Annotators

Jeevesh8

URL

semanticscholar.org/reader/ff53cb49cb18e71e622fce7d96692e813630e878
angms.science angms.science

CVX_SufficientDecreaseLemma_GD.pdf

2
1. Jeevesh8 04 Jul 2022
  
  in Public
  
  L-smooth,
  
  continuously differentiable with L-lipschitz continuous gradient
2. Jeevesh8 04 Jul 2022
  
  in Public
  
  ∥∇f (xk)∥
  
  missing square
Visit annotations in context

Annotators

Jeevesh8

URL

angms.science/doc/CVX/CVX_SufficientDecreaseLemma_GD.pdf
arxiv.org arxiv.org

1704.05179.pdf

1
1. Jeevesh8 01 Jul 2022
  
  in Public
  
  This systemwould consist of three stages. A given question isread and reformulated in the first stage, followedby information retrieval via a search engine. Ananswer is then synthesized based on the query anda set of retrieved documents.
  
  Here, search engine can be thought of as searching whether the asked questions is answered in the available documents.
  
  First stage corresponds to finding questions related to the actual questions, that are answered in the available documents and the answers to which, will help answer the actual final question.
  
  The first step(and the last), although it looks simple at first glance, will be quite complex to perform as the questions become more and more complex.
  
  For example, say, we want to answer a question like: "What would be the consequences of the Riemann hypothesis being false?", given only documents that explain Riemann Hypothesis and results derived thereof.
  
  Then, we would need to reformulate the question as: what results follow from the Riemann Hypothesis? and then we can find answers for that from our search engine, which ,in turn, can be negated in the synthesis stage to answer the counterfactual question.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1704.05179.pdf
Jun 2022
transformer-circuits.pub transformer-circuits.pub

A Mathematical Framework for Transformer Circuits

8
1. Jeevesh8 30 Jun 2022
  
  in Public
  
  because we didn't observe any significant V-composition
  
  i.e., $\frac{||W_{OV}^{h_1}W_{OV}^{h_2}||F}{||W{OV}^{h_1}||F||W{OV}^{h_2}||_F}$ was small for all pairs $(h_1, h_2)$.
2. Jeevesh8 30 Jun 2022
  
  in Public
  
  for the present token
  
  for the present token, with the considered head on the logit..
3. Jeevesh8 30 Jun 2022
  
  in Public
  
  It's also less vulnerable to distributional shift, since it doesn't depend on learned statistics about whether one token can plausibly follow another.
  
  Transformers may learn exact parsing of sentences.. rather than just using heuristics like word matching, bigrams etc.
4. Jeevesh8 30 Jun 2022
  
  in Public
  
  the value vector at these default positions will be small,
  
  It is so because, otherwise it will end up interfering with information transfer from other source tokens.
5. Jeevesh8 16 Jun 2022
  
  in Public
  
  the effect on the output is solely a function of that source token
  
  In case of higher layers, source token's embedding after first layer.
6. Jeevesh8 16 Jun 2022
  
  in Public
  
  if attended to
  
  As $A^h$ is in tensor product with $W_UW_{OV}^hW_E$.
7. Jeevesh8 16 Jun 2022
  
  in Public
  
  One basic consequence is that the residual stream doesn't have a "privileged basis"; we could rotate it by rotating all the matrices interacting with it, without changing model behavior.
  
  As the residual stream gets rotated, each of the matrices used to read and write to the residual stream will also be rotated in the opposite direction. The attention heads and other MLPs are kept, as is.
  
  Rotating these bases to be as axes aligned as possible, could reveal new(or easily interpretable) insights.
8. Jeevesh8 16 Jun 2022
  
  in Public
  
  These virtual weights are the product of the output weights of one layer with the input weightsNote that for attention layers, there are three different kinds of input weights: W_Q, W_K, and W_V. For simplicity and generality, we think of layers as just having input and output weights here. of another (ie. W_{I}^2W_{O}^1), and describe the extent to which a later layer reads in the information written by a previous layer.
  
  What could be the analogue for similar relation(/information exchange) between more than 2 layers?
Visit annotations in context

Annotators

Jeevesh8

URL

transformer-circuits.pub/2021/framework/index.html
openreview.net openreview.net

fluctuation_dissipation_relati.pdf

2
1. Jeevesh8 29 Jun 2022
  
  in Public
  
  equilibration
  
  Equilibration in humans: http://penta.ufrgs.br/edu/telelab/3/equilibr.htm
2. Jeevesh8 29 Jun 2022
  
  in Public
  
  Equilibration
  
  The development or maintenance of an equilibrium.
Visit annotations in context

Annotators

Jeevesh8

URL

openreview.net/pdf
arxiv.org arxiv.org

2205.11822.pdf

1
1. Jeevesh8 22 Jun 2022
  
  in Public
  
  abductively
  
  related to abductive reasoning. Abductive reasoning is "making a probable conclusion from what you already know.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2205.11822.pdf
arxiv.org arxiv.org

2112.08726.pdf

1
1. Jeevesh8 22 Jun 2022
  
  in Public
  
  new
  
  any of the new. The max in eqn 8 selects the constraint that is closest to being fully satisfied in the next few steps.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2112.08726.pdf
aclanthology.org aclanthology.org

2022.acl-long.254.pdf

1
1. Jeevesh8 15 Jun 2022
  
  in Public
  
  Table 4:
  
  How does performance of adapters with PET look like?
Visit annotations in context

Annotators

Jeevesh8

URL

aclanthology.org/2022.acl-long.254.pdf
arxiv.org arxiv.org

1902.06720.pdf

5
1. Jeevesh8 10 Jun 2022
  
  in Public
  
  For an arbitrary point x
  
  Can be a test point outside of $\mathcal{X}$.
  
  Rather than using the expressions in (10) and (11), we can also obtain $\theta_t$ from equation (8) and use that to compute output on arbitrary $x$.
2. Jeevesh8 10 Jun 2022
  
  in Public
  
  ˆΘt(x, X )
  
  This is of dimension $k \times k|\mathcal{D}|$.
3. Jeevesh8 10 Jun 2022
  
  in Public
  
  (6)
  
  $f_t^{lin}(x)$ depends on both $\theta_0$ and $\theta_t$.
  
  We take the gradient on both sides of equation (5), to get $\nabla_{\theta_t} f_t^{lin}(\mathcal{X})=$ coefficient of $\theta_t$ in (5), which is $\nabla_\theta f_0(\mathcal{X}) |_{\theta=\theta_0}$.
  
  The gradient on both sides of eqn (5) is taken w.r.t. the running parameters, $\theta_t$, as the gradient in equation (2) is also w.r.t. the running parameters.
4. Jeevesh8 09 Jun 2022
  
  in Public
  
  3
  
  The second equality here is obtained by substituting for $\dot{\theta_t}$ from equation (2).
5. Jeevesh8 09 Jun 2022
  
  in Public
  
  Unlike the standard parameterization that only normalizes theforward dynamics of the network, the NTK-parameterization also normalizes its backward dynamics.
  
  standard parameterization samples directly from $\mathcal{N}(0,\frac{\sigma_w}{\sqrt{n_l}})$. Thus, gradients are passed only upto $W_{i,j}^l$. But the gradients in this paper, are passed back one step further, to $w_{ij}^l$ and hence are scaled by a factor of $\frac{\sigma_w}{\sqrt{n_l}}$, for all weights of the network.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1902.06720
distill.pub distill.pub

A Visual Exploration of Gaussian Processes

2
1. Jeevesh8 09 Jun 2022
  
  in Public
  
  it makes sense to consider the two-dimensional case as shown in the following figure.
  
  The initialised plot of conditioning distribution, $P(Y|X)$ is not aligned with the initialised contour plot, again.
2. Jeevesh8 09 Jun 2022
  
  in Public
  
  The following figure shows the influence of these parameters on a two-dimensional Gaussian distribution.
  
  The initialized values in the covariance matrix donot correspond to the contour plot on the left. Slightly perturbing the yellow eigenvectors will correct the covariance matrix.
Visit annotations in context

Annotators

Jeevesh8

URL

distill.pub/2019/visual-exploration-gaussian-processes
arxiv.org arxiv.org

1905.13210.pdf

3
1. Jeevesh8 08 Jun 2022
  
  in Public
  
  if the data can beclassified by a function in the NTRF function class FpWp1q, rOp1qq with a small training error, theover-parameterized ReLU network learnt by Algorithm 1 will have a small generalization error
  
  Even though we have a lot of parameters, the complexity of function class $\mathcal{F}$ is still low because the output of a the central function in $\mathcal{F}$, [corresponding to the weights $\matbb{W}^{(1)}$] can only be perturbed in a linear way (specified in the definition of NTRF) for every input.
  
  This simplicity of complexity of $\mathcal{F}$, probably allows us to use VC-dimension like generalization bounds of $\mathcal{\widetilde{O}}(\frac{1}{\sqrt{n}})$
2. Jeevesh8 08 Jun 2022
  
  in Public
  
  Definition 3.2
  
  $\mathbf{W}$ acts as a specification for perturbation to each and every weight of a neural network.
  
  The perturbation in each sample's output is proportional to the sum over all weights of the network of: {perturbation of weight multiplied by the gradient of the output on that sample}.
  
  This NTRF function class sort of tells about all the functions available in an $\frac{R}{\sqrt{m}}$ neighborhood of $\mathbf{W}^{(1)}$. This is true when $\frac{R}{\sqrt{m}}$ is small enough that $f(x;\mathbf{W}^{(1)})$ is approximately linear in the weights in $\frac{R}{\sqrt{m}}$ neighborhood of $\mathbf{W}^{(1)}$ for all inputs $x$.
3. Jeevesh8 08 Jun 2022
  
  in Public
  
  In fact, the empirical observation by Zhanget al. (2017) indicates that in order to understand deep learning, it is important to distinguish thetrue data labels from random labels when studying generalization. In other words, it is essentialto quantify the “classifiability” of the underlying data distribution, i.e., how difficult it can beclassified.
  
  New thought-1.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1905.13210.pdf
arxiv.org arxiv.org

1803.05407.pdf

4
1. Jeevesh8 08 Jun 2022
  
  in Public
  
  constant learning rate c = 1
  
  $c=1$ "epoch"
2. Jeevesh8 08 Jun 2022
  
  in Public
  
  Starting from ˆ
  
  Starting from the $\hat{w}$ that has been trained for $B$ or $0.75B$ time.
3. Jeevesh8 08 Jun 2022
  
  in Public
  
  Another important insight we can get from Figure 3 isthat while the train loss and test error surfaces are quali-tatively similar, they are not perfectly aligned
  
  The cyclical and constant learning rate schedules are leading to much more aligned train-test losses than in Figure 1.
4. Jeevesh8 08 Jun 2022
  
  in Public
  
  The key insight from Figure 3 is that both methods ex-plore points close to the periphery of the set of high-performing networks.
  
  Because of the learning rate schedule and using SGD[not Adam].
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1803.05407
May 2022
arxiv.org arxiv.org

1711.04623.pdf

2
1. Jeevesh8 18 May 2022
  
  in Public
  
  as longas the discretization approximation is justified.
  
  Figure out when this is satisfied, exactly.
2. Jeevesh8 18 May 2022
  
  in Public
  
  higher noise levels in gradients
  
  correspond to smaller batch size, larger learning rate.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1711.04623.pdf
arxiv.org arxiv.org

1802.08770.pdf

1
1. Jeevesh8 18 May 2022
  
  in Public
  
  stochastic differential equation
  
  A differential equation in which one or more terms is a stochastic process.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1802.08770.pdf
Apr 2022
arxiv.org arxiv.org

2109.06096.pdf

1
1. Jeevesh8 11 Apr 2022
  
  in Public
  
  We also find evidencethat some latent features that the model learns maynot be related to linguistic phenomena.
  
  What are these?
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2109.06096.pdf
Feb 2022
arxiv.org arxiv.org

2010.05465.pdf

2
1. Jeevesh8 28 Feb 2022
  
  in Public
  
  thenoun phrase (NP) the hedgehog, which has onlybeen observed as a subject, needs to be interpretedin the direct object position.
  
  Here, we are expecting the model to consolidate the following points:
  
  1) A particular noun phrase that occurs at first argument position of a transitive verb, can also occur at the second.
  
  2) love, which has been seen before as a transitive verb, must be a transitive verb this time too(not an abstract noun as in "Love is the greatest thing ever.")
  
  3) "The boy", "the hedgehog" which occurred as NP before, must still be NP.
  
  4) The second argument position of the transitive verb "love" was the NP that came later on in the sequence, so it must be the same this time too.
2. Jeevesh8 28 Feb 2022
  
  in Public
  
  expressions providing evidence thatdefinite NPs may appear in both argument positions ofa transitive verb.
  
  If the model considers semantic information too, the ability of an NP to occur at both argument positions of a transitive verb depends on what the NP is, as well as, what the transitive verb is. For e.g., you can't "eat" "water". You can only "drink" "water".
  
  But here, we are assuming that the model tries to learn to generalize the grammatical information only.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2010.05465.pdf
Jan 2022
arxiv.org arxiv.org

2010.11929.pdf

1
1. Jeevesh8 23 Jan 2022
  
  in Public
  
  MLP layers are local and transla-tionally equivariant
  
  translationally equivariant to shifts of size $P$.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2010.11929
arxiv.org arxiv.org

1906.06423.pdf

10
1. Jeevesh8 23 Jan 2022
  
  in Public
  
  Hence, at testing time the same object may appe
  
  But we didn't train our network's prediction to be invariant to this $\frac{1}{3}^{rd}$ factor!
  
  This means, network will not detect/mis-detect objects of this smaller size, and lead to errors.
2. Jeevesh8 23 Jan 2022
  
  in Public
  
  Ktest × Ktes
  
  No re-scaling required after taking the RoC of this size, unlike training time, when we needed to re-scale RoC.
3. Jeevesh8 23 Jan 2022
  
  in Public
  
  isotropically resizing the im-age so that the shorter dimension is Kimage
  
  $K_{test}^{image}$ is user specified.
  
  $K_{train} = K_{test} \le K_{test}^{image}$.
  
  The difference between $K_{test}$ and $K_{test}^{image}$ allows to express bias towards important things being present at the centre of the image.
4. Jeevesh8 23 Jan 2022
  
  in Public
  
  Hence, pre-processing standardizes the apparent size, which otherwisewould depend on the input image resolution. This is impor-tant as networks do not have built-in scale invariance.
  
  The final number of pixels an object occupies in a sampled RoC, only depends on the actual height of the object and its distance from the camera. It doesn't depend on the resolution of camera. This is to be expected as we are re-scaling to a fixed size($K_{train}$).
  
  The number of pixels occupied by the object in an original image, however would depend on the resolution of the image. In particular, $r_1 = k\sqrt{HW}\frac{R}{Z}$, is dependent on both $H$ and $W$.
5. Jeevesh8 23 Jan 2022
  
  in Public
  
  that α = 1
  
  Meaning we are taking a square RoC out of a square image.
6. Jeevesh8 23 Jan 2022
  
  in Public
  
  s
  
  $s$ is geometric mean of $\frac{K_{train}}{H_{RoC}}$(scaling factor in $X$ direction) and $\frac{K_{train}}{W_{RoC}}$(scaling factor in $Y$ direction).
7. Jeevesh8 23 Jan 2022
  
  in Public
  
  If an object has an extent of r ×r pixelsin the input image, and if s is the scaling factor between inputimage and the crop, then by the time the object is analysedby the CNN, it will have the new size of rs ×rs pixels.
  
  Assuming that the entire $r \times r$ object happens to be in the RoC. If a part of it is in RoC, then that part's pixels in both directions will be scaled by $s$
8. Jeevesh8 23 Jan 2022
  
  in Public
  
  While the focal length is variable, the field of view angle θFOVof most cameras is usually in the [40◦,60◦] range.
  
  See here%20is,the%20size%20of%20the%20sensor.) for detailed explanation.
9. Jeevesh8 23 Jan 2022
  
  in Public
  
  H ×W
  
  Not pixels, but can be converted to pixels.
10. Jeevesh8 23 Jan 2022
  
  in Public
  
  Since different pre-processing protocols are usedat training and testing time
  
  If we took random RoC at testing time too, they would have had same distribution.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/abs/1906.06423
arxiv.org arxiv.org

2012.12877.pdf

2
1. Jeevesh8 22 Jan 2022
  
  in Public
  
  spread information between the patch tokens andthe class token
  
  That is, initially, all information regarding the class is contained in the patch tokens, and the class token is independent of the patch tokens. As training progresses, the class token's final embedding becomes correlated with (hence dependent on) the patch tokens. It is, as if, information has travelled from the patch tokens to the class token.
2. Jeevesh8 22 Jan 2022
  
  in Public
  
  invariant
  
  *equivariant
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/2012.12877
arxiv.org arxiv.org

Sharp Minima Can Generalize For Deep Nets

10
1. Jeevesh8 14 Jan 2022
  
  in Public
  
  ∇2L)(θ)
  
  Is this positive semi-definite, though? It must be for the square root to be valid. This positive semi-definiteness of Hessian seems to be an additional assumption here.
2. Jeevesh8 14 Jan 2022
  
  in Public
  
  mink≤K(nk)
  
  The minimum number of weights in any layer.
3. Jeevesh8 14 Jan 2022
  
  in Public
  
  r
  
  non-zero.
4. Jeevesh8 14 Jan 2022
  
  in Public
  
  sorted
  
  By multiplicity.
5. Jeevesh8 14 Jan 2022
  
  in Public
  
  singular values of A
  
  These are square roots of (Eigenvalues of $A^TA$). As A^TA is positive semi-definite always, the square roots are valid.
6. Jeevesh8 14 Jan 2022
  
  in Public
  
  positive diagonal element
  
  As we are at a minima, the second derivative in some direction must be positive. Otherwise, if all directions, have second derivative negative or zero, we are at a maxima or a completely flat region(in case all second derivatives on the diagonal are 0).
7. Jeevesh8 14 Jan 2022
  
  in Public
  
  Since all norms are equivalent in finite dimension
  
  https://math.mit.edu/~stevenj/18.335/norm-equivalence.pdf
8. Jeevesh8 14 Jan 2022
  
  in Public
  
  Therefore, Tα(B∞(r,θ))∩B∞(r,θ)
  
  By the particular choice of $\alpha$.
9. Jeevesh8 14 Jan 2022
  
  in Public
  
  is a connected region
  
  Intuitively, you can imagine each point, $\theta'=(\theta'_1, \theta'_2)$ in the initial $B_{\infty}(r, \theta)$ moving along the curve $(\alpha\theta'_1, \alpha^{-1}\theta'_2)$.
10. Jeevesh8 14 Jan 2022
  
  in Public
  
  the
  
  The $\mathcal{L}_{\infty}$ ball is the ball constructed using the $L^{\infty}$ norm on the parameter space. It is just a cube, or side length $2r$.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1703.04933.pdf
arxiv.org arxiv.org

1609.04836.pdf

3
1. Jeevesh8 13 Jan 2022
  
  in Public
  
  1
  
  Add 1 to prevent sharpness from exploding when $f(x)$ is close to 0.
2. Jeevesh8 13 Jan 2022
  
  in Public
  
  A
  
  Provides mapping between $n$-dimensional model-parameter space and a $p$-dimensional manifold.
3. Jeevesh8 13 Jan 2022
  
  in Public
  
  (A+x)i|
  
  If this term was always 0, $C_{\epsilon}$ would have been a perfect cube, with side length $2\epsilon$.
  
  This terms stretches the $i^{th}$ edge on both sides by $\epsilon|(A^{+}x)_i|$. As there is $|.|$, no side will be shorter than $2\epsilon$.
Visit annotations in context

Annotators

Jeevesh8

URL

arxiv.org/pdf/1609.04836.pdf
Dec 2021
www.semanticscholar.org www.semanticscholar.org

Similarity of Neural Network Representations Revisited

3
1. Jeevesh8 21 Dec 2021
  
  in Public
  
  linear kernel
  
  The equation of linear kernel is:
  
  $K(x,y) = x^{T}y+c$
  
  where $c$ is some constant.
2. Jeevesh8 18 Dec 2021
  
  in Public
  
  isotropic scaling
  
  Simple scaling corresponds to transforming the space by multiplication with a diagonal matrix. Isotropic scaling corresponds to multiplication by a diagonal matrix, all of whose diagonal entries are same.
3. Jeevesh8 18 Dec 2021
  
  in Public
  
  orthog-onal transformation
  
  A linear transformation that preserves dot products between elements of the vector space. In 2-D and 3-D it consists of rotations and reflects or any combinations thereof.
Visit annotations in context

Annotators

Jeevesh8

URL

semanticscholar.org/reader/726320cdbd04804ffa8f3a78c095bd1b55a2a695
pure.mpg.de pure.mpg.de

MPIK-TR-140.pdf

9
1. Jeevesh8 19 Dec 2021
  
  in Public
  
  Cxy
  
  The entries of the matrix corresponding to this operator are not like those in the covariance matrix. They are of the form $ [p(e_i, e'_j) - p(e_i)p(e'_j)]_{ij}$ where :
  
  $\{e_i\}$ and $\{e'_j\}$ are basis vectors of $\mathcal{X}$ and $\mathcal{Y}$ respectively.
  
  $p(e_i, e'_j)$ is the probability of sampling the pair of elements corresponding to $e_i, e'_j$ from $\mathcal{X} \times \mathcal{Y}$.
2. Jeevesh8 19 Dec 2021
  
  in Public
  
  Ex,y
  
  Only symbolic for the operator. The expectation will be calculated actually, when a $g$ is sent inside it, and the tensor product operator gets applied on $g$.
3. Jeevesh8 19 Dec 2021
  
  in Public
  
  tensor product operator
  
  This too is a linear operator.
  
  In the finite case, it is similar to the tensor product of two vectors, which yields a matrix.
4. Jeevesh8 19 Dec 2021
  
  in Public
  
  elements
  
  single element per Hilbert space. (most probably)
5. Jeevesh8 19 Dec 2021
  
  in Public
  
  4
  
  These must be true for all $f$ and $g$ respectively.
6. Jeevesh8 19 Dec 2021
  
  in Public
  
  complete orthonormal system
  
  A countable complete orthonormal basis $\{e_i\}_{i=0}^{\infty}$for a Hilbert space has the following properties:
  
  $\langle e_i, e_j \rangle$ = 0 for $i \neq j$.
  
  $||e_i|| = 1$
  
  The linear vector space spanned by the countable many basis vectors, must be dense in the Hilbert space.
  
  The last point means that not every element of the Hilbert space may be written as a linear combination of the basis vectors, but, it can be approximated as closely as we want(in terms of the norm/metric derived from the inner product of the Hilbert Space) by some linear combination of the basis vectors.
7. Jeevesh8 19 Dec 2021
  
  in Public
  
  Hilbert-Schmidt Operator
  
  In effect, while calculating the inner product in the Hilbert Space of linear operators, we are multiplying together corresponding elements of $C(:= [c_{ij}])$ and $D(:=[d_{ij}])$, and summing them together.
  
  $$ \langle C,D \rangle_{HS} := \sum_{i,j} c_{ij}d_{ij}$$
  
  where $i, j$ belong to a countable index.
8. Jeevesh8 19 Dec 2021
  
  in Public
  
  Hilbert-Schmidt Norm
  
  Because both $\mathcal{G}$ and $\mathcal{F}$ admit countable orthonormal bases, we just have summing over two countable sets while calculating this norm.
  
  The linear operator $C$ can be imagined as a matrix with a countable number of rows and columns. (Possibly infinite)
9. Jeevesh8 19 Dec 2021
  
  in Public
  
  that F be separable
  
  In topology, a topological space is called separable if it has a countable, dense subset.
  
  A Hilbert space is separable, if it has a dense countable subset too. Together with Zorn's Lemma, this means that the Hilbert space admits a countable orthonormal basis.
Visit annotations in context

Annotators

Jeevesh8

URL

pure.mpg.de/rest/items/item_1791468/component/file_3009860/content
alex.smola.org alex.smola.org

Hilbert Schmidt Independence Criterion - Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet

1
1. Jeevesh8 19 Dec 2021
  
  in Public
  
  universal kernels
  
  Letting $C(\mathcal{X})$ denote the space of continuous bounded functions on compact domain $\mathcal{X}$, we call a kernel $k$ ''universal'' if $k(x,\cdot)$ is continuous for all $x$ and the RKHS induced by $k$ is dense in $C(\mathcal{X})$.
Visit annotations in context

Annotators

Jeevesh8

URL

alex.smola.org/teaching/iconip2006/iconip_4.pdf

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL