404 Matching Annotations
  1. Mar 2024
    1. Latent variables

      Notice how flat the second graph is along the 2x_1+2x_2 = 1 direction. (0 derivative throughout)

      It seems flatness along some directions of the loss as a function of some representation indicates how good (dis-entangled) the representations themselves are. The next layer that operates on these representations can then discard this flat direction, and focus on the other one. Hence, the more directions the curve is flatter in (== the flatter the curve is), the more efficiently can the next layer learn.

    2. Smoothness of the target function.

      Essentially, smoothness limits how many times can the network "turn" in a given small subset of the input. This limit then makes it so tha all the data in that subset must be fit by the function without wiggling around.

      One can imaging how this can be used to create an equivalent problem by collapsing a lot of points into a single point. And lots of wiggly functions to a smooth variant.

  2. Aug 2023
    1. However, the gradient is non-zero fordirections u in the ID subspace and the corresponding features Bu change across the fine-tuning process. Wecall this feature distortion: the features in some directions are changed but not others

      If we go from ID data to OOD data via a covariate shift, this is true.

      But not if a concept shift has happened when going from ID to OOD data.

  3. Jul 2023
    1. Note that the node representations do not explicitly encode their ground-truth positions.

      Are we sure?

    2. BD

      Brownian Dynamics

    1. Lemma 2.

      But there can be other fixed points also! In particular, \(\theta^*\) is a global minima. We may get stuck in local minima, or oscillate around them due to the changing nature of \(P_t(X_i; \theta)\).

      This pattern is observed in other places in Deep Learning also. Wherein, introducing new/meta-variables and varying quantities often doesn't hurt convergence.

      There seems to be something about the architecture/initialisation/optimisation in DL that seems to enable it.

    2. We will call the model well-specified if there exists θ∗ such that θ∗ ∈ arg minθ∫ `(X, Y ; θ)Q(dY |X)for every X.

      That is, there exists a parameter setting \(\theta^*\) such that it can minimize the loss on each and every input. Meaning that it can predict the correct probabilities over \(Y\) for every \(x \in X\).

  4. Jun 2023
    1. (ii)

      The \(\sigma\)-algebra of the probability space is closed under complements and countable intersections.

      Also using countable additivity of measure space with the above fact, leads to the result.

    2. (i)

      As each event = \(\bar{\phi}\).

    1. Generally we need more assumptions about a specific problem and hypothesis class to bound absolutepopulation risk, hence we focus on bounding the excess risk

      In particular, using the relation between the structure of the problem and how it relates to the hypothesis class we have chosen. The better the two structures align, the better the best performing model will be and it will be easier to find that model. Hence it is clear that more information helps in obtaining better bounds, but not clear why it will be difficult to calculate the \(inf_{g \in \mathcal{H}} L(g)\) and hence convert all the bounds obtained on excess risk to those on absolute risk.

      It will be difficult for two reasons probably:

      1. You will need to know the actual distribution \(p\), just samples from it, won't do. But maybe we can have a limiting case like bounds, where as \(#\)samples \(\rightarrow N\) the actual distribution \(p\) becomes known and the bound becomes valid.

      2. Finding the optimal loss or \(g\) may be a computationally intractable problem.

      Though, in some function classes the optimal loss may just be \(0\) ?

    1. model the most basicassumption instead: the best example weighting shouldminimize the loss of a set of unbiased clean validationexamples that are consistent with the evaluation procedure

      Very clean.

    2. Here we argue that in order to learngeneral forms of training set biases, it is necessary to havea small unbiased validation to guide training.

      Very clean idea.

    1. lin(x; θ0 + τ

      The prediction of \(f_{lin}\) at \(x\) is equal to the original value of \(f\) at \(x\) at parameter \(\theta_0\) plus the component of the task vector along the direction of gradient at \(x\). That is we add a factor corresponding to how well the gradient on the sample \(x\) aligns with the "average" gradient(task vector) direction.

      Another useful way to make sense would be to,replace \(\tau\) with \(\Delta \tau\). And then see what the equation looks like. It is a first order approximation of \(f\) around \(\theta_0\).

    2. task arithmetic property

      For a particular input \(x\); if I add any linear combination of task vectors, only the component along the task vector corresponding to the task to which \(x\) belongs, determines the value of \(f\) at that point. That is, the other task vectors lie in the tangent space of \(f\) seen as function of parameters, at the point \(x\).

    3. functional variation is entirelycaptured by each τ

      This means that \(g_t\) varies the same way as \(f\) varies when we take the parameters \(\alpha_t\mathbf{\tau_t}\) away from the original \(\theta_0\), and \(g_t\)'s value doesn't get effected when we move along any other task vector.

      So, if I move to \(A=\theta_0 + \alpha_1\tau_1\) first, then all the variation in \(f\), due to that movement is captured by \(g_1\) and all other \(g_t\) stay the same.

      Then if we move to \(\theta_0 + \alpha_1\tau_1 + \alpha_2\tau_2\) from \(A\), all the variation will be captured by \(g_2\) and the rest of \(g_t\) will stay same.

      If moving along \(\tau_t\) only changes the value of \(f\) on \(\mathcal{D}_t\) and not on any other task's dataset, then the functions:

      \(g_t(x;\alpha_t\tau_t) := f(x; \theta_0+\alpha_t\tau_t)\mathbb{1}[x \in \mathcal{D}_t]\)

      will satisfy the condition required by equation \((4.)\).

    4. spatially-localized components

      Spatially localized in the input space. That is, functions that are 0 outside of \(\mathcal{D}_t\).

    5. single mixing coefficient

      Do the results hold if we allow multiple mixing coefficients? One per task?

    6. if adding τt does not modify the output of the model outside Dt

      Moreover, the output is assumed to be linear in the parameters. And addition can be adding scaled versions of \(\tau_t\), too.

    1. Figure 1:

      What makes a task transferable?

      Conjecture 1: The diversity of input for a task is what matters about how transferable the task will be.

      Three experiments for comparing transferability:

      1. distributions of cosine similarities between samples => task diversity (?=>) transferability.

      2. Transf.(Model trained on QQP) on one hand and Transf(model trained on MNLI with labels generated by QQP fine-tuned model)

      3. Slowly shuffle larger proportion of labels. Does the task still stay transferable?

    1. Figure 1

      It is not necessary that the dashed line is bound on performance.

  5. May 2023
    1. (i)

      Here the equality is in the probabilistic sense.

      A probabilistic notion of equality of two events \(E, F\) is to say that \(E\) and \(F\) are equal when \(E\subset F\) and \(F\subset E\).

      Another one can be: Two events \(E, F\) are said to be equal if \(P(E\backslash F)=0\) and \(P(F\backslash E)=0\).

      Both the definitions of equality are preserved under extensions as extensions are probability preserving. And both form equivalence relations on the underlying \(\sigma\)-algebra. The former is a refinement of the latter.

    2. CAn−A

      This is a sequence that decays to 0, as n increases. As we bring A closer to 0, this sequence decays faster and faster, with increasing \(n\).

      The statement that "event \(E\) holds with overwhelming probability" means that the probability of this event is an upper bound on the faster and faster decaying sequences.

      In \((iv.)\), instead of having sequences that decay faster and faster, we want that \(\mathbf{P}(E)\) is an upper bound on just one sequence that decays as a polynomial in \(n\).

      In \((v.)\) this is further relaxed to \(\mathbf{P}(E)\) being an upper bound on some sequence that decays to 0.

      Note that in the above, upper bounds are on the 1-decaying sequence.

      Also, as \(E=E_n\), \(\mathbf{P}(E)\) is also a sequence of \(n\). So, in fact, we have that one sequence, is forming an upper bound on another.

      In \((iii.)\), the sequence of proababilities is upper bound to any sequence that tends to 1 as an exponent of \(n\).

      In \((iv.)\), the sequence of proababilities is upper bound to some sequence that tends to 1 as an exponent in \(n\).

      In \((v.)\), the sequence of proababilities is upper bound to some sequence that tends to 1.

      Also, the constant is omitted for brevity, but choosing an apt constant is also to be considered in above statements.

    3. Indeed, once one is no longer working at the foundational level,it is best to try to suppress the fact that events are being modeled assets altogether

      So that we don't think about set-theoretic constructs like cardinality which may not be a probabilistic property. One should be aware that the unions/intersections is not between sets, but between events.

    1. permutation o

      Same permutation for both \(x, z\).

    2. values from a random data point z,

      Generalization of zeroing out the features instead.

  6. Apr 2023
    1. b, d ≤ supf ∈F ⊗t,h∈H |Rtrain(f , h) − ˆRtrain(f , h)|

      The RHS is the maximum difference between the empirical loss(averaged over all the tasks), and the actual loss(averaged over all the tasks) for any shared representation \(\mathbf{h}\) and any task specific mappings \(\mathbf{f}\).

      The inequality holds as both \(b\), and \(d\) resolve to the difference of empirical and average losses, for the functions \((\hat{\mathbf{f}}, \hat{\mathbf{h}})\) and \((\mathbf{f}^, \mathbf{h}^)\) respectively.

    2. c ≤ 0

      As \(\hat{\mathbf{h}}\) and \(\hat{\mathbf{f}}\) are the minimizers for the data over using which \(\hat{L}\) is computed.

      On the other hand \(\mathbf{h}^\) and \(\mathbf{f}^\) are the functions that minimize the actual centered training risk(\(L\)), and hence the empirical centered training risk(\(\hat{L}\)) may be larger when we use these values.

    3. L(f ⋆, h⋆, f ⋆, h⋆)

      This is 0.

    4. dF ,f ⋆ (ˆh; h⋆)

      Distance between the actual and learned representations, under the function class \(\mathcal{F}\), and given the optimal fit over the real shared representation \(\mathbf{h}^*\).

    5. ̃f
      1. \(\hat{\mathbf{h}}\) is the learned shared representation.

      2. \(\tilde{\mathbf{f}}\) is the function that actually optimizes the performance for each task \(j \in [t]\) using the learned shared representation \(\hat{\mathbf{h}}\).

      3. \(\hat{\mathbf{f}}\) is the learned approximation to \(\tilde{\mathbf{f}}\).

    6. A division by \(n\) seems to be missing..

    7. centered training risk

      The training risk over and above the one attained by the actual underlying feature representation \(\mathbf{h}^\) and the optimal \(f_j^\) corresponding to it.

    8. h⋆

      The actual underlying shared representation of the tasks.

    9. RX(Q)

      How well can any function from the function class \(\mathcal{Q}\) match a set of completely random samples from \({+1, -1}\).

    10. Rad
    11. polylogarithmic factors
    12. dF ,f (h′; h)
      1. If \(\mathbf{h}'=\mathbf{h}\) you can directly put \(f'=f_j\) and the distance will go to 0.

      2. If \(\mathbf{h}'\ne\mathbf{h}\), but you can find some \(g \in \mathcal{F}\) such that \(g \circ \mathbf{h}'= \mathbf{h}\), you can still make the distance 0.

      3. This metric can also be negative, given what \(\mathbf{f}\) and \(\mathbf{h}\) are.

    13. y|f ⋆j ◦ h⋆(x)

      \(\mathbf{x}\) influences the distribution over \(y\) only via the task specific representation \(f_j^ \circ \mathbf{h}^\)

    14. L(F )

      This is a Lipschitz constant dependent on the function class \(\mathcal{F}\).

    15. G ̄X(Q)

      what happens to this definition if \(N \to \infty\) ?

    16. N data points

      The assumption here is that the value of \(\mathbf{q}\) at any point \(\mathbf{x}_i\) is constrained by its value on the other points. Because otherwise, if \(\mathbf{q}(\mathbf{x}_i)\) can go to \(\infty\) without any other value going to \(-\infty\); then complexity is \(\infty\).

    17. common-design assumption

      The assumption that the tasks share a common design.

    18. One ofthe most promising methods for multitask and transfer learning is founded on the belief that multiple, differing tasks aredistinguished by a small number of task-specific parameters, but often share a common low-dimensional representation

      Low-dimensional being the keyword here.

    1. gB is a deep network andin our theory (Section 3), gB is a linear projection.

      Why is this valid?

      Why is a deep network approximated with a linear projection?

    2. he headonly accomodates the distorted features of ID points and performs poorly

      The point being made is that the head performs poorly, and the feature extractor may still be giving appropriate features for the head to perform well.

      Can be verified by probing the representations of the finetuned model.

    1. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited.

      Also, we can't guarantee sequential reads/writes in a B-Tree like structure, no matter how we store it in the memory.

    2. per-consumer queue

      A queue per consumer. To efficiently distribute all messages among various consumers. The per-consumer queue is maintained by each consumer, probably.

      Kafka broker manages consumer groups, maintaining the state of each consumer, and distributes load amongst them.

    3. we at least double the available cache by having automatic access to all free memory,

      On the other hand, if we make an in-memory cache ourselves, that cache will take half the memory, and another half will be taken by the pagecache of the OS, when it writes to the disk.

      Though, we can probably free off our cache as we send the writes to OS. But as we don't control when the OS will flush those out to disk, we are only guaranteed to have access of half of the available memory for our (in-memory)cache.

    4. relying on pagecache

      That is, directly write to filesystem, OS will buffer your writes.

    5. performance divergence,

      The disk is fast when writing/reading lots of data sequentially. But not so fast when writing/reading data in random fashion.

      This is the divergence.

      OS maintains a cache of writes, and only when enough are accumulated, are they sent to the disk. This allows for efficient writes.

  7. Mar 2023
    1. Figure 39.3

      There is one entry in OFT per open() call. Each call to open(), in any process will create a new entry in the OFT.

      Each process maintains its own set of file descriptors in its PCB(process control block). These file descriptors, are just numbered pointers to entries in the OFT.

      The fork() call creates a child process, and copies over the contents of its file descriptor list, to the child; without executing open()-open() in child process again and again. Instead it increments the refcnt of the entry in the OFT.

      dup() call in a process, creates a new entry in that process' file descriptors list, and updates the refcnt. It returns the position of the new entry in the file descriptors list.

      dup2() does the same as dup() but puts the new entry at a specified position. So that any other function accessing the entry at that position will now be redirected to the file referred to by the new entry.

      See here too.

    2. dup

      Returns a copied file descriptor that points to the same entry in the OFT. The new file descriptor has a new number.

    1. sleep(b, &ide_lock);

      Releases lock atomically.

    2. ide intr()

      Called by OS, when the application program requests a read/write.

    3. ide start request()

      Called by ISR, (ide_intr()).

    4. ide rw(

      Called by the process which wants to read/write.

    5. as if they were memory locations

      That is, locations of RAM.

    6. device registers

      These are registers of the external device.

    1. Quantitative

      No discounting or similar thing in their definition

    2. online

      that is, concurrent learning of reward predictor and the agent.

    3. non-expert users

      Expert users know how to break the decision problem correctly into possible answers and can possibly provide very fine-grained feedback, as a result.

      We want to succeed in the absence of such detailed feedback also.

    4. we can only recognize the desired behavior, but notnecessarily demonstrate it

      Like NP problems. Solution can be verified in polynomial time, but not generated.

  8. Aug 2022
    1. Foreach field we rank the most likely value from thedocument for that field. If the most likely predic-tion falls below a tuned likelihood threshold, weemit null to indicate no field value is predicted

      Better than having a separate label for null?

    2. Filtering

      using regexes

    3. token

      the amount of "gross" written and "net" written in the invoice.

    1. Field Extraction from Administrative Documents byIncremental Structural Templates

      D’Andecy et al. (2018) build upon this approach by incorporating an a-priori model of term-positions to their iterative layout- specific extraction model, significantly boosting performance on difficult fields.

    2. sidw

      This can be adjusted when only one document is available. For itf-df we need more than one documents though.

    3. From this list of matched nodes, weobtain a set of polar coordinates that indicate a possible spatialrelationship between the word wi and the target field to extract

      That is, suppose we are extracting a word that is at 1.5cm far along a line at -45 degrees from a node whose transcript is "Total".

      Then we don't only consider a "Total" appearing at appearing in a particular location(say at the end of document), but all "Total"s occuring thoughout the document and cast a vote for a location that is 1.5cm along the -45 degree line from each of those locations.

      In effect, the model has following assumptions:

      1. A field to extract has particular position relative to all other words in the invoice/document. All these are constraints are for the field. (Alternatively, the constraints are the edges of the star graph.)

      2. The position that satisfies most of these constraints is the one that gets extracted.

      Additionally, the model is made better and more in-line with reality when the three kinds of weightings in section VI are done incrementally.

    4. j

      \(j\) iterates over all the nodes having same transcription as the \(i^{th}\) word, \(w_i\).

    5. definition of fixed spatial templates

      The document processing(structured information extraction) problem can then be broken into the two tasks:

      1. Template identification(identigying templates from a single document; or grouping documents with same templates together).

      2. OCR.

    1. Algorithm 1

      All (uni/bi/tri/quad)-grams from any value of a field, will have that field as their label in the final results dict.

      This training data generation, generates {nGram, field} pairs, which can be used to improve and adapt the classifier. But it is less clear how to use this to improve OCR, or the character level language model in case we decide to use one.[Think more over this later..]

    2. Hungarian algorithm

      The agents are the various classes. The tasks are all the words with the label as that class.

    3. he UBL invoice produced by the engine is presented tothe user along with the original PDF invoice in a graphicaluser interface (GUI)

      We can use this data to further finetune our models. We would need extra step to find which model exactly went wrong. The architecture of our pipeline should be such that this step can be performed easily.

      [Aside: Use large model like T5 and condition on natural language description of what exactly went wrong?]

    4. Post Processor

      There is a consistency issue with previous step:

      1. The N-grams that are classified as belonging to a particular field, should all compose a single contiguous unit(like a word or a date or entire amount) that can be the value of that field. That is, the Classifier should output "amount" for all characters in "100.0o" for the post-processor to convert it to "100.00". This means that the classifier should have language modelling like abilities.

      2. The post processor can be thought of as a character level language model, that is used to correct spellings/common mistakes of OCR*. In case of poor images, it may be necessary to guess the characters based on the output of the classifier. For example, when from an invoice, half of a zero has evaporated so it looks like "100.0C". In this case it will be wrong to expect OCR to predict 0 at the last position. The character level language model may also let it go undetected(considering C for Celsius, which occurs fairly commonly), unless the language model is conditioned on the predictions of the previous step, i.e., it knows that it is producing an amount not temperature.

      An architecture that will allow us to handle these dependencies between the two tasks effeciently, can be constructed by considering few layers common for the two tasks of language modelling spell correction and classifying the n-grams.

      Other approaches can be to consider ensuring proper images. Or perform some image enhancement/edge detection. Or to do data augmentation to involve such poor samples in training too. Or finetune a pre-trained vision model.

      It can also be the case that there are examples with invoices where C stands for cents. In this case, the sample will be truly ambiguous. [Probably we can do a joint model-ling of replacements and classifier labels, with a crf that is aware of this?]

      [*Probably we could have had a character level language model correct characters output by OCR(step 1), earlier? As this ould be character by character replacement we can preserve the features associated with various character spans. RegEx make sense in case we already know what we are searching for, and it has fixed format, like date and amounts.]

    5. Classifier.

      Can also be a CRF. CRF can be a linear chain CRF( only one dimension) or be a CRF in 2 directions too.

      In case, all classes are not known a-priori, we can train a model that learns to extract {key: value} pairs instead.(Using just B-K, I-K, O, B-V, I-V tags, for example.)That is ,cast the problem as a sequence tagging one.

      We would need new model to include different kinds of annotations(for example, list item data), though.

    6. N-grammer.

      This step is necessary to find units with which features in the next step can be associated.

      Converting to actual words, at this point will result in loss of correspondence between characters identified by OCR(to which features like size etc. can be associated) and the actual words.

    7. Text Extractor.

      The OCR for a character is independent of the structure/context in which the character occurs. Hence it makes sense to include this step as a separate.

    8. for each field,

      Pre-specified by user? Or will be different based on the extraction from various templates?

      What about list-item type data, rather than key-value data?

    1. d e f _ _ i n i t _ _ ( s e l f , key ) :15 key1 , key2 = j r a n d o m . s p l i t ( key )16 s e l f . l a y e r s = [ eqx . nn . L i n e a r ( 2 , 8 , key = key1 ) ,17 eqx . nn . L i n e a r ( 8 , 2 , key = key2 ) ]18 s e l f . a c t i v a t i o n = j n n . r e l u19 s e l f . b i a s = j n p . o n e s ( 2 )
      1. We don't need to write this __init__. If we already construct layers, activation and bias outside the class, we can pass them through the automatically constructed __init__(). Note that __init__() is also present, and the one we wrote will only be called if pass a single argument while initializing(the key).

      2. We can also add in_dim, out_dim, hidden_dim, activation_fn as arguments in this __init__, rather than having fixed values, if we want.

      3. All the fields specified in the annotation list are considered as the children of the PyTree defined by the class.

    2. This cannot be true in general. We may wish to parameterise a function by arbitrary Python types,unknown to JAX.

      This is a problem because:

      As we won't be writing our own tree_flatten and tree_unflatten(will be inherited from eqx.Module), all the attributes of self that are set in __init__, will be passed through these flatten and unflatten.

      So, if we want to write a class like in Appendix A, that allows activation function as a class attribute(we could also have written an __init__() function that takes in the activation function, alongside key), we will ed up having a leaf in the PyTree, that is not a jnp.array but an activation function.

    3. which is a pure function

      With the first argument(self) being a PyTree instance, having all the parameters(as jax.grad differentiates w.r.t. first argument, it is ideal that first argument contains all the parameters!).

    4. The gradient returned by higher_order_func will itself be an instance of Adder

      As gradient of PyTree basically corresponds to calculating gradient w.r.t. all leaf nodes, and storing them in the same PyTree structure.

    5. referential transparenc

      An expression is called referentially transparent if it can be replaced with its corresponding value (and vice-versa) without changing the program's behavior.

    1. JAX sometimes needs to compare treedef for equality.

      For example when ensuring that treedef of two arguments of a function are same, like in vmap example.

      As the auxiliary data gets added to the treedef, it becomes essential that the auxiliary data also matches for vmap to be applied properly. So, we need to keep this mind while defining/registering new pytree node.

  9. Jul 2022
    1. Gradient-enhanced physics-informed neural networks for forward andinverse PDE problems

      Gradient of loss with respect to inputs(rather than NN parameters) contains quite meaningful information that can be easily used to learn better NN parameters, via backprop.

    1. this is whythe dual map and adjoint use the same notation

      Adjoint is also denoted as \(T^*\). What about the conjugation in the adjoint, though?

    1. Varθ[ˆθ] ≥ 1nI(θ)

      Nice result. variance of estimator, bounded below by 1/[information content]!! This is the meaning of "maximum information" in a sample.

    2. n

      scales proportional to \(n^{-1}\).

    3. m′(θ)

      measures how sensitive expected value of estimator is to small changes in the underlying real distribution's real parameter \(\theta\).

    4. Eθ(ˆθ)

      Expected value of the estimator \(\hat{\theta}\) over all possible samples drawn from the "real" distribution\(f(x|\theta)\) as a function of the "real" parameter \(\theta\).

      Here we suppose "ahh.. if \(\theta\) was our real parameter..", and then ask .. "what would the expected value of our estimator be with this \(\theta\)?"

    5. We will show howto used Fisher information to determine the lower bound for the variance of an estimator ofthe parameter θ

      Fisher information tells what is the maximum information we can get about parameter \(\theta\) from a sample \(X_1, ..., X_n\).

      As this is the maximum information, our inference must have a variance related to this maximum information.

    1. Exercise 4

      Consider the set of all \(x\) for which \(P(x)\) is false. Consider a minimal element of this set, \(y\). \(P(x)\) is true \(\forall x<y \).

      Now, \(y=succ(z)\) for some \(z \in X\) xor \(y=sup([min(X), y))\). In both cases, we can apply one of the suppositions, 2 and 3 respectively, to conclude that \(P(y)\) is true, which is a contradiction.

      Hence \(P(x)\) is true \(\forall x\in X\).

    2. n particular, is not the successor of any element in X

      It is assumed \(min(X) \in [min(X), min(X))\).

    3. a totally ordered set

      not needed, as totally ordered is implied by existence of minimal elements for subsets consisting of only \(2\) elements of \(X\).

    4. a collection

      The proof doesn't require \(\mathcal{F}\) to be the collection of all subsets of \(X\). Hence, we can, for one, consider a \(\mathcal{F}\) to be a collection of mutually disjoint subsets of \(X\).

    5. non-empty

      This is essential as \(f(\phi) \notin \phi\), no matter how you choose \(f\).

    1. d|f0 (xj )

      shape\(=n_L\)

    2. K(x, xj )

      shape\(=n_L\times n_L\)

    3. As a result, the (functional) derivative of the cost C at apoint f0 ∈ F can be viewed as an element of F

      \(C\) is defined to have a (functional) derivative at \(f_0\) if the map from \(\mathcal{F}\) to \(\mathbb{R}\) defined as:

      \(\rho \longrightarrow \lim_{x \rightarrow 0} \frac{C[f_0+x\rho]-C[f_0]}{x}\)

      is an element of \(\mathcal{F^*}\).

    4. ΦK

      maps linear forms [functions from \(\mathcal{F}\) to \(\mathbb{R}\)] to functions on the data \(\in \mathcal{F}\).

    5. on the data

      or in the support of \(p^{in}\).

      Note that \(||f||K=||f||{p^{in}}=0\) if \(f\) is \(0\) at all data points.

      Or in otherwords, \(||f-g||K=||f-g||{p^{in}}=0\) if \(f\) and \(g\) are same on all the data points.

    6. The kernel K is positive definite with respect to || · ||pin if ||f ||pin > 0 =⇒ ||f ||K > 0

      If \(K(x,x')\) is not positive definite for some \((x,x')\) pair,

      \(\implies \exists v \in \mathbb{R}^{n_L} \backslash { \mathbf{0} } : v^TK(x,x')v \le 0\).

      Setting \(f(x)=v \forall x \in \mathbb{R}^{n_0}\), we have \(||f||_K = \sqrt{\langle f, f \rangle_K}\) is not defined or 0.

      whereas \( ||f||_{p^{in}} \) is still well defined and \(>0\) as \(v\ne\mathbf{0}\).

      This means that if \(K(x,x')\) is not positive definite for some \((x,x')\), then the kernel \(K\) is not positive definite as per the definition here.

      Moreover, if \(K(x,x')\) is positive definite for every \((x,x')\), then

      \(\langle f,f \rangle > 0 \forall f \) that are non-zero in some region(\(\subseteq \mathbb{R}^{n_0}\)) of non-zero probability,

      which means that \( ||f||{p^{in}} >0 \implies ||f||_K > 0 \forall p^{in}\). This means that K is positive definite w.r.t. \(||\dot||{p^{in}}\).

      This means the current definition is equivalent to saying that K is positive definite iff \(K(x,x')\) is a positive definite \(n_L \times n_L\) matrix, for all \((x,x') \sim p^{in}\).

    7. twice differentiable nonlinearity functionσ : R → R, with bounded second derivative

      The second derivative of output of ANN with respect input of ANN may not be bounded?

    8. empirical distribution

      The expected value of seminorm, then resolves to the empirical average over the entire dataset of \(N\) samples,

    9. seminorm

      Seminorm satisfies: 1. Triangle inequality 2. absolute homogeneity, i.e., \(||s.f||=|s|.||f||\) for all scalars \(s\) and all functions \(f \in \mathcal{F}\). 3. 1 &2 => non-negativity, i.e., \(||f|| \ge 0 \forall f \in \mathcal{F}\).

      For a seminorm to be a norm, it is additionally required that: 4. \(||f||=0 \implies f=0\).

      (https://en.wikipedia.org/wiki/Seminorm#Definition)

      The triangle inequality here follows from the fact that: 1. \(\langle v, u \rangle \le ||v||||u|| \forall v,u \in \mathbb{R}^{n_L}\) 2. and linearity of integration

    1. e assume that the local curvature of the loss surface(the spectral norm of the Hessian) increases or decreases monotonically along the optimization tra-jectory.

      Does it? [CONFIRM]

    1. Equation 5

      Equation 4?

      First term is positive semidefinite as \(vv^T\) is and sum of PSD matrices is PSD.

    2. Figure6

      Figure 5?

    3. larger positive eigenvalues

      That is sharper minima.

      This kind of analysis(comparing magnitude of the eigenvalues) tells us that sharper minima found by large batch gradient descent are indeed sharper in terms of the curvature or the eigenvalues of the Hessian.

      That is, it is not the case that sharper minima just span lesser number of dimensions.

      The number of dimensions spanned by the minima can be guessed from the number of positive eigenvalues.

    4. correlateddata

      In this case \(>M-N\) eigenvalues are expected to be trivial, i.e., \(0\).

    5. if `′(f ( ˆw))and ∇2f ( ˆw) are not correlated

      when is this true?

    6. Figure 1

      Second derivative is 0 along most directions, at the final trained point in parameter space!

    7. bottom of the landscape

      low loss regions of parameter space.

    8. The bulk is mostly full of zero eigenvalues

      problem for the non-degenerate assumption.

    9. Hessian of theloss is non-degenerate

      If the Hessian is degenerate, then it has at least one 0 eigenvalue. That is, there is some subspace of directions along which the derivative stays 0. When the number of parameters, \(M >>\), the number of samples, \(N\); it is possible that all of the \(N\) samples allow gradients in the Kernel of Hessian only. In which case, we will converge to wrong solution..

    10. gradient descen

      *full batch

    1. Figure 3

      The green peaks occur at small plateaus just before big slingshots of weight norm. The sharpness increases first, indicating a big change in last layer norm is about to come.

    2. Cosine distance

      why cosine distance? What happens with euclidean distance?

    3. Weare unable to observe Slingshot Effects with Adagrad [6] and also with stochastic gradient descent(SGD) or SGD with momentum, pointing to the generality of the mechanism across architectures anddatasets.

      Is this sentence correct?

    4. the training loss retains a low value in periods ofrapid norm growth, and then wildly fluctuating in periods of norm plateau.

      why does training loss retain low value when norm grows rapidly?

      Probably because at first, the norm increase is just trying to increase the difference between logit of correct and other labels. But, as we have momentum in our optimizers, maybe it increases the norm too much. This leads to incompatibility between features available to last layer, and the classifying function implemented by the last layer. The features and classifier need to be made compatible again. This causes feature updates and wild loss function fluctuations to happen after norm growth?

    5. at the point of reaching perfect classification of the training set,the cross-entropy (CE) loss by design pressures the classification layer to grow in norm at relativelyfast rate

      so that the difference in the logit of the correct label and other labels increases.

    1. This systemwould consist of three stages. A given question isread and reformulated in the first stage, followedby information retrieval via a search engine. Ananswer is then synthesized based on the query anda set of retrieved documents.

      Here, search engine can be thought of as searching whether the asked questions is answered in the available documents.

      First stage corresponds to finding questions related to the actual questions, that are answered in the available documents and the answers to which, will help answer the actual final question.

      The first step(and the last), although it looks simple at first glance, will be quite complex to perform as the questions become more and more complex.

      For example, say, we want to answer a question like: "What would be the consequences of the Riemann hypothesis being false?", given only documents that explain Riemann Hypothesis and results derived thereof.

      Then, we would need to reformulate the question as: what results follow from the Riemann Hypothesis? and then we can find answers for that from our search engine, which ,in turn, can be negated in the synthesis stage to answer the counterfactual question.

  10. Jun 2022
    1. because we didn't observe any significant V-composition

      i.e., \(\frac{||W_{OV}^{h_1}W_{OV}^{h_2}||F}{||W{OV}^{h_1}||F||W{OV}^{h_2}||_F}\) was small for all pairs \((h_1, h_2)\).

    2. for the present token

      for the present token, with the considered head on the logit..

    3. It's also less vulnerable to distributional shift, since it doesn't depend on learned statistics about whether one token can plausibly follow another.

      Transformers may learn exact parsing of sentences.. rather than just using heuristics like word matching, bigrams etc.

    4. the value vector at these default positions will be small,

      It is so because, otherwise it will end up interfering with information transfer from other source tokens.

    5. the effect on the output is solely a function of that source token

      In case of higher layers, source token's embedding after first layer.

    6. if attended to

      As \(A^h\) is in tensor product with \(W_UW_{OV}^hW_E\).

    7. One basic consequence is that the residual stream doesn't have a "privileged basis"; we could rotate it by rotating all the matrices interacting with it, without changing model behavior.

      As the residual stream gets rotated, each of the matrices used to read and write to the residual stream will also be rotated in the opposite direction. The attention heads and other MLPs are kept, as is.

      Rotating these bases to be as axes aligned as possible, could reveal new(or easily interpretable) insights.

    8. These virtual weights are the product of the output weights of one layer with the input weightsNote that for attention layers, there are three different kinds of input weights: W_Q,  W_K,  and W_V. For simplicity and generality, we think of layers as just having input and output weights here. of another (ie. W_{I}^2W_{O}^1), and describe the extent to which a later layer reads in the information written by a previous layer.

      What could be the analogue for similar relation(/information exchange) between more than 2 layers?

    1. abductively

      related to abductive reasoning. Abductive reasoning is "making a probable conclusion from what you already know.

    1. new

      any of the new. The max in eqn 8 selects the constraint that is closest to being fully satisfied in the next few steps.

    1. Table 4:

      How does performance of adapters with PET look like?

    1. For an arbitrary point x

      Can be a test point outside of \(\mathcal{X}\).

      Rather than using the expressions in (10) and (11), we can also obtain \(\theta_t\) from equation (8) and use that to compute output on arbitrary \(x\).

    2. ˆΘt(x, X )

      This is of dimension \(k \times k|\mathcal{D}|\).

    3. (6)

      \(f_t^{lin}(x)\) depends on both \(\theta_0\) and \(\theta_t\).

      We take the gradient on both sides of equation (5), to get \(\nabla_{\theta_t} f_t^{lin}(\mathcal{X})=\) coefficient of \(\theta_t\) in (5), which is \(\nabla_\theta f_0(\mathcal{X}) |_{\theta=\theta_0}\).

      The gradient on both sides of eqn (5) is taken w.r.t. the running parameters, \(\theta_t\), as the gradient in equation (2) is also w.r.t. the running parameters.

    4. 3

      The second equality here is obtained by substituting for \(\dot{\theta_t}\) from equation (2).

    5. Unlike the standard parameterization that only normalizes theforward dynamics of the network, the NTK-parameterization also normalizes its backward dynamics.

      standard parameterization samples directly from \(\mathcal{N}(0,\frac{\sigma_w}{\sqrt{n_l}})\). Thus, gradients are passed only upto \(W_{i,j}^l\). But the gradients in this paper, are passed back one step further, to \(w_{ij}^l\) and hence are scaled by a factor of \(\frac{\sigma_w}{\sqrt{n_l}}\), for all weights of the network.

    1. it makes sense to consider the two-dimensional case as shown in the following figure.

      The initialised plot of conditioning distribution, \(P(Y|X)\) is not aligned with the initialised contour plot, again.

    2. The following figure shows the influence of these parameters on a two-dimensional Gaussian distribution.

      The initialized values in the covariance matrix donot correspond to the contour plot on the left. Slightly perturbing the yellow eigenvectors will correct the covariance matrix.

    1. if the data can beclassified by a function in the NTRF function class FpWp1q, rOp1qq with a small training error, theover-parameterized ReLU network learnt by Algorithm 1 will have a small generalization error

      Even though we have a lot of parameters, the complexity of function class \(\mathcal{F}\) is still low because the output of a the central function in \(\mathcal{F}\), [corresponding to the weights \(\matbb{W}^{(1)}\)] can only be perturbed in a linear way (specified in the definition of NTRF) for every input.

      This simplicity of complexity of \(\mathcal{F}\), probably allows us to use VC-dimension like generalization bounds of \(\mathcal{\widetilde{O}}(\frac{1}{\sqrt{n}})\)

    2. Definition 3.2
      1. \(\mathbf{W}\) acts as a specification for perturbation to each and every weight of a neural network.

      2. The perturbation in each sample's output is proportional to the sum over all weights of the network of: {perturbation of weight multiplied by the gradient of the output on that sample}.

      This NTRF function class sort of tells about all the functions available in an \(\frac{R}{\sqrt{m}}\) neighborhood of \(\mathbf{W}^{(1)}\). This is true when \(\frac{R}{\sqrt{m}}\) is small enough that \(f(x;\mathbf{W}^{(1)})\) is approximately linear in the weights in \(\frac{R}{\sqrt{m}}\) neighborhood of \(\mathbf{W}^{(1)}\) for all inputs \(x\).

    3. In fact, the empirical observation by Zhanget al. (2017) indicates that in order to understand deep learning, it is important to distinguish thetrue data labels from random labels when studying generalization. In other words, it is essentialto quantify the “classifiability” of the underlying data distribution, i.e., how difficult it can beclassified.

      New thought-1.

    1. constant learning rate c = 1

      \(c=1\) "epoch"

    2. Starting from ˆ

      Starting from the \(\hat{w}\) that has been trained for \(B\) or \(0.75B\) time.

    3. Another important insight we can get from Figure 3 isthat while the train loss and test error surfaces are quali-tatively similar, they are not perfectly aligned

      The cyclical and constant learning rate schedules are leading to much more aligned train-test losses than in Figure 1.

    4. The key insight from Figure 3 is that both methods ex-plore points close to the periphery of the set of high-performing networks.

      Because of the learning rate schedule and using SGD[not Adam].

  11. May 2022
    1. as longas the discretization approximation is justified.

      Figure out when this is satisfied, exactly.

    2. higher noise levels in gradients

      correspond to smaller batch size, larger learning rate.

    1. stochastic differential equation

      A differential equation in which one or more terms is a stochastic process.

  12. Apr 2022
    1. We also find evidencethat some latent features that the model learns maynot be related to linguistic phenomena.

      What are these?

  13. Feb 2022
    1. thenoun phrase (NP) the hedgehog, which has onlybeen observed as a subject, needs to be interpretedin the direct object position.

      Here, we are expecting the model to consolidate the following points:

      1) A particular noun phrase that occurs at first argument position of a transitive verb, can also occur at the second.

      2) love, which has been seen before as a transitive verb, must be a transitive verb this time too(not an abstract noun as in "Love is the greatest thing ever.")

      3) "The boy", "the hedgehog" which occurred as NP before, must still be NP.

      4) The second argument position of the transitive verb "love" was the NP that came later on in the sequence, so it must be the same this time too.

    2. expressions providing evidence thatdefinite NPs may appear in both argument positions ofa transitive verb.

      If the model considers semantic information too, the ability of an NP to occur at both argument positions of a transitive verb depends on what the NP is, as well as, what the transitive verb is. For e.g., you can't "eat" "water". You can only "drink" "water".

      But here, we are assuming that the model tries to learn to generalize the grammatical information only.

  14. Jan 2022
    1. MLP layers are local and transla-tionally equivariant

      translationally equivariant to shifts of size \(P\).

    1. Hence, at testing time the same object may appe

      But we didn't train our network's prediction to be invariant to this \(\frac{1}{3}^{rd}\) factor!

      This means, network will not detect/mis-detect objects of this smaller size, and lead to errors.

    2. Ktest × Ktes

      No re-scaling required after taking the RoC of this size, unlike training time, when we needed to re-scale RoC.

    3. isotropically resizing the im-age so that the shorter dimension is Kimage

      \(K_{test}^{image}\) is user specified.

      \(K_{train} = K_{test} \le K_{test}^{image}\).

      The difference between \(K_{test}\) and \(K_{test}^{image}\) allows to express bias towards important things being present at the centre of the image.

    4. Hence, pre-processing standardizes the apparent size, which otherwisewould depend on the input image resolution. This is impor-tant as networks do not have built-in scale invariance.

      The final number of pixels an object occupies in a sampled RoC, only depends on the actual height of the object and its distance from the camera. It doesn't depend on the resolution of camera. This is to be expected as we are re-scaling to a fixed size(\(K_{train}\)).

      The number of pixels occupied by the object in an original image, however would depend on the resolution of the image. In particular, \(r_1 = k\sqrt{HW}\frac{R}{Z}\), is dependent on both \(H\) and \(W\).

    5. that α = 1

      Meaning we are taking a square RoC out of a square image.

    6. s

      \(s\) is geometric mean of \(\frac{K_{train}}{H_{RoC}}\)(scaling factor in \(X\) direction) and \(\frac{K_{train}}{W_{RoC}}\)(scaling factor in \(Y\) direction).

    7. If an object has an extent of r ×r pixelsin the input image, and if s is the scaling factor between inputimage and the crop, then by the time the object is analysedby the CNN, it will have the new size of rs ×rs pixels.

      Assuming that the entire \(r \times r\) object happens to be in the RoC. If a part of it is in RoC, then that part's pixels in both directions will be scaled by \(s\)

    8. While the focal length is variable, the field of view angle θFOVof most cameras is usually in the [40◦,60◦] range.

      See here%20is,the%20size%20of%20the%20sensor.) for detailed explanation.

    9. H ×W

      Not pixels, but can be converted to pixels.

    10. Since different pre-processing protocols are usedat training and testing time

      If we took random RoC at testing time too, they would have had same distribution.

    1. spread information between the patch tokens andthe class token

      That is, initially, all information regarding the class is contained in the patch tokens, and the class token is independent of the patch tokens. As training progresses, the class token's final embedding becomes correlated with (hence dependent on) the patch tokens. It is, as if, information has travelled from the patch tokens to the class token.

    2. invariant

      *equivariant

    1. ∇2L)(θ)

      Is this positive semi-definite, though? It must be for the square root to be valid. This positive semi-definiteness of Hessian seems to be an additional assumption here.

    2. mink≤K(nk)

      The minimum number of weights in any layer.

    3. r

      non-zero.

    4. sorted

      By multiplicity.

    5. singular values of A

      These are square roots of (Eigenvalues of \(A^TA\)). As A^TA is positive semi-definite always, the square roots are valid.

    6. positive diagonal element

      As we are at a minima, the second derivative in some direction must be positive. Otherwise, if all directions, have second derivative negative or zero, we are at a maxima or a completely flat region(in case all second derivatives on the diagonal are 0).

    7. Since all norms are equivalent in finite dimension
    8. Therefore, Tα(B∞(r,θ))∩B∞(r,θ)

      By the particular choice of \(\alpha\).

    9. is a connected region

      Intuitively, you can imagine each point, \(\theta'=(\theta'_1, \theta'_2)\) in the initial \(B_{\infty}(r, \theta)\) moving along the curve \((\alpha\theta'_1, \alpha^{-1}\theta'_2)\).

    10. the

      The \(\mathcal{L}_{\infty}\) ball is the ball constructed using the \(L^{\infty}\) norm on the parameter space. It is just a cube, or side length \(2r\).

    1. 1

      Add 1 to prevent sharpness from exploding when \(f(x)\) is close to 0.

    2. A

      Provides mapping between \(n\)-dimensional model-parameter space and a \(p\)-dimensional manifold.

    3. (A+x)i|

      If this term was always 0, \(C_{\epsilon}\) would have been a perfect cube, with side length \(2\epsilon\).

      This terms stretches the \(i^{th}\) edge on both sides by \(\epsilon|(A^{+}x)_i|\). As there is \(|.|\), no side will be shorter than \(2\epsilon\).

  15. Dec 2021
    1. linear kernel

      The equation of linear kernel is:

      \(K(x,y) = x^{T}y+c\)

      where \(c\) is some constant.

    2. isotropic scaling

      Simple scaling corresponds to transforming the space by multiplication with a diagonal matrix. Isotropic scaling corresponds to multiplication by a diagonal matrix, all of whose diagonal entries are same.

    3. orthog-onal transformation

      A linear transformation that preserves dot products between elements of the vector space. In 2-D and 3-D it consists of rotations and reflects or any combinations thereof.

    1. Cxy

      The entries of the matrix corresponding to this operator are not like those in the covariance matrix. They are of the form \( [p(e_i, e'_j) - p(e_i)p(e'_j)]_{ij}\) where :

      1. \(\{e_i\}\) and \(\{e'_j\}\) are basis vectors of \(\mathcal{X}\) and \(\mathcal{Y}\) respectively.
      2. \(p(e_i, e'_j)\) is the probability of sampling the pair of elements corresponding to \(e_i, e'_j\) from \(\mathcal{X} \times \mathcal{Y}\).
    2. Ex,y

      Only symbolic for the operator. The expectation will be calculated actually, when a \(g\) is sent inside it, and the tensor product operator gets applied on \(g\).

    3. tensor product operator

      This too is a linear operator.

      In the finite case, it is similar to the tensor product of two vectors, which yields a matrix.

    4. elements

      single element per Hilbert space. (most probably)

    5. 4

      These must be true for all \(f\) and \(g\) respectively.

    6. complete orthonormal system

      A countable complete orthonormal basis \(\{e_i\}_{i=0}^{\infty}\)for a Hilbert space has the following properties:

      1. \(\langle e_i, e_j \rangle\) = 0 for \(i \neq j\).
      2. \(||e_i|| = 1\)
      3. The linear vector space spanned by the countable many basis vectors, must be dense in the Hilbert space.

      The last point means that not every element of the Hilbert space may be written as a linear combination of the basis vectors, but, it can be approximated as closely as we want(in terms of the norm/metric derived from the inner product of the Hilbert Space) by some linear combination of the basis vectors.

    7. Hilbert-Schmidt Operator

      In effect, while calculating the inner product in the Hilbert Space of linear operators, we are multiplying together corresponding elements of \(C(:= [c_{ij}])\) and \(D(:=[d_{ij}])\), and summing them together.

      $$ \langle C,D \rangle_{HS} := \sum_{i,j} c_{ij}d_{ij}$$

      where \(i, j\) belong to a countable index.

    8. Hilbert-Schmidt Norm

      Because both \(\mathcal{G}\) and \(\mathcal{F}\) admit countable orthonormal bases, we just have summing over two countable sets while calculating this norm.

      The linear operator \(C\) can be imagined as a matrix with a countable number of rows and columns. (Possibly infinite)

    9. that F be separable

      In topology, a topological space is called separable if it has a countable, dense subset.

      A Hilbert space is separable, if it has a dense countable subset too. Together with Zorn's Lemma, this means that the Hilbert space admits a countable orthonormal basis.

    1. universal kernels

      Letting \(C(\mathcal{X})\) denote the space of continuous bounded functions on compact domain \(\mathcal{X}\), we call a kernel \(k\) ''universal'' if \(k(x,\cdot)\) is continuous for all \(x\) and the RKHS induced by \(k\) is dense in \(C(\mathcal{X})\).