- Apr 2022
-
arxiv.org arxiv.org
-
We evenly split a pre-trained backbone into 4 subsets of blocks(e.g., 6 in each subset for the 24-block ViT-L).
Note subsets and blocks here
-
-
arxiv.org arxiv.org
-
We propose a cross-covariance based self-attention function that operates along the featuredimension, rather than along the token dimension as in token self-attention.
Does "along the feature dimension" refer solely to the reversed order of the matrix multiply?
-
We restrict the magnitude of the query and key matrices by `2-normalising them,such that each column of length N of the normalised matrices ˆQ and ˆK has unit norm, andevery element in d×d cross-covariance matrix ˆK> ˆQ is in the range [−1,1].
Is this the only difference with \(\hat{Q}\) and \(\hat{K}\)?
-
Relationship between Gram and covariance matrices. To motivate our cross-covarianceattention operation, we recall the relation between Gram and covariance matrices. Theunnormalised d×d covariance matrix is obtained as C=X>X. The N×N Gram matrixcontains all pairwise innerproducts: G=XX>. The non-zero part of the eigenspectrum ofthe Gram and covariance matrix are equivalent, and the eigenvectors of C and G can becomputed in terms of each other. If V are the eigenvectors of G, then the eigenvectors of Care given by U=XV . To minimise the computational cost, the eigendecomposition of eitherthe Gram or covariance matrix can be obtained in terms of the decomposition of the other,depending on which of the two matrices is the smallest.
This looks a lot like what I had guessed about how Performers derived their kernelized approximation of attention. Definitely read and understand this.
-
Touvron et al. [65]
ref 65 = DeIT
-
[22]
ref 22 = ViT paper
-
- Nov 2018
-
scp-wiki.wikidot.com scp-wiki.wikidot.com
-
We live in a connected world, and modern software has to navigate this world. So the building blocks for tomorrow's very largest solutions are connected and massively parallel. It's not enough for code to be "strong and silent" any more. Code has to talk to code. Code has to be chatty, sociable, well-connected. Code has to run like the human brain, trillions of individual neurons firing off messages to each other, a massively parallel network with no central control, no single point of failure, yet able to solve immensely difficult problems. And it's no accident that the future of code looks like the human brain, because the endpoints of every network are, at some level, human brains.
This is an observation that is quite important but frequently missed
-
- Jun 2018
-
www.informatics.indiana.edu www.informatics.indiana.edu
-
connectionism
neural nets...
-
So far, we have dealt with self-reference, but the situation is quite similar with the notion of self-modification. Partial self- modification is easy to achieve; the complete form goes beyond ordinary mathematics and anything we can formulate. Consider, for instance, recursive programs. Every recursive program can be said to modify itself in some sense, since (by the definition of recursiveness) the exact operation carried out at time t depends on the result of the operation at t-1, and so on: therefore, the final "shape" of the transformation is getting defined iteratively, in runtime (a fact somewhat obscured by the usual way in which recursion is written down in high-level programming languages like C). At the same time, as we can expect, to every finite recursive program there belongs an equivalent "straight" program, that uses no recursion at all, and is perfectly well defined in advance, so that it does not change in any respect; it is simply a fixed sequence of a priori given elementary operations.
So unbounded recursion automatically implies a form of self-reference and self-modification?
-
- Nov 2017
-
-
It seems that one-hot encoding isn’t necessary for tree-based algorithms, but you may run into a problem using this on linear models. I personally find ordinal encoding is more than sufficient and it significantly reduces your memory footprint if you have quite a few categorical features.
Interesting! I suppose it makes sense though.
-
- Aug 2017
-
arxiv.org arxiv.org
-
This is a very easy paper to follow, but it looks like their methodology is a simple way to improve performance on limited data. I'm curious how well this is reproduced elsewhere.
-
DEEP learning refers to a network of connected artificialneurons in multiple layers, which can perform featureextraction from observed data and learn the complicatedrelationships among the features of data.
That's a little constrained of a definition - at least, that's the idea that I have from "Deep Learning" by Goodfellow, Bengio, and Courville. Currently, ANNs are the preferred way (possibly only established way) of doing deep learning, but deep learning is more general than ANNs.
-
-
blog.athelas.com blog.athelas.com
-
Excellent overview. I found the papers a little hard to grasp, and this cleared a lot of that up.
-
- Jul 2017
-
-
Kind of a clickbait title - but makes some good points. Personally, though, I considered it common knowledge that storytelling can very effectively incorporate philosophy (and vice versa). Consider Twilight Zone. Consider 12 Angry Men. Consider every movie that has ever presented a moral dilemma and made the viewer think deeply about it.
-
-
turbomack.github.io turbomack.github.io
-
This is good to know. My page uses Hakyll, and I've used Pandoc many times in the past, and I use org-mode nearly daily... yet somehow I am here fretting that I don't know an easy way to publish from my org-mode documents.
-
-
web.hypothes.is web.hypothes.is
-
Can I get an RSS feed to follow new annotations on a document? Yes! You can get RSS and Atom feeds by URL, Tag and User.
Handy! I wonder if I could tie this in with RSS readers to get something akin to Pocket, but not siloed.
-
-
-
The web today is full of disposable speech acts, that are not maintained, enriched, or returned to. Tweets, Facebook posts, contextually dependent blog posts. Consequently entering new conversations feels like sifting through the Nixon tapes.
This expresses neatly something I've complained about in many other contexts - particularly in trying to piece together coherent information on a project when people haven't taken the time to "garden", so everything is just full of old ephemera at best.
-
- Jan 2017
-
-
There is a conceptual and language gap. The sciences of neural networks and probability models do not have a shared language. My goal is to bridge this idea gap and allow for more collaboration and discussion between these fields, and provide a consistent implementation (Github link).
Compare with https://colah.github.io/posts/2015-09-NN-Types-FP/ and the three narratives (biological, representative, probabilistic)
-
- Dec 2016
-
s-ben.github.io s-ben.github.io
-
This is brilliant
-
-
web.hypothes.is web.hypothes.is
-
Handy! I would like to use this to let Wallabag (which has its own annotation support) send annotations to hypothes.is.
-
-
-
This is an excellent article - simultaneously detailed and concise.
-
-
www.thefelderreport.com www.thefelderreport.com
-
Critics have suggested this way of thinking about the stock market is outdated. In other words, “this time is different.” And even if they admit that comparing equity valuations to net worth has some value they insist that value does not include timing the market.
Empirically does this fit?
-
- Nov 2016
-
dailystoic.com dailystoic.com
-
“10,000 hours” theory of expertise
Oh, I thought that was Gladwell.
-
Accept Everything. But Don’t Be Passive
This seems like a principle that Nassim Nicholas Taken would agree with. He talked at length about stoicism in one of his books - I cannot recall whether Black Swan or Antifragile.
-
-
ipfs.io ipfs.io
-
This is a picture of the first HTTP web server in the world. It was Tim Berners-Lee's NeXT computer at CERN. Pasted on the machine is an ominous sticker: "This machine is a server, do not power it down!!". The reason it couldn't be powered down is that web sites on other servers were starting to link to it. Once they linked to it, they then depended on that machine continuing to exist. If the machine was powered down, the links stopped working. If the machine failed or was no longer accessible at the same location, a far worse thing happened: the chain between sites becomes permanently broken, and the ability to access that content is lost forever. That sticker perfectly highlights the biggest problem with HTTP: it erodes.
This is interesting, since the opening video for https://hypothes.is/ mentions the early web also - in this case, for its annotation features that were removed.
It seems to me that hypothes.is is even more powerful used on IPFS content identified by hash since that underlying content cannot change.
Thanks to both services I'm doing exactly this right now!
-
I think this is exactly what I've wanted - and what a lot of people have wanted - for a long time. It's certainly not the first time I've seen someone call for using hashes for referring to files, but the design and implementation behind this look like they do a lot of things right.
-
- Oct 2016
-
ubiquity.acm.org ubiquity.acm.org
-
In the 1970s, software problems were more accessible in the sense that fewer regulations were applied to restrict programmers to access key materials (codes peculiarly). There was a so-called "collaborative hacker" approach, sharing knowledge and improving each other's work, that encouraged programmers to continually redefine the boundaries of the problem (e.g. conducting reverse engineering to deconstruct a software to understand how a code was written). As a result, a wide range of software programming tools (languages, editors, compilers etc.) was created in the 1970s (Ceruzzi, 2003).
That is a perspective I hadn't considered, despite having read RMS's biography. I had really assumed that "modern" open source led to code being much less encumbered.
-