- May 2024
-
arxiv.org arxiv.org
-
less than 5%);
what about the rest?
-
Table 9.
No AWQ only? :(
-
Raspberry Pi 4B, achieving 0.7 tokens/s for 7B models
more info?
-
prompt length of 4 tokens
what's that prompt lol?
-
Overall, using the same calibration and evalua-tion distribution works the best
What about mixing calibration sets?
-
Figure 5.
Baseline for FP16 missing
-
Table 4.
perplexity is lower, but is it that much lower?
-
perplexity
Only perplexity :(?
-
For attention layers, wefuse QKV projections into a single kernel
How?
-
We avoidwriting dequantized weights into DRAM by fusing dequan-tization kernels with the matrix multplication kerne
How?
-
Middle: The generation stage is memory bound and has low arithmetic intensity. W4A16 quantization caneffectively improve the arithmetic intensity by 4×. Right: The amount of weight access is orders of magnitude larger than the amount ofactivation access. Thus, weight-only quantization is more effective for on-device LLMs.
Great analysis and argument for why weight-only scaling matters!
-
we take a small calibration
what if it is not representative of the overall activations? How to choose it?
-
he
grammar nitpick
-
Weempirically find that: (1) The expected error from Round(·)(denoted as RoundErr(·)) does not change
how?
-
group/block
Never defined arghhh
-
such a mixed-precision data type will make the system implementationdifficult.
Why? Unclear, since fp16 is already emulated through software according to https://stackoverflow.com/questions/70446181/using-half-precision-floating-point-on-x86-cpus
-
activation magnitude
defined as?
-
uantization maps a floating-point number into lower-bitintegers.
Pretty late intro
-
group-wiseINT4 quantization
que esso?
-
e focus on thesecond setting in this work since it not only reduces thehardware barrier (requiring a smaller memory size) but alsospeeds up the token generation (remedies memory-boundworkload).
Technically, you could achieve both by (1) too if you quantize to W4A4 no?
-
is
grammar nitpick
-
scaling up the salientchannels can reduce their relative quantization error
How is this avoiding hardware-inefficient mixed-precisions impls. ? Isn't a scale-up still going to result in the same "mixed-precision"?
-
activation distribution
Talk a lot about it without yet defining it
-
, let alone edge devices
grammar nitpick
-
educing the chance of databreaches
IoT/non-smartphone edge devices (=embedded devices) are notoriously less secure than cloud providers (source?) --> is this really a claim that can be done?
-
modalities
tested different modalities?
-
edge devices
Uses edge deviecs as motivation --> tested on them afterwards?
-
- Nov 2023
-
arxiv.org arxiv.org
-
For each application,
Would've been interesting to see if training on one application could've generalized to inferring/prefetching on another one
-
unordered future block indexes underthe same page as the current memory address
I don't understand this
-
There are three significant advantagesof using the binary vector as input
I was thinking of doing something similar, but instead of parsing it as binary, having it as a hex string, separated in hextets, and then having a simple embedding layer (w/ positional encoding).
-
-
arxiv.org arxiv.org
-
ure for large language models, simultaneously achieving training parallelism,low-cost inference, and good performance. We theoretically derive the connectionbetween recurrence
Example annotation for you guys to understand
-
- Oct 2023
-
www.usenix.org www.usenix.org
-
workloads vary over time
I wonder what the scale is for CDNs compared to pages
-
The good deci-sion ratio allows us to run a simulation only onc
I don't understand how this is better that simply counting the number of faults
-
- Sep 2023
-
www.micahlerner.com www.micahlerner.com
-
Figure 11
PMU seems like it improves the system, though 30s+PMU and PMU-only would have been great for isolating its impact
-
However, the impactin hugepage coverage loss to application working sets can be seenin an increase of 7% in TLB misses on 2-tier machines compared to1-tier machines, corroborating the causal explanation of the 1.5%impact.
I don't understand this
-
hot age as number of scan periods during whichthe page was recently accessed
Something of a "reverse-CLOCK" mechanism, interesting
-
show-ing the observed range of STAR and the corresponding impact toapplication performance
What should it be compared to? Isn't higher STAR directly correlated to lower performance (since the workload is longer/more expensive)?
-
- Jul 2023
-
dl.acm.org dl.acm.org
-
#addresses
My parsec traces are 4 to 10x bigger than theirs...
-
Figure 8: An overview of key components of our LSTM based prefetcher.
Wouldn't this be too slow for a L1 cache?
-
The model can be used to gen-erate predictions right away (e.g. after the first epoch oftraining), and while more epochs are completed, more ac-curate estimates can be generated
In a real system, I guess you'd have to synchronize accessing the model, with training it, potentially adding to the size in memory and/or inference time
-
100K accesses as warmup,and then use the next 200K for training and the next 200
I'd have to check if this holds for the PARSEC benchmarks as well, but for my other benchmarks, the first 7M out of 800M accesses were to a very small set of pages (probably some shared standard library)
-
caps the vocabulary at the 2Kmost popular deltas
Would be interesting to see the performance of an LSTM capped to the 2K most popular deltas
-
autocorrelation coef-ficients (ACFs
Had to google what ACF is; I guess the random variable is the memory access itself?
-
-
people.ece.ubc.ca people.ece.ubc.ca
-
Last Level Cache (LLC)
Is extra hardware really needed if only LLC is taken into account?
-
predefined distance
The predefined distance acts as a maximum stride as well then? Also, doesn't work if the pattern is not a stride?
-
- Jun 2023
-
sigops.org sigops.org
-
confidence on the current and previ-ous patterns, which is the probability it assigns to the correctprediction
Is this afterwards correlated with the prefetcher's prediction performance (accuracy/coverage)?
-
miss history length of 1
Isn't this very low?
-
average similar examples, producing singlerepresentative cases
Poses the same challenge as when selecting which misses to train on in my opinion
-
The LSTM takes >150 𝜇s per inferenceand >1 ms per example for training
Could this be decreased if the model "lived" on GPU (training done there, maybe also inference?), and CPU accessed the model through the GPU VRAM directly.
-
interference
Is this a ML-specific term, or do they want to say that the DNN is still too costly resource-wise?
-
-
-
If more than two frequent kernels exist, weignore them entirely for prefetching.
I wonder what actual patterns the loops with "only two kernels" reflect
-
Complex
Complex, as in composition of several patterns?
-
single thread,
It can probably also be argued for multithreaded programs as well if we look into other threads' execution traces
-
Intel’s Precise Event-BasedSampling (PEBS)
Could actually be a great use case for PEBS since we want to focus only on mem. instr.s generating the most cache misses
-
Delta
Could be affected by loop unrolling
-
-
proceedings.mlr.press proceedings.mlr.press
-
we first cluster the raw address space usingk-means
Similarly to why deltas might not work with dynamic allocation, the clusters' start&end addresses could also differ
-