Hypothesis

56 Matching Annotations

May 2024
arxiv.org arxiv.org

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

27
1. vigarov 06 May 2024
  
  in Public
  
  less than 5%);
  
  what about the rest?
2. vigarov 06 May 2024
  
  in Public
  
  Table 9.
  
  No AWQ only? :(
3. vigarov 06 May 2024
  
  in Public
  
  Raspberry Pi 4B, achieving 0.7 tokens/s for 7B models
  
  more info?
4. vigarov 06 May 2024
  
  in Public
  
  prompt length of 4 tokens
  
  what's that prompt lol?
5. vigarov 06 May 2024
  
  in Public
  
  Overall, using the same calibration and evalua-tion distribution works the best
  
  What about mixing calibration sets?
6. vigarov 06 May 2024
  
  in Public
  
  Figure 5.
  
  Baseline for FP16 missing
7. vigarov 06 May 2024
  
  in Public
  
  Table 4.
  
  perplexity is lower, but is it that much lower?
8. vigarov 06 May 2024
  
  in Public
  
  perplexity
  
  Only perplexity :(?
9. vigarov 06 May 2024
  
  in Public
  
  For attention layers, wefuse QKV projections into a single kernel
  
  How?
10. vigarov 06 May 2024
  
  in Public
  
  We avoidwriting dequantized weights into DRAM by fusing dequan-tization kernels with the matrix multplication kerne
  
  How?
11. vigarov 06 May 2024
  
  in Public
  
  Middle: The generation stage is memory bound and has low arithmetic intensity. W4A16 quantization caneffectively improve the arithmetic intensity by 4×. Right: The amount of weight access is orders of magnitude larger than the amount ofactivation access. Thus, weight-only quantization is more effective for on-device LLMs.
  
  Great analysis and argument for why weight-only scaling matters!
12. vigarov 06 May 2024
  
  in Public
  
  we take a small calibration
  
  what if it is not representative of the overall activations? How to choose it?
13. vigarov 06 May 2024
  
  in Public
  
  he
  
  grammar nitpick
14. vigarov 06 May 2024
  
  in Public
  
  Weempirically find that: (1) The expected error from Round(·)(denoted as RoundErr(·)) does not change
  
  how?
15. vigarov 06 May 2024
  
  in Public
  
  group/block
  
  Never defined arghhh
16. vigarov 06 May 2024
  
  in Public
  
  such a mixed-precision data type will make the system implementationdifficult.
  
  Why? Unclear, since fp16 is already emulated through software according to https://stackoverflow.com/questions/70446181/using-half-precision-floating-point-on-x86-cpus
17. vigarov 06 May 2024
  
  in Public
  
  activation magnitude
  
  defined as?
18. vigarov 06 May 2024
  
  in Public
  
  uantization maps a floating-point number into lower-bitintegers.
  
  Pretty late intro
19. vigarov 06 May 2024
  
  in Public
  
  group-wiseINT4 quantization
  
  que esso?
20. vigarov 06 May 2024
  
  in Public
  
  e focus on thesecond setting in this work since it not only reduces thehardware barrier (requiring a smaller memory size) but alsospeeds up the token generation (remedies memory-boundworkload).
  
  Technically, you could achieve both by (1) too if you quantize to W4A4 no?
21. vigarov 06 May 2024
  
  in Public
  
  is
  
  grammar nitpick
22. vigarov 06 May 2024
  
  in Public
  
  scaling up the salientchannels can reduce their relative quantization error
  
  How is this avoiding hardware-inefficient mixed-precisions impls. ? Isn't a scale-up still going to result in the same "mixed-precision"?
23. vigarov 06 May 2024
  
  in Public
  
  activation distribution
  
  Talk a lot about it without yet defining it
24. vigarov 06 May 2024
  
  in Public
  
  , let alone edge devices
  
  grammar nitpick
25. vigarov 06 May 2024
  
  in Public
  
  educing the chance of databreaches
  
  IoT/non-smartphone edge devices (=embedded devices) are notoriously less secure than cloud providers (source?) --> is this really a claim that can be done?
26. vigarov 06 May 2024
  
  in Public
  
  modalities
  
  tested different modalities?
27. vigarov 06 May 2024
  
  in Public
  
  edge devices
  
  Uses edge deviecs as motivation --> tested on them afterwards?
Visit annotations in context

Annotators

vigarov

URL

arxiv.org/pdf/2306.00978
Nov 2023
arxiv.org arxiv.org

2205.14778.pdf

3
1. vigarov 30 Nov 2023
  
  in Public
  
  For each application,
  
  Would've been interesting to see if training on one application could've generalized to inferring/prefetching on another one
2. vigarov 30 Nov 2023
  
  in Public
  
  unordered future block indexes underthe same page as the current memory address
  
  I don't understand this
3. vigarov 30 Nov 2023
  
  in Public
  
  There are three significant advantagesof using the binary vector as input
  
  I was thinking of doing something similar, but instead of parsing it as binary, having it as a hex string, separated in hextets, and then having a simple embedding layer (w/ positional encoding).
Visit annotations in context

Annotators

vigarov

URL

arxiv.org/pdf/2205.14778.pdf
arxiv.org arxiv.org

2307.08621.pdf

1
1. vigarov 16 Nov 2023
  
  in Public
  
  ure for large language models, simultaneously achieving training parallelism,low-cost inference, and good performance. We theoretically derive the connectionbetween recurrence
  
  Example annotation for you guys to understand
Visit annotations in context

Annotators

vigarov

URL

arxiv.org/pdf/2307.08621.pdf
Oct 2023
www.usenix.org www.usenix.org

Untitled document

2
1. vigarov 12 Oct 2023
  
  in Public
  
  workloads vary over time
  
  I wonder what the scale is for CDNs compared to pages
2. vigarov 12 Oct 2023
  
  in Public
  
  The good deci-sion ratio allows us to run a simulation only onc
  
  I don't understand how this is better that simply counting the number of faults
Visit annotations in context

Annotators

vigarov

URL

usenix.org/system/files/nsdi20-paper-song.pdf
Sep 2023
www.micahlerner.com www.micahlerner.com

Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale

4
1. vigarov 13 Sep 2023
  
  in Public
  
  Figure 11
  
  PMU seems like it improves the system, though 30s+PMU and PMU-only would have been great for isolating its impact
2. vigarov 13 Sep 2023
  
  in Public
  
  However, the impactin hugepage coverage loss to application working sets can be seenin an increase of 7% in TLB misses on 2-tier machines compared to1-tier machines, corroborating the causal explanation of the 1.5%impact.
  
  I don't understand this
3. vigarov 13 Sep 2023
  
  in Public
  
  hot age as number of scan periods during whichthe page was recently accessed
  
  Something of a "reverse-CLOCK" mechanism, interesting
4. vigarov 13 Sep 2023
  
  in Public
  
  show-ing the observed range of STAR and the corresponding impact toapplication performance
  
  What should it be compared to? Isn't higher STAR directly correlated to lower performance (since the workload is longer/more expensive)?
Visit annotations in context

Annotators

vigarov

URL

micahlerner.com/assets/pdf/adaptable.pdf
Jul 2023
dl.acm.org dl.acm.org

3357526.3357549.pdf

6
1. vigarov 12 Jul 2023
  
  in Public
  
  #addresses
  
  My parsec traces are 4 to 10x bigger than theirs...
2. vigarov 12 Jul 2023
  
  in Public
  
  Figure 8: An overview of key components of our LSTM based prefetcher.
  
  Wouldn't this be too slow for a L1 cache?
3. vigarov 12 Jul 2023
  
  in Public
  
  The model can be used to gen-erate predictions right away (e.g. after the first epoch oftraining), and while more epochs are completed, more ac-curate estimates can be generated
  
  In a real system, I guess you'd have to synchronize accessing the model, with training it, potentially adding to the size in memory and/or inference time
4. vigarov 12 Jul 2023
  
  in Public
  
  100K accesses as warmup,and then use the next 200K for training and the next 200
  
  I'd have to check if this holds for the PARSEC benchmarks as well, but for my other benchmarks, the first 7M out of 800M accesses were to a very small set of pages (probably some shared standard library)
5. vigarov 12 Jul 2023
  
  in Public
  
  caps the vocabulary at the 2Kmost popular deltas
  
  Would be interesting to see the performance of an LSTM capped to the 2K most popular deltas
6. vigarov 12 Jul 2023
  
  in Public
  
  autocorrelation coef-ficients (ACFs
  
  Had to google what ACF is; I guess the random variable is the memory access itself?
Visit annotations in context

Annotators

vigarov

URL

dl.acm.org/doi/pdf/10.1145/3357526.3357549
people.ece.ubc.ca people.ece.ubc.ca

HoPP-HPCA23.pdf

2
1. vigarov 05 Jul 2023
  
  in Public
  
  Last Level Cache (LLC)
  
  Is extra hardware really needed if only LLC is taken into account?
2. vigarov 05 Jul 2023
  
  in Public
  
  predefined distance
  
  The predefined distance acts as a maximum stride as well then? Also, doesn't work if the pattern is not a stride?
Visit annotations in context

Annotators

vigarov

URL

people.ece.ubc.ca/~sasha/TMP/HoPP-HPCA23.pdf
Jun 2023
sigops.org sigops.org

Prefetching Using Principles of Hippocampal-Neocortical InteractionPrefetching Using Principles of Hippocampal-Neocortical Interaction

5
1. vigarov 28 Jun 2023
  
  in Public
  
  confidence on the current and previ-ous patterns, which is the probability it assigns to the correctprediction
  
  Is this afterwards correlated with the prefetcher's prediction performance (accuracy/coverage)?
2. vigarov 26 Jun 2023
  
  in Public
  
  miss history length of 1
  
  Isn't this very low?
3. vigarov 26 Jun 2023
  
  in Public
  
  average similar examples, producing singlerepresentative cases
  
  Poses the same challenge as when selecting which misses to train on in my opinion
4. vigarov 26 Jun 2023
  
  in Public
  
  The LSTM takes >150 𝜇s per inferenceand >1 ms per example for training
  
  Could this be decreased if the model "lived" on GPU (training done there, maybe also inference?), and CPU accessed the model through the GPU VRAM directly.
5. vigarov 26 Jun 2023
  
  in Public
  
  interference
  
  Is this a ML-specific term, or do they want to say that the DNN is still too costly resource-wise?
Visit annotations in context

Annotators

vigarov

URL

sigops.org/s/conferences/hotos/2023/papers/wu.pdf
dl.acm.org dl.acm.org

Classifying Memory Access Patterns for Prefetching

5
1. vigarov 08 Jun 2023
  
  in Public
  
  If more than two frequent kernels exist, weignore them entirely for prefetching.
  
  I wonder what actual patterns the loops with "only two kernels" reflect
2. vigarov 08 Jun 2023
  
  in Public
  
  Complex
  
  Complex, as in composition of several patterns?
3. vigarov 08 Jun 2023
  
  in Public
  
  single thread,
  
  It can probably also be argued for multithreaded programs as well if we look into other threads' execution traces
4. vigarov 08 Jun 2023
  
  in Public
  
  Intel’s Precise Event-BasedSampling (PEBS)
  
  Could actually be a great use case for PEBS since we want to focus only on mem. instr.s generating the most cache misses
5. vigarov 08 Jun 2023
  
  in Public
  
  Delta
  
  Could be affected by loop unrolling
Visit annotations in context

Annotators

vigarov

URL

dl.acm.org/doi/pdf/10.1145/3373376.3378498
proceedings.mlr.press proceedings.mlr.press

Learning Memory Access Patterns

1
1. vigarov 01 Jun 2023
  
  in Public
  
  we first cluster the raw address space usingk-means
  
  Similarly to why deltas might not work with dynamic allocation, the clusters' start&end addresses could also differ
Visit annotations in context

Annotators

vigarov

URL

proceedings.mlr.press/v80/hashemi18a/hashemi18a.pdf

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL