56 Matching Annotations
  1. May 2024
    1. less than 5%);

      what about the rest?

    2. Table 9.

      No AWQ only? :(

    3. Raspberry Pi 4B, achieving 0.7 tokens/s for 7B models

      more info?

    4. prompt length of 4 tokens

      what's that prompt lol?

    5. Overall, using the same calibration and evalua-tion distribution works the best

      What about mixing calibration sets?

    6. Figure 5.

      Baseline for FP16 missing

    7. Table 4.

      perplexity is lower, but is it that much lower?

    8. perplexity

      Only perplexity :(?

    9. For attention layers, wefuse QKV projections into a single kernel

      How?

    10. We avoidwriting dequantized weights into DRAM by fusing dequan-tization kernels with the matrix multplication kerne

      How?

    11. Middle: The generation stage is memory bound and has low arithmetic intensity. W4A16 quantization caneffectively improve the arithmetic intensity by 4×. Right: The amount of weight access is orders of magnitude larger than the amount ofactivation access. Thus, weight-only quantization is more effective for on-device LLMs.

      Great analysis and argument for why weight-only scaling matters!

    12. we take a small calibration

      what if it is not representative of the overall activations? How to choose it?

    13. he

      grammar nitpick

    14. Weempirically find that: (1) The expected error from Round(·)(denoted as RoundErr(·)) does not change

      how?

    15. group/block

      Never defined arghhh

    16. such a mixed-precision data type will make the system implementationdifficult.

      Why? Unclear, since fp16 is already emulated through software according to https://stackoverflow.com/questions/70446181/using-half-precision-floating-point-on-x86-cpus

    17. activation magnitude

      defined as?

    18. uantization maps a floating-point number into lower-bitintegers.

      Pretty late intro

    19. group-wiseINT4 quantization

      que esso?

    20. e focus on thesecond setting in this work since it not only reduces thehardware barrier (requiring a smaller memory size) but alsospeeds up the token generation (remedies memory-boundworkload).

      Technically, you could achieve both by (1) too if you quantize to W4A4 no?

    21. is

      grammar nitpick

    22. scaling up the salientchannels can reduce their relative quantization error

      How is this avoiding hardware-inefficient mixed-precisions impls. ? Isn't a scale-up still going to result in the same "mixed-precision"?

    23. activation distribution

      Talk a lot about it without yet defining it

    24. , let alone edge devices

      grammar nitpick

    25. educing the chance of databreaches

      IoT/non-smartphone edge devices (=embedded devices) are notoriously less secure than cloud providers (source?) --> is this really a claim that can be done?

    26. modalities

      tested different modalities?

    27. edge devices

      Uses edge deviecs as motivation --> tested on them afterwards?

  2. Nov 2023
    1. For each application,

      Would've been interesting to see if training on one application could've generalized to inferring/prefetching on another one

    2. unordered future block indexes underthe same page as the current memory address

      I don't understand this

    3. There are three significant advantagesof using the binary vector as input

      I was thinking of doing something similar, but instead of parsing it as binary, having it as a hex string, separated in hextets, and then having a simple embedding layer (w/ positional encoding).

    1. ure for large language models, simultaneously achieving training parallelism,low-cost inference, and good performance. We theoretically derive the connectionbetween recurrence

      Example annotation for you guys to understand

  3. Oct 2023
    1. workloads vary over time

      I wonder what the scale is for CDNs compared to pages

    2. The good deci-sion ratio allows us to run a simulation only onc

      I don't understand how this is better that simply counting the number of faults

  4. Sep 2023
    1. Figure 11

      PMU seems like it improves the system, though 30s+PMU and PMU-only would have been great for isolating its impact

    2. However, the impactin hugepage coverage loss to application working sets can be seenin an increase of 7% in TLB misses on 2-tier machines compared to1-tier machines, corroborating the causal explanation of the 1.5%impact.

      I don't understand this

    3. hot age as number of scan periods during whichthe page was recently accessed

      Something of a "reverse-CLOCK" mechanism, interesting

    4. show-ing the observed range of STAR and the corresponding impact toapplication performance

      What should it be compared to? Isn't higher STAR directly correlated to lower performance (since the workload is longer/more expensive)?

  5. Jul 2023
    1. #addresses

      My parsec traces are 4 to 10x bigger than theirs...

    2. Figure 8: An overview of key components of our LSTM based prefetcher.

      Wouldn't this be too slow for a L1 cache?

    3. The model can be used to gen-erate predictions right away (e.g. after the first epoch oftraining), and while more epochs are completed, more ac-curate estimates can be generated

      In a real system, I guess you'd have to synchronize accessing the model, with training it, potentially adding to the size in memory and/or inference time

    4. 100K accesses as warmup,and then use the next 200K for training and the next 200

      I'd have to check if this holds for the PARSEC benchmarks as well, but for my other benchmarks, the first 7M out of 800M accesses were to a very small set of pages (probably some shared standard library)

    5. caps the vocabulary at the 2Kmost popular deltas

      Would be interesting to see the performance of an LSTM capped to the 2K most popular deltas

    6. autocorrelation coef-ficients (ACFs

      Had to google what ACF is; I guess the random variable is the memory access itself?

    1. Last Level Cache (LLC)

      Is extra hardware really needed if only LLC is taken into account?

    2. predefined distance

      The predefined distance acts as a maximum stride as well then? Also, doesn't work if the pattern is not a stride?

  6. Jun 2023
    1. confidence on the current and previ-ous patterns, which is the probability it assigns to the correctprediction

      Is this afterwards correlated with the prefetcher's prediction performance (accuracy/coverage)?

    2. miss history length of 1

      Isn't this very low?

    3. average similar examples, producing singlerepresentative cases

      Poses the same challenge as when selecting which misses to train on in my opinion

    4. The LSTM takes >150 𝜇s per inferenceand >1 ms per example for training

      Could this be decreased if the model "lived" on GPU (training done there, maybe also inference?), and CPU accessed the model through the GPU VRAM directly.

    5. interference

      Is this a ML-specific term, or do they want to say that the DNN is still too costly resource-wise?

    1. If more than two frequent kernels exist, weignore them entirely for prefetching.

      I wonder what actual patterns the loops with "only two kernels" reflect

    2. Complex

      Complex, as in composition of several patterns?

    3. single thread,

      It can probably also be argued for multithreaded programs as well if we look into other threads' execution traces

    4. Intel’s Precise Event-BasedSampling (PEBS)

      Could actually be a great use case for PEBS since we want to focus only on mem. instr.s generating the most cache misses

    5. Delta

      Could be affected by loop unrolling

    1. we first cluster the raw address space usingk-means

      Similarly to why deltas might not work with dynamic allocation, the clusters' start&end addresses could also differ