Hypothesis

75 Matching Annotations

Aug 2025
Local file Local file

Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

3
1. diyiycw 25 Aug 2025
  
  in Public
  
  further strengthen
  
  未来可以进一步考虑整个序列的连贯性
2. diyiycw 25 Aug 2025
  
  in Public
  
  Response Generation
  
  训练样本是长度可变的块，防止奖励模型太过于关注response level，或者被长度所迷惑
3. diyiycw 25 Aug 2025
  
  in Public
  
  Choosing the Reward Model
  
  需要一个大小适中的、和LLM同源的躯壳
Annotators

diyiycw
Local file Local file

Untitled document

7
1. diyiycw 25 Aug 2025
  
  in Public
  
  Coreset (CS)
  
  (5) 核心集策略：计算新实例与旧实例嵌入向量的余弦相似度，均高则说明SLM没有把握，需要请教
2. diyiycw 25 Aug 2025
  
  in Public
  
  Query by Committee (QBC)
  
  (4) 委员会策略：设置4人委员会，衡量委员会与学生预测的差异，大则说明SLM没有把握，需要请教
3. diyiycw 25 Aug 2025
  
  in Public
  
  Prediction Entropy (PE)
  
  (3) 预测熵策略：利用熵公式，计算整体不确定性，高则说明SLM没有把握，需要请教
4. diyiycw 25 Aug 2025
  
  in Public
  
  Margin Sampling (MS)
  
  (2) 边际采样策略：取top 2预测，计算其概率差，大则说明不确定性高（SLM没有把握），需要请教
5. diyiycw 25 Aug 2025
  
  in Public
  
  Front-loading (FR)
  
  (1) 贪心策略：全部选LLM，直到超预算为止。由于示例是独立同分布的，因此效果等同于随机选择，不能体现主动学习的优势
6. diyiycw 25 Aug 2025
  
  in Public
  
  Instance Selection Criteria
  
  几种主动学习策略
7. diyiycw 25 Aug 2025
  
  in Public
  
  retrain the modelfrom scratch
  
  不能增量式训练，因为会导致灾难性遗忘
Annotators

diyiycw
Local file Local file

Untitled document

1
1. diyiycw 24 Aug 2025
  
  in Public
  
  EcoAssistant contains three components.
  
  (1) 代码执行器 (2) LLM级联 (3) 回答检索（即Completion cache）
Annotators

diyiycw
arxiv.org arxiv.org

2305.05176.pdf

5
1. diyiycw 22 Aug 2025
  
  in Public
  
  Datasets,
  
  (1) HEADLINES：新闻标题->金价涨势 (2) OVERRULING：句子->是否推翻先前的判例 (3) COQA：段落+问题->回答（阅读理解）
2. diyiycw 22 Aug 2025
  
  in Public
  
  interpolating it within a few samples
  
  (2) 只跑少量样本，其他的可以在中间插值来估计
3. diyiycw 22 Aug 2025
  
  in Public
  
  with smallanswer disagreemen
  
  (1) 答案分歧小的模型可以剪枝掉
4. diyiycw 22 Aug 2025
  
  in Public
  
  Strategy 2: LLM approximation.
  
  策略2：LLM近似 (1) 问答Cache；(2) 模型蒸馏
5. diyiycw 22 Aug 2025
  
  in Public
  
  Strategy 1: Prompt adaptation.
  
  策略1：Prompt适应（1) Prompt选择：从Prompt维度缩减；(2) Prompt并用：从Query维度并用Prompt
Visit annotations in context

Annotators

diyiycw

URL

arxiv.org/pdf/2305.05176.pdf
Local file Local file

Untitled document

1
1. diyiycw 20 Aug 2025
  
  in Public
  
  Pr[q(S(x)) ≥ q(L(x))]
  
  “小模型的回答质量>=大模型的回答质量”这件事发生的概率
Annotators

diyiycw
Local file Local file

Efficient Routing of Inference Requests across LLM Instances in Cloud-Edge Computing

5
1. diyiycw 18 Aug 2025
  
  in Public
  
  Response Time Metric (RT )
  
  每个请求的平均端到端延迟
2. diyiycw 17 Aug 2025
  
  in Public
  
  Response Quality Metric (RQ)
  
  响应质量的损失
3. diyiycw 17 Aug 2025
  
  in Public
  
  nference Cost Metric (C)
  
  每百万token的价格
4. diyiycw 17 Aug 2025
  
  in Public
  
  The routing policy π : R → S maps each inferencerequest ri ∈ R to a node-LLM pair (nj , mk) ∈ S
  
  将每个请求映射到一个有效的（节点, 模型）组合上
5. diyiycw 17 Aug 2025
  
  in Public
  
  the set of I inferencerequests
  
  成百上千个不同的LLM推理请求分发到不同的云、边服务器上
Annotators

diyiycw
Local file Local file

Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques

28
1. diyiycw 17 Aug 2025
  
  in Public
  
  trained on synthetic data reflecting outputs from both the SLMand the cloud LLM.
  
  对比学习
2. diyiycw 17 Aug 2025
  
  in Public
  
  cores each token
  
  SLM每生成一个token，奖励模型会给其打分（和LLM对齐程度），当分数低时，后面就由LLM生成
3. diyiycw 17 Aug 2025
  
  in Public
  
  valuates its own outputs using a verifier prompt
  
  通过一个prompt让SLM自己评估自己的回答
4. diyiycw 17 Aug 2025
  
  in Public
  
  active learning techniques
  
  学生模型会主动请教： (1) 边际采样：当学生模型对两个最可能的答案犹豫不决时（概率相差很小） (2) 预测熵：当学生模型的输出非常不确定、混乱时
5. diyiycw 17 Aug 2025
  
  in Public
  
  rained continuously on responsesfrom a more expensive teacher model (LLM)
  
  本地学生模型先处理，如果置信度低，则放到云端教师模型，并且学生模型学习教师模型的<查询，回答>对
6. diyiycw 17 Aug 2025
  
  in Public
  
  determined by two methods: user feedbackand automated check
  
  依赖于真实世界反馈：用户反馈+自动检查 (1) 用户反馈：用户来选择是否满意，或者看用户是否追问 (2) 自动检查：代码能否正常运行（或报错）
7. diyiycw 17 Aug 2025
  
  in Public
  
  Starting with inexpensive models, it offloads thequery to more expensive models like GPT-4 only if earlierresponses fail a reliability threshold.
  
  逻辑：每当一个模型生成response后，生成评分函数会评估response质量，如果评分低于阈值，请求就会被自动提交给下一个更昂贵的模型
8. diyiycw 17 Aug 2025
  
  in Public
  
  trained on preference data
  
  直接用偏好数据集来训练
9. diyiycw 17 Aug 2025
  
  in Public
  
  OptLLM
  
  针对一个批次的N个查询
10. diyiycw 17 Aug 2025
  
  in Public
  
  treated as a multi-armed bandit (MAB) problem
  
  用多臂老虎机来解决这个问题，没有环境的
11. diyiycw 17 Aug 2025
  
  in Public
  
  trained a BERT-style encoder using BARTscoreto assess query quality
  
  评估请求的难度，根据难度决定调度到SLM还是LLM
12. diyiycw 17 Aug 2025
  
  in Public
  
  Routoo
  
  强调有成千上万大模型，强调可拓展性
13. diyiycw 17 Aug 2025
  
  in Public
  
  uses anLLM as a Performance Predictor,
  
  用LLM预测性能
14. diyiycw 17 Aug 2025
  
  in Public
  
  performance and co
  
  预测性能+成本
15. diyiycw 17 Aug 2025
  
  in Public
  
  ZOOTER
  
  相当于用现有的reward模型蒸馏路由器模型
16. diyiycw 17 Aug 2025
  
  in Public
  
  reward distillation
  
  模型*查询维数的reward，用这个reward训练路由器
17. diyiycw 17 Aug 2025
  
  in Public
  
  Tryage
  
  本质上是预测模型预测每个模型的分数，再选择最高的
18. diyiycw 17 Aug 2025
  
  in Public
  
  using a Q-learning-inspired approach
  
  类似Q-Learning的方法：用户提示（状态）+ 路由决策（动作） = 目标函数/预期性能（奖励）
19. diyiycw 16 Aug 2025
  
  in Public
  
  ρk · Load(Mk)
  
  压力因子 * 负载容量
20. diyiycw 16 Aug 2025
  
  in Public
  
  minimize cost while maintainingaccuracy
  
  最小化：准确性损失 + 总成本
21. diyiycw 16 Aug 2025
  
  in Public
  
  ransformers relyexclusively on this specialized multi-head attention imple-mentation
  
  特点1：完全依赖于多头注意力机制
22. diyiycw 16 Aug 2025
  
  in Public
  
  transformers require minimal prior knowledge of prob-lem structure
  
  特点2：不依赖于对于问题结构的先验知识
23. diyiycw 16 Aug 2025
  
  in Public
  
  This design enables efficient parallelization, mak-ing transformers particularly suitable for scaling to high-complexity models and processing large datasets.
  
  高并行度 -> 适合复杂模型和大数据集
24. diyiycw 16 Aug 2025
  
  in Public
  
  routing and HI as optimization strategies formulti-LLM systems deployed under real-world constraints
  
  本文核心： (1) 关注routing和HI (2) 实际工程约束下 (3) 多LLM
25. diyiycw 16 Aug 2025
  
  in Public
  
  compression, energy efficiency, and infrastructureoptimization
  
  对应前面三段
26. diyiycw 16 Aug 2025
  
  in Public
  
  At the system level
  
  类似于模型/系统架构上的改进
27. diyiycw 16 Aug 2025
  
  in Public
  
  Energy efficiency
  
  引出边云协同LLM
28. diyiycw 16 Aug 2025
  
  in Public
  
  Efficient inference for LLMs
  
  通过剪枝、量化等操作来减小LLM规模
Annotators

diyiycw
Local file Local file

Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts

25
1. diyiycw 15 Aug 2025
  
  in Public
  
  1d
  
  把新请求给\(q_i\)造成的总额外时延平摊到每个输出token上，这是因为奖励函数中这个阈值L就定义为平均token时延阈值
2. diyiycw 15 Aug 2025
  
  in Public
  
  min(di−di,tj , dj )∑k=1(pj + k))
  
  (1) \(d_i-d_{i,t_j}\)是被影响的请求还需要多少token要生成，\(d_j\)是新请求需要多少token生成，二者取最小值表示请求\(q_i\)被影响的一段token (2) \(p_j+k\)是每一次共同生成token时，新请求加入KV Cache的token数，也就是这一步对\(q_i\)解码的影响 (3) 求和的原因是\(q_i\)每生成一个token，都需要扫描一遍完整的、当前的KV Cache
3. diyiycw 15 Aug 2025
  
  in Public
  
  the decodinglatency for these running requests increases due to the addi-tional load from new requests.
  
  LLM逐token生成答案，新请求进入batch，使得KV Cache增大，导致running queue中请求的decode时延增加
4. diyiycw 15 Aug 2025
  
  in Public
  
  (pi + di,t)
  
  decode需处理整个KV Cache（服务器为所有正在处理的请求维护的一个缓存），这一项是某个running queue上请求的输入token数+目前已经生成的token数，这就是它产生的KV Cache大小
5. diyiycw 15 Aug 2025
  
  in Public
  
  he prefill phase of the incomingrequests will block the running requests
  
  LLM一次性阅读理解request，完全阻塞GPU，running queue中的请求全部暂停，造成与输入token正相关的时延
6. diyiycw 15 Aug 2025
  
  in Public
  
  k2,n
  
  解码时延=梯度*(输入token数+本expert的running queue下所有request目前生成的token数)
7. diyiycw 15 Aug 2025
  
  in Public
  
  k1,n
  
  预填充时延=梯度*输入token数
8. diyiycw 15 Aug 2025
  
  in Public
  
  We then map the arrived request node embeddingGj,(L−1)t as the input of the DRL agent.
  
  请求向量\(G^{j,(L-1)}_t\)已经吸收和融合了整个系统中所有服务器的状态信息，相当于整个部门会议讨论后，由负责人来总结会议
9. diyiycw 15 Aug 2025
  
  in Public
  
  thegraph structure and semantic relation between requests andedge experts will be lost.
  
  必然丢失，除非构造为图
10. diyiycw 15 Aug 2025
  
  in Public
  
  wedefine the raw global state features f t as follows
  
  那么，全局状态= (1) 当前query的状态 (2) 每个expert的状态其实是一个6 + n * (3 + 6 * (r + w))维向量
11. diyiycw 15 Aug 2025
  
  in Public
  
  constantly monitor runningrequests, waiting requests, and their GPU memory utilization.
  
  某个expert的状态= (1) GPU内存利用率\(e_{n,t}\) (2) running queue长度及其状态 (3) waiting queue长度及其状态
12. diyiycw 15 Aug 2025
  
  in Public
  
  both the workload on each edge expert andtheir GPU memory utilization
  
  (1) 工作负载：有多少个任务在排队 (2) GPU内存利用率：服务器的资源被占用了多少
13. diyiycw 15 Aug 2025
  
  in Public
  
  A key insight is that the running requests can continuouslyprovide GPU memory utilization information, and the waitingrequests indicate their increased latency and future contention.
  
  (1) running queue反映GPU内存利用率 (2) waiting queue反映任务负载
14. diyiycw 15 Aug 2025
  
  in Public
  
  we use a special token < extra token n >to represent edge experts and add this special token beforethe request input text
  
  <>标签来区分不同专家
15. diyiycw 13 Aug 2025
  
  in Public
  
  Additionally, during the LLMrouting decision process, we only need to roughly estimatea general range, such as an approximate generation scoreinterval or output length interval, rather than exact values.
  
  很奇怪的一段话。但其实有道理：比如生成分数，是0.8还是0.9对DRL做决策的影响不大，它只需要知道是“高质量”的回答就行
16. diyiycw 13 Aug 2025
  
  in Public
  
  it is impossible to obtain
  
  (1) 生成分数\(s_j\) (2) 总输出token数\(d_j\) 各自需要一个模型来预测
17. diyiycw 13 Aug 2025
  
  in Public
  
  The expert-related features, on the other hand, focuson the request features related to the edge expert handlingthe reques
  
  对于一个请求，另一方面： (1) 输入token数 (2) 预测的生成分数 (3) 预测的总输出token数
18. diyiycw 13 Aug 2025
  
  in Public
  
  The operational featurescapture the immediate output metrics of request
  
  对于一个请求，一方面： (1) 当前输出长度 (2) 当前对这个GPU的利用率 (3) 当前每个token的平均时延
19. diyiycw 12 Aug 2025
  
  in Public
  
  oS-awarereward based on this estimator
  
  reward能体现QoS
20. diyiycw 12 Aug 2025
  
  in Public
  
  action impact estimator to assess the effects ofrouting decisions on overall QoS
  
  提出一个estimator来预测路由决策对总体QoS的影响
21. diyiycw 12 Aug 2025
  
  in Public
  
  average latency per token
  
  token的平均延迟，体现文字流的流畅程度
22. diyiycw 12 Aug 2025
  
  in Public
  
  the BERTScore metric
  
  生成文本和标准答案之间的语义相似度
23. diyiycw 12 Aug 2025
  
  in Public
  
  interference among reques
  
  prefill <-> decode
24. diyiycw 12 Aug 2025
  
  in Public
  
  first directed to an Edge Access Point (eAP)
  
  User -> EAP -> Edge Expert
25. diyiycw 12 Aug 2025
  
  in Public
  
  several critical issues
  
  (1) 云边传输延迟 (2) 网络波动 (3) 用户隐私
Annotators

diyiycw

Annotators

Annotators

Annotators

Annotators

URL

Annotators

Annotators

Annotators

Annotators