75 Matching Annotations
  1. Aug 2025

    Annotators

    1. Query by Committee (QBC)

      (4) 委员会策略:设置4人委员会,衡量委员会与学生预测的差异,大则说明SLM没有把握,需要请教

    2. Margin Sampling (MS)

      (2) 边际采样策略:取top 2预测,计算其概率差,大则说明不确定性高(SLM没有把握),需要请教

    3. Front-loading (FR)

      (1) 贪心策略:全部选LLM,直到超预算为止。由于示例是独立同分布的,因此效果等同于随机选择,不能体现主动学习的优势

    Annotators

    Annotators

    1. Datasets,

      (1) HEADLINES:新闻标题->金价涨势 (2) OVERRULING:句子->是否推翻先前的判例 (3) COQA:段落+问题->回答(阅读理解)

    2. Strategy 1: Prompt adaptation.

      策略1:Prompt适应(1) Prompt选择:从Prompt维度缩减;(2) Prompt并用:从Query维度并用Prompt

    Annotators

    1. The routing policy π : R → S maps each inferencerequest ri ∈ R to a node-LLM pair (nj , mk) ∈ S

      将每个请求映射到一个有效的(节点, 模型)组合上

    Annotators

    1. active learning techniques

      学生模型会主动请教: (1) 边际采样:当学生模型对两个最可能的答案犹豫不决时(概率相差很小) (2) 预测熵:当学生模型的输出非常不确定、混乱时

    2. rained continuously on responsesfrom a more expensive teacher model (LLM)

      本地学生模型先处理,如果置信度低,则放到云端教师模型,并且学生模型学习教师模型的<查询,回答>对

    3. determined by two methods: user feedbackand automated check

      依赖于真实世界反馈:用户反馈+自动检查 (1) 用户反馈:用户来选择是否满意,或者看用户是否追问 (2) 自动检查:代码能否正常运行(或报错)

    4. Starting with inexpensive models, it offloads thequery to more expensive models like GPT-4 only if earlierresponses fail a reliability threshold.

      逻辑:每当一个模型生成response后,生成评分函数会评估response质量,如果评分低于阈值,请求就会被自动提交给下一个更昂贵的模型

    5. using a Q-learning-inspired approach

      类似Q-Learning的方法:用户提示(状态)+ 路由决策(动作) = 目标函数/预期性能(奖励)

    6. This design enables efficient parallelization, mak-ing transformers particularly suitable for scaling to high-complexity models and processing large datasets.

      高并行度 -> 适合复杂模型和大数据集

    7. routing and HI as optimization strategies formulti-LLM systems deployed under real-world constraints

      本文核心: (1) 关注routing和HI (2) 实际工程约束下 (3) 多LLM

    Annotators

    1. 1d

      把新请求给\(q_i\)造成的总额外时延平摊到每个输出token上,这是因为奖励函数中这个阈值L就定义为平均token时延阈值

    2. min(di−di,tj , dj )∑k=1(pj + k))

      (1) \(d_i-d_{i,t_j}\)是被影响的请求还需要多少token要生成,\(d_j\)是新请求需要多少token生成,二者取最小值表示请求\(q_i\)被影响的一段token (2) \(p_j+k\)是每一次共同生成token时,新请求加入KV Cache的token数,也就是这一步对\(q_i\)解码的影响 (3) 求和的原因是\(q_i\)每生成一个token,都需要扫描一遍完整的、当前的KV Cache

    3. the decodinglatency for these running requests increases due to the addi-tional load from new requests.

      LLM逐token生成答案,新请求进入batch,使得KV Cache增大,导致running queue中请求的decode时延增加

    4. (pi + di,t)

      decode需处理整个KV Cache(服务器为所有正在处理的请求维护的一个缓存),这一项是某个running queue上请求的输入token数+目前已经生成的token数,这就是它产生的KV Cache大小

    5. he prefill phase of the incomingrequests will block the running requests

      LLM一次性阅读理解request,完全阻塞GPU,running queue中的请求全部暂停,造成与输入token正相关的时延

    6. We then map the arrived request node embeddingGj,(L−1)t as the input of the DRL agent.

      请求向量\(G^{j,(L-1)}_t\)已经吸收和融合了整个系统中所有服务器的状态信息,相当于整个部门会议讨论后,由负责人来总结会议

    7. wedefine the raw global state features f t as follows

      那么,全局状态= (1) 当前query的状态 (2) 每个expert的状态 其实是一个6 + n * (3 + 6 * (r + w))维向量

    8. constantly monitor runningrequests, waiting requests, and their GPU memory utilization.

      某个expert的状态= (1) GPU内存利用率\(e_{n,t}\) (2) running queue长度及其状态 (3) waiting queue长度及其状态

    9. both the workload on each edge expert andtheir GPU memory utilization

      (1) 工作负载:有多少个任务在排队 (2) GPU内存利用率:服务器的资源被占用了多少

    10. A key insight is that the running requests can continuouslyprovide GPU memory utilization information, and the waitingrequests indicate their increased latency and future contention.

      (1) running queue反映GPU内存利用率 (2) waiting queue反映任务负载

    11. we use a special token < extra token n >to represent edge experts and add this special token beforethe request input text

      <>标签来区分不同专家

    12. Additionally, during the LLMrouting decision process, we only need to roughly estimatea general range, such as an approximate generation scoreinterval or output length interval, rather than exact values.

      很奇怪的一段话。但其实有道理:比如生成分数,是0.8还是0.9对DRL做决策的影响不大,它只需要知道是“高质量”的回答就行

    13. The expert-related features, on the other hand, focuson the request features related to the edge expert handlingthe reques

      对于一个请求,另一方面: (1) 输入token数 (2) 预测的生成分数 (3) 预测的总输出token数

    14. The operational featurescapture the immediate output metrics of request

      对于一个请求,一方面: (1) 当前输出长度 (2) 当前对这个GPU的利用率 (3) 当前每个token的平均时延

    15. action impact estimator to assess the effects ofrouting decisions on overall QoS

      提出一个estimator来预测路由决策对总体QoS的影响

    Annotators