89 Matching Annotations
  1. Last 7 days
    1. we may see a growing divergence between the capabilities we can measure and the capabilities we actually care about.

      「可测量的能力」与「真正关心的能力」之间的分歧正在扩大——这是整篇文章最深刻的洞见。所有当前 benchmark 都偏向「干净、自包含、可自动评分」的任务,而真实工作是「混乱、跨系统、需人类判断」的。随着 AI 向长任务延伸,这个测量-现实之间的鸿沟不会缩小,只会加速扩大。这意味着未来关于「AI 能否替代某类工作」的争论,将越来越难以用数据解决——因为数据本身无法捕捉真实工作的本质。

    2. The most famous chart in AI might be obsolete soon.

      副标题本身就是一个令人震惊的声明:最著名的 AI 进展图表即将过时——不是因为 AI 停止进步,而恰恰是因为进步太快。这创造了一个奇异的悖论:评测工具的失效速度与被评测对象的进步速度正相关。我们对 AI 能力的理解,正在以比 AI 自身进步更慢的速度迭代——「评测滞后」将成为未来数年 AI 治理和决策的核心挑战。

    3. If we took one task out of our task suite or added another task to our task suite, potentially instead of measuring this Claude Opus 4.6 time horizon of, I think, 14 and a half hours, we'd be measuring it at something like eight or 20 hours.

      增减一道题,测量结果从 8 小时变成 20 小时——这意味着整个 METR 时间地平线排行榜,本质上是由极少数「关键任务」撑起来的脆弱测量。当一个评测体系对单点数据如此敏感,它的「精确数字」就不应该被当作事实引用,而应该被当作噪声分布的一次采样。而目前,媒体和公众正是在拿这些数字做严肃决策。

    4. METR's confidence interval for Claude Opus 4.6 ranges from 5 hours to 66 hours.

      置信区间从 5 小时到 66 小时——这个跨度本身就令人震惊。5 小时和 66 小时是 13 倍的差距,却是对「同一个模型」的同一项测量。当一个数字被广泛引用为「Claude Opus 4.6 的时间地平线是 12 小时」时,真相是这个数字的不确定性区间宽达一个数量级。这是整个 AI 能力评测领域目前面临的核心危机:我们在用极度不精确的测量数字来驱动极其重要的决策。

    1. solving 1000 separate 1-hour math problems isn't a 1000-hour task; we'd consider it a 1-hour task done 1000 times.

      这个定义区分揭示了时间地平线框架的核心洞见:真正衡量 AI 自主性的,是「无法并行化的连续推理深度」,而非「并行处理的吞吐量」。1000 个独立数学题可以用 1000 个 API 调用同时解决;而「迭代调试一个复杂系统,每个修复都依赖前一个尝试的结果」,才是真正考验时间地平线的任务类型。这个框架把「深度推理连续性」确立为 AI 自主能力的核心度量维度。

    2. Our human task duration estimates likely overestimate how long a human expert takes to complete these tasks, as the humans (and AI agents!) have much less context for the task than professionals doing equivalent work in their day-to-day job.

      METR 主动承认其人类基准时间可能被高估——因为参与实验的人类和 AI 一样,都是低上下文的「新手」状态,而非熟悉项目的专业人员。这意味着「2 小时时间地平线」所对应的人类能力,更接近一个没有背景知识的外包工人,而非一个有经验的全职工程师。AI 与「有上下文的专业人员」之间的真实差距,比时间地平线数字显示的要大得多。

    3. The task-completion time horizon is the task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability.

      令人惊讶的是,「时间地平线」衡量的不是 AI 花了多长时间,而是人类完成同等任务需要多久——这个设计决策揭示了评测哲学的深层选择:以人类劳动时间作为任务难度的标尺,而非 AI 的实际耗时。这意味着「2 小时时间地平线」是一个关于任务复杂度的声明,而不是关于 AI 速度的声明。两者经常被混淆,而这个混淆正是公众误解 AI 能力的根源之一。

    1. We convert chip computing capabilities into H100 equivalents (H100e) based on their relative FLOP/s specifications, specifically their maximum 8-bit specification.

      用「H100 等效值」作为算力通用货币,这个方法论选择本身值得深思:它把 NVIDIA H100 确立为算力的基准单位,就像用美元作为全球储备货币。然而 Epoch AI 自己也承认这种换算「最准确的场景是模型训练」——对于推理负载,TPU 的实际效率可能被系统性低估,意味着 Google 的真实算力优势可能比数字显示的更大。

  2. Jan 2026
  3. Oct 2025
  4. Sep 2025
    1. You have to multiplyobservations, in order to eliminate the effect of the Brownianmovement of your instrument. This example is, I think,particularly illuminating in our present investigation. For ourorgans of sense, after all, are a kind of instrument. We can seehow useless they would be if they became too sensi tive.
  5. Mar 2025
  6. Feb 2025
  7. Dec 2024
    1. there still seems to be a little bit of Gap in data that doesn't account for 0.2 de celsus warming that is present extra scientists have not been able to comfortably explain over the past in fact several years why there is this little bit of extra global warming it is a major major Gap

      for - stats - climate crisis - global mean temperature gap in models vs measurement of - 0.2 Deg C - from The Print - YouTube - Low clouds disappearing over earth, rapidly acceleration heating - 2024, Dec

  8. Oct 2024
  9. Jun 2024
  10. Jan 2024
    1. The classification scheme used here is possibly the clearest one I've had the chance to see, one I wish I had seen in use at every place I worked at. Here are the broad categories: Surrogate variable that is measured: the thing we measure because it is measurable Variable of true or greater interest: the thing we actually want to know about Measurement technique of surrogate variable: whether we have the ability to get the actual direct value, or whether it is rather inferred from other observations Artefactual influences: what are the things that can mess up the data in measuring it Certainty of "Normal Range": how sure are we that the value we read is representative of what we care about?

      What are we really measuring with metrics, and how could things go wrong?

  11. Dec 2023
    1. key measures of isolation and lonelinessmay miss whether they are reaping the benefits of social connection in other ways,such as feeling adequately supported or having high-quality, close relationships

      Interesting defense of all the fine distinctions in the glossary earlier - though intriguing that "resilience" and "thriving" aren't in the glossary.

  12. Oct 2023
    1. Figure 6 shows a high-impedance 10x probe where the input impedance is 9MΩ, the input impedance inside the oscilloscope is 1MΩ, and the total input impedance is 10MΩ. For 10x probes, the signal has 10x attenuation via impedance matching, and the higher attenuation ratio reduces the signal-to-noise ratio. Because the ripple is a small signal, it is better suited a 1x probe without impedance matching because the signal is not distorted.

      to measure small signal, it is better to use 1x without impedance matching

  13. Jul 2023
  14. Jun 2023
    1. I want to point out that it’s disappointingly rare for UX studies to assess the quality of the work produced with the tool that’s being studied. Output is, after all, the goal of much computer use, and the quality of this output is an essential element in judging the user interface.

      Agreed, much of user testing focuses on the experience of the user when using the tool and depends upon users speaking to voluntarily making the quality of output part of their evaluation of the tool.

      A possible counter: UX research professionals are not subject matter experts in the work deliverables generated in using the tool and not equipped to assess the quality of those deliverables. But that doesn't address the need the author calls out. Solving this will depend on the domain in which the tool is used and connecting with SMEs that can evaluate output in that domain.

      In educational technology, we measure the quality of output by measuring the growth in students' mastery against state standards as they use the tool. This is why reporting has become the key feature for EdTech products.

  15. May 2023
  16. Apr 2023
  17. Mar 2023
  18. Feb 2023
    1. Instead of trying to resolve in general this problem of how macroscopic clas-sical physics behavior emerges in a measurement process, one can adopt thefollowing two principles as providing a phenomenological description of whatwill happen, and these allow one to make precise statistical predictions usingquantum theory

      To resolve the measurement problem from quantum mechanics into the classical realm, one can use the observables principle and the Born rule.

    2. Principle (The Born rule). Given an observable O and two unit-norm states|ψ1〉 and |ψ2〉 that are eigenvectors of O with distinct eigenvalues λ1 and λ2O|ψ1〉 = λ1|ψ1〉, O|ψ2〉 = λ2|ψ2〉the complex linear combination statec1|ψ1〉 + c2|ψ2〉will not have a well-defined value for the observable O. If one attempts tomeasure this observable, one will get either λ1 or λ2, with probabilities|c21||c21| + |c22|and |c22||c21| + |c22|respectively.
  19. Nov 2022
  20. Sep 2022
  21. Aug 2022
  22. Feb 2022
    1. We can’t help it: we measure what is possible, not necessarily what matters.

      Margaret Wheatley has a great quote about this, though you say it quite perfectly as well.

      "the measures define what is meaningful rather than letting the greater meaning of the work define the measures" (Wheatley, 2005, p. 162).

      Wheatley, M. (2005). The uses and abuses of measurement. In Finding our way. Leadership in uncertain times (pp. 156–162). San Francisco: Berret-Koehler.

  23. Jan 2022
    1. Goodhart's law is an adage often stated as "When a measure becomes a target, it ceases to be a good measure".[1] It is named after British economist Charles Goodhart, who advanced the idea in a 1975 article on monetary policy in the United Kingdom:[2][3] .mw-parser-output .templatequote{overflow:hidden;margin:1em 0;padding:0 40px}.mw-parser-output .templatequote .templatequotecite{line-height:1.5em;text-align:left;padding-left:1.6em;margin-top:0}Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.

      We measure what we find important.

      Measures can and often become self-fulfilling targets. (read: Rankings and Reactivity by W. Espeland and M. Sauder https://www.stmarys-ca.edu/sites/default/files/attachments/files/rankings-and-reactivity-2007.pdf)

      When a measure becomes a target it ceases to be a good measure.

      So why measure?


      Is observation and measurement part of a larger complex process which isn't finished until the process itself is finished?


      This seems related to the measurement problem in quantum mechanics, Schrödinger's cat, the Heisenberg uncertainty principle, and the observer effect).

  24. Dec 2021
    1. https://www.archaeology.org/issues/339-1905/trenches/7567-trenches-england-folkton-drums-stonehenge-measurement

      The diameter of the Folkton Drums and the Lavant Drum seem to be based on the "long foot" (1.056 ft) discovered by Andrew Chamberlain and Mike Parker Pearson. The drums ratios are 1:7:8:9 to the long foot respective (the Lavant Drum last).

      What was the origin of the stone used to manufacture these? Do the designs on the drums have a potential mnemonic use for the builders which may have used them as measuring devices?

      These are held by the British Museum: https://www.britishmuseum.org/collection/object/H_1893-1228-15

      Their round nature may have made them easy to roll out measurements. the grooved "tops" may have allowed them to roll on wooden beams of some sort.

      What relationship, if any, is the bone pin that was found with them?

      <small><cite class='h-cite via'> <span class='p-author h-card'>Alison Fisk </span> in "The Folkton Drums. Three cylinders carved from chalk about 5,000 years ago, during the Neolithic period. Decorated with geometric designs and stylised faces. Discovered, along with a bone pin, in a child’s round barrow (burial) in Yorkshire in 1889. #FindsFriday #Archaeology https://t.co/6IyUTN9bCt" (<time class='dt-published'>12/11/2021 09:11:48</time>)</cite></small>

  25. Nov 2021
    1. structure of the measurement and analysis
      1. Findings, solutions and good practices in regard of online learning during the pandemic. mixed method approach
      • collect data through surveying students each week
      • analyze quantitative data using ANOVA
      • compare results to the analysis from qualitative data triangulation: quantitative, qualitative data and literature
  26. Sep 2021
    1. Thus John Harrison, clock-maker and former carpenter of Barton-on-Humber (Lincs.), perfected a marine chronometer, and in 1730 could claim to have brought a Clock to go nearer the truth, than can be well imagin'd, considering the vast Number of seconds of Time there is in a Month, in which space of time it does not vary above one second . . I am sure I can bring it to the nicety of 2 or 3 seconds in a year.

      State of the art of timepieces in 1730.

      Could be interesting to see a graph of accuracy of timekeeping over time.

  27. Jun 2021
    1. #AnnoConvo: A Conversation about Annotation, Literacy, and Learning

      While watching the end of this presentation I find myself thinking:

      "We measure what we care about" is an interesting truism, but we need to remember to question why we care about particular things.

  28. Apr 2021
    1. A crucial difference between representations of relative error inthese equations compared withEquations 6and7 for the single-facet designs is that three sources of measurement error varianceare separately represented, withpt2ntequaling specific-factor error,po2noequaling transient error, andpto,e2ntnoequaling random-responseerror. Effects fortasks, occasions, and their interaction are includedin the denominator for the D-coefficient but not the G-coefficientbecause those effects can change the absolutemagnitude of scoresbut not their relative differences.

    Tags

    Annotators

  29. Feb 2021
  30. Jan 2021
    1. Because many early students of psychology were also trained in such “hard” sciences as biology and physiology, it is not surprising that these researchers turned to such physical measures in their attempts to understand mental functioning. Unlike today’s testing efforts, however, individual differences were not the focus of these studies. On the contrary, such differences were generally considered to be the result of imperfect control of experimental conditions, and every effort was made to design stud-ies in which such differences were minimized.
      • hard science
      • origin of species
      • eugenics

      individual difference

    Tags

    Annotators

  31. Oct 2020
    1. METHODOLOGY DEVELOPMENT IN ADULT LEARNING RESEARCHCOMBINING PHYSIOLOGICAL REACTIONS AND LEARNING EXPERIENCES IN SIMULATION-BASED LEARNING ENVIRONMENTS

      This article details the methods and results of a research experiment done to determine whether/ how physiological measurement technologies can be used with educational research methods to investigate subjective learning experiences. Describes research methods and data collected. 8/10, very interesting article and a very interesting and well done study but very specific to this one topic. e

    1. When Wojcicki took over, in 2014, YouTube was a third of the way to the goal, she recalled in investor John Doerr’s 2018 book Measure What Matters.“They thought it would break the internet! But it seemed to me that such a clear and measurable objective would energize people, and I cheered them on,” Wojcicki told Doerr. “The billion hours of daily watch time gave our tech people a North Star.” By October, 2016, YouTube hit its goal.

      Obviously they took the easy route. You may need to measure what matters, but getting to that goal by any means necessary or using indefensible shortcuts is the fallacy here. They could have had that North Star, but it's the means they used by which to reach it that were wrong.

      This is another great example of tech ignoring basic ethics to get to a monetary goal. (Another good one is Marc Zuckerberg's "connecting people" mantra when what he should be is "connecting people for good" or "creating positive connections".

  32. Sep 2020
    1. Yet another disadvantage to poorly chosen metrics is that they can lean into something called Goodhart’s Law, a subset of the moral hazard of gameplay specific to measurement. A good summation of the law comes from British anthropologist Marilyn Strathern, who describes it like this: “When a measure becomes a target, it ceases to be a good measure”

      compare with Heisenberg principle

    Tags

    Annotators

  33. Aug 2020
    1. Where these systems most notably differ is in their units of volume. A US fluid ounce (fl oz), about 29.6 millilitres (ml), is slightly larger than the imperial fluid ounce (about 28.4 ml). However, as there are 16 US fl oz to a US pint and 20 imp fl oz per imperial pint, the imperial pint is about 20% larger. The same is true of quarts, gallons, etc.; six US gallons are a little less than five imperial gallons.
  34. Jul 2020
  35. Jun 2020
  36. May 2020
    1. Grifoni, A., Weiskopf, D., Ramirez, S. I., Mateus, J., Dan, J. M., Moderbacher, C. R., Rawlings, S. A., Sutherland, A., Premkumar, L., Jadi, R. S., Marrama, D., de Silva, A. M., Frazier, A., Carlin, A., Greenbaum, J. A., Peters, B., Krammer, F., Smith, D. M., Crotty, S., & Sette, A. (2020). Targets of T cell responses to SARS-CoV-2 coronavirus in humans with COVID-19 disease and unexposed individuals. Cell, S0092867420306103. https://doi.org/10.1016/j.cell.2020.05.015

  37. Apr 2020
  38. Feb 2020
    1. socially-recognized standards of measure

      See Simon Schaffer, "Ceremonies of Measurement: Rethinking the World History of Science," Annales HSS, 70, no. 2 (April-June 2015): 335-360. [PDF].

      Schaffer: "No doubt all this explains why some historians have identified the advent of European modernity with the rise of the quantitative spirit, and, simultaneously, with the capacity of these Europeans to travel, loot, accumulate, and dominate beyond the limits of their own world and, in principle, everywhere."

  39. Jan 2020
  40. Nov 2019
    1. This brings me to the crucial issue. Unlike the position that exists in the physical sciences, in economics and other disciplines that deal with essentially complex phenomena, the aspects of the events to be accounted for about which we can get quantitative data are necessarily limited and may not include the important ones. While in the physical sciences it is generally assumed, probably with good reason, that any important factor which determines the observed events will itself be directly observable and measurable, in the study of such complex phenomena as the market, which depend on the actions of many individuals, all the circumstances which will determine the outcome of a process, for reasons which I shall explain later, will hardly ever be fully known or measurable. And while in the physical sciences the investigator will be able to measure what, on the basis of a prima facie theory, he thinks important, in the social sciences often that is treated as important which happens to be accessible to measurement. This is sometimes carried to the point where it is demanded that our theories must be formulated in such terms that they refer only to measurable magnitudes.
  41. Oct 2019
    1. Zachary Stein - "On the Ethics of Planetary Scale Measurement Meta-Structures"

      "We measure what we care about and we care about what we measure."

      "You could be handed a measure and start caring about something that you didn't care about before, just because you're measuring it now."

      "... it's the locus of trust. It's part of the common wealth of the community is a trustworthy measure. The only way that a community can work. And then of course they encode power..."

      "Measurements help certain groups understand things."

      "In fact maybe we should pause and think very deeply about the whole domain of things that we shouldn't ever measure.... A meta-modern approach to measurement, respects the immeasurable and partitions off regions that are not to be measured."

      "Once you start to factor all of these things (meaning, ethics, objectivity, efficiency) that's when you realize we should not measure some stuff."

      "People who think about measuring new things should also think about the ontology of absence. ... Every time you measure something, think about what you left out."

      "Measurement simplifies complexity by showing you "this" at the expense of "that". So the meta-modern approach is like I see this but I'm also aware of that"

      "Who is measuring who? for what purpose? who is deciding what well being is?"

      "Put the measurement tools into users hands. Don't come from behind the curtain and say ta-daa this is your score. Let them configure and play with the assessments themselves. This will lead to a better assessment system because a million minds are better than a hundred (or whatever your team-size is. Secondly, it empowers as opposed to just internalize."

      "As much as possible empower the person who is being measured, let them see how they're being measured, and actually be able to augment the measure so it's like if you have a mirror, you want to be able to clean the mirror, adjust angles, magnify etc. you want to have that capability for them (measured ones). And then they feel like they are not a test subject, but participants"

    1. It must also understand why different outsourcing projects succeed or fail. The Institute for Government has previously showed that there are several conditions that make outsourcing more likely to succeed.2 Above all, these include: •the existence of a competitive market of high-quality suppliers•the ease of measuring the value added by the provider •the service not being so integral to the nature of government as to make outsourcing inappropriate.*

      Outsourcing conditions

    1. So, maybe all measurement is just training? Training that is useful until we are able to operate on that granularity of perception without formalized symbols? If so, measurement may only be valid for areas where we as organisms seem to lack skill and declare a willingness to be more skillful? Maybe even in areas like gratitude, multi-perspectival ability or joint decision making. Even though all those areas might feel too tender for deliberate currency design?

      Great perspective on measurement.

  42. Sep 2019
  43. Jul 2019
  44. May 2019
  45. Apr 2019
  46. Feb 2019
    1. In such a future working relationship between human problem-solver and computer 'clerk,' the capability of the computer for executing mathematical processes would be used whenever it was needed. However, the computer has many other capabilities for manipulating and displaying information that can be of significant benefit to the human in nonmathematical processes of planning, organizing, studying, etc.
  47. Dec 2018
  48. Oct 2017
    1. The SNA approach allowed for more nuance in understanding the relationships than other analyses would have.

      Did they look at any outcomes of the different circuits? For example, did Circuit C have more successful prosecutions? Or less re victimization? Any way of measuring whether one network was better than another?

  49. Jul 2017
    1. there are two different measurements for the length of a foot in the United States: the International Foot (also commonly called the foot) and the U.S. Survey Foot. The International Foot (which we were all taught in school) is defined as 0.3048 meters, whereas the U.S. Survey Foot is defined as 0.3048006096 meters. The difference of the two equates to 2 parts per million.

      For example, in a measurement of 10,000 feet, the difference would be 0.02 feet (just less than one-quarter of an inch). In a measurement of 1 million feet, the difference is 2 feet.

  50. Jan 2017
  51. May 2016
  52. Apr 2016
  53. Aug 2015
  54. afghanag.ucdavis.edu afghanag.ucdavis.edu