90 Matching Annotations
  1. Last 7 days
    1. Sakana Fugu models are based on our ICLR 2026 papers (**Trinity** and **Conductor**), and we have substantially further improved the methods to increase the performance and user experience

      文章提到模型基于ICLR 2026论文,并已大幅改进方法和用户体验,但没有具体说明改进的幅度或基准数据。此处缺乏量化依据,无法评估从研究原型到商业产品的改进程度。

    2. Two variants are available: **Sakana Fugu Mini 🐟**, optimized with latency in mind, and **Sakana Fugu Ultra 🐡**, the full orchestration system, optimized for performance for demanding tasks.

      文章提到有两种变体:Mini(延迟优化)和Ultra(性能优化),但未提供具体的性能指标差异,如延迟降低百分比或吞吐量提升数据。这种缺乏具体量化参数的描述难以评估两种变体在实际应用中的性能差异。

    1. So even though I did 100% of the writing and 50% of the refactoring, Windsurf reports that 100% of the code I produced in that session was generated by AI.

      大多数人认为代码生成工具的指标应该反映实际使用情况,但作者展示了即使开发者100%手动编写代码,Windsurf仍会报告100%的AI贡献,这表明其指标系统存在根本性缺陷,完全扭曲了实际贡献比例。

    2. customers should expect PCW values of 85%+, often 95%+. This is not a hallucination and is accurate given how we compute this metric

      大多数人认为AI代码生成工具应该客观、准确地衡量其贡献,但作者认为这些工具的报告数据被设计得极度偏向高AI贡献比例(85%-95%),因为它们的计算方法有严重缺陷,如不计算用户粘贴的代码、不计算自动添加的符号等,这些偏差导致AI贡献被高估。

    1. We use four AI capability metrics: ECI (Epoch Capabilities Index), METR 50% Time Horizon, Combined Math Index, and WeirdML V2 Index.

      研究使用了四个不同的AI能力指标,这增加了结果的可靠性。每个指标都从不同维度测量AI能力,包括综合能力(ECI)、时间效率(METR)、数学能力(Combined Math)和特定环境下的性能(WeirdML)。多指标方法减少了单一指标的偏差风险。

  2. Apr 2026
    1. Codex just hit 3 million weekly active users, our APIs process more than 15 billion tokens per minute, and GPT‑5.4 is driving record engagement across agentic workflows.

      这些惊人的使用指标展示了AI技术在实际应用中的大规模采用。特别是每分钟处理150亿个token的能力,反映了企业对AI处理能力的巨大需求,以及AI已经从实验阶段进入实际工作流程的临界点。

    1. 未来的评估体系,必须同时考虑:成功率、成本、延迟。这有点类似于对于云计算的考核标准,而不是传统软件。

      这一观点揭示了AI技能评估需要引入新的维度,特别是成本因素,这反映了AI时代的独特挑战,也暗示未来技能市场可能会出现基于资源消耗的定价机制,这与传统软件市场有本质区别。

    1. Contemplating mode provides significant capability improvements in challenging tasks, achieving 58% in Humanity's Last Exam and 38% in FrontierScience Research.

      这些具体数字展示了多智能体并行推理的惊人效果,接近人类水平的能力提升,暗示了AI协作模式可能成为解决复杂问题的关键路径,而非单纯扩大模型规模。

    1. The standard AI judges use to define "safe" are measured wrong. They punish action. They ignore inaction.

      令人惊讶的是:当前AI安全评估标准存在根本性缺陷——它们只惩罚错误行动,却忽视错误的不作为。这种评估方式导致AI模型被优化为看起来安全,但实际上可能因为过度谨慎而变得真正危险。

    1. an X post today receives less than 3% of the views a single tweet delivered seven years ago.

      令人惊讶的是:如今在X上的帖子获得的浏览量不到七年前单条推文浏览量的3%。这种急剧下降不仅反映了平台算法的变化,也揭示了社交媒体平台内容分发机制的根本性转变,以及用户行为和平台优先级的巨大变化。

    1. We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem.

      令人惊讶的是:研究团队采用了多种衡量标准,包括Hugging Face下载量、模型衍生品、推理市场份额和性能指标等,来全面评估开源语言模型生态系统。这种多维度分析方法揭示了AI生态系统的复杂性和多样性,远比简单的性能排名更为全面。

  3. Jun 2025
  4. Jan 2025
    1. most financial models described in the literature come from the corporate world, putting the need to maximize revenue in tension with higher education’s mission-based goals that are harder to quantify

      interesting: perhaps the overall impacts of education are hard to measure against the brass tacks of commercial endeavors...there are likely studies in the field of economics that could help weigh education's less monetary measures against commercial activities...the following paragraph suggests some avenues for exploration

  5. Nov 2024
  6. Jun 2024
  7. May 2024
  8. Oct 2023
  9. Sep 2023
    1. The progress and impact of the project will be measured and monitored through the collection ofquantitative indicators. The different systems of the project partners as well as ORCID Inc. andROR will be queried. If possible, indicators for all 10 PID use cases should be measured. Theseinclude for example the following indicators:● Number of registered DataCite DOIs by scientific institutions in Germany.● Number of registered DataCite-DOIs that have a link to further resources via arelated-IDentifier relationship.● Number of ROR implementations at scientific institutions in Germany.● Number of GND records that have an ORCID iD or a ROR ID.● Etc.

      PID Use Cases

  10. May 2023
    1. They are efficiency and effectiveness. For memory systems, an effective system is one that gets the right answer every time no matter how long it takes you. And the efficient system is one that uses the least amount of resources like time, associations, dependent systems, etc. but it may not be that good at providing the correct answers.

      Efficiency and effectiveness measures for specific mnemonic systems may vary from person to person, so one should consider them with respect to their own practices. There may not be a single "right" or "correct" practice universally, but there could be one for everyone individually based on their own choices or preferences.

  11. Mar 2023
  12. Jan 2023
  13. Nov 2022
    1. The “Goals, Signals, Metrics” framework is the most important thing I learned from Ciera, while at Google. This is a framework for defining metrics:Define the actual goal you’re trying to accomplish. Opt for a descriptive definition, for example, say “here is a description of what we want to actually accomplish with our work.” Not the vaguer, “I want to measure X.”Define signals. How would you know if you accomplished that goal, if you had infinite knowledge of everything? These are called “signals.” For example, a signal is “how happy are people with my product?” You cannot directly know how happy people are with your product, unless you’re magically all-knowing. However, if having happy customers is one of your goals, then this is the right signal for that.Figure out your metrics. These are proxies for that signal. Metrics are always proxies, there are no perfect metrics.

      Goals, Signals, Metric framework for defining metrics

    1. my takeaways

      • leading indicators (Work Item Age) are relevant for non-finished work, while lagging indicators (Cycle Time, Throughput) are relevant for finished Items

      • kanban metrics are of use in the Scrum Events

      • in Sprint Planning the key metric is Throughput, complemented with Work Item Aging for planning on work regarding leftover work from previous sprints.
      • in the Daily Scrum Devs concern themselves with the WiP and Work Item Aging.
      • Sprint Review revolves around Throughput, complemented by WIP and Cycle Time.
      • in Sprint Retrospective we focus on Cycle Time, Throughput & WIP, while taking a look also at Work Item Aging.

      Work in Progress

      = a number of work items started but not finished - start and finish are defined by Scrum Team's Definition of Workflow - an explicit policy that serves as a constraint to help shaping of the flow of work - historically visualized through the Cumulative Flow Diagram

      Cycle Time

      = time elapsed between when a work item starts and when it finishes - start is when the work item is pulled into the workflow - CT is a lagging indicator visualised in a Cycle Time Scatterplot from which we can read trends, distributions, and look at the anomalies - enables us to come to the *Service Level Expectation (=amount of time that we expect a work item to be finished in)

      Throughput

      = number of work items finished per unit of time - exact count of items, regardless of their size - measured usually at the finish line of the workflow - visualized either at a separate run chart, or as the angle of curves on a Cumulative Flow Diagram - can be read out of the Cycle Time Scatterplot as well → !is not velocity!

      Work Item Age

      = time elapsed between moment when the work item has been pulled into the workflow (=start) and the current time - complemented with Cycle Time it can show us which items are doing well and which are late - Work Item Age is the best metric to look at if you want to determine when an item that has already started is going to finish

  14. Aug 2022
  15. Jul 2022
  16. Apr 2022
    1. K-Anonymity, L-Diversity, and T-ClosenessIn this section, I will introduce three techniques that can be used to reduce the probability that certain attacks can be performed. The simplest of these methods is k-anonymity, followed by l-diversity, and then followed by t-closeness. Other methods have been proposed to form a sort of alphabet soup, but these are the three most commonly utilized. With each of these, the analysis that must be performed on the dataset becomes increasingly complex and undeniably has implications on the statistical validity of the dataset.

      privacy metrics

  17. Mar 2022
  18. Nov 2021
    1. um kevin anderson 00:12:43 if you can talk more about this issue both you and george assad raymond and so many other climate activists talking about this issue of wealth 00:12:55 you say per capita is a flawed metric as most polluting industries have been moved to developing nations so it's not reflective of the rich nation's emissions take all of this on 00:13:09 yeah i mean that's a really key issue and i think if i focus in here on the uk where i know it's a place obviously i know much better that what we've done in the uk we've closed down a lot of our industry and then we import the manufactured goods from elsewhere in the 00:13:22 world and then we turn around to those parts of the world and then we blame them for the emissions in manufacturing the goods that we are enjoying and that's everything from our electronic goods to parts for our cars as our clothes so you know the uk is 00:13:35 effectively moved to a bar and banking culture and and and offshore virtually everything else and so we when we looking at our total amount of emissions we have to take account of the carbon footprint of our lifestyles and that 00:13:47 does include the emissions that we associated with things that we import and export i mean you take that into account you tend to find that most wealthy countries have a much larger carbon footprint than when you just look at the energy they use within their 00:14:00 boundaries and i think it's really key again when we think about these issues of equity we we that we take this what's often referred to as a consumption-based accounting method we take that into account because it is unfair to be 00:14:12 penalizing poor parts of the world for them making things to help us have a better quality of life over here and when we do that then the challenges get even more striking in terms of what we have to do and it also also brings out 00:14:25 even further the issues of equity the disparity between the richer parts of the world and the poorer parts of the world but i also think on the equity point it's really worth bringing out that it's not as if everyone in the uk is even 00:14:37 there isn't just one public in the uk there are multiple publics there were those of us who are the wealthy ones in our own country that are responsible for the lion's share of missions within the uk that will be true chain for the u.s for germany for japan australia and so 00:14:50 within all of our countries there are large swathes of the country who are the average and below average consumers and for them the response to climate change is very different from those of us who are in our own countries are responsible for the lion's share of 00:15:03 emissions so i think we have to differentiate not just between countries but even within our countries and my concern there is that who are the people that frame the climate dubai debate they're the climate scientists and the academics they're the 00:15:14 entrepreneurs the business leaders the journalists the barristers they're all the people that are in the very high emitting category so we frame the debate and we never ever frame the debate with equity at its core and with regardless 00:15:26 of our maths or our moral sorry regardless of our moral position the maths tell us if we are to deliver on the commitments then equity has to be a key part of our responses but we never talk about that because we are in that 00:15:38 high emitting group

      Kevin points out why a CONSUMPTION-BASED METRIC is more accurate than PER CAPITA metric, as the PER CAPITA metric does not include the embodied carbon emissions of the manufactured goods that consumers purchase. Per Capita metric reflects that the manufacture is responsible, not the consumer, an inaccurate moral indication.

      We have also noticed that wealthy and poor exist in ALL countries of the world and the more nuanced terminology we employ based on a Country-Wealth Sector classification matrix as described here:

      https://medium.com/@gien_SRG/more-nuanced-terminology-for-post-colonialist-inequality-af2f1609635c

      Using this new terminology, Monbiot and Anderson are referring to the North-North and South-North class as all the elites of the world has having the highest personal carbon footprint whilst the North-South and South-South class are the victims.

  19. Oct 2021
    1. Note also: this incentive is in fact far more hard-headed than any metric of hedonic economism—such as GDP, which is measuring the amount of desire satisfied by the productive sector. At best GDP is a revenue metric. A prudent manager will manage an enterprise to maximize capital and profit, not revenue.

      Also agreed; measures how much, not how well

  20. Aug 2021
  21. Mar 2021
    1. The urgent argument for turning any company into a software company is the growing availability of data, both inside and outside the enterprise. Specifically, the implications of so-called “big data”—the aggregation and analysis of massive data sets, especially mobile

      Every company is described by a set of data, financial and other operational metrics, next to message exchange and paper documents. What else we find that contributes to the simulacrum of an economic narrative will undeniably be constrained by the constitutive forces of its source data.

  22. Feb 2021
  23. Oct 2020
  24. Sep 2020
  25. Jun 2020
  26. May 2020
  27. Apr 2020
  28. Nov 2019
  29. Sep 2019
  30. May 2019
    1. Amanda Matos - Métricas & DevOps - Por que você deve medir para conquistar?

      Operações e monitoramento têm um tópico específico entre as exigências do LPI na certificação DevOps. A Amanda irá contar como implementar métricas com ferramentas de código aberto. Fica de olho!

      705.1 IT Operations and Monitoring (weight: 4)

  31. Dec 2018
    1. For many years, academia has relied on citation count as the main way in which we measure impact or importance of research. As a result, citation count is one of the primary metrics used when evaluating researchers. Citation counts also form the basis for other metrics, most notably Clarivate’s Impact Factor as well as the h-index, which respectively evaluate journal quality/prestige and researcher renown.

      The metrics the Academy uses to measure "impact" are regressive.

  32. Nov 2018
    1. We used the following keywords:‘ontology metrics’, ‘ontology evaluation framework’, and ‘ontology evaluation’. From the results,we selected only papers including primarily structural metrics.

      A similar study on metrics and evaluation of classifications, taxonomies, thesaurus and other knowledge organization systems would be interesting!

    Tags

    Annotators

  33. Nov 2017
    1. We calibrate the model for 6 countries at various stages of economic development: 3 low-incomecountries (Uganda in 2005, Kenya in 2006, and Mozambique in 2006), and 3 emerging marketeconomies (Malaysia in 2007, Philippines in 2008 and Egypt in 2007).

      Data & Calibration

    2. In the model, agents are heterogeneous { distinguished from each other by wealth and talent.Individuals choose in each period whether to become an entrepreneur or to supply labor for a wage.Workers supply labor to entrepreneurs and are paid the equilibrium wage. Entrepreneurs haveaccess to a technology that uses capital and labor for production. In equilibrium, only talentedindividuals with a certain level of wealth choose to become entrepreneurs.

      In this model, a heterogenous population is 'created' and differentiated by their talent and wealth. Only people with enough of both can be entrepreneur, otherwise they will stay as wage earners.

    1. The Solow–Swan model augmented with human capital predicts that the income levels of poor countries will tend to catch up with or converge towards the income levels of rich countries if the poor countries have similar savings rates for both physical capital and human capital as a share of output, a process known as conditional convergence.

      Income convergence of the poor and rich people will happen conditional on them enjoying similar savings rates. Otherwise, it might not happen.

    1. Maybe not a ELI5, but: Moments are expectations of things. E(X) is often called the "first moment," E(X2 ) is the "second moment," etc. They can also be more complicated, like E(exp(5x+y)), or whatever. In econometrics, you're trying to figure out something about the underlying distribution of your y's and your x's (and the errors). Often you don't know the shape of the distribution, but you know some moments of the distribution. This is useful because you can't use maximum likelihood estimation unless you make assumptions on the entire distribution. With ordinary least squares, you assume that E(ex) = 0, that is, the errors are uncorrelated with the regressors. You can write e = y - xbeta, to get a moment condition E(x(y-xbeta)) = 0. If you do GMM with this moment condition, you get the regular OLS estimator. If you have endogeneity of some kind, you don't know that E(ex) = 0, but you might have some instruments Z, such that E(ez) = 0. This gives you the moment condition E(z(y - x*beta)) = 0. GMM is nice because it makes relatively weak assumptions compared to other ways of estimating parameters. I hope that helps!

      GMM - I don't get it, at all. What are moments? How are they used? Why are they used? Thanks all!

  34. Sep 2017
  35. Jul 2017
  36. Feb 2017
    1. (Among the recommendations: Greater emphasis on visuals, greater variety of formats and voices. They also announced that the Times would be introducing an alternative metric to pageviews that would “measure an article’s value to attracting and retaining subscribers.”)

      How would they measure this exactly?

  37. Jan 2016
  38. Jul 2015
    1. I think it is possibly too early to tell.

      As Steven Hill of HEFCE recently suggested, it might be far better for UK institutions to reject these tables: "What if all UK institutions made a stand against global rankings, and stopped using them for promotional purposes? The reputation of the UK’s higher education sector would stand firm, and a really strong signal would be sent to the rest of the world. Not drifting, but steering purposely through the metric tide." http://blog.hefce.ac.uk/2015/07/08/the-metrics-dilemma/

  39. Oct 2014
  40. Sep 2013