Smaller awards may be granted for partial wins at our discretion
大多数人认为安全测试要么成功要么失败,不应有'部分成功'的概念,但OpenAI明确表示会为'部分胜利'提供奖励,这打破了传统二元思维,表明他们重视渐进式安全改进而非仅追求完美解决方案。
Smaller awards may be granted for partial wins at our discretion
大多数人认为安全测试要么成功要么失败,不应有'部分成功'的概念,但OpenAI明确表示会为'部分胜利'提供奖励,这打破了传统二元思维,表明他们重视渐进式安全改进而非仅追求完美解决方案。
Sakana Fugu models are based on our ICLR 2026 papers (**Trinity** and **Conductor**), and we have substantially further improved the methods to increase the performance and user experience
文章提到模型基于ICLR 2026论文,并已大幅改进方法和用户体验,但没有具体说明改进的幅度或基准数据。此处缺乏量化依据,无法评估从研究原型到商业产品的改进程度。
Two variants are available: **Sakana Fugu Mini 🐟**, optimized with latency in mind, and **Sakana Fugu Ultra 🐡**, the full orchestration system, optimized for performance for demanding tasks.
文章提到有两种变体:Mini(延迟优化)和Ultra(性能优化),但未提供具体的性能指标差异,如延迟降低百分比或吞吐量提升数据。这种缺乏具体量化参数的描述难以评估两种变体在实际应用中的性能差异。
So even though I did 100% of the writing and 50% of the refactoring, Windsurf reports that 100% of the code I produced in that session was generated by AI.
大多数人认为代码生成工具的指标应该反映实际使用情况,但作者展示了即使开发者100%手动编写代码,Windsurf仍会报告100%的AI贡献,这表明其指标系统存在根本性缺陷,完全扭曲了实际贡献比例。
customers should expect PCW values of 85%+, often 95%+. This is not a hallucination and is accurate given how we compute this metric
大多数人认为AI代码生成工具应该客观、准确地衡量其贡献,但作者认为这些工具的报告数据被设计得极度偏向高AI贡献比例(85%-95%),因为它们的计算方法有严重缺陷,如不计算用户粘贴的代码、不计算自动添加的符号等,这些偏差导致AI贡献被高估。
We use four AI capability metrics: ECI (Epoch Capabilities Index), METR 50% Time Horizon, Combined Math Index, and WeirdML V2 Index.
研究使用了四个不同的AI能力指标,这增加了结果的可靠性。每个指标都从不同维度测量AI能力,包括综合能力(ECI)、时间效率(METR)、数学能力(Combined Math)和特定环境下的性能(WeirdML)。多指标方法减少了单一指标的偏差风险。
Codex just hit 3 million weekly active users, our APIs process more than 15 billion tokens per minute, and GPT‑5.4 is driving record engagement across agentic workflows.
这些惊人的使用指标展示了AI技术在实际应用中的大规模采用。特别是每分钟处理150亿个token的能力,反映了企业对AI处理能力的巨大需求,以及AI已经从实验阶段进入实际工作流程的临界点。
未来的评估体系,必须同时考虑:成功率、成本、延迟。这有点类似于对于云计算的考核标准,而不是传统软件。
这一观点揭示了AI技能评估需要引入新的维度,特别是成本因素,这反映了AI时代的独特挑战,也暗示未来技能市场可能会出现基于资源消耗的定价机制,这与传统软件市场有本质区别。
Contemplating mode provides significant capability improvements in challenging tasks, achieving 58% in Humanity's Last Exam and 38% in FrontierScience Research.
这些具体数字展示了多智能体并行推理的惊人效果,接近人类水平的能力提升,暗示了AI协作模式可能成为解决复杂问题的关键路径,而非单纯扩大模型规模。
The standard AI judges use to define "safe" are measured wrong. They punish action. They ignore inaction.
令人惊讶的是:当前AI安全评估标准存在根本性缺陷——它们只惩罚错误行动,却忽视错误的不作为。这种评估方式导致AI模型被优化为看起来安全,但实际上可能因为过度谨慎而变得真正危险。
an X post today receives less than 3% of the views a single tweet delivered seven years ago.
令人惊讶的是:如今在X上的帖子获得的浏览量不到七年前单条推文浏览量的3%。这种急剧下降不仅反映了平台算法的变化,也揭示了社交媒体平台内容分发机制的根本性转变,以及用户行为和平台优先级的巨大变化。
We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem.
令人惊讶的是:研究团队采用了多种衡量标准,包括Hugging Face下载量、模型衍生品、推理市场份额和性能指标等,来全面评估开源语言模型生态系统。这种多维度分析方法揭示了AI生态系统的复杂性和多样性,远比简单的性能排名更为全面。
Really interesting paper pointing out how the standard metrics used in point cloud, and lidar, analysis are not appropriate for all contexts, especially for machine learning.
most financial models described in the literature come from the corporate world, putting the need to maximize revenue in tension with higher education’s mission-based goals that are harder to quantify
interesting: perhaps the overall impacts of education are hard to measure against the brass tacks of commercial endeavors...there are likely studies in the field of economics that could help weigh education's less monetary measures against commercial activities...the following paragraph suggests some avenues for exploration
Implementing relevant data metrics and making these available to users.
TRSP Desirable Characteristics
Number of Books Read has nothing to do with substance learned; enlightenment gained. It is a vanity metric.
PID Start
PID Start Date
It was created in 2001
PID Started
4,463,574
PID Count
2,400,000
PID Count
ORCID was first announced in 2009
Persistence Maximum
The developer and administrator of the DOI system is the International DOI Foundation (IDF), which introduced it in 2000
Persistence Maximum
2,340,117
Number of ArXiv PIDs
To use open infrastructure consistent with your organization’s values.
lksjdfglsdkjfalsdjkh
The progress and impact of the project will be measured and monitored through the collection ofquantitative indicators. The different systems of the project partners as well as ORCID Inc. andROR will be queried. If possible, indicators for all 10 PID use cases should be measured. Theseinclude for example the following indicators:● Number of registered DataCite DOIs by scientific institutions in Germany.● Number of registered DataCite-DOIs that have a link to further resources via arelated-IDentifier relationship.● Number of ROR implementations at scientific institutions in Germany.● Number of GND records that have an ORCID iD or a ROR ID.● Etc.
PID Use Cases
They are efficiency and effectiveness. For memory systems, an effective system is one that gets the right answer every time no matter how long it takes you. And the efficient system is one that uses the least amount of resources like time, associations, dependent systems, etc. but it may not be that good at providing the correct answers.
Efficiency and effectiveness measures for specific mnemonic systems may vary from person to person, so one should consider them with respect to their own practices. There may not be a single "right" or "correct" practice universally, but there could be one for everyone individually based on their own choices or preferences.
there's one company that i advise called inside tracker
!- Physiological metrics : Inside Tracker - online body metrics - online health metrics
The “Goals, Signals, Metrics” framework is the most important thing I learned from Ciera, while at Google. This is a framework for defining metrics:Define the actual goal you’re trying to accomplish. Opt for a descriptive definition, for example, say “here is a description of what we want to actually accomplish with our work.” Not the vaguer, “I want to measure X.”Define signals. How would you know if you accomplished that goal, if you had infinite knowledge of everything? These are called “signals.” For example, a signal is “how happy are people with my product?” You cannot directly know how happy people are with your product, unless you’re magically all-knowing. However, if having happy customers is one of your goals, then this is the right signal for that.Figure out your metrics. These are proxies for that signal. Metrics are always proxies, there are no perfect metrics.
Goals, Signals, Metric framework for defining metrics
leading indicators (Work Item Age) are relevant for non-finished work, while lagging indicators (Cycle Time, Throughput) are relevant for finished Items
kanban metrics are of use in the Scrum Events
= a number of work items started but not finished - start and finish are defined by Scrum Team's Definition of Workflow - an explicit policy that serves as a constraint to help shaping of the flow of work - historically visualized through the Cumulative Flow Diagram
= time elapsed between when a work item starts and when it finishes - start is when the work item is pulled into the workflow - CT is a lagging indicator visualised in a Cycle Time Scatterplot from which we can read trends, distributions, and look at the anomalies - enables us to come to the *Service Level Expectation (=amount of time that we expect a work item to be finished in)
= number of work items finished per unit of time - exact count of items, regardless of their size - measured usually at the finish line of the workflow - visualized either at a separate run chart, or as the angle of curves on a Cumulative Flow Diagram - can be read out of the Cycle Time Scatterplot as well → !is not velocity!
= time elapsed between moment when the work item has been pulled into the workflow (=start) and the current time - complemented with Cycle Time it can show us which items are doing well and which are late - Work Item Age is the best metric to look at if you want to determine when an item that has already started is going to finish
构建一个高吞吐量的 metrics 系统的思考过程
Defects found in peer review are not an acceptable rubric by which to evaluate team members. Reports pulled from peer code reviews should never be used in performance reports. If personal metrics become a basis for compensation or promotion, developers will become hostile toward the process and naturally focus on improving personal metrics rather than writing better overall code.
K-Anonymity, L-Diversity, and T-ClosenessIn this section, I will introduce three techniques that can be used to reduce the probability that certain attacks can be performed. The simplest of these methods is k-anonymity, followed by l-diversity, and then followed by t-closeness. Other methods have been proposed to form a sort of alphabet soup, but these are the three most commonly utilized. With each of these, the analysis that must be performed on the dataset becomes increasingly complex and undeniably has implications on the statistical validity of the dataset.
privacy metrics
The general trend is that students who report improved socioemotional outcomes also show suggestions of increased activity in collaborative tools relative to their peers.
Another positive outcome from students taking courses with collaborative assignments.
In this sample, 1,868 students enrolled in at least one undergraduate class with, and at least one undergraduate class without, some form of collaborative activity (peer review, Piazza, CourseNetworking, etc.), not including discussions.
Interesting: Discussions are excluded from collaborative activies.
Similarly, social annotation tools such as Hypothesis provide opportunities for students and instructors to interactively engage in a shared resource of interest wherein problems, challenges, and insights can be discussed.
Mention of Hypothesis social annotation as a site for teacher/learner engagement.
We estimate that students performed 1.16% points (95% HDI [0.65–1.66]) better in their undergraduate courses with collaborative activities, compared with the same students’ performance in undergraduate courses without collaborative activities.
Interesting positive student success finding for courses with collaborative assignments.
Teaching with social reading doesn’t mean peering over students’ shoulders. Rather, teaching with social reading should help students share and connect.
So often learning technologies emphasize metrics of engagement that easily become metrics of surveillance. Maybe instead of focusing on "time on page" they might flower out to "connections made" to other people, texts, writing, and ideas.
“It’s really hard to help Emmett understand anything qualitative,” one former employee says. “It has to be quantitative.”
Does running a community require qualitative decision making?
Employing an evidence-based methodology, CitiIQ has created a comprehensive, objective measurement of a city.
um kevin anderson 00:12:43 if you can talk more about this issue both you and george assad raymond and so many other climate activists talking about this issue of wealth 00:12:55 you say per capita is a flawed metric as most polluting industries have been moved to developing nations so it's not reflective of the rich nation's emissions take all of this on 00:13:09 yeah i mean that's a really key issue and i think if i focus in here on the uk where i know it's a place obviously i know much better that what we've done in the uk we've closed down a lot of our industry and then we import the manufactured goods from elsewhere in the 00:13:22 world and then we turn around to those parts of the world and then we blame them for the emissions in manufacturing the goods that we are enjoying and that's everything from our electronic goods to parts for our cars as our clothes so you know the uk is 00:13:35 effectively moved to a bar and banking culture and and and offshore virtually everything else and so we when we looking at our total amount of emissions we have to take account of the carbon footprint of our lifestyles and that 00:13:47 does include the emissions that we associated with things that we import and export i mean you take that into account you tend to find that most wealthy countries have a much larger carbon footprint than when you just look at the energy they use within their 00:14:00 boundaries and i think it's really key again when we think about these issues of equity we we that we take this what's often referred to as a consumption-based accounting method we take that into account because it is unfair to be 00:14:12 penalizing poor parts of the world for them making things to help us have a better quality of life over here and when we do that then the challenges get even more striking in terms of what we have to do and it also also brings out 00:14:25 even further the issues of equity the disparity between the richer parts of the world and the poorer parts of the world but i also think on the equity point it's really worth bringing out that it's not as if everyone in the uk is even 00:14:37 there isn't just one public in the uk there are multiple publics there were those of us who are the wealthy ones in our own country that are responsible for the lion's share of missions within the uk that will be true chain for the u.s for germany for japan australia and so 00:14:50 within all of our countries there are large swathes of the country who are the average and below average consumers and for them the response to climate change is very different from those of us who are in our own countries are responsible for the lion's share of 00:15:03 emissions so i think we have to differentiate not just between countries but even within our countries and my concern there is that who are the people that frame the climate dubai debate they're the climate scientists and the academics they're the 00:15:14 entrepreneurs the business leaders the journalists the barristers they're all the people that are in the very high emitting category so we frame the debate and we never ever frame the debate with equity at its core and with regardless 00:15:26 of our maths or our moral sorry regardless of our moral position the maths tell us if we are to deliver on the commitments then equity has to be a key part of our responses but we never talk about that because we are in that 00:15:38 high emitting group
Kevin points out why a CONSUMPTION-BASED METRIC is more accurate than PER CAPITA metric, as the PER CAPITA metric does not include the embodied carbon emissions of the manufactured goods that consumers purchase. Per Capita metric reflects that the manufacture is responsible, not the consumer, an inaccurate moral indication.
We have also noticed that wealthy and poor exist in ALL countries of the world and the more nuanced terminology we employ based on a Country-Wealth Sector classification matrix as described here:
https://medium.com/@gien_SRG/more-nuanced-terminology-for-post-colonialist-inequality-af2f1609635c
Using this new terminology, Monbiot and Anderson are referring to the North-North and South-North class as all the elites of the world has having the highest personal carbon footprint whilst the North-South and South-South class are the victims.
Note also: this incentive is in fact far more hard-headed than any metric of hedonic economism—such as GDP, which is measuring the amount of desire satisfied by the productive sector. At best GDP is a revenue metric. A prudent manager will manage an enterprise to maximize capital and profit, not revenue.
Also agreed; measures how much, not how well
Lately, metrics related to social usage and online comment have gained momentum — F1000Prime was established in 2002, Mendeley in 2008, and Altmetric.com (supported by Macmillan Science and Education, which owns Nature Publishing Group) in 2011.
See altmetrics.
The urgent argument for turning any company into a software company is the growing availability of data, both inside and outside the enterprise. Specifically, the implications of so-called “big data”—the aggregation and analysis of massive data sets, especially mobile
Every company is described by a set of data, financial and other operational metrics, next to message exchange and paper documents. What else we find that contributes to the simulacrum of an economic narrative will undeniably be constrained by the constitutive forces of its source data.
ReconfigBehSci. (2021, February 10). "Trying to appease both public health demands and the libertarian views of the free market has led not only to astronomical death tolls, such as in the US, UK, and Brazil, but to flailing economies. " 2/3 [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1359427735022694405
Wenham, C. (2021). What went wrong in the global governance of covid-19? BMJ, 372, n303. https://doi.org/10.1136/bmj.n303
ReconfigBehSci. (2021, February 10). "Trying to appease both public health demands and the libertarian views of the free market has led not only to astronomical death tolls, such as in the US, UK, and Brazil, but to flailing economies. " 2/3 [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1359427735022694405
OKRs in Product Management
good overview on OKRs and how to craft them
Matrajt, Laura, Julia Eaton, Tiffany Leung, und Elizabeth R. Brown. „Vaccine Optimization for COVID-19: Who to Vaccinate First?“ Science Advances 7, Nr. 6 (1. Februar 2020): eabf1374. https://doi.org/10.1126/sciadv.abf1374.
Holcombe, A. (2020, September 30). Conventional journal rankings—Fight them! Medium. https://medium.com/@ceptional/conventional-journal-rankings-fight-them-9c6db600b0dd
Obradovich, N., Özak, Ö., Martín, I., Ortuño-Ortín, I., Awad, E., Cebrián, M., Cuevas, R., Desmet, K., Rahwan, I., & Cuevas, Á. (2020). Expanding the measurement of culture with a sample of two billion humans [Preprint]. SocArXiv. https://doi.org/10.31235/osf.io/qkf42
Mathur, M. B., & VanderWeele, T. J. (2020). New statistical metrics for multisite replication projects. Journal of the Royal Statistical Society: Series A (Statistics in Society), 183(3), 1145–1166. https://doi.org/10.1111/rssa.12572
Pastor-Escuredo, D., & Tarazona, C. (2020). Characterizing information leaders in Twitter during COVID-19 crisis. ArXiv:2005.07266 [Physics]. http://arxiv.org/abs/2005.07266
Coronavirus (COVID-19) in Australia – Pandemic Health Intelligence Plan [Text]. (2020, May 6). Australian Government Department of Health. https://www.health.gov.au//resources/publications/coronavirus-covid-19-in-australia-pandemic-health-intelligence-plan
Ferres, L. (2020 April 10). COVID19 mobility reports. Leo's Blog. https://leoferres.info/blog/2020/04/10/covid19-mobility-reports/
Killeen, B.D., et al. (2020, April 1). A country-level dataset for informing the United States' response to COVID-19. Cornel University. arXiv:2004.00756.
From my perspective this would at least require a deep change in how we measure value, define labor and organize the financial system.
Yes, and maybe more importantly in what we value - not just how we measure it.
We’ll make sure that decisions that need to be made post-launch are informed by actual facts of user behavior, not just best guesses.
Short-term (i.e., “This year, I will see...”) and long-term (i.e., “In 10 years, I will see...) indicators are needed.
Metrics must represent short and long-term view to capture impact over time.
Amanda Matos - Métricas & DevOps - Por que você deve medir para conquistar?
Operações e monitoramento têm um tópico específico entre as exigências do LPI na certificação DevOps. A Amanda irá contar como implementar métricas com ferramentas de código aberto. Fica de olho!
705.1 IT Operations and Monitoring (weight: 4)
For many years, academia has relied on citation count as the main way in which we measure impact or importance of research. As a result, citation count is one of the primary metrics used when evaluating researchers. Citation counts also form the basis for other metrics, most notably Clarivate’s Impact Factor as well as the h-index, which respectively evaluate journal quality/prestige and researcher renown.
The metrics the Academy uses to measure "impact" are regressive.
these are not the only reason behindthe trends observed. An examination of the peaks
See http://wikicite.org/statistics.html for similar phases and events better analysed by history instead of calculations.
Wikidata quality assessment
I guess the the average values are spoiled by outliers. A look at distribution would be interesting instead of average and median value only.
Ontology depth
I'm curious about the depth distribution to compare my findings on other classification systems, including Wikipedia categories: https://arxiv.org/abs/cs/0604036
sed in the present analysis
I'd add number of classes with connected Wikipedia articles because these provide definition and context
We used the following keywords:‘ontology metrics’, ‘ontology evaluation framework’, and ‘ontology evaluation’. From the results,we selected only papers including primarily structural metrics.
A similar study on metrics and evaluation of classifications, taxonomies, thesaurus and other knowledge organization systems would be interesting!
We calibrate the model for 6 countries at various stages of economic development: 3 low-incomecountries (Uganda in 2005, Kenya in 2006, and Mozambique in 2006), and 3 emerging marketeconomies (Malaysia in 2007, Philippines in 2008 and Egypt in 2007).
Data & Calibration
In the model, agents are heterogeneous { distinguished from each other by wealth and talent.Individuals choose in each period whether to become an entrepreneur or to supply labor for a wage.Workers supply labor to entrepreneurs and are paid the equilibrium wage. Entrepreneurs haveaccess to a technology that uses capital and labor for production. In equilibrium, only talentedindividuals with a certain level of wealth choose to become entrepreneurs.
In this model, a heterogenous population is 'created' and differentiated by their talent and wealth. Only people with enough of both can be entrepreneur, otherwise they will stay as wage earners.
The Solow–Swan model augmented with human capital predicts that the income levels of poor countries will tend to catch up with or converge towards the income levels of rich countries if the poor countries have similar savings rates for both physical capital and human capital as a share of output, a process known as conditional convergence.
Income convergence of the poor and rich people will happen conditional on them enjoying similar savings rates. Otherwise, it might not happen.
and using five-year averages of all variables to smooth out cyclical variations
Need to account for cyclical trends
Maybe not a ELI5, but: Moments are expectations of things. E(X) is often called the "first moment," E(X2 ) is the "second moment," etc. They can also be more complicated, like E(exp(5x+y)), or whatever. In econometrics, you're trying to figure out something about the underlying distribution of your y's and your x's (and the errors). Often you don't know the shape of the distribution, but you know some moments of the distribution. This is useful because you can't use maximum likelihood estimation unless you make assumptions on the entire distribution. With ordinary least squares, you assume that E(ex) = 0, that is, the errors are uncorrelated with the regressors. You can write e = y - xbeta, to get a moment condition E(x(y-xbeta)) = 0. If you do GMM with this moment condition, you get the regular OLS estimator. If you have endogeneity of some kind, you don't know that E(ex) = 0, but you might have some instruments Z, such that E(ez) = 0. This gives you the moment condition E(z(y - x*beta)) = 0. GMM is nice because it makes relatively weak assumptions compared to other ways of estimating parameters. I hope that helps!
GMM - I don't get it, at all. What are moments? How are they used? Why are they used? Thanks all!
proportion of youth and adults with information and communications technology (ICT) skills, by type of skill’ (SDG 4.4.1)
indicator of digital skills target
Develop appropriate measurement and monitoring strategies
Recommendation 4
it shows the six degrees of separation.
What SNA will help with is not only creating visuals of networks but providing metrics to be able to understand the different roles and positions each nodes plays in the network as well as comparing different network structures.
1000 assignments in the assignment bank and 11 thousand submissions
ds106: syndicated: 1K assignments and 11K submissions
The first great experiment was UMW Blogs, our institutional blogging system which debuted in fall of 2007. In that nine years, it has had almost 13,000 users and it now contains 11,000 individual WordPress sites.
UMW blogs: 13K users, 11K blogs
(Among the recommendations: Greater emphasis on visuals, greater variety of formats and voices. They also announced that the Times would be introducing an alternative metric to pageviews that would “measure an article’s value to attracting and retaining subscribers.”)
How would they measure this exactly?
We’ll let you know who made the millionth annotation
Now we know who made annotation #1,000,000!
PLOS Labs is working on establishing structured reviews and we have talked with them about this.
I think it is possibly too early to tell.
As Steven Hill of HEFCE recently suggested, it might be far better for UK institutions to reject these tables: "What if all UK institutions made a stand against global rankings, and stopped using them for promotional purposes? The reputation of the UK’s higher education sector would stand firm, and a really strong signal would be sent to the rest of the world. Not drifting, but steering purposely through the metric tide." http://blog.hefce.ac.uk/2015/07/08/the-metrics-dilemma/
based on a scientific analysis of citation data
JIF is discredited in many reviews. See for example http://dx.doi.org/10.3389/fnhum.2013.00291. A recent independent review of metrics for the Higher Education Funding Council for England also strongly recommended against the use of measures like JIF: http://www.hefce.ac.uk/pubs/rereports/Year/2015/metrictide/
metrics on annotations/comments
We should compare these asks with our own needs for monitoring our public service at Hypothes.is.
Don’t guess at what’s wrong with your infrastructure–graph it.
They also adopted the mantra, “If it moves, graph it.”