564 Matching Annotations
  1. Last 7 days
    1. Currently, the US only fully manufactures about 10 percent of the chips it requires

      美国仅能自主生产约10%所需的芯片,这表明美国在半导体制造方面高度依赖进口。这一数据凸显了美国在AI芯片制造上的脆弱性,也解释了为什么特朗普政府试图通过关税政策将芯片制造业回流美国。然而,10%的自给率远低于特朗普政府期望的目标,显示了美国在半导体制造方面的巨大挑战。

    2. Tech giants collectively plan to spend $750 billion on AI infrastructure this year, with "a significant portion" of that expected to "go towards chips for data centers"

      全球科技巨头今年计划在AI基础设施上投入7500亿美元,其中相当一部分将用于数据中心芯片。NVIDIA的1500亿美元投资约占这一总额的20%,显示了NVIDIA在AI芯片市场的主导地位。这个数据也反映了AI产业整体投资规模之大,以及数据中心芯片在AI基础设施中的核心作用。

    3. Four years ago, five years ago, Nvidia was spending about 10, 15 billion dollars a year in Taiwan. Now we're spending 100, going to 150 billion dollars in Taiwan each year.

      NVIDIA在台投资增长了10倍以上,从150亿美元增至1500亿美元(文中提到10-150亿,但标题明确150亿)。这种指数级增长反映了台湾在AI产业链中的战略地位日益重要,也表明NVIDIA正将全球AI产业的重心从美国转移到台湾。

    4. Nvidia will invest $150 billion a year to make Taiwan an AI "epicenter."

      这是一个惊人的巨额投资,相当于NVIDIA当前市值(5万亿美元)的3%。这表明NVIDIA将台湾视为AI产业的核心战略要地,远超其在美国的投资。这笔投资规模之大,反映了台湾在半导体制造领域的不可替代性,以及NVIDIA对台湾供应链的深度依赖。

    1. The time from business to production workflow drops from months to days.

      这是一个关于AI代理加速部署时间的定性描述,虽然缺乏具体数字,但反映了从'月'到'日'的数量级变化。这一声明暗示了AI代理可以显著缩短业务需求到实际部署的时间周期,提高组织敏捷性。然而,此处缺乏量化依据,不同复杂度的实施时间可能会有很大差异。

    2. McKinsey predicts that by 2030, three-quarters of current jobs will require redesign, upskilling, or redeployment

      McKinsey预测到2030年,四分之三的现有工作需要重新设计、技能提升或重新部署。这是一个相当惊人的比例,表明AI代理将对就业市场产生深远影响。这一预测强调了组织需要提前规划人力资源战略,包括培训和转型计划,以应对即将到来的劳动力结构变化。

    3. Although 85% of organizations say they want to be agentic within the next three years, 76% say their current operations and infrastructure can't support that change.

      这是一个显著的组织目标与实际能力之间的差距数据。85%的组织表示希望在未来三年内实现代理AI转型,但76%的组织承认现有基础设施不支持这一转变。这表明企业对AI代理技术的期望远超其实际准备程度,可能导致项目失败和投资浪费。此数据来自Celonis调研,可信度较高。

    1. The time is now to make changes in the way we train, prepare, and support young people who are about to enter the workforce

      文章没有提供具体的时间框架或量化指标来支持'现在必须改变'的紧迫性声明。这一论点基于前述数据,但缺乏具体的转型时间表或预期效果数据。需要更多具体数据来评估改革的时间紧迫性和预期效果。

    2. the unemployment rate for recent college graduates rose to 5.6%, while the underemployment rate (the share of graduates working in jobs that typically do not require a college degree) reached 42.5%

      纽约联储数据显示,2025年第四季度大学毕业生失业率达5.6%,未充分就业率高达42.5%,为疫情以来最高水平。这一数据表明毕业生就业市场正在恶化,42.5%的未充分就业率尤其值得关注,意味着近半数毕业生从事不需要大学学位的工作。

    3. the unemployment rate for recent college graduates rose to 5.6%, while the underemployment rate (the share of graduates working in jobs that typically do not require a college degree) reached 42.5%

      5.6%的失业率和42.5%的低就业率是衡量应届毕业生就业状况的重要指标。这一数据来自纽约联邦储备银行,具有较高的可信度。42.5%的低就业率是自疫情以来的最高水平,表明高等教育文凭的价值正在受到挑战。这些数据与AI对初级工作的影响可能相关,但文章也指出不能确定AI是唯一原因。

    4. workers aged 22 to 25 in the most AI-exposed occupations experienced a 16% relative decline in employment after the spread of generative AI

      这个16%的就业下降率是文章中最关键的数据点,表明AI对年轻就业者有显著影响。这个数据来自斯坦福数字经济实验室的工作论文,具有一定可信度。然而,这是相对下降率,不是绝对数量,且仅限于AI高度暴露的职业。这一数据与整体就业稳定的趋势形成鲜明对比,说明AI的影响存在结构性差异。

  2. May 2026
    1. Claude Opus 4.7 has been used to patch over 2,100 vulnerabilities

      在企业环境中,Claude Opus 4.7在三周内修复了2100多个漏洞,这一速度远超开源软件的修复速度。这表明当开发团队可以直接修复自己的代码时,AI驱动的安全工具可以显著提高漏洞修复效率。这一数据点也反映了企业级安全工具与开源社区安全挑战之间的差异。

    2. on average, a high- or critical-severity bug found by Mythos Preview takes two weeks to patch

      高危漏洞的平均修复时间为两周,这一时间在AI加速发现漏洞的背景下显得过长。考虑到AI能够快速发现大量漏洞,而人工修复速度跟不上,这将导致安全风险窗口期延长。文章提到一些维护者甚至要求减缓披露速度,反映了当前安全生态系统面临的严重压力。

    3. 90.6% (1,587) have proved to be valid true positives, and 62.4% (1,094) were confirmed as either high- or critical-severity

      AI模型发现的漏洞中,90.6%被确认为真实阳性,这是一个相当高的准确率。然而,只有62.4%被确认为高危或严重级别,这意味着约28.2%的高危/严重级别评估被降级,这表明AI模型在漏洞严重性评估方面仍有改进空间。

    4. Mythos Preview has found what it estimates are 6,202 high- or critical-severity vulnerabilities in these projects (out of 23,019 in total)

      在扫描的1000多个开源项目中,AI模型发现了总计23,019个漏洞,其中6,202个为高危或严重级别,占比约27%。这一数据表明开源软件的安全状况比许多人想象的更加脆弱,也证明了AI在代码审计方面的强大能力。

    5. their rate of bug-finding has increased by more than a factor of ten

      漏洞发现速度提升超过10倍是一个惊人的数据,这表明AI模型在安全测试效率上实现了质的飞跃。以Cloudflare为例,发现了2000个漏洞,其中400个为高危级别,这一发现速度远超传统人工测试,但也给安全团队带来了新的挑战——如何处理如此大量的漏洞报告。

    6. we and our approximately 50 partners have used Claude Mythos Preview to find more than ten thousand high- or critical-severity vulnerabilities

      这一数据点显示了AI在网络安全领域的惊人能力,50个合作伙伴在短时间内发现了超过1万个高危漏洞,平均每个合作伙伴发现约200个高危漏洞。这一数字表明AI模型在漏洞发现方面已经超越了传统安全方法,但也反映了当前软件安全状况的严峻程度。

    7. Mythos Preview has found what it estimates are 6,202 high- or critical-severity vulnerabilities in these projects (out of 23,019 in total)

      这个数据点提供了AI模型在开源软件扫描中的具体表现,27%的漏洞被评估为高危或严重级别。这是一个相当高的比例,表明系统性软件中存在大量安全风险。然而,这是AI模型的估计值,需要后续人工验证,文章中提到的90.6%验证率表明AI的评估有一定准确性,但仍存在误报可能。

    8. we and our approximately 50 partners have used Claude Mythos Preview to find more than ten thousand high- or critical-severity vulnerabilities

      这个10,000+的高危漏洞数量是一个惊人的统计数据,表明AI在漏洞发现方面已经达到前所未有的规模。50个合作伙伴平均每个找到200+个高危漏洞,这个数字远超传统安全方法的效率。然而,文章没有提供历史对比数据,无法评估这一数字的绝对意义,只能相对于传统方法有显著提升。

    1. In long sessions the bill typically lands at ~1/3 of comparable generic tooling.

      这个数据点声称长期使用时成本通常相当于同类通用工具的1/3左右。这是一个相当大的成本节约声明,但文章没有提供与哪些具体工具进行比较,也没有说明比较的条件和度量标准。1/3的成本节约需要更详细的基准测试和对比数据来支持。

    1. Perceptual BD-rates are based on human ratings from a large-scale subjective study

      这一数据点表明性能评估采用了基于人类感知的BD-rate指标,这是图像压缩领域的重要评估方法。然而,文章没有提供研究的具体规模、参与者数量或评分方法,缺乏量化依据来评估这一评估方法的科学性和可靠性。

    2. search over millions of model configurations to jointly optimize over perceptual quality and on-device runtime

      数百万模型配置的搜索规模表明研究进行了大规模的实验和优化,这增强了结果的可信度。然而,文章没有提供具体的搜索方法、优化算法或计算资源信息,这使得难以评估这一过程的效率和科学性。

    1. All models included in our analysis have at least two scores in each domain, with an average of 3.2 SWE benchmark results and 3.4 math benchmark results.

      这个数据点提供了研究的样本量和基准测试覆盖情况。平均每个模型有3.2个软件工程基准测试和3.4个数学基准测试,样本量相对较小,可能影响统计显著性。但至少每个领域有2个测试结果,确保了基本的数据可靠性。不过,基准测试数量较少可能限制了结果的全面性。

    2. On average Claude models have an SWE-ECI 2.7 points higher than their general ECI, and a Math-ECI 1.8 points lower.

      这个数据点显示了Claude模型在软件工程和数学领域的表现差异。2.7分的软件工程优势和1.8分的数学劣势表明Claude确实在软件工程方面表现相对更好,而在数学方面相对较弱。这种差异虽然不算巨大,但方向性明显,与文章标题的论点一致。数据来自多个模型的平均值,具有一定统计意义。

    1. 13K

      这条推文被转发13000次,是互动数据中最高的指标,约为点赞数的10倍,回复数的46倍。这个高转发率表明消息具有高度传播价值,可能因为Apple意外泄露内部文件这一事件的新闻价值。这个数据点显示该消息在科技社区具有病毒式传播潜力。

    2. 1.3K

      这条推文获得了1300次点赞,与283条回复相比,点赞数约为回复数的4.6倍。这表明大多数用户选择简单表达认可而非深入讨论。这个数据点反映了用户对Apple可能集成Claude AI的积极态度,但同时也暗示话题可能未引发足够的技术深度讨论。

    3. 283 replies

      这条推文有283条回复,虽然相对于250万浏览量来说比例较低(约0.011%),但仍表明有一定程度的讨论。这个数据点反映了用户对Apple内部开发流程和AI集成话题的参与度。相比普通技术推文,这个互动率处于中等水平,说明话题有一定但不是极高的讨论价值。

    4. 2.5M Views

      这条推文获得了250万次浏览量,这是一个相当可观的数字,表明这个关于Apple Support应用更新的消息具有很高的关注度。考虑到这是一个技术性内容,这个浏览量显示了对Apple内部开发流程和潜在AI集成的公众兴趣。这个数据点反映了公众对科技巨头内部运作的好奇程度。

    1. reduce the size of Coinbase by ~14%

      这个14%的裁员比例相当显著,表明Coinbase正在经历重大结构调整。考虑到加密货币行业的波动性,这一比例高于许多科技公司常见的10%裁员规模,显示了公司对当前市场状况的严重担忧和应对决心。

    1. By the fourth quarter of 2025, the five largest chip designers had cumulatively shipped roughly 20 million AI chips

      这个数据点表明AI芯片市场已经达到相当规模,约2000万片。考虑到每片芯片价值数万美元,这个市场总价值已达数千亿美元级别。这个数字反映了AI硬件需求的爆炸性增长,但也需要考虑这是累积数据而非年度出货量,可能包含较早的芯片型号。

  3. Apr 2026
    1. Liam Price just cracked a 60-year-old problem that world-class mathematicians have tried and failed to solve. He's 23 years old and has no advanced mathematics training.

      这个数据点突显了问题的难度和解决者的背景反差。60年的未解问题表明其复杂性,而23岁无高级数学训练的业余爱好者解决它,暗示AI可能正在改变数学研究的门槛和方式。这个年龄和背景信息增强了故事的戏剧性,但也需要更多关于Price教育背景的细节来全面评估。

    1. More than 3,000 forensic engines run in parallel on every submitted sample, covering signal, prosody, articulation, codec, and provenance domains.

      3,000多个法证引擎并行运行展示了深度伪造检测的复杂性。这个数字表明检测系统需要从多个维度分析音频样本,才能准确识别合成语音。这也反映了随着AI技术的发展,检测技术也在不断进步和复杂化。

    2. The FBI Internet Crime Complaint Center logged 2.3 billion dollars in losses for victims aged 60 and over in calendar year 2026.

      60岁以上受害者在2026年损失高达23亿美元,这是一个惊人的数字。这表明老年群体是语音合成攻击的主要目标,他们可能更容易被紧急冒充电话所欺骗。这一数据强调了针对特定人群的网络安全教育的必要性。

    3. Pindrop reported a 475 percent year-over-year increase in synthetic voice attacks against insurance call centers across 2025.

      475%的年增长率表明语音合成攻击呈爆炸性增长。这一惊人的数字反映了AI语音技术的普及和攻击者利用这些技术的速度。保险公司成为主要目标是因为理赔主要通过电话处理,这使得语音验证成为关键安全环节。

    4. The Wall Street Journal reported in February 2026 that high-quality voice cloning now requires roughly fifteen seconds of clean reference audio for tools available off the shelf.

      15秒的干净参考音频是高质量语音克隆的门槛,而Mercor泄露的数据平均每个承包商有2-5分钟的录音,远超过这一阈值。这意味着攻击者可以使用泄露的数据创建非常逼真的语音克隆,大大增加了数据被滥用的风险。

    5. According to the leaked sample index, the archive covers more than 40,000 contractors who signed up to label data, record reading passages, and run through verification calls for AI training.

      40,000名承包商受到影响,这是一个相当大的数字。考虑到每个承包商提供了2-5分钟的录音,总录音时长可能达到80,000-200,000分钟,即约1,333-3,333小时。这个规模的数据泄露可能影响数百万最终使用这些AI系统的用户。

    6. The dump is reported at roughly four terabytes and bundles a payload that breach analysts have been warning about for two years: voice biometrics paired with the same person's government-issued identity document.

      4TB的数据量表明这是一个大规模的数据泄露事件,相当于约100万首歌曲的音频数据。将语音生物识别与政府签发的身份文件配对是特别危险的组合,因为攻击者可以同时获得声音克隆的素材和身份验证的凭证。这种组合大大增加了数据被武器化的可能性。

    1. Our website uses cookies to enhance your browsing experience and analyze site traffic.

      网站提到使用cookies分析流量,但没有提供具体的流量数据、用户会话数或页面浏览量等关键指标,无法进行量化分析。

    1. Founded at the Massachusetts Institute of Technology in 1899

      这个时间点与当前日期(2026年)相比,意味着该机构已经运营了127年。这使其成为美国历史最悠久的科技媒体之一,经历了从电力时代到数字时代的多次技术变革,积累了丰富的行业洞察。

    1. That momentum is starting to extend beyond engineering. Teams are using Codex to pull together context from different tools, reason through what matters, and turn scattered information into useful work - like briefs, plans, checklists, drafts, and follow-ups.

      文章提到Codex的使用范围正在从工程扩展到其他领域,但未提供具体的使用案例数据或采用率。此处缺乏量化依据,无法评估Codex在企业非工程团队中的实际应用程度和价值。

    2. Our professionals are using Codex to move from static requirements to working solutions in hours, not weeks. It's enabling rapid prototyping, real-time workflow redesign, and faster iteration across the development lifecycle.

      Accenture首席AI官声称将开发时间从'周'缩短到'小时',这是一个显著的效率提升声明,但缺乏具体数据支持。此处缺乏量化依据,无法验证这一断言的真实性或普遍适用性。

    3. Companies are using Codex across the software development lifecycle. Virgin Atlantic is using it to increase test coverage and increase team velocity - reducing technical debt and improving performance.

      虽然文章提到了Virgin Atlantic使用Codex的具体应用场景,但没有提供任何量化数据来衡量其效果。此处缺乏量化依据,无法评估Codex实际带来的性能提升或技术债务减少程度。

    1. 🔹 **DeepSeek-V4-Pro:** 1.6T total / 49B active params. Performance rivaling the world's top closed-source models.

      这里提供了DeepSeek-V4-Pro的具体参数数据:总参数1.6万亿,活跃参数490亿。这种参数规模远超大多数开源模型,接近顶级闭源模型。参数效率比(活跃参数/总参数)约为3%,表明采用了稀疏激活技术,这可能是其性能与效率平衡的关键。

    1. Ubuntu 26.04 LTS provides the strongest foundation for our confidential computing stack. It allows us to deploy a single securely designed image for all our verifiably private AI workloads across Intel, AMD, and NVIDIA hardware, with no platform-specific changes required.

      引用自Tinfoil联合创始人,强调了Ubuntu 26.04 LTS在机密计算方面的优势,支持Intel、AMD和NVIDIA硬件上的单一安全镜像。这表明Ubuntu在跨平台机密计算方面的领先地位,为AI工作loads提供了统一的安全基础,减少了平台特定配置的需求。

    2. Ubuntu now fully supports RVA23, the baseline standard for RISC-V. This ensures that teams innovating on RISC-V can take full advantage of the platform, including in mixed-architecture environments.

      文章指出Ubuntu现在完全支持RISC-V的RVA23标准,这反映了Ubuntu对新兴架构的前瞻性支持。RISC-V作为一种开放指令集架构,正逐渐获得关注。Ubuntu的支持将促进RISC-V生态系统的成熟,特别是在混合架构环境中的应用。

    3. TPM-backed full-disk encryption is now generally available in the Ubuntu installer.

      文章提到TPM支持的全盘加密功能现在已在Ubuntu安装程序中普遍可用。这一安全功能将加密绑定到特定设备的TPM芯片上,大大提高了物理访问攻击的门槛。相比其他Linux发行版,Ubuntu将此功能集成到安装程序中,简化了企业部署安全系统的过程。

    4. Ubuntu 26.04 LTS is the first LTS to expand the number of memory safe system components. In practice, this means new kernel drivers and subsystems written in Rust, as well as `sudo-rs` and `uutils``coreutils` bringing memory-safe reimplementations of foundational system tools such as `sudo`, `ls`, `cp`, and `mv`.

      文章强调Ubuntu 26.04 LTS是首个增加内存安全系统组件的LTS版本,包括Rust编写的内核驱动和子系统,以及sudo-rsuutils coreutils等内存安全的基础系统工具重实现。这一举措显著提高了系统的安全性,减少内存相关漏洞的风险,展示了Ubuntu在内存安全方面的领先地位。

    5. Canonical Livepatch now extends its rebootless kernel patching capability to Arm64 for the first time.

      这标志着Canonical Livepatch技术的重要里程碑,首次扩展到Arm64架构。对于运行Ubuntu的Arm64服务器和边缘设备,这意味着无需重启即可应用关键内核补丁,大大提高了系统可用性。这一功能的扩展反映了Ubuntu对ARM生态系统的持续投入。

    6. IgH Master driver brings microsecond-level timing precision natively into the OS, removing a significant integration burden for engineers building motion control systems, robotics platforms, or complex factory automation.

      文章提到EtherCAT驱动提供微秒级(10^-6秒)的时间精度,这对工业自动化应用至关重要。这种高精度时间同步能力是Ubuntu在工业领域的一个关键优势,相比其他通用操作系统,Ubuntu在实时性方面的改进使其更适合工业物联网和自动化场景。

    7. Ubuntu 26.04 LTS is built on Linux 7.0, continuing Canonical's commitment to shipping the latest upstream kernels at the time of release.

      文章明确指出Ubuntu 26.04 LTS基于Linux 7.0内核,这表明Canonical坚持使用最新上游内核的策略。相比其他可能使用更保守内核版本的Linux发行版,Ubuntu的这一策略确保了用户能够获得最新的硬件支持和性能改进。

    8. With optimized images across AWS, Azure, Google Cloud, IBM Cloud and Oracle Cloud, developers and enterprises can rely on Ubuntu 26.04 LTS for their most demanding public cloud workloads.

      文章提到Ubuntu 26.04 LTS支持5大主流云平台(AWS, Azure, Google Cloud, IBM Cloud, Oracle Cloud),这反映了Ubuntu在云环境中的广泛兼容性。相比其他Linux发行版,Ubuntu在多云支持方面表现出色,这增强了其作为企业级操作系统的竞争力。

    9. Ubuntu powers millions of PCs and laptops around the world.

      这是一个模糊的数量描述,'millions'没有提供具体数字,无法确定Ubuntu的确切用户规模。相比其他Linux发行版如Red Hat或SUSE,Ubuntu确实拥有更广泛的桌面用户基础,但缺乏精确的市场份额数据支持这一说法。

    10. The 11th long-term supported release of Ubuntu delivers deep silicon optimization and state-of-the-art security for enterprise workloads.

      这表明Ubuntu 26.04是第11个LTS版本,按照Ubuntu每两年发布一个LTS版本的规律,这与Ubuntu的历史发展时间线一致。作为第11个LTS版本,它代表了Canonical在长期支持方面的成熟经验,为企业和用户提供稳定可靠的选择。

    1. Each cell shows how often a given curve fit is not significantly worse than the fit with the best cross-validation accuracy.

      研究使用交叉验证来评估不同曲线拟合的优劣,每个单元格显示给定曲线拟合与最佳拟合相比不显著差于的频率。这种方法提供了更稳健的统计评估,减少了过拟合风险。

    2. We examine whether AI capabilities are accelerating by fitting statistical models to benchmark performance over time, and comparing their predictive accuracies.

      研究方法基于统计模型拟合和预测准确度比较,这是一种严谨的方法论。通过比较不同曲线拟合的预测能力,可以更客观地判断是否存在加速趋势,而非仅凭直观观察。

    3. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

      文章核心发现,75%的指标显示AI能力正在加速,且主要由推理模型驱动。这是一个明确的量化结论,但需要关注的是,仅基于4个指标就得出'加速'的结论可能存在样本偏差,特别是这些指标主要集中在数学和编程领域。

    4. Our fourth metric, an index constructed from WeirdML V2 results, showed no sign of acceleration. A single global linear trend fit the data best.

      这个25%的指标没有显示出加速趋势,提供了一个重要的对比案例。作者推测这可能是因为WeirdML V2设置了资源限制环境(模型只有5次提交代码的机会,无法使用外部工具),这与当前RL训练的重点不符。这表明AI进步可能高度依赖于测试环境和评估标准。

    5. The best-performing model across these three metrics was a pair of independent linear trends: one for reasoning models and one for non-reasoning models.

      这个模型选择结果(100%的三个指标)表明将模型分为推理和非推理两类是最优预测模型。这提供了强有力的统计证据,支持推理能力可能是AI加速发展的关键因素。然而,文章没有详细说明如何定义推理模型,这可能影响结果的可靠性。

    6. Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.

      这是一个重要的性能对比数据,表明推理模型比非推理模型的进步速度快2-3倍。这是一个显著的加速比率,暗示推理能力的突破可能代表了AI发展的一个转折点。然而,文章没有提供具体的基准测试数据来支持这一倍数关系,需要谨慎对待。

    7. Three of the four metrics (ECI, log METR 50% time horizon, and a math-focused index we constructed from several math benchmarks) show strong evidence that progress has sped up relative to a global linear trend fit to data from 2023 onward.

      这是一个关键的统计数据,表明75%的AI能力指标显示出加速趋势。文章使用2023年后的数据进行线性拟合,发现三个指标偏离了线性趋势。这个比例相当高,但值得注意的是,样本量较小(n=4),可能影响统计显著性。需要更多指标来验证这一发现。

    8. Three of the four metrics (ECI, log METR 50% time horizon, and a math-focused index we constructed from several math benchmarks) show strong evidence that progress has sped up relative to a global linear trend fit to data from 2023 onward.

      这个数据点表明75%的AI能力指标显示加速趋势,这是一个相当高的比例。文章提到这种加速始于2023年,与推理模型的出现时间吻合。这个比例值得注意,因为它表明AI进步可能正在经历一个质的转变,而非仅仅是量的累积。

    9. The three metrics where we find acceleration are concentrated in programming and mathematics. These are areas that labs have explicitly targeted for improvement

      这个观察揭示了AI能力加速的领域局限性。编程和数学领域的加速可能是因为这些领域被明确作为改进目标,且正确性容易验证。这表明AI进步可能是有选择性的,而非全面性的,对评估整体AI进展有重要启示。

    10. Our fourth metric, an index constructed from WeirdML V2 results, showed no sign of acceleration. A single global linear trend fit the data best.

      这个25%的指标没有显示加速现象,表明AI能力加速可能不是普遍适用的。WeirdML V2的特殊环境(资源受限、无外部工具)可能解释了这一差异,但也暗示了AI能力加速可能集中在特定领域,特别是那些容易自动验证正确性的领域。

    11. The best-performing model across these three metrics was a pair of independent linear trends: one for reasoning models and one for non-reasoning models.

      这个发现表明推理模型和非推理模型的发展轨迹确实存在显著差异。这种分离的线性趋势模型在三个指标上表现最佳,100%的情况下优于其他模型,提供了强有力的统计证据支持AI能力加速的论点。

    12. Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.

      这个2-3倍的速度差异是显著的,表明推理模型带来了质的飞跃。这种加速幅度远高于典型的技术进步速度,暗示了AI发展可能进入了一个新阶段。然而,这个倍数范围较宽,缺乏精确的统计显著性检验。

    13. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

      这是一个关键数据点,表明75%的AI能力指标显示加速趋势。这个比例相当高,表明AI能力加速现象可能不是偶然的。然而,这个数据基于四个特定指标,可能不全面代表所有AI能力领域。需要更多指标验证这一结论的普适性。

    14. We work with the natural logarithm of the time horizon, which puts it on an approximately linear scale.

      文章提到对METR时间范围进行自然对数转换,使其处于近似线性尺度。这种数学转换表明原始数据可能呈指数增长,转换后才能更好地分析线性趋势。这种处理方式在分析AI进步率时很常见,因为它能更好地处理跨越多个数量级的数据。

    15. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

      这一数据点表明75%的AI能力指标显示加速趋势,这是一个相当高的比例。然而,文章也指出第四个指标(WeirdML V2)没有显示加速,这表明加速可能并非普遍存在于所有AI能力领域。这个比例需要谨慎解读,因为它基于有限的四个指标,且主要集中在数学和编程领域。

    1. Meta founder and CEO Mark Zuckerberg described superintelligence in a blog post last year

      文章提到Meta的AI战略包括开发'超级智能',但未提供具体投资金额、研发时间表或预期成果。缺乏量化依据,无法评估这一战略的规模、时间框架或可能带来的商业价值。这种技术愿景需要更多具体数据来支撑其可行性评估。

    2. Wedbush Securities analyst Dan Ives said in a report on Thursday.

      文章提到分析师预测未来可能有更多裁员,但未提供具体数字或预测比例。缺乏量化依据,无法评估分析师预测的可靠性。这类行业分析通常需要更具体的数据支持,如预计裁员数量、时间表或财务影响等。

    3. Meta plans to lay off roughly 8,000 employees, or 10% of its workforce

      这是一个显著但合理的裁员比例,10%的裁员规模反映了Meta在AI转型中的重大战略调整。相比其他科技公司裁员比例(通常在5-20%之间),这一比例处于中等偏高水平,表明Meta正在积极重组以支持AI投资。此数据点来自公司官方声明,可信度较高。

    1. The median US buyout fund returns 13% to 16% net.

      文中提到美国收购基金的中位回报率为13-16%,而OpenAI承诺的17%回报率高于这一水平,约为行业平均值的1.06-1.3倍。这一差异表明OpenAI为了获得渠道优势愿意支付溢价,但也暗示了PE partners可能承担了额外的风险或OpenAI的业务模式需要实现超常增长。

    1. Claude usage rose by over 40% amid increased attention but remains far behind ChatGPT

      令人惊讶的是:Claude的使用率在短短一个月内增长了40%,但与ChatGPT的30%使用率相比仍然差距巨大。这表明AI市场存在明显的赢家通吃现象,即使是最成功的挑战者与领导者相比仍有数量级的差距。

  4. Mar 2026
  5. Feb 2026
    1. The cost of the time that it takes fix "workslop" could add up too, with a $186 monthly cost per employee on average, according to a survey of desk workers by BetterUp in partnership with the Stanford Social Media Lab. Forty percent of the workers surveyed said they received "workslop" in the last month and that it took an average of two hours to resolve each incident.

      $186/per employee/per month!

      10 employees = ($22,320) 25 employees = ($55,800) 50 employees = ($111,600) 100 employees = ($223,200) 250 employees = ($558,000) 500 employees = ($1,116,000) 1000 employees = ($2,232,000)

    2. 58% said direct reports submitted work that contained factual inaccuracies generated by AI tools, while fewer reported that AI failed to account for critical contextual factors. Other issues cited include low-quality content, poor recommendations and inappropriate messaging.

      from reporting managers, 58% of them said that employees were submitting work that contained factual inaccuracies in the work that was generated by AI, and that fewer of them reported that AI failed to account for "critical contextual factors", implying that the writing was generic and not directly applicable to the context that the writing was written in. Other issues were: low quality content, poor recommendations and inappropriate messaging.

    3. 59% of managers saying that they had to invest additional time to correct or redo work created by AI. Similarly, 53% said their direct reports had to take on extra work, while 45% said they had to bring in co-workers to help fix the mistake.

      Extra time and money spent to repair errors made by AI but not caught by the human in the middle. 59% is almost 2/3 (closer to 3/5) needed to correct or redo the work created by AI without a human auditing it. 53% claim extra work is needed to repair the AI mistakes, and 45% also needed to bring in a (perhaps more senior) co-worker to help fix the mistake. I can imagine workers needing to work on a mistake the hits production code, and all of the thousands (or more) mistakes that would need to be later repaired and rolled back. very expensive and costly.

    4. While 18% of managers said they did not suffer any financial losses from the mistakes, and 20% said those losses were less than $1,000, a significant number reported bigger losses. Twelve percent said those losses were more than $25,000, while 11% said between $10,000 and $24,999. Another 27% placed the value of those losses above $1,000 but below $10,000.

      great stats for the cost of using AI without human auditing.

  6. Jan 2026
  7. Dec 2025
  8. Sep 2025
  9. Jun 2025
  10. Apr 2025
  11. Jan 2025

    Tags

    Annotators

  12. Nov 2024
    1. Surprisingly, the American author who is quoted most in the OED isnot Mark Twain or Emily Dickinson or Edgar Allan Poe, but rather EdwardH. Knight, a patent lawyer and expert in mechanics who wrote the AmericanMechanical Dictionary and The Practical Dictionary of Mechanics. Knight isthe seventy-fourth-most cited author in the Dictionary, quoted morefrequently than Percy Bysshe Shelley, George Eliot or Ralph Waldo Emerson(who comes in at 116, the next-most quoted American).
  13. Sep 2024
    1. Round half to even, which rounds to the nearest even number. With this method, 1.25 is rounded down to 1.2. If this method applies to 1.35, then it is rounded up to 1.4. This is the method preferred by many scientific disciplines, because, for example, it avoids skewing the average value of a long list of values upwards.
  14. Aug 2024
    1. Why Clinton's claim that Democratic presidents created more jobs than Republicans is slightly misleading by [[Maz Zahn]] on 2024-08-22 for ABC News

      While Clinton may have left out additional detail, the root of the statement is not only broadly true, but broadly representative of the fact that Republican administrations have been devastating in general to the economy and Democrats have been handed shit at the start of their terms to clean up.

  15. Jul 2024
    1. we've achieved a level of prosperity for a huge number of people that was not typical of the past i mean most countries have a big middle class as well as an extremely wealthy upper 00:50:00 class but the number of people in abject poverty in the world today living on less than two dollars a day is greater than the entire population of the world 00:50:13 only 100 years ago so that's not progress

      for - statistics - progress trap - comparative levels of poverty

      statistics - progress trap - comparative levels of poverty - modern civilization has - a huge middle class - a small elite class - a huge impoverished class - The absolute number of people living on less than 2 dollars a day is less than the entire population of humans only 100 years ago

  16. May 2024
  17. Mar 2024
  18. Feb 2024
    1. And yet he desperately needed the help of Subeditors because the task wastoo massive to do alone. Two years into the job, Murray had estimated thathe had sent out 817,625 blank slips to Readers. If they returned them withquotations, and if he spent a minimum of 30 seconds reading each one andallocating it to the correct sense of an entry, it would take him three workingyears to get through a third of the materials gathered.

      By the second year into his editing work on the OED, John Murray estimated that he had sent out 817,625 slips to readers.

      At the average price of $0.025 for bulk index cards in 2023, this would have cost $20,440, so one must wonder at the cost of having done it. How much would this have been in March 1879 when Murray tool over editorship?

      How many went out in total? Who cut them all? Surely mass manufacture didn't exist at the time for them?

      Sending them out would have helped to ensure a reasonable facsimile of having cards of equal size coming back.

    2. Murray received a poignant letter in 1906 fromthe wife of William Sykes of South Devon who had been a one-timeassistant, and faithful Reader and Specialist for twenty-two years, sending in atotal of 16,048 slips: ‘My dear husband died last Friday, the day he receivedyour letter, he was able to read it, and wrote your name in one of the books Iam going to send you eight hours before he died. It took him an hour to writeit, but he made up his mind to do it, and did. The last words he ever wrotewere to you.’ A poignant last line from the impoverished widow reads, ‘I shallsend the books when the probate duty has been paid.’

      William Sykes 16,048 slips over 22 years<br /> (approximately 2 notes per day)

    3. From the moment in March 1879 whenMurray signed the contract with Oxford University Press to be the next Editorof the Dictionary, and he took possession of 2 tons of slips at his house, hisfamily was immediately part of the project (whether they liked it or not)sorting out the slips. Their house was a workplace and the family aworkforce.

      Perhaps one of the first sources of counting slips in weight rather than number!

    4. The most prolific Reader in Europe – we might call him a ‘super-contributor’ – was Hartwig Helwich, a professor at the University of Viennawho wrote out the entire Cursor Mundi onto 46,599 slips. His efforts madethe medieval poem the second-most-frequently cited work in the Dictionaryafter the Bible (though in the current OED, it has dropped to eleventh in thetop sources).

      This practice of writing out everything onto slips sounds like that used later (double check the timing) by the Thesaurus Linguae Latinae in creating their slip corpus for later work.

  19. Nov 2023
    1. Get it right and we will see a lot less of our precious minerals, metals and resources dumped into landfill

      This line specifically stood out to me in this article because it is hard to hear, but also very true to the world we live in. As a world, we toss things out the moment they are no longer viewed as valuable to us but we dont toss things when they are "precious". For example, we buy a new iphone and hold it to a high value but then a year later a new iphone comes out and the old one gets tossed away like it is invaluable. Instead of just tossing things like this we need to be more proactive in recycling valuable and difficult resources that one day we may not have.

    1. the suicide is up by 30% depression rates are skyrocketing 36% of 00:07:57 Americans report feeling lonely frequently 45% of teenagers say they feel despondent and hopeless most of the time the number of people who have no who say they have no close personal friends has gone up by four 00:08:10 times 36% more Americans are not in a romantic relationship uh the number of people Americans who rate themselves in the lowest happiness category has gone up by 50%
      • for: statistics - United States happiness indicators

      • statistics: United States happiness indicators

        • suicide is up by 30%
        • depression is skyrocketing
        • 36% of Americans report feeling lonely frequently
        • 45% of teenagers say they feel despondent and hopeless most of the time
        • the number of people who say they have no close personal friends has gone up by four times
        • 36% more Americans are not in a romantic relationship
        • the number of people Americans who rate themselves in the lowest happiness category has gone up by 50%
  20. Oct 2023
  21. Jul 2023
    1. weakly informative approach to Bayesian analysis

      In [[Richard McElreath]]'s [[Statistical Rethinking]], he defines [[weakly informative priors]] (aka [[regularizing priors]]) as

      priors that gently nudge the machine [which] usually improve inference. Such priors are sometimes called regularizing or weakly informative priors. They are so useful that non-Bayesian statistical procedures have adopted a mathematically equivalent approach, [[penalized likelihood]]. (p. 35, 1st ed.)

    1. Science is not described by thefalsification standard, as Popper recognized and argued.4 In fact, deductive falsification isimpossible in nearly every scientific context. In this section, I review two reasons for thisimpossibility.(1) Hypotheses are not models. The relations among hypotheses and different kinds ofmodels are complex. Many models correspond to the same hypothesis, and manyhypotheses correspond to a single model. This makes strict falsification impossible.(2) Measurement matters. Even when we think the data falsify a model, another ob-server will debate our methods and measures. They don’t trust the data. Sometimesthey are right.For both of these reasons, deductive falsification never works. The scientific method cannotbe reduced to a statistical procedure, and so our statistical methods should not pretend.

      Seems consistent with how Popper used the terms [[falsification]] and [[falsifiability]] noted here

    2. So where do priors come from? They are engineering assumptions, chosen to help themachine learn. The flat prior in Figure 2.5 is very common, but it is hardly ever the best prior.You’ll see later in the book that priors that gently nudge the machine usually improve infer-ence. Such priors are sometimes called regularizing or weakly informative priors.They are so useful that non-Bayesian statistical procedures have adopted a mathematicallyequivalent approach, penalized likelihood. These priors are conservative, in that theytend to guard against inferring strong associations between variables.

      p. 35 where [[Richard McElreath]] defines [[weakly informative priors]] aka [[regularizing priors]] in [[Bayesian statistics]]. Notes that non-Bayesian methods have a mathematically equivalent approach called [[penalized likelihood]].

    3. Andrew Gelman’s

      Per Andrew Gelman's wiki:

      Andrew Eric Gelman (born February 11, 1965) is an American statistician and professor of statistics and political science at Columbia University.

      Gelman received bachelor of science degrees in mathematics and in physics from MIT, where he was a National Merit Scholar, in 1986. He then received a master of science in 1987 and a doctor of philosophy in 1990, both in statistics from Harvard University, under the supervision of Donald Rubin.[1][2][3]

  22. Apr 2023
  23. Mar 2023
    1. Basic statistics regarding the TLL: - ancient Latin vocabulary words: ca. 55,000 words - 10,000,000 slips - ca. 6,500 boxes - ca. 1,500 slips per box - library 32,000 volumes - contributors: 375 scholars from 20 different countries - 12 Indo-European specalists - 8 Romance specialists - 100 proof-readers - ca. 44,000 words published - published content: 70% of the entire vocabulary - print run: 1,350 - Publisher: consortium of 35 academies from 27 countries on 5 continents

      Longest remaining words: - non / 37 boxes of ca 55,500 slips - qui, quae, quod / 65 boxes of ca. 96,000 slips - sum, esse, fui / 54.5 boxes of ca. 81,750 slips - ut / 35 boxes of ca 52,500 slips

      Note that some of these words have individual zettelkasten for themselves approaching the size of some of the largest personal collections we know about!

      [18:51]

    1. Statistics collected in hundreds of cities in the United States show that between a third and a half of the school children fail to progress through the grades at the expected rate; that from 10 to 15 per cent are retarded two years or more; and that from 5 to 8 per cent are retarded at least three years. More than 10 per cent of the $400,000,000 annually expended in the United States for school instruction is devoted to re-teaching children what they have already been taught but have failed to learn.

      I think this information is interesting because we are being told that more than 1/3 of school children fail to progress to the next grade. I think we need to incorporate different learning styles because what if the individual doesn't understand the concept the way it is being taught. Many people learn in different ways such as hands on learning, auditory learning, and visual learning. I think the reason 10% of $400,000,000 is going into teaching children what they have learned but have failed to learn is because there maybe something up head in learning that they might need to understand for the future. I have been retaught certain things when I moved up to the next grade level and I think it is to help refresh memory. I think another reason 10% goes to reteaching is because the students didn't understand the concept and needs to be retaught so they can understand for future uses.

  24. Jan 2023
  25. Dec 2022
    1. Aleatoric music (also aleatory music or chance music; from the Latin word alea, meaning "dice") is music in which some element of the composition is left to chance, and/or some primary element of a composed work's realization is left to the determination of its performer(s). The term is most often associated with procedures in which the chance element involves a relatively limited number of possibilities.

      https://en.wikipedia.org/wiki/Aleatoric_music

  26. Nov 2022
  27. Sep 2022
  28. Aug 2022
    1. ReconfigBehSci. (2021, December 9). a rather worrying development- a (local) newspaper “fact checking” the new German health minister simply by interviewing a virologist who happens to have a different view. There’s simply no established “fact” as to the severity of omicron in children at this point in time [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1469037817481334786

    1. we launched a service that’s now used by over a million people around the world who have made nearly 40 million annotations. In higher education, more than 1,200 colleges and universities use Hypothesis. And we’ve grown from a handful of people into a team of more than 35 passionate web builders.

      h. in 2022 has over 1 million users, who made nearly 40 million annotations. Early this year 2 million annotated articles/sites was reached (2175298 is the number the API rerurns today). This sounds like a lot but on its face works out to an average of 40 annotations on 2 articles per user. This suggests to me the mode is 1 annotation on 1 article per user. How many of those 1 million were active last week / month?

    1. Otto Karl Wilhelm Neurath (German: [ˈnɔʏʀaːt]; 10 December 1882 – 22 December 1945) was an Austrian-born philosopher of science, sociologist, and political economist. He was also the inventor of the ISOTYPE method of pictorial statistics and an innovator in museum practice. Before he fled his native country in 1934, Neurath was one of the leading figures of the Vienna Circle.

      https://en.wikipedia.org/wiki/Otto_Neurath

  29. Jun 2022
    1. First, the so-called normal distribution of statistics assumes that there are default humans who serve as the standard that the rest of us can be accurately measured against.

      "so-called"?! wow! This is a massively divergent viewpoint.

  30. May 2022
    1. The highlights you made in FreeTime are preserved in My Clippings.txt, but you can’t see them on the Kindle unless you are in FreeTime mode. Progress between FreeTime and regular mode are tracked separately, too. I now pretty much only use my Kindle in FreeTime mode so that my reading statistics are tracked. If you are a data nerd and want to crunch the data on your own, it is stored in a SQLite file on your device under system > freetime > freetime.db.

      FreeTime mode on the Amazon Kindle will provide you with reading statistics. You can find the raw data as an SQLite file under system > freetime > freetime.db.

    1. According to a Pew study from last year, only 20 percent of K-12 students in America study a foreign language (compared with an average of 92 percent in Europe), and only 10 states and the District of Columbia make foreign-language learning a high school graduation requirement.

      use of statistics

  31. Apr 2022
    1. ReconfigBehSci. (2022, January 24). @STWorg @FraserNelson @GrahamMedley no worse- he took Medley’s comment that Sage model the scenarios the government asks them to consider to mean that they basically set out to find the justification for what the government already wanted to do. Complete failure to distinguish between inputs and outputs of a model [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1485625862645075970

    1. the Institute of Medicine had released a landmark report on patientsafety, To Err Is Human. The report found that as many as 98,000 Americanswere dying each year as a result of preventable medical errors occurring inhospitals—more people than succumbed to car accidents, workplace injuries, orbreast cancer. And some significant portion of these deaths involved mistakes inthe dispensing of drugs.

      Some might see the 98,000 preventable medical error deaths reported by the Institute of Medicine in To Err is Human (1999) now and laugh at the farcical number of deaths due to coronavirus since 2020, a large proportion of which could have been prevented due to better communication and coordination?

      What if a more pragmatic anthropological viewpoint could be given to the current fractured state of American politics? If anthropologists are taught not to make value judgements on the way other cultures have come to live their lives, but simply to appreciate and report on them accurately, then perhaps we should leave those on the far right who believe in top down, patriarchal rule to their devices?

      What if we nudged (forced) them all to actually live by their own rules by enforcing them to the nth degree? Republican politicians can only get away with badmouthing abortion or homophobic viewpoints because their feet are not held to the fire when those issues impinge upon their own families or even themselves. They have the wealth and the power to flout the laws and not face the direct consequences personally. Would their tunes change if forced by their own top down patriarchal perspectives applying to them?

    1. ReconfigBehSci. (2021, February 1). @MaartenvSmeden @richarddmorey 2/2 Having conducted experiments on lay understanding of arguments from ignorance, in my experience, people intuitively understand probabilistic impact of factors, such as quality of search, that moderate strength. Rather than build on that, we work against it with slogan! [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1356228495714746370

    1. ReconfigBehSci. (2021, February 2). @MichaelPaulEdw1 @islaut1 @ToddHorowitz3 @richarddmorey @MaartenvSmeden as I just said to @islaut1 if you want to force the logical contradiction you move away entirely from all of the interesting cases of inference from absence in everyday life, including the interesting statistical cases of, for example, null findings—So I think we now agree? [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1356530759016792064

    1. ReconfigBehSci. (2021, February 1). @islaut1 @richarddmorey I think of strength of inference resting on P(not E|not H) (for coronavirus case). Search determines the conditional probability (and by total probability of course prob of evidence) but it isn’t itself the evidence. So, was siding with R. against what I thought you meant ;-) [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1356216290847944706

    1. ReconfigBehSci. (2021, February 1). @MaartenvSmeden @richarddmorey you absolutely did (and I would have been disappointed if you hadn’t ;-)! It was a general comment prompted by the fact that the title of the article you linked to doesn’t (as is widespread), and I actually genuinely think this is part of the “problem” in pedagogical terms. 1/2 [Tweet]. @SciBeh. https://twitter.com/SciBeh/status/1356227423067664384

    1. Maarten van Smeden. (2021, February 1). Personal top 10 fallacies and paradoxes in statistics 1. Absence of evidence fallacy 2. Ecological fallacy 3. Stein’s paradox 4. Lord’s paradox 5. Simpson’s paradox 6. Berkson’s paradox 7. Prosecutors fallacy 8. Gambler’s fallacy 9. Lindsey’s paradox 10. Low birthweight paradox [Tweet]. @MaartenvSmeden. https://twitter.com/MaartenvSmeden/status/1356147552362639366

    1. Youyang Gu. (2021, May 25). Is containing COVID-19 a requirement for preserving the economy? My analysis suggests: Probably not. In the US, there is no correlation between Covid deaths & changes in unemployment rates. However, blue states are much more likely to have higher increases in unemployment. 🧵 https://t.co/JrikBtawEb [Tweet]. @youyanggu. https://twitter.com/youyanggu/status/1397230156301930497

    1. Denise Dewald, MD 🗽. (2021, August 12). Here are some modeling predictions for the delta variant from COVSIM (group at North Carolina State): PLEASE CHECK THIS OUT - RESOURCES TO SHARE WITH YOUR SCHOOL DISTRICT School-level COVID-19 Modeling Results for North Carolina for #DeltaVariant https://t.co/zU5hB9bKlY [Tweet]. @denise_dewald. https://twitter.com/denise_dewald/status/1425626289399009288

    1. A New York Times article uses the same temperature dataset you have been using to investigate the distribution of temperatures and temperature variability over time. Read through the article, paying close attention to the descriptions of the temperature distributions.

      Unfortunately, like most NYT content, this article is behind a paywall. I'm partly reading this as I plan to develop a set of open education resources myself and the problem of how to manage dead/unavailable links looks like a key stumbling block.