75 Matching Annotations
  1. Last 7 days
    1. TRINITY transferred zero-shot to four unseen tasks (AIME, BigCodeBench, MT-Bench, and GPQA). On average, the evolved coordinator surpassed every individual constituent model in its pool, including GPT-5, Gemini 2.5-Pro, and Claude-4-Sonnet.

      作者声称一个仅20K参数的协调者能够超越GPT-5等顶级大模型,这一结论与行业对模型规模与能力关系的普遍认知相悖,提出了一个极具挑战性的反直觉观点。

    2. In nature, complex problems are rarely solved by a single monolithic entity, but rather by the coordinated efforts of specialized individuals working together.

      作者将自然界生态系统作为类比,暗示AI发展应该遵循生物多样性的原则,而非当前行业普遍追求的单一大型模型。这与主流AI发展方向形成鲜明对比,提出了一个反直觉的生物学视角。

    1. At 50 million tokens, the design space for AI applications changes fundamentally.

      文章提到5000万token上下文将 fundamentally 改变AI应用的设计空间。这是一个前瞻性的数据点,表明SubQ技术的长期潜力,虽然当前产品仅支持100万token,但架构设计已为未来更大规模应用奠定基础。

  2. May 2026
    1. Almost all the evidence points to very fast software progress: each year, the training compute needed to get to the same capability declines several times — possibly even ten times or more.

      Progress much faster than thought

      Most believe AI progress is primarily from scaling compute, but author shows software progress could be 10x+ per year, dwarfing compute scaling.

    1. The depth of recursion becomes a tunable compute axis at inference time, requiring no retraining. A small model, by reading itself, can iterate toward answers that neither it nor any of its workers could reach in a single pass.

      大多数人认为模型的能力受其规模和训练数据的限制,需要更大模型或重新训练才能提升性能。但作者提出小模型通过自我递归调用可以在推理时动态扩展能力,无需重新训练就能达到单个模型无法企及的高度。这挑战了规模即能力的行业共识,暗示小模型可能通过自省机制实现突破性能力。

  3. Apr 2026
    1. The architecture scales horizontally to 300 sub-agents executing across 4,000 coordinated steps simultaneously, a substantial expansion from K2.5's 100 sub-agents and 1,500 steps.

      大多数人认为AI系统的扩展主要依赖于增加单个模型的计算能力和参数规模,而非增加智能体的数量。作者提出的300个智能体并行执行的模式挑战了这一认知,暗示未来AI发展可能更侧重于'多智能体协作'而非'单一模型增强',这可能会重新定义AI系统的架构设计原则。

    1. memory-driven experience scaling represents a crucial new frontier for agent scaling

      大多数人认为智能体扩展应该主要通过增加模型参数或计算资源来实现。但作者提出经验驱动的记忆扩展是智能体扩展的关键新前沿,这挑战了传统扩展范式,暗示未来的AI发展可能更关注如何有效利用经验而非仅仅是扩大规模。

    2. existing TTS methods often discard the exploration trajectory and treat the final answer as the only useful outcome

      在测试时扩展(Test-time scaling)领域,主流观点认为只有最终结果才是有价值的,探索过程只是达到结果的手段。但作者认为被忽视的探索轨迹实际上是一个丰富的数据源,可以加速智能体从经验中学习的能力。这一观点挑战了传统TTS方法的价值评估标准。

    1. When a Fugu model is allowed to call itself recursively, reading its own prior output as context and deciding whether to revise its coordination strategy, a new form of test-time scaling emerges.

      大多数人认为AI模型的能力主要取决于训练阶段,推理阶段只是应用已学知识,但作者提出Fugu模型可以在推理时通过自我递归调用实现能力扩展,这挑战了传统AI推理阶段的局限性,暗示小型模型可能通过自我迭代达到超越其初始能力水平的表现。

    2. The depth of recursion becomes a tunable compute axis at inference time, requiring no retraining. A small model, by reading itself, can iterate toward answers that neither it nor any of its workers could reach in a single pass.

      大多数人认为模型性能提升需要更大的参数规模或重新训练,但作者提出了一种反直觉的方法:通过递归调用自身,小模型可以在推理时自我迭代,达到单次推理无法达到的答案质量。这挑战了我们对模型规模与能力关系的传统认知。

    1. Parameters are estimated by unweighted least squares. Time t is measured in years since the first observation in each dataset.

      研究使用最小二乘法进行参数估计,时间以年为单位从每个数据集的第一个观测点开始计算。这种方法选择是统计标准做法,但未加权处理可能低估了近期数据点的重要性,因为近期数据点通常代表更先进的模型能力。时间单位的选择也影响了增长率解释的直观性。

    1. On a 150-class benchmark, the surrogate fully replaces the teacher

      大多数人认为复杂分类任务需要大型模型才能处理,小型代理模型只能处理简单任务。但作者展示了一个150类复杂任务中,小型代理模型完全能够替代教师模型,这挑战了'越大越好'的主流认知,证明了高效路由的潜力。

    1. The real unlock is compound scaling—token spend grows linearly while output grows exponentially.

      大多数人认为AI投入与产出成正比,但作者认为AI投入可以实现指数级增长,远超线性投入。这挑战了传统商业认知,暗示AI可以创造超常规回报,但也可能掩盖了AI实际效益被夸大的风险。

    1. At 770M parameters, a looped model achieves the downstream quality of a 1.3B fixed-depth Transformer trained on the same data — roughly half the parameters for the same quality.

      这一发现具有颠覆性,表明循环模型在参数效率上可能远超传统Transformer。如果这一结论成立,那么大模型的发展方向可能需要重新思考——与其不断增加参数量,不如优化循环架构的设计。这挑战了当前'更大即更好'的主流观点。

    1. They will have to go through our journey above of ingesting data, collecting tribal knowledge, and more – and they will have to do so for each individual customer they work with.

      这一观点揭示了专用上下层供应商面临的挑战:需要为每个客户重复复杂的数据摄入和知识收集过程。这暗示了行业可能需要发展更标准化的上下层构建方法,以降低实施成本和复杂度。

    1. Because small, cheap, fast models are sufficient for much of the detection work, you don't need to judiciously deploy one expensive model and hope it looks in the right places. You can deploy cheap models broadly, scanning everything, and compensate for lower per-token intelligence with sheer coverage and lower cost-per-token.

      这一观点提出了AI安全的经济新模式,通过广泛部署小型廉价模型来弥补单一大模型的不足。这种'广撒网'策略可能比依赖少数昂贵模型更有效,尤其在大规模代码库扫描场景中,为AI安全的经济可行性提供了新思路。

    2. The capability rankings reshuffled completely across tasks. There is no stable best model across cybersecurity tasks. The capability frontier is jagged.

      这一发现揭示了AI安全能力的'锯齿状前沿'现象,不同模型在不同安全任务上的表现差异巨大。这表明不存在'一刀切'的最佳安全模型,而是需要根据具体任务选择合适的模型,这对AI安全系统的设计有重要启示。

    1. we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick.

      这是一个惊人的效率提升,比前代模型减少一个数量级的计算量仍能达到相同能力,这暗示了Meta在AI架构优化方面取得了突破性进展,可能重新定义大模型训练的经济性。

    2. scaling Muse Spark with multi-agent thinking enables superior performance with comparable latency.

      令人惊讶的是:通过扩展并行智能体的数量而非延长单个智能体的思考时间,Muse Spark能够在保持相近延迟的同时实现更优性能。这种多智能体协调的推理方式挑战了传统AI模型通过增加计算时间提高性能的范式,为高效推理提供了新思路。

    3. we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick.

      令人惊讶的是:Meta声称他们的新模型Muse Spark在计算效率上取得了突破性进展,仅用前代模型Llama 4 Maverick十分之一的计算量就能达到相同能力。这种数量级的效率提升在AI领域极为罕见,可能代表着训练算法和架构设计的重大革新。

    1. We see continued gains from inference scaling on larger projects, suggesting they may be solvable given enough tokens.

      这一发现揭示了AI性能与推理计算资源之间的正相关关系,暗示了通过增加计算预算可能解决更复杂的编程任务。这为AI能力的边界提供了重要线索,也引发了关于计算资源投入与AI能力提升之间关系的深刻思考。

    1. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters.

      令人惊讶的是:仅使用一块配备1.5TB主机内存的H200 GPU就能训练1200亿参数的模型,这打破了人们对大规模模型必须依赖多GPU集群的固有印象。这一技术突破可能使超大规模模型训练变得更加普及和经济。

    1. we discover a clear scaling law: as the underlying foundation models improve, the quality of the generated papers increases correspondingly.

      AI Scientist 存在「论文质量 Scaling Law」——底层模型越强,生成的论文质量越高。这个发现的含义令人不寒而栗:随着 GPT-5、Claude Opus 4.6、Gemini 3.1 等模型持续迭代,AI Scientist 生成的论文质量将自动提升,无需任何额外的工程投入。AI 加速科研,更强的 AI 又反过来加速 AI 自身的科研——这是第一个有实证数据支撑的正反馈循环证据。

    1. 19世紀の経済学者ジェヴォンズは、蒸気機関の効率向上によって石炭の消費効率が上がると、かえって全体の消費量が増えることを見出しました。

      用「杰文斯悖论」解释推理时间扩展(inference scaling)——这是一个绝妙的框架选择。效率提升→整体消耗增加,这正是 o1/R1 类推理模型出现后发生的事:单次推理更贵,但人们愿意为更难的问题付出更多算力。Sakana 用一个 19 世纪的经济学悖论,为 2026 年的 AI 产品战略提供了令人信服的理论背景——在技术营销中,历史类比是建立认知可信度的最有效工具之一。

    1. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time.

      大多数AI研究者认为推理时间越长,模型探索越充分,结果应该越好。作者却挑战这一共识,认为推理过程中的错误会随着时间同步增长,导致长时间推理反而会降低质量,这是一个颠覆性的观点。

    1. Without any architectural modification, MinerU2.5-Pro achieves 95.69 on OmniDocBench v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200× more parameters.

      大多数人认为更大的模型架构必然带来性能提升,但作者仅通过数据工程和训练策略优化,在保持1.2B参数架构不变的情况下,超越了参数量超过200倍的现有模型,这挑战了'越大越好'的行业共识,证明了数据质量的重要性。

    1. Sandboxes made for running tens of thousands of agents

      大多数人认为在单个系统中运行数万个AI代理是不现实的,会导致资源竞争和性能下降。Freestyle明确将此作为设计目标,暗示他们的架构可能重新定义了AI代理的规模边界,挑战了关于AI系统可扩展性的主流认知。

  4. Jan 2026
    1. But there's one post-American system that's easy to imagine. The project to rip out all the cloud connected, backdoored, untrustworthy black boxes that power our institutions, our medical implants, our vehicles and our tractors; and replace it with collectively maintained, open, free, trustworthy, auditable code. This project is the only one that benefits from economies of scale, rather than being paralyzed by exponential crises of scale. That's because any open, free tool adopted by any public institution – like the Eurostack services – can be audited, localized, pen-tested, debugged and improved by institutions in every other country.

      digital transition is possible because it scales through spreading. You don't have to solve exponential scale first.

    1. For these individual successes to scale into a continent-wide shift, however, structural barriers must be addressed. The path to digital sovereignty is not a single, grand gesture but a series of deliberate, often difficult, choices. The examples from Austria, France, and the ICC show that the journey begins with a single, courageous step, often prompted by the mundane reality of a data protection assessment.

      It's not about scaling, it's about spreading. Not the same thing. No scaler involved. Which is said in the second sentence: 'not a single grand gesture but a series'

  5. Dec 2025
    1. * "Digital platforms are used for hybrid campaigns."* "EU can't compete with US tech ON THEIR TERMS."* "Post-reality US is what happens when tech is unregulated."* "Ireland is a Trojan Horse for Big Tech."* "The Digital Omnibus is sabotage."

      Quotes van [[Defend Democracy o]] event w DK EU presidency cohosting. All convey an aspect of where work is needed. On each I see one could define [[Handelen 20040327155224]] as [[SC landscape van EU Dataspace]] interventions and broader.

      the last one pertains to the AI / GDPR omnibus, not the data one, I think.

    1. would take seriously the fact that intelligence is now being scaled and distributed through organizations long before it is unified or fully understood

      there's no other way, understanding comes from using it, and having stuff go wrong. The scandals around algos are important in this. Scale and distribution are different beasts. Distribution does not need scale (but a network effect helps) in order to work. The need for scale in digital is an outcome of the financing structure and chosen business model, and is the power grab essentially. #openvraag hoe zet je meer focus op distributie als tegenkracht tegen de schalingshonger van actoren?

    1. That raises fixed costs and favors scale—paradoxically advantaging the largest platforms that can afford regional bifurcation while crushing subscale competitors.

      Not really, it does not favor scale, as it's progressive compliance. The next sentence says as much. One platform's bifurcation is the same as having two smaller ones, who have less compliance costs and thus won't be crushed. Which is already the case even, global platforms already cater to diff regulatory regimes (and morally questionable at that). The underlying faulty assumption is that of global platforms for everyone and everything being the desired outcome at all. SV thinking and funding is the root cause. Other paths exist, just not in their world. Zebra's not unicorns. [[Zebra bedrijven zijn beter 20190907063530]] Vgl physical e.g. the German industrial base actually is one of many medium sized orgs being market leaders in some niche, not the car manufacturers usually mentioned as such.

    1. This will be helped by increased "re-shoring" efforts of many Western Governments to bring "technological sovereignty" to web infrastructure, reducing our reliance on such a small group of American companies. This results in a whole bunch of small, local web hosting services popping up in each country, offering forum hosting with the servers within the country with technical and legal structures especially tailored for their respective markets.

      Draws parallel between federation / small multiples, and tech sovereignty from silo'd hyperscalers. Vgl my notes on alternatives to Silicon Valley type scaling as only game in town.

    1. The new litmus test isn’t “Does it scale?” It’s: “Does it spread? Does it take root? Can it compost and regrow?”

      very much yes. Scaling is useless metaphor. Spread, evolution much more. Effective behaviour is contagious. Invisible hand of networks / communities [[Of Scaling TV Salons and the Invisible Hand of Networks – Interdependent Thoughts 20250803205329]]

  6. Jul 2025
  7. Apr 2025
    1. What’s more, the so-called scaling laws aren’t universal laws like gravity but rather mere observations that might not hold forever, much like Moore’s law, a trend in computer chip production that held for decades but arguably began to slow a decade ago.11

      example scaling

    2. There are serious holes in the scaling argument. To begin with, the measures that have scaled have not captured what we desperately need to improve: genuine comprehension. Insiders have long known that one of the biggest problems in AI research is the tests (“benchmarks”) that we use to evaluate AI systems

      arg scaling

  8. Apr 2024
  9. Nov 2023
    1. Ausführlicher Kommentar zu den 2,4 Billionen (Tausend Milliarden, im Artikel falsch übersetzt) Dollar, die laut dem COP27-Bericht von 2022 erforderlich sind, um Klimaschutz und -Anpassung in den Ländern des globalen Südens (außer China) zu finanzieren. Der auf Konsens ausgerichtete COP-Prozess sei außerstande, die nötigen Entscheidungen zu treffen. Der Betrag entspricht grob den aktuellen weltweiten Militärausgaben. https://www.repubblica.it/commenti/2023/11/19/news/cambiamenti_climatici_spesa_annua-420689085/?ref=RHRT-BG-I279994148-P4-S3-T1

  10. Oct 2023
  11. Sep 2023
    1. folgezettel pushes the note maker toward making at least one connection at the time of import.

      There is a difference between the sorts of links one might make when placing an idea into an (analog) zettelkasten. A folgezettel link is more valuable than a simple tag/category link because it places an idea into a more specific neighborhood than any handful of tags. This is one of the benefits of a Luhmann-artig ZK system over a more traditional commonplace one, particularly when the work is done up front instead of being punted to a later time.

      For those with a 1A2B3Z linking system (versus a pure decimal system), it may be more difficult to insert a card before other cards rather than after them because of the potential gymnastics of numbering and the natural tendency to put things into a continuing linear order.

      See also: - https://hypothes.is/a/ToqCPq1bEe2Q0b88j4whwQ - https://hyp.is/WtB2AqmlEe2wvCsB5ZyL5A/docdrop.org/download_annotation_doc/Introduction-to-Luhmanns-Zette---Ludecke-Daniel-h4nh8.pdf

  12. Aug 2023
    1. Last fall, I spent several days in New York City, during which time I visited a home owned by a group of pacifist Christians that lives from a common purse—meaning the members do not have privately held property but share their property and money. Their simple life and shared finances allow their schedules to be more flexible, making for a thicker immediate community and greater generosity to neighbors, as well as a richer life of prayer and private devotion to God, all supported by a deep commitment to their church.This is, admittedly, an extreme example. But this community was thriving not because it found ways to scale down what it asked of its members but because it found a way to scale up what they provided to one another.

      fascinating example of anti toxic capitalism...

  13. Jun 2023
    1. I’ll be speaking with and writing about people working on some of the tools and communities that I think help point ways forward—and with people who’ve built fruitful, immediately useful theories and practices

      Sounds interesting. Add to feeds. Wrt [[Invisible hand of networks 20180616115141]] scaling comes from moving sideways, repetition and replication. And that takes gathering and sharing (through the network) of examples. Vgl [[OurData.eu Open Data Voorbeelden 20090720142847]] but for civic tech, socsoft? What would it look like?

  14. Mar 2023
    1. Abb. 9 Im Normalfall erarbeitete man jedoch eine detaillierte interne Feinsortierung des Belegmaterials häufiger Wörter. Naturgemäß hätte jede Dimension der Analyse (chronologisch, grammatisch, semantisch, graphisch) die Grundlage einer eigenen Sortierordnung bilden können.

      Alternate sort orders for the slips for the Wb include chronological, grammatical, semantic, and graphic, but for teasing out the meanings the original sort order was sufficient. Certainly other sort orders may reveal additional subtleties.

  15. Feb 2023
    1. Folgezettel

      Do folgezettel in combination with an index help to prevent over-indexing behaviors? Or the scaling problem of categorization in a personal knowledge management space?

      Where do subject headings within a zettelkasten dovetail with the index? Where do they help relieve the idea of heavy indexing or tagging? How are the neighborhoods of ideas involved in keeping a sense of closeness while still allowing density of ideas and information?

      Having digital search views into small portions of neighborhoods like gxabbo suggested can be a fantastic affordance. see: https://hypothes.is/a/W2vqGLYxEe2qredYNyNu1A

      For example, consider an anthropology student who intends to spend a lifetime in the subject and its many sub-areas. If they begin smartly tagging things with anthropology as they start, eventually the value of the category, any tags, or ideas within their index will eventually grow without bound to the point that the meaning or value as a search affordance within their zettelkasten (digital or analog) will be utterly useless. Let's say they fix part of the issue by sub-categorizing pieces into cultural anthropology, biological anthropology, linguistic anthropology, archaeology, etc. This problem is fine while they're in undergraduate or graduate school for a bit, but eventually as they specialize, these areas too will become overwhelming in terms of search and the search results. This problem can continue ad-infinitum for areas and sub areas. So how can one solve it?

      Is a living and concatenating index the solution? The index can have anthropology with sub-areas listed with pointers to the beginnings of threads of thought in these areas which will eventually create neighborhoods of these related ideas.

      The solution is far easier when the ideas are done top-down after-the-fact like in the Dewey Decimal System when the broad areas are preknown and pre-delineated. But in a Luhmann-esque zettelkasten, things grow from the bottom up and thus present different difficulties from a scaling up perspective.

      How do we classify first, second, and third order effects which emerge out of the complexity of a zettelkasten? - Sparse indexing can be a useful long term affordance in the second or third order space. - Combinatorial creativity and ideas of serendipity emerge out of at least the third order. - Using ZK for writing is a second order affordance - Storage is a first order affordance - Memory is a first order affordance (related to storage) - Productivity is a second+ order (because solely spending the time to save and store ideas is a drag at the first order and doesn't show value until retrieval at a later date). - Poor organization can be non-affordance or deterrent which results in a scrap heap - lack of a reason why can be a non-affordance or deterrence as well - cross reference this list and continue on with other pieces and affordances

  16. Jan 2023
  17. Dec 2022
    1. Isaac and I uh with another colleague we did a little bit of work trying to look at what would the Swedish policy or the UK policy indeed look like if it was carried out globally and it would look at something like two and a half degrees Centigrade of warming if 00:31:58 not more

      !- key point : Sweden's net zero plan scaled globally - would result in a 2.5 deg C or greater world

  18. Sep 2021
  19. Apr 2021
  20. Mar 2021
  21. Jan 2021
    1. Scale, inevitably leads to power-law distributed outcomes, leading to the inevitable concentration of talent and resources among a few investigators pursuing a few lines of inquiry, and their pale second-rate imitators. Through this mechanism science at scale reinforces (and in fact, under sufficient political capture imposes) consensus, further annihilating the possibility of the necessary revolutionary synthesis of ideas.

      The solution to any sort of global leaderboard thing - like Mendeley most read or most cited - is to break it up into local communities - most read among your friends or most cited within only the outer leaves of the topic tree.

  22. Dec 2020
  23. Aug 2020
  24. Jun 2020
  25. May 2020
    1. WhyGeneral infrastructure simply takes time to build. You have to carefully design interfaces, write documentation and tests, and make sure that your systems will handle load. All of that is rival with experimentation, and not just because it takes time to build: it also makes the system much more rigid.Once you have lots of users with lots of use cases, it’s more difficult to change anything or to pursue radical experiments. You’ve got to make sure you don’t break things for people or else carefully communicate and manage change.Those same varied users simply consume a great deal of time day-to-day: a fault which occurs for 1% of people will present no real problem in a small prototype, but it’ll be high-priority when you have 100k users.Once this playbook becomes the primary goal, your incentives change: your goal will naturally become making the graphs go up, rather than answering fundamental questions about your system.

      The reason the conceptual architecture tends to freeze is because there is a tradeoff between a large user base and the ability to run radical experiments. If you've got a lot of users, there will always be a critical mass of complaints when the experiment blows up.

      Secondly, it takes a lot of time to scale up. This is time that you cannot spend experimenting.

      Andy here is basically advocating remaining in Explore mode a little bit longer than is usually recommended. Doing so will increase your chances of climbing the highest peak during the Exploit mode.

    2. This is obviously a powerful playbook, but it should be deployed with careful timing because it tends to freeze the conceptual architecture of the system.

      One a prototype gains some traction, conventional Silicon Valley wisdom says to scale it up. This, according to Andy Matuschak has certain disadvantages. The main drawback is that it tends to freeze the conceptual architecture of the system.

  26. Apr 2020
    1. Scaling is hard if you try do it yourself, so absolutely don’t try do it yourself. Use vendor provided, cloud abstractions like Google App Engine, Azure Web Apps or AWS Lambda with autoscaling support enabled if you can possibly avoid it.

      Scaling shall be done with cloud abstractions

  27. Mar 2020
    1. I would like to make an appeal to core developers: all design decisions involving involuntary session creation MUST be made with a great caution. In case of a high-load project, avoiding to create a session for non-authenticated users is a vital strategy with a critical influence on application performance. It doesn't really make a big difference, whether you use a database backend, or Redis, or whatever else; eventually, your load would be high enough, and scaling further would not help anymore, so that either network access to the session backend or its “INSERT” performance would become a bottleneck. In my case, it's an application with 20-25 ms response time under a 20000-30000 RPM load. Having to create a session for an each session-less request would be critical enough to decide not to upgrade Django, or to fork and rewrite the corresponding components.
  28. Jul 2019
  29. Jun 2019
  30. Mar 2019
  31. Sep 2018
    1. As student numbers have increased, teaching has regressed for a variety of reasons to a greater focus on information transmission and less focus on questioning, exploration of ideas, presentation of alternative viewpoints, and the development of critical or original thinking. Yet these are the very skills needed by students in a knowledge-based society.

      Related to Vijay Kumar's iron triangle. You can't increase the number of students without sacrificing quality or increasing costs.

  32. Sep 2017
  33. Aug 2017
  34. Feb 2017