224 Matching Annotations
  1. Oct 2025
    1. The system's potential is demonstrated through concrete validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of antimicrobial resistance. For instance, it proposed drug candidates for acute myeloid leukemia that showed tumor inhibition in vitro, showcasing its ability to produce genuinely valuable and original scientific insights. This work frames a new vision for AI: a system that can navigate the high-dimensional space of existing scientific knowledge to discover the unknown.

      align

    2. Proposes a multi-agent system based on Gemini 2.0 to automate hypothesis generation in scientific discovery, using a "generate, debate, and evolve" approach.

      align

    1. A crucial operational maxim is to "be stubborn on the vision and flexible on the details," acknowledging that this flexibility is necessary because the world is changing [1].

      Important

    1. Formally, it minimizes something like log loss or cross-entropy between predicted probabilities and actual observed outcomes — a metric that rewards calibrated truthfulness.

      elaborate

    2. In standard RL or reward-maximizing setups, an agent can over-optimize the reward proxy (Goodhart’s law) and exploit loopholes. A Bayesian, inference-only system doesn’t optimize for outcomes—it merely models them. The ensemble of hypotheses provides natural regularization against runaway single-metric optimization.

      elaborate

    1. Toronto team’s output is on general AI capabilities (LLMs, generative models, agent learning), rather than domain-specific science tools.

      Important

  2. Sep 2025
    1. the prevailing corporate governance models are fundamentally ill-suited to the unique, long-term safety requirements of AGI development.

      Important

    2. Adapting Geometry for Directedness

      I have a feeling that we can learn the "shape" of the DAG if we investigate it under the lens of Discrete Differential Geometry

    3. The GFlowNet "flow matching" loss condition can be interpreted using this language, relating the flow into a state to the flow out of it.

      elaborate

    4. The production rules of a grammar are the constraints that force the data to a low-dimensional manifold. A molecule's grammar, for instance, prevents the vast majority of random atom-and-bond combinations from ever being formed, concentrating all valid molecules onto a specific, structured surface.

      makes sense

    1. Effect: credit from terminal rewards can reach all ancestors in one step through the balancing equations — no need to wait for step-by-step temporal backups.

      this is so important

    2. So instead of waiting for a single reward at the end of a trajectory, each local edge is trained to satisfy a conservation law consistent with terminal rewards.

      Important

    3. GFlowNets use flow-matching (inflow = outflow) as a local conservation law — mathematically simpler, more stable, better credit assignment.

      elaborate

    1. interpolates/extrapolates reward structure: If many molecules with a substructure A had high reward, The policy will bias flow toward other molecules that also contain substructure A — even if it never explicitly saw them.

      important

    1. The above two equations are forced to be consistent (i.e. there is an FFF that gives rise to both PBP_BPB​ and PFP_FPF​) when FFF satisfies the flow-matching constraint (the amount of entering flow equals the amount of outgoing flow), which is not necessarily true when FFF is estimated by a neural network that is being trained (and we have not fully completed training and brought the training loss to 0 everywhere).

      can we force this constraint by design?

    1. The world is combinatorially large (endless possible combinations). Humans generalize well by reusing parts without needing to see every possible combo in training.

      Is this what happens to LLM and possibly stable diffusion?

    2. A small model might need symbolic rules to parse and generate. A large LLM trained on trillions of tokens can produce fluent sentences directly, implicitly encoding the grammar.

      important. need elaboration

    3. Represented symbolically, compositionally, or in step-by-step logic. Conscious and accessible: you can bring it into working memory, verbalize it, and explain it.

      elaborate

    1. Drug discovery models (like GFlowNets, graph diffusion, VAEs) exploit this structure: they learn to navigate and generate only chemically valid molecules rather than arbitrary graphs.

      important

    1. Instead of treating a design as one huge blob (like a voxel grid), CGID represents objects as a composition of functional parts.

      similar to the compositionality spirit of GflowNet

    1. 1. Does a GFlowNet need training data points? Not necessarily. If you have a reward function that you can compute for any candidate object, the GFlowNet can train purely by sampling objects (even ones it has never seen before) and scoring them. This is why they’re attractive for drug/material design: you don’t need a huge dataset of known good molecules, just a scoring function (simulator, energy model, ML predictor) to evaluate new candidates.

      important

    1. Rather than finetuning the whole diffusion model, it optimizes the conditioning embeddings (e.g., CLIP embeddings) with feedback from simulation performance.

      ??

    1. But if they have an energy landscape (a surface with valleys and peaks corresponding to good vs bad configurations), then you don’t need to brute force. You just need to learn the gradient — the direction downhill toward stable/valid solutions.

      interesting. Need elaboration

    1. This assumption is empirical: generative models (autoencoders, GANs, diffusion models, LLMs) succeed precisely because such structure exists and is learnable.

      evidence that generative models can capture these structure just by observation, without interacting with the physical world?

    2. In mathematical terms, if the total search space is size NNN, but the effective dimensionality of “plausible” or “stable” solutions is only size M≪NM \ll NM≪N, then we say the domain has exploitable structure.

      manifold

    1. What’s strong about P

      it makes me think about the general & universal PS system with core Transformer and different adaptors for each task. Can we build a similar universal planning system that shares the same core LLM?

    1. i'm not sure if this is blackpill or whitepill, but my there are a heap of new papers along with my own experiences that are showing "best of N is all you need" for most problems as long as: - sufficient core knowledge was included in the training data - the model is sufficiently large / you use more than 1 model to promote reasonable idea diversity - N is sufficiently large for the complexity of the problem at hand - you have some reasonable discrimination process at the end to determine / approximate "best" result we really haven't come close to leveraging the full potential of existing models, and the antiquated sampling process / approaches are the single biggest culprit

      what if we include some sampling distribution on the output side?

    1. If they achieve network effects (large user base, ecosystem of integrations, proprietary data flows), they could become the “operating system” for interacting with AI.

      important

    1. Both groups are in a race to the bottom of commoditization, where only brand, UX, or network effects (e.g., Perplexity’s model-switching) can provide some edge

      why?

    1. Even if scaling is more generally efficient than search, search allows for quicker intelligence in narrow domains. Training larger foundation models is slow. With search, you don’t have to wait.

      this is the key lesson

    1. I like research topics that are simple, general, and stand the test of time, and I try to avoid projects that are complicated, task-specific, or short-lived.

      elaborate I guess general research topics are more fundamental and tend to have longer lifespan?

    2. Most people (including me) would benefit greatly by spending more time on idea selection, since doing this well is a huge multiplier on research impact. Conversely, working on a narrow topic with little headroom caps the impact of the project, no matter how well it is executed.

      elaborate

    3. A good suggestion from a friend is to either (1) work on a hot topic and do it better than everyone else, or (2) work on something that might become the next hot topic. Strategy 1 is lower risk and requires working very hard. Strategy 2 is higher risk but has potentially very high reward.

      elaborate

    1. They scale with the frontier: as foundation models improve, broad topics grow in importance, while narrow ones fade.

      why these scale with frontier?

    1. It is an open question why exactly scaling works, but here are two hand-wavy reasons. One is that small language models can’t memorize as much knowledge in their parameters, whereas large language models can memorize a huge amount of factual information about the world. A second guess is that while small language models are capacity-constrained, they might only learn first-order correlations in data. Large language models on the other hand, can learn complex heuristics in data.

      important

    2. Intuition 3. Tokens can have very different information density, so give language models time to think.

      important I guess he realized this by examing the behaviors and output of LLMs. Then what happens next is that it inspires him about the intermediate thinking chains of thoughts

    3. The solution to this is to give language models more compute by allowing them to perform natural language reasoning before giving the final answer.

      nice

    4. You can imagine that if you’re ChatGPT, and as soon as you have to see the prompt you have to immediately start typing, it would be pretty hard to get that question right.

      important

    5. This is an interesting example of how a simple objective, when combined with complex data can lead to highly intelligent behavior (assuming you agree that language models are intelligent).

      nice observation

    1. It is important to understand that this does not mean LLMs will be gods producing 100x code, because virtually no domain that software engineering is useful has a perfect oracle. A perfect oracle is a type of feedback where you are given a “correct/incorrect” answer every single time, and they almost only appear in games as real world typically doesn’t have perfect models of correctness. Winning or losing a game is a perfect oracle, as well as creating a program that can pass the judge in a competitive programming contest.

      important and impressive advice

    1. However, in scientific innovation, we are in a totally different realm where we only care about solving a single problem (train=test!) because it’s an unsolved problem and potentially extremely valuable.

      .

    1. Prefer methods that pass a scaling test: their delta is flat or increasing as you go from S→M→L models.

      maybe we can just plug in models of various scales and see how the performance project?

    2. Method: anything you wrap around or plug into the core model: data curation, training tricks, inference procedures, retrieval, tools, constraints, rewards, routing, etc.

      maybe each AI startup out there is a different wrapper, specific for each domain?

    3. 🔹 What does “method” mean here?

      So there are 2 types of hammers: - architectures that scale well with compute & data - methods that scale well with compute & data It might be interesting to bring these compute-scalable methods to integrate in architectures of other domains? What are historic examples for these compute-scalable methods? Are they general-purpose?

    1. adjacent signals: lots of startups are already building agent frameworks (LangChain, AutoGPT, CrewAI). Their bottlenecks hint at where research could contribute

      great advice

    1. 2020: you can cast any language task as sequence prediction and learn it via pretrain + finetune 2021: scaling to GPT-3 size enables doing arbitrary tasks specified via instructions 2022: scaling to GPT-3.5/PaLM size unlocks reasoning via chain of thought 2023: LLMs themselves can be a product a lot of people will use 2024: to push capabilities past GPT-4, scale test-time compute

      there is a dominant trend and scaffolding structure here

    1. Old habit: Just grab a standard benchmark (e.g., GLUE, ImageNet, MMLU) and test your method there. Problem today: Those benchmarks might not stress the thing your method is designed for. You’ll conclude your idea “doesn’t work,” when in fact you just used the wrong test.

      important

    2. Now: Large language models are so capable and multi-task that whether a method works depends a lot on which dataset you test it on.

      elaborate on this

    1. AcknowledgementsThis blog post features contributions from Gabriel Ilharco. I would like to thank Hattie Zhou, Nelson Liu, Noah Smith, Gabriel Ilharco, Mitchell Wortsman, Luke Zettlemoyer, Aditya Kusupati, Jungo Kasai, and Ofir Press for their valuable feedback on drafts of this blog post.

      this guy certainly not feel shy when asking people for feedback on his work, no matter it's compicated as doing research or as simple as writing a blog

    2. He builds hacks, understands the deep relationships of how his hack affects the system, and then extracts this insight in the most minimalistic and well-formulated way possible along with his practical hack.

      This is so good advice

    3. Navigating this uncertainty is best done through fast iterations and balancing multiple projects to maximize the chances of a big success.

      important

  3. Aug 2025
    1. ✅ Correct in spirit: Solving “make X work for the first time” usually looks like jumping from essentially no working method to a viable solution (say 0% → 70%).

      this looks far more impressive than incremental works

    1. Pick a meaningful dimension of stress (tokens, latency, noise, compositional depth, generalization to new domains, etc.).

      how to select out a "meaningful dimension" is an important follow-up question by it own. I have several points to add: - the dimension should also involve "feasibility" that fits our research condition - maybe to find out those "meaningful dimensions", we can look very top-down from applications. We can ask, if this dimension is extended, which kind of application can benefit. And with this, we quickly realize that "context length" is a super influential dimension, benefiting various applications. - how about borrowing from related domains like Jinwoo did in TokenGT?

    2. Relevance to the field’s trajectory The capability should connect to active conversations in the community. Example: In 2020–2022, instruction-following was “in the air” because GPT-3 showed emergent abilities, but not controllability. So InstructGPT’s capability (follow human instructions) was both new and natural.

      I guess ChatGPT can help me with identifying the next natural new AI capabilities to work on

    3. Do you want me to break down how to tell, when reading a new paper, whether it’s a “first-time X” paper or a “make X better” paper? That skill will help you classify work quickly.

      so there are many dimensions for "working better"? e.g more stable, more scalable, ...

    1. Alternatively, armed with the knowledge gained from working on the first idea, you might move on to a different idea aligned with the same goal, with a higher chance of success.

      so knowledge from working on the first idea leads to higher chance of success in the second idea? So I guess it's all about fail fast and then iterate?

    2. If you are thinking in terms of ideas, you’d be easily frustrated and might give the idea a few more attempts before finally giving up and moving on to another, possibly unrelated idea, repeating the same process

      "unrelated idea" is an important point

    3. The main issue is that ideas have a very short lifespan; an idea is unlikely to work at first, might not be novel enough, might be easily scooped

      elaborate

    4. When you’re starting in a new area without a full understanding of the challenges or limitations, It is very tempting to run after a sole idea that you think will work.

      why?

    5. What Yang was talking about is reading enough papers to cover most of the literature in your area. Needless to say, it is not only about the paper count you read, although the paper count can serve as a good indicator of how well you are engaged with the literature in your research area.

      the final goal is to cover understanding of your literature

    1. new AI capabilities

      this is too broad. Is there any constraints to narrow down the set of new AI capabilities that we should vision in the scope 2 years?

    2. Goals also make it possible for a team of researchers to work together and attack different aspects of the problem, whereas idea-driven research is most effectively carried out by “teams” of 1-2 people.

      why??

    3. On the other hand, with goal-driven research, your goal will give you a perspective that’s differentiated from the rest of the community. It will lead you to ask questions that no one else is asking,

      why??

    4. To make breakthroughs with idea-driven research, you need to develop an exceptionally deep understanding of your subject, and a perspective that diverges from the rest of the community—some can do it, but it’s difficult.

      why??

    5. you test a variety of existing methods from the literature, and then you develop your own methods that improve on them

      (1) Which existing methods to test? (2) Why we have to test them before we build our own methods? What does it mean by "improve on them"?

    6. I’ll take goal-driven research to mean that your goal is more specific than your whole subfield’s goal, and it’s more like make X work for the first time than make X work better.

      what does this mean? "your goal is more specific than your whole subfield's goal" how does that align with "make X work for the first time than make X work better"?

    1. This AI-driven auto-scheduling is a massive time-saver, eliminating the tedious manual process of trying to Tetris tasks into your day.

      important

    1. A 40 hour time-blocked work week, I estimate, produces the same amount of output as a 60+ hour work week pursued without structure.

      bring structure into life always wins

    1. 5. Pareto 80/20 Why #5: Still powerful — often a few papers/experiments/figures give most of the insight. But research is exploratory, so it’s easy to misjudge which 20% matters most until later. Works best if combined with checkpoints. Example: In experiment runs → 2–3 baseline setups cover 80% of insight; no need to sweep every hyperparameter.

      CS 197: Computer Science Research Step 1: Performing a literature search Keep track of how much you’re learning about the design axes as you consume additional papers. Typically, you’re learning the most at the very beginning, and the amount per paper starts going down after five papers or so.

    2. 6. Progressive Layering Idea: Build the output in layers: skeleton → basic fill → deeper detail → polish. How to apply: Do a quick pass that touches everything at a shallow level, then loop back to deepen. Good for: Ensuring balanced coverage and avoiding “holes.” Limits: Requires resisting the urge to perfect one section before moving on.

      important

    3. 4. Greedy Value-per-Time Idea: Always pick the next piece of work that gives the highest value for the time spent. How to apply: For each possible action, estimate: “How much value will this add?” / “How long will it take?” → Do the one with the best ratio. Good for: Gradually building up value in the most efficient order. Limits: Estimation can be rough; might overlook long-term gains.

      need elaboration

    4. 3. Time-Boxed Checkpoints Idea: Divide your time budget into checkpoints (e.g., 25%, 50%, 75% of T). How to apply: At each checkpoint, stop and take stock. Freeze the current delivery so it’s already usable, then decide whether to add or polish. Good for: Avoiding the trap of chasing “better” until the very last minute. Limits: Requires discipline to actually stop and review.

      need elaboration

    1. Without early reality checks: you drift → big risks discovered late (wasted days). With L0/L1 reality checks: you “fail fast, small, and cheap,” so every later step builds on a working scaffold.

      fail fast

    1. Very sharp connection, Dat 👍 — you’re right: open-ended research tasks suffer from the same two problems as reinforcement learning (RL):

      both RL and research are hard. I guess feedback is the bottleneck for both.

    1. If you ship something usable by the deadline → ✅ pipeline works. If not → ❌ pipeline broke (e.g., stuck in collection or polish).

      important

    1. Closes the loop faster → You don’t wait months for evaluation. Prevents drift → Every day/week forces a checkpoint. Shapes behavior → You optimize for progress under time instead of perfect completeness.

      this is gold

    2. Each week is a reward event. It shapes your trajectory forward, instead of leaving you wandering until the “final boss” (PhD defense).

      this is gold

    3. 🔹 2. Daily Micro-Outputs as Rewards Reward Rule: +1 if by end of day you capture something tangible (a new cluster, a new theme, or one updated paragraph of outline). Penalty Rule: 0 if you only consumed/collected without producing. 👉 This ensures every day has a checkpoint. Even if small, it signals whether the pipeline is alive.

      important

    4. That is feedback about the weak link in your pipeline: maybe your clustering method is too slow, or you’re over-expanding the input set. Without the deadline, you’d never notice the bottleneck — you’d just keep drifting.

      gold

    1. MVO: a plain input box with keyword search that returns results by exact match. → This lets you confirm: Do users even use the search bar? If yes, you iterate.

      one example for delivering fast to get feedback fast

    2. No natural stopping point Completeness instinct: A dev team building a new chat feature decides they must include typing indicators, message reactions, file sharing, and push notifications before launch. Result: Months pass before users even test the basic messaging experience. MVO version: Ship a bare-bones text-only chat. Once people actually use it, you’ll know if features like reactions are worth adding.

      very prone to over-planning

    3. Layered refinement → MVO outputs become scaffolds. You can always enrich them later, but at least you have something concrete.

      elaborate on this

    4. Barely acceptable ≠ low quality. It means the smallest unit of work that is valid enough to be tested, evaluated, or built upon. The spirit is: cross the acceptance line quickly, then refine if it proves worthwhile.

      So the point is that it allows fast delivery of an artifact, thus enables rapid feedback and iteration and built upon

    1. 3. Reinforcement Learning (Sparse, Delayed Feedback)

      this is so similar to my scenario. So it deserves lots of elaboration. Yes I can only test whether my developed meta research approach is valid or not by using it to develop research question and implementing a project. So it's very delayed. However, if we can use MVO for the process of developing research question and implementation, these can be done rapidly and give the rapid feedback to the meta research approach.

    2. If you reach the deadline and you have something coherent → ✔ that’s feedback that your process worked. If you reach the deadline and you’re still collecting without synthesis → ❌ that’s feedback that your process got stuck, and you must force closure.

      important

    3. Your first synthesis attempt is the feedback resource. If it produces something coherent, that’s your signal to stop collecting and move forward. If it produces obvious holes, those holes tell you precisely what to collect next.

      gold

    4. Force yourself to draft a 1-page proto-framework: Section 1: How I choose problems Section 2: How I generate ideas Section 3: How I design experiments Section 4: How I reflect and adapt Even if rough, you’ll quickly feel whether your collected advices are enough to fill these slots.

      super important

    5. If your 1-page proto-style already helps you reason, structure, or discuss, that’s feedback enough. You don’t need 100% collection before synthesis.

      important

    6. 🔹 D. Time-Bound Feedback Impose a milestone deadline: “If I can’t cluster 15 advices into a draft framework by end of this week, I must ship an MVO draft anyway.” Here, the deadline itself is the feedback resource → it forces closure, telling you that completeness is no longer the metric; progress is.

      important

    1. Evidence-First Approach → once you have a minimal version, you can test if it holds water before investing more. Layered Refinement → each MVO creates a scaffold; if time allows, you can enrich it later.

      need elaboration

    1. Balance Depth vs. Breadth Early milestones should prioritize breadth (collect many inputs). Later milestones should prioritize depth (synthesize into frameworks).

      need elaboration

    2. Set Sprint-Style Boundaries Treat milestones as 1–2 week sprints (like in software dev). End each sprint with a “demo” artifact (outline, table, draft).

      need elaboration

    3. Work Backwards from Deadline Decide: “I want a working draft in 1 month”. Break backwards into weekly checkpoints (Week 1 = Collect, Week 2 = Cluster, Week 3 = Synthesize, Week 4 = Draft).

      need elaboration

    4. Planning everything down to the day months in advance is unrealistic in research. But if you don’t plan at all, you drift. The solution: think in layers of stability.

      important