AlphaEvolve
what is it?
AlphaEvolve
what is it?
The field rewards researchers who can translate expensive experimentation into deep, portable ideas.
nice
AcknowledgementsThis blog post features contributions from Gabriel Ilharco. I would like to thank Hattie Zhou, Nelson Liu, Noah Smith, Gabriel Ilharco, Mitchell Wortsman, Luke Zettlemoyer, Aditya Kusupati, Jungo Kasai, and Ofir Press for their valuable feedback on drafts of this blog post.
this guy certainly not feel shy when asking people for feedback on his work, no matter it's compicated as doing research or as simple as writing a blog
It is important to note that there is neither right or wrong nor good or bad research style.
important
He builds hacks, understands the deep relationships of how his hack affects the system, and then extracts this insight in the most minimalistic and well-formulated way possible along with his practical hack.
This is so good advice
Navigating this uncertainty is best done through fast iterations and balancing multiple projects to maximize the chances of a big success.
important
The learning from one doesn’t always transfer, because the ideas are not anchored to a common goal.
very important
✅ Correct in spirit: Solving “make X work for the first time” usually looks like jumping from essentially no working method to a viable solution (say 0% → 70%).
this looks far more impressive than incremental works
X = real-time object detection on a single GPU.
nice example
If they join a crowded subfield late, the low-hanging fruit is gone.
I read somewhere that great researchers never stay in a crowded place. They move on to the "next big thing". Related posts on low-hanging fruit: https://ai.engin.umich.edu/2023/08/17/eight-lessons-learned-in-two-years-of-ph-d/ https://www.notion.so/From-Michael-Nielsen-Principles-of-effective-research-25199f12ef0d8040be1ceb0d8b16cc67?source=copy_link#25199f12ef0d80649850edcef18f5519
Pick a meaningful dimension of stress (tokens, latency, noise, compositional depth, generalization to new domains, etc.).
how to select out a "meaningful dimension" is an important follow-up question by it own. I have several points to add: - the dimension should also involve "feasibility" that fits our research condition - maybe to find out those "meaningful dimensions", we can look very top-down from applications. We can ask, if this dimension is extended, which kind of application can benefit. And with this, we quickly realize that "context length" is a super influential dimension, benefiting various applications. - how about borrowing from related domains like Jinwoo did in TokenGT?
Relevance to the field’s trajectory The capability should connect to active conversations in the community. Example: In 2020–2022, instruction-following was “in the air” because GPT-3 showed emergent abilities, but not controllability. So InstructGPT’s capability (follow human instructions) was both new and natural.
I guess ChatGPT can help me with identifying the next natural new AI capabilities to work on
Do you want me to break down how to tell, when reading a new paper, whether it’s a “first-time X” paper or a “make X better” paper? That skill will help you classify work quickly.
so there are many dimensions for "working better"? e.g more stable, more scalable, ...
Alternatively, armed with the knowledge gained from working on the first idea, you might move on to a different idea aligned with the same goal, with a higher chance of success.
so knowledge from working on the first idea leads to higher chance of success in the second idea? So I guess it's all about fail fast and then iterate?
If you are thinking in terms of ideas, you’d be easily frustrated and might give the idea a few more attempts before finally giving up and moving on to another, possibly unrelated idea, repeating the same process
"unrelated idea" is an important point
John Schulman argues that goals have more longevity than ideas
did he?
The main issue is that ideas have a very short lifespan; an idea is unlikely to work at first, might not be novel enough, might be easily scooped
elaborate
When you’re starting in a new area without a full understanding of the challenges or limitations, It is very tempting to run after a sole idea that you think will work.
why?
What Yang was talking about is reading enough papers to cover most of the literature in your area. Needless to say, it is not only about the paper count you read, although the paper count can serve as a good indicator of how well you are engaged with the literature in your research area.
the final goal is to cover understanding of your literature
enabling you to make larger leaps of progress.
why?
new AI capabilities
this is too broad. Is there any constraints to narrow down the set of new AI capabilities that we should vision in the scope 2 years?
showing how to do X
so the problem of "making X works for the first time" is already resolved?
choosing a different problem from the rest of the community can lead you to explore different ideas.
how different here exactly?
initial exploration
what is this?
Goals also make it possible for a team of researchers to work together and attack different aspects of the problem, whereas idea-driven research is most effectively carried out by “teams” of 1-2 people.
why??
On the other hand, with goal-driven research, your goal will give you a perspective that’s differentiated from the rest of the community. It will lead you to ask questions that no one else is asking,
why??
To make breakthroughs with idea-driven research, you need to develop an exceptionally deep understanding of your subject, and a perspective that diverges from the rest of the community—some can do it, but it’s difficult.
why??
you test a variety of existing methods from the literature, and then you develop your own methods that improve on them
(1) Which existing methods to test? (2) Why we have to test them before we build our own methods? What does it mean by "improve on them"?
I’ll take goal-driven research to mean that your goal is more specific than your whole subfield’s goal, and it’s more like make X work for the first time than make X work better.
what does this mean? "your goal is more specific than your whole subfield's goal" how does that align with "make X work for the first time than make X work better"?
solve problems that bring you closer to that goal.
how to identify problems that if we solve, could bring us closer to that goal?
Follow some sector of the literature.
what does this mean exactly?
This AI-driven auto-scheduling is a massive time-saver, eliminating the tedious manual process of trying to Tetris tasks into your day.
important
A 40 hour time-blocked work week, I estimate, produces the same amount of output as a 60+ hour work week pursued without structure.
bring structure into life always wins
Combine with MVO first → secure a skeleton first, then apply Pareto logic for depth.
important
5. Pareto 80/20 Why #5: Still powerful — often a few papers/experiments/figures give most of the insight. But research is exploratory, so it’s easy to misjudge which 20% matters most until later. Works best if combined with checkpoints. Example: In experiment runs → 2–3 baseline setups cover 80% of insight; no need to sweep every hyperparameter.
CS 197: Computer Science Research Step 1: Performing a literature search Keep track of how much you’re learning about the design axes as you consume additional papers. Typically, you’re learning the most at the very beginning, and the amount per paper starts going down after five papers or so.
6. Progressive Layering Idea: Build the output in layers: skeleton → basic fill → deeper detail → polish. How to apply: Do a quick pass that touches everything at a shallow level, then loop back to deepen. Good for: Ensuring balanced coverage and avoiding “holes.” Limits: Requires resisting the urge to perfect one section before moving on.
important
4. Greedy Value-per-Time Idea: Always pick the next piece of work that gives the highest value for the time spent. How to apply: For each possible action, estimate: “How much value will this add?” / “How long will it take?” → Do the one with the best ratio. Good for: Gradually building up value in the most efficient order. Limits: Estimation can be rough; might overlook long-term gains.
need elaboration
3. Time-Boxed Checkpoints Idea: Divide your time budget into checkpoints (e.g., 25%, 50%, 75% of T). How to apply: At each checkpoint, stop and take stock. Freeze the current delivery so it’s already usable, then decide whether to add or polish. Good for: Avoiding the trap of chasing “better” until the very last minute. Limits: Requires discipline to actually stop and review.
need elaboration
Lesson: the hidden blocker (“taxonomy dimension unclear”) surfaced cheaply at L0.
important
Without early reality checks: you drift → big risks discovered late (wasted days). With L0/L1 reality checks: you “fail fast, small, and cheap,” so every later step builds on a working scaffold.
fail fast
You fix it before scaling up.
important
utility proxies (does this draft/pipeline already work in some small way?)
super important
8) Common pitfalls (and the fix)
elaboration
You’ll increase 1–2 knobs per level, never all.
important
usability-focused.
elaborate on this
Converts one far-off, sparse reward into 5–6 immediate feedback signals.
important
Break your open-ended milestone into layers, each with a feedback proxy:
important
Solution: tie every proxy to usability: “Can this artifact unblock the next step?” If no → proxy doesn’t count.
need elaboration
Very sharp connection, Dat 👍 — you’re right: open-ended research tasks suffer from the same two problems as reinforcement learning (RL):
both RL and research are hard. I guess feedback is the bottleneck for both.
Get to “usable” fast → then refine only if it proves valuable.
gold
👉 Philosophy: The goal of early outputs isn’t to be complete — it’s to learn faster.
important
Philosophy: Closure is manufactured, not discovered. If you don’t impose an end, the task never ends.
gold
If you ship something usable by the deadline → ✅ pipeline works. If not → ❌ pipeline broke (e.g., stuck in collection or polish).
important
This is the essence of layered refinement: start thin, add depth only if needed.
gold
Momentum
important
Closes the loop faster → You don’t wait months for evaluation. Prevents drift → Every day/week forces a checkpoint. Shapes behavior → You optimize for progress under time instead of perfect completeness.
this is gold
Each week is a reward event. It shapes your trajectory forward, instead of leaving you wandering until the “final boss” (PhD defense).
this is gold
🔹 2. Daily Micro-Outputs as Rewards Reward Rule: +1 if by end of day you capture something tangible (a new cluster, a new theme, or one updated paragraph of outline). Penalty Rule: 0 if you only consumed/collected without producing. 👉 This ensures every day has a checkpoint. Even if small, it signals whether the pipeline is alive.
important
That is feedback about the weak link in your pipeline: maybe your clustering method is too slow, or you’re over-expanding the input set. Without the deadline, you’d never notice the bottleneck — you’d just keep drifting.
gold
If you don’t ship → ❌ that’s feedback that your process got stuck in over-collection, over-polishing, or paralysis.
this is gold
Iterate in Layers → Each week is one “thin slice” of output → feedback → refinement.
need elaboration
Manufacture Feedback → Use proxy tests, peer review, or time deadlines instead of waiting for perfect signals.
need elaboration
Force Output → Every cycle must produce something tangible (outline, synthesis, test).
need elaboration
Cap Input → Don’t endlessly collect; pre-decide how much you’ll allow yourself.
important
The goal is closure and forward motion, not exhaustiveness.
what is closure and forward motion?
MVO: a plain input box with keyword search that returns results by exact match. → This lets you confirm: Do users even use the search bar? If yes, you iterate.
one example for delivering fast to get feedback fast
No natural stopping point Completeness instinct: A dev team building a new chat feature decides they must include typing indicators, message reactions, file sharing, and push notifications before launch. Result: Months pass before users even test the basic messaging experience. MVO version: Ship a bare-bones text-only chat. Once people actually use it, you’ll know if features like reactions are worth adding.
very prone to over-planning
Layered refinement → MVO outputs become scaffolds. You can always enrich them later, but at least you have something concrete.
elaborate on this
Progress > Perfection → momentum compounds over time, while perfectionism stalls.
elaborate
Barely acceptable ≠ low quality. It means the smallest unit of work that is valid enough to be tested, evaluated, or built upon. The spirit is: cross the acceptance line quickly, then refine if it proves worthwhile.
So the point is that it allows fast delivery of an artifact, thus enables rapid feedback and iteration and built upon
3. Reinforcement Learning (Sparse, Delayed Feedback)
this is so similar to my scenario. So it deserves lots of elaboration. Yes I can only test whether my developed meta research approach is valid or not by using it to develop research question and implementing a project. So it's very delayed. However, if we can use MVO for the process of developing research question and implementation, these can be done rapidly and give the rapid feedback to the meta research approach.
If you reach the deadline and you have something coherent → ✔ that’s feedback that your process worked. If you reach the deadline and you’re still collecting without synthesis → ❌ that’s feedback that your process got stuck, and you must force closure.
important
Your first synthesis attempt is the feedback resource. If it produces something coherent, that’s your signal to stop collecting and move forward. If it produces obvious holes, those holes tell you precisely what to collect next.
gold
Instead of waiting for someone else to tell you “you have enough,” you let the process of synthesis tell you.
This is gold
Force yourself to draft a 1-page proto-framework: Section 1: How I choose problems Section 2: How I generate ideas Section 3: How I design experiments Section 4: How I reflect and adapt Even if rough, you’ll quickly feel whether your collected advices are enough to fill these slots.
super important
If your 1-page proto-style already helps you reason, structure, or discuss, that’s feedback enough. You don’t need 100% collection before synthesis.
important
🔹 D. Time-Bound Feedback Impose a milestone deadline: “If I can’t cluster 15 advices into a draft framework by end of this week, I must ship an MVO draft anyway.” Here, the deadline itself is the feedback resource → it forces closure, telling you that completeness is no longer the metric; progress is.
important
It closes the milestone with evidence and prevents “analysis paralysis.”
what is this?
Without this cap, “just one more run” can spiral into weeks of GPU drain.
important
A single config eliminates “rabbit holes” of hyperparameter tuning. One config = just enough to generate a signal.
what is this?
You don’t need the entire dataset to test whether your pipeline works.
Khang Truong used to tell me this
Evidence-First Approach → once you have a minimal version, you can test if it holds water before investing more. Layered Refinement → each MVO creates a scaffold; if time allows, you can enrich it later.
need elaboration
Instead of “complete everything,” you aim for the leanest but valid version that lets you move forward.
very important mindset
Balance Depth vs. Breadth Early milestones should prioritize breadth (collect many inputs). Later milestones should prioritize depth (synthesize into frameworks).
need elaboration
Set Sprint-Style Boundaries Treat milestones as 1–2 week sprints (like in software dev). End each sprint with a “demo” artifact (outline, table, draft).
need elaboration
Work Backwards from Deadline Decide: “I want a working draft in 1 month”. Break backwards into weekly checkpoints (Week 1 = Collect, Week 2 = Cluster, Week 3 = Synthesize, Week 4 = Draft).
need elaboration
Planning everything down to the day months in advance is unrealistic in research. But if you don’t plan at all, you drift. The solution: think in layers of stability.
important
Research = uncertainty. Unlike building a house, you don’t know all the materials or obstacles ahead.
Can I apply this framework for my research workflow to accommodate both "structure" and "flexibility"? I guess the Heilmeier Catechnism will stand on the "structure" part. Iteration, "fail fast" will stand on the "flexibility" part. Can I formulate this sweet combination of "structure" and "flexibility" in research practice? Is there anyone before me who adapt this strategy? How to balance between planning, execution and replanning?
The right question frame might be: "How to navigate uncertainty efficiently?"
Visible progress tracker Have a simple Kanban board (Notion/Trello) with columns: Month → Week → Day → Done. This makes milestones not just a plan, but a visual progress bar.
this is like from Duolingo's streak
Define Minimum Viable Output (MVO) Ask: What’s the smallest version of this milestone that proves progress? Example: For Collect → don’t wait until you’ve read 50 blog posts. Ten high-quality ones may be enough to proceed.
Parento 80/20 Need lots of elaboration since it's important
Without milestones → work naturally keeps inflating.
This is important
A milestone introduces artificial scarcity: “I only have 2 weeks to gather advice → better prioritize the most important sources.”
Okay so I guess the right question to consider is: now I only give myself 2 weeks for this task, then how to make the most out of it. This is the right problem frame! Now we proceed to tackling it. I believe the Parento 20/80 law is one of the correct answer. Any others?
“Keep reading until I know enough” → “By end of Week 2, I’ll have 20 summarized advices.” “Keep synthesizing until it feels right” → “By end of Week 4, I’ll have a draft, even if messy.”
"until I know enough" or "until I feel right" are so dangerous; my mind is really bad at time realization. It will push things to infinite.
Now, even though the universe of possible advice is infinite, you’ve forced closure at each step.
this is super important
Creates time pressure → reduces endless wandering. Builds momentum by giving you small wins. Forces trade-offs: if time is almost up, you’ll cut distractions more aggressively.
These are important
derailing
important
Exploration Log → Any recurring themes worth folding into (1) later?
What is this? It's might be a very important idea to overcome the problem of evolving tree structure!
prioritization
Steve Jobs on prioritization (3 out of 10 things)
Step 4: Tight Execution Loops for (1)
so this milestone is important. It prevents the execution to go to infinity
Example: Friday afternoons = serendipity time.
so what should be the optimal frequency? one per week seems too sparse. And btw I read somewhere that allocate a "mind wandering" slot in everyday also improves concentration when we're in the main task
Rapid
How to make it super rapid (less than 10 seconds) and it doesn't make my mind switch
Step 1: Define a Sharp Focus Box Write down your main goal in one crisp sentence: “Collect and synthesize research advice on problem-driven research to develop my own meta research style.” Break this into milestones (e.g. “Collect 20 pieces of advice → Cluster key themes → Draft my meta approach”). Keep this Focus Box visible in your workspace (sticky note, Notion top bar). This serves as your anchor whenever FOMO tries to pull you away.
I guess this help anchor our workflow and also to filter noises
A good “bit to flip” has 3 properties: Universality: shared by many current methods. Constraining: it actually limits progress toward capabilities. Replaceability: you can imagine an alternative (not just removing, but replacing with something better). Bad flips: assumptions that were helpful approximations without being real bottlenecks → flipping them just adds complexity.
need elaboration
3. Build Trust in Your Evaluation Harness If you never rerun baselines, how do you know: your code, logging, and evaluation are sound? improvements aren’t due to hidden differences in implementation? By reproducing others’ results in your own harness, you ensure a fair playing field for comparing your new method. 👉 Otherwise, reviewers (and yourself) can’t be sure whether gains are real or artifacts.
Donggyun used to mention with me about this. It's possible that the improvement in performance is just due to noise/artifact (e.g randomness in seed), but might not be real improvement.
Papers tend to report best-case or average numbers, not the messy reality: instability across seeds, catastrophic drops mid-training, reproducibility quirks. By rerunning, you see the variance, fragility, and edge cases that are invisible in benchmark tables. 👉 Schulman’s insight about instability in policy gradients (leading to TRPO) only came from seeing the failures himself, not from reported scores.
this is gold
Different methods break in different ways. Running several gives you a menu of failure patterns: instability, sample inefficiency, poor generalization, etc. These failures help surface the real subproblems you need to tackle.
Yes different methods break in different ways but they might share some failure modes (e.g sample inefficiency). We want to find a big bit to flip, i.e the one that is shared by as much existing methods as possible.
5. De-Risk Your Research Jumping straight into your own method is high-risk — it may fail for reasons you could’ve predicted by simply running known baselines. Testing others first is a low-risk learning stage that grounds your work in reality before you spend months on a new algorithm. 👉 It’s like scouting the terrain before building your own path.
need elaboration
Running others’ methods forces you to internalize their assumptions, strengths, and weaknesses. Often, you notice small tweaks that make a huge difference, or you realize why a method doesn’t scale to your problem. Schulman explicitly said: testing many methods was a way of discovering what properties mattered most (stability, variance reduction, etc.). 👉 This hands-on process sharpens your taste and informs your own design choices.
need elaboration
Lets you add comments anywhere on a webpage, including ChatGPT.
the third annotation
Highlight text on ChatGPT responses.
it seems important
Add sticky notes/comments
this is important