1,372 Matching Annotations
  1. Last 7 days
    1. Can reverse cross-population comparisons.

      remember -- we are not focused on cross-population comparisons for this workshop. It's more about 'which interventions yield greater welfare', which would generally involve differences in difference, ideally across comparable populations (but not always)

    2. δ = discount factor for future years

      Where did the discount and time factor come from? Where did these definitional equations come from? I didn't think most emply estimated WELLBY measures considered multi-year collection or impact. And are they really discounting?

    3. what most intervention comparisons need)

      Cut this. I don't think it necessarily holds -- a lot of interventions impact mortality.

      Add to footnote -- the 'incremental' WELLBYs may be captured by observing differences between comparable treated and untreated populations.

    4. UK Government: Official guidance for policy appraisal

      A link to this would be helpful. The "Green Book". (I wonder -- how impactful has this actually been on British policy?)

    5. Neutral point estimation: What is the actual neutral point on the 0-10 scale for different populations? How stable is it across contexts?

      I suspect we don't have any good measures of this? There's the Peasgood paper but I don't think that was in a LMIC and I'm not sure how much it has been vetted?

    6. Annotate & Comment: Double-click any text to add a Hypothes.is annotation. No account needed to read; quick signup for a free account to post.

      We'd especially like pre-session feedback on

      • Are these ~accurate?
      • Are they useful? At the right level
      • What is redundant?
      • Which issues should we skip (as less important to intervention choices for LMIC, mostly-resolved, or intractable?)
      • What is missing?
      • Is there a better overall structure and framing for these?
      • Where does it go into too much detail? Where is it too opinionated in cases where we should leave things open?
      • Are we failing to attribute any important sources for language, arguments, or claims? *
    7. Predictive validity: SWB predicts consequential outcomes systematically

      This was mentioned above, but does it do so in a scale-sensitive way?

      As I suggested, it's not enough to have it be 'somewhat predictive'

    8. Transformation Sensitivity Demo

      This needs more context and explanation. I've forgotten what g of x is here, and what's the actual calculation? Also, this doesn't seem to be illustrating the point that it means to. As I move the slider, population B always seems to be higher, but also it seems like we're getting away from the discussion of the relative impact of different interventions. We don't want to just simply compare populations. If this does pertain to interventions, explain better.

      Exokain a bit more (as a footnote) what the 'transformation' means here and why/when it's used

    9. Magnitude-sensitive cost-effectiveness: Even if signs are stable, cost-effectiveness ratios rely on magnitudes

      Do they? Magnitudes of what? Explain. Give a 1-2 sentence exampls as a footnote

    10. Incremental WELLBY Estimate

      This is simple and perhaps obvious, but good for illustrating the simple WELLBY linear WELLBY concept, but that's already been explained above. I'm not sure what should maybe be put at the top. I'm not sure if it's useful down here. OK put this at the top, in a folding box -- it just helps to make sure we're all in on the same page about the definition of the WELLBY here.

      Perhaps it would also be helpful to include some sort of adjusted WELLBY calculator interface that's a more sophisticated concept people might not appreciate, particularly embodying the approach in Benjamin and others.

    11. What "non-identified" means A parameter is "identified" when data + assumptions pin down a unique value. Ordinal responses only tell us which interval a latent value falls into. Many different latent distributions and transformations can generate the same observed category counts, so rankings of means can change across equally admissible representations.

      This explanation is not clear. It could be improved, it's a bit too literal. Why do ordinal responses only tell us in which interval a latent value falls into?

      This might also be worth folding

    1. We may quote specific responses with attribution unless you request otherwise. If you prefer your responses remain anonymous,

      Adjust this -- "If you prefer your response to remain anonymous, please use a pseudonym and try to use the same one consistently if you're providing multiple responses." If you are fine with internal recognition but don't want any public attribution, please let us know and share any other concerns in the field at the bottom.

    2. How likely is it that the simple WELLBY measure (as defined above) is the best or near-best measure—yielding no less than 80% of the value of the best measure—for cross-intervention comparison in the focal context? (State your best calibrated probability.)

      I'm considering adjusting this one to

      Consider the 'value obtained when using the best feasible measure for cross intervention comparison in contexts like the focal context'. What share of this value is obtained, in expectation, from using the simple linear WELLBY measure for all interventions? Please give your central belief, and 90% credible intervals"

      -- with a slider that goes from zero to one, and two other sliders that allow that allow you to specify the lower and upper bound of the 90% CIs.

    1. emonstrates that small transformations can reverse published findings.

      NotebookLM:

      "they applied their methodology to nine prominent results from the happiness literature—including the Easterlin Paradox, the U-shape of happiness in age, the ranking of countries by happiness, and the effects of marriage and children—and showed that the standard conclusions in all nine areas could be reversed using monotonic (specifically lognormal) scale transformations. They argued that these reversing transformations were "plausible," claiming they were no more skewed than the U.S. wealth distribution

      However, later work questions the plausibility of this. .

    2. WELLBY Reliability The strongest conceptual defense of treating wellbeing scales as cardinal and comparable. Argues that deviations from cardinality are small and not policy-relevant.

      Link this Google doc version (updated 2025) instead https://docs.google.com/document/d/1urmraqXR8QPhH9A-y_hlvfUowXxkBNVByO0YP9Btsv8/edit?tab=t.0#heading=h.zctxvc7apqvw as well as the audio TTS read here: https://www.dropbox.com/scl/fi/p8w622z1ij3nnubqesli6/happy-possibility-cardinality.mp3?rlkey=wyjt56ip11z1919wef7htv54n&dl=0

    1. Note: human means carry their own variance; correlations here are bounded by human inter-rater noise.

      is this ggplotly? Shouldn't it be dynamic? I don't seem to be able to adjust it

    1. Alberto Prati may contribute via pre-recorded video.

      Not 'video', possibly some written content, or we can extract issues from his evaluation to ask Benjamin et al.

    1. leads to least regret?

      The "least regret" is a formal term in information theory, I believe, or from Bayesian updating. Provide a footnote defining and referencing it. #Implement

    2. Annotate & Comment:

      We'd especially like pre-session feedback on

      • Are these ~accurate?
      • Are they useful? At the right level
      • What is redundant?
      • Which issues should we skip (as less important to intervention choices for LMIC, mostly-resolved, or intractable?)
      • What is missing?
      • Is there a better overall structure and framing for these?
      • Where does it go into too much detail? Where is it too opinionated in cases where we should leave things open?
      • Are we failing to attribute any important sources for language, arguments, or claims? *
    1. Most studies measure outcomes at baseline and one or two follow-ups;

      Give a footnote with some examples here. What do the studies involving LMIC interventions do?

  2. Mar 2026
    1. Monotonic transformations can reverse conclusions

      An example here would be very helpful. ... Perhaps even an interactive display.

      Monotonic transformations of what?

    2. Bond and Lang (2019) argue that with ordinal response data, comparing "average happiness" between groups is generally not identified without strong assumptions—monotonic transformations can reverse results.[11]

      This should be fleshed out in more detail and rigor, along with some responses to it, and probably belongs earlier on in the discussion.

      ....

      What do you mean, comparing "average happiness between groups is not identified"? What is the thing that is not identified?

    3. Time structure and discounting Later (t>1)Follow-up (t=1)Baseline (t=0)Later (t>1)Follow-up (t=1)Baseline (t=0)#mermaid-1772847441513{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#000000;}#mermaid-1772847441513 .error-icon{fill:#552222;}#mermaid-1772847441513 .error-text{fill:#552222;stroke:#552222;}#mermaid-1772847441513 .edge-thickness-normal{stroke-width:2px;}#mermaid-1772847441513 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-1772847441513 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-1772847441513 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-1772847441513 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-1772847441513 .marker{fill:#666;stroke:#666;}#mermaid-1772847441513 .marker.cross{stroke:#666;}#mermaid-1772847441513 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-1772847441513 .actor{stroke:hsl(0, 0%, 83%);fill:#eee;}#mermaid-1772847441513 text.actor>tspan{fill:#333;stroke:none;}#mermaid-1772847441513 .actor-line{stroke:#666;}#mermaid-1772847441513 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-1772847441513 .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-1772847441513 #arrowhead path{fill:#333;stroke:#333;}#mermaid-1772847441513 .sequenceNumber{fill:white;}#mermaid-1772847441513 #sequencenumber{fill:#333;}#mermaid-1772847441513 #crosshead path{fill:#333;stroke:#333;}#mermaid-1772847441513 .messageText{fill:#333;stroke:none;}#mermaid-1772847441513 .labelBox{stroke:hsl(0, 0%, 83%);fill:#eee;}#mermaid-1772847441513 .labelText,#mermaid-1772847441513 .labelText>tspan{fill:#333;stroke:none;}#mermaid-1772847441513 .loopText,#mermaid-1772847441513 .loopText>tspan{fill:#333;stroke:none;}#mermaid-1772847441513 .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(0, 0%, 83%);fill:hsl(0, 0%, 83%);}#mermaid-1772847441513 .note{stroke:#999;fill:#666;}#mermaid-1772847441513 .noteText,#mermaid-1772847441513 .noteText>tspan{fill:#fff;stroke:none;}#mermaid-1772847441513 .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-1772847441513 .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-1772847441513 .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-1772847441513 .actorPopupMenu{position:absolute;}#mermaid-1772847441513 .actorPopupMenuPanel{position:absolute;fill:#eee;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-1772847441513 .actor-man line{stroke:hsl(0, 0%, 83%);fill:#eee;}#mermaid-1772847441513 .actor-man circle,#mermaid-1772847441513 line{stroke:hsl(0, 0%, 83%);fill:#eee;stroke-width:2px;}#mermaid-1772847441513 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}Persistence, decay, response shift?

      This diagram is not fully explained. I don't see how it relates to the rest of the content either.

    4. #mermaid-1772847441491{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#000000;}#mermaid-1772847441491 .error-icon{fill:#552222;}#mermaid-1772847441491 .error-text{fill:#552222;stroke:#552222;}#mermaid-1772847441491 .edge-thickness-normal{stroke-width:2px;}#mermaid-1772847441491 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-1772847441491 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-1772847441491 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-1772847441491 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-1772847441491 .marker{fill:#666;stroke:#666;}#mermaid-1772847441491 .marker.cross{stroke:#666;}#mermaid-1772847441491 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-1772847441491 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#000000;}#mermaid-1772847441491 .cluster-label text{fill:#333;}#mermaid-1772847441491 .cluster-label span,#mermaid-1772847441491 p{color:#333;}#mermaid-1772847441491 .label text,#mermaid-1772847441491 span,#mermaid-1772847441491 p{fill:#000000;color:#000000;}#mermaid-1772847441491 .node rect,#mermaid-1772847441491 .node circle,#mermaid-1772847441491 .node ellipse,#mermaid-1772847441491 .node polygon,#mermaid-1772847441491 .node path{fill:#eee;stroke:#999;stroke-width:1px;}#mermaid-1772847441491 .flowchart-label text{text-anchor:middle;}#mermaid-1772847441491 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-1772847441491 .node .label{text-align:center;}#mermaid-1772847441491 .node.clickable{cursor:pointer;}#mermaid-1772847441491 .arrowheadPath{fill:#333333;}#mermaid-1772847441491 .edgePath .path{stroke:#666;stroke-width:2.0px;}#mermaid-1772847441491 .flowchart-link{stroke:#666;fill:none;}#mermaid-1772847441491 .edgeLabel{background-color:white;text-align:center;}#mermaid-1772847441491 .edgeLabel rect{opacity:0.5;background-color:white;fill:white;}#mermaid-1772847441491 .labelBkg{background-color:rgba(255, 255, 255, 0.5);}#mermaid-1772847441491 .cluster rect{fill:hsl(0, 0%, 98.9215686275%);stroke:#707070;stroke-width:1px;}#mermaid-1772847441491 .cluster text{fill:#333;}#mermaid-1772847441491 .cluster span,#mermaid-1772847441491 p{color:#333;}#mermaid-1772847441491 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(-160, 0%, 93.3333333333%);border:1px solid #707070;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-1772847441491 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#000000;}#mermaid-1772847441491 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}InterventionStudy designMeasured outcomesLS / DALY / depressionTranslation layermapping, calibrationCommon currencyWELLBY / DALY / $Decision

      This flow chart is too small and it's underexplained. I don't understand what each of these is meant to mean and how they fit together.

    5. Cheap calibration methods: Can vignettes, anchoring questions, or other calibration approaches work in low-resource settings without excessive respondent burden?

      That seems fairly tractable for us to at least share our knowledge about in this conference. Cool.

    6. true mapping

      That's the second question combo which we'll be setting up an explainer on. Once we do, we should link that and also link that PQ here

      But 'true mapping' Needs a bit more definition. Maybe put it in square quotes to note that (or link the tentative formulation in the PQ space)

    7. Scale-use heterogeneity mapping: How do shifters vs. stretchers vary across LMIC populations, and can we predict which matters more in a given context?

      Measuring this seems fairly high value to me if it can be done at a low cost.

    8. These questions represent high-value areas for future research that could meaningfully improve the reliability of WELLBY-based comparisons:

      I wouldn't state this so directly and clearly, and give attributions to people making the claims that these represent high value. We want this to be one of the outputs of the workshop, but I'm not sure that all of these are in fact high value. Some of them might be very much intractable.

    9. Within-person designs where each person serves as their own control

      But this can bring its own problematic effects if people feel prompted or motivated to report an improvement to please the experimenters, etc.

    10. Treat WELLBY estimates as one input among several, not the final answer

      That's the sort of milquetoast thing I want to avoid. People will always say, "Do compare multiple things, don't treat something as the gospel truth, etc." It's not a statement with a lot of meaning.

    11. 8. Practical Recommendations

      I don't like the core practical recommendations for having a section here. The recommendations are meant to come out of the workshop. We shouldn't be pre-establishing them. It's OK if you want to compare the recommendations coming out of the existing reports & literature, though.

    12. DALYs and QALYs: Standardized But Narrower

      How are these measured in the relevant settings and how does it differ from WELLBY? These are based on external measurements?

    13. Years of Life Lost (YLL) + Years Lived with Disability (YLD

      this seems like it must be incorrect/imprecise. Is a year with a disability actually measured here as being as bad as a year of life lost? This needs a better definition ... how is it measured

    14. It does not automatically imply that within-study randomized treatment effects are meaningless It implies you should be explicit about what assumptions let you treat reported changes as welfare units

      this seems a bit babytalk/obvious

    15. OECD (2024) concludes data remain meaningful for policy despite critiques

      Give a link... and what is the basis for this? Meaningful is somewhat of a vague term. It doesn't get at the hard questions about what measures we should use for comparing specific interventions.

    16. Survey response times can help solve identification (Liu & Netzer, AER 2023)

      This is highly counter-intuitive to me. How do survey response times help?

    17. A strong response to skepticism: even if the numbers seem arbitrary, do they behave like a measurement? Kaiser and Oswald show that single numeric feelings responses have strong predictive power—relationships to later "get-me-out-of-here" actions (changing neighborhoods, jobs, partners) tend to be replicable and close to linear in large longitudinal datasets.[10]

      This kind of seems like a weak response unless I'm missing something. Even if they are not arbitrary, even if they have informational value, it doesn't tell me that they provide reliable information in comparing the benefit/cost across multiple interventions which all improve people's lives.

    18. They do not solve cross-study comparability—but demonstrate that in at least one setting, SWB is responsive.

      But this doesn't seem to have been the challenge as posed. I'm not sure this is the most relevant thing to lead with, or maybe it needs to be motivated better

    19. Measurement error attenuates estimated effects (bias toward zero)—small real effects may be undervalued

      How does that affect the relative comparison of interventions?

    20. What breaks: Duration weighting is wrong. Why it might fail: Adaptation effects—people return to baseline. Mitigation: Long-term follow-up data.

      Again, this is too shorthand. I need an explanation, if necessary, in footnotes or a folding box, of what all this means.

    21. ΔLS has ≈ same welfare meaning across people

      'meaning' should be clarified, perhaps with reference to the gold standards I suggest you add above. Should we state this in terms of an individual's willing to make "time trade-offs" (e.g., would be willing to go from 7-->6 for one year in exchange for going from 3-->4 another year), or probability trade-off (would take a coin flip over the above ), or person trade-off (a third party willing to move one person from 7 to 6 it meant moving someone else from three to four) ... [or vice versa in all cases]

    22. ΔU(3→4) = ΔU(7→8)

      Obviously this notation is extremely crude! I wonder if important nuance is lost here

      E.g.,. is this 'within person' or 'across people'?

    23. Validity

      "Validity" is vague, needs a better definition. And perhaps something more informative in terms of the metric offering value would help. Naturally, no metric would be perfect, and even if a model's assumption are violated in practice, the assumption might be close enough to holding that the difference doesn't matter much.

      We need a better definition of the 'gold standard here'. What would an 'accurate comparison' tell us? What is the appropriate measure of 'degree of inaccuracy'?

    24. Test

      how to test this? Define 'log transformation' more clearly here and what are the assumptions necessary for it to accurately reflect tradeoffs?

    25. Ceiling/floor effects: Even with identical reporting functions, bounded scales can cause mechanical differences in responsiveness at high or low baselines.

      But this does not seem consistent. You are saying "when heterogeneity is most dangerous", but this doesn't look like heterogeneity.

    26. Comparing across studies/countries: Different instruments, translations, norms, and populations. If the distribution of stretch factors bi differs, "1 point-year" is not the same welfare unit across the evidence base.

      Can you justify this a bit more, both in equations and in an intuitive explanation of what the problem is?

    27. interpersonal noncomparability is less of a threat for estimating an average treatment effect

      "less of a threat" is vague, needs clarification. And why? Give a citation and/or a proof and further explanation (perhaps in a footnote)

    28. studies, countries, or populations with different distributions of "stretch factors.

      Adapt this discussion to focus more on comparing different interventions (see the canonical example but also link real-world relevant comparisons and studies) ... where these interventions may take place in nearly-identical, similar or distinct contexts, affect similar or different outcomes (wealth health, etc.)

    29. Δui = bi × ΔLSi.

      this needs more explanation. What does 'fail' mean here? What's being compared, and how do the estimates compare with the ground truth?

    30. UA ≈ UB

      Maybe add a footnote explaining what sort of "utility" we are considering here, noting this is a bit of an oversimplification of welfare considerations.

    31. A common overstatement is that

      Who stated this? How is it 'common'? Maybe just change this to "Equal scores mean equal welfare" is stronger than most applications need.

    32. This second form requires a defined zero point (e.g., death = 0)

      Might benefit from some further explanation. How could Level-based be used for comparing interventions -- that's not clear here. How many people are we summing over? How do 'dead people' enter into that? Some explanations can go in footnotes.

    33. Σi Σt δt (LSit(k) − LSit(0))

      Is this really How it's depicted in the literature? It's a bt confusing at first, because it looks liek one has to know two things for incremental WELLBYs and only one thing for the Level based measure. Furthermore, the incremental one seems to requre knowledge of a counterfactual. However, one mght be able to have an estimate of a difference without knowing the levels. Isn't there a better notation/explanation for this?

    34. ΔWELLBY(k) = Σi Σt δt (LSit(k) − LSit(0))

      I'm missing the definition of the indices i and t, as well as the definition of the variable LS -- #adjust #implement

    35. Benjamin et al. 2023, UK Green Book Wellbeing Guidance, Bond & Lang 2019, Haushofer & Shapiro 2016/2018, Kaiser & Oswald 2022)

      Are these Really all the sources? I thought we had more.

    36. AI-Generated Content (March 2025): This page was created through iterative prompting of Claude Code (Opus 4.5) and GPT-5.2 Pro, feeding in workshop discussion content and focal papers for our Pivotal Questions initiative (Benjamin et al. 2023, UK Green Book Wellbeing Guidance, Bond & Lang 2019, Haushofer & Shapiro 2016/2018, Kaiser & Oswald 2022). While grounded in these sources, this content requires further human verification. Specific claims, citations, and numerical details should be checked against the original literature before relying on them.

      Make this a folding box #implement

    1. The measurement-to-decision pipeline #mermaid-1772846605552{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#000000;}#mermaid-1772846605552 .error-icon{fill:#552222;}#mermaid-1772846605552 .error-text{fill:#552222;stroke:#552222;}#mermaid-1772846605552 .edge-thickness-normal{stroke-width:2px;}#mermaid-1772846605552 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-1772846605552 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-1772846605552 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-1772846605552 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-1772846605552 .marker{fill:#666;stroke:#666;}#mermaid-1772846605552 .marker.cross{stroke:#666;}#mermaid-1772846605552 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-1772846605552 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#000000;}#mermaid-1772846605552 .cluster-label text{fill:#333;}#mermaid-1772846605552 .cluster-label span,#mermaid-1772846605552 p{color:#333;}#mermaid-1772846605552 .label text,#mermaid-1772846605552 span,#mermaid-1772846605552 p{fill:#000000;color:#000000;}#mermaid-1772846605552 .node rect,#mermaid-1772846605552 .node circle,#mermaid-1772846605552 .node ellipse,#mermaid-1772846605552 .node polygon,#mermaid-1772846605552 .node path{fill:#eee;stroke:#999;stroke-width:1px;}#mermaid-1772846605552 .flowchart-label text{text-anchor:middle;}#mermaid-1772846605552 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-1772846605552 .node .label{text-align:center;}#mermaid-1772846605552 .node.clickable{cursor:pointer;}#mermaid-1772846605552 .arrowheadPath{fill:#333333;}#mermaid-1772846605552 .edgePath .path{stroke:#666;stroke-width:2.0px;}#mermaid-1772846605552 .flowchart-link{stroke:#666;fill:none;}#mermaid-1772846605552 .edgeLabel{background-color:white;text-align:center;}#mermaid-1772846605552 .edgeLabel rect{opacity:0.5;background-color:white;fill:white;}#mermaid-1772846605552 .labelBkg{background-color:rgba(255, 255, 255, 0.5);}#mermaid-1772846605552 .cluster rect{fill:hsl(0, 0%, 98.9215686275%);stroke:#707070;stroke-width:1px;}#mermaid-1772846605552 .cluster text{fill:#333;}#mermaid-1772846605552 .cluster span,#mermaid-1772846605552 p{color:#333;}#mermaid-1772846605552 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(-160, 0%, 93.3333333333%);border:1px solid #707070;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-1772846605552 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#000000;}#mermaid-1772846605552 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}

      the diagram is too small, and was never explained!

    2. Some influential critiques argue that different monotone transformations can reverse conclusions about "average happiness"

      'influential' -- that's subjective. ///Link to an example

    3. Is "incremental WELLBY" standard terminology? Some literatures talk about WELLBYs as point-years of life satisfaction (UK guidance) and many evaluation contexts are inherently incremental. But "incremental WELLBY" itself is not uniformly a standard term. In this page, we use it as a descriptive label for counterfactual impact calculation, not as established jargon.

      too inside-info for a whole box. -- make this a footnote at most

    4. WELLBY (unit of account): UK Green Book guidance defines a WELLBY as a one-point change in life satisfaction on a 0-10 scale, per person per year.[3]HM Treasury (2021/2024). Wellbeing Guidance for Appraisal: Supplementary Green Book Guidance.

      Missing the standard framing of the LS question here

    5. The measurement-to-decision pipeline #mermaid-1772845759179{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#000000;}#mermaid-1772845759179 .error-icon{fill:#552222;}#mermaid-1772845759179 .error-text{fill:#552222;stroke:#552222;}#mermaid-1772845759179 .edge-thickness-normal{stroke-width:2px;}#mermaid-1772845759179 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-1772845759179 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-1772845759179 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-1772845759179 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-1772845759179 .marker{fill:#666;stroke:#666;}#mermaid-1772845759179 .marker.cross{stroke:#666;}#mermaid-1772845759179 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-1772845759179 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#000000;}#mermaid-1772845759179 .cluster-label text{fill:#333;}#mermaid-1772845759179 .cluster-label span,#mermaid-1772845759179 p{color:#333;}#mermaid-1772845759179 .label text,#mermaid-1772845759179 span,#mermaid-1772845759179 p{fill:#000000;color:#000000;}#mermaid-1772845759179 .node rect,#mermaid-1772845759179 .node circle,#mermaid-1772845759179 .node ellipse,#mermaid-1772845759179 .node polygon,#mermaid-1772845759179 .node path{fill:#eee;stroke:#999;stroke-width:1px;}#mermaid-1772845759179 .flowchart-label text{text-anchor:middle;}#mermaid-1772845759179 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-1772845759179 .node .label{text-align:center;}#mermaid-1772845759179 .node.clickable{cursor:pointer;}#mermaid-1772845759179 .arrowheadPath{fill:#333333;}#mermaid-1772845759179 .edgePath .path{stroke:#666;stroke-width:2.0px;}#mermaid-1772845759179 .flowchart-link{stroke:#666;fill:none;}#mermaid-1772845759179 .edgeLabel{background-color:white;text-align:center;}#mermaid-1772845759179 .edgeLabel rect{opacity:0.5;background-color:white;fill:white;}#mermaid-1772845759179 .labelBkg{background-color:rgba(255, 255, 255, 0.5);}#mermaid-1772845759179 .cluster rect{fill:hsl(0, 0%, 98.9215686275%);stroke:#707070;stroke-width:1px;}#mermaid-1772845759179 .cluster text{fill:#333;}#mermaid-1772845759179 .cluster span,#mermaid-1772845759179 p{color:#333;}#mermaid-1772845759179 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(-160, 0%, 93.3333333333%);border:1px solid #707070;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-1772845759179 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#000000;}#mermaid-1772845759179 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}InterventionStudy designMeasured outcomesLS / DALY / depression scaleTranslation layermapping, calibration, assumptionsCommon currencyWELLBY / DALY / $Decision / deliberation

      this is too small and also underexplained

    6. Plant, M. (2025). "A Happy Possibility: Rational Behavior and the Cardinality Thesis." Working paper.

      wait -- hallucination -- you renamed the title here!!

    7. f you compare to mortality-preventing interventions

      Adjust this to "if you compare interventions that affect mortality (or, in some accounting, birth rates)"

    1. 📊 View Aggregated Results See beliefs elicitation summaries and Metaculus question forecasts

      I don't think I want to show this here because I don't want people to anchor in stating their beliefs. #todo #adjust #implement

    2. discussion. SEGMENT 1 ~11:00 AM ET 25 min Stakeholder Problem Statement & Pivotal Questions Stakeholders present their WELLBY/DALY challenges (~10 min each), then we introduce key PQs for belief elicitation (~5 min) Speakers: Peter Hickman (Coefficient Giving), Matt Lerner (Founders Pledge) Upcoming

      Here or somewhere early in the workshop, we should have the opportunity for participants to provide feedback about whether the Pivotal questions are clear and useful, and which ones are more important to their work and to the ~welfare of humanity.

    3. SEGMENT 7 ~2:20 PM ET 30 min Practitioner Panel & Open Discussion Brief presentations (~10 min each) on practical implications, followed by open Q&A. A private follow-up discussion among key participants will follow.

      Possibly put a gap before the section to give people more time to do the belief elicitation for us to potentially summarize the results and present it to people.

    4. SEGMENT 7 ~2:20 PM ET 30 min Practitioner Panel & Open Discussion Brief presentations (~10 min each) on practical implications, followed by open Q&A. A private follow-up discussion among key participants will follow. Panelists: Matt Lerner (FP), Peter Hickman (CG)

      Rephrase this as "the most heavily involved participants, not necessarily 'key participants', as we're not judging how 'important' people are."

    5. SEGMENT 6 15 min Beliefs Elicitation Guided form to state priors on operationalized pivotal questions Self-guided form + Metaculus questions Upcoming

      This will be very loose. Reinstein will introduce the context and interfaces, and stick around to answer questions and fix bugs etc.

      Change this to "explains" not "introduces", as it was already introduced briefly

    6. SEGMENT 7 ~2:20 PM ET 30 min Practitioner Panel & Open Discussion Practical implications for funders and researchers Panelists: Matt Lerner (FP), Peter Hickman (CG)

      So this will be public -- another 10 minutes presentation from each. Then we will open it up for discussion questions -- David Reinstein will raise some if others don't. This will be followed by a private invitation-only discussion amongst a few heavily involved participants (to be mentioned but not linked here)

    7. SEGMENT 4 25 min WELLBY Reliability Discussion Is the linear WELLBY reliable enough for cross-intervention comparison? Open discussion

      Reword -- 'reliable enough' is not precise (see Pqs) /// also adjust to note "in low-income countries /// #implement

    8. Scale-use heterogeneity findings, calibration methods, and implications for WELLBY use Speakers: Dan Benjamin (UCLA/NBER), Miles Kimball (CU Boulder)

      Extend this with a presentation on the application of this method in Israel. -- or maybe after the evaluator responses/discussion?

    9. Evaluator Responses & Discussion Evaluation findings, author dialogue, Unjournal process Presenters: Caspar Kaiser, David Reinstein, Valentin Klotzbücher

      This will get into more technical research issues

    10. SEGMENT 7 30 min Practitioner Panel & Open Discussion Practical implications for funders and researchers Speaker: Matt Lerner (Founders Pledge)

      I thought the CG speaker would be on this as well. And maybe someone from HLI?

    11. Evaluator Responses & Discussion Key critiques, suggestions, and author responses

      I suggest Caspar Kaiser, David Reinstein, and maybe Valentin Klotzbucher as presenters (Reinstein/Valentin will introduce Caspar and perhaps a pre-recorded bit from Prati)

    12. SEGMENT 2 25 min Paper Presentation: Benjamin et al. Scale-use heterogeneity findings and calibration methods Speakers: Dan Benjamin (UCLA/NBER), Miles Kimball (CU Boulder)

      "Research presentation" (not "paper presentation")

      Should come with "implications for WELLBY use"

    1. html`<div style="background: #f8f9fa; padding: 1rem 1.25rem; border-left: 4px solid #3498db; margin-bottom: 1.5rem; font-size: 0.95em; line-height: 1.6;"> <strong>What these numbers represent:</strong> Simulated <strong>production cost per kilogram of cultured chicken</strong> (wet weight, unprocessed) in <strong>${targetYear}</strong>, based on ${stats.n.toLocaleString()} Monte Carlo simulations. This is the cost to produce meat in a bioreactor — not retail price, which would include processing, distribution, and margins. <br><br> <strong>Why it matters:</strong> If production costs reach <strong>~$10/kg</strong> (comparable to conventional chicken), cultured meat could compete at scale. If costs remain <strong>>$50/kg</strong>, the technology may remain niche. These thresholds inform whether animal welfare interventions should prioritize supporting this industry. </div>` RuntimeError: targetYear is not definedOJS Runtime Error (line 804, column 163) targetYear is not defined

      How Can we fix this 'runtime error'.I think it was working before. The "target year" should be the "projection year" in the sidebar model parameters. The default year was 2036. #implement

    1. Add questions and comments directly to the collaborative notes above, or submit them via the beliefs elicitation form.

      Remove 'or submit them via...' Note that the "beliefs elicitation form" is doing something else. #adjust #implement (adjust this on all pages).

    2. decision problems that funders face when comparing interventions measured in different units (WELLBYs vs DALYs)

      It's not just about 'comparing interventions measured in different units' #adjust #implement

    3. How funders currently navigate WELLBY vs DALY in cost-effectiveness analysis

      Not just 'WELLBY vs DALY' ... --> How funders consider wellbeing and metrics based on self-reports in considering and comparing interventions.

    1. By 2030, will more than 50% of GiveWell's top charities include a WELLBY-based cost-effectiveness analysis alongside or instead of DALY-based analysis?

      Make sure this one is posted on Metaculus. It's great because it has an actual ground truth.

    2. If the effectiveness of some programs have already been measured in terms of WELLBYs, while others are measured in terms of DALYs, what method or what "mapping structure or approach" should we use to compare and convert between them?

      It might be too many questions on conversion here if the workshop's not focusing on conversion. We might want to move some of these more detailed questions to a second outlinked page.

    3. These are some of the key operationalized questions from our Wellbeing Pivotal Questions project. We want to elicit expert and stakeholder beliefs—before, during, and after reviewing the evidence and key arguments—to see how views evolve and where consensus exists. (All questions are optional.)

      These have Metaculus versions. We probably want to link them here, but we also don't want to overwhelm people.

    4. 📋 Full question specifications: For more detail, context, and the complete set of operationalized questions, see the canonical Wellbeing PQ formulations on Coda →

      We link to these, but they might be a bit overwhelming for session participants - perhaps put a disclaimer here.

    5. About You

      We had a box here to indicate whether this is an original submission or an updated edition. First submission, second submission, etc. Or perhaps this is redundant as we will see it based on the time that comes in?

    6. One way to think about this: Imagine an ideal research team with unlimited resources, time, and data—perhaps even a kind of omniscience where they could perfectly understand the welfare and psychological states of everyone affected. What probability would you assign that this idealized team would ultimately conclude the statement is true?

      I'd like to link a "calibrate your judgment" tool here over a very quick exercise. I'll stop. Ideally, this is something we could even embed. I don't want people to have to sign up for things; friction is the enemy.

    1. DALY_01 What is the best numerical conversion factor between WELLBYs and DALYs? If a charity prevents 1 DALY (Disability-Adjusted Life Year), approximately how many WELLBYs does this represent? Current estimates range from 2-15. View on Metaculus

      Have this link to the specific linked metaculus forecast, not just the general community page.

      But we should also embed the more detailed belief elicitation that is already here.

    1. Brief context on the Unjournal evaluation process

      Also mention how this relates to this Pivotal Questions initiative and how we're looking for Pivotal Questions evaluators. #implement

    1. —a “think first, score second” protocol designed to ground numeric ratings in specific textual evidence

      Is there backing in the literature for this approach? Is there any formal way of defining this approach?

    2. We pass the PDF directly to the model’s native multimodal input rather than extracting text, preserving tables, figures, equations, and layout cues that ad-hoc scraping could mangle. A single API call per paper avoids hand-offs and summary loss from m

      I think we've discussed this before. There are trade-offs here. Close up, we could be assessing some less meaningful components of paper processing rather than the actual reasoning frontiers, and the differences in these by model, by type of paper, by field, etc. Are you sure this is what we're doing and we might consider doing in a different way?

    3. Sample and human reference data.

      I suggest some subsection headers here. There are different aspects to what you might call methods, including: - the context of the content included - the LLM pipeline and procedure - The comparisons of human and LLM identification, including for identification of issues - the statistical/info theoretic analysis It might be helpful to divide this up.

    4. development economics, health policy, environmental economics,

      This leaves out some important field/priority areas - dig deeper. We include the economics of innovation and global catastrophic risk as well.

    1. models’ training data may include fragments of these papers or related discussions.

      This should be discussed in more detail, perhaps with a particular section addressing this both conceptually and with some empirical checks. We should perhaps have robustness checks (maybe we already have some of these ) with models ... that are explicitly using cut-off dates below the start of our evaluation sample, or removing papers/evaluations that occur after this date. This should be linked and referred to here.

    2. Qualitative coverage varies widely across papers: on some, the LLM captures nearly all consensus human concerns; on others, it misses key critiques or raises issues absent from the expert consensus

      A link would be helpful here / an example.

    3. respectably

      Let's try to avoid terms like "respectively" without definitions. This is sort of something that leads to sloppy thinking. Can we give a quantification in words in some meaningful way? Or perhaps we should depart from the norm of giving broad but imprecise explanations of everything (also with a lot of repetition) _in papers like this.

    4. , approaching the ceiling set by inter-rater variability among humans themselves.

      It's noted elsewhere this isn't really a ceiling. It doesn't act as a ceiling. In practice, it's not mathematically guaranteed to be the case, and I don't see a conceptual reasoning why it should be the case.

    1. (likely Founders Pledge)

      update #implement -- CG and Founders Pledge are both likely to speak for about ten minutes, followed by a discussion of how we're mapping this into "pivotal questions". /// Try to keep this aligned with the "live sessions" page.

    2. Confirmed: Monday, March 16, 2026 · 11am–5pm ET / 4pm–10pm UK · Fully online · ~3.5 hours of live sessions (join only the segments you're interested in) + asynchronous

      Make this date/time more prominent here and on all pages. Note that you need to sign up to be given the Zoom link.

    3. Your primary role in this conversation (optional)

      Make a box below this with an optional free response, asking people what their background is and why they're interested.

      Add a caveat/"Note on access" in a folding box. Note that these sessions themselves, as proposed, are "by invitation only". We will share the Zoom link only with a limited set of people to keep things from becoming overwhelming. Please don't be offended if we don't follow up with you; we have limited bandwidth and may have overlooked you. But we aim to bring anyone interested into the conversation in some format, perhaps a future more open event. #implement

    1. Focal Question (DALY_01) If the impact of one program is measured in WELLBYs and another program impact is measured in DALYs, and we have a reported effect size and standard deviation for each, what is the best numerical conversion or mapping between them? Note: This may be treated as a secondary topic depending on time constraints.

      This should link or embed the space where people can state their beliefs

    1. 💬 Questions & Comments Submit questions and comments through the form below. Note: Submissions are publicly visible to all participants.

      There is no form here. How can we enable it? Ideally with an 'upvote' feature to prioritize these?

    1. ogether the paper's authors, the evaluators who assessed it,

      adjust -- not just 'the paper' ; authors of several papers in this area as well as Unournal evaluators

  3. Feb 2026
    1. (explaining the slightly different ρ values).

      The difference between the value in the diagram and in the table. I didn't understand what difference was being referred to at first... So this should say, "Explaining the slightly different ... values between the table and the figure."

    2. evaluator pairs is no tighter than the LLM-human scatter in panel

      The vague statement. I mean it's not obviously tighter but you can't eyeball it and say that it's no tighter. Okay it's less tight if we use the Spearman measure. That should be made a bit clearer. I didn't see that you gave us the Spearmans.

      But I still think this gets back to the question of whether it's fair to compare the human-human individual evaluator correlations to the correlation between the LLM and the average of humans. Given both signal and noise, I imagine that the average of 2 measures tends to be more reliably predicted than by a measure with signal and noise than one individual measure predicts a second measure.

    3. Compare panels (b) and (c) directly to see whether LLM-human scatter is tighter than human-human scatter.

      Probably put at least one correlation metric in each plot because it's really hard to eyeball this

    4. ) ratings for each paper, revealing inter-rater variability—CIs often span 20–40~points.

      The statement is confusing and not fully explained. What CIs are we talking about here? Note that we ask each rater to explicitly provide 90% credible intervals for each rating. Is that what this is referring to? But that's a different thing than inter-rater variability

    5. In most cases both LLMs fall within the range of human opinions, though several papers show substantial divergence.

      This might have been my own language question? In any case we should have some numbers to back this up - it's not clear to me that this statement is in fact justified using the diagram. I seem to see a lot of cases where the LLM ratings fall outside the humans or at least more than a few

    6. Per-paper overview and model comparison. Figure 2.1 presents three complementary views of overall (0–100 percentile) ratings. Panel (a) displays individual human evaluator ratings alongside GPT-5 Pro (orange diamo

      I guess this is ordered from highest human rating to lowest average human rating? Check this and explain it in the diagram or the discussion

    7. Per-paper overview and model comparison. Figure 2.1 presents three

      Diagrams are too small. I can barely see them. At least in this version of it. If these are meant to be printed out or whatever, there's no way people will be able to see it. In an online hosting you could have them zoom in of course. But for a sort of printable version you'd need to make these a lot bigger. And no one can read the names either

    8. We evaluate 6 frontier LLMs against human expert reviews from The Unjournal.

      This seems repetitive of what we said in the first section... to the extent it needs to repeat, please take on board the hypothesis comments there

    9. Results CodeShow All CodeHide All CodeView Source

      Putting results before methods might be the norm in computer science or something but in economics I think we usually see the methods and discussion come first (although people often mention the results in the introduction, )

    10. criterion-level ceiling.

      I don't know why they use the word "ceiling." It's not really a ceiling here. Maybe it's a point of comparison, but there's nothing that statistically or mathematically bounds the others to be below this. And in fact sometimes the models do better at matching humans at least in the stats that I've seen by this measure, than humans do.

    11. If two human evaluators agree at Spearman ρ = 0.55, an LLM achieving ρ = 0.57 against the human mean is performing within human inter-rater range.

      Not sure I completely understand the claim here and what is meant by "performing within human interrater range."

    12. severity, topic familiarity, interpretation of the scale)

      Perhaps also mention that we're asking them to provide percentiles relative to papers in this area that they read in the last two years, and different evaluators may have read different selections of research. There should be a link here to the actual guidelines that we gave the humans (https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators)

    13. 33 of these were also evaluated by Claude Opus 4.6

      We want to make sure this is either dynamically coded or that the LLM is called "updated" as we should increase this

    1. six frontier LLMs

      These are not all frontier, I would say. Or am I wrong here? Does the term "frontier" include faster but less deeply thinking models?

    2. Funding for The Unjournal has been provided by the Survival and Flourishing Fund, the Long Term Future Fund, and EA Funds.

      do we need to mention Unjournal funding here?

    3. The Unjournal setting is particularly well suited for this comparison. It commissions paid expert evaluations using a structured rubric covering seven percentile criteria with 90% credible intervals plus journal-tier predictions, and publishes the resulting packages openly

      A bit more context on The Unjournal would probably be helpful here, mentioning our prioritization, etc.

      Claude added this comment (on an earlier version?) Claude: Selection bias: Unjournal selects papers from NBER/top working paper series. This is not a random sample of research. LLM performance on pre-screened quality papers may differ from performance on the full distribution (including poor papers). Explicitly note: "Our sample is pre-selected for quality; results may not generalize to evaluating lower-quality submissions."

    4. Our headline finding is that the best-performing model (GPT-5 Pro) matches or exceeds pairwise human inter-rater rank agreement on overall quality,

      I don't want to be seen as cherry-picking here. When we report this we should also report the other important statistics like Krippendorfer and at least mention which metrics the LLM performs worse on in terms of matching this

    5. while the journal-tier predictions provide an external reference point2

      By the language here, the predictions are not an external reference point. The publication outcomes and perhaps citation outcomes are an external reference point even though as we say, this is not a precise measure of the "quality" of the paper.

    6. reducing classic gatekeeping motives and increasing reviewer effort.

      Not sure what they mean by "reducing classic gatekeeping motives." We argue that it does lead to a high level of reviewer effort for a few reasons, but this is not fully justified here. The case we make is that we manage it carefully, that the reviews (we call them "evaluations") will be made public. So people may want to set a better standard, and some people leave their names (aka sign their reviews) so there's a reputation motive. We offer compensation as well as prizes for the strongest work. So there's an incentive the direct financial incentive although our compensation is fairly modest.

    7. structured measurement schemas (Asirvatham, Mokski, and Shleifer 2026), iterative quality-checking workflows (Zhang and Abernethy 2025), or the kind of prompt-robustness engineering motivated by specification-search concerns (Asher et al. 2026)—should improve further.

      Of course we want to look at these carefully before we praise them. I'm not super familiar with what each of these things are. And I'm not sure that I would state it so strongly that it will necessarily improve on this. There may be countervailing constraints and limitations. ... Taking a look at the AMS abstract, I don't quite see that this is the same sort of thing we're trying to do.

    8. with no iteration, retrieval augmentation, chain-of-thought scaffolding, or multi-step agentic loop.

      Rephrase as "We do not do any iteration..." As a separate sentence otherwise, it's a little confusing what we're saying we are doing versus not doing.

    1. irr::kripp.alpha(M, method = "interval")$value }, error = function(e) NA_real_)

      This is the interval version of K's alpha which penalizes the square of the distances. I guess this means that it particularly penalizes cases where the raters are very far apart and a larger number of small differences won't matter as much. I'm not sure if this is appropriate to ask something to think about. Perhaps we also want to provide the ordinal version for comparison or something else. I believe we've thought about this but I can't remember what we came up with. We'll have to re-consult the notes.

    2. 0.07

      Wow this is nearly zero agreement among humans but I wonder if something is coming up here because of the way we changed the category/introducing new criteria? I think the claims might've been something we introduced later in the process, at the same point that I coalesced the two things related to global relevance. (I could check this. We have documentation.)

    3. Table A.3: Krippendorff’s αHH

      I suggest we should also include the agreement measures for the journal tears? It should be comparable to the others at least if the measure is fairly unit-less