1,969 Matching Annotations

Feb 2026
daaronr.github.io daaronr.github.io

Just Ask the Model: One-Shot LLM Research Evaluation and Structured Expert Review

8
1. daaronr 26 Feb 2026
  
  in Public
  
  deliberately
  
  I would take out the word "deliberately" here
2. daaronr 26 Feb 2026
  
  in Public
  
  strong case that frontier LLMs can serve as additional expert raters in structured evaluation pipelines,
  
  I think saying "strong case" is probably too strong
3. daaronr 26 Feb 2026
  
  in Public
  
  Our headline finding is that the best-performing model (GPT-5 Pro) matches or exceeds pairwise human inter-rater rank agreement on overall quality,
  
  Just need some clarification. It's meeting or exceeding this if we compare it to the average of the human ratings. Is that a fair comparison? Double-check or compare it to how it would compare with the individual human raters.
4. daaronr 26 Feb 2026
  
  in Public
  
  against expert evaluations for 60 economics and social-science working papers
  
  I don't think this number should be 60 - I thought we only have 57, at least 57 that are publicly released.
5. daaronr 26 Feb 2026
  
  in Public
  
  multi-dimensional
  
  It's not clear what the advantage of our evaluations being "multi-dimensional" here is. At least this paragraph doesn't make it clear. The paper should make it clear that we also ask for overall judgments in comparison to familiar journal tiers. I would say the advantage of the multi-dimensional is it gives us a sense of the aspects of the research that the LLM tools tend to agree or disagree with the humans on... Something like an understanding of tastes and prioritization.
6. daaronr 26 Feb 2026
  
  in Public
  
  These developments make the evidentiary gap salient: funders, editors, and policymakers need to know when AI evaluation outputs are trustworthy enough to use, and when they are unstable, biased, or manipulable. Recent work highlights all three concerns. First, reproducibility can be “jagged”: repeated runs of the same models on the same corpus over time can be highly consistent for some tasks and models, but much less so for others (Thomas, Romasanta, and Pujol Priego 2026); robustness may require separating scientific judgment from computational execution (Xu and Yang 2026); and even without overt adversarial intent, subtle reframings of the same task can induce systematic shifts in outputs—a form of LLM “specification search”—raising concerns about frame-sensitive biases when models serve as measurement instruments (Asher et al. 2026). Second, adversarial manipulation is not hypothetical: invisible-text “prompt injection” can substantially inflate LLM-assigned review scores and acceptance recommendations in simulated peer review (Choi et al. 2026), and prompt-injection vulnerabilities are also documented in other high-stakes advice settings (Lee et al. 2025). Third, even when outputs look fluent and plausible, it remains unclear whether AI models approximate expert judgment: AI-generated reviews tend to cover more surface-level sections while being less thematically diverse and less focused on interpretation, originality, and applicability than human reviews (Rajakumar et al. 2026); LLMs used as manuscript quality checkers identify only a small fraction of confirmed critical errors even with the strongest reasoning models (Zhang and Abernethy 2025); and LLM scoring exhibits systematic range restriction and halo effects that can distort agreement metrics (Wang et al. 2025).
  
  This seems too long. This isn't really coming from us, so we might mention some of these things, but I tend to make this a lot shorter. Perhaps some things can be put in footnotes. Obviously we need to check these carefully to see if we agree with them.
  
  I think I mentioned before I'm not sure our work really speaks to the prompt injection issue. The set of work we're putting the LLMs and humans to evaluate would seem to be rather unlikely to have such prompt injection, so we can't really test that (unless we modified the work being fed in, but I don't think that's in our wheelhouse right now. )
7. daaronr 26 Feb 2026
  
  in Public
  
  Meanwhile, publishers are formalizing policies that treat manuscripts and reviews as confidential and prohibit reviewers from uploading them into general-purpose generative AI tools
  
  I'm just checking some of these references to see if there's hallucination going on - this one seems to check out ... from the cited policy " Reviewers should not upload a submitted manuscript or any part of it into a generative AI tool as this may violate the authors’ confidentiality and proprietary rights and, where the paper contains personally identifiable information, may breach data privacy right".
8. daaronr 26 Feb 2026
  
  in Public
  
  under strain
  
  citation needed
Visit annotations in context

Annotators

daaronr

URL

daaronr.github.io/llm-paper-mirror/
unjournal.github.io unjournal.github.io

Technical Reference – Cultured Chicken Cost Model

18
1. daaronr 24 Feb 2026
  
  in Public
  
  At 20 kTA reference scale
  
  Still needs more explanation. I don't know why you're using this reference scale. I don't know why we're talking about pharma grade, etc.
2. daaronr 24 Feb 2026
  
  in Public
  
  CAPEX scaling
  
  But what equation does it actually appear in? I guess it's the scale exponent r?
3. daaronr 24 Feb 2026
  
  in Public
  
  Scalable GF technology 50% Switches to “cheap” GF prices Pivotal uncertaint
  
  As mentioned elsewhere, this needs a lot more explanation or discussion. What is the major factor switching us between cheap and expensive growth factors here? How much does this affect the outcomes? What are the different price distributions for the cheap versus expensive ones? I'm not actually seeing growth factors or any of these p's in any of the equations you give. At least not in a way that allows me to unpack each element. ... Okay, now it's partially explained above, but I still don't see what the different price distributions are and where they come from for the cheap versus expensive
4. daaronr 24 Feb 2026
  
  in Public
  
  yments:
  
  You should provide an explanation in a folding box for the CRF formula in intuitive terms. I suppose it depends on the interest rate r and n, the number of years, but then explain how that works and why it equals this complicated formula.
5. daaronr 24 Feb 2026
  
  in Public
  
  The slider complements this by letting users explore “what if progress is partial?” scenarios.
  
  That seems to be underexplained and seems to contradict what you just said.
6. daaronr 24 Feb 2026
  
  in Public
  
  If any one of these succeeds at commercial scale, the “cheap” price regime applies
  
  That makes sense, but then what determines how you model the price in the 'cheap' price regime?
7. daaronr 24 Feb 2026
  
  in Public
  
  $0.20 – $1.20
  
  What is the source of these price numbers? Everything should be referenced / linked
8. daaronr 24 Feb 2026
  
  in Public
  
  Consider correlated scenarios via the maturity slider
  
  We probably want to unpack this more. One could imagine some forms of technical development going together and others less so.
9. daaronr 24 Feb 2026
  
  in Public
  
  Use for relative comparisons rather than absolute predictions
  
  What do you mean by relative comparisons?
10. daaronr 24 Feb 2026
  
  in Public
  
  Equipment depreciation period
  
  Which equipment here? Again, I want links directly to the equation.
11. daaronr 24 Feb 2026
  
  in Public
  
  Weighted average cost of capital
  
  That seems rather high - what are references for this? Why should it be so expensive? Here, is this comparable to some benchmarks?
  
  And again, I want to be able to look up each of these elements within an equation somewhere - I don't see where that equation is. Make the links clearer.
12. daaronr 24 Feb 2026
  
  in Public
  
  Breakthrough technologies that could trigger the “cheap” scenario: - Autocrine cell lines (cells produce own FGF2) - Plant molecular farming ($1-10/g target) - Precision fermentation at scale - Polyphenol substitution (reduces GF requirements by 80%)
  
  Okay, you got to my question here that I asked above, although it still seems underexplained. Wouldn't each of these things have independent effects on the cost of growth factors? So why is it just a zero-one switch?
13. daaronr 24 Feb 2026
  
  in Public
  
  30-200 g/L Final biomass at harvest Cycle time 0.5-5 days Time per production batch Media turnover 1-10 ratio 1=batch, >1=perfusion
  
  Interesting, but it should be more clear how this maps into the ultimate cost equation. Everything should be linked back in some way to a total cost formula. I'd like to be able to open and close and unpack the different elements.
14. daaronr 24 Feb 2026
  
  in Public
  
  Distribution
  
  Why were these distributions chosen? What's the justification? Were these used in other related models you could reference ?
15. daaronr 24 Feb 2026
  
  in Public
  
  Why correlate? In “good worlds” for cultured chicken: - Technologies are more likely adopted (higher P) - Custom reactors are more common (lower CAPEX) - Financing is cheaper (lower WACC) This prevents unrealistic scenarios where technology succeeds but financing remains prohibitively expensive.
  
  this is really not well explained. I don't see how the discussion relates to the equations here
16. daaronr 24 Feb 2026
  
  in Public
  
  The model uses a latent maturity factor (0–1) to correlate technology adoption, reactor costs, and financing: Padopted=bound(Pbase+k⋅(m−0.5),0,1) What does “bound” mean? bound(x, 0, 1) ensures the result stays between 0 and 1. Also c
  
  what are k and m here? define and explain
17. daaronr 24 Feb 2026
  
  in Public
  
  The GF progress slider interpolates between current and target prices: PGF=Pcurrent×(0.01)progress At 0% progress: current prices ($5,000–500,000/g) At 100% progress: target prices ($1–100/g for cheap scenario)
  
  The equation doesn't seem to be correct/displayed correctly here. Explain more but also I don't understand what "0.01^(progress)" means
18. daaronr 24 Feb 2026
  
  in Public
  
  Example calculation: - Cell density: 50 g/L → need 1000/50 = 20 L per kg - Media turnover: 3× (perfusion system) → 20 × 3 = 60 L/kg - Media price: $0.50/L (hydrolysates) → 60 × 0.50 = $30/kg
  
  is the 'per liter' meaningful though? Doesn't the density depends strongly on the contents used?
Visit annotations in context

Annotators

daaronr

URL

unjournal.github.io/cm_pq_modeling/docs.html
unjournal-workshop.netlify.app unjournal-workshop.netlify.app

Wellbeing Pivotal Questions | The Unjournal

12
1. daaronr 19 Feb 2026
  
  in Public
  
  uld the resulting adjustments change the cost-effectiven
  
  Adjust this to 'meaningfully change' (as defined in the 'resolution criteria')
2. daaronr 19 Feb 2026
  
  in Public
  
  WELLBY is a reasonably useful measure in this context
  
  Clarify: the WELLBY with linear aggregation as in the definition above
3. daaronr 19 Feb 2026
  
  in Public
  
  Given the available collected data [...], how should [funders] measure the impact on wellbeing? [...] What measures of well-being should charities, NGOs, and RCTs collect for impact analysis?
  
  Let's split up the answer boxes within this question to ask separately about the best use of currently collected data for these cases, and also ask what data should be collected in the future.
4. daaronr 19 Feb 2026
  
  in Public
  
  How reliable is the WELLBY measure [...] relative to other available measures in the 'wellbeing space'? How much insight is lost by using WELLBY and when will it steer us wrong?
  
  signpost more that we are talking about the very simple use of the WELLBY measure
5. daaronr 19 Feb 2026
  
  in Public
  
  More detailed questions on WELLBY reliability
  
  Should be 'on WELLBY reliability and wellbeing measures' ... but also the folding box is still not ideal here -- better for this to link out to another page/subpage (open in new window)
6. daaronr 19 Feb 2026
  
  in Public
  
  "Meaningful change" = at least one intervention currently in the top 5 moves out of the top 5, OR the #1 ranked intervention changes. This assumes future RCTs incorporate these methods and Founders Pledge updates their CEA accordingly.
  
  This one is nice -- is it the same in the PQ table?
7. daaronr 19 Feb 2026
  
  in Public
  
  If you propose a measure other than linear WELLBY in your answer above, how much more would it cost to achieve the same welfare improvement using linear WELLBY instead?
  
  Make it clear that 'speculative' is OK here
8. daaronr 16 Feb 2026
  
  in Public
  
  Currently, standard practice (used by HLI and Founders Pledge) treats SDs on different mental health instruments as interconvertible with WELLBY SDs on a roughly 1:1 basis.
  
  I don't know if that's standard practice -- these are 2 EA-linked or adjacent groups -- moderate
9. daaronr 15 Feb 2026
  
  in Public
  
  calibration questions can partially adjust for it.
  
  "Adjust for it" is too vague, use more precise language
10. daaronr 15 Feb 2026
  
  in Public
  
  Key assumptions
  
  Skip the 'key assumptions' part -- I don't think this gets it correctly anyways. E.g., comparisons of SD units shouldn't require linearity per se, that would be sufficient but not necessary. I'm not sure if interpersonal comparability in levels is necessary either -- if we had linearity and cardinality the measured changes wouldn't depend on the starting points
11. daaronr 15 Feb 2026
  
  in Public
  
  If it's unreliable or systematically misleading, billions of dollars in funding decisions could be poorly directed.
  
  This point seems obvious. Maybe skip it
12. daaronr 15 Feb 2026
  
  in Public
  
  Is the WELLBY (linear, 0–10 life satisfaction) a useful and reliable measure for comparing interventions—particularly those involving mental health, consumption, and life-saving—in the context that organizations like Founders Pledge use it?
  
  Claude rephrased these in simpler but less rigorous and imo less useful ways. This page should use the actual wording of the key questions on the page https://coda.io/d/Unjournal-Public-Pages_ddIEzDONWdb/Wellbeing-PQ_suPg8sEH#_luVrD0mE -- use quotes where possible, ellipse where necessary, and link or fold the details
  
  I find 'useful and reliable' a bit too vague perhaps
Visit annotations in context

Annotators

daaronr

URL

unjournal-workshop.netlify.app/beliefs
unjournal-workshop.netlify.app unjournal-workshop.netlify.app

About This Workshop | The Unjournal

9
1. daaronr 19 Feb 2026
  
  in Public
  
  Benjamin et al. show that calibration questions
  
  "Provide evidence suggesting" -- not "show"
2. daaronr 19 Feb 2026
  
  in Public
  
  effective altruism-aligned
  
  Effective Altruism in caps
3. daaronr 15 Feb 2026
  
  in Public
  
  How the workshop is structured
  
  This is one proposed agenda -- it has not been finalized yet. Make this known.
4. daaronr 15 Feb 2026
  
  in Public
  
  3. Could calibration questions improve things?
  
  I would add something like 'or other more in-depth approaches'
5. daaronr 15 Feb 2026
  
  in Public
  
  "calibration questions"
  
  no need for scare quotes. But note the calibration questions is one of the 2 aspects of this question.
6. daaronr 15 Feb 2026
  
  in Public
  
  If "7 out of 10" means something very different to different people, that's a fundamental challenge for the WELLBY as a tool for comparing interventions.
  
  This is a bit too simple. Note WELLBY, as used in the simplest approaches, mainly requires differences to be comparable-- and even linear -- across individuals. Moving from 1 to 3 is equally valued as moving from 4 to 6 or 8 to 10, and gets twice the value in this measure as moving 2 people from 3 to 4.
7. daaronr 15 Feb 2026
  
  in Public
  
  This is one of the most important recent papers
  
  "Is ... most important" ... seems a bit presumptious ... moderate this
8. daaronr 15 Feb 2026
  
  in Public
  
  Two measures dominate these analyses. The DALY (disability-adjusted life year) comes from health economics and captures years of healthy life lost to disease or disability.
  
  Not sure 'dominate' is accurate. Is DALY used more than QALY? Is WELLBY on par with either of these? Are there other heavily used measures
9. daaronr 15 Feb 2026
  
  in Public
  
  Founders Pledge, GiveWell, and Open Philanthropy
  
  As well as government ODA agencies and mainstream charities and NGOs like Oxfam and UNICEF
Visit annotations in context

Annotators

daaronr

URL

unjournal-workshop.netlify.app/about
woozy-page-39c.notion.site woozy-page-39c.notion.site

Alignment Journal Description (Jan 2026)

17
1. daaronr 11 Feb 2026
  
  in Public
  
  Our highest priority will be to avoid wasting author time. We’re very cognizant from first-hand experience that poor conversion quality, perhaps requiring back-and-forth with the author, is very unpleasant and a huge time suck.
  
  I suspect this is much less of an issue in the days of Claude
2. daaronr 11 Feb 2026
  
  in Public
  
  Following acceptance, authors may pass their manuscript to the journal in any reasonable format (LaTeX or markdown preferred; Word and PDF acceptable).The document will be published in a “web-first” format, such as the Distill version of R Markdown.This allows reflowable text and mobile readability.We currently do not plan to support interactive content, as we do not think the large effort is worth the modest benefit.
  
  You don't have to host -- why not just evaluate and curte?
  
  Or you can have a compromise -- a 'traditional summary' in the journal, linking to the interactive version created by the author, the latter being the canonical one
  
  NB, I think interactive content is high value, but the authors can produce it, especially given Claude code etc
3. daaronr 11 Feb 2026
  
  in Public
  
  The review process will be done using a manuscript in PDF format, which can be generated by the authors using whatever software they prefer (e.g., LaTeX). This avoids wasting the time of authors of papers that are later rejected.
  
  Not sure you even need pdf -- markdown should be acceptable, for example
4. daaronr 11 Feb 2026
  
  in Public
  
  The journal Alignment will be a fast and rigorous venue for theoretical AI alignment—research on agency, understanding, and asymptotic behavior of advanced and potentially self-modifying synthetic agents
  
  Definitely theoretical alignment, not AI governance?
5. daaronr 11 Feb 2026
  
  in Public
  
  Many potential criticisms of papers are “NP” (can be checked easily), so credentials of reviewer should be irrelevant
  
  I see that as a reasonable sterile manning of what PREreview is doing. In contrast The Unjournal we look for legible signs of expertise when we source and commission evaluators, although we do also encourage in a separate "independent evaluation" mode (which has had very little takeup)
6. daaronr 11 Feb 2026
  
  in Public
  
  If confidential: Massive reviewer effort (the report) boiled down to a single bit (!)
  
  Yeah, that's the most obvious limitation of the journal system. That's why we say "publicly evaluate and rate, don't accept/reject"
7. daaronr 11 Feb 2026
  
  in Public
  
  . Our bet is that we can could unify and expand the field of alignment by establishing a legitimate academic journal with an unorthodox review pipeline.
  
  There are some costs here -- maybe you can have your thing ALSO be an overlay journal Publish-review-curate thing, at least for those interested
8. daaronr 11 Feb 2026
  
  in Public
  
  Since they are regarded as informal by institutional academia, time spent on such outputs is dead time, from the perspective of institutional research performances indicators and career progressio
  
  This is big -- and something Unjournal is also hoping to remedy in a sense
9. daaronr 11 Feb 2026
  
  in Public
  
  Our experimental solution to address this problem is to publish each accepted paper with a “reviewer abstract”. Its main goal is to help a potential reader decide — on the paper’s merits — if the paper is worth reading.
  
  I like this idea. We ask for "abstracts" too but I particularly like the way you have phrased it, targeted at a potential reader
10. daaronr 11 Feb 2026
  
  in Public
  
  Reviewer Compensation
  
  We have a promptness bonus and an additional incentive and prize based reward system that follows up later
11. daaronr 11 Feb 2026
  
  in Public
  
  We think it’s very reasonable to spend an average of ~$3k per paper on reviewer payments.
  
  We spend less on the 'evaluators' but something of this magnitude including eval manager time and my own time etc.
12. daaronr 11 Feb 2026
  
  in Public
  
  We intend to experiment with LLM recommendations to surface candidates that might not be salient to the editors.
  
  We have experience with this and I can sugggest some good tools
13. daaronr 11 Feb 2026
  
  in Public
  
  Author identity known to reviewers
  
  This would not work well in situations I'm familiar with. Need to provide the opportunity for single-blind review, especially if there is some meaningful rating or filtering. Otherwise you just get back-slapping
14. daaronr 11 Feb 2026
  
  in Public
  
  If a submission is published in the journal, the AF post is updated to reflect this, and the reviewer abstract is added. The reviewer abstract can be upvoted on AF, with the reviewers with AF accounts who sign the abstract receiving karma as appropriate.
  
  Why filter rather than just rate and sort? ... and let users choose how to filter?
15. daaronr 11 Feb 2026
  
  in Public
  
  Journal Not Conference
  
  OK, your situation is rather different than for Unjournal -- I guess you are trying to build credibility and institutional structure for a new and fledgling tield
16. daaronr 11 Feb 2026
  
  in Public
  
  We are tentatively planning on making the journal archival, meaning that publication there constitutes the “version of record”, in contrast to a workshop publication. (Preprints of course are allowed.)
  
  Bad idea IMO ... although these concerns may vary by field
17. daaronr 11 Feb 2026
  
  in Public
  
  Public review avoids this, but introduces additional problems due to lack of confidentiality: less honest, more combative and defensive conversations between authors and reviewers. Public review also produces an artifact that is poorly suited to a reader because the c
  
  anonymous public review exists
Visit annotations in context

Annotators

daaronr

URL

woozy-page-39c.notion.site/Alignment-Journal-Description-Jan-2026-2e9398af2ab0816ea6cdf847f6447b46
Jan 2026
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - 4 Results: Critiques & Key Issues

1
1. daaronr 07 Jan 2026
  
  in Public
  
  Paper-by-Paper Comparison
  
  Fix margins -- way too narrow to see
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/results_critiques
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human peer reviews and ratings of social science research using data from Unjournal.org evaluations

2
1. daaronr 07 Jan 2026
  
  in Public
  
  Full LLM evaluations, human adjudication, new human evaluations
  
  This iss likely to occur over a longer time. There will be overlap between these phases.
2. daaronr 05 Jan 2026
  
  in Public
  
  Research Goals and Questions (overview)
  
  Let's make this some sort of offset box
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/
Dec 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Results

4
1. daaronr 31 Dec 2025
  
  in Public
  
  Table 3.1: Token usage and estimated cost per model
  
  @Valentin some confusion here. In some cases we have pro on 3 papers. In some cases we have pro on 40 or so papers
2. daaronr 31 Dec 2025
  
  in Public
  
  To assess whether different LLMs produce systematically different evaluations, we collected ratings from multiple providers: OpenAI (GPT-4o-mini), Anthropic (Claude Sonnet 4), and Google (Gemini 2.0 Flash)
  
  We should note that none of these are frontier models other than GPT pro (which version?) @Valentin
3. daaronr 31 Dec 2025
  
  in Public
  
  ?tbl-llm-token-cost-summary
  
  @Valentin missing reference/link
4. daaronr 31 Dec 2025
  
  in Public
  
  We first use the earlier GPT‑5 Pro evaluation run that covered all papers in our Unjournal sample with a simpler JSON‑schema prompt
  
  @Valentin which 'simpler prompt' was this? We should link it
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/results
unjournal-llm-research-evaluation.netlify.app unjournal-llm-research-evaluation.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Results

12
1. daaronr 01 Dec 2025
  
  in Public
  
  temporal “data leakage” between the multi‑year regrowth label and contemporaneous predictors;
  
  Did it really show this? OK we can check this in the appendix -- but it might be good to present a bit more side-by-side comparison.
  
  OK I checked the appendix and I couldn't find any mention of the temporal data leakage issue. It mentioned other issues that I interpreted as more about fitting a model on one time period and expecting it to pertain to another period, but that's not 'leakage'.
2. daaronr 01 Dec 2025
  
  in Public
  
  We start by examining selected evaluations in detail. In the next step we will juxtapose these LLM assessments with the human evaluators’ written reports.
  
  Let's put a table of the relative ratings here (human vs AI, for each category etc), especially for this subset
3. daaronr 01 Dec 2025
  
  in Public
  
  sensitivity of return‑on‑investment calculations to assumptions about donor lifetime value and unobserved costs.
  
  iirc this overlaps the human evaluation
4. daaronr 01 Dec 2025
  
  in Public
  
  potential spillovers and spatial correlation across postal codes
  
  this was raised by the authors themselves, and they had an approach to accounting for it
5. daaronr 01 Dec 2025
  
  in Public
  
  Case study: Williams et al. (2024)
  
  Valentin: Did an LLM write this comparison or did you? It's so very detailed that I am wondering how you had time to dig in this much.
6. daaronr 01 Dec 2025
  
  in Public
  
  not fully propagate uncertainty
  
  Did the model really mention this? If so, awesome. (Although I'm a bit concerned about whether our evaluations entered it's corpus because I on't think it did so in the last version)
7. daaronr 01 Dec 2025
  
  in Public
  
  additionality and opportunity costs in the policy framing,
  
  you didn't mention this one for the LLM
8. daaronr 01 Dec 2025
  
  in Public
  
  the treatment of “biophysical potential” versus business‑as‑usual regrowth,
  
  is this the same as something the LLM identified? Maybe better to present the exact language in parallel (for human vs llm)?
9. daaronr 01 Dec 2025
  
  in Public
  
  edictors; incomplete or optimistic treatment of uncertainty around the headline 215 Mha estimate; a broad and permissive definition of land “available for natural regeneration”; limitations of the carbon overlay and permanence assumptions; and only partial openness of code and workflows, which increases barriers to full replication.
  
  I would want to look at this correspondence between human and LLM critiques more closely. (Can also ask LLMs to check that)
10. daaronr 01 Dec 2025
  
  in Public
  
  relatively
  
  Relative to other papers or relative to other ratings categories? I don't think we showed summary statistics or plots for the overall set of evaluations to compare this to know if it is 'relatively high'.
  
  Or maybe you meant 'relative to the human evaluations'?
11. daaronr 01 Dec 2025
  
  in Public
  
  This October 2025 run asked the model only for numeric ratings and journal‑tier scores (no diagnostic summary or reasoning trace);
  
  I thought we asked it for 'reasoning for each rating'?
12. daaronr 01 Dec 2025
  
  in Public
  
  To understand what GPT‑5 Pro is actually responding to, we re‑ran the model on four focal papers (Adena and Hager 2024; Peterman et al. 2024; Williams et al. 2024; Green, Smith, and Mathur 2025) using a refined prompt (as shown in the previous section).
  
  It's not clear to me how the prompt use here is different from the prompt used on the rest of the papers.
Visit annotations in context

Annotators

daaronr

URL

unjournal-llm-research-evaluation.netlify.app/results
unjournal-llm-research-evaluation.netlify.app unjournal-llm-research-evaluation.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Discussion

1
1. daaronr 01 Dec 2025
  
  in Public
  
  temporal leakage from contemporaneous predictors,
  
  I actually didn't see this in the discussion! Looked through it in some detail
Visit annotations in context

Annotators

daaronr

URL

unjournal-llm-research-evaluation.netlify.app/discussion
unjournal-llm-research-evaluation.netlify.app unjournal-llm-research-evaluation.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Appendix A — LLM evaluation summaries and reasoning traces

8
1. daaronr 01 Dec 2025
  
  in Public
  
  ourth, when sufficient information to compute a standardized mean difference (SMD) was lacking and the text reported a “null,” outcomes were set to an “unspecified null” of 0.01. This imputation is transparent but ad hoc; it could bias pooled estimates upward (relative to zero) and may not reflect the true variance of those effects. The manuscript would benefit from sensitivity checks setting these to 0, excluding them, or modeling them with conservative variances.
  
  IIRC this echoes the human evaluation (although one of the evaluators had a particular detailed suggestion for this)
2. daaronr 01 Dec 2025
  
  in Public
  
  Model assessment summary
  
  Paragraph breaks missing here ... makes it hard to read.
3. daaronr 01 Dec 2025
  
  in Public
  
  For “tier_will,” given its status as a WZB discussion paper and the need to disregard actual publication knowledge, I might predict it will land around 3.2 to 4.0.
  
  This suggests/confirms that the model does not have access to the latest 'news' about the publication (in Management Science)
4. daaronr 01 Dec 2025
  
  in Public
  
  clearer pre-analysis plan deviation trackin
  
  OK it gets at this a bit
5. daaronr 01 Dec 2025
  
  in Public
  
  Heterogeneity analyses suggest stronger effects in urban areas and in PLZs with higher employment, more children, and more Catholics, and with higher predicted giving potential. These patterns can guide targeting but also indicate that the ITT estimates average over meaningful heterogeneity.
  
  Seems to miss the issue of MHT, and some very surprising heterogeneity suggests spurious estimates.
  
  Also, divergence from the PAP, although I'm not sure it had access to the PAP
6. daaronr 01 Dec 2025
  
  in Public
  
  Adena and Hager 2024
  
  Maybe make these either folded or one paper per chapter in the Quarto?
7. daaronr 01 Dec 2025
  
  in Public
  
  Data construction choices appear reasonable but introduce some judgment calls. Winsorizing PLZ-day donations at €1,000 reduces variance from heavy tails; the authors show that results are directionally robust, but precision trades off.
  
  Good it noted the Winsorizing -- something Reiley emphasized.
8. daaronr 01 Dec 2025
  
  in Public
  
  The most important methodological limitations concern exposure heterogeneity and spillovers. Treatment is assigned at the PLZ level, but impressions are probabilistic and sparse (roughly one in ten Facebook users in treated PLZs received at least one impression), so the estimates are ITT and likely attenuated relative to the effect of actually seeing the ad; the TOT is not estimated. The allocation strategy partly allows Facebook to endogenously concentrate impressions, creating within-treatment variation in exposure that is not exploited for causal TOT analysis (e.g., using randomized budgets as an instrument in a dose–response framework). Spillovers across PLZs are plausible (algorithmic leakage of geotargeting and social diffusion). The authors document positive “share of treated neighbors” effects and argue the main estimates are lower bounds, but the neighbor-treatment share is not itself randomized, and spatial correlation or common shocks could inflate these coefficients; the spillover analysis should be interpreted cautiously. Robustness to spatial correlation in errors is only partly addressed by robust standard errors and randomization inference; alternative SEs (e.g., spatial HAC or clustering at larger administrative units) and placebo geographies would further strengthen inference.
  
  At a first look (and from my memory) this seems like an extremely useful and plausible report!
Visit annotations in context

Annotators

daaronr

URL

unjournal-llm-research-evaluation.netlify.app/appendix_llm_traces
unjournal-llm-research-evaluation.netlify.app unjournal-llm-research-evaluation.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Data and methods

1
1. daaronr 01 Dec 2025
  
  in Public
  
  claims & evidence, methods, logic & communication, open science, global relevance, and an overall
  
  remove bold font
Visit annotations in context

Annotators

daaronr

URL

unjournal-llm-research-evaluation.netlify.app/methods
unjournal-llm-research-evaluation.netlify.app unjournal-llm-research-evaluation.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org

8
1. daaronr 01 Dec 2025
  
  in Public
  
  For each of 47 such papers,
  
  we have more now -- is this '47' hard-coded?
2. daaronr 01 Dec 2025
  
  in Public
  
  In this project, we test whether current large language models (LLMs) can generate research evaluations that are comparable, in structure and content, to expert human reviews.
  
  This is only a part of the project though
3. daaronr 01 Dec 2025
  
  in Public
  
  Furthermore, a key promise of AI is to directly improve science and research.
  
  rephrase using language from grant application
4. daaronr 01 Dec 2025
  
  in Public
  
  works
  
  "works" seems too simple
5. daaronr 01 Dec 2025
  
  in Public
  
  a high‑stakes, policy‑relevant domain, and as the first step toward a broader benchmark and set of tools for comparing and combining human and AI research evaluations.
  
  Last sentence seems relevant to the grant application language
6. daaronr 01 Dec 2025
  
  in Public
  
  model reliably identifies many of the same methodological and interpretive issues
  
  "reliably identifies" feels too strong ... or at least I haven't seen the evidence yet.
7. daaronr 01 Dec 2025
  
  in Public
  
  and to produce a narrative assessment anchored in the PDF of each paper.
  
  I don't understand 'anchored in the PDF of each paper' -- maybe LLM wrote this?
8. daaronr 01 Dec 2025
  
  in Public
  
  Comparing LLM and human reviews of social science research using data from Unjournal.org
  
  No link to slides anymore?
Visit annotations in context

Annotators

daaronr

URL

unjournal-llm-research-evaluation.netlify.app/
Nov 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Results

2
1. daaronr 11 Nov 2025
  
  in Public
  
  Strong pantropical mapping, but several methodological and interpretive risks remain. Training data on natural regrowth include substantial omission error in humid biomes; pseudo-absences and misclassification may bias the model. Random forests were trained on class-balanced points and probabilities are treated as calibrated; no prevalence correction or probability calibration is shown, yet expected areas/carbon rely on these values. Validation is not fully spatial; accuracy likely inflated by autocorrelation (declines at greater distances); no formal spatial block cross-validation or predictive uncertainty mapping. Reported “confidence intervals” for total area/carbon are effectively deterministic sums, not uncertainty; overall uncertainty is understated. Predictions at 30 m depend on several coarser predictors (300 m–1 km), so effective map resolution is coarser and may mislead fine-scale planning. Final maps omit socioeconomic predictors (despite similar accuracy), assuming stationarity from 2000–2016 to 2030 and potentially overstating practical feasibility. Carbon estimates exclude permanence/leakage dynamics and use coarse downscaled inputs. Data products are open, but code is only “on request,” limiting full reproducibility.
  
  This new one seems to show some potential to be reflecting the key concerns but I need to check this in more detail as it could just be credible sounding garbage. It still doesn't seem to pick up the key 'data leakage' concern.
  
  But actually I'm a bit puzzled as to what data is being piped in there because if I recall the latest version we had didn't ask for rationales for specific categories. So where is it getting this from?
2. daaronr 11 Nov 2025
  
  in Public
  
  78
  
  Much higher than the humans, but this category is rated lowest or second-lowest for both ... so perhaps close to a monotonic transformation of sorts.
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/results
Oct 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Results

2
1. daaronr 05 Oct 2025
  
  in Public
  
  Show cod
  
  @Valentin These forest plots are really hard to read, it's so dense without spacing. Let's work together on some ways of making it more informative.
  
  I'm also puzzled as to why so many papers are shown only one type of rating and not the other. I know that some of our evaluators did not give ratings like this, and in some cases, we didn't even encourage it. But why is it missing for some of the llms? Did it just not have time to finish processing it?
  
  Maybe it's a display issue? It seems that the papers that were rated highest in terms of these tiers by the human raters did not get rated by the LLMs. Or maybe it just didn't show up in the graph?
2. daaronr 05 Oct 2025
  
  in Public
  
  (only partially available)
  
  @Valentin why/how 'partially available'?
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/results
Sep 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Results

13
1. daaronr 23 Sep 2025
  
  in Public
  
  Inter-rater reliability (Cohen’s κ): We treated the AI as one “reviewer” and the average human as another, and asked: how often do they effectively give similar ratings, beyond what chance alignment would predict? For overall scores, Cohen’s κ (with quadratic weighting for partial agreement) came out around 0.25–0.30. This would be characterized as “fair” agreement at best – a significant gap remains between AI and human judgments. For most specific criteria, the weighted κ was lower, often in the 0.1–0.2 range (and for one category, effectively 0). A κ of 0 would mean no more agreement than random chance (given the distribution of scores), so some categories are bordering on that. In contrast, typical inter-human agreement on these kinds of scoring tasks can also be low, but usually one would hope for κ in the 0.3–0.6 range among trained reviewers on well-defined criteria. Our finding of κ < 0.3 in all cases suggests that the AI’s ratings are not interchangeable with a human reviewer’s – at least not without further calibration.
  
  @Valentin I want to focus on interpreting statistics like these. I think we could make info-theoretic statements about things like "how often would the LLM's relative ranking of 2 randomly chosen papers agree with a human's rannking". Is that kappa? Let's dig in
2. daaronr 23 Sep 2025
  
  in Public
  
  Correlation (Pearson’s r) between the AI’s and human scores across papers: This tells us, for example, if a paper that humans gave a high score also tended to get a high score from AI (regardless of absolute difference). For the Overall scores, Pearson r ≈ 0.30, indicating a weak-to-moderate positive correlation.
  
  @Valentin Important -- let's put these measures in context. How does this correlation compare with what other papers found, and with inter-human ratings, for example.
  
  We might also find a way to introduce information-theoretic measures that say something in an intuitive absolute sense.
3. daaronr 23 Sep 2025
  
  in Public
  
  r ~ 0
  
  @Valentin this should be stated more precisely, I think. (And with soft coding)
4. daaronr 23 Sep 2025
  
  in Public
  
  Table 3.1 show agreement metrics across rating criteria. To quantify the agreements and differences observed, we calculated several statistics comparing LLM scores to the human scores, aggregated by criterion:
  
  @Valentin perhaps we should start with the measures of agreement and then get into unpacking the discrepancies?
5. daaronr 23 Sep 2025
  
  in Public
  
  has a distinct “taste,” elevating some work and devaluing other work differently than human referees.
  
  I don't think we can call this 'taste' yet. It might be random noise or perhaps ~bias (to top institutions, authors, journals, etc.).
6. daaronr 23 Sep 2025
  
  in Public
  
  or example, Aghion et al. 2017 was among the top few for human reviewers, but the LLM overall score put it notably lower relative to others, hence a downward green curve.
  
  @Valentin I don't think that was the greatest discrepancy -- should we identify some with a greater discrepancy here? (Ideally, we even soft-code it, as this is likely to change as we adjust the prompts, anonymize, etc.)
7. daaronr 23 Sep 2025
  
  in Public
  
  high quality due to its robust modeling and accessible methodologies, making it relevant for climate mitigation and biodiversity
  
  This is also not particular coherent. Strange. If we just asked the llm to evaluate the paper it tends to give a much better response.
8. daaronr 23 Sep 2025
  
  in Public
  
  Comparing its midpoint percentiles to a reference group of serious research from the last three years will help inform this.
  
  @Valentin Yikes -- this sentence doesn't really make sense
9. daaronr 16 Sep 2025
  
  in Public
  
  Shared notes
  
  @Valentin Let's replace this link with a link to the Coda doc
10. daaronr 16 Sep 2025
  
  in Public
  
  on the right, the papers are ordered by the AI’s overall score (rank 1 = highest rated by AI)
  
  labeling-wise, would it be better to add the paper authors on the right as well? @valentin.klotzbuecher@posteo.de
11. daaronr 15 Sep 2025
  
  in Public
  
  For example, in the Logic & Communication column, we see many light-orange cells – the AI often thought papers were a bit clearer or better argued (by its judgment) than the human evaluators did.
  
  I wonder if we should normalize this in a few ways, at least as an alternative measure.
  
  I suspect the AI's distribution of ratings may have different than the human distribution of ratings overall and, the "bias" may also differ by category.
  
  Actually, that might be something to do first -- compare the distributions of (middle -- later more sophisticated) ratings for humans and for LLMs in an overall sense.
  
  One possible normalization would be to state these as percentiles relative to the other stated percentiles within that group (humans, LLMs), or even within categories of paper/field/cause area (I suspect there's some major difference between the more applied and niche-EA work and the standard academic work (the latter is also probably concentrated in GH&D and environmental econ). On the other hand, the systematic differences between LLM and human ratings on average might also tell us something interesting. So I wouldn't want to only use normalized measures.
  
  I think a more sophisticated version of this normalization just becomes a statistical (random effects?) model where you allow components of variation along several margins.
  
  It's true the ranks thing gets at this issue to some extent, as I guess Spearman also does? But I don't think it fully captures it.
12. daaronr 15 Sep 2025
  
  in Public
  
  deed, GPT often noted lack of code or data sharing in papers and penalized for it, whereas some human reviewers may have been more forgiving or did not emphasize open-science practices as strongly (especially if they focused more on content quality). As a result, for many papers the AI’s Open Science score is 5–10 points below the human average.
  
  This is interesting. The human evaluators may have had low expectations because they don't expect the open code and data to be provided until the paper has been published in a peer-reviewed journal. Here I would agree more with the LLM. "What should be" sense.
13. daaronr 14 Sep 2025
  
  in Public
  
  Figure 3.2: Relative ranking (overall) by LLM and Human evaluators
  
  A quick impression: the LLMs tend to rank the papers from prominent academic authors particularly high?
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/results
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Data and methods

12
1. daaronr 23 Sep 2025
  
  in Public
  
  he brief rationales clarify what evidence in the paper drove each score.
  
  I don't think we can be confident that whatever it puts here accurately reflects the reasoning or process that determined the rating. (added discussion in text).
  
  Also, is there some way to extract the 'thinking steps' from the process ... that reasoning thread the models show you (which May or may not also reflect its true reasoning )
  
  @valik
2. daaronr 23 Sep 2025
  
  in Public
  
  - tier_should = where the paper deserves to publish if quality-only decides. - tier_will = realistic prediction given status/noise/connections.
  
  This seems like an LLM abbreviation of our instructinos. @valik can you put the actual instructions back in?
3. daaronr 23 Sep 2025
  
  in Public
  
  Round all scores to the nearest 0.5
  
  @Valentin @valik why would you ask for this rounding?
4. daaronr 23 Sep 2025
  
  in Public
  
  We use a single‑step call with a reasoning model that supports file input. One step avoids hand‑offs and summary loss from a separate “ingestion” stage. The model reads the whole PDF and produces the JSON defined above. We do not retrieve external sources or cross‑paper material for these scores; the evaluation is anchored in the manuscript itself.
  
  We should probably give a citation for this point.
  
  But is this the same point you made above?
5. daaronr 23 Sep 2025
  
  in Public
  
  • Default = arithmetic mean of the other six midpoints (rounded).
  
  @Valentin I'm not sure why this should be the default. Note that in an earlier version of our evaluation framework, we used a weighted scheme, which we dropped
  
  I suspect it would be better for you to ask the question the same way we asking the evaluators here ... Which would simply mean getting rid of this bullet point. I think.
  
  The second bullet point is interesting though. I would be curious to hear how it considered the relevance of each metric in its overall. ... although I expect that asking that question might alter the overall score
  
  @valik Maybe GPT suggested this default averaging? It also seems so optimal because we're asking to get something that we could easily compute ourselves.
6. daaronr 23 Sep 2025
  
  in Public
  
  plus a short rationale
  
  @valik @Valentin
  
  I don't see where in the prompt you explained to it what the rationale was supposed to mean. I only see some discussion on how to use that. It should be stated if they override the simple mean for aggregating the other categories to get the overall.
  
  (And see my other comment on why I think we should remove the request to do simple means.)
7. daaronr 23 Sep 2025
  
  in Public
  
  The credible intervals communicate uncertainty rather than false precision
  
  @Valentin That's the intention, sure, but this paragraph makes it seem like the JSON schema somehow ensure that this is used. I'll try to adjust.
8. daaronr 23 Sep 2025
  
  in Public
  
  # Environment setup
  
  Note to self -- this chunk is set to 'eval=false', it doesn't need re-running every time (and has trouble playing in the Quarto/Rstudio environment)
9. daaronr 23 Sep 2025
  
  in Public
  
  Direct ingestion preserves tables, figures, equations, and sectioning, which ad‑hoc text scraping can mangle. It also avoids silent trimming or segmentation choices that would bias what the model sees.
  
  Useful to add a citation here.
10. daaronr 23 Sep 2025
  
  in Public
  
  he sample includes r dim(research_done)[1] papers
  
  @Valentin the number comes out in my Rstudio but not on render. Wondering what I did wrong here. I think it may be syntax, but also it starts again with every chapter, so we need to carry the data in first for this discussion.
11. daaronr 16 Sep 2025
  
  in Public
  
  49
  
  @Valentin Let's soft-code these numbers in based on the data.
  
  Something like
  
  The sample includes `r dim(research_done)[1]` papers spanning
12. daaronr 16 Sep 2025
  
  in Public
  
  per criterion (and noting the range of individual scores).
  
  @valentin we should probably do something more sophisticated at the next pass ... Either using each evaluation as a separate observation in the analysis or imputing the median taking the statedcredible intervals into account with some Bayesian procedure.
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/methods
squigglehub.org squigglehub.org

Unjournal-modelingPQs/cultured_meat_improved_by_squiggle_improve | Squiggle Hub

1
1. daaronr 05 Sep 2025
  
  in Public
  
  Please turn on the hypothes.is plugin, and view public annotations to see Latex math representations
  
  What if I annotate this in hypothes.is -- is it preserved usefully?$CRF(r,n) = \frac{r(1+r)^n}{(1+r)^n - 1} $
  
  Yes, it seems to stay in the same place in the notebook even if the notebook is edited.
  
  Wait now it's an orphan?
Visit annotations in context

Annotators

daaronr

URL

squigglehub.org/models/Unjournal-modelingPQs/cultured_meat_improved_by_squiggle_improve
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human peer reviews and ratings of social science research using data from Unjournal.org evaluations

1
1. daaronr 02 Sep 2025
  
  in Public
  
  model for now)
  
  DR @ VK: Should we use R inline code for things that will change, like the model number, and centralize these?
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/
Aug 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - LLM vs. Human Ratings

24
1. daaronr 27 Aug 2025
  
  in Public
  
  Figure 3.4: Human uncertainty (CI width) vs |LLM − Human|. Spearman correlation reported.
  
  This one is rather intricate, it might need some more analysis and talking through.
2. daaronr 27 Aug 2025
  
  in Public
  
  Table 3.3: CI coverage: does one interval contain the other’s point estimate?
  
  I think DHK also looked at 'human in human' -- Might as well add this as a comparator?
3. daaronr 27 Aug 2025
  
  in Public
  
  coverage
  
  Not sure if we want to use the formal statistical term 'coverage' here either (unless it's in scare quotes)
4. daaronr 27 Aug 2025
  
  in Public
  
  Table 3.1: Agreement metrics by metric (Pearson, Spearman, mean/median bias, κ unweighted/weighted).
  
  I assume this is referring to the correlation of (the average of?) the human ratings and the LLM ratings?
  
  Because we could also look at the agreement between humans (something David HJ was working on ... Might be useful to share this with him for his thoughts if he's interested at all.)
5. daaronr 27 Aug 2025
  
  in Public
  
  metric
  
  Is it worth adding confidence intervals or "significance stars" for these metrics?
6. daaronr 27 Aug 2025
  
  in Public
  
  mean_bias
  
  are the units "percentage points" or something else?
7. daaronr 27 Aug 2025
  
  in Public
  
  metric
  
  maybe 'by rating category' To avoid repeating the word "metric",
8. daaronr 27 Aug 2025
  
  in Public
  
  Calibration
  
  Not actually calibration unless we asked for a specific prediction.
9. daaronr 27 Aug 2025
  
  in Public
  
  calibration
  
  Maybe we should put in a footnote explaining that these aren't really measures of calibration.
10. daaronr 27 Aug 2025
  
  in Public
  
  As another indicator of agreement
  
  Exactly. It's sort of a measure of agreement, although I'm not sure quite how to interpret it. It's not a measure of calibration per se because we were asking them to rate these things, not to predict them.
  
  Although we could ask the LLM to do this as a prediction exercise, with or without training data, that might be interesting, And then, of course, the calibration would be meaningful.
11. daaronr 27 Aug 2025
  
  in Public
  
  monotonic relationship
  
  I'm not sure what the "monotonic relationship" in parentheses means here?
12. daaronr 27 Aug 2025
  
  in Public
  
  he horizontal radius covers the union of all human evidence for that paper — combining individual raters’ point scores and their CIs
  
  It's not fully clear to me what this does. I guess you center it (horizontally) at the midpoints of the human ratings.
  
  Are you using the outer bounds of both raters' CIs?
  
  Probably the best thing to do ultimately would be to impute a distribution over each individual ratings and CI these and then do some sort of belief aggregation. -- Horizontal linear aggregation actually feels the most intuitive to me from what I've read. ... And then give the implied CIs for that
13. daaronr 27 Aug 2025
  
  in Public
  
  For each paper (selected metric), the ellipse is centered at the pair of midpoints (Human, LLM).
  
  This is pretty clear but it took me a second to get this, so maybe mention human = horizontal, LLM = vertical in the text.
14. daaronr 27 Aug 2025
  
  in Public
  
  (unless a point lies outside its own stated CI, which is flagged in a diagnostic check).
  
  Did that ever happen?
15. daaronr 26 Aug 2025
  
  in Public
  
  robust CI ellipses
  
  Robust in what sense? Also might be worth mentioning that we are talking about 90% credible intervals (at least that's what we asked the humans to give us).
16. daaronr 26 Aug 2025
  
  in Public
  
  Uncertainty fields. Where available, we carry lower/upper bounds for both LLM and humans; these are used in optional uncertainty checks but do not affect the mid‑point comparisons below.
  
  I suspect we ultimately should do something more sophisticated with this ... like some Bayesian updating/averaging. It's also not entirely clear what you mean by "we carry".
  
  But of course, the "right" way to do this will depend on what precisely is the question that we're asking. Something we should have some nice chats about.
17. daaronr 26 Aug 2025
  
  in Public
  
  optional
  
  Instead of 'optional' maybe you mean 'additional analyses' or something?
18. daaronr 26 Aug 2025
  
  in Public
  
  we fold these into the single LLM metric global_relevance.
  
  Do you average them?
19. daaronr 26 Aug 2025
  
  in Public
  
  Human criteria are recoded to the LLM schema (e.g., claims → claims_evidence, adv_knowledge → advancing_knowledge, etc.)
  
  This is just about coding the variables, right? Not really about the content?
20. daaronr 26 Aug 2025
  
  in Public
  
  Sources. LLM ratings come from results/metrics_long.csv (rendered in the previous chapter). Human ratings are imported from your hand‑coded spreadsheet and mapped to LLM paper IDs via UJ_map.csv.
  
  Okay, I see these are basically notes to ourselves here.
21. daaronr 26 Aug 2025
  
  in Public
  
  compute paired scores per (paper, metric)
  
  I'm not sure what is meant by "compute paired scores."
22. daaronr 26 Aug 2025
  
  in Public
  
  We (i) harmonize the two sources to a common set of metrics and paper IDs
  
  This is fine for our own notes for now, but it's not something that outsiders need to read, I guess.
23. daaronr 26 Aug 2025
  
  in Public
  
  compares LLM‑generated ratings
  
  Link the ~prompt used so people can see exactly what we asked the LLM
24. daaronr 26 Aug 2025
  
  in Public
  
  human evaluations across the Unjournal criteria.
  
  "across The Unjournal's criteria"
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/compare_ratings
www.roastmypost.org www.roastmypost.org

Untitled document

17
1. daaronr 07 Aug 2025
  
  in Public
  
  Limited International Scope: The focus on US data and markets (acknowledged in footnote 20) may limit generalizability, particularly given different regulatory environments and consumer preferences globally.
  
  Perhaps we should footnote something more about how we're focusing on US data because of its availability and thus focusing the question around the US, but will encourage making reasonable extrapolations to global consumption. #todo? Although we already have a footnote about it
2. daaronr 06 Aug 2025
  
  in Public
  
  Supply Chain Complexity: The analysis doesn't fully account for how restaurant vs. grocery substitution patterns might differ, despite acknowledging this limitation.
  
  This kind of asks for too much detail, but maybe we state what the purpose of the post was as clearly as we should have?
3. daaronr 06 Aug 2025
  
  in Public
  
  Importance:5/10
  
  @ozzie I love the importance ranking. (If it's actually working, of course. Here it gave everything a 5 )
4. daaronr 06 Aug 2025
  
  in Public
  
  The explicit recognition that 'given the range of unknowns involved' makes direct comparison infeasible demonstrates strategic research thinking that EA organizations should emulate.
  
  @ozzie does your tool have explicit background instructions talking about "EA organizations", or is it getting this from the context of the post? Obviously the former would make it hard to apply in a more general range of settings.
5. daaronr 06 Aug 2025
  
  in Public
  
  Score: 78/100
  
  @ozzie this is a critique of the above evaluation of our post. That's a very cool idea! Given that our post is already a bit meta, my brain is struggling to grok this, but I like it.
6. daaronr 06 Aug 2025
  
  in Public
  
  Comment 1
  
  @ozzie what is the quote comments section supposed to mean? It seems a bit repetitive of the above?
7. daaronr 06 Aug 2025
  
  in Public
  
  Temporal Dynamics: The document doesn't adequately address how substitution patterns might change as plant-based products become more mainstream or as consumer familiarity increases.
  
  That's a fair point, and one we might want to incorporate. #todo?
8. daaronr 06 Aug 2025
  
  in Public
  
  Meta-Analysis Limitations: While the document presents cross-price elasticity estimates, it could better address how traditional meta-analysis approaches might be inappropriate given the fundamental methodological concerns raised.
  
  This is a mostly valid comment, it's a pretty reasonable thing to ask for here. We probably don't want to go into this detail here, but it's a natural thing someone might ask.
  
  OK. #todo? (maybe a footnote)
9. daaronr 06 Aug 2025
  
  in Public
  
  the document could better explain how these predictions will be weighted against expert evaluations.
  
  Good point. #todo?
10. daaronr 06 Aug 2025
  
  in Public
  
  This nuanced view prevents oversimplified dismissal of conflicting results.
  
  But does this tool reward valid nuance or just any type of "it's complicated because of" statements?
11. daaronr 06 Aug 2025
  
  in Public
  
  However, the document could strengthen its analysis by exploring potential explanations for these inconsistencies more systematically.
  
  This perhaps misses the point of the post. It's not meant to be an actual research exploration. It's just demonstrating the problem to help motivate and guide the next step in our project.
12. daaronr 06 Aug 2025
  
  in Public
  
  The progression from broad goal-focused questions to specific operationalized questions follows sound research design principles.
  
  I'm wondering how much the tool just being agreeable with our reasoning and how much the tool is challenging it. I wonder whether it just focuses on whether the author seems to be saying things a lot about "why" and thinking/overthinking, and it rewards that.
  
  That said, I do think this is a post that we've been putting a lot of care into, so it should get at least a decent rating here.
13. daaronr 06 Aug 2025
  
  in Public
  
  supermarkets raising PBA prices during economic windfalls effectively illustrates how naive analysis could overstate substitution effects.
  
  This would be a useful insight for a regular layman's post (the current context), but it would be kind of trivial if we were applying this to an economics paper.
  
  But maybe the tool adapts to that.
14. daaronr 06 Aug 2025
  
  in Public
  
  This represents sophisticated epistemic humility—recognizing that documenting limitations transparently provides significant value even when definitive answers remain elusive
  
  Okay, but this also seems like something that might be easy to fake with shibboleths and hollow phrases.
15. daaronr 06 Aug 2025
  
  in Public
  
  Audience Analysis and Calibration
  
  This is great.
16. daaronr 06 Aug 2025
  
  in Public
  
  This document presents The Unjournal's approach to investigating whether plant-based meat alternatives effectively displace animal product consumption—a critical question for animal welfare funding decisions. The analysis demonstrates sophisticated awareness of methodological limitations while proposing a structured approach to synthesize conflicting evidence.
  
  The language is a little too formal and flowery for my taste.
17. daaronr 06 Aug 2025
  
  in Public
  
  This document presents The Unjournal's Pivotal Questions initiative investigating plant-based meat substitution effects on animal product consumption. The analysis reveals significant methodological challenges in the existing research, with conflicting findings across studies using similar data sources. The document proposes a focused operationalized question about Impossible/Beyond Beef price changes and chicken consumption, while acknowledging fundamental limitations in current estimation approaches. Key insights include the recognition of endogeneity problems, the value of transparent uncertainty quantification, and the need for more rigorous experimental designs to inform animal welfare funding decisions.
  
  The summary is accurate, although it emphasizes methodology more than I thought the post was. That said, it might be that the post is actually just doing that more than we had intended.
Visit annotations in context

Annotators

daaronr

URL

roastmypost.org/docs/_PEXsQ-yVpOUiLpn/evals/sed_c9fgaNb8bJoV
www.roastmypost.org www.roastmypost.org

Untitled document

4
1. daaronr 06 Aug 2025
  
  in Public
  
  acob Schmiess; we ac
  
  @ozzie it's incorrectly correcting a name spelling here.
2. daaronr 06 Aug 2025
  
  in Public
  
  Extracted forecast: Will Impossible Beef reach price-parity with conventional ground beef by December 31, 2030? Quality scores: Precision: 85 | Verifiable: 90 | Important: 75 | Robustness: 65 Our reasoning: Based on 2 independent analyses, the estimated probability is 72.3%. There is high consensus among the forecasts. Individual model forecasts: Model 1: 72.3% - "Plant-based meat costs have declined significantly due to scale economies and R&D investments, while conventional beef prices face upward pressure from environmental regulations and feed costs, making price parity likely within the 5+ year timeframe."
  
  @ozzie this is indeed directly helpful for us!
3. daaronr 06 Aug 2025
  
  in Public
  
  Based on 2 independent analyses, the estimated probability is 7.5%. There is high consensus among the forecasts.
  
  @ozzie ah, it seems you've set it up to only make predictions for binary things.
4. daaronr 06 Aug 2025
  
  in Public
  
  1. "What will be the price of IB+ on January 1, 2030?"
  
  @ozzie I'm puzzled here by what's going on. These were not our predictions. These were just questions we asked for others to predict. I guess it defaults to 'author predicts 50%' for such questions?
Visit annotations in context

Annotators

daaronr

URL

roastmypost.org/docs/_PEXsQ-yVpOUiLpn/reader
www.roastmypost.org www.roastmypost.org

Untitled document

1
1. daaronr 06 Aug 2025
  
  in Public
  
  op 3 Most Overconfident Predictions 1. "Will most cultured meat (by volume) be produced wi..." Reality check: ~32.6% | Gap: 17.4%
  
  @ozzie I'm rather puzzled by what it did here, as we didn't make a prediction; we were just proposing questions for others to predict on.
Visit annotations in context

Annotators

daaronr

URL

roastmypost.org/docs/HcrEh2lyKqyS8ezq/evals/g2cI2FxsphEhDvbo
Jul 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - Quantitative metrics

1
1. daaronr 07 Jul 2025
  
  in Public
  
  Figure 2.2: Per-paper mid-point scores across all metrics. Darker green → higher percentile. Columns ordered by each paper’s overall average. #altair-viz-41a3d5b933fe425c841b312f24a54cc6.vega-embed { width: 100%; display: flex; } #altair-viz-41a3d5b933fe425c841b312f24a54cc6.vega-embed details, #altair-viz-41a3d5b933fe425c841b312f24a54cc6.vega-embed details summary { position: relative; } Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor var VEGA_DEBUG = (typeof VEGA_DEBUG == "undefined") ? {} : VEGA_DEBUG; (function(spec, embedOpt){ let outputDiv = document.currentScript.previousElementSibling; if (outputDiv.id !== "altair-viz-41a3d5b933fe425c841b312f24a54cc6") { outputDiv = document.getElementById("altair-viz-41a3d5b933fe425c841b312f24a54cc6"); } const paths = { "vega": "https://cdn.jsdelivr.net/npm/vega@5?noext", "vega-lib": "https://cdn.jsdelivr.net/npm/vega-lib?noext", "vega-lite": "https://cdn.jsdelivr.net/npm/vega-lite@5.20.1?noext", "vega-embed": "https://cdn.jsdelivr.net/npm/vega-embed@6?noext", }; function maybeLoadScript(lib, version) { var key = `${lib.replace("-", "")}_version`; return (VEGA_DEBUG[key] == version) ? Promise.resolve(paths[lib]) : new Promise(function(resolve, reject) { var s = document.createElement('script'); document.getElementsByTagName("head")[0].appendChild(s); s.async = true; s.onload = () => { VEGA_DEBUG[key] = version; return resolve(paths[lib]); }; s.onerror = () => reject(`Error loading script: ${paths[lib]}`); s.src = paths[lib]; }); } function showError(err) { outputDiv.innerHTML = `<div class="error" style="color:red;">${err}</div>`; throw err; } function displayChart(vegaEmbed) { vegaEmbed(outputDiv, spec, embedOpt) .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`)); } if(typeof define === "function" && define.amd) { requirejs.config({paths}); let deps = ["vega-embed"]; require(deps, displayChart, err => showError(`Error loading script: ${err.message}`)); } else { maybeLoadScript("vega", "5") .then(() => maybeLoadScript("vega-lite", "5.20.1")) .then(() => maybeLoadScript("vega-embed", "6")) .catch(showError) .then(() => displayChart(vegaEmbed)); } })({"config": {"view": {"stroke": "transparent"}, "background": "white", "title": {"font": "Source Sans Pro", "fontSize": 16, "color": "#222"}, "axis": {"labelFont": "Source Sans Pro", "titleFont": "Source Sans Pro", "labelColor": "#222", "titleColor": "#222", "gridOpacity": 0.15}, "legend": {"labelFont": "Source Sans Pro", "titleFont": "Source Sans Pro"}, "range": {"heatmap": ["#f6fbf3", "#e2f1d7", "#cfe7ba", "#badc9c", "#a6d27f", "#92c861", "#7dbd43", "#69b325", "#55a807", "#477b13"], "category": ["#99bb66", "#f19e4b", "#6bb0f3", "#d9534f", "#636363", "#ffb400", "#53354a", "#2780e3", "#3fb618", "#8e6c8a"]}}, "data": {"name": "data-b64186ba8bd194e866931748b3fda927"}, "mark": {"type": "rect"}, "encoding": {"color": {"field": "midpoint", "legend": {"title": "Score"}, "scale": {"domain": [0, 100]}, "type": "quantitative"}, "tooltip": [{"field": "short", "type": "nominal"}, {"field": "metric", "type": "nominal"}, {"field": "midpoint", "type": "quantitative"}], "x": {"field": "short", "sort": ["Banerjee (2023)", "Acemoglu (2024)", "Buntaine (2023)", "Kremer (2022)", "Akram (2017)", "Walker (2023)", "Jack (2022)", "Alcott (2024)", "Crawfurd (2023)", "Bhat (2022)", "Liang (2021)", "Haushofer (2020)", "Epperson (2024)", "Arora (2023)", "Fangwa (2023)", "Trammel (2025)", "Alatas (2019)", "Hill (2024)", "Barker (2021)", "Aghion (2017)", "Bahar (2022)", "Carson (2023)", "Clancy (2024)", "Kubo (2023)", "Schuett (2023)", "Barberio (2022)", "Chuard (2022)", "Bettle (2023)", "Kremer (2020)", "Bhattacharya (2020)", "Bruers (2021)"], "title": null, "type": "nominal"}, "y": {"field": "metric", "sort": null, "title": null, "type": "nominal"}}, "height": 280, "width": 434, "$schema": "https://vega.github.io/schema/vega-lite/v5.20.1.json", "datasets": {"data-b64186ba8bd194e866931748b3fda927": [{"metric": "global_relevance", "paper": "Acemoglu et al. 2024", "midpoint": 92, "short": "Acemoglu (2024)"}, {"metric": "open_science", "paper": "Acemoglu et al. 2024", "midpoint": 75, "short": "Acemoglu (2024)"}, {"metric": "logic_communication", "paper": "Acemoglu et al. 2024", "midpoint": 90, "short": "Acemoglu (2024)"}, {"metric": "advancing_knowledge", "paper": "Acemoglu et al. 2024", "midpoint": 88, "short": "Acemoglu (2024)"}, {"metric": "methods", "paper": "Acemoglu et al. 2024", "midpoint": 83, "short": "Acemoglu (2024)"}, {"metric": "claims_evidence", "paper": "Acemoglu et al. 2024", "midpoint": 85, "short": "Acemoglu (2024)"}, {"metric": "overall", "paper": "Acemoglu et al. 2024", "midpoint": 86, "short": "Acemoglu (2024)"}, {"metric": "global_relevance", "paper": "Aghion et al. 2017", "midpoint": 70, "short": "Aghion (2017)"}, {"metric": "open_science", "paper": "Aghion et al. 2017", "midpoint": 40, "short": "Aghion (2017)"}, {"metric": "logic_communication", "paper": "Aghion et al. 2017", "midpoint": 85, "short": "Aghion (2017)"}, {"metric": "advancing_knowledge", "paper": "Aghion et al. 2017", "midpoint": 80, "short": "Aghion (2017)"}, {"metric": "methods", "paper": "Aghion et al. 2017", "midpoint": 72, "short": "Aghion (2017)"}, {"metric": "claims_evidence", "paper": "Aghion et al. 2017", "midpoint": 65, "short": "Aghion (2017)"}, {"metric": "overall", "paper": "Aghion et al. 2017", "midpoint": 69, "short": "Aghion (2017)"}, {"metric": "global_relevance", "paper": "Akram et al. 2017", "midpoint": 85, "short": "Akram (2017)"}, {"metric": "open_science", "paper": "Akram et al. 2017", "midpoint": 60, "short": "Akram (2017)"}, {"metric": "logic_communication", "paper": "Akram et al. 2017", "midpoint": 83, "short": "Akram (2017)"}, {"metric": "advancing_knowledge", "paper": "Akram et al. 2017", "midpoint": 80, "short": "Akram (2017)"}, {"metric": "methods", "paper": "Akram et al. 2017", "midpoint": 88, "short": "Akram (2017)"}, {"metric": "claims_evidence", "paper": "Akram et al. 2017", "midpoint": 85, "short": "Akram (2017)"}, {"metric": "overall", "paper": "Akram et al. 2017", "midpoint": 83, "short": "Akram (2017)"}, {"metric": "global_relevance", "paper": "Alatas et al. 2019", "midpoint": 70, "short": "Alatas (2019)"}, {"metric": "open_science", "paper": "Alatas et al. 2019", "midpoint": 50, "short": "Alatas (2019)"}, {"metric": "logic_communication", "paper": "Alatas et al. 2019", "midpoint": 80, "short": "Alatas (2019)"}, {"metric": "advancing_knowledge", "paper": "Alatas et al. 2019", "midpoint": 75, "short": "Alatas (2019)"}, {"metric": "methods", "paper": "Alatas et al. 2019", "midpoint": 85, "short": "Alatas (2019)"}, {"metric": "claims_evidence", "paper": "Alatas et al. 2019", "midpoint": 80, "short": "Alatas (2019)"}, {"metric": "overall", "paper": "Alatas et al. 2019", "midpoint": 75, "short": "Alatas (2019)"}, {"metric": "global_relevance", "paper": "Alcott et al. 2024", "midpoint": 65, "short": "Alcott (2024)"}, {"metric": "open_science", "paper": "Alcott et al. 2024", "midpoint": 75, "short": "Alcott (2024)"}, {"metric": "logic_communication", "paper": "Alcott et al. 2024", "midpoint": 90, "short": "Alcott (2024)"}, {"metric": "advancing_knowledge", "paper": "Alcott et al. 2024", "midpoint": 85, "short": "Alcott (2024)"}, {"metric": "methods", "paper": "Alcott et al. 2024", "midpoint": 77, "short": "Alcott (2024)"}, {"metric": "claims_evidence", "paper": "Alcott et al. 2024", "midpoint": 80, "short": "Alcott (2024)"}, {"metric": "overall", "paper": "Alcott et al. 2024", "midpoint": 80, "short": "Alcott (2024)"}, {"metric": "global_relevance", "paper": "Arora et al. 2023", "midpoint": 60, "short": "Arora (2023)"}, {"metric": "open_science", "paper": "Arora et al. 2023", "midpoint": 70, "short": "Arora (2023)"}, {"metric": "logic_communication", "paper": "Arora et al. 2023", "midpoint": 78, "short": "Arora (2023)"}, {"metric": "advancing_knowledge", "paper": "Arora et al. 2023", "midpoint": 85, "short": "Arora (2023)"}, {"metric": "methods", "paper": "Arora et al. 2023", "midpoint": 80, "short": "Arora (2023)"}, {"metric": "claims_evidence", "paper": "Arora et al. 2023", "midpoint": 75, "short": "Arora (2023)"}, {"metric": "overall", "paper": "Arora et al. 2023", "midpoint": 77, "short": "Arora (2023)"}, {"metric": "global_relevance", "paper": "Bahar et al. 2022", "midpoint": 80, "short": "Bahar (2022)"}, {"metric": "open_science", "paper": "Bahar et al. 2022", "midpoint": 40, "short": "Bahar (2022)"}, {"metric": "logic_communication", "paper": "Bahar et al. 2022", "midpoint": 70, "short": "Bahar (2022)"}, {"metric": "advancing_knowledge", "paper": "Bahar et al. 2022", "midpoint": 75, "short": "Bahar (2022)"}, {"metric": "methods", "paper": "Bahar et al. 2022", "midpoint": 72, "short": "Bahar (2022)"}, {"metric": "claims_evidence", "paper": "Bahar et al. 2022", "midpoint": 70, "short": "Bahar (2022)"}, {"metric": "overall", "paper": "Bahar et al. 2022", "midpoint": 68, "short": "Bahar (2022)"}, {"metric": "global_relevance", "paper": "Banerjee et al. 2023", "midpoint": 95, "short": "Banerjee (2023)"}, {"metric": "open_science", "paper": "Banerjee et al. 2023", "midpoint": 80, "short": "Banerjee (2023)"}, {"metric": "logic_communication", "paper": "Banerjee et al. 2023", "midpoint": 85, "short": "Banerjee (2023)"}, {"metric": "advancing_knowledge", "paper": "Banerjee et al. 2023", "midpoint": 95, "short": "Banerjee (2023)"}, {"metric": "methods", "paper": "Banerjee et al. 2023", "midpoint": 90, "short": "Banerjee (2023)"}, {"metric": "claims_evidence", "paper": "Banerjee et al. 2023", "midpoint": 88, "short": "Banerjee (2023)"}, {"metric": "overall", "paper": "Banerjee et al. 2023", "midpoint": 90, "short": "Banerjee (2023)"}, {"metric": "global_relevance", "paper": "Barberio et al. 2022", "midpoint": 85, "short": "Barberio (2022)"}, {"metric": "open_science", "paper": "Barberio et al. 2022", "midpoint": 40, "short": "Barberio (2022)"}, {"metric": "logic_communication", "paper": "Barberio et al. 2022", "midpoint": 80, "short": "Barberio (2022)"}, {"metric": "advancing_knowledge", "paper": "Barberio et al. 2022", "midpoint": 60, "short": "Barberio (2022)"}, {"metric": "methods", "paper": "Barberio et al. 2022", "midpoint": 70, "short": "Barberio (2022)"}, {"metric": "claims_evidence", "paper": "Barberio et al. 2022", "midpoint": 65, "short": "Barberio (2022)"}, {"metric": "overall", "paper": "Barberio et al. 2022", "midpoint": 67, "short": "Barberio (2022)"}, {"metric": "global_relevance", "paper": "Barker et al. 2021", "midpoint": 85, "short": "Barker (2021)"}, {"metric": "open_science", "paper": "Barker et al. 2021", "midpoint": 55, "short": "Barker (2021)"}, {"metric": "logic_communication", "paper": "Barker et al. 2021", "midpoint": 80, "short": "Barker (2021)"}, {"metric": "advancing_knowledge", "paper": "Barker et al. 2021", "midpoint": 75, "short": "Barker (2021)"}, {"metric": "methods", "paper": "Barker et al. 2021", "midpoint": 70, "short": "Barker (2021)"}, {"metric": "claims_evidence", "paper": "Barker et al. 2021", "midpoint": 65, "short": "Barker (2021)"}, {"metric": "overall", "paper": "Barker et al. 2021", "midpoint": 72, "short": "Barker (2021)"}, {"metric": "global_relevance", "paper": "Bettle 2023", "midpoint": 75, "short": "Bettle (2023)"}, {"metric": "open_science", "paper": "Bettle 2023", "midpoint": 40, "short": "Bettle (2023)"}, {"metric": "logic_communication", "paper": "Bettle 2023", "midpoint": 70, "short": "Bettle (2023)"}, {"metric": "advancing_knowledge", "paper": "Bettle 2023", "midpoint": 65, "short": "Bettle (2023)"}, {"metric": "methods", "paper": "Bettle 2023", "midpoint": 60, "short": "Bettle (2023)"}, {"metric": "claims_evidence", "paper": "Bettle 2023", "midpoint": 55, "short": "Bettle (2023)"}, {"metric": "overall", "paper": "Bettle 2023", "midpoint": 61, "short": "Bettle (2023)"}, {"metric": "global_relevance", "paper": "Bhat et al. 2022", "midpoint": 85, "short": "Bhat (2022)"}, {"metric": "open_science", "paper": "Bhat et al. 2022", "midpoint": 60, "short": "Bhat (2022)"}, {"metric": "logic_communication", "paper": "Bhat et al. 2022", "midpoint": 80, "short": "Bhat (2022)"}, {"metric": "advancing_knowledge", "paper": "Bhat et al. 2022", "midpoint": 85, "short": "Bhat (2022)"}, {"metric": "methods", "paper": "Bhat et al. 2022", "midpoint": 80, "short": "Bhat (2022)"}, {"metric": "claims_evidence", "paper": "Bhat et al. 2022", "midpoint": 75, "short": "Bhat (2022)"}, {"metric": "overall", "paper": "Bhat et al. 2022", "midpoint": 78, "short": "Bhat (2022)"}, {"metric": "global_relevance", "paper": "Bhattacharya and Packalen 2020", "midpoint": 55, "short": "Bhattacharya (2020)"}, {"metric": "open_science", "paper": "Bhattacharya and Packalen 2020", "midpoint": 40, "short": "Bhattacharya (2020)"}, {"metric": "logic_communication", "paper": "Bhattacharya and Packalen 2020", "midpoint": 70, "short": "Bhattacharya (2020)"}, {"metric": "advancing_knowledge", "paper": "Bhattacharya and Packalen 2020", "midpoint": 60, "short": "Bhattacharya (2020)"}, {"metric": "methods", "paper": "Bhattacharya and Packalen 2020", "midpoint": 45, "short": "Bhattacharya (2020)"}, {"metric": "claims_evidence", "paper": "Bhattacharya and Packalen 2020", "midpoint": 55, "short": "Bhattacharya (2020)"}, {"metric": "overall", "paper": "Bhattacharya and Packalen 2020", "midpoint": 54, "short": "Bhattacharya (2020)"}, {"metric": "global_relevance", "paper": "Bruers 2021", "midpoint": 55, "short": "Bruers (2021)"}, {"metric": "open_science", "paper": "Bruers 2021", "midpoint": 25, "short": "Bruers (2021)"}, {"metric": "logic_communication", "paper": "Bruers 2021", "midpoint": 60, "short": "Bruers (2021)"}, {"metric": "advancing_knowledge", "paper": "Bruers 2021", "midpoint": 50, "short": "Bruers (2021)"}, {"metric": "methods", "paper": "Bruers 2021", "midpoint": 35, "short": "Bruers (2021)"}, {"metric": "claims_evidence", "paper": "Bruers 2021", "midpoint": 40, "short": "Bruers (2021)"}, {"metric": "overall", "paper": "Bruers 2021", "midpoint": 45, "short": "Bruers (2021)"}, {"metric": "global_relevance", "paper": "Buntaine et al. 2023", "midpoint": 90, "short": "Buntaine (2023)"}, {"metric": "open_science", "paper": "Buntaine et al. 2023", "midpoint": 70, "short": "Buntaine (2023)"}, {"metric": "logic_communication", "paper": "Buntaine et al. 2023", "midpoint": 85, "short": "Buntaine (2023)"}, {"metric": "advancing_knowledge", "paper": "Buntaine et al. 2023", "midpoint": 85, "short": "Buntaine (2023)"}, {"metric": "methods", "paper": "Buntaine et al. 2023", "midpoint": 90, "short": "Buntaine (2023)"}, {"metric": "claims_evidence", "paper": "Buntaine et al. 2023", "midpoint": 90, "short": "Buntaine (2023)"}, {"metric": "overall", "paper": "Buntaine et al. 2023", "midpoint": 85, "short": "Buntaine (2023)"}, {"metric": "global_relevance", "paper": "Carson et al. 2023", "midpoint": 60, "short": "Carson (2023)"}, {"metric": "open_science", "paper": "Carson et al. 2023", "midpoint": 55, "short": "Carson (2023)"}, {"metric": "logic_communication", "paper": "Carson et al. 2023", "midpoint": 80, "short": "Carson (2023)"}, {"metric": "advancing_knowledge", "paper": "Carson et al. 2023", "midpoint": 65, "short": "Carson (2023)"}, {"metric": "methods", "paper": "Carson et al. 2023", "midpoint": 70, "short": "Carson (2023)"}, {"metric": "claims_evidence", "paper": "Carson et al. 2023", "midpoint": 75, "short": "Carson (2023)"}, {"metric": "overall", "paper": "Carson et al. 2023", "midpoint": 68, "short": "Carson (2023)"}, {"metric": "global_relevance", "paper": "Chuard et al. 2022", "midpoint": 65, "short": "Chuard (2022)"}, {"metric": "open_science", "paper": "Chuard et al. 2022", "midpoint": 50, "short": "Chuard (2022)"}, {"metric": "logic_communication", "paper": "Chuard et al. 2022", "midpoint": 75, "short": "Chuard (2022)"}, {"metric": "advancing_knowledge", "paper": "Chuard et al. 2022", "midpoint": 60, "short": "Chuard (2022)"}, {"metric": "methods", "paper": "Chuard et al. 2022", "midpoint": 65, "short": "Chuard (2022)"}, {"metric": "claims_evidence", "paper": "Chuard et al. 2022", "midpoint": 70, "short": "Chuard (2022)"}, {"metric": "overall", "paper": "Chuard et al. 2022", "midpoint": 64, "short": "Chuard (2022)"}, {"metric": "global_relevance", "paper": "Clancy 2024", "midpoint": 80, "short": "Clancy (2024)"}, {"metric": "open_science", "paper": "Clancy 2024", "midpoint": 55, "short": "Clancy (2024)"}, {"metric": "logic_communication", "paper": "Clancy 2024", "midpoint": 70, "short": "Clancy (2024)"}, {"metric": "advancing_knowledge", "paper": "Clancy 2024", "midpoint": 75, "short": "Clancy (2024)"}, {"metric": "methods", "paper": "Clancy 2024", "midpoint": 60, "short": "Clancy (2024)"}, {"metric": "claims_evidence", "paper": "Clancy 2024", "midpoint": 65, "short": "Clancy (2024)"}, {"metric": "overall", "paper": "Clancy 2024", "midpoint": 68, "short": "Clancy (2024)"}, {"metric": "global_relevance", "paper": "Crawfurd et al. 2023", "midpoint": 90, "short": "Crawfurd (2023)"}, {"metric": "open_science", "paper": "Crawfurd et al. 2023", "midpoint": 70, "short": "Crawfurd (2023)"}, {"metric": "logic_communication", "paper": "Crawfurd et al. 2023", "midpoint": 85, "short": "Crawfurd (2023)"}, {"metric": "advancing_knowledge", "paper": "Crawfurd et al. 2023", "midpoint": 80, "short": "Crawfurd (2023)"}, {"metric": "methods", "paper": "Crawfurd et al. 2023", "midpoint": 75, "short": "Crawfurd (2023)"}, {"metric": "claims_evidence", "paper": "Crawfurd et al. 2023", "midpoint": 70, "short": "Crawfurd (2023)"}, {"metric": "overall", "paper": "Crawfurd et al. 2023", "midpoint": 78, "short": "Crawfurd (2023)"}, {"metric": "global_relevance", "paper": "Epperson and Gerster 2024", "midpoint": 60, "short": "Epperson (2024)"}, {"metric": "open_science", "paper": "Epperson and Gerster 2024", "midpoint": 85, "short": "Epperson (2024)"}, {"metric": "logic_communication", "paper": "Epperson and Gerster 2024", "midpoint": 80, "short": "Epperson (2024)"}, {"metric": "advancing_knowledge", "paper": "Epperson and Gerster 2024", "midpoint": 70, "short": "Epperson (2024)"}, {"metric": "methods", "paper": "Epperson and Gerster 2024", "midpoint": 78, "short": "Epperson (2024)"}, {"metric": "claims_evidence", "paper": "Epperson and Gerster 2024", "midpoint": 80, "short": "Epperson (2024)"}, {"metric": "overall", "paper": "Epperson and Gerster 2024", "midpoint": 75, "short": "Epperson (2024)"}, {"metric": "global_relevance", "paper": "Fangwa et al. 2023", "midpoint": 85, "short": "Fangwa (2023)"}, {"metric": "open_science", "paper": "Fangwa et al. 2023", "midpoint": 45, "short": "Fangwa (2023)"}, {"metric": "logic_communication", "paper": "Fangwa et al. 2023", "midpoint": 80, "short": "Fangwa (2023)"}, {"metric": "advancing_knowledge", "paper": "Fangwa et al. 2023", "midpoint": 75, "short": "Fangwa (2023)"}, {"metric": "methods", "paper": "Fangwa et al. 2023", "midpoint": 80, "short": "Fangwa (2023)"}, {"metric": "claims_evidence", "paper": "Fangwa et al. 2023", "midpoint": 78, "short": "Fangwa (2023)"}, {"metric": "overall", "paper": "Fangwa et al. 2023", "midpoint": 75, "short": "Fangwa (2023)"}, {"metric": "global_relevance", "paper": "Haushofer et al. 2020", "midpoint": 75, "short": "Haushofer (2020)"}, {"metric": "open_science", "paper": "Haushofer et al. 2020", "midpoint": 65, "short": "Haushofer (2020)"}, {"metric": "logic_communication", "paper": "Haushofer et al. 2020", "midpoint": 80, "short": "Haushofer (2020)"}, {"metric": "advancing_knowledge", "paper": "Haushofer et al. 2020", "midpoint": 70, "short": "Haushofer (2020)"}, {"metric": "methods", "paper": "Haushofer et al. 2020", "midpoint": 85, "short": "Haushofer (2020)"}, {"metric": "claims_evidence", "paper": "Haushofer et al. 2020", "midpoint": 80, "short": "Haushofer (2020)"}, {"metric": "overall", "paper": "Haushofer et al. 2020", "midpoint": 76, "short": "Haushofer (2020)"}, {"metric": "global_relevance", "paper": "Hill et al. 2024", "midpoint": 65, "short": "Hill (2024)"}, {"metric": "open_science", "paper": "Hill et al. 2024", "midpoint": 60, "short": "Hill (2024)"}, {"metric": "logic_communication", "paper": "Hill et al. 2024", "midpoint": 85, "short": "Hill (2024)"}, {"metric": "advancing_knowledge", "paper": "Hill et al. 2024", "midpoint": 70, "short": "Hill (2024)"}, {"metric": "methods", "paper": "Hill et al. 2024", "midpoint": 78, "short": "Hill (2024)"}, {"metric": "claims_evidence", "paper": "Hill et al. 2024", "midpoint": 75, "short": "Hill (2024)"}, {"metric": "overall", "paper": "Hill et al. 2024", "midpoint": 73, "short": "Hill (2024)"}, {"metric": "global_relevance", "paper": "Jack et al. 2022", "midpoint": 88, "short": "Jack (2022)"}, {"metric": "open_science", "paper": "Jack et al. 2022", "midpoint": 60, "short": "Jack (2022)"}, {"metric": "logic_communication", "paper": "Jack et al. 2022", "midpoint": 85, "short": "Jack (2022)"}, {"metric": "advancing_knowledge", "paper": "Jack et al. 2022", "midpoint": 78, "short": "Jack (2022)"}, {"metric": "methods", "paper": "Jack et al. 2022", "midpoint": 80, "short": "Jack (2022)"}, {"metric": "claims_evidence", "paper": "Jack et al. 2022", "midpoint": 85, "short": "Jack (2022)"}, {"metric": "overall", "paper": "Jack et al. 2022", "midpoint": 79, "short": "Jack (2022)"}, {"metric": "global_relevance", "paper": "Kremer et al. 2020", "midpoint": 85, "short": "Kremer (2020)"}, {"metric": "open_science", "paper": "Kremer et al. 2020", "midpoint": 40, "short": "Kremer (2020)"}, {"metric": "logic_communication", "paper": "Kremer et al. 2020", "midpoint": 80, "short": "Kremer (2020)"}, {"metric": "advancing_knowledge", "paper": "Kremer et al. 2020", "midpoint": 55, "short": "Kremer (2020)"}, {"metric": "methods", "paper": "Kremer et al. 2020", "midpoint": 45, "short": "Kremer (2020)"}, {"metric": "claims_evidence", "paper": "Kremer et al. 2020", "midpoint": 60, "short": "Kremer (2020)"}, {"metric": "overall", "paper": "Kremer et al. 2020", "midpoint": 61, "short": "Kremer (2020)"}, {"metric": "global_relevance", "paper": "Kremer et al. 2022", "midpoint": 95, "short": "Kremer (2022)"}, {"metric": "open_science", "paper": "Kremer et al. 2022", "midpoint": 70, "short": "Kremer (2022)"}, {"metric": "logic_communication", "paper": "Kremer et al. 2022", "midpoint": 85, "short": "Kremer (2022)"}, {"metric": "advancing_knowledge", "paper": "Kremer et al. 2022", "midpoint": 90, "short": "Kremer (2022)"}, {"metric": "methods", "paper": "Kremer et al. 2022", "midpoint": 85, "short": "Kremer (2022)"}, {"metric": "claims_evidence", "paper": "Kremer et al. 2022", "midpoint": 80, "short": "Kremer (2022)"}, {"metric": "overall", "paper": "Kremer et al. 2022", "midpoint": 84, "short": "Kremer (2022)"}, {"metric": "global_relevance", "paper": "Kubo et al. 2023", "midpoint": 60, "short": "Kubo (2023)"}, {"metric": "open_science", "paper": "Kubo et al. 2023", "midpoint": 55, "short": "Kubo (2023)"}, {"metric": "logic_communication", "paper": "Kubo et al. 2023", "midpoint": 78, "short": "Kubo (2023)"}, {"metric": "advancing_knowledge", "paper": "Kubo et al. 2023", "midpoint": 65, "short": "Kubo (2023)"}, {"metric": "methods", "paper": "Kubo et al. 2023", "midpoint": 75, "short": "Kubo (2023)"}, {"metric": "claims_evidence", "paper": "Kubo et al. 2023", "midpoint": 70, "short": "Kubo (2023)"}, {"metric": "overall", "paper": "Kubo et al. 2023", "midpoint": 67, "short": "Kubo (2023)"}, {"metric": "global_relevance", "paper": "Liang et al. 2021", "midpoint": 70, "short": "Liang (2021)"}, {"metric": "open_science", "paper": "Liang et al. 2021", "midpoint": 70, "short": "Liang (2021)"}, {"metric": "logic_communication", "paper": "Liang et al. 2021", "midpoint": 75, "short": "Liang (2021)"}, {"metric": "advancing_knowledge", "paper": "Liang et al. 2021", "midpoint": 80, "short": "Liang (2021)"}, {"metric": "methods", "paper": "Liang et al. 2021", "midpoint": 85, "short": "Liang (2021)"}, {"metric": "claims_evidence", "paper": "Liang et al. 2021", "midpoint": 85, "short": "Liang (2021)"}, {"metric": "overall", "paper": "Liang et al. 2021", "midpoint": 78, "short": "Liang (2021)"}, {"metric": "global_relevance", "paper": "Schuett et al. 2023", "midpoint": 70, "short": "Schuett (2023)"}, {"metric": "open_science", "paper": "Schuett et al. 2023", "midpoint": 80, "short": "Schuett (2023)"}, {"metric": "logic_communication", "paper": "Schuett et al. 2023", "midpoint": 70, "short": "Schuett (2023)"}, {"metric": "advancing_knowledge", "paper": "Schuett et al. 2023", "midpoint": 60, "short": "Schuett (2023)"}, {"metric": "methods", "paper": "Schuett et al. 2023", "midpoint": 55, "short": "Schuett (2023)"}, {"metric": "claims_evidence", "paper": "Schuett et al. 2023", "midpoint": 65, "short": "Schuett (2023)"}, {"metric": "overall", "paper": "Schuett et al. 2023", "midpoint": 67, "short": "Schuett (2023)"}, {"metric": "global_relevance", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 90, "short": "Trammel (2025)"}, {"metric": "open_science", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 40, "short": "Trammel (2025)"}, {"metric": "logic_communication", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 80, "short": "Trammel (2025)"}, {"metric": "advancing_knowledge", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 85, "short": "Trammel (2025)"}, {"metric": "methods", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 75, "short": "Trammel (2025)"}, {"metric": "claims_evidence", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 70, "short": "Trammel (2025)"}, {"metric": "overall", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 75, "short": "Trammel (2025)"}, {"metric": "global_relevance", "paper": "Walker et al. 2023", "midpoint": 90, "short": "Walker (2023)"}, {"metric": "open_science", "paper": "Walker et al. 2023", "midpoint": 60, "short": "Walker (2023)"}, {"metric": "logic_communication", "paper": "Walker et al. 2023", "midpoint": 80, "short": "Walker (2023)"}, {"metric": "advancing_knowledge", "paper": "Walker et al. 2023", "midpoint": 85, "short": "Walker (2023)"}, {"metric": "methods", "paper": "Walker et al. 2023", "midpoint": 80, "short": "Walker (2023)"}, {"metric": "claims_evidence", "paper": "Walker et al. 2023", "midpoint": 85, "short": "Walker (2023)"}, {"metric": "overall", "paper": "Walker et al. 2023", "midpoint": 80, "short": "Walker (2023)"}]}}, {"mode": "vega-lite"});
  
  This is really nice in principle, but it needs some tweaking. something about the colors and spacing make it hard to different here. Love the 'hover titles' though!!
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/numerical_ratings
Jun 2025
valentink.quarto.pub valentink.quarto.pub

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - Illustrating LangChain pipeline

2
1. daaronr 23 Jun 2025
  
  in Public
  
  in-memory Chroma vector DB
  
  Do I need to now what this means?
2. daaronr 23 Jun 2025
  
  in Public
  
  (one per page).
  
  small point ... one per page seems a bit random
Visit annotations in context

Annotators

daaronr

URL

valentink.quarto.pub/llm_research_evaluation/langchain_test.html
ai.nejm.org ai.nejm.org

Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis

1
1. daaronr 07 Jun 2025
  
  in Public
  
  International Conference on Learning Representations (ICLR) machine learning conference
  
  See also this paper
Visit annotations in context

Annotators

daaronr

URL

ai.nejm.org/doi/full/10.1056/AIoa2400196
May 2025
daaronr.github.io daaronr.github.io

5 Robustness and diagnostics, with integrity; Open Science resources | Statistics, econometrics, experiment and survey methods, data science: Notes

1
1. daaronr 23 May 2025
  
  in Public
  
  5.1 (How) can diagnostic tests make sense? Where is the burden of proof?
  
  Some references from ChatGPT o3 making similar points
  
  https://chatgpt.com/share/6830a265-3f14-8007-b8f8-daa0936d8f5f
Visit annotations in context

Annotators

daaronr

URL

daaronr.github.io/metrics_discussion/robust-diag.html
valentink.quarto.pub valentink.quarto.pub

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - Illustrating LangChain pipeline

3
1. daaronr 20 May 2025
  
  in Public
  
  o3,
  
  you mean 'aiming for o3, testing with 4o mini for now'?
2. daaronr 20 May 2025
  
  in Public
  
  including probability judgement and robustness suggestions.
  
  --> 'making a probabilistic judgement about the accuracy of this claim, and making suggestions for robustness checks'
3. daaronr 20 May 2025
  
  in Public
  
  factual
  
  being super fussy, but 'factual as opposed to what'?
Visit annotations in context

Annotators

daaronr

URL

valentink.quarto.pub/llm_research_evaluation/langchain_test.html
valentink.quarto.pub valentink.quarto.pub

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - 2 Testing OpenAI API Access in Python

12
1. daaronr 11 May 2025
  
  in Public
  
  → scoring ssrn-4071953.pdf …
  
  what was the run time? -- put in a timer? Something like 5 minutes for 3 papers
  
  NB some papers too long for o3 -- can be chunked and summarized by another model, langchain helps, or we use a larger context model like Gemini
2. daaronr 11 May 2025
  
  in Public
  
  # Install the SDK if it isn’t present
  
  rem to put freeze option here
3. daaronr 11 May 2025
  
  in Public
  
  key_path = pathlib.Path("key/openai_key.txt")
  
  Don't commit this (gitignore), everyone has there own key
4. daaronr 11 May 2025
  
  in Public
  
  Project files
  
  this links your github page? I don't see the particular repo
5. daaronr 11 May 2025
  
  in Public
  
  Table 2.1: Overall assessments (percentiles)
  
  Let's try one out on a paper that our evaluators gave low scores ... or one we think would likely to receive low scores in general, for comparison
6. daaronr 11 May 2025
  
  in Public
  
  # 1. Point to the papers you want to score
  
  Looks like you put the pdfs in the local directory and the git repo? Let's think about ways to mor eautomatically 'feed these in' in bulkl
7. daaronr 11 May 2025
  
  in Public
  
  Following (Garg and Fetzer 2025)
  
  Nice, Garg and Fetzer seems very relevant. I see a lot of fruitful potential overlap.
8. daaronr 10 May 2025
  
  in Public
  
  ::: {#cell-Averaged ratings .cell tbl-cap=‘Mean ± SD of ten model draws’ execution_count=8}
  
  code formatting (minor_
9. daaronr 10 May 2025
  
  in Public
  
  paper_id
  
  these are consistent but maybe not informative at all ... same for all papers??
10. daaronr 10 May 2025
  
  in Public
  
  2.1.0.4 Monte‑Carlo: 10 independent ratings per paper
  
  it looks like 9 runs?
11. daaronr 10 May 2025
  
  in Public
  
  rationale
  
  make this column wider/expandable
12. daaronr 10 May 2025
  
  in Public
  
  We call GPT‑4o on each PDF, asking for a strict JSON percentile rating and a short rationale, then collect the results.
  
  I suspect 4o is not particularly insightful? But we can compare it to the (~manual) results we previously got from o3.
Visit annotations in context

Annotators

daaronr

URL

valentink.quarto.pub/llm_research_evaluation/main.html
valentink.quarto.pub valentink.quarto.pub

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research

1
1. daaronr 11 May 2025
  
  in Public
  
  Explore LangChain, OpenAI Evals
  
  Is langchain compatible with this quarto approach? ... probably yet. But we can first do some things more manually/simply.
Visit annotations in context

Annotators

daaronr

URL

valentink.quarto.pub/llm_research_evaluation/
lakens.github.io lakens.github.io

Improving Your Statistical Inferences - 4 Bayesian statistics

1
1. daaronr 05 May 2025
  
  in Public
  
  r in the following formula: P(H1|D)P(H0|D)= P(D|H1)P(D|H0) × P(H1)P(H0) Posterior Probability= Bayes Factor × Prior Probability A Bayesian analysis of data requires specifying the prior. Here, we will continu
  
  formula should refer to 'odds' not 'probability'
Visit annotations in context

Annotators

daaronr

URL

lakens.github.io/statistical_inferences/04-bayes.html

David Reinstein

Dr. David Reinstein, Senior Economist, Rethink Priorities https://daaronr.github.io/markdown-cv/

Project pages: innovationsinfundraising.org, giveifyouwin.org

Twitter: @givingtools

Annotations: 1,969

Joined: July 17, 2019

Location: Western Massachussets

Link: davidreinstein.org

ORCID: 0000-0002-0470-4991

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL