1,066 Matching Annotations

Oct 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Results

2
1. daaronr 05 Oct 2025
  
  in Public
  
  Show cod
  
  @Valentin These forest plots are really hard to read, it's so dense without spacing. Let's work together on some ways of making it more informative.
  
  I'm also puzzled as to why so many papers are shown only one type of rating and not the other. I know that some of our evaluators did not give ratings like this, and in some cases, we didn't even encourage it. But why is it missing for some of the llms? Did it just not have time to finish processing it?
  
  Maybe it's a display issue? It seems that the papers that were rated highest in terms of these tiers by the human raters did not get rated by the LLMs. Or maybe it just didn't show up in the graph?
2. daaronr 05 Oct 2025
  
  in Public
  
  (only partially available)
  
  @Valentin why/how 'partially available'?
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/results
Sep 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Results

13
1. daaronr 23 Sep 2025
  
  in Public
  
  Inter-rater reliability (Cohen’s κ): We treated the AI as one “reviewer” and the average human as another, and asked: how often do they effectively give similar ratings, beyond what chance alignment would predict? For overall scores, Cohen’s κ (with quadratic weighting for partial agreement) came out around 0.25–0.30. This would be characterized as “fair” agreement at best – a significant gap remains between AI and human judgments. For most specific criteria, the weighted κ was lower, often in the 0.1–0.2 range (and for one category, effectively 0). A κ of 0 would mean no more agreement than random chance (given the distribution of scores), so some categories are bordering on that. In contrast, typical inter-human agreement on these kinds of scoring tasks can also be low, but usually one would hope for κ in the 0.3–0.6 range among trained reviewers on well-defined criteria. Our finding of κ < 0.3 in all cases suggests that the AI’s ratings are not interchangeable with a human reviewer’s – at least not without further calibration.
  
  @Valentin I want to focus on interpreting statistics like these. I think we could make info-theoretic statements about things like "how often would the LLM's relative ranking of 2 randomly chosen papers agree with a human's rannking". Is that kappa? Let's dig in
2. daaronr 23 Sep 2025
  
  in Public
  
  Correlation (Pearson’s r) between the AI’s and human scores across papers: This tells us, for example, if a paper that humans gave a high score also tended to get a high score from AI (regardless of absolute difference). For the Overall scores, Pearson r ≈ 0.30, indicating a weak-to-moderate positive correlation.
  
  @Valentin Important -- let's put these measures in context. How does this correlation compare with what other papers found, and with inter-human ratings, for example.
  
  We might also find a way to introduce information-theoretic measures that say something in an intuitive absolute sense.
3. daaronr 23 Sep 2025
  
  in Public
  
  r ~ 0
  
  @Valentin this should be stated more precisely, I think. (And with soft coding)
4. daaronr 23 Sep 2025
  
  in Public
  
  Table 3.1 show agreement metrics across rating criteria. To quantify the agreements and differences observed, we calculated several statistics comparing LLM scores to the human scores, aggregated by criterion:
  
  @Valentin perhaps we should start with the measures of agreement and then get into unpacking the discrepancies?
5. daaronr 23 Sep 2025
  
  in Public
  
  has a distinct “taste,” elevating some work and devaluing other work differently than human referees.
  
  I don't think we can call this 'taste' yet. It might be random noise or perhaps ~bias (to top institutions, authors, journals, etc.).
6. daaronr 23 Sep 2025
  
  in Public
  
  or example, Aghion et al. 2017 was among the top few for human reviewers, but the LLM overall score put it notably lower relative to others, hence a downward green curve.
  
  @Valentin I don't think that was the greatest discrepancy -- should we identify some with a greater discrepancy here? (Ideally, we even soft-code it, as this is likely to change as we adjust the prompts, anonymize, etc.)
7. daaronr 23 Sep 2025
  
  in Public
  
  high quality due to its robust modeling and accessible methodologies, making it relevant for climate mitigation and biodiversity
  
  This is also not particular coherent. Strange. If we just asked the llm to evaluate the paper it tends to give a much better response.
8. daaronr 23 Sep 2025
  
  in Public
  
  Comparing its midpoint percentiles to a reference group of serious research from the last three years will help inform this.
  
  @Valentin Yikes -- this sentence doesn't really make sense
9. daaronr 16 Sep 2025
  
  in Public
  
  Shared notes
  
  @Valentin Let's replace this link with a link to the Coda doc
10. daaronr 16 Sep 2025
  
  in Public
  
  on the right, the papers are ordered by the AI’s overall score (rank 1 = highest rated by AI)
  
  labeling-wise, would it be better to add the paper authors on the right as well? @valentin.klotzbuecher@posteo.de
11. daaronr 15 Sep 2025
  
  in Public
  
  For example, in the Logic & Communication column, we see many light-orange cells – the AI often thought papers were a bit clearer or better argued (by its judgment) than the human evaluators did.
  
  I wonder if we should normalize this in a few ways, at least as an alternative measure.
  
  I suspect the AI's distribution of ratings may have different than the human distribution of ratings overall and, the "bias" may also differ by category.
  
  Actually, that might be something to do first -- compare the distributions of (middle -- later more sophisticated) ratings for humans and for LLMs in an overall sense.
  
  One possible normalization would be to state these as percentiles relative to the other stated percentiles within that group (humans, LLMs), or even within categories of paper/field/cause area (I suspect there's some major difference between the more applied and niche-EA work and the standard academic work (the latter is also probably concentrated in GH&D and environmental econ). On the other hand, the systematic differences between LLM and human ratings on average might also tell us something interesting. So I wouldn't want to only use normalized measures.
  
  I think a more sophisticated version of this normalization just becomes a statistical (random effects?) model where you allow components of variation along several margins.
  
  It's true the ranks thing gets at this issue to some extent, as I guess Spearman also does? But I don't think it fully captures it.
12. daaronr 15 Sep 2025
  
  in Public
  
  deed, GPT often noted lack of code or data sharing in papers and penalized for it, whereas some human reviewers may have been more forgiving or did not emphasize open-science practices as strongly (especially if they focused more on content quality). As a result, for many papers the AI’s Open Science score is 5–10 points below the human average.
  
  This is interesting. The human evaluators may have had low expectations because they don't expect the open code and data to be provided until the paper has been published in a peer-reviewed journal. Here I would agree more with the LLM. "What should be" sense.
13. daaronr 14 Sep 2025
  
  in Public
  
  Figure 3.2: Relative ranking (overall) by LLM and Human evaluators
  
  A quick impression: the LLMs tend to rank the papers from prominent academic authors particularly high?
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/results
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human reviews of social science research using data from Unjournal.org - Data and methods

12
1. daaronr 23 Sep 2025
  
  in Public
  
  he brief rationales clarify what evidence in the paper drove each score.
  
  I don't think we can be confident that whatever it puts here accurately reflects the reasoning or process that determined the rating. (added discussion in text).
  
  Also, is there some way to extract the 'thinking steps' from the process ... that reasoning thread the models show you (which May or may not also reflect its true reasoning )
  
  @valik
2. daaronr 23 Sep 2025
  
  in Public
  
  - tier_should = where the paper deserves to publish if quality-only decides. - tier_will = realistic prediction given status/noise/connections.
  
  This seems like an LLM abbreviation of our instructinos. @valik can you put the actual instructions back in?
3. daaronr 23 Sep 2025
  
  in Public
  
  Round all scores to the nearest 0.5
  
  @Valentin @valik why would you ask for this rounding?
4. daaronr 23 Sep 2025
  
  in Public
  
  We use a single‑step call with a reasoning model that supports file input. One step avoids hand‑offs and summary loss from a separate “ingestion” stage. The model reads the whole PDF and produces the JSON defined above. We do not retrieve external sources or cross‑paper material for these scores; the evaluation is anchored in the manuscript itself.
  
  We should probably give a citation for this point.
  
  But is this the same point you made above?
5. daaronr 23 Sep 2025
  
  in Public
  
  • Default = arithmetic mean of the other six midpoints (rounded).
  
  @Valentin I'm not sure why this should be the default. Note that in an earlier version of our evaluation framework, we used a weighted scheme, which we dropped
  
  I suspect it would be better for you to ask the question the same way we asking the evaluators here ... Which would simply mean getting rid of this bullet point. I think.
  
  The second bullet point is interesting though. I would be curious to hear how it considered the relevance of each metric in its overall. ... although I expect that asking that question might alter the overall score
  
  @valik Maybe GPT suggested this default averaging? It also seems so optimal because we're asking to get something that we could easily compute ourselves.
6. daaronr 23 Sep 2025
  
  in Public
  
  plus a short rationale
  
  @valik @Valentin
  
  I don't see where in the prompt you explained to it what the rationale was supposed to mean. I only see some discussion on how to use that. It should be stated if they override the simple mean for aggregating the other categories to get the overall.
  
  (And see my other comment on why I think we should remove the request to do simple means.)
7. daaronr 23 Sep 2025
  
  in Public
  
  The credible intervals communicate uncertainty rather than false precision
  
  @Valentin That's the intention, sure, but this paragraph makes it seem like the JSON schema somehow ensure that this is used. I'll try to adjust.
8. daaronr 23 Sep 2025
  
  in Public
  
  # Environment setup
  
  Note to self -- this chunk is set to 'eval=false', it doesn't need re-running every time (and has trouble playing in the Quarto/Rstudio environment)
9. daaronr 23 Sep 2025
  
  in Public
  
  Direct ingestion preserves tables, figures, equations, and sectioning, which ad‑hoc text scraping can mangle. It also avoids silent trimming or segmentation choices that would bias what the model sees.
  
  Useful to add a citation here.
10. daaronr 23 Sep 2025
  
  in Public
  
  he sample includes r dim(research_done)[1] papers
  
  @Valentin the number comes out in my Rstudio but not on render. Wondering what I did wrong here. I think it may be syntax, but also it starts again with every chapter, so we need to carry the data in first for this discussion.
11. daaronr 16 Sep 2025
  
  in Public
  
  49
  
  @Valentin Let's soft-code these numbers in based on the data.
  
  Something like
  
  The sample includes `r dim(research_done)[1]` papers spanning
12. daaronr 16 Sep 2025
  
  in Public
  
  per criterion (and noting the range of individual scores).
  
  @valentin we should probably do something more sophisticated at the next pass ... Either using each evaluation as a separate observation in the analysis or imputing the median taking the statedcredible intervals into account with some Bayesian procedure.
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/methods
squigglehub.org squigglehub.org

Unjournal-modelingPQs/cultured_meat_improved_by_squiggle_improve | Squiggle Hub

1
1. daaronr 05 Sep 2025
  
  in Public
  
  Please turn on the hypothes.is plugin, and view public annotations to see Latex math representations
  
  What if I annotate this in hypothes.is -- is it preserved usefully?$CRF(r,n) = \frac{r(1+r)^n}{(1+r)^n - 1} $
  
  Yes, it seems to stay in the same place in the notebook even if the notebook is edited.
  
  Wait now it's an orphan?
Visit annotations in context

Annotators

daaronr

URL

squigglehub.org/models/Unjournal-modelingPQs/cultured_meat_improved_by_squiggle_improve
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM and human peer reviews and ratings of social science research using data from Unjournal.org evaluations

1
1. daaronr 02 Sep 2025
  
  in Public
  
  model for now)
  
  DR @ VK: Should we use R inline code for things that will change, like the model number, and centralize these?
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/
Aug 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - LLM vs. Human Ratings

24
1. daaronr 27 Aug 2025
  
  in Public
  
  Figure 3.4: Human uncertainty (CI width) vs |LLM − Human|. Spearman correlation reported.
  
  This one is rather intricate, it might need some more analysis and talking through.
2. daaronr 27 Aug 2025
  
  in Public
  
  Table 3.3: CI coverage: does one interval contain the other’s point estimate?
  
  I think DHK also looked at 'human in human' -- Might as well add this as a comparator?
3. daaronr 27 Aug 2025
  
  in Public
  
  coverage
  
  Not sure if we want to use the formal statistical term 'coverage' here either (unless it's in scare quotes)
4. daaronr 27 Aug 2025
  
  in Public
  
  Table 3.1: Agreement metrics by metric (Pearson, Spearman, mean/median bias, κ unweighted/weighted).
  
  I assume this is referring to the correlation of (the average of?) the human ratings and the LLM ratings?
  
  Because we could also look at the agreement between humans (something David HJ was working on ... Might be useful to share this with him for his thoughts if he's interested at all.)
5. daaronr 27 Aug 2025
  
  in Public
  
  metric
  
  Is it worth adding confidence intervals or "significance stars" for these metrics?
6. daaronr 27 Aug 2025
  
  in Public
  
  mean_bias
  
  are the units "percentage points" or something else?
7. daaronr 27 Aug 2025
  
  in Public
  
  metric
  
  maybe 'by rating category' To avoid repeating the word "metric",
8. daaronr 27 Aug 2025
  
  in Public
  
  Calibration
  
  Not actually calibration unless we asked for a specific prediction.
9. daaronr 27 Aug 2025
  
  in Public
  
  calibration
  
  Maybe we should put in a footnote explaining that these aren't really measures of calibration.
10. daaronr 27 Aug 2025
  
  in Public
  
  As another indicator of agreement
  
  Exactly. It's sort of a measure of agreement, although I'm not sure quite how to interpret it. It's not a measure of calibration per se because we were asking them to rate these things, not to predict them.
  
  Although we could ask the LLM to do this as a prediction exercise, with or without training data, that might be interesting, And then, of course, the calibration would be meaningful.
11. daaronr 27 Aug 2025
  
  in Public
  
  monotonic relationship
  
  I'm not sure what the "monotonic relationship" in parentheses means here?
12. daaronr 27 Aug 2025
  
  in Public
  
  he horizontal radius covers the union of all human evidence for that paper — combining individual raters’ point scores and their CIs
  
  It's not fully clear to me what this does. I guess you center it (horizontally) at the midpoints of the human ratings.
  
  Are you using the outer bounds of both raters' CIs?
  
  Probably the best thing to do ultimately would be to impute a distribution over each individual ratings and CI these and then do some sort of belief aggregation. -- Horizontal linear aggregation actually feels the most intuitive to me from what I've read. ... And then give the implied CIs for that
13. daaronr 27 Aug 2025
  
  in Public
  
  For each paper (selected metric), the ellipse is centered at the pair of midpoints (Human, LLM).
  
  This is pretty clear but it took me a second to get this, so maybe mention human = horizontal, LLM = vertical in the text.
14. daaronr 27 Aug 2025
  
  in Public
  
  (unless a point lies outside its own stated CI, which is flagged in a diagnostic check).
  
  Did that ever happen?
15. daaronr 26 Aug 2025
  
  in Public
  
  robust CI ellipses
  
  Robust in what sense? Also might be worth mentioning that we are talking about 90% credible intervals (at least that's what we asked the humans to give us).
16. daaronr 26 Aug 2025
  
  in Public
  
  Uncertainty fields. Where available, we carry lower/upper bounds for both LLM and humans; these are used in optional uncertainty checks but do not affect the mid‑point comparisons below.
  
  I suspect we ultimately should do something more sophisticated with this ... like some Bayesian updating/averaging. It's also not entirely clear what you mean by "we carry".
  
  But of course, the "right" way to do this will depend on what precisely is the question that we're asking. Something we should have some nice chats about.
17. daaronr 26 Aug 2025
  
  in Public
  
  optional
  
  Instead of 'optional' maybe you mean 'additional analyses' or something?
18. daaronr 26 Aug 2025
  
  in Public
  
  we fold these into the single LLM metric global_relevance.
  
  Do you average them?
19. daaronr 26 Aug 2025
  
  in Public
  
  Human criteria are recoded to the LLM schema (e.g., claims → claims_evidence, adv_knowledge → advancing_knowledge, etc.)
  
  This is just about coding the variables, right? Not really about the content?
20. daaronr 26 Aug 2025
  
  in Public
  
  Sources. LLM ratings come from results/metrics_long.csv (rendered in the previous chapter). Human ratings are imported from your hand‑coded spreadsheet and mapped to LLM paper IDs via UJ_map.csv.
  
  Okay, I see these are basically notes to ourselves here.
21. daaronr 26 Aug 2025
  
  in Public
  
  compute paired scores per (paper, metric)
  
  I'm not sure what is meant by "compute paired scores."
22. daaronr 26 Aug 2025
  
  in Public
  
  We (i) harmonize the two sources to a common set of metrics and paper IDs
  
  This is fine for our own notes for now, but it's not something that outsiders need to read, I guess.
23. daaronr 26 Aug 2025
  
  in Public
  
  compares LLM‑generated ratings
  
  Link the ~prompt used so people can see exactly what we asked the LLM
24. daaronr 26 Aug 2025
  
  in Public
  
  human evaluations across the Unjournal criteria.
  
  "across The Unjournal's criteria"
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/compare_ratings
www.roastmypost.org www.roastmypost.org

Untitled document

17
1. daaronr 07 Aug 2025
  
  in Public
  
  Limited International Scope: The focus on US data and markets (acknowledged in footnote 20) may limit generalizability, particularly given different regulatory environments and consumer preferences globally.
  
  Perhaps we should footnote something more about how we're focusing on US data because of its availability and thus focusing the question around the US, but will encourage making reasonable extrapolations to global consumption. #todo? Although we already have a footnote about it
2. daaronr 06 Aug 2025
  
  in Public
  
  Supply Chain Complexity: The analysis doesn't fully account for how restaurant vs. grocery substitution patterns might differ, despite acknowledging this limitation.
  
  This kind of asks for too much detail, but maybe we state what the purpose of the post was as clearly as we should have?
3. daaronr 06 Aug 2025
  
  in Public
  
  Importance:5/10
  
  @ozzie I love the importance ranking. (If it's actually working, of course. Here it gave everything a 5 )
4. daaronr 06 Aug 2025
  
  in Public
  
  The explicit recognition that 'given the range of unknowns involved' makes direct comparison infeasible demonstrates strategic research thinking that EA organizations should emulate.
  
  @ozzie does your tool have explicit background instructions talking about "EA organizations", or is it getting this from the context of the post? Obviously the former would make it hard to apply in a more general range of settings.
5. daaronr 06 Aug 2025
  
  in Public
  
  Score: 78/100
  
  @ozzie this is a critique of the above evaluation of our post. That's a very cool idea! Given that our post is already a bit meta, my brain is struggling to grok this, but I like it.
6. daaronr 06 Aug 2025
  
  in Public
  
  Comment 1
  
  @ozzie what is the quote comments section supposed to mean? It seems a bit repetitive of the above?
7. daaronr 06 Aug 2025
  
  in Public
  
  Temporal Dynamics: The document doesn't adequately address how substitution patterns might change as plant-based products become more mainstream or as consumer familiarity increases.
  
  That's a fair point, and one we might want to incorporate. #todo?
8. daaronr 06 Aug 2025
  
  in Public
  
  Meta-Analysis Limitations: While the document presents cross-price elasticity estimates, it could better address how traditional meta-analysis approaches might be inappropriate given the fundamental methodological concerns raised.
  
  This is a mostly valid comment, it's a pretty reasonable thing to ask for here. We probably don't want to go into this detail here, but it's a natural thing someone might ask.
  
  OK. #todo? (maybe a footnote)
9. daaronr 06 Aug 2025
  
  in Public
  
  the document could better explain how these predictions will be weighted against expert evaluations.
  
  Good point. #todo?
10. daaronr 06 Aug 2025
  
  in Public
  
  This nuanced view prevents oversimplified dismissal of conflicting results.
  
  But does this tool reward valid nuance or just any type of "it's complicated because of" statements?
11. daaronr 06 Aug 2025
  
  in Public
  
  However, the document could strengthen its analysis by exploring potential explanations for these inconsistencies more systematically.
  
  This perhaps misses the point of the post. It's not meant to be an actual research exploration. It's just demonstrating the problem to help motivate and guide the next step in our project.
12. daaronr 06 Aug 2025
  
  in Public
  
  The progression from broad goal-focused questions to specific operationalized questions follows sound research design principles.
  
  I'm wondering how much the tool just being agreeable with our reasoning and how much the tool is challenging it. I wonder whether it just focuses on whether the author seems to be saying things a lot about "why" and thinking/overthinking, and it rewards that.
  
  That said, I do think this is a post that we've been putting a lot of care into, so it should get at least a decent rating here.
13. daaronr 06 Aug 2025
  
  in Public
  
  supermarkets raising PBA prices during economic windfalls effectively illustrates how naive analysis could overstate substitution effects.
  
  This would be a useful insight for a regular layman's post (the current context), but it would be kind of trivial if we were applying this to an economics paper.
  
  But maybe the tool adapts to that.
14. daaronr 06 Aug 2025
  
  in Public
  
  This represents sophisticated epistemic humility—recognizing that documenting limitations transparently provides significant value even when definitive answers remain elusive
  
  Okay, but this also seems like something that might be easy to fake with shibboleths and hollow phrases.
15. daaronr 06 Aug 2025
  
  in Public
  
  Audience Analysis and Calibration
  
  This is great.
16. daaronr 06 Aug 2025
  
  in Public
  
  This document presents The Unjournal's approach to investigating whether plant-based meat alternatives effectively displace animal product consumption—a critical question for animal welfare funding decisions. The analysis demonstrates sophisticated awareness of methodological limitations while proposing a structured approach to synthesize conflicting evidence.
  
  The language is a little too formal and flowery for my taste.
17. daaronr 06 Aug 2025
  
  in Public
  
  This document presents The Unjournal's Pivotal Questions initiative investigating plant-based meat substitution effects on animal product consumption. The analysis reveals significant methodological challenges in the existing research, with conflicting findings across studies using similar data sources. The document proposes a focused operationalized question about Impossible/Beyond Beef price changes and chicken consumption, while acknowledging fundamental limitations in current estimation approaches. Key insights include the recognition of endogeneity problems, the value of transparent uncertainty quantification, and the need for more rigorous experimental designs to inform animal welfare funding decisions.
  
  The summary is accurate, although it emphasizes methodology more than I thought the post was. That said, it might be that the post is actually just doing that more than we had intended.
Visit annotations in context

Annotators

daaronr

URL

roastmypost.org/docs/_PEXsQ-yVpOUiLpn/evals/sed_c9fgaNb8bJoV
www.roastmypost.org www.roastmypost.org

Untitled document

4
1. daaronr 06 Aug 2025
  
  in Public
  
  acob Schmiess; we ac
  
  @ozzie it's incorrectly correcting a name spelling here.
2. daaronr 06 Aug 2025
  
  in Public
  
  Extracted forecast: Will Impossible Beef reach price-parity with conventional ground beef by December 31, 2030? Quality scores: Precision: 85 | Verifiable: 90 | Important: 75 | Robustness: 65 Our reasoning: Based on 2 independent analyses, the estimated probability is 72.3%. There is high consensus among the forecasts. Individual model forecasts: Model 1: 72.3% - "Plant-based meat costs have declined significantly due to scale economies and R&D investments, while conventional beef prices face upward pressure from environmental regulations and feed costs, making price parity likely within the 5+ year timeframe."
  
  @ozzie this is indeed directly helpful for us!
3. daaronr 06 Aug 2025
  
  in Public
  
  Based on 2 independent analyses, the estimated probability is 7.5%. There is high consensus among the forecasts.
  
  @ozzie ah, it seems you've set it up to only make predictions for binary things.
4. daaronr 06 Aug 2025
  
  in Public
  
  1. "What will be the price of IB+ on January 1, 2030?"
  
  @ozzie I'm puzzled here by what's going on. These were not our predictions. These were just questions we asked for others to predict. I guess it defaults to 'author predicts 50%' for such questions?
Visit annotations in context

Annotators

daaronr

URL

roastmypost.org/docs/_PEXsQ-yVpOUiLpn/reader
www.roastmypost.org www.roastmypost.org

Untitled document

1
1. daaronr 06 Aug 2025
  
  in Public
  
  op 3 Most Overconfident Predictions 1. "Will most cultured meat (by volume) be produced wi..." Reality check: ~32.6% | Gap: 17.4%
  
  @ozzie I'm rather puzzled by what it did here, as we didn't make a prediction; we were just proposing questions for others to predict on.
Visit annotations in context

Annotators

daaronr

URL

roastmypost.org/docs/HcrEh2lyKqyS8ezq/evals/g2cI2FxsphEhDvbo
Jul 2025
llm-uj-research-eval.netlify.app llm-uj-research-eval.netlify.app

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - Quantitative metrics

1
1. daaronr 07 Jul 2025
  
  in Public
  
  Figure 2.2: Per-paper mid-point scores across all metrics. Darker green → higher percentile. Columns ordered by each paper’s overall average. #altair-viz-41a3d5b933fe425c841b312f24a54cc6.vega-embed { width: 100%; display: flex; } #altair-viz-41a3d5b933fe425c841b312f24a54cc6.vega-embed details, #altair-viz-41a3d5b933fe425c841b312f24a54cc6.vega-embed details summary { position: relative; } Save as SVGSave as PNGView SourceView Compiled VegaOpen in Vega Editor var VEGA_DEBUG = (typeof VEGA_DEBUG == "undefined") ? {} : VEGA_DEBUG; (function(spec, embedOpt){ let outputDiv = document.currentScript.previousElementSibling; if (outputDiv.id !== "altair-viz-41a3d5b933fe425c841b312f24a54cc6") { outputDiv = document.getElementById("altair-viz-41a3d5b933fe425c841b312f24a54cc6"); } const paths = { "vega": "https://cdn.jsdelivr.net/npm/vega@5?noext", "vega-lib": "https://cdn.jsdelivr.net/npm/vega-lib?noext", "vega-lite": "https://cdn.jsdelivr.net/npm/vega-lite@5.20.1?noext", "vega-embed": "https://cdn.jsdelivr.net/npm/vega-embed@6?noext", }; function maybeLoadScript(lib, version) { var key = `${lib.replace("-", "")}_version`; return (VEGA_DEBUG[key] == version) ? Promise.resolve(paths[lib]) : new Promise(function(resolve, reject) { var s = document.createElement('script'); document.getElementsByTagName("head")[0].appendChild(s); s.async = true; s.onload = () => { VEGA_DEBUG[key] = version; return resolve(paths[lib]); }; s.onerror = () => reject(`Error loading script: ${paths[lib]}`); s.src = paths[lib]; }); } function showError(err) { outputDiv.innerHTML = `<div class="error" style="color:red;">${err}</div>`; throw err; } function displayChart(vegaEmbed) { vegaEmbed(outputDiv, spec, embedOpt) .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`)); } if(typeof define === "function" && define.amd) { requirejs.config({paths}); let deps = ["vega-embed"]; require(deps, displayChart, err => showError(`Error loading script: ${err.message}`)); } else { maybeLoadScript("vega", "5") .then(() => maybeLoadScript("vega-lite", "5.20.1")) .then(() => maybeLoadScript("vega-embed", "6")) .catch(showError) .then(() => displayChart(vegaEmbed)); } })({"config": {"view": {"stroke": "transparent"}, "background": "white", "title": {"font": "Source Sans Pro", "fontSize": 16, "color": "#222"}, "axis": {"labelFont": "Source Sans Pro", "titleFont": "Source Sans Pro", "labelColor": "#222", "titleColor": "#222", "gridOpacity": 0.15}, "legend": {"labelFont": "Source Sans Pro", "titleFont": "Source Sans Pro"}, "range": {"heatmap": ["#f6fbf3", "#e2f1d7", "#cfe7ba", "#badc9c", "#a6d27f", "#92c861", "#7dbd43", "#69b325", "#55a807", "#477b13"], "category": ["#99bb66", "#f19e4b", "#6bb0f3", "#d9534f", "#636363", "#ffb400", "#53354a", "#2780e3", "#3fb618", "#8e6c8a"]}}, "data": {"name": "data-b64186ba8bd194e866931748b3fda927"}, "mark": {"type": "rect"}, "encoding": {"color": {"field": "midpoint", "legend": {"title": "Score"}, "scale": {"domain": [0, 100]}, "type": "quantitative"}, "tooltip": [{"field": "short", "type": "nominal"}, {"field": "metric", "type": "nominal"}, {"field": "midpoint", "type": "quantitative"}], "x": {"field": "short", "sort": ["Banerjee (2023)", "Acemoglu (2024)", "Buntaine (2023)", "Kremer (2022)", "Akram (2017)", "Walker (2023)", "Jack (2022)", "Alcott (2024)", "Crawfurd (2023)", "Bhat (2022)", "Liang (2021)", "Haushofer (2020)", "Epperson (2024)", "Arora (2023)", "Fangwa (2023)", "Trammel (2025)", "Alatas (2019)", "Hill (2024)", "Barker (2021)", "Aghion (2017)", "Bahar (2022)", "Carson (2023)", "Clancy (2024)", "Kubo (2023)", "Schuett (2023)", "Barberio (2022)", "Chuard (2022)", "Bettle (2023)", "Kremer (2020)", "Bhattacharya (2020)", "Bruers (2021)"], "title": null, "type": "nominal"}, "y": {"field": "metric", "sort": null, "title": null, "type": "nominal"}}, "height": 280, "width": 434, "$schema": "https://vega.github.io/schema/vega-lite/v5.20.1.json", "datasets": {"data-b64186ba8bd194e866931748b3fda927": [{"metric": "global_relevance", "paper": "Acemoglu et al. 2024", "midpoint": 92, "short": "Acemoglu (2024)"}, {"metric": "open_science", "paper": "Acemoglu et al. 2024", "midpoint": 75, "short": "Acemoglu (2024)"}, {"metric": "logic_communication", "paper": "Acemoglu et al. 2024", "midpoint": 90, "short": "Acemoglu (2024)"}, {"metric": "advancing_knowledge", "paper": "Acemoglu et al. 2024", "midpoint": 88, "short": "Acemoglu (2024)"}, {"metric": "methods", "paper": "Acemoglu et al. 2024", "midpoint": 83, "short": "Acemoglu (2024)"}, {"metric": "claims_evidence", "paper": "Acemoglu et al. 2024", "midpoint": 85, "short": "Acemoglu (2024)"}, {"metric": "overall", "paper": "Acemoglu et al. 2024", "midpoint": 86, "short": "Acemoglu (2024)"}, {"metric": "global_relevance", "paper": "Aghion et al. 2017", "midpoint": 70, "short": "Aghion (2017)"}, {"metric": "open_science", "paper": "Aghion et al. 2017", "midpoint": 40, "short": "Aghion (2017)"}, {"metric": "logic_communication", "paper": "Aghion et al. 2017", "midpoint": 85, "short": "Aghion (2017)"}, {"metric": "advancing_knowledge", "paper": "Aghion et al. 2017", "midpoint": 80, "short": "Aghion (2017)"}, {"metric": "methods", "paper": "Aghion et al. 2017", "midpoint": 72, "short": "Aghion (2017)"}, {"metric": "claims_evidence", "paper": "Aghion et al. 2017", "midpoint": 65, "short": "Aghion (2017)"}, {"metric": "overall", "paper": "Aghion et al. 2017", "midpoint": 69, "short": "Aghion (2017)"}, {"metric": "global_relevance", "paper": "Akram et al. 2017", "midpoint": 85, "short": "Akram (2017)"}, {"metric": "open_science", "paper": "Akram et al. 2017", "midpoint": 60, "short": "Akram (2017)"}, {"metric": "logic_communication", "paper": "Akram et al. 2017", "midpoint": 83, "short": "Akram (2017)"}, {"metric": "advancing_knowledge", "paper": "Akram et al. 2017", "midpoint": 80, "short": "Akram (2017)"}, {"metric": "methods", "paper": "Akram et al. 2017", "midpoint": 88, "short": "Akram (2017)"}, {"metric": "claims_evidence", "paper": "Akram et al. 2017", "midpoint": 85, "short": "Akram (2017)"}, {"metric": "overall", "paper": "Akram et al. 2017", "midpoint": 83, "short": "Akram (2017)"}, {"metric": "global_relevance", "paper": "Alatas et al. 2019", "midpoint": 70, "short": "Alatas (2019)"}, {"metric": "open_science", "paper": "Alatas et al. 2019", "midpoint": 50, "short": "Alatas (2019)"}, {"metric": "logic_communication", "paper": "Alatas et al. 2019", "midpoint": 80, "short": "Alatas (2019)"}, {"metric": "advancing_knowledge", "paper": "Alatas et al. 2019", "midpoint": 75, "short": "Alatas (2019)"}, {"metric": "methods", "paper": "Alatas et al. 2019", "midpoint": 85, "short": "Alatas (2019)"}, {"metric": "claims_evidence", "paper": "Alatas et al. 2019", "midpoint": 80, "short": "Alatas (2019)"}, {"metric": "overall", "paper": "Alatas et al. 2019", "midpoint": 75, "short": "Alatas (2019)"}, {"metric": "global_relevance", "paper": "Alcott et al. 2024", "midpoint": 65, "short": "Alcott (2024)"}, {"metric": "open_science", "paper": "Alcott et al. 2024", "midpoint": 75, "short": "Alcott (2024)"}, {"metric": "logic_communication", "paper": "Alcott et al. 2024", "midpoint": 90, "short": "Alcott (2024)"}, {"metric": "advancing_knowledge", "paper": "Alcott et al. 2024", "midpoint": 85, "short": "Alcott (2024)"}, {"metric": "methods", "paper": "Alcott et al. 2024", "midpoint": 77, "short": "Alcott (2024)"}, {"metric": "claims_evidence", "paper": "Alcott et al. 2024", "midpoint": 80, "short": "Alcott (2024)"}, {"metric": "overall", "paper": "Alcott et al. 2024", "midpoint": 80, "short": "Alcott (2024)"}, {"metric": "global_relevance", "paper": "Arora et al. 2023", "midpoint": 60, "short": "Arora (2023)"}, {"metric": "open_science", "paper": "Arora et al. 2023", "midpoint": 70, "short": "Arora (2023)"}, {"metric": "logic_communication", "paper": "Arora et al. 2023", "midpoint": 78, "short": "Arora (2023)"}, {"metric": "advancing_knowledge", "paper": "Arora et al. 2023", "midpoint": 85, "short": "Arora (2023)"}, {"metric": "methods", "paper": "Arora et al. 2023", "midpoint": 80, "short": "Arora (2023)"}, {"metric": "claims_evidence", "paper": "Arora et al. 2023", "midpoint": 75, "short": "Arora (2023)"}, {"metric": "overall", "paper": "Arora et al. 2023", "midpoint": 77, "short": "Arora (2023)"}, {"metric": "global_relevance", "paper": "Bahar et al. 2022", "midpoint": 80, "short": "Bahar (2022)"}, {"metric": "open_science", "paper": "Bahar et al. 2022", "midpoint": 40, "short": "Bahar (2022)"}, {"metric": "logic_communication", "paper": "Bahar et al. 2022", "midpoint": 70, "short": "Bahar (2022)"}, {"metric": "advancing_knowledge", "paper": "Bahar et al. 2022", "midpoint": 75, "short": "Bahar (2022)"}, {"metric": "methods", "paper": "Bahar et al. 2022", "midpoint": 72, "short": "Bahar (2022)"}, {"metric": "claims_evidence", "paper": "Bahar et al. 2022", "midpoint": 70, "short": "Bahar (2022)"}, {"metric": "overall", "paper": "Bahar et al. 2022", "midpoint": 68, "short": "Bahar (2022)"}, {"metric": "global_relevance", "paper": "Banerjee et al. 2023", "midpoint": 95, "short": "Banerjee (2023)"}, {"metric": "open_science", "paper": "Banerjee et al. 2023", "midpoint": 80, "short": "Banerjee (2023)"}, {"metric": "logic_communication", "paper": "Banerjee et al. 2023", "midpoint": 85, "short": "Banerjee (2023)"}, {"metric": "advancing_knowledge", "paper": "Banerjee et al. 2023", "midpoint": 95, "short": "Banerjee (2023)"}, {"metric": "methods", "paper": "Banerjee et al. 2023", "midpoint": 90, "short": "Banerjee (2023)"}, {"metric": "claims_evidence", "paper": "Banerjee et al. 2023", "midpoint": 88, "short": "Banerjee (2023)"}, {"metric": "overall", "paper": "Banerjee et al. 2023", "midpoint": 90, "short": "Banerjee (2023)"}, {"metric": "global_relevance", "paper": "Barberio et al. 2022", "midpoint": 85, "short": "Barberio (2022)"}, {"metric": "open_science", "paper": "Barberio et al. 2022", "midpoint": 40, "short": "Barberio (2022)"}, {"metric": "logic_communication", "paper": "Barberio et al. 2022", "midpoint": 80, "short": "Barberio (2022)"}, {"metric": "advancing_knowledge", "paper": "Barberio et al. 2022", "midpoint": 60, "short": "Barberio (2022)"}, {"metric": "methods", "paper": "Barberio et al. 2022", "midpoint": 70, "short": "Barberio (2022)"}, {"metric": "claims_evidence", "paper": "Barberio et al. 2022", "midpoint": 65, "short": "Barberio (2022)"}, {"metric": "overall", "paper": "Barberio et al. 2022", "midpoint": 67, "short": "Barberio (2022)"}, {"metric": "global_relevance", "paper": "Barker et al. 2021", "midpoint": 85, "short": "Barker (2021)"}, {"metric": "open_science", "paper": "Barker et al. 2021", "midpoint": 55, "short": "Barker (2021)"}, {"metric": "logic_communication", "paper": "Barker et al. 2021", "midpoint": 80, "short": "Barker (2021)"}, {"metric": "advancing_knowledge", "paper": "Barker et al. 2021", "midpoint": 75, "short": "Barker (2021)"}, {"metric": "methods", "paper": "Barker et al. 2021", "midpoint": 70, "short": "Barker (2021)"}, {"metric": "claims_evidence", "paper": "Barker et al. 2021", "midpoint": 65, "short": "Barker (2021)"}, {"metric": "overall", "paper": "Barker et al. 2021", "midpoint": 72, "short": "Barker (2021)"}, {"metric": "global_relevance", "paper": "Bettle 2023", "midpoint": 75, "short": "Bettle (2023)"}, {"metric": "open_science", "paper": "Bettle 2023", "midpoint": 40, "short": "Bettle (2023)"}, {"metric": "logic_communication", "paper": "Bettle 2023", "midpoint": 70, "short": "Bettle (2023)"}, {"metric": "advancing_knowledge", "paper": "Bettle 2023", "midpoint": 65, "short": "Bettle (2023)"}, {"metric": "methods", "paper": "Bettle 2023", "midpoint": 60, "short": "Bettle (2023)"}, {"metric": "claims_evidence", "paper": "Bettle 2023", "midpoint": 55, "short": "Bettle (2023)"}, {"metric": "overall", "paper": "Bettle 2023", "midpoint": 61, "short": "Bettle (2023)"}, {"metric": "global_relevance", "paper": "Bhat et al. 2022", "midpoint": 85, "short": "Bhat (2022)"}, {"metric": "open_science", "paper": "Bhat et al. 2022", "midpoint": 60, "short": "Bhat (2022)"}, {"metric": "logic_communication", "paper": "Bhat et al. 2022", "midpoint": 80, "short": "Bhat (2022)"}, {"metric": "advancing_knowledge", "paper": "Bhat et al. 2022", "midpoint": 85, "short": "Bhat (2022)"}, {"metric": "methods", "paper": "Bhat et al. 2022", "midpoint": 80, "short": "Bhat (2022)"}, {"metric": "claims_evidence", "paper": "Bhat et al. 2022", "midpoint": 75, "short": "Bhat (2022)"}, {"metric": "overall", "paper": "Bhat et al. 2022", "midpoint": 78, "short": "Bhat (2022)"}, {"metric": "global_relevance", "paper": "Bhattacharya and Packalen 2020", "midpoint": 55, "short": "Bhattacharya (2020)"}, {"metric": "open_science", "paper": "Bhattacharya and Packalen 2020", "midpoint": 40, "short": "Bhattacharya (2020)"}, {"metric": "logic_communication", "paper": "Bhattacharya and Packalen 2020", "midpoint": 70, "short": "Bhattacharya (2020)"}, {"metric": "advancing_knowledge", "paper": "Bhattacharya and Packalen 2020", "midpoint": 60, "short": "Bhattacharya (2020)"}, {"metric": "methods", "paper": "Bhattacharya and Packalen 2020", "midpoint": 45, "short": "Bhattacharya (2020)"}, {"metric": "claims_evidence", "paper": "Bhattacharya and Packalen 2020", "midpoint": 55, "short": "Bhattacharya (2020)"}, {"metric": "overall", "paper": "Bhattacharya and Packalen 2020", "midpoint": 54, "short": "Bhattacharya (2020)"}, {"metric": "global_relevance", "paper": "Bruers 2021", "midpoint": 55, "short": "Bruers (2021)"}, {"metric": "open_science", "paper": "Bruers 2021", "midpoint": 25, "short": "Bruers (2021)"}, {"metric": "logic_communication", "paper": "Bruers 2021", "midpoint": 60, "short": "Bruers (2021)"}, {"metric": "advancing_knowledge", "paper": "Bruers 2021", "midpoint": 50, "short": "Bruers (2021)"}, {"metric": "methods", "paper": "Bruers 2021", "midpoint": 35, "short": "Bruers (2021)"}, {"metric": "claims_evidence", "paper": "Bruers 2021", "midpoint": 40, "short": "Bruers (2021)"}, {"metric": "overall", "paper": "Bruers 2021", "midpoint": 45, "short": "Bruers (2021)"}, {"metric": "global_relevance", "paper": "Buntaine et al. 2023", "midpoint": 90, "short": "Buntaine (2023)"}, {"metric": "open_science", "paper": "Buntaine et al. 2023", "midpoint": 70, "short": "Buntaine (2023)"}, {"metric": "logic_communication", "paper": "Buntaine et al. 2023", "midpoint": 85, "short": "Buntaine (2023)"}, {"metric": "advancing_knowledge", "paper": "Buntaine et al. 2023", "midpoint": 85, "short": "Buntaine (2023)"}, {"metric": "methods", "paper": "Buntaine et al. 2023", "midpoint": 90, "short": "Buntaine (2023)"}, {"metric": "claims_evidence", "paper": "Buntaine et al. 2023", "midpoint": 90, "short": "Buntaine (2023)"}, {"metric": "overall", "paper": "Buntaine et al. 2023", "midpoint": 85, "short": "Buntaine (2023)"}, {"metric": "global_relevance", "paper": "Carson et al. 2023", "midpoint": 60, "short": "Carson (2023)"}, {"metric": "open_science", "paper": "Carson et al. 2023", "midpoint": 55, "short": "Carson (2023)"}, {"metric": "logic_communication", "paper": "Carson et al. 2023", "midpoint": 80, "short": "Carson (2023)"}, {"metric": "advancing_knowledge", "paper": "Carson et al. 2023", "midpoint": 65, "short": "Carson (2023)"}, {"metric": "methods", "paper": "Carson et al. 2023", "midpoint": 70, "short": "Carson (2023)"}, {"metric": "claims_evidence", "paper": "Carson et al. 2023", "midpoint": 75, "short": "Carson (2023)"}, {"metric": "overall", "paper": "Carson et al. 2023", "midpoint": 68, "short": "Carson (2023)"}, {"metric": "global_relevance", "paper": "Chuard et al. 2022", "midpoint": 65, "short": "Chuard (2022)"}, {"metric": "open_science", "paper": "Chuard et al. 2022", "midpoint": 50, "short": "Chuard (2022)"}, {"metric": "logic_communication", "paper": "Chuard et al. 2022", "midpoint": 75, "short": "Chuard (2022)"}, {"metric": "advancing_knowledge", "paper": "Chuard et al. 2022", "midpoint": 60, "short": "Chuard (2022)"}, {"metric": "methods", "paper": "Chuard et al. 2022", "midpoint": 65, "short": "Chuard (2022)"}, {"metric": "claims_evidence", "paper": "Chuard et al. 2022", "midpoint": 70, "short": "Chuard (2022)"}, {"metric": "overall", "paper": "Chuard et al. 2022", "midpoint": 64, "short": "Chuard (2022)"}, {"metric": "global_relevance", "paper": "Clancy 2024", "midpoint": 80, "short": "Clancy (2024)"}, {"metric": "open_science", "paper": "Clancy 2024", "midpoint": 55, "short": "Clancy (2024)"}, {"metric": "logic_communication", "paper": "Clancy 2024", "midpoint": 70, "short": "Clancy (2024)"}, {"metric": "advancing_knowledge", "paper": "Clancy 2024", "midpoint": 75, "short": "Clancy (2024)"}, {"metric": "methods", "paper": "Clancy 2024", "midpoint": 60, "short": "Clancy (2024)"}, {"metric": "claims_evidence", "paper": "Clancy 2024", "midpoint": 65, "short": "Clancy (2024)"}, {"metric": "overall", "paper": "Clancy 2024", "midpoint": 68, "short": "Clancy (2024)"}, {"metric": "global_relevance", "paper": "Crawfurd et al. 2023", "midpoint": 90, "short": "Crawfurd (2023)"}, {"metric": "open_science", "paper": "Crawfurd et al. 2023", "midpoint": 70, "short": "Crawfurd (2023)"}, {"metric": "logic_communication", "paper": "Crawfurd et al. 2023", "midpoint": 85, "short": "Crawfurd (2023)"}, {"metric": "advancing_knowledge", "paper": "Crawfurd et al. 2023", "midpoint": 80, "short": "Crawfurd (2023)"}, {"metric": "methods", "paper": "Crawfurd et al. 2023", "midpoint": 75, "short": "Crawfurd (2023)"}, {"metric": "claims_evidence", "paper": "Crawfurd et al. 2023", "midpoint": 70, "short": "Crawfurd (2023)"}, {"metric": "overall", "paper": "Crawfurd et al. 2023", "midpoint": 78, "short": "Crawfurd (2023)"}, {"metric": "global_relevance", "paper": "Epperson and Gerster 2024", "midpoint": 60, "short": "Epperson (2024)"}, {"metric": "open_science", "paper": "Epperson and Gerster 2024", "midpoint": 85, "short": "Epperson (2024)"}, {"metric": "logic_communication", "paper": "Epperson and Gerster 2024", "midpoint": 80, "short": "Epperson (2024)"}, {"metric": "advancing_knowledge", "paper": "Epperson and Gerster 2024", "midpoint": 70, "short": "Epperson (2024)"}, {"metric": "methods", "paper": "Epperson and Gerster 2024", "midpoint": 78, "short": "Epperson (2024)"}, {"metric": "claims_evidence", "paper": "Epperson and Gerster 2024", "midpoint": 80, "short": "Epperson (2024)"}, {"metric": "overall", "paper": "Epperson and Gerster 2024", "midpoint": 75, "short": "Epperson (2024)"}, {"metric": "global_relevance", "paper": "Fangwa et al. 2023", "midpoint": 85, "short": "Fangwa (2023)"}, {"metric": "open_science", "paper": "Fangwa et al. 2023", "midpoint": 45, "short": "Fangwa (2023)"}, {"metric": "logic_communication", "paper": "Fangwa et al. 2023", "midpoint": 80, "short": "Fangwa (2023)"}, {"metric": "advancing_knowledge", "paper": "Fangwa et al. 2023", "midpoint": 75, "short": "Fangwa (2023)"}, {"metric": "methods", "paper": "Fangwa et al. 2023", "midpoint": 80, "short": "Fangwa (2023)"}, {"metric": "claims_evidence", "paper": "Fangwa et al. 2023", "midpoint": 78, "short": "Fangwa (2023)"}, {"metric": "overall", "paper": "Fangwa et al. 2023", "midpoint": 75, "short": "Fangwa (2023)"}, {"metric": "global_relevance", "paper": "Haushofer et al. 2020", "midpoint": 75, "short": "Haushofer (2020)"}, {"metric": "open_science", "paper": "Haushofer et al. 2020", "midpoint": 65, "short": "Haushofer (2020)"}, {"metric": "logic_communication", "paper": "Haushofer et al. 2020", "midpoint": 80, "short": "Haushofer (2020)"}, {"metric": "advancing_knowledge", "paper": "Haushofer et al. 2020", "midpoint": 70, "short": "Haushofer (2020)"}, {"metric": "methods", "paper": "Haushofer et al. 2020", "midpoint": 85, "short": "Haushofer (2020)"}, {"metric": "claims_evidence", "paper": "Haushofer et al. 2020", "midpoint": 80, "short": "Haushofer (2020)"}, {"metric": "overall", "paper": "Haushofer et al. 2020", "midpoint": 76, "short": "Haushofer (2020)"}, {"metric": "global_relevance", "paper": "Hill et al. 2024", "midpoint": 65, "short": "Hill (2024)"}, {"metric": "open_science", "paper": "Hill et al. 2024", "midpoint": 60, "short": "Hill (2024)"}, {"metric": "logic_communication", "paper": "Hill et al. 2024", "midpoint": 85, "short": "Hill (2024)"}, {"metric": "advancing_knowledge", "paper": "Hill et al. 2024", "midpoint": 70, "short": "Hill (2024)"}, {"metric": "methods", "paper": "Hill et al. 2024", "midpoint": 78, "short": "Hill (2024)"}, {"metric": "claims_evidence", "paper": "Hill et al. 2024", "midpoint": 75, "short": "Hill (2024)"}, {"metric": "overall", "paper": "Hill et al. 2024", "midpoint": 73, "short": "Hill (2024)"}, {"metric": "global_relevance", "paper": "Jack et al. 2022", "midpoint": 88, "short": "Jack (2022)"}, {"metric": "open_science", "paper": "Jack et al. 2022", "midpoint": 60, "short": "Jack (2022)"}, {"metric": "logic_communication", "paper": "Jack et al. 2022", "midpoint": 85, "short": "Jack (2022)"}, {"metric": "advancing_knowledge", "paper": "Jack et al. 2022", "midpoint": 78, "short": "Jack (2022)"}, {"metric": "methods", "paper": "Jack et al. 2022", "midpoint": 80, "short": "Jack (2022)"}, {"metric": "claims_evidence", "paper": "Jack et al. 2022", "midpoint": 85, "short": "Jack (2022)"}, {"metric": "overall", "paper": "Jack et al. 2022", "midpoint": 79, "short": "Jack (2022)"}, {"metric": "global_relevance", "paper": "Kremer et al. 2020", "midpoint": 85, "short": "Kremer (2020)"}, {"metric": "open_science", "paper": "Kremer et al. 2020", "midpoint": 40, "short": "Kremer (2020)"}, {"metric": "logic_communication", "paper": "Kremer et al. 2020", "midpoint": 80, "short": "Kremer (2020)"}, {"metric": "advancing_knowledge", "paper": "Kremer et al. 2020", "midpoint": 55, "short": "Kremer (2020)"}, {"metric": "methods", "paper": "Kremer et al. 2020", "midpoint": 45, "short": "Kremer (2020)"}, {"metric": "claims_evidence", "paper": "Kremer et al. 2020", "midpoint": 60, "short": "Kremer (2020)"}, {"metric": "overall", "paper": "Kremer et al. 2020", "midpoint": 61, "short": "Kremer (2020)"}, {"metric": "global_relevance", "paper": "Kremer et al. 2022", "midpoint": 95, "short": "Kremer (2022)"}, {"metric": "open_science", "paper": "Kremer et al. 2022", "midpoint": 70, "short": "Kremer (2022)"}, {"metric": "logic_communication", "paper": "Kremer et al. 2022", "midpoint": 85, "short": "Kremer (2022)"}, {"metric": "advancing_knowledge", "paper": "Kremer et al. 2022", "midpoint": 90, "short": "Kremer (2022)"}, {"metric": "methods", "paper": "Kremer et al. 2022", "midpoint": 85, "short": "Kremer (2022)"}, {"metric": "claims_evidence", "paper": "Kremer et al. 2022", "midpoint": 80, "short": "Kremer (2022)"}, {"metric": "overall", "paper": "Kremer et al. 2022", "midpoint": 84, "short": "Kremer (2022)"}, {"metric": "global_relevance", "paper": "Kubo et al. 2023", "midpoint": 60, "short": "Kubo (2023)"}, {"metric": "open_science", "paper": "Kubo et al. 2023", "midpoint": 55, "short": "Kubo (2023)"}, {"metric": "logic_communication", "paper": "Kubo et al. 2023", "midpoint": 78, "short": "Kubo (2023)"}, {"metric": "advancing_knowledge", "paper": "Kubo et al. 2023", "midpoint": 65, "short": "Kubo (2023)"}, {"metric": "methods", "paper": "Kubo et al. 2023", "midpoint": 75, "short": "Kubo (2023)"}, {"metric": "claims_evidence", "paper": "Kubo et al. 2023", "midpoint": 70, "short": "Kubo (2023)"}, {"metric": "overall", "paper": "Kubo et al. 2023", "midpoint": 67, "short": "Kubo (2023)"}, {"metric": "global_relevance", "paper": "Liang et al. 2021", "midpoint": 70, "short": "Liang (2021)"}, {"metric": "open_science", "paper": "Liang et al. 2021", "midpoint": 70, "short": "Liang (2021)"}, {"metric": "logic_communication", "paper": "Liang et al. 2021", "midpoint": 75, "short": "Liang (2021)"}, {"metric": "advancing_knowledge", "paper": "Liang et al. 2021", "midpoint": 80, "short": "Liang (2021)"}, {"metric": "methods", "paper": "Liang et al. 2021", "midpoint": 85, "short": "Liang (2021)"}, {"metric": "claims_evidence", "paper": "Liang et al. 2021", "midpoint": 85, "short": "Liang (2021)"}, {"metric": "overall", "paper": "Liang et al. 2021", "midpoint": 78, "short": "Liang (2021)"}, {"metric": "global_relevance", "paper": "Schuett et al. 2023", "midpoint": 70, "short": "Schuett (2023)"}, {"metric": "open_science", "paper": "Schuett et al. 2023", "midpoint": 80, "short": "Schuett (2023)"}, {"metric": "logic_communication", "paper": "Schuett et al. 2023", "midpoint": 70, "short": "Schuett (2023)"}, {"metric": "advancing_knowledge", "paper": "Schuett et al. 2023", "midpoint": 60, "short": "Schuett (2023)"}, {"metric": "methods", "paper": "Schuett et al. 2023", "midpoint": 55, "short": "Schuett (2023)"}, {"metric": "claims_evidence", "paper": "Schuett et al. 2023", "midpoint": 65, "short": "Schuett (2023)"}, {"metric": "overall", "paper": "Schuett et al. 2023", "midpoint": 67, "short": "Schuett (2023)"}, {"metric": "global_relevance", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 90, "short": "Trammel (2025)"}, {"metric": "open_science", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 40, "short": "Trammel (2025)"}, {"metric": "logic_communication", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 80, "short": "Trammel (2025)"}, {"metric": "advancing_knowledge", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 85, "short": "Trammel (2025)"}, {"metric": "methods", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 75, "short": "Trammel (2025)"}, {"metric": "claims_evidence", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 70, "short": "Trammel (2025)"}, {"metric": "overall", "paper": "Trammel and Aschenbrenner 2025", "midpoint": 75, "short": "Trammel (2025)"}, {"metric": "global_relevance", "paper": "Walker et al. 2023", "midpoint": 90, "short": "Walker (2023)"}, {"metric": "open_science", "paper": "Walker et al. 2023", "midpoint": 60, "short": "Walker (2023)"}, {"metric": "logic_communication", "paper": "Walker et al. 2023", "midpoint": 80, "short": "Walker (2023)"}, {"metric": "advancing_knowledge", "paper": "Walker et al. 2023", "midpoint": 85, "short": "Walker (2023)"}, {"metric": "methods", "paper": "Walker et al. 2023", "midpoint": 80, "short": "Walker (2023)"}, {"metric": "claims_evidence", "paper": "Walker et al. 2023", "midpoint": 85, "short": "Walker (2023)"}, {"metric": "overall", "paper": "Walker et al. 2023", "midpoint": 80, "short": "Walker (2023)"}]}}, {"mode": "vega-lite"});
  
  This is really nice in principle, but it needs some tweaking. something about the colors and spacing make it hard to different here. Love the 'hover titles' though!!
Visit annotations in context

Annotators

daaronr

URL

llm-uj-research-eval.netlify.app/numerical_ratings
Jun 2025
valentink.quarto.pub valentink.quarto.pub

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - Illustrating LangChain pipeline

2
1. daaronr 23 Jun 2025
  
  in Public
  
  in-memory Chroma vector DB
  
  Do I need to now what this means?
2. daaronr 23 Jun 2025
  
  in Public
  
  (one per page).
  
  small point ... one per page seems a bit random
Visit annotations in context

Annotators

daaronr

URL

valentink.quarto.pub/llm_research_evaluation/langchain_test.html
ai.nejm.org ai.nejm.org

Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis

1
1. daaronr 07 Jun 2025
  
  in Public
  
  International Conference on Learning Representations (ICLR) machine learning conference
  
  See also this paper
Visit annotations in context

Annotators

daaronr

URL

ai.nejm.org/doi/full/10.1056/AIoa2400196
May 2025
daaronr.github.io daaronr.github.io

5 Robustness and diagnostics, with integrity; Open Science resources | Statistics, econometrics, experiment and survey methods, data science: Notes

1
1. daaronr 23 May 2025
  
  in Public
  
  5.1 (How) can diagnostic tests make sense? Where is the burden of proof?
  
  Some references from ChatGPT o3 making similar points
  
  https://chatgpt.com/share/6830a265-3f14-8007-b8f8-daa0936d8f5f
Visit annotations in context

Annotators

daaronr

URL

daaronr.github.io/metrics_discussion/robust-diag.html
valentink.quarto.pub valentink.quarto.pub

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - Illustrating LangChain pipeline

3
1. daaronr 20 May 2025
  
  in Public
  
  o3,
  
  you mean 'aiming for o3, testing with 4o mini for now'?
2. daaronr 20 May 2025
  
  in Public
  
  including probability judgement and robustness suggestions.
  
  --> 'making a probabilistic judgement about the accuracy of this claim, and making suggestions for robustness checks'
3. daaronr 20 May 2025
  
  in Public
  
  factual
  
  being super fussy, but 'factual as opposed to what'?
Visit annotations in context

Annotators

daaronr

URL

valentink.quarto.pub/llm_research_evaluation/langchain_test.html
valentink.quarto.pub valentink.quarto.pub

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research - 2 Testing OpenAI API Access in Python

12
1. daaronr 11 May 2025
  
  in Public
  
  → scoring ssrn-4071953.pdf …
  
  what was the run time? -- put in a timer? Something like 5 minutes for 3 papers
  
  NB some papers too long for o3 -- can be chunked and summarized by another model, langchain helps, or we use a larger context model like Gemini
2. daaronr 11 May 2025
  
  in Public
  
  # Install the SDK if it isn’t present
  
  rem to put freeze option here
3. daaronr 11 May 2025
  
  in Public
  
  key_path = pathlib.Path("key/openai_key.txt")
  
  Don't commit this (gitignore), everyone has there own key
4. daaronr 11 May 2025
  
  in Public
  
  Project files
  
  this links your github page? I don't see the particular repo
5. daaronr 11 May 2025
  
  in Public
  
  Table 2.1: Overall assessments (percentiles)
  
  Let's try one out on a paper that our evaluators gave low scores ... or one we think would likely to receive low scores in general, for comparison
6. daaronr 11 May 2025
  
  in Public
  
  # 1. Point to the papers you want to score
  
  Looks like you put the pdfs in the local directory and the git repo? Let's think about ways to mor eautomatically 'feed these in' in bulkl
7. daaronr 11 May 2025
  
  in Public
  
  Following (Garg and Fetzer 2025)
  
  Nice, Garg and Fetzer seems very relevant. I see a lot of fruitful potential overlap.
8. daaronr 10 May 2025
  
  in Public
  
  ::: {#cell-Averaged ratings .cell tbl-cap=‘Mean ± SD of ten model draws’ execution_count=8}
  
  code formatting (minor_
9. daaronr 10 May 2025
  
  in Public
  
  paper_id
  
  these are consistent but maybe not informative at all ... same for all papers??
10. daaronr 10 May 2025
  
  in Public
  
  2.1.0.4 Monte‑Carlo: 10 independent ratings per paper
  
  it looks like 9 runs?
11. daaronr 10 May 2025
  
  in Public
  
  rationale
  
  make this column wider/expandable
12. daaronr 10 May 2025
  
  in Public
  
  We call GPT‑4o on each PDF, asking for a strict JSON percentile rating and a short rationale, then collect the results.
  
  I suspect 4o is not particularly insightful? But we can compare it to the (~manual) results we previously got from o3.
Visit annotations in context

Annotators

daaronr

URL

valentink.quarto.pub/llm_research_evaluation/main.html
valentink.quarto.pub valentink.quarto.pub

Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research

1
1. daaronr 11 May 2025
  
  in Public
  
  Explore LangChain, OpenAI Evals
  
  Is langchain compatible with this quarto approach? ... probably yet. But we can first do some things more manually/simply.
Visit annotations in context

Annotators

daaronr

URL

valentink.quarto.pub/llm_research_evaluation/
lakens.github.io lakens.github.io

Improving Your Statistical Inferences - 4 Bayesian statistics

1
1. daaronr 05 May 2025
  
  in Public
  
  r in the following formula: P(H1|D)P(H0|D)= P(D|H1)P(D|H0) × P(H1)P(H0) Posterior Probability= Bayes Factor × Prior Probability A Bayesian analysis of data requires specifying the prior. Here, we will continu
  
  formula should refer to 'odds' not 'probability'
Visit annotations in context

Annotators

daaronr

URL

lakens.github.io/statistical_inferences/04-bayes.html
Apr 2025
elicit.com elicit.com

Comparative Analysis of DALYs and WELLBYs in Global Welfare Assessment | Elicit

15
1. daaronr 22 Apr 2025
  
  in Public
  
  Be open to newer measures like WELLBYs where they offer clear advantages.
  
  The authors are advocating their own measure here, no?
2. daaronr 22 Apr 2025
  
  in Public
  
  Consider how choice of measure might affect assessment of policies targeting different population groups or levels of well-being.
  
  This seems a bit more actionable, but the qquickly linked text doesn't seem to provide insight
3. daaronr 22 Apr 2025
  
  in Public
  
  These might capture changes not reflected in traditional health-related quality of life measures.
  
  Where does WELLBY fall on this spectrum?
4. daaronr 22 Apr 2025
  
  in Public
  
  Be Cautious with Conversions
  
  This sort of advice doesn't seem to provide a clear path forward ... just 'be careful and uncertain'
5. daaronr 22 Apr 2025
  
  in Public
  
  Measure Selection: Choice between DALYs, QALYs, WELLBYs, or other measures can significantly impact estimated benefits of interventions. Airoldi (2007) demonstrates that DALYs consistently yield smaller values than QALYs for the same intervention, potentially affecting cost-effectiveness rankings and resource allocation decisions.
  
  This seems to argue 'this stuff does actually matter'. Not sure whether this shows it still matters after the SD ~normalization though
6. daaronr 22 Apr 2025
  
  in Public
  
  Mental Health Measurement Integration
  
  It's not clear what elicit is getting at here. This may be a miss.
7. daaronr 22 Apr 2025
  
  in Public
  
  Comparison of mental well-being and health-related quality of life measures: Johnson et al. (2016) compared the EQ-5D-3L with the Warwick-Edinburgh Mental Well-being Scale (WEMWBS). They found that WEMWBS better distinguished between different health levels due to the absence of a ceiling effect. This implies that standard deviation changes in mental well-being measures might capture more variation at higher levels of health or well-being compared to traditional health-related quality of life measures.
  
  Not sure I understand this, but the Ceiling effect issue seems important at a glance. If one measure is hitting ceilings, that can mess up 'like for like SD comparisons'.
8. daaronr 22 Apr 2025
  
  in Public
  
  Sensitivity of subjective well-being measures: Christopher et al. (2014) found that subjective well-being measures were less sensitive to health differences and changes compared to SF-6D values. This suggests that standard deviation changes in subjective well-being measures (which are related to WELLBYs) might not correspond directly to changes in health-related quality of life measures (which are more closely related to DALYs).
  
  Seems 'important if true'
9. daaronr 22 Apr 2025
  
  in Public
  
  We didn’t find direct comparisons of standard deviation changes in WELLBYs versus DALYs in the included studies.
  
  Is that a bad omen for this modeling choice?
10. daaronr 22 Apr 2025
  
  in Public
  
  ound that subjective well-being measures were less sensitive or responsive to health differences and changes compared to SF-6D values
  
  What 'health differences'? Should this be seen as a gold standard comparison?
11. daaronr 22 Apr 2025
  
  in Public
  
  They found that WEMWBS better distinguishes health levels due to the absence of a ceiling effect, which is present in the EQ-5D-3L.
  
  So why are others using 'WELLBYs" instead? Easier to collect?
12. daaronr 22 Apr 2025
  
  in Public
  
  Johnson et al. (2016) compared the Warwick-Edinburgh Mental Well-being Scale (WEMWBS) with the EQ-5D-3L.
  
  Factor analysis. EFA and CFA validation here. (Something I've never fully been able to grok; it seems too many degrees of freedom if the elements can load onto any factor and factors can be correlated ... but this seems to be established practice.)
13. daaronr 22 Apr 2025
  
  in Public
  
  Johnson et al. (2016) compared the Warwick-Edinburgh Mental Well-being Scale (WEMWBS) with the EQ-5D-3L.
  
  Lots of factor analysis to validate this measure. I've not fully been able to understand the logic behind EFA and CFA myself, even after some trying. But it seems to be well-accepted.
14. daaronr 22 Apr 2025
  
  in Public
  
  Characteristics of Included Studies
  
  It can only give us 7, it's throttled. See other tabs -- the 'dots' at top for a wider list
15. daaronr 22 Apr 2025
  
  in Public
  
  First, proposed conversion factors between different well-being measures (such as the 0.3 QALY/DALY ratio) showed significant variation across contexts and populations. Second, while WELLBYs were proposed as an alternative to traditional health-related quality of life measures, the studies indicated limitations in their current validation. Third, the reviewed empirical work found that subjective well-being measures demonstrated lower sensitivity to health status changes compared to established measures like SF-6D, particularly in health intervention contexts.
  
  This strikes me as as particularly worth digging into!
Visit annotations in context

Annotators

daaronr

URL

elicit.com/review/8a17604a-1a08-45c7-b9ab-ac7f3b232415
forum.effectivealtruism.org forum.effectivealtruism.org

Forecasts estimate limited cultured meat production through 2050

1
1. daaronr 16 Apr 2025
  
  in Public
  
  Here we present an initial set of forecasts from a panel of paid forecasters (including Linch and Neil). We plan to expand forecasting on similar questions in a Metaculus tournament[1] so that we can see how forecasts are affected by news of supposedly important breakthroughs.
  
  A small team. This aspect might be seen as informal
Visit annotations in context

Annotators

daaronr

URL

forum.effectivealtruism.org/posts/2b9HCjTiFnWM8jkRM/forecasts-estimate-limited-cultured-meat-production-through
Mar 2025
www.openphilanthropy.org www.openphilanthropy.org

The Center for Election Science — General Support (2019) | Open Philanthropy

1
1. daaronr 13 Mar 2025
  
  in Public
  
  These systems provide no structural advantages or disadvantages to either the Democratic or Republican parties or to any single politician.
  
  This seems like it's probably an overstatement; may be shorthand here
Visit annotations in context

Annotators

daaronr

URL

openphilanthropy.org/grants/the-center-for-election-science-general-support-2019/
Nov 2024
results2021.ref.ac.uk results2021.ref.ac.uk

Impact case study : Results and submissions : REF 2021

1
1. daaronr 21 Nov 2024
  
  in Public
  
  Dr McCulloch has provided expert input, based on his research, to provide a solid evidence base to the Better Deal for Animals campaign coordinated by over 40 leading NGOs. McCulloch’s proposed animal welfare impact assessment has been supported by HMG Official Opposition Labour Party in Parliament and the UK Government has stated in official correspondence that it is considering how to implement post-Brexit sentience policy in response to coordinated lobbying strongly underpinned by McCulloch’s research.
  
  REF impact ... frameworks for The Unjournal?
Visit annotations in context

Annotators

daaronr

URL

results2021.ref.ac.uk/impact/10afeb5d-ac48-4c17-b7fb-7c032112b7a0
Oct 2024
www.cos.io www.cos.io

Lifecycle Journal | Center for Open Science

2
1. daaronr 23 Oct 2024
  
  in Public
  
  empirical evaluations
  
  What's 'empirical evaluations' here?
2. daaronr 23 Oct 2024
  
  in Public
  
  Publish-then-Review; Transparent and Open Peer Review
  
  The Unjournal supports this. I think we have done this in more depth and rigor than other initiatives in economics and social science
Visit annotations in context

Annotators

daaronr

URL

cos.io/lifecyclejournal
Aug 2024
unjournal.github.io unjournal.github.io

A first look at Unjournal’s data – The Unjournal Data Blog

16
1. daaronr 28 Aug 2024
  
  in Public
  
  Table 3 shows loadings for the first three components.
  
  We should discuss and analyze this further. What's a reasonable interpretation? How much of the variation do each of these explain?
2. daaronr 28 Aug 2024
  
  in Public
  
  The table above already looks a bit worrying: typically no more than half of our evaluators’ midpoints fall within their co-evaluator’s 90% credible interval
  
  I'd like to dig into this to find some specific instances. Are these all the 'real CIs' and not the ones we imputed from the earlier likert CI data?
3. daaronr 28 Aug 2024
  
  in Public
  
  Krippendorff's Alpha
  
  Some comparison numbers would also be helpful. I asked chatgpt
  
  give some comparison alpha's from other contexts to help understand this measure
  
  Response:
  
  Krippendorff's alpha is used across a variety of disciplines to measure inter-rater reliability. Below are some examples from different contexts to provide a point of reference for interpreting the values in your data:
  
  Content Analysis (Media Studies):
  
  A typical Krippendorff's alpha for coding reliability in content analysis (e.g., categorizing themes in media) might range from 0.6 to 0.8. Values below 0.67 are often considered insufficient for drawing strong conclusions, though thresholds can vary depending on the field. For example, a study evaluating news articles might report an alpha around 0.7.
  
  Healthcare (Medical Diagnostics):
  
  In medical diagnostics, particularly in assessments involving subjective judgment (e.g., radiology, pathology), alpha values are expected to be relatively high—often above 0.75. An alpha of 0.8 or higher is often considered acceptable for diagnostic consistency. For instance, a study on agreement among radiologists in interpreting mammograms might report an alpha of around 0.81.
  
  Education (Grading & Evaluation):
  
  In educational assessment, inter-rater reliability for grading essays or open-ended responses typically yields alphas in the range of 0.6 to 0.9, depending on the specificity of the rubric and training provided to raters. A study on grading consistency among multiple teachers might report an alpha of 0.65 to 0.75.
  
  Psychology (Behavioral Coding):
  
  In psychology, for tasks like coding behavior or assigning labels based on psychological interviews, Krippendorff’s alpha values of 0.4 to 0.6 are relatively common, but higher values (above 0.7) are preferred for more reliable and replicable findings. For example, coding nonverbal behaviors in therapy sessions might produce an alpha of 0.5.
  
  Social Science (Survey Coding):
  
  When coding open-ended survey responses, social science researchers often see alpha values ranging from 0.6 to 0.7, reflecting moderate agreement. An alpha of 0.65 might be reported in a study on the reliability of coding responses to survey questions about political preferences.
  
  Linguistics (Speech Annotation):
  
  In linguistics, particularly in tasks such as speech transcription or annotation, inter-rater reliability can vary widely depending on the complexity of the task. Alphas around 0.7 are often considered acceptable. For example, a study transcribing dialectical variations in speech might report an alpha of 0.72.
  
  General Guidelines for Interpreting Krippendorff's Alpha:
  
  >0.80: Strong agreement, generally considered acceptable for high-stakes decision-making.
  
  0.67 - 0.80: Moderate to strong agreement, often considered sufficient for exploratory studies or content analysis.
  
  0.50 - 0.67: Moderate agreement, acceptable in some fields, but may require more consistency for high-stakes or formal evaluations.
  
  <0.50: Low agreement, indicates significant variability in ratings; further training or clarification of criteria may be needed.
  
  Comparison to Your Data:
  
  Your highest alpha values (e.g., 0.804 for "gp_relevance" and 0.778 for "merits_journal") indicate strong agreement, similar to the reliability found in high-stakes fields like medical diagnostics or well-defined content analysis tasks.
  
  The lower values (e.g., 0.271 for "adv_knowledge" and 0.134 for "methods") suggest substantial disagreement among evaluators. These values are below what would be considered acceptable in most research contexts, even for exploratory analysis, and might require addressing the clarity or consistency of evaluation criteria.
  
  These comparisons provide a framework for assessing where your data stands in relation to typical expectations in various research areas.
4. daaronr 28 Aug 2024
  
  in Public
  
  pe=1/6. Krippendorff’s alpha is a widely accepted statistic that corrects for pe and also defines “agreement” appropriately for different levels of measurement.
  
  I think you are missing the basic characterization here ...
  
  1 indicates perfect agreement, 0 indicates no agreement beyond chance, and negative values indicate systematic disagreement.
5. daaronr 28 Aug 2024
  
  in Public
  
  Krippendorf's alpha statistics for our quantitative measures. N = 21 papers, 39 evaluations.
  
  Do we have an interpretation of this? Are these high, low, reasonable?
  
  (By the way if you wanted to integrate this into a permanent dashboard you might not want to add a narrative about the actual values, but you could still add a general discussion of 'what is considered a high alpha')
6. daaronr 28 Aug 2024
  
  in Public
  
  There is a single paper with three evaluations; adding this in would give us many missing values in the “third evaluation” column, and we’d have to use more advanced techniques to deal with these.
  
  We should find some way to integrate this in,. There's so little data it's a shame to drop these.
7. daaronr 28 Aug 2024
  
  in Public
  
  we have many evaluators, each contributing only one or two evaluations.
  
  Very rarely 2. I'm guessing it's only about 10% repeat evaluators atm.
8. daaronr 28 Aug 2024
  
  in Public
  
  So we need to adjust for the expected amount of agreement. To do this most measures use the marginal distributions of the ratings: in our example, a 1 in 6 chance of each number from 0 to 5, giving
  
  This seems like a wild oversimplification. This would be if they both gave uniformly distributed random ratings.
  
  A more reasonable baseline for 'meaningless ratings' would be something like "they both draw from give the underlying distribution of all papers, without using any information about the paper being rated itself." Perhaps some of the other statistics you mention get at this?
9. daaronr 28 Aug 2024
  
  in Public
  
  Choosing a reliability statistic
  
  Is there any way to make these boxes foldable in the blog post format?
10. daaronr 28 Aug 2024
  
  in Public
  
  Out of 342 confidence intervals, 0 were degenerate, 0 were uninformative and 7 were misspecified.
  
  We should look at this more closely.
  
  Which were misspecified? The newer interface actually doesn't permit this.
  
  As I noted above, some people simply refused to give CIs at all ... which is essentially giving a 'degenerate' interval
  
  For journal ranking tiers I'd still like to see the results in some way. If evaluators understood how we intended this measure to work, they should basically never give degenerate intervals.
11. daaronr 28 Aug 2024
  
  in Public
  
  We also check if people straightline lower bounds of the credible intervals (0 straightliners) and upper bounds (0 straightliners).
  
  However, IIRC some people didn't give bounds on the CIs at all, or they gave meaningless bounds; maybe we dropped those bounds?
12. daaronr 28 Aug 2024
  
  in Public
  
  Here are some things we might hope to learn from our data.
  
  You left out one thing I had suggested: Do evaluators with different attributes (country, field, anonymity choice) rate the papers differently? Again, this need not imply biases, but it might be suggestive.
13. daaronr 28 Aug 2024
  
  in Public
  
  Do evaluators understand the questions in the same way? Are different evaluators of the same paper answering the “same questions” in their head? What about evaluators of different papers in different fields?
  
  Should we reference 'construal validity' here?
14. daaronr 28 Aug 2024
  
  in Public
  
  About the data
  
  Can we give some links here to data and code? Even a setting to 'turn on the code' or some such? I like to do this for transparency and other reasons we've discussed.
  
  OK we do link the github repo -- maybe add a note about that, and about how to find the code that produces this blog for people who want to see/check?
15. daaronr 16 Aug 2024
  
  in Public
  
  Warning: Duplicate rows found in ratings data.
  
  Do we want this to show here? Maybe some note to the reader is in order?
16. daaronr 16 Aug 2024
  
  in Public
  
  Overall assessment: “Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to knowledge production, and importance to practice.” Advancing our knowledge and practice: “To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions?…” Methods: Justification, reasonableness, validity, robustness: “Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? Are the results and methods likely to be robust to reasonable changes in the underlying assumptions?…
  
  Formatting here: text is a bit bunched up, a bit harder to read
Visit annotations in context

Annotators

daaronr

URL

unjournal.github.io/unjournaldata/posts/uj-data-first-look/
unjournal.github.io unjournal.github.io

The Unjournal data blog – The Unjournal Data Blog

1
1. daaronr 28 Aug 2024
  
  in Public
  
  Blog Posts
  
  I think the 'blog post' might be a bit buried here? Perhaps it should be made more prominent?
Visit annotations in context

Annotators

daaronr

URL

unjournal.github.io/unjournaldata/index.html
unjournal.pubpub.org unjournal.pubpub.org

The Unjournal

2
1. daaronr 12 Aug 2024
  
  in Public
  
  The UnjournalJournal-independent peer review of research that informs global prioritiesUnjournal on PubPubSearch for contentLog in/sign upUJ homepageCompleted UJ evaluation packages
  
  I thing the spacing and fonts might be improved? Too much vertical space perhaps
2. daaronr 12 Aug 2024
  
  in Public
  
  The UnjournalThe Unjournal seeks to make rigorous research more impactful and impactful research more rigorous. We encourage better research by making it easier for researchers to get feedback and credible ratings. We focus on quantitative work that informs global priorities, especially in economics, policy, and social science. We commission public journal-independent evaluation of hosted papers and dynamically presented projects. We publish evaluations, ratings, manager summaries, author responses, and links to evaluated research on this PubPub – to inform practitioners and other researchers.As the name suggests, we are not a journal. We don’t “publish”, i.e., we don’t claim ownership or host research. We don’t charge any fees. Instead, we offer an efficient, informative, useful, and transparent research evaluation system. You can visit Unjournal.org and our knowledge base for more details or contact us at contact@unjournal.org.
  
  Probably worth making this text punchier
Visit annotations in context

Annotators

daaronr

URL

unjournal.pubpub.org/
Jul 2024
80000hours.org 80000hours.org

Research questions that could have a big social impact, organised by discipline

14
1. daaronr 16 Jul 2024
  
  in Public
  
  What’s the best way to measure individual wellbeing? What’s the best way to measure aggregate wellbeing for groups?
  
  Unjournal 'pivotal questions'
  
  By our above standards, this is far too broadly defined; however, there is a wealth of recent (empirical, theoretical, and methodological) work on this; some of which (e.g., ‘WELLBYs’) seems to be influencing the funding, policies and agendas at high-impact orgs. Elicit.org (free version) summarizes the findings from the ‘top-4’ recent papers:
  
  Recent research highlights the complexity of measuring wellbeing, emphasizing its multidimensional nature. Studies have identified inconsistencies in defining and measuring wellbeing among university students in the UK, with a focus on subjective experiences and mental health (Dodd et al., 2021). Global trends analysis reveals distinct patterns in flourishing across geography, time, and age, suggesting the need for comprehensive measures beyond single-item assessments (Shiba et al., 2022). Scholars argue for well-being indicators that capture physical, social, and mental conditions, as well as access to opportunities and resources (Lijadi, 2023). The concept of "Well-Being Capital" proposes integrating multidimensional wellbeing indicators with economic measures to better reflect a country's performance and inform public policy (Bayraktar, 2022). These studies collectively emphasize the importance of considering subjective experiences, cultural factors, and ecological embeddedness when measuring individual and aggregate wellbeing, moving beyond traditional economic indicators like GDP.
  
  This suggests a large set of more targeted questions, including conceptual questions, psychometric issues, normative economic theory, and empirical questions. But I also suspect that 80k and their funders would want to reframe this question in a more targetted way. They may be particularly interested in comparing a specific set of measures that they could actually source and use for making their decisions. They may be more focused on well-being measures that have been calibrated for individuals in extreme poverty, or suffering from painful diseases. They may only be interested in a small subset of theoretical concerns; perhaps only those that could be adapted to a cost-benefit framework.
  
  ^[Asking it to “...please focus on recent work in economics and decision theory, and measurements that have been used for low and middle-income countries.” yields [excerpted]
  
  Recent research … aim[s] to move beyond traditional economic indicators like GDP. The Global Index of Wellbeing (GLOWING) proposes a simple, meaningful measure using secondary ecological data (Elliott et al., 2017). The Multidimensional Wellbeing Index for Peru (MWI-P) incorporates 12 dimensions based on what people value, revealing disparities among subgroups (Clausen & Barrantes, 2022). Another approach suggests using the Condorcet median to aggregate non-comparable wellbeing facets into a robust ranking (Boccard, 2017). New methods for measuring welfare include stated preferences over aspects of wellbeing, life-satisfaction scales, and the WELLBY approach, as well as comprehensive frameworks like Bhutan's Gross National Happiness Index (Cooper et al., 2023).]
  
  This seems promising as the basis for a ‘research prioritization stream’. We would want to build a specific set of representative questions and applications as well as some counterexamples (‘questions we are less interested in’), and then we could make a specific drive to source and evaluate work in this area.
2. daaronr 15 Jul 2024
  
  in Public
  
  When estimating the chance that now (or any given time) is a particularly pivotal moment in history, what is the best uninformative prior to update from? For example, see our podcast with Will MacAskill and this thread between Will MacAskill and Toby Ord for a discussion of the relative merits of using a uniform prior v. a Jeffreys prior.
  
  Unjournal 'pivotal questions' -- not quite operationalized but seems close to workable 5/10
3. daaronr 15 Jul 2024
  
  in Public
  
  How frequent are supervolcanic eruptions and what size of eruption could cause a volcanic winter scenario? (Adapted from Toby Ord, The Precipice, Appendix F)
  
  Unjournal 'pivotal questions' -- operationalizable but out of our current scope.
4. daaronr 15 Jul 2024
  
  in Public
  
  What’s the average lifespan of the most common species of wild animals? What percent die via various means
  
  Unjournal 'pivotal questions' -- operationalizable but out of our current scope.
5. daaronr 15 Jul 2024
  
  in Public
  
  What’s the minimum viable human population (from the perspective of genetic diversity)? (Michael Aird, Crucial questions for longtermists)
  
  Unjournal 'pivotal questions' -- operationalizable but out of our current scope.
6. daaronr 15 Jul 2024
  
  in Public
  
  What are the best existing methods for estimating the long-term benefit of past investments in scientific research, and what have they found? What new estimates should be conducted? (Adapted from Luke Muehlhauser (writing for Open Philanthropy), Technical and Philosophical Questions That Might Affect our Grantmaking)
  
  Unjournal 'pivotal questions'
  
  Reframe as ‘what has been the long term benefit {defined in terms of some specific measure} of investment in scientific research’ – 6/10 as reframed
7. daaronr 15 Jul 2024
  
  in Public
  
  E.g. does creatine actually increase IQ in vegetarians?
  
  [The “e.g.”, is 9/10 although it’s on the borderline of our scope]
8. daaronr 15 Jul 2024
  
  in Public
  
  E.g. does creatine actually increase IQ in vegetarians?
  
  Unjournal 'pivotal questions'
  
  The “e.g.”, is 9/10 although it’s on the borderline of our scope
9. daaronr 15 Jul 2024
  
  in Public
  
  How well does good forecasting ability transfer across domains?
  
  Unjournal 'pivotal questions'
  
  Could be operationalized ... what measure of ability, which domains (or how to define this?)
  
  4/10
10. daaronr 15 Jul 2024
  
  in Public
  
  Will AI come to be seen as the one of the most strategically important parts of the modern economy, warranting massive state support and intervention?
  
  Unjournal 'pivotal questions'
  
  Could this be operationalized? 3/10
11. daaronr 15 Jul 2024
  
  in Public
  
  Could advances in AI lead to risks of very bad outcomes, like suffering on a massive scale? Is it the most likely source of such risks? (Adapted from Michael Aird, Crucial questions for longtermists)
  
  Unjournal 'pivotal questions' ‘Could’ is vague. ‘Very bad outcomes’ needs a precise measure. Reframe as ~‘what is the increase in the risk of {specific VBO} as a function of the level of AI progress {specifically defined}.
  
  3/10
12. daaronr 15 Jul 2024
  
  in Public
  
  How much do global issues differ in how cost-effective the most cost-effective interventions within them are?
  
  Unjournal 'pivotal questions'
  
  ‘Global issues’ is vague; we need more specific categorizations. Cost-effectiveness needs a metric –
  
  5/10
13. daaronr 15 Jul 2024
  
  in Public
  
  long-term rate of expropriation of financial investments? How does this vary as investments grow larger? (Michael Aird, Crucial questions for longtermists)
  
  Unjournal 'pivotal questions'
  
  This is fairly definite, although it’s not clear what the precise motivation is here.
  
  8/10
14. daaronr 15 Jul 2024
  
  in Public
  
  What is the effect of economic growth on existential risk? How desirable is economic growth after accounting for this and any other side effects that might be important from a longtermist perspective? (See a recent paper by Leopold Aschenbrenner for some initial work on this question.)
  
  Unjournal 'pivotal questions' -- Needs some refinement – what measure of growth, what units of x-risk, etc.
  
  6/10
Visit annotations in context

Annotators

daaronr

URL

80000hours.org/articles/research-questions-by-discipline/
May 2024
www.givewell.org www.givewell.org

New Incentives (Conditional Cash Transfers to Increase Infant Vaccination)

5
1. daaronr 07 May 2024
  
  in Public
  
  Note: The figures in this report are from our February 2024 cost-effectiveness analysis. Our estimates change over time as we gather new information and update our analysis, and so the numbers in this report may not exactly match those in our most recent cost-effectiveness analysis (available here).
  
  could they make these reports dynamic ?
2. daaronr 07 May 2024
  
  in Public
  
  We adjust for this risk in our analysis (reducing our estimate of vaccine efficacy by around 20%).
  
  These ad-hoc adjustments seem particular sketchy and this ripe for critique and a more systematic approach
3. daaronr 07 May 2024
  
  in Public
  
  Proportion of children enrolled who would be vaccinated in the absence of New Incentives’ program (more)
  
  The hover content is helpful
4. daaronr 07 May 2024
  
  in Public
  
  We use a cost-effectiveness analysis to quantify our reasoning. Here is a summary of our analysis, using one state, Bauchi, as an example.
  
  The linked Google sheet is hard to parse and hard to read. This makes it less than fully transparent. E.g., the columns are frozen in a way that you can barely navigate the by-region columns.
  
  Linking something people can't use doesn't add transparency, it just wastes people's attention. If you feel the need put these links at the bottom, in a 'data section' or something. Anyone who wants to dig into it will need to do so as part of a separate and intensive exercise -- not just a glance while reading this. At least that's my impression.
  
  But also note that a code notebook based platform can be far more manageable for the reader.
5. daaronr 07 May 2024
  
  in Public
  
  New Incentives (Conditional Cash Transfers to Increase Infant Vaccination)
  
  I think all reports should have an 'update date' prominently listed. OK it says 'April' below
Visit annotations in context

Annotators

daaronr

URL

givewell.org/international/technical/programs/new-incentives
Apr 2024
www.cambridge.org www.cambridge.org

The Political Consequences of Green Policies: Evidence from Italy

28
1. daaronr 18 Apr 2024
  
  in Public
  
  In both cases the estimated effects are very small and not statistically distinguishable from zero
  
  But can they actively rule out a substantial effect ... equivalence test. I see some positive effects. Maybe it is just underpowered.
2. daaronr 18 Apr 2024
  
  in Public
  
  we focus on pre-trends
  
  Is this the usual pre-trend diagnostic for DiD?
3. daaronr 18 Apr 2024
  
  in Public
  
  Vote Switching Before Area B
  
  I want to see the voting levels by group
4. daaronr 18 Apr 2024
  
  in Public
  
  Hence, in this case, Diesel-Euro5 car owners constitute a “fake” treatment group
  
  robustness tests -- what are some other 'fake' control groups?
5. daaronr 18 Apr 2024
  
  in Public
  
  The null result on the Democratic Party is particularly interesting, as it suggests that voters did not penalize the party of the incumbent mayor, who was directly accountable for the introduction of the traffic ban.
  
  But I suspect they might have reported this result as supporting their hypothesis. Thus a multiple comparison test seems appropriate.
6. daaronr 18 Apr 2024
  
  in Public
  
  27 Our findings are robust to restricting the control group to Diesel-Euro5 owners only, and to controlling for the number of kilometers driven per year, as well as for the frequency of car use (see Supplementary Tables SI-8–SI-10, respectively).
  
  they did some reasonable robustness checks. What about restricting it to all Diesel cars?
7. daaronr 18 Apr 2024
  
  in Public
  
  substantively sizable shift considering that the baseline rate of support for Lega in the sample was 24.4%. Put differently, owning a car affected by the vehicle ban raised the probability of voting for Lega in the subsequent elections by 55% ab
  
  I'm surprised it's so large ... hmmm
8. daaronr 18 Apr 2024
  
  in Public
  
  Table 2. Voting for Lega in EU Elections of 2019
  
  a key first table of results. Looks like a linear probability model, here.
9. daaronr 18 Apr 2024
  
  in Public
  
  Our main dependent variable is an indicator that takes the value 1 if the respondent reports voting for Lega, and 0 otherwise. We also investigate potential treatment effects on support for other parties. In particular, to assess potential anti-incumbent effects, we examine support for the Democratic Party, the party of the city’s mayor.
  
  That would be other plausible outcomes?
  
  Any good way to correct for multiple comparisons here?
10. daaronr 18 Apr 2024
  
  in Public
  
  we do not know the exact timing of the disbursement, that is, whether the compensation was actually received by the time the EU elections took place.Footnote 25 In other words, our indicator variable equals 1 if the respondent has received compensation at any point in time prior to the survey (January 2021), and zero otherwise
  
  so it should understate the effecgt
11. daaronr 18 Apr 2024
  
  in Public
  
  eassuringly, earlier studies (e.g., Colantone and Stanig Reference Colantone and Stanig2018) show that individuals with these characteristics tend to be less likely to support a radical-right party such as Lega. Hence, the composition of the treatment group should in fact work against finding a pro-Lega effect of the policy.Footnote
  
  Hmm... Note that the treated group is also more male. But I don't think any of this matters after you control for these demographics, so their comment is misguided.
  
  OK this is probably something we can ignore.
12. daaronr 18 Apr 2024
  
  in Public
  
  Table 1 compares the characteristics of the different groups of car owners in terms of their age, gender, education, and income.
  
  to do: I would want to see a table including vote shares here
13. daaronr 18 Apr 2024
  
  in Public
  
  specifications of the following form:(1)
  
  Leading robustness check suggestion: The key outcome variable is a percentage of the vote total, I guess. But the treated and groups may have started from a different baseline level of voting. If the true model is nonlinear, the results here may be misleading. E.g., suppose the true effect on voting was the same for both groups as a percentage of the initial vote. Or suppose the impacts have diminishing returns, a sort of ceiling effect.
  
  Other possible robustness checks ... what are some other plausible forms? Car type random effects? Limit analysis to diesel only? Euro-5 only?
  
  Use demographic controls differently (?interact with a random effect or something?) Note, this DiD does not involve any time difference.
14. daaronr 18 Apr 2024
  
  in Public
  
  As another type of control group, we also interviewed 303 owners of new cars in the Euro6 category (both Diesel and Petrol). These car owners serve as a useful placebo test, for reasons we detail below.
  
  the 'placebo test' -- Possible robustness test: what if we made them part of the control group?
15. daaronr 18 Apr 2024
  
  in Public
  
  Starting in April 2019, city residents affected by the ban could apply for compensation from the Municipality of Milan.Footnote 12 The initial 2019 call for compensation was open only to low-income car owners (i.e., with an adjusted household income below €25,000 per year, or €28,000 if aged 65+). In the next year, the income criterion was dropped, and hence the call was effectively open to all residents
  
  compensation scheme
16. daaronr 18 Apr 2024
  
  in Public
  
  aim is to compare owners of affected cars to owners of relatively similar-yet-unaffected cars.Footnote 10 Specifically, our treatment group will consist of owners of Diesel-Euro4 cars, while the control group will consist of Petrol-Euro4, Diesel-Euro5, and Petrol-Euro5 car owners.Footnote
  
  Possible robustness test: compare other plausible vehicle treatment and control groups
17. daaronr 18 Apr 2024
  
  in Public
  
  The policy identifies the most polluting categories of vehicles and bans them from accessing and circulating within the area.
  
  the policy
18. daaronr 18 Apr 2024
  
  in Public
  
  seem to
  
  "seem to" is vague
19. daaronr 18 Apr 2024
  
  in Public
  
  If anything, affected car owners exhibited slightly more environment-friendly attitudes.
  
  I guess they mean 'after the policy these people became relatively more environmentally friendly'
20. daaronr 18 Apr 2024
  
  in Public
  
  was even larger
  
  Check: significantly larger?
21. daaronr 18 Apr 2024
  
  in Public
  
  owners of banned cars were 13.5 percentage points more likely to vote for Lega in the European Parliament elections of 2019
  
  main claim, quantified
22. daaronr 18 Apr 2024
  
  in Public
  
  close to zero and well below statistical significance.
  
  But how big a violation is 'substantial enough to matter' here?
23. daaronr 18 Apr 2024
  
  in Public
  
  owners of banned vehicles—who incurred a median loss of €3,750—were significantly more likely to vote for Lega in the subsequent elections
  
  In the paper, they are basically making the stronger claim that "this policy CAUSED these people to vote for Lega". That's the argument behind their discontinuity DiD approach .
24. daaronr 18 Apr 2024
  
  in Public
  
  we exploit arbitrary discontinuities in the rules dictating the car models that would be covered by the ban and employ a difference-in-differences estimation to identify the policy’s effect on voting behavior.
  
  The DiD approach
25. daaronr 18 Apr 2024
  
  in Public
  
  the electoral impact of the introduction of the Area B ban.
  
  the thing they want to measure
26. daaronr 18 Apr 2024
  
  in Public
  
  original survey with a targeted sampling design that we conducted among residents of Milan. The survey collected detailed information about respondents’ car ownership, environmental views, and political behavior.
  
  the data
27. daaronr 18 Apr 2024
  
  in Public
  
  In line with this pattern, recipients of compensation from the local government were not more likely to switch to Lega.
  
  A claim of a 'null finding' (bounded or just underpowered?) And is the difference significant?
28. daaronr 18 Apr 2024
  
  in Public
  
  indicates that this electoral change did not stem from a broader shift against environmentalism, but rather from disaffection with the policy’s uneven pocketbook implications.
  
  A secondary claim
Visit annotations in context

Annotators

daaronr

URL

cambridge.org/core/journals/american-political-science-review/article/political-consequences-of-green-policies-evidence-from-italy/4D76FEDA813739711DCB40EC102744AF
Mar 2024
static1.squarespace.com static1.squarespace.com

XPT.pdf

16
1. daaronr 07 Mar 2024
  
  in Public
  
  AI-concerned think the risk that a genetically engineered pathogen will killmore than 1% of people within a 5-year period before 2100 is 12.38%, while the AIskeptics forecast a 2% chance of that event, with 96% of the AI-concerned abovethe AI skeptics’ median forecast
  
  this seems like a sort of ad-hoc way of breaking up the data. What exactly is the question here, and why is this the best way to answer it?
2. daaronr 07 Mar 2024
  
  in Public
  
  hose who did best on reciprocal scoring had lower forecasts ofextinction risk.72 We separately compare each forecaster’s forecast of others’ forecasts on ten key questions, for both expertsand superforecasters. We rank each forecaster’s accuracy on those 20 quantities relative to other participants,and then we compute each forecaster’s average rank to calculate an overall measure of intersubjective accuracy.73 This may be because superforecasters are a more homogenous group, who regularly interact with eachother outside of forecasting tournaments like this.74 Pavel Atanasov et al., “Full Accuracy Scoring Accelerates the Discovery of Skilled Forecasters,” SSRN WorkingPaper, (February 14, 2023), http://dx.doi.org/10.2139/ssrn.4357367.
  
  This seems visually the case, but I don't see metrics or statistical inference here.
3. daaronr 07 Mar 2024
  
  in Public
  
  ithin both groups—experts and superforecasters—more accurate reciprocalscores were correlated with lower estimates of catastrophic and extinction risk. Inother words, the better experts were at discerning what other people would predict,the less concerned they were about extinction
  
  But couldn't this just be because people who think there is high Xrisk think others are likely to think like themselves? Is it more finely grained 'better reciprocal accuracy' than that?
4. daaronr 07 Mar 2024
  
  in Public
  
  otal Catastrophic Risk
  
  The differences in the total x-risk are not quite so striking-- about 2:1 vs 6:1 What accounts for this? Hmm, this look different from the 'Total Extinction risk' in table 4. Here a notebook would be helpful. Ahh, it's because this is for catastrophic risk, not extinction risk.
5. daaronr 07 Mar 2024
  
  in Public
  
  First, we can rule out the possibility that experts can’t persuade others of the severityof existential risks simply because of a complete lack of sophistication, motivation,or intelligence on the part of their audience. The superforecasters have all thosecharacteristics, and they continue to assign much lower chances than do experts.
  
  This paragraph seems a bit loosely argued.
6. daaronr 06 Mar 2024
  
  in Public
  
  Question and resolution details
  
  They seem to have displayed the questions along with particular “Prior Forecasts” — is that appropriate? Could that be driving the persistent difference between the superforecasters and experts?
7. daaronr 06 Mar 2024
  
  in Public
  
  general x-riskexperts
  
  What are 'general x-risk experts'? Give some examples.
8. daaronr 06 Mar 2024
  
  in Public
  
  The median participant who completedthe tournament earned $2,500 in incentives, but this figure is expected to rise asquestions resolve in the coming years.
  
  fairly substantial incentives ... but it may have been time consuming; how many hours did it take?... and how much variation was there in the incentive pay/how sensitive was it to the predictions?
9. daaronr 06 Mar 2024
  
  in Public
  
  with 111completing all stages of the tournament
  
  Would this attrition matter?
10. daaronr 06 Mar 2024
  
  in Public
  
  Participants made individual forecasts2. Teams comprised entirely of either superforecasters or experts deliberated andupdated their forecasts3. Blended teams from the second stage, consisting of one superforecaster team andone expert team, deliberated and updated their forecasts4. Each team saw one wiki summarizing the thinking of another team and againupdated their forecasts
  
  with incentives for accuracy (or 'intersubjective' accuracy) at each stage, or only at the very end? Aldo incentives for making strong comments and (?) convincing others/
11. daaronr 06 Mar 2024
  
  in Public
  
  We also advertised broadly, reaching participants withrelevant experience via blogs and Twitter. We received hundreds of expressions ofinterest in participating in the tournament, and we screened these respondents forexpertise, offering slots to respondents with the most expertise after a review of theirbackgrounds.1
  
  Recruitment of experts.
12. daaronr 06 Mar 2024
  
  in Public
  
  We explained that after the tournament we would show the highest-qualityanonymized rationales (curated by independent readers) to panels of online surveyparticipants who would make forecasts before and after reading the rationale. Prizesgo to those whose rationales helped citizens update their forecasts toward greateraccuracy, using both proper scoring rules for resolvable questions and intersubjectiveaccuracy for unresolvable questions.21
  
  Is this approach valid? Would it give powerful incentives to be persuasive? What is are these rationales used for? Note that 'intersubjective accuracy' is not a ground truth for the latter questions.
13. daaronr 06 Mar 2024
  
  in Public
  
  One common challenge in forecasting tournaments is to uncover the reasoningbehind predictions.
  
  How does this 'uncover the reasoning behind predictions'?
14. daaronr 06 Mar 2024
  
  in Public
  
  scoring ground rules: questions resolving by 2030were scored using traditional forecasting metrics where the goal was to minimize thegap between probability judgments and reality (coded as zero or one as a function ofthe outcome). However, for the longer-run questions, participants learned that theywould be scored based on the accuracy of their reciprocal forecasts: the better theypredicted what experts and superforecasters would predict for each question, thebetter their score.
  
  Is the 'reciprocal scoring' rule likely to motivate honest (incentive-compatible) predictions? Is it likely to generate useful information in this context?
15. daaronr 06 Mar 2024
  
  in Public
  
  When we report probabilities of long-run catastrophic andexistential risk in this report, we report forecasters’ own (unincentivized) beliefs. But,we rely on the incentivized forecasts to calculate measures of intersubjective accuracy
  
  This is a bit confusing. The language needs clarification. What exactly is 'intersubjective accuracy'?
16. daaronr 06 Mar 2024
  
  in Public
  
  the XPT:• What will be the global surface temperature change as compared to 1850–1900, indegrees Celsius? (By 2030, 2050, 2100)• By what year will fusion reactors deliver 1% of all utility-scale power consumed inthe U.S.?• How much will be spent on compute [computational resources] in the largest AIexperiment? (By 2024, 2030, 2050)• What is the probability that artificial intelligence will be the cause of death, within a5-year period, for more than 10% of humans alive at the beginning of that period?(By 2030, 2050, 2100)• What is the overall probability of human extinction or a reduction in the globalpopulation below 5,000? (By 2030, 2050, 2100)18 Participants also consented to participate in this study, via the University of Pennsylvania’s InstitutionalReview Board. The consent form detailed the format of the study.19 We define a catastrophic event as one causing the death of at least 10% of humans alive at the beginning ofa five-year period. We define extinction as reduction of the global population to less than 5,000.
  
  I appreciate these links to the full question content.
Visit annotations in context

Annotators

daaronr

URL

static1.squarespace.com/static/635693acf15a3e2a14a56a4a/t/64f0a7838ccbf43b6b5ee40c/1693493128111/XPT.pdf
Dec 2023
www.nber.org www.nber.org

Subjecting the ‘Average Joe’ to War Theatre Triggers Intimate Partner Violence

1
1. daaronr 27 Dec 2023
  
  in Public
  
  feed these motivations through several potential mechanism
  
  I don't see the connection to household bargaining here
Visit annotations in context

Annotators

daaronr

URL

nber.org/system/files/working_papers/w31227/w31227.pdf

David Reinstein

Dr. David Reinstein, Senior Economist, Rethink Priorities https://daaronr.github.io/markdown-cv/

Project pages: innovationsinfundraising.org, giveifyouwin.org

Twitter: @givingtools

Annotations: 1,066

Joined: July 17, 2019

Location: Western Massachussets

Link: daaronr.github.io/markdown-cv/

ORCID: 0000-0002-0470-4991

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

General Guidelines for Interpreting Krippendorff's Alpha:

Comparison to Your Data:

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL