122 Matching Annotations
  1. Last 7 days
    1. e 6 shows that Claude alone underperforms PaperLens-Vision, but adding the calibrated prior improvesoverall accuracy from 63.0% to 70.6%, surpassing both PaperLens-Vision and Claude individually, andimproves alignment with human reviews. This matches the factorization above: PaperLens-Vision anchorsthe decision while the reviewer supplies the critique. The gain comes mainly from a 13.6% subset of papersflipping Accept→Reject at 77.9% accuracy; Reject→Accept flips are rarer and much less reliable, so the priormostly corrects false acceptances

      something we arent talking about is the fact that we measure gains in rating alignemnt as well. note that after the model's predictions flip from A -> R it starts spitting out weaknesses and novelty critiques at a higher rate, agreeing with human critique.

    2. his is consistent with psychology and peer-review studies of evaluative judgment:reviewers may form an early global impression of a manuscript and then selectively construct or emphasizecriticisms that make that impression appear analytically grounded (Kunda, 1990; Nisbett and Wilson, 1977;Slovic et al., 2007; Haidt, 2001; Lord et al., 1979), with peer-review evidence showing that reviewer behavioris sensitive to identity cues, network ties, and halo effects rather than to manuscript content alone (Tomkinset al., 2017; Blank, 1991; Peters and Ceci, 1982; Teplitskiy et al., 2018; De Sordi et al., 2020)

      this sentence is way too long, you need to cut it down. also no one knows what newtowrk ties and halo effects are please fix that.

    3. Model scale. 3B/14B differences to 7B, ma

      please center the numbers in the columns also make delta Acc on a new line, can we decrease the width of this table, so that the figure B can be larger?

    4. Memory probes confirm these cutoffs: frontier models fail abstract completion,multiple-choice decision recall, and venue identification on this corpus

      please read LLamaFactoryAutoReviewer/results/frontier_memory_probe/breakdown.md., please add the results to the appendix section that talks about frotneir models and then reference that here. show a summarized table view here that clearly shows that the frontier models consistently fail at the different recall tests. highest on 24. add a short paragraph here.

    5. ; even strongly critical prompts only reach a roughly 50% accept rate, staying near the majority-rejectbaseline and inaccurate (Appendix F.2)

      delete

    6. decisions

      because they are inherently calibrated to the test distribtuion, since our training data has the test years (or somethign like that).

    7. Full feature definitions are in Table C.1.

      for the figure 4 feature prob audit, you state full feature definitions are in table c.1, but really it should be appendix c like in the main text. also you were supposed to raise the legend [accept/reject] and change the naems to OR-ICLR and arXiv. please fix (in the right subfigure) you corrected the table

    8. Reviews, ratings, andmeta-reviews are unobserved; we focus on Z rather than modeling them (Gao et al., 2024; Zhu et al., 2025b)

      In contrast with prior work (cite), reviews, ratings, and meta-reviews are not regressed on; they are only used for measuring rating correlation and review alignment.

    Annotators

    1. our learned paper signalas the observable relaxation p(Z | Xreb) of the ideal p(Z | X)

      i think we want more setup. I think we want to state: "As mentioned in the problem formulation, papers are transformed into reviews before reaching the decision. Given a paper (X), over decisions (D) and reviews (R) ( aka p(D,R|X)). While the flow indicates that reviews precede the decision, now that we have built a shortcut that predicts the decisionwithout the review, we question whether we can generate more \italics{decision-calibrated} reviews. This is valid by simply rearranging the joint distribution like so:

      then the formula.

      then the both factorizations are vlaid. then we can give a more conceptual understanding: if the conjecture is correct, that much of the decision can be determined by overall presentation quality, then perhaps reviews are simply a means-to-an-end to justify the reviewer's inital recaction.

      heres more context about sources that we can cite for this behavior: The strongest terms to search are:

      motivated reasoning, post hoc rationalization, biased assimilation, affect heuristic, halo effect, and in peer review specifically prestige bias, reviewer bias, or cognitive bias in peer review.

      Here are the most relevant scientific articles.

      1. Kunda — “The Case for Motivated Reasoning”

      Ziva Kunda, 1990, Psychological Bulletin

      This is probably the canonical article. Kunda argues that motivation can bias the cognitive processes people use to access, construct, and evaluate beliefs. In your case: a reviewer’s initial evaluation of a paper can shape which criticisms feel salient or persuasive.

      Best use: cite this for the general mechanism: people reason toward a conclusion they are already inclined to reach.

      1. Nisbett & Wilson — “Telling More Than We Can Know”

      Richard Nisbett & Timothy Wilson, 1977, Psychological Review

      This is highly relevant to the “reverse reasoning” part. They reviewed evidence that people often lack direct introspective access to the mental processes that caused their judgments, but still produce explanations for those judgments.

      Best use: cite this for the idea that reviewers may sincerely report reasons for a judgment without accurately knowing what actually caused the judgment.

      1. Haidt — “The Emotional Dog and Its Rational Tail”

      Jonathan Haidt, 2001, Psychological Review

      Haidt’s social intuitionist model is about moral judgment, not peer review, but the mechanism maps well: quick intuitive judgment is followed by slower post hoc reasoning. The article explicitly frames reasoning as often coming after intuition rather than causing judgment.

      Best use: cite this when you want the phrase “intuition first, reasoning second.”

      1. Lord, Ross & Lepper — “Biased Assimilation and Attitude Polarization”

      Charles Lord, Lee Ross & Mark Lepper, 1979, Journal of Personality and Social Psychology

      This is a classic study on how people evaluate evidence differently depending on their prior beliefs. People tend to accept congenial evidence more readily and scrutinize uncongenial evidence more harshly.

      Best use: cite this for the reviewer behavior where evidence supporting the reviewer’s initial take gets treated as decisive, while contrary evidence gets nitpicked.

      1. Slovic et al. — “The Affect Heuristic”

      Paul Slovic, Melissa Finucane, Ellen Peters & Donald MacGregor, 2007, European Journal of Operational Research

      This article argues that feelings of “goodness” or “badness” can rapidly guide judgments and decisions. That is very close to what people informally mean by a “paper gestalt”: an overall positive or negative feeling that then colors assessment of specific features.

      Best use: cite this for the gut reaction / affective global impression component.

      1. Tomkins, Zhang & Heavlin — “Reviewer Bias in Single- versus Double-Blind Peer Review”

      Andrew Tomkins, Min Zhang & William D. Heavlin, 2017, PNAS

      This is peer-review-specific. In a large conference-review setting, submissions were reviewed under single-blind and double-blind conditions. The study found that single-blind reviewing advantaged papers by famous authors and high-prestige institutions.

      Best use: cite this to show that manuscript judgments are not purely about the paper’s intrinsic content; contextual cues can bias reviews.

      1. Blank — “The Effects of Double-Blind versus Single-Blind Reviewing”

      Rebecca Blank, 1991, American Economic Review

      This was a randomized experiment at the American Economic Review. It found that reviewers were more critical when author identity was blinded, and acceptance rates were lower under double-blind review.

      Best use: cite this as older experimental evidence that review outcomes shift when identity/prestige cues are removed.

      1. Peters & Ceci — “Peer-Review Practices of Psychological Journals”

      Douglas Peters & Stephen Ceci, 1982, Behavioral and Brain Sciences

      Classic and brutal study: previously published psychology articles were resubmitted to the same journals with altered author/institution information. Most were not recognized, and many were rejected.

      Best use: cite this for the unreliability/context-sensitivity of peer review.

      1. Teplitskiy et al. — “The Social Structure of Consensus in Scientific Review”

      Misha Teplitskiy et al., 2018

      This study analyzed reviews of 7,981 neuroscience manuscripts submitted to PLOS ONE and found that reviewers favored authors closer to them in the co-authorship network. The authors interpret this not just as simple nepotism but as partly reflecting “schools of thought” and substantive evaluative differences.

      Best use: cite this if your point is that reviewers’ intellectual/social position can shape what they see as valid or flawed.

      1. Sordi et al. — “Halo Effect in Peer Review”

      José Osvaldo De Sordi et al., 2020

      This one is directly about the halo effect in peer review. It explores whether belonging to the same professional field or group can bias article evaluation during review.

      Best use: cite this for the closest peer-review-specific phrase to “paper gestalt”: halo effect in peer review.

      A good synthesis sentence would be:

      The phenomenon can be described as an interaction of affective first impressions, motivated reasoning, and post hoc rationalization: reviewers may form an early global evaluation of a manuscript and then selectively construct or emphasize criticisms that make that evaluation appear analytically grounded.

      For a paper or lit review, I’d probably cite Kunda 1990 + Nisbett & Wilson 1977 + Slovic et al. 2007 for the psychology mechanism, then Tomkins et al. 2017 + Peters & Ceci 1982 + Teplitskiy et al. 2018 for peer-review-specific evidence.

      the key is that based off this realization, we sought to see whether conditioning frontier coding agents for reviews can 1) improve their decision-prediction and 2) improve their generated reviews' alignment to human reviews.

    2. We score the seven venues with enough test papers anddrop four (ECCV, COLM, AISTATS, CoRL) whose concentration in the test set is too low to score reliably

      i think we should add these back into the plot, but not state their sample contributions.

    3. Holm–Bonferroni correc-tion

      we should add a cite here: @article{abdi2010holm, title={Holm’s sequential Bonferroni procedure}, author={Abdi, Herv{\'e}}, journal={Encyclopedia of research design}, volume={1}, number={8}, pages={1--8}, year={2010}, publisher={Thousand Oaks, California} }

    4. o in Figure 6b the largest arXivAcc gain comes from data scale rather than parameter scale.

      this is insufficient. Lets make one more paragraph that talks about data scale. and how we varied the OpenReview-ICLR by 50% size to 75% size, but for arXiv, since we have the arXiv-M and arXiv sets, we plot a 3.3x multiple. We see that more data helps for both sets. Expand ont his a bit more. and talk about figure 6b

    5. The 3B text checkpoints show the tradeoff: OpenReview-ICLR-trained text gainsICLR Acc but loses AUC and rating alignment, while the arXiv-medium text model trades a small Acc dropfor slightly stronger ranking.

      lets remove this and the delta AUC and delta rating columns they arent very important here.

    6. Gemini-3.1-Pro (Vision)

      legend needs to be fixed its box isnt spanning all the items, additionally too much whitespace from legend to plots. also add more hsapce between plots.

    7. additionally

      I think we want to have one finaly short paragraph that states that for all the reported Baselines, for fairness, we run validation tuning for their decision threshold. Some model's are biased out of box, for example GPT 5.4 is reject-biased, while Gemini is accept-biased. And yet, they still have relatively high AUC. Its possible that the predicted rating is well-ordered, but the mdoel's chosen decision threshold is biased. therefore, for all frontier models (and coding agents), and DeepReview, we calibrate the decision threshold to vlaidation data before reporting test results (ignoring the model's initial decision). This consistently helps baseline numbers, in agreement with the findings of \cite{AI Scientist v1}.

    8. Table 3. Main results. For each baseline row × test set (frontier APIs, agents, DeepReview, and the Qwen base models),we run 10 random 50-sample tuning trials, average the resulting metrics, and adopt the tuned threshold only when itimproves balanced accuracy by ≥ 2pp; otherwise the native decision is kept. PaperLens SFT and RL rows use theirnative binary decisions throughout. AUC and ρR are threshold-free.

      revert the table caption to what we had before. i just wanted you to change the baseline results! also the shading on PaperLens needs to be reverted to only shade the top results in OpenReview-ICLR and arXiv!

    9. Ordi-nary reviewing prompts were consistently accept-biased, so the main table reports base-model performanceon the exact answer-only and reasoning prompts used for SFT and RL. Workspace layouts, prompt ablations,and launch settings are in Appendix F.1 and Appendix F.2 for reproducibility and comparison

      lets change this. We want to state that: Vanilla prompts were consistently accept-biased, so it correct we tested using "critical" language; however, while the acceptance rate drew closer to 50%, the model's were still wildly inaccurate at 50% accuracy. Therefore, in our main results, we report the model behavior on the base prompt, the same ones we start with for SFT/RL training. These base prompts, as well as the critical-variants are defined in Appendix....

    10. anuary 2025-era cutoffs.

      please check: /scratch/gpfs/ZHUANGL/sk7524/LLaMA-Factory-AutoReviewer/results/frontier_memory_probe/breakdown.md

      we actually did some analysis on the behavior of frotnier models and showed that they can't recall even ablated across many options like abstract completion, multiple choice guess whether the paper is accept/reject, remember the paper's venue etc.

    11. The full feature inventory is listedin Table C.1, and Appendix C gives grouped distribution diagnostics.

      remove the grouped diagnostics, just state at the end of the last setnence that (the full feature inventory is captured in Appendix C)

    12. ICLR is balanced within OpenReview year

      OpenReview-ICLR starts with 40K ICLR submissions, and subsamples the reject population while maintaining the annual distribution. However for arXiv, submission year is more nebulous because implied rejects have upload dates, while proceedings-matched accepts ahve conference years and postreview uploads. THerefore, we choose not to year balance, and instead subsample 101,659 filtered papers to forma. 83,556 balanced set, maintaining the starting venue distribution. This is further subsampled to achieve a 25K set, equivalent size to OpenReview-ICLR for scale-invariant training comparison.

    13. OpenReview-ICLR is ICLR submissions fromOpenReview: 25K examples split 85/5/10 across train, validation, and test, with the held-out test focused on’25–’26. arXiv is source-derived proceedings papers, with a held-out test covering ’24–’26. Training uses twopools: a 25K proportional subset (arXiv-m) and the full 83K set (arXiv).

      both datasets are 85/5/10. lets state the following: OpenReview-ICLR consists of 25K ICLR submissions from OpenReview from '20-'26, whose test set spans '25 - '26. arXiv consists of LaTex source-derived proceedings papers, with a held out test-set covering '24-'26 (to include evaluation of conferences that are biannual) like ECCV). For arXiv, training utilizes two pools: a large 83K set (arXiv) and a smaller set proportional to OpenReview-ICLR (arXiv-m). All datasets contain equal numbers of accepts and rejects, and are split 85/5/10 for train,validation, and test sets.

      Because arXiv spans many venues, it also contains an ICLR subset (we call this \italics{arXiv-ICLR}), which is distinct from the OpenReview-sourced OpenReview-ICLR set. Figure 2 shows paired examples from both datasets.

    14. Because our target metric is Acc, which averages accept andreject recall, a balanced test set directly measures the behavior we care abou

      I think we want to say: "Therefore, we use a balanced test-set, whereby accuracy equally weights accept and reject recall.

    15. .Year-by-year coverage is documented in Appendix B.1;

      just say (Appendix B.1) after the aftifact itself, dont say this year-by-year coverage is documentd.

    16. source-derived papers linked to proceedings after processing

      i think we want to say something like: arXiv examples are derived from LaTex source, recompiled with venue-specific anonymization macros. (or something like this much shorter)

    17. Because public anonymous artifacts are most consistently available after rebutta

      i think we need a reference to the appendix section which talks about this availability over years.

    Annotators