Hypothesis

271 Matching Annotations

Jun 2026
Local file Local file

Untitled document

38
1. skones 07 Jun 2026
  
  in Public
  
  and it comes with no explanation of why
  
  though it does have correlations to rating and citation.
2. skones 07 Jun 2026
  
  in Public
  
  prove
  
  clearly prove
3. skones 07 Jun 2026
  
  in Public
  
  We define the edit categories used to label section-level perturbations.
  
  use footnote sizeing
4. skones 07 Jun 2026
  
  in Public
  
  Table J.1. Section attention.
  
  you can make the width smaller, this doesnt look proportional, and also the caption should be cnetered under the table.
5. skones 07 Jun 2026
  
  in Public
  
  PA P E R L E N S - V I S I O N
  
  you are using old formatting, please remove and use the new macro
6. skones 07 Jun 2026
  
  in Public
  
  The onlycomparison that loses significance under correction is DeepReviewer-7B on arXiv
  
  i am confused, we never mention deepreviewer in the table. what are you referring to?
7. skones 07 Jun 2026
  
  in Public
  
  Intervals
  
  also the columns should be PaperLens-Vision and Best Baseline btw.
8. skones 07 Jun 2026
  
  in Public
  
  Intervals are 95% bootstrap intervals for the vision-minus-text Acc
  
  just say PaperLens (no need for 7B specificailly)
9. skones 07 Jun 2026
  
  in Public
  
  Table H.1. Paired bootstrap significance for headline vision comparisons.
  
  again you should use OR-ICLR in the table
10. skones 07 Jun 2026
  
  in Public
  
  Baseline
  
  Best Baseline
11. skones 07 Jun 2026
  
  in Public
  
  zero
  
  hanging line
12. skones 07 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
13. skones 07 Jun 2026
  
  in Public
  
  predictions
  
  hanging line
14. skones 07 Jun 2026
  
  in Public
  
  Text
  
  use the text and vision macros defined in the main table
15. skones 07 Jun 2026
  
  in Public
  
  G SFT and RL Method Details
  
  i just realized at the very end we want a short hardware details subsection. where we mention that all of our SFT/RL experiments use 4 H200. we used liger-kernel @article{hsu2024liger, title={Liger kernel: Efficient triton kernels for llm training}, author={Hsu, Pin-Lun and Dai, Yun and Kothapalli, Vignesh and Song, Qingquan and Tang, Shao and Zhu, Siyu and Shimizu, Steven and Sahni, Shivam and Ning, Haowen and Chen, Yanning}, journal={arXiv preprint arXiv:2410.10989}, year={2024} } to reduce memory usage, allowing per device btach to be 4 (and grad accum is set appropriately, i.e., 32 batch, 4 per device batch, 2 grad accum).
16. skones 07 Jun 2026
  
  in Public
  
  Vision RAmoves from 0.14 to 0.87 while RR crashes from 0.86 to 0.21. Text RL is steadier, with RA rising +25 pp whileRR loses only ∼7 pp, but neither policy surpasses the strongest SFT model in the main text.
  
  i dont think we need the exact numbers. just qualitatively explain the trend that even tho reward does go up, as the model learns to format better, its not actually that much more accurate, and retains the initial accept bias.
17. skones 07 Jun 2026
  
  in Public
  
  Vision RL’s apparent peakat step 100 is a 0.14 → 0.87 swing on RA paid for by an 0.86 → 0.21 crash on RR
  
  delete this. simply state that though the models improve their reward, they retain their accept bias, and it even gets worse with RL training.
18. skones 07 Jun 2026
  
  in Public
  
  use the checkpoint policy in Table 3.
  
  there is no checkpoint policy. just state that for 7b/14b training we use 2 epochs, for 3b we use 4 epochs.
19. skones 07 Jun 2026
  
  in Public
  
  500-paper OpenReview-ICLR subset
  
  you need to also put the conclusion lol. also be consistent iwth OR-ICLR macro
20. skones 07 Jun 2026
  
  in Public
  
  rollout)
  
  hanging line
21. skones 07 Jun 2026
  
  in Public
  
  text/vision evaluation cells.
  
  we need more content here. Also ICLR should be OR-ICLR (italics) with macro. I think we want to say that the modes spend anywhere between ~2k to 4.5k thinking, with gemini ceiling noticeably higher than gpt.
22. skones 07 Jun 2026
  
  in Public
  
  both
  
  hanging line
23. skones 07 Jun 2026
  
  in Public
  
  Frontier Gemini and GPT
  
  same here add system and user blocks (even if empty) also for all prompts, please fix the line wrapping
24. skones 07 Jun 2026
  
  in Public
  
  Reasoning prompts move
  
  please add the system and user blocks here as well.
25. skones 07 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
26. skones 07 Jun 2026
  
  in Public
  
  We reverse text sections or page order during training/evaluation and find only smallaccuracy changes, especially for vision
  
  i think we want better labeling for the table here. 1) we dont want to mention that are testing on validation vs test, so just remove that. 2) remove the n= column, 3) we want a column that says forward-trained/reverse-trained. then another column thats text or vision. and then other column that is test-set ordering. got me?
27. skones 07 Jun 2026
  
  in Public
  
  We report fixed-cost
  
  can you add vspace between the tables and subcaptions? they are a bit too close rn.
28. skones 07 Jun 2026
  
  in Public
  
  We therefore use the simplest paper-only pool, ’20–’23,’25–’26∗
  
  this is hanging
29. skones 07 Jun 2026
  
  in Public
  
  accuracy
  
  hanging line
30. skones 07 Jun 2026
  
  in Public
  
  slices.
  
  after this i think we want to say, despite the larger claim rate, the accuracy of the claim is less than 1% across all cues, showing that while llms might recall, they cant do it accurately.
31. skones 07 Jun 2026
  
  in Public
  
  OR-ICLR
  
  isnt this italicised? or supposed to be?
32. skones 07 Jun 2026
  
  in Public
  
  anonymizes
  
  we should specify the anonymization. Here we strip similarly all the identity-bearing spans directly from the latex source, and we also recompile the pdf under the [review] latex macro.
33. skones 07 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR (needs italics)
34. skones 07 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  put in paranthesis (OR-ICLR), then refer to this throughout the rest of the text plase. (even in the tables)
35. skones 07 Jun 2026
  
  in Public
  
  arXiv source tarball.
  
  hanging line
36. skones 07 Jun 2026
  
  in Public
  
  other ML venues.
  
  hanging line
37. skones 07 Jun 2026
  
  in Public
  
  The dominant drop is ve
  
  i want the arxiv-m to be a right arrow, not down arrow.
38. skones 07 Jun 2026
  
  in Public
  
  retained LATEX sourc
  
  if we use the latex symbol we should be consistent. id just use the latex (plain text) throughout
Annotators

skones
Local file Local file

Untitled document

1
1. skones 07 Jun 2026
  
  in Public
  
  LATEX
  
  we should be consistent with this symbol. i think we should just always use plain text for it
Annotators

skones
Local file Local file

Untitled document

54
1. skones 07 Jun 2026
  
  in Public
  
  se search_arxiv to verify novelty claims and find missing references
  
  again delete this
2. skones 07 Jun 2026
  
  in Public
  
  CALIBRATION (from 400 training papers):
  
  again delete this
3. skones 07 Jun 2026
  
  in Public
  
  Use search_arxiv to verify novelty claims and find missing references
  
  delete this step
4. skones 07 Jun 2026
  
  in Public
  
  CALIBRATION (from training data):- Accept papers: mean rating ~5.0 (ratings 4-6), soundness ~2.75,presentation ~3.0, contribution ~2.5- Reject papers: mean rating ~3.0 (ratings 2-4), soundness ~2.75,presentation ~2.25, contribution ~1.5- Strengths: ~3 bullets, Weaknesses: ~4-5 bullet
  
  remove this
5. skones 07 Jun 2026
  
  in Public
  
  Last-layer attention is on later sections.
  
  this could/should be a wrapfig, please add it in where mentioned
6. skones 07 Jun 2026
  
  in Public
  
  Table H.2. Dataset-stratified modality comparisons.
  
  same with this
7. skones 07 Jun 2026
  
  in Public
  
  able H.1. Paired bootstrap significance for headline vision comparisons. Intervals are 95% percentile intervals from10,000 paired bootstrap resamples over paper IDs.
  
  these need to be weaved into the text
8. skones 07 Jun 2026
  
  in Public
  
  or both models
  
  for OR-ICLR
9. skones 07 Jun 2026
  
  in Public
  
  Fitted parameters: text a = 0.845, b = 0.147 (slight compression with a small upward shift); vision a = 0.636,b = 0.062 (strong compression, equivalent to T ≈ 1.57, with minimal shift)
  
  can these be updated, and can you state the numbers for arxix text/vision and OR-iclr?
10. skones 07 Jun 2026
  
  in Public
  
  Both runs use VERL with GRPO, KLcoefficient 0.001, the k3 KL estimator, PPO clip 0.2 / 0.2, standard-deviation normalization, and token-meanloss reduction
  
  remove this
11. skones 07 Jun 2026
  
  in Public
  
  ,667-paperheld-out ’25/’26 evaluation set.
  
  the OR-ICLR test set
12. skones 07 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
13. skones 07 Jun 2026
  
  in Public
  
  well below the SFT models reported in the main text.
  
  lets make sure we use OR-ICLR unless its necessary and we are tlaking about the dataset curation details where i think its appropriate to use OpenReview-ICLR there
14. skones 07 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  this is the pipeline for OpenReview-ICLR AND arXiv, please fix. also where is the main boxy text?
15. skones 07 Jun 2026
  
  in Public
  
  ICLR
  
  OpenReview-ICLR
16. skones 07 Jun 2026
  
  in Public
  
  r Freeform Boxed Standard on a 500-paper OpenReview-ICLR subset
  
  hmm, i think we should add in these prompt variants, otherwise its hard to know what stru tured json or PDR with advocate.critic, etc means
17. skones 07 Jun 2026
  
  in Public
  
  o check whether scaling helps base prompting
  
  to check whether, with a strong base model (the qwen3.5 model is more likely to be strong than the qwen2.5 base), will scale help prompting?
18. skones 07 Jun 2026
  
  in Public
  
  Structured JSON
  
  (a full review with strenghts/weaknesses/etc.)
19. skones 07 Jun 2026
  
  in Public
  
  .9to 4.3
  
  say 4.3%
20. skones 07 Jun 2026
  
  in Public
  
  Table F.3. Prompt-ablation averages across ICLR ’25 and ’26.
  
  these need to be mentioned before, also do you think we could put a and be next to each other and c underneath to save space?
21. skones 07 Jun 2026
  
  in Public
  
  summarize
  
  this feels somewhat out of place. where is it referenced, i feel this should be placed earlier, before prompting. we should 1) have a preamble in the prompting and extended experiments. two we should mention this summary table in line in the preamble talking about how all settings are summarized here, notably the forntier models use high reasoning and temperature=1, we also replicate this in our RL prompt ( need to add temperature =0.7 there). then talk about how our sft after trained is greedily decoded? maybe then introduce the prompting we tested and how we did some experiments to correct open-source model bias
22. skones 06 Jun 2026
  
  in Public
  
  main.py # calls the schema validator‘-- results/|-- codex/| |-- PREDICTIONS.json| ‘-- session.jsonl‘-- claude/|-- PREDICTIONS.json‘-- session.jsonl
  
  we didnt explain what the schema validator is, we didnt explain the predicitons.json or session.jsonl. please explain in the main text, the schema validator is a utility given to the agents to ensure their predictions (stored in the PRedictions.json) is of correct format. we record the entire agent trajectory in a session.jsonl file.
23. skones 06 Jun 2026
  
  in Public
  
  AGENTS.override.md
  
  is this the same as the forntier reasoning prompt? if so we should indicate that
24. skones 06 Jun 2026
  
  in Public
  
  the hard label
  
  However, in our main results, we calibrated the model's decision using its rating and accuracy performance on the held-out validation set. This generally improved performance.
25. skones 06 Jun 2026
  
  in Public
  
  This appendix records the prompts and decoding settings used for prompting, SFT, and RL. We omit thepaper body from the snippets and normalize line wrapping for typesetting
  
  simply state: In this section we denote the prompts and decoding settings used during prompting, sft, and RL. we omit the paper body from the prompts for brevity. also please dont line wrap until you have hit the full width of the pormpt with. it seems like line wrapping is used too early rn
26. skones 06 Jun 2026
  
  in Public
  
  You are an expert academic reviewer for the ICLR conference, predicting whethera paper will be accepted or rejected. ICLR generally has a ~30% acceptance rate.
  
  i like what was done in the base open-source prompting, please adopt that ehre by denoting whats in system, whats in user, and then the difference between iclr and arxiv.
27. skones 06 Jun 2026
  
  in Public
  
  Frontier Model Prompting
  
  same for this
28. skones 06 Jun 2026
  
  in Public
  
  Base Open-Source Prompting
  
  i was thinking this should be its own subsection
29. skones 06 Jun 2026
  
  in Public
  
  ixed: text examples
  
  please use something other than colon here
30. skones 06 Jun 2026
  
  in Public
  
  e hypothesize that borderline papers inject label noise, which, over the longgenerations learned during RL, can cause the model to reinforce noise rather than learn a stable boundar
  
  this sentence is trange and idt this is meant to be here. remove it. i think the key is that we should state that none of these design choices consistently improve performance, so we choose to simply use paper -> decision prediction
31. skones 06 Jun 2026
  
  in Public
  
  We report fixed-cost data
  
  canw e increase the size of these tables to /footnotesize
32. skones 06 Jun 2026
  
  in Public
  
  We therefore study
  
  i think stated this causally as: "we therefore study ..." is weird. i think we should say: "to improve the quality of the corpus, we tried to filter out borderline cases ..."
33. skones 06 Jun 2026
  
  in Public
  
  curated domain distribution:the accept
  
  choose something else other than colon here
34. skones 06 Jun 2026
  
  in Public
  
  thethreshold used in prior memorization audits
  
  delete this
35. skones 06 Jun 2026
  
  in Public
  
  For matched I
  
  please dont abbreciate the columns, we have more than enoughs pace
36. skones 06 Jun 2026
  
  in Public
  
  utcome accurac
  
  we report acc in the main table as a %, id do the same here. Make sure this is clear by stating outcome accuracy when claimed (%)
37. skones 06 Jun 2026
  
  in Public
  
  is not to replace the learned PaperLens score, but
  
  delete
38. skones 06 Jun 2026
  
  in Public
  
  contribut
  
  "contrib"
39. skones 06 Jun 2026
  
  in Public
  
  avail
  
  probably need "avail"
40. skones 06 Jun 2026
  
  in Public
  
  The abstract span starts at a line matching the Abstractkeyword (optionally preceded by Markdown hashes), thencontinues until the next Markdown heading or until a blankline followed by an uppercase letter or digit. Split that span onperiods, exclamation marks, and question marks, and countthe resulting nonempty sentences
  
  i think this could be shortneed. as well as the previous all we need to say is that we find the abstract content and then either count number of sentences or the length of each sentence (average)
41. skones 06 Jun 2026
  
  in Public
  
  estimated body page
  
  we also need to state that the w_iclr, w_arxiv are derived from logistic regression models trained on all these features
42. skones 06 Jun 2026
  
  in Public
  
  data pipelines.
  
  these need to appear closer to where they are mentioned in the appendix
43. skones 06 Jun 2026
  
  in Public
  
  Decisions are already published, so the funnel is short.
  
  State: Decisions for both accepts and rejects are fully available, so the pipeline is simple: download openreview PDFS and metadata, extract markdown content/apply normalization, and subsample.
44. skones 06 Jun 2026
  
  in Public
  
  9,442 balanced PROD papers (9,770 accept /19,672 reject) end up in training.
  
  how is this balanced?? something is wrong with your metrics, this should be 50/50 accepts and rejects? with our 40k we do a stratified year balance, please recheck
45. skones 06 Jun 2026
  
  in Public
  
  CLR is short because OpenReview
  
  lets remove the red text between the boxes. also lets increase the spacing between boxes
46. skones 06 Jun 2026
  
  in Public
  
  able B.1. Artifact shortcut audit across public paper datasets.
  
  can we move this to previous page (before shortcuts in prior datasets?)
47. skones 06 Jun 2026
  
  in Public
  
  Vision picksthis up: even after normalization, the model remains unusually strong on ’24, indicating that residual layoutcues survive whiteout. MinerU confirms the source of the problem: across years the mean y-position of theintroduction start is roughly matched between accepts and rejects, except in ’24 where the gap is notable.
  
  i dislike overuse of colon/semicolon, please clean this up,
48. skones 06 Jun 2026
  
  in Public
  
  We strip identity-bearing s
  
  Need setup. We attempted to recover a anonymized version of ICLR '24 by performing manual anonymization on the camera-ready PDF.
49. skones 06 Jun 2026
  
  in Public
  
  so we exclude them from the OpenReview-ICLR pool
  
  so we only source ICLR submissions from OpenReview.
50. skones 06 Jun 2026
  
  in Public
  
  shortcuts to discriminate
  
  we should add one last sentence stating: Our audit shows that using papers sourced from the stage of review (latest rebuttal), doenst suffer from any of these shortcuts.
51. skones 06 Jun 2026
  
  in Public
  
  These signals are markdown/LaTeX sourcepresence (LaTeX is only available for arXiv submissions, not all OpenReview rejects)
  
  this sentence seems a bit listy and not so coherent. State: "We tested out several signals denoted in Table ..." (add a table that adds a deinition to each signal. Markdown/LaTex source is mainly for deepreview, but basically we identify this for looking for \begin{equation} or \cite tags in the text, rather than rendered. author names visible is lookinf for names before abstract, same with affiliation before abstract. has ack is if acknowledgmenets ection header is present. and github in abstract searches fro a github url in the abstract.
52. skones 06 Jun 2026
  
  in Public
  
  Existing accept/reject datasets often mix artifacts in ways that make labels recoverable from source, venue,or revision artifacts rather than paper quality.
  
  i think we should say: "For instance in Deep Paper Gestalt \cite, they used accepted CVPR papers and used workshop papers as a reject pool. However, from formatting alone, if one can tell if something is workshop format (less pages, specific headers, etc) they can easily shortcut the core prediction problem.
53. skones 06 Jun 2026
  
  in Public
  
  pi ∼ F
  
  what does this mean?
54. skones 06 Jun 2026
  
  in Public
  
  NeurIPS
  
  '14/'21
Annotators

skones
Local file Local file

Untitled document

41
1. skones 06 Jun 2026
  
  in Public
  
  Prompt Ablations
  
  this should be called: "Correcting Model Biases with Prompting" and this should have a short preamble talking about how our base open-source models are heavily accept biased (accepting all), so we tried various prompt corrections and tested the effects over various opensource and a cheap frontier model (Gemini-2.5-Flash) on OpenReview-ICLR. then table F.3, F.4, and F.5 should be weaved in here. i am thinking we have one one paragraph for critical variant, one paragraph for multi-step peer review emulation. the experimental notes should be in the preamble. be results-driven, right now this reads wayy too much as describing setup.
2. skones 06 Jun 2026
  
  in Public
  
  able F.2. Frontier reasoning-token usage. We summarize median, mean, and P95 reasoning-token counts for thestratified high-reasoning Gemini and GPT evaluation runs.
  
  this should be its own sub-section titled: "Frontier Model Reasoning Usage"
3. skones 06 Jun 2026
  
  in Public
  
  Text-Only prompt variants. nonconcise
  
  we can fully remove this and the prompt search, idt we are mentioning this anymore
4. skones 06 Jun 2026
  
  in Public
  
  cosine decay, 5 warmup steps, context 24,480,greedy evaluation. Full optimizer settings are inAppendix E.1.
  
  most important thing to say, is temperature=0 (greedy evaluation) we dont talk about training here. Context length is fine too
5. skones 06 Jun 2026
  
  in Public
  
  Qwen direct prompting
  
  Qwen Base Prompting
6. skones 06 Jun 2026
  
  in Public
  
  Large FrontierGemini/GPT
  
  merge with above: Frontier Models (Gemini/GPT)
7. skones 06 Jun 2026
  
  in Public
  
  Prompts
  
  lets reorganize this into a subsection: Base Open-Source Prompting and then we have one paragraph for No Reasoning Prompt. then we have one for Reasoning Prompt. for the reasoning prompt id like you to be a bit more clear by showing the prompt more contiguously rather than split as it is now. then we have a subsection: "Frontier Model Prompting" and one para: "Frontier Reasoning Prompt", then one para: "Frontier Reasoning Agent Workspace. (no need for deepreview).
8. skones 06 Jun 2026
  
  in Public
  
  The search-augmented RL prompt extends this schema with thearXiv search interface described in Appendix E.2
  
  remove this because we arent mentioning arXiv search for RL anymore
9. skones 06 Jun 2026
  
  in Public
  
  label
  
  fix the hanging line
10. skones 06 Jun 2026
  
  in Public
  
  Prompting and Extended Experiments
  
  this needs to be BEFORE SFT and RL Method Details
11. skones 06 Jun 2026
  
  in Public
  
  E.2 RL
  
  ok so we are movign the RL section from the previous to here. i think thats actually all we need, we dont need to put any other details, because they arent mentioned in the main text. lets alos remove the 4-step text+arXiv, lets support the main text only which is that we trained text/vision RL, just uses the single-turn settings.
12. skones 06 Jun 2026
  
  in Public
  
  SFT.
  
  remove this paragraph header, just do subsection
13. skones 06 Jun 2026
  
  in Public
  
  RL. The single-turn text RL policy uses a learning rate of 1.2 × 10−6, constant warmup over 4 steps, maxgradient norm 0.5,
  
  this should be moved E.2
14. skones 06 Jun 2026
  
  in Public
  
  ROUGE-L
  
  add citation for rouge-l here actually
15. skones 06 Jun 2026
  
  in Public
  
  ells follow the same format as Table H.1.
  
  you need to qualitatively describe the result. the tldr for both tables should be that frontier mdoels cant adequately memorize or recall decision results, though its hihgest for the earlier years and for arXiv.
16. skones 06 Jun 2026
  
  in Public
  
  Table H.3. Frontier memory probe on OR-ICLR, by mean reviewer-rating bucket. Cells follow the same format asTable H.1.
  
  delete
17. skones 06 Jun 2026
  
  in Public
  
  Label-recall c
  
  also you should denote the metric (Label-Recall (%)) above the years, so the reader knows what you are referring to. also table caption should qualitatiively give us the result.
18. skones 06 Jun 2026
  
  in Public
  
  Frontier memory probe
  
  dont say memory probe: "Frontier memorization evaluation on ...
19. skones 06 Jun 2026
  
  in Public
  
  non-trivial only on ’24 arXiv (peaking at 40% mc recall for GPT-5.4, 30% for Gemini-3.1-Pro),
  
  dont report max, report median
20. skones 06 Jun 2026
  
  in Public
  
  and OR-ICLR stratified by mean reviewer rating bucket (< 5: n=14, 5–5.99:n=18, 6–6.99: n=14, ≥ 7: n=4)
  
  dont inlcude this, its useless, remvoe from table
21. skones 06 Jun 2026
  
  in Public
  
  ROUGE-L
  
  needs citation, add to bibtex: @inproceedings{lin-2004-rouge, title = "{ROUGE}: A Package for Automatic Evaluation of Summaries", author = "Lin, Chin-Yew", booktitle = "Text Summarization Branches Out", month = jul, year = "2004", address = "Barcelona, Spain", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W04-1013/", pages = "74--81" }
22. skones 06 Jun 2026
  
  in Public
  
  recall rate (the fraction of papers where the model claims to remember the decision) and, when themodel does recall, the outcome accuracy
  
  we should rename the recall rate, i think this seems like it is "correctly recalling", we want a better deinition.
23. skones 06 Jun 2026
  
  in Public
  
  continuation (a sentence prefix from the paper),
  
  this is confusing, what sentence prefix? like first sentence of the intro? are the tiutle + abstract incliuded?
24. skones 06 Jun 2026
  
  in Public
  
  mc (multiple-choice decision among accept,reject, and a distractor)
  
  this is confusing because whats the input? title + abstract?
25. skones 06 Jun 2026
  
  in Public
  
  gpt-5.4 and gemini-3.1-pro-preview
  
  stop using textttt, use the main text convention of calling this Gemini-3.1-Pro and GPT-5.4
26. skones 06 Jun 2026
  
  in Public
  
  rontier Memory Probes
  
  lets 1) move this up before section D Training Variations. 2) i want this to be named: "Do Frontier Models Memorize the Training Data?"
27. skones 06 Jun 2026
  
  in Public
  
  Data Mixture and Input Representation
  
  i think we should call this "Ablating Training Distribution and Paper Representation", then we should setup that we tried various techniques on OpenReview-ICLR to improve accuracy. first we started with filtering high-quality accepts/rejects (ingoring borderline), then we looked at various suffix options to help the decision objective.
28. skones 06 Jun 2026
  
  in Public
  
  remain problematic
  
  i think it would be highly prudent to add a new subsection titled "Our Data Pipeline" with two paragraphs for ICLR and arXiv. for arXiv, you can find the pipeline here: /scratch/gpfs/ZHUANGL/sk7524/AutoReviewer/plots/funnel(_openreview).html <- the first contains the pipeline for arXiv, the other for OpenReview. i think its important to maybe have a table that summarizes what we start with, how many sty we can recognize, how many are matched, how many are filtered out, and then the implied reject pool, then the filtereing. for openreview we can talk about start and the filtering? or perhaps a diagram would be nice? you decide, dont just verbatim copy the htmls, they are too dense. look at how we represented information throughout the paper, be wise clear succinct.
29. skones 06 Jun 2026
  
  in Public
  
  Table D.2. Data mixture breakdown.
  
  can you put subcaptions BELOW the tables please!
30. skones 06 Jun 2026
  
  in Public
  
  All headlineresults use native decisions unless stated otherwise, and the generated TSVs retain the exact class recalls,thresholds, priors, and supports behind the appendix and main-text tables
  
  all headline PaperLens results use native decisions, unless stated otherwise (like for calibration). you shouldnt mention generated TSV's
31. skones 06 Jun 2026
  
  in Public
  
  The main text reports the top five accept- and reject-associated coefficients from separate ICLR and arXivfeature probes. Here we show grouped distributions for the pooled top accept- and reject-associated featuresacross ICLR years and arXiv venues. The probe uses standardized paper-derived features and excludes sourceidentifiers, direct venue labels, ratings, citations, author names, and non-comparable body-page counts. Itspurpose is not to replace the learned PaperLens score, but to check whether simple surface statistics revealbroad content differences or narrow artifacts
  
  i think instead of grouped bar charts, we could instead, on the table show the feature weight for logistic regression (as a column?), then the paragraph just discusses the table. no need for all these grouped bar charts.
32. skones 06 Jun 2026
  
  in Public
  
  ABSTRACT or Abstract,
  
  just say "Abstract keyword", fix this throughout the table
33. skones 06 Jun 2026
  
  in Public
  
  n feature names, per pg. means per estimated body page: metadatabody_pages, then num_pages_noreferences, then num_pages, otherwise word count divided by 900 and lower-bounded at 1.
  
  i dont think we need this detail, stop before colon. also dont use texttt here, no need, just use english for body pages
34. skones 06 Jun 2026
  
  in Public
  
  Existing accept/reject datasets
  
  also feel free to split this into two 5-8 paragraphs, that are full width no hanging lines
35. skones 06 Jun 2026
  
  in Public
  
  B.2 Shortcuts in Prior Datasets
  
  i almost feel this should be BEFORE the B.1 section, we reference this first in the related works, so it makes sense to put it first.
36. skones 06 Jun 2026
  
  in Public
  
  0 (497 vs. 498), ’21 (506 vs. 508), ’22 (515 vs. 520), ’23 (532 vs. 524), and only mildlyshifted in ’25 (554 vs. 543), but ’24 shows the largest gap (563 vs. 529
  
  i dont think we need to state the numbers, we merely just need to state the effect, 24 has a notable gap gap
37. skones 06 Jun 2026
  
  in Public
  
  Why normalized ICLR ’24 remains shortcut-prone for vision.
  
  can we ensure that this plot maintains the existing pltotting conventions? i think the colors are font, and i mainly want to make sure font sizing on the labels/axes are consistent
38. skones 06 Jun 2026
  
  in Public
  
  but are used as a cross-venue audit for the shortcut analysis below
  
  delete this, this was legacy. we should state that nips, icml, colm from openreview only release accepts and no rejects, so we ignore it for our openreview set.
39. skones 06 Jun 2026
  
  in Public
  
  OpenReview artifact availability.
  
  good but we are missing '23 notation here we just say 20-23. please fix that in the table
40. skones 06 Jun 2026
  
  in Public
  
  ayesian Optimal Accuracy Derivation
  
  good section, no change needed
41. skones 06 Jun 2026
  
  in Public
  
  Section B.2
  
  this should be Appendix B.2
Annotators

skones
Local file Local file

Untitled document

3
1. skones 06 Jun 2026
  
  in Public
  
  The
  
  i think this is part of the previous, it shouldnt be made its own paragraph.
2. skones 06 Jun 2026
  
  in Public
  
  rating extremes
  
  i think we mean the lower-end of rating not just the extremes
3. skones 06 Jun 2026
  
  in Public
  
  confidence interval (CI)
  
  i think we measure confidence intervals earlier, so we should efine the acronym there and use it here
Annotators

skones
Local file Local file

Untitled document

12
1. skones 04 Jun 2026
  
  in Public
  
  The training prior is a separate choice. Natural-rate training exposesthe model to the conference base rate, but it also makes the nativedecision rule inherit the majority-reject bias.
  
  the table has messed up its formatting, i think something happened with the numbers?
2. skones 04 Jun 2026
  
  in Public
  
  text budge
  
  max context length
3. skones 04 Jun 2026
  
  in Public
  
  export: we
  
  export. We
4. skones 04 Jun 2026
  
  in Public
  
  s —
  
  , ensuring uniformity
5. skones 04 Jun 2026
  
  in Public
  
  ; for
  
  . For
6. skones 04 Jun 2026
  
  in Public
  
  paper lists
  
  delete
7. skones 04 Jun 2026
  
  in Public
  
  where accepts are exposed only ascamera-ready papers after review
  
  where accepts are only camera-ready.
8. skones 04 Jun 2026
  
  in Public
  
  and additional experiments
  
  delete
9. skones 04 Jun 2026
  
  in Public
  
  Because arXiv spans many venues, itcontains an ICLR subset (arXiv-ICLR) distinct from the OpenReview-sourced OR-ICLR set.
  
  delete this, and just say form the last sentence: All datasets are balanced 50/50, and Figure 2 shows examples from both.
10. skones 04 Jun 2026
  
  in Public
  
  text and vision prediction interface
  
  remove this
11. skones 04 Jun 2026
  
  in Public
  
  they are onlyused for measuring rating correlation and review alignment
  
  remove this
12. skones 04 Jun 2026
  
  in Public
  
  ssistance
  
  this hsould just be section 7 like the above reference
Annotators

skones
Local file Local file

Untitled document

33
1. skones 03 Jun 2026
  
  in Public
  
  e 6 shows that Claude alone underperforms PaperLens-Vision, but adding the calibrated prior improvesoverall accuracy from 63.0% to 70.6%, surpassing both PaperLens-Vision and Claude individually, andimproves alignment with human reviews. This matches the factorization above: PaperLens-Vision anchorsthe decision while the reviewer supplies the critique. The gain comes mainly from a 13.6% subset of papersflipping Accept→Reject at 77.9% accuracy; Reject→Accept flips are rarer and much less reliable, so the priormostly corrects false acceptances
  
  something we arent talking about is the fact that we measure gains in rating alignemnt as well. note that after the model's predictions flip from A -> R it starts spitting out weaknesses and novelty critiques at a higher rate, agreeing with human critique.
2. skones 03 Jun 2026
  
  in Public
  
  his is consistent with psychology and peer-review studies of evaluative judgment:reviewers may form an early global impression of a manuscript and then selectively construct or emphasizecriticisms that make that impression appear analytically grounded (Kunda, 1990; Nisbett and Wilson, 1977;Slovic et al., 2007; Haidt, 2001; Lord et al., 1979), with peer-review evidence showing that reviewer behavioris sensitive to identity cues, network ties, and halo effects rather than to manuscript content alone (Tomkinset al., 2017; Blank, 1991; Peters and Ceci, 1982; Teplitskiy et al., 2018; De Sordi et al., 2020)
  
  this sentence is way too long, you need to cut it down. also no one knows what newtowrk ties and halo effects are please fix that.
3. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
4. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR!
5. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
6. skones 03 Jun 2026
  
  in Public
  
  alism of the artifact
  
  hanging line
7. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
8. skones 03 Jun 2026
  
  in Public
  
  t adds
  
  hanging line
9. skones 03 Jun 2026
  
  in Public
  
  Model scale. 3B/14B differences to 7B, ma
  
  please center the numbers in the columns also make delta Acc on a new line, can we decrease the width of this table, so that the figure B can be larger?
10. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
11. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
12. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR here too
13. skones 03 Jun 2026
  
  in Public
  
  Memory probes confirm these cutoffs: frontier models fail abstract completion,multiple-choice decision recall, and venue identification on this corpus
  
  please read LLamaFactoryAutoReviewer/results/frontier_memory_probe/breakdown.md., please add the results to the appendix section that talks about frotneir models and then reference that here. show a summarized table view here that clearly shows that the frontier models consistently fail at the different recall tests. highest on 24. add a short paragraph here.
14. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  this should be OR-ICLR
15. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
16. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR
17. skones 03 Jun 2026
  
  in Public
  
  Vanilla prompts
  
  lets create a new paragraph header called: "Baseline Correction Measures."
18. skones 03 Jun 2026
  
  in Public
  
  Prompting.
  
  lets name this Baseline Prompting.
19. skones 03 Jun 2026
  
  in Public
  
  pattern:
  
  use period here
20. skones 03 Jun 2026
  
  in Public
  
  kewed:
  
  delete colon and replace with period
21. skones 03 Jun 2026
  
  in Public
  
  ; even strongly critical prompts only reach a roughly 50% accept rate, staying near the majority-rejectbaseline and inaccurate (Appendix F.2)
  
  delete
22. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  here i think we can say OR-ICLR
23. skones 03 Jun 2026
  
  in Public
  
  decisions
  
  because they are inherently calibrated to the test distribtuion, since our training data has the test years (or somethign like that).
24. skones 03 Jun 2026
  
  in Public
  
  probes
  
  instead of probes lets call these "models". also lets name the paragraph: "Tabular feature analysis."
25. skones 03 Jun 2026
  
  in Public
  
  Full feature definitions are in Table C.1.
  
  for the figure 4 feature prob audit, you state full feature definitions are in table c.1, but really it should be appendix c like in the main text. also you were supposed to raise the legend [accept/reject] and change the naems to OR-ICLR and arXiv. please fix (in the right subfigure) you corrected the table
26. skones 03 Jun 2026
  
  in Public
  
  Table 2. Training-prior bias
  
  the table should state OR-ICLR and arXiv in italics, and then center the columns for T/V
27. skones 03 Jun 2026
  
  in Public
  
  scale-invariant training comparison
  
  hanging line
28. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  OR-ICLR in the dataset coverage image
29. skones 03 Jun 2026
  
  in Public
  
  sources
  
  datasets
30. skones 03 Jun 2026
  
  in Public
  
  OpenReview-ICLR
  
  add (OR-ICLR) here
31. skones 03 Jun 2026
  
  in Public
  
  rendered arXiv papers do not.
  
  hanging line please fix
32. skones 03 Jun 2026
  
  in Public
  
  Reviews, ratings, andmeta-reviews are unobserved; we focus on Z rather than modeling them (Gao et al., 2024; Zhu et al., 2025b)
  
  In contrast with prior work (cite), reviews, ratings, and meta-reviews are not regressed on; they are only used for measuring rating correlation and review alignment.
33. skones 03 Jun 2026
  
  in Public
  
  eview assistance
  
  add section reference here
Annotators

skones
Local file Local file

Untitled document

18
1. skones 03 Jun 2026
  
  in Public
  
  our learned paper signalas the observable relaxation p(Z | Xreb) of the ideal p(Z | X)
  
  i think we want more setup. I think we want to state: "As mentioned in the problem formulation, papers are transformed into reviews before reaching the decision. Given a paper (X), over decisions (D) and reviews (R) ( aka p(D,R|X)). While the flow indicates that reviews precede the decision, now that we have built a shortcut that predicts the decisionwithout the review, we question whether we can generate more \italics{decision-calibrated} reviews. This is valid by simply rearranging the joint distribution like so:
  
  then the formula.
  
  then the both factorizations are vlaid. then we can give a more conceptual understanding: if the conjecture is correct, that much of the decision can be determined by overall presentation quality, then perhaps reviews are simply a means-to-an-end to justify the reviewer's inital recaction.
  
  heres more context about sources that we can cite for this behavior: The strongest terms to search are:
  
  motivated reasoning, post hoc rationalization, biased assimilation, affect heuristic, halo effect, and in peer review specifically prestige bias, reviewer bias, or cognitive bias in peer review.
  
  Here are the most relevant scientific articles.
  
  Kunda — “The Case for Motivated Reasoning”
  
  Ziva Kunda, 1990, Psychological Bulletin
  
  This is probably the canonical article. Kunda argues that motivation can bias the cognitive processes people use to access, construct, and evaluate beliefs. In your case: a reviewer’s initial evaluation of a paper can shape which criticisms feel salient or persuasive.
  
  Best use: cite this for the general mechanism: people reason toward a conclusion they are already inclined to reach.
  
  Nisbett & Wilson — “Telling More Than We Can Know”
  
  Richard Nisbett & Timothy Wilson, 1977, Psychological Review
  
  This is highly relevant to the “reverse reasoning” part. They reviewed evidence that people often lack direct introspective access to the mental processes that caused their judgments, but still produce explanations for those judgments.
  
  Best use: cite this for the idea that reviewers may sincerely report reasons for a judgment without accurately knowing what actually caused the judgment.
  
  Haidt — “The Emotional Dog and Its Rational Tail”
  
  Jonathan Haidt, 2001, Psychological Review
  
  Haidt’s social intuitionist model is about moral judgment, not peer review, but the mechanism maps well: quick intuitive judgment is followed by slower post hoc reasoning. The article explicitly frames reasoning as often coming after intuition rather than causing judgment.
  
  Best use: cite this when you want the phrase “intuition first, reasoning second.”
  
  Lord, Ross & Lepper — “Biased Assimilation and Attitude Polarization”
  
  Charles Lord, Lee Ross & Mark Lepper, 1979, Journal of Personality and Social Psychology
  
  This is a classic study on how people evaluate evidence differently depending on their prior beliefs. People tend to accept congenial evidence more readily and scrutinize uncongenial evidence more harshly.
  
  Best use: cite this for the reviewer behavior where evidence supporting the reviewer’s initial take gets treated as decisive, while contrary evidence gets nitpicked.
  
  Slovic et al. — “The Affect Heuristic”
  
  Paul Slovic, Melissa Finucane, Ellen Peters & Donald MacGregor, 2007, European Journal of Operational Research
  
  This article argues that feelings of “goodness” or “badness” can rapidly guide judgments and decisions. That is very close to what people informally mean by a “paper gestalt”: an overall positive or negative feeling that then colors assessment of specific features.
  
  Best use: cite this for the gut reaction / affective global impression component.
  
  Tomkins, Zhang & Heavlin — “Reviewer Bias in Single- versus Double-Blind Peer Review”
  
  Andrew Tomkins, Min Zhang & William D. Heavlin, 2017, PNAS
  
  This is peer-review-specific. In a large conference-review setting, submissions were reviewed under single-blind and double-blind conditions. The study found that single-blind reviewing advantaged papers by famous authors and high-prestige institutions.
  
  Best use: cite this to show that manuscript judgments are not purely about the paper’s intrinsic content; contextual cues can bias reviews.
  
  Blank — “The Effects of Double-Blind versus Single-Blind Reviewing”
  
  Rebecca Blank, 1991, American Economic Review
  
  This was a randomized experiment at the American Economic Review. It found that reviewers were more critical when author identity was blinded, and acceptance rates were lower under double-blind review.
  
  Best use: cite this as older experimental evidence that review outcomes shift when identity/prestige cues are removed.
  
  Peters & Ceci — “Peer-Review Practices of Psychological Journals”
  
  Douglas Peters & Stephen Ceci, 1982, Behavioral and Brain Sciences
  
  Classic and brutal study: previously published psychology articles were resubmitted to the same journals with altered author/institution information. Most were not recognized, and many were rejected.
  
  Best use: cite this for the unreliability/context-sensitivity of peer review.
  
  Teplitskiy et al. — “The Social Structure of Consensus in Scientific Review”
  
  Misha Teplitskiy et al., 2018
  
  This study analyzed reviews of 7,981 neuroscience manuscripts submitted to PLOS ONE and found that reviewers favored authors closer to them in the co-authorship network. The authors interpret this not just as simple nepotism but as partly reflecting “schools of thought” and substantive evaluative differences.
  
  Best use: cite this if your point is that reviewers’ intellectual/social position can shape what they see as valid or flawed.
  
  Sordi et al. — “Halo Effect in Peer Review”
  
  José Osvaldo De Sordi et al., 2020
  
  This one is directly about the halo effect in peer review. It explores whether belonging to the same professional field or group can bias article evaluation during review.
  
  Best use: cite this for the closest peer-review-specific phrase to “paper gestalt”: halo effect in peer review.
  
  A good synthesis sentence would be:
  
  The phenomenon can be described as an interaction of affective first impressions, motivated reasoning, and post hoc rationalization: reviewers may form an early global evaluation of a manuscript and then selectively construct or emphasize criticisms that make that evaluation appear analytically grounded.
  
  For a paper or lit review, I’d probably cite Kunda 1990 + Nisbett & Wilson 1977 + Slovic et al. 2007 for the psychology mechanism, then Tomkins et al. 2017 + Peters & Ceci 1982 + Teplitskiy et al. 2018 for peer-review-specific evidence.
  
  the key is that based off this realization, we sought to see whether conditioning frontier coding agents for reviews can 1) improve their decision-prediction and 2) improve their generated reviews' alignment to human reviews.
2. skones 03 Jun 2026
  
  in Public
  
  review
  
  add a specific (review prompt in Appendix ...)
3. skones 03 Jun 2026
  
  in Public
  
  decision
  
  add specific apepndix reference here (Appendix ...)
4. skones 03 Jun 2026
  
  in Public
  
  ’25–’26
  
  remove
5. skones 03 Jun 2026
  
  in Public
  
  decisions implicit in the tone of the feedback
  
  predicted decisions poorly track ground-truth outcomes.
6. skones 03 Jun 2026
  
  in Public
  
  Sec.
  
  Section
7. skones 03 Jun 2026
  
  in Public
  
  per-venue
  
  delete this
8. skones 03 Jun 2026
  
  in Public
  
  is provably usefu
  
  empirically proven:
9. skones 03 Jun 2026
  
  in Public
  
  We score the seven venues with enough test papers anddrop four (ECCV, COLM, AISTATS, CoRL) whose concentration in the test set is too low to score reliably
  
  i think we should add these back into the plot, but not state their sample contributions.
10. skones 03 Jun 2026
  
  in Public
  
  , and only loosely with how muchtraining data each venue contributes
  
  delete
11. skones 03 Jun 2026
  
  in Public
  
  papers submitted to ICLR
  
  arXiv-ICLR
12. skones 03 Jun 2026
  
  in Public
  
  accept and reject recal
  
  just say R_A and R_R
13. skones 03 Jun 2026
  
  in Public
  
  ICLR
  
  OpenReview-ICLR
14. skones 03 Jun 2026
  
  in Public
  
  ICLR
  
  OpenReview-ICLR
15. skones 03 Jun 2026
  
  in Public
  
  ICLR
  
  OpenReview-ICLR -> OpenReview-ICLR
16. skones 03 Jun 2026
  
  in Public
  
  rates every paper
  
  has human ground-truth ratings for each paper.
17. skones 03 Jun 2026
  
  in Public
  
  right
  
  correct
18. skones 03 Jun 2026
  
  in Public
  
  costs only about two points
  
  only drops performance on OpenReview-ICLR by 2%
Annotators

skones

Annotators

Annotators

Annotators

Annotators

Annotators

Annotators

Annotators

Annotators