Hypothesis

237 Matching Annotations

Nov 2025
Local file Local file

The Evaluation of Technological Activities and Products. An Overlooked Concern in Research Assessment Reform

15
1. metaror 06 Nov 2025
  
  in Public
  
  None.
2. metaror 06 Nov 2025
  
  in Public
  
  The submission presents a rigorous, clearly written study of how technological activities and products (TAP) are evaluated in the Argentine CONICET system. Reviewer 1 highlights the strong methodological design, clear writing, and valuable conceptual contribution in distinguishing TAPs from broader social impact. According to Reviewer 1, the empirical analysis is well executed and effectively demonstrates that TAPs remain marginal in peer review practice, with evaluators facing difficulties due to limited documentation, unclear criteria, and tensions between technological work and conventional academic expectations. Reviewer 2 similarly notes the adopted critical theoretical framework. Results are considered interesting and thought-provoking, and the article is a relevant contribution to the discussions on impact evaluation beyond the Global North.
  
  Both reviewers offer constructive feedback. Reviewer 1 emphasises that TAPs are insufficiently prioritised and documentation is often inconsistent, underscoring the need for clearer evaluative expectations to minimise subjectivity. Reviewer 2 calls for clarification regarding ethical approval, more detailed reporting of qualitative methods, and the use of frameworks such as COREQ to improve transparency. Reviewer 2 also suggests strengthening the integration between the quantitative and qualitative components, reorganising the discussion section to begin with a summary of findings before interpretation, and explicitly acknowledging study limitations. Overall, both reviewers regard the manuscript as a timely and valuable contribution to responsible research assessment, while recommending methodological and structural refinements to further enhance its clarity, transparency, and analytical coherence.
3. metaror 06 Nov 2025
  
  in Public
  
  No
4. metaror 06 Nov 2025
  
  in Public
  
  The article “The Evaluation of Technological Activities and Products. An Overlooked Concern in Research Assessment Reform” examines how technological activities and products (TAPs) are assessed within CONICET’s research career system in Argentina. The study uses 421 peer review reports from applied fields to analyze both the presence of TAPs in evaluation discourse and the challenges they raise in peer review practices.
  
  The text is pleasant to read, and its writing is clear and concise. It is a well-crafted article with a good writing style.
  
  A major strength of the article lies in its solid methodological design. By combining lexicometric analysis with qualitative content interpretation, the authors successfully identify recurring patterns in evaluation language and clarify the difficulties reviewers encounter when addressing TAPs. The findings are presented clearly and supported with meaningful examples from the dataset, which reinforces the empirical credibility of the study
  
  The article also makes a relevant conceptual contribution by arguing that TAPs should be considered analytically distinct from social impact, and therefore require specific attention and evaluative criteria. This argument is well situated within broader international debates on responsible research assessment and the diversification of research outputs beyond publications. The contextualization of the Argentine case within global reform trends further enhances the relevance of the study to diverse audiences
  
  However, certain limitations are acknowledged by the authors and deserve emphasis. Despite the potential value of TAPs for applied research careers, the study shows that reviewers rarely prioritize these activities and often encounter insufficient or inconsistent documentation to support evaluation. These constraints sometimes lead to subjective judgments about originality, relevance, or adequate links to the researcher’s main area of expertise. As the authors note, this results in a lack of shared criteria and introduces variability that undermines the fairness and effectiveness of peer assessment in this domain
  
  Overall, the article contributes valuable evidence to understanding why academic evaluation systems still struggle to recognize diverse forms of applied and technological work. It encourages fur0ther discussion on how to develop clearer procedures, shared expectations, and more robust evaluative frameworks for TAPs—particularly in Global South research contexts where resource constraints and institutional cultures add unique challenges. The work is therefore a pertinent and timely input to ongoing discussions on research assessment reform.
5. metaror 06 Nov 2025
  
  in Public
  
  No competing interests.
6. metaror 06 Nov 2025
  
  in Public
  
  This manuscript focuses on “how Technological Activities and Products (TAPs) are assessed in the individual evaluation of researchers’ trajectories in Argentina.” The authors present a rigorous, relevant, and critical theoretical framework addressing social impact, scientific careers, and the case of Argentina. The method involves an analysis of a database of external reviewer reports corresponding to promotion applications submitted in 2017 and 2018, encompassing 421 reports from three areas. Textual quantitative analysis using software and qualitative content analysis of Section 2 of the experts’ evaluation form were the applied methods. Some issues should be clarified by the authors:
  
  Was there any institutional review board (IRB) approval for the data collection process?
  
  The description of the qualitative methods is insufficient. I strongly recommend using a checklist for qualitative research (e.g., COnsolidated criteria for REporting Qualitative research — COREQ, available at https://onlinelibrary.wiley.com/pb-assets/assets/17416612/COREQ_Checklist-1556513515737.pdf) to ensure transparency, rigor, and reproducibility of the results.
  
  Although the quantitative and qualitative results are interesting and thought-provoking, a stronger connection between these two methodological approaches is recommended. Employing a mixed-method strategy could be beneficial.
  
  4. The Discussion section needs reorganization. The first paragraph should summarize the quantitative and qualitative findings, while subsequent paragraphs should interpret and explain each result using a theoretical framework. Considering that this is an empirical study, the section should not include a discussion of the literature revised, as currently presented in the first paragraph. Finally, the limitations of the study must be acknowledged.
7. metaror 06 Nov 2025
  
  in Public
  
  We thank the editor and reviewers for their thoughtful comments. We believe they will substantially strengthen the manuscript and clarify our arguments.
  
  We appreciate R1’s positive assessment of the quality and relevance of our work. Regarding the closing remark on inadequate documentation and guidance, we agree that this has been a critical issue. Fortunately, CONICET has recently issued additional guidance (available here in Spanish), which we view as a positive development; we look forward to observing its effects on evaluation practices.
  
  R2 raises important and thought-provoking points. On IRB approval: in Argentina, document-based research in the social sciences does not require IRB review. Nevertheless, before accessing the materials we discussed the ethical implications of the study with CONICET authorities and signed an agreement specifying how the data would be used and committing to anonymize any excerpts made public. We also limited our analysis to closed promotion cases (final decisions, no further appeals possible) to ensure our research would not affect ongoing processes.
  
  We thank R2 for directing us to the COREQ guideline. We understand it is intended for reporting focus groups and interviews; we will review it and consider which elements can be adapted to improve the reporting of our document-based study.
  
  We also agree that a more explicit articulation between the qualitative and quantitative components will help present the results in a more integrated way, and we will work toward that aim. Concerning the structure of the discussion, our intention is not to introduce new concepts but to connect our findings to prior scholarship already cited in previous sections and to situate them within the broader global conversation. Finally, we concur that the study’s limitations should be stated more clearly, given that our analysis is restricted to a single career system and three disciplinary fields.
8. metaror 06 Nov 2025
  
  in Public
  
  Vasen, F., Sarthou, N. F., & Romano, S. A. (2025, July 8). The Evaluation of Technological Activities and Products. An Overlooked Concern in Research Assessment Reform. https://doi.org/10.31235/osf.io/6j3nh_v1
9. metaror 06 Nov 2025
  
  in Public
  
  July 8, 2025
10. metaror 06 Nov 2025
  
  in Public
  
  October 29, 2025
11. metaror 06 Nov 2025
  
  in Public
  
  November 2, 2025
12. metaror 06 Nov 2025
  
  in Public
  
  Authors:
  
  Federico Vasen fvasen@rec.uba.ar
  
  Nerina Sarthou nsarthou@fch.unicen.edu.ar
  
  Silvina Romano sromano@untdf.edu.ar
13. metaror 06 Nov 2025
  
  in Public
  
  https://osf.io/preprints/socarxiv/6j3nh_v1
14. metaror 06 Nov 2025
  
  in Public
  
  10.31235/osf.io/6j3nh_v1
15. metaror 06 Nov 2025
  
  in Public
  
  The Evaluation of Technological Activities and Products. An Overlooked Concern in Research Assessment Reform
Annotators

metaror
Oct 2025
osf.io osf.io

Establishing trust in automated reasoning

17
1. metaror 23 Oct 2025
  
  in Public
  
  As a statistician, I am in strong agreement on the widespread inappropriate use of statistical inference (page 2) and the importance of software. I also strongly agree that “independent critical inspection [is] particularly challenging” (page 3). I also strongly agree that “The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission”, as this is everyone’s problem and nobody’s problem.
  
  I also agree that automation has encouraged standardisation and I have personally supported standardisation because some practices are so bad that many authors need to be “standardised”. However, I’ve also felt frustration at the sometimes fussy requirements when uploading R packages to CRAN (https://cran.r-project.org/). Similarly, some blanket changes from CRAN seem pedantic. There’s likely a balance between reducing poor practice and becoming too prescriptive.
  
  In terms of transparency (section 2.4) I did think about the “Verbose=TRUE” option that I sometimes see in R. I tend to turn this on, as it’s good to see more of the workings, but perhaps the default is off? I did look at some packages using the google search: “verbose site:cran.r-project.org/web/packages”. I was also reminded of the difference between Bayesian and frequentist statistical modelling. Frequentist modelling often uses maximum likelihood to create parameter estimates, which usually runs quickly to create the estimates. In contrast, Bayesian methods often use MCMC, which is often slow and creates long chains of estimates; however, the chains will show if the likelihood does not have a clear maximum, which is usually from a badly specified model, whereas the maximum likelihood simply finds any peak. Frustratingly, I often get more push back from reviewers when using Bayesian methods, whereas in my opinion it should be the other way around as the Bayesian estimates have shown far more of the inner workings.
  
  Some reflection on the growing use of AI to write software may be worthwhile. Presumably this could be more standardised, but there are other concerns. Using automation to check code could also be worthwhile.
  
  For section 3, I thought that more sharing of code would mean “more eyeballs”, but the sharing needs to be done in FAIR way.
  
  I wondered if highly-used software should get more scrutiny. Peer review is a scarce resource, so is likely better directed towards high use software. Andrew Gelman recently put forward a similar argument for checking published papers when they reach 250 citations: https://statmodeling.stat.columbia.edu/2025/02/26/pp/.
  
  I agreed with the need for effort (page 19) and wondered if this paper could call for more effort.
  
  Minor comments:
  
  typo “asses” on page 7.
  
  “supercomputers are rare”, should this be “relatively rare” or am I speaking from a privileged university where I’ve always had access to supercomputers.
  
  I did think about “testthat” at multiple points whilst reading the paper (https://testthat.r-lib.org/)
  
  Can badges on github about downloads and maturity help (page 7)? Although, far from all software is on github.
2. metaror 23 Oct 2025
  
  in Public
  
  none
3. metaror 23 Oct 2025
  
  in Public
  
  This summary article does not present new data or experiments but instead takes a broad look at automated reasoning and software. Reviewer #1 thought the article needed much more detail, including citations, examples, screenshots and figures. They were concerned about strong generalisations that were lacking evidence and have provided places where they wanted these details. Reviewer #2 considers the differences between reviewability and the practicalities of reviewing everything, and how being easily able to build-on other software acts as a kind of reproducibility. In my own editorial review, I generally enjoyed reading the paper and it prompted some interesting thoughts on trade-offs with standardisation and the level of detail shown to users for statistical code.
4. metaror 23 Oct 2025
  
  in Public
  
  none
5. metaror 23 Oct 2025
  
  in Public
  
  Thank you for submitting this paper. I think the paper requires substantial, major revisions to be published. Throughout the paper I noted many instances where references or examples would help make the intent clear. I also think the message of the paper would benefit from several figures to demonstrate workflows or ideas. The figures presented are essentially tables, and I think the message could be made clearer for the reader if they were presented as flow charts or at least with clear numbering to hook the ideas to the reader - e.g., Figures 1 & 2 would benefit from having numbers on the key ideas.
  
  The paper is lacking many instances of citation, and at times reads as though it is an essay delivering an opinion. I'm not sure if this is the type of article that the journal would like, but two examples of sentences missing citations are:
  
  "Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection." (Introduction, page 2)
  
  "A large number of examples cited in this context involves faulty software or inappropriate use of software" (Introduction, page 3)
  
  Two examples of sentences missing examples are:
  
  Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete (in Mature vs. experimental software, page 7). Could the author provide more examples of what "experimental software" is? There is also consistent use of universal terms like "...is rarely up to date or complete", which would be better phrased as "is often not up to date or complete"
  
  There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification.
  
  Overall the paper introduces many new concepts, and I think it would greatly benefit from being made shorter and more concise, with adding some key figures for the reader to refer back to to understand these new ideas. The paper is well written, and it is clear the author is a great writer, and has put a lot of thought into the ideas. However it is my opinion that because these ideas are so big and require so much unpacking, they are also harder to understand. The reader would benefit from having more guidance to come back to understand these ideas.
  
  I hope this review is helpful to the author.
  
  Review comments
  
  Introduction
  
  Highlight [page 2]: Ever since the beginnings of organized science in the 17th century, researchers are expected to put all facts supporting their conclusions on the table, and allow their peers to inspect them for accuracy, pertinence, completeness, and bias. Since the 1950s, critical inspection has become an integral part of the publication process in the form of peer review, which is still widely regarded as a key criterion for trustworthy results.
  
  and Note [page 2]: Both of these statements feel like they should have some peer review, or reference on them, I believe. What was the beginnings of organised science in the 1600s? Why since the 1950s? Why not sooner? What happened then?
  
  Highlight [page 2]: Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection.
  
  and Note [page 2]: I would expect at least a couple of citations here, e.g., Stodden et al., https://www.pnas.org/doi/abs/10.1073/pnas.1708290115
  
  Highlight [page 2]: In the quantitative sciences, almost all of today’s research critically relies on computational techniques, even when they are not the primary tool for investigation - and Note [page 2]: Again, it does feel like it would be great to acknowledge research in this space.
  
  Highlight [page 2]: But then, scientists mostly abandoned doubting.
  
  and Note [page 2]: This feels like an essay, where show me the evidence for where you can say something like this?
  
  Highlight [page 2]: Automation bias
  
  and Note [page 2]: What is automation bias?
  
  Highlight [page 3]: A large number of examples cited in this context involves faulty software or inappropriate use of software
  
  and Note [page 3]: Can you provide some examples of the examples cited that you are referring to here?
  
  Highlight [page 3]: A particularly frequent issue is the inappropriate use of statistical inference techniques.
  
  and Note [page 3]: Please provide citations to these frequent issues.
  
  Highlight [page 3]: The Open Science movement has made a first step towards dealing with automated reasoning in insisting on the necessity to publish scientific software, and ideally making the full development process transparent by the adoption of Open Source practices - and Note [page 3]: Could you provide an example of one of these Open Science movements?
  
  Highlight [page 3]: Almost no scientific software is subjected to independent review today.
  
  and Note [page 3]: How can you justify this claim?
  
  Highlight [page 3]: In fact, we do not even have established processes for performing such reviews
  
  and Note [page 3]: I disagree, there is the Journal of Open Source Software: https://joss.theoj.org/, rOpenSci has a guide for development of peer review of statistical software: https://github.com/ropensci/statistical-software-review-book, and also maintain a very clear process of software review: https://ropensci.org/software-review/
  
  Highlight [page 3]: as I will show
  
  and Note [page 3]: How will you show this?
  
  Highlight [page 3]: is as much a source of mistakes as defects in the software itself
  
  and Note [page 3]: Again, this feels like a statement of fact without evidence or citation.
  
  Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.
  
  and Note [page 3]: The same can be said of assumptions for equations and mathematics - the problem here is dealing with abstraction of complexity and the potential unintended consequences.
  
  Highlight [page 4]: the preservation of epistemic diversity
  
  and Note [page 4]: Please define epistemic diversity
  
  Reviewability of automated reasoning systems
  
  Highlight [page 5]: The five dimensions of scientific software that influence its reviewability.
  
  and Note [page 5]: It might be clearer to number these in the figure, and also I might suggest changing the “convivial” - it’s a pretty unusual word?
  
  Wide-spectrum vs. situated software
  
  Highlight [page 6]: In between these extremes, we have in particular domain libraries and tools, which play a very important role in computational science, i.e. in studies where computational techniques are the principal means of investigation
  
  and Note [page 6]: I’m not very clear on this example - can you provide an example of a “domain library” or “domain tool” ?
  
  Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.
  
  and Note [page 6]: I’m not sure I agree it is always smaller and simpler - the custom code for a new method could be incredibly complicated.
  
  Highlight [page 6]: Domain tools and libraries
  
  and Note [page 6]: Can you give an example of this?
  
  Mature vs. experimental software
  
  Highlight [page 7]: Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete
  
  and Note [page 7]: Could the author provide more examples of what “experimental software” is? There is also consistent use of universal terms like “…is rarely up to date or complete”, which would be better phrased as “is often not up to date or complete”
  
  Highlight [page 7]: An extreme case of experimental software is machine learning models that are constantly updated with new training data.
  
  and Note [page 7]: Such as…
  
  Highlight [page 7]: interlocutor
  
  and Note [page 7]: suggest “middle man” or “mediator”, ‘interlocutor’ isn’t a very common word
  
  Highlight [page 7]: A grey zone
  
  and Note [page 7]: I think it would be helpful to discuss black and white zones before this.
  
  Highlight [page 7]: The libraries of the scientific Python ecosystem
  
  and Note [page 7]: Do you mean SciPy? https://scipy.org/. Can you provide an example of the frequent changes that break backward compatibility?
  
  Highlight [page 7]: too late that some of their critical dependencies are not as mature as they seemed to be
  
  and Note [page 7]: Again, can you provide some evidence for this?
  
  Highlight [page 7]: The main difference in practice is the widespread use of experimental software by unsuspecting scientists who believe it to be mature, whereas users of instrument prototypes are usually well aware of the experimental status of their equipment.
  
  and Note [page 7]: Again this feels like an assertion without evidence. Is this an essay, or a research paper?
  
  Convivial vs. proprietary software
  
  Highlight [page 8]: Convivial software [Kell 2020], named in reference to Ivan Illich’s book “Tools for conviviality” [Illich 1973], is software that aims at augmenting its users’ agency over their computation
  
  and Note [page 8]: It would be really helpful if the author would define the word, “convivial” here. It would also be very useful if they went on to give an example of what they meant by: “…software that aims at augmenting its users’ agency over their computation.” How does it augment the users agency?
  
  Highlight [page 8]: Shaw recently proposed the less pejorative term vernacular developers [Shaw 2022]
  
  and Note [page 8]: Could you provide an example of what makes “vernacular developers” different, or just what they mean by this term?
  
  Highlight [page 8]: which Illich has described in detail
  
  and Note [page 8]: Should this have a citation to Illich then in this sentence?
  
  Highlight [page 8]: what has happened with computing technology for the general public
  
  and Note [page 8]: Can you give an example of this. Do you mean the rise of Apple and Windows? MS Word? Facebook? A couple of examples would be really useful to make this point clear.
  
  Highlight [page 8]: tech corporations
  
  and Note [page 8]: Suggest “tech corporations” be “technology corporations”.
  
  Highlight [page 8]: Some research communities have fallen into this trap as well, by adopting proprietary tools such as MATLAB as a foundation for their computational tools and models.
  
  and Note [page 8]: Can you provide an example of the alternative here, what would be the way to avoid this trap - use software such as Octave, or?
  
  Highlight [page 8]: Historically, the Free Software movement was born in a universe of convivial technology.
  
  and Note [page 8]: If it is historic, can you please provide a reference to this?
  
  Highlight [page 8]: most of the software they produced and used was placed in the public domain
  
  and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.
  
  Highlight [page 8]: as they saw legal constraints as the main obstacle to preserving conviviality
  
  and Note [page 8]: Again, these are conjectures that are lacking a reference or example, can you provide some examples of references of this?
  
  Highlight [page 9]: Software complexity has led to a creeping loss of user agency, to the point that even building and installing Open Source software from its source code is often no longer accessible to non-experts, making them dependent not only on the development communities, but also on packaging experts. An experience report on building the popular machine learning library PyTorch from source code nicely illustrates this point [Courtès 2021].
  
  and Note [page 9]: Can you summarise what makes it diﬀicult to install Open Source Software? Again, this statement feels like it is making a strong generalisation without clear evidence to support this. The article by Courtès (https://hpc.guix.info/blog/2021/09/whats-in-a-package/), actually notes that it’s straightforward to install PyTorch via pip, but using an alternative package manager causes diﬀiculty. The point you are making here seems to be that building and installing most open source software is almost prohibitive, but I think you’ve given strong evidence for this claim, and I don’t understand how this builds into your overall argument.
  
  Highlight [page 9]: It survives mainly in communities whose technology has its roots in the 1980s, such as programming systems inheriting from Smalltalk (e.g. Squeak, Pharo, and Cuis), or the programmable text editor GNU Emacs.
  
  and Note [page 9]: Can you give an example of how it survives in these communities?
  
  Highlight [page 9]: FLOSS has been rapidly gaining in popularity, and receives strong support from the Open Science movement
  
  and Note [page 9]: Can you provide some evidence to back this statement up?
  
  Highlight [page 9]: the traditional values of scientific research.
  
  and Note [page 9]: Can you state what you mean by “traditional values of scientific research”
  
  Highlight [page 9]: always been convivial
  
  and Note [page 9]: Can you provide a further explanation of what makes them convivial?
  
  Transparent vs. opaque software
  
  Highlight [page 9]: Transparent software
  
  and Note [page 9]: It might be useful to explain a distinction between transparent and open software - or to perhaps open with a statement for why we are talking about transparent and opaque software.
  
  Highlight [page 9]: Large language models are an extreme example.
  
  and Note [page 9]: Based on your definition of transparent software - every action produces a visible result. If I type something into an LLM and get an immediate and visible result, how is this different? It is possible you are stating that the behaviour is able to be easily interpreted, or perhaps the behaviour is easy to understand?
  
  Highlight [page 10]: Even highly interactive software, for example in data analysis, performs nonobvious computations, yielding output that an experienced user can perhaps judge for plausibility, but not for correctness.
  
  and Note [page 10]: Could you give a small example of this?
  
  Highlight [page 10]: It is much easier to develop trust in transparent than in opaque software.
  
  and Note [page 10]: Can you state why it is easier to develop this trust?
  
  Highlight [page 10]: but also less important
  
  and Note [page 10]: Can you state why it is less important?
  
  Highlight [page 10]: even a very weak trustworthiness indicator such as popularity becomes suﬀicient
  
  and Note [page 10]: becomes suﬀicient for what? Reviewing? Why does it become suﬀicient?
  
  Highlight [page 10]: This is currently a much discussed issue with machine learning models,
  
  and Note [page 10]: Given it is currently much discussed, could you link to at least 2 research articles discussing this point?
  
  Highlight [page 10]: treated extensively in the philosophy of science.
  
  and Note [page 10]: Given that is has been treated extensively, can you please provide some key references after this statement? You do go on to cite one paper, but it would be helpful to mention at least a few key articles.
  
  Size of the minimal execution environment
  
  Highlight [page 11]: The importance of this execution environment is not suﬀiciently appreciated by most researchers today, who tend to consider it a technical detail
  
  and Note [page 11]: This statement is a bit of a sweeping generalisation - why is it not suﬀiciently appreciated? What evidence do you have of this?
  
  Highlight [page 11]: Software environments have only recently been recognized as highly relevant for automated reasoning in science and beyond
  
  and Note [page 11]: Where have they been only recently recognised?
  
  Highlight [page 11]: However, they have not yet found their way into mainstream computational science.
  
  and Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?
  
  Analogies in experimental and theoretical science
  
  Highlight [page 12]: Non-industrial components are occasionally made for special needs, but this is discouraged by their high manufacturing cost
  
  and Note [page 12]: Can you provide an example of this?
  
  Highlight [page 12]: cables
  
  and Note [page 12]: What do you mean by a cable? As in a computer cable? An electricity cable?
  
  Highlight [page 13]: which an experienced microscopist will recognize. Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily.
  
  and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional programmer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.
  
  Highlight [page 13]: where “traditional” means not relying on any form of automated reasoning.
  
  and Note [page 13]: Can you give an example of what a “traditional” scientific model or theory
  
  Improving the reviewability of automated reasoning systems
  
  Highlight [page 14]: Figure 2: Four measures that can be taken to make scientific software more trustworthy.
  
  and Note [page 14]: Could the author perhaps instead call these “four measures” or perhaps give them a better name, and number them?
  
  Review the reviewable
  
  Highlight [page 14]: mature wide-spectrum software
  
  and Note [page 14]: Can you give an example of what “mature wide-spectrum software” is?
  
  Highlight [page 15]: The main diﬀiculty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.
  
  and Note [page 15]: I disagree. Monash provides an example here where they view software as a first class research output: https://robjhyndman.com/files/EBS_research_software.pdf
  
  Science vs. the software industry
  
  Highlight [page 15]: Many computers, operating systems, and compilers were designed specifically for the needs of scientists.
  
  and Note [page 15]: Could you give an example of this? E.g., FORTRAN? COBAL?
  
  Highlight [page 15]: Today, scientists use mostly commodity hardware
  
  and Note [page 15]: Can you explain what you mean by “commodity hardware”, and give an example.
  
  Highlight [page 15]: even considered advantageous if it also creates a barrier to reverse- engineering of the software by competitors
  
  and Note [page 15]: Can you give an example of this?
  
  Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for
  
  and Note [page 15]: What about software like SPSS/STATA/SAS - surely many many industries, and also researchers will pay for software like this that is considered mature?
  
  Emphasize situated and convivial software
  
  Highlight [page 16]: a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer.
  
  and Note [page 16]: Could you give an example of what this might look like practically? Are you saying things like SciPy would be restructured into many separate modules, or?
  
  Highlight [page 16]: In terms of FLOSS jargon, users make a partial fork of the project. Version control systems ensure provenance tracking and support the discovery of other forks. Keeping up to date with relevant forks of one’s software, and with the motivations for them, is part of everyday research work at the same level as keeping up to date with publications in one’s wider community. In fact, another way to describe this approach is full integration of scientific software development into established research practices, rather than keeping it a distinct activity governed by different rules.
  
  and Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?
  
  Highlight [page 17]: a universe is very
  
  and Note [page 17]: Perhaps this could be “would be very different” - since this doesn’t yet exist, right?
  
  Highlight [page 17]: Improvement thus happens by small-step evolution rather than by large-scale design. While this may look strange to anyone used to today’s software development practices, it is very similar to how scientific models and theories have evolved in the pre-digital era.
  
  and Note [page 17]: I think some kind of schematic or workflow to compare existing practices to this new practice would be really useful to articulate these points. I also think this new method of development you are proposing should have a concrete name.
  
  Highlight [page 17]: Existing code refactoring tools can probably be adapted to support application-specific forks, for example via code specialization. But tools for working with the forks, i.e. discovering, exploring, and comparing code from multiple forks, are so far lacking. The ideal toolbox should support both forking and merging, where merging refers to creating consensual code versions from multiple forks. Such maintenance by consensus would probably be much slower than maintenance performed by a coordinated team.
  
  and Note [page 17]: Perhaps an example of screenshot of a diff could be used to demonstrate that we can make these changes between two branches/commits, but comparing multiple is challenging?
  
  Make scientific software explainable
  
  Highlight [page 18]: An interesting line of research in software engineering is exploring possibilities to make complete software systems explainable [Nierstrasz and Girba 2022]. Although motivated by situated business applications, the basic ideas should be transferable to scientific computing
  
  and Note [page 18]: Is this similar to concepts such as “X-AI” or “X-ML” - that is, “Explainable” Artificial Intelligence or Machine Learning?
  
  Highlight [page 18]: Unlike traditional notebooks, Glamorous Toolkit [feenk.com 2023],
  
  and Note [page 18]: It appears that you have introduced “Glamorous Toolkit” as an example of these three principles? It feels like it should be introduced earlier in this paragraph?
  
  Highlight [page 18]: In Glamorous Toolkit, whenever you look at some code, you can access corresponding examples (and also other references to the code) with a few mouse clicks
  
  and Note [page 18]: I think it would be very beneficial to show screenshots of what the author means - while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.
  
  Use Digital Scientific Notations
  
  Highlight [page 18]: There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification
  
  and Note [page 18]: Can you give an example of these techniques?
  
  Highlight [page 18]: The use of these tools is, for now, reserved to software that is critical for safety or security,
  
  and Note [page 18]: Again, could you give an example of this point? Which tools, and which software is critical for safety or security?
  
  Highlight [page 19]: formal specifications
  
  and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.
  
  Highlight [page 19]: All of them are much more elaborate than the specification of the result they produce. They are also rather opaque.
  
  and Note [page 19]: It isn’t clear to me how these are opaque - if the algorithm is defined, it can be understood, how is it opaque?
  
  Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]
  
  and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.
  
  Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.
  
  and Note [page 19]: Is an example of this test drive development?
  
  Highlight [page 19]: A formal specification has to evolve in the same way, and is best seen as the formalization of the scientific knowledge. Change can flow from specification to software, but also in the opposite direction.
  
  and Note [page 19]: Again, I think a good figure here would be very helpful in articulating this clearly.
  
  Highlight [page 19]: My own experimental Digital Scientific Notation, Leibniz [Hinsen 2024], is intended to resemble traditional mathematical notation as used e.g. in physics. Its statements are embeddable into a narrative, such as a journal article, and it intentionally lacks typical programming language features such as scopes that do not exist in natural language, nor in mathematical notation.
  
  and Note [page 19]: Could we see an example of what this might look like?
  
  Conclusion
  
  Highlight [page 20]: Situated software is easy to recognize.
  
  and Note [page 20]: Could you provide some examples?
  
  Highlight [page 20]: Examples from the reproducibility crisis support this view
  
  and Note [page 20]: Can you provide some example papers that you mention here?
  
  Highlight [page 21]: The ideal structure for a reliable scientific software stack would thus consist of a foundation of mature software, on top of which a transparent layer of situated software, such as a script, a notebook, or a workflow, orchestrates the computations that together answer a specific scientific question. Both layers of such a stack are reviewable, as I have explained in section 3.1, but adequate reviewing processes remain to be enacted.
  
  and Note [page 21]: Again, I think it would be very insightful for the reader to have a clear figure to rest these ideas upon.
  
  Highlight [page 21]: has been neglected by research institutions all around the world
  
  and Note [page 21]: I do not think this is true - could you instead say “neglected my most/many” perhaps?
6. metaror 23 Oct 2025
  
  in Public
  
  none
7. metaror 23 Oct 2025
  
  in Public
  
  In his article Establishing trust in automated reasoning (Hinsen, 2023) Hinsen argues that much of current scientific software lacks reviewability. Because scientific software has become such a central part of many scientific endeavors he worries that unreviewed software might contain mistakes which will never be spotted and consequently taint the scientific record. To illustrate this worry he cites issues with reproductions in different fields of science, which are often subsumed under the umbrella term of reproducibility crises. These crises, though not uncontested, have varied sources. In the field of social psychology reproducibility issues can for example often be traced to errors in statistical analyses, while shifting baselines and data leakage lead to problems in ML. Hinsen is only concerned with errors in scientific software. He suggests that potential errors could be spotted more easily if scientific software would be more reviewable. Thus he proposes five criteria against which reviewability could be judged. I will not discuss them in detail in this commentary and refer the interested reader to Hinsen (2023, section 2) for an extensive discussion. I note though, that the five criteria are meant to ensure an ideal type of reproducibility which Hinsen defines as follows: “Ideally, each piece of software should perform a well-defined computation that is documented in sufficient detail for its users and verifiable by independent reviewers.” (Hinsen, 2023, p.2). I take the upshot of these criteria to be that one could assert the reviewability of a piece of software before actually doing the review. They could thus function, perhaps contrary to Hinsen’s open science convictions, as a gatekeeping device in a peer review process for software. An editor could ”desk reject” software for not fulfilling the criteria before even sending it out to potential reviewers. If I am correct in this interpretation then we should entertain the same caution with them as we do with preregistration.
  
  To be fair, Hinsen envisions a software review process which differs from current peer review with its acknowledged defects in several ways. He says, ”Developing suitable intermediate processes and institutions for reviewing such software is perhaps possible, but I consider it scientifically more appropriate to restructure such software into a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer.” (Hinsen, 2023, p.16).
  
  Convivial software in turn is supposed to augment ”its users’ agency over their computation.” (Hinsen, 2023, p.16). This gives us a hint about the kind of user Hinsen has in mind – it is the software developer as a user. His concept of reviewability aims to make software transparent only to this kind of user (see Hinsen, 2023, p.20). In one of his many comparisons of scientific software to science, he notes that ”[. . . ] the main intellectual artifacts of science, i.e. theories and models, have always been convivial.” (Hinsen, 2023, p.9) and we can guess that he wants this to be the case for software too. But, if at all, scientific theories and models only have ever been convivial for scientists. The comparison also works the other way around, science as much as software is heavily fragmented into modules (disciplines). Scientists have always relied on the results of other scientists – they often have done and still do so without reviewing them. Has this hindered progress? I think one would be hard pressed to answer such a question in general for science, and perhaps it is the same for scientific software.
  
  As Hinsen admits formal peer review is a quite novel addition to scientific methodology, being enforced on a larger scale only since the past fifty years or so. Science has progressed many years without, so we could ask why scientific software should not do likewise. Hinsen’s answer of course has to do with how he grades such software with respect to his reviewability criteria – obviously, most of it scores badly. Most scientific software is neither reviewed nor reviewable, Hinsen claims. This he considers a defect, because only reviewable software has to potential of being reviewed. Many practical considerations he discusses actually speak against the hope that most reviewable software will actually be reviewed. Still, without reviewability, it is hard, if not impossible, to spot mistakes. A case that was recently brought to my attention emphasizes this point. In Beheim et al. (2021) it is pointed out that a statistical analysis imputed missing values in an archaeo-historical database with the number 0. But for the statistical model (and software!) in use 0 had a different meaning than not available. This casts doubt on the conclusion that was drawn from the model. Beheim et al. were only able to spot this assumption because the code and data were available for review1. Cases like this abound and are examples for invisible programming values that philosopher James Moor discussed in the context of computer ethics (see Moor, 1985, The invisibility factor). Hinsen calls such values “tacit assumptions made by software developers” (Hinsen, 2023, p.3). We might speculate though, what would have happened if this questionable result had been incorporated into the scientific canon. Would later scientists really have continued building on it without ever realizing their shaky foundations? Or would the whole edifice have had to face the tribunal of experience at some point and crumbled? Perhaps the originating problem would never have been found and a whole research program would have been abandoned, perhaps a completely different part would have been blamed and excised – hard to say!
  
  But maybe reviewability can also serve a different aim than establishing trust in the results of certain pieces of scientific software. Perhaps, it facilitates building on and incorporating pieces of such software in other projects. Its purpose could be more instrumental than epistemic. Although Hinsen seems to worry more about the epistemic problems coming with lack of reviewability, many points he makes implicitly deal with practical problems of software engineering. Whoever has fought against jupyter notebooks with legacy python requirements can immediately relate to his wish for keeping the execution environment as small as possible. For Hinsen software is actually defined by its execution environment (Hinsen, 2023, p.11), thus the complete environment must be available for its reviewability2. Software cannot be really seen as a separate entity and a review always reviews the whole environment. Analogously to Quine-Duhem we could call this situation review holism. But review holism might be less problematic than its scientific cousin suggests. We might not actually need to explicitly review the whole system. Perhaps it is sufficient if we achieve frictionless reproducibility (see Donoho, 2024), that is, other people can more or less easily incorporate and built on the software in question. Firstly if other software which incorporates the software in question works, it already is a type of successful reproduction. Secondly, the process of how software evolves might weed out any major errors, whatever errors remain are perhaps just irrelevant. In all fairness it has to be said that Hinsen does not think this is the case with current software. He argues that ”Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily.” (Hinsen, 2023, p.13). But if that is the case then Hinsen’s later recourse to reliabilist-style justifications for software correctness is blocked too. We are in a situation for which the late Humphreys coined the term strange error (Rathkopf & Heinrichs, 2023, p.5). Strange errors are a challenge for any reliabilist account of justification because their magnitude can easily overwhelm arduously collected reliability assurances. If computational reliabilism was just reliabilism, and Hinsen seems to take it as such3, it would suffer from this problem too. But computational reliabilism has an additional internalist component, which explicitly allows for the whole toolbox of ”rationalist” software verification methods. If possible we should learn something about our tools other than their mere reliability. As Hacking said, ”[To understand] whether one sees through a microscope, one needs to know quite a lot about the tools.” (Hacking, 1981, p.135).
  
  I would go so far and say that, if available, internalist justificiations are preferable to reliabilistic guarantees. It is only the case that often they are not and then we might content ourselves with the guarantees reliabilism provides. I said might content here, because such guarantees are unlikely to satisfy the skeptic. Obviously strange errors are always a possibility and no finite observation of correct software behaviour can completely rule them out. But in practice such concerns tend to fade over time, although they provide opportunity for unchecked philosophically skepticism. Many discussions about software opacity feed from such skepticism and this is what I tried to balance with computational reliabilism. In this spirit computational reliabilism was an attempt to temper theoretical skeptics in philosophy, not to give normative guidance to software engineering practice. My view was always that practice has the last say over philosophical concerns. If the emerging view in software engineering practice now is that more skepticism is appropriate, I will happily concur. But I should like to remind the practitioner that evidence for such skepticism has to be given in practice too, mere theoretical possibilities are not sufficient to establish it.
  
  Reviewability does not mean reviewed. And only reviews can give us trust - or so we might think. As Hinsen acknowledges we should not expect that a majority of scientific software will ever be reviewed. Does this mean we cannot trust the results from such software? Above I tried to sketch a way out of this conundrum: We can view reviewability as advocated by Hinsen as a way to enable frictionless reproducibility, which in turn lets us built upon software, incorporate it in our own projects and use its results. As long as it works in a practically fulfilling way, this might be all the reviewing we need.
  
  Notes
  
  1A statistician once told me, that one glance at the raw data of this example immediately made clear to him that whatever problem there was with imputation, the data would never have supported the desired conclusions in any way. One man’s glance is another’s review.
  
  2Hinsen’s definition of software closely parallels that of Moor, who argued that computer programs are a relation between a computer, a set of instructions and an activity (Moor, 1978, p.214).
  
  3Hinsen characterizes computational reliabilism as follows, ”As an alternative source of trust, they propose computational reliabilism, which is trust derived from the experience that a computational procedure has produced mostly good results in a large number of applications.” (Hinsen, 2023, p.10)
  
  References
  
  Beheim, B., Atkinson, Q. D., Bulbulia, J., Gervais, W., Gray, R. D., Henrich, J., Lang, M., Monroe, M. W., Muthukrishna, M., Norenzayan, A., Purzy- cki, B. G., Shariff, A., Slingerland, E., Spicer, R., & Willard, A. K. (2021). Treatment of missing data determined conclusions regarding moralizing gods. Nature, 595 (7866), E29–E34. https://doi.org/10.1038/s41586-021-03655-4
  
  Donoho, D. (2024). Data Science at the Singularity. Harvard Data Science Re- view, 6 (1). https://doi.org/10.1162/99608f92.b91339ef
  
  Hacking, I. (1981). Do We See Through a Microscope? Pacific Philosophical Quarterly, 62 (4), 305–322. https://doi.org/10.1111/j.1468-0114.1981.tb00070.x
  
  Hinsen, K. (2023, July). Establishing trust in automated reasoning. https:// doi.org/10.31222/osf.io/nt96q
  
  Moor, J. H. (1978). Three Myths of Computer Science. The British Journal for the Philosophy of Science, 29 (3), 213–222. https://doi.org/10.1093/bjps/29.3.213
  
  Moor, J. H. (1985). What is computer ethics? Metaphilosophy, 16 (4), 266–275. https://doi.org/10.1111/j.1467-9973.1985.tb00173.x
  
  Rathkopf, C., & Heinrichs, B. (2023). Learning to Live with Strange Error: Be- yond Trustworthiness in Artificial Intelligence Ethics. Cambridge Quarterly of Healthcare Ethics, 1–13. https://doi.org/10.1017/S0963180122000688
8. metaror 23 Oct 2025
  
  in Public
  
  Dear editors and reviewers, Thank you for your careful reading of my manuscript and the detailed and insightful feedback. It has contributed significantly to the improvements in the revised version. Please find my detailed responses below.
  
  1 Reviewer 1
  
  Thank you for this helpful review, and in particular for pointing out the need for more references, illustrations, and examples in various places of my manuscript. In the case of the section on experimental software, the search for examples made clear to me that the label was in fact badly chosen. I have relabeled the dimension as “stable vs. evolving software”, and rewritten the section almost entirely. Another major change motivated by your feedback is the addition of a figure showing the structure of a typical scientific software stack (Fig. 2), and of three case studies (section 2.7) in which I evaluate scientific software packages according to my five dimensions of reviewability. The discussion of conviviality (section 2.4), a concept that is indeed not widely known yet, has been much expanded. I have followed the advice to add references in many places. I have been more hesitant to follow the requests for additional examples and illustrations, because of the inevitable conflict with the equally understandable request to make the paper more compact. In many cases, I have preferred to refer to examples discussed in the literature. A few comments deserve a more detailed reply:
  
  Introduction
  
  Highlight [page 3]: In fact, we do not even have established processes for performing such reviews
  
  and Note [page 3]: I disagree, there is the Journal of Open Source Software: https://joss.theoj.org/, rOpenSci has a guide for development of peer review of statistical software: https://github.com/ropensci/statistical software-review-book, and also maintain a very clear process of software review: https://ropensci.org/software-review/
  
  As I say in the section “Review the reviewable”, these reviews are not independent critical examination of the software as I define it. Reviewers are not asked to evaluate the software’s correctness or appropriateness for any specific purpose. They are expected to comment only on formal characteristics of the software publication process (e.g. “is there a license?”), and on a few software engineering quality indicators (“is there a test suite?”).
  
  Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.
  
  and Note [page 3]: The same can be said of assumptions for equations and mathematics- the problem here is dealing with abstraction of complexity and the potential unintended consequences.
  
  Indeed. That’s why we need someone other than the authors to go through mathematical reasoning and verify it. Which we do.
  
  Reviewability of automated reasoning systems
  
  Wide-spectrum vs. situated software
  
  Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.
  
  and Note [page 6]: I’m not sure I agree it is always smaller and simpler- the custom code for a new method could be incredibly complicated.
  
  The comparison is between situated software and more generic software performing the same operation. For example, a script reading one specific CSV file compared to a subroutine reading arbitrary CSV files. I have yet to see a case in which abstraction from a concrete to a generic function makes code smaller or simpler.
  
  Convivial vs. proprietary software
  
  Highlight [page 8]: most of the software they produced and used was placed in the public domain
  
  and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.
  
  Software distribution in science was well organized long before the Internet, it was just slower and more expensive. Both decks of punched cards and magnetic tapes were routinely sent by mail. The earliest organized software distribution for science I am aware of was the DECUS Software Library in the early 1960s.
  
  Size of the minimal execution environment
  
  Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?
  
  I have looked for quantitative studies on software use in science that would allow to give a precise meaning to “mainstream”, but I have not been able to find any. Based on my personal experience, mostly with teaching MOOCs on computational science in which students are asked about the software they use, the most widely used platform is Microsoft Windows. Linux is already a minority platform (though overrepresented in computer science), and Nix users are again a small minority among Linux users.
  
  Analogies in experimental and theoretical science
  
  Highlight [page 13]: which an experienced microscopist will recognize. Soft ware with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diag- nose easily.
  
  and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional program mer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.
  
  There are probably cases of microscopists not noticing defects, but my point is that if you ask them to look for defects, they know what to do (and I have made this clearer in my text). For contrast, take GROMACS (one of my case studies in the revised manuscript) and ask either an expert programmer or an experienced computational biophysicist if it correctly implements, say, the AMBER force field. They wouldn’t know what to do to answer that question, both because it is ill-defined (there is no precise definition of the AMBER force field) and because the number of possible mistakes and symptoms of mistakes is enormous. I have seen a protein simulation program fail for proteins whose number of atoms was in a narrow interval, defined by the size that a compiler attributed to a specific data structure. I was able to catch and track down this failure only because a result was obviously wrong for my use case. I have never heard of similar issues with microscopes.
  
  Improving the reviewability of automated reasoning systems
  
  Review the reviewable
  
  Highlight [page 15]: The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.
  
  and Note [page 15]: I disagree. Monash provides an example here where they view software as a first class research output: https://robjhyndman.com/files/EBS_research_software.pdf
  
  This example is about superficial reviews in the context of career evaluation. Other institutions have similar processes. As far as I know, none of them ask reviewers to look at the actual code and comment on its correctness or its suitability for some specific purpose.
  
  Science vs. the software industry
  
  Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for
  
  and Note [page 15]: What about software like SPSS/STATA/SAS- surely many many industries, and also researchers will pay for software like this that is considered mature?
  
  I could indeed extend the list of examples to include various industries. Compared to the huge number of individuals using PCs and smartphones, that’s still few customers.
  
  Emphasize situated and convivial software
  
  Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?
  
  I have decided the contrary: I have significantly shortened this section, removing all speculation about how the ideas could be turned into concrete technology. The reason is that I have been working on this topic since I wrote the reviewed version of this manuscript, and I have a lot more to say about it than would be reasonable to include in this work. This will become a separate article.
  
  Make scientific software explainable
  
  Note [page 18]: I think it would be very beneficial to show screenshots of what the author means- while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.
  
  Unfortunately, static screenshots can only convey a limited impression of Glamorous Toolkit, but I agree that they have are a more stable support than the software itself. Rather than adding my own screenshots, I refer to a recent paper by the authors of Glamorous Toolkit that includes many screenshots for illustration.
  
  Use Digital Scientific Notations
  
  Highlight [page 19]: formal specifications and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.
  
  Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]
  
  and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.
  
  I do give an example: sorting a list. To write down an actual formalized version, I’d have to introduce a formal specification language and explain it, which I think goes well beyond the scope of this article. Illustrating modularity requires an even larger example. This is, however, an interesting challenge which I’d be happy to take up in a future article.
  
  Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.
  
  and Note [page 19]: Is an example of this test drive development?
  
  Not exactly, though the underlying idea is similar: provide a condition that a result must satisfy as evidence for being correct. With testing, the condition is spelt out for one specific input. In a formal specification, the condition is written down for all possible inputs.
  
  2 Reviewer 2
  
  First of all, I would like to thank the reviewer for this thoughtful review. It addresses many points that require clarifications in the my article, which I hope to have done adequately in the revised version.
  
  One such point is the role and form of reviewing processes for software. I have made it clearer that I take “review” to mean “critical independent inspection”. It could be performed by the user of a piece of software, but the standard case should be a review performed by experts at the request of some institution that then publishes the reviewer’s findings. There is no notion of gatekeeping attached to such reviews. Users are free to ignore them. Given that today, we publish and use scientific software without any review at all, the risk of shifting to the opposite extreme of having reviewers become gatekeepers seems unlikely to me.
  
  Your comment on users being software developers addresses another important point that I had failed to make clear: conviviality is all about diminishing the distinction between developers and users. Users gain agency over their computations at the price of taking on more of a developer role. This is now stated explicitly in the revised article. Your hypothesis that I want scientific software to be convivial is only partially true. I want convivially structured software to be an option for scientists, with adequate infrastructure and tooling support, but I do not consider it to be the best approach for all scientific software.
  
  The paragraph on the relevance and importance of reviewing in your comment is a valid point of view but, unsurprisingly, not mine. In the grand scheme of science, no specific quality assurance measure is strictly necessary. There is always another layer above that will catch mistakes that weren’t detected in the layer below. It is thus unlikely that unreliable software will cause all of science to crumble. But from many perspectives, including overall efficiency, personal satisfaction of practitioners, and insight derived from the process, it is preferable to catch mistakes as closely as possible to their source. Pre-digital theoreticians have always double-checked their manual calculations before submitting their papers, rather than sending off unchecked results and count on confrontation with experiment for finding mistakes. I believe that we should follow this same approach with software. The cost of mistakes can be quite high. Consider the story of the five retracted protein structures that I cite in my article (Miller, 2006, 10.1126/science.314.5807.1856). The five publications that were retracted involved years of work by researchers, reviewers, and editors. In between their publication and their retraction, other protein crystallographers saw their work rejected because it was in contradiction with the high-profile articles that later turned out to be wrong. The whole story has probably involved a few ruined careers in addition to its monetary cost. In contrast, independent critical examination of the software and the research processes in which it was used would likely have spotted the problem rather quickly (Matthews, 2007).
  
  You point out that reviewability is also a criterion in choosing software to build on, and I agree. Building on other people’s software requires trusting it. Incorporating it into one’s own work (the core principle of convivial software) requires understanding it. This is in fact what motivated my reflections on this topic. I am not much interested in neatly separating epistemic and practical issues. I am a practitioner, my interest in epistemology comes from a desire for improving practices.
  
  Review holism is something I have not thought about before. I consider it both impossible to apply in practice and of little practical value. What I am suggesting, and I hope to have made this clearer in my revision, is that reviewing must take into account the dependency graph. Reviewing software X requires a prior review of its dependencies (possibly already done by someone else), and a consideration of how each dependency influences the software under consideration. However, I do not consider Donoho’s “frictionless reproducibility” a sufficient basis for trust. It has the same problem as the widespread practice of tacitly assuming a piece of software to be correct because it is widely used. This reasoning is valid only if mistakes have a high chance of being noticed, and that’s in my experience not true for many kinds of research software. “It works”, when pronounced by a computational scientist, really means “There is no evidence that it doesn’t work”.
  
  This is also why I point out the chaotic nature of computation. It is not about Humphreys’ “strange errors”, for which I have no solution to offer. It is about the fact that looking for mistakes requires some prior idea of what the symptoms of a mistake might be. Experienced researchers do have such prior ideas for scientific instruments, and also e.g. for numerical algorithms. They come from an understanding of the instruments and their use, including in particular a knowledge of how they can go wrong. But once your substrate is a Turing-complete language, no such understanding is possible any more. Every programmer has made the experience of chasing down some bug that at first sight seems impossible. My long-term hope is that scientific computing will move towards domain-specific languages that are explicitly not Turing-complete, and offer useful guarantees in exchange. Unfortunately, I am not aware of any research in this space.
  
  I fully agree with you that internalist justifications are preferable to reliabilistic ones. But being fundamentally a pragmatist, I don’t care much about that distinction. Indisputable justification doesn’t really exist anywhere in science. I am fine with trust that has a solid basis, even if there remains a chance of failure. I’d already be happy if every researcher could answer the question “why do you trust your computational results?” in a way that shows signs of critical reflection.
  
  What I care about ultimately is improving practices in computational science. Over the last 30 years, I have seen numerous mistakes being discovered by chance, often leading to abandoned research projects. Some of these mistakes were due to software bugs, but the most common cause was an incorrect mental model of what the software does. I believe that the best technique we have found so far to spot mistakes in science is critical independent inspection. That’s why I am hoping to see it applied more widely to computation.
  
  2.1 References
  
  Miller, G. (2006) A Scientist’s Nightmare: Software Problem Leads to Five Retractions. Science 314, 1856. https://doi.org/10.1126/science.314.5807.1856
  
  Matthews, B.W. (2007) Five retracted structure reports: Inverted or incorrect? Protein Science 16, 1013. https://doi.org/10.1110/ps.072888607
  
  3 Editor
  
  Bayesian methods often use MCMC, which is often slow and creates long chains of estimates; however, the chains will show if the likelihood does not have a clear maximum, which is usually from a badly specified model...
  
  That is an interesting observation I haven’t seen mentioned bedore. I agree that Bayesian inference is particularly amenable to inspection. One more reason to normalize inspection and inspectability in computational science.
  
  Some reflection on the growing use of AI to write software may be worthwhile.
  
  The use of AI in writing and reviewing software is a topic I have considered for this review, since the technology has evolved enormously since I wrote the current version of the manuscript. However, in view of reviewer 1’s constant admonition to back up statements with citations, I refrained from delving into this topic. We all know it’s happening, but it’s too early to observe a clear impact on research software. I have therefore limited myself to a short comment in the Conclusion section.
  
  I wondered if highly-used software should get more scrutiny.
  
  This is an interesting suggestion. If and when we get serious about reviewing code, resource allocation will become an important topic. For getting started, it’s probably more productive to review newly published code than heavily used code, because there is a better chance that authors actually act on the feedback and improve their code before it has many users. That in turn will help improve the reviewing process, which is what matters most right now, in my opinion.
  
  “supercomputers are rare”, should this be “relatively rare” or am I speaking from a privileged university where I’ve always had access to supercomputers.
  
  If you have easy access to supercomputer, you should indeed consider yourself privileged. But did you ever use supercomputer time for reviewing someone else’s work? I have relatively easy access to supercomputers as well, but I do have to make a re quest and promise to do innovative research with the allocated resources.
  
  I did think about “testthat” at multiple points whilst reading the paper (https://testthat.r-lib.org/)
  
  I hadn’t seen “testthat” before, not being much of a user of R. It looks interesting, and reminds me of similar test support features in Smalltalk which I found very helpful. Improving testing culture is definitely a valuable contribution to improving computational practices.
  
  Can badges on github about downloads and maturity help (page 7)?
  
  Badges can help, on GitHub or elsewhere, e.g. in scientific software catalogs. I see them as a coarse-grained output of reviewing. The right balance to find is between the visibility of a badge and the precision of a carefully written review report. One risk with badges is the temptation to automate the evaluation that leads to it. This is fine for quantitative measures such as test coverage, but what we mostly lack today is human expert judgement on software.
9. metaror 23 Oct 2025
  
  in Public
  
  Hinsen, K. (2023, July 6). Establishing trust in automated reasoning. https://doi.org/10.31222/osf.io/nt96q
10. metaror 23 Oct 2025
  
  in Public
  
  January 3, 2025
11. metaror 23 Oct 2025
  
  in Public
  
  August 14, 2025
12. metaror 23 Oct 2025
  
  in Public
  
  September 3, 2025
13. metaror 23 Oct 2025
  
  in Public
  
  Authors:
  
  Konrad Hinsen konrad.hinsen@cnrs.fr
14. metaror 23 Oct 2025
  
  in Public
  
  3
15. metaror 23 Oct 2025
  
  in Public
  
  https://osf.io/preprints/metaarxiv/nt96q
16. metaror 23 Oct 2025
  
  in Public
  
  10.31222/osf.io/nt96q
17. metaror 23 Oct 2025
  
  in Public
  
  Establishing trust in automated reasoning
Visit annotations in context

Annotators

metaror

URL

osf.io/preprints/metaarxiv/nt96q
osf.io osf.io

Evolution of Peer Review in Scientific Communication

21
1. metaror 15 Oct 2025
  
  in Public
  
  I strongly endorse the main theme of most of the reviews, which is that the progression and underlying justifications for this article’s arguments needs a great deal of work. In my view, this article’s main contribution seems to be the evaluation of the three peer review models against the functions of scientific communication. I say ‘seems to be’ because the article is not very clear on that and I hope you will consider clarifying what your manuscript seeks to add to the existing work in this field.
  
  In any case, if that assessment of the three models is your main contribution, that part is somewhat underdeveloped. Moreover, I never got the sense that there is clear agreement in the literature about what the tenets of scientific communication are. Note that scientific communication is a field in its own right.
  
  I also agree that paper is too strongly worded at times, with limitations and assumptions in the analysis minimised or not stated. For example, all of the typologies and categories drawn could easily be reorganised and there is a high degree of subjectivity in this entire exercise. Subjective choices should be highlighted and made salient for the reader.
  
  Note that greater clarity, rigour, and humility may also help with any alleged or actual bias.
  
  Some more minor points are:
  
  I agree with Reviewer 3 that the ‘we’ perspective is distracting.
  
  The paragraph starting with ‘Nevertheless’ on page 2 is very long.
  
  There are many points where language could be shortened for readability, for example:
  
  Page 3: ‘decision on publication’ could be ‘publication decision’.
  
  Page 5: ‘efficiency of its utilization’ could be ‘its efficiency’.
  
  Page 7: ‘It should be noted…’ could be ‘Note that…’.
  
  Page 7: ‘It should be noted that..’ – this needs a reference.
  
  I’m not sure that registered reports reflect a hypothetico-deductive approach (page 6). For instance, systematic reviews (even non-quantitative ones) are often published as registered reports and Cochrane has required this even before the move towards registered reports in quantitative psychology.
  
  I agree that modular publishing sits uneasily as its own chapter.
  
  Page 14: ‘The "Publish-Review-Curate" model is universal that we expect to be the future of scientific publishing. The transition will not happen today or tomorrow, but in the next 5-10 years, the number of projects such as eLife, F1000Research, Peer Community in, or MetaROR will rapidly increase’. This seems overly strong (an example of my larger critique and that of the reviewers).
2. metaror 15 Oct 2025
  
  in Public
  
  As an important corollary, and in the interest of transparency, I declare that I am a founding managing editor of MetaROR, which is a PRC platform. It may be advisable for the author to make a similar declaration because I understand that they are affiliated with one of the universities involved in the founding of MetaROR.
3. metaror 15 Oct 2025
  
  in Public
  
  This article provides a brief history and review of peer review. It evaluates peer review models against the goals of scientific communication, expressing a preference for publish, review, curate (PRC) models. The review and history are useful. However, the article’s progression and arguments, along with what it seeks to contribute to the literature need refinement and clarification. The argument for PRC is under-developed due to a lack of clarity about what the article means by scientific communication. Clarity here might make the endorsement of PRC seem like less of a foregone conclusion.
4. metaror 15 Oct 2025
  
  in Public
  
  I have no competing interests in the compilation of this review, although I do have specific interests as noted above.
5. metaror 15 Oct 2025
  
  in Public
  
  In "Evolution of Peer Review in Scientific Communication", Kochetkov provides a point-of-view discussion of the current state of play of peer review for scientific literature, focussing on the major models in contemporary use and recent innovations in reform. In particular, they present a typology of three main forms of peer review: traditional pre-publication review; registered reports; and post-publication review, their preferred model. The main contribution it could make would be to help consolidate typologies and terminologies, to consolidate major lines of argument and to present some useful visualisations of these. On the other hand, the overall discussion is not strongly original in character.
  
  The major strength of this article is that the discussion is well-informed by contemporary developments in peer-review reform. The typology presented is modest and, for that, readily comprehensible and intuitive. This is to some extent a weakness as well as a strength; a typology that is too straightforward may not be useful enough. As suggested at the end it might be worth considering how to complexify the typology at least at subordinate levels without sacrificing this strength. The diagrams of workflows are particularly clear.
  
  The primary weakness of this article is that it presents itself as an 'analysis' from which they 'conclude' certain results such as their typology, when this appears clearly to be an opinion piece. In my view, this results in a false claim of objectivity which detracts from what would otherwise be an interesting and informative, albeit subjective, discussion, and thus fails to discuss the limitations of this approach. A secondary weakness is that the discussion is not well structured and there are some imprecisions of expression that have the potential to confuse, at least at first.
  
  This primary weakness is manifested in several ways. The evidence and reasoning for claims made is patchy or absent. One instance of the former is the discussion of bias in peer review. There are a multitude of studies of such bias and indeed quite a few meta-analyses of these studies. A systematic search could have been done here but there is no attempt to discuss the totality of this literature. Instead, only a few specific studies are cited. Why are these ones chosen? We have no idea. To this extent I am not convinced that the references used here are the most appropriate. Instances of the latter are the claim that "The most well-known initiatives at the moment are ResearchEquals and Octopus" for which no evidence is provided, the claim that "we believe that journal-independent peer review is a special case of Model 3" for which no further argument is provided, and the claim that "the function of being the "supreme judge" in deciding what is "good" and "bad" science is taken on by peer review" for which neither is provided.
  
  A particular example of this weakness, which is perhaps of marginal importance to the overall paper but of strong interest to this reviewer is the rather odd engagement with history within the paper. It is titled "Evolution of Peer Review" but is really focussed on the contemporary state-of-play. Section 2 starts with a short history of peer review in scientific publishing, but that seems intended only to establish what is described as the 'traditional' model of peer review. Given that that short history had just shown how peer review had been continually changing in character over centuries - and indeed Kochetkov goes on to describe further changes - it is a little difficult to work out what 'traditional' might mean here; what was 'traditional' in 2010 was not the same as what was 'traditional' in 1970. It is not clear how seriously this history is being taken. Kochetkov has earlier written that "as early as the beginning of the 21st century, it was argued that the system of peer review is 'broken'" but of course criticisms - including fundamental criticisms - of peer review are much older than this. Overall, this use of history seems designed to privilege the experience of a particular moment in time, that coincides with the start of the metascience reform movement.
  
  Section 2 also demonstrates some of the second weakness described, a rather loose structure. Having moved from a discussion of the history of peer review to detail the first model, 'traditional' peer review, it then also goes on to describe the problems of this model. This part of the paper is one of the best - and best -evidenced. Given the importance of it to the main thrust of the discussion it should probably have been given more space as a Section all on its own.
  
  Another example is Section 4 on Modular Publishing, in which Kochetkov notes "Strictly speaking, modular publishing is primarily an innovative approach for the publishing workflow in general rather than specifically for peer review." Kochetkov says "This is why we have placed this innovation in a separate category" but if it is not an innovation in peer review, the bigger question is 'Why was it included in this article at all?'.
  
  One example of the imprecisions of language is as follows. The author also shifts between the terms 'scientific communication' and 'science communication' but, at least in many contexts familiar to this reviewer, these are not the same things, the former denoting science-internal dissemination of results through publication (which the author considers), conferences and the like (which the author specifically excludes) while the latter denotes the science-external public dissemination of scientific findings to non-technical audiences, which is entirely out of scope for this article.
  
  A final note is that Section 3, while an interesting discussion, seems largely derivative from a typology of Waltman, with the addition of a consideration of whether a reform is 'radical' or 'incremental', based on how 'disruptive' the reform is. Given that this is inherently a subjective decision, I wonder if it might not have been more informative to consider 'disruptiveness' on a scale and plot it accordingly. This would allow for some range to be imagined for each reform as well; surely reforms might be more or less disruptive depending on how they are implemented. Given that each reform is considered against each model, it is somewhat surprising that this is not presented in a tabular or graphical form.
  
  Beyond the specific suggestions in the preceding paragraphs, my suggestions to improve this article are as follows:
  
  Reconceptualize this as an opinion piece. Where systematic evidence can be drawn upon to make points, use that, but don't be afraid to just present a discussion from what is clearly a well-informed author.
  
  Reconsider the focus on history and 'evolution' if the point is about the current state of play and evaluation of reforms (much as I would always want to see more studies on the history and evolution of peer review).
  
  Consider ways in which the typology might be expanded, even if at subordinate level.
6. metaror 15 Oct 2025
  
  in Public
  
  none
7. metaror 15 Oct 2025
  
  in Public
  
  The work ‘Evolution of Peer Review in Scientific Communication’ provides a concise and readable summary of the historical role of peer review in modern science. The paper categorises the peer review practices into three models: (1) traditional pre-publication peer review; (2) registered reports; (3) post-publication peer review. The author compares the three models and draws the conclusion that the “third model offers the best way to implement the main function of scientific communication”.
  
  I would contest this conclusion. In my eyes the three models serve different aims - with more or less drawbacks. For example, although Model 3 is less chance to insert bias to the readers, it also weakens the filtering function of the review system. Let’s just think about the dangers of machine-generated articles, paper-mills, p-hacked research reports and so on. Although the editors do some pre-screening for the submissions, in a world with only Model 3 peer review the literature could easily get loaded with even more ‘garbage’ than in a model where additional peers help the screening.
  
  Compared to registered reports other aspects can come to focus that Model 3 cannot cover. It’s the efficiency of researchers’ work. In the care of registered reports, Stage 1 review can still help researchers to modify or improve their research design or data collection method. Empirical work can be costly and time-consuming and post-publication review can only say that “you should have done it differently then it would make sense”.
  
  Finally, the author puts openness as a strength of Model 3. In my eyes, openness is a separate question. All models can work very openly and transparently in the right circumstances. This dimension is not an inherent part of the models.
  
  In conclusion, I would not make verdict over the models, instead emphasise the different functions they can play in scientific communication.
  
  A minor comment: I found that a number of statements lack references in the Introduction. I would have found them useful for statements such as “There is a point of view that peer review is included in the implicit contract of the researcher.”
8. metaror 15 Oct 2025
  
  in Public
  
  none
9. metaror 15 Oct 2025
  
  in Public
  
  In this manuscript, the author provides a historical review of the place of peer review in the scientific ecosystem, including a discussion of the so-called current crisis and a presentation of three important peer review models. I believe this is a non-comprehensive yet useful overview. My main contention is that the structure of the paper could be improved. More specifically, the author could expand on the different goals of peer review and discuss these goals earlier in the paper. This would allow readers to better interpret the different issues plaguing peer review and helps put the costs and benefits of the three models into context. Other than that, I found some claims made in the paper a little too strong. Presenting some empirical evidence or downplaying these claims would improve the manuscript in my opinion. Below, you can find my comments:
  
  In my view, the biggest issue with the current peer review system is the low quality of reviews, but the manuscript only mentions this fleetingly. The current system facilitates publication bias, confirmation bias, and is generally very inconsistent. I think this is partly due to reviewers’ lack of accountability in such a closed peer review system, but I would be curious to hear the author’s ideas about this, more elaborately than they provide them as part of issue 2.
  
  I’m missing a section in the introduction on what the goals of peer review are or should be. You mention issues with peer review, and these are mostly fair, but their importance is only made salient if you link them to the goals of peer review. The author does mention some functions of peer review later in the paper, but I think it would be good to expand that discussion and move it to a place earlier in the manuscript.
  
  Table 1 is intuitive but some background on how the author arrived at these categorizations would be welcome. When is something incremental and when is something radical? Why are some innovations included but not others (e.g., collaborative peer review, see https://content.prereview.org/how-collaborative-peer-review-can-transform-scientific-research/)?
  
  “Training of reviewers through seminars and online courses is part of the strategies of many publishers. At the same time, we have not been able to find statistical data or research to assess the effectiveness of such training.” (p. 5)  There is some literature on this, although not recent. See work by Sara Schroter for example, Schroter et al., 2004; Schroter et al., 2008)
  
  “It should be noted that most initiatives aimed at improving the quality of peer review simultaneously increase the costs.” (p. 7)  This claim needs some support. Please explicate why this typically is the case and how it should impact our evaluations of these initiatives.
  
  I would rephrase “Idea of the study” in Figure 2 since the other models start with a tangible output (the manuscript). This is the same for registered reports where they submit a tangible report including hypotheses, study design, and analysis plan. In the same vein, I think study design in the rest of the figure might also not be the best phrasing. Maybe the author could use the terminology used by COS (Stage 1 manuscript, and Stage 2 manuscript, see Details & Workflow tab of https://www.cos.io/initiatives/registered-reports). Relatedly, “Author submits the first version of the manuscript” in the first box after the ‘Manuscript (report)’ node maybe a confusing phrase because I think many researchers see the first version of the manuscript as the stage 1 report sent out for stage 1 review.
  
  One pathway that is not included in Figure 2 is that authors can decide to not conduct the study when improvements are required. Relatedly, in the publish-review-curate model, is revising the manuscripts based on the reviews not optional as well? Especially in the case of 3a, authors can hardly be forced to make changes even though the reviews are posted on the platform.
  
  I think the author should discuss the importance of ‘open identities’ more. This factor is now not explicitly included in any of the models, while it has been found to be one of the main characteristics of peer review systems (Ross-Hellauer, 2017). More generally, I was wondering why the author chose these three models and not others. What were the inclusion criteria for inclusion in the manuscript? Some information on the underlying process would be welcome, especially when claims like “However, we believe that journal-independent peer review is a special case of Model 3 (“Publish-Review-Curate”).” are made without substantiation.
  
  Maybe it helps to outline the goals of the paper a bit more clearly in the introduction. This helps the reader to know what to expect.
  
  The Modular Publishing section is not inherently related to peer review models, as you mention in the first sentence of that paragraph. As such, I think it would be best to omit this section entirely to maintain the flow of the paper. Alternatively, you could shortly discuss it in the discussion section but a separate paragraph seems too much from my point of view.
  
  Labeling model 3 as post-publication review might be confusing to some readers. I believe many researchers see post-publication review as researchers making comments on preprints, or submitting commentaries to journals. Those activities are substantially different from the publish-review-curate model so I think it is important to distinguish between these types.
  
  I do not think the conclusions drawn below Table 3 logically follow from the earlier text. For example, why are “all functions of scientific communication implemented most quickly and transparently in Model 3”? It could be that the entire process takes longer in Model 3 (e.g. because reviewers need more time), so that Model 1 and Model 2 lead to outputs quicker. The same holds for the following claim: “The additional costs arising from the independent assessment of information based on open reviews are more than compensated by the emerging opportunities for scientific pluralism.” What is the empirical evidence for this? While I personally do think that Model 3 improves on Model 1, emphatic statements like this require empirical evidence. Maybe the author could provide some suggestions on how we can attain this evidence. Model 2 does have some empirical evidence underpinning its validity (see Scheel, Schijen, Lakens, 2021; Soderberg et al., 2021; Sarafoglou et al. 2022) but more meta-research inquiries into the effectiveness and cost-benefits ratio of registered reports would still be welcome in general.
  
  What is the underlaying source for the claim that openness requires three conditions?
  
  “If we do not change our approach, science will either stagnate or transition into other forms of communication.” (p. 2)  I don’t think this claim is supported sufficiently strongly. While I agree there are important problems in peer review, I think would need to be a more in-depth and evidence-based analysis before claims like this can be made.
  
  On some occasions, the author uses “we” while the study is single authored.
  
  Figure 1: The top-left arrow from revision to (re-)submission is hidden
  
  “The low level of peer review also contributes to the crisis of reproducibility in scientific research (Stoddart, 2016).” (p. 4)  I assume the author means the low quality of peer review.
  
  “Although this crisis is due to a multitude of factors, the peer review system bears a significant responsibility for it.” (p. 4)  This is also a big claim that is not substantiated
  
  “Software for automatic evaluation of scientific papers based on artificial intelligence (AI) has emerged relatively recently” (p. 5)  The author could add RegCheck (https://regcheck.app/) here, even though it is still in development. This tool is especially salient in light of the finding that preregistration-paper checks are rarely done as part of reviews (see Syed, 2023)
  
  There is a typo in last box of Figure 1 (“decicion” instead of “decision”). I also found typos in the second box of Figure 2, where “screns” should be “screens”, and the author decision box where “desicion” should be “decision”
  
  Maybe it would be good to mention results blinded review in the first paragraph of 3.2. This is a form of peer review where the study is already carried out but reviewers are blinded to the results. See work by Locascio (2017), Grand et al. (2018), and Woznyj et al. (2018).
  
  Is “Not considered for peer review” in figure 3b not the same as rejected? I feel that it is rejected in the sense that neither the manuscript not the reviews will be posted on the platform.
  
  “In addition to the projects mentioned, there are other platforms, for example, PREreview12, which departs even more radically from the traditional review format due to the decentralized structure of work.” (p. 11)  For completeness, I think it would be helpful to add some more information here, for example why exactly decentralization is a radical departure from the traditional model.
  
  “However, anonymity is very conditional - there are still many “keys” left in the manuscript, by which one can determine, if not the identity of the author, then his country, research group, or affiliated organization.” (p.11)  I would opt for the neutral “their” here instead of “his”, especially given that this is a paragraph about equity and inclusion.
  
  “Thus, “closeness” is not a good way to address biases.” (p. 11)  This might be a straw man argument because I don’t believe researchers have argued that it is a good method to combat biases. If they did, it would be good to cite them here. Alternatively, the sentence could be omitted entirely.
  
  I would start the Modular Publishing section with the definition as that allows readers to interpret the other statements better.
  
  It would be helpful if the Models were labeled (instead of using Model 1, Model 2, and Model 3) so that readers don’t have to think back what each model involved.
  
  Table 2: “Decision making” for the editor’s role is quite broad, I recommend to specify and include what kind of decisions need to be made.
  
  Table 2: “Aim of review” – I believe the aim of peer review differs also within these models (see the “schools of thought” the author mentions earlier), so maybe a statement on what the review entails would be a better way to phrase this.
  
  Table 2: One could argue that the object of the review’ in Registered Reports is also the manuscript as a whole, just in different stages. As such, I would phrase this differently.
  
  Good luck with any revision!
  
  Olmo van den Akker (ovdakker@gmail.com)
  
  References
  
  Grand, J. A., Rogelberg, S. G., Banks, G. C., Landis, R. S., & Tonidandel, S. (2018). From outcome to process focus: Fostering a more robust psychological science through registered reports and results-blind reviewing. Perspectives on Psychological Science, 13(4), 448-456.
  
  Ross-Hellauer, T. (2017). What is open peer review? A systematic review. F1000Research, 6.
  
  Sarafoglou, A., Kovacs, M., Bakos, B., Wagenmakers, E. J., & Aczel, B. (2022). A survey on how preregistration affects the research workflow: Better science but more work. Royal Society Open Science, 9(7), 211997.
  
  Scheel, A. M., Schijen, M. R., & Lakens, D. (2021). An excess of positive results: Comparing the standard psychology literature with registered reports. Advances in Methods and Practices in Psychological Science, 4(2), 25152459211007467.
  
  Schroter, S., Black, N., Evans, S., Carpenter, J., Godlee, F., & Smith, R. (2004). Effects of training on quality of peer review: randomised controlled trial. Bmj, 328(7441), 673.
  
  Schroter, S., Black, N., Evans, S., Godlee, F., Osorio, L., & Smith, R. (2008). What errors do peer reviewers detect, and does training improve their ability to detect them?. Journal of the Royal Society of Medicine, 101(10), 507-514.
  
  Soderberg, C. K., Errington, T. M., Schiavone, S. R., Bottesini, J., Thorn, F. S., Vazire, S., ... & Nosek, B. A. (2021). Initial evidence of research quality of registered reports compared with the standard publishing model. Nature Human Behaviour, 5(8), 990-997.
  
  Syed, M. (2023). Some data indicating that editors and reviewers do not check preregistrations during the review process. PsyArXiv Preprints.
  
  Locascio, J. J. (2017). Results blind science publishing. Basic and applied social psychology, 39(5), 239-246.
  
  Woznyj, H. M., Grenier, K., Ross, R., Banks, G. C., & Rogelberg, S. G. (2018). Results-blind review: A masked crusader for science. European Journal of Work and Organizational Psychology, 27(5), 561-576.
10. metaror 15 Oct 2025
  
  in Public
  
  none
11. metaror 15 Oct 2025
  
  in Public
  
  Overall thoughts: This is an interesting history piece regarding peer review and the development of review over time. Given the author’s conflict of interest and association with the Centre developing MetaROR, I think that this paper might be a better fit for an information page or introduction to the journal and rationale for the creation of MetaROR, rather than being billed as an independent article. Alternatively, more thorough information about advantages to pre-publication review or more downsides/challenges to post-publication review might make the article seem less affiliated. I appreciate seeing the history and current efforts to change peer review, though I am not comfortable broadly encouraging use of these new approaches based on this article alone.
  
  Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear.
  
  Consider discussing the benefits of the traditional model of peer review.
  
  Table 1 – Most of these are self-explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.
  
  3.2: Considering mentioning your conflict of interest here where MetaROR is mentioned.
  
  With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?
  
  There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them. Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.
  
  Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.
12. metaror 15 Oct 2025
  
  in Public
  
  Response to the Editors and the Reviewers
  
  I am sincerely grateful to the editors and peer reviewers at MetaROR for their detailed feedback and valuable comments and suggestions. I have addressed each point below.
  
  Handling editor
  
  1. “However, the article’s progression and arguments, along with what it seeks to contribute to the literature need refinement and clarification. The argument for PRC is under-developed due to a lack of clarity about what the article means by scientific communication. Clarity here might make the endorsement of PRC seem like less of a foregone conclusion.”
  
  The structure of the paper (and discussion) has changed significantly to address the feedback.
  
  2. “I strongly endorse the main theme of most of the reviews, which is that the progression and underlying justifications for this article’s arguments needs a great deal of work. In my view, this article’s main contribution seems to be the evaluation of the three peer review models against the functions of scientific communication. I say ‘seems to be’ because the article is not very clear on that and I hope you will consider clarifying what your manuscript seeks to add to the existing work in this field. In any case, if that assessment of the three models is your main contribution, that part is somewhat underdeveloped. Moreover, I never got the sense that there is clear agreement in the literature about what the tenets of scientific communication are. Note that scientific communication is a field in its own right.”
  
  I have implemented a more rigorous approach to argumentation in response. “Scientific communication” was replaced by “scholarly communication.”
  
  3. “I also agree that paper is too strongly worded at times, with limitations and assumptions in the analysis minimised or not stated. For example, all of the typologies and categories drawn could easily be reorganised and there is a high degree of subjectivity in this entire exercise. Subjective choices should be highlighted and made salient for the reader. Note that greater clarity, rigour, and humility may also help with any alleged or actual bias.”
  
  I have incorporated the conceptual framework and description of the research methodology. However, the Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.
  
  4. “I agree with Reviewer 3 that the ‘we’ perspective is distracting.”
  
  This has been fixed.
  
  5. “The paragraph starting with ‘Nevertheless’ on page 2 is very long.”
  
  The text was restructured.
  
  6. “There are many points where language could be shortened for readability, for example:
  
  Page 3: ‘decision on publication’ could be ‘publication decision’.
  
  Page 5: ‘efficiency of its utilization’ could be ‘its efficiency’.
  
  Page 7: ‘It should be noted…’ could be ‘Note that…’.”
  
  I have proofread the text.
  
  7. “Page 7: ‘It should be noted that..’ – this needs a reference.”
  
  This statement has been moved to the Discussion section, paraphrased, and reference added
  
  “It should be also noted that peer review innovations pull in opposing directions, with some aiming to increase efficiency and reduce costs, while others aim to promote rigor and increase costs (Kaltenbrunner et al., 2022).”
  
  8. “I’m not sure that registered reports reflect a hypothetico-deductive approach (page 6). For instance, systematic reviews (even non-quantitative ones) are often published as registered reports and Cochrane has required this even before the move towards registered reports in quantitative psychology.”
  
  I have added this clarification.
  
  9. “I agree that modular publishing sits uneasily as its own chapter.”
  
  Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.
  
  10. “Page 14: ‘The "Publish-Review-Curate" model is universal that we expect to be the future of scientific publishing. The transition will not happen today or tomorrow, but in the next 5-10 years, the number of projects such as eLife, F1000Research, Peer Community in, or MetaROR will rapidly increase’. This seems overly strong (an example of my larger critique and that of the reviewers).”
  
  This part of the text has been rewritten.
  
  Reviewer 1
  
  11. “For example, although Model 3 is less chance to insert bias to the readers, it also weakens the filtering function of the review system. Let’s just think about the dangers of machine-generated articles, paper-mills, p-hacked research reports and so on. Although the editors do some pre-screening for the submissions, in a world with only Model 3 peer review the literature could easily get loaded with even more ‘garbage’ than in a model where additional peers help the screening.”
  
  I think that generated text is better detected by software tools. At the same time, I tried and described the pros and cons of different models in a more balanced way in the concluding section.
  
  12. “Compared to registered reports other aspects can come to focus that Model 3 cannot cover. It’s the efficiency of researchers’ work. In the care of registered reports, Stage 1 review can still help researchers to modify or improve their research design or data collection method. Empirical work can be costly and time-consuming and post-publication review can only say that ‘you should have done it differently then it would make sense’.”
  
  Thank you very much for this valuable contribution, I have added this statement at P. 11.
  
  13. “Finally, the author puts openness as a strength of Model 3. In my eyes, openness is a separate question. All models can work very openly and transparently in the right circumstances. This dimension is not an inherent part of the models.”
  
  I think that the model, providing peer reviews to all the submissions, ensures maximum transparency. However, I have made effort to make the wording more balanced and distinguish my personal perspective from the literature.
  
  14. “In conclusion, I would not make verdict over the models, instead emphasize the different functions they can play in scientific communication.”
  
  This idea has been reflected now in the concluding section.
  
  15. “A minor comment: I found that a number of statements lack references in the Introduction. I would have found them useful for statements such as ‘There is a point of view that peer review is included in the implicit contract of the researcher.’”
  
  Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.
  
  Reviewer 2
  
  16. “The primary weakness of this article is that it presents itself as an 'analysis' from which they 'conclude' certain results such as their typology, when this appears clearly to be an opinion piece. In my view, this results in a false claim of objectivity which detracts from what would
  
  otherwise be an interesting and informative, albeit subjective, discussion, and thus fails to discuss the limitations of this approach.”
  
  I have incorporated the conceptual framework and description of the research methodology. However, the Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.
  
  17. “A secondary weakness is that the discussion is not well structured and there are some imprecisions of expression that have the potential to confuse, at least at first.”
  
  The structure of the paper (and discussion) has changed significantly.
  
  18. “The evidence and reasoning for claims made is patchy or absent. One instance of the former is the discussion of bias in peer review. There are a multitude of studies of such bias and indeed quite a few meta-analyses of these studies. A systematic search could have been done here but there is no attempt to discuss the totality of this literature. Instead, only a few specific studies are cited. Why are these ones chosen? We have no idea. To this extent I am not convinced that the references used here are the most appropriate.”
  
  I have reviewed the existing references and incorporated additional sources. However, the study does not claim to conduct a systematic literature review; rather, it adopts an interpretative approach to literature analysis.
  
  19. “Instances of the latter are the claim that ‘The most well-known initiatives at the moment are ResearchEquals and Octopus’ for which no evidence is provided, the claim that ‘we believe that journal-independent peer review is a special case of Model 3’ for which no further argument is provided, and the claim that ‘the function of being the "supreme judge" in deciding what is "good" and "bad" science is taken on by peer review’ for which neither is provided.
  
  Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.
  
  20. “A particular example of this weakness, which is perhaps of marginal importance to the overall paper but of strong interest to this reviewer is the rather odd engagement with history within the paper. It is titled "Evolution of Peer Review" but is really focussed on the contemporary state-of-play. Section 2 starts with a short history of peer review in scientific publishing, but that seems intended only to establish what is described as the 'traditional' model of peer review. Given that that short history had just shown how peer review had been continually changing in character over centuries - and indeed Kochetkov goes on to describe further changes - it is a little difficult to work out what 'traditional' might mean here; what was 'traditional' in 2010 was not the same as what was 'traditional' in 1970. It is not clear how seriously this history is being taken. Kochetkov has earlier written that "as early as the beginning of the 21st century, it was argued that the system of peer review is 'broken'" but of course criticisms - including fundamental criticisms - of peer review are much older than this. Overall, this use of history seems designed to privilege the experience of a particular moment in time, that coincides with the start of the metascience reform movement.”
  
  While the paper addresses some aspects of peer review history, it does not provide a comprehensive examination of this topic. A clarifying statement to this effect has been included in the methodology section.
  
  “… this section incorporates elements of historical analysis, it does not fully qualify as such because primary sources were not directly utilized. Instead, it functions as an interpretative literature review, and one that is intentionally concise, as a comprehensive history of peer review falls outside the scope of this research”.
  
  21. “Section 2 also demonstrates some of the second weakness described, a rather loose structure. Having moved from a discussion of the history of peer review to detail the first model, 'traditional' peer review, it then also goes on to describe the problems of this model. This part of the paper is one of the best - and best - evidenced. Given the importance of it to the main thrust of the discussion it should probably have been given more space as a Section all on its own.”
  
  This section (now Section 4) has been extended, see also previous comment.
  
  22. “Another example is Section 4 on Modular Publishing, in which Kochetkov notes "Strictly speaking, modular publishing is primarily an innovative approach for the publishing workflow in general rather than specifically for peer review." Kochetkov says "This is why we have placed this innovation in a separate category" but if it is not an innovation in peer review, the bigger question is 'Why was it included in this article at all?'.”
  
  Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.
  
  23. “One example of the imprecisions of language is as follows. The author also shifts between the terms 'scientific communication' and 'science communication' but, at least in many contexts familiar to this reviewer, these are not the same things, the former denoting science-internal dissemination of results through publication (which the author considers), conferences and the like (which the author specifically excludes) while the latter denotes the science-external public dissemination of scientific findings to non-technical audiences, which is entirely out of scope for this article.”
  
  Thank you for your remark. As a non- native speaker, I initially did not grasp the distinction between the terms. However, I believe the phrase ‘scholarly communication’ is the most universally applicable term. This adjustment has now been incorporated into the text.
  
  24. “A final note is that Section 3, while an interesting discussion, seems largely derivative from a typology of Waltman, with the addition of a consideration of whether a reform is 'radical' or 'incremental', based on how 'disruptive' the reform is. Given that this is inherently a subjective decision, I wonder if it might not have been more informative to consider 'disruptiveness' on a scale and plot it accordingly. This would allow for some range to be imagined for each reform as well; surely reforms might be more or less disruptive depending on how they are implemented. Given that each reform is considered against each model, it is somewhat surprising that this is not presented in a tabular or graphical form.”
  
  Ultimately, I excluded this metric due to its current reliance on purely subjective judgment. Measuring 'disruptiveness', e.g., through surveys or interviews remains a task for future research.
  
  25. “Reconceptualize this as an opinion piece. Where systematic evidence can be drawn upon to make points, use that, but don't be afraid to just present a discussion from what is clearly a well-informed author.”
  
  I cannot definitively classify this work as an opinion piece. In fact, this manuscript synthesizes elements of a literature review, research article, and opinion essay. My idea was to integrate the strengths of all three genres.
  
  26. “Reconsider the focus on history and 'evolution' if the point is about the current state of play and evaluation of reforms (much as I would always want to see more studies on the history and evolution of peer review).”
  
  I have revised the title to better reflect the study’s scope and explicitly emphasize its focus on contemporary developments in the field.
  
  “Peer Review at the Crossroads”
  
  27. “Consider ways in which the typology might be expanded, even if at subordinate level.”
  
  I have updated the typology and introduced the third tier, where it is applicable (see Fig.2).
  
  Reviewer 3
  
  28. “In my view, the biggest issue with the current peer review system is the low quality of reviews, but the manuscript only mentions this fleetingly. The current system facilitates publication bias, confirmation bias, and is generally very inconsistent. I think this is partly due to reviewers’ lack of accountability in such a closed peer review system, but I would be curious to hear the author’s ideas about this, more elaborately than they provide them as part of issue 2.
  
  I have elaborated on this issue in the footnote.
  
  29. “I’m missing a section in the introduction on what the goals of peer review are or should be. You mention issues with peer review, and these are mostly fair, but their importance is only made salient if you link them to the goals of peer review. The author does mention some functions of peer review later in the paper, but I think it would be good to expand that discussion and move it to a place earlier in the manuscript.”
  
  The functions of peer review are summarized in the first paragraph of Introduction.
  
  30. “Table 1 is intuitive but some background on how the author arrived at these categorizations would be welcome. When is something incremental and when is something radical? Why are some innovations included but not others (e.g., collaborative peer review, see https://content.prereview.org/how-collaborative-peer-review-can-transform-scientific-research/)?”
  
  Collaborative peer review, namely, Prereview was mentioned in the context of Model 3 (Publish-Review-Curate). However, I have extended this part of the paper.
  
  31“‘Training of reviewers through seminars and online courses is part of the strategies of many publishers. At the same time, we have not been able to find statistical data or research to assess the effectiveness of such training.’ (p. 5) There is some literature on this, although not recent. See work by Sara Schroter for example, Schroter et al., 2004; Schroter et al., 2008)”
  
  Thank you very much, I have added these studies and a few more recent ones.
  
  32. “‘It should be noted that most initiatives aimed at improving the quality of peer review simultaneously increase the costs.’ (p. 7) This claim needs some support. Please explicate why this typically is the case and how it should impact our evaluations of these initiatives.”
  
  I have moved this part to the Discussion section.
  
  33. “I would rephrase “Idea of the study” in Figure 2 since the other models start with a tangible output (the manuscript). This is the same for registered reports where they submit a tangible report including hypotheses, study design, and analysis plan. In the same vein, I think study design in the rest of the figure might also not be the best phrasing. Maybe the author could use the terminology used by COS (Stage 1 manuscript, and Stage 2 manuscript, see Details & Workflow tab of https://www.cos.io/initiatives/registered-reports). Relatedly, “Author submits the first version of the manuscript” in the first box after the ‘Manuscript (report)’ node maybe a confusing phrase because I think many researchers see the first version of the manuscript as the stage 1 report sent out for stage 1 review.”
  
  Thank you very much. Stage 1 and Stage 2 manuscripts look like suitable labelling solution.
  
  34. “One pathway that is not included in Figure 2 is that authors can decide to not conduct the study when improvements are required. Relatedly, in the publish-review-curate model, is revising the manuscripts based on the reviews not optional as well? Especially in the case of
  
  3a, authors can hardly be forced to make changes even though the reviews are posted on the platform.”
  
  All the four models imply a certain level of generalization; thus, I tried to avoid redundant details. However, I have added this choice to the PRC model (now, Model 4).
  
  35. “I think the author should discuss the importance of ‘open identities’ more. This factor is now not explicitly included in any of the models, while it has been found to be one of the main characteristics of peer review systems (Ross-Hellauer, 2017).”
  
  This part has been extended.
  
  36. “More generally, I was wondering why the author chose these three models and not others. What were the inclusion criteria for inclusion in the manuscript? Some information on the underlying process would be welcome, especially when claims like ‘However, we believe that journal-independent peer review is a special case of Model 3 (‘Publish-Review-Curate’).’ are made without substantiation.”
  
  The study included four generalized models of peer review that involved some level of abstraction.
  
  37. “Maybe it helps to outline the goals of the paper a bit more clearly in the introduction. This helps the reader to know what to expect.”
  
  The Introduction has been revised including the goal and objectives.
  
  38. “The Modular Publishing section is not inherently related to peer review models, as you mention in the first sentence of that paragraph. As such, I think it would be best to omit this section entirely to maintain the flow of the paper. Alternatively, you could shortly discuss it in the discussion section but a separate paragraph seems too much from my point of view.”
  
  Modular publishing has been combined with registered reports into the fragmented publishing group of models, now in Section 5.
  
  39. “Labeling model 3 as post-publication review might be confusing to some readers. I believe many researchers see post-publication review as researchers making comments on preprints, or submitting commentaries to journals. Those activities are substantially different from the publish-review-curate model so I think it is important to distinguish between these types.”
  
  The label was changed into Publish- Review-Curate model.
  
  40. “I do not think the conclusions drawn below Table 3 logically follow from the earlier text. For example, why are “all functions of scientific communication implemented most quickly and transparently in Model 3”? It could be that the entire process takes longer in Model 3 (e.g. because reviewers need more time), so that Model 1 and Model 2 lead to outputs quicker. The same holds for the following claim: ‘The additional costs arising from the independent assessment of information based on open reviews are more than compensated by the emerging opportunities for scientific pluralism.’ What is the empirical evidence for this? While I personally do think that Model 3 improves on Model 1, emphatic statements like this require empirical evidence. Maybe the author could provide some suggestions on how we can attain this evidence. Model 2 does have some empirical evidence underpinning its validity (see Scheel, Schijen, Lakens, 2021; Soderberg et al., 2021; Sarafoglou et al. 2022) but more meta-research inquiries into the effectiveness and cost-benefits ratio of registered reports would still be welcome in general.”
  
  The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap. I am grateful for the suggested literature on RRs, which I have now integrated into the relevant subsection.
  
  41. “What is the underlaying source for the claim that openness requires three conditions?”
  
  I have made effort to clarify within the text that this reflects my personal stance.
  
  42. “‘If we do not change our approach, science will either stagnate or transition into other forms of communication.’ (p. 2) I don’t think this claim is supported sufficiently strongly. While I agree there are important problems in peer review, I think would need to be a more in-depth and evidence-based analysis before claims like this can be made.”
  
  The sentence has been rephrased.
  
  43. “On some occasions, the author uses ‘we’ while the study is single authored.”
  
  This has been fixed.
  
  44. “Figure 1: The top-left arrow from revision to (re-)submission is hidden”
  
  I have updated Figure 1.
  
  45. “‘The low level of peer review also contributes to the crisis of reproducibility in scientific research (Stoddart, 2016).’ (p. 4) I assume the author means the low quality of peer review.”
  
  This has been fixed.
  
  46. “‘Although this crisis is due to a multitude of factors, the peer review system bears a significant responsibility for it.’ (p. 4) This is also a big claim that is not substantiated”
  
  I have paraphrased this sentence as “While multiple factors drive this crisis, deficiencies in the peer review process remain a significant contributor.” and added a footnote.
  
  47. “‘Software for automatic evaluation of scientific papers based on artificial intelligence (AI) has emerged relatively recently” (p. 5) The author could add RegCheck (https://regcheck.app/) here, even though it is still in development. This tool is especially salient in light of the finding that preregistration-paper checks are rarely done as part of reviews (see Syed, 2023)”
  
  Thank you very much, I have added this information.
  
  48. “There is a typo in last box of Figure 1 (‘decicion’ instead of ‘decision’). I also found typos in the second box of Figure 2, where ‘screns’ should be ‘screens’, and the author decision box where ‘desicion’ should be ‘decision’”
  
  This has been fixed.
  
  49. “Maybe it would be good to mention results blinded review in the first paragraph of 3.2. This is a form of peer review where the study is already carried out but reviewers are blinded to the results. See work by Locascio (2017), Grand et al. (2018), and Woznyj et al. (2018).”
  
  Thanks, I have added this (now section 5.2)
  
  50. “Is ‘Not considered for peer review’ in figure 3b not the same as rejected? I feel that it is rejected in the sense that neither the manuscript not the reviews will be posted on the platform.”
  
  Changed into “Rejected”
  
  51. “‘In addition to the projects mentioned, there are other platforms, for example, PREreview12, which departs even more radically from the traditional review format due to the decentralized structure of work.’ (p. 11) For completeness, I think it would be helpful to add some more information here, for example why exactly decentralization is a radical departure from the traditional model.”
  
  I have extended this passage.
  
  52. “‘However, anonymity is very conditional - there are still many “keys” left in the manuscript, by which one can determine, if not the identity of the author, then his country, research group, or affiliated organization.’ (p.11) I would opt for the neutral ‘their’ here instead of ‘his’, especially given that this is a paragraph about equity and inclusion.”
  
  This has been fixed.
  
  53. “‘Thus, “closeness” is not a good way to address biases.’ (p. 11) This might be a straw man argument because I don’t believe researchers have argued that it is a good method to combat biases. If they did, it would be good to cite them here. Alternatively, the sentence could be
  
  omitted entirely.
  
  I have omitted the sentence.
  
  54. “I would start the Modular Publishing section with the definition as that allows readers to interpret the other statements better.”
  
  Modular publishing has been combined with registered reports into the deconstructed publication group of models, now in Section 5, general definition added.
  
  55. “It would be helpful if the Models were labeled (instead of using Model 1, Model 2, and Model 3) so that readers don’t have to think back what each model involved.”
  
  All the models represent a kind of generalization, which is why non-detailed labels are used. The text labels may vary depending on the context.
  
  56. “Table 2: ‘Decision making’ for the editor’s role is quite broad, I recommend to specify and include what kind of decisions need to be made.”
  
  Changed into “Making accept/reject decisions”
  
  57. “Table 2: ‘Aim of review’ – I believe the aim of peer review differs also within these models (see the ‘schools of thought’ the author mentions earlier), so maybe a statement on what the review entails would be a better way to phrase this.”
  
  Changed into “What does peer review entail?”
  
  58. “Table 2: One could argue that the object of the review’ in Registered Reports is also the manuscript as a whole, just in different stages. As such, I would phrase this differently.
  
  Current wording fits your remark: “Manuscript in terms of study design and execution”
  
  Reviewer 4
  
  59. “Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. Consider discussing the benefits of the traditional model of peer review.”
  
  This section has been extended.
  
  60. “Table 1 – Most of these are self-explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.”
  
  Table 1 has been replaced by Figure 2. I have also extended text descriptions, added definitions.
  
  61. “With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?”
  
  Some of these platforms, e.g., F1000, Lifecycle Journal, replace conventional journal publishing. Modular publishing allows for step-by-step feedback from peers. An important advantage of RRs over other peer review models lies in their capacity to enhance research efficiency. By conducting peer review at Stage 1, researchers gain the opportunity to refine their study design or data collection protocols before empirical work begins. Other models of review can offer critiques such as "the study should have been conducted differently" without actionable opportunity for improvement. The key motivation for having my paper reviewed in MetaROR is the quality of peer review – I have never received so many comments, frankly! Moreover, platforms such as MetaROR usually have partnering journals.
  
  62. “There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them. Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.”
  
  I have omitted these conditions and employed the Moore’s Technology Adoption Life Cycle. Thank you very much for your comment!
  
  63. Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.
  
  The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap.
13. metaror 15 Oct 2025
  
  in Public
  
  9, 137, 182
14. metaror 15 Oct 2025
  
  in Public
  
  Kochetkov, D. (2024, March 21). Evolution of Peer Review in Scientific Communication. https://doi.org/10.31235/osf.io/b2ra3
15. metaror 15 Oct 2025
  
  in Public
  
  Jul 26, 2024
16. metaror 15 Oct 2025
  
  in Public
  
  Nov 20, 2024
17. metaror 15 Oct 2025
  
  in Public
  
  Nov 20, 2024
18. metaror 15 Oct 2025
  
  in Public
  
  Authors:
  
  Dmitry Kochetkov (Leiden University ) d.kochetkov@cwts.leidenuniv.nl
19. metaror 15 Oct 2025
  
  in Public
  
  5
20. metaror 15 Oct 2025
  
  in Public
  
  10.31235/osf.io/b2ra3
21. metaror 15 Oct 2025
  
  in Public
  
  Evolution of Peer Review in Scientific Communication
Visit annotations in context

Annotators

metaror

URL

osf.io/preprints/socarxiv/b2ra3
Sep 2025
doi.org doi.org

Tracking transformative agreements through open metadata: method and validation using Dutch Research Council NWO funded papers

15
1. metaror 23 Sep 2025
  
  in Public
  
  none
2. metaror 23 Sep 2025
  
  in Public
  
  In this preprint, the authors describe and test a method for calculating which articles were covered by transformative agreements. Using a test case of publications likely covered by transformative agreements in the Netherlands, the authors find that their method works well, correctly identifying 89% of the articles in the sample. The preprint was reviewed by two metaresearchers deeply involved in studying transformative agreements. Both commend the paper for its clarity, methodological transparency, and timely contribution to ongoing discussions about transformative agreements. They highlight the paper’s practical value, particularly the Dutch case study’s validation using national research information, and its epistemically modest discussion of the method’s capabilities and limitations. Key suggestions for improvement include clarifying aspects of the methodology, presenting key results earlier in the text, and enhancing the clarity and interpretability of the figures.
3. metaror 23 Sep 2025
  
  in Public
  
  none
4. metaror 23 Sep 2025
  
  in Public
  
  This article presents an open method for tracking journal articles under transformative agreements using open metadata. The authors apply their approach to the Dutch context as a case study, including validation against national research information. The well-written paper usefully highlights, in an accessible and easy-to-understand manner, how both researchers and practitioners evaluating this open access licensing model can navigate data gaps. By demonstrating that estimating publications under transformative agreements requires combining multiple data sources, the authors offer practical methodological insights for those interested in this prevalent licensing model but uncertain about data sources and their limitations. They also highlight the progress made in increasing the transparency of transformative agreements, which has often been lacking in previous subscription agreements between libraries and publishers.
  
  In my view, the key contribution of this study is the validation using Dutch research information. This allows to show that while many articles could be matched, there are shortcomings that do not reflect weaknesses in the open method itself, but rather the current state of the data infrastructure for transparency around transformative agreements and open access. Particularly noteworthy is the finding that there are challenges in using corresponding author information as a proxy to delineate open access funding. While it is already known that open metadata (here OpenAlex) on corresponding authors is not as complete as proprietary databases, the validation also reveals that even when corresponding author data are available, issues can arise, particularly with multiple affiliations. The availability of funding information faces similar limitations.
  
  This leads me to wonder whether the complex structure of transformative agreements on monitoring should warrant a broader discussion based on the findings of this work. Of course, full disclosure of open access invoicing through a community-owned open data service would help assessment but nevertheless makes comparisons between publishers and countries difficult. Examples that can hardly be controlled by open metadata about publication but would require a thorough analysis of the contracts themselves, have been extensively explored by the authors in their qualitative analysis: Authors can decide whether or not to publish open access, agreements can be capped, not all article types are eligible, time lag between submission and publication, and institutions involved. Funding contexts add to this complexity.
  
  Given this complexity, I wonder whether a focus on ESAC to disclose articles enabled by transformative agreements, which is a community effort only run by the Max Planck Digital Library, is sufficient. Perhaps the authors can speculate on the role of existing national infrastructures and workflows around subscription-based publishing in libraries (serial cataloguing and license management)? Can they be transformed to increase the transparency and thus the accountability of this licensing model, or are these infrastructure services no longer be needed in favour of international open metadata initiatives that have been set up together with transformative agreements? Another consideration might be the role of discovery services such as Unpaywall/OpenAlex or OpenAIRE. I think the paper provides a very good overview of these different actors from a data perspective, but the case study would benefit from a discussion of how the different actors involved, particularly in the Dutch context, could work together to achieve more streamlined monitoring through the combination of data services and standardised agreements, as much data seems to already exist internally.
  
  Apart, I have two other considerations:
  
  I suggest that the results section could benefit from earlier mention of the number and proportion of articles that could not be matched. Although these are effectively summarised in the conclusion (last paragraph, page 12), incorporating this information earlier would improve the presentation of findings. I consider the identification of publications missed by the open method due to limitations in the availability of corresponding author data and funding information to be an essential outcome of this research.
  
  Regarding methodology, I had some difficulty understanding where the disambiguation of ISSN variants took place. The text indicates that this information was obtained from the JCT ("The data from the Journal Checker tool is exposed through a publicly available API. It used ISSN (more precisely ISSN-L) to identify journals and RoR-IDs to identify institutions",page 6). However, to my knowledge, ISSN-L retrieval is not supported by the JCT API? Upon examination of the code, it appears that ISSN linking to ISSN-L may have been established using Unpaywall data, while Figure 2 refers to Crossref in this context.
  
  In summary, I would like to congratulate the authors on this important contribution and recommend that all those concerned with open access business models, and those involved in improving the evidence base for transformative agreements, read this important work and adopt the open method presented.
5. metaror 23 Sep 2025
  
  in Public
  
  In line with the reviewer policy of MetaROR I want to disclose that I have in the past co-authored reports together with Bianca Kramer as part a larger authorship groups (in 2019 and in 2023).
6. metaror 23 Sep 2025
  
  in Public
  
  Through a case study of analyzing research publications supported by funding of the Dutch Research Council NOW this manuscript provides a concise description and assessment of a methodology based on utilizing open data sources for identifying which funded publications have been made available within so-called transformative open access agreements. The research gap that is addressed is a relevant and interesting one, as exact and measurable aspects of transformative agreements are still scarce despite the massive financial investments made into them and the breadth of research outputs that are impacted by them.
  
  The introduction and literature review are good and appropriate at framing the context of the study and provide a thorough positioning of this study in relation to previous work in this area. I especially appreciate that the authors have included and given credit to the recently published study by Jahn (2025), and provide a clear argument for how the two studies are within the same topic area but come with different contributions.
  
  The methods section provides a transparent description of the workflow utilized for the study, the data collection and analysis requires a few different steps and working with data on different levels (agreement, journal, article) but the provided narrative provides sufficient detail for the reader to follow the process.
  
  The results section is the only area where I see some room for improvement in terms of presentation, the results themselves are valid and in my view interpreted correctly. Figure 3 is a central visualization of the results of the project but it is hard to interpret independently from reading the text, and even by reading the text some aspects remain unclear. I would suggest to consider the following changes 1) make the venn-areas proportional, now the areas are not sized in relation to the data they represent (use e.g. https://www.deepvenn.com/, https://www.meta-chart.com/venn or a similar tool), 2) insert data labels and legends so that one can get a grip of what the different areas and colors of the visualization mean without reading the text, 3) revise the text that refers to the figure so that each part of the figure would be mention in consecutive order following a predictable structure. Currently only some of the lettered areas are mentioned and those which are not mentioned require quite a lot of effort from the reader to figure out.
  
  The discussion and conclusions are good and I like that the authors are not over-selling the contribution in any way, rather being very realistic in what this method can and cannot achieve.
  
  Overall I think this is a strong paper that provides a valuable and focused contribution to intersecting area of bibliometric research and science policy that still contains many unanswered questions due to the lack of comprehensive data, to which this study provides one more piece to the puzzle.
7. metaror 23 Sep 2025
  
  in Public
  
  We would like to thank Najko Jahn and Mikael Laakso for their very positive and thoughtful reviews, which significantly improved our article. In response to the reviewers' specific comments, we have corrected all identified errors and made a number of improvements. In particular, regarding the presentation of results (Figure 3) and a more comprehensive speculation on the role of national and international infrastructure providers for unlocking article level metadata on transformative agreements. These are the changes we made in response to the reviewer’s comments:
  
  Response to Reviewer #1: Najko Jahn
  
  We agree that a broader discussion of transformative agreement complexity would be valuable, and this paper provides important insights—particularly regarding author opt-out possibilities, contract caps, and exclusion of non-research articles. However, we believe such discussion extends beyond the scope of this paper which we have meant primarily as a study to validate a method for tracking / analysing transformative agreements using open data.
  
  Following the reviewer's advice, we have updated the concluding section to include discussion of how various national and international infrastructures could facilitate more open availability of article-level transformative agreement data.
  
  We have implemented the reviewer's suggestion to present main results earlier in the paper. The key findings regarding matched and unmatched articles are now introduced on pages 3-4.
  
  The reviewer correctly identified our imprecise description of journal identification methodology. We have corrected both the figure and text to accurately reflect our use of ISSN (obtained from Crossref) rather than ISSN-L.
  
  Response to Reviewer #2: Mikael Laakso
  
  We thank the reviewer for their insightful feedback. We have implemented all suggested improvements to our results presentation, enhancing the Venn diagram in Figure 3 by: 1) making areas proportional to data, 2) improving label clarity, 3) providing clearer caption explanations, and 4) revising the accompanying text to follow the figure's label sequence.
  
  The revised version of the article is available here.
8. metaror 23 Sep 2025
  
  in Public
  
  de Jonge, H., Kramer, B., & Sondervan, J. (2025, March 10). Tracking transformative agreements through open metadata: method and validation using Dutch Research Council NWO funded papers. https://doi.org/10.31222/osf.io/tz6be_v2
9. metaror 23 Sep 2025
  
  in Public
  
  February 10, 2025
10. metaror 23 Sep 2025
  
  in Public
  
  May 8, 2025
11. metaror 23 Sep 2025
  
  in Public
  
  May 13, 2025
12. metaror 23 Sep 2025
  
  in Public
  
  Authors:
  
  Hans de Jonge h.dejonge@nwo.nl
  
  Bianca Kramer bianca@sesameopenscience.org
  
  Jeroen Sondervan j.sondervan@nwo.nl
13. metaror 23 Sep 2025
  
  in Public
  
  2
14. metaror 23 Sep 2025
  
  in Public
  
  10.31222/osf.io/tz6be_v2
15. metaror 23 Sep 2025
  
  in Public
  
  Tracking transformative agreements through open metadata: method and validation using Dutch Research Council NWO funded papers
Visit annotations in context

Annotators

metaror

URL

doi.org/10.31222/osf.io/tz6be_v2
osf.io osf.io

Preprint review services: Disrupting the scholarly communication landscape?

14
1. metaror 23 Sep 2025
  
  in Public
  
  Kathryn Zeiler is co-Editor-in-Chief of MetaROR working with Ludo Waltman, a co-author of the article and co-Editor-in-Chief of MetaROR
2. metaror 23 Sep 2025
  
  in Public
  
  The authors present a descriptive analysis of preprint review services. The analysis focuses on the services’ relative characteristics and differences in preprint review management. The authors conclude that such services have the potential to improve the traditional peer review process. Two metaresearchers reviewed the article. They note that the background section and literature review are current and appropriate, the methods used to search for preprint servers are generally sound and sufficiently detailed to allow for reproduction, and the discussion related to anonymizing articles and reviews during the review process is useful. The reviewers also offered suggestions for improvement. They point to terminology that could be clarified. They suggest adding URLs for each of the 23 services included in the study. Other suggestions include explaining why overlay journals were excluded, clarifying the limitation related to including only English-language platforms, archiving rawer input data to improve reproducibility, adding details related to the qualitative text analysis, discussing any existing empirical evidence about misconduct as it relates to different models of peer review, and improving field inclusiveness by avoiding conflation of “research” and “scientific research.”
  
  The reviewers and I agree that the article is a valuable contribution to the metaresearch literature related to peer review processes.
3. metaror 23 Sep 2025
  
  in Public
  
  none
4. metaror 23 Sep 2025
  
  in Public
  
  This manuscript examines preprint review services and their role in the scholarly communications ecosystem. It seems quite thorough to me. In Table 1 they list many peer-review services that I was unaware of e.g. SciRate and Sinai Immunology Review Project.
  
  To help elicit critical & confirmatory responses for this peer review report I am trialling Elsevier’s suggested “structured peer review” core questions, and treating this manuscript as a research article.
  
  Introduction
  
  Is the background and literature section up to date and appropriate for the topic?
  
  Yes.
  
  Are the primary (and secondary) objectives clearly stated at the end of the introduction?
  
  No. Instead the authors have chosen to put the two research questions on page 6 in the methods section. I wonder if they ought to be moved into the introduction – the research questions are not methods in themselves. Might it be better to state the research questions first and then detail the methods one uses to address those questions afterwards? [as Elsevier’s structured template seems implicitly to prefer.
  
  Methods
  
  Are the study methods (including theory/applicability/modelling) reported in sufficient detail to allow for their replicability or reproducibility?
  
  I note with approval that the version number of the software they used (ATLAS.ti) was given.
  
  I note with approval that the underlying data is publicly archived under CC BY at figshare.
  
  The Atlas.ti report data spreadsheet could do with some small improvement – the column headers are little cryptic e.g. “Nº ST “ and “ST” which I eventually deduced was Number of Schools of Thought and Schools of Thought (?)
  
  Is there a rawer form of the data that could be deposited with which to evidence the work done? The Atlas.ti report spreadsheet seemed like it was downstream output data from Atlas.ti. What was the rawer input data entered into Atlas.ti? Can this be archived somewhere in case researchers want to reanalyse it using other tools and methods.
  
  I note with disapproval that Atlas.ti is proprietary software which may hinder the reproducibility of this work. Nonetheless I acknowledge that Atlas.ti usage is somewhat ‘accepted’ in social sciences despite this issue.
  
  I think the qualitative text analysis is a little vague and/or under-described: “Using ATLAS.ti Windows (version 23.0.8.0), we carried out a qualitative analysis of text from the relevant sites, assigning codes covering what they do and why they have chosen to do it that way.” That’s not enough detail. Perhaps an example or two could be given? Was inter-rater reliability performed when ‘assigning codes’ ? How do we know the ‘codes’ were assigned accurately?
  
  Are statistical analyses, controls, sampling mechanism, and statistical reporting (e.g., P-values, CIs, effect sizes) appropriate and well described?
  
  This is a descriptive study (and that’s fine) so there aren’t really any statistics on show here other than simple ‘counts’ (of Schools of Thought) in this manuscript. There are probably some statistical processes going on within the proprietary qualitative analysis of text done in ATLAS.ti but it is under described and so hard for me to evaluate.
  
  Results
  
  Is the results presentation, including the number of tables and figures, appropriate to best present the study findings?
  
  Yes. However, I think a canonical URL to each service should be given. A URL is very useful for disambiguation, to confirm e.g. that the authors mean this Hypothesis (www.hypothes.is) and NOT this Hypothesis (www.hyp.io). I know exactly which Hypothesis is the one the authors are referring to but we cannot assume all readers are experts 😊
  
  Optional suggestion: I wonder if the authors couldn’t present the table data in a slightly more visual and/or compact way? It’s not very visually appealing in its current state. Purely as an optional suggestion, to make the table more compact one could recode the answers given in one or more of the columns 2, 3 and 4 in the table e.g. "all disciplines = ⬤ , biomedical and life sciences = ▲, social sciences =  ‡ , engineering and technology = † ". I note this would give more space in the table to print the URLs for each service that both reviewers have requested.
  
  ———————————————————————————————
  
  | Service name | Developed by | Scientific disciplines | Types of outputs |
  
  | Episciences | Other | ⬤ | blah blah blah. |
  
  | Faculty Opinions | Individual researcher | ▲ | blah blah blah. |
  
  | Red Team Market | Individual researcher | ‡ | blah blah blah. |
  
  ———————————————————————————————
  
  The "Types of outputs" column might even lend themselves to mini-colour-pictograms (?) which could be more concise and more visually appealing? A table just of text, might be scientifically 'correct' but it is incredibly dull for readers, in my opinion.
  
  Are additional sub-analyses or statistical measures needed (e.g., reporting of CIs, effect sizes, sensitivity analyses)?
  
  No / Not applicable.
  
  Discussion
  
  Is the interpretation of results and study conclusions supported by the data and the study design?
  
  Yes.
  
  Have the authors clearly emphasized the limitations of their study/theory/methods/argument?
  
  No. Perhaps a discussion of the linguistic/comprehension bias of the authors might be appropriate for this manuscript. What if there are ‘local’ or regional Chinese, Japanese, Indonesian or Arabic language preprint review services out there? Would this authorship team really be able to find them?
  
  Additional points:
  
  Perhaps the points made in this manuscript about financial sustainability (p24) are a little too pessimistic. I get it, there is merit to this argument, but there is also some significant investment going on there if you know where to look. Perhaps it might be worth citing some recent investments e.g. Gates -> PREreview (2024) https://content.prereview.org/prereview-welcomes-funding/ and Arcadia’s $4 million USD to COAR for the Notify Project which supports a range of preprint review communities including Peer Community In, Episciences, PREreview and Harvard Library. (source: https://coar-repositories.org/news-updates/coar-welcomes-significant-funding-for-the-notify-project/ )
  
  Although I note they are mentioned, I think more needs to be written about the similarity and overlap between ‘overlay journals’ and preprint review services. Are these arguably not just two different terms for kinda the same thing? If you have Peer Community In which has it’s overlay component in the form of the Peer Community Journal, why not mention other overlay journals like Discrete Analysis and The Open Journal of Astrophysics.   I think Peer Community In (& it’s PCJ) is the go-to example of the thin-ness of the line the separates (or doesn’t!) overlay journals and preprint review services. Some more exposition on this would be useful.
5. metaror 23 Sep 2025
  
  in Public
  
  No competing interests are declared by me as reviewer.
6. metaror 23 Sep 2025
  
  in Public
  
  Thank you very much for the opportunity to review the preprint titled “Preprint review services: Disrupting the scholarly communication landscape?” (https://doi.org/10.31235/osf.io/8c6xm) The authors review services that facilitate peer review of preprints, primarily in the STEM (science, technology, engineering, and math) disciplines. They examine how these services operate and their role within the scholarly publishing ecosystem. Additionally, the authors discuss the potential benefits of these preprint peer review services, placing them in the context of tensions in the broader peer review reform movement. The discussions are organized according to four “schools of thought” in peer review reform, as outlined by Waltman et al. (2023), which provides a useful framework for analyzing the services. In terms of methodology, I believe the authors were thorough in their search for preprint review services, especially given that a systematic search might be impractical.
  
  As I see it, the adoption of preprints and reforming peer review are key components of the move towards improving scholarly communication and open research. This article is a useful step along that journey, taking stock of current progress, with a discussion that illuminates possible paths forward. It is also well-structured and easy for me to follow. I believe it is a valuable contribution to the metaresearch literature.
  
  On a high level, I believe the authors have made a reasonable case that preprint review services might make peer review more transparent and rewarding for all involved. Looking forward, I would like to see metaresearch which gathers further evidence that these benefits are truly being realised.
  
  In this review, I will present some general points which merit further discussion or clarification to aid an uninitiated reader. Additionally, I raise one issue regarding how the authors framed the article and categorised preprint review services and the disciplines they serve. In my view, this problem does not fundamentally undermine the robust search, analyses, and discussion in this paper, but it risks putting off some researchers and constrains how broadly one should derive conclusions.
  
  General comments
  
  Some metaresearchers may be aware of preprints, but not all readers will be familiar with them. I suggest briefly defining what they are, how they work, and which types of research have benefited from preprints, similar to how “preprint review service” is clearly defined in the introduction.
  
  Regarding Waltman et al.’s (2023) “Equity & Inclusion” school of thought, does it specifically aim for “balanced” representation by different groups as stated in this article? There is an important difference between “balanced” versus “equitable” representation, and I would like to see it addressed in this text.
  
  Another analysis I would like to see is whether any of the 23 services reviewed present any evidence that their approach has improved research quality. For instance, the discussion on peer review efficiency and incentives states that there is currently “no hard evidence” that journals want to utilise reviews by Rapid Reviews: COVID-19, and that “not all journals are receptive” to partnerships. Are journals skeptical of whether preprint review services could improve research quality? Or might another dynamic be at work?
  
  The authors cite Nguyen et al. (2015) and Okuzaki et al. (2019), stating that peer review is often “overloaded”. I would like to see a clearer explanation by what “overloaded” means in this context so that a reader does not have to read the two cited papers.
  
  To the best of my understanding, one of the major sticking points in peer review reform is whether to anonymise reviewers and/or authors. Consequently, I appreciate the comprehensive discussion about this issue by the authors.
  
  However, I am only partially convinced by the statement that double anonymity is “essentially incompatible” with preprint review. For example, there may be, as yet not fully explored, ways to publish anonymous preprints with (a) a notice that it has been submitted to, or is undergoing, peer review; and (b) that the authors will be revealed once peer review has been performed (e.g. at least one review has been published). This would avoid the issue of publishing only after review is concluded as is the case for Hypothesis and Peer Community In.
  
  Additionally, the authors describe 13 services which aim to “balance transparency and protect reviewers’ interests”. This is a laudable goal, but I am concerned that framing this as a “balance” implies a binary choice, and that to have more of one, we must lose an equal amount of the other. Thinking only in terms of “balance” prevents creative, win-win solutions. Could a case be made for non-anonymity to be complemented by a reputation system for authors and reviewers? For example, major misconduct (e.g. retribution against a critical review) would be recorded in that system and dissuade bad actors. Something similar can already be seen in the reviewer evaluation system of CrowdPeer, which could plausibly be extended or modified to highlight misconduct.
  
  I also note that misconduct and abusive behaviour already occur even in fully or partially anonymised peer review, and they are not limited to the review or preprints. While I am not aware of existing literature on this topic, academics’ fears seem reasonable. For example, there is at least anecdotal testimonies that a reviewer would deliberately reject a paper to retard the progress of a rival research group, while taking the ideas of that paper and beating their competitors to winning a grant. Or, a junior researcher might refrain from giving a negative review out of fear that the senior researcher whose work they are reviewing might retaliate. These fears, real or not, seem to play a part in the debates about if and how peer review should (or should not) be anonymised. I would like to see an exploration of whether de-anonimisation will improve or worsen this behaviour and in what contexts. And if such studies exist, it would be good to discuss them in this paper.
  
  I found it interesting that almost all preprint review services claim to be complementary to, and not compete with, traditional journal-based peer review. The methodology described in this article cannot definitely explain what is going on, but I suspect there may be a connection between this aversion to compete with traditional journals, and (a) the skepticism of journals towards partnering with preprint review services and (b) the dearth of publisher-run options. I hypothesise that there is a power dynamic at play, where traditional publishers have a vested interest in maintaining the power they hold over scholarly communication, and that preprint review services stress their complementarity (instead of competitiveness) as a survival mechanism. This may be an avenue for further metaresearch.
  
  To understand preprints from which fields of research are actually present on the services categorised under “all disciplines,” I used the Random Integer Set Generator by the Random.org true random number service (https://www.random.org/integer-sets/) to select five services for closer examination: Hypothesis, Peeriodicals, PubPeer, Qeios, and Researchers One. Of those, I observed that Hypothesis is an open source web annotation service that allows commenting on and discussion of any web page on the Internet regardless of whether it is research or preprints. Hypothesis has a sub-project named TRiP (Transparent Review in Preprints), which is their preprint review service in collaboration with Cold Spring Harbor Laboratory. It is unclear to me why the authors listed Hypothesis as the service name in Table 1 (and elsewhere) instead of TRiP (or other similar sub-projects). In addition, Hypothesis seems to be framed as a generic web annotation service that is used by some as a preprint review tool. This seems fundamentally different from others who are explicitly set up as preprint review services. This difference seems noteworthy to me.
  
  To aid readers, I also suggest including hyperlinks to the 23 services reviewed in this paper. My comments on disciplinary representation in these services are elaborated further below.
  
  One minor point of curiosity is that several services use an “automated tool” to select reviewers. It would be helpful to describe in this paper exactly what those tools are and how they work, or report situations where services do not explain it.
  
  Lastly, what did the authors mean by “software heritage” in section 6? Are they referring to the organisation named Software Heritage (https://www.softwareheritage.org/) or something else? It is not clear to me how preprint reviews would be deposited in this context.
  
  Respecting disciplinary and epistemic diversity
  
  In the abstract and elsewhere in the article, the authors acknowledge that preprints are gaining momentum “in some fields” as a way to share “scientific” findings. After reading this article, I agree that preprint review services may disrupt publishing for research communities where preprints are in the process of being adopted or already normalised. However, I am less convinced that such disruption is occurring, or could occur, for scholarly publishing more generally.
  
  I am particularly concerned about the casual conflation of “research” and “scientific research” in this article. Right from the start, it mentions how preprints allow sharing “new scientific findings” in the abstract, stating they “make scientific work available rapidly.” It also notes that preprints enable “scientific work to be accessed in a timely way not only by scientists, but also…” This framing implies that all “scholarly communication,” as mentioned in the title, is synonymous with “scientific communication.” Such language excludes researchers who do not typically identify their work as “scientific” research. Another example of this conflation appears in the caption for Figure 1, which outlines potential benefits of preprint review services. Here, “users” are defined as “scientists, policymakers, journalists, and citizens in general.” But what about researchers and scholars who do not see themselves as “scientists”?
  
  Similarly, the authors describe the 23 preprint review services using six categories, one of which is “scientific discipline”. One of those disciplines is called “humanities” in the text, and Table 1 lists it as a discipline for Science Open Reviewed. Do the authors consider “humanities” to be a “scientific” discipline? If so, I think that needs to be justified with very strong evidence.
  
  Additionally, Waltman et al.’s four schools of thought for peer review reform works well with the 23 services analysed. However, at least three out of the four are explicitly described as improving “scientific” research.
  
  Related to the above are how the five “scientific disciplines” are described as the “usual organisation” of the scholarly communication landscape. On what basis should they be considered “usual”? In this formulation, research in literature, history, music, philosophy, and many other subjects would all be lumped together into the “humanities”, which sit at the same hierarchical level as “biomedical and life sciences”, arguably a much more specific discipline. My point is not to argue for a specific organisation of research disciplines, but to highlight a key epistemic assumption underlying the whole paper that comes across as very STEM-centric (science, technology, engineering, and math).
  
  How might this part of the methodology affect the categories presented in Table 1? “Biomedical and life sciences” appear to be overrepresented compared to other “disciplines”. I’d like to see a discussion that examines this pattern, and considers why preprint review services (or maybe even preprints more generally) appear to cover mostly the biomedical or physical sciences.
  
  In addition, there are 12 services described as serving “all disciplines”. I believe this paper can be improved by at least a qualitative assessment of the diversity of disciplines actually represented on those services. Because it is reported that many of these service stress improving the “reproducibility” of research, I suspect most of them serve disciplines which rely on experimental science.
  
  I randomly selected five services for closer examination, as mentioned above. Of those, only Qeios has demonstrated an attempt to at least split “arts and humanities” into subfields. The others either don’t have such categories altogether, or have a clear focus on a few disciplines (e.g. life sciences for Hypothesis/TRiP). In all cases I studied, there is a heavy focus on STEM subjects, especially biology or medical research. However, they are all categorised by the authors as serving “all disciplines”.
  
  If preprint review services originate from, or mostly serve, a narrow range of STEM disciplines (especially experiment-based ones), it would be worth examining why that is the case, and whether preprints and reviews of them could (or could not) serve other disciplines and epistemologies.
  
  It is postulated that preprint review services might “disrupt the scholarly communication landscape in a more radical way”. Considering the problematic language I observed, what about fields of research where peer-reviewed journal publications are not the primary form of communication? Would preprint review services disrupt their scholarly communications?
  
  To be clear, my concern is not just the conflation of language in a linguistic sense but rather inequitable epistemic power. I worry that this conflation would (a) exclude, minoritise, and alienate researchers of diverse disciplines from engaging with metaresearch; and (b) blind us from a clear pattern in these 23 services, that is their strong focus on the life sciences and medical research and a discussion of why that might be the case. Critically, what message are we sending to, for example, a researcher of 18th century French poetry with the language and framing of this paper? I believe the way “disciplines” are currently presented here poses a real risk of devaluing and minoritising certain subject areas and ways of knowing. In its current form, I believe that while this paper is a very valuable contribution, one should not derive from it any conclusions which apply to scholarly publishing as a whole.
  
  The authors have demonstrated inclusive language elsewhere. For example, they have consciously avoided “peer” when discussing preprint review services, clearly contrasting them to “journal-based peer review”. Therefore, I respectfully suggest that similar sensitivity be adopted to avoid treating “scientific research” and “research” as the same thing. A discussion, or reference to existing works, on the disciplinary skew of preprints (and reviews of them) would also add to the intellectual rigour of this already excellent piece.
  
  Overall, I believe this paper is a valuable reflection on the state of preprints and services which review them. Addressing the points I raised, especially the use of more inclusive language with regards to disciplinary diversity, would further elevate its usefulness in the metaresearch discourse. Thank you again for the chance to review.
  
  Signed:
  
  Dr Pen-Yuan Hsing (ORCID ID: 0000-0002-5394-879X)
  
  University of Bristol, United Kingdom
  
  Data availability
  
  I have checked the associated dataset, but still suggest including hyperlinks to the 23 services analysed in the main text of this paper.
7. metaror 23 Sep 2025
  
  in Public
  
  Henriques, S. O., Rzayeva, N., Pinfield, S., & Waltman, L. (2023, October 13). Preprint review services: Disrupting the scholarly communication landscape?. https://doi.org/10.31235/osf.io/8c6xm
8. metaror 23 Sep 2025
  
  in Public
  
  Aug 11, 2024
9. metaror 23 Sep 2025
  
  in Public
  
  Nov 20, 2024
10. metaror 23 Sep 2025
  
  in Public
  
  Nov 20, 2024
11. metaror 23 Sep 2025
  
  in Public
  
  Authors:
  
  Susana Henriques (Research on Research Institute (RoRI) Centre for Science and Technology Studies (CWTS), Leiden University, Leiden, the Netherlands Scientific Research Department, Azerbaijan University of Architecture and Construction, Baku, Azerbaijan) s.oliveira@cwts.leidenuniv.nl
  
  Narmin Rzayeva (Research on Research Institute (RoRI) Information School, University of Sheffield, Sheffield, UK) n.rzayeva@cwts.leidenuniv.nl
  
  Stephen Pinfield (Research on Research Institute (RoRI) Centre for Science and Technology Studies (CWTS), Leiden University, Leiden, the Netherlands) s.pinfield@sheffield.ac.uk
  
  Ludo Waltman waltmanlr@cwts.leidenuniv.nl
12. metaror 23 Sep 2025
  
  in Public
  
  1
13. metaror 23 Sep 2025
  
  in Public
  
  10.31235/osf.io/8c6xm
14. metaror 23 Sep 2025
  
  in Public
  
  Preprint review services: Disrupting the scholarly communication landscape
Visit annotations in context

Annotators

metaror

URL

osf.io/preprints/socarxiv/8c6xm/
osf.io osf.io

Researchers are willing to trade their results for journal prestige: results from a discrete choice experiment

16
1. metaror 23 Sep 2025
  
  in Public
  
  Ludo Waltman is Editor-in-Chief of MetaROR working with Adrian Barnett, a co-author of the article and a member of the editorial team of MetaROR.
2. metaror 23 Sep 2025
  
  in Public
  
  In this article the authors use a discrete choice experiment to study how health and medical researchers decide where to publish their research, showing the importance of impact factors in these decisions. The article has been reviewed by two reviewers. The reviewers consider the work to be robust, interesting, and clearly written. The reviewers have some suggestions for improvements. One suggestion is to emphasize more strongly that the study focuses on the health and medical sciences and to reflect on the extent to which the results may generalize to other fields. Another suggestion is to strengthen the embedding of the article in the literature. Reviewer 2 also suggests to extend the discussion of the sample selection and to address in more detail the question of why impact factors still persist.
3. metaror 23 Sep 2025
  
  in Public
  
  none
4. metaror 23 Sep 2025
  
  in Public
  
  In "Researchers Are Willing to Trade Their Results for Journal Prestige: Results from a Discrete Choice Experiment", the authors investigate researchers’ publication preferences using a discrete choice experiment in a cross-sectional survey of international health and medical researchers. The study investigates publishing decisions in relation to negotiation of trade-offs amongst various factors like journal impact factor, review helpfulness, formatting requirements, and usefulness for promotion in their decisions on where to publish. The research is timely; as the authors point out, reform of research assessment is currently a very active topic. The design and methods of the study are suitable and robust. The use of focus groups and interviews in developing the attributes for study shows care in the design. The survey instrument itself is generally very well-designed, with important tests of survey fatigue, understanding (dominant choice task) and respondent choice consistency (repeat choice task) included. Respondent performance was good or excellent across all these checks. Analysis methods (pMMNL and latent class analysis) are well-suited to the task. Pre-registration and sharing of data and code show commitment to transparency. Limitations are generally well-described.
  
  In the below, I give suggestions for clarification/improvement. Except for some clarifications on limitations and one narrower point (reporting of qualitative data analysis methods), my suggestions are only that – the preprint could otherwise stand, as is, as a very robust and interesting piece of scientific work.
  
  Respondents come from a broad range of countries (63), with 47 of those countries represented by fewer than 10 respondents. Institutional cultures of evaluation can differ greatly across nations. And we can expect variability in exposure to the messages of DORA (seen, for example, in level of permeation of DORA as measured by signatories in each country, https://sfdora.org/signers/)..%3B!!NVzLfOphnbDXSw!HdeyeHHei6yWQHFjhN3deSSfp82ur9i9JNOLEVOYZN0BvyslUO2S8DlvjBbautmafJEvlUsxQZbT0JLQX7lO8EcOYtZsJkA%24&data=05%7C02%7Ca.l.brasil.varandas.pinto%40cwts.leidenuniv.nl%7C9f47a111adec49d04bb608dd0614ae94%7Cca2a7f76dbd74ec091086b3d524fb7c8%7C0%7C0%7C638673408085242099%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=by5mhPfSM0MFFG9LE2iiYjdtSs5IhvpuukqVv%2FLak2s%3D&reserved=0 "https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fsfdora.org%2Fsigners%2F).%3B!!NVzLfOphnbDXSw!HdeyeHHei6yWQHFjhN3deSSfp82ur9i9JNOLEVOYZN0BvyslUO2S8DlvjBbautmafJEvlUsxQZbT0JLQX7lO8EcOYtZsJkA%24&data=05%7C02%7Ca.l.brasil.varandas.pinto%40cwts.leidenuniv.nl%7C9f47a111adec49d04bb608dd0614ae94%7Cca2a7f76dbd74ec091086b3d524fb7c8%7C0%7C0%7C638673408085242099%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=by5mhPfSM0MFFG9LE2iiYjdtSs5IhvpuukqVv%2FLak2s%3D&reserved=0") In addition, some contexts may mandate or incentivise publication in some venues using measures including IF, but also requiring journals to be in certain databases like WoS or Scopus, or having preferred journal lists). I would suggest the authors should include in the Sampling section a rationale for taking this international approach, including any potentially confounding factors it may introduce, and then adding the latter also in the limitations.
  
  Reporting of qualitative results: In the introduction and methods, the role of the focus groups and interviews seems to have been just to inform the design of the experiment. But then, results from that qualitative work then appear as direct quotes within the discussion to contextualise or explain results. In this sense though, the qualitative results are being used as new data. Given this, I feel that the methods section should include description of the methods and tools used for qualitative data analysis (currently it does not). But in addition, to my understanding (and this may be a question of disciplinary norms – I’m not a health/medicine researcher), generally new data should not be introduced in the discussion section of a research paper. Rather the discussion is meant to interpret, analyse, and provide context for the results that have already been presented. I personally hence feel that the paper would benefit from the qualitative results being reported separately within the results section.
  
  Impact factors – Discussion section: While there is interesting new information on the relative trade-offs amongst other factors, the most emphasised finding, that impact factors still play a prominent role in publication venue decisions, is hardly surprising. More could perhaps be done to compare how the levels of importance reported here differ with previous results from other disciplines or over time (I know a like-for-like comparison is difficult but other studies have investigated these themes, e.g., https://doi.org/10.1177/01655515209585). In addition, beyond the question of whether impact factors are important, a more interesting question in my view is why they still persist. What are they used for and why are they still such important “driver[s] of researchers’ behaviour”? This was not the authors’ question, and they do provide some contextualisation by quoting their participants, but still I think they could do more to contextualise what is known from the literature on that to draw out the implications here. The attribute label in the methods for IF is “ranking”, but ranking according of what and for what? Not just average per-article citations in a journal over a given time frame. Rather, impact factors are used as a proxy indicators of less-tangible desirable qualities – certainly prestige (as the title of this article suggests), but also quality, trust (as reported by one quoted focus group member “I would never select a journal without an impact factor as I always publish in journals that I know and can trust that are not predatory”, p.6), journal visibility, importance to the field, or improved chances of downstream citations or uptake in news media/policy/industry etc. Picking apart the interactions of these various factors in researchers’ choices to make use of IFs (which is not in all cases bogus or unjustified) could add valuable context. I’d especially recommend engaging at least briefly with more work from Science and Technology Studies - especially Müller and de Rijcke’s excellent Thinking with Indicators study (doi: 10.1093/reseval/rvx023), but also those authors other work, as well as work from Ulrike Felt, Alex Rushforth (esp https://doi.org/10.1007/s11024-015-9274-5), Björn Hammerfelt and others.
  
  Disciplinary coverage: (1) A lot of the STS work I talk about above emphasises epistemic diversity and the ways cultures of indicator use differ across disciplinary traditions. For this reason, I think it should be pointed out in the limitations that this is research in Health/Med only, with questions on generalisability to other fields. (2) Also, although the abstract and body of the article do make clear the disciplinary focus, the title does not. Hence, I believe the title should be slightly amended (e.g., “Health and Medical Researchers Are Willing to Trade …”)
5. metaror 23 Sep 2025
  
  in Public
  
  none
6. metaror 23 Sep 2025
  
  in Public
  
  This manuscript reports the results of an interesting discrete choice experiment designed to probe the values and interests that inform researchers’ decisions on where to publish their work.
  
  Although I am not an expert in the design of discrete choice experiments, the methodology is well explained and the design of the study comes across as well considered, having been developed in a staged way to identify the most appropriate pairings of journal attributes to include.
  
  The principal findings to my mind, well described in the abstract, include the observations that (1) researchers’ strongest preference was for journal impact factor and (2) that they were prepared to remove results from their papers if that would allow publication in a higher impact factor journal. The first of these is hardly surprising – and is consistent with a wide array of literature (and ongoing activism, e.g. through DORA, CoARA). The second is much more striking – and concerning for the research community (and its funders). This is the first time I have seen evidence for such a trade-off.
  
  Overall, the manuscript is very clearly written. I have no major issues with the methods or results. However, I think but some minor revisions would enhance the clarity and utility of the paper.
  
  First, although it is made clear in Table 1 that the researchers included in the study are all from the medical and clinical sciences, this is not apparent from the title or the abstract. I think both should be modified to reflect the nature of the sample. In my experience researchers in these fields are among those who feel most intensely the pressure to publish in high IF journals. The authors may want also to reflect in a revised manuscript how well their findings may transfer to other disciplines.
  
  Second, in several places I felt the discussion of the results could be enriched by reference to papers in the recent literature that are missing from the bibliography. These include (1) Muller and De Rijcke’s 2017 paper on Thinking with Indicators, which discusses how the pressure of metrics impacts the conduct of research (https://doi.org/10.1093/reseval/rvx023); (2) Bjorn Brembs’ analysis of the reliability of research published in prestige science journals (https://www.frontiersin.org/journals/human-neuroscience/articles/10.3389/fnhum.2018.00376/full; and (3) McKiernan’s et al.’s examination of the use of the Journal Impact Factor in academic review, promotion, and tenure evaluations (https://pubmed.ncbi.nlm.nih.gov/31364991/).
  
  Third, although the text and figures are nicely laid out, I would recommend using a smaller or different font for the figure legends to more easily distinguish them from body text.
7. metaror 23 Sep 2025
  
  in Public
  
  12, 54
8. metaror 23 Sep 2025
  
  in Public
  
  Bohorquez, N. G., Weerasuriya, S., Brain, D., Senanayake, S., Kularatna, S., & Barnett, A. (2024, July 31). Researchers are willing to trade their results for journal prestige: results from a discrete choice experiment. https://doi.org/10.31219/osf.io/uwt3b
9. metaror 23 Sep 2025
  
  in Public
  
  Aug 03, 2024
10. metaror 23 Sep 2025
  
  in Public
  
  Nov 20, 2024
11. metaror 23 Sep 2025
  
  in Public
  
  Nov 20, 2024
12. metaror 23 Sep 2025
  
  in Public
  
  Authors:
  
  Natalia Gonzalez Bohorquez (Queensland University of Technology) natalia.gonzalezbohorquez@hdr.qut.edu.au
  
  Sucharitha Weerasuriya (Queensland University of Technology) sucharitha.weerasuriya@qut.edu.au
  
  David Brain (Queensland University of Technology) david.brain@qut.edu.au
  
  Sameera Senanayake (Duke-NUS Medical School) sameera.senanayake@duke-nus.edu.sg
  
  Sanjeewa Kularatna (Duke-NUS Medical School) sanjeewa.kularatna@duke-nus.edu.sg
  
  Adrian Barnett a.barnett@qut.edu.au
13. metaror 23 Sep 2025
  
  in Public
  
  1
14. metaror 23 Sep 2025
  
  in Public
  
  https://doi.org/10.31219/osf.io/uwt3b
15. metaror 23 Sep 2025
  
  in Public
  
  10.31219/osf.io/uwt3b
16. metaror 23 Sep 2025
  
  in Public
  
  Researchers are willing to trade their results for journal prestige: results from a discrete choice experiment [Version 1]
Visit annotations in context

Annotators

metaror

URL

osf.io/preprints/osf/uwt3b
arxiv.org arxiv.org

Use of diverse data sources to control which topics emerge in a science map

16
1. metaror 08 Sep 2025
  
  in Public
  
  none
2. metaror 08 Sep 2025
  
  in Public
  
  The reviewers agree that the paper constitutes a valuable contribution to the literature, illuminating the use of document data networks to control the topic clustering of a science map, using a rigorous methodology and with careful presentation of results. For this purpose, one reviewer suggests that “the key result may be that BERT-based methods offer a more faithful reproduction of the existing map than bibliometric approaches using the alternative data sources”, but as this is a somewhat exploratory study, my impression is that the authors refrained from strong conclusions. The document data networks used as examples were datasets from Facebook, the former Twitter, documents cited in patents, and policy documents. The reviewers provide comments about several details that could be improved, including specifying more details of the method and datasets, availability of their data, and framing of the introduction and scope, including the addition of some important references.
3. metaror 08 Sep 2025
  
  in Public
  
  I am affiliated with Digital Science, the owner and operator of both Altmetric and Dimensions—two of the data sources analysed in the study.
4. metaror 08 Sep 2025
  
  in Public
  
  I read with interest the article by Bascur, Costas, and Verberne, which examines the use of diverse data sources to influence topic emergence in science maps.
  
  At the outset, I note that I am affiliated with Digital Science, the owner and operator of both Altmetric and Dimensions—two of the data sources analysed in the study.
  
  The article focuses on the mapping of articles onto science maps characterised by clusters of topical areas. These are typically visualised in two dimensions, where the relative positions of topics are determined by a selected distance metric. This area of study has seen considerable development in recent years, and science maps continue to serve as a compelling tool in various analytical and strategic contexts.
  
  While the paper’s focus on mapping research outputs onto science maps is timely and relevant, I was disappointed to see that key foundational works in this field were not cited. In particular, I believe the following references are highly relevant, especially the last, which explores the use of Wikipedia in a manner closely related to the current paper:
  
  Börner, K., Chen, C., & Boyack, K. W. (2005). Visualizing knowledge domains. Annual Review of Information Science and Technology, 37(1), 179. https://doi.org/10.1002/aris.1440370106
  
  Boyack, K., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64, 351–374. https://doi.org/10.1007/s11192-005-0255-6
  
  Holloway, T., Bozicevic, M., & Börner, K. (2007). Analyzing and visualizing the semantic coverage of Wikipedia and its authors. Complexity, 12, 3–85. https://doi.org/10.1002/cplx.20164
  
  The paper is framed as an exploration of how networks emerge, which is an important and intriguing subject. Visual representations play a crucial role in knowledge communication and decision-making, and I consider this work both significant and valuable.
  
  That said, I initially interpreted the paper as an exploration of science map construction via bibliometric coupling informed by different data sources. However, the paper does not appear to explore alternative embedding metrics, topological variations, or the graphs that might result from different coupling strategies. Instead, it primarily assesses how faithfully papers can be mapped onto an existing topic structure using alternative data sources. I use the term "faithful" here in its group-theoretic sense—capturing both purity and effectiveness—as it conveys the intended meaning more precisely, in my opinion.
  
  This focus differs somewhat from the broader ambitions implied in the title and abstract. I recommend the authors reassess whether the current framing accurately reflects the content, or alternatively, provide a more explicit explanation of the embedding method used and how it relates to the structural similarity being evaluated. Even if the scope is narrower than anticipated, the findings remain rigorous, well-articulated, and represent a valuable contribution.
  
  If the paper is best understood as a study of the faithfulness of mapping unclassified papers to an existing clustering structure using different linking mechanisms (e.g., social data vs. textual or citation-based), then its key result appears to be that BERT-based methods offer a more faithful reproduction of the existing map than bibliometric approaches using the alternative data sources.
  
  Given the clustering methodology behind the target map, this result is sensible. The use of social media data, as discussed in the paper, is more likely to yield alternative representations rather than a faithful reproduction. By contrast, BERT embeddings naturally align more closely with textual structures already reflected in the map. This outcome is consistent with the analytic approach adopted.
  
  As an aside, the paper does not appear to reference the work of Evans and Lambiotte (https://doi.org/10.1103/PhysRevE.80.016105), which investigates the use of bipartite graphs and their duals for community detection. Their work is directly relevant to the paper’s discussion of faithfulness, particularly in terms of minimising overlap in cluster assignments.
  
  Finally, I believe the paper would benefit from a stronger articulation of the contexts in which science mapping is applied. I have previously explored this in (https://doi.org/10.1162/qss_a_00244), and I believe the current work holds particular promise in evaluative settings. Preserving classification consistency over time is often vital for longitudinal comparisons, and the paper’s approach could be valuable in assessing alternative coupling strategies against a stable reference framework.
  
  In summary, this paper presents a thoughtful and well-executed study. I find the technical development in the methodology around the refinement of purity to be helpful and something that those in the field will want to explore further. I recommend the authors consider refining the framing and expanding the contextualisation to strengthen the contribution and clarify its position within the existing literature.
5. metaror 08 Sep 2025
  
  in Public
  
  none
6. metaror 08 Sep 2025
  
  in Public
  
  The article “Use of diverse data sources to control which topics emerge in a science map” aims to analyze the effects of different data sources on topic clustering bias in science maps. For this purpose, the clustering effectiveness of different topic categories is analyzed based on different traditional and non-traditional data sources.
  
  (1) contribution to existing literature
  
  The present research is well embedded in the existing body of literature and builds on the study “Which topics are best represented by science maps? An analysis of clustering effectiveness for citation and text similarity networks” by Bascur, Verberne, van Eck, and Waltman (2024). That study explored the extent to which science maps can successfully cluster documents that address the same topic - a concept referred to as clustering effectiveness. This metric serves as an indicator of the thematic precision of clustering approaches. Bascur et al. (2024) found that clustering effectiveness varies depending on the topic domain: documents related to certain topics, such as diseases, were more accurately clustered than those related to others, such as geography. Building on these findings, the present study investigates whether the clustering effectiveness for documents on the same topic is influenced by the choice of data source, and whether this effectiveness can be systematically adjusted or optimized through the selection of that source.
  
  (2) major strengths and weaknesses
  
  The article’s ideas and arguments are presented with clarity and precision. Its structure follows a classic and well-established format - introduction, background, methods, results, discussion, and conclusion - which makes it easy to follow. As a reader, I never lost the thread; the narrative remains coherent and accessible throughout. The current state of research is conveyed in a thorough, well-reasoned, and nuanced manner. Particularly noteworthy is the detailed introduction to the topics of science maps based on diverse sources and comparing clustering solutions of different networks. This contextualization is both comprehensive and essential for understanding the research that follows.
  
  The document selection is highly extensive (4,142,511 documents) and well-justified. The rationale for which documents are included in the study is clearly and convincingly presented. All selection criteria are explained in detail in Section 3.1.
  
  The introduction clearly explains the rationale for using non-traditional data sources alongside traditional data sources, and the justification is both logical and easy to follow. The external data sources are introduced and described in Section 3.2. The procedures for building the different networks (Sections 3.3 and 3.4), as well as the clustering approaches (Section 3.5), are also thoroughly explained. The topics and topic categories analyzed in the study are presented and justified in detail in Section 3.6. To evaluate how well different topics are represented within the clustering networks, the study employs the concept of clustering effectiveness. The relevant calculations are described in Section 3.7.
  
  The article presents its complex results in a well-structured and sensible tabular format. Figure 2 provides an example to illustrate how the results are displayed. Table 3 reports all detailed results, while Table 4 offers a summary, and Table 5 draws conclusions on which network performed best for each topic. The tables and their captions are extensive and may seem overwhelming at first glance. However, the article makes it clear that this level of detail is both intentional and necessary. The thorough descriptions guide the reader through the results and enhance comprehension. Rather than being a weakness, the comprehensive presentation reflects the authors’ careful and rigorous approach.
  
  Regarding additional strengths of the article, I would like to highlight and support those identified by the authors themselves. This study represents a clear advancement over the 2024 publication. By focusing on a single metric—purity, rather than also including inverse cluster number—the evaluation and interpretation of results have been significantly simplified, and comparability has improved. Whereas the earlier study only allowed comparisons between cluster solutions based on identical document sets and similar cluster sizes, the current study enables comparisons across different networks, even when they involve varying documents and cluster structures. A notable innovation in this article is the introduction of purity profiles, which effectively illustrate how clearly topic clusters would be perceived by users navigating the science map.
  
  In addition to highlighting the strengths of their work, the authors also acknowledge three key limitations. These include the absence of a specified minimum cluster size, the combination of bipartite and non-bipartite networks, and the potential inaccessibility of certain data sources for other researchers (e.g., due to paywalls such as those associated with the Twitter API). Each of these limitations is clearly presented and discussed in the article. The authors provide thoughtful reasoning on the impact of these constraints and explain how they have addressed them within the scope of their study.
  
  (3) suggestions for improvements
  
  I have no suggestions for improving the article.
  
  (4) data and code availability/ research ethics/ MetaROR policies
  
  The research data is available on Zenodo in accordance with the principle as open as possible, as closed as necessary. Due to legal restrictions, the raw data used in the experiments cannot be shared. However, the code used to conduct the experiments and generate the results is provided, along with a summary of the data utilized.<br /> This ensures transparency and allows others to understand the methodology and replicate the results, even in the absence of the original raw data.
7. metaror 08 Sep 2025
  
  in Public
  
  none
8. metaror 08 Sep 2025
  
  in Public
  
  In this article, the authors present a study using different networks from various data sources to measure differences in gathering scholarly document topics and to show which networks provide the best information to represent the scientific topics considered appropriately. The work is built on a previous contribution and analyses networks obtained from six sources: scholarly document authors, Facebook users, Twitter users and conversations, patents, and policy documents. These networks are also accompanied by other networks, i.e. the text similarity network and the citation network, that are mainly used for comparison purposes.
  
  The work particularly interests the scholarly community, aiming to work with science map generation. However, some passages need further explanation to be clear to the reader.
  
  In the abstract, there is a mention of traditional and non-traditional data sources. While in the text of the article there are, indeed, some clarifications, it would be ideal to briefly explain in the abstract what the authors refer to these terms, since it is not immediately clear what is a traditional data source in the context of topic identification.
  
  In the introduction, the authors anticipate the outcomes of a previous work they have conducted on a similar topic. They claim that some topics are well-represented in maps based on citation links and text similarity, while others are not. However, it is not clear which sources they have used to get to this claim, and it is also not evident what the main difference is that characterises the current work compared to the previous one.
  
  In section 3, the authors introduce all the methods and materials used for their analysis. Despite the fact that some of the material cannot be shared since it is behind a paywall (e.g. the Web of Science data), by reading the section, it is not clear that all the code developed and the data obtained from the analysis have been published on Zenodo. While it is okay to address this aspect in the appropriate section at the end of the article, I would suggest to anticipate this information at the beginning of section 3, citing the Zenodo record appropriately and clarifying which of material is not included in that record, thus explaining that the full reproducibility of the experiment cannot be conducted.
  
  Considering all the external sources of networks, it is not clear what the datetime window of each source is - are all these sources containing information from the year of publication of the oldest article in the document set considered to 2024?
  
  As far as I understood from the formula in section 3.7.1, the Purity is always calculated against a particular topic M. Thus, why not refer to such "M" in the formula definition, defining it in a function-like way Purity(N, M)? In addition, still in this section, it is not clear how the N clusters considered are selected. A running example of Purity calculation would probably help the reader here.
  
  In section 3.7.2, the denominator of the formula is set to 5. However, it is unclear why such a number is sensitive for the calculation presented. Why not 6 or 7? Why not 3? I think the authors should clearly justify the choice of such a denominator by bringing in explicit evidence.
  
  In section 3.7.3, it is not entirely clear what the difference is between topics and topic categories.
  
  In the discussions, it would be good to extend a bit on the work's limitation and envision possible paths for future works in the area. A few points that I would love to see discussed in detail:
  
  The analysis has been done by using sources that may have changed drastically in the past months/years - e.g. Twitter that, after becoming X, has seen a series of abandons from the academics towards more open (in a broad sense) platforms and networks (e.g. Mastodon and, more recently, BlueSky). Would it be possible to gather the necessary data from these platforms to run the study again? If yes, would it be possible to download them? If not, should we consider these sources unreliable for scientific purposes and, if so, what preconditions should be in place for their reliability? Considering the present situation, what is the relevance of the results obtained with the data gathered from Twitter (now X)?
  
  The authors transparently claim that some of the data used (e.g. Web of Science data) are not freely available to the reader, thus preventing the full replication of the study. Is it possible to substitute these closed sources with others offering open research information? For instance, OpenCitations for gathering the citation network (full disclosure: I'm director of OpenCitations), PubMed and PubMed Central for gathering titles and abstracts of the article considered, etc.?
  
  The core set of scholarly documents considered are primarily from the biomedical domain since the authors considered only those with a PubMed identifier specified. While the results shown are sensitive for this domain, how much does the approach the authors presented scale also in other scholarly areas, e.g. Social Science and Humanities? Is it possible to speculate that the approach presented is discipline-agnostic? Is there any evidence for such a claim?
  
  Some final remarks:
  
  A. The figures should be closer (i.e. maximum on the next page) to the place they are mentioned the very first time.
  
  B. The research question introduced in the article is introduced in section 1, and then it is not explicitly mentioned anymore in the text. It would be ideal to add an explicit reference to that question when the authors present appropriate evidence to answer it (e.g. in section 4) and to recall the answer to that question in the conclusion of the paper.
9. metaror 08 Sep 2025
  
  in Public
  
  Bascur, J. P., Costas, R., & Verberne, S. (2024). Use of diverse data sources to control which topics emerge in a science map. arXiv preprint arXiv:2412.07550.
10. metaror 08 Sep 2025
  
  in Public
  
  December 10, 2024
11. metaror 08 Sep 2025
  
  in Public
  
  June 11, 2025
12. metaror 08 Sep 2025
  
  in Public
  
  June 30, 2025
13. metaror 08 Sep 2025
  
  in Public
  
  Authors:
  
  Juan Bascur j.p.bascur.cifuentes@cwts.leidenuniv.nl
  
  Rodrigo Costas
  
  Suzan Verberne
14. metaror 08 Sep 2025
  
  in Public
  
  https://arxiv.org/abs/2412.07550
15. metaror 08 Sep 2025
  
  in Public
  
  https://doi.org/10.48550/arXiv.2412.07550
16. metaror 08 Sep 2025
  
  in Public
  
  Use of diverse data sources to control which topics emerge in a science map
Visit annotations in context

Annotators

metaror

URL

arxiv.org/abs/2412.07550
Aug 2025
osf.io osf.io

Approximately 1 in 7 Scientific Papers Are Fake

19
1. metaror 13 Aug 2025
  
  in Public
  
  The value of the article lies not in its too-roughly calculated estimate but in its attempt to highlight both an important yet unanswered question and the difficulties that hinder our ability to reliably answer it. The article also provides a useful summary of the bourgeoning literature and the challenges of drawing broad inferences from it. The author should consider highlighting these points rather than the rough estimate of the rate of falsification and fabrication. The author should change the article’s title to reflect the skepticism about the estimate that runs throughout the article so as not to confuse readers about what we can reliably take away from the article.
  
  The following are specific suggestions:
  
  Adding reference or links to Table 1 would help readers find details related to the listed items.
  
  p. 6 (“The following (Table 1) is a selection of events which took place after the figure above was established.”): Clarify why 2005 is the first year of interest in the events table (e.g., change sentence to “The following (Table 1) is a selection of events that took place during or after 2005, the final year of publication of the studies Fanelli used to compute the 2% figure.”).
  
  p. 7 (“Significantly, all of the above happened after the figure of 2% was collected.”): change to “… after publication of the studies on which the 2% figure is based.”
  
  The link in footnote 5 no longer works. The report can be found at https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1262&context=scholcom. Consider changing all links to permalinks.
  
  p. 16: list the 29 observations of data sleuth estimates in a footnote.
  
  Footnote 6: the link pulls up the Retraction Watch Database for Nature Publishing Group. I accessed the link on Jan 15, 2025, and the database found 1,610 items. It’s not clear how you computed 667 (12,000 / 667 = 18). If “Retracted Article” is chosen for “Article Type(s),” the count is 1.
  
  p. 18 (“the presumably higher number of papers containing questionable research practices (which are far more commonly admitted to) is presumably higher still.”): Consider citing to published estimates, which are mostly produced using surveys, and note that this literature suffers from the same problems as the literature that estimates FF.
  
  p. 18: add citations to articles that address each of the harms caused by false results.
  
  Bottom of p. 19 (“In particular, it seems likely that FF rates change by individual field – in doing so, they may present specific rather than general threats to human health and scientific progress.”): Providing some explanation for field-specific rates might help the reader assess the claim. For example, it’s possible that rates are similar across fields because those willing to commit fraud or to fabricate data likely randomly distribute themselves across fields, and journal editors and referees are roughly equally likely to fail to detect falsification and fabrication. Does any evidence call these possibilities into question?
2. metaror 13 Aug 2025
  
  in Public
  
  none
3. metaror 13 Aug 2025
  
  in Public
  
  The author uses 12 previously reported estimates from studies that focus on different research quality characteristics and construct samples from different literatures to estimate that approximately 1 in 7 scientific papers are “fake.” All three reviewers, however, call into question the estimate’s accuracy, and the article itself notes reasons to be skeptical. Even setting aside the intrinsic difficulties given the available evidence, the article does not use a systematic or rigorous method to compute the reported estimate. Thus, the reported estimate could be overstated or understated. The author also argues that the proportion of scientific outputs that are fake is a more relevant statistic than the oft-cited percentage of scientists who admit to faking or plagiarizing (Fanelli, 2009). The author calls for better recognition of the problem and better funding so that metaresearchers can conduct large-scale studies capable of producing more reliable overall estimates. The reviewers noted some strengths. For example, two reviewers noted that the research question is important and that updated estimates are needed. One reviewer noted the importance of understanding the increase in the percentage of fake scientific outputs given changes in available technology helpful in committing fraud and found the estimate of 1 in 7 urgently concerning despite its roughness. The reviewers also point to weaknesses. Reviewer 1 worries that no published estimate tells us much about the overall proportion of fake studies. This reviewer proposes that the author take a different approach by determining which data are needed to accurately estimate the proportion, collecting that data, and using it to compute a reliable estimate. This reviewer also suggests adding references to support claims made throughout the article. The second co-authored review report notes three concerns. First, the co-reviewers emphasize that the author calls his own claims into question. Second, they argue that the author is incorrect in claiming that his article is “in opposition” to Fanelli (2009) because both articles fail to provide a reliable estimate of the amount of scientific output that is fake. Finally, the co-reviewers draw inferences from a dataset they constructed to argue that the author is incorrect in his characterizations of how others have interpretated Fanelli (2009). Reviewer 3 notes that the author’s focus on articles rather than scientists deemphasizes the important human dimension of fakery. This reviewer suggests emphasizing reputational harm caused by false positives. In sum, all three reviewers are unpersuaded by the author’s claim that approximately 1 in 7 scientific papers are fake.
4. metaror 13 Aug 2025
  
  in Public
  
  Jennifer Byrne receives NHMRC grant funding to study the integrity of molecular cancer research publications.
5. metaror 13 Aug 2025
  
  in Public
  
  This manuscript attempts to provide an answer to the proportion of scientific papers that are fake. The presence of fake scientific papers in the literature is a serious problem, as the author outlines. Papers of variable quality and significance will inevitably be published, but most researchers assess manuscripts and papers based on the assumption that the described research took place. Papers that disguise their identities as fake papers can therefore be highly damaging to research efforts, by preventing accurate assessments of research quality and significance, and by encouraging future research that could consume time and other resources. As the manuscript describes, fake papers are also damaging to science by eroding trust in the scientific method and communities of scientists.
  
  It is therefore clear that knowing the proportion of fake scientific papers is important, that the author is concerned about the problem, and that the author wants to arrive at an answer. However, as the manuscript partly recognises, the question of the overall proportion of fake scientific papers is currently difficult to answer.
  
  The overall proportion of fake papers in science will represent the individual proportions of fake papers in different scientific disciplines. In turn, the proportions of fake papers in any single discipline will reflect many factors, including (i) researcher incentives to produce fake papers, (ii) the ease with which fake papers can be produced and (iii) published, (iv) the ease or likelihood of fake papers being detected, before or (v) after publication, and (vi) the consequences for authors if they are found to have published fake papers. Some of these factors are likely to vary between different disciplines and in different research settings. For example, it has been suggested that it is similarly difficult to invent some research results as it is to produce genuine data. However, in other fields, it is easier to invent data than to generate data through experiments that remain difficult, expensive and/or slow. It is also likely that factors such as the capacity to invent fake papers, detect fake papers, as well as incentives and consequences for researchers could vary over time, particularly in response to generative AI.
  
  As someone who studies errors in scientific papers, I don’t believe that we currently have a good understanding of the proportions of fake papers in any individual scientific field, at any time. There are some fields where we have estimates of individual error types, but these error types are likely to wrongly estimate the overall proportions of fake papers. Rather than attempting to answer the question of the overall proportion of fake scientific papers in the absence of the necessary data, it seems preferable to describe how we could obtain the data that we need to answer this question. While the overall proportion of fake scientific papers is an important statistic, most scientists will also be more concerned about how many fake papers exist in their own fields. We could therefore start by trying to obtain reliable estimates of fake papers in individual fields, working out how we need to do this, and then carrying out the necessary research. In the absence of reliable data, it’s perhaps most important that researchers are aware that fake papers could exist in their fields, so that all researchers can assess papers more carefully.
  
  Beyond these broad considerations, the following manuscript elements could be reconsidered.
  
  Fake science is defined as fabricated or falsified, yet this definition is sometimes expanded to include plagiarism (page 8, Table 2). However, plagiarism doesn’t equate with faking or falsifying data, and some plagiarised articles could describe sound data. Including plagiarised articles as fake articles will inevitably inflate estimates of fake papers, particularly in fields with higher rates of plagiarism.
  
  Table 1 was stated to represent “a selection of events that took place after the figure above (ie the figure published by Fanelli (2009)) was established”, yet some listed references/ events were published/ occurred between 2005 and 2008.
  
  It is reasonable to expect that increased capacity to autogenerate text and images will increase the numbers of fake papers, but I’m not aware of any evidence to support this. No reference is cited.
  
  Table 2; “similar survey results”: it’s not clear how the listed studies are similar.
  
  There are many unreferenced statements, eg page 9, “most rejected papers are published, just elsewhere”, page 19.
  
  Some estimates of fake papers arise from small sample sizes (eg page 13).
  
  The statement “The accumulation of papers assembled here is, frankly, haphazard” doesn’t inspire confidence in the resulting estimate.
  
  “…it would be prudent to immediately reproduce the result presented here as a formal systematic review”- any systematic review seems premature without reliable estimates.
  
  “The false positive rate (FPR) of detecting fake science is almost certainly quite low”- this seems unlikely to be correct. False positive rates depend on the methods used. Different methods will be required to detect fake papers in different disciplines, and these different methods could have very different false positive rates, particularly when comparing the application of manual versus automated methods that are applied without manual checking.
  
  Page 2: I could not see the n=12 studies summarised in a single Table.
  
  Page 10: “All relevant studies were included”…. “The list below is comprehensive but not necessarily exhaustive”- these statements contradict each other.
6. metaror 13 Aug 2025
  
  in Public
  
  The author has no conflicts of interest to disclose.
7. metaror 13 Aug 2025
  
  in Public
  
  The provocative essay written by James Heathers is a genuine attempt to quantify the current prevalence of two growing research malpractices, namely fabrication and falsification (FF for short), which are universally recognized as gross misconducts. The matter is of interest not only to researchers themselves (including meta-scientists), but also to general audiences, since taxpayers have a natural right to oversee the rewards of Science for the society at large. The underlying assumption of the author is that the generally accepted figure of 2% of researchers involved at least once in FF should now be considered as a lower bound. This 2% rate appeared in an article authored by Daniele Fanelli in 2009, and made an impact in the scholarly community. However, a lot of water has flowed under the bridge since then, and new actors showed up: papermills, sophisticated digital tools (intended for both data fabrication and FF tracking), whistleblowers communicating via social networks, generative artificial intelligence, etc. The update proposed by James Heathers is thus certainly welcome.
  
  The other premise of the author is that the assessment of the proportion of faking scientists is not a suitable proxy. Instead, he preferred to address a tangential issue: the estimation of the rate of scholarly papers including fabricated or falsified data. According to the author, such an approach has more benefits than drawbacks, and could be, from an idealistic point of view, fully automated. One could agree, although the fear of seeing the building of an Orwellian machinery is never far away. At the end of the process, offending papers are retracted (assuming, again, an ideal world), while the authors of the flagged papers are jailed (metaphorically or not).
  
  A survey of more recent studies was thus carried out. Although the author acknowledges that the small sample size for his study (N = 12), as well as the large dispersion of FF estimates retrieved from this corpus, do not allow a proper meta-analysis, an alarming figure of 14.3% for the updated FF rate emerges. Moreover, this figure is consistent with independent data reported by other sleuths engaged in the fight against questionable research practices, which are mentioned in the “discussion” section of the paper. Even if estimated in a rough way, the increase of FF in less than 15 years, if confirmed by other studies, is a real threat to Science, and should be addressed urgently.
  
  The main value of this essay is thus to raise concerns about the fast growth of FF, rather than to provide an up-to-date FF rate, which is anyway probably impossible to obtain in a reliable manner. On the other hand, an obvious weakness of the study is the chosen target: by focusing his attention on papers, James Heathers is missing the human dimension of the academic endeavour. Indeed, authors and papers are entangled bodies, and like entangled particles, they are described by a single state involving both entities: a paper does not exist without authors, and authors are invisible if they do not publish on a regular basis.
  
  Nowadays, scientific papers are extremely complex, and almost always impenetrable to researchers outside of the involved field. However, Homo academicus (as coined by Pierre Bourdieu) is also a very complex being. This is why, despite there is an unambiguous definition for FF, the false positive and negative rates of detecting FF are unknown, as recognized by James Heather. In particular, false positive detections can be detrimental to authors. This point is mentioned en passant in the essay, but should be emphasized: it is more than just a drawback of the used methodology, since it is related to the very human dimension of the scholarly enterprise.
  
  Perhaps a complementary perspective of the work carried out by James Heathers could be based on the following example: James Ibers (1930-2021), an old-school chemist and influential crystallographer, wrote a memoir published by the American Crystallographic Association, shortly before his death.1 He describes how, as a freshman at Caltech, he attended a mandatory one-week orientation workshop. In his own words: “The most important message I took away was the Caltech Honor Code for all undergraduates. In its simplest terms: You can’t cheat in Science because you will eventually be found out. I have adhered to that Code as a husband, a father, a scientist, a teacher, a research director, and all others I have dealt with”. How many of us can ensure, without hesitation, that they stand next to Ibers? What is the tolerable threshold of cheaters in Science? 2%? 14.3%? More?
  
  James Heathers ends his article with a worrying sentence: “Priorities must change, or science will start to die”. Perhaps, however, Science is already as dead as a dodo.
  
  1 https://chemistry.northwestern.edu/documents/people/james_ibers.aca.memoir.2020.pdf
8. metaror 13 Aug 2025
  
  in Public
  
  Additional reviewers:
  
  Maha Said
  
  Frederique Bordignon
9. metaror 13 Aug 2025
  
  in Public
  
  The reviewers have no competing interests to declare.
10. metaror 13 Aug 2025
  
  in Public
  
  The title of the article makes a simple striking claim about the state of the scientific literature with a numerical estimate of the proportion of “fake” articles. Yet, by contrast to this title, in the text of the article, Heathers is highly critical of his own work.
  
  James’ peer review of Heathers’ article
  
  James Heathers often mentions the limitations of his research thus “peer-reviewing” his own article to the extent that he admits that this work is “incomplete”, “unsystematic” and “far flung”.
  
  “This work is too incomplete to support responsible meta-analysis, and research that could more accurately define this figure does not exist yet. ~1 in 7 papers being fake represents an existential threat to the scientific enterprise.”
  
  “While this is highly unsystematic, it produced a substantially higher figure. Correspondents reliably estimated 1-5% of all papers contain fabricated data, and 2-10% contain falsified results.”
  
  “These values are too disparate to meta-analyze responsibly, and support only the briefest form of numerical summary: n=12 papers return n=16 individual estimates; these have a median of 13.95%, and 9 out of 16 of these estimates are between 13.4% and 16.9%. Given this, a rough approximation is that for any given corpus of papers, 1 in 7 (i.e. 14.3%) contain errors consistent with faking in at least one identifiable element.”
  
  “The accumulation of papers collected here is, frankly, haphazard. It does not represent a mature body of literature. The papers use different methods of analyzing figures, data, or other features of scientific publications. They do not distinguish well between papers that have small problematic elements which are fake, or fake in their entirety. They analyze both small and large corpora of papers, which are in different areas of study and in journals of different scientific quality – and this greatly changes base rates;…”
  
  “As a consequence, it would be prudent to immediately reproduce the result presented here as a formal systematic review. It is possible further figures are available after an exhaustive search, and also that pre registered analytical assumptions would modify the estimations presented.”
  
  Heathers has also in an interview published in Retraction Watch (Chawla 2024) acknowledged pitfalls in this article such as:
  
  “Heathers said he decided to conduct his study as a meta-analysis because his figures are “far flung.””
  
  “They are a little bit from everywhere; it’s wildly nonsystematic as a piece of work,” he said.”
  
  “Heathers acknowledged those limitations but argued that he had to conduct the analysis with the data that exist. “If we waited for the resources necessary to be able to do really big systematic treatments of a problem like this within a specific area, I think we’d be waiting far too long,” he said. “This is crucially underfunded.”
  
  Built in opposition to Fanelli 2009, but it’s illogical
  
  Heathers states in the abstract that his article is “in opposition” to Fanelli’s 2009 PloS One article (Fanelli 2009), yet that opposition is illogical and artificially constructed since there is no contradiction between 2% of scientists self-reporting having taking part in fabrication or falsification and an eventual much higher proportion of “fake scientific outputs”. Like most of what is wrong with Heather’s article, this is in fact acknowledged by the author who notes that the 2% figure “leaves us with no estimate of how much scientific output is fake” (bias in self-reporting, possibility of prolific authors, etc).
  
  Fanelli 2009 is not cited in the way JH says it is cited
  
  Whilst the opposition discussed above is illogical, it could be that the 2% figure is mis-cited by others as representing an estimate of fake scientific outputs thus probably underestimating the extent of fraud. Heathers suggests that this may indeed be the case, but also contradicts himself about how (Fanelli 2009), or the 2% figure coming from that publication, is typically used.
  
  In one sentence, he writes that “the figure is overwhelmingly the salient cited fact in its 1513 citations” and that “this generally appears as some variant of “about 2% of scientists admitted to have fabricated, falsified or modified data or results at least once” (Frank et al. 2023)
  
  whilst and in another sentence, he writes that “the typical phraseology used to express it – e.g. “the most serious types of misconduct, fabrication and falsification (i.e., data fraud), are relatively rare” (George 2016).
  
  Those two sentences cited by Heathers are fundamentally different, the first one accurately reports that the 2% figure relates to individuals self-reporting, whilst the second one appears to relate to the prevalence of misconducts in the literature itself. How Fanelli 2009 is cited in the literature is an empirical question that can be studied by looking at citation contexts beyond the two examples given by Heathers. Given that a central justification for Heathers’ piece appears to be the misuse of this 2% figure, we sought to test whether this was the case.
  
  A first surprise was that whilst the sentence attributed to (George 2016) can indeed be found in that publication (in the abstract), first it is not in a sentence citing (Fanelli 2009) nor the 2% figure, and, second, it is quoted selectively omitting a part of the sentence that nuances it considerably: “The evidence on prevalence is unreliable and fraught with definitional problems and with study design issues. Nevertheless, the evidence taken as a whole seems to suggest that cases of the most serious types of misconduct, fabrication and falsification (i.e., data fraud), are relatively rare but that other types of questionable research practices are quite common.” (Fanelli 2009) is discussed extensively by (George 2016), and some of the caveats, e.g. on self-reporting, are highlighted.
  
  To go beyond those two examples, we constructed a comprehensive corpus of citation contexts, defined as the textual environment surrounding a paper's citation, including several words or sentences before and after the citation (see Methods section below). 737 citation contexts could be analysed. Out of those, the vast majority (533, or 72%) did not cite the 2% figure. Instead, they often referred to this article as a general reference together with other articles to make a broad point, or, focused on other numbers in particular those related to questionable research practices (Bordignon, Said, and Levy 2024). The 28% (204) citation contexts that did mention the 2% figure did so accurately in the majority of cases: 83% (170) of those did mention that it was self-reporting by scientists whilst 17% (34) of those, or 5% of the total citation contexts analysed were either ambiguous or misleading in that they suggested or claimed that the 2% figure related to scientific outputs.
  
  Although the analysis above does not include all citation contexts, it is possible to conclude unambiguously that the 2% figure is not overwhelmingly the salient cited fact in relation to Fanelli 2009, and that when it is cited it is often accurately, i.e. as representing self-reporting by scientists. Whilst an exhaustive analysis is beyond the scope of this peer review, it is not uncommon to find in this corpus citations contexts that have an alarming tone about the seriousness of the problem of FFPs, e.g. “…a meta-analysis (Fanelli 2009) suggest that the few cases that do surface represent only the tip of a large iceberg." [DOI: 10.1177/0022034510384627]
  
  Thus, the rationale for Heathers’ study appears to be misguided. The supposed lack of attention for the very serious problem of FFPs is not due to a minimisation of the situation fueled by a misinterpretation of Fanelli 2009. Importantly, even if that was the case, an attempt to draw attention by claiming that 1 in 7 papers are fake, a claim which according to the author himself is not grounded in solid facts, is not how the scientific literature should be used.
  
  Methods for the construction of the corpus of citation contexts
  
  We used Semantic Scholar, an academic database encompassing over 200 million scholarly documents from diverse sources including publishers, data providers, and web crawlers. Using the specific paper identifier for Fanelli's 2009 publication (d9db67acc223c9bd9b8c1d4969dc105409c6dfef), we queried the Semantic Scholar API to retrieve available citation contexts. Citation contexts were extracted from the "contexts" field within the JSON response pages, (see technical specifications).
  
  The query looks like this: semanticscholar.org
  
  The broad coverage of Semantic Scholar does not imply that citation contexts are always retrieved. The Semantic Scholar API provided citation contexts for only 48% of the 1452 documents citing the paper. To get more, we identified open access papers among the remaining 52% citing papers, retrieved their PDF location and downloaded the files. We used Unpaywall API, which is a database to be queried with a DOI in order to get open access information about a document. The query looks like this.
  
  We downloaded 266 PDF files and converted them to text format using an online bulk PDF-to-text converter. These files were then processed using TXM, a specialized textual analysis tool. We used its concordancer function to identify the term "Fanelli" as a pivot term and check the reference being the good one (the 2009 paper in PlosOne). We did manual cleaning and appended the citation contexts to the previous corpus.
  
  Through this comprehensive methodology, we ultimately identified 824 citation contexts, representing 54% (784) of all documents citing Fanelli's 2009 paper. This corpus comprised 48% of contexts retrieved from Semantic Scholar and an additional 6% obtained through semi-manual extraction from open access documents. 87 of those contexts were excluded from the analysis for a range of reasons including: context too short to conclude, language neither English nor French (shared languages of the authors of this review), duplicate documents (e.g. preprints), etc, leaving us with 737 contexts. They were first classified manually in two categories, those mentioning the 2% figure and those which did not. Then, for the first category, they were further classified manually in two categories depending on whether the figure was appropriately assigned to self-reporting of researchers or rather misleadingly suggesting that the 2% applied to research outputs.
  
  Contributions
  
  Investigation: FB collected the citation contexts.<br /> Data curation and formal analysis: RL and MS<br /> Writing – review & editing: RL, MS and FB
  
  References
  
  Bordignon, Frederique, Maha Said, and Raphael Levy. 2024. “Citation Contexts of [How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data, DOI: 10.1371/Journal.Pone.0005738].” Zenodo. https://doi.org/10.5281/zenodo.14417422.
  
  Chawla, Dalmeet Singh. 2024. “1 in 7 Scientific Papers Is Fake, Suggests Study That Author Calls ‘Wildly Nonsystematic.’” Retraction Watch (blog). September 24, 2024. https://retractionwatch.com/2024/09/24/1-in-7-scientific-papers-is-fake-suggests-study-that-author-calls-wildly-nonsystematic/.
  
  Fanelli, Daniele. 2009. “How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data.” PLOS ONE 4 (5): e5738. https://doi.org/10.1371/journal.pone.0005738.
  
  Frank, Fabrice, Nans Florens, Gideon Meyerowitz-Katz, Jérôme Barriere, Éric Billy, Véronique Saada, Alexander Samuel, Jacques Robert, and Lonni Besançon. 2023. “Raising Concerns on Questionable Ethics Approvals - a Case Study of 456 Trials from the Institut Hospitalo-Universitaire Méditerranée Infection.” Research Integrity and Peer Review 8 (1): 9. https://doi.org/10.1186/s41073-023-00134-4.
  
  George, Stephen L. 2016. “Research Misconduct and Data Fraud in Clinical Trials: Prevalence and Causal Factors.” International Journal of Clinical Oncology 21 (1): 15–21. https://doi.org/10.1007/s10147-015-0887-3.
11. metaror 13 Aug 2025
  
  in Public
  
  Heathers, J. (2024) How much science is fake? https://doi.org/10.17605/OSF.IO/5RF2M
12. metaror 13 Aug 2025
  
  in Public
  
  September 22, 2024
13. metaror 13 Aug 2025
  
  in Public
  
  February 14, 2025
14. metaror 13 Aug 2025
  
  in Public
  
  March 13, 2025
15. metaror 13 Aug 2025
  
  in Public
  
  Authors:
  
  James Heathers jamesheathers@gmail.com
16. metaror 13 Aug 2025
  
  in Public
  
  2
17. metaror 13 Aug 2025
  
  in Public
  
  https://osf.io/s4gce
18. metaror 13 Aug 2025
  
  in Public
  
  10.17605/OSF.IO/5RF2M
19. metaror 13 Aug 2025
  
  in Public
  
  Approximately 1 in 7 Scientific Papers Are Fake
Visit annotations in context

Annotators

metaror

URL

osf.io/s4gce
Jul 2025
upstream.force11.org upstream.force11.org

Drinking from the Firehose? Write More and Publish Less

18
1. metaror 28 Jul 2025
  
  in Public
  
  none
2. metaror 28 Jul 2025
  
  in Public
  
  In this blog post the author argues that problematic incentive structures have led to a rapid increase in the publication of low-quality research articles and that stakeholders need to work together to reform incentive structures. The blog post has been reviewed by three reviewers. Reviewer 3 considers the blog post to be a ‘great piece’, and Reviewer 1 finds it ‘compellingly written and thought provoking’. According to Reviewer 2, the blog post does not offer significant new insights for readers already familiar with the topic. All three reviewers provide recommendations for clarifications. Reviewers 2 and 3 also suggest the blog post could be more critical toward publishers. Reviewers 1 and 3 suggest taking a broader perspective on incentives, for instance by also considering incentives related to teaching and admin or incentives for funders, libraries, and other organizations.
3. metaror 28 Jul 2025
  
  in Public
  
  none
4. metaror 28 Jul 2025
  
  in Public
  
  This op-ed addresses the issue with the exponential increase in publications and how this is leading to a lower quality of peer review which, in turn, is resulting in more bad science being published. It is a well-written article that tackles a seemingly eternal topic. This piece focussed more on the positives and potential actions which is nice to see as this is a topic that can become stuck in the problems. There are places throughout that would benefit from more clarity and at times there appears to be a bias towards publishers, almost placing blame on researchers. Very simple word changes or headings could immediately resolve any doubt here as I don't believe this is the intention of the article at all.
  
  Additionally, this article is very focussed on peer review (a positive) but I think that it would benefit from small additions throughout that zoom out from this and place the discussion in the context of the wider issues - for example you cannot change peer review incentives without changing the entire incentives around "service" activities including teaching, admin etc. This occurs to a degree with the discussion on other outputs, including preprints and data. Moreover, when discussing service type activities, there is data that reveals certain demographics deliberately avoid this work. Adding this element into the article would provide a much stronger argument for change (and do some good in the new current political climate).
  
  Overall, I thought this was a great piece when it was first posted online and does exactly what a good op-ed should - provoke thought and discussion. Below are some specific comments, in reading order. I do not believe that there are any substantial or essential changes required, particularly given that this is an op-ed article.
  
  -----
  
  Quote: "Academia is undergoing a rapid transformation characterized by exponential growth of scholarly outputs."
  
  Comment: There's an excellent paper providing evidence to this: https://direct.mit.edu/qss/article/5/4/823/124269/The-strain-on-scientific-publishing which would be a very positive addition
  
  Quote: "it’s challenging to keep up with the volume at which research publications are produced"
  
  Comment: Might be nice to add that this was a complaint dating back since almost the beginning of sharing research via print media, just to reinforce that this is a very old point.
  
  Quote: "submissions of poor-quality manuscripts"
  
  Comment: The use of "poor quality" here is unnecessary. Just because a submission is not accepted, it has no reflection on "quality". As such this does seem to needlessly diminish work rejected by one journal
  
  Quote: "Maybe there are too many poor quality journals too - responding to an underlying demand to publish low quality papers."
  
  Comment: This misses the flip side - poor quality journals encourage and actively drive low quality & outright fraudulent submissions due to the publisher dominance in the assessment of research and academics.
  
  Quote: "even after accounting for quality,"
  
  Comment: Quality is mentioned here but has yet to be clearly defined. What is "quality"? - how many articles a journal publishes? The "prestige" of a journal? How many people are citing the articles?
  
  Quote: "Researchers can – and do – respond to the availability by slicing up their work (and their data) into minimally publishable units"
  
  Comment: I fully agree that some researchers do exactly this. However, again, this seems to be blaming researchers for creating this firehose problem. I think this point could be reworded to not place so much blame or be substantiated with evidence that this is a widespread practice - my experience has been very mixed in that I've worked for people who do this almost to the extreme (and have very high self-citations) and also worked for people who focus on the science and making it as high quality and robust as possible. I agree many respond to the explosion of journals and varied quality in a negative manner but the journals, not researchers are the drivers here.
  
  Quote: "least important aspect of the expected contributions of scholars."
  
  Comment: I think it may be worth highlighting here that sometimes specific demographics (white males) actively avoid these kinds of service activities - there's a good study on this providing data in support of this. It adds an extra dimension into the argument for appropriate incentives and the importance & challenges of addressing this.
  
  Quote: "high quality peer review"
  
  Comment: Just another comment on the use of "quality'. This is not defined and I think when discussing these topics it is vital to be clear what one means by "high quality". For example, a high quality peer review that is designed as quality control would be detecting gross defects and fraud, preventing such work from being published (peer review does not reliably achieve this). In contrast, a high quality peer review designed to help authors improve their work and avoid hyperbole would be very detailed and collegial, not requesting large numbers of additional experiments.
  
  Quote: "conferring public trust in the oversight of science"
  
  Comment: I'm not convinced of this. Conveying peer review as a stamp of approval or QC leads to reduced trust when regular examples emerge with peer review failures - just look at Hydroxychloroquine and how peer review was used to justify that during COVID or the MMR/autism issues that are still on-going even after the work was retracted. I think this should be much more carefully worded, removed or expanded on to provide this perspective - this occurs slightly in the following sentence but it is very important to be clear on this point.
  
  Quote: "Researchers hold an incredible amount of market power in scholarly publishing"
  
  Comment: I like the next few paragraphs but, again, this seems to be blaming researchers when they in fact hold no/little power. I agree that researchers *could* use market pressure but this is entirely unrealistic when their careers depend on publishing X papers in X journal. An argument as to why science feels increasingly non-collaborative perhaps. Funders can have immediate and significant changes. Institutions adopting reward structures, such as teaching for example, would have significant impacts on researcher behaviour. Researchers are adapting to the demands the publication system creates - more journals, greater quantity and reduced quality whilst maintaining control over the assessment - eLife being removed from Wos/Scopus is a prime example of publishers (via their parent companies) preventing innovation or even rather basic improvements.
  
  Quote: "With preprint review, authors participate in a system that views peer review not as a gatekeeping hurdle to overcome to reach publication but as a participatory exercise to improve scholarship."
  
  Comment: This is framing that I really like; improving scholarship, not quality control.
  
  Quote: "buy"
  
  Comment: typo
  
  Quote: "adoption of preprint review can shift the inaccurate belief that all preprints lack review"
  
  Comment: Is this the right direction for preprints though? If we force all preprints to be reviewed and only value reviewed-preprints, then we effectively dismantle the benefits of preprints and their potential that we've been working so hard to build. A recent op-ed by Alice Fleerackers et al provided an excellent argument to this effect. More a question than a suggestion for anything to change.
  
  Quote: "between all of those stakeholders to work together without polarization"
  
  Comment: I disagree here - publishers have repeatedly shown that their only real interest is money. Working with them risks undermining all of the effort (financial, careers, reputation, time) that advocates for change put in. The OA movement should also highlight perfectly why this is such a bad route to go down (again). Publishers grip on preprint servers is a great example - those servers are hard to use as a reader, lack APIs and access to data, are not innovative or interacting with independent services. The community should make the rules and then publishers abide by and within them. Currently the publishers make all of the rules and dominate. Indeed, this is possibly the biggest ommision from this article - the total dominance of publishers across the entire ecosystem. You can't talk about change without highlighting that the publishers don't just own journals but the reference managers, the assessment systems, the databases etc. I may be an outlier on this point but for all of the people I interact with (often those at the bottom of the ladder) this is a strong feeling. Again, not a suggestion for anything to change and indeed the point of an op-ed is to stimulate thought and discussion so dissent is positive.
  
  Note that these annotations were made in hypothes.is and are available here, linked in-text for ease - comments are duplicated in this review.
5. metaror 28 Jul 2025
  
  in Public
  
  none
6. metaror 28 Jul 2025
  
  in Public
  
  Summary of the essay
  
  In this essay, the author seeks to explain the ‘firehose’ problem in academic research, namely the rapid growth in the number of articles but also the seemingly concurrent decline in quality. The explanation, he concludes, lies in the ‘superstructure’ of misaligned incentives and feedback loops that primarily drive publisher and researcher behaviour, with the current publish or perish evaluation system at the core. On the publisher side, these include commercial incentives driving both higher acceptance rates in existing journals and the launch of new journals with higher acceptance rates. At the same time, publishers seek to retain reputational currency by maintaining consistency and therefore brand power of scarcer, legacy-prestige journals. The emergence of journal cascades (automatic referrals from one journal to another journal within the same publisher) and the introduction of APCs (especially for special issues) also contribute to commercial incentives driving article growth. On the researcher side, he argues that there is an apparent demand from researchers for more publishing outlets and simultaneous salami slicing by researchers because authors feel they have to distribute relatively more publications among journals that are perceived to be of lower quality (higher acceptance rates) in order to gain equivalent prestige to that of a higher impact paper. The state of peer review also impacts the firehose. The drain of PhD qualified scientists out of academia, compounded by a lack of recognition for peer review, further contributes to the firehose problem because there are insufficient reviewers in the system, especially for legitimate journals. Moreover, what peer review is done is no guarantee of quality (in highly selective journals as well as ‘predatory’). One of his conclusions is that there is not just a crisis in scholarly publishing but in peer review specifically and it is this crisis that will undermine science the most. Add AI into the mix of this publish or perish culture, and he predicts the firehose will burst.
  
  He suggests that the solution lies in researchers taking back power themselves by writing more but ‘publishing’ less. By writing more he means outputs beyond traditional journal publications such as policy briefs, blogs, preprints, data, code and so on, and that these should count as much as peer-reviewed publications. He places special emphasis on the potential role of preprints and on open and more collegiate preprint review acting as a filter upstream of the publishing firehouse. He ends with a call for more collegiality across all stakeholders to align the incentives and thus alleviate the pressure causing the firehose in the first place.
  
  General Comment
  
  I enjoyed reading the essay and think the author does a good job of exposing multiple incentives and competing interests in the system. Although discussion of perverse incentives has been raised in many articles and blog posts, the author specifically focuses on some of the key commercial drivers impacting publishing and the responses of researchers to those drivers. I found the essay compellingly written and thought provoking although it took me a while to work through the various layers of incentives. In general, I agree with the incentives and drivers he has identified and especially his call for stakeholders to avoid polarization and work together to repair the system. Although I appreciate the need to have a focused argument I did miss a more in-depth discussion about the equally complex layers of incentives for institutions, funders and other organisations (such as Clarivate) that also feed the firehose.
  
  I note that my perspective comes from a position of being deeply embedded in publishing for most of my career. This will have also impacted what I took away from the essay and the focus of my comments below.
  
  Main comments
  
  I especially liked the idea of a ‘superstructure’ of incentives as I think that gives a sense of the size and complexity of the problem. At the same time, by focusing on publisher incentives and researchers’ response to them he has missed out important parts of the superstructure contributing to the firehose, namely the role of institutions and funders in the system. Although this is implicit, I think it would have been worth noting more, in particular:
  
  He mentions institutions and the role of tenure and promotion towards the end but not the extent of the immense and immobilizing power this wields across the system (despite initiatives such as DORA and CoARA).
  
  Most review panels (researchers) assessing grants for funders are also still using journal publications as a proxy for quality, even if the funder policy states journal name and rank should not be used
  
  Many Institutions/Universities still rely on number and venue of publications. Although some notable institutions are moving away from this, the impact factor/journal rank is still largely relied on. This seems especially the case in China and India for example, which has shown a huge growth in research output. Although the author discusses the firehose, it would have been interesting to see a regional breakdown of this.
  
  Libraries also often negotiate with publishers based on volume of articles – i.e they want evidence that they are getting more articles as they renegotiate a specific contract (e.g. Transformative agreements), rather than e.g. also considering the quality of service.
  
  Institutions are also driven by rankings in a parallel way to researchers being assessed based on journal rank (or impact factor). How University Rankings are calculated is also often opaque (apart from the Leiden rankings) but publications form a core part. This further incentivises institutions to select researchers/faculty based on the number and venue of their publications in order to promote their own position in the rankings (and obtain funding)
  
  The essay is also about power dynamics and where power in the system lies. The implication in the essay is that power lies with the publishers and this can be taken back by researchers. Publishers do have power, especially those in possession of high prestige journals and yet publishers are also subject to the power of other parts of the system, such as funder and institutional evaluation policies. Crucially, other infrastructure organisations, such as Clarivate, that provide indexing services and citation metrics also exert a strong controlling force on the system, for example:
  
  Only a subset of journals are ever indexed by Clarivate. And funders and Institutions also use the indexing status of a journal as a proxy of quality. A huge number of journals are thus excluded from the evaluation system (primarily in the arts and humanities but also many scholar-led journals from low and middle income countries and also new journals). This further exacerbates the firehose problem because researchers often target only indexed journals. I’d be interested to see if the firehose problem also exists in journals that are not traditionally indexed (although appreciate this is also likely to be skewed by discipline)
  
  Indexers also take on the role of arbiters of journal quality and can choose to delist or list journals accordingly. Listing or delisting has a huge impact on the submission rates to journals that can be worth millions of dollars to a publisher, but it is often unclear how quality is assessed and there seems to be a large variance in who gets listed or not.
  
  Clarivate are also paid large fees by publishers to use their products, which creates a potential conflict of interest for the indexer as delisting journals from major publishers could potentially cause a substantial loss of revenue if they withdraw their fees. Also Clarivate relies on publishers to create the journals on which their products are based which may also create a conflict if Clarivate wishes to retain the in-principle support of those publishers.
  
  The delisting of elife recently, even though it is an innovator and of established quality, shows the precariousness of journal indexing.
  
  All the stakeholders in the system seem to be essentially ‘following the money’ in one way or another – it’s just that the currency for researchers, institutions, publishers and others varies. Publishers – both commercial and indeed most not-for profit - follow the requirements of the majority of their ‘customers’ (and that’s what authors, institutions, subscribers etc are in this system) in order to ensure both sustainability and revenue growth. This may be a legacy of the commercialisation of research in the 20th Century but we should not be surprised that growth is a key objective for any company. It is likely that commercial players will continue to play an important role in science and science communication; what needs to be changed are the requirements of the customers.
  
  The root of the problem, as the author notes, is what is valued in the system, which is still largely journal publications. The author’s solution is for researchers to write more – and for value to be placed on this greater range of outputs by all stakeholders. I agree with this sentiment – I am an ardent advocate for Open Science. And yet, I also think the focus on outputs per se and not practice or services is always going to lead to the system being gamed in some way in order to increase the net worth of a specific actor in the system. Preprints and preprint review itself could be subject to such gaming if value is placed on e.g. the preprint server or the preprint-review platform as a proxy of preprint and then researcher quality.
  
  I think the only way to start to change the system is to start placing much more value on both the practices of researchers (as well as outputs) and on the services provided by publishers. Of course saying this is much easier than implementing it.
  
  Other comments
  
  A key argument is that higher acceptance rates actually create a perverse incentive for researchers to submit as many manuscripts as possible because they are more likely to get accepted in journals with higher acceptance rates. I disagree that higher acceptance rates per se are the main incentive for researchers to publish more. More powerful is the fact that those responsible for grants and promotion continue to use quantity of journal articles as a proxy for research quality.
  
  Higher acceptance rates are not necessarily an indicator of low quality or a bad thing if it means that null, negative and inconclusive results are also published
  
  The author states that Journal Impact Factors might have been an effective measure of quality in the past. I take issue with this because the JIF has, as far as I know, always been driven by relatively few outliers (papers with very high citations) and I don’t know of evidence to show that this wasn’t also true in the past. It also makes the assumption that citations = quality.
  
  The author asks at one point “Why would field specialization need a lower threshold for publication if the merits of peer review are constant? ” I can see a case for lower thresholds, however, when the purpose of peer review is primarily to select for high impact, rather than rigour, of the science conducted. A similar case might be made for multidisciplinary research, where peer reviewers tend to assess an article from their discipline’s perspective and reject it because the part that is relevant to them is not interesting enough… Of course, this all points to the inherent problems with peer review (with which I agree with the author)
  
  The author puts his essay in appropriate context, drawing on a range of sources to support his argument. I particularly like that he tried to find source material that was openly available.
  
  He cites 2 papers by Bjoern Brembs to substantiate the claim that there is potentially poorer review in higher prestige journals than in lower ranked journals. These papers were published in 2013 and 2018 and the conclusions relied, in part, on the fact that higher ranked journals had more retractions. Apart from a potential reporting bias, given the flood of retractions across multiple journals in more recent years, I doubt this correlation now exists?
  
  The author works out submission rates from the published acceptance rates of journals. The author acknowledges this is only approximate and discusses several factors that could inflate or deflate it. I can add a few more variables that could impact the estimate, including: 1) the number of articles a publisher/journal rejects before articles are assigned to any editor (e.g. because of plagiarism, reporting issues or other research integrity issues), 2) the extent to which articles are triaged and rejected by editors before peer review (e.g. because it is out of scope or not sufficiently interesting to peer review); the number of articles rejected after peer review; and 4) the extent to which authors independently withdraw an article at any stage of the process. When publishers publish acceptance rates, they don’t make it clear what goes into the numerator or the denominator and there are no community standards around this. The author rightly notes this process is too opaque.
  
  Catriona J. MacCallum
  
  As is my practice, I do not wish to remain anonymous. Please also note that I work for a large commercial publisher and am writing this review in an independent capacity such that this review reflects my own opinion, which are not necessarily those of my employer.
7. metaror 28 Jul 2025
  
  in Public
  
  none
8. metaror 28 Jul 2025
  
  in Public
  
  This is a well written and clear enough piece that may be helpful for a reader new to the topic. To people familiar with the field there is not so much which is new here. The final recommendation is not well expressed. As currently put it is, I think, wrong. But it is a provocative idea. I comment section by section below.
  
  The first paragraphs repeat well established facts that there are too many papers. Seppelt et al’s contribution is missing here. It also reproduces the disengenuous claim, by a publisher’s employee, that publishers ‘only’ respond to demand. I do not think that is true. They create demand. They encourage authors to write and submit papers, as anyone who has been emailed by MDPI recently can testify. Why repeat something which is so inaccuate?
  
  The section on ‘upstream of the nozzle’ is rather confusing. I think the author is trying to establish if more work is being submitted. But this cannot be deduced from the data presented. No trends are given. Rejection rates will be a poor guide if the same paper is being rejected by several journals. I was also confused by the sources used to track growth in papers – why not just use Dimensions data? The final paragraph again repeats well known facts about the proliferation of outlets and salami slicing. Thus far the article has not introduced new arguments.
  
  Minor points in this section:
  
  there are some unsupported claims. Eg ‘This is a practice that is often couched within the seemingly innocuous guise of field specialty journals.’
  
  I also do not understand the logic of this rather long sentence: ‘The expansion of journals with higher acceptance rates alters the rational calculus for researchers - all things being equal higher acceptance rates create a perverse incentive to submit as many manuscripts as possible since the underlying probability of acceptance is simply higher than if those same publications were submitted to a journal with a lower acceptance rate, and hence higher prestige.’ I suggest it be rephrased
  
  The section on peer review (Who’s testing the water) is mostly a useful review of the issues. But there are some problems which need addressing. Bizarrely, when discussing whether there enough scientists, it fails to mention Hanson et al’s global study, despite linking to it’s preprint in the opening lines. Instead the author adopts a parochial North American approach and refers only to PhDs coming from the US. It is not adequate to take trends in one country to cannot explain an international publishing scene. These are not the ‘good data’ the author claims. Likewise the value of data on doctorates not going onto a post-doc hinges on how many post-docs there are. That trend is not supplied. This statement ‘Almost everyone getting a doctorate goes into a non-university position after graduation’ may be true, but no supporting data are supplied to justify it. Nor do we know what country, or countries, the author is referring to.
  
  The section ‘A Sip from the Spring’ makes the mistaken claim that researchers hold market power. This is not true. Researchers institutions, their libraries and governments are the main source of publisher income. It is here that the key proposal for improvement is made: researcher can write more and publish less. But if the problem is that there is too much poorly reviewed literature then this cannot be the solution. Removing all peer review, would mean there is even more material to read whose appearance is not slowed up by peer review at all. If peer review is becoming inadequate, evading it entirely is hardly a solution.
  
  This does not mean we should not release pre-prints. The author is right to advocate them, but the author is mistaken to think that this will reduce publishing pressures. The clue is in their name ‘pre-print’. Publication is intended.
  
  Missing from the author’s argument is recognition of the important role that communities of researchers form, and the roles that journals play in providing venues for conversation, disagreement and disucssion. They provide a filter. Yes researchers produce other material than publications as the author states: ‘grant proposals, editorials, policy briefs, blog posts, teaching curricula and lectures, software code and documentation, dataset curation, and labnotes and codebooks.’ I would add email and whatsapp messages to that list. But adding all that to our reading lists will not reduce the volume of things to be read. It must increase it. And it would make it harder to marshall and search all those words.
  
  But the idea is provocative nonetheless. Running through this paper, and occasionally made explicit, is the fact that publishers earn billions from their ‘service’ to academia. They have a strong commercial interest in our publishing more, and in competing with each other to produce a larger share of the market. If writing more, and publishing less, means we need to find ways of directing our thoughts so that they earn less money for publishers, then that could bring real change to the system.
  
  A minor point: the fire hose analogy is fully exploited and rather laboured in this paper. But it is a North American term and image, that does not travel so easily.
9. metaror 28 Jul 2025
  
  in Public
  
  A few months back, Upstream editor Martin Fenner suggested that I submit my Upstream blog post titled, Drinking from the Firehose? Write More and Publish Less, for peer-review as a sort of experiment for Upstream through MetaROR. MetaROR, a relative newcomer to the scholarly communication community, provides the review and curate steps in the "publish-review-curate" model for meta-research.
  
  While I do not consider myself a meta-researcher (scholars who conduct research on research) many of my positions on science policy have implications on the field (especially, those on transparency, openness, and reproducibility). I think the main call in my blog post for reform in scholarly communication – namely, to stop publishing in traditional journals as much and start rewarding a broader swath of scholarly activities like data sharing – is particularly appealing to meta-researchers who rely on non-publication outputs for their work. So, I submitted. The article was openly reviewed, and MetaROR provided an editorial assessment. Here, I reply to the reviewers and contribute to the curation of the original post.
  
  The reviews are very high-quality - in fact, they are some of the most well-reasoned reviews I've received in the 20 years I've been a scholar. If MetaROR represents the future of peer-review through the publish-review-curate model, scholarly communication is about to get a whole lot better. You can read the open reviews of my blog post here. The revised version of the editorial is here.
  
  Like traditional peer-review, each individual reviewer provided their feedback independently of the others and the handling editor did not curate the reviews. I prefer when editors do such curation since it helps to organize the response in a way that reduces redundancy. This is one of the main benefits of the group-based peer review systems - such as PREreview's Live Review. Also, there was no easy way (or at least not an obvious one) to export the reviews in plaintext from MetaROR so I could respond point-by-point in software of my choice. Below is an attempt to organize my response roughly around the major criticisms and suggestions in the review. Because this was an opinion piece and not research, I'm not going to respond to every point anyway – nearly all of which I would accept and revise accordingly had this been a research article.
  
  Too Easy on the Publishers, Too Hard on Researchers
  
  All three reviewers expressed some dismay over how light my criticism of the publishers was in my blog piece. I do not disagree. The reviewers rightfully point out that the publishers play outsized role in the inequity created in the scholarly communication space. However, I am choosing not to revise here much as the essay was already too long - it would have taken a tome to articulate my criticism of the publishers. That's out of scope. However, I revised the first paragraph in the conclusion to state:
  
  The publishers are incentivized to avoid any other form of reform - this is the rational option that publishers choose in response to the apparent demand from researchers - as Ciavarella rightly pointed out.
  
  Two of the reviewers also thought I was too harsh on researchers. I don't think that I was overly harsh. All three agree with me that researchers have some market role here but disagree with the extent to which they can exert influence. One reviewer claims researchers have no market power (to which I respectfully disagree). I've clarified in the paper that: 'the power any individual researcher has here is small. Collective action is needed.' I reject that researchers are blameless for the status quo - complacency empowers the publishers. Unfortunately, it's also baked into the superstructure of the reward system that is perpetuated by publisher-controlled market forces. I also added the following sentiment along these lines when discussing market-power of researchers:
  
  It's free to share and read research without the need for costly, anticompetitive gatekeeping. Leveraging that freedom is an untapped source of market power.
  
  Focus More on Institutions and Funders and Communities
  
  Two of the three reviewers thought I needed to draw more attention to the roles, demands, and influence that academic institutions, publisher consortia, libraries, indexing services, scholarly societies, and grassroots research organizations have in this ecosystem. I agree with all these points - and had Clarivate's irresponsible delisting of eLife in the Web of Science happened before I wrote the original piece, I would have highlighted that as one reviewer suggested.
  
  No New Arguments or Analysis
  
  The reviewers felt that, while well-articulated, the arguments I was espousing are not novel. First, I think it is worthy to renew the idea that we should be more selective in choosing what to publish in journals. Focusing on quality over quantity and valuing activities beyond journal publications should be repeated often until it's common practice.
  
  One comment called for more data and analysis, and another wanted some additional research cited. I think that's a great idea and I hope the reviewers can do that work or perhaps the open review will inspire others to do so.
  
  In response to the criticism that preprints themselves both presuppose an eventual traditional publication and that they could be gamed, I revised that section accordingly:
  
  There is risk of gaming preprints and preprint review just as there is in traditional publishing, such as by placing value on a paper for where it appears or how it was reviewed without considering its quality or contribution to science.
  
  One reviewer misunderstood my point about preprints altogether:
  
  Removing all peer review, would mean there is even more material to read whose appearance is not slowed up by peer review at all. If peer review is becoming inadequate, evading it entirely is hardly a solution. This does not mean we should not release pre-prints. The author is right to advocate them, but the author is mistaken to think that this will reduce publishing pressures. The clue is in their name ‘pre-print’. Publication is intended.
  
  I am absolutely not arguing for tossing out peer review. I strongly believe peer review is valuable but currently broken. Moreover, I reject that peer review needs to happen behind the gatekeeping of publishers. I revised to clarify here and added a footnote based on this reviewer's latter observation.
  
  Peer-review remains a critical check for pollutants in the waters - but the prevailing model needs significant reform. The traditional opaque, uncompensated system has eroded the quality, transparency, timeliness, and appropriateness of peer review due to competing priorities and a lack of appropriately aligned incentive structures. Novel models of peer review including, publish-review-curate and preprint review, and compensated review - ideally all done transparently and with conflicts of interest declared out in the open. At the same time, not all manuscripts need review to have value and most preprints with value (even those with reviews) should not be published in journals.
  
  New footnote: The term 'preprint' is evolving - what was once a moniker for non-peer reviewed manuscript intended to eventually become reviewed and published (or more likely, rejected) now scopes-in other forms including publish-curate-review and manuscripts with preprint reviews. A new labeling and metadata system is desperately needed to highlight the state of review of a particular manuscript in a record of versions. Version control systems and badging are ubiquitous in the open-source software community and could be easily adopted here.
  
  Volume is Volume is Volume
  
  Probably the most important critique among the set of reviews points out an apparent recursion in the logic of the thesis that I need to clarify: you can't solve the firehose problem by writing more, as that just adds more volume to the flow. My revision to the conclusion clarifies my intent: what I'm proposing is to stop sending so many papers to journals for publication and to choose preprints more often for reading, reviewing, and writing. At the same time the system should, maintain or increase non-publication scholarly outputs and reward those too.
  
  "Write-More" here is a placeholder for all the non-publication writing scholars do and should get credit for from their institutions and fields. Again, I happen to focus on writing because that's what I care about in this editorial and it would take volumes to pontificate on all the other services and activities that happen within the academy that are not properly rewarded.
  
  Summary
  
  Having my blog post peer-reviewed through MetaROR was a positive experience and I recommend the service. However, my post was still just an editorial – my opinions and thoughts – not research. Had this been a research article, however, the reviews as presented would have been a very good roadmap to improving the paper. For MetaROR, I have two suggestions: 1) the editorial assessment could be improved by organizing the key points and 2) create a way to have all reviews downloadable in plaintext for ease of importing into an editor.
  
  Acknowledgments
  
  Special thanks are owed to the reviewers, Catriona MacCallum, Dan Brockington, and Jonny Coates, the MetaROR handling editor Ludo Waltman, and to Upstream Editor and Front Matter founder Martin Fenner for the crazy idea to peer-review a blog post.
  
  Disclosure
  
  The opinions expressed here are my own and may not represent those of my employer, my associates, or the reviewers. I have no conflicts of interest to disclose.
  
  This author response was previously published on Upstream.
10. metaror 28 Jul 2025
  
  in Public
  
  Marcum, C.S. (2024). Drinking from the Firehose? Write More and Publish Less. Upstream. https://doi.org/10.54900/r8zwg-62003
11. metaror 28 Jul 2025
  
  in Public
  
  August 27, 2024
12. metaror 28 Jul 2025
  
  in Public
  
  February 27, 2025
13. metaror 28 Jul 2025
  
  in Public
  
  March 13, 2025
14. metaror 28 Jul 2025
  
  in Public
  
  Authors:
  
  Christopher Marcum christopher.steven.marcum@gmail.com
15. metaror 28 Jul 2025
  
  in Public
  
  1
16. metaror 28 Jul 2025
  
  in Public
  
  https://upstream.force11.org/drinking-from-the-firehose/
17. metaror 28 Jul 2025
  
  in Public
  
  10.54900/r8zwg-62003
18. metaror 28 Jul 2025
  
  in Public
  
  Drinking from the Firehose? Write More and Publish Less
Visit annotations in context

Annotators

metaror

URL

upstream.force11.org/drinking-from-the-firehose/
osf.io osf.io

Evolution of Peer Review in Scientific Communication

2
1. metaror 17 Jul 2025
  
  in Public
  
  Testttttt
2. metaror 17 Jul 2025
  
  in Public
  
  I am sincerely grateful to the editors and peer reviewers at MetaROR for their detailed feedback and valuable comments and suggestions. I have addressed each point below.
  
  Handling Editor
  
  1. However, the article’s progression and arguments, along with what it seeks to contribute to the literature need refinement and clarification. The argument for PRC is under-developed due to a lack of clarity about what the article means by scientific
  
  communication. Clarity here might make the endorsement of PRC seem like less of a foregone conclusion.
  
  The structure of the paper (and discussion) has changed significantly to address the feedback.
  
  2. I strongly endorse the main theme of most of the reviews, which is that the progression and underlying justifications for this article’s arguments needs a great deal of work. In my view, this article’s main contribution seems to be the evaluation of the three peer review models against the functions of scientific communication. I say ‘seems to be’ because the article is not very clear on that and I hope you will consider clarifying what your manuscript seeks to add to the existing work in this field. In any case, if that assessment of the three models is your main contribution, that part is somewhat underdeveloped. Moreover, I never got the sense that there is clear agreement in the literature about what the tenets of scientific communication are. Note that scientific communication is a field in its own right.
  
  I have implemented a more rigorous approach to argumentation in response. “Scientific communication” was replaced by “scholarly communication.”
  
  3. I also agree that paper is too strongly worded at times, with limitations and assumptions in the analysis minimised or not stated. For example, all of the typologies and categories drawn could easily be reorganised and there is a high degree of subjectivity in this entire exercise. Subjective choices should be highlighted and made salient for the reader. Note that greater clarity, rigour, and humility may also help with any alleged or actual bias.
  
  I have incorporated the conceptual framework and description of the research methodology. However, the
  
  Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.
  
  4. I agree with Reviewer 3 that the ‘we’ perspective is distracting.
  
  This has been fixed.
  
  5. The paragraph starting with ‘Nevertheless’ on page 2 is very long.
  
  The text was restructured.
  
  6. There are many points where language could be shortened for readability, for example:
  
  Page 3: ‘decision on publication’ could be ‘publication decision’.
  
  Page 5: ‘efficiency of its utilization’ could be ‘its efficiency’.
  
  Page 7: ‘It should be noted…’ could be ‘Note that…’.
  
  I have proofread the text.
  
  7. Page 7: ‘It should be noted that..’ – this needs a reference.
  
  This statement has been moved to the Discussion section, paraphrased, and reference added.
  
  “It should be also noted that peer review innovations pull in opposing directions, with some aiming to increase efficiency and reduce costs, while others aim to promote rigor and increase costs
  
  (Kaltenbrunner et al., 2022).”
  
  8. I’m not sure that registered reports reflect a hypothetico-deductive approach (page 6). For instance, systematic reviews (even non-quantitative ones) are often published as registered reports and Cochrane has required this even before the move towards registered reports in quantitative psychology.
  
  I have added this clarification.
  
  9. I agree that modular publishing sits uneasily as its own chapter.
  
  Modular publishing has been combined with registered reports into the deconstructed publication group of
  
  models, now Section 5.1.
  
  10. Page 14: ‘The "Publish-Review-Curate" model is universal that we expect to be the future of scientific publishing. The transition will not happen today or tomorrow, but in the next 5-10 years, the number of projects such as eLife, F1000Research, Peer Community in, or MetaROR will rapidly increase’. This seems overly strong (an example of my larger critique and that of the reviewers).
  
  This part of the text has been rewritten.
  
  Reviewer 1
  
  11. For example, although Model 3 is less chance to insert bias to the readers, it also weakens the filtering function of the review system. Let’s just think about the dangers of machine-generated articles, paper-mills, p-hacked research reports and so on. Although the editors do some pre-screening for the submissions, in a world with only Model 3 peer review the literature could easily get loaded with even more ‘garbage’ than in a model where additional peers help the screening.
  
  I think that generated text is better detected by software tools. At the same time, I tried and described the pros and cons of different models in a more balanced way in the concluding section.
  
  12. Compared to registered reports other aspects can come to focus that Model 3 cannot cover. It’s the efficiency of researchers’ work. In the care of registered reports, Stage 1 review can still help researchers to modify or improve their research design or data collection method. Empirical work can be costly and time-consuming and post-publication review can only say that “you should have done it differently then it
  
  would make sense”.
  
  Thank you very much for this valuable contribution, I have added this statement at P. 11.
  
  13. Finally, the author puts openness as a strength of Model 3. In my eyes, openness is a separate question. All models can work very openly and transparently in the right circumstances. This dimension is not an inherent part of the models.
  
  I think that the model, providing peer reviews to all the submissions, ensures maximum transparency. However, I have made effort to make the wording more balanced and distinguish my personal perspective from the literature.
  
  14. In conclusion, I would not make verdict over the models, instead emphasize the different functions they can play in scientific communication.
  
  This idea has been reflected now in the concluding section.
  
  15. A minor comment: I found that a number of statements lack references in the Introduction. I would have found them useful for statements such as “There is a point of view that peer review is included in the implicit contract of the researcher.”
  
  Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.
  
  Reviewer 2
  
  16. The primary weakness of this article is that it presents itself as an 'analysis' from which they 'conclude' certain results such as their typology, when this appears clearly to be an opinion piece. In my view, this results in a false claim of objectivity which detracts from what would otherwise be an interesting and informative, albeit subjective, discussion, and thus fails to discuss the limitations of this approach.
  
  I have incorporated the conceptual framework and description of the research methodology. However, the
  
  Discussion section reflects my personal perspective in some points, which I have explicitly highlighted to ensure clarity.
  
  17. A secondary weakness is that the discussion is not well structured and there are some imprecisions of expression that have the potential to confuse, at least at first.
  
  The structure of the paper (and discussion) has changed significantly.
  
  18. The evidence and reasoning for claims made is patchy or absent. One instance of the former is the discussion of bias in peer review. There are a multitude of studies of such bias and indeed quite a few meta-analyses of these studies. A systematic search could have been done here but there is no attempt to discuss the totality of this literature. Instead, only a few specific studies are cited. Why are these ones chosen? We have no idea. To this extent I am not convinced that the references used here are the most appropriate.
  
  I have reviewed the existing references and incorporated additional sources. However, the study does not claim to conduct a systematic literature review; rather, it adopts an interpretative approach to literature analysis.
  
  19. Instances of the latter are the claim that "The most well-known initiatives at the moment are ResearchEquals and Octopus" for which no evidence is provided, the claim that "we believe that journal-independent peer review is a special case of Model 3" for which no further argument is provided, and the claim that "the function of being the "supreme judge" in deciding what is "good" and "bad" science is taken on by peer review" for which neither is provided.
  
  Thank you for your feedback. I have implemented a more rigorous approach to argumentation in response.
  
  20. A particular example of this weakness, which is perhaps of marginal importance to the overall paper but of strong interest to this reviewer is the rather odd engagement with history within the paper. It is titled "Evolution of Peer Review" but is really focussed on the contemporary state-of-play. Section 2 starts with a short history of peer review in scientific publishing, but that seems intended only to establish what is
  
  described as the 'traditional' model of peer review. Given that that short history had just shown how peer review had been continually changing in character over centuries - and indeed Kochetkov goes on to describe further changes - it is a little difficult to work out what 'traditional' might mean here; what was 'traditional' in 2010 was not the same as what was 'traditional' in 1970. It is not clear how seriously this history is being taken. Kochetkov has earlier written that "as early as the beginning of the 21st century, it was argued that the system of peer review is 'broken'" but of course criticisms - including fundamental criticisms - of peer review are much older than this. Overall, this use of history seems designed to privilege the
  
  experience of a particular moment in time, that coincides with the start of the metascience reform movement.
  
  While the paper addresses some aspects of peer review history, it does not provide a comprehensive examination of this topic. A clarifying statement to this effect has been included in the methodology section.
  
  “… this section incorporates elements of historical analysis, it does not fully qualify as such because primary sources were not directly utilized. Instead, it functions as an interpretative literature review, and one that is intentionally concise, as a comprehensive history of peer review falls outside the scope of this research”.
  
  21. Section 2 also demonstrates some of the second weakness described, a rather loose structure. Having moved from a discussion of the history of peer review to detail the first model, 'traditional' peer review, it then also goes on to describe the problems of this model. This part of the paper is one of the best - and best - evidenced. Given the importance of it to the main thrust of the discussion it should probably have been given more space as a Section all on its own.
  
  This section (now Section 4) has been extended, see also previous comment.
  
  22. Another example is Section 4 on Modular Publishing, in which Kochetkov notes "Strictly speaking, modular publishing is primarily an innovative approach for the publishing workflow in general rather than specifically for peer review."
  
  Kochetkov says "This is why we have placed this innovation in a separate category" but if it is not an innovation in peer review, the bigger question is 'Why was it included in this article at all?'.
  
  Modular publishing has been combined with registered reports into the deconstructed publication group of models, now Section 5.1.
  
  23. One example of the imprecisions of language is as follows. The author also shifts between the terms 'scientific communication' and 'science communication' but, at least in many contexts familiar to this reviewer, these are not the same things, the former denoting science-internal dissemination of results through publication (which the author considers), conferences and the like (which the author specifically excludes) while the latter denotes the science-external public dissemination of scientific findings to non-technical audiences, which is entirely out of scope for this article.
  
  Thank you for your remark. As a non- native speaker, I initially did not grasp the distinction between the terms. However, I believe the phrase ‘scholarly communication’ is the most universally applicable term. This adjustment has now been incorporated into the text.
  
  24. A final note is that Section 3, while an interesting discussion, seems largely derivative from a typology of Waltman, with the addition of a consideration of whether a reform is 'radical' or 'incremental', based on how 'disruptive' the reform is. Given that this is inherently a subjective decision, I wonder if it might not have been more informative to consider 'disruptiveness' on a scale and plot it accordingly. This would allow for some range to be imagined for each reform as well; surely reforms might be more or less disruptive depending on how they are implemented. Given that each reform is considered against each model, it is somewhat surprising that this is not presented in a tabular or graphical form.
  
  Ultimately, I excluded this metric due to its current reliance on purely subjective judgment. Measuring 'disruptiveness', e.g., through surveys or interviews remains a task for future research.
  
  25. Reconceptualize this as an opinion piece. Where systematic evidence can be drawn upon to make points, use that, but don't be afraid to just present a discussion from what is clearly a well-informed author.
  
  I cannot definitively classify this work as an opinion piece. In fact, this manuscript synthesizes elements of a literature review, research article, and opinion essay. My idea was to integrate the strengths of all three genres.
  
  26. Reconsider the focus on history and 'evolution' if the point is about the current state of play and evaluation of reforms (much as I would always want to see more studies on the history and evolution of peer review).
  
  I have revised the title to better reflect the study’s scope and explicitly emphasize its focus on contemporary developments in the field.
  
  “Peer Review at the Crossroads”
  
  27. Consider ways in which the typology might be expanded, even if at subordinate level.
  
  I have updated the typology and introduced the third tier, where it is applicable (see Fig.2).
  
  Reviewer 3
  
  28. In my view, the biggest issue with the current peer review system is the low quality of reviews, but the manuscript only mentions this fleetingly. The current system facilitates publication bias, confirmation bias, and is generally very inconsistent. I think this is partly due to reviewers’ lack of accountability in such a closed peer review system, but I would be curious to hear the author’s ideas about this, more elaborately than they provide them as part of issue 2.
  
  I have elaborated on this issue in the footnote.
  
  29. I’m missing a section in the introduction on what the goals of peer review are or should be. You mention issues with peer review, and these are mostly fair, but their importance is only made salient if you link them to the goals of peer review. The author does mention some functions of peer review later in the paper, but I think it would be good to expand that discussion and move it to a place earlier in the manuscript.
  
  The functions of peer review are summarized in the first paragraph of Introduction.
  
  30. Table 1 is intuitive but some background on how the author arrived at these categorizations would be welcome.
  
  When is something incremental and when is something radical? Why are some innovations included but not others (e.g., collaborative peer review, see https://content.prereview.org/how-collaborative-peer-review-can-
  
  transform-scientific-research/)?
  
  Collaborative peer review, namely, Prereview was mentioned in the context of Model 3 (Publish-Review-Curate). However, I have extended this part of the paper.
  
  31. “Training of reviewers through seminars and online courses is part of the strategies of many publishers. At the same time, we have not been able to find statistical data or research to assess the effectiveness of such training.” (p. 5) There is some literature on this, although not recent. See work by Sara Schroter for example, Schroter et al., 2004; Schroter et al., 2008)
  
  Thank you very much, I have added these studies and a few more recent ones.
  
  32. “It should be noted that most initiatives aimed at improving the quality of peer review simultaneously increase the costs.” (p. 7) This claim needs some support. Please explicate why this typically is the case and how it should impact our evaluations of these initiatives.
  
  I have moved this part to the Discussion section.
  
  33. I would rephrase “Idea of the study” in Figure 2 since the other models start with a tangible output (the manuscript). This is the same for registered reports where they submit a tangible report including hypotheses, study design, and analysis plan. In the same vein, I think study design in the rest of the figure might also not be the best phrasing. Maybe the author could use the terminology used by COS (Stage 1 manuscript, and Stage 2 manuscript, see Details & Workflow tab of https://www.cos.io/initiatives/registered-reports). Relatedly, “Author submits the first version of the manuscript” in the first box after the ‘Manuscript (report)’ node maybe a confusing phrase because I think many researchers see the first version of the manuscript as the stage 1 report sent out for stage 1 review.
  
  Thank you very much. Stage 1 and Stage 2 manuscripts look like suitable labelling solution.
  
  34. One pathway that is not included in Figure 2 is that authors can decide to not conduct the study when improvements are required. Relatedly, in the publish- review-curate model, is revising the manuscripts based on the reviews not optional as well? Especially in the case of 3a, authors can hardly be forced to make changes even though the reviews are posted on the platform.
  
  All the four models imply a certain level of generalization; thus, I tried to avoid redundant details. However, I have added this choice to the PRC model (now, Model 4).
  
  35. I think the author should discuss the importance of ‘open identities’ more. This factor is now not explicitly included in any of the models, while it has been found to be one of the main characteristics of peer review systems (Ross-Hellauer, 2017).
  
  This part has been extended.
  
  36. More generally, I was wondering why the author chose these three models and not others. What were the inclusion criteria for inclusion in the manuscript? Some information on the underlying process would be welcome, especially when claims like “However, we believe that journal-independent peer review is a special case of Model 3 (“Publish-Review-Curate”).” are made without substantiation.
  
  The study included four generalized models of peer review that involved some level of abstraction.
  
  37. Maybe it helps to outline the goals of the paper a bit more clearly in the introduction. This helps the reader to know what to expect.
  
  The Introduction has been revised including the goal and objectives.
  
  38. The Modular Publishing section is not inherently related to peer review models, as you mention in the first sentence of that paragraph. As such, I think it would be best to omit this section entirely to maintain the flow of the paper. Alternatively, you could shortly discuss it in the discussion section but a separate paragraph seems too much from my point of view.
  
  Modular publishing has been combined with registered reports into the fragmented publishing group of models, now in Section 5.
  
  39. Labeling model 3 as post-publication review might be confusing to some readers. I believe many researchers see post-publication review as researchers making comments on preprints, or submitting commentaries to journals. Those activities are substantially different from the publish-review-curate model so I think it is important to distinguish between these types.
  
  The label was changed into Publish-Review-Curate model.
  
  40. I do not think the conclusions drawn below Table 3 logically follow from the earlier text. For example, why are “all functions of scientific communication implemented most quickly and transparently in Model 3”? It could be that the entire process takes longer in Model 3 (e.g. because reviewers need more time), so that Model 1 and Model 2 lead to outputs quicker. The same holds for the following claim: “The additional costs arising from the independent assessment of information based on open reviews are more than compensated by the emerging opportunities for scientific pluralism.” What is the empirical evidence for this? While I personally do think that Model 3 improves on Model 1, emphatic statements like this require empirical evidence. Maybe the author could provide some suggestions on how we can attain this evidence. Model 2 does have some empirical evidence underpinning its validity (see Scheel, Schijen, Lakens, 2021; Soderberg et al., 2021; Sarafoglou et al. 2022) but more meta-research inquiries into the effectiveness and cost- benefits ratio of registered reports would still be welcome in general.
  
  The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap. I am grateful for the suggested literature on RRs, which I have now integrated into the relevant subsection.
  
  41. What is the underlaying source for the claim that openness requires three conditions?
  
  I have made effort to clarify within the text that this reflects my personal stance.
  
  42. “If we do not change our approach, science will either stagnate or transition into other forms of communication.” (p. 2) I don’t think this claim is supported sufficiently strongly. While I agree there are important problems in peer review, I think would need to be a more in-depth and evidence-based analysis before claims like this can be made.
  
  The sentence has been rephrased.
  
  43. On some occasions, the author uses “we” while the study is single authored.
  
  This has been fixed.
  
  44. Figure 1: The top-left arrow from revision to (re-)submission is hidden
  
  I have updated Figure 1.
  
  45. “The low level of peer review also contributes to the crisis of reproducibility in scientific research (Stoddart, 2016).” (p. 4) I assume the author means the low quality of peer review.
  
  This has been fixed.
  
  46. “Although this crisis is due to a multitude of factors, the peer review system bears a significant responsibility for it.” (p. 4)
  
  This is also a big claim that is not substantiated
  
  I have paraphrased this sentence as
  
  “While multiple factors drive this crisis, deficiencies in the peer review process
  
  remain a significant contributor.” and added a footnote.
  
  47. “Software for automatic evaluation of scientific papers based on artificial intelligence (AI) has emerged relatively recently” (p. 5) The author could add RegCheck (https://regcheck.app/) here, even though it is still in development. This tool is especially salient in light of the finding that preregistration-paper checks are rarely done as part of reviews (see Syed, 2023)
  
  Thank you very much, I have added this information.
  
  48. There is a typo in last box of Figure 1 (“decicion” instead of “decision”). I also found typos in the second box of Figure 2, where “screns” should be “screens”, and the author decision box where “desicion” should be “decision”
  
  This has been fixed.
  
  49. Maybe it would be good to mention results blinded review in the first paragraph of 3.2. This is a form of peer review where the study is already carried out but reviewers are blinded to the results. See work by Locascio (2017), Grand et al. (2018), and Woznyj et al. (2018).
  
  Thanks, I have added this (now section 5.2)
  
  50. Is “Not considered for peer review” in figure 3b not the same as rejected? I feel that it is rejected in the sense that neither the manuscript not the reviews will be posted on the platform.
  
  Changed into “Rejected”
  
  51. “In addition to the projects mentioned, there are other platforms, for example, PREreview12, which departs even more radically from the traditional review format due to the decentralized structure of work.” (p. 11) For completeness, I think it would be helpful to add some more information here, for example why exactly decentralization is a radical departure from the traditional model.
  
  I have extended this passage.
  
  52. “However, anonymity is very conditional - there are still many “keys” left in the manuscript, by which one can determine, if not the identity of the author, then his country, research group, or affiliated organization.” (p.11) I would opt for the neutral “their” here instead of “his”, especially given that this is a paragraph about equity and inclusion.
  
  This has been fixed.
  
  53. “Thus, “closeness” is not a good way to address biases.” (p. 11) This might be a straw man argument because I don’t believe researchers have argued that it is a good method to combat biases. If they did, it would be good to cite them here. Alternatively, the sentence could be omitted entirely.
  
  I have omitted the sentence.
  
  54. I would start the Modular Publishing section with the definition as that allows readers to interpret the other statements better.
  
  Modular publishing has been combined with registered reports into the deconstructed publication group of
  
  models, now in Section 5, general definition added.
  
  55. It would be helpful if the Models were labeled (instead of using Model 1, Model 2, and Model 3) so that readers don’t have to think back what each model involved.
  
  All the models represent a kind of generalization, which is why non-detailed labels are used. The text labels may vary depending on the context.
  
  56. Table 2: “Decision making” for the editor’s role is quite broad, I recommend to specify and include what kind of decisions need to be made.
  
  Changed into “Making accept/reject decisions”
  
  57. Table 2: “Aim of review” – I believe the aim of peer review differs also within these models (see the “schools of thought” the author mentions earlier), so maybe a statement on what the review entails would be a better way to phrase this.
  
  Changed into “What does peer review entail?”
  
  58. Table 2: One could argue that the object of the review’ in Registered Reports is
  
  also the manuscript as a whole, just in different stages. As such, I would phrase this differently.
  
  Current wording fits your remark
  
  “Manuscript in terms of study design and execution”
  
  Reviewer 4
  
  59. Page 3: It’s hard to get a feel for the timeline given the dates that are described. We have peer review becoming standard after WWII (after 1945), definitively established by the second half of the century, an example of obligatory peer review starting in 1976, and in crisis by the end of the 20th century. I would consider adding
  
  examples that better support this timeline – did it become more common in specific journals before 1976? Was the crisis by the end of the 20th century something that happened over time or something that was already intrinsic to the institution? It doesn’t seem like enough time to get established and then enter crisis, but more details/examples could help make the timeline clear. Consider discussing the benefits of the traditional model of peer review.
  
  This section has been extended.
  
  60. Table 1 – Most of these are self- explanatory to me as a reader, but not all. I don’t know what a registered report refers to, and it stands to reason that not all of these innovations are familiar to all readers. You do go through each of these sections, but that’s not clear when I initially look at the table. Consider having a more informative caption. Additionally, the left column is “Course of changes” here but “Directions” in text. I’d pick one and go with it for consistency.
  
  Table 1 has been replaced by Figure 2. I have also extended text descriptions, added definitions.
  
  61. With some of these methods, there’s the ability to also submit to a regular journal. Going to a regular journal presumably would instigate a whole new round of review, which may or may not contradict the previous round of post-publication review and would increase the length of time to publication by going through both types. If someone has a goal to publish in a journal, what benefit would they get by going through the post-publication review first, given this extra time?
  
  Some of these platforms, e.g., F1000, Lifecycle Journal, replace conventional journal publishing. Modular publishing allows for step-by-step feedback from peers.
  
  An important advantage of RRs over other peer review models lies in their capacity to enhance research efficiency. By conducting peer review at Stage 1, researchers gain the opportunity to refine their study design or data collection protocols before empirical work begins.
  
  Other models of review can offer critiques such as "the study should have been conducted differently" without
  
  actionable opportunity for improvement. The key motivation for having my paper reviewed in MetaROR is the quality of peer review – I have never received so many comments, frankly! Moreover, platforms such as MetaROR usually have partnering journals.
  
  62. There’s a section talking about institutional change (page 14). It mentions that openness requires three conditions – people taking responsibility for scientific communication, authors and reviewers, and infrastructure. I would consider adding some discussion of readers and evaluators. Readers have to be willing to accept these papers as reliable, trustworthy, and respectable to read and use the information in them.
  
  Evaluators such as tenure committees and potential employers would need to consider papers submitted through these approaches as evidence of scientific scholarship for the effort to be worthwhile for scientists.
  
  I have omitted these conditions and employed the Moore’s Technology Adoption Life Cycle. Thank you very much for your comment!
  
  63. Based on this overview, which seems somewhat skewed towards the merits of these methods (conflict of interest, limited perspective on downsides to new methods/upsides to old methods), I am not quite ready to accept this effort as equivalent of a regular journal and pre-publication peer review process. I look forward to learning more about the approach and seeing this review method in action and as it develops.
  
  The Discussion section has been substantially revised to address this point. While I acknowledge the current scarcity of empirical studies on innovative peer review models, I have incorporated a critical discussion of this methodological gap.
Visit annotations in context

Annotators

metaror

URL

osf.io/preprints/socarxiv/b2ra3
arxiv.org arxiv.org

GREI Data Repository AI Taxonomy (#47)

15
1. metaror 14 Jul 2025
  
  in Public
  
  none
2. metaror 14 Jul 2025
  
  in Public
  
  The authors present a seven-part taxonomy aimed at identifying how AI can be integrated into workflows in data repositories. The taxonomy focuses on functions such as ‘acquire’, ‘validate’, ‘enhance’, and ‘organize’. The authors suggest that the piece is meant for the community of librarians and Open Science practitioners working with data repositories. The piece was reviewed by two metaresearchers. Both emphasized the potential value of the proposed taxonomy, but also noted a number of areas in which the paper needs substantial improvement. The reviewers admitted to be struggling with what they saw as a lack of clarity and practical applicability of the taxonomy. They emphasized the need for more detailed explanations of the various functions, such as case studies or examples of how AI is currently used in repositories. A second main point mentioned by both reviewers was the lack of a section explaining how exactly the taxonomy was developed. This makes it hard to see how and why the authors arrived at the seven functions and what knowledge/insights they are based on – is it previous literature or based on practical work experience of the authors? This also ties in with a third point raised by the reviewers, namely the lack of references. Finally, especially reviewer 2 felt that the concluding section on balancing human and AI expertise is interesting, but also noted that it currently feels a bit disconnected from the main part of the text.
  
  The reviewers and I agree that the paper can potentially become a valuable contribution to the metaresearch literature related to data repositories, provided that the authors address the various criticisms raised by the reviewers.
3. metaror 14 Jul 2025
  
  in Public
  
  none
4. metaror 14 Jul 2025
  
  in Public
  
  General review
  
  The article describes a taxonomy which can be used to help facilitate the use of AI in data repositories. The taxonomy is similar to other descriptions of the data or research life cycle and is not particularly AI specific although brief descriptions of how AI could be used in each stage are included. The article is not currently written as a journal article, and does not engage with the current literature or research within the field, which would help to explain the importance or value of this specific taxonomy.
  
  A literature review should be included and references should also be included throughout using an inline referencing style rather than hyperlinks. A references section should be included at the end of the article.
  
  It is not clear how the 7 areas for AI in data repositories were decided. Is this opinion, based on currently literature, based on user experience…? Without knowing this it is hard for the reader to understand how justified the categories are. Towards the end of the paper 5 categories of AI involvement are discussed, but these don’t appear to be a part of the framework although they seem to be the most AI specific part. I’d suggest that the taxonomy could benefit from including more AI specific terms.
  
  The article could be improved by more detail about how this framework could be used in practice (or is currently being used). It is unclear whether this is just a suggestion of shared language or something more formal like CReDIT or MeSH?
  
  The Balancing AI and Human Expertise in Data Repositories section gives a nice overview of some of the concerns, however this section needs to be referenced as there have been lots of discussions in the literature about the important of ‘humans in the loop’.
  
  The ‘three suggestions to promote trust and transparency’ at the end are interesting but come rather out of nowhere. These should be tied into the body of the paper. How are they related to the taxonomy?
  
  Specific suggestions
  
  Engagement with the literature would help to show why this taxonomy is important or relevant. Some areas which the literature review could cover are:
  
  How is AI being used with repositories at the moment? (or not)
  
  What is the importance of standard taxonomies?
  
  Are there other examples of AI taxonomies?
  
  Are there other ways of describing AI roles which are currently being used and why are they not sufficient?
  
  Each of the taxonomy descriptions include examples of how AI could benefit this role, but they would be greatly enhance by giving more detail of projects where this is already happening or explaining where (and why) it isn’t happening. A specific example of this is within the Organize section where a useful project to discuss might be the Library of Congress metadata labelling project.
  
  A more comprehensive introduction to the project and what you are trying to do would be helpful. Readers may not be specialists in either AI or digital repositories.
  
  The introduction should also explain the current situation and problem that this taxonomy is trying to solve. Future possible benefits are described, but it does not discuss what else is happening in this space or the current situation.
  
  “Just as AI can revolutionize other forms of scholarly communications like peer-reviewed publications” – the reference justifying this is an editorial, a research paper would be more authoritative. There have been a lot of articles arguing the positives and negatives of AI within the field of scholarly communication and this should be discussed more thoroughly.
  
  “it can bring significant improvements to data repositories” What are these improvements? Has anyone done this yet or is it only theoretical?
  
  “As AI becomes more integrated into data repository workflows” Is it becoming more integrated into these workflows? Or do you mean “if”?
5. metaror 14 Jul 2025
  
  in Public
  
  none
6. metaror 14 Jul 2025
  
  in Public
  
  This commentary (or perhaps it is more similar to a blogpost) proposes a taxonomy for tasks for how AIs could be used within workflows in data repositories. Although it is not the stated aim, it also reflects on the relationship between AIs and humans in these workflows and how to develop trust and transparency in a repository while using AIs.
  
  The idea of having a taxonomy which can be used to spur discussions or to classify the roles of AI in repositories is useful, and the taxonomy roles themselves are sensible, although I am not sure how feasible they actually are at the moment. This is perhaps related to a lack of clarity about if these capabilities already exist for AIs to be used in these ways, or if this is the imagined future. It is also not clear which AIs the authors are referring to; is this targeted to LLMs specifically? To other AIs? I imagine this is important in identifying the potential tasks which AI could perform.
  
  Although the authors provide examples of possible tasks for each part of the taxonomy, I still find it a bit abstract and vague. Maybe it would help to include a case study or example of a repository to carry through the steps and demonstrate how this looks/could look in practice? Some of the statements introducing each section of the taxonomy could be a bit more clear also, especially in terms of the target audience. Overall it is unclear to me who the exact target audience is – is this aimed for all repositories or just generalist repositories, given that it was developed within GREI? The latter part of the article refers to generalist repositories, but the earlier part (the taxonomy) does not. This raises the question of whether these steps would apply to all types and sizes of data repositories, or just to larger “generalist” ones. Some of the names for the taxonomy tasks may also be confusing, e.g. “share” – which usually tends to refer to the act of researchers sharing/depositing data in repositories rather than repositories making data available and facilitating reuse,.
  
  One of the biggest points for improvement is that it is not very clear how the taxonomy was developed. The authors mention that it is based on other taxonomies (but they do not provide links or references) and their “coopetition efforts” within the GREI consortium. What did these activities entail exactly? There is a link in the acknowledgements section to authors of a very similar taxonomy (actually a very similar piece overall) that was developed for publishing workflows. Why is this not referenced in the article earlier, or the relationship between the two pieces described? Overall, the referencing should be improved and standardized (although again how this is done depends a bit on how the authors envision the future of this work) – there is currently no reference list at all.
  
  It also seems as if the taxonomy part of the article is disconnected from the latter sections (aside from the conclusion). I actually find the section on balancing human and AI expertise to be quite valuable as its own contribution, as it makes it clear that using AIs is not an all-or-nothing proposition in repositories’ workflows. I think it would help if the authors could let readers know (at least) that this section, and the one on trust, are a part of the article in the introduction. Or perhaps there is room for restructuring the argument here, and somehow foregrounding the human/AI section.
  
  Other reflections
  
  It is a bit difficult to review this piece, not knowing whether it is intended to be a commentary, blogpost, or other type of article.
7. metaror 14 Jul 2025
  
  in Public
  
  Chodacki, J., Hanhel, M., Iacus, S., Scherle, R., Olson, E., Pfeiffer, N., ... & Hosseini, M. (2024). GREI Data Repository AI Taxonomy. arXiv preprint arXiv:2411.08054.
8. metaror 14 Jul 2025
  
  in Public
  
  November 7, 2024
9. metaror 14 Jul 2025
  
  in Public
  
  March 23, 2025
10. metaror 14 Jul 2025
  
  in Public
  
  April 4, 2025
11. metaror 14 Jul 2025
  
  in Public
  
  Authors:
  
  Mark Hanhel m.hahnel@digital-science.com
  
  Stefano Iacus siacus@g.harvard.edu
  
  Ryan Scherle ryan@datadryad.org
  
  Eric Olson eric@cos.io
  
  Nici Pfeiffer nici@cos.io
  
  Kristi Holmes kristi.holmes@northwestern.edu
  
  Mohammad Hosseini mohammad.hosseini@northwestern.edu
12. metaror 14 Jul 2025
  
  in Public
  
  1
13. metaror 14 Jul 2025
  
  in Public
  
  https://arxiv.org/abs/2411.08054
14. metaror 14 Jul 2025
  
  in Public
  
  10.48550/arXiv.2411.08054
15. metaror 14 Jul 2025
  
  in Public
  
  GREI Data Repository AI Taxonomy
Visit annotations in context

Annotators

metaror

URL

arxiv.org/abs/2411.08054
osf.io osf.io

A framework for values-based assessment in promotion, tenure, and other academic evaluations

15
1. metaror 08 Jul 2025
  
  in Public
  
  none
2. metaror 08 Jul 2025
  
  in Public
  
  The authors present a novel method—developed through a series of workshops—for assessing academics. The proposed reform aims to move evaluation away from the traditional focus on scholarship, teaching, and service and toward a more nuanced and flexible set of values. Two scholars reviewed the article. They note that the authors' summary of their findings is comprehensive and that adding concreteness to the widely accepted but abstract reform idea is useful. The reviewers also offered suggestions for improvement. Both reviewers suggested adding details related to the methodology of moving from the workshop data to the proposal. The reviewers also agreed with the authors that implementing the framework might seem daunting and suggested ways to help potential adopters handle the complexities. The reviewers and I agree that the article is a valuable contribution to the literature related to assessing academics.
3. metaror 08 Jul 2025
  
  in Public
  
  The reviewer declares that there are no competing interests in relation to the review of this article.
4. metaror 08 Jul 2025
  
  in Public
  
  The argument presented by McKiernan et al. reminds us that while research assessment reform is increasingly the subject of discussion and proposals for the need to change, concreteness around how to convert this widespread agreement into action remains somewhat lacking. Developed with the input of a series of workshops, the authors present a values-based framework with the hope that this will help institutions and individuals across diverse disciplines take ground and move the needle to achieve actionable change.
  
  This paper offers a valuable perspective and well-resourced recommendations that pull from both published works and on-the-ground insights to address a recognized challenge; as such, the suggestions below are largely focused on potential challenges to adoption, in the interest of driving uptake and to increase the chances the framework can generate individual and institutional benefit.
  
  Purely from a usability standpoint, 14 values is a lot to process (as the authors themselves recognize). While the article makes it clear that there is no expectation to use all 14, with encouragement for “specific departments, disciplines, and institutions to work through whether these values are the ones that resonate most with them,” the sheer number does risk overwhelming potential users from the jump or inadvertently scaring away folks who may feel paralyzed by the need to winnow them down. This, in combination with facing a wholly new system that is intentionally designed to reduce the security blanket of research/teaching/service, may feel like too many changes at once, which subsequently risks reducing potential uptake. Given that one common refrain from those starting on the research assessment reform journey is that simply figuring out how to start can be a challenge, there may therefore be a strong benefit to providing conceptual on-ramps to make this framework as approachable as possible This could take a range of forms—e.g., via clustering values into higher-level categories; employing structures or prompts to assist with processing options or selecting a place to start; proposing potential prioritization strategies—that would neither limit the content nor enforce a strict regimen, but which might make initial entry less daunting. Note that this is not a suggestion to reduce the set, as all seem useful and relevant, but simply to provide some strategies to make the task feel more initially manageable in order to overcome early potential barriers to adoption.
  
  One known and recurrent challenge with more qualitative assessments can be that they are often seen as more time-consuming than using quantitative measures, which tend to be much easier to scan, digest, and compare (see e.g., Ma, L. (2021) ‘Metrics as Time-Saving Devices’, in: F. Vostal (ed.) Inquiring into Academic Timescapes. pp:123-133. Emerald Publishing Limited. 10.1108/978-1-78973-911-420211011 and Rushforth, A. & De Rijcke, S. (2024) Practicing responsible research assessment: Qualitative study of faculty hiring, promotion, and tenure assessments in the United States, Research Evaluation 33 rvae007, https://doi.org/10.1093/reseval/rvae007. A second—perhaps less articulated, but equally critical—issue is that reviewers may not always feel equipped or trained to assess more qualitative outputs. This makes the inclusion of example activities and indicators extremely valuable in their ability to provide a useful on-ramp to assessment activities. At the same time, the authors note that values can be interpreted broadly; this is both a blessing (in that they can accommodate a wide potential range of instances) and a curse (in that it may more difficult for new users to feel confident in how they are being applied), especially as the indicators essentially read as slightly more detailed versions of the activities or behaviors. While I recognize this is intentional, to support a wider variety of potential use cases, it may be useful to explicitly prompt potential users to consider how moving beyond the general can supply an extra layer of specificity appropriate to the case at hand, which in turn can help concretize what ‘good’ looks like in their specific instances. This may be especially useful or important in fields where value may be more qualitative or more difficult to capture and in situations like tenure processes, which rely on communicating or translating accomplishments across disciplines.
  
  The authors make it clear that this approach should not be seen just as an alternative set of measures for what has been used historically, but as an opportunity to interrogate the notion that there is a single way to signal or demonstrate scholarly success, and further to reinforce that there is not one correct pathway or model for building an academic career. This strikes me as an important point, and one that perhaps deserves more attention. It is well recognized, for example, that the traditional hierarchy often used in academic assessment (i.e., research > teaching > service) creates a perverse set of incentives, where some activities are rewarded and legitimized more than others. Despite the fact that the paper mentions that responses during the working sessions were open-minded about contextualizing these values across different scholarly activities, the longevity of that mental model may make it hard to dislodge, and may result in individuals using the proposed framework in such a way that values seen as being more research-aligned are prioritized, or used in ways that continue to promote research above teaching (e.g., mentoring), and so on. This suggests the value of reiterating how the framework’s values can show up across the traditional triad of research/teaching/service for those who might be new to the idea.
  
  Secondly, the valuable insight that different disciplines will likely reflect these values in different ways suggests that the framework might benefit from further consideration regarding how this proposed approach might play across different career stages (e.g., early career vs. advanced professionals) or career paths (e.g., alternatives to the traditional arc that presumes moving in a linear, unbroken progression from undergrad to graduate to post-graduate academic positions). With regard to career stages, while many of the values included in this model are surely important at any point along a career trajectory, the ways in which they manifest may vary quite a bit (most specifically, for example, in cases like leadership and mentorship that may naturally take on a different tenor as one advances; qualities such as “collaboration and partnership” or criteria for “advancing knowledge” may also look quite different with increased seniority). This suggests there is potential value in proposing that institutions consider how each dimension might take on natural progressions in behaviors and indicators.
  
  Further, it might be interesting to consider how the values themselves might provide an inspirational frame or structure that helps academics—perhaps especially early in their career—see how focusing on constellations of values can help them envision and carve out a scholarly identity. This suggestion is prompted in part by the notion of trying on different ‘shapes’ of academic identities that appears in Building Blocks for Impact (https://doi.org/10.5281/zenodo.7249187), which took a different tack toward expanding what matters with regard to scholarship but was equally interested in helping scholars envision centers of gravity and trajectories that were not solely grounded in traditional milestones such as tenure. [In full disclosure, I worked on this model as a part of a DORA-led, grant-funded effort called Project TARA; this point is offered not as a bid for a citation or as self-promotion but because that model grew from similar motivations as the ideas presented in this paper].
  
  Recognizing movement from or into careers outside of academia or developed in practice-based settings may also help the framework encompass the reality of non-traditional career arcs (e.g., moving to or from industry). Given that the framework is already quite substantial, this is not as a suggestion to add more layers of content; rather—as with many of the suggestions above—this might entail supplementing the framework with guidance to prompt institutions or committees to consider a different kind of scholarly diversity. Along similar lines, there might be a benefit in recognizing how career variability does not change the values themselves, but that it might impact how those values play out. For example, moving back and forth from industry (or other non-academic arenas) to academia might offer new types of relationships and opportunities but might also limit open sources sharing to some degree if one is constrained by institutional requirements such as non-disclosure agreements (NDAs).
  
  Finally, it might be worth reflecting on potential ways to capture feedback about the framework’s use or cases that describe how the framework was employed or implemented. While that data collection effort obviously expands beyond the framework itself, the authors’ position that the model is a starting point suggests that proposing ways to learn from collective use will not only help provide guidance for others but also ensure that the framework and overall approach gets better and more robust over time.
5. metaror 08 Jul 2025
  
  in Public
  
  The reviewer declares that there are no competing interests in relation to the review of this article.
6. metaror 08 Jul 2025
  
  in Public
  
  I enjoyed reading this paper and seeing the further development of the framework for values-based assessment in academia. Overall, I feel that the authors provide a comprehensive overview of their workshop approach and findings. Minor suggestions would be including a little more information about the workshop structure and strategies within the text and not relying on reference 31 entirely. For example, instead of asking the reader to locate details regarding the methodology in a different paper I would suggest either adding an appendix with more detailed methodology or adding some broad demographic information and a bit more about recruitment of participants. I would also be interested in how differences of opinions and consensus building were navigated within the workshops. It is also possible that some kind of visual about the workshop structure might be useful for readers, including a bit more detail as to the topics discussed and how the audience was engaged.
  
  In the suggestions for further use of the framework, it would be useful to have some reflection on what is needed for institutions and departments to make use of it. As the authors point out, “our framework could seem overwhelming, with 14 values, multiple behaviors under each, and the potential to add more”. Providing more reflection on this would be helpful, particularly around instigating value-focused discussions and mediating consensus building.
  
  In relation to the future work, I would be very interested in whether the authors were expanding their work to include data stewards and other key support staff communities. In the Netherlands there is a strong push to decrease the distinction between academic and academic support staff. In particular, this is in recognition of their expertise and the key roles that they play in successful research - and Open Science. I think that it would be useful to include perspectives from these communities as well in further iterations of the framework.
7. metaror 08 Jul 2025
  
  in Public
  
  McKiernan, E., Carter, C., Dougherty, M. R., & Tananbaum, G. (2024, July 31). A framework for values-based assessment in promotion, tenure, and other academic evaluations. https://doi.org/10.31219/osf.io/s4vc5
8. metaror 08 Jul 2025
  
  in Public
  
  August 27, 2024
9. metaror 08 Jul 2025
  
  in Public
  
  June 4, 2025
10. metaror 08 Jul 2025
  
  in Public
  
  June 19, 2025
11. metaror 08 Jul 2025
  
  in Public
  
  Authors:
  
  Erin McKiernan emckiernan@ciencias.unam.mx
  
  Caitlin Carter caitlinacarter@gmail.com
  
  Michael Dougherty mdougher@umd.edu
  
  Greg Tananbaum greg@orcaopen.org
12. metaror 08 Jul 2025
  
  in Public
  
  2
13. metaror 08 Jul 2025
  
  in Public
  
  https://osf.io/preprints/osf/s4vc5
14. metaror 08 Jul 2025
  
  in Public
  
  10.31219/osf.io/s4vc5
15. metaror 08 Jul 2025
  
  in Public
  
  A framework for values-based assessment in promotion, tenure, and other academic evaluations
Visit annotations in context

Annotators

metaror

URL

osf.io/preprints/osf/s4vc5
osf.io osf.io

Systematic review: The reliability of indicators that may differentiate between suicidal, homicidal, and accidental sharp force wounds

17
1. metaror 08 Jul 2025
  
  in Public
  
  Jennifer Byrne is a member of the editorial team of MetaROR working with Jason Chin, a co-author of the protocol and also a member of the editorial team of MetaROR.
2. metaror 08 Jul 2025
  
  in Public
  
  This protocol aims to address two questions: (1) What do we know about the science underlying impactful legal decisions? (2) How can we assess this evidence efficiently and accurately, such that it is usable for courts? The protocol has been reviewed by three reviewers (reviewer 2 in fact represents a team of three individuals). The reviewers mention various strengths of the protocol. Reviewer 1 emphasises the importance and timeliness of the research questions and praises the interdisciplinary nature of the research team. Reviewer 3 considers the protocol to be thoughtful and detailed, and reviewer 2 notes that the protocol presages an important effort. The reviewers do not see any major shortcomings in the protocol, but they do highlight opportunities to strengthen the protocol, such as considering studies published in languages other than English and adding more detail on how team disagreements will be resolved.
3. metaror 08 Jul 2025
  
  in Public
  
  I declare that this review has been written in the absence of any competing interest, including any role, relationship (including commercial or financial) or commitment that poses an actual or perceived threat to the integrity or independence of my review and that could be construed as a potential conflict of interest.
4. metaror 08 Jul 2025
  
  in Public
  
  Summary
  
  This protocol describes the plan for a systematic review of the literature on stab wounds. The focus is on the types of observations made in such cases, and whether there are any (types of) observations that can be considered “indicators” of the manner of death, to help distinguish between cases of self-inflicted injury and those inflicted by others.
  
  Strong points of this research plan
  
  The authors present compelling arguments for the need for the proposed meta-research project; they refer to a recent case in the High Court of Australia (Lang v the Queen). The arguments highlight the importance and timeliness of the research questions. More generally, the field of forensic pathology and its perception and use by the legal community seems to be an area with great research potential: see for example the problematic cases involving the testimony of Colin Manock in South Australia (e.g. the Keogh case, where the examination of bruises was an issue).
  
  Overall, the research plan is well informed: the authors have conducted a preliminary review of existing relevant studies and reviews. They use the findings from this preliminary review to critically inform the design of their study.
  
  The research team is interdisciplinary, with members from law, psychology and pathology, and appears to be suitably qualified to carry out the proposed research.
  
  The research plan is sufficiently detailed and transparent in terms of search procedures, eligibility criteria, outcome variables, data management and open access policy, which should make the research results widely accessible and reproducible.
  
  Comments, suggestions, critiques
  
  The title includes the term “reliability”, but it is never defined in the text. While this term can be taken in its common sense interpretation, this may not be sufficient for a scientific study. Do the authors mean “reliability” as used, for example, by the US FRE? Or do they understand the term to be similar to the PCAST’s use of the term “validity”?
  
  The plan is not clear (enough) about how – conceptually – to characterise the potential of an observation (made by a pathologist) to provide information about a selected question of interest (e.g., manner of death, the way in which an injury was inflicted, etc.). Formally, the diagnosticity of an observation (or type of observation) is defined in terms of a likelihood ratio. In other words, for an observation to have diagnostic value with respect to a given proposition (hypothesis), the probability of the observation of interest given the proposition of interest must be higher than given an alternative proposition. Thus, whatever this study will reveal about medico-legal observations (in stab wound cases), an inferential framework is needed to assess diagnosticity and, more broadly, reliability. The research plan is silent on this aspect. Instead, most of the effort is spent on descriptive statistics. There is nothing wrong with descriptive statistics, but it will not help to address the main question posed in the title of the proposed research. As an aside, the reference to “confidence intervals” (p. 15 and 19) is unfortunate in the sense that frequentist statistics, although (still) ubiquitous, are problematic for a variety of reasons.
  
  To some extent, the research proposal is too uncritical and passive with respect to terminology that appears to be standard in the field in which the literature review is to be conducted. Consider, for example, the terms “defense injuries” and “tentative injuries” (p. 7). These terms are problematic because they mix observations (e.g., cuts) with ground truth (i.e., self-inflicted or third-party inflicted). Since the ground truth cannot be known in actual cases, “defense injury” cannot meaningfully serve as a descriptor. Moreover, the use of such terms is problematic: suppose an examiner talks about “tentative injuries”. This could suggest to the recipient of expert information that the observed injury is necessarily self-inflicted. Of course, the authors’ intention might be to determine how diagnostic the expert’s utterance of “tentative injury” is with respect to the proposition of self-inflicted injury (without assuming that the utterance of “tentative injury” necessarily implies self-inflicted injury). Nevertheless, this doesn’t solve the problem of confusing terminology. Therefore, this research project could be strengthened by not limiting itself to the descriptive_adoption of standard terminology, but by including a critical analysis and discussion of terminology. In fact, the problem of testimony in this field is not limited to the (currently unknown) diagnosticity of observations made during pathological examinations. It also depends on the coherence of foundational terminology (i.e., its logic) used in this field, as well as on the soundness of the reasoning methods used (e.g., the crucial distinction between findings/observations and _unobservable ground truth states).
  
  On p. 15, the research plan states: “We will attempt to quantitively synthesise cases by first separating them into four groups: those classified by study authors as suicides, homicides, accidents or inconclusives. Then, we will list the frequency with which the case variables listed above appear in each group.” Treating the data in this way will lead to useful statistics: i.e., the probability of different observations given different case types (suicides, homicides, etc.). Such statistics characterise the diagnosticity of the various observations (“case variables”). However, a major problem arises here: how – if at all – it can one known that the reported classification of cases into suicides, homicides etc. was correct? For obvious reasons, none of the case reports in the literature involve experiments under controlled conditions. However, there may be other information or evidence in a case (e.g., video surveillance) that supports particular classifications. Will the project control for this complication, and if so, how?
  
  It would be valuable for this research to include normative considerations, as opposed to a purely descriptive perspective, of what it means for an observation – be it in pathology or any other forensic field – to be “indicative” or discriminative with respect to selected (disputed) propositions. This relates to the notion of inferential framework_mentioned above, which is largely established in the philosophy of science (see e.g. Howson/Urbach, _Scientific Reasoning, 2005), and which could serve as an additional reference point against which to evaluate the current literature. It remains unclear to the reader why this research project refrains from taking a firmer position on the logic of evaluative thinking, which has now become inseparable from sound evaluation procedures in forensic science. Reviewing and synthesising existing literature is one thing, challenging the current state of the art is another. Combining the two is a valuable opportunity that this project could seize.
5. metaror 08 Jul 2025
  
  in Public
  
  none
6. metaror 08 Jul 2025
  
  in Public
  
  Thank you for the opportunity to review this protocol. My expertise is in systematic review methods, generally relating to health interventions, and as such I should note that I do not have expertise in forensic pathology or medico-legal issues.
  
  This paper outlines the protocol for a systematic review of characteristics which allow forensic experts to distinguish between suicide and homicide relating to sharp force wounds, in the context of contributing to criminal prosecution. Interestingly, the protocol outlines the development of preliminary approaches to novel methodology adapted for use in this field, including novel approaches to assessing risk of bias and certainty in the evidence, which have primarily been developed to assess intervention research.
  
  I commend the authors for a thoughtful and detailed protocol. In my view, this is a strong piece of work and will contribute findings of interest to the field, as well as contributing to the exploration of methods for the assessment of a category of research for which such methods are currently lacking. I have made a few suggestions below for consideration by the authors that may strengthen the protocol.
  
  Rationale
  
  It may be helpful to international readers to clarify in the text of the Rationale that R v Lang is a case in the High Court of Australia, and to spell HCA out in full in the footnote. With regard to readers looking for details on this case, are these published on a website for which a URL can be provided?
  
  It would be helpful for readers without a background in legal proceedings to discuss the extent to which research evidence and systematic reviews are or are not commonly presented in legal proceedings, in contrast to expert opinion.
  
  Where you discuss the debate about the role of cause of death findings, it would be helpful to explicitly state in which jurisdictions these discussions have been occurred, so that readers can understand whether and how this topic relates to their own jurisdiction or where there may be differences. It may further be helpful to elaborate briefly on why cause of death determinations may be considered unreliable.
  
  Methods
  
  It is a limitation to the review to only include studies published in English. The proficiency of automated translation is currently such that screening of potentially relevant studies in multiple languages is often possible, with assistance from multilingual colleagues or communities such as Cochrane Engage can enable the inclusion of studies in additional languages.
  
  Regarding grey literature, both of the listed organisations appear to be based in the USA (although this is not stated for the OSCAC) – could you provide a rationale for only using US institutions to identify relevant data? For example, there may be organisations in Australia (which is the jurisdiction of interest for the legal aspects of this review) or in countries with comparable criminal legal systems (such as the UK, Europe or elsewhere).
  
  Will a software tool be used to support study selection, such as Covidence or similar? This may contribute to your analysis of time and process.
  
  Injury severity score – will injury severity be captured if other measures of severity are used, or not at all? There are methods available to consider results across different measures of similar outcomes, if these would be considered valid alternatives.
  
  · In the rationale and the methods relating to risk of bias, you note that it may be relevant to capture (if available) information such as whether witness, video evidence or a confession was available to support the conclusion of cause of death. Should this kind of characteristic be added to the data collected?
  
  The methods provided for data synthesis, risk of bias assessment and the certainty/quality of the evidence (based on GRADE) all currently read as if all your included studies will be case series or case studies. As your included studies also include observational studies that may give effect estimates such as odds ratios rather than individual counts of characteristics, methods should be provided for handling and perhaps quantitatively synthesising this kind of data, where appropriate. Risk of bias methods and GRADE methods may more closely correspond to the existing methods for this kind of study, and require less adaptation.
  
  GRADE methodology generally refers to “certainty in the evidence” rather than confidence, to avoid confusion with risk of bias assessment.
  
  You note in the rationale that you plan to collect data on the review process, such as time taken to complete different tasks. I’d suggest putting this detail in the methods section.
  
  I would recommend giving some further thought to how you will draw conclusions from the data you find int his review. Assuming that sufficient data can be found, and that you have a set of either percentages from case studies/series or effect estimates from observational studies, it is likely that you will wish to discuss which factors appear to be associated with different causes of death, or which are most effective at discriminating between causes. I would strongly recommend considering what thresholds for associations or differences between causes of death would underpin such conclusions, and specify these in advance. I’d recommend speaking to a statistician to draft these methods appropriately and avoid errors in interpreting the estimates found.
7. metaror 08 Jul 2025
  
  in Public
  
  none
8. metaror 08 Jul 2025
  
  in Public
  
  On behalf of the Center for Integrity in Forensic Sciences and its Executive Director, Katherine H. Judson, as well as its co-founder, Professor Emeritus Keith A. Findley, I am pleased to submit these comments on the above-cited draft work of Jason Chin, Stephanie Clayton, and their colleagues. Thank you for soliciting our views. You may learn more about the Center for Integrity in Forensic Sciences at www.cifsjustice.org
  
  The authors’ explanation of their planned systematic review is helpful and presages an important effort. We commend the authors for their thoughtful study design, their transparency, and their initial research into source materials listed in Appendix A.
  
  Two minor methodological concerns appear to us initially. One, we do not fully understand the intention, described in four places (pages 11, 18, and 19), to use two independent reviewers of data and to resolve disagreements “by discussion.” It is not clear whether that discussion is to occur between the pair of reviewers only, or whether others will join the adjudicative discussion. In either event, it may be useful to consider an odd number of adjudicators for purposes of breaking a deadlock, if necessary. Two, the intended systematic review excludes studies not published in English (see page 10). While the lack of proficiency in other languages among the research team is understandable (and rightly acknowledged), the availability of reliable translations today should allow inclusion of studies published in other languages, we suspect.
  
  Our two principal substantive concerns are broader, though. First, this systematic review appears to overlook risks of availability bias and confirmation bias in information gathering by pathologists, who often rely on information passed along by law enforcement officers and others invested in a particular outcome or conclusion. Relatedly, forensic pathologists themselves often are closely aligned professionally and attitudinally with law enforcement personnel. Indeed, the pathologists may be employed by prosecutive and investigative agencies of the government, and therefore professionally and financially dependent on their sources of information. We predict that the research team will encounter frequently—perhaps almost uniformly—the absence of pre-existing protocols that Cochrane raises as a concern and that the authors rightly note at page 18 of this draft. That common absence of a known protocol, established in advance and subject to compliance assessment later, may be both caused in part by and an effect of the availability and confirmation (or tunnel vision) biases we discuss here.
  
  Second, the systematic review does not seem designed to consider the normative question of which systemic actor or actors are best equipped and most appropriate to make manner of death determinations for judicial, as opposed to statistical, purposes. We hope that the researchers will recommend that such determinations by pathologists or other biomedical experts should be limited to statistical purposes, for use in allocating public
  
  resources. In the end, regardless how reliable their opinions, pathologists and biomedical practitioners are no better positioned than jurors or judges to make adjudicative determinations of suicide or homicide, as the factfinders in a judicial system should have access to all information—presented to them in a more transparent, testable form in court—that the pathologist has in drawing conclusions. And as a normative matter, those adjudicative conclusions are assigned to jurors and judges, not to pathologists or other biomedical experts.
  
  With these caveats, we again welcome this initial work and description of the metaanalysis to come. Especially if confined to assessing and advancing the reliability of manner of death determinations in cases of sharp force wounds for statistical purposes, and thus as an aid in allocating public resources outside the judicial system, the eventual systematic review may be quite valuable.
  
  Finally, for a pertinent and longer discussion of related issues, see Keith A. Findley & Dean A. Strang, Ending Manner of Death Testimony and Other Opinion Determinations of Crime, 60 Duquesne Law Review 302 (2022). The authors themselves cite this article at footnotes 5 and 7 of their draft. Again, thank you for the opportunity to offer these comments.
9. metaror 08 Jul 2025
  
  in Public
  
  Chin, J., Clayton, S., Cordner, S., Edmond, G., Growns, B., Hunter, K., … Summersby, S. (2025, April 30). Systematic review: The reliability of indicators that may differentiate between suicidal, homicidal, and accidental sharp force wounds [peer reviewed]. Retrieved from https://osf.io/preprints/metaarxiv/atu56_v2
10. metaror 08 Jul 2025
  
  in Public
  
  October 2, 2004
11. metaror 08 Jul 2025
  
  in Public
  
  February 6, 2025
12. metaror 08 Jul 2025
  
  in Public
  
  April 1, 2025
13. metaror 08 Jul 2025
  
  in Public
  
  Authors:
  
  Jason Chin jason.chin@anu.edu.au
  
  Stephanie Clayton stephanie.Clayton1@health.nsw.gov.au
  
  Stephen Cordner stephen.cordner@vifm.org
  
  Gary Edmond g.edmond@unsw.edu.au
  
  Bethany Growns Bethany.growns@canterbury.ac.nz
  
  Kylie Hunter kylie.hunter@sydney.edu.au
  
  Bernard I'Ons bernard.ions@health.nsw.gov.au
  
  Kristy Martire k.martire@unsw.edu.au
  
  Gianni Ribeiro gianni.riberio@unisq.edu.au
  
  Stephanie Summersby stephanie.summersby@police.vic.gov.au
14. metaror 08 Jul 2025
  
  in Public
  
  2
15. metaror 08 Jul 2025
  
  in Public
  
  https://osf.io/preprints/metaarxiv/atu56
16. metaror 08 Jul 2025
  
  in Public
  
  10.31222/osf.io/atu56
17. metaror 08 Jul 2025
  
  in Public
  
  Systematic review: The reliability of indicators that may differentiate between suicidal, homicidal, and accidental sharp force wounds
Visit annotations in context

Annotators

metaror

URL

osf.io/preprints/metaarxiv/atu56

Annotators

Review comments

Introduction

Reviewability of automated reasoning systems

Wide-spectrum vs. situated software

Mature vs. experimental software

Convivial vs. proprietary software

Transparent vs. opaque software

Size of the minimal execution environment

Analogies in experimental and theoretical science

Improving the reviewability of automated reasoning systems

Review the reviewable

Science vs. the software industry

Emphasize situated and convivial software

Make scientific software explainable

Use Digital Scientific Notations

Conclusion

Notes

References

1 Reviewer 1

Introduction

Reviewability of automated reasoning systems

Improving the reviewability of automated reasoning systems

2 Reviewer 2

2.1 References

3 Editor

Annotators

URL

References

Response to the Editors and the Reviewers

Handling editor

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4

Annotators

URL

Response to Reviewer #1: Najko Jahn

Response to Reviewer #2: Mikael Laakso

Annotators

URL

Introduction

Methods

Results

Discussion

Additional points:

General comments

Respecting disciplinary and epistemic diversity

Signed:

Data availability

Annotators

URL

Annotators

URL

Annotators

URL

James’ peer review of Heathers’ article

Fanelli 2009 is not cited in the way JH says it is cited

Methods for the construction of the corpus of citation contexts

Contributions

References

Annotators

URL

Summary of the essay

General Comment

Main comments

Other comments

Catriona J. MacCallum

Too Easy on the Publishers, Too Hard on Researchers

Focus More on Institutions and Funders and Communities

No New Arguments or Analysis

Volume is Volume is Volume

Summary

Acknowledgments

Disclosure

Annotators

URL

Handling Editor

Reviewer 1

Reviewer 2