- Jul 2018
-
europepmc.org europepmc.org
-
On 2016 Nov 09, NLM Trainees’ Data Science Journal Club commented:
This is a preliminary study examining the discoverability of and access to biomedical datasets generated by research funded by the U.S. National Institutes of Health (NIH). More specifically, it focuses on datasets not deposited in a known repository and considered “invisible.” Analysis of the NIH-funded journal articles shows that only 12% explicitly mention deposition of datasets in known repositories, leaving a surprising 88% that are invisible datasets. The authors suggest that approximately 200,000 to 235,000 invisible datasets are generated from NIH-funded research published in one year alone. The study identifies issues to improve discoverability and access to research data: definition of a dataset, determining which data should be preserved and archived, and better methods for calculating the number of datasets of interest.
Our group had the honor of having two of the authors in attendance – Betsy Humphreys and Lou Knecht – to provide personal insights into the study. Betsy pointed out one the article’s strength – the opportunity to share this surprising discovery that has potential practical benefits to the research community. The study has received a fair amount of positive feedback, and Betsy mentioned that Clifford Lynch from the Coalition for Networked Information (CNI) has distributed this paper and calling for more studies on this subject. She also recognized the study’s weakness of not holding up as a model of research methodology and lack of consensus among the annotators on the definition of a dataset.
There was a lively discussion from the group on what constitute “invisible” dataset – are links to scientist or institutional websites considered invisible and how to detect visibility with better JATS tagging. For clinical researchers, personal domain self-archiving is the norm and considered invisible. Everyone agreed that it would be a priority to define visible and invisible for future research. Another interesting part of the discussion focused on having dataset be “in context” i.e., datasets are meaningful only when considered along with the published paper. It was surprising to learn that a large part of the formal repository name mentioned in the full text did not turn out to be in the context of an actual deposit. We discussed that it would be helpful to have guidelines and a consistent way to label the information – database name, date of deposit, accession number – in the acknowledgement for easier discoverability and preservation. We were pleased to see this preliminary study published because it brought light to the large problem of invisible dataset, and look forward to seeing more research in this area.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2016 Nov 09, NLM Trainees’ Data Science Journal Club commented:
This is a preliminary study examining the discoverability of and access to biomedical datasets generated by research funded by the U.S. National Institutes of Health (NIH). More specifically, it focuses on datasets not deposited in a known repository and considered “invisible.” Analysis of the NIH-funded journal articles shows that only 12% explicitly mention deposition of datasets in known repositories, leaving a surprising 88% that are invisible datasets. The authors suggest that approximately 200,000 to 235,000 invisible datasets are generated from NIH-funded research published in one year alone. The study identifies issues to improve discoverability and access to research data: definition of a dataset, determining which data should be preserved and archived, and better methods for calculating the number of datasets of interest.
Our group had the honor of having two of the authors in attendance – Betsy Humphreys and Lou Knecht – to provide personal insights into the study. Betsy pointed out one the article’s strength – the opportunity to share this surprising discovery that has potential practical benefits to the research community. The study has received a fair amount of positive feedback, and Betsy mentioned that Clifford Lynch from the Coalition for Networked Information (CNI) has distributed this paper and calling for more studies on this subject. She also recognized the study’s weakness of not holding up as a model of research methodology and lack of consensus among the annotators on the definition of a dataset.
There was a lively discussion from the group on what constitute “invisible” dataset – are links to scientist or institutional websites considered invisible and how to detect visibility with better JATS tagging. For clinical researchers, personal domain self-archiving is the norm and considered invisible. Everyone agreed that it would be a priority to define visible and invisible for future research. Another interesting part of the discussion focused on having dataset be “in context” i.e., datasets are meaningful only when considered along with the published paper. It was surprising to learn that a large part of the formal repository name mentioned in the full text did not turn out to be in the context of an actual deposit. We discussed that it would be helpful to have guidelines and a consistent way to label the information – database name, date of deposit, accession number – in the acknowledgement for easier discoverability and preservation. We were pleased to see this preliminary study published because it brought light to the large problem of invisible dataset, and look forward to seeing more research in this area.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-