- Jul 2018
-
europepmc.org europepmc.org
-
On 2017 Mar 19, NLM Trainees’ Data Science Journal Club commented:
The NLM Trainees Data Science Journal Club for this discussion consisted of several NLM staff members, including one of the paper’s authors. In brief, the paper examines the feasibility of using MetaMap to compare clinical associations found in clinical billing records with clinical associations found in the medical literature. A primary goal of the study was to determine if using the medical literature for comparison as a corpus of “known” associations would allow discovery of “novel” clinical associations, or associations present in clinical records but not in the literature. Previous studies had used similar methods to examine associations within specific clinical topics, but this study represents an “all-to-all” comparison, analyzing each corpus irrespective of clinical topic.
One group member asked about the use of 600, specifically, as the value chosen for the more inclusive MetaMap600 scores. The author in our group noted that 600 was determined by the authors as a reasonable approach for broadening the match recall, and is based on Lister Hill’s recommended starting point for MetaMap configuration.
Much of the group discussion explored factors that may have influenced the strength of the association results. For example, one person noted differences between billing records and research articles, namely that billing notes are far more precise than research articles. The difference in scope and intent may have had an effect on the lower than expected association strength. Relatedly, the author re-iterated that the medical literature does not necessarily reflect the scope of known clinical associations, as the literature often focuses on new, emerging, or rare phenomena rather than those that are well known in the community.
One group member noted that this procedure may be more powerful using clinical notes, and clinical terminologies like SNOMED CT, rather than billing records. For example, in ICD 9, some rare diseases are coded using more general codes because the specific code for a rare disease is not available. In cases like this, less specific data could cause a false negative association.
While this study was an all-to-all comparison, the group brainstormed about the potential for sampling de-identified full-text clinical notes, and limiting MEDLINE content to either articles published in the last five years, or clinically-focused journals, to reduce some of the data “noise” introduced into the study by dated and/or non-clinical text. Because novel associations represent the vast minority of associations discovered between the two datasets, this study is challenged with the issue of “sparse data.” Using more focused and comparable sample datasets for comparison could potentially make novel associations easier to identify.
As an aside, the journal club appreciated that the authors specified the statistical package (tool and version) used for their study to aid in reproducibility and informing future studies.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2017 Mar 19, NLM Trainees’ Data Science Journal Club commented:
The NLM Trainees Data Science Journal Club for this discussion consisted of several NLM staff members, including one of the paper’s authors. In brief, the paper examines the feasibility of using MetaMap to compare clinical associations found in clinical billing records with clinical associations found in the medical literature. A primary goal of the study was to determine if using the medical literature for comparison as a corpus of “known” associations would allow discovery of “novel” clinical associations, or associations present in clinical records but not in the literature. Previous studies had used similar methods to examine associations within specific clinical topics, but this study represents an “all-to-all” comparison, analyzing each corpus irrespective of clinical topic.
One group member asked about the use of 600, specifically, as the value chosen for the more inclusive MetaMap600 scores. The author in our group noted that 600 was determined by the authors as a reasonable approach for broadening the match recall, and is based on Lister Hill’s recommended starting point for MetaMap configuration.
Much of the group discussion explored factors that may have influenced the strength of the association results. For example, one person noted differences between billing records and research articles, namely that billing notes are far more precise than research articles. The difference in scope and intent may have had an effect on the lower than expected association strength. Relatedly, the author re-iterated that the medical literature does not necessarily reflect the scope of known clinical associations, as the literature often focuses on new, emerging, or rare phenomena rather than those that are well known in the community.
One group member noted that this procedure may be more powerful using clinical notes, and clinical terminologies like SNOMED CT, rather than billing records. For example, in ICD 9, some rare diseases are coded using more general codes because the specific code for a rare disease is not available. In cases like this, less specific data could cause a false negative association.
While this study was an all-to-all comparison, the group brainstormed about the potential for sampling de-identified full-text clinical notes, and limiting MEDLINE content to either articles published in the last five years, or clinically-focused journals, to reduce some of the data “noise” introduced into the study by dated and/or non-clinical text. Because novel associations represent the vast minority of associations discovered between the two datasets, this study is challenged with the issue of “sparse data.” Using more focused and comparable sample datasets for comparison could potentially make novel associations easier to identify.
As an aside, the journal club appreciated that the authors specified the statistical package (tool and version) used for their study to aid in reproducibility and informing future studies.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-