- Jul 2018
-
journals.plos.org journals.plos.org
-
On 2017 Nov 30, Chris Mebane commented:
Kudos to the authors for researching and publishing this article. One quibble. In the Table 2 criteria for Data Reusability, authors were marked down if they archived data "in a format that is designed to be machine-readable with proprietary software (e.g., Excel)."
That is not a correct criticism. "Excel" is a proprietary software but the "xlsx" Office Open XML format is not proprietary and is a standardized format. Between 2006 and 2009, the Office Open XML format was standardized by the international standards organizations Ecma and ISO [1]. The U.S. Library of Congress has a wealth of information on Sustainability of Digital Formats in archiving, including xlsx [2]. For Library of Congress staff preference, they note that "For works acquired for its collections, the list of Library of Congress Recommended Formats Statement for Datasets/Databases, as of June 2016, includes XLSX (.xlsx) as a preferred format for datasets." [2]
The misnomer in Table 2 is repeated in the table 3 "Key recommendations to improve public data archiving , ...Use standard formats: Use file formats that are compatible with many different kinds of software (e.g., csv rather than excel files)." Office Open XML are not "Excel" files, rather it is the default format used by Excel. As the Library of Congress staff noted, direct editing of XLSX files can be done, for example, in Google Sheets without conversion.
If the expectation is that reuse would be via Big Data automated data-mining without need for a human to reading the associated paper, then by all means csv flat files and metadata in standard data dictionaries are the way to go. However, for smaller datasets or studies in which the context of the data matters and reuse would entail another researcher looking at the article and study, and semantic structure of formulas and their relationship to cells with values matter, then the Office Open XML format or the similar Open Document Format (odf) is fully appropriate.
It may be prudent to hedge our bets and publish their data in more than one format. I did that for example in a Dryad data release [3], and it only took a few minutes longer. All the work was in the curation, structuring, and labeling (that is, beyond the work of generating the data in the first place). The fundamental point of this comment is not to recommend a specific format other than the format be fit for purpose and that since the 2006-2009 approval of the xlsx Office Open XML format by international standardization organizations, "xlsx" is in fact a standard, non-proprietary format.
[1] https://en.wikipedia.org/wiki/Office_Open_XML
[2] https://www.loc.gov/preservation/digital/formats/fdd/fdd000398.shtml
[3] http://datadryad.org/resource/doi:10.5061/dryad.67n20
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
journals.plos.org journals.plos.org
-
On 2017 Nov 30, Chris Mebane commented:
Kudos to the authors for researching and publishing this article. One quibble. In the Table 2 criteria for Data Reusability, authors were marked down if they archived data "in a format that is designed to be machine-readable with proprietary software (e.g., Excel)."
That is not a correct criticism. "Excel" is a proprietary software but the "xlsx" Office Open XML format is not proprietary and is a standardized format. Between 2006 and 2009, the Office Open XML format was standardized by the international standards organizations Ecma and ISO [1]. The U.S. Library of Congress has a wealth of information on Sustainability of Digital Formats in archiving, including xlsx [2]. For Library of Congress staff preference, they note that "For works acquired for its collections, the list of Library of Congress Recommended Formats Statement for Datasets/Databases, as of June 2016, includes XLSX (.xlsx) as a preferred format for datasets." [2]
The misnomer in Table 2 is repeated in the table 3 "Key recommendations to improve public data archiving , ...Use standard formats: Use file formats that are compatible with many different kinds of software (e.g., csv rather than excel files)." Office Open XML are not "Excel" files, rather it is the default format used by Excel. As the Library of Congress staff noted, direct editing of XLSX files can be done, for example, in Google Sheets without conversion.
If the expectation is that reuse would be via Big Data automated data-mining without need for a human to reading the associated paper, then by all means csv flat files and metadata in standard data dictionaries are the way to go. However, for smaller datasets or studies in which the context of the data matters and reuse would entail another researcher looking at the article and study, and semantic structure of formulas and their relationship to cells with values matter, then the Office Open XML format or the similar Open Document Format (odf) is fully appropriate.
It may be prudent to hedge our bets and publish their data in more than one format. I did that for example in a Dryad data release [3], and it only took a few minutes longer. All the work was in the curation, structuring, and labeling (that is, beyond the work of generating the data in the first place). The fundamental point of this comment is not to recommend a specific format other than the format be fit for purpose and that since the 2006-2009 approval of the xlsx Office Open XML format by international standardization organizations, "xlsx" is in fact a standard, non-proprietary format.
[1] https://en.wikipedia.org/wiki/Office_Open_XML
[2] https://www.loc.gov/preservation/digital/formats/fdd/fdd000398.shtml
[3] http://datadryad.org/resource/doi:10.5061/dryad.67n20
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-