1,966 Matching Annotations
  1. Mar 2022
    1. narid:https://www.oxfordjournals.org/nar/database/summary/1058

      narCategory:Genomics Databases (non-vertebrate), Nucleotide Sequence Databases

      Subcategory:Prokaryotic genome databases, Transcriptional regulator sites and transcription factors

      wd:https://www.wikidata.org/wiki/Q111382150

  2. Feb 2022
    1. Daniel Vila Suero@dvilasuero·Feb 9BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision Noisy labels using Wikidata and gazetteers (distant labels) Fine-tune Roberta for NER with distant labels Self-training https://github.com/cliang1453/BOND https://arxiv.org/abs/2006.15509 #python #NLProc

    1. Deprecation-based indicator Wikidata has a ‘soft’ al-ternative to deletions: deprecating

      test

    1. COVID-19 pandemic

      Reviewer 3. Daniel Mietchen

      This review includes supplemental files, videos and hypothes.is annotations of the preprint!: https://zenodo.org/record/4909923

      The videos of the review process are also available on YouTube:

      Part 1 (Screen Recording 2021-06-05 at 10.02.02.mov): https://youtu.be/_UnDdE3Oi-4 Part 2 (Screen Recording 2021-06-05 at 10.52.51.mov): https://youtu.be/z5xRK0lg3b4 Part 3 (Screen Recording 2021-06-05 at 11.27.01.mov): https://youtu.be/VnztlEqFW2A Part 4 (Screen Recording 2021-06-07 at 02.51.59.mov): https://youtu.be/IYtLfMcLTvA Part 5 (Screen Recording 2021-06-07 at 06.11.52.mov): https://youtu.be/Jv_AUHCASQw Part 6 (Screen Recording 2021-06-07 at 18.07.45.mov): https://youtu.be/6Y-yA9oahzM Part 7 (Screen Recording 2021-06-07 at 19.07.02.mov): https://youtu.be/LV5whFhfmEU

      First round of review:

      Summary The present manuscript provides an overview of how the English Wikipedia incorporated COVID-19-related information during the first months of the ongoing COVID-19 pandemic.

      It focuses on information supported by academic sources and considers how specific properties of the sources (namely their status with respect to open access and preprints) correlate with their incorporation into Wikipedia, as well as the role of existing content and policies in mediating that incorporation.

      No aspect of the manuscript would justify a rejection but there are literally lots of opportunities for improvements, so "Major revision" appears to be the most appropriate recommendation at this point.

      General comments The main points that need to be addressed better: (1) documentation of the computational workflows; (2) adaptability of the Wikipedia approach to other contexts; (3) descriptions of or references to Wikipedia workflows; (4) linguistic presentation.

      Ad 1: while the code used for the analyses and for the visualizations seems to be shared rather comprehensively, it lacks sufficient documentation as to what was done in what order and what manual steps were involved. This makes it hard to replicate the findings presented here or to extend the analysis beyond the time frame considered by the authors. Ad 2: The authors allude to how pre-existing Wikipedia content and policies - which they nicely frame as Wikipedia's "scientific infrastructure" or "scientific backbone" - "may provide insight into how its unique model may be deployed in other contexts" but that potentially most transferrable part of the manuscript - which would presumably be of interest to many of its readers - is not very well developed, even though that backbone is well described for Wikipedia itself. Ad 3: there is a good number of cases where the Wikipedia workflows are misrepresented (sometimes ever so slightly), and while many of these do not affect the conclusions, some actually do, and overall comprehension is hampered. I highlighted some of these cases, and others have been pointed out in community discussions, notably at https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_COVID- 19&oldid=1028476999#Review_of_Wikipedia's_coverage_of_COVID and http://bluerasberry.com/2021/06/review-of-paper-on-wikipedia-and-covid/ . Some resources particularly relevant to these parts of the manuscript have not been mentioned, be it scholarly ones like https://arxiv.org/abs/2006.08899 and https://doi.org/10.1371/journal.pone.0228786 or Wikimedia ones like https://en.wikipedia.org/wiki/Wikipedia_coverage_of_the_COVID-19_pandemic and https://commons.wikimedia.org/wiki/File:Wikimedia_Policy_Brief_-_COVID-19_- _How_Wikipedia_helps_us_through_uncertain_times.pdf . Likewise essentially missing - although this is a common feature in academic articles about Wikipedia - is a discussion of how valid the observations made for the English Wikipedia are in the context of other language versions (e.g. Hebrew). On that basis, it is understandable that no attempt is made to look beyond Wikipedia to see how coverage of the pandemic was handled in other parts of the Wikimedia ecosystem (e.g. Wikinews, Wikisource, Wikivoyage, Wikimedia Commons and Wikidata), but doing so might actually strengthen the above case for deployability of the Wikipedia approach in other contexts. Disclosure: I am closely involved with WikiProject COVID-19 on Wikidata too, e.g. as per https://doi.org/10.5281/zenodo.4028482 . Ad 4: The relatively high number of linguistic errors - e.g. typos, grammar, phrasing and also things like internal references or figure legends - needlessly distracts from the value of the paper. The inclusion of figures - both via the text body and via the supplement - into the narrative is also sometimes confusing and would benefit from streamlining. While GigaScience has technically asked me to review version 3 of the preprint (available via https://www.biorxiv.org/content/10.1101/2021.03.01.433379v3 and also via GigaScience's editorial system), that version was licensed incompatibly with publication in GigaScience, so I pinged the authors on this (via https://twitter.com/EvoMRI/status/1393114202349391872 ), which resulted (with some small additional changes) in the creation of version 4 (available via https://www.biorxiv.org/content/10.1101/2021.03.01.433379v4 ) that I concentrated on in my review.

      Production of that version 4 - of which I eventually used both the PDF and the HTML, which became available to me at different times - took a while, during which I had a first full read of the manuscript in version 3.

      In an effort to explore how to make the peer review process more transparent than simply sharing the correspondence, I recorded myself while reading the manuscript for the second time, commenting on it live. These recordings are available via https://doi.org/10.5281/zenodo.4909923 .

      In terms of specific comments, I annotated version 4 directly using Hypothes.is, and these annotations are available via https://via.hypothes.is/https://www.biorxiv.org/content/10.1101/2021.03.01.433379v4.full .

      Re-review: I welcome the changes the authors have made - both to the manuscript itself (of which I read the bioRxiv version 5) and to the WikiCitationHistoRy repo - in response to the reviewer comments. I also noticed comments they chose not to address, but as stated before, none of these would be ground for rejection. What I am irritated about is whether the proofreading has actually happened before the current version 5 was posted. For instance, reference 44 seems missing (some others are missing in the bioRxiv version, but I suspect that's not the authors' fault), while lots of linguistic issues in phrases like "to provide a comprehensive bibliometric analyses of english Wikipedia's COVID-19 articles" would still benefit from being addressed. At this point, I thus recommend that the authors (a) update the existing Zenodo repository such that there is some more structure in the way the files are shared (b) archive a release of WikiCitationHistoRy on Zenodo

  3. Jan 2022
  4. takingnotenow.blogspot.com takingnotenow.blogspot.com
    1. while also recognising another important subject you also want to write about later.

      interesting they are essentially saying the point of a wikilink is to bookmark something you want to write later, it's essentially entity based annotations. very much in the same vein as [[wikidata]]

    1. Author Response

      Reviewer #1 (Public Review):

      This manuscript addresses a major issue facing consumers of structure-organism pair data: the landscape of databases is very difficult to navigate due to the way data is made available (many resources do not have structured data dumps) and the way data is standardized (many resources' structured data dumps do not standardize their nomenclature or use stable entity identifiers). The solution presented is a carefully constructed pipeline (see Figure 1) for importing data, harmonizing/cleaning it, automating decisions about exclusions, and reducing redundancy. The results are disseminated through Wikidata to enable downstream consumption via SPARQL and other standard access methods as well as through a bespoke website constructed to address the needs of the natural products community. The supplemental section of the manuscript provides a library of excellent example queries for potential users. The authors suggest that users may be motivated to make improvements through manual curations on Wikidata, through semi-automated and automated interaction with Wikidata mediated by bots, or by addition of importer modules to the LOTUS codebase itself.

      Despite the potential impact of the paper and excellent summary of the current landscape of related tools, it suffers from a few omissions and tangents:

      1. It does not cite specific examples of downstream usages of structure-organism pairs, such as an illustration on how this information in both higher quantity and quality is useful for drug discovery, agriculture, artificial intelligence, etc. These would provide a much more satisfying bookend to both the introduction and conclusion.

      Thank you for this remark. We deliberately decided not to insist too heavily on the application examples of the LOTUS outputs. Indeed we are somehow biased by our main investigation field, natural products chemistry, and expect that the dissemination of specialized metabolites occurrences will benefit a wide range of scientific disciplines (ecology, drug discovery, chemical ecology, ethnopharmacology, etc.)

      However, Figure 5 was established to illustrate how the information available through LOTUS is quantitatively (size) and qualitatively (color classes) superior to what is available through single natural products resources.

      As added in the introduction, one of the downstream usages of those pairs is for example to perform taxonomically informed scoring as described in https://doi.org/10.3389/fpls.2019.01329. Obtaining an open database of natural products’ occurrences to fuel such taxonomically informed metabolite annotation tools was the initial impulse for us to build LOTUS. These metabolite annotation strategies, tailored for specialized metabolites, have been shown to offer appreciable performance improvements for current state-of-the-art computational metabolite annotation tools. Since metabolite annotation is still regularly cited as “the major bottleneck” in metabolomics in the scientific literature over the last 15 years (https://europepmc.org/article/med/15663322, https://doi.org/10.1021/acs.analchem.1c00238), any tangible improvement in this field is welcome. With LOTUS we offer a reliable and reusable structures-organisms data source that can be exploited by the community to tackle such issues of importance.

      Other possible usages are suggested in the conclusion, but benchmarking or even exemplifying such uses is clearly out of the scope of this paper, each one of them being an article per se.

      The additional queries are written in our first answer (see “essential revisions”) and demonstrate the impact of LOTUS on accelerating the initial bibliographic survey of chemical structures occurrences over the tree of life.

      This query (https://w.wiki/4VGC) can be compared to a literature review work, such as https://doi.org/10.1016/j.micres.2021.126708. In seconds, it allows retrieving a table listing compounds reported in given taxa and limits the search by years.

      1. The mentions of recently popular buzzwords FAIR and TRUST should be better qualified and be positioned as a motivation for the work, rather than a box to be checked in the modern publishing climate.

      It is true that the modern publishing system certainly suffers from some drawbacks (also critically mentioned within the paper). However, after consultation of all authors, we believe that because LOTUS checks both boxes of FAIR and TRUST, we would rather stick to these two terms. In our view, rules 1 (Don’t reinvent the wheel) and 5 (put yourself in your user’s shoes) of https://doi.org/10.1371/journal.pcbi.1005128 apply here. Both terms are indeed commonly (mis-)used but we felt that redefining other complicated terms would not help the reader/user.

      1. The current database landscape really is bad; and the authors should feel emboldened to emphasize this in order to accentuate the value of the work, with more specific examples on some of the unmaintained databases

      We perfectly agree with this statement and it is the central motivation of the LOTUS initiative to improve this landscape. It was a deliberate choice not to emphasize how bad the actual landscape is, but rather to focus on better habits for the future. We do not want to start devaluing other resources and elevate our initiative at the cost of others. We also believe that an attentive look at the complexity of the LOTUS gathering, harmonization, and curation speaks for itself and describes the huge efforts required to access properly formatted natural products occurrence data.

      If the reviewer and editors insist, although not in our scope, we are happy to list a series of specific (but anonymized) examples of badly formatted entries, of wrong structures-organisms associations, or poorly accessible resources.

      1. While the introduction and supplemental tables provide a thorough review of the existing databases, it eschews an important more general discussion about data stewardship and maintenance. Many databases in this list have been abandoned immediately following publication, have been discontinued after a single or limited number of updates, or have been decommissioned/taken down. This happens for a variety of reasons, from the maintainer leaving the original institution, from funding ending, from original plans to just publish then move on, etc. The authors should reflect on this and give more context for why this domain is in this situation, and if it is different from others.

      We do agree with the reviewer and added a “status” column in the table https://github.com/lotusnprod/lotus-processor/blob/main/docs/dataset.csv We chose 4 possible statuses:

      • Maintained (self-explanatory)
      • Unmaintained: the database did not see any update in the last year.
      • Retired: the authors stated they will not maintain the database anymore.
      • Defunct: the database is not accessible anymore

      As for question 3 above, we decided not to focus too heavily on the negative points and resume the current situation in the previous table. Reasons for the databases publishing being in this situation are multiple, and we think they are well summarized in https://doi.org/10.1371/journal.pcbi.1005128 (Rule 10: Maintain, update, or retire), already cited in the manuscript introduction.

      1. Related to data stewardship: the LOTUS Initiative has ingested several databases that are no longer maintained as well as several databases with either no license or a more restrictive license than the CC0 under which LOTUS and Wikidata are distributed. These facts are misrepresented in Supplementary Table 1 (Data Sources List), which links to notes in one of the version controlled LOTUS repositories that actually describes the license. For example, https://gitlab.com/lotus7/lotus-processor/-/blob/8b60015210ea476350b36a6e734ad6b66f2948bc/docs/licenses/biofacquim.md states that the dataset has no license information. First, the links should be written with exactly what the licenses are, if available, and explicitly state if no license is available. There should be a meaningful and transparent reflection in the manuscript on whether this is legally and/or scientifically okay to do - especially given the light that many of these resources are obviously abandoned.

      This point is a very important one. We did our best to be as transparent as possible in our initial table. Following the reviewer’s suggestion, we updated it to better reflect the licensing status of each resource (https://github.com/lotusnprod/lotus-processor/blob/main/docs/dataset.csv). Therefore, we removed the generic “license” header, which could indeed be misleading, and replaced it with ”licensing status”, filled with the attributed license type and hyperlink to its content). It remains challenging since some resources changed their copyright in the meantime. We remain at the editor and reviewers’ disposal for any further improvement.

      Moreover, as stated in the manuscript, we took care of collecting all licenses and contacted authors of resources whose license was not perfectly explicit to us, therefore accomplishing our due diligence. Additionally, we contacted legal offices in our University and explained our situation. We did everything that we had been advised.

      1) To the best of our knowledge, the dissemination of the LOTUS initiative data falls under the Right to quote for scientific articles, as we do not share the whole information, but only a very small part.

      2) We do not redistribute original content. What comes out of LOTUS has undergone several curation and validation steps, adding value to the original data. The 500 random test entries, provided in their original form for the sake of reproducibility and testing, are the only exception.

      Many scientific authors forget about the importance of proper licensing. While it might be deliberate to restrict the use, inappropriate license choice (or omission) is too often due to a lack of information on its implication.

      All authors of the utilized resources can freely benefit from our curation. We are sharing with the community the results of our work, while always citing the original reference.

      Concerning the possible evolution of licensing, it remains a real challenge. While we tried to “freeze” the license status when we accessed the data, some resources updated their licensing since then. This can be tracked in the git history of the table (https://github.com/lotusnprod/lotus-processor/blob/main/docs/dataset.csv). Discrepancies between our frozen licensing (at the time of gathering) and actual license can therefore occur. Initiatives such as https://archive.org/web could help solving this issue, coming with other legal challenges.

      1. The order of sections of the manuscript results in several duplicated, but not further substantiated explanations. Most importantly, the methods should be much more specific throughout and the results/discussion should more heavily cross-link to it, as a reader who examines the paper from top to bottom will be left with large holes of misunderstanding throughout.

      As our paper focuses a lot on the methods, the barrier between results & methods becomes thinner. We took into account the reviewers’ suggestions and added some additional cross-links for the reader to be able to quickly access related methods.

      1. The work presented was done in a variety of programming languages across a variety of repositories (and even version control systems), making it difficult to give a proper code review. It could be argued that the most popular language in computational science at the moment is Python, with languages like R, Bash, and in some domains, still, Java maintaining relevance. The usage of more esoteric languages (again, with respect to the domain) such as Kotlin hampers the ability for others to deeply understand the work presented. Further, as the authors suggest additional importers may implemented in the future, this restricts what external authors may be able to contribute.

      Scientific software has indeed always been written in multiple languages. To this day, scientists have used all kinds of languages adapted both to their needs and their knowledge. Numpy uses Fortran libraries and many projects published in biology and chemistry recently are in Java, R, Python, C#, PHP, Groovy, Scala… We understand that some authors are more comfortable with one language or another. But R syntax is for example much more distant from Python's syntax than Kotlin can be. We needed a highly performant language for some parts of the pipeline and R, Bash, or Python were not sufficient. We decided to use Kotlin as it provides an easier syntax than Java while staying 100% compatible with it.

      The advantage of the way LOTUS is designed is that importers are language-agnostic. As long as the program can produce a file or write to the DB in the accepted format, it can be integrated into the pipeline. This was our goal from the beginning, to have a pipeline that can have its various parts replaced without breaking any of the processes.

      1. As a follow up to the woes of point 4., 5., and 7., the manuscript fails to reflect on the longevity of the LOTUS Initiative. Like many, will the project effectively end upon publication? If not, what institutions will be maintaining it for how long, how actively, and with what funding source? If these things are not clear, it only seems fair to inform the reader and potential user.

      LOTUS is an initiative that aims to improve knowledge management and sharing in natural products research. Our first project, which is the object of the current manuscript, is to provide a free and open resource of natural products occurrences for the scientific community. Its purpose is not to be a database by itself, but instead to provide through Wikidata and associated tools a way to access natural products knowledge. The objective was not to create yet another database (https://doi.org/10.1371/journal.pcbi.1005128), but instead to remove this need and give our community the tools and the power to act on its knowledge. This way, as everything is on Wikidata, the initiative is not “like many”. This also means that this project should not be considered and evaluated exactly like a classical DB. Once the initial curation, harmonization, and dissemination jobs have been done, they should ideally not be run again. The community should switch to Wikidata as a point of access, curation, and addition of data. If viewed with such arguments in mind, yes, LOTUS can live long!

      Wikimedia is a public not-for-profit organization, whose financial development appears to indicate solid health https://en.wikipedia.org/wiki/Wikimedia_Foundation#Finances.

      In terms of funding sources, we would like to refer to https://elifesciences.org/articles/52614#sa2 , which stated the following in response to a similar question: "Wikidata is sustained by funding streams that are different from the vast majority of biomedical resources (which are mostly funded by the NIH). Insulation from the 4-5 year funding cycles that are typical of NIH-funded biomedical resources does make Wikidata quite unique." The core of the Wikidata funding streams are donations to the Wikipedia ecosystem. These donations - with a contributor base of millions of donors from almost any country in the world, chipping in at an average order of magnitude of around 10 dollars - are likely to continue as long as that ecosystem is useful to the community of its users. See <https://wikimediafoundation.org/about/financial-reports for details>.

      1. Overall, there were many opportunities for introspection on the shortcomings of the work (e.g., the stringent validation pipeline could use improvement). Because this work is already quite impactful, I don't think the authors will be opening themselves to unfair criticism by including more thoughtful introspection, at minimum, in the conclusions section.

      We agree with the reviewer and therefore, list again the major limitations of our processing pipeline:

      First, our processing pipeline is heavy. It includes many dependencies and requires a lot of time for understanding. We are aware of this issue and tried to simplify it as much as possible while keeping what we considered necessary to ensure high data quality. Second, it can sometimes induce errors. Those errors, ranging from unnecessary discarded correct entries to more problematic ones can be attributed to various parameters, reflecting the variety of our input. We will therefore try listing them, keeping in mind that the list won’t be exhaustive. For each detected issue, we tried fixing it at best, knowing it will not lead to an ideal result, but hopefully increase data quality gradually.

      ● Compounds

      ○ Sanitization (the three steps below are performed automatically since we observed a higher ratio of incorrect salts, charged or dimerized compounds. However, this also means that true salts, charged or dimeric compounds were erroneously “sanitized”.)

      ■ Salt removals

      ■ Charged molecules

      ■ Dimers

      ○ Translation (both processes below are pretty error-prone)

      ■ Name to structure

      ■ Structure to name

      ● Biological organisms

      ○ Synonymy

      ■ Lotus (https://www.wikidata.org/wiki/Q3645698, https://www.wikidata.org/wiki/Q16528).

      This is also one of the reasons why we decided to call the resource Lotus, as it illustrates part of the problem.

      ■ Iris (https://www.wikidata.org/wiki/Q156901, https://www.wikidata.org/wiki/Q2260419)

      ■ Ficus variegata (https://www.wikidata.org/wiki/Q502030, https://www.wikidata.org/wiki/Q5446649)

      ○ External and internal dictionaries are not exhaustive, impacting translation

      ○ Some botanical names we use might not be the accepted ones anymore because of the tools we use and the pace taxonomy is renaming taxa.

      ● References

      ○ The tool we favored, Crossref, returns a hit whatever the input. This generates noise and incorrect translations, which is why our filtering rules focus on reference types.

      ● Filtering rules:

      ○ Limited validation set, requires manual validation

      ○ Validates some incorrect entries (False positives)

      ○ Does not validate some correct entries (False negatives)

      Again, our processing pipeline removes entries we do not yet know how to process properly.

      Our restrictive filters but substantial contribution to Wikidata in terms of structure-organisms pairs data upload should hopefully incentivize the community to contribute by further adding its human validated data.

      We updated the conclusion part of the manuscript accordingly. See https://github.com/lotusnprod/lotus-manuscript/commit/a866a01bad10dfd8b3af90e2f30bb3ae51dd7b9e.

      Reviewer #2 (Public Review):

      Rutz et al. introduce a new open-source database that links natural products structures with the organisms they are present in (structure-organism pairs). LOTUS contains over 700,000 referenced structure-organism pairs, and their web portal (https://lotus.naturalproducts.net/) provides a powerful platform for mining literature for published data on structure-organism pairs. Lotus is built within the computer-readable Wikidata framework, which allows researchers to easily contribute, edit and reuse data within a clear and open CC0 license. In addition to depositing the database into Wikidata, the authors provide many domain-specific resources, including structure-based database searches and taxon-oriented searches.

      Strengths:

      The Lotus database presented in this study represents a cutting-edge resource that has a lot of potentials to benefit the scientific community. Lotus contains more data than previous databases, combines multiple resources into a single resource.

      Moreover, they provide many useful tools for mining the data and visualizing it. The authors were thoughtful in thinking about the ways that researchers could/would use this resource and generating tools to make it ways to use. For example, their inclusion of structure-based searches and multiple taxonomy classification schemes is very useful.

      Overall the authors seem conscientious in designing a resource that is updatable and that can grow as more data become available.

      Weaknesses/Questions:

      1) Overall, I would like to know to what degree LOTUS represents a comprehensive database. LOTUS is clearly, the best database to date, but has it reached a point where it is truly comprehensive, and can thus be used for a metanalysis or as a data source for research questions. Can it truly replace doing a manual literature search/review?

      As highlighted by the reviewer, even if LOTUS might be the most comprehensive natural products occurrences ressources at the moment, TRUE or FULL comprehensive quality of such resource will always be limited to the available data in the litterature. And the community is far from fully describing the metabolome of living beings. We however hope that the LOTUS infrastructure will offer a good place to start this ambitious and systematic description process.

      1) Yes it can serve as data source for research questions, as exemplified in the query table

      2) No, it cannot and must not replace manual literature search. Manual literature search is the best but at an enormous cost. If the outcome of such search can be made available to the whole community (eg. via Wikidata), the value of such would be even bigger. However, LOTUS can expedite a decent part of a manual litterature search and liberate time to complement this search. See our comment to the editors “To further showcase the possibilities opened by LOTUS, and also answer the remark on the comprehensiveness of our resource, we established an additional query (https://w.wiki/4VGC).This query is comparable to a literature review work, such as: https://doi.org/10.1016/j.micres.2021.126708. In seconds, it allows retrieving a table listing compounds reported in given taxa and limits the search by years.”

      We added these examples in the manuscript (see https://github.com/lotusnprod/lotus-manuscript/commit/a6ee135b83e56e8e2041d09d7ce2d5b913c1029d)

      2) Data Cleaning & Validation. The manuscript could be improved by adding more details about how and why data were excluding or included in the final upload. Why did only 30% of the initial 2.5 million get uploaded? Was it mostly due to redundant data or does the data mining approach result in lots of missed data?

      The reason for this “low” yield is that we highly favored quality over quantity (as in the F-score equation, ß being equal to 0.5, so more importance is given to the precision than the recall). Of course there is redundancy, but the rejected entries are mostly because of too low confidence level according to our developed rules. It is not fully discarded data as we keep it for further curation (ideally including the community) before uploading to Wikidata. We adapted the text accordingly.

      3) Similarly, more information about the accuracy of the data mining is needed. The authors report that the test dataset (420 referenced structure-organisms pairs) resulted in 97% true positives, what about false negatives? Also, how do we know that 420 references are sufficiently large to build a model for 2.5M datapoints? Is the training data set is sufficiently large to accurately capture the complexities of such a large dataset?

      False negatives are 3%, which is, in our opinion, a fair amount of “loss” given the quality of the data. We actually manually checked 500+ documented pairs, which is more or less the equivalent of a literature review. We were careful in sampling the entries in the right proportions, but we cannot (and did not) state they are enough. We cannot model it either, since the 2.5M+ points have absolutely different distributions, in terms of databases, quality, etc. Only “hint” is the similar behaviour among all subsets. (the 420 + 100 entries) were divided between 3 authors, which obtained similar results.

      4) Data Addition and Evolution: The authors have outlined several mechanisms for how the LOTUS database will evolve in the future. I would like to know if/how their scripts for data mining will be maintained if they will continue to acquire new data for the database. To what extent does the future of LOTUS depend on the larger natural products community being aware of the resource and voluntarily uploading to it? Are there mechanisms in place such as those associated with sequencing data and NCBI?

      Programs have been not only maintained but also updated with new possibilities (as, for example: the addition of a “manual mode” allowing user to run the LOTUS processing pipeline on a set of their own entries and make them Wikidata-ready (https://github.com/lotusnprod/lotus-processor/commit/f49e4e2b3814766d5497f9380bfe141692f13f23). We will of course do our best to keep on maintaining it, but as no one in academia can state he/she will maintain programs forever. However the LOTUS initiative hopefully embraces a new way of considering database dynamics. If the repository and website of the LOTUS initiative shut down tomorrow, all the work done will still be available to anyone on Wikidata. Of course, future data addition strongly relies on community involvement. We have already started to advocate for the community to start taking part of it, in the form of direct upload to Wikidata, ideally. At the time, there are no mechanisms in place to push publishing of the pairs on Wikidata (as for sequencing, mass spec data), but we will be engaged in pushing forward this direction. The initiative needs stronger involvement of the publishing sector (also reviewers) to help change those habits.

      5) Quality of chemical structure accuracy in the database. I would imagine that one of the largest sources of error in the LOTUS database would be due to variation in the quality of chemical structures available. Are all structure-organism pairs based on fully resolved NMR-based structures are they based on mass spectral data with no confirmational information? At what point is a structural annotation accurate enough to be included in the database. More and more metabolomics studies are coming out and many of these contain compound annotations that could be included in the database, but what level (in silico, exact mass database search, or relative to a known standard) are required.

      This is a very interesting point and some databases have this “tag” (NMR, cristal, etc.). We basically rely on original published articles, included in specialized databases. If poorly reported structures have been accepted for publication, labelled as “identified” (and not “annotated”) and the authors publishing the specialized databases overlooked it, we might end up with such structures.

      Here, the Evidence Ontology (http://obofoundry.org/ontology/eco.html) might be a good direction to look at and further characterize the occurrences links in the LOTUS dataset.

      Reviewer #3 (Public Review):

      Due to missing or incomplete documentation of the LOTUS processes and software, a full review could not be completed.

      Some parts of LOTUS were indeed not sufficiently described and we improved both our documentation and accessibility to external users a lot. We thank the reviewer for insisting on this point as it will surely improve the adoption of our tool by the community.

    2. Evaluation Summary:

      Rutz et al. outline LOTUS, a new open-source database that links natural product structures with the organisms they are present in. It contains over 700,000 referenced structure-organism pairs and search tools that make mining the database intuitive and efficient. The LOTUS Initiative comprises an important data harmonization/integration effort over previous databases. The results are distributed to the public through Wikidata, which additionally supports future curation. This new resource is likely to be of great interest to natural product researchers as well as across fields of biology including ecology, evolution, and biochemistry.

      (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

    3. Reviewer #1 (Public Review):

      This manuscript addresses a major issue facing consumers of structure-organism pair data: the landscape of databases is very difficult to navigate due to the way data is made available (many resources do not have structured data dumps) and the way data is standardized (many resources' structured data dumps do not standardize their nomenclature or use stable entity identifiers). The solution presented is a carefully constructed pipeline (see Figure 1) for importing data, harmonizing/cleaning it, automating decisions about exclusions, and reducing redundancy. The results are disseminated through Wikidata to enable downstream consumption via SPARQL and other standard access methods as well as through a bespoke website constructed to address the needs of the natural products community. The supplemental section of the manuscript provides a library of excellent example queries for potential users. The authors suggest that users may be motivated to make improvements through manual curations on Wikidata, through semi-automated and automated interaction with Wikidata mediated by bots, or by addition of importer modules to the LOTUS codebase itself.

      Despite the potential impact of the paper and excellent summary of the current landscape of related tools, it suffers from a few omissions and tangents:

      1. It does not cite specific examples of downstream usages of structure-organism pairs, such as an illustration on how this information in both higher quantity and quality is useful for drug discovery, agriculture, artificial intelligence, etc. These would provide a much more satisfying bookend to both the introduction and conclusion.

      2. The mentions of recently popular buzzwords FAIR and TRUST should be better qualified and be positioned as a motivation for the work, rather than a box to be checked in the modern publishing climate.

      3. The current database landscape really is bad; and the authors should feel emboldened to emphasize this in order to accentuate the value of the work, with more specific examples on some of the unmaintained databases

      4. While the introduction and supplemental tables provide a thorough review of the existing databases, it eschews an important more general discussion about data stewardship and maintenance. Many databases in this list have been abandoned immediately following publication, have been discontinued after a single or limited number of updates, or have been decommissioned/taken down. This happens for a variety of reasons, from the maintainer leaving the original institution, from funding ending, from original plans to just publish then move on, etc. The authors should reflect on this and give more context for why this domain is in this situation, and if it is different from others.

      5. Related to data stewardship: the LOTUS Initiative has ingested several databases that are no longer maintained as well as several databases with either no license or a more restrictive license than the CC0 under which LOTUS and Wikidata are distributed. These facts are misrepresented in Supplementary Table 1 (Data Sources List), which links to notes in one of the version controlled LOTUS repositories that actually describes the license. For example, https://gitlab.com/lotus7/lotus-processor/-/blob/8b60015210ea476350b36a6e734ad6b66f2948bc/docs/licenses/biofacquim.md states that the dataset has no license information. First, the links should be written with exactly what the licenses are, if available, and explicitly state if no license is available. There should be a meaningful and transparent reflection in the manuscript on whether this is legally and/or scientifically okay to do - especially given the light that many of these resources are obviously abandoned.

      6. The order of sections of the manuscript results in several duplicated, but not further substantiated explanations. Most importantly, the methods should be much more specific throughout and the results/discussion should more heavily cross-link to it, as a reader who examines the paper from top to bottom will be left with large holes of misunderstanding throughout.

      7. The work presented was done in a variety of programming languages across a variety of repositories (and even version control systems), making it difficult to give a proper code review. It could be argued that the most popular language in computational science at the moment is Python, with languages like R, Bash, and in some domains, still, Java maintaining relevance. The usage of more esoteric languages (again, with respect to the domain) such as Kotlin hampers the ability for others to deeply understand the work presented. Further, as the authors suggest additional importers may implemented in the future, this restricts what external authors may be able to contribute.

      8. As a follow up to the woes of point 4., 5., and 7., the manuscript fails to reflect on the longevity of the LOTUS Initiative. Like many, will the project *effectively* end upon publication? If not, what institutions will be maintaining it for how long, how actively, and with what funding source? If these things are not clear, it only seems fair to inform the reader and potential user.

      9. Overall, there were many opportunities for introspection on the shortcomings of the work (e.g., the stringent validation pipeline could use improvement). Because this work is already quite impactful, I don't think the authors will be opening themselves to unfair criticism by including more thoughtful introspection, at minimum, in the conclusions section.

      10. Given the competitive nature of building databases and scientific publishing, it remains to be seen whether new database builders will contribute directly to the LOTUS Initiative, but the system the authors described seems to be prepared to support its maintainers to continue to import new databases as long as they are actively working on the project.

      Overall, this manuscript served as an excellent survey of the landscape of the structure-organism databases, the deep ties to natural product databases, and presents an obviously useful resource that will greatly simplify and improve the lives of other scientists who want to use this kind of data. It had a good focus and met the goals that it set in its abstract and introduction, and described the journey quite elegantly.

    4. Reviewer #2 (Public Review):

      Rutz et al. introduce a new open-source database that links natural products structures with the organisms they are present in (structure-organism pairs). LOTUS contains over 700,000 referenced structure-organism pairs, and their web portal (https://lotus.naturalproducts.net/) provides a powerful platform for mining literature for published data on structure-organism pairs. Lotus is built within the computer-readable Wikidata framework, which allows researchers to easily contribute, edit and reuse data within a clear and open CC0 license. In addition to depositing the database into Wikidata, the authors provide many domain-specific resources, including structure-based database searches and taxon-oriented searches.

      Strengths:

      The Lotus database presented in this study represents a cutting-edge resource that has a lot of potentials to benefit the scientific community. Lotus contains more data than previous databases, combines multiple resources into a single resource.

      Moreover, they provide many useful tools for mining the data and visualizing it. The authors were thoughtful in thinking about the ways that researchers could/would use this resource and generating tools to make it ways to use. For example, their inclusion of structure-based searches and multiple taxonomy classification schemes is very useful.

      Overall the authors seem conscientious in designing a resource that is updatable and that can grow as more data become available.

      Weaknesses/Questions:

      1) Overall, I would like to know to what degree LOTUS represents a comprehensive database. LOTUS is clearly, the best database to date, but has it reached a point where it is truly comprehensive, and can thus be used for a metanalysis or as a data source for research questions. Can it truly replace doing a manual literature search/review?

      2) Data Cleaning & Validation. The manuscript could be improved by adding more details about how and why data were excluding or included in the final upload. Why did only 30% of the initial 2.5 million get uploaded? Was it mostly due to redundant data or does the data mining approach result in lots of missed data?

      3) Similarly, more information about the accuracy of the data mining is needed. The authors report that the test dataset (420 referenced structure-organisms pairs) resulted in 97% true positives, what about false negatives? Also, how do we know that 420 references are sufficiently large to build a model for 2.5M datapoints? Is the training data set is sufficiently large to accurately capture the complexities of such a large dataset?

      4) Data Addition and Evolution: The authors have outlined several mechanisms for how the LOTUS database will evolve in the future. I would like to know if/how their scripts for data mining will be maintained if they will continue to acquire new data for the database. To what extent does the future of LOTUS depend on the larger natural products community being aware of the resource and voluntarily uploading to it? Are there mechanisms in place such as those associated with sequencing data and NCBI?

      5) Quality of chemical structure accuracy in the database. I would imagine that one of the largest sources of error in the LOTUS database would be due to variation in the quality of chemical structures available. Are all structure-organism pairs based on fully resolved NMR-based structures are they based on mass spectral data with no confirmational information? At what point is a structural annotation accurate enough to be included in the database. More and more metabolomics studies are coming out and many of these contain compound annotations that could be included in the database, but what level (in silico, exact mass database search, or relative to a known standard) are required.

  5. Dec 2021
    1. The semantic web is, of course, another idea that’s been kicking around forever. In that imagined version of the web, documents encode data structures governed by shared schemas. And those islands of data are linked to form archipelagos that can be traversed not only by people but also by machines. That mostly hasn’t happened because we don’t yet know what those schemas need to be, nor how to create writing tools that enable people to easily express schematized information
      • Semantic Web: an utopia???
      • I have been waiting for it for 20 years, and counting...
      • Instead "plain text": "triplets"; properties and wikidata-Qs
    1. v0.1.0 Pre-release Pre-release The plugin hasn't been thoroughly tested, so please backup your library before trying it (see Readme), just in case. Support making changes to Wikidata Support adding new "cites work" claims to existing entities Support creating new entitites for items without QID Support deleting "cites work" claims when synced local citation is removed Support logged-in (regular and bot password) and anonymous edits Use item's title (in addition to DOI and ISBN) to fetch its QID, using Wikidata's reconciliation API. This helps ensure no corresponding entity exists before offering to create a new one. Support sorting citations by authors, date and title, in addition to by index.

      ok

    1. v0.2.2 Support automatic creation of new Wikidata entities (without going through QuickStatements)

      como, entonces?

    1. Citation metadata support This module adds citation metadata support to Zotero. It provides an additional Citations tab where the user can: add, edit or remove individual citations; run item-wide and citation specific actions, such as syncing to WikiData, parsing citations from attachments, etc; edit source item's UUIDs, such as DOI, WikiData's QID, and OpenCitations Corpus ID.

      Citation module;parsing citations from attachments!

    2. adding citation metadata support, with back and forth communication to WikiData, citation extraction from file attachments, and local citation network visualization

      Wikicite->Cita

    1. diegodlh January 6, 2021 My proposal to develop a WikiCite plugin for Zotero, adding citations metadata support, back and forth syncing to Wikidata, and local citation network visualization, was approved and I started working on it some time ago.

      WikiCite! 2021-01-06

  6. Nov 2021
    1. Zotero can also export in (English) Wikipedia citation template format.

      Using: Open Zotero's preferences dialogue on macOS, choose Zotero > Preferences… on Windows, it's at the bottom of the Edit menu Select the "Export" tab Either Change the "Default Format" to "Wikipedia Citation Templates"

  7. Oct 2021
    1. Wikidata has quickly become one of the largest and most active collaborative knowledge bases in the world. It is used by Wikipedia, Apple, Google, Library of Congress, and many other organizations. Data from Wikidata is viewed by more than a billion people every month. Knowledge Graphs have developed to become important resources to represent knowledge within an organization. We will present Wikidata, how it is used by different organizations, statistics about its growth and an outlook at the future of Wikidata and knowledge sharing - and how you can use Wikidata in your organization to support your knowledge use cases.

    1. Maintaining software through intentional source-code views

      intentional software

      source code views

    1. extract-ing entities and relationships from a researcher’s existingdocuments, or tracking the electronic resources that comeinto one’s workspace (Dumais et al., 2003) may provideusable fodder for a knowledge-sharing environment

      extracting entities from electonic resources useful fodder

      Hypothesis + WikiData Anchoring connected to one's ownd(ed) MindGraph

    2. (c) A social network built around users of the concept.

      Beyond Ontologies inspired analogues in @TrailMarks : a) Intentional concept structure (not intensional) b) user defined trails: narrative, propositional, operational you name it and define them, literally c) A decent(ralized) InterPersonal, InterPlanetary, InterOperable social network of networks, where each individual is a hub to self, built around users of any concept d) Dataresources and Notable Entitiies as in WikiData

  8. Sep 2021
  9. Aug 2021
    1. the term knowledge graph is not suit-able for describing all of these applications and should beused more carefully. Several applications need not be calledknowledge graphs, because the terms knowledge base andontology describe them sufficiently and more accurately

      How wrong can you get!

      Clearly, any system that captures the intertwingularity of knowledge or attempt to impose some ways to make sense of it is rightly be called a knowledge graph.

      Wikis, Semantic Wikis, WikiData, Networked thinking tools with bi-directional links and ways to articulate the meaning of those links is also a knowledge graph: Roam Research, Obsidian, LogSeq, Anagora etc, , not just those, like Kanopy that uses RDF. By today the entire semantic Web is understood to be a distributed Global giant Knowledge Graph. Linked Data, InterPlanetary Linked Data are not necessarily KGs, although you can build KGs using them, and of course many other ways.

  10. Jul 2021
    1. Wikidata: A large-scale collaborative ontological medical database

      Select some text to add add a comment / ask a question =)

  11. Jun 2021
  12. May 2021
  13. Apr 2021
  14. Mar 2021
    1. fluorescent components of the scorpions Centuroides vittatus and Pandinus imperator as beta-carboline

      Este pigmento ha sido registrado en Wikidata

      https://www.wikidata.org/wiki/Q23719025

      ¡Qué es esto? https://github.com/lmichan/biocolores

    1. supply a fake column of names (for instance using a random value that yields no match when searching for it in Wikidata.

      Why not just use an empty string?

    2. If a unique identifier is supplied in an additional property, then reconciliation candidates will first be searched for using the unique identifier values. If no item matches the unique identifier supplied, then the reconciliation service falls back on regular search.
    3. a unique identifier

      I understand it knows it is a unique identifier, because the Wikidata property says so.

  15. Feb 2021
    1. search and filter on HyLights and full document t
      • contrast:

      everything that is annotated and all comments and dots are indexed. Internally. But any other indexing search mechanism can be plugged out too

      Case in point is WikiData Explorer for anchoring conversation is notable entities etc

  16. Jan 2021
    1. upload the citation information

      Sync items' citations with Wikidata. No citations would be downloaded because the entities will have been created just recently.

    2. create a new entry on WikiData

      That is:

      1. fetch QIDs for items selected,
      2. create new entity for items not found: https://github.com/diegodlh/zotero-wikicite/issues/33

      A PID (DOI or ISBN) may be required to make sure no duplicates are created.

    1. Another prominent effort launched in 2012 is Wikidata,39 which started as a project at Wikimedia Deutschland funded among others by Google, Yandex, and the Allen Institute for AI. Wikidata is based on a similar idea as Wikipedia, namely, to crowdsource information, However, while Wikipedia is providing encyclopedia-style texts (with human readers as the main consumers), Wikidata is about creating structured data that can be used by programs or in other projects

      6

    1. Wikidata is expected to not use automatically scraped data.

      But this has changed, hasn't it?

  17. Dec 2020
    1. This is a short demo of a prototype WikiData Explorer and harvester of WikiData.

    1. Wikidata as a FAIR knowledge graph for the life sciences

      Wikidata FAIR knowledge graph life sciences

  18. Nov 2020
    1. Blazegraph

      Blazegraph is a triplestore and graph database, which is used in the Wikidata SPARQL endpoint.

    1. MX Buscar

      Imagen de avatar

      9:15 / 1:00:42

      WikiCite 2020: The state of WikiCite 403 vistas•Transmitido en vivo el 26 oct. 2020

      28

      0

      COMPARTIR

      GUARDAR

      Wikipedia Weekly 369 suscriptores

      Part of the WIkiCite 2020 Virtual Conference: https://meta.wikimedia.org/wiki/WikiC...

      02:09 - Presentation by Daniel Mietchen

      Summary: WikiCite is an initiative to collect bibliographic and citation information, particularly of references cited from Wikimedia projects like Wikipedia, Wikisource or Wikidata. It provides an umbrella for a broad range of activities at the intersection between Wikimedia, libraries and other organizations engaged in scholarly communication or cultural heritage. Over the past few years, many of these activities have involved in-person events - including participation in the Workshop on Open Citations in 2018 - but the ongoing COVID-19 pandemic has changed that, and the WikiCite community is adapting. In this talk, I will provide an overview of WikiCite activities during the last 12-18 months as well as of what is ongoing and planned.

      Bio: Daniel Mietchen is a researcher at the School of Data Science of the University of Virginia. He is interested in integrating open and collaborative research and education workflows with the web, particularly through Wikimedia platforms, to which he actively contributes. Trained as a biophysicist, his research topics range from the subcellular to the organismic level, from biochemical and embryological to geological time scales, from specimens to biodiversity informatics, from data points to data science and to the role of research and education in sustainable development. Transcripción

      [Music] welcome welcome uh thank you for joining us this is the beginning of the wiki site 2020 virtual conference this is the first time we run a virtual edition of this uh almost annual conference and it is the fourth the fourth wikiside annual conference today we have a series of sessions by a variety of people in various topics but we are starting with let's call it the keynote the state of wikisight in 2020 but to introduce them my name is liam wyatt and the host of the the conference in general but to introduce daniel and this particular session i'd like to introduce jacob and eva welcome and thank you for hosting this session today yeah hello and welcome from the side of uh both of us the hosts for this session um uh yeah we are happy to uh present daniel median um yeah to to introduce daniel he's virtually everywhere so um everywhere in wikidata world all each time i discover a new interesting weekly data related project i i'm sure he has already seen it or he has an opinion in it and he is a strong well-founded opinion and that's why it's very interesting to listen to what he's going to talk about so we welcome daniel meachen and um yeah are you ready then the screen is yours yeah i'm ready okay we will leave so thank you everyone i'll just start by noting that the slides are available on the noda i also just tweeted them and so you can follow at your own pace there is a back channel for comments so if there is anything that you would like to comment on or if you have a question during the talk please put it in this ether pad wiki w dot wiki igk with the gk and capitals and from the ether pad we will also then take the q a section i tried to talk for 45 minutes about that and then we have about 15 minutes for questions but this is flexible and i can be interrupted that's fine so structure of the talk i'll say hello then we will look at some previous versions of the state of wiki site 2020 because you might have seen in the title that says version 3. i'll give an overview of wikisite in general and then the wiki site development since the last official wikisite meeting 2018 in berkeley and in this overview i'll look at the content at some of the infrastructure at grants and fellowships and also at events including further events as part of this wikisite conference there will be mechanisms for you to get involved and then there will be q a okay hello wiki site i was introduced already uh this slide is more or less just for people who are stumbling upon this randomly on the internet uh when they watch the youtube uh recordings um so now let's look at uh wikisite 2020 and the previous versions so one important previous version is the annual report as liam already pointed out we have roughly annual uh events which also means we have roughly annual reports uh the latest one is for 2019 20. and i'm not going to talk about this too much but i encourage you to take a look at this in any case here is a brief summary so um the wiki site annual report 2019 and 2020 was detailing some events that were planned and actually held so um for this period we were actually planning to go on decentral we wanted to have a more multilingual and more diverse a set of events taking place and that involved thoughts about going virtual from the beginning um and we had two events actually taking place one in australia one in czechia but then the covert 19 pandemic has changed many of those plans so the event in czechia had to be cut short by day uh events in planned in ireland and finland had to be modified events in ghana germany and india had to be cancelled yeah i was actually meant to speak at this event in germany for the talk that i'm giving now events in haiti and ghana had to be postponed i'm glad to report that by now the one in haiti has actually taken place and the pandemic has also triggered for instance wiki project covert 19 which had a strong wikiside component other the ecosystem around wikisite is evolving in terms of the wikimedia strategy and working data wiki-based strategy and everything that's underlined here in my slides is linked so i encourage you to explore those links yeah so the second version of the state of wikiside 2020 uh was a talk that i gave at the workshop on open citations in open scholarly metadata 2020 at the beginning of september back then i didn't know that i would be giving this talk here and i wasn't sure whether there would be any kind of overview of wikisite talk this year which is why i gave it that title back in september on the other hand i had only very little time so i couldn't go into details and i'm reusing some of the slides from there but still the structure of the talk is different the key takeaway from from that talk is despite the pandemic wikiside is moving forward but we're actually not so sure about the citation part um and that talk is online as well including some video recording so now we're on to version three um let's start with an overview of a keysight i assume that some people are here that don't know uh really what that is about and so i hope to um add a little bit of clarity and reduce a bit of the confusion so wikisite is a community it's a socio-technical platform that connects this community with an ecosystem of projects tools and other activities it's also uh or can be seen as a collection of data sets a series of events and a number of other things uh let's zoom into some of those so uh the community if i go by 2016 conference uh it's outlined the vision to create the sum of all human citations as linked open data this is inspired by the wikimedia vision to create some more human knowledge and from that we derived the mission which is still up on the wiki site page on the meta arrow key to develop open citations and linked bibliographic data to serve free knowledge using wikimedia platforms and especially wikidata that vision and mission have let us to consider a number of goals i'm again quoting the 2016 version of them lay the foundations for building a repository of all wikimedia references as a structured data in our working data that's one goal and the other one is to design data models and technology to improve the coverage quality and standards compliance and machine readability of citations across wikimedia projects that's what we what the plan was in 2016 and somewhere between these two uh we actually are still active uh it's just that by now we've zoomed in on onto those things into uh so many different directions some of which i'll try to outline uh the next aspect of that community is that the community has a road map that offers essentially four roads to travel but that road map is not really used much everybody travels on their own everybody finds their way on their own and the rope map is decisions that the road map is about they're still kind of looming at some point they need to be taken or they will be taken by someone or something and so it's probably worth for the community to um just keep it in mind and maybe have some more active discussions about this next so i mentioned it's also a technological platform so it has a home base on the meta wiki under the wiki site basically handle and from there you can also find the program for this this week's conference but lots of other documentation about wikisite in general it's home based on wikidata is a wiki project source metadata which was actually the nucleus from which wikisite has grown and and wikiproject source metadata is a bit of a mouthful and also it was specific to wikidata uh so by now we've come to use the term wikisite to encompass to signal that it's uh not just for wikidata it's it's really on meta because it concerns all wikimedia projects um but yeah the the nucleus the the origin of this is that we wanted to uh care for source metadata and that included citations uh so it's important to keep in mind that wikidata is about one-third just citations and scholarly publications so about one-third of the content of wikidata is um created by the wikisite community problems um and so the problems of wikidata they reflect on wikisite and vice versa um and then finally uh yes uh wikisite is a series of events we were in berlin in 2016 in vienna 2017 and in berkeley 2018 and now we're um trying to be a bit more global this year so i mentioned wikidata a number of times most of you will know what it is but i suspect some will not so here is an attempt like wikipedia which is the free encyclopedia that anyone can edit wikidata is the free knowledge base that anyone can edit and use of course if that doesn't clarify this to you then i recommend to look at uh this seven minute intro video on youtube that i've linked so a few more numbers about wikidata in from the perspective of wikisite so our wiki data as a whole has 11 billion statements that come in the triple form like subject predicate object and uh for us most the most interesting for us are for instance citations 170 million scholarly articles 36 million items that have an digital object identifier publications or data sets 26 million orchids which is an identifier for researchers one and a half million and then also that's the creation work that's going on um the links from the publications to the top wrote the publications roughly 20 million links uh links from the people to the institutions where they were educated one and a half million and from the people to where they're working it's about a million and all of these data since we're talking linked data and open data they can be uh mixed and matched in different fashions and reused and enrich each other um so now let's talk a little bit about the curation workflows that are employed by the wikisite community i'll single out like three of them here i mentioned four of them on the slide but uh due to time and stuff i'll go only into three of them so one is topic tagging which allows us to address questions like give me all the papers about sars cough 2 or any other specific topic the other one is author disambiguation which allows to address questions like give me all the papers by a given person the other one is ontology development like for instance if you want to get all papers about mathematics education and you want to include different aspects of mathematics education then you will need to map those at aspects like calculus for instance or algebra or geometry education you need to map them to math education that's called ontology development so basically linking topics and subtopics and then the last one is subject affiliation so if you want to get an overview of the publications from a given institution and many institutions want to to get that overview then somehow you need to record that there is the link between the person at the institution and then between the person and the publications and then through linked data we can harvest that and make the link between from the publications to the institutions okay an example of topic tagging here i chose the a query that gives me terms that have bologna in them on in the titles of paper publications and uh so from that i can see that there are different kinds of bologna that are being talked about so bologna in in general there is bologna italy like there's the province of bologna there's the city of bologna university of bologna bologna experience which refers to a number of things then there are some clinics in bologna bologna sausage and so on so disintegration is very important and but once we have disambiguated things because we have the rather precise uh topics here in wikidata which all have their own identifier we can then um clearly uh distinguish the literature about the sausage from the literature about the university in the clinic and the province and the city then the most popular tags for topics of publications as of august 2020 were these ones here uh mostly medical that's because that's or that just shows some bias that we have that's partly because in medicine biomedicine we have databases that are more easily harvestable that can be more easily integrated in wikidata than in other fields but it's not only um biomedicine so for instance we see india popping up here nanoparticle and statistics so it's a bit more complex than that um here i'm bringing up the zika virus as an example because it served as an example uh already since 2016. uh it's kind of the uh the guinea pig the the testing ground for working side workflows and uh this is in part because wikisite developed roughly in parallel to the zika virus epidemic in 2015-16 but also uh because uh it has this zika virus corpus or data set has been used in a number of contexts um and also it helped uh address the ongoing covet 19 pandemic so what it does is it's a model for disaster response in general not just for outbreaks it's a relatively complete data set for a given pathogen it has a data model that is enriching as the epidemic progresses so initially when the zika virus came around all increased then the data model on wikidata was refined and we could then uh express things at a more granular level um it has also triggered a number of tool developments and it's yeah it's used in technical tests and education here are a few more links on this i'll get back to code 19 later on so once we have topic tags we can do things like we can get the list of most recently published works on the topic of course these states are not very relevant anymore so here we're talking about the principle um and also we can look at which topics co-occur with a certain topic so if the topic is zika virus then we see here for instance the topic of congenital zika virus infection is uh very uh prominent zika fever uh pregnancy and microcephaly uh these are all topics that co-occur that so that are discussed together with the topic of the zika virus and that is a way to browse things so if you if you don't know much about the topic you just go there you start from the zika virus and then you can explore all those other things that are linked to it then what we can also do is link the discussions of the topics with certain locations because some of those topics will have a geolocation in like india we had in those most popular topics we had india as one example and india has a geolocation but some of those co-occurring topics uh they will discuss things like the zika virus in india or in french polynesia or in brazil or something like this and we can pull this out of the data by mapping the papers to the topics and then the topics to their geolocation if they are entities that have a geolocation we i want to contrast this to this map which shows the locations of the authors of papers on the topic so in the previous one we had the locations of topics or topics that occur with the zika virus and here we have the locations of the authors of papers about the zika virus and so these are different things we can plot them we can pull that out of the data that is created in wikidata through the wikiside community and tool chains and that allows us to explore certain gaps so for instance here in africa there's not much happening according to this map which might be first that there is not much research on the zika virus happening in in africa or in much of northern asia or simply that here there is not a lot of population anyway or that we just haven't created the data about the publications or about the authors or their affiliations or their geolocations very comprehensively and these are the kinds of things that the community then has to drill down on in order to to find out and to improve the data set now are on to author disambiguation so the basic strategy is if there is an orchid which is meant to uh help with author disambiguation uh then and the orchid has public data that is actually usable for wikidata then we try to use it uh we have some tooling for this um there were some problems with that so it doesn't really work well for the moment that's also some discussion we need to have and also the orcid data is not always best quality but at least it is there it's a principal standard that the research community has agreed on to support and so if you want to map research then that's something we should uh take into account um the disability on wikidata can be done by anyone just like any other kind of activity on the platform and just like on wikipedia we have a number of mechanisms for quality control we have some tools for this they're actually dedicated for author disambiguation um and we encourage institutions to disambiguate the people that are affiliated with them because they know best or the the people themselves of course also know about um in ordinary wiki data volunteer might not really know the start time and end time when some professor was affiliated with a certain research institution and then if we consider uh like that there have been also basically uh then uh and many of which have many authors uh then this is becomes a humongous challenge uh but uh thinking about this shouldn't kind of occupy our thoughts too much as long as we find some mechanisms to prioritize properly so here is one of the tools in action it's the called the author disambiguator and it can be used to basically convert author name strings into author identified authors so yeah the author here rosa prato has an identifier on working data already that is associated with nine items at the moment when when this screenshot was taken and then here we have 83 publications uh that lists the name rosa prato as an author named string and then the question is to what extent do the publications that show this author name string actually correspond to publications that have been written by this person that is identified here and the tool basically asks this question and helps with the grouping and suggests and so that's one of the mechanisms by which author disambiguation can happen subject affiliation so in order to find out the affiliation for a given person we can search in wikidata or we can search elsewhere by university or by organization and we we can get information such as the faculty and their publications or research sites and clinical trials that these people have been involved in or events such as conferences and locations where these people have interacted or have shown up um and uh if we do affiliation tagging then uh we can basically uh do similar things that i showed for the zika map before um and that allows us to for instance look into um areas where a certain topic is well represented so here uh the uh data here is for italy you see that there is a certain gradient in terms of publications that have italy as a subject in italy there's a lot of papers about italy and the further you go away the fewer papers there are um and that also the mapping here also serves as one mechanism by which we can do quality control one example i would like to point out is uh there's a vanderbilt university in the united states that actually uh does this uh curation of the affiliation at a very systematic level so they have figured out who is currently basically affiliated with university they've gotten all their orchid identifiers and they have fed that into my information into wikidata and that allows them and everybody else now to uh basically browse the information that is available about publications by uh vanderbilt university people staff mostly and faculty but also occasionally students and the work that they are still working on is like linking these people to the publications and that's also something that is part of the community workflows so they really try to integrate their efforts with the community workflows i put in some links here to uh first some news item about this the bots that made all this possible and similar projects at indiana university the university of cannery islands and also there's a countrywide similar effort for the netherlands um some other things that we haven't talked about too much in previous wikiside events is clinical trials uh these are also things that you can cite and they produce citable things and they are about certain topics they have uh authors they have um primary and investigators and these kind of things and um so these are um things entities that are part of the creation workflow and uh we're now relatively complete with respect to clinical trials that are registered in the united states we're trying to increase the coverage of clinical trials that are registered in other places so here is a an attempt at representing the wikidata ecosystem which i described as a socio-technical system in one of my opening slides so here we have a layer of users that are using a number of tools to interact basically with the basic infrastructure wikidata and wikibase and some of those users are automated uh and most of them are humans and some of the tools they they are used more or less for reading others for writing and others for like mixed uh matters um so one tool allows historypeter allows you to basically visualize timelines scolia allows you to visualize connections that are roughly related to scholarly literature or to research context in general quick statements as a tool to edit working data mix and match as a tool to also edit wiki data and to map things from wiki data to external databases um recoin helps with quality control yeah other uh another way to look at these tools is uh they do um they facilitate browsing like this wiki data front end that is specific for basically software um they allow to edit wiki data in various ways so here if you have an identifier for publication you can put that into source metadata tool and then it will check whether that publication is already indexed and if not it will help you set up the corresponding item and a large set of tools is actually there to check consistency data quality and these kind of things so for instance for statements about symptoms we have a requirement that whatever is stated as a value here it should be supported by a reference and if that is not the case then a warning is displayed there's an overview of all the tools here on this page um for the scolia tool specifically we also have a category on commons and there are similar categories for some of the other tools i don't have time for that there will be a dedicated session on scolia later today now back to the citation questions so here is a graph that is available on this website that's operated by um yeah one of the organizers of this session and um what it shows is two different developments here until well somewhere in the middle of this year one of them is the number of publication items which goes up in more or less in jumps um but con steadily roughly the few occasions where it goes down they're actually acts of active curation where someone for instance uh noticed oh we already have this publication it was started based on the digital object identifier but uh here and it was started based on the pubmed identifier there and they're actually about the same thing so we can connect them and so these acts of creation are sometimes visible in those stats here and here what we also have here is the number of citations i mentioned the number of 170 something million or so at the beginning and uh we here see that uh it has basically plateaued for um like a year or so then it went down this is again is an act of curation uh the problem is or um at least something to discuss is what's what's the future trajectory here we don't really currently have that data but uh it's clear it's not going up very strongly and the question is whether it should um or whether that is something that should take place somewhere else um there are a number of bots that are active in this space uh vanderbot is basically underlying uh the vanderbilt project that i outlined before large data setbot is essentially creating publication items so it checks whether items for certain publications are already indexed in wikidata and if not it creates those items refbot is perhaps closest to the original uh let's say goal outlined in 2016 that we want to basically support the uh references or support every statement in wikimedia projects with suitable references and so what the refbot does is it picks certain statements in wikidata that does do not have a reference yet and then searches the literature for places where this statement might have been made and then adds relevant literature as a reference to support this statement so for the statement that local anesthesia is a subclass of anesthesia we now have two um sources that were added by refbot there is also a proposal for open citation spot but this is pending so technically it's kind of uh doable but still not done and socially we don't really know yet whether we actually want to do it in wikidata or not that's part of the discussion that events like this want to facilitate so um also important to think about uh the event history um a little bit uh because um yeah we we're beginning another event right now and uh so uh i want us to kind of kind of come into the mindset for for this kind of event thing it's different now because it's all virtual um but uh yeah let's let's think about this nonetheless a little bit so in uh these previous three conferences uh we had a three-day design typically um where we tried to combine monologues like the one i'm doing right now with more dialogue oriented sessions and then with a hackathon and we have elements of all three of these in the current program and in the other wiki site activities all outlights some of those this week's virtual conference actually i would have liked to include this screenshot of the scolia page for this but it wasn't detailed enough so i i'm just encouraging you to take a look and help create that part also i noticed that for wikisite 2016 we don't have the information very well created yet um so let's focus on the wiki site 2020 events uh so we we're in this current session from 10 to 11 utc you see there a number of other sessions coming up today i guess if you are here in this talk you already know roughly about this but i really encourage you to use some of the breaks here to go through the entire um basically schedule because there are so many different things they they differ not just by location but they differ by language they differ by uh the kind of topic focus um like here we have swedish parliamentary documents in indonesia here for instance they're discussing palm leaf documents uh they're differing uh in terms of the technical aspects so here we have a hands-on uh introduction on like how you can actually edit wiki data then here is curation of author items here is the front end of wiki side here is things that are specific to genetics and and so on there lots of uh such sessions and um the ev the aim of doing them uh is to foster collaboration since all of this is is open everybody can interact with all of these projects and that's one of the mechanisms by which this wiki site community thrives we normally had pushes an activity right after the wiki site events and hopefully this will happen for this virtual one as well um in terms of other activities i would also like to mention that we have set up a mechanism to award some grants and e-scholarships um that are related to wikisite events and they are somewhat of a an adaptation to the corporate times so um normally i mentioned we would have had a hackathon in uh in in this conference setting three-day conference setting but since uh the normal way of doing a hackathon means we bring multiple people into the same room um that that and that's not possible these days uh we basically thought about a mechanism by which we could give people some time to work on uh and that's that's the e-scholarships approach so the things that they could have done during a three-four day hackathon they're now part of those e-scholarships we also have a number of uh grants that are a bit larger scale projects but still doable in a matter of a few weeks to months and here is a blog post that details all of them again one of them is about the leaves from the palm leaves from indonesia but there's 23 of them and they're much more diverse than the projects that we have discussed at the previous wikiside meetings in terms of coordination so um wikisite is driven by volunteers in all of its aspects including the organization of this session and this conference and all of the content work and tool development but some coordination needs to happen and in typical wiki style much of this coordination happens in a variety of channels still we have a uh an organizing committee um or steering committee um and this is basically a the same team that has been active over the last few years we had a few members change but we're constantly looking for new members we want to facilitate wikiside activities in different locations and different languages on different topics on different technical contexts and so if any of that resonates with you please get in touch and consider joining the steering committee so that's now the um first thank you slide uh i think it's important to thank you one of the things i like best about the wikimedia um let's say ecosystem is actually that we have a thank you button on all the different uh platforms or on most of them and that is a very interesting aspect of forming a community it's not visible uh other than to the person that is being thinked at least by default and and so it it is not normally gamed and so it's a very nice thing here um i yeah i just want to thank everyone i don't like that phrase still i use it and so i'm also spelling it out i want to thank the providers of open infrastructure not just the wikimedia open infrastructure but open infrastructure in general including the one that we interact with the providers of this citation or this presentation template from slides carnival the wikimedia wiki data and nokia site communities i would especially like to thank the scolia team because yeah that's the team i'm most closely interacting with on a daily basis i would like to thank the alfred sloan foundation that sponsors the wiki site events and i would like to thank the organizers of wiki site events like if you're organizing any session or here at this conference or somewhere else if you did that in the past if you plan in the future i want to thank you for just thinking about and especially if you've done it for doing it and then if you've made it until here i want to thank you for paying some attention to to these topics now i'm looking forward to the discussion let's see what the ether pad says if you want to contact me here are some contact details and now i'm trying to hop into the pad yeah thank you very much daniel you are well in time so we have uh some questions uh very well we look at up here on the ether pad yeah so the first is could we decide provide an alternative to commercial graphics search engines what do you think um okay it would be helpful if you mention the question that you also show the uh tell me the line number uh peter eventually yes i hope so but currently it's an alternative only for very specific things like if you want to know something about the zika virus or something like i would um guess for certain things uh the zika corpus on wiki data is more uh or is annotated in a better or more detailed fashion than the standard that you get from the commercial search engines but in general they are for many purposes still better in part because they have um money to put at this and uh wikiside is largely driven by volunteers who do it in their spare time and so what we do have is the uh a community they don't have that and um the the question then is uh what is the niche for both of those uh so since we're entirely open um anyone is entitled to benefit from our creation efforts that includes the commercial providers of this and so one of the main purposes of wikiside is actually to increase the quality of the information that is available through such search engines so if anyone has access to the wikidata version of it or with the site version of it then whatever those commercial offerings are they should be better and so that that's a short answer but the question is deep and we could have an entire discussion around it so it in any case depends on the topic yes it depends on the topic for a number of topics especially in humanities or so uh wiki data doesn't have a lot of information at the moment this is due to several structural and and community biases and we're working on them we're aware of some of those biases but still they are there and of course the search engines have uh their own biases as well so for particular like uh i hope like in a half a year from now the palm leaves collection for instance from indonesia will be in wikidata and i don't see them showing up in any of the commercial search engines anytime soon but maybe they pick it up from wikidata that's that would also be progress that would make reset the the information more widely available okay so next question uh let's question of mine about topic tapping picking uh so where do these topics uh tags come from it's not that easy like with author the date of publication because topics are a bit less hard facts what do you think about this yeah that's actually something um i'm thinking about quite often i think the current workflows that we have are um not very mature so there are lots of workflows typically in library contexts where um topic tagging is already happening so for instance the database pubmed for biomedicine has an entire system called the mesh terms medical subject headings which uses basically human readers so humans reading a paper they then decide which terms to associate with that paper and they choose those terms from a defined controlled vocabulary called the mesh terms and that system exists but we haven't found mechanisms to leverage that for wikidata in part because wikidata is cross-disciplinary and so even if we had this working for pubmed this wouldn't help us much in terms of covering publications in history or astronomy or elsewhere and so there is no system that is cross-disciplinary which means that there is actually an opportunity for wikidata to become the first cross-disciplinary uh platform to have a consistent mechanism of tagging right now we don't have it so what we do have is often just uh based on inferences from the title which might uh run into problems when uh the tagging is based on ambiguous uh terms um or when the tagging is performed by someone who doesn't know the subject very well or something like this all of this happens all of this also happens to me um and but the good thing about this is since we're curating into an open platform everyone can check what has been done everyone can point out problems with that some of those checks are even automatic and then we can work together to reduce the number of false positives false tags and we can also think about what the best granularity of the tags is so if we for instance go back to the mesh example of the mesh terms they come with their own hierarchy some of that hierarchy is already reflected in wikidata and so if a paper is associated with a number of mesh terms that are in a hierarchy then the question is should we take all of them or should we just take the most specific ones and that's the kind of questions and for other um topic tagging let's say context the workflows might differ again okay thanks um just a short interruption as i'm um here for the technical side can you please um close those um pop-up tabs yes thank you yeah i was going to do it when i had a minute but since i was talking all the time it took me a while okay yeah next one shall we just go by the order or uh do you have a selection we have time yeah uh could help the authorities in big so how does the authority yeah so uh the author disinvigorator is a tool that was specifically designed for assisting uh with the conversion of the strings uh that we get uh in terms of authorship uh from various databases uh and so the conversion from those strings into wikidata identifiers um typical example is jane smith or something like this it might be many of them or xinhuang or something like this many people uh have this name or are using this name uh what in publications or this string and uh then in wikidet we have to figure out which one is which which person is the author of that paper and and so on and this tool that i briefly demoed um helps with that um in principle it could be adapted to do all sorts of string disambiguation uh but that hasn't happened and uh maybe it will not happen in the framework of this tool because this tool um has this particular purpose but it's open source everybody is welcome to contribute everybody is welcome to fork it and to develop it into another direction this tool is actually itself a fork of an earlier version of the tool um and um you could imagine lots of other string disambiguation uh being useful i think the ticket that i have in mind here is to use it on uh things like um or on policy documents uh yeah there are lots of options but if you want to work this well and for authors it works really well for the moment then it needs some development and so it's not just adding a line of code in order to make it let's say cover other kinds of disambiguations as well yeah also oh say our author has already replied to that that's nice so yeah i can only read uh so much so yeah that's that's one of the purposes of having this ether pad that's very nice that someone posts a question someone responds while i'm talking that's fine that makes the discussion more efficient maybe then uh you can uh guide me a little bit what the things i should talk about uh i see yeah i think there was a question about open refine but we will hear about this uh later and the event so more down yeah open refine i i wanted to mention it at several points in the slides but for some reason if i haven't done this this is certainly in a mission it was not um in this was not intended open refine is one of those tools um and it it can yeah basically refine it it helps reducing the noise in in the data if you want to map strings to items and it's very powerful um for certain um workflows it has become more or less the de facto uh wikiside tool and for for other contacts it has not been explored too much but there are dedicated sessions on this later okay so that was another good question so uh which were where on wikipedia is the wiki data wiki site data used so it's the idea so you we have all the graphic data and wiki data and and it's automatically shown in wikipedia does it happen um i think there have been uh tests demos and explorations in various places uh in i think in in russian and uh maybe even in french and catalan basque something like this but i'm not aware of anything doing this systematically for a number of reasons um and yeah briefly sketching them out is at first not necessarily all the references cited in any wikipedia are already indexed in data that's one thing um maybe we're coming close or we we're we have a certain degree of completeness for scholarly references that have a digital object identifier but for anything articles for instance and things like that um and so that's one thing so if you want to do uh if you want to use wikidata for um running these basically citation sections in your wikipedia articles then it would be nice to have higher coverage which is one of the original motivations for doing wikisite but for that would require to for wiki sites slash wiki data to develop better or more robust data models and tooling around uh those other kinds of things for instance about around newspapers and there are efforts in this direction but they're not necessarily as mature so that a wikipedia could use them at scale as a default i think wikidata is ready to be used for experimentation let's say for a certain topic anything related to zika for instance in those areas i would actually love to see those experiments let's say let's try to replace all the references in zika related articles most wikipedias have just one or a few of those articles and so let's try to start with those um articles and then let's see what the problems are some books pop up and policy documents will pop up and and then we we solve those problems and then we can think about scaling this up so go on wikipedians it's a wiki try out yeah yeah above there was a question about people about living persons so is there some procedure to prevent that personal information is added despite the people that do not want it can you tell me the line number uh 30 well uh 35 35 okay um okay so someone already posted the comment so i kind of jumped over this question and i had seen that arthur had responded to it and someone said this is not really answered yet okay so uh i'll briefly read it okay um so yeah some people do not want their data to be exposed in wikidata and uh in certain jurisdictions um there are legal protections for the kind of personal data and wiki media projects in general they have some policy around which or how to handle personal data in general the situation is such that if someone does not want their information to appear in wikidata or wikipedia then that's a strong incentive to actually remove that unless there is a public interest in keeping that public if about this particular fact that the person wants to be removed then it will likely stay but otherwise it will likely be removed and if that's not the case then there are mechanisms to [Music] address that but since the information once it was in the system it cannot easily be deleted it can just be like removed from the current version by default um really removing it from the system is a bit more effort involves administrators but also that can be done and has been done on occasion yeah so it's complicated yeah it's complicated yeah so last chances for questions uh please so we have another question here in the esa pet line 49 about a newspaper information about newspapers i would like to extend the question about information about books because you mainly talked about scientific articles so there are other publications too what about um yeah i know that in in your statistics uh you are uh compiling a list of like hundreds of different types of publications including patterns and poems and things like that and um yeah the general answer to this is um that there have been relatively lots of efforts in the journal uh article space not even the journals just the journal articles there have also been lots of efforts in the book space um they haven't translated into tools too much um we have one good tool for books which is amantea which i also didn't mention but that's not uh because i don't like it it's just because i had to kind of prioritize here um and uh also that already that uh tool amontear has to make some compromise in terms of data model so it doesn't distinguish too much between an addition of a work and um like the work in general um in so if if we don't have the workflow well we have the data model roughly figured out for books but we don't have the workflows to pull that information uh from elsewhere to put it into uh wikidata for instance and that's in part because the resources that would have the data they are not public domains so we can't feed on them directly but in part because nobody has written those tools and in part because the data model isn't really fully adapted to the particular use case you might be interested in um yeah i'm aware of a number of activities in the newspaper space um still nothing that is uh like large scale or transferable between languages or countries for instance or something like this uh so yeah i see there is an australian newspaper project i was uh somewhat involved in a us newspaper project for instance but still it it's uh it's not as homogeneous as we have it for scholarly articles uh because that is based on the digital object identifier for which there is a standard and uh for newspapers there is just not such a standard that i'm aware of and for poems and other things it's it's getting even more complex something i haven't mentioned yet what i would also like to see is for instance jupiter notebooks uh for anything that is about an algorithm or a programming language it would be nice to have some sort of a demo of that aspect of the algorithm or of the programming language demo it in a jupyter notebook and that jupyter notebook should then also be cited in some fashion but we don't have the mechanisms for that either the book tool uh the jupiter notebook tool is that doctor's need here the thing that you're talking about um yeah i'll just you just add the link um to the tool you mentioned yeah um i hope that's i i was speaking about books in general as well so i'm not sure whether that's the one that i wanted to know about yeah there's also zotero at shortcut we will um it will be presented on tomorrow and the session at 10 utc so a lot of tools yeah okay so i will go through the ether pad once i'm disconnected here and try to answer whatever remains uh of the questions and otherwise i guess you all have my contact info i'm happy to address further questions this way and i also plan to attend some of the other sessions and so i hope that everyone will enjoy the rest of the conference thank you we are running out of time i have to say from the technical side do you want to extend for 10 more minutes or how do you want to behave i'm fine i think if there are further questions i'm happy to continue right now i uh all the questions that i am aware of i've at least briefly touched upon okay yeah so i'm being corrected on on what i said about amantea yeah so it's always good good that the experts are around that's the the the beauty of an open system what uh everything can be verified that's that's the nugget the core of wikisite everything is verifiable um and so yeah i i guess max is putting the details in here okay so please keep on asking here on the next sessions we um are happy to have opened this virtual conference the next um session will start in two hours and yeah have a look at the program you will see uh if and me both of us in the next session tomorrow at 10 am i right ava okay so again thank you all daniel yeah and see you thank you bye bye

    1. Wikidata currently contains 90,660,947 items. 1,303,412,350 edits have been made since the project launch. You can find additional details at wikidata-todo/stats.php and stats.wikimedia.org/v2/#/wikidata.org.

  19. Oct 2020
    1. https://etherpad.wikimedia.org/p/WikiCite-2020-Research-output-items

      This pad assists with collaborative note-taking for the "Research output items" session at WikiCite 2020, as per https://meta.wikimedia.org/wiki/WikiCite/2020_Virtual_conference#Research_output_items .

      YouTube link: https://www.youtube.com/watch?v=us_yYR6tAUY

      It is public and does not require a login - just start typing below to note down your observations, questions or comments regarding any of the contributions during the session.


      Daniel Mietchen

      • A quote that struck me -> We are on the verge of moving from "the age of information" to "the age of verification". #wikipedia #wikicite #wikicite2020

      • Diversity

        • of WikiCite 2020 speakers
        • of WikiCite 2020 session formats
        • of WikiCite contributors
        • of languages covered by WikiCite
        • of topics covered by WikiCite
        • ...
      • Research outputs: anyone interested in starting some data modeling for early research outcomes?

        • For instance, I have looked into Jupyter notebooks and how they are used across the Wikimedia ecosystem http://doi.org/10.5281/zenodo.4031806
        • In one of my non-wiki volunteering roles, I am editing a journal that promotes the sharing of research across all stages of the research cycle, e.g. research ideas, grant proposals (funded or not), data management plans, posters, policy briefs etc., plus the usual research and review articles
      • Disaster response & mitigation

        • what can WikiCite do to better prepare for ongoing and future disasters and their information needs?
        • recent workshop: Wikimedia in disaster contexts. http://doi.org/10.5281/zenodo.4105179
      • Scholia as a role model for exposing / exploring / learning & teaching Linked Open Data?
      • WikiCite roadmap & roadblocks

      I wonder what relations are there between The Research Ideas and Outcomes journal https://riojournal.com/ and WikiCite. If there are none, maybe it is an interface to increase diversity of representation of different research outputs... I think we are friendly fellow travelers :) but no direct overlap I know of

      With organisation and government publications they are not hosted in a stable way - often just published online. How does the verifiability issue work if we have lots of broken links. Any thoughts on how to deal with this? the Internet Archive is auto-archiving links that are put in English Wikipedia via bot... perhaps this bot's scope could be expanded (but this also works for web publications) If Australian: The Analysis & Policy Observatory hosts copies of a lot of this grey literature, and we have WD properties to link them directly as references: https://www.wikidata.org/wiki/Property:P7869, https://www.wikidata.org/wiki/Property:P7870. Are APO records in wikidata now? I'm not sure what this question means. They have millions of documents. We do not have an index of them, but you could make some items about individual reports if you consider them important. I haven't investigated a bulk upload. Neither property is well populated so far.

      I know that you have an associtaion with a PLOS journal that publishes wikipedia-style articles that can be copied over to WP. How did you set that up? Did you contact them or did they contact you?

      Tools - I keep getting recommended to use SourceMD but the new version doesn't work and the documentation is - well, I haven't found it! Not to whinge, but how do we make robust workflows that we can share with others when we can't be sure that tools will still be working in the future? (User:DrThneed) (Thank you, it helps to understand the history. DrThneed)


      Margaret Donald

      Margaret's presentation - does Margaret attempt to disambiguate the authors while she works on these citations? OpenRefine seems to be a worthy tool for this, is there a reason she doesn't use this before upload? (User:Ambrosia10 / Siobhan Leachman) As Margaret seem to, I find the Author Disambiguator even better than OpenRefine for this (especially when I know their research well). But if I had a lot of corroborating columns about the author, then I would try OR. (99of9) I've been wondering myself at what point with data is it worth reconciling authors in OpenRefine rather than adding as author strings and then having to individually disambiguate with Author Disambiguator? I suspect that reconciling in OpenRefine might work well when you are dealing with works with only a couple of authors and you are working on a research group or institution where you are familiar with most of the authors. (User:DrThneed)

      Where is Liam writing? I can't see his questions.

      • Me either.
      • Ah, it was a comment he sent in the private chat inside streamyard

      Amanda Lawrence

      Q: What's wrong with the Wikidata definition of "discussion paper" (currenty "pre-review academic work published for community comments during open peer review")? Sorry I didn't see you'd added the definition. So my problem is that a discussion paper is not necessary pre-review, or academic work or subject to peer review. Q: Are all of these vocabularies set up as properties in WD? Q: Has the APO term taxonomy changed? The link we had to "climate change" now returns a 404: https://apo.org.au/taxonomy/term/20177 http://web.archive.org/web/20191031173118/https://apo.org.au/taxonomy/term/20177 https://apo.org.au/search-apo/%22climate%20change%22?apo-facets%5B0%5D=subject%3A52376

      Wow! Appalling that the Finch Report is so hard to find. Can Wikimedia cache objects like this? (User:Petermr) Probably not, becasue copyright I assume it's Crown copyright - can't remember what that allows . ("The default licence for most Crown copyright and Crown database right information is the Open Government Licence.") If that applies, a PDF could go on Commons, and it could be transcribed in Wikisource apparantly a working group of a consortium https://pubmed.ncbi.nlm.nih.gov/24400530/ copyright too hard as no clear statement, and original link to full report has rotted http://www.researchinfonet.org/wp-content/uploads/2012/06/Finch-Group-report-FINAL-VERSION.pdf the executive summary has an "Open Access" graphic, but has no licence statement sigh here is the IA version https://web.archive.org/web/20120725031811/http://www.researchinfonet.org/wp-content/uploads/2012/06/Finch-Group-report-FINAL-VERSION.pdf it has CC-BY on it but no version. i would upload that to commons, but expect a question. it was early in OGL days. if it gets deleted, they will upload a local copy at wikisource. This has been on Commons since January 2014! - https://commons.wikimedia.org/wiki/File:Finch_Group_report.pdf - Troubling that it didn't show up in any of our searches. Now linked to the Wikidata item: https://www.wikidata.org/wiki/Q19028392 And on Wikisource! https://en.wikisource.org/wiki/Accessibility,_sustainability,_excellence:_how_to_expand_access_to_research_publications very nice, need some linking at TOC rather than the whole text in one. and link to wikidata Yes, and a header template, in which the search-friendly phrase "Finch Report" can be included. It's late here in the UK; I'll do it tomorrow if none of the AU/NZ crowd beat me to it ;-) How would we go about setting up a working group to work through the issues with grey literature modelling and importing raised by Amanda?

      Set up a sub-page of the WikiCite related project at https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData - then announce it on the main project talk page, social media with #WikiCite hashtag, and on the WikiCite mailing list, etc.

      Thomas Shafee

      WikiJournal links

      https://en.wikiversity.org/wiki/WikiJournal_of_Science/Potential_upcoming_articles https://en.wikiversity.org/wiki/WikiJournal_of_Science/Volume_3_Issue_1 Example: https://doi.org/10.15347/WJM/2020.002 https://www.wikidata.org/wiki/Q96317242

      STARDIT links

      Mapping: https://wikispore.wmflabs.org/wiki/STARDIT/form_mapping

      Example1: https://doi.org/10.21203/rs.3.rs-54058/v1 https://wikispore.wmflabs.org/wiki/STARDIT/Involving_ASPREE-XT_participants_in_co-_design_of_a_future_multi-generational_cohort_study http://www.wikidata.org/wiki/Q98539361

      Example2: https://doi.org/10.21203/rs.3.rs-62242/v1 https://wikispore.wmflabs.org/wiki/STARDIT/Involving_People_Affected_by_a_Rare_Condition_in_Shaping_Future_Genomic_Research https://www.wikidata.org/wiki/Q100403236

      Open Qs:

      For overall structure when is it best to:

      • Enrich the Wikidata item for the activity (e.g. Q98539361)?
      • Create a separate Wikidata item for the report (e.g. Q98539361)?

      For every aspect:

      • How much of the free text can be made structured?
      • What’s the best way to structure each data type?
      • Can some freetext be stored in Wikidata (similar to P1683 quotation)?

      What's the best way to input data by users with no Wikidata experience! Pageforms -> local wikibase in wikispore -> Wikidata -> back to wikispore?

      Dark-side links:

      https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Source_MetaData#Predatory_publishers https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Source_MetaData#Retracted_articles

      Example: https://www.wikidata.org/wiki/Q61957492 https://www.wikidata.org/wiki/Q29030043 https://www.wikidata.org/wiki/Q58419365

      Q: Are reviewers ever shy about showing ignorance in their questioning if they know their comments are open? some studies suggest it can be harder to get a reviewr to agree, but the reviews themselves appear to be as, or more thorough.

      Exciting that you see methodology as metadata. I agree. This is something where instruments, software, reagents are all very well identifiable in running text. Some have RRIDs. Those could be added as properties (User:petermr)

      Thomas do you think people will put the time into creating the extra data about their work? Some authors are already keen, but for most, it would have to be a journal requirement. Currnelty, many require declaration of author roles (but I've never seen that info structured, just in the acknowledgements section). there's also precedent for methodology from www.cell.com/star-methods which some chemistry journals require, which are quite structured and incredibly thorough. For addtional contributors, they're often included in the acknowledgements as freee text, so it might not be too difficult to shift people's habits towards filling in a simple dropdown form.


      Toby Hudson

      Wikidata:Entity Explosion https://www.wikidata.org/wiki/Wikidata:Entity_Explosion

      There is an arachnaphobe!


      Open discussion

      The wishlist/major gaps in WikiCite activity

      • WikiCite as dataset for citations in Wikipedia -- as long as they are formatted for human reading and linking from Wikipedia and not sending readers around the traps

      Wishing journals would pay attention to their data in Wikidata in the way they do to where they are indexed.

      • Is there an option in WIkidata for indicating which indexing services an journal is indexed by?
    1. WikiCite 2020: Organización del conocimiento y datos bibliográficos

      https://etherpad.wikimedia.org/p/Wissensorganisation_und_Bibliographische_Daten

      Short link to this etherpad: https://w.wiki/iyh

      Welcome to the WMF etherpad installation. Please keep in mind all current as well as past content in any pad is public. Removing content from a pad does not mean it is deleted. Keep in mind as well that there is no guarantee that a pad's contents will always be available. A pad may be corrupted, deleted or similar. Please keep a copy of important data somewhere else as well

      Scope: Knowledge Organization, identifier mappings, bibliographic data, Zotero Session hosts: Eva Seidlmayer & Jakob Voß

      Watch on Youtube https://www.youtube.com/watch?v=r3LtGMqHrZU Watch on Twitter: https://twitter.com/Wikicite/status/1321028896205066240?s=20

      💡10:00 UTC =11:00 CET Floating classifications - Knowledge Organization Systems in past, present and future Andrea Scharnhorst, Royal Netherlands Academy of Arts and Sciences Session summary: https://meta.wikimedia.org/wiki/WikiCite/2020_Virtual_conference/Floating_classifications_-_Knowledge_Organization_Systems_in_past,_present_and_future https://www.slideshare.net/AndreaScharnhorst/floating-classifications-knowledge-organization-systems-in-past-present-and-future

      MAP https://www.researchgate.net/figure/Design-vs-Emergence-Visualisation-of-Knowledge-Orders-Akdag-et-al-2011b_fig4_273439193

      information is like the "stone in the stone age" - so information is power. what is more powerfull: shaping the order/classifications of information or "own" information? Or does it goes together/ is it the same? what would be a democratic way of information-power?

      decimal classification: order all the knowledge of the world" - this is very ambitious ...or arrogant. -> can any classification claim to last forever? order of knowledge must always be in evolution! could an "evoluting classification" as Wikipedia last forever? when is a concept completely outdated? Or is everything basically the same?

      Question Magnus Sälgö: During yesterdays session it was shown that Parlament documents from Sweden is uploaded and they start looking into add Main Topics etc,. The vision I see is find all parliament decisions in Europe about AI my question is how should we do this? Any standards? We have seen Eurovoc but doint know if its the way forward,,, Any good further reading in this area?

      What about information order/ approaches from other parts of the world? what can we learn about organizing information from Asia or Afrika for example?

      💡11:00 UTC =12:00 CET Wikidata-Normdatenverknüpfung mit Cocoda Jakob Voß, Verbundzentrale des GBV

      Question is there a forum were I can ask questions about Cocoda? I am looking into adding occupation codes to Wikidata and maybe this is a good tool https://phabricator.wikimedia.org/T262906

      Question is a presentation like this available in English

      • We are planning to create a screencast in English specifically about Wikidata mapping.

      Question: Who is the main envisioned user/collaborator in Cocoda? Bibliothekklassifikationsmapping

      Question: on which basis did you make the selection which classifications to use? Why not UDC? Der Summary is offen, und wir arbeiten an einer LD version, ich schicke Dir Details

      • Lizenz/Konvertierung

      Question: once the mappings are in place what will be able to do better in wikidata? in other words what is your vision? for retrieval also linking to library content automatically?

      Frage: Wo würde man am Besten Indirektionen Mappings aufgelöst? Z.B. wenn es ein Mapping von BK auf RVK gibt und von dort auf Wikidata, aber das direkte Mapping BK auf Wikidata fehlt.

      was sind die zukunftspläne?

      • Wir haben viele Ideen, wie man unsere Tools verbessern/erweitern kann, aber erstmal ist der Plan, dass unsere Tools und Infrastruktur im deutschen Bibliothekswesen aktiv genutzt wird, beispielsweise durch Integration unserer Mappings im K10plus-Katalog.
      • Was Features angeht, nehmen wir gerne Hilfe entgegen, es ist alles offen zugängig auf GitHub!

      Question: wie geht ihr mit versionen um, die Klassifikationen veraenderen ja alle ueber der Zeit gebt Ihr die version irgendwo an? bei der UDC ist das ein grosses Problem

      • Kurze antwort: wir haben das Problem auf später verschoben

      💡11:20 UTC=12:20 CET Katalogisierung mit Zotero in Wikidata (Zotkat) Philipp Zumstein, Universitätsbibliothek Mannheim

      • Wie wird vermieden, dass in Wikidata Duplikate eingetragen werden? +1 Hat sich erledigt, wurde im Vortrag erkärt
      • P1433 - published in - braucht wohl bessere Daten in Zotero, was die Zeitschrift selbst betrifft? +1 alles klar, hat sich erledigt!
      • Infos zum Open Access oder direkte Verlinkung zum Dokument, wären ja praktisch. +1 Volltext verfügbar heißt aber noch nicht Open Access - daher evt. noch Statements fürs Copyright ergänzen, wenn zB Rechte-Info in Zotero angegeben?
      • Könnte nicht auch direkt aus Zotero in Wikidata gespeichert werden statt dass extra QuickStatements aufgerufen werden muss?
      • In Zotero werden ja auch oft Screenshots gespeichert, könnte man die auch gleich mit hochladen? oder die eigenen Notizen? klar, es ist fraglich, wen die eigenen Notizen interessieren, aber das ließe sich ja in Form/eine Anwendung (Q-ID "Kommentare"?) denken..
      • Werden die "Codeschnipsel" noch in Zotkat eingebaut so dass sie einfacher anzuwenden sind?
      • Das war nicht so richtig sehbar: die Autoren werden als author-strings angelegt oder werden die Autoren direkt mit den Autoren Q-IDs verknüpft? (aktuell: strings)
    1. En esta sesión daremos una introducción práctica a Wikidata y la creación, curación y extracción de datos estructurados sobre trabajos académicos, autores e instituciones (parte de la iniciativa WikiCite). Aprenderá sobre el modelo de datos de Wikidata y realizará sus primeras ediciones. Aprenderá cómo completar una entrada de Wikidata desde un DOI u ORCID y cómo recuperar y visualizar datos en SPARQL a través del Servicio de consultas de Wikidata. No se requieren conocimientos previos de Wikidata o proyectos de Wikimedia.

    1. Relación entre datos y conocimiento.

      Waagmeester, A., Stupp, G., Burgstaller-Muehlbacher, S., Good, B. M., Griffith, M., Griffith, O. L., Hanspers, K., Hermjakob, H., Hudson, T. S., Hybiske, K., Keating, S. M., Manske, M., Mayers, M., Mietchen, D., Mitraka, E., Pico, A. R., Putman, T., Riutta, A., Queralt-Rosinach, N., Schriml, L. M., … Su, A. I. (2020). Wikidata as a knowledge graph for the life sciences. eLife, 9, e52614. https://doi.org/10.7554/eLife.52614

    2. Waagmeester, A., Stupp, G., Burgstaller-Muehlbacher, S., Good, B. M., Griffith, M., Griffith, O. L., Hanspers, K., Hermjakob, H., Hudson, T. S., Hybiske, K., Keating, S. M., Manske, M., Mayers, M., Mietchen, D., Mitraka, E., Pico, A. R., Putman, T., Riutta, A., Queralt-Rosinach, N., Schriml, L. M., … Su, A. I. (2020). Wikidata as a knowledge graph for the life sciences. eLife, 9, e52614. https://doi.org/10.7554/eLife.52614

      En este artículo se da la importancia que ha tenido Wikidata para poder organizar información pertinente a genómica, proteómica, compuestos químicos y enfermedades.

    1. Pfundner, A., Schönberg, T., Horn, J., Boyce, R. D., & Samwald, M. (2015). Utilizing the Wikidata system to improve the quality of medical content in Wikipedia in diverse languages: a pilot study. Journal of medical Internet research, 17(5), e110. https://doi.org/10.2196/jmir.4163

      -Interesante artículo del uso de wikidata para mejorar la información médica en Wikipedia a través de la representación de datos estructurados y flujos de trabajo automatizados podría conducir a una mejora significativa de la calidad de la información médica en uno de los recursos web más populares del mundo.

  20. Sep 2020
  21. scholia.toolforge.org scholia.toolforge.org
    1. Scholia es un servicio que crea perfiles académicos visuales para temas, personas, organizaciones, especies, productos químicos, etc. utilizando información bibliográfica y de otro tipo en Wikidata.

    1. Many organizations assert copyright for any media which they touch, without any consideration of whether the media is eligible for copyright or whether they own the copyright.

      Shouldn't cases like these be taken to trial? Imagine someone forbidding access to a public square under allegation that it belongs to them. Afraid of being prosecuted, people start paying this person to enter the public square. One day someone decides to take the case to court. The court can't simply rule that the person can't continue asking for money to use the square. The person should be punished for having deterred people from freely using the square for so long.

    1. Wikidata could be seen as a tertiary (or quartary) reference, but one can't add it to Wikidata itself.

      This should block the possibility of creating a loop where wikidata points to a source which points back to wikidata:

      Any reference pointing to wikidata would be at least a quartary reference (since wikidata is a tertiary reference). This would make it automatically ineligible as source for wikidata, because it only allows up to tertiary sources.

    1. “Information” (a valid value for <stated in>).

      What are other values for it?

    1. Figure 1

      Vale ressaltar a diferença do valor da propriedade ctp:P6 do "Den" (descrição) de um item. Pois tanto o Den quanto ctp:P6 podem preencher a mesma função, ou não?

      Edit: No caso, Den pode ser um ctp:P6 "standard" também

      Edit2: Vale falar que eu não tenho experiência com web semântica fora da Wikidata então não sei se um valor de descrição é algo comum em todas bases RDF

  22. Jul 2020
    1. → Start Cocoda instance optimized for RVK→ Start Cocoda instance optimized for Wikidata

      Eigener Unterpunkt "Instances"

    1. Wikidata as a knowledge graph for the life sciences.

      Wikidata as a knowledge graph for the life sciences.

  23. Jun 2020
    1. profitiert vom Wiki-Prinzip: jeder kann dort Ergänzungen und Korrekturen einbringen.

      vlt. hier noch auf die Abfrage verweisen: Saxonica in Wikidata ohne PPN-ID/K10+_ID: https://w.wiki/UBC

  24. May 2020
    1. Para crear un nombre propio tenes que ir a wikidata

      Creo que este fue un caso particular de tu aprendizaje, que se recordará bastante porque a cada uno le toca un viaje diferente y el tuyo fue bastante raro.

  25. Apr 2020
    1. The world faces (and will continue to face) viral epdemics which arise suddenly and where scientific/medical knowledge is a critical resource. Despite over 100 Billion USD on medical research worldwide much knowledge is behind publisher paywalls and only available to rich universities. Moreover it is usually badly published, dispersed without coherent knowledge tools. It particularly disadvantages the Global South. This project aims to use modern tools, especially Wikidata (and Wikpedia), R, Java, textmining, with semantic tools to create a modern integrated resource of all current published information on viruses and their epidemics. It relies on collaboration and gifts of labour and knowledge.
    1. The world faces (and will continue to face) viral epdemics which arise suddenly and where scientific/medical knowledge is a critical resource. Despite over 100 Billion USD on medical research worldwide much knowledge is behind publisher paywalls and only available to rich universities. Moreover it is usually badly published, dispersed without coherent knowledge tools. It particularly disadvantages the Global South. This project aims to use modern tools, especially Wikidata (and Wikpedia), R, Java, textmining, with semantic tools to create a modern integrated resource of all current published information on viruses and their epidemics. It relies on collaboration and gifts of labour and knowledge.
    2. Petermr/openVirus. (n.d.). GitHub. Retrieved April 8, 2020, from https://github.com/petermr/openVirus

  26. Mar 2020
  27. Feb 2020
  28. Jan 2020
  29. Sep 2019
    1. Indexed the proceedings for Hypertext '87, '91, '93, and ECHT '94 conferences. Currently am creating a global index and hypertext for all the SigWeb (formerly SigLink) hypertext conferences since 1987

      We could index it in Wikidata as part of WikiCite.

    1. Ludvig Holberg. Portræt malet af Jørgen Roed 1847 efter original fra ca. 1752 af Johan RoseliusPersonlig informationPseudonym Hans MikkelsenFødt 3. december 1684BergenDød 28. januar 1754 (69 år)KøbenhavnGravsted Sorø KlosterkirkeUddannelse og virkeUddannelses­sted Københavns UniversitetBeskæftigelse Essayist, universitetslærer, selvbiograf, historiker, dramatiker, manuskriptforfatter, skribent, filosof, digter, romanforfatterArbejdsgiver Københavns UniversitetKendte værker Jeppe på bjerget, Erasmus MontanusGenre teater/komedieLitterær bevægelse OplysningstidenPåvirket af Plautus, MolièreSignatur Information med symbolet hentes fra Wikidata. Kildehenvisninger foreligger sammesteds. [ redigér på Wikidata ]

      holberg billede

  30. Jul 2019
    1. Notas: La definición de cada uno de los tipos de colaboración se basa principalmente en los siguientes esquema de metadatos: DATACITE (https://schema.datacite.org/meta/kernel-4.1/doc/DataCite-MetadataKernel_v4.1.pdf) LOM (http://tvdi.det.uvigo.es/proyectos/t-learning/SCORM_ontology/LOM_Contributor.html) Sub-Propiedad: Nombre Completo del Colaborador (contributorName) (M, 1): Esta propiedad incluye el texto asociado al colaborador del recurso que debe ser descrito en cualquiera de los formas de descripción propuestas para el campo authorName. Atributo: Tipo de Colaborador (nameType) (O, 0-1): Este atributo permite especificar el tipo de colaborador que se describe en el campo de metadatos. Se debe tener en cuenta los siguientes tipos de colaboradores y su codificación normalizada según el vocabulario controlado propuesto: Vocabulario Normalizado Descripción del Atributo Dominio de Vocabulario Organizational Autor Corporativo datacite Personal Autor Personal datacite Event Autor Conferencia - Evento redcol Service Servicio redcol Sub-Propiedad: Nombres (givenName) (O, 0-1): Esta propiedad de uso opcional, incluye el texto asociado exclusivamente a los nombres (primer y segundo nombres) del colaborador cuando este se trate de una persona. Sub-Propiedad: Apellidos (familyName) (O, 0-1): Esta propiedad de uso opcional, incluye el texto asociado exclusivamente a los apellidos (primer y segundo apellidos) del colaborador cuando este se trate de una persona. Sub-Propiedad: Afiliación institucional (affiliation) (O, 0-n): Esta propiedad de uso opcional, incluye el texto asociado a las distintas afiliación institucionales a las que pertenece el colaborador. Sub-Propiedad: Identificador de Nombre (nameIdentifier) (O, 0-n): Esta propiedad de uso opcional, incluye el texto asociado que permite identificar de manera unívoca un colaborador como persona natural o corporativa a partir del uso de diversos esquemas de identificación. El formato de texto asociado depende de cada esquema de identificación utilizado. Se debe tener en cuenta los siguientes tipos de identificadores existentes y su codificación normalizada en los atributos de esta propiedad (nameIdentifierScheme), según el vocabulario controlado propuesto (Uso Opcional): Vocabulario Normalizado (nameIdentifierScheme) Descripción del Elemento Esquema de Dominio del Vocabulario (schemeURI) EMAIL Dirección principal de correo electrónico https://schema.org/email ORCID Open Researcher and Contributor ID https://orcid.org ISNI International Standard Name Identifier (ISO 27729) http://www.isni.org/ PUBLONS Clarivate Analytics Publons ID https://publons.com RESEARCHID Web of Science ResearcherID https://www.researcherid.com SCOPUS Author ID SCOPUS https://www.scopus.com/freelookup/form/author.uri IRALISID IRA-LIS https://www.iralis.org/ VIAF Virtual International Authority File https://viaf.org/ LCNAF Library of Congress authority ID http://id.loc.gov/authorities/names.html OCLC OCLC FAST Authority File http://experimental.worldcat.org/fast/ WIKIDATA Wikidata databse https://www.wikidata.org OTHERS Incluye: Facebook, Twitter, Mendeley, LinkedIn, BNE, BNC, ResearchGate.   Atributo Nombre del esquema del identificador (nameIdentifierScheme) (M, 1, si es utilizada la propiedad nameIdentifier): Este atributo permite especificar el nombre del esquema identificador utilizado para describir al colaborador en el campo de metadatos. Se debe tener en cuenta el vocabulario controlado propuesto en la propiedad nameIdentifier. Atributo URI del esquema del identificador (schemeURI) (M, 1, si es utilizada la propiedad nameIdentifier): Este atributo permite especificar la URI del nombre del esquema identificador utilizado para describir al colaborador en el campo de metadatos. Se debe tener en cuenta el vocabulario controlado propuesto en la propiedad nameIdentifier 3.8. Relaciones con otros campos¶ No debe confundirse al colaborador (dc.contributor) del recurso con el autor del recurso (dc.creator) incluidas sus especificadores de campo. No debe confundirse al colaborador (dc.contributor) del recurso con la entidad responsable de la publicación (dc.publisher) del mismo. Cuando se trate de describir a una entidad que patrocina el desarrollo de un recurso de información de debe utilizar el campo dc.contributor.sponsor y no los campos dc.description.sponsorship ó dc.description.funder. Cuando se describe al director de un trabajo de grado ó tesis de maestría ó doctorado se debe utilizar dc.contributor.advisor. Para identificar la institución que certifica el grado de un trabajo de grado ó tesis de maestría / doctorado debe utilizarse thesis.degree.grantor. 3.9. Restricciones¶ Ninguna 3.10. Ejemplos y ayudas¶ 3.10.1. Ayudas¶ Colaborador Editor: Colaborador Traductor: 3.10.2. Ejemplo en XML (Interoperabilidad OAI-PMH)¶ Esquema oai_dc 1<dc:contributor>Vivas Barrera, Tania Giovanna, editora</dc:contributor> Esquema DataCite 1 2 3 4 5 6 7 8 <datacite:contributors> <datacite:contributor> <datacite:contributorName>Evans, R. J.</datacite:contributorName> <datacite:contributor> <datacite:contributor> <datacite:contributorName>International Human Genome Sequencing Consortium</datacite:contributorName> </datacite:contributor> </datacite:contributors> Esquema dim 1<dim:field mdschema="dc" element="contributor" qualifier="editor" lang="spa">Vivas Barrera, Tania Giovanna, editora</dim:field> Esquema xoai 1 2 3 4 5 6 7 8 <element name="dc"> <element name="contributor"> <element name="editor"> <element name="spa"> <field name="value">Vivas Barrera, Tania Giovanna, editora</field> </element> </element> </element> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 <element name="dc"> <element name="contributor"> <element name="author"> <element name="none"> <field name="value">Gasparini, Blaž</field> <field name="authority">a863d05d988bdb56375ccf483f6c2ef3</field> <field name="confidence">600</field> <field name="orcid-id" /> </element> </element> <element name="supervisor"> <element name="none"> <field name="value">Lohmann, Ulrike</field> <field name="authority">895d98a3537122db33165a6db6c0af61</field> <field name="confidence">600</field> <field name="orcid-id" /> <field name="value">Peter, Thomas</field> <field name="authority">53688288a3c9335eefa0ddc6b1b85b0c</field> <field name="confidence">600</field> <field name="orcid-id" /> <field name="value">Leisner, Thomas</field> <field name="authority">728336b61c3952148ab1b65bdc2a9202</field> <field name="confidence">600</field> <field name="orcid-id" /> </element> </element> 3.11. Niveles de aplicación para productos de investigación de Colciencias¶ Se aplica a todos los productos de Colciencias. 3.12. Relaciones con otros modelos de metadatos¶ El campo Colaborador (datacite:contributor) es utilizado por los siguientes esquemas de metadatos y puede intercambiarse su uso de manera indistinta mientras se conserven sus distintos niveles de atributos y especificadores de campo: Esquema de Metadatos Campo Relacionado dc dc.contributor dc.contributor.advisor dc.contributor.editor, etc. dcterms dcterms.contributor dcterms.contributor.advisor dcterms.contributor.editor, etc. lom lom.lifecycle.contribute marcxml field:700,710,711 3.13. Niveles semánticos¶ Para la gestión normalizada de roles de usuario, se está tomando como base las siguientes ontologías: LOM Ontologiy (http://tvdi.det.uvigo.es/proyectos/t-learning/SCORM_ontology/index.html): Datacite Ontology (https://sparontologies.github.io/datacite/current/datacite.html) FOAF Ontology (http://xmlns.com/foaf/spec/) CERIF Ontology (https://www.eurocris.org/ontologies/semcerif/) Este campo contempla la utilización de distintos sistemas de gestión de autoridades de nombre que normalizan semánticamente los colaboradores. Cada registro presente en estos sistemas de gestión de autoridades de nombre provee una Identificación persistente. Adicionalmente dichos sistemas proveen una URI única que debe ser enlazada y utilizada en el campo de metadatos asociado. En su mayoría, los sistemas de gestión de autoridades de nombre contemplan la exportación de registros en representaciones semánticas MADS/SKOS a través de formatos MARCXML, RDF, XML, N3, Turtle, JSON. 3.14. Recomendación de campos de aplicación en DSPACE¶ Se recomienda crear/modificar el componente de registro de metadatos (y sus correspondientes hojas de entrada de datos) de los sistemas DSPACE basados en los siguientes elementos: Vocabulario controlado OpenAire/RedCol Campo Elemento DSPACE Cualificar Nota de alcance Advisor dc.contributor advisor   AudiovisualDesigner dc.contributor audiovisualdesigner   AudiovisualDirector dc.contributor audiovisualdirector   ContactPerson dc.contributor contactperson   ContentProvider dc.contributor contentprovider   DataCollector dc.contributor datacollector   DataCurator dc.contributor datacurator   DataManager dc.contributor datamanager   Distributor dc.contributor distributor   Editor dc.contributor editor   EducationalValidator dc.contributor educationalvalidator   ExecutiveProducer dc.contributor executiveproducer   HostingInstitution dc.contributor hostinginstitution   Financer dc.contributor financer   GraphicalDesigner dc.contributor graphicaldesigner   Illustrator dc.contributor illustrator   Initiator dc.contributor initiator   InstructionalDesigner dc.contributor instructionaldesigner   Photographer dc.contributor photographer   Producer dc.contributor producer   ProjectLeader dc.contributor projectleader   ProjectManager dc.contributor projectmanager   ProjectMember dc.contributor projectmember   Referee dc.contributor referee   RegistrationAgency dc.contributor registrationagency   RegistrationAuthority dc.contributor registrationauthority   RelatedPerson dc.contributor relatedperson   Researcher dc.contributor researcher   ResearchGroup dc.contributor researchgroup   RightsHolder dc.contributor rightsholder   ScriptWriter dc.contributor scriptwriter   SoftwareDeveloper dc.contributor softwaredeveloper   Sponsor dc.contributor sponsor   SubjectMatterExpert dc.contributor subjectmatterexpert   Supervisor dc.contributor supervisor   TechnicalImplementer dc.contributor technicalimplementer   TechnicalValidator dc.contributor technicalvalidator   Terminator dc.contributor terminator   Translator dc.contributor translator   Validator dc.contributor validator   WebDeveloper dc.contributor webdeveloper   WorkPackageLeader dc.contributor workpackageleader   Other dc.contributor other   Notas: Con el fin de tener un alcance normalizado de las distintas propiedades y atributos (correos, afiliaciones, identificadores, etc..) asociadas a los autores, se recomienda utilizar la configuración de control de autoridades provista por DSPACE ó en su defecto incorporar características de sistema CRIS en DSPACE, en específico activar la entidad CONTRIBUTOR. 3.15. Recomendaciones de migración de otras directrices de metadatos (BDCOL, SNAAC, LA REFERENCIA, OPENAIRE 2, OPENAIRE 3)¶ En las distintas directrices que han existido, siempre ha sido obligatorio el uso del campo colaborador aunque no se hace explícito contemplar las diferencias de los distintos tipos y características de los autores. En el sistema DSPACE en su instalación por defecto el campo autor viene con los campos dc.contributor y dc.contributor.advisor Se recomienda específicamente crear los nuevos atributos/especificadores del campo de autor según la codificación propuesta. Next Previous © Copyright Colciencias 2019 Revision 7ac4facf. Built with Sphinx using a theme provided by Read the Docs. Read the Docs v: latest

      Ojalá se recomendara utilizar un rol específico y no otros. (Para evitar pérdida de datos)

  31. Jun 2019
    1. Wikidata, described by Vrandeˇci ́c and Kr ̈otz in[4], as means to integrate data from different sources

      WikiData integrate data from different sources

    Annotators

  32. May 2019