- Nov 2020
Imagen de avatar
9:15 / 1:00:42
WikiCite 2020: The state of WikiCite 403 vistas•Transmitido en vivo el 26 oct. 2020
Wikipedia Weekly 369 suscriptores
Part of the WIkiCite 2020 Virtual Conference: https://meta.wikimedia.org/wiki/WikiC...
02:09 - Presentation by Daniel Mietchen
Summary: WikiCite is an initiative to collect bibliographic and citation information, particularly of references cited from Wikimedia projects like Wikipedia, Wikisource or Wikidata. It provides an umbrella for a broad range of activities at the intersection between Wikimedia, libraries and other organizations engaged in scholarly communication or cultural heritage. Over the past few years, many of these activities have involved in-person events - including participation in the Workshop on Open Citations in 2018 - but the ongoing COVID-19 pandemic has changed that, and the WikiCite community is adapting. In this talk, I will provide an overview of WikiCite activities during the last 12-18 months as well as of what is ongoing and planned.
Bio: Daniel Mietchen is a researcher at the School of Data Science of the University of Virginia. He is interested in integrating open and collaborative research and education workflows with the web, particularly through Wikimedia platforms, to which he actively contributes. Trained as a biophysicist, his research topics range from the subcellular to the organismic level, from biochemical and embryological to geological time scales, from specimens to biodiversity informatics, from data points to data science and to the role of research and education in sustainable development. Transcripción
[Music] welcome welcome uh thank you for joining us this is the beginning of the wiki site 2020 virtual conference this is the first time we run a virtual edition of this uh almost annual conference and it is the fourth the fourth wikiside annual conference today we have a series of sessions by a variety of people in various topics but we are starting with let's call it the keynote the state of wikisight in 2020 but to introduce them my name is liam wyatt and the host of the the conference in general but to introduce daniel and this particular session i'd like to introduce jacob and eva welcome and thank you for hosting this session today yeah hello and welcome from the side of uh both of us the hosts for this session um uh yeah we are happy to uh present daniel median um yeah to to introduce daniel he's virtually everywhere so um everywhere in wikidata world all each time i discover a new interesting weekly data related project i i'm sure he has already seen it or he has an opinion in it and he is a strong well-founded opinion and that's why it's very interesting to listen to what he's going to talk about so we welcome daniel meachen and um yeah are you ready then the screen is yours yeah i'm ready okay we will leave so thank you everyone i'll just start by noting that the slides are available on the noda i also just tweeted them and so you can follow at your own pace there is a back channel for comments so if there is anything that you would like to comment on or if you have a question during the talk please put it in this ether pad wiki w dot wiki igk with the gk and capitals and from the ether pad we will also then take the q a section i tried to talk for 45 minutes about that and then we have about 15 minutes for questions but this is flexible and i can be interrupted that's fine so structure of the talk i'll say hello then we will look at some previous versions of the state of wiki site 2020 because you might have seen in the title that says version 3. i'll give an overview of wikisite in general and then the wiki site development since the last official wikisite meeting 2018 in berkeley and in this overview i'll look at the content at some of the infrastructure at grants and fellowships and also at events including further events as part of this wikisite conference there will be mechanisms for you to get involved and then there will be q a okay hello wiki site i was introduced already uh this slide is more or less just for people who are stumbling upon this randomly on the internet uh when they watch the youtube uh recordings um so now let's look at uh wikisite 2020 and the previous versions so one important previous version is the annual report as liam already pointed out we have roughly annual uh events which also means we have roughly annual reports uh the latest one is for 2019 20. and i'm not going to talk about this too much but i encourage you to take a look at this in any case here is a brief summary so um the wiki site annual report 2019 and 2020 was detailing some events that were planned and actually held so um for this period we were actually planning to go on decentral we wanted to have a more multilingual and more diverse a set of events taking place and that involved thoughts about going virtual from the beginning um and we had two events actually taking place one in australia one in czechia but then the covert 19 pandemic has changed many of those plans so the event in czechia had to be cut short by day uh events in planned in ireland and finland had to be modified events in ghana germany and india had to be cancelled yeah i was actually meant to speak at this event in germany for the talk that i'm giving now events in haiti and ghana had to be postponed i'm glad to report that by now the one in haiti has actually taken place and the pandemic has also triggered for instance wiki project covert 19 which had a strong wikiside component other the ecosystem around wikisite is evolving in terms of the wikimedia strategy and working data wiki-based strategy and everything that's underlined here in my slides is linked so i encourage you to explore those links yeah so the second version of the state of wikiside 2020 uh was a talk that i gave at the workshop on open citations in open scholarly metadata 2020 at the beginning of september back then i didn't know that i would be giving this talk here and i wasn't sure whether there would be any kind of overview of wikisite talk this year which is why i gave it that title back in september on the other hand i had only very little time so i couldn't go into details and i'm reusing some of the slides from there but still the structure of the talk is different the key takeaway from from that talk is despite the pandemic wikiside is moving forward but we're actually not so sure about the citation part um and that talk is online as well including some video recording so now we're on to version three um let's start with an overview of a keysight i assume that some people are here that don't know uh really what that is about and so i hope to um add a little bit of clarity and reduce a bit of the confusion so wikisite is a community it's a socio-technical platform that connects this community with an ecosystem of projects tools and other activities it's also uh or can be seen as a collection of data sets a series of events and a number of other things uh let's zoom into some of those so uh the community if i go by 2016 conference uh it's outlined the vision to create the sum of all human citations as linked open data this is inspired by the wikimedia vision to create some more human knowledge and from that we derived the mission which is still up on the wiki site page on the meta arrow key to develop open citations and linked bibliographic data to serve free knowledge using wikimedia platforms and especially wikidata that vision and mission have let us to consider a number of goals i'm again quoting the 2016 version of them lay the foundations for building a repository of all wikimedia references as a structured data in our working data that's one goal and the other one is to design data models and technology to improve the coverage quality and standards compliance and machine readability of citations across wikimedia projects that's what we what the plan was in 2016 and somewhere between these two uh we actually are still active uh it's just that by now we've zoomed in on onto those things into uh so many different directions some of which i'll try to outline uh the next aspect of that community is that the community has a road map that offers essentially four roads to travel but that road map is not really used much everybody travels on their own everybody finds their way on their own and the rope map is decisions that the road map is about they're still kind of looming at some point they need to be taken or they will be taken by someone or something and so it's probably worth for the community to um just keep it in mind and maybe have some more active discussions about this next so i mentioned it's also a technological platform so it has a home base on the meta wiki under the wiki site basically handle and from there you can also find the program for this this week's conference but lots of other documentation about wikisite in general it's home based on wikidata is a wiki project source metadata which was actually the nucleus from which wikisite has grown and and wikiproject source metadata is a bit of a mouthful and also it was specific to wikidata uh so by now we've come to use the term wikisite to encompass to signal that it's uh not just for wikidata it's it's really on meta because it concerns all wikimedia projects um but yeah the the nucleus the the origin of this is that we wanted to uh care for source metadata and that included citations uh so it's important to keep in mind that wikidata is about one-third just citations and scholarly publications so about one-third of the content of wikidata is um created by the wikisite community problems um and so the problems of wikidata they reflect on wikisite and vice versa um and then finally uh yes uh wikisite is a series of events we were in berlin in 2016 in vienna 2017 and in berkeley 2018 and now we're um trying to be a bit more global this year so i mentioned wikidata a number of times most of you will know what it is but i suspect some will not so here is an attempt like wikipedia which is the free encyclopedia that anyone can edit wikidata is the free knowledge base that anyone can edit and use of course if that doesn't clarify this to you then i recommend to look at uh this seven minute intro video on youtube that i've linked so a few more numbers about wikidata in from the perspective of wikisite so our wiki data as a whole has 11 billion statements that come in the triple form like subject predicate object and uh for us most the most interesting for us are for instance citations 170 million scholarly articles 36 million items that have an digital object identifier publications or data sets 26 million orchids which is an identifier for researchers one and a half million and then also that's the creation work that's going on um the links from the publications to the top wrote the publications roughly 20 million links uh links from the people to the institutions where they were educated one and a half million and from the people to where they're working it's about a million and all of these data since we're talking linked data and open data they can be uh mixed and matched in different fashions and reused and enrich each other um so now let's talk a little bit about the curation workflows that are employed by the wikisite community i'll single out like three of them here i mentioned four of them on the slide but uh due to time and stuff i'll go only into three of them so one is topic tagging which allows us to address questions like give me all the papers about sars cough 2 or any other specific topic the other one is author disambiguation which allows to address questions like give me all the papers by a given person the other one is ontology development like for instance if you want to get all papers about mathematics education and you want to include different aspects of mathematics education then you will need to map those at aspects like calculus for instance or algebra or geometry education you need to map them to math education that's called ontology development so basically linking topics and subtopics and then the last one is subject affiliation so if you want to get an overview of the publications from a given institution and many institutions want to to get that overview then somehow you need to record that there is the link between the person at the institution and then between the person and the publications and then through linked data we can harvest that and make the link between from the publications to the institutions okay an example of topic tagging here i chose the a query that gives me terms that have bologna in them on in the titles of paper publications and uh so from that i can see that there are different kinds of bologna that are being talked about so bologna in in general there is bologna italy like there's the province of bologna there's the city of bologna university of bologna bologna experience which refers to a number of things then there are some clinics in bologna bologna sausage and so on so disintegration is very important and but once we have disambiguated things because we have the rather precise uh topics here in wikidata which all have their own identifier we can then um clearly uh distinguish the literature about the sausage from the literature about the university in the clinic and the province and the city then the most popular tags for topics of publications as of august 2020 were these ones here uh mostly medical that's because that's or that just shows some bias that we have that's partly because in medicine biomedicine we have databases that are more easily harvestable that can be more easily integrated in wikidata than in other fields but it's not only um biomedicine so for instance we see india popping up here nanoparticle and statistics so it's a bit more complex than that um here i'm bringing up the zika virus as an example because it served as an example uh already since 2016. uh it's kind of the uh the guinea pig the the testing ground for working side workflows and uh this is in part because wikisite developed roughly in parallel to the zika virus epidemic in 2015-16 but also uh because uh it has this zika virus corpus or data set has been used in a number of contexts um and also it helped uh address the ongoing covet 19 pandemic so what it does is it's a model for disaster response in general not just for outbreaks it's a relatively complete data set for a given pathogen it has a data model that is enriching as the epidemic progresses so initially when the zika virus came around all increased then the data model on wikidata was refined and we could then uh express things at a more granular level um it has also triggered a number of tool developments and it's yeah it's used in technical tests and education here are a few more links on this i'll get back to code 19 later on so once we have topic tags we can do things like we can get the list of most recently published works on the topic of course these states are not very relevant anymore so here we're talking about the principle um and also we can look at which topics co-occur with a certain topic so if the topic is zika virus then we see here for instance the topic of congenital zika virus infection is uh very uh prominent zika fever uh pregnancy and microcephaly uh these are all topics that co-occur that so that are discussed together with the topic of the zika virus and that is a way to browse things so if you if you don't know much about the topic you just go there you start from the zika virus and then you can explore all those other things that are linked to it then what we can also do is link the discussions of the topics with certain locations because some of those topics will have a geolocation in like india we had in those most popular topics we had india as one example and india has a geolocation but some of those co-occurring topics uh they will discuss things like the zika virus in india or in french polynesia or in brazil or something like this and we can pull this out of the data by mapping the papers to the topics and then the topics to their geolocation if they are entities that have a geolocation we i want to contrast this to this map which shows the locations of the authors of papers on the topic so in the previous one we had the locations of topics or topics that occur with the zika virus and here we have the locations of the authors of papers about the zika virus and so these are different things we can plot them we can pull that out of the data that is created in wikidata through the wikiside community and tool chains and that allows us to explore certain gaps so for instance here in africa there's not much happening according to this map which might be first that there is not much research on the zika virus happening in in africa or in much of northern asia or simply that here there is not a lot of population anyway or that we just haven't created the data about the publications or about the authors or their affiliations or their geolocations very comprehensively and these are the kinds of things that the community then has to drill down on in order to to find out and to improve the data set now are on to author disambiguation so the basic strategy is if there is an orchid which is meant to uh help with author disambiguation uh then and the orchid has public data that is actually usable for wikidata then we try to use it uh we have some tooling for this um there were some problems with that so it doesn't really work well for the moment that's also some discussion we need to have and also the orcid data is not always best quality but at least it is there it's a principal standard that the research community has agreed on to support and so if you want to map research then that's something we should uh take into account um the disability on wikidata can be done by anyone just like any other kind of activity on the platform and just like on wikipedia we have a number of mechanisms for quality control we have some tools for this they're actually dedicated for author disambiguation um and we encourage institutions to disambiguate the people that are affiliated with them because they know best or the the people themselves of course also know about um in ordinary wiki data volunteer might not really know the start time and end time when some professor was affiliated with a certain research institution and then if we consider uh like that there have been also basically uh then uh and many of which have many authors uh then this is becomes a humongous challenge uh but uh thinking about this shouldn't kind of occupy our thoughts too much as long as we find some mechanisms to prioritize properly so here is one of the tools in action it's the called the author disambiguator and it can be used to basically convert author name strings into author identified authors so yeah the author here rosa prato has an identifier on working data already that is associated with nine items at the moment when when this screenshot was taken and then here we have 83 publications uh that lists the name rosa prato as an author named string and then the question is to what extent do the publications that show this author name string actually correspond to publications that have been written by this person that is identified here and the tool basically asks this question and helps with the grouping and suggests and so that's one of the mechanisms by which author disambiguation can happen subject affiliation so in order to find out the affiliation for a given person we can search in wikidata or we can search elsewhere by university or by organization and we we can get information such as the faculty and their publications or research sites and clinical trials that these people have been involved in or events such as conferences and locations where these people have interacted or have shown up um and uh if we do affiliation tagging then uh we can basically uh do similar things that i showed for the zika map before um and that allows us to for instance look into um areas where a certain topic is well represented so here uh the uh data here is for italy you see that there is a certain gradient in terms of publications that have italy as a subject in italy there's a lot of papers about italy and the further you go away the fewer papers there are um and that also the mapping here also serves as one mechanism by which we can do quality control one example i would like to point out is uh there's a vanderbilt university in the united states that actually uh does this uh curation of the affiliation at a very systematic level so they have figured out who is currently basically affiliated with university they've gotten all their orchid identifiers and they have fed that into my information into wikidata and that allows them and everybody else now to uh basically browse the information that is available about publications by uh vanderbilt university people staff mostly and faculty but also occasionally students and the work that they are still working on is like linking these people to the publications and that's also something that is part of the community workflows so they really try to integrate their efforts with the community workflows i put in some links here to uh first some news item about this the bots that made all this possible and similar projects at indiana university the university of cannery islands and also there's a countrywide similar effort for the netherlands um some other things that we haven't talked about too much in previous wikiside events is clinical trials uh these are also things that you can cite and they produce citable things and they are about certain topics they have uh authors they have um primary and investigators and these kind of things and um so these are um things entities that are part of the creation workflow and uh we're now relatively complete with respect to clinical trials that are registered in the united states we're trying to increase the coverage of clinical trials that are registered in other places so here is a an attempt at representing the wikidata ecosystem which i described as a socio-technical system in one of my opening slides so here we have a layer of users that are using a number of tools to interact basically with the basic infrastructure wikidata and wikibase and some of those users are automated uh and most of them are humans and some of the tools they they are used more or less for reading others for writing and others for like mixed uh matters um so one tool allows historypeter allows you to basically visualize timelines scolia allows you to visualize connections that are roughly related to scholarly literature or to research context in general quick statements as a tool to edit working data mix and match as a tool to also edit wiki data and to map things from wiki data to external databases um recoin helps with quality control yeah other uh another way to look at these tools is uh they do um they facilitate browsing like this wiki data front end that is specific for basically software um they allow to edit wiki data in various ways so here if you have an identifier for publication you can put that into source metadata tool and then it will check whether that publication is already indexed and if not it will help you set up the corresponding item and a large set of tools is actually there to check consistency data quality and these kind of things so for instance for statements about symptoms we have a requirement that whatever is stated as a value here it should be supported by a reference and if that is not the case then a warning is displayed there's an overview of all the tools here on this page um for the scolia tool specifically we also have a category on commons and there are similar categories for some of the other tools i don't have time for that there will be a dedicated session on scolia later today now back to the citation questions so here is a graph that is available on this website that's operated by um yeah one of the organizers of this session and um what it shows is two different developments here until well somewhere in the middle of this year one of them is the number of publication items which goes up in more or less in jumps um but con steadily roughly the few occasions where it goes down they're actually acts of active curation where someone for instance uh noticed oh we already have this publication it was started based on the digital object identifier but uh here and it was started based on the pubmed identifier there and they're actually about the same thing so we can connect them and so these acts of creation are sometimes visible in those stats here and here what we also have here is the number of citations i mentioned the number of 170 something million or so at the beginning and uh we here see that uh it has basically plateaued for um like a year or so then it went down this is again is an act of curation uh the problem is or um at least something to discuss is what's what's the future trajectory here we don't really currently have that data but uh it's clear it's not going up very strongly and the question is whether it should um or whether that is something that should take place somewhere else um there are a number of bots that are active in this space uh vanderbot is basically underlying uh the vanderbilt project that i outlined before large data setbot is essentially creating publication items so it checks whether items for certain publications are already indexed in wikidata and if not it creates those items refbot is perhaps closest to the original uh let's say goal outlined in 2016 that we want to basically support the uh references or support every statement in wikimedia projects with suitable references and so what the refbot does is it picks certain statements in wikidata that does do not have a reference yet and then searches the literature for places where this statement might have been made and then adds relevant literature as a reference to support this statement so for the statement that local anesthesia is a subclass of anesthesia we now have two um sources that were added by refbot there is also a proposal for open citation spot but this is pending so technically it's kind of uh doable but still not done and socially we don't really know yet whether we actually want to do it in wikidata or not that's part of the discussion that events like this want to facilitate so um also important to think about uh the event history um a little bit uh because um yeah we we're beginning another event right now and uh so uh i want us to kind of kind of come into the mindset for for this kind of event thing it's different now because it's all virtual um but uh yeah let's let's think about this nonetheless a little bit so in uh these previous three conferences uh we had a three-day design typically um where we tried to combine monologues like the one i'm doing right now with more dialogue oriented sessions and then with a hackathon and we have elements of all three of these in the current program and in the other wiki site activities all outlights some of those this week's virtual conference actually i would have liked to include this screenshot of the scolia page for this but it wasn't detailed enough so i i'm just encouraging you to take a look and help create that part also i noticed that for wikisite 2016 we don't have the information very well created yet um so let's focus on the wiki site 2020 events uh so we we're in this current session from 10 to 11 utc you see there a number of other sessions coming up today i guess if you are here in this talk you already know roughly about this but i really encourage you to use some of the breaks here to go through the entire um basically schedule because there are so many different things they they differ not just by location but they differ by language they differ by uh the kind of topic focus um like here we have swedish parliamentary documents in indonesia here for instance they're discussing palm leaf documents uh they're differing uh in terms of the technical aspects so here we have a hands-on uh introduction on like how you can actually edit wiki data then here is curation of author items here is the front end of wiki side here is things that are specific to genetics and and so on there lots of uh such sessions and um the ev the aim of doing them uh is to foster collaboration since all of this is is open everybody can interact with all of these projects and that's one of the mechanisms by which this wiki site community thrives we normally had pushes an activity right after the wiki site events and hopefully this will happen for this virtual one as well um in terms of other activities i would also like to mention that we have set up a mechanism to award some grants and e-scholarships um that are related to wikisite events and they are somewhat of a an adaptation to the corporate times so um normally i mentioned we would have had a hackathon in uh in in this conference setting three-day conference setting but since uh the normal way of doing a hackathon means we bring multiple people into the same room um that that and that's not possible these days uh we basically thought about a mechanism by which we could give people some time to work on uh and that's that's the e-scholarships approach so the things that they could have done during a three-four day hackathon they're now part of those e-scholarships we also have a number of uh grants that are a bit larger scale projects but still doable in a matter of a few weeks to months and here is a blog post that details all of them again one of them is about the leaves from the palm leaves from indonesia but there's 23 of them and they're much more diverse than the projects that we have discussed at the previous wikiside meetings in terms of coordination so um wikisite is driven by volunteers in all of its aspects including the organization of this session and this conference and all of the content work and tool development but some coordination needs to happen and in typical wiki style much of this coordination happens in a variety of channels still we have a uh an organizing committee um or steering committee um and this is basically a the same team that has been active over the last few years we had a few members change but we're constantly looking for new members we want to facilitate wikiside activities in different locations and different languages on different topics on different technical contexts and so if any of that resonates with you please get in touch and consider joining the steering committee so that's now the um first thank you slide uh i think it's important to thank you one of the things i like best about the wikimedia um let's say ecosystem is actually that we have a thank you button on all the different uh platforms or on most of them and that is a very interesting aspect of forming a community it's not visible uh other than to the person that is being thinked at least by default and and so it it is not normally gamed and so it's a very nice thing here um i yeah i just want to thank everyone i don't like that phrase still i use it and so i'm also spelling it out i want to thank the providers of open infrastructure not just the wikimedia open infrastructure but open infrastructure in general including the one that we interact with the providers of this citation or this presentation template from slides carnival the wikimedia wiki data and nokia site communities i would especially like to thank the scolia team because yeah that's the team i'm most closely interacting with on a daily basis i would like to thank the alfred sloan foundation that sponsors the wiki site events and i would like to thank the organizers of wiki site events like if you're organizing any session or here at this conference or somewhere else if you did that in the past if you plan in the future i want to thank you for just thinking about and especially if you've done it for doing it and then if you've made it until here i want to thank you for paying some attention to to these topics now i'm looking forward to the discussion let's see what the ether pad says if you want to contact me here are some contact details and now i'm trying to hop into the pad yeah thank you very much daniel you are well in time so we have uh some questions uh very well we look at up here on the ether pad yeah so the first is could we decide provide an alternative to commercial graphics search engines what do you think um okay it would be helpful if you mention the question that you also show the uh tell me the line number uh peter eventually yes i hope so but currently it's an alternative only for very specific things like if you want to know something about the zika virus or something like i would um guess for certain things uh the zika corpus on wiki data is more uh or is annotated in a better or more detailed fashion than the standard that you get from the commercial search engines but in general they are for many purposes still better in part because they have um money to put at this and uh wikiside is largely driven by volunteers who do it in their spare time and so what we do have is the uh a community they don't have that and um the the question then is uh what is the niche for both of those uh so since we're entirely open um anyone is entitled to benefit from our creation efforts that includes the commercial providers of this and so one of the main purposes of wikiside is actually to increase the quality of the information that is available through such search engines so if anyone has access to the wikidata version of it or with the site version of it then whatever those commercial offerings are they should be better and so that that's a short answer but the question is deep and we could have an entire discussion around it so it in any case depends on the topic yes it depends on the topic for a number of topics especially in humanities or so uh wiki data doesn't have a lot of information at the moment this is due to several structural and and community biases and we're working on them we're aware of some of those biases but still they are there and of course the search engines have uh their own biases as well so for particular like uh i hope like in a half a year from now the palm leaves collection for instance from indonesia will be in wikidata and i don't see them showing up in any of the commercial search engines anytime soon but maybe they pick it up from wikidata that's that would also be progress that would make reset the the information more widely available okay so next question uh let's question of mine about topic tapping picking uh so where do these topics uh tags come from it's not that easy like with author the date of publication because topics are a bit less hard facts what do you think about this yeah that's actually something um i'm thinking about quite often i think the current workflows that we have are um not very mature so there are lots of workflows typically in library contexts where um topic tagging is already happening so for instance the database pubmed for biomedicine has an entire system called the mesh terms medical subject headings which uses basically human readers so humans reading a paper they then decide which terms to associate with that paper and they choose those terms from a defined controlled vocabulary called the mesh terms and that system exists but we haven't found mechanisms to leverage that for wikidata in part because wikidata is cross-disciplinary and so even if we had this working for pubmed this wouldn't help us much in terms of covering publications in history or astronomy or elsewhere and so there is no system that is cross-disciplinary which means that there is actually an opportunity for wikidata to become the first cross-disciplinary uh platform to have a consistent mechanism of tagging right now we don't have it so what we do have is often just uh based on inferences from the title which might uh run into problems when uh the tagging is based on ambiguous uh terms um or when the tagging is performed by someone who doesn't know the subject very well or something like this all of this happens all of this also happens to me um and but the good thing about this is since we're curating into an open platform everyone can check what has been done everyone can point out problems with that some of those checks are even automatic and then we can work together to reduce the number of false positives false tags and we can also think about what the best granularity of the tags is so if we for instance go back to the mesh example of the mesh terms they come with their own hierarchy some of that hierarchy is already reflected in wikidata and so if a paper is associated with a number of mesh terms that are in a hierarchy then the question is should we take all of them or should we just take the most specific ones and that's the kind of questions and for other um topic tagging let's say context the workflows might differ again okay thanks um just a short interruption as i'm um here for the technical side can you please um close those um pop-up tabs yes thank you yeah i was going to do it when i had a minute but since i was talking all the time it took me a while okay yeah next one shall we just go by the order or uh do you have a selection we have time yeah uh could help the authorities in big so how does the authority yeah so uh the author disinvigorator is a tool that was specifically designed for assisting uh with the conversion of the strings uh that we get uh in terms of authorship uh from various databases uh and so the conversion from those strings into wikidata identifiers um typical example is jane smith or something like this it might be many of them or xinhuang or something like this many people uh have this name or are using this name uh what in publications or this string and uh then in wikidet we have to figure out which one is which which person is the author of that paper and and so on and this tool that i briefly demoed um helps with that um in principle it could be adapted to do all sorts of string disambiguation uh but that hasn't happened and uh maybe it will not happen in the framework of this tool because this tool um has this particular purpose but it's open source everybody is welcome to contribute everybody is welcome to fork it and to develop it into another direction this tool is actually itself a fork of an earlier version of the tool um and um you could imagine lots of other string disambiguation uh being useful i think the ticket that i have in mind here is to use it on uh things like um or on policy documents uh yeah there are lots of options but if you want to work this well and for authors it works really well for the moment then it needs some development and so it's not just adding a line of code in order to make it let's say cover other kinds of disambiguations as well yeah also oh say our author has already replied to that that's nice so yeah i can only read uh so much so yeah that's that's one of the purposes of having this ether pad that's very nice that someone posts a question someone responds while i'm talking that's fine that makes the discussion more efficient maybe then uh you can uh guide me a little bit what the things i should talk about uh i see yeah i think there was a question about open refine but we will hear about this uh later and the event so more down yeah open refine i i wanted to mention it at several points in the slides but for some reason if i haven't done this this is certainly in a mission it was not um in this was not intended open refine is one of those tools um and it it can yeah basically refine it it helps reducing the noise in in the data if you want to map strings to items and it's very powerful um for certain um workflows it has become more or less the de facto uh wikiside tool and for for other contacts it has not been explored too much but there are dedicated sessions on this later okay so that was another good question so uh which were where on wikipedia is the wiki data wiki site data used so it's the idea so you we have all the graphic data and wiki data and and it's automatically shown in wikipedia does it happen um i think there have been uh tests demos and explorations in various places uh in i think in in russian and uh maybe even in french and catalan basque something like this but i'm not aware of anything doing this systematically for a number of reasons um and yeah briefly sketching them out is at first not necessarily all the references cited in any wikipedia are already indexed in data that's one thing um maybe we're coming close or we we're we have a certain degree of completeness for scholarly references that have a digital object identifier but for anything articles for instance and things like that um and so that's one thing so if you want to do uh if you want to use wikidata for um running these basically citation sections in your wikipedia articles then it would be nice to have higher coverage which is one of the original motivations for doing wikisite but for that would require to for wiki sites slash wiki data to develop better or more robust data models and tooling around uh those other kinds of things for instance about around newspapers and there are efforts in this direction but they're not necessarily as mature so that a wikipedia could use them at scale as a default i think wikidata is ready to be used for experimentation let's say for a certain topic anything related to zika for instance in those areas i would actually love to see those experiments let's say let's try to replace all the references in zika related articles most wikipedias have just one or a few of those articles and so let's try to start with those um articles and then let's see what the problems are some books pop up and policy documents will pop up and and then we we solve those problems and then we can think about scaling this up so go on wikipedians it's a wiki try out yeah yeah above there was a question about people about living persons so is there some procedure to prevent that personal information is added despite the people that do not want it can you tell me the line number uh 30 well uh 35 35 okay um okay so someone already posted the comment so i kind of jumped over this question and i had seen that arthur had responded to it and someone said this is not really answered yet okay so uh i'll briefly read it okay um so yeah some people do not want their data to be exposed in wikidata and uh in certain jurisdictions um there are legal protections for the kind of personal data and wiki media projects in general they have some policy around which or how to handle personal data in general the situation is such that if someone does not want their information to appear in wikidata or wikipedia then that's a strong incentive to actually remove that unless there is a public interest in keeping that public if about this particular fact that the person wants to be removed then it will likely stay but otherwise it will likely be removed and if that's not the case then there are mechanisms to [Music] address that but since the information once it was in the system it cannot easily be deleted it can just be like removed from the current version by default um really removing it from the system is a bit more effort involves administrators but also that can be done and has been done on occasion yeah so it's complicated yeah it's complicated yeah so last chances for questions uh please so we have another question here in the esa pet line 49 about a newspaper information about newspapers i would like to extend the question about information about books because you mainly talked about scientific articles so there are other publications too what about um yeah i know that in in your statistics uh you are uh compiling a list of like hundreds of different types of publications including patterns and poems and things like that and um yeah the general answer to this is um that there have been relatively lots of efforts in the journal uh article space not even the journals just the journal articles there have also been lots of efforts in the book space um they haven't translated into tools too much um we have one good tool for books which is amantea which i also didn't mention but that's not uh because i don't like it it's just because i had to kind of prioritize here um and uh also that already that uh tool amontear has to make some compromise in terms of data model so it doesn't distinguish too much between an addition of a work and um like the work in general um in so if if we don't have the workflow well we have the data model roughly figured out for books but we don't have the workflows to pull that information uh from elsewhere to put it into uh wikidata for instance and that's in part because the resources that would have the data they are not public domains so we can't feed on them directly but in part because nobody has written those tools and in part because the data model isn't really fully adapted to the particular use case you might be interested in um yeah i'm aware of a number of activities in the newspaper space um still nothing that is uh like large scale or transferable between languages or countries for instance or something like this uh so yeah i see there is an australian newspaper project i was uh somewhat involved in a us newspaper project for instance but still it it's uh it's not as homogeneous as we have it for scholarly articles uh because that is based on the digital object identifier for which there is a standard and uh for newspapers there is just not such a standard that i'm aware of and for poems and other things it's it's getting even more complex something i haven't mentioned yet what i would also like to see is for instance jupiter notebooks uh for anything that is about an algorithm or a programming language it would be nice to have some sort of a demo of that aspect of the algorithm or of the programming language demo it in a jupyter notebook and that jupyter notebook should then also be cited in some fashion but we don't have the mechanisms for that either the book tool uh the jupiter notebook tool is that doctor's need here the thing that you're talking about um yeah i'll just you just add the link um to the tool you mentioned yeah um i hope that's i i was speaking about books in general as well so i'm not sure whether that's the one that i wanted to know about yeah there's also zotero at shortcut we will um it will be presented on tomorrow and the session at 10 utc so a lot of tools yeah okay so i will go through the ether pad once i'm disconnected here and try to answer whatever remains uh of the questions and otherwise i guess you all have my contact info i'm happy to address further questions this way and i also plan to attend some of the other sessions and so i hope that everyone will enjoy the rest of the conference thank you we are running out of time i have to say from the technical side do you want to extend for 10 more minutes or how do you want to behave i'm fine i think if there are further questions i'm happy to continue right now i uh all the questions that i am aware of i've at least briefly touched upon okay yeah so i'm being corrected on on what i said about amantea yeah so it's always good good that the experts are around that's the the the beauty of an open system what uh everything can be verified that's that's the nugget the core of wikisite everything is verifiable um and so yeah i i guess max is putting the details in here okay so please keep on asking here on the next sessions we um are happy to have opened this virtual conference the next um session will start in two hours and yeah have a look at the program you will see uh if and me both of us in the next session tomorrow at 10 am i right ava okay so again thank you all daniel yeah and see you thank you bye bye