MX
Buscar
Imagen de avatar
9:15 / 1:00:42
WikiCite 2020: The state of WikiCite
403 vistas•Transmitido en vivo el 26 oct. 2020
28
0
COMPARTIR
GUARDAR
Wikipedia Weekly
369 suscriptores
Part of the WIkiCite 2020 Virtual Conference:
https://meta.wikimedia.org/wiki/WikiC...
02:09 - Presentation by Daniel Mietchen
Summary:
WikiCite is an initiative to collect bibliographic and citation information, particularly of references cited from Wikimedia projects like Wikipedia, Wikisource or Wikidata. It provides an umbrella for a broad range of activities at the intersection between Wikimedia, libraries and other organizations engaged in scholarly communication or cultural heritage. Over the past few years, many of these activities have involved in-person events - including participation in the Workshop on Open Citations in 2018 - but the ongoing COVID-19 pandemic has changed that, and the WikiCite community is adapting. In this talk, I will provide an overview of WikiCite activities during the last 12-18 months as well as of what is ongoing and planned.
Bio:
Daniel Mietchen is a researcher at the School of Data Science of the University of Virginia. He is interested in integrating open and collaborative research and education workflows with the web, particularly through Wikimedia platforms, to which he actively contributes. Trained as a biophysicist, his research topics range from the subcellular to the organismic level, from biochemical and embryological to geological time scales, from specimens to biodiversity informatics, from data points to data science and to the role of research and education in sustainable development.
Transcripción
[Music]
welcome
welcome uh thank you for joining us this
is the beginning
of the wiki site 2020 virtual conference
this is the first time we run a virtual
edition of this
uh almost annual conference and it is
the fourth
the fourth wikiside annual conference
today we have a series of sessions by a
variety of people in various topics
but we are starting with let's call it
the keynote
the state of wikisight in 2020 but to
introduce them
my name is liam wyatt and the host of
the the conference in general
but to introduce daniel and this
particular
session i'd like to introduce jacob
and eva welcome and thank you for
hosting this session today
yeah hello and welcome from the side of
uh
both of us the hosts for this session um
uh yeah we are happy to uh present
daniel median
um yeah to to introduce daniel
he's virtually everywhere so um
everywhere in wikidata world
all each time i discover a new
interesting weekly data related project
i i'm sure he has already seen it or he
has an opinion in it
and he is a strong well-founded opinion
and that's why it's very interesting to
listen to what he's going to talk about
so we welcome daniel meachen
and um yeah are you ready then
the screen is yours yeah i'm ready
okay we will leave
so thank you everyone
i'll just start by noting that the
slides are available on the noda i also
just tweeted them and
so you can follow at your own pace
there is a back channel for comments so
if there is anything that you would like
to comment on or if you have a question
during the talk
please put it in this ether pad wiki
w dot wiki igk with the gk and capitals
and from the ether pad we will also then
take the q a section i tried to talk for
45 minutes about that
and then we have about 15 minutes for
questions
but this is flexible and i can be
interrupted that's fine
so structure of the talk i'll say hello
then we will look at some previous
versions of the state of wiki site
2020 because you might have seen in the
title that says version 3.
i'll give an overview of wikisite in
general and then the wiki site
development since the last
official wikisite meeting 2018 in
berkeley
and in this overview i'll look at the
content
at some of the infrastructure at grants
and fellowships and also at events
including further events as part of this
wikisite
conference there will be
mechanisms for you to get involved and
then there will be q a
okay hello wiki site i was introduced
already uh this slide is more or less
just for people who are
stumbling upon this randomly on the
internet uh
when they watch the youtube uh
recordings
um so now let's look at uh wikisite 2020
and the previous versions so one
important previous version is the annual
report
as liam already pointed out we have
roughly annual uh events which also
means we have roughly annual reports
uh the latest one is for 2019 20.
and i'm not going to talk about this too
much but i encourage you to take a look
at this
in any case here is a brief summary
so um the wiki
site annual report 2019 and 2020
was detailing some
events that were planned and actually
held
so um for this period we were actually
planning to go
on decentral we wanted to have a more
multilingual
and more diverse a set of events taking
place
and that involved thoughts about going
virtual
from the beginning um and
we had two events actually taking place
one in australia one in czechia but then
the covert 19 pandemic has changed
many of those plans so the event in
czechia had to be cut short by day
uh events in planned in ireland and
finland had to be modified
events in ghana germany and india had to
be cancelled yeah i was actually
meant to speak at this event in germany
for the talk that i'm giving now
events in haiti and ghana had to be
postponed i'm glad to report that by now
the one in haiti has actually taken
place and
the pandemic has also triggered for
instance wiki project covert 19
which had a strong wikiside component
other
the ecosystem around wikisite is
evolving in terms of the wikimedia
strategy and working data wiki-based
strategy
and everything that's underlined here in
my slides is linked so i encourage you
to
explore those links
yeah so the second version of the state
of wikiside 2020
uh was a talk that i gave at the
workshop on open citations in open
scholarly metadata
2020 at the beginning of september
back then i didn't know that i would be
giving this talk here and i wasn't sure
whether there would be any kind of
overview of wikisite talk this year
which is why i gave it that
title back in september on the other
hand i had only
very little time so i couldn't go into
details
and i'm reusing some of the slides from
there but still the structure of the
talk is different
the key takeaway from from that talk is
despite the pandemic wikiside is moving
forward
but we're actually not so sure about the
citation part
um and that talk is online as well
including some video recording
so now we're on to version three um
let's start with an overview of a
keysight i assume that some people are
here that don't know
uh really what that is about and so i
hope
to um add a little bit of clarity and
reduce a bit of the confusion so
wikisite is
a community it's a socio-technical
platform that connects this community
with an ecosystem of projects tools and
other activities
it's also uh or can be seen as a
collection of data sets
a series of events and a number of other
things
uh let's zoom into some of those so uh
the community
if i go by
2016 conference uh it's outlined the
vision to create the sum of all human
citations
as linked open data this is inspired by
the wikimedia vision to create
some more human knowledge and from that
we derived the mission which is still up
on the
wiki site page on the meta arrow key to
develop open citations and linked
bibliographic data to serve free
knowledge
using wikimedia platforms and especially
wikidata
that vision and mission have let us
to consider a number of goals i'm again
quoting the
2016 version of them lay the foundations
for building a repository of all
wikimedia references as a structured
data
in our working data that's one goal and
the other one is to design data models
and technology to improve the coverage
quality and standards compliance and
machine readability of citations across
wikimedia projects
that's what we what the plan was in 2016
and somewhere between these two
uh we actually are still active uh it's
just that
by now we've zoomed in on onto those
things into
uh so many different directions some of
which i'll try to outline
uh the next aspect of that community is
that the community has a road map
that offers essentially four roads to
travel
but that road map is not really used
much
everybody travels on their own everybody
finds their way on their own
and the rope map is
decisions that the road map is about
they're still kind of looming at some
point they need to be taken or they will
be taken by someone or something
and so it's probably worth for the
community to um
just keep it in mind and maybe have some
more active discussions about this
next so i mentioned it's also a
technological platform
so it has a home base on the meta wiki
under the wiki site basically handle
and from there you can also find the
program for this
this week's conference but lots of other
documentation about wikisite in general
it's home based on wikidata is a wiki
project source metadata which was
actually
the nucleus from which wikisite has
grown
and and wikiproject source metadata is a
bit of a mouthful
and also it was specific to wikidata uh
so by now we've come to use the term
wikisite to encompass
to signal that it's uh not just for
wikidata
it's it's really on meta because it
concerns all wikimedia projects
um but yeah the the nucleus the the
origin of this is that we wanted to
uh care for source metadata and that
included
citations uh so it's important to keep
in mind that
wikidata is about one-third
just citations and scholarly
publications so about one-third of the
content of wikidata is um created by the
wikisite community
problems um and so
the problems of wikidata they reflect on
wikisite and vice versa
um and then finally uh yes
uh wikisite is a series of events we
were in berlin in 2016 in vienna 2017
and in berkeley 2018 and now we're um
trying to be a bit more global
this year so i mentioned wikidata a
number of times most of you will know
what it is but i suspect some will not
so
here is an attempt
like wikipedia which is the free
encyclopedia that anyone can edit
wikidata is the free knowledge base that
anyone can edit
and use of course if that doesn't
clarify this to you then i recommend
to look at uh this seven minute intro
video on youtube
that i've linked so a few more
numbers about wikidata in from the
perspective of wikisite
so our wiki data as a whole has
11 billion statements that come in the
triple form
like subject predicate object and
uh for us most the most interesting for
us are
for instance citations 170 million
scholarly articles 36 million items that
have an
digital object identifier publications
or data sets
26 million orchids which is an
identifier for
researchers one and a half million
and then also that's the creation work
that's going on
um the links from the publications to
the top
wrote the publications roughly 20
million links
uh links from the people to the
institutions where they were educated
one and a half million
and from the people to where they're
working
it's about a million and all of these
data since we're talking linked data and
open data
they can be uh mixed and matched in
different fashions and reused and enrich
each other
um so now let's talk a little bit about
the curation workflows that
are employed by the wikisite community
i'll single out like three of them here
i mentioned four of them on the slide
but uh
due to time and stuff i'll go only into
three of them so one is topic tagging
which allows us to address questions
like give me all the papers about
sars cough 2 or any other specific topic
the other one is author disambiguation
which allows to address
questions like give me all the papers by
a given person
the other one is ontology development
like for instance if you want to get all
papers about mathematics education and
you want to include
different aspects of mathematics
education
then you will need to map those at
aspects
like calculus for instance or algebra or
geometry education you need to map them
to math education
that's called ontology development so
basically linking topics and subtopics
and then the last one is subject
affiliation
so
if you want to get an overview of the
publications from a given institution
and many institutions want to to get
that overview then
somehow you need to record that there is
the link
between the person at the institution
and then between the person and the
publications and then
through linked data we can harvest that
and make the link between
from the publications to the
institutions okay
an example of topic tagging here i chose
the a query that gives me
terms that have bologna in them
on in the titles of paper publications
and uh so from that i can see that there
are different kinds of bologna that are
being talked about
so bologna in in general there is
bologna italy
like there's the province of bologna
there's the city of bologna university
of bologna
bologna experience which refers to a
number of things then there are some
clinics in bologna
bologna sausage and so on so
disintegration
is very important and but once we have
disambiguated things because we have the
rather precise
uh topics here in wikidata which all
have their own identifier we can then
um clearly uh distinguish the literature
about the sausage from the literature
about the university in the clinic and
the province and the city
then the most popular tags for topics of
publications as of august 2020
were these ones here uh mostly medical
that's because
that's or that just shows some bias that
we have
that's partly because in medicine
biomedicine we have
databases that are more easily
harvestable that can be more easily
integrated in wikidata
than in other fields but it's not only
um
biomedicine so for instance we see india
popping up here
nanoparticle and statistics so it's a
bit more complex than that
um here i'm bringing up
the zika virus as an example because it
served as an example
uh already since 2016. uh it's kind of
the
uh the guinea pig the the testing ground
for
working side workflows and uh this is in
part because wikisite developed roughly
in parallel to the zika virus
epidemic in 2015-16
but also uh because uh it has this
zika virus corpus or data set has been
used in a number of
contexts um and also it helped
uh address the ongoing covet 19 pandemic
so what it does is
it's a model for disaster response in
general not just for
outbreaks it's a relatively complete
data set
for a given pathogen it has
a data model that is enriching as the
epidemic progresses so initially when
the zika virus came around
all
increased then the data model on
wikidata was refined and we could then
uh express things at a more granular
level
um it has also triggered
a number of tool developments and it's
yeah it's used in
technical tests and education here are a
few more links on this
i'll get back to code 19 later on
so once we have topic tags we can do
things like we can get the list of most
recently published works on the topic
of course these states are not very
relevant anymore
so here we're talking about the
principle um and also we can look at
which
topics co-occur with a certain topic so
if the topic is zika virus
then we see here for instance the topic
of congenital zika virus infection is
uh very uh prominent zika fever
uh pregnancy and microcephaly uh these
are all topics
that co-occur that so that are discussed
together with
the topic of the zika virus and that
is a way to browse things so if you
if you don't know much about the topic
you just go there you start from the
zika virus and then you can explore all
those other things that are linked to it
then what we can also do is link
the discussions
of the topics with certain locations
because
some of those topics will have a
geolocation in like india we had in
those most popular topics we had india
as one example
and india has a geolocation
but some of those co-occurring topics
uh they will discuss things like the
zika virus in india or
in french polynesia or in brazil or
something like this
and we can pull this out of the data by
mapping the papers to the topics and
then
the topics to their geolocation if they
are
entities that have a geolocation
we i want to contrast this to this map
which shows the locations of the authors
of papers on the topic so in the
previous one we had the locations of
topics or topics that occur with the
zika virus and here we have the
locations of the authors of papers about
the zika virus and so these are
different things we can plot them we can
pull that out of the data that is
created in
wikidata through the wikiside community
and tool chains
and that allows us to explore certain
gaps so for instance here in africa
there's not much
happening according to this map which
might be
first that there is not much research on
the zika virus happening in
in africa or in much of northern asia
or simply that
here there is not a lot of population
anyway
or that we just haven't created the data
about the
publications or about the authors or
their affiliations
or their geolocations very
comprehensively and these are the kinds
of things that
the community then has to drill down
on in order to to find out and to
improve the data set
now are on to author disambiguation so
the basic strategy is if there is an
orchid which is meant to
uh help with author disambiguation uh
then
and the orchid has public data that is
actually usable for wikidata
then we try to use it uh we have some
tooling for this
um there were some problems with that so
it doesn't
really work well for the moment that's
also some discussion we need to have
and also the orcid data is not always
best quality
but at least it is there it's a
principal standard
that the research community has agreed
on
to support and so if you want to map
research then that's something we should
uh take into account
um the disability on wikidata can be
done by anyone
just like any other kind of activity on
the platform
and just like on wikipedia we have a
number of mechanisms for quality control
we have
some tools for this they're actually
dedicated for author disambiguation
um and we encourage institutions to
disambiguate
the people that are affiliated with them
because
they know best or the the people
themselves of course also know
about um in ordinary wiki data volunteer
might not really know
the start time and end time when some
professor was affiliated with a certain
research institution
and then if we consider uh like that
there have been
also basically uh then uh and many of
which have many authors
uh then this is becomes a humongous
challenge uh but
uh thinking about this shouldn't kind of
occupy our thoughts too much
as long as we find some mechanisms to
prioritize properly
so here is one of the tools in action
it's the called the author disambiguator
and
it can be used to basically convert
author name strings into author
identified authors
so yeah the author here rosa prato has
an
identifier on working data already that
is associated with nine items at the
moment when when this screenshot was
taken
and then here we have 83 publications uh
that
lists the name rosa prato as an author
named string and then the question is to
what extent
do the publications that show this
author name string
actually correspond to publications that
have been written by this person that is
identified here
and the tool basically asks this
question and helps with the grouping
and suggests and so that's one of the
mechanisms by which author
disambiguation can happen
subject affiliation so
in order to find out the affiliation for
a given person we can search
in wikidata or we can search elsewhere
by university or by
organization and we we can get
information such as the faculty and
their publications
or research sites and clinical trials
that these people have been involved in
or events such as conferences and
locations
where these people have
interacted or have shown up um
and uh if we do affiliation tagging then
uh we can basically uh do similar things
that i
showed for the zika map before um and
that allows us to for instance look into
um areas where a certain topic is
well represented so here uh the
uh data here is for italy you see that
there is a certain gradient in terms of
publications that have italy as a
subject
in italy there's a lot of papers about
italy and the further you go away
the fewer papers there are um and that
also the mapping here also serves as one
mechanism
by which we can do quality control
one example i would like to point out is
uh there's
a vanderbilt university in the united
states that actually uh
does this uh curation of the affiliation
at a very systematic level so
they have figured out who is currently
basically affiliated with university
they've gotten all their orchid
identifiers
and they have fed that into my
information into wikidata
and that allows them and everybody else
now
to uh basically browse the information
that is available
about publications by uh vanderbilt
university
people staff mostly and faculty
but also occasionally students and the
work that
they are still working on is like
linking these people to the publications
and that's also something that is part
of the
community workflows so they really try
to integrate their efforts with the
community workflows
i put in some links here to uh first
some news item about this the bots that
made all this possible and similar
projects at indiana university the
university of cannery islands and also
there's a countrywide similar effort for
the netherlands
um some other things that we haven't
talked about too much in previous
wikiside events is clinical trials
uh these are also things that you can
cite and they produce citable things
and they are about certain topics they
have uh authors they have
um primary and investigators and these
kind of things
and um so these are um
things entities that are part of the
creation workflow and
uh we're now relatively complete with
respect to clinical trials that are
registered in the united states
we're trying to increase the coverage of
clinical trials that are registered in
other places
so here is a
an attempt at representing the wikidata
ecosystem
which i described as a socio-technical
system in one of my
opening slides so here we have
a layer of users that are using a number
of
tools to interact basically with the
basic infrastructure
wikidata and wikibase and some of those
users are
automated uh and most of them are
humans and some of the tools
they they are used more or less for
reading others for writing and others
for like mixed
uh matters um so one tool allows
historypeter allows you to basically
visualize timelines
scolia allows you to visualize
connections that are
roughly related to scholarly literature
or to research context in general
quick statements as a tool to edit
working data
mix and match as a tool to also edit
wiki data and to
map things from wiki data to external
databases
um recoin helps with quality control
yeah other uh
another way to look at these tools is uh
they
do um they facilitate browsing
like this wiki data front end that is
specific for
basically software um they allow
to edit wiki data in various ways so
here if you have an identifier for
publication you can put that into source
metadata tool
and then it will check whether that
publication is already indexed and if
not it will help you set up the
corresponding item and a large set of
tools is actually there
to check consistency data quality and
these kind of things
so for instance for statements about
symptoms
we have a requirement that
whatever is stated as a value here it
should be supported by
a reference and if that is not the case
then a warning is displayed
there's an overview of all the tools
here on this page
um for the scolia tool specifically we
also have a category on commons and
there are similar categories for some of
the other tools
i don't have time for that there will be
a dedicated session on scolia
later today now back to the citation
questions
so here is a graph that is available on
this website that's operated by
um yeah one of the organizers of this
session and
um what it shows is two different
developments here
until well somewhere in the middle of
this year
one of them is the number of publication
items which goes
up in more or less in jumps um
but con steadily roughly the few
occasions where it goes down they're
actually
acts of active curation where someone
for instance uh noticed oh
we already have this publication it was
started based on the
digital object identifier but uh here
and it was started based on the pubmed
identifier there and they're actually
about the same thing
so we can connect them and so
these acts of creation are sometimes
visible in those stats here
and here what we also have here is the
number of citations
i mentioned the number of 170 something
million
or so at the beginning and uh we here
see that uh it has basically plateaued
for
um like a year or so then it went down
this is again is an act of curation uh
the problem is or um at least something
to discuss is
what's what's the future trajectory here
we don't really currently have that data
but uh it's clear it's not going up very
strongly and the question is whether it
should um
or whether that is something that should
take place somewhere else
um there are a number of bots that are
active in this space
uh vanderbot is basically underlying uh
the vanderbilt project that i
outlined before large data setbot
is essentially creating publication
items so it checks whether
items for certain publications are
already indexed in wikidata
and if not it creates those items refbot
is perhaps closest to the original uh
let's say goal outlined in 2016
that we want to basically support the uh
references or
support every statement in wikimedia
projects
with suitable references and so what the
refbot does
is it picks
certain statements in wikidata that does
do not have a reference yet
and then searches the literature for
places where
this statement might have been made and
then adds relevant
literature as a reference to support
this statement so
for the statement that local anesthesia
is a subclass of anesthesia we now have
two
um sources that were added by refbot
there is also a proposal for open
citation spot but
this is pending so technically it's kind
of uh
doable but still not done
and socially we don't really know yet
whether we actually want to do it in
wikidata or not
that's part of the discussion that
events like this
want to facilitate
so um also important to think about
uh the event history um a little bit
uh because um yeah we we're beginning
another
event right now and uh so
uh i want us to kind of kind of come
into the mindset for for this kind of
event
thing it's different now because it's
all virtual um
but uh yeah let's let's think about this
nonetheless a little bit
so in uh these previous three
conferences
uh we had a three-day design typically
um where we tried to combine monologues
like the one i'm doing right now with
more dialogue oriented
sessions and then with a hackathon and
we have elements of all
three of these in the current program
and in the other wiki site activities
all outlights
some of those this week's
virtual conference actually i
would have liked to include this
screenshot of the scolia page for this
but
it wasn't detailed enough so i i'm just
encouraging you to take a look and help
create that part
also i noticed that for wikisite 2016 we
don't have the
information very well created yet um
so let's focus on the wiki site 2020
events
uh so we we're in this current session
from 10 to 11 utc
you see there a number of other sessions
coming up today
i guess if you are here in this talk you
already know roughly about this but
i really encourage you to use some of
the breaks here
to go through the entire um
basically schedule because there are so
many different things they
they differ not just by location but
they differ by language
they differ by uh the kind of topic
focus
um like here we have swedish
parliamentary documents
in indonesia here for instance they're
discussing palm leaf documents
uh they're differing uh in terms of the
technical aspects so here we have a
hands-on
uh introduction on like how you can
actually edit wiki data then here is
curation of author items here is the
front end of wiki side here is
things that are specific to genetics
and and so on there lots of uh such
sessions and um the ev
the aim of doing them uh is to
foster collaboration since all of this
is is open
everybody can interact with all of these
projects and
that's one of the mechanisms by which
this wiki site community thrives we
normally had
pushes an activity right after the wiki
site events and hopefully
this will happen for this virtual one as
well
um in terms of other activities i would
also like to mention that
we have set up a
mechanism to award
some grants and e-scholarships um that
are related to wikisite events
and they are somewhat of a an adaptation
to
the corporate times so um normally i
mentioned
we would have had a hackathon in uh
in in this conference setting three-day
conference setting but
since uh the normal way of doing a
hackathon means we bring
multiple people into the same room um
that that
and that's not possible these days uh we
basically thought about a mechanism by
which we could give people
some time to work on uh and that's
that's the
e-scholarships approach so the things
that they could have done
during a three-four day hackathon
they're now part of those e-scholarships
we also have a number of uh grants that
are
a bit larger scale projects but still
doable in a matter of a few
weeks to months and here is a blog post
that details all of them
again one of them is about the
leaves from the palm leaves from
indonesia but there's
23 of them and they're much more diverse
than the projects that we have discussed
at the previous wikiside
meetings in terms of coordination
so um wikisite is driven by volunteers
in all of its aspects including
the organization of this session and
this conference and
all of the content work and tool
development but some coordination needs
to happen and in typical wiki style much
of this coordination
happens in a variety of channels still
we have a
uh an organizing committee um
or steering committee um and
this is basically a
the same team that has been active over
the last few years we had a few
members change but we're
constantly looking for new members we
want to facilitate
wikiside activities in different
locations and different languages
on different topics on different
technical
contexts and so if any of that resonates
with you
please get in touch and consider joining
the steering committee
so that's now the
um first thank you slide
uh i think it's important to thank you
one of the things i like best about the
wikimedia
um let's say ecosystem is actually that
we have a thank you button on all the
different
uh platforms or on most of them and
that is a very interesting
aspect of forming a community it's not
visible
uh other than to the person that is
being thinked at least by default
and and so it it is not normally gamed
and so it's a very nice thing here um
i yeah i just want to thank
everyone i don't like that phrase still
i use it
and so i'm also spelling it out i want
to thank the providers of open
infrastructure not just the wikimedia
open infrastructure but open
infrastructure in general including the
one that we interact with
the providers of this citation or this
presentation template from
slides carnival the wikimedia wiki data
and nokia site communities
i would especially like to thank the
scolia team
because yeah that's the team i'm most
closely interacting with on a daily
basis
i would like to thank the alfred sloan
foundation that
sponsors the wiki site events and i
would like to
thank the organizers of wiki site events
like if you're organizing
any session or here at this conference
or
somewhere else if you did that in the
past if you plan in the future i want to
thank you for just
thinking about and especially if you've
done it for doing it and then
if you've made it until here i want to
thank you for
paying some attention to to these topics
now i'm looking forward to the
discussion let's see what the ether pad
says if you want to contact me
here are some contact details and
now i'm trying to hop into the pad
yeah thank you very much daniel you are
well in time
so we have uh some questions uh very
well
we look at up here on the ether pad yeah
so the first is could we decide provide
an alternative to commercial
graphics search engines what do you
think
um okay it would be helpful if you
mention the
question that you also show the uh tell
me the
line number uh peter
eventually yes i hope so but
currently it's an alternative only for
very specific things like if you want to
know something about the zika virus or
something like i would
um guess for certain things uh the zika
corpus on wiki data is more
uh or is annotated in a better or more
detailed fashion than
the standard that you get from the
commercial search engines
but in general
they are for many purposes still better
in part because they have um money
to put at this and uh wikiside is
largely driven by volunteers who do it
in their spare time
and so what we do have is the uh a
community they don't have that
and um the
the question then is uh what is the
niche for both of those
uh so since we're entirely open
um anyone is entitled to benefit from
our creation efforts that includes the
commercial
providers of this and so one of the main
purposes of wikiside is actually to
increase the quality of the
information that is available through
such search engines
so if anyone has access to the wikidata
version of it or with the site version
of it
then whatever those commercial offerings
are they should be better
and so that that's
a short answer but the question is deep
and we could have an entire discussion
around it
so it in any case depends on the topic
yes it depends on the topic for a number
of topics especially in humanities or so
uh wiki data doesn't have a lot of
information
at the moment this is due to several
structural and and community biases and
we're working on them we're aware of
some of those biases but still they are
there
and of course the search engines have
uh their own biases as well so for
particular like
uh i hope like in a half a year from now
the palm leaves collection for instance
from indonesia
will be in wikidata and i don't see them
showing up in any of the commercial
search engines anytime soon but maybe
they pick it up from wikidata that's
that would also be progress that would
make reset the
the information more widely available
okay so next question uh let's question
of mine about
topic tapping picking uh so where do
these topics uh tags come from it's not
that easy
like with author the date of publication
because
topics are a bit less hard facts what do
you think about this
yeah that's actually something um i'm
thinking about
quite often i think the current
workflows that we have are
um not very mature so
there are lots of workflows typically in
library contexts
where um topic tagging is already
happening so for instance the
database pubmed for biomedicine has
an entire system called the mesh terms
medical subject headings
which uses basically
human readers so humans reading a paper
they then decide which terms to
associate with that paper and they
choose those terms from a defined
controlled vocabulary called the mesh
terms
and that system exists
but we haven't found mechanisms to
leverage that for wikidata in part
because
wikidata is cross-disciplinary and so
even if we had this working for
pubmed this wouldn't help us much in
terms of covering
publications in history or astronomy or
elsewhere
and so there is no system that is
cross-disciplinary which means that
there is actually an opportunity for
wikidata to become the first
cross-disciplinary uh platform to have a
consistent mechanism of tagging right
now we don't have it so what we do have
is often just uh based on inferences
from the title
which might uh run into problems when
uh the tagging is based on ambiguous
uh terms um
or when the tagging is performed by
someone who doesn't know the subject
very well or something like this all of
this happens
all of this also happens to me um
and but the good thing about this is
since we're
curating into an open platform everyone
can check
what has been done everyone can point
out
problems with that some of those checks
are even automatic
and then we can work together to
reduce the number of false positives
false tags and we can also
think about what the best granularity of
the tags is so if we for instance
go back to the mesh example of the mesh
terms they come with their own hierarchy
some of that hierarchy is already
reflected in wikidata and so if a paper
is associated with a number of
mesh terms that are in a hierarchy then
the question is should we take all of
them or should we just take the most
specific ones and
that's the kind of questions and for
other um
topic tagging let's say context
the workflows might differ again okay
thanks
um just a short interruption as i'm um
here for the technical side can you
please um
close those um pop-up tabs yes thank you
yeah i was going to do it when i had a
minute
but since i was talking all the time it
took me a while okay yeah
next one shall we just go by the order
or uh do you have a
selection we have time yeah
uh could help the authorities in big
so how does the authority
yeah so uh the author disinvigorator is
a tool that was specifically designed
for
assisting uh with the conversion of the
strings
uh that we get uh in terms of authorship
uh from various databases uh
and so the conversion from those strings
into wikidata identifiers
um typical example is jane smith or
something like this
it might be many of them or xinhuang or
something like this
many people uh have this name or are
using this name
uh what in publications or this string
and uh then in wikidet we have to figure
out which one is which which
person is the author of that paper and
and so on
and this tool that i briefly demoed um
helps with that um in principle it could
be adapted to do all sorts of
string disambiguation uh but that hasn't
happened
and uh maybe it will not happen in the
framework of this tool because this tool
um has this particular purpose but it's
open source everybody is welcome to
contribute
everybody is welcome to fork it and to
develop it into
another direction this tool is actually
itself a fork of an earlier version of
the tool
um and um
you could imagine lots of other
string disambiguation uh being useful i
think the ticket that i have in mind
here is
to use it on uh things like
um or on policy documents
uh yeah there are lots of options but if
you want to work this well and for
authors it works
really well for the moment then it needs
some development
and so it's not just adding a line of
code
in order to make it let's say
cover other kinds of disambiguations as
well
yeah also oh say our author has already
replied to that that's nice
so yeah i can only read uh so much
so yeah that's that's one of the
purposes of having this ether pad that's
very nice
that someone posts a question someone
responds while i'm talking
that's fine that makes the discussion
more efficient maybe
then uh you can uh guide me a little bit
what
the things i should talk about uh i see
yeah i think
there was a question about open refine
but we will hear about
this uh later and the event
so more down
yeah open refine i i wanted to mention
it
at several points in the slides but for
some reason if i haven't done this this
is certainly in a mission it was not
um in this was not
intended open refine is one of those
tools
um and it it can
yeah basically refine it it helps
reducing the noise in
in the data if you want to map strings
to items
and it's very powerful um
for certain um workflows it has become
more or less the
de facto uh wikiside tool
and for for other contacts
it has not been explored too much but
there are dedicated sessions on this
later okay so
that was another good question so uh
which were where on wikipedia
is the wiki data wiki site data used so
it's the idea so you we have all the
graphic data and wiki data and
and it's automatically shown in
wikipedia
does it happen um i think there have
been
uh tests demos and explorations in
various places
uh in i think in in russian and uh maybe
even in french
and catalan basque something like this
but i'm not aware of anything doing this
systematically
for a number of reasons um
and yeah briefly sketching them out
is at first not necessarily all the
references cited in any wikipedia are
already indexed in
data that's one thing um maybe we're
coming close
or we we're we have a certain degree of
completeness for
scholarly references that have a digital
object identifier
but for anything
articles for instance and things like
that um
and so that's one thing so if you want
to do uh
if you want to use wikidata for um
running these basically citation
sections in your wikipedia articles
then it would be nice to have higher
coverage which is one of the original
motivations for doing wikisite
but for that would require to for wiki
sites slash wiki data to develop better
or more robust
data models and tooling around uh those
other kinds of things for instance about
around newspapers and there are
efforts in this direction but they're
not necessarily as mature so that a
wikipedia could use them at scale
as a default i think wikidata is ready
to be used for experimentation
let's say for a certain topic
anything related to zika for instance
in those areas i would actually love to
see those experiments
let's say let's try to replace all the
references in zika related articles
most wikipedias have just one or a few
of those articles and so let's try to
start with those
um articles and then let's see what the
problems are
some books pop up and policy documents
will pop up and and
then we we solve those problems and then
we can think about scaling this up
so go on wikipedians it's a wiki try out
yeah yeah
above there was a question about people
about
living persons so is there
some procedure to prevent
that personal information is added
despite the people that do not want it
can you tell me the line number
uh 30 well
uh 35 35 okay
um okay so someone already
posted the comment so i kind of jumped
over this question
and i had seen that arthur had responded
to it and someone said this is not
really answered yet okay so
uh i'll briefly read it
okay um so
yeah some people do not want their data
to be exposed in wikidata
and uh in certain jurisdictions um
there are legal protections for the kind
of personal data
and wiki media projects in general they
have some policy
around which or how to handle personal
data
in general the situation is such that
if someone does not want their
information to appear in wikidata or
wikipedia then that's a strong
incentive to actually remove that unless
there is a public interest in keeping
that public
if about this particular fact that the
person wants to be removed
then it will likely stay but otherwise
it will likely be removed and if that's
not the case
then there are mechanisms to
[Music]
address that but since the information
once it was in the system it cannot
easily be
deleted it can just be like removed from
the current version
by default um really removing it from
the system is a bit
more effort involves administrators but
also that can be done and has been done
on occasion yeah so it's complicated
yeah it's complicated yeah so last
chances for
questions uh please so we have another
question here in the
esa pet line 49 about a
newspaper information about newspapers
i would like to extend the question
about information about books because
you mainly
talked about scientific articles so
there are other publications too what
about um
yeah i know that in in your statistics
uh you are
uh compiling a list of like hundreds of
different types of publications
including patterns and poems and things
like that
and um yeah the general answer to this
is um that there have been
relatively lots of efforts in the
journal uh
article space not even the journals just
the journal articles
there have also been lots of efforts in
the book
space um they haven't translated into
tools
too much um we have one good tool for
books which is amantea which i also
didn't mention
but that's not uh because i don't like
it it's just
because i had to kind of prioritize here
um
and uh also that already that
uh tool amontear has to make some
compromise in terms of data model
so it doesn't distinguish too much
between an addition of a work
and um like the work
in general um in
so if if we don't have the workflow
well we have the data model roughly
figured out for books
but we don't have the workflows to pull
that information
uh from elsewhere to put it into uh
wikidata for instance
and that's in part because the resources
that would have the data they are not
public domains so we can't
feed on them directly but in part
because nobody has written those tools
and in part because the data model isn't
really fully adapted to
the particular use case you might be
interested in um
yeah i'm aware of a number of activities
in the newspaper space
um still nothing that is uh like
large scale or transferable between
languages or countries for instance or
something like this uh so yeah i see
there is an australian newspaper project
i was uh
somewhat involved in a us newspaper
project
for instance but still it it's uh it's
not
as homogeneous as we have it for
scholarly articles uh
because that is based on the digital
object identifier for which there is a
standard
and uh for newspapers there is just not
such a standard that i'm aware of
and for poems and other things it's it's
getting even more complex something i
haven't mentioned yet what i would also
like to see is for instance jupiter
notebooks
uh for anything that is about an
algorithm or a programming language it
would be nice to have
some sort of a demo of that aspect of
the
algorithm or of the programming language
demo it in a jupyter notebook and
that jupyter notebook should then also
be cited in some fashion but we don't
have the
mechanisms for that either
the book tool uh the jupiter notebook
tool is that
doctor's need here the thing that you're
talking about um
yeah i'll just you just add the link um
to the tool you mentioned yeah um
i hope that's i i was speaking about
books in general as well so i'm not sure
whether that's the one that i wanted to
know about
yeah there's also zotero at shortcut we
will um
it will be presented on tomorrow
and the session at 10 utc so a lot of
tools
yeah okay so i will go through the
ether pad once i'm disconnected here and
try to
answer whatever remains uh of the
questions and otherwise
i guess you all have my contact info i'm
happy to
address further questions this way and i
also plan to attend
some of the other sessions and
so i hope that everyone will enjoy the
rest of the conference
thank you we are running out of time
i have to say from the technical side do
you want to extend for
10 more minutes or how do you want to
behave i'm fine
i think if there are further questions
i'm happy to continue
right now i uh all the questions that i
am aware of
i've at least briefly touched upon
okay
yeah so i'm being corrected on on what i
said about amantea
yeah so it's always good good that the
experts are around that's the the
the beauty of an open system what uh
everything can be verified that's
that's the nugget the core of wikisite
everything is verifiable
um and so
yeah i i guess max is
putting the details in here
okay so please keep on asking
here on the next sessions we um are
happy to have opened
this virtual conference the next um
session will start in two hours and
yeah have a look at the program
you will see uh if and me both of us
in the next session tomorrow at 10 am i
right
ava okay so
again thank you all daniel yeah
and see you thank you bye
bye