Cross Format Annotation

By Ed Summers | 31 May, 2013

In 1960 I had a vision of a world-wide system of electronic publishing, anarchic and populist, where anyone could publish anything and anyone could read it. (So far, sounds like the web.) But my approach is about literary depth– including side-by-side intercomparison, annotation, and a unique copyright proposal. I now call this “deep electronic literature” instead of “hypertext,” since people now think hypertext means the web. Ted Nelson

As soon as there was a World Wide Web, we wanted to annotate it. Twenty years ago today, just as the Web was being born, Marc Andreesen humbly announced annotation support in Mosaic (one of the first Web browsers) on the www-talk mailing list. It is instructive to look again at how this functionality worked:

… every time you access a document in Mosaic, the group annotation server [if you’re using one] is queried with the URL of the document you’re viewing; if any group annotations exist for that document, the group annotation server returns to Mosaic corresponding hyperlinks which are inlined into the document just like personal annotations.

Ultimately this annotation support was dropped from Mosaic, since it required a server side component that needed to scale to an increasing number of Mosaic users; and it didn’t receive adequate funding at the time. But twenty years on we can see a vast proliferation of annotation services that use its key insight of grouping annotations by URL. The simple act of copy/pasting a URL into your Facebook status box, entering some text, and hitting post, creates an annotation by you, about the Web resource with that URL. The same applies to URLs you use in Twitter or App.net status updates, or when you save a bookmark with some tags, or a brief notein Pinboard, Digg, Delicious or Connotea.

Of course, the Web would be a better place if all these services shared annotations in a compatible way, with agreed upon standards. But it is remarkable that Web architecture has allowed this ecosystem of annotation to grow organically, with little coordination, using this very simple concept
of annotation by URL.

Scholarly Annotation

And yet, as evoked in the quote from Ted Nelson’s above, annotation isn’t just a consumer tool for the Web masses. Annotation is an activity that is deeply intertwingled with the nature of scholarship itself. It is embedded in how our cultures document and define themselves. Annotation’s roots extend through all varieties of media, and back a thousand years ago to when commentary on the Talmud began. It is completely understandable why commercial services haven’t traditionally focused on the needs of scholars. But scholars are the power users of annotation. What’s more, recent developments like MOOCs, Open Access and citizen science, are transforming our ideas of what scholarship means, and the people who are doing it aren’t always what we could call traditional scholars anymore.

In his recent talk about the Open Knowledge Foundation’s Annotator project at I Annotate, Nick Stenning identified 4 Hilbert Problems for annotation on the Web today. The second problem that Nick talked about was the need for annotating documents rather than formats. To concretely understand this problem consider the simple case where an annotation is made on a PeerJ preprint at:

https://peerj.com/preprints/1/

When you visit this URL in your browser you are presented with an HTML representation of the pre-print. But you may notice that you can also view the article as a PDF by following a link on the upper left to:

https://peerj.com/preprints/1.pdf

If you were to make an annotation of the HTML, wouldn’t you want to see that annotation when viewing the PDF, and vice-versa? Unfortunately, this won’t work if you are following the tried and true pattern of querying for annotations using only the URL that your browser happens to be viewing.

In fact the problem is a bit more nuanced, in that a document can exist in many places on the Web. Consider Vannevar Bush’s seminal 1945 article As We May Think, now published on the Web by The Atlantic. If you scroll to the bottom you’ll see that the article is segmented into four different pages, as well as a single page view, each of which has its own URL:

If you happen to annotate page three, and then weeks or years later visit the single page view wouldn’t you want to see the annotation you made? If the tool you are using queries for annotations using only the URL of the document you are viewing you won’t see it.

As a final example imagine you are an astronomer, and you use the Astrophysics Data Service to keep up with literature in the field. One day you notice a new article you are interested in, and make an annotation of the abstract at:

http://adsabs.harvard.edu/abs/2006Natur.444..461V

Later, when the article is published by Nature, you bring it up in browser at this URL:

http://www.nature.com/nature/journal/v444/n7118/full/nature05240.html

If your annotation tool is only using the Nature URL to query for annotations then you will miss the annotation you made of the article’s abstract at ADS.

This sounds like a thorny problem, and perfect solutions still don’t exist yet– if they ever will. But fortunately today’s Web is a much richer information space than the Web of twenty years ago, and it presents some opportunities for improving the situation. At Hypothes.is we’ve been experimenting with some simple, pragmatic heuristics to enable cross-format-annotation that we’d like to share here, in the hopes that may be of interest to builders of other annotation tools, and perhaps to Web publishers generally.

On Sameness

At the heart of the cross-format format problem is the notion of the sameness of Web resources. So, in pseudo logic, given that

R1 and R2 are Web resources
A1 is an annotation of R1 and
A2 is an annotation of R2

If you know that:

R1 is the same as R2

You can infer that:

A1 is an annotation of R2 and
A2 is an annotation of R1

Like cross-format-annotation the sameness of Web resources is another tough problem, that has attracted much attention in the Semantic Web community. Luckily at Hypothes.is, we aren’t so much interested in the sameness of resources in the generalized ontological sense, as we are in the sameness of resources in the domain of annotation–and specifically in the context of the user stories described above. So we’re at a distinct advantage compared to the Semantic Web researchers who are concerned with the ambiguities and entailments of owl:sameAs in an open world. Also, we’re also not keen to boil the ocean, by introducing a new standard for Web publishers to implement. Hypothes.is needs to work with the Web we have today, and incentivize small, iterative improvements where we can.

The Annotator

As has been described in previous posts here we’ve been building our annotation service, dubbed h, on top of the Open Knowledge Foundation’s Annotator project. The Annotator is actually two separate software components: annotator (a JavaScript client) and annotator-store (a RESTful annotation storage Web service implemented in Python).

The annotator-store has a REST API which (up until now) has worked very similarly to the annotation service that Mosaic used twenty years ago. The annotator client uses the current location of the browser to query for annotations by URL. So in order to search for annotations of the As We May Think article above, the annotator client issues the following API call to the annotator-store to retrieve relevant annotations:

https://test.hypothes.is/api/search?&uri=http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/

Very recently this interaction between the client and server was updated to take document metadata into consideration. Just this week, our changes to the annotator and the annotator-store projects were pushed upstream to the Open Knowledge Foundation’s GitHub repositories where they can be used by the wider Annotator community. Together with our previous work on Fuzzy Anchoring they represent the core of our approach to cross-format-annotation. So you might be wondering what we mean by document metadata and how it changes the interaction between the annotator and the annotator-store, so let’s dive in.

Canonical Links

To simplify the job of Web crawlers, and to give publishers more control over how their content shows up in search results, the major search engines support canonical links. Google introduced the canonical link, standardized it as RFC 6596, and now it is used widely on the Web to indicate the preferred version of a Web resource. This can be handy for Search Engine Optimization when the number of links to a page significantly changes its relevancy rank in search results, and where there is a preferred URL for a resource that you would like search results to link to.

By happy coincidence, canonical links are also very useful in annotation. Consider the case of the As We May Think article, where there are multiple URLs for different views of the article, each of which can be annotated. If you view source on page two you should see:

<link rel="canonical" href="http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/" />

The same <link> is present in each of the other article views as well. We made a significant change to the annotator so that it will now introspect on the page and also request annotations using the canonical URL, as well as the URL for the current location of the browser. In addition, when creating an annotation the annotator will include the canonical URL as part of the document metadata that is POSTed to the annotator-store. This document metadata is then persisted by the annotator-store so that annotations can be looked up using either URL.

Alternate Links

Alternate links are defined as part of HTML, and allow you to indicate when a Web resource is available at another URL, in another representation. Alternate links have been used for years by the blogging community for auto-discovery of syndicated feeds (RSS or Atom) associated with a Website. They are also useful in situations where you would like to let Web clients know, for example, that a PDF view of an HTML page is also
available.

In the PeerJ example above, when there are HTML and PDF representations of a given pre-print, each of which has a different URL, the annotator now introspects on the HTML looking for alternate links, for example:

<link rel="alternate" type="application/pdf" href="https://peerj.com/preprints/1.pdf">

Alternate links are used in a similar fashion to the canonical links above: each alternate link is persisted as document metadata along with each annotation, so that the annotation can be retrieved independent of whether the user is looking at the HTML or the PDF.

Structured Metadata

As a result of the rapid expansion of the Web, there are many ways to share structured metadata in Web pages. Patterns for structured data often reflect the requirements of specific applications, the needs of particular communities, and sometimes the generalized nature of the Web at large. When you annotate a page, Hypothes.is looks for some of this metadata to help identify the resource that you are annotating. Because of our increased interest in scholarly annotation, our initial support for structured metadata is modeled on formats supported by Google Scholar and Mendeley:

Highwire Press
Eprints
PRISM
Dublin Core

These metadata formats are often part of turnkey institutional repository software platforms such as DSpace and Eprints, and are also used widely by publishers who want their content to appear in platforms like Google Scholar. Because of their increased use we also look for:

Facebook’s OpenGraph Protocol
Twitter Cards

This structured metadata is generally useful for obtaining accurate title, thumbnail, author information for annotated documents. But it also provides another avenue for discovering alternate URLs for a document: for example when a PDF URL is given using Highwire Press tags, as in the case of this PLOS One article:

<meta name="citation_pdf_url" content="http://dx.plos.org/10.1371/journal.pone.0000001.pdf" />

In addition this structured metadata can be a way to discover alternate identifiers such as Digital Object Identifiers (DOI), as in the user scenario above involving an article in the Astrophysics Data Service and at Nature. If you look at the HTML for the article at ADS and the article at Nature you can see that both have a citation_doi tag:

<meta name="citation_doi" content="doi:10.1038/nature05240" />

The fact that both these pages share the same DOI is a strong indicator to the annotator that it should query the annotator-store for annotations of that DOI in addition to the URL that the browser is currently viewing. And just like the canonical and alternate links, the DOI is persisted as part of an annotation as document metadata, so the it can be used to fan out a query by URL in the annotator-store.

Future Work

The document metadata facilities that we’ve added to annotator and annotator-store are just a beginning. One of the reasons why extracting document metadata was of interest to us is that it put us on the road of learning and recording more about the documents that are being annotated, and we think this data presents lots of opportunities for improved service offerings. One area that is of interest is displaying summary information for the annotated document, such as titles and thumbnails in annotations if the annotations are presented outside of the context of the document itself.

Another more ambitious project is to build a reconciliation service that examines the persisted document metadata in the annotator-store and attempts to merge documents that share a certain number of attributes (title, author, publication date, etc). For example, wouldn’t it be great if an annotation you made of a book in Project Gutenberg also showed up when you were viewing an epub of the same book in your browser with epub.js?

Speaking of JavaScript, we’re also working on leveraging the fine work that Mozilla has been doing on pdf.js which makes PDFs renderable in the browser without the need for a plugin from Adobe. This is important because the annotator needs a DOM in order to generate and record annotations. As of February 19th Firefox now ships with pdf.js, so if you are using Hypothes.is with Firefox you should be able to annotate PDFs today. For users of other browsers like Chrome we are interested in examining whether we can bundle pdf.js along with a browser extension, to bring annotation of PDFs into other environments.

So it’s exciting times here at Hypothes.is! We’ll be writing up another version of this blog post as simplified instructions for publishers, who want to make sure that their Web content is friendly for Hypothes.is. We think people will do his because we’re really just reinforcing existing best practices on the Web, and not introducing new standards or patterns that need to be evangelized. We would be interested in any feedback you might have about our recent work on cross-format-annotation, and would love to hear your ideas for making it better.