Fund: On-Demand Web Archiving of Annotated Pages

By bigbluehat | 8 May, 2015

We’re excited to announce our second project funded via the Open Annotation Fund of $3000 for Ilya Kreymer to develop an API of On-Demand Web Archiving for Annotated Pages.

The funded proposal is included below:

Summary

Whenever a web page changes or disappears, annotations on the page may no longer be viewable, unless the original content is preserved. The purpose of this project is to ensure that an archival recording is made of the annotated page.

The proposal is to build a simple service which will be triggered when an annotation is made and archive the full page by loading a headless browser through an existing web archiving tool.

Initially, this service will be used with the Internet Archive Save Page Now feature which can archive any page (except those excluded by robots.txt) on demand and add it to the IA Wayback Machine.

This service can also be used with webrecorder.io or any other web archiving service to create an on-demand web archive.

Requirements

Whenever an annotation is made, a request is made to a server side API (in order to prevent having the user to reload the page with a special ‘archiving enabled’)

This would entail creating a standalone API which:

Accepts a url to be archived, eg. via /archive/<url>/ endpoint
Returns a result of the archiving operation, perhaps as a list of resources that were attempted to be archived and the result (success/failure)
Can be triggered directly from the annotator client after the annotation is made, or wrapped as needed. For example, may be only logged in users could trigger the API, and maybe only a simple status is returned to the user. Users may have an option to opt-in/opt-out of this feature.
It would also be possible to use the api to archive all existing pages with annotations.
Source code and documentation for this service will be hosted on GitHub.

Implementation

The implementation will consist of the following:

A web-accessible api endpoint will accept requests for urls to archive (via simple HTTP GET)
The server-side handler receiving the api request will use a headless browser (such as phantomjs) to automatically make a synchronous request to the web archiving service (eg. web.archive.org/save/<url> or webrecorder.io/record/<url>) to archive the base page url and any embedded urls, including dynamic content created by javascript.
Results from headless browser will be aggregated and returned to the client as a list of archived urls/status codes, and may also be stored in a local log.
(Optional) If successful, the result of the archiving operation could be cached for some period of time, to avoid duplicate saving within a short period of time. (eg. if a user makes multiple annotations within a few seconds. This is optional as the IA /save/ feature can handle this quite well). Content hash could also be checked to avoid saving duplicate content.
(Optional) A script could be written to run through this system all existing urls for all current annotations as a one-off operation.
The service would be configurable to support other end-points besides IA save page and webrecorder. It could ‘scale horizontally’ as needed by adding more handlers, and it could be expanded to work ‘asynchronously’ through a worker queue if needed.

Fund Details

Developer:: Ilya Kreymer
Estimated Time:: 3 Weeks
Funding Requested:: $3000