Customise Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorised as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyse the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customised advertisements based on the pages you visited previously and to analyse the effectiveness of the ad campaigns.

No cookies to display.

Fund: On-Demand Web Archiving of Annotated Pages

By bigbluehat | 8 May, 2015

We’re excited to announce our second project funded via the Open Annotation Fund of $3000 for Ilya Kreymer to develop an API of On-Demand Web Archiving for Annotated Pages.

The funded proposal is included below:

Summary

Whenever a web page changes or disappears, annotations on the page may no longer be viewable, unless the original content is preserved. The purpose of this project is to ensure that an archival recording is made of the annotated page.

The proposal is to build a simple service which will be triggered when an annotation is made and archive the full page by loading a headless browser through an existing web archiving tool.

Initially, this service will be used with the Internet Archive Save Page Now feature which can archive any page (except those excluded by robots.txt) on demand and add it to the IA Wayback Machine.

This service can also be used with webrecorder.io or any other web archiving service to create an on-demand web archive.

Requirements

Whenever an annotation is made, a request is made to a server side API (in order to prevent having the user to reload the page with a special ‘archiving enabled’)

This would entail creating a standalone API which:

  • Accepts a url to be archived, eg. via /archive/<url>/ endpoint
  • Returns a result of the archiving operation, perhaps as a list of resources that were attempted to be archived and the result (success/failure)
  • Can be triggered directly from the annotator client after the annotation is made, or wrapped as needed. For example, may be only logged in users could trigger the API, and maybe only a simple status is returned to the user. Users may have an option to opt-in/opt-out of this feature.
  • It would also be possible to use the api to archive all existing pages with annotations.
  • Source code and documentation for this service will be hosted on GitHub.

Implementation

The implementation will consist of the following:

  1. A web-accessible api endpoint will accept requests for urls to archive (via simple HTTP GET)
  2. The server-side handler receiving the api request will use a headless browser (such as phantomjs) to automatically make a synchronous request to the web archiving service (eg. web.archive.org/save/<url> or webrecorder.io/record/<url>) to archive the base page url and any embedded urls, including dynamic content created by javascript.
  3. Results from headless browser will be aggregated and returned to the client as a list of archived urls/status codes, and may also be stored in a local log.
  4. (Optional) If successful, the result of the archiving operation could be cached for some period of time, to avoid duplicate saving within a short period of time. (eg. if a user makes multiple annotations within a few seconds. This is optional as the IA /save/ feature can handle this quite well). Content hash could also be checked to avoid saving duplicate content.
  5. (Optional) A script could be written to run through this system all existing urls for all current annotations as a one-off operation.
  6. The service would be configurable to support other end-points besides IA save page and webrecorder. It could ‘scale horizontally’ as needed by adding more handlers, and it could be expanded to work ‘asynchronously’ through a worker queue if needed.

Fund Details

Developer:
Ilya Kreymer
Estimated Time:
3 Weeks
Funding Requested:
$3000

Share this article