LOCKSS Technical Manual

From Plnwiki

Anthony's list will go here

LOCKSS applications

LOCKSS is a versatile technology which can be used for many applications

Originally, the open-source LOCKSS software was developed for the Global LOCKSS Network (GLN), a solution for the post-cancellation and perpetual access of academic publications.

Private LOCKSS Networks consists in using the same technology to preserve collections which are specific to a community of institutions.

Type of usage

  • preserving access to content (e.g. the LOCKSSdm “proxy”)
  • backing-up content (simple storage, no curation, every modification goes in)
  • preserving content on the long-term (which implies curation)

Type of content

  • rather static content (journals,...)
  • dynamic content (institutional repositories, websites)

Type of access

  • public - light archive (e.g. GLN)
  • private - dark archive

The difference between these applications mainly resides in the configuration parameters of the LOCKSS software.

This manual aims at exposing the most important parameters and describing how to concretely set-up a Private LOCKSS network.

Basic Private LOCKSS Network infrastructure

Basic PLN infrastructure

A PLN is a generic framework. For the sake of illustration, the figure represents one specific implementation of a PLN (MetaArchive).

  • LOCKSS caches: The LOCKSS caches are the nodes in the network in which digital objects will be preserved and monitored as archival units. One archival unit is typically a one-year collection (size: 1GB to 20GB). The size of an AU results of a trade-off between the processing overhead required by large AUs and the guarantee that all AUs integrity can be regularly checked in the case of a multitude of small AUs. The daemon is a java application which collects digital objects through http requests from the original website, store them inside the cache as an Archival Unit, computes an SHA-1 checksum and regularly monitors their integrity by comparing the preserved content with the other caches in the network with a specific voting protocol (LCAP). The AUs are collected at various moments in time by the different nodes in the network to reduce the risk of communication issues. The content is regularly recrawled from the original website if it is still available. If a new version of the AU is available, the previous version is kept but only the most recent AUs will be checked for fixity. A LOCKSS cache can also be configured as a proxy to deliver the preserved content. The LOCKSS cache local configuration can be managed through a web-based user interface. This interface allows local administrators to add or remove AUs from the cache, to view the daemon status and to modify the local cache configuration.
LOCKSS cache web-based user interface
  • The administrative server: The administrative server is a simple webserver delivering the global configuration (lockss.xml) to the PLN caches composing the network. Local cache settings can however supersede the global configuration. The server generally also includes the plug-in repository and the title databases. In the specific case of MetaArchive, a conspectus tool is managing the creation of title databases. The administration server must be accessible to all PLN nodes to set their global configuration.
  • Conspectus database (specific MetaArchive config): contains functional metadata to maintain LOCKSS title database and non-functional metadata required for the maintainers of the preservation network: collection title, description, coverage, format, language, ownership, related collections, base URL, crawling information (plugin identifier, plugin parameters, manifest page, OAI provider). This database automatically generates and updates title databases
  • Title database: One or more xml files describing the content of the archival units: where to find the corresponding plug-in to collect them and the archival unit base url.
  • Plugins repository: The repository is typically a versioning server that stores the plugins. A plugin instructs the LOCKSS box of how to crawl (filtering rules) and audit the content - where to find the manifest page - only signed plugins are accepted (based on keystore).
  • Website Manifest pages: provides permission for LOCKSS caches to crawl and harvest an Archival unit from the webserver

How to set up a PLN admin server properly (what are the important parameters?)

A complete list of parameters (and their description) is available but it would be interesting to have a list of the most important parameters (~100 parameters according to Tom) and a more detailed description for those parameters (typical values, range of acceptable values).

How to install and configure a LOCKSS daemon on a PLN node?

Prerequisite: CentOS 6.4 basic server install with the following open TCP ports: 8080 (proxy), 8081 (webadmin), 9729 (V3 poll) and 22 (ssh)

  1. Log in as root to your lockss cache
    > ssh root@[your_lockss_box_ip_address]
  2. Get lockss-daemon rpm from Github (https://github.com/lockss/lockss-daemon/releases)
  3. Install rpm
    > rpm -i lockss-daemon-1.70.-1.noarch.rpm [or whichever version you downloaded]
  4. Create local caches folders
    > mkdir /caches
    > mkdir /caches/cache0
    > mkdir /caches/cache1
  5. Run the LOCKSS daemon configuration script
    > /etc/lockss/hostconfig
  6. Answer local configuration questions as follows
    Fully qualified hostname (FQDN) of this machine: [your_lockss_box_hostname]
    IP address of this machine: [your_lockss_box_ip_address]
    Is this machine behind NAT?: [N]
    Initial subnet for admin UI access: []
    LCAP V3 protocol port: [9729]
    PROXY port: [8080]
    Admin UI port: [8081]
    Mail relay for this machine: [localhost] [your smtp server]
    Does mail relay [your smtp server] need user & password: [N]
    E-mail address for administrator: []
    Path to java: [/usr/bin/java]
    Java switches: []
    Configuration URL: [1] #http://[your_admin_server_ip]/lockss.xml
    Configuration proxy (host:port): [NONE]
    Preservation group(s): [prod] testpln
    Content storage directories: [] /caches/cache0; /caches/cache1
    Temporary storage directory: [/caches/cache0/tmp]
    User name for web UI administration: [] lockss
    Password for web UI administration user lockss: []
    Password for web UI administration (again): []
    OK to store this configuration: [Y] y
    Start the daemon
    > /etc/init.d/lockss start
    Connect to the web use interface
    with your favorite browser, go to:
    http:// [your_lockss_box_ip_address]:8081
    user/pwd: previously defined credentials
    no need to chkconfig on, done automatically
    if any error appears, check /var/log/lockss/stdout for info
  7. Update your adminstrative server configuration file (lockss.xml):
    Insert the new cache IP address [your_lockss_box_ip_address] in <property name="id.initialV3PeerList">

How to create a title database?

  • Placeholder for MetaArchive title db configuration (forthcoming)

How do you get your content into a LOCKSS box?'

This can vary but typically, content is 'staged' on a web server so it can be harvested by the LOCKSS boxes in a PLN. An example of content staged for this purpose from the Simon Fraser University Library's Editorial Cartoon Collection is available.

LOCKSS content crawlers

LOCKSS preserves collections of URLs. An archival unit in LOCKSS terminology is really a set of URLs (UrlSet) with captured response data and HTTP headers. LOCKSS stores URL payload bytes (current) and the HTTP headers (current.props) in discrete files. A vote is composed of a SHA-1 hash of the current file and the access URL (see VoteBlock). This type agnostic method of storing and peer polling content makes LOCKSS suitable for preservation of practically every type of static file accessible by HTTP, which is essentially every kind of static file. Since the poller compares a vote's access URL before the content hash, potential SHA-1 collisions are effectively mitigated. Examine the LOCKSS file system to understand the relationship between crawl URL and the file system organization.

There are two basic types of content crawler in LOCKSS: the new content crawler and the repair crawler. The new content crawler is given a start URL and crawl scope rules (e.g. descend into URL only). The new content crawler then follows links accordingly and discovers new content. The new content crawler, in typical usage, only looks at the publisher's 'content staging area'. Both the new content crawler and the OAI crawler extend FollowLinkCrawler. Modified files captured by the new content crawler are saved as revisions.

The repair crawler is given specific URLs. The repair crawler does not follow links or discover new content. The repair crawler can crawl either publisher's content staging area or request repair data from PLN peers with V3 LCAP messaging (see requestRepairFromPeer in V3Poller.java). Received repairs are not saved as revisions.

Okay, so what?

A PLN is not required to maintain a content staging area with a copy of all content. However, the only method to populate a node is with the new content crawler so a PLN should be aware of the necessity of a content staging area. Any sort data recovery that by-passes the content staging area requires a populated LOCKSS node. Also, LOCKSS stores URL response header and bytes in a file system directory tree analogous to the URL segments of the access URL. So any sort of ephemeral URL scheme to provide initial access may be inappropriate for LOCKSS long-term. Revisions and duplication control in the LOCKSS cache depend on two factors, the file and the URL. A modification of the access URL that does not violate the crawl spec will result in a new file in a new directory tree in the LOCKSS cache. That being said, once the nodes of a PLN have cached the URL and established something resembling a quorum, the content staging area is not a part of the routine polling and voting methods (unless configured to be a or the repair source for the repair crawler).

How to create, sign and publish a plug-in?

Plugins define what content gets harvested into the LOCKSS network.

How to test a plugin (or test other aspects of a PLN)

How to ingest content with the UI?

How to monitor your AUs status?

How to replace, upgrade or insert a new node in a PLN already in production?

When a new cache is inserted in the PLN (due to a new institution contributing to the PLN, a cache repair or a regular 3-year cache disk replacement) and if this new cache shows a valid security certificate, it should directly collect AUs not from the original source (the AU publisher) but from another cache in the PLN configured as a proxy. The idea is that the PLN should survive the institutions, implying that the safe source of information is supposed to be the PLN itself and not the original source which is supposed to be more prone to attacks and less safe than our PLN. This is actually another point of view than "LOCKSS cache can be used as a proxy for the original server at anytime" which assumes that the original source is always the reference and which, in my understanding, should be only used for the GLN and not for PLNs. If I understand well, this behavior should be easy to configure by setting the AU parameter org.lockss.crawler.fetch_from_other_caches_only to true, forcing the cache to collect the AU from other caches.

Need Clarification

Can be deleted

The parameter org.lockss.crawler.fetch_from_other_caches_only is only used in the RepairCrawler. While repair data can be sent via the V3 LCAP messaging protocol (peer to peer, search particularly for requestRepairFromPeer() in V3Poller.java), it appears the requisite enumerable list of URLs needing repair, as required by the repair crawler, is only triggered from V3 Poll vote tallying methods. I don't think a new node with no AU and vacated publisher staging, can participate in a poll. See Repair Crawler for more.

In terms of GLN and access rights, I can see applicability of this logic, else a random network node could populate its own cache without proof of possession or rights to possess the data.

Data transfer options

1. Populate the original content staging area.

The content staging area should maintain the original URL access scheme to provide seamless integration of a new node into an existing network.

The data on a publisher's content staging area is authoritative. However, data on the content staging area is not likely to be monitored for corruption, degradation or accidental change. While corrupted data will likely be repaired in a node after poll (if sufficient nodes agree over quorum and too_close limit is met), modifications that change the file last write will result in all nodes ingesting the content and pushing the original unmodified data out of polling scope. (Optional) input validation in the daemon would alleviate some of these concerns. This is a wish list item for PLNs who utilize the standard RPM release.

2. Modifying AU title list parameters to specify a LOCKSS cache as a crawl proxy. Crawl Proxy

This method proxies an AU start URL through crawl_proxy parameter and enables FollowLinkCrawler to gather data from a peer node. Title list parameter definition:

<property name="param.98">
  <property name="key" value="crawl_proxy" />
  <property name="value" value="a_lockss_node.domain.tld:8082" />
</property>

This approach establishes a per-AU proxy host and is different than setting the global parameters org.lockss.crawler.proxy.enabled, org.lockss.crawler.proxy.host and org.lockss.crawler.proxy.port in the Expert Config.

3. File system copy. Copying AUs, Different Nodes

There are a few files to expunge from the transfer and also to keep in order to convince the repository manager a new content crawl has been completed. And a new content crawl must be completed before the repository manager engages in polling for an archival unit.

4. Enabling ICP on a server as an option? ICP Server (untested)

Goes back to node has no knowledge of AU, don't see how this would work.

While this method probably won't work for populating a node with content, coupling the ICP server with a Squid proxy instance to establish a single, unified reference point for title awareness of partitioned cache data across uneven nodes would be an interesting exercise. Partitioned Caches and Title Awareness

Advanced configuration, fine-tuning and more

Securing your PLN

For the web interfaces, IP filtering is turned on be default, and can be configured through the UI (Admin Access Control, Content Access Control). Stronger security is available (SSL, form authentication with inactivity timeouts, password rotation and strength requirements, etc.) for all but the proxy. The easiest way to enable this is to choose one of the pre-defined policies by setting org.lockss.accounts.policy to one of {BASIC, FORM, SSL, LC}. BASIC just enables the user accounting subsystem and allows you to create additional user accounts; FORM replaces the basic auth method (browser popup) with an HTML form, SSL replaces HTTP with HTTPS, and LC enables the strict Library of Congress security policies (strong passwords, password rotation, session timeouts, auditing of user-initiated events, etc.) When you enable SSL you can either give it a server certificate or let it generate one itself (in which case browsers will produce a scary warning about a self-signed certificate, which users will have to accept. I recommend you use the defaults at first and turn on additional security once you have things working.

LCAP communication is unencrypted by default. If you put the appropriate keystores in /etc/lockss/keys, the startup scripts will automatically enable v3overSSL and sslClientAuth, which will cause all communication to be encrypted and all connections to be verified as coming from a node with a valid key. It's best to leave this until all or most of the nodes are up and running as generating and distributing the key files adds a nuisance factor.

In-depth Guide to LOCKSS

Tobin Cataldo at ADPN has created an in-depth guide to the LOCKSS software.

Software that works with LOCKSS

There's a list.

Basic documentation describing how to set-up a simple testPLN environment

Media:PLN_kick_off_meeting_290413.pdf

APIs

Some basic documentation and examples are available.

Web Services documentation.

Wish List

LOCKSS (Daemon / UI / File System) Features Unofficial Wish List.

LOCKSS Program Documents

TRAC certification documents for the CLOCKSS Archive are at DSHR's blog and the CLOCKSS Wiki.