LOCKSS Technical Manual

From Plnwiki

Revision as of 17:07, 15 November 2013 by Tobin.Cataldo (Talk | contribs)

Anthony's list will go here

LOCKSS applications

LOCKSS is a versatile technology which can be used for many applications

Type of usage

  • preserving access to content (e.g. the LOCKSSdm “proxy”)
  • backing-up content (simple storage, no curation, every modification goes in)
  • preserving content on the long-term (which implies curation)

Type of content

  • rather static content (journals,...)
  • dynamic content (institutional repositories, websites)

Type of access

  • public (e.g. GLN)
  • private (SSL, over VPN)

The difference between these applications mainly resides in the lockss.xml configuration (767 parameters lacking detailed documentation, many of which are not used).

Basic Private LOCKSS Network infrastructure

Basic PLN infrastructure
  • (first draft... will be improved later)

A PLN is a generic framework. For the sake of illustration, the figure represents one specific implementation of a PLN composed of:

  • LOCKSS caches: The LOCKSS caches are the nodes in the network in which digital objects will be preserved and monitored as archival units. One archival unit is typically a one-year collection (size: 1GB to 20GB). The daemon is a java application which basically fetches and monitors data inside the cache. The LOCKSS cache local configuration can be set through a web-based user interface. The size of an AU results of a trade-off between the processing overhead required by large AUs and the guarantee that all AUs integrity can be regularly checked in the case of a miltitude of small AUs.
  • The adminsitrative server: The Admin server mainly contains the global configuration of the PLN nodes (lockss.xml, . Local node configuration can however needs to be accessible to all PLN nodes. Its purpose is to maintain the whole PLN.
  • Conspectus database (specific MetaArchive config): contains functional metadata to maintain LOCKSS title database and non-functional metadata required for the maintainers of the preservation network: collection title, description, coverage, format, language, ownership, related collections, base URL, HARVESTING info (plugin identifier, plugin parameters, manifest page, OAI provider). This conspectus database generates and updates the title databases
  • Title database: One or more xml files which describes the content of the archival units: where to find the corresponding plug-in, the archival unit base url, the IP adress of all the cache in the network
  • Plugins repository: Versioning server that stores the plugins. A plugin instructs the LOCKSS box of how to crawl (filtering rules) and audit the content - where to find the manifest page - only signed plugins are accepted (keystore) - reread plugin registries every 6 hours.
  • Manifest page: provides permission for LOCKSS to crawl and harvest an Archival unit from the webserver

How to set up a PLN admin server properly (what are the important parameters?)

A complete list of parameters (and their description) is available but it would be interesting to have a list of the most important parameters (~100 parameters according to Tom) and a more detailed description for those parameters (typical values, range of acceptable values).

How to install and configure a locks daemon on a PLN node?

  • Prerequisite: CentOS 6.4 basic server install with the following open TCP ports: 8080 (proxy), 8081 (webadmin), 9729 (V3 poll) and 22 (ssh)
  1. Log in as root
    > ssh root@[you_lockss_box_ip_address]
  2. Get lockss-daemon rpm from sourceforge
    > cd ~
    > wget http://downloads.sourceforge.net/project/lockss/Daemon/1.60.3/lockss-daemon-1.60.3-1.noarch.rpm
  3. Install rpm
    > rpm -i lockss-daemon-1.60.3-1.noarch.rpm
  4. Create local caches folders
    > mkdir /caches
    > mkdir /caches/cache0
    > mkdir /caches/cache1
  5. Run the LOCKSS daemon configuration script
    > /etc/lockss/hostconfig
  6. Answer local configuration questions as follows
    Fully qualified hostname (FQDN) of this machine: [lockssbox1.ulb.ac.be]
    IP address of this machine: [164.15.4.112]
    Is this machine behind NAT?: [N]
    Initial subnet for admin UI access: [164.15.4.0/24]
    LCAP V3 protocol port: [9729]
    PROXY port: [8080]
    Admin UI port: [8081]
    Mail relay for this machine: [localhost] smtp.ulb.ac.be
    Does mail relay smtp.ulb.ac.be need user & password: [N] y
    User for smtp_server: [] lockssadmin@ulb.ac.be
    Password for locksadmin@ulb.ac.be @ smtp.ulb.ac.be: []
    Again: []
    E-mail address for administrator: [] lockssadmin@ulb.ac.be
    Path to java: [/usr/bin/java]
    Java switches: []
    Configuration URL: [1] http://164.15.1.1/lockss.xml
    Configuration proxy (host:port): [NONE]
    Preservation group(s): [prod] testpln
    Content storage directories: [] /caches/cache0; /caches/cache1
    Temporary storage directory: [/caches/cache0/tmp]
    User name for web UI administration: [] lockss
    Password for web UI administration user lockss: []
    Password for web UI administration (again): []
    OK to store this configuration: [Y] y
  7. Start the daemon
    > /etc/init.d/lockss start
  8. Connect to the web interface
    with your favorite browser, go to:
    http://your_ip_address:8081
    user/pwd: previously defined credentials
    no need to chkconfig on, done automatically
    if any error appears, check /var/log/lockss/stdout for info
  9. Ask PLN admin to update
  • the lockss.xml file with the new cache IP address
  • nagios monitoring server configuration

How to create a title database?

  • Placeholder for MetaArchive title db configuration (forthcoming)

How to expose your content to LOCKSS?

Or how do you get your content into a LOCKSS box?

This can vary but typically, content is 'staged' on a web server so it can be harvested by the LOCKSS boxes in a PLN. An example of content staged for this purpose from the Simon Fraser University Library's Editorial Cartoon Collection is available.

LOCKSS Content Crawlers

LOCKSS preserves collections of URLs. An archival unit in LOCKSS terminology is really a set of URLs (UrlSet) with captured response data and HTTP headers. This type agnostic method of storing and peer polling content makes LOCKSS suitable for preservation of practically every type of static document accessible by HTTP. Examine the LOCKSS file system to understand the relationship between crawl URL and the file system organization.

There are two basic types of content crawler in LOCKSS: the new content crawler and the repair crawler. The new content crawler is given a start URL and crawl scope rules (e.g. descend into URL only). The new content crawler then follows links accordingly and discovers new content. The new content crawler, in typical usage, only looks at the publisher's 'content staging area'. Both the new content crawler and the OAI crawler extend FollowLinkCrawler. Modified files captured by the new content crawler are saved as revisions.

The repair crawler is given specific URLs. The repair crawler does not follow links or discover new content. The repair crawler can crawl either publisher's content staging area or request repair data from PLN peers with V3 LCAP messaging (see requestRepairFromPeer in V3Poller.java. Received repairs are not saved as revisions.

How to create, sign and publish a plug-in?

Plugins define what content gets harvested into the LOCKSS network.

How to ingest content with the UI?

How to monitor your AUs status?

How to replace, upgrade or insert a new node in a PLN already in production?

When a new cache is inserted in the PLN (due to a new institution contributing to the PLN, a cache repair or a regular 3-year cache disk replacement) and if this new cache shows a valid security certificate, it should directly collect AUs not from the original source (the AU publisher) but from another cache in the PLN configured as a proxy. The idea is that the PLN should survive the institutions, implying that the safe source of information is supposed to be the PLN itself and not the original source which is supposed to be more prone to attacks and less safe than our PLN. This is actually another point of view than "LOCKSS cache can be used as a proxy for the original server at anytime" which assumes that the original source is always the reference and which, in my understanding, should be only used for the GLN and not for PLNs. If I understand well, this behavior should be easy to configure by setting the AU parameter org.lockss.crawler.fetch_from_other_caches_only to true, forcing the cache to collect the AU from other caches.

Need Clarification

Can be deleted

The parameter org.lockss.crawler.fetch_from_other_caches_only is only used in the RepairCrawler. While repair data can be sent via the V3 LCAP messaging protocol (peer to peer, search particularly for requestRepairFromPeer() in V3Poller.java), it appears the requisite enumerable list of URLs needing repair, as required by the repair crawler, is only triggered from V3 Poll vote tallying methods. I don't think a new node with no AU and vacated publisher staging, can participate in a poll. See Repair Crawler for more.

In terms of GLN and access rights, I can see applicability of this logic, else a random network node could populate its own cache without proof of possession or rights to possess the data.

Data Transfer Options

1. Modifying AU title list parameters to specify a LOCKSS cache as a crawl proxy. Crawl Proxy

This method proxies an AU start URL through crawl_proxy parameter and enables FollowLinkCrawler to gather data from a peer node. Title list parameter definition:

<property name="param.98">
  <property name="key" value="crawl_proxy" />
  <property name="value" value="a_lockss_node.domain.tld:8082" />
</property>

This approach establishes a per-AU proxy host and is different than setting the global parameters org.lockss.crawler.proxy.enabled, org.lockss.crawler.proxy.host and org.lockss.crawler.proxy.port in the Expert Config.

2. File system copy. Copying AUs, Different Nodes

There are a few files to expunge from the transfer and also to keep in order to convince the repository manager a new content crawl has been completed.

3. Enabling ICP on a server as an option? ICP Server (untested)

Goes back to node has no knowledge of AU, don't see how this would work.

Advanced configuration, fine-tuning and more

Securing your PLN

For the web interfaces, IP filtering is turned on be default, and can be configured through the UI (Admin Access Control, Content Access Control). Stronger security is available (SSL, form authentication with inactivity timeouts, password rotation and strength requirements, etc.) for all but the proxy. The easiest way to enable this is to choose one of the pre-defined policies by setting org.lockss.accounts.policy to one of {BASIC, FORM, SSL, LC}. BASIC just enables the user accounting subsystem and allows you to create additional user accounts; FORM replaces the basic auth method (browser popup) with an HTML form, SSL replaces HTTP with HTTPS, and LC enables the strict Library of Congress security policies (strong passwords, password rotation, session timeouts, auditing of user-initiated events, etc.) When you enable SSL you can either give it a server certificate or let it generate one itself (in which case browsers will produce a scary warning about a self-signed certificate, which users will have to accept. I recommend you use the defaults at first and turn on additional security once you have things working.

LCAP communication is unencrypted by default. If you put the appropriate keystores in /etc/lockss/keys, the startup scripts will automatically enable v3overSSL and sslClientAuth, which will cause all communication to be encrypted and all connections to be verified as coming from a node with a valid key. It's best to leave this until all or most of the nodes are up and running as generating and distributing the key files adds a nuisance factor.

In-depth Guide to LOCKSS

Tobin Cataldo at ADPN has created an in-depth guide to the LOCKSS software.

Software that works with LOCKSS

There's a list.

Basic documentation describing how to set-up a simple testPLN environment

Media:PLN_kick_off_meeting_290413.pdf

APIs

Some basic documentation and examples are available.