Talk:Plugins/Plugin XML Format

From Plnwiki

Overview

A LOCKSS plugin is a set of key-value pairs, possibly supplemented by Java classes. Keys can be simple or compound. Values can be any Java datatype such as string, integer, list, map, or an instance of a LOCKSS class. The syntax is compatible with XStream; the overall structure is a map:

 <map>
   <entry>
     <string>key1</string>
     <string>string_value</string>
   </entry>
   <entry>
     <string>key2</string>
     <type>value_of_type</type>
   </entry>
   ...
 </map>

Compound Keys

Several elements of plugins are inherently map-like, e.g., MIME-type dependent link extractors or link rewriters. For historical reasons (the original plugin syntax didn't support maps as entry values) some of the values that would naturally be maps are instead represented as a set of compound keys with a common suffix, and the "map" key as the prefix. E.g., link extractors are specified as keys: MIME-type_link_extractor_factory, e.g., text/html_link_extractor_factory, application/pdf_link_extractor_factory.

 <entry>
   <string>text/html_link_extractor_factory</string>
   <string>org.lockss.plugin.pkg.HtmlLinkExtractorFactory</string>
 </entry>
 <entry>
   <string>text/xml_link_extractor_factory</string>
   <string>org.lockss.plugin.pkg.XmlLinkExtractorFactory</string>
 </entry>

Value Datatypes

Values in plugin .xml files are typed using XML tags, either basic datatypes such as <string>, <int>, <list>, <map>, or class names such as <org.lockss.daemon.ConfigParamDescr>. Common types are denoted below by:

  • S - string (<string>value</string>)
  • I - integer (<int>NNN</int>)
  • L - long interger (<long>NNN</long>)
  • B - boolean (<boolean>true</boolean> or <boolean>false</boolean>)
  • P - printf format string and arguments (see below)

X+ means either a single X or one or more Xs enclosed in <list> ... </list>. E.g., S+ means either a single string or a list of strings:

 <entry>
   <string>key</string>
   <string>value</string>
 </entry>
 
 <entry>
   <string>key</string>
    <list>
      <string>value1</string>
      <string>value2</string>
      ...
    </list>
 </entry>

Printf Template Strings

Printf format templates are used in many places to parameterize strings with AU configuration parameter values (see plugin_config_props). They consist of a template string enclosed in quotes, and a list of arguments. <string>"template", arg1, arg2, ...</string>. (Note that while the plugintool writes the quotes as XML entities (&quot;), double quotes are legal and easier to read when're editing the file by hand.)

   <string>"base is %s, year is %d", base_url, year</string>

The arguments must be the names of AU configuration parameters or single-argument functions thereof. The following functions are builtin; plugins may define additional functions using au_param_functor (below).

  • url_host() : Returns the host part of a URL.
  • url_path() : Returns the path part of a URL.
  • add_www() : Adds leading www subdomain to host, if not already present.
  • del_www() : Removes leading www subdomain from host.
  • to_https() : Replaces http: with https: at start of URL.
  • to_http() : Replaces https: with http: at start of URL.
  • url_encode() : URL encodes argument.
  • url_decode() : URL decodes argument.
   <string>"%sjournals/%s/v%03d/", to_https(base_url), journal_dir, volume</string>

In many cases (e.g., in crawl rules and substance patters) the result of the template expansion is a regular expression.

   <string>"^%s%d/%d-[^./?&]+\.html$", base_url, year, year</string>

Plugin elements

The following elements are sufficient for a basic plugin, suitable for most straightforward sites with static content.

  • plugin_identifier
  • plugin_name
  • plugin_config_props
  • au_name
  • au_start_url
  • au_crawlrules

These elements are commonly used.

  • au_crawl_depth
  • au_def_pause_time


plugin_identifier (S) 
Package-qualified identifier of the plugin. Should agree with identifier by which the plugin can be found on the classpath (e.g., org.lockss.plugin.platform.PlatformPlugin)
plugin_name (S) 
Human readable name of the plugin.
plugin_version (S) 
Version number. Informs daemon when a newly collected plugin jar holds a newer version of the plugin. (Default 1).
required_daemon_version (S) 
Minimum daemon version required for this plugin to operate.
plugin_parent (S) 
Identifier of another plugin from which this plugin will inherit attribute values.
plugin_parent_version (S) 
Expected version number of parent plugin.
plugin_notes (S) 
Commentary displayed along with plugin definition in DaemonStatus.
plugin_bulk_content (B) 
Should be true for plugins whose AUs span multiple publishers.
plugin_publishing_platform (S) 
Name of the publishing platform. Used in displays and reports, useful when more than one plugin is used with the same platform.
plugin_config_props 
List of ConfigParamDescr objects declaring the parameters used to define AUs with this plugin.
 <entry>
   <string>plugin_config_props</string>
   <list>
     <org.lockss.daemon.ConfigParamDescr>...</>
     <org.lockss.daemon.ConfigParamDescr>...</>
     <org.lockss.daemon.ConfigParamDescr>...</>
   </list>
 </entry>
Each org.lockss.daemon.ConfigParamDescr describes a parameter accepted by the plugin. For example:
   <org.lockss.daemon.ConfigParamDescr>
     <key>base_url</key>
     <displayName>Base URL</displayName>
     <description>Usually of the form http://<journal-name>.com/</description>
     <type>3</type>
     <size>40</size>
     <definitional>true</definitional>
     <defaultOnly>false</defaultOnly>
   </org.lockss.daemon.ConfigParamDescr>
  • key: Parameter name used in AU definitions
  • displayName: Human-readable name for the parameter
  • description: Human-readable description of the parameter, generally used to help AU creators
  • type: The type of data value allowed for the parameter, one of the following:
    • 1: String
    • 2: Integer
    • 3: URL
    • 4: Year
    • 5: Boolean
    • 6: Positive Integer
    • 7: String Range (e.g., aa-zz)
    • 8: Numeric Range (e.g., 1-14)
    • 9: Set (command-separated strings, e.g., Jan,Feb,Mar,...,Dec)
    • 10: User:passwd
    • 11: Long Integer
  • size: Width of entry field in form edit AU definition
  • definitional: "true" if this parameter is integral to the identity of an AU. "false" otherwise
  • defaultOnly: The value of this parameter should be taken from the currently loaded title db, not stored with the local AU configuration
plugin_au_config_user_msg (S) 
String which is displayed to the user when one or more AUs of this plugin are added to a LOCKSS box. Some publishers require actions on the part of site admins, such as registering the LOCKSS box as an allowed crawler. This field may be used to inform site admins of such requirements.
plugin_feature_version_map 
Map of feature name (Poll, Metadata, Substance) to version string.

The Poll version is used to determine polling compatibility between AUs on different boxes; if the AUs' plugins have the same Poll feature version the AUs may participate in the same poll, even if the plugins themselves have different versions. The Poll version should be changed whenever some aspect of the plugin that affects polling behavior (sich as a hash filter) is changed.

Metadata and Substance version inform the daemon when it is necessary to re-extract metadata or recompute the substance state of an AU. The Metadata feature version should be changed whenever the article iterator or metadata extractor(s) are change in a way that affects the extracted metadata. The Substance feature version should be changed whenever the substance patterns are changed.

 <entry>
   <string>plugin_feature_version_map</string>
   <map>
     <entry>
       <string>Poll</string>
       <string>3</string>
     </entry>
     <entry>
       <string>Substance</string>
       <string>1</string>
     </entry>
     <entry>
       <string>Metadata</string>
       <string>2</string>
     </entry>
   </map>
 </entry>
key_override 
Allows rudimentary conditionalization of plugin. The value should be a map analogous to the toplevel plugin map. If the configuration parameter org.lockss.daemon.testingMode is set to <key>, entries from this map will replace those in the plugin. Normally used to, e.g., shorten fetch delays when testing.
 <entry>
   <string>testing_override</string>
   <map>
    <entry>

<string>au_def_pause_time</string> <long>100</long>

    </entry>
   </map>
 </entry>
au_name (P) 
Printf template for the display name given to AUs defined by this plugin. Used when the AU's title db entry does not contain a title property.
au_param_functor 
Name of a class the implements org.lockss.plugin.AuParamFunctor. Allows plugin to define additional functions which may be used in printf template arguments.

Crawl related

au_start_url (P+) 
URL(s) at which crawl starts. Often referred to as manifest page(s). The daemon will look for permission statement(s) here unless permission URLs are specified separately with au_permission_url. (Not required if plugin_crawl_seed_factory is defined.)
au_permission_url (P+) 
If included, daemon will look for permission statement(s) at these URLs instead of au_start_url.
plugin_per_host_permission_path (S) 
Absolute path where permission statement may be found on hosts not listed in au_start_url or au_permission_url. Useful for sites such as Internet Archive that have banks of similar hosts with unpredictable names.
au_crawlrules (S+) 
Controls the URLs to which links should be followed by the crawler. (Crawl rules make a decision based on the URL only; to decide based on context, see <mime-type>_crawl_filter_factory.)

Strings of the form N,<printf_template>, where the printf template expands to a regular expression, and N specifies the action to take if the regex either does or doesn't match a URL:

  • 1: Include the URL if it matches the pattern, else ignore this rule
  • 2: Exclude the URL if it matches the pattern, else ignore this rule
  • 3: Include the URL if it does not match the pattern, else ignore this rule
  • 4: Exclude the URL if it does not match the pattern, else ignore this rule
  • 5: Include the URL if it matches the pattern, else exclude the URL
  • 6: Exclude the URL if it matches the pattern, else include the URL
au_crawlrules_ignore_case (B) 
If true, patterns in crawl rules match case independently. The global default is set with the daemon config param org.lockss.plugin.crawlRulesIgnoreCase, which defaults to true.
mime-type_crawl_filter_factory (S) 
Name of a class that implements org.lockss.plugin.FilterFactory, to filter out parts of the content that should not be processed by the link extractor during a crawl. E.g.,
 <entry>
   <string>text/html_crawl_filter_factory</string>
   <string>org.lockss.plugin.apub/APublisherCrawlFilterFactory</string>
 </entry>
au_def_new_content_crawl (L) 
Interval (ms) at which recrawl should occur after a successful crawl.
au_def_pause_time (L) 
Minimum interval (ms) between start of successive fetch operations during a crawl.
au_mime_rate_limiter_map 
Map of MIME-type to fetch rate string. Allows variable fetch rate based on MIME-type of last file fetched. Keys are comma-separated lists of MIME-types (type/subtype). Wildcard (*) may be used for subtype or whole MIME-type.
 <entry>
   <string>au_mime_rate_limiter_map</string>
   <map>
     <entry>
       <string>text/html,text/x-html,application/pdf</string>
       <string>10/1m</string>
     </entry>
     <entry>
       <string>image/*</string>
       <string>5/1s</string>
     </entry>
   </map>
 </entry>
au_url_rate_limiter_map 
Map of regex to fetch rate string. Allows variable fetch rate based on URL of last file fetched. Keys are regular expressions; first match is used.
 <entry>
   <string>au_url_rate_limiter_map</string>
   <map>
     <entry>
       <string>(\.html)|(\.pdf)</string>
       <string>10/1m</string>
     </entry>
     <entry>
       <string>(\.gif)|(\.jpeg)|(\.png)</string>
       <string>5/1s</string>
     </entry>
   </map>
 </entry>
au_rate_limiter_info 
Allows fine-grained control over fetch rate, by day of week, time of day, MIME type or URL pattern. Value is an instance of org.lockss.plugin.RateLimiterInfo.
 <entry>
   <string>au_rate_limiter_info</string>
   <org.lockss.plugin.RateLimiterInfo>
     <rate>1/3500</rate>    
     <cond>
       <entry>
         <org.lockss.daemon.CrawlWindows-Daily>
           <from>24:00</from>
           <to>12:00</to>
           <timeZoneId>America/Los_Angeles</timeZoneId>
           <daysOfWeek>2;3;4;5;6</daysOfWeek>
         </org.lockss.daemon.CrawlWindows-Daily>
         <org.lockss.plugin.RateLimiterInfo>
           <rate>1/3500</rate>
         </org.lockss.plugin.RateLimiterInfo>
       </entry>
       <entry>
         <org.lockss.daemon.CrawlWindows-Always />
         <org.lockss.plugin.RateLimiterInfo>
           <rate>1/2s</rate>
         </org.lockss.plugin.RateLimiterInfo>
       </entry>
     </cond>
   </org.lockss.plugin.RateLimiterInfo>
 </entry>
plugin_fetch_rate_limiter_source (S) 
Controls which AUs may crawl simultaneously. (A better name would be au_crawl_pool.) Each AU belongs to a single, named crawl pool. Normally only one AU from each pool may crawl at a time; this may be changed with org.lockss.crawler.concurrentCrawlLimitMap . The only limit on the number of simultaneously crawls of AUs from different pools is org.lockss.crawler.threadPool.max .

The value should be one of:

  • au - All AUs belonging to this plugin will be in their own crawl pool, thus may crawl simultaneously.
  • plugin - All AUs belonging to this plugin will belong to a pool whose name is the plugin id.
  • key:string - All AUs belonging to this plugin will belong to a pool with the specified name.
  • host:param_name - Each AU will be in a crawl pool whose name is host:host_name, where host_name is the host name of the URL that is the value of param_name in the AU's configuration.
  • title_attribute:attr_name[:default] - Each AU will belong to a crawl pool named attr_name:V, where V is the value of the specified attribute in the AU's tdb entry. If there is no such parameter or the parameter is not of type URL, the crawl pool will be determined by default if present, else by org.lockss.baseau.defaultFetchRateLimiterSource .
au_refetch_depth (I) 
Link depth from start page(s) within which already-collected files will be refetched on each crawl to check for changes. When possible the (re)fetch will use If-Modified-Since. Can be raised for an individual crawl by performing a "deep recrawl" from the AuControlService web service or DebugPanel.
au_url_normalizer (S) 
Name of a class that implements org.lockss.plugin.UrlNormalizer. For plugins that provide it, this class is applied to all URLs extracted from collected pages during crawls, and URLs requested by users via the proxy and ServeContent. Typical use is to remove session-specific parts of the URL, which don't affect the content that's collected, but which are different each time the site is crawled.
au_crawlwindow (S) 
Name of a class that implements org.lockss.plugin.definable.DefinableArchivalUnit.ConfigurableCrawlWindow whose makeCrawlWindow() methods returns a CrawlWindow that determines when AU's belonging to this plugin are allowed to crawl.
au_crawlwindow_ser 
A serialized instance of org.lockss.daemon.CrawlWindow.
au_redirect_to_login_url_pattern (P) 
Printf template that expands to a regex that matches the URL(s) of login pages. If a fetch is redirected to a URL that matches this pattern the crawler will abort with "No permission".
au_login_page_checker (S) 
Name of a class that implements org.lockss.daemon.LoginPageChecker. Allows plugins to detect when a site has served a page prompting for login info instead of the requested content.
au_crawl_cookie_policy (S) 
Specifies cookie handling. Defaults to org.lockss.urlconn.cookiePolicy, which defaults to compatibility.
  • ignore: cookies will be ignored.
  • netscape: (old) Netscape draft policy.
  • compatibility: Common browser compatibility.
  • rfc2109: Strict RFC 2109.
au_http_cookie (S+) 
Strings of form "name=val" specifying cookies to be added to crawler's GET requests.
au_http_request_header (S+) 
Strings of the form "name:val" specifying additional headers to be added to crawler's GET requests, or of the form "name;" specifying headers to remove from requests.
au_permission_checker_factory (S) 
Name of a class that implements org.lockss.daemon.PermissionCheckerFactory, which will be called to create plugin-dependent permission checkers.
au_permitted_host_pattern (P+) 
Templates that expand to regex(s) that match the host name of CDN hosts, from which collection will be permitted without an explicit permission statement appearing on the host. Such hosts must also match one of the patterns on org.lockss.crawler.allowedPluginPermittedHosts
plugin_url_fetcher_factory (S) 
Name of a class that implements org.lockss.plugin.UrlFetcherFactory. Allows plugin to customize the manner in which content is fetched, e.g., when a more elaborate interaction than a single GET is required.
plugin_url_consumer_factory (S) 
Name of a class that implements org.lockss.plugin.UrlConsumerFactory. Allows plugin to customize the manner in which content is stored after being fetched. E.g., this can be used with sites that redirect permanent URLs to one-time URLs, to store the content at the permanent URL, or to adapt to sites undergoing HTTP to HTTPS transitions.
plugin_cache_result_list (S+) 
Strings of form http_status_code=action or Exception=action. Specifies how various http responses and fetch errors should be handled. action can be the name of either a subclass of org.lockss.util.urlconn.CacheException, or a class that implements org.lockss.util.urlconn.CacheResultHandler. Subclasses of CacheException have attributes that control severity (ignore, warn, error, abort), and retry count and frequency. CacheResultHandlers determine at runtime which CacheException to return. E.g., the following causes 300 responses to be treated as meaning "you don't have permission to crawl", 500 responses to be retried 5 times at 60 second intervals, and IOExceptions (generally socket errors such as timeouts) to be retried 5 times at 30 second intervals.
 <entry>
   <string>plugin_cache_result_list</string>
   <list>
     <string>300=org.lockss.util.urlconn.CacheException$PermissionException</string>
     <string>500=org.lockss.util.urlconn.CacheException$RetryableNetworkException_5_60S</string>
     <string>java.io.IOException=org.lockss.util.urlconn.CacheException$RetryableNetworkException_5_30S</string>
   </list>
 </entry>
plugin_crawl_seed_factory (S) 
Name of a class that implements org.lockss.crawler.CrawlSeedFactory. Allows start URLs to be computed at crawl time (e.g., by querying metadata for new/changed URLs).
plugin_crawl_url_comparator_factory (S) 
Name of a class that implements org.lockss.plugin.CrawlUrlComparatorFactory. Controls the order of URLs on the crawler's fetch queue.
mime-type_link_extractor_factory (S) 
Name of a class that implements org.lockss.extractor.LinkExtractor$Factory, to either override the default link extractor for the specified mime type, or provide a link extractor for a mime type not normally handled. The crawler invokes link extractors to find links to follow during the crawl. By default there are link extractors for text/html, application/xhtml+xml, text/css, and text/xml and application/xml (for stylesheets in the XML prolog). E.g.,
 <entry>
   <string>text/xml_link_extractor_factory</string>
   <string>org.lockss.extractor.DublinCoreLinkExtractor$Factory</string>
 </entry>
au_exploder_pattern (S) 
Regex matching URLs of archive files that should be unpacked when collected by the crawler. If this is defined, files with matching URLs will be unpacked by ExplodingUrlConsumer, using the plugin's ExploderHelper to provide info for each member, such as the AU the member should be stored in, the URL it is to be assigned, and the header fields.
au_exploder_helper (S) 
Name of a class that implements org.lockss.plugin.ExploderHelper.

Poll related

au_dont_poll (B) 
If true, polls will not be called on this AU.
mime-type_filter_factory (S) 
Name of a class that implements org.lockss.plugin.FilterFactory, to filter out variable content (content that isn't expected to be the same across machines) during hashing for polls. E.g.,
 <entry>
   <string>text/html_filter_factory</string>
   <string>org.nypl.plugin.WordsWithoutBordersHashHtmlFilterFactory</string>
 </entry>
plugin_delete_extra_files (B) 
If true, files that exist on the poller but not on a majority of voters will be deleted. Because polls may be called at a time when only some boxes have collected a file, this is usually a bad idea. Defaults to the value of org.lockss.poll.v3.deleteExtraFiles, which default to false.
plugin_repair_from_publisher_when_too_close (P+) 
Sites that version files rapidly, then settle down, may have trouble achieving agreement because no one version was collected by a majority of peer. When the tally for a URL in a poll is "too close" (no clear majority) the current version of the file will be fetched from the publisher if the URL matches one of these patterns.
au_exclude_urls_from_polls_pattern (P+) 
URLs that match one of these patterns will not be included in polls.
au_repair_from_peer_if_missing_url_pattern (P+) 
Sites that use rapidly versioned auxiliary files (css, js, etc.), with versioned URLs, can result in each peer collecting a different subset of these files. One strategy to achieve agreement in these cases is to ensure that all files are distributed among all peers. URLs that match these patterns will be fetched from a peer if missing from the poller, even if a majority of other peers don't have the file.
au_url_poll_result_weight 
List of comma-separated pairs of printf template that expands to a regex, and a float between 0.0 and 1.0. URLs matching a regex will be assigned the corresponding weight in poll results (instead of the default 1.0).
 <entry>
   <string>au_url_poll_result_weight</string>
   <list>
     <string>"^%sstatic/[^/]+/.+[.](png|gif)$", base_url, 0</string>
     <string>"^%sstatic/[^/]+/.+[.](js|css)$", base_url, 0.5</string>
   </list>
 </entry>

Serving Content

mime-type_link_rewriter_factory (S) 
Name of a class that implements org.lockss.rewriter.LinkRewriterFactory, to rewrite links in pages of the specified mime-type. Rewriters for HTML and CSS are builtin, plugins may supply additional or replace the builtin ones.
 <entry>
   <string>text/javascript_link_rewriter_factory</string>
   <string>org.lockss.plugin.highwire.HighWireJavaScriptLinkRewriterFactory</string>
 </entry>
plugin_rewrite_html_meta_urls (S+) 
List of names of HTML meta tags whose values should be rewritten by the HTML link rewriter. The set of links in meta tags that should be rewritten depends on the context in which the content is being served. It is usually desirable for citation tags to point to the original site if it's still active, but if the site no longer exists (such as in CLOCKSS triggered content) the citations need to be rewritten to the triggered content site.
 <entry>
   <string>plugin_rewrite_html_meta_urls</string>
   <list>
     <string>citation_abstract_html_url</string>
     <string>citation_pdf_url</string>
   </list>
 </entry>
au_feature_urls 
Provides information to allow the Open URL resolver to locate articles, issue ToCs, etc. Value is a map whose keys are a feature name (au_title, au_volume, au_issue, au_article) and values are a printf template that expands to a URL, or a list or such templates, in which case the OpenURL resolver returns the first URL that exists within the AU.<p> Or, the value can be a map keyed by the value of the AU's au_feature_key title attribute, with values being a URL template or list of templates. The entry with key "*" is used if the value of the title attribute isn't found in the map, or if it isn't set. This allows for different feature URLs for different titles within the same plugin.

 <entry>
   <string>au_feature_urls</string>
   <map>
     <entry>
       <string>au_issue</string>
       <list>
           <string>"%sarchives/issue_%d", base_url, issue</string>
           <string>"%sarchives/%s", base_url, issue</string>
       </list>
     </entry>
   </map>
 </entry>
   
 <entry>
   <string>au_feature_urls</string>
   <map>
     <entry>
       <string>au_title</string>
       <string>"%sloi/%s", base_url, journal_id</string>
     </entry>
     <entry>
       <string>au_volume</string>
       <map>
         <entry>
           <string>coas</string>
           <string>"%sloi/%s", base_url, journal_id</string>
         </entry>
         <entry>
           <string>*</string>
           <string>"%slockss/%s/%s/index.html", base_url, journal_id, volume_name</string>
         </entry>
       </map>
     </entry>
     <entry>
       <string>au_issue</string>
       <map>
         <entry>
           <string>coas</string>
           <string>"%sloi/%s", base_url, journal_id</string>
         </entry>
         <entry>
           <string>*</string>
           <string>"%stoc/%s/%s/%s", base_url, journal_id, volume_name, issue</string>
         </entry>
       </map>
     </entry>
   </map>
 </entry>
au_additional_url_stems (P+) 
In order to serve content the daemon must know the set of URL stems (scheme, host and port) represented in each AU. Normally this is obtained from the permission URLs, AU config params of type URL, and any collected CDN URLs. Plugins that collect from different schemes (e.g., http: vs. https:) or ports on these same hosts must inform the daemon of the additional stems.

Substance Detection

Site redesign sometimes results in crawls that are apparently successful but collect no real content. E.g., the start page is left at the original URL but all the links on it have changed to URLs that don't match the crawl rules. Substances pattern can be used to detect this situation.

au_substance_url_pattern (P+) 
If these patterns exist and no URLs in the AU match any of them the AU will be marked as not having substance.
au_non_substance_url_pattern (P+) 
If these patterns exist and all URLs in the AU match one of them the AU will be marked as not having substance.

Either of the above may also be a map keyed by the value of the AU's au_coverage_depth attribute (usually one of abstracts, tablesofcontents, fulltext), in order to tailor the substance patterns for AUs that are expect to contain only abstracts or tables of contents.

plugin_substance_predicate_factory 
Name of a class that implements org.lockss.plugin.SubstancePredicateFactory to implement runtime checks for substance (e.g., to examine the content).
plugin_archive_file_types 
Either the string "standard" or an instance of org.lockss.plugin.ArchiveFileTypes, specifying the types of archive files (zip, tar, etc.) in this plugin's AUs whose members will be individually accessible, using the syntax <archive-file-url>!/<member-name>. Article iterators (see Metadata Extraction) will also include members in the iteration.<p> The value "standard" specifies that all known archive file types should be handled: zip, tar, tgz.

 <entry>
   <string>plugin_archive_file_types</string>
   <string>standard</string>
 </entry>
Otherwise the value should specify the extensions and MIME types that should be handled. E.g.,
 <entry>
   <string>plugin_archive_file_types</string>
   <org.lockss.plugin.ArchiveFileTypes>
     <extMimeMap>
     <entry>
         <string>.zip</string>
           <string>.zip</string>
           </entry>
           <entry>
         <string>application/zip</string>
           <string>.zip</string>
           </entry>
     </extMimeMap>
   </org.lockss.plugin.ArchiveFileTypes>
 </entry>

Metadata Extraction

Metadata extractors locate and extract bibliographic metadata from articles preserved within an AU. The metadata is normally stored in a matadata database and can be used for a number of purposes, such as searching for an article based on a bibliographic citation.

Plugins must define an Article Iterator, which locates the files within the AU where metadata is located, and one or more Metadata Extractors. General purpose base classes are provided to plugins generally need to supply only plugin-specific behavior.

plugin_article_iterator_factory 
Name of a class that implements org.lockss.plugin.ArticleIteratorFactory. The class should return an Iterator<ArticleFiles> pointing to the file(s) making up each "article" in the AU.
plugin_article_metadata_extractor_factory 
Name of a class that implements org.lockss.plugin.ArticleIteratorFactory. The class will be invoked on each ArticleFiles returned by the iterator, and should invoke a MIME type-specific metadata extractor on each relevant file.
plugin_default_article_mime_type 
MIME type of article files (e.g., application/pdf). Used only when the plugin's article iterator doesn't make this decision itself. [Not sure this is right.]
mime-type_metadata_extractor_factory_map 
A map of metadata type to the name of a class that implements org.lockss.extractor.FileMetadataExtractorFactory, which knows how to extract the specified type of metadata from files of the specified MIME type.
 <entry>
   <string>text/html_metadata_extractor_factory_map</string>
   <map>
     <entry>
       <string>*;DublinCore</string>
       <string>org.lockss.plugin.pion.PionHtmlMetadataExtractorFactory</string>
     </entry>
   </map>
 </entry>
 <entry>
   <string>application/ris_metadata_extractor_factory_map</string>
   <map>
     <entry>
       <string>*;RIS</string>
       <string>org.lockss.plugin.pion.PionRisMetadataExtractorFactory</string>
      </entry>
   </map>
 </entry>

Obsolescent

au_crawl_depth (I) 
Obsolescent name of au_refetch_depth.
au_manifest (S) 
Obsolescent name of au_permission_url.
<mime-type>_parser 
<mime-type>_filter 
Obsolescent link extractor & hash filter.