About this documentation

This documentation wants to give a brief overview about the architecture of KIT Data Manager, its basic installation and some first steps into the topic of using and extending KIT Data Manager to the needs of your community. Due to the flexibility of KIT Data Manager it is impossible to cover all scenarios of installing and using KIT Data Manager in this documentation. However, this documentation gives a first impression and will be extended over time. If you miss a special topic or specific example please do not hesitate to contact the development team directly.

Architecture Overview

In the following chapters basic components and concepts of KIT Data Manager’s architecture are described briefly. Basically, the presented components are responsible for abstracting from underlaying infrastructure and for providing a common service layer for integrating community-specific respository solutions.

Architecture
Figure 1. Basic architecture of KIT Data Manager.

The figure above shows a quite general view on the differnt layers and services provided by KIT Data Manager. The following table gives a short impression about the resposibilities of each service and in which chapter(s) the services are explained in detail.

Service Covered in chapter…​ Short Description

Data Sharing

Authorization

Authorizing access to data and functionality based on users, groups and roles.

Metadata Management

Metadata Management

Managing and accessing different kinds of metadata.

Search

Metadata Management,Data Organization

Search for metadata.

Staging

Staging, Data Organization

Accessing and managing (file-based) data.

Data Processing

Data Workflow

Processing of data stored in a KIT Data Manager repository.

Audit

Audit

Capturing of audit information for repository resources.

Before describing each service let’s take a short look at a core concept of KIT Data Manager: the Digital Object. To be able to cope with the tasks of a repository system (e.g. content preservation, bit preservation, curation…​) there is the need to think in structured, Digital Object-based rather than in unstructured, File-based dimensions. Of course, in many cases there are files managed by KIT Data Manager on the lowest layer (see Data Organization) but they are only one possible representation of the content of a Digital Object. The Digital Object and everything related to it is described by a huge amount of metadata (see Metadata Management) in order to enable software systems to retrieve and interpret Digital Objects, their content and their provenance. However, while reading the following chapters bear always in mind that the core element of KIT Data Manager is the Digital Objects consisting of metadata and data.

Authorization

One of the core services, which concerns almost every part of KIT Data Manager, is the Authorization. It is based on users, groups and an effective role and determines if the access to a secured resource (e.g. a Digital Objects) or operation (e.g. adding users to a group) is granted or not. Authorization decisions in KIT Data Manager are always based on a specific context used to issue the request. This context consists of:

UserId

An internal user identifier. This identifier is unique for each user and is assigned during registration, e.g. via a Community-specific Web frontend. For real world applications, the user identifier should be obtained from another, central source like an LDAP database in order to provide a common source of authentification for data and metadata access.

GroupId

An internal group id. This identifier is unique for each user group of a KIT Data Manager instance. Available groups can be managed by higher level services. By default, there is only one group with the id USERS which contains all registered users.

The final part needed for authorization decisions is the Role. A role defines, together with UserId and GroupId, the AuthorizationContext that is used to access resources or functionalities. Currently, there are the following roles available:

Role Description

NO_ACCESS

An AuthorizationContext with this role has no access to any operation or resource.

MEMBERSHIP_REQUESTED

This is an intermediate role used in special cases. Users with this role have requested the membership for e.g. a group but are not activated, yet.

GUEST

Using an AuthorizationContext with the role GUEST grants read access to public accessible resources. Modifications of resources are not allowed.

MEMBER

An AuthorizationContext with this role can be used to read and create resources. This role should be used in most cases.

MANAGER

An AuthorizationContext with the role MANAGER can be used for operations that require special access to resources and functionalities on a group level, e.g. assign new memberships to a group.

CURATOR

The CURATOR role is intended to be used for curation tasks, e.g. updating metadata, data formats or removing entities. Currently, the CURATOR role is not used explicitly and update can be performed by everybody posessing at least the role MEMBER. For future versions such operations will be reserved for the CURATOR role in order to avoid uncontrolled modification of repository content.

ADMINISTRATOR

The role ADMINISTRATOR is used for operations that require special access on a global level. Typically, the administrator role should be used sparingly and only for a small amount of administrator users.

Out of these roles, each user has a maximum role MAX_ROLE which is defined globally and cannot be bypassed.

The MAX_ROLE defines the highest role a user may possess. This role might be further restricted on group- or resource-level, so that a maximum role of MANAGER may result in a group role MEMBER. On the other hand it is not possible to gain a higher role than MAX_ROLE, which means, if the maximum role of a user is set to NO_ACCESS, the user is not allowed to do anything and won’t be able to gain more permissions on group- or resource-level.

To determine the actual role within a specific context, a user has a group-specific role and the MAX_ROLE. The effective role is the group role unless MAX_ROLE is smaller. In this case MAX_ROLE is the effective role.

Additionally, there can be resource-specific roles issued separately. By default, each resource with restricted access can be accessed (read and modify) by all members of the group the resource has been created in. For sharing purposes it is also possible to issue additional permissions for another groups (called ResourceReferences) or to single users (called Grants). This allows a user who is not a member of the creating group to access a single resource with an assigned role.

Apart from access to resources also access to operations can be restricted. For basic services and components of KIT Data Manager the restriction for operations is roughly oriented on the role definition presented above. This means, that the role GUEST is required for read operations, MEMBER is required for basic write operations and the MANAGER role entitles the caller to remove data (to a very limited extent).

Metadata Management

The core of the metadata management of KIT Data Manager is the metadata model, which is presented in the figure below.

Metadata model
Figure 2. Metadata model of KIT Data Manager. An important aspect of this model is covered by the Digital Object which provides the Digital Object ID (OID). This OID is used to refer to the according Digital Object by other linked entities, e.g. Staging or Data Organization Metadata.

The metadata model constists of three parts which can be described as follows:

Administrative Metadata

This category contains metadata elements that are mostly used internally by KIT Data Manager and its services. These elements have a fixed schema and are typically stored in a relational database. One important part is the Base Metadata defining a guaranteed core set of metadata elements that are expected to be available for each and every Digital Object managed by KIT Data Manager. Parts of the Base Metadata are adopted from the Core Scientific Metadata Model (CSMD) 2.0, other parts of this model were skipped to reduce complexity and to allow more flexibility. An overview of all Base Metadata entities and how they relate to each other is depicted in figure BaseMetadata. One can see the main entities Study, Investigation and Digital Object which also contains the (Digital) Object Identifier also referenced as OID in this document. The OID identifies each Digital Object and can be used to link additional metadata entities to a Digital Object, e.g. Staging Metadata or Sharing Metadata as shown in figure MetadataModel. All administrative metadata elements may or may not be exposed to the user by according service interfaces.

Data Organization Metadata

The Data Organization Metadata contains information on how data managed by KIT Data Manager is organized/structured and where it is located. Currently, only file-based Data Organization is supported and all Data Organization information is stored in relational databases. For more information please refer to chapter Data Organization.

Content Metadata

The third part of the metadata model is the content metadata. Content metadata covers for example community-specific metadata providing detailed information about the content of a Digital Object. For the sake of flexibility all content metadata related implementations are outsourced into a separate module called Enhanced Metadata Module (see Extension: Enhanced Metadata Module) that can be installed optionally to the basic KIT Data Manager distribution.

BaseMetadata
Figure 3. The most important Base Metadata entities and their relations. The core entities, namely Study, Investigation and Digital Object, are highlighted. Furthermore, some of the relationships are simplified for reasons of clarity.

Administrative and Data Organization Metadata of KIT Data Manager are stored in relational databases using well defined schemas and can be accessed using Java or REST service APIs. For content metadata there is no fixed API or schema available as content metadata strongly differs depending on the community and supported use cases. Rather there are basic workflows available that can be implemented in order to extract, register and retrieve content metadata. How access to this metadata is realized depends on the system(s) in which the metadata are registered. More details on enhance metadata handling can be found in the according chapter.

For harvesting metadata directly KIT Data Manager provides an OAI-PMH interface available at http://localhost:8080/KITDM/oaipmh by default. Harvestable are all Digital Objects readable by the user with the userid OAI-PMH and all Digital Objects readable by the user group with the groupid WORLD.

Furthermore, it is possible to publish a Digital Object by granting access to userid WORLD or groupid WORLD. By doing to, everybody can access the Digital Object’s landing page via http://localhost:8080/KITDM?landing&oid=<OBJECT_ID> A sample landing page for object 3b1243b2-df09-4a98-ad87-21b7cda74be9 may look as follows:

LandingPage
Figure 4. The landing page for a public Digital Object. It allows to read/download metadata in different formats and to download public data associated with the Digital Object.

If a Digital Object is not published, the user is requested to login in order to be able to see the landing page, which is of course only possible if the user is authorized to access the Digital Object.

It is recommended to create a view named public for storing published data of a Digital Object. If this view does not exists, publishing a Digital Objects will result in full access to the default Data Organization view, which might not be wanted in every case.

Staging

The term Staging basically stands for the process of transferring data in or out KIT Data Manager, either to access data manually or automatically for computing purposes. As KIT Data Manager aims to be able to build up repository systems for large-scale scientific experiment data, the data transfer needs a special focus. In contrast to traditional repository systems KIT Data Manager provides reasonable throughput in order to be able to cope with the huge amounts of data delivered by scientists. In addition, scientific data can be very diverse, e.g. hundred of thousands of small files vs. a handful of files in the order of terabyte. Therefore, the process of Staging in case of an ingest (for downloads this process is carried out the other way around) is divided into two parts:

Caching

Caching is the plain data transfer to a temporary storage system which is accessible in an optimal way depending on the requirement, e.g. throughput, security or geographical location. To achieve best results transfers to the cache are carried out using native clients or custom, optimized transfer tools. The location where the cached data can be stored is provided and set up by KIT Data Manager using pre-configured StagingAccessPoints. A StagingAccessPoint defines the protocol as well as the local and remote base path/URL accessible by the repository system and the user. Details about StagingAccessPoints can be found in the Programming KIT Data Manager or Administration UI chapters.

Archiving

During archiving the data from the cache is validated as far as possible, metadata might be extracted and transfer post-processing may take place. Afterwards, the data is copied from the cache to a managed storage area where the repository system is taking care of the data. As soon as the data is in the managed storage, it can be expected to be safe. Local copies and cached data can be removed and repository workflows start taking care of the data.

Authentification and authorization for data transfers to and from the cache is not covered by KIT Data Manager. This offers a huge level of flexibility and allows to customzie the data transfer to possible needs. However, it is still possible to use the same user database that is used to obtain KIT Data Manager UserIDs, e.g. an LDAP server. The only thing that has to be ensured is that for data ingests the written data has to be accessible by KIT Data Manager and for downloads the data, written by KIT Data Manager, must be readable by the user who wants to access the data. Typically, this can be achieved by running KIT Data Manager as a privileged user or by handling access permission on a group level.

As mentioned before, the transfer into the archive storage is much more than a simple copy operation. The process of initially providing data associated with a Digital Object is called Ingest. During Ingest, different steps like metadata extraction, checksumming or even processing steps might be performed. For download operations requested data is copied first from the archive to the cache and can then be downloaded by the user using an according StagingAccessPoint. Furthermore, this workflow can be also used to copy the data to an external location, e.g. for data processing. In order to be able to provide supplementary data (e.g. files containing metadata), a special folder structure was defined for each staging location:

Staging Tree Structure
31/12/69  23:59         <DIR>    data       (1)
31/12/69  23:59         <DIR>    generated  (2)
31/12/69  23:59         <DIR>    settings   (3)
1 Contains the user data, e.g. uploaded files or files for download.
2 Contains generated data, e.g. extracted metadata or processing results
3 Contains KIT Data Manager-related settings, e.g. credentials for third-party transfers.

The graphic below shows the general Staging workflow. At the beginning, the Staging Service is accessed using an appropriate client, which might be a commandline client, a Web portal or something comparable. In case of downloading data the Staging operation is scheduled and will data be made available asynchronously by the Staging Service as this preparation may take a while (e.g. when restoring data from tape). In case of an ingest the cache is prepared immediately. As soon as the preparation of the data transfer operation is finished, the data is accessible from the cache by a data consumer or can be transferred to the cache by a data producer. Both, cache and archive storage must be accessible by KIT Data Manager, at least one of them must be accessible (also) in a POSIX-like way.

Staging
Figure 5. Staging workflow for ingest operations with KIT Data Manager. After selecting the data (1) a Digital Object is registered (2) and a new ingest is scheduled. As soon as the transfer is prepared, the data can be transfered (3). Finally, the ingest is marked for finalization (4). During finalization the cached data is copied to the archive (5), the Data Organization is obtained and content metadata might be extracted automatically. Finally, extracted content metadata is made accessible e.g. by a search index (6).

Data Organization

The Data Organization is closely coupled with the Staging and holds information on how the data belonging to a Digital Object is organized and where it is located. In the most simple case, after ingesting a file tree the Data Organization reflects exactly the ingested tree including CollectionNodes representing folders and FileNodes representing the files. This allows to restore the file tree in case of a download in the representation the user expects. In addition, there might be attributes linked to Data Organization nodes, e.g. size or mime type of a node.

In more sophisticated scenarios the Data Organization might be customized according to user needs. These customizations are called Views. Views can be useful, e.g. to group all files with the same type belonging to one Digital Object or to transform a Digital Object’s data into another format but keep the Data Organization linked to the particular Digital Object.

DataOrganization
Figure 6. The graphic shows exemplarily different views for the Data Organization of a Digital Object. On the left hand side the default view with all contained files is shown. The second view contains only the data files, the third view contains a compressed version of the default view. Finally, the view on the right hand side contains generated files with a preview of the images contained in the default view that can be mapped by their names to each other.

Apart from the basic file-based use case it is also possible to provide a custom Data Organization tree during ingest. This allows to register Data Organization trees where single nodes may refer to data located elsewhere, e.g. in another repository or on a web server. For standard workflows, e.g. for ingest and dowload, this enhanced scenario implies some differences:

Ingest

Instead of ingesting a file tree, one or more JSON file(s) containing a defined structure is/are ingested. An according StagingProcessor implemented by the class edu.kit.dama.staging.processor.impl.ReferenceTreeIngestProcessor has to be configured properly in beforehand. This StagingProcessor will parse and register the Data Organization information provided by the JSON file(s). For more details please refern to the Samples module where the example CustomDataOrganizationIngest shows how to ingest a custom Data Organization and describes what are requirements and rules.

Download

For download there are also multiple options. The easiest way for supporting transparent access to remotely stored data is to refer to data openly accessible via HTTP. In that case, the LFNs available in the Data Organization nodes can be directly accessed by the user to obtain the data. Furthermore, the REST-based Data Download described in the following section fully supports streaming of data from HTTP LFNs. Finally, also the Staging Service allows to stage data accessible via HTTP the same way as it is done for locally available data. However, this requires a bit more configuration effort.

For more information on how to enable the support for ingesting custom Data Organization trees please refer to the section Support for Ingest of Custom Data Organization Trees.

REST-based Data Download

For obtaining data represented in the Data Organization typically the Staging Service is used to transfer the data to a location accessible using one of the configured StagingAccessPoints. An alternative that can be used for smaller downloads and for direct access avoiding the asynchronous staging process the Data Organization REST service offers direct access to the content of single nodes. Typically, this feature is used the same way as any other REST endpoint and the access is authorized via OAuth or any other configure authentication mechanism. The URL for downloading the content of a Data Organization node is build as follows:

http://kitdm-host:8080/KITDM/rest/dataorganization/organization/download/{objectId}/{path}?groupId=USERS&viewName=default

The first part http://kitdm-host:8080/KITDM/rest/dataorganization/organization/download/ is the base URL pointing to the download endpoint. Typically, only kitdm-host has to be changed according to the accessed KIT DM instance. The second part of the URL defines the accessed digital object by its numeric baseId followed by the Data Organization path. The value of path can be one of the following:

Value of {path} Delivered Content

nothing

The entire content of the provided view zipped in a file named <DIGITAL_OBJECT_ID>.zip

MyCollection/

Assuming that MyCollection refers to a collection node, the zipped content of the node and all children is returned.

MyCollection/MyFile.txt

Assuming that MyFile.txt refers to a file node, the file content is returned using the automatically determined mime type of the file.

100

The content of the node with the provided nodeId. Depending on the node type the call is handled similar to one of the previous cases, e.g. 100 typically refers to the root node. Hence, providing 100 behaves identical to the example in table row 1.

The arguments provided in the URL are optional and define the group used to authorize the access to the associated Digital Object and the Data Organization view that will be accessed. By default these values are USERS and default.

As mentioned before, access to the REST-based data download is authorized e.g. via OAuth. However, there may be scenarios where one wants to provide open access to repository content, e.g. for showing an image, stored in the repository, on a web page. For this purpose the Data Organization offers the so called authorization-less access. The concept of this feature is to provide open access to a particular part of the Data Organization of a Digital Object without any user or group based authentification and authorization. Of course, this feature should be used with care in order to avoid opening data that should not be accessible by everybody.

Currently, there are four different kinds of authorization-less access:

Type Description

Data Organization View

Allows to provide one or more Data Organization views that are publicly accessible. List of Data Organization views that are publicly accessible. Attention: You should never define the default view publicly available using this feature as this would grant access to the data of all Digital Objects.

Data Organization Attribute

Allows to define an attribute that, if assigned to a Data Organization node, allows public access. The advantage of this approach is its fine-grained applicability on single collection nodes or file nodes also allowing to public single files of a Digital Object.

Collection Node

For testing only! This option grants access to all collection nodes in all Digital Objects. It should never be enabled in production environments.

File Node Filter

Allows to provide a regular expression granting public access to all file nodes in all Digital Objects and all Data Organization views for which the node name fulfills the regular expression.

For setting up authorization-less access for the Data Organization service please refer to chapter Authorization-less Access to Data Organization.

Data Workflows

A very special feature that distinguishes KIT Data Manager from other research data repository systems is the ability to trigger data processing workflows. This allows to execute data workflows seamlessly integrated into repository system workflows. The repository system takes care of transferring the data to the processing environment, monitoring the execution and ingesting the results back to the repository system and link them to the input object(s). Furthermore, single processing tasks can be chained to construct complex data processing workflows. The Data Workflow module is based on three major entities:

Execution Environment

The Execution Environment is the physical environment where a single data workflow task is executed. It might be the local machine or a compute cluster. The access to an execution environment is implemented using an appropriate ExecutionEnvironmentHandler taking care of the preparation of the application and the input data, the execution of the application and its monitoring as well as the ingest of the results, the creation of links to the input Digital Objects and the cleanup. However, this process can be abstracted quite well so that different handler implementations only have to take care about the actual execution/submission of the application and the monitoring of its status. The data transfer can be fully covered in a generic way using the Staging Service of KIT Data Manager by assigning a StagingAccessPoint to each execution environment.

Data Workflow Task Configuration

A Data Workflow Task Configuration describes a single task that can be executed alone or after a predecessor task to chain multiple tasks. An actual instance of a Data Workflow Task Configuration is just called Data Workflow task. A task configuration consists of basic metadata, e.g. name, version, description and keywords, a package URL, which is pointing to a ZIP file containing everything needed to execute the task’s application, e.g. libraries, executables and a wrapper script named run.sh, and fixed application arguments. For all tasks registered in a KIT Data Manager instance the combination of name, version, application package URL and application arguments is unique. If any of these fields changes, a new task version or an entirely new task must be created. This is mandatory in order to be able to collect reliable provenance information later on. As all tasks are executed by an according ExecutionEnvironmentHandler fully automatically, the successful execution of the task in the targeted environment should be tested in beforehand before registering the task in the repository system. This excludes principle errors, e.g. missing dependencies. Due to this required effort (packaging and testing the application) Data Workflow Tasks are mainly interesting for tasks executed many times.

Environment Property

Finally, there are Environment Properties allowing to describe capabilities of an execution environment, e.g. the platform or existing libraries, and the requirements of a Data Workflow Task. In both cases, environment properties can be chosen from a common pool of properties and before registering a task execution in an execution environment capabilities and requirements are matched to determine whether an execution environment can principally handle a task. However, there is currently no mechanism to ensure that an execution environment is really providing a specific environment property, e.g. the defined platform or a software package.

After describing the major entities the question is, how they work together. The first point here is the actual application that will be executed as a Data Workflow Task. Such application must be packed in a Zip archive with the following structure:

Application Package Structure
31/12/69  23:59         <DIR>    app        (1)
31/12/69  23:59         123      run.sh     (2)
1 Contains the user application, e.g. libraries, binaries and static configuration files.
2 The execution wrapper script for setting up and calling the user application.

The actual user application may or may not be located in a directory called app but it is recommended to achieve a clean cut between data workflow and user application. The execution wrapper consistes of two parts: a common part that is recommended for all wrapper scripts and a specific part for setting up and calling the actual user application. The base script looks as follows:

#!/bin/sh

#Variable definition for accessing/storing data. The placeholder variables, e.g. ${data.input.dir}, are replaced by the workflow service using the
#appropriate values for the according task execution and execution environment.
export INPUT_DIR=${data.input.dir}
export OUTPUT_DIR=${data.output.dir}
export WORKING_DIR=${working.dir}
export TEMP_DIR=${temp.dir}

#Place environment checks here, if necessary, to allow debugging. However, if something is missing at this point, the process will fail
#either way.

#Now, the execution of the user application starts. The variables above should be provided to the process in a proper way depending on
#the kind of the process, e.g. for Java processes via -DINPUT_DIR=$INPUT_DIR

#At this point, the user application is executed, e.g. via './app/MainBinary $@'. The argument $@ is recommended to forward all fixed command line arguments
#provided by the task configuration and the dynamic arguments optionally provided for each specific task instance. Each user application should return
#a proper exit code (0 for success, any other value for error).

#Obtain the exit code of the process, print out a logging message for debugging, and exit the wrapper script using the exit code of the internal
#process to allow a proper handling by the workflow service. If the exit code is 0 the data ingest of all data stored in $OUTPUT_DIR is triggered.
#Otherwise, the task will remain in an error state and needs user interaction.
EXIT=$?
echo "Execution finished."
exit $EXIT

It is recommended to use this wrapper script as it allows to obtain the proper directories from the repository and to provide an exit code for proper callback. The sample script above shows one feature allowing the repository to provide information for the user application: the variable substitution. All variables are pointing to an absolute path within the execution environment. Available variables are:

Variable Content

${data.input.dir}

The directory containing all staged input data. If multiple Digital Objects are input for one task, the data of all object will be located in this directory.

${data.output.dir}

The directory that can be used to store output data. All data located in this directory will be ingested as a new Digital Object as soon as the processing has succeeded.

${working.dir}

The working directory where the application archive is extracted to. Furthermore, the execution settings, which can be provided for each task execution, are stored in a file dataworkflow.properties, which is also located in the working directory.

${temp.dir}

A temporary directory where the user application can store intermediate data. The content of this directory will be removed during task cleanup.

By default, the Data Workflow Service replaces these variables only in the file run.sh. In order to enforce variable substitution in custom files, e.g. settings or other application-specific files, an empty file named dataworkflow_substitution has to be placed in each directory where substitution should take place. Variables substitution will then be applied to all files in this directory with a size smaller than 10 MB.

The actual execution of a workflow task is covered by an associated execution environment handler. This handler executes each task in multiple phases which are the following:

Phase Name Actions Next Phase

SCHEDULED

Initial phase after creation.

PREPARING

PREPARING

Creation of task directories, obtaining and extracting application package and performing variable substitution.

STAGING

STAGING

Provide the data of all input Digital Objects in the input directory of the task.

PROCESSING

PROCESSING

Use the concrete handler implementation to execute/submit and monitor the application execution.

INGEST

INGEST

Ingest the data located in the task output directory as a new Digital Object.

CLEANUP

CLEANUP

Remove all task directories and their contents.

-

Each phase has a PHASE_SUCCESSFUL and a PHASE_FAILED state. If a phase has been completed successfully in during the last handler execution cycle it will enter the next phase in the next cycle. If the phase execution has failed, the task has to be reset to the last phase’s SUCCESSFUL state manually in order to reattempt the phase execution. The actual execution of tasks is triggered either via a Cron job executing the DataWorkflowTrigger script of by creating an appropriate job schedule using the AdminUI.

For each handled workflow task only one (or no) state transition will occur during a single Cron/job schedule execution. Possible transitions are depicted in figure TaskStatus. In case of data transfer tasks, e.g. Staging and Ingest, the phase might be entered multiple times as long as the data transfer operation has not finished, yet. All other phases are implemented in a synchronous way so the according Cron/job schedule execution won’t finish until the phase is either in its SUCCESSFUL or FAILED state.
TaskStatus
Figure 7. Task state transitions of data workflow tasks.

Summarizing, the Data Workflow Service offer a great potential for processing Digital Objects in an automated way including the tracking of provenance information for better reproducability. For more details on how to setup execution environments, workflow tasks and triggering them, please refer to the chapters Installation of KIT Data Manager and Settings of the Administration UI.

Audit

The Audit Service provides functionalities to capture audit information about changes of resources stored in the repository system. One example for using audit information is the documentation of the lifecycle of a Digital Object, starting with the creation and ingestion of content, modification of metdata, the validation of content and/or metadata and the migration or replication. Finally, the audit information may also contain deaccession and deletion of the resource.

In KIT Data Manager there are two ways how audit events are triggered: internally and externally. Internal audit events are statically integrated into KIT Data Manager workflows in order to ensure that they are triggered as soon as they occur, e.g. the successful creation of a Digital Object via REST endpoint. For performance reasons internal events are only triggered in REST endpoints. If a Digital Object is created directly using the MetadataManagement Java API, an according audit event has to be triggered manually. Currently, internal audit events are generated if a Study, an Investigation or a Digital Object are created and modified. Other audit events might be added in future. For more information on how to to this please refer to the the according sample in the Samples module of KIT Data Manager.

The setup of the Audit component is shown in figure AuditWorkflow.

AuditWorkflow
Figure 8. Workflow of publishing and consuming audit events. An event is produced by an EventProducer, e.g. by a REST service during the creation of a digital object. The event is published using the configured EventPublisher. Currently, the default EventPublisher is realized by RabbitMQ. For the configured EventPublisher there should be an according EventReceiver, in our case there is a RabbitMQ receiver configured as servlet running in the local Tomcat container. The EventReceiver is then responsible to distribute received events to configured consumers, in the example there is a consumer writing all events to a logfile and another consumer writing the same events to a database.

The reason why the current implementation of the Audit workflow is implemented using RabbitMQ is that RabbitMQ allows an asynchronous publishing of messages from different sources and takes care of persisting undelivered messages on its own. Therefore, RabbitMQ is expected to scale very well and the influence of publishing audit events should be very low.

By default, audit events are distributed via a locally running RabbitMQ server. If no server is installed or not properly configured audit events are logged via logback to the configured logfile (e.g. $CATALINA_HOME/temp//datamanager.log).

Installing KIT Data Manager

The following sections describe the basic installation steps of KIT Data Manager. After following the installation instructions you’ll have all available KIT Data Manager services deployed in a Tomcat environment accessible via REST interfaces. Everything beyond, e.g. graphical user interfaces or advanced deployment scenarios, are not part of this documentation.

Prerequisites

In order to be able to provide KIT Data Manager services you’ll need at least the following software components installed:

  • Java SE Development Kit 8

  • Apache Tomcat 7 (Tomcat 8 is currently not supported)

  • PostgreSQL 9.1+

  • RabbitMQ 3.6.5+

  • Elasticsearch 5.1+

In principal it is possible to install KIT Data Manager on every operating system. For simplicity, this documentation only covers the installation on a Unix-based system (namely Ubuntu 16.04 LTS) in detail.

Software Installation

At first, please install the required software packages. Therefor, you can use either the package manager of your system or you can download the packages manually. The following steps are based on an installation using Apt:

user@localhost:/home/user$ sudo apt-get install postgresql
Reading package lists... Done
Building dependency tree
Reading state information... Done
postgresql is already the newest version.
user@localhost:/home/user$ sudo apt-get install openjdk-8-jdk
Reading package lists... Done
Building dependency tree
Reading state information... Done
openjdk-8-jdk is already the newest version.
user@localhost:/home/user$ sudo apt-get install tomcat7
Reading package lists... Done
Building dependency tree
Reading state information... Done
tomcat7 is already the newest version.
user@localhost:/home/user$ sudo apt-get install rabbitmq-server
Reading package lists...
Building dependency tree...
Reading state information...
rabbitmq-server is already the newest version.
user@localhost:/home/user$

As elasticsearch is not available via a software repository it has to be downloaded and configured first.

Elasticsearch

If an older version of Elasticsearch is already installed an upgrade should be made to preserve existing indices. (See links below)

Cluster name and hostname needed to be known during further installation steps.

Step1: Download newest Version (5.X) of Elasticsearch (Tested with version 5.1.1).

user@localhost:/home/user$ wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.1.1.deb

Step 2: Install Elasticsearch

user@localhost:/home/user$ sudo dpkg -i elasticsearch-5.1.1.deb

Step 3: Configuring Elasticsearch The Elasticsearch configuration file (elasticsearch.yml) in the /etc/elasticsearch directory has to be modified: To avoid conflicts with other installations a unique clustername has to be chosen. Look for the corresponding line remove the leading # and change the value:

cluster.name: KITDataManager@hostname

Set hostname to the hostname of the host running elasticsearch or to another, unique value and keep the value of cluster.name for later use.

Please check also the System Configuration section of elasticsearch, especially for increasing the number of open file descriptors.

Step 4: Configure Elasticsearch as daemon and start service

user@localhost:/home/user$ sudo /bin/systemctl daemon-reload
user@localhost:/home/user$ sudo /bin/systemctl enable elasticsearch.service
user@localhost:/home/user$ sudo /bin/systemctl start elasticsearch.service

That’s it. Now continue setting up KIT Data Manager to index metadata to Elasticsearch.

Since elastic version 5.0.0 the ports (9200 and 9300) for accessing Elasticsearch are only accessible from localhost. Dowever, everyone who is able to access Elasticsearch has all privileges, e.g. also deleting the index. For production environments further security measures are recommended.

For more detailed information on how to setup Elasticsearch on Ubuntu please refer to the following links.

Setup KIT Data Manager

Now you have installed and configured all software packages you need to run KIT Data Manager. Not later than now you should extract your KIT Data Manager binary distribution archive to a preferred location which will be referred below as $KITDM_LOCATION. Afterwards change the ownership to tomcat7:tomcat7.

user@localhost:/home/user$ cd $KITDM_LOCATION
user@localhost:/home/user$ unzip KITDM-<Version>.zip
user@localhost:/home/user$ sudo chown -R tomcat7:tomcat7 KITDM
user@localhost:/home/user$

There should now be the application itself and a number of configuration files and scripts, which are covered later, available at $KITDM_LOCATION. In the following steps the relational database used by KIT Data Manager is prepared. The first step is setting a password for the database user that was created during installation.

user@localhost:/home/user$ sudo -u postgres psql postgres
psql (9.1.13)
Type "help" for help.

postgres=#\password postgres

Now set a password for the database user postgres and keep it in mind for later configuration steps. Now the KIT Data Manager database can be created.

postgres=#create database datamanager;
CREATE DATABASE
postgres=#\q

Alternatively this can also be done from commandline.

user@localhost:/home/user$ sudo -u postgres createdb datamanager
user@localhost:/home/user$

A database named datamanager has been created and can be set up. For this purpose, a file named $KITDM_LOCATION/sql/schema.sql is provided as part of your KIT Data Manager distribution. Apply the SQL statements in this file to the database by the following command:

user@localhost:/home/user$ sudo -u postgres psql -U postgres -d datamanager -f $KITDM_LOCATION/sql/schema.sql
CREATE TABLE
CREATE SEQUENCE
ALTER SEQUENCE
[...]
CREATE INDEX
user@localhost:/home/user$
Adding sample data is not longer necessary since KIT DM version 1.5. For security reasons, a first start wizard has been implemented for creating the administrator user and perform basic configuration.

In the next step you have to deploy the KIT Data Manager and WebDav Web applications on the local Tomcat server. Therefor, copy the files $KITDM_LOCATION/webapp/KITDM.xml and $KITDM_LOCATION/webapp/webdav.xml to $CATALINA_HOME/conf/Catalina/localhost/. In our case this would be /var/lib/tomcat7/conf/Catalina/localhost/.

user@localhost:/home/user$ sudo cp $KITDM_LOCATION/webapp/*.xml /var/lib/tomcat7/conf/Catalina/localhost/
user@localhost:/home/user$

Now, the custom WebDav implementation must be deployed to Tomcat. Therefor, execute the following commands:

user@localhost:/home/user$ sudo cp $KITDM_LOCATION/tomcat-ext/*.jar /usr/share/tomcat7/lib/
user@localhost:/home/user$ sudo chown tomcat7:tomcat7 /usr/share/tomcat7/lib/*.jar
user@localhost:/home/user$

This will copy a couple of libraries to your Tomcat installation, which should be:

  • commons-codec-1.7.jar

  • logback-classic-1.0.11.jar

  • logback-core-1.0.11.jar

  • postgresql-9.1-901.jdbc4.jar

  • slf4j-api-1.7.5.jar

  • tomcat-ext-1.1.1.jar

The library versions as well as the JDBC driver (postgresql-9.1-901.jdbc4.jar) may differ, depending on the KIT DM version and your local installation.

Finally, you have to modify /var/lib/tomcat7/conf/Catalina/localhost/KITDM.xml and /var/lib/tomcat7/conf/Catalina/localhost/webdav.xml in order to make KITDM_LOCATION point to the absolute value of $KITDM_LOCATION. Inside /var/lib/tomcat7/conf/Catalina/localhost/webdav.xml also set the correct values for DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD to be able to access the user database. The default values according to the above configuration options are DB_HOST=localhost and DB_NAME=datamanager.Typically, DB_PORT should be 5432 for PostgreSQL, but might be different if this port was already in use during installation. DB_USER and DB_PASSWORD must be set to the previously created/chosen values.

Now, your KIT Data Manager services are ready for further configuration.

Basic Configuration

All configuration files of KIT Data Manager itself are located inside the web application folder, by default at $KITDM_LOCATION/KITDM/WEB-INF/classes/. Relevant configuration files are:

META-INF/persistence.xml

Contains all JPA persistence units for database access. Values that must be changed are DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD. The default values according to the above configuration options are DB_HOST=localhost and DB_NAME=datamanager. Typically, DB_PORT should be 5432 for PostgreSQL, but might be different if this port was already in use during installation. DB_USER and DB_PASSWORD must be set to the previously created/chosen values.

datamanager.xml

Contains all KIT Data Manager settings. Typically, most of them can remain unchanged, except settings containing placeholder variables (e.g. KITDM_LOCATION, HOSTNAME, and ARCHIVE_STORAGE). Database settings, e.g. for setting up the scheduler persistence backend, are typically the same as in the previous configuration file. Elasticsearch settings also has to be adapted. For more information, please refer to the comments inside datamanager.xml for more information.

logback.xml

The configuration of the logging framework. By default, all warnings and error messages are logged into the Tomcat temp folder located e.g. at /tmp/tomcat7-tomcat7-tmp/ or $CATALINA_HOME/temp/, but you can also set another value instead of ${java.io.tmpdir}/datamanager.log as log file location at any time.

Please go through all files at least during your first installation to check which settings are in there and which you have to change in order to get your KIT Data Manager instance running.

In order to be able to use KIT Data Manager as a repository system, at least one access point must be configured to allow data ingest and download. Since version 1.4 the default (and recommended) way to transfer data to and from KIT Data Manager is via the provided WebDav servlet as it offers several advantages:

  1. The configuration effort, e.g. installing additional software packages, open ports and permission management, is reduced to a minimum.

  2. The KIT Data Manager user management is seamlessly integrated with the WebDav servlet.

  3. The WebDav servlet offers built-in authorization of resource access for KIT Data Manager users and groups.

The first step in order to enable WebDav access was already done in the last section by copying and modifying the file $KITDM_LOCATION/webapp/webdav.xml and the libraries located in tomcat-ext. In the next step please check $KITDM_LOCATION/webdav/WEB-INF/web.xml. Typically, no changes are necessary here but the inline comments may help to understand which configuration options there are if customization gets necessary. The most relevant settings are in the section <login-config>. In this section the realm-name, which is set to kitdm, and the auth-method, which is set to DIGEST, are defined. Both values should remain unchanged if there is no necessity to change it. The reason therefor is, that if they change they must also be changed in the according section of datamanager.xml.

There you’ll find the configuration section:

<authenticator class="edu.kit.dama.rest.util.auth.impl.HTTPAuthenticator">
     <authenticatorId>webdav</authenticatorId>
     <!--The HTTP realm needed if type is 'DIGEST'.-->
     <realm>kitdm</realm>
     <!--The type that must be either BASIC or DIGEST.-->
     <type>DIGEST</type>
 </authenticator>

The values of realm and type must match the configuration in the web.xml as the HTTPAuthenticator is used to generate the WebDav credentials for the KIT Data Manager users that are later used to authenticate. If there is a configuration mismatch, authentication will fail.

NOW start/restart your Tomcat container in order to apply the configuration changes.

The final step is now to register a new staging access point using WebDav access using the administration backend of KIT Data Manager. For this purpose please browse to http://localhost:8080/KITDM at your KIT Data Manager machine. As this should be the first time accessing the installation, a wizard will appear guiding you through the initial setup.

Please follow the wizard until you reach the third step. At this point you have to input Base Url and Base Path. The base Url look like http://myhost:8080/webdav depending on your local hostname, the Base Path should be the absolute path to $KITDM_LOCATION/webdav. At the end of the wizard the access point is committed to the database together with all other settings.

Enhanced Configuration

ORCiD Login and Registration

Since KIT Data Manager 1.4 user registration and login via ORCiD is supported. In order to enable ORCiD service access a couple of working steps are required. First, please go to ORCiD.org and enable public API access for your account. You should be able to find it unter For Researchers" → 'Developer Tools. Provide application name, e.g. MyRepository, your website URL, e.g. http://myinstitution.org, a short description of your application and at least one redirect URI, which is the base URL of your KIT Data Manager instance, e.g. https://myhost:8080/KITDM You may also provide multiple redirect URIs later in order to support other applications, too.

After providing all information, you’ll find Client ID and Client secret on the ORCiD page. Now, add these two tokens to datamanager.xml into the according fields (login.orcid.clientid and login.orcid.clientsecret) of the authorization section. After (re-)starting your Tomcat instance you should see the optional ORCiD login when entering https://myhost:8080/KITDM.

From the configuration and implementation perspective, also B2ACCESS login is possible in the current version. However, as B2ACCESS is by default configured not to send any information helping to identify the user this feature has been disabled for the time being.

Authorization-less Access to Data Organization

Section REST-based Data Download describes a feature allowing to download data directly via the Data Organization REST endpoint. By default, this feature is disabled in order to avoid open access to repository content by accident. In order to enable the authorization-less access to defined nodes via the Data Organization REST endpoint you have to modify $KITDM_LOCATION/KITDM/WEB-INF/web.xml and uncomment/change one or more of the following init-param nodes.

There are four different parameter blocks, each for one authorization-less access type. You should decide which of them you really need as they require to organize you Digital Objects accordingly. The following table shows all options, sample values and a short description.

Parameter Name Example Value Description

public.view.names

public;public2

List of Data Organization views that are publicly accessible. In the example, all elements stored in views named public and public2 of all Digital Objects in the repository system are publicly accessible. Attention: You should never provide the default view here as this would open the entire data of a Digital Object.

public.attribute.key

isOpen

A single attribute, which sets a node (either collection or file node) to be publicly accessible if an attribute with the provided name has been assigned to the node. In the example, all Data Organization nodes having the attribute isOpen assigned are publicly accessible. The attribute value is currently not evaluated.

public.collection.node.access.allowed

false

Either true or false. Mainly for testing purposes to allow authorization-less access to entire collection nodes in all Digital Objects and all Data Organization views. It is recommended NOT to enable this option.

public.file.node.filter

(.*)\.jpg$

Regular expression granting public access to all file nodes in all Digital Objects and all Data Organization views for which the node name fulfills the regular expression. In the example, all nodes with the extension jpg are publicly available.

Please refer to the inline documentation or the according section in the architecture decription for details.

The following snippet shows the according configuration section in the KIT Data Manager’s web.xml.

<!--[...]-->
<!--DataOrganization REST interface-->
    <servlet>
        <servlet-name>DataOrganizationServiceAdapter</servlet-name>
        <servlet-class>com.sun.jersey.spi.container.servlet.ServletContainer</servlet-class>
        <init-param>
            <param-name>com.sun.jersey.config.property.packages</param-name>
            <param-value>edu.kit.dama.rest.dataorganization.services.impl</param-value>
        </init-param>
        <init-param>
            <param-name>com.sun.jersey.api.json.POJOMappingFeature</param-name>
            <param-value>true</param-value>
        </init-param>
        <!--Access to all nodes in the listed view(s), multiple views are separated by ';', is allowed without any authorization. (Only recommended for special views) -->
        <!--init-param>
            <param-name>public.view.names</param-name>
            <param-value>public</param-value>
        </init-param-->
        <!--Access to all nodes having an attribute with the mentioned key assigned is allowed without any authorization. (Can be used for fine-grained selection) -->
        <!--init-param>
            <param-name>public.attribute.key</param-name>
            <param-value>public</param-value>
        </init-param-->
        <!--Access collection nodes (download zipped version of all children) is allowed (Typically not recommended) -->
        <!--init-param>
            <param-name>public.collection.node.access.allowed</param-name>
            <param-value>false</param-value>
        </init-param-->
        <!--Access file nodes is allowed if node name matches the provided pattern, e.g. (.*)\.jpg$ for all nodes ending with .jpg
        Note that special characters are only escaped with one slash, not with two as if the pattern is provided as string!-->
        <!--init-param>
            <param-name>public.file.node.filter</param-name>
            <param-value>(.*)\.jpg$</param-value>
        </init-param-->
        <load-on-startup>1</load-on-startup>
    </servlet>
    <servlet-mapping>
        <servlet-name>DataOrganizationServiceAdapter</servlet-name>
        <url-pattern>/rest/dataorganization/*</url-pattern>
    </servlet-mapping>
<!--[...]-->
Keep in mind that enabling authorization-less access affects ALL Digital Objects stored in the repository. Therefore, you should consider to configure this feature in a way, that the risk of opening data by accident is minimized, e.g. by using randomly generated view/attribute names or by using very specific name filter for accessing file nodes, e.g. the exact name instead of a too broad filter pattern.

Support for Ingest of Custom Data Organization Trees

Basically, the ingest of custom Data Organization trees makes use of the default workflows already available with a basic KIT Data Manager installation. However, enabling this feature requires two configuration steps:

Add a StagingProcessor

Therefor, you can use the AdminUI as described in this section. Add a new StagingProcessor using the implementation class edu.kit.dama.staging.processor.impl.ReferenceTreeIngestProcessor. After adding a new processor select the CheckBoxes Default and INGEST SUPPORTED. Finish the configuration by clicking Commit. Now, this processor will check for each ingest if there is a custom Data Organization description provided.

Add a custom AdalapiProtocolConfiguration

This step is necessary if you plan to use the Staging Service to access remotely (via HTTP) accessible data provided as part of a custom Data Organization tree. By default, KIT Data Manager assumes that all HTTP URLs are accessible via the WebDav protocol in with an interactive authentication performed by the user. For automated staging operations this is not useful. Therefore, in order to provide basic read access to HTTP URLs by the staging service, the access protocol as well as the authentication has to be selected based on the accessed URL. Therefor, the protocol, hostname and port (if available) are used to create the unique identifier of a so called AdalapiProtocolConfiguration, which is stored in a table with the same name and with the following columns:

id authenticatorclass customproperties identifier protocolclass

1

edu.kit. dama.staging. adalapi.authenticator. KITDMAuthenticator

{"repository.context":"empty"}

http@remotehost

edu.kit. dama.staging. adalapi.protocol. SimpleHttp

Please note that authenticatorclass and protocol class contain spaces in order to improve the readability of both values. Please remove the spaces when adding the values to the database.

For the time being the only two ways to add a new AdalapiProtocolConfiguration is manually adding a table row via SQL statement or programatically adding an entry via Java APIs. Furthermore, the exact values as shown above must be used. The only fields that change are the id, which has to be unique, and the identifier which also has to be unique and must be generated based on the URL that should be accessed. For manually adding a configuration via SQL statement, the following table shows some examples of identifiers for specific URLs:

Remote URL Identifier

http://www.google.com/[...]

http@www.google.com

https://myDomain:8443/[...]

https@myDomain:8443

http://localhost:8080/webdav/[...]

http@localhost:8080

Adding a configuration via Java APIs can be achieved as follows:

//Create properties map containing properties that might be needed by the protocol and the authenticator implementation
Map<String, Object> properties = new HashMap<>();
properties.put("repository.context", "empty");
//Instantiate a new AdalapiProtocolConfiguration. The configuration identifier is obtained using the provided sample URL.
AdalapiProtocolConfiguration config = AdalapiProtocolConfiguration.factoryConfiguration(new URL("http://remotehost/"), edu.kit.dama.staging.adalapi.protocol.SimpleHttp.class.getCanonicalName(), edu.kit.dama.staging.adalapi.authenticator.KITDMAuthenticator.class.getCanonicalName(), properties);

//Persist the configuration using the MetadataManagement of KIT Data Manager
IMetaDataManager mdm = MetaDataManagement.getMetaDataManagement().getMetaDataManager();
mdm.setAuthorizationContext(AuthorizationContext.factorySystemContext());
try {
   mdm.save(config);
} finally {
   mdm.close();
}

Now, while processing a download operation the Staging Service can check for each LFN obtained from a Data Organization node whether the protocol-host-port combination fits one of the registered identifiers or not. If this is the case, the URL is accessed using the configured protocol class and authenticated using the provided authenticator class. This also means, that for each different protocol-host-port combination one table entry is needed. In the sample row in the table above all URLs starting with http://remotehost/ will be accessed using protocol class edu.kit.dama.staging.adalapi.protocol.SimpleHttp and authenticator class edu.kit.dama.staging.adalapi.authenticator.KITDMAuthenticator by the Staging Service while preparing a data download.

Updating KIT Data Manager

The effort for updating from one KIT Data Manager version to another mainly depends on how KIT Data Manager is used and customized. A typical update process of a standard installation described earlier in this documentation includes the following steps:

  1. Stop the Tomcat container in which KIT Data Manager is running

  2. Create a backup of $KITDM_LOCATION and the database

  3. Update the KIT Data Manager libraries

  4. Update database schema and/or settings if required

  5. Rebuild and redeploy all additionally added, custom libraries (*)

  6. Restart the Tomcat container

(*) Step 5 is optional and will be described in section Update Custom Libraries.

The following detailed description for updating a KIT Data Manager installation is based on Ubuntu. If you are using another distribution some commands and service names might be slightly different. Also the username and database name used by the pg_dump command might be different depending on how your KIT Data Manager instance is configured.

user@localhost:/home/user$ sudo service tomcat7 stop
* Stopping Tomcat servlet engine tomcat7
user@localhost:/home/user$ cd $KITDM_LOCATION
user@localhost:/home/user$ mkdir ../backup_1.0
user@localhost:/home/user$ cp * ../backup_1.0 -R
user@localhost:/home/user$ sudo -u postgres pg_dump -U postgres -h localhost -d datamanager -W > ../backup_1.0/database_dump.sql
user@localhost:/home/user$

In the next step, the libraries of your KIT Data Manager installation have to be updated. Please download the update package KITDM-<VERSION>_Update.zip to $KITDM_LOCATION and continue as follows:

user@localhost:/home/user$ rm KITDM/WEB-INF/lib/*.jar
user@localhost:/home/user$ rm KITDM/WEB-INF/classes/edu/kit/dama/ -R
user@localhost:/home/user$ unzip -u KITDM-<VERSION>_Update.zip
user@localhost:/home/user$

In the first two steps, all libraries and classes of the old version are deleted. This is necessary due to the fact, that a library named Authorization-1.0.jar would be favored by the classloader even if there is a new version named Authorization-1.1.jar. Afterwards, by extracting KITDM-Update-<VERSION>.zip all libraries of version <VERSION> are placed directly at $KITDM_LOCATION/KITDM/WEB-INF/lib/.

Since KIT Data Manager 1.2 there is an update tool offering support for the process of updating libraries from one version to another. You’ll find the tool under $KITDM_LOCATION/scripts/LibraryCompare.sh. The tool allows you to compare the contents of a library folder of an old installation with the contents of a library folder of a new installation. The differences can be printed out to StdOut or written into a script that can be used to merge both library folders. However, it is highly recommended to backup the old library folder before applying the update script. Please refer to the command line help of the script for further information.

After updating all core libraries there might be additional steps necessary, e.g. applying database schema or configuration file changes. Necessary steps are described for each new version in section Additional Update Steps. If you skip one or more versions you typically have to apply all intermediate steps unless the documentation states something different.

Update Custom Libraries

As soon as you’ve started integrating community-specific features into the basic repository system, the question arises, how these customizations are carried from one version to the other. Basically, this is quite simple. At first, it is highly recommended to rebuild all customizations, e.g. Staging Processors or custom metadata entities, against the current version of KIT Data Manager in order to detect the use of deprecated interfaces or updated imports. Afterwards, all custom libraries can be placed at $KITDM_LOCATION/KITDM/WEB-INF/lib/ again and should work as before. If this can not be expected, e.g. due to internal changes, an according remark will be stated in the following chapter.

Additional Update Steps

The following section contains additional update steps necessary in order to update from different versions of KIT Data Manager to the next. There are different types of updates:

DatabaseUpdate Database schema changes

ConfigurationUpdate Configuration file changes

InterfaceUpdate REST Interface changes

For database changes (DatabaseUpdate) typically an SQL update script is provided that has to be executed using the following command:

user@localhost:/home/user$ sudo -u postgres psql -U postgres -W -h localhost -d datamanager -f $KITDM_LOCATION/sql/update.1.0-1.1.sql
user@localhost:/home/user$

Of course you should use the username and database name fitting your KIT Data Manager installation. Also the name of the update script will change depending on the affected versions.

Please refer to the following section for all changes, whether applying them is optional or not and their side effects:

Update 1.0 → 1.1

Type Optional Todo Side Effects

DatabaseUpdate

No

Apply script sql/update.1.0-1.1.sql

All ingest- and downloadinformation entries are deleted. Therefor, it is recommended to finish all data transfers before updating.

ConfigurationUpdate

Yes

Add nodes for maxIngestLifetime and maxDownloadLifetime to staging section in $KITDM_LOCATION/KITDM/WEB-INF/classes/datamanager.xml The default value is 604800 (seconds per week).

None.

Update 1.1 → 1.2

Type Optional Todo Side Effects

DatabaseUpdate

No

Apply script sql/update.1.1-1.2.sql

This script adds the Quartz scheduler related tables to the database. Furthermore, a typo was fixed and an additional column to the table ExecutionEnvironmentConfiguration was introduced. As the worklflow service was not publicly available before version 1.2 this table should not exists, yet. Therefor, error related to changing this table can be safely ignored.

ConfigurationUpdate

No

Add a new section scheduler containing setting for the internal scheduler feature.

None.

ConfigurationUpdate

No

Update entity list in persistence.xml, persistence unit MDM-Core according to the list below.

None.

Additional section that has to be added to datamanger.xml in order to configure the Quartz scheduler. The values of the job store have to be changed according to your local configuration:

<config>
   <!--[...]-->
   </staging>
   <scheduler>
     <!--Connection information for the JobStore used to hold information about scheduled jobs. Typically, the same information also used
         in the persistence.xml can be applied here in order to keep everything together in one place.-->
     <jobStoreConnectionDriver>org.postgresql.Driver</jobStoreConnectionDriver>
     <jobStoreConnectionString>jdbc:postgresql://DB_HOST:DB_PORT/DB_NAME</jobStoreConnectionString>
     <jobStoreUser>DB_USER</jobStoreUser>
     <jobStorePassword>DB_PASSWORD</jobStorePassword>
     <!--Wait for running tasks if the scheduler is shutting down, e.g. due to shutting down the application server. default: true-->
     <waitOnShutdown>true</waitOnShutdown>
     <!--Delay in seconds before the scheduler starts the execution of tasks. This delay is useful as services needed to perform tasks
         may not be running, yet, when the scheduler starts. The default value is 5 seconds.-->
     <startDelaySeconds>5</startDelaySeconds>
      <!--Add default schedules during the first startup of the scheduler. These schedules are executing transfer finalization
          (ingest/download) every 60/30 seconds. The default value is true.-->
     <addDefaultSchedules>true</addDefaultSchedules>
   </scheduler>
</config>

Current list of entities (KIT DM 1.2) that has to be registered in the persistence.xml for persistence unit MDM-Core:

 <!-- ********************************************************************
    ***                           MDM-Core                                ***
    *************************************************************************-->
<persistence-unit name="MDM-Core" transaction-type="RESOURCE_LOCAL">
<!--[...]-->
    <class>edu.kit.dama.authorization.entities.impl.Group</class>
    <class>edu.kit.dama.authorization.entities.impl.User</class>
    <class>edu.kit.dama.authorization.entities.impl.Membership</class>
    <class>edu.kit.dama.authorization.entities.impl.GrantImpl</class>
    <class>edu.kit.dama.authorization.entities.impl.GrantSet</class>
    <class>edu.kit.dama.authorization.entities.impl.ResourceReference</class>
    <class>edu.kit.dama.authorization.entities.impl.Grant</class>
    <class>edu.kit.dama.authorization.entities.impl.FilterHelper</class>
    <class>edu.kit.dama.authorization.entities.impl.SecurableResource</class>
    <class>edu.kit.dama.mdm.base.OrganizationUnit</class>
    <class>edu.kit.dama.mdm.base.Study</class>
    <class>edu.kit.dama.mdm.base.Investigation</class>
    <class>edu.kit.dama.mdm.base.DigitalObject</class>
    <class>edu.kit.dama.mdm.base.Participant</class>
    <class>edu.kit.dama.mdm.base.Relation</class>
    <class>edu.kit.dama.mdm.base.Task</class>
    <class>edu.kit.dama.mdm.base.UserData</class>
    <class>edu.kit.dama.mdm.base.MetaDataSchema</class>
    <class>edu.kit.dama.mdm.base.DigitalObjectType</class>
    <class>edu.kit.dama.mdm.base.ObjectTypeMapping</class>
    <class>edu.kit.dama.mdm.base.DigitalObjectTransition</class>
    <class>edu.kit.dama.mdm.base.ObjectViewMapping</class>
    <class>edu.kit.dama.mdm.dataworkflow.ExecutionEnvironmentConfiguration</class>
    <class>edu.kit.dama.mdm.dataworkflow.properties.ExecutionEnvironmentProperty</class>
    <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTask</class>
    <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTaskConfiguration</class>
    <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTransition</class>
    <class>edu.kit.dama.mdm.dataworkflow.properties.StringValueProperty</class>
    <class>edu.kit.dama.mdm.dataworkflow.properties.LinuxSoftwareMapProperty</class>
    <class>edu.kit.dama.mdm.admin.ServiceAccessToken</class>
    <class>edu.kit.dama.mdm.admin.UserGroup</class>
    <class>edu.kit.dama.mdm.admin.UserProperty</class>
    <class>edu.kit.dama.mdm.admin.UserPropertyCollection</class>
    <!--class>edu.kit.dama.mdm.content.MetadataIndexingTask</class-->
<!--[...]-->
</persistence-unit>

Update 1.2 → 1.3

Type Optional Todo Side Effects

DatabaseUpdate

No

Apply script sql/update.1.2-1.3.sql

This script adds two new columns to the StagingProcessor table. Furthermore, the foreign key constraint of the Attribute table has been modified to allow cascading updates. Also the fully qualified class name of LFNImpl has been changed (already in version 1.2). The according database update is now part of this script. Finally, a typo has been fixed.

ConfigurationUpdate

No

Add a new section metadataManagement containing metadata related setting, e.g. for the new feature of object transition handlers. For more details see below.

None.

ConfigurationUpdate

No

Update entity list in persistence.xml, persistence unit MDM-Core according to the list below.

None.

Additional section that has to be added to datamanger.xml in order to configure the transition type handler of the metadata management:

<!--[...]-->
<metaDataManagement>
 <transitionTypes>
  <NONE>
   <handlerClass>edu.kit.dama.mdm.tools.NullTransitionTypeHandler</handlerClass>
  </NONE>
 <DATAWORKFLOW>
   <handlerClass>edu.kit.dama.mdm.dataworkflow.tools.DataWorkflowTransitionTypeHandler</handlerClass>
  </DATAWORKFLOW>
  <ELASTICSEARCH>
   <handlerClass>edu.kit.dama.mdm.content.util.ElasticsearchTransitionTypeHandler</handlerClass>
  </ELASTICSEARCH>
 </transitionTypes>
</metaDataManagement>
<!--[...]-->

Current list of entities (KIT DM 1.3) that has to be registered in the persistence.xml for persistence unit MDM-Core:

 <!-- ********************************************************************
    ***                           MDM-Core                                ***
    *************************************************************************-->
<persistence-unit name="MDM-Core" transaction-type="RESOURCE_LOCAL">
<!--[...]-->
    <class>edu.kit.dama.authorization.entities.impl.Group</class>
    <class>edu.kit.dama.authorization.entities.impl.User</class>
    <class>edu.kit.dama.authorization.entities.impl.Membership</class>
    <class>edu.kit.dama.authorization.entities.impl.GrantImpl</class>
    <class>edu.kit.dama.authorization.entities.impl.GrantSet</class>
    <class>edu.kit.dama.authorization.entities.impl.ResourceReference</class>
    <class>edu.kit.dama.authorization.entities.impl.Grant</class>
    <class>edu.kit.dama.authorization.entities.impl.FilterHelper</class>
    <class>edu.kit.dama.authorization.entities.impl.SecurableResource</class>
    <class>edu.kit.dama.mdm.base.OrganizationUnit</class>
    <class>edu.kit.dama.mdm.base.Study</class>
    <class>edu.kit.dama.mdm.base.Investigation</class>
    <class>edu.kit.dama.mdm.base.DigitalObject</class>
    <class>edu.kit.dama.mdm.base.Participant</class>
    <class>edu.kit.dama.mdm.base.Relation</class>
    <class>edu.kit.dama.mdm.base.Task</class>
    <class>edu.kit.dama.mdm.base.UserData</class>
    <class>edu.kit.dama.mdm.base.MetaDataSchema</class>
    <class>edu.kit.dama.mdm.base.DigitalObjectType</class>
    <class>edu.kit.dama.mdm.base.ObjectTypeMapping</class>
    <class>edu.kit.dama.mdm.base.DigitalObjectTransition</class>
    <class>edu.kit.dama.mdm.base.ObjectViewMapping</class>
    <class>edu.kit.dama.mdm.dataworkflow.ExecutionEnvironmentConfiguration</class>
    <class>edu.kit.dama.mdm.dataworkflow.properties.ExecutionEnvironmentProperty</class>
    <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTask</class>
    <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTaskConfiguration</class>
    <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTransition</class>
    <class>edu.kit.dama.mdm.dataworkflow.properties.StringValueProperty</class>
    <class>edu.kit.dama.mdm.dataworkflow.properties.LinuxSoftwareMapProperty</class>
    <class>edu.kit.dama.mdm.admin.ServiceAccessToken</class>
    <class>edu.kit.dama.mdm.admin.UserGroup</class>
    <class>edu.kit.dama.mdm.admin.UserProperty</class>
    <class>edu.kit.dama.mdm.admin.UserPropertyCollection</class>
    <class>edu.kit.dama.mdm.content.ElasticsearchTransition</class>
    <class>edu.kit.dama.staging.entities.AdalapiProtocolConfiguration</class>
    <!--class>edu.kit.dama.mdm.content.MetadataIndexingTask</class-->
<!--[...]-->
</persistence-unit>

Update 1.3 → 1.4

Type Optional Todo Side Effects

DatabaseUpdate

No

Apply script sql/update.1.3-1.4.sql

This script adds new columns to tables MetaDataSchema, DigitalObject and StagingProcessor. Furthermore it fixes a type in table DataWorkflowTask. Finally, the script adds system users and groups introduced in version 1.4.

ConfigurationUpdate

No

New sections authorization and audit added to datamanager.xml, major cleanup of file structure.

None.

ConfigurationUpdate

No

Update entity list in persistence.xml, persistence unit MDM-Core according to the list below.

None.

Due to a huge number of changes in the structure of datamanager.xml it is recommended to use the sample file below and merge all settings from the old configuration (database connections, elasticsearch settings, staging settings) manually:

<!--KIT Data Manager configuration file. This file contains all general properties used to configure your KIT Data Manager instance.
-->
<config>
<general>
	<repositoryName>KIT Data Manager</repositoryName>
	<repositoryLogoUrl>http://datamanager.kit.edu/dama/logo_default.png</repositoryLogoUrl>
	<!--Can be accessed e.g. by GUIs to send system mail. Please replace $HOSTNAME by the local hostname.-->
	<systemMailAddress>${general.mail.sender}</systemMailAddress>
	<mailServer>${general.mail.server}</mailServer>
	<globalSecret>qr2I9Hyp0CBhUUXj</globalSecret>
	<!--The base URL of your application server, e.g. http://localhost:8080. Please replace $HOSTNAME by the local hostname. -->
	<baseUrl>${general.base.url}</baseUrl>
	<!--Enable/Disable production mode to show/hide additional logging output.-->
	<productionMode>true</productionMode>
</general>

<!--
SimpleMonitoring-related settings.
-->
<simon>
	<!--The path where the configuration files for the SimpleMonitoring are located. Please replace $KITDM_LOCATION by the absolut path of your KIT Data Manager installation.-->
	<configLocation>${simon.config.location}</configLocation>
</simon>
<!--
Elasticsearch-related settings.
-->
<elasticsearch>
	<!--The cluster name used by KIT Data Manager to publish metadata. (default: KITDataManager)-->
	<cluster>${elasticsearch.cluster}</cluster>
	<!--The hostname of the node where metadata should be published to. (default: localhost)-->
	<host>${elasticsearch.host}</host>
	<!--The port of the Elasticsearch instance. (default: 9300)-->
	<port>${elasticsearch.port}</port>
	<!--The default index that is access for metadata publishing/querying.
	The index to which metadata is published depends on the published metadata schema. (default: dc)
	-->
	<index>${elasticsearch.default.index}</index>
	<!--The elasticsearch document key which contains the fulltext representation of an entire document.
	The availability of this key depends on the metadata stored in the document.
	The default value is 'es.fulltext', this property should not be changed,
	-->
	<!--fulltextKey>es.fulltext</fulltextKey-->
</elasticsearch>
<!--
MetaDataManagement-related settings.
-->
<metaDataManagement>
    <persistenceImplementations>
      <persistenceImplementation>
        <!--Name of the persistence implementation-->
        <name>JPA</name>
        <!--Implementation class of the persistence implementation-->
        <class>edu.kit.dama.mdm.core.jpa.PersistenceFactoryJpa</class>
        <persistenceUnits>
          <!-- A list of persistence units (configured endpoints) to store metadata.
          In case of the default JPA implementation these persistence units are
          actual persistence units configured in a persistence.xml file using the
		  MetaDataManagement implementation defined above. JPA persistence units not using
		  this implementation are not listed here. For other implementations of the
		  MetaDataManagement, these persistence units are probably mapped to something different.

          Attention:

		  PersistenceUnit labels should be the same for all implementations
          in order to be able to switch implementations.

                    The default persistence unit can be marked by an attribute 'default=true',
                    otherwise the first entry is interpreted as default persistence unit used by the
                    implementation if no persistence unit is specified.
                    -->
                    <persistenceUnit authorization="true">${persistence.authorizationPU}</persistenceUnit>
                    <persistenceUnit>DataOrganizationPU</persistenceUnit>
                    <!--Default persistence unit if the used persistence unit is not explicitly named.-->
                    <persistenceUnit default="true">MDM-Core</persistenceUnit>
                    <persistenceUnit staging="true">${persistence.stagingPU}</persistenceUnit>
                </persistenceUnits>
            </persistenceImplementation>
        </persistenceImplementations>

        <!--Transition type definitions and their handler implementations used by the base metadata REST
        endpoint to handle transition information provided as JSON structure.-->
        <transitionTypes>
            <NONE>
                <handlerClass>edu.kit.dama.mdm.tools.NullTransitionTypeHandler</handlerClass>
            </NONE>
            <DATAWORKFLOW>
                <handlerClass>edu.kit.dama.mdm.dataworkflow.tools.DataWorkflowTransitionTypeHandler</handlerClass>
            </DATAWORKFLOW>
            <ELASTICSEARCH>
                <handlerClass>edu.kit.dama.mdm.content.util.ElasticsearchTransitionTypeHandler</handlerClass>
            </ELASTICSEARCH>
        </transitionTypes>
    </metaDataManagement>
    <!--
    Staging-related settings.
    -->
    <staging>
        <adapters>
            <dataOrganizationAdapter class="edu.kit.dama.staging.adapters.DefaultDataOrganizationServiceAdapter" target="LOCAL"/>
            <ingestInformationServiceAdapter class="edu.kit.dama.staging.adapters.DefaultIngestInformationServiceAdapter" target="LOCAL"/>
            <downloadInformationServiceAdapter class="edu.kit.dama.staging.adapters.DefaultDownloadInformationServiceAdapter" target="LOCAL"/>
            <storageVirtualizationAdapter class="edu.kit.dama.staging.adapters.DefaultStorageVirtualizationAdapter" target="LOCAL">
                <!--The Url where the managed repository storage (archive) is located. All data ingested into the repository system will be located here.
                Currently, the DefaultStorageVirtualizationAdapter only supports locally accessible Urls. However, this can be remote storages mounted
                into the local filesystem. Please replace $ARCHIVE_STORAGE by the absolute path of your archive location, e.g. file:///mnt/archive/
                Attention: Please pay attention to provide three (!) slashes. Otherwise, all data transfer services of KIT Data Manager won't work.
                -->
                <archiveUrl>${staging.archive.url}</archiveUrl>
                <!--Pattern that is used to structure the data at 'archiveUrl'. Valid variables are:
                     $year: The current year, e.g. 2015
                     $month: The current month, e.g. 9
                     $day: The day of the month, e.g. 1
                     $owner: The userId of the user who has ingested the data, e.g. admin
                     $group: The groupId of the group on whose behalf the user has ingested the data, e.g. USERS
                -->
                <pathPattern>${staging.archive.path.pattern}</pathPattern>
            </storageVirtualizationAdapter>
        </adapters>
        <!--Possible overwrite for persistence unit defined in persistence section.-->
        <!--persistenceUnit>${persistence.stagingPU}</persistenceUnit-->
        <remoteAccess>
            <!--The remove access Url of the staging service (currently not used). Please replace $HOSTNAME by the local hostname.-->
            <restUrl>${staging.rest.url}</restUrl>
        </remoteAccess>
        <!--The max. number of single files that is transferred in parallel to/from the archive location to access point locations.
    This number refers to one single staging operation (ingest/download). If there are two staging operations running in parallel,
        two times 'maxParallelTransfers' are used.-->
        <maxParallelTransfers>10</maxParallelTransfers>
        <!--The max. number of simultaneous ingest/download operations. This setting is used by the TransferFinalizer tool. The tool itself
        handles one ingest/download per execution. However, by running the TransferFinalizer as Cron job multiple instances may run in
        parallel. As soon as maxParallelIngests/maxParallelDownloads is reached TransferFinalizer will return without doing anything.-->
        <maxParallelIngests>2</maxParallelIngests>
        <maxParallelDownloads>2</maxParallelDownloads>
        <!--The max. lifetime in seconds before completed/failed ingests/downloads are removed from the database by the TransferFinalizer.
        The default value is one week.-->
        <maxIngestLifetime>604800</maxIngestLifetime>
        <maxDownloadLifetime>604800</maxDownloadLifetime>
    </staging>

    <scheduler>
        <!--Connection information for the JobStore used to hold information about scheduled jobs. Typically, the same information also used
        in the persistence.xml can be applied here in order to keep everything together in one place.-->
        <jobStoreConnectionDriver>${persistence.connection.driver}</jobStoreConnectionDriver>
        <jobStoreConnectionString>${persistence.connection.string}</jobStoreConnectionString>
        <jobStoreUser>${persistence.database.user}</jobStoreUser>
        <jobStorePassword>${persistence.database.user.password}</jobStorePassword>
        <!--Wait for running tasks if the scheduler is shutting down, e.g. due to shutting down the application server. default: true-->
        <waitOnShutdown>true</waitOnShutdown>
        <!--Delay in seconds before the scheduler starts the execution of tasks. This delay is useful as services needed to perform tasks
        may not be running, yet, when the scheduler starts. The default value is 5 seconds.-->
        <startDelaySeconds>5</startDelaySeconds>
        <!--Add default schedules during the first startup of the scheduler. These schedules are executing transfer finalization
        (ingest/download) every 60/30 seconds. The default value is true.-->
        <addDefaultSchedules>true</addDefaultSchedules>
    </scheduler>

    <authorization>
        <login>
            <orcid>
                <!--Configuration for ORCiD login. The ORCiD login is only enabled if id and secret are provided.
                Furthermore, the base Url of the repository instance, e.g. http://localhost:8080/KITDM,  has to be registered as
                valid redirection of the ORCiD OAuth2 login. -->
                <clientid>ORCID_CLIENT_ID</clientid>
                <clientsecret>ORCID_CLIENT_SECRET</clientsecret>
            </orcid>
            <b2access>
                <!--Configuration for B2Access login. The B2Access login is only enabled if id and secret are provided.
                Furthermore, the base Url of the repository instance, e.g. http://localhost:8080/KITDM,  has to be registered as
                valid redirection of the B2Access OAuth2 login. -->
                <clientid>NOT_SUPPORTED_YET</clientid>
                <clientsecret>NOT_SUPPORTED_YET</clientsecret>
            </b2access>
        </login>
        <rest>
            <!--Configuration of available authenticators. An authenticator allows to secure the access to
            KITDM RESTful web services. By default, the access is secured via OAuth using fixed consumer key and secret.
            The user credentials are stored as ServiceAccessToken entities with the default serviceId 'restServiceAccess'.
            -->
            <authenticators>
                <!--The authenticator element and its implementation class-->
                <authenticator class="edu.kit.dama.rest.util.auth.impl.OAuthAuthenticator">
                    <!--The id used as serviceId in associatedService AccessToken entities.-->
                    <authenticatorId>restServiceAccess</authenticatorId>
                    <!--Regular expression allowing to enable this authenticator for specific services. The value below
                    enables the authenticator for all services, but it is also imaginable to enable an authenticator
                    only for one specific service.
                    The expression is applied to the base URL of the request and does not include the resource portion.-->
                    <enableFor>(.*)</enableFor>                       			<!--enableFor>(.*)(basemetadata|sharing|dataorganization|staging|usergroup|dataworkflow|scheduler)(.*)</enableFor-->
                    <!--Authenticator-specific properties, in this case these are OAuth consumer key and secret. -->
                    <defaultConsumerKey>key</defaultConsumerKey>
                    <defaultConsumerSecret>secret</defaultConsumerSecret>
                </authenticator>
                <!--HTTP Authenticator for WebDav access. Please keep in mind that the settings here (realm, type) must match the settings in the web.xml of the WebDav servlet.-->
                <authenticator class="edu.kit.dama.rest.util.auth.impl.HTTPAuthenticator">
                    <authenticatorId>webdav</authenticatorId>
                    <!--The HTTP realm needed if type is 'DIGEST'.-->
                    <realm>kitdm</realm>
                    <!--The type that must be either BASIC or DIGEST.-->
                    <type>DIGEST</type>
                </authenticator>
                <!--Helper authenticator to support ORCID login.-->
                <authenticator class="edu.kit.dama.rest.util.auth.impl.BearerTokenAuthenticator">
                    <authenticatorId>ORCID</authenticatorId>
                </authenticator>
                <!--Helper authenticator to support B2ACCESS login.-->
                <!--B2Access is NOT officially supported, yet. Thus, this setting has no effect.
                <authenticator class="edu.kit.dama.rest.util.auth.impl.BearerTokenAuthenticator">
                    <authenticatorId>B2ACCESS</authenticatorId>
                </authenticator>
                -->
            </authenticators>
        </rest>
        <!--The default persistence unit for KIT Data Manager Authorization services.
        Due to its complexity, the generic nature of KIT Data Manager MetaDataManagement is not feasible for Authorization services.
        Therefore, they will be configured separately also in future releases.
        -->
        <defaultPU>AuthorizationPU</defaultPU>
    </authorization>

    <audit>
        <!--Audit message publisher implementation. This publisher is contacted by the audit component as soon as an audit message occurs.
        It is the responsibility of the publisher to distribute the messages to connected consumers. By default, KITDM used a RabbitMQ based
        publisher in order to allow asynchronous, reliable publishing of audit messages. The according receiver is implemented as ServletContextListener
        publishing all received events to connected message consumers.-->
        <publisher class="edu.kit.dama.mdm.audit.impl.RabbitMQPublisher">
            <!--Each publisher might have custom properties, in this case they are the RabbitMQ server hostname and the RabbitMQ exchange used
            to publish audit messages.-->
            <hostname>${rabbitmq.host}</hostname>
            <exchange>audit</exchange>
        </publisher>

        <!--Configuration of connected audit message consumers. Received audit messages are forwarded to the consumer which is responsible for
        handling the message according to its implementation.-->
        <consumers>
            <consumer class="edu.kit.dama.mdm.audit.impl.ConsoleConsumer"/>
            <consumer class="edu.kit.dama.mdm.audit.impl.DatabaseConsumer"/>
        </consumers>
    </audit>
</config>

Current list of entities (KIT DM 1.4) that has to be registered in the persistence.xml for persistence unit MDM-Core:

 <!-- ********************************************************************
    ***                           MDM-Core                                ***
    *************************************************************************-->
<persistence-unit name="MDM-Core" transaction-type="RESOURCE_LOCAL">
<!--[...]-->
     <class>edu.kit.dama.authorization.entities.impl.Group</class>
        <class>edu.kit.dama.authorization.entities.impl.User</class>
        <class>edu.kit.dama.authorization.entities.impl.Membership</class>
        <class>edu.kit.dama.authorization.entities.impl.GrantImpl</class>
        <class>edu.kit.dama.authorization.entities.impl.GrantSet</class>
        <class>edu.kit.dama.authorization.entities.impl.ResourceReference</class>
        <class>edu.kit.dama.authorization.entities.impl.Grant</class>
        <class>edu.kit.dama.mdm.base.OrganizationUnit</class>
        <class>edu.kit.dama.mdm.base.Study</class>
        <class>edu.kit.dama.mdm.base.Investigation</class>
        <class>edu.kit.dama.mdm.base.DigitalObject</class>
        <class>edu.kit.dama.mdm.base.Participant</class>
        <class>edu.kit.dama.mdm.base.Relation</class>
        <class>edu.kit.dama.mdm.base.Task</class>
        <class>edu.kit.dama.mdm.base.UserData</class>
        <class>edu.kit.dama.mdm.base.MetaDataSchema</class>
        <class>edu.kit.dama.mdm.base.DigitalObjectType</class>
        <class>edu.kit.dama.mdm.base.ObjectTypeMapping</class>
        <class>edu.kit.dama.mdm.base.DigitalObjectTransition</class>
        <class>edu.kit.dama.mdm.base.ObjectViewMapping</class>
        <class>edu.kit.dama.authorization.entities.impl.FilterHelper</class>
        <class>edu.kit.dama.authorization.entities.impl.SecurableResource</class>
        <class>edu.kit.dama.mdm.dataworkflow.ExecutionEnvironmentConfiguration</class>
        <class>edu.kit.dama.mdm.dataworkflow.properties.ExecutionEnvironmentProperty</class>
        <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTask</class>
        <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTaskConfiguration</class>
        <class>edu.kit.dama.mdm.dataworkflow.DataWorkflowTransition</class>
        <class>edu.kit.dama.mdm.dataworkflow.properties.StringValueProperty</class>
        <class>edu.kit.dama.mdm.dataworkflow.properties.LinuxSoftwareMapProperty</class>
        <class>edu.kit.dama.mdm.admin.ServiceAccessToken</class>
        <class>edu.kit.dama.mdm.admin.UserGroup</class>
        <class>edu.kit.dama.mdm.admin.UserProperty</class>
        <class>edu.kit.dama.mdm.admin.UserPropertyCollection</class>
        <class>edu.kit.dama.mdm.content.MetadataIndexingTask</class>
        <class>edu.kit.dama.mdm.content.ElasticsearchTransition</class>
        <class>edu.kit.dama.staging.entities.AdalapiProtocolConfiguration</class>
        <class>edu.kit.dama.mdm.audit.types.AuditEvent</class>
<!--[...]-->
</persistence-unit>
Also consider to update to the custom WebDav implementation provided with KIT DM 1.4. Therefor, please follow the instructions beginning with deploying tomcat-ext, followed by copying and configuring the provided webdav.xml Finally, you should compare your configuration with the configuration described in the installation chapter. Afterwards, you should be able to add WebDav credentials in the profile tab of the AdminUI.

Update 1.4 → 1.5

Type Optional Todo Side Effects

DatabaseUpdate

No

Apply script sql/update.1.4-1.5.sql

This script updates constraints in the data organization tables in order to avoid conflicts during updates.

ConfigurationUpdate

Yes

Due to an update of the behaviour of transfer finalization jobs it is recommended to change the trigger rate of the ingest and download finalizer jobs to one execution per 2 or 3 seconds after removing the existing triggers from both jobs. Please refer to the AdminUI documentation on how to do this.

None.

RESTInterfaceChanges

No

PUT and POST do not longer accept query parameters. Instead, previous query parameters must be provided as form parameters. This affects only the groupId parameter used in almost in every endpoint.

Form parameters provide as query parameters are ignored. In case of the groupId, the default groupId USERS would be used.

RESTInterfaceChanges

No

DELETE rest/sharing/resources/references now takes an additional parameter referenceGroupId to provide the groupId of the resource reference. Previously, groupId has been used for this purpose, which is now used for authorization only.

Not providing referenceGroupId will results in HTTP BAD_REQUEST.

The authentication extension located in library tomcat-ext-<version>.jar has been updated to version 1.1.2. Please remember to replace library /usr/share/tomcat7/lib/tomcat-ext-1.1.1.jar by /usr/share/tomcat7/lib/tomcat-ext-1.1.2.jar.

Programming KIT Data Manager

Without any additional effort, KIT Data Manager offers basic functionality for building up and running repository systems. There is a set of basic services which can be used to register Base Metadata and schedule file transfers, but the whole spectrum of repository features can just be covered by using public interfaces, connect them to workflows and extend the whole system by custom functionalities. The following chapters will describe common interfaces, the access to High Level Services and which extension points can be used to integrate community-specific functionality.

REST Access to High Level Services

The typical access to KIT Data Manager services for basic (remote) use cases is the access via the available REST interfaces. Currently, there are six different REST services:

Service Description

User-Group Management Service

Service for adding groups, assigning users to groups and getting information about existing users and groups.

BaseMetaData Service

Access to Base Metadata (Study, Investigation and DigitalObject) for registering and reading new metadata entities.

Staging Service

Service for initiating data transfers task to and from KIT Data Manager. This service does not take care of the actual transfer!

DataOrganization Service

Access to the DataOrganization of Digital Objects as soon as it is extractred during ingest.

Sharing Service

Obtaining and changing access permissions for secured resources (e.g. Study, Investigation, Digital Object).

Data Workflow Service

Apply processing tasks or workflows to Digital Objects in the repository system.

Audit Service

Query and create audit events.

You’ll find the documentation of all of these services in your KIT Data Manager distribution. All these REST interfaces are secured via OAuth 1.0. The OAuth implementation used for KIT Data Manager is kept rather simple for the time being. There is no management of different consumers, thus the consumer key and secret are have the values mentioned below. In order to authorize the access to a service, you have to provide an OAuth header with each REST call using the following OAuth parameters:

Parameter Value

Consumer Key

key

Consumer Secret

secret

Access Token

Your Access Token accessible via the administration backend at http://kitdm-host:8080/KITDM (default: admin)

Access Key

Your Access Key accessible via the administration backend at http://kitdm-host:8080/KITDM (default: dama14)

For Java programmers there are client implementations available for all services covering authorization and access to the services. The implementation classes and default service base URLs are shown in the following table:

Service Java Client Implementation/Service Base URL

User-Group Management Service

edu.kit.dama.rest.admin.client.impl.UserGroupRestClient

http://kitdm-host:8080/KITDM/rest/usergroup/

BaseMetaData Service

edu.kit.dama.rest.metadata.client.impl.BaseMetaDataRestClient

http://kitdm-host:8080/KITDM/rest/basemetadata/

Staging Service

edu.kit.dama.rest.staging.client.impl.StagingRestClient

http://kitdm-host:8080/KITDM/rest/staging/

DataOrganization Service

edu.kit.dama.rest.dataorganization.client.impl.DataOrganizationRestClient

http://kitdm-host:8080/KITDM/rest/dataorganization/

Sharing Service

edu.kit.dama.rest.sharing.client.impl.SharingRestClient

http://kitdm-host:8080/KITDM/rest/sharing/

Data Workflow Service

edu.kit.dama.rest.dataworkflow.client.impl.DataWorkflowRestClient

http://kitdm-host:8080/KITDM/rest/dataworkflow/

Audit Service

edu.kit.dama.rest.audit.client.impl.AuditRestClient

http://kitdm-host:8080/KITDM/rest/audit/

You’ll find some of the following and other examples withint the code base of KIT Data Manager in the Samples module

In Java, the access to the different services looks quite similar. The following example shows how to create the BasicMetadata structure for a new Digital Object:

String defaultGroup = edu.kit.dama.util.Constants.USERS_GROUP_ID;
String accessKey = "putYourKeyHere";
String accessSecret = "putYourSecretHere";
String restBaseUrl = "http://kitdm-host:8080/KITDM";
SimpleRESTContext context = new SimpleRESTContext(accessKey, accessSecret);

//First, create temporary objects which contain the attributes you want to provide for the metadata entities.
//It is recommended to collect as many attributes as possible in order to be able to distinguish registered metadata entities as good as possible.
//At least topic/label should be provided for studies, investigations and Digital Objects.

//Collect all study attributes and return it as a temporary object.
Study newStudy = getStudy();
//Collect all investigation attributes and return it as a temporary object.
Investigation newInvestigation = getInvestigation();
//Collect all Digital Object attributes and return it as a temporary object.
DigitalObject newDigitalObject = getDigitalObject();

//Instantiate BaseMetaDataRestClient using the base URL and the security context, both defined above
BaseMetaDataRestClient client = new BaseMetaDataRestClient(restBaseUrl + "/rest/basemetadata/", context);

//Create a new study. The study will be assigned to the default group whose ID we've obtained above.
StudyWrapper studyWrapper = client.addStudy(newStudy, defaultGroup);
//Assign returned study to 'newStudy' as the created entity now contains a valid studyId.
newStudy = studyWrapper.getEntities().get(0);

//Use the studyId to add a new investigation to the study we've just created.
InvestigationWrapper investigationWrapper = client.addInvestigationToStudy(newStudy.getStudyId(), newInvestigation, defaultGroup);
//Assign returned investigation to 'newInvestigation' as the created entity now contains a valid investigationId.
newInvestigation = investigationWrapper.getEntities().get(0);

//Use the investigationId to add a new Digital Object to the investigation just created.
DigitalObjectWrapper digitalObjectWrapper = client.addDigitalObjectToInvestigation(newInvestigation.getInvestigationId(), newDigitalObject, defaultGroup);
//Assign returned digitalObject to 'newDigitalObject' as the created entity now contains a valid objectId.
newDigitalObject = digitalObjectWrapper.getEntities().get(0);

Now, that you have the Base Metadata structure defined, you may want to schedule a data ingest, which can be done as follows:

//Instantiate the staging client re-using the credentials of the last call and the same base URL as defined above.
StagingServiceRESTClient stagingClient = new StagingServiceRESTClient(restBaseUrl + "/rest/staging/", context);

//At first we have to obtain the StagingAccessPoint in order to be able to schedule an ingest.
//For convenience we expect to have exactly one AccessPoint as set up during the default installation.
//At first we obtain the id, followed by a query for detailed information.
long accessPointId = stagingClient.getAllAccessPoints(defaultGroup, context).getEntities().get(0).getId();
String uniqueAPIdentifier = stagingClient.getAccessPointById(accessPointId, context).getEntities().get(0).getUniqueIdentifier();

//Now, we schedule the ingest for the DigitalObject we have just created.
//To identify the object you have to use the DigitalObjectId which is NOT identical with the objectId we were talking about before.
String digitalObjectId = newDigitalObject.getDigitalObjectId().getStringRepresentation();
IngestInformationWrapper ingest = stagingClient.createIngest(digitalObjectId, uniqueAPIdentifier);
//Note: As of KIT Data Manager 1.1 the ingest can also be created using the numeric ids of Digital Object and access point.
//Therefore, the following call would lead to the same result as the call above:
//IngestInformationWrapper ingest = stagingClient.createIngest(Long.toString(newDigitalObject.getBaseId()), Long.toString(accessPointId));

//Now, you can obtain the id of the ingest, which will be used further on.
long ingestId = ingest.getEntities().get(0).getId();

//As the ingest preparation takes place synchronously, the ingest is usable immediately.
//To be sure the current status can be polled by calling:
IngestInformationWrapper wrapper = stagingClient.getIngestById(ingestId);
int status = wrapper.getEntities().get(0).getStatus();

//If the status is 4 (INGEST_STATUS.PRE_INGEST_SCHEDULED), data can be uploaded to the URL obtainable by:
wrapper = stagingClient.getIngestById(ingestId);
URL dataUploadFolder = wrapper.getEntities().get(0).getDataFolderUrl();

//If the upload has finished, the status has to be set to 16 (INGEST_STATUS.PRE_INGEST_FINISHED) in order to trigger archiving of the data.
//updateIngest() returns a ClientResponse object that can be checked for success via response.getStatus() == 200
stagingClient.updateIngest(ingestId, null, INGEST_STATUS.PRE_INGEST_FINISHED.getId());

//If further monitoring is required, you can wait for a status change to status 128 (INGEST_STATUS.INGEST_FINISHED), which means that archiving has finished.
wrapper = stagingClient.getIngestById(ingestId);
boolean ingestFinished = wrapper.getEntities().get(0).getStatus() == INGEST_STATUS.INGEST_FINISHED.getId();

For more available status codes take a look at edu.kit.dama.staging.entities.ingest.INGEST_STATUS. The behavior and usage of all other REST services offered by KIT Data Manager are identical to the presented example. Please refer to the documentation on how to use them and which results can be expected.

If you want to use other programming languages than Java, please refer to the documentation on how to access REST services in your preferred language.

Staging Processors

As mentioned in the chapter describing the architecture of KIT Data Manager, the data ingest is often more than a simple upload of files. Depending on the community and special use cases there could be additional steps that have to be performed before or/and after the actual data transfer in order to get a successful ingest. For KIT Data Manager such operations are covered by StagingProcessors. Each StagingProcessor must implement the abstract class edu.kit.dama.staging.processor.AbstractStagingProcessor. Please refer to the JavaDoc of this class in order to get familiar with the functionality it offers. The following example will try to explain the implementation of a StagingProcessor:

At first we implement the constructor which may contain custom initialization and the method getName() that returns a human readable name of the processor.

public class MyStagingProcessor extends AbstractStagingProcessor{


  public MyStagingProcessor(String pUniqueIdentifier) {
    super(pUniqueIdentifier);
  }

  public String getName(){
    return "MyStagingProcessor";
  }

The second part covers custom properties. For each StagingProcessor such properties can be applied as key-value-pairs. They are used to configure the processor or to provide optional customization. getPropertyKeys() returns all keys that are available for the processor, getPropertyDescription(String pKey) returns a human readable description for each key. Both methods can be used by user interfaces (e.g. the AdminUI) to request the actual property value as user input. validateProperties(Properties pProperties) and configure(Properties pProperties) are almost identical. The only difference is that validateProperties(Properties pProperties) is used to check if all necessary properties are there and valid, whereas configure(Properties pProperties) is used as soon as the processor should be executed in order to obtain a usable StagingProcessor instance. However, if the validation succeeds the configuration must also succeed. Therefore, configure(Properties pProperties) is not expected to throw any exception.

  public String[] getPropertyKeys(){
    return new String[]{"metadataFilename"};
  }

  public String getPropertyDescription(String pKey){
    if("metadataFilename".equals(pKey)){
      return "This is the name of the metadata file checked by MyStagingProcessor."
    }
    return "Invalid property key '" + pKey + "'";
  }

  public void validateProperties(Properties pProperties) throws PropertyValidationException{
    if(pProperties.get("metadataFilename") == null){
       throw new PropertyValidationException("Mandatory property 'metadataFilename' is missing.");
    }
    //Perform additional validation steps if needed
  }

  public void configure(Properties pProperties){
    String metadataFilenameValue = (String)pProperties.get("metadataFilename");
    //do something with the property value, e.g. set it as member variable and use it later
    metadataFilenameMember = metadataFilenameValue;
  }

Finally, there are four methods for the actual execution. Two of them are for pre-transfer operations which are executed in case of an ingest before the transfer from the upload cache to the repository storage starts. The two other methods are for post-transfer operations which are executed after the ingest into the repository storage has finished (see Staging). The main difference between pre- and post-transfer processing is, that pre-transfer processing is only available for ingests and is executed directly before the digital object has been ingested into the repository system. Therefor, there is no data organization information available, yet. However, during pre-transfer processing validation of the uploaded data and metadata extraction may take place as pre-transfer processing is allowed to fail. In contrast, post-transfer processing is executed after the digital object has been integested into the repository system. Therefore, post-transfer processing should not fail as the content is already ingested.

The first method perform<Pre|Post>Processing(TransferTaskContainer pContainer) is responsible for the actual execution of the processor in the according phase. The provided TransferTaskContainer contains all information available for the transfer itself as well as information about uploaded data. The data belonging to a single transfer is organized in a well-defined tree structure described in chapter Staging. All data generated during an execution of a staging processor must be stored in the generated folder. The URL of this folder can be obtained from the TransferTaskContainer as it is shown in the example. Finally, the generated file must be added to the container in order to ingest it together with the data to access it later. The second method finalize<Pre|Post>TransferProcessing(TransferTaskContainer pContainer) is meant to be used for cleanup purposes in case of a successful execution as this method is only called if the processor was executed without errors. In our example, no special cleanup is needed.

  public void performPreTransferProcessing(TransferTaskContainer pContainer)
  throws StagingProcessorException{
    //Example 1: Read an uploaded file.
    //Obtain the root node of the file tree uploaded by the user
    ICollectionNode root = pContainer.getFileTree().getRootNode();
    //Use helper class edu.​kit.​dama.​mdm.​dataorganization.​impl.​util.Util to get the subtree containing the uploaded data...
    IDataOrganizationNode dataSubTree = Util.getNodeByName(root, Constants.STAGING_DATA_FOLDER_NAME);
    //... and search for a node with the name specified as property 'metadataFilename'
    IFileNode metadataFile = (IFileNode) Util.getNodeByName((ICollectionNode) dataSubTree, metadataFilenameMember);
    if (metadataFile == null) {
       throw new StagingProcessorException("No metadata file named " + metadataFilenameMember + " found.");
    } else {
      try {
        //Obtain the logical filename (should be accessible locally as we are inside the staging service)
        File metadataFile = new File(new URL(metadataFile.getLogicalFileName().asString()).toURI());
        //Read file, validate content etc.
      } catch (MalformedURLException | URISyntaxException ex) {
        throw new StagingProcessorException("Failed to obtain metadata file from URL " + metadataFile.getLogicalFileName().asString() + ".", ex);
      } catch (IOException ex) {
        throw new StagingProcessorException("Error reading file from URL " + metadataFile.getLogicalFileName().asString() + ".", ex);
      }
    }
    //Example 2: Generate a file.
    //Generating additional files must happen in the pre-transfer phase to allow
    //them to be ingested as generated content. Adding them in the post-transfer phase
    //to the container won't be possible anymore.
    try{
      //Generated files should go into the 'generated' folder of the according ingest.
      //This folder can be obtained from the TransferTaskContainer:
      File processorOutput = new File(pContainer.getGeneratedUrl().toURI(), "myProcessor.log");
      //Create an output stream and write some dummy data.
      FileOutputStream fout = new FileOutputStream(processorOutput)
      fout.write(("Processor '" + getName() + "' was successfully executed.\n").getBytes());
      fout.write(("Custom metadata was successfully detected in file " + metadataFilenameMember).getBytes());
      fout.flush();
      fout.close();
      //Add the generated file to the TransferTaskContainer in order to archive it later on.
      pContainer.addGeneratedFile(processorOutput);
    }catch(Exception e){
      throw new StagingProcessorException("StagingProcessor " + getName() + " has failed.", e);
    }
  }

  public void finalizePreTransferProcessing(TransferTaskContainer pContainer)
  throws StagingProcessorException{
     //not used here
  }

  public void performPostTransferProcessing(TransferTaskContainer pContainer)
  throws StagingProcessorException{
    //not used here
  }

  public void finalizePostTransferProcessing(TransferTaskContainer pContainer)
  throws StagingProcessorException{
     //not used here
  }
}

Finally, the processor has to be deployed and registered in your KIT Data Manager instance. Therefor, the JAR file(s) containing the StagingProcessor class(es) and necessary dependencies have to be placed in the Web Application path of your KIT Data Manager deployment, typically this is $CATALINA_HOME/webapps/KITDM/WEB-INF/lib/. Afterwards, the Web Application server (typically Apache Tomcat) or at least the KIT Data Manager Web Application have to be restarted. The registration and configuration of the deployed StagingProcessor can be done after a successful restart using the administration user interface.

Server-sided Development

The final part of this chapter covers the delevopment on the server-side. For local, high-performance access to KIT Data Manager services, e.g. for implementing modern Web applications for repository access, there are plenty of Java APIs providing much more flexibility than the general REST interfaces. For each high level service there is an according local implementation as listed in the following table:

Service Local Implementation(s)

User-Group Management Service

edu.kit.dama.authorization.services.administration.[User|Group]ServiceLocal

BaseMetaData Service

edu.kit.dama.mdm.core.MetaDataManagement

Staging Service

edu.kit.dama.staging.services.impl.StagingService

DataOrganization Service

edu.kit.dama.mdm.dataorganization.service.core.DataOrganizationServiceLocal

Sharing Service

edu.kit.dama.authorization.services.administration.ResourceServiceLocal

Data Workflow Service

edu.kit.dama.dataworkflow.services.impl.DataWorkflowServiceLocal

All these service implementations are realized as singletons, their use is described in the following chapters.

Querying for Metadata

As decribed in chapter Metadata Management, there are three different kinds of metadata which can be accessed in different ways. The access to Content Metadata is done via special interfaces according to the system the Content Metadata is published to, which won’t be described here. The other two kinds of metadata are accessible via the integrated MetaDataManagement Service of KIT Data Manager. Basically, these functionalities are roughly oriented towards the Java Persistence API (JPA) standard and are defined in the interface edu.kit.dama.mdm.IMetaDataManager. Please refer to the JavaDoc of this interface for a detailed description of each method. The following examples refer to the JPA-based implementation of the interface, which is the one available in the default KIT Data Manager distribution. A typical example for accessing persisted entities is shown in the following snippet:

public List<DigitalObject> findAllDigitalObjects(IAuthorizationContext authorizationContext) throws UnauthorizedAccessAttemptException{
  //Default persistence unit for Base Metadata
  String persistenceUnit = "MDM-Core";
  //Create IMetaDataManager instance
  IMetaDataManager mdm = MetaDataManagement.getMetaDataManagement().getMetaDataManager(persistenceUnit);
  //Set AuthorizationContext(authorizationContext);
  mdm.setAuthorizationContext(authorizationContext);
  try{
      return mdm.find(DigitalObject.class);
   }finally{
      //!Important! Otherwise you may run out of database connections.
      mdm.close();
   }
}

The only thing you have to do is to create a new instance of an IMetaDataManager implementation, providing the authorizationContext used to authorize the access and perform a query, e.g. for all entities of a specific class, in our example DigitalObject.class. This call will return a list of Digital Objects accessible by the provided authorizationContext. Please pay attention to the fact, that you should close the MetaDataManager instance as soon as you don’t need it any longer. Otherwise, you may run out of database connections sooner or later. Another, more specific query would be:

public DigitalObject getDigitalObjectByPrimaryKey(Long primaryKey, IAuthorizationContext authorizationContext)
throws UnauthorizedAccessAttemptException{
  //Default persistence unit for Base Metadata
  String persistenceUnit = "MDM-Core";
  //Create IMetaDataManager instance
  IMetaDataManager mdm = MetaDataManagement.getMetaDataManagement().getMetaDataManager(persistenceUnit);
  //Set AuthorizationContext(authorizationContext);
  mdm.setAuthorizationContext(authorizationContext);
  try{
    return mdm.find(DigitalObject.class, primaryKey);
  }finally{
    mdm.close();
  }
}

In this case, the DigitalObject with the provided primary key is returned or an UnauthorizedAccessAttemptException is thrown if the Digital Object with the provided primary key is not accessible using the provided authorizationContext. Another example shows a more enhanced way to query for results:

public List<DigitalObject> getDigitalObjectInRange(Long first, Long last, IAuthorizationContext authorizationContext) throws UnauthorizedAccessAttemptException{
   IMetaDataManager mgr = SecureMetaDataManager.factorySecureMetaDataManager(authorizationContext);

   //define the left sample and set the base id (primary key), e.g. to 1
   DigitalObject sample_left = new DigitalObject();
   sample_left.setBaseId(first);
   //define the right sample and set the base id (primary key), e.g. to 10
   DigitalObject sample_right = new DigitalObject();
   sample_right.setBaseId(last);
   //perform a query for entities of type DigitalObject and baseId between
   //sample_left.baseId and sample_right.baseId, in our example between 1 and 10
   try{
      return mgr.find(sample_left, sample_right);
    }finally{
      mgr.close();
    }
}

The snippet above shows a way to perform enhanced queries based on samples. For the provided samples all basic attributes (Attribute.PersistentAttributeType.BASIC) which are set are used for the resulting query applying the following mappings:

Attribute value provided in …​ Resulting Query

Left Sample

…​ WHERE attribute >= LeftSample.attibuteValue …​

Right Sample

…​ WHERE attribute ⇐ RightSample.attibuteValue …​

Both Samples

…​ WHERE attribute BETWEEN LeftSample.attributeValue AND RightSample.attibuteValue …​

If multiple attributes are set, all resulting conditions are concatenated and may result in complex queries that are hard to handle. Therefore, it is recommended to assign only single attributes per query. For more control over queries there is another way to perform them:

IMetaDataManager mgr = SecureMetaDataManager.factorySecureMetaDataManager(authorizationContext);

//perform a custom query expecting exactly one or no result
String singleResultQuery = "SELECT o FROM DigitalObject o WHERE o.digitalObjectIdentifier='abcd-efgh-1234-5678-9'";
DigitalObject object = mgr.findSingleResult(singleResultQuery, DigitalObject.class);

//perform a custom query expecting multiple or no results
String wildcardQuery = "SELECT o FROM DigitalObject o WHERE o.digitalObjectIdentifier LIKE '%efgh%'";
List<DigitalObject> objects =  mgr.findResultList(wildcardQuery, DigitalObject.class);

The last snippet shows two different queries: The first is a query for exactly one single result. This method will throw an exception if more than one result matches the query. In our case it will return the DigitalObject with the DigitalObjectIdentifier abcd-efgh-1234-5678-9, no result if there is no object found for this identifier or an UnauthorizedAccessAttemptException will be thrown if the caller is not allowed to access the query method or read the object. The second query will return a list as long as no UnauthorizedAccessAttemptException is produced. If no result was found, the list is empty. Otherwise, the list contains all accessible DigitalObjects whose DigitalObjectIdentifier contains efgh.

Querying for entities using the beforementioned methods is easy but is limited and will load the entire object tree! There are many different opportunities to reduce the amount of returned data and to speed up queries. Some of them are presented next:

//Return only the first 10 Digital Objects for pagination.
List<DigitalObject> objects = mgr.findResultList("SELECT o FROM DigitalObject o", DigitalObject.class, 0, 10);
//Return all Digital Objects from investigation with id 5.
List<DigitalObject> objects = mgr.findResultList("SELECT o FROM DigitalObject o WHERE o.investigation.investigationId=5", DigitalObject.class);
//Return all Digital Objects in any investigation of the study with id 3.
List<DigitalObject> objects = mgr.findResultList("SELECT o FROM DigitalObject o WHERE o.investigation.study.studyId=3", DigitalObject.class);
//Return only the labels of all Digital Objects.
List<String> labels = mgr.findResultList("SELECT o.label FROM DigitalObject o", String.class);

//When providing query parameters a proper escaping, e.g. of strings, should be desired. Therefor, query arguments can be provided.
//In the following call all Digital Objects containing 'test' in their note field are returned. The argument array contains the value
//on the first position and can therefore be referenced by ?1 in the query.
List<DigitalObject> objects = mgr.findResultList("SELECT o FROM DigitalObject o WHERE o.note LIKE ?1", new Object[]{"%test%"}, DigitalObject.class);

//Finally, parameters may contain collections to query for different value for a single field.
//The following list contains three base ids for which the according Digital Object should be returned.
List<Long> relevantIds = Arrays.asList(new Long[]{1l, 2l, 10l});
//JPA will take care of putting the collection elements into the query including proper escaping.
List<DigitalObject> objects = mgr.findResultList("SELECT o FROM DigitalObject o WHERE o.baseId IN ?1", new Object[]{relevantIds}, DigitalObject.class);

With such queries a very good performance while reading Base Metadata can be achieved. The same applies to updates. If e.g. a Digital Object should be updated this can be done in two different ways:

//Obtain the Digital Object with base id 1.
DigitalObject object = mgr.find(DigitalObject.class, 1l);
object.setNote("Updated note");
//Update the object in the database.
object = mgr.update(object);

//If the object tree of 'object' is really huge (> 5K entries) the previous call might be slow as it implies many checks and queries.
//Alternatively, direct updates via JPQL are possible as follows:
//int affectedEntities = mdm.performUpdate("UPDATE DigitalObject o SET o.note='Updated note' WHERE o.baseId=1");

For more details about the syntax of custom queries please refer to the JPQL Standard.

Adding Authorization on Method-Level

In chapter Authorization the flexible authorization mechanisms of KIT Data Manager were described shortly. Integrating this functionality for authorizing access to single methods is quite easy. But first, let’s have a closer look at the following example:

@SecuredMethod(roleRequired = Role.MEMBER)
public void mySecuredMethod(@Context IAuthorizationContext authorizationContext) throws UnauthorizedAccessAttemptException{
   //perform the actual functionality
}

The code snippet shows a very basic scenario where a method is restricted to be called with an AuthorizationContext having the MEMBER role inside. The snippet contains two annotations: @SecuredMethod(roleRequired = Role.MEMBER) marks the method itself as secured by the authorization framework and accessible with at least the role MEMBER. The second annotation @Context marks the argument which defines user, group and role (or short, the AuthorizationContext) which is used to request access to the method. Finally, the method has to throw an UnauthorizedAccessAttemptException for the case, that authorizationContext does not entitle the caller to access mySecuredMethod(). The actual authorization code is weaved into the program code at compile time using AspectJ. To trigger this step you’ll have to call a specific Maven plugin during your build. Therefor, just add the following lines into the pom.xml of your project:

<build>
    <plugins>
      <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>aspectj-maven-plugin</artifactId>
        <version>1.6</version>
        <configuration>
          <complianceLevel>1.6</complianceLevel>
          <aspectLibraries>
            <aspectLibrary>
              <groupId>edu.kit.dama</groupId>
              <artifactId>Authorization</artifactId>
            </aspectLibrary>
          </aspectLibraries>
        </configuration>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>test-compile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <!--More plugins coming here-->
   </plugins>
</build>

With these few steps the method mySecuredMethod() can only be accessed using an AuthorizationContext with the role MEMBER. Another question relevant for this scenario is how to obtain a valid AuthorizationContext. This is shown in the following snippet:

public static IAuthorizationContext getAuthorizationContext(UserId pUserId, GroupId pGroupId) throws AuthorizationException {
 try {
      //obtain the effective role for pUserId in pGroupId using the system context
      Role effectiveRole = (Role) GroupServiceLocal.getSingleton().getMaximumRole(pGroupId, pUserId, AuthorizationContext.factorySystemContext());
      //create a new AuthorizationContext using the provided/obtained information
      return new AuthorizationContext(pUserId, pGroupId, effectiveRole);
    } catch (UnauthorizedAccessAttemptException ex) {
      //fatal error
      throw new AuthorizationException("Failed to get maximum role using system context.", ex);
    }
}

The method takes UserId and GroupId obtained from some external source (e.g. the user login) and obtains the effective role using the method getMaximumRole(GroupId, UserId, IAuthorizationContext) of GroupServiceLocal. As this is an internal call we can use AuthorizationContext.factorySystemContext() to authorize the execution.

The AuthorizationContext returned by AuthorizationContext.factorySystemContext() is for internal use only. It overwrites ALL security mechanisms and should be used only if there is no other context available or I you exactly aware of what you do.

The effective role returned by the call already combines the group role and the global MAX_ROLE described in chapter Authorization. Finally, the AuthorizationContext is returned containing the provided/obtained information and can be used for authorization decisions. Another possibility of securing more than a method call is shown in the following code snippet:

@SecuredMethod(roleRequired = Role.MEMBER)
public void mySecuredMethod(@SecuredArgument ResourceClass securedArgument, @Context IAuthorizationContext authorizationContext) throws UnauthorizedAccessAttemptException{
   //perform the actual functionality
}

In the code you can find another annotation, @SecuredArgument. This annotation marks one or more arguments which should be accessed inside the method but may have access restrictions that should be checked in beforehand. During authorization KIT Data Manager will check if the provided context authorizationContext is allowed to access the resource provided by securedArgument, which means that the caller must be able to access the resource with at least the GUEST role. In order to make securedArgument applicable for the @SecuredArgument annotation it has to implement the interface ISecurableResource as follows:

package edu.kit.dama;

@Entity
public class ResourceClass implements ISecurableResource{

  @SecurableResourceIdField(domainName = "edu.kit.dama.ResourceClass")
  @Column(nullable = false, unique = true)
  private String uniqueIdentifier;

  /**Default constructor.*/
  public ResourceClass(){
    //initial creation of the unique identifier
    uniqueIdentifier = UUID.randomUUID().toString();
  }

  //your class code

  @Override
  public SecurableResourceId getSecurableResourceId() {
    return new SecurableResourceId("edu.kit.dama.ResourceClass", uniqueIdentifier);
  }
}

In the beginning you see the annotation @Entity. This is the standard annotation to mark an entity stored in a database via Java Persistence API (JPA) which is used by KIT Data Manager. Another JPA-specific annotation is provided next to the field uniqueIdentifier and has the content @Column(nullable = false, unique = true). This tells JPA that the table column holding the annotated field in the database backend contains values that must not be null and must be unique. However, both annotations are optional and not relevant for the authorization aspect of this example, but we’ll come to this again in a few seconds.

The first line relevant for the authorization is the annotation @SecurableResourceIdField(domainName = "edu.kit.dama.ResourceClass") which defines that the annotated field contains the unique identifier for this SecurableResource. It takes a domainName as argument. The domainName defines the scope in which uniqueIdentifier has to be unique. Typically, the class name is fine for this purpose. However, this annotation implies some requirements towards the annotated field:

Applicable to fields of type java.lang.String

If the annotation is applied to a field of another type, the development environment will raise an error and compilation will fail.

Once per class

If the annotation is used twice in one class, a warning is shown in the development environment and the field annotated first will be used as uniqueIdentifier.

Unique Identifiers in JPA Entities should be marked as not nullable and unique

If the resource class is annotated with @Entity, the field annotated with @SecurableResourceIdField should also be annotated with @Column(nullable = false, unique = true). Otherwise, a warning is shown in the development environment.

All warnings and errors mentioned before are normally not raised by the IDE. To enable this kind of pre-compile time validation for your project, please add the following plugin configuration to your pom.xml:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <version>2.5.1</version>
    <configuration>
    <source>1.7</source>
    <target>1.7</target>
    <!--Custom annotation processor for parsing classes for securable resource annotations.-->
    <annotationProcessors>
      <annotationProcessor>
          edu.kit.dama.authorization.annotations.SecurableResourceIdFieldProcessor
      </annotationProcessor>
      <annotationProcessor>
          edu.kit.dama.authorization.annotations.FilterOutputValidationProcessor
      </annotationProcessor>
    </annotationProcessors>
    </configuration>
</plugin>

The two annotation processors will check the annotations in your project based on the criterias mentioned before and can help you to detect misconfiguration easily. The final part necessary for implementing ISecurableResource is the implementation of the method getSecurableResourceId(). This method has to return an instance of SecurableResourceId containing the domainName (the one we’ve used in the annotation above) as well as the content of the field uniqueIdentifier. The value of this field should be set during the initial creation of the object and should never change.

A final scenario of coding KIT Data Manager from the authorization perspective covers the case when a method returns a list of securable resources.

@SecuredMethod(roleRequired = Role.MEMBER)
@FilterOutput(roleRequired = Role.MEMBER)
public  <T extends ISecurableResource> List<T>  mySecuredMethod(@Context IAuthorizationContext authorizationContext) throws UnauthorizedAccessAttemptException{
   List<T> unfilteredList = obtainUnfilteredList();
   //just return the unfiltered list
   return unfilteredList;
}

As you see, the implementation is quite similar to the other cases. An additional annotation @FilterOutput(roleRequired = Role.MEMBER) tells the authorization framework to filter the output and remove entities which need higher permissions than MEMBER from the initially returned list unfilteredList. This works as long as the entries in the returned list do implement ISecurableResource as described in the last example. In the worst case, an empty list is returned as all elements were removed.

Administration UI

The Administration UI, shortly AdminUI, is a straightforward Web application for administrating users and user groups, registered at KIT Data Manager. It further allows an effortless configuration of all required staging components.

Login

In order to access the KIT Data Manager AdminUI you first have to register. After browsing to the AdminUI, e.g. at http://localhost:8080/KITDM, you will find the login page. Here you can select between different login methods, e.g. ORCiD or Username/Password, which also can be used to register a new account.

From KIT Data Manager perspective it makes no difference whether an account is created via ORCiD or directly by providing a username and a password, but during the registration process the ORCiD service might be used to obtain user details like email or name, which are then filled into the registration form automatically. Furthermore, the login via ORCiD is much more convenient for the user. However, in every case there is also create a password login credential for the case that an external authentication service is not available.

UI components tagged with Tag are mandatory and/or expect a value with a certain character length.

Views

The KIT Data Manager AdminUI provides a couple of views responsible for different tasks. After login you see the information view providing a few information about the repository content. Other views and functionalities, which can be selected in the upper left area of the AdminUI, are namely:

Information_Button Information

Profile_Button Profile

SiMon_Button Simple Monitoring SiMon

Settings_Button System Settings

Logout_Button Logout

Information

The information-view, displayed by clicking the menu item Information_Button, offers you a basic overview of the repository, e.g. the number of groups, digital objects and occupied disc space.

Information

Profile

The profile-view, displayed by clicking the menu item Profile_Button, gives an overview of your personal information and allows you to manage your credentials. Among the possibility to change your login-password for the AdminUI, it also offers you the (re-)generation of your OAuth credentials, e.g. if they where compromised, or to link an ORCiD Id to your account in order to enable ORCiD login.

Profile

In the lower part of the screen you can find the credentials table. Next to the table there are a couple of buttons for adding, modifying, removing and reloading credentials. If you want to create a new credential, click the button Add. In the resulting dialog you can select the credential type you want to create. Depending on the type, credential key and secret must be provided.

Please be aware that for each user only one credential of each type can be provided. Furthermore, depending on the credential type, it is not possible to see the plain credential secret. In that case, if you forgot the secret, you have to create a new secret.

SiMon - SimpleMonitoring

SiMon

SiMon offers a simple and configurable service monitoring. For this purpose different so called Probes can be defined. By default, there are four different probe types:

edu.kit.dama.ui.simon.impl.MountProbe

This probe can be used to check the availability of mount points. It checks whether to configured location is readable and writable. See ‘$KITDM_LOCATION/simon/MOUNT.properties’ for an example.

edu.kit.dama.ui.simon.impl.RestServiceProbe

This probe checks the availability of a RESTful service. Therefore, the service URL and an unsecured service method must be provided. If the method returns HTTP 200 the probe succeeds. See ‘$KITDM_LOCATION/simon/REST.properties’ for an example.

edu.kit.dama.ui.simon.impl.WebServerProbe

This probe simply tries to connect to a Web Server behind a specified URL. If the call succeeds within the provided timeout, the probe also succeeds. See ‘$KITDM_LOCATION/simon/WEBSERVER.properties’ for an example.

edu.kit.dama.ui.simon.impl.ShellScriptProbe

This probe should cover all other scenarios as it allows to execute an arbitrary shell script. If the shell script finishes with the exit code 0 the probe succeeds. See ‘$KITDM_LOCATION/simon/SCRIPT.properties’ for an example.

You can also create own probes by implementing edu.kit.dama.ui.simon.impl.AbstractProbe and placing the Jar file containing your probe class at $KITDM_LOCATION/KITDM/WEB-INF/lib Please refer to the JavaDoc for detailed information.

The SiMon view in the AdminUI shows all configured probes in the Overview tab. Furthermore, there are tabs for each probe category only showing probes in the according category. Depending on the probe status the probe representation in the overview tab is one of the following:

Status Probe Style

Success

Green background and solid black border

Failed

Red background and solid black border

Updating

Grey background and solid black border

Unknown

Grey background and dashed black border

Unavailable

Grey background and solid red border

Settings

The settings-view, displayed by clicking the menu item Settings_Button, offers access to all system settings of your KIT Data Manager instance. Among the administration of users and user groups it allows the configuration other configurable elements:

The menu item Settings_Button is visible only if your current role is either MANAGER or ADMINISTRATOR. The role is defined per group. By default, after logging in at the AdminUI, your active group is USERS. To change the active group use the combobox in the upper right corner of the AdminUI. Next to the selection box you can also find you current role.

User Administration

The set of functionalities offered by the user-administration-tab depends on your current role. There are the following possibilities:

MANAGER

If you are manager you are allowed to see user information. You are also allowed to change basic information, e.g. first and last name and validation period, of yourself and to see your and other group memberships. Furthermore, you are allowed to change group memberships for users in groups in which you are group manager.

ADMINISTRATOR

If you are logged in as ADMINISTRATOR you have access to all information. Only your maximum role cannot be changed by yourself.

UserAdministration

Memberships-View

For administrating group memberships of a user, select the corresponding user from the user-table embedded in the user-administration-tab and view all memberships by clicking the button Show Memberships.

In the window that is opened you can find all groups the selected user is member in together with the according role. If the role is NO_ACCESS the user is either not a member of the group or the group membership was withdrawn.

MembershipsView

In order to change the membership in a specific group select the group in the table, chose the new role in the combobox next to the table and click Apply New Role. If the update suceeds, the role is updated in the table and committed to the database. In case of an error, e.g. if you have insufficient permissions, an exclamation mark will appear in the info colum. For details please hover over the exclamation mark.

Filter Setter

The filter-setter-panel that opens by clicking the icon-button AddFilter_Button placed at the right upper corner of each table allows you to define and set a table filter.

Filter
TABLE_COLUMN

Combo box listing all filterable table columns

FILTER_EXPRESSION

Text field expecting the expression by which the previously selected table column shall be filtered

FILTER_OPTION

Collection of filter options allowing only a single-selection

Contains

Option for filtering the specified table column by the cell content containing the given filter expression

Starts_With

Option for filtering the specified table column by the cell content starting with the given filter expression

Ends_With

Option for filtering the specified table column by the cell content ending with the given filter expression

Filter

Button for setting the customized table filter

Group Administration

The group-administration-tab allows you to administrate the groups in which you have at least the MANAGER role. This tab allows you to see and modify group information, e.g. the name or the description.

GroupAdministration

You can also create new groups and show members of the selected group.

Members-View

For administrating the members of a group, select the corresponding group from the group-table embedded in the group-administration-tab and view all its members by clicking the button Show Members. In the opened window you will find a list of all users who are member of the group together with their role within the group.

MembersView

On the right hand side you can find a list of free users that are either not a member of the group or who have the role NO_ACCESS, which represents for a withdrawn membership. If you want to add a user to a group, select one or more users in the USERS list and click Add Member(s). If the user is not in the table it will appear there with the role MEMBER. If the user already was in the table, only the role will change to MEMBER.

To withdraw a group membership select one or more users from the memberships table and click Exclude Member(s). The selected users will be added to the free users list and the role in the memberships table will change to NO_ACCESS.

The memberships view only allows to switch between role MEMBER and NO_ACCESS. For fine grained role assignment please use the membership administration in the user management tab.

Staging Access Point Configuration

The staging access point is one of three components relevant for staging. A staging access point defines a (remotely and locally accessible) storage location that can be accessed by the user for data up- or download. For this purpose each access point has a Remote Base URL defining the access protocol, hostname, port and a common path component, and a Local Base Path, which is the local directory corresponding to Remote Base URL. Local Base Path must be writable by the KIT Data Manager in order to prepare data for up- and download.

The definition of a new access point and the modification of an already existing one are both provided by the staging-access-point-configuration-tab selectable from a tab collection in the settings-view:

Create New Access Point

For creating a new access point, select the item NEW from the left-hand container AVAILABLE ELEMENTS. The selection enables both, the upper text field ACCESS POINT IMPLEMENTATION CLASS and the button ImplementationClass_Button; contemporary the lower part of the form, designed for declaring the properties of an access point, is disabled. For proceeding, enter the fully qualified class name of the implementation class that shall be used for creating the new access point, and finalize the access point creation by clicking ImplementationClass_Button.

In case of a successfully created access point, the remaining form is enabled, permitting the further configuration of the newly created access point. Now declare all remaining properties and confirm them by clicking the button Commit Changes.


Newly created access points are automatically disabled for security reasons. Enable them by unselecting the check box Disabled.

View/Modify Existing Access Point

For viewing or modifying the configuration of an already declared access point, select the corresponding access point from the left-hand container AVAILABLE ELEMENTS. A valid selection enables all components of changeable access-point-properties. Commit your changes by clicking the button Commit Changes.

StagingAccessPointConfiguration
AVAILABLE_ELEMENTS

Container listing all defined access points.

ACCESS_POINT_IMPLEMENTATION_CLASS

Text field declaring the implementation class on which the selected access point [1] is based

ImplementationClass_Button

Button for creating an access point with the implementation class declared in the text field ACCESS_POINT_IMPLEMENTATION_CLASS

ACCESS_POINT_NAME

Text field displaying the name of the selected access point [1].

REMOTE_BASE_URL

Text field declaring the remotely accessible base URL for this access point. Once declared, it is highly recommended NOT to change the REMOTE_BASE_URL without any valid reason. Otherwise, already prepared ingests/downloads get invalidated.

LOCAL_BASE_PATH

Text field declaring the locally accessible path corresponding to REMOTE_BASE_URL. Once declared, it is highly recommended NOT to change the LOCAL_BASE_PATH. Otherwise, already prepared ingests/downloads get invalidated.

Select_Path

Button opening a subwindow for selecting a directory from the local file system as LOCAL BASE PATH.

ACCESSIBLE_BY

Combo box for defining the user group who is allowed to use the selected access point. By default, an access point is linked to group USERS, which contains all registered users. In some cases, a special access point for a specific user group might be required. In that case, the according group can be selected at this combo box.

Default

Check box for setting the selected access point [1] as default staging access point for the user group this access point is associated with. For each user group there might be only one default access point. The default access point is used if no specific access point id is provided while requesting an ingest/download.

Disabled

Check box for disabling the selected access point [1].

DESCRIPTION

Text area displaying additional information with respect to the selected access point [1].

Commit Changes

Button for committing the changed configuration of the selected access point [1].

Staging Processor Configuration

A staging processor defines an operation that is performed before or after staging operations, e.g. data ingests or download. This may cover data validation, checksumming or metadata extraction. For more details please refer to chapter Coding: Staging Processor.

The definition of a new staging processor and the modification of an already existing one are both provided by the staging-processor-configuration-tab selectable from a tab collection in the settings-view:

Create New Staging Processor

For creating a new staging processor, select the item NEW from the left-hand container AVAILABLE ELEMENTS. The selection enables both, the upper text field STAGING PROCESSOR IMPLEMENTATION CLASS and the button ImplementationClass_Button; contemporary the lower part of the form, designed for declaring the properties of an staging processor, is disabled. For proceeding, enter the fully qualified class name of the implementation class that shall be used for creating the new staging processor, and finalize the staging processor creation by clicking ImplementationClass_Button.

In case of a successfully created staging processor, the remaining form is enabled, permitting the further configuration of the newly created staging processor. Now declare all remaining properties and confirm them by clicking the button Commit Changes.


Newly created staging processors are automatically disabled for security reasons. Enable them by unselecting the check box Disabled.

View/Modify Existing Staging Processor

For viewing or modifying the configuration of an already declared staging processor, select the corresponding staging processor from the left-hand container AVAILABLE ELEMENTS. A valid selection enables all components of changeable processor-properties. Finalize your changes by clicking the button Commit Changes.

StagingProcessorConfiguration
AVAILABLE_ELEMENTS

Container listing all defined staging processors.

PROCESSOR_IMPLEMENTATION_CLASS

Text field declaring the implementation class on which the selected staging processor [2] is based

ImplementationClass_Button

Button for creating a staging processor with the implementation class declared in the text fiel PROCESSOR_IMPLEMENTATION_CLASS

PROCESSOR_NAME

Text field displaying the name of the selected staging processor [2]

ACCESSIBLE_BY

Combo box listing all registered user groups. By selecting a group, a staging processor can be associated with a specific group and might be selected only for staging operations of this group. This is for example relevant for metadata extractors extracting metadata in a group-specific metadata model.

PRIORITY

Slider for selecting the processor priority. A higher priority leads to an earlier execution of the according processor whereas processors with a priority of 0 are executed at the end.

Default

Check box for setting the selected staging processor [2] as default staging processor.

Disabled

Check box for disabling the selected staging processor [2].

Ingest Supported

Check box to define whether the processor can be used for data ingest operations or not.

Download Supported

Check box to define whether the processor can be used for data download operations or not.

DESCRIPTION

Text area displaying additional information with respect to the selected staging processor [2].

Extended Properties

Button for switching to the extended properties tab. Depending on the Staging Processor there may or may not be extended properties available. If there are mandatory properties, you’ll be notified about missing properties on commit.

Commit Changes

Button for committing the changed configuration of the selected access processor [2].

Execution Environments

In the execution environment tab of the settings-view so called execution environments for data workflow tasks can be defined and configured. Basically, an execution environment is defined by a handler class taking care of task execution and monitoring. To allow providing data for the task execution each execution environment is linked to an access point that can be used by the repository system to stage data into the computing environment and out of it. Therefor, repository system and computing environment must share a storage resource which they both can access locally, e.g. via mount point.

Creating new execution environment is done using the execution-environment-tab selectable in the settings-view:

Create New Execution Environment

For creating a new execution environment, select the item NEW from the left-hand container AVAILABLE ELEMENTS. The selection enables both, the upper text field ENVIRONMENT HANDLER IMPLEMENTATION CLASS and the button ImplementationClass_Button; contemporary the lower part of the form, designed for declaring the properties of a execution environment, is disabled. For proceeding, enter the fully qualified class name of the implementation class that shall be used for creating the new execution environment before clicking ImplementationClass_Button.

In case of a successfully created execution environment, the remaining form is enabled, permitting the further configuration of the newly created execution environment. Now, declare all remaining properties and commit the changes to the execution environment by clicking the button Commit Changes. In addition, ENVIRONMENT PROPERTIES can be defined allowing a more detailed description of the execution environment. These properties can be used later on to decide whether a workflow task, which requires a set of ENVIRONMENT PROPERTIES from a common pool of properties, can be executed using a particular execution environment or not. However, there is currently no mechanism to validate whether these properties contain any meaningful value or if e.g. a Software-Map-Property describes a software which is physically available in the execution environment the property is assigned to.

View/Modify Existing Execution Environment

For viewing or modifying the configuration of an already declared execution environment, select the corresponding execution environment from the left-hand container AVAILABLE ELEMENTS. A valid selection enables all components of changeable execution-environment-properties. Finalize your changes by clicking the button Commit Changes.

ExecutionEnvironmentConfiguration
AVAILABLE_ELEMENTS

Container listing all defined execution environments.

ENVIRONMENT_HANDLER_IMPLEMENTATION_CLASS

Text field declaring the implementation class on which the selected execution environment is based.

ImplementationClass_Button

Button for creating an execution environment with the implementation class declared in the text fiel JOB_IMPLEMENTATION_CLASS

ACCESS_POINT

Combobox allowing to select the access point that can be used to stage data in and out the execution environment.

ACCESS_POINT_BASE_PATH

Text field displaying the base path at which the access point is available within the execution environment.

MAX_PARALLEL_TASKS

Text field displaying the max. number of parallel workflow tasks that should be handled by the environment.

DESCRIPTION

Text area displaying additional information with respect to the selected execution environment.

Default

Check box for setting the selected execution environment as default environment.

Disabled

Check box for disabling the selected execution environment.

ENVIRONMENT_PROPERTIES

A twin columns list allowing to select environment properties provided by the execution environment. A button next to the list allows to add new properties.

Extended Properties

Button for switching to the extended properties tab. Depending on the Execution Environment there may or may not be extended properties available. If there are mandatory properties, you’ll be notified about missing properties on commit.

Commit Changes

Button for committing the changed configuration of the selected execution environment.

The properties panel offers two different views between which you can switch by clicking the icon-button RightNavigation_Button. Depending on the execution environment implementation the extended properties view may contain mandatory properties.

Data Workflow Tasks

A data workflow task describes a single task that can be executed alone or in a chain of multiple tasks. In contrast to other elements configured in the settings-view, a basic versioning is applied to data workflow tasks. Changing the TASK NAME, the PACKAGE URL or the ARGUMENTS will result in a new workflow task (in case of changing the name) or to a new task version (in case of changing the package url or arguments). All other fields can be changed without any effect to the task version.

Creating data workflow task is done using the data-workflow-tasks-tab selectable in the settings-view:

Create New Data Workflow Task

For creating a new data workflow task, select the item NEW from the left-hand container AVAILABLE ELEMENTS. Fill all fields according to your task properties and requirements. PACKAGE_URL should be in the form file://<ZIP_LOCATION_ON_LOCAL_FILESYSTEM>. Optionally, you can provide ENVIRONMENT PROPERTIES that have to be provided by an execution environment in order to execute the task. Please refer to the according section for more details. Finally, after committing the changed, the workflow task is registered and can be linked to Digital Objects using the according REST service or Java API.

View/Modify Existing Data Workflow Tasks

For viewing or modifying the configuration of an already declared data workflow task, select the corresponding data workflow task from the left-hand container AVAILABLE ELEMENTS. A valid selection enables all components of changeable data-workflow-task-properties. Finalize your changes by clicking the button Commit Changes.

DataWorkflowTaskConfiguration
AVAILABLE_ELEMENTS

Container listing all defined data workflow tasks.

TASK_NAME

Text field to display the unique task name. Changing the name of an existing task will result in creating a new task in version 1 with the same properties as the existing task.

VERSION

Text field to display the version number of the task. The version number increases if any of the unchangeable field of a task, e.g. application URL or arguments, is changed. It is not allowed to modify the version number manually.

CONTACT_USERID

Text field displaying the userId of the contact user of the workflow task, e.g. for application support. This userId must be a valid userId registered in the repository system.

PACKAGE_URL

Text field displaying the package URL where the application zip archive is located. This URL must be in the format file://<LOCAL_PATH>, e.g. file:///usr/share/apps/MyApp-1.0.zip.

ARGUMENTS

Text field displaying the application arguments which are appended to each application execution. Beware that these arguments cannot be changed for a workflow task. Changing the arguments results in creating a new version of the task.

KEYWORDS

Text area displaying a set of space separated keywords that might be used to search for a particular workflow task.

DESCRIPTION

Text area displaying additional information with respect to the selected workflow task.

Default

Check box for setting the selected workflow task as default task, e.g. to be applied by default to ingested Digital Objects. Currently, this flag is not actively used.

Disabled

Check box for disabling the selected workflow task.

ENVIRONMENT_PROPERTIES

A twin columns list allowing to select environment properties required by the workflow task. A button next to the list allows to add new properties.

Commit Changes

Button for committing the changed configuration of the selected workflow task or to create a new task/a new version of the task if unchangeable fields have changed.

In order to make KIT Data Manager executing scheduled tasks, a job has to be scheduled via the AdminUI. Therefor, change to the Job Scheduling tab and add a new job with implementation class edu.kit.dama.dataworkflow.scheduler.jobs.DataWorkflowExecutorJob.

Job Scheduling

The job scheduling is a new feature introduced in KIT Data Manager 1.2 allowing to configure and execute recurring jobs in a seamlessly integrated, platform independent way. Basically, the job scheduling is a replacement for setting up Cron jobs for jobs like data staging, metadata indexing or workflow execution.

Creating new job and scheduling their execution is provided by the job-scheduling-tab selectable in the settings-view:

Create New Job Schedule

For creating a new job schedule, select the item NEW from the left-hand container AVAILABLE ELEMENTS. The selection enables both, the upper text field ENVIRONMENT HANDLER IMPLEMENTATION CLASS and the button ImplementationClass_Button; contemporary the lower part of the form, designed for declaring the properties of a job schedule, is disabled. For proceeding, enter the fully qualified class name of the implementation class that shall be used for creating the new job schedule before clicking ImplementationClass_Button.

In case of a successfully created job schedule, the remaining form is enabled, permitting the further configuration of the newly created job schedule. Declare all remaining properties (except TRIGGERS) and persist the job schedule by clicking the button Commit Changes. Finally, in order to trigger the job execution, one or more triggers have to be assigned to the job. Click the add button next to the trigger table. In the popup which opens there are four different triggers available:

Trigger Description

Now Trigger

Triggers a immediate, single-time job execution. After the execution, the trigger is removed again.

At Trigger

Triggers a single-time job execution at a specific time. After the execution, the trigger is removed again.

Expression Trigger

Triggers a repeated job execution specified in a Cron-like syntax.

Interval Trigger

Triggers a repreated job execution in a specific interval in seconds and with an optional number of times.

For typical use cases it is recommended to use either Expression or Interval triggers as they typically run as long as they are not removed manually. Now and At triggers are bound to a specific time, would disappear afterwards and the according job won’t be triggered any longer.


Newly created job schedules will have the job id UNSCHEDULED. Such jobs are not persisted, yet, and will disappear if the settings tab is changed. After completing all mandatory fields, job schedules have to be persisted using the Commit button before triggers can be assigned.

View/Modify Existing Job Schedules

For viewing or modifying the configuration of an already declared job schedule, select the corresponding job schedule from the left-hand container AVAILABLE ELEMENTS. A valid selection enables all components of changeable job-schedule-properties. Finalize your changes by clicking the button Commit Changes.

JobScheduleConfiguration
AVAILABLE_ELEMENTS

Container listing all defined staging processors.

JOB_IMPLEMENTATION_CLASS

Text field declaring the implementation class on which the selected job schedule is based.

ImplementationClass_Button

Button for creating a job schedule with the implementation class declared in the text fiel JOB_IMPLEMENTATION_CLASS.

JOB_ID

Text field displaying the unique id of the selected job schedule.

JOB_GROUP

Text field displaying the group name of the selected job schedule.

JOB_NAME

Text field displaying the name of the selected job schedule.

DESCRIPTION

Text area displaying additional information with respect to the selected job schedule.

TRIGGERS

A table showing all triggers assigned to the selected job. Triggers can be added, refreshed or removing using the accoding button next to the table. The refresh button is provided to allow a manual update as triggers might be removed automatically after a single execution.

Extended Properties

Button for switching to the extended properties tab. Depending on the job there may or may not be extended properties available. If there are mandatory properties, you’ll be notified about missing properties on commit.

Commit Changes

Button for committing the changed configuration of the selected job schedule.

Logout

The logout from AdminUI happens through the menu item Logout_Button. A successful logout leads you back to the AdminUI’s login-panel.

Enhanced Metadata Handling

As KIT Data Manager aims to be applicable for many heterogeneous communities, it potentially has to support a huge number of community-specific metadata schemas, in our case summarized as content metadata. As this goes hand in hand with a lot of domain-specific knowledge, development effort and additional dependencies we have decided to outsource content metadata handling to a separate module providing a collection of examples and tools to extract and publish content metadata. The basic workflow for handling content metadata in KIT Data Manager is relatively straightforward:

  • Register a metadata schema in KIT Data Manager

  • Implement and deploy a Staging Processor for extracting metadata of the registered schema during data ingest

  • Associate the Staging Processor with an ingest operation to enable the metadata extraction

  • Publish/harvest the extracted metadata in/by an external system

At some points, concrete implementations of content metadata handling may implement this workflow differently. However, all examples and tools of the Enhanced Metadata Module try to follow this basic workflow, but first, let’s check the workflow steps in detail.

Register a Metadata Schema

Metadata schema definitions are part of KIT Data Manager’s administrative metadata. It consists of a (unique) schema identifier and a schema URL. At the moment a Metadata Schema and its identifier are used to distinguish between different schemas, e.g. to publish different schemas in different ways. Currently, there is no mechanism implemented for validating the provided schema URL, but such feature might be available in future versions. Registering a new metadata schema can be done easily using the Base Metadata REST service or Java APIs of the KIT Data Manager Metadata Management and Base Metadata.

Implement a Staging Processor for Extracting Metadata

Extracting metadata is implemented as part of the KIT Data Manager ingest process. For adding custom steps to the basic ingest process Staging Processors are used. For metadata extraction a new Staging Processor extending edu.kit.dama.mdm.content.impl.AbstractMetadataExtractor must be implemented and deployed. The implementation of AbstractMetadataExtractor allows to link the processor to a metadata schema, it defines where extracted metadata is stored and it also defines the different phases of the metadata extraction process. Typically, after executing a Staging Processor for metadata extraction an XML file containing the metadata following the linked schema will be created and stored next to the ingested data. Details on how the content metadata is extracted from which sources (e.g. from the ingested files or from some external source) and how to perform error handling (e.g. ignore errors or define the associated ingest as failed) are defined by the implementation. For examples and details on how to implement such a Staging Processor please refer to the according chapter.

Associate the Staging Processor with an Ingest or a Download

Depending on the configuration of a Staging Processor it might be enabled for each ingest or download by default. If not, e.g. if different metadata extractors are used for different ingests, it can be enabled while requesting a new ingest or download via the Java APIs or using the REST interface of the staging service.

Publish Extracted Metadata

This final workflow step defines how and where extracted metadata is published. For this, the XML file written in the previous step has to be read, transformed if required and registered in an external metadata store, index or any other location. In case of the reference implementation of the Enhanced Metadata Module the XML file is read, transformed to a JSON structure and then sent to a local Elasticsearch index. Now, the only open question is how the link to the Digital Object stored in KIT Data Manager is achieved. As mentioned before content metadata is linked via the OID of a Digital Object. For our reference implementation this means that the document id in Elasticsearch is equal to the OID in the repository system, which makes the mapping between both systems simple. In other cases it is imaginable that the OID is stored as part of the metadata, if there is an appropriate field available, or the mapping has to be realized by some custom service or tool.

Summarizing, the concept of handling content metadata in KIT Data Manager offeres a lot of flexibility. The following chapter descibed how the default metadata extraction workflow based on METS documents is realized and how it can be customized.

Metadata Extraction (METS)

As decribed before, the basic content metadata workflow is divided into extraction and publishing. The metadata extraction process is accomplished by Staging Processors. For details about Staging Processors and their anatomy please refer to this section. Compared to standard Staging Processors the ones responsible for metadata extraction extend MetsMetadataExtractor instead of AbstractStagingProcessor. There are some additional configuration fields:

CommunityMetadataDmdId

Id of the descriptive metadata section of the mets document containing the community metadata. This id should be defined in the mets profile of the according community.

CommunityMDType

Id of an endorsed metadata type defined by the mets standard. Possible types are: [MARC, MODS, EAD, DC, NISOIMG, LC-AV, VRA, TEIHDR, DDI, FGDC, LOM, PREMIS, PREMIS:OBJECT, PREMIS:AGENT, PREMIS:RIGHTS, PREMIS:EVENT, TEXTMD, METSRIGHTS, ISO 19115-2003 NAP, EAC-CPF, LIDO, OTHER]. If no MDType is defined, an external metadata schema identified via the property communityMetadataSchemaId might be used instead.

CommunityMetadataSchemaId

Id of the metadata schema registered at the repository. The schema url is used as OTHERMDTYPE attribute in the according descriptive metadata section. The value has to be the unique identifier of a metadata schema previously defined (for more information, please refer to Register a Metadata Schema in section Enhanced Metadata Module).

Add indexer plugins (optional)

There may be an undefined number of indexers used for indexing different types of metadata for the search engine. Right now (11/2016) the following indexers are available:

Plugin:bmd

Extractor of the base metadata, which is available for every digital object.

Plugin:oai_dc

Extractor of the Dublin Core metadata.

Implement Interface

To add your own indexer based on METS document the IMetsTypeExtractor has to be implemented. Create a new project with dependency to maven project MDM-Content.

Register new plugin

To register a new plugin the pom.xml has to be prepared like the following:

      <plugin>
        <groupId>eu.somatik.serviceloader-maven-plugin</groupId>
        <artifactId>serviceloader-maven-plugin</artifactId>
        <version>1.0.7</version>
        <configuration>
          <services>
            <param>edu.kit.dama.content.mets.plugin.IMetsTypeExtractor</param>
          </services>
        </configuration>
        <executions>
          <execution>
            <goals>
              <goal>generate</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
Register new metadata schema

To enable the implemented plugin the linked metadata schema has to be registered.

schema identifier

name of the plugin

schema URL

namespace of the generated XML document.

Add new plugin to KIT Data Manager

That’s it. Now you can generate and add the jar file in the lib directory and restart KIT Data Manager.

Implement Metadata Extraction (METS)

Link to Metadata Schema

Each implemented AbstractMetadataExtractor must be associated with a metadata schema using an internal property with the key METADATA_SCHEMA_IDENTIFIER. The value will be the unique identifier of a metadata schema previously defined (for more information, please refer to Register a Metadata Schema in section Enhanced Metadata Module).

Custom Configuration

As the internal properties of the Staging Processor base class are used to associate a metadata schema with an implemented AbstractMetadataExtractor, metadata extractors are using an alternative way for providing custom properties. For this purpose the methods getExtractorPropertyKeys(), getExtractorPropertyDescription(), validateExtractorProperties(Properties pProperties) and configureExtractor(Properties pProperties) are the equivalents for the according methods having the term Internal instead of Extractor in their method signature. At runtime the AbstractMetadataExtractor base implementation takes care of merging both property lists together.

Metadata Extraction Workflow

The base class MetsMetadataExtractor realizes a basic workflow for metadata extraction. The community metadata may be extracted by a community specific implementation of the method createCommunitySpecificDocument(TransferTaskContainer pContainer) which delivers a document containing all community specific metadata.

By default, the resulting Mets document fulfills a Metadata for Applied Sciences (MASI) profile, which looks as follows:

<?xml version="1.0" encoding="UTF-8"?>
<mets xmlns="http://www.loc.gov/METS/" OBJID="f5b984c8-42e9-4a39-95fa-542589108201" PROFILE="http://datamanager.kit.edu/dama/metadata/2016-08/Metadata4AppliedSciences-METS-profile.xml" TYPE="edu.kit.dama.mdm.base.DigitalObject" xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd">
  <metsHdr CREATEDATE="2016-11-09T13:40:30Z" LASTMODDATE="2016-11-09T13:40:30Z">
    <agent ROLE="CREATOR" TYPE="OTHER">
      <name>last name, first name</name>
    </agent>
  </metsHdr>
  <dmdSec ID="DUBLIN-CORE">
    <mdWrap MDTYPE="OTHER" MIMETYPE="text/xml" OTHERMDTYPE="OAI-DUBLIN-CORE">
      <xmlData>
        <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
          <dc:title>DigitalObject_2016_05_27T09_57</dc:title>
          ...
          <dc:format>application/octet-stream</dc:format>
          <dc:type>Dataset</dc:type>
          <dc:identifier>f5b984c8-42e9-4a39-95fa-542589108201</dc:identifier>
        </oai_dc:dc>
      </xmlData>
    </mdWrap>
  </dmdSec>
  <amdSec ID="KIT-DM-AMD">
    <sourceMD ID="KIT-DM-BASEMETADATA">
      <mdWrap MDTYPE="OTHER" MIMETYPE="text/xml" OTHERMDTYPE="KIT-DM-BASEMETADATA">
        <xmlData>
          <basemetadata xmlns="http://datamanager.kit.edu/dama/basemetadata" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://datamanager.kit.edu/dama/basemetadata http://datamanager.kit.edu/dama/basemetadata/2015-08/basemetadata.xsd">
            <digitalObject>
              <baseId>22</baseId>
              <digitalObjectIdentifier>f5b984c8-42e9-4a39-95fa-542589108201</digitalObjectIdentifier>
              <endDate>2016-11-08T11:45:48+01:00</endDate>
              <label>DigitalObject_2016_05_27T09_57</label>
              <note>Any note about digital object.</note>
              ...
            </digitalObject>
          </basemetadata>
        </xmlData>
      </mdWrap>
    </sourceMD>
    <sourceMD ID="KIT-DM-DATAORGANIZATION">
      <mdWrap MDTYPE="OTHER" MIMETYPE="text/xml" OTHERMDTYPE="KIT-DM-DATAORGANIZATION">
        <xmlData>
          <dataOrganization xmlns="http://datamanager.kit.edu/dama/dataorganization" xs:schemaLocation="http://datamanager.kit.edu/dama/dataorganization http://datamanager.kit.edu/dama/dataorganization/2015-08/dataorganization.xsd">
            <digitalObjectId>855930f7-b284-4e14-88ad-46be588ff91b</digitalObjectId>
            <view xmlns:NS1="http://datamanager.kit.edu/dama/dataorganization" NS1:name="default">
              <root>
                <name/>
                <logicalFileName>http://hostname:8080/KITDM/rest/dataorganization/organization/download/id/</logicalFileName>
                <children>
                  <child>
                    <name>metadata</name>
                    <logicalFileName>http://hostname:8080/KITDM/rest/dataorganization/organization/download/id/metadata</logicalFileName>
                    <attributes>
                      <attribute>
                        <key>children</key>
                        <value>1</value>
                      </attribute>
                      <attribute>
                        <key>size</key>
                        <value>22</value>
                      </attribute>
                      <attribute>
                        <key>lastModified</key>
                        <value>1479120151000</value>
                      </attribute>
                      <attribute>
                        <key>directory</key>
                        <value>true</value>
                      </attribute>
                    </attributes>
                    <children>
                      <child>
                        <name>anyFile.ext</name>
                        <logicalFileName>http://hostname:8080/KITDM/rest/dataorganization/organization/download/id/anyFile.ext</logicalFileName>
                        <attributes>
                          <attribute>
                            <key>lastModified</key>
                            <value>1479120111000</value>
                          </attribute>
                          <attribute>
                            <key>directory</key>
                            <value>false</value>
                          </attribute>
                          <attribute>
                            <key>size</key>
                            <value>22</value>
                          </attribute>
                        </attributes>
                      </child>
                    </children>
                  </child>
                </children>
              </root>
            </view>
          </dataOrganization>
        </xmlData>
      </mdWrap>
    </sourceMD>
  </amdSec>
  <fileSec>
    <fileGrp ID="KIT-DM-FILE-GROUP"/>
  </fileSec>
  <dmdSec ID="EXAMPLE">
    <mdWrap MDTYPE="OTHER" MIMETYPE="text/xml">
      <xmlData>
        <example:metadata xmlns:example="http://www.example.org/1.0/">
          <example:title>Any title</example:title>
          ...
        </examle:metadata>
      </xmlData>
    </mdWrap>
  </dmdSec>
  <fileSec>
    <fileGrp ID="KIT-DM-FILE-GROUP">
      <file ID="FILE-0">
        <FLocat LOCTYPE="URL" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://hostname:8080/KITDM/rest/dataorganization/organization/download/id/anyFile.ext"/>
      </file>
    </fileGrp>
  </fileSec>
  <dmdSec ID="XML">
    <mdWrap MDTYPE="OTHER" MIMETYPE="text/xml">
      <xmlData>
        <dc:nometadata xmlns:dc="http://www.openarchives.org/OAI/2.0/oai_dc.xsd"/>
      </xmlData>
    </mdWrap>
  </dmdSec>
  <structMap ID="KIT-DM-FILE-VIEW" LABEL="default">
    <div LABEL="root" TYPE="folder">
      <fptr FILEID="FILE-0"/>
      <div LABEL="metadata" TYPE="folder">
        <fptr FILEID="FILE-0"/>
      </div>
    </div>
  </structMap>
</mets>
Even if putting all Base Metadata in the resulting XML document is part of the standard workflow this is not mandatory. Overwriting createMetadataDocument(TransferTaskContainer pContainer) is also a legitimate way to provide the content metadata XML document in a custom way.

Finally, the resulting document is written to a file named <SID>_<OID>.xml where <SID> is the unique identifier of the associated metadata schema and <OID> the Digital Object identifier. The file is stored in the generated folder of the associated ingest. Finally, a new metadata indexing task for indexing the previously created XML file is scheduled by creating a new indexing task entry in the database.

In the publishing phase the previously configured scheduler for metadata indexing performs the following workflow:

  • Check the database for new metadata indexing task entries

  • Take the next unprocessed entry and read the associated XML files

  • Convert the XML file into a JSON structure

  • Publish the JSON structure to an Elasticsearch node running on localhost:9300 (default Elasticsearch settings) belonging to cluster and index stated in the paramters (default cluster is KITDataManager, default index dc).

  • Update the indexing entry in the database to a success/error state

Querying the Elasticsearch index can then be done using the Elasticseach APIs that can be found at https://www.elastic.co/.

For the described approach for extracting metadata there is one basic implementation available in class edu.kit.dama.mdm.content.mets.BasicMetsExtractor. The following sections describe the basic configuration steps of setting up this basic extractor followed by the configuration of metadata indexing.

Configuring Metadata Extraction

A new metadata extractor can be registered as staging processor using the administration backend of KIT Data Manager. For this purpose please browse to http://localhost:8080/KITDM in the browser of your KIT Data Manager machine and login. By default the administrator email is dama@kit.edu and the password is dama14. Open the settings page using the button with the gears Settings_Button and select the tab Staging Processors. At first, insert edu.kit.dama.mdm.content.mets.BasicMetsExtractor as Implementation Class. Clicking the button next to the input fields will create a new staging processor. Now, please insert all values as shown in the following screenshots in order to configure the staging processor properly.

METS1
Figure 9. Basic configuration of the Mets extractor. The most important settings are in the General Options where the extractor is enabled and made assignable to ingest operations. All other fields, e.g. the name, can be customized as required.
METS2
Figure 10. Extended properties of the Mets extractor. Properties in this view may change depending on the local installation and the supported community. The screenshot shows a default configuration as it should work for every installation.

Finally, commit all changes using the Commit button on the lower right. As a result, different metadata documents will be extracted during each ingest. These documents are then located in the generated folder, which is also ingested and available in an according Data Organization view afterwards:

  • metadata/bmd_<OID>.xml

  • metadata/mets_<OID>.xml

  • metadata/oai_dc_<OID>.xml

Access to these documents is now possible in the same way as accessing data, just by addressing the generated view instead of the default view. If the metadata should be made searchable, registerin the generated metadata in a search index is required. How to achieve this is described in the following chapter.

Configuring Metadata Indexing

During metadata extraction for each created documents also a MetadataIndexingTask is registered. This allows to index the metadata documents at a search index by a separate process. For default installations the internal job scheduler should be used for indexing metadata, which can be configured using the administration backend of KIT Data Manager. Please refer to the according section for details.

In the following tables recommended settings for the MetadataIndexer job are listed:

Attributes Value

JOB_IMPLEMENTATION_CLASS

edu.kit.dama.mdm.content.scheduler.jobs.MetadataIndexerJob

JOB_GROUP

Metadata

JOB_NAME

MetadataIndexer

DESCRIPTION

Indexing metadata to an Elasticsearch instance.

TRIGGERS

(Should be added after changes are committed)

In the Extended Properties section the following settings should be applied. Most of them depend on your local configuration inside $KITDM_LOCATION/KITDM/WEB-INF/classes/datamanager.xml:

Attributes Value

groupid

e.g.: USERS

hostname

Hostname of the Elasticsearch instance configured at elasticsearch.host, e.g. localhost

cluster

Cluster name of the Elasticsearch instance configured at elasticsearch.cluster, e.g. KITDataManager@localhost

index

Default index of the Elasticsearch instance configured at elasticsearch.index, e.g. kitdatamanager

Finally, after creating the scheduler job via Commit Changes, add a trigger to the created job. It is recommended to add an Interval trigger executing the indexing task for example every 30 seconds.

For more examples how to use the Enhanced Metadata Handling please refer to the source code repository of the according sub-module of KIT Data Manager at https://github.com/kit-data-manager/base/tree/master/MetaDataManagement/MDM-Content


1. access point selected from the left-hand container of the staging-access-point-configuration-tab
2. staging processor selected from the left-hand container of the staging-processor-configuration-tab