Developer Information

Application Architecture

Introduction

The DataMeta application stack comprises three main applications:

Relational data model

erdiag

The relational database model comprises the following entities:

  • User

    A database user, uniquely identifiable by their email address. The attributes are pretty self explanatory. A user can be disabled but should not be deleted to maintain tracability of submitted records.

  • Group

    Every user is associated with exactly one group.

  • File

    A file corresponds to a single file submitted by the user. It holds the following attributes:

    • name - The original name of the file as submitted by the user

    • name_storage - The name under which the file is stored in the storage backend. The webapp uses a fixed naming scheme for this purpose which ensures that the filename is unique on the storage backend: {file_id.rjust(10,'0')}_{user_id}_{group_id}_{file_size}_{md5_checksum}

    • checksum - The MD5 checksum, computed immediately after upload

    • filesize - The filesize of the file

    • checksum_crypt - The checksum of the file after encryption

    • filesize_crypt - The filesize of the file after encryption

    • user_id - The UID of the user who uploaded the file

    • group_id - The GID of the user who uploaded the file (see User and Group)

  • MetaDatum

    A metadatum describes a datum conceptually, it does not hold any data. It thus corresponds to a column in the submitted sample sheet, but not to the data held in that column. The corresponding table in the database is filled when values when the DataMeta instance is configured and typically remains static throughout the lifetime of the instance. However, there are scenarios in which one may want to add a new column in a later stage, modify the linting constraints defined for a column etc. Implementing this will require migration concepts for these events, e.g. old data may no longer pass linting.

    A metadatum has the following attributes:

    • name - The name of the metadata. This defines the text required in the header of the corresponding sample sheet column.

    • regexp - A regular expression. If specified, it’s applied to the field during linting.

    • short_description - A message to display in the interface when the verification based on regexp failes, e.g. “Only three digits are allowed in this field”.

    • datetimefmt - C standard date time format code string (see e.g. here). If specified, the application assumes that this field holds a datetime value. When reading sample sheets from plain text formats (CSV, TSV) or if the corresponding column in a submitted Excel sheet is formatted as plain text, this format string is applied to parse the column values. Additionally, this format string is applied to display the values of this column in the user interface. Note that date, time and datetime values are always stored as plain text datetime values in ISO 8601 format internally. The missing component falls back to 00:00 and 1900-01-01 respectively.

    • datetimemode - An enum holding DATE, TIME or DATETIME. If set, denotes that the field should be treated as the corresponding type internally.

    • mandatory - A flag indicating whether this field may be empty in sample sheets or not

    • order - An integer. When displaying the sample sheet in the user interface, the columns are shown in the order of their order values.

    • isfile - A flag indicating whether this column corresponds to a file.

  • MetaDatumRecord

    Corresponds to one captured value of one metadatum. Every filled field in the sample sheet (except the header) corresponds to one MetaDatumRecord. Attributes:

    • metadatum_id - The ID of the metadatum this record corresponds to

    • metadataset_id - The MetaDataSet (see below) this record is part of.

    • file_id - If this metadatum record corresponds to a file and it has been submitted (see Conceptual Notes)

    • value - The plain text value of this metadatum

  • MetaDataSet

    A metadataset corresponds to one row in the sample sheet. It groups all metadatum records that correspond to this row and additionally holds information about the owner of the metadata and whether it has been submitted (aka committed) or not. Attributes are:

    • user_id - The owner’s UID

    • group_id - The owner’s GID at the time of creation

    • submission_id - The submission ID once this metadata has been committed (see below).

  • Submission

    A submission corresponds to the event of a user clicking the COMMIT button on the /submit view. It holds the time at which the submission took place.

  • AppSettings

    A table holding application settings configured by the administrator running the DataMeta instance. This is currently not used.

Conceptual Notes

  1. All captured metadata is treated as text internally

    The generic application design, i.e. that the metadata fields can be defined dynamically at runtime, implies that there cannot be a 1:1 mapping of the sample sheet into the relational data model in form of a corresponding relation / table, unless one wants to go into the realm of using CREATE / ALTER TABLE at runtime. The current data model design stores all captured values in one attribute of one relation (metadatumrecord.value), which makes type reflection on the data model level impossible. At the same time, we’re anyway accepting plain text formats such as CSV or TSV as input, thus all values that are to be held must be serializable anyway. Thus the value field of the MetaDatumRecord entity is TEXT / VARCHAR.

  2. Files and MetaDatumRecords are detached until submission

    Until submission (aka commit), the data model does not link files and metadatumrecords. The integration of the file names and file uploads into the sample sheet (Pending annotated submissions) is purely visual on the client side through name-based matching. This is also utilized internally, i.e. to differentiate pending files from files that have been submitted, as files do not have a direct relation to a submission themselves. Only when a data record is submitted, files get linked to metadatumrecords and those get linked to a submission (via metadataset).

  3. Both files and metadatasets have owners

    The previous point, i.e. pending files and metadatasets not being connected, requires that ownership is documented both on the file and on the metadataset level.

  4. Who has access to what data

    What is shown on the /submit view is private to a UID/GID combination. Other members of the same group cannot access the pending submission, neither can the user himself in case they change their group. After submission, in the /view view, the user can see all submitted data from his group (not yet implemented).

Visual Studio Code Setup

Installation instructions for using the remote container feature of Visual Studio Code (vscode)

The remote container feature allows to run the editor’s backend inside a docker container, which is readily set up for development.

Quickstart

Clone this repository:

git clone https://github.com/ghga-de/datameta.git

And open the created directory in vscode, for instance like that:

code ./datameta

Install the remote development extension:

  • click on the extensions symbol in the side bar

  • search for Remote Development and install it

To reopen vscode inside the dev container:

  • select View > Command Palette in the dropdown menu

  • then select (or type): Remote-Containers: Reopen in Container

If you are executing this for the first time, the containers will be set up via docker-compose. This might take some time.

Using the container

Once the build is successful, you will be able to use vscode as usual. The workspace will be mounted at /workspace.

However, before you start, you have to first install datameta in edit mode. Just type in the terminal:

dev_install

(this will execute the script at /workspace/docker/dev_install) You only have to run this once (unless you re-build the container or want to re-install datameta).

Every time you would like to deploy datameta, just type:

dev_launcher

(this will execute the script at /workspace/docker/dev_launcher)

The frontend should be available at http://localhost:8080/ in your browser.

Configuration

Any configuration regarding the dev container environment can be found at /.devcontainer.

The environment includes a few useful vscode extensions out of the box. If you find a extension that might be of use to everybody, feel free to add it to the /.devcontainer/devcontainer.json.

For general information on this vscode feature please look here.