About the CoGDat Portal
The CoGDat portal is driven by a DataMeta submission server. DataMeta is a submission portal software for the collection of experimental raw data with associated metadata annotations. It is designed to
Make data submission easy. Raw data files and metadata can be submitted interactively through a user-friendly web UI in the browser, or in an automated fashion using the DataMeta REST API, Python client library or command line client.
Be scalable. Due to its architecture, DataMeta can serve a large number of clients and be deployed to various types of server environments.
Architecture
The main DataMeta components are
The DataMeta Back End
The back end provides a RESTful API via HTTP. It keeps track of all metadata, raw files and their associations in a database. When a client wants to upload or retrieve a file, the back end brokers the file exchange by providing instructions to the client on how to perform it via HTTP.
File Storage
The file storage itself is accessed through non-RESTful HTTP POST and GET requests. It can either be directly provided by the DataMeta back end or by an S3 storage which is managed by the DataMeta back end.
Client
The DataMeta back end and the file storage can be accessed through various clients. In addition to the API endpoints, a DataMeta installation also provides a Web UI which can be used an interactive client to access the back end through the browser. Additionally, a Python package providing programmatic interfaces as well as a command line client are available. Finally, the REST API and file storage API are open and documented, allowing the creation of custom clients.
General Concepts
DataMeta manages metadata and files. Which metadata has to be provided differs for every installation of DataMeta, the metadata for the CoGDat portal is documented here. We use the following terminology:
A metadataset is a collection of metadata, i.e. key value pairs corresponding to the same entity. It typically corresponds to a row in a sample sheet. An example for a metadataset is shown above.
A field is a single key of a metadataset, for example ZIPCode. A field typically corresponds to a column name in a sample sheet.
A field may correspond to a file, in that case the corresponding value is the filename of the corresponding file. When uploading data, files and metadatasets are transmitted separately and only linked once a submission is created, which finalizes the transfer.
Uploading content
for the example metadataset shown above, the submission would include the following steps:
Upload of the metadataset holding
{ ID : "ABC123", Date : "2020-10-30", ZIPCode : "123", RawFQ1 : "ABC123_R1.fq.gz", RawFQ2 : "ABC123_R2.fq.gz"}`
Upload of the file
ABC123_R1.fq.gz
Upload of the file
ABC123_R2.fq.gz
Creation of a submission which links the metadataset with the two corresponding files.
Field Constraints
Not all fields allow arbitrary values. For example, the field ZIPCode is constraint to hold exactly three digits, or the field Date must contain a valid date specification. If a metadataset violates any field constraints, DataMeta rejects the metadatasets and informs the user about the violated constraints.
Services
DataMeta distinguishes between two types of metadata, regular metadata and service metadata. Regular metadata is the metadata that can be provided upon the original submission of a metadataset. In contrast, service metadata can only be provided subsequently and only by user accounts that are designated to manage a particular service.
A service typically reflects a particular analysis or computer program. For example, within CoGDat all samples are run through a viral genome assembly pipeline after submission. The “ViralGenomeAssembly” service provides two additional metadata for each metadataset: an additional genome assembly file (file metadatum) and a boolean metadatum indicating whether or not human contamination was identified during the assembly preprocessing.
Client Software
The aforementioned steps to upload a metadataset and the corresponding files do not have to be executed manually under normal circumstances. Various client software options are available for various degrees of automation. For example, the DataMeta web interface allows users to simply drag and drop a sample sheet holding multiple metadatasets and the corresponding files into the browser. For details on how to use the various client options, please consult the corresponding sections in this documentation.