# About the CoGDat Portal The CoGDat portal is driven by a DataMeta submission server. DataMeta is a submission portal software for the collection of experimental raw data with associated metadata annotations. It is designed to * *Make data submission easy*. Raw data files and metadata can be submitted interactively through a user-friendly web UI in the browser, or in an automated fashion using the DataMeta REST API, Python client library or command line client. * *Be scalable*. Due to its architecture, DataMeta can serve a large number of clients and be deployed to various types of server environments. ## Architecture ![DataMeta Components](img/components.svg) The main DataMeta components are * **The DataMeta Back End** The back end provides a RESTful API via HTTP. It keeps track of all metadata, raw files and their associations in a database. When a client wants to upload or retrieve a file, the back end brokers the file exchange by providing instructions to the client on how to perform it via HTTP. * **File Storage** The file storage itself is accessed through non-RESTful HTTP POST and GET requests. It can either be directly provided by the DataMeta back end or by an S3 storage which is managed by the DataMeta back end. * **Client** The DataMeta back end and the file storage can be accessed through various clients. In addition to the API endpoints, a DataMeta installation also provides a Web UI which can be used an interactive client to access the back end through the browser. Additionally, a Python package providing programmatic interfaces as well as a command line client are available. Finally, the REST API and file storage API are open and documented, allowing the creation of custom clients. ## General Concepts ![Metadatasets and Files - Excel](img/example_excel.png) ![Metadatasets and Files - Diagram](img/files_metadata.svg) DataMeta manages metadata and files. Which metadata has to be provided differs for every installation of DataMeta, the metadata for the CoGDat portal is documented [here](metadata.md). We use the following terminology: * A *metadataset* is a collection of metadata, i.e. key value pairs corresponding to the same entity. It typically corresponds to a row in a sample sheet. An example for a metadataset is shown above. * A *field* is a single key of a metadataset, for example *ZIPCode*. A field typically corresponds to a column name in a sample sheet. A field may correspond to a file, in that case the corresponding value is the filename of the corresponding file. When uploading data, files and metadatasets are transmitted separately and only linked once a *submission* is created, which finalizes the transfer. ### Uploading content for the example metadataset shown above, the submission would include the following steps: 1. Upload of the *metadataset* holding ``` { ID : "ABC123", Date : "2020-10-30", ZIPCode : "123", RawFQ1 : "ABC123_R1.fq.gz", RawFQ2 : "ABC123_R2.fq.gz"}` ``` 1. Upload of the file `ABC123_R1.fq.gz` 1. Upload of the file `ABC123_R2.fq.gz` 1. Creation of a *submission* which links the metadataset with the two corresponding files. ### Field Constraints Not all fields allow arbitrary values. For example, the field *ZIPCode* is constraint to hold exactly three digits, or the field *Date* must contain a valid date specification. If a metadataset violates any field constraints, DataMeta rejects the metadatasets and informs the user about the violated constraints. ### Services DataMeta distinguishes between two types of metadata, *regular metadata* and *service metadata*. Regular metadata is the metadata that can be provided upon the original submission of a metadataset. In contrast, service metadata can only be provided subsequently and only by user accounts that are designated to manage a particular service. A *service* typically reflects a particular analysis or computer program. For example, within CoGDat all samples are run through a viral genome assembly pipeline after submission. The "ViralGenomeAssembly" service provides two additional metadata for each metadataset: an additional genome assembly file (file metadatum) and a boolean metadatum indicating whether or not human contamination was identified during the assembly preprocessing. ## Client Software The aforementioned steps to upload a metadataset and the corresponding files do not have to be executed manually under normal circumstances. Various client software options are available for various degrees of automation. For example, the DataMeta web interface allows users to simply drag and drop a sample sheet holding multiple metadatasets and the corresponding files into the browser. For details on how to use the various client options, please consult the corresponding sections in this documentation.