Working with HDC Datasets

Last modified by Dennis Segebarth on 2024/10/02 18:25

Table of Contents

How it Works
Prerequisites
Data Stewardship
Creating a Dataset
Adding Files to a Dataset
Organizing and Validating Datasets
- Organizing Datasets
- Validating a Dataset Structure
Annotating Datasets with Metadata
Version Control and Dataset Sharing

Datasets extend Projects with comprehensive data management functionalities. Importantly, each Project can have multiple Datasets associated with it, but each Dataset can only be populated with data from a single Project. Each Dataset can be viewed, accessed, and interacted with by the user who created it, as well as all users with the Project Administrator role in the Project the Dataset is associated with.

In brief, Datasets are collections of related Project files, folders, and detailed metadata. The Dataset feature allows users to organize Project files, enrich with annotations, validate data structures, and publish controlled versions for sharing.

How it Works

After creating a Dataset, you can select and copy files into it from the Project Core. Within the Dataset you can reorganize files, validate the file structure against standardized models (currently, the BIDS - Brain Imaging Data Structure - is supported), enrich the Dataset with metadata annotations, and create controlled versions for sharing.

Prerequisites

Project Collaborator role or higher.

Data Stewardship

Users are reminded to abide by the Platform Terms of Use and any Project-specific restrictions when using Datasets to manage and share data and code.

Creating a Dataset

New Datasets are empty containers ready for you to add data files and/or folders from your Project Core. Each Dataset can be viewed, accessed, and interacted with by the user who created it, as well as all users with the Project Administrator role in the Project the Dataset is associated with.

Datasets are created from the Main Menu. To create a new dataset,

Click Datasets from the Main Menu to open the Datasets landing page. Datasets you previously created are displayed in this page in the “My Datasets” tab, and you can continue working on them as described in the following sections of this article, e.g. see Adding Files to a Dataset.
Click + Create New in the upper right-hand corner. You will be prompted to provide some essential metadata information about your dataset in two sections (“Define Dataset” and “Description”, see the following list for more details):
Note: this information is stored in the Dataset’s Essential schema and is visible in the Dataset Home tab, and can be changed later (with the exception of Dataset Code) - see Reviewing and Editing Metadata Annotations. Mandatory fields are denoted by a red asterisk ( * ).
1. Define Dataset
  - Title*: A short title for the Dataset (max. 100 characters). This entry can be changed later.
  - Dataset Code*: A distinct, immutable code defined by you to uniquely identify your Dataset. This entry cannot be changed later.
  - Authors*: One or more authors. Hit Enter after each entry. These entries can be changed later.
  - Dataset Type: Use the default GENERAL or, if your Dataset uses the Brain Imaging Data Structure (BIDS) standard and you wish to use the built-in BIDS validation tool, select BIDS from the dropdown menu. This entry can be changed later.
2. Description:
  - Dataset Description*: A longer description of the dataset. Please remember to comply with the Platform Terms of Use, Privacy Policy, and do not enter sensitive personal information in this field.
  - Modality: Based on the Human Brain Project OpenMINDS standard.
  - Collection Method: Based on the Human Brain Project OpenMINDS standard.
  - License: License under which you want to share your Dataset (e.g., Creative Commons). A License is not required, yet highly recommended if you want to share your Dataset in compliance with the FAIR criteria.
  - Tags: Custom keywords you create for your Dataset. Tags are displayed on the Datasets landing page and information bar.
When you provided all required information, click Create in the upper right-hand corner to complete the Dataset creation.

Your Dataset is now saved and is visible in the Datasets landing page under My Dataset. Next, you can go to a Project and begin copying files into the Dataset, as described below in the section Adding Files to a Dataset.

Adding Files to a Dataset

As mentioned above, each Dataset is associated to a single Project. Therefore, files can be added to a Dataset only from the Core of the respective Project. This will create a copy of the file so the original file in the Core is unaffected by subsequent manipulations inside the Dataset. The Dataset must have already been created before files can be added (see Creating a Dataset). Once this association between the Dataset and a Project has been established, the Dataset automatically becomes visible and accessible for all members of the Project that have the Project Administrator role.

To add files to a Dataset,

Navigate your Project’s File Explorer and click Core.
Select one or more files and/or folders by clicking the checkbox beside each one, then click Add to Datasets in the File Explorer menu.
In the popup window, use the dropdown menu to select an existing Dataset. If no Datasets appear, close the window and follow the instructions in Creating a Dataset, then return and try again.
Click Add to Dataset. The selected file(s) are added to the Dataset. If a file with the same name already exists in the Dataset, it is skipped and won’t be added to the Dataset.
You can now open the Dataset from the Main Menu and view the added files in the Explorer.

Note: A Dataset may only contain data from a single Project. This prevents unintentional linkage of data from different Projects in violation of data privacy principles. Once you have added a Project file into your Dataset, all subsequent data added to the Dataset must originate from the same Project. If you attempt to upload data from different Projects into the same Dataset, the Portal will return a warning and prevent the action.

Organizing and Validating Datasets

After copying Project files into a Dataset, you can perform a variety of tasks including moving, previewing, renaming or deleting files and folders, and validating the file structure against a supported external specification (currently, the Brain Imaging Data Structure (BIDS) standard is supported). You can also add metadata (see Annotating Datasets with Metadata) and share these with members of your Project by first creating a Space in the EBRAINS Knowledge Graph, and then uploading the metadata to it (see Interacting with the EBRAINS Knowledge Graph).

Organizing Datasets

Use the Explorer tab to organize and validate your Dataset.

Open the Dataset from the Datasets landing page, then select the Explorer tab.
Use + New Folder or Move to to rearrange files and/or folders.
Click a filename to display additional interactive icons:
- Preview (eye icon): The panel on the right side displays a content preview for supported structured file types (e.g., .txt, .csv, .json).
- Download: Download and save a copy to your local device.
- Edit (pencil icon): Rename the file or folder.
- Trash: Remove the file or folder from your Dataset. This does not delete the Project file from the Core.

Validating a Dataset Structure

Many scientific communities and disciplines have established data standards for describing and organizing datasets to promote interoperability and reusability, offering automated validation tools to facilitate conformation with the standards. The HDC Portal currently offers a validation tool for the BIDS structure.

BIDS Validation

The Validate BIDS tool allows you to check your Dataset for conformation with the Brain Imaging Data Structure (BIDS) standard. The tool uses the open-source bids-validator 1.8.4 Python Package.

Before running the Validate BIDS tool, ensure that the Dataset Type is set to BIDS (defined at Dataset creation or modified in the Metadata tab under Schemas > Essential).

Open the Dataset from the Datasets landing page and select the Explorer tab.
Click Validate BIDS. If the Validate BIDS button isn’t visible, make sure your Dataset Type is set to BIDS (see above).
The validation process starts. It may take some time to run, especially for large files.
When the validation is complete, the results are indicated by icons adjacent to the Validate BIDS tool:
- Green - Validated, no warnings: A valid BIDS Dataset, no further modifications needed.
- Green - Validated + orange warning icon: A valid BIDS Dataset with optional recommendations to make the Dataset more BIDS compliant. Click the warning icon to view the explanations. If desired, make the recommended corrections to the Dataset and re-run the validation.
- Red - Not Validated: an invalid BIDS Dataset. Click the warning icon to view the errors. Make the necessary corrections to the folder structure and re-run the validation.
After successful validation, the status of BIDS compliant is displayed in the Explorer until the next validation.

Annotating Datasets with Metadata

Metadata annotation schemas are pieces of code defined by a Dataset author to describe the Dataset in a machine-readable structure that can be used by search and query engines to facilitate future discovery and reuse by the research community. Annotating your Dataset with metadata annotation schemas promotes findability and interoperability. Adding metadata annotation schemas does not change the actual content of the data itself.

Several options are available to annotate your Dataset.

Default Schema: Standard annotation categories that apply to a wide range of research domains and make your Dataset interoperable with common models like DAta Tag Suite (DATS).
Custom Schemas: Flexible schemas where the metadata elements are defined entirely by you.
Supported external schemas: Predefined metadata schemas offered by the research community. JSON files containing predefined annotations can be uploaded and attached to a Dataset. Currently, the EBRAINS openMINDS schema is supported (see Annotating with OpenMINDS Schema).

Changes to your Dataset's metadata are tracked in the Activity tab. When new versions of your Dataset are released, the metadata definition at the time of release is stored and will be available to download as part of that version.

Adding New Annotations

Use the Metadata tab to annotate your Dataset.

These steps describe how to annotate your dataset using the Default Schema. If you have created a Custom Schema (see the section titled Creating a Custom Schema in this article), you can follow the same steps described here, selecting your Custom Schema name instead of Default Schema in Step 2.

Open the Dataset from the Datasets landing page and select the Metadata tab.
Under Existing Schema, select Default Schemas. The schemas that have already been completed in part or full are listed in the left panel, and their details are displayed in the right panel. This includes the Essential schema template with the information you entered at the time the Dataset was created.
In the Schemas section on the right panel, click the Select schema to complete dropdown menu and select one of the Default schema templates or a Custom Schema created by you (see the section Creating a Custom Schema in this article).
Default schema categories include the following:
- Contributors: Information about the persons or organizations who contributed to the Dataset.
- Disease: Information about the disease condition.
- Distribution: Information about the Dataset's distribution properties (format, web URL, authorization).
- Essential: Basic information about the Dataset including the information collected when the Dataset was created and additional optional fields not collected at the time of Dataset creation.
- Grant: Information about the grant that supported the work reported by the Dataset.
- Subjects: Information about each data subject in the Dataset
Enter the requested fields in the selected schema. If a field accepts multiple nested entries, click +Add Item to add more.
Optional step: Click Save as Draft to exit and return to complete the entries later before saving them to your Dataset, or X Reset to clear all entries.
When you’re ready to save the annotations to your Dataset, click Submit.

The schema is saved and is displayed in the Existing Schema panel.

Reviewing and Editing Metadata Annotations

If you wish to review and optionally make changes to the annotations already saved to a Dataset,

Navigate to Existing Schema (left panel of the Metadata tab) and select the Default Schemas category.
Click the schema name, then click the eye icon to open the Schemas view (right side). The existing entries are displayed in the Schemas panel at the right side.
To edit the entries, click Edit and make the changes, and
- X Reset to return all entries to their original values.
- Cancel to exit without saving changes.
- Update to save the changes and return to the schema view.

Any new changes are reflected in the Schemas view.

Creating a Custom Schema template

To enrich your Dataset annotations beyond the provided Default Schemas, you can define a Custom Schema template containing your own unique fields (keys), designate each field as required or optional, and then save the template and make it available for adding annotation values.

To create a Custom Schema,

Open the Dataset from the Datasets landing page and select the Metadata tab. Under Existing Schema (left panel), select Default Schemas.
In the Schemas panel (right side), click the Select schema to complete dropdown menu and select + Create Custom Schema to open the Custom Schema Template creation window.
Enter a Template Name (mandatory).
Click + Add field to create a new field and configure the key parameters:
- Type: The type of information to be collected (text, multiple choice, numeric, date)
- Title: The name of the field (max. 20 characters)
- Value: If you selected multiple choice as type, you can define the available value options (maximum 20 characters each), pressing Enter after each entry.
- Optional: Specify whether the field is optional or mandatory (mandatory by default). Check the box if the Field is optional.
Click the checkmark to save the new field or X to clear the field.
To add more fields, repeat steps 5-6 until all fields have been added.
Review the newly created fields for accuracy and completeness. Be sure to click the checkmark beside each field to save the entry, or X to discard the field. Use the Edit or Delete icons to make any changes to saved fields.
Note: After a Custom Schema template has been saved, new fields can be added to the template but existing fields cannot be edited or removed.
Click Save.

Your new Custom Schema template is saved. The panel switches to Annotations mode where you can enter annotations in your Custom Schema now or later, as described in the section Adding New Annotations in this article.

Editing a Custom Schema template

Editing a Custom Schema after annotations have been saved

After annotations have been saved to a Custom Schema, you can make limited changes to the template fields and save the template with a new name. Any annotations already saved in the template are preserved and any future annotations will follow the new schema definition.

To manage a Custom Schema template that contains saved annotations,

Navigate to Existing Schema (left panel of the Metadata tab) and select the Default Schema category.
Click the Custom schema name, then click the eye icon to open the Schemas view (right side). The existing entries are displayed.
Click Manage Template.
Make your desired edits. The following changes are possible:
- Add new fields
- Delete existing fields
- Change an existing field from required to optional
Give the updated Schema template a new name, e.g., {name} version 2.
Click + Create New Template.

From this point forward, the custom schema is listed in the Existing Schema panel under the new name. All previous annotations are saved and the newly defined annotations can be added.

Annotating with OpenMINDS

The open Metadata Initiative for Neuroscience Data Structures (openMINDS) is an open-source, community-driven research infrastructure initiative powered by EBRAINS and the Human Brain Project. The openMINDS schema gathers a set of metadata models that can be used for describing heterogeneous neuroscience data. The data can originate from human, animal or simulated studies, computational models, and software tools, as well as metadata or data models.

Metadata stored in the openMINDS configuration in JSON format can be uploaded directly to your Dataset.

Community tools are offered to help generate openMINDS JSON schemas, including a wizard and a python package. For information or support, please contact the openMINDS team.

To upload an openMINDS JSON schema file,

Open the Dataset from the Datasets landing page and select the Metadata tab.
Under Existing Schema, select openMINDS Schemas and click Upload Schemas.
Click Select Schema and select the JSON file(s) from your local computer that contain your metadata in the openMINDS format.
Click Upload.
After successful upload, the schemas appear in the Existing Schemas panel. Click the eye icon to view your schema. Click the trash icon to delete a schema.

Version Control and Dataset Sharing

Datasets can be versioned and shared in a variety of ways.

Users are reminded to abide by the Platform Terms of Use and any Project-specific restrictions when using the Datasets feature to download data.

Viewing Dataset Activity

Changes made to a Dataset are tracked in real time and can be viewed at any time in the Activity tab. This lineage record displays all the historical actions in a Dataset starting with the time of Dataset creation, including who performed the action and when it was performed, .

To view Activity, open the Dataset from the Datasets landing page and select the Activity tab. Use the date filters to narrow the range of activities displayed.

Anytime Dataset Download

You can download the Dataset in its current state, along with with all the associated metadata, at any time using the Download icon located beneath the Dataset title in the Dataset information bar .

The files and metadata are packaged in a zip archive identified with a unique name and date-time hash. Metadata files, including any custom metadata annotations, are included as JSON files.

Releasing Dataset Versions

From time to time you may wish to create a snapshot version of your Dataset as a saved package that can be accessed and downloaded now or later. The Dataset Version Release tool bundles your data and metadata, assigns a version number, captures release notes, and creates a Dataset release.

To release a Dataset version,

Open the Dataset from the Datasets landing page and ensure you have made all the desired updates to the files, folder structure and metadata.
Click Release new version (upper right corner) to open a popup window and enter the release details.
Select a Release type: Minor or Major. A Version number is assigned automatically starting with the last Version you published. (Major: Version 1.0, 2.0, etc., Minor: Version 1.1, 1.2, etc.).
Note: The distinction of Minor vs. Major is a convenience feature and the definition is entirely at your discretion. Typically, a major release could represent a significant change from the previous version such as a large amount of new data or significant restructuring of the data, whereas a minor release could signify a small change.
Enter a brief description in the Version Notes box (max. 250 characters).
Click Submit to save the snapshot or Cancel to stop creating the Release.

Your dataset files and all metadata JSON files are packaged in a zip file and saved for future access. Dataset version history is also captured in the Dataset Activity tab.

To view all released Dataset versions, click Versions in the Dataset information bar located next to the Dataset title. This opens the Versions sidebar where you can browse and download the Versions you have created.

HealthDataCloud is powered by Pilot technology, a product of Indoc Systems.