Global Community Guidelines for Documenting, Sharing, and Reusing Quality Information of Individual Digital Datasets

Ge Peng; Carlo Lacagnina; Robert R. Downs; Anette Ganske; Hampapuram K. Ramapriyan; Ivana Ivánová; Lesley Wyborn; Dave Jones; Lucy Bastin; Chung-lin Shie; David F. Moroni

1. Background

Informed decisions on whether and how to (re)use particular digital datasets rely on knowledge about aspects of data and metadata quality, including their completeness, accuracy, provenance and timeliness (; ). Quality assessments also improve the reliability and usability of both data and metadata () and are crucial for supporting open-source science and data-driven policy-making processes (; ).

A dataset in this article refers to a collection of data that is identifiable (), and has the potential to be curated or published by a single actor (). A particular dataset can digitally represent a group of observations, a data product from a specific version of a processing algorithm based on observations, output of numerical model(s), or outcomes of laboratory experiments.

Dataset quality information embodies information about the quality or state of data (input, output, and ancillary), metadata, documentation, software, procedures, processes, workflows, and infrastructure that were created or utilized during the entire lifecycle of a dataset (). Therefore, the focus of this article is on dataset quality – not just data quality.

To be effectively shared and utilized, quality information needs to be consistently curated, preferably traceable, and appropriately documented (). The granularity of this quality documentation may vary – sometimes be very fine (e.g., per-observation in the case of volunteered observations) but the critical common resolution required to support FAIR data publishing is the individual dataset level. Quality assessment results also need to be represented consistently, updated regularly, and should be integrable across systems, services, and tools to enable improved data sharing (; ; ).

While the needs for assessments about the quality of data and related information for a particular dataset are well recognized, an approach for a framework to evaluate and present such quality information to data users (e.g., ) may not have been sufficiently developed and/or sufficiently addressed for disciplinary or interdisciplinary use. In response, an international workshop was held virtually on 13 July 2020 to pursue the needs and challenges for preparing and documenting dataset quality information consistently during the complete dataset lifecycle by a group of global Earth science, interdisciplinary domain experts. A number of challenges were identified in Peng et al. (), and three are highlighted below.

First, the selection of relevant quality attribute(s) (e.g., accuracy, completeness, relevancy, timeliness, etc.) is largely dependent upon context and can yield multiple quality categories and practical dimensions (; ; ; ). This multi-dimensionality makes the assessment of dataset quality a complex endeavor. For example, the quality attribute of completeness can refer to the completeness of data values in both spatial and temporal spaces, or the completeness of metadata elements or content. The multi-dimensionality of dataset quality has been discussed in detail by Peng et al. ().

An example of grouping dataset quality into four aspects (i.e., science, product, stewardship, and service) through the entire dataset lifecycle is shown in Figure 1. For each aspect, three important stages are listed along with selected quality attributes which do not constitute an exhaustive list. Those dataset lifecycle stages do not necessarily cover all activities. They may not necessarily happen sequentially, and also may occur in more than one quality aspect. For example, the ‘Evaluate’ part of the lifecycle in the ‘Product’ quadrant may overlap with the ‘Science’ by influencing the ‘Validate’ part. However, generally speaking, activities in the dataset lifecycle identified in the ‘Science’ quadrant occur before those in the ‘Product’ quadrant as noted by the direction of the arrows in Figure 1. Note that the term ‘Develop’ used in the ‘Science’ quadrant also includes data observation/acquisition. The feedback and improvement cycle can occur in any one of the stages.

Figure 1

Brief description of four quality aspects (i.e., science, product, stewardship and service) throughout a dataset lifecycle, three key stages and a few quality attributes associated with each quality aspect (e.g., define, develop, and validate stages for the science quality aspect). The quality aspects and associated stages are based on Ramapriyan et al. () with the following changes, based on feedback from the ESIP community and the International FAIR Dataset Quality Information (DQI) Community Guidelines Working Group: i) ‘Assess’ replaced by ‘Evaluate’ in the Product aspect; ii) ‘Deliver’ replaced by ‘Release’ in the Product aspect; and iii) ‘Maintain’ replaced by ‘Document’ in the Stewardship aspect. Additionally, completeness of metadata is moved from the Product to Stewardship aspect. Creator: Ge Peng; Contributors to conceptualization: Lesley Wyborn and Robert R. Downs.

Second, quality attributes are often not defined, measured, or captured consistently, even within one discipline. Moroni and colleagues recently observed such complexity as it pertains to the uncertainty of Earth science data (). Consistency in defining quality attributes and converging to standardized assessment models may be optimal for sharing, but more progress needs to be made, and whether such consistency is achievable remains to be seen. A step towards cross-domain interoperability, however, may be achieved by thorough documentation of domain-specific quality assessment techniques and metrics and the full provenance of the quality assessment. This allows transformations to be applied to dataset quality scores when this is possible and appropriate, e.g., computation of an exceedance value or quantile from a mean and standard deviation ().

The third challenge is associated with the paradigm shift in the capabilities of the designated community of scientific data: from domain literate with familiarity of the scientific context and intended use of data products, to potential users representing diverse fields of inquiry (), with increasing demand for machine interoperability. Therefore, the existence of a wide range of stakeholders and data users, including those with very little or no science background, should be considered to facilitate the analysis, interpretation, understanding of research data and related information and in some cases acted upon ().

Any effort to maximize the sharing of quality information requires collaboration among members of the entire community across science, data management, and technology domains. Recognizing that, 32 workshop participants – all international domain experts – issued an open ‘call-to-action for global access to and harmonization of quality information of individual Earth science datasets’ (). In response to that action call and further motivated by the needs of and interest from the global Earth science community, the International FAIR Dataset Quality Information (FAIR-DQI) Community Guidelines Working Group was formed.

Working group members comprise international domain experts, such as data producers and contributors, data managers and curators from scientific institutes and data centers, and data consumers and publishers. Given their common interest in dataset quality information, this group of people can be regarded as a ‘Community of Practice (CoP)’ (E. ). Together, the members of this group possess valuable first-hand knowledge and expertise in dealing with the challenges of developing, managing, disseminating, and using a variety of Earth science data products and services, such as data products obtained from surface, airborne, and satellite observations as well as output from numerical models.

Since September 2020, the members of this working group have been working collaboratively to develop practical guidelines for data managers and repositories to follow when preparing, representing, and reporting on the quality of individual datasets. These guidelines build on the success of the FAIR Guiding Principles for data sharing () and on the extensive expert knowledge and practical experiences of working group members, while leveraging community practices. This article describes the development principles and processes, captures the outcomes of this international community effort, and presents a path forward toward enhancing the coverage of disciplines beyond Earth sciences.

This article is organized as follows. A background has been provided in this section. The principles, scope, goals, and intended audience for the development of the guidelines are provided in Section 2, while the development process is described in Section 3. The guidelines developed are presented in Section 4, with a workflow for initiating and carrying out quality assessment, as well as a description of crosswalks to elements of the FAIR Principles. Potential impact of the guidelines, benefits of CoP, and path forward are discussed in Section 5, with a conclusion in Section 6.

2. Development principles, scope, goals, and intended audience

2a. Development Principles

The following principles are utilized to guide the development of the guidelines, based on feedback from the Earth science community:

A holistic dataset life-cycle approach should be adopted for developing guidelines.
Guidelines should be produced in an iterative manner with continuous community engagement for feedback.
Guidelines should be independent of specific quality attributes, assessment types, and context of applications.
Any methodology that is utilized to evaluate certain dataset quality attribute(s) should be findable and accessible, and preferably be interoperable and reusable for both human users and machine users.
The assessment results should be openly available findable, accessible, interoperable, and reusable to both human users and machine users.
Transparent and quantifiable quality assessments should be a part of a dataset quality management framework.
Guidelines should be regularly updated and version controlled.

2b. Scope

Given the complexity of dataset quality attributes and different contexts of their fitness for use, the guidelines will focus on providing guidance for capturing and representing dataset quality information consistently, adapting the FAIR Guiding Principles. Preparing such guidance will foster data use by providing users with consistent, timely, and accessible information that is available to effectively make educated data (re)use decisions for their unique application requirements. The guidelines do not focus on what quality attributes, aspects, or dimensions to assess; what assessment models to use; or how to assess dataset quality. However, a basic workflow has been developed, and practical examples are provided as references to help organizations and data stewards get started.

A dataset lifecycle in the context of this article starts at the planning and designing stage of developing a data product (Figure 1). It will not touch on sensor algorithms or model development and deployment. However, it is also important to capture and describe quality information such as algorithm model parameters (e.g., accuracy, precision, uncertainty) during these development and deployment stages, because the quality information from these stages is critical for identifying error sources; estimating data product uncertainty (); and examining error progression to downstream applications (e.g., ).

2c. Goals

This international community effort has been undertaken to develop guidelines for the Earth science community, in collaboration with international domain experts on data and information quality. The primary objective of the guidelines has been to offer the Earth science community actionable recommendations that can be adopted by a variety of stakeholders to consistently capture, represent, and integrate dataset quality information. Treating dataset quality information as a digital object and being consistent with the FAIR Guiding Principles, improves its potential for sharing and reuse with more targeted practicality. Care was taken so that the guidelines would be general enough to be readily adopted or adapted by other research science communities. The optimal goal is to foster global access to and harmonization of quality information of datasets as a critical step towards facilitating open-source science in both machine- and human-friendly environments as called for by Peng et al. ().

2d. Intended Audience

All data stakeholders may benefit from the community guidelines:

-Data producers will find these useful to ensure at the point of acquisition that critical attributes are captured. Such attributes will later be used to ascertain the quality of the data they are capturing (e.g., uncertainty of location/measurements, instrument parameters, metadata attributes on the instrument used to acquire the data).
-Data publishers and data curators may find the community guidelines valuable for improving the quality information associated with the data that they publish and manage.
-Sponsors and funders may find the guidelines helpful when reviewing data management plans in proposals for the support of projects and programs that will be creating, curating, disseminating, and supporting the use of data. They will also find them useful during the project closure phase when assessing the quality of the data products generated against the initial project goals and data management plans.
-Data users may find that the guidelines improve their understanding of quality issues when determining whether a particular data product or service is appropriate for their intended use and what the limitations may be for using the data. This could support the application of ‘confidence levels’ to certain information derived from the data.

3. Development process: timelines and workflow

This section provides a detailed description of the process of developing a framework through an international collaboration with the expectation that it will be useful for other groups or communities that may be considering similar endeavors.

The idea of potentially developing a framework for consistently capturing quality information for enabling the use of Earth science datasets was initiated in September 2019 (Figure 2). Follow-on discussions on community needs and the prospect of developing community guidelines for documenting and reporting dataset quality information as described in Peng et al. (), were carried out among several groups across the globe. These groups include the Earth Science Information Partners (ESIP) Information Quality Cluster (IQC), the Barcelona Supercomputing Center (BSC) Evaluation and Quality Control (EQC) team, and the Australia/New Zealand Data Quality Interest Group (AU/NZ DQIG).

Figure 2

Schematic diagram of timelines of the initiation, planning, development, community review, and first baseline of the guidelines document. The guidelines document will be updated in the future to improve its coverage in diverse disciplines. ESIP IQC: Information Quality Cluster of the Earth Science Information Partners. BSC EQC: Barcelona Supercomputing Center (BSC) Evaluation and Quality Control (EQC) team.

ESIP, which was founded in 1998, is primarily supported by United States Earth science governmental agencies, including the National Aeronautics and Space Administration (NASA), the National Oceanic and Atmospheric Administration (NOAA), and the United States Geological Survey (USGS). ESIP members include over 150 national and international partner organizations. The ESIP IQC fosters cross-disciplinary collaborations to evaluate various facets of Earth science data and information quality and produces recommended practices for the community. The BSC EQC team supports the EQC function of Copernicus Climate Change Service (C3S) Climate Data Store, one of six services of the European Union’s Earth observation programme. The AU/NZ DQIG is a forum for AU/NZ data providers, repository operators and data consumers and it is facilitated by the Australian Research Data Commons (ARDC). The AU/NZ DQIG was founded in late 2019 by ARDC, Curtin University and the Australian National University (ANU).

Support from the ESIP leadership was committed in early 2020 to sponsor a whole-day, in-person, international workshop prior to the ESIP 2020 summer meeting (SM20) with an additional report-out session during the SM20. The goal of the pre-ESIP workshop was to convene international domain experts to kick off the development of the guidelines by exploring the needs, challenges and current state of documenting and reporting dataset quality information. Invitations for participation were sent to prospective collaborators.

In the wake of the COVID-19 pandemic, the in-person workshop was changed to a virtual event, allowing it to be extended to a wider audience. A case statement was drafted and published to help set the stage and communicate the effort (). The workshop website was established to host the workshop materials and additional resources (Figure 3).

Figure 3

Flowchart outlining different phases of the guidelines development process, including the initiation, planning, development, community review and engagement, and baseline of the guidelines.

About 80 ESIP and invited international domain experts, affiliated with over 40 private, academic and governmental institutions from nine countries within North America, Oceania, and Europe registered for the workshop (). Two live 90-minute virtual workshop sessions were held on July 13, 2020, to accommodate attendees from different time zones. More than 45 workshop registrants attended the first live session while approximately 25 attended the second. About 45 ESIP SM20 registrants attended the subsequent report-out session. Prior to this workshop, a mini-workshop had been held by AU/NZ DQIG on July 6, 2020, where 57 had registered and 27 participated actively.

Eleven invited speakers presented during the two virtually live workshop sessions and additional three presented at the 90-minute report-out session during SM20. Invited speakers represented diverse international organizations, including major international space agencies and satellite programs, such as the NOAA Joint Polar Satellite System (JPSS) program (), European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT) (), and European Space Agency (ESA) (). Presentations described data stewardship activities at global organizations, such as the Group on Earth Observations () and the World Meteorological Organization (); as well as major national Earth science data and service centers, including those for NASA (), NOAA (), USGS () and Copernicus Marine Environment Monitoring Service (CMEMS) (). (See Table 1 of , for the full list of presentation titles, affiliated organizations, and citations).

The speakers shared their knowledge to help participants ascertain the complexity and multi-dimensionality of curating dataset quality information. This knowledge exchange allowed participants to understand why Earth science organizations need to prepare and describe data quality information throughout the entire dataset lifecycle – covering stages from data product design and production, through data and metadata curation for preservation and access, to data use by servicing data to consumers. It also helped attendees appreciate the challenges those organizations face and learn about the different approaches taken. These informative presentations provided perspective for productive discussions among participants during the live sessions. Notes were recorded online in a collaborative Google Doc and offline discussions continued following the workshop during the two weeks of the virtual SM20. For many of the over twenty non-US pre-ESIP workshop attendees, this was their first time engaging with the ESIP community ().

The strong need for practical guidelines was recognized as an opportunity to provide the community with guidance to improve data sharing by consistently preparing and representing information about the quality of datasets. The absence and limitations of currently available guidance also was recognized (). Participants of both the pre-ESIP workshop and the subsequent SM20 session have stressed the need for such guidelines to be created by the community and for the community through an iterative process with community feedback ().

Several community calls to participate voluntarily in an international working group were announced during the pre-ESIP workshop and the subsequent SM20 session, along with messages to relevant Earth science email lists, including the ESIP community list. Since September 2020, over twenty international domain experts have joined the working group, which has begun developing the guidelines by consolidating community recommendations first (Figures 2 and 3). A white paper on the guidelines was published for community review in April 2021 (). Extensive outreach was conducted by working group members to share the initial draft of the guidelines document with the Earth science and geospatial data community (e.g., ; –; ; , , , , , , ; ). The guidelines document, partially reproduced below, has since been revised to release the first baseline version, which reflects community comments and suggestions ().

4. Fair Dataset Quality Information Guidelines

In this section, we first define a basic workflow with relevant elements to consider when setting out to assess dataset quality and curate quality information. A set of the guidelines developed by the International FAIR-DQI Community Guidelines Working group are then presented, followed by crosswalks between the guidelines to the FAIR Guiding Principles.

4a. Basic Workflow for Curating Dataset Quality Information

While assessing dataset quality is multi-dimensional (), there are common aspects. Knowledge about these common aspects may help to set the direction for the right approach in each specific case of assessing quality and reporting assessment results.

To help organizations and data stewards address the challenge of where to start when curating and reporting dataset quality information, we have developed a typical workflow (Figure 4). This approach is inspired by the quality evaluation procedures defined in ISO 19157 () and Six Sigma (e.g., ), and follows the steps outlined below to define, measure, analyze, and improve, as presented in Lee et al. () for organizing data quality management.

Figure 4

A schematic diagram of a basic workflow with relevant elements for curating and disseminating dataset quality information. Creator: Carlo Lacagnina. Contributor: Ge Peng.

The workflow highlights some of the basic ingredients and elements to be considered at each step when curating dataset quality information. We add the dissemination, a.k.a. ‘reporting’ in ISO 19157 (), of dataset quality information, which is becoming an increasingly important task for building trust between data providers and end-users and for improving data usability.

As shown in Figure 4, the following two steps are needed prior to carrying out any assessment activity.

Step 1: Quality specification – Curating dataset quality information should start with defining the quality attribute(s), aspect, or dimension that will be assessed, determining the level of granularity (variable, ensemble member, model or algorithm), and identifying which data and quality attribute should be prioritized. This step will need some profiling, that is, an initial analysis of the available data to understand the challenges and the most critical issues to set priorities and determine the appropriate strategy to deploy (e.g., ; ).
Step 2: Evaluation specification – The next step involves identifying or developing an approach (or method) to evaluate the identified quality attribute(s) or assess its maturity. Example approaches could include a statistical analysis approach () or a scientific maturity matrix (). In this step, the framework for the evaluation is defined. It is important to describe the identified quality attribute or dimension, the evaluation method used, and the protocols, standards and workflows applied (e.g., ; ; ; ; ; ). A well-documented quality evaluation helps to increase transparency, verifiability, reproducibility, and resilience of the quality evaluation process.

The next two steps are important to capture and convey the resultant quality information.

Step 3: Evaluation execution – During this stage, the actual assessments are performed based on the tools, approaches and priorities defined in the previous steps. While doing this, the assessments should be captured in structured, human- and machine-readable, and standard-based formats (e.g., ; ).
Step 4: Quality dissemination – The results of the assessments represent the core of the dataset quality information and need to be disseminated with the data for the benefit of end-users. Feedback from users on data quality is beneficial to data producers to initiate data improvement processes. For reproducibility purposes, it is recommended that the operations performed to produce the quality information also be published (e.g., ). In this step, the mechanism for quality information dissemination (e.g., metadata, web page, API) is implemented and put into practice.

Finally, feedback from users on dataset quality information should be sought and evaluated to improve the quality information provided along with how the information is disseminated.

Step 5: Monitoring and improvement – The feedback collected in the previous step and the experience gained during the assessments are rationalized to consider improvements of the protocols, tools, and approaches and to redefine priorities in the assessment process (e.g., ; Wu & Gourcuff, 2021). This step is completed continuously throughout the assessment to dissemination steps, as it helps to improve the curation of quality information.

4b. Guidelines for Enabling FAIR Dataset Quality Information

The following five guidelines are developed by the International FAIR-DQI Community Guidelines Working Group to enable curated dataset quality information to be FAIR (i.e., findable, accessible, interoperable, and reusable), for both human users and machines. A description of crosswalks to relevant elements of the FAIR Principles, which are denoted as F1-F4 for Findable, A1-A2 for Accessible, I1-I3 for Interoperable, and R1 for Reusable, is provided (see for the definitions of the FAIR Principles).

The current state of dataset compliance with these guidelines varies. Most, if not all, datasets do not yet fully satisfy these guidelines. While it is difficult to find examples of datasets that comply with all the guidelines, it is still useful to provide examples that illustrate how each individual guideline is being met. This is the approach followed below. Additional examples can be found in Peng et al. ().

Guideline 1: Describe dataset (title, persistent identifier [PID] with a comprehensive landing page, e.g., digital object identifier [DOI], product uniform resource identifier [URI], version, data producer, publication/update date, publisher, date accessed, usage license, e.g., CC-BY 4.0 or CC0).

This guideline aims to ensure that the underlying dataset is findable, comprehensively described, and potentially reusable by cross-walking to all the F1-F4 principles of Findability, and the R1 (rich metadata with a plurality of relevant attributes) and R1.1 principles (data usage license) of Reusability, either directly or indirectly, denoted by solid and dashed lines in Figure 5, respectively.

Figure 5

Diagram mapping the guidelines to the FAIR Guiding Principles as defined in Wilkinson et al. (). Solid lines represent direct mapping while the dashed lines represent indirect or weak mapping that are either inferred or may not always hold. {F, A, I, R}n denotes the nth element of the findable, accessible, interoperable, and reusable principles, respectively. Based on Table 1 in Peng et al. (), with additional weak mappings represented by the dashed lines. Creator: Ge Peng. Contributor: Anette Ganske.

Specifically, having a dataset PID leads to satisfying F1 (data are assigned a unique and persistent identifier). Given the nature of PID and the required landing page ensures that the (meta)data are indexed and resolvable (F4). To have a comprehensive landing page of a dataset, both data and metadata need to be described with numerous pertinent attributes, which leads to satisfy F2 (data are described with rich metadata) and R1 principles, respectively. Including a usage license leads to supporting the R1.1 principle.

The current common practice is to include the data PID in the metadata (F3) as part of the process of assigning and minting that PID. If the data PID is minted by a service provider such as DataCite, metadata should continue to be accessible even beyond the availability of the data (A2). However, since it is largely up to practices implemented by individual organizations, it yields only an indirect crosswalk from the guideline 1 to these two FAIR principles (F3, A2).

There are many examples of published datasets that meet this guideline by following community data citation standards. Two of them are shown below:

Neumann, D, Matthias, V, Bieser, J and Aulinger, A (2017). Concentrations of gaseous pollutants and particulate compounds over northwestern Europe and nitrogen deposition into the north and Baltic Sea in 2008. World Data Center for Climate (WDCC) at DKRZ. License: CC BY 4.0. Created: 2017–06–08. https://doi.org/10.1594/WDCC/CMAQ_CCLM_HZG_2008.

Maggi, F, Tang, F H M, la Cecilia, D and McBratney, A (2020). Global Pesticide Grids (PEST-CHEMGRIDS), Version 1.01. Created: September 2020. License: CC-BY 4.0 International. Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC). https://doi.org/10.7927/weq9-pv30

Guideline 2: Utilize a one- (or more) dimensional, structured quality assessment metric that is:

2.1. versioned and publicly available with a globally unique, persistent and resolvable identifier (PID) such as digital object identifier (DOI) and universally unique identifier (UUID);
2.2. registered or indexed in a searchable resource that supports authentication and authorization, such as Figshare, Zenodo, GitHub, and Dryad; and
2.3. retrievable by their identifier using an open, free, standardized and universally implementable communications protocol such as Hypertext Transfer Protocol Secure (HTTPS) or Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH).

This guideline aims to ensure that the assessment model is searchable and retrievable (Figure 5). Requirement 2.1 leads to satisfying F1 (assignment of PID), while Requirements 2.2 and 2.3 ensure that F4 and A1 (registered (meta)data and their retrievability) are satisfied, respectively. The authentication and authorization requirements in 2.2 meet A1.2. The requirements for protocol in 2.3 lead to satisfying A1.2. The versioning itself is far short of the information required for assessment of model provenance. However, it helps support provenance (R1.2). Therefore, an indirect crosswalk to R1.2 is indicated in Figure 5.

Examples of existing dataset quality assessment models and their compliance with Guideline 2 are provided in Table 1. Additional assessment model examples can be found in Peng et al. ().

Table 1

Examples of dataset quality assessment models and their compliance with Guideline 2.


ASSESSMENT MODEL	SCIENTIFIC DATA STEWARDSHIP MATURITY MATRIX ()	STEWARDSHIP MATURITY MATRIX FOR CLIMATE DATA ()	FAIR DATA MATURITY MODEL ()	METADATA QUALITY FRAMEWORK ()	DATA QUALITY ANALYSES AND QUALITY CONTROL FRAMEWORK ()

Quality Entity (i.e., attribute, aspect, or dimension)	Stewardship	Stewardship	FAIRness	Metadata	Data

2.1 – Publicly Available	Yes	Yes	Yes	Yes	Yes

2.1 – PID	DOI	DOI	DOI	DOI	DOI

2.2 – Indexed	Data Science Journal	Figshare	Zenodo	Data Science Journal	Integrated Marine Observing System Catalog

2.3 – Retrievable Using Free, Open, Standard-Based Protocol	Yes	Yes	Yes	Yes	Yes

If no suitable assessment model is available, one may need to develop a new one. In this case, above requirements 2.1–2.3 should be satisfied to make the assessment model findable and accessible. Individual researchers can also use the Registry of Research data Repositories (re3Data) at https://doi.org/10.17616/R3D to search for appropriate repositories based on their own requirements. A CoreTrustSeal certified repository demonstrates more matured organizational processes and capabilities in managing its holdings of digital objects ().

Minimally, a published paper (with DOI) that describes a quality assessment model is necessary to provide access to the model. We highly recommend publishing the assessment model itself (with DOI), for example, in one of the aforementioned repositories. A project website tends to be a common place currently, but is often not sustainable or persistent due to the limited lifespan of projects. For example, a broken link as a result of organizational system migration will lead to inaccessibility of the assessment model.

Guideline 3: Capture the quality attribute(s)/aspect(s)/dimension(s), assessment method and results in a dataset-level metadata record using a consistent framework/schema that:

3.1. is semantically and structurally consistent and follows community standards – preferably compliant with national or international metadata standards that satisfy the conditions of Guideline 2 (i.e., 2.1–2.3),
3.2. includes a description of the quality attribute(s), aspect(s), or dimension(s) to be assessed,
3.3. includes a description of the assessment method and assessment model structure and version, and access date if applicable,
3.4. includes a description of the assessment results, and
3.5. includes versioning and the history of the assessments.

This guideline aims to ensure that the quality information is captured or referenced in the dataset metadata and that it is findable, accessible, interoperable, and reusable by machine end-users (Figure 5).

Utilizing a metadata framework/schema that satisfies the conditions 2.1–2.3 of Guideline 2 ensures that it is findable and accessible.

The requirements of capturing quality entity (i.e., attribute, aspect, or dimension), assessment method and results and that in 3.1 help ensure that the dataset-level metadata is richly described (R1) following metadata standards (R1.3) and is machine interoperable (I1). Capturing the assessment method is often accomplished by referencing it in the metadata record, which satisfies I3; as is capturing assessment results in the form of a published report.

Specifically, including a description of the information related to assessments, that is, quality entity, method, and results as required in 3.2–3.5, leads to rich metadata with a plurality of relevant attributes (R1). The semantically and structurally consistent metadata record that is compliant with standards (3.1) and crosswalks to I1 and R1.3. It may also potentially meet the requirements of I2 (FAIR-compliant vocabularies) in a best-case scenario but may fall short in most of the cases, so only a weak mapping is denoted by the dashed line (Figure 5). The requirements in 3.5 support the provenance of the assessment results (R1.2).

Examples of existing approaches in representing quality entities, assessment models and assessment results in machine-readable quality metadata and their compliance with Guideline 2 are provided in Table 2. Additional examples can be found in Peng et al. ().

Table 2

Examples of representing quality entities, assessment models and assessment results in machine-readable quality metadata and their compliance with Guideline 3.


QUALITY METADATA FRAMEWORK	NOAA ONESTOP DSMM QUALITY METADATA ()	ATMODAT MATURITY INDICATOR ()	METADATAFROMGEODATA ()

Quality Entity	Stewardship	Any Quality Entity	Data and Metadata

3.1 – Semantically and Structurally Consistent	Yes	Yes	Yes

3.1 – Metadata Framework/Schema	International	Domain	Domain

3.2 – Quality Entity Description	Yes	Yes	Yes

3.3 – Assessment Method/Structure Description	Yes	Yes	Partly (contains evaluation of quality description and not description of quality assessment)

3.4 – Assessment Results Description	Yes	Yes	Yes

3.5 – Versioning and the History of the Assessments	Yes	Versioning	Creation & Last Update Dates

Adopting or adapting (including information about the adaptation) existing quality metadata frameworks also is recommended. If that is not possible, a new quality metadata framework or schema may be developed. In this case, the framework should have the capability to allow for requirements in 3.1–3.5 to be satisfied.

Using a consistent metadata tag and including it in a schema is recommended, if applicable. For example, Peng et al. () uses MM-Stew as a metadata tag to denote stewardship maturity assessment. Once the new schema is stable, registering it with schema.org or other relevant metadata schema host entities, such as DataCite, is recommended.

Guideline 4: Describe comprehensively the assessment method, workflow, and results in at least a human-readable quality report that:

4.1. preferably follows a template that is published and satisfies the conditions of Guideline 2 (i.e., 2.1–2.3),
4.2. is published with an explicit open license and history of the report, satisfying the conditions of Guideline 2, and
4.3 links the report PID to the dataset-level metadata record.

This guideline aims to at least ensure the quality information is findable, accessible, citable, reusable and understandable to human end-users (Figure 5). However, we strongly encourage quality reports to be also machine readable.

Comprehensively describing the relevant information yields human-readable metadata with multiple attributes (R1: richly described metadata). Publishing the assessment report following the criteria 2.1–2.3 with an explicit open license (4.2) leads to F1 (PID), F4 ((meta)data registered in a searchable resource), A1 ((meta)data retrievable via standardized protocol), and R1.1 (clear data usage license). The inclusion of the report history (4.2) supports R1.2. Linking the report PID to the dataset-level metadata record (4.3) satisfies the F3 (PID in metadata) and I3 (references to other metadata) principles, respectively.

Examples of existing approaches in representing quality entities, assessment models, and assessment results in human-readable quality reports and their compliance with Guideline 4 are provided in Table 3. Additional examples can be found in Peng et al. ().

Table 3

Examples of human-readable dataset quality assessment reports and their compliance with Guideline 4.


QUALITY REPORT	LEMIEUX ET AL. ()	HÖCK ET AL. ()	COWLEY ()

Quality Entity	Stewardship	Data	Data

4.1 – Follow Template	Yes	Yes	Yes

4 – Quality Entity Description	Yes	Yes	Yes

4 – Assessment Method Description	Yes	Yes	Yes

4 – Assessment Results Description	Yes	Yes	Yes

4.2 – License	Yes	Yes	Yes

4.2 – Assessment History	Yes	Yes	Yes

4.3 – Linked Report PID	Yes	No	Yes

Guideline 5: Report/disseminate the dataset quality information in an organized way via a web interface with a comprehensive description of:

5.1. the dataset according to the Guideline 1,
5.2. assessed quality attribute(s)/aspect(s)/dimension(s),
5.3. the evaluation method and process including the review process, if applicable, and
5.4. how to understand and use the information.

This guideline aims to ensure that the quality information is online and comprehensively described, findable, and easily understood and trusted by providing the assessment provenance (Figure 5).

A comprehensive description of the dataset (requirement 5.1), the assessed quality attribute/aspect (5.2), the evaluation method (5.3), and how to understand and use the quality information (5.4) leads to rich metadata with a plurality of relevant attributes (F2 and R1). The nature of reporting or disseminating and being online indicates it is retrievable via a standardized communication protocol (A1).

Examples of existing approaches in representing assessment results online and their compliance with Guideline 5 are provided in Table 4. Additional examples can be found in Peng et al. ().

Table 4

Examples of disseminating assessment results online and their compliance with Guideline 5.


ONLINE PORTAL	JPSS DATA PRODUCT ALGORITHM MATURITY PORTAL	C3S CLIMATE DATA STORE DATASET QUALITY ASSESSMENT PORTAL	ROLLINGDECK TO REPOSITORY (R2R) QA DASHBOARD

Quality Entity	Algorithm	Technical and Scientific Quality	Sensor

5 – Report information in an organized way	Yes	Yes	Yes

5.1 – Dataset Description	Minimal	Yes	Minimal

5.2 – Assessed Quality Entity Description	Yes	Yes	Yes

5.3 – Evaluation Method and Review Process Description	Yes	Yes	Yes

5.4 – Description of How to Understand and Use Description	Some	Some	Minimal

There is a large diversity in current approaches to disseminate data and metadata quality information because of the dependency on the knowledge-base of the designated community for data. Data users should provide feedback on which disseminated quality information is most relevant and how it can be improved. Therefore, user engagement activities are quite relevant at this stage, including prompt responses to questions and suggestions received from users.

Likewise, it is also recommended to convey dataset quality information in a manner that is easily understood and usable by data users and provide a mechanism for user feedback.

5. Discussion

This section provides a brief discussion of the potential impact of the guidelines provided above, benefits of CoP, and the path forward to increasing community awareness of the guidelines and promoting their adoption.

5a. Potential impact of the guidelines

Improving practices for documenting, sharing, and reusing information about the quality of datasets will help advance scientific progress and contribute to societal benefits through open-source science. When dataset quality information enables potential users to discover a dataset and determine whether it is appropriate for an intended use, FAIR data quality information also helps to achieve FAIR data (). Likewise, when information describing the quality of a dataset fosters its interoperability and reusability, the guidelines further help to make the data FAIR. Those elements of the guidelines which focus on documentation of quality assessment strategies have the additional potential to make FAIR not just the data, but also those evaluation processes. This articulation and communication of domain-specific models, protocols and assumptions can support robust interdisciplinary re-use of data.

In addition, adoption of the guidelines for dataset quality information by the Earth science community, as well as by other disciplinary communities, offers an opportunity to improve the trust that potential users have in the underlying datasets. From a user’s perspective, finding relevant, trusted data is critical to driving decisions. By improving practices for documenting, sharing, and reusing information about the quality of datasets, data providers and users will have increased confidence and improve consistency when disparate datasets are accessed, overlayed, and shared to drive impact-based decisions. These guidelines can assist in establishing trusted approaches for enabling diverse in-situ observing platforms to be used with confidence when assessing, for example, water quality information in estuaries, rivers, bays, and oceans when those sensors may have been installed and funded by different state and federal agencies.

Furthermore, providing sufficient information, including quality information, for using datasets within data collections has the potential to improve trust in the data repositories that are responsible for curating and sharing data (). Clearly, community guidelines for dataset quality information would also benefit disciplines beyond the Earth sciences and efforts are underway to increase their discipline diversity.

5b. Benefits of a Community of Practice (CoP)

With common interests and passions about sharing quality information, the members of the International FAIR-DQI Community Guidelines Working Group have come together to essentially form a loosely organized CoP. The development of the guidelines benefited from the common advantages of a CoP. These include knowledge sharing on needs, challenges, and practices in curating and representing quality information from diverse Earth science domains. There are also added benefits of participating in a CoP throughout the development process. Two will be highlighted below.

One is that we are all learning together. Knowledge about other perspectives broadens our own point of view that comes from our own experiences. A large part of developing knowledge is developing consensus through learning from each other.

Another is that we bring what we have learned back to our jobs, organizations, and communities. Changing is a long process of learning, accepting, and adapting – the first and hardest part is culture change. The subtle changes we make through knowledge we learned can become the seeds that lead to much-needed culture change in our organizations and communities towards sharing quality information at large.

5c. Path forward

The guidelines should help organizations and data stewards get started on providing dataset quality information to data consumers – an important step to close the chasm between data producers and users. However, adoption often requires culture change, which demands continued engagement with the Earth science community (e.g., ).

The effective sharing and (re)use of dataset quality information needs cross-disciplinary integration. Efforts are underway to engage and collaborate with other communities and disciplines beyond Earth science, such as:

Open Geospatial Consortium (OGC; – OGC Data Quality Workshop – citizen science, Earth science, geospatial science, machine learning, social science, urban planning),
World Data System ( – SciDataCon session – astronomy, citizen science, Earth science, social science), and
Research Data Alliance (RDA) ( – RDA18th Plenary session – astronomy, Earth science, genomics, social science). Activities are underway towards forming an RDA working group on making dataset quality information FAIR for the RDA community.

It has been pointed out by the community during our on-going engagement that it will be beneficial to develop and provide use cases for data quality and implementation of the guidelines. The OGC Data Quality Domain Working Group (OGC DQ DWG) is currently working towards the development of a catalog of data quality use cases and we will be contributing to the effort.

6. Conclusion

The FAIR Guiding Principles described by Wilkinson et al. () provide a succinct and measurable set of concepts to be used as a guideline for improving the access and reusability of data for human users and machines. Although the FAIR Principles have provided an effective way to enable data sharing, they do not explicitly describe how dataset quality information should be curated and shared.

Inspired by the FAIR Guiding Principles, a set of guidelines for curating and reporting dataset quality information were developed for both human users and machines, as a global community effort. The guidelines development effort was carried out by a Community of Practice through an iterative process guided by community feedback. The process of developing the guidelines has been described, which may be of use to inspire similar activities requiring large community consensus and uptake.

The guidelines aim to improve the availability and usability of quality information at the individual digital dataset level. Utilizing a structured quality assessment model helps to ensure the consistency of evaluation methods and results, which in turn will make it easier to capture them consistently. Capturing the assessment results in the dataset-level metadata using a consistent framework improves machine interoperability and supports integration across systems and tools. Disseminating the dataset quality information in a transparent and user-friendly way will help end users to understand and effectively use or integrate the information.

Community guidelines developed as a result of this effort bring the Earth science community one step closer to standardizing the curation and representation of dataset quality information. The guidelines described in this article offer opportunities to enable or improve the transparency and interoperability of dataset quality information. Adopting all or part of the guidelines can contribute to the ecosystem that supports open-source science. An excellent byproduct of streamlining the curation and representation of dataset quality information is the improved likelihood of automating the curation and reporting process, leading to international access to and usability of information about the quality of individual digital datasets ().

Utilizing the guidelines also helps improve the overall FAIRness of a dataset by providing community-standard-based rich metadata with a plurality of relevant quality attributes and qualified references. It establishes the trustworthiness of data and ultimately improves the maturity of a dataset in multiple quality dimensions or aspects including product, stewardship, and services by improving the completeness and usability of metadata and documentation.

The international FAIR-DQI community guidelines document () is a living document and is expected to evolve over time to accommodate user feedback and emerging community best practices. As indicated in Section 5c, use cases will be developed, in collaboration with OGC DQ DWG, to further improve the maturity and comprehensiveness of the guidelines and provide implementation examples for the global Earth science and geospatial community. Furthermore, in collaboration with the RDA community, an effort is underway to improve the discipline diversity of the guidelines.

Data Science Journal

Research Papers