Supporting_Scientific_Data

Glossary

3-2-1 Rule - A good rule of thumb to ensure that data is not lost. Stipulates that 3 copies of data should be maintained (1 working copy, 2 backups) and that at least one of the backups should be in saved different location.

Archival Storage - Refers to methods and media used to store data that is not needed day-to-day. The goal of archival storage is long-term preservation.

Audit (Data Management) - An exercise designed to ensure that data management practices are being implemented properly. Is typically very involved and may not involve members of the research team.

Backup Storage - Refers to methods and media used to store copies of data that can be used to restore the original if (or when) data loss occurs. The goal of backup data storage is redundancy.

Bit Rot - Broadly refers to the gradual degradation of the integrity of stored data. This is largely caused by wear and tear on storage media.

Cloud Storage - A system in which digital data is stored remotely rather than on a local storage medium.

Data Availability Statement - A short section of a research paper that outlines if, when, and how any related data can be accessed.

Data Dictionary - A document that describes the contents and structure of a dataset.

Data Management - Strategies and practices related to the storage, organization, and description of research data. The focus of data research data management is ensuring accessibility, reliability, and quality. As a shorthand, it can be useful to think about data management as processes related to ensuring that data is usable.

Data Management Plan - Documentation that describes data management and sharing-related practices and strategies. These include both data management and sharing plans written for grant and project proposals as well as more thorough documentation that describes standardized practices to be implemented during the day-to-day course of a research effort. This later category may include standalone documents or be included in broader materials (e.g. lab manuals).

Data Management and Sharing Plan - A form of data management plan developed and submitted as part of a grant proposal. Typically data management and sharing plans are relatively short documents (two or so pages) with required elements, including a description of the research data that will be acquired or generated by the proposed work as well as how that data will be preserved and made available to others.

Data Provenance (also sometimes referred to as “data lineage”), - The documented trail that describes the origin of a piece of data, how it has been changed or transformed, where it has been moved to, and where it is presently.

Data Quality - A broad concept that refers to the degree to which a set of data is fit for its intended purpose. Highly related to data management, but also includes issues more related to methodological rigor.

Data Repository - A platform that facilitates the preservation, organization, and discovery of research data.

Data Storage - The recording of information into a medium of some kind.

Data Sharing - The release of data for use by others. This includes forms of sharing that are relatively restricted (who data is shared with, what they users are able to do with shared data, etc) and more open forms of sharing (e.g. sharing of data through a publicly accessible repository.

Data Sovereignty - A group’s (or individual’s) right to control and maintain their own data, including its collection, storage, and interpretation

Data Standards - Agreed upon ways of organizing, structuring, and/or describing a particular form of research data.

Data Use Agreement - A contractual agreement that establishes who is permitted to use and a dataset, the permitted uses of the dataset, as well as the responsibilities of the users of the dataset. The most typical consideration of a DUA is the protection of protected health data, but such agreements can be used in a variety of situations where the exchange of data is necessary.

Directory (File Folder) - A structure that contains computer files and possibly other directories.

Documentation - Recorded information that is used to describe or explain something.

Experimental Protocols - A document, analagous to an SOP, that provides the precise steps needed to complete a research-related procedure.

FAIR Guiding Principles - A set of guiding principles initially developed to describe the desired characteristics of data-related infrastructure to facilitate the discovery and re-use of data assets by computational tools (i.e. machine readability). The term has now come to be used generally to refer to datasets that are not only available, but available in a usable form.

File Naming Convention - A consistent way of naming files in a way that provides information about the contents of the file and how it relates to other files.

File Format -Standardized ways in which information is encoded to be stored in a computer file. Think .XLSX, .CSV, etc.

File fixity - The process of ensuring that a digital file in an archive has remained unchanged at the bit level.
Good Data Management Practice - A way of thinking about the support of research data in the context of twenty-first century science. Includes ten principles related to how data should be defined and how data management should be incorporated into research processes and workflows.

Good Documentation Practice - A set of guidelines drawn from the pharmaceutical and manufactoring industries outlining how to maintain effective documentation.

Lab Notebook - A formal record of the research process. In the context of data management, lab notebooks are a form of contemporaneous documentation.

License Stacking - A complex situation in which the reuse of a dataset (e.g. combining one dataset with another), can lead to more and more restrictions on future use.

Lossiness - The degree to which information is lost when information is encoded into a particular file type. “Lossy” compression involves the removal of information (and thus decreases the fidelity of the information) while “lossless” compression does not.

Metadata - Refers to information that facilitates the interpretation and/or use of research data. Can refer to formal metadata schemas (e.g. standardized ontologies) or to related documentation (data dictionaries, codebooks, protocols, etc).

Metadata Schema - A set of rules that are used to structure and describe metadata. A metadata schema defines metadata elements, what they mean, how they relate, and how they should be used.

Monitoring (Data Management) - A routine and continuous process designed to check that data management practices are being implemented according to standard operating procedures. Is typically less involved than a full audit and involves members of the research team.

Obsolescence - Refers to an inability to access digital data because the needed hardware or software are no longer available.

Openness - The degree to which the way a file type encodes information in a manner that is secret or restricted. Proprietary file formats typically can only be opened and used with specific software tools. In contrast, open (non-proprietary) file formats are unrestricted and free to use.

Open Science - An umbrella term for a variety of efforts aimed at making scientific research more transparent and accessible. In this guide, we are mostly focused on activities related to the outputs of the research process (datasets, code, etc), but the term also encompasses efforts to ensure that the scientific enterprise is inclusive and equitable.

Persistent Identifier - A digital identifier that permanently and unambiguously identifies a digital object or an individual.

Public Domain - Describes circumstances when works are not protected by copyright and can be freely used, shared, or adapted by anyone.

Quality Assurance/Quality Control (QA/QC) - Retrospective and prospective efforts to assure data quality. The terms are often used interchangeably.

ReadMe - A simple text document that lays out how a user can find and use the file they are looking for.

Reproducibility - Broadly refers to efforts related to ensuring reliability, validity, and credibility of scientific research. Somewhat ironically, the meaning of key terms and definitions related to “research reproducibility” are not standardized across the scientific research enterprise. Throughout this guide, we generally use the term to refer to methods reproducibility (the provision of sufficient detail about study procedures so that they can be - theoretically or actually - exactly repeated).

Research Data - Broadly refers to the inputs or outputs required to evaluate, reproduce, or built upon the analyses or conclusions of a given research project. Throughout a research workflow, data may be categorized as “raw”, “intermediate”, or “final” products.

Research Workflow - The series of programmatic steps or practical ‘ways of doing things’ as data is collected, processed, and analyzed. Basically, where the research results came from.

Study Protocol - A formal document that describes every aspect of a research project, including motivations, SOPs, data management, and planned statistical analyses.

Usability - In the context of data management and sharing, this refers to an ability to open, understand, make use of, and build upon a set of data. In the context of biomedical science, “re-use” encompasses a large number of potential activities, including using a dataset for education and training (of both human researchers and algorithms), testing new hypotheses (which can involve combining multiple extant datasets), and more.

Version Control - Systems that are responsible for managing changes in software and code, documents, and other sets of information. Can refer to tools like Git but also to more manual forms of keeping track of versions, such as using dates or version numbers in file names.

Working Data Storage Refers to the methods and media used to store data that is currently being actively worked on - data that is currently in the process of being transformed, analyzed, and evaluated. The goal of working data storage is immediate access and use.