Data Lifecycle - DataOne - Data - Research Guides at Towson University

Data Lifecycle - DataOne

The data life cycle has eight components:

Plan: description of the data that will be compiled, and how the data will be managed and made accessible throughout its lifetime
Collect: observations are made either by hand or with sensors or other instruments and the data are placed a into digital form
Assure: the quality of the data are assured through checks and inspections
Describe: data are accurately and thoroughly described using the appropriate metadata standards
Preserve: data are submitted to an appropriate long-term archive (i.e. data center)
Discover: potentially useful data are located and obtained, along with the relevant information about the data (metadata)
Integrate: data from disparate sources are combined to form one homogeneous set of data that can be readily analyzed
Analyze: data are analyzed

The data life cycle from the perspective of a researcher. The ‘plan’ component describes the entire life cycle.

Primer on Data Management: What you always wanted to know
The goal of data management is to produce self-describing data sets. If you give your data to a scientist or colleague who has not been involved with your project, will they be able to make sense of it? Will they be able to use it effectively and properly? This primer describes a few fundamental data management practices that will enable you to develop a data management plan, as well as how to effectively create, organize, manage, describe, preserve and share data.

Plan for data management as your research proposal (for funding agency, dissertation committee, etc.) is being developed. Revisit your data management plan frequently during the project and make changes as necessary. Consider the following:

Collecting your data: Based on the hypotheses and sampling plan, what data will be generated? How will the samples be collected and analyzed? Provide descriptive documentation of collection rationale and methods, analysis methods, and any relevant contextual information. What instruments will be used? What cyberinfrastructure will be required to support your research?
Decide on a repository : Select a data repository (i.e. data center) that is most appropriate for the data you will generate and for the community that will make use of the data. Talk with colleagues and research sponsors about the best repository for your discipline and your type of data. Check with the repository about requirements for submission, including required data documentation, metadata standards, and any possible restrictions on use(e.g. intellectual property rights). Repository specifications will help guide decisions about the remainder of the practices below.
Organizing your data: Decide on how data will be organized within a file, what file formats will be used, and the types of data products you will generate. Does your community have standard formats (file formats, units, parameter names)? Consider whether a relational database or other data organization strategy might be most appropriate for your research.
Managing your data : Who is in charge of managing the data? How will version control be handled? How will data be backed up, and how often?
Describing your data: How will you produce a metadata record? Using what metadata standard? Using what tool? Will you create a record at the project's inception and update it as you progress with your research? Where will you deposit the metadata? Consider your community's standards when deciding on the metadata standard and data center.
Sharing your data: Develop a plan for sharing data with the project team, with other collaborators, and with the broader science community. Under what conditions will data be released to each of these groups? What are the target dates for release to these groups? How will the data be released?
Preserving your data: As files are created, implement a short-term data preservation plan that ensures that data can be recovered in the event of file loss (e.g., storing data routinely in different locations).
Consider your budget : What types of personnel will be required to carry out your data management plan? What types of hardware, software, or other computational resources will be needed? What other expenses might be necessary, such as data center donation or payment? The budget prepared for your research project should include estimated costs for data management.
Explore your institutional resources: Some institutions have data management plan templates, suggested institutional data centers, budget suggestions, and useful tools for planning your project.

It is important to collect data in such a way as to ensure its usability later. Careful consideration of methods and documentation before collection occurs is important.

Consider creating a template for use during data collection. This will ensure that any relevant contextual data are collected, especially if there are multiple data collectors.
Describe the contents of your data files : Define each parameter, including its format, the units used, and codes for missing values (see here). Provide examples of formats for common parameters. Data descriptions should accompany the data files as a “readme.txt” file, a metadata file using an accepted metadata standard, or both.
Use consistent data organization: We recommend that you organize the data within a file in one of the two ways described below. Whichever style you use, be sure to place each observation on a separate line (row).

Each row in a file represents a complete record and the columns represent all the parameters that make up the record (a spreadsheet format).
One column is used to define the parameter and another column is used for the value of the parameter (a database format). Other columns may be used for site, date, treatment, units of measure, etc.

Use the same format throughout the file; for instance, do not rearrange columns or rows within the file. At the top of the file, include one or more header rows that identify the parameter and the units for each column. “Atomize” data: make sure there is only one piece of data in each entry.
Use plain text ascii characters for variable names, file names, and data: this will ensure that your data file is readable by the maximum number of software programs.
Use stable, non-proprietary software and hardware: File formats should ideally be nonproprietary (e.g. .txt or .csv files rather than .xls), so that they are stable and can be read well into the future. Consider the longevity of hardware when backing up data.
Assign descriptive file names : File names ideally describe the project, file contents, location, and date, and should be unique enough to stand alone as file descriptions. File names do not replace complete metadata records.
Keep your raw data raw : Preserve the raw data, with all of its imperfections. Use a scripted program to “clean” the data so that all steps are documented.
Create a parameter table: Describe the code and abbreviations used for a parameter, the units, maximum and minimum values, the type of data (i.e. text, numerical), and a description.
Create a site table: Describe the sites where data were collected, including latitude, longitude, dates visited, and any contextual details (e.g. ecosystem type, land cover or use, weather conditions, etc.) that might affect the data collected.

Perform basic quality assurance and quality control on your data (see here), during data collection, entry, and analysis. Describe any conditions during collection that might affect the quality of the data. Identify values that are estimated, double-check data that are entered by hand (preferably entered by more than one person), and use quality level flags (see here) to indicate potential problems. Check the format of the data to be sure it is consistent across the data set. Perform statistical and graphical summaries (e.g. max/min, average, range) to check for questionable or impossible values and to identify outliers.

Communicate data quality using either coding within the data set that indicates quality, or in the metadata or data documentation. Identify missing values. Check data using similar data sets to identify potential problems. Additional problems with the data may also be identified during analysis and interpretation of the data prior to manuscript preparation.

Comprehensive data documentation (i.e. metadata) is the key to future understanding of data. Without a thorough description of the context of the data file, the context in which the data were collected, the measurements that were made, and the quality of the data, it is unlikely that the data can be easily discovered, understood, or effectively used. Consider the following when documenting your data:

Describe the digital context
- Name of the data set
- The name(s) of the data file(s) in the data set
- Date the data set was last modified
- Example data file records for each data type file
- Pertinent companion files
- List of related or ancillary data sets
- Software (including version number) used to prepare/read the data set
- Data processing that was performed
Describe the personnel and stakeholders
- Who collected the data
- Who should be contacted with questions
- Sponsors
Describe the scientific context
- Scientific reason why the data were collected
- What data were collected
- What instruments (including model and serial number) were used
- Environmental conditions during collection
- Where collected and spatial resolution
- When collected and temporal resolution
- Standards or calibrations used
Information about parameters
- How each was measured or produced
- Units of measure
- Format used in the data set
- Precision, accuracy, and uncertainty
- Information about data
- Taxonomic details
- Definitions of codes used
- Quality assurance and activities
- Known problems that limit data use (e.g. uncertainty, sampling problems)
- How to cite the data set

Metadata should be generated in a metadata format commonly used by the most relevant science community. Use metadata editing tools to generate comprehensive descriptions of the data. Comprehensive metadata enables others to discover, understand, and use your data.

Work with a data center or archiving service that is familiar with your area of research. They can provide guidance as to how to prepare formal metadata, how to preserve the data, what file formats to use, and how to provide additional services to future users of your data. Data centers can provide tools that support data discovery, access, and dissemination of data in response to users needs.

Identify data with long-term value : It is not necessary to archive all of the data products generated from your research. Consider the size of your files, which data will be most useful for future data users (typically raw data), and which data versions would be most difficult to reproduce.
Store data using appropriate precision (e.g. significant digits)
Use standard terminology : To enable others to find and use your data, carefully select terminology to describe your data. Use common keywords and consult ontologys for your discipline, if they are available.
Consider legal and other policies : All research requires the sharing of information and data. Researchers should be aware of legal and policy considerations that affect the use and reuse of their data. It is important to provide the most comprehensive access possible with the fewest barriers or restrictions. There are three primary areas that need to be addressed when producing sharable data:
1. Check your institution's policies on privacy and confidentiality.
2. Data are not copyrightable. Ensure that you have the appropriate permissions when using data that has multiple owners or copyright layers. Information documenting the context of data collection may be under copyright.
3. Data is able to be licensed. The manner in which you license your data can determine its ability to be consumed by other scholars. For example, the Creative Commons Zero License [14] provides for very broad access.

If your data fall into any of the following categories, there are additional considerations regarding sharing: Rare, threatened or endangered species; Cultural items returned to their country of origin; Native American and Native Hawaiian human remains and objects; Any research involving human subjects. If you use data from other sources, you should review your rights to use the data and be sure you have the appropriate licenses and permissions.

Attribution and provenance: The following information should be included in the data documentation or the companion metadata file:
- The personnel responsible for the data set throughout the lifetime of the data set
- The context of the data set with respect to a larger project or study (including links and related documentation), if applicable
- Revision history, including additions of new data and error corrections
- Links to source data, if the data were derived from another data set
- Project support (e.g., funding agencies, collaborators, computational support)
- How to properly cite the data set
- Intellectual property rights and other licensing considerations

Best Practice: Discover
Identify complementary data sets that can add value to project data. Strategies to help endure the data have maximum impact include registering the project on a project directory site, depositing data in an open repository, and adding data descriptions to metadata clearing houses.

Best Practice: Integrate
Data from multiple sources are combined into a form that can be readily analyzed. For example, you could combine citizen science project data with other sources of data to enable new analyses and investigations. Successful data integration depends on documentation of the integration process, clearly citing and making accessible the data you are using, and employing good data management practices throughout the Data Life Cycle.

Best Practice: Analyze
Create analyses and visualizations to identify patterns, test hypotheses, and illustrate finding. During this process record your methods, document data processing steps, and ensure your data are reproduceable. Learn about these best practices and more.

Ask a Librarian

Data

Data Lifecycle - DataOne