Skip to Main Content
Albert S. Cook Library

Data

Sources for finding data

Data Lifecycle - USGS

Faundeen, J., Burley, T. E., Carlino, J. A., Govoni, D. L., Henkel, H. S., Holl, S. L., Hutchison, V. B., Martín, E., Montgomery, E. T., Ladino, C., Tessler, S., & Zolly, L. S. (2014). The United States Geological Survey Science Data Lifecycle Model (Report No. 2013–1265; Open-File Report, p. 12). USGS Publications Warehouse. https://doi.org/10.3133/ofr20131265


 

 

In USGS, this Data Lifecycle Model was developed as a high-level view of data—from conception through preservation and sharing—to illustrate how data management activities relate to project workflows, and to assist with understanding the expectations of proper data management. 


Plan

Create a data management plan and learn about important planning activities. Prior to starting a project, it is important to plan how data will be managed throughout the lifecycle. 

Acquire

Acquiring data for a project involves collecting or generating new data or obtaining existing data.

Process

Processing data involves various activities associated with the preparation of new or previously collected data inputs, including:

  • Validating Data
  • Summarizing Data
  • Transforming Data
  • Integrating Data
  • Subsetting Data
  • Deriving Data

Analyze

Data analysis involves various activities associated with exploring and interpreting processed data. Analysis activities covered on the Analyze page include:

  • Statistical Analysis
  • Visualization
  • Spatial Analysis
  • Image Analysis
  • Modeling
  • Interpretation

Preserve

Preservation involves actions and procedures used to ensure long-term viability and accessibility of data.

Publish/Share

Publishing and sharing data is an important and required stage in the research process, just like publishing traditional peer-reviewed journal articles. 

 

Source: Data Lifecycle | U.S. Geological Survey, n.d.

A documented sequence of intended actions to identify and secure resources and gather, maintain, secure, and utilize data holdings comprise a Data Management Plan. This also includes the procurement of funding and the identification of technical and staff resources for full lifecycle data management. Once the data needs are determined, a system to manage the data can then be developed.


 

Why is Data Stewardship Important?

The data collected and analyzed by the USGS are a national resource. They are paid for by taxpayers and are used to make all types of management decisions, many of which have substantial economic, health, and safety consequences.

What is a Data Steward?

A data steward, or data manager, is a person responsible for overseeing the lifecycle activities of a set of data products for the USGS.


 

The Purpose of Data Sharing Agreements

Data sharing agreements protect against data misuse and promote early communication among agencies about questions of data handling and use.


 

When to Consider Access Controls and Copyrights

It is important to consider issues of access controls and copyrights throughout a project: when acquiring or collecting sensitive data, when creating metadata, and when sharing data with others.


 

Overview

It is critical that our information be protected from uninvited disclosure or intentional corruption, and that our systems are secured against external attack to the maximum extent possible.

Who is Responsible?

IT security must be incorporated into all phases of program planning and execution, from budgeting to close-out. The cognizant Program Manager or IT System Owner has primary responsibility to assure that contractors comply with DOI security mandates.

Data Processing covers any set of structured activities resulting in the alteration or integration of data. Data processing can result in data ready for analysis, or generate output such as graphs and summary reports. Documenting the steps for how data are processed is essential for reproducibility and improves transparency.


Example Workflow to Predict Biomass Properties from Genotype

The Analyze stage of the Science Data Lifecycle represents activities associated with the exploration and assessment of data, where hypotheses are tested, discoveries are made, and conclusions are drawn.


Data analysis may be required to better understand data content, context, and quality. In this stage of the Lifecycle, conclusions or new datasets are generated and methods are documented. Analytical activities include the following:


Any method or code developed for a workflow should be clearly written, well documented, modular, and accessible to facilitate research reproducibility. The use of open source solutions and repositories is recommended.

Here are some things to consider when developing methods and workflows:

 

Data Quality

  • Have a plan for data quality management throughout the workflow

  • Maintain documentation on data-quality and provenance

 

Efficiency

  • Use a scripting language to automate data processing and simplify documentation

  • Use standardized methods and protocols appropriate to your data, when available

  • When possible, support the research by building software or code modules that automatically acquire external data and execute processing and analysis code

  • Embrace modular workflows, processes, and code, where component parts are reusable

 

Transparency

  • Open source software development is encouraged

  • Readability - code and documentation should be concise and understandable

  • Use published or citable methods, or publish new methods as necessary

 

Reproducibility

  • Documentation is only one component of this goal

  • Enable anyone (including yourself) to rerun your analyses

    • This is accomplished when an independent reviewer can read your documentation, acquire the requisite data, and execute the processing and analyses using code or manual actions

  • Use a version control system for your code

  • Use software that installs code packages with all dependencies

  • Reference your data sources as specifically as possible

 

Accessibility

  • Use open source development environments when possible

  • Make code available through public repositories

  • Ensure that all data used in research is available

 

Documentation

  • Maintain documentation on data processing and analysis activities as they happen; reconstructing research activities retrospectively is less efficient and accurate

  • For release-stage products, include diagrams and other supplemental material, in addition to standard metadata, to assist with understanding or reproducing a process or analysis

Preservation involves actions and procedures to keep data for some period of time and/or to set data aside for future use, and includes data archiving and/or data submission to a data repository.

The ability to prepare, release, and share, or disseminate, quality data to the public and to other agencies is an important part of the lifecycle process.

Publication of scientific data as stand-alone products or in conjunction with the scholarly articles they support is integral to the open data movement. The USGS has developed a path for formally releasing or publishing USGS scientific data called a "data release."

Diagram of the elements of a USGS data release: data, metadata, digital object identifier, IPDS, USGS dataset repository, SDC

 

Data citation is the practice of referencing data used in research. A data citation includes key descriptive information about the data, such as the title, source, and responsible parties.