Faundeen, J., Burley, T. E., Carlino, J. A., Govoni, D. L., Henkel, H. S., Holl, S. L., Hutchison, V. B., Martín, E., Montgomery, E. T., Ladino, C., Tessler, S., & Zolly, L. S. (2014). The United States Geological Survey Science Data Lifecycle Model (Report No. 2013–1265; Open-File Report, p. 12). USGS Publications Warehouse. https://doi.org/10.3133/ofr20131265
In USGS, this Data Lifecycle Model was developed as a high-level view of data—from conception through preservation and sharing—to illustrate how data management activities relate to project workflows, and to assist with understanding the expectations of proper data management.
Plan
Create a data management plan and learn about important planning activities. Prior to starting a project, it is important to plan how data will be managed throughout the lifecycle.
Acquire
Acquiring data for a project involves collecting or generating new data or obtaining existing data.
Process
Processing data involves various activities associated with the preparation of new or previously collected data inputs, including:
Analyze
Data analysis involves various activities associated with exploring and interpreting processed data. Analysis activities covered on the Analyze page include:
Preserve
Preservation involves actions and procedures used to ensure long-term viability and accessibility of data.
Publish/Share
Publishing and sharing data is an important and required stage in the research process, just like publishing traditional peer-reviewed journal articles.
Source: Data Lifecycle | U.S. Geological Survey, n.d.
A documented sequence of intended actions to identify and secure resources and gather, maintain, secure, and utilize data holdings comprise a Data Management Plan. This also includes the procurement of funding and the identification of technical and staff resources for full lifecycle data management. Once the data needs are determined, a system to manage the data can then be developed.
Why is Data Stewardship Important?
The data collected and analyzed by the USGS are a national resource. They are paid for by taxpayers and are used to make all types of management decisions, many of which have substantial economic, health, and safety consequences.
What is a Data Steward?
A data steward, or data manager, is a person responsible for overseeing the lifecycle activities of a set of data products for the USGS.
The Purpose of Data Sharing Agreements
Data sharing agreements protect against data misuse and promote early communication among agencies about questions of data handling and use.
When to Consider Access Controls and Copyrights
It is important to consider issues of access controls and copyrights throughout a project: when acquiring or collecting sensitive data, when creating metadata, and when sharing data with others.
Overview
It is critical that our information be protected from uninvited disclosure or intentional corruption, and that our systems are secured against external attack to the maximum extent possible.
Who is Responsible?
IT security must be incorporated into all phases of program planning and execution, from budgeting to close-out. The cognizant Program Manager or IT System Owner has primary responsibility to assure that contractors comply with DOI security mandates.
Data Processing covers any set of structured activities resulting in the alteration or integration of data. Data processing can result in data ready for analysis, or generate output such as graphs and summary reports. Documenting the steps for how data are processed is essential for reproducibility and improves transparency.
The Analyze stage of the Science Data Lifecycle represents activities associated with the exploration and assessment of data, where hypotheses are tested, discoveries are made, and conclusions are drawn.
Data analysis may be required to better understand data content, context, and quality. In this stage of the Lifecycle, conclusions or new datasets are generated and methods are documented. Analytical activities include the following:
Any method or code developed for a workflow should be clearly written, well documented, modular, and accessible to facilitate research reproducibility. The use of open source solutions and repositories is recommended.
Here are some things to consider when developing methods and workflows:
Data Quality
Have a plan for data quality management throughout the workflow
Maintain documentation on data-quality and provenance
Efficiency
Use a scripting language to automate data processing and simplify documentation
Use standardized methods and protocols appropriate to your data, when available
When possible, support the research by building software or code modules that automatically acquire external data and execute processing and analysis code
Embrace modular workflows, processes, and code, where component parts are reusable
Transparency
Open source software development is encouraged
Readability - code and documentation should be concise and understandable
Use published or citable methods, or publish new methods as necessary
Reproducibility
Documentation is only one component of this goal
Enable anyone (including yourself) to rerun your analyses
This is accomplished when an independent reviewer can read your documentation, acquire the requisite data, and execute the processing and analyses using code or manual actions
Use a version control system for your code
Use software that installs code packages with all dependencies
Reference your data sources as specifically as possible
Accessibility
Use open source development environments when possible
Make code available through public repositories
Ensure that all data used in research is available
Documentation
Maintain documentation on data processing and analysis activities as they happen; reconstructing research activities retrospectively is less efficient and accurate
For release-stage products, include diagrams and other supplemental material, in addition to standard metadata, to assist with understanding or reproducing a process or analysis
Preservation involves actions and procedures to keep data for some period of time and/or to set data aside for future use, and includes data archiving and/or data submission to a data repository.
The ability to prepare, release, and share, or disseminate, quality data to the public and to other agencies is an important part of the lifecycle process.
Publication of scientific data as stand-alone products or in conjunction with the scholarly articles they support is integral to the open data movement. The USGS has developed a path for formally releasing or publishing USGS scientific data called a "data release."
Data citation is the practice of referencing data used in research. A data citation includes key descriptive information about the data, such as the title, source, and responsible parties.