Source repo: sdsc-summer-institute-2023 | Branch:
main| Last synced: 2026-04-24 10:27:17.425 UTC
SDSC Summer Institute 2023 - Session 3.2 Data Management
Date: Tuesday, August 8th, 2023
Time: 9:00 AM – 10:00 AM PT
Instructor: Marty Kandes, Computational & Data Science Research Specialist, High Performance Computing User Services Group, San Diego Supercomputer Center, University of California, San Diego
Summary
Proper data management is essential to make effective use of high-performance computing (HPC) systems and other advanced cyberinfrastructure (CI) resources. This session will cover an overview of filesystems, data compression, archives (tar files), checksums and MD5 digests, downloading data using wget and curl, data transfer and long-term storage solutions.
Reading and Presentations
Data has a lifecycle. Data management is a lifestyle.

Image Credit: Harvard Biomedical Data Management
Lecture material
- CIFAR through the tubes: Downloading data from the internet
- More files, more problems: Advantages and limitations of different filesystems
- Going parallel: Lustre basics
- Back that data up: Data transfer tools
Source Code/Examples
Additional References
- Implementing Research Data Management for Labs & Grants (2021)
- Data Management & File Systems on Expanse (2021)
- Data Management & Job Submission (2022)
- Data Management & File Systems (2023)
Next - CIFAR through the tubes: Downloading data from the internet