Source repo: sdsc-summer-institute-2022 | Branch:
main| Last synced: 2026-04-24 10:27:17.425 UTC
Data Management: Or how (not) to handle your data in an HPC environment
Proper data management is essential to make effective use of high-performance computing (HPC) systems and other advanced cyberinfrastructure (CI) resources. This session will cover an overview of filesystems, data compression, archives (tar files), checksums and MD5 digests, downloading data using wget and curl, data transfer and long-term storage solutions.
- Before we begin: A few disclaimers
- Easy access: Setting up SSH keys
- CIFAR through the tubes: Downloading data from the internet
- More files, more problems: Advantages and limitations of different filesystems
- Going parallel: Lustre basics
- Back that data up: Data transfer tools
Additional references:
- https://education.sdsc.edu/training/interactive/202110_data_management_and_file_systems/index.html
- https://github.com/sdsc-hpc-training-org/hpc-training-2022/blob/main/week03_jobsub_datamgmt/DataManagement_HPCTraining_2022.pdf