Source repo: sdsc-summer-institute-2022 | Branch:
main| Last synced: 2026-04-24 10:27:17.425 UTC
Session 6.2 Scaling up Interactive Data Analysis in Jupyter Lab: From Laptop to HPC
Friday, August 5, 2022
In this session we will demonstrate scaling up data analysis to larger than memory (out-of-core) datasets and processing them in parallel on CPU and GPU nodes. In the hands-on exercise we will compare Pandas, Dask, Spark, cuDF, and Dask-cuDF dataframe libraries for handling large datasets. We also cover setting up reproducible and transferable software environments for data analysis.
Resources:
-
Git Repository df-parallel: Comparison of Dataframe libraries for parallel processing of large tabular files
-
Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks
TASK 1: Launch Jupyter Lab on Expanse using a CONDA environment
- Open a Terminal Window ("expanse Shell Access") through the Expanse Portal (use your trainxx login credentials)
- Note: At this point in time, the portal's Jupyter App form does not currently support all of the latest galyleo command-line options so you will learn to launch a notebook on Expanse using the Galyleo script.
- Clone the Git repository df-parallel in your home directory
git clone https://github.com/sbl-sdsc/df-parallel.git
-
Launch Jupyter Lab using the Galyleo script
This script will generate a URL for your Jupyter Lab session.
galyleo launch --account `acount_number` --partition gpu-shared --cpus 10 --memory 92 --gpus 1 --time-limit 01:00:00 --conda-env df-parallel-gpu --conda-yml "${HOME}/df-parallel/environment-gpu.yml" --mamba
- Open a new tab in your web browser and paste the Jupyter Lab URL.
You should see the Satellite Reserver Proxy Service page launch in your browser.
- In your Zoom session, select "Yes" under Reactions after you complete these steps.
TASK 2: Run Notebooks in Jupyter Lab
For this task you will compare the runtime for a simple data analysis using 5 dataframe libraries.
-
Go to the Jupyter Lab session launched in TASK 1
Navigate to the
df-parallel/notebooksdirectory. -
Copy data files
Run the
1-DownloadData.ipynband1a-Csv2Parquet.ipynbnotebooks to copy datasets to the local scratch disk on the GPU node. -
Run the Dataframe notebooks with a csv input file
Run the following Dataframe notebooks and write down the runtime shown at the bottom of each notebook.
To get exact timings, run the notebook with the
>>(Run All) button!
2-PandasDataframe.ipynb
3-DaskDataframe.ipynb
4-SparkDataframe.ipynb
5-CudaDataframe.ipynb
6-DaskCudaDataframe.ipynb
-
Run Cuda Dataframe notebook with a parquet input file
In the 5-CudaDataframe.ipynb notebook, change the file format to parquet:
file_format = "parquet"
To get exact timings, run the notebook with the
>>(Run All) button!
Write down the runtime for Cuda with the parquet format.
-
Shutdown Jupyter Lab
File -> Shutdownto terminate the process -
In your Zoom session, select "Yes" under "Reactions" after you complete these steps.