Source Information¶


Created by:

Updated by: October 01, 2024

Resources:

  • https://pandas.pydata.org/
  • https://docs.cupy.dev/en/stable/user_guide/basic.html

Goal¶

This notebook introduces CuPy, a Python library for GPU-accelerated numerical computations. We will explore how to use CuPy arrays (1D and 2D data structures) for data analysis, similar to NumPy arrays but with the performance benefits of GPU processing.

CuPy - a quick introduction¶

CuPy is an open-source library designed for GPU-accelerated array computations in Python. Its interface is highly compatible with both NumPy and SciPy, making it a seamless drop-in replacement in most cases. For more information and to explore its capabilities, visit the official CuPy documentation

Required Modules for the Jupyter Notebook¶

Before running the notebook, we need the following modules.

Module:cupy,cudf

In [2]:
import cupy as cp
import cudf
import pandas as pd
import numpy as np

Since I used a Conda environment to launch this notebook, I will install openpyxl to enable reading and writing Excel files within the notebook.

In [ ]:
conda install openpyxl -q

Load CSV Data Set¶

Let's dive into an Olympic Medal dataset available from Wikipedia https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table. I've already done a little bit of cleanup so that we can quickly get to the important features.

Import Data using cuDF¶

Our file is in csv format, so we'll use the read_csv method. We know that the first two rows contain comments and other data that we won't want. We can use the skiprows argument to skip over these rows. We'll set index_col to zero so that the first column serves as the index. After loading the DataFrame, we'll look at the first and last few rows using the head and tail methods

Normally, when using the CPU, we use the pandas package to load and manipulate data. However, to leverage the GPU for data processing, we use cuDF instead of pandas. cuDF is a GPU-accelerated library that has a similar API to pandas but runs computations on the GPU, enabling faster processing for large datasets.

In [8]:
# instead of using pandas, we will us cudf
df = cudf.read_csv('olympics.csv',index_col=0, skiprows=2)
In [9]:
df.head(3)
Out[9]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
0
Afghanistan (AFG) 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2
Algeria (ALG) 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15
Argentina (ARG) 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
In [10]:
df.tail(3)
Out[10]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
0
Zimbabwe (ZIM) [ZIM] 12 3 4 1 8 1 0 0 0 0 13 3 4 1 8
Mixed team (ZZX) [ZZX] 3 8 5 4 17 0 0 0 0 0 3 8 5 4 17
Totals 27 4809 4775 5130 14714 22 959 958 948 2865 49 5768 5733 6078 17579

Note that the last row of our dataframe contains totals for the number of games and medals won. Let's get rid of that using the drop method.

In [11]:
df = df.drop('Totals')
df.tail(3)
Out[11]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
Zambia (ZAM) [ZAM] 12 0 1 1 2 0 0 0 0 0 12 0 1 1 2
Zimbabwe (ZIM) [ZIM] 12 3 4 1 8 1 0 0 0 0 13 3 4 1 8
Mixed team (ZZX) [ZZX] 3 8 5 4 17 0 0 0 0 0 3 8 5 4 17

We can use the shape and size attributes of the dataframe to determine the dimensions and number of cells. Note that the dimensions do not include the column headers and index column.

In [13]:
df.shape
Out[13]:
(146, 15)
In [14]:
df.size
Out[14]:
2190

The columns attribute returns the column names

In [15]:
df.columns
Out[15]:
Index(['Summer games', 'Summer gold', 'Summer silver', 'Summer bronze',
       'Summer total', 'Winter games', 'Winter gold', 'Winter silver',
       'Winter bronze', 'Winter total', 'Combined games', 'Combined gold',
       'Combined silver', 'Combined bronze', 'Combined total'],
      dtype='object')

The row labels are returned using the index attribute

In [16]:
df.index
Out[16]:
StringIndex(['Afghanistan\xa0(AFG)' 'Algeria\xa0(ALG)' 'Argentina\xa0(ARG)'
 'Armenia\xa0(ARM)' 'Australasia\xa0(ANZ) [ANZ]'
 'Australia\xa0(AUS) [AUS] [Z]' 'Austria\xa0(AUT)' 'Azerbaijan\xa0(AZE)'
 'Bahamas\xa0(BAH)' 'Bahrain\xa0(BRN)' 'Barbados\xa0(BAR) [BAR]'
 'Belarus\xa0(BLR)' 'Belgium\xa0(BEL)' 'Bermuda\xa0(BER)'
 'Bohemia\xa0(BOH) [BOH] [Z]' 'Botswana\xa0(BOT)' 'Brazil\xa0(BRA)'
 'British West Indies\xa0(BWI) [BWI]' 'Bulgaria\xa0(BUL) [H]'
 'Burundi\xa0(BDI)' 'Cameroon\xa0(CMR)' 'Canada\xa0(CAN)'
 'Chile\xa0(CHI) [I]' 'China\xa0(CHN) [CHN]' 'Colombia\xa0(COL)'
 'Costa Rica\xa0(CRC)' 'Ivory Coast\xa0(CIV) [CIV]' 'Croatia\xa0(CRO)'
 'Cuba\xa0(CUB) [Z]' 'Cyprus\xa0(CYP)' 'Czech Republic\xa0(CZE) [CZE]'
 'Czechoslovakia\xa0(TCH) [TCH]' 'Denmark\xa0(DEN) [Z]'
 'Djibouti\xa0(DJI) [B]' 'Dominican Republic\xa0(DOM)' 'Ecuador\xa0(ECU)'
 'Egypt\xa0(EGY) [EGY] [Z]' 'Eritrea\xa0(ERI)' 'Estonia\xa0(EST)'
 'Ethiopia\xa0(ETH)' 'Finland\xa0(FIN)' 'France\xa0(FRA) [O] [P] [Z]'
 'Gabon\xa0(GAB)' 'Georgia\xa0(GEO)' 'Germany\xa0(GER) [GER] [Z]'
 'United Team of Germany\xa0(EUA) [EUA]' 'East Germany\xa0(GDR) [GDR]'
 'West Germany\xa0(FRG) [FRG]' 'Ghana\xa0(GHA) [GHA]'
 'Great Britain\xa0(GBR) [GBR] [Z]' 'Greece\xa0(GRE) [Z]'
 'Grenada\xa0(GRN)' 'Guatemala\xa0(GUA)' 'Guyana\xa0(GUY) [GUY]'
 'Haiti\xa0(HAI) [J]' 'Hong Kong\xa0(HKG) [HKG]' 'Hungary\xa0(HUN)'
 'Iceland\xa0(ISL)' 'India\xa0(IND) [F]' 'Indonesia\xa0(INA)'
 'Iran\xa0(IRI) [K]' 'Iraq\xa0(IRQ)' 'Ireland\xa0(IRL)' 'Israel\xa0(ISR)'
 'Italy\xa0(ITA) [M] [S]' 'Jamaica\xa0(JAM) [JAM]' 'Japan\xa0(JPN)'
 'Kazakhstan\xa0(KAZ)' 'Kenya\xa0(KEN)' 'North Korea\xa0(PRK)'
 'South Korea\xa0(KOR)' 'Kuwait\xa0(KUW)' 'Kyrgyzstan\xa0(KGZ)'
 'Latvia\xa0(LAT)' 'Lebanon\xa0(LIB)' 'Liechtenstein\xa0(LIE)'
 'Lithuania\xa0(LTU)' 'Luxembourg\xa0(LUX) [O]' 'Macedonia\xa0(MKD)'
 'Malaysia\xa0(MAS) [MAS]' 'Mauritius\xa0(MRI)' 'Mexico\xa0(MEX)'
 'Moldova\xa0(MDA)' 'Mongolia\xa0(MGL)' 'Montenegro\xa0(MNE)'
 'Morocco\xa0(MAR)' 'Mozambique\xa0(MOZ)' 'Namibia\xa0(NAM)'
 'Netherlands\xa0(NED) [Z]' 'Netherlands Antilles\xa0(AHO) [AHO] [I]'
 'New Zealand\xa0(NZL) [NZL]' 'Niger\xa0(NIG)' 'Nigeria\xa0(NGR)'
 'Norway\xa0(NOR) [Q]' 'Pakistan\xa0(PAK)' 'Panama\xa0(PAN)'
 'Paraguay\xa0(PAR)' 'Peru\xa0(PER) [L]' 'Philippines\xa0(PHI)'
 'Poland\xa0(POL)' 'Portugal\xa0(POR)' 'Puerto Rico\xa0(PUR)'
 'Qatar\xa0(QAT)' 'Romania\xa0(ROU)' 'Russia\xa0(RUS) [RUS]'
 'Russian Empire\xa0(RU1) [RU1]' 'Soviet Union\xa0(URS) [URS]'
 'Unified Team\xa0(EUN) [EUN]' 'Saudi Arabia\xa0(KSA)' 'Senegal\xa0(SEN)'
 'Serbia\xa0(SRB) [SRB]' 'Serbia and Montenegro\xa0(SCG) [SCG]'
 'Singapore\xa0(SIN)' 'Slovakia\xa0(SVK) [SVK]' 'Slovenia\xa0(SLO)'
 'South Africa\xa0(RSA)' 'Spain\xa0(ESP) [Z]' 'Sri Lanka\xa0(SRI) [SRI]'
 'Sudan\xa0(SUD)' 'Suriname\xa0(SUR) [E]' 'Sweden\xa0(SWE) [Z]'
 'Switzerland\xa0(SUI)' 'Syria\xa0(SYR)'
 'Chinese Taipei\xa0(TPE) [TPE] [TPE2]' 'Tajikistan\xa0(TJK)'
 'Tanzania\xa0(TAN) [TAN]' 'Thailand\xa0(THA)' 'Togo\xa0(TOG)'
 'Tonga\xa0(TGA)' 'Trinidad and Tobago\xa0(TRI) [TRI]' 'Tunisia\xa0(TUN)'
 'Turkey\xa0(TUR)' 'Uganda\xa0(UGA)' 'Ukraine\xa0(UKR)'
 'United Arab Emirates\xa0(UAE)' 'United States\xa0(USA) [P] [Q] [R] [Z]'
 'Uruguay\xa0(URU)' 'Uzbekistan\xa0(UZB)' 'Venezuela\xa0(VEN)'
 'Vietnam\xa0(VIE)' 'Virgin Islands\xa0(ISV)' 'Yugoslavia\xa0(YUG) [YUG]'
 'Independent Olympic Participants\xa0(IOP) [IOP]' 'Zambia\xa0(ZAM) [ZAM]'
 'Zimbabwe\xa0(ZIM) [ZIM]' 'Mixed team\xa0(ZZX) [ZZX]'], dtype='object')

We can manipulate the data in our data frame. For example, let's cleanup the country names to get rid of everything from the country abbreviation to the end of the string. For example "Australia (AUS) [AUS] [Z]" becomes simply "Australia".

We'll start by calling the str.split method to split the country names on the white space and opening parenthesis "(" and then reassign the first element of the resulting list to the index. We then display the head of the dataframe to confirm that the renaming of the countries worked as expected.

As CuDF does support some string operations, string manipulation capabilities are more limited than Pandas when it comes to complex string processing, such as split.

In [19]:
df.index = df.index.str.replace(r'\s*\(.*\)|\s*\[.*\]', '', regex=True) 
df.tail(3)
Out[19]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
Zambia 12 0 1 1 2 0 0 0 0 0 12 0 1 1 2
Zimbabwe 12 3 4 1 8 1 0 0 0 0 13 3 4 1 8
Mixed team 3 8 5 4 17 0 0 0 0 0 3 8 5 4 17

Single and double square brackets / accessing rows and columns¶

To work with DataFrames, we just need to master a few key concepts

  • Single and double brackets

    • Single brackets [ ] with a single value return a new Series
    • Single brackets [ ] with multiple values return a new DataFrame
    • Double brackets [[ ]] return a new DataFrame
  • Selecting data by row

    • Rows are accessed by position using iloc
    • Rows are accessed by row name or label using loc
  • Selecting data by column

    • Columns are accessed by column name

Selecting rows¶

In [20]:
# Return a Series selecting row 2 (iloc and single brackets)
df.iloc[2]
Out[20]:
Summer games       23
Summer gold        18
Summer silver      24
Summer bronze      28
Summer total       70
Winter games       18
Winter gold         0
Winter silver       0
Winter bronze       0
Winter total        0
Combined games     41
Combined gold      18
Combined silver    24
Combined bronze    28
Combined total     70
Name: Argentina, dtype: int64
In [21]:
# Return a DataFrame selecting row 2-6 (iloc and single brackets)
df.iloc[2:5]
Out[21]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12
In [22]:
# Return a single-row Data Frame selecting row 2 (iloc and double brackets)
df.iloc[[2]]
Out[22]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70

We can also access a row by the label or value of the index using the loc method. For example, if we wanted to get the data for France, it's more convenient to use the label than figure out the row number.

In [23]:
# Return a Series for row labeled France (loc and single brackets)
df.loc['France']
Out[23]:
Summer games        27
Summer gold        202
Summer silver      223
Summer bronze      246
Summer total       671
Winter games        22
Winter gold         31
Winter silver       31
Winter bronze       47
Winter total       109
Combined games      49
Combined gold      233
Combined silver    254
Combined bronze    293
Combined total     780
Name: France, dtype: int64
In [24]:
# Return a DataFrame for rows labeled France and Germany (loc, single brackets, list argument)
countries = ['France', 'Germany']
df.loc[countries]
Out[24]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
France 27 202 223 246 671 22 31 31 47 109 49 233 254 293 780
Germany 15 174 182 217 573 11 78 78 53 209 26 252 260 270 782
In [25]:
# Return a single-row DataFrame for row labeled France
df.loc[['France']]
Out[25]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
France 27 202 223 246 671 22 31 31 47 109 49 233 254 293 780

Selecting columns¶

In [26]:
# Return column as a Series (column name and single brackets)
df['Summer silver'].head()
Out[26]:
Afghanistan     0
Algeria         2
Argentina      24
Armenia         2
Australasia     4
Name: Summer silver, dtype: int64
In [27]:
# Return column as a DataFrame (column name and double brackets)
df[['Summer silver']].head()
Out[27]:
Summer silver
Afghanistan 0
Algeria 2
Argentina 24
Armenia 2
Australasia 4

We can select multiple columns from a dataframe by passing a list of column names rather than a single name. (A data frame is returned when using the double bracket operator)

In [28]:
# Return multiple columns as DataFrame (list of column names and single brackets)
cnames = ['Summer gold', 'Summer silver', 'Summer bronze']
df[cnames].head()
Out[28]:
Summer gold Summer silver Summer bronze
Afghanistan 0 0 2
Algeria 5 2 8
Argentina 18 24 28
Armenia 1 2 9
Australasia 3 4 5

Adding and deleting columns¶

A new column is created simply by assigning data to a column that does not already exist. In the example below, we create a new column that is a weighted sum of all gold, silver and bronze medals. Note that we're using single brackets and operating on series

In [29]:
df['Combined weighted'] = df['Combined gold']*3 + df['Combined silver']*2 + df['Combined bronze']
df.head()
Out[29]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total Combined weighted
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 2
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 27
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 130
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 16
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 22

To delete a column, use the del operator or use the drop method with axis 1

In [30]:
# This will also work: "df.drop('Combined weighted', 1)"

del df['Combined weighted']
df.head()
Out[30]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12

Putting it all together¶

In this section we show how to find min and max values of columns and the indexes of the corresponding rows. We also show how to filter by column values and combine slicing by row and column. We start by finding the country that won the most winter gold medals and the number they won.

In [31]:
df['Winter gold'].max()
Out[31]:
118

Unlike Pandas, cudf does not support idxmax() which it returns the index or label of the maximum value. Instead, to retrieve the index of the maximum value, you can use a boolean mask to filter the DataFrame and then access the index directly, as demonstrated.

In [40]:
max_indices= df[df['Winter gold'] == 118].index
In [42]:
max_indices[0]
Out[42]:
'Norway'

We can select rows from the dataframe based on the values in a column. In the example below, we filter on countries that have won at least 50 gold medals in the winter Olympics.

In [44]:
df.loc[ df['Winter gold']>50 ]
Out[44]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
Austria 26 18 33 35 86 22 59 78 81 218 48 77 111 116 304
Canada 25 59 99 121 279 22 62 56 52 170 47 121 155 173 449
Germany 15 174 182 217 573 11 78 78 53 209 26 252 260 270 782
Norway 24 56 49 43 148 22 118 111 100 329 46 174 160 143 477
Soviet Union 9 395 319 296 1010 9 78 57 59 194 18 473 376 355 1204
United States 26 976 757 666 2399 22 96 102 84 282 48 1072 859 750 2681

We can filter on multiple columns. In the example below, we limit the output to countries that also won more than 50 summer gold medals and assign the results to a new dataframe

In [45]:
df2 = df.loc[ (df['Winter gold']>50) & (df['Summer gold']>50) ]
df2
Out[45]:
Summer games Summer gold Summer silver Summer bronze Summer total Winter games Winter gold Winter silver Winter bronze Winter total Combined games Combined gold Combined silver Combined bronze Combined total
Canada 25 59 99 121 279 22 62 56 52 170 47 121 155 173 449
Germany 15 174 182 217 573 11 78 78 53 209 26 252 260 270 782
Norway 24 56 49 43 148 22 118 111 100 329 46 174 160 143 477
Soviet Union 9 395 319 296 1010 9 78 57 59 194 18 473 376 355 1204
United States 26 976 757 666 2399 22 96 102 84 282 48 1072 859 750 2681

Let's create a simpler dataframe that is limited to the gold medal results and then use the sum method to sum the values in a column.

In [46]:
df2 = df2[['Summer gold', 'Winter gold']]
df2
Out[46]:
Summer gold Winter gold
Canada 59 62
Germany 174 78
Norway 56 118
Soviet Union 395 78
United States 976 96
In [47]:
df2['Winter gold'].sum()
Out[47]:
432

We can also select by row and column simultaneously to create a new data frame

In [48]:
df[["Winter gold", "Winter silver", "Winter bronze"]].iloc[10:15]
Out[48]:
Winter gold Winter silver Winter bronze
Barbados 0 0 0
Belarus 6 4 5
Belgium 1 1 3
Bermuda 0 0 0
Bohemia 0 0 0

Since slicing by row is so common, pandas lets you skip using iloc and simply provide a range of indices

In [49]:
df[["Winter gold", "Winter silver", "Winter bronze"]][10:15]
Out[49]:
Winter gold Winter silver Winter bronze
Barbados 0 0 0
Belarus 6 4 5
Belgium 1 1 3
Bermuda 0 0 0
Bohemia 0 0 0

Selection can also be done by row labels using loc with a single label or list of labels

In [50]:
df[["Winter gold", "Winter silver", "Winter bronze"]].loc[['Barbados', 'Belarus', 'Belgium']]
Out[50]:
Winter gold Winter silver Winter bronze
Barbados 0 0 0
Belarus 6 4 5
Belgium 1 1 3

Reading from other file formats¶

Until now, we've been working with csv files, but pandas can handle many other formats including json, html, excel and HDF. This is extremely useful since we don't need to create csv files from richer data formats. We show an example below where we read two sheets from an Excel file

First, we will load Pandas to read the Excel file. After that, we will convert the DataFrame to a cuDF DataFrame.

In [83]:
df4 = pd.read_excel('city temps spreadsheet.xlsx', index_col=0, skiprows=0, sheet_name='set1')
df_cudf_4 = cudf.DataFrame.from_records(df4.to_records(index=False))
df_cudf_4
Out[83]:
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
0 71 NaN 75.0 78.0 80.0 81 79
1 58 56.0 NaN 54.0 50.0 61 63
2 92 91.0 90.0 NaN NaN 85 82
3 72 72.0 72.0 70.0 NaN 71 68
4 61 63.0 61.0 NaN 60.0 61 68
In [84]:
df5 = pd.read_excel('city temps spreadsheet.xlsx', index_col=0, skiprows=0, sheet_name='set2')
df_cudf_5 = cudf.DataFrame.from_records(df5.to_records(index=False))
df_cudf_5
Out[84]:
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
0 71 NaN 75.0 78.0 80.0 81 79
1 58 56.0 NaN 54.0 50.0 61 63
2 92 91.0 90.0 NaN NaN 85 82
3 72 72.0 72.0 70.0 NaN 71 68
4 61 63.0 61.0 NaN 60.0 61 68

From numpy array to data frame¶

First, we will create a NumPy array, and then we will convert it to a CuPy DataFrame for GPU-accelerated computations.

In [92]:
a = np.random.rand(10, 3)
a_cupy = cp.asarray(a)
In [93]:
df6 = pd.DataFrame(a, columns=['feature 1', 'feature 2', 'feature 3'])
df6
Out[93]:
feature 1 feature 2 feature 3
0 0.742098 0.017849 0.452410
1 0.323730 0.553193 0.274448
2 0.144275 0.944165 0.308382
3 0.607335 0.273645 0.808574
4 0.018208 0.587484 0.950142
5 0.535471 0.701922 0.425869
6 0.727945 0.662824 0.571413
7 0.975310 0.122992 0.356427
8 0.135503 0.190317 0.056039
9 0.340796 0.744023 0.526396

Submit Ticket¶

If you find anything that needs to be changed, edited, or if you would like to provide feedback or contribute to the notebook, please submit a ticket by contacting us at:

Email: consult@sdsc.edu

We appreciate your input and will review your suggestions promptly!

In [ ]: