Source Information¶
Created by:
Updated by: October 25, 2024 updated by Gloria Seo
Resources: https://pandas.pydata.org/
Goal¶
This notebook introduces how to use covers pandas, a useful Python data analysis toolkit. We will look at two pandas objects: Series and DataFrame (1D and 2D data structures).
Pandas - a quick introduction¶
In this tutorial, we'll explore pandas, the Python data analysis toolkit. Our emphasis will be on working with the two primary pandas objects: Series and DataFrame. Although we don't have time to study all the features of pandas, extensive documentation can be found here https://pandas.pydata.org/pandas-docs/stable/dsintro.html
- Series: one-dimensional labeled array capable of holding any data type
- DataFrame: two-dimensional labeled data structure with columns of potentially different types.
Required Modules for the Jupyter Notebook¶
Before running the notebook, we need the following modules.
Module:numpy, pandas
import pandas as pd
import numpy as np
Load CSV Data Set¶
Let's dive into an Olympic Medal dataset available from Wikipedia https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table. I've already done a little bit of cleanup so that we can quickly get to the important features.
Our file is in csv format, so we'll use the read_csv method. We know that the first two rows contain comments and other data that we won't want. We can use the skiprows argument to skip over these rows. We'll set index_col to zero so that the first column serves as the index. After loading the DataFrame, we'll look at the first and last few rows using the head and tail methods
df = pd.read_csv('olympics.csv', index_col=0, skiprows=2)
df.head(3)
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
| Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
| Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
df.tail(3)
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Zimbabwe (ZIM) [ZIM] | 12 | 3 | 4 | 1 | 8 | 1 | 0 | 0 | 0 | 0 | 13 | 3 | 4 | 1 | 8 |
| Mixed team (ZZX) [ZZX] | 3 | 8 | 5 | 4 | 17 | 0 | 0 | 0 | 0 | 0 | 3 | 8 | 5 | 4 | 17 |
| Totals | 27 | 4809 | 4775 | 5130 | 14714 | 22 | 959 | 958 | 948 | 2865 | 49 | 5768 | 5733 | 6078 | 17579 |
Note that the last row of our dataframe contains totals for the number of games and medals won. Let's get rid of that using the drop method.
df = df.drop('Totals')
df.tail(3)
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Zambia (ZAM) [ZAM] | 12 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 1 | 1 | 2 |
| Zimbabwe (ZIM) [ZIM] | 12 | 3 | 4 | 1 | 8 | 1 | 0 | 0 | 0 | 0 | 13 | 3 | 4 | 1 | 8 |
| Mixed team (ZZX) [ZZX] | 3 | 8 | 5 | 4 | 17 | 0 | 0 | 0 | 0 | 0 | 3 | 8 | 5 | 4 | 17 |
We can use the shape and size attributes of the dataframe to determine the dimensions and number of cells. Note that the dimensions do not include the column headers and index column.
df.shape
(146, 15)
df.size
2190
The columns attribute returns the column names
df.columns
Index(['Summer games', 'Summer gold', 'Summer silver', 'Summer bronze',
'Summer total', 'Winter games', 'Winter gold', 'Winter silver',
'Winter bronze', 'Winter total', 'Combined games', 'Combined gold',
'Combined silver', 'Combined bronze', 'Combined total'],
dtype='object')
The row labels are returned using the index attribute
df.index
Index(['Afghanistan (AFG)', 'Algeria (ALG)', 'Argentina (ARG)',
'Armenia (ARM)', 'Australasia (ANZ) [ANZ]', 'Australia (AUS) [AUS] [Z]',
'Austria (AUT)', 'Azerbaijan (AZE)', 'Bahamas (BAH)', 'Bahrain (BRN)',
...
'Uruguay (URU)', 'Uzbekistan (UZB)', 'Venezuela (VEN)', 'Vietnam (VIE)',
'Virgin Islands (ISV)', 'Yugoslavia (YUG) [YUG]',
'Independent Olympic Participants (IOP) [IOP]', 'Zambia (ZAM) [ZAM]',
'Zimbabwe (ZIM) [ZIM]', 'Mixed team (ZZX) [ZZX]'],
dtype='object', length=146)
We can manipulate the data in our data frame. For example, let's cleanup the country names to get rid of everything from the country abbreviation to the end of the string. For example "Australia (AUS) [AUS] [Z]" becomes simply "Australia".
We'll start by calling the str.split method to split the country names on the white space and opening parenthesis "(" and then reassign the first element of the resulting list to the index. We then display the head of the dataframe to confirm that the renaming of the countries worked as expected.
names_ids = df.index.str.split('\s\(')
df.index = names_ids.str[0]
df.head(3)
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
| Algeria | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
| Argentina | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
Single and double square brackets / accessing rows and columns¶
To work with DataFrames, we just need to master a few key concepts
Single and double brackets
- Single brackets [ ] with a single value return a new Series
- Single brackets [ ] with multiple values return a new DataFrame
- Double brackets [[ ]] return a new DataFrame
Selecting data by row
- Rows are accessed by position using iloc
- Rows are accessed by row name or label using loc
Selecting data by column
- Columns are accessed by column name
Selecting rows¶
# Return a Series selecting row 2 (iloc and single brackets)
df.iloc[2]
Summer games 23 Summer gold 18 Summer silver 24 Summer bronze 28 Summer total 70 Winter games 18 Winter gold 0 Winter silver 0 Winter bronze 0 Winter total 0 Combined games 41 Combined gold 18 Combined silver 24 Combined bronze 28 Combined total 70 Name: Argentina, dtype: int64
# Return a DataFrame selecting row 2-6 (iloc and single brackets)
df.iloc[2:5]
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Argentina | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
| Armenia | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
| Australasia | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
# Return a single-row Data Frame selecting row 2 (iloc and double brackets)
df.iloc[[2]]
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Argentina | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
We can also access a row by the label or value of the index using the loc method. For example, if we wanted to get the data for France, it's more convenient to use the label than figure out the row number.
# Return a Series for row labeled France (loc and single brackets)
df.loc['France']
Summer games 27 Summer gold 202 Summer silver 223 Summer bronze 246 Summer total 671 Winter games 22 Winter gold 31 Winter silver 31 Winter bronze 47 Winter total 109 Combined games 49 Combined gold 233 Combined silver 254 Combined bronze 293 Combined total 780 Name: France, dtype: int64
# Return a DataFrame for rows labeled France and Germany (loc, single brackets, list argument)
countries = ['France', 'Germany']
df.loc[countries]
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| France | 27 | 202 | 223 | 246 | 671 | 22 | 31 | 31 | 47 | 109 | 49 | 233 | 254 | 293 | 780 |
| Germany | 15 | 174 | 182 | 217 | 573 | 11 | 78 | 78 | 53 | 209 | 26 | 252 | 260 | 270 | 782 |
# Return a single-row DataFrame for row labeled France
df.loc[['France']]
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| France | 27 | 202 | 223 | 246 | 671 | 22 | 31 | 31 | 47 | 109 | 49 | 233 | 254 | 293 | 780 |
Selecting columns¶
# Return column as a Series (column name and single brackets)
df['Summer silver'].head()
Afghanistan 0 Algeria 2 Argentina 24 Armenia 2 Australasia 4 Name: Summer silver, dtype: int64
# Return column as a DataFrame (column name and double brackets)
df[['Summer silver']].head()
| Summer silver | |
|---|---|
| Afghanistan | 0 |
| Algeria | 2 |
| Argentina | 24 |
| Armenia | 2 |
| Australasia | 4 |
We can select multiple columns from a dataframe by passing a list of column names rather than a single name. (A data frame is returned when using the double bracket operator)
# Return multiple columns as DataFrame (list of column names and single brackets)
cnames = ['Summer gold', 'Summer silver', 'Summer bronze']
df[cnames].head()
| Summer gold | Summer silver | Summer bronze | |
|---|---|---|---|
| Afghanistan | 0 | 0 | 2 |
| Algeria | 5 | 2 | 8 |
| Argentina | 18 | 24 | 28 |
| Armenia | 1 | 2 | 9 |
| Australasia | 3 | 4 | 5 |
Adding and deleting columns¶
A new column is created simply by assigning data to a column that does not already exist. In the example below, we create a new column that is a weighted sum of all gold, silver and bronze medals. Note that we're using single brackets and operating on series
df['Combined weighted'] = df['Combined gold']*3 + df['Combined silver']*2 + df['Combined bronze']
df.head()
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | Combined weighted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 | 2 |
| Algeria | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 | 27 |
| Argentina | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 | 130 |
| Armenia | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 | 16 |
| Australasia | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 | 22 |
To delete a column, use the del operator or use the drop method with axis 1
# This will also work: "df.drop('Combined weighted', 1)"
del df['Combined weighted']
df.head()
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
| Algeria | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
| Argentina | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
| Armenia | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
| Australasia | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
Putting it all together¶
In this section we show how to find min and max values of columns and the indexes of the corresponding rows. We also show how to filter by column values and combine slicing by row and column. We start by finding the country that won the most winter gold medals and the number they won.
df['Winter gold'].max()
118
df['Winter gold'].idxmax()
'Norway'
We can select rows from the dataframe based on the values in a column. In the example below, we filter on countries that have won at least 50 gold medals in the winter Olympics.
df.loc[ df['Winter gold']>50 ]
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Austria | 26 | 18 | 33 | 35 | 86 | 22 | 59 | 78 | 81 | 218 | 48 | 77 | 111 | 116 | 304 |
| Canada | 25 | 59 | 99 | 121 | 279 | 22 | 62 | 56 | 52 | 170 | 47 | 121 | 155 | 173 | 449 |
| Germany | 15 | 174 | 182 | 217 | 573 | 11 | 78 | 78 | 53 | 209 | 26 | 252 | 260 | 270 | 782 |
| Norway | 24 | 56 | 49 | 43 | 148 | 22 | 118 | 111 | 100 | 329 | 46 | 174 | 160 | 143 | 477 |
| Soviet Union | 9 | 395 | 319 | 296 | 1010 | 9 | 78 | 57 | 59 | 194 | 18 | 473 | 376 | 355 | 1204 |
| United States | 26 | 976 | 757 | 666 | 2399 | 22 | 96 | 102 | 84 | 282 | 48 | 1072 | 859 | 750 | 2681 |
We can filter on multiple columns. In the example below, we limit the output to countries that also won more than 50 summer gold medals and assign the results to a new dataframe
df2 = df.loc[ (df['Winter gold']>50) & (df['Summer gold']>50) ]
df2
| Summer games | Summer gold | Summer silver | Summer bronze | Summer total | Winter games | Winter gold | Winter silver | Winter bronze | Winter total | Combined games | Combined gold | Combined silver | Combined bronze | Combined total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Canada | 25 | 59 | 99 | 121 | 279 | 22 | 62 | 56 | 52 | 170 | 47 | 121 | 155 | 173 | 449 |
| Germany | 15 | 174 | 182 | 217 | 573 | 11 | 78 | 78 | 53 | 209 | 26 | 252 | 260 | 270 | 782 |
| Norway | 24 | 56 | 49 | 43 | 148 | 22 | 118 | 111 | 100 | 329 | 46 | 174 | 160 | 143 | 477 |
| Soviet Union | 9 | 395 | 319 | 296 | 1010 | 9 | 78 | 57 | 59 | 194 | 18 | 473 | 376 | 355 | 1204 |
| United States | 26 | 976 | 757 | 666 | 2399 | 22 | 96 | 102 | 84 | 282 | 48 | 1072 | 859 | 750 | 2681 |
Let's create a simpler dataframe that is limited to the gold medal results and then use the sum method to sum the values in a column.
df2 = df2[['Summer gold', 'Winter gold']]
df2
| Summer gold | Winter gold | |
|---|---|---|
| Canada | 59 | 62 |
| Germany | 174 | 78 |
| Norway | 56 | 118 |
| Soviet Union | 395 | 78 |
| United States | 976 | 96 |
df2['Winter gold'].sum()
432
We can also select by row and column simultaneously to create a new data frame
df[["Winter gold", "Winter silver", "Winter bronze"]].iloc[10:15]
| Winter gold | Winter silver | Winter bronze | |
|---|---|---|---|
| Barbados | 0 | 0 | 0 |
| Belarus | 6 | 4 | 5 |
| Belgium | 1 | 1 | 3 |
| Bermuda | 0 | 0 | 0 |
| Bohemia | 0 | 0 | 0 |
Since slicing by row is so common, pandas lets you skip using iloc and simply provide a range of indices
df[["Winter gold", "Winter silver", "Winter bronze"]][10:15]
| Winter gold | Winter silver | Winter bronze | |
|---|---|---|---|
| Barbados | 0 | 0 | 0 |
| Belarus | 6 | 4 | 5 |
| Belgium | 1 | 1 | 3 |
| Bermuda | 0 | 0 | 0 |
| Bohemia | 0 | 0 | 0 |
Selection can also be done by row labels using loc with a single label or list of labels
df[["Winter gold", "Winter silver", "Winter bronze"]].loc[['Barbados', 'Belarus', 'Belgium']]
| Winter gold | Winter silver | Winter bronze | |
|---|---|---|---|
| Barbados | 0 | 0 | 0 |
| Belarus | 6 | 4 | 5 |
| Belgium | 1 | 1 | 3 |
Missing values¶
Observation data often has missing values, which can lead to problems with their analysis. Fortunately, pandas provides fillna and interpolate methods for filling in these missing values (data imputation)
Let's explore this using a small data set containing temperatures in US cities over the course of a week.
# Missing values are dislayed as NaNs (Not a Number)
df3 = pd.read_csv('city temps.csv', index_col=0, skiprows=0)
df3
| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
|---|---|---|---|---|---|---|---|
| New York | 71.0 | NaN | 75.0 | 78.0 | 80.0 | 81.0 | 79.0 |
| Boston | 58.0 | 56.0 | NaN | 54.0 | 50.0 | 61.0 | 63.0 |
| Dallas | 92.0 | 91.0 | 90.0 | NaN | NaN | 85.0 | 82.0 |
| San Diego | 72.0 | 72.0 | 72.0 | 70.0 | NaN | 71.0 | 68.0 |
| Seattle | 61.0 | 63.0 | 61.0 | NaN | 60.0 | 61.0 | 68.0 |
# The simplest method is to replace missing values with a fixed value
df3.fillna(value=70)
| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
|---|---|---|---|---|---|---|---|
| New York | 71.0 | 70.0 | 75.0 | 78.0 | 80.0 | 81.0 | 79.0 |
| Boston | 58.0 | 56.0 | 70.0 | 54.0 | 50.0 | 61.0 | 63.0 |
| Dallas | 92.0 | 91.0 | 90.0 | 70.0 | 70.0 | 85.0 | 82.0 |
| San Diego | 72.0 | 72.0 | 72.0 | 70.0 | 70.0 | 71.0 | 68.0 |
| Seattle | 61.0 | 63.0 | 61.0 | 70.0 | 60.0 | 61.0 | 68.0 |
# Forward filling using the last valid value to fill missing value
# Note that in this case we used axis=1 so that we propogate across
# the rows rather than the values
df3.fillna(method="ffill", axis=1)
| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
|---|---|---|---|---|---|---|---|
| New York | 71.0 | 71.0 | 75.0 | 78.0 | 80.0 | 81.0 | 79.0 |
| Boston | 58.0 | 56.0 | 56.0 | 54.0 | 50.0 | 61.0 | 63.0 |
| Dallas | 92.0 | 91.0 | 90.0 | 90.0 | 90.0 | 85.0 | 82.0 |
| San Diego | 72.0 | 72.0 | 72.0 | 70.0 | 70.0 | 71.0 | 68.0 |
| Seattle | 61.0 | 63.0 | 61.0 | 61.0 | 60.0 | 61.0 | 68.0 |
# Back filling using the NEXT valid value to fill missing value
# Note that in this case we used axis=1 so that we propogate across
# the rows rather than the values
df3.fillna(method="bfill", axis=1)
| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
|---|---|---|---|---|---|---|---|
| New York | 71.0 | 75.0 | 75.0 | 78.0 | 80.0 | 81.0 | 79.0 |
| Boston | 58.0 | 56.0 | 54.0 | 54.0 | 50.0 | 61.0 | 63.0 |
| Dallas | 92.0 | 91.0 | 90.0 | 85.0 | 85.0 | 85.0 | 82.0 |
| San Diego | 72.0 | 72.0 | 72.0 | 70.0 | 71.0 | 71.0 | 68.0 |
| Seattle | 61.0 | 63.0 | 61.0 | 60.0 | 60.0 | 61.0 | 68.0 |
# As of version 0.17.0, pandas provides an interpolate function that fills missing values
# By default, we get a linear interpolation, but a number of other options are available
# (quadratic, cubic, polynomial, etc.)
df3.interpolate(method="linear", axis=1)
| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
|---|---|---|---|---|---|---|---|
| New York | 71.0 | 73.0 | 75.0 | 78.000000 | 80.000000 | 81.0 | 79.0 |
| Boston | 58.0 | 56.0 | 55.0 | 54.000000 | 50.000000 | 61.0 | 63.0 |
| Dallas | 92.0 | 91.0 | 90.0 | 88.333333 | 86.666667 | 85.0 | 82.0 |
| San Diego | 72.0 | 72.0 | 72.0 | 70.000000 | 70.500000 | 71.0 | 68.0 |
| Seattle | 61.0 | 63.0 | 61.0 | 60.500000 | 60.000000 | 61.0 | 68.0 |
Reading from other file formats¶
Until now, we've been working with csv files, but pandas can handle many other formats including json, html, excel and HDF. This is extremely useful since we don't need to create csv files from richer data formats. We show an example below where we read two sheets from an Excel file
df4 = pd.read_excel('city temps spreadsheet.xlsx', index_col=0, skiprows=0, sheet_name='set1')
df4
| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
|---|---|---|---|---|---|---|---|
| New York | 71 | NaN | 75.0 | 78.0 | 80.0 | 81 | 79 |
| Boston | 58 | 56.0 | NaN | 54.0 | 50.0 | 61 | 63 |
| Dallas | 92 | 91.0 | 90.0 | NaN | NaN | 85 | 82 |
| San Diego | 72 | 72.0 | 72.0 | 70.0 | NaN | 71 | 68 |
| Seattle | 61 | 63.0 | 61.0 | NaN | 60.0 | 61 | 68 |
df5 = pd.read_excel('city temps spreadsheet.xlsx', index_col=0, skiprows=0, sheet_name='set2')
df5
| Sunday | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | |
|---|---|---|---|---|---|---|---|
| Atlanta | 71 | NaN | 75.0 | 78.0 | 80.0 | 81 | 79 |
| Portland | 58 | 56.0 | NaN | 54.0 | 50.0 | 61 | 63 |
| Phoenix | 92 | 91.0 | 90.0 | NaN | NaN | 85 | 82 |
| Las Vegas | 72 | 72.0 | 72.0 | 70.0 | NaN | 71 | 68 |
| Chicago | 61 | 63.0 | 61.0 | NaN | 60.0 | 61 | 68 |
From numpy array to data frame¶
Many Python packages can be used with numpy arrays or pandas data frames, but it's often easier to work with the latter. Pandas provides a simple way to convert an array to a data frame.
a = np.random.rand(10,3)
a
array([[0.93767545, 0.52905276, 0.37840219],
[0.69301731, 0.65304621, 0.3063061 ],
[0.69479708, 0.92496885, 0.22242732],
[0.03814035, 0.27154036, 0.59482261],
[0.41736065, 0.15261061, 0.7636656 ],
[0.01100607, 0.64267618, 0.28872264],
[0.61810238, 0.30143727, 0.09309254],
[0.64356449, 0.97165521, 0.11659938],
[0.151108 , 0.658962 , 0.05964654],
[0.90910788, 0.85471942, 0.51000376]])
df6 = pd.DataFrame(a, columns=['feature 1', 'feature 2', 'feature 3'])
df6
| feature 1 | feature 2 | feature 3 | |
|---|---|---|---|
| 0 | 0.937675 | 0.529053 | 0.378402 |
| 1 | 0.693017 | 0.653046 | 0.306306 |
| 2 | 0.694797 | 0.924969 | 0.222427 |
| 3 | 0.038140 | 0.271540 | 0.594823 |
| 4 | 0.417361 | 0.152611 | 0.763666 |
| 5 | 0.011006 | 0.642676 | 0.288723 |
| 6 | 0.618102 | 0.301437 | 0.093093 |
| 7 | 0.643564 | 0.971655 | 0.116599 |
| 8 | 0.151108 | 0.658962 | 0.059647 |
| 9 | 0.909108 | 0.854719 | 0.510004 |
Submit Ticket¶
If you find anything that needs to be changed, edited, or if you would like to provide feedback or contribute to the notebook, please submit a ticket by contacting us at:
Email: consult@sdsc.edu
We appreciate your input and will review your suggestions promptly!