Pandas
This document is an extract of the Pandas notebook from Level 6. Although this is not an interactive webpage, it is written as if it were. This document assumes that you already made the Pandas notebook and only use this as a quick reference.
Introduction
When processing data, you will most often work with structured data. This is the kind of data where you have a table with some number of named columns, each corresponding to a different feature, and a sequence of rows, each corresponding to a different sample. Examples of structured data can be found everywhere and occur any time you are making a series of observations and write down some properties for each observation you make.
Some examples of structured data are: - Weather reports, where you have columns such as ‘Date’, ‘Temperature’ and ‘Humidity’, and each row shows the temperature and humidity on some date. - Companies use structured data everywhere, from user accounts to product information and from purchases to reviews. - A grocery list is a very simple type of structured data, where you have a ‘Product’ and ‘Amount’ column indicating how many of what product you intend to buy.
Note on notation: There are many different ways to refer to the columns and rows in a table. This notebook will refer to the vertically orientated axis as ‘columns’ (but elsewhere this might be called a ‘field’, ‘feature’ or ‘property’) and the horizontal axis as ‘rows’ (also called ‘entries’, ‘samples’ or ‘observations’).
Pandas
(often imported as pd
) is the most popular library to use when dealing with structured data in Python. In this exercise notebook, we will explore some of the basics of working with Pandas. This notebook will not be able to cover everything you will ever need, but will hopefully give you an idea of what it can do for you.
Some sections will end with a note on Further Reading; this will not be necessary for completing the exercises, but references relevant information that might help you in the future.
DataFrames
Structured data is nothing more than a table consisting of rows and columns. In Pandas
, such a table is called a DataFrame
. To create a DataFrame called df
, you can use df = pd.DataFrame(data)
where data
is a list
of rows, and each row is also a list, containing the values for that row. Below you can see a simple example of creating a table with 2 rows and 3 columns.
You can have any number of rows, but it is important that each row has the same number of elements (i.e. columns), otherwise you will get an error. The example below uses Jupyter’s display
function to get a nice looking output (although print
also works).
import pandas as pd
data = [[1, 2, 3], # The first row has values 1, 2, 3
[4, 5, 6]] # The second row has values 4, 5, 6
df = pd.DataFrame(data)
display(df)
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
Now this is not a very descriptive table, since the columns are just called 0
, 1
and 2
. To change this, you can use the optional columns
argument when creating the DataFrame
, where you can specify the names of each of the columns using a list
. Make sure to provide as many column names as there are columns in your data (the number of values in a single row), otherwise you will get an error.
Let’s create another dataframe; the first column has values 'a'
and 'b'
and we call it 'Some letter'
, another column with values 2
and 5
we call 'Int'
and finally a column with 3.5
and 6.5
we call 'A float'
.
data = [['a', 2, 3.5],
['b', 5, 6.2]]
df = pd.DataFrame(data, columns=['Some letter', 'Int', 'A float'])
display(df)
Some letter | Int | A float | |
---|---|---|---|
0 | a | 2 | 3.5 |
1 | b | 5 | 6.2 |
You might have already noticed that each of the rows also gets its own index on the left (in this case it is 0
and 1
, which it does by default). This is called the index
of the dataframe (which you can see by using df.index
), and can be set using the optional index
argument to provide a list
of row names when creating the DataFrame
.
Again, make sure you provide as many names in the index
argument as there are rows (in this case 2 rows, and therefore 2 names); otherwise you will get an error.
data = [['a', 2, 3.5],
['b', 5, 6.2]]
df = pd.DataFrame(data,
index=['First Row', 'Row 2'],
columns=['Some letter', 'Int', 'A float']
)
display(df)
Some letter | Int | A float | |
---|---|---|---|
First Row | a | 2 | 3.5 |
Row 2 | b | 5 | 6.2 |
Now that we know how to create a DataFrame, let’s see what we can do with them!
This notebook will follow a narrative where you are hired as the new head of (and only member of) the Data Processing department of your local grocery store. This grocery store has kept a record of their products, and it is now up to you to process this data. This record consists of products, to what category they belong, their price, the size (i.e. ‘unit’) of the product and the current stock.
We have provided you with a function get_df()
which gets you the data as a DataFrame
for you to use. As you can see, it is indexed on the ‘Product’ and has the columns ‘Category’, ‘Price’, ‘Unit’ and ‘Stock’. The example DataFrame
is displayed below.
df = get_df()
display(df)
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Banana | Fruits | 1.49 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Coffee | Breakfast | 7.49 | 1 kg | 10 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Lettuce | Vegetables | 0.82 | 1 pc | 5 |
Penne | Pastas | 0.99 | 500 gr | 3 |
Rice | International | 2.59 | 1 kg | 7 |
Spaghetti | Pastas | 0.99 | 500 gr | 0 |
Strawberry | Fruits | 3.49 | 400 gr | 10 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
Getting data from a DataFrame
A table consists of columns with values for each row; and similarly, a DataFrame
consists of Series
with values for each index. Therefore, if you want to look at a single column of your data, you’ll need to retrieve that Series
from your DataFrame
!
To see what columns are in your DataFrame, you can use the df.columns
property:
print("The columns of `df`:")
display(df.columns)
# Indexing works the same way as a list:
print("\nThe second column is:", df.columns[1])
The columns of `df`:
Index(['Category', 'Price', 'Unit', 'Stock'], dtype='object')
The second column is: Price
To get a column from your DataFrame, you can index it in the same way you do with a dictionary: by using df['ColumnName']
, where df
is your DataFrame and ColumnName
is the name of your column. The code below retrieves the ‘Price’ column from our DataFrame:
# Retrieve the 'Price' column from the DataFrame as a Series
prices = df['Price']
display(prices)
Apple 2.39
Banana 1.49
Broccoli 1.29
Carrot 0.45
Coffee 7.49
Eggplant 0.72
Juice 2.79
Lettuce 0.82
Penne 0.99
Rice 2.59
Spaghetti 0.99
Strawberry 3.49
Tagliatelle 1.89
Name: Price, dtype: float64
# Here you can see that this is indeed a 'Series' object:
print("The variable 'prices' is of type:", type(prices))
# You can always check the dtype of your Series by using .dtype:
print("This Series has dtype:", prices.dtype)
The variable 'prices' is of type: <class 'pandas.core.series.Series'>
This Series has dtype: float64
In the example above you can see that this Series doesn’t just contain the values of the ‘Price’ column (displayed on the right), but also the index, which lists the product names from the original DataFrame (displayed on the left). This way, it is a lot easier to see which price belongs to what product! Lastly, you can see at the bottom what the dtype
of this column is (in this case float64
, since these are floating point numbers).
Accessing the data in that Series
then also works in the same way as for dictionaries: you can just use the index as the key. For example, the price of Broccoli can be found by first retrieving the ‘Price’ column from the DataFrame
, and then indexing the resulting Series
on ‘Broccoli’:
# Retrieve the 'Price' column from the DataFrame as a Series
prices = df['Price']
# Get the 'Broccoli' value from that Series
price_broccoli = prices['Broccoli']
print("The price of Broccoli is:", price_broccoli)
The price of Broccoli is: 1.29
Series
A Series
is data structure specific to Pandas, that has aspects of both lists and dictionaries. It is like a list
in that it is an ordered sequence of data, meaning it just stores different values together in some order. However, the indices used for a Series
don’t need to be the integers $0, 1, \dots$, like with a list, but can be any type, including strings, like the keys of a dictionary
.
Note that all the values in a Series
, must be of the same type, like with Numpy ndarrays
. In the example above, the Series
has a dtype
of float, meaning it can only contain floating point values. Lastly, a Series
has an optional name, which is set automatically here to the name of the column from the DataFrame
where it came from.
If we want to create a Series
, we can just define it using a list of indices and list of values. Note that order here matters; position 0 in the index list will become the index for position 0 in the value list, and this will become the first element in the Series
. Below is an example to create a pd.Series
called discount
with products as its indices and a discounted price as its values.
discount_products = ['Strawberry', 'Apple', 'Coffee', 'Juice', 'Broccoli']
discount_prices = [2.99, 2.15, 4.99, 1.89, 0.99]
discount = pd.Series(
data=discount_prices, # Use discount_prices as the values
index=discount_products, # Use discount_products as the index
dtype=float, # Set the type for the `data` argument to floating point numbers
name="Discount", # Set the name for the Series to Discount
)
display(discount)
Strawberry 2.99
Apple 2.15
Coffee 4.99
Juice 1.89
Broccoli 0.99
Name: Discount, dtype: float64
The most basic version of a Series
is just a sequence of data of the same type, meaning you actually don’t need to provide an index for your Series
. If you do not provide any index, Pandas
will just use $0, 1, \dots$, like the indices of a list
.
However, much of the power of Pandas is the use of this index to easily retrieve rows from your DataFrame (more on this in the section on Indexing, Selection and Masking). It is therefore always recommended to use some sort of index for your Series, if you can!
Below is an example of the basic way you can create a Series, by just providing a list
of values:
basicSeries = pd.Series([6.4, 2.3, 1.0])
display(basicSeries)
0 6.4
1 2.3
2 1.0
dtype: float64
You can create a Series
with any kind of data. Below are some more examples with different data types: pay attention to the resulting dtype
for each!
Pandas tries its best to find the right type for your Series, and falls back on an object
type if it cannot find a type that works. A Series must always be consistent in its type for all values in the Series, and standard types for this are the same as for Numpy; integers, floats and booleans. Note that strings are not in this list, and so strings also get the generic object
type.
What you can do with a column largely depends on the dtype
of that Series
, so always try to pick the type that works for your data!
print("A Series with strings gets dtype 'object':")
display(pd.Series(['a', 'b', 'c']))
print("\nIf you provide a mix of integers and floats, it will convert all elements to floats:")
display(pd.Series([1, 2, 3.0]))
print("\nBut you can also force it to be a specific type:")
display(pd.Series([1, 2, 3.0], dtype=int))
print("\nIf you provide a mix of strings and other types, then everything becomes 'object's:")
display(pd.Series(['hello', 2, 3.0]))
print("\nAs a very useful extra, Pandas can even convert your date to a consistent type:")
display(pd.Series([ "2023 aug 02", "2021 8 23", "1970/01/01"], dtype="datetime64[ns]"))
A Series with strings gets dtype 'object':
0 a
1 b
2 c
dtype: object
If you provide a mix of integers and floats, it will convert all elements to floats:
0 1.0
1 2.0
2 3.0
dtype: float64
But you can also force it to be a specific type:
0 1
1 2
2 3
dtype: int64
If you provide a mix of strings and other types, then everything becomes 'object's:
0 hello
1 2
2 3.0
dtype: object
As a very useful extra, Pandas can even convert your date to a consistent type:
0 2023-08-02
1 2021-08-23
2 1970-01-01
dtype: datetime64[ns]
Adding and modifying data
You can add a column to your DataFrame
in the same way we retrieved a column from the DataFrame before. If you have a Series called mySeries
, you can add it to your DataFrame myDataFrame
as a column named 'myColumn'
using myDataFrame['myColumn'] = mySeries
. Pandas will even match the index
of your Series and DataFrame, so all values end up in the right place!
Let’s use the discount
Series from before and add it to the Dataframe. This will match all the indices in the Series with the DataFrame indices, and add the values in the right place. When there is no discount for a specific product, the value automatically gets set to NaN
. We will come back to the different ways you can deal with the NaN
s later, in the section on Handling missing values.
df = get_df()
print("The discount Series")
discount = pd.Series(discount_prices, index=discount_products)
display(discount)
print("Added to the DataFrame:")
df['Discount'] = discount
display(df)
The discount Series
Strawberry 2.99
Apple 2.15
Coffee 4.99
Juice 1.89
Broccoli 0.99
dtype: float64
Added to the DataFrame:
Category | Price | Unit | Stock | Discount | |
---|---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 | 2.15 |
Banana | Fruits | 1.49 | 500 gr | 2 | NaN |
Broccoli | Vegetables | 1.29 | 500 gr | 3 | 0.99 |
Carrot | Vegetables | 0.45 | 1 pc | 22 | NaN |
Coffee | Breakfast | 7.49 | 1 kg | 10 | 4.99 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 | NaN |
Juice | Breakfast | 2.79 | 2 L | 3 | 1.89 |
Lettuce | Vegetables | 0.82 | 1 pc | 5 | NaN |
Penne | Pastas | 0.99 | 500 gr | 3 | NaN |
Rice | International | 2.59 | 1 kg | 7 | NaN |
Spaghetti | Pastas | 0.99 | 500 gr | 0 | NaN |
Strawberry | Fruits | 3.49 | 400 gr | 10 | 2.99 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 | NaN |
We can also easily modify values in Series
using basic arithmetic operations. All arithmetic operations are element-wise by default, and broadcasted when necessary, exactly like in Numpy.
Below is a simple example where we get a new inventory shipment with 10 items for each product, and need to update the DataFrame accordingly. Note that this increase is getting broadcasted to every value in the Series
here.
It is important to remember that any arithmetic operation always creates a new Series
, and so doesn’t modify the existing DataFrame! If we want to update the values in the DataFrame, we need to re-assign the new Series to the corresponding collumn.
print("Adding 10 to the stock of every product:")
increased_stock = df['Stock'] + 10
display(increased_stock)
print("\nReplacing the stock column in the DataFrame:")
df['Stock'] = increased_stock
display(df)
Adding 10 to the stock of every product:
Apple 16
Banana 12
Broccoli 13
Carrot 32
Coffee 20
Eggplant 11
Juice 13
Lettuce 15
Penne 13
Rice 17
Spaghetti 10
Strawberry 20
Tagliatelle 12
Name: Stock, dtype: int64
Replacing the stock column in the DataFrame:
Category | Price | Unit | Stock | Discount | |
---|---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 16 | 2.15 |
Banana | Fruits | 1.49 | 500 gr | 12 | NaN |
Broccoli | Vegetables | 1.29 | 500 gr | 13 | 0.99 |
Carrot | Vegetables | 0.45 | 1 pc | 32 | NaN |
Coffee | Breakfast | 7.49 | 1 kg | 20 | 4.99 |
Eggplant | Vegetables | 0.72 | 1 pc | 11 | NaN |
Juice | Breakfast | 2.79 | 2 L | 13 | 1.89 |
Lettuce | Vegetables | 0.82 | 1 pc | 15 | NaN |
Penne | Pastas | 0.99 | 500 gr | 13 | NaN |
Rice | International | 2.59 | 1 kg | 17 | NaN |
Spaghetti | Pastas | 0.99 | 500 gr | 10 | NaN |
Strawberry | Fruits | 3.49 | 400 gr | 20 | 2.99 |
Tagliatelle | Pastas | 1.89 | 500 gr | 12 | NaN |
An important difference with Numpy is that element-wise operations are always matched by their index, so the order of the elements in the Series actually does not matter, only what their index is. We’ll come back to this in more detail in the section Operating on DataFrames, but for now we’ll just give another example, so you have a bit of an idea of the possibilities:
Let say you want to now compute the percentage of discount for each of the discounted prices, so you can use those to advertise your sale. First we’ll need to compute ratio of the discounted prices compared to the original price and store the result in a new Series
. We can just do this by dividing the Discount column by the Price column:
discount_ratio = df['Discount'] / df['Price']
print("The ratio of the discounted price compared to the original price:")
display(discount_ratio)
The ratio of the discounted price compared to the original price:
Apple 0.90
Banana NaN
Broccoli 0.77
Carrot NaN
Coffee 0.67
Eggplant NaN
Juice 0.68
Lettuce NaN
Penne NaN
Rice NaN
Spaghetti NaN
Strawberry 0.86
Tagliatelle NaN
dtype: float64
Note that this division was indeed performed element by element and matched on the index, so the discounted price of an apple was divided by the original price of an apple, and so on. The products that had a NaN
value for the discount, then also automatically become NaN
for the ratio.
Next, we can just multiply this by $100$ to get a percentage, and subtract that value from $100$ to get the percentage that was discounted. Lastly, we can then add this new Series
back into the DataFrame
as a new column:
discount_percentage = 100 - discount_ratio * 100
df['Percentage'] = discount_percentage
print('Discount percentage added to the DataFrame:')
display(df)
Discount percentage added to the DataFrame:
Category | Price | Unit | Stock | Discount | Percentage | |
---|---|---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 16 | 2.15 | 10.04 |
Banana | Fruits | 1.49 | 500 gr | 12 | NaN | NaN |
Broccoli | Vegetables | 1.29 | 500 gr | 13 | 0.99 | 23.26 |
Carrot | Vegetables | 0.45 | 1 pc | 32 | NaN | NaN |
Coffee | Breakfast | 7.49 | 1 kg | 20 | 4.99 | 33.38 |
Eggplant | Vegetables | 0.72 | 1 pc | 11 | NaN | NaN |
Juice | Breakfast | 2.79 | 2 L | 13 | 1.89 | 32.26 |
Lettuce | Vegetables | 0.82 | 1 pc | 15 | NaN | NaN |
Penne | Pastas | 0.99 | 500 gr | 13 | NaN | NaN |
Rice | International | 2.59 | 1 kg | 17 | NaN | NaN |
Spaghetti | Pastas | 0.99 | 500 gr | 10 | NaN | NaN |
Strawberry | Fruits | 3.49 | 400 gr | 20 | 2.99 | 14.33 |
Tagliatelle | Pastas | 1.89 | 500 gr | 12 | NaN | NaN |
In the DataFrame
we can now see the apples have $10\%$ discount when sold for $2.15$ instead of $2.39$, but we might want to advertise with the $33\%$ discount on coffee instead!
With a couple of discounts this is easy to spot, but for a larger DataFrame
you would use built-in aggregate functions instead. Aggregate functions combine data in a column in some way, so for example using a .sum()
to compute the total for a column. Here we’ll need to use the .max()
function to find the largest discount percentage and .idxmax()
to find the product that largest discount belongs to. Note that NaN
values in the Percentage column will just be ignored here, as this is the default behaviour for all aggregation functions.
largest_discount_percentage = df['Percentage'].max()
largest_discount_product = df['Percentage'].idxmax()
original_price = df['Price'][largest_discount_product]
discount_price = df['Discount'][largest_discount_product]
product_unit = df['Unit'][largest_discount_product]
print(f'Our best deal is a {int(largest_discount_percentage)}% discount on {largest_discount_product}:')
print(f'Usually {original_price} per {product_unit}, but now on sale for {discount_price}!')
Our best deal is a 33% discount on Coffee:
Usually 7.49 per 1 kg, but now on sale for 4.99!
Indexing, Selection and Masking
In the previous section we covered the most basic ways to retrieve data from a DataFrame
, but there are many more ways to get data from a DataFrame
in Pandas. As we’ve already covered how to retrieve columns, next we’ll take a look at how to get specific rows from a DataFrame
.
The first thing we need to answer is: What is a ‘row’ in Pandas? Turns out: it’s also a Series
! That’s great, because we just learned how to work with those. Let’s start by taking another look at our example DataFrame
:
df = get_df()
display(df)
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Banana | Fruits | 1.49 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Coffee | Breakfast | 7.49 | 1 kg | 10 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Lettuce | Vegetables | 0.82 | 1 pc | 5 |
Penne | Pastas | 0.99 | 500 gr | 3 |
Rice | International | 2.59 | 1 kg | 7 |
Spaghetti | Pastas | 0.99 | 500 gr | 0 |
Strawberry | Fruits | 3.49 | 400 gr | 10 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
There are several ways to access the rows in a DataFrame
, but the most common ones are df.loc[...]
and df.iloc[...]
. These two methods look and work very similarly, but there are some important differences:
Using df.loc[...]
The .loc
property retrieves rows by making use of the index we defined for our dataframe. This works similarly to a dictionary, where you can use the key (i.e. index) to get the row you want: df.loc['Banana']
will get you the row where the index is equal to ‘Banana’.
df = get_df()
row = df.loc['Banana']
print("The row with index 'Banana' is:")
display(row)
print("\nThis row is of the type:", type(row))
The row with index 'Banana' is:
Category Fruits
Price 1.49
Unit 500 gr
Stock 2
Name: Banana, dtype: object
This row is of the type: <class 'pandas.core.series.Series'>
As you can see, the resulting row Series
automatically becomes indexed by the columns from the original DataFrame. This means that, for example, you could now retrieve the ‘Price’ for a specific row
using row['Price']
. This works the same way for any row, regardless of whether you’re using either .loc
or .iloc
to retrieve it.
Using df.iloc[...]
The .iloc
property retrieves rows by the order in which the rows are stored in the DataFrame
. It stands for “integer location” and works the same way as list
indices: df.iloc[0]
is the first row, df.iloc[1]
is the second, et cetera.
It is important to note that if you re-order your dataframe (such as with sorting), the order of the rows will change and therefore also their “Integer location”! After some operations, using df.iloc[0]
might give a different result than before. This is different then with df.loc
, where you just get the row matching the index regardless of the exact position in the DataFrame
.
df = get_df()
row = df.iloc[0]
print("The first row of the DataFrame is:")
display(row)
print("\nThis row is of the type:", type(row))
The first row of the DataFrame is:
Category Fruits
Price 2.39
Unit 1 kg
Stock 6
Name: Apple, dtype: object
This row is of the type: <class 'pandas.core.series.Series'>
Rows vs. columns
Retrieving specific rows from a DataFrame
can obviously be very useful, but there is an important caveat to keep in mind when working with rows:
The problem here stems from the fact that a Series
can only ever have one type. This works great for columns, as you expect a column of data to all be of the same type (e.g. a price column should always contain floats for each row). This is probably also the best way to look at what a DataFrame
actually is: A collection of different columns of data, each with their own data type, that also each share the same row index.
However, a single row of data will contain information from all of the different columns, which might have different types. For mixed types, Pandas will again try its best to convert the types to a consistent option, but if you have just one column with strings, then the whole row Series
will become type ‘object’. You can actually already see this happening in both the examples above!
Converting everything to ‘object’ type means you lose the type information from the columns. In addition, these rows are actually new Series
that have to be built by retrieving the values from each column for an index, and then combining them together into that Series
(which is when the type conversion happens). This much slower than getting a column, where you simply can simply retrieve the stored Series
from a DataFrame
.
As a result, the best option is usually to work with columns directly. Looping over the rows in a DataFrame
might seem like a natural way to process data, as you’re going through the data line by line, but this is actually very inefficient. Instead you should try and use element-wise operations to modify a whole column in a DataFrame
, like the examples you’ve already seen in the section Adding and modifying data.
Note that is still possible to loop through the rows of a DataFrame
using functions like .iterrows()
. This function will create a new Series
for each row and return them one by one, which can sometimes be very useful. However, if you do find yourself using this, ask yourself first if there isn’t a better way to solve your problem, as this type of approach should always be one of the last options you try.
Selecting multiple rows or columns
In addition to using .loc['Banana']
or .iloc[8]
to obtain a single row, you can also pass a list
of values to obtain multiple rows from your DataFrame in one go! This actually also works for columns, by passing a list of column names when indexing the DataFrame. Note that all these operations will always return a DataFrame
, and not a Series
like before - even if it is a DataFrame
with only a single row or column!
Below are some examples using lists to index a DataFrame:
df = get_df()
print("The full dataframe:")
display(df)
rows1 = df.loc[ ['Carrot', 'Juice', 'Rice'] ]
print("\nThe rows with index 'Carrot', 'Juice' and 'Rice':")
display(rows1)
rows2 = df.loc[ ['Carrot'] ]
print("\nThe row with index 'Carrot', as a DataFrame:")
display(rows2)
rows3 = df.iloc[ [0, 2, 4] ]
print("\nFirst, third and fifth row, using a list of indices for iloc:")
display(rows3)
rows4 = df[ ['Category', 'Price'] ]
print("\nSelecting only the Category and Price columns")
display(rows4)
The full dataframe:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Banana | Fruits | 1.49 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Coffee | Breakfast | 7.49 | 1 kg | 10 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Lettuce | Vegetables | 0.82 | 1 pc | 5 |
Penne | Pastas | 0.99 | 500 gr | 3 |
Rice | International | 2.59 | 1 kg | 7 |
Spaghetti | Pastas | 0.99 | 500 gr | 0 |
Strawberry | Fruits | 3.49 | 400 gr | 10 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
The rows with index 'Carrot', 'Juice' and 'Rice':
Category | Price | Unit | Stock | |
---|---|---|---|---|
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Rice | International | 2.59 | 1 kg | 7 |
The row with index 'Carrot', as a DataFrame:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Carrot | Vegetables | 0.45 | 1 pc | 22 |
First, third and fifth row, using a list of indices for iloc:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Coffee | Breakfast | 7.49 | 1 kg | 10 |
Selecting only the Category and Price columns
Category | Price | |
---|---|---|
Apple | Fruits | 2.39 |
Banana | Fruits | 1.49 |
Broccoli | Vegetables | 1.29 |
Carrot | Vegetables | 0.45 |
Coffee | Breakfast | 7.49 |
Eggplant | Vegetables | 0.72 |
Juice | Breakfast | 2.79 |
Lettuce | Vegetables | 0.82 |
Penne | Pastas | 0.99 |
Rice | International | 2.59 |
Spaghetti | Pastas | 0.99 |
Strawberry | Fruits | 3.49 |
Tagliatelle | Pastas | 1.89 |
The df.iloc
property works with the position index, and also supports ‘slices’:
- df.iloc[a:]
returns everything starting from the a
-th row.
- df.iloc[:b]
returns everything up to (but not including) the b
-th row.
- df.iloc[::c]
returns every c
-th row, starting with the first.
You can also think of this as the ‘step size’ with which to go through the DataFrame. By default this is 1, which is just every row. A step size of 2 means you skip every second row, et cetera.
And these can also be combined as df.iloc[a:b:c]
which says: from the a
th to the b
th row, give every c
th row starting with the a
th.
Further Reading: You can find more examples on slices here.
Below are some examples. Feel free to try out some different values in a new cell, and predict what you’ll see before running the code.
df = get_df()
print("Full DataFrame, with a new default range as index:")
# If you reset the index, it will use indices 0,1,... instead
# This will help see what integer indexes are for each row!
display(df.reset_index())
print("\n\nStarting at the tenth row:")
display(df.iloc[10:])
print("\n\nFirst five rows:")
display(df.iloc[:5])
# A negative index means: count backwards from the end
print("\n\nStarting at the second-to-last row:")
display(df.iloc[-2:])
# A bit of an advanced example:
# 1:10 says "Take the second to tenth rows (indices 1,2,3,4,5,6,7,8,9)"
# :2 says "And then skip every other row (indices 1,3,5,7,9)"
print("\n\nEvery second row up to the tenth row:")
display(df.iloc[1:10:2])
Full DataFrame, with a new default range as index:
index | Category | Price | Unit | Stock | |
---|---|---|---|---|---|
0 | Apple | Fruits | 2.39 | 1 kg | 6 |
1 | Banana | Fruits | 1.49 | 500 gr | 2 |
2 | Broccoli | Vegetables | 1.29 | 500 gr | 3 |
3 | Carrot | Vegetables | 0.45 | 1 pc | 22 |
4 | Coffee | Breakfast | 7.49 | 1 kg | 10 |
5 | Eggplant | Vegetables | 0.72 | 1 pc | 1 |
6 | Juice | Breakfast | 2.79 | 2 L | 3 |
7 | Lettuce | Vegetables | 0.82 | 1 pc | 5 |
8 | Penne | Pastas | 0.99 | 500 gr | 3 |
9 | Rice | International | 2.59 | 1 kg | 7 |
10 | Spaghetti | Pastas | 0.99 | 500 gr | 0 |
11 | Strawberry | Fruits | 3.49 | 400 gr | 10 |
12 | Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
Starting at the tenth row:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Spaghetti | Pastas | 0.99 | 500 gr | 0 |
Strawberry | Fruits | 3.49 | 400 gr | 10 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
First five rows:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Banana | Fruits | 1.49 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Coffee | Breakfast | 7.49 | 1 kg | 10 |
Starting at the second-to-last row:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Strawberry | Fruits | 3.49 | 400 gr | 10 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
Every second row up to the tenth row:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Banana | Fruits | 1.49 | 500 gr | 2 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 |
Lettuce | Vegetables | 0.82 | 1 pc | 5 |
Rice | International | 2.59 | 1 kg | 7 |
Combining row and column selection
If you want to select specific rows and columns from a DataFrame
, there are many ways you can do this in Pandas. First off, you can just combine the methods for selecting rows and columns you saw earlier, as these always return a DataFrame
which you can just index again.
Some examples:
df = get_df()
print("\nSelecting columns first, then indexing the resulting rows using loc:")
cols = df[['Category', 'Price']]
res1 = cols.loc[['Carrot', 'Juice', 'Rice']]
display(res1)
print("\nIndexing the rows using their position with iloc, then selecting the columns:")
rows = df.iloc[[3, 6, 9]]
res2 = rows[['Category', 'Price']]
display(res2)
print("\nSelecting columns first, then slices on iloc, executed on the same line:")
res3 = df[['Category', 'Price']].iloc[3:10:3]
display(res3)
Selecting columns first, then indexing the resulting rows using loc:
Category | Price | |
---|---|---|
Carrot | Vegetables | 0.45 |
Juice | Breakfast | 2.79 |
Rice | International | 2.59 |
Indexing the rows using their position with iloc, then selecting the columns:
Category | Price | |
---|---|---|
Carrot | Vegetables | 0.45 |
Juice | Breakfast | 2.79 |
Rice | International | 2.59 |
Selecting columns first, then slices on iloc, executed on the same line:
Category | Price | |
---|---|---|
Carrot | Vegetables | 0.45 |
Juice | Breakfast | 2.79 |
Rice | International | 2.59 |
Note that all these different methods end up with the same DataFrame! Actually any combination of row and column indexing you’ve seen before would work for this, of course.
In addition, loc
and iloc
also support the option to index rows and columns together in the same operation, which is a little more efficient than first creating new Series for the full rows and then selecting from them. With these methods the row indices must always come first (as loc
and iloc
are row-based operations), and then optionally you can add a column index too, separated by a comma. This combined indexing also works for single indices, where you only give one index for the rows or columns position, instead of a list. Note in those cases the row or column is reduced, and the result will no longer be a DataFrame. A few examples:
df = get_df()
print("\nSelecting the rows with index Apple and Broccoli, and the columns Category and Stock")
display(df.loc[ ['Apple', 'Broccoli'], ['Category', 'Stock'] ])
print("\nSelecting the first and second row, and the second and fourth column:")
display(df.iloc[ [0, 1], [1, 3] ])
print("\nSelecting the Penne row and the Price column (which is just an element):")
display(df.loc['Penne', 'Price'])
print("\nSelecting only the first row, and the first and fourth column (which is a Series):")
display(df.iloc[0, [0, 3]])
Selecting the rows with index Apple and Broccoli, and the columns Category and Stock
Category | Stock | |
---|---|---|
Apple | Fruits | 6 |
Broccoli | Vegetables | 3 |
Selecting the first and second row, and the second and fourth column:
Price | Stock | |
---|---|---|
Apple | 2.39 | 6 |
Banana | 1.49 | 2 |
Selecting the Penne row and the Price column (which is just an element):
0.99
Selecting only the first row, and the first and fourth column (which is a Series):
Category Fruits
Stock 6
Name: Apple, dtype: object
All these different indexing methods can make Pandas especially confusing to newcomers, as they are all used interchangeably in explanations and example you might find online. One of the goals of this notebook is to introduce all of these properly, so you can make a bit more sense of the different options you might encounter.
Below are 3 examples all using loc
that look very similar, but do completely different things. Read the code fo the examples and try to predict what each of these will do, before actually running the cell:
example1 = df.loc['Apple', 'Price']
print("\nExample 1:")
display(example1)
example2 = df.loc[['Apple', 'Broccoli']]
print("\nExample 2:")
display(example2)
example3 = df.loc[['Apple'], ['Price']]
print("\nExample 3:")
display(example3)
Example 1:
2.39
Example 2:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Example 3:
Price | |
---|---|
Apple | 2.39 |
Masking
The last, and very useful, method we’ll cover to select data from you DataFrame is masking. Masking is another concept borrowed directly from Numpy, as so works almost exactly the same way in Pandas.
Masks are used to select parts of a DataFrame where some condition is True
. You can create a mask by using boolean operators (like >
, <=
or ==
) directly on a DataFrame or Series to define the condition where the mask should be True
. This will result a broadcasted comparison of elements on the DataFrame/Series, so will create a new DataFrame/Series containing only True
s and False
s. Applying a mask just takes the part(s) of your DataFrame where the mask is True
, and leaves out the parts where it is False
.
The code below uses the Stock column to create a mask called mask
where the Stock column is less than four, then uses it to show what items in the DataFrame are low in stock. Note that the dtype
of the mask becomes bool
ean and it indeed contains only True
for the rows that have a Stock less than 4.
df = get_df()
mask = df['Stock'] < 4
print("\nMask where Stock is less than 4:")
display(mask)
print("\nUsing the mask on the DataFrame gives:")
display(df[mask])
Mask where Stock is less than 4:
Apple False
Banana True
Broccoli True
Carrot False
Coffee False
Eggplant True
Juice True
Lettuce False
Penne True
Rice False
Spaghetti True
Strawberry False
Tagliatelle True
Name: Stock, dtype: bool
Using the mask on the DataFrame gives:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Banana | Fruits | 1.49 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Penne | Pastas | 0.99 | 500 gr | 3 |
Spaghetti | Pastas | 0.99 | 500 gr | 0 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
Pandas also has built-in functions to create masks. One such example is the pd.Series.isin(values)
method. This mask is True
whenever a element in a Series is equal to one of the elements you provided as the list of values
.
As an example, the code below creates a mask where the values in the ‘Category’ column are either ‘Vegetables’ or ‘Fruits’.
df = get_df()
mask = df['Category'].isin(['Vegetables', 'Fruits'])
print("\nThe mask where the category is either Fruits or Vegetables:")
display(mask)
print("\n\nThe result of applying this mask on the DataFrame:")
display(df[mask])
The mask where the category is either Fruits or Vegetables:
Apple True
Banana True
Broccoli True
Carrot True
Coffee False
Eggplant True
Juice False
Lettuce True
Penne False
Rice False
Spaghetti False
Strawberry True
Tagliatelle False
Name: Category, dtype: bool
The result of applying this mask on the DataFrame:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Banana | Fruits | 1.49 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 |
Lettuce | Vegetables | 0.82 | 1 pc | 5 |
Strawberry | Fruits | 3.49 | 400 gr | 10 |
You can also use logical operations, like logical or |
and the logical and &
, to combine two masks element-wise. The first creates a new mask where each element is True
if one of the masks was True
for that element, and the second creates a new mask where each element is True
if both the masks were True
for that element.
expensive = df['Price'] > 1
low_stock = df['Stock'] < 4
combined_mask = expensive & low_stock
print("\nThe combined mask for items that are both expensive and low in stock:")
display(combined_mask)
print("\nSelecting out the rows where the combined mask is True:")
display(df[combined_mask])
The combined mask for items that are both expensive and low in stock:
Apple False
Banana True
Broccoli True
Carrot False
Coffee False
Eggplant False
Juice True
Lettuce False
Penne False
Rice False
Spaghetti False
Strawberry False
Tagliatelle True
dtype: bool
Selecting out the rows where the combined mask is True:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Banana | Fruits | 1.49 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
Further Reading: The Python Data Science Handbook has a section on data indexing and selection and using masks. The documentation on
.iloc
and.loc
also contains more examples.
Operating on DataFrames
In the section on Adding and modifying data
you already saw that basic arithmetic operations are actually done element-wise by default, which is extremely useful. The general term for operations that are applied element-wise are “Universal Functions” (or Ufuncs) and make libraries such as NumPy and Pandas incredibly powerful, convenient and computationally efficient. These Ufuncs do not only work with basic arithmetics (+
, -
, *
, /
, etc) but also more sophisticated operations (exponential, trigonometric, logarithmic functions and many more).
With all of these operations, it is important to keep track of the index
of the Series you are using. For example, if you have a Series a
with index 0,1,2
and a Series b
with index 1,2,3
, the operation c = a + b
will have index 0,1,2,3
(the union of the index of a
and b
). Only on indices 1
and 2
will it be able to compute a result, and on indices 0
and 3
it will automatically fill in NaN
(“Not a Number”).
a = pd.Series(
data=[5, 2, 3],
index=[0, 1, 2]
)
b = pd.Series(
data=[7, 3, 1],
index=[1, 2, 3]
)
print("Series a:")
display(a)
print("\n\nSeries b:")
display(b)
print("\n\na+b gives:")
display(a + b)
Series a:
0 5
1 2
2 3
dtype: int64
Series b:
1 7
2 3
3 1
dtype: int64
a+b gives:
0 NaN
1 9.0
2 6.0
3 NaN
dtype: float64
As you can see, the indices 1
and 2
have been added element-wise, as they occur in both Series, while for indices 0
and 3
no valid result could be computed, so they become NaN
. If you want a different result for the indices that only occur once, you’ll need to specify how to handle those cases with the pd.Series.add(...)
method. Using a.add( b )
is completely equivalent to a + b
, but using .add
allows you to specify a fill_value
, which it will use whenever one of the Series is missing a value for some index.
For example, we can set up a Series called sales
with the daily sales for some products. If you want to subtract these from the ‘Stock’ Series in df
, you need to specify what to do with the missing indices. In this case, we can fill these with zeros using the sub
function and fill_value=0
.
df = get_df()
# Extract the 'Stock' column
stock = df['Stock']
print("Stock before sales:")
display(stock)
# Set up a 'sales' Series
sales = pd.Series(
data=[1, 3, 1, 2, 1, 2],
index=['Lettuce', 'Carrot', 'Tagliatelle', 'Coffee', 'Apple', 'Strawberry'],
dtype=int,
name='Sales'
)
print("\n\nSales:")
display(sales)
print("\n\nStock after sales:")
display( stock.sub(sales, fill_value=0) )
# See what happens if you do not specify a `fill_value` by uncommenting the following line:
# display( stock.sub(sales) )
Stock before sales:
Apple 6
Banana 2
Broccoli 3
Carrot 22
Coffee 10
Eggplant 1
Juice 3
Lettuce 5
Penne 3
Rice 7
Spaghetti 0
Strawberry 10
Tagliatelle 2
Name: Stock, dtype: int64
Sales:
Lettuce 1
Carrot 3
Tagliatelle 1
Coffee 2
Apple 1
Strawberry 2
Name: Sales, dtype: int64
Stock after sales:
Apple 5.0
Banana 2.0
Broccoli 3.0
Carrot 19.0
Coffee 8.0
Eggplant 1.0
Juice 3.0
Lettuce 4.0
Penne 3.0
Rice 7.0
Spaghetti 0.0
Strawberry 8.0
Tagliatelle 1.0
dtype: float64
Further Reading: The python data science handbook on operations in pandas.
Map
Next to these built-in operations, you can also apply your own function to each value in a Series. Below is a very simple example function, that just returns the length of a string. Say you want to apply myFunc
to each value in series_in
and save the results in series_out
, then you might write something like this:
# A very simple example function
def myFunc(value):
newValue = len(value)
return newValue
# Get some Series
series_in = pd.Series(["Hello", "world!"])
print("Input series:")
display(series_in)
# Save the new results in a list for now
series_out = []
# Go through each value in the input Series
for value in series_in:
# Apply your function to the value, save in output list
series_out.append( myFunc(value) )
# Create a new series with new values
series_out = pd.Series(series_out)
print("\n\nOutput Series:")
display(series_out)
Input series:
0 Hello
1 world!
dtype: object
Output Series:
0 5
1 6
dtype: int64
But looping through values, appending them and then creating a new Series is quite inefficient. All of this can be replaced by using the pd.Series.map
method. You provide the name of your function (myFunc
in this case), and Pandas will map each value in your Series using your function:
# A very simple example function
def myFunc(value):
# Do some operations here...
newValue = len(value)
return newValue
# Get some Series
series_in = pd.Series(["Hello", "world!"])
print("Input series:")
display(series_in)
# Apply myFunc to each value in series_in, save the result in series_out
series_out = series_in.map(myFunc)
print("\n\nOutput Series:")
display(series_out)
Input series:
0 Hello
1 world!
dtype: object
Output Series:
0 5
1 6
dtype: int64
Lambda functions
One thing you’ll see very commonly in combination with map
are lambda
functions. Lamba functions are built-in to Python (so not a part of Pandas specifically), and can be use to quickly define functions you only want to use once, which is exactly what you want to do with a map
most of the time.
For example, let’s say you want convert all the Categories to lowercase. Defining a function for this and using map
would be a good solution:
def lowercase(x):
return x.lower()
df = get_df()
df['Category'] = df['Category'].map(lowercase)
print("\nThe Category column replaced by the lowercase version:")
display(df)
The Category column replaced by the lowercase version:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | fruits | 2.39 | 1 kg | 6 |
Banana | fruits | 1.49 | 500 gr | 2 |
Broccoli | vegetables | 1.29 | 500 gr | 3 |
Carrot | vegetables | 0.45 | 1 pc | 22 |
Coffee | breakfast | 7.49 | 1 kg | 10 |
Eggplant | vegetables | 0.72 | 1 pc | 1 |
Juice | breakfast | 2.79 | 2 L | 3 |
Lettuce | vegetables | 0.82 | 1 pc | 5 |
Penne | pastas | 0.99 | 500 gr | 3 |
Rice | international | 2.59 | 1 kg | 7 |
Spaghetti | pastas | 0.99 | 500 gr | 0 |
Strawberry | fruits | 3.49 | 400 gr | 10 |
Tagliatelle | pastas | 1.89 | 500 gr | 2 |
An even more compact solution would be to use a lambda
function here. A lambda function starts with the word lambda
, followed by a space, the argument of your function (the example below uses x
) and a colon (:
). After that you can write a single line of code for the function, which will define what the function returns.
This is just a shorter way to write a function, where you don’t need give it a name, don’t need to write return
and can define it on the same line. The example above can be replaced by the following using a lamdba
function:
df = get_df()
df['Category'] = df['Category'].map(lambda x: x.lower())
print("\nThe Category column replaced by the lowercase version:")
display(df)
The Category column replaced by the lowercase version:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | fruits | 2.39 | 1 kg | 6 |
Banana | fruits | 1.49 | 500 gr | 2 |
Broccoli | vegetables | 1.29 | 500 gr | 3 |
Carrot | vegetables | 0.45 | 1 pc | 22 |
Coffee | breakfast | 7.49 | 1 kg | 10 |
Eggplant | vegetables | 0.72 | 1 pc | 1 |
Juice | breakfast | 2.79 | 2 L | 3 |
Lettuce | vegetables | 0.82 | 1 pc | 5 |
Penne | pastas | 0.99 | 500 gr | 3 |
Rice | international | 2.59 | 1 kg | 7 |
Spaghetti | pastas | 0.99 | 500 gr | 0 |
Strawberry | fruits | 3.49 | 400 gr | 10 |
Tagliatelle | pastas | 1.89 | 500 gr | 2 |
Further Reading: More about lambda statements, including examples with and without the use of Pandas.
Sorting values
Sorting your DataFrame or Series is also a useful and common operation. This is as easy as providing a column to sort on, and Pandas will do the rest. By default this will sort the rows of the DataFrame in ascending order (lowest to highest), but you can change this by setting the optional ascending
parameter to False
. Note that strings are automatically sorted in alphabetical order too.
The code below will sort the DataFrame by the ‘Category’ column in ascending order:
df = get_df()
sorted_categories = df.sort_values(by='Category')
display(sorted_categories)
Category | Price | Unit | Stock | |
---|---|---|---|---|
Coffee | Breakfast | 7.49 | 1 kg | 10 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Apple | Fruits | 2.39 | 1 kg | 6 |
Banana | Fruits | 1.49 | 500 gr | 2 |
Strawberry | Fruits | 3.49 | 400 gr | 10 |
Rice | International | 2.59 | 1 kg | 7 |
Penne | Pastas | 0.99 | 500 gr | 3 |
Spaghetti | Pastas | 0.99 | 500 gr | 0 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 |
Lettuce | Vegetables | 0.82 | 1 pc | 5 |
Another useful method is sorting by multiple columns: simply replace the by=...
argument with a list of column names (by=['col1', 'col2', ...]
). Similarly, you can replace the ascending=...
argument with a list with True
or False
values (ascending=[True, False, ...]
) of the same length, which indictates for each of the corresponding columns whether they should be sorted in ascending order or not.
Note that order of the columns in the by
list matter, as the DataFrame will be sorted by the first column first, and so on. The code below first sorts by the ‘Category’ column (in ascending order). If multiple rows have the same category, it will then sort these by the ‘Stock’ column (in descending order):
df = get_df()
display(df.sort_values(by=['Category', 'Stock'], ascending=[True, False]))
Handling missing values
In the section on operations we already encountered some NaN
values. Knowing how to deal with missing values is a great skill when working messy datasets (which most datasets are, in practice).
For this exercise we will use a different dataset: For the most popular products in the store (Carrots, Apples, Coffee and Lettuce), your boss has kept a registry of the stock over time. In this case, the products are the columns, and each row is indexed by a date. The values you see is the stock at the start of each week (the market gets restocked on the first of each month).
Unfortunately, some values were illegible and are replaced with NaN
values. In this section you will see some tools that can help you deal with this.
stock_df = get_stocks()
display(stock_df)
Product | Carrot | Apple | Coffee | Lettuce |
---|---|---|---|---|
Date | ||||
2023-08-07 | 7.0 | 17.0 | 13.0 | 22.0 |
2023-08-14 | NaN | 12.0 | NaN | 17.0 |
2023-08-21 | NaN | 9.0 | 7.0 | 9.0 |
2023-08-28 | NaN | 5.0 | 3.0 | 5.0 |
2023-09-04 | 38.0 | 20.0 | 17.0 | 22.0 |
2023-09-11 | 22.0 | 12.0 | 14.0 | 16.0 |
2023-09-18 | 11.0 | NaN | 12.0 | 10.0 |
2023-09-25 | 3.0 | NaN | 9.0 | 3.0 |
2023-10-02 | 44.0 | 23.0 | 15.0 | 14.0 |
2023-10-09 | 28.0 | 10.0 | 11.0 | 8.0 |
The most simple method of dealing with NaN
values is by removing them. This can be done using the pd.DataFrame.dropna(...)
method. By default, it will remove any row that has any NaN
on it. You can change this to drop any column that has at least one NaN
value by passing axis=1
. If you want to drop a row/column if it is completely filled with NaN
values, you can pass how='all'
.
Below is a code example that uses dropna
with axis=0
and how=any
(which are also the default options) to remove the rows containing NaN
. You can make a new cell where you copy this example and can play around with different combinations of argument. With each combination of arguments you test, try to predict how many rows and columns you expect to see before running the cell.
stock_df = get_stocks()
# Drop rows/columns that contain NaN values
# axis=0 -> drop rows (default)
# 1 -> drop columns
# how='any' -> if that row/column contains at least one NaN value (default)
# 'all' -> if that row/column contains only NaN values
cleaned_df = stock_df.dropna(axis=0, how='any')
display(cleaned_df)
Product | Carrot | Apple | Coffee | Lettuce |
---|---|---|---|---|
Date | ||||
2023-08-07 | 7.0 | 17.0 | 13.0 | 22.0 |
2023-09-04 | 38.0 | 20.0 | 17.0 | 22.0 |
2023-09-11 | 22.0 | 12.0 | 14.0 | 16.0 |
2023-10-02 | 44.0 | 23.0 | 15.0 | 14.0 |
2023-10-09 | 28.0 | 10.0 | 11.0 | 8.0 |
The above example shows that, no matter how you deal with discarding NaN
values, you throw away some valuable data in the process.
Instead, you can try to replace these missing values with some approximation or guess of what these values could have been. Doing this in a sophisticated way is its own field of research (called Regression, which is part of Machine Learning), but for now we will just replace each NaN
with a constant value (i.e. the same value everywhere).
Filling values can be done using the pd.DataFrame.fillna(...)
method. For example, you can specify some value
you want to fill with:
stock_df = get_stocks()
display(stock_df.fillna(value=0))
Product | Carrot | Apple | Coffee | Lettuce |
---|---|---|---|---|
Date | ||||
2023-08-07 | 7.0 | 17.0 | 13.0 | 22.0 |
2023-08-14 | 0.0 | 12.0 | 0.0 | 17.0 |
2023-08-21 | 0.0 | 9.0 | 7.0 | 9.0 |
2023-08-28 | 0.0 | 5.0 | 3.0 | 5.0 |
2023-09-04 | 38.0 | 20.0 | 17.0 | 22.0 |
2023-09-11 | 22.0 | 12.0 | 14.0 | 16.0 |
2023-09-18 | 11.0 | 0.0 | 12.0 | 10.0 |
2023-09-25 | 3.0 | 0.0 | 9.0 | 3.0 |
2023-10-02 | 44.0 | 23.0 | 15.0 | 14.0 |
2023-10-09 | 28.0 | 10.0 | 11.0 | 8.0 |
Using the same constant value for each column might not be the best approach: there seem to be many more Carrots in stock than Coffee. It would be better if you use some constant value for one column, and another value for another column.
You can do this by passing a ‘mapping’ in the value
argument, which gives a value to use for each column in the dataframe. A dict
ionary is a type of mapping, which maps the keys of that dictionary to the corresponding values. The example below uses Carrot, Apple and Coffee as a key, each with their own value.
stock_df = get_stocks()
# Create a dictionary (a type of mapping)
mapping = {
'Carrot': 15,
'Apple': 10,
'Coffee': 2
}
display(stock_df.fillna(value=mapping))
Product | Carrot | Apple | Coffee | Lettuce |
---|---|---|---|---|
Date | ||||
2023-08-07 | 7.0 | 17.0 | 13.0 | 22.0 |
2023-08-14 | 15.0 | 12.0 | 2.0 | 17.0 |
2023-08-21 | 15.0 | 9.0 | 7.0 | 9.0 |
2023-08-28 | 15.0 | 5.0 | 3.0 | 5.0 |
2023-09-04 | 38.0 | 20.0 | 17.0 | 22.0 |
2023-09-11 | 22.0 | 12.0 | 14.0 | 16.0 |
2023-09-18 | 11.0 | 10.0 | 12.0 | 10.0 |
2023-09-25 | 3.0 | 10.0 | 9.0 | 3.0 |
2023-10-02 | 44.0 | 23.0 | 15.0 | 14.0 |
2023-10-09 | 28.0 | 10.0 | 11.0 | 8.0 |
Further Reading: in the python data science handbook on handling missing data.
Grouping
Categorical variables (such as the ‘Category’ column in our example dataset) can also be used to group data together. In doing this, you combine one or more rows from your original DataFrame into a row in the grouped DataFrame. You will then need to specify how these rows get combined (also called ‘aggregating’), as multiple rows will need to be reduced to one row per Category.
First, you can group a DataFrame using the pd.DataFrame.groupby
method and specifying what column you want to group on. This gives something called a DataFrameGroupBy
, which is special intermediate groupby
object. The groupby can then be completed by using an aggregate function on the DataFrameGroupBy
. This will return a regular DataFrame
again, with its rows reduced as specified by the aggregate function.
You’ve already seen some of the possible aggregate functions for numerical values throughout this notebook. These are functions like .sum()
, .max()
, .min()
, .mean()
and .median()
, which combine values by computing, for example, the maximum value in each column. With a groupby
, these functions are applied to all the remaining columns, so the columns that were not selected as the column to group on. Applying an aggregate function to a DataFrameGroupBy
results in a new DataFrame with just one row for each separate group, containing all the combined values of that group. Note that is not possible to apply these functions on nonnumerical columns (like the Units column), so trying this will result in an error. You can pass the optional numeric_only=True
argument to the aggregate functions to tell Pandas explicitly to just drop those columns.
df = get_df()
grouped_df = df.groupby('Category')
print("\nIntermediate groupby object, which isn't a real DataFrame:")
display(grouped_df)
print("\nHowever, we can access each group separately, which are DataFrames:")
for (group_name, group_df) in grouped_df:
print('\n'+ group_name +':')
display(group_df)
category_means = grouped_df.mean(numeric_only=True)
print("\n\n\nComputing the mean per group for each column:")
display(category_means)
# Note this is usually how this is used, with both function calls on the same line
# so without storing the intermediate groupby object seperately
category_max = df.groupby('Category').max(numeric_only=True)
print("\nComputing the maximum per group for each column:")
display(category_max)
Intermediate groupby object, which isn't a real DataFrame:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x15a4bf350>
However, we can access each group separately, which are DataFrames:
Breakfast:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Coffee | Breakfast | 7.49 | 1 kg | 10 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Fruits:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Banana | Fruits | 1.49 | 500 gr | 2 |
Strawberry | Fruits | 3.49 | 400 gr | 10 |
International:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Rice | International | 2.59 | 1 kg | 7 |
Pastas:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Penne | Pastas | 0.99 | 500 gr | 3 |
Spaghetti | Pastas | 0.99 | 500 gr | 0 |
Tagliatelle | Pastas | 1.89 | 500 gr | 2 |
Vegetables:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Eggplant | Vegetables | 0.72 | 1 pc | 1 |
Lettuce | Vegetables | 0.82 | 1 pc | 5 |
Computing the mean per group for each column:
Price | Stock | |
---|---|---|
Category | ||
Breakfast | 5.14 | 6.50 |
Fruits | 2.46 | 6.00 |
International | 2.59 | 7.00 |
Pastas | 1.29 | 1.67 |
Vegetables | 0.82 | 7.75 |
Computing the maximum per group for each column:
Price | Stock | |
---|---|---|
Category | ||
Breakfast | 7.49 | 10 |
Fruits | 3.49 | 10 |
International | 2.59 | 7 |
Pastas | 1.89 | 3 |
Vegetables | 1.29 | 22 |
As you can see, the result of these groupby and aggregates is a new DataFrame containing the combined data. In the mean DataFrame you can, for example, see that average Stock of all the Breakfast items is $6.5$. And, in the max DataFrame you can see that the Fruit with the highest Price costs $3.49$.
There are also some aggregate functions that work on both numerical and categorical columns:
grouped_df.count()
will count the number of rows in that group, excludingNaN
s.grouped_df.nunique()
counts the number of unique values in each group for each column.grouped_df.first()
gives the elements of the first of the grouped rows. Similarly,.last()
gives the last.grouped_df.head(n)
will include (up to) the firstn
rows in that group. Similarly,.tail(n)
gives the lastn
rows.
df = get_df()
print("\nThe number of unique values in each group:")
display(df.groupby('Category').nunique())
print("\nThe first value in each group:")
display(df.groupby('Category').first())
print("\nKeeping (up to) the first 2 rows per group:")
display(df.groupby('Category').head(2))
The number of unique values in each group:
Price | Unit | Stock | |
---|---|---|---|
Category | |||
Breakfast | 2 | 2 | 2 |
Fruits | 3 | 3 | 3 |
International | 1 | 1 | 1 |
Pastas | 2 | 1 | 3 |
Vegetables | 4 | 2 | 4 |
The first value in each group:
Price | Unit | Stock | |
---|---|---|---|
Category | |||
Breakfast | 7.49 | 1 kg | 10 |
Fruits | 2.39 | 1 kg | 6 |
International | 2.59 | 1 kg | 7 |
Pastas | 0.99 | 500 gr | 3 |
Vegetables | 1.29 | 500 gr | 3 |
Keeping (up to) the first 2 rows per group:
Category | Price | Unit | Stock | |
---|---|---|---|---|
Apple | Fruits | 2.39 | 1 kg | 6 |
Banana | Fruits | 1.49 | 500 gr | 2 |
Broccoli | Vegetables | 1.29 | 500 gr | 3 |
Carrot | Vegetables | 0.45 | 1 pc | 22 |
Coffee | Breakfast | 7.49 | 1 kg | 10 |
Juice | Breakfast | 2.79 | 2 L | 3 |
Penne | Pastas | 0.99 | 500 gr | 3 |
Rice | International | 2.59 | 1 kg | 7 |
Spaghetti | Pastas | 0.99 | 500 gr | 0 |
Like regular DataFrames
, you can select one or more columns from a DataFrameGroupBy
using the column names. Remember that, as when indexing columns, a single column will always return a Series, and a list of columns (even a list of 1 column) returns a DataFrame!
The examples below compute the minimum Stock per Category, and number of unique values in Price and Unit per Category.
df = get_df()
grouped_df = df.groupby('Category')
minimum_stock = grouped_df['Stock'].min()
print("\nThe minimum values in the Stock column, per Category:")
# Passing just 'Stock' gives a Series, from which we compute the min
display(minimum_stock)
unique_price_unit = grouped_df[['Price', 'Unit']].nunique()
print("\nThe number of unique values in the 'Price' and 'Unit' column, per category:")
# Passing multiple columns gives a DataFrame; from which we compute the n-unique
display(unique_price_unit)
The minimum values in the Stock column, per Category:
Category
Breakfast 3
Fruits 2
International 7
Pastas 0
Vegetables 1
Name: Stock, dtype: int64
The number of unique values in the 'Price' and 'Unit' column, per category:
Price | Unit | |
---|---|---|
Category | ||
Breakfast | 2 | 2 |
Fruits | 3 | 3 |
International | 1 | 1 |
Pastas | 2 | 1 |
Vegetables | 4 | 2 |
Pandas supports many more methods of aggregating data, which you can use through the grouped_df.agg()
method. Rather than ‘just’ computing the mean, you can give a list of methods to aggregate the groups in your DataFrame:
df = get_df()
grouped_df = df.groupby('Category')
# For each column, compute the number of unique values and the number of rows per category
display(
grouped_df.agg(['nunique', 'count'])
)
Further Reading: On grouping data.