Pandas

This document is an extract of the Pandas notebook from Level 6. Although this is not an interactive webpage, it is written as if it were. This document assumes that you already made the Pandas notebook and only use this as a quick reference.

Introduction

When processing data, you will most often work with structured data. This is the kind of data where you have a table with some number of named columns, each corresponding to a different feature, and a sequence of rows, each corresponding to a different sample. Examples of structured data can be found everywhere and occur any time you are making a series of observations and write down some properties for each observation you make.

Some examples of structured data are: - Weather reports, where you have columns such as ‘Date’, ‘Temperature’ and ‘Humidity’, and each row shows the temperature and humidity on some date. - Companies use structured data everywhere, from user accounts to product information and from purchases to reviews. - A grocery list is a very simple type of structured data, where you have a ‘Product’ and ‘Amount’ column indicating how many of what product you intend to buy.

Note on notation: There are many different ways to refer to the columns and rows in a table. This notebook will refer to the vertically orientated axis as ‘columns’ (but elsewhere this might be called a ‘field’, ‘feature’ or ‘property’) and the horizontal axis as ‘rows’ (also called ‘entries’, ‘samples’ or ‘observations’).

Pandas (often imported as pd) is the most popular library to use when dealing with structured data in Python. In this exercise notebook, we will explore some of the basics of working with Pandas. This notebook will not be able to cover everything you will ever need, but will hopefully give you an idea of what it can do for you.

Some sections will end with a note on Further Reading; this will not be necessary for completing the exercises, but references relevant information that might help you in the future.

DataFrames

Structured data is nothing more than a table consisting of rows and columns. In Pandas, such a table is called a DataFrame. To create a DataFrame called df, you can use df = pd.DataFrame(data) where data is a list of rows, and each row is also a list, containing the values for that row. Below you can see a simple example of creating a table with 2 rows and 3 columns.

You can have any number of rows, but it is important that each row has the same number of elements (i.e. columns), otherwise you will get an error. The example below uses Jupyter’s display function to get a nice looking output (although print also works).

import pandas as pd
data = [[1, 2, 3], # The first row has values 1, 2, 3
        [4, 5, 6]] # The second row has values 4, 5, 6

df = pd.DataFrame(data)

display(df)

0 1 2
0 1 2 3
1 4 5 6

Now this is not a very descriptive table, since the columns are just called 0, 1 and 2. To change this, you can use the optional columns argument when creating the DataFrame, where you can specify the names of each of the columns using a list. Make sure to provide as many column names as there are columns in your data (the number of values in a single row), otherwise you will get an error.

Let’s create another dataframe; the first column has values 'a' and 'b' and we call it 'Some letter', another column with values 2 and 5 we call 'Int' and finally a column with 3.5 and 6.5 we call 'A float'.

data = [['a', 2, 3.5],
        ['b', 5, 6.2]]

df = pd.DataFrame(data, columns=['Some letter', 'Int', 'A float'])

display(df)

Some letter Int A float
0 a 2 3.5
1 b 5 6.2

You might have already noticed that each of the rows also gets its own index on the left (in this case it is 0 and 1, which it does by default). This is called the index of the dataframe (which you can see by using df.index), and can be set using the optional index argument to provide a list of row names when creating the DataFrame.

Again, make sure you provide as many names in the index argument as there are rows (in this case 2 rows, and therefore 2 names); otherwise you will get an error.

data = [['a', 2, 3.5],
        ['b', 5, 6.2]]

df = pd.DataFrame(data,
    index=['First Row', 'Row 2'],
    columns=['Some letter', 'Int', 'A float']
)

display(df)

Some letter Int A float
First Row a 2 3.5
Row 2 b 5 6.2

Now that we know how to create a DataFrame, let’s see what we can do with them!

This notebook will follow a narrative where you are hired as the new head of (and only member of) the Data Processing department of your local grocery store. This grocery store has kept a record of their products, and it is now up to you to process this data. This record consists of products, to what category they belong, their price, the size (i.e. ‘unit’) of the product and the current stock.

We have provided you with a function get_df() which gets you the data as a DataFrame for you to use. As you can see, it is indexed on the ‘Product’ and has the columns ‘Category’, ‘Price’, ‘Unit’ and ‘Stock’. The example DataFrame is displayed below.

df = get_df()
display(df)

Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Banana Fruits 1.49 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Carrot Vegetables 0.45 1 pc 22
Coffee Breakfast 7.49 1 kg 10
Eggplant Vegetables 0.72 1 pc 1
Juice Breakfast 2.79 2 L 3
Lettuce Vegetables 0.82 1 pc 5
Penne Pastas 0.99 500 gr 3
Rice International 2.59 1 kg 7
Spaghetti Pastas 0.99 500 gr 0
Strawberry Fruits 3.49 400 gr 10
Tagliatelle Pastas 1.89 500 gr 2

Getting data from a DataFrame

A table consists of columns with values for each row; and similarly, a DataFrame consists of Series with values for each index. Therefore, if you want to look at a single column of your data, you’ll need to retrieve that Series from your DataFrame!

To see what columns are in your DataFrame, you can use the df.columns property:

print("The columns of `df`:")
display(df.columns)

# Indexing works the same way as a list:
print("\nThe second column is:", df.columns[1])

The columns of `df`:
Index(['Category', 'Price', 'Unit', 'Stock'], dtype='object')

The second column is: Price

To get a column from your DataFrame, you can index it in the same way you do with a dictionary: by using df['ColumnName'], where df is your DataFrame and ColumnName is the name of your column. The code below retrieves the ‘Price’ column from our DataFrame:

# Retrieve the 'Price' column from the DataFrame as a Series
prices = df['Price']

display(prices)

Apple          2.39
Banana         1.49
Broccoli       1.29
Carrot         0.45
Coffee         7.49
Eggplant       0.72
Juice          2.79
Lettuce        0.82
Penne          0.99
Rice           2.59
Spaghetti      0.99
Strawberry     3.49
Tagliatelle    1.89
Name: Price, dtype: float64

# Here you can see that this is indeed a 'Series' object:
print("The variable 'prices' is of type:", type(prices))

# You can always check the dtype of your Series by using .dtype:
print("This Series has dtype:", prices.dtype)

The variable 'prices' is of type: <class 'pandas.core.series.Series'>
This Series has dtype: float64

In the example above you can see that this Series doesn’t just contain the values of the ‘Price’ column (displayed on the right), but also the index, which lists the product names from the original DataFrame (displayed on the left). This way, it is a lot easier to see which price belongs to what product! Lastly, you can see at the bottom what the dtype of this column is (in this case float64, since these are floating point numbers).

Accessing the data in that Series then also works in the same way as for dictionaries: you can just use the index as the key. For example, the price of Broccoli can be found by first retrieving the ‘Price’ column from the DataFrame, and then indexing the resulting Series on ‘Broccoli’:

# Retrieve the 'Price' column from the DataFrame as a Series
prices = df['Price']

# Get the 'Broccoli' value from that Series
price_broccoli = prices['Broccoli']

print("The price of Broccoli is:", price_broccoli)

The price of Broccoli is: 1.29

Series

A Series is data structure specific to Pandas, that has aspects of both lists and dictionaries. It is like a list in that it is an ordered sequence of data, meaning it just stores different values together in some order. However, the indices used for a Series don’t need to be the integers $0, 1, \dots$, like with a list, but can be any type, including strings, like the keys of a dictionary.

Note that all the values in a Series, must be of the same type, like with Numpy ndarrays. In the example above, the Series has a dtype of float, meaning it can only contain floating point values. Lastly, a Series has an optional name, which is set automatically here to the name of the column from the DataFrame where it came from.

If we want to create a Series, we can just define it using a list of indices and list of values. Note that order here matters; position 0 in the index list will become the index for position 0 in the value list, and this will become the first element in the Series. Below is an example to create a pd.Series called discount with products as its indices and a discounted price as its values.

discount_products = ['Strawberry', 'Apple', 'Coffee', 'Juice', 'Broccoli']
discount_prices = [2.99, 2.15, 4.99, 1.89, 0.99]

discount = pd.Series(
    data=discount_prices,      # Use discount_prices as the values
    index=discount_products,   # Use discount_products as the index
    dtype=float,               # Set the type for the `data` argument to floating point numbers
    name="Discount",           # Set the name for the Series to Discount
)

display(discount)

Strawberry    2.99
Apple         2.15
Coffee        4.99
Juice         1.89
Broccoli      0.99
Name: Discount, dtype: float64

The most basic version of a Series is just a sequence of data of the same type, meaning you actually don’t need to provide an index for your Series. If you do not provide any index, Pandas will just use $0, 1, \dots$, like the indices of a list.

However, much of the power of Pandas is the use of this index to easily retrieve rows from your DataFrame (more on this in the section on Indexing, Selection and Masking). It is therefore always recommended to use some sort of index for your Series, if you can!

Below is an example of the basic way you can create a Series, by just providing a list of values:

basicSeries = pd.Series([6.4, 2.3, 1.0])

display(basicSeries)

0    6.4
1    2.3
2    1.0
dtype: float64

You can create a Series with any kind of data. Below are some more examples with different data types: pay attention to the resulting dtype for each!

Pandas tries its best to find the right type for your Series, and falls back on an object type if it cannot find a type that works. A Series must always be consistent in its type for all values in the Series, and standard types for this are the same as for Numpy; integers, floats and booleans. Note that strings are not in this list, and so strings also get the generic object type.

What you can do with a column largely depends on the dtype of that Series, so always try to pick the type that works for your data!

print("A Series with strings gets dtype 'object':")
display(pd.Series(['a', 'b', 'c']))

print("\nIf you provide a mix of integers and floats, it will convert all elements to floats:")
display(pd.Series([1, 2, 3.0]))

print("\nBut you can also force it to be a specific type:")
display(pd.Series([1, 2, 3.0], dtype=int))

print("\nIf you provide a mix of strings and other types, then everything becomes 'object's:")
display(pd.Series(['hello', 2, 3.0]))

print("\nAs a very useful extra, Pandas can even convert your date to a consistent type:")
display(pd.Series([ "2023 aug 02", "2021 8 23", "1970/01/01"], dtype="datetime64[ns]"))

A Series with strings gets dtype 'object':

0    a
1    b
2    c
dtype: object

If you provide a mix of integers and floats, it will convert all elements to floats:

0    1.0
1    2.0
2    3.0
dtype: float64

But you can also force it to be a specific type:

0    1
1    2
2    3
dtype: int64

If you provide a mix of strings and other types, then everything becomes 'object's:

0    hello
1        2
2      3.0
dtype: object

As a very useful extra, Pandas can even convert your date to a consistent type:

0   2023-08-02
1   2021-08-23
2   1970-01-01
dtype: datetime64[ns]

Adding and modifying data

You can add a column to your DataFrame in the same way we retrieved a column from the DataFrame before. If you have a Series called mySeries, you can add it to your DataFrame myDataFrame as a column named 'myColumn' using myDataFrame['myColumn'] = mySeries. Pandas will even match the index of your Series and DataFrame, so all values end up in the right place!

Let’s use the discount Series from before and add it to the Dataframe. This will match all the indices in the Series with the DataFrame indices, and add the values in the right place. When there is no discount for a specific product, the value automatically gets set to NaN. We will come back to the different ways you can deal with the NaNs later, in the section on Handling missing values.

df = get_df()

print("The discount Series")
discount = pd.Series(discount_prices, index=discount_products)
display(discount)

print("Added to the DataFrame:")
df['Discount'] = discount
display(df)

The discount Series



Strawberry    2.99
Apple         2.15
Coffee        4.99
Juice         1.89
Broccoli      0.99
dtype: float64


Added to the DataFrame:
Category Price Unit Stock Discount
Apple Fruits 2.39 1 kg 6 2.15
Banana Fruits 1.49 500 gr 2 NaN
Broccoli Vegetables 1.29 500 gr 3 0.99
Carrot Vegetables 0.45 1 pc 22 NaN
Coffee Breakfast 7.49 1 kg 10 4.99
Eggplant Vegetables 0.72 1 pc 1 NaN
Juice Breakfast 2.79 2 L 3 1.89
Lettuce Vegetables 0.82 1 pc 5 NaN
Penne Pastas 0.99 500 gr 3 NaN
Rice International 2.59 1 kg 7 NaN
Spaghetti Pastas 0.99 500 gr 0 NaN
Strawberry Fruits 3.49 400 gr 10 2.99
Tagliatelle Pastas 1.89 500 gr 2 NaN

We can also easily modify values in Series using basic arithmetic operations. All arithmetic operations are element-wise by default, and broadcasted when necessary, exactly like in Numpy.

Below is a simple example where we get a new inventory shipment with 10 items for each product, and need to update the DataFrame accordingly. Note that this increase is getting broadcasted to every value in the Series here.

It is important to remember that any arithmetic operation always creates a new Series, and so doesn’t modify the existing DataFrame! If we want to update the values in the DataFrame, we need to re-assign the new Series to the corresponding collumn.

print("Adding 10 to the stock of every product:")
increased_stock = df['Stock'] + 10
display(increased_stock)

print("\nReplacing the stock column in the DataFrame:")
df['Stock'] = increased_stock
display(df)

Adding 10 to the stock of every product:

Apple          16
Banana         12
Broccoli       13
Carrot         32
Coffee         20
Eggplant       11
Juice          13
Lettuce        15
Penne          13
Rice           17
Spaghetti      10
Strawberry     20
Tagliatelle    12
Name: Stock, dtype: int64

Replacing the stock column in the DataFrame:
Category Price Unit Stock Discount
Apple Fruits 2.39 1 kg 16 2.15
Banana Fruits 1.49 500 gr 12 NaN
Broccoli Vegetables 1.29 500 gr 13 0.99
Carrot Vegetables 0.45 1 pc 32 NaN
Coffee Breakfast 7.49 1 kg 20 4.99
Eggplant Vegetables 0.72 1 pc 11 NaN
Juice Breakfast 2.79 2 L 13 1.89
Lettuce Vegetables 0.82 1 pc 15 NaN
Penne Pastas 0.99 500 gr 13 NaN
Rice International 2.59 1 kg 17 NaN
Spaghetti Pastas 0.99 500 gr 10 NaN
Strawberry Fruits 3.49 400 gr 20 2.99
Tagliatelle Pastas 1.89 500 gr 12 NaN

An important difference with Numpy is that element-wise operations are always matched by their index, so the order of the elements in the Series actually does not matter, only what their index is. We’ll come back to this in more detail in the section Operating on DataFrames, but for now we’ll just give another example, so you have a bit of an idea of the possibilities:

Let say you want to now compute the percentage of discount for each of the discounted prices, so you can use those to advertise your sale. First we’ll need to compute ratio of the discounted prices compared to the original price and store the result in a new Series. We can just do this by dividing the Discount column by the Price column:

discount_ratio = df['Discount'] / df['Price']

print("The ratio of the discounted price compared to the original price:")
display(discount_ratio)


The ratio of the discounted price compared to the original price:

Apple          0.90
Banana          NaN
Broccoli       0.77
Carrot          NaN
Coffee         0.67
Eggplant        NaN
Juice          0.68
Lettuce         NaN
Penne           NaN
Rice            NaN
Spaghetti       NaN
Strawberry     0.86
Tagliatelle     NaN
dtype: float64

Note that this division was indeed performed element by element and matched on the index, so the discounted price of an apple was divided by the original price of an apple, and so on. The products that had a NaN value for the discount, then also automatically become NaN for the ratio.

Next, we can just multiply this by $100$ to get a percentage, and subtract that value from $100$ to get the percentage that was discounted. Lastly, we can then add this new Series back into the DataFrame as a new column:

discount_percentage = 100 - discount_ratio * 100
df['Percentage'] = discount_percentage

print('Discount percentage added to the DataFrame:')
display(df)

Discount percentage added to the DataFrame:
Category Price Unit Stock Discount Percentage
Apple Fruits 2.39 1 kg 16 2.15 10.04
Banana Fruits 1.49 500 gr 12 NaN NaN
Broccoli Vegetables 1.29 500 gr 13 0.99 23.26
Carrot Vegetables 0.45 1 pc 32 NaN NaN
Coffee Breakfast 7.49 1 kg 20 4.99 33.38
Eggplant Vegetables 0.72 1 pc 11 NaN NaN
Juice Breakfast 2.79 2 L 13 1.89 32.26
Lettuce Vegetables 0.82 1 pc 15 NaN NaN
Penne Pastas 0.99 500 gr 13 NaN NaN
Rice International 2.59 1 kg 17 NaN NaN
Spaghetti Pastas 0.99 500 gr 10 NaN NaN
Strawberry Fruits 3.49 400 gr 20 2.99 14.33
Tagliatelle Pastas 1.89 500 gr 12 NaN NaN

In the DataFrame we can now see the apples have $10\%$ discount when sold for $2.15$ instead of $2.39$, but we might want to advertise with the $33\%$ discount on coffee instead!

With a couple of discounts this is easy to spot, but for a larger DataFrame you would use built-in aggregate functions instead. Aggregate functions combine data in a column in some way, so for example using a .sum() to compute the total for a column. Here we’ll need to use the .max() function to find the largest discount percentage and .idxmax() to find the product that largest discount belongs to. Note that NaN values in the Percentage column will just be ignored here, as this is the default behaviour for all aggregation functions.

largest_discount_percentage = df['Percentage'].max()
largest_discount_product = df['Percentage'].idxmax()

original_price = df['Price'][largest_discount_product]
discount_price = df['Discount'][largest_discount_product]
product_unit =  df['Unit'][largest_discount_product]

print(f'Our best deal is a {int(largest_discount_percentage)}% discount on {largest_discount_product}:')
print(f'Usually {original_price} per {product_unit}, but now on sale for {discount_price}!')

Our best deal is a 33% discount on Coffee:
Usually 7.49 per 1 kg, but now on sale for 4.99!

Indexing, Selection and Masking

In the previous section we covered the most basic ways to retrieve data from a DataFrame, but there are many more ways to get data from a DataFrame in Pandas. As we’ve already covered how to retrieve columns, next we’ll take a look at how to get specific rows from a DataFrame.

The first thing we need to answer is: What is a ‘row’ in Pandas? Turns out: it’s also a Series! That’s great, because we just learned how to work with those. Let’s start by taking another look at our example DataFrame:

df = get_df()
display(df)

Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Banana Fruits 1.49 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Carrot Vegetables 0.45 1 pc 22
Coffee Breakfast 7.49 1 kg 10
Eggplant Vegetables 0.72 1 pc 1
Juice Breakfast 2.79 2 L 3
Lettuce Vegetables 0.82 1 pc 5
Penne Pastas 0.99 500 gr 3
Rice International 2.59 1 kg 7
Spaghetti Pastas 0.99 500 gr 0
Strawberry Fruits 3.49 400 gr 10
Tagliatelle Pastas 1.89 500 gr 2

There are several ways to access the rows in a DataFrame, but the most common ones are df.loc[...] and df.iloc[...]. These two methods look and work very similarly, but there are some important differences:

Using df.loc[...]

The .loc property retrieves rows by making use of the index we defined for our dataframe. This works similarly to a dictionary, where you can use the key (i.e. index) to get the row you want: df.loc['Banana'] will get you the row where the index is equal to ‘Banana’.

df = get_df()
row = df.loc['Banana']

print("The row with index 'Banana' is:")
display(row)
print("\nThis row is of the type:", type(row))

The row with index 'Banana' is:



Category    Fruits
Price         1.49
Unit        500 gr
Stock            2
Name: Banana, dtype: object



This row is of the type: <class 'pandas.core.series.Series'>

As you can see, the resulting row Series automatically becomes indexed by the columns from the original DataFrame. This means that, for example, you could now retrieve the ‘Price’ for a specific row using row['Price']. This works the same way for any row, regardless of whether you’re using either .loc or .iloc to retrieve it.

Using df.iloc[...]

The .iloc property retrieves rows by the order in which the rows are stored in the DataFrame. It stands for “integer location” and works the same way as list indices: df.iloc[0] is the first row, df.iloc[1] is the second, et cetera.

It is important to note that if you re-order your dataframe (such as with sorting), the order of the rows will change and therefore also their “Integer location”! After some operations, using df.iloc[0] might give a different result than before. This is different then with df.loc, where you just get the row matching the index regardless of the exact position in the DataFrame.

df = get_df()
row = df.iloc[0]

print("The first row of the DataFrame is:")
display(row)
print("\nThis row is of the type:", type(row))

The first row of the DataFrame is:

Category    Fruits
Price         2.39
Unit          1 kg
Stock            6
Name: Apple, dtype: object

This row is of the type: <class 'pandas.core.series.Series'>

Rows vs. columns

Retrieving specific rows from a DataFrame can obviously be very useful, but there is an important caveat to keep in mind when working with rows:

The problem here stems from the fact that a Series can only ever have one type. This works great for columns, as you expect a column of data to all be of the same type (e.g. a price column should always contain floats for each row). This is probably also the best way to look at what a DataFrame actually is: A collection of different columns of data, each with their own data type, that also each share the same row index.

However, a single row of data will contain information from all of the different columns, which might have different types. For mixed types, Pandas will again try its best to convert the types to a consistent option, but if you have just one column with strings, then the whole row Series will become type ‘object’. You can actually already see this happening in both the examples above!

Converting everything to ‘object’ type means you lose the type information from the columns. In addition, these rows are actually new Series that have to be built by retrieving the values from each column for an index, and then combining them together into that Series (which is when the type conversion happens). This much slower than getting a column, where you simply can simply retrieve the stored Series from a DataFrame.

As a result, the best option is usually to work with columns directly. Looping over the rows in a DataFrame might seem like a natural way to process data, as you’re going through the data line by line, but this is actually very inefficient. Instead you should try and use element-wise operations to modify a whole column in a DataFrame, like the examples you’ve already seen in the section Adding and modifying data.

Note that is still possible to loop through the rows of a DataFrame using functions like .iterrows(). This function will create a new Series for each row and return them one by one, which can sometimes be very useful. However, if you do find yourself using this, ask yourself first if there isn’t a better way to solve your problem, as this type of approach should always be one of the last options you try.

Selecting multiple rows or columns

In addition to using .loc['Banana'] or .iloc[8] to obtain a single row, you can also pass a list of values to obtain multiple rows from your DataFrame in one go! This actually also works for columns, by passing a list of column names when indexing the DataFrame. Note that all these operations will always return a DataFrame, and not a Series like before - even if it is a DataFrame with only a single row or column!

Below are some examples using lists to index a DataFrame:

df = get_df()
print("The full dataframe:")
display(df)

rows1 = df.loc[ ['Carrot', 'Juice', 'Rice'] ]
print("\nThe rows with index 'Carrot', 'Juice' and 'Rice':")
display(rows1)

rows2 = df.loc[ ['Carrot'] ]
print("\nThe row with index 'Carrot', as a DataFrame:")
display(rows2)

rows3 = df.iloc[ [0, 2, 4] ]
print("\nFirst, third and fifth row, using a list of indices for iloc:")
display(rows3)

rows4 = df[ ['Category', 'Price'] ]
print("\nSelecting only the Category and Price columns")
display(rows4)

The full dataframe:
Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Banana Fruits 1.49 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Carrot Vegetables 0.45 1 pc 22
Coffee Breakfast 7.49 1 kg 10
Eggplant Vegetables 0.72 1 pc 1
Juice Breakfast 2.79 2 L 3
Lettuce Vegetables 0.82 1 pc 5
Penne Pastas 0.99 500 gr 3
Rice International 2.59 1 kg 7
Spaghetti Pastas 0.99 500 gr 0
Strawberry Fruits 3.49 400 gr 10
Tagliatelle Pastas 1.89 500 gr 2
The rows with index 'Carrot', 'Juice' and 'Rice':
Category Price Unit Stock
Carrot Vegetables 0.45 1 pc 22
Juice Breakfast 2.79 2 L 3
Rice International 2.59 1 kg 7
The row with index 'Carrot', as a DataFrame:
Category Price Unit Stock
Carrot Vegetables 0.45 1 pc 22
First, third and fifth row, using a list of indices for iloc:
Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Broccoli Vegetables 1.29 500 gr 3
Coffee Breakfast 7.49 1 kg 10
Selecting only the Category and Price columns
Category Price
Apple Fruits 2.39
Banana Fruits 1.49
Broccoli Vegetables 1.29
Carrot Vegetables 0.45
Coffee Breakfast 7.49
Eggplant Vegetables 0.72
Juice Breakfast 2.79
Lettuce Vegetables 0.82
Penne Pastas 0.99
Rice International 2.59
Spaghetti Pastas 0.99
Strawberry Fruits 3.49
Tagliatelle Pastas 1.89

The df.iloc property works with the position index, and also supports ‘slices’: - df.iloc[a:] returns everything starting from the a-th row. - df.iloc[:b] returns everything up to (but not including) the b-th row. - df.iloc[::c] returns every c-th row, starting with the first. You can also think of this as the ‘step size’ with which to go through the DataFrame. By default this is 1, which is just every row. A step size of 2 means you skip every second row, et cetera.

And these can also be combined as df.iloc[a:b:c] which says: from the ath to the bth row, give every cth row starting with the ath.

Further Reading: You can find more examples on slices here.

Below are some examples. Feel free to try out some different values in a new cell, and predict what you’ll see before running the code.

df = get_df()
print("Full DataFrame, with a new default range as index:")
# If you reset the index, it will use indices 0,1,... instead
# This will help see what integer indexes are for each row!
display(df.reset_index())

print("\n\nStarting at the tenth row:")
display(df.iloc[10:])

print("\n\nFirst five rows:")
display(df.iloc[:5])

# A negative index means: count backwards from the end
print("\n\nStarting at the second-to-last row:")
display(df.iloc[-2:])

# A bit of an advanced example:
# 1:10 says "Take the second to tenth rows (indices 1,2,3,4,5,6,7,8,9)"
#   :2 says "And then skip every other row (indices 1,3,5,7,9)"
print("\n\nEvery second row up to the tenth row:")
display(df.iloc[1:10:2])

Full DataFrame, with a new default range as index:
index Category Price Unit Stock
0 Apple Fruits 2.39 1 kg 6
1 Banana Fruits 1.49 500 gr 2
2 Broccoli Vegetables 1.29 500 gr 3
3 Carrot Vegetables 0.45 1 pc 22
4 Coffee Breakfast 7.49 1 kg 10
5 Eggplant Vegetables 0.72 1 pc 1
6 Juice Breakfast 2.79 2 L 3
7 Lettuce Vegetables 0.82 1 pc 5
8 Penne Pastas 0.99 500 gr 3
9 Rice International 2.59 1 kg 7
10 Spaghetti Pastas 0.99 500 gr 0
11 Strawberry Fruits 3.49 400 gr 10
12 Tagliatelle Pastas 1.89 500 gr 2
Starting at the tenth row:
Category Price Unit Stock
Spaghetti Pastas 0.99 500 gr 0
Strawberry Fruits 3.49 400 gr 10
Tagliatelle Pastas 1.89 500 gr 2
First five rows:
Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Banana Fruits 1.49 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Carrot Vegetables 0.45 1 pc 22
Coffee Breakfast 7.49 1 kg 10
Starting at the second-to-last row:
Category Price Unit Stock
Strawberry Fruits 3.49 400 gr 10
Tagliatelle Pastas 1.89 500 gr 2
Every second row up to the tenth row:
Category Price Unit Stock
Banana Fruits 1.49 500 gr 2
Carrot Vegetables 0.45 1 pc 22
Eggplant Vegetables 0.72 1 pc 1
Lettuce Vegetables 0.82 1 pc 5
Rice International 2.59 1 kg 7

Combining row and column selection

If you want to select specific rows and columns from a DataFrame, there are many ways you can do this in Pandas. First off, you can just combine the methods for selecting rows and columns you saw earlier, as these always return a DataFrame which you can just index again.

Some examples:

df = get_df()

print("\nSelecting columns first, then indexing the resulting rows using loc:")
cols = df[['Category', 'Price']]
res1 = cols.loc[['Carrot', 'Juice', 'Rice']]
display(res1)


print("\nIndexing the rows using their position with iloc, then selecting the columns:")
rows = df.iloc[[3, 6, 9]]
res2 = rows[['Category', 'Price']]
display(res2)


print("\nSelecting columns first, then slices on iloc, executed on the same line:")
res3 = df[['Category', 'Price']].iloc[3:10:3]
display(res3)

Selecting columns first, then indexing the resulting rows using loc:
Category Price
Carrot Vegetables 0.45
Juice Breakfast 2.79
Rice International 2.59
Indexing the rows using their position with iloc, then selecting the columns:
Category Price
Carrot Vegetables 0.45
Juice Breakfast 2.79
Rice International 2.59
Selecting columns first, then slices on iloc, executed on the same line:
Category Price
Carrot Vegetables 0.45
Juice Breakfast 2.79
Rice International 2.59

Note that all these different methods end up with the same DataFrame! Actually any combination of row and column indexing you’ve seen before would work for this, of course.

In addition, loc and iloc also support the option to index rows and columns together in the same operation, which is a little more efficient than first creating new Series for the full rows and then selecting from them. With these methods the row indices must always come first (as loc and iloc are row-based operations), and then optionally you can add a column index too, separated by a comma. This combined indexing also works for single indices, where you only give one index for the rows or columns position, instead of a list. Note in those cases the row or column is reduced, and the result will no longer be a DataFrame. A few examples:

df = get_df()

print("\nSelecting the rows with index Apple and Broccoli, and the columns Category and Stock")
display(df.loc[ ['Apple', 'Broccoli'], ['Category', 'Stock'] ])

print("\nSelecting the first and second row, and the second and fourth column:")
display(df.iloc[ [0, 1], [1, 3] ])

print("\nSelecting the Penne row and the Price column (which is just an element):")
display(df.loc['Penne', 'Price'])

print("\nSelecting only the first row, and the first and fourth column (which is a Series):")
display(df.iloc[0, [0, 3]])

Selecting the rows with index Apple and Broccoli, and the columns Category and Stock
Category Stock
Apple Fruits 6
Broccoli Vegetables 3
Selecting the first and second row, and the second and fourth column:
Price Stock
Apple 2.39 6
Banana 1.49 2
Selecting the Penne row and the Price column (which is just an element):

0.99

Selecting only the first row, and the first and fourth column (which is a Series):

Category    Fruits
Stock            6
Name: Apple, dtype: object

All these different indexing methods can make Pandas especially confusing to newcomers, as they are all used interchangeably in explanations and example you might find online. One of the goals of this notebook is to introduce all of these properly, so you can make a bit more sense of the different options you might encounter.

Below are 3 examples all using loc that look very similar, but do completely different things. Read the code fo the examples and try to predict what each of these will do, before actually running the cell:

example1 = df.loc['Apple', 'Price']

print("\nExample 1:")
display(example1)

example2 = df.loc[['Apple', 'Broccoli']]

print("\nExample 2:")
display(example2)

example3 = df.loc[['Apple'], ['Price']]

print("\nExample 3:")
display(example3)

Example 1:

2.39

Example 2:
Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Broccoli Vegetables 1.29 500 gr 3
Example 3:
Price
Apple 2.39

Masking

The last, and very useful, method we’ll cover to select data from you DataFrame is masking. Masking is another concept borrowed directly from Numpy, as so works almost exactly the same way in Pandas.

Masks are used to select parts of a DataFrame where some condition is True. You can create a mask by using boolean operators (like >, <= or ==) directly on a DataFrame or Series to define the condition where the mask should be True. This will result a broadcasted comparison of elements on the DataFrame/Series, so will create a new DataFrame/Series containing only Trues and Falses. Applying a mask just takes the part(s) of your DataFrame where the mask is True, and leaves out the parts where it is False.

The code below uses the Stock column to create a mask called mask where the Stock column is less than four, then uses it to show what items in the DataFrame are low in stock. Note that the dtype of the mask becomes boolean and it indeed contains only True for the rows that have a Stock less than 4.

df = get_df()

mask = df['Stock'] < 4

print("\nMask where Stock is less than 4:")
display(mask)

print("\nUsing the mask on the DataFrame gives:")
display(df[mask])

Mask where Stock is less than 4:

Apple          False
Banana          True
Broccoli        True
Carrot         False
Coffee         False
Eggplant        True
Juice           True
Lettuce        False
Penne           True
Rice           False
Spaghetti       True
Strawberry     False
Tagliatelle     True
Name: Stock, dtype: bool

Using the mask on the DataFrame gives:
Category Price Unit Stock
Banana Fruits 1.49 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Eggplant Vegetables 0.72 1 pc 1
Juice Breakfast 2.79 2 L 3
Penne Pastas 0.99 500 gr 3
Spaghetti Pastas 0.99 500 gr 0
Tagliatelle Pastas 1.89 500 gr 2

Pandas also has built-in functions to create masks. One such example is the pd.Series.isin(values) method. This mask is True whenever a element in a Series is equal to one of the elements you provided as the list of values.

As an example, the code below creates a mask where the values in the ‘Category’ column are either ‘Vegetables’ or ‘Fruits’.

df = get_df()

mask = df['Category'].isin(['Vegetables', 'Fruits'])

print("\nThe mask where the category is either Fruits or Vegetables:")
display(mask)

print("\n\nThe result of applying this mask on the DataFrame:")
display(df[mask])

The mask where the category is either Fruits or Vegetables:

Apple           True
Banana          True
Broccoli        True
Carrot          True
Coffee         False
Eggplant        True
Juice          False
Lettuce         True
Penne          False
Rice           False
Spaghetti      False
Strawberry      True
Tagliatelle    False
Name: Category, dtype: bool

The result of applying this mask on the DataFrame:
Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Banana Fruits 1.49 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Carrot Vegetables 0.45 1 pc 22
Eggplant Vegetables 0.72 1 pc 1
Lettuce Vegetables 0.82 1 pc 5
Strawberry Fruits 3.49 400 gr 10

You can also use logical operations, like logical or | and the logical and &, to combine two masks element-wise. The first creates a new mask where each element is True if one of the masks was True for that element, and the second creates a new mask where each element is True if both the masks were True for that element.

expensive = df['Price'] > 1
low_stock = df['Stock'] < 4

combined_mask = expensive & low_stock

print("\nThe combined mask for items that are both expensive and low in stock:")
display(combined_mask)

print("\nSelecting out the rows where the combined mask is True:")
display(df[combined_mask])

The combined mask for items that are both expensive and low in stock:

Apple          False
Banana          True
Broccoli        True
Carrot         False
Coffee         False
Eggplant       False
Juice           True
Lettuce        False
Penne          False
Rice           False
Spaghetti      False
Strawberry     False
Tagliatelle     True
dtype: bool

Selecting out the rows where the combined mask is True:
Category Price Unit Stock
Banana Fruits 1.49 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Juice Breakfast 2.79 2 L 3
Tagliatelle Pastas 1.89 500 gr 2

Further Reading: The Python Data Science Handbook has a section on data indexing and selection and using masks. The documentation on .iloc and .loc also contains more examples.

Operating on DataFrames

In the section on Adding and modifying data you already saw that basic arithmetic operations are actually done element-wise by default, which is extremely useful. The general term for operations that are applied element-wise are “Universal Functions” (or Ufuncs) and make libraries such as NumPy and Pandas incredibly powerful, convenient and computationally efficient. These Ufuncs do not only work with basic arithmetics (+, -, *, /, etc) but also more sophisticated operations (exponential, trigonometric, logarithmic functions and many more).

With all of these operations, it is important to keep track of the index of the Series you are using. For example, if you have a Series a with index 0,1,2 and a Series b with index 1,2,3, the operation c = a + b will have index 0,1,2,3 (the union of the index of a and b). Only on indices 1 and 2 will it be able to compute a result, and on indices 0 and 3 it will automatically fill in NaN (“Not a Number”).

a = pd.Series(
    data=[5, 2, 3],
    index=[0, 1, 2]
)
b = pd.Series(
    data=[7, 3, 1],
    index=[1, 2, 3]
)

print("Series a:")
display(a)

print("\n\nSeries b:")
display(b)

print("\n\na+b gives:")
display(a + b)

Series a:

0    5
1    2
2    3
dtype: int64

Series b:

1    7
2    3
3    1
dtype: int64

a+b gives:

0    NaN
1    9.0
2    6.0
3    NaN
dtype: float64

As you can see, the indices 1 and 2 have been added element-wise, as they occur in both Series, while for indices 0 and 3 no valid result could be computed, so they become NaN. If you want a different result for the indices that only occur once, you’ll need to specify how to handle those cases with the pd.Series.add(...) method. Using a.add( b ) is completely equivalent to a + b, but using .add allows you to specify a fill_value, which it will use whenever one of the Series is missing a value for some index.

For example, we can set up a Series called sales with the daily sales for some products. If you want to subtract these from the ‘Stock’ Series in df, you need to specify what to do with the missing indices. In this case, we can fill these with zeros using the sub function and fill_value=0.

df = get_df()

# Extract the 'Stock' column
stock = df['Stock']

print("Stock before sales:")
display(stock)

# Set up a 'sales' Series
sales = pd.Series(
    data=[1, 3, 1, 2, 1, 2],
    index=['Lettuce', 'Carrot', 'Tagliatelle', 'Coffee', 'Apple', 'Strawberry'],
    dtype=int,
    name='Sales'
)

print("\n\nSales:")
display(sales)

print("\n\nStock after sales:")
display( stock.sub(sales, fill_value=0) )

# See what happens if you do not specify a `fill_value` by uncommenting the following line:
# display( stock.sub(sales) )

Stock before sales:

Apple           6
Banana          2
Broccoli        3
Carrot         22
Coffee         10
Eggplant        1
Juice           3
Lettuce         5
Penne           3
Rice            7
Spaghetti       0
Strawberry     10
Tagliatelle     2
Name: Stock, dtype: int64

Sales:

Lettuce        1
Carrot         3
Tagliatelle    1
Coffee         2
Apple          1
Strawberry     2
Name: Sales, dtype: int64

Stock after sales:

Apple           5.0
Banana          2.0
Broccoli        3.0
Carrot         19.0
Coffee          8.0
Eggplant        1.0
Juice           3.0
Lettuce         4.0
Penne           3.0
Rice            7.0
Spaghetti       0.0
Strawberry      8.0
Tagliatelle     1.0
dtype: float64

Further Reading: The python data science handbook on operations in pandas.

Map

Next to these built-in operations, you can also apply your own function to each value in a Series. Below is a very simple example function, that just returns the length of a string. Say you want to apply myFunc to each value in series_in and save the results in series_out, then you might write something like this:

# A very simple example function
def myFunc(value):
    newValue = len(value)
    return newValue


# Get some Series
series_in = pd.Series(["Hello", "world!"])

print("Input series:")
display(series_in)


# Save the new results in a list for now
series_out = []

# Go through each value in the input Series
for value in series_in:
    # Apply your function to the value, save in output list
    series_out.append( myFunc(value) )

# Create a new series with new values
series_out = pd.Series(series_out)


print("\n\nOutput Series:")
display(series_out)

Input series:

0     Hello
1    world!
dtype: object

Output Series:

0    5
1    6
dtype: int64

But looping through values, appending them and then creating a new Series is quite inefficient. All of this can be replaced by using the pd.Series.map method. You provide the name of your function (myFunc in this case), and Pandas will map each value in your Series using your function:

# A very simple example function
def myFunc(value):
    # Do some operations here...
    newValue = len(value)
    return newValue


# Get some Series
series_in = pd.Series(["Hello", "world!"])

print("Input series:")
display(series_in)


# Apply myFunc to each value in series_in, save the result in series_out
series_out = series_in.map(myFunc)


print("\n\nOutput Series:")
display(series_out)

Input series:

0     Hello
1    world!
dtype: object

Output Series:

0    5
1    6
dtype: int64

Lambda functions

One thing you’ll see very commonly in combination with map are lambda functions. Lamba functions are built-in to Python (so not a part of Pandas specifically), and can be use to quickly define functions you only want to use once, which is exactly what you want to do with a map most of the time.

For example, let’s say you want convert all the Categories to lowercase. Defining a function for this and using map would be a good solution:

def lowercase(x):
    return x.lower()


df = get_df()

df['Category'] = df['Category'].map(lowercase)

print("\nThe Category column replaced by the lowercase version:")
display(df)

The Category column replaced by the lowercase version:
Category Price Unit Stock
Apple fruits 2.39 1 kg 6
Banana fruits 1.49 500 gr 2
Broccoli vegetables 1.29 500 gr 3
Carrot vegetables 0.45 1 pc 22
Coffee breakfast 7.49 1 kg 10
Eggplant vegetables 0.72 1 pc 1
Juice breakfast 2.79 2 L 3
Lettuce vegetables 0.82 1 pc 5
Penne pastas 0.99 500 gr 3
Rice international 2.59 1 kg 7
Spaghetti pastas 0.99 500 gr 0
Strawberry fruits 3.49 400 gr 10
Tagliatelle pastas 1.89 500 gr 2

An even more compact solution would be to use a lambda function here. A lambda function starts with the word lambda, followed by a space, the argument of your function (the example below uses x) and a colon (:). After that you can write a single line of code for the function, which will define what the function returns.

This is just a shorter way to write a function, where you don’t need give it a name, don’t need to write return and can define it on the same line. The example above can be replaced by the following using a lamdba function:

df = get_df()

df['Category'] = df['Category'].map(lambda x: x.lower())

print("\nThe Category column replaced by the lowercase version:")
display(df)

The Category column replaced by the lowercase version:
Category Price Unit Stock
Apple fruits 2.39 1 kg 6
Banana fruits 1.49 500 gr 2
Broccoli vegetables 1.29 500 gr 3
Carrot vegetables 0.45 1 pc 22
Coffee breakfast 7.49 1 kg 10
Eggplant vegetables 0.72 1 pc 1
Juice breakfast 2.79 2 L 3
Lettuce vegetables 0.82 1 pc 5
Penne pastas 0.99 500 gr 3
Rice international 2.59 1 kg 7
Spaghetti pastas 0.99 500 gr 0
Strawberry fruits 3.49 400 gr 10
Tagliatelle pastas 1.89 500 gr 2

Further Reading: More about lambda statements, including examples with and without the use of Pandas.

Sorting values

Sorting your DataFrame or Series is also a useful and common operation. This is as easy as providing a column to sort on, and Pandas will do the rest. By default this will sort the rows of the DataFrame in ascending order (lowest to highest), but you can change this by setting the optional ascending parameter to False. Note that strings are automatically sorted in alphabetical order too.

The code below will sort the DataFrame by the ‘Category’ column in ascending order:

df = get_df()

sorted_categories = df.sort_values(by='Category')

display(sorted_categories)

Category Price Unit Stock
Coffee Breakfast 7.49 1 kg 10
Juice Breakfast 2.79 2 L 3
Apple Fruits 2.39 1 kg 6
Banana Fruits 1.49 500 gr 2
Strawberry Fruits 3.49 400 gr 10
Rice International 2.59 1 kg 7
Penne Pastas 0.99 500 gr 3
Spaghetti Pastas 0.99 500 gr 0
Tagliatelle Pastas 1.89 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Carrot Vegetables 0.45 1 pc 22
Eggplant Vegetables 0.72 1 pc 1
Lettuce Vegetables 0.82 1 pc 5

Another useful method is sorting by multiple columns: simply replace the by=... argument with a list of column names (by=['col1', 'col2', ...]). Similarly, you can replace the ascending=... argument with a list with True or False values (ascending=[True, False, ...]) of the same length, which indictates for each of the corresponding columns whether they should be sorted in ascending order or not.

Note that order of the columns in the by list matter, as the DataFrame will be sorted by the first column first, and so on. The code below first sorts by the ‘Category’ column (in ascending order). If multiple rows have the same category, it will then sort these by the ‘Stock’ column (in descending order):

df = get_df()

display(df.sort_values(by=['Category', 'Stock'], ascending=[True, False]))

Handling missing values

In the section on operations we already encountered some NaN values. Knowing how to deal with missing values is a great skill when working messy datasets (which most datasets are, in practice).

For this exercise we will use a different dataset: For the most popular products in the store (Carrots, Apples, Coffee and Lettuce), your boss has kept a registry of the stock over time. In this case, the products are the columns, and each row is indexed by a date. The values you see is the stock at the start of each week (the market gets restocked on the first of each month).

Unfortunately, some values were illegible and are replaced with NaN values. In this section you will see some tools that can help you deal with this.

stock_df = get_stocks()
display(stock_df)

Product Carrot Apple Coffee Lettuce
Date
2023-08-07 7.0 17.0 13.0 22.0
2023-08-14 NaN 12.0 NaN 17.0
2023-08-21 NaN 9.0 7.0 9.0
2023-08-28 NaN 5.0 3.0 5.0
2023-09-04 38.0 20.0 17.0 22.0
2023-09-11 22.0 12.0 14.0 16.0
2023-09-18 11.0 NaN 12.0 10.0
2023-09-25 3.0 NaN 9.0 3.0
2023-10-02 44.0 23.0 15.0 14.0
2023-10-09 28.0 10.0 11.0 8.0

The most simple method of dealing with NaN values is by removing them. This can be done using the pd.DataFrame.dropna(...) method. By default, it will remove any row that has any NaN on it. You can change this to drop any column that has at least one NaN value by passing axis=1. If you want to drop a row/column if it is completely filled with NaN values, you can pass how='all'.

Below is a code example that uses dropna with axis=0 and how=any (which are also the default options) to remove the rows containing NaN. You can make a new cell where you copy this example and can play around with different combinations of argument. With each combination of arguments you test, try to predict how many rows and columns you expect to see before running the cell.

stock_df = get_stocks()

# Drop rows/columns that contain NaN values
# axis=0 -> drop rows (default)
#      1 -> drop columns
# how='any' -> if that row/column contains at least one NaN value (default)
#     'all' -> if that row/column contains only NaN values
cleaned_df = stock_df.dropna(axis=0, how='any')

display(cleaned_df)

Product Carrot Apple Coffee Lettuce
Date
2023-08-07 7.0 17.0 13.0 22.0
2023-09-04 38.0 20.0 17.0 22.0
2023-09-11 22.0 12.0 14.0 16.0
2023-10-02 44.0 23.0 15.0 14.0
2023-10-09 28.0 10.0 11.0 8.0

The above example shows that, no matter how you deal with discarding NaN values, you throw away some valuable data in the process.

Instead, you can try to replace these missing values with some approximation or guess of what these values could have been. Doing this in a sophisticated way is its own field of research (called Regression, which is part of Machine Learning), but for now we will just replace each NaN with a constant value (i.e. the same value everywhere).

Filling values can be done using the pd.DataFrame.fillna(...) method. For example, you can specify some value you want to fill with:

stock_df = get_stocks()
display(stock_df.fillna(value=0))

Product Carrot Apple Coffee Lettuce
Date
2023-08-07 7.0 17.0 13.0 22.0
2023-08-14 0.0 12.0 0.0 17.0
2023-08-21 0.0 9.0 7.0 9.0
2023-08-28 0.0 5.0 3.0 5.0
2023-09-04 38.0 20.0 17.0 22.0
2023-09-11 22.0 12.0 14.0 16.0
2023-09-18 11.0 0.0 12.0 10.0
2023-09-25 3.0 0.0 9.0 3.0
2023-10-02 44.0 23.0 15.0 14.0
2023-10-09 28.0 10.0 11.0 8.0

Using the same constant value for each column might not be the best approach: there seem to be many more Carrots in stock than Coffee. It would be better if you use some constant value for one column, and another value for another column.

You can do this by passing a ‘mapping’ in the value argument, which gives a value to use for each column in the dataframe. A dictionary is a type of mapping, which maps the keys of that dictionary to the corresponding values. The example below uses Carrot, Apple and Coffee as a key, each with their own value.

stock_df = get_stocks()

# Create a dictionary (a type of mapping)
mapping = {
    'Carrot': 15,
    'Apple': 10,
    'Coffee': 2
}

display(stock_df.fillna(value=mapping))

Product Carrot Apple Coffee Lettuce
Date
2023-08-07 7.0 17.0 13.0 22.0
2023-08-14 15.0 12.0 2.0 17.0
2023-08-21 15.0 9.0 7.0 9.0
2023-08-28 15.0 5.0 3.0 5.0
2023-09-04 38.0 20.0 17.0 22.0
2023-09-11 22.0 12.0 14.0 16.0
2023-09-18 11.0 10.0 12.0 10.0
2023-09-25 3.0 10.0 9.0 3.0
2023-10-02 44.0 23.0 15.0 14.0
2023-10-09 28.0 10.0 11.0 8.0

Further Reading: in the python data science handbook on handling missing data.

Grouping

Categorical variables (such as the ‘Category’ column in our example dataset) can also be used to group data together. In doing this, you combine one or more rows from your original DataFrame into a row in the grouped DataFrame. You will then need to specify how these rows get combined (also called ‘aggregating’), as multiple rows will need to be reduced to one row per Category.

First, you can group a DataFrame using the pd.DataFrame.groupby method and specifying what column you want to group on. This gives something called a DataFrameGroupBy, which is special intermediate groupby object. The groupby can then be completed by using an aggregate function on the DataFrameGroupBy. This will return a regular DataFrame again, with its rows reduced as specified by the aggregate function.

You’ve already seen some of the possible aggregate functions for numerical values throughout this notebook. These are functions like .sum(), .max(), .min(), .mean() and .median(), which combine values by computing, for example, the maximum value in each column. With a groupby, these functions are applied to all the remaining columns, so the columns that were not selected as the column to group on. Applying an aggregate function to a DataFrameGroupBy results in a new DataFrame with just one row for each separate group, containing all the combined values of that group. Note that is not possible to apply these functions on nonnumerical columns (like the Units column), so trying this will result in an error. You can pass the optional numeric_only=True argument to the aggregate functions to tell Pandas explicitly to just drop those columns.

df = get_df()

grouped_df = df.groupby('Category')

print("\nIntermediate groupby object, which isn't a real DataFrame:")
display(grouped_df)

print("\nHowever, we can access each group separately, which are DataFrames:")
for (group_name, group_df) in grouped_df:
    print('\n'+ group_name +':')
    display(group_df)


category_means = grouped_df.mean(numeric_only=True)

print("\n\n\nComputing the mean per group for each column:")
display(category_means)

# Note this is usually how this is used, with both function calls on the same line
# so without storing the intermediate groupby object seperately
category_max = df.groupby('Category').max(numeric_only=True)

print("\nComputing the maximum per group for each column:")
display(category_max)

Intermediate groupby object, which isn't a real DataFrame:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x15a4bf350>

However, we can access each group separately, which are DataFrames:

Breakfast:
Category Price Unit Stock
Coffee Breakfast 7.49 1 kg 10
Juice Breakfast 2.79 2 L 3
Fruits:
Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Banana Fruits 1.49 500 gr 2
Strawberry Fruits 3.49 400 gr 10
International:
Category Price Unit Stock
Rice International 2.59 1 kg 7
Pastas:
Category Price Unit Stock
Penne Pastas 0.99 500 gr 3
Spaghetti Pastas 0.99 500 gr 0
Tagliatelle Pastas 1.89 500 gr 2
Vegetables:
Category Price Unit Stock
Broccoli Vegetables 1.29 500 gr 3
Carrot Vegetables 0.45 1 pc 22
Eggplant Vegetables 0.72 1 pc 1
Lettuce Vegetables 0.82 1 pc 5
Computing the mean per group for each column:
Price Stock
Category
Breakfast 5.14 6.50
Fruits 2.46 6.00
International 2.59 7.00
Pastas 1.29 1.67
Vegetables 0.82 7.75
Computing the maximum per group for each column:
Price Stock
Category
Breakfast 7.49 10
Fruits 3.49 10
International 2.59 7
Pastas 1.89 3
Vegetables 1.29 22

As you can see, the result of these groupby and aggregates is a new DataFrame containing the combined data. In the mean DataFrame you can, for example, see that average Stock of all the Breakfast items is $6.5$. And, in the max DataFrame you can see that the Fruit with the highest Price costs $3.49$.

There are also some aggregate functions that work on both numerical and categorical columns:


df = get_df()

print("\nThe number of unique values in each group:")
display(df.groupby('Category').nunique())

print("\nThe first value in each group:")
display(df.groupby('Category').first())

print("\nKeeping (up to) the first 2 rows per group:")
display(df.groupby('Category').head(2))

The number of unique values in each group:
Price Unit Stock
Category
Breakfast 2 2 2
Fruits 3 3 3
International 1 1 1
Pastas 2 1 3
Vegetables 4 2 4
The first value in each group:
Price Unit Stock
Category
Breakfast 7.49 1 kg 10
Fruits 2.39 1 kg 6
International 2.59 1 kg 7
Pastas 0.99 500 gr 3
Vegetables 1.29 500 gr 3
Keeping (up to) the first 2 rows per group:
Category Price Unit Stock
Apple Fruits 2.39 1 kg 6
Banana Fruits 1.49 500 gr 2
Broccoli Vegetables 1.29 500 gr 3
Carrot Vegetables 0.45 1 pc 22
Coffee Breakfast 7.49 1 kg 10
Juice Breakfast 2.79 2 L 3
Penne Pastas 0.99 500 gr 3
Rice International 2.59 1 kg 7
Spaghetti Pastas 0.99 500 gr 0

Like regular DataFrames, you can select one or more columns from a DataFrameGroupBy using the column names. Remember that, as when indexing columns, a single column will always return a Series, and a list of columns (even a list of 1 column) returns a DataFrame!

The examples below compute the minimum Stock per Category, and number of unique values in Price and Unit per Category.

df = get_df()

grouped_df = df.groupby('Category')

minimum_stock = grouped_df['Stock'].min()

print("\nThe minimum values in the Stock column, per Category:")
# Passing just 'Stock' gives a Series, from which we compute the min
display(minimum_stock)

unique_price_unit = grouped_df[['Price', 'Unit']].nunique()

print("\nThe number of unique values in the 'Price' and 'Unit' column, per category:")
# Passing multiple columns gives a DataFrame; from which we compute the n-unique
display(unique_price_unit)

The minimum values in the Stock column, per Category:

Category
Breakfast        3
Fruits           2
International    7
Pastas           0
Vegetables       1
Name: Stock, dtype: int64

The number of unique values in the 'Price' and 'Unit' column, per category:
Price Unit
Category
Breakfast 2 2
Fruits 3 3
International 1 1
Pastas 2 1
Vegetables 4 2

Pandas supports many more methods of aggregating data, which you can use through the grouped_df.agg() method. Rather than ‘just’ computing the mean, you can give a list of methods to aggregate the groups in your DataFrame:

df = get_df()
grouped_df = df.groupby('Category')

# For each column, compute the number of unique values and the number of rows per category
display(
    grouped_df.agg(['nunique', 'count'])
)

Further Reading: On grouping data.