Images from Unsplash

Disclaimer: This article is my learning note from the courses I took from Kaggle.

In this course, we will explore on the Python pandas module which is a popular library for data analysis. With pandas, we can use it to create data and also work or manipulate the existing data.

1. Introduction

pandas has two core objects known as DataFrame and Series. A DataFrame is a table where it consists of individual entries in the form of an array. Each entry corresponds to a row and a column. For example:

Let us define a DataFrame, we can also write string in the DataFrame:

import pandas as pd
pd.DataFrame({
    'x': [1,2,3],
    'y': [3,2,1],
})
| x | y |
|---|---|
| 1 | 3 |
| 2 | 2 |
| 3 | 1 |

In the above code we are defining x and y as the column names, and we assigned values to the column. Now, let’s try to put an index or row labels to the DataFrame:

pd.DataFrame({
    'x': [1,2,3],
    'y': [3,2,1]},
    index = ['Row 1', 'Row 2', 'Row 3']    
)
|     | x | y |
|-----|---|---|
|Row 1| 1 | 3 |
|Row 2| 2 | 2 |
|Row 3| 3 | 1 |

On the other hand, for Series, it represents a sequence of data values. Just now we have shown that DataFrame is in the form of table, but Series is in the form of a list. Let’s try defining one:

pd.Series([1,2,3,4])

We can think of a series as only a single column of a DataFrane. In fact, we can add row label or index to Series too:

pd.Series([1,2,3,4], index = ['Row 1', 'Row 2', 'Row 3', 'Row 4'], name = "My Series")
|Row 1| 1 |
|Row 2| 2 |
|Row 3| 3 |
Name: Product A

1.1 Reading Data Files

We can use pandas to read .csv file as well, by typing df = pd.read_csv('file.csv').

To check how large the DataFrame is:

df.shape()

Here’s how we can take a preview of the first five rows of the DataFrame:

df.head()
# try df.head(10) for first 10 rows

You might be wondering, can we display only the last five rows? Of course:

df.tail()

2. Indexing, Selecting & Assigning

We have learned how to create a Series and a DataFrame. Now let’s look at how do we select specific value in the DataFrame in a quick and effective manner.

Let’s say we have a DataFrame called df with the following column names: country, points, price, province, state and so on. If we would like to only see the country column, we could access the column by doing this:

df.country

# or alternatively
df['country']

Notice two approaches are provided above. Neither of them is more or less syntactically valid than the other. But let’s say we have a column name My House, we just could not write df.My House, it wouldn’t work.

If we want to know the first item in the country column, we can use the indexing operator to drill down to a single specific value:

df['country'][0]

2.1 Indexing in Pandas

In pandas, indexing works in one of two paradigms. We will start with the first one, the index-based selection where data is selected based on its numerical position in the data. To select the first row of a data:

df.iloc[0]

For the loc and iloc function, row will come first then column.

df.iloc[:,0] # display all row of first column

The : operator means everything. We can use it to indicate the range of values that we are interested in:

df.iloc[:3, 0] # first three rows of the first column (index 3 or fourth row excluded)
df.iloc[1:3, 0] # second and third rows of the first column

# alternatively
df.iloc[[1,2], 0]

Interestingly, we can also use negative number to select data items. For negative number, it will start counting from the end of the values:

df.iloc[-5:] # get the last five rows for all columns

Furthermore, a second paradigm for attribute selection is done by using the loc operator, called as label-based selection. For this approach, it is the data index value, rather than the position that matters.

df.loc[0, 'country'] # first item in the country column

When we compare iloc and loc, iloc seems to be conceptually simpler than loc. This is because we ignore the dataset’s indices. Using iloc, dataset is treated like a big matrix, that we index into by position. For loc, we make use of the information in the indices to do its work. Of course, dataset usually comes with indices and this makes loc to be easier to work with.

df.loc[:, ['country', 'regions', 'states']] # display all rows for the column of country, regions and states

So, how do we choose between iloc and loc? For iloc, it uses the Python standard library indexing scheme, where the first element in the range is included, and the last one is excluded. But for loc, it will include both the first and last element. Of course, using iloc can be confusing sometimes. Here’s a comparison:

Using loc: 0:10
Output: 0,1,...,10

Using iloc 0:10
Output: 0,1,...,9

2.2 Manipulation Index

When using label-based selection, the power of it comes from the labels in the index. We can manipulate index in any way we see fit as well by using set_index()

df.set_index("country")

2.3 Conditional Selection

In order to do some interesting things with our dataset, we often need to ask questions based on certain conditions. Suppose that we want to know the better-than-average wine produced in Sweden, we can start by checking whether the wine is from Sweden or not:

wine.country == 'Sweden'

# alternatively
wine.loc[wine.country == 'Sweden']

Now we would like to add an extra condition. We want the wine review to also be at least 90 points:

wine.loc[(wine.country == 'Sweden') & (wine.points >= 90)]

If we would like to know whether a wine is from Sweden or a wine is rated above average:

wine.loc[(wine.country == 'Sweden') | (wine.points >= 90)]

What if we want to know whether a wine is produced in Sweden or Italy?

wine.loc[wine.country.isin(['Sweden', 'Italy'])]

We notice that for the wine data produced in Sweden or Italy, there are some rows where the points given is empty, we would like to filter them out:

wine.loc[wine.points.notnull()]

# to check for the null rows
wine.loc[wine.points.isnull()]

2.4 Assigning Data

Now let’s look at the method to assign data to a DataFrame:

wine['critic'] = everyone # assingning a constant to the column named critic
wine['index_reverse'] = range(len('country'),0,-1) 

3. Summary Functions & Maps

In the previous section, we learned about selecting relevant data from a DataFrame and a Series. However, data does not always come out in the format that we want, and often we would want to do some more works to reformat the data at hand. For starter, we could use the describe function to generate a high-level summary of the attributes of the given column:

wine.describe()

We can select the column that we want to know the summary:

winde.critic.describe()

Notice that for numerical column, the describe function is telling use about the name, the maximum and the minimum value and a few other information. If we only want to know, says, the mean or the unique labels in a column, we could:

wine.points.mean()
wine.critic.unique()

Let’s say we want to know for a list of unique value, how frequent they occur in the dataset:

wine.critic.value_counts()

3.1 Maps

Map is a mathematical term where it describes that a function takes one set of values and maps them to another set of value. Oftentimes, we have to create new representations from existing data or perform transformation to the data. Maps are extremely important to help us to achieve our works.

An example where we remean the scores the wines received to 0:

wine_points_mean = wine.points.mean()
wine_points_mean.map(lambda p : p - wine_points_mean)

# alternative
wine_points_mean = wine.points.mean()
wine.points - wine_points_mean

In the function above, map() expects a single value from the Series and return a transformed version of that value. A new series with all the transformed values will be returned by map().

On the other hand, we can use apply() which is also an equivalent method. Now we want to transform the whole DataFrame by calling a function on each row:

def remean(row):
    row.points = row.points - wine_points_mean
    return row

wine.apply(remean, axis = 'columns')

If we set the axis = index, then we would need to give a function to transform each column rather than each row.

map and apply do not modify the original data they’re call on. They return new, transformed series and DataFrame respectively.

Pandas understand what we do when we perform operations between Series of equal length. These operations are faster than using map or apply. Of course, we can use these functions with more advanced case like conditional logic.

wine.country + " - " + wine_region

4. Grouping & Sorting

4.1 Groupwise Analysis

In previous section, we learned about value_counts() to count for the unique labels’ occurrences. Here’s an alternative:

wine.groupby('points').points.count()

This groupby function creates a group of reviews which allotted the same point values to the given wine. Then from these groups, we select the points column and count how many times they appear. For example, we want to get the cheapest wine in each point value category:

wine.groupby('points').price.min()

We can think of each group as a slice of our DataFrame with only data with values that match. This DataFrane is accessible to use directly using apply, and we can manipulate the data. If we want to select the name of the first wine reviewed from each winery:

wine.groupby('winery').apply(lambda df : df.title.iloc[0])

We can also group more than one column. Let’s say we want to pick out the best wine by country and province:

wine.groupby(['country', 'province']).apply(lambda x : x.loc[x.points.idmax()])

Furthermore, if we want to run a bunch of different functions on the DataFrame at once, we can generate a simple statistical summary of the dataset with agg:

wine.groupby(['country']).price.agg([len, min, max])

4.2 Multi-indexes

A multi-index example:

countries_reviewed = wine.groupby(['country', 'province']).description.agg([len])
countries_reviewed
| country |     province     | len |
|---------|------------------|-----|
|Argentina| Mendoza Province | 3264|
|         | Other            | 536 |
|...      | ...              |     |

If we check the type:

mi = countries_reviewed.index
type(mi)

# output
# pandas.core.indexes.multi.MultiIndex

There are a few ways to deal with this tiered structure which are absent for single-level indices. To retrieve a value, we need to provide two levels of labels. For the first method, we can convert the index back to the regular index:

countries_reviewed.reset_index()
| | country |     province     | len |
|-|---------|------------------|-----|
|0|Argentina| Mendoza Province | 3264|
|1|Argentina| Other            | 536 |
|2|...      | ...              | ... |

4.3. Sorting

Previously, we learned about the grouping function to group data in index order. But what about in value order? Can we order the rows from the grouping result based on values in the data rather than index?

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by = 'len')

By default, the sorting will be in ascending order, we can change that by changing the ascending parameter:

countries_reviewed.sort_values(by = 'len', ascending = False)

Moreover, if we just want to sort by index values, simply leave the by parameter blank:

countries_reviewed.sort_values()

Of course, we could sort more than one column at a time:

countries_reviewed.sort_values(by = ['country', 'len'])

5. Data Types & Missing Values

In this section, we will explore how to investigate data types within a DataFrame and how to replace entries in it.

wine.price.dtype() # for the specified column

wine.dtype() # for all columns

dtype function tells us how pandas store the data internally such as float64 or int64. We can convert a column from one type to the other too:

wine.points.astype('float64')

5.1 Missing Value

In a dataset, we might see values being given as NaN or “not a number”. Using Pandas, we can check for the missing data in a DataFrame:

wine[pd.isnull(wine.country)]

To replace the missing value in data, we can use the fillna() function:

wine.region.fllna("Unknown")

If we want to replace value, here’s what we can do:

wine.twitter_handle.replace("@x", "@abc")

6. Renaming & Combining

rename is a function to let you change index names or column names:

wine.rename(column = {'point': 'score'})

This function allows us to rename index or column by specifying an index or a column keyword parameter:

wine.rename(index = {0: 'first', 1: 'second'})

Of course, we could give name to a row index and the column index as well:

wine.rename_axis('wines', axis = 'rows').rename_axis('fields', axis = 'columns')

6.1 Combining

The simplest combination method in Pandas is concat. This function is useful when we have data in different DataFrame but having the same columns:

pd.concat([can_yt, bri_yt])

Moreover, for join, it will combine different DataFrame which have an index in common:

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

For the above example, the lsuffix and rsuffix parameter is necessary because the data has the same columns names in both of the datasets. If it is not, then we could just ignore that.