Images from Unsplash
Disclaimer: This article is my learning note from the courses I took from Kaggle.
In this course, we will explore on the Python pandas
module which is a popular library for data analysis. With pandas
, we can use it to create data and also work or manipulate the existing data.
1. Introduction
pandas
has two core objects known as DataFrame
and Series. A DataFrame
is a table where it consists of individual entries in the form of an array. Each entry corresponds to a row and a column. For example:
Let us define a DataFrame
, we can also write string in the DataFrame
:
import pandas as pd
pd.DataFrame({
'x': [1,2,3],
'y': [3,2,1],
})
| x | y |
|---|---|
| 1 | 3 |
| 2 | 2 |
| 3 | 1 |
In the above code we are defining x and y as the column names, and we assigned values to the column. Now, let’s try to put an index or row labels to the DataFrame
:
pd.DataFrame({
'x': [1,2,3],
'y': [3,2,1]},
index = ['Row 1', 'Row 2', 'Row 3']
)
| | x | y |
|-----|---|---|
|Row 1| 1 | 3 |
|Row 2| 2 | 2 |
|Row 3| 3 | 1 |
On the other hand, for Series, it represents a sequence of data values. Just now we have shown that DataFrame
is in the form of table, but Series is in the form of a list. Let’s try defining one:
pd.Series([1,2,3,4])
We can think of a series as only a single column of a DataFrane
. In fact, we can add row label or index to Series too:
pd.Series([1,2,3,4], index = ['Row 1', 'Row 2', 'Row 3', 'Row 4'], name = "My Series")
|Row 1| 1 |
|Row 2| 2 |
|Row 3| 3 |
Name: Product A
1.1 Reading Data Files
We can use pandas
to read .csv
file as well, by typing df = pd.read_csv('file.csv')
.
To check how large the DataFrame
is:
df.shape()
Here’s how we can take a preview of the first five rows of the DataFrame
:
df.head()
# try df.head(10) for first 10 rows
You might be wondering, can we display only the last five rows? Of course:
df.tail()
2. Indexing, Selecting & Assigning
We have learned how to create a Series and a DataFrame
. Now let’s look at how do we select specific value in the DataFrame
in a quick and effective manner.
Let’s say we have a DataFrame
called df
with the following column names: country, points, price, province, state and so on. If we would like to only see the country column, we could access the column by doing this:
df.country
# or alternatively
df['country']
Notice two approaches are provided above. Neither of them is more or less syntactically valid than the other. But let’s say we have a column name My House
, we just could not write df.My House
, it wouldn’t work.
If we want to know the first item in the country column, we can use the indexing operator to drill down to a single specific value:
df['country'][0]
2.1 Indexing in Pandas
In pandas
, indexing works in one of two paradigms. We will start with the first one, the index-based selection where data is selected based on its numerical position in the data. To select the first row of a data:
df.iloc[0]
For the loc
and iloc
function, row will come first then column.
df.iloc[:,0] # display all row of first column
The :
operator means everything. We can use it to indicate the range of values that we are interested in:
df.iloc[:3, 0] # first three rows of the first column (index 3 or fourth row excluded)
df.iloc[1:3, 0] # second and third rows of the first column
# alternatively
df.iloc[[1,2], 0]
Interestingly, we can also use negative number to select data items. For negative number, it will start counting from the end of the values:
df.iloc[-5:] # get the last five rows for all columns
Furthermore, a second paradigm for attribute selection is done by using the loc
operator, called as label-based selection. For this approach, it is the data index value, rather than the position that matters.
df.loc[0, 'country'] # first item in the country column
When we compare iloc
and loc
, iloc
seems to be conceptually simpler than loc
. This is because we ignore the dataset’s indices. Using iloc
, dataset is treated like a big matrix, that we index into by position. For loc
, we make use of the information in the indices to do its work. Of course, dataset usually comes with indices and this makes loc
to be easier to work with.
df.loc[:, ['country', 'regions', 'states']] # display all rows for the column of country, regions and states
So, how do we choose between iloc
and loc
? For iloc
, it uses the Python standard library indexing scheme, where the first element in the range is included, and the last one is excluded. But for loc
, it will include both the first and last element. Of course, using iloc
can be confusing sometimes. Here’s a comparison:
Using loc: 0:10
Output: 0,1,...,10
Using iloc 0:10
Output: 0,1,...,9
2.2 Manipulation Index
When using label-based selection, the power of it comes from the labels in the index. We can manipulate index in any way we see fit as well by using set_index()
df.set_index("country")
2.3 Conditional Selection
In order to do some interesting things with our dataset, we often need to ask questions based on certain conditions. Suppose that we want to know the better-than-average wine produced in Sweden, we can start by checking whether the wine is from Sweden or not:
wine.country == 'Sweden'
# alternatively
wine.loc[wine.country == 'Sweden']
Now we would like to add an extra condition. We want the wine review to also be at least 90 points:
wine.loc[(wine.country == 'Sweden') & (wine.points >= 90)]
If we would like to know whether a wine is from Sweden or a wine is rated above average:
wine.loc[(wine.country == 'Sweden') | (wine.points >= 90)]
What if we want to know whether a wine is produced in Sweden or Italy?
wine.loc[wine.country.isin(['Sweden', 'Italy'])]
We notice that for the wine data produced in Sweden or Italy, there are some rows where the points given is empty, we would like to filter them out:
wine.loc[wine.points.notnull()]
# to check for the null rows
wine.loc[wine.points.isnull()]
2.4 Assigning Data
Now let’s look at the method to assign data to a DataFrame
:
wine['critic'] = everyone # assingning a constant to the column named critic
wine['index_reverse'] = range(len('country'),0,-1)
3. Summary Functions & Maps
In the previous section, we learned about selecting relevant data from a DataFrame
and a Series. However, data does not always come out in the format that we want, and often we would want to do some more works to reformat the data at hand. For starter, we could use the describe
function to generate a high-level summary of the attributes of the given column:
wine.describe()
We can select the column that we want to know the summary:
winde.critic.describe()
Notice that for numerical column, the describe
function is telling use about the name, the maximum and the minimum value and a few other information. If we only want to know, says, the mean or the unique labels in a column, we could:
wine.points.mean()
wine.critic.unique()
Let’s say we want to know for a list of unique value, how frequent they occur in the dataset:
wine.critic.value_counts()
3.1 Maps
Map is a mathematical term where it describes that a function takes one set of values and maps them to another set of value. Oftentimes, we have to create new representations from existing data or perform transformation to the data. Maps are extremely important to help us to achieve our works.
An example where we remean the scores the wines received to 0:
wine_points_mean = wine.points.mean()
wine_points_mean.map(lambda p : p - wine_points_mean)
# alternative
wine_points_mean = wine.points.mean()
wine.points - wine_points_mean
In the function above, map()
expects a single value from the Series and return a transformed version of that value. A new series with all the transformed values will be returned by map()
.
On the other hand, we can use apply()
which is also an equivalent method. Now we want to transform the whole DataFrame
by calling a function on each row:
def remean(row):
row.points = row.points - wine_points_mean
return row
wine.apply(remean, axis = 'columns')
If we set the axis = index
, then we would need to give a function to transform each column rather than each row.
map
and apply
do not modify the original data they’re call on. They return new, transformed series and DataFrame
respectively.
Pandas understand what we do when we perform operations between Series of equal length. These operations are faster than using map
or apply
. Of course, we can use these functions with more advanced case like conditional logic.
wine.country + " - " + wine_region
4. Grouping & Sorting
4.1 Groupwise Analysis
In previous section, we learned about value_counts()
to count for the unique labels’ occurrences. Here’s an alternative:
wine.groupby('points').points.count()
This groupby
function creates a group of reviews which allotted the same point values to the given wine. Then from these groups, we select the points
column and count how many times they appear. For example, we want to get the cheapest wine in each point value category:
wine.groupby('points').price.min()
We can think of each group as a slice of our DataFrame
with only data with values that match. This DataFrane
is accessible to use directly using apply
, and we can manipulate the data. If we want to select the name of the first wine reviewed from each winery:
wine.groupby('winery').apply(lambda df : df.title.iloc[0])
We can also group more than one column. Let’s say we want to pick out the best wine by country and province:
wine.groupby(['country', 'province']).apply(lambda x : x.loc[x.points.idmax()])
Furthermore, if we want to run a bunch of different functions on the DataFrame
at once, we can generate a simple statistical summary of the dataset with agg
:
wine.groupby(['country']).price.agg([len, min, max])
4.2 Multi-indexes
A multi-index example:
countries_reviewed = wine.groupby(['country', 'province']).description.agg([len])
countries_reviewed
| country | province | len |
|---------|------------------|-----|
|Argentina| Mendoza Province | 3264|
| | Other | 536 |
|... | ... | |
If we check the type:
mi = countries_reviewed.index
type(mi)
# output
# pandas.core.indexes.multi.MultiIndex
There are a few ways to deal with this tiered structure which are absent for single-level indices. To retrieve a value, we need to provide two levels of labels. For the first method, we can convert the index back to the regular index:
countries_reviewed.reset_index()
| | country | province | len |
|-|---------|------------------|-----|
|0|Argentina| Mendoza Province | 3264|
|1|Argentina| Other | 536 |
|2|... | ... | ... |
4.3. Sorting
Previously, we learned about the grouping function to group data in index order. But what about in value order? Can we order the rows from the grouping result based on values in the data rather than index?
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by = 'len')
By default, the sorting will be in ascending order, we can change that by changing the ascending
parameter:
countries_reviewed.sort_values(by = 'len', ascending = False)
Moreover, if we just want to sort by index values, simply leave the by
parameter blank:
countries_reviewed.sort_values()
Of course, we could sort more than one column at a time:
countries_reviewed.sort_values(by = ['country', 'len'])
5. Data Types & Missing Values
In this section, we will explore how to investigate data types within a DataFrame
and how to replace entries in it.
wine.price.dtype() # for the specified column
wine.dtype() # for all columns
dtype
function tells us how pandas store the data internally such as float64
or int64
. We can convert a column from one type to the other too:
wine.points.astype('float64')
5.1 Missing Value
In a dataset, we might see values being given as NaN
or “not a number”. Using Pandas, we can check for the missing data in a DataFrame
:
wine[pd.isnull(wine.country)]
To replace the missing value in data, we can use the fillna()
function:
wine.region.fllna("Unknown")
If we want to replace value, here’s what we can do:
wine.twitter_handle.replace("@x", "@abc")
6. Renaming & Combining
rename
is a function to let you change index names or column names:
wine.rename(column = {'point': 'score'})
This function allows us to rename index or column by specifying an index or a column keyword parameter:
wine.rename(index = {0: 'first', 1: 'second'})
Of course, we could give name to a row index and the column index as well:
wine.rename_axis('wines', axis = 'rows').rename_axis('fields', axis = 'columns')
6.1 Combining
The simplest combination method in Pandas is concat
. This function is useful when we have data in different DataFrame
but having the same columns:
pd.concat([can_yt, bri_yt])
Moreover, for join
, it will combine different DataFrame
which have an index in common:
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
For the above example, the lsuffix
and rsuffix
parameter is necessary because the data has the same columns names in both of the datasets. If it is not, then we could just ignore that.