Robin Linderborg

@robinlinderborg

Reshaping Data in Python

I really enjoyed Jean-Nicholas Hould’s article on Tidy Data in Python, which in turn is based on this paper on Tidy Data by Hadley Wickham. In a sense, the conclusions presented are intuitive and obvious when you think about them. But data analysis can be abstract. Finding the right vocabulary for what you’re doing isn’t always easy.

In this post, I want to focus exclusively on the process of reshaping data, i.e. converting or transforming data from one format to another. There will be some repetition from Hould’s article, but the goal is to outline the various data formats that we frequently encounter, name them and name the operations we use to transform the data.

The long format

Let’s begin with looking at a table where the data is tidy. We will be referring to this as long format data (although other naming conventions exist, see below). Borrowing Wickham’s definition, in this format a) each variable forms a column, b) each observation forms a row, and c) each type of observational unit forms a table.

An example of long format data is this made-up table of three individual’s cash balance on certain dates.

In  : df
Out :
date person dollars
0 2000-01-03 Michael 200
1 2000-01-03 George 500
2 2000-01-03 Lisa 450
3 2000-01-04 Michael 180.5
4 2000-01-04 George 450
5 2000-01-04 Lisa 448
6 2000-01-05 Michael 177
7 2000-01-05 George 420
8 2000-01-05 Lisa 447
9 2000-01-06 Michael 150
10 2000-01-06 George 300
11 2000-01-06 Lisa 344.6

The format of this table can be referred to as the:

  • stacked format, because the individual observations are stacked on top of each other.
  • record format, because each row is a single record, i.e. a single observation.
  • long format, because this format will be long in the vertical direction as opposed to wide in the horizontal direction.

Pivoting data

If this table is already tidy, why would we want to reshape it to another format? Well, we might be interested in visually comparing the balance of the individuals by date. Or we might be interested in plotting the data as a time series, where each horizontal line represents one individual.

In pandas, we can accomplish just that by using the pivot method of the dataframe. This produces a “pivot table”, which will be familiar to Excel users.

In  : df.pivot(index='date', columns='name', values='dollars')
Out :
name George Lisa Michael
date
2000-01-03 500.0 450.0 200.0
2000-01-04 450.0 448.0 180.5
2000-01-05 420.0 447.0 177.0
2000-01-06 300.0 344.6 150.0

Whatever column you specify as the columns argument will be used to create new columns (each unique entry will form a new column). The column you specify as the values argument will form the values of those columns, and the index will be made up of… you guessed it, the column you specify as the index argument.

The format of this table can be referred to as:

  • wide format, because the table is now wider rather than longer.
  • unstacked format, because the individual observations (one person/one date) are no longer stacked on top of each other.

A bit confusingly, pandas dataframes also come with a pivot_table method, which is a generalization of the pivot method. Whenever you have duplicate values for one index/column pair, you need to use the pivot_table. Let’s look at one example.

Let’s say we have data of the number of cookies that George, Lisa, and Michael have sold. Each row in our table represents one sale occasion, which means that there could be multiple rows with the same seller for a given date.

In  : df_cookies
Out :
cookies_sold date name
0 1 2000–01–01 George
1 3 2000–01–01 Michael
2 3 2000–01–01 Lisa
3 2 2000–01–01 George
4 4 2000–01–01 Lisa

If we try to pivot this dataframe, we get a ValueError.

In  : df_cookies.pivot(index='date',
columns='name',
values='cookies_sold')
Out :
...
ValueError: Index contains duplicate entries, cannot reshape

Unlike our previous balance dataframe, the values in the cookies dataframe must be aggregated in order to be pivoted since George and Lisa sold cookies on multiple occasions in one single day. In other words, when we ask pandas to pivot our data, it can’t find a single value to return for duplicate pairs of dates/names. Pandas can, however, give us the sum, or the mean, or any other aggregated value for each date/name pair.

In  : df_cookies.pivot_table(index='date',
columns='name',
values='cookies_sold')
Out :
name George Lisa Michael
date
2000-01-01 1.5 3.5 3.0

The default aggregation function that pandas uses is the mean, but we can easily change that using the aggfunc argument.

In  : df_cookies.pivot_table(index='date',
columns='name',
values='cookies_sold'
aggfunc='sum')
Out :
name George Lisa Michael
date
2000-01-01 3 7 3

Stacking and unstacking data

In addition to the pivoting methods, pandas also has the two related concepts of stacking and unstacking data. These are primarily designed to operate on multi-indexed dataframes.

Let’s create a multi-indexed dataframe of our original balance dataframe.

In  : df_multi = df.set_index(['date', 'name'])
In : df_multi
Out :
dollars
date name
2000–01–03 Michael 200.0
George 500.0
Lisa 450.0
2000–01–04 Michael 180.5
George 450.0
Lisa 448.0
2000–01–05 Michael 177.0
George 420.0
Lisa 447.0
2000–01–06 Michael 150.0
George 300.0
Lisa 344.6

Remember, this is stacked data. Each row corresponds to one row. With DataFrame.stack and DataFrame.unstack, we can toggle between hierarchical indices and hierarchical columns. In this case, we have a hierarchical index, so let’s see what unstack does.

In  : df_multi.unstack()
Out :
dollars
name George Lisa Michael
date
2000-01-03 500.0 450.0 200.0
2000-01-04 450.0 448.0 180.5
2000-01-05 420.0 447.0 177.0
2000-01-06 300.0 344.6 150.0

As you can see, the operation moved one level of our hierarchical index to form a new level of columns in the dataframe. To move back to a stacked format, we simple use stack.

In  : df_multi.unstack().stack()
Out :
dollars
date name
2000–01–03 Michael 200.0
George 500.0
Lisa 450.0
2000–01–04 Michael 180.5
George 450.0
Lisa 448.0
2000–01–05 Michael 177.0
George 420.0
Lisa 447.0
2000–01–06 Michael 150.0
George 300.0
Lisa 344.6

Transposing data

Pandas has an easy way of rotating dataframes, i.e. switching the locations of columns and indices. Simply use the T attribute to rotate the data.

In  : df = pd.DataFrame({'a': [1, 2], 
'b': [3, 4]}, index=[‘one’, ‘two’])
In : df
Out :
a b
one 1 3
two 2 4
In  : df.T
Out :
one two
a 1 2
b 3 4

Unpivoting data with melt

Discovering pandas’ melt function was a game-changer for me. It basically allows you to unpivot data however you want. In other words, we use melt to transform wide data to long data.

Say we have a wide format dataframe like this one.

  country 2010  2011  2012
0 Canada 55 55 86
1 Iraq 56 32 22
2 Italy 3 56 11

To reshape this data to a long format, where each row represents one country/year pair, we use melt (which is not a dataframe method, but a top-level import from pandas).

In  : pd.melt(df, id_vars='country', value_vars=[2010, 2011, 2012])
Out :
country year value
0 Canada 2010 55
1 Iraq 2010 56
2 Italy 2010 3
3 Canada 2011 55
4 Iraq 2011 32
5 Italy 2011 56
6 Canada 2012 86
7 Iraq 2012 22
8 Italy 2012 11

Hopefully this overview makes the pandas tools for reshaping data a bit clearer. Happy data wrangling!

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising &sponsorship opportunities.
To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.
If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!
Topics of interest

More Related Stories