I really enjoyed Jean-Nicholas Hould’s article on , which in turn is based on by Hadley Wickham. In a sense, the conclusions presented are intuitive and obvious when you think about them. But data analysis can be abstract. Finding the right vocabulary for what you’re doing isn’t always easy. Tidy Data in Python this paper on Tidy Data In this post, I want to focus exclusively on the process of reshaping data, i.e. converting or transforming data from one format to another. There will be some repetition from Hould’s article, but the goal is to outline the various data formats that we frequently encounter, name them and name the operations we use to transform the data. The long format Let’s begin with looking at a table where the data is tidy. We will be referring to this as format data (although other naming conventions exist, see below). Borrowing Wickham’s definition, in this format a) each variable forms a column, b) each observation forms a row, and c) each type of observational unit forms a table. long An example of long format data is this made-up table of three individual’s cash balance on certain dates. In  : dfOut :**   date        person   dollars** 2000-01-03  Michael  200 2000-01-03  George   500 2000-01-03  Lisa     450 2000-01-04  Michael  180.5 2000-01-04  George   450 2000-01-04  Lisa     448 2000-01-05  Michael  177 2000-01-05  George   420 2000-01-05  Lisa     447 2000-01-06  Michael  150 2000-01-06  George   300 2000-01-06  Lisa     344.6 0 1 2 3 4 5 6 7 8 9 10 11 The format of this table can be referred to as the: , because the individual observations are stacked on top of each other. stacked format , because each row is a single record, i.e. a single observation. record format , because this format will be long in the vertical direction as opposed to wide in the horizontal direction. long format Pivoting data If this table is already tidy, why would we want to reshape it to another format? Well, we might be interested in visually comparing the balance of the individuals by date. Or we might be interested in plotting the data as a time series, where each horizontal line represents one individual. In pandas, we can accomplish just that by using the method of the dataframe. This produces a “pivot table”, which will be familiar to Excel users. pivot In  : df.pivot(index='date', columns='name', values='dollars')Out : 500.0   450.0  200.0 450.0   448.0  180.5 420.0   447.0  177.0 300.0   344.6  150.0 name        George  Lisa   Michael date 2000-01-03 2000-01-04 2000-01-05 2000-01-06 Whatever column you specify as the argument will be used to create new columns (each unique entry will form a new column). The column you specify as the argument will form the values of those columns, and the index will be made up of… you guessed it, the column you specify as the argument. columns values index The format of this table can be referred to as: , because the table is now wider rather than longer. wide format , because the individual observations (one person/one date) are no longer stacked on top of each other. unstacked format A bit confusingly, pandas dataframes also come with a method, which is a generalization of the method. Whenever you have duplicate values for one index/column pair, you need to use the . Let’s look at one example. pivot_table pivot pivot_table Let’s say we have data of the number of cookies that George, Lisa, and Michael have sold. Each row in our table represents one sale occasion, which means that there could be multiple rows with the same seller for a given date. In  : df_cookiesOut : 1             2000–01–01  George 3             2000–01–01  Michael 3             2000–01–01  Lisa 2             2000–01–01  George 4             2000–01–01  Lisa cookies_sold  date        name 0 1 2 3 4 If we try to this dataframe, we get a ValueError. pivot In  : df_cookies.pivot(index='date',columns='name',values='cookies_sold')Out :...ValueError: Index contains duplicate entries, cannot reshape Unlike our previous balance dataframe, the values in the cookies dataframe must be aggregated in order to be pivoted since George and Lisa sold cookies on multiple occasions in one single day. In other words, when we ask pandas to pivot our data, it can’t find a single value to return for duplicate pairs of dates/names. Pandas can, however, give us the sum, or the mean, or any other aggregated value for each date/name pair. In  : df_cookies.pivot_table(index='date',columns='name',values='cookies_sold')Out : 1.5     3.5   3.0 name        George  Lisa  Michael date 2000-01-01 The default aggregation function that pandas uses is the mean, but we can easily change that using the argument. aggfunc In  : df_cookies.pivot_table(index='date',columns='name',values='cookies_sold'aggfunc='sum')Out : 3       7     3 name        George  Lisa  Michael date 2000-01-01 Stacking and unstacking data In addition to the pivoting methods, pandas also has the two related concepts of stacking and unstacking data. These are primarily designed to operate on multi-indexed dataframes. Let’s create a multi-indexed dataframe of our original balance dataframe. In  : df_multi = df.set_index(['date', 'name'])In  : df_multiOut : 200.0**  George**   500.0**  Lisa**     450.0 180.5**  George**   450.0**  Lisa**     448.0 177.0**  George**   420.0** Lisa**     447.0 150.0** George**   300.0** Lisa**     344.6 dollars date        name 2000–01–03  Michael 2000–01–04  Michael 2000–01–05  Michael 2000–01–06  Michael Remember, this is stacked data. Each row corresponds to one row. With and , we can toggle between hierarchical indices and hierarchical columns. In this case, we have a hierarchical index, so let’s see what does. DataFrame.stack DataFrame.unstack unstack In  : df_multi.unstack()Out : 500.0   450.0  200.0 450.0   448.0  180.5 420.0   447.0  177.0 300.0   344.6  150.0 dollars name George  Lisa   Michael date 2000-01-03 2000-01-04 2000-01-05 2000-01-06 As you can see, the operation moved one level of our hierarchical index to form a new level of columns in the dataframe. To move back to a stacked format, we simple use . stack In  : df_multi.unstack().stack()Out : 200.0**  George**   500.0**  Lisa**     450.0 180.5**  George**   450.0** Lisa**     448.0 177.0** George**   420.0** Lisa**     447.0 150.0** George**   300.0** Lisa**     344.6 dollars date        name 2000–01–03  Michael 2000–01–04  Michael 2000–01–05  Michael 2000–01–06  Michael Transposing data Pandas has an easy way of rotating dataframes, i.e. switching the locations of columns and indices. Simply use the attribute to rotate the data. T In  : df = pd.DataFrame({'a': [1, 2],'b': [3, 4]}, index=[‘one’, ‘two’])In  : dfOut : 1 3 2 4 a b one two In  : df.TOut : 1   2 3   4 one two a b Unpivoting data with melt Discovering pandas’ function was a game-changer for me. It basically allows you to unpivot data however you want. In other words, we use to transform wide data to long data. melt melt Say we have a wide format dataframe like this one. Canada  55    55    86 Iraq    56    32    22 Italy   3     56    11 country 2010  2011  2012 0 1 2 To reshape this data to a long format, where each row represents one country/year pair, we use (which is not a dataframe method, but a top-level import from pandas). melt In  : pd.melt(df, id_vars='country', value_vars=[2010, 2011, 2012])Out : Canada   2010  55 Iraq     2010  56 Italy    2010  3 Canada   2011  55 Iraq     2011  32 Italy    2011  56 Canada   2012  86 Iraq     2012  22 Italy    2012  11 country  year  value 0 1 2 3 4 5 6 7 8 Hopefully this overview makes the pandas tools for reshaping data a bit clearer. Happy data wrangling! is how hackers start their afternoons. We’re a part of the family. We are now and happy to opportunities. Hacker Noon @AMI accepting submissions discuss advertising &sponsorship To learn more, , , or simply, read our about page like/message us on Facebook tweet/DM @HackerNoon. If you enjoyed this story, we recommend reading our and . Until next time, don’t take the realities of the world for granted! latest tech stories trending tech stories

Reshaping Data in Python

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

10 Patterns of Centralized Crypto Exchanges Explained Using Machine Learning and Data Visualizations

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

10 Patterns of Centralized Crypto Exchanges Explained Using Machine Learning and Data Visualizations

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps