Photo by Pedro Gonzalez on Unsplash
Once upon a time there was a student who was eager to learn data science.
He asked people what he should do, and they told him to learn pandas. He scoured the web for data courses, and they all featured pandas. So the student learned pandas, and then got a job in academia, where everyone was working with pandas.
So he toiled with pandas for many moons, until he could slice dataframes in his sleep. When he was done, he joined a data bootcamp: lo and behold, they were teaching pandas. When he was finished, the bootcamp hired him—to teach pandas.
Then came the time when the student entered his first company. His master wanted him to process some data.
“I shall use pandas” said the student.
“The hell you will!” said the master. “We use S—Q—L here” he retorted, emphasizing each letter with a blow of his stick.
“But... But... Verbosity... Ugliness... Lack of functions... Endless nesting... And the joins, the joins are horrible!...“
“If you cannot see the forest, then you should not touch the trees” said the master, hitting him on the head.
The student was enlightened.
Or so he thought; in fact the master’s blow had stunned him so badly that his judgement was temporarily impaired.
Many moons later, after a painful withdrawal, the student understood SQL. From then on he didn’t feel the need to use pandas again, and the master never landed another blow on him.
The above koan is autobiographical, although I should remark that none of my supervisors ever hit me (even when they should have).
Not much has changed since I got started. Most data bootcamps’ curriculums feature a heavy investment in pandas (alongside Jupyter notebooks) while SQL is at best an after-thought.
Search “data science” on Udemy, and you’ll see the top courses mentioning Python (which inevitably includes pandas) and R, sometimes even Scala. Very few of them mention SQL.
I think this is bad, both from the value standpoint and from the pedagogical standpoint.
If you’re doing standard analytics, SQL is simply a better tool than pandas. It’s simpler, clearer, more powerful and efficient. It’s also easier to understand, share and reproduce. There’s a reason why it’s been the lingua franca of data analytics for decades.
If you’re focusing on Pandas at the expense of SQL, you’re missing out on being a better, more effective data practitioner.
Of course, there are things that pandas does better, and I explore them briefly at the end of the article. But overall, when it comes to pure analytics, SQL is hard to beat.
People often fail to notice that pandas introduces significant overhead in terms of complexity, inefficiency, idiosyncrasy, and opportunities for confusion.
I suspect that the excessive emphasis on pandas hurts data students. In my experience, there is a point where doing more pandas results in anti-learning, i.e. the person simply becomes more confused over time.
New students, in particular, experience weird issues that confuse them and for which they compensate with blind memorization.
They also acquire bad habits that are hard to give up later, like looping over rows when they could use table operations; creating columns with mixed types; saving data in CSV; modifying data in place; clogging the memory with multiple copies and slices of the same dataframe... I could go on.
Part of this comes Python being an extremely tolerant language, which puts the burden on the user not to do bad stuff (like pandas Series
with mixed types). Part of it comes from the fact that pandas accepts (though doesn’t necessarily encourage) an imperative approach.
For example, if a student wants to combine data from two tables, there is nothing that stops her from using this algorithm:
Loop over table1
For each table1
row, scan all table2
for a lookup
Update the row in table1
from the data in table2
Yes, this is obviously very bad. But beginners don’t know any better.
In contrast, SQL’s declarative approach makes it harder to do bad stuff.
Declarative programming forces the user to think in terms of the result they want to see, rather than how to produce it. This gives them the space they need to focus on the logic behind their analysis, rather than being constantly bogged down by issues and weird bugs.
SQL also forces students to think in tables (i.e. relational model) and column-level operations, which is highly beneficial when their first instinct is to loop over everything.
Finally, learning SQL produces a greater return on investment because of its universality and portability.
I don’t hate pandas. It was my basic analytics tool for two years, and I still use it for my personal projects. I’m happy to see people learning it.
But I’m trying to see the bigger picture. I think that over-emphasizing pandas at the expense of SQL does more harm than good. Especially when it comes to beginners, who are learning about the pandas MultiIndex
before they can do a proper GROUP BY
.
In the next section, I analyse some of the weirdest aspects of pandas and compare them directly with SQL.
The goal, once again, is not to put pandas down, but to expand the reader’s perspective on the tools at their disposal.
Let’s delve in.
SELECT *
FROM opinionated_comparison
WHERE sql > pandas
In a single statement, SQL’s SELECT
allows you to:
Pick the columns you want and the order in which they should be returned.
Create new columns based on combinations and transformations of existing columns.
Rename columns.
This example selects all columns except one, then creates a new column based on a linear combination of two other columns:
SELECT * EXCEPT (first_name),
equity_comp / total_comp * 100 AS percent_equity
FROM salaries
The pandas column selection only allows choosing and ordering columns. If you want to rename or transform some, you need multiple statements, and many people fall for the error of transforming data in place (see immutability below).
Beginners get confused because selecting a single column requires one set of brackets (df[”first_name”]
) while selecting multiple columns requires a double set of brackets (df[[”first_name”, "last_name"]]
).
The biggest problem I have with pandas here is the dot notation: the fact that you can select columns like this: df.first_name
.
This is so much easier than using brackets and quotes, so people end up preferring it out of sheer laziness. That’s what happened to me at least: I still use dot notation automatically, even though I know it’s bad.
The problems appear when you have a column called count
or shape
or diff
or any other of the many standard attributes of a dataframe (you can see them all with dir(df)
).
When you try to access them with dot notation, you will get the attribute instead of the column and your code will break.
So pandas has three ways of selecting columns: two for getting a single column (of which one is bad yet more attractive!) and a third one for selecting multiple columns.
In SQL, to select specific rows, you just use the WHERE
statement (see Filtering below).
Selecting rows in Pandas is... complex. To see how complex, check the user starting guide. Or delve into a typical 30 minutes tutorial.
I will limit myself to a single example. Every DataFrame has an Index
. The default index is a sequence of integers: [0,1,2,3,4,5...]
.
Naturally, most people assume that the index represents the cardinality of the rows. Actually, the numbers are just categorical labels! They might as well be random letters like ['x', 'z', 'a']
. There is no cardinality implied.
To get a row by its index, you use df.loc
. But to select by the row’s cardinality, you use df.iloc
. With the default index, these two methods give the same result.
Which only adds to the confusion, because at any point your index may change to something completely random like [7, 2, 2, 'c', True, None]
. Yes, all of this is allowed! And there are no constraints to prevent it (see Constraints below).
Imagine that you wrote your code assuming that the index represented row cardinality. Now:
df.loc[7]
will return the first rowdf.loc[2]
will return a dataframe slice instead of a row (because it occurs more than once in the index)df.loc[None]
will return an actual row!
I’m not crying you’re crying.
And yes: the same method can return a scalar value, a row or a dataframe slice depending on how the index is composed. The pandas docs acknowledge this madness:
Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will reduce dimensionality. Slicing a
DataFrame
with a scalar will return aSeries
. Slicing aSeries
with a scalar will return a scalar. But with [index] duplicates, this isn’t the case.
And remember, there is no constraint to prevent the index from containing duplicates. I cannot begin to tell you how many headaches this has caused me.
(Besides all the methods for selecting that we’ve mentioned, pandas also has df.at
and df.iat
for single values. Another thing to remember and to avoid confusing.)
In SQL, filtering is straightforward. Write WHERE
, insert as many statements as you want, and chain them together with logical operators. Brackets give you more control over structuring expressions.
For example, the following query filters for people who are both over 30 and satisfy at least one of two conditions: more than 5 years of tenure or equity compensation less than 50:
SELECT *
from salaries
WHERE age > 30 AND (tenure > 5 OR equity_comp < 50)
How does this look in Pandas?
new_df = df[(df["age"] > 30) & ((df["tenure"] > 5) | (df["equity_comp"] < 50))]
Ugh. Have fun explaining this to beginners.
Granted, you probably wouldn’t write it like this because it’s too ugly. You would execute the filtering in multiple statements: which means more lines of code, variables and repetition.
Pandas filters are based on a method called boolean indexing. Every filtering operation happens in two steps:
You take a Series
(that’s the column object) and run each element through a boolean test. Thus transforming it into a new Series
made of boolean values (true or false).
You select the dataframe with this column, which ends up excluding the rows where the boolean Series
contains a false value.
Notice the hidden assumption here? The Series
used to filter and the dataframe that’s being filtered need to share the same index, in the same order. This is not always guaranteed.
In practice, boolean indexing means that when you filter, you always have to repeat the dataframe variable, e.g. salaries[salaries["cash_comp"] > 20]
. This is very annoying when you’re writing lots of code! See the example above: the dataframe variable is referenced 4 times.
I can also say from experience that boolean indexing is not easy to understand for beginners. Some people never get the underlying mechanism at all. They just memorize the coding pattern.
(The df.query()
method seems to provide a better method for filtering.)
No major complaints here. But SQL is definitely closer to English. These two are equivalent:
SELECT AVG(cash_comp), SUM(tenure)
FROM salaries
GROUP BY department
grouped_df = df.groupby('department').agg({"cash_comp": np.mean, "tenure": np.sum})
SQL has one type of join. It’s called JOIN
. Sure, it can be left/right and inner/outer, but the use is pretty straightforward.
Pandas has two methods: join
and merge
. I never understood why we need two. join
is supposed to work on indexes, merge
is supposed to work on any column.
But if you look at the docs [1][2] they each seem to support both indexes and columns. I am confusion. (If you’re in doubt, I would suggest always choosing merge
since join
is more of a legacy method.)
SQL makes it really easy to JOIN based on a chain of logical conditions, such as: join by role, but only if the London salary is great than the one in Washington, or the person has a longer tenure.
SELECT *
FROM london_hq lhq
JOIN washington_hq whq
ON lhq.role = whq.role
AND (lhq.salary > whq.salary
OR lhq.tenure > whq.tenure)
As far as I know, this is not possible with pandas. When joining (or merging) you can only use the equality condition.
So you would have to first execute the JOIN on role
, keep track of each column’s origin and then filter on the result.
However, I would argue that those conditions rightfully belong in the JOIN and aren’t any less relevant for not using the equality comparison.
One of the major advantages of SQL is that every column has a well-defined type. Furthermore, columns are not allowed to have mixed types. This saves a lot of bugs and headaches in the long run.
When you load data into pandas, most columns are automatically typed as object
. This can mean one of three things:
The column contains only strings
The column contains Python objects that are not a primitive data type, e.g. list or dictionary
The column contains mixed types
When you see the object
data type, you never know which is the case. I find this very annoying.
Unlike with SQL, you can load data with mixed types in pandas: they will simply be typed as object
.
Pandas does not force you to specify a schema and stick with it. This gives you a speed premium when you’re getting started, but you often pay dearly for it in future bugs and confusion.
This is especially problematic for beginners who are not alert to common pitfalls. For example, when I was working with Pandas, I would often attempt a datetime operation, only to realize that the datetime column was made of strings (hence classified as object
). I would naively do this:
df['Date'] = df['Date'].astype('datetime64[ns]')
And move on, only to find out much later that Pandas’ date parser had misread my strings and the dates made no sense.
Let’s be honest: most people store their dataframes as CSV. Pandas students are welcomed, nay encouraged, to do this. This is a bad idea!
Sure, CSVs are human readable and... their advantages end here. Their disadvantages are:
When converting to CSV, you lose all information on schema and column types. Everything reverts to text.
CSVs are prone to bad formatting, corruption and parsing errors.
CSVs are hard to compress, which increases storage costs.
The CSV format is underspecified, meaning that different programs create CSVs in different ways, and the burden is with the user to figure out how to parse them. This can quickly turn into a hellish experience, as any data professional will attest.
The loss of schema is typically the biggest issue for people who work in pandas. This is a common situation:
Your work starts with a CSV. Provided you have figured out the right formatting, encoding, quote char specification, and the rest of the many arguments to pandas’ read_csv
, you will load it in a dataframe. Now you need to spend time exploring the columns, casting each column to the correct type, taking care of any bugs that appear in the process and verifying that the end result makes sense. (Or you could start working immediately and face lots of bugs later).
Once your work is done, you have a new dataframe. What are you going to do with it? Why, save it as a CSV. Now all your previous work on schema definition is gone, since the dataframe is dumped to text.
You need to load the new dataframe for another workflow. That means loading the CSV you just dumped. I hope you wrote functions that can successfully restore the schema, or you will have to do the work all over again (provided that you remember what each column was supposed to do).
Want to share the CSV with a friend or post it on GitHub? You better share the code that can re-impute the schema, and hope they are willing and able to run it. Or they will be left with a blob of text and will have to repeat all the schema imputation work from scratch.
Sounds absurd? I have seen this done countless times. I have done this myself! But now I wonder: why do we teach people to work like this? What justifies such madness and cruelty?
There are two solutions here.
If you really need to work in pandas, export your dataframes in Parquet.
Or you could work in SQL and save yourself all the trouble. After all, a database is the best place to store data.
Ask yourself: Why do I need a file layer? If you’re simply reading some data, processing it and then storing the results, you probably don’t. Load from the database, work in the database, save in the database. It’s that simple. Need to share data externally? Export in Parquet.
The world does not need more CSVs.
Note: some people try to solve the schema problem by pickling their dataframes. This is a terrible idea.
Pickling is inefficient and insecure (never open a pickle that you don’t trust!). A pickled dataframe can only be opened inside Python, and it has to happen within the same library environment (about which the user might know absolutely nothing). If the pandas that reads the pickle is a different version than the pandas that wrote it, the file might be unreadable!
SQL uses NULL values to indicate missing data. You can easily filter out nulls.
SELECT *
FROM salaries
WHERE equity_comp IS NOT NULL
AND cash_comp IS NOT NULL
In Pandas, missing values can be any of these:
Python’s native None
(which Pandas displays as None
but treats like nan
)
numpy.nan
pandas.NA
pandas.NaT
(for datetimes)
Let’s focus on numpy.nan
which is the most common:
The type of this object is float
, so forget about detecting it with type checks.
It’s truthy, so forget about boolean tests. bool(np.nan)
is True
.
It fails the equality test, since numpy.nan == numpy.nan
is false. nan
is not equal to itself!
Using nan
in an operation doesn’t throw an exception, it just means the result is nan
.
Isn’t this fun?
The only way to detect a nan
is to use pandas.isna()
. That’s fine, once you read the docs and forget all of your pythonic instincts. Still, this behavior is extremely confusing to beginners.
Here’s how you can replicate the above query in Pandas:
new_df = df.dropna(subset=["equity_comp", "cash_comp"])
Constraints are important in SQL. They allow you to specify rules that keep your data safe and consistent. For example, the primary key, which serves as unique identifier for each row, must be unique and not null.
Pandas does not have anything like this.
The closest thing to a primary key in pandas is the index. Unfortunately the index value can be both repeated and null (yes, you can have an index with None
values).
Users often work with the implicit assumption that the index is a primary key, an idea enforced by the fact that the default index is made of integers: [0,1,2,3,4...]
. Hence people tend to use the index to reference specific rows, e.g. df.loc[2]
.
‘Tis a foolish act of faith. This becomes evident when concatenating or merging different dataframes. It often happens that similar indexes are mixed up and you get an index that looks like this: [0,1,2,2,2,3...]
.
Pandas does not throw any warnings about this, so you don’t realize it at first. But next time you use df.loc[2]
your code will break because instead of a single row it will now return a dataframe with three rows.
Many tears will flow before you figure out that you need to run reset_index()
on the merged dataframe so that each row gets a unique value again.
Furthermore, SQL contraints allow you to run checks at writing time. If you try to insert a null value into a column which carries the NOT NULL
constraint, you get an exception and the bad write does not happen. Pandas only allows running checks on read. That is, if you remember to do it.
This is mainly a pedagogical point. Pandas, as is well known, allows and even encourages vectorized operations (where all elements of a series are accessed in parallel).
But many people who work in Python do not automatically think like this. They started out by learning loops and now, by Guido, they want to use those loops.
When they get started with pandas, they soon discover the iterrows
and itertuples
methods, which allow them to loop the dataframe by rows.
Almost inevitably, they fall into looping patterns again, because nothing forces them to think in vectors. This causes them to write terrible code that runs very slow.
I started focusing on SQL after a long experience with pandas. Every time I was faced with a SQL problem, my instinct was to come up with a looped solution. Frustratingly, SQL did not allow me to do that.
Its declarative syntax forced me to think in terms of column operations, JOINs and window functions. Gradually, I built a new set of mental models that made me a better analyst.
I think learners should build confidence manipulating data in SQL before starting with pandas. They will be better equipped to understand when looping by row is inevitable, which is rare.
You load a dataframe in memory. You need to modify that dataframe. Do you change it in place, or do you create a copy? Should you update existing columns or create new ones?
What if you need to create multiple dataframe slices, then do some work on each slice? Should you store each slice in a separate variable or use the same variable to hold different slices in turn?
When people work in pandas, they tend to do all these things at the same time. It soon becomes difficult to keep track of all the variables that contain dataframes, slices of dataframes, and slices of slices, and to know how data was added or modified.
(I don’t always write pandas, but when I do, I get the setting with copy warning.)
And since most people use pandas with notebooks, these problems compound with the typical notebook pitfalls and end up causing huge headaches.
This is one of the reasons why I think that pandas is not the best option for data analysis.
When processing data in SQL, you don’t change the original data. The UPDATE
statement is not used in analytics. Instead, you create pipelines of tables and views that represent different selections.
When you need to save your results, you create new tables (or add rows to existing target tables, but without modifying or removing previous rows). This respects the tenet of immutability: never modify or delete source data. It makes your process safe, transparent and easy to replicate.
Yes, you can respect immutability in pandas, but it is not obvious that you should do so, and many people never learn to do it. What you typically see is immutability at the level of files: people usually load a CSV and output a new CSV. But for the work that happens in between? Anything goes.
(Most pandas methods are theoretically “pure” because they return a new dataframe instead of modifying the previous one. But they all provide the inplace
option which changes a dataframe in place. And people use it with gusto.)
If you do serious work with pandas, you will eventually hit a performance wall. The data you’re analyzing is just too big, or your processing needs are too high.
I saw this often when I was doing research with pandas. When it happened, my colleagues and I would google “make pandas faster” and find countless articles on this hot subject, who in turn proposed countless hacks, optimizations and PyPI libraries that promised to do the trick.
If you’re in this situation, by all means check out the available resources. Especially those that explain how to use pandas better. But don’t get your hopes too high. There are hard limits to what pandas can do. That’s okay: it wasn’t designed to be the end all be all of data analytics.
The two best alternatives when you need to scale your data workloads are:
PySpark. Spark is an engine for large scale analytics and parallelized data processing. PySpark allows you to leverage it with Python, and uses a syntax reminiscent of pandas. It even has pandas API.
Data warehouse. A system for storing and analyzing data at very large scale (think terabytes and petabytes). Modern data warehouses run in the cloud so you can leverage the power of distributed systems without managing any servers. BigQuery, Google Cloud’s data warehouse solution, can process 100 billion rows or 7 terabytes of data in 30 seconds. Data warehouses typically work with SQL. (If you want to try BigQuery for free, I wrote about it here.)
I don’t want you to shun pandas. It’s an awesome tool which is definitely worth learning.
There are cases where pandas is a better choice than SQL. I won’t go into detail here, but here’s a quick list:
When integrating with other Python workflows, e.g. you’re doing machine learning or you want to use a Python visualization library.
When you need some quick statistics. The describe()
method is very useful.
When you need to slap together a quick analysis, without worrying about scale or reproducibility. Although Excel or Google Sheets might work just as well.
If you’re building Python applications. Pandas might be the quickest way to go from an arbitrary data structure to a table, and vice versa.
When you actually need imperative workflows and loops. E.g. building Markov simulations.
When you need to write or reuse unusual functions. Pandas is great at applying arbitrary Python functions.
If you have a highly dynamic and parametrized workflow.
I hope this article has stimulated you to think more deeply about SQL and pandas and their relative strengths and weaknesses.
I believe the current trend in data has swung too far in the favor of pandas, and that you ignore SQL at your own peril.
Here’s my advice:
If you’re a learner: study SQL and learn to use it for your analytics. You won’t regret it.
If you’re a curriculum designer: drown your students in SQL until they dream in tables and speak in all caps. It’s tough love and they will hate you for it, but when day they’ll understand. (Don’t hit them on the head though.)
If you’re a mentor: try to gradually wean your pupils off of Pandas and encourage them to attempt problems in SQL.
I’d love to get a conversation going. Feel free to drop a comment, write me an email or add me on LinkedIn.