Why Are We Teaching Pandas Instead of SQL?

Photo by on Pedro Gonzalez Unsplash Once upon a time there was a student who was eager to learn data science. He asked people what he should do, and they told him to learn pandas. He scoured the web for data courses, and they all featured pandas. So the student learned pandas, and then got a job in academia, where everyone was working with pandas. So he toiled with pandas for many moons, until he could slice dataframes in his sleep. When he was done, he joined a data bootcamp: lo and behold, they were teaching pandas. When he was finished, the bootcamp hired him—to teach pandas. Then came the time when the student entered his first company. His master wanted him to process some data. “I shall use pandas” said the student. “The hell you will!” said the master. “We use S—Q—L here” he retorted, emphasizing each letter with a blow of his stick. “But... But... Verbosity... Ugliness... Lack of functions... Endless nesting... And the joins, the joins are horrible!...“ “If you cannot see the forest, then you should not touch the trees” said the master, hitting him on the head. The student was enlightened. Or so he thought; in fact the master’s blow had stunned him so badly that his judgement was temporarily impaired. Many moons later, after a painful withdrawal, the student understood SQL. From then on he didn’t feel the need to use pandas again, and the master never landed another blow on him. SQL >> Pandas The above koan is autobiographical, although I should remark that none of my supervisors ever hit me (even when they should have). Not much has changed since I got started. Most data bootcamps’ curriculums feature a heavy investment in pandas (alongside Jupyter notebooks) while SQL is at best an after-thought. Search “data science” on Udemy, and you’ll see the top courses mentioning Python (which inevitably includes pandas) and R, sometimes even Scala. Very few of them mention SQL. I think this is bad, both from the value standpoint and from the pedagogical standpoint. The Value Problem If you’re doing standard analytics, SQL is simply a better tool than pandas. It’s simpler, clearer, more powerful and efficient. It’s also easier to understand, share and reproduce. There’s a reason why it’s been the of data analytics for decades. lingua franca If you’re focusing on Pandas at the expense of SQL, you’re missing out on being a better, more effective data practitioner. Of course, there are things that pandas does better, and I explore them briefly at the end of the article. But overall, when it comes to pure analytics, SQL is hard to beat. People often fail to notice that pandas introduces significant overhead in terms of complexity, inefficiency, idiosyncrasy, and opportunities for confusion. The Learning Problem I suspect that the excessive emphasis on pandas hurts data students. In my experience, there is a point where , i.e. the person simply becomes more confused over time. doing more pandas results in anti-learning New students, in particular, experience weird issues that confuse them and for which they compensate with blind memorization. They also acquire bad habits that are hard to give up later, like looping over rows when they could use table operations; creating columns with mixed types; saving data in CSV; modifying data in place; clogging the memory with multiple copies and slices of the same dataframe... I could go on. Part of this comes Python being an extremely tolerant language, which puts the burden on the user not to do bad stuff (like pandas with mixed types). Part of it comes from the fact that pandas accepts (though doesn’t necessarily encourage) an imperative approach. Series For example, if a student wants to combine data from two tables, there is nothing that stops her from using this algorithm: Loop over table1 For each row, scan all for a lookup table1 table2 Update the row in from the data in table1 table2 Yes, this is obviously very bad. But beginners don’t know any better. In contrast, SQL’s declarative approach makes it harder to do bad stuff. . This gives them the space they need to focus on the logic behind their analysis, rather than being constantly bogged down by issues and weird bugs. Declarative programming forces the user to think in terms of the result they want to see, rather than how to produce it SQL also forces students to (i.e. relational model) and column-level operations, which is highly beneficial when their first instinct is to loop over everything. think in tables Finally, learning SQL produces a greater return on investment because of its universality and portability. Disclaimer I don’t hate pandas. It was my basic analytics tool for two years, and I still use it for my personal projects. I’m happy to see people learning it. But I’m trying to see the bigger picture. I think that . Especially when it comes to beginners, who are learning about the pandas before they can do a proper . over-emphasizing pandas at the expense of SQL does more harm than good MultiIndex GROUP BY Prospect In the next section, I analyse some of the weirdest aspects of pandas and compare them directly with SQL. The goal, once again, is not to put pandas down, but to expand the reader’s perspective on the tools at their disposal. Let’s delve in. SELECT * FROM opinionated_comparison WHERE sql > pandas Comparison Selecting Columns In a single statement, SQL’s allows you to: SELECT Pick the columns you want and the order in which they should be returned. Create new columns based on combinations and transformations of existing columns. Rename columns. This example selects all columns except one, then creates a new column based on a linear combination of two other columns: SELECT * EXCEPT (first_name), equity_comp / total_comp * 100 AS percent_equity FROM salaries The pandas column selection only allows choosing and ordering columns. If you want to rename or transform some, you need multiple statements, and many people fall for the error of transforming data in place (see immutability below). Beginners get confused because selecting a single column requires one set of brackets ( ) while selecting multiple columns requires a double set of brackets ( ). df[”first_name”] df[[”first_name”, "last_name"]] The biggest problem I have with pandas here is the dot notation: the fact that you can select columns like this: . df.first_name This is so much easier than using brackets and quotes, so people end up preferring it out of sheer laziness. That’s what happened to me at least: I still use dot notation automatically, even though I know it’s bad. The problems appear when you have a column called or or or any other of the many standard attributes of a dataframe (you can see them all with ). count shape diff dir(df) When you try to access them with dot notation, you will get the attribute instead of the column and your code will break. So pandas has three ways of selecting columns: two for getting a single column (of which one is bad yet more attractive!) and a third one for selecting multiple columns. Selecting Rows In SQL, to select specific rows, you just use the statement (see Filtering below). WHERE Selecting rows in Pandas is... complex. To see how complex, check the . Or delve into a typical . user starting guide 30 minutes tutorial I will limit myself to a single example. Every DataFrame has an . The default index is a sequence of integers: . Index [0,1,2,3,4,5...] Naturally, most people assume that the index represents the cardinality of the rows. the numbers are just categorical labels! They might as well be random letters like . There is no cardinality implied. Actually, ['x', 'z', 'a'] To get a row by its index, you use . But to select by the row’s cardinality, you use . df.loc df.iloc With the default index, these two methods give the same result. Which only adds to the confusion, because at any point your index may change to something completely random like . Yes, all of this is allowed! And there are no constraints to prevent it (see Constraints below). [7, 2, 2, 'c', True, None] Imagine that you wrote your code assuming that the index represented row cardinality. Now: will return the first row df.loc[7] will return a dataframe slice instead of a row (because it occurs more than once in the index) df.loc[2] will return an actual row! df.loc[None] I’m not crying you’re crying. And yes: the can return a scalar value, a row or a dataframe slice depending on how the index is composed. The acknowledge this madness: same method pandas docs Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will . Slicing a with a scalar will return a . Slicing a with a scalar will return a scalar. But with [index] duplicates, this isn’t the case. reduce dimensionality DataFrame Series Series And remember, there is no constraint to prevent the index from containing duplicates. I cannot begin to tell you how many headaches this has caused me. (Besides all the methods for selecting that we’ve mentioned, pandas also has and for single values. Another thing to remember and to avoid confusing.) df.at df.iat Filtering In SQL, filtering is straightforward. Write , insert as many statements as you want, and chain them together with logical operators. Brackets give you more control over structuring expressions. WHERE For example, the following query filters for people who are both over 30 and satisfy at least one of two conditions: more than 5 years of tenure or equity compensation less than 50: SELECT * from salaries WHERE age > 30 AND (tenure > 5 OR equity_comp 30) & ((df["tenure"] > 5) | (df["equity_comp"] < 50))] Ugh. Have fun explaining this to beginners. Granted, you probably wouldn’t write it like this because it’s too ugly. You would execute the filtering in multiple statements: which means more lines of code, variables and repetition. Pandas filters are based on a method called . Every filtering operation happens in two steps: boolean indexing You take a (that’s the column object) and run each element through a boolean test. Thus transforming it into a new made of boolean values (true or false). Series Series You select the dataframe with this column, which ends up excluding the rows where the boolean contains a false value. Series Notice the hidden assumption here? The used to filter and the dataframe that’s being filtered need to share the same index, in the same order. This is not always guaranteed. Series In practice, boolean indexing means that when you filter, you always have to repeat the dataframe variable, e.g. . This is very annoying when you’re writing lots of code! See the example above: the dataframe variable is referenced 4 times. salaries[salaries["cash_comp"] > 20] I can also say from experience that boolean indexing is not easy to understand for beginners. Some people never get the underlying mechanism at all. They just memorize the coding pattern. (The method seems to provide .) df.query() a better method for filtering Grouping No major complaints here. But SQL is definitely closer to English. These two are equivalent: SELECT AVG(cash_comp), SUM(tenure) FROM salaries GROUP BY department grouped_df = df.groupby('department').agg({"cash_comp": np.mean, "tenure": np.sum}) Joins SQL has one type of join. It’s called . Sure, it can be left/right and inner/outer, but the use is pretty straightforward. JOIN Pandas has two methods: and . I never understood why we need two. is supposed to work on indexes, is supposed to work on any column. join merge join merge But if you look at the docs [ ][ ] they each seem to support both indexes and columns. I am confusion. (If you’re in doubt, I would suggest always choosing since is more of a legacy method.) 1 2 merge join SQL makes it really easy to JOIN based on a chain of logical conditions, such as: join by role, but only if the London salary is great than the one in Washington, or the person has a longer tenure. SELECT * FROM london_hq lhq JOIN washington_hq whq ON lhq.role = whq.role AND (lhq.salary > whq.salary OR lhq.tenure > whq.tenure) As far as I know, this is not possible with pandas. When joining (or merging) you can only use the equality condition. So you would have to first execute the JOIN on , keep track of each column’s origin and then filter on the result. role However, I would argue that those conditions rightfully belong in the JOIN and aren’t any less relevant for not using the equality comparison. Typing One of the major advantages of SQL is that every column has a well-defined type. Furthermore, columns are not allowed to have mixed types. This saves a lot of bugs and headaches in the long run. When you load data into pandas, most columns are automatically typed as . This can mean one of three things: object The column contains only strings The column contains Python objects that are not a primitive data type, e.g. list or dictionary The column contains mixed types When you see the data type, you never know which is the case. I find this very annoying. object Unlike with SQL, you can load data with mixed types in pandas: they will simply be typed as . object Pandas does not force you to specify a schema and stick with it. This gives you a speed premium when you’re getting started, but you often pay dearly for it in future bugs and confusion. This is especially problematic for beginners who are not alert to common pitfalls. For example, when I was working with Pandas, I would often attempt a datetime operation, only to realize that the datetime column was made of strings (hence classified as ). I would naively do this: object df['Date'] = df['Date'].astype('datetime64[ns]') And move on, only to find out much later that Pandas’ date parser had misread my strings and the dates made no sense. Files and CSVs Let’s be honest: most people store their dataframes as CSV. Pandas students are welcomed, nay encouraged, to do this. This is a bad idea! Sure, CSVs are human readable and... their advantages end here. Their disadvantages are: When converting to CSV, you lose all information on schema and column types. Everything reverts to text. CSVs are prone to bad formatting, corruption and parsing errors. CSVs are hard to compress, which increases storage costs. , meaning that different programs create CSVs in different ways, and the burden is with the user to figure out how to parse them. This can quickly turn into a hellish experience, as any data professional will attest. The CSV format is underspecified The loss of schema is typically the biggest issue for people who work in pandas. This is a common situation: Your work starts with a CSV. Provided you have figured out the right formatting, encoding, quote char specification, and the rest of the to pandas’ , you will load it in a dataframe. Now you need to spend time exploring the columns, casting each column to the correct type, taking care of any bugs that appear in the process and verifying that the end result makes sense. (Or you could start working immediately and face lots of bugs later). many arguments read_csv Once your work is done, you have a new dataframe. What are you going to do with it? Why, save it as a CSV. Now all your previous work on schema definition is gone, since the dataframe is dumped to text. You need to load the new dataframe for another workflow. That means loading the CSV you just dumped. I hope you wrote functions that can successfully restore the schema, or you will have to do the work all over again (provided that you remember what each column was supposed to do). Want to share the CSV with a friend or post it on GitHub? You better share the code that can re-impute the schema, and hope they are willing and able to run it. Or they will be left with a blob of text and will have to repeat all the schema imputation work from scratch. Sounds absurd? I have seen this done . I have done this myself! But now I wonder: why do we teach people to work like this? What justifies such madness and cruelty? countless times There are two solutions here. If you really need to work in pandas, export your dataframes in Parquet. Or you could work in SQL and save yourself all the trouble. After all, a database is the best place to store data. Ask yourself: If you’re simply reading some data, processing it and then storing the results, you probably don’t. Load from the database, work in the database, save in the database. It’s that simple. Need to share data externally? Export in Parquet. Why do I need a file layer? The world does not need more CSVs. Note: some people try to solve the schema problem by their dataframes. . pickling This is a terrible idea Pickling is inefficient and insecure (never open a pickle that you don’t trust!). A pickled dataframe can only be opened inside Python, and it has to happen within the same library environment (about which the user might know absolutely nothing). If the pandas that reads the pickle is a different version than the pandas that wrote it, the file might be unreadable! Nulls SQL uses NULL values to indicate missing data. You can easily filter out nulls. SELECT * FROM salaries WHERE equity_comp IS NOT NULL AND cash_comp IS NOT NULL In Pandas, missing values can be any of these: Python’s native (which Pandas displays as but treats like ) None None nan numpy.nan pandas.NA (for datetimes) pandas.NaT Let’s focus on which is the most common: numpy.nan The type of this object is , so forget about detecting it with type checks. float It’s truthy, so forget about boolean tests. is . bool(np.nan) True It fails the equality test, since is false. is not equal to itself! numpy.nan == numpy.nan nan Using in an operation doesn’t throw an exception, it just means the result is . nan nan Isn’t this fun? The only way to detect a is to use . That’s fine, once you read the docs and forget all of your pythonic instincts. Still, this behavior is extremely confusing to beginners. nan pandas.isna() Here’s how you can replicate the above query in Pandas: new_df = df.dropna(subset=["equity_comp", "cash_comp"]) Constraints are important in SQL. They allow you to specify rules that keep your data safe and consistent. For example, the primary key, which serves as unique identifier for each row, must be unique and not null. Constraints Pandas does not have anything like this. The closest thing to a primary key in pandas is the index. Unfortunately the index value can be both repeated and null (yes, you can have an index with values). None Users often work with the implicit assumption that the index is a primary key, an idea enforced by the fact that the default index is made of integers: . Hence people tend to use the index to reference specific rows, e.g. . [0,1,2,3,4...] df.loc[2] ‘Tis a foolish act of faith. This becomes evident when concatenating or merging different dataframes. It often happens that similar indexes are mixed up and you get an index that looks like this: . [0,1,2,2,2,3...] Pandas does not throw any warnings about this, so you don’t realize it at first. But next time you use your code will break because instead of a single row it will now return a dataframe with three rows. df.loc[2] Many tears will flow before you figure out that you need to run on the merged dataframe so that each row gets a unique value again. reset_index() Furthermore, SQL contraints allow you to run checks at writing time. If you try to insert a null value into a column which carries the constraint, you get an exception and the bad write does not happen. Pandas only allows running checks on read. That is, if you remember to do it. NOT NULL Vectorized Operations This is mainly a pedagogical point. Pandas, as is well known, allows and even encourages vectorized operations (where all elements of a series are accessed in parallel). But many people who work in Python do not automatically think like this. They started out by learning loops and now, , they want to use those loops. by Guido When they get started with pandas, they soon discover the and methods, which allow them to loop the dataframe by rows. iterrows itertuples Almost inevitably, they fall into looping patterns again, because nothing forces them to think in vectors. This causes them to write terrible code that runs very slow. I started focusing on SQL after a long experience with pandas. Every time I was faced with a SQL problem, my instinct was to come up with a looped solution. Frustratingly, SQL did not allow me to do that. Its declarative syntax forced me to think in terms of column operations, JOINs and window functions. Gradually, I built a new set of mental models that made me a better analyst. I think learners should build confidence manipulating data in SQL before starting with pandas. They will be better equipped to understand when looping by row is inevitable, which is rare. Immutability You load a dataframe in memory. You need to modify that dataframe. Do you change it in place, or do you create a copy? Should you update existing columns or create new ones? What if you need to create multiple dataframe slices, then do some work on each slice? Should you store each slice in a separate variable or use the same variable to hold different slices in turn? When people work in pandas, they tend to do all these things at the same time. It soon becomes difficult to keep track of all the variables that contain dataframes, slices of dataframes, and slices of slices, and to know how data was added or modified. (I don’t always write pandas, but when I do, I get the .) setting with copy warning And since most people use pandas with notebooks, these problems compound with the typical and end up causing huge headaches. notebook pitfalls This is one of the reasons why I think that pandas is not the best option for data analysis. When processing data in SQL, you don’t change the original data. The statement is not used in analytics. Instead, you create pipelines of tables and views that represent different selections. UPDATE When you need to save your results, you create new tables (or add rows to existing target tables, but without modifying or removing previous rows). This respects the tenet of : never modify or delete source data. It makes your process safe, transparent and easy to replicate. immutability Yes, you respect immutability in pandas, but it is not obvious that you should do so, and many people never learn to do it. What you typically see is immutability at the level of : people usually load a CSV and output a new CSV. But for the work that happens in between? Anything goes. can files (Most pandas methods are theoretically “pure” because they return a new dataframe instead of modifying the previous one. But they all provide the option which changes a dataframe in place. And people use it with gusto.) inplace When Pandas is Not Enough If you do serious work with pandas, you will eventually hit a performance wall. The data you’re analyzing is just too big, or your processing needs are too high. I saw this often when I was doing research with pandas. When it happened, my colleagues and I would google “make pandas faster” and find countless articles on this hot subject, who in turn proposed countless hacks, optimizations and PyPI libraries that promised to do the trick. If you’re in this situation, by all means check out the available resources. Especially those that explain how to . But don’t get your hopes too high. There are hard limits to what pandas can do. That’s okay: it wasn’t designed to be the end all be all of data analytics. use pandas better The two best alternatives when you need to scale your data workloads are: . Spark is an engine for large scale analytics and parallelized data processing. PySpark allows you to leverage it with Python, and uses a syntax reminiscent of pandas. It even has pandas API. PySpark . A system for storing and analyzing data at very large scale (think terabytes and petabytes). Modern data warehouses run in the cloud so you can leverage the power of distributed systems without managing any servers. BigQuery, Google Cloud’s data warehouse solution, . Data warehouses typically work with SQL. (If you want to try BigQuery for free, ) Data warehouse can process 100 billion rows or 7 terabytes of data in 30 seconds I wrote about it here. When is Pandas Better? I don’t want you to shun pandas. It’s an awesome tool which is definitely worth learning. There cases where pandas is a better choice than SQL. I won’t go into detail here, but here’s a quick list: are When integrating with other Python workflows, e.g. you’re doing machine learning or you want to use a Python visualization library. When you need some quick statistics. The method is very useful. describe() When you need to slap together a quick analysis, without worrying about scale or reproducibility. Although Excel or Google Sheets might work just as well. If you’re building Python applications. Pandas might be the quickest way to go from an arbitrary data structure to a table, and vice versa. When you need imperative workflows and loops. E.g. building Markov simulations. actually When you need to write or reuse unusual functions. Pandas is great at applying arbitrary Python functions. If you have a highly dynamic and parametrized workflow. Conclusion I hope this article has stimulated you to think more deeply about SQL and pandas and their relative strengths and weaknesses. I believe the current trend in data has swung too far in the favor of pandas, and that you ignore SQL at your own peril. Here’s my advice: : study SQL and learn to use it for your analytics. You won’t regret it. If you’re a learner : drown your students in SQL until they dream in tables and speak in all caps. It’s tough love and they will hate you for it, but when day they’ll understand. (Don’t hit them on the head though.) If you’re a curriculum designer : try to gradually wean your pupils off of Pandas and encourage them to attempt problems in SQL. If you’re a mentor I’d love to get a conversation going. Feel free to drop a comment, or add me on . write me an email LinkedIn