Testing is a crucial step in software development. Every programming language has a myriad of tools to make our lives easier when testing our code — and Python is no different. However, when using the library I've noticed that testing isn’t as common. Pandas This could be because it’s harder than testing regular Python code, or because Pandas is usually used by professionals who aren’t necessarily programmers. Regardless of the circumstances, the fact remains that testing plays a crucial role and should not be overlooked. In this post, I will share some of my knowledge about the topic, including what tools to use, general strategies for testing, and how to test code that uses Pandas. Using Pytest There are other libraries out there, but I’ve found to be the most simple and yet robust of them. Pytest Other testing tools usually require defining classes with methods for every test, which makes this a bit more cumbersome. Pytest, on the other hand, lets us define simple and compact functions instead. To use it, we must first install it: pip install pytest As an example, say we have the following function that we want to test, inside a file called “ ”: operations.py # operations.py def sum_a_and_b(a, b): return a + b To test it with Pytest, we create a file that either starts or ends with “test” (so “test_operations.py” or “operations_test.py”). Inside this file, we define a function to test our function with an : sum_a_and_b assertion # test_operations.py from operations import sum_a_and_b def test_sum(): result = sum_a_and_b(1, 2) assert result == 3 We run the tests by typing in the terminal: pytest (venv) pandas-testing % pytest ========================== test session starts ========================== platform darwin -- Python 3.9.6, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 rootdir: /Users/eduardo/development/tutorials/pandas-testing collected 1 item test_operations.py . [100%] =========================== 1 passed in 0.01s =========================== Writing Good Tests Beyond the Happy Path Of course, the simple test above doesn’t ensure our function works correctly 100% of the time. For instance, it doesn’t test the behavior of the function when we pass strings as arguments. To write good tests, we should start by testing the “happy paths” (i.e., the normal expected behavior), but we should always remember to also test the edge cases. Testing Principles For unit tests, it’s a good rule of thumb to remember that tests should always follow the common “ ” structure. That is to say, we should: arrange/act/assert everything that is needed for the test, like creating any necessary data or special settings, preparing an in-memory database, or mocking API calls; arrange on the function or method to be tested by calling it; act the expected outcome. assert If we testing something that relies on user interaction, we can also think of it in terms of “ ”, i.e.: given / when / then what is needed to run the test; given some interaction happens; when we should expect a certain outcome. then Take Your Time Naming Everything Correctly Just like when we’re writing any other code, naming things correctly when testing is highly recommended. Our test functions should be descriptive of what is being tested. The same goes for any variables used inside it. For instance, our example above could have been written like this: test_sum from operations import s def test1(): x = s(1, 2) assert x == 3 It’s the same test, and it achieves the same result. But even for this simple example, it’s harder to understand what is actually being tested. For more complex test cases, you can imagine it would quickly become unintelligible. Testing Pandas So, how do we apply this to Pandas? If we have functions or methods that output DataFrames, we will want to ensure that the results are as expected. For example, say we have a function that multiplies every value of a DataFrame by two: double_dataframe # pandas_example.py import pandas as pd def double_dataframe(df): return df * 2 To test it, we may want to use an assertion as we did in our previous example. We define an expected result and compare it with the actual result using an equality operator ( ): == import pandas as pd from pandas_example import double_dataframe def test_double_dataframe(): # arrange input_df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) # act result_df = double_dataframe(input_df) # assert expected_df = pd.DataFrame({'a': [2, 4, 6], 'b': [8, 10, 12]}) assert result_df == expected_df But that does not make sense in Pandas, as a DataFrame is a collection of vectorized values. Pandas won’t understand what we mean by that and will return an error: . ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() Using the Pandas Test Functions Instead, we can use the Pandas test methods: , , and . (In many cases, the Pytest assertions will still be useful and necessary, so we should still use them when appropriate.) assert_frame_equal assert_series_equal assert_index_equal assert_frame_equal We can rewrite the example above to correctly compare the result and expected DataFrames with : assert_frame_equal import pandas as pd from pandas.testing import assert_frame_equal from pandas_example import double_dataframe def test_double_dataframe(): input_df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) result_df = double_dataframe(input_df) expected_df = pd.DataFrame({'a': [2, 4, 6], 'b': [8, 10, 12]}) assert_frame_equal(result_df, expected_df) And this time, the test will succeed. will compare two DataFrames and output any differences. assert_frame_equal We can allow varying the strictness of the equality checks by using additional parameters like , , … and many more that you can find in the Pandas documentation. check_dtype check_index_type check_exact For instance, if we want to compare values but we don’t care if they’re or , we can set : float int check_dtype = false import pandas as pd from pandas.testing import assert_frame_equal from pandas_example import double_dataframe def test_double_dataframe(): input_df = pd.DataFrame({'a': [1.0, 2.0, 3.0], 'b': [4, 5, 6]}) result_df = double_dataframe(input_df) expected_df = pd.DataFrame({'a': [2, 4, 6], 'b': [8, 10, 12]}) assert_frame_equal(result_df, expected_df, check_dtype = False) assert_series_equal works in a similar way, but for Series. assert_series_equal Say we have the following function that doubles the values of a single column in a DataFrame, and returns it as a Series: import pandas as pd def double_column(df, col_name): return df.loc[:, col_name] * 2 We can test it with : assert_series_equal def test_double_column(): input_df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}) result_series = double_column(input_df, 'a') expected_series = pd.Series([2, 4, 6], name='a') assert_series_equal(result_series, expected_series) assert_index_equal can be used when we need to make sure the indexes of two DataFrames are the same. assert_index_equal For example, if we have a function that takes a DataFrame containing information on countries and returns the top countries based on a specified column. get_top_n_countries n The resulting DataFrame uses the country names as the index. import pandas as pd def get_top_n_countries(df, column_name, n): sorted_df = df.sort_values(column_name, ascending=False) top_n_countries = sorted_df.head(n) top_n_countries.set_index('country', inplace=True) return top_n_countries We can test it with : assert_index_equal import pandas as pd from pandas.testing import assert_index_equal from pandas_example import get_top_n_countries def test_get_top_n_countries(): data = {'country': ['USA', 'China', 'Japan', 'Germany', 'UK'], 'population': [328, 1393, 126, 83, 66]} df = pd.DataFrame(data) top_3_countries = get_top_n_countries(df, 'population', 3) expected_index = pd.Index(['China', 'USA', 'Japan'], name='country') assert_index_equal(top_3_countries.index, expected_index) Don’t Forget About the Edge Cases In the examples above, we’re testing only the happy path, the expected behavior when everything goes according to plan. However, it’s essential to consider situations where something may not go as expected. These are called edge cases, and we need to define the desired behavior for them. For instance, we may want to allow negative values or not, depending on the data type we’re working with. To illustrate that, let’s expand on the last example, where we tested the function. We can think of a few situations where we might have results that deviate from the happy path. get_top_n_countries The DataFrame Is Empty If the DataFrame is empty, we may want to ensure that the function runs and no errors are raised: def test_get_top_n_countries_empty_dataframe(): data = {'country': [], 'population': []} df = pd.DataFrame(data) top_3_countries = get_top_n_countries(df, 'population', 3) expected_index = pd.Index([], name='country') assert_index_equal(top_3_countries.index, expected_index, exact=False) The DataFrame Contains Missing Values (NaN) If the DataFrame contains missing values, we may want to ensure that they are being handled correctly and no rows are being dropped: def test_get_top_n_countries_missing_values(): data = {'country': ['USA', 'China', 'Japan'], 'population': [328, 1393, float('nan')]} df = pd.DataFrame(data) top_3_countries = get_top_n_countries(df, 'population', 3) expected_index = pd.Index(['China', 'USA', 'Japan'], name='country') assert_index_equal(top_3_countries.index, expected_index) The DataFrame Contains Non-Numeric Values If the DataFrame contains values that are not numbers, we may want to raise an exception if they are in the column being sorted: def test_get_top_n_countries_non_numeric_values(): # arrange data = {'country': ['USA', 'China', 'Japan'], 'population': [328, 1393, 'One hundred twenty six']} df = pd.DataFrame(data) # act and assert with pytest.raises(TypeError): get_top_n_countries(df, 'population', 3) The DataFrame Contains Negative Values If the DataFrame contains negative values, we may want to raise an exception when we’re working with population data, as a negative population value wouldn’t make sense. To do that, we would first need to change the function that is being tested: def get_top_n_countries(df, column_name, n): if column_name == 'population' and (df[column_name] < 0).any(): raise ValueError('population values must be greater than zero') sorted_df = df.sort_values(column_name, ascending=False) top_n_countries = sorted_df.head(n) top_n_countries.set_index('country', inplace=True) return top_n_countries Then we can implement the test: def test_get_top_n_countries_non_numeric_values(): data = {'country': ['USA', 'China', 'Japan'], 'population': [328, 1393, -100]} df = pd.DataFrame(data) with pytest.raises(ValueError): get_top_n_countries(df, 'population', 3) By testing these edge cases, we can ensure that our function works correctly and handles unexpected situations appropriately. Final Words Testing is a crucial step in software development, and Pandas is no exception. In conclusion: Although it may be harder to test Pandas code, and it’s not as common as testing regular Python code, it is still an essential part of the development process that should not be overlooked. When it comes to testing, Pytest is a simple and robust tool that lets us define compact functions for testing. To write good tests, we should always remember to test both the “happy paths” and the edge cases, and follow the “arrange/act/assert” structure for unit tests. Proper naming of everything when testing is highly recommended to avoid confusion. When testing Pandas, we should use Pandas test methods such as , , and . assert_frame_equal assert_series_equal assert_index_equal By following these tips and best practices, we can ensure that our Pandas code is reliable, robust, and performs as expected.