As the clock struck midnight, five days after ChatGPT was launched and it reached 1 million users, I decided to play devil’s advocate and feed it with programming questions that I have known to be the crux of disconnect between Juniors and Seniors in the programming/coding area.
The cause of that disconnect can sometimes be the Juniors' incapacity to dive into the "why," and frequently choosing an incorrect library; which leads to wrong algorithmic complexity, lack of experience, and so on. Seniors, too, are susceptible to logical myopia and may make incorrect judgments in the pursuit of "tradeoffs" that do not make sense.
I used the Python language ecosystems to reveal the hallucination (i.e. seemingly real perception of something not actually present/real) of ChatGPT, which may not serve Juniors well in their decision-making and prevent them from making an informed decision by understanding the “why.”
This post is not meant to disparage ChatGPT but to exhort Juniors to pay attention to what they set out to study and encourage them to not cut corners by using ChatGPT.
Let’s take it a bit further and get into the spiritual game of Python. We’ll ask ChatGPT about a few Panda/DataFrame functions to investigate the number of rows and columns. We will footprint our ordeal on df.count()
, df.shape[0]
, len(df)
, len(df.index)
, df.info()
, and df.value_counts()
asking which of these functions has the fastest Algorithmic Complexity.
It’s necessary to define “Algorithmic Complexity” with a few explanations of how we designate the efficiency of an algorithm in computer science using “Big O Notation” before we proceed to question ChatGPT:
Algorithmic Complexity:
Algorithmic Complexity measures how long an algorithm/function/code would take to complete given an input of size n, where “n” is data elements that can be more than one. Complexity measures how the processing time fluctuates as the problem grows in size.
Big O Notation:
Big O Notation is a concept borrowed from the field of Mathematics. It is used to measure the efficiency of an algorithm. When a function is said to be O(1), it means that the function runs in "constant time," regardless of how large the size is. When a function is said to be O(N), the function runs in "linear time."
It is a rule of thumb to prefer O(1) to O(N) because O(N) takes more than a single step and deters the efficiency of a program. You would prefer a function/method/algorithm to take either one (1) input or seventy (70) million inputs and process it all in a nanosecond under a single step, rather than running seventy (70 )million inputs for 1 hour because it takes seventy million steps to process it. When a solution outputs badly by taking time because the number of elements (n) is much, it’s considered a bad solution.
It is also necessary to note the algorithmic complexity differences may be more or less significant depending on the specific hardware and software environment you are using.
So, let’s have some level of romance with ChatGPT and question the efficiency of df.count()
, df.shape[0]
, len(df)
, len(df.index)
, df.info()
, and df.value_counts()
:
This is an absolute hallucination, and if you take it for its word, you will end up underperforming in your career path. It is wrong! We will investigate this verdict. But before that, it’s necessary to mention that enjoying ChatGPT means you have to be knowledgeable to train ChatGPT yourself and be able to frown at it when it makes the wrong decision. ChatGPT is versatile in some responses but poor in questions like checking what is better, solely because it wasn’t trained to dig into standard libraries to make the best decision.
Investigation:
df.count()
https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/core/frame.py#L10620-L10721
This function takes multiple steps to return a value. As far as I can see, this is O(N), and ChatGPT is incorrect with its response.
df.info()
https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/core/frame.py#L3433-L3461.
This is also in “linear time” O(N). Regardless of the input, it takes more than a single step. Even though this method takes cognizance of memory_usage
–a look at Line 3463 on the link to its standard library shows that. It optionally defaults index to True and elements of `object` dtype. It can’t be faster than other methods. With index=True
the memory usage of the index is the first item in the output, and that gives it some hedge; however, it still doesn’t automatically make it the most efficient algorithmic complexity.
df.value_counts()
https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/core/frame.py#L7104-L7228
As far as I can see, this method is O(N). It does extra steps, and one of those steps is that it forces MultiIndex for a single column.
df.shape[0]
https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/core/frame.py#L873-L893
This “could have been” what I will consider O(1), but unfortunately, by definition, it takes a bit longer than len(df.index)
. By definition,,
it returns len(self.index), len(self.columns)
. So, it cant be faster than what it’s returning, and it can’t be the one with the fastest algorithmic complexity.
len(df)
https://docs.python.org/3/library/functions.html?highlight=len#len
len()
is a Python built-in function, allowing itself to be used with other third-party libraries such as NumPy and Panda DataFrame, hence my perception of passing in DataFrame as an argument into it. What it does is it simply returns the length of an object.
To have some understanding of len()
, from your IDLE run help(len)
, as seen below:
>>> help(len)
Help on built-in function len in module builtins:
len(obj, /)
Return the number of items in a container.
>>>
The built-in definition of len()
in CPython (i.e. https://github.com/python/cpython/blob/main/Python/bltinmodule.c#L1683) itself is not O(1); it’s an absolute O(N). In fact, Line 1679 calls PyObject_Size(obj);
on https://github.com/python/cpython/blob/main/Objects/abstract.c#L52, which is another journey altogether.
So, len()
is O(N) by birth!
We will remain on this built-in function for the next check. Apparently, we will pass in pandas.DataFrame.index, which is expected to make it faster, given that Python is written in C programming language and in turn faster than Python.
len(df.index)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html
df.index
is the index (row labels) of the DataFrame. So we are still discussing len()
. I will submit that it is O(N). Why? Because this method is just building upon the len()
function aforementioned by indexing DataFrame.
Indexes are faster to parse by computer systems generally. We will discuss this while we check the performance of these functions in the Benchmark section.
So far, we have established the incorrect verdicts of ChatGPT and demonstrated that you must have prior knowledge to help train ChatGPT. At this point, some will ask which one is better or faster to use. Stick around!
It would be interesting to remember that this is “not” to disparage ChatGPT but to exhort Juniors to pay attention to what they set out to study and not cut corners by using ChatGPT.
Let’s ask ChatGPT if it is sure.
Hmmm? May Guido van Rossum, Wes McKinney, and the rest of the contributors to Python & Panda be praised! ChatGPT asserts its response is sensible. It held firm on its argument, thus unjustly building a software folk who agreed with the judgment.
It’s sane that it suggested we check for performance benchmarks. Let’s write a simple code to test these functions.
Benchmark:
import pandas as pd
import numpy as np
# DataFrame with 10,000,000 rows and 30 columns
df = pd.DataFrame(np.random.randint(0, 100, size=(10000000, 30)))
# You may include this just to remove some rows to avoid RangeIndex.
# Either ways, the most performant is still the most performant.
df = df.drop([12,14,16,18], axis='rows')
# Test Benchmark
%timeit df.count()
%timeit df.shape[0]
%timeit df.info()
%timeit len(df)
%timeit len(df.index)
The output is impressive to find out len(df.index)
and df.shape[0]
came out to the two topmost performant functions, followed by len(df)
. These three(3) functions responded with a “nanoseconds” timeframe response report, while the rest are all in “milliseconds.” You may copy the code and try it yourself. Please see the attached image below; the benchmark test was conducted using Python 3.10.9 on an i5 8th gen mobile processor.
While the most performant is len(df.index)
, it is understandable why. Computer memory parses indexes faster. This surely will be explained going forward.
DataFrame objects naturally have an explicit index that you may use to refer to and change the data. Using DataFrame for the purpose of testing ChatGPT’s hallucination, we know that the columns of our DataFrame itself are an Index
object holding the column labels. This Index object is an intriguing structure in and of itself, as it can be seen as either an immutable array or an ordered set (technically a multi-set, as Index objects may contain repeated values).
The scale rasterized code, or streamlined version of the initial code below, will do more justice to understanding we are playing with Array with Indexes under the hood.
dframe = pd.DataFrame(np.random.randint(0, 10, size=(10, 3)))
df = dframe.drop([2,4,6,8], axis='rows')
print(df)
# 0 1 2
# 0 6 3 7
# 1 4 1 7
# 3 9 9 2
# 5 1 1 5
# 7 0 4 4
# 9 9 6 1
print(df.index)
# Int64Index([0, 1, 3, 5, 7, 9], dtype='int64')
print(df.columns)
# RangeIndex(start=0, stop=3, step=1)
print(df.values)
# [[6 3 7]
# [4 1 7]
# [9 9 2]
# [1 1 5]
# [0 4 4]
# [9 6 1]]
print(df.values.__class__)
# <class 'numpy.ndarray'>
Therefore, DataFrame
may be conceived of as a two-dimensional NumPy array known as a matrix (a table of rows and columns).
When an array is declared, the computer system puts it in the empty/unoccupied memory location/address sequentially (following a logical order), where it can easily write the array fully no matter how large the array is, adding indexes to it, thereby picking the value easily.
Pictorial representation of the reality of an Array i.e. an_array = [“lagos”, “london”, “ottawa”, “alberta”]
, is as below:
|
"lagos" |
"london" |
"ottawa" |
"alberta" |
---|---|---|---|---|
Memory Address |
3010 |
3011 |
3012 |
3013 |
Index |
0 |
1 |
2 |
3 |
The "Read" ability is one of Array's four (4) data operations, and it is naturally thought to be efficient, given what I explained earlier. The others are Search, Insert, and Delete. Computers are generally capable of effectively reading arrays due to this characteristic.
So far, we have established the hallucinating boldface of ChatGPT, and it is even expected of you and I to help train ChatGPT by voting down and correcting it.
In summary, deciding which function to use as a software person always depends on your use cases. And in most cases, you cannot use a single function with the mindset that it has the best algorithmic complexity without figuring that out by digging deep. It’s always not one size fits all. You will have tradeoffs that will entail you using something slower than what you figured out to be faster.
Libraries grow. Bugs get fixed. In fact, you will see me investigate a reported bug on Panda involving .DataFrameGrouBy.value_counts raising an error when used with a :class: TimeGrouper (:issue 50486)
here. What I figured out while investigating it is that it doesn’t allow you to supply a Class Object to value_counts()
function, which is not supposed to be, and it worked well for .count()
and .size()
. A fix was submitted by the contributor who raised it, and the issue was closed. The release of Panda v2.0.0 comes with a total fix for that problem/issue. You will do yourself a bad service by just taking verdicts blindly from ChatGPT. Therefore, your mastery and learning processes are important. Trust the process and learn what it takes as a Dev.
“Bear in mind ChatGPT or any other AI out there will not replace Software Devs”
Some advice I can point you to regarding how to use ChatGPT for a better response is from Rob Lennon’s Tweet. Some excerpts from his Tweet content are below, and I quote:
Simulate an expert:
Ask ChatGPT to play the part of a customer, co-host, or talented expert. Have a conversation with it, or ask it to generate content as if it were that specific persona.
Challenge the conventional narrative:
Ask for examples of what contradicts the dominant narrative. Generate content that challenges readers' assumptions. Seek out provocative angles that defy expectations and break the mold.
Use unconventional prompts:
Try using prompts that are more open-ended or abstract. This way you’ll get unique and creative responses nobody else is. By getting weird, you can unlock ChatGPT's creative potential in finding vivid language and unexpected topics.
Be Ultra-Brainstormer:
It’s easy to have ChatGPT generate a list of potential topic ideas for your next project. But often they're generic and expected. Instead, ask it to come up with new angles or approaches to cover a familiar topic.
Capture your writing style:
Feed ChatGPT your writing. Ask it to help you create a style guide for future outputs. It’ll give you the exact words to describe your voice and tone in a way that AIs understand.
Have ChatGPT write from different perspectives:
Ask it to write from the perspective of a group of characters with different backgrounds or viewpoints. Explore new ideas and perspectives, and add depth to your writing.
Generate content with a specific purpose or goal in mind:
Tell ChatGPT who your audience is and what you want to achieve with your content. Remember, it has no context about who you are or what you want unless you give it some. So give it context.