In early February I saw that a “March Madness” competition was being hosted on Kaggle. I thought this might be something I could do since I am somewhat interested in sports, but when I looked at the competition I was pleasantly surprised . . . there was an analytics portion. The Google Cloud & March Madness Competition had three portions this year: A Competition predicting the Men’s Tournament, the Women’s Tournament, and a separate analytics competition. I had been planning on working on my EDA and visualization skills anyway, so I thought this would be the perfect training ground.
The competition originally began on February 13 and would end on
April 6th. The challenge is “to tell a data story about college basketball through a combination of both narrative text and data exploration.” The rules for the notebook were relatively broad, so the topics you could explore were endless as long as they related to March Madness.
I began exploring the data and saw that the opportunities were basically endless. There was data about both the Men’s and Women’s Tournament and Regular Season going back over 20 years. Fearful of drowning in a sea of data, I began brainstorming topics that would allow me to be creative and produce good visualizations, as well as have a chance at finishing this notebook on time.
After exploring some data on points per game (PPG) and the track records of different teams, an idea popped into my head. It was just a passing thought at first, but then, after a few hours, I could not stop thinking about it. I had begun to deeply wonder about a connection between the men’s and women’s data.
The more I thought, the more the idea took shape—Schools that have two good teams (men and women) could be more likely to win games in the NCAA Tournament.
It was a crazy thought, I know. I didn’t even have any evidence or experience to back up the idea. But on some level, it just clicked with me. So, from there I began my exploration into what I have dubbed the “Two-Team Factor.”
I needed to first establish what exactly this factor did, and how I could explain it in my notebook. Basically, the Two-Team Factor shows us that when programs/schools (like, say Duke) have both their Men’s and Women’s teams competing in both NCAA Tournaments in the same season, their teams are more likely to win games and advance in the tournament than programs that only have one team in the tournament.
I know that is a mouthful so I tried to condense to just saying Two-Team programs and Single Team programs. Two-Team programs are special, they are an anomaly; they are schools that are lucky enough to be able
to watch both their Men’s and Women’s teams play in March Madness in the same year.
As this idea began to take shape and my fantasies about maybe even winning my first competition grew, the world suddenly changed. COVID-19 (Coronavirus) had spread to the U.S., and in early March the NCAA canceled the remaining college basketball season, including March Madness. This was a shock, and soon the prediction competitions for both the Men’s and Women’s Tournaments were canceled too.
As I worried about whether my hard work had been wasted, Kaggle announced that the analytics competition for Google Cloud & March Madness would continue and the deadline would be extended to April 30. This was great news, which also allowed many contestants from the prediction competitions to move into the analytics competition. Surprisingly, this influx of competitors only motivated me to work harder on my idea. Now that I had extra time, I could really flesh it out and make some great visualizations.
This competition forced me to learn more about pandas and visualizations. For days I dove into
, Matplotlib
, and Plotly
. Each library offered something different that I could use. seaborn
Matplotlib
is the classic, the one I, and everyone else, starts out using. It is really great for being able to describe how each and every detail of your visualization will look, but this also means it can be a very complex and long block of code. Seaborn
is a visualization library that is built on top of Matplotlib
. It allowed me to use shorter code to make my plots. However, Plotly
was really the library that I had never used before, and it became a large part of my arsenal. Plotly
is a web-based library that allowed me to build interactive plots. This means I could build plots like this:that allow users to hover over the columns and see the different values. It also allows me to make bar charts like this that can compare data from two different variables.
As I progressed, I learned that a notebook can get out of hand very quickly. I had done this entire competition so far in one notebook. It consisted of many of my experiments and many failed attempts. As the notebook got longer and longer it became more difficult to keep up with the variables I created and the different datasets.
This alerted me to the fact that I need to make a plan and outline earlier in the competition to keep up with everything, maybe even multiple notebooks like I do in normal competitions. So with this realization, I started to prepare another notebook that would be my final draft.
My original notebook had served its purpose. I was able to explore the data and reach the conclusion that the Two-Team Factor had merit. I discovered that in each round the “Two-Team” programs, as a percentage of the teams in the round, grew from under 30% in the first round to over 60% in the final rounds. This held true for both the men’s and women’s tournaments, and showed that these few teams must be winning most of their matchups.
I was able to then provide visualizations that showed the winning percentages of each type of program (Two-Team & Single Team) in each round of the tournament. As I had expected, the Two-Team programs had a higher winning percentage (60% or higher) in every round except one. With this information I created the basis of my argument and decided to continue exploring this factor in relation to other common factors associated with team performance.
During this entire experimentation period, the submission date was growing nearer. It caused a certain amount of stress, but just enough to motivate me. Fortunately, I had a lot of extra time due to the unfortunate event of the Coronavirus. This allowed me to really put in the extra hours in the last couple weeks. I eventually came to a point where there was no more time for exploration, and it was now time to create a beautiful notebook.
My personal requirements for a beautiful notebook were: (1) Clear and concise writing, (2) Great visualizations, and (3) understandable, concise code.
The writing was not very difficult because I had a perfect idea of what I wanted to convey. It just took time and many revisions. The visualizations, however, were something I needed to spend a lot of time on. I discovered that I could change the style of each chart with a single line of code. This was really a game-changer for me because I think the chart style adds a lot to a visualization. Some charts need the
“fivethirtyeight”
style, and others need a simple “seaborn-whitegrid.”
Finding the perfect fit took some tinkering, but I think it improves the value of a visualization by a lot. Finally, I came to the code. I had decided I would use the hide feature on most of it in the final notebook for readability. But I still wanted my code to by clear, efficient, and understandable. The only problem was I had written so much unstructured code in my original notebook it was hard to piece it all together in the final one.
My actual code needed good variable names and good documentation that explained what the code was doing so it is reproducible and understandable. This is something that is easy to forget when using a
Jupyter Notebook
. I’ve heard for months now that
can be dangerous to developers, but data scientists love them. I love them! They are easy to use and great for data exploration, but as soon as you start running cells out of order things can get out of hand fast. Jupyter notebooks
To make more readable and useful code I realized I needed to actually use my (very weak) python programming skills. I needed to define objects clearly, write notes, and use classes and functions. These can be very useful for shortening code and creating repeatable processes. It also allows me to understand and run the code in order.
This was a big hurdle, but I now believe developer skills are much more valuable to a data scientist than I originally thought. Even though it is a
jupyter notebook
, we need reproducible code that has structure. This is something I want to focus on in the near future, and develop these programming skills further. The deadline was now only days away and I scrambled to make sure my writing was clear and concise, and all of my visualizations had labels and made sense. On March 30, 2020, my first live Kaggle competition ended and I turned in the notebook, “The Two-Team Factor: Gender in March Madness.”
There were many great notebooks turned in to this competition. On May 18, 2020, the competition creators chose five winners in the March Madness Analytics Competition. I, not surprisingly, was not chosen. The winners, whose notebooks can be viewed here, had very impressive and creative notebooks. For my first competition, I was very proud of my submission. It was a semi-clear, creative notebook with what I thought were good visualizations.
I could not have asked for a better first live competition. The other competitors were great and I was able to really dive deep into some of the skills I wanted to work on. I think this competition has really improved my dataset manipulation and visualization skills. This is something that I hope will be extremely helpful in creating EDAs in future competitions, as well as creating great notebooks to share.