paint-brush
Data analysis and visualisation of The Simpsons from beginning to endby@eyeofdata
1,423 reads
1,423 reads

Data analysis and visualisation of The Simpsons from beginning to end

by AnaFebruary 18th, 2018
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

So far, I have been posting only visualisations I had already created, most of them at work. This time, I created something specific for the blog as I wanted to see if I could do a video of the whole process — analysis of the data, basic plot, final and edited plot. I warn the readers, that video recording and editing for me is something completely experimental and out of my regular skill use. Any feedback is appreciated and I will make sure to improve as I keep on doing these posts.

Company Mentioned

Mention Thumbnail
featured image - Data analysis and visualisation of The Simpsons from beginning to end
Ana HackerNoon profile picture

So far, I have been posting only visualisations I had already created, most of them at work. This time, I created something specific for the blog as I wanted to see if I could do a video of the whole process — analysis of the data, basic plot, final and edited plot. I warn the readers, that video recording and editing for me is something completely experimental and out of my regular skill use. Any feedback is appreciated and I will make sure to improve as I keep on doing these posts.

While thinking about recording this video, I browsed kaggle.com in search of a fun dataset to work with. I came across a couple of datasets from The Simpsons and ended up choosing one named “The Simpsons by the Data”. It contained 4 .csv files: simpsons_characters.csv, simpsons_episodes.csv, simpsons_locations.csv and simpsons_script_lines.csv.

For this post, I analysed only the file simpsons_episodes.csv, which contained the following variables: id, title, original_air_date, production_code, season, number_in_season, number_in_series, us_viewers_in_millions, views, imdb_rating, imdb_votes, image_url and video_url.

part of the simpsons_episodes.csv in an R dataframe

One thing that is missing in the video is that the first time I loaded the file into R, I noticed that the dataframe had 298 lines instead of the 600 it was supposed to have. When looking at the dataframe in RStudio, I noticed that one line had a quotation mark that wasn’t closed, which didn’t allow the data to be loaded correctly. I removed this quotation mark by hand because it was just the one, but normally, data cleanup should be done programatically as to be easily repeatable.

The video shows a really basic analysis on R of some of the variables, a construction of a plot and then the plot enhancement on Adobe Illustrator. In the video I also explain my train of thought and why I chose to so certain things. Check it out!

Here is the plot created on R and edited on Adobe Illustrator (before and after).

IMDB rating of The Simpsons seasons (before and after editing)

As mentioned in the beginning, this is something completely new. Let me know if you enjoyed it, what topics you’d like to see me talk about or do videos on and I will try to make more content!

Edit: link for code used in the video -https://gist.github.com/anabarbosa/3dccb842c1d0a3cfd547ee90b508a19a