I’m not sure how I came up with this idea, but some time ago, I wanted to scrape a bunch of websites to see what the most popular colors and colors themes were. In this techsploration, I generated screenshots of the top ~400 most trafficked websites and analyzed what colors were most commonly used by all of them. This was an interesting data scraping project and I used this as an opportunity to learn about multi-threading, color theory, and matplotlib. I’ll share with you guys everything I learned, as well as the code that I wrote so you can try this on your own if you want.
I downloaded a list of the top 500 domains from moz.com’s top 500 list. The first thing I had to figure out was how to programmatically capture a screenshot of the webpage for each URL. My first idea was using pyautogui to automate the process of screenshotting a website and saving it, but this seemed way to contrived and would take too long.
After some research, I found a really cool library called python-webkit2png that would allow me to screenshot a webpage using webkit. After installation, the webkit2png script is available from the command line. In addition to being well documented, the repository is awesome, and even has a cool dancing Batman GIF.
Awesomeness. Source
Unfortunately, since webkit2png was written for Python 2.x, I had to hack around it a bit to get it to work with Python 3. (Yes I know, I could’ve just used Python 2.x).
The command line tool passes the raw data of the screenshot PNG to stdout, so I used subprocess.Popen to invoke it from within my Python script. The great thing about this is that I didn’t have to store the data in a temporary PNG file since I could pass the PNG data straight from stdout to a Python PIL Image object.
Above is a snippet of the code I just described. The get_image() function returns a PIL Image object consisting the screenshot of the given url. You’ll notice there’s an additional logfile parameter for the function. Since the webkit2png script wrote errors to a log file, I deleted the log file before attempting to take a screenshot and checked it afterwards to see if a screenshot was successfully captured.
Awesome, now that I have a way to get a screenshot of a webpage, I’m going to run that on the csv of top websites. It became apparent after 5 minutes that the script would take forever to run. I had just brushed up on multi-threading again, so it was kind of cool to use it to speed this process up.
I split the list of URLs into chunks and passed each chunk to a thread. For brevity, the chunk() function is not included (all it does is split a list into n equally sized chunks). This snippet has a write_image_data() function using the get_image() function from before that appends the fetched image data to a file. Now we just have to combine the two parts and run it on every URL.
Okay, on to more interesting stuff. Looking at code is all fun and good, but it’s much more interesting to see what this data yields. I wrote a quick script to count the frequency of each color, sort the data, and dump it into another data file.
Outputting the first and last few lines of the sorted color frequencies
It’s pretty clear that pure white (255, 255, 255) is the most frequently occurring color among all the webpages scraped. This format of viewing the data isn’t very intuitive or visual though. I passed the data to a Jinja template and rendered it on to an HTML file.
The most frequently occurring colors (unfiltered)
Displayed above is a small section of the most frequently occurring colors from top to bottom. This also isn’t very informative, since there are a lot of similar colors with high frequencies. I calculated the squared Euclidean distance between the colors to determine how similar they were and removed colors that were adjacent to similar colors.
The most frequently occurring colors (filtered)
This slice of the most frequently occurring colors is a bit more informative.
Note that removing similar colors isn’t a very good way to do this because it completely ignores those colors in the frequency count. It would be much better to group them together and count them as one color in order to accurately rank them.
What websites use these colors? Let’s check. I wrote a small module to look up URLs by colors.
Websites that use the most frequently occurring colors
Note that this visual is still misleading. This shows the most frequently occurring colors, along with websites that use that color the most, and does not weight website traffic or any other factors. These visuals only show a slice of the whole picture.
Let’s plot the colors in 3D to draw some more concrete conclusions.
3D visualization of website color frequencies
Above is a GIF of the top 10% of the most frequently occurring colors using the x, y, and z axes as red, green, and blue respectively. The dots are colored accordingly, and their size represents their relative frequency. The only dots that aren’t sized to scale are the ones near white because those colors occur so frequently that they would take up more than an eighth of the visualization if they were to-scale.
This makes sense, since whites and light grays are very frequently used for large swathes of background on websites. Note the rough line across the diagonal between (0, 0, 0) and (255, 255, 255). Those colors are light and dark shades of grey, colors that very commonly used as text colors and secondary content colors.
There are very few colors near the extreme edges and corners, it is much more common to see mellower reds and blues since those tend to be very popular for website color themes.
Another way we can visualize the color frequencies is by clustering similar colors together. The k-means clustering algorithm is simple machine learning algorithm that allows us to group together similar colors into clusters around a centroid.
The k-means clustering algorithm works by selecting k centroids and shifting them until they’ve reached an optimal “central” position among a cluster of points. For a more detailed explanation of the algorithm, refer to this page. For learning purposes, I implemented a weighted k-means algorithm first in pure Python, and then rewrote it using the NumPy library for speed. I’ll leave the code for the algorithm out from this post, but you can view it here.
3D visualization of clustered website colors.
In the visualization above, I used the k-means clustering algorithm to determine 8 centroids and clusters using the colors and weighting them by their frequencies. The centroid of each cluster is represented by the largest dot in each cluster. Each cluster is colored according to the color of the centroid. This visualization is a good measure of color distribution and shows us what colors tend to appear the most.
So what color is the Internet? Mostly black, white, and shades of gray.
I’m not entirely sure where I was going with this project, but I learned a lot about multiprocessing and NumPy. I’ll probably revisit this data again in the future to see what else I can extrapolate from it. If you’d like to mess with the code yourself, here is a link to the repository. Thanks for reading!
Follow me on Twitter: @omgimanerd