paint-brush
Domain Classification: Analyzing Source Types in Search Engine Resultsby@browserology
149 reads

Domain Classification: Analyzing Source Types in Search Engine Results

tldt arrow

Too Long; Didn't Read

The study delves into the classification, popularity, and exclusivity of domains in search engine results, comparing Google, Bing, DuckDuckGo, and Metager. It analyzes the distribution patterns using the Gini coefficient and explores the similarities and overlaps among search engine domains, providing valuable insights into the diversity and source concentration in search rankings.
featured image - Domain Classification: Analyzing Source Types in Search Engine Results
Browserology: Study & Science of Internet Browsers HackerNoon profile picture

Authors:

(1) Yagci, Nurce, HAW Hamburg, Germany & [email protected];

(2) Sünkler, Sebastian, HAW Hamburg, Germany & [email protected];

(3) Häußler, Helena, HAW Hamburg, Germany & [email protected];

(4) Lewandowski, Dirk, HAW Hamburg, Germany & [email protected].

Abstract and Introduction

Literature Review

Objectives and Research Questions

Methods

Results

Discussion

Conclusion, Research Data, Acknowledgments, and References

RESULTS

Classification of domains

Since we are working with sources from two different countries and the type of content behind some domains might not be immediately recognizable, we manually created a general set of categories based on the top 50 domains found across all results. Following this, we manually classified the 50 most popular domains for both the German and US results (see Table 1).


When comparing the top 50 domains in the German results, News service dominates with 54%, while Movies & Entertainment and Sports make up 18% and 14%, respectively. In the US, 34% of the top 50 domains are Sports websites, with News services following close behind at 30%. Movies & Entertainment only make up 12%. These are the overall distributions of domain classes across all search engines; below, we look at the differences between the search engines.


Variety of domains

A comparison of the number of root domains in the search results of Google, Bing, DuckDuckGo, and Metager in Germany shows that Google has the greatest diversity by a small margin (see Figure 2). Overall, the values are very similar. However, it is noticeable that Google has the greatest variety of domains, especially in the first three positions. Interestingly, the greater diversity of Google's sources is even more pronounced in the US results. To examine the differences more closely, we look at the numbers below.


Table 2 shows the cumulative frequencies for the German results. The number of unique root domains converges at the fifth position. We found 2,841 unique domains in Google's results, 2,783 unique domains for Bing, 2,707 unique domains for DuckDuckGo, and 2,683 unique domains for Metager in Germany.



There are differences when comparing the search results in Germany and the US. It is notable that, in contrast to Germany, in the US results, there are only minor differences in the first positions. However, Google also shows the greatest variety of root domains in the search results in the US. This is more evident than in the German results (see Table 3). Overall, we found 4,085 unique domains in Google, 3,602 in Bing, 3,579 in DuckDuckGo, and 3,500 in Metager.


When comparing the top domains for each search engine across all German search results collected, we find a clear preference for Wikipedia in all of them (see Table 4). While Wikipedia is the most popular domain in all search engines, the frequency in the Google results is significantly lower, with only 658 compared to the number of occurrences in Bing being 1,948, in Metager 1,878, and in DuckDuckGo 1,752.


The other domains in the top 10 across the search engines are News services (Google: 7, Bing: 4, DuckDuckGo: 3, Metager: 4). Sports make up 2 out of 10 for all but Google, with only one sports website. Surprisingly, Instagram is in the top 10 for DuckDuckGo and Metager, even though in Table 1, we showed that domains of the class Social Media only make up 4% of top domains in the German results. The same goes for Amazon, which is the second most frequent domain in DuckDuckGo, even though in the overall results, E-Commerce websites only make up 2%.



Contrarily, in the US results, Wikipedia has the highest number of occurrences in Google with 1,892 compared to Bing's 1,388, Metager's 1,304, and DuckDuckGo's 1,287 (see Table 5). Still, what remains the same is that Wikipedia is the most popular domain across all search engines.


Another difference in the US results is that Social Media sources are much more prevalent across all search engines, especially in Google, with Instagram as the second most frequent domain, Facebook as the third, and Twitter as the sixth. This is surprising when considering that Social Media domains only make up 6% of the top domains for the US results (see Table 1). Another finding is that YouTube, a subsidiary of Google, is the top 8th domain for all three alternatives but not for Google. Again, 30% of the top domains for Google, DuckDuckGo, and Metager are News services. For Bing, it is 40%. For each search engine, Sports websites make up 20% of domains.


Exclusivity of domains

This section compares the top 50 domains for each search engine to determine what source Google and its alternatives might exclusively offer to the user. Table 6 shows the domains found solely in Google's top 50 domains list for the German queries. Out of these 20 domains, eight are Sports websites, six are News services, and four belong to the Entertainment & Movies class, with one each for E-Commerce and Government websites.


Table 6. Domains only found in Google's top 50 (Germany)


On the contrary, table 7 shows the list of domains found in the top 50 domains of all three alternatives but not in Google. These would be the domains missed by users who only use Google. Out of the 13 domains, four are News services, followed by three Movies & Entertainment and two Celebrities sources. Interestingly, Instagram and YouTube are missing from Google, as well.



When implementing the same evaluation for the US results, we find that 17 domains are exclusively found in Google's top 50 domains, with five domains classified as Sports, four as Movies & Entertainment, another four as News services, three as Social Media and one as Celebrities (see Table 8). Notable is that social media giants Facebook and TikTok are only found in Google and not in the alternatives.



On the opposite end, 13 domains are not found in the US results on Google (see Table 9). Six of them are News services, three are websites about Celebrities, two are about Movies & Entertainment, and one each are Sports and E-Commerce websites. Interestingly, Amazon is missing in Google's results, as well as msn.com, a service from Microsoft, like Bing itself.


Distribution of domains

Next, we look at the Gini coefficient to measure the concentration and statistical dispersion of root domains in search engines. The distributions across all search engines in both German and US results show very similar values. The results range from 0.77 to 0.79 and 0.73 to 0.76 for German and US results, respectively (see Figure 3). Overall, this means that there is an imbalance in the distribution of results from root domains and that some top sources dominate the search results for both countries.


Similarity of domains

Comparing the similarities of the top 10 German results, we find that Bing and Metager are most similar in terms of their top results, with 70%, followed by 64% overlap between DuckDuckGo and Metager and 63% between Bing and DuckDuckGo. However, when comparing the alternative search engines to Google, we find that the highest overlap is between Google and Bing with 28% and slightly lower ones with DuckDuckGo and Metager at 27%.


Interestingly, the top 1 result overlap is higher between Google and Bing (31%) than the overlap between Bing and DuckDuckGo (15%).


The results show a mean overlap of 30% of top 1, 40% of top 3, and 47% of top 10 results (table 10). This indicates that when looking at the entire first search results page, nearly half of the domains are the same in all search engines considered. For search engine users, this means that in addition to Google, this would also allow them to see search results that they would otherwise miss if they used the alternative search engine



The same evaluation for the US yields similar results. Again, the overlap between Bing and Metager is the highest, but the margin between this pair and the other alternative pairings has been reduced. When looking at the top 10 results, Bing and Metager overlap by 65%, Bing and DuckDuckGo by 64%, and DuckDuckGo and Metager by 62%. While Google's overlap with the other search engines is lower in the US results, again, the pair of Google and Bing is slightly higher (25%) than the 24% overlap between Google and the other two alternatives (see Table 11).


Narrowing the results down to the top 3, the overlap between Bing and DuckDuckGo increases minimally to 65% and remains the same for DuckDuckGo and Metager at 62%. Here, the overlap between Google and Bing is lower than in the German results, with only 26%. Furthermore, the top 1 result overlap between Google and Bing is at 29%, which is also lower than in the German results. Generally, compared to the German results, there is a higher overlap in the top 1 position for the pairs Google and DuckDuckGo (20%), Bing and DuckDuckGo (50%), and DuckDuckGo and Metager (47%). The averaged results make this trend clearer because the overlaps for the top 1, top 3, and top 10 results are similar, with 41%, 46%, and 44%, respectively.



This paper is available on arxiv under CC 4.0 license.