paint-brush
The Ouroboros Effect of Data Aggregation and Scrapingby@cerniauskas
157 reads

The Ouroboros Effect of Data Aggregation and Scraping

by Julius ČerniauskasOctober 27th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Scraping allows individuals and companies to create large databases from publicly available data quite easily. However, constantly fed by public data, popular databases might start moving towards data convergence - an effect that might negatively influence the decision-making for those trusting such sources completely.
featured image - The Ouroboros Effect of Data Aggregation and Scraping
Julius Černiauskas HackerNoon profile picture


Hailed as one of the most important innovations in data acquisition, scraping allows individuals and companies to create large databases from publicly available data rather easily. Numerous useful applications have been created from these processes, such as recommendation engines.

There’s an effect that’s not often talked about, however. Data scraping and aggregation feed off of each other, especially if the database is then used to produce publicly available information. As such, it’s a lot like the Greek Ouroboros, a snake continually eating its own tail.

Typical scraped data

Setting aside all of the legal regulatory requirements and intricacies, most of the scraped data will be publicly available and will not, in any way or another, be personally identifiable information. These restrictions have a few interesting implications.


First, all of the data will be much more dynamic. Most personal data will be static (such as names) and unlikely to be influenced by external factors. There can be minor variations in personal data, such as when someone changes their legally designated name. However, these occurrences will be relatively infrequent.


Scraping also requires the data to be publicly accessible, which also restricts the number of sources heavily (compared to the vastness of the internet). Anything that’s behind a login (or otherwise publicly inaccessible) will be seen by a smaller number of people, reducing the likelihood of it influencing decisions for the majority of viewers.


What remains is a subset of data that has two key factors. One is that it’s, in general, more subject to change (e.g., pricing, opinions, popularity, etc.). Another is that the potential to influence decisions on a major scale is higher as accessibility increases.


These factors, outside of the current case, are usually of minor importance. Whenever we consider publicly available databases that were created through scraping, these factors start to play a much greater role.

Recommendations and suggestions

A significant part of databases remain tightly under wraps within businesses and individuals that create them for personal or commercial use. On the other hand, there are numerous public aggregators that display information for anyone to see.


One of the largest data aggregators in the world is Google. Some of the methodology used to rank websites has been lifted from academic circles. PageRank, one of the earliest iterations of a ranking algorithm, was based on academic citations where links to a website would be considered a measure of authority and accuracy.


Over time, Google became much more sophisticated in how it aggregates and lists search engine results pages. One thing remains static – it’s all a competitive game to get to the top of search, with the top 3 results getting about 75% of all clicks. These results essentially act as recommendations for the most authoritative sources.


These effects have been studied in scholarly research. James Evans, for example, in his paper “Electronic Publication and the Narrowing of Science and Scholarship,” showed that due to scientific journals moving to digital formats, the way scholars engaged with literature has changed to favor the usage of hyperlinks, more recent articles, and a move towards greater consensus.


Google Search likely exhibits a similar effect. While the company is doing their best to serve diverse results, the top results will always have the greatest impact on creating opinions and suggesting courses of action.


What most search and other suggestion or recommendation engines inevitably do is make decisions easier and faster. Whether it’s finding information or picking a product, instead of having to browse through dozens of paper pages, we can better find what we’re looking for. Unfortunately, that also has a side effect – decision-making convergence.

Dangers of Decision-making Convergence

Statistical aggregators that provide actionable insights by collecting publicly available data have a potentially major flaw. If they influence decision-making and each decision made is added back into the database, recommendations will begin to converge towards a small scatter plot.


A simple, albeit unusual example has been happening in the realm of a popular card game, “Magic: The Gathering,” revolving around deck building. Players can construct their own card decks in the game, with some decisions being more optimal than others for one reason or another.


EDHRec is a recommendation engine for one of the game’s formats, which works by scraping (or otherwise collecting) data about publicly available decks and then outlining percentage values for the card used in the construction process. As far as I know, there’s no machine learning model or other underlying mechanics – it’s simply a statistical database.


Each time, however, a player uses the website for recommendations and then publishes their own version of a deck. The data is then scraped and put back into EDHRec. Over time, even if certain decisions are suboptimal, the recommendations begin to converge toward specific cards.


Another factor has been that the increasing popularity of the tool has brought more and more players of various skill levels. Simple statistics state that most people playing the game cannot be from the top 1% of the population; therefore, most recommendations will be extracted from relatively unskilled players. As such, the recommendation engine becomes slanted towards the greater population rather than the professional players.


There’s no self-cleanup process, either. The number of players using EDHRec recommendations versus the players building decks without it is heavily slanted towards the former. So even if someone goes against the grain, there will be a higher number of people pushing convergence forward.


While the effect has been relegated to a niche hobby community, these effects have the possibility to become more pronounced over time, as we have seen with the changes in scientific journal publishing. Additionally, publicly available databases of the most popular products or services can have a similar effect on decision-making.


Similarly, financial services may be affected by the most traded instruments. I have an inkling, as my colleague once wrote that the momentum effect (i.e., the tendency of well-performing stocks to continue performing well) may be in part influenced by something akin to the examples above.


Finally, there’s a lot more to be discussed about the issues of such engines. It’s best described, I think, by thinking of decision-making as a deductive process – we have our reasons. Therefore, we arrive at some decision. These recommendation engines allow us to skip past the deductive process and make decisions based on induction.


Yet, the popularity of making a certain decision says nothing about whether it’s the correct one to make. We can only assume that if all of the data has been derived from sources where the correct reasoning process has been made. That, however, is impossible when aggregating data from publicly available web sources.

Conclusion

There are three key factors influencing the importance of the Ouroboros effect:


  1. Data has to be collected from publicly available sources.
  2. The aggregated database has to be made publicly available and, in one way or another, influence decision-making.
  3. The database has to be continually updated with new occurrences of the same phenomena.


Once these factors are in place, the databases will begin to slowly converge towards a small pool of decisions. The popularity of databases will also have an effect on the convergence.


As such, the danger lies in trusting these publicly available databases blindly. They can be tremendously helpful to make decisions easier, but they shouldn’t take over decision-making completely.