Information on public web pages is accessible to anyone at no cost from anywhere in the world with an internet connection. Some people pay other people to have this information in a nice and convenient format. It’s called web scraping, and it’s now becoming a commodity. Being a commodity means that if we ask 100 web scrapers to collect the price of on the exact same day (June 19th, 2023), we obtain 100 times the same answer: 3.350 USD. There is no difference among any of those 100 scrapers. this product page This information has been publicly disclosed by the website , and they made it accessible to anyone. louisvuitton.com For those who paid for the scraping, the value of this information does not change based on collected/scraped it. who Even if itself were to offer a file with this data, it would not embed any additional value (= it could not be charged a higher price) than independent web scrapes at like-for-like conditions. louisvuitton.com According to Wikipedia […] a commodity is an economic good, usually a resource, that has full or substantial fungibility: that is, the market treats instances of the good as equivalent or nearly so with no regard to who produced them […]. So yes, web data has the features of a commodity. The Need For A Market Ok, it’s a commodity. Why does this matter? It matters because this technology is: Growing (thanks for the heads-up on the influence of AI as an accelerator for web data) Pierluigi Vinciguerra Still in its Wild-West phase: The availability of technology and internet access has spread the access to web scraping at a global scale, and literally, almost everyone is doing it, planning to do it, or failing while trying to do it. So, while current web scraping solutions focus on the technical side, creating a common marketplace is only in its early days. Let’s see why we are encouraging this change in the industry and what the of a data-as-a-commodity market are: benefits Regulation The enormous fragmentation of web data (everyone can access it, and the tools for achieving it are available to anyone, like ) is causing a great heterogeneity in formats, quality, types of data, and much more. The Web Scraping Club An independent entity that serves as a regulator will benefit the buyer: Quality standards on data, collection processes, and monetary transactions. To unlock the global adoption of web data, the industry needs to instill among all actors. trust Without a marketplace, all due diligence and all quality controls will have to e conducted at every purchase, as happens today. Liquidity Rules and regulations only serve one purpose: attracting a larger audience and creating a bigger market. In other words, liquidity: Liquidity for the buyers means they participate in the market as they find they need at the they are willing to pay. what price Without a marketplace, every buyer will have to negotiate individually with providers (often internal) the price for the acquisition. Liquidity for the sellers means they can find multiple buyers for the same dataset they are collecting. Without a marketplace, could not be spread among multiple customers, and there could be no productization. extraction costs Liquidity (which can be enabled only by regulation) is the prime and most powerful factor in ensuring buyers and sellers are active at the for whatever goods or services they are trading on, and web data makes no exception. Fair Market Value Derivative Products A data pipeline can be complex. While our first example of the Louis Vuitton bag was simple, things get ugly pretty fast: Cross-website terminology, language differences, and changes in history, just to name a few, make data usage very painful. Raw web data is nice, but more needs to be done to make it useful. One great advantage of having an independent marketplace for data is that you can have sellers offering raw data deriving products based on the original raw data. and other sellers Without a marketplace, the transformations performed on raw data will have to be commissioned and negotiated case by case. A market for derived work enables competition and quality of derived work. Let’s give an example. We have raw data from product reviews from an electronic e-commerce store. The data would contain each review with the comment, the user, the time, and the final rating. A derived work would be a for each review, and it would be provided by an independent seller, in line with the regulations of the marketplace, so that the buyer of this derivative work has a better knowledge of how the information was treated (is it in line with European GDPR or the CCPA?). sentiment scoring Web Data: Buying Instead Of Scraping It The core implication of having a marketplace is that businesses can buy web data instead of scraping it (or paying for a scraping project), and scrapers can scale their operations and efficiency. The advent of AI has pushed the need for fast, large-scale, high-quality data, and internal production cannot keep the pace of this scale. Wild-west mode is not fit anymore for current data needs. A well-functioning marketplace with transparent, independent regulations and quality standards will enable faster, safer exchanges of web data, promote ethical data collection and create a new ecosystem for data treatment, which exists only within large corporations today. We are building this at Data Boutique: An independent place where web data is handled and traded like a commodity.

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

Web Data Has Become a Commodity and Needs a Marketplace to Grow

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Improve Early Failure Detection (EFD) in Web Scraping With Benchmark Data

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Improve Early Failure Detection (EFD) in Web Scraping With Benchmark Data

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps