The practice of automated public web data collection, otherwise called web scraping, has become increasingly prevalent in various industries. Despite bringing multiple benefits to data-driven businesses, the practice still attracts skepticism from some people who see it as 'unethical' and serving only business needs. However, web scraping has the power to deliver a significant positive impact for non-profit causes, too.
Journalists, watchdog organisations, and NGOs have harnessed the power of web scraping to expose corruption, shed light on societal issues, and foster transparency. Timely and reliable data intelligence can greatly improve society. Therefore, promoting it while maintaining ethical and lawful rules of conduct should be of interest to everyone.
Web intelligence, when used ethically and responsibly, has the potential to bring positive change and benefits to the public. It enhances access to information, promotes transparency and accountability, facilitates market research, and fosters innovation and development. However, without established guidelines and regulations, there is a potential for misuse.
Collaboration between stakeholders is vital to establish ethical guidelines and prevent the misuse of web scraping. Governments, NGOs, and technology companies should work together to agree on industry standards and regulations that strike a balance between the benefits of web scraping and the protection of individual rights.
The "Ethical Web Data Collection Initiative'' (EWDCI) aims at fostering such cooperation within the data aggregation industry. Its objective is to construct a framework that facilitates an open and inclusive process for formulating principles governing legal and ethical web scraping practices. By uniting efforts, different industry stakeholders can promote the ethical and responsible use of web scraping, benefiting both businesses and society.
Education and awareness also play a crucial role in promoting ethical web scraping practices. Training programs and workshops can be organised to educate individuals and organisations about the legal and ethical considerations associated with data extraction. By fostering a culture of responsible data handling, we can ensure that web scraping continues to be a force for good.
Furthermore, web intelligence and proxy providers are responsible for promoting ethical web scraping practices among their clients. This can be achieved by implementing robust usage policies and KYC practices, offering tools that enable responsible data extraction, and actively monitoring and addressing any potential misuse of their services.
Advocating for social justice and accountability. Web scraping has played a vital role in advocating for social justice and holding individuals and organisations accountable. By analysing publicly available data, journalists and researchers have revealed disparities in public services, discriminatory practices, and financial irregularities. Notably, web scraping has exposed wrongful property tax charges leading to home foreclosures and has shed light on the prevalence of cookie-cutter laws promoted by special interest groups. These revelations have prompted corrective actions and legal interventions to rectify systemic flaws.
Combating illegal activities. Web scraping has assisted law enforcement agencies in identifying and combating illicit activities. By analysing message boards and online marketplaces, scrapers have exposed underground markets for human trafficking, illegal firearm sales, and other criminal enterprises. These efforts have led to successful prosecutions, dismantled criminal networks, and saved lives.
Monitoring online hate speech. Web scraping can be employed to analyse and track online hate speech, providing insights into patterns, trends, and the spread of harmful ideologies. By monitoring online platforms and forums, researchers and organisations can identify emerging threats, devise strategies for countering hate speech, and promote online safety.
Cleaning the internet from illegal content. The “4β” project aims to empower global research by offering researchers, universities, and NGOs expertise, infrastructure, and resources for collecting public web data. One notable endeavor under 4β's support is the automated detection system for illegal online content commissioned by the Communications Regulatory Authority of the Republic of Lithuania (RRT).
Using specialised data acquisition and AI tools, this prototype identifies prohibited visual content, mainly related to child sexual abuse, allowing RRT to proactively manage and share this information with authorities both locally and abroad.
Monitoring air pollution through online searches. Web search analytics can be utilised to observe and track air pollution trends, offering a deeper understanding of environmental conditions based on public interest. By analysing aggregated search data from various platforms and integrating it with meteorological information, researchers can now "nowcast" pollution levels more effectively. This innovative approach paves the way for a more accessible means of monitoring air quality, shaping strategies for environmental protection, and forecasting potential pollution challenges.
As sustainability reporting is soon to be mandatory for many companies, some businesses have decided to take a proactive approach and begin reporting before it is compulsory. Oxylabs has already released two annual impact reports, contributing to the growing trend of companies keeping themselves accountable through such disclosures.
Some parts of the sustainability reporting process might include information harvested via web scraping. This method can gather essential sustainability data such as the industry's general greenhouse gas emissions, waste production, and water usage from publicly available sources. Such data, in turn, provides a valuable resource for enhancing the completeness and accuracy of sustainability reports.
Web scraping, when applied responsibly and ethically, can empower businesses to better align with the rising expectations of all stakeholders, from customers to employees to investors, as the collective consciousness about sustainability continues to grow.
Considering all the examples mentioned in this article, it is possible to argue that web intelligence can bring immense value to society, from uncovering corruption to safeguarding public health and promoting fair competition. On the other hand, it is crucial to strike a balance between protecting against data misuse and preserving the ability to leverage web scraping for good.
By embracing responsible web scraping practices, safeguarding user privacy, and respecting terms of service, companies and researchers can unlock the full potential of this powerful tool while upholding both legal and ethical standards.