Suppose you are searching for information on a website. Let’s imagine a Twitter user writes about CRYPTOCURRENCY! What do you do? You can copy and paste the tweets about CRYPTOCURRENCY into your own file. But what if you want to retrieve massive volumes of information from Twitter? Such as vast quantities of information for your data science project? In this circumstance, copying and pasting won’t work! Then you will need to utilize Web Scraping. What is Web Scraping? The term “web scraping” refers to an automated process that can collect significant volumes of data from websites. The majority of this data is unstructured data that is stored in an HTML format. In order for this data to be utilized in a variety of applications, it must first be converted into structured data that is stored in a spreadsheet or a database. For many businesses, web scraping can be used to quickly and inexpensively gather data that can then be analyzed in a variety of ways such as news monitoring, sentiment analysis, email marketing, and others. Web scraping, the process of obtaining data from websites through automated means, can be carried out in a variety of different methods. Application Programming Interfaces (APIs) e.g from Twitter, StackOverflow & Google. Write code (e.g in the Python programming language). Online services from different providers e.g Octoparse. In this article, you will learn how to: How to execute web scraping on Twitter using the Python library. snsscrape How to store scraped data automatically in the database using . HarperDB How to share your data via API call by using Custom Function from . HarperDB So let’s get started. What is Snscrape? snscrape is a scraping tool for social networking services (SNS). It scrapes information like user profiles, hashtags, searches, and threads and returns the discovered items, e.g. the relevant posts. It was released on July 8, 2020, and it is capable of scraping data from a variety of platforms, including the following: Twitter Instagram Reddit Facebook Weibo Telegram Mastodon You can use snscrape by typing its command-line interface (CLI) commands into the command prompt/terminal. If you don’t feel comfortable using a terminal, you can use snscrape as a , but this is not yet documented. Python library On Twitter, it can scrape users, user profiles, hashtags, searches, tweets (single or surrounding thread), list posts, and trends. Note: What is HarperDB? is a lightning-fast and versatile platform for managing SQL and NoSQL data. You can put it to work for a wide variety of purposes, some of which include but are not limited to quick application development, distributed computing, edge computing, software as a service (SaaS), and many others. HarperDB HarperDB does not duplicate data, is fully indexed and can run on any device, from the edge to the cloud. Additionally, it may be used with any programming language, such as Javascript, Java, , and others. Python The following is a list of a few of the features that can be accessed with HarperDB: Allows JSON and CSV file insertions. Single endpoint API. Supports SQL queries for full CRUD operations. (Lambda-like application development platform with direct access to HarperDB’s core methods). Custom Functions Limited database configuration required. Math.js and GeoJSON are both supported. HarperDB has a built-in HTTP API, custom functions for user-defined endpoints, and a dynamic schema that can help you easily share your scraped data with your coworkers after storing them in a HarperDB cloud instance. HarperDB allows you to quickly download scraped data held in the HarperDB instance as a CSV file so that you can perform extra analysis before making a final choice. After being introduced to the tools (snscrape & harperDB) that you will use to automate the process of scraping data and saving it in the database. Then all you have to do is follow the steps that are described below Step 1: Create a HarperDB Account We will start by working on the HarperDB database first. You can visit and then click the navigation bar to see a link called “Start Free.” Click it in order to create your account. https://harperdb.io/ If you already have an account, use the following URL to sign in with your credentials. https://studio.harperdb.io/ Step 2: Create a HarperDB Cloud Instance After registration, you need to create a cloud instance to store and fetch your scraped data from Twitter. Click the Create New HarperDB Cloud Instance link to add a new instance to your account. You just need to follow all instructions provided by harperDB to create your cloud instance, such as: Note: Select Instance Types.Choose Cloud Provider. Add instance information. Select instance specification (RAM size, instance storage size, and instance region). Confirm and create a cloud instance. When the HarperDB Cloud Instance has been created successfully, you will see the status as OK for that particular instance, check the image below. Step 3: Configure the HarperDB Schema and Table To add the Twitter data that has been scraped into the database, you must first create a schema and a table. It only requires loading the HarperDB cloud instance you already created from the dashboard and creating the schema by giving it a name (like “data_scraping”). You then have to add a table (e.g ). Additionally, HarperDB will ask you to specify the hash attribute, which is equivalent to an ID number. “tweets” Step 4: Install the Required Packages You need to install the following package on your local machine. This is the we’ll use to implement different HarperDB API functions sucha as inserting data into to the cloud instance. It also provides wrappers for an object-oriented interface. (a) harper-sdk-python Python package pip install harperdb Snscrape requires Python 3.8 or higher. When you install snscrape, the dependencies for the Python package are automatically installed. (b) snscrape pin install snscrape Step 5:Import Important Packages The next step is to import Python packages to scrape data from Twitter and automatically store them on harperDB cloud instance. #import packages #snscrape import snscrape.modules.twitter as sntwitter # harperdb import harperdb import warnings # To ignore any warnings warnings.filterwarnings( "ignore" ) Step 6: Connect to HarperDB Cloud Instance You need to connect to the HarperDB cloud instance in order to insert scraped tweets into the table called . tweets Here you need to provide three parameters: Full URL of the HarperDB instance Your username Your password db = harperdb.HarperDB(url=URL, username=USERNAME, password=PASSWORD) db.describe_all() # connect to harperdb URL = "https://1-mlproject.harperdbcloud.com" USERNAME = "USERNAME" PASSWORD = "PASSWORD" # check if you are connected When you execute the above code, you will see output similar to that displayed below, indicating a successful connection to your HarperDB Cloud Instance. '__updatedtime__': 1660390877630,
   'hash_attribute': 'id',
   'id': 'd140645e-3af2-42d7-8594-2195826dabbc',
   'name': 'tweets',
   'residence': None,
   'schema': 'data_scraping',
   'attributes': [{'attribute': '__createdtime__'},
    {'attribute': '__updatedtime__'},
    {'attribute': 'id'}],
   'record_count': 0}}} {'data_scraping': {'tweets': {'__createdtime__': 1660390877630, Step 7:Create a Function to Record the Scrapped Tweets Using the from the harperdb-python package, the following function will insert the scraped tweets as data (in dictionary format) into the specified table.The insert function will receive three parameters: insert function SCHEMA name TABLE name data (scraped tweets) result = db.insert(SCHEMA, TABLE, [data]) # define a function to record scraped data into the table def record_tweets ( data ): #define the schema and table SCHEMA = "data_scraping" TABLE = "tweets" # insert data into the table return result Step 8:Scrape tweets by using snsscrape Now you can use T from snsscrape python package to scrap tweets with the particular search query. In this example, I will show you how to scrap 1,000 tweets about “ from 1st January 2022 to 13th August 2022. witterSearchScrapper method cryptocurrency” sntwitter.TwitterSearchScraper( data = { } result = record_tweets(data) #1 Using TwitterSearchScraper to scrape data and append tweets to list for i, tweet in enumerate ( 'crytocurrency since:2022-01-01 until:2022-08-13' ).get_items()): if i > 1000 : break #2 save data automatically to the HarperB cloud instance "user_name" : tweet.user.username, "content" : tweet.content, "lang" : tweet.lang, "url" : tweet.url, "source" : tweet.source # insert result into the HarperDB table As you can see from the code block above , harperDB will automatically store scraped data into the tweets table with the following attributes. (comment #2) Username Content Lang Url Source Step 10:View the Tweets Table If you open your HarperDB cloud instance, you will be able to see all records of your scraped data from Twitter. 🎉 You have successfully completed all required steps to automate the process of scraping data and saving it in the database. Congratulations What if you wish to share the scraped information with your colleagues? Custom Function provides a straightforward solution to this problem in HarperDB. What is a Custom Function? A Custom Function is a brand-new feature included in HarperDB’s 3.1+ release. You can use the feature to add your own to HarperDB. Custom functions are powered by Fastify, which is incredibly flexible and makes it simple to interact with your data by using HarperDB core methods. API endpoints You will learn how to use the HarperDB studio to create your very own custom function in this section. You can then use an API call to share the outcomes of your scraped data with your coworkers at the office. Here are the steps you need to follow:- The first step is to enable the Custom functions by clicking “ ” in your HarperDB Studio (it is not enabled by default). 1. Enable Custom Functions functions The next step is to create a project by specifying the name. For example It will also create setting files for the project including: 2. Create a Project tweets-api-v1. Routes folder File to add helper functions Static folder For this article, you will focus on the routes folder. Note: In this step, you will create the first route to fetch some data from the tweets table from the HarperDB Datastore. You also need to know that Route URLs are resolved in the following manner: 3. Define a Route [Instance URL]:[Custom Functions Port]/[Project Name]/[Route URL] It will include: Cloud Instance URL Custom Functions Port Project name you have created The route you have defined In the route file (example.js) from the function page, you will see some template code as an example. You need to replace that code with the following code: server.route({ request.body= { }; }
}); 'use strict' ; module .exports = async (server, { hdbCore, logger }) => { url : '/' , method : 'GET' , handler : ( request ) => { operation : 'sql' , sql : 'SELECT user_name,content,lang,url,source FROM data_scraping.tweets ORDER BY __createdtime__' return hdbCore.requestWithoutAuthentication(request); In the code above, the route is defined with the GET method and the handler function will send an SQL query to the database to get from the ordered by the /tweets-api-v1 user_name, content, lang, URL, and source tweets table __createdtime__ column. Finally, you can now use the route you have defined to get the data from the tweets table. Here you will send an API request by using the Python package. 4. Access data via API Endpoint requests r = requests.get(url = URL) data = r.json() #send an API request import requests # api-endpoint URL = "https://functions-1-mlproject.harperdbcloud.com/tweets-api-v1" # sending get request and saving the response as response object # extracting data in json format for experiment in data: print (experiment) Here is the sample output from the above code. {"user_name": "DailyCryptoTrad","content": "DXY forming a bullish bull flag on the daily - a break out of 106.6 will give crypto red days however if we fail below 105 will give crypto green days - Keep an eye on DXY #DXY #SPY #crypto #btc #eth #bitcoin #crytocurrency #cryptocurrencies https://t.co/AkF8Igf3Uc","lang": "en","url": "https://twitter.com/DailyCryptoTrad/status/1558211511461597188","source": "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Web App</a>"},{"user_name": "Ariscrypto1970","content": "@scrypto_1977 @Epayme_uae #Saitama will go parabolic when it happens! This is the #WeAreSaitama and the world are waiting for. 🔥🔥🔥🚀🚀🚀🚀#crytocurrency #DeFi","lang": "en","url": "https://twitter.com/Ariscrypto1970/status/1558200674273345537","source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},{"user_name": "dan_nyeche","content": "Cryptocurrency market up 24Hrs. #Bitcoin #Dan_Trades #crytocurrency Emilokan Big Brother Modella FireBoy Giddyfia  GTBank President Obama #gayfish Gen Z Ethereum Chi Chi Obidatti2023 Sapa Lewandoski #HAPPYJAEMINDAY #Jalsa4K #GomoraMzanzi #SheggzOlu𓃵 #ViratKohli𓃵 https://t.co/QbU4ei3MGA","lang": "in","url": "https://twitter.com/dan_nyeche/status/1558188248362467329","source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"}, With HarperDB, you can quickly and easily build API endpoints to share the scraped data with your team working on the same data science project. Note: Conclusion Congratulations 🎉, you have made it to the end of this article. You have learned: How to execute web scraping on Twitter using the Python library. snsscrape How to store scraped data automatically in the database using . HarperDB cloud instance How to create a from the HarperDB cloud instance to share your scraped data with your coworkers working on the project via an API endpoint. custom function If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! You can also find me on Twitter @Davis_McDavid.

Facebook

Fetch

Google

HarperDB

Instagram

Twitter

Mobile Price Classification: An Open Source Data Science Project with Dagshub

How to Create an Engaging README for Your Data Science Project on Github

Contact me for collaboration

Nominated for 2022 - Data Science Demon

Nominated for 2022 - HackerNoon Contributor of the Year - Data Science

Nominated for 2022 - HackerNoon Contributor of the Year - Artificial Intelligence

How to Web Scrape Using Python, Snscrape & HarperDB

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Best African Language Datasets for Data Science Projects

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

10 Best African Language Datasets for Data Science Projects

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps