In today’s world, where gaining valuable, data-driven insights is crucial for business growth and improving services, social media platforms like Facebook, YouTube, Instagram, and Twitter/X have become central to our everyday lives. People freely share their thoughts, opinions, and experiences, turning social media into a goldmine of public data. By scraping this data, we can unlock new opportunities to understand market trends and consumer behavior, helping to improve products and services in meaningful ways.
Imagine your company just launched an exciting new product or service, and posted videos on YouTube showcasing its features and benefits. By scraping YouTube comments, you can gather valuable insights into how users are perceiving and reviewing your product—helping you understand what’s working and where there might be room for improvement.
In this article, we’ll explore:
How I overcame these hurdles to scrape data from YouTube effectively using Bright Data and Python
By the end of this article, you will understand how to scrape insight-rich social media data with Bright Data’s tools.
The traditional approach to scraping data from any platform usually starts by figuring out the platform's HTML structure. Next, you pinpoint where the information you need is located on the page. Then, you write scripts in Python using popular frameworks like Selenium, Beautiful Soup, or Playwright. But it does not end there. Every social media platform has specific measures that prevent data misuse or scraping. Some of the measures include:
Blocking IP Address: Some platforms block your IP address when you quickly make multiple automated requests from the same IP address. The website may flag the IP address as harmful if it detects unusual traffic patterns.
Rate Limiting: When the number of requests exceeds a certain threshold, the website rate limits the requests to prevent the abuse of their servers.
Header-based Request Blocking: Websites can block requests from specific sources based on headers like User-Agent and Referrer. If these headers seem suspicious or not legitimate, the website may take action to prevent the request.
CAPTCHAs: Another common way to stop abnormal activities is to ask the user to solve a CAPTCHA before navigating to the website content. This step ensures that it is an actual human act, not an automated bot.
Bright Data is an efficient all-in-one proxy and AI-powered web scraping tool that simplifies data scraping projects with a headful GUI browser that is fully compatible with Puppeteer/Playwright/Selenium APIs. Bright Data's powerful unlocker infrastructure and premium proxy network allow you to bypass the previously mentioned challenges right out of the box.
Bright Data expertly tackles challenges like website blocks, CAPTCHAs, and fingerprints by using advanced AI to mimic real user behavior and avoid detection. Plus, its Scraping Browser comes packed with features that make web scraping more reliable while saving you time, money, and effort.
1. Sign up for a free trial on the Bright Data website. You can do so by clicking on “Start free trial” or “Start free with Google”. You can proceed to the next step if you have an existing account.
2. From the dashboard, Navigate to the “Proxy and Scraping Infrastructure” section and Click on the “Add” button, then select “Scraping Browser” from the dropdown menu.
3. Enter a name of your choice in the form to create a new Scraping Browser.
4. After creating a new Scraping Browser instance, click on its name, and navigate to “Access Parameters” to access the hostname, username, and password information.
5. You can use these parameters in the following Python script to access the Scraping Browser instance.
Let us consider that you are working for a company and want to know how people perceive your product. You go ahead by scraping the comments of a YouTube video that specifically reviewed your product and analyze it to arrive at some metrics.
We will look into the review of the iPhone 16 to know people’s opinions.
Prerequisites
Install the necessary packages in your project folder. You’ll use the Playwright Python Library and Pandas to get insights from the data. To make asynchronous requests, install the Asynchronous IO library. You will use NLTK and WordCloud Libraries to analyze the retrieved comments.
2. Import the necessary Python libraries in your Python Script and create a get_comments() method to get the video list from the webpage.
async def get_comments(): async with async_playwright() as playwright:
async def get_comments():
async with async_playwright() as playwright:
auth = '<provide username here>:<provide password here>'
host = '<provide host name here>'
browser_url = f'wss://{auth}@{host}'
# Connecting to the Scraping Browser
browser = await playwright.chromium.connect_over_cdp(browser_url)
page = await browser.new_page()
page.set_default_timeout(3*60*1000)
# Opens the Youtube Video Page in the browser
await page.goto('https://www.youtube.com/watch?v=v94jRN2FhGo&ab_channel=MarquesBrownlee')
for i in range(2):
await page.evaluate("window.scrollBy(0, 500)")
await page.wait_for_timeout(2000)
await page.wait_for_selector("ytd-comment-renderer")
# Parse the HTML tags to get the Comments and likes
data = await page.query_selector_all('ytd-comment-renderer#comment')
comments = []
for item in data:
comment_div = await item.query_selector('yt-formatted-string#content-text')
comment_likes = await item.query_selector('span#vote-count-middle')
comment = {
"Comments": await comment_div.inner_text(),
"Likes": await comment_likes.inner_text()
}
comments.append(comment)
comment_list = json.loads(json.dumps(comments))
#Storing into the CSV file
with open("youtube_videos.csv", 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=comment_list[0].keys())
writer.writeheader()
for data in comment_list:
writer.writerow(data)
#Converting CSV to a data frame for further processing
df = pd.read_csv("youtube_comments.csv")
await browser.close()
return df
3. The get_comments() method then works as follows:
We’re almost through! You have extracted the data from the YouTube Video. Next, let’s dig into the insights provided by the data.
First, we need to gauge the number of individuals with a positive outlook on the product. As such, you’ll be conducting a sentiment analysis of the videos with the aid of the widely used Natural Language Processing NLTK library.
nltk.download("stopwords", quiet=True)
nltk.download("vader_lexicon", quiet=True)
def transform_comments(df):
#clean the comments
df["Cleaned Comments"] = (
df["Comments"].str.strip().str.lower().str.replace(r"[^\w\s]+", "",regex=True).str.replace("\n", " "))
stop_words = stopwords.words("english")
df["Cleaned Comments"] = df["Cleaned Comments"].apply(
lambda comment: " ".join([word for word in comment.split() if word not in stop_words]))
#analyse the sentiment of each comment and classify
df["Sentiment"] = df["Cleaned Comments"].apply(lambda comment: analyze_sentiment(comment))
#Create a bar graph to understand the sentiments of people
sentiment_counts = df.groupby('Sentiment').size().reset_index(name='Count')
plt.bar(sentiment_counts['Sentiment'], sentiment_counts['Count'],color=['red', 'blue', 'green'])
plt.grid(axis='y', linestyle=' - ', alpha=0.7)
plt.show()
def analyze_sentiment(text):
sentiment_analyzer = SentimentIntensityAnalyzer()
scores = sentiment_analyzer.polarity_scores(text)
sentiment_score = scores["compound"]
if sentiment_score <= -0.5:
sentiment = "Negative"
elif -0.5 < sentiment_score <= 0.5:
sentiment = "Neutral"
else:
sentiment = "Positive"
return sentiment
In the above code, you are cleaning the comments to eliminate any whitespace, special characters and newlines.
Then, you remove the common English stopwords, which don’t contribute much to the sentiment analysis.
After that, the sentiment of each comment is calculated and added as a new column in the data frame.
Finally, you create a bar graph which visually classifies the comments as “Positive”, “Negative” and “Neutral.”
According to the sentiment analysis results, many individuals hold a positive view of the product.
Which features of the product do people find most appealing?
You’re interested in discovering the aspects of the product talked about by people, which is the next intriguing piece of information you’re searching for. A helpful way to achieve this is by creating a word cloud using comments. The word size in the word cloud represents the frequency of the word in the comments.
def generate_word_cloud(df):
comments = "\n".join(df["Comments"].tolist())
wordcloud = WordCloud().generate(comments)
This code will create a word cloud from the YouTube Comment.
Looking into WordCloud, you can find the features talked about by people, apart from the common ones like iPhone, phone and Apple. People also spoke about display, model, camera, battery, and screen.
If you want to focus on more specific insights, you can utilize filters in the Pandas data frame based on exact keywords such as “Camera” or “Battery.” By conducting a sentiment analysis and creating a word cloud from this data, you can uncover insights explicitly tailored to those features.
As you may have observed, I should have used additional techniques to overcome the challenges mentioned earlier. Instead, I leveraged Bright Data Scraping Browser to act as my website browser. Surprisingly, the Scraping Browser took on all the problematic aspects of the job for me. It has several inherent features that can effortlessly eliminate obstacles on websites. Let me show you some of those benefits.
By utilizing Bright Data’s Scraping Browser in combination with Python, you can gather valuable information about customers, products, and the market, allowing your business to use data-driven strategies and informed decision-making in a scalable and cost-effective manner.