Have you ever had to find unique topics in a set of documents? If you have, then you’ve probably worked with Latent Dirichlet Allocation (LDA). This is how LDA works: The algorithm searches for the same word clusters across the documents you're working with and generates any so-called unique topics. These are topics that occur at some frequency in the documents. LDA is widely used for online content generation. This algorithm is particularly helpful in sales and marketing. Suppose that you want to create a logo for your company. How can you find the one that will best represent your business and attract clients? The answer is to study the existing cases and review the competitors’ strategies. To create an effective logo, you should gather data from competitors’ websites. That is, . Based on the results you get, you will be able to outline the keywords most frequently associated with your sphere and create an attractive logo. you need to parse websites and apply LDA to generate goal-related topic clusters for you Or suppose that you want to optimize your content based on SEO research. . Based on the results, you can adjust your content to make it semantically valuable for search engines. Applying LDA will allow you to gather unique topic clusters with keywords in each cluster You can use LDA not only for websites but for any type of textual data. So, if you want to use LDA to generate unique topics, you need to have a large number of documents or websites to use for your analysis. But that’s not all: . Stop words are the words that bear no value in the analysis, like “the,” “your,” “him,” “a,” and so on. to get accurate results, you have to remove stop words from the dataset first Apart from deleting stop words, you have to delete stuff related to HTML, CSS, JavaSript SEO, and other words not related to LDA analysis. Does that seem too complicated? Let’s lay it all out. In this article, we will discuss how to generate unique topics with LDA when parsing websites. How to Start Using the LDA Algorithm To parse websites with LDA, first, you need to import the necessary libraries for file processing and cleaning the HTML, CSS, JavaScript text. You'll also import the LDA analyzer: os re gensim nltk time random itertools pandas pd string digits wordcloud WordCloud gensim.utils simple_preprocess nltk.corpus stopwords nltk.stem.porter PorterStemmer tqdm.notebook tqdm, trange gensim.corpora corpora bs4 BeautifulSoup import import import import import import import import as from import from import from import from import from import from import import as from import Next, select the number of websites you need to analyze. The larger and more precise your sample is, the more correct the LDA results will be. Here, both the quantity and the quality of the sample matter. That is, you should create a uniform sample of websites that may contain the necessary keywords for a topic generation. Then, choose the folder that will be used to read the documents: entries = os.listdir( ) './dataset_htmls/' Next, read the chosen websites and conduct a preliminary cleaning of the HTML, CSS, JavaScript: def cleanDocument(html):
    soup = BeautifulSoup(html, ) script soup([ , ]): 
        script.extract()

    text = soup.get_text()
    lines = (line.strip() line text.splitlines())
    chunks = (phrase.strip() line lines phrase line.split( ))
    text = .join(chunk chunk chunks chunk) text


start_time = time.time()

files = [ ] * files_number 
df = pd.DataFrame(data={}, columns = [ ]) i range(files_number): open( + entries[i], , encoding= ) f: :
            text = cleanDocument(f.read())
            text = re.sub(r , , text)
            df = df.append({ : text}, ignore_index=True)   
            
        except:
            files[i] = print( + entries[i])
            
print( % (time.time() - start_time)) "html.parser" for in "script" "style" for in for in for in "  " '\n' for in if return 0 'text' for in with "./dataset_htmls/" "r" 'utf-8' as try '\b\w{1,2}\b' '' 'text' " " "problem with file" "--- %s seconds ---" When you're using LDA, keep in mind that there are a large number of frameworks for creating websites like Angular, Vue, React, WordPress, and more. Also, there are a lot of ready-made website builders that allow you to build websites using a no-code technique like Wix or Tilda. And each developer uses their own best practices to build websites. The issue is that most problems occur because of the so-called "dirty hacks" that developers use. With these dirty hacks, websites bypass the parser and analyzer which poses a problem for the LDA. How the LDA Algorithm Works Let’s start with CSS. Make sure you detect the following ways of defining styles: Inline CSS — use of style attribute in HTML tags Inline CSS in the head section — use of in the document header <style> External CSS like a file — use of to load a CSS file <link> For HTML, it's important to delete tags but not to delete the content of these tags. The same goes for JavaScript. One way to do so is to use regular expressions: df[ ] = df[ ].map(lambda x: re.sub( , , x))
df[ ] = df[ ].map(lambda x: re.sub( , , x))
df[ ] = df[ ].map(lambda x: re.sub(r , , x, flags=re.DOTALL)) 'text' 'text' '<[^<]+?>' '' 'text' 'text' '(?s)<style>(.*?)<\/style>' '' 'text' 'text' '<script.+?</script>' '' Yet, there is a problem with such methods. Using single tags, JavaScript plugins, and various ways of building web applications means that you cannot completely remove HTML, CSS, and JavaScript. But you can easily check this in a site's input files. This is why, for writing this article, I used the bs4 library. To test the cleanup, I chose 1,000 websites where no significant bugs with removing web elements were detected. So you also can use this library and remove web elements that are not related to the LDA analysis via the function. cleanDocument The next step is to delete complex words consisting of two words. Use the following regular expression to solve this problem: text = re.sub(r , , text) '\b\w{1,2}\b' '' All cleaned text is recorded to dataframes. To evaluate the work of the algorithm, look at the running time. The running time for 100 websites is about 5 seconds. df Your next task is clean for stop words. As I mentioned, stop words refer to the words that bear no value for analysis like “the,” “you,” “himself,” and so on. To clean stop words, download a list of standard words for the English language. Apart from stop words, consider the standard syntax used on nearly every website. For example, such words as “contact us,” “cookies,” “confirm,” “back,” and so on. Then add those words to your list of stop words. Also, when cleaning, review the "standard" set of words used in SEO optimization. other_stop_words = [ , , , , , 
]
    nltk.download( )
stop_words = stopwords.words( )
stop_words.extend(other_stop_words)

def sent_to_words(sentences): sentence sentences: (gensim.utils.simple_preprocess(str(sentence), deacc=True))
        
def remove_stopwords(texts): [[word word simple_preprocess(str(doc)) word not stop_words] doc texts]

data = df[ ].values.tolist()
data_words = list(sent_to_words(data))

data_words = remove_stopwords(data_words) 'contact' 'cookies' 'confirm' 'website' 'share' 'stopwords' 'english' for in yield return for in if in for in 'text' Congratulations! You are halfway there. What’s next? You need to form a word dictionary for the LDA algorithm: id2word = corpora.Dictionary(data_words)
texts = data_words
corpus = [id2word.doc2bow(text) text texts]

num_topics = lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

print(lda_model.print_topics()) for in 10 Important note: to get precise results, you need to make a dictionary of stop words that will be used to clean the text. To see the output of this procedure, use the frequency counting of words in all text. This will let you determine which words occur the most frequently: def freq(data_words):
    df = pd.DataFrame({}, columns=[ , ])
    
    unique_words = set(data_words) word unique_words:
        df = df.append({ : word, : data_words.count(word)}, 
                ignore_index=True) df

data_words = itertools.chain.from_iterable(data_words)
data_words = list(data_words)
frequency = freq(data_words)
frequency = frequency.sort_values(by= , ascending=False)
frequency.to_csv( ) 'Word' 'Frequency' for in 'Word' 'Frequency' return "Frequency" 'frequency.csv' The Results of LDA Analysis Let’s review the results. We'll use as an example 10 websites that provide courses for learning programming languages. Based on the algorithm above, I parsed the 10 websites and removed all words unrelated to the analysis. You can see the first result of the LDA analysis in the following image: Below, you can see the results of frequency analysis: Frequency analysis provides information for SEO optimization where words can be used as search keys. However, to find a cluster of similar and "valuable" words, you should clean the text of base words. In our case, base words include: , , , , , , , , , , , , , , , , , , , , 'course' 'learn' 'programming' 'lpa' 'code' 'courses' 'java' 'learning' 'get' 'one' 'python' 'see' 'pluralsight' 'web' 'org' 'become' 'javascript' 'android' 'may' 'site' 'developer.' You can identify these words based on frequency analysis or the results of the LDA algorithm after iterations. n After cleaning out the base words we get the following result: In fact, the output result is a set of topics with the following set of words: Team, students, read, review, free, skills, reviews, path, data, project, build, masterclass, instructor, job, etc. Keep in mind that you need to clean for topics manually and iteratively until you get your desired results. In such algorithms, you get only the roots of words. The endings, plural forms, and duplicated words will be rejected. That’s it — if you are interested in more info, I welcome you to check out the . full code Wrapping Up You can see that LDA can be an effective way of topic modeling. Based on the results above, you can easily choose the right word combination for your logo. You can also optimize your existing content and adjust it to perform better based on the results. The advantage of LDA lies in the fact that its functions are flexible. Its purpose is primarily mathematical, so you can adjust LDA’s work to your needs. The disadvantage, though, is that you need to polish the results manually. While this is more time-consuming, you get more precise results in the end. So if you deal with marketing and SEO, using LDA is a perfect way to dive deeper into content generation. . Special thanks for co-authoring this article to Volodia Andrushchak, Machine Learning Engineer at @KeenEthics If you are interested in other useful readings on Internet of Things and Artificial Intelligence, check other articles by Volodia: https://keenethics.com/volodya-andrushchak?article=2697 Thank you for reading, and I sincerely hope that you found it helpful!

Using the LDA Algorithm for Websites

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Choosing The Optimal Development Methodology

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Choosing The Optimal Development Methodology

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps