Community detection is a topic that can be applied to datasets that have an inherent order among themselves. For example, redwood, birch and oak are types of trees. Fish, algae and octopus are types of underwater creatures. These entities can be grouped under a common category, and the task of doing that is called as community detection. Sometimes we have a dataset that is unstructured and we want to get some value out of it, in those cases community detection can be useful. In this article I will be talking about how to detect communities using LPA, label propagation algorithm in GraphFrames package of Spark. The code for the article can be found . here Label Propagation Algorithm Label Propagation Algorithm is a fast algorithm that detects communities in a graph. It does not require prior information about the communities, and it uses network structure to detect them. We label some initial communities and it takes it from there. LPA was proposed by Raghavan in , and it works by labelling the nodes, and propagating these labels throughout the network and forming communities based on the process of label propagation. this paper The idea is that a label will become dominant in a densely connected group of nodes, but will have trouble propagating in a sparsely connected region. When the algorithm finishes, the nodes with the same label are considered as part of the community. The algorithm works as follows: a) Every node is initialized with a unique label. b) Label are propagated through the network c) At every iteration of the propagation, each node updates its label to the one that the maximum number of its neighbours belongs to. d) LPA reaches convergence when each node has the majority label of its neighbours. e) LPA stops if either convergence or the user defined maximum number of iterations are achieved. Implementation Dataset and Processing I am using the dataset provided by , it is a dataset that contains a word, and a reference to the wikipedia url about that word. There are 10 files, 600 MB each, and i am working with one file. It contains 18 million rows. UMASS It is in the following format: URL: ftp://217.219.170.14/Computer%20Group/Faani/vaset%20fani/second/sattari/word/2007/source/s%20crt.docx MENTION vacuum tube 421 http://en.wikipedia.org/wiki/Vacuum_tube MENTION vacuum tubes 10838 http://en.wikipedia.org/wiki/Vacuum_tube MENTION electron gun 598 http://en.wikipedia.org/wiki/Electron_gun MENTION fluorescent 790 http://en.wikipedia.org/wiki/Fluorescent MENTION oscilloscope 1307 http://en.wikipedia.org/wiki/Oscilloscope MENTION computer monitor 1503 http://en.wikipedia.org/wiki/Computer_monitor MENTION computer monitors 3066 http://en.wikipedia.org/wiki/Computer_monitor MENTION radar 1657 http://en.wikipedia.org/wiki/Radar MENTION plasma screens 2162 http://en.wikipedia.org/wiki/Plasma_screen URL is the link to a particular topic, and mention contains the word that was mentioned in the article along with the Wikipedia link (also mentioned in the article). I used the following code to preprocess the data to arrive at a meaningful format: After preprocessing, i have the data in the following format: ftp://ftp.pcbsd.org/pub/handbook 9.0/handbook_en_ver9.0.html,Zfs ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,BSD_UNIX ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Acpi ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Softupdates ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Compiz ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Pcx ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Journaling_file_system ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Nextstep ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Freebsd_jail ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Mac_address ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,IPv6_address Basically, in this format, I have a URL followed by each word that was mentioned in the original file. Applying LPA Now that we have the data in the required format, we can apply LPA. I use the following code to do that. The largest community is has 8500 nodes. Let us look at it’s contents: ... ... ... ... ... ... ... ... http://commons.wikimedia.org/wiki/File:Vestv%C... ... http://commons.wikimedia.org/wiki/File:Aston_M http://www.free-photos.biz/photographs/industr http://commons.wikimedia.org/wiki/File:Mercede http://commons.wikimedia.org/wiki/File:Odiorne http://commons.wikimedia.org/wiki/File:Hannove http://commons.wikimedia.org/wiki/File:Autosta http://www.free-photos.biz/photographs/food/ve http://commons.wikimedia.org/wiki/File:Smeaton http://commons.wikimedia.org/wiki/File:Mitsubi As we can see, the largest community is about free photos. Let us look at another: Philip_Neame Monsoon#Asia Arab_Jew Cave_of_Elijah Shi’a_Islam Culture_of_the_Song_Dynasty Safavid_dynasty John_Romer_(Egyptologist) Bamyan_Province Henry_Phillips_(horticulturist) Christ_Church,_Nazareth The second largest community is about dynasties or provinces, and people related to them. It has about 6000 nodes. There are some false positives as well, which means more iterations in LPA can probably give good results. Third largest community is: Hansi Vadodara_district Cox-Forbes_theory Puppis_(constellation) University_of_Alexandria historical_scholarship Villuercas Center_of_the_universe genre Velayuthampalayam and it is mostly about science. It has 4000 nodes, and by the looks of it, it can probably use more number of iterations as well. Conclusion In this post we went through what communities are, and how do we detect them. We used the algorithm Label Propagation algorithm to determine that. It works by initially labelling nodes, then iterating and propagating labels and determining the final label at convergence. We then used a dataset that contains a URL and references to wikipedia text for several mentions of words in that url, and determined communities from them. Most popular community is about stock photos, then about dynasties and then science. I hope you liked my post.