Community detection is a topic that can be applied to datasets that have an inherent order among themselves.
For example, redwood, birch and oak are types of trees. Fish, algae and octopus are types of underwater creatures. These entities can be grouped under a common category, and the task of doing that is called as community detection.
Sometimes we have a dataset that is unstructured and we want to get some value out of it, in those cases community detection can be useful. In this article I will be talking about how to detect communities using LPA, label propagation algorithm in GraphFrames package of Spark. The code for the article can be found here.
Label Propagation Algorithm is a fast algorithm that detects communities in a graph. It does not require prior information about the communities, and it uses network structure to detect them. We label some initial communities and it takes it from there.
LPA was proposed by Raghavan in this paper, and it works by labelling the nodes, and propagating these labels throughout the network and forming communities based on the process of label propagation.
The idea is that a label will become dominant in a densely connected group of nodes, but will have trouble propagating in a sparsely connected region. When the algorithm finishes, the nodes with the same label are considered as part of the community.
The algorithm works as follows:
a) Every node is initialized with a unique label.
b) Label are propagated through the network
c) At every iteration of the propagation, each node updates its label to the one that the maximum number of its neighbours belongs to.
d) LPA reaches convergence when each node has the majority label of its neighbours.
e) LPA stops if either convergence or the user defined maximum number of iterations are achieved.
I am using the dataset provided by UMASS, it is a dataset that contains a word, and a reference to the wikipedia url about that word. There are 10 files, 600 MB each, and i am working with one file. It contains 18 million rows.
It is in the following format:
URL: ftp://217.219.170.14/Computer%20Group/Faani/vaset%20fani/second/sattari/word/2007/source/s%20crt.docx
MENTION vacuum tube 421 http://en.wikipedia.org/wiki/Vacuum_tube
MENTION vacuum tubes 10838 http://en.wikipedia.org/wiki/Vacuum_tube
MENTION electron gun 598 http://en.wikipedia.org/wiki/Electron_gun
MENTION fluorescent 790 http://en.wikipedia.org/wiki/Fluorescent
MENTION oscilloscope 1307 http://en.wikipedia.org/wiki/Oscilloscope
MENTION computer monitor 1503 http://en.wikipedia.org/wiki/Computer_monitor
MENTION computer monitors 3066 http://en.wikipedia.org/wiki/Computer_monitor
MENTION radar 1657 http://en.wikipedia.org/wiki/Radar
MENTION plasma screens 2162 http://en.wikipedia.org/wiki/Plasma_screen
URL is the link to a particular topic, and mention contains the word that was mentioned in the article along with the Wikipedia link (also mentioned in the article).
I used the following code to preprocess the data to arrive at a meaningful format:
After preprocessing, i have the data in the following format:
ftp://ftp.pcbsd.org/pub/handbook 9.0/handbook_en_ver9.0.html,Zfs
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,BSD_UNIX
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Acpi
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Softupdates
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Compiz
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Pcx
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Journaling_file_system
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Nextstep
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Freebsd_jail
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,Mac_address
ftp://ftp.pcbsd.org/pub/handbook/9.0/handbook_en_ver9.0.html,IPv6_address
Basically, in this format, I have a URL followed by each word that was mentioned in the original file.
Now that we have the data in the required format, we can apply LPA. I use the following code to do that.
The largest community is has 8500 nodes. Let us look at it’s contents:
http://commons.wikimedia.org/wiki/File:Aston_M...
http://www.free-photos.biz/photographs/industr...
http://commons.wikimedia.org/wiki/File:Mercede...
http://commons.wikimedia.org/wiki/File:Odiorne...
http://commons.wikimedia.org/wiki/File:Hannove...
http://commons.wikimedia.org/wiki/File:Autosta...
http://www.free-photos.biz/photographs/food/ve...
http://commons.wikimedia.org/wiki/File:Smeaton...
http://commons.wikimedia.org/wiki/File:Vestv%C...
http://commons.wikimedia.org/wiki/File:Mitsubi...
As we can see, the largest community is about free photos. Let us look at another:
Philip_Neame
Monsoon#Asia
Arab_Jew
Cave_of_Elijah
Shi’a_Islam
Culture_of_the_Song_Dynasty
Safavid_dynasty
John_Romer_(Egyptologist)
Bamyan_Province
Henry_Phillips_(horticulturist)
Christ_Church,_Nazareth
The second largest community is about dynasties or provinces, and people related to them. It has about 6000 nodes. There are some false positives as well, which means more iterations in LPA can probably give good results.
Third largest community is:
Hansi
Vadodara_district
Cox-Forbes_theory
Puppis_(constellation)
University_of_Alexandria
historical_scholarship
Villuercas
Center_of_the_universe
genre
Velayuthampalayam
and it is mostly about science. It has 4000 nodes, and by the looks of it, it can probably use more number of iterations as well.
In this post we went through what communities are, and how do we detect them. We used the algorithm Label Propagation algorithm to determine that. It works by initially labelling nodes, then iterating and propagating labels and determining the final label at convergence.
We then used a dataset that contains a URL and references to wikipedia text for several mentions of words in that url, and determined communities from them. Most popular community is about stock photos, then about dynasties and then science.
I hope you liked my post.