With this article, I intend to give a simple and concise explanation of the Jaccard index. It is a measure of the similarity between two sets of information. The Jaccard similarity coefficient was created by Grove Karl Gilbert in 1884 and since then it has seen an extensive range of diverse applications, from to , obviously going through the NPL. behavioral research stability of unicellular clusters To fully grasp this concept, you might need to study a little bit of set theory, or if you’re an SQL developer, it can be interpreted as the measure of an inner join. I know topics like this can seem like a drag, but stay with me. We’ll be getting started in python by defining two sets, after loading the libraries. # libraries import matplotlib.pyplot as plt import matplotlib_venn as venn GroupA = {1, 2, 3} GroupB = {3, 4, 5} To view the Venn diagrams we use the library matplotlib_venn venn.venn2([GroupA, GroupB], set_labels=('Group A','GroupB')) plt.show() # Intersection method # Intersection = GroupA.intersection(GroupB) print("Intersection of GroupA and GroupB:", Intersection) Intersection of GroupA and GroupB: {3} Now, we can see how the intersection of the two datasets is in “3”. Then, we move on to calculate the Jaccard index with the following formula: This expression can be conceptually interpreted as: Jaccard = Intersection / ( GroupA + GroupB - Intersection ) Jaccard = 1 / ( 3 + 3 - 1) Jaccard = 1/5 Jaccard = 0.2 In python, a specific code can be: # specific code # len(Intersection) / ( len(GroupA) + len(GroupB) - len(Intersection) ) Of course, generally, you need to compare the list of items, then you need to make a loop in the source list to compare each record of the comparison list. I make a little code for this. you can review, enjoy and make corrections. Always are welcome! Resources The of this text notebook version The Jaccard code Also published here.