paint-brush
Exploring Jaccard's Similarity Coefficientby@mxsundevice
117 reads

Exploring Jaccard's Similarity Coefficient

by David Ochoa Corrales
David  Ochoa Corrales HackerNoon profile picture

David Ochoa Corrales

@mxsundevice

Data entusiast

March 1st, 2023
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The Jaccard index is a measure of similarity. It measures the similarity between two sets of information. To understand clearly you need to study only little a bit of set theory.
featured image - Exploring Jaccard's Similarity Coefficient
1x
Read by Dr. One voice-avatar

Listen to this story

David  Ochoa Corrales HackerNoon profile picture
David  Ochoa Corrales

David Ochoa Corrales

@mxsundevice

Data entusiast

Learn More
LEARN MORE ABOUT @MXSUNDEVICE'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Original Reporting

Original Reporting

This story contains new, firsthand information uncovered by the writer.


With this article, I intend to give a simple and concise explanation of the Jaccard index. It is a measure of the similarity between two sets of information. The Jaccard similarity coefficient was created by Grove Karl Gilbert in 1884 and since then it has seen an extensive range of diverse applications, from behavioral research to stability of unicellular clusters, obviously going through the NPL.


To fully grasp this concept, you might need to study a little bit of set theory, or if you’re an SQL developer, it can be interpreted as the measure of an inner join.



inner join

inner join



I know topics like this can seem like a drag, but stay with me.


We’ll be getting started in python by defining two sets, after loading the libraries.

# libraries
import matplotlib.pyplot as plt
import matplotlib_venn as venn


GroupA = {1, 2, 3}
GroupB = {3, 4, 5}


To view the Venn diagrams we use the matplotlib_venn library

venn.venn2([GroupA, GroupB], set_labels=('Group A','GroupB'))
plt.show()

Venn Intersection

Venn Intersection

# Intersection method
#
Intersection = GroupA.intersection(GroupB)
print("Intersection of GroupA and GroupB:", Intersection)

Intersection of GroupA and GroupB: {3}


Now, we can see how the intersection of the two datasets is in “3”. Then, we move on to calculate the Jaccard index with the following formula:



JaccardFormula

JaccardFormula


This expression can be conceptually interpreted as:

Jaccard = Intersection / ( GroupA + GroupB - Intersection )

Jaccard = 1 / ( 3 + 3 - 1)
Jaccard = 1/5
Jaccard = 0.2


In python, a specific code can be:

# specific code
#
len(Intersection) / ( len(GroupA) + len(GroupB) - len(Intersection) )


Of course, generally, you need to compare the list of items, then you need to make a loop in the source list to compare each record of the comparison list.


I make a little code for this. you can review, enjoy and make corrections. Always are welcome!

Resources

The notebook version of this text

The Jaccard code



Also published here.

L O A D I N G
. . . comments & more!

About Author

David  Ochoa Corrales HackerNoon profile picture
David Ochoa Corrales@mxsundevice
Data entusiast

TOPICS

THIS ARTICLE WAS FEATURED IN...

Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Coffee-web
Unni
Hashnode
Learnrepo
X REMOVE AD