Intro Hi guys, In this tutorial, we learn how to make a in Python using machine learning techniques such as and in just a few lines of code. Plagiarism Detector word2vec cosine similarity Overview Once finished, our will be capable of loading a from files and then compute the similarity to determine if students copied each other. plagiarism detector student’s assignment To be able to follow through this tutorial you need to have installed on your machine. Requirements scikit-learn Installation pip install -U scikit-learn How do we analyze text ? We all know that computers can only understand and , and for us to perform some computation on textual data we need a way to convert the text into numbers. 0s 1s Word embedding The process of converting the textual data into an array of numbers is generally known as , we going to use built-in features to do this. word embedding sci-kit-learn The vectorization of textual data to vectors is not a random process instead it follows certain algorithms resulting in words being represented as a position in space. we going to use scikit-learn built-in features to do this. How do we detect similarity in documents? Here we gonna use the basic concept of , to determine how closely two texts are similar by the value of between vectors representations of student’s text assignments. vector dot product computing cosine similarity Also, you need to have on the student’s assignments which we gonna use in testing our model. sample text documents The text files need to be in the same directory with your script with an extension of .txt, If you wanna use sample text files I used for this tutorial download here The project directory should look like this . ├── app.py ├── fatma.txt ├── image.png ├── john.txt └── juma.txt Let's now build our Plagiarism detector Let’s first all necessary import modules os sklearn.feature_extraction.text TfidfVectorizer sklearn.metrics.pairwise cosine_similarity import from import from import We're gonna use in loading paths of textfiles and then TfidfVectorizer to perform word embedding on our textual data and cosine similarity to compute the plagiarism. OS Module Reading all text files using List Comprehension We are going to use concepts of a list comprehension to all the path textfiles on our as shown below. load project directory student_files = [doc doc os.listdir() doc.endswith( )] for in if '.txt' Lambda function to Vectorize & Compute Similarity We need to create two functions, one to convert the text to arrays of numbers and the other one to the similarity between them. lambda compute vectorize = lambda Text: TfidfVectorizer().fit_transform(Text).toarray() similarity = lambda doc1, : cosine_similarity([doc1, doc2]) doc2 Vectorize the Textual Data Add the below two lines to vectorize the loaded student files. vectors = vectorize(student_notes) s_vectors = list(zip(student_files, vectors)) Creating a Function to Compute Similarity Below is the main function of our script responsible for managing the whole process of the similarity among students. computing plagiarism_results = set() s_vectors student_a, text_vector_a s_vectors: new_vectors =s_vectors.copy() current_index = new_vectors.index((student_a, text_vector_a)) new_vectors[current_index] student_b , text_vector_b new_vectors: sim_score = similarity(text_vector_a, text_vector_b)[ ][ ] student_pair = sorted((student_a, student_b)) score = (student_pair[ ], student_pair[ ],sim_score) plagiarism_results.add(score) plagiarism_results Let’s plagiarism results data check_plagiarism(): print(data) : def check_plagiarism () global for in del for in 0 1 0 1 return print for in Final code When you compile down all the above concepts, you get the below full scripts ready to among student's assignments. to detect plagiarism os sklearn.feature_extraction.text TfidfVectorizer sklearn.metrics.pairwise cosine_similarity student_files = [doc doc os.listdir() doc.endswith( )] student_notes =[open(File).read() File student_files] vectorize = Text: TfidfVectorizer().fit_transform(Text).toarray() similarity = doc1, doc2: cosine_similarity([doc1, doc2]) vectors = vectorize(student_notes) s_vectors = list(zip(student_files, vectors)) plagiarism_results = set() s_vectors student_a, text_vector_a s_vectors: new_vectors =s_vectors.copy() current_index = new_vectors.index((student_a, text_vector_a)) new_vectors[current_index] student_b , text_vector_b new_vectors: sim_score = similarity(text_vector_a, text_vector_b)[ ][ ] student_pair = sorted((student_a, student_b)) score = (student_pair[ ], student_pair[ ],sim_score) plagiarism_results.add(score) plagiarism_results data check_plagiarism(): print(data) import from import from import for in if '.txt' for in lambda lambda : def check_plagiarism () global for in del for in 0 1 0 1 return for in Output : Once you run the above the out will look as shown below app.py $ python app.py #__________RESULT ___________ ( , , ) ( , , ) ( , , ) 'john.txt' 'juma.txt' 0.5465972177348937 'fatma.txt' 'john.txt' 0.14806887549598566 'fatma.txt' 'juma.txt' 0.18643448370323362 you have just made your own in Python, Now share it with your fellow peers, press to share it. Congratulations Plagiarism Detector Tweet now Also published on: https://kalebujordan.com/learn-python-os-module-with-examples/