One thing is clear, It is impossible to search questions on the internet during the exam but I can take a picture quickly when the examiner turned his back. That is the first part of the algorithm. Somehow I need to extract the question from the picture.
It seems there are a lot of services that can provide text extraction tools but I need some kind of API to solve this problem. Finally, Google's Vision API was the exact tool that I am looking for. The great thing is, the first 1000 API calls is free for each month, which is quite enough for me to test and use the API.
First, go and create Google Cloud Account then search for Vision AI in services. Using the Vision AI, you can perform things such as assign labels to an image to organize your images, get the recommended crop vertices, detect famous landscapes or places, extract texts, and few other things.
Check the documentation to enable and set up the API. After configuration, you have to create JSON file that contains your key downloads to your computer.
Run following command to install the client library:
pip install google-cloud-vision
Then provide authentication credentials to your application code by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS.
import os, io
from google.cloud import vision
from google.cloud.vision import types
# JSON file that contains your key
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'your_private_key.json'
# Instantiates a client
client = vision.ImageAnnotatorClient()
FILE_NAME = 'your_image_file.jpg'
# Loads the image into memory
with io.open(os.path.join(FILE_NAME), 'rb') as image_file:
content = image_file.read()
image = vision.types.Image(content=content)
# Performs text detection on the image file
response = client.text_detection(image=image)
print(response)
# Extract description
texts = response.text_annotations[0]
print(texts.description)
When you run the code you will see the response in JSON format that includes specifications of the detected texts. But we only need pure description so I extracted this part from the response.
The next step is to search the question part on Google to get some information. I used regex library to extract question part from the description (response). Then we must slugify the extracted question part to be able to search it.
import re
import urllib
# If ending with question mark
if '?' in texts.description:
question = re.search('([^?]+)', texts.description).group(1)
# If ending with colon
elif ':' in texts.description:
question = re.search('([^:]+)', texts.description).group(1)
# If ending with newline
elif '\n' in texts.description:
question = re.search('([^\n]+)', texts.description).group(1)
# Slugify the match
slugify_keyword = urllib.parse.quote_plus(question)
print(slugify_keyword)
We are going to use BeautifulSoup to crawl the first 3 results to get some information about the question because the answer probably locates in one of them.
Additionally, if you want to crawl particular data from the google's search list don't use inspect element to find attributes of the elements, instead print the whole page to see attributes because it is different from the actual one.
We need to crawl first 3 links in search result but these links are really messed up so it is important to get the clean links for crawling.
/url?q=https://en.wikipedia.org/wiki/IAU_definition_of_planet&sa=U&ved=2ahUKEwiSmtrEsaTnAhXtwsQBHduCCO4QFjAAegQIBBAB&usg=AOvVaw0HzMKrBxdHZj5u1Yq1t0en
As you see the actual link locates between q= and &sa. By using Regex we can get this particular field or the valid URL.
result_urls = []
def crawl_result_urls():
req = Request('https://google.com/search?q=' + slugify_keyword, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
bs = BeautifulSoup(html, 'html.parser')
results = bs.find_all('div', class_='ZINbbc')
try:
for result in results:
link = result.find('a')['href']
# Checking if it is url (in case)
if 'url' in link:
result_urls.append(re.search('q=(.*)&sa', link).group(1))
except (AttributeError, IndexError) as e:
pass
Before we crawl the content of these URLs let me show you the Question Answering System with Python.
That is the main part of the algorithm. After crawling the information from first 3 results, program should detect the answer by iterating documents. First I thought it is better to use similarity algorithm to detect the documents which is the most similar to question but I had no any idea how to implement it.
After hours of researching I found an article in Medium which explains Question Answering system with Python. Its easy-to-use python package to implement a QA System on your own private data. You can go and check for more explanation from here
Let's first install the package:
pip install cdqa
I am downloading pre-trained models and data manually by using download functions that included in the example code block below:
import pandas as pd
from ast import literal_eval
from cdqa.utils.filters import filter_paragraphs
from cdqa.utils.download import download_model, download_bnpp_data
from cdqa.pipeline.cdqa_sklearn import QAPipeline
# Download data and models
download_bnpp_data(dir='./data/bnpp_newsroom_v1.1/')
download_model(model='bert-squad_1.1', dir='./models')
# Loading data and filtering / preprocessing the documents
df = pd.read_csv('data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval})
df = filter_paragraphs(df)
# Loading QAPipeline with CPU version of BERT Reader pretrained on SQuAD 1.1
cdqa_pipeline = QAPipeline(reader='models/bert_qa.joblib')
# Fitting the retriever to the list of documents in the dataframe
cdqa_pipeline.fit_retriever(df)
# Sending a question to the pipeline and getting prediction
query = 'Since when does the Excellence Program of BNP Paribas exist?'
prediction = cdqa_pipeline.predict(query)
print('query: {}\n'.format(query))
print('answer: {}\n'.format(prediction[0]))
print('title: {}\n'.format(prediction[1]))
print('paragraph: {}\n'.format(prediction[2]))
he output should like this:
It prints the exact answer and paragraph that includes the answer.
Basically, When question extracted from picture and sent to the system, the Retriever will select a list of documents from the crawled data that are the most likely to contain the answer. As I stated before, it computes the cosine similarity between the question and each document in the crawled data.
After selecting the most probable documents, the system divides each document into paragraphs and send them with the question to the Reader,
which is basically a pre-trained Deep Learning model. The model used was the Pytorch version of the well known NLP model BERT.
Then, the Reader outputs the most probable answer it can find in each paragraph. After the Reader, there is a final layer in the system that compares the answers by using an internal score function and outputs the most likely one according to the scores which will the answer of our
question.
Here is the schema of the system mechanism.
You have to set your dataframe (CSV) in specific structure so it can be sent to cdQA pipeline.
But actually I used PDF converter to create an input dataframe from a
directory of PDF files. So, I am going to save all crawled data in pdf file for each result. Hopefully, we will have 3 pdf files in total (can be 1 or 2 as well). Additonaly, we need to name these pdf files that's why I crawled the heading of each page.
def get_result_details(url):
try:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
bs = BeautifulSoup(html, 'html.parser')
try:
# Crawl any heading in result to name pdf file
title = bs.find(re.compile('^h[1-6]$')).get_text().strip().replace('?', '').lower()
# Naming the pdf file
filename = "/home/coderasha/autoans/pdfs/" + title + ".pdf"
if not os.path.exists(os.path.dirname(filename)):
try:
os.makedirs(os.path.dirname(filename))
except OSError as exc: # Guard against race condition
if exc.errno != errno.EEXIST:
raise
with open(filename, 'w') as f:
# Crawl first 5 paragraphs
for line in bs.find_all('p')[:5]:
f.write(line.text + '\n')
except AttributeError:
pass
except urllib.error.HTTPError:
pass
def find_answer():
df = pdf_converter(directory_path='/home/coderasha/autoans/pdfs')
cdqa_pipeline = QAPipeline(reader='models/bert_qa.joblib')
cdqa_pipeline.fit_retriever(df)
query = question + '?'
prediction = cdqa_pipeline.predict(query)
print('query: {}\n'.format(query))
print('answer: {}\n'.format(prediction[0]))
print('title: {}\n'.format(prediction[1]))
print('paragraph: {}\n'.format(prediction[2]))
return prediction[0]
Well, If I summarize the algorithm it will extract the question form the picture, search it on google, crawl first 3 results, create 3 pdf files from the crawled data and finally find the answer using question answering system.
If you want to see how it is working check I made a bot that can solve exam questions from the picture
Here is the Full Code:
import os, io
import errno
import urllib
import urllib.request
import hashlib
import re
import requests
from time import sleep
from google.cloud import vision
from google.cloud.vision import types
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd
from ast import literal_eval
from cdqa.utils.filters import filter_paragraphs
from cdqa.utils.download import download_model, download_bnpp_data
from cdqa.pipeline.cdqa_sklearn import QAPipeline
from cdqa.utils.converters import pdf_converter
result_urls = []
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'your_private_key.json'
client = vision.ImageAnnotatorClient()
FILE_NAME = 'your_image_file.jpg'
with io.open(os.path.join(FILE_NAME), 'rb') as image_file:
content = image_file.read()
image = vision.types.Image(content=content)
response = client.text_detection(image=image)
texts = response.text_annotations[0]
# print(texts.description)
if '?' in texts.description:
question = re.search('([^?]+)', texts.description).group(1)
elif ':' in texts.description:
question = re.search('([^:]+)', texts.description).group(1)
elif '\n' in texts.description:
question = re.search('([^\n]+)', texts.description).group(1)
slugify_keyword = urllib.parse.quote_plus(question)
# print(slugify_keyword)
def crawl_result_urls():
req = Request('https://google.com/search?q=' + slugify_keyword, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
bs = BeautifulSoup(html, 'html.parser')
results = bs.find_all('div', class_='ZINbbc')
try:
for result in results:
link = result.find('a')['href']
print(link)
if 'url' in link:
result_urls.append(re.search('q=(.*)&sa', link).group(1))
except (AttributeError, IndexError) as e:
pass
def get_result_details(url):
try:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
bs = BeautifulSoup(html, 'html.parser')
try:
title = bs.find(re.compile('^h[1-6]$')).get_text().strip().replace('?', '').lower()
# Set your path to pdf directory
filename = "/path/to/pdf_folder/" + title + ".pdf"
if not os.path.exists(os.path.dirname(filename)):
try:
os.makedirs(os.path.dirname(filename))
except OSError as exc:
if exc.errno != errno.EEXIST:
raise
with open(filename, 'w') as f:
for line in bs.find_all('p')[:5]:
f.write(line.text + '\n')
except AttributeError:
pass
except urllib.error.HTTPError:
pass
def find_answer():
# Set your path to pdf directory
df = pdf_converter(directory_path='/path/to/pdf_folder/')
cdqa_pipeline = QAPipeline(reader='models/bert_qa.joblib')
cdqa_pipeline.fit_retriever(df)
query = question + '?'
prediction = cdqa_pipeline.predict(query)
# print('query: {}\n'.format(query))
# print('answer: {}\n'.format(prediction[0]))
# print('title: {}\n'.format(prediction[1]))
# print('paragraph: {}\n'.format(prediction[2]))
return prediction[0]
crawl_result_urls()
for url in result_urls[:3]:
get_result_details(url)
sleep(5)
answer = find_answer()
print('Answer: ' + answer)
It can confuse sometimes but generally it is okay I think. At least I can pass the exam with 60% of right answers :D
Alright Devs! Please tell me in comments what do you think about it? Actually it is better to iterate through questions at once so I don't need to take picture for each question. But unfortunately I don't have enough time to do it so it is better to keep that for the next time.
Check Reverse Python for more cool content.
References:
How to create your own Question-Answering system easily with python
Stay Connected!