How to Converse With PDF Files Using Computer Vision and Opensource Language Models

The advent of language models has ignited a revolution, reshaping the landscape of computer vision and text data interaction. With pre-trained language models at our disposal, the barriers to creating powerful applications with smaller datasets have been significantly lowered, exemplified by the vast spectrum of applications emerging from GPT-based models. Among these exciting applications lies the intriguing ability to engage in meaningful conversations with PDF documents.

This article delves into the significance of utilizing open-source language models, showcasing their immense potential for building exceptional chatbots without the need for extensive training data. We will explore how open-source language models empower developers to create chatbots even in data-limited scenarios, providing a comprehensive tutorial on building such chatbots using these accessible and powerful tools.

Why Use Open Source Language Models?

The utilization of open-source language models is essential due to the paramount concern for data security. By leveraging open-source models, organizations can build chatbots and other applications using private and sensitive documents as data, without the apprehension of data misuse by enterprise language models. Open-source models offer the advantage of providing access to the intricate layers of the neural network, enabling a deeper understanding of the model's inner workings. Additionally, the transparency of open-source models facilitates the identification of information pertaining to the source of the document, allowing for greater accountability and traceability.

Benefits of using Open Source Models

Pretrained open-source models offer significant benefits, having already been trained on extensive corpora. These models possess a comprehensive understanding of language context and tone, which makes them highly effective right out of the box. In many cases, fine-tuning is either minimal or not required at all, similar to the transfer learning techniques employed in computer vision. This eliminates the need for additional training efforts. Notably, considerable research has focused on the remarkable zero-shot learning capabilities of language models. Leveraging these advancements, it is often unnecessary to fine-tune or train a pre-trained model, as they demonstrate impressive performance even without further modifications. This efficiency and effectiveness make open-source models a valuable resource for various natural language processing tasks.

Dependencies:

Before embarking on building a chatbot, it is crucial to ensure that the necessary dependencies are in place. There are three main components needed to build a powerful chatbot. Here’s the list

Selecting a pre-trained language model - Firstly, selecting an open-source pre-trained language model is paramount. The Hugging Face leaderboard offers a wide array of language models to choose from, catering to different requirements. In this tutorial, we will opt for a lightweight model that can efficiently run on GPU, eliminating the need for quantization.

Data Engineering - To streamline the process, we will utilize the LangChain framework, which simplifies data loading and embedding creation. By leveraging this framework, the complexities associated with data handling are reduced, enabling a more streamlined development experience.

Transformers - We will leverage the Hugging Face transformer framework, which provides a comprehensive set of tools for downloading and utilizing pre-trained language models. This framework offers seamless integration with various language models, ensuring a smooth implementation process for building the chatbot.

By having these dependencies in place, including the appropriate pre-trained language model, the LangChain framework, and the Hugging Face transformer framework, we lay the foundation for building a powerful and efficient chatbot solution.

Tutorial: Let’s build a chatbot

The tutorial is divided into two parts

Generating a document embedding database

a. Loading the Data

from langchain.document_loaders import PyPDFLoader # for loading the pdf

pdf_folder_path = "YOUR FOLDER PATH"

#Use Langchain document loader to load the PDF Files
loader = PyPDFLoader(pdf_folder_path)
pages = loader.load()

b. Data-Preprocessing

from langchain.text_splitter import RecursiveCharacterTextSplitter

#Preprocessing using text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=50)
text_chunks = text_splitter.split_documents(pages)

c. Document Embedding Data Creation

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS


#Get the embedding Model
document_embeddings = HuggingFaceEmbeddings()
#Create a database
db = FAISS.from_documents(text_chunks, document_embeddings)
#Create a retriever
retriever = db.as_retriever(search_kwargs={"k":3})

Dialogue Management:

a. Downloading the Pre-trained LLM Model from Hugging Face

from langchain import HuggingFacePipeline
from transformers import AutoTokenizer, pipeline

model = "tiiuae/falcon-7b-instruct" #Provide the exact model id from hugging face

#Download the pre-trained model
tokenizer = AutoTokenizer.from_pretrained(model)

# Create the pipeline
pipeline = pipeline(
    "text-generation", #task
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
)

#Load the pre-trained model
llm = HuggingFacePipeline(pipeline = pipeline,model_kwargs={"temperature":0.0,"device":"cuda})

b. Create RetrieverQA to retrieve and summarize from langchain

from langchain import PromptTemplate,  LLMChain, RetrieverQA

#Create RetrieverQA Object
qa_chain = RetrieverQA.from_chain_type(llm=llm,chain_type="stuff",retriever=retriever)

Under the hood

The llm chain takes in a prompt and converts the prompt into an embedding. The embedding is checked against a database of all the documents and the closest relevant document is selected. The closest document’s content is given to the downloaded pre-trained language model to give the relevant answer to the prompt/question that is asked

Inference - Lets get answers

prompt = "What are language models"
answer = qa_chain(prompt)
print(answer)

Create a simple Chatbot

#Create an infinite loop to ask question

while True:
  question = input("How can I help you")
  answer = qa_chain(question)
  print(answer)

The answer is a dictionary and it is upto the user to parse and display the relevant information

Conclusion

In conclusion, the utilization of open-source pre-trained language models offers an exciting opportunity to interact with PDF documents in a more dynamic and intelligent manner. By leveraging these models, we can extract valuable insights, generate summaries, and engage in meaningful conversations with PDF content.

The transparency and collaborative nature of open source models ensure accountability, facilitate ongoing improvement and address potential biases and ethical concerns. Through the convergence of open-source language models and PDF interaction, we are unlocking new possibilities for efficient document analysis, research, and knowledge extraction. With the power of open source, we are paving the way for more accessible, adaptable, and intelligent conversations with PDFs, enabling us to extract greater value from these information-rich resources.

Complete Code - https://github.com/sumedhvdatar/pdf-gpt

Sources

The lead image for this article was generated by HackerNoon's AI Image Generator via the prompt “language model”.