Build a Document Analyzer with ChatGPT, Google Cloud, and Python

When OpenAI released ChatGPT to the general public, few people, including executives at OpenAI itself, could anticipate the speed of adoption by the general public. Since then, ChatGPT has unseated TikTok as the fastest app to achieve 100 million users. People from all walks of life have found ways to use ChatGPT to improve their efficiency, and companies have scrambled to develop guidelines on its use. Some organizations, including many academic institutions, have been mostly skeptical about its use, while other organizations like technology companies have adopted a much more liberal policy, even creating applications around the ChatGPT API. Today, we will walk through building one such application.

Target Audience

This article is broken down into three parts: 1) the explanation of technologies underlying the application, 2) the back-end of the application, and 3) the front-end of the application. If you can read some basic Python code, you should be able to follow the first two sections easily, and if you have some basic experience with React.js, you can follow the third section without problem.

The Application

The application we’re building today will be useful to anyone who regularly does research using foreign language sources. A good example would be macroeconomists who often have to read through government reports published in foreign languages. Sometimes these reports can be copy-pasted into machine translation services, but occasionally they are published in the form of non-searchable PDFs. In those cases, the researcher will need to engage human translators, but resource constraints significantly limit the number of reports that can be translated. To further compound the problem, these reports can be very long and tedious to read, which makes translation and analysis costly and time-consuming.

Our application will make this process easier by combining several AI and machine learning tools at our disposal - OCR, Machine Translation, and Large Language Models. We will extract the raw text content from a PDF using OCR, translate it into English using machine translation, and analyze the translated extraction using a large language model.

For today’s application we will look at a PDF publication from the Japanese government, the Innovation White Paper from the Ministry of Education, Culture, Sports, Science, and Technology. While the PDF itself is searchable and can be copied into a translation engine, we will be acting as if the PDF is unsearchable in order to showcase the technologies used in the application. The original document can be found here.

If you just want to build the app now, feel free to skip the next section. However, if you want to get a better understanding of the various technologies we will be using in this article, the next section will give you a bit of background.

Underlying Technologies

The first technology we will use is OCR, or Optical Character Recognition, which is one of the earliest commercial machine learning applications to become available to the general public. OCR models and applications aim to take a picture or image then identify and extract textual information from the image. This may seem like a simple task at first, but the problem is actually quite complex. For example, the letters may be slightly blurry, making it difficult to make a positive identification. The letter may also be arranged in an unusual orientation, meaning the machine learning model has to identify vertical and upside-down text. Despite these challenges, researchers have developed many fast and powerful OCR models, and many of them are available at a relatively low cost. For today’s application, we will use Google’s Cloud Vision model, which we can access using the Google Cloud API.

The next technology we will use is machine translation. This, like OCR, is an extremely difficult machine learning problem. Human language is full of idiosyncrasies and contextual intricacies that make it especially difficult for computers to process and understand. Translation between dissimilar language pairs like Chinese and English tends to yield particularly inaccurate and humorous results due to the inherently dissimilar structure of these languages requiring vastly different strategies for digitization and embedding. However, despite these challenges, researchers have developed powerful and sophisticated models and have made them generally available. Today, we will be using Google’s Translation API, one of the best and most widely-used machine translation tools available.

The last machine learning technology we will use is the LLM, or Large Language Model, which has been revolutionary for consumer artificial intelligence. The LLM is able to understand the structure of natural human language, and is able to draw upon a large body of data to produce a detailed and informative response. There are still many limitations to the technology, but its flexibility and data processing capabilities has inspired the creation of many novel techniques for engaging with the model. One such technique is called Prompt Engineering, where users craft and tweak skillfully worded and structured inputs, or prompts, to the model to get a desired result. In today’s application, we will use ChatGPT’s API and some simple prompt engineering to help us analyze the translated report.

The Backend

Cloud Services Setup

Before we begin coding the application, we will need to first sign up for the services.

Because ChatGPT’s API website is always changing, we will not be able to provide the exact steps to sign up for the ChatGPT API. However, you should find easy-to-follow instructions on the API documentation website. Simply progress until you obtain an API key, which we will need to call the ChatGPT API.

Google Cloud is slightly more complicated, but it is also relatively simple to sign up. Simply head to the Google Cloud console and follow the instructions for setting up a project. Once in the project, you will want to navigate to the IAM & Admin console, and create a service account. While the Google Cloud console is changing all the time, you should be able to navigate the interface by simply searching for “IAM” and the “Service Account” on the webpage. Once the service account is created, you will want to download the copy of the private key file to your computer. You will also want to copy the private key string, since the translation REST API uses the private key string instead of the key file.

Before we wrap up Google Cloud setup, you will want to enable the Machine Vision API, which you can do from the main console page. Simply search for Machine Vision API, click on the product from Google, and activate the API. You will also want to create a bucket to hold the data we’re going to use for this project.

Python Setup

Now that we have signed up for the proper services, we’re ready to begin coding our application in Python. First things first, we will want to install the requisite packages to our Python environment.

Pip install google-cloud-storage google-cloud-vision openai requests

Once the installation is complete, let’s create a new folder, download the PDF file, and create a new Python file in the same folder. We’ll call it document_analyze.py. We start by importing the necessary packages:

import requests
Import openai
from google.cloud import vision
from google.cloud import storage
Import os
Import json
Import time
Import shutil

We can then do some basic setup so our application can use the cloud services we just signed up for:

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = [Path to your Google Cloud key file]
openai.api_key = [your openAI API key]

With these credentials in place, you should now be able to access Google Cloud and ChatGPT APIs from your Python script. We can now write the functions that will provide the desired functionality for our app.

OCR Code

Now we can start to build some of the functions that will become the building blocks of the application. Let’s start with the OCR functions:

# upload the file to google cloud storage bucket for further processing
def upload_file(file_name, bucket_name, bucket_path):
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)
    blob = bucket.blob(bucket_path)
    blob.upload_from_filename(file_name)

# kick off the OCR process on the document. This is asynchronous because OCR can take a while
def async_detect_document(gcs_source_uri, gcs_destination_uri):
    client = vision.ImageAnnotatorClient()
    input_config = vision.InputConfig(gcs_source=vision.GcsSource(uri=gcs_source_uri), mime_type= 'application/pdf')
    output_config = vision.OutputConfig(
       gcs_destination=vision.GcsDestination(uri=gcs_destination_uri), 
       batch_size=100
    )
    async_request = vision.AsyncAnnotateFileRequest(
        features=[vision.Feature(type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)], 
        input_config=input_config, output_config=output_config
    )
    operation = client.async_batch_annotate_files(requests=[async_request])

def check_results(bucket_path, prefix):
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_path)
    blob_list = list(bucket.list_blobs(prefix=prefix))
    blb = [b for b in blob_list if 'output-' in b.name and '.json' in b.name]
    return len(blb) != 0


# download the OCR result file
def write_to_text(bucket_name, prefix):
    bucket = storage.Client().get_bucket(bucket_name)
    blob_list = list(bucket.list_blobs(prefix=prefix))
    if not os.path.exists('ocr_results'):
        os.mkdir('ocr_results')
    for blob in blob_list:
        if blob.name.endswith('.json'):
            with open(os.path.join('ocr_results', blob.name), 'w') as fp_data:
                fp_data.write(blob.download_as_string().decode('utf-8'))

def delete_objects(bucket, prefix):
    bucket = storage.Client().get_bucket(bucket)
    blob_list = list(bucket.list_blobs(prefix=prefix))
    for blob in blob_list:
        blob.delete()
        print('Blob', blob.name, 'Deleted')

Let’s examine what each function does in detail.

The upload_file function is a function that grabs a bucket from the Google Cloud Storage container and uploads your file to it. Google Cloud’s excellent abstractions make it very easy to write this function.

The async_detect_document function asynchronously invokes Google Cloud’s OCR function. Because of the number of options available in Google Cloud, we have to instantiate a few configuration objects, but it’s really just letting google cloud know where the source file is and where the output should be written. The batch_size variable is set to 100, so google cloud will process the document 100 pages at a time. This cuts down on the number of output files that get written, which makes processing easier. Another important thing to note is that the invocation is asynchronous, meaning the execution of the Python script will continue instead of waiting for the processing to finish. While this doesn’t make a big difference to this particular stage, it will become more useful later when we will turn the Python code into a web API.

The check_results function is a simple cloud storage function to check if the processing is done. Because we are invoking the OCR function asynchronously, we need to call this function periodically to see if the result file is present. If there is a result file, the function will return true, and we can continue with the analysis. If there is no result file, the function will return false and we will continue to wait until the processing finishes.

The write_to_text function downloads the result file(s) to disk for further processing. The function will iterate over all of the files in your bucket with a particular prefix, retrieve the output string, and write the result to the local file system.

The delete_objects function, while not strictly relevant to the OCR, cleans up the uploaded files so that the system doesn’t keep unnecessary artifacts in Google Cloud Storage.

Now that we’re done with the OCR invocations, let’s look at the machine translation code!

Machine Translation Code

Now we can define the translation functions:

# detect the language we’re translating from
def detect_language(text):
    url = 'https://translation.googleapis.com/language/translate/v2/detect'
    data = {
        "q": text,
        "key": [your google cloud API key]
    }
    res = requests.post(url, data=data)
    return res.json()['data']['detections'][0][0]['language']

# translate the text
def translate_text(text):
    url = 'https://translation.googleapis.com/language/translate/v2'
    language = detect_language(text)
    if language == 'en':
        return text
    data = {
        "q": text,
        "source": language,
        "target": "en",
        "format": "text",
        "key": [your google cloud API key]
    }
    res = requests.post(url, data=data)
    return res.json()['data']['translations'][0]['translatedText']

These functions are rather straightforward. The detect_language function calls the language detection API to determine the source language for the subsequent translate_text call. While we know that the PDF is written in Japanese, it is still best practice to run language detection so that the application can handle other languages. The translate_text function simply uses the Google Translation API to translate the text from the detected source language to English, although, if it determines that the source language is already English, it will skip the translation.

ChatGPT Code and Prompt Engineering

Lastly, we have the invocations to ChatGPT:

def run_chatgpt_api(report_text):
    completion = openai.ChatCompletion.create(
          model="gpt-3.5-turbo",
          messages=[
            {"role": "user", "content": '''
        Consider the following report:
        ---
        %s
        ---
        1. Summarize the purpose of the report.
        2. Summarize the primary conclusion of the report.
        3. Summarize the secondary conclusion of the report
        4. Who is the intended audience for this report?
        5. What other additional context would a reader be interested in knowing?
Please reply in json format with the keys purpose, primaryConclusion, secondaryConclusion, intendedAudience, and additionalContextString.
            ''' % report_text},
          ]
        )
    return completion.choices[0]['message']['content']

Notice that the Python call is a relative simple API call, but the prompt is written in a way to produce specific results:

The prompt provides the text of the report as context, so ChatGPT can analyze the report easily. The text is demarcated using dash lines, making it easy to recognize where the report ends and where the questions start.
The questions are enumerated rather than stated in a paragraph format. The response is therefore likely to follow a similar structure. The enumerated structure makes the result much easier to parse with code than if ChatGPT replied in paragraph format.
The prompt specifies the format of the reply, in this case JSON format. JSON format is very easy to process with code.
The prompt specifies the keys of the JSON object, and chooses keys that are very easy to associate with the questions.
The keys also use a commonly used convention (camelCase) that ChatGPT should recognize. The JSON keys are full words instead of contractions. This makes it more likely that ChatGPT will use the actual key in the response, as ChatGPT has a habit of doing “spelling corrections” as part of its processing.
The additionalContextString provides an outlet for ChatGPT to pass additional information. This takes advantage of the freeform analysis ability of large language models.
The prompt uses phrasing often found in technical discussions of the subject at hand. As you may have surmised from the structure of the API call, the “true” goal of ChatGPT is not necessarily to provide an answer to the prompt, but rather to predict the next line in a dialogue.

Therefore, if your prompt is phrased like a line from a surface level discussion, you will likely get a surface level answer, whereas if your prompt is phrased like a line from an expert discussion, you’re more likely to receive an expert result. This effect is especially pronounced for subjects like math or technology, but is relevant here as well.

The above are basic techniques in a new body of work called “prompt engineering”, where the user structures their prompts to get a specific result. With large language models like ChatGPT, careful crafting of the prompt can result in dramatic increases in the effectiveness of the model. There are many similarities to coding, but prompt engineering requires a lot more intuition and fuzzy reasoning on the part of the engineer. As tools like ChatGPT become more embedded in the workplace, engineering will no longer be a purely technical endeavor but also a philosophical one, and engineers should take care to develop their intuition and a “theory of mind” for large language models in order to boost their efficiency.

Putting It All Together

Now let’s put it all together. If you followed the previous section, the below code block should be fairly straightforward to understand.

bucket = [Your bucket name]
upload_file_name = [the name of the PDF in the Google Cloud bucket]
upload_prefix = [the prefix to use for the analysis results]
pdf_to_analyze = [path to the PDF to analyze]

upload_file(pdf_to_analyze, bucket, upload_file_name) 
async_detect_document(f'gs://{bucket}/{upload_file_name}', f'gs://{bucket}/{upload_prefix}')
while not check_results(bucket, upload_prefix):
    print('Not done yet... checking again')
    time.sleep(5)
    
If __name__ == ‘__main__’:
    write_to_text(bucket, upload_prefix)
    all_responses = []
    for result_json in os.listdir('ocr_results'): 
        with open(os.path.join('ocr_results', result_json)) as fp_res:
            response = json.load(fp_res)
        all_responses.extend(response['responses'])
    texts = [a['fullTextAnnotation']['text'] for a in all_responses]

    translated_text = [translate_text(t) for t in texts]
    print('Running cleanup...')
    delete_objects(bucket, upload_file_name)
    delete_objects(bucket, upload_prefix) 
    shutil.rmtree('ocr_results')

    print('Running Analysis...')
    analysis = run_chatgpt_api('\n'.join(translated_text))

    print('=== purpose ====')
    print(analysis_res['purpose'])
    print()
    print('==== primary conclusion =====')
    print(analysis_res['primaryConclusion'])
    print()
    print('==== secondary conclusion =====')
    print(analysis_res['secondaryConclusion'])
    print()
    print('==== intended audience ====')
    print(analysis_res['intendedAudience'])
    print()
    print('===== additional context =====')
    print(analysis_res['additionalContextString'])

We upload the report to the bucket, kick off the OCR, and wait for the OCR to finish. We then download the OCR results and put it into a list. We translate the result list, and send it to ChatGPT for analysis, and print out the result of the analysis.

Because AI tools are not deterministic, if you run this code you will likely get a similar but not identical result. However, here is what I got as the output, using the PDF linked above:

Not done yet... checking again
Not done yet... checking again
Running cleanup...
Blob [pdf name] Deleted
Blob [OCR Output Json] Deleted
Running Analysis...
=== purpose ====
The purpose of the report is to analyze the current status and issues of Japan's research capabilities, discuss the government's growth strategy and investment in science and technology, and introduce the initiatives towards realizing a science and technology nation.

==== primary conclusion =====
The primary conclusion of the report is that Japan's research capabilities, as measured by publication index, have been declining internationally, raising concerns about a decline in research capabilities.

==== secondary conclusion =====
The secondary conclusion of the report is that the Kishida Cabinet's growth strategy emphasizes becoming a science and technology nation and strengthening investment in people to achieve growth and distribution.

==== intended audience ====
The intended audience for this report is government officials, policymakers, researchers, and anyone interested in Japan's research capabilities and science and technology policies.

===== additional context =====
The report focuses on the importance of science, technology, and innovation for Japan's future development and highlights the government's efforts to promote a 'new capitalism' based on a virtuous cycle of growth and distribution. It also mentions the revision to the Basic Law on Science, Technology, and Innovation, the 6th Science, Technology, and Innovation Basic Plan, and the concept of Society 5.0 as the future society Japan aims for. The report suggests that comprehensive knowledge is necessary to promote science, technology, and innovation and emphasizes the importance of transcending disciplinary boundaries and utilizing diverse knowledge.

As a Japanese speaker, I can verify that the analysis is quite good! You can experiment on your own, supplying your own PDF and changing the questions to suit your needs. Now you have a powerful AI tool to translate and summarize any foreign language PDF you encounter. If you’re in the target audience for this article, I hope you’re feeling quite a bit of excitement from the possibilities that just opened up!

The full backend code can be found on my GitHub

The Frontend

We already built a powerful app, but in its current form the user will have to know a bit of Python to use the application to its fullest extent. What if you had a user who doesn’t want to read or write any code? In that case, we will want to build a web application around this tool, so that people can access the full power of AI from the comfort of a browser.

Scaffolding the React Application

Let’s begin by creating a React application. Make sure Node is installed, navigate to the folder where you want the application code to live, and run the create-react-app script and install some basic packages:

npx create-react-app llm-frontend

And then:

cd llm-frontend
npm install bootstrap react-bootstrap axios

If we were developing a full-fledged application we would also want to also install packages to handle state management and routing, but that is out of the scope of this article. We will simply make edits to the App.jsx file.

Execute npm run start to start the development server, and your browser should open a page to http://localhost:3000. Keep that page, and open up the App.jsx file in your favorite text editor. You should see something like this:

import logo from './logo.svg';
import './App.css';

function App() {
  return (
    <div className="App">
      <header className="App-header">
        <img src={logo} className="App-logo" alt="logo" />
        <p>
          Edit <code>src/App.js</code> and save to reload.
        </p>
        <a
          className="App-link"
          href="https://reactjs.org"
          target="_blank"
          rel="noopener noreferrer"
        >
          Learn React
        </a>
      </header>
    </div>
  );
}

export default App;

Go ahead and delete the boilerplate code and replace it with some basic Bootstrap components.

import React, {useState, useEffect} from 'react';
Import axios from ‘axios’;
import Container from 'react-bootstrap/Container';
import Row from 'react-bootstrap/Row';
import Col from 'react-bootstrap/Col';
import 'bootstrap/dist/css/bootstrap.min.css';

function App() {
  return (
    <Container>
      <Row>
        <Col md={{ span: 10, offset: 1 }}>
          Main Application Here
        </Col>
      </Row>
    </Container>
  );
}

export default App;

Save the app, and you should see it update in the browser. It will look quite sparse now, but not to worry, we will fix that soon.

Application Components

For this application to work, we will need four main components: a file selector to upload the file, a results display to display the translation and the summary, a text input so the user can ask their own questions, and a results display to display the answer to the user questions.

We can build out the simpler components and put placeholders in for the more complex interfaces for now. While we’re at it, let’s create the data containers that we will use to power the interface:

import React, {useState, useEffect} from 'react';
import axios from 'axios';
import Container from 'react-bootstrap/Container';
import Row from 'react-bootstrap/Row';
import Col from 'react-bootstrap/Col';
import Button from 'react-bootstrap/Button';
import Accordion from 'react-bootstrap/Accordion';
import Form from 'react-bootstrap/Form';
import ListGroup from 'react-bootstrap/ListGroup';
import 'bootstrap/dist/css/bootstrap.min.css';


const ResultDisplay = ({
    initialAnalysis, userQuestion, setUserQuestion,
    userQuestionResult, userQuestionMessage,
    userQuestionAsked
}) => {
    return <Row>
      <Col>
         <Row style={{marginTop: '10px'}}>
           <Col md={{ span: 10, offset: 1 }}>
              <Accordion defaultActiveKey="0">
                <Accordion.Item eventKey="0">
                  <Accordion.Header>Analysis Result</Accordion.Header>
                  <Accordion.Body>
                     {initialAnalysis.analysis}
                  </Accordion.Body>
                </Accordion.Item>
                <Accordion.Item eventKey="1">
                   <Accordion.Header>Raw Translated Text</Accordion.Header>
                   <Accordion.Body>
                     {initialAnalysis.translatedText}
                   </Accordion.Body>
                </Accordion.Item>
                <Accordion.Item eventKey="2">
                  <Accordion.Header>Raw Source Text</Accordion.Header>
                  <Accordion.Body>
                    {initialAnalysis.rawText}
                  </Accordion.Body>
                </Accordion.Item>
              </Accordion>
            </Col>
          </Row>
          <Row style={{marginTop: '10px'}}>
            <Col md={{ span: 8, offset: 1 }}>
              <Form.Control type="text"
                 placeholder="Additional Questions"
                 value={userQuestion}
                 onChange={e => setUserQuestion(e.target.value)}
              />
            </Col>
            <Col md={{ span: 2 }}>
              <Button variant="primary">Ask</Button>
            </Col>
          </Row>
          <Row><Col>{userQuestionMessage}</Col></Row>
          <Row style={{marginTop: '10px'}}>
            <Col md={{span: 10, offset: 1}}>
              {userQuestionResult && userQuestionAsked ?         <ListGroup>
                <ListGroup.Item>
                  <div><b>Q:</b> {userQuestionAsked}</div>
                  <div><b>A:</b> {userQuestionResult}</div></ListGroup.Item>
              </ListGroup>: ''}
            </Col>
          </Row>
        </Col>
      </Row>
}


function App() {
  const [file, setFile] = useState(null);
  const [haveFileAnalysisResults, setHaveFileAnalysisResults] = useState(false);
  const [message, setMessage] = useState('');
  const [userQuestionMessage, setUserMessage] = useState('');
  const [initialAnalysis, setInitialAnalysis] = useState({analysis: '', translatedText: '', rawText: ''});
  const [userQuestion, setUserQuestion] = useState('');
  const [userQuestionResult, setUserQuestionResult] = useState('');
  const [userQuestionAsked, setUserQuestionAsked] = useState('');


return (
<Container>
  <Row>
    <Col md={{ span: 8, offset: 1 }}>
      <Form.Group controlId="formFile">
        <Form.Label>Select a File to Analyze</Form.Label>
        <Form.Control type="file" onChange={e => setFile(e.target.files[0])} />
      </Form.Group>
    </Col>
    <Col md={{span: 2}}>
      <div style={{marginTop: '30px'}}><Button variant="primary" >Analyze</Button></div>
    </Col>
    <Col md={12}>{message}</Col>
  </Row>
  {haveFileAnalysisResults? <ResultDisplay
    initialAnalysis={initialAnalysis}
    userQuestion={userQuestion}
    setUserQuestion={setUserQuestion}
    userQuestionResult={userQuestionResult}
    userQuestionMessage={userQuestionMessage}
    userQuestionAsked={userQuestionAsked}
  />: ''}
  </Container>);
}

There is quite a bit of new code, but there is nothing complex or groundbreaking about the new code. The code is just a basic data entry and display interface based on React-Bootstrap components.

Save the file, and you should see your browser update to show the file upload interface.

Play around with the variables and you should see an interface like this. This will be the frontend interface of our application.

Now that we have written the basic frontend interface, let’s write the functions that will connect the application to our (not yet written) API. These functions will all be defined in the app object so it will have access to all the React hooks. If you are not entirely sure where these functions should go, you can refer to the full code hosted on GitHub.

First, let’s write a couple of utility functions for passing messages to the user.

const flashMessageBuilder = (setMessage) => (message) => {
    setMessage(message);
    setTimeout(() => {
      setMessage('');
    }, (5000));
  }

  const flashMessage = flashMessageBuilder(setMessage);
  const flashUserQuestionMessage = flashMessageBuilder(setUserQuestionMessage);

As you can see, these are simple functions that display a message at the appropriate place and create a time to remove the message after 5 seconds. This is a simple UI feature but makes the app feel much more dynamic and usable.

Next, let’s write the functions to analyze the file and check for results.

  const pollForResults = (batchId) => {
    flashMessage('Checking for results...');
    return new Promise((resolve, reject) => {
      setTimeout(() => {
        axios.post('http://localhost:5000/check_if_finished', {batchId})
          .then(r => r.data)
          .then(d => {
            // the result should have a "status" key and a "result" key. 
            if (d.status === 'complete') {
              resolve(d); // we're done!
            } else {
              resolve(pollForResults(batchId)); // wait 5 seconds and try again.
            }
          }).catch(e => reject(e));
      }, 5000);
    })
  }
  
  const analyzeFile = () => {
    if (file === null) {
      flashMessage('No file selected!');
      return;
    }
    flashMessage('Uploading file...');
    const formData = new FormData();
    formData.append("file", file);
    axios.post("http://localhost:5000/analyze_file", formData, {
        headers: {
          'Content-Type': 'multipart/form-data'
        }
    }).then(r => r.data)
    .then(d => {
      // the result should contain a batchId that we use to poll for results.
      flashMessage('File upload success, waiting for analysis results...');
      return pollForResults(d.batchId);
    })
    .then(({analysis, translatedText, rawText}) => {
      // the result should contain the initial analysis results with the proper format.
      setInitialAnalysis({analysis, translatedText, rawText});
      setHaveFileAnalysisResults(true); // show the results display now that we have results
    })
    .catch(e => {
      console.log(e);
      flashMessage('There was an error with the upload. Please check the console for details.');
    })
  }

Again a pretty simple set of functions. The analyzeFile function sends the file to the analyze_file endpoint for analysis. The API will give it a batch ID which it uses to check for results with the pollForResults function. The pollForResults function will hit the check_if_finished endpoint and return the results if the analysis is finished, and wait 5 seconds if the analysis is still processing. The analyzeFile “thread” will then continue to execute, putting the data into the appropriate places.

Lastly, let’s write the function that lets the user ask freeform questions:

  const askUserQuestion = () => {
    flashUserQuestionMessage('Asking user question...')
    axios.post('http://localhost:5000/ask_user_question', {
      text: initialAnalysis.translatedText,
      userQuestion
    }).then(r => r.data)
    .then(d => {
      setUserQuestionResult(d.result);
      setUserQuestionAsked(userQuestion);
    }).catch(e => {
      console.log(e);
      flashUserQuestionMessage('There was an issue asking the question. Please check the console for details');
    });
  }

Again, a fairly simple function. We provide the translated text along with the user question so our API can construct the ChatGPT prompt. The result is then pushed to the appropriate data containers for display.

We’re pretty much done with the React app, but before we move onto coding the API, let’s make one more cosmetic change. Right now, the analysis result display is configured to display the ChatGPT analysis as a string. However, the ChatGPT analysis is actually a JSON data object, so to properly display it for human use, we will want to add some formatting to the display object. Replace the first Accordion item with the following code:

  <Accordion.Item eventKey="0">
     <Accordion.Header>Analysis Result</Accordion.Header>
     <Accordion.Body>
        <h6>Purpose</h6>
        <p>{analysis.purpose}</p>
        <h6>Primary Conclusion</h6>
        <p>{analysis.primaryConclusion}</p>
        <h6>Secondary Conclusion</h6>
        <p>{analysis.secondaryConclusion}</p>
        <h6>Intended Audience</h6>
        <p>{analysis.intendedAudience}</p>
        <h6>Additional Context</h6>
        <p>{analysis.additionalContextString}</p>
     </Accordion.Body>
  </Accordion.Item>

Now the frontend is done, let’s go to our Python code and build the backend.

Flask API

First, let’s install Flask, which we will use to write our backend.

Pip install flask flask-cors

Flask is a simple framework for building web applications and web APIs. The interface is incredibly simple, and getting a server running is as easy as:

from flask import Flask, request, jsonify
from flask_cors import CORS


app = Flask(__name__)
CORS(app)


@app.route('/')
def hello_world():
    return "Hello from flask!"


if __name__ == '__main__':
    app.run(debug=True)

Run this file and navigate to http://localhost:5000 in your browser, and you should see the “Hello from flask!” message.

Now we can begin building the API functionality. Let’s begin by importing the required functions and defining some constants:

from flask import Flask, request, jsonify
from flask_cors import CORS 
import uuid
import os
import json
from document_analyze import upload_file, async_detect_document, check_results, \
    write_to_text, translate_text, delete_objects, run_chatgpt_api,\
    ask_chatgpt_question


app = Flask(__name__)
CORS(app)
BUCKET = ‘[YOUR BUCKET NAME]’

This code assumes your server code is in the same folder as the document_analyze.py file we wrote earlier, but you can choose any directory structure you like, as long as the server code can find and import from document_analyze.py. Let’s write the handler for the file upload endpoint:

@app.route('/analyze_file', methods=['POST'])
def analyze_file():
    file_to_analyze = request.files['file']
    batch_name = str(uuid.uuid4())
    local_file_path = 'uploads/%s.pdf' % batch_name
    cloud_file_path = '%s.pdf' % batch_name
    file_to_analyze.save(local_file_path)
    upload_file(local_file_path, BUCKET, cloud_file_path)
    async_detect_document(
        f'gs://{BUCKET}/{cloud_file_path}',
        f'gs://{BUCKET}/{batch_name}')

    return jsonify({
        'batchId': batch_name
    })

As you can see, this function takes the uploaded file, sends it over to Google Cloud Storage, and kicks off the OCR process. It should look pretty familiar, but here are a couple of small changes worth pointing out. First, the file is identified by a UUID that also serves as the batch name. This avoids potential collision issues that could arise from the API being called multiple times, and it also uniquely identifies all of the files used in a particular analysis batch, making it easier to check for progress and to perform cleanup down the line.

Let’s now write the handler that lets the app check if the analysis is finished.

@app.route('/check_if_finished', methods=['POST'])
def check_if_finished():
    batch_name = request.json['batchId']
    if not check_results(BUCKET, batch_name):
        return jsonify({
            'status': 'processing'
        })
    write_to_text(BUCKET, batch_name)
    all_responses = []
    for result_json in os.listdir('ocr_results'):
        if result_json.endswith('json') and result_json.startswith(batch_name):
            result_file = os.path.join('ocr_results', result_json)
            with open(os.path.join('ocr_results', result_json)) as fp_res:
                response = json.load(fp_res)
            all_responses.extend(response['responses'])
            os.remove(result_file)
    txts = [a['fullTextAnnotation']['text'] for a in all_responses]
    translated_text = [translate_text(t) for t in txts]
    print('Running cleanup...')
    delete_objects(BUCKET, batch_name)
    os.remove('uploads/%s.pdf' % batch_name)
    analysis = run_chatgpt_api('\n'.join(translated_text))
    analysis_res = json.loads(analysis)
    return jsonify({
        'status': 'complete',
        'analysis': analysis,
        'translatedText': translated_text,
        'rawText': '\n'.join(txts)
    })

Again this should look quite familiar. We first check if the OCR is done, and if the OCR is not done, we simply return a message saying the batch is still processing. If the OCR is done, we continue the analysis, downloading the OCR results and running the translation and ChatGPT pipeline. We also make sure to clean up the source files once the analysis is done, in order to avoid incurring unnecessary storage costs. We package the result into the final result object, which contains the ChatGPT analysis JSON, the translated text, and the raw text extracted by the OCR.

While the custom question backend is a new feature, it is fairly straightforward. First we will want to define the function to ask a custom question:

def ask_chatgpt_question(report_text, question_text):
    completion = openai.ChatCompletion.create(
          model="gpt-3.5-turbo",
          messages=[
            {"role": "user", "content": '''
        Consider the following report:
        ---
        %s
        ---
        Answer the following question:
        %s
            ''' % (report_text, question_text)},
          ]
        )
    return completion.choices[0]['message']['content']

Now we can import the function and define the API endpoint:


@app.route('/ask_user_question')
def ask_user_question():
    report_text = request.json['text']
    user_question = request.json['userQuestion']
    response = ask_chatgpt_question(report_text, user_question)
    return jsonify({
        'result': response
    })

Now that we have written our API, let’s test it out from the frontend. Go ahead and upload the file through the frontend. You should see something like this:

Wait a little while, and you should see the result come through on the WebApp.

Look around and you will see the translated text as well:

And the raw source text:

Now let’s test the custom questions interface:

Pretty nice! With that, we have successfully built a React.js application around our AI tools!

Conclusion

In today’s article, we built an application that leverages some of the most powerful AI tools currently on the market. While this specific application is geared towards parsing and summarizing foreign language PDFs, similar techniques can be adapted to develop powerful revolutionary applications in many fields.

I hope this article inspired you to write AI-driven applications of your own. If you would like to reach out and talk to me about your perspective on AI applications, I would love to hear from you. If you’re looking to build a similar, app, feel free to refer to the code hosted on my Github page where you can find repositories for the Frontend and the Backend

If you want to use this app without having to build it on your own, I have built and hosted a more sophisticated version of the application for general use. If you would like access to this application, please get in touch and we can provision access to the website.

Please be on the lookout for a follow up article, where we build and run our own instance of a LLM. When the article is published, we will publish a link to the article here. Stay tuned!