paint-brush
How to Use AWS Textract with S3by@songthamtung
12,159 reads
12,159 reads

How to Use AWS Textract with S3

by Songtham TungJune 13th, 2019
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This article demonstrates how to use AWS Textract to extract text from scanned documents in an S3 bucket. Included in this blog is a sample code snippet using AWS Python SDK Boto3 to help you quickly get started. It can save your team countless man hours by automating the tedious and error-prone task of manual data entry. The article also includes a code snippet for the use of the Python Python SDK to help users quickly start working with Textract in a simple S3 app.
featured image - How to Use AWS Textract with S3
Songtham Tung HackerNoon profile picture

This article demonstrates how to use AWS Textract to extract text from scanned documents in an S3 bucket.

This goes beyond Amazon’s documentation — where they only use examples involving one image. Included in this blog is a sample code snippet using AWS Python SDK Boto3 to help you quickly get started.

Definitions

  • Textract is a service that automatically extracts text and data from scanned documents.
  • Simple Storage Service (S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

Code

#!/usr/bin/env python3

# Detects text in a document stored in an S3 bucket. 
import boto3
import sys
from time import sleep
import math
import pandas as pd


if __name__ == "__main__":

    bucket='your_bucket_name'
    ACCESS_KEY='your_access_key'
    SECRET_KEY='your_secret_key'
    
    client = boto3.client('textract', 
                          region_name='your_region', 
                          aws_access_key_id=ACCESS_KEY,
                          aws_secret_access_key=SECRET_KEY)
    
    s3 = boto3.resource('s3',  
                      aws_access_key_id=ACCESS_KEY,
                      aws_secret_access_key=SECRET_KEY)
    
    your_bucket = s3.Bucket(bucket)

    extracted_data = []
    for s3_file in your_bucket.objects.all():
        print(s3_file)
        
        # use textract to process s3 file
        response = client.detect_document_text(
            Document={'S3Object': {'Bucket': bucket, 'Name': s3_file.key}})
        
        blocks=response['Blocks']

        for block in blocks:
                if block['BlockType'] != 'PAGE':
                    print('Detected: ' + block['Text'])
                    print('Confidence: ' + "{:.2f}".format(block['Confidence']) + "%")
                    
                    # Example case where you want to extract words with #
                    if("#" in block['Text']):
                        words = block['Text'].split()
                        for word in words:
                               if("#" in word):
                                    extracted_data.append({"word" : word, "file" : s3_file.key, "confidence": "{:.2f}".format(block['Confidence']) + "%"})
        
        # sleep 2 seconds to prevent ProvisionedThroughputExceededException
        sleep(2)

    df = pd.DataFrame(extracted_data)
    df = df.drop_duplicates()
    df.to_csv('output.csv')

Closing

Textract is an amazing OCR (optical character recognition) tool. It can save your team countless man hours by automating the tedious and error-prone task of manual data entry.