This article demonstrates how to use AWS Textract to extract text from scanned documents in an S3 bucket. This goes beyond Amazon’s — where they only use examples involving one image. Included in this blog is a sample code snippet using AWS Python SDK Boto3 to help you quickly get started. documentation Definitions is a service that automatically extracts text and data from scanned documents. Textract (S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Simple Storage Service Code boto3 sys time sleep math pandas pd __name__ == :

    bucket= ACCESS_KEY= SECRET_KEY= client = boto3.client( , 
                          region_name= , 
                          aws_access_key_id=ACCESS_KEY,
                          aws_secret_access_key=SECRET_KEY)
    
    s3 = boto3.resource( ,  
                      aws_access_key_id=ACCESS_KEY,
                      aws_secret_access_key=SECRET_KEY)
    
    your_bucket = s3.Bucket(bucket)

    extracted_data = [] s3_file your_bucket.objects.all():
        print(s3_file) response = client.detect_document_text(
            Document={ : { : bucket, : s3_file.key}})
        
        blocks=response[ ] block blocks: block[ ] != :
                    print( + block[ ])
                    print( + .format(block[ ]) + ) ( block[ ]):
                        words = block[ ].split() word words: ( word):
                                    extracted_data.append({ : word, : s3_file.key, : .format(block[ ]) + }) sleep( )

    df = pd.DataFrame(extracted_data)
    df = df.drop_duplicates()
    df.to_csv( ) #!/usr/bin/env python3 # Detects text in a document stored in an S3 bucket. import import from import import import as if "__main__" 'your_bucket_name' 'your_access_key' 'your_secret_key' 'textract' 'your_region' 's3' for in # use textract to process s3 file 'S3Object' 'Bucket' 'Name' 'Blocks' for in if 'BlockType' 'PAGE' 'Detected: ' 'Text' 'Confidence: ' "{:.2f}" 'Confidence' "%" # Example case where you want to extract words with # if "#" in 'Text' 'Text' for in if "#" in "word" "file" "confidence" "{:.2f}" 'Confidence' "%" # sleep 2 seconds to prevent ProvisionedThroughputExceededException 2 'output.csv' Closing Textract is an amazing OCR (optical character recognition) tool. It can save your team countless man hours by automating the tedious and error-prone task of manual data entry.

Amazon

How to Use AWS Textract with S3

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

5 Copy and Pasting Tips For Every Developer to Know

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

5 Copy and Pasting Tips For Every Developer to Know

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps