This article demonstrates how to use AWS Textract to extract text from scanned documents in an S3 bucket. This goes beyond Amazon’s — where they only use examples involving one image. Included in this blog is a sample code snippet using AWS Python SDK Boto3 to help you quickly get started. documentation Definitions is a service that automatically extracts text and data from scanned documents. Textract (S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Simple Storage Service Code boto3 sys time sleep math pandas pd __name__ == : bucket= ACCESS_KEY= SECRET_KEY= client = boto3.client( , region_name= , aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY) s3 = boto3.resource( , aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY) your_bucket = s3.Bucket(bucket) extracted_data = [] s3_file your_bucket.objects.all(): print(s3_file) response = client.detect_document_text( Document={ : { : bucket, : s3_file.key}}) blocks=response[ ] block blocks: block[ ] != : print( + block[ ]) print( + .format(block[ ]) + ) ( block[ ]): words = block[ ].split() word words: ( word): extracted_data.append({ : word, : s3_file.key, : .format(block[ ]) + }) sleep( ) df = pd.DataFrame(extracted_data) df = df.drop_duplicates() df.to_csv( ) #!/usr/bin/env python3 # Detects text in a document stored in an S3 bucket. import import from import import import as if "__main__" 'your_bucket_name' 'your_access_key' 'your_secret_key' 'textract' 'your_region' 's3' for in # use textract to process s3 file 'S3Object' 'Bucket' 'Name' 'Blocks' for in if 'BlockType' 'PAGE' 'Detected: ' 'Text' 'Confidence: ' "{:.2f}" 'Confidence' "%" # Example case where you want to extract words with # if "#" in 'Text' 'Text' for in if "#" in "word" "file" "confidence" "{:.2f}" 'Confidence' "%" # sleep 2 seconds to prevent ProvisionedThroughputExceededException 2 'output.csv' Closing Textract is an amazing OCR (optical character recognition) tool. It can save your team countless man hours by automating the tedious and error-prone task of manual data entry.