Amazon recently released Textract in the Asia Pacific (Sydney), thus i decided to write a javascript OCR demo using Amazon Textract.
Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.
In this post, I show how we can use AWS Textract to extract text from scanned pdf files.
The following diagram shows the architecture of the process.
The following must be done before following this guide:
Before getting started, Install the AWS SAM CLI and creates an application with sample code using
sam init -r nodejs12.x
There will be a SAM template file (template.yaml) in the project directory created. Let’s start to define a set of objects in template file as below:
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
Timeout: 60
Parameters:
Stage:
Type: String
Default: dev
BucketName:
Type: String
Default: aiyi.demo.textract
Resources:
TextractSNSTopic:
Type: AWS::SNS::Topic
Properties:
DisplayName: !Sub "textract-sns-topic"
TopicName: !Sub "textract-sns-topic"
Subscription:
- Protocol: lambda
Endpoint: !GetAtt TextractEndFunction.Arn
TextractSNSTopicPolicy:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref TextractEndFunction
Principal: sns.amazonaws.com
Action: lambda:InvokeFunction
SourceArn: !Ref TextractSNSTopic
TextractEndFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/
Handler: handler.textractEndHandler
Runtime: nodejs12.x
Role: !GetAtt TextractRole.Arn
Policies:
- AWSLambdaExecute
- Statement:
- Effect: Allow
Action:
- "s3:PutObject"
Resource: !Join [":", ["arn:aws:s3::", !Ref BucketName]]
TextractStartFunction:
Type: AWS::Serverless::Function
Properties:
Environment:
Variables:
TEXT_EXTRACT_ROLE: !GetAtt TextractRole.Arn
SNS_TOPIC: !Ref TextractSNSTopic
Role: !GetAtt TextractRole.Arn
CodeUri: src/
Handler: handler.textractStartHandler
Runtime: nodejs12.x
Events:
PDFUploadEvent:
Type: S3
Properties:
Bucket: !Ref S3Bucket
Events: s3:ObjectCreated:*
Filter:
S3Key:
Rules:
- Name: suffix
Value: ".pdf"
TextractRole:
Type: AWS::IAM::Role
Properties:
RoleName: "TextractRole"
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Principal:
Service:
- "textract.amazonaws.com"
- "lambda.amazonaws.com"
Action:
- "sts:AssumeRole"
ManagedPolicyArns:
- "arn:aws:iam::aws:policy/AWSLambdaExecute"
Policies:
- PolicyName: "TextractRoleAccess"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "sns:*"
Resource: "*"
- Effect: Allow
Action:
- "textract:*"
Resource: "*"
GetTextractResult:
Type: AWS::Serverless::Function
Properties:
Role: !GetAtt TextractRole.Arn
CodeUri: src/
Handler: handler.getTextractResult
Runtime: nodejs12.x
Events:
TextExactStart:
Type: HttpApi
Properties:
Path: /textract
Method: post
MyHttpApi:
Type: AWS::Serverless::HttpApi
Properties:
StageName: !Ref Stage
Cors:
AllowMethods: "'OPTIONS,POST,GET'"
AllowHeaders: "'Content-Type'"
AllowOrigin: "'*'"
S3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Ref BucketName
Note that API Gateway HTTP API AWS::Serverless::HttpApi which is still in beta and is subject to change, please don’t use it for production.
The following code example shows how to use a few lines of code to send pdf to Amazon Textract asynchronous operations in a lambda function and another lambda function will be triggered to get json response back by calling getDocumentAnalysisonce once Textract analysis job is completed. We then iterate over the blocks in JSON and save the detected text to S3.
exports.textractStartHandler = async (event, context, callback) => {
try {
const bucket = event.Records[0].s3.bucket.name;
const key = event.Records[0].s3.object.key;
const params = {
DocumentLocation: {
S3Object: {
Bucket: bucket,
Name: key
}
},
FeatureTypes: ["TABLES", "FORMS"],
NotificationChannel: {
RoleArn: process.env.TEXT_EXTRACT_ROLE,
SNSTopicArn: process.env.SNS_TOPIC
}
};
const reponse = await textract.startDocumentAnalysis(params).promise();
console.log(reponse);
} catch (err) {
console.log(err);
} finally {
callback(null);
}
};
exports.textractEndHandler = async (event, context, callback) => {
try {
const {
Sns: { Message }
} = event.Records[0];
const {
JobId: jobId,
Status: status,
DocumentLocation: { S3ObjectName, S3Bucket }
} = JSON.parse(Message);
if (status === "SUCCEEDED") {
const textResult = await getDocumentText(jobId, null);
const params = {
Bucket: S3Bucket,
Key: `${path.parse(S3ObjectName).name}.txt`,
Body: textResult
};
await s3.putObject(params).promise();
}
} catch (error) {
callback(error);
} finally {
callback(null);
}
};
const getDocumentText = async (jobId, nextToken) => {
console.log("nextToken", nextToken);
const params = {
JobId: jobId,
MaxResults: 100,
NextToken: nextToken
};
if (!nextToken) delete params.NextToken;
let {
JobStatus: _jobStatus,
NextToken: _nextToken,
Blocks: _blocks
} = await textract.getDocumentAnalysis(params).promise();
let textractResult = _blocks
.map(({ BlockType, Text }) => {
if (BlockType === "LINE") return `${Text}${EOL}`;
})
.join();
if (_nextToken) {
textractResult += await getDocumentText(jobId, _nextToken);
}
return textractResult;
};
Now let’s add another lambda function as a REST endpoint using HTTP API defined in template.yaml. with the rest api, we can retrieve the text analysis result and job status by Textract job id.
exports.getTextractResult = async (event, context, callback) => {
try {
if (event.body) {
const body = JSON.parse(event.body);
if (body.jobId) {
const params = {
JobId: body.jobId,
MaxResults: 100,
nextToken: body.nextToken
};
!params.nextToken && delete params.nextToken;
let {
JobStatus: jobStatus,
NextToken: nextToken,
Blocks: blocks
} = await textract.getDocumentAnalysis(params).promise();
if (jobStatus === "SUCCEEDED") {
textractResult = blocks
.map(({ BlockType, Text }) => {
if (BlockType === "LINE") return `${Text}${EOL}`;
})
.join();
}
return callback(null, {
statusCode: 200,
body: JSON.stringify({
text: textractResult,
jobStatus,
nextToken
})
});
}
}
} catch ({ statusCode, message }) {
return callback(null, {
statusCode,
body: JSON.stringify({ message })
});
} finally {
return callback(null);
}
};
Note that Amazon Textract retains the results of asynchronous operations for 7 days.
Now let’s deploy the service and test it out!
$sam deploy --guided
After deployment finished, copy a pdf file to S3 bucket.
$aws s3 cp ~/downloads/ocrscan.pdf s3://aiyi.demo.textract
You will get a Textract job id in CloudWatch lamba function TextractStartFunction’s log group, to monitor CloudWatch logs realtime you can run following command:
$sam logs --name TextractStartFunction -t --region YOUR_REGION --stack-name sam-app-appv2
Let’s check the job status by calling api endpoint we just deployed.
$curl -d '{"jobId":"xxxxx2bd5ad43875edxxxx5aee29b65f273fxxxxx"}' -H "Content-Type: application/json" https://xxxx.execute-api.ap-southeast-2.amazonaws.com/textract | jq '.'
Output shows job status is SUCCEEDED, there is a text file supposed to be created in S3 bucket. Let’s go to AWS S3 console and have a look:
The following image is the the content of ocrscan.txt.
That’s all about it, Thanks for reading! I hope you have found this article useful, You can find the complete project in my GitHub repo.