Amazon recently released Textract in the Asia Pacific (Sydney), thus i decided to write a javascript OCR demo using Amazon Textract. Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. In this post, I show how we can use AWS Textract to extract text from scanned pdf files. Overview of the process Upload files to an S3 bucket. A S3 event trigger will invoke an AWS Lambda function, which will call Amazon Textract asynchronous operations to analyse uploaded document and then push the status of the job to an SNS topic after document analysis job completed. The SNS topic will invoke another Lambda function, which will read the status of the job, and if job status is SUCCEEDED, it will write the extracted text to a .txt object to S3 bucket. A Http Api endpoint can also get extracted job status and result by giving job id. The following diagram shows the architecture of the process. Prerequisites The following must be done before following this guide: Setup an AWS account. Configure the AWS CLI with user credentials. Install . AWS CLI jq (optional). Before getting started, Install the and creates an application with sample code using AWS SAM CLI sam init -r nodejs12.x Lets get started There will be a SAM template file (template.yaml) in the project directory created. Let’s start to define a set of objects in template file as below: lambda functions and inline policies; S3 bucket IAM role SNS topic Http Api AWSTemplateFormatVersion: "2010-09-09" Transform: AWS::Serverless-2016-10-31 Globals: Function: Timeout: 60 Parameters: Stage: Type: String Default: dev BucketName: Type: String Default: aiyi.demo.textract Resources: TextractSNSTopic: Type: AWS::SNS::Topic Properties: DisplayName: !Sub "textract-sns-topic" TopicName: !Sub "textract-sns-topic" Subscription: - Protocol: lambda Endpoint: !GetAtt TextractEndFunction.Arn TextractSNSTopicPolicy: Type: AWS::Lambda::Permission Properties: FunctionName: !Ref TextractEndFunction Principal: sns.amazonaws.com Action: lambda:InvokeFunction SourceArn: !Ref TextractSNSTopic TextractEndFunction: Type: AWS::Serverless::Function Properties: CodeUri: src/ Handler: handler.textractEndHandler Runtime: nodejs12.x Role: !GetAtt TextractRole.Arn Policies: - AWSLambdaExecute - Statement: - Effect: Allow Action: - "s3:PutObject" Resource: !Join [":", ["arn:aws:s3::", !Ref BucketName]] TextractStartFunction: Type: AWS::Serverless::Function Properties: Environment: Variables: TEXT_EXTRACT_ROLE: !GetAtt TextractRole.Arn SNS_TOPIC: !Ref TextractSNSTopic Role: !GetAtt TextractRole.Arn CodeUri: src/ Handler: handler.textractStartHandler Runtime: nodejs12.x Events: PDFUploadEvent: Type: S3 Properties: Bucket: !Ref S3Bucket Events: s3:ObjectCreated:* Filter: S3Key: Rules: - Name: suffix Value: ".pdf" TextractRole: Type: AWS::IAM::Role Properties: RoleName: "TextractRole" AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: "Allow" Principal: Service: - "textract.amazonaws.com" - "lambda.amazonaws.com" Action: - "sts:AssumeRole" ManagedPolicyArns: - "arn:aws:iam::aws:policy/AWSLambdaExecute" Policies: - PolicyName: "TextractRoleAccess" PolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Action: - "sns:*" Resource: "*" - Effect: Allow Action: - "textract:*" Resource: "*" GetTextractResult: Type: AWS::Serverless::Function Properties: Role: !GetAtt TextractRole.Arn CodeUri: src/ Handler: handler.getTextractResult Runtime: nodejs12.x Events: TextExactStart: Type: HttpApi Properties: Path: /textract Method: post MyHttpApi: Type: AWS::Serverless::HttpApi Properties: StageName: !Ref Stage Cors: AllowMethods: "'OPTIONS,POST,GET'" AllowHeaders: "'Content-Type'" AllowOrigin: "'*'" S3Bucket: Type: AWS::S3::Bucket Properties: BucketName: !Ref BucketName Note that API Gateway HTTP API AWS::Serverless::HttpApi which is still in beta and is subject to change, please don’t use it for production. The following code example shows how to use a few lines of code to send pdf to Amazon Textract asynchronous operations in a lambda function and another lambda function will be triggered to get json response back by calling getDocumentAnalysisonce once Textract analysis job is completed. We then iterate over the blocks in JSON and save the detected text to S3. exports.textractStartHandler = (event, context, callback) => { { bucket = event.Records[ ].s3.bucket.name; key = event.Records[ ].s3.object.key; params = { : { : { : bucket, : key
        }
      }, : [ , ], : { : process.env.TEXT_EXTRACT_ROLE, : process.env.SNS_TOPIC
      }
    }; reponse = textract.startDocumentAnalysis(params).promise(); .log(reponse);
  } (err) { .log(err);
  } {
    callback( );
  }
};
exports.textractEndHandler = (event, context, callback) => { { { : { Message }
    } = event.Records[ ]; { : jobId, : status, : { S3ObjectName, S3Bucket }
    } = .parse(Message); (status === ) { textResult = getDocumentText(jobId, ); params = { : S3Bucket, : , : textResult
      }; s3.putObject(params).promise();
    }
  } (error) {
    callback(error);
  } {
    callback( );
  }
}; getDocumentText = (jobId, nextToken) => { .log( , nextToken); params = { : jobId, : , : nextToken
  }; (!nextToken) params.NextToken; { : _jobStatus, : _nextToken, : _blocks
  } = textract.getDocumentAnalysis(params).promise(); textractResult = _blocks
    .map( { (BlockType === ) ;
    })
    .join(); (_nextToken) {
    textractResult += getDocumentText(jobId, _nextToken);
  } textractResult;
}; async try const 0 const 0 const DocumentLocation S3Object Bucket Name FeatureTypes "TABLES" "FORMS" NotificationChannel RoleArn SNSTopicArn const await console catch console finally null async try const Sns 0 const JobId Status DocumentLocation JSON if "SUCCEEDED" const await null const Bucket Key ` .txt` ${path.parse(S3ObjectName).name} Body await catch finally null const async console "nextToken" const JobId MaxResults 100 NextToken if delete let JobStatus NextToken Blocks await let ( ) => { BlockType, Text } if "LINE" return ` ` ${Text} ${EOL} if await return Now let’s add another lambda function as a REST endpoint using HTTP API defined in template.yaml. with the rest api, we can retrieve the text analysis result and job status by Textract job id. exports.getTextractResult = (event, context, callback) => { { (event.body) { body = .parse(event.body); (body.jobId) { params = { : body.jobId, : , : body.nextToken
        };
        !params.nextToken && params.nextToken; { : jobStatus, : nextToken, : blocks
        } = textract.getDocumentAnalysis(params).promise(); (jobStatus === ) {
          textractResult = blocks
            .map( { (BlockType === ) ;
            })
            .join();
        } callback( , { : , : .stringify({ : textractResult,
            jobStatus,
            nextToken
          })
        });
      }
    }
  } ({ statusCode, message }) { callback( , {
      statusCode, : .stringify({ message })
    });
  } { callback( );
  }
}; async try if const JSON if const JobId MaxResults 100 nextToken delete let JobStatus NextToken Blocks await if "SUCCEEDED" ( ) => { BlockType, Text } if "LINE" return ` ` ${Text} ${EOL} return null statusCode 200 body JSON text catch return null body JSON finally return null Note that Amazon Textract retains the results of asynchronous operations for 7 days. Now let’s deploy the service and test it out! deploy --guided $sam After deployment finished, copy a pdf file to S3 bucket. s3 cp ~/downloads/ocrscan s3: $aws .pdf //aiyi.demo.textract You will get a Textract job id in CloudWatch lamba function s log group, to monitor CloudWatch logs realtime you can run following command: TextractStartFunction’ $sam logs - - name TextractStartFunction - t - - region YOUR_REGION - - stack - name sam - app - appv2 Let’s check the job status by calling api endpoint we just deployed. $curl  -d -H http //xxxx. -api.ap-southeast- .amazonaws. /textract | jq '{"jobId":"xxxxx2bd5ad43875edxxxx5aee29b65f273fxxxxx"}' "Content-Type: application/json" s: execute 2 com '.' Output shows job status is SUCCEEDED, there is a text file supposed to be created in S3 bucket. Let’s go to AWS S3 console and have a look: The following image is the the content of ocrscan.txt. That’s all about it, Thanks for reading! I hope you have found this article useful, You can find the complete project in my . GitHub repo

Amazon Textract: Extract Text from PDF and Image Files [A How To Guide]

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

An easy way to manage Serverless project resources by using AWS Resource Groups

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

An easy way to manage Serverless project resources by using AWS Resource Groups

101 Stories To Learn About Cloud Infrastructure

10 Things in Engineering We Don't Spend Enough Time On

10 Things I Did To Increase CloudTrail Logs Security

10 reasons to give cloud computing a go

10 Lessons from 10 Years of AWS (part 1)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps