In the past, I wrote articles on how to extract text from an image. I had discussed achieving this goal using Tesseract OCR and Google Cloud Vision.
In this article, I’ll discuss the Amazon Textract service which also allows you to extract text from an image. Reading the text of the image is sometimes necessary for you. Possibly, you want to detect if the text is abusive, or not appropriate to the audience. And you don’t want the manual process of detecting text. The robust solution is to build a program and automate your workflow.
Usually, OCR(optical character recognition) software like Tesseract OCR is used for this purpose. But this kind of software requires manual configurations to be done on your computer.
The Amazon Textract extracts text using a machine learning technique. It’s a better choice over the software as it runs on the cloud and you don’t need to keep an eye on updating software and configurations.
That being said, let’s study how one can extract text from an image using PHP and Amazon Textract.
Get Your AWS Security Credentials
Amazon provides the SDK for PHP applications. With this SDK, we’ll incorporate the Textract service which then interacts with the AWS.
To use the AWS SDK, you should have an account on AWS. Upon creating an account, get the security credentials that are required for AWS SDK integration. On the basis of your credentials, SDK performs the operations against your AWS account.
Extract Text From An Image
Let’s assume you want to read the text of the following image. I’ll save this image as 1.jpeg
to my local system.
Note: The Textract service only allows JPEG and PNG images. If you pass other image formats then you’ll get an error of Request has unsupported document format.
Next, install the AWS SDK for PHP using the Composer command as follows.
composer require aws/aws-sdk-php
Upon installing the library, include the AWS environment and initialize TextractClient
as follows.
<?php
require 'vendor/autoload.php';
use Aws\Textract\TextractClient;
$textractClient = new TextractClient([
'version' => 'latest',
'region' => 'us-west-2', // pass your region
'credentials' => [
'key' => 'ACCESS_KEY_ID',
'secret' => 'ACCESS_KEY_SECRET'
]
]);
Now, to read the text you need to pass the content of the image in the request format suggested by AWS.
try {
$result = $textractClient->detectDocumentText([
'Document' => [
'Bytes' => file_get_contents(getcwd().'/1.jpg'),
]
]);
foreach ($result->get('Blocks') as $block) {
if ($block['BlockType'] != 'WORD') continue;
echo $block['Text']." ";
}
} catch (Aws\Textract\Exception\TextractException $e) {
// output error message if fails
echo $e->getMessage();
}
Run this file and you will get the output as ‘How TO SEND EMAIL USING GMAIL API WITH PHPMAILER’.
Here, I am printing the Text from the WORD Block objects. If you use print_r($result)
, you will see the LINE and WORD Block objects. You can use either Block objects to print the text.
Related Articles
- Upload Files to Amazon S3 Using AWS PHP SDK
- Speech-To-Text using Amazon Transcribe in PHP
- Text-To-Speech using Amazon Polly in PHP
If you liked this article, then please subscribe to our YouTube Channel for video tutorials.
After performing, I am getting this error,
Error executing “DetectDocumentText” on “https://textract.us-east-1.amazonaws.com”; AWS HTTP error: Client error: `POST https://textract.us-east-1.amazonaws.com` resulted in a `400 Bad Request` response: {“__type”:”AccessDeniedException”
You must pass JPEG, PNG, and PDF formats. For other formats it throws an error of Request has unsupported document format.