DocExtend logo
Login

How to Use Tesseract OCR to Extract Text from Images or PDFs

20 January 2025Last Updated: 20 January 20254 min read

How to Use Tesseract OCR to Extract Text from Images or PDFs

In today's digital age, text extraction from images and scanned documents has become an essential task. Whether you're digitizing old records, processing receipts, or automating workflows, Optical Character Recognition (OCR) technology can simplify the process. Tesseract OCR, an open-source tool maintained by Google, is one of the most popular solutions for this purpose. It is free, versatile, and supports over 100 languages, making it ideal for a variety of text extraction needs.

This guide will walk you through using Tesseract OCR to extract text from images and PDFs, step-by-step.

What is Tesseract OCR?

Tesseract OCR is an open-source OCR engine that converts images and PDFs containing text into machine-readable formats. It supports multiple output formats like plain text, searchable PDFs, and even structured data formats like TSV.

Key Features of Tesseract OCR:

  1. Multilingual Support: Over 100 languages supported with language packs.
  2. Custom Training: Train Tesseract to recognize specialized fonts or handwriting.
  3. Output Formats: Export text in plain text, searchable PDFs, or TSV formats.
  4. Extensibility: Easily integrated into scripts, workflows, or larger applications.

Prerequisites

Before getting started, ensure the following:

  1. Install Tesseract OCR:

    • On Linux:
      sudo apt update
      sudo apt install tesseract-ocr
      sudo apt install libtesseract-dev
      
    • On macOS:
      brew install tesseract
      
    • On Windows:
      • Download the Tesseract installer from Tesseract GitHub and follow the installation steps.
  2. Install Python (Optional):

    • If you want to use Tesseract with Python, install the pytesseract library:
      pip install pytesseract pillow
      
  3. Input Files: Have image files (e.g., PNG, JPG) or PDFs ready for processing.

Extracting Text from Images Using Tesseract

Step 1: Basic Command-Line Usage

Tesseract can be run directly from the command line. Here's how:

  1. Open a terminal or command prompt.
  2. Run the following command:
    tesseract input_image.png output_text
    
    • input_image.png: Replace with the path to your image file.
    • output_text: Specify the name of the output text file (without extension).

Example:

For an image named invoice.png:

tesseract invoice.png invoice_text

This command generates a file named invoice_text.txt containing the extracted text.

Step 2: Specifying Languages

If your document is in a language other than English, specify the language using the -l option:

tesseract input_image.png output_text -l spa

In this example, spa is the language code for Spanish. Language codes can be found here.

Extracting Text from PDFs Using Tesseract

Tesseract does not natively support PDF input, but you can use tools like pdftoppm or ImageMagick to convert PDFs into images first.

Step 1: Convert PDF to Images

Use pdftoppm (part of Poppler utilities):

pdftoppm input.pdf output_image -png

This generates PNG images for each page of the PDF.

Step 2: Run Tesseract on Each Image

Use a loop to process multiple pages:

for page in output_image*.png; do
  tesseract "$page" "${page%.png}_text" -l eng
done

This command processes all the generated images and extracts text into corresponding text files.

Step 3: Combine the Text Files (Optional)

If your PDF has multiple pages, combine the text files:

cat *_text.txt > combined_output.txt

Using Tesseract with Python

For a more automated approach, you can use Tesseract with Python. Here's a basic example:

Step 1: Import Libraries

Install the required libraries:

pip install pytesseract pillow

Step 2: Write the Python Script

from PIL import Image
import pytesseract

# Specify the path to Tesseract executable (Windows users)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# Load the image
image = Image.open('input_image.png')

# Extract text
text = pytesseract.image_to_string(image, lang='eng')

# Save the extracted text to a file
with open('output_text.txt', 'w') as file:
    file.write(text)

print("Text extraction complete. Check output_text.txt")

Step 3: Run the Script

Execute the script:

python script_name.py

This script reads an image file (input_image.png), extracts text, and saves it to output_text.txt.

Tips for Better Results

  1. Preprocess Images: Use tools like OpenCV to enhance image quality (e.g., remove noise, adjust contrast).
  2. Use Configurations: Tesseract offers configuration options to fine-tune the OCR process. Example:
    tesseract input_image.png output_text --psm 6
    
    • --psm specifies the page segmentation mode. Use --psm 6 for block text.
  3. Train Custom Models: For specialized fonts or languages, train Tesseract with your data.

Conclusion

Tesseract OCR is a powerful tool for extracting text from images and PDFs. Its open-source nature, multilingual support, and flexibility make it a go-to solution for businesses and individuals alike. Whether you're processing invoices, digitizing books, or automating workflows, Tesseract can handle the task efficiently.

By following this guide, you'll be able to leverage Tesseract OCR for your text extraction needs and integrate it into larger automation projects. Give it a try and unlock the potential of your document workflows!