How to Use Tesseract OCR to Extract Text from Images or PDFs
20 January 2025Last Updated: 20 January 20254 min read

In today's digital age, text extraction from images and scanned documents has become an essential task. Whether you're digitizing old records, processing receipts, or automating workflows, Optical Character Recognition (OCR) technology can simplify the process. Tesseract OCR, an open-source tool maintained by Google, is one of the most popular solutions for this purpose. It is free, versatile, and supports over 100 languages, making it ideal for a variety of text extraction needs.
This guide will walk you through using Tesseract OCR to extract text from images and PDFs, step-by-step.
What is Tesseract OCR?
Tesseract OCR is an open-source OCR engine that converts images and PDFs containing text into machine-readable formats. It supports multiple output formats like plain text, searchable PDFs, and even structured data formats like TSV.
Key Features of Tesseract OCR:
- Multilingual Support: Over 100 languages supported with language packs.
- Custom Training: Train Tesseract to recognize specialized fonts or handwriting.
- Output Formats: Export text in plain text, searchable PDFs, or TSV formats.
- Extensibility: Easily integrated into scripts, workflows, or larger applications.
Prerequisites
Before getting started, ensure the following:
-
Install Tesseract OCR:
- On Linux:
sudo apt update sudo apt install tesseract-ocr sudo apt install libtesseract-dev
- On macOS:
brew install tesseract
- On Windows:
- Download the Tesseract installer from Tesseract GitHub and follow the installation steps.
- On Linux:
-
Install Python (Optional):
- If you want to use Tesseract with Python, install the
pytesseract
library:pip install pytesseract pillow
- If you want to use Tesseract with Python, install the
-
Input Files: Have image files (e.g., PNG, JPG) or PDFs ready for processing.
Extracting Text from Images Using Tesseract
Step 1: Basic Command-Line Usage
Tesseract can be run directly from the command line. Here's how:
- Open a terminal or command prompt.
- Run the following command:
tesseract input_image.png output_text
input_image.png
: Replace with the path to your image file.output_text
: Specify the name of the output text file (without extension).
Example:
For an image named invoice.png
:
tesseract invoice.png invoice_text
This command generates a file named invoice_text.txt
containing the extracted text.
Step 2: Specifying Languages
If your document is in a language other than English, specify the language using the -l
option:
tesseract input_image.png output_text -l spa
In this example, spa
is the language code for Spanish. Language codes can be found here.
Extracting Text from PDFs Using Tesseract
Tesseract does not natively support PDF input, but you can use tools like pdftoppm or ImageMagick to convert PDFs into images first.
Step 1: Convert PDF to Images
Use pdftoppm (part of Poppler utilities):
pdftoppm input.pdf output_image -png
This generates PNG images for each page of the PDF.
Step 2: Run Tesseract on Each Image
Use a loop to process multiple pages:
for page in output_image*.png; do
tesseract "$page" "${page%.png}_text" -l eng
done
This command processes all the generated images and extracts text into corresponding text files.
Step 3: Combine the Text Files (Optional)
If your PDF has multiple pages, combine the text files:
cat *_text.txt > combined_output.txt
Using Tesseract with Python
For a more automated approach, you can use Tesseract with Python. Here's a basic example:
Step 1: Import Libraries
Install the required libraries:
pip install pytesseract pillow
Step 2: Write the Python Script
from PIL import Image
import pytesseract
# Specify the path to Tesseract executable (Windows users)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# Load the image
image = Image.open('input_image.png')
# Extract text
text = pytesseract.image_to_string(image, lang='eng')
# Save the extracted text to a file
with open('output_text.txt', 'w') as file:
file.write(text)
print("Text extraction complete. Check output_text.txt")
Step 3: Run the Script
Execute the script:
python script_name.py
This script reads an image file (input_image.png
), extracts text, and saves it to output_text.txt
.
Tips for Better Results
- Preprocess Images: Use tools like OpenCV to enhance image quality (e.g., remove noise, adjust contrast).
- Use Configurations: Tesseract offers configuration options to fine-tune the OCR process. Example:
tesseract input_image.png output_text --psm 6
--psm
specifies the page segmentation mode. Use--psm 6
for block text.
- Train Custom Models: For specialized fonts or languages, train Tesseract with your data.
Conclusion
Tesseract OCR is a powerful tool for extracting text from images and PDFs. Its open-source nature, multilingual support, and flexibility make it a go-to solution for businesses and individuals alike. Whether you're processing invoices, digitizing books, or automating workflows, Tesseract can handle the task efficiently.
By following this guide, you'll be able to leverage Tesseract OCR for your text extraction needs and integrate it into larger automation projects. Give it a try and unlock the potential of your document workflows!
Content
Try DocExtend
Boost your productivity with our automated document workflows.