Tools of PDF

Text-Based vs. Image-Based PDFs: The Critical Distinction

Before choosing a text extraction method, identify which type of PDF you're working with:

Text-based PDF: Created digitally from a word processor, design tool, or PDF generator. The text exists as actual character data inside the PDF. You can click and drag to select text in a PDF viewer.

Image-based PDF: Created by scanning a physical document, or by exporting from a tool that rasterised the page as an image. The PDF contains a picture of the text, not the text itself. You cannot select individual characters.

Text-based PDFs can be extracted directly. Image-based PDFs require OCR (Optical Character Recognition) to convert the image of text back into actual characters.

Method 1: Copy and Paste (Simplest)

For text-based PDFs and small amounts of text:

Open the PDF in any viewer
Click and drag to select text
Ctrl+C (Windows) / Cmd+C (Mac) to copy
Paste into a text editor, Word, or wherever you need it

Ctrl+A selects all text on the current page (or in some viewers, the entire document).

Limitations:

Tables often paste with columns jumbled
Multi-column layouts may mix column text together
Formatting (bold, headings) is lost — you get plain text
Images and graphics are not copied

Method 2: Adobe Acrobat — Export to Text

Acrobat Pro: File → Export To → Text (Plain) or Text (Accessible)

Acrobat analyses the reading order and exports cleaner text than copy-paste, especially for multi-column layouts.

Acrobat Reader (Free): Doesn't support direct text export. Use copy-paste or a different tool.

Method 3: Online PDF to Text Tools

Several online tools convert PDF to .txt or editable formats:

PDF2Go

Upload a PDF → convert to TXT → download the plain text file. Free, no account needed.

Smallpdf / ILovePDF

Both offer PDF to Text or PDF to Word conversion. Word format (.docx) preserves more structure than plain text.

Tools of PDF

Upload and convert your PDF to Word, which you can then save as plain text from Word.

For scanned PDFs: Online tools often include OCR automatically during conversion. Check whether the tool mentions OCR support.

Method 4: pdftotext (Command Line, Free)

pdftotext is part of the Poppler utilities — free, fast, and excellent for scripting.

Install:

Linux: sudo apt install poppler-utils
Mac: brew install poppler
Windows: Download from poppler releases on GitHub

Basic usage:

pdftotext document.pdf output.txt

Preserve layout (approximate column positions with spaces):

pdftotext -layout document.pdf output.txt

Specific page range:

pdftotext -f 3 -l 7 document.pdf output.txt

(pages 3 through 7)

Raw extraction (no reading order heuristics):

pdftotext -raw document.pdf output.txt

Batch convert all PDFs in a folder:

for f in *.pdf; do
  pdftotext "$f" "${f%.pdf}.txt"
done

pdftotext is the fastest tool for text-based PDFs and handles multi-column layouts better than most.

Method 5: Python — Text Extraction Libraries

For integration into applications or complex extraction workflows:

PyMuPDF (fitz) — Recommended

import fitz  # PyMuPDF

doc = fitz.open("document.pdf")
full_text = ""

for page in doc:
    full_text += page.get_text()

with open("output.txt", "w", encoding="utf-8") as f:
    f.write(full_text)

print("Done")

Extract with block structure (paragraphs):

for page in doc:
    blocks = page.get_text("blocks")
    for block in blocks:
        print(block[4])  # The text content of the block

pdfplumber — Best for Tables

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())

        # Extract tables as lists of lists
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

pdfplumber is particularly strong for extracting structured data from tables in PDFs.

pypdf (Pure Python)

from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    print(page.extract_text())

Method 6: OCR for Scanned PDFs

When the PDF is image-based (scanned), you need OCR to extract text.

Google Drive (Free, Easy)

Upload the scanned PDF to Google Drive
Right-click → Open with Google Docs
Google performs OCR and creates an editable Google Doc
File → Download → Plain Text (.txt)

Accuracy: Good for clear, typed text. Fair for handwriting or degraded scans.

Tesseract (Free, Open Source, Command Line)

Tesseract is the most widely used open-source OCR engine.

Install:

Linux: sudo apt install tesseract-ocr
Mac: brew install tesseract
Windows: Download from github.com/UB-Mannheim/tesseract/wiki

Convert a scanned PDF to text:

# First convert PDF to images
gs -sDEVICE=png16m -r300 -sOutputFile=page_%03d.png scanned.pdf

# Then run OCR on each image
for img in page_*.png; do
  tesseract "$img" "${img%.png}" -l eng
done

# Combine all .txt files
cat page_*.txt > output.txt

Or use pytesseract (Python):

import pytesseract
from PIL import Image
import fitz

doc = fitz.open("scanned.pdf")
full_text = ""

for page_num in range(len(doc)):
    page = doc[page_num]
    pix = page.get_pixmap(dpi=300)
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    text = pytesseract.image_to_string(img, lang="eng")
    full_text += text + "\n"

with open("output.txt", "w") as f:
    f.write(full_text)

ABBYY FineReader (Commercial, High Accuracy)

For high-accuracy OCR on large volumes, ABBYY FineReader is the industry leader. Supports 190+ languages and complex layout recognition.

Handling Common Extraction Problems

Garbled text / wrong character order: Usually caused by non-standard font encoding. Try pdftotext -raw or PyMuPDF which often handle encoding better.

Text extracted in wrong order (mixed-up columns): Use pdftotext -layout which preserves approximate column positioning. Or use pdfplumber which analyses text blocks and their coordinates.

Text appears correct in viewer but extracts as gibberish: The font uses a custom encoding (common in PDFs from older software). May require manual character mapping or using Acrobat Pro's export which can handle some custom encodings.

Tables extract as jumbled text: Use pdfplumber's extract_tables() function rather than extract_text(). It analyses bounding boxes to reconstruct table structure.

Scanned PDF with low OCR accuracy: Pre-process the images: increase contrast, remove noise, deskew. Tools like ImageMagick or OpenCV can pre-process before Tesseract.

Summary

For text-based PDFs, copy-paste, pdftotext, or Python (PyMuPDF / pdfplumber) are the fastest and most accurate methods. For scanned PDFs, OCR is required — Google Drive handles occasional documents for free, while Tesseract handles batch processing. pdfplumber is the tool of choice when extracting structured table data. Always test your extraction by checking that the output makes sense — encoding issues and column-order problems are common and need to be caught before downstream processing.

How to Convert PDF to Text: Extract Text from Any PDF

Text-Based vs. Image-Based PDFs: The Critical Distinction

Method 1: Copy and Paste (Simplest)

Method 2: Adobe Acrobat — Export to Text

Method 3: Online PDF to Text Tools

PDF2Go

Smallpdf / ILovePDF

Tools of PDF

Method 4: pdftotext (Command Line, Free)

Method 5: Python — Text Extraction Libraries

PyMuPDF (fitz) — Recommended

pdfplumber — Best for Tables

pypdf (Pure Python)

Method 6: OCR for Scanned PDFs

Google Drive (Free, Easy)

Tesseract (Free, Open Source, Command Line)

ABBYY FineReader (Commercial, High Accuracy)

Handling Common Extraction Problems

Summary

Related Articles

How to Extract Images from a PDF File

How to Convert HTML to PDF: Every Method Explained