How to Convert PDF to Text: Extract Text from Any PDF
Learn how to extract plain text from PDF files using free tools, OCR for scanned PDFs, and programmatic methods for developers.
Text-Based vs. Image-Based PDFs: The Critical Distinction
Before choosing a text extraction method, identify which type of PDF you're working with:
Text-based PDF: Created digitally from a word processor, design tool, or PDF generator. The text exists as actual character data inside the PDF. You can click and drag to select text in a PDF viewer.
Image-based PDF: Created by scanning a physical document, or by exporting from a tool that rasterised the page as an image. The PDF contains a picture of the text, not the text itself. You cannot select individual characters.
Text-based PDFs can be extracted directly. Image-based PDFs require OCR (Optical Character Recognition) to convert the image of text back into actual characters.
Method 1: Copy and Paste (Simplest)
For text-based PDFs and small amounts of text:
- Open the PDF in any viewer
- Click and drag to select text
Ctrl+C(Windows) /Cmd+C(Mac) to copy- Paste into a text editor, Word, or wherever you need it
Ctrl+A selects all text on the current page (or in some viewers, the entire document).
Limitations:
- Tables often paste with columns jumbled
- Multi-column layouts may mix column text together
- Formatting (bold, headings) is lost — you get plain text
- Images and graphics are not copied
Method 2: Adobe Acrobat — Export to Text
Acrobat Pro: File → Export To → Text (Plain) or Text (Accessible)
Acrobat analyses the reading order and exports cleaner text than copy-paste, especially for multi-column layouts.
Acrobat Reader (Free): Doesn't support direct text export. Use copy-paste or a different tool.
Method 3: Online PDF to Text Tools
Several online tools convert PDF to .txt or editable formats:
PDF2Go
Upload a PDF → convert to TXT → download the plain text file. Free, no account needed.
Smallpdf / ILovePDF
Both offer PDF to Text or PDF to Word conversion. Word format (.docx) preserves more structure than plain text.
Tools of PDF
Upload and convert your PDF to Word, which you can then save as plain text from Word.
For scanned PDFs: Online tools often include OCR automatically during conversion. Check whether the tool mentions OCR support.
Method 4: pdftotext (Command Line, Free)
pdftotext is part of the Poppler utilities — free, fast, and excellent for scripting.
Install:
- Linux:
sudo apt install poppler-utils - Mac:
brew install poppler - Windows: Download from poppler releases on GitHub
Basic usage:
pdftotext document.pdf output.txt
Preserve layout (approximate column positions with spaces):
pdftotext -layout document.pdf output.txt
Specific page range:
pdftotext -f 3 -l 7 document.pdf output.txt
(pages 3 through 7)
Raw extraction (no reading order heuristics):
pdftotext -raw document.pdf output.txt
Batch convert all PDFs in a folder:
for f in *.pdf; do
pdftotext "$f" "${f%.pdf}.txt"
done
pdftotext is the fastest tool for text-based PDFs and handles multi-column layouts better than most.
Method 5: Python — Text Extraction Libraries
For integration into applications or complex extraction workflows:
PyMuPDF (fitz) — Recommended
import fitz # PyMuPDF
doc = fitz.open("document.pdf")
full_text = ""
for page in doc:
full_text += page.get_text()
with open("output.txt", "w", encoding="utf-8") as f:
f.write(full_text)
print("Done")
Extract with block structure (paragraphs):
for page in doc:
blocks = page.get_text("blocks")
for block in blocks:
print(block[4]) # The text content of the block
pdfplumber — Best for Tables
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
# Extract tables as lists of lists
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
pdfplumber is particularly strong for extracting structured data from tables in PDFs.
pypdf (Pure Python)
from pypdf import PdfReader
reader = PdfReader("document.pdf")
for page in reader.pages:
print(page.extract_text())
Method 6: OCR for Scanned PDFs
When the PDF is image-based (scanned), you need OCR to extract text.
Google Drive (Free, Easy)
- Upload the scanned PDF to Google Drive
- Right-click → Open with Google Docs
- Google performs OCR and creates an editable Google Doc
- File → Download → Plain Text (.txt)
Accuracy: Good for clear, typed text. Fair for handwriting or degraded scans.
Tesseract (Free, Open Source, Command Line)
Tesseract is the most widely used open-source OCR engine.
Install:
- Linux:
sudo apt install tesseract-ocr - Mac:
brew install tesseract - Windows: Download from github.com/UB-Mannheim/tesseract/wiki
Convert a scanned PDF to text:
# First convert PDF to images
gs -sDEVICE=png16m -r300 -sOutputFile=page_%03d.png scanned.pdf
# Then run OCR on each image
for img in page_*.png; do
tesseract "$img" "${img%.png}" -l eng
done
# Combine all .txt files
cat page_*.txt > output.txt
Or use pytesseract (Python):
import pytesseract
from PIL import Image
import fitz
doc = fitz.open("scanned.pdf")
full_text = ""
for page_num in range(len(doc)):
page = doc[page_num]
pix = page.get_pixmap(dpi=300)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
text = pytesseract.image_to_string(img, lang="eng")
full_text += text + "\n"
with open("output.txt", "w") as f:
f.write(full_text)
ABBYY FineReader (Commercial, High Accuracy)
For high-accuracy OCR on large volumes, ABBYY FineReader is the industry leader. Supports 190+ languages and complex layout recognition.
Handling Common Extraction Problems
Garbled text / wrong character order:
Usually caused by non-standard font encoding. Try pdftotext -raw or PyMuPDF which often handle encoding better.
Text extracted in wrong order (mixed-up columns):
Use pdftotext -layout which preserves approximate column positioning. Or use pdfplumber which analyses text blocks and their coordinates.
Text appears correct in viewer but extracts as gibberish: The font uses a custom encoding (common in PDFs from older software). May require manual character mapping or using Acrobat Pro's export which can handle some custom encodings.
Tables extract as jumbled text:
Use pdfplumber's extract_tables() function rather than extract_text(). It analyses bounding boxes to reconstruct table structure.
Scanned PDF with low OCR accuracy: Pre-process the images: increase contrast, remove noise, deskew. Tools like ImageMagick or OpenCV can pre-process before Tesseract.
Summary
For text-based PDFs, copy-paste, pdftotext, or Python (PyMuPDF / pdfplumber) are the fastest and most accurate methods. For scanned PDFs, OCR is required — Google Drive handles occasional documents for free, while Tesseract handles batch processing. pdfplumber is the tool of choice when extracting structured table data. Always test your extraction by checking that the output makes sense — encoding issues and column-order problems are common and need to be caught before downstream processing.