Honest comparison of free and paid OCR tools. PDFWix doesn't offer OCR yet — here's what to use today.
Honest comparison of free and paid OCR tools. PDFWix doesn't offer OCR yet — here's what to use today.
Optical Character Recognition turns an image of text into selectable, searchable characters. Modern OCR runs in four stages: image preprocessing (deskew, denoise, binarize to high-contrast black-and-white), layout analysis (detect columns, paragraphs, tables, headers), character segmentation (isolate each glyph), and recognition (a neural network maps each glyph image to a Unicode code point, then a language model fixes obvious errors using context — 'cl0ud' becomes 'cloud'). Modern engines like Tesseract 5, Adobe Sensei and ABBYY use LSTM-based neural recognisers that hit 99%+ on clean 300dpi scans. To get the best accuracy from any OCR tool: scan at 300dpi minimum (600dpi for small print, receipts and forms), avoid JPEG artifacts by exporting scans as PNG or PDF-with-embedded-PNG rather than re-compressed JPG, deskew before OCR if pages are crooked (most tools do this automatically but it's faster on already-straight input), specify the source language explicitly when possible (English-only OCR is faster and more accurate than auto-detect), and run a spell-check pass on the output. Scanned PDFs from a phone camera typically OCR worse than flatbed scans because of perspective distortion and shadows — a free app like Microsoft Lens or Apple's built-in Notes scanner corrects perspective before saving, which makes downstream OCR dramatically more accurate.
Not yet. We're working on a browser-based OCR using Tesseract.js. In the meantime, the tools listed above all handle PDF OCR well.
For one-off PDFs, Google Drive (upload then 'Open with Google Docs') is the easiest. For batch jobs, Tesseract via the command line or a wrapper like OCRmyPDF is free and very accurate.
Free OCR (Tesseract, Google Drive) hits ~95% character accuracy on clean scans. Adobe and ABBYY hit 99%+ and preserve layout (tables, columns) far better — worth paying for archive work, overkill for personal use.