DIYbanter - View Single Post

Jeff Liebermann

On Mon, 29 May 2017 02:04:11 -0400, rickman wrote:

Jeff Liebermann wrote on 5/29/2017 12:36 AM:
On Mon, 29 May 2017 00:08:29 -0400, rickman wrote:

I received a PDF document of an old PDP-11 listing from someone who wanted
help typing it in. I realized when I clicked my cursor over the text it
would select even though it was clearly created from images. Seems some
software in the path (possibly my reader) was doing optical character
recognition on the document. Most of it came through ok, but once in a
while the slightly out of adjustment printer characters would be misread
like a 9 for a 0, or a 0 for an O. Still, it saved a lot of time.

Anyone else see scanned documents showing selectable text?

Searchable text is a standard PDF feature, even with bitmapped text.
PDF-Xchange has built in OCR (optical character recognition) that will
read through the graphical text, do its best to convert it to ASCII
text, and save the combined file. After that, you can use the search,
select, edit, functions:
https://www.youtube.com/watch?v=CWtHOsIKaKw
https://www.tracker-software.com/knowledgebase/351-How-do-I-OCR-a-document
The free version will do all that except edit and save the resulting
text. For that, you need the registered version.

I'm not sure what "standard" means.

Bad choice of words. I meant that the PDF standard:
https://en.wikipedia.org/wiki/PDF/A
includes searchable text as part of the standard. I'm too lazy to
look up the chapter and verse.

I was viewing a document full of imaged
text the other day and none of the permissions were set to preclude
anything. Yet I couldn't select any text as it had not been OCR'd.

Yep. If you scan text as a bit map image, and save it in PDF format,
it cannot be text searched. You have to feed it to an OCR program,
which is capable of attaching the OCR text to the PDF, save it, and
then you can search.

I assume the OCR has to be done at capture time.

No. It can be done at any time with any reasonable document. I
usually make some effort to realign the text and improve the contrast
to make it easier (and faster) for the OCR program to do it's thing.

Are you saying a reader will convert images to text?

If the images look like readable ASCII characters, yes. I don't think
size makes much difference, but I haven't done much experimentation
into how badly I can butcher the text and the OCR will still work. I
also haven't tried to edit the text after reading to correct OCR
errors.

Maybe a demo will help. Note that the initial scan and file saves
were done in Irfanview, while the OCR and subsequent saves were done
in PDF-Xchange:

Original document scanned to JPG using Irfanview 4.44:
http://802.11junk.com/jeffl/OCR%20Demo/JPG.jpg
This is not searchable.

Same document saved to PDF using Irfanview 4.44:
http://802.11junk.com/jeffl/OCR%20Demo/PDF-no-OCR.pdf
This is also NOT searchable.

Same document in PDF-Xchange 6.0 build 322.4 after OCR:
http://802.11junk.com/jeffl/OCR%20Demo/PDF-after-OCR.pdf
This one can be searched.

PDF-Xchange screen grab showing a typical search result:
http://802.11junk.com/jeffl/OCR%20Demo/PDF-Xchange-screen.jpg

--
Jeff Liebermann
150 Felker St #D http://www.LearnByDestroying.com
Santa Cruz CA 95060 http://802.11junk.com
Skype: JeffLiebermann AE6KS 831-336-2558