Wednesday, March 30, 2011

From a picture to a font — OCR in Acrobat 9

Ever been handed a piece of paper by a client and told, "This is all I have, but I want you to recreate this." If it's a business card, that's not a big deal. There's not much time involved in re-typesetting someone's phone number. But what if it's several pages of type? Or a manuscript? Or a file of scanned images that a client then wants to make type changes to?

Acrobat (not Reader, but the full version) has a great feature to help out with this problem, OCR. OCR stands for Optical Character Recognition, and it can convert an image of text and turn it into searchable, real, vector text. Acrobat has had an OCR feature for many versions now, but we are going to be looking at the features in Acrobat 9's OCR. You can use the OCR feature as you are scanning the document, or if you already have an image, you can apply OCR to it.

I am going to assume you have an image file already and want the text in the image to actually be text, and not a picture. First, launch Acrobat 9. (Mac or PC, of course!) Then go to File < Create PDF < From File. Select your image file, hit okay. When it comes up, you have a pdf file all ready to go.

(A note: Adobe recommends that your compression options in converting a tif to a pdf should be changed to a lossless compression format before converting Tiffs to a pdf format. Go to Edit < Preferences < Convert to PDF, select tiff, click on Edit settings and change the compression to ZIP or for Monochrome, change to JBIG2(lossless) or CCITT G4.)

Then, go to Document < OCR Text Recognition < Recognize Text Using OCR.



When that menu pops up, choose all pages if you have a multiple page document, or the current page, or a range of pages.  For the Settings the OCR will use, I prefer the ones in the image below.


ClearScan is new to Acrobat 9 and why I like using 9 instead of a previous version. In previous versions, the best setting to use was "Searchable Image." Using Searchable Image, you end up with the picture but it is compressed, and also a type layer behind the image that Acrobat then can use for searches, etc. Since it keeps the picture, using the "Searchable Image" settings leaves you with a much larger document, especially if your "Downsample" setting is at 600 dpi. The difference between "Searchable Image" and "Searchable Image (Exact)" is that the Exact option will keep the original image without compression, giving you an even larger final document size.

ClearScan actually creates a custom font out of the picture, turning your bitmap text into vectors. This allows for a much smaller final document size and better looking letters. For the third setting in this menu, I like to use the highest resolution possible in the "Downsample" option. It also keeps the letters looking their best.

To edit your settings, click on the "Edit" button. Once they are set, click ok to continue.

The OCR will now run. How long this takes depends on how many pages the pdf has and how much text are on those pages. Also, remember that the higher resolution your original image is, the better the OCR will recognize your text.

Here is a close up sample of my pdf document of an image of text before OCR. Note the artifacting around the letters.


And here is a close up of my pdf after the OCR has run. The type is no longer bitmapped, but is magically transformed into vectors.


Voila, now I have selectable text:


Once my text is really text, I can do all sorts of things with it in Acrobat.

Or I could select it all, copy it, launch InDesign, and paste it into a document where I can select which ever font I wish my text to be and move on from there to make any type corrections or layout modifications. The image below is small, but it is trying to show that my font has been changed from the custom OCR font to Arial.


The OCR in Acrobat is not perfect. It didn't recognize the fancy drop cap on the page as text, and some of the letters came out wrong (for example, the capital letter "I" and the numeral "1" can look a lot alike in some fonts), so a quick proofread was necessary. However, it was much faster running the OCR than having me sit down and type in a bunch of pages of text, (and much more accurate, too).

Adobe has been expanding Acrobat's capabilities with each version release. There's more to it than just pdf creation nowadays. We'll look at more useful Acrobat features in future posts.

1 comment: