Are all PDF documents the same? – Investintech.com Inc.

No, they are not.

As a matter of fact, PDF documents can be created in a variety of ways. PDFs that are generated from an electronic source – such as a Word document, computer-generated report, or spreadsheet data – have an internal structure that can be read and interpreted. These "electronically generated" PDF documents already contain characters that have an electronic character designation. Hence, conversion from such PDF files can rely on these electronic character designations and provide reliable output.

On the other hand, PDF documents can be created through the process of scanning a document into an electronic format. However, what a scanned document represents is just a "picture" of the words contained within that document. In order to convert a scanned PDF into an editable format, OCR technology is required to analyze the "image" of each character and match it to an electronic character-based file. Because of this, it is much more difficult to ensure that the character "recognized" by the OCR software is the character on the scanned document. Therefore, the quality of the OCR output is affected by factors such as: poor image quality of the scanned document, mixture of fonts used in the scanned documents, as well as italicized or underlined fonts, which all can blur the quality and shape of individual characters.

This article refers to Able2Extract and Able2Extract Professional.

Comments