Xojo Conferences

DynaPDF Manual - Page 557

Previous Page 556   Index   Next Page 558

Function Reference
Page 557 of 770
Unicode conversion
The extraction of human readable text requires a conversion to a well known encoding like Unicode
because PDF strings are not necessarily human readable.
Whether it is possible to convert a PDF string to Unicode depends on whether the required encoding
information is available. This is always the case if a font uses a predefined encoding like WinAnsi or
MacRoman or if the glyph names of Type1 fonts are available in the Adobe Glyph List or
ZapfDingbats encoding.
Fonts which use a symbol encoding can provide a ToUnicode CMap which offers the required
mapping to Unicode. However, this CMap is optional and is not necessarily available. If a symbol
font does not contain a ToUnicode CMap the strings are converted to the code page 1252.
External CMaps
A widely used technique to reduce the amount of data that must stored in a PDF file is the usage of
non embedded CID fonts. CID fonts, whether embedded or not, can depend on external CMap files
which offer the required mapping to Unicode.
To process strings of such fonts correctly, DynaPDF must be able to load required CMap files if
necessary. Therefore, DynaPDF is delivered with the most important CMap files which are provided
by Adobe Systems. These CMaps can be found in the DynaPDF installation directory at
/Resource/CMap/. Applications which extract text from PDF files should include these CMaps so that
they can be loaded at runtime.
The search path to external CMaps must be set with SetCMapDir() before executing ParseContent()
the first time. The function creates a CMap cache that is hold in memory until the PDF instance will
be deleted. The search path(s) to external CMap files should be set only one time per PDF instance
and one PDF instance should be used to process so many PDF files as possible. This can significantly
improve processing speed.
If a required CMap is absent the Decoded parameter of the TShowTextArrayW callback function is set
to false and the string should be ignored in this case because no meaningful values can be returned.
Inside the Callback Functions
The following sub-clauses describe important operations which must be executed in the callback
functions to achieve correct results.
This callback function is executed when a Form XObject in PDF terms is painted (we call this object
type template in DynaPDF). The parameter BBox specifies the template's bounding box or visible
area measured in form space. The required mapping from form space to user space is specified by
the Matrix parameter (see also Coordinate Spaces). The Matrix parameter is optional and can be set to
NULL. In the latter case, the form matrix is set to the identity matrix.

Previous topic: Using the Content Parser, Text Extraction or Text Search Algorithms

Next topic: TMulMatrix, TSetFont