Xojo Developer Conference
25/27th April 2018 in Denver.
MBS Xojo Conference
6/7th September 2018 in Munich, Germany.

DynaPDF Manual - Page 540

Previous Page 539   Index   Next Page 541

Function Reference
Page 540 of 750
a sub string or to convert an arbitrary source string manually to Unicode. It is always possible to
calculate the exact position of a string but the recommended strategy depends on the used text
callback function and on the kind of algorithm that should be developed:
Text extraction algorithms require usually not the exact position of every character or word in
a string. Coordinates of sub strings are only required if word spacing must be considered but
word spacing refers to simple fonts only. Because the code length of a simple font is always
one byte the string width can be easily computed with fntGetTextWidth() and in cases where
the source string is shorter than the Unicode string the source string can be manually
converted to Unicode with TranslateRawString2() (the name is fntTranslateRawString2() in
C/C++). For this kind of algorithm the TShowTextArrayW callback function should be used
because it provides anything required to develop fast text extraction algorithms. The example
projects text_extraction and text_coordinates demonstrate how text extraction algorithms can
be developed.
Text search algorithms could use the TShowTextArrayW callback function too but the usage
is much more complicated if strings of CID fonts must be processed. CID fonts support
encodings with arbitrary code lengths from one through four bytes per character. Because the
string width cannot be computed from the translated Unicode string the function must be
able to find the position in the source string. This is not easy especially if the search text was
stored in multiple text records.
To simplify the development of text search algorithms the content parser provides the
TShowTextArrayA callback function which returns the raw source strings. The conversion to
Unicode can be done with TranslateRawCode() (the name is fntTranslateRawCode() in
C/C++). The function converts a sequence of source bytes to Unicode and calculates the width
of that character. The advantage is that the exact position of every character in a string can be
easily calculated independent of the current font type. The overhead due to the call on a per
character basis is not large because the function is strongly optimized to improve processing
speed. The example text_search demonstrates how a text search algorithm can be developed.
Using the Content Parser
The content parser can be used to extract text, vector graphics, and images from a PDF file. The
following sections describe which callback functions must set, what must be stored in the graphics
state, as well as other important aspects.
Note that DynaPDF is delivered with several example projects which demonstrate how the content
parser can be used. Before developing your own code take a look into the examples text_extraction,
text_search, or image_extraction.
Text Extraction or Text Search Algorithms
The following callback functions should be set to process PDF text:
TBeginTemplate
TEndTemplate
// Optional
 

Previous topic: Sub string coordinates

Next topic: TBeginTemplate