DynaPDF Manual - Page 559

Previous Page 558   Index   Next Page 560

Function Reference
Page 559 of 777
}
}
DynaPDF is delivered with several example projects which demonstrate how text coordinates must
be computed and how text can be extracted. The above code is a fragment of the example project
Text Coordinates which is delivered with all DynaPDF versions.
Text Scaling
Like character and word spacing the current text scaling is already considered in the text width that
is provided in all text callback functions. However, the value must be stored in the graphics state if
the width of a sub string must be computed. Text scaling is measured in percent of the original
unscaled text width.
Sub string coordinates
Sub string processing is somewhat more complicated because the width of a sub string cannot be
calculated from the Unicode string and one source character does necessarily correspond to one
Unicode character.
Simple fonts use one byte encodings where one source byte can be decoded to one or more Unicode
characters. CID fonts support also multi-byte encodings with fixed and variable code lengths. A
sequence of n source bytes can be decoded to m Unicode characters. So, there is no logical
relationship between the source and converted Unicode string.
If a text search algorithm should provide the coordinates of a found string, then it must be able to
find the position of the search text in the source string because it is not possible to calculate the string
width from the Unicode string. DynaPDF provides several helper functions to calculate the width of
a sub string or to convert an arbitrary source string manually to Unicode. It is always possible to
calculate the exact position of a string but the recommended strategy depends on the used text
callback function and on the kind of algorithm that should be developed:
Text extraction algorithms require usually not the exact position of every character or word in
a string. Coordinates of sub strings are only required if word spacing must be considered but
word spacing refers to simple fonts only. Because the code length of a simple font is always
one byte the string width can be easily computed with fntGetTextWidth() and in cases where
the source string is shorter than the Unicode string the source string can be manually
converted to Unicode with TranslateRawString2() (the name is fntTranslateRawString2() in
C/C++). For this kind of algorithm the TShowTextArrayW callback function should be used
because it provides anything required to develop fast text extraction algorithms. The example
projects text_extraction and text_coordinates demonstrate how text extraction algorithms can
be developed.
Text search algorithms could use the TShowTextArrayW callback function too but the usage
is much more complicated if strings of CID fonts must be processed. CID fonts support
encodings with arbitrary code lengths from one through four bytes per character. Because the
string width cannot be computed from the translated Unicode string the function must be
 

Previous topic: Character Spacing, Word Spacing

Next topic: Using the Content Parser, Text Extraction or Text Search Algorithms