Xojo Developer Conference
25/27th April 2018 in Denver.
MBS Xojo Conference
6/7th September 2018 in Munich, Germany.

DynaPDF Manual - Page 436

Previous Page 435   Index   Next Page 437

Function Reference
Page 436 of 750
Order of Text records
GetPageText() returns always when a text showing operator was found. That means the returned
text represents not a text line. It can be a single character up to a complete text line depending on
how the text is stored in the PDF file.
The order in which text is returned is essentially arbitrary. It depends on the file creator whether text
is stored in the logical reading order. For example, most PDF drivers convert headers and footers
first. Such strings appear then at the beginning of the content stream. All other strings are in turn not
necessarily ordered and one text line can be stored in several different text objects.
A text search or text replacement algorithm must correctly handle cases in which a word or sentence
is separated into different text objects. In the worst case GetPageText() returns always only a single
character. As long as the text is not rotated it is relatively easy to determine whether a text record lies
on the same y-axis, but finding an arbitrary rotated text that is also stored in several different text
objects requires further math.
The position of a text object is calculated from the two transformation matrices ctm and tm. The
global transformation matrix ctm represents the current coordinate system when a text showing
operator was found. The matrix ctm is already pre-multiplied because GetPageText() does not return
when a new transformation matrix is applied.
The text transformation matrix tm represents the text coordinate system in which text properties
such as text width, font size, character spacing, word spacing, or the space width are calculated. All
text positioning operators are already included in this matrix.
The combination of both matrices represents the final user space in which the text is rendered. Both
matrices must be combined to enable the calculation of the text position and orientation (see the
examples on the following pages to determine how the matrices must be combined).
Organization of content streams and pages
A PDF page consists of a content stream and a resource array which contains the resources such as
fonts, images, and so on which can be used by the page. The content stream contains the PDF
operators which paint the contents of a PDF page.
The PDF format supports two object types which support vector graphics and images: ordinary
pages and the so called "Form XObjects" which act as a template (we call this object type template
here). A template consists in turn of a content stream and a resource array like a page object and it is
possible to convert a page to a template. A page object can display an arbitrary number of templates
and a template can in turn display arbitrary other templates. It is important to understand that the
content of a template is physically stored in another content stream because the function InitStack()
prepares only the currently open content stream of the page or template for editing.
Only this content stream can be parsed and edited. Templates which occur in a page or other
template must be parsed separately. Because templates can contain other templates it is usually best
to parse templates recursively.
 

Previous topic: External CMaps

Next topic: Organization of text objects