Xojo Developer Conference
25/27th April 2018 in Denver.
MBS Xojo Conference
6/7th September 2018 in Munich, Germany.

DynaPDF Manual - Page 438

Previous Page 437   Index   Next Page 439

Function Reference
Page 438 of 750
We want now take a look into a PDF content stream to determine how an arbitrary text can be stored
in a PDF file. The following text can be stored in many different ways and it is important to
understand that many variants are possible and exist in real PDF files.
The rendered result of the string "The fox eats the lazy mouse." looks quite normal:
The fox eats the lazy mouse.
However, a PDF driver does not necessarily store this text in one record, there are many possible
variants:
%This is the easiest variant, one record contains the entire text line.
%It would be returned in one GetPageText() call as one coherent kerning
%record.
(The fox eats the lazy mouse.)Tj
%This version emulates the spaces with kerning space.
%It would be returned in one GetPageText() call with 6 kerning records.
[(The)-280(fox)-280(eats)-280(the)-280(lazy)-280(mouse.)]TJ
%This version uses PDF positioning operators to emulate spaces.
%It produces 6 separate GetPageText() calls.
(The)Tj
2.8 0 Td
(fox)Tj
2.8 0 Td
(eats)Tj
2.8 0 Td
(the)Tj
2.8 0 Td
(lazy)Tj
2.8 0 Td
(mouse.)Tj
In the worst case each text record consists of only one character and it is also possible that the entire
text occurs unsorted or combined with other texts which lie on completely different positions than
this one. There is not necessarily a logical connection between what you see on screen and what is
stored in the PDF file. Especially if a PDF file contains tables the order of text records is sometimes
very difficult to understand.
Possible encoding issues
If text must be extracted, deleted, or replaced then it is very important that the text in the PDF file
can be converted to Unicode. This conversion is possible if the font uses a standard encoding like
WinAnsi or MacRoman, if it contains a ToUnicode CMap, or if it contains PostScript Character
names which are listed in the Adobe Glyph List, or if it uses a predefined external CMap and if this
CMap is available in one of the CMap search paths (SetSetCMapDir() for further information).
 

Previous topic: Organization of text objects

Next topic: How to calculate the absolute string position?