DL Logo

PDWordFinder Definitions

WF_LATEST_VERSION

Header: PDExpT.h:3642

Description

Used to obtain the latest available version.

Syntax

#define WF_LATEST_VERSION 0

WF_VERSION_2

Header: PDExpT.h:3646

Description

The version used for Acrobat 3.x, 4.x.

Syntax

#define WF_VERSION_2 2

WF_VERSION_3

Header: PDExpT.h:3650

Description

For Acrobat 5.0 without accessibility enabled.

Syntax

#define WF_VERSION_3 3

WF_VERSION_4

Header: PDExpT.h:3654

Description

For Acrobat 5.0 with accessibility enabled.

Syntax

#define WF_VERSION_4 4

PDWordFinder Typedefs

PDWordFinder

Header: PDExpT.h:3359

Description

Extracts words from a PDF file, and enumerates the words on a single page or on all pages in a document.

Syntax

typedef struct _t_PDWordFinder *PDWordFinder;

Returned From

Used By

PDWordFinder Callback Signatures

PDWordFinderCtrlProc

Header: PDExpT.h:3914

Description

This is passed to PDWordFinderSetCtrlProc().

This is the callback function called by Word Finder when its page enumeration process takes longer than the specified time (in seconds). Return true to continue the enumeration process, or false to stop. startTime is the value that was set by ASGetSecs() when the Word Finder started processing the current page.

Syntax

ASBool PDWordFinderCtrlProc(ASUns32 startTime, void *clientData);

PDWordFinder Structures

_t_PDWordFinderConfig

Header: PDExpT.h:3674

Description

A word finder configuration that customizes the way the extraction is performed. In the default configuration, all options are false.

Syntax

struct _t_PDWordFinderConfig {
ASSize_t recSize;
This is always sizeof(PDWordFinderConfigRec).
ASBool disableTaggedPDF;
When true, it disables tagged PDF support and treats the document as non-tagged PDF. Use this to keep the word finder in legacy mode when it is created with the latest algorithm version (WF_LATEST_VERSION).
ASBool noXYSort;
When true, it disables generating an XY-ordered word list. This option replaces the sort order flags in the older version of the word finder creation command (PDDocCreateWordFinder()). Setting this option is equivalent to omitting the WXE_XY_SORT flag.
ASBool preserveSpaces;
When true, the word finder preserves space characters during word breaking. Otherwise, spaces are removed from output text. When false (the default), you can add spaces later by considering the word attribute flag WXE_ADJACENT_TO_SPACE, but there is no way to restore the exact number of consecutive space characters.
ASBool noLigatureExp;

When true, and the font has a ToUnicode table, it disables the expansion of ligatures using the default ligatures. The default ligatures are:

  • fi
  • ff
  • fl
  • ffi
  • ffl
  • st
  • oe
  • OE

When noLigatureExp is true and the font does not have a ToUnicode table, the ligature is expanded based on whether there is a representation of the ligature in the defined codePage. If there is no representation, the ligature is expanded; otherwise, the ligature is not expanded.

ASBool noEncodingGuess;
When true, it disables guessing encoding of fonts that have unknown or custom encoding when there is no ToUnicode table. Inappropriate encoding conversions can cause the word finder to mistakenly recognize non-Roman single-byte fonts as Standard Roman encoding fonts and extract the text in an unusable format. When this option is selected, the word finder avoids such unreliable encoding conversions and tries to provide the original characters without any encoding conversion for a client with its own encoding handling. Use the PDWordGetCharEncFlags() method to detect such characters.
ASBool unknownToStdEnc;
When true, it assumes any font with unknown or custom encoding to be Standard Roman. This option overrides the noEncodingGuess option.
ASBool ignoreCharGaps;
When true, it disables converting large character gaps to space characters, so that the word finder reports a character space only when a space character appears in the original PDF content. This option has no effect on tagged PDF.
ASBool ignoreLineGaps;
When true, it disables treating vertical movements as line breaks, so that the word finder determines a line break only when a line break character or special tag information appears in the original PDF content. This option has no effect on tagged PDF.
ASBool noAnnots;
When true, it disables extracting text from text annotations. Normally, the word finder extracts text from the normal appearances of text annotations that are inside the page crop box.
ASBool noHyphenDetection;
When true, it disables finding and removing soft hyphens in non-tagged PDF, so that the word finder trusts hard hyphens as non-soft hyphens. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between soft and hard hyphen characters in non-tagged PDF files, because these are often misused.
ASBool trustNBSpace;
When true, it disables treating non-breaking space characters as regular space characters in non-tagged PDF files, so that the word finder preserves the space without breaking the word. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between breaking and non-breaking space characters in non-tagged PDF files, because these are often misused.
ASBool noExtCharOffset;
When true, it disables generating extended character offset information to improve text extraction performance. The extended character offset information is necessary to determine exact character offset for character-by-character text selection. The beginning character offset of each word is always available regardless of this option, and can be used for word-by-word text selection with reasonable accuracy. When a client has no need for the detailed character offset information, it can use this option to improve the text extraction efficiency. There is a minor difference in the text extraction performance, and less memory is needed for the extracted word list.
ASBool noStyleInfo;
When true, it disables generating character style information to improve text extraction performance and memory efficiency. When you select this option, you cannot use PDWordGetNthCharStyle() and PDWordGetStyleTransition() with the output of the word finder.
const ASUns16 *decomposeTbl;
A custom UTF-16 decomposition table. This table can be used to expand Unicode ligatures not included in the default ligature list. Each decomposition record contains a UTF-16 character code (either a 16-bit or 32-bit surrogate), a replacement UTF16 string, and the delimiter 0x0000.
ASSize_t decomposeTblSize;
The size of the decomposeTbl in bytes.
const ASUns16 *charTypeTbl;
A custom character type table to enhance word breaking quality. Each character type record contains a region start value, a region end value, and a character type flag as defined in PDExpT.h. A character code is in UTF-16, and is either a 16-bit or a 32-bit surrogate.
ASSize_t charTypeTblSize;
The size of the charTypeTbl in bytes.
ASBool preserveRedundantChars;

When true, it disables detecting and removing redundant characters. Some PDF pages have the same text drawn multiple times on the same spot to get a special visual effect. Normally, those redundant characters are removed from the word finder output.

Since this option may leave extra characters with overlapping bounding boxes, using it together with the disableCharReordering option is recommended for more consistent text extraction results.

ASBool disableCharReordering;
When true, it disables reconstructing the character orders, and the word finding algorithm is applied to the characters in the drawing order. By default, word finder reorders characters on a single line by the relative horizontal character locations. Most of the time, the character reordering feature improves the text extraction quality. However, on a PDF page with heavily overlapped character bounding boxes, the outcome becomes somewhat unpredictable. In such case, disabling the character reordering ( disableCharReordering = true) may produce a more static result.
ASBool noSkewedQuads;
When true, it disables the creation of a quad per character for skewed words, words with a horizontally-aligned, but non-rectangular, bounding region. Each skewed word will, instead, be associated with a single rectangular, bounding region.
ASBool noTextRenderMode3;
When true, it disables extracting text with Text Rendering mode Tr = 3 ("Neither fill nor stroke text (invisible).") Normally, the word finder extracts such text as any other.
ASBool preciseQuad;
} PDWordFinderConfigRec, *PDWordFinderConfig;

Used By

PDWordFinder Functions

PDWordFinderAcquireVisibleWordList

Header: PDProcs.h:10690

Description

Finds all words on the specified page that are visible in the given optional-content context and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.

The list contains only words that are visible in the given context. If the word states change in the given context, the word list will have to be released and re-acquired to reflect the changed set of visible words.

There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.

Use PDWordFinderEnumWords() instead of this method if you wish to find one word at a time instead of obtaining a table containing all visible words on a page.

This procedure is intended to replace the call to PDWordFinderAcquireWordList() in most cases where you want to work only with the content that is visible on screen (such as a text selection). Change this call to update an application to work with the Optional Content feature.

Syntax

void PDWordFinderAcquireVisibleWordList(PDWordFinder wObj, ASInt32 pgNum, PDOCContext ocContext, PDWord *wInfoP, PDWord **xySortTable, PDWord **rdOrderTable, ASInt32 *numWords);

Parameters

wObj
The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list.
pgNum
The page number for which words are found. First page is 0, not 1 as designated in Acrobat.
ocContext
The context within which the words are in a visible state. NULL is equivalent to passing PDDocGetOCContext(pdDoc).
wInfoP

(Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly.

Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS().

xySortTable
(Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, followed by all words on the next line. This array is only filled if the WXE_XY_SORT flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non- NULL, the array is always filled regardless of the value of the rdFlags parameter in PDDocCreateWordFinder().
rdOrderTable
Currently unused. Pass NULL for its value.
numWords
(Filled by the method) The number of visible words found on the page.

Exceptions

PDWordFinderAcquireWordList

Header: PDProcs.h:4781

Description

Finds all words on the specified page and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.

Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.

There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.

Use PDWordFinderEnumWords() instead of this method, if you wish to find one word at a time instead of obtaining a table containing all words on a page.

Syntax

void PDWordFinderAcquireWordList(PDWordFinder wObj, ASInt32 pgNum, PDWord *wInfoP, PDWord **xySortTable, PDWord **rdOrderTable, ASInt32 *numWords);

Parameters

wObj
The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list.
pgNum
The page number for which words are found. The first page is 0, not 1 as designated in Acrobat.
wInfoP

(Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly.

Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord() to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS().

xySortTable
(Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, are followed by all words on the next line. This array is only filled if the WXE_XY_SORT flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non- NULL, the array is always filled regardless of the value of the rdFlags parameter in PDDocCreateWordFinder().
rdOrderTable
Currently unused. Pass NULL for this value.
numWords
(Filled by the method) The number of words found on the page.

Exceptions

PDWordFinderDestroy

Header: PDProcs.h:4829

Description

Destroys a word finder. Use this when you are done extracting text in a file.

Syntax

void PDWordFinderDestroy(PDWordFinder wObj);

Parameters

wObj
IN/OUT The word finder to destroy.

PDWordFinderEnumVisibleWords

Header: PDProcs.h:10764

Description

Extracts visible words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.

Only words that are visible in the given optional-content context are enumerated.

Syntax

ASBool PDWordFinderEnumVisibleWords(PDWordFinder wObj, ASInt32 PageNum, PDOCContext ocContext, PDWordProc wordProc, void *clientData);

Parameters

wObj
A word finder object.
PageNum
The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document.
ocContext
The context within which the words are in a visible state. NULL is equivalent to passing PDDocGetOCContext(pdDoc).
wordProc
A user-supplied callback to call once for each word found. Enumeration halts if wordProc returns false.
clientData
A pointer to user-supplied data to pass to wordProc each time it is called.

Returns

true if enumeration was successfully completed, false if enumeration was terminated because wordProc returned false.

Exceptions

is raised if wordProc is NULL, or pageNum is less than zero or greater than the total number of pages in the document.

PDWordFinderEnumWords

Header: PDProcs.h:4865

Description

Extracts words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.

Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.

Syntax

ASBool PDWordFinderEnumWords(PDWordFinder wObj, ASInt32 PageNum, PDWordProc wordProc, void *clientData);

Parameters

wObj
A word finder object.
PageNum
The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document.
wordProc
A user-supplied callback to call once for each word found. Enumeration halts if wordProc returns false.
clientData
A pointer to user-supplied data to pass to wordProc each time it is called.

Returns

true if enumeration was successfully completed, false if enumeration was terminated because wordProc returned false.

Exceptions

is raised if wordProc is NULL, or pageNum is less than zero or greater than the total number of pages in the document.

PDWordFinderEnumWordsStr

Header: PDProcs.h:8726

Description

Constructs a PDWord list from a Unicode string, and calls a user-supplied procedure once for each word found.

The words extracted by this method do not have quads, text style, or text selection information. The character offset is calculated from the beginning of the input string, and is increased by 2 on every 16 bits of data (the character offset of a character in a PDWord is the byte offset of the character in the source Unicode string).

Syntax

ASBool PDWordFinderEnumWordsStr(PDWordFinder wObj, const ASUTF16Val *ucsStr, ASUns32 strLen, ASUns32 charOffsetAdj, PDWordProc wordProc, void *clientData);

Parameters

wObj
A word finder object.
ucsStr
A pointer to the Unicode string.
strLen
The length of the string in bytes.
charOffsetAdj

The character offset value of the first character in the input Unicode string. This value is added to the word character offsets, and is used to maintain contiguous word character offsets when multiple strings (and multiple calls to this method) are combined into one word list.

For example:

PDWordFinderEnumWordsStr(wf, str1, stelen(str1), 0, wp, d);

PDWordFinderEnumWordsStr(wf, str2, stelen(str2), stelen(str1), wp, d);

wordProc
A user-supplied callback to call once for each word found. Enumeration halts if wordProc returns false.
clientData
A pointer to user-supplied data to pass to wordProc each time it is called.

Returns

true if the enumeration was successfully completed, false if the enumeration was terminated because wordProc returned false.

Exceptions

is raised if wordProc is NULL.

PDWordFinderGetLatestAlgVersion

Header: PDProcs.h:4800

Description

Gets the version number of the specified word finder, or the version number of the latest word finder algorithm.

Syntax

ASInt16 PDWordFinderGetLatestAlgVersion(PDWordFinder wObj);

Parameters

wObj
IN/OUT The word finder whose algorithm's version is obtained. Pass NULL to obtain the latest word finding algorithm version number.

Returns

The algorithm version associated with wObj, or the version of the latest word finder algorithm if wObj is NULL.

PDWordFinderReleaseWordList

Header: PDProcs.h:4817

Description

Releases the word list for a given page. Use this to release a list created by PDWordFinderAcquireWordList() when you are done using this list.

Syntax

void PDWordFinderReleaseWordList(PDWordFinder wObj, ASInt32 pgNum);

Parameters

wObj
A word finder object.
pgNum
The number of pages for which a word list is released.

Exceptions

is raised if the list has already been released.