#define WF_LATEST_VERSION
0
#define WF_VERSION_2
2
#define WF_VERSION_3
3
#define WF_VERSION_4
4
typedef
struct
_t_PDWordFinder
*
PDWordFinder
;
This is passed to PDWordFinderSetCtrlProc().
This is the callback function called by Word Finder when its page enumeration process takes longer than the specified time (in seconds). Return true
to continue the enumeration process, or false
to stop. startTime
is the value that was set by ASGetSecs() when the Word Finder started processing the current page.
ASBool
PDWordFinderCtrlProc(
ASUns32
startTime
,
void
*
clientData
);
false
.
| |
This is always
sizeof(PDWordFinderConfigRec) . | |
When
true , it disables tagged PDF support and treats the document as non-tagged PDF. Use this to keep the word finder in legacy mode when it is created with the latest algorithm version (WF_LATEST_VERSION). | |
When
true , it disables generating an XY-ordered word list. This option replaces the sort order flags in the older version of the word finder creation command (PDDocCreateWordFinder()). Setting this option is equivalent to omitting the WXE_XY_SORT flag. | |
When
true , the word finder preserves space characters during word breaking. Otherwise, spaces are removed from output text. When false (the default), you can add spaces later by considering the word attribute flag WXE_ADJACENT_TO_SPACE , but there is no way to restore the exact number of consecutive space characters. | |
When
When | |
When
true , it disables guessing encoding of fonts that have unknown or custom encoding when there is no ToUnicode table. Inappropriate encoding conversions can cause the word finder to mistakenly recognize non-Roman single-byte fonts as Standard Roman encoding fonts and extract the text in an unusable format. When this option is selected, the word finder avoids such unreliable encoding conversions and tries to provide the original characters without any encoding conversion for a client with its own encoding handling. Use the PDWordGetCharEncFlags() method to detect such characters. | |
When
true , it assumes any font with unknown or custom encoding to be Standard Roman. This option overrides the noEncodingGuess option. | |
When
true , it disables converting large character gaps to space characters, so that the word finder reports a character space only when a space character appears in the original PDF content. This option has no effect on tagged PDF. | |
When
true , it disables treating vertical movements as line breaks, so that the word finder determines a line break only when a line break character or special tag information appears in the original PDF content. This option has no effect on tagged PDF. | |
When
true , it disables extracting text from text annotations. Normally, the word finder extracts text from the normal appearances of text annotations that are inside the page crop box. | |
When
true , it disables finding and removing soft hyphens in non-tagged PDF, so that the word finder trusts hard hyphens as non-soft hyphens. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between soft and hard hyphen characters in non-tagged PDF files, because these are often misused. | |
When
true , it disables treating non-breaking space characters as regular space characters in non-tagged PDF files, so that the word finder preserves the space without breaking the word. This option has no effect on tagged PDF files. Normally, the word finder does not differentiate between breaking and non-breaking space characters in non-tagged PDF files, because these are often misused. | |
When
true , it disables generating extended character offset information to improve text extraction performance. The extended character offset information is necessary to determine exact character offset for character-by-character text selection. The beginning character offset of each word is always available regardless of this option, and can be used for word-by-word text selection with reasonable accuracy. When a client has no need for the detailed character offset information, it can use this option to improve the text extraction efficiency. There is a minor difference in the text extraction performance, and less memory is needed for the extracted word list. | |
When
true , it disables generating character style information to improve text extraction performance and memory efficiency. When you select this option, you cannot use PDWordGetNthCharStyle() and PDWordGetStyleTransition() with the output of the word finder. | |
A custom UTF-16 decomposition table. This table can be used to expand Unicode ligatures not included in the default ligature list. Each decomposition record contains a UTF-16 character code (either a 16-bit or 32-bit surrogate), a replacement UTF16 string, and the delimiter
0x0000 . | |
The size of the
decomposeTbl in bytes. | |
A custom character type table to enhance word breaking quality. Each character type record contains a region start value, a region end value, and a character type flag as defined in PDExpT.h. A character code is in UTF-16, and is either a 16-bit or a 32-bit surrogate.
| |
The size of the
charTypeTbl in bytes. | |
When Since this option may leave extra characters with overlapping bounding boxes, using it together with the | |
When
true , it disables reconstructing the character orders, and the word finding algorithm is applied to the characters in the drawing order. By default, word finder reorders characters on a single line by the relative horizontal character locations. Most of the time, the character reordering feature improves the text extraction quality. However, on a PDF page with heavily overlapped character bounding boxes, the outcome becomes somewhat unpredictable. In such case, disabling the character reordering ( disableCharReordering = true ) may produce a more static result. | |
When
true , it disables the creation of a quad per character for skewed words, words with a horizontally-aligned, but non-rectangular, bounding region. Each skewed word will, instead, be associated with a single rectangular, bounding region. | |
When
true , it disables extracting text with Text Rendering mode Tr = 3 ("Neither fill nor stroke text (invisible).") Normally, the word finder extracts such text as any other. | |
|
Finds all words on the specified page that are visible in the given optional-content context and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
The list contains only words that are visible in the given context. If the word states change in the given context, the word list will have to be released and re-acquired to reflect the changed set of visible words.
There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.
Use PDWordFinderEnumWords() instead of this method if you wish to find one word at a time instead of obtaining a table containing all visible words on a page.
This procedure is intended to replace the call to PDWordFinderAcquireWordList() in most cases where you want to work only with the content that is visible on screen (such as a text selection). Change this call to update an application to work with the Optional Content feature.
void
PDWordFinderAcquireVisibleWordList(
PDWordFinder
wObj
,
ASInt32
pgNum
,
PDOCContext
ocContext
,
PDWord
*
wInfoP
,
PDWord
*
*
xySortTable
,
PDWord
*
*
rdOrderTable
,
ASInt32
*
numWords
);
wObj | The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list.
|
pgNum | The page number for which words are found. First page is
0 , not 1 as designated in Acrobat. |
ocContext | The context within which the words are in a visible state.
NULL is equivalent to passing PDDocGetOCContext (pdDoc) . |
wInfoP | (Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly. Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). |
xySortTable | (Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, followed by all words on the next line. This array is only filled if the WXE_XY_SORT flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non-
NULL , the array is always filled regardless of the value of the rdFlags parameter in PDDocCreateWordFinder(). |
rdOrderTable | Currently unused. Pass
NULL for its value. |
numWords | (Filled by the method) The number of visible words found on the page.
|
Finds all words on the specified page and returns one or more tables containing the words. One table contains the words sorted in the order in which they appear in the PDF file, while the other contains the words sorted by their x- and y-coordinates on the page.
Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.
There can be only one word list in existence at a time; clients must release the previous word list, using PDWordFinderReleaseWordList(), before creating a new one.
Use PDWordFinderEnumWords() instead of this method, if you wish to find one word at a time instead of obtaining a table containing all words on a page.
void
PDWordFinderAcquireWordList(
PDWordFinder
wObj
,
ASInt32
pgNum
,
PDWord
*
wInfoP
,
PDWord
*
*
xySortTable
,
PDWord
*
*
rdOrderTable
,
ASInt32
*
numWords
);
wObj | The word finder (created using PDDocCreateWordFinder() or PDDocCreateWordFinderUCS()) used to acquire the word list.
|
pgNum | The page number for which words are found. The first page is
0 , not 1 as designated in Acrobat. |
wInfoP | (Filled by the method) A user-supplied PDWord variable. Acrobat will fill this in to point to an Acrobat-allocated array of PDWord objects, which should never be accessed directly. Access the acquired list through PDWordFinderGetNthWord(). The words are ordered in PDF order, which is the order in which they appear in the PDF file's data. This is often, but not always, the order in which a person would read the words. Use PDWordFinderGetNthWord() to traverse this array; you cannot access this array directly. This array is always filled, regardless of the flags used in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). |
xySortTable | (Filled by the method) Acrobat fills in this user-supplied pointer to a pointer with the location of an Acrobat-allocated array of PDWords, sorted in x-y order, meaning that all words on the first line, from left to right, are followed by all words on the next line. This array is only filled if the
WXE_XY_SORT flag was set in the call to PDDocCreateWordFinder() or PDDocCreateWordFinderUCS(). PDWordFinderReleaseWordList() must be called to release allocated memory for this return or there will be a memory leak. As long as this parameter is non- NULL , the array is always filled regardless of the value of the rdFlags parameter in PDDocCreateWordFinder(). |
rdOrderTable | Currently unused. Pass
NULL for this value. |
numWords | (Filled by the method) The number of words found on the page.
|
void
PDWordFinderDestroy(
PDWordFinder
wObj
);
wObj | IN/OUT The word finder to destroy.
|
Extracts visible words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
Only words that are visible in the given optional-content context are enumerated.
ASBool
PDWordFinderEnumVisibleWords(
PDWordFinder
wObj
,
ASInt32
PageNum
,
PDOCContext
ocContext
,
PDWordProc
wordProc
,
void
*
clientData
);
wObj | A word finder object.
|
PageNum | The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document.
|
ocContext | The context within which the words are in a visible state.
NULL is equivalent to passing PDDocGetOCContext (pdDoc) . |
wordProc | A user-supplied callback to call once for each word found. Enumeration halts if
wordProc returns false . |
clientData | A pointer to user-supplied data to pass to
wordProc each time it is called. |
true
if enumeration was successfully completed, false
if enumeration was terminated because wordProc
returned false
. is raised if
wordProc is NULL , or pageNum is less than zero or greater than the total number of pages in the document. | |
Extracts words, one at a time, from the specified page or the entire document. It calls a user-supplied procedure once for each word found. If you wish to extract all text from a page at once, use PDWordFinderAcquireWordList() instead of this method.
Only words within or partially within the page's crop box (see PDPageGetCropBox()) are enumerated. Words outside the crop box are skipped.
ASBool
PDWordFinderEnumWords(
PDWordFinder
wObj
,
ASInt32
PageNum
,
PDWordProc
wordProc
,
void
*
clientData
);
wObj | A word finder object.
|
PageNum | The page number from which to extract words. Pass PDAllPages (see PDExpT.h) to sequentially process all pages in the document.
|
wordProc | A user-supplied callback to call once for each word found. Enumeration halts if
wordProc returns false . |
clientData | A pointer to user-supplied data to pass to
wordProc each time it is called. |
true
if enumeration was successfully completed, false
if enumeration was terminated because wordProc
returned false
. is raised if
wordProc is NULL , or pageNum is less than zero or greater than the total number of pages in the document. | |
Constructs a PDWord list from a Unicode string, and calls a user-supplied procedure once for each word found.
The words extracted by this method do not have quads, text style, or text selection information. The character offset is calculated from the beginning of the input string, and is increased by 2
on every 16 bits of data (the character offset of a character in a PDWord is the byte offset of the character in the source Unicode string).
ASBool
PDWordFinderEnumWordsStr(
PDWordFinder
wObj
,
const
ASUTF16Val
*
ucsStr
,
ASUns32
strLen
,
ASUns32
charOffsetAdj
,
PDWordProc
wordProc
,
void
*
clientData
);
wObj | A word finder object.
|
ucsStr | A pointer to the Unicode string.
|
strLen | The length of the string in bytes.
|
charOffsetAdj | The character offset value of the first character in the input Unicode string. This value is added to the word character offsets, and is used to maintain contiguous word character offsets when multiple strings (and multiple calls to this method) are combined into one word list. For example:
|
wordProc | A user-supplied callback to call once for each word found. Enumeration halts if
wordProc returns false . |
clientData | A pointer to user-supplied data to pass to
wordProc each time it is called. |
true
if the enumeration was successfully completed, false
if the enumeration was terminated because wordProc
returned false
. is raised if
wordProc is NULL . | |
ASInt16
PDWordFinderGetLatestAlgVersion(
PDWordFinder
wObj
);
wObj | IN/OUT The word finder whose algorithm's version is obtained. Pass
NULL to obtain the latest word finding algorithm version number. |
wObj
, or the version of the latest word finder algorithm if wObj
is NULL
. void
PDWordFinderReleaseWordList(
PDWordFinder
wObj
,
ASInt32
pgNum
);
wObj | A word finder object.
|
pgNum | The number of pages for which a word list is released.
|
is raised if the list has already been released.
|