#define WXE_ADJACENT_TO_SPACE
0x800
#define WXE_ENCODING_WARNING
#define WXE_ENC_MISSING
#define WXE_ENC_NO_UCS
#define WXE_ENC_UNMAPPED
#define WXE_EXT_CHAR_OFFSETS
#define WXE_FROM_ACTUALT
#define WXE_FRONT_TAB
0x01
#define WXE_HAS_DIGIT
0x8
#define WXE_HAS_HYPHEN
0x20
WXE_HAS_PUNCTUATION
will also be set. #define WXE_HAS_LEADING_PUNC
0x100
#define WXE_HAS_LETTER
0x2
#define WXE_HAS_LIGATURE
0x80
#define WXE_HAS_NONALPHANUM
0
X1
WXE_HAS_LEADING_PUNC
), the end of the word ( WXE_HAS_TRAILING_PUNC
), or elsewhere in the word. #define WXE_HAS_PUNCTUATION
0x10
#define WXE_HAS_SOFT_HYPHEN
0x40
WXE_HAS_PUNCTUATION
will also be set. #define WXE_HAS_TRAILING_PUNC
0x200
#define WXE_HAS_UNMAPPED_CHAR
0x400
#define WXE_HAS_UPPERCASE
0x4
#define WXE_LAST_WORD_ON_LINE
0x8000
#define WXE_PDF_ORDER
0x2
#define WXE_RD_ORDER_SORT
0x8
#define WXE_REVERSE_DIRECTION
0x04
#define WXE_ROTATED
0x1000
#define WXE_STREAM
0x1
#define WXE_VERTICAL_FLOW
0x2000
#define WXE_WBREAK_WORD
0x4000
#define WXE_WORD_IS_UNICODE
0x08
#define WXE_XY_SORT
0x4
#define W_ACCENT
0x800
#define W_CNTL
0x1
#define W_COMMA
0x200
#define W_DIGIT
0x8
"."
, "?"
, "!"
, ":"
, and ";"
). #define W_END_PHRASE
0x2000
#define W_HYPHEN
0x20
#define W_LETTER
0x2
#define W_LIGATURE
0x80
#define W_PERIOD
0x400
#define W_PUNCTUATION
0x10
#define W_SOFT_HYPHEN
0x40
#define W_UNMAPPED
0x1000
#define W_UPPERCASE
0x4
#define W_WHITE
0x100
"
*"
and "?"
) that should not be treated as a normal punctuation mark. #define W_WILD_CARD
0x4000
#define W_WORD_BREAK
0x8000
typedef
struct
_t_PDWord
*
PDWord
;
ASBool
PDWordProc(
PDWordFinder
wObj
,
PDWord
wInfo
,
ASInt32
pgNum
,
void
*
clientData
);
wObj | IN/OUT The word finder.
|
wInfo | IN/OUT The current word in the enumeration.
|
pgNum | IN/OUT The page number on which
wInfo is located. |
clientData | IN/OUT User-supplied data that was passed in the call to PDWordFinderEnumWords().
|
PDWordFinder
method. The text selection can then be set as the current selection using AVDocSetSelection(). Note: For consistent text selection behavior, avoid using other PDTextSelect creation methods which depend on the word finder versions and word offsets. These include PDTextSelectCreatePageHiliteEx(), PDTextSelectCreateRanges(), PDTextSelectCreateRangesEx(), PDTextSelectCreateWordHilite(), and PDTextSelectCreateWordHiliteEx().
PDTextSelect
PDWordCreateTextSelect(
PDPage
page
,
PDWord
*
wList
,
ASUns32
wListLen
);
page | The page on which to select the words.
|
wList | The word list to be selected.
|
wListLen | The number of words in the word list.
|
Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word. It does not remove wildcard characters ( '
*'
and '?'
) or any punctuation surrounded by alphanumeric characters within the word.
The determination of which characters are alphanumeric, wildcard, punctuation, and so forth, is made by the values in infoArray
.
Although this method seems very similar to PDWordFilterWord(), the two methods treat letters and digits slightly differently. PDWordFilterWord() uses the encoding info array but also does a straight character code test for any characters that have not been mapped to anything. It does this to catch letters and digits from non-standard character sets, and is necessary to avoid removing words with non-standard character sets.
PDWordFilterString(), on the other hand, was designed for known character sets such as WinAnsi and Mac Roman.
For non-Roman character set viewers, this method currently supports only SHIFT-JIS encoding on a Japanese system.
Note: In Acrobat 6.0, the method PDWordFinderEnumWordsStr() is preferred to this method, which remains for backward compatability.
ASBool
PDWordFilterString(
ASUns16
*
infoArray
,
char
*
cNewWord
,
char
*
cOldWord
);
infoArray | An array specifying the type of each character in the font. Each entry in this table must be one of the Character Type Codes. If For descriptions of You can find this document on the web store of the International Standards Organization (ISO). |
cNewWord | (Filled by the method) The filtered word.
|
cOldWord | The unfiltered word. This value must be passed to the method.
|
Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word. It does not remove wildcard characters ( '
*'
and '?'
) or any punctuation surrounded by alphanumeric characters within the word. It also converts ligatures to their constituent characters. The determination of which characters to remove is made by examining the flags in the outEncInfo
array passed to PDDocCreateWordFinder(). As a result, this method is most useful after you have been called with words obtained by calling PDWordFinderGetNthWord(), in the callback for PDWordFinderEnumWords(), and words in the pXYSortTable returned by PDWordFinderAcquireWordList(). See the description of PDWordFilterString() for further information, and for a description of how the two methods differ.
The Acrobat Catalog program uses this method to filter words before indexing them.
This method works with non-Roman systems.
Note: In Acrobat 6.0 and later, the method PDWordFinderEnumWords() is preferred to this method, which remains for backward compatability.
ASBool
PDWordFilterWord(
PDWord
word
,
char
*
buffer
,
ASInt16
bufferLen
,
ASInt16
*
newLen
);
word | The PDWord to filter.
|
buffer | (Filled by the method) The filtered string.
|
bufferLen | The maximum number of characters that
buffer can hold. |
newLen | (Filled by the method) The number of characters actually written into
buffer . |
PDWord
PDWordFinderGetNthWord(
PDWordFinder
wObj
,
ASInt32
nTh
);
wObj | IN/OUT The word finder whose nth word is obtained.
|
nTh | IN/OUT The index of the word to obtain. The first word on a page has an index of zero. Words are counted in PDF order. See the description of the
wInfoP parameter in PDWordFinderAcquireWordList(). |
NULL
when the end of the list is reached. void
PDWordGetASText(
PDWord
word
,
ASUns32
filter
,
ASText
str
);
word | The word whose text becomes the new ASText.
|
filter | Character types to be dropped from the output string. For example, the following returns text without soft hyphens and accent marks:
|
str | An existing ASText object whose content will be replaced by the new text.
|
Note: PDWordGetAttr() may return an attribute value greater than the maximum of all of the public attributes since there can be private attributes added on. It is recommended to AND
the result with the attribute you are interested in.
ASUns16
PDWordGetAttr(
PDWord
word
);
word | IN/OUT The word whose character types are obtained.
|
OR
of the Word Attributes. This is a version 6.0 extension of PDWordGetAttr() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher. It can get an additional 16-bit flag group defined in Acrobat 6.
It gets a bit field containing information on the types of characters in a word. Use PDWordGetCharacterTypes() if you wish to check each character's type individually.
Note: PDWordGetAttr() may return an attribute value greater than the maximum of all of the public attributes, since there can be private attributes added on. It is recommended that you AND
the result with the attribute you are interested in.
ASUns16
PDWordGetAttrEx(
PDWord
word
,
ASUns32
groupID
);
word | The word whose character types are obtained.
|
groupID |
|
word
. The value is a logical OR
of the Word Attributes.
Returns the byte offset within the specified word of the highlightable character at the specified character offset. The first character of a word is at byte offset 0
. This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
The returned byte offset can be passed to PDWordGetCharOffsetEx() and PDWordGetCharQuad() to get additional information. Use PDWordGetNumHiliteChar() to get the number of highlightable characters in a word.
ASUns32
PDWordGetByteIdxFromHiliteChar(
PDWord
word
,
ASUns32
charIdx
);
word | The word containing the character.
|
charIdx | The character index within the word.
|
0
if the character index is out of range. fi
(ligature)
sh
the mapped word will be "fish"
. The ligature occupies only one character code, so in this case the character delta will be 3-4
=
-1
. ASInt8
PDWordGetCharDelta(
PDWord
word
);
word | IN/OUT The word whose character delta is obtained.
|
0
. Gets the WordFinder Character Encoding Flags for each character in a word, which specify how reliably the word finder identified the character encoding.
This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
void
PDWordGetCharEncFlags(
PDWord
word
,
ASUns32
*
fList
,
ASUns32
size
);
word | The word whose character encoding flags are obtained.
|
fList | (Filled by the method) An array of character encoding flags types. This array contains one element for each byte of text in the word. The byte length of the text can be determined with PDWordGetLength(). Each element is the logical
OR of one or more of the character encoding flags. |
size | The maximum number of elements in the array
fList . |
ASUns16
PDWordGetCharOffset(
PDWord
word
);
word | IN/OUT The word whose character offset is obtained.
|
This is a version 6.0 extension of PDWordGetCharOffset() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
It returns the character offset for a character identified by its index number, and the number of bytes (length) used for that character. The length is usually 1
for single-byte characters and 2
for double-byte characters. If multiple bytes are used to construct one character, only the first byte has valid character offset information and the other bytes have zero offset length with the same character offset of the first byte. If the returned offset length is zero, it means the specified byte in the word is a part (other than the first byte) of a multi-byte character.
The character offset is the character position calculated in bytes from the beginning of a page. Because of the encoding conversions and character replacements applied by the word finder, some characters may have different byte lengths from the original PDF content. The character offset itself can locate a character in the PDF content. However, without the offset length (that is the number of bytes in the PDF content), clients cannot tell whether two characters are next to each other in the PDF content. For example, suppose you want to create a Text Select object of two characters at character offset 1
and 3
. You can create an object with two disconnected ranges of [Offset
1,
The
length
1]
and [Offset
3,
The
length
1]
. However, if you know that the offset length of both characters is 2
, you can create a simpler object with a single range of [Offset
1,
The
length
4]
.
ASUns32
PDWordGetCharOffsetEx(
PDWord
word
,
ASUns32
byteIdx
,
ASUns32
*
bytesConsumed
,
ASUns32
*
offsetLen
);
word | The word whose character offset is obtained.
|
byteIdx | The byte index within the word of the character whose offset is obtained. Valid values are
0 to PDWordGetLength (word)-1 . |
bytesConsumed | (Filled by method) Returns the number of bytes in the word that are occupied by the specified character. It can be
NULL if it is not needed. Use (byteIdx + *bytesConsumed) to get the byte index of the next character in the word. |
offsetLen | (Filled by the method) Returns the number of bytes occupied by the specified character in the original PDF content. This is
0 if the specified byte is not the starting byte of a character in the PDF content. It can be NULL if it is not needed. |
Gets the quadrilateral bounding of the character at a given index position in the word. If the specified character is constructed with multiple bytes, only the first byte returns a valid quad. Otherwise, this method returns false
.
This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
ASBool
PDWordGetCharQuad(
PDWord
word
,
ASUns32
byteIdx
,
ASFixedQuad
*
quad
);
word | The word whose character offset is obtained.
|
byteIdx | The byte index within the word of the character whose quad is obtained. Valid values are
0 to PDWordGetLength (word)-1 . |
quad | (Filled by method) A pointer to an existing quad structure in which to return the character's quad specified in user-space coordinates.
|
void
PDWordGetCharacterTypes(
PDWord
word
,
ASUns16
*
cArr
,
ASInt16
size
);
word | The word whose character types are obtained.
|
cArr | (Filled by the method) An array of character types. This array contains one element for each character in the word. Use PDWordGetLength() to determine the number of elements that must be in the array. Each element is the logical
OR of one or more of the Character Type Codes. For non-Roman character set viewers, meaningful values are returned only for Roman characters. For non-Roman characters, it returns 0 , which is the same as W_CNTL . If the character is 2 bytes, both bytes indicate the same character type. |
size | The number of elements in
cArr . |
ASUns8
PDWordGetLength(
PDWord
word
);
word | IN/OUT The word object whose character count is obtained.
|
PDStyle
PDWordGetNthCharStyle(
PDWordFinder
wObj
,
PDWord
word
,
ASInt32
dex
);
wObj | IN/OUT A word finder object.
|
word | IN/OUT The word whose nth style is obtained.
|
dex | IN/OUT The index of the style to obtain. The first style in a word has an index of zero.
|
NULL
if dex
is greater than the number of styles in the word. is raised if
dex < 0 . |
Gets the specified word's nth quad, specified in user space coordinates. See PDWordGetNumQuads() for a description of a quad.
The quad's height is the height of the font's bounding box, not the height of the tallest character used in the word. The font's bounding box is determined by the glyphs in the font that extend farthest above and below the baseline; it often extends somewhat above the top of 'A'
and below the bottom of 'y'
.
The quad's width is determined from the characters actually present in the word.
For example, the quads for the words "AWAY"
and "away"
have the same height, but generally do not have the same width unless the font is a mono-spaced font (a font in which all characters have the same width).
Despite the names of the fields in an ASFixedQuad ( tl
for top left, bl
for bottom left, and so forth) the corners of quad
do not necessarily have these positions.
ASBool
PDWordGetNthQuad(
PDWord
word
,
ASInt16
nTh
,
ASFixedQuad
*
quad
);
word | The word whose nth quad is obtained.
|
nTh | The quad to obtain. A word's first quad has an index of zero.
|
quad | (Filled by the method) A pointer to the word's nth quad, specified in user-space coordinates.
|
Gets the number of highlightable characters in a word. A highlightable character is the minimum text unit that Acrobat can select and highlight. This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
Because of the encoding conversion, the characters in a word finder word list do not have a 1-to-1 correspondence to the characters displayed by Acrobat. For example, if the word is "fish"
and the text operation in PDF content is "fi"
(ligature) +
's'
+
'h'
, this method returns the number of highlightable characters as 3
, counting "fi"
as one character. For the same word, the PDWordGetLength() method returns the byte-length as 4
.
ASUns32
PDWordGetNumHiliteChar(
PDWord
word
);
word | The word whose highlightable character count is obtained.
|
word
. ASInt16
PDWordGetNumQuads(
PDWord
word
);
word | IN/OUT The word whose quad count is obtained.
|
This method gets a word's text. The string to return includes any word break characters (such as space characters) that follow the word, but not any that precede the word. The characters that are treated as word breaks are defined in the outEncInfo
parameter of PDDocCreateWordFinder() method. Use PDWordFilterString() to subsequently remove the word break characters.
This method produces a string in whatever encoding the PDWord uses, for both Roman and non-Roman systems.
void
PDWordGetString(
PDWord
word
,
char
*
str
,
ASInt32
len
);
word | The word whose string is obtained.
|
str | (Filled by the method) The string. The encoding of the string is the encoding used by the
PDWordFinder that supplied the PDWord. For instance, if PDDocCreateWordFinderUCS() is used to create the word finder, PDWordGetString() returns only Unicode. There is no way to detect Unicode strings returned by PDWordGetString(), since there is no UCS header ( FEFF ) added to each string returned. |
len | The length of
str in bytes. Up to len characters of word will be copied into str . If str is long enough, it will be NULL -terminated. |
ASInt16
PDWordGetStyleTransition(
PDWord
word
,
ASInt16
*
transTbl
,
ASInt16
size
);
word | IN/OUT The word whose style transition list is obtained.
|
transTbl | IN/OUT (Filled by the method) An array of style transitions. Each element is the character offset in word where the style changes. The offset specifies the first character in the word that has the new style. The first character in a word has an offset of zero.
|
size | IN/OUT The number of entries that
transTbl can hold. The word is searched only until this number of style transitions have been found. |
transTbl
. ASBool
PDWordIsCurrentlyVisible(
PDWord
word
,
ASInt32
pageNum
,
PDOCContext
ctx
);
word | The word to test.
|
pageNum | The page number for which the word is tested.
|
ctx | The context in which the word is tested, as returned by
PDDocGetOCContext (pdDoc) . |
ASBool
PDWordIsRotated(
PDWord
word
);
word | The word to test.
|
ASBool
PDWordMakeVisible(
PDWord
word
,
ASInt32
pageNum
,
PDOCContext
ctx
);
word | The word to test.
|
pageNum | The page number for which the word is to be made visible.
|
ctx | The context in which the word is to be made visible, as returned by
PDDocGetOCContext (pdDoc) . |
Splits the specified string into words by substituting spaces for word separator characters. The list of characters considered to be word separators can be specified, or a default list can be used.
The characters ','
and '.'
are context-sensitive word separators. If surrounded by digits (for example, 654,096.345
), they are not considered word separators.
For non-Roman character set viewers, this method currently supports only SHIFT-JIS encoding on a Japanese system.
ASUns16
PDWordSplitString(
ASUns16
*
infoArray
,
char
*
cNewWord
,
char
*
cOldWord
,
ASUns16
nMaxLen
);
infoArray | A character information table. It specifies each character's type; word separator characters must be marked as
W_WORD_BREAK (see Character Type Codes). This table can be identical to the table to pass to PDDocCreateWordFinder(). If infoArray is NULL , a default table is used (see Glyph Names of Word Separators). |
cNewWord | (Filled by the method) The word that has been split. Word separator characters have been replaced with spaces.
|
cOldWord | The word to split.
|
nMaxLen | The number of characters that
cNewWord can hold. Word splitting stops when cOldWord is completely processed or nMaxLen characters have been placed in cNewWord , whichever occurs first. |
is raised if
infoArray is NULL , but host encoding cannot be obtained. |