PD Layer: PDWord

PDWord Definitions

WXE_ADJACENT_TO_SPACE: - The character following the end of the word is a space (either an explicit space character encoded in a string, or one that appears implicitly because the drawing point was moved).
WXE_ENCODING_WARNING: -
WXE_ENC_MISSING: -
WXE_ENC_NO_UCS: -
WXE_ENC_UNMAPPED: -
WXE_EXT_CHAR_OFFSETS: -
WXE_FROM_ACTUALT: -
WXE_FRONT_TAB: -
WXE_HAS_DIGIT: - One or more characters in the word are digits.
WXE_HAS_HYPHEN: - There is a hyphen in the word.
WXE_HAS_LEADING_PUNC: - The first character in the word is a punctuation mark.
WXE_HAS_LETTER: - The word contains a character between A-Z or a-z.
WXE_HAS_LIGATURE: - The word contains a ligature.
WXE_HAS_NONALPHANUM: - The word contains a character outside the range of A-Z, a-Z, 0-9.
WXE_HAS_PUNCTUATION: - One or more characters in the word are punctuation marks.
WXE_HAS_SOFT_HYPHEN: - There is a soft hyphen in the word.
WXE_HAS_TRAILING_PUNC: - The last character in the word is a punctuation mark.
WXE_HAS_UNMAPPED_CHAR: - One or more characters in the word cannot be represented in the output font encoding.
WXE_HAS_UPPERCASE: - The word contains a character between A-Z.
WXE_LAST_WORD_ON_LINE: - The word is at the end of the current text line (for example, the word is followed by a line break).
WXE_PDF_ORDER: -
WXE_RD_ORDER_SORT: -
WXE_REVERSE_DIRECTION: -
WXE_ROTATED: - The writing direction of the word is not in a multiple of 90 degrees, or the bounding box of the text is skewed.
WXE_STREAM: -
WXE_VERTICAL_FLOW: - The writing direction of the word is either 90 or 180 degrees.
WXE_WBREAK_WORD: -
WXE_WORD_IS_UNICODE: -
WXE_XY_SORT: -
W_ACCENT: - An accent mark.
W_CNTL: - A control code.
W_COMMA: - A comma.
W_DIGIT: - A digit.
W_END_PHRASE: - An end-of-phrase glyph (for example, "."
W_HYPHEN: - A hyphen.
W_LETTER: - A lowercase letter.
W_LIGATURE: - A ligature.
W_PERIOD: - A period.
W_PUNCTUATION: - A punctuation mark.
W_SOFT_HYPHEN: - A hyphen that is only present because a word is broken across two lines of text.
W_UNMAPPED: - A glyph that cannot be represented in the destination font encoding.
W_UPPERCASE: - An uppercase letter.
W_WHITE: - A white space glyph.
W_WILD_CARD: - A wildcard glyph (for example, " *" and "?" ) that should not be treated as a normal punctuation mark.
W_WORD_BREAK: - A glyph that acts as a delimiter between words.

WXE_ADJACENT_TO_SPACE

Header: PDExpT.h:3577

Description

The character following the end of the word is a space (either an explicit space character encoded in a string, or one that appears implicitly because the drawing point was moved).

Syntax

#define WXE_ADJACENT_TO_SPACE 0x800

WXE_ENCODING_WARNING

Header: PDExpT.h

Syntax

#define WXE_ENCODING_WARNING

WXE_ENC_MISSING

Header: PDExpT.h

Syntax

#define WXE_ENC_MISSING

WXE_ENC_NO_UCS

Header: PDExpT.h

Syntax

#define WXE_ENC_NO_UCS

WXE_ENC_UNMAPPED

Header: PDExpT.h

Syntax

#define WXE_ENC_UNMAPPED

WXE_EXT_CHAR_OFFSETS

Header: PDExpT.h

Syntax

#define WXE_EXT_CHAR_OFFSETS

WXE_FROM_ACTUALT

Header: PDExpT.h

Syntax

#define WXE_FROM_ACTUALT

WXE_FRONT_TAB

Header: PDExpT.h:3610

Syntax

#define WXE_FRONT_TAB 0x01

WXE_HAS_DIGIT

Header: PDExpT.h:3523

Description

One or more characters in the word are digits.

Syntax

#define WXE_HAS_DIGIT 0x8

WXE_HAS_HYPHEN

Header: PDExpT.h:3539

Description

There is a hyphen in the word.

Syntax

#define WXE_HAS_HYPHEN 0x20

WXE_HAS_LEADING_PUNC

Header: PDExpT.h:3556

Description

The first character in the word is a punctuation mark. If this bit is set, WXE_HAS_PUNCTUATION will also be set.

Syntax

#define WXE_HAS_LEADING_PUNC 0x100

WXE_HAS_LETTER

Header: PDExpT.h:3513

Description

The word contains a character between A-Z or a-z.

Syntax

#define WXE_HAS_LETTER 0x2

WXE_HAS_LIGATURE

Header: PDExpT.h:3549

Description

The word contains a ligature.

Syntax

#define WXE_HAS_LIGATURE 0x80

WXE_HAS_NONALPHANUM

Header: PDExpT.h:3508

Description

The word contains a character outside the range of A-Z, a-Z, 0-9.

Syntax

#define WXE_HAS_NONALPHANUM 0 X1

WXE_HAS_PUNCTUATION

Header: PDExpT.h:3534

Description

One or more characters in the word are punctuation marks. Other flag bits can be checked to test whether the punctuation was at the beginning of the word ( WXE_HAS_LEADING_PUNC), the end of the word ( WXE_HAS_TRAILING_PUNC), or elsewhere in the word.

Syntax

#define WXE_HAS_PUNCTUATION 0x10

WXE_HAS_SOFT_HYPHEN

Header: PDExpT.h:3544

Description

There is a soft hyphen in the word.

Syntax

#define WXE_HAS_SOFT_HYPHEN 0x40

WXE_HAS_TRAILING_PUNC

Header: PDExpT.h:3563

Description

The last character in the word is a punctuation mark. If this bit is set, WXE_HAS_PUNCTUATION will also be set.

Syntax

#define WXE_HAS_TRAILING_PUNC 0x200

WXE_HAS_UNMAPPED_CHAR

Header: PDExpT.h:3569

Description

One or more characters in the word cannot be represented in the output font encoding.

Syntax

#define WXE_HAS_UNMAPPED_CHAR 0x400

WXE_HAS_UPPERCASE

Header: PDExpT.h:3518

Description

The word contains a character between A-Z.

Syntax

#define WXE_HAS_UPPERCASE 0x4

WXE_LAST_WORD_ON_LINE

Header: PDExpT.h:3607

Description

The word is at the end of the current text line (for example, the word is followed by a line break).

Syntax

#define WXE_LAST_WORD_ON_LINE 0x8000

WXE_PDF_ORDER

Header: PDExpT.h:3630

Syntax

#define WXE_PDF_ORDER 0x2

WXE_RD_ORDER_SORT

Header: PDExpT.h:3638

Syntax

#define WXE_RD_ORDER_SORT 0x8

WXE_REVERSE_DIRECTION

Header: PDExpT.h:3614

Syntax

#define WXE_REVERSE_DIRECTION 0x04

WXE_ROTATED

Header: PDExpT.h:3587

Description

The writing direction of the word is not in a multiple of 90 degrees, or the bounding box of the text is skewed. This flag indicates that the quads of the word should be used to specify the highlight area correctly.

Syntax

#define WXE_ROTATED 0x1000

WXE_STREAM

Header: PDExpT.h:3626

Syntax

#define WXE_STREAM 0x1

WXE_VERTICAL_FLOW

Header: PDExpT.h:3596

Description

The writing direction of the word is either 90 or 180 degrees. This flag ignores the page rotation parameter of the page dictionary. Therefore, if the page is rotated 90 degrees, this flag will be set on each word that appears horizonally on the screen.

Syntax

#define WXE_VERTICAL_FLOW 0x2000

WXE_WBREAK_WORD

Header: PDExpT.h:3601

Syntax

#define WXE_WBREAK_WORD 0x4000

WXE_WORD_IS_UNICODE

Header: PDExpT.h:3615

Syntax

#define WXE_WORD_IS_UNICODE 0x08

WXE_XY_SORT

Header: PDExpT.h:3634

Syntax

#define WXE_XY_SORT 0x4

W_ACCENT

Header: PDExpT.h:3461

Description

An accent mark.

Syntax

#define W_ACCENT 0x800

W_CNTL

Header: PDExpT.h:3402

Description

A control code.

Syntax

#define W_CNTL 0x1

W_COMMA

Header: PDExpT.h:3451

Description

A comma. Commas and periods are treated separately from other punctuation marks because they are used both as word punctuation marks and as delimiters in numbers, and need to be treated differently in the two cases.

Syntax

#define W_COMMA 0x200

W_DIGIT

Header: PDExpT.h:3417

Description

A digit.

Syntax

#define W_DIGIT 0x8

W_END_PHRASE

Header: PDExpT.h:3471

Description

An end-of-phrase glyph (for example, ".", "?", "!", ":", and ";").

Syntax

#define W_END_PHRASE 0x2000

W_HYPHEN

Header: PDExpT.h:3427

Description

A hyphen.

Syntax

#define W_HYPHEN 0x20

W_LETTER

Header: PDExpT.h:3407

Description

A lowercase letter.

Syntax

#define W_LETTER 0x2

W_LIGATURE

Header: PDExpT.h:3437

Description

A ligature.

Syntax

#define W_LIGATURE 0x80

W_PERIOD

Header: PDExpT.h:3456

Description

A period.

Syntax

#define W_PERIOD 0x400

W_PUNCTUATION

Header: PDExpT.h:3422

Description

A punctuation mark.

Syntax

#define W_PUNCTUATION 0x10

W_SOFT_HYPHEN

Header: PDExpT.h:3432

Description

A hyphen that is only present because a word is broken across two lines of text.

Syntax

#define W_SOFT_HYPHEN 0x40

W_UNMAPPED

Header: PDExpT.h:3466

Description

A glyph that cannot be represented in the destination font encoding.

Syntax

#define W_UNMAPPED 0x1000

W_UPPERCASE

Header: PDExpT.h:3412

Description

An uppercase letter.

Syntax

#define W_UPPERCASE 0x4

W_WHITE

Header: PDExpT.h:3442

Description

A white space glyph.

Syntax

#define W_WHITE 0x100

W_WILD_CARD

Header: PDExpT.h:3476

Description

A wildcard glyph (for example, " *" and "?") that should not be treated as a normal punctuation mark.

Syntax

#define W_WILD_CARD 0x4000

W_WORD_BREAK

Header: PDExpT.h:3482

Description

A glyph that acts as a delimiter between words.

Syntax

#define W_WORD_BREAK 0x8000

PDWord Typedefs

PDWord

Header: PDExpT.h:3368

Description

A word in a PDF file. Each word contains a sequence of characters in one or more styles (see PDStyle).

Syntax

typedef struct _t_PDWord *PDWord;

Returned From

Used By

PDWord Callback Signatures

PDWordProc

Header: PDExpT.h:3389

Description

A callback for PDWordFinderEnumWords. It is called once for each word.

Syntax

ASBool PDWordProc(PDWordFinder wObj, PDWord wInfo, ASInt32 pgNum, void *clientData);

Parameters

`wObj`	IN/OUT The word finder.
`wInfo`	IN/OUT The current word in the enumeration.
`pgNum`	IN/OUT The page number on which `wInfo` is located.
`clientData`	IN/OUT User-supplied data that was passed in the call to PDWordFinderEnumWords().

Returns

true to continue enumeration, false to halt enumeration.

Used By

PDWord Functions

PDWordCreateTextSelect: - Creates a text selection object for a given page that includes all words in a word list, as returned from a PDWordFinder method.
PDWordFilterString: - Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word.
PDWordFilterWord: - Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word.
PDWordFinderGetNthWord: - Gets the nth word in the word list obtained using PDWordFinderAcquireWordList().
PDWordGetASText: - Copies the text from a word into an ASText object.
PDWordGetAttr: - Gets a bit field containing information on the types of characters in a word.
PDWordGetAttrEx: - This is a version 6.0 extension of PDWordGetAttr() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
PDWordGetByteIdxFromHiliteChar: - Returns the byte offset within the specified word of the highlightable character at the specified character offset.
PDWordGetCharDelta: - Gets the difference between the word length (the number of printed characters in the word) and the PDF word length (the number of character codes in the word).
PDWordGetCharEncFlags: - Gets the WordFinder Character Encoding Flags for each character in a word, which specify how reliably the word finder identified the character encoding.
PDWordGetCharOffset: - Returns a word's character offset from the beginning of its page.
PDWordGetCharOffsetEx: - This is a version 6.0 extension of PDWordGetCharOffset() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.
PDWordGetCharQuad: - Gets the quadrilateral bounding of the character at a given index position in the word.
PDWordGetCharacterTypes: - Gets the character type for each character in a word.
PDWordGetLength: - Gets the number of bytes in a word.
PDWordGetNthCharStyle: - Returns a PDStyle object for the nth style in a word.
PDWordGetNthQuad: - Gets the specified word's nth quad, specified in user space coordinates.
PDWordGetNumHiliteChar: - Gets the number of highlightable characters in a word.
PDWordGetNumQuads: - Gets the number of quads in a word.
PDWordGetString: - This method gets a word's text.
PDWordGetStyleTransition: - Gets the locations of style transitions in a word.
PDWordIsCurrentlyVisible: - Tests whether a word is visible in a given optional-content context on a given page.
PDWordIsRotated: - Tests whether a word is rotated.
PDWordMakeVisible: - Makes a word visible in a given optional-content context on a given page.
PDWordSplitString: - Splits the specified string into words by substituting spaces for word separator characters.

PDWordCreateTextSelect

Header: PDProcs.h:8679

Description

Creates a text selection object for a given page that includes all words in a word list, as returned from a PDWordFinder method. The text selection can then be set as the current selection using AVDocSetSelection().

Note: For consistent text selection behavior, avoid using other PDTextSelect creation methods which depend on the word finder versions and word offsets. These include PDTextSelectCreatePageHiliteEx(), PDTextSelectCreateRanges(), PDTextSelectCreateRangesEx(), PDTextSelectCreateWordHilite(), and PDTextSelectCreateWordHiliteEx().

Syntax

PDTextSelect PDWordCreateTextSelect(PDPage page, PDWord *wList, ASUns32 wListLen);

Parameters

`page`	The page on which to select the words.
`wList`	The word list to be selected.
`wListLen`	The number of words in the word list.

Returns

The newly created text selection.

PDWordFilterString

Header: PDProcs.h:5140

Description

The determination of which characters are alphanumeric, wildcard, punctuation, and so forth, is made by the values in infoArray.

Although this method seems very similar to PDWordFilterWord(), the two methods treat letters and digits slightly differently. PDWordFilterWord() uses the encoding info array but also does a straight character code test for any characters that have not been mapped to anything. It does this to catch letters and digits from non-standard character sets, and is necessary to avoid removing words with non-standard character sets.

PDWordFilterString(), on the other hand, was designed for known character sets such as WinAnsi and Mac Roman.

For non-Roman character set viewers, this method currently supports only SHIFT-JIS encoding on a Japanese system.

Note: In Acrobat 6.0, the method PDWordFinderEnumWordsStr() is preferred to this method, which remains for backward compatability.

Related Methods

PDWordFilterWord

Syntax

ASBool PDWordFilterString(ASUns16 *infoArray, char *cNewWord, char *cOldWord);

Parameters

`infoArray`	An array specifying the type of each character in the font. Each entry in this table must be one of the Character Type Codes. If `infoArray` is set to `NULL`, a default table is used. For non-UNIX Roman systems, it is `WinAnsiEncoding` on Windows `and` `MacRomanEncoding` on Mac OS. On UNIX (except HP-UX) Roman systems, it is `ISO8859-1` (ISO Latin-1); for HP-UX, it is `HP-ROMAN8`. For descriptions of `WinAnsiEncoding` and `MacRomanEncoding`, see Annex D, "Character Sets and Encodings, in the ISO 32000-1:2008, Document Management-Portable Document Format-Part 1: PDF 1.7, page 651. You can find this document on the web store of the International Standards Organization (ISO).
`cNewWord`	(Filled by the method) The filtered word.
`cOldWord`	The unfiltered word. This value must be passed to the method.

Returns

true if the string required filtering, false if the filtered string is the same as the unfiltered string.

PDWordFilterWord

Header: PDProcs.h:5176

Description

Removes leading and trailing spaces and leading and trailing punctuation (including soft hyphens) from the specified word. It does not remove wildcard characters ( ' *' and '?') or any punctuation surrounded by alphanumeric characters within the word. It also converts ligatures to their constituent characters. The determination of which characters to remove is made by examining the flags in the outEncInfo array passed to PDDocCreateWordFinder(). As a result, this method is most useful after you have been called with words obtained by calling PDWordFinderGetNthWord(), in the callback for PDWordFinderEnumWords(), and words in the pXYSortTable returned by PDWordFinderAcquireWordList(). See the description of PDWordFilterString() for further information, and for a description of how the two methods differ.

The Acrobat Catalog program uses this method to filter words before indexing them.

This method works with non-Roman systems.

Note: In Acrobat 6.0 and later, the method PDWordFinderEnumWords() is preferred to this method, which remains for backward compatability.

Related Methods

PDWordFilterString

Syntax

ASBool PDWordFilterWord(PDWord word, char *buffer, ASInt16 bufferLen, ASInt16 *newLen);

Parameters

`word`	The PDWord to filter.
`buffer`	(Filled by the method) The filtered string.
`bufferLen`	The maximum number of characters that `buffer` can hold.
`newLen`	(Filled by the method) The number of characters actually written into `buffer`.

Returns

true if the word required filtering, false if the filtered string is the same as the unfiltered string.

PDWordFinderGetNthWord

Header: PDProcs.h:2225

Description

Gets the nth word in the word list obtained using PDWordFinderAcquireWordList().

Syntax

PDWord PDWordFinderGetNthWord(PDWordFinder wObj, ASInt32 nTh);

Parameters

`wObj`	IN/OUT The word finder whose nth word is obtained.
`nTh`	IN/OUT The index of the word to obtain. The first word on a page has an index of zero. Words are counted in PDF order. See the description of the `wInfoP` parameter in PDWordFinderAcquireWordList().

Returns

The nth word. It returns NULL when the end of the list is reached.

PDWordGetASText

Header: PDProcs.h:8596

Description

Copies the text from a word into an ASText object. It automatically performs the necessary encoding conversions from the specified word (either in Unicode or Host Encoding) to the ASText object.

Syntax

void PDWordGetASText(PDWord word, ASUns32 filter, ASText str);

Parameters

`word`	The word whose text becomes the new ASText.
`filter`	Character types to be dropped from the output string. For example, the following returns text without soft hyphens and accent marks: `PDWordGetASText(word,` `W_SOFT_HYPHEN` `+` `W_ACCENT,` `mystr);`
`str`	An existing ASText object whose content will be replaced by the new text.

PDWordGetAttr

Header: PDProcs.h:4928

Description

Gets a bit field containing information on the types of characters in a word. Use PDWordGetCharacterTypes() if you wish to check each character's type individually.

Note: PDWordGetAttr() may return an attribute value greater than the maximum of all of the public attributes since there can be private attributes added on. It is recommended to AND the result with the attribute you are interested in.

Syntax

ASUns16 PDWordGetAttr(PDWord word);

Parameters

word

IN/OUT The word whose character types are obtained.

Returns

A bit field containing information on the types of characters in word. The value is a logical OR of the Word Attributes.

PDWordGetAttrEx

Header: PDProcs.h:8654

Description

This is a version 6.0 extension of PDWordGetAttr() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher. It can get an additional 16-bit flag group defined in Acrobat 6.

It gets a bit field containing information on the types of characters in a word. Use PDWordGetCharacterTypes() if you wish to check each character's type individually.

Note: PDWordGetAttr() may return an attribute value greater than the maximum of all of the public attributes, since there can be private attributes added on. It is recommended that you AND the result with the attribute you are interested in.

Syntax

ASUns16 PDWordGetAttrEx(PDWord word, ASUns32 groupID);

Parameters

`word`	The word whose character types are obtained.
`groupID`	The group number of the Word Attributes flags: `0`, the default, is the first 16-bit group, and is the same as PDWordGetAttr(). `1` gets the second group defined in Acrobat 6.

Returns

A bit field containing information on the types of characters in word. The value is a logical OR of the Word Attributes.

PDWordGetByteIdxFromHiliteChar

Header: PDProcs.h:8579

Description

Returns the byte offset within the specified word of the highlightable character at the specified character offset. The first character of a word is at byte offset 0. This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.

The returned byte offset can be passed to PDWordGetCharOffsetEx() and PDWordGetCharQuad() to get additional information. Use PDWordGetNumHiliteChar() to get the number of highlightable characters in a word.

Syntax

ASUns32 PDWordGetByteIdxFromHiliteChar(PDWord word, ASUns32 charIdx);

Parameters

`word`	The word containing the character.
`charIdx`	The character index within the word.

Returns

The byte offset of the specified character within the word, or 0 if the character index is out of range.

PDWordGetCharDelta

Header: PDProcs.h:4989

Description

Gets the difference between the word length (the number of printed characters in the word) and the PDF word length (the number of character codes in the word). For instance, if the PDF word is fi (ligature) sh the mapped word will be "fish". The ligature occupies only one character code, so in this case the character delta will be 3-4 =-1.

Related Methods

PDWordGetCharOffset PDWordGetLength

Syntax

ASInt8 PDWordGetCharDelta(PDWord word);

Parameters

word

IN/OUT The word whose character delta is obtained.

Returns

The character delta for word. Cast the return value to an ASInt8 before using. If the PDWord's character set has no ligatures, such as on a non-Roman viewer supporting Japanese, returns 0.

PDWordGetCharEncFlags

Header: PDProcs.h:8621

Description

Gets the WordFinder Character Encoding Flags for each character in a word, which specify how reliably the word finder identified the character encoding.

This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.

Related Methods

PDWordGetAttrEx PDWordGetLength

Syntax

void PDWordGetCharEncFlags(PDWord word, ASUns32 *fList, ASUns32 size);

Parameters

`word`	The word whose character encoding flags are obtained.
`fList`	(Filled by the method) An array of character encoding flags types. This array contains one element for each byte of text in the word. The byte length of the text can be determined with PDWordGetLength(). Each element is the logical `OR` of one or more of the character encoding flags.
`size`	The maximum number of elements in the array `fList`.

PDWordGetCharOffset

Header: PDProcs.h:4968

Description

Returns a word's character offset from the beginning of its page. This information, together with the character delta obtained from PDWordGetCharDelta(), can be used to highlight a range of words on a page, using PDTextSelectCreatePageHilite().

Syntax

ASUns16 PDWordGetCharOffset(PDWord word);

Parameters

word

IN/OUT The word whose character offset is obtained.

Returns

The word's character offset. On multi-byte systems, it points to the first byte.

PDWordGetCharOffsetEx

Header: PDProcs.h:8507

Description

This is a version 6.0 extension of PDWordGetCharOffset() that can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.

It returns the character offset for a character identified by its index number, and the number of bytes (length) used for that character. The length is usually 1 for single-byte characters and 2 for double-byte characters. If multiple bytes are used to construct one character, only the first byte has valid character offset information and the other bytes have zero offset length with the same character offset of the first byte. If the returned offset length is zero, it means the specified byte in the word is a part (other than the first byte) of a multi-byte character.

The character offset is the character position calculated in bytes from the beginning of a page. Because of the encoding conversions and character replacements applied by the word finder, some characters may have different byte lengths from the original PDF content. The character offset itself can locate a character in the PDF content. However, without the offset length (that is the number of bytes in the PDF content), clients cannot tell whether two characters are next to each other in the PDF content. For example, suppose you want to create a Text Select object of two characters at character offset 1 and 3. You can create an object with two disconnected ranges of [Offset 1, The length 1] and [Offset 3, The length 1]. However, if you know that the offset length of both characters is 2, you can create a simpler object with a single range of [Offset 1, The length 4].

Syntax

ASUns32 PDWordGetCharOffsetEx(PDWord word, ASUns32 byteIdx, ASUns32 *bytesConsumed, ASUns32 *offsetLen);

Parameters

`word`	The word whose character offset is obtained.
`byteIdx`	The byte index within the word of the character whose offset is obtained. Valid values are `0` to `PDWordGetLength(word)-1`.
`bytesConsumed`	(Filled by method) Returns the number of bytes in the word that are occupied by the specified character. It can be `NULL` if it is not needed. Use `(byteIdx` `+` `*bytesConsumed)` to get the byte index of the next character in the word.
`offsetLen`	(Filled by the method) Returns the number of bytes occupied by the specified character in the original PDF content. This is `0` if the specified byte is not the starting byte of a character in the PDF content. It can be `NULL` if it is not needed.

Returns

The word's character offset and the number of bytes occupied by the character.

PDWordGetCharQuad

Header: PDProcs.h:8533

Description

Gets the quadrilateral bounding of the character at a given index position in the word. If the specified character is constructed with multiple bytes, only the first byte returns a valid quad. Otherwise, this method returns false.

This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.

Syntax

ASBool PDWordGetCharQuad(PDWord word, ASUns32 byteIdx, ASFixedQuad *quad);

Parameters

`word`	The word whose character offset is obtained.
`byteIdx`	The byte index within the word of the character whose quad is obtained. Valid values are `0` to `PDWordGetLength(word)-1`.
`quad`	(Filled by method) A pointer to an existing quad structure in which to return the character's quad specified in user-space coordinates.

Returns

true if the provided byte index is the beginning byte of a character and a valid quad is returned, false otherwise.

PDWordGetCharacterTypes

Header: PDProcs.h:4950

Description

Gets the character type for each character in a word.

Related Methods

PDWordGetAttr PDWordGetLength

Syntax

void PDWordGetCharacterTypes(PDWord word, ASUns16 *cArr, ASInt16 size);

Parameters

`word`	The word whose character types are obtained.
`cArr`	(Filled by the method) An array of character types. This array contains one element for each character in the word. Use PDWordGetLength() to determine the number of elements that must be in the array. Each element is the logical `OR` of one or more of the Character Type Codes. For non-Roman character set viewers, meaningful values are returned only for Roman characters. For non-Roman characters, it returns `0`, which is the same as `W_CNTL`. If the character is 2 bytes, both bytes indicate the same character type.
`size`	The number of elements in `cArr`.

PDWordGetLength

Header: PDProcs.h:4878

Description

Gets the number of bytes in a word. This method also works on non-Roman systems.

Syntax

ASUns8 PDWordGetLength(PDWord word);

Parameters

word

IN/OUT The word object whose character count is obtained.

Returns

The number of characters in word.

PDWordGetNthCharStyle

Header: PDProcs.h:5025

Description

Returns a PDStyle object for the nth style in a word.

Related Methods

PDWordGetStyleTransition

Syntax

PDStyle PDWordGetNthCharStyle(PDWordFinder wObj, PDWord word, ASInt32 dex);

Parameters

`wObj`	IN/OUT A word finder object.
`word`	IN/OUT The word whose nth style is obtained.
`dex`	IN/OUT The index of the style to obtain. The first style in a word has an index of zero.

Returns

The nth style in the word. It returns NULL if dex is greater than the number of styles in the word.

Exceptions

genErrBadParm

is raised if dex < 0.

PDWordGetNthQuad

Header: PDProcs.h:5075

Description

Gets the specified word's nth quad, specified in user space coordinates. See PDWordGetNumQuads() for a description of a quad.

The quad's height is the height of the font's bounding box, not the height of the tallest character used in the word. The font's bounding box is determined by the glyphs in the font that extend farthest above and below the baseline; it often extends somewhat above the top of 'A' and below the bottom of 'y'.

The quad's width is determined from the characters actually present in the word.

For example, the quads for the words "AWAY" and "away" have the same height, but generally do not have the same width unless the font is a mono-spaced font (a font in which all characters have the same width).

Despite the names of the fields in an ASFixedQuad ( tl for top left, bl for bottom left, and so forth) the corners of quad do not necessarily have these positions.

Related Methods

PDWordGetNumQuads

Syntax

ASBool PDWordGetNthQuad(PDWord word, ASInt16 nTh, ASFixedQuad *quad);

Parameters

`word`	The word whose nth quad is obtained.
`nTh`	The quad to obtain. A word's first quad has an index of zero.
`quad`	(Filled by the method) A pointer to the word's nth quad, specified in user-space coordinates.

Returns

true if the word has an nth quad, false otherwise.

PDWordGetNumHiliteChar

Header: PDProcs.h:8556

Description

Gets the number of highlightable characters in a word. A highlightable character is the minimum text unit that Acrobat can select and highlight. This method can be used only with a word finder created with algorithm version WF_VERSION_3 or higher.

Because of the encoding conversion, the characters in a word finder word list do not have a 1-to-1 correspondence to the characters displayed by Acrobat. For example, if the word is "fish" and the text operation in PDF content is "fi" (ligature) + 's' + 'h', this method returns the number of highlightable characters as 3, counting "fi" as one character. For the same word, the PDWordGetLength() method returns the byte-length as 4.

Related Methods

PDWordGetLength

Syntax

ASUns32 PDWordGetNumHiliteChar(PDWord word);

Parameters

word

The word whose highlightable character count is obtained.

Returns

The number of highlightable characters in word.

PDWordGetNumQuads

Header: PDProcs.h:5040

Description

Gets the number of quads in a word. A quad is a quadrilateral bounding a contiguous piece of a word. Every word has at least one quad. A word has more than one quad, for example, if it is hyphenated and split across multiple lines or if the word is set on a curve rather than on a straight line.

Related Methods

PDWordGetNthQuad

Syntax

ASInt16 PDWordGetNumQuads(PDWord word);

Parameters

word

IN/OUT The word whose quad count is obtained.

Returns

The number of quads in word.

PDWordGetString

Header: PDProcs.h:4907

Description

This method gets a word's text. The string to return includes any word break characters (such as space characters) that follow the word, but not any that precede the word. The characters that are treated as word breaks are defined in the outEncInfo parameter of PDDocCreateWordFinder() method. Use PDWordFilterString() to subsequently remove the word break characters.

This method produces a string in whatever encoding the PDWord uses, for both Roman and non-Roman systems.

Syntax

void PDWordGetString(PDWord word, char *str, ASInt32 len);

Parameters

`word`	The word whose string is obtained.
`str`	(Filled by the method) The string. The encoding of the string is the encoding used by the `PDWordFinder` that supplied the PDWord. For instance, if PDDocCreateWordFinderUCS() is used to create the word finder, PDWordGetString() returns only Unicode. There is no way to detect Unicode strings returned by PDWordGetString(), since there is no UCS header ( `FEFF`) added to each string returned.
`len`	The length of `str` in bytes. Up to `len` characters of word will be copied into `str`. If `str` is long enough, it will be `NULL`-terminated.

Exceptions

genErrBadParm

is raised if either word or str is NULL.

PDWordGetStyleTransition

Header: PDProcs.h:5010

Description

Gets the locations of style transitions in a word. Every word has at least one style transition, at character position zero in the word.

Related Methods

PDWordGetNthCharStyle

Syntax

ASInt16 PDWordGetStyleTransition(PDWord word, ASInt16 *transTbl, ASInt16 size);

Parameters

`word`	IN/OUT The word whose style transition list is obtained.
`transTbl`	IN/OUT (Filled by the method) An array of style transitions. Each element is the character offset in word where the style changes. The offset specifies the first character in the word that has the new style. The first character in a word has an offset of zero.
`size`	IN/OUT The number of entries that `transTbl` can hold. The word is searched only until this number of style transitions have been found.

Returns

The number of style transition offsets copied to transTbl.

PDWordIsCurrentlyVisible

Header: PDProcs.h:10709

Description

Tests whether a word is visible in a given optional-content context on a given page.

Syntax

ASBool PDWordIsCurrentlyVisible(PDWord word, ASInt32 pageNum, PDOCContext ctx);

Parameters

`word`	The word to test.
`pageNum`	The page number for which the word is tested.
`ctx`	The context in which the word is tested, as returned by `PDDocGetOCContext(pdDoc)`.

Returns

true if the word is visible in the given context, false if it is hidden.

PDWordIsRotated

Header: PDProcs.h:5084

Description

Tests whether a word is rotated.

Related Methods

PDWordGetNthQuad

Syntax

ASBool PDWordIsRotated(PDWord word);

Parameters

word

The word to test.

Returns

true if the word is rotated, false otherwise.

PDWordMakeVisible

Header: PDProcs.h:10726

Description

Makes a word visible in a given optional-content context on a given page.

Syntax

ASBool PDWordMakeVisible(PDWord word, ASInt32 pageNum, PDOCContext ctx);

Parameters

`word`	The word to test.
`pageNum`	The page number for which the word is to be made visible.
`ctx`	The context in which the word is to be made visible, as returned by `PDDocGetOCContext(pdDoc)`.

Returns

true if the word can be made visible in the given context, false otherwise.

PDWordSplitString

Header: PDProcs.h:2260

Description

Splits the specified string into words by substituting spaces for word separator characters. The list of characters considered to be word separators can be specified, or a default list can be used.

The characters ',' and '.' are context-sensitive word separators. If surrounded by digits (for example, 654,096.345), they are not considered word separators.

For non-Roman character set viewers, this method currently supports only SHIFT-JIS encoding on a Japanese system.

Related Methods

PDWordGetString

Syntax

ASUns16 PDWordSplitString(ASUns16 *infoArray, char *cNewWord, char *cOldWord, ASUns16 nMaxLen);

Parameters

`infoArray`	A character information table. It specifies each character's type; word separator characters must be marked as `W_WORD_BREAK` (see Character Type Codes). This table can be identical to the table to pass to PDDocCreateWordFinder(). If `infoArray` is `NULL`, a default table is used (see Glyph Names of Word Separators).
`cNewWord`	(Filled by the method) The word that has been split. Word separator characters have been replaced with spaces.
`cOldWord`	The word to split.
`nMaxLen`	The number of characters that `cNewWord` can hold. Word splitting stops when `cOldWord` is completely processed or `nMaxLen` characters have been placed in `cNewWord`, whichever occurs first.

Returns

The number of splits that occurred.

Exceptions

genErrGeneral

is raised if infoArray is NULL, but host encoding cannot be obtained.