Value options for ASScript.
kASRomanScript | Roman.
|
kASJapaneseScript | Japanese.
|
kASTraditionalChineseScript | Traditional Chinese.
|
kASKoreanScript | Korean.
|
kASArabicScript | Arabic.
|
kASHebrewScript | Hebrew.
|
kASGreekScript | Greek.
|
kASCyrillicScript | Cyrillic.
|
kASRightLeftScript | RightLeft.
|
kASDevanagariScript | Devanagari.
|
kASGurmukhiScript | Gurmukhi.
|
kASGujaratiScript | Gujarati.
|
kASOriyaScript | Oriya.
|
kASBengaliScript | Bengali.
|
kASTamilScript | Tamil.
|
kASTeluguScript | Telugu.
|
kASKannadaScript | Kannada.
|
kASMalayalamScript | Malayalam.
|
kASSinhaleseScript | Sinhalese.
|
kASBurmeseScript | Burmese.
|
kASKhmerScript | Khmer
|
kASThaiScript | Thai
|
kASLaotianScript | Laotian.
|
kASGeorgianScript | Georgian.
|
kASArmenianScript | Armenian.
|
kASSimplifiedChineseScript | Simplified Chinese.
|
kASTibetanScript | Tibetan.
|
kASMongolianScript | Mongolian.
|
kASGeezScript | Ge'ez.
|
kASEastEuropeanRomanScript | East European Roman.
|
kASVietnameseScript | Vietnamese.
|
kASExtendedArabicScript | Extended Arabic.
|
kASEUnicodeScript | Unicode.
|
kASDontKnowScript=-1 | Unknown.
|
kASTextFilterIdentity | Does nothing.
|
kASTextFilterLineEndings | Normalizes line endings (equivalent to ASTextNormalizeEndOfLine()).
|
kASTextFilterUpperCase | Makes all text upper case. DEPRECATED: Case is not a reliably localizable concept. Do not use this.
|
kASTextFilterLowerCase | Makes all text lower case. DEPRECATED: Case is not a reliably localizable concept. Do not use this.
|
kASTextFilterXXXDebug | Changes any ASText to "XXX" (for debugging).
|
kASTextFilterUpperCaseDebug | Makes all text except
scanf format strings upper case. |
kASTextFilterLowerCaseDebug | Makes all text except
scanf format strings lower case. |
kASTextFilterRemoveAmpersands | Removes stand-alone ampersands, and turns
& & into & |
kASTextFilterNormalizeFullWidthASCIIVariants | Changes any full width ASCII variants to their lower-ASCII version. For example,
0xFF21 (full width 'A') becomes 0x0041 (ASCII 'A') |
kASTextRemoveLineEndings | Removes line endings and replaces them with spaces.
|
kASTextFilterRsvd1=1000 | Reserved. Do not use.
|
kASTextFilterUnknown=-1 | An invalid filter type.
|
typedef const struct _t_ASTextRec *ASConstText;
typedef ASUns16 ASCountryCode;
CHARSET id. In UNIX, Acrobat currently only supports English, so the only valid ASHostEncoding is 0 (Roman). See ASScript. typedef ASInt32 ASHostEncoding;
typedef ASUns16 ASLanguageCode;
For value options see ASScripts.
typedef ASInt32 ASScript;
An opaque object holding encoded text.
An ASText object represents a Unicode string. ASText objects can also be used to convert between Unicode and various platform-specific text encodings, as well as conversions between various Unicode formats such as UTF-16 or UTF-8. Since it is common for a Unicode string to be repeatedly converted to or from the same platform-specific text encoding, ASText objects are optimized for this operation. For example, they can cache both the Unicode and platform-specific text strings.
There are several ways of creating an ASText object depending on the type and format of the original text data. The following terminology is used throughout this API to describe the various text formats:
Text Format | Description
|
|---|---|
Encoded
| A multi-byte string terminated with a single
0 character and coupled with a specific host encoding indicator. On Mac OS, the text encoding is specified using a script code. On Windows, the text encoding is specified using a CHARSET code. On UNIX the only valid host encoding indicator is 0, which specifies text in the platform's default Roman encoding. On all platforms, Asian text is typically specified using multi-byte strings. |
ScriptText
| A multi-byte string terminated with a single
0 character and coupled with an ASScript code. This is merely another way of specifying the Encoded case; the ASScript code is converted to a host encoding using ASScriptToHostEncoding(). |
Unicode
| Text specified using UTF-16 or UTF-8. In the UTF-16 case, the bytes can be in either big-endian format or the endian-ness that matches the platform, and are always terminated with a single ASUns16
0 value. In the UTF-8 case, the text is always terminated with a trailing 0 byte. Unicode usage in this case is straight Unicode without the 0xFE 0xFF prefix or language and country codes that can be encoded inside a PDF document. |
A string of text pulled out of a PDF document. This will either be a big-endian Unicode string pre-appended with the bytes
0xFE 0xFF, or a string in PDFDocEncoding. In this case, the Unicode string may have embedded language and country identifiers. ASText objects strip language and country information out of the PDText string and track them separately. See below for more details. |
ASText objects can also be used to accomplish encoding and format conversions; you can request a string in any of the formats specified above. In all cases the ASText code attempts to preserve all characters. For example, if you attempt to concatenate two strings in separate host encodings, the implementation may convert both to Unicode and perform the concatenation in Unicode space.
When creating a new ASText object or putting new data into an existing object, the implementation will always copy the supplied data into the ASText object. The original data is yours to do with as you wish (and release if necessary).
The size of ASText data is always specified in bytes. For example, the len argument to ASTextFromSizedUnicode() specifies the number of bytes in the string, not the number of Unicode characters.
Host encoding and Unicode strings are always terminated with a NULL character (which consists of one NULL byte for host encoded strings and two NULL bytes for Unicode strings). You cannot create a string with an embedded NULL character, even using the calls which take an explicit length parameter.
The Getxxx calls return pointers to data held by the ASText object. You cannot free or manipulate this data directly. The GetxxxCopy calls return data you can manipulate and that you are responsible for freeing.
An ASText object can have language and country codes associated with it. A language code is a 2-character ISO 639 language code. A country code is a 2- character ISO 3166 country code. In both cases the 2-character codes are packed into an ASUns16 value: the first character is packed in bits 8-15, and the second character is packed in bits 0-7. These language and country codes can be encoded into a UTF-16 variant of PDText encoding using an escape sequence. See the description of "Common Data Structures" in ISO 32000-1:2008, Document Management-Portable Document Format-Part 1: PDF 1.7, section 7.9, page 84.
You can find this document on the web store of the International Standards Organization (ISO).
The ASText calls will automatically parse the language and country codes embedded inside a UTF-16 PDText object, and will also author appropriate escape sequences to embed the language and country codes (if present) when generating a UTF-16 PDText object.
typedef struct _t_ASTextRec *ASText;
typedef ASEnum16 ASTextFilterType;
Holds a single 16-bit value from a UTF-16 encoded Unicode string. It is typically used to point to the beginning of an UTF-16 string. For example: ASUTF16Val *utf16String =...
This data type is not large enough to hold any arbitrary Unicode character. Use ASUnicodeChar to pass individual Unicode characters.
typedef ASUns16 ASUTF16Val;
typedef ASUns32 ASUTF32Val;
typedef ASUns8 ASUTF8Val;
typedef ASUTF16Val ASUniChar;
typedef ASUns32 ASUnicodeChar;
For value options see UTFOptions.
typedef ASEnum16 ASUnicodeFormat;
ASText ASTextEvalProc(ASCab params);
Determines whether the given byte is a lead byte of a multi-byte character, and how many tail bytes follow.
When parsing a string in a host encoding, you must keep in mind that the string could be in a variable length multi-byte encoding. In such an encoding (for example, Shift-JIS) the number of bytes required to represent a character varies on a character-by-character basis. To parse such a string you must start at the beginning and, for each byte, determine whether that byte represents a character or is the first byte of a multi-byte character. If the byte is a lead byte for a multi-byte character, you must also compute how many bytes will follow the lead byte to make up the entire character. Currently the API provides a call (PDHostMBLen()) that performs these computations, but only if the encoding in question is the operating system encoding (as returned by PDGetHostEncoding()). ASHostMBLen() allows you to determine this for any byte in any host encoding.
Note: ASHostMBLen() cannot confirm whether the required number of trailing bytes actually follow the first byte. If you are parsing a multi-byte string, make sure your code will stop at the first NULL (zero) byte even if it appears immediately after the lead byte of a multi-byte character.
ASInt32 ASHostMBLen(ASHostEncoding encoding, ASUns8 byte);
encoding | The host encoding type.
|
byte | The first byte of a multi-byte character.
|
1 for a two-byte character and 0 for a one-byte character. For Roman encodings, the return value will always be 0. NULL-terminated. ASBool ASIsValidUTF8(const ASUns8 *cIn, ASCount cInLen);
cIn | The string.
|
cInLen | The length of the string in bytes, not including the
NULL byte at the end. |
CHARSET id. On Mac OS, the host encoding is a script code. ASScript ASScriptFromHostEncoding(ASHostEncoding osScript);
osScript | The host encoding type.
|
CHARSET id. On Mac OS, the host encoding is a script code. ASHostEncoding ASScriptToHostEncoding(ASScript asScript);
asScript | The script value.
|
Compares two ASConstText objects, ignoring language and country information. The comparison is case-sensitive.
Various exceptions may be raised.
ASInt32 ASTextCaseSensitiveCmp(ASConstText str1, ASConstText str2);
str1 | First text object.
|
str2 | Second text object.
|
str1 < str2, a positive number if str1 > str2, and 0 if they are equal. from text to the end of the to text, altering to but not from. It does not change the language or country of to unless it has no language or country, in which case it acquires the language and country of from. void ASTextCat(ASText to, ASConstText from);
to | IN/OUT The encoded text to which
from is appended. |
from | IN/OUT The encoded text to be appended to
to. |
void ASTextCatMany(ASText to, ...);
to |
Compares two ASText objects. This routine can be used to sort text objects using the default collating rules of the underlying operating system before presenting them to the user. The comparison is case-sensitive. The results are suitable for displaying a sorted list of strings to the user in his chosen language and according to the rules of the platform on which the application is running. The results can vary based on the platform and user locale. If you want to compare strings in a way that is consistent across locales and platforms (but not suitable for displaying sorted strings to a user) see ASTextCaseSensitiveCmp().
Various exceptions may be raised.
ASInt32 ASTextCmp(ASConstText str1, ASConstText str2);
str1 | The first text object.
|
str2 | The second text object.
|
str1 < str2, a positive number if str1 > str2, and 0 if they are equal. from to to, along with the country and language. void ASTextCopy(ASText to, ASConstText from);
to | IN/OUT The destination text object.
|
from | IN/OUT The source text object.
|
void ASTextDestroy(ASText str);
str | IN/OUT A text object.
|
ASText ASTextDup(ASConstText str);
str | A text object.
|
is raised if
str is NULL. |
"%keyone%%keytwo%", the value is replaced with the concatenation of the values of the keys keyone and keytwo in the ASCab passed in. void ASTextEval(ASText theText, ASCab params);
theText | A text object containing percent-quoted expressions to replace.
|
params | The ASCab containing the key/value pairs to use for text replacement.
|
if
theText is NULL. |
void ASTextFilter(ASText text, ASTextFilterType filter);
text | A text object modified by the method.
|
filter | The filter to run on the text object.
|
if
text is NULL or if an invalid filter is specified. |
NULL-terminated multi-byte string in the specified host encoding. ASText ASTextFromEncoded(const char *str, ASHostEncoding encoding);
str | The input string.
|
encoding | The host encoding.
|
ASText ASTextFromInt32(ASInt32 num);
num | A number of type ASInt32.
|
0xFEFF prepended to the front or a PDFDocEncoding string. In either case the string is expected to have the appropriate NULL termination. If the PDText is in UTF-16, it may have embedded language and country information; this will cause the ASText object to have its language and country codes set to the values found in the string. ASText ASTextFromPDText(const char *str);
str | A string.
|
NULL-terminated multi-byte string of the specified script. This is a wrapper around ASTextFromEncoded(); the script is converted to a host encoding using ASScriptToHostEncoding(). ASText ASTextFromScriptText(const char *str, ASScript script);
str | A string.
|
script | The specified script.
|
ASText ASTextFromSizedEncoded(const char *str, ASTArraySize len, ASHostEncoding encoding);
str | A string.
|
len | The length in bytes.
|
encoding | The specified host encoding.
|
is raised if
len < 0. |
0xFEFF prepended to the front or a PDFDocEncoding string. If the PDText is in UTF-16, it may have embedded language and country information; this will cause the ASText object to have its language and country codes set to the values found in the string. The length parameter specifies the size, in bytes, of the string. The string must not contain embedded NULL characters. ASText ASTextFromSizedPDText(const char *str, ASTArraySize length);
str | A string.
|
length | The length in bytes.
|
ASText ASTextFromSizedScriptText(const char *str, ASTArraySize len, ASScript script);
str | A string.
|
len | The length in bytes.
|
script | The specified script.
|
Creates a new text object from the specified Unicode string. This string is not expected to have 0xFE 0xFF prepended, or country/language identifiers.
The string cannot contain an embedded NULL character.
ASText ASTextFromSizedUnicode(const ASUTF16Val *ucs, ASUnicodeFormat format, ASTArraySize len);
ucs | The Unicode string
|
format | The Unicode format of
ucs. |
len | The length of
ucs in bytes. |
is raised if
len < 0. |
NULL-terminated Unicode string. This string is not expected to have 0xFE 0xFF prepended, or country/language identifiers. ASText ASTextFromUnicode(const ASUTF16Val *ucs, ASUnicodeFormat format);
ucs | A Unicode string.
|
format | The Unicode format used by
ucs. |
ASText ASTextFromUns32(ASUns32 num);
num | IN/OUT A value of type ASUns32.
|
Returns the best host encoding for representing the text. The best host encoding is the one that is least likely to lose characters during the conversion from Unicode to host. If the string can be represented accurately in multiple encodings (for example, it is low-ASCII text that can be correctly represented in any host encoding), ASTextGetBestEncoding() returns the preferred encoding based on the preferredEncoding parameter.
Various exceptions may be raised.
ASHostEncoding ASTextGetBestEncoding(ASConstText str, ASHostEncoding preferredEncoding);
str | An ASText string.
|
preferredEncoding | The preferred encoding. There is no default.
|
// If you prefer to use the application's language encoding:
ASHostEncoding bestEncoding = ASTextGetBestEncoding(text, AVAppGetLanguageEncoding());
// If you prefer to use the operating system encoding:
ASHostEncoding bestEncoding = ASTextGetBestEncoding(text, (ASHostEncoding)PDGetHostEncoding());
// If you want to favor Roman encodings:
ASHostEncoding hostRoman = ASScriptToHostEncoding(kASRomanScript);
ASHostEncoding bestEncoding = ASTextGetBestEncoding(text, hostRoman);
ASScript ASTextGetBestScript(ASConstText str, ASScript preferredScript);
str | IN/OUT An ASText string.
|
preferredScript | IN/OUT The preferred host script. There is no default.
|
ASCountryCode ASTextGetCountry(ASConstText text);
text | IN/OUT An ASText object.
|
const char *ASTextGetEncoded(ASConstText str, ASHostEncoding encoding);
str | IN/OUT An ASText object.
|
encoding | IN/OUT The specified host encoding.
|
NULL-terminated string corresponding to the text in str. char *ASTextGetEncodedCopy(ASConstText str, ASHostEncoding encoding);
str | An ASText object.
|
encoding | The specified encoding.
|
str. The client owns the resulting information and is responsible for freeing it using ASfree(). is raised if memory could not be allocated for the copy.
|
ASLanguageCode ASTextGetLanguage(ASConstText text);
text | An ASText object.
|
Returns the text in a form suitable for storage in a PDF file. If the text can be represented using PDFDocEncoding, it is; otherwise it is represented in big-endian UTF-16 format with 0xFE 0xFF prepended to the front and any country/language codes embedded in an escape sequence right after 0xFE 0xFF.
You can determine if the string is Unicode by inspecting the first two bytes. The Unicode case is used if the string has a language and country code set. The resulting string is NULL-terminated as appropriate. That is, one NULL byte is used for PDFDocEncoding, two are used for UTF-16.
Various exceptions may be raised.
char *ASTextGetPDTextCopy(ASConstText str, ASTArraySize *len);
str | A string.
|
len | The length in bytes of the resulting string, not counting the
NULL bytes at the end. |
Converts the Unicode string in the ASText object to the appropriate script, and returns a pointer to the converted text. The memory to which it points is owned by the ASText object and must not be altered or destroyed by the client. The memory may also become invalid after subsequent operations are applied to the ASText object.
Various exceptions may be raised.
const char *ASTextGetScriptText(ASConstText str, ASScript script);
str | IN/OUT A string.
|
script | IN/OUT The writing script.
|
char *ASTextGetScriptTextCopy(ASConstText str, ASScript script);
str | A string.
|
script | A writing script.
|
is raised if memory could not be allocated for the copy.
|
Returns a pointer to a string in kUTF16HostEndian format (see ASUnicodeFormat). The memory to which this string points is owned by the ASText object, and may not be valid after additional operations are performed on the object.
The Unicode text returned will not have 0xFE 0xFF prepended or any language or country codes.
const ASUTF16Val *ASTextGetUnicode(ASConstText str);
str | A string.
|
Returns a pointer to a NULL-terminated string in the specified Unicode format. The memory to which this string points is owned by the client, which can modify it at will and is responsible for destroying it using ASfree.
The Unicode text returned will not have 0xFE 0xFF prepended or any language or country codes.
ASUTF16Val *ASTextGetUnicodeCopy(ASConstText str, ASUnicodeFormat format);
str | A string.
|
format | The Unicode format.
|
is raised if memory could not be allocated for the copy.
|
0-length string. ASBool ASTextIsEmpty(ASConstText str);
str | A string.
|
void ASTextMakeEmpty(ASText str);
ASText object (converts it into an empty string). It clears the released storage (for security strings). void ASTextMakeEmptyClear(ASText str);
ASText ASTextNew(void);
\\r and \\n are replaced with \\r\\n. void ASTextNormalizeEndOfLine(ASText text);
text | An object of type ASText.
|
Replaces all occurrences of toReplace in src with the text specified in replacement. This uses an ASText string to indicate the toReplace string; ASTextReplaceASCII() uses a low ASCII Roman string to indicate the text to replace.
Various exceptions may be raised.
void ASTextReplace(ASText src, ASConstText toReplace, ASConstText replacement);
src | Source text.
|
toReplace | Text in source text to replace.
|
replacement | Text used in replacement.
|
Replaces all occurrences of toReplace in src with the text specified in replacement. ASTextReplace() uses an ASText string to indicate the toReplace string; this uses a low-ASCII Roman string to indicate the text to replace.
This call is intended for formatting strings for the user interface. For example, it can be used for replacing a known sequence such as '%1' with other text. Be sure to use only low ASCII characters, which are safe on all platforms. Avoid using backslash and currency symbols.
Various exceptions may be raised.
void ASTextReplaceASCII(ASText src, const char *toReplace, ASConstText replacement);
src | The ASText object containing the text.
|
toReplace | The text to replace.
|
replacement | The replacement text.
|
Replaces all occurrences of characters contained in the list pszBadCharList in the text with the specified replacement character.
Various exceptions may be raised.
void ASTextReplaceBadChars(ASText str, const char *pszBadCharList, char replaceChar);
str | The text in which to replace characters.
|
pszBadCharList | A list of characters to replace, in sorted order with no duplicates.
|
replaceChar | The character with which to replace any character appearing in the list.
|
void ASTextSetCountry(ASText text, ASCountryCode country);
text | IN/OUT An ASText object.
|
country | IN/OUT Country code.
|
void ASTextSetEncoded(ASText str, const char *text, ASHostEncoding encoding);
str | IN/OUT An ASText object to hold the string.
|
text | IN/OUT A pointer to the text string.
|
encoding | IN/OUT The type of encoding.
|
is raised if
text is NULL. |
void ASTextSetLanguage(ASText text, ASLanguageCode language);
text | IN/OUT An ASText object.
|
language | IN/OUT The language code.
|
0xFEFF prepended to the front or a PDFDocEncoding string. In either case the string is expected to have the appropriate NULL termination. If the PDText is in UTF-16, it may have embedded language and country information; this will cause the ASText object to have its language and country codes set to the values found in the string. void ASTextSetPDText(ASText str, const char *text);
str | A string.
|
text | A text string.
|
NULL-terminated multi-byte string of the specified script. This is a wrapper around ASTextFromEncoded(); the script is converted to a host encoding using ASScriptToHostEncoding(). void ASTextSetScriptText(ASText str, const char *text, ASScript script);
str | IN/OUT A string.
|
text | IN/OUT A pointer to the text string.
|
script | IN/OUT The writing script.
|
void ASTextSetSizedEncoded(ASText str, const char *text, ASTArraySize len, ASHostEncoding encoding);
str | IN/OUT A string.
|
text | IN/OUT A pointer to the text string.
|
len | IN/OUT The length of the text string.
|
encoding | IN/OUT The host encoding type.
|
is raised if
text is NULL. |
0xFEFF prepended to the front or a PDFDocEncoding string. In either case the length parameter indicates the number of bytes in the string. The string should not be NULL-terminated and must not contain any NULL characters. If the PDText is in UTF-16, it may have embedded language and country information; this will cause the ASText object to have its language and country codes set to the values found in the string. void ASTextSetSizedPDText(ASText str, const char *text, ASTArraySize length);
str | A string.
|
text | A pointer to a text string.
|
length | The length of the text string.
|
void ASTextSetSizedScriptText(ASText str, const char *text, ASTArraySize len, ASScript script);
str | IN/OUT A string.
|
text | IN/OUT A pointer to the text string.
|
len | IN/OUT The length of the text string.
|
script | IN/OUT The writing script.
|
is raised if
text is NULL. |
void ASTextSetSizedUnicode(ASText str, const ASUTF16Val *ucsValue, ASUnicodeFormat format, ASTArraySize len);
str | (Filled by the method) A string.
|
ucsValue | A Unicode string.
|
format | The Unicode format.
|
len | The length of the string in bytes.
|
NULL-terminated Unicode string. This string is not expected to have 0xFE 0xFF prepended or embedded country/language identifiers. void ASTextSetUnicode(ASText str, const ASUTF16Val *ucsValue, ASUnicodeFormat format);
str | (Filled by the method) A string.
|
ucsValue | A Unicode string.
|
format | The Unicode format.
|
void ASUCS_GetPasswordFromUnicode(ASUTF16Val *inPassword, void **outPassword, ASBool useUTF);
inPassword | |
outPassword | |
useUTF | IN A flag for controlling the conversion. Prior to Acrobat 9.0, passwords were converted from host code-page encoding (8-bit mode) to
PDFDocEncoding. If useUTF == false, this routine does the same, starting from 16-bit Unicode. With encryption, Acrobat 9.0 and later allows Unicode passwords, normalized and converted to UTF-8 encoding. If useUTF == true, such a Unicode password is what is returned. |