An opaque object holding encoded text.
An ASText object represents a Unicode string. ASText objects can also be used to convert between Unicode and various platform-specific text encodings, as well as conversions between various Unicode formats such as UTF-16 or UTF-8. Since it is common for a Unicode string to be repeatedly converted to or from the same platform-specific text encoding, ASText objects are optimized for this operation. For example, they can cache both the Unicode and platform-specific text strings.
There are several ways of creating an ASText object depending on the type and format of the original text data. The following terminology is used throughout this API to describe the various text formats:
| Text Format | Description |
| Encoded | A multi-byte string terminated with a single 0 character and coupled with a specific host encoding indicator. On Mac OS, the text encoding is specified using a script code. On Windows, the text encoding is specified using a CHARSET code. On UNIX the only valid host encoding indicator is 0, which specifies text in the platform's default Roman encoding. On all platforms, Asian text is typically specified using multi-byte strings. |
| ScriptText | A multi-byte string terminated with a single 0 character and coupled with an ASScript code. This is merely another way of specifying the Encoded case; the ASScript code is converted to a host encoding using ASScriptToHostEncoding(). |
| Unicode | Text specified using UTF-16 or UTF-8. In the UTF-16 case, the bytes can be in either big-endian format or the endian-ness that matches the platform, and are always terminated with a single ASUns16 0 value. In the UTF-8 case, the text is always terminated with a trailing 0 byte. Unicode usage in this case is straight Unicode without the 0xFE 0xFF prefix or language and country codes that can be encoded inside a PDF document. |
| PDText | A string of text pulled out of a PDF document. This will either be a big-endian Unicode string pre-appended with the bytes 0xFE 0xFF, or a string in PDFDocEncoding. In this case, the Unicode string may have embedded language and country identifiers. ASText objects strip language and country information out of the PDText string and track them separately. See below for more details. |
ASText objects can also be used to accomplish encoding and format conversions; you can request a string in any of the formats specified above. In all cases the ASText code attempts to preserve all characters. For example, if you attempt to concatenate two strings in separate host encodings, the implementation may convert both to Unicode and perform the concatenation in Unicode space.
When creating a new ASText object or putting new data into an existing object, the implementation will always copy the supplied data into the ASText object. The original data is yours to do with as you wish (and release if necessary).
The size of ASText data is always specified in bytes. For example, the len argument to ASTextFromSizedUnicode() specifies the number of bytes in the string, not the number of Unicode characters.
Host encoding and Unicode strings are always terminated with a NULL character (which consists of one NULL byte for host encoded strings and two NULL bytes for Unicode strings). You cannot create a string with an embedded NULL character, even using the calls which take an explicit length parameter.
The Getxxx calls return pointers to data held by the ASText object. You cannot free or manipulate this data directly. The GetxxxCopy calls return data you can manipulate and that you are responsible for freeing.
An ASText object can have language and country codes associated with it. A language code is a 2-character ISO 639 language code. A country code is a 2- character ISO 3166 country code. In both cases the 2-character codes are packed into an ASUns16 value: the first character is packed in bits 8-15, and the second character is packed in bits 0-7. These language and country codes can be encoded into a UTF-16 variant of PDText encoding using an escape sequence. See the description of "Common Data Structures" in ISO 32000-1:2008, Document Management-Portable Document Format-Part 1: PDF 1.7, section 7.9, page 84.
You can find this document on the web store of the International Standards Organization (ISO).
The ASText calls will automatically parse the language and country codes embedded inside a UTF-16 PDText object, and will also author appropriate escape sequences to embed the language and country codes (if present) when generating a UTF-16 PDText object.