Special thanks to (in alphabetical order): Lucjan Łyczak, Micha Nelissen, Ernst van der Pols and Karl Waclawek.:
LICENSE
The contents of the Extended Document Object Model files are subject to the Mozilla Public License Version 1.1 (the "License"); you may not use this files except in compliance with the License. You may obtain a copy of the License at "http://www.mozilla.org/MPL/"
Software distributed under the License is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.
The Original Code is "UnicodeUtils.pas".
The Initial Developer of the Original Code is Dieter Köhler (Heidelberg, Germany, "http://www.philo.de/"). Portions created by the Initial Developer are Copyright (C) 1999-2003 Dieter Köhler. All Rights Reserved.
Alternatively, the contents of this files may be used under the terms of the GNU General Public License Version 2 or later (the "GPL"), in which case the provisions of the GPL are applicable instead of those above. If you wish to allow use of your version of this files only under the terms of the GPL, and not to allow others to use your version of this files under the terms of the MPL, indicate your decision by deleting the provisions above and replace them with the notice and other provisions required by the GPL. If you do not delete the provisions above, a recipient may use your version of this file under the terms of any one of the MPL or the GPL.
2003
Table of Contents
The Unicode Utility Library (UUL) contains several classes and helper functions to support processing and conversion of Unicode character data. Unicode is a character encoding standard that covers all major scripts of the world. For more information on Unicode and its related standards see the resources mentioned in the References section. The conversion functions are based on the mapping tables which can be found on the CD-ROM accompanying [Unicode 3.0].
Also included is a TCSMIB component for easy access to the Management Information Base (MIB) for character set encodings as specified in [CSMIB]. (This specification is occasionally used in the documentation below even if not explicitly quoted.)
The UUL was built and tested using Delphi 7. It was not tested with any other version of Delphi, Kylix or C++ Builder. Nevertheless, it should also run with Delphi 3, 4, 5 and 6, Kylix 1, 2 and 3, and compatible C++ Builder versions. To use the UUL in a Delphi unit just include a reference to UnicodeConv in its uses clause and make sure that the location of the file UnicodeConv.pas is included in the library path list of your Delphi IDE. To use the TCSMIB component at design time add it to an existing or newly created package via the "Component --> Install Component ..." menu item of the Delphi IDE. If not already available, a new "XML" palette page will appear containing the TCSMIB component.
The UUL is under permanent development. The latest version of this Software can be obtained via the OpenXML web-site at "http://www.philo.de/xml/". The preferred way to contact the author is via the OpenXML mailing list. Instructions how to join the mailing list can be found at "http://www.philo.de/xml/" as well.
TdomEncodingType = (etUnknown, etUTF_8, etUTF_16BE, etUTF_16LE, etISO_10646_UCS_2, etUS_ASCII, etIso_8859_1, etIso_8859_2, etIso_8859_3, etIso_8859_4, etIso_8859_5, etIso_8859_6, etIso_8859_7, etIso_8859_8, etIso_8859_9, etIso_8859_10, etIso_8859_13, etIso_8859_14, etIso_8859_15, etKOI8_R, etJIS_X0201, etNextStep, etCp10000_MacRoman, etCp10006_MacGreek, etCp10007_MacCyrillic, etCp10029_MacLatin2, etCp10079_MacIcelandic, etCp10081_MacTurkish, etIBM037, etIBM424, etIBM437, etDOS_437, etIBM500, etDOS_737, etDOS_775, etIBM850, etDOS_850, etIBM852, etDOS_852, etIBM855, etDOS_855, etPC_856, etIBM857, etDOS_857, etIBM860, etDOS_860, etIBM861, etDOS_861, etIBM862, etDOS_862, etIBM863, etDOS_863, etIBM864, etDOS_864, etIBM865, etDOS_865, etIBM866, etDOS_866, etIBM869, etDOS_869, etCp874, etCp875, etCp1006, etIBM1026, etWindows_1250, etWindows_1251, etWindows_1252, etWindows_1253, etWindows_1254, etWindows_1255, etWindows_1256, etWindows_1257, etWindows_1258);
Constants for all supported encoding schemata plus an etUnknown constant.
TdomEncodingTypes = set of TdomEncodingType;
Defines a set of TdomEncodingType constants.
SINGLE_BYTE_ENCODINGS: TdomEncodingTypes = [etUS_ASCII, etIso_8859_1, etIso_8859_2, etIso_8859_3, etIso_8859_4, etIso_8859_5, etIso_8859_6,etIso_8859_7, etIso_8859_8, etIso_8859_9, etIso_8859_10, etIso_8859_13, etIso_8859_14, etIso_8859_15, etKOI8_R, etJIS_X0201, etNextStep, etCp10000_MacRoman, etCp10006_MacGreek, etCp10007_MacCyrillic, etCp10029_MacLatin2, etCp10079_MacIcelandic, etCp10081_MacTurkish, etIBM037, etIBM424, etIBM437, etDOS_437, etIBM500, etDOS_737, etDOS_775, etIBM850, etDOS_850, etIBM852, etDOS_852, etIBM855, etDOS_855, etPC_856, etIBM857, etDOS_857, etIBM860, etDOS_860, etIBM861, etDOS_861, etIBM862, etDOS_862, etIBM863, etDOS_863, etIBM864, etDOS_864, etIBM865, etDOS_865, etIBM866, etDOS_866, etIBM869, etDOS_869, etCp874, etCp875, etCp1006, etIBM1026, etWindows_1250, etWindows_1251, etWindows_1252, etWindows_1253, etWindows_1254, etWindows_1255, etWindows_1256, etWindows_1257, etWindows_1258];
Defines a constant set of TdomEncodingType constants for all supported single byte encodings.
MULTI_BYTE_ENCODINGS: TdomEncodingTypes = [etUTF_8, etUTF_16BE, etUTF_16LE, etISO_10646_UCS_2];
Defines a constant set of TdomEncodingType constants for all supported multi byte encodings.
TCharToUTF16ConvFunc = function(const W: word): WideChar;
Procedural type for conversion functions of single byte characters into UTF-16BE.
function GetACPEncodingName: String;
Returns the name of the current active code page of the Windows operating system. This function is not available in Kylix.
function GetACPEncodingType: TdomEncodingType;
Returns the encoding type of the current active code page of the Windows operating system. This function is not available in Kylix.
function EncodingToStr(const Encoding: TdomEncodingType): String;
Returns the standard name of the specified character encoding.
TConversionStream = class (TStream)
TConversionStream is an input/output stream for other streams. Its purpose is to transform data as they are written to or read from a target stream.
TUTF16BEToUTF8Stream = class (TConversionStream)
TUTF16BEToUTF8Stream is a descendant from TConversionStream which converts an UTF-16BE stream into an UTF-8 encoded stream.
The following functions serve for UTF-16 surrogate processing:
function Utf16HighSurrogate(const value: integer): WideChar;
Extracts the high surrogate of a number out of the interval [$10000;$10FFFF].
function Utf16LowSurrogate(const value: integer): WideChar;
Extracts the low surrogate of a number out of the interval [$10000;$10FFFF].
function Utf16SurrogateToInt(const highSurrogate, lowSurrogate: WideChar): integer;
Transforms a high surrogate plus a low surrogate into an integer.
highSurrogate
The high surrogate part of the integer.
lowSurrogate
The low surrogate part of the integer.
function IsUtf16HighSurrogate(const S: WideChar): boolean;
Tests whether the specified WideChar is an UTF16 high surrogate.
function GetCharToUTF16ConvFunc(Encoding: TdomEncodingType): TCharToUTF16ConvFunc;
Returns the character conversion function for the specified TdomEncodingType into UTF-16BE.
function GetUTF16ToCharConvFunc(Encoding: TdomEncodingType): TUTF16ToCharConvFunc;
Returns the character conversion function for UTF-16BE into the specified TdomEncodingType.
function UTF8ToUTF16BEStr(const S: string): WideString;
Converts an UTF-8 string into an UTF-16BE wideString. No special conversions (e.g. on line breaks) and no XML-Char checking are done. If 'S' starts with a byte order mark (#$EF #$BB #$BF) the byte order mark is skipped.
function UTF16BEToUTF8Str(const WS: WideString): string;
Converts an UTF-16BE widestring into an UTF-8 encoded string. The implementation is optimized for code that contains mainly ASCII characters (<=#$7F) and little above ASCII-chars. The buffer for the Result is set to the wideStrings-length. With each non-ASCII character the Result-buffer is expanded (by the Insert-function), which leads to performance problems when one processes e.g. mainly Japanese documents. If 'WS' starts with a byte order mark (#$FEFF) the byte order mark is skipped.
function SingleByteEncodingToUTF16Char(const W: word; const Encoding: TdomEncodingType): WideChar;
Converts a single byte character of the specified encoding into an UTF-16BE wideChar.
W
The code point of the single byte character to be converted.
Encoding
The encoding of the character to be converted.
The Unicode Converter Library contains more than 70 functions for character conversion from single byte encoding schemata to UTF-16BE. All these functions share the same structure which is as follows:
function ...ToUTF16Char(const W: word): WideChar;
function cp866_DOSCyrillicRussianToUTF16Char(const W: word): WideChar;
The Unicode Converter Library contains more than 70 functions for string conversion from single byte encoding schemata to UTF-16BE. All these functions share the same structure which is as follows:
function ...ToUTF16Str(const S: string): WideString;
function cp866_DOSCyrillicRussianToUTF16Str(const S: string): WideString;
The Unicode Converter Library contains more than 70 functions for character conversion from UTF-16 to single byte encoding schemata. All these functions share the same structure which is as follows:
function UTF16To...Char(const I: longint): Char;
function UTF16ToCp866_DOSCyrillicRussianChar(const I: longint): Char;
The Unicode Converter Library contains more than 70 functions for string conversion from UTF-16 to single byte encoding schemata. All these functions share the same structure which is as follows:
function UTF16To...Str(const S: WideString): string;
function UTF16ToCp866_DOSCyrillicRussianStr(const S: WideString): string;
These classes are designed for easy access to the Management Information Base (MIB) for character set encodings as specified in [CSMIB].
ECSMIBException = Exception;
ECSMIBException is the exception class for errors in the TCSMIB class.
TCSMIBChangingEvent = procedure (Sender: TObject; NewEnum: integer; var AllowChange: Boolean) of object;
This event class is used for the TCSMIB.OnChanging event.
The TCSMIB component decodes MIB enumaration (MIBenum) values which identify coded character sets as specified in [CSMIB].
property Enum: integer
The Enum property contains the unique MIB enum value to identify a coded character set.
If on setting an invalid value is specified then, depending on the value of the IgnoreInvalidEnum property, either an ECSMIBException is raised or the attempt is silently ignored.
property IgnoreInvalidEnum: boolean
If set to TRUE an attempt to set the Enum property to an invalid value is silently ignored, i.e. Enum will remain the same and no notification about the failure is made.
If set to FALSE an attempt to set the Enum property to an invalid value results in an ECSMIBException being raised.
property Alias[i: integer]: string (readonly)
This property gives access to a list of official names for the character set with the MIB enum value specified in the Enum property.
These names are expressed in ANSI_X3.4-1968, also known as US-ASCII or simply ASCII. The names are not case-sensitive.
The aliases that start with "cs" have been added for use with the Printer MIB (see RFC 1759) and contain the standard numbers along with suggestive names in order to facilitate applications that want to display the names in user interfaces. The "cs" stands for character set and is provided for applications that need a lower case first letter but wan to use mixed case thereafter that cannot contain any special characters, such as underbar ("_") and dash ("-").
The i parameter corresponds to the position of an alias in the list, where 0 is the first alias, 1 is the second alias, and so on. If there is no alias corresponding to the value of i, an ECSMIBException is raised.
The first alias, i.e. Alias[0], always contains the MIB name of the corresponding character set.
property AliasCount: integer (readonly)
This property represents the number of aliases in the list for the names of the coded character set with the MIB enum value specified in the Enum property.
Use the AliasCount property when iterating over all the aliases in the list, or when trying to locate the position of an alias relative to the last alias in the list.
property PreferredMIMEName: string (readonly)
This property contains the preferred MIME Name, if any, of the coded character set with the MIB enum value specified in the Enum property. If no such preferred MIME name is specified this is an empty string.
function IsValidEnum(const Value: integer): boolean; virtual;
Returns TRUE if the specified value is a valid MIBenum value. Otherwise FALSE is returned.
function SetToAlias(const S: string): boolean; virtual;
Tries to set the Enum property to a value corresponding to the specified string. No distinction is made between use of upper and lower case letters. If the attempt was successful, TRUE is return. Otherwise False is returnd and the value of the Enum property remains the same.
OnChange: TNotifyEvent
Occures after the Enum property changed.
type TNotifyEvent = procedure (Sender: TObject) of object;
OnChanging: TCSMIBChangingEvent
Occures just before a change is made to the Enum property.
TCSMIBChangingEvent = procedure (Sender: TObject; NewEnum: integer; var AllowChange: Boolean) of object;
Write an OnChanging event handler to conditionally block changes to the Enum property. Set the AllowChange parameter to FALSE to prevent the change from taking place. The NewEnum parameter is the new Enum value about to be set.
[CSMIB] IANA: Character Sets, 2001-08-23, see: "http://www.iana.org/assignments/character-sets".
[ISO/IEC 10646] ISO (International Organization for Standardization): ISO/IEC 10646-1993 (E). Information technology Universal Multiple-Octet Coded Character Set (UCS) Part 1: Architecture and Basic Multilingual Plane, [Geneva]: International Organization for Standardization, 1993 (+ amendments AM 17).
[RFC 2279] Yergeau, F.: "UTF-8, a Transformation Format of ISO 10646", RFC 2279, 1998, see "http://www.ietf.org/rfc/rfc2279.txt".
[RFC 2781] Hoffman, P. and F. Yergeau: "UTF-16, an Encoding of ISO 10646", RFC 2781, 2000, see "http://www.ietf.org/rfc/rfc2781.txt".
[Unicode 3.0] The Unicode Consortium: The Unicode Standard Version 3.0, Reading (Mass.): Addison-Wesley, 2000.