A character encoding maps each character in a character set to a numeric value that can be represented by a computer. These numbers can be represented by a single byte or multiple bytes. For example, the ASCII encoding uses seven bits to represent the Latin alphabet, punctuation, and control characters.
You use Japanese encodings, such as Shift-JIS, EUC-JP, and ISO-2022-JP, to represent Japanese text. These encodings can vary slightly, but they include a common set of approximately 10,000 characters used in Japanese.
The following terms apply to character encodings:
The following table lists some common character encodings; however, there are many additional character encodings that browsers and web servers support:
Encoding | Type | Description |
---|---|---|
ASCII |
SBCS |
7-bit encoding used by English and Indonesian Bahasa languages |
Latin-1 |
SBCS |
8-bit encoding used for many Western European languages |
Shift_JIS |
DBCS |
16-bit Japanese encoding (Note that you must use an underscore character (_), not a hyphen (-) in the name in CFML attributes.) |
EUC-KR |
DBCS |
16-bit Korean encoding |
UCS-2 |
DBCS |
Two-byte Unicode encoding |
UTF-8 |
MBCS |
Multibyte Unicode encoding. ASCII is 7-bit; non-ASCII characters used in European and many Middle Eastern languages are two-byte; and most Asian characters are three-byte |
The World Wide Web Consortium maintains a list of all character encodings supported by the Internet. You can find this information at www.w3.org/International/O-charset.html.
Computers often must convert between character encodings. In particular, the character encodings most commonly used on the Internet are not used by Java or Windows. Character sets used on the Internet are typically single-byte or multiple-byte (including DBCS character sets that allow single-byte characters). These character sets are most efficient for transmitting data, because each character takes up the minimum necessary number of bytes. Currently, Latin characters are most frequently used on the web, and most character encodings used on the web represent those characters in a single byte.
Computers, however, process data most efficiently if each character occupies the same number of bytes. Therefore, Windows and Java both use double-byte encoding for internal processing.