About character encodings

A character encoding maps each character in a character set to a numeric value that can be represented by a computer. These numbers can be represented by a single byte or multiple bytes. For example, the ASCII encoding uses seven bits to represent the Latin alphabet, punctuation, and control characters.

You use Japanese encodings, such as Shift-JIS, EUC-JP, and ISO-2022-JP, to represent Japanese text. These encodings can vary slightly, but they include a common set of approximately 10,000 characters used in Japanese.

The following terms apply to character encodings:

The following table lists some common character encodings; however, there are many additional character encodings that browsers and web servers support:

Encoding Type Description

ASCII

SBCS

7-bit encoding used by English and Indonesian Bahasa languages

Latin-1
(ISO 8859-1)

SBCS

8-bit encoding used for many Western European languages

Shift_JIS

DBCS

16-bit Japanese encoding (Note that you must use an underscore character (_), not a hyphen (-) in the name in CFML attributes.)

EUC-KR

DBCS

16-bit Korean encoding

UCS-2

DBCS

Two-byte Unicode encoding

UTF-8

MBCS

Multibyte Unicode encoding. ASCII is 7-bit; non-ASCII characters used in European and many Middle Eastern languages are two-byte; and most Asian characters are three-byte

The World Wide Web Consortium maintains a list of all character encodings supported by the Internet. You can find this information at www.w3.org/International/O-charset.html.

Computers often must convert between character encodings. In particular, the character encodings most commonly used on the Internet are not used by Java or Windows. Character sets used on the Internet are typically single-byte or multiple-byte (including DBCS character sets that allow single-byte characters). These character sets are most efficient for transmitting data, because each character takes up the minimum necessary number of bytes. Currently, Latin characters are most frequently used on the web, and most character encodings used on the web represent those characters in a single byte.

Computers, however, process data most efficiently if each character occupies the same number of bytes. Therefore, Windows and Java both use double-byte encoding for internal processing.

View comments on LiveDocs