Collation and Code Page Architecture
Collations control the physical storage of character strings in SQL Server 2005. A collation specifies the bit patterns that represent each character and the rules by which characters are sorted and compared.
In a computer, characters are represented by different patterns of bits being either ON or OFF. There are 8 bits in a byte, and the 8 bits can be turned ON and OFF in 256 different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns. There are 16 bits in 2 bytes, and 16 bits can be turned ON and OFF in 65,536 unique patterns. A program that uses 2 bytes to represent each character can represent up to 65,536 characters.
Single-byte code pages are definitions of the characters mapped to each of the 256 bit patterns possible in a byte. Code pages define bit patterns for uppercase and lowercase characters, digits, symbols, and special characters such as the exclamation point (!), the at sign (@), the number sign (#), or percent (%). Each European language, such as German or Spanish, has its own single-byte code page. Although the bit patterns used to represent the Latin alphabet characters A through Z are the same for all the code pages, the bit patterns used to represent accented characters vary from one code page to the next.
Single-byte character sets cannot store all the characters used by many languages. Some Asian languages have thousands of characters; therefore, they must use 2 bytes per character. Double-byte character sets have been defined for these languages, and code pages have also been defined around them.
The following table shows the code pages that SQL Server 2005 supports.
Code page | Description |
---|---|
1258 |
Vietnamese |
1257 |
Baltic |
1256 |
Arabic |
1255 |
Hebrew |
1254 |
Turkish |
1253 |
Greek |
1252 |
Latin1 (ANSI) |
1251 |
Cyrillic |
1250 |
Central European |
950 |
Chinese (Traditional) |
949 |
Korean |
936 |
Chinese (Simplified) |
932 |
Japanese |
874 |
Thai |
850 |
Multilingual (MS-DOS Latin1) |
437 |
MS-DOS U.S. English |
Multiple collations can use the same code page for non-Unicode data. For example, the 1251 code page defines a set of Cyrillic characters. This code page is used by several collations, such as Cyrillic_General, Ukrainian, and Macedonian_FYROM_90. Although all these collations use the same set of bits to represent non-Unicode character data, the sorting and comparison rules they apply are slightly different. This is so they can handle the dictionary definitions of the correct sequence of characters in the language or alphabet associated with the collation.