Small Basic: Character Set - Unicode
Introduction
This article explains about character set especially in Microsoft Small Basic programming language.
What is Character Set
A character set is an encoding code set for characters. ASCII code was very popular as 7-bit character code. In Small Basic, Unicode is used as it's character set.
What is Character Code
Character code is number for each character. You can get any character code with Text.GetCharacterCode() operation. Also you can get character from character code with Text.GetCharacter() operation.
What is Unicode
Basically Unicode was created as 16-bit character code which includes many DBCS (double byte character set) codes for many countries. In these days, many OS's, web sites, applications can use Unicode. Usually Unicode is expressed such like U+0022. This code means double quotation mark which code is 34 (0x22). 0x22 means hexadecimal. A program Character Map In Windows Accessories can show Unicode for each character. This tool shows only Unicode range between U+0000 and U+FFFF.
And following table is from a program IME Pad which is included Japanese IME in Windows. Range from U+0000 to U+FFFF is called BMP (Basic Multilingual Plane). Range from U+10000 to U+1FFFF is called SMP (Supplementary Multilingual Plane). Following table contains full BMP map and some part of SMP. Other planes for range from U+20000 to U+10FFFF are not here.
Unicode (BMP) | From | To |
Basic Latin | U+0020 | U+007F |
Latin-1 Supplement | U+0080 | U+00FF |
Latin Extended-A | U+0100 | U+017F |
Latin Extended-B | U+0180 | U+024F |
IPA Extensions | U+0250 | U+02AF |
Space Modifier Letters | U+02B0 | U+02FF |
Combining Diacritical Marks | U+0300 | U+036F |
Greek and Coptic | U+0370 | U+03FF |
Cyrillic | U+0400 | U+04FF |
Cyrillic Supplemetary | U+0500 | U+052F |
Armenian | U+0530 | U+058F |
Hebrew | U+0590 | U+05FF |
Arabic | U+0600 | U+06FF |
Syriac | U+0700 | U+074F |
Arabic Supplement | U+0750 | U+077F |
Thaana | U+0780 | U+07BF |
NKo | U+07C0 | U+07FF |
Samaritan | U+0800 | U+083F |
Mandiac | U+0840 | U+08FF |
Devanagari | U+0900 | U+097F |
Bengali | U+0980 | U+09FF |
Gurmukhi | U+0A00 | U+0A7F |
Gujarati | U+0A80 | U+0AFF |
Oriya | U+0B00 | U+0B7F |
Tamil | U+0B80 | U+0BFF |
Telugu | U+0C00 | U+0C7F |
Kannada | U+0C80 | U+0CFF |
Malayalam | U+0D00 | U+0D7F |
Sinhala | U+0D80 | U+0DFF |
Thai | U+0E00 | U+0E7F |
Lao | U+0E80 | U+0EFF |
Tibetan | U+0F00 | U+0FFF |
Myanmar | U+1000 | U+109F |
Georgian | U+10A0 | U+10FF |
Hangul Jamo | U+1100 | U+11FF |
Ethiopic | U+1200 | U+137F |
Ethiopic Supplement | U+1380 | U+139F |
Cherokee | U+13A0 | U+13FF |
Unified Canadian Aboriginal Syllabics | U+1400 | U+167F |
Ogham | U+1680 | U+169F |
Runic | U+16A0 | U+16FF |
Tagalog | U+1700 | U+171F |
Hanunoo | U+1720 | U+173F |
Buhid | U+1740 | U+175F |
Tagbanwa | U+1760 | U+177F |
Khmer | U+1780 | U+17FF |
Mongolian | U+1800 | U+18AF |
Unified Canadian Aboriginal Syllabics Extended | U+18B0 | U+18FF |
Limbu | U+1900 | U+194F |
Tai Le | U+1950 | U+197F |
New Tai Lue | U+1980 | U+19DF |
Khmer Symbols | U+19E0 | U+19FF |
Buginese | U+1A00 | U+1A1F |
Tai Tham | U+1A20 | U+1AFF |
Balinese | U+1B00 | U+1B7F |
Sundanese | U+1B80 | U+1BBF |
Batak | U+1BC0 | U+1BFF |
Lepcha | U+1C00 | U+1C4F |
Ol Chiki | U+1C50 | U+1CCF |
Vedic Extensions | U+1CD0 | U+1CFF |
Phonetic Extensions | U+1D00 | U+1D7F |
Phonetic Extensions Supplement | U+1D80 | U+1DBF |
Combining Diacritical Marks Supplement | U+1DC0 | U+1DFF |
Latin Extended Additional | U+1E00 | U+1EFF |
Greek Extended | U+1F00 | U+1FFF |
General Punctuaton | U+2000 | U+206F |
Superscripts and Subscripts | U+2070 | U+209F |
Currency Symbols | U+20A0 | U+20CF |
Combining Diacritical Marks for Symbols | U+20D0 | U+20FF |
Letterlike Symbols | U+2100 | U+214F |
Number Forms | U+2150 | U+218F |
Arrows | U+2190 | U+21FF |
Mathematical Operators | U+2200 | U+22FF |
Miscellaneous Technical | U+2300 | U+23FF |
Control Pictures | U+2400 | U+243F |
Optical Character Recognition | U+2440 | U+245F |
Enclosed Alphanumerics | U+2460 | U+24FF |
Box Drawing | U+2500 | U+257F |
Box Elements | U+2580 | U+259F |
Geometric Shapes | U+25A0 | U+25FF |
Miscellaneous Symbols | U+2600 | U+26FF |
Dingobats | U+2700 | U+27BF |
Miscellaneous Mathematical Symbols-A | U+27C0 | U+27EF |
Supplemental Arrows-A | U+27F0 | U+27FF |
Braille Patterns | U+2800 | U+28FF |
Supplemental Arrows-B | U+2900 | U+297F |
Miscellaneous Mathematical Symbols-B | U+2980 | U+29FF |
Supplemental Mathematical Operators | U+2A00 | U+2AFF |
Miscellaneous Symbols and Arrows | U+2B00 | U+2BFF |
Glagolitic | U+2C00 | U+2C5F |
Latin Extended-C | U+2C60 | U+2C7F |
Coptic | U+2C80 | U+2CFF |
Gergian Supplement | U+2D00 | U+2D2F |
Tifinagh | U+2D30 | U+2D7F |
Ethiopic Extended | U+2D80 | U+2DDF |
Cyrillic Extended-A | U+2DE0 | U+2DFF |
Supplemental Punctuation | U+2E00 | U+2E7F |
CJK Radicals Supplement | U+2E80 | U+2EFF |
Kangxi Radicals | U+2F00 | U+2FEF |
Ideographic Description Characters | U+2FF0 | U+2FFF |
CJK Symbols and Punctuation | U+3000 | U+303F |
Hiragana | U+3040 | U+309F |
Katakana | U+30A0 | U+30FF |
Bopomofo | U+3100 | U+312F |
Hangul Compatibility Jamo | U+3130 | U+318F |
Kanbun | U+3190 | U+319F |
Bopomofo Extended | U+31A0 | U+31BF |
CJK Strokes | U+31C0 | U+31EF |
Katakana Phonetic Extensions | U+31F0 | U+31FF |
Enclosed CJK Letters and Months | U+3200 | U+32FF |
CJK Compatibility | U+3300 | U+33FF |
CJK Unified Ideographics Extension A | U+3400 | U+4DBF |
Yijing Hexagram Symbols | U+4DC0 | U+4DFF |
CJK Unified Ideographs | U+4E00 | U+9FFF |
Yi Syllables | U+A000 | U+A48F |
Yi Radicals | U+A490 | U+A4CF |
Lisu | U+A4D0 | U+A4FF |
Vai | U+A500 | U+A63F |
Cyrillic Extended-B | U+A640 | U+A69F |
Bamum | U+A6A0 | U+A6FF |
Modifier Tone Letters | U+A700 | U+A71F |
Latin Extended-D | U+A720 | U+A7FF |
Syloti Nagri | U+A800 | U+A82F |
Common Indic Number Forms | U+A830 | U+A83F |
Phags-pa | U+A840 | U+A87F |
Saurashtra | U+A880 | U+A8DF |
Devanagari Extended | U+A8E0 | U+A8FF |
Kayah Li | U+A900 | U+A92F |
Rejang | U+A930 | U+A95F |
Hangul Jamo Extended-A | U+A960 | U+A97F |
Javanese | U+A980 | U+A9FF |
Cham | U+AA00 | U+AA5F |
Myammar Extended-A | U+AA60 | U+AA7F |
Tai Viet | U+AA80 | U+AAFF |
Ethiopic Extended-A | U+AB00 | U+ABBF |
Meetei Mayek | U+ABC0 | U+ABFF |
Hangul Syllables | U+AC00 | U+D7AF |
Hangul Jamo Extended-B | U+D7B0 | U+D7FF |
High Surrogates | U+D800 | U+DB7F |
High Private Use Surrogates | U+DB80 | U+DBFF |
Low Surrogates | U+DC00 | U+DFFF |
Private Use Area | U+E000 | U+F8FF |
CJK Compatibility Ideographs | U+F900 | U+FAFF |
Alphabetic Presentation Forms | U+FB00 | U+FB4F |
Arabic Presentation Forms-A | U+FB50 | U+FDFF |
Variation Selectors | U+FE00 | U+FE0F |
Vertical Forms | U+FE10 | U+FE1F |
Combining Half Marks | U+FE20 | U+FE2F |
CJK Compatibility Forms | U+FE30 | U+FE4F |
Small Form Variants | U+FE50 | U+FE6F |
Arabic Presentation Forms-B | U+FE70 | U+FEFF |
Halfwidth and Fullwidth Forms | U+FF00 | U+FFAF |
Specials | U+FFB0 | U+FFFF |
Part of Unicode (SMP) | From | To |
Mahjong Tiles | U+1F000 | U+1F02F |
Domino Tiles | U+1F030 | U+1F09F |
Playing Cards | U+1F0A0 | U+1F2FF |
Miscellaneous Symbols And Pictographs | U+1F300 | U+1F5FF |
Emoticons | U+1F600 | U+1F67F |
Transport And Map Symbols | U+1F680 | U+1FFFF |
What is UTF-8
Small Basic source files and files created and read with File objects are encoded with UTF-8. UTF-8 is stands for UCS (Universal Character Set) Transformation Format 8-bit. UTF-8 is one of actual format to encode Unicode. Alphabet is from U+0041 to U+007A. These characters are single byte (8-bit) in ASCII code. To reduce size for these popular characters, UTF-8 encoding allows characters between U+0000 and U+007F to be single bytes. But you don't need to care about UTF-8 in Small Basic program. Operations such as Text.GetLength, Text.GetSubText, Text.GetSubTextToEnd, and Text.GetIndexOf convert UTF-8 text to each Unicode character.
Characters for Game
Font Lucida Sans Unicode and some other fonts contain emoji characters that is suitable for game program. Picture of Character Map above shows the characters. You can see them with following instructions. You may find the other characters for your game.
- Run Character Map.
- Select Licida Sans Unicode font.
- Check Advanced View check box.
- Select Unicode Subrange in Group by.
- Select Symbol & Dingbats for Unicode Subrange in Group by window.
And in SMP table above contains mahjong tiles, domino tiles, playing cards and so on.
Tips about Character Set
Glyph Depends on Font
Be careful about fonts. Unicode has a lot of characters but many fonts don't have whole characters. So some fonts show different glyph for the same character. And some fonts are not installed in Windows system but with some applications that will be installed. Small Basic program can be published. So published program will be run in different environment, some has Office font, some doesn't and some has only Mac fonts. See more detail about font here.
Character Set in TextWindow
And in TextWindow, the character set is not Unicode. It depends on the localization, such like ASCII code in US, Shift-JIS code in Japan.
Characters in SMP
Text.ConvertToLowerCase(), Text.ConvertToUpperCase(), Text.GetCharacter(), Text.GetCharacterCode(), Text.GetIndexOf(), Text.GetLength(), Text.GetSubText() and Text.GetSubTextToEnd() operations don't support SMP (U+10000 to U+1FFFF) characters. But SMP characters can be described in a literal. See details in following SMP Characters sample.
Sample Code
ASCII Code Table
Program ID VQX212. This program shows characters between U+0000 to U+007F.
Hexadecimal Dump
Program ID XWT217. This program reads UTF-8 text file and simulate the UTF-8 encode of the file. Before running, remove automated comments of File objects.
Symbol Samples
Program ID QZS270. This program shows symbols in Webdings,Wingdings fonts and Unicode.
Get Character from Unicode
Program ID RPZ143-2. This program shows a character in a text box for a given Unicode. So you can get (copy) the character. This program found that SMP is not supported by Text objects in Small Basic. Only the first 16-bit of the internal UTF-16 code is used.
SMP Characters
Program ID QBS151. This program shows playing card characters in SMP range of Unicode using an array of literal .
Program ID FSQ891. This program shows mahjong tile characters in SMP range of Unicode using an array of literal.
See Also
Additional Resources
- Full Emoji List | The Unicode Consortium