Small Basic: Character Set - Unicode


Introduction

This article explains about character set especially in Microsoft Small Basic programming language.

What is Character Set

A character set is an encoding code set for characters.  ASCII code was very popular as 7-bit character code.  In Small Basic, Unicode is used as it's character set.

What is Character Code

Character code is number for each character.  You can get any character code with Text.GetCharacterCode() operation.  Also you can get character from character code with Text.GetCharacter() operation. 

What is Unicode

Basically Unicode was created as 16-bit character code which includes many DBCS (double byte character set) codes for many countries.  In these days, many OS's, web sites, applications can use Unicode.  Usually Unicode is expressed such like U+0022.  This code means double quotation mark which code is 34 (0x22).  0x22 means hexadecimal.  A program Character Map In Windows Accessories can show Unicode for each character.  This tool shows only Unicode range between U+0000 and U+FFFF. 

And following table is from a program IME Pad which is included Japanese IME in Windows.  Range from U+0000 to U+FFFF is called BMP (Basic Multilingual Plane).  Range from U+10000 to U+1FFFF is called SMP (Supplementary Multilingual Plane).  Following table contains full BMP map and some part of SMP.  Other planes for range from U+20000 to U+10FFFF are not here. 

Unicode (BMP) From To
Basic Latin U+0020 U+007F
Latin-1 Supplement U+0080 U+00FF
Latin Extended-A U+0100 U+017F
Latin Extended-B U+0180 U+024F
IPA Extensions U+0250 U+02AF
Space Modifier Letters U+02B0 U+02FF
Combining Diacritical Marks U+0300 U+036F
Greek and Coptic U+0370 U+03FF
Cyrillic U+0400 U+04FF
Cyrillic Supplemetary U+0500 U+052F
Armenian U+0530 U+058F
Hebrew U+0590 U+05FF
Arabic U+0600 U+06FF
Syriac U+0700 U+074F
Arabic Supplement U+0750 U+077F
Thaana U+0780 U+07BF
NKo U+07C0 U+07FF
Samaritan U+0800 U+083F
Mandiac U+0840 U+08FF
Devanagari U+0900 U+097F
Bengali U+0980 U+09FF
Gurmukhi U+0A00 U+0A7F
Gujarati U+0A80 U+0AFF
Oriya U+0B00 U+0B7F
Tamil U+0B80 U+0BFF
Telugu U+0C00 U+0C7F
Kannada U+0C80 U+0CFF
Malayalam U+0D00 U+0D7F
Sinhala U+0D80 U+0DFF
Thai U+0E00 U+0E7F
Lao U+0E80 U+0EFF
Tibetan U+0F00 U+0FFF
Myanmar U+1000 U+109F
Georgian U+10A0 U+10FF
Hangul Jamo U+1100 U+11FF
Ethiopic U+1200 U+137F
Ethiopic Supplement U+1380 U+139F
Cherokee U+13A0 U+13FF
Unified Canadian Aboriginal Syllabics U+1400 U+167F
Ogham U+1680 U+169F
Runic U+16A0 U+16FF
Tagalog U+1700 U+171F
Hanunoo U+1720 U+173F
Buhid U+1740 U+175F
Tagbanwa U+1760 U+177F
Khmer U+1780 U+17FF
Mongolian U+1800 U+18AF
Unified Canadian Aboriginal Syllabics Extended U+18B0 U+18FF
Limbu U+1900 U+194F
Tai Le U+1950 U+197F
New Tai Lue U+1980 U+19DF
Khmer Symbols U+19E0 U+19FF
Buginese U+1A00 U+1A1F
Tai Tham U+1A20 U+1AFF
Balinese U+1B00 U+1B7F
Sundanese U+1B80 U+1BBF
Batak U+1BC0 U+1BFF
Lepcha U+1C00 U+1C4F
Ol Chiki U+1C50 U+1CCF
Vedic Extensions U+1CD0 U+1CFF
Phonetic Extensions U+1D00 U+1D7F
Phonetic Extensions Supplement U+1D80 U+1DBF
Combining Diacritical Marks Supplement U+1DC0 U+1DFF
Latin Extended Additional U+1E00 U+1EFF
Greek Extended U+1F00 U+1FFF
General Punctuaton U+2000 U+206F
Superscripts and Subscripts U+2070 U+209F
Currency Symbols U+20A0 U+20CF
Combining Diacritical Marks for Symbols U+20D0 U+20FF
Letterlike Symbols U+2100 U+214F
Number Forms U+2150 U+218F
Arrows U+2190 U+21FF
Mathematical Operators U+2200 U+22FF
Miscellaneous Technical U+2300 U+23FF
Control Pictures U+2400 U+243F
Optical Character Recognition U+2440 U+245F
Enclosed Alphanumerics U+2460 U+24FF
Box Drawing U+2500 U+257F
Box Elements U+2580 U+259F
Geometric Shapes U+25A0 U+25FF
Miscellaneous Symbols U+2600 U+26FF
Dingobats U+2700 U+27BF
Miscellaneous Mathematical Symbols-A U+27C0 U+27EF
Supplemental Arrows-A U+27F0 U+27FF
Braille Patterns U+2800 U+28FF
Supplemental Arrows-B U+2900 U+297F
Miscellaneous Mathematical Symbols-B U+2980 U+29FF
Supplemental Mathematical Operators U+2A00 U+2AFF
Miscellaneous Symbols and Arrows U+2B00 U+2BFF
Glagolitic U+2C00 U+2C5F
Latin Extended-C U+2C60 U+2C7F
Coptic U+2C80 U+2CFF
Gergian Supplement U+2D00 U+2D2F
Tifinagh U+2D30 U+2D7F
Ethiopic Extended U+2D80 U+2DDF
Cyrillic Extended-A U+2DE0 U+2DFF
Supplemental Punctuation U+2E00 U+2E7F
CJK Radicals Supplement U+2E80 U+2EFF
Kangxi Radicals U+2F00 U+2FEF
Ideographic Description Characters U+2FF0 U+2FFF
CJK Symbols and Punctuation U+3000 U+303F
Hiragana U+3040 U+309F
Katakana U+30A0 U+30FF
Bopomofo U+3100 U+312F
Hangul Compatibility Jamo U+3130 U+318F
Kanbun U+3190 U+319F
Bopomofo Extended U+31A0 U+31BF
CJK Strokes U+31C0 U+31EF
Katakana Phonetic Extensions U+31F0 U+31FF
Enclosed CJK Letters and Months U+3200 U+32FF
CJK Compatibility U+3300 U+33FF
CJK Unified Ideographics Extension A U+3400 U+4DBF
Yijing Hexagram Symbols U+4DC0 U+4DFF
CJK Unified Ideographs U+4E00 U+9FFF
Yi Syllables U+A000 U+A48F
Yi Radicals U+A490 U+A4CF
Lisu U+A4D0 U+A4FF
Vai U+A500 U+A63F
Cyrillic Extended-B U+A640 U+A69F
Bamum U+A6A0 U+A6FF
Modifier Tone Letters U+A700 U+A71F
Latin Extended-D U+A720 U+A7FF
Syloti Nagri U+A800 U+A82F
Common Indic Number Forms U+A830 U+A83F
Phags-pa U+A840 U+A87F
Saurashtra U+A880 U+A8DF
Devanagari Extended U+A8E0 U+A8FF
Kayah Li U+A900 U+A92F
Rejang U+A930 U+A95F
Hangul Jamo Extended-A U+A960 U+A97F
Javanese U+A980 U+A9FF
Cham U+AA00 U+AA5F
Myammar Extended-A U+AA60 U+AA7F
Tai Viet U+AA80 U+AAFF
Ethiopic Extended-A U+AB00 U+ABBF
Meetei Mayek U+ABC0 U+ABFF
Hangul Syllables U+AC00 U+D7AF
Hangul Jamo Extended-B U+D7B0 U+D7FF
High Surrogates U+D800 U+DB7F
High Private Use Surrogates U+DB80 U+DBFF
Low Surrogates U+DC00 U+DFFF
Private Use Area U+E000 U+F8FF
CJK Compatibility Ideographs U+F900 U+FAFF
Alphabetic Presentation Forms U+FB00 U+FB4F
Arabic Presentation Forms-A U+FB50 U+FDFF
Variation Selectors U+FE00 U+FE0F
Vertical Forms U+FE10 U+FE1F
Combining Half Marks U+FE20 U+FE2F
CJK Compatibility Forms U+FE30 U+FE4F
Small Form Variants U+FE50 U+FE6F
Arabic Presentation Forms-B U+FE70 U+FEFF
Halfwidth and Fullwidth Forms U+FF00 U+FFAF
Specials U+FFB0 U+FFFF
Part of Unicode (SMP) From To
Mahjong Tiles U+1F000 U+1F02F
Domino Tiles U+1F030 U+1F09F
Playing Cards U+1F0A0 U+1F2FF
Miscellaneous Symbols And Pictographs U+1F300 U+1F5FF
Emoticons U+1F600 U+1F67F
Transport And Map Symbols U+1F680 U+1FFFF

What is UTF-8

Small Basic source files and files created and read with File objects are encoded with UTF-8.  UTF-8 is stands for UCS (Universal Character Set) Transformation Format 8-bit.  UTF-8 is one of actual format to encode Unicode.  Alphabet is from U+0041 to U+007A.  These characters are single byte (8-bit) in ASCII code.  To reduce size for these popular characters, UTF-8 encoding allows characters between U+0000 and U+007F to be single bytes.  But you don't need to care about UTF-8 in Small Basic program.  Operations such as Text.GetLength, Text.GetSubText, Text.GetSubTextToEnd, and Text.GetIndexOf convert UTF-8 text to each Unicode character.

Characters for Game

Font Lucida Sans Unicode and some other fonts contain emoji characters that is suitable for game program.  Picture of Character Map above shows the characters. You can see them with following instructions.  You may find the other characters for your game.

  1. Run Character Map.
  2. Select Licida Sans Unicode font.
  3. Check Advanced View check box.
  4. Select Unicode Subrange in Group by.
  5. Select Symbol & Dingbats for Unicode Subrange in Group by window.

And in SMP table above contains mahjong tiles, domino tiles, playing cards and so on.

Tips about Character Set

Glyph Depends on Font

Be careful about fonts.  Unicode has a lot of characters but many fonts don't have whole characters.  So some fonts show different glyph for the same character.  And some fonts are not installed in Windows system but with some applications that will be installed.  Small Basic program can be published.  So published program will be run in different environment, some has Office font, some doesn't and some has only Mac fonts.  See more detail about font here.

Character Set in TextWindow

And in TextWindow, the character set is not Unicode.  It depends on the localization, such like ASCII code in US, Shift-JIS code in Japan.

Characters in SMP

Text.ConvertToLowerCase(), Text.ConvertToUpperCase(), Text.GetCharacter(), Text.GetCharacterCode(), Text.GetIndexOf(), Text.GetLength(), Text.GetSubText() and Text.GetSubTextToEnd() operations don't support SMP (U+10000 to U+1FFFF) characters.  But SMP characters can be described in a literal.  See details in following SMP Characters sample.

Sample Code

ASCII Code Table

Program ID VQX212.  This program shows characters between U+0000 to U+007F.

Hexadecimal Dump

Program ID XWT217.  This program reads UTF-8 text file and simulate the UTF-8 encode of the file. Before running, remove automated comments of File objects.

Symbol Samples

Program ID QZS270.  This program shows symbols in Webdings,Wingdings fonts and Unicode.

Get Character from Unicode

Program ID RPZ143-2.  This program shows a character in a text box for a given Unicode.  So you can get (copy) the character.  This program found that SMP is not supported by Text objects in Small Basic.  Only the first 16-bit of the internal UTF-16 code is used.

SMP Characters

Program ID QBS151.  This program shows playing card characters in SMP range of Unicode using an array of literal .
Program ID FSQ891.  This program shows mahjong tile characters in SMP range of Unicode using an array of literal. 

See Also

Additional Resources

Other Languages