What is Unicode?
Many of the writing systems used today are thousands of years old while digital text is still relatively new. In the early days of computer technology, digital representation of texts was mainly oriented around English. But nowadays much of global human interaction takes place online. People exchange information across linguistic and national borders. This change called for the development of a standardised structure for the exchange of texts in different alphabets and systems of writing. At the same time, technological advancements opened up new possibilities for displaying characters.
A good example are the emojis on your smartphone. These fun icons can be inserted with a special keyboard and used just like letters, almost as if they were a natural part of the alphabet. But how does that work? The Unicode standard forms the basis of it.
- Free website protection with SSL Wildcard included
- Free private registration for greater privacy
- Free 2 GB email account
What is Unicode?
Unicode stands for universal character encoding. It is a standard for the binary coding of letters, numbers, and other characters and enables texts to be saved and processed in digital systems.
What makes Unicode special (and innovative at the time it came out) is that it’s not bound by the formats and encodings of any single human language. It was created to serve as a uniform standard for the representation of all human languages and writing systems.
Since Unicode 1.0 was released in 1991, the standard has successfully served its purpose. It is used in browsers and operating systems as a uniform format. Version 13.0, released in 2020 by the Unicode Consortium, boasts a repertoire of 143,859 characters.
The Unicode Consortium, a nonprofit organisation based in California, is responsible for the continued development of the standard. Leading tech companies like Adobe, Apple, Facebook, Google, IBM, Microsoft, Netflix, and SAP are members of the consortium. The Unicode character set is fully congruent with the Universal Coded Character Set (UCS), defined by the International Standard ISO/IEC 10646.
Technical basis for encoding characters
Text and writing are present everywhere in modern life. Reading and writing are among the first things we learn at school. So, it doesn’t come as a surprise that many people take the presence of digital text for granted. But how exactly does the technical representation of writing work? Let’s take a trip into the world of digital character coding.
Before we move on, it’s important to understand that on a deeper level all information in digital systems consists of sequences of zeros and ones. This is called binary representation. Binary code can be compared to an alphabet, in which there are only two 'letters' (zero and one). Each place in a series of zeros and ones is referred to as a bit.
The basic idea is to represent the characters of various alphabets as sequences of zeros and ones. This is how letters and numbers are encoded, as well as any other distinguishable states. All these characters are referred to as 'symbols'. The longer a sequence of zeros and ones is, the more symbols can be represented. The number of symbols that can be represented doubles with every bit you add to the sequence.
A concrete example: If we have binary 'words' that are two bits long, we can encode four numbers.
Two-bit word | Number |
00 | 0 |
01 | 1 |
10 | 2 |
11 | 3 |
If we insert another bit at the beginning of the sequence, the number of possible bit words doubles. The new 'words' will consist of the bit sequences from above, with either a zero or one placed before them. So now we can encode eight numbers:
Three-bit word | Number |
000 | 0 |
001 | 1 |
010 | 2 |
011 | 3 |
100 | 4 |
101 | 5 |
110 | 6 |
111 | 7 |
An eight-bit word is referred to as an octet or a byte.
For the sake of simplicity, we’ve demonstrated the encoding of numbers here. But the same principle comes into play for encoding letters or any other kind of character or state. See the following simplified example of the binary encoding of letters:
Three-bit word | Letter |
000 | A |
001 | B |
010 | C |
Keep in mind that our explanation up to this point has had nothing to do with writing systems. We’re just talking about the internal model that’s used for the digital representation of characters. The graphic representation of a character is called a glyph. There are various glyphs for the same character, depending on the font used. Even within one font there can be several variants for a glyph - e.g., for bold and italics. The following table illustrates this encoding:
Binary representation | Decimal number | Encoded character | Glyph |
---|---|---|---|
1000001 | 65 | Capital 'A' in the Latin alphabet | A |
1100001 | 97 | Lowercase 'a' in the Latin alphabet | a |
0110000 | 48 | Arabic numeral '0' | 0 |
0111001 | 57 | Arabic numeral '9' | 9 |
11000100 | 196 | Capital 'Ä' | Ä |
11000001 | 193 | Capital 'Á' | Á |
Terminology for encoding characters
Digitally encoding characters touches on several specific concepts. To provide a precise definition of Unicode, we’ve summed up some of the most important terms here:
Term | Meaning |
---|---|
Character set | The set of possible characters, e.g., the numerals '0-9', letters 'a-z', etc. |
Code point | A number assigned to a specific character within the code domain |
Coded character set | The assignment of each character to exactly one code point |
Character encoding | The process of converting each character into a technical structure such as binary representation |
Overview of common character encodings
Before the arrival of Unicode, there were a large variety of specific encodings. The norm was to establish a separate encoding for each language or language family. This frequently led to display errors or data inconsistencies. To prevent this, character encodings were often modelled as a downward-compatible superset of an already-existing standard. In this way, the modern Unicode standard builds on the earlier encoding ISO Latin-1, which in turn is based on the ASCII encoding.
Character encoding | Bits per character | Possible characters | Character set |
---|---|---|---|
ASCII | 7 | 128 | Letters, numerals, and special characters from the US keyboard; control characters for teleprinters |
ISO Latin-1 (ISO 8859-1) | 8 | 256 | The 128 characters from ASCII, as well as 128 special characters from European languages |
Universal Coded Character Set 2 (UCS-2) | 16 | 65,536 | The characters from the 'Basic Multilingual Plane' (BMP); the 256 characters from ISO Latin-1 |
Universal Coded Character Set 4 (UCS-4) | 32 | 1,114,111 | The characters from BMP and further characters; 143,859 characters in Unicode Version 13.0; the first 256 characters from ISO Latin-1 |
UCS Transformation Format 8 Bit (UTF-8) | 8/16/24/32 | 1,114,111 | Characters from UCS-2 and UCS-4; the first 256 characters from ISO Latin-1 |
The structure of the Unicode standard
The Unicode standard defines characters and corresponding code points for letters, syllable characters, ideograms, punctuation marks, special characters, and numerals. It supports not only the Latin alphabet but also the Greek, Cyrillic, Arabic, Hebrew, and Thai alphabets, as well as Japanese (Katakana, Hiragana), Chinese, and Korean (Hangul) writing systems. In addition, there are also mathematical, commercial, and technical characters, and historical control characters for teleprinters.
The characters are summarised in a series of character tables. We’ll give you an overview of the most common character tables.
Writing systems in the Unicode standard
Character table | Selection of the alphabets contained |
---|---|
European writing systems | Armenian, Georgian, Greek, Latin |
African writing systems | Ethiopian, Egyptian hieroglyphics, Coptic |
Writing systems of the Middle East | Arabic, Hebrew, Syrian |
Central Asian writing systems | Mongolian, Tibetan, Old-Turkic |
South Asian writing systems | Brahmi, Tamil, Vedic Sanskrit |
Southeast Asian writing systems | Khmer, Rohingya, Thai |
Writing systems of Indonesia and Oceania | Balinese, Buginese, Javanese |
East Asian writing systems | CJK (Chinese, Japanese, Korean), Hangul (Korean), Hiragana (Japanese) |
American writing systems | Cherokee, Canadian syllabic script, Osage |
Symbols and punctuation marks as Unicode characters
Character table | Selection of the characters contained |
---|---|
Notation systems | Braille patterns, musical notes, Deploying shorthand |
Punctuation marks | Punctuation marks from English, other European languages, and CJK |
Alphanumeric symbols | Mathematical letters, enclosed letters and numbers |
Technology symbols | Symbols from the programming language APL, symbols for optical character recognition |
Numbers and numerals | Maya numerals, Ottoman Siyaq numbers, numerals from Sumerian cuneiform |
Mathematical symbols | Arrows, mathematical operations, geometrical forms |
Emojis and pictograms | Emoticons, dingbats, further pictograms |
Other symbols | Alchemical symbols, currency symbols, chess, dominoes and mahjong characters |
- Free website builder with .co.uk
- Free website protection with one Wildcard SSL
- Free 2 GB email account
What is Unicode used for?
The Unicode standard serves as the universal foundation for processing, saving, and sharing text in every language. Most modern software components that work with text, such as libraries, protocols, and databases, are based on Unicode. In this section, we’ll illustrate the wide range of use cases for Unicode.
Operating systems
Unicode is the internal standard for representing text in most operating systems. Some operating systems, such as Apple’s macOS, allow the use of Unicode characters in file names.
Websites
The Unicode variant UTF-8 has become the standard for encoding HTML documents. In 2016, more than 80 percent of the most visited websites in the world used UTF-8 for saving and representing their HTML documents. The Punycode is now established for the use of domain names with Unicode letters not from ASCII.
- Intuitive website builder with AI assistance
- Create captivating images and texts in seconds
- Domain, SSL and email included
Programming languages
Many modern programming languages use Unicode as a basis for processing text. A newer development is the option to use Unicode characters to name variables and functions. This is possible in, for example, ECMAScript/JavaScript. The following example shows how Unicode characters are used in code:
let ︎ = true;
let = false;
if (bool_var === ︎) {
// …
}
Databases
The popular and widely used database MySQL supports the complete Unicode character set with the character encoding 'utf8mb4'. When, on the other hand, the character encoding 'utf8' is used, Unicode letters whose code points are more than 3 bytes are lost.
Fonts
Fonts contain the glyphs for the graphic representation of text. Due to the large number of Unicode characters, there is no single font that contains every character. In fact, there are only a few fonts that even cover the Basic Multilingual Plane in its entirety. Here are a few examples:
Unicode font | Glyphs | License |
---|---|---|
Noto | Approx. 65,000 | Open Font License |
Sun-ExtA/B | Approx. 50,000 | Freeware |
Unifont | Approx. 63,000 | GNU GPL |
Code2000 | Approx. 63,000 | Shareware |
How is Unicode implemented?
In many cases, people use Unicode without ever being aware of it. Digital text usually appears in Unicode and can be copied, pasted, and edited by users. Sometimes a user needs to insert a specific Unicode letter into a text. There are various ways to do this, as we explain below.
Special onscreen keyboards
Special onscreen keyboards (also called virtual keyboards or soft keyboards) are probably the most common way to insert Unicode characters into text. On keyboards on mobile devices, you can easily switch between different languages and alphabets. In each case, the keyboard layout will change, but all characters will come from the Unicode repertoire. The characters can be mixed and combined.
Good examples for this are emojis: Emojis are perfectly normal Unicode characters, just like numbers and letters. The display of emojis is independent of their internal modelling. Each operating system displays emojis slightly differently.
Onscreen keyboards are also used on desktop computers. In Windows, macOS, and many Linux distributions, virtual keyboards can be opened and display various Unicode characters, depending on the language you’ve chosen. Since the number of keys is limited, not all Unicode letters will be displayed. Rather, you’ll see a language-specific collection of the most common characters.
Unicode character tables
Aside from onscreen keyboards, Unicode character tables are probably one of the most convenient ways to access Unicode letters. A quick review: A coded character set is the set of all the characters and their corresponding unique code points. This kind of structure can easily be displayed as a table, which in the case of Unicode are called code charts. These tables are used to copy specific characters, which can then be pasted elsewhere. They can also be consulted by the end user to find out a code point for use as a numeric character reference - more on this in the next section.
Many desktop operating systems also have a Unicode character table, which provides an overview of all the available Unicode characters, including code point, description, and glyphs. A character can be copied and pasted in one click. You can also make a character table yourself with just a few lines of code. Later in this article, you’ll find an example of this using Python.
Numeric character reference
The assignment of characters to code points is a crucial part of the Unicode standard. If you know the code point for a character, you’ll be able to use the character in various contexts. In Windows, Unicode symbols are inserted using a special key combination on the normal hard keyboard. Please note that the code point number normally has to be entered in hexadecimal notation.
Programmers usually require the numeric character reference. The hexadecimal notation of a code point allows the Unicode character to be displayed in characters from the ASCII character set. We’ll show you how that process works in HTML; the process is generally the same for Python, C++, and other languages.
The general formula for inserting a character using its numeric reference includes the reference itself as well as an opening and closing term. In HTML documents, the numeric reference is opened with '&#x' and closed with ';'. The two to four digit hexadecimal code point should be inserted in between these two terms, resulting in the pattern '&#xNNNN'.
For example, to insert the copyright symbol '©' into an HTML document, you should follow these steps:
- Look for the character in the Unicode table.
- Check the code point for the character. In this example, the code point is 'U+00A9' in hexadecimal notation.
- Create the character reference and insert it into the HTML source text or a markdown document.
In our case, we’ll insert '©', which will give us the rendered character “©”.
There is also a less common method, which uses code points in decimal rather than hexadecimal notation. In this case, the numerical reference will begin with '&#' (without the 'x') and end as before with ';'. The code point in decimal notation comes in between. For the copyright symbol, the result would be '©'.
Use the Unicode character Inspector to quickly look up the various codes for a symbol.
Named character identities
Since writing Unicode characters as numerical references isn’t intuitive for humans, there is also another method - named character entities. They’re defined for frequently used characters and assign each character a short, memorable name. A named character entity begins with an ampersand '&' and ends with a semicolon ';'. The name is placed in between the two symbols without any spaces. So, for example, to make the copyright sign '©' in HTML, you can simply write '©'.
The complete list of named character references can be found in the HTML Standard.
Programming languages
Most programming languages contain fundamental functions that can be used to transform characters and code points. These functions are often called 'ord(Character)' and 'chr(Code point)'. The functions work together as follows:
‘chr(ord(Character)) == Character’
Note that it’s always possible to ascertain which code point corresponds to a character. The other way around will only work for numbers that are defined as code points in the coded character set. In the following example, we show how these functions generally work in Python.
# Ascertain decimal code point of a character
ord('A') # `65`
# Ascertain hexadecimal code point of a character
hex(ord('A')) # `0x41`
# Ascertain which character corresponds to a code point
chr(65) # `'A'`
chr(0x41) # `'A'`
chr(0x110001) # Error, since it’s a code point > `0x110000`
You can easily create a character table for code points from the Unicode character set using these functions. Just iterate the code points and output the corresponding character. This can be done with just a few lines of code in Python:
# Start the range at 32, since control characters will be given as output for smaller values
# Output the ASCII character set
for code_point in range(32, 128):
# Output ISO Latin-1
for code_point in range(32, 256):
# Output code point in decimal and hexadecimal notation including the corresponding character
print(code_point, hex(code_point), chr(code_point))
Program library ICU
The International Components for Unicode (ICU) are collected in a program library made available by the Unicode Consortium. The library is published with an open source license and can be used on many different operating systems. The software facilitates programmatic internationalisation (often abbreviated as 'i18n'). Its applications include:
- Processing Unicode texts
- Supporting regular expressions in Unicode
- Parsing and formatting calendar data, date and time information, numbers, currencies, and messages
There are two versions of the ICU library:
- 'icu4c' is written in C/C++ and provides an API for these languages.
- 'icu4j' is written in Java and provides an API for this language.
The implementation of the components delivers consistent results independent of the platform being used.
Charset meta attribute in HTML heads
Most HTML documents use the character encoding UTF-8. To ensure that a page is displayed without character errors, you should place a 'charset' meta specification in the head of the HTML document. This instructs the browser that the requested document should be interpreted as UTF-8. See the following example:
<head>
<meta charset="utf-8">
<!-- Additional head elements -->
</head>
Twitter fonts
The popular social media platform Twitter doesn’t allow any text formatting in its tweets, profiles, or usernames, limiting the creative possibilities for users. Resourceful developers have found a way around this: Twitter uses Unicode, which means that it’s possible to use special characters to compose a text that looks formatted. Especially useful are characters that are similar to letters from the Latin alphabet. The easiest way to do this is using a Twitter Fonts Generator.