Emoji.prototype.length — a tale of characters in Unicode
- Published at
- Updated at
- Reading time
Emoji are the base for text-driven conversations these days. Without these tiny little symbols, a lot of chat conversations today would end in awkward situations and misunderstandings. I still remember the old days when SMS were a thing.
Text chats that don’t include smileys often lead to “Are you kidding?” messages to be sure to not take a stupid joke too serious. Everybody quickly learned that humor and sarcasm (we should be less sarcastic anyways) are not easily transferable using only written characters. At some point the first Emoji appeared, and they quickly became a fundamental component of everyone's text-based conversations.
Even though I use Emoji every day, I never questioned how they work technically. They surely have to have a connection to Unicode somehow, but I had no idea about the actual functionality. And honestly, I didn’t care too much...
[...'👨👩👦'] // ["👨", "", "👩", "", "👦"] ‘👨👩👦’.length // 8
Okay – using the spread operator on a string like that didn’t get me excited, but the fact that this one visible symbol will be split into three symbols and two empty strings puzzled me. And going on seeing the string property length returning 8 increased my confusion even more as there were five entries in the spread array and not eight.
Unicode is an international computing industry standard. It is a mapping from each letter, character or symbol to a numeric value. Thanks to Unicode we can share documents including e.g. special German characters like ß, ä, ö with people on systems that don’t use these characters. Thanks to Unicode encoding works across different platforms and environments.
1,114,112 different code points are defined in Unicode, and these code points are usually formatted using
U+ followed by a hexadecimal number. The range of Unicode code points goes from
These over one million code points are then divided into 17 so called “planes”, and each plane includes more than 65,000 code points. The most significant plane is the “Basic Multilingual Plane” (BMP) which ranges from
The BMP includes characters for almost all modern languages plus a lot of different symbols. The other 16 planes are called “Supplementary Planes” and have several different use cases like — you might have guessed it — the definition of most of the Emoji symbols.
An Emoji as we know it today is defined by at least one code point in the Unicode range. When looking at all defined Emoji listed in the Full Emoji Data list, you’ll see that there are a lot of them. And by saying “a lot”, I really mean a lot. You might ask yourself how many different Emoji we have defined in Unicode right now. The answer to this question is — as so often in computer science — “It depends”, and we have to understand them first to answer it.
As said an Emoji is defined by at least one code point. This means that there are also several Emoji out there being a combination of several different Emoji and code points. These combinations are called sequences. Thanks to sequences it is for example possible to modify neutral Emoji (usually displayed with yellow skin color) and make them fit your personal preference.
Modifier sequences for diversity in skin color
I still remember when I first noticed in a chat conversation that I could modify the “thumbs up” Emoji to match my own skin tone. It gave me a feeling of inclusion, and I felt way more connected to that thumb symbol that was all over my messages.
In Unicode, five modifiers can be used to alter the neutral Emoji of a human resulting in a variation having the desired skin tone. The modifiers range from
U+1F3FF and are based on the Fitzpatrick scale.
By using these, we can transform a neutral Emoji to one with a more expressive skin tone. So let’s look at an example here:
// U+1F467 + U+1F3FD 👧 + 🏽 > 👧🏽
When we take the girl Emoji which has the code point
U+1F467 and put a skin tone modifier (
U+1F3FD) after it, we automatically get a girl with an adjusted skin tone on systems that support these sequences.
ZWJ sequences for even more diversity
Skin color isn't the only thing people can relate to. When we’re looking back at the family example, it’s quite obvious that not every family consists of a man, a woman, and a boy.
Unicode includes a single code point for the neutral family (
U+1F46A - 👪), but that's not how every family looks like. We can create different families with a so-called Zero-Width-Joiner sequence.
And here is how it works: there is a code point called zero-width-joiner (
U+200D). This code point acts like glue indicating that two code points should be represented as one single symbol when possible.
Thinking of this sequence logically what could we glue together to display a family? That’s a simple one – two grown ups and a kid. By using a Zero-Width-Joiner sequence, diverse families can be represented easily.
// neutral family // U+1F46A > 👪 // ZWJ sequence: family (man, woman, boy) // U+1F468 + U+200D + U+1F469 + U+200D + U+1F466 // 👨 + U+200D + 👩 + U+200D + 👦 > 👨👩👦 // ZWJ sequence: family (woman, woman, girl) // U+1F469 + U+200D + U+1F469 + U+200D + U+1F467 // 👩 + U+200D + 👩 U+200D + 👧 > 👩👩👧 // ZWJ sequence: family (woman, woman, girl, girl) // U+1F469 + U+200D + U+1F469 + U+200D + U+1F467 + U+200D + U+1F467 // 👩 + U+200D + 👩 + U+200D + 👧 + U+200D + 👧 > 👩👩👧👧
Looking at all the defined sequences, you’ll see that there are even more variants for e.g. one father having two girls. Unfortunately, the support for these isn't really good at the time of writing, but Zero-Width-Joiner sequences degrade gracefully resulting in the single code points being displayed. This helps to keep the semantic of the particular combined symbol.
// ZWJ sequence: family (man, girl, girl) // U+1F468 + U+200D + U+1F467 + U+200D + U+1F467 // 👨 + U+200D + 👧 + U+200D + 👧 > 👨👧👧 -> single symbol not supported yet
Another cool thing is that these principles don’t apply to the family Emoji only. Let’s take for example the famous David Bowie Emoji (the real name of this Emoji is actually “man singer”). This one is also a ZWJ sequence consisting of a man (
U+1F468), a ZWJ and a microphone (
And you might have guessed it, exchanging the man (
U+1F468) with a woman (
U+1F469) will result in a female singer (or female version of David Bowie). Bringing in skin tone modifiers is also possible to display a black female singer. Great stuff!
ZWJ sequence: woman singer U+1F469 + U+1F3FF + U+200D + U+1F3A4 👩 + 🏿 + U+200D + 🎤 > 👩🏿🎤 -> single symbol not supported yet
Unfortunately support for these new sequences is also not very good at the time of writing.
Various counts of Emoji
To answer the question how many Emoji are out there, it really depends on what you count as an Emoji. Is it the number of different code points that can be used to display Emoji? Or do we count all the different Emoji variations that can be displayed?
When we count all the different Emoji that can be displayed (including all sequences and variations), we come up with an overall number of 2198. In case you’re interested in the counting, there is a complete section about that topic on unicode.org.
Additional to the “How to count” question there is also the fact that new Emoji and Unicode characters are added to the spec constantly, which makes it also hard to keep track of the overall number.
‘ﾂ’.length // 1 -> U+FF82 ‘⛷’.length // 1 -> U+26F7 ‘☃’.length // 1 -> U+9731
When using the
Surrogate pairs to the rescue
It is possible to combine two code points defined in the BMP to express another code point that lies outside of the first 65 thousand code points. This combination is called surrogate pair.
The code points from
U+DBFF are reserved for the so-called high or “leading” surrogates and from
U+DFFF for the low or “trailing” surrogates.
These two code points always have to be used in pairs beginning with the high surrogate followed by the low surrogate. Then a specific formula will be applied to decode the out-of-range code points.
Let’s have a look at an example here:
‘👨’.length // 2 ‘👨’.charCodeAt(0) // 55357 -> U+D83D // returns code point of leading surrogate ‘👨’.charCodeAt(1) // 56424 -> U+DC68 ‘👨’.codePointAt(0) // 128104 -> U+1F468 // returns combined code point of surrogate ‘👨’.codePointAt(1) // 56424 -> U+DC68
The neutral man Emoji has the code point
charCodeAt, which will return you the code points of each surrogate in case you hit a surrogate pair. The second method is
codePointAt, which will return you the code point of the combined surrogate pair in case you hit the leading surrogate and the code point of the trailing surrogate in case you hit the trailing one.
You think this is horrible confusing? I’m with you on that one and highly recommend to read the linked MDN articles on these two methods carefully.
Let’s have a deeper look at the man Emoji and do the math. Using
charCodeAt we can retrieve the code points of the single code units included in the surrogate pair.
The first entry has the value
55357 which maps to
D83D in hexadecimal. This is the high surrogate. The second entry has the value
56424 which then maps to
DC68 being the low surrogate. It is a classic surrogate pair which will result after applying the formula in
128104, which maps to the man Emoji.
// hexadecimal 0x1F468 = (0xD83D - 0xD800) * 0x400 + 0xDC68 - 0xDC00 + 0x10000 // decimal 128104 = (55357 - 55296) * 1024 + 56424 - 56320 + 65536
With the knowledge of code units, we now can make sense out of the puzzled
Let’s get back to Wes’ initial example then.
// ZWJ sequence: family (man, woman, boy) // U+1F468 + U+200D + U+1F469 + U+200D + U+1F466 [...'👨👩👦'] // ["👨", "", "👩", "", "👦"] ‘👨👩👦’.length // 8 // neutral family // U+1F46A [...’👪’] // [’👪’] ’👪’.length // 2
The Emoji family we see here is a ZWJ sequence consisting of a man, a woman, and boy. The spread operator will go over code points. The empty strings are no empty strings but rather Zero-Width-Joiners. Calling
length on it then will return 2 for each Emoji and 1 for the ZWJ resulting in 8.
I really enjoyed digging into Unicode. In case you’re also interested in this topic, I want to recommend the @fakeunicode Twitter account. It always shares great examples of what Unicode is capable of. And did you know that there is even a podcast and a conference about Emoji? I’ll continue looking at them, because I think it’s super interesting to learn more about these tiny symbols we use daily and maybe you’re interested, too.