There is a Unicode mode in JavaScript regular expressions

2 min read

This post is part of my Today I learned series in which I share all my learnings regarding web development.

Unicode is such an interesting topic and it feels like I can discover new things every day. Today was one of these days. I was reading a blog post and came across the for me new u flag. At the end I found myself reading Axel's chapter in "Exploring ES6" on that topic which as usual got everything covered.

So what's this u flag?

In JavaScript we've got the "problem" that strings are represented in UTF-16 which means that not every character can be represented with a single code unit. This leads to weird length properties of certain strings and it becomes tricky when you deal with surrogate pairs. This brings up the question if . should match a code point that needs two code units?

This is exactly where the u comes into play.

Let's have a look at an example:

const emoji = '\u{1F60A}'; // "smiling face with smiling eyes"
emoji.length               // 2 -> it's a surrogate pair
/^.$/.test(emoji)          // false
/^.$/u.test(emoji)         // true

This mode also enables that you can use code point escape sequences in regular expression which can come in really handy because then you don't have to deal with the surrogate pairs.

const emoji = '\u{1F42A}';  // "camel"
/\u{1F42A}/.test(emoji);    // false
/\uD83D\uDC2A/.test(camel); // true
/\u{1F42A}/u.test(emoji);   // true

The u mode can definitely can help to deal with Unicode in Regular Expressions and I can highly recommend to read Axel's chapter on this topic and of cource Mathias Bynens wrote also an article about that. Have fun!

Load time