loader image

Blog

SEO Strings, Regular Expressions and Template Literals

image

Strings are undoubtedly one of the most important data types in programming language.

Strings are in almost every programming language and to learn effective use of them is basic necessity of each developers. To effectively work with Strings, developer needs to understand Regular Expressions because it has capacity to manipulate strings. With ECMAScript 6 Strings and Regular Expressions now have new features and those missing functionalities that other programming languages have.

In this post I will list below few of new Features/Methods of Strings from ES6:

 

UTF-16 Code Points

Until ECMAScript 6, JavaScript strings supported only 16-bit character encoding. All string properties and methods, like the length and the charAt() method, were based on these 16-bit code units. Although, 16 bits used to be enough to contain any character, but now ES6 introduced new character set by Unicode.

The first 216 code points in UTF-16 are represented as single 16-bit code units. This range is called the Basic Multilingual Plane (BMP). Everything after that is considered to be in one of the supplementary planes, where the code points can not be represented in just 16-bits. To solve this problem UTF-16 introduced surrogate pairs in which a single code point is represented by two 16-bit code units. That means any single character in a string can be either one code unit for BMP characters, giving a total of 16 bits, or two units for supplementary plane characters, giving a total of 32 bits.

Meaning, all string operations work on 16-bit code unit in ECMAScript 5, you may get unexpected results from UTF-16 code strings:

var text = "𠮷";

console.log(text.length);           // 2
console.log(/^.$/.test(text));      // false
console.log(text.charAt(0));        // ""
console.log(text.charAt(1));        // ""
console.log(text.charCodeAt(0));    // 55362
console.log(text.charCodeAt(1));  

The single Unicode character 𠮷 is represented using surrogate pairs, so the JavaScript string operation treat it as having two 16-bit characters. That means:

  • The length of var text is 2, when it should be 1.
  • When we try with regular expression to match a single character fails because it thinks that there are two characters.
  • The charAt() method is unable to return a valid character string, because neither set of 16 bits corresponds to a printable character.
  • The charCodeAt() method also can’t identify the character properly and it returns the appropriate 16-bit number for each code unit.

On the other hand, ES6 enforces UTF-16 string encoding to address these type of problems. Standardizing string operations based on this character encoding means that JavaScript can support functionality designed to work specifically with surrogate pairs.

Author Details