String normalization utilities for Unicode strings and IDs.
In a world where everyone types in Unicode (including emojis!), there are many things to consider when you accept input from users and are planning to use those strings as identifiers, among other things. For example, when dealing with tags, ids, labels, titles… When developers are facing with these situations, there a few common issues:
è
and è
might look identical, they might in fact be in two separate byte sequences, and need to be normalized or string comparisons will fail (learn more)über
into uber
, and papà
into papa
Data used by this module is based on Unicode 12.1.0, released in May 2019.
This module is written in TypeScript and transpiled to JavaScript. All typings are available alongside the code.
This code is licensed under the terms of the MIT license (see LICENSE.md).
Full documentation is available on GitHub pages.
Install from NPM:
npm install smnormalize
The module exports symbols as named exports.
const {Normalize} = require('smnormalize')
Normalize(str, options)
The method accepts an input string str
and normalizes it with three steps:
In addition to that, you can perform other operations depending on the mode of operation.
The options
argument is an object with the following properties:
options.mode
is the mode of operation, and could be one of the following:'basic'
(this is the default value): in this mode, all diacritics/accents are removed from the string, and the string is nornalized in the NFKC form. Whitespaces, including newlines, tabs, etc, are removed; spaces are converted to the character defined in options.preserveCharacters
. All control characters (unprintable characters) are removed too.'alphabetic'
in addition to what basic mode does, all characters that are not letters (in any script/alphabet) are removed, including symbols, spaces, etc.'latin'
similar to the alphabetic mode, but only allows letters that are part of the latin alphabet.options.removeNumbers
(boolean, default: false
) when false, numbers are always allowed. In alphabetic mode, every kind of number is preserved, while in latin mode only latin numbers are allowed (0-9). This option has no effect in basic mode.options.allowEmoji
(boolean, default: false
) if true, does not remove emojis from identifiers. Note that the characters 0-9
(latin numbers) are considered valid emojis, and so are preserved regardless of the value of options.removeNumbers
. This option has no effect in basic mode.options.convertSpaces
(string, default: -
) character to replace space characters (codepoints U+0020 and U+00A0) with. To preserve spaces as is, set this to ' '
(a single space character); note that non-breaking spaces (U+00A0) will be converted to normal spaces regardless. You can set it to null
or to an empty string to remove spaces entirely. Note that other whitespace characters, such as newlines, tabs, etc, are removed as part of the basic normalization.options.preserveCharacters
(string, default: -_.
) optional list of individual characters that should not be removed, regardless of modes of operation. By default, this includes the dash -
, the underscore _
and the dot .
. You can disable this by setting this to an empty string.options.lowercase
(boolean, default: false
) optionally lowercases the string before returning it.To show the difference between multiple modes of operation and options, consider this string as example:
Hello Шѻrld_!1߁🤗
"basic" mode | "alphabetic" mode | "latin" mode | |
---|---|---|---|
removeNumbers = false, keepEmojis = false | Hello-Шѻrld_!1߁🤗 |
Hello-Шѻrld_1߁ |
Hello-rld_1 |
removeNumbers = true, keepEmojis = false | Hello-Шѻrld_!1߁🤗 |
Hello-Шѻrld_ |
Hello-rld_ |
removeNumbers = false, keepEmojis = true | Hello-Шѻrld_!1߁🤗 |
Hello-Шѻrld_1߁🤗 |
Hello-rld_1🤗 |
removeNumbers = true, keepEmojis = true | Hello-Шѻrld_!1߁🤗 |
Hello-Шѻrld_1🤗 |
Hello-rld_1🤗 |
Note that in basic mode the removeNumbers
and keepEmojis
options have no effect, because no characters (aside from whitespaces and control characters) are removed. In alphabetic and latin mode, latin numbers are always present when emojis are allowed (but not numbers in other scripts); also, note that the exclamation mark was removed, but the underscore was kept because it's in the default preserveCharacters
list.
Generated using TypeDoc