A Comprehensive Guide to JavaScript Regular Expressions

Subscribe to my newsletter and never miss my upcoming articles

Regular Expressions could be very tricky and hard to comprehend in the beginning, but in this article we'll discuss the major things to get you up-to-speed with creating and working with Regular Expressions in JavaScript. Here's a list of topics we'll be discussing (in order):

  1. Defining a Regular Expression.
  2. Creating Simple Patterns with Regular Expressions.
  3. Flags.
  4. Sets and Ranges.
  5. Character Classes
  6. Anchors: ^ and $
  7. Quantifiers: +, *, ? and {n}.
  8. Capturing Groups.
  9. Alternation (OR): |.

I've been very busy recently despite the global crisis which is why I haven't been able to write articles. Starting with this article, I'll be writing at least one article a week. So, if you haven't yet subscribed to my newsletter, or followed me yet, I suggest that you do either or both, right away!

So, what's a Regular Expression?

A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. (Wikipedia)

So, basically, a regular expression is made up of a series of characters (be it special characters like /, ?, *, alphabets and numbers) that defines a pattern. This pattern is usually used for finding text, and for validating input values. Here are two scenarios where you'd need a regular expression:

  • You have a very long string in which you want to check if any e-mail addresses exist. You couldn't use any ordinary address because you don't know which address may be in the string. You just want to know if there's any e-mail address at all in the string. How do you do that? You simply create a regular expression defining the pattern which all e-mail addresses follow, and then check if there's any match in the string.

  • A user enters a name in an input field you provided. As a front-end developer your job is to validate the data on the client-side before the server also does it. Regular expressions provide a way to check if the data entered by the user is exact, or follows the pattern of data that's required. For instance, an input field that expects a name of a person shouldn't contain digits or special characters apart from a hyphen.

Defining a Regular Expression

Regular expressions can be defined in two ways: using the built-in RegExp constructor, or defining it as a literal value with double forward slashes - //.

  1. Option 1: new RegExp()
let regex = new RegExp("pattern", "flags");
  1. Option 2: using forward slashes - //
let regex = /pattern/flags;
  • Where pattern is the pattern you intend to find;
  • and flags are optional tokens that add more functionality when finding the pattern you're trying to match.

We'll cover flags in a bit, but for now, let's understand some fundamentals of creating patterns.

Creating Simple Patterns with Regular Expressions

We're going to play with some regular expressions and learn how to use the str.match() and regexp.test() methods to determine whether or not a string matches a pattern. Here's our first example:

// here's our string
let string = "Hello world!";

// and our regular expressions
// both syntaxes do the same thing
let rgxp1 = new RegExp("Hello");
let rgxp2 = /Hello/;

// check if Hello exists in the string, with str.match()
console.log(string.match(rgxp1));

// Result: ["Hello", index: 0, input: "Hello world!", groups: undefined]

// check if Hello exists in the string, with regexp.test()
console.log(rgxp2.test(string));

// Result: true

Explanation

  • We defined two different regular expressions with different syntaxes but with the same pattern. You realised that the flags part was omitted in both expressions. It's optional; we only include it when we need special functionality, which we'll discuss in a second. 😉

  • The str.match() method accepts a regular expression as an argument. This is possible because regular expressions are integrated with string methods. So, you invoke the match() method on the string on which you want to run the pattern search; in our case, the string variable. You then pass the regular expression, which defines the pattern you want to search, as an argument to the match() method. If there's a match, it returns an array of substrings which matched the pattern; else, if there isn't a match, the method returns null. But in our example, the sub-string "Hello" in the string "Hello world!" is matched. There are additional properties attached to the returned array: index, input, groups. The index property holds the index in the target string where a pattern match was obtained; the input property holds a copy of the target string itself, thus "Hello world!"; the groups property hold groups captured in the regular expression, which, in our our case, is undefined because we didn't define any groups in the expression. Don't worry, we'll discuss groups too. 😃 So, with this result, it's clear that the str.match() method returns detailed description of the pattern search.

  • The regexp.test() method, however, is invoked on the regular expression which defines the pattern, and the target string is rather passed as an argument to it. Unlike the str.match() method which returns a detailed information about the search, the regexp.test() method returns a Boolean: true if the pattern was found in the string; false if not. In our case, true was returned because obviously Hello can be found in the string "Hello world!".

The text Hello in between the forward slashes //, and string "Hello" passed to the RegExp constructor as an argument, are all simple examples of patterns in a regular expression. If you want to find a string which has the word "metal", you could define your pattern as /metal/ or new RegExp("metal").

Now, what if we run the search on the string "hello world!" instead of "Hello world!" (Notice the uppercase "H")?

let string = "hello world!";

let rgxp = /Hello/;

// using the test() method
console.log(rgxp.test(string));

// Result: false

Wait, what? 😕🤔 We had no match! That's because regular expressions, by default, are case sensitive. Meaning, an uppercase "H" is not the same as a lowercase "h".

And what if we run the search on the string "Hello Hello world!"? It should match it twice right? Let's find out:

let string = "Hello Hello world!";

let rgxp = /Hello/;

// we need more info about the search
// so we'll use the match() method instead
console.log(string.match(rgxp));

// Result: ["Hello", index: 0, input: "Hello Hello world!", groups: undefined]

Oops! The same result. Notice it matched at the index of 0; meaning it matched only the first Hello and then it stopped searching.

Let's modify these behaviours with Flags. 😉

Flags

Flags, in regular expressions, are tokens that modify the searching behaviour of the regular expression. Flags are optional, as mentioned earlier, and they make it possible to alter the default behaviour of regular expressions. For instance, by default regular expressions are case sensitive. Meaning the pattern /Abc/ will not match the string "abc".

There are only six flags in JavaScript and they're denoted by single alphabets. Here they are:

FlagNameDescription
iIgnore caseMakes the expression search case-insensitively.
gGlobalMakes the expression search globally in the string, matching not only the first pattern, but all patterns in the string.
sDot AllEnsures the special character . matches newlines.
mMultilineEnsures the boundary characters ^ and $ match the beginning and ending of every single line instead of the beginning and ending of the whole string.
yStickyMakes the expression start its searching from the index indicated in its lastIndex property.
uUnicodeEnables support for unicode.

The i Flag

Using this flag on a regular expression changes its default case-sensitive behaviour to case-insensitive.

Now, let's match the string "hello world!" even though our pattern is defined with an uppercase "H":

let string = "hello world!";

// pass the i flag at the end of the expression.
// No space must be between the enclosing / and the flags
let rgxp = /Hello/i;

/*
 Here's how you'd do it with the RegExp constructor:

 let rgxp = new RegExp("Hello", "i");
*/

console.log(rgxp.test(string));

// Result: true

With the i flag, the pattern /AbC/i will match the string "abc".

The g Flag

In our previous example where we tried matching the pattern /Hello/ on the string "Hello Hello world!", we only got one match, which is the first in the string. With the g flag, we can match all patterns in the string and not only the first:

let string = "Hello Hello world!";

let rgxp = /Hello/g;

/*
 Here's how you'd do it with the RegExp constructor:
 let rgxp = new RegExp("Hello", "g");
*/

// we need more info about the search
// so we'll use the match() method instead
console.log(string.match(rgxp));

// Result: (2) ["Hello", "Hello"]

Note that, with the g flag, we don't get the additional properties index, input, and groups.

You can use several flags at the same time, too:

let string = "Hello hello world!";

// it doesn't matter the order in which you write the flags
// just don't separate them with space or any other character
let rgxp = /Hello/gi;

/*
 Here's how you'd do it with the RegExp constructor:

 let rgxp = new RegExp("Hello", "ig");
*/

console.log(rgxp.test(string)); // true

Sets and Ranges

Sets and Ranges provide a way to determine whether or not a character, from a set of characters, is matched at a given position.

Sets

Character sets are used to match one from several characters at a give position in a pattern.

Say you want to match these words: "sell", "tell", "fell", "yell". The words may be more than 4; but let's assume they have the same pattern; like with the 4 words above, they all end with "ell". The only thing different is their initial character. With character sets, we can define a set of these initial characters at the position where only a single one of them is needed, and then add the remaining parts of the pattern that's static.

To define a set of characters in a regular expression, enclose those characters in brackets [].

Example 1

So, for the four words "sell", "tell", "fell", "yell", we define the set of their initials as [stfy], followed by their last characters ell:

// i flag for case-insensitive
let expr = /[stfy]ell/i;

let words = ["sell", "tell", "fell", "yell", "bell", "well"];

// for each word in the words array, find a match
for (let word of words) {
  console.log(`Matched "${word}" :::: `, expr.test(word));
}

Result:

Matched "sell" ::::  true
Matched "tell" ::::  true
Matched "fell" ::::  true
Matched "yell" ::::  true
Matched "bell" ::::  false
Matched "well" ::::  false

It is observed that, the first four strings in the array are perfect matches, but the last two aren't. That's because the last two have their initials to be b and w, which were not defined in the character set.

Example 2

Let's match either "odd" or "add". Both words end with "dd", but they both begin with different characters, "o" and "a".

// i flag for case-insensitive
let expr = /[oa]dd/i;

let strings = ["Odd", "Let's add 2 and 3."];

// for each string in the strings array, try finding a match
for (let string of strings) {
  console.log(`Matched "${string}" :::: `, expr.test(string));
}

Result:

Matched "Odd" ::::  true
Matched "Let's add the 2 and 3." ::::  true

Our test() method matched the "Odd" string (with an uppercase "O") because of the i flag attached at the end of the regular expression. And "add" was matched in the second string as well, though it was somewhere in the middle of the string.

Note that amongst all characters in the brackets [], only one will be matched. Here's a demonstration:

// match either "hello" or "hallo"
let expr = /h[ae]llo/i;

console.log(expr.test("Hello")); // true
console.log(expr.test("Hallo")); // true
console.log(expr.test("Haello")); // false

In the above example, you notice only one character from the set is needed for a pattern to be matched.

The Negative Character Set

There may come a time when we want to match anything except some characters. In such scenarios, negative character sets come in handy. Negative character sets simply match any character that is not enclosed in the brackets [].

To indicate a negative character set, a caret (^) is written right after the opening bracket of a set [, followed by the characters you don't want to match, and then the closing bracket ]:

/[^abc]/

In the above expression, we want to match any character except a, b, or c.

Example

Let's match any series of characters which ends with "rill", beginning with any character (even numbers and symbols) except "a", "e", "i", "o" and "u"; basically a five-character string beginning with any character at all (except vowels), but ending with "rill". So, "drill" and "8rill" should match, but "arill" shouldn't.

let expr = /[^aeiou]rill/;

let texts = ["brill", "2rill", "$rill", "orill"];

// for each text in the texts array, try finding a match
for (let text of texts) {
  console.log(`Matched "${text}" :::: `, expr.test(text));
}

Result:

Matched "brill" ::::  true
Matched "2rill" ::::  true
Matched "$rill" ::::  true
Matched "orill" ::::  false

It is observed that, all strings passed the test, except the last one which began with the letter "o".

Note: You cannot write an expected set of characters, and a negative set at the same time. By this, I mean you cannot define a set where characters abc are expected, except def. Let's take an instance:

// trying to match "a" or "e", except "i", "o" and "u"
let expr = /[ae^iou]rill/;

let texts = ["arill", "erill", "$rill", "orill"];

// for each text in the texts array, find a match
for (let text of texts) {
  console.log(`Matched "${text}" :::: `, expr.test(text));
}

Result:

Matched "arill" ::::  true
Matched "erill" ::::  true
Matched "$rill" ::::  false
Matched "orill" ::::  true

Does this result scare you? 👹 It should. 😅

This happened because by putting the caret ^ in the middle of the set, we revoked its special ability to negate the set, thereby making it a normal character like the others (aeiou). To prove this, in the texts array, change the third string "$rill" to "^rill" (with the caret ^), and you'll see it passes the test. JavaScript doesn't think it's the special negative character anymore. Hence, to write a negative set, always remember to put the caret ^ immediately after the opening bracket [.

Ranges

When you want a character set of aphabets from "A" to "Z", you'd probably define your regular expression like this, right:

let expr = /[abcdefghijklmnopqrstuvwxyz]/;

Well, yes, you could. But you should've seen how much water I drank before I could type all those 26 alphabets. 😂

What if I told you, you could just do this, instead:

let expr = /[a-z]/;

Well, that's called a character range. Character ranges simply shorten writing a list of characters that are positioned in order. Like numbers from 0 to 9:

let expr = /[0-9]/;

Basically, you're just providing a range of characters from "this" to "that".

So , we could match a range of alphabets from "c" to "g", like:

let expr = /[c-g]/;

Or numbers from 5 to 8:

let expr = /[5-8]/;

You could match a range of alphabets from "c" to "g", or their uppercase versions, instead of using the i flag like this:

let expr = /[c-gC-G]/;

Just write them next to each other.

Also, say you want to match a range of alphabets from "a" to "d" or numbers from 1 to 4; here's how that'll be done:

let expr = /[a-d1-4]/;

// numbers could come before letters, it doesn't matter
let expr = /[1-4a-d]/;

Now, an actual example: Let's try matching words that end with "rill", beginning only with either "a", "b", "c" or "d":

let expr = /[a-d]rill/;

let strings = ["arill", "brill", "drill", "erill"];

// for each text in the texts array, try finding a match
for (let string of strings) {
  console.log(`Matched "${string}" :::: `, expr.test(string));
}

Result:

Matched "arill" ::::  true
Matched "brill" ::::  true
Matched "drill" ::::  true
Matched "erill" ::::  false

All, but the last string, passed the test because its initial character is "e", which doesn't exist in the range a-d.

Let's try matching a date-time string with this pattern:

21-JAN-2020 11:32 PM

First, we have two digits, then a hyphen (-), then three uppercase alphabets, then a hyphen (-), then four digits, then one space character, then two digits, blah blah blah...

We need to use this pattern to write a regular expression. Try writing the solution somewhere else and come back here and compare:

// optional: using i flag for case-insensitive effect
let expr = /[0-9][0-9]-[A-Z][A-Z][A-Z]-[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9] [A-Z][A-Z]/i;

let dateTime = "03-FEB-2019 09:00 AM";

console.log(expr.test(dateTime)); // true

Explanation:

  • The first two ranges [0-9][0-9], denote two digits, which matches "03" in our dateTime string. Remember, only one from a set of characters is expected. Followed by the hyphen -. Note that putting a space between the ranges will denote that there should be a space between the first and second digits; that, we don't want.

  • The three ranges [A-Z][A-Z][A-Z] denote three uppercase alphabets (although the i flag makes it match lowercase alphabets also). This matches "FEB" in our dateTime string. Then comes another hyphen -.

  • The four ranges [0-9][0-9][0-9][0-9] denote four digits, matching "2019" in our dateTime string, followed by a space, which matches the space after "2019". To match a space, you have to explicitly type space. The number of spaces you insert consecutively determine the number of consecutive spaces you want to match. Note that.

  • The next two ranges [0-9][0-9], denote two digits, which matches "09" in our dateTime string. Then a colon :. Then two more ranges [0-9][0-9], matching "00" in the dateTime string. Then a space. Followed by last two ranges [A-Z][A-Z], which match the last "AM" in the dateTime string.

If your solution is different, let me see it in the comments section.

Our code works, but seriously, it took a lot of time. And it's not easy to read the regular expression.

Let's change that with character classes.

Character Classes

It's annoying having to write a range, like [0-9], just to match a single digit. If our pattern needs to match about twenty digits, it means we'd have to write the number range twenty times. You could copy and paste, but still... 🥱

Fortunately, we don't have to. Character classes are shorthands for some commonly used character sets. Here's a table of two of them:

Character setShorthandDescription
[0-9]\dMatches any digit character.
[a-zA-Z0-9_]\wMatches an alphanumeric character (“word character”). Notice the underscore (_).

There are other shorthands. Make sure to check MDN.

Remember, it's a backslash \ (found below the Backspace key), not the foward slash /.

Based on the table above, we can rewrite our date-time pattern as follows:

// using the shorthands instead
let expr = /\d\d-\w\w\w-\d\d\d\d \d\d:\d\d \w\w/;

let dateTime = "03-FEB-2019 09:00 AM";

console.log(expr.test(dateTime)); // true

Though the shorthand \w will match digits and an underscore (_), we're utilising it for brevity. Use sets to enforce that only alphabets be matched.

Negated Ranges

We can also negate ranges, like we do with sets. Just add the caret (^) after the opening bracket:

  • [^w-z] - any character except letters from "W" to "Z".

  • [^0-9] - any character except digits from 0 to 9. Meaning all digits are excluded. In such a case you use the shorthand \D, like [^\D] instead, which does the same thing. Don't mistaken this for \d though.

  • [^a-z] - any character except letters from "A" to "Z". Meaning all letters are excluded.

Note that, since the backslash is used to denote a special character in a regular expression, if you want to match a backslash in a string, you'd need to escape it with another backslash, like this \\:

let expr = /\\/;

// apparently you need to escape it in a string too
console.log(expr.test("Here's a backslash: \\"));

// true

There are other characters that have special meanings in regular expressions. Like, +, ^, $, [, ], (, ), {, }, |, even a dot (.).

The Dot Character Class: .

A dot (.), in a regular expression, represents all characters except a newline \n. A newline is created when the return or enter key is pressed on the keyboard, or programmatically created using \n in a string.

Since, you cannot write a set which contains all characters on a keyboard, a . is used to denote just that, just like writing \d instead of [0-9]. Let's demonstrate how it works with an example:

let expr = /./g;

let str = `This is a multiline

string.`;

// using match() to see which characters will be matched
console.log(str.match(expr));

Result:

(26) ["T", "h", "i", "s", " ", "i", "s", " ", "a", " ", "m", "u", "l", "t", "i", "l", "i", "n", "e", "s", "t", "r", "i", "n", "g", "."]

In the matched characters, you can see that even a space character is matched. A newline was created after the word "mulitline" in the string, and then again, creating a blank line, before "string.". If those newlines were matched, we would get a string like this "↵" in the returned array.

We can make the dot (.) match newlines too, using the Dot All flag, s:

// the s flag makes the . match all newline characters too
let expr = /./gs;

let str = `This is a multiline

string.`;

// using match() to see which characters will be matched
console.log(str.match(expr));

Result:

(28) ["T", "h", "i", "s", " ", "i", "s", " ", "a", " ", "m", "u", "l", "t", "i", "l", "i", "n", "e", "↵", "↵", "s", "t", "r", "i", "n", "g", "."]

Now, we can see a newline character "↵" at index 19 and 20 of the returned array.

What if you want to match an actual dot (.) in a string? Just use the backslash \ to escape the special behaviour of the dot and it'll only match an actual dot:

// escaping the . so we can match an actual dot in the string
let expr = /\./gs;

let str = `This is a multiline

string.`;

// using match() to see which characters will be matched
console.log(str.match(expr));

// ["."]

Anchors: ^ and $

We can explicitly define in our regular expression how a pattern should start and end using anchors: ^ and $.

The caret ^ denotes matching a pattern at the beginning of a string, and the dollar $ at the end.

To demonstrate each of them, let's match a string which starts with Hello:

let strs = ["Hello, world!", "Hi and hello devs!"];

// i flag - case-insensitive mode
let rgx = /^hello/i;

for (let str of strs) {
  console.log(`Matched "${str}" :::: `, rgx.test(str));
}

Result:

Matched "Hello, world!" ::::  true
Matched "Hi and hello devs!" ::::  false

So, the caret ^ enforced the Regex to match the string starting with Hello and even though the second string has the word "hello" in it, it didn't pass the test because it began with Hi, not Hello.

It's known that a question is written with a question mark ? at the end, right? Let's write a regular expression to match any sentence that ends with a question mark, using the dollar sign $:

let qs = ["I'm home.", "Anybody home?", "We're here!"];

// a backslash is used because ? has a special meaning
// we'll cover it when we discuss Quantifiers
let rgx = /\?$/;

for (let q of qs) {
  console.log(`"${q}" is a question :::: `, rgx.test(q));
}

Result:

"I'm home." is a question ::::  false
"Anybody home?" is a question ::::  true
"We're here!" is a question ::::  false

Only the second string passed the test, because it ended with a question mark.

The caret ^ must be used right at the beginning of the regular expression to denote that the pattern search should start at the beginning of the string.

The dollar $ must be used at the end of the regular expression to denote that the pattern search should stop at the end of the string.

Testing for a full match

We can use both anchors at the same time though. Say, we're receiving a time as value from an input field; we explicitly tell the user to input the time in the format HH:MM, where HH is hours and MM is minutes. They're expected to input nothing more, nothing less. Here's the regular expression to be written:

/*
 this denotes that, the pattern must begin like this,
 and end like that. Anything else attached to the value 
 from the input should not pass the test
*/

let rgx = /^\d\d:\d\d$/;

/*
 You could've done this:

 let rgx = /^[0-9][0-9]:[0-9][0-9]$/;
*/

Here's a Pen to demostrate this. I've written some simple CSS and commented the JavaScript just so you know what's happening:

When you try typing a valid date, and then add any random character, the validation fails even though the pattern was in the value. It must start and end in the way as defined by the regular expression.

I wouldn't advise you to use this pen in a real-world app as it's not really strict. Someone could type 90:00 as a time and it'll still be valid. You could also make it strict by implementing your own time picker for the form so the user doesn't have to type.

The Multiline Flag, m, for Anchors

Given a string with multiline text, like this:

@GyenAbubakar
@SenaGodson
@degraft_

A list of the usernames of top accounts on Twitter, each starting with the at (@) sign and on its own line.

If we want to check if each and every single line starts with @, how will we do it? You'd say the anchors ^ and $, right? Great idea, but... Do you remember I mentioned the anchors enforce search for a pattern at the start or end of a string? Using them with the above string will match the @ sign on the first line only, and yes, that would pass the test. But we don't want that.

If we want the anchors to search on every line in a multiline string, we need to use the multiline flag, m.

// enabling multiline mode with m flag
let rgx = /^@/m;

// our string, with template string -- backticks ``
// backtick is on the left of "1" key on the alphanumeric keys
let str = `@GyenAbubakar
@SenaGodson
@degraft_`;

// using match() on the string instead of test()
// for details
console.log(str.match(rgx));

Result:

["@", index: 0, input: "@GyenAbubakar↵@SenaGodson↵@degraft_", groups: undefined]

We had one match. Can you guess the problem? We didn't add the global flag, g, so the search stopped on the first match.

Change the regular expression to implement the g flag:

let rgx = /^@/gm;

Now, the result is:

(3) ["@", "@", "@"]

Now, we've successfully performed a pattern search for every single line in a multiline string!

In multiline mode, the anchor ^ starts the search at the beginning of the string, and at the beginning of every line in the multiline string. The anchor $ stops the search at the end of the string, and at the end of every line in the multiline string.

Searching for Newline Characters

A newline character is that which is formed when the return or enter key is pressed on the keyboard. In order to match a point in a string where the enter key was pressed or where a newline was programmatically created, we can use the newline character \n. Just like the way you'd match a digit with \d, the \n is also used to match a newline character in a string.

// enabling multiline mode with m flag
// using g flag to ensure all newlines are matched
let rgx = /\n/gm;

// using our usernames string again
let str = `@GyenAbubakar
@SenaGodson
@degraft_`;

// using the match() on the string instead of the test
// for details
console.log(str.match(rgx));

Result:

(2) ["↵", "↵"]

Look at your your enter key. Does the icon on it look like this: ? Mission accomplished! 😁

The first character was matched at the end of the line with @GyenAbubakar; and the second was matched at the end of the line with @SenaGodson.

Escaping Special Characters

The backslash, just like in strings, is used to denote a character class, like \w. It's also used to escape the default behaviour of special characters in regular expressions:

[ ] \ / ^ $ . | ? * + ( ) { } < >

You've already seen how some of these are used. The rest will be discussed too.

Whenever you want to match an actual plus (+) in a string, but not to use its special ability in regular expressions, write the backslash first, then add the plus (+), like so: \+.

Also, a forward slash will close the regular expression if used improperly (matching a string with "slash/ looks" in it):

let rgx = /slash/ looks/;

Instead of being closed by the last forward slash, the regular expression is closed by the forward slash after the "h". Hence, everything from there becomes meaningless. You get an error:

Uncaught SyntaxError: Unexpected token identifier

Be sure to escape it:

let rgx = /slash\/ looks/;

Quantifiers: +, *, ? and {n}

Quantifiers are used to denote how many characters you need at a certain place in a pattern.

For example, at the beginning of a string, there should be 4 alphabets. Yes, you could write /^\w\w\w\w/, which, even though will match numbers and an underscore (_), would be incovenient in a situation where you need to match fifteen or thirty alphabets. Or, you'd like to make a particular character optional. Quantifiers are a handy way to write such regular expressions.

In the Character Sets and Ranges sections, remember we wrote solutions for matching a date-time string in this format:

21-JAN-2020 11:32 PM

The first solutions was:

let expr = /[0-9][0/[0-9][0-9]-[A-Z][A-Z][A-Z]-[0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9] [A-Z][A-Z]/i;

The second was:

let expr = /\d\d-\w\w\w-\d\d\d\d \d\d:\d\d \w\w/;

Let's learn how conveniently we could make this.

Quantity: {n}

To specify how many times a character should appear in a pattern, we simply use the quantity syntax, specifying the number as n in between the braces {}. For instance, if we want to match any text with five "A"s:

// ensure there's no space between the character, in this case "a", and the opening brace. Neither should there be space anywhere in the brace
let rgx = /a{5}/;

let letterAs = ["aaa", "aaaaa"];

for (let a of letterAs) {
  console.log(`Matched "${a}" :::: `, rgx.test(a));
}

Result:

Matched "aaa" ::::  false
Matched "aaaaa" ::::  true

Let's match a four-digit number in a sentence:

// The quantity {n} can be used on sets, single characters,
// and character classes, like \w
let rgx = /\d{4}/;

/*
 Of course you could use this method instead:

 let rgx = \[0-9]{4}/;
*/

let sentence = "Little Kezia now recites 1234 after school.";

console.log(sentence.match(rgx));

Result:

["1234", index: 25, input: "Little Kezia now recites 1234 after school.", groups: undefined]

At first we could achieve this by repeating \d\d\d\d.

Now, back to the date-time pattern. We can re-write the other regular expressions this way rather:

let expr = /\d{2}-\w{3}-\d{4} \d{2}:\d{2} \w{2}/;

let dateTime = "03-FEB-2019 09:00 AM";

console.log(expr.test(dateTime)); // true

This looks better, don't you think?

Define a range: {n,x}

Say you want to match a particular character (or character class) which appears at least 3 times, and at most 5, you can write 2 in the braces, append a comma (,), and then 5; like this, {2,5}. Ensure there's no space anywhere in the braces.

Let's match numbers in a string which are hundreds or thousands; a three-digit number and a four-digit number:

//let's use a multiline string
let withFigures = `Joe is 17 years old.
That building is 134 years old.
Nancy is 32 years old.
I earned $1000 today.`;

// using g flag so it continues searching after 1st match
let hundredOrThousand = /\d{3,4}/g;

console.log(withFigures.match(hundredOrThousand));

Result:

(2) ["134", "1000"]

By specifying 3, we're saying, "match a number with at least 3 digits", and by specifying 4, we're saying "match a number with at most 4 digits". That's why 134 and 1000 are matches.

We can omit the second number, leaving the first and a trailing comma:

let rgx = /[a-z]{2,}/i;

The regular expression will match any word which is at least two letters long. So, it maches a series of alphabets from a length of 2, to infinity. In other words, 2 or more:

// Adding i flag to ignore casing
// g flag so it searches the entire string
let rgx = /[a-z]{2,}/gi;

let sentence =
  "We arrived late today, just to be welcomed with a cold reception.";

console.log(sentence.match(rgx));

Result:

(11) ["We", "arrived", "late", "today", "just", "to", "be", "welcomed", "with", "cold", "reception"]

It matched all the words in the sentence because they're all at least two characters long, except "a" which is just a character long.

Shorthand Quantifiers: +, *, ?

1. One or more Quantifier: +

Whenever you see the plus (+) after a character, set or range, or a character class, it simply means at least one of that character is expected in the pattern. It is a shorthand way of writing the quantifier {1,}.

Say we want to match a number. A number is formed by one or more digits, right? Therefore, to match a number, a single digit is needed. Hence, the regular expression:

let rgx = /[0-9]+/g;

// or
rgx = /\d+/g;

With the regexp above, let's try and match all numbers in a string:

let rgx = /\d+/g;

let str = "It was 12 o'clock when we left, but we got there exactly 3 o'clock.";

console.log(str.match(rgx));

As you'd expect:

(2) ["12", "3"]

The + quantifier is used when we expect a character at least once in a pattern.

2. Zero or more Quantifier: *

This is a short way of writing {0,}. When used after a character, it means, that character could be absent in the text, or appear several times (without a limit).

To demonstrate this, let's find any number which may end with a zero:

// the * quantifier after the zero (0) means
// the zero may be absent, or appear several times
// so first, any digit then a zero (which may be absent)
let rgx = /\d0*/g;

let str = "200 30 7";

console.log(str.match(rgx));

Result:

(3) ["200", "30", "7"]

As you can see, even though the number 7 doesn't end with a zero, it's still considered a match.

3. Zero or one Quantifier: ?

Writing the quantifier {0,1} after a character means the character is allowed to be absent, but could appear in a pattern only once. In other words, the character is made optional. The shorthand for this quantifier is ?. Simply, place a question mark (?) after a character (or character class, etc.) to make it optional.

The Americans spell the word FAVOR like this, whiles the British include the letter U, like this FAVOUR. Take off the letter U and the word is still a valid English word. So, generally speaking, the U is optional. Here's how we can match such a word:

// placing ? after the "U" makes it optional
let rgx = /favou?r/gi;

let str = "Favour or favor.";

console.log(str.match(rgx));

Result:

(2) ["Favour", "favor"]

Both words match.

Capturing Groups

A group in a regular expression is any part of it which is enclosed in parenthesis ().

It allows us to get a certain part of a match as a separate item in the results array. Secondly, when we write a quantifier after the enclosed parenthesis, the entire group is affected, not just a character.

Say we have a string like "gogogo", where go can be repeated more than once, appending the + after the o, i.e. go+, will match strings like this, "gooooooo". Because, the pattern written will match a "g", then one or more "o".

By capturing groups, we can group both "g" and "o" and make them appear once or more, instead:

/* 1. Grouping g and o with parenthesis.
   2. The + quantifier makes both of them appear in a text at least once.
   3. The ^ and $ anchors simply denote that the string should be a full match, from start to end.
*/

let rgx = /^(go)+$/i;

let strings = ["go", "gogogo", "goooo"];

for (let str of strings) {
  console.log(`Matched "${str}" :::: `, str.match(rgx));
}

Result:

Matched "go" ::::  (2) ["go", "go", index: 0, input: "go", groups: undefined]

Matched "gogogo" ::::  (2) ["gogogo", "go", index: 0, input: "gogogo", groups: undefined]

Matched "goooo" ::::  null
  • Only the first two strings in the array matched, returning two strings each in the array instead of one. The second string in the returned array is the group itself, "go", as defined in the regular expression as (go). Therefore, because we used the one or more quantifier (+), the group could appear once, or more.

  • The third string in the array, however, didn't match even though it began with "go". This is only because of the anchors we used, ^ and $, denoting that the string itself must be a full match. Remove these anchors and you'll see that the first two characters in the third string also match.

Example

Let's match a domain name. We know that the following are valid domain names:

site.com
subdomain.site.com
sub.subdomain.site.com
site2.co.gh

It's clear that all the domain names begin with a bunch of characters and then a dot (.), before the TLD, com, which could be info, edu, gov, etc. The TLD could also have details about the country, like .uk, attached to it; in this case, co.gh (Ghana).

Since the main domain and the sub-domain names follow the same pattern, we can repeat them:

let rgx = /([\w-]+\.)+(\w+)(\.\w+)?/;

let domains = [
  "site.com",
  "subdomain.site.com",
  "sub.subdomain.site.com",
  "site2.co.gh",
];

for (let d of domains) {
  console.log(`Matched "${d}" :::: `, rgx.test(d));
}

Result:

Matched "site.com" ::::  true
Matched "subdomain.site.com" ::::  true
Matched "sub.subdomain.site.com" ::::  true
Matched "site2.co.gh" ::::  true

Explanation:

  • The first group, ([\w-]+\.), defines a set of alphanumeric characters with the character class \w, including a hyphen -. The + quantifier is used to denote that one or more of the characters in the set is expected. Then comes an escaped dot, \., which matches an actual dot. The + after the first group denotes that the entire group could appear once in the string, or several times. This matches the portions site., subdomain.site., sub.subdomain.site., and site2. in the strings inside the domains array.

  • The second group, (\w+), defines a set of alphanumeric characters with the character class \w. The + quantifier is used to denote that one or more of the alphanumeric characters, ([a-bA-C0-9_]), is expected. This matches the portions com and co in the strings inside the domains array.

  • The third group, (\.\w+), defines an actual dot \., and a set of alphanumeric characters with the character class \w. The + quantifier is used to denote that one or more of the alphanumeric characters is expected. The ? quantifier after the third group shows that the third group itself is optional. Therefore, it may be absent in the string and it'll still be a match. This matches the portion .gh in the string site2.co.gh.

Named Groups

We can name groups in a regular expression.

To do this, immediately after the opening parenthesis (, write this: ?<name>; where <name> is the name you'd like to give the group.

Let's match a date in the format:

2020-03-23

We'll name the yearn month and day portions in the date. So, our regexp should look like this:

let rgx = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;

The benefit of doing this is, we'll be able to access them on the groups property of the match() method.

Let's see if it works:

let rgx = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;

let date = "2020-03-23";

let groups = date.match(rgx).groups;
/*
 Or, with ES6 destructuring:

 let { groups } = date.match(rgx);
*/

console.log(groups);
console.log("Year:", groups.year);
console.log("Month:", groups.month);
console.log("Day:", groups.day);

Result:

{year: "2020", month: "03", day: "23"}
Year: 2020
Month: 03
Day: 23

The str.matchAll() Method

The str.match() method is limited in a sense that, when we search for all matches using the global flag (g), it doesn't return details about groups that were matched.Let's demonstrate this:

Re-write the code in the previous example where we searched for domain names:

// attache the g flag
let rgx = /([\w-]+\.)+(\w+)(\.\w+)?/g;

let domains = [
  "site.com",
  "subdomain.site.com",
  "sub.subdomain.site.com",
  "site2.co.gh",
];

for (let d of domains) {
  // use the match() method instead of test()
  console.log(d.match(rgx));
}

Results:

["site.com"]
["subdomain.site.com"]
["sub.subdomain.site.com"]
["site2.co.gh"]

You see that for each string in the domains array, the match() method only returns the input string, which is just a copy of the string we're searching. No details about the matched groups are provided.

Change the match() method to matchAll(), and re-run the code.

We get these results:

RegExpStringIterator {}
RegExpStringIterator {}
RegExpStringIterator {}
RegExpStringIterator {}

For each pattern search, the str.matchAll() method returns a RegExpStringIterator object, which can be turned into an actual array using the static Array.from() method. So, re-write the for..of loop like so:

for (let d of domains) {
  // convert each RegExpStringIterator object to an array
  let result = Array.from(d.matchAll(rgx));

  console.log(result);
}

Results:

[Array(4)]
[Array(4)]
[Array(4)]
[Array(4)]

Now, all results have become arrays. You can expand them in the DevConsole to see the details, or use the [index] syntax to get the details at a specific index.

You may avoid using the static Array.from() method and use a loop over the results returned by the .matchAll() method instead:

let domain = "sub.subdomain.site.com";

let rgx = /([\w-]+\.)+(\w+)(\.\w+)?/g;

// ignoring the Array.from() method
let results = domain.matchAll(rgx);

for (let result of results) {
  console.log(result);
}

Result:

(4) ["sub.subdomain.site.com", "site.", "com", undefined, index: 0, input: "sub.subdomain.site.com", groups: undefined]

For an array of domains, you'd have to run a loop for each of the domains to be able to do this.

Or... We could use ES6 destructuring:

let domain = "sub.subdomain.site.com";

let rgx = /([\w-]+\.)+(\w+)(\.\w+)?/g;

// destructuring
let [ result ] = domain.matchAll(rgx);

console.log(result);

Result:

(4) ["sub.subdomain.site.com", "site.", "com", undefined, index: 0, input: "sub.subdomain.site.com", groups: undefined]

Note that, for the str.matchAll() method to work, the regular expression must implement the global flag, g. Else, you'll run into a TypeError.


To end the article, we'll discuss Alternation.

Alternation (OR): |

Alternation in regular expression simply means "OR", denoted by the pipe character (|).

It can be used when you want to match an expression or the other. For instance, matching either "pizza" or "pie" could be written as /pizza|pie/. Since they both begin with "pi", we could do:

let rgx = /(pi)(zza|e)/i;

let whatSnack = "Do you want pizza or pie?";

console.log(whatSnack.match(rgx));

// (2) ["pizza", "pie"]

This marks the end of our discussion. This week I might write another article on Regular Expressions, so follow up or subscribe to my newsfeed if you want to know when it's published.

If you have any questions, or anything you'd like to add, please leave a comment in the section below.

Thanks for reading! 🙏

Comments (1)

Bolaji Ayodeji's photo

So much knowledge in one article. Thanks for sharing!