JavaScript Lecture Notes           

Instructor: Amadou O. Wane
Last update: 11/06/01

What are RegExps?

RegExps, shorthand for regular expressions, are used in pattern matching and substitution operators. Regular expressions are actually a grammer for a little language. The regular expression interpreter (which we'll call the Engine) takes your grammer and compares it with the string you're doing pattern matching on. The Engine then returns a Boolean value, which depends on whether or not the string can be parsed as a sentence of your little language.

A regular expression is really just a sequence or a pattern of characters that is matched against a string of text when performing searches and replacements. A simple regular expression consists of a single character or a set of characters that matches itself.

Regexps are a very powerful tool. They pack a lot of meaning into a short space. Every single character in a regular expression has a special meaning. I once wasted several hours trying to fix a Perl script, only to find out later that I had forgotten a little "?", somewhere in one of the regular expressions I had constructed.

Many tasks can be done with regular expressions. The most common one is to find out whether a given string matches a particular pattern. You can also find out where the matching substring is located within the string. You can use a substitution command to replace matching sections with another string of your choice.

Don't worry if you still don't quite understand what regexps are. You'll soon become a regular expression expert.

Modifier Description
g Do global pattern matching.
i Do case-insensitive pattern matching.
m* Treat the string as multiple lines.
s* Treat the string as a single line.
x* Ignore whitespace within a pattern.
* Modifiers that are not supported by Navigator 4.0x and Internet Explorer 4.0.

The following pattern matches both "javascript" and "JavaScript":

/JavaScript/i

The /i modifier instructs the Engine to perform case-insensitive pattern matching, so the case of alphabetic characters doesn't matter.

The /x modifier tells the Engine to ignore whitespace that is not backslashed or within a character class. Use this modifier to break up your regular expression into more readable parts. The following patterns match "abc":

/a b c/x
m#a b c#x

Although the /x modifier is a documented feature, it is not supported by Navigator 4.0x or Internet Explorer 4.0. The only modifiers that are currently supported by Navigator 4.0x and Internet Explorer 4.0 are /i and /g. You can attach both modifiers to a single pattern in the following fashion:

/abc/gi

Constructing Regular Expressions

In this section we'll discuss the basics of regular expressions. Before we dive into interpretation rules, let's examine some characteristics of regular expressions.

Most characters in a regular expression simply match themselves. If you string several characters in a row, they must match in order. So, if you write the pattern:

/Bart/

it won't match unless the string contains the substring "Bart" somewhere. The following pattern can be used to determine roughly if a string is a real e-mail address:

/@/

As we proceed, we will discuss much more reliable patterns for e-mail verification.

Some characters don't match themselves, but are metacharacters. You can match these characters literally by placing a backslash in front of them. For example, "\\" matches a backslash and "\$" matches a dollar-sign. Here's the list of metacharacters:

 

\ | () [ { ^ $ * + ? .

 

A backslash also turns an alphanumeric character into a metacharacter. So whenever you see a backslash followed by an alphanumeric character:

 

\d \D \w \W \t \s 

 

you'll know that the sequence matches something strange. For example, \t matches a tab character, while \d matches any digit. Some sequences are actually zero characters wide. For instance, "\b" matches a word boundary, which is not a real character -- it is zero characters wide.

Regular expression are mostly assertions, i.e. plain characters that simply assert that they match themselves. We'll use the term "assertions" for the zero-width ones. Non-zero-width assertions are called atoms. As there is no standard terminology, we use the one from "Programming Perl." As a matter of fact, most of our explanations are based on this great book.

Regular expressions can include non-assertions, such as the alternation operator, which is indicated with a vertical bar:

 

/Homer|Marge|Bart|Lisa|Maggie/

 

Any of those strings can trigger a match. That is, the preceding expression matches all of the following strings:

  • "Homer"
  • "Bart"
  • "Lisa Simpson"
  • "Simpson, Marge"

 

You can group various sorts with parentheses, as in the following expression:

 

/(Homer|Marge|Bart|Lisa|Maggie) Simpson/

 

Quantifiers say how many of the previous substring should match in a row. Here are a few quantifiers:

 

* + ? {4,8} {5,}

 

Quantifiers can only be put after atoms, assertions with width. They attach only to the previous atom, so if you want a quantifier to apply to multiple characters, you must group them together, like this:

 

/(Bart){3}/

 

This pattern matches "BartBartBart", whereas the following pattern matches the string "Barttt":

 

/Bart{3}/
Assertion Description
^ Matches at the beginning of the string.
$ Matches at the end of the string.
\b Matches a word boundary (between \w and \W), when not inside [].
\B Matches a non-word boundary.

 

Quantifier Range
{m,n} Must occur at least m times, but not more than n times.
{n,} Must occur at least n times.
{n} Must occur exactly n times.
* Must occur 0 or more times (same as {0,}).
+ Must occur 1 or more times (same as {1,}).
? Must occur 0 or 1 time (same as {0,1}).

 

  • A backslashed character matches a special character or a character class (more than one character). Here's a list of special characters:

     
    Character Matches
    \n Linefeed
    \r Carriage return
    \t Tab
    \v Vertical tab
    \f Form-feed
    \d A digit (same as [0-9])
    \D A non-digit (same as [^0-9])
    \w A word (alphanumeric) character (same as [a-zA-Z_0-9])
    \W A non-word character (same as [^a-zA-Z_0-9])
    \s A whitespace character (same as [ \t\v\n\r\f])
    \S A non-whitespace character (same as [^ \t\v\n\r\f])

 

Regular Expression (and String) Methods

In this section we'll discuss methods that are related to regular expressions. Some are invoked as a method of a regular expression, whereas others are called as a string's method.

compile

compile() is invoked as a method of a regular expression. Its syntax is:

 

regexp.compile("PATTERN", ["g"|"i"|"gi"])

 

regexp is the name of a regular expression.
PATTERN is the text of the regular expression.

Use the compile() method with a regular expression that was created with the constructor function (not the literal notation). Use the compile method when you know the regular expression will remain constant (after getting its pattern) and will be used repeatedly throughout the script. This method actually converts the specified pattern into its internal format, for faster execution.

The compile() method can also be used to change a regular expression during execution:

var reg = new RegExp("Bart", "i");
var my_reg = reg.compile("Bart","i");
//It will compile and generate a standard regexp notation.
document.write("<br>" +my_reg);

You can also use this method to modify a regular expression's modifier:

 

var reg = new RegExp("bart", "i");
// reg matches "Bart" here
reg.compile("bart")
// reg doesn't match "Bart" here

test

test() is invoked as a method of a regular expression. Its syntax is:

 

regexp.test(str)

 

regexp is the name of a regular expression.
str is the string against which the regular expression is matched.

The test() method checks if a pattern exists within a string, and returns true if so, and false() otherwise. This method doesn't affect the global RegExp object.

The following script segment demonstrates the test() method:

 

var str = "tomer@netscent.com";
var reg = new RegExp("@");
if (reg.test(str))
  alert(str + " is a valid e-mail address!")
else
  alert(str + " is an invalid e-mail address!");

my_regtest = reg.test("bar");
if(my_regtest)
{	document.write("<br> There is a match");
}
else
{	document.write("<br> There is no match");}
var email = "javascript@intechs.net";
var reg2 = new RegExp("@");
if (reg2.test(email))
	alert(email + " is a valid e-mail address");
else
	alert(email + " is an invalid email address");

match

match() is invoked as a method of a string. Its syntax is:

 

str.match(regexp)
var address = "123 main street Tampa FL 33635";
document.write(address.match(/\D/g));
var address = "123 main street Tampa fl 33622-2566 tampa";
var my_string = "Beware of bad dogs, beware d99g dkdk";
var my_string2 = "possible, mission, dos";
document.write("The zip code is: " + address.match(/\d{5}-\d{4}/g));
document.write("<br>All non-digits: " + address.match(/\D/g));
document.write("<br> " + my_string.match(/\bd..../g));
document.write("<br> " + my_string2.match(/\w+s{1,2}\w{0,}/gi));

regexp is the name of a regular expression. You can supply it as a literal or as a variable.
str is any string.

This method is the same as exec(), but its object is a string, and its argument is a regular expression.


replace

replace() is invoked as a method of a string. Its syntax is:

 

str.replace(regexp, replaceStr)

 

regexp is the name of a regular expression. You can supply it as a literal or as a variable.
str is any string.

The following script swaps the first two words in a string:

var company = "Digital";
var str = "Intel is a chip manufacturer!";
var newstr = str.replace(/Intel is/, company + " is");
document.write(newstr); // prints "Digital is a chip manufacturer!"

 

Note that only in Navigator 4.0x the regular expression can also be enclosed in ordinary quotes:

If you want to enable multiple replacements, the regular expression should utilize a /g modifier:

 

var str = "Car Car Car";
var newstr = str.replace(/Car/g, "Bus");
document.write(newstr); // prints "Bus Bus Bus"

If you do not include this modifier, only the first match is replaced with the alternative string, so the preceding script would print "Bus Car Car".


split

split() is invoked as a method of a string. Its syntax is:

 

str.split(regexp)

 

regexp is the name of a regular expression. You can supply it as a literal or as a variable. It can also be an ordinay string.
str is any string.

This method updates the RegExp object if a match is found.

The split() method scans a string (which is actually its object) for delimiters, and splits the string into a list of substrings, returning the resulting list in the form of an array. The delimiters are determined by repeated pattern matching, using the given regular expression. Thus, the delimiters may be of any size and do not need to be the same string on every match. If the pattern does not match at all, the method returns the original string as a single substring. If it matches once, you get two substrings, in the form of a two-element array.

In Netscape Navigator 4.0x you can also hand the method an integer, so the method splits the string into no more than that many fields. If a regular expression is not provided, the method returns the original string. So the following statement does not change the value of the string:

 

str = str.split();

 

A pattern never matches in one spot more than once, even if it matched with a zero width. Here's an example:

 

document.write("a string".split(/ */).join(", "));

split_string = "Hello my world ~ hello";
document.write("<br>" + split_string.split(/~/));
document.write("<br>" + split_string.split(/~/).join("|"));

This statement outputs the following string:

 

a, s, t, r, i, n, g

 

The space between the two words (a, string) disappeared because it matched as part of the delimiter. As a reminder, the join() method joins an array's elements into one string and puts the given delimiter between each of the substrings. The statement:

 

anyString.split(//);

 

should return an array of the characters in anyString, including spaces. For example:

 

document.write("a string".split(//).join(", "));

 

should produce the following output, but Navigator 4.0x and Internet Explorer 4.0 generate an error when a null pattern is used:

 

a,  , s, t, r, i, n, g

 

So instead of a null pattern, you should use an ordinary null string to split a string into characters:

 

document.write("a string".split("").join(", "));