Chapter 3: Regular Expressions

Regular Expression Basics

Description

Understanding how to use regular expressions is fundamental to any Perl programmer. The essential purpose of a regular expression is to match a pattern, and Perl provides two operators for doing just that: m// (match) and s/// (substitute). (The ins and outs of those operators are covered in their own entries.)

When Perl encounters a regular expression, it's handed to a regular expression engine and compiled into a special kind of state machine (a Nondeterministic Finite Automaton). This state machine is used against your data to determine whether the regular expression matches your data. For example, to use the match operator to test whether the word fudge exists in a scalar value:

$r=q{"Oh fudge!" Only that's not what I said.};
if ($r =~ m/fudge/) {
   # ...
}

The regular expression engine takes /fudge/, compiles a state machine to use against $r, and executes the state machine. If it was successful, the pattern matched.

This was a simple example, and could have been accomplished quicker with the index function. The regular expression engine comes in handy because the pattern can contain metacharacters. Regular expression metacharacters are used to specify things that might or might not be in the data, look different (uppercase? lowercase?) in the data, or portions of the pattern that you just don't care about.

The simplest metacharacter is the . (dot). Within a regular expression, the dot stands for a "don't care" position. Any character will be matched by a dot:

m/m..n/;  # Matches: main, mean, moan, morn, moon, honeymooner,
         # m--n, "m  n", m00n, m..n, m22n etc... (but not "mn")

The exception is that a dot won't normally match a newline character. For that to happen, the match must have the /s modifier tacked on to the end. See the modifiers entry for details.

Metacharacters stand in for other characters (see "Character Shorthand") or stand in for entire classes of characters (character classes). They also specify quantity (quantifiers), choices (alternators), or positions (anchors).

In general, something that is normally metacharacter can be made "unspecial" by prefixing it with a backslash, which is sometimes called "escaping" the character. So to match a literal m..n (with real dots), change the expression to

m/m\.\.n/; # Matches only m..n

The full list of metacharacters is \, |, ^, $, *, +, ?, ., (, ), [, {

Everything else in Perl's regular expressions matches itself. A normal character (nonmetacharacter) can sometimes be turned into a metacharacter by adding a backslash. For example, "d" is just a letter "d". However, preceded by a backslash,

/\d/

It matches a digit. More of this is covered in the "Character Shorthand" section. The entire set of metacharacters as well as some contrived metacharacters are covered elsewhere in this book.

As you browse the remainder of this section, keep in mind that there are just a few rules associated with regular expression matching. These are summarized as follows:

Unless otherwise directed (with ?), quantifiers will always match as much as possible, and still have the expression match.

To sum up: the largest possible first match is normally taken.

For more information on how regular expression engines work, see the book Mastering Regular Expressions by Jeffrey Friedl.


See Also

m//, s///, character classes, alternation, quantifiers, character shorthand, line anchors, word anchors, grouping, backreferences and qr in this book


Basic Metacharacters and Operators

Match Operator

m//

Usage

m/pattern/modifiers

Description

The m// operator is Perl's pattern match operator. The pattern is first interpolated as though it were a double-quoted string—scalar variables are expanded, backslash escapes are translated, and so on. Afterward, the pattern is compiled for the regular expression engine.

Next, the pattern is used to match data against the $_ variable unless the match operator has been bound with the =~ operator.

m/(?:\(?\d{3}\)?-)?\d{3}-\d{4}/;      # Match against $_
$t=~m/(?:\(?\d{3}\)?-)?\d{3}-\d{4}/;  # Match against $t

In a scalar context, the match operator returns true if it succeeds and false if it fails. With the /g modifier, in scalar context the match will proceed along the target string, returning true each time, until the target string is exhausted.

The modifiers (other than /g and /c) are described in the Match Modifiers entry.

In a list context, the match operator returns a list consisting of all the matched portions of the pattern that were captured with parenthesis (as well as setting $1, $2 and so on as a side-effect of the match). If there are no parenthesis in the match, the list (1) is returned. If the match fails, the empty list is returned.

In a list context with the /g modifier, the list of substrings matched by capturing parenthesis is returned. If no parenthesis are in the pattern, it returns the entire contents of each match.

$_=q{I do not like green eggs and ham, I do not like them Sam I Am};

$match=m/\w+/;        # $match=1
$match=m/(\w+)/g;     # $match=1, $1="I"
$match=m/(\w+)/g;     # $match=1, $1="do"
$match=m/(\w+)/g;     # $match=1, $1="not" .. and so on

@match=m/\w*am\b/i;       # @match=(1)
@match=m/(\b\w{4}\b)/i;   # @match=(`like');
@match=m/(\w+)\W+(\w+)/i; # @match=qw(I do);

@match=m/\w*am\b/ig;      # @match=qw( ham Sam Am )
@match=m/(\b\w{4}\b)/ig;  # @match=qw( like eggs like them )
@match=m/(\w+)\W+(\w+)/ig;# @match=qw( I do not like [...] Sam I am )

After a failed match with the /g modifier, the search position is normally reset to the beginning of the string. If the /c modifier also is specified, this won't happen, and the next /g search will continue where the old one failed. This is useful if you're matching against a target string that might be appended to during successive checks of the match.

The delimiters within the match operator can be changed by specifying another character after the initial m. Any character except whitespace can be used, and using the delimiter of ` has the side-effect of not allowing string interpolation to be performed before the regular expression is compiled. Balanced characters (such as (), [], {}, and <>) can be used to contain the expression.

m/\/home\/clintp\/bin/;   # Match clintp's /bin
m!/home/clintp/bin!;      # Somewhat more sane
m/$ENV{HOME}\/bin/;       # Match the user's own /bin
m'$ENV{HOME}/bin';        # Match literal `$ENV{HOME}/bin' -- useless?
m{/home/clintp};

If you're content with using // as delimiters for the pattern, the m can be omitted from the match operator:

while( <IRCLOG> ) {
   if (/<(?:Abigail|Addi)>/) {  # Look ma, no "m"!

       # See below for explanation of //
       if (grep(//, @users)) {
           print LOG "$_\n";
       }
   }
}

If the pattern is omitted completely, the pattern from the last successful regular expression match is used. In the previous sample of code, the expression <(?:Abigail|Addi)> is re-used for the grep's pattern.

Example Listing 3.1

# The example from the "backreferences" section
#   re-worked to use the list-context-with-/g return
#   value of the match operator.

open(CONFIG, "config") || die "Can't open config: $!";
{
   local $/;
   %conf=<CONFIG>=~m/([^=]+)=(.*)\n/g;
}

See Also

Substitution operator, ??, and match modifiers in this book


Substitution Operator

s///

Usage

s/pattern/replacement/modifiers

Description

The s/// operator is Perl's substitution operator. The pattern is first interpolated as though it were a double-quoted string—scalar variables are expanded, backslash escapes are translated, and so on. Afterward, the pattern is compiled for the regular expression engine.

The pattern is then used to match against a target string; by default, the $_ variable is used unless another value is bound using the =~ operator.

s/today/yesterday/;           # Change string in $_
$t=~s/yesterday/long ago/;    # Change string in $t

If the pattern is successfully matched against the target string, the matched portion is substituted using the replacement.

The substitution operator returns the number of substitutions made. If no substitutions were made, the substitution operator returns false (the empty string). The return value is the same in both scalar and list contexts.

$_="It was, like, ya know, like, totally cool!";
$changes=s/It/She/;         # $changes=1, for the match
$changes=s/\slike,//g;      # $changes=2, for both matches

The /g modifier causes the substitution operator to repeat the match as often as possible. Unlike the match operator, /g has no other side effects (such as walking along the match in scalar context)—it simply repeats the substitution as often as possible for nonoverlapping regions of the target string.

During the substitution, captured patterns from the pattern portion of the operator are available during the replacement part of the operator as $1, $2, and so on. If the /g modifier is used, the captured patterns are refreshed for each replacement.

$_="One fish two fish red fish blue fish";
s/(\w+)\s(\w+)/$2 $1/g;  # Swap words for "fish one fish two..."

The /e modifier causes Perl to evaluate the replacement portion of the substitution for each replacement about to happen as though it were being run with eval {}. The replacement expression is syntax checked at compile time and variable substitutions occur at runtime, the same as eval {}.

# Make this URL component "safe" by changing non-letters
#   to 2-digit hex codes (RFC 1738)
$text=~s/(\W)/sprintf(`%%%02x', ord($1))/ge;

# Perform word substitutions from a list...
%abrv=( `A.D.' => `Anno Domini',  `a.m.' => `ante meridiem',
  `p.m.' => `post meridiem', `e.g.' => `exempli gratia',
  `etc.' => `et cetera',     `i.e.' => `id est');
$text=qq{I awoke at 6 a.m. and went home, etc.};
$text=~s/([\w.]+)/exists $abrv{$1}?$abrv{$1}:$1/eg;

The delimiters within the substitution operator can be changed by specifying another character after the initial s. Any character except whitespace can be used, and using the delimiter of ` has the side-effect of not allowing string interpolation to be performed before the regular expression is compiled. Balanced characters (such as (), [], {}, and <>) can be used to contain the pattern and replacement. Additionally, a different set of characters can be used to encase the pattern and the replacement:

s/\/home\/clintp/\/users\/clintp/g;   # Ugh!
s,/home/clintp,/users/clintp,g;       # Whew!  Better.
s[/home/clintp]
   {/users/clintp}g;                 # This is really clear

The match modifiers (other than /e and /g) are covered in the entry on match modifiers.

Example Listing 3.2

# This function takes its argument and renders it in
#   Pig-Latin following the traditional rules for Pig Latin
# (Note that there's a substitution within a substitution.)
{
   my $notvowel=qr/[^aeiou_]/i;  # _ is because of \w

   sub igpay_atinlay {
       local $_=shift;

       # Match the word
       s[(\w+)]
            {
           local $_=$1;
           # Now re-arrange the leading consonants
           #   or if none, append "yay"
           s/^($notvowel+)(.*)/$2$1ay/
               or
               s/$/yay/;
           $_;  # Return the result
           }ge;
       return $_;
   }
}
print igpay_atinlay("Hello world");  # "elloHay orldway"

See Also

match operator, match modifiers, capturing, and backreferences in this book


Character Shorthand

Description

Regular expressions, similar to double-quoted strings, also allow you to specify hard-to-type characters as digraphs (backslash sequences), by name or ASCII/Unicode number.

They differ from double-quoted context in that, within a regular expression, you're trying to match the given character—not trying to emit it. A single digraph might match more than one kind of character.

The simplest character shorthand is for the common unprintables. These are as follows:

Character

Matches

\t

A tab (TAB and HT)

\n

A newline (LF, NL). On systems with multicharacter line termination characters, it matches both characters.

\r

A carriage return (CR)

\a

An alarm character (BEL)

\e

An escape character (ESC)

They also can represent any ASCII character using the octal or hexadecimal code for that character. The format for the codes are: \digits for octal and \xdigits for hexadecimal. So to represent a SYN (ASCII 22) character, you can say

/\x16/;  # Match SYN (hex)
/\026/;  # Match SYN (oct)

However, beware that using \digits can cause ambiguity with backreferences (captured pieces of a regexp). The sequence \2 can mean either ASCII 2 (STX), or it can mean the item that was captured from the second set of parenthesis.

Ambiguous references are resolved in this manner: If the number of captured parenthesis is greater than digit, \digit from the capture; otherwise, the value is the corresponding ASCII value (in octal). Within a character class, \digits will never stand for a backreference. Single digit references such as \digit always stand for backreference, except for \0, which means ASCII 0 (NUL).

To avoid this mess, specify octal ASCII codes using three digits (with a leading zero if necessary). Backreferences will never have a leading zero, and there probably won't be more than 100 backreferences in a regular expression.

Wide (multibyte) characters can be specified in hex by surrounding the hex code with {} to contain the entire sequence of digits. The utf8 pragma also must be in effect.

use utf8;
/\x{262f}/;     # Unicode YIN YANG

When the character is a named character, you can specify the name with a \N{name} sequence if the charnames module has been included.

use charnames `:full';
s/\N{CLOUD}/\N{LIGHTNING}/g;  # The weather worsens!

Control-character sequences can be specified directly with \ccharacter. For example, the control-g character is a BEL, and it can be represented as \cg; the control-t character is \ct.

Example Listing 3.2

# Dump the file given on STDIN/command line translating any
#   low-value ASCII characters to their symbolic notation

@names{(0..32)}=qw( NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF
       CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN
       EM SUB ESC FS GS RS US SPACE);
$names{127}='DEL';

while(<>) {
   tr/\200-\377/\000-\177/;  # Strip 8th bit too.
   foreach(split(//)) {
       s/([\000-\x1f\x7f])/$names{ord($1)}/e;
       printf "%4s ", $_;
   }
}

See Also

charnames module documentation

character classes in this book


Character Classes

Description

Character classes in Perl are used to match a single character with a particular property. For example, if you want to match a single alphabetic uppercase character, it would be nice to have a convenient property to describe this property. In Perl, surround the characters that describe the property with a set of square brackets:

m/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/

This expression will match a single, alphabetic, uppercase character (at least for English speakers). This is a character class, and stands in for a single character.

Ranges can be used to simplify the character class:

m/[A-Z]/

Ranges that seem natural (0-9, A-Z, A-M, a-z, n-z) will work. If you're familiar with ASCII collating sequence, other less natural ranges (such as [!-/]) can be constructed. Ranges can be combined simply by putting them next to each other within the class:

m/[A-Za-z]/;    # Upper and lowercase alphabetics

Some characters have special meaning within a character class and deserve attention:

Remember that negating a character class might include some things you didn't expect. In the preceding example, control characters, whitespace, Unicode characters, 8-bit characters, and everything else imaginable would be matched—just not A-Z.

In general, any other metacharacter (including the special character classes later in this section) can be included within a character class. Some exceptions to this are the characters .+()*|$^ which all have their mundane meanings when they appear within a character class, and backreferences (\1, \2) don't work within character classes. The \b sequence means "backspace" in a character class, and not a word boundary.

The hexadecimal, octal, Unicode, and control sequences for characters also work just fine within character classes:

m/[\ca-\cz]/;   # Match all control characters
m/[\x80-\xff]/; # Match high-bit-on characters
use charnames qw(:full);
m/[\N{ARIES}\N{SCORPIUS}\N{PISCES}\N{CANCER}\N{SAGITTARIUS}]/;

In Perl regular expressions, common character classes also can be represented by convenient shortcuts. These are listed as follows:

Class

Name

What It Matches

\d

Digits

[0-9]

\D

Nondigits

[^0-9]

\s

Whitespace

[\x20\t\n\r\f]

\S

Non-whitespace

[^\x20\t\n\r\f]

\w

Word character

[a-zA-Z0-9_]

\W

Non-word character   

[^a-zA-Z0-9_]

These shortcuts can be used within regular character classes or by themselves within a pattern match:

if (/x[\da-f]/i) {  }  # Match something hex-ish
s/(\w+)/reverse $1/e;  # Reverse word-things only

The actual meaning of these will change if a locale is in effect. So, when perl encounters a string such as ¡feliz cumpleaños!, the exact meaning of metacharacters such as \w will change. This code

$a="\xa1feliz cumplea\xf1os!";    # Happy birthday, feliz cumpleaños
while($a=~m/(\w+)/g) {
   print "Word: $1\n";
}

will find three words in that text: feliz, cumplea, and os. The \xf1 (n with a tilde) character isn't recognized as a word character. This code:

use locale;
use POSIX qw(locale_h);
setlocale(LC_CTYPE, "sp.ISO8859-1");  # Spanish, Latin-1 encoding

$a="\xa1feliz cumplea\xf1os!";  # Happy b-day.
while($a=~m/(\w+)/g) {
   print "Word: $1\n";
}

works as a Spanish speaker would expect, finding the words feliz and cumpleaños. The locale can be negated by specifying a bytes pragma within the lexical block, causing the character classes to go back to their original meanings.

Perl also defines character classes to match sets of Unicode characters. These are called Unicode properties, and are represented by \p{property}. The list of properties is extensive because Unicode's property list is long and perl adds a few custom properties to that list as well. Because the Unicode support in Perl is (currently) in flux, your best bet to find out what is currently implemented is to consult the perlunicode manual page for the version of perl that you're interested in.

The last kind of character class shortcut (other than user-defined ones covered in the section on character classes) is defined by POSIX. Within another character class, the POSIX classes can be used to match even more specific kinds of characters. They all have the following form:

[:class:]

where class is the character class you're trying to match. To negate the class, write it as follows: [:^class:].

Class

Meaning

ascii

7-bit ASCII characters (with an ord value <127)

alpha

Matches a letter

lower

Matches a lowercase alpha

upper

Matches an uppercase alpha

digit

Matches a decimal digit

alnum

Matches both alpha and digit characters

space

Matches a whitespace character (just like \s)

punct

Matches a punctuation character

print

Matches alnum, punct, or space

graph

Matches alnum and punct

word

Matches alnum or underscore

xdigit

Match hex digits: digit, a-f, and A-F

cntrl

The ASCII characters with an ord value <32 (control characters)

To use the POSIX character classes, they must be within another character class:

for(split(//,$line)) {
   if (/[[:print:]]/) { print; }
}

Using a POSIX class on its own:

if (/[:print:]/) { }  # WRONG!

won't have the intended effect. The previous bit of code would match :, p, r, i, n, and t.

If the locale pragma is in effect, the POSIX classes will work as the corresponding C library functions such as isalpha, isalnum, isascii, and so on.

Example Listing 3.3

# Analyze the file on STDIN (or the command line) to get the
#  makeup.  A typical MS-Word doc is about 60-70% high-bit
#  characters and control codes.  This book in XML form was
#  less than 4% control codes, 10.8% punctuation, 18.2% whitespace
#  and 69% alphanumeric characters.

use warnings;
use strict;
my(%chars, $total, %props, $code, %summary);
# Take the file apart, summarize the frequency for
#   each character.
while(<>) {
   $chars{$_}++ for(split(//));
   $total+=length;
}

# Warning: space and cntrl overlap so >100% is possible!
%props=(alpha => "Alphabetic",   digit => "Numeric",
   space => "Whitespace",   punct => "Punctuation",
   cntrl => "Control characters",
   `^ascii' => "8-bit characters");

# Build the code to analyze each kind of character
#   and classify it according to the POSIX classes above.
$code.="\$summary{`$_'}+=\$chars{\$_} if /[[:$_:]]/;\n"
    for(keys %props);
eval "for(keys %chars){ $code }";

foreach my $type (keys %props) {
   no warnings `uninitialized';
   printf "%-18s %6d %4.1f%%\n", $props{$type}, $summary{$type},
       ($summary{$type}/$total)*100;
}

See Also

bytes, utf8, and POSIX module documentation

perlunicode in the perl documentation

isalpha in the C library reference


Quantifiers

Usage

{min,max}
{min,}
{min}
*
+
?

Description

Quantifiers are used to specify how many of a preceding item to match. That item can be a single character (/a*/), a group (/(foo)?/), or it can be anything that stands in for a single character such as a character class (/\w+/).

The first quantifier is ?, which means to match the preceding item zero or one times (in other words, the preceding item is optional).

/flowers?/;        # "flower" or "flowers" will match
/foo[0-9]?/;        # foo1, foo2 or just foo will match
/\b[A-Z]\w+(`s)?\b/;    # Matches things like "Bob's" or "Carol" --
           #   capitalized singular words, possibly possessed

# Match day of week names like `Mon', `Thurs' and Friday.
#  (caution: also matches oddities like `Satur' -- this can be
#   remedied, but makes a lousy example.)
/(Mon|Tues?|Wed(nes)?|Thu(rs)?|Fri|Sat(ur)?|Sun)(day)?/;

Any portion of a match quantified by ? will always be successful. Sometimes an item will be found, and sometimes not, but the match will always work.

The quantifier * is similar to ? in that the quantified item is optional, except * specifies that the preceding item can match zero or more times. Specifically, the quantified item should be matched as many times as possible and still have the regular expression match succeed. So,

/fo*bar/;

matches `fobar', `foobar', `foooobar', and also `fbar'. The * quantifier will always match positively, but whether a matching item will be found is another question. Because of this, beware of expressions such as the following:

/[A-Z]*\w*/

You might hope it will match a series of uppercase characters and then a set of word characters, and it will. But it also will match numbers, empty strings, and binary data. Because everything in this expression is optional, the expression will always match.

With * you can absorb unwanted material to make your match less specific:

# This matches any of: <body>, <body background="">,
#   <body background="foo.gif">, <body onload="alert()">,
#   or <body onload="alert()" background="foo.gif:>
/<\w+(\s+\w+="[^"]*")*>/;

In the preceding example, * was used to make [^"] match empty quote marks, or quote marks with something inside; it was used to make the attribute match (foo="bar") optional, and repeat it as often as necessary.

The + quantifier requires the match not only to succeed at least once, but also as many times as possible and still have the regular expression match be successful. So, it's similar to *, except that at least one match is guaranteed. In the preceding example, the space following the \w+ was specified as \s+; otherwise items such as <bodyonload="alert()"> would match.

/fo+bar/;

This matches `fobar', `foobar', and of course `fooooobar'. But unlike *, it will not match `fbar'.

Perl also allows you to match an item a minimal, fixed, or maximum number of times with the {} quantifiers.

Quantifier

Meaning

{min,max}

Matches at least min times, but at most max times.

{min,}

Matches at least min times, but as many as necessary for the match to succeed.

{count}

Matches exactly count times.

Keep in mind that with the {min,} and {min,max} searches, the match will absorb only as many characters as necessary and still have the match succeed. Thus with the following:

$_="Python";
if (m/\w(\w{1,5})\w\w/) {
   print "Matched ", length($1), "\n";
}

The $1 variable winds up with only three characters because the first \w matched P, the last \w's needed "on" to be successful, and that left "yth" for the quantified \w.

Perl's quantifiers are normally maximal matching, meaning that they match as many characters as possible but still allow the regular expression as a whole to match. This is also called greedy matching.

The ? quantifier has another meaning in Perl: when affixed to a *, +, or {} quantifier, it causes the quantifier to match as few characters as necessary for the match to be successful. This is called minimal matching (or lazy matching).

Take the following code:

$_=q{"You maniacs!" he yelled at the surf. "You blew it up!"};
while (m/(".*")/g) {
   print "$1\n";
}

It might surprise you to see that the regular expression grabs the entire string, not just each quote individually. That's because ".*" matches as much as possible between the quote marks, including other quote marks. Changing the expression to:

m/".*?"/g

solves this problem by asking * to match as little as possible for the match to succeed.

Keep in mind that ? is just a convenient shorthand and might not represent the best possible solution to the problem. The pattern /"[^"]*"/ would have been a more efficient choice because the amount of backtracking by the regular expression engine to be done would have been less. But there is programmer efficiency to consider.


See Also

m operator in this book


Modification Characters

Usage

\Q \E \L \l \U \u

Description

The modification characters used in string literals (in an interpolated context) are available in regular expressions as well. See the entry on modification characters for a list.

Understand that these "metacharacters" aren't really metacharacters at all. They do their work because regular expression match operators allow interpolation to happen when the pattern is first examined—much in the same way that \L and \U are only effective in double-quoted strings; they're only effective in regular expressions when the pattern is first examined by perl.

$foo='\U';
if (m/${foo}blah/) {  } # Won't look for BLAH, but `Ublah'
if (m/\Ublah/) {  }     # Will look for BLAH
if (m/(a)\U\1/ { }      # Won't look for aA as you might hope

Most useful among these in regular expressions is the \Q modifier. The \Q modifier is used to quote any metacharacters that follow. When accepting something that will be used in a pattern match from an untrusted source, it is vitally important that you not put the pattern into the regular expression directly. Take this small sample:

# A CGI form is a _VERY_ untrustworthy source of info.

use CGI qw(:form :standard);
print header();
$pat=param("SEARCH");
# ...sometime later...
if (/$pat/) {
}

The trouble with this is that handing $pat to the regular expression engine opens up your system to running code that's determined solely by the user. If the user is malicious, he can:

The third one is probably the most malicious, so it is disabled unless a use re `eval' pragma is in effect or the pattern is compiled with a qr operator.

The \Q modifier will cause perl to treat the contents of the pattern literally until an \E is encountered.

Example Listing 3.4

# A re-working of the inline sample above a little
#   more safe.*  A form parameter "SEARCH" is used to
#   check the GECOS (often "name") field.
# *[Of course, software is only completely "safe" when
#   it's not being used. --cap]

use CGI qw(:form :standard);

print header(-type => `text/plain');
$pat=param("SEARCH");

push(@ARGV, "/etc/passwd");
while(<>) {
   ($name)=(split(/:/, $_))[4];
   if ($name=~/\Q$pat\E/) {
       print "Yup, ${pat}'s in there somewhere!\n";
   }
}

See Also

Modification characters in this book


Anchors, Grouping, and Backreferences

Grouping

Usage

(pattern)
(?:pattern)
(?=pattern)
(?!pattern)
(?<=pattern)
(?<!pattern)
(?#text)
internal modifiers: i, m, s, x

Description

Parenthesis in regular expressions are used for grouping subpatterns within the larger pattern. This can be done to provide

The ability to capture subpatterns for backreferences is covered in the entry on backreferences. Some of the examples in this section assume prior knowledge of backreferences.

Simple parenthesis (pattern) and the (?:pattern) form allow you to group a subpattern of a regular expression. Once grouped, quantifiers can be applied against just that portion of the regular expression:

m/\w+:            # Match the first field (it's required)
  (?:[^:]*:){3}   # Match (and discard) the next three fields
  ([^:]*)        # Match (and capture) the next field
/x;

Also, alternation can be limited so that when an alternation symbol is seen, exactly what's being alternated against can be determined:

m/oats|peas|beans$/;  # oats, peas or beans (but beans at the end)
m/(oats|peas|beans)$/;# Any of oats, peas or beans only at the end

Internal modifiers can have their scope limited (in fact, internal modifiers can only be specified with parenthesis). So in the following:

m/Tony\s(?i:the)\sTiger/;

the phrase will be matched only if the capitalization is just as it appears; however the word the will not be matched case sensitively. (This could have been accomplished with [Tt]he as well.)

The difference between () and (?:) is that the (?:patterns form of parenthesis doesn't capture the subpattern matched and that (pattern) does—it provides grouping without the capturing side effect. This makes a difference if you're using backreferences. See the backreferences entry.

The constructs (?=pattern), (?!pattern), (?<=pattern), and (?<!pattern) are all used to "look around" the current match to see what either precedes or follows it. They are zero-width assertions, meaning that the subpattern contained within is only used to look ahead or look behind the current point of the match to see whether something is true or not.

Pattern

Name

(?=pattern)

Positive lookahead. Is only true if pattern is seen after the current point of the match. So /Abraham\s(?=Simpson|Lincoln)/ matches only if Abraham is followed by Lincoln or Simpson. The benefit is that the last name is not absorbed by the match. See the later examples.

(?!pattern)

Negative lookahead. True only if pattern is not seen after the current point of the match. So if /^(?:\d{1,3}\.){3}\d{1,3}$/ matches an IP address (and some bad ones too, such as 999.888.777.666), /^(?!(?:0+\.){3}0+)(?:\d{1,3}\.){3}\d{1,3}$/x matches those same IP addresses, but disallows 0.0.0.0.

(?<=pattern)

Positive lookbehind. This asserts that pattern was seen before the current point in the match. /(?<=bar)foo/ matches only if foo was directly preceded by bar. There is a restriction on this subpattern: it must be fixed-width, so /(?<=bar.*)foo/ isn't allowed.

(?<!pattern)

Negative lookbehind. True only if pattern was not seen before the current point in the match. /(?<!bar)foo/ is true only if foo was not directly preceded by bar. Like positive lookbehind, the subpattern must be fixed-width.

The (?#text) construct is used to place comments in the body of a regular expression. For example, if the expression is long and convoluted, you might say:

/\D\d{5}(-\d{4})?($# ZIP+4 optional)\D/

Because perl needs the ) to know when to terminate the comment, you cannot include a literal ) in the comment itself.

A cleaner way to include comments within a regular expression is to use the /x modifier to the expression.

The internal modifiers are modifiers (such as /i, /s, /x) that are applied to only a portion of the regular expression. They are specified with the non-capturing parenthesis mechanism by inserting the modifier after the ? but before the next token or by using them within parenthesis with a lone ?:

(?modifiers:pattern) (?modifiers)

To add a modifier to a portion of the expression, use the following modifier value:

if (/Linus Torvalds wrote L(?i:inux)/) { }

This match is case sensitive except the letter-sequence inux, which can be uppercase, lowercase or a mix. A modifier can be removed by preceding it with a dash:

(?modifiers_to_add - modifiers_to_remove:pattern)

For example,

if (/(?-i:Linus) wrote Linux/i) { }

The preceding match is not case sensitive, except the portion matching Linus.

Alternation

Usage

pat|pat

Description

The | metacharacter is used to make the regular expression engine choose between two potential matches; this is called alternation. The | should be placed between potential choices within the pattern:

/cat|dogfish/

Would match either cat or dogfish. The alternation extends outward from the | to the end of the innermost enclosing parenthesis or to another alternation symbol.

/(cat|dog)fish/;       # Either "catfish" or "dogfish"
/(cat|dog|sword)fish/;  # catfish, dogfish or swordfish

The alternation extends outward to include any anchors or zero-width assertions that are within the enclosed scope:

s/^\s+|\s+$//g;  # Remove leading/trailing whitespace

An empty alternative can be specified, which allows you to choose between a few choices or nothing at all:

/(cat|dog|sword|)fish/;  # catfish, dogfish or swordfish or just fish

Perl's regexp engine will process the alternations left-to-right and select the first one that matches. Thus, if you have an alternation that is the prefix of a following alternation, or an empty alternation, it should be placed at the end:

/paper|paperbacks|paperweight/;   # The last two will never match
/(paperbacks|paperweight|paper)/; # Better!
/paper(backs|weight)?/;           # Even better still!

/(|bugle|bugs|bugaboo)/;        # The empty choice will always match

Alternation isn't always the best choice for determining whether a list of things will match. Because of the way that Perl's regex engine works, a list of alternations such as the following:

/than|that|thaw|them|then|they|thin|this|thud|thug|thus/

will run much slower than if the match is re-written as follows:

/th(?:an|at|aw|em|en|ey|in|is|ud|ug|us)/

The regex engine can't scan through the alternations and notice the obvious: the program is trying to match four-letter words that begin with th—it's not that smart (yet). By giving it a hint, that a literal th will need to match before the alternations need to be searched, the speedup time is tremendous. In this case, it is nearly 25 times faster for a large volume of text.

So avoid alternation for simple cases similar to:

m/\b\w(a|e|i|o|u)\w\b/;  # 3 letter words, vowel in the middle

when a character class ([aeiou]) or another construct would work better.


See Also

character classes in this book


Capturing and Backreferences

Usage

()
\1 \2 \3 \n
$1 $2 $3 $n

Description

The parenthesis in regular expressions, in addition to grouping and other functions mentioned in the grouping entry, also have a side effect—patterns matched within parenthesis are stored, and can be used later in the expression or later in the program outside of the expression. This storage of matched patterns is called capturing, and referring to the captured values are backreferences.

Each set of capturing parenthesis encountered takes the portion of the target string matched by the pattern and stores it in a register. The registers are numbered 1, 2, 3, and so on up to the number of parenthesis in the entire pattern match.

During the match, any captured values are available by referring to the proper register with \register. This allows you to refer to something previously matched later in the pattern:

/(\w+)\s\1/;  # Look for repeated words, separated by a space.

In the preceding example, (\w+) captures word characters into the first capture register, and \1 looks for whatever word was stored there after the whitespace character.

After the match has completed (or during the substitution-phase with the s/// operator), the captured value will appear in the variables named $1, $2, $3, and so on up to the number of parenthesis captured in the match.

if ( s/(\w+)\s\1/$1/ ) {  # Remove repeated words, separated by a space.
   print "Removed duplicate word $1\n";
}

In this example, the backreference \1 is used to find the repeated word as shown previously. During the substitution, $1 is used to put back just one instance of the repeated word. After the match, $1 is still set to the captured value during the match.

Some notes about the variables $1, $2, and so on are as follows:

Example Listing 3.5

# Read a file in the format
#       key=value
#       key2=value2
#   and assign the data to %conf appropriately
# ** This is done with a clever code trick in the
#    match operator entry.  See TIMTOWDI in action!

open(CONFIG, "config") || die "Can't open config: $!";
while(<CONFIG>) {
   if (m/^([^=]+)=(.*)$/) {  # Look for FOO=BAR
       $conf{$1}=$2;
   }
}

See Also

local, dynamic scope, match operator, Regular Expression Special Variables, and Character shorthand in this book


Line Anchors

Usage

\A ^ \z \Z $

Description

Anchors are used within regular expression patterns to describe a location. Sometimes the location is relative to something else (\b) or the location can be absolute (\A). Because they don't match an actual character but make an assertion about the state of the match, they also are called zero-width assertions.

The first anchor (appropriately) is ^, which causes the match to happen at the beginning of the string. So,

if (m/^whales/) { }

will only be true if whales occurs at the beginning of $_. If whales occurs anywhere else in $_, the match won't succeed.

Next is the $ metacharacter that only matches at the end of a string:

if (m/Stimpy$/) { }

This pattern will only match if Stimpy occurs at the end of the string. These two metacharacters can be combined for interesting effects:

if (/^$/) {  }   # Matches empty lines
# Here, the middle "doesn't matter", but the beginning and
#   endings that must match are well-defined.
if (/^In the beginning.*Amen$/) {}
if (m/^/) { }    # Will always match

When you think you understand $ and ^, read on.

The first few anchors describe the beginning and ending of a string. These are complicated by the fact that "end of a string" can often mean "end of a logical line" or "end of the storage unit," depending on who you ask. The /m modifier on a regular expression match (or substitution) can change which meaning you want. The same goes for "beginning of a string."

From now on in this entry, I'll refer to a logical line and a string. A string is the entire storage unit. A logical line begins at the beginning of the string and extends to a newline character. It also begins after a newline character and extends to the next newline character in the (or the end of a) string. Take, for example, the string of characters in $t the following:

$t=q{That whim on the way
And again I took the day off
To roam the river's edge};

The string contains two newline characters: one following the word way and one following off. Three logical lines are in the one string.

The ^ metacharacter will match at the beginning of the string, unless /m is used as a modifier on the match. In that case, ^ can match at the beginning of any logical line in the string.

The $ metacharacter will match at the end of the string, unless /m is used as a modifier on the match. If that is the case, $ can match at the end of any logical line in the string.

So observe the following matches against $t from the preceding:

if ($t=~/way$/) { }  # False!  Without /m way isn't at the EOL
if ($t=~/way$/m) { } # True!  With /m way is at the End Of Line
if ($t=~/^That/) { } # Always true!
 if ($t=~/^And/) { }  # False!  Without /m, And isn't at the beginning
if ($t=~/^And/m) { } # True!  With /m, And is at the beginning of line

while($t=~/(\w+)$/g) {  # Prints only "edge", because
   print "$1";    #  without /m, there is only one "end of line"
}

while($t=~/(\w+)$/gm) { # Prints way, off and edge
   print "$1";     #   because each represents an "end of line"
}                       #   with /m

The \A metacharacter matches the beginning of the string always, and without regard to the /m modifier being used on the match. So in the sample string $t, the expression $t=~/\A\w+/m will only match the word That. The \z metacharacter similarly will always match at the end of the string, regardless of whether /m is in effect.

The \Z metacharacter is similar to \z with a bit of a difference: \z anchors at the end of the string behind (to the right of) the newline character if any. The \Z metacharacter anchors at the end of the string just in front of the newline character, if there is one, and at the end of the string if there isn't.


See Also

multi match and word anchors in this book


Word Anchors

Usage

\b \B

Description

The word anchors \b and \B are zero-width assertions that deal with the boundary between nonword characters (\W) and word characters (\w). The beginning and ending of a string are considered nonword characters.

The \b character matches the boundary between \w and \W characters. So, \bFOO matches FOO but only if the character preceding FOO is not a \w. The \B character matches between \W and \W characters; thus \BFOO will find FOO, but only if it's preceded by a word character.

$t=q{There was a young lady from Hyde
Who ate a green apple and died.
While her lover lamented
The apple fermented
And made cider inside her inside.};

$t=~m/\bher\b/;   # Matches "her" but not "There"
$t=~m/\Bher\B/;   # Matches the "her" in "There"
$t=~m/\bide\b/;   # Matches nothing!  Not cider nor inside
$t=~m/\bThere/;      # Matches There, because ^ is a word-boundary

Within a character class, \b stands for backspace and not a word boundary.

A common mistake is to assume that \b matches what people consider to be word boundaries (because _ is a word character). So, clintp@geeksalad.org is three words, U.S.A is also three, but War_And_Peace is only one word.


See Also

line anchors in this book


Multimatch Anchor

Usage

\G

Description

Similar to the line anchors, the multimatch anchor is used to match positions within a string as opposed to actually matching characters. It is in that class of metacharacters called zero-width assertions.

The \G metacharacter matches the position right after the previous regular expression match. For example, given the following code:

$_="One fish, two fish, red fish, blue fish";
m/\b\w{3}\b/g;  # Matches "One"
m/\G\W+(\w+)/;  # $1 is fish
m/\b\w{3}\b/g;  # Picks up "two"
m/\G\W+(\w+)/;  # $1 is fish (number two)

\G is useful for incrementally bumping along within a string with regular expressions. The location marked by \G can be reset by calling the pos function with an argument:

pos($_)=0;      # Reset \G to the beginning

The advantage of \G to look-ahead or look-behind assertions is that you get to write smaller (and simpler!) regular expressions. The /g modifier will cause the match to go back to the position where the last /g left off. The \G assertion allows you to look ahead without destroying your last position.

Example Listing 3.6

# Take apart the given paragraph looking for
#   phrases joined with the conjunctions "nor" and "or".
# Note that "now or later" and "later Or no" are both
#   picked up.  With a single regular expression and no \G
#   this would be much more complicated.

# C.J. lyrics and music by Bob Dorough (c)1973
$t=q{Conjunction Junction, what's your function?
Hookin' up two cars to one when you say
Something like this choice: Either now or later,
Or no choice.  Neither now nor ever.  (Hey that's clever)
Eat this or that, grow thin or fat.};

# The expression here picks up a word at a time, remembering
#   where we left off with /g
while( $t=~m/(\w+)/g ) {
     $left=$1;

   # Matching with \G here doesn't ruin our position in
   #   the match above...because we didn't use /g.
   if ($t=~/\G\W+(n?or)\W+(\w+)/i) {
       print "$left $1 $2\n";
   }
}

See Also

line anchors in this book


Match Modifiers

Usage

m//cgimosx
qr//imosx
s///egimosx

Description

This section describes the modifiers used with regular expression matches, substitutions, and compilations. Some modifiers are particular to an operator:

Modifier

Particular To

/g

Match and Substitution Operators

/gc

Match Operators

/e

Substitution Operators

These modifiers are discussed along with the particular operators to which they apply elsewhere in this book.

The /i operator causes the regular expression to not match case sensitively. During the match, no distinction is made between upper and lowercase letters, including those within character classes:

m/Scrabble/i;    # Matches scrabble or scrabble or sCrAbBlE or...  

The locale pragma causes a wider range of alphabetic characters to be recognized, and sensitivity of upper- and lowercase characters will expand appropriately.

The /m modifier causes the meaning of the ^ and $ anchors to change. With the /m modifier, ^ and s will match at the beginning and end of logical lines (possibly multiple logical lines) within a target string. Some examples of this are in the "Anchors" section.

The /s modifier causes the nature of the . (dot) metacharacter to change. Normally, dot matches any single character except a newline character (\n). With /s in place, the newline is a potential match for .:

$text=q{You are my sunshine, my only sunshine.
   You make me happy, when skies are grey.};
m/You.*/;  # Matches from "You are" to "sunshine."
m/You.*/s; # Matches from "You are" to "grey."

The /o modifier causes perl to only compile a regular expression once. Normally, a regular expression containing variables is recompiled each time perl encounters the expression.

$pat='\w+\W\w+';
while(<>) {
   if (/$pat/o) {
       $a++;
   }
}

In this example, the pattern in $pat is only changed outside of the loop. Perl doesn't realize this, so each pass through the loop, the pattern /$pat/ has to be recompiled by the regex engine. Giving perl the hint with /o that the pattern won't change allows the regex engine to skip the recompilation.

This optimization only makes sense when the pattern contains a value that could potentially change ($pat shown previously). Also, if the /o optimization is used and you do change the variables that make up the pattern, subsequent pattern matches won't reflect those changes.

The /x modifier allows you to specify comments within a regular expression. Specifically, comments are as follows:


See Also

match operator and substitution operator in this book


Miscellaneous Regular Expression Operators

Binding Operators

Usage

expression =~ op
expression !~ op

Description

The binding operators bind an expression to a pattern match or translation operator. Normally the m//, s///, and tr/// operators work on the variable $_. If you need to work on a variable other than $_, use the binding operator from before as follows:

$line=~s/^\s*//;

This causes the substitution operator to work on $line instead of $_. The return value for the operator on the right is returned by the bind operator.

The !~ operator works exactly the same as the =~ operator except that the return value is logically inverted. So, $f !~ /pat/ is the same as saying not $f =~ /path/.

Because =~ has a higher precedence than assignment, this allows you to do curious (and useful) things with the return value from =~. To return a list from a pattern match on $_, you would normally capture that as follows:

($first, $second)=m/(\w+)\W+(\w+)/;

With the bind operator, it's no different except that you can name your variable:

($first, $second)=$sentence=~m/(\w+)\W+(\w+)/;

Coupling this with the fact that the assignment operator yields an assignable value, you can assign, bind, and alter a variable at the same time:

# Okay, here's an assignment, bind and change.
$orig="Won't see this trick in Teach Yourself Perl!";
($lower=$orig)=~s/!$/ in 24 hours!/;
# $lower is now "Won't see this [...] Yourself Perl in 24 Hours!"

# Watch this:
$changes=($upper=$lower)=~s/(\w\w+)/ucfirst $1/ge;

That last statement is kind of difficult and bears some explanation. The highest precedence operator in this expression is =~, but in order for the bind to happen, the ($upper=$lower) must be taken care of. So, $lower's value is assigned to $upper. The bind then takes $upper and performs the substitution. The substitution operator returns the number of substitutions made. This value passes back through the bind and is assigned to $changes. So $changes is 11 and $upper is "Won't See This Trick...".

A special note, if the thing to the right of the bind operator is an expression instead of a pattern match, substitution, or translation operator, a pattern match is performed using the expression.

$pattern="Buick";
if ($shorts =~ $pattern) {
   print "There's a Buick in your shorts\n";
}

Using the bind operator as an implicit pattern match is slower than explicitly calling m// because perl must re-compile the pattern for each pass through the expression.


See Also

substitution operator, pattern match operator, and translation operator in this book


??

Usage

?pattern?modifiers

Description

The ?? operator works the same as the m// operator, with one small difference. The operator only attempts to match the pattern until it is successful and thereafter the operator no longer tries to match the pattern.

Each instance of the ?? operator maintains its own state. Once latched, the ?? can be reset by using the reset function. This resets all the ?? operators in the current package.

Example Listing 3.7

# Prints a summary of a given mailbox file.
# Unix mailbox format is extremely common and uses a paragraph
#   beginning with "From " to describe the start of a message header.
#   The body of the message follows in subsequent paragraphs.

use strict;
use warnings;
my($from, $subject, $to)=("","","");
open(MBOX, "mbox") || die;
$/="";            # Paragraph mode.
while(<MBOX>) {
   $from=$1     if (?^From: (.*)?m);
   $to=$1       if (?^To: (.*)?m);
   $subject=$1  if (?^Subject: (.*)?m);
} continue {
   if (/^From/ or eof MBOX) {
       print "From: $from\nTo: $to\nSubject: $subject\n\n"
           if $from;
       # The 0-argument reset function resets all of the ??
       #   latches above for use in the next message.
       reset;
       $from=$subject=$to="";
   }
}

See Also

reset, match operator, and match modifiers in this book


pos

Usage

pos
pos target string

Description

The pos function returns the position in the target string where the last m//g left off. If no target string is specified, the target string $_ is used. The position returned is the one after the last match, so

$t="I am the very model of a modern major general with mojo";
$t=~m/mo\w+/g;
print pos($t);

prints 19, which is the offset of the substring " of a modern...".

The pos function also can be assigned; doing so causes the position of the next match to begin at that point:

$t="I am the very model of a modern major general with mojo";
$t=~m/mo\w+/g;    # Now we're at 19, just as before.
pos($t)=38;     # Skip forward to the word "general"
$t=~m/(mo\w+)/g;# Grab the next "mo" word...
print $1;    # It's "mojo"!

Example Listing 3.8

# Sample from a text-processing system, where tags of the form
#   <#command> are substituted for variables, and other files can
#   be included, and so on.
# pos() is used to return to the original matchpoint to re-insert
#   the new and improved text. 

use strict;

# Just some sample data to play with.
our $r="Hello, world";
my $data='bar<#var r/>Foo<#include "/etc/passwd"/>';

while($data=~/(<#(.*?)\/?>)/sg) {
   my($whole, $inside)=($1,$2);

   if ($inside=~/var\s+(\w+)/) {    # Grab a variable from main::
       no strict `refs';
       substr($data, pos($data)-length($whole),
            length($whole))=${`main::' . $1}  
     }
   if ($inside=~/include\s+"(.*)"\s*/) { # Include another file..
       open(NEWFH, $1) ||
           die "Cannot open included file: $1";
       {
           local $/;
           my $t=<NEWFH>;
           $t=eval "qq\\$t\\";
           die "Inlcuded file $1 had eval error: $@"
               if $@;
           substr($data, pos($data)-length($whole),
               length($whole))=$t; 
       }
   }
   # ...and many more
}
print $data;  # Gives "barHello, worldFoo[contents of /etc/passwd]"

See Also

match operator in this book


Translation Operator

Usage

tr/searchlist/replacement/modifiers
y/searchlist/replacement/modifiers

Description

The tr/// operator is the translation (or transliteration) operator. Each character in searchlist is examined and replaced with the corresponding character from replacement. The tr/// operator returns the number of characters replaced or deleted. Similar to the match and substitution operators, the translation operator will use the $_ variable unless another variable is bound to it with =~:

tr/aeiou/AEIOU/;     # Change $_ vowels to uppercase
$t=~tr/AEIOU/aeiou/; # Change $t vowels to lowercase

The y/// operator is simply a synonym for the tr/// operator, and they are alike in every other respect.

The tr/// operator doesn't use regular expressions. The searchlist can be expressed as the following:

    tr/a-zA-Z/n-za-mN-ZA-M/;  # ROT-13 encoding

Special characters are allowed, such as backslash escape sequences (covered in the "Character Shorthand" section). Special characters that represent classes (\w\d\s) aren't allowed. (tr/// doesn't use regular expressions!)

No variable interpolation occurs within the tr/// operator. If a character is repeated more than once in the searchlist, only the first instance counts.

The replacement list specifies the character into which searchlist will be translated. If the replacement list is shorter than the searchlist, the last character in the replacement list is repeated. If the replacement list is empty, the searchlist is used as the replacement list (that is, the characters aren't changed, merely counted). If the replacement list is too long, the extra characters are ignored.

The modifiers are as follows:

Modifier

Meaning

/c

Compliments the search list. In other words, similar to using a ^ in a character class; all the characters not represented in the searchlist will be used.

$consonants=$word=~tr/aeiouAEIOU//c; # Count consonants

/d

Deletes characters that are found, but doesn't appear in the replacement list. This bends the aforementioned rules about empty or too-short replacement lists.

$text=~tr/.!?;://d; # Remove punctuation

/s

Takes repeated strings of characters and squashes them into a single instance of the character. For example,

$a="Pardon me, boy. Is that the Chattanooga Choo-Choo?"

$a=~tr/a-z A-Z//s; # Pardon me, boy. Is that the Chatanoga Cho-Cho?


See Also

character shorthand and character classes in this book


study

Usage

study
study expression

Description

The study function is a potential optimization for perl's regular expression engine. It prepares an expression (or $_ if none is specified) for pattern matching with m// or s///. It does this by prescanning the expression and building a list of uncommon characters seen in the expression, so that the match operators jump right to them as anchors.

Calling the study function for a second expression undoes any optimizations by the previously studied expression.

Whether study will save any time on your regular expression matches depends on several factors:

As always, with any optimization, use the Benchmark module and determine whether there really is a cost savings to using study. Constructing a case in which study is actually useful is difficult. Do not use it indiscriminately.


See Also

qr in this book


Quote Regular Expression Operator

Usage

qr/pattern/

Description

The qr operator takes a regular expression and precompiles it for later matching. The compiled expression then can be used as a part of other regular expressions. For example,

$r=qr/\d{3}-\d{2}-\d{4} $name/i;
if (/$r/) {
   # Matched digits-digits-digits and whatever was in $name...
}

Similar to the match operator, the delimiters can be changed to any character other than whitespace. Also, using single quotes as delimiters prevents interpolation.

Example Listing 3.9

# A short demo of the qr// operator.  The fast subroutine
#   runs nearly 4 times faster than the slow subroutine
#   because the qr// operator pre-compiles all of the regular
#   expressions for &fast.
# Remember, if you're not sure something is faster: Benchmark it.

use Benchmark;
sub slow {
   seek(BIG, 0, 0);
   @pats=qw(the a an);
   while(<BIG>) {
       for (@pats) {
           if (/\b$_\b/i) {
               $count{$_}++;
           }
       }
   }
}
sub fast {
   seek(BIG, 0, 0);
   # Pre-compile all of the patterns with
    #   qr//
   @pats=map { qr/\b$_\b/i } qw(the a an);
   while(<BIG>) {
       for (@pats) {
           if (/$_/) {
               $count{$_}++;
           }
       }
   }
}

open(BIG, "bigfile.txt") || die;
timethese(10, {
   slow => \&slow,
   fast => \&fast, });

See Also

match modifiers in this book