File Under: Programming

Using Regex in Perl

In our last lesson, the Regular Expressions Tutorial we walked through the basics of regular expressions and looked at some simple things one can do with them. This time, we’re going to dig a little deeper, but rather than trying to keep track of all the different regex syntaxes out there, we’re going to focus on just one:Perl.

If you have never worked with Perl before, what follows may look bizarre, but try not to let that intimidate you. It’s actually pretty easy. First, the requisite background.


Contents

  1. About Perl
  2. Using Markdown
    1. Replacement scripts
  3. Perl in contact forms
    1. Get inside the script
  4. Using modifiers
  5. Strip out HTML using Perl
  6. Further reading

About Perl

Perl was written by Larry Wall, supposedly as a quick hack to overcome a limitation of awk (which is a Unix text processing program). Like most of my quick hacks, Wall’s hack became an overnight worldwide success and full-fledged, virtually ubiquitous programming language. Just try to find something that doesn’t run Perl. Don’t even look at that PS2 — if it can run Linux, it can do Perl. Perl can process text from the command line, accept STDIN input from GUI editors, spit out dynamic web pages, power blogs, sing, dance, cook and clean up after itself.

Somewhere along the line, Perl became the de-facto standard for CGI programming on the web — apologies to my Python-wielding cohorts, but Perl is infinitely more common. And for good reason. Wall wrote Perl to be a quick and easy way to manipulate text, and what, my friends, is the World Wide Web, if not a really long string of text (and way too many pictures of tennis players in short skirts)?

If you have never used Perl or written any CGI scripts, fear not; Webmonkey has already tread these waters. Check out Colin Ferm’s Perl Tutorial for Beginners to help you get up to speed.

To be honest, I am a latecomer to Perl. And, like most of you, I use far more pre-rewritten scripts — Movable Type, FormMail, etc. — than I do my own homebrewed variety. The reasoning for that is pretty simple:I don’t want to spend my time reinventing the wheel, and a lot of common things that you want to do with Perl have already been done. When I do write my own scripts, it’s almost always because I need to use some specific regex search pattern. With that in mind, I’m going to look at some scripts written by others and talk about the regular expressions being used. Hopefully, this will help you start to form some ideas about things you might be able to do with Perl on your own site.

This first thing we’re going to look at is a chunk of John Gruber’s Markdown script, the script which inspired me to learn Perl. Then we’ll look at an e-mail form processing script. Finally, we’ll return to the example from the Regular Expressions Tutorial and improve our font tag replacement pattern.

Using Markdown

Markdown was created to provide writers on the web with a means of writing human readable documents that could be quickly and simply converted to HTML markup. If you write for the web and you’re not using Markdown, you’re wasting a lot of valuable time. Just as stripping a word processor back to a text editor helps you focus on the writing rather than the formatting, so Markdown strips away the necessary HTML and lets you focus on the writing.

Markdown is essentially an intermediary language. You write in Markdown’s syntax and it generates the necessary HTML for you. Markdown is no panacea though. It won’t create a whole HTML document for you. Instead, it focuses on those elements that are likely to be used when writing content. It’s very useful for things like blog postings, comment forms, and bulletin boards.

The syntax of Markdown is based on common e-mail shorthand. For instance *surrounding something with asterisk* in an e-mail typically adds emphasis to it. Markdown uses single asterisk marks to generate <em> tags and double asterisks to generate <strong> tags. For a complete rundown on Markdown’s syntax, see the Markdown site and, while you’re at it, go ahead and grab the script. Open the file in the text editor of your choice so you can follow along.

So, how does Markdown translate a “*” into an “<em>” tag? Why, regular expressions, of course. Markdown is written in Perl and relies primarily on regular expressions to translate its syntax into proper HTML markup. To further emphasize the flexibility of Perl, I would like to point out that the Markdown script can be run in no less than four different environments without modification. It can run as a Moveable Type plugin, a Blosxom plugin, a BBedit plugin, and from the command line. All that action from one script. As such, Markdown contains some code which is simply for the benefit of each of those architectures, but we’re not really interested in that code right now.

We’re interested in the regular expressions part of the script. And there are many, many regular expressions in Markdown. Most of them are well commented and I encourage you to look through them and familiarize yourself with as many of them as you can, looking up those parts you don’t recognize.

Replacement scripts

Opening up the Markdown script in a text editor, let’s look at the section that uses asterisks and underscores to create <em> and <strong> tags so I can point out a few things. Markdown’s regex for this operation starts on line 1035 and looks like this:

sub _DoItalicsAndBold {

      my $text = shift;



      # <strong> must go first:

      $text =~ s{ (**|__) (?=S) (.+?[*_]*) (?<=S) 1 }

          {<strong>$2</strong>}gsx;



      $text =~ s{ (*|_) (?=S) (.+?) (?<=S) 1 }

          {<em>$2</em>}gsx;



      return $text;

}



OK, let’s start with the expressions themselves. Note that the comments say that the strong substitution must go first. If you know why you get a big bunch of bananas. Don’t worry if you don’t — it will become obvious once we examine the patterns. We’re going to ignore the function definition and the shift bit (if you’re curious, take a look at the online Perl documentation). The actual regex we’re interested in are these lines:

s{ (**|__) (?=S) (.+?[*_]*) (?<=S) 1 }

      {<strong>$2</strong>}gsx;

The “s” tells Perl that we’re about to give it a regular expression and a replacement pattern. We’ll delve deeper into that in our next example. For now, just accept that the format looks like this:s/pattern/replacement/options. The pattern itself is broken into five discreet chunks. Let’s look at them one by one. The first is this:

 (**|__)

This simply says: Find two asterisks or two underscores and store them. The underscores are an alternate way of providing emphasis in Markdown syntax. Markdown lets you write **strong** or __strong__ or *emphasis* and _emphasis_, so we must look for both options. The asterisk must be escaped so Perl knows we want a literal asterisk, not the meta character. The pipe character “|” means “or” in Perl and in almost every other flavor of regex.

The next part of the function introduces something sort of odd. The pattern (?=S) is a Perl-specific optimization of regex. The construct (?=…) is called a “lookahead.” This construct tells Perl to make sure that there is (in this case) a non-white space character (S) in the pattern. If there isn’t, then Perl can quit looking and move on without checking the rest of pattern. This cuts down on the time Perl spends looking for the pattern. Rather than searching through text, matching part of the pattern, but then failing to match the whole pattern and then retreating to try again, Perl will look ahead, see if some key element exists and continue trying the match only if it does. This functionality comes from the x modifier at the end of the expression — the x in the gsx bit at the end of the second line. We’ll cover all the modifiers in our next example.

So, if there are some non-white space characters, then Perl proceeds to the next bit:(.+?[*_]*). This pattern matches the text between the ** or __ markers in our original document. The parentheses store this pattern so that we can use it in our replacement pattern.

The next chunk is a lookbehind pattern, which behaves similarly to the lookahead. It checks to make sure that there is at least one non-white space character behind it. That way if you want to use a literal asterisk you can do so without escaping it.

Pop quiz! Which of the following will match our pattern?

4 * 2 = 8

** strong text**

**strong text **

**strong text**

The correct answer is the last line, and only the last line. In the other three lines, the lookahead and lookbehind chunks of our expression stop the search because they fail to match at least one non-white space character.

The last little bit of our expression may look familiar if you paid extra-close attention during the Regular Expressions Tutorial. In that tutorial, we used 123…n in our replacement patterns to recall a saved chunk of our search pattern. Perl uses the same syntax, but in a different manner. In Perl, the 1…n syntax works only within the search pattern. In this case, we simply add 1 to the end of the pattern and recall the first stored subpattern, which will match the closing ** or __. The replacement pattern is then the stored text string inserted between HTML <strong> tags.

Now, have you figured out why the “strong” pattern must be run before the “emphasis” pattern? It’s quite simple really. Because the emphasis pattern is the exact same pattern but matching only one “*” or one “_”, the script needs to make sure that all instances of double asterisk and underscores are already replaced. Otherwise, the single pattern would also match the double. Once all instances of the double pattern are replaced with HTML tags, then it’s safe to proceed with a single pattern search and replace.


Perl in contact forms

We all know that if you want to make a contact form that allows visitors to contact you through your site, you need two things. First, some HTML (or Flash) to display and grab input from the form. And, second, you need to do something with that information. You know that you are not the first person who ever found themselves in need of a mailing script, so you head off to your trusty search engine and dig one up. Or, these days, your web hosting company just might provide one for you.

That’s fine. It gets the job done and helps you get paid. But, have you ever wondered how those scripts work their magic? All they usually ask is that you plug in the email address you want info sent to. The rest of script is an indecipherable jumble of letters and symbols that look more like a fight scene from the old Batman TV series than a programming language. And yet, that’s Perl hard at work. What is it using to grab the info sent from your HTML? You guessed it:regular expressions again. Do you see a pattern here?

Matt’s script archive has some ready-to-go scripts for this purpose, but I am more fond of TFMail, which addresses a couple of security holes in the FormMail script at Matt’s script archive. TFMail is open source and can be downloaded from sourceforge.net. Grab the file and unpack it. You’ll see that there are a whole bunch of files. The one were interested in is TFmail.pl, “pl” being one of several suffixes that denote Perl files.

Open the file in the text editor of your choice and scroll down to line 492, where you will find this little gem:

$addr =~ m#^[ t]*[w-.*]{1,100}@[w-.]{1,100}[ t]*$# ? 1 :0;

In this sequence, Batman has clocked the Joker upside the head with an upright piano while Robin struggles fruitlessly against the ropes which bind him. Or, possibly, this is a ridiculously terse way of determining whether or not a given string is actually an e-mail address. Before we decipher this mess, I’d like to take this opportunity to suggest commenting your code. This is a public script, so presumably, there is someone with a commented version somewhere. Good commenting is good practice, but with regular expressions it’s doubly important — you think you’ll remember what a 200 character pattern does two weeks from now?


Get inside the script

Let’s rewrite that line of code above with a few comments plugged in to help us understand what’s going on.

$addr =~ m#  #initiate grep sequence



^[ t]*    #match beginning of the string (allows for spaces and tabs)



[w-.*] #the brackets are a character class made of what's inside



{1,100}    #control the char class, match 1, but not more than 100



@         #find the all important @ in the address

[w-.]{1,100} #another character class with greedy delimiters



[ t]*     #allow for trailing spaces and tabs



$          #match end of string



# ? 1 :0; #not actually part of grep pattern, this is what happens

            #based on what the grep pattern returns



In case you were wondering, yes, # is Perl’s comment deliminator. The “=~” binds a scalar expression to a pattern match. It sets the variable $addr (which happens to be a scalar expression) equal to the results of our pattern. Note that in Perl you can comment and add linebreaks just about anywhere and you code will still function as before.

Now let’s get deeper. The “m” character tells Perl that we are about to give it a regular expression pattern. Traditionally, the sequence looks like this:m/PATTERN/MODIFIERS. In this case, the authors are using “#” instead of “/” to mark the pattern. Perl is very loose with its delimiters. In fact, you can use any pair of non-alphanumeric, non-white space characters as delimiters.

As the documentation at Perldoc.org points out, “this is particularly useful for matching path names that contain “/”, to avoid LTS (leaning toothpick syndrome).” In other words, we avoid patterns where we would have to escape every / with a . Instead of m/nt.*/ we could write m#nt.*# which is much easier to read. At the same time, you probably don’t want to use 10 different characters in one Perl script. The “#” is a good choice since the only other time you’re likely to use it is in comments. It’s also worth pointing out that if you decide to stick to the forward slash notation, you can omit the m, and then you need only write /PATTERN/.

So far so good. Obviously, we want to match the beginning of the string. We do want the whole e-mail address, hence the “^” character.

The next section should look familiar from the previous regex lesson. The brackets indicate that what is inside is a class of characters to look for. The authors of TFMail decided to allow padding of the string by creating a character class made up of the space character and the tab character. This means our actual address can be preceded by any number (remember that the * means 0 or more) of tabs or spaces.

After that, we have another character class. In this case the class is made up of w-.*. The w is shorthand for any word character, that is to say [a-zA-Z0-9]. Because that character class is so common, Perl has the shorthand notation w. The next thing in our regex is “-”, which tells our character class to include hyphens since e-mail addresses may contain hyphens. Because the normal use of a hyphen in a character class is to indicate a span of characters, i.e. [a-z], we must precede it with a backslash to tell Perl we want to match a literal hyphen. The same thing goes for the dot. E-mail addresses may be something like your.name@yourhost.com, so we need to account for dot characters. Again, because we are already working inside a character class, we must precede the dot with a backslash. Otherwise, Perl will think we mean zero or more of the previous character. The asterisk “*” is there to say one or more. It must also be escaped with backslash, but for the opposite reason. That is, the escape tells Perl we want the meta character * rather than a literal asterisk.

The next block in the pattern of our example script is a control sequence. The syntax is:{n, m}, where n is the minimum number of matches to allow and m is the maximum. The pattern in question must match at least once, meaning there is at least one character in our user’s e-mail address and can be matched up to 100 times. This makes for a very long e-mail address, but you never know. After that, we have the “@” portion of the address.

There is a second character class which is basically the same as the first, but we have dropped the * because in the second part of an e-mail address zero is not OK. In other words, there must be .com or .net or some such suffix. Also note that the author has added an * after the character class, meaning that the whole class may be matched more than once. That way, e-mail address at sub-domains are not left out. For instance, things like my.university.email.edu are pretty common for large domains.

The pattern ends by allowing for tab and space padding “[ t]*”, then an end of string character “$” (otherwise known as a line break). The last part of the regex pattern is the closing # statement which tells Perl it has reached the end of the pattern.

The rest of the line is just Perl-speak for, “If a match is found, return true. If not return false.” With this script, we have detected whether or not a user has sent a valid e-mail address.

Whew! We made it.


Using modifiers

Now let’s address a couple of further points. I mentioned that Perl matches take the form of m/PATTERN/MODIFIER. What about those modifiers? What can they do? If the sequence had read m#PATTERN#i, the “i” modifier would tell Perl that our pattern was case insensitive. For this pattern, the i modifier is unnecessary because we aren’t looking for actual words. We’re looking for alphanumeric characters, so our search is already case-insensitive. That’s just one of the possibilities. In all, there are seven modifiers for the “m/” expression. They are (from perldoc):

c   Do not reset search position on a failed match when /g is in effect.



g   Match globally, i.e., find all occurrences.



i   Do case-insensitive pattern matching.



m   Treat string as multiple lines.



o   Compile pattern only once.



s   Treat string as single line.



x   Use extended regular expressions.

The list is pretty self-explanatory, so I won’t walk through all of them. Instead, I’d just like to point out the “g” and “m” modifiers which tell Perl to search across multiple lines. These will come in handy in just a minute.

What if we want to replace what we’ve matched? Well, for that we would use a different pattern matching tool. We would use s// like we saw in the Markdown example. This method takes the form s/PATTERN/REPLACEMENT/Modifier where the modifiers are:

e   Evaluate the right side as an expression.



g   Replace globally, i.e., all occurrences.



i   Do case-insensitive pattern matching.



m   Treat string as multiple lines.



o   Compile pattern only once.



s   Treat string as single line.



x   Use extended regular expressions

Returning to our Markdown example, we had a pattern that went s/PATTERN/REPLACEMENT/gsx. The pattern was searching globally, treating the string as one line and using extended regular expressions. What does this mean? Well, by searching globally and treating the string as one line, we are able to search the whole of the document without worrying about linebreak characters. And the “x” got us those lovely extended features like lookahead and lookbehind.

Going back to the exercise that we used in the previous regex lesson, let’s write a Perl script to strip out and replace those outdated <font> tags. Last time we tried that, we had a solution that looked like this:

<font size="20" .*>



and we were replacing it with this:

<p class="myclass">1</p>

Now let’s put that into Perl-speak:

s/<font size="20" .*>/<p class="myclass">$1</p>/gis



The pattern and replacement pattern look pretty much the same. Note that Perl’s replacement sequence uses a $ instead of the we used before. And, especially note the “g”, “i” and “s” modifiers after our replacement pattern. As mentioned above, these modifiers make the search global and case-insensitive, and they treat the string as a single line.

That’s much more powerful than our original pattern. Now we don’t have to worry about line breaks. Perl allows us to search the whole document in one pass.

Strip out HTML using Perl

One more before we go. This pattern is very handy. See if you can guess what it does.

s/<[^>]+>//g;

Give up? It strips all the HTML tags out of a document and just leaves behind text. Along these lines, there is a Python script that is a sort of anti-markdown for those who are interested.


Further reading

There you go folks, a brief intro to Perl’s regular expression powers. And when I say brief I do mean brief. We have barely scratched the surface of what Perl can do with regex. For more information, check out the regex portion of perldoc at perl.org.

For more information on Perl in general, check out CPAN and O’Reilly.com, which has something like 40 books on Perl.

And finally, I leave you with a bit of Perl Zen.

Finally, I would like to thank John Gruber for his assistance with the Markdown section of this article.