Web Magazine for Information Professionals

Book Review: Mastering Regular Expressions, 3rd Edition

Emma Tonkin and Greg Tourte take a look at the new edition of an O'Reilly classic.

Introduction: Needles, Haystacks and Magnets

Since the early days of metadata, powerful textual search methods have been, as Wodehouse’s Wooster might have put it, ‘of the essence’. Effective use of search engines is all about understanding the use of the rich query syntax supported by that particular software. Examples include the use of Boolean logic (AND, OR and NOT), and wildcards, such as and ?. Search engines such as Google naturally include their own selection of rich functionality and usage tricks, and O’Reilly has a book out to cover that, too (see Phil Bradley’s review of ‘Google Hacks’ in Ariadne [1]).

In the digital library environment, methods of free-text search have gained in importance as the availability of metadata and full-text corpora such as eprints archives has increased, along with the processing power of the average Web server. Today, all facets of search literacy are as important to the digital library as they are to data and text mining specialists, and indeed to everybody who deals with textual information. At the core of today’s small- to medium- scale search engines, you will find a set of routines designed to query a set of records, usually an index containing information about the objects in the database, and extract appropriate results. The chances are fairly high that those routines will make heavy use of the subject of this book: the regular expression.

The regular expression, known to its friends as the regex or regexp, is to search and manipulation of data what a powerful electromagnet is to the famous ‘needle-in-a-haystack’ metaphor; that is, an extremely powerful shortcut through an otherwise lengthy and tedious process.

Introducing Regular Expressions

The introduction to this book describes regular expressions as ‘the key to powerful, flexible and efficient text processing. Regular expressions themselves, with a general pattern notation almost like a mini-programming language, allow you to describe and parse text. [They] can add, remove, isolate, and generally fold, spindle, and mutilate all kinds of text and data.’

The regular expression is a canny beast, with many language- or environment- specific adaptions. Pretty much the only constant between the various implementations out there (a phenomenon referred to in this book as ‘linguistic diversification’) is the fact that almost all regular expressions look like accidental ASCII noise on a modem connection - and of course, the fact that they are all extraordinarily useful. It is difficult to imagine life without regular expressions; they’re just too useful. Here is an example randomly plucked from a Perl script:

 

$line=~/title=“(.?)“/;

 

This means: “if part of the string ‘line’ contains ‘title=”…’, and if there are a number of characters after those opening quotation marks and before the closing quotation mark (which can be anything from a number to a letter to a hyphen, underscore, semicolon, etc…), please return the set of characters between the quotation marks.”

This book, unlike so many technical textbooks, is a good all-rounder. The first chapter of the book presents a general introduction to regular expressions, explaining in an accessible manner what a regular expression is and some of the specialist terminology that is in use. Chapter 2 provides some very readable introductory text and a set of simple yet powerful examples, followed in chapter 3 by a concise history of the concept of regular expressions and general descriptions of their use; and by the development over time of the various implementations of regex ‘engines’ in chapter 4; that is, libraries providing regular expression functionality. This makes fascinating reading for everybody who’s ever wondered why the regex functionality of UNIX tools such as grep, awk and sed don’t always line up with Perl (or, indeed, with each other). It also underlines one fact that explains almost everything related to the regexp; computer scientists just like to hack, to make functions just a little better and more powerful.

Chapter 5 is where the book gets technical, providing a cookbook of carefully explained regex techniques and examples, such as matching of IP addresses, matching HTML tags, validating hostnames and parsing CSV files. The following chapter provides you with all of the reasons to be cautious with your new-found knowledge, with a particular focus on efficiency of execution, benchmarking techniques and troubleshooting techniques. Chapters 7-10 each provide extended information about ‘flavours’ of regular expressions in various languages; Perl, Java, .Net and PHP.

Learning Regular Expressions

A side-effect of all this power is a great deal of potential for confusion. Regular expressions are a major cause of stress headaches in developers. Learning to make use of regular expressions is a challenge - there’s a painful learning curve involved in the process. Firstly, there’s a lot to memorise. Secondly, they’re famously cryptic; depending on the implementation and the type of the regular expression engine, the behaviour may differ in a variety of baffling and unexpected ways. Thirdly, they’re just plain hard to read. To a great extent, the only solution is a combination of knowledge, reference and practice. Thankfully, this book offers all three, in an up-to-date, conveniently sized package. Aside from the expository chapters, it offers examples, quizzes and look-up tables. Helpful conventions are explained, such as the use of the x modifier that enables comments and spacing to be placed within a regular expression, enabling regular expressions in Perl to be written with greater legibility. For example, from page 301:

$url =~ m{
    href \s* = \s*      # Match the “href =“part, then the value…
    (?: “([^“]*)”           # a double-quoted value, or…
       | ‘([^‘] )’         # a single-quoted value, or…
       | ([^‘”<>]+) ) # an unquoted value
}ix;

 

Compare to the equivalent but uncommented rendition given below:

 

$url=~m/href\s=\s(?:([^“])“|‘([^‘]*)‘|([^‘”<>]+))/i;

 

There’s more to this book than an applied introduction; readers will find that sections, such as the discussion of efficiency in regexps in Chapter 6, are of value in optimisation of code. The language-specific chapters are great as a reference to keep close to hand in everyday code development, debugging and deciphering.

Those with an interest in the topic may also wish to take a look at O’Reilly’s range of pocket references - though Mastering Regular Expressions may be used as a reference, it is not designed exclusively for this purpose. The original Regular Expressions Pocket Reference [2] dating from 2003 was a little patchy in its coverage, but would work well as a quick reference for a developer with a good knowledge of regular expressions, given the option of referring back to a larger text. It is also notable that a new edition of the Pocket Reference (2nd Edition, July 2007) has been released, with updated coverage of a wide range of platforms and implementations.

Conclusion

This book is a work that should be on the desk of every programmer who deals with text analysis, almost by default. It’s readable, informative and full of useful detail. The subject matter is complex and technical, but the book presents information well and those with the time to work through the first chapters steadily will find they can acquire an enviable level of knowledge. Furthermore, the closing chapters’ coverage of real-world use of regular expressions in a range of programming languages, (from the usual suspects such as Perl and PHP to Java and .Net) makes this book an essential read. And not only for those who do not consider regular expressions as a major feature of their programming language (or for those who think of Perl as a bad habit!)

We will keep a copy of the updated 3rd edition on our desks. We conducted a small quantitative poll of two programmers (the authors) and found that five of the last six scripts they wrote in any language involved regular expressions. Regular expressions are too helpful to ignore. One final, compelling argument: once you’re comfortable with them, one regexp will save you hours of work.

But if we made one small suggestion: before we get carried away designing appropriate regular expressions for all sorts of purposes, let’s all promise to comment our code…

References

  1. Bradley, P, Review of: ‘Google Hacks’, Ariadne 50, January 2007 http://www.ariadne.ac.uk/issue50/bradley-rvw/
  2. Stubblevine, T, “Regular Expression Pocket Reference: First Edition”, O’Reilly, August 2003

Author Details

Emma Tonkin
Interoperability Focus
UKOLN

Email: e.tonkin@ukoln.ac.uk
Web site: http://www.ukoln.ac.uk/ukoln/staff/e.tonkin/

Gregory J. L. Tourte
Scientific Software and Systems Support Engineer School of Geographical Sciences
University of Bristol

Email: g.j.l.tourte@bristol.ac.uk
Web site: http://www.ggy.bris.ac.uk/staff/staff_tourte.html

Return to top