PHP Regular Expressions 101

Regular expressions are one of those things that I never really took the time to learn. Rather, whenever a situation arose I would discern and 'learn' the specific thing that I needed to resolve my problem.

Having built Whois Browser as well as many other internal tools which heavily depend on regular expressions this is somewhat of a bemusing approach.

If you have a similar approach, this post seeks to serve as a basic overview of what you need to know to utilise regular expressions within PHP (at the most basic level).

For advanced level regular expressions check out http://www.regular-expressions.info/ - it is an awesome resource. It is well worth setting aside an afternoon and reading it all.

preg functions

As outlined in the PHP manual there are a number of preg functions contained within the PHP language. These functions all 'perform (a search for) a regular expression' and then do something.

You can for example use:

  • preg_replace to find any numerical values prepended with a £ sign and replace them with something else.

  • preg_match_all to find all phone numbers in a given text block

These methods serve as your main interface for utilising regular expressions within PHP. They are incredibly powerful.

Perl Compatible Regular Expressions (PCRE)

PHP utilises Perl Compatible Regular Expressions (PCRE). This is a format for writing regular expressions - the things that you are searching for.

The PHP manual has its own page on pattern syntax.

The basics

A basic code snippet outlining usage of the preg_match function would be as follows:

$regex = "#cat#";

preg_match($regex, $contentToSearch, $outputArray);

This would search your $contentToSearch for the word cat and populate $outputArray with any instances of cat that it found (this is a completely pointless example).

The # symbol delimits the start and the end of the expression. You can utilise any "non-alphanumeric, non-backslash, non-whitespace character" as a delimiter but it is advisable to use something that is obvious, and will not be utilised within the expression itself (otherwise you'll have to escape characters).

Modifiers

After the closing delimiter you can add any number of modifiers. These modify the expression so for example you can:

  • make your expression case independent
  • make your expression search across multiple lines

If my regex above were $regex = "#cat#i"; it would match cat, CAT, Cat etc

Captures

Any sub-expression contained within brackets (( and)) is captured. That is to say it is appended to your $outputArray.

$contentToSearch = "123-ABC-456";

$regex = "#([0-9]*)-[A-Z]*-([0-9]*)#";

preg_match($regex, $contentToSearch, $outputArray);

In this example the contents of $outputArray would be:

Array
(
    [0] => 123-ABC-456
    [1] => 123
    [2] => 456
)
  • The first value is the whole expression match.
  • The second and third values are the values of the two capturing blocks.

Matches

[ and ] enclose acceptable sets of characters.

  • [0-9] allows any character from 0-9
  • [A9] allows the characters A or 9

How many

  • [A-Z] will match one character from A-Z.
  • [A-Z]* will match any number of characters from A-Z
  • [A-Z]{2,} will match two or more characters from A-Z

Greediness

$contentToSearch = "abc123 welcome back";

$regex = "#.* #";

preg_match($regex, $contentToSearch, $outputArray);


$regexTwo = "#.* #U";

preg_match($regexTwo, $contentToSearch, $outputArrayTwo);

In this example $outputArray would contain the following:

Array
(
    [0] => abc123 welcome
)

The . character matches any character. So this expressions searches for any number of any characters followed by a space.

The match is by default greedy so it wants the largest possible match. abc123 welcome is the largest set of any number of any characters followed by a space.

You can use the U modifier to make the regex ungreedy by default.

$outputArrayTwo would contain:

Array
(
    [0] => abc123
)

There you go

Whilst I am happy to answer any questions that you may have, there are no shortage of questions (and answers), tutorials, and websites outlining the usage of regular expressions.

This post is a bare minimum introduction to regular expressions. That is to say with that outlined above you can probably achieve most simple things.

Whilst the actual expressions often times look ugly, they are in fact incredibly simple. Don't be put off.

Take the time now to get a grasp on regular expressions and you will be thanking yourself the next time you reach for a strpos and str_replace combo and realise that it is a hell of a lot easier with preg_replace ;)