RegExp Tutorial
In this tutorial, you’ll learn how to use regular expressions (RegExp) to create powerful patterns.
What are RegExp patterns?
Those are ways to tell your program “look for anything in this text that match this set of rules, and give me the results”.
For this tutorial, we’re using Python-based RegExp. That is the fastest and most common RegExp base used these days. JavaScript, PHP, Ruby On Rails and plenty of other languages know how to use it.
The PHP function for searching through text with Python-based RegExp is:
<?php preg_match($needle,$haystack,$fill); ?>
$needle is the pattern to look for.
$haystack is the text we want to look inside.
$fill is the variable (array) that any matches will be thrown into.
RegExp is case-sensitive unless you specified otherwise, exclusively. We’ll explain how to do that.
How patterns work
A pattern starts and begins with /.
Between / and / is what we are looking for. You don’t have to know any of the words you’re looking for — it can be any set of anything. We’ll explain.
If you want to group a set of possibilities, you wrap them in ( and ). Inside you can include as many words as you want and include the relationships between them.
For example. Let’s say I’m looking for the word Pizza. My pattern would look like this:
/Pizza/
What if I want to look for either “Pizza” or “Macaroni”? Then I’ll wrap the selection and put the OR syntax between them — you say OR by putting | between the words. Here’s how it should look.
/(Pizza|Macaroni)/
Not too hard is it? But this is pretty limited. Now we’ll teach you how to use sets of characters in a selection.
Let’s say I want to look for any word that has the letters a to z in them, case insensitively. I wrap the set of characters I want to look for inside brackets – [ and ].
/[A-Za-z]/
As you can see, by using -, RegExp knows to look for any characters between that rang, including the range. Since it’s case sensitive, we defined both capitals and lowercase. But notice that if you try this, it will only match the first letter in the text we’re looking in. We can tell it how many characters to look at, and here’s how we do it.
/([0-9]*)([A-Za-z]+)/
* tells it to look for zero or more occurrences of the set. So that part should match either “” (blank) or 009 or 21527532 or… You get the point. + tells it to look for one ore more occurrences of the set. So it will need to have at least one letter or it won’t match. So it wouldn’t match “” but it would match “Pepperoni”.
/(.)*/
. (period) tells it to match any character — so in fact, this pattern will match anything — zero or more of any existing character.
/Beach(es)?/
Would match Beach and Beaches. ? tells it to look for zero or one occurrences of that set.
/(B(e{2,10})(a{3})(c{,4}(h{2,}/
See what I did there? The left number in the braces is the starting range. The right is the ending. So, we tell it to look between the values we put – {2,10} tells it to look for two to ten occurrences of the letter e. If only one parameter is specified, we tell it to look for only that number of occurrences of the set (example, only 3 occurrences of the letter a would be accepted). Leaving a parameter blank will say that that side is unlimited. So c would match 0 to 4 times. And h would match 2 to infinity times.
So…
Beeeaaachh
Beeaaaccchhhhhhhhh
Would both match! But…
Beach
Beaacccccch
Won’t match.
/^Start/
^ tells it to match text that starts with the specified pattern. It can only be put right after the opening /. Anywhere else would make it mean something else.
/End$/
$ says the opposite – it will match only if the text ends with the pattern.
/([^a-z])/
^ anywhere else would mean not. So, the pattern right here would match any one letter that is not a lowercase alphabet.
/\./
\ tells the pattern to ignore the next character and use it as text instead of any special meaning it has. So instead of matching any character here once, it will match a period once.
After the last /, you can put special flags. Most commonly used flags are i, g and m:
i means case-insensitive search (/pattern/i would match both pattern and PATTERN)
g means global matching, it will match the pattern not once, but as many as it takes — and will put them into an array or replace accordingly.
m means multiline search – text that spans over more than one line will start ^ at the first line and end $ at the last line
Let’s try some example patterns!
// This would match a valid email address
preg_match("/^[a-zA-Z\.]+@[a-zA-Z]+\.[a-zA-Z\.]+^/","my@email.com");
preg_match("/(Apple|Orange) (s)?/gi","Apples and Oranges",$match);
/* This would match Apple, Apples, Orange, Oranges,
in a case-insensitive manner and will put any occurrence into $match,
no matter how many occure. So print_r($match) would echo:
$match = Array(
[0] = "Apples",
[1] = "Oranges"
); */
// This would match cumbers and cucumbers case insensitively
preg_match("/(cu){1,2}mbers/i","Cucumbers");
Hope this was helpful!
|