Tweet
Bookmark this on Delicious
In this article we will see how regular expressions can be defined in PHP (the syntax can be very similar to that of Perl's regular expressions. In order to learn as you read, you can use the PHP function preg_match in order to "test" PHP regular expressions by matching them against PHP strings of your choice.
Metacharacters are these characters allowing you to define a pattern, i.e. these characters are not to be taken litterally but have a special meaning within the regular expression. For example, the regular expression [a]{1,2} matches strings containing one or two consecutive "a". In this example, the characters "[", "]", "{", "1", ",", "2" and "}" are there simply in order to help forming the regular expressions but are not to be taken litterally as independent characters.
Metacharacters can be any othe the characters below (we will explain each of them later):
[ ] + * - ^ $ . ? \ ( ) | { }
( ) stands for a given sequence of characters: (abc) will match the strings "abcde", "erabcd", but won't match the string "abfdc".
(abc) matches any string containing the substring "abc" (no matter where within the string).
[ ] stands for one character chosen among the characters between brackets: [az] will match the string "abc" but also the string "bac", "ezrrt", etc ...
It will match any string containing at least one "a" or one "z"
The + metacharacter means that the previous subexpression must occur one or several times consecutively: (ab)+ will match the strings "ab", "abab", "jababa", "axbab" but won't match the string "acb".
(ab)+ will stand for 1 ore more times the sequence "ab".
Similarly, [az]+ will match the strings "abfd", "deaas", "ze", but won't match "exrt".
[az]+ will match any string containing at least one "a or one "z".
By the same token, * will mean a repetition of 0 or more times of a subexpression (essentially the same as + but the subexpression need not occur at least one).
Example: (ab)+[cd]*(ab)+ will match the strings "abab", "abcab", "abdab", "abcdab" but won't match the string "abcdax"
The notations [a-z] and [A-Z] denote any letter (resp. in small or big caps). Similarly, [0-9] denotes any numerical digit between 0 and 9.
The metacharacters . and ? denote (respectively) a single character (resp. at most 1 character). Therefore a.c will match the PHP strings "abc", "axc", but won't match "ac". On the contrary, a?c can also match "ac".
The expression [ ]{n,m} denotes at least n (at most m) characters among the characters between brackets.
Therefore a[b]{1,2}c will match the strings "abc", "abbc", but won't match the strings "ac" or "abbbc". The expression ( ){n,m} denotes at least n (at most m) times the subexpression between parentheses.
Therefore (ab){1,2}[^a] will match the strings "ab", "abab", "ababc", but won't match the string "ababa".
The metacharacter ^ indicates that the coming expression must be placed at the beginning of the string to match: for instance, ^abc will match "abcfr" but won't match "zabcjd". When used between [ ], the metacharacter ^ stands for a negation: for instance [^abc] stands for any character which is neither "a", nor "b", nor "c".
Similarly, the dollar symbol $ is a metacharacter which indicates that the expression coming before the dollar metacharacters must terminate the string: for instance abc$ will match "erertabc" but won't match "efrrabcf".
The metacharacter | denotes the coordination "or"; for instance, the expression [a]{2,2}|[b]{3,3} will match any string containing at least two consecutive "a" or three consecutive "b".
Because metacharacters are not interpreted litterally, we must escape them with a backward slah every time we mean to take them litterally as characters: for instance \ /abc will match the strings "/abcde" and "ae/abcde" but /abc won't match any of these strings since the forward slash is not escaped.
Let's try to build up a regular expression that would match any string containing an URL of the form:
http://www.xxx...xxx.yyy OR
http://www.xxx...xxx.yyy/xxx...xxx/...
i.e. an URL pointing to a folder (not to a page directly) and which does not end with a forward slash. It is understood that the domain name must contain at most 20 characters, and that each substring of characters (except for the forward slash) making up the path name must also be at most 20 characters. The depth of the path (as measured by the number of forward slashes in the URL) measures how deep in the public folder the URL is pointing.
Proposed solution: "http://www\.[a-z0-9]{1,20}\.[a-z]{1,20}(/[a-z]{1,20}){1,20}\s".
Next tutorial: Connecting to a MySQL database server
Previous tutorial: SQL queries in PHP -->
Back to computer forums
Regular expressions in PHP - Defining and matching a regular expression in PHP
In this article we will see how regular expressions can be defined in PHP (the syntax can be very similar to that of Perl's regular expressions. In order to learn as you read, you can use the PHP function preg_match in order to "test" PHP regular expressions by matching them against PHP strings of your choice.
Regular expressions in PHP - Using PHP metacharacters in order to define regular expressions
Metacharacters are these characters allowing you to define a pattern, i.e. these characters are not to be taken litterally but have a special meaning within the regular expression. For example, the regular expression [a]{1,2} matches strings containing one or two consecutive "a". In this example, the characters "[", "]", "{", "1", ",", "2" and "}" are there simply in order to help forming the regular expressions but are not to be taken litterally as independent characters.
Metacharacters can be any othe the characters below (we will explain each of them later):
[ ] + * - ^ $ . ? \ ( ) | { }
( ) stands for a given sequence of characters: (abc) will match the strings "abcde", "erabcd", but won't match the string "abfdc".
(abc) matches any string containing the substring "abc" (no matter where within the string).
[ ] stands for one character chosen among the characters between brackets: [az] will match the string "abc" but also the string "bac", "ezrrt", etc ...
It will match any string containing at least one "a" or one "z"
The + metacharacter means that the previous subexpression must occur one or several times consecutively: (ab)+ will match the strings "ab", "abab", "jababa", "axbab" but won't match the string "acb".
(ab)+ will stand for 1 ore more times the sequence "ab".
Similarly, [az]+ will match the strings "abfd", "deaas", "ze", but won't match "exrt".
[az]+ will match any string containing at least one "a or one "z".
By the same token, * will mean a repetition of 0 or more times of a subexpression (essentially the same as + but the subexpression need not occur at least one).
Example: (ab)+[cd]*(ab)+ will match the strings "abab", "abcab", "abdab", "abcdab" but won't match the string "abcdax"
The notations [a-z] and [A-Z] denote any letter (resp. in small or big caps). Similarly, [0-9] denotes any numerical digit between 0 and 9.
The metacharacters . and ? denote (respectively) a single character (resp. at most 1 character). Therefore a.c will match the PHP strings "abc", "axc", but won't match "ac". On the contrary, a?c can also match "ac".
The expression [ ]{n,m} denotes at least n (at most m) characters among the characters between brackets.
Therefore a[b]{1,2}c will match the strings "abc", "abbc", but won't match the strings "ac" or "abbbc". The expression ( ){n,m} denotes at least n (at most m) times the subexpression between parentheses.
Therefore (ab){1,2}[^a] will match the strings "ab", "abab", "ababc", but won't match the string "ababa".
The metacharacter ^ indicates that the coming expression must be placed at the beginning of the string to match: for instance, ^abc will match "abcfr" but won't match "zabcjd". When used between [ ], the metacharacter ^ stands for a negation: for instance [^abc] stands for any character which is neither "a", nor "b", nor "c".
Similarly, the dollar symbol $ is a metacharacter which indicates that the expression coming before the dollar metacharacters must terminate the string: for instance abc$ will match "erertabc" but won't match "efrrabcf".
The metacharacter | denotes the coordination "or"; for instance, the expression [a]{2,2}|[b]{3,3} will match any string containing at least two consecutive "a" or three consecutive "b".
Regular expressions in PHP - Escaping the metacharacters in PHP regular expressions
Because metacharacters are not interpreted litterally, we must escape them with a backward slah every time we mean to take them litterally as characters: for instance \ /abc will match the strings "/abcde" and "ae/abcde" but /abc won't match any of these strings since the forward slash is not escaped.
Regular expressions in PHP - Example of a real-life PHP regular expression
Let's try to build up a regular expression that would match any string containing an URL of the form:
http://www.xxx...xxx.yyy OR
http://www.xxx...xxx.yyy/xxx...xxx/...
i.e. an URL pointing to a folder (not to a page directly) and which does not end with a forward slash. It is understood that the domain name must contain at most 20 characters, and that each substring of characters (except for the forward slash) making up the path name must also be at most 20 characters. The depth of the path (as measured by the number of forward slashes in the URL) measures how deep in the public folder the URL is pointing.
Proposed solution: "http://www\.[a-z0-9]{1,20}\.[a-z]{1,20}(/[a-z]{1,20}){1,20}\s".
Next tutorial: Connecting to a MySQL database server
Previous tutorial: SQL queries in PHP -->
Back to computer forums
