How to use Python RegEx
After the built-in package ‘re’ has been imported, regular expressions, or RegEx for short, can be used to search for certain patterns within strings.
What is Python RegEx?
RegEx, or regular expressions, are strings that are used to define specific search patterns. Once a search pattern has been defined, you can check Python strings for the specified pattern. Python RegEx has its own syntax and semantics.
Many Python tutorials do not cover advanced programming constructs like regular expressions in detail. If you are interested in learning about other advanced Python programming concepts, check out the following articles:
Python RegEx applications
Regular expressions are often used to check user input that needs to fit a specific format.
An example you might be familiar with is when a user needs to create a password that contains at least one uppercase letter and one number. Python RegEx can be used to check the input against these rules.
Regular expressions are also used for online forms to check if user input is valid. For example, they can check whether or not a user has entered a valid email address format when filling out a form or registering for a website.
If you’re working on a web project with Python, regular expressions can help you in many areas of your project. Another valuable resource to consider is Deploy Now by IONOS. With Deploy Now, you can directly build and deploy your web projects with GitHub.
What are the semantics and syntax for Python RegEx?
Metacharacters are used in regular expressions. Each of these characters has a specific meaning and distinct function within the context of Python RegEx. The following table gives an overview of the most important metacharacters and their meaning along with an example:
Characters | Description | Example |
---|---|---|
. | Stands for any character except newline | ‘he..o’ -> finds all strings that start with ‘he’ followed by any two characters and then followed by ‘o’, e.g. ‘hello’ |
[] | Finds all letters specified between the brackets | ‘[a-e]’ -> finds all lowercase letters between a and e |
^ | Checks if a string starts with a specified character or string | ‘hello’ -> checks if string starts with ‘hello’ |
$ | Checks if a string ends with a specified character or string | ‘$world’ -> checks if string ends with ‘world’ |
* | Zero or more occurrences of one character | ‘a*’ -> matches any number of a’s as well as no a’s at all |
+ | One or more occurrences of a character | ‘a+’ -> matches at least one occurrence of a |
? | One or no occurrence of a character | ‘a?’ -> matches exactly one a or none |
{} | Checks if a character occurs as often as specified in the curly braces | ‘hel{2}o’ -> matches the string ‘hello’ |
Sets
Sets are RegEx patterns that start and end with a square bracket. These are very important for Python RegEx. The table above shows an example of a set that finds all lowercase letters between a and e. Below is an overview of sets in Python RegEx:
Set | Description |
---|---|
[abc]
|
Matches if one of the characters specified between the brackets (i.e. a, b, or c) occurs within a string |
[^abc]
|
Matches for all characters not specified inside the brackets |
[a-z]
|
Matches for all lowercase letters between a and z |
[a-zA-Z]
|
Matches for all letters between a and z in both upper- and lowercase |
[0-9]
|
Matches for any number between 0 and 9 |
[1-5][0-9]
|
Matches for all two-digit numbers between 10 and 59 |
As you can see, sets are a powerful tool for various regular expressions. However, when using sets, keep in mind that the metacharacters presented in the first table do not carry a special meaning when placed inside square brackets. So, for example, the set [] would match every in a string.
Sequences
In addition to metacharacters, there are also special, predefined sequences for creating precise search patterns in Python RegEx.
Sequence | Description | Example |
---|---|---|
\A | Matches if the specified string is found at the beginning of a string | ‘\AMonday’
|
\b | Matches if the specified string is found at the beginning or at the end of a word | ‘\bes’
|
\B | Matches if the specified string is not found at the beginning or end of a word (opposite of \b) | { ‘\Bes’.
|
\d | Matches every digit between 0 and 9 (equivalent to [0-9]) | ‘123’
|
\D | Matches all characters that are not digits (equivalent to [^0-9]) | ‘123acb&’
|
\s | Matches if the string contains a space | ‘Python RegEx’
|
\S | Matches if the string does not contain a space (opposite of \s) | ‘Python RegEx’
|
\w | Matches all alphanumeric characters | ‘1abc$%3’
|
\W | Matches for all characters that are not alphanumeric characters (opposite of \w) | ‘1abc$%3’
|
\Z | Matches if the specified string is at the end of a string | ‘Python\Z’
|
What functions can I use for Python RegEx?
Several predefined functions will assist you when using RegEx in Python. These functions are located in a Python module called ‘re’. You’ll need to import these before you can start working with regular expressions:
import re
Pythonre.findall()
The findall() function is probably the most important function when using Python RegEx . It takes a search pattern and a Python string and returns a Python list. The list consists of strings containing all matches in the order that they were found. The findall() call will return an empty list if no match is found.
The following code example illustrates this function:
import re
string = "python 3.0"
regex = "\D"
result = re.findall(regex, string)
print(result)
PythonNotice that in the code snippet above the re module is imported first. The ‘string’ variable is then used to store the string ‘python 3.0’. The search pattern stored in the ‘regex’ variable is in the sequence table and matches all characters that are not digits. The findall() function carries out the matching. It takes the search pattern as an argument and examines the string. The list returned by the function is stored in the ‘result’ variable and is output to the screen with a call to Python print. The output looks like this:
['p', 'y', 't', 'h', 'o', 'n', ' ', '.']
The list contains every character from the string except the digits. Keep in mind that the space character counts as a separate character and as such appears in the list.
re.sub()
The sub() function overwrites all matches with a text of your choice. Like findall(), this function takes a regular expression as the first parameter. In the second parameter, you need to pass the text that you want to replace the matches with. The function’s third parameter is the string that you want to search for. If you only want to replace a certain number of matches, you can specify a number as the fourth parameter. This indicates how many matches should be replaced starting with the first match.
This following code example will help to clarify how this works:
import re
string = "python is a great programming language"
regex = "\s"
result1 = re.sub(regex, "0", string)
print(result1)
result2 = re.sub(regex, "0", string, 2)
print(result2)
PythonAs you can see, re is imported first and a string is stored in the variable ‘string’. The search pattern should match all spaces in the string.
This is followed by two similar calls to sub(). The first function call should replace every space in the passed string with a 0 and store the result in the variable ‘result1’. The second function call limits the number of spaces using the fourth parameter, which is optional. The first two spaces in the passed string should be replaced with a 0 and it should store the result in the variable ‘result2’.
The code’s output will look like this:
'python0is0a0great0programming0language'
'python0is0a great programming language'
re.split()
The split() function from the re module is similar to the built-in Python split() function, with both allowing you to split a string into a list. In this function, the first parameter is a search pattern, and the second parameter contains the string that should be split. After each match, the string is interrupted with a regular expression.
If you want to split a string a certain number of times, you can pass a number in the third parameter. This will determine the maximum number of splits. The third parameter, however, is optional. Here’s an example of how this works:
import re
string = "python is a great programming language"
regex = "\s"
result1 = re.split(regex, string)
print(result1)
result2 = re.split(regex, string, 1)
print(result2)
PythonMost of the code in this example is similar to the previous example. The split() function call is the only difference. The split() function is called on the string and should split it every time a space occurs. The resulting list is assigned to the variable ‘result1’. The second split() call limits the number of splits to 1 by specifying the optional third parameter. It assigns the result to the variable named ‘result2’. The results are as follows when the program is executed:
['python', 'is', 'a', 'great', 'programming language']
['python', 'is a great programming language']
re.search()
The search() function searches a string for a match. It takes the regular expression first and the string you want to examine as second parameter. Then it returns a Python match object, which is the first match found. If no match is found, the function returns the value ‘None’. To better understand how the function works, take a look at the example below:
import re
string = "python is a great programming language"
regex = "\s"
match = re.search(regex, string)
if match:
print("RegEx was found.")
else:
print("RegEx was not found.")
PythonThe search() function is called with a regular expression that searches for spaces, and a string. The match object returned by the function call is stored in the ‘match’ variable. The Python if-else statement is used to help illustrate this. If a match is found, the match object is not empty, and the if-path is chosen. The program returns the following output:
'RegEx was found.'
What is the match object?
The match object is returned by a search() call and contains information about the search pattern results. You can access this information with various functions:
- object.start() returns the index of the first character of the Python substring that matches your search pattern.
- object.end() returns the index of the last character.
- object.span() combines start() and end(). The function returns a Python tuple containing the substring’s first and last index.
- object.string returns the string you searched for.
- object.re returns the Python RegEx that you passed to search().
You can get a better idea of these functions by adding the function calls to the last code example:
import re
string = "python is a great programming language"
regex = "\s"
match = re.search(regex, string)
if match:
print("RegEx was found.")
else:
print("RegEx was not found.")
print(match.start())
print(match.end())
print(match.span())
print(match.string)
print(match.re())
PythonThe output looks like this:
'RegEx was found.'
6
7
(6, 7)
'python is a great programming language'
re.compile('\\s')
The string ‘RegEx was found.’ is output. This is because the match object is not empty, and this makes the if-condition true. The first match’s index is then displayed. The value may have been easy to guess since the first blank has the index ‘6’. This is also the case for the value ‘7’, which is output by calling the end() function. The tuple ‘(6, 7)’ unites the call to start() and end() by specifying both indices at the same time. The string passed to the match object is also as expected.
But what about the output ‘re.compile(‘\s’)’? This is a Python RegEx object. It is created when the string that you passed as a regular expression is processed as such. You can display the RegEx object using your match object.