Python Mastery: From Beginner to Expert - Sykalo Eugene 2023
Regular expressions
Additional language concepts
Introduction to Regular Expressions
Regular expressions are a sequence of characters that define a search pattern. They are used in a variety of programming languages, including Python, to match patterns in text. Regular expressions allow you to search for patterns in text, replace text, and extract information from text. They are a powerful tool that can save you time and effort when working with text data.
In Python, regular expressions are supported by the re
module. This module provides a set of functions that allow you to work with regular expressions. The re
module provides functions such as search
, match
, findall
, and sub
, which allow you to search for patterns in text, replace text, and extract information from text.
Regular expressions can be complex, but they are also very flexible. They allow you to define patterns that can match a wide range of text. For example, you can use regular expressions to match email addresses, phone numbers, and even URLs. Regular expressions are also useful for data cleaning and data preprocessing tasks.
Basic Syntax and Operators
Regular expressions in Python are defined using a combination of characters and operators. The characters in a regular expression match themselves in the text, while the operators modify how the characters are matched. Here are some of the basic syntax and operators used in regular expressions in Python:
.
: Matches any character except a newline character^
: Matches the beginning of a string$
: Matches the end of a string[]
: Matches any character inside the brackets[^]
: Matches any character not inside the brackets- `` : Matches zero or more occurrences of the preceding character
+
: Matches one or more occurrences of the preceding character?
: Matches zero or one occurrence of the preceding character{m}
: Matches exactly m occurrences of the preceding character{m,n}
: Matches between m and n occurrences of the preceding character()
: Groups characters together and creates a capture group|
: Matches either the expression before or after the operator
Here are some examples of how to use these syntax and operators in regular expressions:
.
: The regular expressiona.b
matches any string that contains ana
, followed by any character, followed by ab
.^
: The regular expression^a
matches any string that starts with ana
.$
: The regular expressiona$
matches any string that ends with ana
.[]
: The regular expression[aeiou]
matches any vowel.[^]
: The regular expression[^aeiou]
matches any consonant.- `` : The regular expression
ab*
matchesa
, followed by zero or moreb
characters. +
: The regular expressionab+
matchesa
, followed by one or moreb
characters.?
: The regular expressionab?
matchesa
, followed by zero or oneb
character.{m}
: The regular expressionab{3}
matchesa
, followed by threeb
characters.{m,n}
: The regular expressionab{2,4}
matchesa
, followed by between two and fourb
characters.()
: The regular expression(ab)+
matches one or more occurrences ofab
.|
: The regular expressiona|b
matches eithera
orb
.
These are just some of the basic syntax and operators used in regular expressions in Python. Regular expressions can be very powerful, and there are many other syntax and operators available that can be used to match more complex patterns in text.
Anchors and Boundaries
Anchors and boundaries are special characters that allow you to match patterns at specific positions in the text. Anchors and boundaries do not match any characters themselves, but they modify the way that other characters are matched.
Here are some of the common anchors and boundaries used in regular expressions in Python:
^
: Matches the beginning of a string or line$
: Matches the end of a string or line\\\\b
: Matches a word boundary\\\\B
: Matches a non-word boundary
Here are some examples of how to use these anchors and boundaries in regular expressions:
^
: The regular expression^abc
matches any string that starts withabc
.$
: The regular expressionabc$
matches any string that ends withabc
.\\\\b
: The regular expression\\\\bthe\\\\b
matches the wordthe
(and onlythe
) in a string.\\\\B
: The regular expression\\\\Bthe\\\\B
matches the wordthe
only if it is not surrounded by other letters.
Anchors and boundaries are useful for matching patterns at specific positions in the text. For example, you can use the ^
and $
anchors to match entire strings or lines, or you can use the \\\\b
and \\\\B
boundaries to match specific words or parts of words.
Character Classes
Character classes allow you to match sets of characters in a text. They are defined using the []
notation and match any single character inside the brackets. Here are some examples of character classes:
[aeiou]
: Matches any vowel[0-9]
: Matches any digit[A-Z]
: Matches any uppercase letter[a-z]
: Matches any lowercase letter[A-Za-z]
: Matches any letter[^aeiou]
: Matches any character that is not a vowel
You can also combine character classes with other syntax and operators to match more complex patterns. For example, the regular expression [a-z]+
matches one or more lowercase letters.
Character classes are useful for matching specific types of characters in text. For example, you can use character classes to match phone numbers, email addresses, or other types of structured data.
In addition to the built-in character classes, you can also define your own custom character classes using the []
notation. For example, the regular expression [abc]
matches any string that contains an a
, b
, or c
.
Quantifiers
Quantifiers allow you to match multiple occurrences of a pattern in a text. They modify the behavior of the preceding character or group in the regular expression. Here are some of the common quantifiers used in regular expressions in Python:
- `` : Matches zero or more occurrences of the preceding character or group.
+
: Matches one or more occurrences of the preceding character or group.?
: Matches zero or one occurrence of the preceding character or group.{m}
: Matches exactly m occurrences of the preceding character or group.{m,n}
: Matches between m and n occurrences of the preceding character or group.{m,}
: Matches at least m occurrences of the preceding character or group.
Here are some examples of how to use these quantifiers in regular expressions:
- `` : The regular expression
ab*
matchesa
, followed by zero or moreb
characters. +
: The regular expressionab+
matchesa
, followed by one or moreb
characters.?
: The regular expressionab?
matchesa
, followed by zero or oneb
character.{m}
: The regular expressionab{3}
matchesa
, followed by threeb
characters.{m,n}
: The regular expressionab{2,4}
matchesa
, followed by between two and fourb
characters.{m,}
: The regular expressionab{2,}
matchesa
, followed by at least twob
characters.
Quantifiers allow you to match patterns that occur multiple times in a text. For example, you can use quantifiers to match repeated characters, words, or other patterns.
In addition to the built-in quantifiers, you can also define your own custom quantifiers using the ()
notation. For example, the regular expression (ab)+
matches one or more occurrences of ab
.
Grouping and Backreferences
Grouping and backreferences allow you to match and extract subpatterns within a regular expression. You can use parentheses ()
to group characters together and create a capture group. The contents of the capture group can then be referenced later in the regular expression using a backreference.
Here are some examples of how to use grouping and backreferences in regular expressions in Python:
()
: The regular expression(ab)+
matches one or more occurrences ofab
.\\\\\\\\1
: The backreference\\\\\\\\1
matches the contents of the first capture group.
Here is an example of how to use grouping and backreferences to extract information from a text:
import re
text = "John Smith (123) 456-7890"
pattern = r"([A-Za-z]+) ([A-Za-z]+) \\\\((\\\\d{3})\\\\) (\\\\d{3}-\\\\d{4})"
match = re.search(pattern, text)
if match:
first_name = match.group(1)
last_name = match.group(2)
area_code = match.group(3)
phone_number = match.group(4)
print("First Name:", first_name)
print("Last Name:", last_name)
print("Area Code:", area_code)
print("Phone Number:", phone_number)
In this example, the regular expression ([A-Za-z]+) ([A-Za-z]+) \\\\((\\\\d{3})\\\\) (\\\\d{3}-\\\\d{4})
matches a first name, last name, area code, and phone number in a text string. The capture groups ( [A-Za-z]+ )
and ( \\\\d{3} )
capture the first name and area code, respectively. These capture groups are then referenced later in the regular expression using the backreferences \\\\\\\\1
and \\\\\\\\3
.
Grouping and backreferences are useful for extracting specific information from text. You can use them to extract data from structured documents such as HTML, XML, or JSON, or to extract information from unstructured text data such as emails or social media posts.