Regular expressions

Python Mastery: From Beginner to Expert - Sykalo Eugene 2023

Regular expressions
Additional language concepts

Introduction to Regular Expressions

Regular expressions are a sequence of characters that define a search pattern. They are used in a variety of programming languages, including Python, to match patterns in text. Regular expressions allow you to search for patterns in text, replace text, and extract information from text. They are a powerful tool that can save you time and effort when working with text data.

In Python, regular expressions are supported by the re module. This module provides a set of functions that allow you to work with regular expressions. The re module provides functions such as search, match, findall, and sub, which allow you to search for patterns in text, replace text, and extract information from text.

Regular expressions can be complex, but they are also very flexible. They allow you to define patterns that can match a wide range of text. For example, you can use regular expressions to match email addresses, phone numbers, and even URLs. Regular expressions are also useful for data cleaning and data preprocessing tasks.

Basic Syntax and Operators

Regular expressions in Python are defined using a combination of characters and operators. The characters in a regular expression match themselves in the text, while the operators modify how the characters are matched. Here are some of the basic syntax and operators used in regular expressions in Python:

  • . : Matches any character except a newline character
  • ^ : Matches the beginning of a string
  • $ : Matches the end of a string
  • [] : Matches any character inside the brackets
  • [^] : Matches any character not inside the brackets
  • `` : Matches zero or more occurrences of the preceding character
  • + : Matches one or more occurrences of the preceding character
  • ? : Matches zero or one occurrence of the preceding character
  • {m} : Matches exactly m occurrences of the preceding character
  • {m,n} : Matches between m and n occurrences of the preceding character
  • () : Groups characters together and creates a capture group
  • | : Matches either the expression before or after the operator

Here are some examples of how to use these syntax and operators in regular expressions:

  • . : The regular expression a.b matches any string that contains an a, followed by any character, followed by a b.
  • ^ : The regular expression ^a matches any string that starts with an a.
  • $ : The regular expression a$ matches any string that ends with an a.
  • [] : The regular expression [aeiou] matches any vowel.
  • [^] : The regular expression [^aeiou] matches any consonant.
  • `` : The regular expression ab* matches a, followed by zero or more b characters.
  • + : The regular expression ab+ matches a, followed by one or more b characters.
  • ? : The regular expression ab? matches a, followed by zero or one b character.
  • {m} : The regular expression ab{3} matches a, followed by three b characters.
  • {m,n} : The regular expression ab{2,4} matches a, followed by between two and four b characters.
  • () : The regular expression (ab)+ matches one or more occurrences of ab.
  • | : The regular expression a|b matches either a or b.

These are just some of the basic syntax and operators used in regular expressions in Python. Regular expressions can be very powerful, and there are many other syntax and operators available that can be used to match more complex patterns in text.

Anchors and Boundaries

Anchors and boundaries are special characters that allow you to match patterns at specific positions in the text. Anchors and boundaries do not match any characters themselves, but they modify the way that other characters are matched.

Here are some of the common anchors and boundaries used in regular expressions in Python:

  • ^ : Matches the beginning of a string or line
  • $ : Matches the end of a string or line
  • \\\\b : Matches a word boundary
  • \\\\B : Matches a non-word boundary

Here are some examples of how to use these anchors and boundaries in regular expressions:

  • ^ : The regular expression ^abc matches any string that starts with abc.
  • $ : The regular expression abc$ matches any string that ends with abc.
  • \\\\b : The regular expression \\\\bthe\\\\b matches the word the (and only the) in a string.
  • \\\\B : The regular expression \\\\Bthe\\\\B matches the word the only if it is not surrounded by other letters.

Anchors and boundaries are useful for matching patterns at specific positions in the text. For example, you can use the ^ and $ anchors to match entire strings or lines, or you can use the \\\\b and \\\\B boundaries to match specific words or parts of words.

Character Classes

Character classes allow you to match sets of characters in a text. They are defined using the [] notation and match any single character inside the brackets. Here are some examples of character classes:

  • [aeiou] : Matches any vowel
  • [0-9] : Matches any digit
  • [A-Z] : Matches any uppercase letter
  • [a-z] : Matches any lowercase letter
  • [A-Za-z] : Matches any letter
  • [^aeiou] : Matches any character that is not a vowel

You can also combine character classes with other syntax and operators to match more complex patterns. For example, the regular expression [a-z]+ matches one or more lowercase letters.

Character classes are useful for matching specific types of characters in text. For example, you can use character classes to match phone numbers, email addresses, or other types of structured data.

In addition to the built-in character classes, you can also define your own custom character classes using the [] notation. For example, the regular expression [abc] matches any string that contains an a, b, or c.

Quantifiers

Quantifiers allow you to match multiple occurrences of a pattern in a text. They modify the behavior of the preceding character or group in the regular expression. Here are some of the common quantifiers used in regular expressions in Python:

  • `` : Matches zero or more occurrences of the preceding character or group.
  • + : Matches one or more occurrences of the preceding character or group.
  • ? : Matches zero or one occurrence of the preceding character or group.
  • {m} : Matches exactly m occurrences of the preceding character or group.
  • {m,n} : Matches between m and n occurrences of the preceding character or group.
  • {m,} : Matches at least m occurrences of the preceding character or group.

Here are some examples of how to use these quantifiers in regular expressions:

  • `` : The regular expression ab* matches a, followed by zero or more b characters.
  • + : The regular expression ab+ matches a, followed by one or more b characters.
  • ? : The regular expression ab? matches a, followed by zero or one b character.
  • {m} : The regular expression ab{3} matches a, followed by three b characters.
  • {m,n} : The regular expression ab{2,4} matches a, followed by between two and four b characters.
  • {m,} : The regular expression ab{2,} matches a, followed by at least two b characters.

Quantifiers allow you to match patterns that occur multiple times in a text. For example, you can use quantifiers to match repeated characters, words, or other patterns.

In addition to the built-in quantifiers, you can also define your own custom quantifiers using the () notation. For example, the regular expression (ab)+ matches one or more occurrences of ab.

Grouping and Backreferences

Grouping and backreferences allow you to match and extract subpatterns within a regular expression. You can use parentheses () to group characters together and create a capture group. The contents of the capture group can then be referenced later in the regular expression using a backreference.

Here are some examples of how to use grouping and backreferences in regular expressions in Python:

  • () : The regular expression (ab)+ matches one or more occurrences of ab.
  • \\\\\\\\1 : The backreference \\\\\\\\1 matches the contents of the first capture group.

Here is an example of how to use grouping and backreferences to extract information from a text:

import re

text = "John Smith (123) 456-7890"
pattern = r"([A-Za-z]+) ([A-Za-z]+) \\\\((\\\\d{3})\\\\) (\\\\d{3}-\\\\d{4})"
match = re.search(pattern, text)

if match:
 first_name = match.group(1)
 last_name = match.group(2)
 area_code = match.group(3)
 phone_number = match.group(4)

 print("First Name:", first_name)
 print("Last Name:", last_name)
 print("Area Code:", area_code)
 print("Phone Number:", phone_number)

In this example, the regular expression ([A-Za-z]+) ([A-Za-z]+) \\\\((\\\\d{3})\\\\) (\\\\d{3}-\\\\d{4}) matches a first name, last name, area code, and phone number in a text string. The capture groups ( [A-Za-z]+ ) and ( \\\\d{3} ) capture the first name and area code, respectively. These capture groups are then referenced later in the regular expression using the backreferences \\\\\\\\1 and \\\\\\\\3.

Grouping and backreferences are useful for extracting specific information from text. You can use them to extract data from structured documents such as HTML, XML, or JSON, or to extract information from unstructured text data such as emails or social media posts.