Regular Expression Metacharacters and Function Parameters - Oracle PL/SQL Programming (2014)

Oracle PL/SQL Programming (2014)

Appendix A. Regular Expression Metacharacters and Function Parameters

This appendix describes the various regular expression metacharacters available starting with Oracle Database 10g. It also provides a summary of the syntax of the REGEXP_ functions. For more details on Oracle’s regular expression support, see Chapter 8.

Metacharacters

The “Initial release” column in Table A-1 through Table A-3 indicates which metacharacters were introduced in Oracle Database 10g Release 1 and which in Release 2.

Table A-1. Character-matching metacharacters

Syntax

Initial release

Description

.

10gR1

Matches any single character except for newline. Will match newline when the n flag is set. On Windows, Linux, and Unix platforms, chr(10) is recognized as the newline.

[ ... ]

10gR1

Defines a matching list that matches any character listed between the brackets. You may specify ranges of characters, as in a–z. These ranges are interpreted based on the NLS_SORT setting.

A dash (-) is a literal when it occurs first or last in the list (e.g., [abc-]). A closing bracket (]) is a literal when it occurs first in the list (e.g., []abc]). A caret (^) in the first position makes the list a nonmatching list (see the next entry).

[^ ... ]

10gR1

Matches any character not listed between the brackets. Referred to as a nonmatching list.

[:class:]

10gR1

Matches any character that belongs to the specified character class. May only be used within a matching list: [[:class:]abc] is a valid expression, but [:class:]abc is not. Table A-5 lists the valid character class names.

[.coll.]

10gR1

Matches the specified collation element, which may be one or more characters. May only be used within a matching list. For example, the expression [[.ch.]] matches the Spanish letter ch. Table A-4 lists the valid collation elements.

[=char=]

10gR1

Matches all characters that share the same base character as char. May be used only within a matching list. For example, [[=e=]] matches any of: “eéëèÉËÈE”.

\d

10gR2

Matches any digit. Equivalent to [[:digit:]].

\D

10gR2

Matches any nondigit. Equivalent to [^[:digit:]].

\w

10gR2

Matches any word character. Word characters are defined to be alphabetic characters, numeric characters, and the underscore.

\W

10gR2

Matches any nonword character.

\s

10gR2

Matches any whitespace character. Equivalent to [[:space:]].

\S

10gR2

Matches nonwhitespace characters. Equivalent to [^[:space:]].

Table A-2. Quantifiers

Syntax

Initial release

Description

?

10gR1

Zero or one.

*

10gR1

Zero or more.

+

10gR1

One or more.

{m}

10gR1

Exactly m occurrences.

{m,}

10gR1

At least m occurrences.

{m,n}

10gR1

At least m, and at most n occurrences.

+?

10gR2

One or more, but nongreedy.

??

10gR2

Zero or one, but nongreedy.

{m}?

10gR2

The same as {m}.

{m,}?

10gR2

At least m occurrences, but nongreedy and stops as soon as m occurrences are reached.

{m,n}?

10gR2

At least m, and at most n occurrences, but nongreedy; when possible, m occurrences are matched.

Table A-3. Other metacharacters

Syntax

Initial release

Description

|

10gR1

Specifies an alternation. An alternation within a subexpression doesn’t extend beyond the subexpression.

( ... )

10gR1

Defines a subexpresson.

\n

10gR1

References the text matched by the nth subexpression. Backreferences may range from \1 through \9.

\

10gR1

When not followed by a digit, the \ is an escape character. For example, use the pattern \\1 to look for a single backslash followed by the digit 1; use \( to look for an opening parenthesis (rather than begin a subexpression), etc.

^

10gR1

Anchors an expression to the beginning of the string (in multiline mode, to the beginning of a line).

$

10gR1

Anchors an expression to the end of the string (in multiline mode, to the end of a line).

\A

10gR2

Anchors an expression to the beginning of the string regardless of whether multiline mode is specified.

\Z

10gR2

Anchors an expression to the end of the string, or a newline that happens to be ending a string, regardless of whether multiline mode is specified.

\z

10gR2

Anchors an expression to the end of the string regardless of whether multiline mode is specified.

Table A-4. Collation elements

NLS_SORT

Multicharacter collation elements

XCROATIAN

d_

lj

nj

D_

LJ

Nj

D_

Lj

NJ

XCZECH

Ch

CH

Ch

XCZECH_PUNCTUATION

Ch

CH

Ch

XDANISH

aa

oe

AA

OE

Aa

Oe

XHUNGARIAN

cs

gy

ly

ny

sz

ty

zs

CS

GY

LY

NY

SZ

TY

ZS

Cs

Gy

Ly

Ny

Sz

Ty

Zs

XSLOVAK

dz

d_

ch

DZ

D_

CH

Dz

D_

Ch

XSPANISH

ch

ll

CH

LL

Ch

Ll

Table A-5. Supported character classes

Class

Description

[:alnum:]

Alphanumeric characters (same as [:alpha:] + [:digit:])

[:alpha:]

Alphabetic characters only

[:blank:]

Blank space characters, such as space and tab

[:cntrl:]

Nonprinting (control) characters

[:digit:]

Numeric digits

[:graph:]

Graphical characters (same as [:punct:] + [:upper:] + [:lower:] + [:digit:])

[:lower:]

Lowercase letters

[:print:]

Printable characters

[:punct:]

Punctuation characters

[:space:]

Whitespace characters such as space, formfeed, newline, carriage return, horizontal tab, and vertical tab

[:upper:]

Uppercase letters

[:xdigit:]

Hexadecimal characters

Functions and Parameters

The following subsection shows the syntax of Oracle’s regular expression functions. The meaning of the parameters is shown in Regular Expression Parameters.

Regular Expression Functions

The syntax for each regular expression function is shown next.

REGEXP_COUNT (Oracle Database 11g and later)

Returns a tally of occurrences of an expression in a target string. The syntax is:

REGEXP_COUNT(source_string, expression

[, position

[, match_parameter]])

REGEXP_INSTR

Returns the character position at which text can be found matching a regular expression in a target string. The syntax is:

REGEXP_INSTR(source_string, expression

[, position [, occurrence

[, return_option

[, match_parameter

[, subexpression]]]]])

REGEXP_LIKE

Determines whether a given string contains text matching an expression. This is a Boolean function, returning TRUE, FALSE, or NULL. The syntax is:

REGEXP_LIKE (source_string, expression

[, match_parameter])

REGEXP_REPLACE

Performs a regular expression search-and-replace operation (see Chapter 8 for details). The syntax is:

REGEXP_REPLACE(source_string, expression

[, replace_string

[, position [, occurrence

[, match_parameter]]]])

REGEXP_SUBSTR

Extracts text matching a regular expression from a string. The syntax is:

REGEXP_SUBSTR(source_string, expression

[, position [, occurrence

[, match_parameter

[, subexpression]]]])

Regular Expression Parameters

These are the parameters that may be included in the regular expression functions described in the preceding subsection:

source_string

Is a string to be searched.

expression

Is a regular expression describing the pattern of text that you seek.

replace_string

Is a string generating the replacement text to be used in a search-and-replace operation.

position

Is the character position within source_string at which to begin a search. This defaults to 1.

occurrence

Is the occurrence of the pattern you want to locate. This defaults to 1, giving you the first possible match.

return_option

Is valid only for REGEXP_INSTR, and determines whether the beginning or ending character position is returned for text matching a pattern. The default is 0, for the beginning. Use 1 to return the ending position.

match_parameter

Is a text string through which you may specify options to vary the behavior of the regular expression matching engine:

‘c’

Requests a case-sensitive search. (By default, your NLS_SORT setting determines whether a search is case-sensitive.)

‘i’

Requests a case insensitive search.

‘n’

Allows the period to match newline characters. By default, the period does not match newlines.

‘m’

Changes the definition of line with respect to the ^ and $ metacharacters. By default, line means the entire target string. Using the m option, however, causes the definition of line to change from the entire target string to any line within that string, where lines are delimited by newline characters.

subexpression (Oracle Database 11g and later)

Is a number (0–9) identifying which subexpression to match on. The default is 0 and signifies that subexpressions will not be used.

You can specify multiple match parameters in any order. For example, ‘in’ means the same as ‘ni’. If you specify conflicting options (such as ‘ic’), the last option (‘c’, in this case) is the one that takes precedence.