Functions for regular expressions

Functions for working with text strings using regular expressions. Regular expressions use the syntax RE2.

regexp_extract()

Description

Extracts a substring from a string using the given regular expression.

Usage

regexp_extract(string, regexp)

If the string string contains a substring overlaid by the regular expression pattern regexp, returns that substring. If there are several substrings covered by the pattern, it returns the first one. If no such substring is found, returns an empty string.

See examples
SELECT
    regexp_extract('Tengri', '..') AS result_1,
    regexp_extract('Tengri', 'n.*') AS result_2,
    regexp_extract('Tengri', '.*') AS result_3,
    regexp_extract('Tengri', '^.{5}') AS result_4,
    regexp_extract('Tengri', '[TNGRi]$') AS result_5,
    regexp_extract('Tengri', '.{7}') AS result_empty;
+----------+----------+----------+----------+----------+--------------+
| result_1 | result_2 | result_3 | result_4 | result_5 | result_empty |
+----------+----------+----------+----------+----------+--------------+
| Te       | ngri     | Tengri   | Tengr    | i        |              |
+----------+----------+----------+----------+----------+--------------+

regexp_extract_all()

Description

Extracts all non-overlapping substrings from a string using the given regular expression. Returns an array of substrings.

Usage

regexp_extract_all(string, regexp[, <num>])

If no substrings in string covered by the regexp pattern are found, returns an empty list.

Parameters

  • <num> — the number of the group within the pattern to return (for each substring). By default (if no parameter is given), the entire substring is returned. Group numbering starts with 1. Groups in the template are highlighted with parentheses.

    More details about the parameter on examples

    Let’s extract from the text all combinations of letters with a dot after them:

    SELECT
        regexp_extract_all('My name is Tengri. My nickname is TNGRi.',
                           '(\w+)(\.)')
    AS result;
    +------------------+
    |      result      |
    +------------------+
    | {Tengri.,TNGRi.} |
    +------------------+

    Now let’s extract from the same text using the same regular expression the combinations of letters before the dot, but without the dot itself. For this purpose, let’s set the group number 1:

    SELECT
        regexp_extract_all('My name is Tengri. My nickname is TNGRi.',
                           '(\w+)(\.)',
                           1)
    AS result;
    +----------------+
    |     result     |
    +----------------+
    | {Tengri,TNGRi} |
    +----------------+
See more examples
SELECT
    regexp_extract_all('My name is Tengri. My nickname is TNGRi.', '\w+')
AS words;
+------------------------------------------+
|                   words                  |
+------------------------------------------+
| {My,name,is,Tengri,My,nickname,is,TNGRi} |
+------------------------------------------+

We use a combination of the regexp_extract_all and unnest functions to extract data from structured text:

CREATE TABLE text_table(text_data VARCHAR);

INSERT INTO text_table VALUES
('Name: "Tengri", Nickname:"TNGRi"'),
('Country: "Russia", Capital:"Moscow"');

SELECT
    unnest(
        regexp_extract_all(text_data, '(\w+):\s*"(.*?)"', 1)
          ) AS key,
    unnest(
        regexp_extract_all(text_data, '(\w+):\s*"(.*?)"', 2)
          ) AS value
FROM text_table;
+----------+--------+
|    key   | value  |
+----------+--------+
| Name     | Tengri |
+----------+--------+
| Nickname | TNGRi  |
+----------+--------+
| Country  | Russia |
+----------+--------+
| Capital  | Moscow |
+----------+--------+

regexp_full_match()

Description

Checks whether a regular expression overlaps a string completely.

Usage

regexp_full_match(string, regexp)

If the regular expression pattern regexp completely overlaps the string string, returns true, otherwise — false.

See examples
SELECT
    regexp_full_match('Tengri', 'gri$')     AS result_1, -- false expected
    regexp_full_match('Tengri', '.')        AS result_2, -- false expected
    regexp_full_match('Tengri', '.*')       AS result_3, -- true expected
    regexp_full_match('Tengri', '^\w+gri$') AS result_4; -- true expected
+----------+----------+----------+----------+
| result_1 | result_2 | result_3 | result_4 |
+----------+----------+----------+----------+
| false    | false    | true     | true     |
+----------+----------+----------+----------+

regexp_matches()

Description

Checks if a regular expression is contained within a string.

Usage

regexp_matches(string, regexp)

If at least one substring covered by the regexp pattern is found inside the string string, returns true, otherwise — false.

See examples
SELECT
    regexp_matches('Tengri', '.+T') AS result_1, -- false expected
    regexp_matches('Tengri', '.*T') AS result_2; -- true expected
+----------+----------+
| result_1 | result_2 |
+----------+----------+
| false    | true     |
+----------+----------+

regexp_replace()

Description

Replaces a substring covered by a regular expression with the specified string.

Usage

regexp_replace(string, regexp, target)

If a substring covered by the regexp pattern is found inside the string string, it is replaced by the string target. If several such substrings are found, only the first one is replaced. If no substring is found, the original string string is returned.

See examples
SELECT
    regexp_replace('Tengri', '.', 't') AS result_1,
    regexp_replace('Tengri', '.*', 't') AS result_2,
    regexp_replace('Tengri', 'e.*r', 'NGR') AS result_3,
    regexp_replace('Tengri', 'a', 't') AS result_4;
+----------+----------+----------+----------+
| result_1 | result_2 | result_3 | result_4 |
+----------+----------+----------+----------+
| tengri   | t        | TNGRi    | Tengri   |
+----------+----------+----------+----------+

regexp_split_to_array()

Description

Splits a string into parts, separated by a regular expression, and returns the parts as an array.

Usage

regexp_split_to_array(string, regexp)

Aliases

string_split_regex()

If substrings overlapped by the regexp pattern are found in the string string, then the parts of the original string not overlapped by the pattern are returned as an array. If the found substrings are at the beginning or at the end of the source string, the resulting array will contain empty strings for the beginning and the end of the string. If no substrings are found, an array of one source string will be returned.

See examples
SELECT
    string_split_regex('My name is Tengri. My nickname is TNGRi.', '\.\s')
AS sentences,
    string_split_regex('My name is Tengri. My nickname is TNGRi.', '[\.\s]+')
AS words;
+-------------------------------------------+-------------------------------------------+
|                 sentences                 |                   words                   |
+-------------------------------------------+-------------------------------------------+
| {My name is Tengri,My nickname is TNGRi.} | {My,name,is,Tengri,My,nickname,is,TNGRi,} |
+-------------------------------------------+-------------------------------------------+

regexp_split_to_table()

Description

Splits a string into parts separated by a regular expression, and returns the parts as strings.

Usage

regexp_split_to_table(string, regexp)

If substrings overlapped by the regexp pattern are found in the string string, then the parts of the original string not overlapped by the pattern are returned as a column, with each part written to a different cell. If the found substrings are at the beginning or end of the source string, the result will include empty strings for the beginning and end of the string. If no substrings are found, a column with one cell in which the original string is written will be returned.

See examples
SELECT
    regexp_split_to_table('My name is Tengri. My nickname is TNGRi.', '\.\s')
AS sentences,
    regexp_split_to_table('My name is Tengri. My nickname is TNGRi.', '[\.\s]+')
AS words;
+-----------------------+----------+
|       sentences       |   words  |
+-----------------------+----------+
| My name is Tengri     | My       |
+-----------------------+----------+
| My nickname is TNGRi. | name     |
+-----------------------+----------+
| null                  | is       |
+-----------------------+----------+
| null                  | Tengri   |
+-----------------------+----------+
| null                  | My       |
+-----------------------+----------+
| null                  | nickname |
+-----------------------+----------+
| null                  | is       |
+-----------------------+----------+
| null                  | TNGRi    |
+-----------------------+----------+
| null                  |          |
+-----------------------+----------+

Operator ~

Description

Checks if the regular expression covers the string completely.

Usage

string ~ regexp.

Can be used with the negation operator !: string !~ regexp.

Fully synonymous with the SIMILAR TO operator.

See examples
SELECT
    'Tengri' !~ 'TNGRi' AS result_1,
    'Tengri' ~ 'T.*'    AS result_2,
    'Tengri' ~ 'T.+i'   AS result_3,
    'TNGRi'  ~ 'T.+i'   AS result_4,
    'T.+i'   ~ 'T.+i'   AS result_5;
+----------+----------+----------+----------+----------+
| result_1 | result_2 | result_3 | result_4 | result_5 |
+----------+----------+----------+----------+----------+
| true     | true     | true     | true     | true     |
+----------+----------+----------+----------+----------+