3.17. Regular Expressions

In order to fully utilize the power of shell scripting, you need to master Regular Expressions. Certain commands and utilities commonly used in scripts, such as sed and awk interpret and use REs.

3.17.1. A Brief Introduction to Regular Expressions

An expression is a string of characters. Those characters that have an interpretation above and beyond their literal meaning are called metacharacters. A quote symbol ("), for example, may denote speech by a person, ditto, or a meta-meaning for the symbols that follow. Regular expressions are sets of characters and/or metacharacters that UNIX endows with special features. [1]

The main uses for Regular Expressions (REs) are text searches and string manipulation. An RE matches a single character or a set of characters (a substring or an entire string).

  • The asterisk * matches any number of repeats of the character string or RE preceding it, including zero.

    "1133*" matches 11 + one or more 3's + possibly other characters: 113, 1133, 111312, and so forth.

  • The dot . matches any one character, except a newline. [2]

    "13." matches 13 + at least one of any character (including a space): 1133, 11333, but not 13 (additional character missing).

  • The caret ^ matches the beginning of a line, but sometimes, depending on context, negates the meaning of a set of characters in an RE.

  • The dollar sign $ at the end of an RE matches the end of a line.

    "^$" matches blank lines.

  • Brackets [...] enclose a set of characters to match in a single RE.

    "[xyz]" matches the characters x, y, or z.

    "[c-n]" matches any of the characters in the range c to n.

    "[B-Pk-y]" matches any of the characters in the ranges B to P and k to y.

    "[a-z0-9]" matches any lowercase letter or any digit.

    "[^b-d]" matches all characters except those in the range b to d. This is an instance of ^ negating or inverting the meaning of the following RE (taking on a role similar to ! in a different context).

    Combined sequences of bracketed characters match common word patterns. "[Yy][Ee][Ss]" matches yes, Yes, YES, yEs, and so forth. "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]" matches any Social Security number.

  • The backslash \ escapes a special character, which means that character gets interpreted literally.

    A "\$" reverts back to its literal meaning of "$", rather than its RE meaning of end-of-line. Likewise a "\\" has the literal meaning of "\".

  • Extended REs. Used in egrep, awk, and Perl

    The question mark ? matches zero or one of the previous RE. It is generally used for matching single characters.

  • The plus + matches one or more of the previous RE. It serves a role similar to the *, but does not match zero occurrences.

       1 # GNU versions of sed and awk can use "+",
       2 # but it needs to be escaped.
       3 
       4 echo a111b | sed -ne '/a1\+b/p'
       5 echo a111b | grep 'a1\+b'
       6 echo a111b | gawk '/a1+b/'
       7 # All of above are equivalent.
       8 
       9 # Thanks, S.C.

  • Escaped "curly brackets" \{ \} indicate the number of occurrences of a preceding RE to match.

    It is necessary to escape the curly brackets since they have only their literal character meaning otherwise. This usage is technically not part of the basic RE set.

    "[0-9]\{5\}" matches exactly five digits (characters in the range of 0 to 9).

    Caution

    Curly brackets are not available as an RE in the "classic" version of awk. However, gawk has the --re-interval option that permits them (without being escaped).

     bash$ echo 2222 | gawk --re-interval '/2{3}/'
     2222
     	      

"Sed & Awk", by Dougherty and Robbins gives a very complete and lucid treatment of REs (see the Bibliography).

3.17.2. Using REs in Scripts

Sed, awk, and Perl, used as filters in scripts, take REs as arguments when "sifting" or transforming files or I/O streams. See Example A-6 and Example A-11 for illustrations of this.

Caution

Bash itself does not recognize Regular Expressions. In scripts, commands and utilities, such as sed and awk, interpret RE's.

Bash does carry out filename expansion, but does not use the standard RE set. Instead, it recognizes and expands wildcards, a process known as "globbing". Globbing interprets the standard wildcard characters, * and ?, in addition to certain other special characters. There are some important limitations on wildcard characters in globbing, however. Strings containing * will not match filenames that start with a dot (such as .bashrc). [3] Likewise, the ? has a different meaning in globbing than as part of an RE.

See Example 3-46.

Notes

[1]

The simplest type or regular expression is a character string that retains its literal meaning, not containing any metacharacters.

[2]

Since sed, awk, and grep process single lines, there will usually not be a newline to match. In those cases where there is a newline in a multiple line expression, the dot will match the newline.

   1 #!/bin/bash
   2 
   3 sed -e 'N;s/.*/[&]/' << EOF   # Here Document
   4 line1
   5 line2
   6 EOF
   7 # OUTPUT:
   8 # [line1
   9 # line2]
  10 
  11 
  12 
  13 echo
  14 
  15 awk '{ $0=$1 "\n" $2; if (/line.1/) {print}}' << EOF
  16 line 1
  17 line 2
  18 EOF
  19 # OUTPUT:
  20 # line
  21 # 1
  22 
  23 
  24 # Thanks, S.C.
  25 
  26 exit 0

[3]

Filename expansion can match dotfiles, but only if the pattern explicitly includes the dot.

   1 ~/[.]bashrc    # Will not expand to ~/.bashrc
   2 ~/?bashrc      # Neither will this.
   3                # Wild cards and metacharacters will not expand to a dot in globbing.
   4 
   5 ~/.[b]ashrc    # Will expand to ~./bashrc
   6 ~/.ba?hrc      # Likewise.
   7 ~/.bashr*      # Likewise.
   8 
   9 # Setting the "dotglob" option turns this off.
  10 
  11 # Thanks, S.C.