------------------------------------------------------------------------ BASIC REGULAR EXPRESSIONS ------------------------------------------------------------------------ Anchors: Anchors are location markers and do not correspond to a character ^ ... beginning of line $ ... end of line However, note that ^ and $ lose their special status when used at locations other than the beginning or end of line, respectively. ^A .. line beginning with A A$ .. line ending with A ^$ .. line containing no characters A^ .. line containing "A^" at any location $A .. line contaning "$A" at any location ^^ .. line starting with "^" $$ .. line ending with "$" $^ .. line contianing "$^" at any location Character Set (search for one occurence of a specificed character): a .. character "a" and so on 0 .. character "0" and so on When characters are enclosed by square brackets then interesting possibilities open up! Class: [0a] .. match either 0 or a Negation: [^a] ... any character but "a" [^0-9] .. any character but a digit [^>] .. any character but ">" Range: [0-9] ...match one of ten digits [a-z] ...match a lower case English alphabet [A-Z] ...match an upper case English alphabet [a-zA-Z] ... match an English alphabet (regardless of case) [0-9a-zA-Z] ... match alphanumeric Note: The basis of the range ("-") is increasing ASCII value (and not something deeper). In order to be independent of the ASCII value (especially important given that modern computer languages allow for a rich set of characters) POSIX standards have been defined. An incomplete listing can be found below: [:alnum:] printable characters (includes whitespace) [:alpha:] alphabetic characters [a-zA-Z] [:digit:] digits [:print:] printable characters [:punct:] punctuation [:blank:] space and tab [:cntrl:] control characters, [\x00-\x1F\x7F] Confusingly to use these sets you need to include an additional set of square brackets e.g. $ grep '[[:digit:]]' infile #identify lines with digits Special characters: * zero or more of previous character . any one character \ .. back-slash \{ \} see below (special meaning) \( \) see below (special meaning) ^M .. EOL character (specific use) \n .. newline character (specific use) \{q\} .. q occurrences of previous character \{q,\} .. at least q previous characters \{p,q\} .. p through q of the previous characters Backreferencing: & ... matched phrase \( \) captures a token \1 refers to the first captured token Note: Capturing tokens usually requires great specifity and will only come with some experience (or trial and error). Notes (specific to old-style or pre-GNU tools) Control characters are specified as "cntrl V and then the character" e.g. EOL = Tab = controlV and then tab ------------------------------------------------------------------------ Regular Expressions Are Tricky: Homework1 ------------------------------------------------------------------------ Rule 1: The longest match is returned. Regular expression engine does not stop as soon as a match is found. The entire line is searched for the longest match. Rule 2: A null match is consider to be a valid match. [Remember this rule when using "*"]. In order to appreciate these two rules study the following examples: $echo "abc,def,ghi,jkl,mno" | gsed 's/,.*,/1/' abc1mo $echo "abc" | gsed 's/a*/1/' 1abc $echo "abc" | gsed 's/a*/2' a1c $echo "abc" | gsed 's/b*/1/g' 1a1c1 $echo "abc" | gsed -E 's/b+/1/' #this is egrep and uses ERE & not BRE a1c $echo "abc" | gsed -E 's/b+/1/g' #caution: egrep a1c ------------------------------------------------------------------------ Homework 2: square brackets ------------------------------------------------------------------------ Rule 3: In square bracket all meta characters lose their meaning $echo "abc" | gsed 's/[$]// abc $ echo "abc" | gsed 's/[*]//' abc $ echo "abc" | gsed 's[\]/-/' abc $ echo "abc" | gsed 's/[^]// #Do you understand why this is not allowed? gsed: -e expression #1, char 7: unterminated `s' command $echo "abc]" | gsed -n '/[]xy]/p' abc] $echo "abc]" | gsed -n '/[xy]]/p' #No match. No output. Why? $echo "abc-]" | gsed -n '/[-xy]/p' abc-] $echo "abc-]" | gsed -n '/[x-y]/p' #No match. No output. Why? $echo "[abc-]" | gsed -n '/[[x-y]/p' [abc-] $echo "[abc-]" | gsed -n '/[x[-y]/p' [abc-]