UNIX POWER TOOLS

sed is very well suited to editing text files (a file consisting of lines, not necessarily of the same length).

awk is well suited to manipulating and editing files which consist of records (columns of data). Each line (record) can have variable number of entries (fields). Each record can have multiple lines as well.

In this first class we will focus on SED. In a later class we will look at AWK.


Class I: A Basic Course on SED
June 13, 2014 at 4p-5:30p, Cahill 219
[10 min]  Review Regular Expression (BRE or Basic Regular Expression)
[10 min]  BRE: Back-expressions, Already Matched, Word Matching,
[10 min]  grep 
[ 5 min]  Simple SED commands: p,=,q,w,r,q
[ 5 min]  Change of flow commands: n, d
[ 5 min]  The substituion command: s
[ 5 min]  Change commands: c,i,a,y
[ 5 min]  Commands involving buffer: h, x
[ 5 min]  Multiline SED commands: H,N,D,P
[ 5 min]  Control commands: t, b, :
[ 5 min]  Ganging up SED commands: Rules about "{..}, ";", EOL
[20 min]  Examples 
We will then work through the examples given below (until 5:30p).

Everything that SED and AWK can do can be done in Python or MATLAB or IDL. However, for specific tasks UNIX tools can be quite powerful and also very compact. I find it useful to invoke SED commands from MATLAB. This gives me the best of both worlds.


The first order business is that you have to become familiar with "regular expressions". This is absolutely basic to understanding not just UNIX but all modern programming regular expressions. A good website for regular expressions (and SED and AWK) is grymoire.com. For this class we will use the "classic" regular expressions and not the "extended set".

I suggest that you look at the following two files: Practice sample and associated input file reg.txt. I will assume that you will be using the "bash" shell (Bourne Again Shell), whence the signature prompt "$". I would like to assume that you will be using "vi" but will forgive you if you are using "emacs". "Real-life" examples now follow.


"GREP" is a simple but powerful tool to search files for patterns specified by regular expressions.

$ grep 'regexp' infile
will search for a regular expression specified by "regexp" in infile.

Examples
$ grep 'SGR.*' paper.tex
will search and display the file paper.tex for all lines containing any pattern which has "SGR followed by any set of characters".
$ grep '[0-9][0-9]*' input.dat
will search and display lines which contain one or more digits.

The following are interesting flags for "grep"

$ grep -A num 'pattern' infile
$ grep -B num 'pattern' infile
$ grep -C num 'pattern' infile
will display "num" lines after (A), before (B) and centered (C) around the line in which the pattern is found.

$ grep -E 'pattern' infile
allows for use of extended regular expressions [in particular, +,?,|,()].

$ grep -v 'pattern' infile
will print lines that do NOT match the pattern.

It is useful to practice regular expressions with "grep" before moving onto SED.


SED is a editor for "streaming data". It is a line oriented editor. A line is read into the "pattern space" and manipulated (by commands) and the line is then output. The next line is read in and the same commands are applied. A group of lines identified via some algorithm (e.g. "from 1 through 10, inclusive" or "not for line 5", lines containing a regular expression) can have their own specific editing commands.

AWK is quite a sophisticated programming language. For this class we will make the simplest use of AWK.

The basic SED and AWK structure is

$ sed 'address1,address2{commands}' InputFile
$ awk ' ADDRESS/PATTERN {ACTION}'

The addresses are optional (in most instances). If no addresses are given then the command group or ACTION applies to all lines.

The main advantage of SED over say MATLAB or IDL is that you do not have to "open" and "close" files and "read in a line" nor "write a line out". It is the lack of these steps that allow you to undertake selective and comprehensive filtering with a few commands. The two basic uses of sed are via command line or via a script file

$ sed [-n] [-e 'address1,address2{commands}'] Infile

$ sed [-f scriptfile] Infile

Here "scriptfile" is a file contaning SED commands. In the absence of the flag "-n" SED will "output" (print out) the (edited) buffer pattern and then go to the next input line. The command "-e" allows for mutiple commands (for each invocation of "-e").


Addressing in SED is of two types: The default address is '1,$' (all lines).

The following commands accept an address range: p,n,d,s,c,y,H,N,D,P. The following accept only one address: a,i,r,q,=. A summary of the commands can be found SEDCommands.txt here.


  1. Deleting errant lines

    The list of saved PTF transients can be found at wget --http-user=srk --http-passwd=RABBIT http://ptf.caltech.edu/cgi-bin/ptf/transient/name_radec.cgi -O Saved.txt
    Yi Cao noted that the last line in this file "None" is garbage.

    Old style: read in vi or emacs and then go to last line and delete the line and then save.
    New style:
    $ sed '$d' Saved.txt > Junk.txt; mv Junk.txt Saved.txt

  2. Reformatting Target files
  3. The format of an "Observation file" for the PTF Marshal is the following:
    TargetMarshal.txt
    14azi    08 23 53.64  +07 30 17.8  2000.0  ! P2 Classification | rubin: Possible bad subtraction (r=17.9)
    14azj    08 25 45.82  +07 34 18.1  2000.0  ! P3 Classification | rubin:  (r=18.3)
    14azm    08 35 55.88  +14 05 15.6  2000.0  ! P2 Classification | rubin:  (r=18.8)
    14azo    08 40 41.92  +14 03 24.0  2000.0  ! P3 Classification | rubin:  (r=19.1)
    14ayo    12 06 03.00  +47 29 33.2  2000.0  ! P5 Classification SN II | ycao:  (r=18.5)
    14azl    12 26 36.03  +10 50 44.8  2000.0  ! P2 Classification | rubin:  (r=18.8)
    
    A user (Yi Cao) wishes to convert the above file to a format that is suited for the the mighty 3.5-m telescope [this is the biggest 3.5-m telescope in the world, according to M. Kasliwal] of the APO Observatory:
    14azi    08:23:53.64  +07:30:17.8  
    14azj    08:25:45.82  +07:34:18.1  
    14azm    08:35:55.88  +14:05:15.6  
    14azo    08:40:41.92  +14:03:24.0  
    14ayo    12:06:03.00  +47:29:33.2  
    14azl    12:26:36.03  +10:50:44.8  
    

    The following SED command will accomplish the task we set out to

    $ sed   -e 's/2000\..*$//'   -e 's/\([-+ ][0-9][0-9]\) \([0-9][0-9]\) \([0-9][0-9]\.[0-9]*\)/\1:\2:\3/g'   TargetMarshal.txt

    The first part of the script 's/2000\..*$//', replaces all characters starting with "2000." through the end of the line ($) with no character (in effect, delete the characters). The second part of the script looks for three specific patterns: +nn, -nn, or nn where nn are two digits. Each such pattern is replaced by the same pattern but with ":" inserted. The sequence for looking for "+' or "-" or " " can be specified either as "[-+ ]" or "[- +]" or "[+ -]" or "[ +-]" but other combinations will result in error. This is because "-" has a special meaning -- a range indicator -- when used within square brackets ([..]). For example [a-z] means match to character between "a" and "z" and so on. The dash character loses its special meaning if it is next to either "[" or "]".

    I find the above SED command as too long! You can combine all the conditions by using ";", as follows.
    $ sed -e 's/2000\..*$//;s/\([-+ ][0-9][0-9]\) \([0-9][0-9]\) \([0-9][0-9]\.[0-9]*\)/\1:\2:\3/g' TargetMarshal.txt

  4. Object IDs to http statements
  5. Mansi would like to convert objects IDs into http statements. The objects IDs are stored in a file ID.txt
    149646130
    152523407
    152549537
    
    She would like to generate output along the following pattern: <a href="http://ptf.nersc.gov/project/deepsky/ptfvet/iexamine.cgi?candid=149646130"> 149646130 </a> $sed 's;\(^.*\);<a href="http//ptf.nersc.gov/deepsky/ptfvet/iexamine.cgi?candid=\1">\1 </a>;' ID.txt produces the following output <a href="http//ptf.nersc.gov/deepsky/ptfvet/iexamine.cgi?candid=149646130">149646130 </a> <a href="http//ptf.nersc.gov/deepsky/ptfvet/iexamine.cgi?candid=152523407">152523407 </a> <a href="http//ptf.nersc.gov/deepsky/ptfvet/iexamine.cgi?candid=152549537">152549537 </a> This is basic use of the back-reference (\1) feature. However, the pattern to be substituted has forward slashes (// and /). It is cumbersome to escape each of these forward slashes. So we make use of a cute feature for the substition command -- namely the "s" can use almost any character as the delimiter. In this case we use ";" as the delimiter. With this delimiter the forward slash is no longer special and need not be escaped.

    When Mansi tried this approach the result was garbage. She traced this to the fact that every line in ID.txt ended with a "non-printable" character "\r"). This character is used by some non-UNIX systems instead of the simple "\n" (new line) character. There are two ways to solve this problem. First used a restricted search pattern. That is "\(^[0-9]*\)" instead of "\(^.*\)" or use "tr" to delete this character before feeding it to SED.

    [It is useful to view the file before unleashing SED.
    $sed -n 'lp' Infile
    will show the "non-print" characters clearly]

  6. Convert a .csv file into a simple classical ASCII file.
  7. My daughter is interested in public health and recently downloaded a file, MMR1.csv, from the Center for Disease Control (CDS). This file can be viewed by easily viewed by Excel (on your Mac, "$open MMR1.csv" is sufficient). You can save it as a "csv" file (comma separate value). I wanted to convert this file into a most simple ASCII file that can be easily read by other programs.

    Below I show how I used SED to accomplish this task. However, I needed to first set the "locale".

    $LANG=C
    Some background. The original ASCII definition essentially used 7 bits of a byte. This corresponds to "C" language. Modern usage uses all 8 bits and MORE. For programmign purpose (at least in the simple world I live in) we do not need umlauts and Sanskrit. In order to interpret control characters correctly you need to set "Locale" setting should be set to "C" (as above). This setting is consisten with the interpretation of ASCII characters as defined by POSIX standards. You only need to set locale once in a given terminal session. [Typing in "echo $LANG" shows that the default on my computer is "en_US.UTF-8". This does not correspond to standard POSIX].

    Now coming to the main task at hand.

    $ sed 's/[^[:print:]]//g' MMR1.csv | sed '1,6d;69,$d;s/([0-9]*.[0-9]*)//g;s/NA/-1/g'

    The first command eliminates all non-print characters (this uses POSIX defined class). In the second invocation of SED the non-data lines are removed and then the confidence levels [shown in (..)], at my daugther's requests, are deleted. Finally all "NA" (non-available) are converted to -1.

    A good SED guru aims for the most compact command. So we can combine the two and accomplish the task as follows:

    $ sed 's/[^[:print:]]//g;1,6d;69,$d;s/([0-9]*.[0-9]*)//g;s/NA/-1/g' MMR1.cs

  8. Filtering out unwanted lines and extracting columns of measuments
  9. A fairly common task in analyzing iPTF data is the anlaysis of photometric data. For each "event" a data file is created and has the following structure
    Header1
    Header2
    A | B | C | ... |Z
    A | B | C | ... |Z
    ....
    (227 rows)
    (blank line)
    
    Your goal is to make a table with two or three parameters of interest (which translate to entries in column 1, 2, 3, ..). The combination of SED and AWK very well in this context. You can use SED to filter out the lines and AWK to read specific columns. Below I delete the first two lines and the footer line beginning with "(" and the last line (which is a blank line) and pass the remaining lines to awk for column filtering.

    $ sed '1,2d;/^(/d;$d' Infile | awk -F"|" '{print $2,$3}'

    Apply this command to DKaaaa.txt , an iPTF file.

  10. Add a running index to each line of a file.
  11. You may find it desirable to add a running line number for a text file, say Kitty.txt
    Hello my name is Kitty
    Really it is something else...
    
    But for now let us say it is Kitty
    

    Apparently there is a huge demand for this task, given the many ways the task can be accomplished.

  12. Delete all but one blank line
  13. Say you have a file with paragraphs separated by groups of blank lines of varying lengths. You would like to replace each such group of blank lines by a single blank line.

    BlankLines.txt


    One blank lines follows.
    
    Two blank lines follow.
    
    
    Three blank lines follows.
    
    
    
    No more lines
    

    $ sed '/^$/{
    N
    /\n$/D
    }'
    BlankLines.txt

    produces

    One blank lines follows.
    
    Two blank lines follow.
    
    Three blank lines follows.
    
    No more lines
    
    It is worth understanding the flow. Each line is read into the "pattern space" (input buffer) and then checked if it is a blank line. If not, given that the "-n" flag has not been set, the pattern space is printed out and the next line is read in. If the new line is a blank line then the command or "list of commands" (the list is the one enclosed in braces) is sequentlially executed. The "N" commands read in the next line and appends it to the pattern buffer (with "\n" separating the old line from the new line). We then see if the second line is a blank line ("/\n$/") and if so the first line is deleted. For both "D" and "d" the control goes to the beginning of command. So the next line is read in ("N") and the cycle repeated and the loop continues until a non-blank line is read in. At this point the pattern buffer which consists of blank line followed by "/n" and the non-blank line is printed out.

    Above I have used the strict rules for a list of commands, namely, "{" must be succeeded by EOL and "}" must be preceded and succeeded by EOL and that all commands between the braces ("{ }") should be likewise be preceded and succeeded by EOL. Fortunately, in UNIX, ";" is equivalent to EOL. Thus the following will also work:

    $ sed '/^$/{N;/\n$/D;}' Blanklines.txt

    Note the last command must end with ";" (this is logically consistent with what I stated above).

  14. Update my daily on-line calendar
  15. My calendar is on line. It is a simple php file with an html beginning block followed by plain ASCII block with "-------" marking the division.

    srk.plan.php

    [Contact]: Hopeless at returning phone calls. Call secretary Bronagh Glaser (+1 626 395 3734) or bglaser@astro.caltech.edu

    1. *, Monday (start of week). :Friday, -Sat, -Sunday
    2. [e], out of town entire day
    3. [m], out of town morning
    4. [a], out of town afternoon
    5. [h], holiday or on leave

    Return to homepage     2014 Schedule (past)     2013 Schedule (old)
    ------------------------------------------------------------------------
    Jun  9*    11a/KeckSearch 12n/lunch
    Jun 10            Jean Muller lunch
    Jun 11            PS/lunch
    Jun 12
    Jun 13
    Jun 14  [e]  Helin ceremony; depart to Russia
    Jun 15  [e]
    Jun 16* [e]                        Zeldovich conference, Russia
    ...
    
    At the start of a day, say June 10, I would like to move the line corresponding to June 9 to an archival file (srk.plan.2014). The script UpDateDailyCalendar does the job. It is a nice example showcasing the usage of the SED "r" and "w" commands.

    UpDateDailyCalendar


    #! /bin/bash
    #Execute as ./UpDateDailyCalendar at the beginning of the day
    
    cp srk.plan.php srk.plan.php.TEMP
    cp srk.plan.2014 srk.plan.2014.TEMP
    
    sed '/^--*$/{n;w out
    d;}' srk.plan.php > b.txt
    
    mv b.txt srk.plan.php
    
    sed '$r out'< srk.plan.2014 > b.txt
    
    rm out
    mv b.txt srk.plan.2014 
    
    echo "moved top line from srk.plan.php to bottom line of srk.plan.2014"
    echo "password please (to transfer to anju.caltech.edu)"
    scp srk.plan.php srk.plan.2014 srk@anju.caltech.edu:public_html
    
    

    Note that the "w out" and "r out" commands have to be on their own lines ALWAYS (the ";" trick does not work). Also you need exactly one space between "r" or "w" and the filename.

    [In order to make this file executable you will have to "chmod +x movelines". Execution requires invoking "./movelines" (if you want to execute it as "movefiles" then you your PATH must include the directory in which the executable is placed.)]

  16. Converting DS9 region files to RA (degrees), Dec (degrees)
  17. Yi Cao would like to convert the coordinates of objects in ds9.reg, a ds9 region file, from sexagesimal units to degrees.
    % Region file format: DS9 version 4.1
    %global color=green dashlist=8 3 width=1 font="helvetica 10 normal roman" select=1 highlite=1 dash=0 fixed=0 edit=1 move=1 delete=1 include=1 source=1
    %fk5
    circle(22:22:49.519,+36:18:11.53,5.05")
    circle(22:22:57.390,+36:16:25.18,5.05")
    circle(22:22:52.291,+36:17:39.66,5.05")
    circle(22:22:33.037,+36:16:49.72,5.05")
    
    This can be accomplished with the following SED script or macrofile, RADEC_ds9
    
    #!/bin/bash
    #chmod RADEC_ds9
    #  USAGE:  ./RADEC_ds9 ds9.reg %
    #first parameter is ds9 file 
    #second parameter is comment character. These lines are not analyzed.
    #
    sed -e '/^'$2'/d' -e 's/circle(//;s/")//;s/:/,/g' -e 's/+/+1,/' -e 's/-/-1,/' $1 | \
    awk -F"," '{ra=($1*3600+$2*60+$3)/3600;dec=($4*($5*3600+$6*60+$7)/3600); \
    print "ra(d)=" ra, "dec(d)=" dec}'
    
    In this script file we first use SED to get rid fo the extraneous material and then then the data to AWK for arithmetic processing and display. It is worth noting how we have passed the shell parameters ($1, $2) to SED.

    The action of SED is as follows: delete all lines starting with the second in-line command parameter (which in this case is "%). Next strip off "circle", "(", ")" and replace ":" by a blank and a double quote (") by a blank. Extract the sign of the declination degrees. The result values (rah, ram, rass, decsign, decd, decm, decss) are passed to AWK as records (separation variable -F is set to " "). The necessary simple arithmetic is done by AWK and printed out. The usage is shown below.

    $ ./RADEC_ds9 ds9.reg

    yields

    ra(d)=22.3804 dec(d)=36.3032
    ra(d)=22.3826 dec(d)=36.2737
    ra(d)=22.3812 dec(d)=36.2944
    ra(d)=22.3758 dec(d)=36.2805
    

  18. Asteroid & Comet Orbital Parameters
  19. The data for asteroids and comets are, by convention, stored as ascii files which are 202 characters long. The first character to each line has the following meaning: "%" comment line, "K" is an entry for an asteroid and "C" is an entry for a comet. Each line has either 25 or 26 fields (field 16 is either yyyy-yyyy or "nn days"; in the first case it is only one field and in the other case it is two fields). Field 12 is the quality of the orbit. If this field is zero then the orbit is excellent.

    CometAsteroid.txt

    %  -----------------------------------------------------------------------------
    %  Near-Earth asteroids discovered by PTF (10 total)
    %---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    %     1    2      3     4         5          6          7          8          9          10          11 12        13    14  15   16 17   18   19  20  21           22            23 24       25         26
    %---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    K14K00D 24.4   0.15 K145N 354.10001   39.43772  226.49863    5.24544  0.5466273  0.31160391   2.1547693  3 E2014-K20    68   1    4 days 0.43 M-v 3Eh MPCADO     0803          2014 KD     17.9   20140521
    K14J55G 29.2   0.15 K145N 348.77840  223.02395   49.93575    8.74017  0.4128117  0.49394282   1.5849598  5 MPO293441    37   1    0 days 1.00 M-v 3Eh MPCW       0803          2014 JG55   15.9   20140510
    K13V04V 19.7   0.15 K145N  46.57572  167.65980  231.90925   19.10378  0.5590444  0.21503219   2.7593094  3 MPO285783    93   1   79 days 0.22 M-v 38h MPCADO     0804          2013 VV4    19.9   20140121
    K13P06V 19.2   0.15 K145N  77.66743  161.29813  182.47902    6.17045  0.4700267  0.28608351   2.2810816  0 MPO278279   358   4 1951-2013 0.32 M-v 38h MPCADO     0804          2013 PV6    18.6   20131202
    K13O05O 20.9   0.15 K145N  70.52000  154.89703  159.23828   10.23810  0.5215132  0.23627753   2.5913204  4 MPO273983   101   1   68 days 0.23 M-v 38h MPCALB     2804          2013 OO5    20.0   20131006
    K13J14H 22.2   0.15 K145N 150.04931  356.84504  152.88209    2.63972  0.5721744  0.34294191   2.0214159  6 MPO263239    32   1   36 days 0.34 M-v 3Eh MPCADO     2803          2013 JH14   20.9   20130610
    K13J01E 19.1   0.15 K145N 165.12527  163.04354  185.78403   32.67397  0.4094445  0.61107138   1.3753337  2 MPO285771    69   3 2000-2014 0.26 M-v 3Eh MPCALB     0803          2013 JE1    20.3   20140123
    K13H11N 22.0   0.15 K145N 105.65226  144.03216   59.61312    6.32099  0.5224299  0.26477126   2.4019031  5 MPO265065    57   1   42 days 0.36 M-v 38h MPCADO     2804          2013 HN11   20.0   20130530
    
    Adam Waszczak would like to search through CometAsteroid.txt and identify asteroids with excellent orbital parameters.

    awk '/^ *K/ && $12==0 {print}' CometAsteroid.txt

    We use two filters to identify the desired objects: lines starting with "K" or " K" (asteroids) and those whose field value $12 is zero. Only those lines are printed.

    However this is not a robust solution since the asteroid and comet convention is by column number. In particular, it is possible that some fields are "empty". In this case the above script will fail. By convention, column 106 is assigned to the quality index. Thus, if we wanted to print out only those asteroids with high quality (index=0) and be ready to accommodate missing "entries" then the following commands does the job (this command was formulated by Adam Wasczack):

    $ sed -e 's/./&,/105' -e 's/./&,/107' InputFile | awk -F "," '/^K/ && $2==0 {print}' | sed 's/,//g'

    There are three parts: sed, awk and then sed. Consider the first part. The command s/./&,/105 inserts a comma after the 105th character and s/./&,/107/. In effect, the old column 106 is now surrounded by ",". inserts a comma after the 107th character. If only this command is executed you will get the following output:

    K14K00D 24.4   0.15 K145N 354.10001   39.43772  226.49863    5.24544  0.5466273  0.31160391   2.1547693  ,3, E2014-K20    68   1    4 days 0.43 M-v 3Eh MPCADO     0803          2014 KD     17.9   20140521
    K14J55G 29.2   0.15 K145N 348.77840  223.02395   49.93575    8.74017  0.4128117  0.49394282   1.5849598  ,5, MPO293441    37   1    0 days 1.00 M-v 3Eh MPCW       0803          2014 JG55   15.9   20140510
    K13V04V 19.7   0.15 K145N  46.57572  167.65980  231.90925   19.10378  0.5590444  0.21503219   2.7593094  ,3, MPO285783    93   1   79 days 0.22 M-v 38h MPCADO     0804          2013 VV4    19.9   20140121
    K13P06V 19.2   0.15 K145N  77.66743  161.29813  182.47902    6.17045  0.4700267  0.28608351   2.2810816  ,0, MPO278279   358   4 1951-2013 0.32 M-v 38h MPCADO     0804          2013 PV6    18.6   20131202
    K13O05O 20.9   0.15 K145N  70.52000  154.89703  159.23828   10.23810  0.5215132  0.23627753   2.5913204  ,4, MPO273983   101   1   68 days 0.23 M-v 38h MPCALB     2804          2013 OO5    20.0   20131006
    K13J14H 22.2   0.15 K145N 150.04931  356.84504  152.88209    2.63972  0.5721744  0.34294191   2.0214159  ,6, MPO263239    32   1   36 days 0.34 M-v 3Eh MPCADO     2803          2013 JH14   20.9   20130610
    K13J01E 19.1   0.15 K145N 165.12527  163.04354  185.78403   32.67397  0.4094445  0.61107138   1.3753337  ,2, MPO285771    69   3 2000-2014 0.26 M-v 3Eh MPCALB     0803          2013 JE1    20.3   20140123
    K13H11N 22.0   0.15 K145N 105.65226  144.03216   59.61312    6.32099  0.5224299  0.26477126   2.4019031  ,5, MPO265065    57   1   42 days 0.36 M-v 38h MPCADO     2804          2013 HN11   20.0   20130530
    K13G79Z 19.9   0.15 K145N 158.34646  102.80295  155.63692   15.98906  0.2906355  0.45529103   1.6734394  ,1, MPO271295   195   2 2002-2013 0.34 M-v 38h MPCADO     0804          2013 GZ79   19.7   20130831
    K13EC8W 20.3   0.15 K145N 102.24712  106.73109  120.98740    7.35944  0.6012466  0.27298205   2.3534952  ,4, MPO268498    90   1  140 days 0.35 M-v 3Eh MPCALB     2803          2013 EW128  18.5   20130802
    
    Note that the length of each record has increased by 2 characters. This ouput is then fed to the awk with the delimited set to "," ("F=","). Characters from the start of the line to column 105 is the field, that at column 107 is the second field and those from column 109 to the end of the line is the third field. The second field is then inspected for a zero value and if so the line is printed. The final step is to remove the "," (because these commas do not conform to the asteroid tabular convention) and this is done by sed 's/,//g'. I would say that this example is an excellent demonstration of the combined power of sed and awk.

  20. Extract bibcodes from and ADS output file.
  21. If you are interested in bibliometrica then a common task is to find all the citations made to a paper with a given BIBCODE. Say that BIBCODE="1992ApJ...396...97R" (which is a paper that I wrote with Richard Rand when he was doing his PhD here). You would then launch

    $ curl http://adsabs.harvard.edu/cgi-bin/nph-ref_query?bibcode=BIBCODE;refs=CITATIONS&db_key=AST   -o ads.html

    Our goal is to write a file for each BIBCODE with the following ouput:
    line 1: BIBCODE
    line 2: number of ciations to this BIBCODE
    line 3: bibcode of citation number 1
    line 4: bibcode for citation number 2
    ...

    Inspecting the file ads.html I found that the BIBCODE can be found on the following line

    <TITLE>Citation Query Results for 1992ApJ...396...97R</TITLE>" Next, the number of citations was given on the following line Selected and retrieved <strong>80</strong> abstracts. Finally, the bibcodes of the citations appear on lines similar to ..<input type="checkbox" name="bibcode" value="2013A&amp;A...560A..42M"> ...

    I will set up three SED filters to identify each desired line(s)

    $sed -n \ -e 's/^<TITLE>.*for \(.*\)<\/TITLE>.*$/\1/p' \ -e 's/^Selected and retrieved.*<strong>\([0-9]*\)<\/strong>.*$/\1/p' \ -e 's/^.*name=\"bibcode\" value=\"\([^"]*\)\".*$/\1/p' \ ads.html Another possible way to execute this long command is to put the sed commands into a file (say ads.sed ): s/<TITLE>.*for \(.*\)<\/TITLE>.*$/\1/p s/^Selected and retrieved.*<strong>\([0-9]*\)<\/strong>.*$/\1/p s/^.*name=\"bibcode\" value=\"\([^"]*\)\".*$/\1/p and then execute

    $ sed -n -f ads.sed ads.html

    You will note that the execution is quite slow. It is well known that substition can be speeded up if the line filtering is first done and then a subsitution sought. This can be accomplished with ads2.sed

    /<TITLE>.*for \(.*\)<\/TITLE>.*$/s//\1/p /^Selected and retrieved.*<strong>\([0-9]*\)<\/strong>.*$/s//\1/p /^.*name=\"bibcode\" value=\"\([^"]*\)\".*$/s//\1/p Above we make clever use of regular expressions (e.g. "s//\1/" means using the previous search regular pattern for the substition command. Also note the use of \1 in the search pattern itself).

  22. Join Regular lines to Special lines
  23. Say you have a file which has two types of lines: "regular" and "special" lines. You wish to join special line(s) which succeed a regular line together.

    An example input file which reads as follows:

    1
    2
    a
    b
    3
    c
    4
    z
    
    $./RegularSpecial "[0-9][0-9]*" "[a-zA-Za-zA-Z]*" < RegularSpecialInput.txt produces
    1
    2ab
    3c
    4z
    
    Let me explain the program RegularSpecial.sed . To start with, do not use "sed", use "gsed". "sed" on Mac OSX is very buggy. In section A, I copy the two regular expressions (regular, special) into variables $R1, $R2. Since both are regular expressions, they can be "misunderstood", whence the use of '"$R1"' (where the inner " " protect "[" and other symbols from being interpreted. The ' ' protects "*" from being interpreted. Incidentally, we do not need to copy $1 to R1 etc. I have done this to make the script like nicer.

    In section B, I check to see if the first line is a special line. If so, the program quits because there is now way this special line can be attached to a preceding regular line. If, it a regular line then the first line is copied to the hold array and the control falls to the bottom.

    The second line is read and it can (by construct) be either a regular line (in which case control goes to section C) or a special line (in whichcase control goes to section D).

    If you have reached this section then both the pattern space and hold space are occupied by regular lines (respecitvely, current and previous). The previous line is transferred to pattern space, removed of possible additional \n (newline) and printed. A check is made if the current line is the last line in which case the line in the hold is printed out.

    If you reach section D then it means that the current line is a special line. It is appended to the hold area. Again a special check is undertaken if the current line is the last line. Control falls to the bottom.

    #!/bin/sh
    
    #Join SPECIAL LINES (regexp R2) to prior REGULAR line (regrexp R1)
    #Assume that file starts with REGULAR line
    
    R1=$1   #A
    R2=$2
    
    gsed -n '
    
    1{/'"$R1"'/!q      	#B first line must be regular. if special, quit
      h                     #populate the hold space with first line
      b}                    #fall to bottom to initiate next read
    
    	
    /'"$R1"'/{               #C  new line is REGULAR
    	x;s/\n//g;p;        #print line in hold (after removing \n)
                                #current line is now in hold
    	${x;p}              #special treatment if current line is last line
    	b
    	}
    
    
    /'"$R2"'/{	    	#D new line is SPECIAL 
    	H                   #append current line to hold space
    
    	${x;s/\n//g;p}      #special treatment if current line is last line
    }'
    

  24. Classic SED 1-liners
  25. See URL
     # double space a file
     sed G
    
     # insert a blank line below every line which matches "regex"
     sed '/regex/G'
    
     # insert a blank line above every line which matches "regex"
     sed '/regex/{x;p;x;}'
    
     # insert a blank line above and below every line which matches "regex"
     sed '/regex/{x;p;x;G;}'
    
     # count lines (emulates "wc -l")
     sed -n '$='
    
     # print line number 52
     sed -n '52p'                 # method 1
     sed '52!d'                   # method 2
     sed '52q;d'                  # method 3, efficient on large files
    
     sed -n '45,50p' filename           # print line nos. 45-50 of a file
     sed -n '51q;45,50p' filename       # same, but executes much faster for big files
    
    
     # join pairs of lines side-by-side (like "paste")
     sed '$!N;s/\n/ /'
    
    
    # remove most HTML tags (accommodates multiple-line tags) sed -e :a -e 's/<[^>]*>//g;/</N;//ba' Probably this terse command is better understand as follows sed -f html.sh ads.html where html.sh is :a s/<[^>]*>//g /</N //b a

  26. Print the line number of lines with a given pattern
  27. Here is the file LineNumber
    #!/usr/bin/sed -nf
    
    s/Basic/*&*/
    t a
    b
    :a
    p
    =
    
    Since I like compact notation I attempted to combine all the commands on one line. However, labels must end a carriage return and so the most compact I could obtain was
    #!/usr/bin/sed -nf
    
    s/Basic/*&*/;t a
    b
    :a
    {=;p;}
    
    Executing this file as
    $ ./LineNumber Infile
    will identify all lines containing the word "Basic". Each such line will be printed after substituting "Basic" by "*Basic*" and followed by the line number. [Note that all meta-characters such as "*" have no special meaning in the pattern to be substituted.]

  28. Eliminate "the the" from a .tex file
  29. Consider the following The.txt file.
    When I write papers  I frequently find myself have one the follow another the.
    A simple example is "the star is the the  brightest known in that quadrant".
    A search for "the *the" is not good enough since you can have interlopers
    such as blithe the or tithe the  and other such unlikely or wrong combinations.
    
    The easiest to catch is when both the the's are on the same line. 
    
    The hardest is when one the is on line but the other the
    the is on the next line.
    Even harder is if you have two the's on one line and another
    pair split over the next line but all over the same two lines
    such as: the the brigthest star is really the
    the one we detected last year.
    
    Typically I want to find the offending line and then fix it on with my own editor. So I also like to have the line number listed. Let me start with the simplest case. The phrase " the the " occurs on the same line.

    $ grep -n "the *the " The.txt

    will do the trick as does a somewhat longer command with sed

    $ sed -n '/ the *the /{
    =
    p
    }'
    The.txt

    This can be condensed to

    $ sed -n '/ the *the /{=;p;}' The.txt

    Next let us consider the case when one of the two "the" is one line and other other on another line (as is the case in the above example). In this case the following command does the trick.

    $ sed -n '/ the *$/{N;s/\n/ /; / the *the /=;p;}' The.txt

    Here, "N" reads in the next line and appends to the current line (in the buffer) and the command 's/\n/ /' replaces the end of line separating the line first read in and the line read by "N" to be replaced by a " ". Following this a search as before for two "the" are carried out and if successful the line and line number is printed.

    However, if two consecutive lines end in "the" then this command fails to catch the second one. This can be addressed with increasingly complicated sed commands but I usually simply fix the offending lines found in the first cycle and then iterate until no additional such combinations are found.

    It is well worth reviewing the script the.sed

    / the *the /{
    =
    b
    }
    N
    h
    s/.*\n//
    / the *the /{
    =
    b
    }
    g
    s/ *\n/ /
    / the *the /{
    g
    =
    b
    }
    g
    D
    
    Execute as follows

    $ sed -f the.sed The.txt

    Separately, the most compact form for the script file, the.sed, is

    / the *the /{=;b
    };N;h;s/.*\n//;/ the *the /{=;b
    };g;s/ *\n/ /;/ the *the /{g;=;b
    };g;D
    
    Three commands: b, r and w must end with a label or file name (as the case may be) and then EOL.