UNIX POWER TOOLS

sed is very well suited to editing text files (a file consisting of lines, not necessarily of the same length).

awk is well suited to manipulating and editing files which consist of records (columns of data). Each line (record) can have variable number of entries (fields). Each record can have multiple lines as well.

In this first class we will focus on SED. In a later class we will look at AWK.

Class I: A Basic Course on SED
June 13, 2014 at 4p-5:30p, Cahill 219

[10 min]  Review Regular Expression (BRE or Basic Regular Expression)
[10 min]  BRE: Back-expressions, Already Matched, Word Matching,
[10 min]  grep 
[ 5 min]  Simple SED commands: p,=,q,w,r,q
[ 5 min]  Change of flow commands: n, d
[ 5 min]  The substituion command: s
[ 5 min]  Change commands: c,i,a,y
[ 5 min]  Commands involving buffer: h, x
[ 5 min]  Multiline SED commands: H,N,D,P
[ 5 min]  Control commands: t, b, :
[ 5 min]  Ganging up SED commands: Rules about "{..}, ";", EOL
[20 min]  Examples 
We will then work through the examples given below (until 5:30p).

Everything that SED and AWK can do can be done in Python or MATLAB or IDL.
However, for specific tasks UNIX tools can be quite powerful and also 
very compact.  I find it useful to invoke SED commands from MATLAB.
This gives me the best of both worlds.

The first order business is that you have to become familiar with "regular expressions". This is absolutely basic to understanding not just UNIX but all modern programming regular expressions. A good website for regular expressions (and SED and AWK) is grymoire.com . For this class we will use the "classic" regular expressions and not the "extended set".

I suggest that you look at the following two files: Practice sample and associated input file reg.txt. I will assume that you will be using the "bash" shell (Bourne Again Shell), whence the signature prompt "$". I would like to assume that you will be using "vi" but will forgive you if you are using "emacs". "Real-life" examples now follow.

"GREP" is a simple but powerful tool to search files for patterns specified by regular expressions.

$ grep 'regexp' infile
will search for a regular expression specified by "regexp" in infile.

Examples
$ grep 'SGR.*' paper.tex
will search and display the file paper.tex for all lines containing any pattern which has "SGR followed by any set of characters".
$ grep '[0-9][0-9]*' input.dat
will search and display lines which contain one or more digits.

The following are interesting flags for "grep"

$ grep -A num 'pattern' infile
$ grep -B num 'pattern' infile
$ grep -C num 'pattern' infile
will display "num" lines after (A), before (B) and centered (C) around the line in which the pattern is found.

$ grep -E 'pattern' infile
allows for use of extended regular expressions [in particular, +,?,|,()].

$ grep -v 'pattern' infile
will print lines that do NOT match the pattern.

It is useful to practice regular expressions with "grep" before moving onto SED.

SED is a editor for "streaming data". It is a line oriented editor. A line is read into the "pattern space" and manipulated (by commands) and the line is then output. The next line is read in and the same commands are applied. A group of lines identified via some algorithm (e.g. "from 1 through 10, inclusive" or "not for line 5", lines containing a regular expression) can have their own specific editing commands.

AWK is quite a sophisticated programming language. For this class we will make the simplest use of AWK.

The basic SED and AWK structure is

$ sed 'address1,address2{commands}' InputFile
$ awk ' ADDRESS/PATTERN {ACTION}'

The addresses are optional (in most instances). If no addresses are given then the command group or ACTION applies to all lines.

The main advantage of SED over say MATLAB or IDL is that you do not have to "open" and "close" files and "read in a line" nor "write a line out". It is the lack of these steps that allow you to undertake selective and comprehensive filtering with a few commands. The two basic uses of sed are via command line or via a script file

$ sed [-n] [-e 'address1,address2{commands}'] Infile

$ sed [-f scriptfile] Infile

Here "scriptfile" is a file contaning SED commands. In the absence of the flag "-n" SED will "output" (print out) the (edited) buffer pattern and then go to the next input line. The command "-e" allows for mutiple commands (for each invocation of "-e").

Addressing in SED is of two types:

Line numbers. As in '1,10p' (print lines 1 through 10) or '100p' (print line 100).
Lines identified by regular expression. As in '/^From:/p' (print all lines which begin with "From:") or '/^From:/,/'End Of Message/d' (delete all lines starting with a line beginning with "From" through to the line beginning with "End of Message".
You can mix the two styles as in '1,/^From:/d'.

The default address is '1,$' (all lines).

The following commands accept an address range: p,n,d,s,c,y,H,N,D,P. The following accept only one address: a,i,r,q,=. A summary of the commands can be found SEDCommands.txt here.

Deleting errant lines
The list of saved PTF transients can be found at wget --http-user=srk --http-passwd=RABBIT http://ptf.caltech.edu/cgi-bin/ptf/transient/name_radec.cgi -O Saved.txt
Yi Cao noted that the last line in this file "None" is garbage.
Old style: read in vi or emacs and then go to last line and delete the line and then save.
New style:
$ sed '$d' Saved.txt > Junk.txt; mv Junk.txt Saved.txt

Reformatting Target files
The format of an "Observation file" for the PTF Marshal is the following:
TargetMarshal.txt
```
14azi 08 23 53.64 +07 30 17.8 2000.0 ! P2 Classification | rubin: Possible bad subtraction (r=17.9)
14azj 08 25 45.82 +07 34 18.1 2000.0 ! P3 Classification | rubin: (r=18.3)
14azm 08 35 55.88 +14 05 15.6 2000.0 ! P2 Classification | rubin: (r=18.8)
14azo 08 40 41.92 +14 03 24.0 2000.0 ! P3 Classification | rubin: (r=19.1)
14ayo 12 06 03.00 +47 29 33.2 2000.0 ! P5 Classification SN II | ycao: (r=18.5)
14azl 12 26 36.03 +10 50 44.8 2000.0 ! P2 Classification | rubin: (r=18.8)
```
A user (Yi Cao) wishes to convert the above file to a format that is suited for the the mighty 3.5-m telescope [this is the biggest 3.5-m telescope in the world, according to M. Kasliwal] of the APO Observatory:
```
14azi 08:23:53.64 +07:30:17.8 
14azj 08:25:45.82 +07:34:18.1 
14azm 08:35:55.88 +14:05:15.6 
14azo 08:40:41.92 +14:03:24.0 
14ayo 12:06:03.00 +47:29:33.2 
14azl 12:26:36.03 +10:50:44.8 
```
The following SED command will accomplish the task we set out to
$ sed -e 's/2000\..*$//' -e 's/$[-+ ][0-9][0-9]$ $[0-9][0-9]$ $[0-9][0-9]\.[0-9]*$/\1:\2:\3/g' TargetMarshal.txt
The first part of the script 's/2000\..*$//', replaces all characters starting with "2000." through the end of the line ($) with no character (in effect, delete the characters). The second part of the script looks for three specific patterns: +nn, -nn, or nn where nn are two digits. Each such pattern is replaced by the same pattern but with ":" inserted. The sequence for looking for "+' or "-" or " " can be specified either as "[-+ ]" or "[- +]" or "[+ -]" or "[ +-]" but other combinations will result in error. This is because "-" has a special meaning -- a range indicator -- when used within square brackets ([..]). For example [a-z] means match to character between "a" and "z" and so on. The dash character loses its special meaning if it is next to either "[" or "]".
I find the above SED command as too long! You can combine all the conditions by using ";", as follows.
$ sed -e 's/2000\..*$//;s/$[-+ ][0-9][0-9]$ $[0-9][0-9]$ $[0-9][0-9]\.[0-9]*$/\1:\2:\3/g' TargetMarshal.txt

Object IDs to http statements
Mansi would like to convert objects IDs into http statements. The objects IDs are stored in a file ID.txt
```
149646130
152523407
152549537
```
She would like to generate output along the following pattern: <a href="http://ptf.nersc.gov/project/deepsky/ptfvet/iexamine.cgi?candid=149646130"> 149646130 </a> $sed 's;$^.*$;<a href="http//ptf.nersc.gov/deepsky/ptfvet/iexamine.cgi?candid=\1">\1 </a>;' ID.txt produces the following output <a href="http//ptf.nersc.gov/deepsky/ptfvet/iexamine.cgi?candid=149646130">149646130 </a> <a href="http//ptf.nersc.gov/deepsky/ptfvet/iexamine.cgi?candid=152523407">152523407 </a> <a href="http//ptf.nersc.gov/deepsky/ptfvet/iexamine.cgi?candid=152549537">152549537 </a> This is basic use of the back-reference (\1) feature. However, the pattern to be substituted has forward slashes (// and /). It is cumbersome to escape each of these forward slashes. So we make use of a cute feature for the substition command -- namely the "s" can use almost any character as the delimiter. In this case we use ";" as the delimiter. With this delimiter the forward slash is no longer special and need not be escaped.
When Mansi tried this approach the result was garbage. She traced this to the fact that every line in ID.txt ended with a "non-printable" character "\r"). This character is used by some non-UNIX systems instead of the simple "\n" (new line) character. There are two ways to solve this problem. First used a restricted search pattern. That is "$^[0-9]*$" instead of "$^.*$" or use "tr" to delete this character before feeding it to SED.
[It is useful to view the file before unleashing SED.
$sed -n 'lp' Infile
will show the "non-print" characters clearly]

Convert a .csv file into a simple classical ASCII file.
My daughter is interested in public health and recently downloaded a file, MMR1.csv, from the Center for Disease Control (CDS). This file can be viewed by easily viewed by Excel (on your Mac, "$open MMR1.csv" is sufficient). You can save it as a "csv" file (comma separate value). I wanted to convert this file into a most simple ASCII file that can be easily read by other programs.
Below I show how I used SED to accomplish this task. However, I needed to first set the "locale".
$LANG=C
Some background. The original ASCII definition essentially used 7 bits of a byte. This corresponds to "C" language. Modern usage uses all 8 bits and MORE. For programmign purpose (at least in the simple world I live in) we do not need umlauts and Sanskrit. In order to interpret control characters correctly you need to set "Locale" setting should be set to "C" (as above). This setting is consisten with the interpretation of ASCII characters as defined by POSIX standards. You only need to set locale once in a given terminal session. [Typing in "echo $LANG" shows that the default on my computer is "en_US.UTF-8". This does not correspond to standard POSIX].
Now coming to the main task at hand.
$ sed 's/[^[:print:]]//g' MMR1.csv | sed '1,6d;69,$d;s/([0-9]*.[0-9]*)//g;s/NA/-1/g'

The first command eliminates all non-print characters (this uses POSIX defined class). In the second invocation of SED the non-data lines are removed and then the confidence levels [shown in (..)], at my daugther's requests, are deleted. Finally all "NA" (non-available) are converted to -1.
A good SED guru aims for the most compact command. So we can combine the two and accomplish the task as follows:
$ sed 's/[^[:print:]]//g;1,6d;69,$d;s/([0-9]*.[0-9]*)//g;s/NA/-1/g' MMR1.cs

Filtering out unwanted lines and extracting columns of measuments
A fairly common task in analyzing iPTF data is the anlaysis of photometric data. For each "event" a data file is created and has the following structure
```
Header1
Header2
A | B | C | ... |Z
A | B | C | ... |Z
....
(227 rows)
(blank line)
```
Your goal is to make a table with two or three parameters of interest (which translate to entries in column 1, 2, 3, ..). The combination of SED and AWK very well in this context. You can use SED to filter out the lines and AWK to read specific columns. Below I delete the first two lines and the footer line beginning with "(" and the last line (which is a blank line) and pass the remaining lines to awk for column filtering.
$ sed '1,2d;/^(/d;$d' Infile | awk -F"|" '{print $2,$3}'
Apply this command to DKaaaa.txt , an iPTF file.

Add a running index to each line of a file.
You may find it desirable to add a running line number for a text file, say Kitty.txt
```
Hello my name is Kitty
Really it is something else...

But for now let us say it is Kitty
```
Apparently there is a huge demand for this task, given the many ways the task can be accomplished.
- Use the Unix command nl
 $ nl Kitty.txt #same as flags set to -b t
 $ nl -b a Kitty.txt #number all lines
 $ nl -b t Kitty.txt #number only non-empty lines
 $ nl -n ln Kitty.txt #left justified
 $ nl -n rn Kitty.txt #right justified
 $ nl -n rz Kitty.txt #right justified with leading zeros
 will number only those lines which contain the regular expression REGULAR.
- Using SED
 $ sed '=' Kitty.txt | sed 'N;s/\n/ /'
 The command "=" prints the line number. Since we have not given a "-n" flag to sed each line is printed. These two lines serve as input for the next invocation of sed. The first line is read into the pattern buffer by default and the second line appended to the pattern buffer by "N" (with an "\n" separating the two lines). The substitution command replaces the "\n" with a blank.
- Using AWK
 $ awk '{print NR, $0}' Kitty.txt
```
1 Hello my name is Kitty
2 Really it is something else...
3 
4 But for now let us say it is Kitty
```
- Using GREP Another approach is grep -n ".*" Kitty.txt
```
1:Hello my name is Kitty
2:Really it is something else...
3:
4:But for now let us say it is Kitty
```
 It is easly to get rid of the ":" by piping to SED
 grep -n ".*" Kitty.txt | sed 's/:/ /1'
- It is worth noting (for future use) the following feature of "nl"
 $ nl -b pREGULAR Infile
Delete all but one blank line
Say you have a file with paragraphs separated by groups of blank lines of varying lengths. You would like to replace each such group of blank lines by a single blank line.
BlankLines.txt
```
One blank lines follows.

Two blank lines follow.


Three blank lines follows.



No more lines
```
$ sed '/^$/{
N
/\n$/D
}' BlankLines.txt
produces
```
One blank lines follows.

Two blank lines follow.

Three blank lines follows.

No more lines
```
It is worth understanding the flow. Each line is read into the "pattern space" (input buffer) and then checked if it is a blank line. If not, given that the "-n" flag has not been set, the pattern space is printed out and the next line is read in. If the new line is a blank line then the command or "list of commands" (the list is the one enclosed in braces) is sequentlially executed. The "N" commands read in the next line and appends it to the pattern buffer (with "\n" separating the old line from the new line). We then see if the second line is a blank line ("/\n$/") and if so the first line is deleted. For both "D" and "d" the control goes to the beginning of command. So the next line is read in ("N") and the cycle repeated and the loop continues until a non-blank line is read in. At this point the pattern buffer which consists of blank line followed by "/n" and the non-blank line is printed out.
Above I have used the strict rules for a list of commands, namely, "{" must be succeeded by EOL and "}" must be preceded and succeeded by EOL and that all commands between the braces ("{ }") should be likewise be preceded and succeeded by EOL. Fortunately, in UNIX, ";" is equivalent to EOL. Thus the following will also work:
$ sed '/^$/{N;/\n$/D;}' Blanklines.txt
Note the last command must end with ";" (this is logically consistent with what I stated above).

Update my daily on-line calendar
My calendar is on line. It is a simple php file with an html beginning block followed by plain ASCII block with "-------" marking the division.
srk.plan.php
[Contact]: Hopeless at returning phone calls. Call secretary Bronagh Glaser (+1 626 395 3734) or bglaser@astro.caltech.edu
Return to homepage 2014 Schedule (past) 2013 Schedule (old)
```
------------------------------------------------------------------------
Jun 9* 11a/KeckSearch 12n/lunch
Jun 10 Jean Muller lunch
Jun 11 PS/lunch
Jun 12
Jun 13
Jun 14 [e] Helin ceremony; depart to Russia
Jun 15 [e]
Jun 16* [e] Zeldovich conference, Russia
...
```
At the start of a day, say June 10, I would like to move the line corresponding to June 9 to an archival file (srk.plan.2014). The script UpDateDailyCalendar does the job. It is a nice example showcasing the usage of the SED "r" and "w" commands.
UpDateDailyCalendar
```
#! /bin/bash
#Execute as ./UpDateDailyCalendar at the beginning of the day

cp srk.plan.php srk.plan.php.TEMP
cp srk.plan.2014 srk.plan.2014.TEMP

sed '/^--*$/{n;w out
d;}' srk.plan.php > b.txt

mv b.txt srk.plan.php

sed '$r out'< srk.plan.2014 > b.txt

rm out
mv b.txt srk.plan.2014 

echo "moved top line from srk.plan.php to bottom line of srk.plan.2014"
echo "password please (to transfer to anju.caltech.edu)"
scp srk.plan.php srk.plan.2014 srk@anju.caltech.edu:public_html
```
Note that the "w out" and "r out" commands have to be on their own lines ALWAYS (the ";" trick does not work). Also you need exactly one space between "r" or "w" and the filename.
[In order to make this file executable you will have to "chmod +x movelines". Execution requires invoking "./movelines" (if you want to execute it as "movefiles" then you your PATH must include the directory in which the executable is placed.)]

Converting DS9 region files to RA (degrees), Dec (degrees)
Yi Cao would like to convert the coordinates of objects in ds9.reg, a ds9 region file, from sexagesimal units to degrees.
```
% Region file format: DS9 version 4.1
%global color=green dashlist=8 3 width=1 font="helvetica 10 normal roman" select=1 highlite=1 dash=0 fixed=0 edit=1 move=1 delete=1 include=1 source=1
%fk5
circle(22:22:49.519,+36:18:11.53,5.05")
circle(22:22:57.390,+36:16:25.18,5.05")
circle(22:22:52.291,+36:17:39.66,5.05")
circle(22:22:33.037,+36:16:49.72,5.05")
```
This can be accomplished with the following SED script or macrofile, RADEC_ds9
```
#!/bin/bash
#chmod RADEC_ds9
# USAGE: ./RADEC_ds9 ds9.reg %
#first parameter is ds9 file 
#second parameter is comment character. These lines are not analyzed.
#
sed -e '/^'$2'/d' -e 's/circle(//;s/")//;s/:/,/g' -e 's/+/+1,/' -e 's/-/-1,/' $1 | \
awk -F"," '{ra=($1*3600+$2*60+$3)/3600;dec=($4*($5*3600+$6*60+$7)/3600); \
print "ra(d)=" ra, "dec(d)=" dec}'
```
In this script file we first use SED to get rid fo the extraneous material and then then the data to AWK for arithmetic processing and display. It is worth noting how we have passed the shell parameters ($1, $2) to SED.
The action of SED is as follows: delete all lines starting with the second in-line command parameter (which in this case is "%). Next strip off "circle", "(", ")" and replace ":" by a blank and a double quote (") by a blank. Extract the sign of the declination degrees. The result values (rah, ram, rass, decsign, decd, decm, decss) are passed to AWK as records (separation variable -F is set to " "). The necessary simple arithmetic is done by AWK and printed out. The usage is shown below.
$ ./RADEC_ds9 ds9.reg
yields
```
ra(d)=22.3804 dec(d)=36.3032
ra(d)=22.3826 dec(d)=36.2737
ra(d)=22.3812 dec(d)=36.2944
ra(d)=22.3758 dec(d)=36.2805
```
Asteroid & Comet Orbital Parameters
The data for asteroids and comets are, by convention, stored as ascii files which are 202 characters long. The first character to each line has the following meaning: "%" comment line, "K" is an entry for an asteroid and "C" is an entry for a comet. Each line has either 25 or 26 fields (field 16 is either yyyy-yyyy or "nn days"; in the first case it is only one field and in the other case it is two fields). Field 12 is the quality of the orbit. If this field is zero then the orbit is excellent.
CometAsteroid.txt
```
% -----------------------------------------------------------------------------
% Near-Earth asteroids discovered by PTF (10 total)
%---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
%---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
K14K00D 24.4 0.15 K145N 354.10001 39.43772 226.49863 5.24544 0.5466273 0.31160391 2.1547693 3 E2014-K20 68 1 4 days 0.43 M-v 3Eh MPCADO 0803 2014 KD 17.9 20140521
K14J55G 29.2 0.15 K145N 348.77840 223.02395 49.93575 8.74017 0.4128117 0.49394282 1.5849598 5 MPO293441 37 1 0 days 1.00 M-v 3Eh MPCW 0803 2014 JG55 15.9 20140510
K13V04V 19.7 0.15 K145N 46.57572 167.65980 231.90925 19.10378 0.5590444 0.21503219 2.7593094 3 MPO285783 93 1 79 days 0.22 M-v 38h MPCADO 0804 2013 VV4 19.9 20140121
K13P06V 19.2 0.15 K145N 77.66743 161.29813 182.47902 6.17045 0.4700267 0.28608351 2.2810816 0 MPO278279 358 4 1951-2013 0.32 M-v 38h MPCADO 0804 2013 PV6 18.6 20131202
K13O05O 20.9 0.15 K145N 70.52000 154.89703 159.23828 10.23810 0.5215132 0.23627753 2.5913204 4 MPO273983 101 1 68 days 0.23 M-v 38h MPCALB 2804 2013 OO5 20.0 20131006
K13J14H 22.2 0.15 K145N 150.04931 356.84504 152.88209 2.63972 0.5721744 0.34294191 2.0214159 6 MPO263239 32 1 36 days 0.34 M-v 3Eh MPCADO 2803 2013 JH14 20.9 20130610
K13J01E 19.1 0.15 K145N 165.12527 163.04354 185.78403 32.67397 0.4094445 0.61107138 1.3753337 2 MPO285771 69 3 2000-2014 0.26 M-v 3Eh MPCALB 0803 2013 JE1 20.3 20140123
K13H11N 22.0 0.15 K145N 105.65226 144.03216 59.61312 6.32099 0.5224299 0.26477126 2.4019031 5 MPO265065 57 1 42 days 0.36 M-v 38h MPCADO 2804 2013 HN11 20.0 20130530
```
Adam Waszczak would like to search through CometAsteroid.txt and identify asteroids with excellent orbital parameters.
awk '/^ *K/ && $12==0 {print}' CometAsteroid.txt
We use two filters to identify the desired objects: lines starting with "K" or " K" (asteroids) and those whose field value $12 is zero. Only those lines are printed.
However this is not a robust solution since the asteroid and comet convention is by column number. In particular, it is possible that some fields are "empty". In this case the above script will fail. By convention, column 106 is assigned to the quality index. Thus, if we wanted to print out only those asteroids with high quality (index=0) and be ready to accommodate missing "entries" then the following commands does the job (this command was formulated by Adam Wasczack):
$ sed -e 's/./&,/105' -e 's/./&,/107' InputFile | awk -F "," '/^K/ && $2==0 {print}' | sed 's/,//g'
There are three parts: sed, awk and then sed. Consider the first part. The command s/./&,/105 inserts a comma after the 105th character and s/./&,/107/. In effect, the old column 106 is now surrounded by ",". inserts a comma after the 107th character. If only this command is executed you will get the following output:
```
K14K00D 24.4 0.15 K145N 354.10001 39.43772 226.49863 5.24544 0.5466273 0.31160391 2.1547693 ,3, E2014-K20 68 1 4 days 0.43 M-v 3Eh MPCADO 0803 2014 KD 17.9 20140521
K14J55G 29.2 0.15 K145N 348.77840 223.02395 49.93575 8.74017 0.4128117 0.49394282 1.5849598 ,5, MPO293441 37 1 0 days 1.00 M-v 3Eh MPCW 0803 2014 JG55 15.9 20140510
K13V04V 19.7 0.15 K145N 46.57572 167.65980 231.90925 19.10378 0.5590444 0.21503219 2.7593094 ,3, MPO285783 93 1 79 days 0.22 M-v 38h MPCADO 0804 2013 VV4 19.9 20140121
K13P06V 19.2 0.15 K145N 77.66743 161.29813 182.47902 6.17045 0.4700267 0.28608351 2.2810816 ,0, MPO278279 358 4 1951-2013 0.32 M-v 38h MPCADO 0804 2013 PV6 18.6 20131202
K13O05O 20.9 0.15 K145N 70.52000 154.89703 159.23828 10.23810 0.5215132 0.23627753 2.5913204 ,4, MPO273983 101 1 68 days 0.23 M-v 38h MPCALB 2804 2013 OO5 20.0 20131006
K13J14H 22.2 0.15 K145N 150.04931 356.84504 152.88209 2.63972 0.5721744 0.34294191 2.0214159 ,6, MPO263239 32 1 36 days 0.34 M-v 3Eh MPCADO 2803 2013 JH14 20.9 20130610
K13J01E 19.1 0.15 K145N 165.12527 163.04354 185.78403 32.67397 0.4094445 0.61107138 1.3753337 ,2, MPO285771 69 3 2000-2014 0.26 M-v 3Eh MPCALB 0803 2013 JE1 20.3 20140123
K13H11N 22.0 0.15 K145N 105.65226 144.03216 59.61312 6.32099 0.5224299 0.26477126 2.4019031 ,5, MPO265065 57 1 42 days 0.36 M-v 38h MPCADO 2804 2013 HN11 20.0 20130530
K13G79Z 19.9 0.15 K145N 158.34646 102.80295 155.63692 15.98906 0.2906355 0.45529103 1.6734394 ,1, MPO271295 195 2 2002-2013 0.34 M-v 38h MPCADO 0804 2013 GZ79 19.7 20130831
K13EC8W 20.3 0.15 K145N 102.24712 106.73109 120.98740 7.35944 0.6012466 0.27298205 2.3534952 ,4, MPO268498 90 1 140 days 0.35 M-v 3Eh MPCALB 2803 2013 EW128 18.5 20130802
```
Note that the length of each record has increased by 2 characters. This ouput is then fed to the awk with the delimited set to "," ("F=","). Characters from the start of the line to column 105 is the field, that at column 107 is the second field and those from column 109 to the end of the line is the third field. The second field is then inspected for a zero value and if so the line is printed. The final step is to remove the "," (because these commas do not conform to the asteroid tabular convention) and this is done by sed 's/,//g'. I would say that this example is an excellent demonstration of the combined power of sed and awk.

Extract bibcodes from and ADS output file.

If you are interested in bibliometrica then a common task is to find all the citations made to a paper with a given BIBCODE. Say that BIBCODE="1992ApJ...396...97R" (which is a paper that I wrote with Richard Rand when he was doing his PhD here). You would then launch
$ curl http://adsabs.harvard.edu/cgi-bin/nph-ref_query?bibcode=BIBCODE;refs=CITATIONS&db_key=AST -o ads.html
Our goal is to write a file for each BIBCODE with the following ouput:
line 1: BIBCODE
line 2: number of ciations to this BIBCODE
line 3: bibcode of citation number 1
line 4: bibcode for citation number 2
...

Inspecting the file ads.html I found that the BIBCODE can be found on the following line

<TITLE>Citation Query Results for 1992ApJ...396...97R</TITLE>" Next, the number of citations was given on the following line Selected and retrieved 80 abstracts. Finally, the bibcodes of the citations appear on lines similar to ..<input type="checkbox" name="bibcode" value="2013A&A...560A..42M"> ...
I will set up three SED filters to identify each desired line(s)

$sed -n \ -e 's/^<TITLE>.*for $.*$<\/TITLE>.*$/\1/p' \ -e 's/^Selected and retrieved.*$[0-9]*$<\/strong>.*$/\1/p' \ -e 's/^.*name=\"bibcode\" value=\"$[^"]*$\".*$/\1/p' \ ads.html Another possible way to execute this long command is to put the sed commands into a file (say ads.sed ): s/<TITLE>.*for $.*$<\/TITLE>.*$/\1/p s/^Selected and retrieved.*$[0-9]*$<\/strong>.*$/\1/p s/^.*name=\"bibcode\" value=\"$[^"]*$\".*$/\1/p and then execute
$ sed -n -f ads.sed ads.html
You will note that the execution is quite slow. It is well known that substition can be speeded up if the line filtering is first done and then a subsitution sought. This can be accomplished with ads2.sed
/<TITLE>.*for $.*$<\/TITLE>.*$/s//\1/p /^Selected and retrieved.*$[0-9]*$<\/strong>.*$/s//\1/p /^.*name=\"bibcode\" value=\"$[^"]*$\".*$/s//\1/p Above we make clever use of regular expressions (e.g. "s//\1/" means using the previous search regular pattern for the substition command. Also note the use of \1 in the search pattern itself).

Join Regular lines to Special lines
Say you have a file which has two types of lines: "regular" and "special" lines. You wish to join special line(s) which succeed a regular line together.
An example input file which reads as follows:
```
1
2
a
b
3
c
4
z
```
$./RegularSpecial "[0-9][0-9]*" "[a-zA-Za-zA-Z]*" < RegularSpecialInput.txt produces
```
1
2ab
3c
4z
```
Let me explain the program
RegularSpecial.sed . To start with, do not use "sed", use "gsed". "sed" on Mac OSX is very buggy. In section A, I copy the two regular expressions (regular, special) into variables $R1, $R2. Since both are regular expressions, they can be "misunderstood", whence the use of '"$R1"' (where the inner " " protect "[" and other symbols from being interpreted. The ' ' protects "*" from being interpreted. Incidentally, we do not need to copy $1 to R1 etc. I have done this to make the script like nicer.
In section B, I check to see if the first line is a special line. If so, the program quits because there is now way this special line can be attached to a preceding regular line. If, it a regular line then the first line is copied to the hold array and the control falls to the bottom.
The second line is read and it can (by construct) be either a regular line (in which case control goes to section C) or a special line (in whichcase control goes to section D).
If you have reached this section then both the pattern space and hold space are occupied by regular lines (respecitvely, current and previous). The previous line is transferred to pattern space, removed of possible additional \n (newline) and printed. A check is made if the current line is the last line in which case the line in the hold is printed out.
If you reach section D then it means that the current line is a special line. It is appended to the hold area. Again a special check is undertaken if the current line is the last line. Control falls to the bottom.
```
#!/bin/sh

#Join SPECIAL LINES (regexp R2) to prior REGULAR line (regrexp R1)
#Assume that file starts with REGULAR line

R1=$1 #A
R2=$2

gsed -n '

1{/'"$R1"'/!q 	#B first line must be regular. if special, quit
 h #populate the hold space with first line
 b} #fall to bottom to initiate next read

	
/'"$R1"'/{ #C new line is REGULAR
	x;s/\n//g;p; #print line in hold (after removing \n)
 #current line is now in hold
	${x;p} #special treatment if current line is last line
	b
	}


/'"$R2"'/{	 	#D new line is SPECIAL 
	H #append current line to hold space

	${x;s/\n//g;p} #special treatment if current line is last line
}'
```
Classic SED 1-liners
See URL
```
 # double space a file
 sed G

 # insert a blank line below every line which matches "regex"
 sed '/regex/G'

 # insert a blank line above every line which matches "regex"
 sed '/regex/{x;p;x;}'

 # insert a blank line above and below every line which matches "regex"
 sed '/regex/{x;p;x;G;}'

 # count lines (emulates "wc -l")
 sed -n '$='

 # print line number 52
 sed -n '52p' # method 1
 sed '52!d' # method 2
 sed '52q;d' # method 3, efficient on large files

 sed -n '45,50p' filename # print line nos. 45-50 of a file
 sed -n '51q;45,50p' filename # same, but executes much faster for big files


 # join pairs of lines side-by-side (like "paste")
 sed '$!N;s/\n/ /'
```
# remove most HTML tags (accommodates multiple-line tags) sed -e :a -e 's/<[^>]*>//g;/</N;//ba' Probably this terse command is better understand as follows sed -f html.sh ads.html where html.sh is :a s/<[^>]*>//g /</N //b a

Print the line number of lines with a given pattern
Here is the file LineNumber
```
#!/usr/bin/sed -nf

s/Basic/*&*/
t a
b
:a
p
=
```
Since I like compact notation I attempted to combine all the commands on one line. However, labels must end a carriage return and so the most compact I could obtain was
```
#!/usr/bin/sed -nf

s/Basic/*&*/;t a
b
:a
{=;p;}
```
Executing this file as
$ ./LineNumber Infile
will identify all lines containing the word "Basic". Each such line will be printed after substituting "Basic" by "*Basic*" and followed by the line number. [Note that all meta-characters such as "*" have no special meaning in the pattern to be substituted.]

Eliminate "the the" from a .tex file
Consider the following The.txt file.
```
When I write papers I frequently find myself have one the follow another the.
A simple example is "the star is the the brightest known in that quadrant".
A search for "the *the" is not good enough since you can have interlopers
such as blithe the or tithe the and other such unlikely or wrong combinations.

The easiest to catch is when both the the's are on the same line. 

The hardest is when one the is on line but the other the
the is on the next line.
Even harder is if you have two the's on one line and another
pair split over the next line but all over the same two lines
such as: the the brigthest star is really the
the one we detected last year.
```
Typically I want to find the offending line and then fix it on with my own editor. So I also like to have the line number listed. Let me start with the simplest case. The phrase " the the " occurs on the same line.
$ grep -n "the *the " The.txt
will do the trick as does a somewhat longer command with sed
$ sed -n '/ the *the /{
=
p
}' The.txt
This can be condensed to
$ sed -n '/ the *the /{=;p;}' The.txt
Next let us consider the case when one of the two "the" is one line and other other on another line (as is the case in the above example). In this case the following command does the trick.
$ sed -n '/ the *$/{N;s/\n/ /; / the *the /=;p;}' The.txt
Here, "N" reads in the next line and appends to the current line (in the buffer) and the command 's/\n/ /' replaces the end of line separating the line first read in and the line read by "N" to be replaced by a " ". Following this a search as before for two "the" are carried out and if successful the line and line number is printed.
However, if two consecutive lines end in "the" then this command fails to catch the second one. This can be addressed with increasingly complicated sed commands but I usually simply fix the offending lines found in the first cycle and then iterate until no additional such combinations are found.
It is well worth reviewing the script the.sed
```
/ the *the /{
=
b
}
N
h
s/.*\n//
/ the *the /{
=
b
}
g
s/ *\n/ /
/ the *the /{
g
=
b
}
g
D
```
Execute as follows
$ sed -f the.sed The.txt
Separately, the most compact form for the script file, the.sed, is
```
/ the *the /{=;b
};N;h;s/.*\n//;/ the *the /{=;b
};g;s/ *\n/ /;/ the *the /{g;=;b
};g;D
```
Three commands: b, r and w must end with a label or file name (as the case may be) and then EOL.

UNIX POWER TOOLS

Deleting errant lines

Reformatting Target files

Object IDs to http statements

Convert a .csv file into a simple classical ASCII file.

Filtering out unwanted lines and extracting columns of measuments

Add a running index to each line of a file.

Delete all but one blank line

Update my daily on-line calendar

Converting DS9 region files to RA (degrees), Dec (degrees)

Asteroid & Comet Orbital Parameters

Extract bibcodes from and ADS output file.

Join Regular lines to Special lines

Classic SED 1-liners

Print the line number of lines with a given pattern

Eliminate "the the" from a .tex file