awk is well suited to manipulating and editing files which consist of records (columns of data). Each line (record) can have variable number of entries (fields). Each record can have multiple lines as well.
In this first class we will focus on SED. In a later class we will look at AWK.
[10 min] Review Regular Expression (BRE or Basic Regular Expression) [10 min] BRE: Back-expressions, Already Matched, Word Matching, [10 min] grep [ 5 min] Simple SED commands: p,=,q,w,r,q [ 5 min] Change of flow commands: n, d [ 5 min] The substituion command: s [ 5 min] Change commands: c,i,a,y [ 5 min] Commands involving buffer: h, x [ 5 min] Multiline SED commands: H,N,D,P [ 5 min] Control commands: t, b, : [ 5 min] Ganging up SED commands: Rules about "{..}, ";", EOL [20 min] Examples We will then work through the examples given below (until 5:30p).Everything that SED and AWK can do can be done in Python or MATLAB or IDL. However, for specific tasks UNIX tools can be quite powerful and also very compact. I find it useful to invoke SED commands from MATLAB. This gives me the best of both worlds.
I suggest that you look at the following two files: Practice sample and associated input file reg.txt. I will assume that you will be using the "bash" shell (Bourne Again Shell), whence the signature prompt "$". I would like to assume that you will be using "vi" but will forgive you if you are using "emacs". "Real-life" examples now follow.
$ grep 'regexp' infile
will search for a regular expression specified by "regexp"
in infile.
Examples
$ grep 'SGR.*' paper.tex
will search and display the file paper.tex for all lines
containing any pattern which has "SGR followed by any set
of characters".
$ grep '[0-9][0-9]*' input.dat
will search and display lines which contain one or more digits.
The following are interesting flags for "grep"
$ grep -A num 'pattern' infile
$ grep -B num 'pattern' infile
$ grep -C num 'pattern' infile
will display "num" lines after (A), before (B) and centered (C)
around the line in which the pattern is found.
$ grep -E 'pattern' infile
allows for use of extended regular expressions [in particular,
+,?,|,()].
$ grep -v 'pattern' infile
will print lines that do NOT match the pattern.
It is useful to practice regular expressions with "grep" before moving onto SED.
AWK is quite a sophisticated programming language. For this class we will make the simplest use of AWK.
The basic SED and AWK structure is
$ sed 'address1,address2{commands}' InputFile
$ awk ' ADDRESS/PATTERN {ACTION}'
The addresses are optional (in most instances). If no addresses are given then the command group or ACTION applies to all lines.
The main advantage of SED over say MATLAB or IDL is that you do not have to "open" and "close" files and "read in a line" nor "write a line out". It is the lack of these steps that allow you to undertake selective and comprehensive filtering with a few commands. The two basic uses of sed are via command line or via a script file
$ sed [-n] [-e 'address1,address2{commands}'] Infile
$ sed [-f scriptfile] Infile
Here "scriptfile" is a file contaning SED commands. In the absence of the flag "-n" SED will "output" (print out) the (edited) buffer pattern and then go to the next input line. The command "-e" allows for mutiple commands (for each invocation of "-e").
The following commands accept an address range: p,n,d,s,c,y,H,N,D,P. The following accept only one address: a,i,r,q,=. A summary of the commands can be found SEDCommands.txt here
.
Old style: read in vi or emacs and then go to last line and delete the line
and then save.
New style:
$ sed '$d' Saved.txt >
Junk.txt; mv Junk.txt Saved.txt
14azi 08 23 53.64 +07 30 17.8 2000.0 ! P2 Classification | rubin: Possible bad subtraction (r=17.9) 14azj 08 25 45.82 +07 34 18.1 2000.0 ! P3 Classification | rubin: (r=18.3) 14azm 08 35 55.88 +14 05 15.6 2000.0 ! P2 Classification | rubin: (r=18.8) 14azo 08 40 41.92 +14 03 24.0 2000.0 ! P3 Classification | rubin: (r=19.1) 14ayo 12 06 03.00 +47 29 33.2 2000.0 ! P5 Classification SN II | ycao: (r=18.5) 14azl 12 26 36.03 +10 50 44.8 2000.0 ! P2 Classification | rubin: (r=18.8)A user (Yi Cao) wishes to convert the above file to a format that is suited for the the mighty 3.5-m telescope [this is the biggest 3.5-m telescope in the world, according to M. Kasliwal] of the APO Observatory:
14azi 08:23:53.64 +07:30:17.8 14azj 08:25:45.82 +07:34:18.1 14azm 08:35:55.88 +14:05:15.6 14azo 08:40:41.92 +14:03:24.0 14ayo 12:06:03.00 +47:29:33.2 14azl 12:26:36.03 +10:50:44.8
The following SED command will accomplish the task we set out to
$ sed   -e 's/2000\..*$//'   -e 's/\([-+ ][0-9][0-9]\) \([0-9][0-9]\) \([0-9][0-9]\.[0-9]*\)/\1:\2:\3/g'   TargetMarshal.txt
The first part of the script 's/2000\..*$//', replaces all characters starting with "2000." through the end of the line ($) with no character (in effect, delete the characters). The second part of the script looks for three specific patterns: +nn, -nn, or nn where nn are two digits. Each such pattern is replaced by the same pattern but with ":" inserted. The sequence for looking for "+' or "-" or " " can be specified either as "[-+ ]" or "[- +]" or "[+ -]" or "[ +-]" but other combinations will result in error. This is because "-" has a special meaning -- a range indicator -- when used within square brackets ([..]). For example [a-z] means match to character between "a" and "z" and so on. The dash character loses its special meaning if it is next to either "[" or "]".
I find the above SED command as too long! You
can combine all the conditions by using ";", as follows.
$ sed -e 's/2000\..*$//;s/\([-+ ][0-9][0-9]\) \([0-9][0-9]\) \([0-9][0-9]\.[0-9]*\)/\1:\2:\3/g' TargetMarshal.txt
149646130 152523407 152549537She would like to generate output along the following pattern:
When Mansi tried this approach the result was garbage. She traced this to the fact that every line in ID.txt ended with a "non-printable" character "\r"). This character is used by some non-UNIX systems instead of the simple "\n" (new line) character. There are two ways to solve this problem. First used a restricted search pattern. That is "\(^[0-9]*\)" instead of "\(^.*\)" or use "tr" to delete this character before feeding it to SED.
[It is useful to view the file before unleashing SED.
$sed -n 'lp' Infile
will show the "non-print" characters clearly]
Below I show how I used SED to accomplish this task. However, I needed to first set the "locale".
$LANG=C
Some background. The original ASCII definition essentially used 7 bits
of a byte. This corresponds to "C" language. Modern usage uses all
8 bits and MORE. For programmign purpose (at least in the simple
world I live in) we do not need umlauts and Sanskrit.
In order to interpret control characters correctly you need to
set "Locale" setting should be set to "C" (as above). This setting is
consisten with the interpretation of ASCII characters as
defined by POSIX standards.
You only need to set locale
once in a given terminal session. [Typing in "echo $LANG"
shows that the default on my computer is "en_US.UTF-8". This
does not correspond to standard POSIX].
Now coming to the main task at hand.
$ sed 's/[^[:print:]]//g' MMR1.csv | sed '1,6d;69,$d;s/([0-9]*.[0-9]*)//g;s/NA/-1/g'
The first command eliminates all non-print characters (this uses POSIX defined class). In the second invocation of SED the non-data lines are removed and then the confidence levels [shown in (..)], at my daugther's requests, are deleted. Finally all "NA" (non-available) are converted to -1.
A good SED guru aims for the most compact command. So we can combine the two and accomplish the task as follows:
$ sed 's/[^[:print:]]//g;1,6d;69,$d;s/([0-9]*.[0-9]*)//g;s/NA/-1/g' MMR1.cs
Header1 Header2 A | B | C | ... |Z A | B | C | ... |Z .... (227 rows) (blank line)Your goal is to make a table with two or three parameters of interest (which translate to entries in column 1, 2, 3, ..). The combination of SED and AWK very well in this context. You can use SED to filter out the lines and AWK to read specific columns. Below I delete the first two lines and the footer line beginning with "(" and the last line (which is a blank line) and pass the remaining lines to awk for column filtering.
$ sed '1,2d;/^(/d;$d' Infile | awk -F"|" '{print $2,$3}'
Apply this command to DKaaaa.txt , an iPTF file.
Hello my name is Kitty Really it is something else... But for now let us say it is Kitty
1 Hello my name is Kitty 2 Really it is something else... 3 4 But for now let us say it is Kitty
1:Hello my name is Kitty 2:Really it is something else... 3: 4:But for now let us say it is KittyIt is easly to get rid of the ":" by piping to SED
grep -n ".*" Kitty.txt | sed 's/:/ /1'
$ nl -b pREGULAR Infile
One blank lines follows. Two blank lines follow. Three blank lines follows. No more lines
produces
One blank lines follows. Two blank lines follow. Three blank lines follows. No more linesIt is worth understanding the flow. Each line is read into the "pattern space" (input buffer) and then checked if it is a blank line. If not, given that the "-n" flag has not been set, the pattern space is printed out and the next line is read in. If the new line is a blank line then the command or "list of commands" (the list is the one enclosed in braces) is sequentlially executed. The "N" commands read in the next line and appends it to the pattern buffer (with "\n" separating the old line from the new line). We then see if the second line is a blank line ("/\n$/") and if so the first line is deleted. For both "D" and "d" the control goes to the beginning of command. So the next line is read in ("N") and the cycle repeated and the loop continues until a non-blank line is read in. At this point the pattern buffer which consists of blank line followed by "/n" and the non-blank line is printed out.
Above I have used the strict rules for a list of commands, namely, "{" must be succeeded by EOL and "}" must be preceded and succeeded by EOL and that all commands between the braces ("{ }") should be likewise be preceded and succeeded by EOL. Fortunately, in UNIX, ";" is equivalent to EOL. Thus the following will also work:
$ sed '/^$/{N;/\n$/D;}' Blanklines.txt
Note the last command must end with ";" (this is logically consistent with what I stated above).
srk.plan.php
[Contact]: Hopeless at returning phone calls. Call secretary Bronagh Glaser (+1 626 395 3734) or bglaser@astro.caltech.edu
------------------------------------------------------------------------ Jun 9* 11a/KeckSearch 12n/lunch Jun 10 Jean Muller lunch Jun 11 PS/lunch Jun 12 Jun 13 Jun 14 [e] Helin ceremony; depart to Russia Jun 15 [e] Jun 16* [e] Zeldovich conference, Russia ...At the start of a day, say June 10, I would like to move the line corresponding to June 9 to an archival file (srk.plan.2014). The script UpDateDailyCalendar does the job. It is a nice example showcasing the usage of the SED "r" and "w" commands.
UpDateDailyCalendar
#! /bin/bash #Execute as ./UpDateDailyCalendar at the beginning of the day cp srk.plan.php srk.plan.php.TEMP cp srk.plan.2014 srk.plan.2014.TEMP sed '/^--*$/{n;w out d;}' srk.plan.php > b.txt mv b.txt srk.plan.php sed '$r out'< srk.plan.2014 > b.txt rm out mv b.txt srk.plan.2014 echo "moved top line from srk.plan.php to bottom line of srk.plan.2014" echo "password please (to transfer to anju.caltech.edu)" scp srk.plan.php srk.plan.2014 srk@anju.caltech.edu:public_html
[In order to make this file executable you will have to "chmod +x movelines". Execution requires invoking "./movelines" (if you want to execute it as "movefiles" then you your PATH must include the directory in which the executable is placed.)]
% Region file format: DS9 version 4.1 %global color=green dashlist=8 3 width=1 font="helvetica 10 normal roman" select=1 highlite=1 dash=0 fixed=0 edit=1 move=1 delete=1 include=1 source=1 %fk5 circle(22:22:49.519,+36:18:11.53,5.05") circle(22:22:57.390,+36:16:25.18,5.05") circle(22:22:52.291,+36:17:39.66,5.05") circle(22:22:33.037,+36:16:49.72,5.05")This can be accomplished with the following SED script or macrofile, RADEC_ds9
#!/bin/bash #chmod RADEC_ds9 # USAGE: ./RADEC_ds9 ds9.reg % #first parameter is ds9 file #second parameter is comment character. These lines are not analyzed. # sed -e '/^'$2'/d' -e 's/circle(//;s/")//;s/:/,/g' -e 's/+/+1,/' -e 's/-/-1,/' $1 | \ awk -F"," '{ra=($1*3600+$2*60+$3)/3600;dec=($4*($5*3600+$6*60+$7)/3600); \ print "ra(d)=" ra, "dec(d)=" dec}'In this script file we first use SED to get rid fo the extraneous material and then then the data to AWK for arithmetic processing and display. It is worth noting how we have passed the shell parameters ($1, $2) to SED.
The action of SED is as follows: delete all lines starting with the second in-line command parameter (which in this case is "%). Next strip off "circle", "(", ")" and replace ":" by a blank and a double quote (") by a blank. Extract the sign of the declination degrees. The result values (rah, ram, rass, decsign, decd, decm, decss) are passed to AWK as records (separation variable -F is set to " "). The necessary simple arithmetic is done by AWK and printed out. The usage is shown below.
$ ./RADEC_ds9 ds9.reg
yields
ra(d)=22.3804 dec(d)=36.3032 ra(d)=22.3826 dec(d)=36.2737 ra(d)=22.3812 dec(d)=36.2944 ra(d)=22.3758 dec(d)=36.2805
% ----------------------------------------------------------------------------- % Near-Earth asteroids discovered by PTF (10 total) %--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- % 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 %--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- K14K00D 24.4 0.15 K145N 354.10001 39.43772 226.49863 5.24544 0.5466273 0.31160391 2.1547693 3 E2014-K20 68 1 4 days 0.43 M-v 3Eh MPCADO 0803 2014 KD 17.9 20140521 K14J55G 29.2 0.15 K145N 348.77840 223.02395 49.93575 8.74017 0.4128117 0.49394282 1.5849598 5 MPO293441 37 1 0 days 1.00 M-v 3Eh MPCW 0803 2014 JG55 15.9 20140510 K13V04V 19.7 0.15 K145N 46.57572 167.65980 231.90925 19.10378 0.5590444 0.21503219 2.7593094 3 MPO285783 93 1 79 days 0.22 M-v 38h MPCADO 0804 2013 VV4 19.9 20140121 K13P06V 19.2 0.15 K145N 77.66743 161.29813 182.47902 6.17045 0.4700267 0.28608351 2.2810816 0 MPO278279 358 4 1951-2013 0.32 M-v 38h MPCADO 0804 2013 PV6 18.6 20131202 K13O05O 20.9 0.15 K145N 70.52000 154.89703 159.23828 10.23810 0.5215132 0.23627753 2.5913204 4 MPO273983 101 1 68 days 0.23 M-v 38h MPCALB 2804 2013 OO5 20.0 20131006 K13J14H 22.2 0.15 K145N 150.04931 356.84504 152.88209 2.63972 0.5721744 0.34294191 2.0214159 6 MPO263239 32 1 36 days 0.34 M-v 3Eh MPCADO 2803 2013 JH14 20.9 20130610 K13J01E 19.1 0.15 K145N 165.12527 163.04354 185.78403 32.67397 0.4094445 0.61107138 1.3753337 2 MPO285771 69 3 2000-2014 0.26 M-v 3Eh MPCALB 0803 2013 JE1 20.3 20140123 K13H11N 22.0 0.15 K145N 105.65226 144.03216 59.61312 6.32099 0.5224299 0.26477126 2.4019031 5 MPO265065 57 1 42 days 0.36 M-v 38h MPCADO 2804 2013 HN11 20.0 20130530Adam Waszczak would like to search through CometAsteroid.txt and identify asteroids with excellent orbital parameters.
awk '/^ *K/ && $12==0 {print}' CometAsteroid.txt
We use two filters to identify the desired objects: lines starting with "K" or " K" (asteroids) and those whose field value $12 is zero. Only those lines are printed.
However this is not a robust solution since the asteroid and comet convention is by column number. In particular, it is possible that some fields are "empty". In this case the above script will fail. By convention, column 106 is assigned to the quality index. Thus, if we wanted to print out only those asteroids with high quality (index=0) and be ready to accommodate missing "entries" then the following commands does the job (this command was formulated by Adam Wasczack):
$ sed -e 's/./&,/105' -e 's/./&,/107' InputFile | awk -F "," '/^K/ && $2==0 {print}' | sed 's/,//g'
There are three parts: sed, awk and then sed. Consider the first part. The command s/./&,/105 inserts a comma after the 105th character and s/./&,/107/. In effect, the old column 106 is now surrounded by ",". inserts a comma after the 107th character. If only this command is executed you will get the following output:
K14K00D 24.4 0.15 K145N 354.10001 39.43772 226.49863 5.24544 0.5466273 0.31160391 2.1547693 ,3, E2014-K20 68 1 4 days 0.43 M-v 3Eh MPCADO 0803 2014 KD 17.9 20140521 K14J55G 29.2 0.15 K145N 348.77840 223.02395 49.93575 8.74017 0.4128117 0.49394282 1.5849598 ,5, MPO293441 37 1 0 days 1.00 M-v 3Eh MPCW 0803 2014 JG55 15.9 20140510 K13V04V 19.7 0.15 K145N 46.57572 167.65980 231.90925 19.10378 0.5590444 0.21503219 2.7593094 ,3, MPO285783 93 1 79 days 0.22 M-v 38h MPCADO 0804 2013 VV4 19.9 20140121 K13P06V 19.2 0.15 K145N 77.66743 161.29813 182.47902 6.17045 0.4700267 0.28608351 2.2810816 ,0, MPO278279 358 4 1951-2013 0.32 M-v 38h MPCADO 0804 2013 PV6 18.6 20131202 K13O05O 20.9 0.15 K145N 70.52000 154.89703 159.23828 10.23810 0.5215132 0.23627753 2.5913204 ,4, MPO273983 101 1 68 days 0.23 M-v 38h MPCALB 2804 2013 OO5 20.0 20131006 K13J14H 22.2 0.15 K145N 150.04931 356.84504 152.88209 2.63972 0.5721744 0.34294191 2.0214159 ,6, MPO263239 32 1 36 days 0.34 M-v 3Eh MPCADO 2803 2013 JH14 20.9 20130610 K13J01E 19.1 0.15 K145N 165.12527 163.04354 185.78403 32.67397 0.4094445 0.61107138 1.3753337 ,2, MPO285771 69 3 2000-2014 0.26 M-v 3Eh MPCALB 0803 2013 JE1 20.3 20140123 K13H11N 22.0 0.15 K145N 105.65226 144.03216 59.61312 6.32099 0.5224299 0.26477126 2.4019031 ,5, MPO265065 57 1 42 days 0.36 M-v 38h MPCADO 2804 2013 HN11 20.0 20130530 K13G79Z 19.9 0.15 K145N 158.34646 102.80295 155.63692 15.98906 0.2906355 0.45529103 1.6734394 ,1, MPO271295 195 2 2002-2013 0.34 M-v 38h MPCADO 0804 2013 GZ79 19.7 20130831 K13EC8W 20.3 0.15 K145N 102.24712 106.73109 120.98740 7.35944 0.6012466 0.27298205 2.3534952 ,4, MPO268498 90 1 140 days 0.35 M-v 3Eh MPCALB 2803 2013 EW128 18.5 20130802Note that the length of each record has increased by 2 characters. This ouput is then fed to the awk with the delimited set to "," ("F=","). Characters from the start of the line to column 105 is the field, that at column 107 is the second field and those from column 109 to the end of the line is the third field. The second field is then inspected for a zero value and if so the line is printed. The final step is to remove the "," (because these commas do not conform to the asteroid tabular convention) and this is done by sed 's/,//g'. I would say that this example is an excellent demonstration of the combined power of sed and awk.
If you are interested in bibliometrica then a common task is to find all the citations made to a paper with a given BIBCODE. Say that BIBCODE="1992ApJ...396...97R" (which is a paper that I wrote with Richard Rand when he was doing his PhD here). You would then launch
$ curl http://adsabs.harvard.edu/cgi-bin/nph-ref_query?bibcode=BIBCODE;refs=CITATIONS&db_key=AST   -o ads.html
Our goal is to write a file for each BIBCODE with the following
ouput:
line 1: BIBCODE
line 2: number of ciations to this BIBCODE
line 3: bibcode of citation number 1
line 4: bibcode for citation number 2
...
Inspecting the file ads.html I found that the BIBCODE
can be found on the following line
I will set up three SED filters to identify each desired line(s)
$ sed -n -f ads.sed ads.html
You will note that the execution is quite slow. It is well known that substition can be speeded up if the line filtering is first done and then a subsitution sought. This can be accomplished with ads2.sed
An example input file which reads as follows:
1 2 a b 3 c 4 z$./RegularSpecial "[0-9][0-9]*" "[a-zA-Za-zA-Z]*" < RegularSpecialInput.txt produces
1 2ab 3c 4zLet me explain the program
In section B, I check to see if the first line is a special line. If so, the program quits because there is now way this special line can be attached to a preceding regular line. If, it a regular line then the first line is copied to the hold array and the control falls to the bottom.
The second line is read and it can (by construct) be either a regular line (in which case control goes to section C) or a special line (in whichcase control goes to section D).
If you have reached this section then both the pattern space and hold space are occupied by regular lines (respecitvely, current and previous). The previous line is transferred to pattern space, removed of possible additional \n (newline) and printed. A check is made if the current line is the last line in which case the line in the hold is printed out.
If you reach section D then it means that the current line is a special line. It is appended to the hold area. Again a special check is undertaken if the current line is the last line. Control falls to the bottom.
#!/bin/sh #Join SPECIAL LINES (regexp R2) to prior REGULAR line (regrexp R1) #Assume that file starts with REGULAR line R1=$1 #A R2=$2 gsed -n ' 1{/'"$R1"'/!q #B first line must be regular. if special, quit h #populate the hold space with first line b} #fall to bottom to initiate next read /'"$R1"'/{ #C new line is REGULAR x;s/\n//g;p; #print line in hold (after removing \n) #current line is now in hold ${x;p} #special treatment if current line is last line b } /'"$R2"'/{ #D new line is SPECIAL H #append current line to hold space ${x;s/\n//g;p} #special treatment if current line is last line }'
# double space a file sed G # insert a blank line below every line which matches "regex" sed '/regex/G' # insert a blank line above every line which matches "regex" sed '/regex/{x;p;x;}' # insert a blank line above and below every line which matches "regex" sed '/regex/{x;p;x;G;}' # count lines (emulates "wc -l") sed -n '$=' # print line number 52 sed -n '52p' # method 1 sed '52!d' # method 2 sed '52q;d' # method 3, efficient on large files sed -n '45,50p' filename # print line nos. 45-50 of a file sed -n '51q;45,50p' filename # same, but executes much faster for big files # join pairs of lines side-by-side (like "paste") sed '$!N;s/\n/ /'
#!/usr/bin/sed -nf s/Basic/*&*/ t a b :a p =Since I like compact notation I attempted to combine all the commands on one line. However, labels must end a carriage return and so the most compact I could obtain was
#!/usr/bin/sed -nf s/Basic/*&*/;t a b :a {=;p;}Executing this file as
When I write papers I frequently find myself have one the follow another the. A simple example is "the star is the the brightest known in that quadrant". A search for "the *the" is not good enough since you can have interlopers such as blithe the or tithe the and other such unlikely or wrong combinations. The easiest to catch is when both the the's are on the same line. The hardest is when one the is on line but the other the the is on the next line. Even harder is if you have two the's on one line and another pair split over the next line but all over the same two lines such as: the the brigthest star is really the the one we detected last year.Typically I want to find the offending line and then fix it on with my own editor. So I also like to have the line number listed. Let me start with the simplest case. The phrase " the the " occurs on the same line.
$ grep -n "the *the " The.txt
will do the trick as does a somewhat longer command with sed
$ sed -n '/ the *the /{
=
p
}' The.txt
This can be condensed to
$ sed -n '/ the *the /{=;p;}' The.txt
Next let us consider the case when one of the two "the" is one line and other other on another line (as is the case in the above example). In this case the following command does the trick.
$ sed -n '/ the *$/{N;s/\n/ /; / the *the /=;p;}' The.txt
Here, "N" reads in the next line and appends to the current line (in the buffer) and the command 's/\n/ /' replaces the end of line separating the line first read in and the line read by "N" to be replaced by a " ". Following this a search as before for two "the" are carried out and if successful the line and line number is printed.
However, if two consecutive lines end in "the" then this command fails to catch the second one. This can be addressed with increasingly complicated sed commands but I usually simply fix the offending lines found in the first cycle and then iterate until no additional such combinations are found.
It is well worth reviewing the script the.sed
/ the *the /{ = b } N h s/.*\n// / the *the /{ = b } g s/ *\n/ / / the *the /{ g = b } g DExecute as follows
$ sed -f the.sed The.txt
Separately, the most compact form for the script file, the.sed, is
/ the *the /{=;b };N;h;s/.*\n//;/ the *the /{=;b };g;s/ *\n/ /;/ the *the /{g;=;b };g;DThree commands: b, r and w must end with a label or file name (as the case may be) and then EOL.