When and How to use Sed and Awk
UNIX Tools can be as intimidating as they are ubiquitous, powerful, and flexible. Their command line syntaxes (short-names, unlabeled positional arguments, and one-letter flags) are built to be compact and quick to type in an ad-hoc manner; as a result, they can be opaque and frustrating to learn. None exemplify this more, I think, than sed and awk.
Many folks, a younger version of myself included, can only make use of sed and awk if given an example to copy and paste. If this is you, read this article not for its examples (there aren’t many), but for the mental model of each command that it offers. Hopefully, armed with an understanding of when and how to employ these tools, your future text-processing tasks can be managed without a Google search to break your workflow.
Sed
sed
is a program that excels at editing a stream of text. It provides a compact, command syntax that fits on the command line for one-off stream editing tasks.
$ cat fizzbuzz.txt | sed 's/fizz/wat/g'
1
2
wat
4
buzz
wat
...
An aside: MacOS comes with BSD sed pre-installed. This is not the sed you’re looking for. GNU sed has everything that BSD sed has, plus a myriad of extensions. Do yourself a favour and install it.
brew install gnu-sed
By default, the homebrew formula will not clobber the pre-installed sed. Either use the gsed
command, or sym-link to it with ln -s $(which gsed)
.
This article assumes you are using GNU sed.
When to use it
sed
is most useful for its “find and replace” text substitution functionality. It can also be used for inserting, replacing, or deleting lines of text according to line-number or regular expression pattern matching.
Don’t use sed
’s more advanced features, such as branching and labeling. If you find that your stream editing task requires the use of an if-statement, it would be easier to write and debug a short Perl or Python script.
How to use it
A sed script is made up of sed commands separated by ;
. A sed command takes the form:
[line address]C\[options]
C
is a single letter command. The command is only executed against lines that match the line address. The line address is optional and can be a line number, range of lines, or a regular expression. Additional options are required for some commands.
Line Addresses
Select lines by:
- Line number
5
matches the fifth line of the stream5~2
matches the fifth line of the stream, and every two lines thereafter- The special number
$
matches the last line of the stream
- Regular Expressions
/regex/
matches any line satisfying the regular expression/regex/I
causes the regular expression to be case-insensitive
- Line ranges
5,10
matches lines 5 through 10 inclusively5,/regex/
matches from line 5 up to and including the next line that matches/regex/
Commands
Substitute. s/regex/replacement/[flags]
- substitute
regex
withreplacement
, respecting the optionalflags
- reference regex groups
\(\)
in the replacement with\1, \2, \3...
- flags:
g
- replace all regex matches, not just the firsti
- cause the regex to be case-insensitive
Delete. d
– use with a line address to remove a specific line from the output stream.
Print. p
– use with the -n
option to print only those lines matching the line address.
Append a\text
– write text
on a new line after any line matching the line address.
Insert i\text
– write text
on a new line before any line matching the line address.
Replace c\text
– replace any line matching the line address with text
.
Quit. q
– use with a line address to exit at a specific line.
Multiple Commands { command ... }
– use with a line address to run several command under the same condition.
Awk
awk
excels at processing data records. It provides what is essentially a full programming language with shortcuts for record- and field-centric processing and reporting.
When to Use it
Awk definitely breaks the UNIX philosophy of “do one thing well”. When you come across a problem which awk can solve, you’ll likely be in one of two situations:
- The problem is complex, and you should write a script in a real programming language.
- The problem is simple, and you should use an already-existing tool that does that one thing well.
If you find yourself with the latter problem, consider the following tools:
- use
sed
to do line replacement or text substitution - use
cut
to pull one or two fields out of a line of data - use
grep
to filter by regular expression - use
head
ortail
to take only the first or last number of items - use
wc
to count lines, words, or characters - use
uniq
to count or remove duplicates - use
sort
to sort data
Nevertheless, you may find yourself in a situation where a more advanced language is not available and you need to solve a complex problem with only the GNU tools at your disposal. For this reason, it helps to have a mental model of awk’s functionality, such that you can quickly deploy it with some help from the documentation when the situation demands.
For my part, I once used awk to parse and search weeks of service logs on a remote log-storage server. I was looking for just a few instances of a problem, and it would have taken a long time to transfer such a large volume of logs to a machine with the appropriate tools. The first question I asked when the event was over: how can we improve the tools on our log-storage servers?
If you ever find yourself in need of awk, it’s likely because your environment wasn’t set up with the right tools in the first place. Be sure to fix that.
How to use it
An awk program is a series of actions, each with an optional condition. Actions are delimited by braces {}
, with the condition preceding the action it corresponds with.
[condition] { action }
Given an awk script and an input stream, awk divides the input into records and fields.
- Records are separated by the value in the special variable
RS
.- By default, records are newline-delimited:
RS = '\n'
- The entire record is stored in the variable
$0
- The record number is stored in the variable
NR
.
- By default, records are newline-delimited:
- Fields are separated by the value in the special variable
FS
.- By default, fields are space-delimited:
FS = ' '
- Each field is stored in the variables
$1, $2, $3...
- The total number of fields is stored in the variable
NF
- The last fields are stored in the variable
$NF, $(NF-1), $(NF-2)...
- By default, fields are space-delimited:
For each record in the input, Awk applies every action whose condition is satisfied by the record.
Conditions
$2 ~ /^[a-z] / { print $0; }
/^[a-z] / { print $0; }
$1 == "a" { print $0; }
- The
~
operator gives true if the left matches a regex on the right. - A condition that is just a regex tries to match the whole line, i.e.
$0 ~ /regex/
. - A condition doesn’t have to be a regex.
Special Conditions
BEGIN { setup, print a header }
{ action }
END { calculate a total, print a footer }
BEGIN
is true before the first record.- Every record satisfies the empty condition.
END
is true after the last record.
Actions
Awk is a programming language. It implements:
- dynamic variables
- associative arrays
- arithmetic (
+-*/%
) if
statements with compound conditional expressions== != ~ !~ < <= > >=
&& || !
for
,for/in
, andwhile
loops- string, time, and math functions
- user-defined functions
See the GNU Awk User’s Guide for a good language reference.
Continuity
Each condition { action }
block is not a scope. Variables can be addressed from other blocks.
BEGIN { records_with_numbers = 0; records_with_letters = 0; }
/[0-9]/ { records_with_numbers++; }
/[a-z]/ { records_with_letters++; }
{ print $0, "(" records_with_numbers, records_with_letters ")" }
END {
print "Total records with numbers:", records_with_numbers;
print "Total records with letters:", records_with_letters;
}
prints:
$ cat data.tab | awk -f script.awk
a b c (0 1)
0 0 0 (1 1)
...
1 1 1 (8 8)
y z a (8 9)
Total records with numbers: 8
Total records with letters: 9
Formatting
{
print $0;
print;
print $1 $2 $3;
print $1, $2, $3;
printf "format", a, b, c
}
- Recall that
$0
holds the entire record. print;
is short forprint $0;
.- The space character
' '
is actually the concatenation operator. - A comma prints the output field separator between arguments to
print
.- The output field separator is stored in the variable
OFS
. - By default:
OFS = ' '
.
- The output field separator is stored in the variable
printf
is a formatter that works the same as the C languageprintf()
function.