When and How to use Sed and Awk

UNIX Tools can be as intimidating as they are ubiquitous, powerful, and flexible. Their command line syntaxes (short-names, unlabeled positional arguments, and one-letter flags) are built to be compact and quick to type in an ad-hoc manner; as a result, they can be opaque and frustrating to learn. None exemplify this more, I think, than sed and awk.

Many folks, a younger version of myself included, can only make use of sed and awk if given an example to copy and paste. If this is you, read this article not for its examples (there aren’t many), but for the mental model of each command that it offers. Hopefully, armed with an understanding of when and how to employ these tools, your future text-processing tasks can be managed without a Google search to break your workflow.

Sed

sed is a program that excels at editing a stream of text. It provides a compact, command syntax that fits on the command line for one-off stream editing tasks.

$ cat fizzbuzz.txt | sed 's/fizz/wat/g'
1
2
wat
4
buzz
wat
...

An aside: MacOS comes with BSD sed pre-installed. This is not the sed you’re looking for. GNU sed has everything that BSD sed has, plus a myriad of extensions. Do yourself a favour and install it.

brew install gnu-sed

By default, the homebrew formula will not clobber the pre-installed sed. Either use the gsed command, or sym-link to it with ln -s $(which gsed).

This article assumes you are using GNU sed.

When to use it

sed is most useful for its “find and replace” text substitution functionality. It can also be used for inserting, replacing, or deleting lines of text according to line-number or regular expression pattern matching.

Don’t use sed’s more advanced features, such as branching and labeling. If you find that your stream editing task requires the use of an if-statement, it would be easier to write and debug a short Perl or Python script.

How to use it

A sed script is made up of sed commands separated by ;. A sed command takes the form:

[line address]C\[options]

C is a single letter command. The command is only executed against lines that match the line address. The line address is optional and can be a line number, range of lines, or a regular expression. Additional options are required for some commands.

Line Addresses

Select lines by:

Line number
- 5 matches the fifth line of the stream
- 5~2 matches the fifth line of the stream, and every two lines thereafter
- The special number $ matches the last line of the stream
Regular Expressions
- /regex/ matches any line satisfying the regular expression
- /regex/I causes the regular expression to be case-insensitive
Line ranges
- 5,10 matches lines 5 through 10 inclusively
- 5,/regex/ matches from line 5 up to and including the next line that matches /regex/

Commands

Substitute. s/regex/replacement/[flags]

substitute regex with replacement, respecting the optional flags
reference regex groups  in the replacement with \1, \2, \3...
flags:
- g - replace all regex matches, not just the first
- i - cause the regex to be case-insensitive

Delete. d – use with a line address to remove a specific line from the output stream.

Print. p – use with the -n option to print only those lines matching the line address.

Append a\text – write text on a new line after any line matching the line address.

Insert i\text – write text on a new line before any line matching the line address.

Replace c\text – replace any line matching the line address with text.

Quit. q – use with a line address to exit at a specific line.

Multiple Commands { command ... } – use with a line address to run several command under the same condition.

Awk

awk excels at processing data records. It provides what is essentially a full programming language with shortcuts for record- and field-centric processing and reporting.

When to Use it

Awk definitely breaks the UNIX philosophy of “do one thing well”. When you come across a problem which awk can solve, you’ll likely be in one of two situations:

The problem is complex, and you should write a script in a real programming language.
The problem is simple, and you should use an already-existing tool that does that one thing well.

If you find yourself with the latter problem, consider the following tools:

use sed to do line replacement or text substitution
use cut to pull one or two fields out of a line of data
use grep to filter by regular expression
use head or tail to take only the first or last number of items
use wc to count lines, words, or characters
use uniq to count or remove duplicates
use sort to sort data

Nevertheless, you may find yourself in a situation where a more advanced language is not available and you need to solve a complex problem with only the GNU tools at your disposal. For this reason, it helps to have a mental model of awk’s functionality, such that you can quickly deploy it with some help from the documentation when the situation demands.

For my part, I once used awk to parse and search weeks of service logs on a remote log-storage server. I was looking for just a few instances of a problem, and it would have taken a long time to transfer such a large volume of logs to a machine with the appropriate tools. The first question I asked when the event was over: how can we improve the tools on our log-storage servers?

If you ever find yourself in need of awk, it’s likely because your environment wasn’t set up with the right tools in the first place. Be sure to fix that.

How to use it

An awk program is a series of actions, each with an optional condition. Actions are delimited by braces {}, with the condition preceding the action it corresponds with.

[condition] { action }

Given an awk script and an input stream, awk divides the input into records and fields.

Records are separated by the value in the special variable RS.
- By default, records are newline-delimited: RS = '\n'
- The entire record is stored in the variable $0
- The record number is stored in the variable NR.
Fields are separated by the value in the special variable FS.
- By default, fields are space-delimited: FS = ' '
- Each field is stored in the variables $1, $2, $3...
- The total number of fields is stored in the variable NF
- The last fields are stored in the variable $NF, $(NF-1), $(NF-2)...

For each record in the input, Awk applies every action whose condition is satisfied by the record.

Conditions

$2 ~ /^[a-z] / { print $0; }
/^[a-z] /      { print $0; }
$1 == "a"      { print $0; }

The ~ operator gives true if the left matches a regex on the right.
A condition that is just a regex tries to match the whole line, i.e. $0 ~ /regex/.
A condition doesn’t have to be a regex.

Special Conditions

BEGIN   { setup, print a header }
        { action }
END     { calculate a total, print a footer }

BEGIN is true before the first record.
Every record satisfies the empty condition.
END is true after the last record.

Actions

Awk is a programming language. It implements:

dynamic variables
associative arrays
arithmetic (+-*/%)
if statements with compound conditional expressions
- == != ~ !~ < <= > >=
- && || !
for, for/in, and while loops
string, time, and math functions
user-defined functions

See the GNU Awk User’s Guide for a good language reference.

Continuity

Each condition { action } block is not a scope. Variables can be addressed from other blocks.

BEGIN { records_with_numbers = 0; records_with_letters = 0; }
/[0-9]/ { records_with_numbers++; }
/[a-z]/ { records_with_letters++; }
        { print $0, "(" records_with_numbers, records_with_letters ")" }
END {
    print "Total records with numbers:", records_with_numbers;
    print "Total records with letters:", records_with_letters;
}

prints:

$ cat data.tab | awk -f script.awk
a b c (0 1)
0 0 0 (1 1)
...
1 1 1 (8 8)
y z a (8 9)
Total records with numbers: 8
Total records with letters: 9

Formatting

{
    print $0;
    print;
    print $1 $2 $3;
    print $1, $2, $3;
    printf "format", a, b, c
}

Recall that $0 holds the entire record.
print; is short for print $0;.
The space character ' ' is actually the concatenation operator.
A comma prints the output field separator between arguments to print.
- The output field separator is stored in the variable OFS.
- By default: OFS = ' '.
printf is a formatter that works the same as the C language printf() function.