When and How to use Sed and Awk

UNIX Tools can be as intimidating as they are ubiquitous, powerful, and flexible. Their command line syntaxes (short-names, unlabeled positional arguments, and one-letter flags) are built to be compact and quick to type in an ad-hoc manner; as a result, they can be opaque and frustrating to learn. None exemplify this more, I think, than sed and awk.

Many folks, a younger version of myself included, can only make use of sed and awk if given an example to copy and paste. If this is you, read this article not for its examples (there aren’t many), but for the mental model of each command that it offers. Hopefully, armed with an understanding of when and how to employ these tools, your future text-processing tasks can be managed without a Google search to break your workflow.

Sed

sed is a program that excels at editing a stream of text. It provides a compact, command syntax that fits on the command line for one-off stream editing tasks.

$ cat fizzbuzz.txt | sed 's/fizz/wat/g'
1
2
wat
4
buzz
wat
...

An aside: MacOS comes with BSD sed pre-installed. This is not the sed you’re looking for. GNU sed has everything that BSD sed has, plus a myriad of extensions. Do yourself a favour and install it.

brew install gnu-sed

By default, the homebrew formula will not clobber the pre-installed sed. Either use the gsed command, or sym-link to it with ln -s $(which gsed).

This article assumes you are using GNU sed.

When to use it

sed is most useful for its “find and replace” text substitution functionality. It can also be used for inserting, replacing, or deleting lines of text according to line-number or regular expression pattern matching.

Don’t use sed’s more advanced features, such as branching and labeling. If you find that your stream editing task requires the use of an if-statement, it would be easier to write and debug a short Perl or Python script.

How to use it

A sed script is made up of sed commands separated by ;. A sed command takes the form:

[line address]C\[options]

C is a single letter command. The command is only executed against lines that match the line address. The line address is optional and can be a line number, range of lines, or a regular expression. Additional options are required for some commands.

Line Addresses

Select lines by:

Commands

Substitute. s/regex/replacement/[flags]

Delete. d – use with a line address to remove a specific line from the output stream.

Print. p – use with the -n option to print only those lines matching the line address.

Append a\text – write text on a new line after any line matching the line address.

Insert i\text – write text on a new line before any line matching the line address.

Replace c\text – replace any line matching the line address with text.

Quit. q – use with a line address to exit at a specific line.

Multiple Commands { command ... } – use with a line address to run several command under the same condition.


Awk

awk excels at processing data records. It provides what is essentially a full programming language with shortcuts for record- and field-centric processing and reporting.

When to Use it

Awk definitely breaks the UNIX philosophy of “do one thing well”. When you come across a problem which awk can solve, you’ll likely be in one of two situations:

  1. The problem is complex, and you should write a script in a real programming language.
  2. The problem is simple, and you should use an already-existing tool that does that one thing well.

If you find yourself with the latter problem, consider the following tools:

Nevertheless, you may find yourself in a situation where a more advanced language is not available and you need to solve a complex problem with only the GNU tools at your disposal. For this reason, it helps to have a mental model of awk’s functionality, such that you can quickly deploy it with some help from the documentation when the situation demands.

For my part, I once used awk to parse and search weeks of service logs on a remote log-storage server. I was looking for just a few instances of a problem, and it would have taken a long time to transfer such a large volume of logs to a machine with the appropriate tools. The first question I asked when the event was over: how can we improve the tools on our log-storage servers?

If you ever find yourself in need of awk, it’s likely because your environment wasn’t set up with the right tools in the first place. Be sure to fix that.

How to use it

An awk program is a series of actions, each with an optional condition. Actions are delimited by braces {}, with the condition preceding the action it corresponds with.

[condition] { action }

Given an awk script and an input stream, awk divides the input into records and fields.

For each record in the input, Awk applies every action whose condition is satisfied by the record.

Conditions

$2 ~ /^[a-z] / { print $0; }
/^[a-z] /      { print $0; }
$1 == "a"      { print $0; }

Special Conditions

BEGIN   { setup, print a header }
        { action }
END     { calculate a total, print a footer }

Actions

Awk is a programming language. It implements:

See the GNU Awk User’s Guide for a good language reference.

Continuity

Each condition { action } block is not a scope. Variables can be addressed from other blocks.

BEGIN { records_with_numbers = 0; records_with_letters = 0; }
/[0-9]/ { records_with_numbers++; }
/[a-z]/ { records_with_letters++; }
        { print $0, "(" records_with_numbers, records_with_letters ")" }
END { 
    print "Total records with numbers:", records_with_numbers;
    print "Total records with letters:", records_with_letters;
}

prints:

$ cat data.tab | awk -f script.awk
a b c (0 1)
0 0 0 (1 1)
...
1 1 1 (8 8)
y z a (8 9)
Total records with numbers: 8
Total records with letters: 9

Formatting

{
    print $0;
    print;
    print $1 $2 $3;
    print $1, $2, $3;
    printf "format", a, b, c
}