AWK birds

AWK crashcourse

AWK language course aims to explain AWK in 15 minutes to let you find awesome tool friend despite it’s given name. The correct pronunciation is [auk] after smaller seabirds Parakeet auklets.

General language description

AWK language (is):

(mainly) text processing language
available on most UNIX-like systems by default, on Windows there is either native binary or cygwin one
syntax is influenced by c and shell programming languages
programs from single line to multiple library files
several implementations available, notably gawk and mawk
solves generaly same problems as similar text-processing tools sed, grep, wc, tr, cut, printf, tail, head, cat, tac, bc, column, …

AWK language use-cases are:

computing int / floating point math formulas (based on input)
general text-processing
- cutting pieces from input text stream
- reformatting input text stream
(shell) meta-programming generator

AWK language capabilities:

text-processing functions
regular expression support
math functions
dynamic typing, support for
- integer / long
- floats
- associative arrays (including multi-dimensional array support)
external execution support

Processing workflow aka `main()`

Every AWK execution consist of folowing three phases:

[1] BEGIN{ ... } are actions performed at the beginning before first text character is read
- multiple blocks allowed (normally single)
[2] [condition]{ ... } are actions performed on every AWK record (default text line)
- every AWK record is automatically split into AWK fields (by default words)
- multiple blocks allowed
[3] END{ ... } are actions performed at the end of the execution after last text character is read
- multiple blocks allowed (normally single)

AWK process flow

warm-up basic example

$ echo -e "AWK is still useful\ntext-processing  technology!" | \
>   awk 'BEGIN{wcnt=0;print "lineno/#words/3rd-word:individual words\n"}
>             {printf("% 6d/% 6d/% 8s:%s\n",NR,NF,$3,$0);wcnt+=NF}
>          END{print "\nSummary:", NR, "lines/records,", wcnt, "words/fields"}'
lineno/#words/3rd-word:individual words
     1/     4/   still:AWK is still useful
     2/     2/        :text-processing  technology!
Summary:2 lines/records, 6 words/fields

Command-line basics

Passing text data to AWK:
- from pipe: cat input-data.txt | awk <app>
- from file[s] read by awk itself: awk <app> input-data.txt
AWK application execution styles (-f):
- on command-line awk '{ ... }' input-data.txt
- in separate files awk -f myapp.awk input-data.txt
specifying an AWK variable on command-line -v var=val
specifying AWK field separator FS variable or -F <FS> switch

Global variables

Global variables are documented here, most common ones are:

$0 value of current AWK record (whole line without line-break)
- $1, $2, … $NF values of first, second, … last AWK field (word)
FS Specifies the input AWK field separator, i.e. how AWK breaks input record into fields (default: a whitespace).
RS Specifies the input AWK record separator, i.e. how AWK breaks input stream into records (default: an universal line break).
OFS Specifies the output separator, i.e. how AWK print parsed fields to the output stream using print() (default: single space).
ORS Specifies the output separator, i.e. how AWK print parsed records to the output stream using print() (default: line break)
FILENAME contains the name of the input file read by awk (read only global variable)

Buildin functions

AWK functions are documented, the most important ones are:

print, printf() and sprintf()
- printing functions
length()
- length of an string argument
substr()
- splitting string to a substring
split()
- split string into an array of strings
index()
- find position of an substring in a string
sub() and gsub()
- (regexp) search and replace (once respectivelly globally)
~ operator and match()
- regexp search
tolower() and toupper()
- convert text to lowercase resp. uppercase

Learn by examples

Best practices

Portability

Prefer general awk before an specific AWK implementation:

use general awk for portable programs
otherwise use the particular implementation e.g. gawk

AWK programs extension and readability

General rule of thumb is to create AWK program as a *.awk file if equivalent one-liner is not well readable.

If you have troubles to understand one line awk program then feel free to use GNU AWK’s profiling functionality i.e. -p option to receive pretty printed AWK code (in awkprof.out).

Code quality

comment properly
indent similarly as in c/c++ programmimng languages
use functions whenever possible
stay explicit avoiding awk default (implicit) actions which make AWK application hard to understand
- example: length > 80 should be rather written 'length($0) > 80 { print }' or 'length($0) > 80 { print $0 }'

Pitfalls

don’t forget to always use apostrophe ' quotation when writing awk oneline applications to avoid shell expansion (for instance $1)
- awk "{print $1}" should be awk '{print $1}'
use one of the recommended implementations as old implementations are quite limited (old awk or nawk)
string / array indexing from 1 (index(), split(), $i, …)
GNU AWK implementation understand localization & utf-8/unicode and thus replacing with [g]sub() can lead to unwanted behavior unless you force gawk to drop such support via exporting environment variable LC_ALL=C
- other awk implementations may not support utf-8/unicode:
```
awk implementation versions
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.1)
mawk 1.3.4 20161107
BusyBox v1.22.1 (2016-02-03 18:22:11 UTC) multi-call binary.

$ echo “Zřetelně” | gawk ‘{print toupper($0)}’
ZŘETELNĚ
$ echo “Zřetelně” | mawk ‘{print toupper($0)}’
ZřETELNě
$ echo “Zřetelně” | busybox awk ‘{print toupper($0)}’
ZřETELNě

 * extended reqular expressions are available just for gawk (and for older version has to be explicitly enabled):

$ ps auxwww | gawk ‘{if($2~/^[0-9]{1,1}$/){print}}’
root 1 0.0 0.0 197064 4196 ? Ss Oct31 2:21 /usr/lib/systemd/systemd —switched-root —system —deserialize 24
root 4 0.0 0.0 0 0 ? S< Oct31 0:00 [kworker/0:0H]

$ ps auxwww | gawk —re-interval ‘{if($2~/^[0-9]{1,1}$/){print}}’
root 1 0.0 0.0 197064 4196 ? Ss Oct31 2:21 /usr/lib/systemd/systemd —switched-root —system —deserialize 24
root 4 0.0 0.0 0 0 ? S< Oct31 0:00 [kworker/0:0H]

$ ps auxwww | mawk ‘{if($2~/^[0-9]{1,1}$/){print}}’
$
```