The GNU Awk User’s Guide

Preface
Getting Started
Running awk and gawk
Regular Expressions
Reading Input Files
Printing Output
Expressions
Patterns, Actions, and Variables
Arrays in awk
Functions
Problem Solving with awk

https://www.gnu.org/software/gawk/manual/html_node/index.html

Preface

The GNU implementation of awk is called gawk; if you invoke it with the proper options or environment variables, it is fully compatible with the POSIX specification of the awk language and with the Unix version of awk maintained by Brian Kernighan.

https://www.gnu.org/software/gawk/manual/html_node/Preface.html#Preface

Getting Started

pattern { action }
pattern { action }
…

Programs in awk consist of pattern–action pairs.
An action without a pattern always runs.
The default action for a pattern without one is { print $0 }.
If several patterns match, then several actions execute in the order in which they appear in the awk program.
If no patterns match, then no actions run.
Use either awk 'program' files or awk -f program-file files to run awk.
You may use the special #! /bin/awk -f header line.
You can add the extension .awk to the file name.
Comments in awk programs start with #.
You may use backslash continuation to continue a source line.
Lines are automatically continued after a ,, {, ?, :, ||, &&, do, and else.

Running `awk` and `gawk`

awk [options] -f progfile [--] file …
awk [options] [--] 'program' file …

-F fs / --field-separator fs: Set the FS variable to fs
-f source-file / --file source-file: may be given multiple times; (codes are concatenated)
-v var=val / --assign var=val: Don't override built-in variable names.
--: Signal the end of the command-line options. useful if you have file names that start with -, etc.

All nonoption command-line arguments, excluding the program text, are placed in the ARGV array.
Adjusting ARGC and ARGV affects how awk processes input.
Any additional arguments on the command line are normally treated as input files to be processed in the order specified.
However, an argument that has the form var=value, assigns the value value to the variable var. It does not specify a file at all.
This variable assiginment is evaluated inbetween processing input files, while -v var=val is evaluated before BEGIN.

printf "\n" > pass1.txt
printf "\n\n" > pass2.txt
awk 'pass == 1  { print "hello" }
     pass == 2  { print "world" }' pass=1 pass1.txt pass=2 pass2.txt

hello
world
world

# The stdin is treated as the second file.
some_command | awk -f myprog.awk file1 - file2

AWKPATH and AWKLIBPATH for specifying library path.
These corresponds to @include and @load.

test1.awk

BEGIN {
    print "This is script test1."
}

test2.awk

@include "test1"
BEGIN {
  print "This is script test2."
}

awk -f test2.awk

This is script test1.
This is script test2.

Regular Expressions

/li/ { print $2 }  # match against the line
exp ~ /regexp/     # match against the variable, or use `!~`

$ awk '$1 ~ /J/' inventory-shipped
-| Jan  13  25  15 115
-| Jun  31  42  75 492
-| Jul  24  34  67 436
-| Jan  21  36  64 620

$ awk '$1 !~ /J/' inventory-shipped
-| Feb  15  32  24 226
-| Mar  15  24  34 228
-| Apr  31  52  63 420
-| May  16  34  29 208
…

\<symbol> (no special)
^, $, .
[..], [^...]
(...), |
*, +, ?
{n}, {n,}, {n,m},
[:alpha:], [:alnum:], [:digit:], [:xdigit:]
[:lower:], [:upper:]
[:blank:] (spaces and tabs) [:space:] (space)
[:print:] (printable, including spaces)
[:graph:] (printable and visible, excluding spaces)
[:punct:] (punctuation), [:cntrl:] (control)

echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }'

<A>bcd

BEGIN { digits_regexp = "[[:digit:]]+" }
$0 ~ digits_regexp    { print }

x = "aB"
if (x ~ /ab/) …   # this test will fail

IGNORECASE = 1
if (x ~ /ab/) …   # now it will succeed

Reading Input Files

Value of `RS`	Records are split on …	`awk` / `gawk`
Any single character	That character	`awk`
The empty string (`""`)	Runs of two or more newlines	`awk`

By default, the record separator is the newline character.
FNR indicates how many records have been read from the current input file;
NR indicates how many records have been read in to

Field separator value	Fields are split …	`awk` / `gawk`
`FS == " "`	On runs of whitespace	`awk`
`FS == any single character`	On that character	`awk`
`FS == regexp`	On text matching the regexp	`awk`

By default, fields are separated by whitespace.
FS may be set from the command line using the -F option.
$1 refers to the first field, $2 to the second, $0 is the whole record, etc.
NF is a predefined variable whose value is the number of fields in the current record.
So, $NF refers to the last field.
Fields may also be assigned values, which causes the value of $0 to be recomputed when it is later referenced.
OFS, output field separator, is used to recompute the record.
Use getline in its various forms to read additional records from the default input stream, from a file, or from a pipe or coprocess.

Printing Output

The simple statement print with no items is equivalent to print $0
To print a blank line, use print "".
If you use print with a list separated by commas, the output will be a string separated by single spaces, followed by a newline.
If you provide a list separated by spaces to print, it results in concatenated string.

awk 'BEGIN { print "foo", "bar" }'
awk 'BEGIN { print "foo" "bar" }'

foo bar
foobar

OFS, ORS are output separators
Use printf to print values formatted.
print and printf can output to other files using >, >>, |.
Use |& to communicate with external processes.(& seems to stand for a kind of background process)
Use close() to clean up the communication with external processes.

awk 'BEGIN { print "test" > "test.txt" }'
cat 'test.txt'

test

Expressions

awk supplies three kinds of constants: numeric, string, and regexp.
Numbers are automatically converted to strings, and strings to numbers, as needed by awk.
Testing an unassigned varaible against "" and 0 results in true.

Patterns, Actions, and Variables

/regular expression/: match
expression: boolean
begpat, endpat: range
BEGIN, END: per program
BEGINFILE, ENDFILE: per file (gawk specific)
<empty>: all

onoff.txt

foo
on
bar
off
baz

awk '/^on$/, /^off$/ { print }' onoff.txt

on
bar
off

Supports if for while, switch, etc. They are similar to those of C.

Arrays in `awk`

Arrays in awk are associative.
Arrays are indexed by string values.
Use the isarray() built-in function to determine if an array element is itself a subarray.

Index	Value
`3`	`30`
`1`	`"foo"`
`0`	`8`
`2`	`""`

if (2 in frequencies)
    print "Subscript 2 is present."

# The iteration order is undefined
# gawk supports some options
for (var in array)
    body

delete array[index-expression]
delete array  # deletes all element, not the array itself

https://www.gnu.org/software/gawk/manual/html_node/Arrays-Summary.html#Arrays-Summary

Functions

function name([parameter-list])
{
    body-of-function
}

function parameters cannot have the same name as one of the special predefined variables

foo.awk

function foo(s) {
    return length(s)
}

echo "apple" | awk -f foo.awk -e '{ print foo($0) }'

As there are no local variable, mimic local variable with redundant arguments. And in this case, as a convention, put extra spaces after the actual arguments. In this case, i is used as a local variable.

function foo(j,    i)
{
    i = j + 1
    print "foo's i=" i
    bar()
    print "foo's i=" i
}

As when arrays are the parameters to functions, they are not copied. So we can use this behavior to mimic call by reference.

https://www.gnu.org/software/gawk/manual/html_node/Functions-Summary.html#Functions-Summary

Problem Solving with `awk`

This part of the guide dedicates to example studies.

Table of Contents