Home/Blog/Programming/Hands-on AWK
Home/Blog/Programming/Hands-on AWK

Hands-on AWK

Mehvish Poshni
Jun 12, 2023
11 min read

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

For many of us, our first exposure to a programming language is a general-purpose programming language like C, Python, or Java. AWK, on the other hand, was designed with a very targeted goal of being able to process text-based data without having to write several lines of code. That’s not to say that AWK is limited to performing this function alone—far from it. However, effectiveness of AWK is largely due to the control it offers over writing these quick one-liner command-line programs, as well as short scripts to serve an immediate need. Imagine having the prowess to manipulate system logs, configuration files and spreadsheet data from the command line in just a few keystrokes. Another reason why it’s worthwhile learning AWK is because it comes pre-installed as the utility awk on Unix-like operating systems, and its inclusion into the Unix ecosystem makes it very convenient to use.

The letters A, W, and K in AWK stand for the last names of the individuals (Alfred Aho, Caspar Weinberger, and Brian Kernighan) who designed the programming language in the late 1970s.

Note: This blog assumes a passing familiarity using a command-line shell (cat, echo, pipe, and redirection), some prior programming exposure (concepts like comparison and logical operators, expressions, conditionals, and loops).

An AWK program takes input either in the form of one or more text files, or as the standard input stream coming from the shell environment in which awk executes. The default behavior is that each line in the input stream is considered one record, and each record has fields (text) separated by one or more whitespace characters (spaces or tabs). This default behavior can be overridden easily.

The coding environments included in this blog make use of an input text file, that can be viewed in each coding environment. For convenience, we show the file here in a tabular format:

First

Last

Age

Joining

Score

Team

Remote

Colton

Dominguez

28

Feb-20-2021

33

Marketing

Yes

Megan

Porter

29

Dec-03-2021

81

Engineering

Yes

Candace

Walsh

25

Apr-14-2023

43

Sales

No

Grady

Clements

40

Feb-15-2023

36

Sales

Yes

Macaulay

Roy

33

Jul-11-2022

63

Engineering

Yes

Abraham

Strickland

31

Aug-25-2022

93

Marketing

Yes

Joelle

Higgins

42

Sep-23-2022

89

Engineering

Yes

Note: The input file may not necessarily have the same number of fields on each line.

An unusual workflow #

The manner in which an AWK program runs is unusual because when it’s run, the code is repeatedly executed for each record in the input — behind the scenes.

Whenever a record is read, there are special built-in variables $1, $2, $3 (and so on) that can be used for accessing the values in the first, second, third (and so on) fields of that record. The entire record can also be retrieved all at once using the built-in variable $0.

An AWK program#

A basic AWK program consists of one or more pattern-action pairs in the following general form.

pattern { action }
  • The pattern is an expression that evaluates to a value that’s regarded as true or false.

    Note: AWK does not have a boolean data type, but 0 and the empty string "" are regarded as false, and all other values as true.

  • The action consists of one or more statements. In case there are multiple statements within an action, they may be separated by either a semicolon character (;) or a newline.

When running an AWK program, each pattern is tested against every record of the input stream, one by one. Whenever the pattern evaluates to true, the corresponding action is executed.

Execution of action depends on the value of the pattern
Execution of action depends on the value of the pattern

The entire program is enclosed within single quotes, and can be run from the command line using the awk utility:

awk 'pattern { action }' inputfile

Here, inputfile is the input to the program. More than one file can also be passed as input.

awk 'pattern { action }' file1 file2

Since we are running the program using the awk utility in the shell, output can be stored in a file using the redirection operator >.

awk 'pattern { action }' inputfile > outputfile

In the same vein, the input can also be taken using the pipe operator.

cat inputfile | awk 'pattern { action }'

Examples#

  1. In the following one-liner, we print the records for which the age (in the third column) is less than 3030.
main.sh
inputfile
awk '$3 < 30 { print $0 }' inputfile
  1. See how none of the records get printed when 0 or the empty string "" is used as a pattern.
main.sh
inputfile
cat inputfile | awk '0 { print $0 }'
awk '"" { print $0 }' inputfile
  1. The pattern in the following snippet is a non-empty string (from the second column in inputfile) which is considered true. So the action is executed for all records. Observe, also, how we can concatenate different strings by placing them side by side.
main.sh
inputfile
awk '$2 { print $2 ", " $1 }' inputfile > outputfile
cat outputfile # To display the contents of the outputfile

Patterns and actions are optional#

It isn’t necessary to specify both pattern and { action }. Just one of them suffices:

  • When pattern is not specified, the action is performed for all the records.
main.sh
inputfile
awk '{ print $1 "." $2 "@educative.io" }' inputfile
  • When { action } is not specified, the default action is to print all the matched records.
main.sh
inputfile
awk '$6 == "Marketing" || $7 == "No"' inputfile

The BEGIN and END patterns#

There are other ways to specify a pattern than creating expressions using numbers, strings, arithmetic or logical operators

  • The pattern BEGIN is matched with the beginning of the input file. So its associated action is executed in the beginning before any other record is read. It makes sense to use it for tasks like initializing variables.
  • The pattern END matches the end of the file, and is executed once at the end of the input file.

Think about how to add the scores listed in the fifth column of the input file.

main.sh
inputfile
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Candace Walsh 25 Apr-14-2023 43 Sales No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Macaulay Roy 33 Jul-11-2022 63 Engineering Yes
Abraham Strickland 31 Aug-25-2022 93 Marketing Yes
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes

Regular expressions as patterns#

Regular expressions (regex) are symbolic ways to represent a pattern, and specify what the matching text should look like.

The syntax of regular expressions used in AWK is known as the Extended Regular Expression (ERE).

This syntax is also used by many other languages and unix-based utilities. So it’s super useful to know.

In AWK, when specifying a regex as a pattern, we can include it between two forward slashes. The simplest form of a regex is as a plain sequence of characters. For example, the pattern /Feb/ matches all records containing the text Feb.

main.sh
inputfile
awk '/Feb/' inputfile

Instead of searching the entire record for a match against a regex, we can use the operator ~ to check if a regular expression matches a smaller portion of the given text. Similarly, the operator !~ is useful for checking if there is no match.

Usage: The regular expression must appear on the right of ~ or !~, and the text being searched must go on the left.

main.sh
inputfile
awk '$6 ~ /Sa/' inputfile # Sa present in the 2nd last column
echo " "
awk '$(6+1) !~ /Y/' inputfile # Absence of Y in the last column

Regex metacharacters#

A regular expression may include some special characters called metacharacters, so called because they are not matched with a text in a literal sense. Instead they are interpreted as a rule for matching text. Here are some examples:

  • The metacharacters [ and ] match one of (possibly) many characters that appear enclosed within the brackets. For example, [AbC] means a single character: either A, b, or C. Expressions like these are called character classes.
  • A range of characters can also be represented as character classes. For example:
    • [0-9] means a single numeric character from 00 to 99.
    • [a-zA-Z] means a single alphabetical character in upper or lower case.
  • The metacharacters [^ ] specify a single character other than the ones appearing after the symbol ^ inside the character class . For example, [^bcd] means any character other than b, c, or d.
main.sh
inputfile
# score column contains a number in the 20-25 range
awk '$5 ~ /[20-25]/ { print $1 " scored in the 20 to 25 range" }' inputfile
echo " "
# 2022 or 2023 not present in the 4th column
awk '$4 ~ /202[^23]/ { print "Joining year of " $1 " is neither 2022 nor 2023" }' inputfile

The metacharacters $ and ^ (outside a character class) take on meaning relative to some other character X in the following way:

  • ^X means lines that start with X.
  • X$ means lines that end with X.
main.sh
inputfile
awk '/^J/' inputfile # Lines that start with J
echo " "
awk '/o$/' inputfile #Lines that end with o
  • The metacharacter . means any single character.
  • The metacharacter | means characters specified by the regex on its left or its right. For example, ab|[cd] matches either ab, c or d.
  • The metacharacters () are used for grouping characters. For example, ^M versus ^(Me) mean two different things (lines beginning with M versus lines beginning with Me).

Note: The GNU implementation of AWK, known as GAWK, also supports additional features inluding the use of metacharacters () for capturing portions of matched text for later use.

main.sh
inputfile
awk '/(M..a)/' inputfile # Matches substrings of Megan and Macaulay
echo " "
awk '/(D|P)o/' inputfile # Matches substrings of Dominguez and Porter

The metacharacters *, +, ?, {m,n} are called quantifiers. They also take on meaning relative to their preceding character, say X:

  • The expression X* means zero or more occurrences of X.
  • The expression X+ means one or more occurrences of X.
  • The expression X? means zero or one occurrence of X.
  • The expression X{n,m} means at least n and at most m occurrences of X. (This is not supported in the editor below.)
main.sh
inputfile
awk '/i[g]+/' inputfile # Colton Mscaulsy Joelle
echo " "
awk '/oe?l/' inputfile # Joelle and Colton

Note: To match a metacharacter literally, we need to use the escape character \. For example /\*/ to match the character *.

Data structure: Associative array#

An associative array is the only data structure supported by AWK. It essentially consists of index and value pairs, where the index can be used for retrieving the corresponding value.

An associative array
An associative array

An associative array is created simply through an assignment statement that maps a value to an index. The syntax looks like this:

arr["ind"] = "val"

We can also add more elements to an array using assignment statements like the one above.

main.sh
inputfile
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Candace Walsh 25 Apr-14-2023 43 Sales No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Macaulay Roy 33 Jul-11-2022 63 Engineering Yes
Abraham Strickland 31 Aug-25-2022 93 Marketing Yes
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes

Notice how we loop over the array arr using a for(i in arr) style loop. In each round, i is set to the index of an element in arr (and not an element in arr).

AWK also supports a C-style for loop (see exact syntax below), but it isn’t suitable for traversing over an associative array because the keys of an associative array may not fall in the required range of numbers.

for(i = 1; i < 10; i++) 

Here’s another example where the number of individuals in each team count is computed.

main.sh
inputfile
awk '!arr[$6] { arr[$6] = 0 }
{ arr[$6] += 1 }
END {
for (i in arr)
{
print i " : " arr[i]
}
}' inputfile

Built-in variables and functions#

Other than $0, and $1, $2, $3 etc., there are other built-in variables that are easy to remember and easy to use. Some of these are shown in the following table:

NR

Number of file records read so far (reset to 0 for each file)

NF

Number of fields in the current record

RS

Record separator (newline by default)

ORS

Output record separator

FS

Field separator (one or more whitespaces by default)

OFS

Output field separator (space by default)

AWK supports many predefined mathematical functions (like log, sqrt, exp, sin) as well as functions for working with strings (such as substr, length, toupper ).

Let’s see a few more examples before we call it a day.

Example 1: Overriding default values#

We can use any character as a field separator in the output by changing the default value of the variable OFS. The default values for OFS can be overridden as shown below.

Also note how, in the following example, we print the record number for each row using the variable NR (for number of records).

main.sh
inputfile
awk '{ print NR, $1, $5 }' OFS=, inputfile

Example 2: Accessing fields using rvalues#

If a variable varname is assigned an integer kk, then the syntax $varname can be used for accessing the fields in the kthk^{th} column.

For example, since NF stores the number of fields in the current record, we can access the last field in that row using the syntax $NF.

main.sh
inputfile
awk '{ print NR, $(NF-1), $NF }' inputfile

Example 3: Formatting output#

The C style printf is used for showing the output formatted in a tabular form. The argument %-20s sets the width of the padded string at 2020 characters and aligns it to the right.

main.sh
inputfile
awk 'BEGIN { printf "%-20s | %-5s\n", "Full Name", "Score" }
{ printf "%-20s | %-5d\n", $1 " " $2, $3 }' inputfile

The next two examples use built-in functions.

Example 4: Splitting a string#

The built-in function split(str, arr, ch) is used, which splits the string str around the character ch and stores the resulting substrings in the array arr. We use this function below to extract the month and year from each individual’s joining date.

main.sh
inputfile
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Candace Walsh 25 Apr-14-2023 43 Sales No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Macaulay Roy 33 Jul-11-2022 63 Engineering Yes
Abraham Strickland 31 Aug-25-2022 93 Marketing Yes
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes

Example 5: Find and replace#

The function gsub(regex,subst,str) looks for all matches made by the regular expression regex in the string str, and replaces it by string subst. The g in gsub is for “global”. There’s also a related function sub (for replacing a single occurrence).

main.sh
inputfile
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Candace Walsh 25 Apr-14-2023 43 Sales No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Macaulay Roy 33 Jul-11-2022 63 Engineering Yes
Abraham Strickland 31 Aug-25-2022 93 Marketing Yes
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes

Example 6: Bigger programs#

In AWK programs, we can use many constructs similar to the ones available in other languages like if, else, while, switch, and more. One can also define a function in an AWK program, and then call it from within the scope of an action.

main.sh
inputfile
awk 'BEGIN { max = -1; name = "" }
{
if (max < $5)
{
max = $5
name = $1
}
}
END { print name ": " getMaxScore() }
function getMaxScore() { return max }' inputfile

A final word#

It’s worth noting that AWK is a Turing complete language, which means that it can be utilized for implementing any algorithm. That being said, some AWK examples of its use include tasks like data filtration and manipulation.

This blog is far from being a complete tutorial, but we hope that it is effective in removing any entry level barriers for a faster and a happier learning experience.


Cover
The Complete Guide to Bash Programming

Learning to program computers by yourself is challenging, but we’ve got you! In this course, you’ll learn how to use Bash. Its features will help you solve your daily computer tasks—and even automate some of them! This course covers the general principles of computer operations, developer tools and their requirements, and basic programming concepts in detail. It will also help you consolidate your new knowledge through exercises for each topic. Moreover, the playground areas will give you hands-on practice for the exercises. Overall, this experience should be a great start to your programming journey.

9hrs
Beginner
94 Playgrounds
98 Quizzes
Cover
Learn Data Science with Bash Shell

2500+ students have taken this innovative project-based data learning course (includes video lectures). It demonstrates the use of Bash shell (Bash, sed and awk including RegEx) in processing textual data. It can help to learn to sort, search, match, replace, clean and optimize various aspects of data with Bash Shell. The target audience (students, researchers, scientists, journalists, data miners, developers) didn't have to go through any tough learning curve. This course also should have helped RedHat, SuSE and Ubuntu Linux learners and Data Science enthusiasts. Regularly updated, new projects to come! - Learn Bash commands interactively - Projects with own stories and conclusive decisions - Animated video lectures (for visual learners) - Demonstrations - Quizzes - Learning tasks Bloom's taxonomy (remember, understand, apply, analyze, evaluate and create) in developing your Linux skills. Learn Scientific Programming! scientificprogramming.io

1hr
Beginner
91 Playgrounds
3 Quizzes

  

Free Resources