For many of us, our first exposure to a programming language is a general-purpose programming language like C, Python, or Java. AWK, on the other hand, was designed with a very targeted goal of being able to process text-based data without having to write several lines of code. That’s not to say that AWK is limited to performing this function alone—far from it. However, effectiveness of AWK is largely due to the control it offers over writing these quick one-liner command-line programs, as well as short scripts to serve an immediate need. Imagine having the prowess to manipulate system logs, configuration files and spreadsheet data from the command line in just a few keystrokes. Another reason why it’s worthwhile learning AWK is because it comes pre-installed as the utility awk
on Unix-like operating systems, and its inclusion into the Unix ecosystem makes it very convenient to use.
The letters A, W, and K in AWK stand for the last names of the individuals (Alfred Aho, Caspar Weinberger, and Brian Kernighan) who designed the programming language in the late 1970s.
Note: This blog assumes a passing familiarity using a command-line shell (
cat
,echo
, pipe, and redirection), some prior programming exposure (concepts like comparison and logical operators, expressions, conditionals, and loops).
An AWK program takes input either in the form of one or more text files, or as the standard input stream coming from the shell environment in which awk
executes. The default behavior is that each line in the input stream is considered one record, and each record has fields (text) separated by one or more whitespace characters (spaces or tabs). This default behavior can be overridden easily.
The coding environments included in this blog make use of an input text file, that can be viewed in each coding environment. For convenience, we show the file here in a tabular format:
First | Last | Age | Joining | Score | Team | Remote |
Colton | Dominguez | 28 | Feb-20-2021 | 33 | Marketing | Yes |
Megan | Porter | 29 | Dec-03-2021 | 81 | Engineering | Yes |
Candace | Walsh | 25 | Apr-14-2023 | 43 | Sales | No |
Grady | Clements | 40 | Feb-15-2023 | 36 | Sales | Yes |
Macaulay | Roy | 33 | Jul-11-2022 | 63 | Engineering | Yes |
Abraham | Strickland | 31 | Aug-25-2022 | 93 | Marketing | Yes |
Joelle | Higgins | 42 | Sep-23-2022 | 89 | Engineering | Yes |
Note: The input file may not necessarily have the same number of fields on each line.
The manner in which an AWK program runs is unusual because when it’s run, the code is repeatedly executed for each record in the input — behind the scenes.
Whenever a record is read,
there are special built-in variables $1
, $2
, $3
(and so on) that can be used for accessing the values in the first, second, third (and so on) fields of that record. The entire record can also be retrieved all at once using the built-in variable $0
.
A basic AWK program consists of one or more pattern-action pairs in the following general form.
pattern { action }
The pattern
is an expression that evaluates to a value that’s regarded as true or false.
Note: AWK does not have a boolean data type, but
0
and the empty string""
are regarded as false, and all other values as true.
The action
consists of one or more statements. In case there are multiple statements within an action
, they may be separated by either a semicolon character (;
) or a newline.
When running an AWK program, each pattern
is tested against every record of the input stream, one by one.
Whenever the pattern
evaluates to true, the corresponding action
is executed.
The entire program is enclosed within single quotes, and can be run from the command line using the awk
utility:
awk 'pattern { action }' inputfile
Here, inputfile
is the input to the program. More than one file can also be passed as input.
awk 'pattern { action }' file1 file2
Since we are running the program using the awk
utility in the shell, output can be stored in a file using the redirection operator >
.
awk 'pattern { action }' inputfile > outputfile
In the same vein, the input can also be taken using the pipe operator.
cat inputfile | awk 'pattern { action }'
awk '$3 < 30 { print $0 }' inputfile
0
or the empty string ""
is used as a pattern.cat inputfile | awk '0 { print $0 }'awk '"" { print $0 }' inputfile
inputfile
) which is considered true. So the action is executed for all records.
Observe, also, how we can concatenate different strings by placing them side by side.awk '$2 { print $2 ", " $1 }' inputfile > outputfilecat outputfile # To display the contents of the outputfile
It isn’t necessary to specify both pattern
and { action }
. Just one of them suffices:
pattern
is not specified, the action
is performed for all the records.awk '{ print $1 "." $2 "@educative.io" }' inputfile
{ action }
is not specified, the default action is to print all the matched records.awk '$6 == "Marketing" || $7 == "No"' inputfile
There are other ways to specify a pattern than creating expressions using numbers, strings, arithmetic or logical operators
BEGIN
is matched with the beginning of the input file. So its associated action is executed in the beginning before any other record is read. It makes sense to use it for tasks like initializing variables.END
matches the end of the file, and is executed once at the end of the input file.Think about how to add the scores listed in the fifth column of the input file.
Colton Dominguez 28 Feb-20-2021 33 Marketing NoMegan Porter 29 Dec-03-2021 81 Engineering YesCandace Walsh 25 Apr-14-2023 43 Sales NoGrady Clements 40 Feb-15-2023 36 Sales YesMacaulay Roy 33 Jul-11-2022 63 Engineering YesAbraham Strickland 31 Aug-25-2022 93 Marketing YesJoelle Higgins 42 Sep-23-2022 89 Engineering Yes
Regular expressions (regex) are symbolic ways to represent a pattern, and specify what the matching text should look like.
The syntax of regular expressions used in AWK is known as the Extended Regular Expression (ERE).
This syntax is also used by many other languages and unix-based utilities. So it’s super useful to know.
In AWK, when specifying a regex as a pattern, we can include it between two forward slashes. The simplest form of a regex is as a plain sequence of characters. For example, the pattern /Feb/
matches all records containing the text Feb
.
awk '/Feb/' inputfile
Instead of searching the entire record for a match against a regex, we can use the operator ~
to check if a regular expression matches a smaller portion of the given text. Similarly, the operator !~
is useful for checking if there is no match.
Usage: The regular expression must appear on the right of
~
or!~
, and the text being searched must go on the left.
awk '$6 ~ /Sa/' inputfile # Sa present in the 2nd last columnecho " "awk '$(6+1) !~ /Y/' inputfile # Absence of Y in the last column
A regular expression may include some special characters called metacharacters, so called because they are not matched with a text in a literal sense. Instead they are interpreted as a rule for matching text. Here are some examples:
[
and ]
match one of (possibly) many characters that appear enclosed within the brackets. For example, [AbC]
means a single character: either A
, b
, or C
. Expressions like these are called character classes.[0-9]
means a single numeric character from to .[a-zA-Z]
means a single alphabetical character in upper or lower case.[^ ]
specify a single character other than the ones appearing after the symbol ^
inside the character class . For example, [^bcd]
means any character other than b
, c
, or d
.# score column contains a number in the 20-25 rangeawk '$5 ~ /[20-25]/ { print $1 " scored in the 20 to 25 range" }' inputfileecho " "# 2022 or 2023 not present in the 4th columnawk '$4 ~ /202[^23]/ { print "Joining year of " $1 " is neither 2022 nor 2023" }' inputfile
The metacharacters $
and ^
(outside a character class) take on meaning relative to some other character X
in the following way:
^X
means lines that start with X
.X$
means lines that end with X
.awk '/^J/' inputfile # Lines that start with Jecho " "awk '/o$/' inputfile #Lines that end with o
.
means any single character.|
means characters specified by the regex on its left or its right. For example, ab|[cd]
matches either ab
, c
or d
.()
are used for grouping characters. For example, ^M
versus ^(Me)
mean two different things (lines beginning with M
versus lines beginning with Me
).Note: The GNU implementation of AWK, known as GAWK, also supports additional features inluding the use of metacharacters
()
for capturing portions of matched text for later use.
awk '/(M..a)/' inputfile # Matches substrings of Megan and Macaulayecho " "awk '/(D|P)o/' inputfile # Matches substrings of Dominguez and Porter
The metacharacters *
, +
, ?
, {m,n}
are called quantifiers. They also take on meaning relative to their preceding character, say X
:
X*
means zero or more occurrences of X
.X+
means one or more occurrences of X
.X?
means zero or one occurrence of X
.X{n,m}
means at least n
and at most m
occurrences of X. (This is not supported in the editor below.)awk '/i[g]+/' inputfile # Colton Mscaulsy Joelleecho " "awk '/oe?l/' inputfile # Joelle and Colton
Note: To match a metacharacter literally, we need to use the escape character
\
. For example/\*/
to match the character*
.
An associative array is the only data structure supported by AWK. It essentially consists of index and value pairs, where the index can be used for retrieving the corresponding value.
An associative array is created simply through an assignment statement that maps a value to an index. The syntax looks like this:
arr["ind"] = "val"
We can also add more elements to an array using assignment statements like the one above.
Colton Dominguez 28 Feb-20-2021 33 Marketing NoMegan Porter 29 Dec-03-2021 81 Engineering YesCandace Walsh 25 Apr-14-2023 43 Sales NoGrady Clements 40 Feb-15-2023 36 Sales YesMacaulay Roy 33 Jul-11-2022 63 Engineering YesAbraham Strickland 31 Aug-25-2022 93 Marketing YesJoelle Higgins 42 Sep-23-2022 89 Engineering Yes
Notice how we loop over the array arr
using a for(i in arr)
style loop. In each round, i
is set to the index of an element in arr
(and not an element in arr
).
AWK also supports a C-style for
loop (see exact syntax below), but it isn’t suitable for traversing over an associative array because the keys of an associative array may not fall in the required range of numbers.
for(i = 1; i < 10; i++)
Here’s another example where the number of individuals in each team count is computed.
awk '!arr[$6] { arr[$6] = 0 }{ arr[$6] += 1 }END {for (i in arr){print i " : " arr[i]}}' inputfile
Other than $0
, and $1
, $2
, $3
etc., there are other built-in variables that are easy to remember and easy to use. Some of these are shown in the following table:
| Number of file records read so far (reset to |
| Number of fields in the current record |
| Record separator (newline by default) |
| Output record separator |
| Field separator (one or more whitespaces by default) |
| Output field separator (space by default) |
AWK supports many predefined mathematical functions (like log
, sqrt
, exp
, sin
) as well as functions for working with strings (such as substr
, length
, toupper
).
Let’s see a few more examples before we call it a day.
We can use any character as a field separator in the output by changing the default value of the variable OFS
. The default values for OFS
can be overridden as shown below.
Also note how, in the following example, we print the record number for each row using the variable NR
(for number of records).
awk '{ print NR, $1, $5 }' OFS=, inputfile
If a variable varname
is assigned an integer , then the syntax $varname
can be used for accessing the fields in the column.
For example, since NF
stores the number of fields in the current record, we can access the last field in that row using the syntax $NF
.
awk '{ print NR, $(NF-1), $NF }' inputfile
The C style printf
is used for showing the output formatted in a tabular form. The argument %-20s
sets the width of the padded string at characters and aligns it to the right.
awk 'BEGIN { printf "%-20s | %-5s\n", "Full Name", "Score" }{ printf "%-20s | %-5d\n", $1 " " $2, $3 }' inputfile
The next two examples use built-in functions.
The built-in function split(str, arr, ch)
is used, which splits the string str
around the character ch
and stores the resulting substrings in the array arr
. We use this function below to extract the month and year from each individual’s joining date.
Colton Dominguez 28 Feb-20-2021 33 Marketing NoMegan Porter 29 Dec-03-2021 81 Engineering YesCandace Walsh 25 Apr-14-2023 43 Sales NoGrady Clements 40 Feb-15-2023 36 Sales YesMacaulay Roy 33 Jul-11-2022 63 Engineering YesAbraham Strickland 31 Aug-25-2022 93 Marketing YesJoelle Higgins 42 Sep-23-2022 89 Engineering Yes
The function gsub(regex,subst,str)
looks for all matches made by the regular expression regex
in the string str
, and replaces it by string subst
. The g
in gsub
is for “global”. There’s also a related function sub
(for replacing a single occurrence).
Colton Dominguez 28 Feb-20-2021 33 Marketing NoMegan Porter 29 Dec-03-2021 81 Engineering YesCandace Walsh 25 Apr-14-2023 43 Sales NoGrady Clements 40 Feb-15-2023 36 Sales YesMacaulay Roy 33 Jul-11-2022 63 Engineering YesAbraham Strickland 31 Aug-25-2022 93 Marketing YesJoelle Higgins 42 Sep-23-2022 89 Engineering Yes
In AWK programs, we can use many constructs similar to the ones available in other languages like if
, else
, while
, switch
, and more. One can also define a function in an AWK program, and then call it from within the scope of an action.
awk 'BEGIN { max = -1; name = "" }{if (max < $5){max = $5name = $1}}END { print name ": " getMaxScore() }function getMaxScore() { return max }' inputfile
It’s worth noting that AWK is a Turing complete language, which means that it can be utilized for implementing any algorithm. That being said, some AWK examples of its use include tasks like data filtration and manipulation.
This blog is far from being a complete tutorial, but we hope that it is effective in removing any entry level barriers for a faster and a happier learning experience.
Learning to program computers by yourself is challenging, but we’ve got you! In this course, you’ll learn how to use Bash. Its features will help you solve your daily computer tasks—and even automate some of them! This course covers the general principles of computer operations, developer tools and their requirements, and basic programming concepts in detail. It will also help you consolidate your new knowledge through exercises for each topic. Moreover, the playground areas will give you hands-on practice for the exercises. Overall, this experience should be a great start to your programming journey.
2500+ students have taken this innovative project-based data learning course (includes video lectures). It demonstrates the use of Bash shell (Bash, sed and awk including RegEx) in processing textual data. It can help to learn to sort, search, match, replace, clean and optimize various aspects of data with Bash Shell. The target audience (students, researchers, scientists, journalists, data miners, developers) didn't have to go through any tough learning curve. This course also should have helped RedHat, SuSE and Ubuntu Linux learners and Data Science enthusiasts. Regularly updated, new projects to come! - Learn Bash commands interactively - Projects with own stories and conclusive decisions - Animated video lectures (for visual learners) - Demonstrations - Quizzes - Learning tasks Bloom's taxonomy (remember, understand, apply, analyze, evaluate and create) in developing your Linux skills. Learn Scientific Programming! scientificprogramming.io
Free Resources