Every Linux sysadmin knows that log files are a fact of life. Every time there is a problem, log files are the first place to go to diagnose nearly every kind of possible problem. And, joking aside, sometimes they can even offer a solution. Sysadmins know, too, that sifting through log files can be tedious. Looking through line after line after line can often result in seeing “the same thing” all over the place and missing the error message entirely, especially when one is not sure of what is to be searched for to begin with.
Linux offers a lot of log analysis tools, both open source and commercially-licensed, for the purposes of analyzing log files. This tutorial will introduce the use of the very-powerful awk utility to “pluck out” error messages from various kinds of log files for the purposes of making it easier to find where (and when) problems are happening. For Linux in particular, awk is implemented via the free GNU utility gawk, and either command can be used to invoke awk.
To describe awk solely as a utility which converts the text contents of a file or stream into something that can be addressed positionally is to do awk a tremendous disservice, but this functionality, combined with the monotonously uniform structure of log files, makes it a very practical tool to search log files very quickly.
To that end, we will be looking at how to work with awk to analyze log files in this system administration tutorial.
Read: Project Management Software and Tools for Developers
How to Map Out Log Files
Anyone who is familiar with comma-separated value (CSV) files or tab-delimited files understands that these files have the following basic structure:
- Each line, or row, in the file is a record
- Within each line, the comma or tab separates the individual “columns”
- Unlike a database, the data format of the “columns” is not guaranteed to be consistent
Harkening back to our tutorial, Text Scraping in Python , this looks somewhat like the following:
Figure 1 – A sample CSV file with phony Social Security Numbers
Figure 2 – The same data, examined in Microsoft Excel
In both of these figures, the obvious “coordinate grid” jumps right out. It is easy to pluck out a particular piece of information just by using said grid. For instance, the value 4235 lives at row 5, column D of the file above.
No doubt some readers are saying, “this works well only if the data is uniformly structured like it is in this idealized example!” But the great thing about awk is that this is not a requirement. The only thing that matters when using awk for log file analysis is that the individual lines being matched have a uniform structure, and for most log files in Linux systems this is most definitely the case.
This characteristic can be seen in the figure below for an example /var/log/auth.log file on an Ubuntu 22.04.1 LTS Server:
Figure 3 – An example log file, showing uniform structure among each of the lines.
If each line of a log file is a record, and if a space is used as the delimiter, then the following numerical identifiers can be used for each word of each line of the log file:
Figure 4 – Numerical identifiers for each word of a line.
Each line of the log file starts with the same information:
- Column 1: Month abbreviation
- Column 2: Day of the month
- Column 3: Event time in 24-hour format
- Column 4: Hostname
- Column 5: Process name and PID
Note, not every log file will look like this; formats can vary wildly from one application to another.
So, in examining the figure above, the easiest way to pull failed ssh logins for this host would be to look for the log lines in /var/log/auth.log, which have the text Failed for column 6 and password for column 7. The numerical columns are prefixed with a dollar sign ($), with $0 representing the entire line currently being processed. Using the awk command below:
$ awk '($6 == "Failed") && ($7 == "password") { print $0 }' /var/log/auth.log
Note: depending on permission configurations, it may be necessary to prefix the command above with sudo.
This gives the following output:
Figure 5 – The log entries which only contain failed ssh login attempts.
As awk is also a scripting language in its own right, it is no surprise that its syntax can look familiar to sysadmins who are also versed in coding. For example, the above command can be implemented as follows, if one prefers a more “coding”-style look:
$ awk '{ if ( ($6 == "Failed") && ($7 == "password") ) { print $0 } }' /var/log/auth.log
Or:
$ awk '\ { \ if ( ($6 == "Failed") && ($7 == "password") ) \ { \ print $0 \ } \ }' /var/log/auth.log
In both command lines above, extra brackets and parentheses are bolded. Both will give the same output:
Figure 6 – Mixing and matching awk inputs
Text matching logic can be as simple, or as complex, as necessary, as will be shown below.
Read: The Best Tools for Remote Developers
How to Perform Expanded Matching
Of course, an invalid login via ssh is not the only way to get listed as a failed login in the /var/log/auth.log file. Consider the following snippet from the same file:
Figure 7 – Log entries for failed direct logins
In this case, columns $6 and $7 have the values FAILED and LOGIN, respectively. These failed logins come from attempts to login from the console.
It would, of course, be convenient to use a single awk call to handle both conditions, as opposed to multiple calls, and, naturally, trying to type a somewhat complex script on a single line would be tedious. To “have our cake and eat it too,” a script can be used to contain the logic for both conditions:
#!/usr/bin/awk -f # parse-failed-logins.awk { if ( ( ($6 == "Failed") && ($7 == "password") ) || ( ($6 == "FAILED") && ($7 == "LOGIN") ) ) { print $0 } }
Note that awk scripts are not free-form text. While it is tempting to “better” organize this code, doing so will likely lead to syntax errors.
While the code for the awk script looks very “C-Like” unfortunately it is most like any other Linux script; the file parse-failed-logins.awk requires execute permissions:
$ chmod +x parse-failed-logins.awk
The following command line executes this script, assuming it is in the present working directory:
$ ./parse-failed-logins.awk /var/log/auth.log
By default, the current directory is not part of the default path in Linux. This is why it is necessary to prefix a script in the current directory with ./ when running it.
The output of this script is shown below:
Figure 8 – Both types of login failures
The only downside of the log is that invalid usernames are not recorded when they attempt to login from the console. This script can be further simplified by using the tolower function to convert the value in $6 to lowercase:
#!/usr/bin/awk -f # parse-failed-logins-ci.awk { if ( tolower($6) == "failed" ) { if ( ($7 == "password") || ($7 == "LOGIN") ) { print $0 } } }
Note that the -f t the end of #!/usr/bin/awk -f at the top of these scripts is very important!
Other Logging Sources
Below is a list of some of the other potential logging sources system administrators may encounter.
journald/journalctl
Of course, the text of log files is not the only source of security-related information. CentOS and Red Hat Enterprise Linux (RHEL), for instance, use journald to facilitate access to login-related information:
$ journalctl -u sshd -u gdm --no-pager
This command passes two units, namely sshd and gdm, into journalctl, as this is what is required to access login-related information in CentOS and RHEL.
Note that, by default, journalctl pages its output. This makes it difficult for awk to work with. The –no-pager option disables paging.
This gives the following output:
Figure 9 – using journalctl to get ssh-related login information
As can be seen above, while gdm does indicate that a failed login attempt took place, it does not specify the user name associated with the attempt. As a result, this unit will not be used in further demonstrations in this tutorial; however, other units specific to a particular Linux distribution could be used if they do provide this information.
The following awk script can parse out the failed logins for CentOS:
#!/usr/bin/awk -f # parse-failed-logins-centos.awk { if ( (tolower($6) == "failed") && ($7 = "password") ) { print $0 } }
The output of journalctl can be piped directly into awk via the command:
$ ./parse-failed-logins-centos.awk < <(journalctl -u sshd -u gdm --no-pager)
This type of piping is known as Process Substitution. Process Substitution allows for command output to be used the same way a file can.
Note that the spacing of the less-than signs and parentheses is critical. This command will not work if the spacing and arrangement of the parentheses is not correct.
This command gives the following output:
Figure 10 – Piping journalctl output into awk
Another way to perform this piping is to use the command:
$ journalctl --no-page -u sshd | ./parse-failed-logins-centos.awk
SELinux/audit.log
SELinux can be a lifesaver for a system administrator, but a nightmare for a software developer. It is by design opaque with its messaging, except for when it comes to logging, at which point it can be almost too helpful.
SELinux logs are typically stored in /var/log/audit/audit.log. As is the case with any other log file subject to rotation, previous iterations of these logs may also be present in the /var/log/audit directory. Below is a sample of such a file, with the denied flag being highlighted.
Figure 11 – A typical SELinux audit.log file
In this specific context, SELinux is prohibiting the Apache httpd daemon from writing to specific files. This is not the same as Linux permissions prohibiting such a write. Even if the user account under which Apache httpd is running does have write access to these files, SELinux will prohibit the write attempt. This is a common good security practice which can help to prevent malicious code that may have been uploaded to a website from overwriting the website itself. However, if a web application is designed with the premise that it should be able to overwrite files in its directory, this can cause problems.
It should be noted that, if a web application is designed to have write access to its own web directory and it is being blocked by SELinux, the best practice is to “rework” the application so that it writes to a different directory instead. Modifying SELinux policies can be very risky and open a server up to many more attack vectors.
SELinux typically polices many different processes in many different contexts within Linux. The result of this is that the /var/log/audit/audit.log file may be too large and “messy” in order to analyze them just by looking. Because of this, awk can be a useful tool to filter out the parts of the /var/log/audit/audit.log file that a sysadmin is not interested in seeing. The following simplified call to awk will filter give the desired results, in this case looking for matching values in columns $4 and $10:
$ sudo awk '($4 == "denied" ) && ($10=="comm=\"httpd\"") { print $0 }' /var/log/audit/audit.log
Note how this command incorporates both sudo as this file is owned by root, as well as escaping for the comm=”httpd” entry. Below is sample output of this call:
Figure 12 – Filtered output via awk command.
It is typical for there to be many, many, many entries which match the criteria above, as publicly-accessible web servers are often subject to constant attacks.
Final Thoughts on Using Awk to Analyze Log Files
As stated earlier, the awk language is vast and quite capable of all sorts of useful file analysis tasks. The Free Software Foundation currently maintains the gawk utility, as well as its official documentation. It is the ideal free tool for performing precision log analysis given the avalanche of information that Linux and its software typically provide in log files. As the language is designed strictly for extracting from text streams, its programs are far more concise and shorter than programs written in more general-purpose languages for the same kinds of tasks.
The awk utility can be incorporated into unattended text file analysis for just about any structured text file format, or if one dares, even unstructured text file formats as well. It is one of the “unsung” and sometimes overlooked tools in a sysadmin’s arsenal that can make that job so much easier, especially when dealing with ever-increasing volumes of data.