Command-line hacking: displaying system temperature

display This is another in my series of articles on doing unusual and, perhaps, interesting things with Linux command-line tools and scripts. The purpose of this week's exercise is to create a formatted display of system temperatures from the various sensors, using only Bash constructs. This example is significantly simpler than some of my earlier ones, although it demonstrates basic file and string handling in Bash.

The output of the script is something like this:

               zone   temp
              ------   -----
              acpitz   46
        x86_pkg_temp   46
         pch_skylake   52

I'll touch on interpretation of the zone (sensor) names later.

As ever, full source code is available in my GitHub repository, although this example is simple enough that I'll include most of the code in the article.

Modern CPUs and motherboards have multiple temperature sensors. Drivers in the Linux kernel make their readings available as pseudo-files in the directories /sys/class/thermal/thermal_zoneNNN/. Unfortunately, there's no single pseudo-file that can be displayed, that will give a nice, formatted display of all the temperatures, which is what this script tries to do.

Within each of the thermal_zoneNNN directories are two pseudo-files that are relevant here. type gives the name of the temperature sensor, while temp is the temperature itself.

The temperature is, according to the kernel documentation, supplied as a string of exactly five decimal digits, representing the temperature in millicelsius. So, to get an ordinary Celsius temperature, we must divide by 1000. That's easy: just take the top two digits of the five, and ignore the rest. This process will round the temperature down, so a temperature of 49900 will be reported as 49 degrees when, ideally, it should be rounded to the nearest integer. Alternatively, we could display the first three digits with a decimal point between them ("49.9") Both these approaches are left as an exercise to the reader -- simply inserting the decimal point is straightforward in a Bash script, but rounding to the nearest integer requires some real math. I'm happy with a result that is within a degree or so of the true value, and these sensors aren't hugely accurate, anyway.

The first task in the script is to gather a list of the thermal_zoneNNN directories. The way I've chosen to do this is shown below, but I'm sure there are many other possibilities.

SCT=/sys/class/thermal

for zone in `find $SCT -name thermal_zone* -exec basename {} \; `
do
  ...
done

There's no functional need to assign /sys/class/thermal to a variable but, as it will be used many times, doing this tidies up the script a little. The for loop will execute once for each thermal_zoneNNN directory, with zone set to the directory name.

If any of the directory names might have contained spaces, we'd have to be a lot more careful about how the names are handled. In this case, however, there cannot be spaces in any of the directory names.

Within each thermal_zoneNNN directory, we need to read temp and type into variables. A common way to read a file into a variable is:

my_var=`cat /path/to/file`

Bash experts discourage this kind of thing, because it spawns an extra process, when there's no real need to. I confess that I use this construct all the time, because it's so easy to read. However, let's try to be more modern.

   type=$(</$SCT/$zone/type)
   temp=$(<$SCT/$zone/temp)

This approach uses the < syntax, so it isn't totally unfamiliar to a shell programmer. Still, I confess that I often forget the exact syntax.

This is where things get a little interesting, from a shell programming point of view. On some Linux system, the temp pseudo-file does not supply a temperature, and trying to read it will result in an error. This is the case even though the file exists, and has read permissions. So what does temp end up containing?

It turns out to be an empty string -- which is fine in this simple example. A valid temperature will never be an empty string, so we don't have to try to distinguish between valid empty strings and errors. But what happens if a script has to read a file or pseudo-file that might actually validly be empty? When we use the < syntax, I'm not sure there's a way to find out whether the file was present but empty, or whether a read error occurred.

If we do need to make this distinction, one possibility might be to use the read call, which returns an error status. Here is one way to tell whether a file is actually readable, when it might be empty:

  ok=""
  read -n 1 < /path/to/file >& /dev/null && ok="1"   
  if [ ! -z $ok ] ... #File is actually readable

This is a rather ugly construction, and I'd be interested to know whether there's a nicer one. In practice, when writing shell scripts, it's usually fine to assume that if a file exists, and the user has read permission, a read operation will succeed. We don't normally need to be more paranoid than that. However, when handling files that are actually interfaces to drivers, oddities can arise, that need to be handled. Happily, in this simple example, it's safe to assume that reading an empty string represents an error condition, so we don't have to be too paranoid.

With the zone name and the temperature in variables, we just need to format them nicely. To reduce the temperature to an integer, we just extract the first two digits, like this:

  temp=${temp:0:2}

This construct means 'take two characters, starting at position zero'. Exercise for the reader: does this still work for multibyte characters? We don't have to worry about that here, but sometimes we do.

To format the display so that the temperatures are aligned, we can use the printf command.

  printf "%20s   %s\n" $type $temp

This command should be familiar to C programmers. The first argument controls the format of the arguments that follow. %s in general denotes a simple text string; %20s pads a string so that it is exactly twenty characters wide.

Finally, how do we know how to interpret the sensor names? The sad fact is that, on the whole, we don't. These names are supplied by drivers in the kernel, based on what sensors it detects by probing the motherboard. From experience I know that x86_pkg_temp is usually the CPU die temperature on x86 systems, while acpitz usually comes from a motherboard sensor mounted close to the CPU. ARM Linux systems that use the 'Big.Little' architecture seem to report the 'Big', 'Middle', and 'Little' cores with their own thermal_zone entries. g3d is probably an integrated graphics controller.

For the record, in my example pch_skylake is the Platform Hub Controller. This would once have been a separate chip on Intel-compatible motherboards but, these days, is probably part of the CPU die. It isn't unusual for this to record a higher temperature than x86_pkg_temp or, of course, not to be present at all: the exact sensors will depend on the motherboard and CPU design.

In short, interpretation of the thermal zone names will likely take a bit of web searching.