Setting up SMART Monitoring in Proxmox

The aim of this article is to configure SMART monitoring in Proxmox and to send emails in the case anything untoward is found. You should already have the email system set up as per this earlier article. This article assumes that you have disks attached directly to the system rather than via a RAID controller. There are probably other ways to monitor the disks if they are connected via a RAID controller.

Checking Things Out

Fortunately for us all most Proxmox installs should come with all of the monitoring tools we need out of the box. As the user guide states, since Proxmox version 4.3 (Sept 2016) it has shipped with smartmontools, a comprehensive SMART monitoring utility. As mentioned in this earlier article inspecting disks with smartmontools manually is easy but who wants to do a job like that manually? An understanding of what the smartmontools utilities can do will help with this task so it’s well worth having a read of the manuals linked at the bottom of this article.

By default the smartmontools daemon, smartd, is running and polls the disks automatically every 30 minutes. The poll time is configurable, as noted in the manual, but I see no good reason to change it. If you do want to change it the setting is in /etc/default/smartmiontools, just uncomment and edit the smartd_opts line. These options will be passed through to the init script that starts the daemon.

To check that smartd is running you can list and filter active processes like this:

# ps aux | grep smart
root      251302  0.0  0.0  11996  6564 ?        Ss   May19   0:00 /usr/sbin/smartd -n
root      770262  0.0  0.0   6244   648 pts/0    S+   10:32   0:00 grep smart

The first line shows that smartd is running. The other columns show that the process is owned by root, has a PID of 251302, is using not CPU or memory (well, very little) and various other things. The Ss means that the process is in an interruptible sleep and that it’s the session leader – in other words it wasn’t running when I looked at it.

Better than that though you can ask systemctl about the status of the service. Typical output would look something like this:

# systemctl status smartd

● smartmontools.service - Self Monitoring and Reporting Technology (SMART) Daemon
     Loaded: loaded (/lib/systemd/system/smartmontools.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-05-19 17:54:57 BST; 4 days ago
       Docs: man:smartd(8)
             man:smartd.conf(5)
   Main PID: 251302 (smartd)
     Status: "Next check of 6 devices will start at 18:24:57"
      Tasks: 1 (limit: 33474)
     Memory: 2.2M
        CPU: 587ms
     CGroup: /system.slice/smartmontools.service
             └─251302 /usr/sbin/smartd -n

May 23 13:55:24 xxx smartd[251302]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 68 to 67
May 23 13:55:24 xxx smartd[251302]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 33
May 23 14:25:02 xxx smartd[251302]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 66 to 65
May 23 14:25:02 xxx smartd[251302]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 34 to 35
May 23 14:25:13 xxx smartd[251302]: Device: /dev/sdc [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64
May 23 14:25:13 xxx smartd[251302]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 37 to 36
May 23 14:25:18 xxx smartd[251302]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
May 23 14:25:18 xxx smartd[251302]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 32
May 23 17:55:08 xxx smartd[251302]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 63
May 23 17:55:08 xxx smartd[251302]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 38 to 37

As you can see we get some information about when the service was started, when it’ll next check the drives and the last few log messages it created. Notice that the long messages are all about temperature and not very interesting, we’ll fix that when we create a more complete configuration.

Out of the Box Configuration of smartd

The configuration for smartd can be found in the /etc/smartd.conf file. The default settings, once all the comments have been removed, are extremely simple:

DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

The DEVICESCAN setting tells smartd to scan for drives to monitor rather than expect a list of drives. The comments in the conf file mention that normally systems should list all the drives to monitor manually but from what I can find online most people seem to scan. This setting can take a number of directives, the current settings are described below.

-d removable –> ignore drives that are missing at start up. To be honest I’ve not found a really good explanation of this directive and whether it’s safe to use with drives that wouldn’t usually be removed. It seems it’s more aimed at the situation where you enumerate drives manually. In that situation you wouldn’t want smartd complaining or exiting because you unplugged a USB drive. In a mixed system with some removable and some non-removable drives it’s not clear to me what this directive does.
-n standby –> don’t spin up a disk to check it, let it stay in sleep or standby. In other words, let sleeping disks lie.
-m root –> if a problem is detected send an email too root, this can also be one or more full email addresses
-M exec /usr/share/smartmontools/smartd-runner –> rather than run the mail command run the specified script.

The smartd-runner script executes all the scripts that are found in the /etc/smartmontools/run.d directory. With a default install this is a single script that just calls the mail utility.

Configuring smartd Better

The default configuration isn’t terrible but it also leaves a bit to be desired. After much documentation reading and searching I cam across this post that provides a better starting point (also mentioned here linking to this). The configuration I’ll be using which is based on the suggestion is shown below.

DEVICESCAN -H -f -u -p -l error -l selftest -n standby,24,q \
-I 194 \
-I 190 \
-W 5,45,50 \
-i 9 \
-R 5! \
-C 197 \
-U 198 \
-o on -S on -s (S/../.././02|L/../01/./04) \
-m root \
-M test \ 
-M exec /usr/share/smartmontools/smartd-runner

A quick rundown of the directives is given below, for full details see the man page of smartd.conf.

-H: (ATA only) Check the SMART health status and log a critical error if any pre-failure has reached or passed it’s threshold value. If this check has been triggered a failure is imminent.
-f: (ATA only) Checks for usage attribute failures e.g. the drive is too warm or is past it’s design life. These don’t indicate a failure is imminent but are warning signs one might be around the corner.
-u: [ATA only] Report changes in usage attributes.
-p: [ATA only] Report changes in prefailure attributes.
-l: (lower case L) Looks for increases in the SMART logs. The error argument examines the summary log and the smarttest argument checks the self-test log. The latter option only makes sense if you are running regular self tests with the -s option.
-n: The standby argument tells smartd to not wake the disks if they aren’t spinning. The value 24 indicates that the disks should be woken up if they have missed 24 tests. The q indicates that log messages shouldn’t be left for skipped tests.
-I: (upper case i) 190 ignores airflow temperature changes, 194 ignores temperature changes. This cut down the number of log message for things that are likely to change very frequently and have no diagnostic worth.
-W: Track temperature changes. The first value looks for a change of more than that amount between reports, the next two values specify info and critical report thresholds.
-i: Ignore an attribute altogether. In this case attribute 9 is specified which is power on hours. This prevents emails being sent for old disks that otherwise seem to be working fine.
-R: Report all changes in raw value for the given attribute. In this case 5 is specified which is reallocated sector count, a key indication a drive is failing. The exclamation logs the change as critical.
-C: Report if current pending sector count is non-zero. the number specifies the attribute to check, it’s usually 197, but some vendors have used other numbers in the past. This is a key indicator of a disk that may fail.
-U: Report if offline uncorrectable count is non-zero. The number specifies the attribute to check, it’s usually 198, but some vendors have used other numbers in the past. This is a key indicator of a disk that may fail.
-o: Turns on automatic SMART testing when smartd starts.
-S: Turns on attribute saving when smartd starts.
-s: Runs self-tests on the disk. The documentation is needed for this one but the setting shown runs a short test at 2AM every day and a long test at 4AM on the first of every month.
-m: Send email reports to all the users and email addressed listed
-M: modifies the behaviour of the -m directive. The test option causes smartd to send a test email for each monitored drive when the service starts. The exec option causes smartd to execute the given script rather than the built in mail command.

Restart smartd

Restarting smartd is necessary to get it to pick up its new settings. This is done by asking systemctl to restart the service.

# systemctl restart smartd

The restart can take a moment in my experience. If you’ve copied the settings above exactly you should now get a flurry of emails, one for each disk in your system.

That’s it you should now have a fully working SMART monitoring system. May your disks live long and trouble free lives.

References

Smartctl man page
Smartd.conf man page
Smartd man page
Old but still relevant guide for Proxmox