When buying new hard drives it’s always a good idea to check them for bad blocks. If a drive had bad blocks it can be an indication it’s not long for this world, especially if the number of bad blocks grows over a short period of time. In this article I’ll show you how to quickly check out the health of a hard drive. I should point out that none of these tests give a guarantee that your drive is fine, all they show is that nothing could be detected at the time of the test. For that reason always backup anything you care about.
NOTE: in the examples below I’m working as root since this is just a demo machine I put together. Some commands (e.g. fdisk -l) will usually require sudo to be prepended to the start of the command (e.g. sudo fdisk -l)
The first thing you need to do is find out what disks your system knows about. The easiest way to do that is with the fdisk command.
root@pm1:~# fdisk -l Disk /dev/sdb: 298.09 GiB, 320072933376 bytes, 625142448 sectors Disk model: WDC WD3200AAKX-0 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: F891A5A3-B02B-084D-ACC8-8BD998ABF9CF Device Start End Sectors Size Type /dev/sdb1 2048 625125375 625123328 298.1G Solaris /usr & Apple ZFS /dev/sdb9 625125376 625141759 16384 8M Solaris reserved 1 Disk /dev/sda: 149.01 GiB, 160000000000 bytes, 312500000 sectors Disk model: ST3160812AS Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 200C55F7-350D-5C46-9A32-39A5EABCFD2C Device Start End Sectors Size Type /dev/sda1 2048 312481791 312479744 149G Solaris /usr & Apple ZFS /dev/sda9 312481792 312498175 16384 8M Solaris reserved 1 Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors Disk model: ST2000DL003-9VT1 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 77070F1F-38F8-41D4-B39B-D921AD600CFA Device Start End Sectors Size Type /dev/sdc1 34 2047 2014 1007K BIOS boot /dev/sdc2 2048 1050623 1048576 512M EFI System /dev/sdc3 1050624 3907029134 3905978511 1.8T Linux LVM Disk /dev/sdd: 149.01 GiB, 160000000000 bytes, 312500000 sectors Disk model: ST3160812AS Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: A835D014-F977-8345-A677-99B78387A342 Device Start End Sectors Size Type /dev/sdd1 2048 312481791 312479744 149G Solaris /usr & Apple ZFS /dev/sdd9 312481792 312498175 16384 8M Solaris reserved 1 Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes
This shows you a lot of information about the drives in your system. Each drive starts with a “Disk” line that indicates where the disk can found e.g. /dev/sda. It also tells you the size, model and a bunch of other information about the disk. As you can see I have four physical disks installed in this system. If any disks are missing from this list but they are plugged in correctly then either it’s hidden in the BIOS or the disk is faulty.
Checking for SMART Errors
Modern hard drives have the ability to monitor their own health using a feature called SMART (Self-Monitoring, Analysis and Reporting Technology). If you keep an eye on the SMART information you’ll often get a warning that a disk is about to fail. In my experience the warning is usually quite short, maybe a few days, but it’s better than nothing. To read the SMART information from a drive you need a utility called smartmontools installed, if it’s not already installed you can install it like this:
sudo apt-get install smartmontools
Note that SMART was originally designed for spinning disks but SSD’s also have SMART reporting. Some of the original SMART fields, such as spin up time, don’t make sense for SSDs but will often still be reported. What is discussed here is aimed at HDDs but much of it will also be relevant to SSDs.
Checking the health of a disk is a one line command with a simple one line pass / fail output as shown below. Note: it’s -H not -h, the latter prints help for the smartctl command.
root@pm1:~# smartctl -H /dev/sda smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
As you can see this drive passes the SMART check. If you want more information you can replace -H with -a which prints all the information for a drive as shown here.
root@pm1:~# smartctl -a /dev/sda smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.74-1-pve] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.9 Device Model: ST3160812AS Serial Number: 5LS9ADML Firmware Version: 3.ADJ User Capacity: 160,000,000,000 bytes [160 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA/ATAPI-7 (minor revision not indicated) Local Time is: Thu May 4 10:05:17 2023 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 54) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 116 088 006 Pre-fail Always - 112686697 3 Spin_Up_Time 0x0003 094 094 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 468 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 079 060 030 Pre-fail Always - 99514528 9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 17808 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 476 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 070 063 045 Old_age Always - 30 (Min/Max 18/32) 194 Temperature_Celsius 0x0022 030 040 000 Old_age Always - 30 (0 14 0 0 0) 195 Hardware_ECC_Recovered 0x001a 053 046 000 Old_age Always - 129752706 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 17185 - # 2 Extended offline Completed without error 00% 14683 - # 3 Extended offline Completed without error 00% 1 - # 4 Short offline Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
This gives a lot of information but the section we’re really interested in is in the middle where we have access to the raw values the drive is reporting. The first one to look at is ID#9 Power On Hours, this drive shows 17808 hours which is just over 2 years of use. This drive is about 16 years old so that tells us that most of the time it’s been sitting around not doing anything. Now look at ID#12 Power Cycle Count which this drive reports as 476. This drive was power cycled almost every day that it was active which indicates it probably wasn’t in a server farm – drives in a server farm would power cycle quite rarely. To put these numbers into perspective my current NAS has drives with power on hours of around 76500 (8.7 years) and power cycle counts of 29.
Now it’s time to look at the factors that give an indication of a possibly failing drive.
- ID#5 Reallocated Sectors Count
- ID#187 Reported Uncorrectable Errors
- ID#188 Command Timeout
- ID#197 Current Pending Sector Count
- ID#198 Uncorrectable Sector Count or Offline Uncorrectable
All of these, if present, should ideally have a raw value of zero. Notice that this drive doesn’t report ID#188, not all drives will report all statistics. These five measures are generally considered to be the ones that point to a possible issue. According to Backblaze 77% of drives that fail will be reporting one of these values as non-zero at the time of failure (4% of working drives will have one value that is non-zero). Multiple non-zero values is definitely cause for concern. In my (limited) experience drives can run for a very long time with one of these values being non-zero. One of my NAS drives has had a non-zero current pending sector value for years, all the other values it reports are zero. The key point though is that the value isn’t changing which is more of a concern.
SMART Self Testing
The smartctl application can not only read the SMART data but also trigger a number of self tests on the disk. To perform a quick test you’d use a command such as:
sudo smartctl -t short /dev/sdd
This will run a short test to look for obvious errors, a test like this should last no more than ten minutes – on modern drives it seems to last just a few seconds. I follow that up with a “conveyance” test which is designed to look for any shipping damage and then I run a long test if I really want to give the drive a workout. Just to give you and idea of how long a long test will take, on the 16TB drives I use it’s about 24 hours.
Now for the bad news…
If you are buying used drives the SMART records might not be reliable initially as it is possible to reset the SMART values on at least some drives (Seagate seem to be particularly susceptible from a quick search). The tools needed don’t seem to be particularly easy to get hold of and I doubt that this is a widespread problem but it’s worth keeping in mind if you come across a used disk that has raw values of zero across the board. See below what what you can do about this. The ongoing SMART data the drive collects once you have it should be reliable.
Checking for Bad Blocks
Before SMART was widely used the best way to check a hard drive was a scan for bad blocks. The problem with this method is that it’s slow, especially for large modern drives, and a write scan is destructive to the data on the drive. Considering how good SMART is now a bad block scan is usually unnecessary but there is one time you might consider it. If you have a used drive and you suspect someone might have tampered with the SMART data a bad block check will turn up any issues. You might not actually get a report of bad blocks, as the drive may silently remap them, but you will see something reported in the SMART statistics.
Personally, I don’t think running badblocks is worth it. On a decent sized modern drive, lets say 16TB, you might expect a full destructive badblocks scan to take a week. Why so long? It completely fills the drive and then reads it back four times. All of this data has to go over the HBA and hit the processor, this is will take time. Add in latency from seeks etc and you can see why it takes so long. Additionally, if badblocks turns up a bad block the drive is finished as that indicates SMART was unable to remap the block. By the time badblocks is finding issues SMART should have been screaming at you for a while.
The badblock utility can be run in a number if different ways but it basically boils down to whether you want to run a write test. If you really want to find bad blocks you need to write to the disk, the only real downside to performing a write test is time – it takes ages. For a typical destructive write test you need to be able to sacrifice all the data and structure on the disk. This will then perform four passes writing and reading different data patterns and checking they match. On the 160GB drive we’ve been using in this article so far as single write and read of one pattern took about 90 minutes! Obviously for a large modern drive you’re probably talking several days.
The command I use for a destructive write test is:
badblocks -wsv /dev/disk/by-id/<device_id>
Note, generally you’d use the device alias such as /dev/sda. I have shown the ID because I run a ZFS raidz1 array where you should really use the ID. I actually offlined a disk for this article just to capture the output of badblocks.
If neither smartctl nor badblocks turns up any problems with the drive then it’s probably good to use. Backblaze has released plenty of analyses on drive failures and their conclusion is that if a drive doesn’t so any SMART warnings it’s probably good. There are about 20% of drives that just die out of the blue without giving any warning but not amount of testing is going to find those.