Diagnose and Replace a Defective Hard Drive (Linux Dedicated Server with Hardware Raid)
Please use the “Print” function at the bottom of the page to create a PDF.
In this article, you will learn how to identify a defective hard drive and prepare your server to replace the defective drive.
This article assumes that you have basic knowledge of Linux server administration. If you have any questions or need help with the replacement of the defective hard drive, please contact IONOS Customer Support.
To ensure the greatest possible reliability of your drives, it is necessary that you monitor the hardware RAID of your dedicated server. If you discover that a hard drive is defective or you receive a notification email about a defective hard drive, you must contact customer service to arrange for the hard drive replacement. To get this done, you will first have to identify the defective hard drive and prepare the server for the drive exchange.
RAID systems enable greater reliability and/or higher speeds. However, they are not a substitute for regular backups. To avoid data loss, we recommend that you back up your data regularly. Also, make sure that you back up your data before performing the following steps to ensure the security of your data.
For more information on creating backups, click here:
Backing Up Server Data (Linux)
Hardware RAID Controllers: General Information
A hardware RAID controller is a physical controller that is built into the server as a hardware component. This controller has its own processor for the calculation of RAID operations. This processor organises and manages the memory space. Thus, the CPU of the server is not burdened by RAID calculations. For hardware RAID controllers, the RAID functionality is independent of the operating system. They are managed by special Command Line Interface (CLI) programs which can vary depending on the manufacturer and model.
Diagnosing Hard Drive Errors
In order to detect hard drive errors, we recommend that you use the smartctl program.
Smartctl is a command line program for monitoring volumes using SMART (Self-Monitoring, Analysis and Reporting Technology). With this program you can check whether a hard drive is defective. It is a component of the Smartmontools. The Smartmontools are available as packages for many Linux distributions.
In some cases, it may be possible that a hard drive defect cannot be detected by the smart values. We would then recommend that you also analyse the log file /var/log/messages.
Installing Smartctl
To install Smartctl, type the following command:
CentOS:
yum install smartmontools
Ubuntu:
sudo apt-get install smartmontools
Determining the Hardware Controller Type
To check which hardware controller is installed in your server, you can use the lshw program. This program creates detailed information about hardware components.
To install the program, enter the following command:
CentOS:
yum install lshw
Ubuntu:
sudo apt-get install lshw
Displaying the Hardware Information
To display a summary of the hardware information, type the following command:
lshw -short
To output the hardware information as a text file, type the following command:
lshw > lshw_edition.txt
In the following example, a PERC H330 hardware controller is installed in the server:
root@829F6DF:~# lshw -short
H/W path Device Class Description
==========================================================
system PowerEdge R230 (SKU=NotProvided;ModelName=PowerEdge R230)
/0 bus 0DWX9P
/0/0 memory 64KiB BIOS
/0/400 processor Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz
/0/400/700 memory 256KiB L1 cache
/0/400/701 memory 1MiB L2 cache
/0/400/702 memory 8MiB L3 cache
/0/1000 memory 32GiB System Memory
/0/1000/0 memory 16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
/0/1000/1 memory 16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2133 MHz (0.5 ns)
/0/1000/2 memory [empty]
/0/1000/3 memory [empty]
/0/100 bridge Intel Corporation
/0/100/1 bridge Skylake PCIe Controller (x16)
/0/100/1/0 scsi0 storage MegaRAID SAS-3 3008 [Fury]
/0/100/1/0/2.0.0 /dev/sda disk 799GB PERC H330 Adp
/0/100/1/0/2.0.0/1 /dev/sda1 volume 2047KiB BIOS Boot partition
/0/100/1/0/2.0.0/2 /dev/sda2 volume 27GiB EXT3 volume
/0/100/1/0/2.0.0/3 /dev/sda3 volume 9536MiB Linux swap volume
/0/100/1/0/2.0.0/4 /dev/sda4 volume 707GiB LVM Physical Volume
/0/100/1.1 bridge Skylake PCIe Controller (x8)
/0/100/14 bus Sunrise Point-H USB 3.0 xHCI Controller
/0/100/14/0 usb1 bus xHCI Host Controller
/0/100/14/0/3 bus Gadget USB HUB
/0/100/14/1 usb2 bus xHCI Host Controller
/0/100/14.2 generic Sunrise Point-H Thermal subsystem
/0/100/16 communication Sunrise Point-H CSME HECI #1
/0/100/16.1 communication Sunrise Point-H CSME HECI #2
/0/100/17 storage Sunrise Point-H SATA controller [AHCI mode]
/0/100/1d bridge Sunrise Point-H PCI Express Root Port #9
/0/100/1d/0 eth0 network NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/100/1d/0.1 eth1 network NetXtreme BCM5720 Gigabit Ethernet PCIe
/0/100/1d.2 bridge Sunrise Point-H PCI Express Root Port #11
/0/100/1d.2/0 bridge SH7758 PCIe Switch [PS]
/0/100/1d.2/0/0 bridge SH7758 PCIe Switch [PS]
/0/100/1d.2/0/0/0 bridge SH7758 PCIe-PCI Bridge [PPB]
/0/100/1d.2/0/0/0/0 display G200eR2
/0/100/1f bridge Sunrise Point-H LPC Controller
/0/100/1f.2 memory Memory controller
/0/100/1f.4 bus Sunrise Point-H SMBus
Viewing Hard Drive Information
To use Smartctl to access hard drive information, you must always specify the appropriate command in combination with an option and a target device. The target device depends on the controller manufacturer.
Use the commands listed below to display the information required for diagnosing the hard drive:
Manufacturer | Hard disk | Command |
---|---|---|
ARECA | 1 | smartctl -iHAl error /dev/sg1 -d areca,1 |
ARECA | 2 | smartctl -iHAl error /dev/sg1 -d areca,2 |
LSI / 3Ware | 1 | smartctl -iHAl error /dev/twe0 -d 3ware,0 |
LSI / 3Ware | 2 | smartctl -iHAl error /dev/twe0 -d 3ware,1 |
Adaptec | 1 | smartctl -iHAl error /dev/sg2 -d sat |
Adaptec | 2 | smartctl -iHAl error /dev/sg3 -d sat |
Adaptec | (3) | smartctl -iHAl error /dev/sg4 -d sat |
Adaptec | (4) | smartctl -iHAl error /dev/sg5 -d sat |
Dell | 1 | smartctl -iHAl error -d sat+megaraid,0 /dev/sda |
Dell | 2 | smartctl -iHAl error -d sat+megaraid,1 /dev/sda |
Broadcom | 1 | smartctl -iHAl error -d sat+megaraid,0 /dev/sda |
Broadcom | 2 | smartctl -iHAl error -d sat+megaraid,1 /dev/sda |
Additional commands for supported hardware controllers can be found on the following page:
https://www.smartmontools.org/wiki/Supported_RAID-Controllers
Example:
[root@localhost ~]# smartctl -iHAl error /dev/sg1 -d areca,1
smartctl 7.0 2018-12-30 r4883 [x86_64-w64-mingw32-2016] (sf-7.0-1)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Hitachi/HGST Ultrastar 7K2
Device Model: HGST HUS722T1TALA604
Serial Number: WMC6M0JAUEV8
LU WWN Device Id: 5 0014ee 00482c2ec
Firmware Version: RAGNWA07
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Jan 17 06:17:05 2019 CAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 141 140 021 Pre-fail Always - 3933
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 34
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
16 Gas_Gauge 0x0022 000 200 000 Old_age Always - 1822115874
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 113 109 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
Interpreting the Data
Look through the detailed information you pulled up. The first section lists information that you can use to identify the hard drive. For example, this section displays the device model, serial number, and size of the hard drive under tests.
=== START OF INFORMATION SECTION ===
Model Family: Hitachi/HGST Ultrastar 7K2
Device Model: HGST HUS722T1TALA604
Serial Number: WMC6M0JAUEV8
LU WWN Device Id: 5 0014ee 00482c2ec
Firmware Version: RAGNWA07
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Jan 17 06:17:05 2019 CAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
In the second section, the current state of the hard drive is evaluated by Smartctl. If, for example, the value Failed or UNKNOWN is displayed instead of the value PASSED, you should replace the hard drive as soon as possible.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
In the third section, the SMART VALUES determined are listed in detail. Next to each current percentage value (VALUE), the worst ever measured value (WORST) and the respective limit value (THRESH) are listed. If the current percentage value (VALUE) or the worst ever measured value (WORST) exceeds the limit value (THRESH), a SMART warning is displayed in the WHEN_FAILED column (e.g. FAILING_NOW).
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 141 140 021 Pre-fail Always - 3933
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 34
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
16 Gas_Gauge 0x0022 000 200 000 Old_age Always - 1822115874
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 113 109 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
The following parameters can indicate an imminent hard drive failure before a SMART warning is displayed:
Reallocated_Sector_Ct: Specifies the number of sectors reassigned due to read errors. If a sector can no longer be read, written to or checked correctly, a replacement sector is automatically assigned to it. The faulty sector is permanently marked as unreadable. This is a clear warning sign for incipient surface problems. If this value is not equal to zero, a hard drive failure is often imminent. This value is the most important indicator for a hard drive replacement.
Current_Pending_Sector_Ct: Specifies the number of unstable sectors waiting for remapping. If a sector cannot be read and written correctly, it first receives the status Current Pending Sector. The sector is not reassigned in this state, since the data in the sector are unknown. Only after several unsuccessful read or write attempts is a replacement sector assigned and the faulty sector is permanently marked as unreadable. The value Current_Pending_Sector_Ct is an important indicator for a hard drive replacement. If this value is not equal to zero, a hard drive failure is often imminent.
Offline_Uncorrectable: Specifies the number of uncorrectable write and read sector errors.
The last section deals with the internal hard drive log. Errors are recorded here if the server's work orders have not been processed correctly by the hard drive. If the number of errors in this section is at least two digits, you should replace the hard drive as soon as possible.
SMART Error Log Version: 1
No Errors Logged
Displaying Log Files
Please refer to the documentation of the respective manufacturer for information regarding the displaying of the log files.
Areca
http://areca.starline.de/RaidCards/Documents/Manual_Spec/Software
3goods
http://www.3ware.com/support/userdocs.asp
Adaptec
http://download.adaptec.com/pdfs/user_guides/microsemi_raid_controller_iug_6_2017.pdf
Dell
https://www.dell.com/support/home/de/de/debsdt1/product-support/product/poweredge-rc-h330/manuals
Broadcom
https://www.broadcom.com/products/storage/raid-controllers/megaraid-9440-8i#documentation
Preparing Hard Drive Replacement
Viewing Detailed Information for Drive Replacement
The following information is required in order to replace the defective hard drive:
Name of the hard drive in the RAID
Serial number
Model
Log file (optional)
Creating a SMART Log
Use the commands listed below to generate a complete SMART log:
Manufacturer | Hard disk | Command |
---|---|---|
ARECA | 1 | smartctl -x /dev/sg1 -d areca,1 |
ARECA | 2 | smartctl -x /dev/sg1 -d areca,2 |
LSI / 3Ware | 1 | smartctl -x /dev/twe0 -d 3ware,0 |
LSI / 3Ware | 2 | smartctl -x /dev/twe0 -d 3ware,1 |
Adaptec | 1 | smartctl -x /dev/sg2 -d sat |
Adaptec | 2 | smartctl -x /dev/sg3 -d sat |
Adaptec | (3) | smartctl -x /dev/sg4 -d sat |
Adaptec | (4) | smartctl -x /dev/sg5 -d sat |
Dell | 1 | smartctl –x -d sat+megaraid,0 /dev/sda |
Dell | 2 | smartctl –x -d sat+megaraid,1 /dev/sda |
Broadcom | 1 | smartctl –x -d sat+megaraid,0 /dev/sda |
Broadcom | 2 | smartctl –x -d sat+megaraid,1 /dev/sda |
If the SMART log was created as described above, it is sufficient information. You can then have the defective hard drive replaced by IONOS Customer Support.
If you cannot find the serial number of the defective hard drive using smartctl, you can alternatively provide Customer Service with the serial number of the functioning hard drive(s).
If you are unable to determine the information required for the replacement and wish to replace the hard drive, the hardware must be checked before replacing it. During this check, the server is usually temporarily unavailable. If a defect in the hard drive is detected during this test, it will be replaced.
Arranging Hard Drive Replacement
You can then have the defective hard drive replaced. Please contact IONOS Customer Support to get this done.
Steps to Take After Replacing the Hard Drive
After the defective hard drive has been replaced, the RAID system usually starts rebuilding automatically. Please check whether the RAID system is starting to rebuild and is carried out successfully.