Diagnose and Replace a Defective Hard Drive (Windows Dedicated Server with Hardware Raid)
Please use the “Print” function at the bottom of the page to create a PDF.
In this article, you will learn how to identify a defective hard drive and prepare your server for the replacement.
Prerequisite
This article has been created for customers who have at least a basic knowledge of Windows server administration. If you have any questions or need help with the drive replacement, please contact Customer Service.
To give yourself the best performance, you have to make sure that you monitor the hardware RAID of your dedicated server. If you find that a hard drive is defective or receive a notification email about a defective hard drive, you will have to contact Customer Service to arrange for the replacement. To do this, you will first have to identify the defective hard drive and prepare the server for the exchange.
Proceed with caution!
RAID systems enable greater reliability and/or higher speeds. However, they are not a substitute for regular backups. To avoid data loss, we recommend that you back up your data regularly. Also make sure that you back up your data before performing the following steps to ensure the security of your data.
For more information on creating backups, click here:
Hardware RAID Controllers: General Information
A hardware RAID controller is a physical controller that is built into the server as a hardware component. This controller has its own processor for the calculation of RAID operation, and the processor organises and manages the memory space. Accordingly, the CPU of the server is not burdened by RAID calculations. For hardware RAID controllers, the RAID functionality is also independent of the operating system. They are managed by special Command Line Interface (CLI) programs, which can vary depending on the manufacturer and model.
Diagnosiing Hard Drive Errors
In order to detect hard drive errors, we recommend that you use the smartctl program.
Smartctl is a command line program for monitoring volumes using SMART (Self-Monitoring, Analysis and Reporting Technology). With this program, you can check whether a hard drive is defective. It is a component of the Smartmontools.
A list of supported hardware controllers can be found here:
https://www.smartmontools.org/wiki/Supported_RAID-Controllers
Install Smartctl
You can download the Smartmontools on the following page:
https://www.smartmontools.org/wiki/Download#InstalltheWindowspackage
Identifying Hardware RAID Controllers
How to check which hardware RAID controller is built into your server:
Open the Control Panel.
Click Hardware > Devices and Printers > Device Manager.
In the Memory Controller section, check which controller is installed in the server.
Checking the Status of the Hardware Raid
Information on checking the status of the hardware raid can be found here:
Hardware RAID Monitoring / Rebuilding (Windows)
If a disk is missing in the raid array, it may be faulty or broken. A defective RAID could look like this:
CLI> rsf info
# Name Disks TotalCap FreeCap DiskChannels State
====================================================================================================================================================================================================
1 Raid Set # 00 3 2250.
In the above example, disk 2 has the status incomplete. This indicates a defect.
Viewing Hard Drive Information
Smartctl behaves the same in Windows and Linux. Because of this, you can use the same commands. To use Smartctl for troubleshooting, you must open the command prompt and change to the directory where the Smartmontools are located.
To use Smartctl to access hard drive information, you must always specify the appropriate command in combination with an option and a target device. The target device depends on the controller manufacturer.
Use the commands listed below to call up the information required for diagnosis via the hard drive:
Manufacturer | Hard disk | Command |
---|---|---|
ARECA | 1 | smartctl -iHAl error /dev/sg1 -d areca,1 |
ARECA | 2 | smartctl -iHAl error /dev/sg1 -d areca,2 |
LSI / 3Ware | 1 | smartctl -iHAl error /dev/twe0 -d 3ware,0 |
LSI / 3Ware | 2 | smartctl -iHAl error /dev/twe0 -d 3ware,1 |
Adaptec | 1 | smartctl -iHAl error /dev/sg2 -d sat |
Adaptec | 2 | smartctl -iHAl error /dev/sg3 -d sat |
Adaptec | (3) | smartctl -iHAl error /dev/sg4 -d sat |
Adaptec | (4) | smartctl -iHAl error /dev/sg5 -d sat |
Dell | 1 | smartctl -iHAl error -d sat+megaraid,0 /dev/sda |
Dell | 2 | smartctl -iHAl error -d sat+megaraid,1 /dev/sda |
Broadcom | 1 | smartctl -iHAl error -d sat+megaraid,0 /dev/sda |
Broadcom | 2 | smartctl -iHAl error -d sat+megaraid,1 /dev/sda |
Additional commands for supported hardware controllers can be found on the following page:
https://www.smartmontools.org/wiki/Supported_RAID-Controllers
Example:
C:\Program Files\smartmontools\bin>smartctl -iHAl error /dev/sg1 -d areca,1
smartctl 7.0 2018-12-30 r4883 [x86_64-w64-mingw32-2016] (sf-7.0-1)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Hitachi/HGST Ultrastar 7K2
Device Model: HGST HUS722T1TALA604
Serial Number: WMC6M0JAUEV8
LU WWN Device Id: 5 0014ee 00482c2ec
Firmware Version: RAGNWA07
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Jan 17 06:17:05 2019 CAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 141 140 021 Pre-fail Always - 3933
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 34
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
16 Gas_Gauge 0x0022 000 200 000 Old_age Always - 1822115874
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 113 109 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
Interpretating the Data
The first section lists characteristic information about the hard drive. In this section, you will find the device model, the serial number and the size of the tested hard disk:
=== START OF INFORMATION SECTION ===
Model Family: Hitachi/HGST Ultrastar 7K2
Device Model: HGST HUS722T1TALA604
Serial Number: WMC6M0JAUEV8
LU WWN Device Id: 5 0014ee 00482c2ec
Firmware Version: RAGNWA07
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Jan 17 06:17:05 2019 CAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
In the second section the current state of the hard disk is evaluated by Smartctl. If, for example, the value Failed or UNKNOWN is displayed instead of the value PASSED, you should replace the hard disk as soon as possible.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
In the third section, the SMART VALUES determined are listed in detail. Next to each current percentage value (VALUE), the worst ever measured value (WORST) and the respective limit value (THRESH) are listed. If the current percentage value (VALUE) or the worst ever measured value (WORST) exceeds the limit value (THRESH), a SMART warning is displayed in the WHEN_FAILED column (e.g. FAILING_NOW).
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 141 140 021 Pre-fail Always - 3933
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 34
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
16 Gas_Gauge 0x0022 000 200 000 Old_age Always - 1822115874
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9
194 Temperature_Celsius 0x0022 113 109 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
The following parameters can indicate an imminent hard drive failure before a SMART warning is displayed:
Reallocated_Sector_Ct: Specifies the number of sectors reassigned due to read errors. If a sector can no longer be read, written to or checked correctly, a replacement sector is automatically assigned to it. The faulty sector is permanently marked as unreadable. This is a clear warning sign for incipient surface problems. If this value is not equal to zero, a hard drive failure is often imminent. This value is the most important indicator for a hard drive replacement.
Current_Pending_Sector_Ct: Specifies the number of unstable sectors waiting for remapping. If a sector cannot be read and written correctly, it first receives the status Current Pending Sector. The sector is not reassigned in this state, since the data in the sector are unknown. Only after several unsuccessful read or write attempts is a replacement sector assigned and the defective sector permanently marked as unreadable. The value Current_Pending_Sector_Ct is an important indicator for a hard drive replacement. If this value is not equal to zero, a hard drive failure is often imminent.
Offline_Uncorrectable: Specifies the number of uncorrectable write and read sector errors.
The last section deals with the internal hard drive log. Errors are recorded here if the server's work orders have not been processed correctly from the hard drive. If the number of errors in this section is at least two digits, you should replace the hard drive das soon as possible.
SMART Error Log Version: 1
No Errors Logged
Viewing Log Files
Please refer to the documentation of the respective manufacturer for information regarding the viewing of log files.
Areca
http://areca.starline.de/RaidCards/Documents/Manual_Spec/Software
3goods
http://www.3ware.com/support/userdocs.asp
Adaptec
http://download.adaptec.com/pdfs/user_guides/microsemi_raid_controller_iug_6_2017.pdf
Dell
https://www.dell.com/support/home/de/de/debsdt1/product-support/product/poweredge-rc-h330/manuals
Broadcom
https://www.broadcom.com/products/storage/raid-controllers/megaraid-9440-8i#documentation
Preparing for a Hard Drive Replacement
Viewing Detailed Information for Drive Replacement
The following information is required in order to replace the defective hard drive:
Name of the hard drive in the RAID
Serial number
Model
Log file (optional)
Creating a SMART Log
Use the commands listed below to generate a complete SMART log:
Manufacturer | Hard disk | Command |
---|---|---|
ARECA | 1 | smartctl -x /dev/sg1 -d areca,1 |
ARECA | 2 | smartctl -x /dev/sg1 -d areca,2 |
LSI / 3Ware | 1 | smartctl -x /dev/twe0 -d 3ware,0 |
LSI / 3Ware | 2 | smartctl -x /dev/twe0 -d 3ware,1 |
Adaptec | 1 | smartctl -x /dev/sg2 -d sat |
Adaptec | 2 | smartctl -x /dev/sg3 -d sat |
Adaptec | (3) | smartctl -x /dev/sg4 -d sat |
Adaptec | (4) | smartctl -x /dev/sg5 -d sat |
Dell | 1 | smartctl –x -d sat+megaraid,0 /dev/sda |
Dell | 2 | smartctl –x -d sat+megaraid,1 /dev/sda |
Broadcom | 1 | smartctl –x -d sat+megaraid,0 /dev/sda |
Broadcom | 2 | smartctl –x -d sat+megaraid,1 /dev/sda |
If the SMART log was created as described above, it will contain all of the information you need. You can then have the defective hard drive replaced. To get this done, please contact IONOS Customer Support.
If you cannot find the serial number of the defective hard drive using smartctl, you can alternatively provide customer service with the serial number of the functioning hard drive(s).
If you are unable to determine the information required for the replacement and wish to replace the hard drive, the hardware must be checked before replacing it. During this check, the server usually becomes temporarily unavailable. If a defect in the hard drive is detected during this test, it will need to be replaced.
Arranging for a Hard Drive Replacement
You can then have the defective hard drive replaced. To do this, please contact IONOS Customer Support.
Steps to Take After Replacing the Hard Drive
After the defective hard drive has been replaced, the RAID system has to be rebuilt, which usually starts automatically. Please make sure that the rebuild of the RAID system starts and completes successfully.