Software Raid: Dealing with a failed drive
The simplest way to proceed is to power down the system and replace the drive.
Note that if you are using mdadm
and your boot drive fails you can move the survivor to the boot position so the system is bootable (for example, you have a RAID across hda and hdc, then hda dies -- move hdc to hda, then insert a new hdc). This is useful for systems that don't let you boot from disks other than the first one.
Skip down to Restore Partitioning Information
An Alternative: Hotswap Linux SCSI drives
(You probably want to read about matching SCSI Device Names And Bus IDs
. This is a really good idea if you think that disks have failed on this computer in the past and might not have been rebooted since.)
Since most scsi controllers are adaptec and the adaptec driver
supports hotswap functionality in the linux scsi layer, you can echo the
device in and out of the realtime /proc filesystem within linux and it
will spin down and spin up the corresponding drive you pass to it.
To get the syntax to pass to the proc filesystem you perform:
It will list all scsi devices detected within linux at the moment you
the command. Let's say you have a smart alert being raised from the
module and you want to replace the drive and you are running software
on the system.
Output of the /proc/scsi/scsi is:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: Seagate Model: and so on
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: Seagate and so on.
Id 1 has been the offender of the smart alerts in messages and you have
another exact duplicate of the drive to replace it with.
You will need to pass this command to remove id1 from the system:
# echo "scsi remove-single-device 0 0 1 0" > /proc/scsi/scsi
(Note: Be aware that the 0 as in scsi0
is significant, it is the first 0 in the command above. Had your devices been listed on scsi1
instead, you would use 1 0 1 0
as the device target specification.)
You will then get a message stating that it is spinning down the drive
will then message what dev it is as well and will tell you when it has
completed. Once completed you can remove the drive and then replace with
the new one.
The reverse is the case to spin the drive up and make it available to
# echo "scsi add-single-device" 0 0 1 0 > /proc/scsi/scsi
Will give the dev name and say the drive is spinning up and then give
queue and tag information and then give ready and the size and device
You should be able to confirm these operations once they are complete by examining the bottom of the output from the dmesg
command.Sometimes you will get unexpected drive letters
for example you replaced sda and the new drive comes in as sdc. This is OK, it just means the SCSI subsystem was confused at mount time and didn't think it could reuse the same letter. Just use the drive name (sdc) as the "new drive" identifier in all the instructions below.
Restore Partitioning Information
Once the drive is installed, you probably have to apply a partition table. You will restore the partition using the sfdisk
command and the files you made at array creation time:
# sfdisk /dev/sdb < /etc/partitions.sdb
This will autopartition the drive to exactly what it was partitioned
and now you are ready to recreate the raid partitions that were
on the failed disk.
(One interesting scenario that I ran into was that I inserted two identical disks that fdisk reported as having different geometries. I took the partitioning from the running one and applied it to the second, and the second disk took on the apparent geometry of the first. Note that I did not have to use the force
option to apply the partitioning information.)If you don't have a partitions.sdb (or whatever) file:
IF (AND ONLY IF) your disks are identical, you can pull the partitioning off the survivor and apply it to the new disk:
# sfdisk -d /dev/sda > /etc/partitions.sdb
# sfdisk /dev/sdb < /etc/partitions.sdb
If your disks don't match, you are running fdisk and partitioning things manually. This is left as an exercise for the reader, although there are good hints to be gleaned on Software Raid Quick Howto
Running Raid Commands to Recover Partitions
Looking at the current configuration:
# less /proc/mdstatmd1 : active raid1 sda1
40064 blocks [2/1] [_U]
It will list all the md partitions which are actually the raid
and the raid counterpart of /sdxx. Notice the [2/1] which indicates that
one of the mirrored partitions is missing which is what was on the other
drive which was sdb1. The  after sda1 denotes that this is id1 drive.
Add the new disk partition into the raid set:
mdadm /dev/md0 –a /dev/sdb1
...and it will go off and sync itself. Once complete, it will show with the less /proc/mdstat
md1 : active raid1 sdb1 sda1
40064 blocks [2/2] [UU]
Now both drives are shown with id0 being first which is sdb1 and the
designates that both partitions are up and [_U] before meant the first
partition was missing and not up. You can queue up all the partition
rebuilds and it will process them one by one in order of queue up. It
then show both of the partitions with the ones that are still queued
show a sda5 with the 2 denoting that it is a spare drive waiting to
build.Of course you have to do this for all RAID sets that have a partition on the replaced disk.
Run grub to make sure your system is actually bootable in this configuration
So if you've replaced one of the disks you are likely to want to boot from in the future, you need to make them bootable.
NOTE: You need to wait until the partition containing /boot has finished sync'ing before you do this!
NOTE: Do not be confused! The “hd0” notation within grub is not the same as normal device notation when referring to devices in LINUX. The “device” command specifies which disk Grub will operate on (ie the disk where /boot's contents are
). Grub addresses the entire disk at the hardware level.
> device (hd0) /dev/sda
> root (hd0,0)
> setup (hd0)
(blah blah blah)
Running "install /boot/grub/stage1 (hd0) (hd0)1+16 p (hd0,0)/boot/grub/stage2
Note: your values of (hd0,0) may vary depending on where your /boot or / partitions actually are.
At this point you should be good to go. A reboot is not necessary here, but if you have an unexpected drive name change (see above) it is a really good idea to reboot at your earliest convenience so that if one of the drives fail in the future you don't accidentally eject the wrong disk.
(Based in part on:Source