For When You Can't Have The Real Thing
[ start | index | login ]
start > Linux > Software Raid Failed Disk Howto

Software Raid Failed Disk Howto

Created by dave. Last edited by dave, 7 years and 193 days ago. Viewed 10,940 times. #14
[diff] [history] [edit] [rdf]
labels
attachments

Software Raid: Dealing with a failed drive

The simplest way to proceed is to power down the system and replace the drive.

Note that if you are using mdadm and your boot drive fails you can move the survivor to the boot position so the system is bootable (for example, you have a RAID across hda and hdc, then hda dies -- move hdc to hda, then insert a new hdc). This is useful for systems that don't let you boot from disks other than the first one.

Skip down to Restore Partitioning Information

An Alternative: Hotswap Linux SCSI drives

(You probably want to read about matching SCSI Device Names And Bus IDs. This is a really good idea if you think that disks have failed on this computer in the past and might not have been rebooted since.)

Since most scsi controllers are adaptec and the adaptec driver supports hotswap functionality in the linux scsi layer, you can echo the device in and out of the realtime /proc filesystem within linux and it will spin down and spin up the corresponding drive you pass to it.

To get the syntax to pass to the proc filesystem you perform:

# cat /proc/scsi/scsi

It will list all scsi devices detected within linux at the moment you pass the command. Let's say you have a smart alert being raised from the aic7xxx module and you want to replace the drive and you are running software raid1 on the system.

Output of the /proc/scsi/scsi is:

Host:  scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: Seagate Model: and so on
Host:  scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: Seagate and so on.

Id 1 has been the offender of the smart alerts in messages and you have another exact duplicate of the drive to replace it with.

You will need to pass this command to remove id1 from the system:

# echo "scsi remove-single-device 0 0 1 0" > /proc/scsi/scsi

(Note: Be aware that the 0 as in scsi0 is significant, it is the first 0 in the command above. Had your devices been listed on scsi1 instead, you would use 1 0 1 0 as the device target specification.)

You will then get a message stating that it is spinning down the drive and will then message what dev it is as well and will tell you when it has completed. Once completed you can remove the drive and then replace with the new one.

The reverse is the case to spin the drive up and make it available to linux:

# echo "scsi add-single-device" 0 0 1 0 > /proc/scsi/scsi

Will give the dev name and say the drive is spinning up and then give the queue and tag information and then give ready and the size and device name.

You should be able to confirm these operations once they are complete by examining the bottom of the output from the dmesg command.

Sometimes you will get unexpected drive letters for example you replaced sda and the new drive comes in as sdc. This is OK, it just means the SCSI subsystem was confused at mount time and didn't think it could reuse the same letter. Just use the drive name (sdc) as the "new drive" identifier in all the instructions below.

Restore Partitioning Information

Once the drive is installed, you probably have to apply a partition table. You will restore the partition using the sfdisk command and the files you made at array creation time:

# sfdisk /dev/sdb < /etc/partitions.sdb

This will autopartition the drive to exactly what it was partitioned before and now you are ready to recreate the raid partitions that were previously on the failed disk.

(One interesting scenario that I ran into was that I inserted two identical disks that fdisk reported as having different geometries. I took the partitioning from the running one and applied it to the second, and the second disk took on the apparent geometry of the first. Note that I did not have to use the force option to apply the partitioning information.)

If you don't have a partitions.sdb (or whatever) file: IF (AND ONLY IF) your disks are identical, you can pull the partitioning off the survivor and apply it to the new disk:

# sfdisk -d /dev/sda > /etc/partitions.sdb
# sfdisk /dev/sdb < /etc/partitions.sdb

If your disks don't match, you are running fdisk and partitioning things manually. This is left as an exercise for the reader, although there are good hints to be gleaned on Software Raid Quick Howto.

Running Raid Commands to Recover Partitions

Looking at the current configuration:

# less /proc/mdstat

md1 : active raid1 sda1[1] 40064 blocks [2/1] [_U]

It will list all the md partitions which are actually the raid partitions and the raid counterpart of /sdxx. Notice the [2/1] which indicates that one of the mirrored partitions is missing which is what was on the other drive which was sdb1. The [1] after sda1 denotes that this is id1 drive.

Add the new disk partition into the raid set:

mdadm /dev/md0 –a /dev/sdb1

...and it will go off and sync itself. Once complete, it will show with the less /proc/mdstat

md1 : active raid1 sdb1[0] sda1[1]
      40064 blocks [2/2] [UU]

Now both drives are shown with id0 being first which is sdb1 and the [UU] designates that both partitions are up and [_U] before meant the first partition was missing and not up. You can queue up all the partition rebuilds and it will process them one by one in order of queue up. It will then show both of the partitions with the ones that are still queued will show a sda5[2] with the 2 denoting that it is a spare drive waiting to build.

Of course you have to do this for all RAID sets that have a partition on the replaced disk.

Run grub to make sure your system is actually bootable in this configuration

So if you've replaced one of the disks you are likely to want to boot from in the future, you need to make them bootable.

NOTE: You need to wait until the partition containing /boot has finished sync'ing before you do this!

NOTE: Do not be confused! The “hd0” notation within grub is not the same as normal device notation when referring to devices in LINUX. The “device” command specifies which disk Grub will operate on (ie the disk where /boot's contents are). Grub addresses the entire disk at the hardware level.

# grub
> device (hd0) /dev/sda
> root (hd0,0)
> setup (hd0)
(blah blah blah)
Running "install /boot/grub/stage1 (hd0) (hd0)1+16 p (hd0,0)/boot/grub/stage2
/boot/grub/grub.conf"… succeeded
Done.
> quit

Note: your values of (hd0,0) may vary depending on where your /boot or / partitions actually are.

At this point you should be good to go. A reboot is not necessary here, but if you have an unexpected drive name change (see above) it is a really good idea to reboot at your earliest convenience so that if one of the drives fail in the future you don't accidentally eject the wrong disk.

(Based in part on:>>Source)

no comments | post comment
This is a collection of techical information, much of it learned the hard way. Consider it a lab book or a /info directory. I doubt much of it will be of use to anyone else.

Useful:


snipsnap.org | Copyright 2000-2002 Matthias L. Jugel and Stephan J. Schmidt