For When You Can't Have The Real Thing
[ start | index | login ]
start > Linux > Software Raid Failed Disk Howto

Software Raid Failed Disk Howto

Created by dave. Last edited by dave, one year and 270 days ago. Viewed 3,889 times. #10
[diff] [history] [edit] [rdf]
labels
attachments

Software Raid: Dealing with a failed drive

The simplest way to proceed is to power down the system and replace the drive.

Note that if you are using mdadm and your boot drive fails you can move the survivor to the boot position so the system is bootable (for example, you have a RAID across hda and hdc, then hda dies -- move hdc to hda, then insert a new hdc). This is useful for systems that don't let you boot from disks other than the first one.

Skip down to Restore Partitioning Information

An Alternative: Hotswap Linux SCSI drives

(These instructions have been tried, with mixed results -- see below)

(You probably want to read about matching SCSI Device Names And Bus IDs.)

There is a lot of mythology about linux software raid and it being difficult as well as not supporting hot swap of drives. That is excactly what it is, a myth. Since most scsi controllers are adaptec and the adaptec driver supports hotswap functionality in the linux scsi layer, you can echo the device in and out of the realtime /proc filesystem within linux and it will spin down and spin up the corresponding drive you pass to it.

To get the syntax to pass to the proc filesystem you perform:

# cat /proc/scsi/scsi

It will list all scsi devices detected within linux at the moment you pass the command. Let's say you have a smart alert being raised from the aic7xxx module and you want to replace the drive and you are running software raid1 on the system.

Output of the /proc/scsi/scsi is:

Host:  scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: Seagate Model: and so on
Host:  scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: Seagate and so on.

Id 1 has been the offender of the smart alerts in messages and you have another exact duplicate of the drive to replace it with.

You will need to pass this command to remove id1 from the system:

# echo "scsi remove-single-device 0 0 1 0" > /proc/scsi/scsi

You will then get a message stating that it is spinning down the drive and will then message what dev it is as well and will tell you when it has completed. Once completed you can remove the drive and then replace with the new one.

The reverse is the case to spin the drive up and make it available to linux:

# echo "scsi add-single-device" 0 0 1 0 > /proc/scsi/scsi

Will give the dev name and say the drive is spinning up and then give the queue and tag information and then give ready and the size and device name.

Restore Partitioning Information

Once the drive is installed, you probably have to apply a partition table. The easiest way to do this is to run the fdisk command against the drive, then write the default partition table back to the disk:

# fdisk /dev/sdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that, of course, the previous
content won't be recoverable.

The number of cylinders for this disk is set to 19457. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): w The partition table has been altered!

Calling ioctl() to re-read partition table. Syncing disks.

In any case you have to restore the partition using the sfdisk command and the files you made at array creation time:

# sfdisk /dev/sdb < /etc/partitions.sdb

This will autopartition the drive to exactly what it was partitioned before and now you are ready to recreate the raid partitions that were previously on the failed disk.

(One interesting scenario that I ran into was that I inserted two identical disks that fdisk reported as having different geometries. I took the partitioning from the running one and applied it to the second, and the second disk took on the apparent geometry of the first. Note that I did not have to use the force option to apply the partitioning information.)

Running Raid Commands to Recover Partitions

The commands to rebuild or remove raid partitions are rather simple. You have to understand where and what to look at to see the health of the raid partitions as well as which /dev/sda and /dev/sdb partitions are assigned to what md raid device. You see that within the /proc filesystem just like you did with scsi. The filename is mdstat and is in the root of /proc:

# less /proc/mdstat

md1 : active raid1 sda1[1] 40064 blocks [2/1] [_U]

It will list all the md partitions which are actually the raid partitions and the raid counterpart of /sdxx. Notice the [2/1] which indicates that one of the mirrored partitions is missing which is what was on the other drive which was sdb1. The [1] after sda1 denotes that this is id1 drive. Adding the drive partition back to the array is accomplished with raidhotadd command:

# raidhotadd /dev/md0 /dev/sdb1

If you don't have raidhotadd, use

mdadm /dev/md0 –a /dev/sdb1

Now it shows with the less /proc/mdstat

Md1 : active raid1 sdb1[0] sda1[1]
      40064 blocks [2/2] [UU]

Now both drives are shown with id0 being first which is sdb1 and the [UU] designates that both partitions are up and [_U] before meant the first partition was missing and not up. You can queue up all the partition rebuilds and it will process them one by one in order of queue up. It will then show both of the partitions with the ones that are still queued will show a sda5[2] with the 2 denoting that it is a spare drive waiting to build.

Once the resync is done, don't forget to go and make your new disk bootable through grub (see Software Raid Quick Howto near the bottom for details).

(>>Source)

Personal Note on Alternative Method ("Hot Swapping")

So far I've done this twice, stopping the failed disk, hot-swapping it, then starting the replacement before going on to rebuilding the mirrors. In the first case, which was a lab bench test, it worked flawlessly; in the second case, which was a live customer machine (of course) the replacement for sdb came up as sdc, which was weird. Rebuilding the mirrors worked, except that we had a three-way mirror (sda, sdb, sdc) with sdb being in degraded mode. As of this writing we have not had the opportunity to reboot this computer so I really don't know what state it will come back up in. (Update: since this was written a reboot has happened, and post-reboot I have a two-armed RAID with sda and sdb both active and happy members.)

no comments | post comment

Virtual Dave Megaplex:

Internet Explorer 6 Users >>Click Here

(read this note about local search)

Logged in Users: (0)
… and a Guest.


Editing: snipsnap-help, Image Macro

Installed 2 years and 129 days ago
Powered By >>SnipSnap Version 1.0b1-uttoxeter

This is a collection of techical information, much of it learned the hard way. Consider it a lab book or a /info directory. I doubt much of it will be of use to anyone else.

Useful:


snipsnap.org | Copyright 2000-2002 Matthias L. Jugel and Stephan J. Schmidt