From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neil Brown Subject: Re: Requesting replace mode for changing a disk Date: Wed, 13 May 2009 21:02:05 +1000 Message-ID: <18954.43181.558444.360139@notabene.brown> References: <20090513012112681.IEFQ19662@cdptpa-omta02.mail.rr.com> <87zldhvg0v.fsf@frosties.localdomain> <18954.20061.109627.591832@notabene.brown> <874ovpihcj.fsf@frosties.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: message from Goswin von Brederlow on Wednesday May 13 Sender: linux-raid-owner@vger.kernel.org To: Goswin von Brederlow Cc: lrhorer@satx.rr.com, 'Linux RAID' List-Id: linux-raid.ids On Wednesday May 13, goswin-v-b@web.de wrote: > Neil Brown writes: > > > On Wednesday May 13, goswin-v-b@web.de wrote: > >> > OK, basically the same question. How does one disassemble the RAID1 array > >> > without wiping the data on the new drive? > >> > >> I think he ment this: > >> > >> mdadm --stop /dev/md0 > >> mdadm --build /dev/md9 --chunk=64k --level=1 --raid-devices=2 /dev/suspect /dev/new > >> mdadm --assemble /dev/md0 /dev/md9 /dev/other ... > > > > or better still: > > > > mdadm --grow /dev/md0 --bitmap internal > > mdadm /dev/md0 --fail /dev/suspect --remove /dev/suspect > > mdadm --build /dev/md9 --level 1 --raid-devices 2 /dev/suspect missing > > mdadm /dev/md0 --add /dev/md9 > > mdadm /dev/md9 --add /dev/new > > > > no down time at all. The bitmap ensures that /dev/md9 will be > > recovered almost immediately once it is added back in to the array. > > I keep forgetting bitmaps. :) > > > The one problem with this approach is that if there is a read error on > > /dev/suspect while data is being copied to /dev/new, you lose. > > > > Hence the requested functionality which I do hope to implement for > > raid456 and raid10 (it adds no value to raid1). > > Maybe by the end of this year... it is on the roadmap. > > > > NeilBrown > > What about raid0? You can't use your bitmap trick there. I seriously had not considered raid0 for this functionality at all. I guess I assume that people who use raid0 directly on normal drives don't really value their data, so if a device starts failing, they will just give up the data as lost (i.e. use the raid0 as a cache for something). Maybe I need to come up with a way to atomically swap a device in any array.... maybe. I actually would really like to provide this hot-replace functionality without explicitly implementing it for each level. The first part of that is to implement support for maintaining a bad-block-list. This is a per-device list that identifies sectors that should fail when read. Then if you resync a raid1 and you get a read failure, you don't have to reject the whole drive, you just record the bad block (on both drives) and move on. Then we can use "swap the drive for a raid1" to mostly implement hot-replace. Once the recovery finishes, mdadm can check out the bad block list, and trigger a resync in the top-level array for just those sectors. That will cause the bad block to be over-written by good data from the top-level. This removes the bad block from the list. Once the list is empty (for the new drive), we swap out the raid1 and put the new drive back in and all is happy. To be able use this as a real solution, I think we want that atomic-swap function. Using the bitmap trick is OK, but not ideal (and as you say, doesn't work on raid0). My other unresolved issue about this approach is correct handling of the metadata. If we crash in the middle of a hot-recovery I want to be sure that the new drive isn't mistakenly assumed to be fully recovered. When the metadata is at the end, that should "just work". But when it is at the start it becomes more awkward. This is probably solvable, but I haven't solved it yet. NeilBrown