From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: raid5 (re)-add recovery data corruption Date: Mon, 30 Jun 2014 13:40:45 +1000 Message-ID: <20140630134045.601cd33d@notabene.brown> References: <53A518BB.60709@sbcglobal.net> <20140623113641.79965998@notabene.brown> <53AF5304.7020401@sbcglobal.net> <20140630132335.4361445e@notabene.brown> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/kgKA=1fxk.pGq1m2w.BIXe1"; protocol="application/pgp-signature" Return-path: In-Reply-To: <20140630132335.4361445e@notabene.brown> Sender: linux-raid-owner@vger.kernel.org To: Bill Cc: linux-raid List-Id: linux-raid.ids --Sig_/kgKA=1fxk.pGq1m2w.BIXe1 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Mon, 30 Jun 2014 13:23:35 +1000 NeilBrown wrote: > On Sat, 28 Jun 2014 18:43:00 -0500 Bill wro= te: >=20 > > On 06/22/2014 08:36 PM, NeilBrown wrote: > > > On Sat, 21 Jun 2014 00:31:39 -0500 Bill = wrote: > > > > > >> Hi Neil, > > >> > > >> I'm running a test on 3.14.8 and seeing data corruption after a reco= very. > > >> I have this array: > > >> > > >> md5 : active raid5 sdc1[2] sdb1[1] sda1[0] sde1[4] sdd1[3] > > >> 16777216 blocks level 5, 64k chunk, algorithm 2 [5/5] [U= UUUU] > > >> bitmap: 0/1 pages [0KB], 2048KB chunk > > >> > > >> with an xfs filesystem on it: > > >> /dev/md5 on /hdtv/data5 type xfs > > >> (rw,noatime,barrier,swalloc,allocsize=3D256m,logbsize=3D256k,largeio) > > >> > > >> and I do this in a loop: > > >> > > >> 1. start writing 1/4 GB files to the filesystem > > >> 2. fail a disk. wait a bit > > >> 3. remove it. wait a bit > > >> 4. add the disk back into the array > > >> 5. wait for the array to sync and the file writes to finish > > >> 6. checksum the files. > > >> 7. wait a bit and do it all again > > >> > > >> The checksum QC will eventually fail, usually after a few hours. > > >> > > >> My last test failed after 4 hours: > > >> > > >> 18:51:48 - mdadm /dev/md5 -f /dev/sdc1 > > >> 18:51:58 - mdadm /dev/md5 -r /dev/sdc1 > > >> 18:52:06 - start writing 3 files > > >> 18:52:08 - mdadm /dev/md5 -a /dev/sdc1 > > >> 18:52:18 - array recovery done > > >> 18:52:23 - writes finished. QC failed for one of three files. > > >> > > >> dmesg shows no errors and the disks are operating normally. > > >> > > >> If I "check" /dev/md5 it shows mismatch_cnt =3D 896 > > >> If I dump the raw data on sd[abcde]1 underneath the bad file, it sho= ws > > >> sd[abde]1 are correct, and sdc1 has some chunks of old data from a > > >> previous file. > > >> > > >> If I fail sdc1, --zero-superblock it, and add it, it then syncs and = the > > >> QC is correct. > > >> > > >> So somehow is seems like md is loosing track of some changes which n= eed > > >> to be > > >> written to sdc1 in the recovery. But rarely - in this case it failed > > >> after 175 cycles. > > >> > > >> Do you have any idea what could be happening here? > > > No. As you say, it looks like md is not setting a bit in the bitmap > > > correctly, or ignoring one that is set, or maybe clearing one that sh= ouldn't > > > be cleared. > > > The last is most likely I would guess. > >=20 > > Neil, > >=20 > > I'm still digging through this but I found something that might help=20 > > narrow it > > down - the bitmap stays dirty after the re-add and recovery is complete: > >=20 > > Filename : /dev/sde1 > > Magic : 6d746962 > > Version : 4 > > UUID : 609846f8:ad08275f:824b3cb4:2e180e57 > > Events : 5259 > > Events Cleared : 5259 > > State : OK > > Chunksize : 2 MB > > Daemon : 5s flush period > > Write Mode : Normal > > Sync Size : 4194304 (4.00 GiB 4.29 GB) > > Bitmap : 2048 bits (chunks), 2 dirty (0.1%) > > ^^^^^^^^^^^^^^ > >=20 > > This is after 1/2 hour idle. sde1 was the one removed / re-added, but > > all five disks show the same bitmap info, and the event count matches=20 > > that of > > the array (5259). At this point the QC check fails. > >=20 > > Then I manually failed, removed and re-added /dev/sde1, and shortly the= =20 > > array > > synced the dirty chunks: > >=20 > > Filename : /dev/sde1 > > Magic : 6d746962 > > Version : 4 > > UUID : 609846f8:ad08275f:824b3cb4:2e180e57 > > Events : 5275 > > Events Cleared : 5259 > > State : OK > > Chunksize : 2 MB > > Daemon : 5s flush period > > Write Mode : Normal > > Sync Size : 4194304 (4.00 GiB 4.29 GB) > > Bitmap : 2048 bits (chunks), 0 dirty (0.0%) > > ^^^^^^^^^^^^^^ > >=20 > > Now the QC check succeeds and an array "check" shows no mismatches. > >=20 > > So it seems like md is ignoring a set bit in the bitmap, which then get= s=20 > > noticed > > with the fail / remove / re-add sequence. >=20 > Thanks, that helps a lot ... maybe. >=20 > I have a theory. This patch explains it and should fix it. > I'm not sure this is the patch I will go with if it works, but it will he= lp > confirm my theory. > Can you test it? >=20 > thanks, > NeilBrown >=20 > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 34846856dbc6..27387a3740c8 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7906,6 +7906,15 @@ void md_check_recovery(struct mddev *mddev) > clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); > clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); > set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); > + /* If there is a bitmap, we need to make sure > + * all writes that started before we added a spare > + * complete before we start doing a recovery. > + * Otherwise the write might complete and set > + * a bit in the bitmap after the recovery has > + * checked that bit and skipped that region. > + */ > + mddev->pers->quiesce(mddev, 1); > + mddev->pers->quiesce(mddev, 0); > } else if (mddev->recovery_cp < MaxSector) { > set_bit(MD_RECOVERY_SYNC, &mddev->recovery); > clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); >=20 Don't even bother trying that - it will definitely deadlock. Please try this instead. NeilBrown diff --git a/drivers/md/md.c b/drivers/md/md.c index 34846856dbc6..f8cd0bd83402 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7501,6 +7501,16 @@ void md_do_sync(struct md_thread *thread) rdev->recovery_offset < j) j =3D rdev->recovery_offset; rcu_read_unlock(); + + /* If there is a bitmap, we need to make sure + * all writes that started before we added a spare + * complete before we start doing a recovery. + * Otherwise the write might complete and set + * a bit in the bitmap after the recovery has + * checked that bit and skipped that region. + */ + mddev->pers->quiesce(mddev, 1); + mddev->pers->quiesce(mddev, 0); } =20 printk(KERN_INFO "md: %s of RAID array %s\n", desc, mdname(mddev)); --Sig_/kgKA=1fxk.pGq1m2w.BIXe1 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUBU7DcPTnsnt1WYoG5AQJAhhAApeWtYXPm8XEWb6fEvojCMg49vfd3CrpT 9/Swba00Ao55nEwVmuPB4InHKDkYq3VFwDZt9a0djpd1/cHrXx7WX0r3rPDKSDwS VXn6jsPdw1cveEkKCbcr/6TeNYDYSm5VTM54LHWYRpR3SdiT5DuF5ClyOLeEqt12 wFvKv/JDWNcdulUD4Lg8gR+J1z4p2cq6Ichchaejl775oQ+szSwmcC+KtiyQOz2d sMIycONG4Z7jUb1/HCWC2518F38WUb3BIgL/s9ix1Q3fyq9WwePo4IIHB1i2yY6I qsYJZ1MQJIYqyDm8Py7gt/4OUL5XnThVClVZ0KC174i+l92ojw+Z1Zn4yb5s4Ydb AM16glVCyUoL5SsuP9Hxx9cq0ArPHXiOtLpCktM61oKxTy9frR0uhB8hA+SyGZQD de54D+IpATmHvJelcTW9rjA4zhAMH4QB85qe2mzoIoDyTx6565Dy8NjLh2lxbc+n doCNktK8j/vHI59khKZZVsYTvzUjcPTRlDl8LxoMqL6gxeqG7dSjcUU70WfiDC+2 pbMf478GX2gmNfCsn2UNfUwdVLN/z0qJvxhrO7PTZDXFxM3aglYLL9DHS6FrufBT WaPZFN+9mfzsVF4L+RI5MJmbjvklEjCGAIVnX40joip/2r3qfYwfexgeOLp3+/av 3uze0Rmbn8s= =jFXC -----END PGP SIGNATURE----- --Sig_/kgKA=1fxk.pGq1m2w.BIXe1--