All of lore.kernel.org
 help / color / mirror / Atom feed
* Raid1 resync problem with leap seconds ?
@ 2012-07-06 12:33 Arnold Schulz
  2012-07-09  1:09 ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: Arnold Schulz @ 2012-07-06 12:33 UTC (permalink / raw)
  To: linux-raid

Hi all,

about 8 seconds after inserting the leap second, a running raid1
resync crashed.

Not being able to assess if it is the raid code or some kernel
timer function to blame, I just present the log here.

Regards,
Arnold

--------------------------------------------
Jul  1 01:03:24 ip4-router kernel: md: data-check of RAID array md2
Jul  1 01:03:24 ip4-router kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Jul  1 01:03:24 ip4-router kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jul  1 01:03:24 ip4-router kernel: md: using 128k window, over a total of 1924209408k.
Jul  1 01:59:59 ip4-router kernel: Clock: inserting leap second 23:59:60 UTC
Jul  1 02:07:12 ip4-router kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
Jul  1 02:07:12 ip4-router kernel: IP: [<ffffffff81300e18>] sync_request+0x628/0x970
Jul  1 02:07:12 ip4-router kernel: PGD 0
Jul  1 02:07:12 ip4-router kernel: Oops: 0000 [#1] PREEMPT SMP
Jul  1 02:07:12 ip4-router kernel: CPU 1
Jul  1 02:07:12 ip4-router kernel: Modules linked in: parport_pc parport binfmt_misc deflate zlib_deflate zlib_inflate ctr 
twofish_generic twofish_x86_64_3way twofish_x86_64 camellia_generic twofish_common camellia_x86_64 serpent_sse2_x86_64 
serpent_generic cryptd lrw blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic xcbc rmd160 sha512_generic 
sha256_generic sha1_generic crypto_null af_key fuse mt2060 dvb_usb_dib0700 dib3000mc dib8000 dvb_usb dib0070 dib7000m dib7000p 
dibx000_common dib0090 dvb_core hfcpci mISDN_core
Jul  1 02:07:12 ip4-router kernel:
Jul  1 02:07:12 ip4-router kernel: Pid: 17823, comm: md2_resync Not tainted 3.4.4 #109 To Be Filled By O.E.M. To Be Filled By 
O.E.M./N68PV-GS
Jul  1 02:07:12 ip4-router kernel: RIP: 0010:[<ffffffff81300e18>]  [<ffffffff81300e18>] sync_request+0x628/0x970
Jul  1 02:07:12 ip4-router kernel: RSP: 0018:ffff8800224e9c30  EFLAGS: 00010202
Jul  1 02:07:12 ip4-router kernel: RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000000001
Jul  1 02:07:12 ip4-router kernel: RDX: 0000000000000002 RSI: ffff88006d7c4d30 RDI: 0000000000000000
Jul  1 02:07:12 ip4-router kernel: RBP: ffff8800224e9ce0 R08: ffff8800224e8000 R09: 0000000000000001
Jul  1 02:07:12 ip4-router kernel: R10: 000000000000013e R11: 0000000000000000 R12: 0000000000000080
Jul  1 02:07:12 ip4-router kernel: R13: ffff88006b403840 R14: ffff88006c711680 R15: ffffea0000ca7580
Jul  1 02:07:12 ip4-router kernel: FS:  00007f441eafa700(0000) GS:ffff88006fd00000(0000) knlGS:0000000000000000
Jul  1 02:07:12 ip4-router kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul  1 02:07:12 ip4-router kernel: CR2: 0000000000000050 CR3: 000000005c53f000 CR4: 00000000000007e0
Jul  1 02:07:12 ip4-router kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul  1 02:07:12 ip4-router kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul  1 02:07:12 ip4-router kernel: Process md2_resync (pid: 17823, threadinfo ffff8800224e8000, task ffff88006d7c4d30)
Jul  1 02:07:12 ip4-router kernel: Stack:
Jul  1 02:07:12 ip4-router kernel: 00000000e5623600 ffff880000000000 0000000029ca6880 0000008029ca5f00
Jul  1 02:07:12 ip4-router kernel: ffff8800224e9e2c 0000000000000000 0000000200000000 0000000029ca6900
Jul  1 02:07:12 ip4-router kernel: 0000000029ca6900 0000000000000080 0000000000001000 ffff88006b68dc00
Jul  1 02:07:12 ip4-router kernel: Call Trace:
Jul  1 02:07:12 ip4-router kernel: [<ffffffff813163e3>] md_do_sync+0x7d3/0xc60
Jul  1 02:07:12 ip4-router kernel: [<ffffffff8104ac90>] ? abort_exclusive_wait+0xb0/0xb0
Jul  1 02:07:12 ip4-router kernel: [<ffffffff81312f7e>] md_thread+0x10e/0x140
Jul  1 02:07:12 ip4-router kernel: [<ffffffff81312e70>] ? md_register_thread+0x110/0x110
Jul  1 02:07:12 ip4-router kernel: [<ffffffff8104a4ee>] kthread+0x8e/0xa0
Jul  1 02:07:12 ip4-router kernel: [<ffffffff8146b4f4>] kernel_thread_helper+0x4/0x10
Jul  1 02:07:12 ip4-router kernel: [<ffffffff8104a460>] ? kthread_worker_fn+0x130/0x130
Jul  1 02:07:12 ip4-router kernel: [<ffffffff8146b4f0>] ? gs_change+0xb/0xb
Jul  1 02:07:12 ip4-router kernel: Code: 0f 84 35 02 00 00 8b 45 84 41 89 06 41 8b 55 10 48 8b 45 98 8d 0c 12 85 c9 0f 8e 85 fa 
ff ff 31 db 66 90 48 63 c3 49 8b 7c c6 58 <48> 81 7f 50 10 f4 2f 81 0f 84 b2 01 00 00 8d 04 12 ff c3 39 d8
Jul  1 02:07:12 ip4-router kernel: RIP  [<ffffffff81300e18>] sync_request+0x628/0x970
Jul  1 02:07:12 ip4-router kernel: RSP <ffff8800224e9c30>
Jul  1 02:07:12 ip4-router kernel: CR2: 0000000000000050
Jul  1 02:07:12 ip4-router kernel: ---[ end trace 79aec5e8bd378abc ]---

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Raid1 resync problem with leap seconds ?
  2012-07-06 12:33 Raid1 resync problem with leap seconds ? Arnold Schulz
@ 2012-07-09  1:09 ` NeilBrown
  2012-07-09  1:35   ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2012-07-09  1:09 UTC (permalink / raw)
  To: Arnold Schulz; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5690 bytes --]

On Fri, 06 Jul 2012 14:33:47 +0200 Arnold Schulz <arnysch@gmx.net> wrote:

> Hi all,
> 
> about 8 seconds after inserting the leap second, a running raid1
> resync crashed.

Thanks for the report.

I think you mean "8 minutes" (though it was really 7 minutes and 12 seconds).

Also it was a 'data-check' rather than a 'resync' :-)

It is extremely unlikely that the two are related.

There appears to be a use-after-free bug in the data-check code which you
have manage to hit.  It has been there since 2006 (2.6.16) when data-check was
added to raid1, and you are the first known victim.  Well done!

I'll submit a patch shortly.

> 
> Not being able to assess if it is the raid code or some kernel
> timer function to blame, I just present the log here.

Thanks for providing the complete log.  It was very helpful.

NeilBrown


> 
> Regards,
> Arnold
> 
> --------------------------------------------
> Jul  1 01:03:24 ip4-router kernel: md: data-check of RAID array md2
> Jul  1 01:03:24 ip4-router kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Jul  1 01:03:24 ip4-router kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
> Jul  1 01:03:24 ip4-router kernel: md: using 128k window, over a total of 1924209408k.
> Jul  1 01:59:59 ip4-router kernel: Clock: inserting leap second 23:59:60 UTC
> Jul  1 02:07:12 ip4-router kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> Jul  1 02:07:12 ip4-router kernel: IP: [<ffffffff81300e18>] sync_request+0x628/0x970
> Jul  1 02:07:12 ip4-router kernel: PGD 0
> Jul  1 02:07:12 ip4-router kernel: Oops: 0000 [#1] PREEMPT SMP
> Jul  1 02:07:12 ip4-router kernel: CPU 1
> Jul  1 02:07:12 ip4-router kernel: Modules linked in: parport_pc parport binfmt_misc deflate zlib_deflate zlib_inflate ctr 
> twofish_generic twofish_x86_64_3way twofish_x86_64 camellia_generic twofish_common camellia_x86_64 serpent_sse2_x86_64 
> serpent_generic cryptd lrw blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic xcbc rmd160 sha512_generic 
> sha256_generic sha1_generic crypto_null af_key fuse mt2060 dvb_usb_dib0700 dib3000mc dib8000 dvb_usb dib0070 dib7000m dib7000p 
> dibx000_common dib0090 dvb_core hfcpci mISDN_core
> Jul  1 02:07:12 ip4-router kernel:
> Jul  1 02:07:12 ip4-router kernel: Pid: 17823, comm: md2_resync Not tainted 3.4.4 #109 To Be Filled By O.E.M. To Be Filled By 
> O.E.M./N68PV-GS
> Jul  1 02:07:12 ip4-router kernel: RIP: 0010:[<ffffffff81300e18>]  [<ffffffff81300e18>] sync_request+0x628/0x970
> Jul  1 02:07:12 ip4-router kernel: RSP: 0018:ffff8800224e9c30  EFLAGS: 00010202
> Jul  1 02:07:12 ip4-router kernel: RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000000001
> Jul  1 02:07:12 ip4-router kernel: RDX: 0000000000000002 RSI: ffff88006d7c4d30 RDI: 0000000000000000
> Jul  1 02:07:12 ip4-router kernel: RBP: ffff8800224e9ce0 R08: ffff8800224e8000 R09: 0000000000000001
> Jul  1 02:07:12 ip4-router kernel: R10: 000000000000013e R11: 0000000000000000 R12: 0000000000000080
> Jul  1 02:07:12 ip4-router kernel: R13: ffff88006b403840 R14: ffff88006c711680 R15: ffffea0000ca7580
> Jul  1 02:07:12 ip4-router kernel: FS:  00007f441eafa700(0000) GS:ffff88006fd00000(0000) knlGS:0000000000000000
> Jul  1 02:07:12 ip4-router kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jul  1 02:07:12 ip4-router kernel: CR2: 0000000000000050 CR3: 000000005c53f000 CR4: 00000000000007e0
> Jul  1 02:07:12 ip4-router kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jul  1 02:07:12 ip4-router kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jul  1 02:07:12 ip4-router kernel: Process md2_resync (pid: 17823, threadinfo ffff8800224e8000, task ffff88006d7c4d30)
> Jul  1 02:07:12 ip4-router kernel: Stack:
> Jul  1 02:07:12 ip4-router kernel: 00000000e5623600 ffff880000000000 0000000029ca6880 0000008029ca5f00
> Jul  1 02:07:12 ip4-router kernel: ffff8800224e9e2c 0000000000000000 0000000200000000 0000000029ca6900
> Jul  1 02:07:12 ip4-router kernel: 0000000029ca6900 0000000000000080 0000000000001000 ffff88006b68dc00
> Jul  1 02:07:12 ip4-router kernel: Call Trace:
> Jul  1 02:07:12 ip4-router kernel: [<ffffffff813163e3>] md_do_sync+0x7d3/0xc60
> Jul  1 02:07:12 ip4-router kernel: [<ffffffff8104ac90>] ? abort_exclusive_wait+0xb0/0xb0
> Jul  1 02:07:12 ip4-router kernel: [<ffffffff81312f7e>] md_thread+0x10e/0x140
> Jul  1 02:07:12 ip4-router kernel: [<ffffffff81312e70>] ? md_register_thread+0x110/0x110
> Jul  1 02:07:12 ip4-router kernel: [<ffffffff8104a4ee>] kthread+0x8e/0xa0
> Jul  1 02:07:12 ip4-router kernel: [<ffffffff8146b4f4>] kernel_thread_helper+0x4/0x10
> Jul  1 02:07:12 ip4-router kernel: [<ffffffff8104a460>] ? kthread_worker_fn+0x130/0x130
> Jul  1 02:07:12 ip4-router kernel: [<ffffffff8146b4f0>] ? gs_change+0xb/0xb
> Jul  1 02:07:12 ip4-router kernel: Code: 0f 84 35 02 00 00 8b 45 84 41 89 06 41 8b 55 10 48 8b 45 98 8d 0c 12 85 c9 0f 8e 85 fa 
> ff ff 31 db 66 90 48 63 c3 49 8b 7c c6 58 <48> 81 7f 50 10 f4 2f 81 0f 84 b2 01 00 00 8d 04 12 ff c3 39 d8
> Jul  1 02:07:12 ip4-router kernel: RIP  [<ffffffff81300e18>] sync_request+0x628/0x970
> Jul  1 02:07:12 ip4-router kernel: RSP <ffff8800224e9c30>
> Jul  1 02:07:12 ip4-router kernel: CR2: 0000000000000050
> Jul  1 02:07:12 ip4-router kernel: ---[ end trace 79aec5e8bd378abc ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Raid1 resync problem with leap seconds ?
  2012-07-09  1:09 ` NeilBrown
@ 2012-07-09  1:35   ` NeilBrown
  0 siblings, 0 replies; 3+ messages in thread
From: NeilBrown @ 2012-07-09  1:35 UTC (permalink / raw)
  To: Arnold Schulz; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2841 bytes --]

On Mon, 9 Jul 2012 11:09:56 +1000 NeilBrown <neilb@suse.de> wrote:

> On Fri, 06 Jul 2012 14:33:47 +0200 Arnold Schulz <arnysch@gmx.net> wrote:
> 
> > Hi all,
> > 
> > about 8 seconds after inserting the leap second, a running raid1
> > resync crashed.
> 
> Thanks for the report.
> 
> I think you mean "8 minutes" (though it was really 7 minutes and 12 seconds).
> 
> Also it was a 'data-check' rather than a 'resync' :-)
> 
> It is extremely unlikely that the two are related.
> 
> There appears to be a use-after-free bug in the data-check code which you
> have manage to hit.  It has been there since 2006 (2.6.16) when data-check was
> added to raid1, and you are the first known victim.  Well done!
> 
> I'll submit a patch shortly.
> 

Below is that patch I'll be submitting, once it has been in -next for a day
or two.

Thanks,
NeilBrown


From 2d4f4f3384d4ef4f7c571448e803a1ce721113d5 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 9 Jul 2012 11:34:13 +1000
Subject: [PATCH] md/raid1: fix use-after-free bug in RAID1 data-check code.

This bug has been present ever since data-check was introduce
in 2.6.16.  However it would only fire if a data-check were
done on a degraded array, which was only possible if the array
has 3 or more devices.  This is certainly possible, but is quite
uncommon.

Since hot-replace was added in 3.3 it can happen more often as
the same condition can arise if not all possible replacements are
present.

The problem is that as soon as we submit the last read request, the
'r1_bio' structure could be freed at any time, so we really should
stop looking at it.  If the last device is being read from we will
stop looking at it.  However if the last device is not due to be read
from, we will still check the bio pointer in the r1_bio, but the
r1_bio might already be free.

So use the read_targets counter to make sure we stop looking for bios
to submit as soon as we have submitted them all.

This fix is suitable for any -stable kernel since 2.6.16.

Cc: stable@vger.kernel.org
Reported-by: Arnold Schulz <arnysch@gmx.net>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 8c2754f..240ff31 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2485,9 +2485,10 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp
 	 */
 	if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) {
 		atomic_set(&r1_bio->remaining, read_targets);
-		for (i = 0; i < conf->raid_disks * 2; i++) {
+		for (i = 0; i < conf->raid_disks * 2 && read_targets; i++) {
 			bio = r1_bio->bios[i];
 			if (bio->bi_end_io == end_sync_read) {
+				read_targets--;
 				md_sync_acct(bio->bi_bdev, nr_sectors);
 				generic_make_request(bio);
 			}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-07-09  1:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-06 12:33 Raid1 resync problem with leap seconds ? Arnold Schulz
2012-07-09  1:09 ` NeilBrown
2012-07-09  1:35   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.