All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
@ 2014-01-19 22:00 Ian Kumlien
  2014-01-19 23:21 ` Richard Weinberger
  2014-01-20  0:38 ` NeilBrown
  0 siblings, 2 replies; 10+ messages in thread
From: Ian Kumlien @ 2014-01-19 22:00 UTC (permalink / raw)
  To: linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 9768 bytes --]

Ok, so third try to actually email this... 
---

Hi,

I started testing 3.13-rc8 on another machine since the first one seemed
to be working fine...

One spontaneous reboot later i'm not so sure ;)

Right now i captured a kernel oops in the raid code it seems...

(Also attached to avoid mangling)

[33411.934672] ------------[ cut here ]------------
[33411.934685] kernel BUG at drivers/md/raid5.c:291!
[33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP 
[33411.934696] Modules linked in: bonding btrfs microcode
[33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
[33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
[33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
[33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
[33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
[33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
[33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
[33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
[33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
[33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
[33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
[33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
[33411.934769] Stack:
[33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
[33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
[33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
[33411.934794] Call Trace:
[33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
[33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
[33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
[33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
[33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
[33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
[33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70Hi,

I started testing 3.13-rc8 on another machine since the first one seemed to be working fine...

One spontaneous reboot later i'm not so sure ;)

Right now i captured a kernel oops in the raid code it seems...

(Also attached to avoid mangling)

[33411.934672] ------------[ cut here ]------------
[33411.934685] kernel BUG at drivers/md/raid5.c:291!
[33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP 
[33411.934696] Modules linked in: bonding btrfs microcode
[33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
[33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
[33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
[33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
[33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
[33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
[33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
[33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
[33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
[33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
[33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
[33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
[33411.934769] Stack:
[33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
[33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
[33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
[33411.934794] Call Trace:
[33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
[33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
[33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
[33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
[33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
[33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
[33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70
[33411.934833]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.934839]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
[33411.934843]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.934847] Code: f7 ff ff 66 90 48 8b 43 18 48 8b b8 48 01 00 00 48 89 14 24 48 89 74 24 08 e8 af 9a 02 00 48 8b 74 24 08 48 8b 14 24 eb 9f 0f 0b <0f> 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f0 ff 4e 
[33411.934912] RIP  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
[33411.934918]  RSP <ffff880320473d28>
[33411.941326] ---[ end trace 42d97d618cc5bfe2 ]---
[33411.941331] ------------[ cut here ]------------
[33411.941337] WARNING: CPU: 4 PID: 2319 at kernel/exit.c:703 do_exit+0x45/0xa40()
[33411.941351] Modules linked in: bonding btrfs microcode
[33411.941377] CPU: 4 PID: 2319 Comm: md2_raid6 Tainted: G      D      3.13.0-rc8 #83
[33411.941395] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
[33411.941417]  0000000000000000 0000000000000009 ffffffff81d4eeb8 0000000000000000
[33411.941449]  ffffffff810eba01 000000000000000b ffff880320473c78 0000000000000096
[33411.941480]  ffff880324eeee58 ffff880320473e88 ffffffff810ed7d5 ffff880324eeee58
[33411.941512] Call Trace:
[33411.941519]  [<ffffffff81d4eeb8>] ? dump_stack+0x4a/0x75
[33411.941530]  [<ffffffff810eba01>] ? warn_slowpath_common+0x81/0xb0
[33411.941544]  [<ffffffff810ed7d5>] ? do_exit+0x45/0xa40
[33411.941557]  [<ffffffff81d4c0d4>] ? printk+0x4f/0x54
[33411.941568]  [<ffffffff810463ed>] ? oops_end+0x8d/0xd0
[33411.941579]  [<ffffffff810431d2>] ? do_invalid_op+0xa2/0x100
[33411.941592]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
[33411.941606]  [<ffffffff81d596a8>] ? invalid_op+0x18/0x20
[33411.941617]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
[33411.941633]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
[33411.941647]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
[33411.941658]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
[33411.941670]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
[33411.941683]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
[33411.941695]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
[33411.941708]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70
[33411.941722]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.941736]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
[33411.941749]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.941762] ---[ end trace 42d97d618cc5bfe3 ]---
[33411.941773] note: md2_raid6[2319] exited with preempt_count 1
[33411.934833]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.934839]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
[33411.934843]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.934847] Code: f7 ff ff 66 90 48 8b 43 18 48 8b b8 48 01 00 00 48 89 14 24 48 89 74 24 08 e8 af 9a 02 00 48 8b 74 24 08 48 8b 14 24 eb 9f 0f 0b <0f> 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f0 ff 4e 
[33411.934912] RIP  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
[33411.934918]  RSP <ffff880320473d28>
[33411.941326] ---[ end trace 42d97d618cc5bfe2 ]---
[33411.941331] ------------[ cut here ]------------
[33411.941337] WARNING: CPU: 4 PID: 2319 at kernel/exit.c:703 do_exit+0x45/0xa40()
[33411.941351] Modules linked in: bonding btrfs microcode
[33411.941377] CPU: 4 PID: 2319 Comm: md2_raid6 Tainted: G      D      3.13.0-rc8 #83
[33411.941395] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
[33411.941417]  0000000000000000 0000000000000009 ffffffff81d4eeb8 0000000000000000
[33411.941449]  ffffffff810eba01 000000000000000b ffff880320473c78 0000000000000096
[33411.941480]  ffff880324eeee58 ffff880320473e88 ffffffff810ed7d5 ffff880324eeee58
[33411.941512] Call Trace:
[33411.941519]  [<ffffffff81d4eeb8>] ? dump_stack+0x4a/0x75
[33411.941530]  [<ffffffff810eba01>] ? warn_slowpath_common+0x81/0xb0
[33411.941544]  [<ffffffff810ed7d5>] ? do_exit+0x45/0xa40
[33411.941557]  [<ffffffff81d4c0d4>] ? printk+0x4f/0x54
[33411.941568]  [<ffffffff810463ed>] ? oops_end+0x8d/0xd0
[33411.941579]  [<ffffffff810431d2>] ? do_invalid_op+0xa2/0x100
[33411.941592]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
[33411.941606]  [<ffffffff81d596a8>] ? invalid_op+0x18/0x20
[33411.941617]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
[33411.941633]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
[33411.941647]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
[33411.941658]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
[33411.941670]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
[33411.941683]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
[33411.941695]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
[33411.941708]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70
[33411.941722]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.941736]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
[33411.941749]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.941762] ---[ end trace 42d97d618cc5bfe3 ]---
[33411.941773] note: md2_raid6[2319] exited with preempt_count 1

[-- Attachment #2: kernel-oops-3.13-rc8-raid.txt --]
[-- Type: text/plain, Size: 4611 bytes --]

[33411.934672] ------------[ cut here ]------------
[33411.934685] kernel BUG at drivers/md/raid5.c:291!
[33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP 
[33411.934696] Modules linked in: bonding btrfs microcode
[33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
[33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
[33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
[33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
[33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
[33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
[33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
[33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
[33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
[33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
[33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
[33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
[33411.934769] Stack:
[33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
[33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
[33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
[33411.934794] Call Trace:
[33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
[33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
[33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
[33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
[33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
[33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
[33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70
[33411.934833]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.934839]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
[33411.934843]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.934847] Code: f7 ff ff 66 90 48 8b 43 18 48 8b b8 48 01 00 00 48 89 14 24 48 89 74 24 08 e8 af 9a 02 00 48 8b 74 24 08 48 8b 14 24 eb 9f 0f 0b <0f> 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f0 ff 4e 
[33411.934912] RIP  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
[33411.934918]  RSP <ffff880320473d28>
[33411.941326] ---[ end trace 42d97d618cc5bfe2 ]---
[33411.941331] ------------[ cut here ]------------
[33411.941337] WARNING: CPU: 4 PID: 2319 at kernel/exit.c:703 do_exit+0x45/0xa40()
[33411.941351] Modules linked in: bonding btrfs microcode
[33411.941377] CPU: 4 PID: 2319 Comm: md2_raid6 Tainted: G      D      3.13.0-rc8 #83
[33411.941395] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
[33411.941417]  0000000000000000 0000000000000009 ffffffff81d4eeb8 0000000000000000
[33411.941449]  ffffffff810eba01 000000000000000b ffff880320473c78 0000000000000096
[33411.941480]  ffff880324eeee58 ffff880320473e88 ffffffff810ed7d5 ffff880324eeee58
[33411.941512] Call Trace:
[33411.941519]  [<ffffffff81d4eeb8>] ? dump_stack+0x4a/0x75
[33411.941530]  [<ffffffff810eba01>] ? warn_slowpath_common+0x81/0xb0
[33411.941544]  [<ffffffff810ed7d5>] ? do_exit+0x45/0xa40
[33411.941557]  [<ffffffff81d4c0d4>] ? printk+0x4f/0x54
[33411.941568]  [<ffffffff810463ed>] ? oops_end+0x8d/0xd0
[33411.941579]  [<ffffffff810431d2>] ? do_invalid_op+0xa2/0x100
[33411.941592]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
[33411.941606]  [<ffffffff81d596a8>] ? invalid_op+0x18/0x20
[33411.941617]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
[33411.941633]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
[33411.941647]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
[33411.941658]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
[33411.941670]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
[33411.941683]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
[33411.941695]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
[33411.941708]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70
[33411.941722]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.941736]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
[33411.941749]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
[33411.941762] ---[ end trace 42d97d618cc5bfe3 ]---
[33411.941773] note: md2_raid6[2319] exited with preempt_count 1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
  2014-01-19 22:00 [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8 Ian Kumlien
@ 2014-01-19 23:21 ` Richard Weinberger
  2014-01-20  0:38 ` NeilBrown
  1 sibling, 0 replies; 10+ messages in thread
From: Richard Weinberger @ 2014-01-19 23:21 UTC (permalink / raw)
  To: Ian Kumlien, NeilBrown; +Cc: linux-kernel, linux-raid

On Sun, Jan 19, 2014 at 11:00 PM, Ian Kumlien <ian.kumlien@gmail.com> wrote:
> Ok, so third try to actually email this...

Let's CC Neil too.

> ---
>
> Hi,
>
> I started testing 3.13-rc8 on another machine since the first one seemed
> to be working fine...
>
> One spontaneous reboot later i'm not so sure ;)
>
> Right now i captured a kernel oops in the raid code it seems...
>
> (Also attached to avoid mangling)
>
> [33411.934672] ------------[ cut here ]------------
> [33411.934685] kernel BUG at drivers/md/raid5.c:291!
> [33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP
> [33411.934696] Modules linked in: bonding btrfs microcode
> [33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
> [33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
> [33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
> [33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
> [33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
> [33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
> [33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
> [33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
> [33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
> [33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
> [33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
> [33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
> [33411.934769] Stack:
> [33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
> [33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
> [33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
> [33411.934794] Call Trace:
> [33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
> [33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
> [33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
> [33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
> [33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
> [33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
> [33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70Hi,
>
> I started testing 3.13-rc8 on another machine since the first one seemed to be working fine...
>
> One spontaneous reboot later i'm not so sure ;)
>
> Right now i captured a kernel oops in the raid code it seems...
>
> (Also attached to avoid mangling)
>
> [33411.934672] ------------[ cut here ]------------
> [33411.934685] kernel BUG at drivers/md/raid5.c:291!
> [33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP
> [33411.934696] Modules linked in: bonding btrfs microcode
> [33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
> [33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
> [33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
> [33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
> [33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
> [33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
> [33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
> [33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
> [33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
> [33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
> [33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
> [33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
> [33411.934769] Stack:
> [33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
> [33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
> [33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
> [33411.934794] Call Trace:
> [33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
> [33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
> [33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
> [33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
> [33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
> [33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
> [33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70
> [33411.934833]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
> [33411.934839]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
> [33411.934843]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
> [33411.934847] Code: f7 ff ff 66 90 48 8b 43 18 48 8b b8 48 01 00 00 48 89 14 24 48 89 74 24 08 e8 af 9a 02 00 48 8b 74 24 08 48 8b 14 24 eb 9f 0f 0b <0f> 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f0 ff 4e
> [33411.934912] RIP  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
> [33411.934918]  RSP <ffff880320473d28>
> [33411.941326] ---[ end trace 42d97d618cc5bfe2 ]---
> [33411.941331] ------------[ cut here ]------------
> [33411.941337] WARNING: CPU: 4 PID: 2319 at kernel/exit.c:703 do_exit+0x45/0xa40()
> [33411.941351] Modules linked in: bonding btrfs microcode
> [33411.941377] CPU: 4 PID: 2319 Comm: md2_raid6 Tainted: G      D      3.13.0-rc8 #83
> [33411.941395] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
> [33411.941417]  0000000000000000 0000000000000009 ffffffff81d4eeb8 0000000000000000
> [33411.941449]  ffffffff810eba01 000000000000000b ffff880320473c78 0000000000000096
> [33411.941480]  ffff880324eeee58 ffff880320473e88 ffffffff810ed7d5 ffff880324eeee58
> [33411.941512] Call Trace:
> [33411.941519]  [<ffffffff81d4eeb8>] ? dump_stack+0x4a/0x75
> [33411.941530]  [<ffffffff810eba01>] ? warn_slowpath_common+0x81/0xb0
> [33411.941544]  [<ffffffff810ed7d5>] ? do_exit+0x45/0xa40
> [33411.941557]  [<ffffffff81d4c0d4>] ? printk+0x4f/0x54
> [33411.941568]  [<ffffffff810463ed>] ? oops_end+0x8d/0xd0
> [33411.941579]  [<ffffffff810431d2>] ? do_invalid_op+0xa2/0x100
> [33411.941592]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
> [33411.941606]  [<ffffffff81d596a8>] ? invalid_op+0x18/0x20
> [33411.941617]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
> [33411.941633]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
> [33411.941647]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
> [33411.941658]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
> [33411.941670]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
> [33411.941683]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
> [33411.941695]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
> [33411.941708]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70
> [33411.941722]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
> [33411.941736]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
> [33411.941749]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
> [33411.941762] ---[ end trace 42d97d618cc5bfe3 ]---
> [33411.941773] note: md2_raid6[2319] exited with preempt_count 1
> [33411.934833]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
> [33411.934839]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
> [33411.934843]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
> [33411.934847] Code: f7 ff ff 66 90 48 8b 43 18 48 8b b8 48 01 00 00 48 89 14 24 48 89 74 24 08 e8 af 9a 02 00 48 8b 74 24 08 48 8b 14 24 eb 9f 0f 0b <0f> 0b 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f0 ff 4e
> [33411.934912] RIP  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
> [33411.934918]  RSP <ffff880320473d28>
> [33411.941326] ---[ end trace 42d97d618cc5bfe2 ]---
> [33411.941331] ------------[ cut here ]------------
> [33411.941337] WARNING: CPU: 4 PID: 2319 at kernel/exit.c:703 do_exit+0x45/0xa40()
> [33411.941351] Modules linked in: bonding btrfs microcode
> [33411.941377] CPU: 4 PID: 2319 Comm: md2_raid6 Tainted: G      D      3.13.0-rc8 #83
> [33411.941395] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
> [33411.941417]  0000000000000000 0000000000000009 ffffffff81d4eeb8 0000000000000000
> [33411.941449]  ffffffff810eba01 000000000000000b ffff880320473c78 0000000000000096
> [33411.941480]  ffff880324eeee58 ffff880320473e88 ffffffff810ed7d5 ffff880324eeee58
> [33411.941512] Call Trace:
> [33411.941519]  [<ffffffff81d4eeb8>] ? dump_stack+0x4a/0x75
> [33411.941530]  [<ffffffff810eba01>] ? warn_slowpath_common+0x81/0xb0
> [33411.941544]  [<ffffffff810ed7d5>] ? do_exit+0x45/0xa40
> [33411.941557]  [<ffffffff81d4c0d4>] ? printk+0x4f/0x54
> [33411.941568]  [<ffffffff810463ed>] ? oops_end+0x8d/0xd0
> [33411.941579]  [<ffffffff810431d2>] ? do_invalid_op+0xa2/0x100
> [33411.941592]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
> [33411.941606]  [<ffffffff81d596a8>] ? invalid_op+0x18/0x20
> [33411.941617]  [<ffffffff81a3a5be>] ? do_release_stripe+0x18e/0x1a0
> [33411.941633]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
> [33411.941647]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
> [33411.941658]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
> [33411.941670]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
> [33411.941683]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
> [33411.941695]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
> [33411.941708]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70
> [33411.941722]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
> [33411.941736]  [<ffffffff81d5857c>] ? ret_from_fork+0x7c/0xb0
> [33411.941749]  [<ffffffff811094d0>] ? flush_kthread_worker+0x80/0x80
> [33411.941762] ---[ end trace 42d97d618cc5bfe3 ]---
> [33411.941773] note: md2_raid6[2319] exited with preempt_count 1



-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
  2014-01-19 22:00 [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8 Ian Kumlien
  2014-01-19 23:21 ` Richard Weinberger
@ 2014-01-20  0:38 ` NeilBrown
  2014-01-20  0:49   ` Ian Kumlien
  1 sibling, 1 reply; 10+ messages in thread
From: NeilBrown @ 2014-01-20  0:38 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2852 bytes --]

On Sun, 19 Jan 2014 23:00:23 +0100 Ian Kumlien <ian.kumlien@gmail.com> wrote:

> Ok, so third try to actually email this... 
> ---
> 
> Hi,
> 
> I started testing 3.13-rc8 on another machine since the first one seemed
> to be working fine...
> 
> One spontaneous reboot later i'm not so sure ;)
> 
> Right now i captured a kernel oops in the raid code it seems...
> 
> (Also attached to avoid mangling)
> 
> [33411.934672] ------------[ cut here ]------------
> [33411.934685] kernel BUG at drivers/md/raid5.c:291!
> [33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP 
> [33411.934696] Modules linked in: bonding btrfs microcode
> [33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
> [33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
> [33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
> [33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
> [33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
> [33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
> [33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
> [33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
> [33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
> [33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
> [33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
> [33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
> [33411.934769] Stack:
> [33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
> [33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
> [33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
> [33411.934794] Call Trace:
> [33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
> [33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
> [33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
> [33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
> [33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
> [33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
> [33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70Hi,

Thanks for the report.
Can you provide any more context about the details of the array in question?
I see it was RAID6.  Was it degraded?  Was it resyncing?  Was it being
reshaped?
Was there any way that it was different from the array one the machine where
it seemed to work?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
  2014-01-20  0:38 ` NeilBrown
@ 2014-01-20  0:49   ` Ian Kumlien
  2014-01-20  3:37     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Ian Kumlien @ 2014-01-20  0:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-kernel, linux-raid

On mån, 2014-01-20 at 11:38 +1100, NeilBrown wrote:
> On Sun, 19 Jan 2014 23:00:23 +0100 Ian Kumlien <ian.kumlien@gmail.com> wrote:
> 
> > Ok, so third try to actually email this... 
> > ---
> > 
> > Hi,
> > 
> > I started testing 3.13-rc8 on another machine since the first one seemed
> > to be working fine...
> > 
> > One spontaneous reboot later i'm not so sure ;)
> > 
> > Right now i captured a kernel oops in the raid code it seems...
> > 
> > (Also attached to avoid mangling)
> > 
> > [33411.934672] ------------[ cut here ]------------
> > [33411.934685] kernel BUG at drivers/md/raid5.c:291!
> > [33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP 
> > [33411.934696] Modules linked in: bonding btrfs microcode
> > [33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
> > [33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
> > [33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
> > [33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
> > [33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
> > [33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
> > [33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
> > [33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
> > [33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
> > [33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
> > [33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
> > [33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
> > [33411.934769] Stack:
> > [33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
> > [33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
> > [33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
> > [33411.934794] Call Trace:
> > [33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
> > [33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
> > [33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
> > [33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
> > [33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
> > [33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
> > [33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70Hi,
> 
> Thanks for the report.
> Can you provide any more context about the details of the array in question?
> I see it was RAID6.  Was it degraded?  Was it resyncing?  Was it being
> reshaped?
> Was there any way that it was different from the array one the machine where
> it seemed to work?

Yes, it's a raid6 and no, there is no reshaping or syncing going on... 

Basically everything worked fine before:
reboot   system boot  3.13.0-rc8       Sun Jan 19 21:47 - 01:42  (03:55)    
reboot   system boot  3.13.0-rc8       Sun Jan 19 21:38 - 01:42  (04:04)    
reboot   system boot  3.13.0-rc8       Sun Jan 19 12:13 - 01:42  (13:29)    
reboot   system boot  3.13.0-rc8       Sat Jan 18 21:23 - 01:42 (1+04:19)   
reboot   system boot  3.12.6           Mon Dec 30 16:27 - 22:21 (19+05:53)  

As in, no problems before the 3.13.0-rc8 upgrade...

cat /proc/mdstat:
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] 
md2 : active raid6 sdf1[2] sdd1[9] sdj1[8] sdg1[4] sde1[5] sdi1[11] sdc1[0] sdh1[10]
      11721074304 blocks super 1.2 level 6, 64k chunk, algorithm 2 [8/8] [UUUUUUUU]
      bitmap: 0/15 pages [0KB], 65536KB chunk

What i do do is:
echo 32768 > /sys/block/*/md/stripe_cache_size

Which has caused no problems during intense write operations before... 

I find it quite surprising since it only requires ~3 gigabytes of writes
to die and almost assume that it's related to the stripe_cache_size.
(Since all memory is ECC and i doubt it would break, quite literally,
over night i haven't run extensive memory tests)

I don't quite know what other information you might need...

> Thanks,
> NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
  2014-01-20  0:49   ` Ian Kumlien
@ 2014-01-20  3:37     ` NeilBrown
  2014-01-20  9:22       ` Ian Kumlien
  2014-01-20 18:27         ` Ian Kumlien
  0 siblings, 2 replies; 10+ messages in thread
From: NeilBrown @ 2014-01-20  3:37 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 6163 bytes --]

On Mon, 20 Jan 2014 01:49:17 +0100 Ian Kumlien <ian.kumlien@gmail.com> wrote:

> On mån, 2014-01-20 at 11:38 +1100, NeilBrown wrote:
> > On Sun, 19 Jan 2014 23:00:23 +0100 Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > 
> > > Ok, so third try to actually email this... 
> > > ---
> > > 
> > > Hi,
> > > 
> > > I started testing 3.13-rc8 on another machine since the first one seemed
> > > to be working fine...
> > > 
> > > One spontaneous reboot later i'm not so sure ;)
> > > 
> > > Right now i captured a kernel oops in the raid code it seems...
> > > 
> > > (Also attached to avoid mangling)
> > > 
> > > [33411.934672] ------------[ cut here ]------------
> > > [33411.934685] kernel BUG at drivers/md/raid5.c:291!
> > > [33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP 
> > > [33411.934696] Modules linked in: bonding btrfs microcode
> > > [33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
> > > [33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
> > > [33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
> > > [33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
> > > [33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
> > > [33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
> > > [33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
> > > [33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
> > > [33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
> > > [33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
> > > [33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
> > > [33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > [33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
> > > [33411.934769] Stack:
> > > [33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
> > > [33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
> > > [33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
> > > [33411.934794] Call Trace:
> > > [33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
> > > [33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
> > > [33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
> > > [33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
> > > [33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
> > > [33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
> > > [33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70Hi,
> > 
> > Thanks for the report.
> > Can you provide any more context about the details of the array in question?
> > I see it was RAID6.  Was it degraded?  Was it resyncing?  Was it being
> > reshaped?
> > Was there any way that it was different from the array one the machine where
> > it seemed to work?
> 
> Yes, it's a raid6 and no, there is no reshaping or syncing going on... 
> 
> Basically everything worked fine before:
> reboot   system boot  3.13.0-rc8       Sun Jan 19 21:47 - 01:42  (03:55)    
> reboot   system boot  3.13.0-rc8       Sun Jan 19 21:38 - 01:42  (04:04)    
> reboot   system boot  3.13.0-rc8       Sun Jan 19 12:13 - 01:42  (13:29)    
> reboot   system boot  3.13.0-rc8       Sat Jan 18 21:23 - 01:42 (1+04:19)   
> reboot   system boot  3.12.6           Mon Dec 30 16:27 - 22:21 (19+05:53)  
> 
> As in, no problems before the 3.13.0-rc8 upgrade...
> 
> cat /proc/mdstat:
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] 
> md2 : active raid6 sdf1[2] sdd1[9] sdj1[8] sdg1[4] sde1[5] sdi1[11] sdc1[0] sdh1[10]
>       11721074304 blocks super 1.2 level 6, 64k chunk, algorithm 2 [8/8] [UUUUUUUU]
>       bitmap: 0/15 pages [0KB], 65536KB chunk
> 
> What i do do is:
> echo 32768 > /sys/block/*/md/stripe_cache_size
> 
> Which has caused no problems during intense write operations before... 
> 
> I find it quite surprising since it only requires ~3 gigabytes of writes
> to die and almost assume that it's related to the stripe_cache_size.
> (Since all memory is ECC and i doubt it would break, quite literally,
> over night i haven't run extensive memory tests)
> 
> I don't quite know what other information you might need...

Thanks - that extra info is quite useful.  Knowing that nothing else unusual
is happening can be quite valuable (and I don't like to assume).

I haven't found anything that would clearly cause your crash, but I have
found something that looks wrong and conceivably could.

Could you please try this patch on top of what you are currently using?  By
the look of it you get a crash at least every day, often more often.  So if
this produces a day with no crashes, that would be promising.

The important aspect of the patch is that it moves the "atomic_inc" of
"sh->count" back under the protection of ->device_lock in the case when some
other thread might be using the same 'sh'.

Thanks,
NeilBrown


diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3088d3af5a89..03f82ab87d9e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -675,8 +675,10 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 					 || !conf->inactive_blocked),
 					*(conf->hash_locks + hash));
 				conf->inactive_blocked = 0;
-			} else
+			} else {
 				init_stripe(sh, sector, previous);
+				atomic_inc(&sh->count);
+			}
 		} else {
 			spin_lock(&conf->device_lock);
 			if (atomic_read(&sh->count)) {
@@ -695,13 +697,11 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 					sh->group = NULL;
 				}
 			}
+			atomic_inc(&sh->count);
 			spin_unlock(&conf->device_lock);
 		}
 	} while (sh == NULL);
 
-	if (sh)
-		atomic_inc(&sh->count);
-
 	spin_unlock_irq(conf->hash_locks + hash);
 	return sh;
 }

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
  2014-01-20  3:37     ` NeilBrown
@ 2014-01-20  9:22       ` Ian Kumlien
  2014-01-20 18:27         ` Ian Kumlien
  1 sibling, 0 replies; 10+ messages in thread
From: Ian Kumlien @ 2014-01-20  9:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-kernel, linux-raid

On mån, 2014-01-20 at 14:37 +1100, NeilBrown wrote:
> On Mon, 20 Jan 2014 01:49:17 +0100 Ian Kumlien <ian.kumlien@gmail.com> wrote:
> 
> > On mån, 2014-01-20 at 11:38 +1100, NeilBrown wrote:
> > > On Sun, 19 Jan 2014 23:00:23 +0100 Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > > 
> > > > Ok, so third try to actually email this... 
> > > > ---
> > > > 
> > > > Hi,
> > > > 
> > > > I started testing 3.13-rc8 on another machine since the first one seemed
> > > > to be working fine...
> > > > 
> > > > One spontaneous reboot later i'm not so sure ;)
> > > > 
> > > > Right now i captured a kernel oops in the raid code it seems...
> > > > 
> > > > (Also attached to avoid mangling)
> > > > 
> > > > [33411.934672] ------------[ cut here ]------------
> > > > [33411.934685] kernel BUG at drivers/md/raid5.c:291!
> > > > [33411.934690] invalid opcode: 0000 [#1] PREEMPT SMP 
> > > > [33411.934696] Modules linked in: bonding btrfs microcode
> > > > [33411.934705] CPU: 4 PID: 2319 Comm: md2_raid6 Not tainted 3.13.0-rc8 #83
> > > > [33411.934709] Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 3029    10/09/2012
> > > > [33411.934716] task: ffff880326265880 ti: ffff880320472000 task.ti: ffff880320472000
> > > > [33411.934720] RIP: 0010:[<ffffffff81a3a5be>]  [<ffffffff81a3a5be>] do_release_stripe+0x18e/0x1a0
> > > > [33411.934731] RSP: 0018:ffff880320473d28  EFLAGS: 00010087
> > > > [33411.934735] RAX: ffff8802f0875a60 RBX: 0000000000000001 RCX: ffff8800b0d816b0
> > > > [33411.934739] RDX: ffff880324eeee98 RSI: ffff8802f0875a40 RDI: ffff880324eeec00
> > > > [33411.934743] RBP: ffff8802f0875a50 R08: 0000000000000000 R09: 0000000000000001
> > > > [33411.934747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880324eeec00
> > > > [33411.934752] R13: ffff880324eeee58 R14: ffff880320473e88 R15: 0000000000000000
> > > > [33411.934756] FS:  00007fc38654d700(0000) GS:ffff880337d00000(0000) knlGS:0000000000000000
> > > > [33411.934761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > [33411.934765] CR2: 00007f0cb28bd000 CR3: 00000002ebcf6000 CR4: 00000000000407e0
> > > > [33411.934769] Stack:
> > > > [33411.934771]  ffff8800bba09690 ffff8800b4f16588 ffff880303005a40 0000000000000001
> > > > [33411.934779]  ffff8800b33e43d0 ffffffff81a3a62d ffff880324eeee58 0000000000000000
> > > > [33411.934786]  ffff880324eeee58 ffff880326660670 ffff880326265880 ffffffff81a41692
> > > > [33411.934794] Call Trace:
> > > > [33411.934798]  [<ffffffff81a3a62d>] ? release_stripe_list+0x4d/0x70
> > > > [33411.934803]  [<ffffffff81a41692>] ? raid5d+0xa2/0x4d0
> > > > [33411.934808]  [<ffffffff81a65ed6>] ? md_thread+0xe6/0x120
> > > > [33411.934814]  [<ffffffff81122060>] ? finish_wait+0x90/0x90
> > > > [33411.934818]  [<ffffffff81a65df0>] ? md_rdev_init+0x100/0x100
> > > > [33411.934823]  [<ffffffff8110958c>] ? kthread+0xbc/0xe0
> > > > [33411.934828]  [<ffffffff81110000>] ? smpboot_park_threads+0x70/0x70Hi,
> > > 
> > > Thanks for the report.
> > > Can you provide any more context about the details of the array in question?
> > > I see it was RAID6.  Was it degraded?  Was it resyncing?  Was it being
> > > reshaped?
> > > Was there any way that it was different from the array one the machine where
> > > it seemed to work?
> > 
> > Yes, it's a raid6 and no, there is no reshaping or syncing going on... 
> > 
> > Basically everything worked fine before:
> > reboot   system boot  3.13.0-rc8       Sun Jan 19 21:47 - 01:42  (03:55)    
> > reboot   system boot  3.13.0-rc8       Sun Jan 19 21:38 - 01:42  (04:04)    
> > reboot   system boot  3.13.0-rc8       Sun Jan 19 12:13 - 01:42  (13:29)    
> > reboot   system boot  3.13.0-rc8       Sat Jan 18 21:23 - 01:42 (1+04:19)   
> > reboot   system boot  3.12.6           Mon Dec 30 16:27 - 22:21 (19+05:53)  
> > 
> > As in, no problems before the 3.13.0-rc8 upgrade...
> > 
> > cat /proc/mdstat:
> > Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] 
> > md2 : active raid6 sdf1[2] sdd1[9] sdj1[8] sdg1[4] sde1[5] sdi1[11] sdc1[0] sdh1[10]
> >       11721074304 blocks super 1.2 level 6, 64k chunk, algorithm 2 [8/8] [UUUUUUUU]
> >       bitmap: 0/15 pages [0KB], 65536KB chunk
> > 
> > What i do do is:
> > echo 32768 > /sys/block/*/md/stripe_cache_size
> > 
> > Which has caused no problems during intense write operations before... 
> > 
> > I find it quite surprising since it only requires ~3 gigabytes of writes
> > to die and almost assume that it's related to the stripe_cache_size.
> > (Since all memory is ECC and i doubt it would break, quite literally,
> > over night i haven't run extensive memory tests)
> > 
> > I don't quite know what other information you might need...
> 
> Thanks - that extra info is quite useful.  Knowing that nothing else unusual
> is happening can be quite valuable (and I don't like to assume).

Yeah, i know, it can be hard to know which information to provide though
=)

> I haven't found anything that would clearly cause your crash, but I have
> found something that looks wrong and conceivably could.
> 
> Could you please try this patch on top of what you are currently using?  By
> the look of it you get a crash at least every day, often more often.  So if
> this produces a day with no crashes, that would be promising.

I haven't been able to crash it yet, it looks like we've found out
culprit =)

> The important aspect of the patch is that it moves the "atomic_inc" of
> "sh->count" back under the protection of ->device_lock in the case when some
> other thread might be using the same 'sh'.
> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3088d3af5a89..03f82ab87d9e 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -675,8 +675,10 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
>  					 || !conf->inactive_blocked),
>  					*(conf->hash_locks + hash));
>  				conf->inactive_blocked = 0;
> -			} else
> +			} else {
>  				init_stripe(sh, sector, previous);
> +				atomic_inc(&sh->count);
> +			}
>  		} else {
>  			spin_lock(&conf->device_lock);
>  			if (atomic_read(&sh->count)) {
> @@ -695,13 +697,11 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
>  					sh->group = NULL;
>  				}
>  			}
> +			atomic_inc(&sh->count);
>  			spin_unlock(&conf->device_lock);
>  		}
>  	} while (sh == NULL);
>  
> -	if (sh)
> -		atomic_inc(&sh->count);
> -
>  	spin_unlock_irq(conf->hash_locks + hash);
>  	return sh;
>  }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
  2014-01-20  3:37     ` NeilBrown
@ 2014-01-20 18:27         ` Ian Kumlien
  2014-01-20 18:27         ` Ian Kumlien
  1 sibling, 0 replies; 10+ messages in thread
From: Ian Kumlien @ 2014-01-20 18:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-kernel, linux-raid

On mån, 2014-01-20 at 14:37 +1100, NeilBrown wrote:
> 
> Thanks - that extra info is quite useful.  Knowing that nothing else unusual
> is happening can be quite valuable (and I don't like to assume).
> 
> I haven't found anything that would clearly cause your crash, but I have
> found something that looks wrong and conceivably could.
> 
> Could you please try this patch on top of what you are currently using?  By
> the look of it you get a crash at least every day, often more often.  So if
> this produces a day with no crashes, that would be promising.
> 
> The important aspect of the patch is that it moves the "atomic_inc" of
> "sh->count" back under the protection of ->device_lock in the case when some
> other thread might be using the same 'sh'.

I have been unable to trip this up, so this was it!

Tested-by: Ian Kumlien <ian.kumlien@gmail.com>

I hope this hits stable ASAP ;)

> Thanks,
> NeilBrown
> 
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3088d3af5a89..03f82ab87d9e 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -675,8 +675,10 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
>  					 || !conf->inactive_blocked),
>  					*(conf->hash_locks + hash));
>  				conf->inactive_blocked = 0;
> -			} else
> +			} else {
>  				init_stripe(sh, sector, previous);
> +				atomic_inc(&sh->count);
> +			}
>  		} else {
>  			spin_lock(&conf->device_lock);
>  			if (atomic_read(&sh->count)) {
> @@ -695,13 +697,11 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
>  					sh->group = NULL;
>  				}
>  			}
> +			atomic_inc(&sh->count);
>  			spin_unlock(&conf->device_lock);
>  		}
>  	} while (sh == NULL);
>  
> -	if (sh)
> -		atomic_inc(&sh->count);
> -
>  	spin_unlock_irq(conf->hash_locks + hash);
>  	return sh;
>  }


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
@ 2014-01-20 18:27         ` Ian Kumlien
  0 siblings, 0 replies; 10+ messages in thread
From: Ian Kumlien @ 2014-01-20 18:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-kernel, linux-raid

On mån, 2014-01-20 at 14:37 +1100, NeilBrown wrote:
> 
> Thanks - that extra info is quite useful.  Knowing that nothing else unusual
> is happening can be quite valuable (and I don't like to assume).
> 
> I haven't found anything that would clearly cause your crash, but I have
> found something that looks wrong and conceivably could.
> 
> Could you please try this patch on top of what you are currently using?  By
> the look of it you get a crash at least every day, often more often.  So if
> this produces a day with no crashes, that would be promising.
> 
> The important aspect of the patch is that it moves the "atomic_inc" of
> "sh->count" back under the protection of ->device_lock in the case when some
> other thread might be using the same 'sh'.

I have been unable to trip this up, so this was it!

Tested-by: Ian Kumlien <ian.kumlien@gmail.com>

I hope this hits stable ASAP ;)

> Thanks,
> NeilBrown
> 
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3088d3af5a89..03f82ab87d9e 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -675,8 +675,10 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
>  					 || !conf->inactive_blocked),
>  					*(conf->hash_locks + hash));
>  				conf->inactive_blocked = 0;
> -			} else
> +			} else {
>  				init_stripe(sh, sector, previous);
> +				atomic_inc(&sh->count);
> +			}
>  		} else {
>  			spin_lock(&conf->device_lock);
>  			if (atomic_read(&sh->count)) {
> @@ -695,13 +697,11 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
>  					sh->group = NULL;
>  				}
>  			}
> +			atomic_inc(&sh->count);
>  			spin_unlock(&conf->device_lock);
>  		}
>  	} while (sh == NULL);
>  
> -	if (sh)
> -		atomic_inc(&sh->count);
> -
>  	spin_unlock_irq(conf->hash_locks + hash);
>  	return sh;
>  }



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
  2014-01-20 18:27         ` Ian Kumlien
  (?)
@ 2014-01-22  0:52         ` NeilBrown
  2014-01-23  0:00           ` Ian Kumlien
  -1 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2014-01-22  0:52 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1304 bytes --]

On Mon, 20 Jan 2014 19:27:18 +0100 Ian Kumlien <ian.kumlien@gmail.com> wrote:

> On mån, 2014-01-20 at 14:37 +1100, NeilBrown wrote:
> > 
> > Thanks - that extra info is quite useful.  Knowing that nothing else unusual
> > is happening can be quite valuable (and I don't like to assume).
> > 
> > I haven't found anything that would clearly cause your crash, but I have
> > found something that looks wrong and conceivably could.
> > 
> > Could you please try this patch on top of what you are currently using?  By
> > the look of it you get a crash at least every day, often more often.  So if
> > this produces a day with no crashes, that would be promising.
> > 
> > The important aspect of the patch is that it moves the "atomic_inc" of
> > "sh->count" back under the protection of ->device_lock in the case when some
> > other thread might be using the same 'sh'.
> 
> I have been unable to trip this up, so this was it!
> 
> Tested-by: Ian Kumlien <ian.kumlien@gmail.com>
> 
> I hope this hits stable ASAP ;)

I've push it out into my for-next branch.
I'll probably send a pull request to Linus tomorrow.
It has some chance of getting into a -stable branch next week (though I'm not
really sure of the schedule).

Thanks again for testing and reporting!

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8
  2014-01-22  0:52         ` NeilBrown
@ 2014-01-23  0:00           ` Ian Kumlien
  0 siblings, 0 replies; 10+ messages in thread
From: Ian Kumlien @ 2014-01-23  0:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-kernel, linux-raid

On ons, 2014-01-22 at 11:52 +1100, NeilBrown wrote:
> On Mon, 20 Jan 2014 19:27:18 +0100 Ian Kumlien <ian.kumlien@gmail.com> wrote:
> > I have been unable to trip this up, so this was it!
> > 
> > Tested-by: Ian Kumlien <ian.kumlien@gmail.com>
> > 
> > I hope this hits stable ASAP ;)
> 
> I've push it out into my for-next branch.
> I'll probably send a pull request to Linus tomorrow.
> It has some chance of getting into a -stable branch next week (though I'm not
> really sure of the schedule).

Good, I bet I'm not the only one that will hit this...

> Thanks again for testing and reporting!

No problem, and thank you! ;)

> NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-01-23  0:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-19 22:00 [BUG] at drivers/md/raid5.c:291! kernel 3.13-rc8 Ian Kumlien
2014-01-19 23:21 ` Richard Weinberger
2014-01-20  0:38 ` NeilBrown
2014-01-20  0:49   ` Ian Kumlien
2014-01-20  3:37     ` NeilBrown
2014-01-20  9:22       ` Ian Kumlien
2014-01-20 18:27       ` Ian Kumlien
2014-01-20 18:27         ` Ian Kumlien
2014-01-22  0:52         ` NeilBrown
2014-01-23  0:00           ` Ian Kumlien

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.