All of lore.kernel.org
 help / color / mirror / Atom feed
* Kernel panic under 3.2.14 Xen dom0 and SCST trunk
@ 2012-07-24 15:16 Joseph Glanville
  2012-07-24 15:43 ` Joseph Glanville
  2012-07-24 17:53 ` Bart Van Assche
  0 siblings, 2 replies; 14+ messages in thread
From: Joseph Glanville @ 2012-07-24 15:16 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
  Cc: Bart Van Assche

Hi guys,

I have been seeing this KP occur about every 3 days on our staging cluster.
I am not exactly sure what the root cause would be.. I assume this
would be a bug in SCST.
The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
HA patches.

The SRP connection settings are actually default at this stage we are
only using the added ability to delete srp connections without unload.

[35404.804901] IP: [<          (null)>]           (null)
[35404.804981] PGD 2ab2b067 PUD 75f5b067 PMD 0
[35404.805064] Oops: 0010 [#1] SMP
[35404.805140] CPU 0
[35404.805149] Modules linked in: tun xen_netback xen_blkback
dm_round_robin ib_srpt(O) scst_vdisk(O) scst(O) bonding dm_multipath
flashcache(O) raid0 raid1 md_mod
[35404.805463]
[35404.805528] Pid: 4585, comm: srpt_mlx4_0-2 Tainted: G           O
3.2.14+ #2 Dell                   PowerEdge C2100       /0P19C9
[35404.805690] RIP: e030:[<0000000000000000>]  [<          (null)>]
       (null)
[35404.805832] RSP: e02b:ffff8800bf42ace0  EFLAGS: 00010046
[35404.805910] RAX: ffff88001ac800c0 RBX: ffff88001ac0c4d0 RCX: ffff88001ac0d600
[35404.805994] RDX: ffff88001ac0dc30 RSI: ffff88001ac800c0 RDI: ffff88001654e900
[35404.806078] RBP: ffff8800bf42adb8 R08: ffff88001654e900 R09: ffff88001ac0d608
[35404.806162] R10: 0000000000000001 R11: ffff88001ac0d5f8 R12: ffff88009c443940
[35404.806263] R13: ffff88001b1a2000 R14: 00000000000004c8 R15: ffff88001ac0c4d0
[35404.806350] FS:  00007f2701406700(0000) GS:ffff8800bf427000(0000)
knlGS:0000000000000000
[35404.806492] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[35404.806571] CR2: 0000000000000000 CR3: 00000000830e0000 CR4: 0000000000002660
[35404.806655] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[35404.806740] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[35404.806825] Process srpt_mlx4_0-2 (pid: 4585, threadinfo
ffff8800b50f4000, task ffff880017faeea0)
[35404.806969] Stack:
[35404.807034]  ffffffff8150285e 0000000000000000 ffff88001ac0c998
ffff880000000001
[35404.807183]  ffff88001ac0d608 ffff88001654e900 ffff88001ac0e3f0
ffff88001ac0e3b8
[35404.807332]  ffff88001ac0c528 ffff880068a11600 ffff88001ac0c4e0
ffff88001ac0c4f0
[35404.807480] Call Trace:
[35404.807548]  <IRQ>
[35404.807637]  [<ffffffff8150285e>] ? srp_recv_completion+0x44e/0x650
[35404.807722]  [<ffffffff81009f52>] ? check_events+0x12/0x20
[35404.807803]  [<ffffffff814ea3c2>] mlx4_ib_cq_comp+0x12/0x20
[35404.807883]  [<ffffffff81433beb>] mlx4_cq_completion+0x3b/0x80
[35404.807964]  [<ffffffff81434aa4>] mlx4_eq_int+0x224/0x290
[35404.808043]  [<ffffffff81434b81>] mlx4_interrupt+0x51/0x80
[35404.808125]  [<ffffffff810b72bd>] handle_irq_event_percpu+0x5d/0x210
[35404.808208]  [<ffffffff810b74bc>] handle_irq_event+0x4c/0x80
[35404.808289]  [<ffffffff810ba233>] handle_fasteoi_irq+0x83/0x140
[35404.808371]  [<ffffffff8130f756>] __xen_evtchn_do_upcall+0x1a6/0x260
[35404.808455]  [<ffffffff813114fa>] xen_evtchn_do_upcall+0x2a/0x40
[35404.808538]  [<ffffffff816846fe>] xen_do_hypervisor_callback+0x1e/0x30
[35404.808620]  <EOI>
[35404.808691]  [<ffffffffa006b0e7>] ?
scst_register_virtual_device+0x5d7/0x750 [scst]
[35404.808833]  [<ffffffffa007a473>] ? scst_cmd_init_done+0xb3/0x5a0 [scst]
[35404.808917]  [<ffffffffa00f0bed>] ? 0xffffffffa00f0bec
[35404.809006]  [<ffffffffa0072a47>] ? scst_rx_cmd+0xe7/0xce0 [scst]
[35404.809088]  [<ffffffffa00f2872>] ? 0xffffffffa00f2871
[35404.809166]  [<ffffffffa00f08e3>] ? 0xffffffffa00f08e2
[35404.809245]  [<ffffffffa00f797f>] ? 0xffffffffa00f797e
[35404.810244]  [<ffffffffa00f081f>] ? 0xffffffffa00f081e
[35404.810323]  [<ffffffffa00f7af0>] ? 0xffffffffa00f7aef
[35404.810417]  [<ffffffffa00f873f>] ? 0xffffffffa00f873e
[35404.810495]  [<ffffffffa00f87a0>] ? 0xffffffffa00f879f
[35404.810573]  [<ffffffffa00f8880>] ? 0xffffffffa00f887f
[35404.810655]  [<ffffffff8167a8d9>] ? _raw_spin_unlock_irqrestore+0x19/0x20
[35404.810739]  [<ffffffffa00f87a0>] ? 0xffffffffa00f879f
[35404.810818]  [<ffffffff81077246>] ? kthread+0x96/0xa0
[35404.810896]  [<ffffffff816845b4>] ? kernel_thread_helper+0x4/0x10
[35404.810979]  [<ffffffff81682673>] ? int_ret_from_sys_call+0x7/0x1b
[35404.811061]  [<ffffffff8167ab7c>] ? retint_restore_args+0x5/0x6
[35404.811142]  [<ffffffff816845b0>] ? gs_change+0x13/0x13
[35404.811219] Code:  Bad RIP value.
[35404.811297] RIP  [<          (null)>]           (null)
[35404.811377]  RSP <ffff8800bf42ace0>
[35404.811447] CR2: 0000000000000000
[35404.811739] ---[ end trace a002a9122b31526a ]---
[35404.811841] Kernel panic - not syncing: Fatal exception in interrupt
[35404.811950] Pid: 4585, comm: srpt_mlx4_0-2 Tainted: G      D    O 3.2.14+ #2
[35404.812061] Call Trace:
[35404.812155]  <IRQ>  [<ffffffff81677b48>] panic+0x8c/0x19d
[35404.812296]  [<ffffffff81009f52>] ? check_events+0x12/0x20
[35404.812402]  [<ffffffff8167b7fa>] oops_end+0xea/0xf0
[35404.812510]  [<ffffffff8103b5f2>] no_context+0xf2/0x270
[35404.812616]  [<ffffffff8103b895>] __bad_area_nosemaphore+0x125/0x210
[35404.812726]  [<ffffffff8103b98e>] bad_area_nosemaphore+0xe/0x10
[35404.812835]  [<ffffffff8167e135>] do_page_fault+0x335/0x4d0
[35404.812942]  [<ffffffff8100984d>] ? xen_force_evtchn_callback+0xd/0x10
[35404.813052]  [<ffffffff81009f52>] ? check_events+0x12/0x20
[35404.813174]  [<ffffffff8167adf5>] page_fault+0x25/0x30
[35404.813280]  [<ffffffff8150285e>] ? srp_recv_completion+0x44e/0x650
[35404.813390]  [<ffffffff81009f52>] ? check_events+0x12/0x20
[35404.813496]  [<ffffffff814ea3c2>] mlx4_ib_cq_comp+0x12/0x20
[35404.813603]  [<ffffffff81433beb>] mlx4_cq_completion+0x3b/0x80
[35404.813711]  [<ffffffff81434aa4>] mlx4_eq_int+0x224/0x290
[35404.813817]  [<ffffffff81434b81>] mlx4_interrupt+0x51/0x80
[35404.813924]  [<ffffffff810b72bd>] handle_irq_event_percpu+0x5d/0x210
[35404.814034]  [<ffffffff810b74bc>] handle_irq_event+0x4c/0x80
[35404.814141]  [<ffffffff810ba233>] handle_fasteoi_irq+0x83/0x140
[35404.814250]  [<ffffffff8130f756>] __xen_evtchn_do_upcall+0x1a6/0x260
[35404.814360]  [<ffffffff813114fa>] xen_evtchn_do_upcall+0x2a/0x40
[35404.814469]  [<ffffffff816846fe>] xen_do_hypervisor_callback+0x1e/0x30
[35404.814600]  <EOI>  [<ffffffffa006b0e7>] ?
scst_register_virtual_device+0x5d7/0x750 [scst]
[35404.814806]  [<ffffffffa007a473>] ? scst_cmd_init_done+0xb3/0x5a0 [scst]
[35404.814916]  [<ffffffffa00f0bed>] ? 0xffffffffa00f0bec
[35404.815022]  [<ffffffffa0072a47>] ? scst_rx_cmd+0xe7/0xce0 [scst]
[35404.815131]  [<ffffffffa00f2872>] ? 0xffffffffa00f2871
[35404.815236]  [<ffffffffa00f08e3>] ? 0xffffffffa00f08e2
[35404.815341]  [<ffffffffa00f797f>] ? 0xffffffffa00f797e
[35404.815447]  [<ffffffffa00f081f>] ? 0xffffffffa00f081e
[35404.815551]  [<ffffffffa00f7af0>] ? 0xffffffffa00f7aef
[35404.815656]  [<ffffffffa00f873f>] ? 0xffffffffa00f873e
[35404.815761]  [<ffffffffa00f87a0>] ? 0xffffffffa00f879f
[35404.815869]  [<ffffffffa00f8880>] ? 0xffffffffa00f887f
[35404.815985]  [<ffffffff8167a8d9>] ? _raw_spin_unlock_irqrestore+0x19/0x20
[35404.816096]  [<ffffffffa00f87a0>] ? 0xffffffffa00f879f
[35404.816201]  [<ffffffff81077246>] ? kthread+0x96/0xa0
[35404.816306]  [<ffffffff816845b4>] ? kernel_thread_helper+0x4/0x10
[35404.816414]  [<ffffffff81682673>] ? int_ret_from_sys_call+0x7/0x1b
[35404.816523]  [<ffffffff8167ab7c>] ? retint_restore_args+0x5/0x6
[35404.816631]  [<ffffffff816845b0>] ? gs_change+0x13/0x13

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
  2012-07-24 15:16 Kernel panic under 3.2.14 Xen dom0 and SCST trunk Joseph Glanville
@ 2012-07-24 15:43 ` Joseph Glanville
       [not found]   ` <CAOzFzEjDHpTROUcKg9cOZkNSX1LnShSombgt26+VOptVdy5i-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2012-07-24 17:53 ` Bart Van Assche
  1 sibling, 1 reply; 14+ messages in thread
From: Joseph Glanville @ 2012-07-24 15:43 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
  Cc: Bart Van Assche

On 25 July 2012 01:16, Joseph Glanville <joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote:
> Hi guys,
>
> I have been seeing this KP occur about every 3 days on our staging cluster.
> I am not exactly sure what the root cause would be.. I assume this
> would be a bug in SCST.
> The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
> HA patches.
>
> The SRP connection settings are actually default at this stage we are
> only using the added ability to delete srp connections without unload.
>
> [35404.804901] IP: [<          (null)>]           (null)
> [35404.804981] PGD 2ab2b067 PUD 75f5b067 PMD 0
> [35404.805064] Oops: 0010 [#1] SMP
> [35404.805140] CPU 0
> [35404.805149] Modules linked in: tun xen_netback xen_blkback
> dm_round_robin ib_srpt(O) scst_vdisk(O) scst(O) bonding dm_multipath
> flashcache(O) raid0 raid1 md_mod
> [35404.805463]
> [35404.805528] Pid: 4585, comm: srpt_mlx4_0-2 Tainted: G           O
> 3.2.14+ #2 Dell                   PowerEdge C2100       /0P19C9
> [35404.805690] RIP: e030:[<0000000000000000>]  [<          (null)>]
>        (null)
> [35404.805832] RSP: e02b:ffff8800bf42ace0  EFLAGS: 00010046
> [35404.805910] RAX: ffff88001ac800c0 RBX: ffff88001ac0c4d0 RCX: ffff88001ac0d600
> [35404.805994] RDX: ffff88001ac0dc30 RSI: ffff88001ac800c0 RDI: ffff88001654e900
> [35404.806078] RBP: ffff8800bf42adb8 R08: ffff88001654e900 R09: ffff88001ac0d608
> [35404.806162] R10: 0000000000000001 R11: ffff88001ac0d5f8 R12: ffff88009c443940
> [35404.806263] R13: ffff88001b1a2000 R14: 00000000000004c8 R15: ffff88001ac0c4d0
> [35404.806350] FS:  00007f2701406700(0000) GS:ffff8800bf427000(0000)
> knlGS:0000000000000000
> [35404.806492] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [35404.806571] CR2: 0000000000000000 CR3: 00000000830e0000 CR4: 0000000000002660
> [35404.806655] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [35404.806740] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [35404.806825] Process srpt_mlx4_0-2 (pid: 4585, threadinfo
> ffff8800b50f4000, task ffff880017faeea0)
> [35404.806969] Stack:
> [35404.807034]  ffffffff8150285e 0000000000000000 ffff88001ac0c998
> ffff880000000001
> [35404.807183]  ffff88001ac0d608 ffff88001654e900 ffff88001ac0e3f0
> ffff88001ac0e3b8
> [35404.807332]  ffff88001ac0c528 ffff880068a11600 ffff88001ac0c4e0
> ffff88001ac0c4f0
> [35404.807480] Call Trace:
> [35404.807548]  <IRQ>
> [35404.807637]  [<ffffffff8150285e>] ? srp_recv_completion+0x44e/0x650
> [35404.807722]  [<ffffffff81009f52>] ? check_events+0x12/0x20
> [35404.807803]  [<ffffffff814ea3c2>] mlx4_ib_cq_comp+0x12/0x20
> [35404.807883]  [<ffffffff81433beb>] mlx4_cq_completion+0x3b/0x80
> [35404.807964]  [<ffffffff81434aa4>] mlx4_eq_int+0x224/0x290
> [35404.808043]  [<ffffffff81434b81>] mlx4_interrupt+0x51/0x80
> [35404.808125]  [<ffffffff810b72bd>] handle_irq_event_percpu+0x5d/0x210
> [35404.808208]  [<ffffffff810b74bc>] handle_irq_event+0x4c/0x80
> [35404.808289]  [<ffffffff810ba233>] handle_fasteoi_irq+0x83/0x140
> [35404.808371]  [<ffffffff8130f756>] __xen_evtchn_do_upcall+0x1a6/0x260
> [35404.808455]  [<ffffffff813114fa>] xen_evtchn_do_upcall+0x2a/0x40
> [35404.808538]  [<ffffffff816846fe>] xen_do_hypervisor_callback+0x1e/0x30
> [35404.808620]  <EOI>
> [35404.808691]  [<ffffffffa006b0e7>] ?
> scst_register_virtual_device+0x5d7/0x750 [scst]
> [35404.808833]  [<ffffffffa007a473>] ? scst_cmd_init_done+0xb3/0x5a0 [scst]
> [35404.808917]  [<ffffffffa00f0bed>] ? 0xffffffffa00f0bec
> [35404.809006]  [<ffffffffa0072a47>] ? scst_rx_cmd+0xe7/0xce0 [scst]
> [35404.809088]  [<ffffffffa00f2872>] ? 0xffffffffa00f2871
> [35404.809166]  [<ffffffffa00f08e3>] ? 0xffffffffa00f08e2
> [35404.809245]  [<ffffffffa00f797f>] ? 0xffffffffa00f797e
> [35404.810244]  [<ffffffffa00f081f>] ? 0xffffffffa00f081e
> [35404.810323]  [<ffffffffa00f7af0>] ? 0xffffffffa00f7aef
> [35404.810417]  [<ffffffffa00f873f>] ? 0xffffffffa00f873e
> [35404.810495]  [<ffffffffa00f87a0>] ? 0xffffffffa00f879f
> [35404.810573]  [<ffffffffa00f8880>] ? 0xffffffffa00f887f
> [35404.810655]  [<ffffffff8167a8d9>] ? _raw_spin_unlock_irqrestore+0x19/0x20
> [35404.810739]  [<ffffffffa00f87a0>] ? 0xffffffffa00f879f
> [35404.810818]  [<ffffffff81077246>] ? kthread+0x96/0xa0
> [35404.810896]  [<ffffffff816845b4>] ? kernel_thread_helper+0x4/0x10
> [35404.810979]  [<ffffffff81682673>] ? int_ret_from_sys_call+0x7/0x1b
> [35404.811061]  [<ffffffff8167ab7c>] ? retint_restore_args+0x5/0x6
> [35404.811142]  [<ffffffff816845b0>] ? gs_change+0x13/0x13
> [35404.811219] Code:  Bad RIP value.
> [35404.811297] RIP  [<          (null)>]           (null)
> [35404.811377]  RSP <ffff8800bf42ace0>
> [35404.811447] CR2: 0000000000000000
> [35404.811739] ---[ end trace a002a9122b31526a ]---
> [35404.811841] Kernel panic - not syncing: Fatal exception in interrupt
> [35404.811950] Pid: 4585, comm: srpt_mlx4_0-2 Tainted: G      D    O 3.2.14+ #2
> [35404.812061] Call Trace:
> [35404.812155]  <IRQ>  [<ffffffff81677b48>] panic+0x8c/0x19d
> [35404.812296]  [<ffffffff81009f52>] ? check_events+0x12/0x20
> [35404.812402]  [<ffffffff8167b7fa>] oops_end+0xea/0xf0
> [35404.812510]  [<ffffffff8103b5f2>] no_context+0xf2/0x270
> [35404.812616]  [<ffffffff8103b895>] __bad_area_nosemaphore+0x125/0x210
> [35404.812726]  [<ffffffff8103b98e>] bad_area_nosemaphore+0xe/0x10
> [35404.812835]  [<ffffffff8167e135>] do_page_fault+0x335/0x4d0
> [35404.812942]  [<ffffffff8100984d>] ? xen_force_evtchn_callback+0xd/0x10
> [35404.813052]  [<ffffffff81009f52>] ? check_events+0x12/0x20
> [35404.813174]  [<ffffffff8167adf5>] page_fault+0x25/0x30
> [35404.813280]  [<ffffffff8150285e>] ? srp_recv_completion+0x44e/0x650
> [35404.813390]  [<ffffffff81009f52>] ? check_events+0x12/0x20
> [35404.813496]  [<ffffffff814ea3c2>] mlx4_ib_cq_comp+0x12/0x20
> [35404.813603]  [<ffffffff81433beb>] mlx4_cq_completion+0x3b/0x80
> [35404.813711]  [<ffffffff81434aa4>] mlx4_eq_int+0x224/0x290
> [35404.813817]  [<ffffffff81434b81>] mlx4_interrupt+0x51/0x80
> [35404.813924]  [<ffffffff810b72bd>] handle_irq_event_percpu+0x5d/0x210
> [35404.814034]  [<ffffffff810b74bc>] handle_irq_event+0x4c/0x80
> [35404.814141]  [<ffffffff810ba233>] handle_fasteoi_irq+0x83/0x140
> [35404.814250]  [<ffffffff8130f756>] __xen_evtchn_do_upcall+0x1a6/0x260
> [35404.814360]  [<ffffffff813114fa>] xen_evtchn_do_upcall+0x2a/0x40
> [35404.814469]  [<ffffffff816846fe>] xen_do_hypervisor_callback+0x1e/0x30
> [35404.814600]  <EOI>  [<ffffffffa006b0e7>] ?
> scst_register_virtual_device+0x5d7/0x750 [scst]
> [35404.814806]  [<ffffffffa007a473>] ? scst_cmd_init_done+0xb3/0x5a0 [scst]
> [35404.814916]  [<ffffffffa00f0bed>] ? 0xffffffffa00f0bec
> [35404.815022]  [<ffffffffa0072a47>] ? scst_rx_cmd+0xe7/0xce0 [scst]
> [35404.815131]  [<ffffffffa00f2872>] ? 0xffffffffa00f2871
> [35404.815236]  [<ffffffffa00f08e3>] ? 0xffffffffa00f08e2
> [35404.815341]  [<ffffffffa00f797f>] ? 0xffffffffa00f797e
> [35404.815447]  [<ffffffffa00f081f>] ? 0xffffffffa00f081e
> [35404.815551]  [<ffffffffa00f7af0>] ? 0xffffffffa00f7aef
> [35404.815656]  [<ffffffffa00f873f>] ? 0xffffffffa00f873e
> [35404.815761]  [<ffffffffa00f87a0>] ? 0xffffffffa00f879f
> [35404.815869]  [<ffffffffa00f8880>] ? 0xffffffffa00f887f
> [35404.815985]  [<ffffffff8167a8d9>] ? _raw_spin_unlock_irqrestore+0x19/0x20
> [35404.816096]  [<ffffffffa00f87a0>] ? 0xffffffffa00f879f
> [35404.816201]  [<ffffffff81077246>] ? kthread+0x96/0xa0
> [35404.816306]  [<ffffffff816845b4>] ? kernel_thread_helper+0x4/0x10
> [35404.816414]  [<ffffffff81682673>] ? int_ret_from_sys_call+0x7/0x1b
> [35404.816523]  [<ffffffff8167ab7c>] ? retint_restore_args+0x5/0x6
> [35404.816631]  [<ffffffff816845b0>] ? gs_change+0x13/0x13
>
> Joseph.
>
> --
> CTO | Orion Virtualisation Solutions | www.orionvm.com.au
> Phone: 1300 56 99 52 | Mobile: 0428 754 846

Sorry I missed the first line. :(

[35404.804723] BUG: unable to handle kernel NULL pointer dereference
at           (null)

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
  2012-07-24 15:16 Kernel panic under 3.2.14 Xen dom0 and SCST trunk Joseph Glanville
  2012-07-24 15:43 ` Joseph Glanville
@ 2012-07-24 17:53 ` Bart Van Assche
       [not found]   ` <500EE108.2090605-HInyCGIudOg@public.gmane.org>
  1 sibling, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2012-07-24 17:53 UTC (permalink / raw)
  To: Joseph Glanville
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 07/24/12 15:16, Joseph Glanville wrote:
> I have been seeing this KP occur about every 3 days on our staging cluster.
> I am not exactly sure what the root cause would be.. I assume this
> would be a bug in SCST.
> The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
> HA patches.

It would help if you could tell us a bit more about your setup. It looks
like SCST is running in dom0, and an IB workload in domU ? If so, which
workload was running in domU ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]   ` <500EE108.2090605-HInyCGIudOg@public.gmane.org>
@ 2012-07-24 19:50     ` Joseph Glanville
       [not found]       ` <CAOzFzEiiiEsUqLjRM-TFsVZhQyvQi=abX0ufS6obvuZxtWgB-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Joseph Glanville @ 2012-07-24 19:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 25 July 2012 03:53, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
> On 07/24/12 15:16, Joseph Glanville wrote:
>> I have been seeing this KP occur about every 3 days on our staging cluster.
>> I am not exactly sure what the root cause would be.. I assume this
>> would be a bug in SCST.
>> The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
>> HA patches.
>
> It would help if you could tell us a bit more about your setup. It looks
> like SCST is running in dom0, and an IB workload in domU ? If so, which
> workload was running in domU ?
>
> Bart.

Hi Bart,

There is no IB workload in the domU's.
In this particular case there are 2 dom0s connected together both
acting as SRP targets and initators.
Their are sometimes vms running on these dom0s but they aren't
currently in production so they aren't doing very much at the moment.

The workload is typically one of adding and removing luns to
ini_groups, rescan the host to ensure they are removed cleanly etc.
As far as I can tell this would have to manifest as a race condition
as it can go for about 2 or so weeks without occuring.
Also worth noting is that I have a similar setup running on 2.6.32
with no issues also a pvops dom0 using SCST and ib_srp.

Could it be your patch series introduced the bug? Those are the only
patches we have in our tree that effect SRP.

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]       ` <CAOzFzEiiiEsUqLjRM-TFsVZhQyvQi=abX0ufS6obvuZxtWgB-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-24 19:52         ` Joseph Glanville
  2012-07-24 19:59         ` Bart Van Assche
  1 sibling, 0 replies; 14+ messages in thread
From: Joseph Glanville @ 2012-07-24 19:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 25 July 2012 05:50, Joseph Glanville <joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote:
> On 25 July 2012 03:53, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
>> On 07/24/12 15:16, Joseph Glanville wrote:
>>> I have been seeing this KP occur about every 3 days on our staging cluster.
>>> I am not exactly sure what the root cause would be.. I assume this
>>> would be a bug in SCST.
>>> The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
>>> HA patches.
>>
>> It would help if you could tell us a bit more about your setup. It looks
>> like SCST is running in dom0, and an IB workload in domU ? If so, which
>> workload was running in domU ?
>>
>> Bart.
>
> Hi Bart,
>
> There is no IB workload in the domU's.
> In this particular case there are 2 dom0s connected together both
> acting as SRP targets and initators.
> Their are sometimes vms running on these dom0s but they aren't
> currently in production so they aren't doing very much at the moment.
>
> The workload is typically one of adding and removing luns to
> ini_groups, rescan the host to ensure they are removed cleanly etc.
> As far as I can tell this would have to manifest as a race condition
> as it can go for about 2 or so weeks without occuring.
> Also worth noting is that I have a similar setup running on 2.6.32
> with no issues also a pvops dom0 using SCST and ib_srp.

To clarify because I realize it's ambigious after sending:
The 2.6.32 cluster doesn't have the SRP HA series applied.

>
> Could it be your patch series introduced the bug? Those are the only
> patches we have in our tree that effect SRP.
>
> Joseph.
>
> --
> CTO | Orion Virtualisation Solutions | www.orionvm.com.au
> Phone: 1300 56 99 52 | Mobile: 0428 754 846



-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]       ` <CAOzFzEiiiEsUqLjRM-TFsVZhQyvQi=abX0ufS6obvuZxtWgB-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2012-07-24 19:52         ` Joseph Glanville
@ 2012-07-24 19:59         ` Bart Van Assche
       [not found]           ` <500EFEB3.5020806-HInyCGIudOg@public.gmane.org>
  1 sibling, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2012-07-24 19:59 UTC (permalink / raw)
  To: Joseph Glanville
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 07/24/12 19:50, Joseph Glanville wrote:
> On 25 July 2012 03:53, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
>> On 07/24/12 15:16, Joseph Glanville wrote:
>>> I have been seeing this KP occur about every 3 days on our staging cluster.
>>> I am not exactly sure what the root cause would be.. I assume this
>>> would be a bug in SCST.
>>> The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
>>> HA patches.
>>
>> It would help if you could tell us a bit more about your setup. It looks
>> like SCST is running in dom0, and an IB workload in domU ? If so, which
>> workload was running in domU ?
> 
> There is no IB workload in the domU's.
> In this particular case there are 2 dom0s connected together both
> acting as SRP targets and initators.
> Their are sometimes vms running on these dom0s but they aren't
> currently in production so they aren't doing very much at the moment.
> 
> The workload is typically one of adding and removing luns to
> ini_groups, rescan the host to ensure they are removed cleanly etc.
> As far as I can tell this would have to manifest as a race condition
> as it can go for about 2 or so weeks without occuring.
> Also worth noting is that I have a similar setup running on 2.6.32
> with no issues also a pvops dom0 using SCST and ib_srp.
> 
> Could it be your patch series introduced the bug? Those are the only
> patches we have in our tree that effect SRP.

You might be hitting a device removal bug in the SCSI core. It would be
appreciated if you could retest with the srp-ha branch of this kernel
tree: http://github.com/bvanassche/linux. That tree contains Linux
kernel 3.5 + SCSI 3.6-rc1 + latest (yet to be posted) srp-ha patch series.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]           ` <500EFEB3.5020806-HInyCGIudOg@public.gmane.org>
@ 2012-07-24 20:14             ` Joseph Glanville
       [not found]               ` <CAOzFzEi8rnbTyomWEByJL3J_7QnCJSj-yWhMdh8d5mHnBRLVzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2012-07-25  2:30             ` Roland Dreier
  1 sibling, 1 reply; 14+ messages in thread
From: Joseph Glanville @ 2012-07-24 20:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 25 July 2012 05:59, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
> On 07/24/12 19:50, Joseph Glanville wrote:
>> On 25 July 2012 03:53, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
>>> On 07/24/12 15:16, Joseph Glanville wrote:
>>>> I have been seeing this KP occur about every 3 days on our staging cluster.
>>>> I am not exactly sure what the root cause would be.. I assume this
>>>> would be a bug in SCST.
>>>> The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
>>>> HA patches.
>>>
>>> It would help if you could tell us a bit more about your setup. It looks
>>> like SCST is running in dom0, and an IB workload in domU ? If so, which
>>> workload was running in domU ?
>>
>> There is no IB workload in the domU's.
>> In this particular case there are 2 dom0s connected together both
>> acting as SRP targets and initators.
>> Their are sometimes vms running on these dom0s but they aren't
>> currently in production so they aren't doing very much at the moment.
>>
>> The workload is typically one of adding and removing luns to
>> ini_groups, rescan the host to ensure they are removed cleanly etc.
>> As far as I can tell this would have to manifest as a race condition
>> as it can go for about 2 or so weeks without occuring.
>> Also worth noting is that I have a similar setup running on 2.6.32
>> with no issues also a pvops dom0 using SCST and ib_srp.
>>
>> Could it be your patch series introduced the bug? Those are the only
>> patches we have in our tree that effect SRP.
>
> You might be hitting a device removal bug in the SCSI core. It would be
> appreciated if you could retest with the srp-ha branch of this kernel
> tree: http://github.com/bvanassche/linux. That tree contains Linux
> kernel 3.5 + SCSI 3.6-rc1 + latest (yet to be posted) srp-ha patch series.
>
> Bart.

Will do.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]               ` <CAOzFzEi8rnbTyomWEByJL3J_7QnCJSj-yWhMdh8d5mHnBRLVzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-25  1:09                 ` Joseph Glanville
  0 siblings, 0 replies; 14+ messages in thread
From: Joseph Glanville @ 2012-07-25  1:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 25 July 2012 06:14, Joseph Glanville <joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote:
> On 25 July 2012 05:59, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
>> On 07/24/12 19:50, Joseph Glanville wrote:
>>> On 25 July 2012 03:53, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
>>>> On 07/24/12 15:16, Joseph Glanville wrote:
>>>>> I have been seeing this KP occur about every 3 days on our staging cluster.
>>>>> I am not exactly sure what the root cause would be.. I assume this
>>>>> would be a bug in SCST.
>>>>> The kernel is a 3.2.14 with Ubuntu patch series applied and Bart's SRP
>>>>> HA patches.
>>>>
>>>> It would help if you could tell us a bit more about your setup. It looks
>>>> like SCST is running in dom0, and an IB workload in domU ? If so, which
>>>> workload was running in domU ?
>>>
>>> There is no IB workload in the domU's.
>>> In this particular case there are 2 dom0s connected together both
>>> acting as SRP targets and initators.
>>> Their are sometimes vms running on these dom0s but they aren't
>>> currently in production so they aren't doing very much at the moment.
>>>
>>> The workload is typically one of adding and removing luns to
>>> ini_groups, rescan the host to ensure they are removed cleanly etc.
>>> As far as I can tell this would have to manifest as a race condition
>>> as it can go for about 2 or so weeks without occuring.
>>> Also worth noting is that I have a similar setup running on 2.6.32
>>> with no issues also a pvops dom0 using SCST and ib_srp.
>>>
>>> Could it be your patch series introduced the bug? Those are the only
>>> patches we have in our tree that effect SRP.
>>
>> You might be hitting a device removal bug in the SCSI core. It would be
>> appreciated if you could retest with the srp-ha branch of this kernel
>> tree: http://github.com/bvanassche/linux. That tree contains Linux
>> kernel 3.5 + SCSI 3.6-rc1 + latest (yet to be posted) srp-ha patch series.
>>
>> Bart.
>
> Will do.
>
> --
> CTO | Orion Virtualisation Solutions | www.orionvm.com.au
> Phone: 1300 56 99 52 | Mobile: 0428 754 846

Hi Bart,

I managed to trigger the bug (kernel oops on the null deref but didn't
KP this time. This is with the SRP HA patches removed.
To trigger I was merely removing luns and rescanning on the initiator
many times per minute for a few hours.

I will pull down the tree you mentioned and try reproduce.

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]           ` <500EFEB3.5020806-HInyCGIudOg@public.gmane.org>
  2012-07-24 20:14             ` Joseph Glanville
@ 2012-07-25  2:30             ` Roland Dreier
  1 sibling, 0 replies; 14+ messages in thread
From: Roland Dreier @ 2012-07-25  2:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Joseph Glanville, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Tue, Jul 24, 2012 at 12:59 PM, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
> You might be hitting a device removal bug in the SCSI core. It would be
> appreciated if you could retest with the srp-ha branch of this kernel
> tree: http://github.com/bvanassche/linux. That tree contains Linux
> kernel 3.5 + SCSI 3.6-rc1 + latest (yet to be posted) srp-ha patch series.

The original crash is in a process name srpt_mlx4_0-2, so it seems more
likely to be an SCST bug.  In fact the crash doesn't even show ib_srp
loaded at all.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]   ` <CAOzFzEjDHpTROUcKg9cOZkNSX1LnShSombgt26+VOptVdy5i-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-08-02 11:04     ` Bart Van Assche
       [not found]       ` <501A5EB0.4060904-HInyCGIudOg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2012-08-02 11:04 UTC (permalink / raw)
  To: Joseph Glanville
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 07/24/12 15:43, Joseph Glanville wrote:
> [35404.804723] BUG: unable to handle kernel NULL pointer dereference at (null)

I've been able to reproduce this ib_srp crash. Apparently if an SRP
response is received after srp_reset_host() has been invoked
srp_process_rsp() tries to call scmnd->scsi_done(scmnd) with scsi_done
== NULL, hence the kernel oops. A candidate fix is available in this
(rebased) tree: http://github.com/bvanassche/linux/tree/srp-ha.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]       ` <501A5EB0.4060904-HInyCGIudOg@public.gmane.org>
@ 2012-08-02 15:45         ` Joseph Glanville
  2012-08-02 20:12         ` David Dillow
  1 sibling, 0 replies; 14+ messages in thread
From: Joseph Glanville @ 2012-08-02 15:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 2 August 2012 21:04, Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org> wrote:
> On 07/24/12 15:43, Joseph Glanville wrote:
>> [35404.804723] BUG: unable to handle kernel NULL pointer dereference at (null)
>
> I've been able to reproduce this ib_srp crash. Apparently if an SRP
> response is received after srp_reset_host() has been invoked
> srp_process_rsp() tries to call scmnd->scsi_done(scmnd) with scsi_done
> == NULL, hence the kernel oops. A candidate fix is available in this
> (rebased) tree: http://github.com/bvanassche/linux/tree/srp-ha.
>
> Bart.
>

Nice work. :)

Sorry I haven't had time to do more testing, I will try get this
branch deployed on something tomorrow.

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]       ` <501A5EB0.4060904-HInyCGIudOg@public.gmane.org>
  2012-08-02 15:45         ` Joseph Glanville
@ 2012-08-02 20:12         ` David Dillow
       [not found]           ` <1343938328.25205.17.camel-zHLflQxYYDO4Hhoo1DtQwJ9G+ZOsUmrO@public.gmane.org>
  1 sibling, 1 reply; 14+ messages in thread
From: David Dillow @ 2012-08-02 20:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Joseph Glanville, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Thu, 2012-08-02 at 11:04 +0000, Bart Van Assche wrote:
> On 07/24/12 15:43, Joseph Glanville wrote:
> > [35404.804723] BUG: unable to handle kernel NULL pointer dereference at (null)
> 
> I've been able to reproduce this ib_srp crash. Apparently if an SRP
> response is received after srp_reset_host() has been invoked
> srp_process_rsp() tries to call scmnd->scsi_done(scmnd) with scsi_done
> == NULL, hence the kernel oops. A candidate fix is available in this
> (rebased) tree: http://github.com/bvanassche/linux/tree/srp-ha.

Hmm, I stopped looking at the thread when I noted the same points Roland
did -- it looked like it was in the target rather than the initiator,
and that ib_srp wasn't loaded (though it could have been built-in).

I think I'm good with your fix, given a few minor changes:
      * rebase it to mainline (I tried it quickly, got conflicts that
        should be simple to resolve)
      * s/srp_remove_req/srp_claim_req/ as it doesn't remove the
        request. This isn't an issue you introduced; it should probably
        have been renamed some time ago.
      * in srp_remove_req(), the test for (scmnd && req->scmnd == scmnd)
        should probably be marked likely()
      * Similarly, the !scmnd test in srp_process_rsp() should be
        unlikely()
      * The reclamation of credits should be moved to srp_free_req(),
        since we could see the case where a credit is available without
        a corresponding request structure.
      * Get rid of the BUG_ON in srp_process_rsp(); in the past, I would
        have probably added it myself, but Andrew Morton called me on
        one I had tried to add, and he was right -- it doesn't add
        anything.
      * I wonder if srp_free_req() is the right name, but I think I'm
        deep in bike-shedding territory here.

It'd be nice if we could avoid taking the lock twice in quick succession
during normal operations, but that's something for later.

We should get this into 3.6, and send it to stable as well. I can make
the changes if you'd like, just let me know.

Thanks,

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]           ` <1343938328.25205.17.camel-zHLflQxYYDO4Hhoo1DtQwJ9G+ZOsUmrO@public.gmane.org>
@ 2012-08-02 22:51             ` Joseph Glanville
  2012-08-03 11:12             ` Bart Van Assche
  1 sibling, 0 replies; 14+ messages in thread
From: Joseph Glanville @ 2012-08-02 22:51 UTC (permalink / raw)
  To: David Dillow
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 3 August 2012 06:12, David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org> wrote:
> On Thu, 2012-08-02 at 11:04 +0000, Bart Van Assche wrote:
>> On 07/24/12 15:43, Joseph Glanville wrote:
>> > [35404.804723] BUG: unable to handle kernel NULL pointer dereference at (null)
>>
>> I've been able to reproduce this ib_srp crash. Apparently if an SRP
>> response is received after srp_reset_host() has been invoked
>> srp_process_rsp() tries to call scmnd->scsi_done(scmnd) with scsi_done
>> == NULL, hence the kernel oops. A candidate fix is available in this
>> (rebased) tree: http://github.com/bvanassche/linux/tree/srp-ha.
>
> Hmm, I stopped looking at the thread when I noted the same points Roland
> did -- it looked like it was in the target rather than the initiator,
> and that ib_srp wasn't loaded (though it could have been built-in).
>
> I think I'm good with your fix, given a few minor changes:
>       * rebase it to mainline (I tried it quickly, got conflicts that
>         should be simple to resolve)
>       * s/srp_remove_req/srp_claim_req/ as it doesn't remove the
>         request. This isn't an issue you introduced; it should probably
>         have been renamed some time ago.
>       * in srp_remove_req(), the test for (scmnd && req->scmnd == scmnd)
>         should probably be marked likely()
>       * Similarly, the !scmnd test in srp_process_rsp() should be
>         unlikely()
>       * The reclamation of credits should be moved to srp_free_req(),
>         since we could see the case where a credit is available without
>         a corresponding request structure.
>       * Get rid of the BUG_ON in srp_process_rsp(); in the past, I would
>         have probably added it myself, but Andrew Morton called me on
>         one I had tried to add, and he was right -- it doesn't add
>         anything.
>       * I wonder if srp_free_req() is the right name, but I think I'm
>         deep in bike-shedding territory here.
>
> It'd be nice if we could avoid taking the lock twice in quick succession
> during normal operations, but that's something for later.
>
> We should get this into 3.6, and send it to stable as well. I can make
> the changes if you'd like, just let me know.
>
> Thanks,
>
> --
> Dave Dillow
> National Center for Computational Science
> Oak Ridge National Laboratory
> (865) 241-6602 office
>
>

Hi Bart/David.

I have had this deployed for a few hours now and haven't been able to
trigger the crash again.
Disconnecting targets works correctly as does removing luns.

Thanks. :)

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Kernel panic under 3.2.14 Xen dom0 and SCST trunk
       [not found]           ` <1343938328.25205.17.camel-zHLflQxYYDO4Hhoo1DtQwJ9G+ZOsUmrO@public.gmane.org>
  2012-08-02 22:51             ` Joseph Glanville
@ 2012-08-03 11:12             ` Bart Van Assche
  1 sibling, 0 replies; 14+ messages in thread
From: Bart Van Assche @ 2012-08-03 11:12 UTC (permalink / raw)
  To: David Dillow
  Cc: Joseph Glanville, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	scst-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On 08/02/12 20:12, David Dillow wrote:
> On Thu, 2012-08-02 at 11:04 +0000, Bart Van Assche wrote:
>> On 07/24/12 15:43, Joseph Glanville wrote:
>>> [35404.804723] BUG: unable to handle kernel NULL pointer dereference at (null)
>>
>> I've been able to reproduce this ib_srp crash. Apparently if an SRP
>> response is received after srp_reset_host() has been invoked
>> srp_process_rsp() tries to call scmnd->scsi_done(scmnd) with scsi_done
>> == NULL, hence the kernel oops. A candidate fix is available in this
>> (rebased) tree: http://github.com/bvanassche/linux/tree/srp-ha.
> 
> Hmm, I stopped looking at the thread when I noted the same points Roland
> did -- it looked like it was in the target rather than the initiator,
> and that ib_srp wasn't loaded (though it could have been built-in).
> 
> I think I'm good with your fix, given a few minor changes:
>       * rebase it to mainline (I tried it quickly, got conflicts that
>         should be simple to resolve)
>       * s/srp_remove_req/srp_claim_req/ as it doesn't remove the
>         request. This isn't an issue you introduced; it should probably
>         have been renamed some time ago.
>       * in srp_remove_req(), the test for (scmnd && req->scmnd == scmnd)
>         should probably be marked likely()
>       * Similarly, the !scmnd test in srp_process_rsp() should be
>         unlikely()
>       * The reclamation of credits should be moved to srp_free_req(),
>         since we could see the case where a credit is available without
>         a corresponding request structure.
>       * Get rid of the BUG_ON in srp_process_rsp(); in the past, I would
>         have probably added it myself, but Andrew Morton called me on
>         one I had tried to add, and he was right -- it doesn't add
>         anything.
>       * I wonder if srp_free_req() is the right name, but I think I'm
>         deep in bike-shedding territory here.
> 
> It'd be nice if we could avoid taking the lock twice in quick succession
> during normal operations, but that's something for later.
> 
> We should get this into 3.6, and send it to stable as well. I can make
> the changes if you'd like, just let me know.

Hello Dave,

Thanks for the feedback. I'll update the patch according to your
feedback and move it to the start of the patch series such that the
patch applies fine on the stable trees.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-08-03 11:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-24 15:16 Kernel panic under 3.2.14 Xen dom0 and SCST trunk Joseph Glanville
2012-07-24 15:43 ` Joseph Glanville
     [not found]   ` <CAOzFzEjDHpTROUcKg9cOZkNSX1LnShSombgt26+VOptVdy5i-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-08-02 11:04     ` Bart Van Assche
     [not found]       ` <501A5EB0.4060904-HInyCGIudOg@public.gmane.org>
2012-08-02 15:45         ` Joseph Glanville
2012-08-02 20:12         ` David Dillow
     [not found]           ` <1343938328.25205.17.camel-zHLflQxYYDO4Hhoo1DtQwJ9G+ZOsUmrO@public.gmane.org>
2012-08-02 22:51             ` Joseph Glanville
2012-08-03 11:12             ` Bart Van Assche
2012-07-24 17:53 ` Bart Van Assche
     [not found]   ` <500EE108.2090605-HInyCGIudOg@public.gmane.org>
2012-07-24 19:50     ` Joseph Glanville
     [not found]       ` <CAOzFzEiiiEsUqLjRM-TFsVZhQyvQi=abX0ufS6obvuZxtWgB-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-24 19:52         ` Joseph Glanville
2012-07-24 19:59         ` Bart Van Assche
     [not found]           ` <500EFEB3.5020806-HInyCGIudOg@public.gmane.org>
2012-07-24 20:14             ` Joseph Glanville
     [not found]               ` <CAOzFzEi8rnbTyomWEByJL3J_7QnCJSj-yWhMdh8d5mHnBRLVzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-25  1:09                 ` Joseph Glanville
2012-07-25  2:30             ` Roland Dreier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.