[Cluster-devel] GPF in dlm_lowcomms_stop

* [Cluster-devel] GPF in dlm_lowcomms_stop
@ 2012-03-22  1:59 dann frazier
  2012-03-30 15:42 ` David Teigland
  0 siblings, 1 reply; 6+ messages in thread
From: dann frazier @ 2012-03-22  1:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

I observed a GPF in dlm_lowcomms_stop() in a crash dump from a
user. I'll include a traceback at the bottom. I have a theory as to
what is going on, but could use some help proving it.

 1 void dlm_lowcomms_stop(void)
 2 {
 3 /* Set all the flags to prevent any
 4 socket activity.
 5 */
 6 mutex_lock(&connections_lock);
 7 foreach_conn(stop_conn);
 8 mutex_unlock(&connections_lock);
 9
10 work_stop();
11
12 mutex_lock(&connections_lock);
13 clean_writequeues();
14
15 foreach_conn(free_conn);
16
17 mutex_unlock(&connections_lock);
18 kmem_cache_destroy(con_cache);
19 }

lines 6-8 should stop any traffic coming in on existing connections,
and the connections_lock prevents any new connections from being
added.

line 10 shutsdown the work queues which process incoming data on
existing connections. Our connections are supposed to be dead, so no
new work should be added to these queues.

However... we've dropped the connections_lock, so its possible that a
new connection gets created on line 9. This connection structure would
have pointers to the workqueues that we're about to destroy. Sometime
later on we get data on this new connection, we try to add work to the
now-obliterated workqueues, and things blow up.

This is a 2.6.32 kernel, but this code looks to be about the
same in 3.3. Normally I'd like to instrument the code to widen the
race window and see if I can trigger it, but I'm pretty clueless about
the stack here. Is there an easy way to mimic this activity from
userspace?

[80061.292085] Call Trace:
[80061.312762]  <IRQ>
[80061.332554]  [<ffffffff810387b9>] default_spin_lock_flags+0x9/0x10
[80061.356963]  [<ffffffff8155c44f>] _spin_lock_irqsave+0x2f/0x40
[80061.380717]  [<ffffffff81080504>] __queue_work+0x24/0x50
[80061.403591]  [<ffffffff81080632>] queue_work_on+0x42/0x60
[80061.426314]  [<ffffffff810807ff>] queue_work+0x1f/0x30
[80061.448516]  [<ffffffffa027326b>] lowcomms_data_ready+0x3b/0x40 [dlm]
[80061.472149]  [<ffffffff814ba318>] tcp_data_queue+0x308/0xaf0
[80061.494912]  [<ffffffff814bd021>] tcp_rcv_established+0x371/0x730
[80061.517903]  [<ffffffff814c4bb3>] tcp_v4_do_rcv+0xf3/0x160
[80061.539872]  [<ffffffff814c6305>] tcp_v4_rcv+0x5b5/0x7e0
[80061.561284]  [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0
[80061.583714]  [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100
[80061.604714]  [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0
[80061.626449]  [<ffffffff814a41ed>] ip_local_deliver_finish+0xdd/0x2d0
[80061.647767]  [<ffffffff814a4470>] ip_local_deliver+0x90/0xa0
[80061.667743]  [<ffffffff814a392d>] ip_rcv_finish+0x12d/0x440
[80061.687278]  [<ffffffff814a3eb5>] ip_rcv+0x275/0x360
[80061.705783]  [<ffffffff8147450a>] netif_receive_skb+0x38a/0x5d0
[80061.725345]  [<ffffffffa0257178>] br_handle_frame_finish+0x198/0x1a0 [bridge]
[80061.746206]  [<ffffffffa025cc78>] br_nf_pre_routing_finish+0x238/0x350 [bridge]
[80061.779534]  [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100
[80061.798437]  [<ffffffffa025ca40>] ? br_nf_pre_routing_finish+0x0/0x350 [bridge]
[80061.831598]  [<ffffffffa025d15a>] br_nf_pre_routing+0x3ca/0x468 [bridge]
[80061.851561]  [<ffffffff81561765>] ? do_IRQ+0x75/0xf0
[80061.869287]  [<ffffffff812b3a63>] ? cpumask_next_and+0x23/0x40
[80061.887717]  [<ffffffff8149bacc>] nf_iterate+0x6c/0xb0
[80061.905491]  [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge]
[80061.925829]  [<ffffffff8149bb84>] nf_hook_slow+0x74/0x100
[80061.944215]  [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge]
[80061.964735]  [<ffffffff8146b216>] ? __netdev_alloc_skb+0x36/0x60
[80061.984128]  [<ffffffffa025730c>] br_handle_frame+0x18c/0x250 [bridge]
[80062.004294]  [<ffffffff81537138>] ? vlan_hwaccel_do_receive+0x38/0xf0
[80062.024707]  [<ffffffff8147437a>] netif_receive_skb+0x1fa/0x5d0
[80062.044500]  [<ffffffff81474920>] napi_skb_finish+0x50/0x70
[80062.063911]  [<ffffffff81537794>] vlan_gro_receive+0x84/0x90
[80062.083512]  [<ffffffffa0015fd9>] igb_clean_rx_irq_adv+0x259/0x6d0 [igb]
[80062.104516]  [<ffffffffa00164c2>] igb_poll+0x72/0x380 [igb]
[80062.124363]  [<ffffffffa0168236>] ? ioat2_issue_pending+0x86/0xa0 [ioatdma]
[80062.145937]  [<ffffffff8147500f>] net_rx_action+0x10f/0x250
[80062.165915]  [<ffffffff8109331e>] ? tick_handle_oneshot_broadcast+0x10e/0x120
[80062.187812]  [<ffffffff8106d9e7>] __do_softirq+0xb7/0x1f0
[80062.207733]  [<ffffffff810c47b0>] ? handle_IRQ_event+0x60/0x170
[80062.228411]  [<ffffffff810132ec>] call_softirq+0x1c/0x30
[80062.248332]  [<ffffffff81014cb5>] do_softirq+0x65/0xa0
[80062.268058]  [<ffffffff8106d7e5>] irq_exit+0x85/0x90
[80062.287578]  [<ffffffff81561765>] do_IRQ+0x75/0xf0
[80062.306763]  [<ffffffff81012b13>] ret_from_intr+0x0/0x11
[80062.326418]  <EOI>
[80062.342191]  [<ffffffff8130f59f>] ? acpi_idle_enter_c1+0xa3/0xc1
[80062.362503]  [<ffffffff8130f57e>] ? acpi_idle_enter_c1+0x82/0xc1
[80062.382525]  [<ffffffff8144f1e7>] ? cpuidle_idle_call+0xa7/0x140
[80062.402409]  [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110
[80062.421258]  [<ffffffff81544137>] ? rest_init+0x77/0x80
[80062.439930]  [<ffffffff81882dd8>] ? start_kernel+0x368/0x371
[80062.459099]  [<ffffffff8188233a>] ? x86_64_start_reservations+0x125/0x129
[80062.479533]  [<ffffffff81882438>] ? x86_64_start_kernel+0xfa/0x109
[80062.499226] Code: 00 00 48 c7 c2 2e 85 03 81 48 c7 c1 31 85 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 01 00 00 48 89 e5 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c9 c3 66 0f 1f 44
[80062.563580] RIP  [<ffffffff81038729>] __ticket_spin_lock+0x9/0x20
[80062.584129]  RSP <ffff880016603740>

^ permalink raw reply	[flat|nested] 6+ messages in thread