All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net 0/2] bonding: fix panics for making bonding device up and down repeatedly
@ 2011-10-07 12:49 Mitsuo Hayasaka
  2011-10-07 12:49 ` [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame Mitsuo Hayasaka
  2011-10-07 12:50 ` [RFC PATCH net 2/2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close Mitsuo Hayasaka
  0 siblings, 2 replies; 8+ messages in thread
From: Mitsuo Hayasaka @ 2011-10-07 12:49 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek; +Cc: netdev, linux-kernel, yrl.pp-manager.tt

Hi,

I hit a kernel BUG when I made a bonding interface up and down
repeatedly, as below.

# while true; do ifconfig bond0 down; done &
# while true; do ifconfig bond0 up; done &

(panic messages are bottom of this mail)

I investigated it and found the below commit tries to address 
this kind of parallel up&down problem.

a0db2dad0935e798973bb79676e722b82f177206 "bonding: properly stop
queuing work when requested"

However, above fix is not enough. I've found two different BUG
patterns.

They happens as follows:

(1) NULL pointer dereference in bond_handle_frame()
1. bond_close() sets bond->recv_probe to NULL.
2. bond_handle_frame() calls bond->recv_probe.
3. a panic happens when running bond_close() in parallel
   with bond_handle_frame().

(2) NULL pointer dereference in proccess_one_work()
1. bond_close() calls cancel_delayed_work() but its last works
   are not cancelled because they were already queued in workqueue.
2. bond_open() initializes work->data.
3. process_one_work() refers get_work_cwq(work)->wq->flags in
   workqueue.c:1850, but get_work_cwq() returns NULL when work->data
   has been initialized.
4. a panic happens when running process_one_work() in parallel with
   bond_open().

Each log is shown below. 

The following two patches addresses these panics in detail:
 [PATCH 1/2] use local function pointer of bond->recv_probe
		in bond_handle_frame
 [PATCH 2/2] use flush_delayed_work_sync in bond_close

Thanks,

============== log for a panic in bond_handle_frame=========

# while true; do ifconfig bond0 down; done &
[1] 1441
# while true; do ifconfig bond0 up; done &
[2] 3383
# [  296.794817] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  296.795761] IP: [<          (null)>]           (null)
[  296.795761] PGD 79515067 PUD 375e9067 PMD 0 
[  296.795761] Oops: 0010 [#1] SMP 
[  296.795761] CPU 0 
[  296.795761] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm pcspkr virtio_balloon microcode i2c_piix4 joydev snd_timer snd virtio_net i2c_core soundcore snd_page_alloc ipv6 xfs exportfs virtio_blk [last unloaded: speedstep_lib]
[  296.795761] 
[  296.795761] Pid: 0, comm: swapper Not tainted 3.1.0-rc9+ #4 Bochs Bochs
[  296.795761] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[  296.795761] RSP: 0018:ffff88007fc03d28  EFLAGS: 00010286
[  296.795761] RAX: ffff88007b1e6500 RBX: ffff88007b1e6a00 RCX: 0000000000000000
[  296.795761] RDX: ffff880037670e00 RSI: ffff88007b522740 RDI: ffff88007b1e6500
[  296.795761] RBP: ffff88007fc03d50 R08: 0000000000000000 R09: 0000000180100001
[  296.795761] R10: ffffffff81a01fd8 R11: ffffffff81a01fd8 R12: ffff88007b522740
[  296.795761] R13: ffff880037670e00 R14: ffff88007b1e6500 R15: ffffe8ffffc02578
[  296.795761] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  296.795761] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  296.795761] CR2: 0000000000000000 CR3: 00000000794f9000 CR4: 00000000000006f0
[  296.795761] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  296.795761] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  296.795761] Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a0d020)
[  296.795761] Stack:
[  296.795761]  ffffffffa00b64d0 ffffffffa00b642e 0000000000000001 ffff880037a3f000
[  296.795761]  0000000000000000 ffff88007fc03db0 ffffffff813e2996 000000000000002e
[  296.795761]  ffff88007b1e6a00 ffff88007fc03d90 ffffffff81b5be40 ffff88007b1e6a00
[  296.795761] Call Trace:
[  296.795761]  <IRQ> 
[  296.795761]  [<ffffffffa00b64d0>] ? bond_handle_frame+0xa2/0x160 [bonding]
[  296.795761]  [<ffffffffa00b642e>] ? skb_cow_head+0x70/0x70 [bonding]
[  296.795761]  [<ffffffff813e2996>] __netif_receive_skb+0x2c5/0x417
[  296.795761]  [<ffffffff813e5c22>] netif_receive_skb+0x6c/0x73
[  296.795761]  [<ffffffffa0155360>] virtnet_poll+0x49a/0x562 [virtio_net]
[  296.795761]  [<ffffffff813e623b>] net_rx_action+0xa9/0x1b8
[  296.795761]  [<ffffffff8105c61b>] __do_softirq+0xc9/0x1b5
[  296.795761]  [<ffffffff81014fa0>] ? sched_clock+0x9/0xd
[  296.795761]  [<ffffffff814b8c2c>] call_softirq+0x1c/0x30
[  296.795761]  [<ffffffff81010af9>] do_softirq+0x46/0x81
[  296.795761]  [<ffffffff8105c8e3>] irq_exit+0x57/0xb1
[  296.795761]  [<ffffffff814b950e>] do_IRQ+0x8e/0xa5
[  296.795761]  [<ffffffff814b062e>] common_interrupt+0x6e/0x6e
[  296.795761]  <EOI> 
[  296.795761]  [<ffffffff810b27c3>] ? rcu_needs_cpu+0x11a/0x1cb
[  296.795761]  [<ffffffff8102f2f1>] ? native_safe_halt+0xb/0xd
[  296.795761]  [<ffffffff81015b32>] default_idle+0x4e/0x86
[  296.795761]  [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
[  296.795761]  [<ffffffff8148cd6e>] rest_init+0x72/0x74
[  296.795761]  [<ffffffff81b74b7d>] start_kernel+0x3ab/0x3b6
[  296.795761]  [<ffffffff81b742c4>] x86_64_start_reservations+0xaf/0xb3
[  296.795761]  [<ffffffff81b74140>] ? early_idt_handlers+0x140/0x140
[  296.795761]  [<ffffffff81b743ca>] x86_64_start_kernel+0x102/0x111
[  296.795761] Code:  Bad RIP value.
[  296.795761] RIP  [<          (null)>]           (null)
[  296.795761]  RSP <ffff88007fc03d28>
[  296.795761] CR2: 0000000000000000
[  296.846940] ---[ end trace 2d2403022cca4fe1 ]---
[  296.847691] Kernel panic - not syncing: Fatal exception in interrupt
[  296.848701] Pid: 0, comm: swapper Tainted: G      D     3.1.0-rc9+ #4
[  296.849717] Call Trace:
[  296.850122]  <IRQ>  [<ffffffff814a6f2b>] panic+0x91/0x1a5
[  296.851080]  [<ffffffff814b13e6>] oops_end+0xb4/0xc5
[  296.851856]  [<ffffffff814a67d5>] no_context+0x203/0x212
[  296.852710]  [<ffffffff813e7601>] ? dev_queue_xmit+0x44b/0x472
[  296.853641]  [<ffffffff814a69af>] __bad_area_nosemaphore+0x1cb/0x1ec
[  296.854654]  [<ffffffffa00bc52c>] ? bond_3ad_xmit_xor+0x130/0x156 [bonding]
[  296.855747]  [<ffffffff814a69e3>] bad_area_nosemaphore+0x13/0x15
[  296.856705]  [<ffffffff814b33a3>] do_page_fault+0x1b8/0x37e
[  296.857602]  [<ffffffff811131d5>] ? virt_to_head_page+0xe/0x31
[  296.858533]  [<ffffffff81114f7a>] ? __cmpxchg_double_slab+0x2c/0x9e
[  296.859538]  [<ffffffff814a9d74>] ? __slab_alloc+0x39f/0x3b4
[  296.860453]  [<ffffffff813dcde0>] ? skb_clone+0x77/0x97
[  296.861277]  [<ffffffff8102fdbd>] ? pvclock_clocksource_read+0x48/0xb4
[  296.862325]  [<ffffffff814b08f5>] page_fault+0x25/0x30
[  296.863148]  [<ffffffffa00b64d0>] ? bond_handle_frame+0xa2/0x160 [bonding]
[  296.864235]  [<ffffffffa00b642e>] ? skb_cow_head+0x70/0x70 [bonding]
[  296.865242]  [<ffffffff813e2996>] __netif_receive_skb+0x2c5/0x417
[  296.866218]  [<ffffffff813e5c22>] netif_receive_skb+0x6c/0x73
[  296.867134]  [<ffffffffa0155360>] virtnet_poll+0x49a/0x562 [virtio_net]
[  296.868166]  [<ffffffff813e623b>] net_rx_action+0xa9/0x1b8
[  296.869037]  [<ffffffff8105c61b>] __do_softirq+0xc9/0x1b5
[  296.869888]  [<ffffffff81014fa0>] ? sched_clock+0x9/0xd
[  296.870726]  [<ffffffff814b8c2c>] call_softirq+0x1c/0x30
[  296.871584]  [<ffffffff81010af9>] do_softirq+0x46/0x81
[  296.872402]  [<ffffffff8105c8e3>] irq_exit+0x57/0xb1
[  296.873192]  [<ffffffff814b950e>] do_IRQ+0x8e/0xa5
[  296.873940]  [<ffffffff814b062e>] common_interrupt+0x6e/0x6e
[  296.874846]  <EOI>  [<ffffffff810b27c3>] ? rcu_needs_cpu+0x11a/0x1cb
[  296.875881]  [<ffffffff8102f2f1>] ? native_safe_halt+0xb/0xd
[  296.876771]  [<ffffffff81015b32>] default_idle+0x4e/0x86
[  296.877615]  [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
[  296.878412]  [<ffffffff8148cd6e>] rest_init+0x72/0x74
[  296.879215]  [<ffffffff81b74b7d>] start_kernel+0x3ab/0x3b6
[  296.880096]  [<ffffffff81b742c4>] x86_64_start_reservations+0xaf/0xb3
[  296.881189]  [<ffffffff81b74140>] ? early_idt_handlers+0x140/0x140
[  296.882169]  [<ffffffff81b743ca>] x86_64_start_kernel+0x102/0x111



============== log for a panic in proccess_one_work=========

# while true; do ifconfig bond0 down; done &
[1] 1463
# while true; do ifconfig bond0 up; done &
[2] 7093
# [  734.859163] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  734.862624] IP: [<ffffffff8106da2e>] process_one_work+0x104/0x2a9
[  734.862624] PGD 0 
[  734.862624] Oops: 0000 [#1] SMP 
[  734.862624] CPU 3 
[  734.862624] Modules linked in: fuse nfs lockd fscache auth_rpcgss nfs_acl sunrpc bonding snd_hda_intel snd_hda_codec snd_hwdep joydev snd_seq snd_seq_device snd_pcm pcspkr microcode virtio_net virtio_balloon snd_timer snd soundcore snd_page_alloc i2c_piix4 i2c_core ipv6 xfs exportfs virtio_blk [last unloaded: speedstep_lib]
[  734.862624] 
[  734.862624] Pid: 47, comm: kworker/u:1 Not tainted 3.1.0-rc9+ #4 Bochs Bochs
[  734.862624] RIP: 0010:[<ffffffff8106da2e>]  [<ffffffff8106da2e>] process_one_work+0x104/0x2a9
[  734.862624] RSP: 0018:ffff880078f95e50  EFLAGS: 00010046
[  734.862624] RAX: 0000000000000000 RBX: ffff880078f5af00 RCX: 0000000000010000
[  734.862624] RDX: 0000000000010000 RSI: ffff88007bcada50 RDI: ffff88007bcad8f8
[  734.862624] RBP: ffff880078f95ea0 R08: ffff88007bcad900 R09: ffff880076c1bf00
[  734.862624] R10: ffff880076c1bf00 R11: ffff88007fd92e40 R12: ffffffff81cc3dc0
[  734.862624] R13: ffff88007a81f600 R14: ffffffffa00b8759 R15: ffff88007a81f607
[  734.862624] FS:  0000000000000000(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
[  734.862624] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  734.862624] CR2: 0000000000000008 CR3: 0000000001a05000 CR4: 00000000000006e0
[  734.862624] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  734.862624] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  734.862624] Process kworker/u:1 (pid: 47, threadinfo ffff880078f94000, task ffff880078f3ae40)
[  734.862624] Stack:
[  734.862624]  ffffffff81cc3dc8 ffff88007bcad900 0000009a7bc54a70 ffff88007bcad8f8
[  734.862624]  ffffffff81cc3dc0 ffff880078f5af00 ffffffff81cc3dc0 ffff880078f5af20
[  734.862624]  ffff880078f3ae40 ffff880078f5af20 ffff880078f95ee0 ffffffff8106e5aa
[  734.862624] Call Trace:
[  734.862624]  [<ffffffff8106e5aa>] worker_thread+0xda/0x15d
[  734.862624]  [<ffffffff8106e4d0>] ? manage_workers+0x176/0x176
[  734.862624]  [<ffffffff810719f7>] kthread+0x84/0x8c
[  734.862624]  [<ffffffff814b8b34>] kernel_thread_helper+0x4/0x10
[  734.862624]  [<ffffffff81071973>] ? kthread_worker_fn+0x148/0x148
[  734.862624]  [<ffffffff814b8b30>] ? gs_change+0x13/0x13
[  734.862624] Code: 8b 55 c8 48 89 42 08 48 89 42 10 41 f6 44 24 1c 10 74 31 49 8b 7c 24 08 49 8d 44 24 08 48 39 c7 74 1c 48 83 ef 08 e8 6a e0 ff ff 
[  734.862624]  8b 40 08 f6 00 10 74 0a 4c 89 e7 e8 85 e4 ff ff eb 06 41 83 
[  734.862624] RIP  [<ffffffff8106da2e>] process_one_work+0x104/0x2a9
[  734.862624]  RSP <ffff880078f95e50>
[  734.862624] CR2: 0000000000000008
[  734.862624] ---[ end trace 7cf0d13647b8711b ]---
[  734.997227] BUG: unable to handle kernel paging request at fffffffffffffff8

Mess[age  fr om734.998114] IP:

---

Mitsuo Hayasaka (2):
      [BUGFIX] bonding: use flush_delayed_work_sync in bond_close
      [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame


 drivers/net/bonding/bond_main.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

-- 
Mitsuo Hayasaka (mitsuo.hayasaka.hu@hitachi.com)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame
  2011-10-07 12:49 [RFC PATCH net 0/2] bonding: fix panics for making bonding device up and down repeatedly Mitsuo Hayasaka
@ 2011-10-07 12:49 ` Mitsuo Hayasaka
  2011-10-07 13:24   ` Américo Wang
  2011-10-07 12:50 ` [RFC PATCH net 2/2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close Mitsuo Hayasaka
  1 sibling, 1 reply; 8+ messages in thread
From: Mitsuo Hayasaka @ 2011-10-07 12:49 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek
  Cc: netdev, linux-kernel, yrl.pp-manager.tt, Mitsuo Hayasaka,
	Jay Vosburgh, Andy Gospodarek

The bond->recv_probe is called in bond_handle_frame() when
a packet is received, but bond_close() sets it to NULL. So,
a panic occurs when both functions work in parallel.

Why this happen:
After null pointer check of bond->recv_probe, an sk_buff is
duplicated and bond->recv_probe is called in bond_handle_frame.
So, a panic occurs when bond_close() is called between the
check and call of bond->recv_probe.

Patch:
This patch uses a local function pointer of bond->recv_probe
in bond_handle_frame(). So, it can avoid the null pointer
dereference.


Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
---

 drivers/net/bonding/bond_main.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 6d79b78..f1baec5 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1435,6 +1435,8 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 	struct sk_buff *skb = *pskb;
 	struct slave *slave;
 	struct bonding *bond;
+	void (*recv_probe)(struct sk_buff *, struct bonding *,
+				struct slave *);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
 	if (unlikely(!skb))
@@ -1448,11 +1450,12 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 	if (bond->params.arp_interval)
 		slave->dev->last_rx = jiffies;
 
-	if (bond->recv_probe) {
+	recv_probe = bond->recv_probe;
+	if (recv_probe) {
 		struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 
 		if (likely(nskb)) {
-			bond->recv_probe(nskb, bond, slave);
+			recv_probe(nskb, bond, slave);
 			dev_kfree_skb(nskb);
 		}
 	}


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH net 2/2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close
  2011-10-07 12:49 [RFC PATCH net 0/2] bonding: fix panics for making bonding device up and down repeatedly Mitsuo Hayasaka
  2011-10-07 12:49 ` [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame Mitsuo Hayasaka
@ 2011-10-07 12:50 ` Mitsuo Hayasaka
  2011-10-07 13:34   ` Américo Wang
  1 sibling, 1 reply; 8+ messages in thread
From: Mitsuo Hayasaka @ 2011-10-07 12:50 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek
  Cc: netdev, linux-kernel, yrl.pp-manager.tt, Mitsuo Hayasaka,
	Jay Vosburgh, Andy Gospodarek

The bond_close() calls cancel_delayed_work() to cancel delayed works.
It, however, cannot cancel works that were already queued in workqueue.
The bond_open() initializes work->data, and proccess_one_work() refers
get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
work->data has been initialized. Thus, a panic occurs.

This patch uses flush_delayed_work_sync() instead of cancel_delayed_work()
in bond_close(). It cancels delayed timer and waits for work to finish
execution. So, it can avoid the null pointer dereference due to the
parallel executions of proccess_one_work() and initializing proccess
of bond_open().

Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
---

 drivers/net/bonding/bond_main.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f1baec5..cd490b7 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3504,27 +3504,27 @@ static int bond_close(struct net_device *bond_dev)
 	write_unlock_bh(&bond->lock);
 
 	if (bond->params.miimon) {  /* link check interval, in milliseconds. */
-		cancel_delayed_work(&bond->mii_work);
+		flush_delayed_work_sync(&bond->mii_work);
 	}
 
 	if (bond->params.arp_interval) {  /* arp interval, in milliseconds. */
-		cancel_delayed_work(&bond->arp_work);
+		flush_delayed_work_sync(&bond->arp_work);
 	}
 
 	switch (bond->params.mode) {
 	case BOND_MODE_8023AD:
-		cancel_delayed_work(&bond->ad_work);
+		flush_delayed_work_sync(&bond->ad_work);
 		break;
 	case BOND_MODE_TLB:
 	case BOND_MODE_ALB:
-		cancel_delayed_work(&bond->alb_work);
+		flush_delayed_work_sync(&bond->alb_work);
 		break;
 	default:
 		break;
 	}
 
 	if (delayed_work_pending(&bond->mcast_work))
-		cancel_delayed_work(&bond->mcast_work);
+		flush_delayed_work_sync(&bond->mcast_work);
 
 	if (bond_is_lb(bond)) {
 		/* Must be called only after all


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame
  2011-10-07 12:49 ` [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame Mitsuo Hayasaka
@ 2011-10-07 13:24   ` Américo Wang
  2011-10-11 13:02     ` HAYASAKA Mitsuo
  0 siblings, 1 reply; 8+ messages in thread
From: Américo Wang @ 2011-10-07 13:24 UTC (permalink / raw)
  To: Mitsuo Hayasaka
  Cc: Jay Vosburgh, Andy Gospodarek, netdev, linux-kernel, yrl.pp-manager.tt

On Fri, Oct 7, 2011 at 8:49 PM, Mitsuo Hayasaka
<mitsuo.hayasaka.hu@hitachi.com> wrote:
> The bond->recv_probe is called in bond_handle_frame() when
> a packet is received, but bond_close() sets it to NULL. So,
> a panic occurs when both functions work in parallel.
>
> Why this happen:
> After null pointer check of bond->recv_probe, an sk_buff is
> duplicated and bond->recv_probe is called in bond_handle_frame.
> So, a panic occurs when bond_close() is called between the
> check and call of bond->recv_probe.
>
> Patch:
> This patch uses a local function pointer of bond->recv_probe
> in bond_handle_frame(). So, it can avoid the null pointer
> dereference.
>

Hmm, I don't doubt it can fix the problem, I am wondering if
bond->recv_probe should be protected by bond->lock...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net 2/2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close
  2011-10-07 12:50 ` [RFC PATCH net 2/2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close Mitsuo Hayasaka
@ 2011-10-07 13:34   ` Américo Wang
  0 siblings, 0 replies; 8+ messages in thread
From: Américo Wang @ 2011-10-07 13:34 UTC (permalink / raw)
  To: Mitsuo Hayasaka
  Cc: Jay Vosburgh, Andy Gospodarek, netdev, linux-kernel, yrl.pp-manager.tt

On Fri, Oct 7, 2011 at 8:50 PM, Mitsuo Hayasaka
<mitsuo.hayasaka.hu@hitachi.com> wrote:
> The bond_close() calls cancel_delayed_work() to cancel delayed works.
> It, however, cannot cancel works that were already queued in workqueue.
> The bond_open() initializes work->data, and proccess_one_work() refers
> get_work_cwq(work)->wq->flags. The get_work_cwq() returns NULL when
> work->data has been initialized. Thus, a panic occurs.
>
> This patch uses flush_delayed_work_sync() instead of cancel_delayed_work()
> in bond_close(). It cancels delayed timer and waits for work to finish
> execution. So, it can avoid the null pointer dereference due to the
> parallel executions of proccess_one_work() and initializing proccess
> of bond_open().
>
> Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
> Cc: Jay Vosburgh <fubar@us.ibm.com>
> Cc: Andy Gospodarek <andy@greyhouse.net>

Makes sense,

Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame
  2011-10-07 13:24   ` Américo Wang
@ 2011-10-11 13:02     ` HAYASAKA Mitsuo
  2011-10-11 13:23       ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: HAYASAKA Mitsuo @ 2011-10-11 13:02 UTC (permalink / raw)
  To: Américo Wang
  Cc: Jay Vosburgh, Andy Gospodarek, netdev, linux-kernel, yrl.pp-manager.tt

Hi WANG Cong

Thank you for your comments.

(2011/10/07 22:24), Américo Wang wrote:
> On Fri, Oct 7, 2011 at 8:49 PM, Mitsuo Hayasaka
> <mitsuo.hayasaka.hu@hitachi.com> wrote:
>> The bond->recv_probe is called in bond_handle_frame() when
>> a packet is received, but bond_close() sets it to NULL. So,
>> a panic occurs when both functions work in parallel.
>>
>> Why this happen:
>> After null pointer check of bond->recv_probe, an sk_buff is
>> duplicated and bond->recv_probe is called in bond_handle_frame.
>> So, a panic occurs when bond_close() is called between the
>> check and call of bond->recv_probe.
>>
>> Patch:
>> This patch uses a local function pointer of bond->recv_probe
>> in bond_handle_frame(). So, it can avoid the null pointer
>> dereference.
>>
> 
> Hmm, I don't doubt it can fix the problem, I am wondering if
> bond->recv_probe should be protected by bond->lock...

Indeed, in general any resources should be protected from the asynchronous
workers.

At first, I thought it should be handled with lock protection, as well.
However, I guess that using bond->lock on this kind of hot-path may
introduces unnecessary overhead. In addition, this code works well
without the strict lock protection. So, I think this change is the
right way to fix it.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame
  2011-10-11 13:02     ` HAYASAKA Mitsuo
@ 2011-10-11 13:23       ` Eric Dumazet
  2011-10-12  7:04         ` HAYASAKA Mitsuo
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2011-10-11 13:23 UTC (permalink / raw)
  To: HAYASAKA Mitsuo
  Cc: Américo Wang, Jay Vosburgh, Andy Gospodarek, netdev,
	linux-kernel, yrl.pp-manager.tt

Le mardi 11 octobre 2011 à 22:02 +0900, HAYASAKA Mitsuo a écrit :
> Hi WANG Cong
> 
> Thank you for your comments.
> 
> (2011/10/07 22:24), Américo Wang wrote:
> > On Fri, Oct 7, 2011 at 8:49 PM, Mitsuo Hayasaka
> > <mitsuo.hayasaka.hu@hitachi.com> wrote:
> >> The bond->recv_probe is called in bond_handle_frame() when
> >> a packet is received, but bond_close() sets it to NULL. So,
> >> a panic occurs when both functions work in parallel.
> >>
> >> Why this happen:
> >> After null pointer check of bond->recv_probe, an sk_buff is
> >> duplicated and bond->recv_probe is called in bond_handle_frame.
> >> So, a panic occurs when bond_close() is called between the
> >> check and call of bond->recv_probe.
> >>
> >> Patch:
> >> This patch uses a local function pointer of bond->recv_probe
> >> in bond_handle_frame(). So, it can avoid the null pointer
> >> dereference.
> >>
> > 
> > Hmm, I don't doubt it can fix the problem, I am wondering if
> > bond->recv_probe should be protected by bond->lock...
> 
> Indeed, in general any resources should be protected from the asynchronous
> workers.
> 
> At first, I thought it should be handled with lock protection, as well.
> However, I guess that using bond->lock on this kind of hot-path may
> introduces unnecessary overhead. In addition, this code works well
> without the strict lock protection. So, I think this change is the
> right way to fix it.

Maybe, but then ACCESS_ONCE() is needed to prevent compiler to
'optimize' the temporary variable.

Or use rcu_dereference() to make the whole thing really safe and self
documented.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame
  2011-10-11 13:23       ` Eric Dumazet
@ 2011-10-12  7:04         ` HAYASAKA Mitsuo
  0 siblings, 0 replies; 8+ messages in thread
From: HAYASAKA Mitsuo @ 2011-10-12  7:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Américo Wang, Jay Vosburgh, Andy Gospodarek, netdev,
	linux-kernel, yrl.pp-manager.tt

Hi Eric,

Thank you for your comment.

(2011/10/11 22:23), Eric Dumazet wrote:
> Le mardi 11 octobre 2011 à 22:02 +0900, HAYASAKA Mitsuo a écrit :
>> Hi WANG Cong
>>
>> Thank you for your comments.
>>
>> (2011/10/07 22:24), Américo Wang wrote:
>>> On Fri, Oct 7, 2011 at 8:49 PM, Mitsuo Hayasaka
>>> <mitsuo.hayasaka.hu@hitachi.com> wrote:
>>>> The bond->recv_probe is called in bond_handle_frame() when
>>>> a packet is received, but bond_close() sets it to NULL. So,
>>>> a panic occurs when both functions work in parallel.
>>>>
>>>> Why this happen:
>>>> After null pointer check of bond->recv_probe, an sk_buff is
>>>> duplicated and bond->recv_probe is called in bond_handle_frame.
>>>> So, a panic occurs when bond_close() is called between the
>>>> check and call of bond->recv_probe.
>>>>
>>>> Patch:
>>>> This patch uses a local function pointer of bond->recv_probe
>>>> in bond_handle_frame(). So, it can avoid the null pointer
>>>> dereference.
>>>>
>>>
>>> Hmm, I don't doubt it can fix the problem, I am wondering if
>>> bond->recv_probe should be protected by bond->lock...
>>
>> Indeed, in general any resources should be protected from the asynchronous
>> workers.
>>
>> At first, I thought it should be handled with lock protection, as well.
>> However, I guess that using bond->lock on this kind of hot-path may
>> introduces unnecessary overhead. In addition, this code works well
>> without the strict lock protection. So, I think this change is the
>> right way to fix it.
> 
> Maybe, but then ACCESS_ONCE() is needed to prevent compiler to
> 'optimize' the temporary variable.
> 
> Or use rcu_dereference() to make the whole thing really safe and self
> documented.
> 
I agreed.
I'd like to send the patch with ACCESS_ONCE(), again.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-10-12  7:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-07 12:49 [RFC PATCH net 0/2] bonding: fix panics for making bonding device up and down repeatedly Mitsuo Hayasaka
2011-10-07 12:49 ` [RFC PATCH net 1/2] [BUGFIX] bonding: use local function pointer of bond->recv_probe in bond_handle_frame Mitsuo Hayasaka
2011-10-07 13:24   ` Américo Wang
2011-10-11 13:02     ` HAYASAKA Mitsuo
2011-10-11 13:23       ` Eric Dumazet
2011-10-12  7:04         ` HAYASAKA Mitsuo
2011-10-07 12:50 ` [RFC PATCH net 2/2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close Mitsuo Hayasaka
2011-10-07 13:34   ` Américo Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.