All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: "David S. Miller" <davem@davemloft.net>
Cc: Julian Anastasov <ja@ssi.bg>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	syzbot <syzbot+30209ea299c09d8785c9@syzkaller.appspotmail.com>,
	ddstreet@ieee.org, dvyukov@google.com,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	syzkaller-bugs@googlegroups.com, David Ahern <dsahern@gmail.com>
Subject: Re: unregister_netdevice: waiting for DEV to become free (2)
Date: Tue, 16 Apr 2019 23:00:44 +0900	[thread overview]
Message-ID: <9d430543-33c3-0d9b-dc77-3a179a8e3919@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <4684eef5-ea50-2965-86a0-492b8b1e4f52@I-love.SAKURA.ne.jp>

Hello, David S. Miller.

I have a question regarding rt_flush_dev() introduced by commit caacf05e5ad1abf0
("ipv4: Properly purge netdev references on uncached routes.") which went to
Linux 3.6-rc1. That commit started replacing "a device to unregister" with
"a loopback device in that namespace", but there is no description why that
commit chose "a loopback device in that namespace". If a device to unregister
is "a loopback device in that namespace" itself, rt_flush_dev() becomes a no-op
because dev == net->loopback_dev from the beginning. Apart from a problem that
usage count keeps increasing because dev_put(rt->dst.dev) is not called after
rt->dst.dev was replaced with a loopback device, replacing "a device to unregister"
with "a loopback device in init namespace" (like shown below) avoids this problem.

----------
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a5da63e..aff6a44 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1492,7 +1492,6 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 
 void rt_flush_dev(struct net_device *dev)
 {
-	struct net *net = dev_net(dev);
 	struct rtable *rt;
 	int cpu;
 
@@ -1503,7 +1502,7 @@ void rt_flush_dev(struct net_device *dev)
 		list_for_each_entry(rt, &ul->head, rt_uncached) {
 			if (rt->dst.dev != dev)
 				continue;
-			rt->dst.dev = net->loopback_dev;
+			rt->dst.dev = init_net.loopback_dev;
 			dev_hold(rt->dst.dev);
 			dev_put(dev);
 		}
----------

On 2019/04/15 22:36, Tetsuo Handa wrote:
> On 2018/08/20 21:55, Julian Anastasov wrote:
>> 	Well, only IPVS and tun in the game? But IPVS does not
>> take any dev references for sync threads. Can it be a problem
>> in tun? For example, a side effects from dst_cache_reset?
>> May be dst_release is called too late? Here is what should happen
>> on unregistration:
>>
>> - NETDEV_UNREGISTER event: rt_flush_dev changes dst->dev with lo
>> but dst is not released
>>
>> - ndo_uninit/ip_tunnel_uninit: dst_cache_reset is called which
>> does nothing!?! May be dst_release call is needed here.
>>
>> - no more references are expected here ...
>>
>> - netdev_run_todo -> netdev_wait_allrefs: loop here due to refcnt!=0
>>
>> - dev->priv_destructor (ip_tunnel_dev_free) calls dst_cache_destroy
>> where dst_release is used but it is not reached because we loop in
>> netdev_wait_allrefs above
>>
>> - dst_cache_destroy: really call dst_release
>>
>> 	In fact, after calling rt_flush_dev and replacing the
>> dst->dev we should reach dev->priv_destructor (ip_tunnel_dev_free)
>> for tun device where dst_release for lo should be called. But may be 
>> something prevents it, exit batching?
> 
> I traced using debug printk() patch shown below.
> 
> ----------------------------------------
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 26f69cf763f4..25f7acacf457 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3726,6 +3726,11 @@ void netdev_run_todo(void);
>   */
>  static inline void dev_put(struct net_device *dev)
>  {
> +	if (!strcmp(dev->name, "lo")) {
> +		int count = netdev_refcnt_read(dev);
> +		printk("dev_put(%p): %u->%u\n", dev, count, count - 1);
> +		dump_stack();
> +	}
>  	this_cpu_dec(*dev->pcpu_refcnt);
>  }
>  
> @@ -3737,6 +3742,11 @@ static inline void dev_put(struct net_device *dev)
>   */
>  static inline void dev_hold(struct net_device *dev)
>  {
> +	if (!strcmp(dev->name, "lo")) {
> +		int count = netdev_refcnt_read(dev);
> +		printk("dev_hold(%p): %u->%u\n", dev, count, count + 1);
> +		dump_stack();
> +	}
>  	this_cpu_inc(*dev->pcpu_refcnt);
>  }
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fdcff29df915..53ff4385c8f7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -8897,9 +8897,9 @@ static void netdev_wait_allrefs(struct net_device *dev)
>  
>  		refcnt = netdev_refcnt_read(dev);
>  
> -		if (time_after(jiffies, warning_time + 10 * HZ)) {
> -			pr_emerg("unregister_netdevice: waiting for %s to become free. Usage count = %d\n",
> -				 dev->name, refcnt);
> +		if (time_after(jiffies, warning_time + HZ)) {
> +			pr_emerg("unregister_netdevice: waiting for %s to become free. Usage count = %d%s (%p)\n",
> +				 dev->name, refcnt, netdev_reg_state(dev), dev);
>  			warning_time = jiffies;
>  		}
>  	}
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index a5da63e5faa2..4c5baca105ed 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1503,6 +1503,7 @@ void rt_flush_dev(struct net_device *dev)
>  		list_for_each_entry(rt, &ul->head, rt_uncached) {
>  			if (rt->dst.dev != dev)
>  				continue;
> +			printk("rt_flush_dev(%p)->(%p)\n", dev, net->loopback_dev);
>  			rt->dst.dev = net->loopback_dev;
>  			dev_hold(rt->dst.dev);
>  			dev_put(dev);
> @@ -2560,6 +2561,7 @@ struct dst_entry *ipv4_blackhole_route(struct net *net, struct dst_entry *dst_or
>  		new->input = dst_discard;
>  		new->output = dst_discard_out;
>  
> +		printk("ipv4_blackhole_route(%p)->(%p)\n", new->dev, net->loopback_dev);
>  		new->dev = net->loopback_dev;
>  		if (new->dev)
>  			dev_hold(new->dev);
> ----------------------------------------
> 
> When a new loopback device in a new network namespace is created using unshare(),
> nothing is wrong.
> 
> [   60.873014][ T7306] dev_hold(00000000d9f4ea20): 0->1
> [   60.873019][ T7306] CPU: 4 PID: 7306 Comm: a.out Not tainted 5.1.0-rc5+ #177
> [   60.873021][ T7306] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
> [   60.873022][ T7306] Call Trace:
> [   60.873031][ T7306]  dump_stack+0xaa/0xd8
> [   60.873037][ T7306]  net_rx_queue_update_kobjects+0x1f5/0x200
> [   60.873042][ T7306]  netdev_register_kobject+0xf8/0x1a0
> [   60.873047][ T7306]  register_netdevice+0x4cc/0x650
> [   60.873052][ T7306]  register_netdev+0x23/0x40
> [   60.873056][ T7306]  loopback_net_init+0x50/0xc0
> [   60.873059][ T7306]  ? loopback_dev_init+0xa0/0xa0
> [   60.873064][ T7306]  ops_init+0x4f/0x140
> [   60.873068][ T7306]  setup_net+0xe7/0x250
> [   60.873072][ T7306]  copy_net_ns+0xee/0x1e0
> [   60.873077][ T7306]  create_new_namespaces+0x141/0x2a0
> [   60.873081][ T7306]  unshare_nsproxy_namespaces+0x7e/0xf0
> [   60.873086][ T7306]  ksys_unshare+0x268/0x4b0
> [   60.873090][ T7306]  __x64_sys_unshare+0x16/0x20
> [   60.873095][ T7306]  do_syscall_64+0x7c/0x180
> [   60.873099][ T7306]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> But when some device in a network namespace calls rt_flush_dev(),
> it gets a usage count on loopback device in that network namespace.
> 
> [   71.388104][ T7620] rt_flush_dev(00000000cd35e96a)->(00000000d9f4ea20)
> [   71.391757][ T7620] dev_hold(00000000d9f4ea20): 7->8
> [   71.394725][ T7620] CPU: 4 PID: 7620 Comm: a.out Not tainted 5.1.0-rc5+ #177
> [   71.398094][ T7620] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
> [   71.403711][ T7620] Call Trace:
> [   71.405912][ T7620]  dump_stack+0xaa/0xd8
> [   71.408252][ T7620]  rt_flush_dev+0x177/0x1b0
> [   71.410802][ T7620]  fib_netdev_event+0x150/0x1b0
> [   71.413270][ T7620]  notifier_call_chain+0x47/0xd0
> [   71.415849][ T7620]  raw_notifier_call_chain+0x2d/0x40
> [   71.418491][ T7620]  ? tun_show_group+0x90/0x90
> [   71.421108][ T7620]  call_netdevice_notifiers_info+0x32/0x70
> [   71.423854][ T7620]  rollback_registered_many+0x421/0x680
> [   71.426583][ T7620]  rollback_registered+0x68/0xb0
> [   71.429244][ T7620]  unregister_netdevice_queue+0xa5/0x100
> [   71.432191][ T7620]  __tun_detach+0x576/0x590
> [   71.435533][ T7620]  tun_chr_close+0x41/0x80
> [   71.437957][ T7620]  ? __tun_detach+0x590/0x590
> [   71.440500][ T7620]  __fput+0xeb/0x2d0
> [   71.442816][ T7620]  ____fput+0x15/0x20
> [   71.445090][ T7620]  task_work_run+0xa9/0xd0
> [   71.447467][ T7620]  do_exit+0x37a/0xf40
> [   71.449623][ T7620]  do_group_exit+0x57/0xe0
> [   71.451826][ T7620]  get_signal+0x114/0x950
> [   71.453989][ T7620]  do_signal+0x2f/0x700
> [   71.456126][ T7620]  ? handle_mm_fault+0x1a8/0x360
> [   71.458323][ T7620]  ? __x64_sys_futex+0x179/0x210
> [   71.460620][ T7620]  exit_to_usermode_loop+0x159/0x180
> [   71.462956][ T7620]  do_syscall_64+0x15d/0x180
> [   71.465110][ T7620]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The usage count printed by "unregister_netdevice: waiting for lo to become free."
> seems to match number of rt_flush_dev() traces shown above. (Complete log is at
> http://I-love.SAKURA.ne.jp/tmp/serial-20190415.txt.xz .)
> 
> Although netdev_wait_allrefs() is periodically calling
> call_netdevice_notifiers(NETDEV_UNREGISTER, dev) in order to try to drop
> the usgae count,
> 
> 	list_for_each_entry(rt, &ul->head, rt_uncached) {
> 		if (rt->dst.dev != dev)
> 			continue;
> 		rt->dst.dev = net->loopback_dev;
> 		dev_hold(rt->dst.dev);
> 		dev_put(dev);
> 	}
> 
> in rt_flush_dev() becomes a no-op because dev == net->loopback_dev and
> therefore cannot drop the usage count forever. That is, netdev_wait_allrefs()
> on a loopback device cannot make forward progress.
> 
> [   95.502947][ T4478] unregister_netdevice: waiting for lo to become free. Usage count = 28 (unregistered) (00000000d9f4ea20)
> [   95.509108][ T4478] rt_flush_dev(00000000d9f4ea20)->(00000000d9f4ea20)
> [   95.512598][ T4478] dev_hold(00000000d9f4ea20): 28->29
> [   95.517241][ T4478] CPU: 5 PID: 4478 Comm: kworker/u128:28 Not tainted 5.1.0-rc5+ #177
> [   95.522984][ T4478] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
> [   95.532898][ T4478] Workqueue: netns cleanup_net
> [   95.537134][ T4478] Call Trace:
> [   95.539751][ T4478]  dump_stack+0xaa/0xd8
> [   95.543642][ T4478]  rt_flush_dev+0x177/0x1b0
> [   95.546835][ T4478]  fib_netdev_event+0x150/0x1b0
> [   95.550609][ T4478]  notifier_call_chain+0x47/0xd0
> [   95.557990][ T4478]  raw_notifier_call_chain+0x2d/0x40
> [   95.563900][ T4478]  call_netdevice_notifiers_info+0x32/0x70
> [   95.568655][ T4478]  netdev_run_todo+0x197/0x410
> [   95.572554][ T4478]  rtnl_unlock+0xe/0x10
> [   95.576836][ T4478]  default_device_exit_batch+0x1ab/0x1d0
> [   95.579650][ T4478]  ? do_wait_intr_irq+0xb0/0xb0
> [   95.582236][ T4478]  ? unregister_netdevice_many+0x30/0x30
> [   95.584766][ T4478]  ? dev_change_net_namespace+0x4e0/0x4e0
> [   95.587397][ T4478]  ops_exit_list.isra.6+0x75/0x90
> [   95.590063][ T4478]  cleanup_net+0x20d/0x380
> [   95.592373][ T4478]  process_one_work+0x202/0x4f0
> [   95.595017][ T4478]  worker_thread+0x3c/0x4b0
> [   95.597520][ T4478]  kthread+0x139/0x160
> [   95.599822][ T4478]  ? process_one_work+0x4f0/0x4f0
> [   95.602351][ T4478]  ? kthread_destroy_worker+0x70/0x70
> [   95.604901][ T4478]  ret_from_fork+0x35/0x40
> [   95.607249][ T4478] dev_put(00000000d9f4ea20): 29->28
> [   95.609935][ T4478] CPU: 5 PID: 4478 Comm: kworker/u128:28 Not tainted 5.1.0-rc5+ #177
> [   95.613282][ T4478] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
> [   95.618699][ T4478] Workqueue: netns cleanup_net
> [   95.621345][ T4478] Call Trace:
> [   95.623381][ T4478]  dump_stack+0xaa/0xd8
> [   95.625787][ T4478]  rt_flush_dev+0x19f/0x1b0
> [   95.628284][ T4478]  fib_netdev_event+0x150/0x1b0
> [   95.630827][ T4478]  notifier_call_chain+0x47/0xd0
> [   95.633153][ T4478]  raw_notifier_call_chain+0x2d/0x40
> [   95.635702][ T4478]  call_netdevice_notifiers_info+0x32/0x70
> [   95.638372][ T4478]  netdev_run_todo+0x197/0x410
> [   95.640641][ T4478]  rtnl_unlock+0xe/0x10
> [   95.642828][ T4478]  default_device_exit_batch+0x1ab/0x1d0
> [   95.645230][ T4478]  ? do_wait_intr_irq+0xb0/0xb0
> [   95.647613][ T4478]  ? unregister_netdevice_many+0x30/0x30
> [   95.650148][ T4478]  ? dev_change_net_namespace+0x4e0/0x4e0
> [   95.652594][ T4478]  ops_exit_list.isra.6+0x75/0x90
> [   95.654975][ T4478]  cleanup_net+0x20d/0x380
> [   95.657140][ T4478]  process_one_work+0x202/0x4f0
> [   95.659829][ T4478]  worker_thread+0x3c/0x4b0
> [   95.662098][ T4478]  kthread+0x139/0x160
> [   95.664119][ T4478]  ? process_one_work+0x4f0/0x4f0
> [   95.666467][ T4478]  ? kthread_destroy_worker+0x70/0x70
> [   95.668916][ T4478]  ret_from_fork+0x35/0x40
> 
> If we do something like
> 
>  	list_for_each_entry(rt, &ul->head, rt_uncached) {
>  		if (rt->dst.dev != dev)
>  			continue;
> -                       rt->dst.dev = net->loopback_dev;
> +                       if (dev == net->loopback_dev)
> +                               rt->dst.dev = init_net.loopback_dev;
> +                       else
> +                               rt->dst.dev = net->loopback_dev;
>  		dev_hold(rt->dst.dev);
>  		dev_put(dev);
>  	}
> 
> at rt_flush_dev(), I guess that this problem will go away. I don't
> know which device should be used instead of init_net.loopback_dev .
> But I guess that we need to somehow avoid getting usage count on
> a loopback device (by e.g. using a dummy device which is not under
> unregistration procedure) when we want to unregister that loopback
> device.
> 


  parent reply	other threads:[~2019-04-16 14:02 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-15 18:50 unregister_netdevice: waiting for DEV to become free (2) syzbot
2018-08-15 20:28 ` syzbot
2018-08-15 20:41   ` Dmitry Vyukov
2018-08-20  4:31 ` syzbot
2018-08-20 12:55   ` Julian Anastasov
2018-08-21  5:40     ` Cong Wang
2018-08-22  4:11       ` Julian Anastasov
2019-04-15 13:36     ` Tetsuo Handa
2019-04-15 15:35       ` David Ahern
2019-04-21 20:41         ` Stephen Suryaputra
2019-04-22 14:58           ` David Ahern
2019-04-22 16:04             ` Eric Dumazet
2019-04-22 16:09               ` Eric Dumazet
2019-04-16 14:00       ` Tetsuo Handa [this message]
2019-04-26 13:43         ` Tetsuo Handa
2019-04-27 17:16           ` David Ahern
2019-04-27 22:33             ` Tetsuo Handa
2019-04-27 23:52               ` Eric Dumazet
2019-04-28  4:22                 ` Tetsuo Handa
2019-04-28 15:04                   ` Eric Dumazet
2019-04-29 18:34                   ` David Ahern
2019-04-29 18:43                     ` David Ahern
2019-05-01 13:38                       ` Tetsuo Handa
2019-05-01 14:52                         ` David Ahern
2019-05-01 16:16                           ` Tetsuo Handa
2019-05-04 14:52                             ` [PATCH] ipv4: Delete uncached routes upon unregistration of loopback device Tetsuo Handa
2019-05-04 15:56                               ` Eric Dumazet
2019-05-04 17:09                                 ` Tetsuo Handa
2019-05-04 17:24                                   ` Eric Dumazet
2019-05-04 20:13                               ` Julian Anastasov
2019-11-28  9:56     ` unregister_netdevice: waiting for DEV to become free (2) Tetsuo Handa
2019-11-29  5:54       ` Lukas Bulwahn
2019-11-29  6:51       ` Jouni Högander
2019-12-05 10:00       ` Jouni Högander
2019-12-05 11:00         ` Tetsuo Handa
2019-12-16 11:12           ` Tetsuo Handa
2019-12-17  7:08             ` Jouni Högander
2019-10-11 10:14   ` Tetsuo Handa
2019-10-11 15:12     ` Alexei Starovoitov
2019-10-16 10:34       ` Toke Høiland-Jørgensen
2019-11-15  9:43         ` Tetsuo Handa
2019-11-21 11:36           ` Toke Høiland-Jørgensen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9d430543-33c3-0d9b-dc77-3a179a8e3919@I-love.SAKURA.ne.jp \
    --to=penguin-kernel@i-love.sakura.ne.jp \
    --cc=davem@davemloft.net \
    --cc=ddstreet@ieee.org \
    --cc=dsahern@gmail.com \
    --cc=dvyukov@google.com \
    --cc=ja@ssi.bg \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=syzbot+30209ea299c09d8785c9@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.