linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] ipv6/route: inherit max_sizes from current netns
@ 2020-05-20 14:58 Christian Brauner
  2020-05-20 16:54 ` David Ahern
  0 siblings, 1 reply; 5+ messages in thread
From: Christian Brauner @ 2020-05-20 14:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, Jakub Kicinski, netdev,
	linux-kernel, Christian Brauner

During NorthSec (cf. [1]) a very large number of unprivileged
containers and nested containers are run during the competition to
provide a safe environment for the various teams during the event. Every
year a range of feature requests or bug reports come out of this and
this year's no different.
One of the containers was running a simple VPN server. There were about
1.5k users connected to this VPN over ipv6 and the container was setup
with about 100 custom routing tables when it hit the max_sizes routing
limit. After this no new connections could be established anymore,
pinging didn't work anymore; you get the idea.

The kernel helpfully logs a message pointing out that bumping max_sizes
should be considered to fix that problem. Here's where it gets tricky.
Routing tables are namespaced and as such max_sizes along with the other
sysctls is a per-netns limit, not a global limit. Each network namespace
by default gets a limit of 4096 routes. Network namespaces owned by the
initial user namespace can change this limit afterwards but network
namespaces owned by non-initial user namespaces can't.

The workload outlined is not an uncommon one. Running VPNs in
unprivileged containers for a large number of users is something people
do and in general large routing tables in unprivileged containers are
pretty common.

There are multiple ways to fix or at least alleviate the problem:
1. Fully namespace at least max_sizes. Right now the max_sizes sysctl
   doesn't show up in network namespaces owned by non-initial user
   namespaces and hence, can't be written to by user namespace root. We
   could simply expose it to non-initial network namespaces owned by
   non-initial user namespaces.
   This solution is the cleanest one conceptually: each container
   already has a separate routing table regardless if privileged or
   unprivileged so allow them to determine the size of it too. The issue
   I see with this solution is that it would potentially make it easier
   to dos the system or even exceed kernel memory. If I'm not mistaken,
   a container could bump max_sizes to INT_MAX(2147483647) and keep
   allocating routes.
   An argument against this worry is that this is technically already
   possible by simply creating a huge number of network namespaces, each
   with a 4096 route limit and allocating 4096 routes in each of them.
   An unprivileged user could for example create at least
   /proc/sys/user/max_net_namespaces network namespaces. The limit on
   the number of network namespaces on my host is e.g. 62690 which means
   256778240 routes. So by exposing max_sizes we don't really add a new
   attack surface.
2. Namespace max_sizes but limiting it to init_net.sysctl.max_sizes. This
   would mean that a container would start with the 4096 limit but if
   the initial network namespace bumps max_sizes later the container
   could bump it too. This has two problems, first the container doesn't
   know what the host bumped max_sizes too and would need to guess by
   trying to write a larger value. Second, this would break network
   namespaces owned by the initial user namespace as they would suddenly
   be limited by the initial network namespace.
3. Inherit the limit from the initial network namespace at
   container/network namespace creation time. This would mean we don't
   fully namespace max_sizes but allow the host to set a limit that it
   is fine with each container to inherit.
   That sounds acceptable but will mean that a nested container can e.g.
   get a larger max_sizes value than the container it was created in.
4. Inherit the limit from the current network namespace. This to means
   we don't fully namespace max_sizes but allow the current network
   namespace to choose a limit it is comfortable with inheriting.

[1]: https://nsec.io/
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 net/ipv6/route.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index a52ec1b86432..cf951d147e3a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6246,6 +6246,7 @@ struct ctl_table * __net_init ipv6_route_sysctl_init(struct net *net)
 static int __net_init ip6_route_net_init(struct net *net)
 {
 	int ret = -ENOMEM;
+	struct net *current_net = current->nsproxy->net_ns;
 
 	memcpy(&net->ipv6.ip6_dst_ops, &ip6_dst_ops_template,
 	       sizeof(net->ipv6.ip6_dst_ops));
@@ -6294,9 +6295,9 @@ static int __net_init ip6_route_net_init(struct net *net)
 	net->ipv6.fib6_routes_require_src = 0;
 #endif
 #endif
-
 	net->ipv6.sysctl.flush_delay = 0;
-	net->ipv6.sysctl.ip6_rt_max_size = 4096;
+	net->ipv6.sysctl.ip6_rt_max_size =
+		READ_ONCE(current_net->ipv6.sysctl.ip6_rt_max_size);
 	net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
 	net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
 	net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;

base-commit: 4f65e2f483b6f764c15094d14dd53dda048a4048
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] ipv6/route: inherit max_sizes from current netns
  2020-05-20 14:58 [PATCH net-next] ipv6/route: inherit max_sizes from current netns Christian Brauner
@ 2020-05-20 16:54 ` David Ahern
  2020-05-20 17:24   ` Christian Brauner
  0 siblings, 1 reply; 5+ messages in thread
From: David Ahern @ 2020-05-20 16:54 UTC (permalink / raw)
  To: Christian Brauner, David S. Miller
  Cc: Alexey Kuznetsov, Hideaki YOSHIFUJI, Jakub Kicinski, netdev,
	linux-kernel

On 5/20/20 8:58 AM, Christian Brauner wrote:
> During NorthSec (cf. [1]) a very large number of unprivileged
> containers and nested containers are run during the competition to
> provide a safe environment for the various teams during the event. Every
> year a range of feature requests or bug reports come out of this and
> this year's no different.
> One of the containers was running a simple VPN server. There were about
> 1.5k users connected to this VPN over ipv6 and the container was setup
> with about 100 custom routing tables when it hit the max_sizes routing
> limit. After this no new connections could be established anymore,
> pinging didn't work anymore; you get the idea.
> 

should have been addressed by:

commit d8882935fcae28bceb5f6f56f09cded8d36d85e6
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri May 8 07:34:14 2020 -0700
    ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()
    We currently have to adjust ipv6 route gc_thresh/max_size depending
    on number of cpus on a server, this makes very little sense.


Did your tests include this patch?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] ipv6/route: inherit max_sizes from current netns
  2020-05-20 16:54 ` David Ahern
@ 2020-05-20 17:24   ` Christian Brauner
  2020-05-20 17:27     ` David Ahern
  2020-05-20 17:27     ` Christian Brauner
  0 siblings, 2 replies; 5+ messages in thread
From: Christian Brauner @ 2020-05-20 17:24 UTC (permalink / raw)
  To: David Ahern
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Jakub Kicinski, netdev, linux-kernel

On Wed, May 20, 2020 at 10:54:21AM -0600, David Ahern wrote:
> On 5/20/20 8:58 AM, Christian Brauner wrote:
> > During NorthSec (cf. [1]) a very large number of unprivileged
> > containers and nested containers are run during the competition to
> > provide a safe environment for the various teams during the event. Every
> > year a range of feature requests or bug reports come out of this and
> > this year's no different.
> > One of the containers was running a simple VPN server. There were about
> > 1.5k users connected to this VPN over ipv6 and the container was setup
> > with about 100 custom routing tables when it hit the max_sizes routing
> > limit. After this no new connections could be established anymore,
> > pinging didn't work anymore; you get the idea.
> > 
> 
> should have been addressed by:
> 
> commit d8882935fcae28bceb5f6f56f09cded8d36d85e6
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Fri May 8 07:34:14 2020 -0700
>     ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()
>     We currently have to adjust ipv6 route gc_thresh/max_size depending
>     on number of cpus on a server, this makes very little sense.
> 
> 
> Did your tests include this patch?

No, it's also pretty hard to trigger. The conference was pretty good for
this.
I tested on top of rc6. I'm probably missing the big picture here, could
you briefy explain how this commit fixes the problem we ran into?

Thanks!
Christian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] ipv6/route: inherit max_sizes from current netns
  2020-05-20 17:24   ` Christian Brauner
@ 2020-05-20 17:27     ` David Ahern
  2020-05-20 17:27     ` Christian Brauner
  1 sibling, 0 replies; 5+ messages in thread
From: David Ahern @ 2020-05-20 17:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Jakub Kicinski, netdev, linux-kernel

On 5/20/20 11:24 AM, Christian Brauner wrote:
> On Wed, May 20, 2020 at 10:54:21AM -0600, David Ahern wrote:
>> On 5/20/20 8:58 AM, Christian Brauner wrote:
>>> During NorthSec (cf. [1]) a very large number of unprivileged
>>> containers and nested containers are run during the competition to
>>> provide a safe environment for the various teams during the event. Every
>>> year a range of feature requests or bug reports come out of this and
>>> this year's no different.
>>> One of the containers was running a simple VPN server. There were about
>>> 1.5k users connected to this VPN over ipv6 and the container was setup
>>> with about 100 custom routing tables when it hit the max_sizes routing
>>> limit. After this no new connections could be established anymore,
>>> pinging didn't work anymore; you get the idea.
>>>
>>
>> should have been addressed by:
>>
>> commit d8882935fcae28bceb5f6f56f09cded8d36d85e6
>> Author: Eric Dumazet <edumazet@google.com>
>> Date:   Fri May 8 07:34:14 2020 -0700
>>     ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()
>>     We currently have to adjust ipv6 route gc_thresh/max_size depending
>>     on number of cpus on a server, this makes very little sense.
>>
>>
>> Did your tests include this patch?
> 
> No, it's also pretty hard to trigger. The conference was pretty good for
> this.
> I tested on top of rc6. I'm probably missing the big picture here, could
> you briefy explain how this commit fixes the problem we ran into?
> 

ipv6 still has limits on the number of dst_entry's that can be created.
Eric traced the overflow to per-cpu caches in each FIB entry.

Larger systems (lots of cpus) x lots of unique connections = overflow

Eric's change removes the per-cpu dst caches from the counting, so only
exceptions (mtu, redirect) are now counted towards the limit.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next] ipv6/route: inherit max_sizes from current netns
  2020-05-20 17:24   ` Christian Brauner
  2020-05-20 17:27     ` David Ahern
@ 2020-05-20 17:27     ` Christian Brauner
  1 sibling, 0 replies; 5+ messages in thread
From: Christian Brauner @ 2020-05-20 17:27 UTC (permalink / raw)
  To: David Ahern
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Jakub Kicinski, netdev, linux-kernel

On Wed, May 20, 2020 at 07:24:18PM +0200, Christian Brauner wrote:
> On Wed, May 20, 2020 at 10:54:21AM -0600, David Ahern wrote:
> > On 5/20/20 8:58 AM, Christian Brauner wrote:
> > > During NorthSec (cf. [1]) a very large number of unprivileged
> > > containers and nested containers are run during the competition to
> > > provide a safe environment for the various teams during the event. Every
> > > year a range of feature requests or bug reports come out of this and
> > > this year's no different.
> > > One of the containers was running a simple VPN server. There were about
> > > 1.5k users connected to this VPN over ipv6 and the container was setup
> > > with about 100 custom routing tables when it hit the max_sizes routing
> > > limit. After this no new connections could be established anymore,
> > > pinging didn't work anymore; you get the idea.
> > > 
> > 
> > should have been addressed by:
> > 
> > commit d8882935fcae28bceb5f6f56f09cded8d36d85e6
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Fri May 8 07:34:14 2020 -0700
> >     ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()
> >     We currently have to adjust ipv6 route gc_thresh/max_size depending
> >     on number of cpus on a server, this makes very little sense.
> > 
> > 
> > Did your tests include this patch?
> 
> No, it's also pretty hard to trigger. The conference was pretty good for
> this.
> I tested on top of rc6. I'm probably missing the big picture here, could
> you briefy explain how this commit fixes the problem we ran into?

Hm, and it'd be great if we could expose the file - even just read-only
- to network namespaces owned by non-initial user namespaces. It doesn't
contain sensitive information and could probably be used to limit
connections etc.

Christian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-05-20 17:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20 14:58 [PATCH net-next] ipv6/route: inherit max_sizes from current netns Christian Brauner
2020-05-20 16:54 ` David Ahern
2020-05-20 17:24   ` Christian Brauner
2020-05-20 17:27     ` David Ahern
2020-05-20 17:27     ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).