netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [iproute PATCH 0/2] Netns performance improvements
@ 2016-07-05 14:42 Phil Sutter
  2016-07-05 14:42 ` [iproute PATCH 1/2] ipnetns: Move NETNS_RUN_DIR into it's own propagation group Phil Sutter
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Phil Sutter @ 2016-07-05 14:42 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric W . Biederman, netdev

Stress-testing OpenStack Neutron revealed poor performance of 'ip netns'
when dealing with a high amount of namespaces. The cause of this lies in
the combination of how iproute2 mounts NETNS_RUN_DIR and the netns files
therein and the fact that systemd makes all mount points of the system
shared.

Phil Sutter (2):
  ipnetns: Move NETNS_RUN_DIR into it's own propagation group
  ipnetns: Make netns mount points private

 ip/ipnetns.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

-- 
2.8.2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [iproute PATCH 1/2] ipnetns: Move NETNS_RUN_DIR into it's own propagation group
  2016-07-05 14:42 [iproute PATCH 0/2] Netns performance improvements Phil Sutter
@ 2016-07-05 14:42 ` Phil Sutter
  2016-07-05 14:42 ` [iproute PATCH 2/2] ipnetns: Make netns mount points private Phil Sutter
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Phil Sutter @ 2016-07-05 14:42 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric W . Biederman, netdev

On systems where the parent mount point is shared, NETNS_RUN_DIR
inherits the parent's propagation group. This leads to netns mount
points being propagated to the parent and thus showing up twice in the
output of 'mount'.

By making the newly mounted NETNS_RUN_DIR private first, then shared
again, it will move to it's own propagation group which will still allow
for netns mounts to propagate between mount namespaces but gets rid of
the double netns entry at the same time.

Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 ip/ipnetns.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index b3ee23c23aaa2..1cefe73c68bfc 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -650,6 +650,11 @@ static int netns_add(int argc, char **argv)
 				NETNS_RUN_DIR, NETNS_RUN_DIR, strerror(errno));
 			return -1;
 		}
+		if (mount("", NETNS_RUN_DIR, "none", MS_PRIVATE, NULL)) {
+			fprintf(stderr, "mount --make-private %s failed: %s\n",
+				NETNS_RUN_DIR, strerror(errno));
+			return -1;
+		}
 		made_netns_run_dir_mount = 1;
 	}
 
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [iproute PATCH 2/2] ipnetns: Make netns mount points private
  2016-07-05 14:42 [iproute PATCH 0/2] Netns performance improvements Phil Sutter
  2016-07-05 14:42 ` [iproute PATCH 1/2] ipnetns: Move NETNS_RUN_DIR into it's own propagation group Phil Sutter
@ 2016-07-05 14:42 ` Phil Sutter
  2016-07-05 14:44 ` [iproute PATCH 0/2] Netns performance improvements Eric W. Biederman
  2016-07-05 14:49 ` Phil Sutter
  3 siblings, 0 replies; 18+ messages in thread
From: Phil Sutter @ 2016-07-05 14:42 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric W . Biederman, netdev

On systems with a shared /proc mount point, the netns mounts inherit
that propagation group and therefore cause unnecessary overhead upon
netns deletion.

In order to achieve this, the MS_REC flag when making NETNS_RUN_DIR
shared has to be removed as well or otherwise it will make existing
netns mount points shared again.

Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 ip/ipnetns.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index 1cefe73c68bfc..acaedd5894e6c 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -636,7 +636,7 @@ static int netns_add(int argc, char **argv)
 	 * file in all namespaces allowing the network namespace to be freed
 	 * sooner.
 	 */
-	while (mount("", NETNS_RUN_DIR, "none", MS_SHARED | MS_REC, NULL)) {
+	while (mount("", NETNS_RUN_DIR, "none", MS_SHARED, NULL)) {
 		/* Fail unless we need to make the mount point */
 		if (errno != EINVAL || made_netns_run_dir_mount) {
 			fprintf(stderr, "mount --make-shared %s failed: %s\n",
@@ -678,6 +678,11 @@ static int netns_add(int argc, char **argv)
 			netns_path, strerror(errno));
 		goto out_delete;
 	}
+	if (mount("", netns_path, "none", MS_PRIVATE, NULL)) {
+		fprintf(stderr, "mount --make-private %s failed: %s\n",
+			netns_path, strerror(errno));
+		return -1;
+	}
 	return 0;
 out_delete:
 	netns_delete(argc, argv);
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-05 14:42 [iproute PATCH 0/2] Netns performance improvements Phil Sutter
  2016-07-05 14:42 ` [iproute PATCH 1/2] ipnetns: Move NETNS_RUN_DIR into it's own propagation group Phil Sutter
  2016-07-05 14:42 ` [iproute PATCH 2/2] ipnetns: Make netns mount points private Phil Sutter
@ 2016-07-05 14:44 ` Eric W. Biederman
  2016-07-05 20:51   ` Phil Sutter
  2016-07-05 14:49 ` Phil Sutter
  3 siblings, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2016-07-05 14:44 UTC (permalink / raw)
  To: Phil Sutter; +Cc: Stephen Hemminger, netdev

Phil Sutter <phil@nwl.cc> writes:

> Stress-testing OpenStack Neutron revealed poor performance of 'ip netns'
> when dealing with a high amount of namespaces. The cause of this lies in
> the combination of how iproute2 mounts NETNS_RUN_DIR and the netns files
> therein and the fact that systemd makes all mount points of the system
> shared.

So please tell me.  Given that it was clearly a deliberate choice in the
code to make these directories shared, and that this is not a result
of a systemd making all directories shared by default.  Why is it
better to these directories non-shared?

This may be the appropriate change but saying you stress testing things
and have a problem but do not describe how large a scale you had a
problem, or anything else to make your problem reproducible by anyone
else makes it difficult to consider the merits of this change.

Sometimes things are a good default policy but have imperfect scaling on
extreme workloads.

My experience with the current situtation with ip netns is that it
prevents a whole lot of confusion by making the network namespace names
visible whichever mount namespace your processes are running in.

> Phil Sutter (2):
>   ipnetns: Move NETNS_RUN_DIR into it's own propagation group
>   ipnetns: Make netns mount points private
>
>  ip/ipnetns.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-05 14:42 [iproute PATCH 0/2] Netns performance improvements Phil Sutter
                   ` (2 preceding siblings ...)
  2016-07-05 14:44 ` [iproute PATCH 0/2] Netns performance improvements Eric W. Biederman
@ 2016-07-05 14:49 ` Phil Sutter
  3 siblings, 0 replies; 18+ messages in thread
From: Phil Sutter @ 2016-07-05 14:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric W . Biederman, netdev

Hi,

On Tue, Jul 05, 2016 at 04:42:51PM +0200, Phil Sutter wrote:
> Stress-testing OpenStack Neutron revealed poor performance of 'ip netns'
> when dealing with a high amount of namespaces. The cause of this lies in
> the combination of how iproute2 mounts NETNS_RUN_DIR and the netns files
> therein and the fact that systemd makes all mount points of the system
> shared.
> 
> Phil Sutter (2):
>   ipnetns: Move NETNS_RUN_DIR into it's own propagation group
>   ipnetns: Make netns mount points private

Please disregard this series for now, I forgot to give credit to the
original author of the changeset.

Sorry for the noise!

Phil

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-05 14:44 ` [iproute PATCH 0/2] Netns performance improvements Eric W. Biederman
@ 2016-07-05 20:51   ` Phil Sutter
  2016-07-07  4:58     ` Eric W. Biederman
  0 siblings, 1 reply; 18+ messages in thread
From: Phil Sutter @ 2016-07-05 20:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen Hemminger, netdev

Hi Eric,

Thanks for your quick and insightful reply rightfully pointing out the
lack of rationale behind this change. So let me try to catch up:

On Tue, Jul 05, 2016 at 09:44:00AM -0500, Eric W. Biederman wrote:
> Phil Sutter <phil@nwl.cc> writes:
> 
> > Stress-testing OpenStack Neutron revealed poor performance of 'ip netns'
> > when dealing with a high amount of namespaces. The cause of this lies in
> > the combination of how iproute2 mounts NETNS_RUN_DIR and the netns files
> > therein and the fact that systemd makes all mount points of the system
> > shared.
> 
> So please tell me.  Given that it was clearly a deliberate choice in the
> code to make these directories shared, and that this is not a result
> of a systemd making all directories shared by default.  Why is it
> better to these directories non-shared?

NETNS_RUN_DIR itself is kept shared as it was intended by you (I hope).
The only difference is that we should avoid it being in the same group
as the parent mount point. Otherwise, all netns mount points will occur
twice.

Regarding the shared state of the netns mount points, I have actually no
idea what's the benefit, as there won't be any child mount points and
therefore no propagation should occur. Or am I missing something?

> This may be the appropriate change but saying you stress testing things
> and have a problem but do not describe how large a scale you had a
> problem, or anything else to make your problem reproducible by anyone
> else makes it difficult to consider the merits of this change.
> 
> Sometimes things are a good default policy but have imperfect scaling on
> extreme workloads.
> 
> My experience with the current situtation with ip netns is that it
> prevents a whole lot of confusion by making the network namespace names
> visible whichever mount namespace your processes are running in.

The only functional difference I noticed was the no longer twice
appearing netns mount points. They are still visible in all namespaces
though, just as before.

Here's the script I wrote to benchmark 'ip netns':

| #!/bin/bash
| 
| IP=${IP:-/usr/sbin/ip}
| echo "using ip at $IP"
| 
| # make sure we start at a clean state
| for netns in $(ls /run/netns/* 2>/dev/null); do
|         $IP netns del ${netns##*/}
| done
| umount /run/netns
| 
| echo "creating 100 mount ns"
| touch /tmp/stay_alive
| for ((i = 0; i < 100; i++)); do
|         unshare -m --propagation unchanged bash -c \
| 		"while [[ -e /tmp/stay_alive ]]; do sleep 1; done" &
| done
| # give a little time for unshare to complete
| sleep 3
| 
| nscount=1000
| 
| echo -en "\ncreating $nscount netns"
| time (for ((i = 0; i < $nscount; i++)); do $IP netns add test$i; done)
| 
| echo -en "\ndeleting $nscount netns"
| time (for ((i = 0; i < $nscount; i++)); do $IP netns del test$i; done)
| 
| echo "removing mount ns again"
| rm /tmp/stay_alive
| wait

So basically it creates 100 idle mount namespaces, then times
adding/removing 1000 network namespaces. I called it three times:
without any patch, with just patch 1 and with both patches applied. Here
are the results:

| # IP=/tmp/base/ip /vmshare/reproducer/ip_netns_bench.sh
| using ip at /tmp/base/ip
| creating 100 mount ns
| 
| creating 1000 netns
| real	0m8.110s
| user	0m1.143s
| sys	0m6.235s
| 
| deleting 1000 netns
| real	0m15.347s
| user	0m0.957s
| sys	0m11.359s
| removing mount ns again

| # IP=/tmp/p1/ip /vmshare/reproducer/ip_netns_bench.sh
| using ip at /tmp/p1/ip
| creating 100 mount ns
| 
| creating 1000 netns
| real	0m7.956s
| user	0m0.987s
| sys	0m4.896s
| 
| deleting 1000 netns
| real	0m7.407s
| user	0m1.165s
| sys	0m3.418s
| removing mount ns again

| # IP=/tmp/p2/ip /vmshare/reproducer/ip_netns_bench.sh
| using ip at /tmp/p2/ip
| creating 100 mount ns
| 
| creating 1000 netns
| real	0m7.843s
| user	0m0.977s
| sys	0m4.915s
| 
| deleting 1000 netns
| real	0m6.407s
| user	0m1.006s
| sys	0m3.057s
| removing mount ns again

As you can see, the biggest improvement comes during deletion and from
patch 1. Though the second patch lowers the total time to delete the
namespaces by another second, which is still relatively much in
comparison to the low total time.

Cheers, Phil

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-05 20:51   ` Phil Sutter
@ 2016-07-07  4:58     ` Eric W. Biederman
  2016-07-07 11:17       ` Phil Sutter
  0 siblings, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2016-07-07  4:58 UTC (permalink / raw)
  To: Phil Sutter; +Cc: Stephen Hemminger, netdev

Phil Sutter <phil@nwl.cc> writes:

> Hi Eric,
>
> Thanks for your quick and insightful reply rightfully pointing out the
> lack of rationale behind this change. So let me try to catch up:

Grr.  I did not get what you are trying to accomplish the first time I
skimmed this and rereading it all again closely I still don't get what
you are trying to acomplish.

What real world scenario do you have that approximates 100 mount
namespaces all sharing with each other with 1000 network namespaces
in that shared world?

I am inclined to suspect you are setting up containers that don't
contain and those 100 mount namespaces that share with each other
are your real concern.  But I don't know.

> On Tue, Jul 05, 2016 at 09:44:00AM -0500, Eric W. Biederman wrote:
>> Phil Sutter <phil@nwl.cc> writes:
>> 
>> > Stress-testing OpenStack Neutron revealed poor performance of 'ip netns'
>> > when dealing with a high amount of namespaces. The cause of this lies in
>> > the combination of how iproute2 mounts NETNS_RUN_DIR and the netns files
>> > therein and the fact that systemd makes all mount points of the system
>> > shared.
>> 
>> So please tell me.  Given that it was clearly a deliberate choice in the
>> code to make these directories shared, and that this is not a result
>> of a systemd making all directories shared by default.  Why is it
>> better to these directories non-shared?
>
> NETNS_RUN_DIR itself is kept shared as it was intended by you (I hope).
> The only difference is that we should avoid it being in the same group
> as the parent mount point. Otherwise, all netns mount points will occur
> twice.

How do they occur twice?  Are you dealing with a system that bind mounts
/run and /var/run?  The netns mount points occurring twice sounds
correct in that scenario.  Replacing a bind mount with a symlink would
be a more appropriate fix if you are concerned with the mount overhead.

> Regarding the shared state of the netns mount points, I have actually no
> idea what's the benefit, as there won't be any child mount points and
> therefore no propagation should occur. Or am I missing something?

I think the second patch is probably ok.  I get turned around with the
finer points of mount propagation somedays as it is the parent mount
whose attributes matter when it comes to propagating the children.  

Still if the change semantically does not matter we have a missing
optimization in the kernel, and I would much rather implement that
optmization in the kernel than in every application that might possibly
hit it.  Especially given that the default on systemd systems is
"mount --make-rshared /"

>> This may be the appropriate change but saying you stress testing things
>> and have a problem but do not describe how large a scale you had a
>> problem, or anything else to make your problem reproducible by anyone
>> else makes it difficult to consider the merits of this change.
>> 
>> Sometimes things are a good default policy but have imperfect scaling on
>> extreme workloads.
>> 
>> My experience with the current situtation with ip netns is that it
>> prevents a whole lot of confusion by making the network namespace names
>> visible whichever mount namespace your processes are running in.
>
> The only functional difference I noticed was the no longer twice
> appearing netns mount points. They are still visible in all namespaces
> though, just as before.

But you are fighting the how the rest of the system is configured at
that point and that concerns me.  iproute is not the place to
reconfigure the system.

> Here's the script I wrote to benchmark 'ip netns':
>
> | #!/bin/bash
> | 
> | IP=${IP:-/usr/sbin/ip}
> | echo "using ip at $IP"
> | 
> | # make sure we start at a clean state
> | for netns in $(ls /run/netns/* 2>/dev/null); do
> |         $IP netns del ${netns##*/}
> | done
> | umount /run/netns
> | 
> | echo "creating 100 mount ns"
> | touch /tmp/stay_alive
> | for ((i = 0; i < 100; i++)); do
> |         unshare -m --propagation unchanged bash -c \
> | 		"while [[ -e /tmp/stay_alive ]]; do sleep 1; done" &
> | done
> | # give a little time for unshare to complete
> | sleep 3
> | 
> | nscount=1000
> | 
> | echo -en "\ncreating $nscount netns"
> | time (for ((i = 0; i < $nscount; i++)); do $IP netns add test$i; done)
> | 
> | echo -en "\ndeleting $nscount netns"
> | time (for ((i = 0; i < $nscount; i++)); do $IP netns del test$i; done)
> | 
> | echo "removing mount ns again"
> | rm /tmp/stay_alive
> | wait
>
> So basically it creates 100 idle mount namespaces, then times
> adding/removing 1000 network namespaces. I called it three times:
> without any patch, with just patch 1 and with both patches applied. Here
> are the results:
>
> | # IP=/tmp/base/ip /vmshare/reproducer/ip_netns_bench.sh
> | using ip at /tmp/base/ip
> | creating 100 mount ns
> | 
> | creating 1000 netns
> | real	0m8.110s
> | user	0m1.143s
> | sys	0m6.235s
> | 
> | deleting 1000 netns
> | real	0m15.347s
> | user	0m0.957s
> | sys	0m11.359s
> | removing mount ns again
>
> | # IP=/tmp/p1/ip /vmshare/reproducer/ip_netns_bench.sh
> | using ip at /tmp/p1/ip
> | creating 100 mount ns
> | 
> | creating 1000 netns
> | real	0m7.956s
> | user	0m0.987s
> | sys	0m4.896s
> | 
> | deleting 1000 netns
> | real	0m7.407s
> | user	0m1.165s
> | sys	0m3.418s
> | removing mount ns again
>
> | # IP=/tmp/p2/ip /vmshare/reproducer/ip_netns_bench.sh
> | using ip at /tmp/p2/ip
> | creating 100 mount ns
> | 
> | creating 1000 netns
> | real	0m7.843s
> | user	0m0.977s
> | sys	0m4.915s
> | 
> | deleting 1000 netns
> | real	0m6.407s
> | user	0m1.006s
> | sys	0m3.057s
> | removing mount ns again
>
> As you can see, the biggest improvement comes during deletion and from
> patch 1. Though the second patch lowers the total time to delete the
> namespaces by another second, which is still relatively much in
> comparison to the low total time.

Which all seems to be about making /run/netns and /var/run/netns not
shared with each other which appears to be semantically wrong.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07  4:58     ` Eric W. Biederman
@ 2016-07-07 11:17       ` Phil Sutter
  2016-07-07 12:59         ` Nicolas Dichtel
  0 siblings, 1 reply; 18+ messages in thread
From: Phil Sutter @ 2016-07-07 11:17 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Stephen Hemminger, netdev

Hi,

On Wed, Jul 06, 2016 at 11:58:54PM -0500, Eric W. Biederman wrote:
> Phil Sutter <phil@nwl.cc> writes:
> 
> > Hi Eric,
> >
> > Thanks for your quick and insightful reply rightfully pointing out the
> > lack of rationale behind this change. So let me try to catch up:
> 
> Grr.  I did not get what you are trying to accomplish the first time I
> skimmed this and rereading it all again closely I still don't get what
> you are trying to acomplish.

Maybe I did not get what information you are missing. Communication
issues always include two parties. :)

> What real world scenario do you have that approximates 100 mount
> namespaces all sharing with each other with 1000 network namespaces
> in that shared world?
> 
> I am inclined to suspect you are setting up containers that don't
> contain and those 100 mount namespaces that share with each other
> are your real concern.  But I don't know.

The issue came up during OpenStack Neutron testing, see this ticket for
reference:

https://bugzilla.redhat.com/show_bug.cgi?id=1310795

> > On Tue, Jul 05, 2016 at 09:44:00AM -0500, Eric W. Biederman wrote:
> >> Phil Sutter <phil@nwl.cc> writes:
> >> 
> >> > Stress-testing OpenStack Neutron revealed poor performance of 'ip netns'
> >> > when dealing with a high amount of namespaces. The cause of this lies in
> >> > the combination of how iproute2 mounts NETNS_RUN_DIR and the netns files
> >> > therein and the fact that systemd makes all mount points of the system
> >> > shared.
> >> 
> >> So please tell me.  Given that it was clearly a deliberate choice in the
> >> code to make these directories shared, and that this is not a result
> >> of a systemd making all directories shared by default.  Why is it
> >> better to these directories non-shared?
> >
> > NETNS_RUN_DIR itself is kept shared as it was intended by you (I hope).
> > The only difference is that we should avoid it being in the same group
> > as the parent mount point. Otherwise, all netns mount points will occur
> > twice.
> 
> How do they occur twice?  Are you dealing with a system that bind mounts
> /run and /var/run?  The netns mount points occurring twice sounds
> correct in that scenario.  Replacing a bind mount with a symlink would
> be a more appropriate fix if you are concerned with the mount overhead.

In RHEL7, /var/run is a symlink to ../run. /run itself is a tmpfs mount.
After creating a namespace 'foo', findmnt lists /run/netns/foo as a
child of /run and /run/netns, hence it occurs twice in mount output.

> > Regarding the shared state of the netns mount points, I have actually no
> > idea what's the benefit, as there won't be any child mount points and
> > therefore no propagation should occur. Or am I missing something?
> 
> I think the second patch is probably ok.  I get turned around with the
> finer points of mount propagation somedays as it is the parent mount
> whose attributes matter when it comes to propagating the children.  
> 
> Still if the change semantically does not matter we have a missing
> optimization in the kernel, and I would much rather implement that
> optmization in the kernel than in every application that might possibly
> hit it.  Especially given that the default on systemd systems is
> "mount --make-rshared /"

Which change are you talking about that semantically does not matter?

> >> This may be the appropriate change but saying you stress testing things
> >> and have a problem but do not describe how large a scale you had a
> >> problem, or anything else to make your problem reproducible by anyone
> >> else makes it difficult to consider the merits of this change.
> >> 
> >> Sometimes things are a good default policy but have imperfect scaling on
> >> extreme workloads.
> >> 
> >> My experience with the current situtation with ip netns is that it
> >> prevents a whole lot of confusion by making the network namespace names
> >> visible whichever mount namespace your processes are running in.
> >
> > The only functional difference I noticed was the no longer twice
> > appearing netns mount points. They are still visible in all namespaces
> > though, just as before.
> 
> But you are fighting the how the rest of the system is configured at
> that point and that concerns me.  iproute is not the place to
> reconfigure the system.

But iproute is in control of /run/netns mount point, at least in that it
manipulates it's propagation flags. Therefore it should try to not cause
unexpected results irrespective of how the parent mount point is set up
by the system.

> > Here's the script I wrote to benchmark 'ip netns':
> >
[...]
> >
> > As you can see, the biggest improvement comes during deletion and from
> > patch 1. Though the second patch lowers the total time to delete the
> > namespaces by another second, which is still relatively much in
> > comparison to the low total time.
> 
> Which all seems to be about making /run/netns and /var/run/netns not
> shared with each other which appears to be semantically wrong.

No, it's basically about not making /run and /run/netns not shared with
each other since that is unnecessary.

I hope this clarifies things a bit.

Cheers, Phil

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07 11:17       ` Phil Sutter
@ 2016-07-07 12:59         ` Nicolas Dichtel
  2016-07-07 15:48           ` Phil Sutter
  0 siblings, 1 reply; 18+ messages in thread
From: Nicolas Dichtel @ 2016-07-07 12:59 UTC (permalink / raw)
  To: Phil Sutter, Eric W. Biederman, Stephen Hemminger, netdev

Le 07/07/2016 13:17, Phil Sutter a écrit :
[snip]
> The issue came up during OpenStack Neutron testing, see this ticket for
> reference:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1310795
Access to this ticket is not public :(

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07 12:59         ` Nicolas Dichtel
@ 2016-07-07 15:48           ` Phil Sutter
  2016-07-07 16:16             ` Rick Jones
  0 siblings, 1 reply; 18+ messages in thread
From: Phil Sutter @ 2016-07-07 15:48 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: Eric W. Biederman, Stephen Hemminger, netdev

On Thu, Jul 07, 2016 at 02:59:48PM +0200, Nicolas Dichtel wrote:
> Le 07/07/2016 13:17, Phil Sutter a écrit :
> [snip]
> > The issue came up during OpenStack Neutron testing, see this ticket for
> > reference:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1310795
> Access to this ticket is not public :(

*Sigh* OK, here are a few quotes:

"OpenStack Neutron controller nodes, when undergoing testing, are
locking up specifically during creation and mounting of namespaces.
They appear to be blocking behind vfsmount_lock, and contention for the
namespace_sem"

"During the scale testing, we have 300 routers, 600 dhcp namespaces
spread across four neutron network nodes. When then start as one set of
standard Openstack Rally benchmark test cycle against neutron. An
example scenario is creating 10x networks, list them, delete them and
repeat 10x times. The second set performs an L3 benchmark test between
two instances."

Cheers, Phil

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07 15:48           ` Phil Sutter
@ 2016-07-07 16:16             ` Rick Jones
  2016-07-07 16:34               ` Eric W. Biederman
  2016-07-08  8:01               ` Nicolas Dichtel
  0 siblings, 2 replies; 18+ messages in thread
From: Rick Jones @ 2016-07-07 16:16 UTC (permalink / raw)
  To: Phil Sutter, Nicolas Dichtel, Eric W. Biederman,
	Stephen Hemminger, netdev

On 07/07/2016 08:48 AM, Phil Sutter wrote:
> On Thu, Jul 07, 2016 at 02:59:48PM +0200, Nicolas Dichtel wrote:
>> Le 07/07/2016 13:17, Phil Sutter a écrit :
>> [snip]
>>> The issue came up during OpenStack Neutron testing, see this ticket for
>>> reference:
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1310795
>> Access to this ticket is not public :(
>
> *Sigh* OK, here are a few quotes:
>
> "OpenStack Neutron controller nodes, when undergoing testing, are
> locking up specifically during creation and mounting of namespaces.
> They appear to be blocking behind vfsmount_lock, and contention for the
> namespace_sem"
>
> "During the scale testing, we have 300 routers, 600 dhcp namespaces
> spread across four neutron network nodes. When then start as one set of
> standard Openstack Rally benchmark test cycle against neutron. An
> example scenario is creating 10x networks, list them, delete them and
> repeat 10x times. The second set performs an L3 benchmark test between
> two instances."
>

Those 300 routers will each have at least one namespace along with the 
dhcp namespaces.  Depending on the nature of the routers (Distributed 
versus Centralized Virtual Routers - DVR vs CVR) and whether the routers 
are supposed to be "HA" there can be more than one namespace for a given 
router.

300 routers is far from the upper limit/goal.  Back in HP Public Cloud, 
we were running as many as 700 routers per network node (*), and more 
than four network nodes. (back then it was just the one namespace per 
router and network). Mileage will of course vary based on the "oomph" of 
one's network node(s).

happy benchmarking,

rick jones

* Didn't want to go much higher than that because each router had a port 
on a common linux bridge and getting to > 1024 would be an unpleasant day.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07 16:16             ` Rick Jones
@ 2016-07-07 16:34               ` Eric W. Biederman
  2016-07-07 17:28                 ` Rick Jones
  2016-07-08  8:01               ` Nicolas Dichtel
  1 sibling, 1 reply; 18+ messages in thread
From: Eric W. Biederman @ 2016-07-07 16:34 UTC (permalink / raw)
  To: Rick Jones; +Cc: Phil Sutter, Nicolas Dichtel, Stephen Hemminger, netdev

Rick Jones <rick.jones2@hpe.com> writes:

> On 07/07/2016 08:48 AM, Phil Sutter wrote:
>> On Thu, Jul 07, 2016 at 02:59:48PM +0200, Nicolas Dichtel wrote:
>>> Le 07/07/2016 13:17, Phil Sutter a écrit :
>>> [snip]
>>>> The issue came up during OpenStack Neutron testing, see this ticket for
>>>> reference:
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1310795
>>> Access to this ticket is not public :(
>>
>> *Sigh* OK, here are a few quotes:
>>
>> "OpenStack Neutron controller nodes, when undergoing testing, are
>> locking up specifically during creation and mounting of namespaces.
>> They appear to be blocking behind vfsmount_lock, and contention for the
>> namespace_sem"
>>
>> "During the scale testing, we have 300 routers, 600 dhcp namespaces
>> spread across four neutron network nodes. When then start as one set of
>> standard Openstack Rally benchmark test cycle against neutron. An
>> example scenario is creating 10x networks, list them, delete them and
>> repeat 10x times. The second set performs an L3 benchmark test between
>> two instances."
>>
>
> Those 300 routers will each have at least one namespace along with the
> dhcp namespaces.  Depending on the nature of the routers (Distributed
> versus Centralized Virtual Routers - DVR vs CVR) and whether the
> routers are supposed to be "HA" there can be more than one namespace
> for a given router.
>
> 300 routers is far from the upper limit/goal.  Back in HP Public
> Cloud, we were running as many as 700 routers per network node (*),
> and more than four network nodes. (back then it was just the one
> namespace per router and network). Mileage will of course vary based
> on the "oomph" of one's network node(s).

To clarify processes for these routers and dhcp servers are created
with "ip netns exec"?

If that is the case and you are using this feature as effectively a
lightweight container and not lots vrfs in a single network stack
then I suspect much larger gains can be had by creating a variant
of ip netns exec avoids the mount propagation.

> happy benchmarking,
>
> rick jones
>
> * Didn't want to go much higher than that because each router had a
> port on a common linux bridge and getting to > 1024 would be an
> unpleasant day.

* I would have thought all you have to do is bump of the size
  of the linux neighbour cache.  echo $BIGNUM > /proc/sys/net/ipv4/neigh/default/gc_thresh3 

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07 16:34               ` Eric W. Biederman
@ 2016-07-07 17:28                 ` Rick Jones
  2016-07-08  8:12                   ` Eric W. Biederman
  2016-07-08 14:31                   ` Brian Haley
  0 siblings, 2 replies; 18+ messages in thread
From: Rick Jones @ 2016-07-07 17:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Phil Sutter, Nicolas Dichtel, Stephen Hemminger, netdev

On 07/07/2016 09:34 AM, Eric W. Biederman wrote:
> Rick Jones <rick.jones2@hpe.com> writes:
>> 300 routers is far from the upper limit/goal.  Back in HP Public
>> Cloud, we were running as many as 700 routers per network node (*),
>> and more than four network nodes. (back then it was just the one
>> namespace per router and network). Mileage will of course vary based
>> on the "oomph" of one's network node(s).
>
> To clarify processes for these routers and dhcp servers are created
> with "ip netns exec"?

I believe so, but it would be good to have someone else confirm that, 
and speak to your paragraph below.

> If that is the case and you are using this feature as effectively a
> lightweight container and not lots vrfs in a single network stack
> then I suspect much larger gains can be had by creating a variant
> of ip netns exec avoids the mount propagation.
>

...

>> * Didn't want to go much higher than that because each router had a
>> port on a common linux bridge and getting to > 1024 would be an
>> unpleasant day.
>
> * I would have thought all you have to do is bump of the size
>    of the linux neighbour cache.  echo $BIGNUM > /proc/sys/net/ipv4/neigh/default/gc_thresh3

We didn't want to hit the 1024 port limit of a (then?) Linux bridge.

rick

Having a bit of deja vu but I suspect things like commit 
0818bf27c05b2de56c5b2bd08cfae2a939bd5f52  are not exactly on the same 
wavelength, just my brain seeing "namespaces" and "performance" and 
lighting-up :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07 16:16             ` Rick Jones
  2016-07-07 16:34               ` Eric W. Biederman
@ 2016-07-08  8:01               ` Nicolas Dichtel
  2016-07-08 17:18                 ` Rick Jones
  1 sibling, 1 reply; 18+ messages in thread
From: Nicolas Dichtel @ 2016-07-08  8:01 UTC (permalink / raw)
  To: Rick Jones, Phil Sutter, Eric W. Biederman, Stephen Hemminger, netdev

Le 07/07/2016 18:16, Rick Jones a écrit :
> On 07/07/2016 08:48 AM, Phil Sutter wrote:
>> On Thu, Jul 07, 2016 at 02:59:48PM +0200, Nicolas Dichtel wrote:
>>> Le 07/07/2016 13:17, Phil Sutter a écrit :
>>> [snip]
>>>> The issue came up during OpenStack Neutron testing, see this ticket for
>>>> reference:
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1310795
>>> Access to this ticket is not public :(
>>
>> *Sigh* OK, here are a few quotes:
>>
>> "OpenStack Neutron controller nodes, when undergoing testing, are
>> locking up specifically during creation and mounting of namespaces.
>> They appear to be blocking behind vfsmount_lock, and contention for the
>> namespace_sem"
>>
>> "During the scale testing, we have 300 routers, 600 dhcp namespaces
>> spread across four neutron network nodes. When then start as one set of
>> standard Openstack Rally benchmark test cycle against neutron. An
>> example scenario is creating 10x networks, list them, delete them and
>> repeat 10x times. The second set performs an L3 benchmark test between
>> two instances."
>>
> 
> Those 300 routers will each have at least one namespace along with the dhcp
> namespaces.  Depending on the nature of the routers (Distributed versus
> Centralized Virtual Routers - DVR vs CVR) and whether the routers are supposed
> to be "HA" there can be more than one namespace for a given router.
> 
> 300 routers is far from the upper limit/goal.  Back in HP Public Cloud, we were
> running as many as 700 routers per network node (*), and more than four network
> nodes. (back then it was just the one namespace per router and network). Mileage
> will of course vary based on the "oomph" of one's network node(s).
Thank you for the details.

Do you have a script or something else to easily reproduce this problem?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07 17:28                 ` Rick Jones
@ 2016-07-08  8:12                   ` Eric W. Biederman
  2016-07-08 14:31                   ` Brian Haley
  1 sibling, 0 replies; 18+ messages in thread
From: Eric W. Biederman @ 2016-07-08  8:12 UTC (permalink / raw)
  To: Rick Jones; +Cc: Phil Sutter, Nicolas Dichtel, Stephen Hemminger, netdev

Rick Jones <rick.jones2@hpe.com> writes:

> On 07/07/2016 09:34 AM, Eric W. Biederman wrote:
>> Rick Jones <rick.jones2@hpe.com> writes:
>>> 300 routers is far from the upper limit/goal.  Back in HP Public
>>> Cloud, we were running as many as 700 routers per network node (*),
>>> and more than four network nodes. (back then it was just the one
>>> namespace per router and network). Mileage will of course vary based
>>> on the "oomph" of one's network node(s).
>>
>> To clarify processes for these routers and dhcp servers are created
>> with "ip netns exec"?
>
> I believe so, but it would be good to have someone else confirm that, and speak
> to your paragraph below.

>> If that is the case and you are using this feature as effectively a
>> lightweight container and not lots vrfs in a single network stack
>> then I suspect much larger gains can be had by creating a variant
>> of ip netns exec avoids the mount propagation.
>>
>
> ...
>
>>> * Didn't want to go much higher than that because each router had a
>>> port on a common linux bridge and getting to > 1024 would be an
>>> unpleasant day.
>>
>> * I would have thought all you have to do is bump of the size
>>    of the linux neighbour cache.  echo $BIGNUM > /proc/sys/net/ipv4/neigh/default/gc_thresh3
>
> We didn't want to hit the 1024 port limit of a (then?) Linux bridge.

Silly linux bridge.  I haven't run into that one.

> Having a bit of deja vu but I suspect things like commit
> 0818bf27c05b2de56c5b2bd08cfae2a939bd5f52  are not exactly on the same
> wavelength, just my brain seeing "namespaces" and "performance" and lighting-up
> :)

Actually that could still be relevant. 100,000 or so mount entries
is larger than the 16384 of mount entries on the machine I am looking
at.  Given an expected avearage hash chain length of 6.  So it might be
worth playing with the mhash= and mphash= kernel command line entries
and seeing if upping the count helps.  For upstream is probably very
much worth looking at making the mount hash an rhashtable so it grows to
the size it is needed.

I looked a little more and I see where the double mounts are coming
from.  Because "ip netns" creates /var/run/netns as a local bind mount
of itself we get one copy of the mounts below the bind mount and
another copy above.  Ugh.

Unfortunately I think the way the first patch solves this (by breaking
mount propagation with the parent) will fail to do the right thing in
caseses where "ip netns add" is called from a mount namespace with just
a private /tmp like systemd creates to run services in.  If we break the
mount propagation is broken by making the bind mount private I can't see
how the network namespace file descriptor mounts would propagate to the
rest of the ordinary mount namespaces in the system.

Unfortunately the semantics of the mount propgation directives were not
designed for easy use.  It seems extremly easy to do the wrong thing.

So I think the correct way to avoid double mounts and to safely and
reliably do what patch 1 is trying to do is to read /proc/self/mountinfo
and see if /var/run/netns is under a shared mount point (possibly
itself).  If so do go on to creating the mountpoint for the netns file
descriptor.  Otherwise make /var/run/netns a bind mount to itself and
ensure it is marked MS_SHARED.

Effectively that is runtime detection of systemd.  But since it keys off
of what is actually happening on the system it will work in whatever
strange environment "ip netns" happens to be run in.

Eric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-07 17:28                 ` Rick Jones
  2016-07-08  8:12                   ` Eric W. Biederman
@ 2016-07-08 14:31                   ` Brian Haley
  1 sibling, 0 replies; 18+ messages in thread
From: Brian Haley @ 2016-07-08 14:31 UTC (permalink / raw)
  To: Rick Jones, Eric W. Biederman
  Cc: Phil Sutter, Nicolas Dichtel, Stephen Hemminger, netdev

On 07/07/2016 01:28 PM, Rick Jones wrote:
> On 07/07/2016 09:34 AM, Eric W. Biederman wrote:
>> Rick Jones <rick.jones2@hpe.com> writes:
>>> 300 routers is far from the upper limit/goal.  Back in HP Public
>>> Cloud, we were running as many as 700 routers per network node (*),
>>> and more than four network nodes. (back then it was just the one
>>> namespace per router and network). Mileage will of course vary based
>>> on the "oomph" of one's network node(s).
>>
>> To clarify processes for these routers and dhcp servers are created
>> with "ip netns exec"?
>
> I believe so, but it would be good to have someone else confirm that, and speak
> to your paragraph below.

Yes, the namespace is created and configured, then in the case of dhcp an 'ip 
netns exec $namespace dnsmasq ...' is run.  Routers typically have a small 
daemon running "inside" as well.

>> If that is the case and you are using this feature as effectively a
>> lightweight container and not lots vrfs in a single network stack
>> then I suspect much larger gains can be had by creating a variant
>> of ip netns exec avoids the mount propagation.

So you're thinking a new command like 'ip netns daemon $namespace ...' ?  Or if 
there's a better way with other tools today to accomplish this I'd be 
interested, as waiting for a new iproute2 to ripple through the distros could 
take a while.

-Brian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-08  8:01               ` Nicolas Dichtel
@ 2016-07-08 17:18                 ` Rick Jones
  2016-07-11 12:51                   ` Nicolas Dichtel
  0 siblings, 1 reply; 18+ messages in thread
From: Rick Jones @ 2016-07-08 17:18 UTC (permalink / raw)
  To: nicolas.dichtel, Phil Sutter, Eric W. Biederman,
	Stephen Hemminger, netdev

On 07/08/2016 01:01 AM, Nicolas Dichtel wrote:
>> Those 300 routers will each have at least one namespace along with the dhcp
>> namespaces.  Depending on the nature of the routers (Distributed versus
>> Centralized Virtual Routers - DVR vs CVR) and whether the routers are supposed
>> to be "HA" there can be more than one namespace for a given router.
>>
>> 300 routers is far from the upper limit/goal.  Back in HP Public Cloud, we were
>> running as many as 700 routers per network node (*), and more than four network
>> nodes. (back then it was just the one namespace per router and network). Mileage
>> will of course vary based on the "oomph" of one's network node(s).
> Thank you for the details.
>
> Do you have a script or something else to easily reproduce this problem?

Do you mean for my much older, slightly different stuff done in HP 
Public Cloud, or for what Phil (?) is doing presently?  I believe Phil 
posted something several messages back in the thread.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [iproute PATCH 0/2] Netns performance improvements
  2016-07-08 17:18                 ` Rick Jones
@ 2016-07-11 12:51                   ` Nicolas Dichtel
  0 siblings, 0 replies; 18+ messages in thread
From: Nicolas Dichtel @ 2016-07-11 12:51 UTC (permalink / raw)
  To: Rick Jones, Phil Sutter, Eric W. Biederman, Stephen Hemminger, netdev

Le 08/07/2016 19:18, Rick Jones a écrit :
> On 07/08/2016 01:01 AM, Nicolas Dichtel wrote:
[snip]
>> Do you have a script or something else to easily reproduce this problem?
> 
> Do you mean for my much older, slightly different stuff done in HP Public Cloud,
> or for what Phil (?) is doing presently?  I believe Phil posted something
> several messages back in the thread.
I was thinking to Phil's scenario.


Thank you,
Nicolas

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-07-11 12:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-05 14:42 [iproute PATCH 0/2] Netns performance improvements Phil Sutter
2016-07-05 14:42 ` [iproute PATCH 1/2] ipnetns: Move NETNS_RUN_DIR into it's own propagation group Phil Sutter
2016-07-05 14:42 ` [iproute PATCH 2/2] ipnetns: Make netns mount points private Phil Sutter
2016-07-05 14:44 ` [iproute PATCH 0/2] Netns performance improvements Eric W. Biederman
2016-07-05 20:51   ` Phil Sutter
2016-07-07  4:58     ` Eric W. Biederman
2016-07-07 11:17       ` Phil Sutter
2016-07-07 12:59         ` Nicolas Dichtel
2016-07-07 15:48           ` Phil Sutter
2016-07-07 16:16             ` Rick Jones
2016-07-07 16:34               ` Eric W. Biederman
2016-07-07 17:28                 ` Rick Jones
2016-07-08  8:12                   ` Eric W. Biederman
2016-07-08 14:31                   ` Brian Haley
2016-07-08  8:01               ` Nicolas Dichtel
2016-07-08 17:18                 ` Rick Jones
2016-07-11 12:51                   ` Nicolas Dichtel
2016-07-05 14:49 ` Phil Sutter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).