All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
@ 2014-06-11  5:52 Rafael Tinoco
  2014-06-11  7:07 ` Eric W. Biederman
  2014-06-11 13:39 ` Paul E. McKenney
  0 siblings, 2 replies; 24+ messages in thread
From: Rafael Tinoco @ 2014-06-11  5:52 UTC (permalink / raw)
  To: paulmck, linux-kernel; +Cc: davem, ebiederm, Dave Chiluk, Christopher Arges

Paul E. McKenney, Eric Biederman, David Miller (and/or anyone else interested):

It was brought to my attention that netns creation/execution might
have suffered scalability/performance regression after v3.8.

I would like you, or anyone interested, to review these charts/data
and check if there is something that could be discussed/said before I
move further.

The following script was used for all the tests and charts generation:

====
#!/bin/bash
IP=/sbin/ip

function add_fake_router_uuid() {
    j=`uuidgen`
    $IP netns add bar-${j}
    $IP netns exec bar-${j} $IP link set lo up
    $IP netns exec bar-${j} sysctl -w net.ipv4.ip_forward=1 > /dev/null
    k=`echo $j | cut -b -11`
    $IP link add qro-${k} type veth peer name qri-${k} netns bar-${j}
    $IP link add qgo-${k} type veth peer name qgi-${k} netns bar-${j}
}

for i in `seq 1 $1`; do
    if [ `expr $i % 250` -eq 0 ]; then
        echo "$i by `date +%s`"
    fi
    add_fake_router_uuid
done
====

This script gives how many "fake routers" are added per second (from 0
to 3000 router creation mark, ex). With this and a git bisect on
kernel tree I was led to one specific commit causing
scalability/performance regression: #911af50 "rcu: Provide
compile-time control for no-CBs CPUs". Even Though this change was
experimental at that point, it introduced a performance scalability
regression (explained below) that still lasts.

RCU related code looked like to be responsible for the problem. With
that, every commit from tag v3.8 to master that changed any of this
files: "kernel/rcutree.c kernel/rcutree.h kernel/rcutree_plugin.h
include/trace/events/rcu.h include/linux/rcupdate.h" had the kernel
checked out/compiled/tested. The idea was to check performance
regression during rcu development, if that was the case. In the worst
case, the regression not being related to rcu, I would still have
chronological data to interpret.

All text below this refer to 2 groups of charts, generated during the study:

====
1) Kernel git tags from 3.8 to 3.14.
*** http://people.canonical.com/~inaddy/lp1328088/charts/250-tag.html ***

2) Kernel git commits for rcu development (111 commits) -> Clearly
shows regressions:
*** http://people.canonical.com/~inaddy/lp1328088/charts/250.html ***

Obs:

1) There is a general chart with 111 commits. With this chart you can
see performance evolution/regression on each test mark. Test mark goes
from 0 to 2500 and refers to "fake routers already created". Example:
Throughput was 50 routers/sec on 250 already created mark and 30
routers/sec on 1250 mark.

2) Clicking on a specific commit will give you that commit evolution
from 0 routers already created to 2500 routers already created mark.
====

Since there were differences in results, depending on how many cpus or
how the no-cb cpus were configured, 3 kernel config options were used
on every measure, for 1 and 4 cpus.

====
- CONFIG_RCU_NOCB_CPU (disabled): nocbno
- CONFIG_RCU_NOCB_CPU_ALL (enabled): nocball
- CONFIG_RCU_NOCB_CPU_NONE (enabled): nocbnone

Obs: For 1 cpu cases: nocbno, nocbnone, nocball behaves the same (or
should) since w/ only 1 cpu there is no no-cb cpu.
====

After charts being generated it was clear that NOCB_CPU_ALL (4 cpus)
affected the "fake routers" creation process performance and this
regression continues up to upstream version. It was also clear that,
after commit #911af50, having more than 1 cpu does not improve
performance/scalability for netns, makes it worse.

#911af50
====
...
+#ifdef CONFIG_RCU_NOCB_CPU_ALL
+ pr_info("\tExperimental no-CBs for all CPUs\n");
+ cpumask_setall(rcu_nocb_mask);
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU_ALL */
...
====

Comparing standing out points (see charts):

#81e5949 - good
#911af50 - bad

I was able to see that, from the script above, the following lines
causes major impact on netns scalability/performance:

1) ip netns add -> huge performance regression:

 1 cpu: no regression
 4 cpu: regression for NOCB_CPU_ALL

 obs: regression from 250 netns/sec to 50 netns/sec on 500 netns
already created mark

2) ip netns exec -> some performance regression

 1 cpu: no regression
 4 cpu: regression for NOCB_CPU_ALL

 obs: regression from 40 netns (+1 exec per netns creation) to 20
netns/sec on 500 netns created mark

========

FULL NOTE: http://people.canonical.com/~inaddy/lp1328088/

** Assumption: RCU callbacks being offloaded to multiple cpus
(cpumask_setall) caused regression in
copy_net_ns<-created_new_namespaces or unshare(clone_newnet).

** Next Steps: I'll probably begin to function_graph netns creation execution

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11  5:52 Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus Rafael Tinoco
@ 2014-06-11  7:07 ` Eric W. Biederman
  2014-06-11 13:39 ` Paul E. McKenney
  1 sibling, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2014-06-11  7:07 UTC (permalink / raw)
  To: Rafael Tinoco
  Cc: paulmck, linux-kernel, davem, Dave Chiluk, Christopher Arges

Rafael Tinoco <rafael.tinoco@canonical.com> writes:

> Paul E. McKenney, Eric Biederman, David Miller (and/or anyone else interested):
>
> It was brought to my attention that netns creation/execution might
> have suffered scalability/performance regression after v3.8.
>
> I would like you, or anyone interested, to review these charts/data
> and check if there is something that could be discussed/said before I
> move further.
>
> The following script was used for all the tests and charts generation:

> ====
> #!/bin/bash
> IP=/sbin/ip
>
> function add_fake_router_uuid() {
>     j=`uuidgen`
>     $IP netns add bar-${j}
>     $IP netns exec bar-${j} $IP link set lo up
>     $IP netns exec bar-${j} sysctl -w net.ipv4.ip_forward=1 > /dev/null
>     k=`echo $j | cut -b -11`
>     $IP link add qro-${k} type veth peer name qri-${k} netns bar-${j}
>     $IP link add qgo-${k} type veth peer name qgi-${k} netns bar-${j}
> }
>
> for i in `seq 1 $1`; do
>     if [ `expr $i % 250` -eq 0 ]; then
>         echo "$i by `date +%s`"
>     fi
>     add_fake_router_uuid
> done

[snip long explanation]

> I was able to see that, from the script above, the following lines
> causes major impact on netns scalability/performance:
>
> 1) ip netns add -> huge performance regression:
>
>  1 cpu: no regression
>  4 cpu: regression for NOCB_CPU_ALL
>
>  obs: regression from 250 netns/sec to 50 netns/sec on 500 netns
> already created mark

copy_netns except possibly in the per_net callbacks does not use
rcu so I am mystified.  So a little more digging to figure out which
rcu usage is causing the problem would be very interesting.

> 2) ip netns exec -> some performance regression
>
>  1 cpu: no regression
>  4 cpu: regression for NOCB_CPU_ALL
>
>  obs: regression from 40 netns (+1 exec per netns creation) to 20
> netns/sec on 500 netns created mark

The performance regression is probably in setns().
switch_task_namespaces is occassionaly a choke point.

At one point I was playing with ideas on how to use the task lock
instead of rcu to protect nsproxy.  As the original reason we could
not use task_lock appeared to have disappeared.

That could be worth playing with.


> ========
>
> FULL NOTE: http://people.canonical.com/~inaddy/lp1328088/
>
> ** Assumption: RCU callbacks being offloaded to multiple cpus
> (cpumask_setall) caused regression in
> copy_net_ns<-created_new_namespaces or unshare(clone_newnet).
>
> ** Next Steps: I'll probably begin to function_graph netns creation execution

Eric



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11  5:52 Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus Rafael Tinoco
  2014-06-11  7:07 ` Eric W. Biederman
@ 2014-06-11 13:39 ` Paul E. McKenney
  2014-06-11 15:17   ` Rafael Tinoco
  1 sibling, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2014-06-11 13:39 UTC (permalink / raw)
  To: Rafael Tinoco
  Cc: linux-kernel, davem, ebiederm, Dave Chiluk, Christopher Arges

On Wed, Jun 11, 2014 at 02:52:09AM -0300, Rafael Tinoco wrote:
> Paul E. McKenney, Eric Biederman, David Miller (and/or anyone else interested):
> 
> It was brought to my attention that netns creation/execution might
> have suffered scalability/performance regression after v3.8.
> 
> I would like you, or anyone interested, to review these charts/data
> and check if there is something that could be discussed/said before I
> move further.
> 
> The following script was used for all the tests and charts generation:
> 
> ====
> #!/bin/bash
> IP=/sbin/ip
> 
> function add_fake_router_uuid() {
>     j=`uuidgen`
>     $IP netns add bar-${j}
>     $IP netns exec bar-${j} $IP link set lo up
>     $IP netns exec bar-${j} sysctl -w net.ipv4.ip_forward=1 > /dev/null
>     k=`echo $j | cut -b -11`
>     $IP link add qro-${k} type veth peer name qri-${k} netns bar-${j}
>     $IP link add qgo-${k} type veth peer name qgi-${k} netns bar-${j}
> }
> 
> for i in `seq 1 $1`; do
>     if [ `expr $i % 250` -eq 0 ]; then
>         echo "$i by `date +%s`"
>     fi
>     add_fake_router_uuid
> done
> ====
> 
> This script gives how many "fake routers" are added per second (from 0
> to 3000 router creation mark, ex). With this and a git bisect on
> kernel tree I was led to one specific commit causing
> scalability/performance regression: #911af50 "rcu: Provide
> compile-time control for no-CBs CPUs". Even Though this change was
> experimental at that point, it introduced a performance scalability
> regression (explained below) that still lasts.
> 
> RCU related code looked like to be responsible for the problem. With
> that, every commit from tag v3.8 to master that changed any of this
> files: "kernel/rcutree.c kernel/rcutree.h kernel/rcutree_plugin.h
> include/trace/events/rcu.h include/linux/rcupdate.h" had the kernel
> checked out/compiled/tested. The idea was to check performance
> regression during rcu development, if that was the case. In the worst
> case, the regression not being related to rcu, I would still have
> chronological data to interpret.
> 
> All text below this refer to 2 groups of charts, generated during the study:
> 
> ====
> 1) Kernel git tags from 3.8 to 3.14.
> *** http://people.canonical.com/~inaddy/lp1328088/charts/250-tag.html ***
> 
> 2) Kernel git commits for rcu development (111 commits) -> Clearly
> shows regressions:
> *** http://people.canonical.com/~inaddy/lp1328088/charts/250.html ***

I am having a really hard time distinguishing the colors on both charts
(yeah, red-green colorblind, go figure).  Any chance of brighter colors,
patterned lines, or (better yet) the data in tabular form (for example,
with the configuration choices as columns and the releases/commits
as rows)?  That said, I must admire your thicket of linked charts,
even if I cannot reliably distinguish the lines.

OK, I can apparently click on the color spots to eliminate some of
the traces.  More on this later.

In addition, two of the color spots at the top of the graphs do not have
labels.  What are they?

What is a "250 MARK"?  250 fake netns routers?  OK, maybe this is
the routers/sec below, though I have no idea what that might mean.
(Laptops/sec?  Smartphones/sec?  Supercomputers/sec?)

You have the throughput apparently dropping all the way to zero, for
example, for "Merge commit 8700c95adb03 into timers/nohz."  Really???

> Obs:
> 
> 1) There is a general chart with 111 commits. With this chart you can
> see performance evolution/regression on each test mark. Test mark goes
> from 0 to 2500 and refers to "fake routers already created". Example:
> Throughput was 50 routers/sec on 250 already created mark and 30
> routers/sec on 1250 mark.
> 
> 2) Clicking on a specific commit will give you that commit evolution
> from 0 routers already created to 2500 routers already created mark.
> ====
> 
> Since there were differences in results, depending on how many cpus or
> how the no-cb cpus were configured, 3 kernel config options were used
> on every measure, for 1 and 4 cpus.
> 
> ====
> - CONFIG_RCU_NOCB_CPU (disabled): nocbno
> - CONFIG_RCU_NOCB_CPU_ALL (enabled): nocball
> - CONFIG_RCU_NOCB_CPU_NONE (enabled): nocbnone
> 
> Obs: For 1 cpu cases: nocbno, nocbnone, nocball behaves the same (or
> should) since w/ only 1 cpu there is no no-cb cpu.

In addition, there should not be much in the way of change for the
nocbno case, but I see the the nocbno-4cpu-250 line frequently dropping
to zero.  Again, really???

Also, the four-CPU case is getting only about 2x the throughput of the
one-CPU case.  Why such poor scaling?  Does this benchmark depend mostly
on the grace-period latency or something?  (Given the routers/sec
measure, I am thinking maybe so...)

Do you have CONFIG_RCU_FAST_NO_HZ=y?  If so, please try setting it to n.

> ====
> 
> After charts being generated it was clear that NOCB_CPU_ALL (4 cpus)
> affected the "fake routers" creation process performance and this
> regression continues up to upstream version.

Given that NOCB_CPU_ALL was designed primarily for real-time and HPC
workloads, this is no surprise.  I am working on some changes to make
it better behaved for other workloads based on a bug report from Rik.
Something about certain distros having enabled it by default.  ;-)

>                                              It was also clear that,
> after commit #911af50, having more than 1 cpu does not improve
> performance/scalability for netns, makes it worse.

Well, before that commit, there was no such thing as CONFIG_RCU_NOCB_CPU_ALL,
for one thing.  ;-)

If you want to see the real CONFIG_RCU_NOCB_CPU_ALL effect before that
commit, you need to use the rcu_nocbs= boot parameter.

> #911af50
> ====
> ...
> +#ifdef CONFIG_RCU_NOCB_CPU_ALL
> + pr_info("\tExperimental no-CBs for all CPUs\n");
> + cpumask_setall(rcu_nocb_mask);
> +#endif /* #ifdef CONFIG_RCU_NOCB_CPU_ALL */
> ...
> ====
> 
> Comparing standing out points (see charts):
> 
> #81e5949 - good
> #911af50 - bad
> 
> I was able to see that, from the script above, the following lines
> causes major impact on netns scalability/performance:
> 
> 1) ip netns add -> huge performance regression:
> 
>  1 cpu: no regression
>  4 cpu: regression for NOCB_CPU_ALL
> 
>  obs: regression from 250 netns/sec to 50 netns/sec on 500 netns
> already created mark
> 
> 2) ip netns exec -> some performance regression
> 
>  1 cpu: no regression
>  4 cpu: regression for NOCB_CPU_ALL
> 
>  obs: regression from 40 netns (+1 exec per netns creation) to 20
> netns/sec on 500 netns created mark

Again, what you are seeing is the effect of callback offloading on
a workload not particularly suited for it.  That said, I don't understand
why you are seeing any particular effect when offloading is completely
disabled unless your workload is sensitive to grace-period latency.

> ========
> 
> FULL NOTE: http://people.canonical.com/~inaddy/lp1328088/
> 
> ** Assumption: RCU callbacks being offloaded to multiple cpus
> (cpumask_setall) caused regression in
> copy_net_ns<-created_new_namespaces or unshare(clone_newnet).
> 
> ** Next Steps: I'll probably begin to function_graph netns creation execution

Some questions:

o	Why does the throughput drop all the way to zero at various points?

o	What exactly is this benchmark doing?

o	Is this benchmark sensitive to grace-period latency?
	(You can check this by changing the value of HZ, give or take.)

o	How many runs were taken at each point?  If more than one, what
	was the variability?

o	Routers per second means what?

o	How did you account for the effects of other non-RCU commits?
	Did you rebase the RCU commits on top of an older release without
	the other commits or something similar?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 13:39 ` Paul E. McKenney
@ 2014-06-11 15:17   ` Rafael Tinoco
  2014-06-11 15:46     ` David Chiluk
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael Tinoco @ 2014-06-11 15:17 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, davem, ebiederm, Dave Chiluk, Christopher Arges,
	Jay Vosburgh

> I am having a really hard time distinguishing the colors on both charts
> (yeah, red-green colorblind, go figure).  Any chance of brighter colors,
> patterned lines, or (better yet) the data in tabular form (for example,
> with the configuration choices as columns and the releases/commits
> as rows)?  That said, I must admire your thicket of linked charts,
> even if I cannot reliably distinguish the lines.

For now best option for me will be to generate charts on different colors
since this is not much time consuming and i can focus on other things.

> OK, I can apparently click on the color spots to eliminate some of
> the traces.  More on this later.
>
> In addition, two of the color spots at the top of the graphs do not have
> labels.  What are they?

Those 2 lines are only "fixing" a minimum and maximum (scale). They
should be always checked so you can keel same scale in any chart or
measure.
>
> What is a "250 MARK"?  250 fake netns routers?  OK, maybe this is
> the routers/sec below, though I have no idea what that might mean.
> (Laptops/sec?  Smartphones/sec?  Supercomputers/sec?)

This script simulates a failure on a cloud infrastructure, for ex. As soon as
one virtualization host fails all its network namespaces have to be migrated
to other node. Creating thousands of netns in the shortest time possible
is the objective here. This regression was observed trying to migrate from
v3.5 to v3.8+.

Script creates up to 3000/4000 thousands network namespaces and places
links on them. Every 250 mark (netns already created) we have a throughput
average (how many were created per second up from last mark to this one).

> You have the throughput apparently dropping all the way to zero, for
> example, for "Merge commit 8700c95adb03 into timers/nohz."  Really???

You can de-select all lines, but the affected one (even the 2 colors without
label). If you see "0,00" probably compilation did not generate a bootable
kernel for my testing tool. If you see something like "0,xx" it is probably a
serious regression.

Example:

http://people.canonical.com/~inaddy/lp1328088/charts/c0f4dfd4.html

If you select ONLY nocbno-4cpu and nocbnone-4cpu you will see that
nocbno has 0,09 (huge regression) and nocbnone has 0 (huge regression
or unbootable kernel).

> In addition, there should not be much in the way of change for the
> nocbno case, but I see the the nocbno-4cpu-250 line frequently dropping
> to zero.  Again, really???

Yes, it was observed and I thought it was weird also.

> Also, the four-CPU case is getting only about 2x the throughput of the
> one-CPU case.  Why such poor scaling?  Does this benchmark depend mostly
> on the grace-period latency or something?  (Given the routers/sec
> measure, I am thinking maybe so...)

I would say the four-CPU case is getting *half* of the throughput of the
one-CPU case (yes, i will generate charts with other colors, sorry). This
is my main intention here, to understand if this could be happening just
because of grace-period latency due to callbacks being offloaded (if it
makes sense). I'm starting to function_graph netns calls to check that.

> Do you have CONFIG_RCU_FAST_NO_HZ=y?  If so, please try setting it to n.

I have all 111 compiled kernels with CONFIG_RCU_FAST_NO_HZ=y. This is
probably because distributions try to configure a "fit-for-all-purposes" kernel
and this options makes sense for small devices and its energy consumption.

However I can get some commits that points out and recompile the kernel
again without this option to check if this would be beneficial. I'll
try to avoid
compiling everything again because it takes 5-7 days to compile and run
all the tests in all commits with all 3 config options on each (111 commits,
3 options = 333 kernels tested on 1 and 4 cpus).

Let me know if you have any specific commit you would like to see without
CONFIG_RCU_FAST_NO_HZ.

> Given that NOCB_CPU_ALL was designed primarily for real-time and HPC
> workloads, this is no surprise.  I am working on some changes to make
> it better behaved for other workloads based on a bug report from Rik.
> Something about certain distros having enabled it by default.  ;-)

:D Totally agree. Unfortunately having "nocbno" or "nocbnone" is also giving
us this performance regression (for netns, comparing to kernels <= 3.8). You
can check that on 250.html chart, on the last commit (recent).

Probably configuring rcu_nocbs would be the best scenario for a "general-
purpose" kernel.

Again, since the bisect showed regression for a specific rcu commit this was
the line of the "investigation". The regression could be tested manually also,
compiling before and after the bisect-bad commit.

>
> Well, before that commit, there was no such thing as CONFIG_RCU_NOCB_CPU_ALL,
> for one thing.  ;-)

Yes!! :D Im aware of that.. but i just did an automatized testing tool
for this and
make nconfig fixed my non-existent CONFIG_* options for kernels before that
specific commit.

> If you want to see the real CONFIG_RCU_NOCB_CPU_ALL effect before that
> commit, you need to use the rcu_nocbs= boot parameter.
>

Absolutely, I'll try with 2 or 3 commits before #911af50 just in case.

> Again, what you are seeing is the effect of callback offloading on
> a workload not particularly suited for it.  That said, I don't understand
> why you are seeing any particular effect when offloading is completely
> disabled unless your workload is sensitive to grace-period latency.
>

Wanted to make sure results were correctly. Starting to investigate netns
functions (copied some of netns developers here also). Totally agree
and this confirm my hypothesis.

>
> Some questions:
>
> o       Why does the throughput drop all the way to zero at various points?

Explained earlier. Check if it is 0.00 or 0.xx. 0.00 can mean unbootable kernel.

>
> o       What exactly is this benchmark doing?

Explained earlier. Simulating cloud infrastructure migrating netns on failure.

>
> o       Is this benchmark sensitive to grace-period latency?
>         (You can check this by changing the value of HZ, give or take.)

Will do that.

> o       How many runs were taken at each point?  If more than one, what
>         was the variability?

For all commits only one. For pointed out commits more then one.
Results tend to
be the same with minimum variation. Trying to balance efforts on digging into
the problem versus getting more results.

If you think, after my next answers (changing HZ, FAST_NOHZ) that
remeasuring everything is a must, let me know then I'll work on
deviation for you.

>
> o       Routers per second means what?

Explained earlier.

>
> o       How did you account for the effects of other non-RCU commits?
>         Did you rebase the RCU commits on top of an older release without
>         the other commits or something similar?

I used the Linus git tree, checking out specific commits and compiling the
kernel. I've only used commits that changed RCU because of the bisect
result. Besides these commits I have only generated kernel for main
release tags.

In my point of view, if this is related to RCU, several things have to be
discussed: Is using NOCB_CPU_ALL for a general purpose kernel a
good option ? Is netns code too dependent of grace-period low latency
to scale ? Is there a way of minimizing this ?

>                                                         Thanx, Paul

No Paul, I have to thank you. Really appreciate your time.

Rafael (tinoco@canonical/~inaddy)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 15:17   ` Rafael Tinoco
@ 2014-06-11 15:46     ` David Chiluk
  2014-06-11 16:18       ` Paul E. McKenney
  0 siblings, 1 reply; 24+ messages in thread
From: David Chiluk @ 2014-06-11 15:46 UTC (permalink / raw)
  To: Rafael Tinoco, paulmck
  Cc: linux-kernel, davem, ebiederm, Christopher Arges, Jay Vosburgh

On 06/11/2014 10:17 AM, Rafael Tinoco wrote:
> This script simulates a failure on a cloud infrastructure, for ex. As soon as
> one virtualization host fails all its network namespaces have to be migrated
> to other node. Creating thousands of netns in the shortest time possible
> is the objective here. This regression was observed trying to migrate from
> v3.5 to v3.8+.
> 
> Script creates up to 3000/4000 thousands network namespaces and places
> links on them. Every 250 mark (netns already created) we have a throughput
> average (how many were created per second up from last mark to this one).

Here's a little more background, and the "why it matters".

In an openstack cloud, neutron *(openstack's networking framework) keeps
all customers of the cloud separated via network namespaces.  On each
compute node this is not a big deal, since each compute node can only
handle at most a few hundred VMs.  However in order for neutron to route
a customer's network traffic between disparate compute hosts, it uses
the concept of a neutron gateway.  In order for customer A's vm on host
1 to talk to customer A's vm on host 2, it must first go through a gre
tunnel to the neutron gateway.  The Neutron gateay then turns around and
routes the network traffic over another gre tunnel to host 2.  The
neutron gateway is where the problem is.

The neutron gateway must have a network namespace for every net
namespace in the cloud.  Granted this collection can be split up by
increasing the number of neutron gateways *(scaling out), but some
clouds have decided to run these gateways on very beefy machines.  As
you can see by the graph, there is a software limitation that prevents
these machines from hosting any more than a few thousand namespaces.
This makes the gateway's hardware severely under-utilized.

Now think about what happens when a gateway goes down, the namespaces
need to be migrated, or a new machine needs to be brought up to replace
it.  When we're talking about 3000 namespaces, the amount of time it
takes simply to recreate the namespaces becomes very significant.

The script is a stripped down example of what exactly is being done on
the neutron gateway in order to create namespaces.

Dave.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 15:46     ` David Chiluk
@ 2014-06-11 16:18       ` Paul E. McKenney
  2014-06-11 18:27         ` Dave Chiluk
  0 siblings, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2014-06-11 16:18 UTC (permalink / raw)
  To: chiluk
  Cc: Rafael Tinoco, linux-kernel, davem, ebiederm, Christopher Arges,
	Jay Vosburgh

On Wed, Jun 11, 2014 at 10:46:00AM -0500, David Chiluk wrote:
> On 06/11/2014 10:17 AM, Rafael Tinoco wrote:
> > This script simulates a failure on a cloud infrastructure, for ex. As soon as
> > one virtualization host fails all its network namespaces have to be migrated
> > to other node. Creating thousands of netns in the shortest time possible
> > is the objective here. This regression was observed trying to migrate from
> > v3.5 to v3.8+.
> > 
> > Script creates up to 3000/4000 thousands network namespaces and places
> > links on them. Every 250 mark (netns already created) we have a throughput
> > average (how many were created per second up from last mark to this one).
> 
> Here's a little more background, and the "why it matters".

Thank you, this is quite helpful.

> In an openstack cloud, neutron *(openstack's networking framework) keeps
> all customers of the cloud separated via network namespaces.  On each
> compute node this is not a big deal, since each compute node can only
> handle at most a few hundred VMs.  However in order for neutron to route
> a customer's network traffic between disparate compute hosts, it uses
> the concept of a neutron gateway.  In order for customer A's vm on host
> 1 to talk to customer A's vm on host 2, it must first go through a gre
> tunnel to the neutron gateway.  The Neutron gateay then turns around and
> routes the network traffic over another gre tunnel to host 2.  The
> neutron gateway is where the problem is.
> 
> The neutron gateway must have a network namespace for every net
> namespace in the cloud.  Granted this collection can be split up by
> increasing the number of neutron gateways *(scaling out), but some
> clouds have decided to run these gateways on very beefy machines.  As
> you can see by the graph, there is a software limitation that prevents
> these machines from hosting any more than a few thousand namespaces.
> This makes the gateway's hardware severely under-utilized.
> 
> Now think about what happens when a gateway goes down, the namespaces
> need to be migrated, or a new machine needs to be brought up to replace
> it.  When we're talking about 3000 namespaces, the amount of time it
> takes simply to recreate the namespaces becomes very significant.
> 
> The script is a stripped down example of what exactly is being done on
> the neutron gateway in order to create namespaces.

Are the namespaces torn down and recreated one at a time, or is there some
syscall, ioctl(), or whatever that allows bulk tear down and recreating?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 16:18       ` Paul E. McKenney
@ 2014-06-11 18:27         ` Dave Chiluk
  2014-06-11 19:48           ` Paul E. McKenney
  2014-06-11 20:46           ` Eric W. Biederman
  0 siblings, 2 replies; 24+ messages in thread
From: Dave Chiluk @ 2014-06-11 18:27 UTC (permalink / raw)
  To: paulmck
  Cc: Rafael Tinoco, linux-kernel, davem, ebiederm, Christopher Arges,
	Jay Vosburgh

On 06/11/2014 11:18 AM, Paul E. McKenney wrote:
> On Wed, Jun 11, 2014 at 10:46:00AM -0500, David Chiluk wrote:
>> Now think about what happens when a gateway goes down, the namespaces
>> need to be migrated, or a new machine needs to be brought up to replace
>> it.  When we're talking about 3000 namespaces, the amount of time it
>> takes simply to recreate the namespaces becomes very significant.
>>
>> The script is a stripped down example of what exactly is being done on
>> the neutron gateway in order to create namespaces.
> 
> Are the namespaces torn down and recreated one at a time, or is there some
> syscall, ioctl(), or whatever that allows bulk tear down and recreating?
> 
> 							Thanx, Paul

In the normal running case, the namespaces are created one at a time, as
new customers create a new set of VMs on the cloud.

However, in the case of failover to a new neutron gateway the namespaces
are created all at once using the ip command (more or less serially).

As far as I know there is no syscall or ioctl that allows bulk tear down
and recreation.  if such a beast exists that might be helpful.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 18:27         ` Dave Chiluk
@ 2014-06-11 19:48           ` Paul E. McKenney
  2014-06-11 20:55             ` Eric W. Biederman
  2014-06-11 20:46           ` Eric W. Biederman
  1 sibling, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2014-06-11 19:48 UTC (permalink / raw)
  To: Dave Chiluk
  Cc: Rafael Tinoco, linux-kernel, davem, ebiederm, Christopher Arges,
	Jay Vosburgh

On Wed, Jun 11, 2014 at 01:27:07PM -0500, Dave Chiluk wrote:
> On 06/11/2014 11:18 AM, Paul E. McKenney wrote:
> > On Wed, Jun 11, 2014 at 10:46:00AM -0500, David Chiluk wrote:
> >> Now think about what happens when a gateway goes down, the namespaces
> >> need to be migrated, or a new machine needs to be brought up to replace
> >> it.  When we're talking about 3000 namespaces, the amount of time it
> >> takes simply to recreate the namespaces becomes very significant.
> >>
> >> The script is a stripped down example of what exactly is being done on
> >> the neutron gateway in order to create namespaces.
> > 
> > Are the namespaces torn down and recreated one at a time, or is there some
> > syscall, ioctl(), or whatever that allows bulk tear down and recreating?
> > 
> > 							Thanx, Paul
> 
> In the normal running case, the namespaces are created one at a time, as
> new customers create a new set of VMs on the cloud.
> 
> However, in the case of failover to a new neutron gateway the namespaces
> are created all at once using the ip command (more or less serially).
> 
> As far as I know there is no syscall or ioctl that allows bulk tear down
> and recreation.  if such a beast exists that might be helpful.

The solution might be to create such a beast.  I might be able to shave
a bit of time off of this benchmark, but at the cost of significant
increases in RCU's CPU consumption.  A bulk teardown/recreation API could
reduce the RCU grace-period overhead by several orders of magnitude by
having a single RCU grace period cover a few thousand changes.

This is why other bulk-change syscalls exist.

Just out of curiosity, what syscalls does the ip command use?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 18:27         ` Dave Chiluk
  2014-06-11 19:48           ` Paul E. McKenney
@ 2014-06-11 20:46           ` Eric W. Biederman
  2014-06-11 21:14             ` Dave Chiluk
  2014-06-11 22:52             ` Paul E. McKenney
  1 sibling, 2 replies; 24+ messages in thread
From: Eric W. Biederman @ 2014-06-11 20:46 UTC (permalink / raw)
  To: chiluk
  Cc: paulmck, Rafael Tinoco, linux-kernel, davem, Christopher Arges,
	Jay Vosburgh

Dave Chiluk <chiluk@canonical.com> writes:

> On 06/11/2014 11:18 AM, Paul E. McKenney wrote:
>> On Wed, Jun 11, 2014 at 10:46:00AM -0500, David Chiluk wrote:
>>> Now think about what happens when a gateway goes down, the namespaces
>>> need to be migrated, or a new machine needs to be brought up to replace
>>> it.  When we're talking about 3000 namespaces, the amount of time it
>>> takes simply to recreate the namespaces becomes very significant.
>>>
>>> The script is a stripped down example of what exactly is being done on
>>> the neutron gateway in order to create namespaces.
>> 
>> Are the namespaces torn down and recreated one at a time, or is there some
>> syscall, ioctl(), or whatever that allows bulk tear down and recreating?
>> 
>> 							Thanx, Paul
>
> In the normal running case, the namespaces are created one at a time, as
> new customers create a new set of VMs on the cloud.
>
> However, in the case of failover to a new neutron gateway the namespaces
> are created all at once using the ip command (more or less serially).
>
> As far as I know there is no syscall or ioctl that allows bulk tear down
> and recreation.  if such a beast exists that might be helpful.

Bulk teardown exists for network namespaces, and it happens
automatically.  Bulk creation does not exist.  But then until now rcu
was not know to even exist on the namespace creation path.

Which is what puzzles me.

Looking a little closer switch_task_namespaces which calls
synchronize_rcu when the old nsproxy is dead may exists in both
unshare/clone and in setns.  So that may be the culprit.

Certainly it is the only thing going on in the ip netns exec case.

ip netns add also performs a bind mount so we get into all of the vfs
level locking as well.

On the chance it is dropping the old nsproxy which calls syncrhonize_rcu
in switch_task_namespaces that is causing you problems I have attached
a patch that changes from rcu_read_lock to task_lock for code that
calls task_nsproxy from a different task.  The code should be safe
and it should be an unquestions performance improvement but I have only
compile tested it.

If you can try the patch it will tell is if the problem is the rcu
access in switch_task_namespaces (the only one I am aware of network
namespace creation) or if the problem rcu case is somewhere else.

If nothing else knowing which rcu accesses are causing the slow down
seem important at the end of the day.

Eric


From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Wed, 11 Jun 2014 13:33:47 -0700
Subject: [PATCH] nsproxy: Protect remote reads of nsproxy with task_lock not rcu_read_lock.

Remote reads are rare and setns/clone can be slow because we are using
syncrhonize_rcu.  Let's speed things up by using locking that does not
optimize for a case that does not exist.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/namespace.c           |  4 ++--
 fs/proc/proc_net.c       |  2 ++
 fs/proc_namespace.c      |  6 ++----
 include/linux/nsproxy.h  |  6 +++---
 ipc/namespace.c          |  4 ++--
 kernel/nsproxy.c         | 12 +++---------
 kernel/utsname.c         |  4 ++--
 net/core/net_namespace.c |  6 ++++--
 8 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 182bc41cd887..2d52c1676bbb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2972,13 +2972,13 @@ static void *mntns_get(struct task_struct *task)
 	struct mnt_namespace *ns = NULL;
 	struct nsproxy *nsproxy;
 
-	rcu_read_lock();
+	task_lock(task);
 	nsproxy = task_nsproxy(task);
 	if (nsproxy) {
 		ns = nsproxy->mnt_ns;
 		get_mnt_ns(ns);
 	}
-	rcu_read_unlock();
+	task_unlock(task);
 
 	return ns;
 }
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 4677bb7dc7c2..a5e2d5576645 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -113,9 +113,11 @@ static struct net *get_proc_task_net(struct inode *dir)
 	rcu_read_lock();
 	task = pid_task(proc_pid(dir), PIDTYPE_PID);
 	if (task != NULL) {
+		task_lock(task);
 		ns = task_nsproxy(task);
 		if (ns != NULL)
 			net = get_net(ns->net_ns);
+		task_unlock(task);
 	}
 	rcu_read_unlock();
 
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 1a81373947f3..2b0f6455af54 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -232,17 +232,15 @@ static int mounts_open_common(struct inode *inode, struct file *file,
 	if (!task)
 		goto err;
 
-	rcu_read_lock();
+	task_lock(task);
 	nsp = task_nsproxy(task);
 	if (!nsp || !nsp->mnt_ns) {
-		rcu_read_unlock();
+		task_unlock(task);
 		put_task_struct(task);
 		goto err;
 	}
 	ns = nsp->mnt_ns;
 	get_mnt_ns(ns);
-	rcu_read_unlock();
-	task_lock(task);
 	if (!task->fs) {
 		task_unlock(task);
 		put_task_struct(task);
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d159ac..229aeb8ade5b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -46,7 +46,7 @@ extern struct nsproxy init_nsproxy;
  *     precautions should be taken - just dereference the pointers
  *
  *  3. the access to other task namespaces is performed like this
- *     rcu_read_lock();
+ *     task_lock(tsk);
  *     nsproxy = task_nsproxy(tsk);
  *     if (nsproxy != NULL) {
  *             / *
@@ -57,13 +57,13 @@ extern struct nsproxy init_nsproxy;
  *         * NULL task_nsproxy() means that this task is
  *         * almost dead (zombie)
  *         * /
- *     rcu_read_unlock();
+ *     task_unlock(tsk);
  *
  */
 
 static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
 {
-	return rcu_dereference(tsk->nsproxy);
+	return tsk->nsproxy;
 }
 
 int copy_namespaces(unsigned long flags, struct task_struct *tsk);
diff --git a/ipc/namespace.c b/ipc/namespace.c
index 59451c1e214d..15b2ee95c3a9 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -154,11 +154,11 @@ static void *ipcns_get(struct task_struct *task)
 	struct ipc_namespace *ns = NULL;
 	struct nsproxy *nsproxy;
 
-	rcu_read_lock();
+	task_lock(task);
 	nsproxy = task_nsproxy(task);
 	if (nsproxy)
 		ns = get_ipc_ns(nsproxy->ipc_ns);
-	rcu_read_unlock();
+	task_unlock(task);
 
 	return ns;
 }
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 8e7811086b82..20a9929ce342 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -204,18 +204,12 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
 
 	might_sleep();
 
+	task_lock(p);
 	ns = p->nsproxy;
-
-	rcu_assign_pointer(p->nsproxy, new);
+	p->nsproxy = new;
+	task_unlock(p);
 
 	if (ns && atomic_dec_and_test(&ns->count)) {
-		/*
-		 * wait for others to get what they want from this nsproxy.
-		 *
-		 * cannot release this nsproxy via the call_rcu() since
-		 * put_mnt_ns() will want to sleep
-		 */
-		synchronize_rcu();
 		free_nsproxy(ns);
 	}
 }
diff --git a/kernel/utsname.c b/kernel/utsname.c
index fd393124e507..359b1f12e90f 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -93,13 +93,13 @@ static void *utsns_get(struct task_struct *task)
 	struct uts_namespace *ns = NULL;
 	struct nsproxy *nsproxy;
 
-	rcu_read_lock();
+	task_lock(task);
 	nsproxy = task_nsproxy(task);
 	if (nsproxy) {
 		ns = nsproxy->uts_ns;
 		get_uts_ns(ns);
 	}
-	rcu_read_unlock();
+	task_unlock(task);
 
 	return ns;
 }
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 85b62691f4f2..4196f2a46fa8 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -373,9 +373,11 @@ struct net *get_net_ns_by_pid(pid_t pid)
 	tsk = find_task_by_vpid(pid);
 	if (tsk) {
 		struct nsproxy *nsproxy;
+		task_lock(tsk);
 		nsproxy = task_nsproxy(tsk);
 		if (nsproxy)
 			net = get_net(nsproxy->net_ns);
+		task_unlock(tsk);
 	}
 	rcu_read_unlock();
 	return net;
@@ -632,11 +634,11 @@ static void *netns_get(struct task_struct *task)
 	struct net *net = NULL;
 	struct nsproxy *nsproxy;
 
-	rcu_read_lock();
+	task_lock(task);
 	nsproxy = task_nsproxy(task);
 	if (nsproxy)
 		net = get_net(nsproxy->net_ns);
-	rcu_read_unlock();
+	task_unlock(task);
 
 	return net;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 19:48           ` Paul E. McKenney
@ 2014-06-11 20:55             ` Eric W. Biederman
  2014-06-11 21:03               ` Rafael Tinoco
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2014-06-11 20:55 UTC (permalink / raw)
  To: paulmck
  Cc: Dave Chiluk, Rafael Tinoco, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Wed, Jun 11, 2014 at 01:27:07PM -0500, Dave Chiluk wrote:
>> On 06/11/2014 11:18 AM, Paul E. McKenney wrote:
>> > On Wed, Jun 11, 2014 at 10:46:00AM -0500, David Chiluk wrote:
>> >> Now think about what happens when a gateway goes down, the namespaces
>> >> need to be migrated, or a new machine needs to be brought up to replace
>> >> it.  When we're talking about 3000 namespaces, the amount of time it
>> >> takes simply to recreate the namespaces becomes very significant.
>> >>
>> >> The script is a stripped down example of what exactly is being done on
>> >> the neutron gateway in order to create namespaces.
>> > 
>> > Are the namespaces torn down and recreated one at a time, or is there some
>> > syscall, ioctl(), or whatever that allows bulk tear down and recreating?
>> > 
>> > 							Thanx, Paul
>> 
>> In the normal running case, the namespaces are created one at a time, as
>> new customers create a new set of VMs on the cloud.
>> 
>> However, in the case of failover to a new neutron gateway the namespaces
>> are created all at once using the ip command (more or less serially).
>> 
>> As far as I know there is no syscall or ioctl that allows bulk tear down
>> and recreation.  if such a beast exists that might be helpful.
>
> The solution might be to create such a beast.  I might be able to shave
> a bit of time off of this benchmark, but at the cost of significant
> increases in RCU's CPU consumption.  A bulk teardown/recreation API could
> reduce the RCU grace-period overhead by several orders of magnitude by
> having a single RCU grace period cover a few thousand changes.
>
> This is why other bulk-change syscalls exist.
>
> Just out of curiosity, what syscalls does the ip command use?

You can look in iproute2 ip/ipnetns.c

But rought ip netns add does:

unshare(CLONE_NEWNET);
mkdir /var/run/netns/<name>
mount --bind /proc/self/ns/net /var/run/netns/<name>

I don't know if there is any sensible way to batch that work.

(The unshare gets you into copy_net_ns in net/core/net_namespace.c
 and to find all of the code it can call you have to trace all
 of the register_pernet_subsys and register_pernet_device calls).

At least for creation I would like to see if we can make all of the
rcu_callback synchronize_rcu calls go away.  That seems preferable
to batching at creation time.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 20:55             ` Eric W. Biederman
@ 2014-06-11 21:03               ` Rafael Tinoco
  0 siblings, 0 replies; 24+ messages in thread
From: Rafael Tinoco @ 2014-06-11 21:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

Eric,

I'll test the patch with the same testcase and let you all know.

Really appreciate everybody's efforts.

On Wed, Jun 11, 2014 at 5:55 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>
>> On Wed, Jun 11, 2014 at 01:27:07PM -0500, Dave Chiluk wrote:
>>> On 06/11/2014 11:18 AM, Paul E. McKenney wrote:
>>> > On Wed, Jun 11, 2014 at 10:46:00AM -0500, David Chiluk wrote:
>>> >> Now think about what happens when a gateway goes down, the namespaces
>>> >> need to be migrated, or a new machine needs to be brought up to replace
>>> >> it.  When we're talking about 3000 namespaces, the amount of time it
>>> >> takes simply to recreate the namespaces becomes very significant.
>>> >>
>>> >> The script is a stripped down example of what exactly is being done on
>>> >> the neutron gateway in order to create namespaces.
>>> >
>>> > Are the namespaces torn down and recreated one at a time, or is there some
>>> > syscall, ioctl(), or whatever that allows bulk tear down and recreating?
>>> >
>>> >                                                    Thanx, Paul
>>>
>>> In the normal running case, the namespaces are created one at a time, as
>>> new customers create a new set of VMs on the cloud.
>>>
>>> However, in the case of failover to a new neutron gateway the namespaces
>>> are created all at once using the ip command (more or less serially).
>>>
>>> As far as I know there is no syscall or ioctl that allows bulk tear down
>>> and recreation.  if such a beast exists that might be helpful.
>>
>> The solution might be to create such a beast.  I might be able to shave
>> a bit of time off of this benchmark, but at the cost of significant
>> increases in RCU's CPU consumption.  A bulk teardown/recreation API could
>> reduce the RCU grace-period overhead by several orders of magnitude by
>> having a single RCU grace period cover a few thousand changes.
>>
>> This is why other bulk-change syscalls exist.
>>
>> Just out of curiosity, what syscalls does the ip command use?
>
> You can look in iproute2 ip/ipnetns.c
>
> But rought ip netns add does:
>
> unshare(CLONE_NEWNET);
> mkdir /var/run/netns/<name>
> mount --bind /proc/self/ns/net /var/run/netns/<name>
>
> I don't know if there is any sensible way to batch that work.
>
> (The unshare gets you into copy_net_ns in net/core/net_namespace.c
>  and to find all of the code it can call you have to trace all
>  of the register_pernet_subsys and register_pernet_device calls).
>
> At least for creation I would like to see if we can make all of the
> rcu_callback synchronize_rcu calls go away.  That seems preferable
> to batching at creation time.
>
> Eric



-- 
-- 
Rafael David Tinoco
Software Sustaining Engineer @ Canonical
Canonical Technical Services Engineering Team
# Email: rafael.tinoco@canonical.com (GPG: 87683FC0)
# Phone: +55.11.9.6777.2727 (Americas/Sao_Paulo)
# LP: ~inaddy | IRC: tinoco | Skype: rafael.tinoco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 20:46           ` Eric W. Biederman
@ 2014-06-11 21:14             ` Dave Chiluk
  2014-06-11 22:52             ` Paul E. McKenney
  1 sibling, 0 replies; 24+ messages in thread
From: Dave Chiluk @ 2014-06-11 21:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: paulmck, Rafael Tinoco, linux-kernel, davem, Christopher Arges,
	Jay Vosburgh

On 06/11/2014 03:46 PM, Eric W. Biederman wrote:
> ip netns add also performs a bind mount so we get into all of the vfs
> level locking as well.

It's actually quite a bit worse than that as ip netns exec creates a new
mount namespace as well.  That being said, the vfs issues have been
healthily mitigated by a number of recent patches by Al Viro to
fs/namespace.c.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 20:46           ` Eric W. Biederman
  2014-06-11 21:14             ` Dave Chiluk
@ 2014-06-11 22:52             ` Paul E. McKenney
  2014-06-11 23:12               ` Eric W. Biederman
  1 sibling, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2014-06-11 22:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: chiluk, Rafael Tinoco, linux-kernel, davem, Christopher Arges,
	Jay Vosburgh

On Wed, Jun 11, 2014 at 01:46:08PM -0700, Eric W. Biederman wrote:
> Dave Chiluk <chiluk@canonical.com> writes:
> 
> > On 06/11/2014 11:18 AM, Paul E. McKenney wrote:
> >> On Wed, Jun 11, 2014 at 10:46:00AM -0500, David Chiluk wrote:
> >>> Now think about what happens when a gateway goes down, the namespaces
> >>> need to be migrated, or a new machine needs to be brought up to replace
> >>> it.  When we're talking about 3000 namespaces, the amount of time it
> >>> takes simply to recreate the namespaces becomes very significant.
> >>>
> >>> The script is a stripped down example of what exactly is being done on
> >>> the neutron gateway in order to create namespaces.
> >> 
> >> Are the namespaces torn down and recreated one at a time, or is there some
> >> syscall, ioctl(), or whatever that allows bulk tear down and recreating?
> >> 
> >> 							Thanx, Paul
> >
> > In the normal running case, the namespaces are created one at a time, as
> > new customers create a new set of VMs on the cloud.
> >
> > However, in the case of failover to a new neutron gateway the namespaces
> > are created all at once using the ip command (more or less serially).
> >
> > As far as I know there is no syscall or ioctl that allows bulk tear down
> > and recreation.  if such a beast exists that might be helpful.
> 
> Bulk teardown exists for network namespaces, and it happens
> automatically.  Bulk creation does not exist.  But then until now rcu
> was not know to even exist on the namespace creation path.
> 
> Which is what puzzles me.
> 
> Looking a little closer switch_task_namespaces which calls
> synchronize_rcu when the old nsproxy is dead may exists in both
> unshare/clone and in setns.  So that may be the culprit.
> 
> Certainly it is the only thing going on in the ip netns exec case.
> 
> ip netns add also performs a bind mount so we get into all of the vfs
> level locking as well.
> 
> On the chance it is dropping the old nsproxy which calls syncrhonize_rcu
> in switch_task_namespaces that is causing you problems I have attached
> a patch that changes from rcu_read_lock to task_lock for code that
> calls task_nsproxy from a different task.  The code should be safe
> and it should be an unquestions performance improvement but I have only
> compile tested it.
> 
> If you can try the patch it will tell is if the problem is the rcu
> access in switch_task_namespaces (the only one I am aware of network
> namespace creation) or if the problem rcu case is somewhere else.
> 
> If nothing else knowing which rcu accesses are causing the slow down
> seem important at the end of the day.
> 
> Eric
> 
> 
> From: "Eric W. Biederman" <ebiederm@xmission.com>
> Date: Wed, 11 Jun 2014 13:33:47 -0700
> Subject: [PATCH] nsproxy: Protect remote reads of nsproxy with task_lock not rcu_read_lock.
> 
> Remote reads are rare and setns/clone can be slow because we are using
> syncrhonize_rcu.  Let's speed things up by using locking that does not
> optimize for a case that does not exist.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/namespace.c           |  4 ++--
>  fs/proc/proc_net.c       |  2 ++
>  fs/proc_namespace.c      |  6 ++----
>  include/linux/nsproxy.h  |  6 +++---
>  ipc/namespace.c          |  4 ++--
>  kernel/nsproxy.c         | 12 +++---------
>  kernel/utsname.c         |  4 ++--
>  net/core/net_namespace.c |  6 ++++--
>  8 files changed, 20 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 182bc41cd887..2d52c1676bbb 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2972,13 +2972,13 @@ static void *mntns_get(struct task_struct *task)
>  	struct mnt_namespace *ns = NULL;
>  	struct nsproxy *nsproxy;
> 
> -	rcu_read_lock();
> +	task_lock(task);
>  	nsproxy = task_nsproxy(task);
>  	if (nsproxy) {
>  		ns = nsproxy->mnt_ns;
>  		get_mnt_ns(ns);
>  	}
> -	rcu_read_unlock();
> +	task_unlock(task);
> 
>  	return ns;
>  }
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 4677bb7dc7c2..a5e2d5576645 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -113,9 +113,11 @@ static struct net *get_proc_task_net(struct inode *dir)
>  	rcu_read_lock();
>  	task = pid_task(proc_pid(dir), PIDTYPE_PID);
>  	if (task != NULL) {
> +		task_lock(task);
>  		ns = task_nsproxy(task);
>  		if (ns != NULL)
>  			net = get_net(ns->net_ns);
> +		task_unlock(task);
>  	}
>  	rcu_read_unlock();
> 
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 1a81373947f3..2b0f6455af54 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -232,17 +232,15 @@ static int mounts_open_common(struct inode *inode, struct file *file,
>  	if (!task)
>  		goto err;
> 
> -	rcu_read_lock();
> +	task_lock(task);
>  	nsp = task_nsproxy(task);
>  	if (!nsp || !nsp->mnt_ns) {
> -		rcu_read_unlock();
> +		task_unlock(task);
>  		put_task_struct(task);
>  		goto err;
>  	}
>  	ns = nsp->mnt_ns;
>  	get_mnt_ns(ns);
> -	rcu_read_unlock();
> -	task_lock(task);
>  	if (!task->fs) {
>  		task_unlock(task);
>  		put_task_struct(task);
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index b4ec59d159ac..229aeb8ade5b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -46,7 +46,7 @@ extern struct nsproxy init_nsproxy;
>   *     precautions should be taken - just dereference the pointers
>   *
>   *  3. the access to other task namespaces is performed like this
> - *     rcu_read_lock();
> + *     task_lock(tsk);
>   *     nsproxy = task_nsproxy(tsk);
>   *     if (nsproxy != NULL) {
>   *             / *
> @@ -57,13 +57,13 @@ extern struct nsproxy init_nsproxy;
>   *         * NULL task_nsproxy() means that this task is
>   *         * almost dead (zombie)
>   *         * /
> - *     rcu_read_unlock();
> + *     task_unlock(tsk);
>   *
>   */
> 
>  static inline struct nsproxy *task_nsproxy(struct task_struct *tsk)
>  {
> -	return rcu_dereference(tsk->nsproxy);
> +	return tsk->nsproxy;
>  }
> 
>  int copy_namespaces(unsigned long flags, struct task_struct *tsk);
> diff --git a/ipc/namespace.c b/ipc/namespace.c
> index 59451c1e214d..15b2ee95c3a9 100644
> --- a/ipc/namespace.c
> +++ b/ipc/namespace.c
> @@ -154,11 +154,11 @@ static void *ipcns_get(struct task_struct *task)
>  	struct ipc_namespace *ns = NULL;
>  	struct nsproxy *nsproxy;
> 
> -	rcu_read_lock();
> +	task_lock(task);
>  	nsproxy = task_nsproxy(task);
>  	if (nsproxy)
>  		ns = get_ipc_ns(nsproxy->ipc_ns);
> -	rcu_read_unlock();
> +	task_unlock(task);
> 
>  	return ns;
>  }
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index 8e7811086b82..20a9929ce342 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -204,18 +204,12 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
> 
>  	might_sleep();
> 
> +	task_lock(p);
>  	ns = p->nsproxy;
> -
> -	rcu_assign_pointer(p->nsproxy, new);
> +	p->nsproxy = new;
> +	task_unlock(p);
> 
>  	if (ns && atomic_dec_and_test(&ns->count)) {
> -		/*
> -		 * wait for others to get what they want from this nsproxy.
> -		 *
> -		 * cannot release this nsproxy via the call_rcu() since
> -		 * put_mnt_ns() will want to sleep
> -		 */
> -		synchronize_rcu();
>  		free_nsproxy(ns);

If this is the culprit, another approach would be to use workqueues from
RCU callbacks.  The following (untested, probably does not even build)
patch illustrates one such approach.

							Thanx, Paul

------------------------------------------------------------------------

nsproxy: Substitute call_rcu/queue_work for synchronize_rcu

This commit gets synchronize_rcu() out of the way by getting the work
done in workqueue context from an RCU callback function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d159ac..489bf4c7a3a0 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -33,6 +33,10 @@ struct nsproxy {
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns_for_children;
 	struct net 	     *net_ns;
+	union cu {
+		struct rcu_head      rh;
+		struct work_struct   ws;
+	};
 };
 extern struct nsproxy init_nsproxy;
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 8e7811086b82..d9a4730ce386 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -27,6 +27,7 @@
 #include <linux/syscalls.h>
 
 static struct kmem_cache *nsproxy_cachep;
+static struct workqueue_struct *nsproxy_wq;
 
 struct nsproxy init_nsproxy = {
 	.count			= ATOMIC_INIT(1),
@@ -198,6 +199,21 @@ out:
 	return err;
 }
 
+static void free_nsproxy_wq(struct work_struct *work)
+{
+	struct nsproxy *ns = container_of(rhp, struct nsproxy, cu.ws);
+
+	free_nsproxy(ns);
+}
+
+static void free_nsproxy_rcu(struct rcu_head *rhp)
+{
+	struct nsproxy *ns = container_of(rhp, struct nsproxy, cu.rh);
+
+	INIT_WORK(&ns->cu.ws, free_nsproxy_wq);
+	queue_work(nsproxy_wq, &ns->cu.ws);
+}
+
 void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
 {
 	struct nsproxy *ns;
@@ -210,13 +226,10 @@ void switch_task_namespaces(struct task_struct *p, struct nsproxy *new)
 
 	if (ns && atomic_dec_and_test(&ns->count)) {
 		/*
-		 * wait for others to get what they want from this nsproxy.
-		 *
-		 * cannot release this nsproxy via the call_rcu() since
-		 * put_mnt_ns() will want to sleep
+		 * Invoke free_nsproxy() (from workqueue context) to clean
+		 * up after others to get what they want from this nsproxy.
 		 */
-		synchronize_rcu();
-		free_nsproxy(ns);
+		call_rcu(&ns->rh, free_nsproxy_rcu);
 	}
 }
 
@@ -264,5 +277,6 @@ out:
 int __init nsproxy_cache_init(void)
 {
 	nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
+	nsproxy_wq = alloc_workqueue("nsproxy_wq", WQ_MEM_RECLAIM, 0);
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 22:52             ` Paul E. McKenney
@ 2014-06-11 23:12               ` Eric W. Biederman
  2014-06-11 23:49                 ` Paul E. McKenney
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2014-06-11 23:12 UTC (permalink / raw)
  To: paulmck
  Cc: chiluk, Rafael Tinoco, linux-kernel, davem, Christopher Arges,
	Jay Vosburgh

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Wed, Jun 11, 2014 at 01:46:08PM -0700, Eric W. Biederman wrote:
>> On the chance it is dropping the old nsproxy which calls syncrhonize_rcu
>> in switch_task_namespaces that is causing you problems I have attached
>> a patch that changes from rcu_read_lock to task_lock for code that
>> calls task_nsproxy from a different task.  The code should be safe
>> and it should be an unquestions performance improvement but I have only
>> compile tested it.
>> 
>> If you can try the patch it will tell is if the problem is the rcu
>> access in switch_task_namespaces (the only one I am aware of network
>> namespace creation) or if the problem rcu case is somewhere else.
>> 
>> If nothing else knowing which rcu accesses are causing the slow down
>> seem important at the end of the day.
>> 
>> Eric
>> 
>
> If this is the culprit, another approach would be to use workqueues from
> RCU callbacks.  The following (untested, probably does not even build)
> patch illustrates one such approach.

For reference the only reason we are using rcu_lock today for nsproxy is
an old lock ordering problem that does not exist anymore.

I can say that in some workloads setns is a bit heavy today because of
the synchronize_rcu and setns is more important that I had previously
thought because pthreads break the classic unix ability to do things in
your process after fork() (sigh).

Today daemonize is gone, and notify the parent process with a signal
relies on task_active_pid_ns which does not use nsproxy.  So the old
lock ordering problem/race is gone.

The description of what was happening when the code switched from
task_lock to rcu_read_lock to protect nsproxy.

commit cf7b708c8d1d7a27736771bcf4c457b332b0f818
Author: Pavel Emelyanov <xemul@openvz.org>
Date:   Thu Oct 18 23:39:54 2007 -0700

    Make access to task's nsproxy lighter
    
    When someone wants to deal with some other taks's namespaces it has to lock
    the task and then to get the desired namespace if the one exists.  This is
    slow on read-only paths and may be impossible in some cases.
    
    E.g.  Oleg recently noticed a race between unshare() and the (sent for
    review in cgroups) pid namespaces - when the task notifies the parent it
    has to know the parent's namespace, but taking the task_lock() is
    impossible there - the code is under write locked tasklist lock.
    
    On the other hand switching the namespace on task (daemonize) and releasing
    the namespace (after the last task exit) is rather rare operation and we
    can sacrifice its speed to solve the issues above.
    
    The access to other task namespaces is proposed to be performed
    like this:
    
         rcu_read_lock();
         nsproxy = task_nsproxy(tsk);
         if (nsproxy != NULL) {
                 / *
                   * work with the namespaces here
                   * e.g. get the reference on one of them
                   * /
         } / *
             * NULL task_nsproxy() means that this task is
             * almost dead (zombie)
             * /
         rcu_read_unlock();
    
    This patch has passed the review by Eric and Oleg :) and,
    of course, tested.
    
    [clg@fr.ibm.com: fix unshare()]
    [ebiederm@xmission.com: Update get_net_ns_by_pid]
    Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Oleg Nesterov <oleg@tv-sign.ru>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Serge Hallyn <serue@us.ibm.com>
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 23:12               ` Eric W. Biederman
@ 2014-06-11 23:49                 ` Paul E. McKenney
  2014-06-12  0:14                   ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Paul E. McKenney @ 2014-06-11 23:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: chiluk, Rafael Tinoco, linux-kernel, davem, Christopher Arges,
	Jay Vosburgh

On Wed, Jun 11, 2014 at 04:12:15PM -0700, Eric W. Biederman wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> > On Wed, Jun 11, 2014 at 01:46:08PM -0700, Eric W. Biederman wrote:
> >> On the chance it is dropping the old nsproxy which calls syncrhonize_rcu
> >> in switch_task_namespaces that is causing you problems I have attached
> >> a patch that changes from rcu_read_lock to task_lock for code that
> >> calls task_nsproxy from a different task.  The code should be safe
> >> and it should be an unquestions performance improvement but I have only
> >> compile tested it.
> >> 
> >> If you can try the patch it will tell is if the problem is the rcu
> >> access in switch_task_namespaces (the only one I am aware of network
> >> namespace creation) or if the problem rcu case is somewhere else.
> >> 
> >> If nothing else knowing which rcu accesses are causing the slow down
> >> seem important at the end of the day.
> >> 
> >> Eric
> >> 
> >
> > If this is the culprit, another approach would be to use workqueues from
> > RCU callbacks.  The following (untested, probably does not even build)
> > patch illustrates one such approach.
> 
> For reference the only reason we are using rcu_lock today for nsproxy is
> an old lock ordering problem that does not exist anymore.
> 
> I can say that in some workloads setns is a bit heavy today because of
> the synchronize_rcu and setns is more important that I had previously
> thought because pthreads break the classic unix ability to do things in
> your process after fork() (sigh).
> 
> Today daemonize is gone, and notify the parent process with a signal
> relies on task_active_pid_ns which does not use nsproxy.  So the old
> lock ordering problem/race is gone.
> 
> The description of what was happening when the code switched from
> task_lock to rcu_read_lock to protect nsproxy.

OK, never mind, then!  ;-)

							Thanx, Paul

> commit cf7b708c8d1d7a27736771bcf4c457b332b0f818
> Author: Pavel Emelyanov <xemul@openvz.org>
> Date:   Thu Oct 18 23:39:54 2007 -0700
> 
>     Make access to task's nsproxy lighter
>     
>     When someone wants to deal with some other taks's namespaces it has to lock
>     the task and then to get the desired namespace if the one exists.  This is
>     slow on read-only paths and may be impossible in some cases.
>     
>     E.g.  Oleg recently noticed a race between unshare() and the (sent for
>     review in cgroups) pid namespaces - when the task notifies the parent it
>     has to know the parent's namespace, but taking the task_lock() is
>     impossible there - the code is under write locked tasklist lock.
>     
>     On the other hand switching the namespace on task (daemonize) and releasing
>     the namespace (after the last task exit) is rather rare operation and we
>     can sacrifice its speed to solve the issues above.
>     
>     The access to other task namespaces is proposed to be performed
>     like this:
>     
>          rcu_read_lock();
>          nsproxy = task_nsproxy(tsk);
>          if (nsproxy != NULL) {
>                  / *
>                    * work with the namespaces here
>                    * e.g. get the reference on one of them
>                    * /
>          } / *
>              * NULL task_nsproxy() means that this task is
>              * almost dead (zombie)
>              * /
>          rcu_read_unlock();
>     
>     This patch has passed the review by Eric and Oleg :) and,
>     of course, tested.
>     
>     [clg@fr.ibm.com: fix unshare()]
>     [ebiederm@xmission.com: Update get_net_ns_by_pid]
>     Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
>     Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>     Cc: Oleg Nesterov <oleg@tv-sign.ru>
>     Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>     Cc: Serge Hallyn <serue@us.ibm.com>
>     Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> Eric
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-11 23:49                 ` Paul E. McKenney
@ 2014-06-12  0:14                   ` Eric W. Biederman
  2014-06-12  0:25                     ` Rafael Tinoco
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2014-06-12  0:14 UTC (permalink / raw)
  To: paulmck
  Cc: chiluk, Rafael Tinoco, linux-kernel, davem, Christopher Arges,
	Jay Vosburgh

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Wed, Jun 11, 2014 at 04:12:15PM -0700, Eric W. Biederman wrote:
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> 
>> > On Wed, Jun 11, 2014 at 01:46:08PM -0700, Eric W. Biederman wrote:
>> >> On the chance it is dropping the old nsproxy which calls syncrhonize_rcu
>> >> in switch_task_namespaces that is causing you problems I have attached
>> >> a patch that changes from rcu_read_lock to task_lock for code that
>> >> calls task_nsproxy from a different task.  The code should be safe
>> >> and it should be an unquestions performance improvement but I have only
>> >> compile tested it.
>> >> 
>> >> If you can try the patch it will tell is if the problem is the rcu
>> >> access in switch_task_namespaces (the only one I am aware of network
>> >> namespace creation) or if the problem rcu case is somewhere else.
>> >> 
>> >> If nothing else knowing which rcu accesses are causing the slow down
>> >> seem important at the end of the day.
>> >> 
>> >> Eric
>> >> 
>> >
>> > If this is the culprit, another approach would be to use workqueues from
>> > RCU callbacks.  The following (untested, probably does not even build)
>> > patch illustrates one such approach.
>> 
>> For reference the only reason we are using rcu_lock today for nsproxy is
>> an old lock ordering problem that does not exist anymore.
>> 
>> I can say that in some workloads setns is a bit heavy today because of
>> the synchronize_rcu and setns is more important that I had previously
>> thought because pthreads break the classic unix ability to do things in
>> your process after fork() (sigh).
>> 
>> Today daemonize is gone, and notify the parent process with a signal
>> relies on task_active_pid_ns which does not use nsproxy.  So the old
>> lock ordering problem/race is gone.
>> 
>> The description of what was happening when the code switched from
>> task_lock to rcu_read_lock to protect nsproxy.
>
> OK, never mind, then!  ;-)

I appreciate you posting your approach.  I just figured I should do
my homework, and verify my fuzzy memory.

Who knows there might be different performance problems with my
approach.  But I am hoping this is one of those happy instances where we
can just make everything simpler.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-12  0:14                   ` Eric W. Biederman
@ 2014-06-12  0:25                     ` Rafael Tinoco
  2014-06-12  1:09                       ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael Tinoco @ 2014-06-12  0:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

I'm getting a kernel panic with your patch:

-- panic
-- mount_block_root
-- mount_root
-- prepare_namespace
-- kernel_init_freeable

It is giving me an unknown block device for the same config file i
used on other builds. Since my test is running on a kvm guest under a
ramdisk, i'm still checking if there are any differences between this
build and other ones but I think there aren't.

Any chances that "prepare_namespace" might be breaking mount_root ?

Tks

On Wed, Jun 11, 2014 at 9:14 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>
>> On Wed, Jun 11, 2014 at 04:12:15PM -0700, Eric W. Biederman wrote:
>>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>
>>> > On Wed, Jun 11, 2014 at 01:46:08PM -0700, Eric W. Biederman wrote:
>>> >> On the chance it is dropping the old nsproxy which calls syncrhonize_rcu
>>> >> in switch_task_namespaces that is causing you problems I have attached
>>> >> a patch that changes from rcu_read_lock to task_lock for code that
>>> >> calls task_nsproxy from a different task.  The code should be safe
>>> >> and it should be an unquestions performance improvement but I have only
>>> >> compile tested it.
>>> >>
>>> >> If you can try the patch it will tell is if the problem is the rcu
>>> >> access in switch_task_namespaces (the only one I am aware of network
>>> >> namespace creation) or if the problem rcu case is somewhere else.
>>> >>
>>> >> If nothing else knowing which rcu accesses are causing the slow down
>>> >> seem important at the end of the day.
>>> >>
>>> >> Eric
>>> >>
>>> >
>>> > If this is the culprit, another approach would be to use workqueues from
>>> > RCU callbacks.  The following (untested, probably does not even build)
>>> > patch illustrates one such approach.
>>>
>>> For reference the only reason we are using rcu_lock today for nsproxy is
>>> an old lock ordering problem that does not exist anymore.
>>>
>>> I can say that in some workloads setns is a bit heavy today because of
>>> the synchronize_rcu and setns is more important that I had previously
>>> thought because pthreads break the classic unix ability to do things in
>>> your process after fork() (sigh).
>>>
>>> Today daemonize is gone, and notify the parent process with a signal
>>> relies on task_active_pid_ns which does not use nsproxy.  So the old
>>> lock ordering problem/race is gone.
>>>
>>> The description of what was happening when the code switched from
>>> task_lock to rcu_read_lock to protect nsproxy.
>>
>> OK, never mind, then!  ;-)
>
> I appreciate you posting your approach.  I just figured I should do
> my homework, and verify my fuzzy memory.
>
> Who knows there might be different performance problems with my
> approach.  But I am hoping this is one of those happy instances where we
> can just make everything simpler.
>
> Eric



-- 
-- 
Rafael David Tinoco
Software Sustaining Engineer @ Canonical
Canonical Technical Services Engineering Team
# Email: rafael.tinoco@canonical.com (GPG: 87683FC0)
# Phone: +55.11.9.6777.2727 (Americas/Sao_Paulo)
# LP: ~inaddy | IRC: tinoco | Skype: rafael.tinoco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-12  0:25                     ` Rafael Tinoco
@ 2014-06-12  1:09                       ` Eric W. Biederman
  2014-06-12  1:14                         ` Rafael Tinoco
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2014-06-12  1:09 UTC (permalink / raw)
  To: Rafael Tinoco
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

Rafael Tinoco <rafael.tinoco@canonical.com> writes:

> I'm getting a kernel panic with your patch:
>
> -- panic
> -- mount_block_root
> -- mount_root
> -- prepare_namespace
> -- kernel_init_freeable
>
> It is giving me an unknown block device for the same config file i
> used on other builds. Since my test is running on a kvm guest under a
> ramdisk, i'm still checking if there are any differences between this
> build and other ones but I think there aren't.
>
> Any chances that "prepare_namespace" might be breaking mount_root ?

My patch boots for me....

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-12  1:09                       ` Eric W. Biederman
@ 2014-06-12  1:14                         ` Rafael Tinoco
       [not found]                           ` <CAJE_dJzjcWP=e_CPM1M64URVHiEFFb+fP6g2YKZVdoFntkQMZg@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael Tinoco @ 2014-06-12  1:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

Ok, some misconfiguration here probably, never mind. I'll finish the
tests tomorrow, compare with existent ones and let you know asap. Tks.

On Wed, Jun 11, 2014 at 10:09 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Rafael Tinoco <rafael.tinoco@canonical.com> writes:
>
>> I'm getting a kernel panic with your patch:
>>
>> -- panic
>> -- mount_block_root
>> -- mount_root
>> -- prepare_namespace
>> -- kernel_init_freeable
>>
>> It is giving me an unknown block device for the same config file i
>> used on other builds. Since my test is running on a kvm guest under a
>> ramdisk, i'm still checking if there are any differences between this
>> build and other ones but I think there aren't.
>>
>> Any chances that "prepare_namespace" might be breaking mount_root ?
>
> My patch boots for me....
>
> Eric



-- 
-- 
Rafael David Tinoco
Software Sustaining Engineer @ Canonical
Canonical Technical Services Engineering Team
# Email: rafael.tinoco@canonical.com (GPG: 87683FC0)
# Phone: +55.11.9.6777.2727 (Americas/Sao_Paulo)
# LP: ~inaddy | IRC: tinoco | Skype: rafael.tinoco

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
       [not found]                           ` <CAJE_dJzjcWP=e_CPM1M64URVHiEFFb+fP6g2YKZVdoFntkQMZg@mail.gmail.com>
@ 2014-06-13 18:22                             ` Rafael Tinoco
  2014-06-14  0:02                             ` Eric W. Biederman
  1 sibling, 0 replies; 24+ messages in thread
From: Rafael Tinoco @ 2014-06-13 18:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

Okay,

Tests with the same script were done.
I'm comparing : master + patch vs 3.15.0-rc5 (last sync'ed rcu commit)
and 3.9 last bisect good.

Same tests were made. I'm comparing the following versions:

1) master + suggested patch
2) 3.15.0-rc5 (last rcu commit in my clone)
3) 3.9-rc2 (last bisect good)

         master + sug patch          3.15.0-rc5 (last rcu)
3.9-rc2 (bisec good)
mark      no      none    all         no       none     all        no

# (netns add) / sec

 250      125.00  250.00  250.00      20.83    22.73    50.00      83.33
 500      250.00  250.00  250.00      22.73    22.73    50.00      125.00
 750      250.00  125.00  125.00      20.83    22.73    62.50      125.00
1000      125.00  250.00  125.00      20.83    20.83    50.00      250.00
1250      125.00  125.00  250.00      22.73    22.73    50.00      125.00
1500      125.00  125.00  125.00      22.73    22.73    41.67      125.00
1750      125.00  125.00  83.33       22.73    22.73    50.00      83.33
2000      125.00  83.33   125.00      22.73    25.00    50.00      125.00

-> From 3.15 to patched tree, netns add performance was ***
restored/improved *** OK

# (netns add + 1 x exec) / sec

 250      11.90   14.71   31.25       5.00     6.76     15.63      62.50
 500      11.90   13.89   31.25       5.10     7.14     15.63      41.67
 750      11.90   13.89   27.78       5.10     7.14     15.63      50.00
1000      11.90   13.16   25.00       4.90     6.41     15.63      35.71
1250      11.90   13.89   25.00       4.90     6.58     15.63      27.78
1500      11.36   13.16   25.00       4.72     6.25     15.63      25.00
1750      11.90   12.50   22.73       4.63     5.56     14.71      20.83
2000      11.36   12.50   22.73       4.55     5.43     13.89      17.86

-> From 3.15 to patched tree, performance improves +100% but still
-50% of 3.9-rc2

# (netns add + 2 x exec) / sec

250       6.58    8.62    16.67       2.81     3.97     9.26       41.67
500       6.58    8.33    15.63       2.78     4.10     9.62       31.25
750       5.95    7.81    15.63       2.69     3.85     8.93       25.00
1000      5.95    7.35    13.89       2.60     3.73     8.93       20.83
1250      5.81    7.35    13.89       2.55     3.52     8.62       16.67
1500      5.81    7.35    13.16       0.00     3.47     8.62       13.89
1750      5.43    6.76    13.16       0.00     3.47     8.62       11.36
2000      5.32    6.58    12.50       0.00     3.38     8.33        9.26

-> Same as before.

# netns add + 2 x exec + 1 x ip link to netns

250       7.14    8.33    14.71       2.87     3.97     8.62       35.71
500       6.94    8.33    13.89       2.91     3.91     8.93       25.00
750       6.10    7.58    13.89       2.75     3.79     8.06       19.23
1000      5.56    6.94    12.50       2.69     3.85     8.06       14.71
1250      5.68    6.58    11.90       2.58     3.57     7.81       11.36
1500      5.56    6.58    10.87       0.00     3.73     7.58       10.00
1750      5.43    6.41    10.42       0.00     3.57     7.14       8.62
2000      5.21    6.25    10.00       0.00     3.33     7.14       6.94

-> Ip link add to netns did not change performance proportion much.

# netns add + 2 x exec + 2 x ip link to netns

250       7.35    8.62    13.89       2.94     4.03     8.33       31.25
500       7.14    8.06    12.50       2.94     4.03     8.06       20.83
750       6.41    7.58    11.90       2.81     3.85     7.81       15.63
1000      5.95    7.14    10.87       2.69     3.79     7.35       12.50
1250      5.81    6.76    10.00       2.66     3.62     7.14       10.00
1500      5.68    6.41    9.62        3.73     6.76     8.06
1750      5.32    6.25    8.93        3.68     6.58     7.35
2000      5.43    6.10    8.33        3.42     6.10     6.41

-> Same as before.

OBS:

1) It seems that performance got improved for network namespace
addiction but maybe there can be some improvement also on netns
execution. This way we might achieve same performance as 3.9.0-rc2
(good bisect) had.

2) These tests were made with 4 cpu only.

3) Initial charts showed that 1 cpu case with all cpus as no-cb
(without this patch) had something like 50% of bisect good. The 4 cpu
(nocball) case had 26% of bisect good (like showed above in the last
case -> 31.25 -- 8.33).

4) With the patch, using 4 cpus and nocball, we now have 44% of bisect
good performance (against 26% we had).

5) NOCB_* is still an issue. It is clear that only NOCB_CPU_ALL option
is giving us something near last good commit performance.

Thank you

Rafael

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
       [not found]                           ` <CAJE_dJzjcWP=e_CPM1M64URVHiEFFb+fP6g2YKZVdoFntkQMZg@mail.gmail.com>
  2014-06-13 18:22                             ` Rafael Tinoco
@ 2014-06-14  0:02                             ` Eric W. Biederman
  2014-06-16 15:01                               ` Rafael Tinoco
  1 sibling, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2014-06-14  0:02 UTC (permalink / raw)
  To: Rafael Tinoco
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

Rafael Tinoco <rafael.tinoco@canonical.com> writes:

> Okay,
>
> Tests with the same script were done. 
> I'm comparing : master + patch vs 3.15.0-rc5 (last sync'ed rcu commit)
> and 3.9 last bisect good.
>
> Same tests were made. I'm comparing the following versions:
>
> 1) master + suggested patch
> 2) 3.15.0-rc5 (last rcu commit in my clone)
> 3) 3.9-rc2 (last bisect good)

I am having a hard time making sense of your numbers.

If I have read your email correctly my suggested patch caused:
"ip netns add" numbers to improve
1x "ip netns exec" to improve some
2x "ip netns exec" to show no improvement
"ip link add" to show no effect (after the 2x ip netns exec)

This is interesting in a lot of ways.
- This seems to confirm that the only rcu usage in ip netns add
  was switch_task_namespaces.  Which is convinient as that rules
  out most of the network stack when looking for performance oddities.

- "ip netns exec" had an expected performance improvement
- "ip netns exec" is still slow (so something odd is still going on)

- "ip link add" appears immaterial to the performance problem.

It would be interesting to switch the "ip link add" and "ip netns exec"
in your test case to confirm that there is nothing interesting/slow
going on in "ip link add"

Which leaves me with the question what ip "ip netns exec" remains
that is using rcu and is slowing all of this down.

Eric


> master + sug patch 3.15.0-rc5 (last rcu) 3.9-rc2 (bisec good)
> mark no none all no none all no 
>
> # (netns add) / sec 
>
> 250  125.00 250.00 250.00   20.83 22.73 50.00   83.33
> 500  250.00 250.00 250.00   22.73 22.73 50.00  125.00
> 750  250.00 125.00 125.00   20.83 22.73 62.50  125.00
> 1000 125.00 250.00 125.00   20.83 20.83 50.00  250.00
> 1250 125.00 125.00 250.00   22.73 22.73 50.00  125.00
> 1500 125.00 125.00 125.00   22.73 22.73 41.67  125.00
> 1750 125.00 125.00  83.33   22.73 22.73 50.00   83.33
> 2000 125.00  83.33 125.00   22.73 25.00 50.00  125.00
>
> -> From 3.15 to patched tree, netns add performance was ***
> restored/improved *** OK
>
> # (netns add + 1 x exec) / sec
>
> 250  11.90 14.71 31.25 5.00 6.76 15.63 62.50
> 500  11.90 13.89 31.25 5.10 7.14 15.63 41.67
> 750  11.90 13.89 27.78 5.10 7.14 15.63 50.00
> 1000 11.90 13.16 25.00 4.90 6.41 15.63 35.71
> 1250 11.90 13.89 25.00 4.90 6.58 15.63 27.78
> 1500 11.36 13.16 25.00 4.72 6.25 15.63 25.00
> 1750 11.90 12.50 22.73 4.63 5.56 14.71 20.83
> 2000 11.36 12.50 22.73 4.55 5.43 13.89 17.86
>
> -> From 3.15 to patched tree, performance improves +100% but still -
> 50% of 3.9-rc2
>
> # (netns add + 2 x exec) / sec
>
> 250 6.58 8.62 16.67 2.81 3.97 9.26 41.67
> 500 6.58 8.33 15.63 2.78 4.10 9.62 31.25
> 750 5.95 7.81 15.63 2.69 3.85 8.93 25.00
> 1000 5.95 7.35 13.89 2.60 3.73 8.93 20.83
> 1250 5.81 7.35 13.89 2.55 3.52 8.62 16.67
> 1500 5.81 7.35 13.16 0.00 3.47 8.62 13.89
> 1750 5.43 6.76 13.16 0.00 3.47 8.62 11.36
> 2000 5.32 6.58 12.50 0.00 3.38 8.33 9.26
>
> -> Same as before.
>
> # netns add + 2 x exec + 1 x ip link to netns
>
> 250 7.14 8.33 14.71 2.87 3.97 8.62 35.71
> 500 6.94 8.33 13.89 2.91 3.91 8.93 25.00
> 750 6.10 7.58 13.89 2.75 3.79 8.06 19.23
> 1000 5.56 6.94 12.50 2.69 3.85 8.06 14.71
> 1250 5.68 6.58 11.90 2.58 3.57 7.81 11.36
> 1500 5.56 6.58 10.87 0.00 3.73 7.58 10.00
> 1750 5.43 6.41 10.42 0.00 3.57 7.14 8.62
> 2000 5.21 6.25 10.00 0.00 3.33 7.14 6.94
>
> -> Ip link add to netns did not change performance proportion much.
>
> # netns add + 2 x exec + 2 x ip link to netns
>
> 250 7.35 8.62 13.89 2.94 4.03 8.33 31.25
> 500 7.14 8.06 12.50 2.94 4.03 8.06 20.83
> 750 6.41 7.58 11.90 2.81 3.85 7.81 15.63
> 1000 5.95 7.14 10.87 2.69 3.79 7.35 12.50
> 1250 5.81 6.76 10.00 2.66 3.62 7.14 10.00
> 1500 5.68 6.41 9.62 3.73 6.76 8.06
> 1750 5.32 6.25 8.93 3.68 6.58 7.35
> 2000 5.43 6.10 8.33 3.42 6.10 6.41
>
> -> Same as before.
>
> OBS:
>
> 1) It seems that performance got improved for network namespace
> addiction but maybe there can be some improvement also on netns
> execution. This way we might achieve same performance as 3.9.0-rc2
> (good bisect) had. 
>
> 2) These tests were made with 4 cpu only. 
>
> 3) Initial charts showed that 1 cpu case with all cpus as no-cb
> (without this patch) had something like 50% of bisect good. The 4 cpu
> (nocball) case had 26% of bisect good (like showed above in the last
> case -> 31.25 -- 8.33). 
>
> 4) With the patch, using 4 cpus and nocball, we now have 44% of bisect
> good performance (against 26% we had).
>
> 5) NOCB_* is still an issue. It is clear that only NOCB_CPU_ALL option
> is giving us something near last good commit performance.
>
> Thank you
>
> Rafael

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-14  0:02                             ` Eric W. Biederman
@ 2014-06-16 15:01                               ` Rafael Tinoco
  2014-07-17 12:05                                 ` Rafael David Tinoco
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael Tinoco @ 2014-06-16 15:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

...

On Fri, Jun 13, 2014 at 9:02 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Rafael Tinoco <rafael.tinoco@canonical.com> writes:
>
>> Okay,
>>
>> Tests with the same script were done.
>> I'm comparing : master + patch vs 3.15.0-rc5 (last sync'ed rcu commit)
>> and 3.9 last bisect good.
>>
>> Same tests were made. I'm comparing the following versions:
>>
>> 1) master + suggested patch
>> 2) 3.15.0-rc5 (last rcu commit in my clone)
>> 3) 3.9-rc2 (last bisect good)
>
> I am having a hard time making sense of your numbers.
>
> If I have read your email correctly my suggested patch caused:
> "ip netns add" numbers to improve
> 1x "ip netns exec" to improve some
> 2x "ip netns exec" to show no improvement
> "ip link add" to show no effect (after the 2x ip netns exec)

 - "netns add" are as good as they were before this regression.
 - "netns exec" are improved but still 50% of the last good bisect commit.
 - "link add" didn't show difference.

> This is interesting in a lot of ways.
> - This seems to confirm that the only rcu usage in ip netns add
>   was switch_task_namespaces.  Which is convinient as that rules
>   out most of the network stack when looking for performance oddities.
>
> - "ip netns exec" had an expected performance improvement
> - "ip netns exec" is still slow (so something odd is still going on)
> - "ip link add" appears immaterial to the performance problem.
>
> It would be interesting to switch the "ip link add" and "ip netns exec"
> in your test case to confirm that there is nothing interesting/slow
> going on in "ip link add"

 - will do that.

>
> Which leaves me with the question what ip "ip netns exec" remains
> that is using rcu and is slowing all of this down.

 - will check this also.

> Eric

Tks

Rafael

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-06-16 15:01                               ` Rafael Tinoco
@ 2014-07-17 12:05                                 ` Rafael David Tinoco
  2014-07-24  7:01                                   ` Eric W. Biederman
  0 siblings, 1 reply; 24+ messages in thread
From: Rafael David Tinoco @ 2014-07-17 12:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

Hello Eric,

Coming back to this...

On Jun 16, 2014, at 12:01 PM, Rafael Tinoco <rafael.tinoco@canonical.com> wrote:

> ...
> 
> On Fri, Jun 13, 2014 at 9:02 PM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> Rafael Tinoco <rafael.tinoco@canonical.com> writes:
>> 
>>> Okay,
>>> 
>>> Tests with the same script were done.
>>> I'm comparing : master + patch vs 3.15.0-rc5 (last sync'ed rcu commit)
>>> and 3.9 last bisect good.
>>> 
>>> Same tests were made. I'm comparing the following versions:
>>> 
>>> 1) master + suggested patch
>>> 2) 3.15.0-rc5 (last rcu commit in my clone)
>>> 3) 3.9-rc2 (last bisect good)
>> 
>> I am having a hard time making sense of your numbers.
>> 
>> If I have read your email correctly my suggested patch caused:
>> "ip netns add" numbers to improve
>> 1x "ip netns exec" to improve some
>> 2x "ip netns exec" to show no improvement
>> "ip link add" to show no effect (after the 2x ip netns exec)
> 
> - "netns add" are as good as they were before this regression.
> - "netns exec" are improved but still 50% of the last good bisect commit.
> - "link add" didn't show difference.
> 
>> This is interesting in a lot of ways.
>> - This seems to confirm that the only rcu usage in ip netns add
>>  was switch_task_namespaces.  Which is convinient as that rules
>>  out most of the network stack when looking for performance oddities.
>> 
>> - "ip netns exec" had an expected performance improvement
>> - "ip netns exec" is still slow (so something odd is still going on)
>> - "ip link add" appears immaterial to the performance problem.
>> 
>> It would be interesting to switch the "ip link add" and "ip netns exec"
>> in your test case to confirm that there is nothing interesting/slow
>> going on in "ip link add"
> 
> - will do that.

IP link add seems ok.

> 
>> 
>> Which leaves me with the question what ip "ip netns exec" remains
>> that is using rcu and is slowing all of this down.
> 
> - will check this also.

Based on my tests (and some other users that deployed this patch on a server farm)
it looks like changing rcu_read_lock() to task_lock() did the trick. We are getting
same (sometimes much better) results - comparing bisect good - for a big amount 
of netns being created simultaneously. 

Is it possible to make this change permanent in kernel tree ? 

Much appreciate your attention Eric

Regards

Rafael Tinoco

> 
>> Eric
> 
> Tks
> 
> Rafael


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus
  2014-07-17 12:05                                 ` Rafael David Tinoco
@ 2014-07-24  7:01                                   ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2014-07-24  7:01 UTC (permalink / raw)
  To: Rafael David Tinoco
  Cc: Paul McKenney, Dave Chiluk, linux-kernel, davem,
	Christopher Arges, Jay Vosburgh

Rafael David Tinoco <rafael.tinoco@canonical.com> writes:

> Hello Eric,
>
> Coming back to this...
>
> On Jun 16, 2014, at 12:01 PM, Rafael Tinoco <rafael.tinoco@canonical.com> wrote:
>
>> ...
>> 
>> On Fri, Jun 13, 2014 at 9:02 PM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> Rafael Tinoco <rafael.tinoco@canonical.com> writes:
>>> 
>>>> Okay,
>>>> 
>>>> Tests with the same script were done.
>>>> I'm comparing : master + patch vs 3.15.0-rc5 (last sync'ed rcu commit)
>>>> and 3.9 last bisect good.
>>>> 
>>>> Same tests were made. I'm comparing the following versions:
>>>> 
>>>> 1) master + suggested patch
>>>> 2) 3.15.0-rc5 (last rcu commit in my clone)
>>>> 3) 3.9-rc2 (last bisect good)
>>> 
>>> I am having a hard time making sense of your numbers.
>>> 
>>> If I have read your email correctly my suggested patch caused:
>>> "ip netns add" numbers to improve
>>> 1x "ip netns exec" to improve some
>>> 2x "ip netns exec" to show no improvement
>>> "ip link add" to show no effect (after the 2x ip netns exec)
>> 
>> - "netns add" are as good as they were before this regression.
>> - "netns exec" are improved but still 50% of the last good bisect commit.
>> - "link add" didn't show difference.
>> 
>>> This is interesting in a lot of ways.
>>> - This seems to confirm that the only rcu usage in ip netns add
>>>  was switch_task_namespaces.  Which is convinient as that rules
>>>  out most of the network stack when looking for performance oddities.
>>> 
>>> - "ip netns exec" had an expected performance improvement
>>> - "ip netns exec" is still slow (so something odd is still going on)
>>> - "ip link add" appears immaterial to the performance problem.
>>> 
>>> It would be interesting to switch the "ip link add" and "ip netns exec"
>>> in your test case to confirm that there is nothing interesting/slow
>>> going on in "ip link add"
>> 
>> - will do that.
>
> IP link add seems ok.
>
>> 
>>> 
>>> Which leaves me with the question what ip "ip netns exec" remains
>>> that is using rcu and is slowing all of this down.
>> 
>> - will check this also.
>
> Based on my tests (and some other users that deployed this patch on a server farm)
> it looks like changing rcu_read_lock() to task_lock() did the trick. We are getting
> same (sometimes much better) results - comparing bisect good - for a big amount 
> of netns being created simultaneously. 
>
> Is it possible to make this change permanent in kernel tree ? 

Definitely.  I just need to finish getting my act together.

It sounded like you had seen other performance problems and I was
waiting on your futher testing so we could narrow down.

If you can't see other problems then I am happy to move forward with
this.

Thank you for testing and reporting this,

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-07-24  7:04 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-11  5:52 Possible netns creation and execution performance/scalability regression since v3.8 due to rcu callbacks being offloaded to multiple cpus Rafael Tinoco
2014-06-11  7:07 ` Eric W. Biederman
2014-06-11 13:39 ` Paul E. McKenney
2014-06-11 15:17   ` Rafael Tinoco
2014-06-11 15:46     ` David Chiluk
2014-06-11 16:18       ` Paul E. McKenney
2014-06-11 18:27         ` Dave Chiluk
2014-06-11 19:48           ` Paul E. McKenney
2014-06-11 20:55             ` Eric W. Biederman
2014-06-11 21:03               ` Rafael Tinoco
2014-06-11 20:46           ` Eric W. Biederman
2014-06-11 21:14             ` Dave Chiluk
2014-06-11 22:52             ` Paul E. McKenney
2014-06-11 23:12               ` Eric W. Biederman
2014-06-11 23:49                 ` Paul E. McKenney
2014-06-12  0:14                   ` Eric W. Biederman
2014-06-12  0:25                     ` Rafael Tinoco
2014-06-12  1:09                       ` Eric W. Biederman
2014-06-12  1:14                         ` Rafael Tinoco
     [not found]                           ` <CAJE_dJzjcWP=e_CPM1M64URVHiEFFb+fP6g2YKZVdoFntkQMZg@mail.gmail.com>
2014-06-13 18:22                             ` Rafael Tinoco
2014-06-14  0:02                             ` Eric W. Biederman
2014-06-16 15:01                               ` Rafael Tinoco
2014-07-17 12:05                                 ` Rafael David Tinoco
2014-07-24  7:01                                   ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.