netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
@ 2014-05-27 21:29 sowmini varadhan
  2014-05-28  1:41 ` Eric Dumazet
  2014-05-29  6:34 ` Julian Anastasov
  0 siblings, 2 replies; 13+ messages in thread
From: sowmini varadhan @ 2014-05-27 21:29 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: Eric Dumazet, Niels Möller, netdev, Jonas Bonn

On Sat, May 24, 2014 at 8:06 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 05/23/14 10:14, Eric Dumazet wrote:
>
>> Use the batch mode, and it will be much faster than ifconfig, as
>> ifconfig does not support this mode (you need one fork()/exec() per IP
>> address)
>>
>> ip -batch filename
>>
>
> The address dumping algorithm is a very likely contributor as well.
> It tries to remember indices and then skips on the next iteration
> all the way to where it left off.... has never been a big deal until
> someone tries a substantial number of addresses.
>
> cheers,
> jamal

Niels (nisse@southpole.se) reported:

   I've done a simple benchmark with a script assigning n addresses
   using "ip address add", and this seems to have O(n^2) complexity.
   E.g, assigning n=25500 addresses took 26 s, and doubling n, assigning
   51000 addresses, took 122 s, 4.6 times longer. Which isn't
   necessarily a problems once all the addresses are assigned, but it
   sounds a bit like there's a linear datastructure in there, not
   intended for a large number of addresses.

And this bothered me, since the suggested workaround of
"ip -b", plus the comment about slow address dumping algorithm
are both saying that there may be some fundamental scaling
issues here.

Also, my earlier comment about netlink vs ioctl was possibly
a red-herring- when I compared my experiment with what Niels is
trying to do, the experiment was different- I was adding
an address to a (newly created) tunnel interface (thus
explodes both number of interfaces and addresses), whereas
Niels is addign all addresses to the same interface.

So I looked at Niels' test script with perf. Some observations:

perf tells me:

   80.13%       ip  [other]
                 |
                 |--30.12%-- fib_sync_up
                 |          |
                 |           --30.12%-- fib_inetaddr_event
                 |                     notifier_call_chain
                 |                     __blocking_notifier_call_chain
                 |                     blocking_notifier_call_chain
                 |                     __inet_insert_ifa
                 |                     inet_rtm_newaddr
                 |                     rtnetlink_rcv_msg
                 |                     netlink_rcv_skb
                 |                     rtnetlink_rcv
                 |                     netlink_unicast
                 |                     netlink_sendmsg
                 |                     sock_sendmsg
                 |                     ___sys_sendmsg
                 |                     __sys_sendmsg
                 |                     SyS_sendmsg
                 |                     SyS_socketcall
                 |                     syscall_call

thus fib_sync_up() itself doesn't scale very well. Not sure
how much tweak-potential exists here.

Further, in __inet_insert_ifa, we walk the ifa_list at least once
(which is probably unavoidable),

static int __inet_insert_ifa( /* .. */
                             u32 portid)
{

        /* ... */
       for (ifap = &in_dev->ifa_list; (ifa1 = *ifap) != NULL;
             ifap = &ifa1->ifa_next) {
        /* ... */
       blocking_notifier_call_chain(&inetaddr_chain, NETDEV_UP, ifa);

       return (0);
}

But in addition, The fib callback: fib_inetaddr_event() has another
potential ifa_list walk for SECONDARY addresses.

        switch (event) {
        case NETDEV_UP:
                fib_add_ifaddr(ifa);
#ifdef CONFIG_IP_ROUTE_MULTIPATH
                fib_sync_up(dev);
#endif

For Niels script, since there are many addresses in the same
subnet, we'll have a lot of cases of an IFA_F_SECONDARY address,
so fib_add_ifaddr will then do another walk of the ifa_list.

Has anyone looked at consolidating some of this?
All of this could easily become a factor when the system
has a large number of interfaces and addresses, and the
control plane only wants to modify a very small subset of
that state.

--Sowmini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-27 21:29 Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?) sowmini varadhan
@ 2014-05-28  1:41 ` Eric Dumazet
  2014-05-28 10:01   ` sowmini varadhan
  2014-05-28 12:18   ` sowmini varadhan
  2014-05-29  6:34 ` Julian Anastasov
  1 sibling, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2014-05-28  1:41 UTC (permalink / raw)
  To: sowmini varadhan; +Cc: Jamal Hadi Salim, Niels Möller, netdev, Jonas Bonn

On Tue, 2014-05-27 at 17:29 -0400, sowmini varadhan wrote:

> Has anyone looked at consolidating some of this?
> All of this could easily become a factor when the system
> has a large number of interfaces and addresses, and the
> control plane only wants to modify a very small subset of
> that state.

You did not provide kernel version you used.

Last time I checked, we were spending _lot_ of time in check_lifetime()

( this stuff was added in linux-3.9 )

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-28  1:41 ` Eric Dumazet
@ 2014-05-28 10:01   ` sowmini varadhan
  2014-05-28 11:23     ` Jamal Hadi Salim
  2014-05-28 12:18   ` sowmini varadhan
  1 sibling, 1 reply; 13+ messages in thread
From: sowmini varadhan @ 2014-05-28 10:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jamal Hadi Salim, Niels Möller, netdev, Jonas Bonn

> You did not provide kernel version you used.
>
> Last time I checked, we were spending _lot_ of time in check_lifetime()
>
> ( this stuff was added in linux-3.9 )
>

Built the latest kernel (vmlinuz-3.15.0-rc6+) from a clone
of kernel.git from may 24

--Sowmini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-28 10:01   ` sowmini varadhan
@ 2014-05-28 11:23     ` Jamal Hadi Salim
  2014-05-28 11:54       ` sowmini varadhan
  0 siblings, 1 reply; 13+ messages in thread
From: Jamal Hadi Salim @ 2014-05-28 11:23 UTC (permalink / raw)
  To: sowmini varadhan, Eric Dumazet; +Cc: Niels Möller, netdev, Jonas Bonn

On 05/28/14 06:01, sowmini varadhan wrote:
>> You did not provide kernel version you used.
>>
>> Last time I checked, we were spending _lot_ of time in check_lifetime()
>>
>> ( this stuff was added in linux-3.9 )
>>
>
> Built the latest kernel (vmlinuz-3.15.0-rc6+) from a clone
> of kernel.git from may 24
>
> --Sowmini
>

Did you try the scenario where you add many IP addresses to the same
interface? Try to do a listing of the ip addresses then, should be
fun.

cheers,
jamal

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-28 11:23     ` Jamal Hadi Salim
@ 2014-05-28 11:54       ` sowmini varadhan
  0 siblings, 0 replies; 13+ messages in thread
From: sowmini varadhan @ 2014-05-28 11:54 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: Eric Dumazet, Niels Möller, netdev, Jonas Bonn

On Wed, May 28, 2014 at 7:23 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:


>
> Did you try the scenario where you add many IP addresses to the same
> interface? Try to do a listing of the ip addresses then, should be
> fun.

I did. Here, you try it too :-) attached below is the script I got from Niels.

And yes, while it is "fun" to do "ip addr show dev $IF|grep '\<inet\>'|wc -l",
I think Niels is first trying to scale the address-addition itself first

BTW, I'm not sure it's O(n^2) - we dont have enough data points from
the script below to establish that. Once you optimize fib_sync_up,
I suspect it scales as K * n, where N > 1. FWIW, doing the quick+dirty
thing of just commenting out fib_sync_up as an experiment *halves* the
wallclock time.

$ cat nils.sh
#!/bin/sh
#
# From nisse@southpole.se (Niels Moller)
#  I used the below script. Run, e.g, with arguments
#  eth0 100 add
# to assign 100*255 addresses..
# And to get numbers, I just ran it with time(1).


if [ $# -lt 2 ] ; then
   echo Too few arguments
   exit 1
fi

IF=$1
CNT=$2
if [ $# -gt 2 ] ; then
    CMD=$3
else
    CMD=del
fi

echo "IF $IF CNT $CNT CMD $CMD"

for x in `seq 1 $CNT` ; do
    echo 10.200.$x.x
    for y in `seq 1 255` ; do
        addr="10.200.$x.$y"
        ip address "$CMD" "$addr"/32 dev "$IF"  || echo FAIL: $addr
    done
done

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-28  1:41 ` Eric Dumazet
  2014-05-28 10:01   ` sowmini varadhan
@ 2014-05-28 12:18   ` sowmini varadhan
  2014-05-28 13:44     ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: sowmini varadhan @ 2014-05-28 12:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jamal Hadi Salim, Niels Möller, netdev, Jonas Bonn

On Tue, May 27, 2014 at 9:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Last time I checked, we were spending _lot_ of time in check_lifetime()

Although check_lifetime() itself did not light up in my
perf output, i see your point- it has a few ifa_list walks
that might end up being expensive - but these are static
IPv4 addresses- they should be marked IFA_F_PERMANENT,
right? That's probably why they did not show up in my perf output?

--Sowmini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-28 12:18   ` sowmini varadhan
@ 2014-05-28 13:44     ` Eric Dumazet
  2014-05-28 14:48       ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2014-05-28 13:44 UTC (permalink / raw)
  To: sowmini varadhan; +Cc: Jamal Hadi Salim, Niels Möller, netdev, Jonas Bonn

On Wed, 2014-05-28 at 08:18 -0400, sowmini varadhan wrote:
> On Tue, May 27, 2014 at 9:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Last time I checked, we were spending _lot_ of time in check_lifetime()
> 
> Although check_lifetime() itself did not light up in my
> perf output, i see your point- it has a few ifa_list walks
> that might end up being expensive - but these are static
> IPv4 addresses- they should be marked IFA_F_PERMANENT,
> right? That's probably why they did not show up in my perf output?


How did you run perf ?

If you run :

perf record ./your_script.sh

It wont catch all the load that is triggered from work queues.

If you do

./your_script.sh &

perf top

Then you'll see check_lifetime() at first position.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-28 13:44     ` Eric Dumazet
@ 2014-05-28 14:48       ` Eric Dumazet
  2014-05-28 16:00         ` Eric Dumazet
  2014-05-28 17:18         ` sowmini varadhan
  0 siblings, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2014-05-28 14:48 UTC (permalink / raw)
  To: sowmini varadhan
  Cc: Jamal Hadi Salim, Niels Möller, netdev, Jonas Bonn, Jiri Pirko

On Wed, 2014-05-28 at 06:44 -0700, Eric Dumazet wrote:
> On Wed, 2014-05-28 at 08:18 -0400, sowmini varadhan wrote:
> > On Tue, May 27, 2014 at 9:41 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > Last time I checked, we were spending _lot_ of time in check_lifetime()
> > 
> > Although check_lifetime() itself did not light up in my
> > perf output, i see your point- it has a few ifa_list walks
> > that might end up being expensive - but these are static
> > IPv4 addresses- they should be marked IFA_F_PERMANENT,
> > right? That's probably why they did not show up in my perf output?
> 
> 
> How did you run perf ?
> 
> If you run :
> 
> perf record ./your_script.sh
> 
> It wont catch all the load that is triggered from work queues.
> 
> If you do
> 
> ./your_script.sh &
> 
> perf top
> 
> Then you'll see check_lifetime() at first position.

Following patch helps a lot
(30 seconds to add 65536 addresses on lo interface instead of 52 seconds
on my host)

for i in `seq 0 255`; do for j in `seq 0 255`; do echo "addr add 127.2.$i.$j/8 dev lo"; done; done >batch
ip -b batch

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index bdbf68bb2e2d..9b9763f27607 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -474,7 +474,8 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 	inet_hash_insert(dev_net(in_dev->dev), ifa);
 
 	cancel_delayed_work(&check_lifetime_work);
-	queue_delayed_work(system_power_efficient_wq, &check_lifetime_work, 0);
+	queue_delayed_work(system_power_efficient_wq, &check_lifetime_work,
+			   HZ / 4);
 
 	/* Send message first, then call notifier.
 	   Notifier will trigger FIB update, so that

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-28 14:48       ` Eric Dumazet
@ 2014-05-28 16:00         ` Eric Dumazet
  2014-05-28 17:18         ` sowmini varadhan
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2014-05-28 16:00 UTC (permalink / raw)
  To: sowmini varadhan
  Cc: Jamal Hadi Salim, Niels Möller, netdev, Jonas Bonn, Jiri Pirko

On Wed, 2014-05-28 at 07:48 -0700, Eric Dumazet wrote:


> for i in `seq 0 255`; do for j in `seq 0 255`; do echo "addr add 127.2.$i.$j/8 dev lo"; done; done >batch
> ip -b batch

BTW, do not try after previous additions following command 

ip addr del 127.0.0.1/8 dev lo

It triggers a softlockup :(

[ 1074.543962] BUG: soft lockup - CPU#11 stuck for 11s! [ip:10729]
[ 1074.549908] task: ffff8808afa15580 ti: ffff8808afb18000 task.ti: ffff8808afb18000
[ 1074.549909] RIP: 0010:[<ffffffff8152d991>]  [<ffffffff8152d991>] fib_del_ifaddr+0x161/0x490
[ 1074.549915] RSP: 0018:ffff8808afb19878  EFLAGS: 00000202
[ 1074.549916] RAX: ffff880891705640 RBX: 0000000000000000 RCX: 0000000000000001
[ 1074.549917] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000e
[ 1074.549918] RBP: ffff8808afb198d8 R08: 0000000000000000 R09: 000000000100007f
[ 1074.549919] R10: ffff880891705640 R11: 000000000000007f R12: ffff88089a417ec0
[ 1074.549920] R13: ffff88089a417ec0 R14: 0000000000000286 R15: 000000007f02b9e9
[ 1074.549921] FS:  00007f92f64d66d0(0000) GS:ffff88107fc60000(0000) knlGS:0000000000000000
[ 1074.549922] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1074.549923] CR2: 00000000022f9008 CR3: 0000000891b41000 CR4: 00000000000427e0
[ 1074.549924] Stack:
[ 1074.549925]  ffff8801143f7d40 0000000000000000 ffff88088f0c3000 00000000812f3f40
[ 1074.549934]  ffff88010fdf4a00 ffffff7f00000001 000000d000000000 ffff88088f0c3000
[ 1074.549944]  ffffffff81c73a00 0000000000000002 ffff8801143f7d40 0000000000000000
[ 1074.549953] Call Trace:
[ 1074.549958]  [<ffffffff8152dd46>] fib_inetaddr_event+0x86/0xe0
[ 1074.549965]  [<ffffffff8156699d>] notifier_call_chain+0x4d/0x70
[ 1074.549972]  [<ffffffff81078c08>] __blocking_notifier_call_chain+0x58/0x80
[ 1074.549976]  [<ffffffff81078c46>] blocking_notifier_call_chain+0x16/0x20
[ 1074.549981]  [<ffffffff81523409>] __inet_del_ifa+0xf9/0x2a0
[ 1074.549986]  [<ffffffff810cc736>] ? __res_counter_charge+0xf6/0x130
[ 1074.549990]  [<ffffffff815236ba>] inet_rtm_deladdr+0x10a/0x160
[ 1074.549996]  [<ffffffff814bc544>] rtnetlink_rcv_msg+0xa4/0x240
[ 1074.550000]  [<ffffffff814bc4a0>] ? __rtnl_unlock+0x20/0x20
[ 1074.550006]  [<ffffffff814dc8a1>] netlink_rcv_skb+0xb1/0xc0
[ 1074.550010]  [<ffffffff814b90e5>] rtnetlink_rcv+0x25/0x40
[ 1074.550013]  [<ffffffff814dc145>] netlink_unicast+0x145/0x200
[ 1074.550017]  [<ffffffff814dc503>] netlink_sendmsg+0x303/0x400
[ 1074.550023]  [<ffffffff81491e0c>] sock_sendmsg+0x9c/0xd0
[ 1074.550028]  [<ffffffff814926c0>] ? move_addr_to_kernel+0x40/0xa0
[ 1074.550033]  [<ffffffff8149fa39>] ? verify_iovec+0x49/0xd0

We have a quadratic behavior here. 65536*65536 is too big.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-28 14:48       ` Eric Dumazet
  2014-05-28 16:00         ` Eric Dumazet
@ 2014-05-28 17:18         ` sowmini varadhan
  1 sibling, 0 replies; 13+ messages in thread
From: sowmini varadhan @ 2014-05-28 17:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jamal Hadi Salim, Niels Möller, netdev, Jonas Bonn, Jiri Pirko

On Wed, May 28, 2014 at 10:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:


> Following patch helps a lot
> (30 seconds to add 65536 addresses on lo interface instead of 52 seconds
> on my host)

I haven't had a chance to try out your patch yet (thanks for
the "dont do a del" caveat :-)) and I'm going to try it out now,
 but reading your info makes me
wonder- in this case all the addresses that are being added
are static Ipv4 addrs, i.e., they are clearly IFA_F_PERMANENT
addresses, so check_lifetime() should not have too much work
to do (other than to examine the ifa_flags and continue). So
why is this making such a big difference?

Alternatively, afaict check_lifetime() is only interested in
an !IFA_F_PERMANENT. So maybe __inet_insert_ifa should
only do the {inet_hash_insert(..); /* reset check_lifetime_work */}
when it added a !PERMANENT address? Would this break
something else? (I realize it may need more changes elsewhere,
if the PERMANENT flag is modified on the fly, I've not checked
for those yet)

--Sowmini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-27 21:29 Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?) sowmini varadhan
  2014-05-28  1:41 ` Eric Dumazet
@ 2014-05-29  6:34 ` Julian Anastasov
  2014-05-29 16:11   ` sowmini varadhan
  1 sibling, 1 reply; 13+ messages in thread
From: Julian Anastasov @ 2014-05-29  6:34 UTC (permalink / raw)
  To: sowmini varadhan
  Cc: Jamal Hadi Salim, Eric Dumazet, Niels Möller, netdev, Jonas Bonn


	Hello,

On Tue, 27 May 2014, sowmini varadhan wrote:

> For Niels script, since there are many addresses in the same
> subnet, we'll have a lot of cases of an IFA_F_SECONDARY address,
> so fib_add_ifaddr will then do another walk of the ifa_list.
> 
> Has anyone looked at consolidating some of this?
> All of this could easily become a factor when the system
> has a large number of interfaces and addresses, and the
> control plane only wants to modify a very small subset of
> that state.

	First improvment without adding fields to
struct in_ifaddr would be (step 1):

- find_matching_ifa:
	- walk inet_addr_lst and match

- devinet_ioctl:
	- tryaddrmatch: walk inet_addr_lst and match

- inet_rtm_deladdr:
	- if IFA_LOCAL is provided find ifa_local in
	inet_addr_lst, then do other matches

	With additional pointer we can optimize
__inet_insert_ifa and __inet_del_ifa: we will know
how after finding ifa by walking inet_addr_lst to reach
the primary ifa: with new pointer ifa_parent that
points to our subnet. All secondaries for the subnet
can be known with pointer to the first one: ifa_sec,
because all secondaries are after all primaries:

- pri1
- pri2
- ...
- sec1_1
- sec1_2
- sec2_1
- sec2_2

	In fact ifa_sec and ifa_parent can be one field:
ifa_pri_sec, ifa_link or another better name, used depending
on IFA_F_SECONDARY.

	So, step 2: add pointer in ifa

	The real pain is fib_del_ifaddr: for ifa_local
we have a fast way (inet_addr_lst hash table) to determine
if this is the last local address in system (for prefsrc
purposes) but for ifa_broadcast we don't have such hash table.
May be with such hash table we can solve the problem but
it needs more ifa fields.

	Step 3: hash table for ifa_broadcast and
struct hlist_node for ifa_has_brd (ifa_broadcast),
ifa_hash_brd0 (first addr in subnet), ifa_hash_brd1 (last
addr in subnet).

	Any ideas?

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-29  6:34 ` Julian Anastasov
@ 2014-05-29 16:11   ` sowmini varadhan
  2014-05-29 16:19     ` David Ahern
  0 siblings, 1 reply; 13+ messages in thread
From: sowmini varadhan @ 2014-05-29 16:11 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Jamal Hadi Salim, Eric Dumazet, Niels Möller, netdev, Jonas Bonn

On Thu, May 29, 2014 at 2:34 AM, Julian Anastasov <ja@ssi.bg> wrote:

>         First improvment without adding fields to
> struct in_ifaddr would be (step 1):

< proposal to walk inet_addr_lst instead of ifa_list in 3 critical add/del
functions >

>         With additional pointer we can optimize
> __inet_insert_ifa and __inet_del_ifa: we will know
> how after finding ifa by walking inet_addr_lst to reach
> the primary ifa: with new pointer ifa_parent that
> points to our subnet. All secondaries for the subnet
> can be known with pointer to the first one: ifa_sec,
> because all secondaries are after all primaries:
  :
>
>         In fact ifa_sec and ifa_parent can be one field:
> ifa_pri_sec, ifa_link or another better name, used depending
> on IFA_F_SECONDARY.
>
>         So, step 2: add pointer in ifa

there would have to be some work done in addition/ deletion
code, to promote a track primary/secondary addresses..


>         Step 3: hash table for ifa_broadcast and
> struct hlist_node for ifa_has_brd (ifa_broadcast),
> ifa_hash_brd0 (first addr in subnet), ifa_hash_brd1 (last
> addr in subnet).
>
>         Any ideas?

Interesting proposal. But by itself, it might be a lot of code
change, with the real bottle-necks being elsewhere-

from what Eric and I observed, seems like the primary
time-suckers in these paths are check_lifetime() and fib_sync_up()-
both of these show at the top of the list for `perf top` and
from the quick hacks that Eric and I tried to get them out of the
way, give the most bang-for-the-buck?

I'm still not sure I understand *why* check_lifetime ends
up being expensive, though. For this particular test, all the
addresses are PERMANENT, so the to inet-addr_lst walking
loops should not be expensive at all- they should just bail
out quickly and continue. Eric's patch suggestst that this is due
to the chrun cause by doing  cancel_delayed_work +
queue_delayed_work from __inet_insert_ifa with a delay of 0?
If that's true, can this be safely skipped for PERMANENT addresses (i.e.
make delay infinite) with some careful adjustments on the delete
side to avoid the soft_lockup? Been meaning to play with that.

fib_sync_up() could also use a magnifying glass - was going to
look at this later.

And, of course, there's all the other config/control paths that
could use optimization - my tunnel experiment, and Jamal's
point about dumping ip addresses come to mind.

--Sowmini

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?)
  2014-05-29 16:11   ` sowmini varadhan
@ 2014-05-29 16:19     ` David Ahern
  0 siblings, 0 replies; 13+ messages in thread
From: David Ahern @ 2014-05-29 16:19 UTC (permalink / raw)
  To: sowmini varadhan, Julian Anastasov
  Cc: Jamal Hadi Salim, Eric Dumazet, Niels Möller, netdev, Jonas Bonn

On 5/29/14, 10:11 AM, sowmini varadhan wrote:
> I'm still not sure I understand*why*  check_lifetime ends
> up being expensive, though. For this particular test, all the
...
>
> fib_sync_up() could also use a magnifying glass - was going to
> look at this later.

Annotation feature of perf should help
     perf record -ga -- <command>

     'perf report' with a TUI and then hit 'a' for annotation mode
     or use 'perf annotate -k vmlinux -s <symbol name>'

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-05-29 16:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-27 21:29 Scaling 'ip addr add' (was Re: What's the right way to use a *large* number of source addresses?) sowmini varadhan
2014-05-28  1:41 ` Eric Dumazet
2014-05-28 10:01   ` sowmini varadhan
2014-05-28 11:23     ` Jamal Hadi Salim
2014-05-28 11:54       ` sowmini varadhan
2014-05-28 12:18   ` sowmini varadhan
2014-05-28 13:44     ` Eric Dumazet
2014-05-28 14:48       ` Eric Dumazet
2014-05-28 16:00         ` Eric Dumazet
2014-05-28 17:18         ` sowmini varadhan
2014-05-29  6:34 ` Julian Anastasov
2014-05-29 16:11   ` sowmini varadhan
2014-05-29 16:19     ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).