linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
@ 2015-08-25  7:54 Raghavendra K T
  2015-08-25  7:54 ` [PATCH RFC 1/2] net: Introduce helper functions to get the per cpu data Raghavendra K T
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Raghavendra K T @ 2015-08-25  7:54 UTC (permalink / raw)
  To: davem, kuznet, jmorris, yoshfuji, kaber
  Cc: jiri, edumazet, hannes, tom, azhou, ebiederm, ipm,
	nicolas.dichtel, netdev, linux-kernel, raghavendra.kt, anton,
	nacc, srikar

While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.

The current patch tries to improve by reordering the statistics gathering.

Please note that similar overhead was also reported while creating
veth pairs  https://lkml.org/lkml/2013/3/19/556

Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory

1000 docker containers was created with command
docker run -itd  ubuntu:15.04  /bin/bash in loop

observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.

rtnl_fill_ifinfo
  -> inet6_fill_link_af
    -> inet6_fill_ifla6_attrs
      -> snmp_fold_field

proposed idea:
 currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 90 items).
 The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Performance of docker creation improved by around more than 2x
after the patch.

before the patch: 
================
time docker run -itd  ubuntu:15.04  /bin/bash
3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617
real	0m6.836s
user	0m0.095s
sys	0m0.011s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
=======================================================
# Samples: 32K of event 'cycles'
# Event count (approx.): 24688700190
# Overhead  Command          Shared Object           Symbol                                                                                         
# ........  ...............  ......................  ........................
    50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field                                                                                                        
     9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop                                                                                                            
     3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one                                                                                                         
     2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock                                                                                                         
     1.37%  docker           docker                  [.] backtrace_qsort                                                                                                        
     1.31%  docker           docker                  [.] strings.FieldsFunc                                                                      

  cache-misses:  2.7%
                                                      
after the patch:
=============
 time docker run -itd  ubuntu:15.04  /bin/bash
4e0619421332990bdea413fe455ab187607ed63d33d5c37aa5291bc2f5b35857
real	0m3.357s
user	0m0.092s
sys	0m0.010s

perf record -a docker run -itd  ubuntu:15.04  /bin/bash
=======================================================
# Samples: 15K of event 'cycles'
# Event count (approx.): 11471830714
# Overhead  Command          Shared Object         Symbol                                                                                         
# ........  ...............  ....................  .........................
    10.56%  swapper          [kernel.kallsyms]     [k] snooze_loop                                                                                            
     8.72%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field                                                                                     
     7.59%  docker           [kernel.kallsyms]     [k] veth_stats_one                                                                                         
     3.65%  swapper          [kernel.kallsyms]     [k] _raw_spin_lock                                                                                         
     3.06%  docker           docker                [.] strings.FieldsFunc                                                                                     
     2.96%  docker           docker                [.] backtrace_qsort      
                
cache-misses: 1.38 %

Please let me know if you have suggestions/comments.

Raghavendra K T (2):
  net: Introduce helper functions to get the per cpu data
  net: Optimize snmp stat aggregation by walking all the percpu data at
    once

 include/net/ip.h    | 10 ++++++++++
 net/ipv4/af_inet.c  | 41 +++++++++++++++++++++++++++--------------
 net/ipv6/addrconf.c | 14 +++++++++++---
 3 files changed, 48 insertions(+), 17 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 1/2] net: Introduce helper functions to get the per cpu data
  2015-08-25  7:54 [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
@ 2015-08-25  7:54 ` Raghavendra K T
  2015-08-25  7:54 ` [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Raghavendra K T
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2015-08-25  7:54 UTC (permalink / raw)
  To: davem, kuznet, jmorris, yoshfuji, kaber
  Cc: jiri, edumazet, hannes, tom, azhou, ebiederm, ipm,
	nicolas.dichtel, netdev, linux-kernel, raghavendra.kt, anton,
	nacc, srikar

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 include/net/ip.h   | 10 ++++++++++
 net/ipv4/af_inet.c | 41 +++++++++++++++++++++++++++--------------
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d5fe9f2..93bf12e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 #define NET_ADD_STATS_BH(net, field, adnd) SNMP_ADD_STATS_BH((net)->mib.net_statistics, field, adnd)
 #define NET_ADD_STATS_USER(net, field, adnd) SNMP_ADD_STATS_USER((net)->mib.net_statistics, field, adnd)
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct);
 unsigned long snmp_fold_field(void __percpu *mib, int offt);
 #if BITS_PER_LONG==32
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+			 size_t syncp_offset);
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off);
 #else
+static inline u64  snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+					size_t syncp_offset)
+{
+	return snmp_get_cpu_field(mib, cpu, offct);
+
+}
+
 static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_off)
 {
 	return snmp_fold_field(mib, offt);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9532ee8..302e36b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family,
 }
 EXPORT_SYMBOL_GPL(inet_ctl_sock_create);
 
+u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
+{
+	return  *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt);
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field);
+
 unsigned long snmp_fold_field(void __percpu *mib, int offt)
 {
 	unsigned long res = 0;
 	int i;
 
 	for_each_possible_cpu(i)
-		res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt);
+		res += snmp_get_cpu_field(mib, i, offt);
 	return res;
 }
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 #if BITS_PER_LONG==32
 
+u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct,
+			 size_t syncp_offset)
+{
+	void *bhptr;
+	struct u64_stats_sync *syncp;
+	u64 v;
+	unsigned int start;
+
+	bhptr = per_cpu_ptr(mib, cpu);
+	syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
+	do {
+		start = u64_stats_fetch_begin_irq(syncp);
+		v = *(((u64 *)bhptr) + offt);
+	} while (u64_stats_fetch_retry_irq(syncp, start));
+
+	return v;
+}
+EXPORT_SYMBOL_GPL(snmp_get_cpu_field64);
+
 u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset)
 {
 	u64 res = 0;
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
-		void *bhptr;
-		struct u64_stats_sync *syncp;
-		u64 v;
-		unsigned int start;
-
-		bhptr = per_cpu_ptr(mib, cpu);
-		syncp = (struct u64_stats_sync *)(bhptr + syncp_offset);
-		do {
-			start = u64_stats_fetch_begin_irq(syncp);
-			v = *(((u64 *) bhptr) + offt);
-		} while (u64_stats_fetch_retry_irq(syncp, start));
-
-		res += v;
+		res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
 	}
 	return res;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
  2015-08-25  7:54 [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
  2015-08-25  7:54 ` [PATCH RFC 1/2] net: Introduce helper functions to get the per cpu data Raghavendra K T
@ 2015-08-25  7:54 ` Raghavendra K T
  2015-08-25 14:28   ` Eric Dumazet
  2015-08-25 14:33 ` [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Eric Dumazet
  2015-08-25 23:07 ` David Miller
  3 siblings, 1 reply; 14+ messages in thread
From: Raghavendra K T @ 2015-08-25  7:54 UTC (permalink / raw)
  To: davem, kuznet, jmorris, yoshfuji, kaber
  Cc: jiri, edumazet, hannes, tom, azhou, ebiederm, ipm,
	nicolas.dichtel, netdev, linux-kernel, raghavendra.kt, anton,
	nacc, srikar

Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.

reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data of an item (iteratively for around 90 items).

idea: This patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.

Docker creation got faster by more than 2x after the patch.

Result:
                       Before           After
Docker creation time   6.836s           3.357s
cache miss             2.7%             1.38%

perf before:
    50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
     9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
     3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
     2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock

perf after:
    10.56%  swapper          [kernel.kallsyms]     [k] snooze_loop
     8.72%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
     7.59%  docker           [kernel.kallsyms]     [k] veth_stats_one
     3.65%  swapper          [kernel.kallsyms]     [k] _raw_spin_lock

Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 net/ipv6/addrconf.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..2ec905f 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4624,16 +4624,24 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib,
 }
 
 static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
-				      int items, int bytes, size_t syncpoff)
+					int items, int bytes, size_t syncpoff)
 {
-	int i;
+	int i, c;
+	u64 *tmp;
 	int pad = bytes - sizeof(u64) * items;
 	BUG_ON(pad < 0);
 
+	tmp = kcalloc(items, sizeof(u64), GFP_KERNEL);
+
 	/* Use put_unaligned() because stats may not be aligned for u64. */
 	put_unaligned(items, &stats[0]);
+
+	for_each_possible_cpu(c)
+		for (i = 1; i < items; i++)
+			tmp[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
+
 	for (i = 1; i < items; i++)
-		put_unaligned(snmp_fold_field64(mib, i, syncpoff), &stats[i]);
+		put_unaligned(tmp[i], &stats[i]);
 
 	memset(&stats[items], 0, pad);
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
  2015-08-25  7:54 ` [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Raghavendra K T
@ 2015-08-25 14:28   ` Eric Dumazet
  2015-08-25 15:47     ` Raghavendra K T
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-08-25 14:28 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes,
	tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

On Tue, 2015-08-25 at 13:24 +0530, Raghavendra K T wrote:
> Docker container creation linearly increased from around 1.6 sec to 7.5 sec
> (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.
> 
> reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
> through per cpu data of an item (iteratively for around 90 items).
> 
> idea: This patch tries to aggregate the statistics by going through
> all the items of each cpu sequentially which is reducing cache
> misses.
> 
> Docker creation got faster by more than 2x after the patch.
> 
> Result:
>                        Before           After
> Docker creation time   6.836s           3.357s
> cache miss             2.7%             1.38%
> 
> perf before:
>     50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
>      9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
>      3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
>      2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock
> 
> perf after:
>     10.56%  swapper          [kernel.kallsyms]     [k] snooze_loop
>      8.72%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
>      7.59%  docker           [kernel.kallsyms]     [k] veth_stats_one
>      3.65%  swapper          [kernel.kallsyms]     [k] _raw_spin_lock
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  net/ipv6/addrconf.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 21c2c81..2ec905f 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4624,16 +4624,24 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib,
>  }
>  
>  static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
> -				      int items, int bytes, size_t syncpoff)
> +					int items, int bytes, size_t syncpoff)
>  {
> -	int i;
> +	int i, c;
> +	u64 *tmp;
>  	int pad = bytes - sizeof(u64) * items;
>  	BUG_ON(pad < 0);
>  
> +	tmp = kcalloc(items, sizeof(u64), GFP_KERNEL);
> +


This is a great idea, but kcalloc()/kmalloc() can fail and you'll crash
the whole kernel at this point.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
  2015-08-25  7:54 [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
  2015-08-25  7:54 ` [PATCH RFC 1/2] net: Introduce helper functions to get the per cpu data Raghavendra K T
  2015-08-25  7:54 ` [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Raghavendra K T
@ 2015-08-25 14:33 ` Eric Dumazet
  2015-08-25 15:55   ` Raghavendra K T
  2015-08-25 23:07 ` David Miller
  3 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-08-25 14:33 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes,
	tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

On Tue, 2015-08-25 at 13:24 +0530, Raghavendra K T wrote:
> While creating 1000 containers, perf is showing lot of time spent in
> snmp_fold_field on a large cpu system.
> 
> The current patch tries to improve by reordering the statistics gathering.
> 
> Please note that similar overhead was also reported while creating
> veth pairs  https://lkml.org/lkml/2013/3/19/556
> 
> Setup:
> 160 cpu (20 core) baremetal powerpc system with 1TB memory

I wonder if these kind of results would demonstrate cache coloring
problems on this host. Looks like all the per cpu data are colliding on
same cache lines.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
  2015-08-25 14:28   ` Eric Dumazet
@ 2015-08-25 15:47     ` Raghavendra K T
  2015-08-25 16:00       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Raghavendra K T @ 2015-08-25 15:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes,
	tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

On 08/25/2015 07:58 PM, Eric Dumazet wrote:
> On Tue, 2015-08-25 at 13:24 +0530, Raghavendra K T wrote:
>> Docker container creation linearly increased from around 1.6 sec to 7.5 sec
>> (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.
>>
>> reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
>> through per cpu data of an item (iteratively for around 90 items).
>>
>> idea: This patch tries to aggregate the statistics by going through
>> all the items of each cpu sequentially which is reducing cache
>> misses.
>>
>> Docker creation got faster by more than 2x after the patch.
>>
>> Result:
>>                         Before           After
>> Docker creation time   6.836s           3.357s
>> cache miss             2.7%             1.38%
>>
>> perf before:
>>      50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
>>       9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
>>       3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
>>       2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock
>>
>> perf after:
>>      10.56%  swapper          [kernel.kallsyms]     [k] snooze_loop
>>       8.72%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
>>       7.59%  docker           [kernel.kallsyms]     [k] veth_stats_one
>>       3.65%  swapper          [kernel.kallsyms]     [k] _raw_spin_lock
>>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   net/ipv6/addrconf.c | 14 +++++++++++---
>>   1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>> index 21c2c81..2ec905f 100644
>> --- a/net/ipv6/addrconf.c
>> +++ b/net/ipv6/addrconf.c
>> @@ -4624,16 +4624,24 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib,
>>   }
>>
>>   static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
>> -				      int items, int bytes, size_t syncpoff)
>> +					int items, int bytes, size_t syncpoff)
>>   {
>> -	int i;
>> +	int i, c;
>> +	u64 *tmp;
>>   	int pad = bytes - sizeof(u64) * items;
>>   	BUG_ON(pad < 0);
>>
>> +	tmp = kcalloc(items, sizeof(u64), GFP_KERNEL);
>> +
>
>
> This is a great idea, but kcalloc()/kmalloc() can fail and you'll crash
> the whole kernel at this point.
>

Good catch, and my bad. Though system is in bad memory condition,
since fill_stat is not critical for the system do you think silently
returning from here is a good idea?
or do you think we should handle with -ENOMEM way up.?




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
  2015-08-25 14:33 ` [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Eric Dumazet
@ 2015-08-25 15:55   ` Raghavendra K T
  0 siblings, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2015-08-25 15:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes,
	tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

On 08/25/2015 08:03 PM, Eric Dumazet wrote:
> On Tue, 2015-08-25 at 13:24 +0530, Raghavendra K T wrote:
>> While creating 1000 containers, perf is showing lot of time spent in
>> snmp_fold_field on a large cpu system.
>>
>> The current patch tries to improve by reordering the statistics gathering.
>>
>> Please note that similar overhead was also reported while creating
>> veth pairs  https://lkml.org/lkml/2013/3/19/556
>>
>> Setup:
>> 160 cpu (20 core) baremetal powerpc system with 1TB memory
>
> I wonder if these kind of results would demonstrate cache coloring
> problems on this host. Looks like all the per cpu data are colliding on
> same cache lines.
>

It could be. My testing on a 128 cpu system with less memory did not
incur huge time penalty for 1000 containers.
But snmp_fold_field in general had the problem.
for e.g. same experiment I had around 15% overhead for snmp_fold reduced 
to 5% after the patch.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
  2015-08-25 15:47     ` Raghavendra K T
@ 2015-08-25 16:00       ` Eric Dumazet
  2015-08-25 16:06         ` Raghavendra K T
  2015-08-26 11:04         ` Raghavendra K T
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2015-08-25 16:00 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes,
	tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

On Tue, 2015-08-25 at 21:17 +0530, Raghavendra K T wrote:
> On 08/25/2015 07:58 PM, Eric Dumazet wrote:

> >
> >
> > This is a great idea, but kcalloc()/kmalloc() can fail and you'll crash
> > the whole kernel at this point.
> >
> 
> Good catch, and my bad. Though system is in bad memory condition,
> since fill_stat is not critical for the system do you think silently
> returning from here is a good idea?
> or do you think we should handle with -ENOMEM way up.?

Hmm... presumably these 288 bytes could be allocated in
inet6_fill_ifla6_attrs() stack frame.

Also it is weird we fill all these stats for a device we just created
and never enabled : initial stats are all 0 for them.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
  2015-08-25 16:00       ` Eric Dumazet
@ 2015-08-25 16:06         ` Raghavendra K T
  2015-08-26 11:04         ` Raghavendra K T
  1 sibling, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2015-08-25 16:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes,
	tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

On 08/25/2015 09:30 PM, Eric Dumazet wrote:
> On Tue, 2015-08-25 at 21:17 +0530, Raghavendra K T wrote:
>> On 08/25/2015 07:58 PM, Eric Dumazet wrote:
>
>>>
>>>
>>> This is a great idea, but kcalloc()/kmalloc() can fail and you'll crash
>>> the whole kernel at this point.
>>>
>>
>> Good catch, and my bad. Though system is in bad memory condition,
>> since fill_stat is not critical for the system do you think silently
>> returning from here is a good idea?
>> or do you think we should handle with -ENOMEM way up.?
>
> Hmm... presumably these 288 bytes could be allocated in
> inet6_fill_ifla6_attrs() stack frame.
>
> Also it is weird we fill all these stats for a device we just created
> and never enabled : initial stats are all 0 for them.
>

Yes it is.. Initially I was even thinking

1. if we could disable the stat filling just after creation ( only
allocate the space for statistics but do not fill).

2. should we have a PROC_FS_NET_SNMP config which we can disable if not
necessary.

3. should we defer this snmp_fold_walk to a workqueue. (unfortunately
there is not much to do after this stat filling which can run in
parallel before we wait for completion.. or may be there is a way).






^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
  2015-08-25  7:54 [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
                   ` (2 preceding siblings ...)
  2015-08-25 14:33 ` [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Eric Dumazet
@ 2015-08-25 23:07 ` David Miller
  2015-08-26 10:25   ` Raghavendra K T
  3 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2015-08-25 23:07 UTC (permalink / raw)
  To: raghavendra.kt
  Cc: kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes, tom,
	azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Date: Tue, 25 Aug 2015 13:24:24 +0530

> Please let me know if you have suggestions/comments.

Like Eric Dumazet said the idea is good but needs some adjustments.

You might want to see whether a per-cpu work buffer works for this.

It's extremely unfortunately that we can't depend upon the destination
buffer being properly aligned, because we wouldn't need a temporary
scratch area if it were aligned properly.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
  2015-08-25 23:07 ` David Miller
@ 2015-08-26 10:25   ` Raghavendra K T
  2015-08-26 14:09     ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Raghavendra K T @ 2015-08-26 10:25 UTC (permalink / raw)
  To: David Miller
  Cc: kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes, tom,
	azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

On 08/26/2015 04:37 AM, David Miller wrote:
> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Date: Tue, 25 Aug 2015 13:24:24 +0530
>
>> Please let me know if you have suggestions/comments.
>
> Like Eric Dumazet said the idea is good but needs some adjustments.
>
> You might want to see whether a per-cpu work buffer works for this.

sure, Let me know if I understood correctly,

we allocate the temp buffer,
we will have a  "add_this_cpu_data" function and do

for_each_online_cpu(cpu)
                 smp_call_function_single(cpu, add_this_cpu_data, buffer, 1)

if not could you please point to an example you had in mind.

>
> It's extremely unfortunately that we can't depend upon the destination
> buffer being properly aligned, because we wouldn't need a temporary
> scratch area if it were aligned properly.

True, But I think for 64 bit cpus when (pad == 0) we can go ahead and
use stats array directly and get rid of put_unaligned(). is it correct?

(my internal initial patch had this version but thought it is ugly to
have ifdef BITS_PER_LONG==64)





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
  2015-08-25 16:00       ` Eric Dumazet
  2015-08-25 16:06         ` Raghavendra K T
@ 2015-08-26 11:04         ` Raghavendra K T
  1 sibling, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2015-08-26 11:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet, hannes,
	tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev, linux-kernel,
	anton, nacc, srikar

On 08/25/2015 09:30 PM, Eric Dumazet wrote:
> On Tue, 2015-08-25 at 21:17 +0530, Raghavendra K T wrote:
>> On 08/25/2015 07:58 PM, Eric Dumazet wrote:
>
>>>
>>>
>>> This is a great idea, but kcalloc()/kmalloc() can fail and you'll crash
>>> the whole kernel at this point.
>>>
>>
>> Good catch, and my bad. Though system is in bad memory condition,
>> since fill_stat is not critical for the system do you think silently
>> returning from here is a good idea?
>> or do you think we should handle with -ENOMEM way up.?
>
> Hmm... presumably these 288 bytes could be allocated in
> inet6_fill_ifla6_attrs() stack frame.

Correct, since we need to allocate  for IPSTATS_MIB_MAX, we could do 
this in even snmp6_fill_stats() stack frame.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
  2015-08-26 10:25   ` Raghavendra K T
@ 2015-08-26 14:09     ` Eric Dumazet
  2015-08-26 14:30       ` Raghavendra K T
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-08-26 14:09 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: David Miller, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet,
	hannes, tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev,
	linux-kernel, anton, nacc, srikar

On Wed, 2015-08-26 at 15:55 +0530, Raghavendra K T wrote:
> On 08/26/2015 04:37 AM, David Miller wrote:
> > From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > Date: Tue, 25 Aug 2015 13:24:24 +0530
> >
> >> Please let me know if you have suggestions/comments.
> >
> > Like Eric Dumazet said the idea is good but needs some adjustments.
> >
> > You might want to see whether a per-cpu work buffer works for this.
> 
> sure, Let me know if I understood correctly,
> 
> we allocate the temp buffer,
> we will have a  "add_this_cpu_data" function and do
> 
> for_each_online_cpu(cpu)
>                  smp_call_function_single(cpu, add_this_cpu_data, buffer, 1)
> 
> if not could you please point to an example you had in mind.


Sorry I do not think it is a good idea.

Sending an IPI is way more expensive and intrusive than reading 4 or 5
cache lines from memory (per cpu)

Definitely not something we want.

> 
> >
> > It's extremely unfortunately that we can't depend upon the destination
> > buffer being properly aligned, because we wouldn't need a temporary
> > scratch area if it were aligned properly.
> 
> True, But I think for 64 bit cpus when (pad == 0) we can go ahead and
> use stats array directly and get rid of put_unaligned(). is it correct?


Nope. We have no alignment guarantee. It could be 0x............04
pointer value. (ie not a multiple of 8)

> 
> (my internal initial patch had this version but thought it is ugly to
> have ifdef BITS_PER_LONG==64)

This has nothing to do with arch having 64bit per long. It is about
alignment of a u64.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
  2015-08-26 14:09     ` Eric Dumazet
@ 2015-08-26 14:30       ` Raghavendra K T
  0 siblings, 0 replies; 14+ messages in thread
From: Raghavendra K T @ 2015-08-26 14:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, kuznet, jmorris, yoshfuji, kaber, jiri, edumazet,
	hannes, tom, azhou, ebiederm, ipm, nicolas.dichtel, netdev,
	linux-kernel, anton, nacc, srikar

On 08/26/2015 07:39 PM, Eric Dumazet wrote:
> On Wed, 2015-08-26 at 15:55 +0530, Raghavendra K T wrote:
>> On 08/26/2015 04:37 AM, David Miller wrote:
>>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>> Date: Tue, 25 Aug 2015 13:24:24 +0530
>>>
>>>> Please let me know if you have suggestions/comments.
>>>
>>> Like Eric Dumazet said the idea is good but needs some adjustments.
>>>
>>> You might want to see whether a per-cpu work buffer works for this.
>>
>> sure, Let me know if I understood correctly,
>>
>> we allocate the temp buffer,
>> we will have a  "add_this_cpu_data" function and do
>>
>> for_each_online_cpu(cpu)
>>                   smp_call_function_single(cpu, add_this_cpu_data, buffer, 1)
>>
>> if not could you please point to an example you had in mind.
>
>
> Sorry I do not think it is a good idea.
>
> Sending an IPI is way more expensive and intrusive than reading 4 or 5
> cache lines from memory (per cpu)
>
> Definitely not something we want.

Okay. Another problem I thought here was that we could only loop over
online cpus.

>>> It's extremely unfortunately that we can't depend upon the destination
>>> buffer being properly aligned, because we wouldn't need a temporary
>>> scratch area if it were aligned properly.
>>
>> True, But I think for 64 bit cpus when (pad == 0) we can go ahead and
>> use stats array directly and get rid of put_unaligned(). is it correct?
>
>
> Nope. We have no alignment guarantee. It could be 0x............04
> pointer value. (ie not a multiple of 8)
>
>>
>> (my internal initial patch had this version but thought it is ugly to
>> have ifdef BITS_PER_LONG==64)
>
> This has nothing to do with arch having 64bit per long. It is about
> alignment of a u64.
>

Okay. I 'll send V2 with declaring tmp buffer in stack.



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-08-26 14:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-25  7:54 [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Raghavendra K T
2015-08-25  7:54 ` [PATCH RFC 1/2] net: Introduce helper functions to get the per cpu data Raghavendra K T
2015-08-25  7:54 ` [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Raghavendra K T
2015-08-25 14:28   ` Eric Dumazet
2015-08-25 15:47     ` Raghavendra K T
2015-08-25 16:00       ` Eric Dumazet
2015-08-25 16:06         ` Raghavendra K T
2015-08-26 11:04         ` Raghavendra K T
2015-08-25 14:33 ` [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus Eric Dumazet
2015-08-25 15:55   ` Raghavendra K T
2015-08-25 23:07 ` David Miller
2015-08-26 10:25   ` Raghavendra K T
2015-08-26 14:09     ` Eric Dumazet
2015-08-26 14:30       ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).