linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics
@ 2016-08-29 19:03 Kirill Tkhai
  2016-08-29 19:03 ` [PATCH RFC 1/2] net: Implement net_stats callbacks Kirill Tkhai
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Kirill Tkhai @ 2016-08-29 19:03 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: yoshfuji, jmorris, davem, peterz, edumazet, mingo, kaber, ktkhai

Many variables of statistics type are made percpu in kernel. This allows
to do not make them atomic or to do not use synchronization. The result
value is calculated as sum of values on every possible cpu.

The problem is this scales bad. The calculations may took a lot of time.
For example, some machine configurations have many possible cpus like below:

"smpboot: Allowing 192 CPUs, 160 hotplug CPUs"

There are only 32 real cpus, but 192 possible cpus.

I had a report about very slow getifaddrs() on older kernel, when there are
possible only 590 getifaddrs calls/second on Xeon(R) CPU E5-2667 v3 @ 3.20GHz.

The patchset aims to begin solving of this problem. It makes possible to
iterate over present cpus mask instead of possible. When cpu is going down,
a statistics is being moved to an alive cpu. It's made in CPU_DYING callback,
which happens when machine is stopped. So, iteration  over present cpus mask
is safe under preemption disabled.

Patchset could exclude even offline cpus, but I didn't do that, because
the main problem seems to be possible cpus. Also, this would require to
do some changes in kernel/cpu.c, so I'd like to hear people opinion about
expediency of this before.

One more question is whether the whole kernel needs the same possibility
and the patchset should be more generic.

Please, comment!

For the above configuration the patchset improves the below test in 2.9 times:

#define _GNU_SOURCE     /* To get defns of NI_MAXSERV and NI_MAXHOST */
#include <arpa/inet.h>
#include <sys/socket.h>
#include <netdb.h>
#include <ifaddrs.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/if_link.h>

int do_gia()
{
   struct ifaddrs *ifaddr, *ifa;
   int family, s, n;
   char host[NI_MAXHOST];

   if (getifaddrs(&ifaddr) == -1) {
           perror("getifaddrs");
           exit(EXIT_FAILURE);
   }

   /* touch the data  */
   for (ifa = ifaddr, n = 0; ifa != NULL; ifa = ifa->ifa_next, n++) {
           if (ifa->ifa_addr == NULL)
                   continue;
           family = ifa->ifa_addr->sa_family;
   }
   freeifaddrs(ifaddr);
}

int main(int argc, char *argv[])
{
        int i;
        for(i=0; i<10000; i++)
           do_gia();
}

---

Kirill Tkhai (2):
      net: Implement net_stats callbacks
      net: Iterate over present cpus only during ipstats calculation


 include/net/stats.h |    9 ++++++
 net/core/Makefile   |    1 +
 net/core/stats.c    |   83 +++++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/addrconf.c |    4 ++
 net/ipv6/af_inet6.c |   56 ++++++++++++++++++++++++++++++++++
 5 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 include/net/stats.h
 create mode 100644 net/core/stats.c

--
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC 1/2] net: Implement net_stats callbacks
  2016-08-29 19:03 [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics Kirill Tkhai
@ 2016-08-29 19:03 ` Kirill Tkhai
  2016-08-29 19:04 ` [PATCH RFC 2/2] net: Iterate over present cpus only during ipstats calculation Kirill Tkhai
  2016-08-30  6:55 ` [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics Peter Zijlstra
  2 siblings, 0 replies; 4+ messages in thread
From: Kirill Tkhai @ 2016-08-29 19:03 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: yoshfuji, jmorris, davem, peterz, edumazet, mingo, kaber, ktkhai

net_stats callbacks are functions, which are called
during a cpu is going down. They operate on percpu
statistics and should move it from dying cpu to an
alive one.

The callbacks are called on CPU_DYING stage, when
machine is stopped, and they are executed on dying
cpu.

This allows to minimize overhead of summation of
percpu statistics on all possible cpus, and iterate
only present cpus instead. It may give a signify
growth of performance on configurations with big
number of possible cpus.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 include/net/stats.h |    9 ++++++
 net/core/Makefile   |    1 +
 net/core/stats.c    |   83 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 93 insertions(+)
 create mode 100644 include/net/stats.h
 create mode 100644 net/core/stats.c

diff --git a/include/net/stats.h b/include/net/stats.h
new file mode 100644
index 0000000..7aebc56
--- /dev/null
+++ b/include/net/stats.h
@@ -0,0 +1,9 @@
+#ifndef _NET_STATS_H_
+#define _NET_STATS_H_
+
+typedef int (*net_stats_cb_t)(int cpu);
+
+extern int register_net_stats_cb(net_stats_cb_t func);
+extern int unregister_net_stats_cb(net_stats_cb_t func);
+
+#endif /* _NET_STATS_H_ */
diff --git a/net/core/Makefile b/net/core/Makefile
index d6508c2..c1093c3 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -27,3 +27,4 @@ obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
+obj-$(CONFIG_HOTPLUG_CPU) += stats.o
diff --git a/net/core/stats.c b/net/core/stats.c
new file mode 100644
index 0000000..0307e2c
--- /dev/null
+++ b/net/core/stats.c
@@ -0,0 +1,83 @@
+#include <linux/init.h>
+#include <linux/cpu.h>
+#include <linux/vmalloc.h>
+#include <net/stats.h>
+
+struct net_stats_cb_entry {
+	struct list_head list;
+	net_stats_cb_t func;
+};
+
+static DEFINE_SPINLOCK(net_stats_lock);
+static LIST_HEAD(net_stats_cb_list);
+
+int register_net_stats_cb(net_stats_cb_t func)
+{
+	struct net_stats_cb_entry *entry = vmalloc(sizeof(*entry));
+
+	if (!entry)
+		return -ENOMEM;
+	entry->func = func;
+	spin_lock(&net_stats_lock);
+	list_add_tail(&entry->list, &net_stats_cb_list);
+	spin_unlock(&net_stats_lock);
+	return 0;
+}
+EXPORT_SYMBOL(register_net_stats_cb);
+
+int unregister_net_stats_cb(net_stats_cb_t func)
+{
+	struct net_stats_cb_entry *entry;
+	bool found = false;
+
+	spin_lock(&net_stats_lock);
+	list_for_each_entry(entry, &net_stats_cb_list, list) {
+		if (entry->func == func) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock(&net_stats_lock);
+
+	if (!found)
+		return -ENOENT;
+
+	list_del(&entry->list);
+	vfree(entry);
+	return 0;
+}
+EXPORT_SYMBOL(unregister_net_stats_cb);
+
+static int net_stats_cpu_notify(struct notifier_block *nb,
+				unsigned long action, void *hcpu)
+{
+	struct net_stats_cb_entry *entry;
+	long cpu = (long)hcpu;
+	int ret;
+
+	if ((action & ~CPU_TASKS_FROZEN) == CPU_DYING) {
+		/* We call callbacks in dying stage, when machine is stopped */
+		spin_lock(&net_stats_lock);
+		list_for_each_entry(entry, &net_stats_cb_list, list) {
+			ret = entry->func(cpu);
+			if (ret)
+				break;
+		}
+		spin_unlock(&net_stats_lock);
+
+		if (ret)
+			return NOTIFY_BAD;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block net_stats_nfb = {
+	.notifier_call = net_stats_cpu_notify,
+};
+
+static int __init net_stats_init(void)
+{
+	return register_cpu_notifier(&net_stats_nfb);
+}
+
+subsys_initcall(net_stats_init);

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 2/2] net: Iterate over present cpus only during ipstats calculation
  2016-08-29 19:03 [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics Kirill Tkhai
  2016-08-29 19:03 ` [PATCH RFC 1/2] net: Implement net_stats callbacks Kirill Tkhai
@ 2016-08-29 19:04 ` Kirill Tkhai
  2016-08-30  6:55 ` [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics Peter Zijlstra
  2 siblings, 0 replies; 4+ messages in thread
From: Kirill Tkhai @ 2016-08-29 19:04 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: yoshfuji, jmorris, davem, peterz, edumazet, mingo, kaber, ktkhai

Use net_stats callback to iterate only present cpus mask.
This gives a signify performance growth on configurations
with large number of possible cpus.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 net/ipv6/addrconf.c |    4 +++-
 net/ipv6/af_inet6.c |   56 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index df8425f..2ab088d 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4970,10 +4970,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib,
 	memset(buff, 0, sizeof(buff));
 	buff[0] = IPSTATS_MIB_MAX;
 
-	for_each_possible_cpu(c) {
+	preempt_disable();
+	for_each_present_cpu(c) {
 		for (i = 1; i < IPSTATS_MIB_MAX; i++)
 			buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);
 	}
+	preempt_enable();
 
 	memcpy(stats, buff, IPSTATS_MIB_MAX * sizeof(u64));
 	memset(&stats[IPSTATS_MIB_MAX], 0, pad);
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 46ad699..1c7a431 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -61,6 +61,7 @@
 #include <net/ip6_tunnel.h>
 #endif
 #include <net/calipso.h>
+#include <net/stats.h>
 
 #include <asm/uaccess.h>
 #include <linux/mroute6.h>
@@ -785,6 +786,55 @@ static void ipv6_cleanup_mibs(struct net *net)
 	kfree(net->mib.icmpv6msg_statistics);
 }
 
+static void in6_move_stats(int src_cpu, void __percpu *mib, int max)
+{
+	int i, target_cpu;
+	u64 *ptr, v;
+
+	target_cpu = cpumask_first(cpu_online_mask);
+
+	BUG_ON(target_cpu > nr_cpu_ids || src_cpu != smp_processor_id());
+
+	for (i = 1; i < max; i++) {
+		ptr = ((u64 *)per_cpu_ptr(mib, src_cpu)) + i;
+		v = *ptr;
+		*ptr = 0;
+
+		ptr = ((u64 *)per_cpu_ptr(mib, target_cpu)) + i;
+		*ptr += v;
+	}
+}
+
+static int ipv6_stats_cb(int cpu)
+{
+	struct net_device *dev;
+	struct inet6_dev *idev;
+	struct net *net;
+
+	WARN_ON(!irqs_disabled());
+
+	rcu_read_lock();
+	write_lock(&dev_base_lock);
+
+	for_each_net(net) {
+		if (!maybe_get_net(net))
+			continue;
+		for_each_netdev(net, dev) {
+			idev = in6_dev_get(dev);
+			if (!idev)
+				continue;
+			in6_move_stats(cpu, idev->stats.ipv6, IPSTATS_MIB_MAX);
+			in6_dev_put(idev);
+		}
+		put_net(net);
+	}
+
+	write_unlock(&dev_base_lock);
+	rcu_read_unlock();
+
+	return 0;
+}
+
 static int __net_init inet6_net_init(struct net *net)
 {
 	int err = 0;
@@ -986,6 +1036,10 @@ static int __init inet6_init(void)
 	if (err)
 		goto pingv6_fail;
 
+	err = register_net_stats_cb(ipv6_stats_cb);
+	if (err)
+		goto net_stats_fail;
+
 	err = calipso_init();
 	if (err)
 		goto calipso_fail;
@@ -1003,6 +1057,8 @@ static int __init inet6_init(void)
 	calipso_exit();
 #endif
 calipso_fail:
+	unregister_net_stats_cb(ipv6_stats_cb);
+net_stats_fail:
 	pingv6_exit();
 pingv6_fail:
 	ipv6_packet_cleanup();

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics
  2016-08-29 19:03 [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics Kirill Tkhai
  2016-08-29 19:03 ` [PATCH RFC 1/2] net: Implement net_stats callbacks Kirill Tkhai
  2016-08-29 19:04 ` [PATCH RFC 2/2] net: Iterate over present cpus only during ipstats calculation Kirill Tkhai
@ 2016-08-30  6:55 ` Peter Zijlstra
  2 siblings, 0 replies; 4+ messages in thread
From: Peter Zijlstra @ 2016-08-30  6:55 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: netdev, linux-kernel, yoshfuji, jmorris, davem, edumazet, mingo, kaber

On Mon, Aug 29, 2016 at 10:03:48PM +0300, Kirill Tkhai wrote:
> Many variables of statistics type are made percpu in kernel. This allows
> to do not make them atomic or to do not use synchronization. The result
> value is calculated as sum of values on every possible cpu.
> 
> The problem is this scales bad. The calculations may took a lot of time.
> For example, some machine configurations have many possible cpus like below:
> 
> "smpboot: Allowing 192 CPUs, 160 hotplug CPUs"
> 
> There are only 32 real cpus, but 192 possible cpus.

This is fairly rare AFAIK. Its typically only found on machines with
empty sockets (rare, because empty sockets are expensive) or broken
BIOSes (I have one of the latter).

I've cured things by adding "possible_cpus=40" to the cmdline.

> I had a report about very slow getifaddrs() on older kernel, when there are
> possible only 590 getifaddrs calls/second on Xeon(R) CPU E5-2667 v3 @ 3.20GHz.
> 
> The patchset aims to begin solving of this problem. It makes possible to
> iterate over present cpus mask instead of possible. When cpu is going down,
> a statistics is being moved to an alive cpu. It's made in CPU_DYING callback,
> which happens when machine is stopped. So, iteration  over present cpus mask
> is safe under preemption disabled.
> 
> Patchset could exclude even offline cpus, but I didn't do that, because
> the main problem seems to be possible cpus. Also, this would require to
> do some changes in kernel/cpu.c, so I'd like to hear people opinion about
> expediency of this before.
> 
> One more question is whether the whole kernel needs the same possibility
> and the patchset should be more generic.

I'd vote for no. This isn't a fundamental optimization, the thing is
still O(n), we just reduced the n for one particular
machine.

[ and note that that machine is still wasting an enormous amount of
memory actually _having_ all that per-cpu storage, which too is gone
with the cmdline 'fix' ]

On machines which really have 192 (or more) CPUs, this will still be as
slow as ever.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-08-30  6:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-29 19:03 [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics Kirill Tkhai
2016-08-29 19:03 ` [PATCH RFC 1/2] net: Implement net_stats callbacks Kirill Tkhai
2016-08-29 19:04 ` [PATCH RFC 2/2] net: Iterate over present cpus only during ipstats calculation Kirill Tkhai
2016-08-30  6:55 ` [PATCH RFC 0/2] net: Iterate over cpu_present_mask during calculation of percpu statistics Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).