All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch v2 0/5] percpu_counter: bug fix and enhancement
@ 2011-05-11  8:10 Shaohua Li
  2011-05-11  8:10 ` [patch v2 1/5] percpu_counter: fix code for 32bit systems for UP Shaohua Li
                   ` (5 more replies)
  0 siblings, 6 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-11  8:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, tj, eric.dumazet, cl, npiggin

The patch sets do two things.
1. fix bug for 32-bit system. percpu_counter uses s64 counter. Without any
locking reading s64 in 32-bit system isn't safe and can cause bad side effect.
2. improve scalability for __percpu_counter_add. In some cases, _add could
cause heavy lock contention (see patch 4 for detailed infomation and data).
The patches will remove the contention and speed up it a bit. Last post
(http://marc.info/?l=linux-kernel&m=130259547913607&w=2) simpliy uses
atomic64 for percpu_counter, but Tejun pointed out this could cause
deviation in __percpu_counter_sum.
The new implementation uses lglock to protect percpu data. Each cpu has its
private lock while other cpu doesn't take. In this way _add doesn't need take
global lock anymore and remove the deviation. This still gives me about
about 5x ~ 6x faster (not that faster than the original 7x faster, but still
good) with the workload mentioned in patch 4.

patch 1 fix s64 read bug for 32-bit system for UP
patch 2 convert lglock to be used by dynamaically allocated structre. Later
patch will use lglock for percpu_counter
patch 3,4 fix s64 read bug for 32-bit system for MP. And it also improve the
scalability for __percpu_counter_add.
patch 5 is from Christoph Lameter to make __percpu_counter_add fastpath
preemptless. I added it here because I converted percpu_counter to use
lglock. All bugs are from mine.

Comments and suggestions are welcomed!

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch v2 1/5] percpu_counter: fix code for 32bit systems for UP
  2011-05-11  8:10 [patch v2 0/5] percpu_counter: bug fix and enhancement Shaohua Li
@ 2011-05-11  8:10 ` Shaohua Li
  2011-05-11  8:10 ` [patch v2 2/5] lglock: convert it to work with dynamically allocated structure Shaohua Li
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-11  8:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, tj, eric.dumazet, cl, npiggin, Shaohua Li

[-- Attachment #1: percpu-counter-32bits.patch --]
[-- Type: text/plain, Size: 1915 bytes --]

percpu_counter.counter is a 's64'. Accessing it in 32-bit system is racing.
we need some locking to protect it otherwise some very wrong value could be
accessed.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 include/linux/percpu_counter.h |   31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

Index: linux/include/linux/percpu_counter.h
===================================================================
--- linux.orig/include/linux/percpu_counter.h	2011-05-05 10:33:12.000000000 +0800
+++ linux/include/linux/percpu_counter.h	2011-05-06 11:19:26.000000000 +0800
@@ -101,14 +101,34 @@ static inline void percpu_counter_destro
 
 static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 {
+#if BITS_PER_LONG == 32
+	preempt_disable();
+	fbc->count = amount;
+	preempt_enable();
+#else
 	fbc->count = amount;
+#endif
+}
+
+static inline s64 percpu_counter_read(struct percpu_counter *fbc)
+{
+#if BITS_PER_LONG == 32
+	s64 count;
+	preempt_disable();
+	count = fbc->count;
+	preempt_enable();
+	return count;
+#else
+	return fbc->count;
+#endif
 }
 
 static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
 {
-	if (fbc->count > rhs)
+	s64 count = percpu_counter_read(fbc);
+	if (count > rhs)
 		return 1;
-	else if (fbc->count < rhs)
+	else if (count < rhs)
 		return -1;
 	else
 		return 0;
@@ -128,18 +148,13 @@ __percpu_counter_add(struct percpu_count
 	percpu_counter_add(fbc, amount);
 }
 
-static inline s64 percpu_counter_read(struct percpu_counter *fbc)
-{
-	return fbc->count;
-}
-
 /*
  * percpu_counter is intended to track positive number. In UP case, the number
  * should never be negative.
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	return fbc->count;
+	return percpu_counter_read(fbc);
 }
 
 static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch v2 2/5] lglock: convert it to work with dynamically allocated structure
  2011-05-11  8:10 [patch v2 0/5] percpu_counter: bug fix and enhancement Shaohua Li
  2011-05-11  8:10 ` [patch v2 1/5] percpu_counter: fix code for 32bit systems for UP Shaohua Li
@ 2011-05-11  8:10 ` Shaohua Li
  2011-05-11  8:10 ` [patch v2 3/5] percpu_counter: use lglock to protect percpu data Shaohua Li
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-11  8:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, tj, eric.dumazet, cl, npiggin, Shaohua Li

[-- Attachment #1: lglock-work-struct.patch --]
[-- Type: text/plain, Size: 12242 bytes --]

Converting lglock to work with dynamically allocated structure.
There is no fundamental reason lglock must be static to me. And this can
reduce some code size actually. Next patch will use it in a dynamically
allocated structure.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 include/linux/lglock.h |  192 ++++++++++++-------------------------------------
 kernel/Makefile        |    2 
 kernel/lglock.c        |  124 +++++++++++++++++++++++++++++++
 3 files changed, 175 insertions(+), 143 deletions(-)

Index: linux/include/linux/lglock.h
===================================================================
--- linux.orig/include/linux/lglock.h	2011-05-10 16:10:55.000000000 +0800
+++ linux/include/linux/lglock.h	2011-05-11 09:42:07.000000000 +0800
@@ -1,7 +1,5 @@
 /*
- * Specialised local-global spinlock. Can only be declared as global variables
- * to avoid overhead and keep things simple (and we don't want to start using
- * these inside dynamically allocated structures).
+ * Specialised local-global spinlock.
  *
  * "local/global locks" (lglocks) can be used to:
  *
@@ -23,150 +21,60 @@
 #include <linux/lockdep.h>
 #include <linux/percpu.h>
 
-/* can make br locks by using local lock for read side, global lock for write */
-#define br_lock_init(name)	name##_lock_init()
-#define br_read_lock(name)	name##_local_lock()
-#define br_read_unlock(name)	name##_local_unlock()
-#define br_write_lock(name)	name##_global_lock_online()
-#define br_write_unlock(name)	name##_global_unlock_online()
-
-#define DECLARE_BRLOCK(name)	DECLARE_LGLOCK(name)
-#define DEFINE_BRLOCK(name)	DEFINE_LGLOCK(name)
-
-
-#define lg_lock_init(name)	name##_lock_init()
-#define lg_local_lock(name)	name##_local_lock()
-#define lg_local_unlock(name)	name##_local_unlock()
-#define lg_local_lock_cpu(name, cpu)	name##_local_lock_cpu(cpu)
-#define lg_local_unlock_cpu(name, cpu)	name##_local_unlock_cpu(cpu)
-#define lg_global_lock(name)	name##_global_lock()
-#define lg_global_unlock(name)	name##_global_unlock()
-#define lg_global_lock_online(name) name##_global_lock_online()
-#define lg_global_unlock_online(name) name##_global_unlock_online()
-
+struct lglock {
+	arch_spinlock_t __percpu *locks;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-#define LOCKDEP_INIT_MAP lockdep_init_map
-
-#define DEFINE_LGLOCK_LOCKDEP(name)					\
- struct lock_class_key name##_lock_key;					\
- struct lockdep_map name##_lock_dep_map;				\
- EXPORT_SYMBOL(name##_lock_dep_map)
-
-#else
-#define LOCKDEP_INIT_MAP(a, b, c, d)
-
-#define DEFINE_LGLOCK_LOCKDEP(name)
+	struct lockdep_map lock_dep_map;
 #endif
-
+};
 
 #define DECLARE_LGLOCK(name)						\
- extern void name##_lock_init(void);					\
- extern void name##_local_lock(void);					\
- extern void name##_local_unlock(void);					\
- extern void name##_local_lock_cpu(int cpu);				\
- extern void name##_local_unlock_cpu(int cpu);				\
- extern void name##_global_lock(void);					\
- extern void name##_global_unlock(void);				\
- extern void name##_global_lock_online(void);				\
- extern void name##_global_unlock_online(void);				\
+	extern struct lglock name;
 
 #define DEFINE_LGLOCK(name)						\
 									\
- DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
- DEFINE_LGLOCK_LOCKDEP(name);						\
-									\
- void name##_lock_init(void) {						\
-	int i;								\
-	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
-	for_each_possible_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
-	}								\
- }									\
- EXPORT_SYMBOL(name##_lock_init);					\
-									\
- void name##_local_lock(void) {						\
-	arch_spinlock_t *lock;						\
-	preempt_disable();						\
-	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
-	lock = &__get_cpu_var(name##_lock);				\
-	arch_spin_lock(lock);						\
- }									\
- EXPORT_SYMBOL(name##_local_lock);					\
-									\
- void name##_local_unlock(void) {					\
-	arch_spinlock_t *lock;						\
-	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
-	lock = &__get_cpu_var(name##_lock);				\
-	arch_spin_unlock(lock);						\
-	preempt_enable();						\
- }									\
- EXPORT_SYMBOL(name##_local_unlock);					\
-									\
- void name##_local_lock_cpu(int cpu) {					\
-	arch_spinlock_t *lock;						\
-	preempt_disable();						\
-	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
-	lock = &per_cpu(name##_lock, cpu);				\
-	arch_spin_lock(lock);						\
- }									\
- EXPORT_SYMBOL(name##_local_lock_cpu);					\
-									\
- void name##_local_unlock_cpu(int cpu) {				\
-	arch_spinlock_t *lock;						\
-	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
-	lock = &per_cpu(name##_lock, cpu);				\
-	arch_spin_unlock(lock);						\
-	preempt_enable();						\
- }									\
- EXPORT_SYMBOL(name##_local_unlock_cpu);				\
-									\
- void name##_global_lock_online(void) {					\
-	int i;								\
-	preempt_disable();						\
-	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		arch_spin_lock(lock);					\
-	}								\
- }									\
- EXPORT_SYMBOL(name##_global_lock_online);				\
-									\
- void name##_global_unlock_online(void) {				\
-	int i;								\
-	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		arch_spin_unlock(lock);					\
-	}								\
-	preempt_enable();						\
- }									\
- EXPORT_SYMBOL(name##_global_unlock_online);				\
-									\
- void name##_global_lock(void) {					\
-	int i;								\
-	preempt_disable();						\
-	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
-	for_each_possible_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		arch_spin_lock(lock);					\
-	}								\
- }									\
- EXPORT_SYMBOL(name##_global_lock);					\
-									\
- void name##_global_unlock(void) {					\
-	int i;								\
-	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
-	for_each_possible_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		arch_spin_unlock(lock);					\
-	}								\
-	preempt_enable();						\
- }									\
- EXPORT_SYMBOL(name##_global_unlock);
+DEFINE_PER_CPU(arch_spinlock_t, name##_percpulock);			\
+struct lglock name = {							\
+	.locks = &name##_percpulock,					\
+};
+
+extern int lglock_alloc(struct lglock *lglock);
+extern void lglock_free(struct lglock *lglock);
+extern void __lglock_init(struct lglock *lglock, const char *name,
+	struct lock_class_key *key);
+#define lglock_init(lock, name)						\
+({									\
+	static struct lock_class_key key;				\
+	__lglock_init(lock, name, &key);				\
+})
+extern void lglock_local_lock(struct lglock *lglock);
+extern void lglock_local_unlock(struct lglock *lglock);
+extern void lglock_local_lock_cpu(struct lglock *lglock, int cpu);
+extern void lglock_local_unlock_cpu(struct lglock *lglock, int cpu);
+extern void lglock_global_lock_online(struct lglock *lglock);
+extern void lglock_global_unlock_online(struct lglock *lglock);
+extern void lglock_global_lock(struct lglock *lglock);
+extern void lglock_global_unlock(struct lglock *lglock);
+
+/* can make br locks by using local lock for read side, global lock for write */
+#define br_lock_init(name)	lglock_init(&name, #name)
+#define br_read_lock(name)	lglock_local_lock(&name)
+#define br_read_unlock(name)	lglock_local_unlock(&name)
+#define br_write_lock(name)	lglock_global_lock_online(&name)
+#define br_write_unlock(name)	lglock_global_unlock_online(&name)
+
+#define DECLARE_BRLOCK(name)	DECLARE_LGLOCK(name)
+#define DEFINE_BRLOCK(name)	DEFINE_LGLOCK(name)
+
+#define lg_lock_init(name)	lglock_init(&name, #name)
+#define lg_local_lock(name)	lglock_local_lock(&name)
+#define lg_local_unlock(name)	lglock_local_unlock(&name)
+#define lg_local_lock_cpu(name, cpu)	lglock_local_lock_cpu(&name, cpu)
+#define lg_local_unlock_cpu(name, cpu) \
+	lglock_local_unlock_cpu(&name, cpu)
+#define lg_global_lock(name)	lglock_global_lock(&name)
+#define lg_global_unlock(name)	lglock_global_unlock(&name)
+#define lg_global_lock_online(name) lglock_global_lock_online(&name)
+#define lg_global_unlock_online(name) lglock_global_unlock_online(&name)
+
 #endif
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile	2011-05-10 16:10:55.000000000 +0800
+++ linux/kernel/Makefile	2011-05-10 16:23:01.000000000 +0800
@@ -10,7 +10,7 @@ obj-y     = sched.o fork.o exec_domain.o
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
 	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
-	    async.o range.o jump_label.o
+	    async.o range.o jump_label.o lglock.o
 obj-y += groups.o
 
 ifdef CONFIG_FUNCTION_TRACER
Index: linux/kernel/lglock.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/kernel/lglock.c	2011-05-11 09:39:46.000000000 +0800
@@ -0,0 +1,124 @@
+#include <linux/lglock.h>
+#include <linux/module.h>
+
+int lglock_alloc(struct lglock *lglock)
+{
+	lglock->locks = alloc_percpu(arch_spinlock_t);
+	if (!lglock->locks)
+		return -ENOMEM;
+	return 0;
+}
+EXPORT_SYMBOL(lglock_alloc);
+
+void lglock_free(struct lglock *lglock)
+{
+	free_percpu(lglock->locks);
+}
+EXPORT_SYMBOL(lglock_free);
+
+void __lglock_init(struct lglock *lglock, const char *name,
+	struct lock_class_key *key)
+{
+	int i;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	lockdep_init_map(&lglock->lock_dep_map, name, key, 0);
+#endif
+	for_each_possible_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lglock->locks, i);
+		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
+	}
+}
+EXPORT_SYMBOL(__lglock_init);
+
+void lglock_local_lock(struct lglock *lglock)
+{
+	arch_spinlock_t *lock;
+	preempt_disable();
+	rwlock_acquire_read(&lglock->lock_dep_map, 0, 0, _THIS_IP_);
+	lock = __this_cpu_ptr(lglock->locks);
+	arch_spin_lock(lock);
+}
+EXPORT_SYMBOL(lglock_local_lock);
+
+void lglock_local_unlock(struct lglock *lglock)
+{
+	arch_spinlock_t *lock;
+	rwlock_release(&lglock->lock_dep_map, 1, _THIS_IP_);
+	lock = __this_cpu_ptr(lglock->locks);
+	arch_spin_unlock(lock);
+	preempt_enable();
+}
+EXPORT_SYMBOL(lglock_local_unlock);
+
+void lglock_local_lock_cpu(struct lglock *lglock, int cpu)
+{
+	arch_spinlock_t *lock;
+	preempt_disable();
+	rwlock_acquire_read(&lglock->lock_dep_map, 0, 0, _THIS_IP_);
+	lock = per_cpu_ptr(lglock->locks, cpu);
+	arch_spin_lock(lock);
+}
+EXPORT_SYMBOL(lglock_local_lock_cpu);
+
+void lglock_local_unlock_cpu(struct lglock *lglock, int cpu)
+{
+	arch_spinlock_t *lock;
+	rwlock_release(&lglock->lock_dep_map, 1, _THIS_IP_);
+	lock = per_cpu_ptr(lglock->locks, cpu);
+	arch_spin_unlock(lock);
+	preempt_enable();
+}
+EXPORT_SYMBOL(lglock_local_unlock_cpu);
+
+void lglock_global_lock_online(struct lglock *lglock)
+{
+	int i;
+	preempt_disable();
+	rwlock_acquire(&lglock->lock_dep_map, 0, 0, _RET_IP_);
+	for_each_online_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lglock->locks, i);
+		arch_spin_lock(lock);
+	}
+}
+EXPORT_SYMBOL(lglock_global_lock_online);
+
+void lglock_global_unlock_online(struct lglock *lglock)
+{
+	int i;
+	rwlock_release(&lglock->lock_dep_map, 1, _RET_IP_);
+	for_each_online_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lglock->locks, i);
+		arch_spin_unlock(lock);
+	}
+	preempt_enable();
+}
+EXPORT_SYMBOL(lglock_global_unlock_online);
+
+void lglock_global_lock(struct lglock *lglock)
+{
+	int i;
+	preempt_disable();
+	rwlock_acquire(&lglock->lock_dep_map, 0, 0, _RET_IP_);
+	for_each_possible_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lglock->locks, i);
+		arch_spin_lock(lock);
+	}
+}
+EXPORT_SYMBOL(lglock_global_lock);
+
+void lglock_global_unlock(struct lglock *lglock)
+{
+	int i;
+	rwlock_release(&lglock->lock_dep_map, 1, _RET_IP_);
+	for_each_possible_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lglock->locks, i);
+		arch_spin_unlock(lock);
+	}
+	preempt_enable();
+}
+EXPORT_SYMBOL(lglock_global_unlock);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch v2 3/5] percpu_counter: use lglock to protect percpu data
  2011-05-11  8:10 [patch v2 0/5] percpu_counter: bug fix and enhancement Shaohua Li
  2011-05-11  8:10 ` [patch v2 1/5] percpu_counter: fix code for 32bit systems for UP Shaohua Li
  2011-05-11  8:10 ` [patch v2 2/5] lglock: convert it to work with dynamically allocated structure Shaohua Li
@ 2011-05-11  8:10 ` Shaohua Li
  2011-05-11  8:10 ` [patch v2 4/5] percpu_counter: use atomic64 for counter in SMP Shaohua Li
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-11  8:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, tj, eric.dumazet, cl, npiggin, Shaohua Li

[-- Attachment #1: percpu-counter-percpulock.patch --]
[-- Type: text/plain, Size: 3744 bytes --]

uses lglock to protect percpu data. This is a preparation to remove
percpu_counter global lock.
This will slow __percpu_counter_sum, but this one is supposed not to be called
frequently, so doesn't matter.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 include/linux/percpu_counter.h |    9 ++++++---
 lib/percpu_counter.c           |   15 ++++++++++++++-
 2 files changed, 20 insertions(+), 4 deletions(-)

Index: linux/include/linux/percpu_counter.h
===================================================================
--- linux.orig/include/linux/percpu_counter.h	2011-05-10 16:23:01.000000000 +0800
+++ linux/include/linux/percpu_counter.h	2011-05-11 09:28:55.000000000 +0800
@@ -12,6 +12,7 @@
 #include <linux/threads.h>
 #include <linux/percpu.h>
 #include <linux/types.h>
+#include <linux/lglock.h>
 
 #ifdef CONFIG_SMP
 
@@ -22,18 +23,20 @@ struct percpu_counter {
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
 	s32 __percpu *counters;
+	struct lglock lglock;
 };
 
 extern int percpu_counter_batch;
 
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
-			  struct lock_class_key *key);
+			struct lock_class_key *key, const char *name,
+			struct lock_class_key *key2);
 
 #define percpu_counter_init(fbc, value)					\
 	({								\
-		static struct lock_class_key __key;			\
+		static struct lock_class_key __key, __key2;		\
 									\
-		__percpu_counter_init(fbc, value, &__key);		\
+		__percpu_counter_init(fbc, value, &__key, #fbc, &__key2);\
 	})
 
 void percpu_counter_destroy(struct percpu_counter *fbc);
Index: linux/lib/percpu_counter.c
===================================================================
--- linux.orig/lib/percpu_counter.c	2011-05-10 16:10:54.000000000 +0800
+++ linux/lib/percpu_counter.c	2011-05-11 09:28:55.000000000 +0800
@@ -77,8 +77,10 @@ void __percpu_counter_add(struct percpu_
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
 		spin_lock(&fbc->lock);
+		lg_local_lock(fbc->lglock);
 		fbc->count += count;
 		__this_cpu_write(*fbc->counters, 0);
+		lg_local_unlock(fbc->lglock);
 		spin_unlock(&fbc->lock);
 	} else {
 		__this_cpu_write(*fbc->counters, count);
@@ -97,18 +99,21 @@ s64 __percpu_counter_sum(struct percpu_c
 	int cpu;
 
 	spin_lock(&fbc->lock);
+	lg_global_lock_online(fbc->lglock);
 	ret = fbc->count;
 	for_each_online_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		ret += *pcount;
 	}
+	lg_global_unlock_online(fbc->lglock);
 	spin_unlock(&fbc->lock);
 	return ret;
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
-			  struct lock_class_key *key)
+		struct lock_class_key *key, const char *name,
+		struct lock_class_key *key2)
 {
 	spin_lock_init(&fbc->lock);
 	lockdep_set_class(&fbc->lock, key);
@@ -116,6 +121,11 @@ int __percpu_counter_init(struct percpu_
 	fbc->counters = alloc_percpu(s32);
 	if (!fbc->counters)
 		return -ENOMEM;
+	if (lglock_alloc(&fbc->lglock)) {
+		free_percpu(fbc->counters);
+		return -ENOMEM;
+	}
+	__lglock_init(&fbc->lglock, name, key2);
 
 	debug_percpu_counter_activate(fbc);
 
@@ -143,6 +153,7 @@ void percpu_counter_destroy(struct percp
 #endif
 	free_percpu(fbc->counters);
 	fbc->counters = NULL;
+	lglock_free(&fbc->lglock);
 }
 EXPORT_SYMBOL(percpu_counter_destroy);
 
@@ -174,9 +185,11 @@ static int __cpuinit percpu_counter_hotc
 		unsigned long flags;
 
 		spin_lock_irqsave(&fbc->lock, flags);
+		lg_local_lock_cpu(fbc->lglock, cpu);
 		pcount = per_cpu_ptr(fbc->counters, cpu);
 		fbc->count += *pcount;
 		*pcount = 0;
+		lg_local_unlock_cpu(fbc->lglock, cpu);
 		spin_unlock_irqrestore(&fbc->lock, flags);
 	}
 	mutex_unlock(&percpu_counters_lock);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch v2 4/5] percpu_counter: use atomic64 for counter in SMP
  2011-05-11  8:10 [patch v2 0/5] percpu_counter: bug fix and enhancement Shaohua Li
                   ` (2 preceding siblings ...)
  2011-05-11  8:10 ` [patch v2 3/5] percpu_counter: use lglock to protect percpu data Shaohua Li
@ 2011-05-11  8:10 ` Shaohua Li
  2011-05-11  9:34   ` Andrew Morton
  2011-05-11  8:10 ` [patch v2 5/5] percpu_counter: preemptless __per_cpu_counter_add Shaohua Li
  2011-05-11  9:28 ` [patch v2 0/5] percpu_counter: bug fix and enhancement Tejun Heo
  5 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-11  8:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, tj, eric.dumazet, cl, npiggin, Shaohua Li

[-- Attachment #1: percpu-counter-atomic.patch --]
[-- Type: text/plain, Size: 5403 bytes --]

The percpu_counter global lock is only used to protect updating fbc->count after
we use lglock to protect percpu data. Uses atomic64 for percpu_counter, because
it is cheaper than spinlock. This doesn't slow fast path (percpu_counter_read).
atomic64_read equals to read fbc->count for 64-bit system, or equals to
spin_lock-read-spin_unlock for 32-bit system.

Note, originally the percpu_counter_read for 32-bit system doesn't hold
spin_lock, but that is buggy and might cause very wrong value accessed.
This patch fixes the issue.

This can also improve some workloads with percpu_counter->lock heavily
contented. For example, vm_committed_as sometimes causes the contention.
We should tune the batch count, but if we can make percpu_counter better,
why not? In a 24 CPUs system and 24 processes, each runs:
while (1) {
	mmap(128M);
	munmap(128M);
}
we then measure how many loops each process can take:
orig: 1226976
patched: 6727264
The atomic method gives 5x~6x faster.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 include/linux/percpu_counter.h |   14 ++++++--------
 lib/percpu_counter.c           |   27 +++++++++------------------
 2 files changed, 15 insertions(+), 26 deletions(-)

Index: linux/include/linux/percpu_counter.h
===================================================================
--- linux.orig/include/linux/percpu_counter.h	2011-05-10 16:23:01.000000000 +0800
+++ linux/include/linux/percpu_counter.h	2011-05-10 16:23:01.000000000 +0800
@@ -17,8 +17,7 @@
 #ifdef CONFIG_SMP
 
 struct percpu_counter {
-	spinlock_t lock;
-	s64 count;
+	atomic64_t count;
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
@@ -29,14 +28,13 @@ struct percpu_counter {
 extern int percpu_counter_batch;
 
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
-			struct lock_class_key *key, const char *name,
-			struct lock_class_key *key2);
+			struct lock_class_key *key, const char *name);
 
 #define percpu_counter_init(fbc, value)					\
 	({								\
-		static struct lock_class_key __key, __key2;		\
+		static struct lock_class_key __key;			\
 									\
-		__percpu_counter_init(fbc, value, &__key, #fbc, &__key2);\
+		__percpu_counter_init(fbc, value, &__key, #fbc);	\
 	})
 
 void percpu_counter_destroy(struct percpu_counter *fbc);
@@ -63,7 +61,7 @@ static inline s64 percpu_counter_sum(str
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
-	return fbc->count;
+	return atomic64_read(&fbc->count);
 }
 
 /*
@@ -73,7 +71,7 @@ static inline s64 percpu_counter_read(st
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	s64 ret = fbc->count;
+	s64 ret = percpu_counter_read(fbc);
 
 	barrier();		/* Prevent reloads of fbc->count */
 	if (ret >= 0)
Index: linux/lib/percpu_counter.c
===================================================================
--- linux.orig/lib/percpu_counter.c	2011-05-10 16:23:01.000000000 +0800
+++ linux/lib/percpu_counter.c	2011-05-11 09:24:24.000000000 +0800
@@ -59,13 +59,11 @@ void percpu_counter_set(struct percpu_co
 {
 	int cpu;
 
-	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
-	fbc->count = amount;
-	spin_unlock(&fbc->lock);
+	atomic64_set(&fbc->count, amount);
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
@@ -76,12 +74,10 @@ void __percpu_counter_add(struct percpu_
 	preempt_disable();
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		spin_lock(&fbc->lock);
 		lg_local_lock(fbc->lglock);
-		fbc->count += count;
+		atomic64_add(count, &fbc->count);
 		__this_cpu_write(*fbc->counters, 0);
 		lg_local_unlock(fbc->lglock);
-		spin_unlock(&fbc->lock);
 	} else {
 		__this_cpu_write(*fbc->counters, count);
 	}
@@ -98,26 +94,21 @@ s64 __percpu_counter_sum(struct percpu_c
 	s64 ret;
 	int cpu;
 
-	spin_lock(&fbc->lock);
 	lg_global_lock_online(fbc->lglock);
-	ret = fbc->count;
+	ret = atomic64_read(&fbc->count);
 	for_each_online_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		ret += *pcount;
 	}
 	lg_global_unlock_online(fbc->lglock);
-	spin_unlock(&fbc->lock);
 	return ret;
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
-		struct lock_class_key *key, const char *name,
-		struct lock_class_key *key2)
+		struct lock_class_key *key, const char *name)
 {
-	spin_lock_init(&fbc->lock);
-	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
+	atomic64_set(&fbc->count, amount);
 	fbc->counters = alloc_percpu(s32);
 	if (!fbc->counters)
 		return -ENOMEM;
@@ -125,7 +116,7 @@ int __percpu_counter_init(struct percpu_
 		free_percpu(fbc->counters);
 		return -ENOMEM;
 	}
-	__lglock_init(&fbc->lglock, name, key2);
+	__lglock_init(&fbc->lglock, name, key);
 
 	debug_percpu_counter_activate(fbc);
 
@@ -184,13 +175,13 @@ static int __cpuinit percpu_counter_hotc
 		s32 *pcount;
 		unsigned long flags;
 
-		spin_lock_irqsave(&fbc->lock, flags);
+		local_irq_save(flags);
 		lg_local_lock_cpu(fbc->lglock, cpu);
 		pcount = per_cpu_ptr(fbc->counters, cpu);
-		fbc->count += *pcount;
+		atomic64_add(*pcount, &fbc->count);
 		*pcount = 0;
 		lg_local_unlock_cpu(fbc->lglock, cpu);
-		spin_unlock_irqrestore(&fbc->lock, flags);
+		local_irq_restore(flags);
 	}
 	mutex_unlock(&percpu_counters_lock);
 #endif


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch v2 5/5] percpu_counter: preemptless __per_cpu_counter_add
  2011-05-11  8:10 [patch v2 0/5] percpu_counter: bug fix and enhancement Shaohua Li
                   ` (3 preceding siblings ...)
  2011-05-11  8:10 ` [patch v2 4/5] percpu_counter: use atomic64 for counter in SMP Shaohua Li
@ 2011-05-11  8:10 ` Shaohua Li
  2011-05-11  9:28 ` [patch v2 0/5] percpu_counter: bug fix and enhancement Tejun Heo
  5 siblings, 0 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-11  8:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, tj, eric.dumazet, cl, npiggin, Shaohua Li

[-- Attachment #1: percpu-counter-add-preemptless.patch --]
[-- Type: text/plain, Size: 1792 bytes --]

From: Christoph Lameter <cl@linux.com>

Make percpu_counter_add hot path preemptless.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 lib/percpu_counter.c |   37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

Index: linux/lib/percpu_counter.c
===================================================================
--- linux.orig/lib/percpu_counter.c	2011-05-10 16:23:01.000000000 +0800
+++ linux/lib/percpu_counter.c	2011-05-10 16:23:01.000000000 +0800
@@ -67,21 +67,34 @@ void percpu_counter_set(struct percpu_co
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
+static bool __percpu_counter_add_cmpxchg(struct percpu_counter *fbc,
+	s64 count, s64 new)
+{
+#ifdef CONFIG_PREEMPT
+	return this_cpu_cmpxchg(*fbc->counters, count, new) == count;
+#else
+	this_cpu_write(*fbc->counters, new);
+	return true;
+#endif
+}
+
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
-	s64 count;
+	s64 count, new;
+
+	do {
+		count = this_cpu_read(*fbc->counters);
+		new = count + amount;
 
-	preempt_disable();
-	count = __this_cpu_read(*fbc->counters) + amount;
-	if (count >= batch || count <= -batch) {
-		lg_local_lock(fbc->lglock);
-		atomic64_add(count, &fbc->count);
-		__this_cpu_write(*fbc->counters, 0);
-		lg_local_unlock(fbc->lglock);
-	} else {
-		__this_cpu_write(*fbc->counters, count);
-	}
-	preempt_enable();
+		if (new >= batch || new <= -batch) {
+			lg_local_lock(fbc->lglock);
+			count = __this_cpu_read(*fbc->counters) + amount;
+			atomic64_add(count, &fbc->count);
+			__this_cpu_write(*fbc->counters, 0);
+			lg_local_unlock(fbc->lglock);
+			return;
+		}
+	} while (!__percpu_counter_add_cmpxchg(fbc, count, new));
 }
 EXPORT_SYMBOL(__percpu_counter_add);
 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-11  8:10 [patch v2 0/5] percpu_counter: bug fix and enhancement Shaohua Li
                   ` (4 preceding siblings ...)
  2011-05-11  8:10 ` [patch v2 5/5] percpu_counter: preemptless __per_cpu_counter_add Shaohua Li
@ 2011-05-11  9:28 ` Tejun Heo
  2011-05-12  2:48   ` Shaohua Li
  2011-05-12 14:38   ` [patch v2 0/5] percpu_counter: bug fix and enhancement Christoph Lameter
  5 siblings, 2 replies; 52+ messages in thread
From: Tejun Heo @ 2011-05-11  9:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, akpm, eric.dumazet, cl, npiggin

Hey, Shaohua.

On Wed, May 11, 2011 at 04:10:12PM +0800, Shaohua Li wrote:
> The new implementation uses lglock to protect percpu data. Each cpu has its
> private lock while other cpu doesn't take. In this way _add doesn't need take
> global lock anymore and remove the deviation. This still gives me about
> about 5x ~ 6x faster (not that faster than the original 7x faster, but still
> good) with the workload mentioned in patch 4.

I'm afraid I'm not too thrilled about lglock + atomic64 usage.  It is
a very patchy approach which addresses a very specific use case which
might just need a higher @batch.  I just can't see enough benefits to
justify the overhead and complexity.  :-(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 4/5] percpu_counter: use atomic64 for counter in SMP
  2011-05-11  8:10 ` [patch v2 4/5] percpu_counter: use atomic64 for counter in SMP Shaohua Li
@ 2011-05-11  9:34   ` Andrew Morton
  2011-05-12  2:40     ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Andrew Morton @ 2011-05-11  9:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, tj, eric.dumazet, cl, npiggin

On Wed, 11 May 2011 16:10:16 +0800 Shaohua Li <shaohua.li@intel.com> wrote:

> The percpu_counter global lock is only used to protect updating fbc->count after
> we use lglock to protect percpu data. Uses atomic64 for percpu_counter, because
> it is cheaper than spinlock. This doesn't slow fast path (percpu_counter_read).
> atomic64_read equals to read fbc->count for 64-bit system, or equals to
> spin_lock-read-spin_unlock for 32-bit system.
> 
> Note, originally the percpu_counter_read for 32-bit system doesn't hold
> spin_lock, but that is buggy and might cause very wrong value accessed.
> This patch fixes the issue.
> 
> This can also improve some workloads with percpu_counter->lock heavily
> contented. For example, vm_committed_as sometimes causes the contention.
> We should tune the batch count, but if we can make percpu_counter better,
> why not? In a 24 CPUs system and 24 processes, each runs:
> while (1) {
> 	mmap(128M);
> 	munmap(128M);
> }
> we then measure how many loops each process can take:
> orig: 1226976
> patched: 6727264
> The atomic method gives 5x~6x faster.

How much slower did percpu_counter_sum() become?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 4/5] percpu_counter: use atomic64 for counter in SMP
  2011-05-11  9:34   ` Andrew Morton
@ 2011-05-12  2:40     ` Shaohua Li
  0 siblings, 0 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-12  2:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, tj, eric.dumazet, cl, npiggin

On Wed, 2011-05-11 at 17:34 +0800, Andrew Morton wrote:
> On Wed, 11 May 2011 16:10:16 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
> 
> > The percpu_counter global lock is only used to protect updating fbc->count after
> > we use lglock to protect percpu data. Uses atomic64 for percpu_counter, because
> > it is cheaper than spinlock. This doesn't slow fast path (percpu_counter_read).
> > atomic64_read equals to read fbc->count for 64-bit system, or equals to
> > spin_lock-read-spin_unlock for 32-bit system.
> > 
> > Note, originally the percpu_counter_read for 32-bit system doesn't hold
> > spin_lock, but that is buggy and might cause very wrong value accessed.
> > This patch fixes the issue.
> > 
> > This can also improve some workloads with percpu_counter->lock heavily
> > contented. For example, vm_committed_as sometimes causes the contention.
> > We should tune the batch count, but if we can make percpu_counter better,
> > why not? In a 24 CPUs system and 24 processes, each runs:
> > while (1) {
> > 	mmap(128M);
> > 	munmap(128M);
> > }
> > we then measure how many loops each process can take:
> > orig: 1226976
> > patched: 6727264
> > The atomic method gives 5x~6x faster.
> 
> How much slower did percpu_counter_sum() become?
I did a stress test. 23 CPU run _add, one cpu runs _sum
In both cases (_add fast path (don't hold lock), _add slow path (hold
lock)), _sum becomes about 2.4x slower. Not too much slower, anyway,
_sum isn't frequently used.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-11  9:28 ` [patch v2 0/5] percpu_counter: bug fix and enhancement Tejun Heo
@ 2011-05-12  2:48   ` Shaohua Li
  2011-05-12  8:21     ` Tejun Heo
  2011-05-12 14:38   ` [patch v2 0/5] percpu_counter: bug fix and enhancement Christoph Lameter
  1 sibling, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-12  2:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, akpm, eric.dumazet, cl, npiggin

On Wed, 2011-05-11 at 17:28 +0800, Tejun Heo wrote:
> Hey, Shaohua.
> 
> On Wed, May 11, 2011 at 04:10:12PM +0800, Shaohua Li wrote:
> > The new implementation uses lglock to protect percpu data. Each cpu has its
> > private lock while other cpu doesn't take. In this way _add doesn't need take
> > global lock anymore and remove the deviation. This still gives me about
> > about 5x ~ 6x faster (not that faster than the original 7x faster, but still
> > good) with the workload mentioned in patch 4.
> 
> I'm afraid I'm not too thrilled about lglock + atomic64 usage.  It is
> a very patchy approach which addresses a very specific use case which
> might just need a higher @batch. 
It's quite hard to get a higher @batch. Please my comments in
http://marc.info/?l=linux-kernel&m=130153302319613&w=2

And the atomic64 approach not just improved the performance (which is
always welcomed), but it also fixes a bug for 32-bit system. The usage
of lglock is actually quite straightforward and is standard usage of
lglock (the comments of lglock.h declare such usage), just lglock
doesn't work for dynamatically allocated structure currently, which
needs a convert.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-12  2:48   ` Shaohua Li
@ 2011-05-12  8:21     ` Tejun Heo
  2011-05-12  8:55       ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2011-05-12  8:21 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, akpm, eric.dumazet, cl, npiggin

Hello,

On Thu, May 12, 2011 at 10:48:13AM +0800, Shaohua Li wrote:
> And the atomic64 approach not just improved the performance (which is
> always welcomed), but it also fixes a bug for 32-bit system. The usage
> of lglock is actually quite straightforward and is standard usage of
> lglock (the comments of lglock.h declare such usage), just lglock
> doesn't work for dynamatically allocated structure currently, which
> needs a convert.

lglock doesn't seem like the right solution for the problem at hand.
It's way too heavy handed and used to close a very small race window.
It doesn't seem right.  Eric's idea seemed much better to me and I
don't see why that can't be improved and used instead.  Would you be
interested in looking that direction?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-12  8:21     ` Tejun Heo
@ 2011-05-12  8:55       ` Shaohua Li
  2011-05-12  8:59         ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-12  8:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, akpm, eric.dumazet, cl, npiggin

Hi,
On Thu, 2011-05-12 at 16:21 +0800, Tejun Heo wrote:
> On Thu, May 12, 2011 at 10:48:13AM +0800, Shaohua Li wrote:
> > And the atomic64 approach not just improved the performance (which is
> > always welcomed), but it also fixes a bug for 32-bit system. The usage
> > of lglock is actually quite straightforward and is standard usage of
> > lglock (the comments of lglock.h declare such usage), just lglock
> > doesn't work for dynamatically allocated structure currently, which
> > needs a convert.
> 
> lglock doesn't seem like the right solution for the problem at hand.
> It's way too heavy handed and used to close a very small race window.
> It doesn't seem right.  Eric's idea seemed much better to me and I
> don't see why that can't be improved and used instead.  Would you be
> interested in looking that direction?
sure, but it's quite difficult to determine a @maxfuzzy in his proposal
I thought (and could confuse user), did I miss anything?

Thanks,
Shaohua



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-12  8:55       ` Shaohua Li
@ 2011-05-12  8:59         ` Tejun Heo
  2011-05-12  9:02           ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2011-05-12  8:59 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, akpm, eric.dumazet, cl, npiggin

Hello,

On Thu, May 12, 2011 at 04:55:20PM +0800, Shaohua Li wrote:
> sure, but it's quite difficult to determine a @maxfuzzy in his proposal
> I thought (and could confuse user), did I miss anything?

I don't think @maxfuzzy is necessary there.  I wrote this before but
why can't we track the actual deviation instead of the number of
deviation events?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-12  8:59         ` Tejun Heo
@ 2011-05-12  9:02           ` Eric Dumazet
  2011-05-12  9:03             ` Eric Dumazet
  2011-05-12  9:05             ` Tejun Heo
  0 siblings, 2 replies; 52+ messages in thread
From: Eric Dumazet @ 2011-05-12  9:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Le jeudi 12 mai 2011 à 10:59 +0200, Tejun Heo a écrit :
> Hello,
> 
> On Thu, May 12, 2011 at 04:55:20PM +0800, Shaohua Li wrote:
> > sure, but it's quite difficult to determine a @maxfuzzy in his proposal
> > I thought (and could confuse user), did I miss anything?
> 
> I don't think @maxfuzzy is necessary there.  I wrote this before but
> why can't we track the actual deviation instead of the number of
> deviation events?
> 

Thats roughly same thing (BATCH multiplicator factor apart)

Most percpu_counter users for a given percpu_counter object use a given
BATCH, dont they ?




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-12  9:02           ` Eric Dumazet
@ 2011-05-12  9:03             ` Eric Dumazet
  2011-05-12  9:05             ` Tejun Heo
  1 sibling, 0 replies; 52+ messages in thread
From: Eric Dumazet @ 2011-05-12  9:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Le jeudi 12 mai 2011 à 11:02 +0200, Eric Dumazet a écrit :
> Le jeudi 12 mai 2011 à 10:59 +0200, Tejun Heo a écrit :
> > Hello,
> > 
> > On Thu, May 12, 2011 at 04:55:20PM +0800, Shaohua Li wrote:
> > > sure, but it's quite difficult to determine a @maxfuzzy in his proposal
> > > I thought (and could confuse user), did I miss anything?
> > 
> > I don't think @maxfuzzy is necessary there.  I wrote this before but
> > why can't we track the actual deviation instead of the number of
> > deviation events?
> > 
> 
> Thats roughly same thing (BATCH multiplicator factor apart)
> 
> Most percpu_counter users for a given percpu_counter object use a given
> BATCH, dont they ?
> 
> 


I guess nr_cpu_ids would be a nice @maxfuzzy default value...




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-12  9:02           ` Eric Dumazet
  2011-05-12  9:03             ` Eric Dumazet
@ 2011-05-12  9:05             ` Tejun Heo
  2011-05-13  3:09               ` Shaohua Li
  2011-05-13  4:37               ` Shaohua Li
  1 sibling, 2 replies; 52+ messages in thread
From: Tejun Heo @ 2011-05-12  9:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Hello,

On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > I don't think @maxfuzzy is necessary there.  I wrote this before but
> > why can't we track the actual deviation instead of the number of
> > deviation events?
> 
> Thats roughly same thing (BATCH multiplicator factor apart)
> 
> Most percpu_counter users for a given percpu_counter object use a given
> BATCH, dont they ?

Well, @maxfuzzy is much harder than @batch.  It's way less intuitive.
Although I haven't really thought about it that much, I think it might
be possible to eliminate it.  Maybe I'm confused.  I'll take another
look later but if someone can think of something, please jump right
in.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-11  9:28 ` [patch v2 0/5] percpu_counter: bug fix and enhancement Tejun Heo
  2011-05-12  2:48   ` Shaohua Li
@ 2011-05-12 14:38   ` Christoph Lameter
  1 sibling, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-05-12 14:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Shaohua Li, linux-kernel, akpm, eric.dumazet, npiggin

On Wed, 11 May 2011, Tejun Heo wrote:

> Hey, Shaohua.
>
> On Wed, May 11, 2011 at 04:10:12PM +0800, Shaohua Li wrote:
> > The new implementation uses lglock to protect percpu data. Each cpu has its
> > private lock while other cpu doesn't take. In this way _add doesn't need take
> > global lock anymore and remove the deviation. This still gives me about
> > about 5x ~ 6x faster (not that faster than the original 7x faster, but still
> > good) with the workload mentioned in patch 4.
>
> I'm afraid I'm not too thrilled about lglock + atomic64 usage.  It is
> a very patchy approach which addresses a very specific use case which
> might just need a higher @batch.  I just can't see enough benefits to
> justify the overhead and complexity.  :-(

Same here.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-12  9:05             ` Tejun Heo
@ 2011-05-13  3:09               ` Shaohua Li
  2011-05-13  4:37               ` Shaohua Li
  1 sibling, 0 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-13  3:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Eric Dumazet, linux-kernel, akpm, cl, npiggin

Hi,
On Thu, May 12, 2011 at 05:05:34PM +0800, Tejun Heo wrote:
> Hello,
> 
> On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > I don't think @maxfuzzy is necessary there.  I wrote this before but
> > > why can't we track the actual deviation instead of the number of
> > > deviation events?
> > 
> > Thats roughly same thing (BATCH multiplicator factor apart)
> > 
> > Most percpu_counter users for a given percpu_counter object use a given
> > BATCH, dont they ?
> 
> Well, @maxfuzzy is much harder than @batch.  It's way less intuitive.
> Although I haven't really thought about it that much, I think it might
> be possible to eliminate it.  Maybe I'm confused.  I'll take another
> look later but if someone can think of something, please jump right
> in.
there is another problem here, _sum could keep spin if cocurrent updater
is running.

We could slightly change Eric's idea, how about something like this:

s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
	retry_times = 0;
retry:
	old_seq = fbc->seq;
	sum = do_sum()
	if (old_seq != fbc->seq && retry_times++ < MAX_RETRY)
		goto retry;
	return sum;
}
MAX_RETRY could be nr_cpu_ids. The rational here is if cocurrent updater
keeps running, we can't get accurate sum, so just don't try hard.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-12  9:05             ` Tejun Heo
  2011-05-13  3:09               ` Shaohua Li
@ 2011-05-13  4:37               ` Shaohua Li
  2011-05-13  5:20                 ` Eric Dumazet
  1 sibling, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-13  4:37 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Eric Dumazet, linux-kernel, akpm, cl, npiggin

On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> Hello,
> 
> On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > I don't think @maxfuzzy is necessary there.  I wrote this before but
> > > why can't we track the actual deviation instead of the number of
> > > deviation events?
> > 
> > Thats roughly same thing (BATCH multiplicator factor apart)
> > 
> > Most percpu_counter users for a given percpu_counter object use a given
> > BATCH, dont they ?
> 
> Well, @maxfuzzy is much harder than @batch.  It's way less intuitive.
> Although I haven't really thought about it that much, I think it might
> be possible to eliminate it.  Maybe I'm confused.  I'll take another
> look later but if someone can think of something, please jump right
> in.
Hmm, looks Eric's approach doesn't work. because we want to remove lock
in _add, checking seq in _sum still races with _add.

can we do something like this:
void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
        s64 count;

        preempt_disable();
        count = __this_cpu_read(*fbc->counters) + amount;
        if (count >= batch || count <= -batch) {
                while (1) {
                        atomic_inc(&fbc->add_start);
                        if (atomic_read(&fbc->sum_start) != 0)
                                atomic_dec(&fbc->add_start);
                        else
                                break;
                        while (atomic_read(&fbc->sum_start) != 0)
                                cpu_relax();
                }

                atomic64_add(count, &fbc->count);
                __this_cpu_write(*fbc->counters, 0);
                atomic_dec(&fbc->add_start);
        } else {
                __this_cpu_write(*fbc->counters, count);
        }
        preempt_enable();
}

s64 __percpu_counter_sum(struct percpu_counter *fbc)
{
        s64 ret = 0;
        int cpu;
        int old_seq;
        s64 old_count;

        atomic_inc(&fbc->sum_start);
        while (atomic_read(&fbc->add_start) != 0)
                cpu_relax();

        old_count = atomic64_read(&fbc->count);

        for_each_online_cpu(cpu) {
                s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
                ret += *pcount;
        }
        ret += atomic64_read(&fbc->count);
        atomic_dec(&fbc->sum_start);
        return ret;
}
if _add finds _sum is in progress, it gives up and and wait _sum. if
_sum finds _add is in progress, it waits _add to give up or end. We let
_add waits _sum here, because _sum is seldom called. If _sum waits _add,
_sum might run a dead loop. Maybe we need a spinlock to protect
concurrent _sum too. Anything wrong here?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-13  4:37               ` Shaohua Li
@ 2011-05-13  5:20                 ` Eric Dumazet
  2011-05-13  5:28                   ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-13  5:20 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le vendredi 13 mai 2011 à 12:37 +0800, Shaohua Li a écrit :
> On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > > I don't think @maxfuzzy is necessary there.  I wrote this before but
> > > > why can't we track the actual deviation instead of the number of
> > > > deviation events?
> > > 
> > > Thats roughly same thing (BATCH multiplicator factor apart)
> > > 
> > > Most percpu_counter users for a given percpu_counter object use a given
> > > BATCH, dont they ?
> > 
> > Well, @maxfuzzy is much harder than @batch.  It's way less intuitive.
> > Although I haven't really thought about it that much, I think it might
> > be possible to eliminate it.  Maybe I'm confused.  I'll take another
> > look later but if someone can think of something, please jump right
> > in.
> Hmm, looks Eric's approach doesn't work. because we want to remove lock
> in _add, checking seq in _sum still races with _add.
> 

Why ?

I'll code a patch, I believe it should work.

A seqcount is not a 'lock'.

The thing is we want _add to be real fast, so it must not hit a lock set
in _sum()

[Think about a machine with 4096 cpus]




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-13  5:20                 ` Eric Dumazet
@ 2011-05-13  5:28                   ` Shaohua Li
  2011-05-13  6:34                     ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-13  5:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Hi,
On Fri, May 13, 2011 at 01:20:06PM +0800, Eric Dumazet wrote:
> Le vendredi 13 mai 2011 à 12:37 +0800, Shaohua Li a écrit :
> > On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > > > I don't think @maxfuzzy is necessary there.  I wrote this before but
> > > > > why can't we track the actual deviation instead of the number of
> > > > > deviation events?
> > > > 
> > > > Thats roughly same thing (BATCH multiplicator factor apart)
> > > > 
> > > > Most percpu_counter users for a given percpu_counter object use a given
> > > > BATCH, dont they ?
> > > 
> > > Well, @maxfuzzy is much harder than @batch.  It's way less intuitive.
> > > Although I haven't really thought about it that much, I think it might
> > > be possible to eliminate it.  Maybe I'm confused.  I'll take another
> > > look later but if someone can think of something, please jump right
> > > in.
> > Hmm, looks Eric's approach doesn't work. because we want to remove lock
> > in _add, checking seq in _sum still races with _add.
> > 
> 
> Why ?
> 
> I'll code a patch, I believe it should work.
I thought your proposal is:
in _add
{
	if (count >= batch || count <= -batch) {
		fbc->seq_count++;
               atomic64_add(count, &fbc->count);
-------->
                __this_cpu_write(*fbc->counters, 0);
	}
}

in _sum
{
restart:
	oldseq = fbc->seqcount;
	smp_rmb();
	do_sum();
	smp_rmb()
	newseq = fbc->seqcount;
	if (newseq - oldseq >= maxfuzzy)
		goto restart;
	return ret;
} 
if _sum run between above line marked in _add, then the seqcount check
doesn't work, we still have deviation Tejun pointed out.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-13  5:28                   ` Shaohua Li
@ 2011-05-13  6:34                     ` Eric Dumazet
  2011-05-13  7:33                       ` Shaohua Li
  2011-05-13 14:51                       ` [patch] percpu_counter: scalability works Eric Dumazet
  0 siblings, 2 replies; 52+ messages in thread
From: Eric Dumazet @ 2011-05-13  6:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le vendredi 13 mai 2011 à 13:28 +0800, Shaohua Li a écrit :
> Hi,
> On Fri, May 13, 2011 at 01:20:06PM +0800, Eric Dumazet wrote:
> > Le vendredi 13 mai 2011 à 12:37 +0800, Shaohua Li a écrit :
> > > On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> > > > Hello,
> > > > 
> > > > On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > > > > I don't think @maxfuzzy is necessary there.  I wrote this before but
> > > > > > why can't we track the actual deviation instead of the number of
> > > > > > deviation events?
> > > > > 
> > > > > Thats roughly same thing (BATCH multiplicator factor apart)
> > > > > 
> > > > > Most percpu_counter users for a given percpu_counter object use a given
> > > > > BATCH, dont they ?
> > > > 
> > > > Well, @maxfuzzy is much harder than @batch.  It's way less intuitive.
> > > > Although I haven't really thought about it that much, I think it might
> > > > be possible to eliminate it.  Maybe I'm confused.  I'll take another
> > > > look later but if someone can think of something, please jump right
> > > > in.
> > > Hmm, looks Eric's approach doesn't work. because we want to remove lock
> > > in _add, checking seq in _sum still races with _add.
> > > 
> > 
> > Why ?
> > 
> > I'll code a patch, I believe it should work.
> I thought your proposal is:
> in _add
> {
> 	if (count >= batch || count <= -batch) {
> 		fbc->seq_count++;
>                atomic64_add(count, &fbc->count);
> -------->
>                 __this_cpu_write(*fbc->counters, 0);
> 	}
> }
> 
> in _sum
> {
> restart:
> 	oldseq = fbc->seqcount;
> 	smp_rmb();
> 	do_sum();
> 	smp_rmb()
> 	newseq = fbc->seqcount;
> 	if (newseq - oldseq >= maxfuzzy)
> 		goto restart;
> 	return ret;
> } 
> if _sum run between above line marked in _add, then the seqcount check
> doesn't work, we still have deviation Tejun pointed out.
> 

I see the point thanks, I'll think a bit more about it.

We currently serializes both _sum() and _add() with a spinlock.

My idea was OK if we still kept spinlock in _add(), but this obviously
is not the need.

Your goal is letting _add() run without spinlock, but can we agree
_sum() can run with a spinlock() like today [no more than one instance
of _sum() running per percpu_counter] ?




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch v2 0/5] percpu_counter: bug fix and enhancement
  2011-05-13  6:34                     ` Eric Dumazet
@ 2011-05-13  7:33                       ` Shaohua Li
  2011-05-13 14:51                       ` [patch] percpu_counter: scalability works Eric Dumazet
  1 sibling, 0 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-13  7:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

On Fri, 2011-05-13 at 14:34 +0800, Eric Dumazet wrote:
> Le vendredi 13 mai 2011 à 13:28 +0800, Shaohua Li a écrit :
> > Hi,
> > On Fri, May 13, 2011 at 01:20:06PM +0800, Eric Dumazet wrote:
> > > Le vendredi 13 mai 2011 à 12:37 +0800, Shaohua Li a écrit :
> > > > On Thu, 2011-05-12 at 17:05 +0800, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Thu, May 12, 2011 at 11:02:15AM +0200, Eric Dumazet wrote:
> > > > > > > I don't think @maxfuzzy is necessary there.  I wrote this before but
> > > > > > > why can't we track the actual deviation instead of the number of
> > > > > > > deviation events?
> > > > > > 
> > > > > > Thats roughly same thing (BATCH multiplicator factor apart)
> > > > > > 
> > > > > > Most percpu_counter users for a given percpu_counter object use a given
> > > > > > BATCH, dont they ?
> > > > > 
> > > > > Well, @maxfuzzy is much harder than @batch.  It's way less intuitive.
> > > > > Although I haven't really thought about it that much, I think it might
> > > > > be possible to eliminate it.  Maybe I'm confused.  I'll take another
> > > > > look later but if someone can think of something, please jump right
> > > > > in.
> > > > Hmm, looks Eric's approach doesn't work. because we want to remove lock
> > > > in _add, checking seq in _sum still races with _add.
> > > > 
> > > 
> > > Why ?
> > > 
> > > I'll code a patch, I believe it should work.
> > I thought your proposal is:
> > in _add
> > {
> > 	if (count >= batch || count <= -batch) {
> > 		fbc->seq_count++;
> >                atomic64_add(count, &fbc->count);
> > -------->
> >                 __this_cpu_write(*fbc->counters, 0);
> > 	}
> > }
> > 
> > in _sum
> > {
> > restart:
> > 	oldseq = fbc->seqcount;
> > 	smp_rmb();
> > 	do_sum();
> > 	smp_rmb()
> > 	newseq = fbc->seqcount;
> > 	if (newseq - oldseq >= maxfuzzy)
> > 		goto restart;
> > 	return ret;
> > } 
> > if _sum run between above line marked in _add, then the seqcount check
> > doesn't work, we still have deviation Tejun pointed out.
> > 
> 
> I see the point thanks, I'll think a bit more about it.
> 
> We currently serializes both _sum() and _add() with a spinlock.
> 
> My idea was OK if we still kept spinlock in _add(), but this obviously
> is not the need.
> 
> Your goal is letting _add() run without spinlock, but can we agree
> _sum() can run with a spinlock() like today [no more than one instance
> of _sum() running per percpu_counter] ?
locking _sum should be fine



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch] percpu_counter: scalability works
  2011-05-13  6:34                     ` Eric Dumazet
  2011-05-13  7:33                       ` Shaohua Li
@ 2011-05-13 14:51                       ` Eric Dumazet
  2011-05-13 15:39                         ` Eric Dumazet
  1 sibling, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-13 14:51 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le vendredi 13 mai 2011 à 08:34 +0200, Eric Dumazet a écrit :

> I see the point thanks, I'll think a bit more about it.
> 
> We currently serializes both _sum() and _add() with a spinlock.
> 
> My idea was OK if we still kept spinlock in _add(), but this obviously
> is not the need.
> 
> Your goal is letting _add() run without spinlock, but can we agree
> _sum() can run with a spinlock() like today [no more than one instance
> of _sum() running per percpu_counter] ?
> 
> 

Here the patch I cooked (on top of linux-2.6)

This solves the problem quite well for me.

Idea is :

Consider _sum() being slow path. It is still serialized by a spinlock().

Add a fbc->sequence, so that _add() can detect a sum() is in flight, and
directly add to a new atomic64_t field I named "fbc->slowcount" (and not
touch its percpu s32 variable so that _sum() can get accurate
percpu_counter 'Value')

Low order bit of the 'sequence' is used to signal _sum() is in flight,
while _add() threads that overflow their percpu s32 variable do a
sequence += 2, so that _sum() can detect at least one cpu messed the
fbc->count and reset its s32 variable. _sum() can restart its loop, but
since sequence has still low order bit set, we have guarantee that the
_sum() loop wont be restarted ad infinitum.

Notes : I disabled IRQ in _add() to reduce window, making _add() as fast
as possible to avoid _sum() extra loops, but its not strictly necessary,
we can discuss this point, since _sum() is slow path :)

_sum() is accurate and not blocking anymore _add(). It's slowing it a
bit of course since all _add() will touch fbc->slowcount.

_sum() is about same speed than before in my tests.

On my 8 cpu (Intel(R) Xeon(R) CPU E5450 @ 3.00GHz) machine, and 32bit
kernel, the : 
	loop (10000000 times) {
		p = mmap(128M, ANONYMOUS);
		munmap(p, 128M);
	} 
done on 8 cpus bench :

Before patch :
real	3m22.759s
user	0m6.353s
sys	26m28.919s

After patch :
real	0m23.420s
user	0m6.332s
sys	2m44.561s

Quite good results considering atomic64_add() uses two "lock cmpxchg8b"
on x86_32 :

    33.03%        mmap_test  [kernel.kallsyms]       [k] unmap_vmas
    12.99%        mmap_test  [kernel.kallsyms]       [k] atomic64_add_return_cx8
     5.62%        mmap_test  [kernel.kallsyms]       [k] free_pgd_range
     3.07%        mmap_test  [kernel.kallsyms]       [k] sysenter_past_esp
     2.48%        mmap_test  [kernel.kallsyms]       [k] memcpy
     2.24%        mmap_test  [kernel.kallsyms]       [k] perf_event_mmap
     2.21%        mmap_test  [kernel.kallsyms]       [k] _raw_spin_lock
     2.02%        mmap_test  [vdso]                  [.] 0xffffe424
     2.01%        mmap_test  [kernel.kallsyms]       [k] perf_event_mmap_output
     1.38%        mmap_test  [kernel.kallsyms]       [k] vma_adjust
     1.24%        mmap_test  [kernel.kallsyms]       [k] sched_clock_local
     1.23%             perf  [kernel.kallsyms]       [k] __copy_from_user_ll_nozero
     1.07%        mmap_test  [kernel.kallsyms]       [k] down_write


If only one cpu runs the program :

real	0m16.685s
user	0m0.771s
sys	0m15.815s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/percpu_counter.h |   15 +++++++--
 lib/percpu_counter.c           |   47 +++++++++++++++++++------------
 2 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..288acf4 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,14 +16,21 @@
 #ifdef CONFIG_SMP
 
 struct percpu_counter {
-	spinlock_t lock;
-	s64 count;
+	spinlock_t	lock;
+	atomic_t	sequence; /* low order bit set if one sum() is in flight */
+	atomic64_t	count;
+	atomic64_t	slowcount;
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
 	s32 __percpu *counters;
 };
 
+static inline bool percpu_counter_active_sum(const struct percpu_counter *fbc)
+{
+	return (atomic_read(&fbc->sequence) & 1) ? true : false;
+}
+
 extern int percpu_counter_batch;
 
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
@@ -60,7 +67,7 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
-	return fbc->count;
+	return atomic64_read(&fbc->count) + atomic64_read(&fbc->slowcount);
 }
 
 /*
@@ -70,7 +77,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	s64 ret = fbc->count;
+	s64 ret = percpu_counter_read(fbc);
 
 	barrier();		/* Prevent reloads of fbc->count */
 	if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..aef4bd5 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -59,31 +59,35 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 {
 	int cpu;
 
-	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
-	fbc->count = amount;
-	spin_unlock(&fbc->lock);
+	atomic64_set(&fbc->count, amount);
+	atomic64_set(&fbc->slowcount, 0);
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
+	unsigned long flags;
 
-	preempt_disable();
+	if (percpu_counter_active_sum(fbc)) {
+		atomic64_add(amount, &fbc->slowcount);
+		return;
+	}
+
+	local_irq_save(flags);
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		spin_lock(&fbc->lock);
-		fbc->count += count;
+		atomic_add(2, &fbc->sequence);
+		atomic64_add(count, &fbc->count);
 		__this_cpu_write(*fbc->counters, 0);
-		spin_unlock(&fbc->lock);
 	} else {
 		__this_cpu_write(*fbc->counters, count);
 	}
-	preempt_enable();
+	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(__percpu_counter_add);
 
@@ -95,13 +99,22 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
+	unsigned int seq;
 
 	spin_lock(&fbc->lock);
-	ret = fbc->count;
-	for_each_online_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
-		ret += *pcount;
-	}
+	atomic_inc(&fbc->sequence);
+	do {
+		seq = atomic_read(&fbc->sequence);
+	
+		ret = atomic64_read(&fbc->count);
+		for_each_online_cpu(cpu) {
+			s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+			ret += *pcount;
+		}
+	} while (atomic_read(&fbc->sequence) != seq);
+
+	atomic_inc(&fbc->sequence);
+	ret += atomic64_read(&fbc->slowcount);
 	spin_unlock(&fbc->lock);
 	return ret;
 }
@@ -112,7 +125,8 @@ int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
 {
 	spin_lock_init(&fbc->lock);
 	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
+	atomic64_set(&fbc->count, amount);
+	atomic64_set(&fbc->slowcount, 0);
 	fbc->counters = alloc_percpu(s32);
 	if (!fbc->counters)
 		return -ENOMEM;
@@ -171,13 +185,10 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 	mutex_lock(&percpu_counters_lock);
 	list_for_each_entry(fbc, &percpu_counters, list) {
 		s32 *pcount;
-		unsigned long flags;
 
-		spin_lock_irqsave(&fbc->lock, flags);
 		pcount = per_cpu_ptr(fbc->counters, cpu);
-		fbc->count += *pcount;
+		atomic64_add(*pcount, &fbc->count);
 		*pcount = 0;
-		spin_unlock_irqrestore(&fbc->lock, flags);
 	}
 	mutex_unlock(&percpu_counters_lock);
 #endif



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch] percpu_counter: scalability works
  2011-05-13 14:51                       ` [patch] percpu_counter: scalability works Eric Dumazet
@ 2011-05-13 15:39                         ` Eric Dumazet
  2011-05-13 16:35                           ` [patch V2] " Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-13 15:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le vendredi 13 mai 2011 à 16:51 +0200, Eric Dumazet a écrit :

> Here the patch I cooked (on top of linux-2.6)
> 
> This solves the problem quite well for me.
> 
> Idea is :
> 
> Consider _sum() being slow path. It is still serialized by a spinlock().
> 
> Add a fbc->sequence, so that _add() can detect a sum() is in flight, and
> directly add to a new atomic64_t field I named "fbc->slowcount" (and not
> touch its percpu s32 variable so that _sum() can get accurate
> percpu_counter 'Value')
> 
> Low order bit of the 'sequence' is used to signal _sum() is in flight,
> while _add() threads that overflow their percpu s32 variable do a
> sequence += 2, so that _sum() can detect at least one cpu messed the
> fbc->count and reset its s32 variable. _sum() can restart its loop, but
> since sequence has still low order bit set, we have guarantee that the
> _sum() loop wont be restarted ad infinitum.
> 
> Notes : I disabled IRQ in _add() to reduce window, making _add() as fast
> as possible to avoid _sum() extra loops, but its not strictly necessary,
> we can discuss this point, since _sum() is slow path :)
> 
> _sum() is accurate and not blocking anymore _add(). It's slowing it a
> bit of course since all _add() will touch fbc->slowcount.
> 
> _sum() is about same speed than before in my tests.
> 
> On my 8 cpu (Intel(R) Xeon(R) CPU E5450 @ 3.00GHz) machine, and 32bit
> kernel, the : 
> 	loop (10000000 times) {
> 		p = mmap(128M, ANONYMOUS);
> 		munmap(p, 128M);
> 	} 
> done on 8 cpus bench :
> 
> Before patch :
> real	3m22.759s
> user	0m6.353s
> sys	26m28.919s
> 
> After patch :
> real	0m23.420s
> user	0m6.332s
> sys	2m44.561s
> 
> Quite good results considering atomic64_add() uses two "lock cmpxchg8b"
> on x86_32 :
> 
>     33.03%        mmap_test  [kernel.kallsyms]       [k] unmap_vmas
>     12.99%        mmap_test  [kernel.kallsyms]       [k] atomic64_add_return_cx8
>      5.62%        mmap_test  [kernel.kallsyms]       [k] free_pgd_range
>      3.07%        mmap_test  [kernel.kallsyms]       [k] sysenter_past_esp
>      2.48%        mmap_test  [kernel.kallsyms]       [k] memcpy
>      2.24%        mmap_test  [kernel.kallsyms]       [k] perf_event_mmap
>      2.21%        mmap_test  [kernel.kallsyms]       [k] _raw_spin_lock
>      2.02%        mmap_test  [vdso]                  [.] 0xffffe424
>      2.01%        mmap_test  [kernel.kallsyms]       [k] perf_event_mmap_output
>      1.38%        mmap_test  [kernel.kallsyms]       [k] vma_adjust
>      1.24%        mmap_test  [kernel.kallsyms]       [k] sched_clock_local
>      1.23%             perf  [kernel.kallsyms]       [k] __copy_from_user_ll_nozero
>      1.07%        mmap_test  [kernel.kallsyms]       [k] down_write
> 
> 
> If only one cpu runs the program :
> 
> real	0m16.685s
> user	0m0.771s
> sys	0m15.815s

Thinking a bit more, we could allow several _sum() in flight (we would
need an atomic_t counter for counter of _sum(), not a single bit, and
remove the spinlock.

This would allow using a separate integer for the
add_did_change_fbc_count and remove one atomic operation in _add() { the
atomic_add(2, &fbc->sequence); of my previous patch }


Another idea would also put fbc->count / fbc->slowcount out of line,
to keep "struct percpu_counter" read mostly.

I'll send a V2 with this updated schem.


By the way, I ran the bench on a more recent 2x4x2 machine and 64bit
kernel (HP G6 : Intel(R) Xeon(R) CPU E5540  @ 2.53GHz)

1) One process started (no contention) :

Before :
real	0m21.372s
user	0m0.680s
sys	0m20.670s

After V1 patch :
real	0m19.941s
user	0m0.750s
sys	0m19.170s


2) 16 processes started

Before patch:
real	2m14.509s
user	0m13.780s
sys	35m24.170s

After V1 patch :
real	0m48.617s
user	0m16.980s
sys	12m9.400s




^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch V2] percpu_counter: scalability works
  2011-05-13 15:39                         ` Eric Dumazet
@ 2011-05-13 16:35                           ` Eric Dumazet
  2011-05-13 16:46                             ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-13 16:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le vendredi 13 mai 2011 à 17:39 +0200, Eric Dumazet a écrit :

> Thinking a bit more, we could allow several _sum() in flight (we would
> need an atomic_t counter for counter of _sum(), not a single bit, and
> remove the spinlock.
> 
> This would allow using a separate integer for the
> add_did_change_fbc_count and remove one atomic operation in _add() { the
> atomic_add(2, &fbc->sequence); of my previous patch }
> 
> 
> Another idea would also put fbc->count / fbc->slowcount out of line,
> to keep "struct percpu_counter" read mostly.
> 
> I'll send a V2 with this updated schem.
> 

Here is V2 of patch :

Idea is :

We consider _sum() being slow path. We dont try to make it fast [ but
this implementation should be better since I remove the spinlock that
used to serialize _sum() / _add() invocations ]

Add a fbc->sum_cnt, so that _add() can detect a _sum() is in flight, and
directly add to a new atomic64_t field I named "fbc->slowcount" (and not
touch its percpu s32 variable so that _sum() can get accurate
percpu_counter 'Value')

Use an out of line structure to make "struct percpu_count" mostly read
This structure uses its own cache line to reduce false sharing.

Each time one _add() thread overflows its percpu s32 variable, do an
increment of a sequence, so that _sum() can detect at least one cpu
messed the fbc->count and reset its s32 variable.
_sum() can restart its loop, but since sum_cnt is non null, we have
guarantee that the _sum() loop wont be restarted ad infinitum.

In practice, it should be restarted once at most :

I disabled IRQ in _add() to reduce window, making _add() as fast
as possible to avoid _sum() extra loops, but its not strictly necessary,
we can discuss this point, since _sum() is slow path :)

_sum() is accurate and not blocking anymore _add(). It's slowing it a
bit of course since all _add() will touch fbc->slowcount.

On my 2x4x2 cpu (Intel(R) Xeon(R) CPU E5540  @ 2.53GHz) machine, and
64bit
kernel, the : 
        loop (10000000 times) {
                p = mmap(128M, ANONYMOUS);
                munmap(p, 128M);
        } 

1) One process started (no contention) :

Before :
real    0m21.372s
user    0m0.680s
sys     0m20.670s

After V2 patch :
real	0m19.654s
user	0m0.730s
sys	0m18.890s


2) 16 processes started

Before patch:
real    2m14.509s
user    0m13.780s
sys     35m24.170s

After V2 patch:

real	0m35.635s
user	0m17.670s
sys	8m11.020s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/percpu_counter.h |   26 +++++++--
 lib/percpu_counter.c           |   83 ++++++++++++++++++++-----------
 2 files changed, 74 insertions(+), 35 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..4aac7f5 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -15,13 +15,25 @@
 
 #ifdef CONFIG_SMP
 
-struct percpu_counter {
-	spinlock_t lock;
-	s64 count;
+/*
+ * For performance reasons, we keep this part in a separate cache line
+ */
+struct percpu_counter_rw {
+	atomic64_t	count;
+	unsigned int	sequence;
+	atomic64_t	slowcount;
+
+	/* since we have plenty room, store list here, even if never used */
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
+	struct percpu_counter *fbc;
 #endif
-	s32 __percpu *counters;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter {
+	atomic_t		 sum_cnt; /* count of in flight sum() */
+	struct percpu_counter_rw *pcrw;
+	s32 __percpu		 *counters;
 };
 
 extern int percpu_counter_batch;
@@ -60,7 +72,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
-	return fbc->count;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+	return atomic64_read(&pcrw->count) + atomic64_read(&pcrw->slowcount);
 }
 
 /*
@@ -70,7 +84,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	s64 ret = fbc->count;
+	s64 ret = percpu_counter_read(fbc);
 
 	barrier();		/* Prevent reloads of fbc->count */
 	if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..c9c33c1 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -9,6 +9,7 @@
 #include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/debugobjects.h>
+#include <linux/slab.h>
 
 static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
@@ -58,32 +59,38 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 {
 	int cpu;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
 
-	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
-	fbc->count = amount;
-	spin_unlock(&fbc->lock);
+	atomic64_set(&pcrw->count, amount);
+	atomic64_set(&pcrw->slowcount, 0);
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
+	unsigned long flags;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
 
-	preempt_disable();
+	if (atomic_read(&fbc->sum_cnt)) {
+		atomic64_add(amount, &pcrw->slowcount);
+		return;
+	}
+
+	local_irq_save(flags);
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		spin_lock(&fbc->lock);
-		fbc->count += count;
+		pcrw->sequence++; /* lazy increment (not atomic) */
+		atomic64_add(count, &pcrw->count);
 		__this_cpu_write(*fbc->counters, 0);
-		spin_unlock(&fbc->lock);
 	} else {
 		__this_cpu_write(*fbc->counters, count);
 	}
-	preempt_enable();
+	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(__percpu_counter_add);
 
@@ -95,14 +102,25 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
+	unsigned int seq;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
 
-	spin_lock(&fbc->lock);
-	ret = fbc->count;
-	for_each_online_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
-		ret += *pcount;
-	}
-	spin_unlock(&fbc->lock);
+	atomic_inc(&fbc->sum_cnt);
+	do {
+		seq = pcrw->sequence;
+		smp_rmb();
+
+		ret = atomic64_read(&pcrw->count);
+		for_each_online_cpu(cpu) {
+			s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+			ret += *pcount;
+		}
+
+		smp_rmb();
+	} while (pcrw->sequence != seq);
+
+	atomic_dec(&fbc->sum_cnt);
+	ret += atomic64_read(&pcrw->slowcount);
 	return ret;
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
@@ -110,19 +128,27 @@ EXPORT_SYMBOL(__percpu_counter_sum);
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
 			  struct lock_class_key *key)
 {
-	spin_lock_init(&fbc->lock);
-	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
+	struct percpu_counter_rw *pcrw; 
+
+	pcrw = kzalloc(sizeof(*pcrw), GFP_KERNEL);
+	if (!pcrw)
+		return -ENOMEM;
+	atomic64_set(&pcrw->count, amount);
+
 	fbc->counters = alloc_percpu(s32);
-	if (!fbc->counters)
+	if (!fbc->counters) {
+		kfree(pcrw);
 		return -ENOMEM;
+	}
+	fbc->pcrw = pcrw;
 
 	debug_percpu_counter_activate(fbc);
 
 #ifdef CONFIG_HOTPLUG_CPU
-	INIT_LIST_HEAD(&fbc->list);
+	INIT_LIST_HEAD(&pcrw->list);
+	pcrw->fbc = fbc;
 	mutex_lock(&percpu_counters_lock);
-	list_add(&fbc->list, &percpu_counters);
+	list_add(&pcrw->list, &percpu_counters);
 	mutex_unlock(&percpu_counters_lock);
 #endif
 	return 0;
@@ -138,11 +164,13 @@ void percpu_counter_destroy(struct percpu_counter *fbc)
 
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);
-	list_del(&fbc->list);
+	list_del(&fbc->pcrw->list);
 	mutex_unlock(&percpu_counters_lock);
 #endif
 	free_percpu(fbc->counters);
 	fbc->counters = NULL;
+	kfree(fbc->pcrw);
+	fbc->pcrw = NULL;
 }
 EXPORT_SYMBOL(percpu_counter_destroy);
 
@@ -161,7 +189,7 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 {
 #ifdef CONFIG_HOTPLUG_CPU
 	unsigned int cpu;
-	struct percpu_counter *fbc;
+	struct percpu_counter_rw *pcrw;
 
 	compute_batch_value();
 	if (action != CPU_DEAD)
@@ -169,15 +197,12 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 
 	cpu = (unsigned long)hcpu;
 	mutex_lock(&percpu_counters_lock);
-	list_for_each_entry(fbc, &percpu_counters, list) {
+	list_for_each_entry(pcrw, &percpu_counters, list) {
 		s32 *pcount;
-		unsigned long flags;
 
-		spin_lock_irqsave(&fbc->lock, flags);
-		pcount = per_cpu_ptr(fbc->counters, cpu);
-		fbc->count += *pcount;
+		pcount = per_cpu_ptr(pcrw->fbc->counters, cpu);
+		atomic64_add(*pcount, &pcrw->count);
 		*pcount = 0;
-		spin_unlock_irqrestore(&fbc->lock, flags);
 	}
 	mutex_unlock(&percpu_counters_lock);
 #endif



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch V2] percpu_counter: scalability works
  2011-05-13 16:35                           ` [patch V2] " Eric Dumazet
@ 2011-05-13 16:46                             ` Eric Dumazet
  2011-05-13 22:03                               ` [patch V3] " Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-13 16:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le vendredi 13 mai 2011 à 18:35 +0200, Eric Dumazet a écrit :

If you want to try this patch, please note I missed in
__percpu_counter_init() :

fbc->sum_cnt = 0; 


>  int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
>  			  struct lock_class_key *key)
>  {
> -	spin_lock_init(&fbc->lock);
> -	lockdep_set_class(&fbc->lock, key);
> -	fbc->count = amount;
> +	struct percpu_counter_rw *pcrw; 
> +
> +	pcrw = kzalloc(sizeof(*pcrw), GFP_KERNEL);
> +	if (!pcrw)
> +		return -ENOMEM;
> +	atomic64_set(&pcrw->count, amount);
> +
>  	fbc->counters = alloc_percpu(s32);
> -	if (!fbc->counters)
> +	if (!fbc->counters) {
> +		kfree(pcrw);
>  		return -ENOMEM;
> +	}
> +	fbc->pcrw = pcrw;

	fbc->sum_cnt = 0;

>  
>  	debug_percpu_counter_activate(fbc);
>  





^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch V3] percpu_counter: scalability works
  2011-05-13 16:46                             ` Eric Dumazet
@ 2011-05-13 22:03                               ` Eric Dumazet
  2011-05-16  0:58                                 ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-13 22:03 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Shaohua Li reported a scalability problem with many threads doing
mmap()/munmap() calls. vm_committed_as percpu_counter is hitting its
spinlock very hard, if size of mmaped zones are bigger than
percpu_counter batch. We could tune this batch value but better have a
more scalable percpu_counter infrastructure.

Shaohua provided some patches to speedup __percpu_counter_add(), by
removing the need to use a spinlock and use an atomic64_t fbc->count
instead.

Problem of these patches were a possible big deviation seen by
__percpu_counter_sum()

Idea of this patch is to extend Shaohua idea :

We consider _sum() being slow path. We dont try to make it fast [ but
this implementation should be better since we remove the spinlock that
used to serialize _sum() / _add() invocations ]

Add a fbc->sum_cnt, so that _add() can detect a _sum() is in flight, and
directly add to a new atomic64_t field named "fbc->slowcount" (and not
touch its percpu s32 variable so that _sum() can get more accurate
percpu_counter 'Value')

Use an out of line structure to make "struct percpu_count" mostly read
This structure uses its own cache line to reduce false sharing.

Each time one _add() thread overflows its percpu s32 variable, do an
increment of a sequence, so that _sum() can detect at least one cpu
messed the fbc->count and reset its s32 variable.
_sum() can restart its loop, but since sum_cnt is non null, we have
guarantee that the _sum() loop wont be restarted ad infinitum.

_sum() is accurate and not blocking anymore _add() [ It's slowing it a
bit of course since all _add() will touch shared fbc->slowcount ]

On my 2x4x2 cpu (Intel(R) Xeon(R) CPU E5540  @ 2.53GHz) machine, and
64bit kernel, the following bench : 

    loop (10000000 times) {
            p = mmap(128M, ANONYMOUS);
            munmap(p, 128M);
    } 

16 processes started

Before patch:
real    2m14.509s
user    0m13.780s
sys     35m24.170s

After patch:
real	0m34.055s
user	0m17.910s
sys	8m1.680s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Shaohua Li <shaohua.li@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Christoph Lameter <cl@linux.com>
CC: Tejun Heo <tj@kernel.org>
CC: Nick Piggin <npiggin@kernel.dk>
---
V3: remove irq masking in __percpu_counter_add()
    initialize fbc->sum_cnt in __percpu_counter_init

 include/linux/percpu_counter.h |   26 +++++++---
 lib/percpu_counter.c           |   79 ++++++++++++++++++++-----------
 2 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..4aac7f5 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -15,13 +15,25 @@
 
 #ifdef CONFIG_SMP
 
-struct percpu_counter {
-	spinlock_t lock;
-	s64 count;
+/*
+ * For performance reasons, we keep this part in a separate cache line
+ */
+struct percpu_counter_rw {
+	atomic64_t	count;
+	unsigned int	sequence;
+	atomic64_t	slowcount;
+
+	/* since we have plenty room, store list here, even if never used */
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
+	struct percpu_counter *fbc;
 #endif
-	s32 __percpu *counters;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter {
+	atomic_t		 sum_cnt; /* count of in flight sum() */
+	struct percpu_counter_rw *pcrw;
+	s32 __percpu		 *counters;
 };
 
 extern int percpu_counter_batch;
@@ -60,7 +72,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
-	return fbc->count;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+	return atomic64_read(&pcrw->count) + atomic64_read(&pcrw->slowcount);
 }
 
 /*
@@ -70,7 +84,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	s64 ret = fbc->count;
+	s64 ret = percpu_counter_read(fbc);
 
 	barrier();		/* Prevent reloads of fbc->count */
 	if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..ff486b2 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -9,6 +9,7 @@
 #include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/debugobjects.h>
+#include <linux/slab.h>
 
 static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
@@ -58,28 +59,33 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 {
 	int cpu;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
 
-	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
-	fbc->count = amount;
-	spin_unlock(&fbc->lock);
+	atomic64_set(&pcrw->count, amount);
+	atomic64_set(&pcrw->slowcount, 0);
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+	if (atomic_read(&fbc->sum_cnt)) {
+		atomic64_add(amount, &pcrw->slowcount);
+		return;
+	}
 
 	preempt_disable();
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		spin_lock(&fbc->lock);
-		fbc->count += count;
+		atomic64_add(count, &pcrw->count);
+		pcrw->sequence++;
 		__this_cpu_write(*fbc->counters, 0);
-		spin_unlock(&fbc->lock);
 	} else {
 		__this_cpu_write(*fbc->counters, count);
 	}
@@ -95,14 +101,25 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
+	unsigned int seq;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
 
-	spin_lock(&fbc->lock);
-	ret = fbc->count;
-	for_each_online_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
-		ret += *pcount;
-	}
-	spin_unlock(&fbc->lock);
+	atomic_inc(&fbc->sum_cnt);
+	do {
+		seq = pcrw->sequence;
+		smp_rmb();
+
+		ret = atomic64_read(&pcrw->count);
+		for_each_online_cpu(cpu) {
+			s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+			ret += *pcount;
+		}
+
+		smp_rmb();
+	} while (pcrw->sequence != seq);
+
+	atomic_dec(&fbc->sum_cnt);
+	ret += atomic64_read(&pcrw->slowcount);
 	return ret;
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
@@ -110,19 +127,28 @@ EXPORT_SYMBOL(__percpu_counter_sum);
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
 			  struct lock_class_key *key)
 {
-	spin_lock_init(&fbc->lock);
-	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
+	struct percpu_counter_rw *pcrw; 
+
+	pcrw = kzalloc(sizeof(*pcrw), GFP_KERNEL);
+	if (!pcrw)
+		return -ENOMEM;
+	atomic64_set(&pcrw->count, amount);
+
 	fbc->counters = alloc_percpu(s32);
-	if (!fbc->counters)
+	if (!fbc->counters) {
+		kfree(pcrw);
 		return -ENOMEM;
+	}
+	fbc->pcrw = pcrw;
+	atomic_set(&fbc->sum_cnt, 0);
 
 	debug_percpu_counter_activate(fbc);
 
 #ifdef CONFIG_HOTPLUG_CPU
-	INIT_LIST_HEAD(&fbc->list);
+	INIT_LIST_HEAD(&pcrw->list);
+	pcrw->fbc = fbc;
 	mutex_lock(&percpu_counters_lock);
-	list_add(&fbc->list, &percpu_counters);
+	list_add(&pcrw->list, &percpu_counters);
 	mutex_unlock(&percpu_counters_lock);
 #endif
 	return 0;
@@ -138,11 +164,13 @@ void percpu_counter_destroy(struct percpu_counter *fbc)
 
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);
-	list_del(&fbc->list);
+	list_del(&fbc->pcrw->list);
 	mutex_unlock(&percpu_counters_lock);
 #endif
 	free_percpu(fbc->counters);
 	fbc->counters = NULL;
+	kfree(fbc->pcrw);
+	fbc->pcrw = NULL;
 }
 EXPORT_SYMBOL(percpu_counter_destroy);
 
@@ -161,7 +189,7 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 {
 #ifdef CONFIG_HOTPLUG_CPU
 	unsigned int cpu;
-	struct percpu_counter *fbc;
+	struct percpu_counter_rw *pcrw;
 
 	compute_batch_value();
 	if (action != CPU_DEAD)
@@ -169,15 +197,12 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 
 	cpu = (unsigned long)hcpu;
 	mutex_lock(&percpu_counters_lock);
-	list_for_each_entry(fbc, &percpu_counters, list) {
+	list_for_each_entry(pcrw, &percpu_counters, list) {
 		s32 *pcount;
-		unsigned long flags;
 
-		spin_lock_irqsave(&fbc->lock, flags);
-		pcount = per_cpu_ptr(fbc->counters, cpu);
-		fbc->count += *pcount;
+		pcount = per_cpu_ptr(pcrw->fbc->counters, cpu);
+		atomic64_add(*pcount, &pcrw->count);
 		*pcount = 0;
-		spin_unlock_irqrestore(&fbc->lock, flags);
 	}
 	mutex_unlock(&percpu_counters_lock);
 #endif



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-13 22:03                               ` [patch V3] " Eric Dumazet
@ 2011-05-16  0:58                                 ` Shaohua Li
  2011-05-16  6:11                                   ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-16  0:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

On Sat, 2011-05-14 at 06:03 +0800, Eric Dumazet wrote:
> Shaohua Li reported a scalability problem with many threads doing
> mmap()/munmap() calls. vm_committed_as percpu_counter is hitting its
> spinlock very hard, if size of mmaped zones are bigger than
> percpu_counter batch. We could tune this batch value but better have a
> more scalable percpu_counter infrastructure.
> 
> Shaohua provided some patches to speedup __percpu_counter_add(), by
> removing the need to use a spinlock and use an atomic64_t fbc->count
> instead.
> 
> Problem of these patches were a possible big deviation seen by
> __percpu_counter_sum()
> 
> Idea of this patch is to extend Shaohua idea :
> 
> We consider _sum() being slow path. We dont try to make it fast [ but
> this implementation should be better since we remove the spinlock that
> used to serialize _sum() / _add() invocations ]
> 
> Add a fbc->sum_cnt, so that _add() can detect a _sum() is in flight, and
> directly add to a new atomic64_t field named "fbc->slowcount" (and not
> touch its percpu s32 variable so that _sum() can get more accurate
> percpu_counter 'Value')
> 
> Use an out of line structure to make "struct percpu_count" mostly read
> This structure uses its own cache line to reduce false sharing.
> 
> Each time one _add() thread overflows its percpu s32 variable, do an
> increment of a sequence, so that _sum() can detect at least one cpu
> messed the fbc->count and reset its s32 variable.
> _sum() can restart its loop, but since sum_cnt is non null, we have
> guarantee that the _sum() loop wont be restarted ad infinitum.
> 
> _sum() is accurate and not blocking anymore _add() [ It's slowing it a
> bit of course since all _add() will touch shared fbc->slowcount ]
> 
> On my 2x4x2 cpu (Intel(R) Xeon(R) CPU E5540  @ 2.53GHz) machine, and
> 64bit kernel, the following bench : 
> 
>     loop (10000000 times) {
>             p = mmap(128M, ANONYMOUS);
>             munmap(p, 128M);
>     } 
> 
> 16 processes started
> 
> Before patch:
> real    2m14.509s
> user    0m13.780s
> sys     35m24.170s
> 
> After patch:
> real	0m34.055s
> user	0m17.910s
> sys	8m1.680s
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Shaohua Li <shaohua.li@intel.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Christoph Lameter <cl@linux.com>
> CC: Tejun Heo <tj@kernel.org>
> CC: Nick Piggin <npiggin@kernel.dk>
> ---
> V3: remove irq masking in __percpu_counter_add()
>     initialize fbc->sum_cnt in __percpu_counter_init
> 
>  include/linux/percpu_counter.h |   26 +++++++---
>  lib/percpu_counter.c           |   79 ++++++++++++++++++++-----------
>  2 files changed, 72 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 46f6ba5..4aac7f5 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -15,13 +15,25 @@
>  
>  #ifdef CONFIG_SMP
>  
> -struct percpu_counter {
> -	spinlock_t lock;
> -	s64 count;
> +/*
> + * For performance reasons, we keep this part in a separate cache line
> + */
> +struct percpu_counter_rw {
> +	atomic64_t	count;
> +	unsigned int	sequence;
> +	atomic64_t	slowcount;
> +
> +	/* since we have plenty room, store list here, even if never used */
>  #ifdef CONFIG_HOTPLUG_CPU
>  	struct list_head list;	/* All percpu_counters are on a list */
> +	struct percpu_counter *fbc;
>  #endif
> -	s32 __percpu *counters;
> +} ____cacheline_aligned_in_smp;
> +
> +struct percpu_counter {
> +	atomic_t		 sum_cnt; /* count of in flight sum() */
> +	struct percpu_counter_rw *pcrw;
> +	s32 __percpu		 *counters;
>  };
>  
>  extern int percpu_counter_batch;
> @@ -60,7 +72,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
>  
>  static inline s64 percpu_counter_read(struct percpu_counter *fbc)
>  {
> -	return fbc->count;
> +	struct percpu_counter_rw *pcrw = fbc->pcrw;
> +
> +	return atomic64_read(&pcrw->count) + atomic64_read(&pcrw->slowcount);
>  }
>  
>  /*
> @@ -70,7 +84,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
>   */
>  static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
>  {
> -	s64 ret = fbc->count;
> +	s64 ret = percpu_counter_read(fbc);
>  
>  	barrier();		/* Prevent reloads of fbc->count */
>  	if (ret >= 0)
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 28f2c33..ff486b2 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -9,6 +9,7 @@
>  #include <linux/cpu.h>
>  #include <linux/module.h>
>  #include <linux/debugobjects.h>
> +#include <linux/slab.h>
>  
>  static LIST_HEAD(percpu_counters);
>  static DEFINE_MUTEX(percpu_counters_lock);
> @@ -58,28 +59,33 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
>  void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
>  {
>  	int cpu;
> +	struct percpu_counter_rw *pcrw = fbc->pcrw;
>  
> -	spin_lock(&fbc->lock);
>  	for_each_possible_cpu(cpu) {
>  		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
>  		*pcount = 0;
>  	}
> -	fbc->count = amount;
> -	spin_unlock(&fbc->lock);
> +	atomic64_set(&pcrw->count, amount);
> +	atomic64_set(&pcrw->slowcount, 0);
>  }
>  EXPORT_SYMBOL(percpu_counter_set);
>  
>  void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
>  {
>  	s64 count;
> +	struct percpu_counter_rw *pcrw = fbc->pcrw;
> +
> +	if (atomic_read(&fbc->sum_cnt)) {
> +		atomic64_add(amount, &pcrw->slowcount);
> +		return;
> +	}
>  
>  	preempt_disable();
>  	count = __this_cpu_read(*fbc->counters) + amount;
>  	if (count >= batch || count <= -batch) {
> -		spin_lock(&fbc->lock);
> -		fbc->count += count;
> +		atomic64_add(count, &pcrw->count);
so if _sum starts and ends here, _sum can still get deviation.

I had a patch which uses the idea which I described in last email and
should remove the deviation, see below. it will delay _add if _sum is
running, which sounds scaring. but since _sum is called quite rare, this
isn't a big problem. we can further convert add_start below to a percpu
count to reduce cache line bounce.

---
 include/linux/percpu_counter.h |   18 ++++-------------
 lib/percpu_counter.c           |   43 +++++++++++++++++++++++++----------------
 2 files changed, 32 insertions(+), 29 deletions(-)

Index: linux/include/linux/percpu_counter.h
===================================================================
--- linux.orig/include/linux/percpu_counter.h	2011-05-13 11:13:25.000000000 +0800
+++ linux/include/linux/percpu_counter.h	2011-05-13 16:22:41.000000000 +0800
@@ -16,8 +16,9 @@
 #ifdef CONFIG_SMP
 
 struct percpu_counter {
+	atomic_t sum_start, add_start;
+	atomic64_t count;
 	spinlock_t lock;
-	s64 count;
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
@@ -26,16 +27,7 @@ struct percpu_counter {
 
 extern int percpu_counter_batch;
 
-int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
-			  struct lock_class_key *key);
-
-#define percpu_counter_init(fbc, value)					\
-	({								\
-		static struct lock_class_key __key;			\
-									\
-		__percpu_counter_init(fbc, value, &__key);		\
-	})
-
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
@@ -60,7 +52,7 @@ static inline s64 percpu_counter_sum(str
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
-	return fbc->count;
+	return atomic64_read(&fbc->count);
 }
 
 /*
@@ -70,7 +62,7 @@ static inline s64 percpu_counter_read(st
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	s64 ret = fbc->count;
+	s64 ret = percpu_counter_read(fbc);
 
 	barrier();		/* Prevent reloads of fbc->count */
 	if (ret >= 0)
Index: linux/lib/percpu_counter.c
===================================================================
--- linux.orig/lib/percpu_counter.c	2011-05-13 10:29:04.000000000 +0800
+++ linux/lib/percpu_counter.c	2011-05-13 16:22:03.000000000 +0800
@@ -59,13 +59,11 @@ void percpu_counter_set(struct percpu_co
 {
 	int cpu;
 
-	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
-	fbc->count = amount;
-	spin_unlock(&fbc->lock);
+	atomic64_set(&fbc->count, amount);
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
@@ -76,10 +74,20 @@ void __percpu_counter_add(struct percpu_
 	preempt_disable();
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		spin_lock(&fbc->lock);
-		fbc->count += count;
+		while (1) {
+			atomic_inc_return(&fbc->add_start);
+			if (atomic_read(&fbc->sum_start) != 0)
+				atomic_dec(&fbc->add_start);
+			else
+				break;
+			while (atomic_read(&fbc->sum_start) != 0)
+				cpu_relax();
+		}
+
+		atomic64_add(count, &fbc->count);
 		__this_cpu_write(*fbc->counters, 0);
-		spin_unlock(&fbc->lock);
+
+		atomic_dec(&fbc->add_start);
 	} else {
 		__this_cpu_write(*fbc->counters, count);
 	}
@@ -97,22 +105,28 @@ s64 __percpu_counter_sum(struct percpu_c
 	int cpu;
 
 	spin_lock(&fbc->lock);
-	ret = fbc->count;
+	atomic_inc_return(&fbc->sum_start);
+	while (atomic_read(&fbc->add_start) != 0)
+		cpu_relax();
+
+	ret = atomic64_read(&fbc->count);
 	for_each_online_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		ret += *pcount;
 	}
+
+	atomic_dec(&fbc->sum_start);
 	spin_unlock(&fbc->lock);
 	return ret;
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
-int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
-			  struct lock_class_key *key)
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	spin_lock_init(&fbc->lock);
-	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
+	atomic64_set(&fbc->count, amount);
+	atomic_set(&fbc->sum_start, 0);
+	atomic_set(&fbc->add_start, 0);
 	fbc->counters = alloc_percpu(s32);
 	if (!fbc->counters)
 		return -ENOMEM;
@@ -127,7 +141,7 @@ int __percpu_counter_init(struct percpu_
 #endif
 	return 0;
 }
-EXPORT_SYMBOL(__percpu_counter_init);
+EXPORT_SYMBOL(percpu_counter_init);
 
 void percpu_counter_destroy(struct percpu_counter *fbc)
 {
@@ -171,13 +185,10 @@ static int __cpuinit percpu_counter_hotc
 	mutex_lock(&percpu_counters_lock);
 	list_for_each_entry(fbc, &percpu_counters, list) {
 		s32 *pcount;
-		unsigned long flags;
 
-		spin_lock_irqsave(&fbc->lock, flags);
 		pcount = per_cpu_ptr(fbc->counters, cpu);
-		fbc->count += *pcount;
+		atomic64_add(*pcount, &fbc->count);
 		*pcount = 0;
-		spin_unlock_irqrestore(&fbc->lock, flags);
 	}
 	mutex_unlock(&percpu_counters_lock);
 #endif



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16  0:58                                 ` Shaohua Li
@ 2011-05-16  6:11                                   ` Eric Dumazet
  2011-05-16  6:37                                     ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-16  6:11 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le lundi 16 mai 2011 à 08:58 +0800, Shaohua Li a écrit :

> so if _sum starts and ends here, _sum can still get deviation.

This makes no sense at all. If you have so many cpus 'here' right before
you increment fbc->sum_cnt, then no matter how precise and super
cautious you are in your _sum() implementation, as soon as you exit from
sum(), other cpus already changed the percpu counter global value.


 
> @@ -76,10 +74,20 @@ void __percpu_counter_add(struct percpu_
>  	preempt_disable();
>  	count = __this_cpu_read(*fbc->counters) + amount;
>  	if (count >= batch || count <= -batch) {
> -		spin_lock(&fbc->lock);
> -		fbc->count += count;
> +		while (1) {
> +			atomic_inc_return(&fbc->add_start);
> +			if (atomic_read(&fbc->sum_start) != 0)
> +				atomic_dec(&fbc->add_start);
> +			else
> +				break;
> +			while (atomic_read(&fbc->sum_start) != 0)
> +				cpu_relax();
> +		}
> +
> +		atomic64_add(count, &fbc->count);
>  		__this_cpu_write(*fbc->counters, 0);
> -		spin_unlock(&fbc->lock);
> +
> +		atomic_dec(&fbc->add_start);
>  	} else {
>  		__this_cpu_write(*fbc->counters, count);
>  	}
> 

This is way too heavy. You have 3 atomic ops here and a very slow
atomic_inc_return() in fast path [ not all machines are x86].

Not all percpu_counters are used in degenerated way. Most of them hit
the global count not very often.

Your version slows down a very common case (one cpu only calling _add()
several times, for example network stack in input path)

fbc->counters being in same cache line than fbc->add_start/sum_start and
all, I bet everything will be very slow during a _sum() on a 4096 cpu
machine, especially if this _sum() is interrupted by some long lasting
interrupt.

I believe the 'deviation' risk is almost null with my patch.
Remember percpu_counter is not an exact counter but a very lazy one.
(Only requirement is to not have drift)

The risk is small especially if we move the :
__this_cpu_write(*fbc->counters, 0);
before the :
atomic64_add(count, &fbc->count);

and then do the sequence increment _after_ this.



Here is my V4 : We dont need the second fbc->slowcount, given sum() get
fbc->count after the folding, not before : If some cpus enter _add()
while _sum() is running they'll seem sum_cnt signal and change
fbc->count immediately.

I also make following sequence in _add() :

__this_cpu_write(*fbc->counters, 0);
atomic64_add(count, &pcrw->count);
pcrw->sequence++;


 include/linux/percpu_counter.h |   25 +++++++--
 lib/percpu_counter.c           |   78 ++++++++++++++++++++-----------
 2 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..e3e62b1 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -15,13 +15,24 @@
 
 #ifdef CONFIG_SMP
 
-struct percpu_counter {
-	spinlock_t lock;
-	s64 count;
+/*
+ * For performance reasons, we keep this part in a separate cache line
+ */
+struct percpu_counter_rw {
+	atomic64_t	count;
+	unsigned int	sequence;
+
+	/* since we have plenty room, store list here, even if never used */
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
+	struct percpu_counter *fbc;
 #endif
-	s32 __percpu *counters;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter {
+	atomic_t		 sum_cnt; /* count of in flight sum() */
+	struct percpu_counter_rw *pcrw;
+	s32 __percpu		 *counters;
 };
 
 extern int percpu_counter_batch;
@@ -60,7 +71,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
-	return fbc->count;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+	return atomic64_read(&pcrw->count);
 }
 
 /*
@@ -70,7 +83,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	s64 ret = fbc->count;
+	s64 ret = percpu_counter_read(fbc);
 
 	barrier();		/* Prevent reloads of fbc->count */
 	if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..27292ba 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -9,6 +9,7 @@
 #include <linux/cpu.h>
 #include <linux/module.h>
 #include <linux/debugobjects.h>
+#include <linux/slab.h>
 
 static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
@@ -58,28 +59,32 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 {
 	int cpu;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
 
-	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
-	fbc->count = amount;
-	spin_unlock(&fbc->lock);
+	atomic64_set(&pcrw->count, amount);
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
+
+	if (atomic_read(&fbc->sum_cnt)) {
+		atomic64_add(amount, &pcrw->count);
+		return;
+	}
 
 	preempt_disable();
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
-		spin_lock(&fbc->lock);
-		fbc->count += count;
 		__this_cpu_write(*fbc->counters, 0);
-		spin_unlock(&fbc->lock);
+		atomic64_add(count, &pcrw->count);
+		pcrw->sequence++;
 	} else {
 		__this_cpu_write(*fbc->counters, count);
 	}
@@ -95,14 +100,25 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
+	unsigned int seq;
+	struct percpu_counter_rw *pcrw = fbc->pcrw;
 
-	spin_lock(&fbc->lock);
-	ret = fbc->count;
-	for_each_online_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
-		ret += *pcount;
-	}
-	spin_unlock(&fbc->lock);
+	atomic_inc(&fbc->sum_cnt);
+	do {
+		seq = pcrw->sequence;
+		smp_rmb();
+
+		ret = 0;
+		for_each_online_cpu(cpu) {
+			s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+			ret += *pcount;
+		}
+		ret += atomic64_read(&pcrw->count);
+
+		smp_rmb();
+	} while (pcrw->sequence != seq);
+
+	atomic_dec(&fbc->sum_cnt);
 	return ret;
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
@@ -110,19 +126,28 @@ EXPORT_SYMBOL(__percpu_counter_sum);
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
 			  struct lock_class_key *key)
 {
-	spin_lock_init(&fbc->lock);
-	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
+	struct percpu_counter_rw *pcrw; 
+
+	pcrw = kzalloc(sizeof(*pcrw), GFP_KERNEL);
+	if (!pcrw)
+		return -ENOMEM;
+	atomic64_set(&pcrw->count, amount);
+
 	fbc->counters = alloc_percpu(s32);
-	if (!fbc->counters)
+	if (!fbc->counters) {
+		kfree(pcrw);
 		return -ENOMEM;
+	}
+	fbc->pcrw = pcrw;
+	atomic_set(&fbc->sum_cnt, 0);
 
 	debug_percpu_counter_activate(fbc);
 
 #ifdef CONFIG_HOTPLUG_CPU
-	INIT_LIST_HEAD(&fbc->list);
+	INIT_LIST_HEAD(&pcrw->list);
+	pcrw->fbc = fbc;
 	mutex_lock(&percpu_counters_lock);
-	list_add(&fbc->list, &percpu_counters);
+	list_add(&pcrw->list, &percpu_counters);
 	mutex_unlock(&percpu_counters_lock);
 #endif
 	return 0;
@@ -138,11 +163,13 @@ void percpu_counter_destroy(struct percpu_counter *fbc)
 
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);
-	list_del(&fbc->list);
+	list_del(&fbc->pcrw->list);
 	mutex_unlock(&percpu_counters_lock);
 #endif
 	free_percpu(fbc->counters);
 	fbc->counters = NULL;
+	kfree(fbc->pcrw);
+	fbc->pcrw = NULL;
 }
 EXPORT_SYMBOL(percpu_counter_destroy);
 
@@ -161,7 +188,7 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 {
 #ifdef CONFIG_HOTPLUG_CPU
 	unsigned int cpu;
-	struct percpu_counter *fbc;
+	struct percpu_counter_rw *pcrw;
 
 	compute_batch_value();
 	if (action != CPU_DEAD)
@@ -169,15 +196,12 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 
 	cpu = (unsigned long)hcpu;
 	mutex_lock(&percpu_counters_lock);
-	list_for_each_entry(fbc, &percpu_counters, list) {
+	list_for_each_entry(pcrw, &percpu_counters, list) {
 		s32 *pcount;
-		unsigned long flags;
 
-		spin_lock_irqsave(&fbc->lock, flags);
-		pcount = per_cpu_ptr(fbc->counters, cpu);
-		fbc->count += *pcount;
+		pcount = per_cpu_ptr(pcrw->fbc->counters, cpu);
+		atomic64_add(*pcount, &pcrw->count);
 		*pcount = 0;
-		spin_unlock_irqrestore(&fbc->lock, flags);
 	}
 	mutex_unlock(&percpu_counters_lock);
 #endif



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16  6:11                                   ` Eric Dumazet
@ 2011-05-16  6:37                                     ` Shaohua Li
  2011-05-16  6:55                                       ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-16  6:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

On Mon, 2011-05-16 at 14:11 +0800, Eric Dumazet wrote:
> Le lundi 16 mai 2011 à 08:58 +0800, Shaohua Li a écrit :
> 
> > so if _sum starts and ends here, _sum can still get deviation.
> 
> This makes no sense at all. If you have so many cpus 'here' right before
> you increment fbc->sum_cnt, then no matter how precise and super
> cautious you are in your _sum() implementation, as soon as you exit from
> sum(), other cpus already changed the percpu counter global value.
I don't agree here. The original implementation also just has quite
small window we have deviation, the window only exists between the two
lines:
		atomic64_add(count, &fbc->count);
	        __this_cpu_write(*fbc->counters, 0);
if you think we should ignore it, we'd better not use any protection
here.

> > @@ -76,10 +74,20 @@ void __percpu_counter_add(struct percpu_
> >  	preempt_disable();
> >  	count = __this_cpu_read(*fbc->counters) + amount;
> >  	if (count >= batch || count <= -batch) {
> > -		spin_lock(&fbc->lock);
> > -		fbc->count += count;
> > +		while (1) {
> > +			atomic_inc_return(&fbc->add_start);
> > +			if (atomic_read(&fbc->sum_start) != 0)
> > +				atomic_dec(&fbc->add_start);
> > +			else
> > +				break;
> > +			while (atomic_read(&fbc->sum_start) != 0)
> > +				cpu_relax();
> > +		}
> > +
> > +		atomic64_add(count, &fbc->count);
> >  		__this_cpu_write(*fbc->counters, 0);
> > -		spin_unlock(&fbc->lock);
> > +
> > +		atomic_dec(&fbc->add_start);
> >  	} else {
> >  		__this_cpu_write(*fbc->counters, count);
> >  	}
> > 
> 
> This is way too heavy. You have 3 atomic ops here and a very slow
> atomic_inc_return() in fast path [ not all machines are x86].
> 
> Not all percpu_counters are used in degenerated way. Most of them hit
> the global count not very often.
> 
> Your version slows down a very common case (one cpu only calling _add()
> several times, for example network stack in input path)
> 
> fbc->counters being in same cache line than fbc->add_start/sum_start and
> all, I bet everything will be very slow during a _sum() on a 4096 cpu
> machine, especially if this _sum() is interrupted by some long lasting
> interrupt.
as I wrote in the email, the atomic and cacheline issue can be resolved
with a per_cpu data, I just didn't post the patch. I post it this time,
please see below. There is no cache line bounce anymore.

> I believe the 'deviation' risk is almost null with my patch.
> Remember percpu_counter is not an exact counter but a very lazy one.
> (Only requirement is to not have drift)
> 
> The risk is small especially if we move the :
> __this_cpu_write(*fbc->counters, 0);
> before the :
> atomic64_add(count, &fbc->count);
> 
> and then do the sequence increment _after_ this.
> 
> 
> 
> Here is my V4 : We dont need the second fbc->slowcount, given sum() get
> fbc->count after the folding, not before : If some cpus enter _add()
> while _sum() is running they'll seem sum_cnt signal and change
> fbc->count immediately.
> 
> I also make following sequence in _add() :
> 
> __this_cpu_write(*fbc->counters, 0);
we still have the deviation issue if _sum starts and ends here. this
doesn't change anything.

> atomic64_add(count, &pcrw->count);
> pcrw->sequence++;
> 
> 
>  include/linux/percpu_counter.h |   25 +++++++--
>  lib/percpu_counter.c           |   78 ++++++++++++++++++++-----------
>  2 files changed, 70 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
> index 46f6ba5..e3e62b1 100644
> --- a/include/linux/percpu_counter.h
> +++ b/include/linux/percpu_counter.h
> @@ -15,13 +15,24 @@
>  
>  #ifdef CONFIG_SMP
>  
> -struct percpu_counter {
> -	spinlock_t lock;
> -	s64 count;
> +/*
> + * For performance reasons, we keep this part in a separate cache line
> + */
> +struct percpu_counter_rw {
> +	atomic64_t	count;
> +	unsigned int	sequence;
> +
> +	/* since we have plenty room, store list here, even if never used */
>  #ifdef CONFIG_HOTPLUG_CPU
>  	struct list_head list;	/* All percpu_counters are on a list */
> +	struct percpu_counter *fbc;
>  #endif
> -	s32 __percpu *counters;
> +} ____cacheline_aligned_in_smp;
> +
> +struct percpu_counter {
> +	atomic_t		 sum_cnt; /* count of in flight sum() */
> +	struct percpu_counter_rw *pcrw;
> +	s32 __percpu		 *counters;
>  };
>  
>  extern int percpu_counter_batch;
> @@ -60,7 +71,9 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
>  
>  static inline s64 percpu_counter_read(struct percpu_counter *fbc)
>  {
> -	return fbc->count;
> +	struct percpu_counter_rw *pcrw = fbc->pcrw;
> +
> +	return atomic64_read(&pcrw->count);
>  }
>  
>  /*
> @@ -70,7 +83,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
>   */
>  static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
>  {
> -	s64 ret = fbc->count;
> +	s64 ret = percpu_counter_read(fbc);
>  
>  	barrier();		/* Prevent reloads of fbc->count */
>  	if (ret >= 0)
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 28f2c33..27292ba 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -9,6 +9,7 @@
>  #include <linux/cpu.h>
>  #include <linux/module.h>
>  #include <linux/debugobjects.h>
> +#include <linux/slab.h>
>  
>  static LIST_HEAD(percpu_counters);
>  static DEFINE_MUTEX(percpu_counters_lock);
> @@ -58,28 +59,32 @@ static inline void debug_percpu_counter_deactivate(struct percpu_counter *fbc)
>  void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
>  {
>  	int cpu;
> +	struct percpu_counter_rw *pcrw = fbc->pcrw;
>  
> -	spin_lock(&fbc->lock);
>  	for_each_possible_cpu(cpu) {
>  		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
>  		*pcount = 0;
>  	}
> -	fbc->count = amount;
> -	spin_unlock(&fbc->lock);
> +	atomic64_set(&pcrw->count, amount);
>  }
>  EXPORT_SYMBOL(percpu_counter_set);
>  
>  void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
>  {
>  	s64 count;
> +	struct percpu_counter_rw *pcrw = fbc->pcrw;
> +
> +	if (atomic_read(&fbc->sum_cnt)) {
> +		atomic64_add(amount, &pcrw->count);
> +		return;
> +	}
>  
>  	preempt_disable();
>  	count = __this_cpu_read(*fbc->counters) + amount;
>  	if (count >= batch || count <= -batch) {
> -		spin_lock(&fbc->lock);
> -		fbc->count += count;
>  		__this_cpu_write(*fbc->counters, 0);
> -		spin_unlock(&fbc->lock);
> +		atomic64_add(count, &pcrw->count);
smp_wmb() or atomic64_add_return() here to guarantee the changes are
seen before sequence++;

> +		pcrw->sequence++;
sequence++ can introduce cache line bouncing.

add_start causes a lot of cache bouncing because it's updated by all
cpus. We can actually make it a percpu variable. This will completely
reduce the cache bouncing.
With the patch and last patch, I get about 7x faster running the
workload that last patch described. Only with last patch, the workload
is only about 4x faster.
This doesn't slow down _sum because we removed lock for _sum. I did
a stress test. 23 CPU run _add, one cpu runs _sum. In _add fast path
(don't hold) lock, _sum runs a little slow (about 20% slower). In
_add slow path (hold lock), _sum runs much faster (about 9x faster);

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 include/linux/percpu_counter.h |    3 ++-
 lib/percpu_counter.c           |   22 ++++++++++++++++------
 2 files changed, 18 insertions(+), 7 deletions(-)

Index: linux/include/linux/percpu_counter.h
===================================================================
--- linux.orig/include/linux/percpu_counter.h	2011-05-16 10:26:05.000000000 +0800
+++ linux/include/linux/percpu_counter.h	2011-05-16 10:27:48.000000000 +0800
@@ -16,12 +16,13 @@
 #ifdef CONFIG_SMP
 
 struct percpu_counter {
-	atomic_t sum_start, add_start;
+	atomic_t sum_start;
 	atomic64_t count;
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
 	s32 __percpu *counters;
+	char __percpu *add_starts;
 };
 
 extern int percpu_counter_batch;
Index: linux/lib/percpu_counter.c
===================================================================
--- linux.orig/lib/percpu_counter.c	2011-05-16 10:26:58.000000000 +0800
+++ linux/lib/percpu_counter.c	2011-05-16 10:46:12.000000000 +0800
@@ -75,10 +75,12 @@ void __percpu_counter_add(struct percpu_
 	count = __this_cpu_read(*fbc->counters) + amount;
 	if (count >= batch || count <= -batch) {
 		while (1) {
-			atomic_inc_return(&fbc->add_start);
+			__this_cpu_write(*fbc->add_starts, 1);
+			/* Guarantee add_starts is seen by _sum */
+			smp_wmb();
 			if (atomic_read(&fbc->sum_start) == 0)
 				break;
-			atomic_dec(&fbc->add_start);
+			__this_cpu_write(*fbc->add_starts, 0);
 			while (atomic_read(&fbc->sum_start) != 0)
 				cpu_relax();
 		}
@@ -86,7 +88,7 @@ void __percpu_counter_add(struct percpu_
 		atomic64_add(count, &fbc->count);
 		__this_cpu_write(*fbc->counters, 0);
 
-		atomic_dec(&fbc->add_start);
+		__this_cpu_write(*fbc->add_starts, 0);
 	} else {
 		__this_cpu_write(*fbc->counters, count);
 	}
@@ -104,8 +106,10 @@ s64 __percpu_counter_sum(struct percpu_c
 	int cpu;
 
 	atomic_inc_return(&fbc->sum_start);
-	while (atomic_read(&fbc->add_start) != 0)
-		cpu_relax();
+	for_each_online_cpu(cpu) {
+		while (*per_cpu_ptr(fbc->add_starts, cpu) != 0)
+			cpu_relax();
+	}
 
 	ret = atomic64_read(&fbc->count);
 	for_each_online_cpu(cpu) {
@@ -122,10 +126,15 @@ int percpu_counter_init(struct percpu_co
 {
 	atomic64_set(&fbc->count, amount);
 	atomic_set(&fbc->sum_start, 0);
-	atomic_set(&fbc->add_start, 0);
 	fbc->counters = alloc_percpu(s32);
 	if (!fbc->counters)
 		return -ENOMEM;
+	fbc->add_starts = alloc_percpu(char);
+	if (!fbc->add_starts) {
+		free_percpu(fbc->counters);
+		return -ENOMEM;
+	}
+
 
 	debug_percpu_counter_activate(fbc);
 
@@ -152,6 +161,7 @@ void percpu_counter_destroy(struct percp
 	mutex_unlock(&percpu_counters_lock);
 #endif
 	free_percpu(fbc->counters);
+	free_percpu(fbc->add_starts);
 	fbc->counters = NULL;
 }
 EXPORT_SYMBOL(percpu_counter_destroy);



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16  6:37                                     ` Shaohua Li
@ 2011-05-16  6:55                                       ` Eric Dumazet
  2011-05-16  7:15                                         ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-16  6:55 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le lundi 16 mai 2011 à 14:37 +0800, Shaohua Li a écrit :
> On Mon, 2011-05-16 at 14:11 +0800, Eric Dumazet wrote:
> > Le lundi 16 mai 2011 à 08:58 +0800, Shaohua Li a écrit :
> > 
> > > so if _sum starts and ends here, _sum can still get deviation.
> > 
> > This makes no sense at all. If you have so many cpus 'here' right before
> > you increment fbc->sum_cnt, then no matter how precise and super
> > cautious you are in your _sum() implementation, as soon as you exit from
> > sum(), other cpus already changed the percpu counter global value.
> I don't agree here. The original implementation also just has quite
> small window we have deviation, the window only exists between the two
> lines:
> 		atomic64_add(count, &fbc->count);
> 	        __this_cpu_write(*fbc->counters, 0);
> if you think we should ignore it, we'd better not use any protection
> here.
> 

Not at all. Your version didnt forbid new cpu to come in _add() and
hitting the deviation problem.

There is a small difference, or else I wouldnt had bother.


> as I wrote in the email, the atomic and cacheline issue can be resolved
> with a per_cpu data, I just didn't post the patch. I post it this time,
> please see below. There is no cache line bounce anymore.
> 

I am afraid we make no progress at all here, if you just try to push
your patch and ignore my comments.

percpu_counter is a compromise, dont make it too slow for normal
operations. It works well if most _add() operations only go through
percpu data.

Please just move vm_committed_as to a plain atomic_t, this will solve
your problem.

Thanks



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16  6:55                                       ` Eric Dumazet
@ 2011-05-16  7:15                                         ` Shaohua Li
  2011-05-16  7:44                                           ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-16  7:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

On Mon, 2011-05-16 at 14:55 +0800, Eric Dumazet wrote:
> Le lundi 16 mai 2011 à 14:37 +0800, Shaohua Li a écrit :
> > On Mon, 2011-05-16 at 14:11 +0800, Eric Dumazet wrote:
> > > Le lundi 16 mai 2011 à 08:58 +0800, Shaohua Li a écrit :
> > > 
> > > > so if _sum starts and ends here, _sum can still get deviation.
> > > 
> > > This makes no sense at all. If you have so many cpus 'here' right before
> > > you increment fbc->sum_cnt, then no matter how precise and super
> > > cautious you are in your _sum() implementation, as soon as you exit from
> > > sum(), other cpus already changed the percpu counter global value.
> > I don't agree here. The original implementation also just has quite
> > small window we have deviation, the window only exists between the two
> > lines:
> > 		atomic64_add(count, &fbc->count);
> > 	        __this_cpu_write(*fbc->counters, 0);
> > if you think we should ignore it, we'd better not use any protection
> > here.
> > 
> 
> Not at all. Your version didnt forbid new cpu to come in _add() and
> hitting the deviation problem.
if everybody agrees the deviation isn't a problem, I will not bother to
argue here.
but your patch does have the deviation issue which Tejun dislike.


> There is a small difference, or else I wouldnt had bother.
in _sum, set a bit. in _add, we wait till the bit is unset. This can
easily solve the issue too, and much easier.

> > as I wrote in the email, the atomic and cacheline issue can be resolved
> > with a per_cpu data, I just didn't post the patch. I post it this time,
> > please see below. There is no cache line bounce anymore.
> > 
> 
> I am afraid we make no progress at all here, if you just try to push
> your patch and ignore my comments.
I did try to push my patch, but I didn't ignore your comments. I pointed
out your patch still has the deviation issue and you didn't think it's
an issue, so you are ignoring my comments actually. On the other hand, I
push my patch because I thought mine hasn't the deviation.

> percpu_counter is a compromise, dont make it too slow for normal
> operations. It works well if most _add() operations only go through
> percpu data.
> 
> Please just move vm_committed_as to a plain atomic_t, this will solve
> your problem.
I can, but you can't prevent me to optimize percpu_counter.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16  7:15                                         ` Shaohua Li
@ 2011-05-16  7:44                                           ` Eric Dumazet
  2011-05-16  8:34                                             ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-16  7:44 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le lundi 16 mai 2011 à 15:15 +0800, Shaohua Li a écrit :

> I can, but you can't prevent me to optimize percpu_counter.
> 

Well, I have the right to say you're wrong.

Your last patch is not good, sorry. Please take the time to read it
again and fix obvious problems. And also give us numbers if one process
does the mmap()/munmap() loop, before and after your patch.

A percpu_counter is already a beast as is, you're suggesting to double
its size, for a pathological case.

Its absolutely not clear to me why vm_committed_as is using the default
percpu_counter_batch. 

By the way could you make sure percpu_counter_batch has the right value
on your 24 cpus machine ?

Your 128Mbyte mmap threshold sounds like percpu_counter_batch=32 instead
of 48




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16  7:44                                           ` Eric Dumazet
@ 2011-05-16  8:34                                             ` Shaohua Li
  2011-05-16  9:35                                               ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-16  8:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

On Mon, 2011-05-16 at 15:44 +0800, Eric Dumazet wrote:
> Le lundi 16 mai 2011 à 15:15 +0800, Shaohua Li a écrit :
> 
> > I can, but you can't prevent me to optimize percpu_counter.
> > 
> 
> Well, I have the right to say you're wrong.
sure, but please give a reason.

> Your last patch is not good, sorry. 
> Please take the time to read it
> again and fix obvious problems.
what kind of obvious problems?

> And also give us numbers if one process
> does the mmap()/munmap() loop, before and after your patch.
I did a stress test with one thread

while {
__percpu_counter_add(+count)
__percpu_counter_add(-count)
}
the loop do 10000000 times.
in _add fast path (no locking hold):
before my patch:
real    0m0.133s
user    0m0.000s
sys     0m0.124s
after:
real    0m0.129s
user    0m0.000s
sys     0m0.120s
the difference is variation.

in _add slow path (locking hold):
before my patch:
real    0m0.374s
user    0m0.000s
sys     0m0.372s
after:
real    0m0.245s
user    0m0.000s
sys     0m0.020s

My patch actually makes _add faster, because we removed spin_lock.

> A percpu_counter is already a beast as is, you're suggesting to double
> its size, for a pathological case.
> 
> Its absolutely not clear to me why vm_committed_as is using the default
> percpu_counter_batch. 
> 
> By the way could you make sure percpu_counter_batch has the right value
> on your 24 cpus machine ?
> 
> Your 128Mbyte mmap threshold sounds like percpu_counter_batch=32 instead
> of 48
let's not argue the batch size anymore. If we can make percpu_counter
faster, why we don't (even your patch mentioned this).

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16  8:34                                             ` Shaohua Li
@ 2011-05-16  9:35                                               ` Eric Dumazet
  2011-05-16 14:22                                                 ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-16  9:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le lundi 16 mai 2011 à 16:34 +0800, Shaohua Li a écrit :

> let's not argue the batch size anymore. If we can make percpu_counter
> faster, why we don't (even your patch mentioned this).
> 

I actually changed my mind, after trying to solve the problem and spend
a few hours on it. This is not worth it and counterproductive.

Whole point of percpu_counter is being able to avoid the false sharing
in _most_ cases. I would even make the false sharing case even more
expensive just to pinpoint a bad user, thanks to profiling.

Trying to make it fast in pathological case is throwing a brown paper
bag.

An interesting move would be to make percpu_counter hierarchical,
because we might need it for 4096 cpus machines.

Given that vm_committed has one percent resolution need
(sysctl_overcommit_ratio is expressed with percent resolution), it
should be used with an appropriate batch value, something like :

vm_committed_as_batch = max(percpu_counter_batch,
			    total_ram_pages/(num_possible_cpus()*100));

Instead of the default percpu_counter_batch, more aimed for 
_add(1)/_add(-1) uses.

Note : This wont solve your mmap(128M)/munmap() problem, unless your
machine has a _lot_ of memory. Still, this will be a win on many real
workloads.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16  9:35                                               ` Eric Dumazet
@ 2011-05-16 14:22                                                 ` Eric Dumazet
  2011-05-17  0:55                                                   ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-16 14:22 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le lundi 16 mai 2011 à 11:35 +0200, Eric Dumazet a écrit :

> Given that vm_committed has one percent resolution need
> (sysctl_overcommit_ratio is expressed with percent resolution), it
> should be used with an appropriate batch value, something like :
> 
> vm_committed_as_batch = max(percpu_counter_batch,
> 			    total_ram_pages/(num_possible_cpus()*100));
> 


Funny thing with vm_committed_as is we dont even read its value with
default vm configuration 

(sysctl_overcommit_memory == OVERCOMMIT_ALWAYS or OVERCOMMIT_GUESS)

[ In this case, we read it only for /proc/meminfo output ]

Ideally, we could dynamically change vm_committed_as_batch when
sysctl_overcommit_memory or other param is changed. This would need a
mechanism to ask all cpus to transfert/clear their local s32 into global
fbc->count [when lowering vm_committed_as_batch]

Another idea would be to use an atomic when manipulating the percpu s32,
so that __percpu_counter_sum() is able to make this operation itself :
At the end of __percpu_counter_sum(), fbc->count would be the final
result, and all s32 would be zero, unless some cpus called _add()
meanwhile.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-16 14:22                                                 ` Eric Dumazet
@ 2011-05-17  0:55                                                   ` Shaohua Li
  2011-05-17  4:56                                                     ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-17  0:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

On Mon, 2011-05-16 at 22:22 +0800, Eric Dumazet wrote:
> Le lundi 16 mai 2011 à 11:35 +0200, Eric Dumazet a écrit :
> 
> > Given that vm_committed has one percent resolution need
> > (sysctl_overcommit_ratio is expressed with percent resolution), it
> > should be used with an appropriate batch value, something like :
> > 
> > vm_committed_as_batch = max(percpu_counter_batch,
> > 			    total_ram_pages/(num_possible_cpus()*100));
> > 
> 
> 
> Funny thing with vm_committed_as is we dont even read its value with
> default vm configuration 
> 
> (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS or OVERCOMMIT_GUESS)
> 
> [ In this case, we read it only for /proc/meminfo output ]
> 
> Ideally, we could dynamically change vm_committed_as_batch when
> sysctl_overcommit_memory or other param is changed. This would need a
> mechanism to ask all cpus to transfert/clear their local s32 into global
> fbc->count [when lowering vm_committed_as_batch]
I actually posted something like this before
http://marc.info/?l=linux-mm&m=130144785326028&w=2
but this could affect /proc/meminfo read.

> Another idea would be to use an atomic when manipulating the percpu s32,
> so that __percpu_counter_sum() is able to make this operation itself :
> At the end of __percpu_counter_sum(), fbc->count would be the final
> result, and all s32 would be zero, unless some cpus called _add()
> meanwhile.
don't understand it. But if concurrent _add can introduce deviation,
this is good.

I'm still interesting in improving percpu_counter itself. If we can
improve it, why we don't? My patch doesn't slow down anything for all
tests. Why didn't you ever look at it?

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17  0:55                                                   ` Shaohua Li
@ 2011-05-17  4:56                                                     ` Eric Dumazet
  2011-05-17  5:22                                                       ` Shaohua Li
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-17  4:56 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le mardi 17 mai 2011 à 08:55 +0800, Shaohua Li a écrit :

> I'm still interesting in improving percpu_counter itself. If we can
> improve it, why we don't? My patch doesn't slow down anything for all
> tests. Why didn't you ever look at it?
> 

I did and said there were obvious problems in it.

1) 4096 cpus : size of one percpu_counter is 16Kbytes.

After your last patch -> 20 kbytes for no good reason.

2) Two separate alloc_percpu() -> two separate cache lines instead of
one.

But then, if one alloc_percpu() -> 32 kbytes per object.

3) Focus on percpu_counter() implementation instead of making an
analysis of callers.

I did a lot of rwlocks removal in network stack because they are not the
right synchronization primitive in many cases. I did not optimize
rwlocks. If rwlocks were even slower, I suspect other people would have
help me to convert things faster.

4) There is a possible way to solve your deviation case : add at _add()
beginning a short cut for crazy 'amount' values. Its a bit expensive on
32bit arches, so might be added in a new helper to let _add() be fast
for normal and gentle users.

if (unlikely(amount >= batch || amount <= -batch)) {
	atomic64(amount, &fbc->count);
	return;
}

Ie dont care to get the s32 value and change it to 0, just leave it
alone.

Thanks

------------------------------------------------------------------

About making percpu s32 'atomic', here is the patch to show the idea:

Each time we call __percpu_counter_sum(), fbc->counters[] values are
zeroed and fbc->count is the "precise" value of the counter. (deviation
comes close to 0)

So if we need to dynamically change one percpu counter batch value, we
can call __percpu_counter_sum() to minimize the 'deviation' close to 0,
no need to send an IPI to all cpus so that they perform the transfert
themselves.

1) vm_committed_as could use a big vm_committed_as_batch when
(sysctl_overcommit_memory == OVERCOMMIT_ALWAYS or OVERCOMMIT_GUESS)

2) We could switch vm_committed_as_batch to an adequate value if/when
sysctl_overcommit_memory is changed to OVERCOMMIT_NEVER (and redo the
batch computation is swap space is changed as well, or hugepages
reservations, or number of online cpus, or ...)

Note: this has a cost, because percpu_counter() fast path would have one
atomic operation instead of 0 in current implementation (cmpxchg() on
the percpu s32). Not sure current cpus would care for percpu data (no
contention)

 include/linux/percpu_counter.h |    7 +---
 lib/percpu_counter.c           |   51 +++++++++++++++----------------
 2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 46f6ba5..d6b7831 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -16,8 +16,7 @@
 #ifdef CONFIG_SMP
 
 struct percpu_counter {
-	spinlock_t lock;
-	s64 count;
+	atomic64_t count;
 #ifdef CONFIG_HOTPLUG_CPU
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
@@ -60,7 +59,7 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
-	return fbc->count;
+	return atomic64_read(&fbc->count);
 }
 
 /*
@@ -70,7 +69,7 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc)
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	s64 ret = fbc->count;
+	s64 ret = percpu_counter_read(fbc);
 
 	barrier();		/* Prevent reloads of fbc->count */
 	if (ret >= 0)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 28f2c33..745787e 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -59,29 +59,36 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 {
 	int cpu;
 
-	spin_lock(&fbc->lock);
 	for_each_possible_cpu(cpu) {
 		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
 		*pcount = 0;
 	}
-	fbc->count = amount;
-	spin_unlock(&fbc->lock);
+	atomic64_set(&fbc->count, amount);
 }
 EXPORT_SYMBOL(percpu_counter_set);
 
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
+	s32 *ptr, old;
+
+	if (unlikely(amount >= batch || amount <= -batch)) {
+		atomic64_add(amount, &fbc->count);
+		return;
+	}
 
 	preempt_disable();
-	count = __this_cpu_read(*fbc->counters) + amount;
+	ptr = this_cpu_ptr(fbc->counters);
+retry:
+	old = *ptr;
+	count = old + amount;
 	if (count >= batch || count <= -batch) {
-		spin_lock(&fbc->lock);
-		fbc->count += count;
-		__this_cpu_write(*fbc->counters, 0);
-		spin_unlock(&fbc->lock);
+		if (unlikely(cmpxchg(ptr, old, 0) != old))
+			goto retry;
+		atomic64_add(count, &fbc->count);
 	} else {
-		__this_cpu_write(*fbc->counters, count);
+		if (unlikely(cmpxchg(ptr, old, count) != old))
+			goto retry;
 	}
 	preempt_enable();
 }
@@ -93,26 +100,23 @@ EXPORT_SYMBOL(__percpu_counter_add);
  */
 s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
-	s64 ret;
 	int cpu;
 
-	spin_lock(&fbc->lock);
-	ret = fbc->count;
 	for_each_online_cpu(cpu) {
-		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
-		ret += *pcount;
+		s32 count, *pcount = per_cpu_ptr(fbc->counters, cpu);
+
+		count = xchg(pcount, 0);
+		if (count)
+			atomic64_add(count, &fbc->count);
 	}
-	spin_unlock(&fbc->lock);
-	return ret;
+	return atomic64_read(&fbc->count);
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
 int __percpu_counter_init(struct percpu_counter *fbc, s64 amount,
 			  struct lock_class_key *key)
 {
-	spin_lock_init(&fbc->lock);
-	lockdep_set_class(&fbc->lock, key);
-	fbc->count = amount;
+	atomic64_set(&fbc->count, amount);
 	fbc->counters = alloc_percpu(s32);
 	if (!fbc->counters)
 		return -ENOMEM;
@@ -170,14 +174,11 @@ static int __cpuinit percpu_counter_hotcpu_callback(struct notifier_block *nb,
 	cpu = (unsigned long)hcpu;
 	mutex_lock(&percpu_counters_lock);
 	list_for_each_entry(fbc, &percpu_counters, list) {
-		s32 *pcount;
-		unsigned long flags;
+		s32 count, *pcount;
 
-		spin_lock_irqsave(&fbc->lock, flags);
 		pcount = per_cpu_ptr(fbc->counters, cpu);
-		fbc->count += *pcount;
-		*pcount = 0;
-		spin_unlock_irqrestore(&fbc->lock, flags);
+		count = xchg(pcount, 0);
+		atomic64_add(count, &fbc->count);
 	}
 	mutex_unlock(&percpu_counters_lock);
 #endif



^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17  4:56                                                     ` Eric Dumazet
@ 2011-05-17  5:22                                                       ` Shaohua Li
  2011-05-17  9:01                                                         ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Shaohua Li @ 2011-05-17  5:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

On Tue, 2011-05-17 at 12:56 +0800, Eric Dumazet wrote:
> Le mardi 17 mai 2011 à 08:55 +0800, Shaohua Li a écrit :
> 
> > I'm still interesting in improving percpu_counter itself. If we can
> > improve it, why we don't? My patch doesn't slow down anything for all
> > tests. Why didn't you ever look at it?
> > 
> 
> I did and said there were obvious problems in it.
> 
> 1) 4096 cpus : size of one percpu_counter is 16Kbytes.
> 
> After your last patch -> 20 kbytes for no good reason.
I don't know why you said there is no good reason. I posted a lot of
data which shows improvement, while you just ignore.

The size issue is completely pointless. If you have 4096 CPUs, how could
you worry about 16k bytes memory. Especially the extra memory makes the
API much faster.

> 2) Two separate alloc_percpu() -> two separate cache lines instead of
> one.
Might be in one cache line actually, but can be easily fixed if not
anyway. On the other hand, even touch two cache lines, it's still faster
than the original spinlock implementation, which I already posted data.

> But then, if one alloc_percpu() -> 32 kbytes per object.
the size issue is completely pointless

> 3) Focus on percpu_counter() implementation instead of making an
> analysis of callers.
> 
> I did a lot of rwlocks removal in network stack because they are not the
> right synchronization primitive in many cases. I did not optimize
> rwlocks. If rwlocks were even slower, I suspect other people would have
> help me to convert things faster.
My original issue is mmap, but I already declaimed several times we can
make percpu_counter better, why won't we?

> 4) There is a possible way to solve your deviation case : add at _add()
> beginning a short cut for crazy 'amount' values. Its a bit expensive on
> 32bit arches, so might be added in a new helper to let _add() be fast
> for normal and gentle users.

+		if (unlikely(cmpxchg(ptr, old, 0) != old))
> +			goto retry;
this doesn't change anything, you still have the deviation issue here

> +		atomic64_add(count, &fbc->count);

> if (unlikely(amount >= batch || amount <= -batch)) {
> 	atomic64(amount, &fbc->count);
> 	return;
> }
why we just handle this special case, my patch can make the whole part
faster without deviation

so you didn't point out any obvious problem with my patch actually. This
is good.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17  5:22                                                       ` Shaohua Li
@ 2011-05-17  9:01                                                         ` Eric Dumazet
  2011-05-17  9:11                                                           ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-17  9:01 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Tejun Heo, linux-kernel, akpm, cl, npiggin

Le mardi 17 mai 2011 à 13:22 +0800, Shaohua Li a écrit :

> I don't know why you said there is no good reason. I posted a lot of
> data which shows improvement, while you just ignore.
> 

Dear Shaihua, ignoring you would mean I would not even answer, and let
other people do, when they have time (maybe in 2 or 3 months, maybe
never. Just take a look at my previous attempts, two years ago,
atomic64_t didnt exist at that time, obviously)

I hope you can see the value I add to your concerns, making this subject
alive and even coding stuff. We all share ideas, we are not fighting.



> The size issue is completely pointless. If you have 4096 CPUs, how could
> you worry about 16k bytes memory. Especially the extra memory makes the
> API much faster.
> 

It is not pointless at all, maybe for Intel guys it is.

I just NACK this idea

> > 2) Two separate alloc_percpu() -> two separate cache lines instead of
> > one.
> Might be in one cache line actually, but can be easily fixed if not
> anyway. On the other hand, even touch two cache lines, it's still faster
> than the original spinlock implementation, which I already posted data.
> 
> > But then, if one alloc_percpu() -> 32 kbytes per object.
> the size issue is completely pointless
> 

Thats your opinion

> > 3) Focus on percpu_counter() implementation instead of making an
> > analysis of callers.
> > 
> > I did a lot of rwlocks removal in network stack because they are not the
> > right synchronization primitive in many cases. I did not optimize
> > rwlocks. If rwlocks were even slower, I suspect other people would have
> > help me to convert things faster.
> My original issue is mmap, but I already declaimed several times we can
> make percpu_counter better, why won't we?
> 

Only if it's a good compromise. Your last patches are not yet good
candidates I'm afraid.

> > 4) There is a possible way to solve your deviation case : add at _add()
> > beginning a short cut for crazy 'amount' values. Its a bit expensive on
> > 32bit arches, so might be added in a new helper to let _add() be fast
> > for normal and gentle users.
> 
> +		if (unlikely(cmpxchg(ptr, old, 0) != old))
> > +			goto retry;
> this doesn't change anything, you still have the deviation issue here
> 

You do understand 'my last patch' doesnt address the deviation problem
anymore ? Its a completely different matter to address vm_committed_as
problem (and maybe other percpu_counters).

The thing you prefer to not touch so that your 'results' sound better...

If your percpu_counter is hit so hardly that you have many cpus
competing in atomic64(&count, &fbc->count), _sum() result is wrong right
after its return. so _sum() _can_ deviate even if it claims being more
precise.



> > +		atomic64_add(count, &fbc->count);
> 
> > if (unlikely(amount >= batch || amount <= -batch)) {
> > 	atomic64(amount, &fbc->count);
> > 	return;
> > }
> why we just handle this special case, my patch can make the whole part
> faster without deviation
> 

This 'special case' is the whole problem others pointed out, and this
makes deviation worst value like before your initial patch.


> so you didn't point out any obvious problem with my patch actually. This
> is good.
> 

This brings nothing. Just say NO to people saying its needed.

Its not because Tejun says there is a deviation "problem", you need to
change lglock and bring lglock to percpu_counter, or double
percpu_counter size, or whatever crazy idea.

Just convince him that percpu_counter by itself cannot bring a max
deviation guarantee. No percpu_counter user cares at all. If they do,
then percpu_counter choice for their implementation is probably wrong.

[ We dont provide yet a percpu_counter_add_return() function ]





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17  9:01                                                         ` Eric Dumazet
@ 2011-05-17  9:11                                                           ` Tejun Heo
  2011-05-17  9:45                                                             ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2011-05-17  9:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Hello, Eric, Shaohua.

On Tue, May 17, 2011 at 11:01:01AM +0200, Eric Dumazet wrote:
> Just convince him that percpu_counter by itself cannot bring a max
> deviation guarantee. No percpu_counter user cares at all. If they do,
> then percpu_counter choice for their implementation is probably wrong.
> 
> [ We dont provide yet a percpu_counter_add_return() function ]

I haven't gone through this thread yet but will do so later today, but
let me clarify the whole deviation thing.

1. I don't care reasonable (can't think of a better word at the
   moment) level of deviation.  Under high level of concurrency, the
   exact value isn't even well defined - nobody can tell operations
   happened in what order anyway.

2. But I _do_ object to _sum() has the possibility of deviating by
   multiples of @batch even with very low level of activity.

I'm completely fine with #1.  I'm not that crazy but I don't really
want to take #2 - that makes the whole _sum() interface almost
pointless.  Also, I don't want to add big honking lglock to just avoid
#2 unless it can be shown that the same effect can't be achieved in
saner manner and I'm highly skeptical that would happen.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17  9:11                                                           ` Tejun Heo
@ 2011-05-17  9:45                                                             ` Eric Dumazet
  2011-05-17  9:50                                                               ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-17  9:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Le mardi 17 mai 2011 à 11:11 +0200, Tejun Heo a écrit :

> I'm completely fine with #1.  I'm not that crazy but I don't really
> want to take #2 - that makes the whole _sum() interface almost
> pointless.  

Hi Tejun

_sum() is a bit more precise than percpu_counter_read(), but to make it
really precise, we means we have to stop concurrent activities, and we
never did in previous/current implementation.

We could add this (as Shaohua and myself tried in various patches)
later, if needed, but nowhere in kernel we currently need that.

Even /proc/meminfo doesnt call _sum(&vm_committed_as) but the lazy
percpu_counter_read_positive() function...

Reammy _sum() gives a good approximation of the counter, more precise
because of the percpu s32 folding, but no guarantee of deviation.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17  9:45                                                             ` Eric Dumazet
@ 2011-05-17  9:50                                                               ` Tejun Heo
  2011-05-17 12:20                                                                 ` Eric Dumazet
  2011-05-18  1:00                                                                 ` Shaohua Li
  0 siblings, 2 replies; 52+ messages in thread
From: Tejun Heo @ 2011-05-17  9:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Hello, Eric.

On Tue, May 17, 2011 at 11:45:41AM +0200, Eric Dumazet wrote:
> _sum() is a bit more precise than percpu_counter_read(), but to make it
> really precise, we means we have to stop concurrent activities, and we
> never did in previous/current implementation.
> 
> We could add this (as Shaohua and myself tried in various patches)
> later, if needed, but nowhere in kernel we currently need that.
> 
> Even /proc/meminfo doesnt call _sum(&vm_committed_as) but the lazy
> percpu_counter_read_positive() function...
> 
> Reammy _sum() gives a good approximation of the counter, more precise
> because of the percpu s32 folding, but no guarantee of deviation.

I'm not asking to make it more accurate but the initial patches from
Shaohua made the _sum() result to deviate by @batch even when only one
thread is doing _inc() due to the race window between adding to the
main counter and resetting the local one.  All I'm asking is closing
that hole and I'll be completely happy with it.  The lglock does that
but it's ummm.... not a very nice way to do it.

Please forget about deviations from concurrent activities.  I don't
care and nobody should.  All I'm asking is removing that any update
having the possibility of that unnecessary spike and I don't think
that would be too hard.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17  9:50                                                               ` Tejun Heo
@ 2011-05-17 12:20                                                                 ` Eric Dumazet
  2011-05-17 12:45                                                                   ` Tejun Heo
  2011-05-18  1:00                                                                 ` Shaohua Li
  1 sibling, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-17 12:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Le mardi 17 mai 2011 à 11:50 +0200, Tejun Heo a écrit :

> I'm not asking to make it more accurate but the initial patches from
> Shaohua made the _sum() result to deviate by @batch even when only one
> thread is doing _inc() due to the race window between adding to the
> main counter and resetting the local one.  All I'm asking is closing
> that hole and I'll be completely happy with it.  The lglock does that
> but it's ummm.... not a very nice way to do it.
> 
> Please forget about deviations from concurrent activities.  I don't
> care and nobody should.  All I'm asking is removing that any update
> having the possibility of that unnecessary spike and I don't think
> that would be too hard.
> 

Spikes are expected and have no effect by design.

batch value is chosen so that granularity of the percpu_counter
(batch*num_online_cpus()) is the spike factor, and thats pretty
difficult when number of cpus is high.

In Shaohua workload, 'amount' for a 128Mbyte mapping is 32768, while the
batch value is 48. 48*24 = 1152.
So the percpu s32 being in [-47 .. 47] range would not change the
accuracy of the _sum() function [ if it was eventually called, but its
not ]

No drift in the counter is the only thing we care - and _read() being
not too far away from the _sum() value, in particular if the
percpu_counter is used to check a limit that happens to be low (against
granularity of the percpu_counter : batch*num_online_cpus()).

I claim extra care is not needed. This might give the false impression
to reader/user that percpu_counter object can replace a plain
atomic64_t.

For example, I feel vm_committed_as could be a plain atomic_long_t




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17 12:20                                                                 ` Eric Dumazet
@ 2011-05-17 12:45                                                                   ` Tejun Heo
  2011-05-17 13:00                                                                     ` Eric Dumazet
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2011-05-17 12:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Hello, Eric.

On Tue, May 17, 2011 at 02:20:07PM +0200, Eric Dumazet wrote:
> Spikes are expected and have no effect by design.
> 
> batch value is chosen so that granularity of the percpu_counter
> (batch*num_online_cpus()) is the spike factor, and thats pretty
> difficult when number of cpus is high.
> 
> In Shaohua workload, 'amount' for a 128Mbyte mapping is 32768, while the
> batch value is 48. 48*24 = 1152.
> So the percpu s32 being in [-47 .. 47] range would not change the
> accuracy of the _sum() function [ if it was eventually called, but its
> not ]
> 
> No drift in the counter is the only thing we care - and _read() being
> not too far away from the _sum() value, in particular if the
> percpu_counter is used to check a limit that happens to be low (against
> granularity of the percpu_counter : batch*num_online_cpus()).
> 
> I claim extra care is not needed. This might give the false impression
> to reader/user that percpu_counter object can replace a plain
> atomic64_t.

We already had this discussion.  Sure, we can argue about it again all
day but I just don't think it's a necessary compromise and really
makes _sum() quite dubious.  It's not about strict correctness, it
can't be, but if I spent the overhead to walk all the different percpu
counters, I'd like to have a rather exact number if there's nothing
much going on (freeblock count, for example).  Also, I want to be able
to use large @batch if the situation allows for it without worrying
about _sum() accuracy.

Given that _sum() is super-slow path and we have a lot of latitude
there, this should be possible without resorting to heavy handed
approach like lglock.  I was hoping that someone would come up with a
better solution, which didn't seem to have happened.  Maybe I was
wrong, I don't know.  I'll give it a shot.

But, anyways, here's my position regarding the issue.

* If we're gonna just fix up the slow path, I don't want to make
  _sum() less useful by making its accuracy dependent upon @batch.

* If somebody is interested, it would be worthwhile to see whether we
  can integrate vmstat and percpu counters so that its deviation is
  automatically regulated and we don't have to think about all this
  anymore.

I'll see if I can come up with something.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17 12:45                                                                   ` Tejun Heo
@ 2011-05-17 13:00                                                                     ` Eric Dumazet
  2011-05-17 13:04                                                                       ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Dumazet @ 2011-05-17 13:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Le mardi 17 mai 2011 à 14:45 +0200, Tejun Heo a écrit :

> Given that _sum() is super-slow path and we have a lot of latitude
> there, 

Absolutely not. Its slow path yes, but not super-slow.

Dont change rules now ;)

Take a look at include/net/tcp.h and commit ad1af0fedba14f82b
(tcp: Combat per-cpu skew in orphan tests.)

If you guys make percpu_counter much slower, we'll just remove its use
in network stack, and there will be not a lot of users left.




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17 13:00                                                                     ` Eric Dumazet
@ 2011-05-17 13:04                                                                       ` Tejun Heo
  2011-05-17 13:55                                                                         ` Christoph Lameter
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2011-05-17 13:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Shaohua Li, linux-kernel, akpm, cl, npiggin

Hello,

On Tue, May 17, 2011 at 03:00:23PM +0200, Eric Dumazet wrote:
> Absolutely not. Its slow path yes, but not super-slow.

I was speaking in relative terms.  We're talking about local only fast
path vs. something which hits every percpu counter likely causing
cache misses in many of them.  It's bound to be multiple orders of
magnitude heavier than fast path.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17 13:04                                                                       ` Tejun Heo
@ 2011-05-17 13:55                                                                         ` Christoph Lameter
  2011-05-17 14:02                                                                           ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Christoph Lameter @ 2011-05-17 13:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Eric Dumazet, Shaohua Li, linux-kernel, akpm, npiggin

On Tue, 17 May 2011, Tejun Heo wrote:

> Hello,
>
> On Tue, May 17, 2011 at 03:00:23PM +0200, Eric Dumazet wrote:
> > Absolutely not. Its slow path yes, but not super-slow.
>
> I was speaking in relative terms.  We're talking about local only fast
> path vs. something which hits every percpu counter likely causing
> cache misses in many of them.  It's bound to be multiple orders of
> magnitude heavier than fast path.

Well lets just adopt the system that vm statistics use. Bound the error by
time and batch and allow the user to change the batch if more accuracy is
desired.

The _sum function is optional and should it should be explained that the
result *could* be better but dont count on it.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17 13:55                                                                         ` Christoph Lameter
@ 2011-05-17 14:02                                                                           ` Tejun Heo
  2011-05-17 14:38                                                                             ` Christoph Lameter
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2011-05-17 14:02 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Eric Dumazet, Shaohua Li, linux-kernel, akpm, npiggin

Hey,

On Tue, May 17, 2011 at 08:55:33AM -0500, Christoph Lameter wrote:
> Well lets just adopt the system that vm statistics use. Bound the error by
> time and batch and allow the user to change the batch if more accuracy is
> desired.

Yeah, that would be the better long term solution.

> The _sum function is optional and should it should be explained that the
> result *could* be better but dont count on it.

We can implement force-flush for dire situations, similar to expedited
RCU flush thing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17 14:02                                                                           ` Tejun Heo
@ 2011-05-17 14:38                                                                             ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2011-05-17 14:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Eric Dumazet, Shaohua Li, linux-kernel, akpm, npiggin

On Tue, 17 May 2011, Tejun Heo wrote:

> We can implement force-flush for dire situations, similar to expedited
> RCU flush thing.

I would suggest that if the user wants accurate results then another
external form of synchronization needs to be provided. Maybe the counter
is always incremented when another lock has been taken.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [patch V3] percpu_counter: scalability works
  2011-05-17  9:50                                                               ` Tejun Heo
  2011-05-17 12:20                                                                 ` Eric Dumazet
@ 2011-05-18  1:00                                                                 ` Shaohua Li
  1 sibling, 0 replies; 52+ messages in thread
From: Shaohua Li @ 2011-05-18  1:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Eric Dumazet, linux-kernel, akpm, cl, npiggin

On Tue, 2011-05-17 at 17:50 +0800, Tejun Heo wrote:
> Hello, Eric.
> 
> On Tue, May 17, 2011 at 11:45:41AM +0200, Eric Dumazet wrote:
> > _sum() is a bit more precise than percpu_counter_read(), but to make it
> > really precise, we means we have to stop concurrent activities, and we
> > never did in previous/current implementation.
> > 
> > We could add this (as Shaohua and myself tried in various patches)
> > later, if needed, but nowhere in kernel we currently need that.
> > 
> > Even /proc/meminfo doesnt call _sum(&vm_committed_as) but the lazy
> > percpu_counter_read_positive() function...
> > 
> > Reammy _sum() gives a good approximation of the counter, more precise
> > because of the percpu s32 folding, but no guarantee of deviation.
> 
> I'm not asking to make it more accurate but the initial patches from
> Shaohua made the _sum() result to deviate by @batch even when only one
> thread is doing _inc() due to the race window between adding to the
> main counter and resetting the local one.  All I'm asking is closing
> that hole and I'll be completely happy with it.  The lglock does that
> but it's ummm.... not a very nice way to do it.
> 
> Please forget about deviations from concurrent activities.  I don't
> care and nobody should.  All I'm asking is removing that any update
> having the possibility of that unnecessary spike and I don't think
> that would be too hard.
Hmm, we once again to talk about the deviation issue. I thought we
agreed the deviation issue should be resolved in last discussion, but
seems not...

I would suggest you guys seriously look at my v3 patches, which doesn't
use lglock but can solve the deviation issue and has no significant
overhead.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2011-05-18  1:00 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-11  8:10 [patch v2 0/5] percpu_counter: bug fix and enhancement Shaohua Li
2011-05-11  8:10 ` [patch v2 1/5] percpu_counter: fix code for 32bit systems for UP Shaohua Li
2011-05-11  8:10 ` [patch v2 2/5] lglock: convert it to work with dynamically allocated structure Shaohua Li
2011-05-11  8:10 ` [patch v2 3/5] percpu_counter: use lglock to protect percpu data Shaohua Li
2011-05-11  8:10 ` [patch v2 4/5] percpu_counter: use atomic64 for counter in SMP Shaohua Li
2011-05-11  9:34   ` Andrew Morton
2011-05-12  2:40     ` Shaohua Li
2011-05-11  8:10 ` [patch v2 5/5] percpu_counter: preemptless __per_cpu_counter_add Shaohua Li
2011-05-11  9:28 ` [patch v2 0/5] percpu_counter: bug fix and enhancement Tejun Heo
2011-05-12  2:48   ` Shaohua Li
2011-05-12  8:21     ` Tejun Heo
2011-05-12  8:55       ` Shaohua Li
2011-05-12  8:59         ` Tejun Heo
2011-05-12  9:02           ` Eric Dumazet
2011-05-12  9:03             ` Eric Dumazet
2011-05-12  9:05             ` Tejun Heo
2011-05-13  3:09               ` Shaohua Li
2011-05-13  4:37               ` Shaohua Li
2011-05-13  5:20                 ` Eric Dumazet
2011-05-13  5:28                   ` Shaohua Li
2011-05-13  6:34                     ` Eric Dumazet
2011-05-13  7:33                       ` Shaohua Li
2011-05-13 14:51                       ` [patch] percpu_counter: scalability works Eric Dumazet
2011-05-13 15:39                         ` Eric Dumazet
2011-05-13 16:35                           ` [patch V2] " Eric Dumazet
2011-05-13 16:46                             ` Eric Dumazet
2011-05-13 22:03                               ` [patch V3] " Eric Dumazet
2011-05-16  0:58                                 ` Shaohua Li
2011-05-16  6:11                                   ` Eric Dumazet
2011-05-16  6:37                                     ` Shaohua Li
2011-05-16  6:55                                       ` Eric Dumazet
2011-05-16  7:15                                         ` Shaohua Li
2011-05-16  7:44                                           ` Eric Dumazet
2011-05-16  8:34                                             ` Shaohua Li
2011-05-16  9:35                                               ` Eric Dumazet
2011-05-16 14:22                                                 ` Eric Dumazet
2011-05-17  0:55                                                   ` Shaohua Li
2011-05-17  4:56                                                     ` Eric Dumazet
2011-05-17  5:22                                                       ` Shaohua Li
2011-05-17  9:01                                                         ` Eric Dumazet
2011-05-17  9:11                                                           ` Tejun Heo
2011-05-17  9:45                                                             ` Eric Dumazet
2011-05-17  9:50                                                               ` Tejun Heo
2011-05-17 12:20                                                                 ` Eric Dumazet
2011-05-17 12:45                                                                   ` Tejun Heo
2011-05-17 13:00                                                                     ` Eric Dumazet
2011-05-17 13:04                                                                       ` Tejun Heo
2011-05-17 13:55                                                                         ` Christoph Lameter
2011-05-17 14:02                                                                           ` Tejun Heo
2011-05-17 14:38                                                                             ` Christoph Lameter
2011-05-18  1:00                                                                 ` Shaohua Li
2011-05-12 14:38   ` [patch v2 0/5] percpu_counter: bug fix and enhancement Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.