All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/12] Lockless update of reference count protected by spinlock
@ 2013-07-05 14:47 Waiman Long
  2013-07-05 14:47 ` [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Waiman Long @ 2013-07-05 14:47 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	Thomas Gleixner
  Cc: Waiman Long, linux-fsdevel, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Benjamin Herrenschmidt,
	Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J

v4->v5:
 - Add a d_count() helper for readonly access of reference count and
   change all references to d_count outside of dcache.c, dcache.h
   and namei.c to use d_count().

v3->v4:
 - Replace helper function access to d_lock and d_count by using
   macros to redefine the old d_lock name to the spinlock and new
   d_refcount name to the reference count. This greatly reduces the
   size of this patchset from 25 to 12 and make it easier to review.

v2->v3:
 - Completely revamp the packaging by adding a new lockref data
   structure that combines the spinlock with the reference
   count. Helper functions are also added to manipulate the new data
   structure. That results in modifying over 50 files, but the changes
   were trivial in most of them.
 - Change initial spinlock wait to use a timeout.
 - Force 64-bit alignment of the spinlock & reference count structure.
 - Add a new way to use the combo by using a new union and helper
   functions.

v1->v2:
 - Add one more layer of indirection to LOCK_WITH_REFCOUNT macro.
 - Add __LINUX_SPINLOCK_REFCOUNT_H protection to spinlock_refcount.h.
 - Add some generic get/put macros into spinlock_refcount.h.

This patchset supports a generic mechanism to atomically update
a reference count that is protected by a spinlock without actually
acquiring the lock itself. If the update doesn't succeeed, the caller
will have to acquire the lock and update the reference count in the
the old way.  This will help in situation where there is a lot of
spinlock contention because of frequent reference count update.

The d_lock and d_count fields of the struct dentry in dcache.h was
modified to use the new lockref data structure and the d_lock name
is now a macro to the actual spinlock. The d_count name, however,
cannot be reused as it has collision elsewhere in the kernel. So a
new d_refcount macro is now used for the reference count and files
outside of dcache.c, dcache.h and namei.c are modified to use the
d_count() helper function.

The various patches were applied one-by-one to make sure that there
were no broken build.

This patch set causes significant performance improvement in the
short workload of the AIM7 benchmark on a 8-socket x86-64 machine
with 80 cores.

patch 1:	Introduce the new lockref data structure
patch 2:	Enable x86 architecture to use the feature
patch 3:	Rename all d_count references to d_refcount
patches 4-11:	Rename all d_count references to d_count() helper
patch 12:	Change the dentry structure to use the lockref
		structure to improve performance for high contention
		cases

Thank to Thomas Gleixner, Andi Kleen and Linus for their valuable
input in shaping this patchset.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>

Waiman Long (12):
  spinlock: A new lockref structure for lockless update of refcount
  spinlock: Enable x86 architecture to do lockless refcount update
  dcache: rename d_count field of dentry to d_refcount
  auto-fs: replace direct access of d_count with the d_count() helper
  ceph-fs: replace direct access of d_count with the d_count() helper
  coda-fs: replace direct access of d_count with the d_count() helper
  config-fs: replace direct access of d_count with the d_count() helper
  ecrypt-fs: replace direct access of d_count with the d_count() helper
  file locking: replace direct access of d_count with the d_count()
    helper
  nfs: replace direct access of d_count with the d_count() helper
  nilfs2: replace direct access of d_count with the d_count() helper
  dcache: Enable lockless update of refcount in dentry structure

 arch/x86/Kconfig                         |    3 +
 arch/x86/include/asm/spinlock_refcount.h |    1 +
 fs/autofs4/expire.c                      |    8 +-
 fs/autofs4/root.c                        |    2 +-
 fs/ceph/inode.c                          |    4 +-
 fs/ceph/mds_client.c                     |    2 +-
 fs/coda/dir.c                            |    2 +-
 fs/configfs/dir.c                        |    2 +-
 fs/dcache.c                              |   72 +++++-----
 fs/ecryptfs/inode.c                      |    2 +-
 fs/locks.c                               |    2 +-
 fs/namei.c                               |    6 +-
 fs/nfs/dir.c                             |    6 +-
 fs/nfs/unlink.c                          |    2 +-
 fs/nilfs2/super.c                        |    2 +-
 include/asm-generic/spinlock_refcount.h  |   46 ++++++
 include/linux/dcache.h                   |   31 +++--
 include/linux/spinlock_refcount.h        |  107 ++++++++++++++
 kernel/Kconfig.locks                     |    5 +
 lib/Makefile                             |    2 +
 lib/spinlock_refcount.c                  |  229 ++++++++++++++++++++++++++++++
 21 files changed, 474 insertions(+), 62 deletions(-)
 create mode 100644 arch/x86/include/asm/spinlock_refcount.h
 create mode 100644 include/asm-generic/spinlock_refcount.h
 create mode 100644 include/linux/spinlock_refcount.h
 create mode 100644 lib/spinlock_refcount.c


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
  2013-07-05 14:47 [PATCH v5 00/12] Lockless update of reference count protected by spinlock Waiman Long
@ 2013-07-05 14:47 ` Waiman Long
  2013-07-05 18:59   ` Thomas Gleixner
  2013-07-05 14:47 ` [PATCH v5 02/12] spinlock: Enable x86 architecture to do lockless refcount update Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-07-05 14:47 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	Thomas Gleixner
  Cc: Waiman Long, linux-fsdevel, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Benjamin Herrenschmidt,
	Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J

This patch introduces a new set of spinlock_refcount.h header files to
be included by kernel codes that want to do a faster lockless update
of reference count protected by a spinlock.

The new lockref structure consists of just the spinlock and the
reference count data. Helper functions are defined in the new
<linux/spinlock_refcount.h> header file to access the content of
the new structure. There is a generic structure defined for all
architecture, but each architecture can also optionally define its
own structure and use its own helper functions.

Two new config parameters are introduced:
1. SPINLOCK_REFCOUNT
2. ARCH_SPINLOCK_REFCOUNT

The first one is defined in the kernel/Kconfig.locks which is used
to enable or disable the faster lockless reference count update
optimization. The second one has to be defined in each of the
architecture's Kconfig file to enable the optimization for that
architecture. Therefore, each architecture has to opt-in for this
optimization or it won't get it. This allows each architecture plenty
of time to test it out before deciding to use it or replace it with
a better architecture specific solution.

This optimization won't work for non-SMP system or when spinlock
debugging is turned on. As a result, it is turned off each any of
them is true. It also won't work for full preempt-RT and so should
be turned on in this case.

The current patch allows 3 levels of access to the new lockref
structure:

1. The lockless update optimization is turned off (SPINLOCK_REFCOUNT=n).
2. The lockless update optimization is turned on and the generic version
   is used (SPINLOCK_REFCOUNT=y and ARCH_SPINLOCK_REFCOUNT=y).
3. The lockless update optimization is turned on and the architecture
   provides its own version.

To maximize the chance of doing lockless update in the generic version,
the inlined __lockref_add_unless() function will wait for a certain
amount of time if the lock is not free before trying to do the update.

The new code also attempts to do lockless atomic update twice before
falling back to the old code path of acquiring a lock before doing
the update. It is because there will still be some fair amount of
contention with only one attempt.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 include/asm-generic/spinlock_refcount.h |   46 ++++++
 include/linux/spinlock_refcount.h       |  107 ++++++++++++++
 kernel/Kconfig.locks                    |    5 +
 lib/Makefile                            |    2 +
 lib/spinlock_refcount.c                 |  229 +++++++++++++++++++++++++++++++
 5 files changed, 389 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/spinlock_refcount.h
 create mode 100644 include/linux/spinlock_refcount.h
 create mode 100644 lib/spinlock_refcount.c

diff --git a/include/asm-generic/spinlock_refcount.h b/include/asm-generic/spinlock_refcount.h
new file mode 100644
index 0000000..469b96b
--- /dev/null
+++ b/include/asm-generic/spinlock_refcount.h
@@ -0,0 +1,46 @@
+/*
+ * Spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (c) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __ASM_GENERIC_SPINLOCK_REFCOUNT_H
+#define __ASM_GENERIC_SPINLOCK_REFCOUNT_H
+
+/*
+ * The lockref structure defines a combined spinlock with reference count
+ * data structure to be embedded in a larger structure. The combined data
+ * structure is always 8-byte aligned. So proper placement of this structure
+ * in the larger embedding data structure is needed to ensure that there is
+ * no hole in it.
+ */
+struct __aligned(sizeof(u64)) lockref {
+	union {
+		u64		__lock_count;
+		struct {
+			unsigned int	refcnt;	/* Reference count */
+			spinlock_t	lock;
+		};
+	};
+};
+
+/*
+ * Struct lockref helper functions
+ */
+extern void lockref_get(struct lockref *lockcnt);
+extern int  lockref_put(struct lockref *lockcnt);
+extern int  lockref_get_not_zero(struct lockref *lockcnt);
+extern int  lockref_put_or_locked(struct lockref *lockcnt);
+
+#endif /* __ASM_GENERIC_SPINLOCK_REFCOUNT_H */
diff --git a/include/linux/spinlock_refcount.h b/include/linux/spinlock_refcount.h
new file mode 100644
index 0000000..28b0b89
--- /dev/null
+++ b/include/linux/spinlock_refcount.h
@@ -0,0 +1,107 @@
+/*
+ * Spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (c) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+#ifndef __LINUX_SPINLOCK_REFCOUNT_H
+#define __LINUX_SPINLOCK_REFCOUNT_H
+
+#include <linux/spinlock.h>
+
+#ifdef CONFIG_SPINLOCK_REFCOUNT
+#include <asm/spinlock_refcount.h>
+#else
+/*
+ * If the spinlock & reference count optimization feature is disabled,
+ * the spinlock and reference count are accessed separately on its own.
+ */
+struct lockref {
+	unsigned int refcnt;	/* Reference count */
+	spinlock_t   lock;
+};
+
+/*
+ * Struct lockref helper functions
+ */
+/*
+ * lockref_get - increments reference count while not locked
+ * @lockcnt: pointer to lockref structure
+ */
+static __always_inline void
+lockref_get(struct lockref *lockcnt)
+{
+	spin_lock(&lockcnt->lock);
+	lockcnt->refcnt++;
+	spin_unlock(&lockcnt->lock);
+}
+
+/*
+ * lockref_get_not_zero - increments count unless the count is 0
+ * @lockcnt: pointer to lockref structure
+ * Return: 1 if count updated successfully or 0 if count is 0
+ */
+static __always_inline int
+lockref_get_not_zero(struct lockref *lockcnt)
+{
+	int retval = 0;
+
+	spin_lock(&lockcnt->lock);
+	if (likely(lockcnt->refcnt)) {
+		lockcnt->refcnt++;
+		retval = 1;
+	}
+	spin_unlock(&lockcnt->lock);
+	return retval;
+}
+
+/*
+ * lockref_put - decrements count unless count <= 1 before decrement
+ * @lockcnt: pointer to lockref structure
+ * Return: 1 if count updated successfully or 0 if count <= 1
+ */
+static __always_inline int
+lockref_put(struct lockref *lockcnt)
+{
+	int retval = 0;
+
+	spin_lock(&lockcnt->lock);
+	if (likely(lockcnt->refcnt > 1)) {
+		lockcnt->refcnt--;
+		retval = 1;
+	}
+	spin_unlock(&lockcnt->lock);
+	return retval;
+}
+
+/*
+ * lockref_put_or_locked - decrements count unless count <= 1 before decrement
+ *			   otherwise the lock will be taken
+ * @lockcnt: pointer to lockref structure
+ * Return: 1 if count updated successfully or 0 if count <= 1 and lock taken
+ */
+static __always_inline int
+lockref_put_or_locked(struct lockref *lockcnt)
+{
+	spin_lock(&lockcnt->lock);
+	if (likely(lockcnt->refcnt > 1)) {
+		lockcnt->refcnt--;
+		spin_unlock(&lockcnt->lock);
+		return 1;
+	}
+	return 0;	/* Count is 1 & lock taken */
+}
+
+#endif /* CONFIG_SPINLOCK_REFCOUNT */
+#endif /* __LINUX_SPINLOCK_REFCOUNT_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 44511d1..d1f8670 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,8 @@ endif
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
 	depends on SMP && !DEBUG_MUTEXES
+
+config SPINLOCK_REFCOUNT
+	def_bool y
+	depends on ARCH_SPINLOCK_REFCOUNT && SMP
+	depends on !GENERIC_LOCKBREAK && !DEBUG_SPINLOCK && !DEBUG_LOCK_ALLOC
diff --git a/lib/Makefile b/lib/Makefile
index c55a037..0367915 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -179,3 +179,5 @@ quiet_cmd_build_OID_registry = GEN     $@
 clean-files	+= oid_registry_data.c
 
 obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
+
+obj-$(CONFIG_SPINLOCK_REFCOUNT) += spinlock_refcount.o
diff --git a/lib/spinlock_refcount.c b/lib/spinlock_refcount.c
new file mode 100644
index 0000000..1f599a7
--- /dev/null
+++ b/lib/spinlock_refcount.c
@@ -0,0 +1,229 @@
+/*
+ * Generic spinlock with reference count combo
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <waiman.long@hp.com>
+ */
+
+#include <linux/spinlock.h>
+#include <asm-generic/spinlock_refcount.h>
+
+#ifdef _SHOW_WAIT_LOOP_TIME
+#include <linux/time.h>
+#endif
+
+/*
+ * The maximum number of waiting spins when the lock was acquiring before
+ * trying to attempt a lockless update. The purpose of the timeout is to
+ * limit the amount of unfairness to the thread that is doing the reference
+ * count update. Otherwise, it is theoretically possible for that thread to
+ * be starved for a really long time causing all kind of problems. If it
+ * times out and the lock is still not free, the code will fall back to the
+ * traditional way of queuing up to acquire the lock before updating the count.
+ *
+ * The actual time spent in the wait loop very much depends on the CPU being
+ * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction
+ * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping
+ * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait
+ * about 72us before the count reaches 0. With cacheline contention, the
+ * wait time can go up to 3x as much (about 210us). Having multiple
+ * cpu_relax's in the wait loop does seem to reduce cacheline contention
+ * somewhat and give slightly better performance.
+ *
+ * The preset timeout value is rather arbitrary and really depends on the CPU
+ * being used. If customization is needed, we could add a config variable for
+ * that. The exact value is not that important. A longer wait time will
+ * increase the chance of doing a lockless update, but it results in more
+ * unfairness to the waiting thread and vice versa.
+ */
+#ifndef CONFIG_LOCKREF_WAIT_SHIFT
+#define CONFIG_LOCKREF_WAIT_SHIFT	12
+#endif
+#define	LOCKREF_SPIN_WAIT_MAX	(1<<CONFIG_LOCKREF_WAIT_SHIFT)
+
+/**
+ *
+ * __lockref_add_unless - atomically add the given value to the count unless
+ *			  the lock was acquired or the count equals to the
+ *			  given threshold value.
+ *
+ * @plockcnt : pointer to the combined lock and count 8-byte data
+ * @plock    : pointer to the spinlock
+ * @pcount   : pointer to the reference count
+ * @value    : value to be added
+ * @threshold: threshold value for acquiring the lock
+ * Return    : 1 if operation succeeds, 0 otherwise
+ *
+ * If the lock was not acquired, __lockref_add_unless() atomically adds the
+ * given value to the reference count unless the given threshold is reached.
+ * If the lock was acquired or the threshold was reached, 0 is returned and
+ * the caller will have to acquire the lock and update the count accordingly
+ * (can be done in a non-atomic way).
+ *
+ * N.B. The lockdep code won't be run as this optimization should be disabled
+ *	when LOCK_STAT is enabled.
+ */
+static __always_inline int
+__lockref_add_unless(u64 *plockcnt, spinlock_t *plock, unsigned int *pcount,
+		     int value, int threshold)
+{
+	struct lockref old, new;
+	int   get_lock, loopcnt;
+#ifdef _SHOW_WAIT_LOOP_TIME
+	struct timespec tv1, tv2;
+#endif
+
+	/*
+	 * Code doesn't work if raw spinlock is larger than 4 bytes
+	 * or is empty.
+	 */
+	BUILD_BUG_ON(sizeof(arch_spinlock_t) == 0);
+	if (sizeof(arch_spinlock_t) > 4)
+		return 0;	/* Need to acquire the lock explicitly */
+
+	/*
+	 * Wait a certain amount of time until the lock is free or the loop
+	 * counter reaches 0. Doing multiple cpu_relax() helps to reduce
+	 * contention in the spinlock cacheline and achieve better performance.
+	 */
+#ifdef _SHOW_WAIT_LOOP_TIME
+	getnstimeofday(&tv1);
+#endif
+	loopcnt = LOCKREF_SPIN_WAIT_MAX;
+	while (loopcnt && spin_is_locked(plock)) {
+		loopcnt--;
+		cpu_relax();
+		cpu_relax();
+		cpu_relax();
+		cpu_relax();
+	}
+#ifdef _SHOW_WAIT_LOOP_TIME
+	if (loopcnt == 0) {
+		unsigned long ns;
+
+		getnstimeofday(&tv2);
+		ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
+		     (tv2.tv_nsec - tv1.tv_nsec);
+		pr_info("lockref wait loop time = %lu ns\n", ns);
+	}
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+	if (loopcnt != LOCKREF_SPIN_WAIT_MAX)
+		lock_contended(plock->dep_map, _RET_IP_);
+#endif
+	old.__lock_count = ACCESS_ONCE(*plockcnt);
+	get_lock = ((threshold >= 0) && (old.refcnt <= threshold));
+	if (likely(!get_lock && !spin_is_locked(&old.lock))) {
+		new.__lock_count = old.__lock_count;
+		new.refcnt += value;
+		if (cmpxchg64(plockcnt, old.__lock_count, new.__lock_count)
+		    == old.__lock_count)
+			return 1;
+		cpu_relax();
+		cpu_relax();
+		/*
+		 * Try one more time
+		 */
+		old.__lock_count = ACCESS_ONCE(*plockcnt);
+		get_lock = ((threshold >= 0) && (old.refcnt <= threshold));
+		if (likely(!get_lock && !spin_is_locked(&old.lock))) {
+			new.__lock_count = old.__lock_count;
+			new.refcnt += value;
+			if (cmpxchg64(plockcnt, old.__lock_count,
+				      new.__lock_count) == old.__lock_count)
+				return 1;
+			cpu_relax();
+		}
+	}
+	return 0;	/* The caller will need to acquire the lock */
+}
+
+/*
+ * Generic macros to increment and decrement reference count
+ * @sname : pointer to lockref structure
+ * @lock  : name of spinlock in structure
+ * @count : name of reference count in structure
+ *
+ * LOCKREF_GET()  - increments count unless it is locked
+ * LOCKREF_GET0() - increments count unless it is locked or count is 0
+ * LOCKREF_PUT()  - decrements count unless it is locked or count is 1
+ */
+#define LOCKREF_GET(sname, lock, count)					\
+	__lockref_add_unless(&(sname)->__lock_count, &(sname)->lock,	\
+			     &(sname)->count, 1, -1)
+#define LOCKREF_GET0(sname, lock, count)				\
+	__lockref_add_unless(&(sname)->__lock_count, &(sname)->lock,	\
+			     &(sname)->count, 1, 0)
+#define LOCKREF_PUT(sname, lock, count)					\
+	__lockref_add_unless(&(sname)->__lock_count, &(sname)->lock,	\
+			     &(sname)->count, -1, 1)
+
+/*
+ * Struct lockref helper functions
+ */
+/*
+ * lockref_get - increments reference count
+ * @lockcnt: pointer to struct lockref structure
+ */
+void
+lockref_get(struct lockref *lockcnt)
+{
+	if (LOCKREF_GET(lockcnt, lock, refcnt))
+		return;
+	spin_lock(&lockcnt->lock);
+	lockcnt->refcnt++;
+	spin_unlock(&lockcnt->lock);
+}
+EXPORT_SYMBOL(lockref_get);
+
+/*
+ * lockref_get_not_zero - increments count unless the count is 0
+ * @lockcnt: pointer to struct lockref structure
+ * Return: 1 if count updated successfully or 0 if count is 0 and lock taken
+ */
+int
+lockref_get_not_zero(struct lockref *lockcnt)
+{
+	return LOCKREF_GET0(lockcnt, lock, refcnt);
+}
+EXPORT_SYMBOL(lockref_get_not_zero);
+
+/*
+ * lockref_put - decrements count unless the count <= 1
+ * @lockcnt: pointer to struct lockref structure
+ * Return: 1 if count updated successfully or 0 if count <= 1
+ */
+int
+lockref_put(struct lockref *lockcnt)
+{
+	return LOCKREF_PUT(lockcnt, lock, refcnt);
+}
+EXPORT_SYMBOL(lockref_put);
+
+/*
+ * lockref_put_or_locked - decrements count unless the count is <= 1
+ *			   otherwise, the lock will be taken
+ * @lockcnt: pointer to struct lockref structure
+ * Return: 1 if count updated successfully or 0 if count <= 1 and lock taken
+ */
+int
+lockref_put_or_locked(struct lockref *lockcnt)
+{
+	if (LOCKREF_PUT(lockcnt, lock, refcnt))
+		return 1;
+	spin_lock(&lockcnt->lock);
+	return 0;
+}
+EXPORT_SYMBOL(lockref_put_or_locked);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 02/12] spinlock: Enable x86 architecture to do lockless refcount update
  2013-07-05 14:47 [PATCH v5 00/12] Lockless update of reference count protected by spinlock Waiman Long
  2013-07-05 14:47 ` [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount Waiman Long
@ 2013-07-05 14:47 ` Waiman Long
  2013-07-05 20:19   ` Thomas Gleixner
  2013-07-05 14:47 ` [PATCH v5 03/12] dcache: rename d_count field of dentry to d_refcount Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-07-05 14:47 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	Thomas Gleixner
  Cc: Waiman Long, linux-fsdevel, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Benjamin Herrenschmidt,
	Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J

There are two steps to enable each architecture to do lockless
reference count update:
1. Define the ARCH_SPINLOCK_REFCOUNT config parameter in its Kconfig
   file.
2. Add a <asm/spinlock_refcount.h> architecture specific header file.

This is done for the x86 architecture to use the generic version
available.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 arch/x86/Kconfig                         |    3 +++
 arch/x86/include/asm/spinlock_refcount.h |    1 +
 2 files changed, 4 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/spinlock_refcount.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fe120da..649ed4b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -253,6 +253,9 @@ config ARCH_CPU_PROBE_RELEASE
 config ARCH_SUPPORTS_UPROBES
 	def_bool y
 
+config ARCH_SPINLOCK_REFCOUNT
+	def_bool y
+
 source "init/Kconfig"
 source "kernel/Kconfig.freezer"
 
diff --git a/arch/x86/include/asm/spinlock_refcount.h b/arch/x86/include/asm/spinlock_refcount.h
new file mode 100644
index 0000000..ab6224f
--- /dev/null
+++ b/arch/x86/include/asm/spinlock_refcount.h
@@ -0,0 +1 @@
+#include <asm-generic/spinlock_refcount.h>
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 03/12] dcache: rename d_count field of dentry to d_refcount
  2013-07-05 14:47 [PATCH v5 00/12] Lockless update of reference count protected by spinlock Waiman Long
  2013-07-05 14:47 ` [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount Waiman Long
  2013-07-05 14:47 ` [PATCH v5 02/12] spinlock: Enable x86 architecture to do lockless refcount update Waiman Long
@ 2013-07-05 14:47 ` Waiman Long
  2013-07-05 15:02 ` [PATCH v5 00/12] Lockless update of reference count protected by spinlock Al Viro
  2013-07-05 20:33 ` Thomas Gleixner
  4 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2013-07-05 14:47 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	Thomas Gleixner
  Cc: Waiman Long, linux-fsdevel, linux-kernel, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Benjamin Herrenschmidt,
	Andi Kleen, Chandramouleeswaran, Aswin, Norton, Scott J

Before converting the d_lock and d_count field of the dentry data
structure to the new lockref structure, we need to consider the
implication of such a change. All current references of d_count and
d_lock have to be changed accordingly.

One way to minimize the changes is to redefine the original field
names as macros to the new names.  For d_lock, it is possible to do
so saving a lot of changes as this name is not used anywhere else
in the kernel. For d_count, this is not possible as this name is used
somewhere else in the kernel for different things.

The dcache.c and namei.c files need to change the reference count
value.  They will be modified to use a different reference count
name "refcount" which is unique in the kernel source code. To avoid
breaking build in other parts of the kernel, refcount is currently
defined as an alias of d_count. For other parts of the kernel code
that need readonly access to d_count, a new helper function d_count()
is now provided to access the reference count value.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 fs/dcache.c            |   54 ++++++++++++++++++++++++------------------------
 fs/namei.c             |    6 ++--
 include/linux/dcache.h |   18 ++++++++++++++-
 3 files changed, 46 insertions(+), 32 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f09b908..61d2f11 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -54,7 +54,7 @@
  *   - d_flags
  *   - d_name
  *   - d_lru
- *   - d_count
+ *   - d_refcount
  *   - d_unhashed()
  *   - d_parent and d_subdirs
  *   - childrens' d_child and d_parent
@@ -229,7 +229,7 @@ static void __d_free(struct rcu_head *head)
  */
 static void d_free(struct dentry *dentry)
 {
-	BUG_ON(dentry->d_count);
+	BUG_ON(dentry->d_refcount);
 	this_cpu_dec(nr_dentry);
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
@@ -467,7 +467,7 @@ relock:
 	}
 
 	if (ref)
-		dentry->d_count--;
+		dentry->d_refcount--;
 	/*
 	 * inform the fs via d_prune that this dentry is about to be
 	 * unhashed and destroyed.
@@ -513,12 +513,12 @@ void dput(struct dentry *dentry)
 		return;
 
 repeat:
-	if (dentry->d_count == 1)
+	if (dentry->d_refcount == 1)
 		might_sleep();
 	spin_lock(&dentry->d_lock);
-	BUG_ON(!dentry->d_count);
-	if (dentry->d_count > 1) {
-		dentry->d_count--;
+	BUG_ON(!dentry->d_refcount);
+	if (dentry->d_refcount > 1) {
+		dentry->d_refcount--;
 		spin_unlock(&dentry->d_lock);
 		return;
 	}
@@ -535,7 +535,7 @@ repeat:
 	dentry->d_flags |= DCACHE_REFERENCED;
 	dentry_lru_add(dentry);
 
-	dentry->d_count--;
+	dentry->d_refcount--;
 	spin_unlock(&dentry->d_lock);
 	return;
 
@@ -590,7 +590,7 @@ int d_invalidate(struct dentry * dentry)
 	 * We also need to leave mountpoints alone,
 	 * directory or not.
 	 */
-	if (dentry->d_count > 1 && dentry->d_inode) {
+	if (dentry->d_refcount > 1 && dentry->d_inode) {
 		if (S_ISDIR(dentry->d_inode->i_mode) || d_mountpoint(dentry)) {
 			spin_unlock(&dentry->d_lock);
 			return -EBUSY;
@@ -606,7 +606,7 @@ EXPORT_SYMBOL(d_invalidate);
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
 {
-	dentry->d_count++;
+	dentry->d_refcount++;
 }
 
 static inline void __dget(struct dentry *dentry)
@@ -634,8 +634,8 @@ repeat:
 		goto repeat;
 	}
 	rcu_read_unlock();
-	BUG_ON(!ret->d_count);
-	ret->d_count++;
+	BUG_ON(!ret->d_refcount);
+	ret->d_refcount++;
 	spin_unlock(&ret->d_lock);
 	return ret;
 }
@@ -718,7 +718,7 @@ restart:
 	spin_lock(&inode->i_lock);
 	hlist_for_each_entry(dentry, &inode->i_dentry, d_alias) {
 		spin_lock(&dentry->d_lock);
-		if (!dentry->d_count) {
+		if (!dentry->d_refcount) {
 			__dget_dlock(dentry);
 			__d_drop(dentry);
 			spin_unlock(&dentry->d_lock);
@@ -734,7 +734,7 @@ EXPORT_SYMBOL(d_prune_aliases);
 
 /*
  * Try to throw away a dentry - free the inode, dput the parent.
- * Requires dentry->d_lock is held, and dentry->d_count == 0.
+ * Requires dentry->d_lock is held, and dentry->d_refcount == 0.
  * Releases dentry->d_lock.
  *
  * This may fail if locks cannot be acquired no problem, just try again.
@@ -764,8 +764,8 @@ static void try_prune_one_dentry(struct dentry *dentry)
 	dentry = parent;
 	while (dentry) {
 		spin_lock(&dentry->d_lock);
-		if (dentry->d_count > 1) {
-			dentry->d_count--;
+		if (dentry->d_refcount > 1) {
+			dentry->d_refcount--;
 			spin_unlock(&dentry->d_lock);
 			return;
 		}
@@ -793,7 +793,7 @@ static void shrink_dentry_list(struct list_head *list)
 		 * the LRU because of laziness during lookup.  Do not free
 		 * it - just keep it off the LRU list.
 		 */
-		if (dentry->d_count) {
+		if (dentry->d_refcount) {
 			dentry_lru_del(dentry);
 			spin_unlock(&dentry->d_lock);
 			continue;
@@ -913,7 +913,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 			dentry_lru_del(dentry);
 			__d_shrink(dentry);
 
-			if (dentry->d_count != 0) {
+			if (dentry->d_refcount != 0) {
 				printk(KERN_ERR
 				       "BUG: Dentry %p{i=%lx,n=%s}"
 				       " still in use (%d)"
@@ -922,7 +922,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 				       dentry->d_inode ?
 				       dentry->d_inode->i_ino : 0UL,
 				       dentry->d_name.name,
-				       dentry->d_count,
+				       dentry->d_refcount,
 				       dentry->d_sb->s_type->name,
 				       dentry->d_sb->s_id);
 				BUG();
@@ -933,7 +933,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 				list_del(&dentry->d_u.d_child);
 			} else {
 				parent = dentry->d_parent;
-				parent->d_count--;
+				parent->d_refcount--;
 				list_del(&dentry->d_u.d_child);
 			}
 
@@ -981,7 +981,7 @@ void shrink_dcache_for_umount(struct super_block *sb)
 
 	dentry = sb->s_root;
 	sb->s_root = NULL;
-	dentry->d_count--;
+	dentry->d_refcount--;
 	shrink_dcache_for_umount_subtree(dentry);
 
 	while (!hlist_bl_empty(&sb->s_anon)) {
@@ -1147,7 +1147,7 @@ resume:
 		 * loop in shrink_dcache_parent() might not make any progress
 		 * and loop forever.
 		 */
-		if (dentry->d_count) {
+		if (dentry->d_refcount) {
 			dentry_lru_del(dentry);
 		} else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
 			dentry_lru_move_list(dentry, dispose);
@@ -1269,7 +1269,7 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	smp_wmb();
 	dentry->d_name.name = dname;
 
-	dentry->d_count = 1;
+	dentry->d_refcount = 1;
 	dentry->d_flags = 0;
 	spin_lock_init(&dentry->d_lock);
 	seqcount_init(&dentry->d_seq);
@@ -1970,7 +1970,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
 				goto next;
 		}
 
-		dentry->d_count++;
+		dentry->d_refcount++;
 		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
@@ -2069,7 +2069,7 @@ again:
 	spin_lock(&dentry->d_lock);
 	inode = dentry->d_inode;
 	isdir = S_ISDIR(inode->i_mode);
-	if (dentry->d_count == 1) {
+	if (dentry->d_refcount == 1) {
 		if (!spin_trylock(&inode->i_lock)) {
 			spin_unlock(&dentry->d_lock);
 			cpu_relax();
@@ -2937,7 +2937,7 @@ resume:
 		}
 		if (!(dentry->d_flags & DCACHE_GENOCIDE)) {
 			dentry->d_flags |= DCACHE_GENOCIDE;
-			dentry->d_count--;
+			dentry->d_refcount--;
 		}
 		spin_unlock(&dentry->d_lock);
 	}
@@ -2945,7 +2945,7 @@ resume:
 		struct dentry *child = this_parent;
 		if (!(this_parent->d_flags & DCACHE_GENOCIDE)) {
 			this_parent->d_flags |= DCACHE_GENOCIDE;
-			this_parent->d_count--;
+			this_parent->d_refcount--;
 		}
 		this_parent = try_to_ascend(this_parent, locked, seq);
 		if (!this_parent)
diff --git a/fs/namei.c b/fs/namei.c
index 9ed9361..3a59198 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -536,8 +536,8 @@ static int unlazy_walk(struct nameidata *nd, struct dentry *dentry)
 		 * a reference at this point.
 		 */
 		BUG_ON(!IS_ROOT(dentry) && dentry->d_parent != parent);
-		BUG_ON(!parent->d_count);
-		parent->d_count++;
+		BUG_ON(!parent->d_refcount);
+		parent->d_refcount++;
 		spin_unlock(&dentry->d_lock);
 	}
 	spin_unlock(&parent->d_lock);
@@ -3280,7 +3280,7 @@ void dentry_unhash(struct dentry *dentry)
 {
 	shrink_dcache_parent(dentry);
 	spin_lock(&dentry->d_lock);
-	if (dentry->d_count == 1)
+	if (dentry->d_refcount == 1)
 		__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 1a6bb81..c07198d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -130,6 +130,10 @@ struct dentry {
 	struct list_head d_subdirs;	/* our children */
 	struct hlist_node d_alias;	/* inode alias list */
 };
+/*
+ * Define d_refcount as an alias to d_count.
+ */
+#define d_refcount	d_count
 
 /*
  * dentry->d_lock spinlock nesting subclasses:
@@ -258,6 +262,16 @@ extern int have_submounts(struct dentry *);
 extern void d_rehash(struct dentry *);
 
 /**
+ * d_count - return the reference count in dentry
+ * @entry: dentry pointer
+ * Returns: current value of reference count
+ */
+static inline unsigned int d_count(struct dentry *entry)
+{
+	return entry->d_refcount;
+}
+
+/**
  * d_add - add dentry to hash queues
  * @entry: dentry to add
  * @inode: The inode to attach to this dentry
@@ -319,7 +333,7 @@ static inline int __d_rcu_to_refcount(struct dentry *dentry, unsigned seq)
 	assert_spin_locked(&dentry->d_lock);
 	if (!read_seqcount_retry(&dentry->d_seq, seq)) {
 		ret = 1;
-		dentry->d_count++;
+		dentry->d_refcount++;
 	}
 
 	return ret;
@@ -352,7 +366,7 @@ extern char *dentry_path(struct dentry *, char *, int);
 static inline struct dentry *dget_dlock(struct dentry *dentry)
 {
 	if (dentry)
-		dentry->d_count++;
+		dentry->d_refcount++;
 	return dentry;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 00/12] Lockless update of reference count protected by spinlock
  2013-07-05 14:47 [PATCH v5 00/12] Lockless update of reference count protected by spinlock Waiman Long
                   ` (2 preceding siblings ...)
  2013-07-05 14:47 ` [PATCH v5 03/12] dcache: rename d_count field of dentry to d_refcount Waiman Long
@ 2013-07-05 15:02 ` Al Viro
  2013-07-05 15:29   ` Waiman Long
  2013-07-05 20:33 ` Thomas Gleixner
  4 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2013-07-05 15:02 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeff Layton, Miklos Szeredi, Ingo Molnar, Thomas Gleixner,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, Jul 05, 2013 at 10:47:33AM -0400, Waiman Long wrote:
> v4->v5:
>  - Add a d_count() helper for readonly access of reference count and
>    change all references to d_count outside of dcache.c, dcache.h
>    and namei.c to use d_count().

Sigh...  You are breaking bisectability again.  This helper should
be introduced *first*, along with conversion to it.  Then you can
do the rest on top of that.

I've just pushed such commit into vfs.git#for-linus; please, do the rest
on top of it.  And keep it bisectable, i.e. so that at any intermediate
point the tree would build and work.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 00/12] Lockless update of reference count protected by spinlock
  2013-07-05 15:02 ` [PATCH v5 00/12] Lockless update of reference count protected by spinlock Al Viro
@ 2013-07-05 15:29   ` Waiman Long
  2013-07-05 15:31     ` Waiman Long
  2013-07-05 17:54     ` Al Viro
  0 siblings, 2 replies; 18+ messages in thread
From: Waiman Long @ 2013-07-05 15:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Miklos Szeredi, Ingo Molnar, Thomas Gleixner,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 07/05/2013 11:02 AM, Al Viro wrote:
> On Fri, Jul 05, 2013 at 10:47:33AM -0400, Waiman Long wrote:
>> v4->v5:
>>   - Add a d_count() helper for readonly access of reference count and
>>     change all references to d_count outside of dcache.c, dcache.h
>>     and namei.c to use d_count().
> Sigh...  You are breaking bisectability again.  This helper should
> be introduced *first*, along with conversion to it.  Then you can
> do the rest on top of that.
>
> I've just pushed such commit into vfs.git#for-linus; please, do the rest
> on top of it.  And keep it bisectable, i.e. so that at any intermediate
> point the tree would build and work.

I am sorry. I didn't change anything in the dentry structure in patch 3. 
So putting patches 4-11 on top of it won't break the build.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 00/12] Lockless update of reference count protected by spinlock
  2013-07-05 15:29   ` Waiman Long
@ 2013-07-05 15:31     ` Waiman Long
  2013-07-05 17:54     ` Al Viro
  1 sibling, 0 replies; 18+ messages in thread
From: Waiman Long @ 2013-07-05 15:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Miklos Szeredi, Ingo Molnar, Thomas Gleixner,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 07/05/2013 11:29 AM, Waiman Long wrote:
> On 07/05/2013 11:02 AM, Al Viro wrote:
>> On Fri, Jul 05, 2013 at 10:47:33AM -0400, Waiman Long wrote:
>>> v4->v5:
>>>   - Add a d_count() helper for readonly access of reference count and
>>>     change all references to d_count outside of dcache.c, dcache.h
>>>     and namei.c to use d_count().
>> Sigh...  You are breaking bisectability again.  This helper should
>> be introduced *first*, along with conversion to it.  Then you can
>> do the rest on top of that.
>>
>> I've just pushed such commit into vfs.git#for-linus; please, do the rest
>> on top of it.  And keep it bisectable, i.e. so that at any intermediate
>> point the tree would build and work.
>
> I am sorry. I didn't change anything in the dentry structure in patch 
> 3. So putting patches 4-11 on top of it won't break the build.

I explicitly applied the patches one at a time to make sure that there 
was no broken build.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 00/12] Lockless update of reference count protected by spinlock
  2013-07-05 15:29   ` Waiman Long
  2013-07-05 15:31     ` Waiman Long
@ 2013-07-05 17:54     ` Al Viro
  2013-07-05 18:56       ` Waiman Long
  1 sibling, 1 reply; 18+ messages in thread
From: Al Viro @ 2013-07-05 17:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jeff Layton, Miklos Szeredi, Ingo Molnar, Thomas Gleixner,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, Jul 05, 2013 at 11:29:10AM -0400, Waiman Long wrote:
> >I've just pushed such commit into vfs.git#for-linus; please, do the rest
> >on top of it.  And keep it bisectable, i.e. so that at any intermediate
> >point the tree would build and work.
> 
> I am sorry. I didn't change anything in the dentry structure in
> patch 3. So putting patches 4-11 on top of it won't break the build.

*gyah*...  I'd missed the !@#!# macro you've added there.  Could you
explain the reasons for using it at all?  Not to mention anything
else, you've missed
#  define d_refcount(d)                 ((d)->d_count)
in Lustre.  What's the point of your macro (d_refcount -> d_count), anyway?  
All references outside of fs/namei.c, fs/dcache.c, include/linux/dcache.h
should be via d_count(dentry) anyway...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 00/12] Lockless update of reference count protected by spinlock
  2013-07-05 17:54     ` Al Viro
@ 2013-07-05 18:56       ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2013-07-05 18:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Miklos Szeredi, Ingo Molnar, Thomas Gleixner,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 07/05/2013 01:54 PM, Al Viro wrote:
> On Fri, Jul 05, 2013 at 11:29:10AM -0400, Waiman Long wrote:
>>> I've just pushed such commit into vfs.git#for-linus; please, do the rest
>>> on top of it.  And keep it bisectable, i.e. so that at any intermediate
>>> point the tree would build and work.
>> I am sorry. I didn't change anything in the dentry structure in
>> patch 3. So putting patches 4-11 on top of it won't break the build.
> *gyah*...  I'd missed the !@#!# macro you've added there.  Could you
> explain the reasons for using it at all?  Not to mention anything
> else, you've missed
> #  define d_refcount(d)                 ((d)->d_count)
> in Lustre.  What's the point of your macro (d_refcount ->  d_count), anyway?
> All references outside of fs/namei.c, fs/dcache.c, include/linux/dcache.h
> should be via d_count(dentry) anyway...

I am sorry that I am still using the latest 3.10 bits that I pull in 
last week as the basis for my patchset. I want to get my changes 
stabilized before looking at the latest bits. So I did missed the latest 
3.11 bits that are merged this week including the Lustre change that you 
mentioned. I will pull in the latest bit and rebase my patch on top of 
it. I guess I also need to make changes to that d_refcount() macro as 
well as codes that reference it. Are you aware of other upcoming patches 
that may conflict with my patch?

The d_refcount macro in patch 3 is to make the name change first so that 
I don't need to change them in the last patch. This is to make the last 
patch easier to review by moving those irrelevant name changes away from it.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
  2013-07-05 14:47 ` [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount Waiman Long
@ 2013-07-05 18:59   ` Thomas Gleixner
  2013-07-08 14:21     ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2013-07-05 18:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, 5 Jul 2013, Waiman Long wrote:
> + * If the spinlock & reference count optimization feature is disabled,
> + * the spinlock and reference count are accessed separately on its own.
> + */
> +struct lockref {
> +	unsigned int refcnt;	/* Reference count */
> +	spinlock_t   lock;
> +};
> +
> +/*
> + * Struct lockref helper functions
> + */
> +/*

Function documentation starts with /**

> + * lockref_get - increments reference count while not locked

This should be: Increments reference count unconditionally.

> + * @lockcnt: pointer to lockref structure
> + */
> +static __always_inline void
> +lockref_get(struct lockref *lockcnt)

Please avoid these line breaks if the line fits in 80 chars.

> +{
> +	spin_lock(&lockcnt->lock);
> +	lockcnt->refcnt++;
> +	spin_unlock(&lockcnt->lock);
> +}

> +/*
> + * lockref_put_or_locked - decrements count unless count <= 1 before decrement
> + *			   otherwise the lock will be taken

  lockref_put_or_lock please

Docbook cannot work with multiline comments for the function name.

So make that shorter and add a longer explanation below the @argument
docs.

> + * @lockcnt: pointer to lockref structure
> + * Return: 1 if count updated successfully or 0 if count <= 1 and lock taken
> + */
> +static __always_inline int
> +lockref_put_or_locked(struct lockref *lockcnt)
> +{
> +	spin_lock(&lockcnt->lock);
> +	if (likely(lockcnt->refcnt > 1)) {
> +		lockcnt->refcnt--;
> +		spin_unlock(&lockcnt->lock);
> +		return 1;
> +	}
> +	return 0;	/* Count is 1 & lock taken */

Please no tail comments. They are horrible to parse and it's obvious
from the code.

> +}
> +
> +#endif /* CONFIG_SPINLOCK_REFCOUNT */
> +#endif /* __LINUX_SPINLOCK_REFCOUNT_H */
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index 44511d1..d1f8670 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -223,3 +223,8 @@ endif
>  config MUTEX_SPIN_ON_OWNER
>  	def_bool y
>  	depends on SMP && !DEBUG_MUTEXES
> +
> +config SPINLOCK_REFCOUNT
> +	def_bool y
> +	depends on ARCH_SPINLOCK_REFCOUNT && SMP

This looks wrong. We want three options:

1) Always take the lock
2) Use the generic implementation
3) Use an architecture specific implementation

So we want

   config GENERIC_SPINLOCK_REFCOUNT
   	  bool

   config ARCH_SPINLOCK_REFCOUNT
   	  bool

   config SPINLOCK_REFCOUNT
   	  def_bool y
	  depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT
	  depends on .....
   	  	  
So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the
generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its
own special version.

And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT

> +/*
> + * The maximum number of waiting spins when the lock was acquiring before
> + * trying to attempt a lockless update. The purpose of the timeout is to
> + * limit the amount of unfairness to the thread that is doing the reference
> + * count update. Otherwise, it is theoretically possible for that thread to
> + * be starved for a really long time causing all kind of problems. If it
> + * times out and the lock is still not free, the code will fall back to the
> + * traditional way of queuing up to acquire the lock before updating the count.
> + *
> + * The actual time spent in the wait loop very much depends on the CPU being
> + * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction
> + * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping
> + * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait
> + * about 72us before the count reaches 0. With cacheline contention, the
> + * wait time can go up to 3x as much (about 210us). Having multiple
> + * cpu_relax's in the wait loop does seem to reduce cacheline contention
> + * somewhat and give slightly better performance.
> + *
> + * The preset timeout value is rather arbitrary and really depends on the CPU
> + * being used. If customization is needed, we could add a config variable for
> + * that. The exact value is not that important. A longer wait time will
> + * increase the chance of doing a lockless update, but it results in more
> + * unfairness to the waiting thread and vice versa.

As I told you before it's a horrible idea.

This is completely depending on the CPU, the cache implementation and
whatever. These heuristic can never be made correct. Their are prone
to fail and in the worst case you end up with a regression on some
systems.

Let's start with a simple version because it IS simple and easy
to analyze and debug and then incrementally build improvements on it
instead of creating an heuristics monster in the first place, i.e.:

   if (!spin_is_locked(&lr->lock) && try_cmpxchg_once(lr))
      return 0;
   return 1;

Take numbers for this on a zoo of different machines: Intel and AMD,
old and new.

Then you can add an incremental patch on that, which add loops and
hoops. Along with numbers on the same zoo of machines.

> + */
> +#ifndef CONFIG_LOCKREF_WAIT_SHIFT

No, we don't want this to be a config option. Either you can determine
it at runtime with a proper test or you just avoid the whole loop
business.

> +#define CONFIG_LOCKREF_WAIT_SHIFT	12
> +#endif
> +#define	LOCKREF_SPIN_WAIT_MAX	(1<<CONFIG_LOCKREF_WAIT_SHIFT)
> +
> +/**
> + *
> + * __lockref_add_unless - atomically add the given value to the count unless
> + *			  the lock was acquired or the count equals to the
> + *			  given threshold value.

Short online descriptions please.

> + * @plockcnt : pointer to the combined lock and count 8-byte data
> + * @plock    : pointer to the spinlock
> + * @pcount   : pointer to the reference count
> + * @value    : value to be added
> + * @threshold: threshold value for acquiring the lock
> + * Return    : 1 if operation succeeds, 0 otherwise
> + *
> + * If the lock was not acquired, __lockref_add_unless() atomically adds the
> + * given value to the reference count unless the given threshold is reached.
> + * If the lock was acquired or the threshold was reached, 0 is returned and
> + * the caller will have to acquire the lock and update the count accordingly
> + * (can be done in a non-atomic way).
> + *
> + * N.B. The lockdep code won't be run as this optimization should be disabled
> + *	when LOCK_STAT is enabled.

-ENOPARSE

> + */
> +static __always_inline int
> +__lockref_add_unless(u64 *plockcnt, spinlock_t *plock, unsigned int *pcount,
> +		     int value, int threshold)
> +{
> +	struct lockref old, new;
> +	int   get_lock, loopcnt;
> +#ifdef _SHOW_WAIT_LOOP_TIME
> +	struct timespec tv1, tv2;
> +#endif
> +
> +	/*
> +	 * Code doesn't work if raw spinlock is larger than 4 bytes
> +	 * or is empty.
> +	 */
> +	BUILD_BUG_ON(sizeof(arch_spinlock_t) == 0);
> +	if (sizeof(arch_spinlock_t) > 4)
> +		return 0;	/* Need to acquire the lock explicitly */

You can fail the build here as well. Why go through loops and hoops at
runtime if the compiler can figure it out already?

> +	/*
> +	 * Wait a certain amount of time until the lock is free or the loop
> +	 * counter reaches 0. Doing multiple cpu_relax() helps to reduce
> +	 * contention in the spinlock cacheline and achieve better performance.
> +	 */
> +#ifdef _SHOW_WAIT_LOOP_TIME
> +	getnstimeofday(&tv1);
> +#endif
> +	loopcnt = LOCKREF_SPIN_WAIT_MAX;
> +	while (loopcnt && spin_is_locked(plock)) {
> +		loopcnt--;
> +		cpu_relax();
> +		cpu_relax();
> +		cpu_relax();
> +		cpu_relax();
> +	}
> +#ifdef _SHOW_WAIT_LOOP_TIME
> +	if (loopcnt == 0) {
> +		unsigned long ns;
> +
> +		getnstimeofday(&tv2);
> +		ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
> +		     (tv2.tv_nsec - tv1.tv_nsec);
> +		pr_info("lockref wait loop time = %lu ns\n", ns);

Using getnstimeofday() for timestamping and spamming the logs with
printouts? You can't be serious about that? 

Thats what tracepoints are for. Tracing is the only way to get proper
numbers which take preemption, interrupts etc. into account without
hurting runtime performace.

> +	}
> +#endif
> +
> +#ifdef CONFIG_LOCK_STAT
> +	if (loopcnt != LOCKREF_SPIN_WAIT_MAX)
> +		lock_contended(plock->dep_map, _RET_IP_);
> +#endif

So you already failed and nevertheless you try again? Whats' the
point?

> +	old.__lock_count = ACCESS_ONCE(*plockcnt);
> +	get_lock = ((threshold >= 0) && (old.refcnt <= threshold));
> +	if (likely(!get_lock && !spin_is_locked(&old.lock))) {
> +		new.__lock_count = old.__lock_count;
> +		new.refcnt += value;
> +		if (cmpxchg64(plockcnt, old.__lock_count, new.__lock_count)
> +		    == old.__lock_count)
> +			return 1;
> +		cpu_relax();
> +		cpu_relax();
> +		/*
> +		 * Try one more time
> +		 */

There are proper loop constructs available in C. Copying code is
definitely none of them.

> +		old.__lock_count = ACCESS_ONCE(*plockcnt);
> +		get_lock = ((threshold >= 0) && (old.refcnt <= threshold));
> +		if (likely(!get_lock && !spin_is_locked(&old.lock))) {
> +			new.__lock_count = old.__lock_count;
> +			new.refcnt += value;
> +			if (cmpxchg64(plockcnt, old.__lock_count,
> +				      new.__lock_count) == old.__lock_count)
> +				return 1;
> +			cpu_relax();
> +		}
> +	}
> +	return 0;	/* The caller will need to acquire the lock */
> +}
> +
> +/*
> + * Generic macros to increment and decrement reference count
> + * @sname : pointer to lockref structure
> + * @lock  : name of spinlock in structure
> + * @count : name of reference count in structure
> + *
> + * LOCKREF_GET()  - increments count unless it is locked
> + * LOCKREF_GET0() - increments count unless it is locked or count is 0
> + * LOCKREF_PUT()  - decrements count unless it is locked or count is 1
> + */
> +#define LOCKREF_GET(sname, lock, count)					\
> +	__lockref_add_unless(&(sname)->__lock_count, &(sname)->lock,	\
> +			     &(sname)->count, 1, -1)

> +void
> +lockref_get(struct lockref *lockcnt)

One line please

> +{
> +	if (LOCKREF_GET(lockcnt, lock, refcnt))
> +		return;

What's wrong with writing:

       if (lockref_mod(&lc->__lock_count, &lc->lock, &lc->refcnt, 1, -1))
       	  	return;

Instead of forcing everyone to read through a macro maze?

Seems to be an old HP tradition. Ask your elder colleagues about
George Anzinger.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 02/12] spinlock: Enable x86 architecture to do lockless refcount update
  2013-07-05 14:47 ` [PATCH v5 02/12] spinlock: Enable x86 architecture to do lockless refcount update Waiman Long
@ 2013-07-05 20:19   ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2013-07-05 20:19 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, 5 Jul 2013, Waiman Long wrote:

> There are two steps to enable each architecture to do lockless
> reference count update:
> 1. Define the ARCH_SPINLOCK_REFCOUNT config parameter in its Kconfig
>    file.

No, we select GENERIC_SPINLOCK_REFCOUNT unless we provide an
architecture specific implementation. In that case we select
ARCH_SPINLOCK_REFCOUNT.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 00/12] Lockless update of reference count protected by spinlock
  2013-07-05 14:47 [PATCH v5 00/12] Lockless update of reference count protected by spinlock Waiman Long
                   ` (3 preceding siblings ...)
  2013-07-05 15:02 ` [PATCH v5 00/12] Lockless update of reference count protected by spinlock Al Viro
@ 2013-07-05 20:33 ` Thomas Gleixner
  2013-07-08 14:22   ` Waiman Long
  4 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2013-07-05 20:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Fri, 5 Jul 2013, Waiman Long wrote:
> patch 1:	Introduce the new lockref data structure
> patch 2:	Enable x86 architecture to use the feature
> patch 3:	Rename all d_count references to d_refcount

And after that the mail threading does not work anymore, because you
managed to lose the

In-Reply-To: <1373035656-40600-1-git-send-email-Waiman.Long@hp.com>
References: <1373035656-40600-1-git-send-email-Waiman.Long@hp.com>

tags in patches 4-12

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
  2013-07-05 18:59   ` Thomas Gleixner
@ 2013-07-08 14:21     ` Waiman Long
  2013-07-15 14:41       ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-07-08 14:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 07/05/2013 02:59 PM, Thomas Gleixner wrote:
> On Fri, 5 Jul 2013, Waiman Long wrote:
>> + * If the spinlock&  reference count optimization feature is disabled,
>> + * the spinlock and reference count are accessed separately on its own.
>> + */
>> +struct lockref {
>> +	unsigned int refcnt;	/* Reference count */
>> +	spinlock_t   lock;
>> +};
>> +
>> +/*
>> + * Struct lockref helper functions
>> + */
>> +/*
> Function documentation starts with /**

I will fix that.

>> + * lockref_get - increments reference count while not locked
> This should be: Increments reference count unconditionally.

Will change that.

>> + * @lockcnt: pointer to lockref structure
>> + */
>> +static __always_inline void
>> +lockref_get(struct lockref *lockcnt)
> Please avoid these line breaks if the line fits in 80 chars.

Will make the change.

>> +{
>> +	spin_lock(&lockcnt->lock);
>> +	lockcnt->refcnt++;
>> +	spin_unlock(&lockcnt->lock);
>> +}
>> +/*
>> + * lockref_put_or_locked - decrements count unless count<= 1 before decrement
>> + *			   otherwise the lock will be taken
>    lockref_put_or_lock please
>
> Docbook cannot work with multiline comments for the function name.
>
> So make that shorter and add a longer explanation below the @argument
> docs.

Will fix that.

>> + * @lockcnt: pointer to lockref structure
>> + * Return: 1 if count updated successfully or 0 if count<= 1 and lock taken
>> + */
>> +static __always_inline int
>> +lockref_put_or_locked(struct lockref *lockcnt)
>> +{
>> +	spin_lock(&lockcnt->lock);
>> +	if (likely(lockcnt->refcnt>  1)) {
>> +		lockcnt->refcnt--;
>> +		spin_unlock(&lockcnt->lock);
>> +		return 1;
>> +	}
>> +	return 0;	/* Count is 1&  lock taken */
> Please no tail comments. They are horrible to parse and it's obvious
> from the code.

Will remove the tail comments.

>> +}
>> +
>> +#endif /* CONFIG_SPINLOCK_REFCOUNT */
>> +#endif /* __LINUX_SPINLOCK_REFCOUNT_H */
>> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
>> index 44511d1..d1f8670 100644
>> --- a/kernel/Kconfig.locks
>> +++ b/kernel/Kconfig.locks
>> @@ -223,3 +223,8 @@ endif
>>   config MUTEX_SPIN_ON_OWNER
>>   	def_bool y
>>   	depends on SMP&&  !DEBUG_MUTEXES
>> +
>> +config SPINLOCK_REFCOUNT
>> +	def_bool y
>> +	depends on ARCH_SPINLOCK_REFCOUNT&&  SMP
> This looks wrong. We want three options:
>
> 1) Always take the lock
> 2) Use the generic implementation
> 3) Use an architecture specific implementation
>
> So we want
>
>     config GENERIC_SPINLOCK_REFCOUNT
>     	  bool
>
>     config ARCH_SPINLOCK_REFCOUNT
>     	  bool
>
>     config SPINLOCK_REFCOUNT
>     	  def_bool y
> 	  depends on GENERIC_SPINLOCK_REFCOUNT || ARCH_SPINLOCK_REFCOUNT
> 	  depends on .....
>     	  	
> So an architectire can select GENERIC_SPINLOCK_REFCOUNT to get the
> generic implementation or ARCH_SPINLOCK_REFCOUNT if it provides its
> own special version.
>
> And lib/spinlock_refcount.o depends on GENERIC_SPINLOCK_REFCOUNT

I think it is OK to add the GENERIC option, but I would like to make 
available a slightly different set of options:
1) Always take the lock
2) Use the generic implementation with the default parameters
3) Use the generic implementation with a customized set of parameters
4) Use an architecture specific implementation.

2) set only GENERIC_SPINLOCK_REFCOUNT
3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
4) set only ARCH_SPINLOCK_REFCOUNT

The customized parameters will be set in the "asm/spinlock_refcount.h" 
file. Currently ,there is 2 parameters that can be customized for each 
architecture:
1) How much time will the function wait until the lock is free
2) How many attempts to do a lockless cmpxchg to update reference count

>> +/*
>> + * The maximum number of waiting spins when the lock was acquiring before
>> + * trying to attempt a lockless update. The purpose of the timeout is to
>> + * limit the amount of unfairness to the thread that is doing the reference
>> + * count update. Otherwise, it is theoretically possible for that thread to
>> + * be starved for a really long time causing all kind of problems. If it
>> + * times out and the lock is still not free, the code will fall back to the
>> + * traditional way of queuing up to acquire the lock before updating the count.
>> + *
>> + * The actual time spent in the wait loop very much depends on the CPU being
>> + * used. On a 2.4GHz Westmere CPU, the execution time of a PAUSE instruction
>> + * (cpu_relax) in a 4k loop is about 16us. The lock checking and looping
>> + * overhead is about 8us. With 4 cpu_relax's in the loop, it will wait
>> + * about 72us before the count reaches 0. With cacheline contention, the
>> + * wait time can go up to 3x as much (about 210us). Having multiple
>> + * cpu_relax's in the wait loop does seem to reduce cacheline contention
>> + * somewhat and give slightly better performance.
>> + *
>> + * The preset timeout value is rather arbitrary and really depends on the CPU
>> + * being used. If customization is needed, we could add a config variable for
>> + * that. The exact value is not that important. A longer wait time will
>> + * increase the chance of doing a lockless update, but it results in more
>> + * unfairness to the waiting thread and vice versa.
> As I told you before it's a horrible idea.
>
> This is completely depending on the CPU, the cache implementation and
> whatever. These heuristic can never be made correct. Their are prone
> to fail and in the worst case you end up with a regression on some
> systems.
>
> Let's start with a simple version because it IS simple and easy
> to analyze and debug and then incrementally build improvements on it
> instead of creating an heuristics monster in the first place, i.e.:
>
>     if (!spin_is_locked(&lr->lock)&&  try_cmpxchg_once(lr))
>        return 0;
>     return 1;
>
> Take numbers for this on a zoo of different machines: Intel and AMD,
> old and new.
>
> Then you can add an incremental patch on that, which add loops and
> hoops. Along with numbers on the same zoo of machines.

I originally tried to do a cmpxchg without waiting and there was 
practically no performance gain. I believe that as soon as contention 
happens, it will force all the upcoming reference count update threads 
to take the lock eliminating any potential performance gain that we can 
have. To make it simple, I can change the default to wait indefinitely 
until the lock is free instead of looping a certain number of times, but 
I still like the option of letting each architecture to decide how many 
times they want to try. I agree that the actual waiting time even for 
one architecture is depending on the actual CPU chip that is being used. 
I have done some experiment on that. As long as the count is large 
enough, increasing the loop count doesn't have any significant impact on 
performance any more. The main reason for having a finite time is to 
avoid having the waiting thread to wait virtually forever if the lock 
happens to be busy for a long time.
>> + */
>> +#ifndef CONFIG_LOCKREF_WAIT_SHIFT
> No, we don't want this to be a config option. Either you can determine
> it at runtime with a proper test or you just avoid the whole loop
> business.

This will be a customized parameter settable by each architecture.

>> +#define CONFIG_LOCKREF_WAIT_SHIFT	12
>> +#endif
>> +#define	LOCKREF_SPIN_WAIT_MAX	(1<<CONFIG_LOCKREF_WAIT_SHIFT)
>> +
>> +/**
>> + *
>> + * __lockref_add_unless - atomically add the given value to the count unless
>> + *			  the lock was acquired or the count equals to the
>> + *			  given threshold value.
> Short online descriptions please.

Will change that.

>> + * @plockcnt : pointer to the combined lock and count 8-byte data
>> + * @plock    : pointer to the spinlock
>> + * @pcount   : pointer to the reference count
>> + * @value    : value to be added
>> + * @threshold: threshold value for acquiring the lock
>> + * Return    : 1 if operation succeeds, 0 otherwise
>> + *
>> + * If the lock was not acquired, __lockref_add_unless() atomically adds the
>> + * given value to the reference count unless the given threshold is reached.
>> + * If the lock was acquired or the threshold was reached, 0 is returned and
>> + * the caller will have to acquire the lock and update the count accordingly
>> + * (can be done in a non-atomic way).
>> + *
>> + * N.B. The lockdep code won't be run as this optimization should be disabled
>> + *	when LOCK_STAT is enabled.
> -ENOPARSE
>
>> + */
>> +static __always_inline int
>> +__lockref_add_unless(u64 *plockcnt, spinlock_t *plock, unsigned int *pcount,
>> +		     int value, int threshold)
>> +{
>> +	struct lockref old, new;
>> +	int   get_lock, loopcnt;
>> +#ifdef _SHOW_WAIT_LOOP_TIME
>> +	struct timespec tv1, tv2;
>> +#endif
>> +
>> +	/*
>> +	 * Code doesn't work if raw spinlock is larger than 4 bytes
>> +	 * or is empty.
>> +	 */
>> +	BUILD_BUG_ON(sizeof(arch_spinlock_t) == 0);
>> +	if (sizeof(arch_spinlock_t)>  4)
>> +		return 0;	/* Need to acquire the lock explicitly */
> You can fail the build here as well. Why go through loops and hoops at
> runtime if the compiler can figure it out already?

Will change that to BUILD_BUG_ON.

>> +	/*
>> +	 * Wait a certain amount of time until the lock is free or the loop
>> +	 * counter reaches 0. Doing multiple cpu_relax() helps to reduce
>> +	 * contention in the spinlock cacheline and achieve better performance.
>> +	 */
>> +#ifdef _SHOW_WAIT_LOOP_TIME
>> +	getnstimeofday(&tv1);
>> +#endif
>> +	loopcnt = LOCKREF_SPIN_WAIT_MAX;
>> +	while (loopcnt&&  spin_is_locked(plock)) {
>> +		loopcnt--;
>> +		cpu_relax();
>> +		cpu_relax();
>> +		cpu_relax();
>> +		cpu_relax();
>> +	}
>> +#ifdef _SHOW_WAIT_LOOP_TIME
>> +	if (loopcnt == 0) {
>> +		unsigned long ns;
>> +
>> +		getnstimeofday(&tv2);
>> +		ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
>> +		     (tv2.tv_nsec - tv1.tv_nsec);
>> +		pr_info("lockref wait loop time = %lu ns\n", ns);
> Using getnstimeofday() for timestamping and spamming the logs with
> printouts? You can't be serious about that?
>
> Thats what tracepoints are for. Tracing is the only way to get proper
> numbers which take preemption, interrupts etc. into account without
> hurting runtime performace.

The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose 
only during development cycle. It is not supposed to be turned on at 
production system. I will document that in the code.

>> +	}
>> +#endif
>> +
>> +#ifdef CONFIG_LOCK_STAT
>> +	if (loopcnt != LOCKREF_SPIN_WAIT_MAX)
>> +		lock_contended(plock->dep_map, _RET_IP_);
>> +#endif
> So you already failed and nevertheless you try again? Whats' the
> point?

I will take this code out as it is not supposed to be used anyway.

>> +	old.__lock_count = ACCESS_ONCE(*plockcnt);
>> +	get_lock = ((threshold>= 0)&&  (old.refcnt<= threshold));
>> +	if (likely(!get_lock&&  !spin_is_locked(&old.lock))) {
>> +		new.__lock_count = old.__lock_count;
>> +		new.refcnt += value;
>> +		if (cmpxchg64(plockcnt, old.__lock_count, new.__lock_count)
>> +		    == old.__lock_count)
>> +			return 1;
>> +		cpu_relax();
>> +		cpu_relax();
>> +		/*
>> +		 * Try one more time
>> +		 */
> There are proper loop constructs available in C. Copying code is
> definitely none of them.

Will change that into a loop.

>> +		old.__lock_count = ACCESS_ONCE(*plockcnt);
>> +		get_lock = ((threshold>= 0)&&  (old.refcnt<= threshold));
>> +		if (likely(!get_lock&&  !spin_is_locked(&old.lock))) {
>> +			new.__lock_count = old.__lock_count;
>> +			new.refcnt += value;
>> +			if (cmpxchg64(plockcnt, old.__lock_count,
>> +				      new.__lock_count) == old.__lock_count)
>> +				return 1;
>> +			cpu_relax();
>> +		}
>> +	}
>> +	return 0;	/* The caller will need to acquire the lock */
>> +}
>> +
>> +/*
>> + * Generic macros to increment and decrement reference count
>> + * @sname : pointer to lockref structure
>> + * @lock  : name of spinlock in structure
>> + * @count : name of reference count in structure
>> + *
>> + * LOCKREF_GET()  - increments count unless it is locked
>> + * LOCKREF_GET0() - increments count unless it is locked or count is 0
>> + * LOCKREF_PUT()  - decrements count unless it is locked or count is 1
>> + */
>> +#define LOCKREF_GET(sname, lock, count)					\
>> +	__lockref_add_unless(&(sname)->__lock_count,&(sname)->lock,	\
>> +			&(sname)->count, 1, -1)
>> +void
>> +lockref_get(struct lockref *lockcnt)
> One line please

Will do that.

>> +{
>> +	if (LOCKREF_GET(lockcnt, lock, refcnt))
>> +		return;
> What's wrong with writing:
>
>         if (lockref_mod(&lc->__lock_count,&lc->lock,&lc->refcnt, 1, -1))
>         	  	return;

Will remove the macros.

Thank for your time reviewing this patch.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 00/12] Lockless update of reference count protected by spinlock
  2013-07-05 20:33 ` Thomas Gleixner
@ 2013-07-08 14:22   ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2013-07-08 14:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 07/05/2013 04:33 PM, Thomas Gleixner wrote:
> On Fri, 5 Jul 2013, Waiman Long wrote:
>> patch 1:	Introduce the new lockref data structure
>> patch 2:	Enable x86 architecture to use the feature
>> patch 3:	Rename all d_count references to d_refcount
> And after that the mail threading does not work anymore, because you
> managed to lose the
>
> In-Reply-To:<1373035656-40600-1-git-send-email-Waiman.Long@hp.com>
> References:<1373035656-40600-1-git-send-email-Waiman.Long@hp.com>
>
> tags in patches 4-12

Thank for pointing this out. I will change the way I send the patch to 
fix this issue in the next revision of the patch.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
  2013-07-08 14:21     ` Waiman Long
@ 2013-07-15 14:41       ` Thomas Gleixner
  2013-07-15 21:24         ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2013-07-15 14:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, 8 Jul 2013, Waiman Long wrote:
> On 07/05/2013 02:59 PM, Thomas Gleixner wrote:
> > On Fri, 5 Jul 2013, Waiman Long wrote:
> I think it is OK to add the GENERIC option, but I would like to make available
> a slightly different set of options:
> 1) Always take the lock
> 2) Use the generic implementation with the default parameters
> 3) Use the generic implementation with a customized set of parameters
> 4) Use an architecture specific implementation.
> 
> 2) set only GENERIC_SPINLOCK_REFCOUNT
> 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
> 4) set only ARCH_SPINLOCK_REFCOUNT
> 
> The customized parameters will be set in the "asm/spinlock_refcount.h" file.
> Currently ,there is 2 parameters that can be customized for each architecture:
> 1) How much time will the function wait until the lock is free
> 2) How many attempts to do a lockless cmpxchg to update reference count

Sigh. GENERIC means, that you use the generic implementation, ARCH
means the architecture has a private implementation of that code.

The generic implementation can use arch specific includes and if there
is none we simply fallback to the asm-generic one.

 > Let's start with a simple version because it IS simple and easy
> > to analyze and debug and then incrementally build improvements on it
> > instead of creating an heuristics monster in the first place, i.e.:
> > 
> >     if (!spin_is_locked(&lr->lock)&&  try_cmpxchg_once(lr))
> >        return 0;
> >     return 1;
> > 
> > Take numbers for this on a zoo of different machines: Intel and AMD,
> > old and new.
> > 
> > Then you can add an incremental patch on that, which add loops and
> > hoops. Along with numbers on the same zoo of machines.
> 
> I originally tried to do a cmpxchg without waiting and there was
> practically no performance gain. I believe that as soon as

Well, you did not see a difference on your particular machine. Still
we don't have an idea how all of this works on a set of different
machines. There is a world beside 8 socket servers.

> contention happens, it will force all the upcoming reference count
> update threads to take the lock eliminating any potential
> performance gain that we can have. To make it simple, I can change
> the default to wait indefinitely until the lock is free instead of
> looping a certain number of times, but I still like the option of
> letting each architecture to decide how many times they want to
> try. I agree that the actual waiting time even for one architecture
> is depending on the actual CPU chip that is being used. I have done
> some experiment on that. As long as the count is large enough,
> increasing the loop count doesn't have any significant impact on
> performance any more. The main reason for having a finite time is to
> avoid having the waiting thread to wait virtually forever if the
> lock happens to be busy for a long time.

Again, if we make this tunable then we still want documentation for
the behaviour on small, medium and large machines.

Also what's the approach to tune that? Running some random testbench
and monitor the results for various settings?

If that's the case we better have a that as variables with generic
initial values in the code, which can be modified by sysctl.


> > > +		getnstimeofday(&tv2);
> > > +		ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
> > > +		     (tv2.tv_nsec - tv1.tv_nsec);
> > > +		pr_info("lockref wait loop time = %lu ns\n", ns);
> > Using getnstimeofday() for timestamping and spamming the logs with
> > printouts? You can't be serious about that?
> > 
> > Thats what tracepoints are for. Tracing is the only way to get proper
> > numbers which take preemption, interrupts etc. into account without
> > hurting runtime performace.
> 
> The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
> during development cycle. It is not supposed to be turned on at production
> system. I will document that in the code.

No, no, no! Again: That's what tracepoints are for.

This kind of debugging is completely pointless. Tracepoints are
designed to be low overhead and can be enabled on production
systems.

Your debugging depends on slow timestamps against CLOCK_REALTIME and
an even slower output via printk. How useful is that if you want to
really instrument the behaviour of this code?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
  2013-07-15 14:41       ` Thomas Gleixner
@ 2013-07-15 21:24         ` Waiman Long
  2013-07-15 23:47           ` Thomas Gleixner
  0 siblings, 1 reply; 18+ messages in thread
From: Waiman Long @ 2013-07-15 21:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 07/15/2013 10:41 AM, Thomas Gleixner wrote:
> On Mon, 8 Jul 2013, Waiman Long wrote:
>> On 07/05/2013 02:59 PM, Thomas Gleixner wrote:
>>> On Fri, 5 Jul 2013, Waiman Long wrote:
>> I think it is OK to add the GENERIC option, but I would like to make available
>> a slightly different set of options:
>> 1) Always take the lock
>> 2) Use the generic implementation with the default parameters
>> 3) Use the generic implementation with a customized set of parameters
>> 4) Use an architecture specific implementation.
>>
>> 2) set only GENERIC_SPINLOCK_REFCOUNT
>> 3) set both GENERIC_SPINLOCK_REFCOUNT and ARCH_SPINLOCK_REFCOUNT
>> 4) set only ARCH_SPINLOCK_REFCOUNT
>>
>> The customized parameters will be set in the "asm/spinlock_refcount.h" file.
>> Currently ,there is 2 parameters that can be customized for each architecture:
>> 1) How much time will the function wait until the lock is free
>> 2) How many attempts to do a lockless cmpxchg to update reference count
> Sigh. GENERIC means, that you use the generic implementation, ARCH
> means the architecture has a private implementation of that code.
>
> The generic implementation can use arch specific includes and if there
> is none we simply fallback to the asm-generic one.

I used the ARCH+GENERIC to mean using the generic code but with arch 
specific include.

>   >  Let's start with a simple version because it IS simple and easy
>>> to analyze and debug and then incrementally build improvements on it
>>> instead of creating an heuristics monster in the first place, i.e.:
>>>
>>>      if (!spin_is_locked(&lr->lock)&&   try_cmpxchg_once(lr))
>>>         return 0;
>>>      return 1;
>>>
>>> Take numbers for this on a zoo of different machines: Intel and AMD,
>>> old and new.
>>>
>>> Then you can add an incremental patch on that, which add loops and
>>> hoops. Along with numbers on the same zoo of machines.
>> I originally tried to do a cmpxchg without waiting and there was
>> practically no performance gain. I believe that as soon as
> Well, you did not see a difference on your particular machine. Still
> we don't have an idea how all of this works on a set of different
> machines. There is a world beside 8 socket servers.

I understand that. I can live with try_cmpxchg_once, though doing it 
twice will give a slightly better performance. However, without waiting 
for the lock to be free, this patch won't do much good. To keep it 
simple, I can remove the ability to do customization while doing cmpxchg 
once and wait until the lock is free. Please let me know if this is 
acceptable to you.

>> contention happens, it will force all the upcoming reference count
>> update threads to take the lock eliminating any potential
>> performance gain that we can have. To make it simple, I can change
>> the default to wait indefinitely until the lock is free instead of
>> looping a certain number of times, but I still like the option of
>> letting each architecture to decide how many times they want to
>> try. I agree that the actual waiting time even for one architecture
>> is depending on the actual CPU chip that is being used. I have done
>> some experiment on that. As long as the count is large enough,
>> increasing the loop count doesn't have any significant impact on
>> performance any more. The main reason for having a finite time is to
>> avoid having the waiting thread to wait virtually forever if the
>> lock happens to be busy for a long time.
> Again, if we make this tunable then we still want documentation for
> the behaviour on small, medium and large machines.
>
> Also what's the approach to tune that? Running some random testbench
> and monitor the results for various settings?
>
> If that's the case we better have a that as variables with generic
> initial values in the code, which can be modified by sysctl.

As I said above, I can remove the customization. I may reintroduce user 
customization using sysctl as you suggested in the a follow up patch 
after this one is merged.

>
>>>> +		getnstimeofday(&tv2);
>>>> +		ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
>>>> +		     (tv2.tv_nsec - tv1.tv_nsec);
>>>> +		pr_info("lockref wait loop time = %lu ns\n", ns);
>>> Using getnstimeofday() for timestamping and spamming the logs with
>>> printouts? You can't be serious about that?
>>>
>>> Thats what tracepoints are for. Tracing is the only way to get proper
>>> numbers which take preemption, interrupts etc. into account without
>>> hurting runtime performace.
>> The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
>> during development cycle. It is not supposed to be turned on at production
>> system. I will document that in the code.
> No, no, no! Again: That's what tracepoints are for.
>
> This kind of debugging is completely pointless. Tracepoints are
> designed to be low overhead and can be enabled on production
> systems.
>
> Your debugging depends on slow timestamps against CLOCK_REALTIME and
> an even slower output via printk. How useful is that if you want to
> really instrument the behaviour of this code?

This code is not critical and I can certainly remove it.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
  2013-07-15 21:24         ` Waiman Long
@ 2013-07-15 23:47           ` Thomas Gleixner
  2013-07-16  1:40             ` Waiman Long
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2013-07-15 23:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On Mon, 15 Jul 2013, Waiman Long wrote:
> On 07/15/2013 10:41 AM, Thomas Gleixner wrote:
> > On Mon, 8 Jul 2013, Waiman Long wrote:
> > Sigh. GENERIC means, that you use the generic implementation, ARCH
> > means the architecture has a private implementation of that code.
> > 
> > The generic implementation can use arch specific includes and if there
> > is none we simply fallback to the asm-generic one.
> 
> I used the ARCH+GENERIC to mean using the generic code but with arch specific
> include.

And what's the point of that? I just explained it to you that you do
not need the ARCH=y and GENERIC=y at all.
 
> >   >  Let's start with a simple version because it IS simple and easy
> > > > to analyze and debug and then incrementally build improvements on it
> > > > instead of creating an heuristics monster in the first place, i.e.:
> > > > 
> > > >      if (!spin_is_locked(&lr->lock)&&   try_cmpxchg_once(lr))
> > > >         return 0;
> > > >      return 1;
> > > > 
> > > > Take numbers for this on a zoo of different machines: Intel and AMD,
> > > > old and new.
> > > > 
> > > > Then you can add an incremental patch on that, which add loops and
> > > > hoops. Along with numbers on the same zoo of machines.
> > > I originally tried to do a cmpxchg without waiting and there was
> > > practically no performance gain. I believe that as soon as
> > Well, you did not see a difference on your particular machine. Still
> > we don't have an idea how all of this works on a set of different
> > machines. There is a world beside 8 socket servers.
> 
> I understand that. I can live with try_cmpxchg_once, though doing it
> twice will give a slightly better performance. However, without

I asked you several times now to explain and document the whole thing
with numbers instead of your handwaving "slightly better performance"
arguments.

I CANNOT verify that at all simply because I have no access to a 8
socket machine. Hint: Ask your boss to send me one :)

> waiting for the lock to be free, this patch won't do much good. To
> keep it simple, I can remove the ability to do customization while
> doing cmpxchg once and wait until the lock is free. Please let me
> know if this is acceptable to you.

No, it's not acceptable at all if you are not able to provide data for
1,2,4,8 socket machines (from single core to your precious big
boxes). It's that simple. We are not accepting patches which optimize
for a single use case and might negatively affect 99,9999% of the
existing users which have no access to this kind of hardware unless
proven otherwise.

As I told you versus the qrwlocks: You are missing to explain WHY it
improves your particular workloads and WHY this change wont affect
other workloads. All I can see from your explanations is: Works better
on 8 socket big iron. I have yet to see any numbers that this wont
affect the vast majority of users.

> > Also what's the approach to tune that? Running some random testbench
> > and monitor the results for various settings?
> > 
> > If that's the case we better have a that as variables with generic
> > initial values in the code, which can be modified by sysctl.
> 
> As I said above, I can remove the customization. I may reintroduce user
> customization using sysctl as you suggested in the a follow up patch after
> this one is merged.

And I asked for a step by step approach in the first review, but you
decided to ignore that. And now you think that it's accetable for you
as long as you get what you want. That's not the way it works, really.

> > > > > +		getnstimeofday(&tv2);
> > > > > +		ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
> > > > > +		     (tv2.tv_nsec - tv1.tv_nsec);
> > > > > +		pr_info("lockref wait loop time = %lu ns\n", ns);
> > > > Using getnstimeofday() for timestamping and spamming the logs with
> > > > printouts? You can't be serious about that?
> > > > 
q > > > > Thats what tracepoints are for. Tracing is the only way to get proper
> > > > numbers which take preemption, interrupts etc. into account without
> > > > hurting runtime performace.
> > > The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
> > > during development cycle. It is not supposed to be turned on at production
> > > system. I will document that in the code.
> > No, no, no! Again: That's what tracepoints are for.
> > 
> > This kind of debugging is completely pointless. Tracepoints are
> > designed to be low overhead and can be enabled on production
> > systems.
> > 
> > Your debugging depends on slow timestamps against CLOCK_REALTIME and
> > an even slower output via printk. How useful is that if you want to
> > really instrument the behaviour of this code?
> 
> This code is not critical and I can certainly remove it.

Did you even try to understand what I wrote? I did not ask you to
remove instrumentation. I asked you to use useful instrumentation
instead of some totally useless crap.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount
  2013-07-15 23:47           ` Thomas Gleixner
@ 2013-07-16  1:40             ` Waiman Long
  0 siblings, 0 replies; 18+ messages in thread
From: Waiman Long @ 2013-07-16  1:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Alexander Viro, Jeff Layton, Miklos Szeredi, Ingo Molnar,
	linux-fsdevel, linux-kernel, Peter Zijlstra, Steven Rostedt,
	Linus Torvalds, Benjamin Herrenschmidt, Andi Kleen,
	Chandramouleeswaran, Aswin, Norton, Scott J

On 07/15/2013 07:47 PM, Thomas Gleixner wrote:
> On Mon, 15 Jul 2013, Waiman Long wrote:
>> On 07/15/2013 10:41 AM, Thomas Gleixner wrote:
>>> On Mon, 8 Jul 2013, Waiman Long wrote:
>>> Sigh. GENERIC means, that you use the generic implementation, ARCH
>>> means the architecture has a private implementation of that code.
>>>
>>> The generic implementation can use arch specific includes and if there
>>> is none we simply fallback to the asm-generic one.
>> I used the ARCH+GENERIC to mean using the generic code but with arch specific
>> include.
> And what's the point of that? I just explained it to you that you do
> not need the ARCH=y and GENERIC=y at all.

As I said in my previous mail, I can remove the ARCH+GENERIC option.

>>>    >   Let's start with a simple version because it IS simple and easy
>>>>> to analyze and debug and then incrementally build improvements on it
>>>>> instead of creating an heuristics monster in the first place, i.e.:
>>>>>
>>>>>       if (!spin_is_locked(&lr->lock)&&    try_cmpxchg_once(lr))
>>>>>          return 0;
>>>>>       return 1;
>>>>>
>>>>> Take numbers for this on a zoo of different machines: Intel and AMD,
>>>>> old and new.
>>>>>
>>>>> Then you can add an incremental patch on that, which add loops and
>>>>> hoops. Along with numbers on the same zoo of machines.
>>>> I originally tried to do a cmpxchg without waiting and there was
>>>> practically no performance gain. I believe that as soon as
>>> Well, you did not see a difference on your particular machine. Still
>>> we don't have an idea how all of this works on a set of different
>>> machines. There is a world beside 8 socket servers.
>> I understand that. I can live with try_cmpxchg_once, though doing it
>> twice will give a slightly better performance. However, without
> I asked you several times now to explain and document the whole thing
> with numbers instead of your handwaving "slightly better performance"
> arguments.

I will provide performance data for 1 and 2 retries in my next patch 
version.

>
>> waiting for the lock to be free, this patch won't do much good. To
>> keep it simple, I can remove the ability to do customization while
>> doing cmpxchg once and wait until the lock is free. Please let me
>> know if this is acceptable to you.
> No, it's not acceptable at all if you are not able to provide data for
> 1,2,4,8 socket machines (from single core to your precious big
> boxes). It's that simple. We are not accepting patches which optimize
> for a single use case and might negatively affect 99,9999% of the
> existing users which have no access to this kind of hardware unless
> proven otherwise.

I did provide performance data for 1,2,4 and 8 socket configurations in 
my commit message. I used numactl to simulate different socket 
configuration by forcing the code to use only a subset of total number 
of sockets. I know that is not ideal, but I think it should be close 
enough. I will provide performance data on a more common 2 socket test 
machine that I have.

Yes, I don't provide data for single-thread use case. I will also 
provide that data in my next version by measuring the average time for 
doing low-level reference count update using lock and lockless update 
like what I had done for the qrwlock patch. For single thread case, I 
don't believe any real workload will show any appreciable difference in 
performance due to the differing reference count update mechanisms.

>>> Also what's the approach to tune that? Running some random testbench
>>> and monitor the results for various settings?
>>>
>>> If that's the case we better have a that as variables with generic
>>> initial values in the code, which can be modified by sysctl.
>> As I said above, I can remove the customization. I may reintroduce user
>> customization using sysctl as you suggested in the a follow up patch after
>> this one is merged.
> And I asked for a step by step approach in the first review, but you
> decided to ignore that. And now you think that it's accetable for you
> as long as you get what you want. That's not the way it works, really.

I am trying to provide what you are asking for while at the same time 
meet my own need.

>>>>>> +		getnstimeofday(&tv2);
>>>>>> +		ns = (tv2.tv_sec - tv1.tv_sec) * NSEC_PER_SEC +
>>>>>> +		     (tv2.tv_nsec - tv1.tv_nsec);
>>>>>> +		pr_info("lockref wait loop time = %lu ns\n", ns);
>>>>> Using getnstimeofday() for timestamping and spamming the logs with
>>>>> printouts? You can't be serious about that?
>>>>>
> q>  >  >  >  Thats what tracepoints are for. Tracing is the only way to get proper
>>>>> numbers which take preemption, interrupts etc. into account without
>>>>> hurting runtime performace.
>>>> The _SHOW_WAIT_LOOP_TIME is for debugging and instrumentation purpose only
>>>> during development cycle. It is not supposed to be turned on at production
>>>> system. I will document that in the code.
>>> No, no, no! Again: That's what tracepoints are for.
>>>
>>> This kind of debugging is completely pointless. Tracepoints are
>>> designed to be low overhead and can be enabled on production
>>> systems.
>>>
>>> Your debugging depends on slow timestamps against CLOCK_REALTIME and
>>> an even slower output via printk. How useful is that if you want to
>>> really instrument the behaviour of this code?
>> This code is not critical and I can certainly remove it.
> Did you even try to understand what I wrote? I did not ask you to
> remove instrumentation. I asked you to use useful instrumentation
> instead of some totally useless crap.

I am not that familiar with using the tracepoints instrumentation for 
timing measurement. I will try to use that in the code for that purpose.

Regards,
Longman

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-07-16  1:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-05 14:47 [PATCH v5 00/12] Lockless update of reference count protected by spinlock Waiman Long
2013-07-05 14:47 ` [PATCH v5 01/12] spinlock: A new lockref structure for lockless update of refcount Waiman Long
2013-07-05 18:59   ` Thomas Gleixner
2013-07-08 14:21     ` Waiman Long
2013-07-15 14:41       ` Thomas Gleixner
2013-07-15 21:24         ` Waiman Long
2013-07-15 23:47           ` Thomas Gleixner
2013-07-16  1:40             ` Waiman Long
2013-07-05 14:47 ` [PATCH v5 02/12] spinlock: Enable x86 architecture to do lockless refcount update Waiman Long
2013-07-05 20:19   ` Thomas Gleixner
2013-07-05 14:47 ` [PATCH v5 03/12] dcache: rename d_count field of dentry to d_refcount Waiman Long
2013-07-05 15:02 ` [PATCH v5 00/12] Lockless update of reference count protected by spinlock Al Viro
2013-07-05 15:29   ` Waiman Long
2013-07-05 15:31     ` Waiman Long
2013-07-05 17:54     ` Al Viro
2013-07-05 18:56       ` Waiman Long
2013-07-05 20:33 ` Thomas Gleixner
2013-07-08 14:22   ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.