All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] locking Introduce range reader/writer lock
@ 2017-03-07  5:03 Davidlohr Bueso
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-07  5:03 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, dave, linux-kernel

Hi,

Here's a very tardy proposal for enhancements to Jan's original[1] range lock
using interval trees. Because at some point it would be awesome to switch mmap_sem
from rwsem to range rwlock, I've focused on making it sharable and performance
enhancements reducing the performance delta between this and conventional locks as
much as possible -- details in patch 1.

The rest of the patches adds support for testing the new lock and actually
makes use of it for lustre. It has passed quite a bit of artificial pounding and
I believe/hope it is in shape to consider.

Applies on top of v4.11-rc1.

[1] https://lkml.org/lkml/2013/1/31/483

Thanks!

Davidlohr Bueso (5):
  locking: Introduce range reader/writer lock
  locking/locktorture: Fix rwsem reader_delay
  locking/locktorture: Fix num reader/writer corner cases
  locking/locktorture: Support range rwlocks
  staging/lustre: Use generic range rwlock

 drivers/gpu/drm/Kconfig                            |   2 -
 drivers/gpu/drm/i915/Kconfig                       |   1 -
 drivers/staging/lustre/lustre/llite/Makefile       |   2 +-
 drivers/staging/lustre/lustre/llite/file.c         |  21 +-
 .../staging/lustre/lustre/llite/llite_internal.h   |   4 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c    |   3 +-
 drivers/staging/lustre/lustre/llite/range_lock.c   | 239 -----------
 drivers/staging/lustre/lustre/llite/range_lock.h   |  82 ----
 include/linux/range_rwlock.h                       |  96 +++++
 kernel/locking/Makefile                            |   2 +-
 kernel/locking/locktorture.c                       | 299 +++++++++----
 kernel/locking/range_rwlock.c                      | 462 +++++++++++++++++++++
 lib/Kconfig                                        |  14 -
 lib/Kconfig.debug                                  |   1 -
 lib/Makefile                                       |   2 +-
 15 files changed, 792 insertions(+), 438 deletions(-)
 delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.c
 delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.h
 create mode 100644 include/linux/range_rwlock.h
 create mode 100644 kernel/locking/range_rwlock.c

-- 
2.6.6

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 [PATCH 0/5] locking Introduce range reader/writer lock Davidlohr Bueso
@ 2017-03-07  5:03 ` Davidlohr Bueso
  2017-03-09 11:03   ` Jan Kara
                     ` (9 more replies)
  2017-03-07  5:03 ` [PATCH 2/5] locking/locktorture: Fix rwsem reader_delay Davidlohr Bueso
                   ` (3 subsequent siblings)
  4 siblings, 10 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-07  5:03 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, dave, linux-kernel,
	Davidlohr Bueso

This implements a sleepable range rwlock, based on interval tree, serializing
conflicting/intersecting/overlapping ranges within the tree. The largest range
is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
involves dealing with the tree itself and the range to be locked, normally
stack allocated and always explicitly prepared/initialized by the user in a
[a0, a1] a0 <= a1 sorted manner, before actually taking the lock.

We allow exclusive locking of arbitrary ranges. We guarantee that each
range is locked only after all conflicting range locks requested previously
have been unlocked. Thus we achieve fairness and avoid livelocks.

When new range lock is requested, we add its interval to the tree and store
number of intervals intersecting it to 'blocking_ranges'. When a range is
unlocked, we again walk intervals that intersect with the unlocked one and
decrement their 'blocking_ranges'. We wake up owner of any range lock whose
'blocking_ranges' drops to 0.

For the shared case, the 'blocking_ranges' is only incremented if the
intersecting range is not marked as a reader. In order to mitigate some of
the tree walk overhead for non-intersecting ranges, the tree's min/max values
are maintained and consulted in O(1) in the fastpath.

How much does it cost:
----------------------

The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
total number of ranges and R_int is the number of ranges intersecting the
new range range to be added.

Due to its sharable nature, full range locks can be compared with rw-sempahores,
which also serves from a mutex standpoint as writer-only situations are
pretty similar nowadays.

The first is the memory footprint, rwsems are larger than tree locks: 40 vs
24 bytes, but the later requires an additional 64 bytes of stack for the range
structure.

Secondly, because every range call is serialized by the tree->lock, any lock()
fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
overhead compared to a single atomic insn in the case of rwsems. Similar scenario
obviously for the unlock() case.

The torture module was used to measure 1-1 differences in lock acquisition with
increasing core counts over a period of 10 minutes. Readers and writers are
interleaved, with a slight advantage to writers as its the first kthread that is
created. The following shows the avg ops/minute with various thread-setups on
boxes with small and large core-counts.

** 4-core AMD Opteron **
(write-only)
rwsem-2thr: 4198.5, stddev: 7.77
range-2thr: 4199.1, stddev: 0.73

rwsem-4thr: 6036.8, stddev: 50.91
range-4thr: 6004.9, stddev: 126.57

rwsem-8thr: 6245.6, stddev: 59.39
range-8thr: 6229.3, stddev: 10.60

(read-only)
rwsem-2thr: 5930.7, stddev: 21.92
range-2thr: 5917.3, stddev: 25.45

rwsem-4thr: 9881.6, stddev: 0.70
range-4thr: 9540.2, stddev: 98.28

rwsem-8thr: 11633.2, stddev: 7.72
range-8thr: 11314.7, stddev: 62.22

For the read/write-only cases, there is very little difference between the range lock
and rwsems, with up to a 3% hit, which could very well be considered in the noise range.

(read-write)
rwsem-write-1thr: 1744.8, stddev: 11.59
rwsem-read-1thr:  1043.1, stddev: 3.97
range-write-1thr: 1740.2, stddev: 5.99
range-read-1thr:  1022.5, stddev: 6.41

rwsem-write-2thr: 1662.5, stddev: 0.70
rwsem-read-2thr:  1278.0, stddev: 25.45
range-write-2thr: 1321.5, stddev: 51.61
range-read-2thr:  1243.5, stddev: 30.40

rwsem-write-4thr: 1761.0, stddev: 11.31
rwsem-read-4thr:  1426.0, stddev: 7.07
range-write-4thr: 1417.0, stddev: 29.69
range-read-4thr:  1398.0, stddev: 56.56

While a single reader and writer threads does not show must difference, increasing
core counts shows that in reader/writer workloads, writer threads can take a hit in
raw performance of up to ~20%, while the number of reader throughput is quite similar
among both locks.

** 240-core (ht) IvyBridge **
(write-only)
rwsem-120thr: 6844.5, stddev: 82.73
range-120thr: 6070.5, stddev: 85.55

rwsem-240thr: 6292.5, stddev: 146.3
range-240thr: 6099.0, stddev: 15.55

rwsem-480thr: 6164.8, stddev: 33.94
range-480thr: 6062.3, stddev: 19.79

(read-only)
rwsem-120thr: 136860.4, stddev: 2539.92
range-120thr: 138052.2, stddev: 327.39

rwsem-240thr: 235297.5, stddev: 2220.50
range-240thr: 232099.1, stddev: 3614.72

rwsem-480thr: 272683.0, stddev: 3924.32
range-480thr: 256539.2, stddev: 9541.69

Similar to the small box, larger machines show that range locks take only a minor
(up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.

(read-write)
rwsem-write-60thr: 4658.1, stddev: 1303.19
rwsem-read-60thr:  1108.7, stddev: 718.42
range-write-60thr: 3203.6, stddev: 139.30
range-read-60thr:  1852.8, stddev: 147.5

rwsem-write-120thr: 3971.3, stddev: 1413.0
rwsem-read-120thr:  1038.8, stddev: 353.51
range-write-120thr: 2282.1, stddev: 207.18
range-read-120thr:  1856.5, stddev: 198.69

rwsem-write-240thr: 4112.7, stddev: 2448.1
rwsem-read-240thr:  1277.4, stddev: 430.30
range-write-240thr: 2353.1, stddev: 502.04
range-read-240thr:  1551.5, stddev: 361.33

When mixing readers and writers, writer throughput can take a hit of up to ~40%,
similar to the 4 core machine, however, reader threads can increase the number of
acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
always be higher for rwsems. A huge factor in this behavior is that range locks
do not have writer spin-on-owner feature.

On both machines when actually testing threads acquiring different ranges, the
amount of throughput will always outperform the rwsem, due to the increased
parallelism; which is no surprise either. As such microbenchmarks that merely
pounds on a lock will pretty much always suffer upon direct lock conversions,
but not enough to matter in the overall picture.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/gpu/drm/Kconfig       |   2 -
 drivers/gpu/drm/i915/Kconfig  |   1 -
 include/linux/range_rwlock.h  |  96 +++++++++
 kernel/locking/Makefile       |   2 +-
 kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig                   |  14 --
 lib/Kconfig.debug             |   1 -
 lib/Makefile                  |   2 +-
 8 files changed, 560 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/range_rwlock.h
 create mode 100644 kernel/locking/range_rwlock.c

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 88e01e08e279..e4d9eadd2c47 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -154,7 +154,6 @@ config DRM_RADEON
 	select HWMON
 	select BACKLIGHT_CLASS_DEVICE
 	select BACKLIGHT_LCD_SUPPORT
-	select INTERVAL_TREE
 	help
 	  Choose this option if you have an ATI Radeon graphics card.  There
 	  are both PCI and AGP versions.  You don't need to choose this to
@@ -174,7 +173,6 @@ config DRM_AMDGPU
 	select HWMON
 	select BACKLIGHT_CLASS_DEVICE
 	select BACKLIGHT_LCD_SUPPORT
-	select INTERVAL_TREE
 	help
 	  Choose this option if you have a recent AMD Radeon graphics card.
 
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 183f5dc1c3f2..8a9154550f46 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -3,7 +3,6 @@ config DRM_I915
 	depends on DRM
 	depends on X86 && PCI
 	select INTEL_GTT
-	select INTERVAL_TREE
 	# we need shmfs for the swappable backing store, and in particular
 	# the shmem_readpage() which depends upon tmpfs
 	select SHMEM
diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
new file mode 100644
index 000000000000..174e13668f67
--- /dev/null
+++ b/include/linux/range_rwlock.h
@@ -0,0 +1,96 @@
+/*
+ * Range locking
+ *
+ * We allow exclusive locking of arbitrary ranges. We guarantee that each
+ * range is locked only after all conflicting range locks requested previously
+ * have been unlocked. Thus we achieve fairness and avoid livelocks.
+ *
+ * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
+ * total number of ranges and R_int is the number of ranges intersecting the
+ * operated range.
+ */
+#ifndef _LINUX_RANGE_RWLOCK_H
+#define _LINUX_RANGE_RWLOCK_H
+
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+/*
+ * The largest range will span [0,RANGE_RWLOCK_INFINITY].
+ */
+#define RANGE_RWLOCK_INFINITY  (~0UL - 1)
+
+struct range_rwlock {
+	struct interval_tree_node node;
+	struct task_struct *task;
+	/* Number of ranges which are blocking acquisition of the lock */
+	unsigned int blocking_ranges;
+	bool reader;
+};
+
+struct range_rwlock_tree {
+	struct rb_root root;
+	spinlock_t lock;
+	struct interval_tree_node *leftmost; /* compute smallest 'start' */
+};
+
+#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
+	{ .leftmost = NULL                              \
+	, .root = RB_ROOT				\
+	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
+
+#define DEFINE_RANGE_RWLOCK_TREE(name) \
+       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
+
+#define __RANGE_RWLOCK_INITIALIZER(start, last) {	\
+		.node = {			\
+			.start = (start)	\
+			,.last = (last)		\
+		}				\
+	}
+
+#define DEFINE_RANGE_RWLOCK(name, start, last)				\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
+
+#define DEFINE_RANGE_RWLOCK_INF(name)		\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
+
+static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
+{
+	tree->root = RB_ROOT;
+	spin_lock_init(&tree->lock);
+	tree->leftmost = NULL;
+}
+
+void range_rwlock_init(struct range_rwlock *lock,
+		       unsigned long start, unsigned long last);
+void range_rwlock_init_inf(struct range_rwlock *lock);
+
+/*
+ * lock for reading
+ */
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock);
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock);
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+/*
+ * lock for writing
+ */
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock);
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock);
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock);
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 760158d9d98d..bf054d0af82b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -2,7 +2,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
new file mode 100644
index 000000000000..4d92baa9bd80
--- /dev/null
+++ b/kernel/locking/range_rwlock.c
@@ -0,0 +1,462 @@
+/*
+ * Implementation of read/write range locks.
+ *
+ * We keep interval tree of locked and to-be-locked ranges. When new range lock
+ * is requested, we add its interval to the tree and store number of intervals
+ * intersecting it to 'blocking_ranges'.
+ *
+ * When a range is unlocked, we again walk intervals that intersect with the
+ * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
+ * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
+ * 'blocking_ranges' is only incremented if the intersecting range is not marked
+ * as a reader. In order to mitigate some of the tree walk overhead for
+ * non-intersecting ranges, the tree's min/max values are maintained and consulted
+ * in O(1) in the fastpath.
+ */
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/range_rwlock.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched.h>
+#include <linux/export.h>
+
+
+#define range_entry(ptr, type, member) container_of(ptr, type, member)
+
+/*
+ * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
+ * is given by:
+ *
+ *        a0 <= b1 && b0 <= a1
+ *
+ * ... where A holds the lock range and B holds the smallest 'start' and
+ * largest 'last' in the tree. For the later, we rely on the root node,
+ * which by augmented interval tree property, holds the largest value in
+ * its last-in-subtree.
+ */
+static inline bool
+__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *root;
+
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
+		return false;
+
+	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
+
+	return lock->node.start <= root->__subtree_last &&
+		tree->leftmost->start <= lock->node.last;
+}
+
+static inline void
+__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
+		tree->leftmost = &lock->node;
+	else if (lock->node.start < tree->leftmost->start)
+		tree->leftmost = &lock->node;
+
+	interval_tree_insert(&lock->node, &tree->root);
+}
+
+static inline void
+__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (tree->leftmost == &lock->node) {
+		struct rb_node *next = rb_next(&tree->leftmost->rb);
+		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
+	}
+
+	interval_tree_remove(&lock->node, &tree->root);
+}
+
+static inline void
+__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
+		  unsigned long last)
+{
+	WARN_ON(start > last);
+
+	lock->node.start = start;
+	lock->node.last = last;
+	RB_CLEAR_NODE(&lock->node.rb);
+	lock->blocking_ranges = 0;
+	lock->reader = false;
+}
+
+/**
+ * range_rwlock_init - initialize the range lock
+ * @lock: the range lock to be initialized
+ * @start: start of the interval (inclusive)
+ * @last: last location in the interval (inclusive)
+ *
+ * Initialize the range's [start, last] such that it can
+ * later be locked. User is expected to enter a sorted
+ * range, such that @start <= @last.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
+		     unsigned long last)
+{
+	__range_rwlock_init(lock, start, last);
+}
+EXPORT_SYMBOL(range_rwlock_init);
+
+/**
+ * range_rwlock_init_inf - initialize the range lock to infinity
+ * @lock: the range lock to be initialized
+ *
+ * Initialize the full range.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init_inf(struct range_rwlock *lock)
+{
+	__range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
+}
+EXPORT_SYMBOL(range_rwlock_init_inf);
+
+static inline void
+range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
+{
+	if (!--lock->blocking_ranges)
+		wake_q_add(wake_q, lock->task);
+}
+
+static inline int wait_for_ranges(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock, long state)
+{
+	int ret = 0;
+
+	while (true) {
+		set_current_state(state);
+
+		if (unlikely(signal_pending_state(state, current))) {
+			unsigned long flags;
+
+			ret = -EINTR;
+			/*
+			 * We're not taking the lock after all, cleanup
+			 * after ourselves.
+			 */
+			spin_lock_irqsave(&tree->lock, flags);
+			lock->reader = false;
+			__range_tree_remove(tree, lock);
+			spin_unlock_irqrestore(&tree->lock, flags);
+			break;
+		}
+
+		/* do we need to go to sleep? */
+		if (!lock->blocking_ranges)
+			break;
+
+		schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+	return ret;
+}
+
+static __always_inline int
+__range_read_lock_common(struct range_rwlock_tree *tree,
+			 struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	lock->reader = true;
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					 lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!blocked_lock->reader)
+			lock->blocking_ranges++;
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->task = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+
+}
+EXPORT_SYMBOL(range_read_lock);
+
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+
+}
+EXPORT_SYMBOL(range_read_lock_interruptible);
+
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
+
+}
+EXPORT_SYMBOL(range_read_lock_killable);
+
+/**
+ * range_read_trylock - Trylock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int ret = true;
+	unsigned long flags;
+	struct interval_tree_node *node;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	/*
+	 * We have overlapping ranges in the tree, ensure that we can
+	 * in fact share the lock.
+	 */
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!blocked_lock->reader) {
+			ret = false;
+			goto unlock;
+		}
+
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+insert:
+	lock->reader = true;
+	__range_tree_insert(tree, lock);
+unlock:
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(range_read_trylock);
+
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	lock->reader = false;
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!blocked_lock->reader)
+			range_rwlock_unblock(blocked_lock, &wake_q);
+
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL(range_read_unlock);
+
+static inline int __sched
+__range_write_lock_common(struct range_rwlock_tree *tree,
+			  struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	lock->reader = false;
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		/*
+		 * As a writer, we always consider an existing node. We
+		 * need to block; either the intersecting node is another
+		 * writer or we have a reader that needs to finish.
+		 */
+		lock->blocking_ranges++;
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->task = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(range_write_lock);
+
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL(range_write_lock_interruptible);
+
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(range_write_lock_killable);
+
+/**
+ * range_write_trylock - Trylock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int intersects;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	intersects = __range_intersects_intree(tree, lock);
+
+	if (!intersects) {
+		lock->reader = false;
+		__range_tree_insert(tree, lock);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return !intersects;
+}
+EXPORT_SYMBOL(range_write_trylock);
+
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	lock->reader = false;
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+				     lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		range_rwlock_unblock(blocked_lock, &wake_q);
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL(range_write_unlock);
+
+/**
+ * range_downgrade_write - Downgrade write range lock to read lock
+ * @tree: interval tree
+ * @lock: the range lock to be downgraded
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to downgrade an unacquired write lock.
+ */
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock)
+{
+	unsigned long flags;
+	struct interval_tree_node *node;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	WARN_ON(lock->reader);
+
+	/*
+	 * Unaccount for any blocked reader lock. Wakeup if possible.
+	 */
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (blocked_lock->reader)
+			range_rwlock_unblock(blocked_lock, &wake_q);
+
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+
+	lock->reader = true;
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL(range_downgrade_write);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..52ca6668a16d 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
 config BTREE
 	bool
 
-config INTERVAL_TREE
-	bool
-	help
-	  Simple, embeddable, interval-tree. Can find the start of an
-	  overlapping range in log(n) time and then iterate over all
-	  overlapping nodes. The algorithm is implemented as an
-	  augmented rbtree.
-
-	  See:
-
-		Documentation/rbtree.txt
-
-	  for more information.
-
 config RADIX_TREE_MULTIORDER
 	bool
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 97d62c2da6c2..a8a64eb54d3a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1771,7 +1771,6 @@ config RBTREE_TEST
 config INTERVAL_TREE_TEST
 	tristate "Interval tree test"
 	depends on m && DEBUG_KERNEL
-	select INTERVAL_TREE
 	help
 	  A benchmark measuring the performance of the interval tree library
 
diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a8725..d4ff3057450a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
 
 obj-$(CONFIG_BTREE) += btree.o
-obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
+obj-y += interval_tree.o
 obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/5] locking/locktorture: Fix rwsem reader_delay
  2017-03-07  5:03 [PATCH 0/5] locking Introduce range reader/writer lock Davidlohr Bueso
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
@ 2017-03-07  5:03 ` Davidlohr Bueso
  2017-03-07  5:03 ` [PATCH 3/5] locking/locktorture: Fix num reader/writer corner cases Davidlohr Bueso
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-07  5:03 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, dave, linux-kernel,
	Davidlohr Bueso

We should account for nreader threads, not writers in this
callback. Could even trigger a div by 0 if the user explicitly
disables writers.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/locktorture.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index f24582d4dad3..0ef5510f8742 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -570,7 +570,7 @@ static void torture_rwsem_read_delay(struct torture_random_state *trsp)
 
 	/* We want a long delay occasionally to force massive contention.  */
 	if (!(torture_random(trsp) %
-	      (cxt.nrealwriters_stress * 2000 * longdelay_ms)))
+	      (cxt.nrealreaders_stress * 2000 * longdelay_ms)))
 		mdelay(longdelay_ms * 2);
 	else
 		mdelay(longdelay_ms / 2);
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/5] locking/locktorture: Fix num reader/writer corner cases
  2017-03-07  5:03 [PATCH 0/5] locking Introduce range reader/writer lock Davidlohr Bueso
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
  2017-03-07  5:03 ` [PATCH 2/5] locking/locktorture: Fix rwsem reader_delay Davidlohr Bueso
@ 2017-03-07  5:03 ` Davidlohr Bueso
  2017-03-07  5:03 ` [PATCH 4/5] locking/locktorture: Support range rwlocks Davidlohr Bueso
  2017-03-07  5:03 ` [PATCH 5/5] staging/lustre: Use generic range rwlock Davidlohr Bueso
  4 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-07  5:03 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, dave, linux-kernel,
	Davidlohr Bueso

Things can explode for locktorture if the user does combinations
of nwriters_stress=0 nreaders_stress=0. Fix this by not assuming
we always want to torture writer threads.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/locktorture.c | 76 +++++++++++++++++++++++++-------------------
 1 file changed, 44 insertions(+), 32 deletions(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 0ef5510f8742..a68167803eee 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -715,8 +715,7 @@ static void __torture_print_stats(char *page,
 {
 	bool fail = 0;
 	int i, n_stress;
-	long max = 0;
-	long min = statp[0].n_lock_acquired;
+	long max = 0, min = statp ? statp[0].n_lock_acquired : 0;
 	long long sum = 0;
 
 	n_stress = write ? cxt.nrealwriters_stress : cxt.nrealreaders_stress;
@@ -823,7 +822,7 @@ static void lock_torture_cleanup(void)
 	 * such, only perform the underlying torture-specific cleanups,
 	 * and avoid anything related to locktorture.
 	 */
-	if (!cxt.lwsa)
+	if (!cxt.lwsa && !cxt.lrsa)
 		goto end;
 
 	if (writer_tasks) {
@@ -898,6 +897,13 @@ static int __init lock_torture_init(void)
 		firsterr = -EINVAL;
 		goto unwind;
 	}
+
+	if (nwriters_stress == 0 && nreaders_stress == 0) {
+		pr_alert("lock-torture: must run at least one locking thread\n");
+		firsterr = -EINVAL;
+		goto unwind;
+	}
+
 	if (cxt.cur_ops->init)
 		cxt.cur_ops->init();
 
@@ -921,17 +927,19 @@ static int __init lock_torture_init(void)
 #endif
 
 	/* Initialize the statistics so that each run gets its own numbers. */
+	if (nwriters_stress) {
+		lock_is_write_held = 0;
+		cxt.lwsa = kmalloc(sizeof(*cxt.lwsa) * cxt.nrealwriters_stress, GFP_KERNEL);
+		if (cxt.lwsa == NULL) {
+			VERBOSE_TOROUT_STRING("cxt.lwsa: Out of memory");
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
 
-	lock_is_write_held = 0;
-	cxt.lwsa = kmalloc(sizeof(*cxt.lwsa) * cxt.nrealwriters_stress, GFP_KERNEL);
-	if (cxt.lwsa == NULL) {
-		VERBOSE_TOROUT_STRING("cxt.lwsa: Out of memory");
-		firsterr = -ENOMEM;
-		goto unwind;
-	}
-	for (i = 0; i < cxt.nrealwriters_stress; i++) {
-		cxt.lwsa[i].n_lock_fail = 0;
-		cxt.lwsa[i].n_lock_acquired = 0;
+		for (i = 0; i < cxt.nrealwriters_stress; i++) {
+			cxt.lwsa[i].n_lock_fail = 0;
+			cxt.lwsa[i].n_lock_acquired = 0;
+		}
 	}
 
 	if (cxt.cur_ops->readlock) {
@@ -948,19 +956,21 @@ static int __init lock_torture_init(void)
 			cxt.nrealreaders_stress = cxt.nrealwriters_stress;
 		}
 
-		lock_is_read_held = 0;
-		cxt.lrsa = kmalloc(sizeof(*cxt.lrsa) * cxt.nrealreaders_stress, GFP_KERNEL);
-		if (cxt.lrsa == NULL) {
-			VERBOSE_TOROUT_STRING("cxt.lrsa: Out of memory");
-			firsterr = -ENOMEM;
-			kfree(cxt.lwsa);
-			cxt.lwsa = NULL;
-			goto unwind;
-		}
-
-		for (i = 0; i < cxt.nrealreaders_stress; i++) {
-			cxt.lrsa[i].n_lock_fail = 0;
-			cxt.lrsa[i].n_lock_acquired = 0;
+		if (nreaders_stress) {
+			lock_is_read_held = 0;
+			cxt.lrsa = kmalloc(sizeof(*cxt.lrsa) * cxt.nrealreaders_stress, GFP_KERNEL);
+			if (cxt.lrsa == NULL) {
+				VERBOSE_TOROUT_STRING("cxt.lrsa: Out of memory");
+				firsterr = -ENOMEM;
+				kfree(cxt.lwsa);
+				cxt.lwsa = NULL;
+				goto unwind;
+			}
+
+			for (i = 0; i < cxt.nrealreaders_stress; i++) {
+				cxt.lrsa[i].n_lock_fail = 0;
+				cxt.lrsa[i].n_lock_acquired = 0;
+			}
 		}
 	}
 
@@ -990,12 +1000,14 @@ static int __init lock_torture_init(void)
 			goto unwind;
 	}
 
-	writer_tasks = kzalloc(cxt.nrealwriters_stress * sizeof(writer_tasks[0]),
-			       GFP_KERNEL);
-	if (writer_tasks == NULL) {
-		VERBOSE_TOROUT_ERRSTRING("writer_tasks: Out of memory");
-		firsterr = -ENOMEM;
-		goto unwind;
+	if (nwriters_stress) {
+		writer_tasks = kzalloc(cxt.nrealwriters_stress * sizeof(writer_tasks[0]),
+				       GFP_KERNEL);
+		if (writer_tasks == NULL) {
+			VERBOSE_TOROUT_ERRSTRING("writer_tasks: Out of memory");
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
 	}
 
 	if (cxt.cur_ops->readlock) {
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/5] locking/locktorture: Support range rwlocks
  2017-03-07  5:03 [PATCH 0/5] locking Introduce range reader/writer lock Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2017-03-07  5:03 ` [PATCH 3/5] locking/locktorture: Fix num reader/writer corner cases Davidlohr Bueso
@ 2017-03-07  5:03 ` Davidlohr Bueso
  2017-03-07  5:03 ` [PATCH 5/5] staging/lustre: Use generic range rwlock Davidlohr Bueso
  4 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-07  5:03 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, dave, linux-kernel,
	Davidlohr Bueso

Torture the reader/writer range locks. Each thread will attempt to
lock+unlock a range of up to [0, 4096].

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/locktorture.c | 221 +++++++++++++++++++++++++++++++++----------
 1 file changed, 172 insertions(+), 49 deletions(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index a68167803eee..76de50da4cdc 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -29,6 +29,7 @@
 #include <linux/rwlock.h>
 #include <linux/mutex.h>
 #include <linux/rwsem.h>
+#include <linux/range_rwlock.h>
 #include <linux/smp.h>
 #include <linux/interrupt.h>
 #include <linux/sched.h>
@@ -89,13 +90,13 @@ static void lock_torture_cleanup(void);
  */
 struct lock_torture_ops {
 	void (*init)(void);
-	int (*writelock)(void);
+	int (*writelock)(void *arg);
 	void (*write_delay)(struct torture_random_state *trsp);
 	void (*task_boost)(struct torture_random_state *trsp);
-	void (*writeunlock)(void);
-	int (*readlock)(void);
+	void (*writeunlock)(void *arg);
+	int (*readlock)(void *arg);
 	void (*read_delay)(struct torture_random_state *trsp);
-	void (*readunlock)(void);
+	void (*readunlock)(void *arg);
 
 	unsigned long flags; /* for irq spinlocks */
 	const char *name;
@@ -117,7 +118,7 @@ static struct lock_torture_cxt cxt = { 0, 0, false,
  * Definitions for lock torture testing.
  */
 
-static int torture_lock_busted_write_lock(void)
+static int torture_lock_busted_write_lock(void *arg)
 {
 	return 0;  /* BUGGY, do not use in real life!!! */
 }
@@ -136,7 +137,7 @@ static void torture_lock_busted_write_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_lock_busted_write_unlock(void)
+static void torture_lock_busted_write_unlock(void *arg)
 {
 	  /* BUGGY, do not use in real life!!! */
 }
@@ -159,7 +160,8 @@ static struct lock_torture_ops lock_busted_ops = {
 
 static DEFINE_SPINLOCK(torture_spinlock);
 
-static int torture_spin_lock_write_lock(void) __acquires(torture_spinlock)
+static int torture_spin_lock_write_lock(void *arg)
+	__acquires(torture_spinlock)
 {
 	spin_lock(&torture_spinlock);
 	return 0;
@@ -185,7 +187,8 @@ static void torture_spin_lock_write_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_spin_lock_write_unlock(void) __releases(torture_spinlock)
+static void torture_spin_lock_write_unlock(void *arg)
+	__releases(torture_spinlock)
 {
 	spin_unlock(&torture_spinlock);
 }
@@ -201,8 +204,8 @@ static struct lock_torture_ops spin_lock_ops = {
 	.name		= "spin_lock"
 };
 
-static int torture_spin_lock_write_lock_irq(void)
-__acquires(torture_spinlock)
+static int torture_spin_lock_write_lock_irq(void *arg)
+	__acquires(torture_spinlock)
 {
 	unsigned long flags;
 
@@ -211,7 +214,7 @@ __acquires(torture_spinlock)
 	return 0;
 }
 
-static void torture_lock_spin_write_unlock_irq(void)
+static void torture_lock_spin_write_unlock_irq(void *arg)
 __releases(torture_spinlock)
 {
 	spin_unlock_irqrestore(&torture_spinlock, cxt.cur_ops->flags);
@@ -230,7 +233,8 @@ static struct lock_torture_ops spin_lock_irq_ops = {
 
 static DEFINE_RWLOCK(torture_rwlock);
 
-static int torture_rwlock_write_lock(void) __acquires(torture_rwlock)
+static int torture_rwlock_write_lock(void *arg)
+	__acquires(torture_rwlock)
 {
 	write_lock(&torture_rwlock);
 	return 0;
@@ -251,12 +255,14 @@ static void torture_rwlock_write_delay(struct torture_random_state *trsp)
 		udelay(shortdelay_us);
 }
 
-static void torture_rwlock_write_unlock(void) __releases(torture_rwlock)
+static void torture_rwlock_write_unlock(void *arg)
+	__releases(torture_rwlock)
 {
 	write_unlock(&torture_rwlock);
 }
 
-static int torture_rwlock_read_lock(void) __acquires(torture_rwlock)
+static int torture_rwlock_read_lock(void *arg)
+	__acquires(torture_rwlock)
 {
 	read_lock(&torture_rwlock);
 	return 0;
@@ -277,7 +283,8 @@ static void torture_rwlock_read_delay(struct torture_random_state *trsp)
 		udelay(shortdelay_us);
 }
 
-static void torture_rwlock_read_unlock(void) __releases(torture_rwlock)
+static void torture_rwlock_read_unlock(void *arg)
+	__releases(torture_rwlock)
 {
 	read_unlock(&torture_rwlock);
 }
@@ -293,7 +300,8 @@ static struct lock_torture_ops rw_lock_ops = {
 	.name		= "rw_lock"
 };
 
-static int torture_rwlock_write_lock_irq(void) __acquires(torture_rwlock)
+static int torture_rwlock_write_lock_irq(void *arg)
+	__acquires(torture_rwlock)
 {
 	unsigned long flags;
 
@@ -302,13 +310,14 @@ static int torture_rwlock_write_lock_irq(void) __acquires(torture_rwlock)
 	return 0;
 }
 
-static void torture_rwlock_write_unlock_irq(void)
-__releases(torture_rwlock)
+static void torture_rwlock_write_unlock_irq(void *arg)
+	__releases(torture_rwlock)
 {
 	write_unlock_irqrestore(&torture_rwlock, cxt.cur_ops->flags);
 }
 
-static int torture_rwlock_read_lock_irq(void) __acquires(torture_rwlock)
+static int torture_rwlock_read_lock_irq(void *arg)
+	__acquires(torture_rwlock)
 {
 	unsigned long flags;
 
@@ -317,8 +326,8 @@ static int torture_rwlock_read_lock_irq(void) __acquires(torture_rwlock)
 	return 0;
 }
 
-static void torture_rwlock_read_unlock_irq(void)
-__releases(torture_rwlock)
+static void torture_rwlock_read_unlock_irq(void *arg)
+	__releases(torture_rwlock)
 {
 	read_unlock_irqrestore(&torture_rwlock, cxt.cur_ops->flags);
 }
@@ -336,7 +345,8 @@ static struct lock_torture_ops rw_lock_irq_ops = {
 
 static DEFINE_MUTEX(torture_mutex);
 
-static int torture_mutex_lock(void) __acquires(torture_mutex)
+static int torture_mutex_lock(void *arg)
+	__acquires(torture_mutex)
 {
 	mutex_lock(&torture_mutex);
 	return 0;
@@ -358,7 +368,8 @@ static void torture_mutex_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_mutex_unlock(void) __releases(torture_mutex)
+static void torture_mutex_unlock(void *arg)
+	__releases(torture_mutex)
 {
 	mutex_unlock(&torture_mutex);
 }
@@ -380,7 +391,7 @@ static DEFINE_WW_MUTEX(torture_ww_mutex_0, &torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_1, &torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_2, &torture_ww_class);
 
-static int torture_ww_mutex_lock(void)
+static int torture_ww_mutex_lock(void *arg)
 __acquires(torture_ww_mutex_0)
 __acquires(torture_ww_mutex_1)
 __acquires(torture_ww_mutex_2)
@@ -425,7 +436,7 @@ __acquires(torture_ww_mutex_2)
 	return 0;
 }
 
-static void torture_ww_mutex_unlock(void)
+static void torture_ww_mutex_unlock(void *arg)
 __releases(torture_ww_mutex_0)
 __releases(torture_ww_mutex_1)
 __releases(torture_ww_mutex_2)
@@ -449,7 +460,8 @@ static struct lock_torture_ops ww_mutex_lock_ops = {
 #ifdef CONFIG_RT_MUTEXES
 static DEFINE_RT_MUTEX(torture_rtmutex);
 
-static int torture_rtmutex_lock(void) __acquires(torture_rtmutex)
+static int torture_rtmutex_lock(void *arg)
+	__acquires(torture_rtmutex)
 {
 	rt_mutex_lock(&torture_rtmutex);
 	return 0;
@@ -513,7 +525,8 @@ static void torture_rtmutex_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_rtmutex_unlock(void) __releases(torture_rtmutex)
+static void torture_rtmutex_unlock(void *arg)
+	__releases(torture_rtmutex)
 {
 	rt_mutex_unlock(&torture_rtmutex);
 }
@@ -531,7 +544,8 @@ static struct lock_torture_ops rtmutex_lock_ops = {
 #endif
 
 static DECLARE_RWSEM(torture_rwsem);
-static int torture_rwsem_down_write(void) __acquires(torture_rwsem)
+static int torture_rwsem_down_write(void *arg)
+	__acquires(torture_rwsem)
 {
 	down_write(&torture_rwsem);
 	return 0;
@@ -553,12 +567,14 @@ static void torture_rwsem_write_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_rwsem_up_write(void) __releases(torture_rwsem)
+static void torture_rwsem_up_write(void *arg)
+	__releases(torture_rwsem)
 {
 	up_write(&torture_rwsem);
 }
 
-static int torture_rwsem_down_read(void) __acquires(torture_rwsem)
+static int torture_rwsem_down_read(void *arg)
+	__acquires(torture_rwsem)
 {
 	down_read(&torture_rwsem);
 	return 0;
@@ -580,7 +596,8 @@ static void torture_rwsem_read_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_rwsem_up_read(void) __releases(torture_rwsem)
+static void torture_rwsem_up_read(void *arg)
+	__releases(torture_rwsem)
 {
 	up_read(&torture_rwsem);
 }
@@ -596,6 +613,70 @@ static struct lock_torture_ops rwsem_lock_ops = {
 	.name		= "rwsem_lock"
 };
 
+/*
+ * Lock torturing starts here, and is slightly different than other locks
+ * in that it expects the range to lock to be passed as well as the actual
+ * lock itself.
+ */
+#define locktorture_is_range_lock(type) (!strncmp(type, "range_rwlock", 12))
+
+struct locktorture_range {
+	unsigned long start;
+	unsigned long end;
+	struct range_rwlock lock;
+};
+
+#define RANGE_POOL_SIZE (1UL << 12)
+struct locktorture_range *range_pool = NULL;
+
+static DEFINE_RANGE_RWLOCK_TREE(torture_range_rwlock);
+
+static int torture_range_rwlock_read_lock(void *arg)
+	__acquires(torture_range_rwlock)
+{
+	struct range_rwlock *range = (struct range_rwlock *) arg;
+
+	range_read_lock(&torture_range_rwlock, range);
+	return 0;
+}
+
+static void torture_range_rwlock_read_unlock(void *arg)
+	__releases(torture_range_rwlock)
+{
+	struct range_rwlock *range = (struct range_rwlock *) arg;
+
+	range_read_unlock(&torture_range_rwlock, range);
+}
+
+static int torture_range_rwlock_write_lock(void *arg)
+	__acquires(torture_rwrange_lock)
+{
+	struct range_rwlock *range = (struct range_rwlock *) arg;
+
+	range_write_lock(&torture_range_rwlock, range);
+	return 0;
+}
+
+static void torture_range_rwlock_write_unlock(void *arg)
+	__releases(torture_range_rwlock)
+{
+	struct range_rwlock *range = (struct range_rwlock *) arg;
+
+	range_write_unlock(&torture_range_rwlock, range);
+}
+
+static struct lock_torture_ops range_rwlock_ops = {
+	.writelock	= torture_range_rwlock_write_lock,
+	.write_delay	= torture_rwsem_write_delay,
+	.task_boost     = torture_boost_dummy,
+	.writeunlock	= torture_range_rwlock_write_unlock,
+
+	.readlock       = torture_range_rwlock_read_lock,
+	.read_delay	= torture_rwsem_read_delay,
+	.readunlock     = torture_range_rwlock_read_unlock,
+	.name		= "range_rwlock"
+};
+
 #include <linux/percpu-rwsem.h>
 static struct percpu_rw_semaphore pcpu_rwsem;
 
@@ -604,24 +685,28 @@ void torture_percpu_rwsem_init(void)
 	BUG_ON(percpu_init_rwsem(&pcpu_rwsem));
 }
 
-static int torture_percpu_rwsem_down_write(void) __acquires(pcpu_rwsem)
+static int torture_percpu_rwsem_down_write(void *arg)
+	__acquires(pcpu_rwsem)
 {
 	percpu_down_write(&pcpu_rwsem);
 	return 0;
 }
 
-static void torture_percpu_rwsem_up_write(void) __releases(pcpu_rwsem)
+static void torture_percpu_rwsem_up_write(void *arg)
+	__releases(pcpu_rwsem)
 {
 	percpu_up_write(&pcpu_rwsem);
 }
 
-static int torture_percpu_rwsem_down_read(void) __acquires(pcpu_rwsem)
+static int torture_percpu_rwsem_down_read(void *arg)
+	__acquires(pcpu_rwsem)
 {
 	percpu_down_read(&pcpu_rwsem);
 	return 0;
 }
 
-static void torture_percpu_rwsem_up_read(void) __releases(pcpu_rwsem)
+static void torture_percpu_rwsem_up_read(void *arg)
+	__releases(pcpu_rwsem)
 {
 	percpu_up_read(&pcpu_rwsem);
 }
@@ -638,6 +723,17 @@ static struct lock_torture_ops percpu_rwsem_lock_ops = {
 	.name		= "percpu_rwsem_lock"
 };
 
+static void mkrandom_range(struct range_rwlock *range,
+			   struct torture_random_state *trsp)
+{
+	unsigned long start, last;
+
+	last = (torture_random(trsp) >> 4) % (RANGE_POOL_SIZE + 1);
+	start = (torture_random(trsp) >> 4) % (last + 1);
+
+	range_rwlock_init(range, start, last);
+}
+
 /*
  * Lock torture writer kthread.  Repeatedly acquires and releases
  * the lock, checking for duplicate acquisitions.
@@ -646,26 +742,38 @@ static int lock_torture_writer(void *arg)
 {
 	struct lock_stress_stats *lwsp = arg;
 	static DEFINE_TORTURE_RANDOM(rand);
+	bool is_range = locktorture_is_range_lock(torture_type);
 
 	VERBOSE_TOROUT_STRING("lock_torture_writer task started");
 	set_user_nice(current, MAX_NICE);
 
 	do {
+		struct range_rwlock range;
+
 		if ((torture_random(&rand) & 0xfffff) == 0)
 			schedule_timeout_uninterruptible(1);
 
+		if (is_range)
+			mkrandom_range(&range, &rand);
+
 		cxt.cur_ops->task_boost(&rand);
-		cxt.cur_ops->writelock();
-		if (WARN_ON_ONCE(lock_is_write_held))
-			lwsp->n_lock_fail++;
-		lock_is_write_held = 1;
-		if (WARN_ON_ONCE(lock_is_read_held))
-			lwsp->n_lock_fail++; /* rare, but... */
+		cxt.cur_ops->writelock(&range);
+
+		if (!is_range) {
+			if (WARN_ON_ONCE(lock_is_write_held))
+				lwsp->n_lock_fail++;
+			if (WARN_ON_ONCE(lock_is_read_held))
+				lwsp->n_lock_fail++; /* rare, but... */
+
+			lock_is_write_held = true;
+		}
 
 		lwsp->n_lock_acquired++;
 		cxt.cur_ops->write_delay(&rand);
-		lock_is_write_held = 0;
-		cxt.cur_ops->writeunlock();
+
+		if (!is_range)
+			lock_is_write_held = false;
+		cxt.cur_ops->writeunlock(&range);
 
 		stutter_wait("lock_torture_writer");
 	} while (!torture_must_stop());
@@ -683,26 +791,40 @@ static int lock_torture_reader(void *arg)
 {
 	struct lock_stress_stats *lrsp = arg;
 	static DEFINE_TORTURE_RANDOM(rand);
+	bool is_range = locktorture_is_range_lock(torture_type);
 
 	VERBOSE_TOROUT_STRING("lock_torture_reader task started");
 	set_user_nice(current, MAX_NICE);
 
 	do {
+		struct range_rwlock range;
+
 		if ((torture_random(&rand) & 0xfffff) == 0)
 			schedule_timeout_uninterruptible(1);
 
-		cxt.cur_ops->readlock();
-		lock_is_read_held = 1;
-		if (WARN_ON_ONCE(lock_is_write_held))
-			lrsp->n_lock_fail++; /* rare, but... */
+		if (is_range)
+			mkrandom_range(&range, &rand);
+
+		cxt.cur_ops->readlock(&range);
+
+		if (!is_range) {
+			if (WARN_ON_ONCE(lock_is_write_held))
+				lrsp->n_lock_fail++; /* rare, but... */
+
+			lock_is_read_held = 1;
+		}
 
 		lrsp->n_lock_acquired++;
 		cxt.cur_ops->read_delay(&rand);
-		lock_is_read_held = 0;
-		cxt.cur_ops->readunlock();
+
+		if (!is_range)
+			lock_is_read_held = 0;
+
+		cxt.cur_ops->readunlock(&range);
 
 		stutter_wait("lock_torture_reader");
 	} while (!torture_must_stop());
+
 	torture_kthread_stopping("lock_torture_reader");
 	return 0;
 }
@@ -876,6 +998,7 @@ static int __init lock_torture_init(void)
 #endif
 		&rwsem_lock_ops,
 		&percpu_rwsem_lock_ops,
+		&range_rwlock_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose, &torture_runnable))
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/5] staging/lustre: Use generic range rwlock
  2017-03-07  5:03 [PATCH 0/5] locking Introduce range reader/writer lock Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2017-03-07  5:03 ` [PATCH 4/5] locking/locktorture: Support range rwlocks Davidlohr Bueso
@ 2017-03-07  5:03 ` Davidlohr Bueso
  2017-03-07  6:05     ` [lustre-devel] " Oleg Drokin
                     ` (2 more replies)
  4 siblings, 3 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-07  5:03 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, dave, linux-kernel,
	oleg.drokin, andreas.dilger, jsimmons, lustre-devel,
	Davidlohr Bueso

This replaces the in-house version, which is also derived
from Jan's interval tree implementation.

Cc: oleg.drokin@intel.com
Cc: andreas.dilger@intel.com
Cc: jsimmons@infradead.org
Cc: lustre-devel@lists.lustre.org

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
XXX: compile tested only. In house uses 'ulong long', generic uses 'ulong', is this a problem?

 drivers/staging/lustre/lustre/llite/Makefile       |   2 +-
 drivers/staging/lustre/lustre/llite/file.c         |  21 +-
 .../staging/lustre/lustre/llite/llite_internal.h   |   4 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c    |   3 +-
 drivers/staging/lustre/lustre/llite/range_lock.c   | 239 ---------------------
 drivers/staging/lustre/lustre/llite/range_lock.h   |  82 -------
 6 files changed, 15 insertions(+), 336 deletions(-)
 delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.c
 delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.h

diff --git a/drivers/staging/lustre/lustre/llite/Makefile b/drivers/staging/lustre/lustre/llite/Makefile
index 322d4fa63f5d..922a901bc62c 100644
--- a/drivers/staging/lustre/lustre/llite/Makefile
+++ b/drivers/staging/lustre/lustre/llite/Makefile
@@ -1,6 +1,6 @@
 obj-$(CONFIG_LUSTRE_FS) += lustre.o
 lustre-y := dcache.o dir.o file.o llite_lib.o llite_nfs.o \
-	    rw.o rw26.o namei.o symlink.o llite_mmap.o range_lock.o \
+	    rw.o rw26.o namei.o symlink.o llite_mmap.o \
 	    xattr.o xattr_cache.o xattr_security.o \
 	    super25.o statahead.o glimpse.o lcommon_cl.o lcommon_misc.o \
 	    vvp_dev.o vvp_page.o vvp_lock.o vvp_io.o vvp_object.o \
diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 481c0d01d4c6..1a14a79f87f8 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -42,6 +42,7 @@
 #include <linux/file.h>
 #include <linux/sched.h>
 #include <linux/mount.h>
+#include <linux/range_rwlock.h>
 #include "../include/lustre/ll_fiemap.h"
 #include "../include/lustre/lustre_ioctl.h"
 #include "../include/lustre_swab.h"
@@ -1055,7 +1056,7 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 	struct ll_inode_info *lli = ll_i2info(file_inode(file));
 	struct ll_file_data  *fd  = LUSTRE_FPRIVATE(file);
 	struct vvp_io *vio = vvp_env_io(env);
-	struct range_lock range;
+	struct range_rwlock range;
 	struct cl_io	 *io;
 	ssize_t result = 0;
 	int rc = 0;
@@ -1072,9 +1073,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 		bool range_locked = false;
 
 		if (file->f_flags & O_APPEND)
-			range_lock_init(&range, 0, LUSTRE_EOF);
+			range_rwlock_init(&range, 0, LUSTRE_EOF);
 		else
-			range_lock_init(&range, *ppos, *ppos + count - 1);
+			range_rwlock_init(&range, *ppos, *ppos + count - 1);
 
 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
 		vio->vui_iter = args->u.normal.via_iter;
@@ -1087,10 +1088,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 		if (((iot == CIT_WRITE) ||
 		     (iot == CIT_READ && (file->f_flags & O_DIRECT))) &&
 		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-			CDEBUG(D_VFSTRACE, "Range lock [%llu, %llu]\n",
-			       range.rl_node.in_extent.start,
-			       range.rl_node.in_extent.end);
-			rc = range_lock(&lli->lli_write_tree, &range);
+			CDEBUG(D_VFSTRACE, "Range lock [%lu, %lu]\n",
+			       range.node.start, range.node.last);
+			rc = range_write_lock_interruptible(&lli->lli_write_tree, &range);
 			if (rc < 0)
 				goto out;
 
@@ -1100,10 +1100,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 		rc = cl_io_loop(env, io);
 		ll_cl_remove(file, env);
 		if (range_locked) {
-			CDEBUG(D_VFSTRACE, "Range unlock [%llu, %llu]\n",
-			       range.rl_node.in_extent.start,
-			       range.rl_node.in_extent.end);
-			range_unlock(&lli->lli_write_tree, &range);
+			CDEBUG(D_VFSTRACE, "Range unlock [%lu, %lu]\n",
+			       range.node.start, range.node.last);
+			range_write_unlock(&lli->lli_write_tree, &range);
 		}
 	} else {
 		/* cl_io_rw_init() handled IO */
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 55f68acd85d1..aa2ae72e3e70 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -49,8 +49,8 @@
 #include <linux/namei.h>
 #include <linux/xattr.h>
 #include <linux/posix_acl_xattr.h>
+#include <linux/range_rwlock.h>
 #include "vvp_internal.h"
-#include "range_lock.h"
 
 #ifndef FMODE_EXEC
 #define FMODE_EXEC 0
@@ -193,7 +193,7 @@ struct ll_inode_info {
 			 * }
 			 */
 			struct rw_semaphore		lli_trunc_sem;
-			struct range_lock_tree		lli_write_tree;
+			struct range_rwlock_tree	lli_write_tree;
 
 			struct rw_semaphore		lli_glimpse_sem;
 			unsigned long			lli_glimpse_time;
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index b229cbc7bb33..8054e916b3f5 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -40,6 +40,7 @@
 #include <linux/statfs.h>
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/range_rwlock.h>
 
 #include "../include/lustre/lustre_ioctl.h"
 #include "../include/lustre_ha.h"
@@ -853,7 +854,7 @@ void ll_lli_init(struct ll_inode_info *lli)
 		mutex_init(&lli->lli_size_mutex);
 		lli->lli_symlink_name = NULL;
 		init_rwsem(&lli->lli_trunc_sem);
-		range_lock_tree_init(&lli->lli_write_tree);
+		range_rwlock_tree_init(&lli->lli_write_tree);
 		init_rwsem(&lli->lli_glimpse_sem);
 		lli->lli_glimpse_time = 0;
 		INIT_LIST_HEAD(&lli->lli_agl_list);
diff --git a/drivers/staging/lustre/lustre/llite/range_lock.c b/drivers/staging/lustre/lustre/llite/range_lock.c
deleted file mode 100644
index 14148a097476..000000000000
--- a/drivers/staging/lustre/lustre/llite/range_lock.c
+++ /dev/null
@@ -1,239 +0,0 @@
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * Range lock is used to allow multiple threads writing a single shared
- * file given each thread is writing to a non-overlapping portion of the
- * file.
- *
- * Refer to the possible upstream kernel version of range lock by
- * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
- *
- * This file could later replaced by the upstream kernel version.
- */
-/*
- * Author: Prakash Surya <surya1@llnl.gov>
- * Author: Bobi Jam <bobijam.xu@intel.com>
- */
-#include "range_lock.h"
-#include "../include/lustre/lustre_user.h"
-
-/**
- * Initialize a range lock tree
- *
- * \param tree [in]	an empty range lock tree
- *
- * Pre:  Caller should have allocated the range lock tree.
- * Post: The range lock tree is ready to function.
- */
-void range_lock_tree_init(struct range_lock_tree *tree)
-{
-	tree->rlt_root = NULL;
-	tree->rlt_sequence = 0;
-	spin_lock_init(&tree->rlt_lock);
-}
-
-/**
- * Initialize a range lock node
- *
- * \param lock  [in]	an empty range lock node
- * \param start [in]	start of the covering region
- * \param end   [in]	end of the covering region
- *
- * Pre:  Caller should have allocated the range lock node.
- * Post: The range lock node is meant to cover [start, end] region
- */
-int range_lock_init(struct range_lock *lock, __u64 start, __u64 end)
-{
-	int rc;
-
-	memset(&lock->rl_node, 0, sizeof(lock->rl_node));
-	if (end != LUSTRE_EOF)
-		end >>= PAGE_SHIFT;
-	rc = interval_set(&lock->rl_node, start >> PAGE_SHIFT, end);
-	if (rc)
-		return rc;
-
-	INIT_LIST_HEAD(&lock->rl_next_lock);
-	lock->rl_task = NULL;
-	lock->rl_lock_count = 0;
-	lock->rl_blocking_ranges = 0;
-	lock->rl_sequence = 0;
-	return rc;
-}
-
-static inline struct range_lock *next_lock(struct range_lock *lock)
-{
-	return list_entry(lock->rl_next_lock.next, typeof(*lock), rl_next_lock);
-}
-
-/**
- * Helper function of range_unlock()
- *
- * \param node [in]	a range lock found overlapped during interval node
- *			search
- * \param arg [in]	the range lock to be tested
- *
- * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
- *				overlapping range node
- * \retval INTERVAL_ITER_STOP	indicate to stop the search
- */
-static enum interval_iter range_unlock_cb(struct interval_node *node, void *arg)
-{
-	struct range_lock *lock = arg;
-	struct range_lock *overlap = node2rangelock(node);
-	struct range_lock *iter;
-
-	list_for_each_entry(iter, &overlap->rl_next_lock, rl_next_lock) {
-		if (iter->rl_sequence > lock->rl_sequence) {
-			--iter->rl_blocking_ranges;
-			LASSERT(iter->rl_blocking_ranges > 0);
-		}
-	}
-	if (overlap->rl_sequence > lock->rl_sequence) {
-		--overlap->rl_blocking_ranges;
-		if (overlap->rl_blocking_ranges == 0)
-			wake_up_process(overlap->rl_task);
-	}
-	return INTERVAL_ITER_CONT;
-}
-
-/**
- * Unlock a range lock, wake up locks blocked by this lock.
- *
- * \param tree [in]	range lock tree
- * \param lock [in]	range lock to be deleted
- *
- * If this lock has been granted, relase it; if not, just delete it from
- * the tree or the same region lock list. Wake up those locks only blocked
- * by this lock through range_unlock_cb().
- */
-void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
-{
-	spin_lock(&tree->rlt_lock);
-	if (!list_empty(&lock->rl_next_lock)) {
-		struct range_lock *next;
-
-		if (interval_is_intree(&lock->rl_node)) { /* first lock */
-			/* Insert the next same range lock into the tree */
-			next = next_lock(lock);
-			next->rl_lock_count = lock->rl_lock_count - 1;
-			interval_erase(&lock->rl_node, &tree->rlt_root);
-			interval_insert(&next->rl_node, &tree->rlt_root);
-		} else {
-			/* find the first lock in tree */
-			list_for_each_entry(next, &lock->rl_next_lock,
-					    rl_next_lock) {
-				if (!interval_is_intree(&next->rl_node))
-					continue;
-
-				LASSERT(next->rl_lock_count > 0);
-				next->rl_lock_count--;
-				break;
-			}
-		}
-		list_del_init(&lock->rl_next_lock);
-	} else {
-		LASSERT(interval_is_intree(&lock->rl_node));
-		interval_erase(&lock->rl_node, &tree->rlt_root);
-	}
-
-	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
-			range_unlock_cb, lock);
-	spin_unlock(&tree->rlt_lock);
-}
-
-/**
- * Helper function of range_lock()
- *
- * \param node [in]	a range lock found overlapped during interval node
- *			search
- * \param arg [in]	the range lock to be tested
- *
- * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
- *				overlapping range node
- * \retval INTERVAL_ITER_STOP	indicate to stop the search
- */
-static enum interval_iter range_lock_cb(struct interval_node *node, void *arg)
-{
-	struct range_lock *lock = (struct range_lock *)arg;
-	struct range_lock *overlap = node2rangelock(node);
-
-	lock->rl_blocking_ranges += overlap->rl_lock_count + 1;
-	return INTERVAL_ITER_CONT;
-}
-
-/**
- * Lock a region
- *
- * \param tree [in]	range lock tree
- * \param lock [in]	range lock node containing the region span
- *
- * \retval 0	get the range lock
- * \retval <0	error code while not getting the range lock
- *
- * If there exists overlapping range lock, the new lock will wait and
- * retry, if later it find that it is not the chosen one to wake up,
- * it wait again.
- */
-int range_lock(struct range_lock_tree *tree, struct range_lock *lock)
-{
-	struct interval_node *node;
-	int rc = 0;
-
-	spin_lock(&tree->rlt_lock);
-	/*
-	 * We need to check for all conflicting intervals
-	 * already in the tree.
-	 */
-	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
-			range_lock_cb, lock);
-	/*
-	 * Insert to the tree if I am unique, otherwise I've been linked to
-	 * the rl_next_lock of another lock which has the same range as mine
-	 * in range_lock_cb().
-	 */
-	node = interval_insert(&lock->rl_node, &tree->rlt_root);
-	if (node) {
-		struct range_lock *tmp = node2rangelock(node);
-
-		list_add_tail(&lock->rl_next_lock, &tmp->rl_next_lock);
-		tmp->rl_lock_count++;
-	}
-	lock->rl_sequence = ++tree->rlt_sequence;
-
-	while (lock->rl_blocking_ranges > 0) {
-		lock->rl_task = current;
-		__set_current_state(TASK_INTERRUPTIBLE);
-		spin_unlock(&tree->rlt_lock);
-		schedule();
-
-		if (signal_pending(current)) {
-			range_unlock(tree, lock);
-			rc = -EINTR;
-			goto out;
-		}
-		spin_lock(&tree->rlt_lock);
-	}
-	spin_unlock(&tree->rlt_lock);
-out:
-	return rc;
-}
diff --git a/drivers/staging/lustre/lustre/llite/range_lock.h b/drivers/staging/lustre/lustre/llite/range_lock.h
deleted file mode 100644
index 779091ccec4e..000000000000
--- a/drivers/staging/lustre/lustre/llite/range_lock.h
+++ /dev/null
@@ -1,82 +0,0 @@
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * Range lock is used to allow multiple threads writing a single shared
- * file given each thread is writing to a non-overlapping portion of the
- * file.
- *
- * Refer to the possible upstream kernel version of range lock by
- * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
- *
- * This file could later replaced by the upstream kernel version.
- */
-/*
- * Author: Prakash Surya <surya1@llnl.gov>
- * Author: Bobi Jam <bobijam.xu@intel.com>
- */
-#ifndef _RANGE_LOCK_H
-#define _RANGE_LOCK_H
-
-#include "../../include/linux/libcfs/libcfs.h"
-#include "../include/interval_tree.h"
-
-struct range_lock {
-	struct interval_node	rl_node;
-	/**
-	 * Process to enqueue this lock.
-	 */
-	struct task_struct	*rl_task;
-	/**
-	 * List of locks with the same range.
-	 */
-	struct list_head	rl_next_lock;
-	/**
-	 * Number of locks in the list rl_next_lock
-	 */
-	unsigned int		rl_lock_count;
-	/**
-	 * Number of ranges which are blocking acquisition of the lock
-	 */
-	unsigned int		rl_blocking_ranges;
-	/**
-	 * Sequence number of range lock. This number is used to get to know
-	 * the order the locks are queued; this is required for range_cancel().
-	 */
-	__u64			rl_sequence;
-};
-
-static inline struct range_lock *node2rangelock(const struct interval_node *n)
-{
-	return container_of(n, struct range_lock, rl_node);
-}
-
-struct range_lock_tree {
-	struct interval_node	*rlt_root;
-	spinlock_t		 rlt_lock;	/* protect range lock tree */
-	__u64			 rlt_sequence;
-};
-
-void range_lock_tree_init(struct range_lock_tree *tree);
-int range_lock_init(struct range_lock *lock, __u64 start, __u64 end);
-int  range_lock(struct range_lock_tree *tree, struct range_lock *lock);
-void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
-#endif
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] staging/lustre: Use generic range rwlock
  2017-03-07  5:03 ` [PATCH 5/5] staging/lustre: Use generic range rwlock Davidlohr Bueso
@ 2017-03-07  6:05     ` Oleg Drokin
  2017-03-09  8:56     ` [lustre-devel] " kbuild test robot
  2017-03-09 14:40     ` [lustre-devel] " kbuild test robot
  2 siblings, 0 replies; 37+ messages in thread
From: Oleg Drokin @ 2017-03-07  6:05 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, andreas.dilger, jsimmons, lustre-devel,
	Davidlohr Bueso


On Mar 7, 2017, at 12:03 AM, Davidlohr Bueso wrote:

> This replaces the in-house version, which is also derived
> from Jan's interval tree implementation.
> 
> Cc: oleg.drokin@intel.com
> Cc: andreas.dilger@intel.com
> Cc: jsimmons@infradead.org
> Cc: lustre-devel@lists.lustre.org
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> XXX: compile tested only. In house uses 'ulong long', generic uses 'ulong', is this a problem?

Hm, cannot seem to find the other patches in this series anywhere to verify and my subscription to linux-kernel broke as it turns out.
You mean the range is ulong? So only can have this working up to 2G offsets on the
32bit systems and then wrap around?

> 
> drivers/staging/lustre/lustre/llite/Makefile       |   2 +-
> drivers/staging/lustre/lustre/llite/file.c         |  21 +-
> .../staging/lustre/lustre/llite/llite_internal.h   |   4 +-
> drivers/staging/lustre/lustre/llite/llite_lib.c    |   3 +-
> drivers/staging/lustre/lustre/llite/range_lock.c   | 239 ---------------------
> drivers/staging/lustre/lustre/llite/range_lock.h   |  82 -------
> 6 files changed, 15 insertions(+), 336 deletions(-)
> delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.c
> delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.h
> 
> diff --git a/drivers/staging/lustre/lustre/llite/Makefile b/drivers/staging/lustre/lustre/llite/Makefile
> index 322d4fa63f5d..922a901bc62c 100644
> --- a/drivers/staging/lustre/lustre/llite/Makefile
> +++ b/drivers/staging/lustre/lustre/llite/Makefile
> @@ -1,6 +1,6 @@
> obj-$(CONFIG_LUSTRE_FS) += lustre.o
> lustre-y := dcache.o dir.o file.o llite_lib.o llite_nfs.o \
> -	    rw.o rw26.o namei.o symlink.o llite_mmap.o range_lock.o \
> +	    rw.o rw26.o namei.o symlink.o llite_mmap.o \
> 	    xattr.o xattr_cache.o xattr_security.o \
> 	    super25.o statahead.o glimpse.o lcommon_cl.o lcommon_misc.o \
> 	    vvp_dev.o vvp_page.o vvp_lock.o vvp_io.o vvp_object.o \
> diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
> index 481c0d01d4c6..1a14a79f87f8 100644
> --- a/drivers/staging/lustre/lustre/llite/file.c
> +++ b/drivers/staging/lustre/lustre/llite/file.c
> @@ -42,6 +42,7 @@
> #include <linux/file.h>
> #include <linux/sched.h>
> #include <linux/mount.h>
> +#include <linux/range_rwlock.h>
> #include "../include/lustre/ll_fiemap.h"
> #include "../include/lustre/lustre_ioctl.h"
> #include "../include/lustre_swab.h"
> @@ -1055,7 +1056,7 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 	struct ll_inode_info *lli = ll_i2info(file_inode(file));
> 	struct ll_file_data  *fd  = LUSTRE_FPRIVATE(file);
> 	struct vvp_io *vio = vvp_env_io(env);
> -	struct range_lock range;
> +	struct range_rwlock range;
> 	struct cl_io	 *io;
> 	ssize_t result = 0;
> 	int rc = 0;
> @@ -1072,9 +1073,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		bool range_locked = false;
> 
> 		if (file->f_flags & O_APPEND)
> -			range_lock_init(&range, 0, LUSTRE_EOF);
> +			range_rwlock_init(&range, 0, LUSTRE_EOF);
> 		else
> -			range_lock_init(&range, *ppos, *ppos + count - 1);
> +			range_rwlock_init(&range, *ppos, *ppos + count - 1);
> 
> 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
> 		vio->vui_iter = args->u.normal.via_iter;
> @@ -1087,10 +1088,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		if (((iot == CIT_WRITE) ||
> 		     (iot == CIT_READ && (file->f_flags & O_DIRECT))) &&
> 		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
> -			CDEBUG(D_VFSTRACE, "Range lock [%llu, %llu]\n",
> -			       range.rl_node.in_extent.start,
> -			       range.rl_node.in_extent.end);
> -			rc = range_lock(&lli->lli_write_tree, &range);
> +			CDEBUG(D_VFSTRACE, "Range lock [%lu, %lu]\n",
> +			       range.node.start, range.node.last);
> +			rc = range_write_lock_interruptible(&lli->lli_write_tree, &range);
> 			if (rc < 0)
> 				goto out;
> 
> @@ -1100,10 +1100,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		rc = cl_io_loop(env, io);
> 		ll_cl_remove(file, env);
> 		if (range_locked) {
> -			CDEBUG(D_VFSTRACE, "Range unlock [%llu, %llu]\n",
> -			       range.rl_node.in_extent.start,
> -			       range.rl_node.in_extent.end);
> -			range_unlock(&lli->lli_write_tree, &range);
> +			CDEBUG(D_VFSTRACE, "Range unlock [%lu, %lu]\n",
> +			       range.node.start, range.node.last);
> +			range_write_unlock(&lli->lli_write_tree, &range);
> 		}
> 	} else {
> 		/* cl_io_rw_init() handled IO */
> diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
> index 55f68acd85d1..aa2ae72e3e70 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_internal.h
> +++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
> @@ -49,8 +49,8 @@
> #include <linux/namei.h>
> #include <linux/xattr.h>
> #include <linux/posix_acl_xattr.h>
> +#include <linux/range_rwlock.h>
> #include "vvp_internal.h"
> -#include "range_lock.h"
> 
> #ifndef FMODE_EXEC
> #define FMODE_EXEC 0
> @@ -193,7 +193,7 @@ struct ll_inode_info {
> 			 * }
> 			 */
> 			struct rw_semaphore		lli_trunc_sem;
> -			struct range_lock_tree		lli_write_tree;
> +			struct range_rwlock_tree	lli_write_tree;
> 
> 			struct rw_semaphore		lli_glimpse_sem;
> 			unsigned long			lli_glimpse_time;
> diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
> index b229cbc7bb33..8054e916b3f5 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
> +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
> @@ -40,6 +40,7 @@
> #include <linux/statfs.h>
> #include <linux/types.h>
> #include <linux/mm.h>
> +#include <linux/range_rwlock.h>
> 
> #include "../include/lustre/lustre_ioctl.h"
> #include "../include/lustre_ha.h"
> @@ -853,7 +854,7 @@ void ll_lli_init(struct ll_inode_info *lli)
> 		mutex_init(&lli->lli_size_mutex);
> 		lli->lli_symlink_name = NULL;
> 		init_rwsem(&lli->lli_trunc_sem);
> -		range_lock_tree_init(&lli->lli_write_tree);
> +		range_rwlock_tree_init(&lli->lli_write_tree);
> 		init_rwsem(&lli->lli_glimpse_sem);
> 		lli->lli_glimpse_time = 0;
> 		INIT_LIST_HEAD(&lli->lli_agl_list);
> diff --git a/drivers/staging/lustre/lustre/llite/range_lock.c b/drivers/staging/lustre/lustre/llite/range_lock.c
> deleted file mode 100644
> index 14148a097476..000000000000
> --- a/drivers/staging/lustre/lustre/llite/range_lock.c
> +++ /dev/null
> @@ -1,239 +0,0 @@
> -/*
> - * GPL HEADER START
> - *
> - * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 only,
> - * as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License version 2 for more details (a copy is included
> - * in the LICENSE file that accompanied this code).
> - *
> - * You should have received a copy of the GNU General Public License
> - * version 2 along with this program; If not, see
> - * http://www.gnu.org/licenses/gpl-2.0.html
> - *
> - * GPL HEADER END
> - */
> -/*
> - * Range lock is used to allow multiple threads writing a single shared
> - * file given each thread is writing to a non-overlapping portion of the
> - * file.
> - *
> - * Refer to the possible upstream kernel version of range lock by
> - * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
> - *
> - * This file could later replaced by the upstream kernel version.
> - */
> -/*
> - * Author: Prakash Surya <surya1@llnl.gov>
> - * Author: Bobi Jam <bobijam.xu@intel.com>
> - */
> -#include "range_lock.h"
> -#include "../include/lustre/lustre_user.h"
> -
> -/**
> - * Initialize a range lock tree
> - *
> - * \param tree [in]	an empty range lock tree
> - *
> - * Pre:  Caller should have allocated the range lock tree.
> - * Post: The range lock tree is ready to function.
> - */
> -void range_lock_tree_init(struct range_lock_tree *tree)
> -{
> -	tree->rlt_root = NULL;
> -	tree->rlt_sequence = 0;
> -	spin_lock_init(&tree->rlt_lock);
> -}
> -
> -/**
> - * Initialize a range lock node
> - *
> - * \param lock  [in]	an empty range lock node
> - * \param start [in]	start of the covering region
> - * \param end   [in]	end of the covering region
> - *
> - * Pre:  Caller should have allocated the range lock node.
> - * Post: The range lock node is meant to cover [start, end] region
> - */
> -int range_lock_init(struct range_lock *lock, __u64 start, __u64 end)
> -{
> -	int rc;
> -
> -	memset(&lock->rl_node, 0, sizeof(lock->rl_node));
> -	if (end != LUSTRE_EOF)
> -		end >>= PAGE_SHIFT;
> -	rc = interval_set(&lock->rl_node, start >> PAGE_SHIFT, end);
> -	if (rc)
> -		return rc;
> -
> -	INIT_LIST_HEAD(&lock->rl_next_lock);
> -	lock->rl_task = NULL;
> -	lock->rl_lock_count = 0;
> -	lock->rl_blocking_ranges = 0;
> -	lock->rl_sequence = 0;
> -	return rc;
> -}
> -
> -static inline struct range_lock *next_lock(struct range_lock *lock)
> -{
> -	return list_entry(lock->rl_next_lock.next, typeof(*lock), rl_next_lock);
> -}
> -
> -/**
> - * Helper function of range_unlock()
> - *
> - * \param node [in]	a range lock found overlapped during interval node
> - *			search
> - * \param arg [in]	the range lock to be tested
> - *
> - * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
> - *				overlapping range node
> - * \retval INTERVAL_ITER_STOP	indicate to stop the search
> - */
> -static enum interval_iter range_unlock_cb(struct interval_node *node, void *arg)
> -{
> -	struct range_lock *lock = arg;
> -	struct range_lock *overlap = node2rangelock(node);
> -	struct range_lock *iter;
> -
> -	list_for_each_entry(iter, &overlap->rl_next_lock, rl_next_lock) {
> -		if (iter->rl_sequence > lock->rl_sequence) {
> -			--iter->rl_blocking_ranges;
> -			LASSERT(iter->rl_blocking_ranges > 0);
> -		}
> -	}
> -	if (overlap->rl_sequence > lock->rl_sequence) {
> -		--overlap->rl_blocking_ranges;
> -		if (overlap->rl_blocking_ranges == 0)
> -			wake_up_process(overlap->rl_task);
> -	}
> -	return INTERVAL_ITER_CONT;
> -}
> -
> -/**
> - * Unlock a range lock, wake up locks blocked by this lock.
> - *
> - * \param tree [in]	range lock tree
> - * \param lock [in]	range lock to be deleted
> - *
> - * If this lock has been granted, relase it; if not, just delete it from
> - * the tree or the same region lock list. Wake up those locks only blocked
> - * by this lock through range_unlock_cb().
> - */
> -void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> -{
> -	spin_lock(&tree->rlt_lock);
> -	if (!list_empty(&lock->rl_next_lock)) {
> -		struct range_lock *next;
> -
> -		if (interval_is_intree(&lock->rl_node)) { /* first lock */
> -			/* Insert the next same range lock into the tree */
> -			next = next_lock(lock);
> -			next->rl_lock_count = lock->rl_lock_count - 1;
> -			interval_erase(&lock->rl_node, &tree->rlt_root);
> -			interval_insert(&next->rl_node, &tree->rlt_root);
> -		} else {
> -			/* find the first lock in tree */
> -			list_for_each_entry(next, &lock->rl_next_lock,
> -					    rl_next_lock) {
> -				if (!interval_is_intree(&next->rl_node))
> -					continue;
> -
> -				LASSERT(next->rl_lock_count > 0);
> -				next->rl_lock_count--;
> -				break;
> -			}
> -		}
> -		list_del_init(&lock->rl_next_lock);
> -	} else {
> -		LASSERT(interval_is_intree(&lock->rl_node));
> -		interval_erase(&lock->rl_node, &tree->rlt_root);
> -	}
> -
> -	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
> -			range_unlock_cb, lock);
> -	spin_unlock(&tree->rlt_lock);
> -}
> -
> -/**
> - * Helper function of range_lock()
> - *
> - * \param node [in]	a range lock found overlapped during interval node
> - *			search
> - * \param arg [in]	the range lock to be tested
> - *
> - * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
> - *				overlapping range node
> - * \retval INTERVAL_ITER_STOP	indicate to stop the search
> - */
> -static enum interval_iter range_lock_cb(struct interval_node *node, void *arg)
> -{
> -	struct range_lock *lock = (struct range_lock *)arg;
> -	struct range_lock *overlap = node2rangelock(node);
> -
> -	lock->rl_blocking_ranges += overlap->rl_lock_count + 1;
> -	return INTERVAL_ITER_CONT;
> -}
> -
> -/**
> - * Lock a region
> - *
> - * \param tree [in]	range lock tree
> - * \param lock [in]	range lock node containing the region span
> - *
> - * \retval 0	get the range lock
> - * \retval <0	error code while not getting the range lock
> - *
> - * If there exists overlapping range lock, the new lock will wait and
> - * retry, if later it find that it is not the chosen one to wake up,
> - * it wait again.
> - */
> -int range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> -{
> -	struct interval_node *node;
> -	int rc = 0;
> -
> -	spin_lock(&tree->rlt_lock);
> -	/*
> -	 * We need to check for all conflicting intervals
> -	 * already in the tree.
> -	 */
> -	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
> -			range_lock_cb, lock);
> -	/*
> -	 * Insert to the tree if I am unique, otherwise I've been linked to
> -	 * the rl_next_lock of another lock which has the same range as mine
> -	 * in range_lock_cb().
> -	 */
> -	node = interval_insert(&lock->rl_node, &tree->rlt_root);
> -	if (node) {
> -		struct range_lock *tmp = node2rangelock(node);
> -
> -		list_add_tail(&lock->rl_next_lock, &tmp->rl_next_lock);
> -		tmp->rl_lock_count++;
> -	}
> -	lock->rl_sequence = ++tree->rlt_sequence;
> -
> -	while (lock->rl_blocking_ranges > 0) {
> -		lock->rl_task = current;
> -		__set_current_state(TASK_INTERRUPTIBLE);
> -		spin_unlock(&tree->rlt_lock);
> -		schedule();
> -
> -		if (signal_pending(current)) {
> -			range_unlock(tree, lock);
> -			rc = -EINTR;
> -			goto out;
> -		}
> -		spin_lock(&tree->rlt_lock);
> -	}
> -	spin_unlock(&tree->rlt_lock);
> -out:
> -	return rc;
> -}
> diff --git a/drivers/staging/lustre/lustre/llite/range_lock.h b/drivers/staging/lustre/lustre/llite/range_lock.h
> deleted file mode 100644
> index 779091ccec4e..000000000000
> --- a/drivers/staging/lustre/lustre/llite/range_lock.h
> +++ /dev/null
> @@ -1,82 +0,0 @@
> -/*
> - * GPL HEADER START
> - *
> - * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 only,
> - * as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License version 2 for more details (a copy is included
> - * in the LICENSE file that accompanied this code).
> - *
> - * You should have received a copy of the GNU General Public License
> - * version 2 along with this program; If not, see
> - * http://www.gnu.org/licenses/gpl-2.0.html
> - *
> - * GPL HEADER END
> - */
> -/*
> - * Range lock is used to allow multiple threads writing a single shared
> - * file given each thread is writing to a non-overlapping portion of the
> - * file.
> - *
> - * Refer to the possible upstream kernel version of range lock by
> - * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
> - *
> - * This file could later replaced by the upstream kernel version.
> - */
> -/*
> - * Author: Prakash Surya <surya1@llnl.gov>
> - * Author: Bobi Jam <bobijam.xu@intel.com>
> - */
> -#ifndef _RANGE_LOCK_H
> -#define _RANGE_LOCK_H
> -
> -#include "../../include/linux/libcfs/libcfs.h"
> -#include "../include/interval_tree.h"
> -
> -struct range_lock {
> -	struct interval_node	rl_node;
> -	/**
> -	 * Process to enqueue this lock.
> -	 */
> -	struct task_struct	*rl_task;
> -	/**
> -	 * List of locks with the same range.
> -	 */
> -	struct list_head	rl_next_lock;
> -	/**
> -	 * Number of locks in the list rl_next_lock
> -	 */
> -	unsigned int		rl_lock_count;
> -	/**
> -	 * Number of ranges which are blocking acquisition of the lock
> -	 */
> -	unsigned int		rl_blocking_ranges;
> -	/**
> -	 * Sequence number of range lock. This number is used to get to know
> -	 * the order the locks are queued; this is required for range_cancel().
> -	 */
> -	__u64			rl_sequence;
> -};
> -
> -static inline struct range_lock *node2rangelock(const struct interval_node *n)
> -{
> -	return container_of(n, struct range_lock, rl_node);
> -}
> -
> -struct range_lock_tree {
> -	struct interval_node	*rlt_root;
> -	spinlock_t		 rlt_lock;	/* protect range lock tree */
> -	__u64			 rlt_sequence;
> -};
> -
> -void range_lock_tree_init(struct range_lock_tree *tree);
> -int range_lock_init(struct range_lock *lock, __u64 start, __u64 end);
> -int  range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> -void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> -#endif
> -- 
> 2.6.6

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [lustre-devel] [PATCH 5/5] staging/lustre: Use generic range rwlock
@ 2017-03-07  6:05     ` Oleg Drokin
  0 siblings, 0 replies; 37+ messages in thread
From: Oleg Drokin @ 2017-03-07  6:05 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, andreas.dilger, jsimmons, lustre-devel,
	Davidlohr Bueso


On Mar 7, 2017, at 12:03 AM, Davidlohr Bueso wrote:

> This replaces the in-house version, which is also derived
> from Jan's interval tree implementation.
> 
> Cc: oleg.drokin at intel.com
> Cc: andreas.dilger at intel.com
> Cc: jsimmons at infradead.org
> Cc: lustre-devel at lists.lustre.org
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> XXX: compile tested only. In house uses 'ulong long', generic uses 'ulong', is this a problem?

Hm, cannot seem to find the other patches in this series anywhere to verify and my subscription to linux-kernel broke as it turns out.
You mean the range is ulong? So only can have this working up to 2G offsets on the
32bit systems and then wrap around?

> 
> drivers/staging/lustre/lustre/llite/Makefile       |   2 +-
> drivers/staging/lustre/lustre/llite/file.c         |  21 +-
> .../staging/lustre/lustre/llite/llite_internal.h   |   4 +-
> drivers/staging/lustre/lustre/llite/llite_lib.c    |   3 +-
> drivers/staging/lustre/lustre/llite/range_lock.c   | 239 ---------------------
> drivers/staging/lustre/lustre/llite/range_lock.h   |  82 -------
> 6 files changed, 15 insertions(+), 336 deletions(-)
> delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.c
> delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.h
> 
> diff --git a/drivers/staging/lustre/lustre/llite/Makefile b/drivers/staging/lustre/lustre/llite/Makefile
> index 322d4fa63f5d..922a901bc62c 100644
> --- a/drivers/staging/lustre/lustre/llite/Makefile
> +++ b/drivers/staging/lustre/lustre/llite/Makefile
> @@ -1,6 +1,6 @@
> obj-$(CONFIG_LUSTRE_FS) += lustre.o
> lustre-y := dcache.o dir.o file.o llite_lib.o llite_nfs.o \
> -	    rw.o rw26.o namei.o symlink.o llite_mmap.o range_lock.o \
> +	    rw.o rw26.o namei.o symlink.o llite_mmap.o \
> 	    xattr.o xattr_cache.o xattr_security.o \
> 	    super25.o statahead.o glimpse.o lcommon_cl.o lcommon_misc.o \
> 	    vvp_dev.o vvp_page.o vvp_lock.o vvp_io.o vvp_object.o \
> diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
> index 481c0d01d4c6..1a14a79f87f8 100644
> --- a/drivers/staging/lustre/lustre/llite/file.c
> +++ b/drivers/staging/lustre/lustre/llite/file.c
> @@ -42,6 +42,7 @@
> #include <linux/file.h>
> #include <linux/sched.h>
> #include <linux/mount.h>
> +#include <linux/range_rwlock.h>
> #include "../include/lustre/ll_fiemap.h"
> #include "../include/lustre/lustre_ioctl.h"
> #include "../include/lustre_swab.h"
> @@ -1055,7 +1056,7 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 	struct ll_inode_info *lli = ll_i2info(file_inode(file));
> 	struct ll_file_data  *fd  = LUSTRE_FPRIVATE(file);
> 	struct vvp_io *vio = vvp_env_io(env);
> -	struct range_lock range;
> +	struct range_rwlock range;
> 	struct cl_io	 *io;
> 	ssize_t result = 0;
> 	int rc = 0;
> @@ -1072,9 +1073,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		bool range_locked = false;
> 
> 		if (file->f_flags & O_APPEND)
> -			range_lock_init(&range, 0, LUSTRE_EOF);
> +			range_rwlock_init(&range, 0, LUSTRE_EOF);
> 		else
> -			range_lock_init(&range, *ppos, *ppos + count - 1);
> +			range_rwlock_init(&range, *ppos, *ppos + count - 1);
> 
> 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
> 		vio->vui_iter = args->u.normal.via_iter;
> @@ -1087,10 +1088,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		if (((iot == CIT_WRITE) ||
> 		     (iot == CIT_READ && (file->f_flags & O_DIRECT))) &&
> 		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
> -			CDEBUG(D_VFSTRACE, "Range lock [%llu, %llu]\n",
> -			       range.rl_node.in_extent.start,
> -			       range.rl_node.in_extent.end);
> -			rc = range_lock(&lli->lli_write_tree, &range);
> +			CDEBUG(D_VFSTRACE, "Range lock [%lu, %lu]\n",
> +			       range.node.start, range.node.last);
> +			rc = range_write_lock_interruptible(&lli->lli_write_tree, &range);
> 			if (rc < 0)
> 				goto out;
> 
> @@ -1100,10 +1100,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		rc = cl_io_loop(env, io);
> 		ll_cl_remove(file, env);
> 		if (range_locked) {
> -			CDEBUG(D_VFSTRACE, "Range unlock [%llu, %llu]\n",
> -			       range.rl_node.in_extent.start,
> -			       range.rl_node.in_extent.end);
> -			range_unlock(&lli->lli_write_tree, &range);
> +			CDEBUG(D_VFSTRACE, "Range unlock [%lu, %lu]\n",
> +			       range.node.start, range.node.last);
> +			range_write_unlock(&lli->lli_write_tree, &range);
> 		}
> 	} else {
> 		/* cl_io_rw_init() handled IO */
> diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
> index 55f68acd85d1..aa2ae72e3e70 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_internal.h
> +++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
> @@ -49,8 +49,8 @@
> #include <linux/namei.h>
> #include <linux/xattr.h>
> #include <linux/posix_acl_xattr.h>
> +#include <linux/range_rwlock.h>
> #include "vvp_internal.h"
> -#include "range_lock.h"
> 
> #ifndef FMODE_EXEC
> #define FMODE_EXEC 0
> @@ -193,7 +193,7 @@ struct ll_inode_info {
> 			 * }
> 			 */
> 			struct rw_semaphore		lli_trunc_sem;
> -			struct range_lock_tree		lli_write_tree;
> +			struct range_rwlock_tree	lli_write_tree;
> 
> 			struct rw_semaphore		lli_glimpse_sem;
> 			unsigned long			lli_glimpse_time;
> diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
> index b229cbc7bb33..8054e916b3f5 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
> +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
> @@ -40,6 +40,7 @@
> #include <linux/statfs.h>
> #include <linux/types.h>
> #include <linux/mm.h>
> +#include <linux/range_rwlock.h>
> 
> #include "../include/lustre/lustre_ioctl.h"
> #include "../include/lustre_ha.h"
> @@ -853,7 +854,7 @@ void ll_lli_init(struct ll_inode_info *lli)
> 		mutex_init(&lli->lli_size_mutex);
> 		lli->lli_symlink_name = NULL;
> 		init_rwsem(&lli->lli_trunc_sem);
> -		range_lock_tree_init(&lli->lli_write_tree);
> +		range_rwlock_tree_init(&lli->lli_write_tree);
> 		init_rwsem(&lli->lli_glimpse_sem);
> 		lli->lli_glimpse_time = 0;
> 		INIT_LIST_HEAD(&lli->lli_agl_list);
> diff --git a/drivers/staging/lustre/lustre/llite/range_lock.c b/drivers/staging/lustre/lustre/llite/range_lock.c
> deleted file mode 100644
> index 14148a097476..000000000000
> --- a/drivers/staging/lustre/lustre/llite/range_lock.c
> +++ /dev/null
> @@ -1,239 +0,0 @@
> -/*
> - * GPL HEADER START
> - *
> - * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 only,
> - * as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License version 2 for more details (a copy is included
> - * in the LICENSE file that accompanied this code).
> - *
> - * You should have received a copy of the GNU General Public License
> - * version 2 along with this program; If not, see
> - * http://www.gnu.org/licenses/gpl-2.0.html
> - *
> - * GPL HEADER END
> - */
> -/*
> - * Range lock is used to allow multiple threads writing a single shared
> - * file given each thread is writing to a non-overlapping portion of the
> - * file.
> - *
> - * Refer to the possible upstream kernel version of range lock by
> - * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
> - *
> - * This file could later replaced by the upstream kernel version.
> - */
> -/*
> - * Author: Prakash Surya <surya1@llnl.gov>
> - * Author: Bobi Jam <bobijam.xu@intel.com>
> - */
> -#include "range_lock.h"
> -#include "../include/lustre/lustre_user.h"
> -
> -/**
> - * Initialize a range lock tree
> - *
> - * \param tree [in]	an empty range lock tree
> - *
> - * Pre:  Caller should have allocated the range lock tree.
> - * Post: The range lock tree is ready to function.
> - */
> -void range_lock_tree_init(struct range_lock_tree *tree)
> -{
> -	tree->rlt_root = NULL;
> -	tree->rlt_sequence = 0;
> -	spin_lock_init(&tree->rlt_lock);
> -}
> -
> -/**
> - * Initialize a range lock node
> - *
> - * \param lock  [in]	an empty range lock node
> - * \param start [in]	start of the covering region
> - * \param end   [in]	end of the covering region
> - *
> - * Pre:  Caller should have allocated the range lock node.
> - * Post: The range lock node is meant to cover [start, end] region
> - */
> -int range_lock_init(struct range_lock *lock, __u64 start, __u64 end)
> -{
> -	int rc;
> -
> -	memset(&lock->rl_node, 0, sizeof(lock->rl_node));
> -	if (end != LUSTRE_EOF)
> -		end >>= PAGE_SHIFT;
> -	rc = interval_set(&lock->rl_node, start >> PAGE_SHIFT, end);
> -	if (rc)
> -		return rc;
> -
> -	INIT_LIST_HEAD(&lock->rl_next_lock);
> -	lock->rl_task = NULL;
> -	lock->rl_lock_count = 0;
> -	lock->rl_blocking_ranges = 0;
> -	lock->rl_sequence = 0;
> -	return rc;
> -}
> -
> -static inline struct range_lock *next_lock(struct range_lock *lock)
> -{
> -	return list_entry(lock->rl_next_lock.next, typeof(*lock), rl_next_lock);
> -}
> -
> -/**
> - * Helper function of range_unlock()
> - *
> - * \param node [in]	a range lock found overlapped during interval node
> - *			search
> - * \param arg [in]	the range lock to be tested
> - *
> - * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
> - *				overlapping range node
> - * \retval INTERVAL_ITER_STOP	indicate to stop the search
> - */
> -static enum interval_iter range_unlock_cb(struct interval_node *node, void *arg)
> -{
> -	struct range_lock *lock = arg;
> -	struct range_lock *overlap = node2rangelock(node);
> -	struct range_lock *iter;
> -
> -	list_for_each_entry(iter, &overlap->rl_next_lock, rl_next_lock) {
> -		if (iter->rl_sequence > lock->rl_sequence) {
> -			--iter->rl_blocking_ranges;
> -			LASSERT(iter->rl_blocking_ranges > 0);
> -		}
> -	}
> -	if (overlap->rl_sequence > lock->rl_sequence) {
> -		--overlap->rl_blocking_ranges;
> -		if (overlap->rl_blocking_ranges == 0)
> -			wake_up_process(overlap->rl_task);
> -	}
> -	return INTERVAL_ITER_CONT;
> -}
> -
> -/**
> - * Unlock a range lock, wake up locks blocked by this lock.
> - *
> - * \param tree [in]	range lock tree
> - * \param lock [in]	range lock to be deleted
> - *
> - * If this lock has been granted, relase it; if not, just delete it from
> - * the tree or the same region lock list. Wake up those locks only blocked
> - * by this lock through range_unlock_cb().
> - */
> -void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> -{
> -	spin_lock(&tree->rlt_lock);
> -	if (!list_empty(&lock->rl_next_lock)) {
> -		struct range_lock *next;
> -
> -		if (interval_is_intree(&lock->rl_node)) { /* first lock */
> -			/* Insert the next same range lock into the tree */
> -			next = next_lock(lock);
> -			next->rl_lock_count = lock->rl_lock_count - 1;
> -			interval_erase(&lock->rl_node, &tree->rlt_root);
> -			interval_insert(&next->rl_node, &tree->rlt_root);
> -		} else {
> -			/* find the first lock in tree */
> -			list_for_each_entry(next, &lock->rl_next_lock,
> -					    rl_next_lock) {
> -				if (!interval_is_intree(&next->rl_node))
> -					continue;
> -
> -				LASSERT(next->rl_lock_count > 0);
> -				next->rl_lock_count--;
> -				break;
> -			}
> -		}
> -		list_del_init(&lock->rl_next_lock);
> -	} else {
> -		LASSERT(interval_is_intree(&lock->rl_node));
> -		interval_erase(&lock->rl_node, &tree->rlt_root);
> -	}
> -
> -	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
> -			range_unlock_cb, lock);
> -	spin_unlock(&tree->rlt_lock);
> -}
> -
> -/**
> - * Helper function of range_lock()
> - *
> - * \param node [in]	a range lock found overlapped during interval node
> - *			search
> - * \param arg [in]	the range lock to be tested
> - *
> - * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
> - *				overlapping range node
> - * \retval INTERVAL_ITER_STOP	indicate to stop the search
> - */
> -static enum interval_iter range_lock_cb(struct interval_node *node, void *arg)
> -{
> -	struct range_lock *lock = (struct range_lock *)arg;
> -	struct range_lock *overlap = node2rangelock(node);
> -
> -	lock->rl_blocking_ranges += overlap->rl_lock_count + 1;
> -	return INTERVAL_ITER_CONT;
> -}
> -
> -/**
> - * Lock a region
> - *
> - * \param tree [in]	range lock tree
> - * \param lock [in]	range lock node containing the region span
> - *
> - * \retval 0	get the range lock
> - * \retval <0	error code while not getting the range lock
> - *
> - * If there exists overlapping range lock, the new lock will wait and
> - * retry, if later it find that it is not the chosen one to wake up,
> - * it wait again.
> - */
> -int range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> -{
> -	struct interval_node *node;
> -	int rc = 0;
> -
> -	spin_lock(&tree->rlt_lock);
> -	/*
> -	 * We need to check for all conflicting intervals
> -	 * already in the tree.
> -	 */
> -	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
> -			range_lock_cb, lock);
> -	/*
> -	 * Insert to the tree if I am unique, otherwise I've been linked to
> -	 * the rl_next_lock of another lock which has the same range as mine
> -	 * in range_lock_cb().
> -	 */
> -	node = interval_insert(&lock->rl_node, &tree->rlt_root);
> -	if (node) {
> -		struct range_lock *tmp = node2rangelock(node);
> -
> -		list_add_tail(&lock->rl_next_lock, &tmp->rl_next_lock);
> -		tmp->rl_lock_count++;
> -	}
> -	lock->rl_sequence = ++tree->rlt_sequence;
> -
> -	while (lock->rl_blocking_ranges > 0) {
> -		lock->rl_task = current;
> -		__set_current_state(TASK_INTERRUPTIBLE);
> -		spin_unlock(&tree->rlt_lock);
> -		schedule();
> -
> -		if (signal_pending(current)) {
> -			range_unlock(tree, lock);
> -			rc = -EINTR;
> -			goto out;
> -		}
> -		spin_lock(&tree->rlt_lock);
> -	}
> -	spin_unlock(&tree->rlt_lock);
> -out:
> -	return rc;
> -}
> diff --git a/drivers/staging/lustre/lustre/llite/range_lock.h b/drivers/staging/lustre/lustre/llite/range_lock.h
> deleted file mode 100644
> index 779091ccec4e..000000000000
> --- a/drivers/staging/lustre/lustre/llite/range_lock.h
> +++ /dev/null
> @@ -1,82 +0,0 @@
> -/*
> - * GPL HEADER START
> - *
> - * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 only,
> - * as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License version 2 for more details (a copy is included
> - * in the LICENSE file that accompanied this code).
> - *
> - * You should have received a copy of the GNU General Public License
> - * version 2 along with this program; If not, see
> - * http://www.gnu.org/licenses/gpl-2.0.html
> - *
> - * GPL HEADER END
> - */
> -/*
> - * Range lock is used to allow multiple threads writing a single shared
> - * file given each thread is writing to a non-overlapping portion of the
> - * file.
> - *
> - * Refer to the possible upstream kernel version of range lock by
> - * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
> - *
> - * This file could later replaced by the upstream kernel version.
> - */
> -/*
> - * Author: Prakash Surya <surya1@llnl.gov>
> - * Author: Bobi Jam <bobijam.xu@intel.com>
> - */
> -#ifndef _RANGE_LOCK_H
> -#define _RANGE_LOCK_H
> -
> -#include "../../include/linux/libcfs/libcfs.h"
> -#include "../include/interval_tree.h"
> -
> -struct range_lock {
> -	struct interval_node	rl_node;
> -	/**
> -	 * Process to enqueue this lock.
> -	 */
> -	struct task_struct	*rl_task;
> -	/**
> -	 * List of locks with the same range.
> -	 */
> -	struct list_head	rl_next_lock;
> -	/**
> -	 * Number of locks in the list rl_next_lock
> -	 */
> -	unsigned int		rl_lock_count;
> -	/**
> -	 * Number of ranges which are blocking acquisition of the lock
> -	 */
> -	unsigned int		rl_blocking_ranges;
> -	/**
> -	 * Sequence number of range lock. This number is used to get to know
> -	 * the order the locks are queued; this is required for range_cancel().
> -	 */
> -	__u64			rl_sequence;
> -};
> -
> -static inline struct range_lock *node2rangelock(const struct interval_node *n)
> -{
> -	return container_of(n, struct range_lock, rl_node);
> -}
> -
> -struct range_lock_tree {
> -	struct interval_node	*rlt_root;
> -	spinlock_t		 rlt_lock;	/* protect range lock tree */
> -	__u64			 rlt_sequence;
> -};
> -
> -void range_lock_tree_init(struct range_lock_tree *tree);
> -int range_lock_init(struct range_lock *lock, __u64 start, __u64 end);
> -int  range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> -void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> -#endif
> -- 
> 2.6.6

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] staging/lustre: Use generic range rwlock
  2017-03-07  6:05     ` [lustre-devel] " Oleg Drokin
  (?)
@ 2017-03-08 15:02     ` Davidlohr Bueso
  -1 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-08 15:02 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, andreas.dilger, jsimmons, lustre-devel,
	Davidlohr Bueso

On Tue, 07 Mar 2017, Oleg Drokin wrote:
>On Mar 7, 2017, at 12:03 AM, Davidlohr Bueso wrote:
>
>> This replaces the in-house version, which is also derived
>> from Jan's interval tree implementation.
>>
>> Cc: oleg.drokin@intel.com
>> Cc: andreas.dilger@intel.com
>> Cc: jsimmons@infradead.org
>> Cc: lustre-devel@lists.lustre.org
>>
>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
>> ---
>> XXX: compile tested only. In house uses 'ulong long', generic uses 'ulong', is this a problem?
>
>Hm, cannot seem to find the other patches in this series anywhere to verify and my subscription to linux-kernel broke as it turns out.

You can find the full series here:

https://lwn.net/Articles/716383/

>You mean the range is ulong? So only can have this working up to 2G offsets on the
>32bit systems and then wrap around?

Yes.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] staging/lustre: Use generic range rwlock
  2017-03-07  5:03 ` [PATCH 5/5] staging/lustre: Use generic range rwlock Davidlohr Bueso
@ 2017-03-09  8:56     ` kbuild test robot
  2017-03-09  8:56     ` [lustre-devel] " kbuild test robot
  2017-03-09 14:40     ` [lustre-devel] " kbuild test robot
  2 siblings, 0 replies; 37+ messages in thread
From: kbuild test robot @ 2017-03-09  8:56 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: kbuild-all, mingo, peterz, akpm, jack, kirill.shutemov, mhocko,
	mgorman, dave, linux-kernel, oleg.drokin, andreas.dilger,
	jsimmons, lustre-devel, Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 3631 bytes --]

Hi Davidlohr,

[auto build test WARNING on staging/staging-testing]
[also build test WARNING on v4.11-rc1 next-20170309]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Davidlohr-Bueso/locking-Introduce-range-reader-writer-lock/20170309-140444
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from drivers/staging/lustre/lustre/llite/../include/lustre/lustre_idl.h:76:0,
                    from drivers/staging/lustre/lustre/llite/../include/lustre_lib.h:49,
                    from drivers/staging/lustre/lustre/llite/../include/lustre_dlm.h:47,
                    from drivers/staging/lustre/lustre/llite/file.c:40:
   drivers/staging/lustre/lustre/llite/file.c: In function 'll_file_io_generic':
>> drivers/staging/lustre/lustre/llite/../include/lustre/lustre_user.h:78:20: warning: large integer implicitly truncated to unsigned type [-Woverflow]
    #define LUSTRE_EOF 0xffffffffffffffffULL
                       ^
>> drivers/staging/lustre/lustre/llite/file.c:1072:33: note: in expansion of macro 'LUSTRE_EOF'
       range_rwlock_init(&range, 0, LUSTRE_EOF);
                                    ^~~~~~~~~~

vim +78 drivers/staging/lustre/lustre/llite/../include/lustre/lustre_user.h

23ec6607e9 John L. Hammond 2016-09-18  62   * are co-existing.
23ec6607e9 John L. Hammond 2016-09-18  63   */
23ec6607e9 John L. Hammond 2016-09-18  64  #if __BITS_PER_LONG != 64 || defined(__ARCH_WANT_STAT64)
23ec6607e9 John L. Hammond 2016-09-18  65  typedef struct stat64   lstat_t;
23ec6607e9 John L. Hammond 2016-09-18  66  #define lstat_f  lstat64
f0cf21abcc John L. Hammond 2016-10-02  67  #define fstat_f		fstat64
f0cf21abcc John L. Hammond 2016-10-02  68  #define fstatat_f	fstatat64
23ec6607e9 John L. Hammond 2016-09-18  69  #else
23ec6607e9 John L. Hammond 2016-09-18  70  typedef struct stat     lstat_t;
23ec6607e9 John L. Hammond 2016-09-18  71  #define lstat_f  lstat
f0cf21abcc John L. Hammond 2016-10-02  72  #define fstat_f		fstat
f0cf21abcc John L. Hammond 2016-10-02  73  #define fstatat_f	fstatat
23ec6607e9 John L. Hammond 2016-09-18  74  #endif
23ec6607e9 John L. Hammond 2016-09-18  75  
23ec6607e9 John L. Hammond 2016-09-18  76  #define HAVE_LOV_USER_MDS_DATA
d7e09d0397 Peng Tao        2013-05-02  77  
00c0a6aea0 John L. Hammond 2016-08-16 @78  #define LUSTRE_EOF 0xffffffffffffffffULL
00c0a6aea0 John L. Hammond 2016-08-16  79  
d7e09d0397 Peng Tao        2013-05-02  80  /* for statfs() */
d7e09d0397 Peng Tao        2013-05-02  81  #define LL_SUPER_MAGIC 0x0BD00BD0
d7e09d0397 Peng Tao        2013-05-02  82  
d7e09d0397 Peng Tao        2013-05-02  83  #ifndef FSFILT_IOC_GETFLAGS
d7e09d0397 Peng Tao        2013-05-02  84  #define FSFILT_IOC_GETFLAGS	       _IOR('f', 1, long)
d7e09d0397 Peng Tao        2013-05-02  85  #define FSFILT_IOC_SETFLAGS	       _IOW('f', 2, long)
d7e09d0397 Peng Tao        2013-05-02  86  #define FSFILT_IOC_GETVERSION	     _IOR('f', 3, long)

:::::: The code at line 78 was first introduced by commit
:::::: 00c0a6aea0d0ab2c11594616244d787ad7bf64dc staging: lustre: uapi: reduce scope of lustre_idl.h

:::::: TO: John L. Hammond <john.hammond@intel.com>
:::::: CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 59084 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [lustre-devel] [PATCH 5/5] staging/lustre: Use generic range rwlock
@ 2017-03-09  8:56     ` kbuild test robot
  0 siblings, 0 replies; 37+ messages in thread
From: kbuild test robot @ 2017-03-09  8:56 UTC (permalink / raw)
  To: lustre-devel

Hi Davidlohr,

[auto build test WARNING on staging/staging-testing]
[also build test WARNING on v4.11-rc1 next-20170309]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Davidlohr-Bueso/locking-Introduce-range-reader-writer-lock/20170309-140444
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from drivers/staging/lustre/lustre/llite/../include/lustre/lustre_idl.h:76:0,
                    from drivers/staging/lustre/lustre/llite/../include/lustre_lib.h:49,
                    from drivers/staging/lustre/lustre/llite/../include/lustre_dlm.h:47,
                    from drivers/staging/lustre/lustre/llite/file.c:40:
   drivers/staging/lustre/lustre/llite/file.c: In function 'll_file_io_generic':
>> drivers/staging/lustre/lustre/llite/../include/lustre/lustre_user.h:78:20: warning: large integer implicitly truncated to unsigned type [-Woverflow]
    #define LUSTRE_EOF 0xffffffffffffffffULL
                       ^
>> drivers/staging/lustre/lustre/llite/file.c:1072:33: note: in expansion of macro 'LUSTRE_EOF'
       range_rwlock_init(&range, 0, LUSTRE_EOF);
                                    ^~~~~~~~~~

vim +78 drivers/staging/lustre/lustre/llite/../include/lustre/lustre_user.h

23ec6607e9 John L. Hammond 2016-09-18  62   * are co-existing.
23ec6607e9 John L. Hammond 2016-09-18  63   */
23ec6607e9 John L. Hammond 2016-09-18  64  #if __BITS_PER_LONG != 64 || defined(__ARCH_WANT_STAT64)
23ec6607e9 John L. Hammond 2016-09-18  65  typedef struct stat64   lstat_t;
23ec6607e9 John L. Hammond 2016-09-18  66  #define lstat_f  lstat64
f0cf21abcc John L. Hammond 2016-10-02  67  #define fstat_f		fstat64
f0cf21abcc John L. Hammond 2016-10-02  68  #define fstatat_f	fstatat64
23ec6607e9 John L. Hammond 2016-09-18  69  #else
23ec6607e9 John L. Hammond 2016-09-18  70  typedef struct stat     lstat_t;
23ec6607e9 John L. Hammond 2016-09-18  71  #define lstat_f  lstat
f0cf21abcc John L. Hammond 2016-10-02  72  #define fstat_f		fstat
f0cf21abcc John L. Hammond 2016-10-02  73  #define fstatat_f	fstatat
23ec6607e9 John L. Hammond 2016-09-18  74  #endif
23ec6607e9 John L. Hammond 2016-09-18  75  
23ec6607e9 John L. Hammond 2016-09-18  76  #define HAVE_LOV_USER_MDS_DATA
d7e09d0397 Peng Tao        2013-05-02  77  
00c0a6aea0 John L. Hammond 2016-08-16 @78  #define LUSTRE_EOF 0xffffffffffffffffULL
00c0a6aea0 John L. Hammond 2016-08-16  79  
d7e09d0397 Peng Tao        2013-05-02  80  /* for statfs() */
d7e09d0397 Peng Tao        2013-05-02  81  #define LL_SUPER_MAGIC 0x0BD00BD0
d7e09d0397 Peng Tao        2013-05-02  82  
d7e09d0397 Peng Tao        2013-05-02  83  #ifndef FSFILT_IOC_GETFLAGS
d7e09d0397 Peng Tao        2013-05-02  84  #define FSFILT_IOC_GETFLAGS	       _IOR('f', 1, long)
d7e09d0397 Peng Tao        2013-05-02  85  #define FSFILT_IOC_SETFLAGS	       _IOW('f', 2, long)
d7e09d0397 Peng Tao        2013-05-02  86  #define FSFILT_IOC_GETVERSION	     _IOR('f', 3, long)

:::::: The code at line 78 was first introduced by commit
:::::: 00c0a6aea0d0ab2c11594616244d787ad7bf64dc staging: lustre: uapi: reduce scope of lustre_idl.h

:::::: TO: John L. Hammond <john.hammond@intel.com>
:::::: CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/gzip
Size: 59084 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170309/80c52e7d/attachment-0001.bin>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
@ 2017-03-09 11:03   ` Jan Kara
  2017-03-28 10:00   ` Laurent Dufour
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2017-03-09 11:03 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

Hi!

On Mon 06-03-17 21:03:26, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
> 
> We allow exclusive locking of arbitrary ranges. We guarantee that each
> range is locked only after all conflicting range locks requested previously
> have been unlocked. Thus we achieve fairness and avoid livelocks.
> 
> When new range lock is requested, we add its interval to the tree and store
> number of intervals intersecting it to 'blocking_ranges'. When a range is
> unlocked, we again walk intervals that intersect with the unlocked one and
> decrement their 'blocking_ranges'. We wake up owner of any range lock whose
> 'blocking_ranges' drops to 0.
> 
> For the shared case, the 'blocking_ranges' is only incremented if the
> intersecting range is not marked as a reader. In order to mitigate some of
> the tree walk overhead for non-intersecting ranges, the tree's min/max values
> are maintained and consulted in O(1) in the fastpath.

Thanks for pushing this into completion! I like the patch and FWIW I didn't
find any problem in it so feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> How much does it cost:
> ----------------------
> 
> The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
> 
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
> 
> The first is the memory footprint, rwsems are larger than tree locks: 40 vs
> 24 bytes, but the later requires an additional 64 bytes of stack for the range
> structure.
> 
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
> 
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
> 
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
> 
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
> 
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
> 
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
> 
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
> 
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
> 
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
> 
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr:  1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr:  1022.5, stddev: 6.41
> 
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr:  1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr:  1243.5, stddev: 30.40
> 
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr:  1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr:  1398.0, stddev: 56.56
> 
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
> 
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
> 
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
> 
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
> 
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
> 
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
> 
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
> 
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
> 
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr:  1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr:  1852.8, stddev: 147.5
> 
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr:  1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr:  1856.5, stddev: 198.69
> 
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr:  1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr:  1551.5, stddev: 361.33
> 
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
> 
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  drivers/gpu/drm/Kconfig       |   2 -
>  drivers/gpu/drm/i915/Kconfig  |   1 -
>  include/linux/range_rwlock.h  |  96 +++++++++
>  kernel/locking/Makefile       |   2 +-
>  kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
>  lib/Kconfig                   |  14 --
>  lib/Kconfig.debug             |   1 -
>  lib/Makefile                  |   2 +-
>  8 files changed, 560 insertions(+), 20 deletions(-)
>  create mode 100644 include/linux/range_rwlock.h
>  create mode 100644 kernel/locking/range_rwlock.c
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
>  	select HWMON
>  	select BACKLIGHT_CLASS_DEVICE
>  	select BACKLIGHT_LCD_SUPPORT
> -	select INTERVAL_TREE
>  	help
>  	  Choose this option if you have an ATI Radeon graphics card.  There
>  	  are both PCI and AGP versions.  You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
>  	select HWMON
>  	select BACKLIGHT_CLASS_DEVICE
>  	select BACKLIGHT_LCD_SUPPORT
> -	select INTERVAL_TREE
>  	help
>  	  Choose this option if you have a recent AMD Radeon graphics card.
>  
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
>  	depends on DRM
>  	depends on X86 && PCI
>  	select INTEL_GTT
> -	select INTERVAL_TREE
>  	# we need shmfs for the swappable backing store, and in particular
>  	# the shmem_readpage() which depends upon tmpfs
>  	select SHMEM
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..174e13668f67
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,96 @@
> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY  (~0UL - 1)
> +
> +struct range_rwlock {
> +	struct interval_tree_node node;
> +	struct task_struct *task;
> +	/* Number of ranges which are blocking acquisition of the lock */
> +	unsigned int blocking_ranges;
> +	bool reader;
> +};
> +
> +struct range_rwlock_tree {
> +	struct rb_root root;
> +	spinlock_t lock;
> +	struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
> +	{ .leftmost = NULL                              \
> +	, .root = RB_ROOT				\
> +	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> +       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(start, last) {	\
> +		.node = {			\
> +			.start = (start)	\
> +			,.last = (last)		\
> +		}				\
> +	}
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last)				\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_INF(name)		\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> +	tree->root = RB_ROOT;
> +	spin_lock_init(&tree->lock);
> +	tree->leftmost = NULL;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> +		       unsigned long start, unsigned long last);
> +void range_rwlock_init_inf(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT		:= n
>  
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
>  
>  ifdef CONFIG_FUNCTION_TRACER
>  CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..4d92baa9bd80
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +/*
> + * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + *        a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree.
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *root;
> +
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		return false;
> +
> +	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> +	return lock->node.start <= root->__subtree_last &&
> +		tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		tree->leftmost = &lock->node;
> +	else if (lock->node.start < tree->leftmost->start)
> +		tree->leftmost = &lock->node;
> +
> +	interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (tree->leftmost == &lock->node) {
> +		struct rb_node *next = rb_next(&tree->leftmost->rb);
> +		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> +	}
> +
> +	interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		  unsigned long last)
> +{
> +	WARN_ON(start > last);
> +
> +	lock->node.start = start;
> +	lock->node.last = last;
> +	RB_CLEAR_NODE(&lock->node.rb);
> +	lock->blocking_ranges = 0;
> +	lock->reader = false;
> +}
> +
> +/**
> + * range_rwlock_init - initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		     unsigned long last)
> +{
> +	__range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_inf - initialize the range lock to infinity
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_inf(struct range_rwlock *lock)
> +{
> +	__range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
> +}
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> +	if (!--lock->blocking_ranges)
> +		wake_q_add(wake_q, lock->task);
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock, long state)
> +{
> +	int ret = 0;
> +
> +	while (true) {
> +		set_current_state(state);
> +
> +		if (unlikely(signal_pending_state(state, current))) {
> +			unsigned long flags;
> +
> +			ret = -EINTR;
> +			/*
> +			 * We're not taking the lock after all, cleanup
> +			 * after ourselves.
> +			 */
> +			spin_lock_irqsave(&tree->lock, flags);
> +			lock->reader = false;
> +			__range_tree_remove(tree, lock);
> +			spin_unlock_irqrestore(&tree->lock, flags);
> +			break;
> +		}
> +
> +		/* do we need to go to sleep? */
> +		if (!lock->blocking_ranges)
> +			break;
> +
> +		schedule();
> +	}
> +
> +	__set_current_state(TASK_RUNNING);
> +	return ret;
> +}
> +
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> +			 struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	lock->reader = true;
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					 lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader)
> +			lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->task = current;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock);
> +
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_killable);
> +
> +/**
> + * range_read_trylock - Trylock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	int ret = true;
> +	unsigned long flags;
> +	struct interval_tree_node *node;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	/*
> +	 * We have overlapping ranges in the tree, ensure that we can
> +	 * in fact share the lock.
> +	 */
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader) {
> +			ret = false;
> +			goto unlock;
> +		}
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	lock->reader = true;
> +	__range_tree_insert(tree, lock);
> +unlock:
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(range_read_trylock);
> +
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	lock->reader = false;
> +	__range_tree_remove(tree, lock);
> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader)
> +			range_rwlock_unblock(blocked_lock, &wake_q);
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_read_unlock);
> +
> +static inline int __sched
> +__range_write_lock_common(struct range_rwlock_tree *tree,
> +			  struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	lock->reader = false;
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		/*
> +		 * As a writer, we always consider an existing node. We
> +		 * need to block; either the intersecting node is another
> +		 * writer or we have a reader that needs to finish.
> +		 */
> +		lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->task = current;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	__range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock);
> +
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_killable);
> +
> +/**
> + * range_write_trylock - Trylock for writing
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	int intersects;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	intersects = __range_intersects_intree(tree, lock);
> +
> +	if (!intersects) {
> +		lock->reader = false;
> +		__range_tree_insert(tree, lock);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return !intersects;
> +}
> +EXPORT_SYMBOL(range_write_trylock);
> +
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	lock->reader = false;
> +	__range_tree_remove(tree, lock);
> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +				     lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		range_rwlock_unblock(blocked_lock, &wake_q);
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_write_unlock);
> +
> +/**
> + * range_downgrade_write - Downgrade write range lock to read lock
> + * @tree: interval tree
> + * @lock: the range lock to be downgraded
> + *
> + * Wakes any blocked readers, when @lock is the only conflicting range.
> + *
> + * It is not allowed to downgrade an unacquired write lock.
> + */
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock)
> +{
> +	unsigned long flags;
> +	struct interval_tree_node *node;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	WARN_ON(lock->reader);
> +
> +	/*
> +	 * Unaccount for any blocked reader lock. Wakeup if possible.
> +	 */
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (blocked_lock->reader)
> +			range_rwlock_unblock(blocked_lock, &wake_q);
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	lock->reader = true;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_downgrade_write);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 0c8b78a9ae2e..52ca6668a16d 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
>  config BTREE
>  	bool
>  
> -config INTERVAL_TREE
> -	bool
> -	help
> -	  Simple, embeddable, interval-tree. Can find the start of an
> -	  overlapping range in log(n) time and then iterate over all
> -	  overlapping nodes. The algorithm is implemented as an
> -	  augmented rbtree.
> -
> -	  See:
> -
> -		Documentation/rbtree.txt
> -
> -	  for more information.
> -
>  config RADIX_TREE_MULTIORDER
>  	bool
>  
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 97d62c2da6c2..a8a64eb54d3a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1771,7 +1771,6 @@ config RBTREE_TEST
>  config INTERVAL_TREE_TEST
>  	tristate "Interval tree test"
>  	depends on m && DEBUG_KERNEL
> -	select INTERVAL_TREE
>  	help
>  	  A benchmark measuring the performance of the interval tree library
>  
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..d4ff3057450a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
>  obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
>  
>  obj-$(CONFIG_BTREE) += btree.o
> -obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
> +obj-y += interval_tree.o
>  obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
>  obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
>  obj-$(CONFIG_DEBUG_LIST) += list_debug.o
> -- 
> 2.6.6
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/5] staging/lustre: Use generic range rwlock
  2017-03-07  5:03 ` [PATCH 5/5] staging/lustre: Use generic range rwlock Davidlohr Bueso
@ 2017-03-09 14:40     ` kbuild test robot
  2017-03-09  8:56     ` [lustre-devel] " kbuild test robot
  2017-03-09 14:40     ` [lustre-devel] " kbuild test robot
  2 siblings, 0 replies; 37+ messages in thread
From: kbuild test robot @ 2017-03-09 14:40 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: kbuild-all, mingo, peterz, akpm, jack, kirill.shutemov, mhocko,
	mgorman, dave, linux-kernel, oleg.drokin, andreas.dilger,
	jsimmons, lustre-devel, Davidlohr Bueso

[-- Attachment #1: Type: text/plain, Size: 1917 bytes --]

Hi Davidlohr,

[auto build test WARNING on staging/staging-testing]
[also build test WARNING on v4.11-rc1 next-20170309]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Davidlohr-Bueso/locking-Introduce-range-reader-writer-lock/20170309-140444
config: i386-randconfig-i1-201710 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/staging/lustre/lustre/llite/file.c: In function 'll_file_io_generic':
>> drivers/staging/lustre/lustre/llite/file.c:1072:4: warning: large integer implicitly truncated to unsigned type [-Woverflow]
       range_rwlock_init(&range, 0, LUSTRE_EOF);
       ^

vim +1072 drivers/staging/lustre/lustre/llite/file.c

  1056		struct cl_io	 *io;
  1057		ssize_t result = 0;
  1058		int rc = 0;
  1059	
  1060		CDEBUG(D_VFSTRACE, "file: %pD, type: %d ppos: %llu, count: %zu\n",
  1061		       file, iot, *ppos, count);
  1062	
  1063	restart:
  1064		io = vvp_env_thread_io(env);
  1065		ll_io_init(io, file, iot == CIT_WRITE);
  1066	
  1067		if (cl_io_rw_init(env, io, iot, *ppos, count) == 0) {
  1068			struct vvp_io *vio = vvp_env_io(env);
  1069			bool range_locked = false;
  1070	
  1071			if (file->f_flags & O_APPEND)
> 1072				range_rwlock_init(&range, 0, LUSTRE_EOF);
  1073			else
  1074				range_rwlock_init(&range, *ppos, *ppos + count - 1);
  1075	
  1076			vio->vui_fd  = LUSTRE_FPRIVATE(file);
  1077			vio->vui_iter = args->u.normal.via_iter;
  1078			vio->vui_iocb = args->u.normal.via_iocb;
  1079			/*
  1080			 * Direct IO reads must also take range lock,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37254 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [lustre-devel] [PATCH 5/5] staging/lustre: Use generic range rwlock
@ 2017-03-09 14:40     ` kbuild test robot
  0 siblings, 0 replies; 37+ messages in thread
From: kbuild test robot @ 2017-03-09 14:40 UTC (permalink / raw)
  To: lustre-devel

Hi Davidlohr,

[auto build test WARNING on staging/staging-testing]
[also build test WARNING on v4.11-rc1 next-20170309]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Davidlohr-Bueso/locking-Introduce-range-reader-writer-lock/20170309-140444
config: i386-randconfig-i1-201710 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/staging/lustre/lustre/llite/file.c: In function 'll_file_io_generic':
>> drivers/staging/lustre/lustre/llite/file.c:1072:4: warning: large integer implicitly truncated to unsigned type [-Woverflow]
       range_rwlock_init(&range, 0, LUSTRE_EOF);
       ^

vim +1072 drivers/staging/lustre/lustre/llite/file.c

  1056		struct cl_io	 *io;
  1057		ssize_t result = 0;
  1058		int rc = 0;
  1059	
  1060		CDEBUG(D_VFSTRACE, "file: %pD, type: %d ppos: %llu, count: %zu\n",
  1061		       file, iot, *ppos, count);
  1062	
  1063	restart:
  1064		io = vvp_env_thread_io(env);
  1065		ll_io_init(io, file, iot == CIT_WRITE);
  1066	
  1067		if (cl_io_rw_init(env, io, iot, *ppos, count) == 0) {
  1068			struct vvp_io *vio = vvp_env_io(env);
  1069			bool range_locked = false;
  1070	
  1071			if (file->f_flags & O_APPEND)
> 1072				range_rwlock_init(&range, 0, LUSTRE_EOF);
  1073			else
  1074				range_rwlock_init(&range, *ppos, *ppos + count - 1);
  1075	
  1076			vio->vui_fd  = LUSTRE_FPRIVATE(file);
  1077			vio->vui_iter = args->u.normal.via_iter;
  1078			vio->vui_iocb = args->u.normal.via_iocb;
  1079			/*
  1080			 * Direct IO reads must also take range lock,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: .config.gz
Type: application/gzip
Size: 37254 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20170309/64b53f13/attachment-0001.bin>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
  2017-03-09 11:03   ` Jan Kara
@ 2017-03-28 10:00   ` Laurent Dufour
  2017-03-28 16:39     ` Davidlohr Bueso
  2017-03-29  8:56   ` Peter Zijlstra
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-03-28 10:00 UTC (permalink / raw)
  To: Davidlohr Bueso, mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, linux-kernel, Davidlohr Bueso

On 07/03/2017 06:03, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
> 
> We allow exclusive locking of arbitrary ranges. We guarantee that each
> range is locked only after all conflicting range locks requested previously
> have been unlocked. Thus we achieve fairness and avoid livelocks.
> 
> When new range lock is requested, we add its interval to the tree and store
> number of intervals intersecting it to 'blocking_ranges'. When a range is
> unlocked, we again walk intervals that intersect with the unlocked one and
> decrement their 'blocking_ranges'. We wake up owner of any range lock whose
> 'blocking_ranges' drops to 0.
> 
> For the shared case, the 'blocking_ranges' is only incremented if the
> intersecting range is not marked as a reader. In order to mitigate some of
> the tree walk overhead for non-intersecting ranges, the tree's min/max values
> are maintained and consulted in O(1) in the fastpath.
> 
> How much does it cost:
> ----------------------
> 
> The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
> 
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
> 
> The first is the memory footprint, rwsems are larger than tree locks: 40 vs
> 24 bytes, but the later requires an additional 64 bytes of stack for the range
> structure.
> 
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
> 
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
> 
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
> 
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
> 
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
> 
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
> 
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
> 
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
> 
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
> 
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr:  1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr:  1022.5, stddev: 6.41
> 
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr:  1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr:  1243.5, stddev: 30.40
> 
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr:  1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr:  1398.0, stddev: 56.56
> 
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
> 
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
> 
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
> 
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
> 
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
> 
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
> 
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
> 
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
> 
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr:  1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr:  1852.8, stddev: 147.5
> 
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr:  1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr:  1856.5, stddev: 198.69
> 
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr:  1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr:  1551.5, stddev: 361.33
> 
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
> 
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  drivers/gpu/drm/Kconfig       |   2 -
>  drivers/gpu/drm/i915/Kconfig  |   1 -
>  include/linux/range_rwlock.h  |  96 +++++++++
>  kernel/locking/Makefile       |   2 +-
>  kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
>  lib/Kconfig                   |  14 --
>  lib/Kconfig.debug             |   1 -
>  lib/Makefile                  |   2 +-
>  8 files changed, 560 insertions(+), 20 deletions(-)
>  create mode 100644 include/linux/range_rwlock.h
>  create mode 100644 kernel/locking/range_rwlock.c
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
>  	select HWMON
>  	select BACKLIGHT_CLASS_DEVICE
>  	select BACKLIGHT_LCD_SUPPORT
> -	select INTERVAL_TREE
>  	help
>  	  Choose this option if you have an ATI Radeon graphics card.  There
>  	  are both PCI and AGP versions.  You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
>  	select HWMON
>  	select BACKLIGHT_CLASS_DEVICE
>  	select BACKLIGHT_LCD_SUPPORT
> -	select INTERVAL_TREE
>  	help
>  	  Choose this option if you have a recent AMD Radeon graphics card.
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
>  	depends on DRM
>  	depends on X86 && PCI
>  	select INTEL_GTT
> -	select INTERVAL_TREE
>  	# we need shmfs for the swappable backing store, and in particular
>  	# the shmem_readpage() which depends upon tmpfs
>  	select SHMEM
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..174e13668f67
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,96 @@
> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY  (~0UL - 1)
> +
> +struct range_rwlock {
> +	struct interval_tree_node node;
> +	struct task_struct *task;
> +	/* Number of ranges which are blocking acquisition of the lock */
> +	unsigned int blocking_ranges;
> +	bool reader;
> +};
> +
> +struct range_rwlock_tree {
> +	struct rb_root root;
> +	spinlock_t lock;
> +	struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
> +	{ .leftmost = NULL                              \
> +	, .root = RB_ROOT				\
> +	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> +       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(start, last) {	\
> +		.node = {			\
> +			.start = (start)	\
> +			,.last = (last)		\
> +		}				\
> +	}

Hi Davidlohr,

This macro doesn't expand correctly because the field name ".start" is
replaced by the start parameter. Should rather be :

#define __RANGE_RWLOCK_INITIALIZER(__start, __last) {	\
		.node = {			\
			.start = (__start)	\
			,.last = (__last)	\
		}				\
	}

By the way, should the other fields set as in __range_rwlock_init() ?

> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last)				\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_INF(name)		\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> +	tree->root = RB_ROOT;
> +	spin_lock_init(&tree->lock);
> +	tree->leftmost = NULL;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> +		       unsigned long start, unsigned long last);
> +void range_rwlock_init_inf(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
             ^^^^
      range_write_trylock(...) isn't it ?

Cheers,
Laurent.

> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT		:= n
> 
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
> 
>  ifdef CONFIG_FUNCTION_TRACER
>  CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..4d92baa9bd80
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +/*
> + * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + *        a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree.
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *root;
> +
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		return false;
> +
> +	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> +	return lock->node.start <= root->__subtree_last &&
> +		tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		tree->leftmost = &lock->node;
> +	else if (lock->node.start < tree->leftmost->start)
> +		tree->leftmost = &lock->node;
> +
> +	interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (tree->leftmost == &lock->node) {
> +		struct rb_node *next = rb_next(&tree->leftmost->rb);
> +		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> +	}
> +
> +	interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		  unsigned long last)
> +{
> +	WARN_ON(start > last);
> +
> +	lock->node.start = start;
> +	lock->node.last = last;
> +	RB_CLEAR_NODE(&lock->node.rb);
> +	lock->blocking_ranges = 0;
> +	lock->reader = false;
> +}
> +
> +/**
> + * range_rwlock_init - initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		     unsigned long last)
> +{
> +	__range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_inf - initialize the range lock to infinity
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_inf(struct range_rwlock *lock)
> +{
> +	__range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
> +}
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> +	if (!--lock->blocking_ranges)
> +		wake_q_add(wake_q, lock->task);
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock, long state)
> +{
> +	int ret = 0;
> +
> +	while (true) {
> +		set_current_state(state);
> +
> +		if (unlikely(signal_pending_state(state, current))) {
> +			unsigned long flags;
> +
> +			ret = -EINTR;
> +			/*
> +			 * We're not taking the lock after all, cleanup
> +			 * after ourselves.
> +			 */
> +			spin_lock_irqsave(&tree->lock, flags);
> +			lock->reader = false;
> +			__range_tree_remove(tree, lock);
> +			spin_unlock_irqrestore(&tree->lock, flags);
> +			break;
> +		}
> +
> +		/* do we need to go to sleep? */
> +		if (!lock->blocking_ranges)
> +			break;
> +
> +		schedule();
> +	}
> +
> +	__set_current_state(TASK_RUNNING);
> +	return ret;
> +}
> +
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> +			 struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	lock->reader = true;
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					 lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader)
> +			lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->task = current;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock);
> +
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_killable);
> +
> +/**
> + * range_read_trylock - Trylock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	int ret = true;
> +	unsigned long flags;
> +	struct interval_tree_node *node;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	/*
> +	 * We have overlapping ranges in the tree, ensure that we can
> +	 * in fact share the lock.
> +	 */
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader) {
> +			ret = false;
> +			goto unlock;
> +		}
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	lock->reader = true;
> +	__range_tree_insert(tree, lock);
> +unlock:
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(range_read_trylock);
> +
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	lock->reader = false;
> +	__range_tree_remove(tree, lock);
> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader)
> +			range_rwlock_unblock(blocked_lock, &wake_q);
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_read_unlock);
> +
> +static inline int __sched
> +__range_write_lock_common(struct range_rwlock_tree *tree,
> +			  struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	lock->reader = false;
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		/*
> +		 * As a writer, we always consider an existing node. We
> +		 * need to block; either the intersecting node is another
> +		 * writer or we have a reader that needs to finish.
> +		 */
> +		lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->task = current;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	__range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock);
> +
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_killable);
> +
> +/**
> + * range_write_trylock - Trylock for writing
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	int intersects;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	intersects = __range_intersects_intree(tree, lock);
> +
> +	if (!intersects) {
> +		lock->reader = false;
> +		__range_tree_insert(tree, lock);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return !intersects;
> +}
> +EXPORT_SYMBOL(range_write_trylock);
> +
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	lock->reader = false;
> +	__range_tree_remove(tree, lock);
> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +				     lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		range_rwlock_unblock(blocked_lock, &wake_q);
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_write_unlock);
> +
> +/**
> + * range_downgrade_write - Downgrade write range lock to read lock
> + * @tree: interval tree
> + * @lock: the range lock to be downgraded
> + *
> + * Wakes any blocked readers, when @lock is the only conflicting range.
> + *
> + * It is not allowed to downgrade an unacquired write lock.
> + */
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock)
> +{
> +	unsigned long flags;
> +	struct interval_tree_node *node;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	WARN_ON(lock->reader);
> +
> +	/*
> +	 * Unaccount for any blocked reader lock. Wakeup if possible.
> +	 */
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (blocked_lock->reader)
> +			range_rwlock_unblock(blocked_lock, &wake_q);
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	lock->reader = true;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_downgrade_write);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 0c8b78a9ae2e..52ca6668a16d 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
>  config BTREE
>  	bool
> 
> -config INTERVAL_TREE
> -	bool
> -	help
> -	  Simple, embeddable, interval-tree. Can find the start of an
> -	  overlapping range in log(n) time and then iterate over all
> -	  overlapping nodes. The algorithm is implemented as an
> -	  augmented rbtree.
> -
> -	  See:
> -
> -		Documentation/rbtree.txt
> -
> -	  for more information.
> -
>  config RADIX_TREE_MULTIORDER
>  	bool
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 97d62c2da6c2..a8a64eb54d3a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1771,7 +1771,6 @@ config RBTREE_TEST
>  config INTERVAL_TREE_TEST
>  	tristate "Interval tree test"
>  	depends on m && DEBUG_KERNEL
> -	select INTERVAL_TREE
>  	help
>  	  A benchmark measuring the performance of the interval tree library
> 
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..d4ff3057450a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
>  obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
> 
>  obj-$(CONFIG_BTREE) += btree.o
> -obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
> +obj-y += interval_tree.o
>  obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
>  obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
>  obj-$(CONFIG_DEBUG_LIST) += list_debug.o
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-28 10:00   ` Laurent Dufour
@ 2017-03-28 16:39     ` Davidlohr Bueso
  2017-03-28 16:58       ` Kirill A. Shutemov
  2017-04-03 14:19       ` Laurent Dufour
  0 siblings, 2 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-28 16:39 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Tue, 28 Mar 2017, Laurent Dufour wrote:
>> +#define __RANGE_RWLOCK_INITIALIZER(start, last) {	\
>> +		.node = {			\
>> +			.start = (start)	\
>> +			,.last = (last)		\
>> +		}				\
>> +	}
>
>Hi Davidlohr,
>
>This macro doesn't expand correctly because the field name ".start" is
>replaced by the start parameter. Should rather be :
>
>#define __RANGE_RWLOCK_INITIALIZER(__start, __last) {	\
>		.node = {			\
>			.start = (__start)	\
>			,.last = (__last)	\
>		}				\
>	}
>
>By the way, should the other fields set as in __range_rwlock_init() ?

Indeed.

>> +/*
>> + * lock for writing
>> + */
>> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
>> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
>> +				   struct range_rwlock *lock);
>> +int range_write_lock_killable(struct range_rwlock_tree *tree,
>> +			      struct range_rwlock *lock);
>> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
>             ^^^^
>      range_write_trylock(...) isn't it ?
>

Duh, yeah.

I'll wait to see if there are any more concerns and send a v2 with your corrections.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-28 16:39     ` Davidlohr Bueso
@ 2017-03-28 16:58       ` Kirill A. Shutemov
  2017-03-29  8:38         ` Laurent Dufour
  2017-04-03 14:19       ` Laurent Dufour
  1 sibling, 1 reply; 37+ messages in thread
From: Kirill A. Shutemov @ 2017-03-28 16:58 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Laurent Dufour, mingo, peterz, akpm, jack, kirill.shutemov,
	mhocko, mgorman, linux-kernel, Davidlohr Bueso

On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
> I'll wait to see if there are any more concerns and send a v2 with your corrections.

Have you tried drop-in replacement of mmap_sem with full range lock?
It would be interesting to see performance implication for this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-28 16:58       ` Kirill A. Shutemov
@ 2017-03-29  8:38         ` Laurent Dufour
  2017-03-29 15:31           ` Davidlohr Bueso
  0 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-03-29  8:38 UTC (permalink / raw)
  To: Kirill A. Shutemov, Davidlohr Bueso
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On 28/03/2017 18:58, Kirill A. Shutemov wrote:
> On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
>> I'll wait to see if there are any more concerns and send a v2 with your corrections.
> 
> Have you tried drop-in replacement of mmap_sem with full range lock?
> It would be interesting to see performance implication for this.
> 

I've a patch that replace the mmap_sem with a full range lock, it seems
to work fine for x86 and ppc64 for now. I'll send it soon.
But I didn't yet check for performance. What is the best way to that ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
  2017-03-09 11:03   ` Jan Kara
  2017-03-28 10:00   ` Laurent Dufour
@ 2017-03-29  8:56   ` Peter Zijlstra
  2017-03-29 15:12     ` Davidlohr Bueso
  2017-03-29  8:59   ` Peter Zijlstra
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-29  8:56 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
>  	select HWMON
>  	select BACKLIGHT_CLASS_DEVICE
>  	select BACKLIGHT_LCD_SUPPORT
> -	select INTERVAL_TREE
>  	help
>  	  Choose this option if you have an ATI Radeon graphics card.  There
>  	  are both PCI and AGP versions.  You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
>  	select HWMON
>  	select BACKLIGHT_CLASS_DEVICE
>  	select BACKLIGHT_LCD_SUPPORT
> -	select INTERVAL_TREE
>  	help
>  	  Choose this option if you have a recent AMD Radeon graphics card.
>  
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
>  	depends on DRM
>  	depends on X86 && PCI
>  	select INTEL_GTT
> -	select INTERVAL_TREE
>  	# we need shmfs for the swappable backing store, and in particular
>  	# the shmem_readpage() which depends upon tmpfs
>  	select SHMEM

I presume this is part of making INTERVAL_TREE unconditional; should be
a separate patch, no?


> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY  (~0UL - 1)

That's a strange limit, what's wrong with ~0UL ?

> +
> +struct range_rwlock {
> +	struct interval_tree_node node;
> +	struct task_struct *task;
> +	/* Number of ranges which are blocking acquisition of the lock */
> +	unsigned int blocking_ranges;
> +	bool reader;
> +};

Hate the name; our rwlock is a spinlock, therefore this thing suggests
it is too.

Also, no bool in structures.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
                     ` (2 preceding siblings ...)
  2017-03-29  8:56   ` Peter Zijlstra
@ 2017-03-29  8:59   ` Peter Zijlstra
  2017-03-29  9:08   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-29  8:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */

Was your editor broken? Those lines are > 80 for no reason what so ever.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
                     ` (3 preceding siblings ...)
  2017-03-29  8:59   ` Peter Zijlstra
@ 2017-03-29  9:08   ` Peter Zijlstra
  2017-03-29 15:05     ` Davidlohr Bueso
  2017-03-29  9:11   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-29  9:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +#define RANGE_RWLOCK_INFINITY  (~0UL - 1)

> +#define DEFINE_RANGE_RWLOCK_INF(name)		\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)

> +void range_rwlock_init_inf(struct range_rwlock *lock);

Ayes I'm a pendant, but that's a very small infinity. I always thought
infinity wasn't enumerable.

Can we think of a different name here? 'whole' / 'all' / 'full' ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
                     ` (4 preceding siblings ...)
  2017-03-29  9:08   ` Peter Zijlstra
@ 2017-03-29  9:11   ` Peter Zijlstra
  2017-03-29  9:44   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-29  9:11 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. 

Not a word about fairness and starvation... Such important topics for
lock primitives.

                   In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */

Maybe that ought not be here, doesn't seem like a fundamental design
point and would thus be better suited for a comment near where this
implementation detail is located ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
                     ` (5 preceding siblings ...)
  2017-03-29  9:11   ` Peter Zijlstra
@ 2017-03-29  9:44   ` Peter Zijlstra
  2017-03-29 10:35   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-29  9:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> +			 struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	lock->reader = true;
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					 lock->node.last);
> +	while (node) {

	for (node = interval_tree_iter_first(); node;
	     node = interval_tree_iter_next()) {

Or some interval_tree_for_each helper?

> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader)
> +			lock->blocking_ranges++;

Can this @blocked range now go EINTR and remove itself from
wait_for_ranges()? If so, who will decrement our blocking_ranges?

> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->task = current;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
                     ` (6 preceding siblings ...)
  2017-03-29  9:44   ` Peter Zijlstra
@ 2017-03-29 10:35   ` Peter Zijlstra
  2017-03-29 10:49   ` Peter Zijlstra
  2017-03-30 14:56   ` Laurent Dufour
  9 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-29 10:35 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +EXPORT_SYMBOL(range_rwlock_init);
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +EXPORT_SYMBOL(range_read_lock);
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +EXPORT_SYMBOL(range_read_lock_killable);
> +EXPORT_SYMBOL(range_read_trylock);
> +EXPORT_SYMBOL(range_read_unlock);
> +EXPORT_SYMBOL(range_write_lock);
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +EXPORT_SYMBOL(range_write_lock_killable);
> +EXPORT_SYMBOL(range_write_trylock);
> +EXPORT_SYMBOL(range_write_unlock);
> +EXPORT_SYMBOL(range_downgrade_write);

Is there a good reason this isn't _GPL ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
                     ` (7 preceding siblings ...)
  2017-03-29 10:35   ` Peter Zijlstra
@ 2017-03-29 10:49   ` Peter Zijlstra
  2017-03-29 15:14     ` Davidlohr Bueso
  2017-03-30 14:56   ` Laurent Dufour
  9 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2017-03-29 10:49 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> +			 struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	lock->reader = true;
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					 lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader)
> +			lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->task = current;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}

Another thing; this implementation lacks lockdep annotations and has 0
other debugging features.

Is there something that makes the normal rwsem lockdep annotation not
work for this?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-29  9:08   ` Peter Zijlstra
@ 2017-03-29 15:05     ` Davidlohr Bueso
  0 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Wed, 29 Mar 2017, Peter Zijlstra wrote:

>On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
>> +#define RANGE_RWLOCK_INFINITY  (~0UL - 1)
>
>> +#define DEFINE_RANGE_RWLOCK_INF(name)		\
>> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
>
>> +void range_rwlock_init_inf(struct range_rwlock *lock);
>
>Ayes I'm a pendant, but that's a very small infinity. I always thought
>infinity wasn't enumerable.

:-)

>
>Can we think of a different name here? 'whole' / 'all' / 'full' ?

Yeah, I guess 'all' is more suitable.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-29  8:56   ` Peter Zijlstra
@ 2017-03-29 15:12     ` Davidlohr Bueso
  0 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Wed, 29 Mar 2017, Peter Zijlstra wrote:

>On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>> index 88e01e08e279..e4d9eadd2c47 100644
>> --- a/drivers/gpu/drm/Kconfig
>> +++ b/drivers/gpu/drm/Kconfig
>> @@ -154,7 +154,6 @@ config DRM_RADEON
>>  	select HWMON
>>  	select BACKLIGHT_CLASS_DEVICE
>>  	select BACKLIGHT_LCD_SUPPORT
>> -	select INTERVAL_TREE
>>  	help
>>  	  Choose this option if you have an ATI Radeon graphics card.  There
>>  	  are both PCI and AGP versions.  You don't need to choose this to
>> @@ -174,7 +173,6 @@ config DRM_AMDGPU
>>  	select HWMON
>>  	select BACKLIGHT_CLASS_DEVICE
>>  	select BACKLIGHT_LCD_SUPPORT
>> -	select INTERVAL_TREE
>>  	help
>>  	  Choose this option if you have a recent AMD Radeon graphics card.
>>
>> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
>> index 183f5dc1c3f2..8a9154550f46 100644
>> --- a/drivers/gpu/drm/i915/Kconfig
>> +++ b/drivers/gpu/drm/i915/Kconfig
>> @@ -3,7 +3,6 @@ config DRM_I915
>>  	depends on DRM
>>  	depends on X86 && PCI
>>  	select INTEL_GTT
>> -	select INTERVAL_TREE
>>  	# we need shmfs for the swappable backing store, and in particular
>>  	# the shmem_readpage() which depends upon tmpfs
>>  	select SHMEM
>
>I presume this is part of making INTERVAL_TREE unconditional; should be
>a separate patch, no?

Ok, I can separate it.

>> +/*
>> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
>> + */
>> +#define RANGE_RWLOCK_INFINITY  (~0UL - 1)
>
>That's a strange limit, what's wrong with ~0UL ?

Nothing, I was under the impression I had updated that.

>> +
>> +struct range_rwlock {
>> +	struct interval_tree_node node;
>> +	struct task_struct *task;
>> +	/* Number of ranges which are blocking acquisition of the lock */
>> +	unsigned int blocking_ranges;
>> +	bool reader;
>> +};
>
>Hate the name; our rwlock is a spinlock, therefore this thing suggests
>it is too.

Hmmm not sure of a better name, kinda liked it. I'll see what I can come
up with.

>
>Also, no bool in structures.

Ok, but is that because sizeof(bool) can be undefined?  In any case, I'm thinking
we could skip the 'reader' member entirely and tag the task pointer (similar with
what we do for other locks).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-29 10:49   ` Peter Zijlstra
@ 2017-03-29 15:14     ` Davidlohr Bueso
  0 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Wed, 29 Mar 2017, Peter Zijlstra wrote:

>On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
>> +static __always_inline int
>> +__range_read_lock_common(struct range_rwlock_tree *tree,
>> +			 struct range_rwlock *lock, long state)
>> +{
>> +	struct interval_tree_node *node;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&tree->lock, flags);
>> +	lock->reader = true;
>> +
>> +	if (!__range_intersects_intree(tree, lock))
>> +		goto insert;
>> +
>> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
>> +					 lock->node.last);
>> +	while (node) {
>> +		struct range_rwlock *blocked_lock;
>> +		blocked_lock = range_entry(node, struct range_rwlock, node);
>> +
>> +		if (!blocked_lock->reader)
>> +			lock->blocking_ranges++;
>> +		node = interval_tree_iter_next(node, lock->node.start,
>> +					       lock->node.last);
>> +	}
>> +insert:
>> +	__range_tree_insert(tree, lock);
>> +
>> +	lock->task = current;
>> +	spin_unlock_irqrestore(&tree->lock, flags);
>> +
>> +	return wait_for_ranges(tree, lock, state);
>> +}
>
>Another thing; this implementation lacks lockdep annotations and has 0
>other debugging features.

True.

>Is there something that makes the normal rwsem lockdep annotation not
>work for this?

Not that I know of. Let me look into it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-29  8:38         ` Laurent Dufour
@ 2017-03-29 15:31           ` Davidlohr Bueso
  2017-03-29 15:40             ` Kirill A. Shutemov
  0 siblings, 1 reply; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 15:31 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Kirill A. Shutemov, mingo, peterz, akpm, jack, kirill.shutemov,
	mhocko, mgorman, linux-kernel, Davidlohr Bueso

On Wed, 29 Mar 2017, Laurent Dufour wrote:

>On 28/03/2017 18:58, Kirill A. Shutemov wrote:
>> On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
>>> I'll wait to see if there are any more concerns and send a v2 with your corrections.
>>
>> Have you tried drop-in replacement of mmap_sem with full range lock?
>> It would be interesting to see performance implication for this.
>>
>
>I've a patch that replace the mmap_sem with a full range lock, it seems
>to work fine for x86 and ppc64 for now. I'll send it soon.
>But I didn't yet check for performance. What is the best way to that ?

I expect performance to take a measurable hit if we simply use full range
lock as a drop in replacement. My rwsem vs range lock measurements were
done with this in mind. We only win with range locks when improving the
level of parallelism. There are plenty of benchmarks out there that
stress address space concurrency. I've been using some from the mosbench
suite (metis, psearchy, dedup); Ingo also has a microbench that does
mmap()/munmap()/mremap() combos:

http://people.redhat.com/mingo/threaded-mmap-stresstest/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-29 15:31           ` Davidlohr Bueso
@ 2017-03-29 15:40             ` Kirill A. Shutemov
  2017-03-29 16:10               ` Davidlohr Bueso
  0 siblings, 1 reply; 37+ messages in thread
From: Kirill A. Shutemov @ 2017-03-29 15:40 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Laurent Dufour, mingo, peterz, akpm, jack, kirill.shutemov,
	mhocko, mgorman, linux-kernel, Davidlohr Bueso

On Wed, Mar 29, 2017 at 08:31:33AM -0700, Davidlohr Bueso wrote:
> On Wed, 29 Mar 2017, Laurent Dufour wrote:
> 
> > On 28/03/2017 18:58, Kirill A. Shutemov wrote:
> > > On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
> > > > I'll wait to see if there are any more concerns and send a v2 with your corrections.
> > > 
> > > Have you tried drop-in replacement of mmap_sem with full range lock?
> > > It would be interesting to see performance implication for this.
> > > 
> > 
> > I've a patch that replace the mmap_sem with a full range lock, it seems
> > to work fine for x86 and ppc64 for now. I'll send it soon.
> > But I didn't yet check for performance. What is the best way to that ?
> 
> I expect performance to take a measurable hit if we simply use full range
> lock as a drop in replacement. My rwsem vs range lock measurements were
> done with this in mind. We only win with range locks when improving the
> level of parallelism.

It would be hard sell if we would see performance degradation simple
single-threaded workload.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-29 15:40             ` Kirill A. Shutemov
@ 2017-03-29 16:10               ` Davidlohr Bueso
  0 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 16:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Laurent Dufour, mingo, peterz, akpm, jack, kirill.shutemov,
	mhocko, mgorman, linux-kernel, Davidlohr Bueso

On Wed, 29 Mar 2017, Kirill A. Shutemov wrote:

>On Wed, Mar 29, 2017 at 08:31:33AM -0700, Davidlohr Bueso wrote:
>> On Wed, 29 Mar 2017, Laurent Dufour wrote:
>>
>> > On 28/03/2017 18:58, Kirill A. Shutemov wrote:
>> > > On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
>> > > > I'll wait to see if there are any more concerns and send a v2 with your corrections.
>> > >
>> > > Have you tried drop-in replacement of mmap_sem with full range lock?
>> > > It would be interesting to see performance implication for this.
>> > >
>> >
>> > I've a patch that replace the mmap_sem with a full range lock, it seems
>> > to work fine for x86 and ppc64 for now. I'll send it soon.
>> > But I didn't yet check for performance. What is the best way to that ?
>>
>> I expect performance to take a measurable hit if we simply use full range
>> lock as a drop in replacement. My rwsem vs range lock measurements were
>> done with this in mind. We only win with range locks when improving the
>> level of parallelism.
>
>It would be hard sell if we would see performance degradation simple
>single-threaded workload.

Yeah, that's why I included very low contention in the lock comparison.
Deltas are very much within the noise region, it is with high contention
where things go south performance wise.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
                     ` (8 preceding siblings ...)
  2017-03-29 10:49   ` Peter Zijlstra
@ 2017-03-30 14:56   ` Laurent Dufour
  2017-03-30 17:13     ` Davidlohr Bueso
  9 siblings, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-03-30 14:56 UTC (permalink / raw)
  To: Davidlohr Bueso, mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, linux-kernel, Davidlohr Bueso

On 07/03/2017 06:03, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
> 
> We allow exclusive locking of arbitrary ranges. We guarantee that each
> range is locked only after all conflicting range locks requested previously
> have been unlocked. Thus we achieve fairness and avoid livelocks.
> 
> When new range lock is requested, we add its interval to the tree and store
> number of intervals intersecting it to 'blocking_ranges'. When a range is
> unlocked, we again walk intervals that intersect with the unlocked one and
> decrement their 'blocking_ranges'. We wake up owner of any range lock whose
> 'blocking_ranges' drops to 0.
> 
> For the shared case, the 'blocking_ranges' is only incremented if the
> intersecting range is not marked as a reader. In order to mitigate some of
> the tree walk overhead for non-intersecting ranges, the tree's min/max values
> are maintained and consulted in O(1) in the fastpath.
> 
> How much does it cost:
> ----------------------
> 
> The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
> 
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
> 
> The first is the memory footprint, rwsems are larger than tree locks: 40 vs
> 24 bytes, but the later requires an additional 64 bytes of stack for the range
> structure.
> 
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
> 
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
> 
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
> 
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
> 
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
> 
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
> 
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
> 
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
> 
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
> 
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr:  1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr:  1022.5, stddev: 6.41
> 
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr:  1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr:  1243.5, stddev: 30.40
> 
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr:  1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr:  1398.0, stddev: 56.56
> 
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
> 
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
> 
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
> 
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
> 
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
> 
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
> 
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
> 
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
> 
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr:  1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr:  1852.8, stddev: 147.5
> 
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr:  1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr:  1856.5, stddev: 198.69
> 
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr:  1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr:  1551.5, stddev: 361.33
> 
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
> 
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
>  drivers/gpu/drm/Kconfig       |   2 -
>  drivers/gpu/drm/i915/Kconfig  |   1 -
>  include/linux/range_rwlock.h  |  96 +++++++++
>  kernel/locking/Makefile       |   2 +-
>  kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
>  lib/Kconfig                   |  14 --
>  lib/Kconfig.debug             |   1 -
>  lib/Makefile                  |   2 +-
>  8 files changed, 560 insertions(+), 20 deletions(-)
>  create mode 100644 include/linux/range_rwlock.h
>  create mode 100644 kernel/locking/range_rwlock.c
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
>  	select HWMON
>  	select BACKLIGHT_CLASS_DEVICE
>  	select BACKLIGHT_LCD_SUPPORT
> -	select INTERVAL_TREE
>  	help
>  	  Choose this option if you have an ATI Radeon graphics card.  There
>  	  are both PCI and AGP versions.  You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
>  	select HWMON
>  	select BACKLIGHT_CLASS_DEVICE
>  	select BACKLIGHT_LCD_SUPPORT
> -	select INTERVAL_TREE
>  	help
>  	  Choose this option if you have a recent AMD Radeon graphics card.
> 
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
>  	depends on DRM
>  	depends on X86 && PCI
>  	select INTEL_GTT
> -	select INTERVAL_TREE
>  	# we need shmfs for the swappable backing store, and in particular
>  	# the shmem_readpage() which depends upon tmpfs
>  	select SHMEM
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..174e13668f67
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,96 @@
> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY  (~0UL - 1)
> +
> +struct range_rwlock {
> +	struct interval_tree_node node;
> +	struct task_struct *task;
> +	/* Number of ranges which are blocking acquisition of the lock */
> +	unsigned int blocking_ranges;
> +	bool reader;
> +};
> +
> +struct range_rwlock_tree {
> +	struct rb_root root;
> +	spinlock_t lock;
> +	struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
> +	{ .leftmost = NULL                              \
> +	, .root = RB_ROOT				\
> +	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> +       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(start, last) {	\
> +		.node = {			\
> +			.start = (start)	\
> +			,.last = (last)		\
> +		}				\
> +	}
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last)				\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_INF(name)		\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> +	tree->root = RB_ROOT;
> +	spin_lock_init(&tree->lock);
> +	tree->leftmost = NULL;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> +		       unsigned long start, unsigned long last);
> +void range_rwlock_init_inf(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT		:= n
> 
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
> 
>  ifdef CONFIG_FUNCTION_TRACER
>  CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..4d92baa9bd80
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +/*
> + * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + *        a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree.
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *root;
> +
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		return false;
> +
> +	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> +	return lock->node.start <= root->__subtree_last &&
> +		tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		tree->leftmost = &lock->node;
> +	else if (lock->node.start < tree->leftmost->start)
> +		tree->leftmost = &lock->node;
> +
> +	interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (tree->leftmost == &lock->node) {
> +		struct rb_node *next = rb_next(&tree->leftmost->rb);
> +		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> +	}
> +
> +	interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		  unsigned long last)
> +{
> +	WARN_ON(start > last);
> +
> +	lock->node.start = start;
> +	lock->node.last = last;
> +	RB_CLEAR_NODE(&lock->node.rb);
> +	lock->blocking_ranges = 0;
> +	lock->reader = false;
> +}
> +
> +/**
> + * range_rwlock_init - initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		     unsigned long last)
> +{
> +	__range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_inf - initialize the range lock to infinity
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_inf(struct range_rwlock *lock)
> +{
> +	__range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
> +}
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> +	if (!--lock->blocking_ranges)
> +		wake_q_add(wake_q, lock->task);
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock, long state)
> +{
> +	int ret = 0;
> +
> +	while (true) {
> +		set_current_state(state);
> +
> +		if (unlikely(signal_pending_state(state, current))) {
> +			unsigned long flags;
> +
> +			ret = -EINTR;
> +			/*
> +			 * We're not taking the lock after all, cleanup
> +			 * after ourselves.
> +			 */
> +			spin_lock_irqsave(&tree->lock, flags);
> +			lock->reader = false;
> +			__range_tree_remove(tree, lock);
> +			spin_unlock_irqrestore(&tree->lock, flags);
> +			break;
> +		}
> +
> +		/* do we need to go to sleep? */
> +		if (!lock->blocking_ranges)
> +			break;

Hi Davidlohr,

While building a kernel on top of a patched kernel using full range lock
in the place of mmap_sem, I found that fork() sometimes failed returning
ENOMEM.
It happens that if fork() get called at the time signal is sent to the
calling process, the call to range_write_lock_interruptible() is failing
even if there is no contention on the lock. This is because we check for
the signal pending before checking for the lock contention in
wait_for_ranges().

The loop here should rather be :

	while (true) {
		/* do we need to go to sleep? */
		if (!lock->blocking_ranges)
			break;

		if (unlikely(signal_pending_state(state, current))) {
			unsigned long flags;

			ret = -EINTR;
			/*
			 * We're not taking the lock after all, cleanup
			 * after ourselves.
			 */
			spin_lock_irqsave(&tree->lock, flags);
			lock->reader = false;
			__range_tree_remove(tree, lock);
			spin_unlock_irqrestore(&tree->lock, flags);
			break;
		}

		set_current_state(state);

		schedule();
	}

I also moved the call to set_current_state() before calling schedule(),
not sure this has to be done this way, but my system seems to work fine
like this.

Laurent.

> +
> +		schedule();
> +	}
> +
> +	__set_current_state(TASK_RUNNING);
> +	return ret;
> +}
> +
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> +			 struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	lock->reader = true;
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					 lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader)
> +			lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->task = current;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock);
> +d
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_killable);
> +
> +/**
> + * range_read_trylock - Trylock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	int ret = true;
> +	unsigned long flags;
> +	struct interval_tree_node *node;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	/*
> +	 * We have overlapping ranges in the tree, ensure that we can
> +	 * in fact share the lock.
> +	 */
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader) {
> +			ret = false;
> +			goto unlock;
> +		}
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	lock->reader = true;
> +	__range_tree_insert(tree, lock);
> +unlock:
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(range_read_trylock);
> +
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	lock->reader = false;
> +	__range_tree_remove(tree, lock);
> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!blocked_lock->reader)
> +			range_rwlock_unblock(blocked_lock, &wake_q);
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_read_unlock);
> +
> +static inline int __sched
> +__range_write_lock_common(struct range_rwlock_tree *tree,
> +			  struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	lock->reader = false;
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		/*
> +		 * As a writer, we always consider an existing node. We
> +		 * need to block; either the intersecting node is another
> +		 * writer or we have a reader that needs to finish.
> +		 */
> +		lock->blocking_ranges++;
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->task = current;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	__range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock);
> +
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_killable);
> +
> +/**
> + * range_write_trylock - Trylock for writing
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	int intersects;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	intersects = __range_intersects_intree(tree, lock);
> +
> +	if (!intersects) {
> +		lock->reader = false;
> +		__range_tree_insert(tree, lock);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return !intersects;
> +}
> +EXPORT_SYMBOL(range_write_trylock);
> +
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	lock->reader = false;
> +	__range_tree_remove(tree, lock);
> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +				     lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		range_rwlock_unblock(blocked_lock, &wake_q);
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_write_unlock);
> +
> +/**
> + * range_downgrade_write - Downgrade write range lock to read lock
> + * @tree: interval tree
> + * @lock: the range lock to be downgraded
> + *
> + * Wakes any blocked readers, when @lock is the only conflicting range.
> + *
> + * It is not allowed to downgrade an unacquired write lock.
> + */
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock)
> +{
> +	unsigned long flags;
> +	struct interval_tree_node *node;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	WARN_ON(lock->reader);
> +
> +	/*
> +	 * Unaccount for any blocked reader lock. Wakeup if possible.
> +	 */
> +	node = interval_tree_iter_first(&tree->root, lock->node.start,
> +					lock->node.last);
> +	while (node) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (blocked_lock->reader)
> +			range_rwlock_unblock(blocked_lock, &wake_q);
> +
> +		node = interval_tree_iter_next(node, lock->node.start,
> +					       lock->node.last);
> +	}
> +
> +	lock->reader = true;
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_downgrade_write);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 0c8b78a9ae2e..52ca6668a16d 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
>  config BTREE
>  	bool
> 
> -config INTERVAL_TREE
> -	bool
> -	help
> -	  Simple, embeddable, interval-tree. Can find the start of an
> -	  overlapping range in log(n) time and then iterate over all
> -	  overlapping nodes. The algorithm is implemented as an
> -	  augmented rbtree.
> -
> -	  See:
> -
> -		Documentation/rbtree.txt
> -
> -	  for more information.
> -
>  config RADIX_TREE_MULTIORDER
>  	bool
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 97d62c2da6c2..a8a64eb54d3a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1771,7 +1771,6 @@ config RBTREE_TEST
>  config INTERVAL_TREE_TEST
>  	tristate "Interval tree test"
>  	depends on m && DEBUG_KERNEL
> -	select INTERVAL_TREE
>  	help
>  	  A benchmark measuring the performance of the interval tree library
> 
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..d4ff3057450a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
>  obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
> 
>  obj-$(CONFIG_BTREE) += btree.o
> -obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
> +obj-y += interval_tree.o
>  obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
>  obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
>  obj-$(CONFIG_DEBUG_LIST) += list_debug.o
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-30 14:56   ` Laurent Dufour
@ 2017-03-30 17:13     ` Davidlohr Bueso
  0 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-03-30 17:13 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Davidlohr Bueso, mingo, peterz, akpm, jack, kirill.shutemov,
	mhocko, mgorman, linux-kernel

On 2017-03-30 07:56, Laurent Dufour wrote:
> On 07/03/2017 06:03, Davidlohr Bueso wrote:
>> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
>> +				  struct range_rwlock *lock, long state)
>> +{
>> +	int ret = 0;
>> +
>> +	while (true) {
>> +		set_current_state(state);
>> +
>> +		if (unlikely(signal_pending_state(state, current))) {
>> +			unsigned long flags;
>> +
>> +			ret = -EINTR;
>> +			/*
>> +			 * We're not taking the lock after all, cleanup
>> +			 * after ourselves.
>> +			 */
>> +			spin_lock_irqsave(&tree->lock, flags);
>> +			lock->reader = false;
>> +			__range_tree_remove(tree, lock);
>> +			spin_unlock_irqrestore(&tree->lock, flags);
>> +			break;
>> +		}
>> +
>> +		/* do we need to go to sleep? */
>> +		if (!lock->blocking_ranges)
>> +			break;
> 
> Hi Davidlohr,
> 
> While building a kernel on top of a patched kernel using full range 
> lock
> in the place of mmap_sem, I found that fork() sometimes failed 
> returning
> ENOMEM.
> It happens that if fork() get called at the time signal is sent to the
> calling process, the call to range_write_lock_interruptible() is 
> failing
> even if there is no contention on the lock. This is because we check 
> for
> the signal pending before checking for the lock contention in
> wait_for_ranges().
> 
> The loop here should rather be :
> 
> 	while (true) {
> 		/* do we need to go to sleep? */
> 		if (!lock->blocking_ranges)
> 			break;
> 

Thanks, this makes sense, and is actually the standard way of waiting in
most locks.

> 		if (unlikely(signal_pending_state(state, current))) {
> 			unsigned long flags;
> 
> 			ret = -EINTR;
> 			/*
> 			 * We're not taking the lock after all, cleanup
> 			 * after ourselves.
> 			 */
> 			spin_lock_irqsave(&tree->lock, flags);
> 			lock->reader = false;
> 			__range_tree_remove(tree, lock);
> 			spin_unlock_irqrestore(&tree->lock, flags);
> 			break;
> 		}
> 
> 		set_current_state(state);
> 
> 		schedule();
> 	}
> 
> I also moved the call to set_current_state() before calling schedule(),
> not sure this has to be done this way, but my system seems to work fine
> like this.

No, we do not hold any locks. Please keep set_current_state() the very 
first
thing we do in the loop. You can check Documentation/memory-barriers.txt
for details :-)

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-03-28 16:39     ` Davidlohr Bueso
  2017-03-28 16:58       ` Kirill A. Shutemov
@ 2017-04-03 14:19       ` Laurent Dufour
  2017-04-03 15:26         ` Davidlohr Bueso
  1 sibling, 1 reply; 37+ messages in thread
From: Laurent Dufour @ 2017-04-03 14:19 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

Le Tue, 28 Mar 2017 09:39:18 -0700,
Davidlohr Bueso <dave@stgolabs.net> a écrit : 
> I'll wait to see if there are any more concerns and send a v2 with
> your corrections.

Hi Bavidlohr, I think there is a major issue regarding the task
catching a signal in wait_for_range().
I can see it when a thread is catching a signal, the process deadlock
in exit path.

Let's imagine all these tasks waiting for the complete range lock, so
range doesn't matter:

A get the lock in write
B want the read lock => B->blocking_range=1 (because of A)
C want the write lock => C->blocking_range=2 (A,B)
D want the read lock => D->blocking_range=3 (A,B,C)
=> C catch a signal and exit wait_for_ranges()
A release the lock 
        => B->blocking_range=0
        => D->blocking_range=2 (D has not seen C removal)
=> B get the lock
B release the lock  
        => D->blocking_range=1

D remains blocked while no one has the lock !

The issue is when removing a task from the interval tree, we
should decrement all the blocking_ranges of the task added to that
range after the one leaving... I can't see an easy fix for that :(

Am I right ?

Cheers,
Laurent.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-04-03 14:19       ` Laurent Dufour
@ 2017-04-03 15:26         ` Davidlohr Bueso
  2017-04-03 16:06           ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Davidlohr Bueso @ 2017-04-03 15:26 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, 03 Apr 2017, Laurent Dufour wrote:

>Le Tue, 28 Mar 2017 09:39:18 -0700,
>Davidlohr Bueso <dave@stgolabs.net> a écrit :
>> I'll wait to see if there are any more concerns and send a v2 with
>> your corrections.
>
>Hi Bavidlohr, I think there is a major issue regarding the task
>catching a signal in wait_for_range().
>I can see it when a thread is catching a signal, the process deadlock
>in exit path.
>
>Let's imagine all these tasks waiting for the complete range lock, so
>range doesn't matter:
>
>A get the lock in write
>B want the read lock => B->blocking_range=1 (because of A)
>C want the write lock => C->blocking_range=2 (A,B)
>D want the read lock => D->blocking_range=3 (A,B,C)
>=> C catch a signal and exit wait_for_ranges()
>A release the lock
>        => B->blocking_range=0
>        => D->blocking_range=2 (D has not seen C removal)
>=> B get the lock
>B release the lock
>        => D->blocking_range=1
>
>D remains blocked while no one has the lock !
>
>The issue is when removing a task from the interval tree, we
>should decrement all the blocking_ranges of the task added to that
>range after the one leaving... I can't see an easy fix for that :(
>
>Am I right ?

Yes. Peter had also mentioned the issue too. One way I though of fixing
the problem was to track the jiffies timestamp in a per range_rwlock
basis for when it was added, and in the signal_pending() case, along with
removing the lock from the tree, we iterate the tree again and decrement
the blocking_ranges for those with a higher timestamp. It would add some
overhead, but again this is the unlikely() case. It also adds an extra 8
bytes of footprint, but this is usually stack allocated.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-04-03 15:26         ` Davidlohr Bueso
@ 2017-04-03 16:06           ` Jan Kara
  2017-04-04 15:31             ` Davidlohr Bueso
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2017-04-03 16:06 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Laurent Dufour, mingo, peterz, akpm, jack, kirill.shutemov,
	mhocko, mgorman, linux-kernel, Davidlohr Bueso

On Mon 03-04-17 08:26:43, Davidlohr Bueso wrote:
> On Mon, 03 Apr 2017, Laurent Dufour wrote:
> >Le Tue, 28 Mar 2017 09:39:18 -0700,
> >Davidlohr Bueso <dave@stgolabs.net> a écrit :
> >>I'll wait to see if there are any more concerns and send a v2 with
> >>your corrections.
> >
> >Hi Bavidlohr, I think there is a major issue regarding the task
> >catching a signal in wait_for_range().
> >I can see it when a thread is catching a signal, the process deadlock
> >in exit path.
> >
> >Let's imagine all these tasks waiting for the complete range lock, so
> >range doesn't matter:
> >
> >A get the lock in write
> >B want the read lock => B->blocking_range=1 (because of A)
> >C want the write lock => C->blocking_range=2 (A,B)
> >D want the read lock => D->blocking_range=3 (A,B,C)
> >=> C catch a signal and exit wait_for_ranges()
> >A release the lock
> >       => B->blocking_range=0
> >       => D->blocking_range=2 (D has not seen C removal)
> >=> B get the lock
> >B release the lock
> >       => D->blocking_range=1
> >
> >D remains blocked while no one has the lock !
> >
> >The issue is when removing a task from the interval tree, we
> >should decrement all the blocking_ranges of the task added to that
> >range after the one leaving... I can't see an easy fix for that :(
> >
> >Am I right ?
> 
> Yes. Peter had also mentioned the issue too. One way I though of fixing
> the problem was to track the jiffies timestamp in a per range_rwlock
> basis for when it was added, and in the signal_pending() case, along with
> removing the lock from the tree, we iterate the tree again and decrement
> the blocking_ranges for those with a higher timestamp. It would add some
> overhead, but again this is the unlikely() case. It also adds an extra 8
> bytes of footprint, but this is usually stack allocated.

Or just a plain sequence counter of the lock operations? There are ways how
we could implement the functionality without the counter and although
asymptotically they will be the same, they are more complex code-wise so I
think the counter is the best solution.

								Hoza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
  2017-04-03 16:06           ` Jan Kara
@ 2017-04-04 15:31             ` Davidlohr Bueso
  0 siblings, 0 replies; 37+ messages in thread
From: Davidlohr Bueso @ 2017-04-04 15:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Laurent Dufour, mingo, peterz, akpm, kirill.shutemov, mhocko,
	mgorman, linux-kernel, Davidlohr Bueso

On Mon, 03 Apr 2017, Jan Kara wrote:

>Or just a plain sequence counter of the lock operations?

So what I dislike about this is that we'd also have to enlarge
the struct range_rwlock_tree. otoh, I'm hesitant to depend on
the tick rate for lock correctness, so perhaps your suggestion
is best.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-04-04 15:31 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-07  5:03 [PATCH 0/5] locking Introduce range reader/writer lock Davidlohr Bueso
2017-03-07  5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
2017-03-09 11:03   ` Jan Kara
2017-03-28 10:00   ` Laurent Dufour
2017-03-28 16:39     ` Davidlohr Bueso
2017-03-28 16:58       ` Kirill A. Shutemov
2017-03-29  8:38         ` Laurent Dufour
2017-03-29 15:31           ` Davidlohr Bueso
2017-03-29 15:40             ` Kirill A. Shutemov
2017-03-29 16:10               ` Davidlohr Bueso
2017-04-03 14:19       ` Laurent Dufour
2017-04-03 15:26         ` Davidlohr Bueso
2017-04-03 16:06           ` Jan Kara
2017-04-04 15:31             ` Davidlohr Bueso
2017-03-29  8:56   ` Peter Zijlstra
2017-03-29 15:12     ` Davidlohr Bueso
2017-03-29  8:59   ` Peter Zijlstra
2017-03-29  9:08   ` Peter Zijlstra
2017-03-29 15:05     ` Davidlohr Bueso
2017-03-29  9:11   ` Peter Zijlstra
2017-03-29  9:44   ` Peter Zijlstra
2017-03-29 10:35   ` Peter Zijlstra
2017-03-29 10:49   ` Peter Zijlstra
2017-03-29 15:14     ` Davidlohr Bueso
2017-03-30 14:56   ` Laurent Dufour
2017-03-30 17:13     ` Davidlohr Bueso
2017-03-07  5:03 ` [PATCH 2/5] locking/locktorture: Fix rwsem reader_delay Davidlohr Bueso
2017-03-07  5:03 ` [PATCH 3/5] locking/locktorture: Fix num reader/writer corner cases Davidlohr Bueso
2017-03-07  5:03 ` [PATCH 4/5] locking/locktorture: Support range rwlocks Davidlohr Bueso
2017-03-07  5:03 ` [PATCH 5/5] staging/lustre: Use generic range rwlock Davidlohr Bueso
2017-03-07  6:05   ` Oleg Drokin
2017-03-07  6:05     ` [lustre-devel] " Oleg Drokin
2017-03-08 15:02     ` Davidlohr Bueso
2017-03-09  8:56   ` kbuild test robot
2017-03-09  8:56     ` [lustre-devel] " kbuild test robot
2017-03-09 14:40   ` kbuild test robot
2017-03-09 14:40     ` [lustre-devel] " kbuild test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.