All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock
@ 2017-04-06  8:46 Davidlohr Bueso
  2017-04-06  8:46 ` [PATCH 1/6] interval-tree: Build unconditionally Davidlohr Bueso
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-06  8:46 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, ldufour, mhocko, mgorman, dave, linux-kernel

Changes from v1 (https://lwn.net/Articles/716383/), all in patch 2:
  - s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL
  - Made the tree walks a foreach loop, instead of while.
  - Fixed signal_pending() lockup issue for unaccounted waiters.
  - Fixed initialization macros.
  - Check condition before signal_pending in loop.
  - Make building the interval-tree unconditionally a separate patch.
  - More/better documentation.
  - Added jack's reviewed-by.
  - Renamed inf to full.

** What's still pending:
  - Debug support (it's been a pain to use lockdep with range locking).
  - Renaming the lock; I don't agree that rwlock's should have exclusivity
    on such a generic name; maybe they should be called rwspinlocks :)
  - lustre 32bit offset caveat.

Hi,

Here's a very tardy proposal for enhancements to Jan's original[1] range lock
using interval trees. Because at some point it would be awesome to switch mmap_sem
from rwsem to range rwlock, I've focused on making it sharable and performance
enhancements reducing the performance delta between this and conventional locks as
much as possible -- details in patch 2.

The rest of the patches adds support for testing the new lock and actually
makes use of it for lustre. It has passed quite a bit of artificial pounding and
I believe/hope it is in shape to consider.

Laurent, while there are some pending items Peter wants, it might be worth
using this series for any further mmap_sem testing.

Applies on top of tip v4.11-rc5

[1] https://lkml.org/lkml/2013/1/31/483

Thanks!

Davidlohr Bueso (6):
  interval-tree: Build unconditionally
  locking: Introduce range reader/writer lock
  locking/locktorture: Fix rwsem reader_delay
  locking/locktorture: Fix num reader/writer corner cases
  locking/locktorture: Support range rwlocks
  staging/lustre: Use generic range rwlock

 drivers/gpu/drm/Kconfig                            |   2 -
 drivers/gpu/drm/i915/Kconfig                       |   1 -
 drivers/staging/lustre/lustre/llite/Makefile       |   2 +-
 drivers/staging/lustre/lustre/llite/file.c         |  21 +-
 .../staging/lustre/lustre/llite/llite_internal.h   |   4 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c    |   3 +-
 drivers/staging/lustre/lustre/llite/range_lock.c   | 239 ---------
 drivers/staging/lustre/lustre/llite/range_lock.h   |  82 ---
 include/linux/range_rwlock.h                       | 115 +++++
 kernel/locking/Makefile                            |   2 +-
 kernel/locking/locktorture.c                       | 299 ++++++++---
 kernel/locking/range_rwlock.c                      | 554 +++++++++++++++++++++
 lib/Kconfig                                        |  14 -
 lib/Kconfig.debug                                  |   1 -
 lib/Makefile                                       |   3 +-
 15 files changed, 903 insertions(+), 439 deletions(-)
 delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.c
 delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.h
 create mode 100644 include/linux/range_rwlock.h
 create mode 100644 kernel/locking/range_rwlock.c

-- 
2.12.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/6] interval-tree: Build unconditionally
  2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
@ 2017-04-06  8:46 ` Davidlohr Bueso
  2017-04-06  8:46 ` [PATCH 2/6] locking: Introduce range reader/writer lock Davidlohr Bueso
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-06  8:46 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, ldufour, mhocko, mgorman, dave,
	linux-kernel, Davidlohr Bueso

In preparation for range locking, this patch gets rid of
CONFIG_INTERVAL_TREE option as we will unconditionally
build it.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/gpu/drm/Kconfig      |  2 --
 drivers/gpu/drm/i915/Kconfig |  1 -
 lib/Kconfig                  | 14 --------------
 lib/Kconfig.debug            |  1 -
 lib/Makefile                 |  3 +--
 5 files changed, 1 insertion(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 88e01e08e279..e4d9eadd2c47 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -154,7 +154,6 @@ config DRM_RADEON
 	select HWMON
 	select BACKLIGHT_CLASS_DEVICE
 	select BACKLIGHT_LCD_SUPPORT
-	select INTERVAL_TREE
 	help
 	  Choose this option if you have an ATI Radeon graphics card.  There
 	  are both PCI and AGP versions.  You don't need to choose this to
@@ -174,7 +173,6 @@ config DRM_AMDGPU
 	select HWMON
 	select BACKLIGHT_CLASS_DEVICE
 	select BACKLIGHT_LCD_SUPPORT
-	select INTERVAL_TREE
 	help
 	  Choose this option if you have a recent AMD Radeon graphics card.
 
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 183f5dc1c3f2..8a9154550f46 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -3,7 +3,6 @@ config DRM_I915
 	depends on DRM
 	depends on X86 && PCI
 	select INTEL_GTT
-	select INTERVAL_TREE
 	# we need shmfs for the swappable backing store, and in particular
 	# the shmem_readpage() which depends upon tmpfs
 	select SHMEM
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..52ca6668a16d 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
 config BTREE
 	bool
 
-config INTERVAL_TREE
-	bool
-	help
-	  Simple, embeddable, interval-tree. Can find the start of an
-	  overlapping range in log(n) time and then iterate over all
-	  overlapping nodes. The algorithm is implemented as an
-	  augmented rbtree.
-
-	  See:
-
-		Documentation/rbtree.txt
-
-	  for more information.
-
 config RADIX_TREE_MULTIORDER
 	bool
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 97d62c2da6c2..a8a64eb54d3a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1771,7 +1771,6 @@ config RBTREE_TEST
 config INTERVAL_TREE_TEST
 	tristate "Interval tree test"
 	depends on m && DEBUG_KERNEL
-	select INTERVAL_TREE
 	help
 	  A benchmark measuring the performance of the interval tree library
 
diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a8725..7881de026baf 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o refcount.o
+	 once.o refcount.o interval_tree.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
@@ -80,7 +80,6 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
 
 obj-$(CONFIG_BTREE) += btree.o
-obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
 obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
  2017-04-06  8:46 ` [PATCH 1/6] interval-tree: Build unconditionally Davidlohr Bueso
@ 2017-04-06  8:46 ` Davidlohr Bueso
  2017-04-06  9:01   ` Laurent Dufour
                     ` (2 more replies)
  2017-04-06  8:46 ` [PATCH 3/6] locking/locktorture: Fix rwsem reader_delay Davidlohr Bueso
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-06  8:46 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, ldufour, mhocko, mgorman, dave,
	linux-kernel, Davidlohr Bueso

This implements a sleepable range rwlock, based on interval tree, serializing
conflicting/intersecting/overlapping ranges within the tree. The largest range
is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
involves dealing with the tree itself and the range to be locked, normally
stack allocated and always explicitly prepared/initialized by the user in a
[a0, a1] a0 <= a1 sorted manner, before actually taking the lock.

An interval tree of locked and to-be-locked ranges is kept. When a new range
lock is requested, we add its interval to the tree and store number of
intervals intersecting it to 'blocking_ranges'. For the reader case,
'blocking_ranges' is only accounted for if the intersecting range is
marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
guarantee that task is blocked until there are no overlapping ranges in the
tree.

When a range is unlocked, we again walk intervals that overlap with the
unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
therefore relies on the order of the interval tree  -- as opposed to a
more traditional fifo mechanism. There is no lock stealing either, which
prevents starvation and guarantees fairness.

How much does it cost:
----------------------

The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
is total number of ranges and R_int is the number of ranges intersecting the
new range range to be added.

Due to its sharable nature, full range locks can be compared with rw-sempahores,
which also serves from a mutex standpoint as writer-only situations are
pretty similar nowadays.

The first is the memory footprint, tree locks are larger than rwsems: 80 vs
40 bytes, and also requires an additional 72 bytes of stack for the range
structure.

Secondly, because every range call is serialized by the tree->lock, any lock()
fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
overhead compared to a single atomic insn in the case of rwsems. Similar scenario
obviously for the unlock() case.

The torture module was used to measure 1-1 differences in lock acquisition with
increasing core counts over a period of 10 minutes. Readers and writers are
interleaved, with a slight advantage to writers as its the first kthread that is
created. The following shows the avg ops/minute with various thread-setups on
boxes with small and large core-counts.

** 4-core AMD Opteron **
(write-only)
rwsem-2thr: 4198.5, stddev: 7.77
range-2thr: 4199.1, stddev: 0.73

rwsem-4thr: 6036.8, stddev: 50.91
range-4thr: 6004.9, stddev: 126.57

rwsem-8thr: 6245.6, stddev: 59.39
range-8thr: 6229.3, stddev: 10.60

(read-only)
rwsem-2thr: 5930.7, stddev: 21.92
range-2thr: 5917.3, stddev: 25.45

rwsem-4thr: 9881.6, stddev: 0.70
range-4thr: 9540.2, stddev: 98.28

rwsem-8thr: 11633.2, stddev: 7.72
range-8thr: 11314.7, stddev: 62.22

For the read/write-only cases, there is very little difference between the range lock
and rwsems, with up to a 3% hit, which could very well be considered in the noise range.

(read-write)
rwsem-write-1thr: 1744.8, stddev: 11.59
rwsem-read-1thr:  1043.1, stddev: 3.97
range-write-1thr: 1740.2, stddev: 5.99
range-read-1thr:  1022.5, stddev: 6.41

rwsem-write-2thr: 1662.5, stddev: 0.70
rwsem-read-2thr:  1278.0, stddev: 25.45
range-write-2thr: 1321.5, stddev: 51.61
range-read-2thr:  1243.5, stddev: 30.40

rwsem-write-4thr: 1761.0, stddev: 11.31
rwsem-read-4thr:  1426.0, stddev: 7.07
range-write-4thr: 1417.0, stddev: 29.69
range-read-4thr:  1398.0, stddev: 56.56

While a single reader and writer threads does not show must difference, increasing
core counts shows that in reader/writer workloads, writer threads can take a hit in
raw performance of up to ~20%, while the number of reader throughput is quite similar
among both locks.

** 240-core (ht) IvyBridge **
(write-only)
rwsem-120thr: 6844.5, stddev: 82.73
range-120thr: 6070.5, stddev: 85.55

rwsem-240thr: 6292.5, stddev: 146.3
range-240thr: 6099.0, stddev: 15.55

rwsem-480thr: 6164.8, stddev: 33.94
range-480thr: 6062.3, stddev: 19.79

(read-only)
rwsem-120thr: 136860.4, stddev: 2539.92
range-120thr: 138052.2, stddev: 327.39

rwsem-240thr: 235297.5, stddev: 2220.50
range-240thr: 232099.1, stddev: 3614.72

rwsem-480thr: 272683.0, stddev: 3924.32
range-480thr: 256539.2, stddev: 9541.69

Similar to the small box, larger machines show that range locks take only a minor
(up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.

(read-write)
rwsem-write-60thr: 4658.1, stddev: 1303.19
rwsem-read-60thr:  1108.7, stddev: 718.42
range-write-60thr: 3203.6, stddev: 139.30
range-read-60thr:  1852.8, stddev: 147.5

rwsem-write-120thr: 3971.3, stddev: 1413.0
rwsem-read-120thr:  1038.8, stddev: 353.51
range-write-120thr: 2282.1, stddev: 207.18
range-read-120thr:  1856.5, stddev: 198.69

rwsem-write-240thr: 4112.7, stddev: 2448.1
rwsem-read-240thr:  1277.4, stddev: 430.30
range-write-240thr: 2353.1, stddev: 502.04
range-read-240thr:  1551.5, stddev: 361.33

When mixing readers and writers, writer throughput can take a hit of up to ~40%,
similar to the 4 core machine, however, reader threads can increase the number of
acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
always be higher for rwsems. A huge factor in this behavior is that range locks
do not have writer spin-on-owner feature.

On both machines when actually testing threads acquiring different ranges, the
amount of throughput will always outperform the rwsem, due to the increased
parallelism; which is no surprise either. As such microbenchmarks that merely
pounds on a lock will pretty much always suffer upon direct lock conversions,
but not enough to matter in the overall picture.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/range_rwlock.h  | 115 +++++++++
 kernel/locking/Makefile       |   2 +-
 kernel/locking/range_rwlock.c | 554 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 670 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/range_rwlock.h
 create mode 100644 kernel/locking/range_rwlock.c

diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
new file mode 100644
index 000000000000..0a8dad62d097
--- /dev/null
+++ b/include/linux/range_rwlock.h
@@ -0,0 +1,115 @@
+/*
+ * Range/interval rw-locking
+ * -------------------------
+ *
+ * An interval tree of locked and to-be-locked ranges is kept. When a new range
+ * lock is requested, we add its interval to the tree and store number of
+ * intervals intersecting it to 'blocking_ranges'. For the reader case,
+ * 'blocking_ranges' is only accounted for if the intersecting range is
+ * marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
+ * guarantee that task is blocked until there are no overlapping ranges in the
+ * tree.
+ *
+ * When a range is unlocked, we again walk intervals that overlap with the
+ * unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
+ * owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
+ * therefore relies on the order of the interval tree  -- as opposed to a
+ * more traditional fifo mechanism. There is no lock stealing either, which
+ * prevents starvation and guarantees fairness.
+ *
+ * The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
+ * is total number of ranges and R_int is the number of ranges intersecting the
+ * operated range.
+ */
+#ifndef _LINUX_RANGE_RWLOCK_H
+#define _LINUX_RANGE_RWLOCK_H
+
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+/*
+ * The largest range will span [0,RANGE_RWLOCK_FULL].
+ */
+#define RANGE_RWLOCK_FULL  ~0UL
+
+struct range_rwlock {
+	struct interval_tree_node node;
+	struct task_struct *waiter;
+	/* Number of ranges which are blocking acquisition of the lock */
+	unsigned int blocking_ranges;
+	u64 seqnum;
+};
+
+struct range_rwlock_tree {
+	struct rb_root root;
+	spinlock_t lock;
+	struct interval_tree_node *leftmost; /* compute smallest 'start' */
+	u64 seqnum; /* track order of incoming ranges, avoid overflows */
+};
+
+#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
+	{ .leftmost = NULL                              \
+	, .root = RB_ROOT				\
+	, .seqnum = 0					\
+	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
+
+#define DEFINE_RANGE_RWLOCK_TREE(name) \
+       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
+
+#define __RANGE_RWLOCK_INITIALIZER(__start, __last) {	\
+		.node = {			\
+			.start = (__start)	\
+			,.last = (__last)	\
+		}				\
+		, .task = NULL			\
+		, .blocking_ranges = 0		\
+		, .reader = false		\
+		, .seqnum = 0			\
+	}
+
+#define DEFINE_RANGE_RWLOCK(name, start, last)				\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
+
+#define DEFINE_RANGE_RWLOCK_FULL(name)		\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_FULL)
+
+static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
+{
+	tree->root = RB_ROOT;
+	spin_lock_init(&tree->lock);
+	tree->leftmost = NULL;
+	tree->seqnum = 0;
+}
+
+void range_rwlock_init(struct range_rwlock *lock,
+		       unsigned long start, unsigned long last);
+void range_rwlock_init_full(struct range_rwlock *lock);
+
+/*
+ * lock for reading
+ */
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock);
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock);
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+/*
+ * lock for writing
+ */
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock);
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock);
+int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock);
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 760158d9d98d..bf054d0af82b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -2,7 +2,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
new file mode 100644
index 000000000000..550ad360d87a
--- /dev/null
+++ b/kernel/locking/range_rwlock.c
@@ -0,0 +1,554 @@
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/range_rwlock.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched.h>
+#include <linux/export.h>
+
+#define range_entry(ptr, type, member) container_of(ptr, type, member)
+
+#define range_interval_tree_foreach(node, root, start, last)	\
+	for (node = interval_tree_iter_first(root, start, last); \
+	     node; node = interval_tree_iter_next(node, start, last))
+
+/*
+ * Fastpath range intersection/overlap between A: [a0, a1] and B: [b0, b1]
+ * is given by:
+ *
+ *        a0 <= b1 && b0 <= a1
+ *
+ * ... where A holds the lock range and B holds the smallest 'start' and
+ * largest 'last' in the tree. For the later, we rely on the root node,
+ * which by augmented interval tree property, holds the largest value in
+ * its last-in-subtree. This allows mitigating some of the tree walk overhead
+ * for non-intersecting ranges, maintained and consulted in O(1).
+ */
+static inline bool
+__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *root;
+
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
+		return false;
+
+	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
+
+	return lock->node.start <= root->__subtree_last &&
+		tree->leftmost->start <= lock->node.last;
+}
+
+static inline void
+__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)) ||
+	    lock->node.start < tree->leftmost->start)
+		tree->leftmost = &lock->node;
+
+	lock->seqnum = tree->seqnum++;
+	interval_tree_insert(&lock->node, &tree->root);
+}
+
+static inline void
+__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (tree->leftmost == &lock->node) {
+		struct rb_node *next = rb_next(&tree->leftmost->rb);
+		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
+	}
+
+	interval_tree_remove(&lock->node, &tree->root);
+}
+
+/*
+ * lock->waiter reader tracking.
+ */
+#define RANGE_FLAG_READER	1UL
+
+static inline struct task_struct *range_lock_waiter(struct range_rwlock *lock)
+{
+	return (struct task_struct *)
+		((unsigned long) lock->waiter & ~RANGE_FLAG_READER);
+}
+
+static inline void range_lock_set_reader(struct range_rwlock *lock)
+{
+	lock->waiter = (struct task_struct *)
+		((unsigned long)lock->waiter | RANGE_FLAG_READER);
+}
+
+static inline void range_lock_clear_reader(struct range_rwlock *lock)
+{
+	lock->waiter = (struct task_struct *)
+		((unsigned long)lock->waiter & ~RANGE_FLAG_READER);
+}
+
+static inline bool range_lock_is_reader(struct range_rwlock *lock)
+{
+	return (unsigned long) lock->waiter & RANGE_FLAG_READER;
+}
+
+static inline void
+__range_rwlock_init(struct range_rwlock *lock,
+		    unsigned long start, unsigned long last)
+{
+	WARN_ON(start > last);
+
+	lock->node.start = start;
+	lock->node.last = last;
+	RB_CLEAR_NODE(&lock->node.rb);
+	lock->blocking_ranges = 0;
+	lock->waiter = NULL;
+	lock->seqnum = 0;
+}
+
+/**
+ * range_rwlock_init - Initialize the range lock
+ * @lock: the range lock to be initialized
+ * @start: start of the interval (inclusive)
+ * @last: last location in the interval (inclusive)
+ *
+ * Initialize the range's [start, last] such that it can
+ * later be locked. User is expected to enter a sorted
+ * range, such that @start <= @last.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
+		     unsigned long last)
+{
+	__range_rwlock_init(lock, start, last);
+}
+EXPORT_SYMBOL_GPL(range_rwlock_init);
+
+/**
+ * range_rwlock_init_full - Initialize a full range lock
+ * @lock: the range lock to be initialized
+ *
+ * Initialize the full range.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init_full(struct range_rwlock *lock)
+{
+	__range_rwlock_init(lock, 0, RANGE_RWLOCK_FULL);
+}
+EXPORT_SYMBOL_GPL(range_rwlock_init_full);
+
+static inline void
+range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
+{
+	if (!--lock->blocking_ranges)
+		wake_q_add(wake_q, range_lock_waiter(lock));
+}
+
+static inline int wait_for_ranges(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock, long state)
+{
+	int ret = 0;
+
+	while (true) {
+		set_current_state(state);
+
+		/* do we need to go to sleep? */
+		if (!lock->blocking_ranges)
+			break;
+
+		if (unlikely(signal_pending_state(state, current))) {
+			struct interval_tree_node *node;
+			unsigned long flags;
+			DEFINE_WAKE_Q(wake_q);
+
+			ret = -EINTR;
+			/*
+			 * We're not taking the lock after all, cleanup
+			 * after ourselves.
+			 */
+			spin_lock_irqsave(&tree->lock, flags);
+
+			range_lock_clear_reader(lock);
+			__range_tree_remove(tree, lock);
+
+			if (!__range_intersects_intree(tree, lock))
+				goto unlock;
+
+			range_interval_tree_foreach(node, &tree->root,
+						    lock->node.start,
+						    lock->node.last) {
+				struct range_rwlock *blked;
+				blked = range_entry(node,
+						    struct range_rwlock, node);
+
+				if (range_lock_is_reader(lock) &&
+				    range_lock_is_reader(blked))
+					continue;
+
+				/* unaccount for threads _we_ are blocking */
+				if (lock->seqnum < blked->seqnum)
+					range_rwlock_unblock(blked, &wake_q);
+			}
+
+		unlock:
+			spin_unlock_irqrestore(&tree->lock, flags);
+			wake_up_q(&wake_q);
+			break;
+		}
+
+		schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+	return ret;
+}
+
+static __always_inline int __sched
+__range_read_lock_common(struct range_rwlock_tree *tree,
+			 struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	range_lock_set_reader(lock);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock))
+			lock->blocking_ranges++;
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->waiter = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+/**
+ * range_read_lock - Lock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Returns when the lock has been acquired or sleep until
+ * until there are no overlapping ranges.
+ */
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock);
+
+/**
+ * range_read_lock_interruptible - Lock for reading (interruptible)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_read_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock_interruptible);
+
+/**
+ * range_read_lock_killable - Lock for reading (killable)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_read_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock_killable);
+
+/**
+ * range_read_trylock - Trylock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int ret = true;
+	unsigned long flags;
+	struct interval_tree_node *node;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	/*
+	 * We have overlapping ranges in the tree, ensure that we can
+	 * in fact share the lock.
+	 */
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock)) {
+			ret = false;
+			goto unlock;
+		}
+	}
+insert:
+	range_lock_set_reader(lock);
+	__range_tree_insert(tree, lock);
+unlock:
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(range_read_trylock);
+
+/**
+ * range_read_unlock - Unlock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be unlocked
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to unlock an unacquired read lock.
+ */
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	range_lock_clear_reader(lock);
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock))
+			range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_read_unlock);
+
+static __always_inline int __sched
+__range_write_lock_common(struct range_rwlock_tree *tree,
+			  struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		/*
+		 * As a writer, we always consider an existing node. We
+		 * need to block; either the intersecting node is another
+		 * writer or we have a reader that needs to finish.
+		 */
+		lock->blocking_ranges++;
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->waiter = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+/**
+ * range_write_lock - Lock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Returns when the lock has been acquired or sleep until
+ * until there are no overlapping ranges.
+ */
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock);
+
+/**
+ * range_write_lock_interruptible - Lock for writing (interruptible)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_write_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock_interruptible);
+
+/**
+ * range_write_lock_killable - Lock for writing (killable)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_write_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock_killable);
+
+/**
+ * range_write_trylock - Trylock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int intersects;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	intersects = __range_intersects_intree(tree, lock);
+
+	if (!intersects) {
+		range_lock_clear_reader(lock);
+		__range_tree_insert(tree, lock);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return !intersects;
+}
+EXPORT_SYMBOL_GPL(range_write_trylock);
+
+/**
+ * range_write_unlock - Unlock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be unlocked
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to unlock an unacquired write lock.
+ */
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	range_lock_clear_reader(lock);
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_write_unlock);
+
+/**
+ * range_downgrade_write - Downgrade write range lock to read lock
+ * @tree: interval tree
+ * @lock: the range lock to be downgraded
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to downgrade an unacquired write lock.
+ */
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock)
+{
+	unsigned long flags;
+	struct interval_tree_node *node;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	WARN_ON(range_lock_is_reader(lock));
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		/*
+		 * Unaccount for any blocked reader lock. Wakeup if possible.
+		 */
+		if (range_lock_is_reader(blocked_lock))
+			range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	range_lock_set_reader(lock);
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_downgrade_write);
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/6] locking/locktorture: Fix rwsem reader_delay
  2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
  2017-04-06  8:46 ` [PATCH 1/6] interval-tree: Build unconditionally Davidlohr Bueso
  2017-04-06  8:46 ` [PATCH 2/6] locking: Introduce range reader/writer lock Davidlohr Bueso
@ 2017-04-06  8:46 ` Davidlohr Bueso
  2017-04-06  8:46 ` [PATCH 4/6] locking/locktorture: Fix num reader/writer corner cases Davidlohr Bueso
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-06  8:46 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, ldufour, mhocko, mgorman, dave,
	linux-kernel, Davidlohr Bueso

We should account for nreader threads, not writers in this
callback. Could even trigger a div by 0 if the user explicitly
disables writers.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/locktorture.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index f24582d4dad3..0ef5510f8742 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -570,7 +570,7 @@ static void torture_rwsem_read_delay(struct torture_random_state *trsp)
 
 	/* We want a long delay occasionally to force massive contention.  */
 	if (!(torture_random(trsp) %
-	      (cxt.nrealwriters_stress * 2000 * longdelay_ms)))
+	      (cxt.nrealreaders_stress * 2000 * longdelay_ms)))
 		mdelay(longdelay_ms * 2);
 	else
 		mdelay(longdelay_ms / 2);
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/6] locking/locktorture: Fix num reader/writer corner cases
  2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
                   ` (2 preceding siblings ...)
  2017-04-06  8:46 ` [PATCH 3/6] locking/locktorture: Fix rwsem reader_delay Davidlohr Bueso
@ 2017-04-06  8:46 ` Davidlohr Bueso
  2017-04-06  8:46 ` [PATCH 5/6] locking/locktorture: Support range rwlocks Davidlohr Bueso
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-06  8:46 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, ldufour, mhocko, mgorman, dave,
	linux-kernel, Davidlohr Bueso

Things can explode for locktorture if the user does combinations
of nwriters_stress=0 nreaders_stress=0. Fix this by not assuming
we always want to torture writer threads.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/locktorture.c | 76 +++++++++++++++++++++++++-------------------
 1 file changed, 44 insertions(+), 32 deletions(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 0ef5510f8742..a68167803eee 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -715,8 +715,7 @@ static void __torture_print_stats(char *page,
 {
 	bool fail = 0;
 	int i, n_stress;
-	long max = 0;
-	long min = statp[0].n_lock_acquired;
+	long max = 0, min = statp ? statp[0].n_lock_acquired : 0;
 	long long sum = 0;
 
 	n_stress = write ? cxt.nrealwriters_stress : cxt.nrealreaders_stress;
@@ -823,7 +822,7 @@ static void lock_torture_cleanup(void)
 	 * such, only perform the underlying torture-specific cleanups,
 	 * and avoid anything related to locktorture.
 	 */
-	if (!cxt.lwsa)
+	if (!cxt.lwsa && !cxt.lrsa)
 		goto end;
 
 	if (writer_tasks) {
@@ -898,6 +897,13 @@ static int __init lock_torture_init(void)
 		firsterr = -EINVAL;
 		goto unwind;
 	}
+
+	if (nwriters_stress == 0 && nreaders_stress == 0) {
+		pr_alert("lock-torture: must run at least one locking thread\n");
+		firsterr = -EINVAL;
+		goto unwind;
+	}
+
 	if (cxt.cur_ops->init)
 		cxt.cur_ops->init();
 
@@ -921,17 +927,19 @@ static int __init lock_torture_init(void)
 #endif
 
 	/* Initialize the statistics so that each run gets its own numbers. */
+	if (nwriters_stress) {
+		lock_is_write_held = 0;
+		cxt.lwsa = kmalloc(sizeof(*cxt.lwsa) * cxt.nrealwriters_stress, GFP_KERNEL);
+		if (cxt.lwsa == NULL) {
+			VERBOSE_TOROUT_STRING("cxt.lwsa: Out of memory");
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
 
-	lock_is_write_held = 0;
-	cxt.lwsa = kmalloc(sizeof(*cxt.lwsa) * cxt.nrealwriters_stress, GFP_KERNEL);
-	if (cxt.lwsa == NULL) {
-		VERBOSE_TOROUT_STRING("cxt.lwsa: Out of memory");
-		firsterr = -ENOMEM;
-		goto unwind;
-	}
-	for (i = 0; i < cxt.nrealwriters_stress; i++) {
-		cxt.lwsa[i].n_lock_fail = 0;
-		cxt.lwsa[i].n_lock_acquired = 0;
+		for (i = 0; i < cxt.nrealwriters_stress; i++) {
+			cxt.lwsa[i].n_lock_fail = 0;
+			cxt.lwsa[i].n_lock_acquired = 0;
+		}
 	}
 
 	if (cxt.cur_ops->readlock) {
@@ -948,19 +956,21 @@ static int __init lock_torture_init(void)
 			cxt.nrealreaders_stress = cxt.nrealwriters_stress;
 		}
 
-		lock_is_read_held = 0;
-		cxt.lrsa = kmalloc(sizeof(*cxt.lrsa) * cxt.nrealreaders_stress, GFP_KERNEL);
-		if (cxt.lrsa == NULL) {
-			VERBOSE_TOROUT_STRING("cxt.lrsa: Out of memory");
-			firsterr = -ENOMEM;
-			kfree(cxt.lwsa);
-			cxt.lwsa = NULL;
-			goto unwind;
-		}
-
-		for (i = 0; i < cxt.nrealreaders_stress; i++) {
-			cxt.lrsa[i].n_lock_fail = 0;
-			cxt.lrsa[i].n_lock_acquired = 0;
+		if (nreaders_stress) {
+			lock_is_read_held = 0;
+			cxt.lrsa = kmalloc(sizeof(*cxt.lrsa) * cxt.nrealreaders_stress, GFP_KERNEL);
+			if (cxt.lrsa == NULL) {
+				VERBOSE_TOROUT_STRING("cxt.lrsa: Out of memory");
+				firsterr = -ENOMEM;
+				kfree(cxt.lwsa);
+				cxt.lwsa = NULL;
+				goto unwind;
+			}
+
+			for (i = 0; i < cxt.nrealreaders_stress; i++) {
+				cxt.lrsa[i].n_lock_fail = 0;
+				cxt.lrsa[i].n_lock_acquired = 0;
+			}
 		}
 	}
 
@@ -990,12 +1000,14 @@ static int __init lock_torture_init(void)
 			goto unwind;
 	}
 
-	writer_tasks = kzalloc(cxt.nrealwriters_stress * sizeof(writer_tasks[0]),
-			       GFP_KERNEL);
-	if (writer_tasks == NULL) {
-		VERBOSE_TOROUT_ERRSTRING("writer_tasks: Out of memory");
-		firsterr = -ENOMEM;
-		goto unwind;
+	if (nwriters_stress) {
+		writer_tasks = kzalloc(cxt.nrealwriters_stress * sizeof(writer_tasks[0]),
+				       GFP_KERNEL);
+		if (writer_tasks == NULL) {
+			VERBOSE_TOROUT_ERRSTRING("writer_tasks: Out of memory");
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
 	}
 
 	if (cxt.cur_ops->readlock) {
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/6] locking/locktorture: Support range rwlocks
  2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
                   ` (3 preceding siblings ...)
  2017-04-06  8:46 ` [PATCH 4/6] locking/locktorture: Fix num reader/writer corner cases Davidlohr Bueso
@ 2017-04-06  8:46 ` Davidlohr Bueso
  2017-04-06  8:46 ` [PATCH 6/6] staging/lustre: Use generic range rwlock Davidlohr Bueso
  2017-04-19 12:37 ` [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Peter Zijlstra
  6 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-06  8:46 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, ldufour, mhocko, mgorman, dave,
	linux-kernel, Davidlohr Bueso

Torture the reader/writer range locks. Each thread will attempt to
lock+unlock a range of up to [0, 4096].

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 kernel/locking/locktorture.c | 221 +++++++++++++++++++++++++++++++++----------
 1 file changed, 172 insertions(+), 49 deletions(-)

diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index a68167803eee..76de50da4cdc 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -29,6 +29,7 @@
 #include <linux/rwlock.h>
 #include <linux/mutex.h>
 #include <linux/rwsem.h>
+#include <linux/range_rwlock.h>
 #include <linux/smp.h>
 #include <linux/interrupt.h>
 #include <linux/sched.h>
@@ -89,13 +90,13 @@ static void lock_torture_cleanup(void);
  */
 struct lock_torture_ops {
 	void (*init)(void);
-	int (*writelock)(void);
+	int (*writelock)(void *arg);
 	void (*write_delay)(struct torture_random_state *trsp);
 	void (*task_boost)(struct torture_random_state *trsp);
-	void (*writeunlock)(void);
-	int (*readlock)(void);
+	void (*writeunlock)(void *arg);
+	int (*readlock)(void *arg);
 	void (*read_delay)(struct torture_random_state *trsp);
-	void (*readunlock)(void);
+	void (*readunlock)(void *arg);
 
 	unsigned long flags; /* for irq spinlocks */
 	const char *name;
@@ -117,7 +118,7 @@ static struct lock_torture_cxt cxt = { 0, 0, false,
  * Definitions for lock torture testing.
  */
 
-static int torture_lock_busted_write_lock(void)
+static int torture_lock_busted_write_lock(void *arg)
 {
 	return 0;  /* BUGGY, do not use in real life!!! */
 }
@@ -136,7 +137,7 @@ static void torture_lock_busted_write_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_lock_busted_write_unlock(void)
+static void torture_lock_busted_write_unlock(void *arg)
 {
 	  /* BUGGY, do not use in real life!!! */
 }
@@ -159,7 +160,8 @@ static struct lock_torture_ops lock_busted_ops = {
 
 static DEFINE_SPINLOCK(torture_spinlock);
 
-static int torture_spin_lock_write_lock(void) __acquires(torture_spinlock)
+static int torture_spin_lock_write_lock(void *arg)
+	__acquires(torture_spinlock)
 {
 	spin_lock(&torture_spinlock);
 	return 0;
@@ -185,7 +187,8 @@ static void torture_spin_lock_write_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_spin_lock_write_unlock(void) __releases(torture_spinlock)
+static void torture_spin_lock_write_unlock(void *arg)
+	__releases(torture_spinlock)
 {
 	spin_unlock(&torture_spinlock);
 }
@@ -201,8 +204,8 @@ static struct lock_torture_ops spin_lock_ops = {
 	.name		= "spin_lock"
 };
 
-static int torture_spin_lock_write_lock_irq(void)
-__acquires(torture_spinlock)
+static int torture_spin_lock_write_lock_irq(void *arg)
+	__acquires(torture_spinlock)
 {
 	unsigned long flags;
 
@@ -211,7 +214,7 @@ __acquires(torture_spinlock)
 	return 0;
 }
 
-static void torture_lock_spin_write_unlock_irq(void)
+static void torture_lock_spin_write_unlock_irq(void *arg)
 __releases(torture_spinlock)
 {
 	spin_unlock_irqrestore(&torture_spinlock, cxt.cur_ops->flags);
@@ -230,7 +233,8 @@ static struct lock_torture_ops spin_lock_irq_ops = {
 
 static DEFINE_RWLOCK(torture_rwlock);
 
-static int torture_rwlock_write_lock(void) __acquires(torture_rwlock)
+static int torture_rwlock_write_lock(void *arg)
+	__acquires(torture_rwlock)
 {
 	write_lock(&torture_rwlock);
 	return 0;
@@ -251,12 +255,14 @@ static void torture_rwlock_write_delay(struct torture_random_state *trsp)
 		udelay(shortdelay_us);
 }
 
-static void torture_rwlock_write_unlock(void) __releases(torture_rwlock)
+static void torture_rwlock_write_unlock(void *arg)
+	__releases(torture_rwlock)
 {
 	write_unlock(&torture_rwlock);
 }
 
-static int torture_rwlock_read_lock(void) __acquires(torture_rwlock)
+static int torture_rwlock_read_lock(void *arg)
+	__acquires(torture_rwlock)
 {
 	read_lock(&torture_rwlock);
 	return 0;
@@ -277,7 +283,8 @@ static void torture_rwlock_read_delay(struct torture_random_state *trsp)
 		udelay(shortdelay_us);
 }
 
-static void torture_rwlock_read_unlock(void) __releases(torture_rwlock)
+static void torture_rwlock_read_unlock(void *arg)
+	__releases(torture_rwlock)
 {
 	read_unlock(&torture_rwlock);
 }
@@ -293,7 +300,8 @@ static struct lock_torture_ops rw_lock_ops = {
 	.name		= "rw_lock"
 };
 
-static int torture_rwlock_write_lock_irq(void) __acquires(torture_rwlock)
+static int torture_rwlock_write_lock_irq(void *arg)
+	__acquires(torture_rwlock)
 {
 	unsigned long flags;
 
@@ -302,13 +310,14 @@ static int torture_rwlock_write_lock_irq(void) __acquires(torture_rwlock)
 	return 0;
 }
 
-static void torture_rwlock_write_unlock_irq(void)
-__releases(torture_rwlock)
+static void torture_rwlock_write_unlock_irq(void *arg)
+	__releases(torture_rwlock)
 {
 	write_unlock_irqrestore(&torture_rwlock, cxt.cur_ops->flags);
 }
 
-static int torture_rwlock_read_lock_irq(void) __acquires(torture_rwlock)
+static int torture_rwlock_read_lock_irq(void *arg)
+	__acquires(torture_rwlock)
 {
 	unsigned long flags;
 
@@ -317,8 +326,8 @@ static int torture_rwlock_read_lock_irq(void) __acquires(torture_rwlock)
 	return 0;
 }
 
-static void torture_rwlock_read_unlock_irq(void)
-__releases(torture_rwlock)
+static void torture_rwlock_read_unlock_irq(void *arg)
+	__releases(torture_rwlock)
 {
 	read_unlock_irqrestore(&torture_rwlock, cxt.cur_ops->flags);
 }
@@ -336,7 +345,8 @@ static struct lock_torture_ops rw_lock_irq_ops = {
 
 static DEFINE_MUTEX(torture_mutex);
 
-static int torture_mutex_lock(void) __acquires(torture_mutex)
+static int torture_mutex_lock(void *arg)
+	__acquires(torture_mutex)
 {
 	mutex_lock(&torture_mutex);
 	return 0;
@@ -358,7 +368,8 @@ static void torture_mutex_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_mutex_unlock(void) __releases(torture_mutex)
+static void torture_mutex_unlock(void *arg)
+	__releases(torture_mutex)
 {
 	mutex_unlock(&torture_mutex);
 }
@@ -380,7 +391,7 @@ static DEFINE_WW_MUTEX(torture_ww_mutex_0, &torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_1, &torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_2, &torture_ww_class);
 
-static int torture_ww_mutex_lock(void)
+static int torture_ww_mutex_lock(void *arg)
 __acquires(torture_ww_mutex_0)
 __acquires(torture_ww_mutex_1)
 __acquires(torture_ww_mutex_2)
@@ -425,7 +436,7 @@ __acquires(torture_ww_mutex_2)
 	return 0;
 }
 
-static void torture_ww_mutex_unlock(void)
+static void torture_ww_mutex_unlock(void *arg)
 __releases(torture_ww_mutex_0)
 __releases(torture_ww_mutex_1)
 __releases(torture_ww_mutex_2)
@@ -449,7 +460,8 @@ static struct lock_torture_ops ww_mutex_lock_ops = {
 #ifdef CONFIG_RT_MUTEXES
 static DEFINE_RT_MUTEX(torture_rtmutex);
 
-static int torture_rtmutex_lock(void) __acquires(torture_rtmutex)
+static int torture_rtmutex_lock(void *arg)
+	__acquires(torture_rtmutex)
 {
 	rt_mutex_lock(&torture_rtmutex);
 	return 0;
@@ -513,7 +525,8 @@ static void torture_rtmutex_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_rtmutex_unlock(void) __releases(torture_rtmutex)
+static void torture_rtmutex_unlock(void *arg)
+	__releases(torture_rtmutex)
 {
 	rt_mutex_unlock(&torture_rtmutex);
 }
@@ -531,7 +544,8 @@ static struct lock_torture_ops rtmutex_lock_ops = {
 #endif
 
 static DECLARE_RWSEM(torture_rwsem);
-static int torture_rwsem_down_write(void) __acquires(torture_rwsem)
+static int torture_rwsem_down_write(void *arg)
+	__acquires(torture_rwsem)
 {
 	down_write(&torture_rwsem);
 	return 0;
@@ -553,12 +567,14 @@ static void torture_rwsem_write_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_rwsem_up_write(void) __releases(torture_rwsem)
+static void torture_rwsem_up_write(void *arg)
+	__releases(torture_rwsem)
 {
 	up_write(&torture_rwsem);
 }
 
-static int torture_rwsem_down_read(void) __acquires(torture_rwsem)
+static int torture_rwsem_down_read(void *arg)
+	__acquires(torture_rwsem)
 {
 	down_read(&torture_rwsem);
 	return 0;
@@ -580,7 +596,8 @@ static void torture_rwsem_read_delay(struct torture_random_state *trsp)
 #endif
 }
 
-static void torture_rwsem_up_read(void) __releases(torture_rwsem)
+static void torture_rwsem_up_read(void *arg)
+	__releases(torture_rwsem)
 {
 	up_read(&torture_rwsem);
 }
@@ -596,6 +613,70 @@ static struct lock_torture_ops rwsem_lock_ops = {
 	.name		= "rwsem_lock"
 };
 
+/*
+ * Lock torturing starts here, and is slightly different than other locks
+ * in that it expects the range to lock to be passed as well as the actual
+ * lock itself.
+ */
+#define locktorture_is_range_lock(type) (!strncmp(type, "range_rwlock", 12))
+
+struct locktorture_range {
+	unsigned long start;
+	unsigned long end;
+	struct range_rwlock lock;
+};
+
+#define RANGE_POOL_SIZE (1UL << 12)
+struct locktorture_range *range_pool = NULL;
+
+static DEFINE_RANGE_RWLOCK_TREE(torture_range_rwlock);
+
+static int torture_range_rwlock_read_lock(void *arg)
+	__acquires(torture_range_rwlock)
+{
+	struct range_rwlock *range = (struct range_rwlock *) arg;
+
+	range_read_lock(&torture_range_rwlock, range);
+	return 0;
+}
+
+static void torture_range_rwlock_read_unlock(void *arg)
+	__releases(torture_range_rwlock)
+{
+	struct range_rwlock *range = (struct range_rwlock *) arg;
+
+	range_read_unlock(&torture_range_rwlock, range);
+}
+
+static int torture_range_rwlock_write_lock(void *arg)
+	__acquires(torture_rwrange_lock)
+{
+	struct range_rwlock *range = (struct range_rwlock *) arg;
+
+	range_write_lock(&torture_range_rwlock, range);
+	return 0;
+}
+
+static void torture_range_rwlock_write_unlock(void *arg)
+	__releases(torture_range_rwlock)
+{
+	struct range_rwlock *range = (struct range_rwlock *) arg;
+
+	range_write_unlock(&torture_range_rwlock, range);
+}
+
+static struct lock_torture_ops range_rwlock_ops = {
+	.writelock	= torture_range_rwlock_write_lock,
+	.write_delay	= torture_rwsem_write_delay,
+	.task_boost     = torture_boost_dummy,
+	.writeunlock	= torture_range_rwlock_write_unlock,
+
+	.readlock       = torture_range_rwlock_read_lock,
+	.read_delay	= torture_rwsem_read_delay,
+	.readunlock     = torture_range_rwlock_read_unlock,
+	.name		= "range_rwlock"
+};
+
 #include <linux/percpu-rwsem.h>
 static struct percpu_rw_semaphore pcpu_rwsem;
 
@@ -604,24 +685,28 @@ void torture_percpu_rwsem_init(void)
 	BUG_ON(percpu_init_rwsem(&pcpu_rwsem));
 }
 
-static int torture_percpu_rwsem_down_write(void) __acquires(pcpu_rwsem)
+static int torture_percpu_rwsem_down_write(void *arg)
+	__acquires(pcpu_rwsem)
 {
 	percpu_down_write(&pcpu_rwsem);
 	return 0;
 }
 
-static void torture_percpu_rwsem_up_write(void) __releases(pcpu_rwsem)
+static void torture_percpu_rwsem_up_write(void *arg)
+	__releases(pcpu_rwsem)
 {
 	percpu_up_write(&pcpu_rwsem);
 }
 
-static int torture_percpu_rwsem_down_read(void) __acquires(pcpu_rwsem)
+static int torture_percpu_rwsem_down_read(void *arg)
+	__acquires(pcpu_rwsem)
 {
 	percpu_down_read(&pcpu_rwsem);
 	return 0;
 }
 
-static void torture_percpu_rwsem_up_read(void) __releases(pcpu_rwsem)
+static void torture_percpu_rwsem_up_read(void *arg)
+	__releases(pcpu_rwsem)
 {
 	percpu_up_read(&pcpu_rwsem);
 }
@@ -638,6 +723,17 @@ static struct lock_torture_ops percpu_rwsem_lock_ops = {
 	.name		= "percpu_rwsem_lock"
 };
 
+static void mkrandom_range(struct range_rwlock *range,
+			   struct torture_random_state *trsp)
+{
+	unsigned long start, last;
+
+	last = (torture_random(trsp) >> 4) % (RANGE_POOL_SIZE + 1);
+	start = (torture_random(trsp) >> 4) % (last + 1);
+
+	range_rwlock_init(range, start, last);
+}
+
 /*
  * Lock torture writer kthread.  Repeatedly acquires and releases
  * the lock, checking for duplicate acquisitions.
@@ -646,26 +742,38 @@ static int lock_torture_writer(void *arg)
 {
 	struct lock_stress_stats *lwsp = arg;
 	static DEFINE_TORTURE_RANDOM(rand);
+	bool is_range = locktorture_is_range_lock(torture_type);
 
 	VERBOSE_TOROUT_STRING("lock_torture_writer task started");
 	set_user_nice(current, MAX_NICE);
 
 	do {
+		struct range_rwlock range;
+
 		if ((torture_random(&rand) & 0xfffff) == 0)
 			schedule_timeout_uninterruptible(1);
 
+		if (is_range)
+			mkrandom_range(&range, &rand);
+
 		cxt.cur_ops->task_boost(&rand);
-		cxt.cur_ops->writelock();
-		if (WARN_ON_ONCE(lock_is_write_held))
-			lwsp->n_lock_fail++;
-		lock_is_write_held = 1;
-		if (WARN_ON_ONCE(lock_is_read_held))
-			lwsp->n_lock_fail++; /* rare, but... */
+		cxt.cur_ops->writelock(&range);
+
+		if (!is_range) {
+			if (WARN_ON_ONCE(lock_is_write_held))
+				lwsp->n_lock_fail++;
+			if (WARN_ON_ONCE(lock_is_read_held))
+				lwsp->n_lock_fail++; /* rare, but... */
+
+			lock_is_write_held = true;
+		}
 
 		lwsp->n_lock_acquired++;
 		cxt.cur_ops->write_delay(&rand);
-		lock_is_write_held = 0;
-		cxt.cur_ops->writeunlock();
+
+		if (!is_range)
+			lock_is_write_held = false;
+		cxt.cur_ops->writeunlock(&range);
 
 		stutter_wait("lock_torture_writer");
 	} while (!torture_must_stop());
@@ -683,26 +791,40 @@ static int lock_torture_reader(void *arg)
 {
 	struct lock_stress_stats *lrsp = arg;
 	static DEFINE_TORTURE_RANDOM(rand);
+	bool is_range = locktorture_is_range_lock(torture_type);
 
 	VERBOSE_TOROUT_STRING("lock_torture_reader task started");
 	set_user_nice(current, MAX_NICE);
 
 	do {
+		struct range_rwlock range;
+
 		if ((torture_random(&rand) & 0xfffff) == 0)
 			schedule_timeout_uninterruptible(1);
 
-		cxt.cur_ops->readlock();
-		lock_is_read_held = 1;
-		if (WARN_ON_ONCE(lock_is_write_held))
-			lrsp->n_lock_fail++; /* rare, but... */
+		if (is_range)
+			mkrandom_range(&range, &rand);
+
+		cxt.cur_ops->readlock(&range);
+
+		if (!is_range) {
+			if (WARN_ON_ONCE(lock_is_write_held))
+				lrsp->n_lock_fail++; /* rare, but... */
+
+			lock_is_read_held = 1;
+		}
 
 		lrsp->n_lock_acquired++;
 		cxt.cur_ops->read_delay(&rand);
-		lock_is_read_held = 0;
-		cxt.cur_ops->readunlock();
+
+		if (!is_range)
+			lock_is_read_held = 0;
+
+		cxt.cur_ops->readunlock(&range);
 
 		stutter_wait("lock_torture_reader");
 	} while (!torture_must_stop());
+
 	torture_kthread_stopping("lock_torture_reader");
 	return 0;
 }
@@ -876,6 +998,7 @@ static int __init lock_torture_init(void)
 #endif
 		&rwsem_lock_ops,
 		&percpu_rwsem_lock_ops,
+		&range_rwlock_ops,
 	};
 
 	if (!torture_init_begin(torture_type, verbose, &torture_runnable))
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/6] staging/lustre: Use generic range rwlock
  2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
                   ` (4 preceding siblings ...)
  2017-04-06  8:46 ` [PATCH 5/6] locking/locktorture: Support range rwlocks Davidlohr Bueso
@ 2017-04-06  8:46 ` Davidlohr Bueso
  2017-04-07 10:08     ` [lustre-devel] " Dilger, Andreas
  2017-04-19 12:37 ` [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Peter Zijlstra
  6 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-06  8:46 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, ldufour, mhocko, mgorman, dave,
	linux-kernel, oleg.drokin, andreas.dilger, jsimmons,
	lustre-devel, Davidlohr Bueso

This replaces the in-house version, which is also derived
from Jan's interval tree implementation.

Cc: oleg.drokin@intel.com
Cc: andreas.dilger@intel.com
Cc: jsimmons@infradead.org
Cc: lustre-devel@lists.lustre.org
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
 drivers/staging/lustre/lustre/llite/Makefile       |   2 +-
 drivers/staging/lustre/lustre/llite/file.c         |  21 +-
 .../staging/lustre/lustre/llite/llite_internal.h   |   4 +-
 drivers/staging/lustre/lustre/llite/llite_lib.c    |   3 +-
 drivers/staging/lustre/lustre/llite/range_lock.c   | 239 ---------------------
 drivers/staging/lustre/lustre/llite/range_lock.h   |  82 -------
 6 files changed, 15 insertions(+), 336 deletions(-)
 delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.c
 delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.h

diff --git a/drivers/staging/lustre/lustre/llite/Makefile b/drivers/staging/lustre/lustre/llite/Makefile
index 322d4fa63f5d..922a901bc62c 100644
--- a/drivers/staging/lustre/lustre/llite/Makefile
+++ b/drivers/staging/lustre/lustre/llite/Makefile
@@ -1,6 +1,6 @@
 obj-$(CONFIG_LUSTRE_FS) += lustre.o
 lustre-y := dcache.o dir.o file.o llite_lib.o llite_nfs.o \
-	    rw.o rw26.o namei.o symlink.o llite_mmap.o range_lock.o \
+	    rw.o rw26.o namei.o symlink.o llite_mmap.o \
 	    xattr.o xattr_cache.o xattr_security.o \
 	    super25.o statahead.o glimpse.o lcommon_cl.o lcommon_misc.o \
 	    vvp_dev.o vvp_page.o vvp_lock.o vvp_io.o vvp_object.o \
diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 481c0d01d4c6..1a14a79f87f8 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -42,6 +42,7 @@
 #include <linux/file.h>
 #include <linux/sched.h>
 #include <linux/mount.h>
+#include <linux/range_rwlock.h>
 #include "../include/lustre/ll_fiemap.h"
 #include "../include/lustre/lustre_ioctl.h"
 #include "../include/lustre_swab.h"
@@ -1055,7 +1056,7 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 	struct ll_inode_info *lli = ll_i2info(file_inode(file));
 	struct ll_file_data  *fd  = LUSTRE_FPRIVATE(file);
 	struct vvp_io *vio = vvp_env_io(env);
-	struct range_lock range;
+	struct range_rwlock range;
 	struct cl_io	 *io;
 	ssize_t result = 0;
 	int rc = 0;
@@ -1072,9 +1073,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 		bool range_locked = false;
 
 		if (file->f_flags & O_APPEND)
-			range_lock_init(&range, 0, LUSTRE_EOF);
+			range_rwlock_init(&range, 0, LUSTRE_EOF);
 		else
-			range_lock_init(&range, *ppos, *ppos + count - 1);
+			range_rwlock_init(&range, *ppos, *ppos + count - 1);
 
 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
 		vio->vui_iter = args->u.normal.via_iter;
@@ -1087,10 +1088,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 		if (((iot == CIT_WRITE) ||
 		     (iot == CIT_READ && (file->f_flags & O_DIRECT))) &&
 		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-			CDEBUG(D_VFSTRACE, "Range lock [%llu, %llu]\n",
-			       range.rl_node.in_extent.start,
-			       range.rl_node.in_extent.end);
-			rc = range_lock(&lli->lli_write_tree, &range);
+			CDEBUG(D_VFSTRACE, "Range lock [%lu, %lu]\n",
+			       range.node.start, range.node.last);
+			rc = range_write_lock_interruptible(&lli->lli_write_tree, &range);
 			if (rc < 0)
 				goto out;
 
@@ -1100,10 +1100,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
 		rc = cl_io_loop(env, io);
 		ll_cl_remove(file, env);
 		if (range_locked) {
-			CDEBUG(D_VFSTRACE, "Range unlock [%llu, %llu]\n",
-			       range.rl_node.in_extent.start,
-			       range.rl_node.in_extent.end);
-			range_unlock(&lli->lli_write_tree, &range);
+			CDEBUG(D_VFSTRACE, "Range unlock [%lu, %lu]\n",
+			       range.node.start, range.node.last);
+			range_write_unlock(&lli->lli_write_tree, &range);
 		}
 	} else {
 		/* cl_io_rw_init() handled IO */
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 55f68acd85d1..aa2ae72e3e70 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -49,8 +49,8 @@
 #include <linux/namei.h>
 #include <linux/xattr.h>
 #include <linux/posix_acl_xattr.h>
+#include <linux/range_rwlock.h>
 #include "vvp_internal.h"
-#include "range_lock.h"
 
 #ifndef FMODE_EXEC
 #define FMODE_EXEC 0
@@ -193,7 +193,7 @@ struct ll_inode_info {
 			 * }
 			 */
 			struct rw_semaphore		lli_trunc_sem;
-			struct range_lock_tree		lli_write_tree;
+			struct range_rwlock_tree	lli_write_tree;
 
 			struct rw_semaphore		lli_glimpse_sem;
 			unsigned long			lli_glimpse_time;
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index b229cbc7bb33..8054e916b3f5 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -40,6 +40,7 @@
 #include <linux/statfs.h>
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/range_rwlock.h>
 
 #include "../include/lustre/lustre_ioctl.h"
 #include "../include/lustre_ha.h"
@@ -853,7 +854,7 @@ void ll_lli_init(struct ll_inode_info *lli)
 		mutex_init(&lli->lli_size_mutex);
 		lli->lli_symlink_name = NULL;
 		init_rwsem(&lli->lli_trunc_sem);
-		range_lock_tree_init(&lli->lli_write_tree);
+		range_rwlock_tree_init(&lli->lli_write_tree);
 		init_rwsem(&lli->lli_glimpse_sem);
 		lli->lli_glimpse_time = 0;
 		INIT_LIST_HEAD(&lli->lli_agl_list);
diff --git a/drivers/staging/lustre/lustre/llite/range_lock.c b/drivers/staging/lustre/lustre/llite/range_lock.c
deleted file mode 100644
index 14148a097476..000000000000
--- a/drivers/staging/lustre/lustre/llite/range_lock.c
+++ /dev/null
@@ -1,239 +0,0 @@
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * Range lock is used to allow multiple threads writing a single shared
- * file given each thread is writing to a non-overlapping portion of the
- * file.
- *
- * Refer to the possible upstream kernel version of range lock by
- * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
- *
- * This file could later replaced by the upstream kernel version.
- */
-/*
- * Author: Prakash Surya <surya1@llnl.gov>
- * Author: Bobi Jam <bobijam.xu@intel.com>
- */
-#include "range_lock.h"
-#include "../include/lustre/lustre_user.h"
-
-/**
- * Initialize a range lock tree
- *
- * \param tree [in]	an empty range lock tree
- *
- * Pre:  Caller should have allocated the range lock tree.
- * Post: The range lock tree is ready to function.
- */
-void range_lock_tree_init(struct range_lock_tree *tree)
-{
-	tree->rlt_root = NULL;
-	tree->rlt_sequence = 0;
-	spin_lock_init(&tree->rlt_lock);
-}
-
-/**
- * Initialize a range lock node
- *
- * \param lock  [in]	an empty range lock node
- * \param start [in]	start of the covering region
- * \param end   [in]	end of the covering region
- *
- * Pre:  Caller should have allocated the range lock node.
- * Post: The range lock node is meant to cover [start, end] region
- */
-int range_lock_init(struct range_lock *lock, __u64 start, __u64 end)
-{
-	int rc;
-
-	memset(&lock->rl_node, 0, sizeof(lock->rl_node));
-	if (end != LUSTRE_EOF)
-		end >>= PAGE_SHIFT;
-	rc = interval_set(&lock->rl_node, start >> PAGE_SHIFT, end);
-	if (rc)
-		return rc;
-
-	INIT_LIST_HEAD(&lock->rl_next_lock);
-	lock->rl_task = NULL;
-	lock->rl_lock_count = 0;
-	lock->rl_blocking_ranges = 0;
-	lock->rl_sequence = 0;
-	return rc;
-}
-
-static inline struct range_lock *next_lock(struct range_lock *lock)
-{
-	return list_entry(lock->rl_next_lock.next, typeof(*lock), rl_next_lock);
-}
-
-/**
- * Helper function of range_unlock()
- *
- * \param node [in]	a range lock found overlapped during interval node
- *			search
- * \param arg [in]	the range lock to be tested
- *
- * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
- *				overlapping range node
- * \retval INTERVAL_ITER_STOP	indicate to stop the search
- */
-static enum interval_iter range_unlock_cb(struct interval_node *node, void *arg)
-{
-	struct range_lock *lock = arg;
-	struct range_lock *overlap = node2rangelock(node);
-	struct range_lock *iter;
-
-	list_for_each_entry(iter, &overlap->rl_next_lock, rl_next_lock) {
-		if (iter->rl_sequence > lock->rl_sequence) {
-			--iter->rl_blocking_ranges;
-			LASSERT(iter->rl_blocking_ranges > 0);
-		}
-	}
-	if (overlap->rl_sequence > lock->rl_sequence) {
-		--overlap->rl_blocking_ranges;
-		if (overlap->rl_blocking_ranges == 0)
-			wake_up_process(overlap->rl_task);
-	}
-	return INTERVAL_ITER_CONT;
-}
-
-/**
- * Unlock a range lock, wake up locks blocked by this lock.
- *
- * \param tree [in]	range lock tree
- * \param lock [in]	range lock to be deleted
- *
- * If this lock has been granted, relase it; if not, just delete it from
- * the tree or the same region lock list. Wake up those locks only blocked
- * by this lock through range_unlock_cb().
- */
-void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
-{
-	spin_lock(&tree->rlt_lock);
-	if (!list_empty(&lock->rl_next_lock)) {
-		struct range_lock *next;
-
-		if (interval_is_intree(&lock->rl_node)) { /* first lock */
-			/* Insert the next same range lock into the tree */
-			next = next_lock(lock);
-			next->rl_lock_count = lock->rl_lock_count - 1;
-			interval_erase(&lock->rl_node, &tree->rlt_root);
-			interval_insert(&next->rl_node, &tree->rlt_root);
-		} else {
-			/* find the first lock in tree */
-			list_for_each_entry(next, &lock->rl_next_lock,
-					    rl_next_lock) {
-				if (!interval_is_intree(&next->rl_node))
-					continue;
-
-				LASSERT(next->rl_lock_count > 0);
-				next->rl_lock_count--;
-				break;
-			}
-		}
-		list_del_init(&lock->rl_next_lock);
-	} else {
-		LASSERT(interval_is_intree(&lock->rl_node));
-		interval_erase(&lock->rl_node, &tree->rlt_root);
-	}
-
-	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
-			range_unlock_cb, lock);
-	spin_unlock(&tree->rlt_lock);
-}
-
-/**
- * Helper function of range_lock()
- *
- * \param node [in]	a range lock found overlapped during interval node
- *			search
- * \param arg [in]	the range lock to be tested
- *
- * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
- *				overlapping range node
- * \retval INTERVAL_ITER_STOP	indicate to stop the search
- */
-static enum interval_iter range_lock_cb(struct interval_node *node, void *arg)
-{
-	struct range_lock *lock = (struct range_lock *)arg;
-	struct range_lock *overlap = node2rangelock(node);
-
-	lock->rl_blocking_ranges += overlap->rl_lock_count + 1;
-	return INTERVAL_ITER_CONT;
-}
-
-/**
- * Lock a region
- *
- * \param tree [in]	range lock tree
- * \param lock [in]	range lock node containing the region span
- *
- * \retval 0	get the range lock
- * \retval <0	error code while not getting the range lock
- *
- * If there exists overlapping range lock, the new lock will wait and
- * retry, if later it find that it is not the chosen one to wake up,
- * it wait again.
- */
-int range_lock(struct range_lock_tree *tree, struct range_lock *lock)
-{
-	struct interval_node *node;
-	int rc = 0;
-
-	spin_lock(&tree->rlt_lock);
-	/*
-	 * We need to check for all conflicting intervals
-	 * already in the tree.
-	 */
-	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
-			range_lock_cb, lock);
-	/*
-	 * Insert to the tree if I am unique, otherwise I've been linked to
-	 * the rl_next_lock of another lock which has the same range as mine
-	 * in range_lock_cb().
-	 */
-	node = interval_insert(&lock->rl_node, &tree->rlt_root);
-	if (node) {
-		struct range_lock *tmp = node2rangelock(node);
-
-		list_add_tail(&lock->rl_next_lock, &tmp->rl_next_lock);
-		tmp->rl_lock_count++;
-	}
-	lock->rl_sequence = ++tree->rlt_sequence;
-
-	while (lock->rl_blocking_ranges > 0) {
-		lock->rl_task = current;
-		__set_current_state(TASK_INTERRUPTIBLE);
-		spin_unlock(&tree->rlt_lock);
-		schedule();
-
-		if (signal_pending(current)) {
-			range_unlock(tree, lock);
-			rc = -EINTR;
-			goto out;
-		}
-		spin_lock(&tree->rlt_lock);
-	}
-	spin_unlock(&tree->rlt_lock);
-out:
-	return rc;
-}
diff --git a/drivers/staging/lustre/lustre/llite/range_lock.h b/drivers/staging/lustre/lustre/llite/range_lock.h
deleted file mode 100644
index 779091ccec4e..000000000000
--- a/drivers/staging/lustre/lustre/llite/range_lock.h
+++ /dev/null
@@ -1,82 +0,0 @@
-/*
- * GPL HEADER START
- *
- * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 only,
- * as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License version 2 for more details (a copy is included
- * in the LICENSE file that accompanied this code).
- *
- * You should have received a copy of the GNU General Public License
- * version 2 along with this program; If not, see
- * http://www.gnu.org/licenses/gpl-2.0.html
- *
- * GPL HEADER END
- */
-/*
- * Range lock is used to allow multiple threads writing a single shared
- * file given each thread is writing to a non-overlapping portion of the
- * file.
- *
- * Refer to the possible upstream kernel version of range lock by
- * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
- *
- * This file could later replaced by the upstream kernel version.
- */
-/*
- * Author: Prakash Surya <surya1@llnl.gov>
- * Author: Bobi Jam <bobijam.xu@intel.com>
- */
-#ifndef _RANGE_LOCK_H
-#define _RANGE_LOCK_H
-
-#include "../../include/linux/libcfs/libcfs.h"
-#include "../include/interval_tree.h"
-
-struct range_lock {
-	struct interval_node	rl_node;
-	/**
-	 * Process to enqueue this lock.
-	 */
-	struct task_struct	*rl_task;
-	/**
-	 * List of locks with the same range.
-	 */
-	struct list_head	rl_next_lock;
-	/**
-	 * Number of locks in the list rl_next_lock
-	 */
-	unsigned int		rl_lock_count;
-	/**
-	 * Number of ranges which are blocking acquisition of the lock
-	 */
-	unsigned int		rl_blocking_ranges;
-	/**
-	 * Sequence number of range lock. This number is used to get to know
-	 * the order the locks are queued; this is required for range_cancel().
-	 */
-	__u64			rl_sequence;
-};
-
-static inline struct range_lock *node2rangelock(const struct interval_node *n)
-{
-	return container_of(n, struct range_lock, rl_node);
-}
-
-struct range_lock_tree {
-	struct interval_node	*rlt_root;
-	spinlock_t		 rlt_lock;	/* protect range lock tree */
-	__u64			 rlt_sequence;
-};
-
-void range_lock_tree_init(struct range_lock_tree *tree);
-int range_lock_init(struct range_lock *lock, __u64 start, __u64 end);
-int  range_lock(struct range_lock_tree *tree, struct range_lock *lock);
-void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
-#endif
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-06  8:46 ` [PATCH 2/6] locking: Introduce range reader/writer lock Davidlohr Bueso
@ 2017-04-06  9:01   ` Laurent Dufour
  2017-04-06 16:50     ` Davidlohr Bueso
  2017-04-06 10:24   ` Peter Zijlstra
  2017-04-18 13:57   ` Laurent Dufour
  2 siblings, 1 reply; 31+ messages in thread
From: Laurent Dufour @ 2017-04-06  9:01 UTC (permalink / raw)
  To: Davidlohr Bueso, mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, linux-kernel, Davidlohr Bueso

On 06/04/2017 10:46, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
> 
> An interval tree of locked and to-be-locked ranges is kept. When a new range
> lock is requested, we add its interval to the tree and store number of
> intervals intersecting it to 'blocking_ranges'. For the reader case,
> 'blocking_ranges' is only accounted for if the intersecting range is
> marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
> guarantee that task is blocked until there are no overlapping ranges in the
> tree.
> 
> When a range is unlocked, we again walk intervals that overlap with the
> unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
> owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
> therefore relies on the order of the interval tree  -- as opposed to a
> more traditional fifo mechanism. There is no lock stealing either, which
> prevents starvation and guarantees fairness.
> 
> How much does it cost:
> ----------------------
> 
> The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
> is total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
> 
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
> 
> The first is the memory footprint, tree locks are larger than rwsems: 80 vs
> 40 bytes, and also requires an additional 72 bytes of stack for the range
> structure.
> 
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
> 
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
> 
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
> 
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
> 
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
> 
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
> 
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
> 
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
> 
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
> 
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr:  1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr:  1022.5, stddev: 6.41
> 
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr:  1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr:  1243.5, stddev: 30.40
> 
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr:  1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr:  1398.0, stddev: 56.56
> 
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
> 
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
> 
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
> 
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
> 
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
> 
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
> 
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
> 
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
> 
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr:  1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr:  1852.8, stddev: 147.5
> 
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr:  1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr:  1856.5, stddev: 198.69
> 
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr:  1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr:  1551.5, stddev: 361.33
> 
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
> 
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/range_rwlock.h  | 115 +++++++++
>  kernel/locking/Makefile       |   2 +-
>  kernel/locking/range_rwlock.c | 554 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 670 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/range_rwlock.h
>  create mode 100644 kernel/locking/range_rwlock.c
> 
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..0a8dad62d097
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,115 @@
> +/*
> + * Range/interval rw-locking
> + * -------------------------
> + *
> + * An interval tree of locked and to-be-locked ranges is kept. When a new range
> + * lock is requested, we add its interval to the tree and store number of
> + * intervals intersecting it to 'blocking_ranges'. For the reader case,
> + * 'blocking_ranges' is only accounted for if the intersecting range is
> + * marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
> + * guarantee that task is blocked until there are no overlapping ranges in the
> + * tree.
> + *
> + * When a range is unlocked, we again walk intervals that overlap with the
> + * unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
> + * owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
> + * therefore relies on the order of the interval tree  -- as opposed to a
> + * more traditional fifo mechanism. There is no lock stealing either, which
> + * prevents starvation and guarantees fairness.
> + *
> + * The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
> + * is total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_FULL].
> + */
> +#define RANGE_RWLOCK_FULL  ~0UL
> +
> +struct range_rwlock {
> +	struct interval_tree_node node;
> +	struct task_struct *waiter;
> +	/* Number of ranges which are blocking acquisition of the lock */
> +	unsigned int blocking_ranges;
> +	u64 seqnum;
> +};
> +
> +struct range_rwlock_tree {
> +	struct rb_root root;
> +	spinlock_t lock;
> +	struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +	u64 seqnum; /* track order of incoming ranges, avoid overflows */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
> +	{ .leftmost = NULL                              \
> +	, .root = RB_ROOT				\
> +	, .seqnum = 0					\
> +	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> +       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(__start, __last) {	\
> +		.node = {			\
> +			.start = (__start)	\
> +			,.last = (__last)	\
> +		}				\
> +		, .task = NULL			\
> +		, .blocking_ranges = 0		\
> +		, .reader = false		\
> +		, .seqnum = 0			\
> +	}
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last)				\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_FULL(name)		\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_FULL)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> +	tree->root = RB_ROOT;
> +	spin_lock_init(&tree->lock);
> +	tree->leftmost = NULL;
> +	tree->seqnum = 0;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> +		       unsigned long start, unsigned long last);
> +void range_rwlock_init_full(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock);
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT		:= n
> 
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
> 
>  ifdef CONFIG_FUNCTION_TRACER
>  CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..550ad360d87a
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,554 @@
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +#define range_interval_tree_foreach(node, root, start, last)	\
> +	for (node = interval_tree_iter_first(root, start, last); \
> +	     node; node = interval_tree_iter_next(node, start, last))
> +
> +/*
> + * Fastpath range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + *        a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree. This allows mitigating some of the tree walk overhead
> + * for non-intersecting ranges, maintained and consulted in O(1).
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *root;
> +
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		return false;
> +
> +	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> +	return lock->node.start <= root->__subtree_last &&
> +		tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)) ||
> +	    lock->node.start < tree->leftmost->start)
> +		tree->leftmost = &lock->node;
> +
> +	lock->seqnum = tree->seqnum++;
> +	interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (tree->leftmost == &lock->node) {
> +		struct rb_node *next = rb_next(&tree->leftmost->rb);
> +		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> +	}
> +
> +	interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +/*
> + * lock->waiter reader tracking.
> + */
> +#define RANGE_FLAG_READER	1UL
> +
> +static inline struct task_struct *range_lock_waiter(struct range_rwlock *lock)
> +{
> +	return (struct task_struct *)
> +		((unsigned long) lock->waiter & ~RANGE_FLAG_READER);
> +}
> +
> +static inline void range_lock_set_reader(struct range_rwlock *lock)
> +{
> +	lock->waiter = (struct task_struct *)
> +		((unsigned long)lock->waiter | RANGE_FLAG_READER);
> +}
> +
> +static inline void range_lock_clear_reader(struct range_rwlock *lock)
> +{
> +	lock->waiter = (struct task_struct *)
> +		((unsigned long)lock->waiter & ~RANGE_FLAG_READER);
> +}
> +
> +static inline bool range_lock_is_reader(struct range_rwlock *lock)
> +{
> +	return (unsigned long) lock->waiter & RANGE_FLAG_READER;
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock,
> +		    unsigned long start, unsigned long last)
> +{
> +	WARN_ON(start > last);
> +
> +	lock->node.start = start;
> +	lock->node.last = last;
> +	RB_CLEAR_NODE(&lock->node.rb);
> +	lock->blocking_ranges = 0;
> +	lock->waiter = NULL;
> +	lock->seqnum = 0;
> +}
> +
> +/**
> + * range_rwlock_init - Initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		     unsigned long last)
> +{
> +	__range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL_GPL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_full - Initialize a full range lock
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_full(struct range_rwlock *lock)
> +{
> +	__range_rwlock_init(lock, 0, RANGE_RWLOCK_FULL);
> +}
> +EXPORT_SYMBOL_GPL(range_rwlock_init_full);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> +	if (!--lock->blocking_ranges)
> +		wake_q_add(wake_q, range_lock_waiter(lock));
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock, long state)
> +{
> +	int ret = 0;
> +
> +	while (true) {
> +		set_current_state(state);
> +
> +		/* do we need to go to sleep? */
> +		if (!lock->blocking_ranges)
> +			break;
> +
> +		if (unlikely(signal_pending_state(state, current))) {
> +			struct interval_tree_node *node;
> +			unsigned long flags;
> +			DEFINE_WAKE_Q(wake_q);
> +
> +			ret = -EINTR;
> +			/*
> +			 * We're not taking the lock after all, cleanup
> +			 * after ourselves.
> +			 */
> +			spin_lock_irqsave(&tree->lock, flags);
> +
> +			range_lock_clear_reader(lock);
> +			__range_tree_remove(tree, lock);
> +
> +			if (!__range_intersects_intree(tree, lock))
> +				goto unlock;
> +
> +			range_interval_tree_foreach(node, &tree->root,
> +						    lock->node.start,
> +						    lock->node.last) {
> +				struct range_rwlock *blked;
> +				blked = range_entry(node,
> +						    struct range_rwlock, node);
> +
> +				if (range_lock_is_reader(lock) &&
> +				    range_lock_is_reader(blked))
> +					continue;
> +
> +				/* unaccount for threads _we_ are blocking */
> +				if (lock->seqnum < blked->seqnum)
> +					range_rwlock_unblock(blked, &wake_q);

How is 'seqnum' wrapping handled here ?
I'd rather see something like time_before() here, isn't it ?

Cheers,
Laurent.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-06  8:46 ` [PATCH 2/6] locking: Introduce range reader/writer lock Davidlohr Bueso
  2017-04-06  9:01   ` Laurent Dufour
@ 2017-04-06 10:24   ` Peter Zijlstra
  2017-04-18 13:57   ` Laurent Dufour
  2 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-06 10:24 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Thu, Apr 06, 2017 at 01:46:16AM -0700, Davidlohr Bueso wrote:
> +/*
> + * Range/interval rw-locking
> + * -------------------------
> + *
> + * An interval tree of locked and to-be-locked ranges is kept. When a new range
> + * lock is requested, we add its interval to the tree and store number of
> + * intervals intersecting it to 'blocking_ranges'.

You're again confusing semantics with implementation here.

> For the reader case,
> + * 'blocking_ranges' is only accounted for if the intersecting range is
> + * marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
> + * guarantee that task is blocked until there are no overlapping ranges in the
> + * tree.
> + *
> + * When a range is unlocked, we again walk intervals that overlap with the
> + * unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
> + * owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
> + * therefore relies on the order of the interval tree  -- as opposed to a
> + * more traditional fifo mechanism.

Which order is that? (I could of course go read the interval tree code,
but it shouldn't be too much effort to mention it here).

> There is no lock stealing either, which
> + * prevents starvation and guarantees fairness.

So no lock stealing has always been very bad for performance. So are you
sure people will not frob this back in?


> +#ifndef _LINUX_RANGE_RWLOCK_H

Still don't like the name... rwlock_t is a spinlock.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-06  9:01   ` Laurent Dufour
@ 2017-04-06 16:50     ` Davidlohr Bueso
  2017-04-13  8:07       ` Laurent Dufour
  0 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-06 16:50 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Thu, 06 Apr 2017, Laurent Dufour wrote:

>How is 'seqnum' wrapping handled here ?
>I'd rather see something like time_before() here, isn't it ?

Its a 64bit counter, no overflows.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] staging/lustre: Use generic range rwlock
  2017-04-06  8:46 ` [PATCH 6/6] staging/lustre: Use generic range rwlock Davidlohr Bueso
@ 2017-04-07 10:08     ` Dilger, Andreas
  0 siblings, 0 replies; 31+ messages in thread
From: Dilger, Andreas @ 2017-04-07 10:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, Andrew Morton, jack, kirill.shutemov, ldufour,
	mhocko, mgorman, linux-kernel, Drokin, Oleg, jsimmons,
	lustre-devel, Davidlohr Bueso

On Apr 6, 2017, at 02:46, Davidlohr Bueso <dave@stgolabs.net> wrote:
> 
> This replaces the in-house version, which is also derived
> from Jan's interval tree implementation.
> 
> Cc: oleg.drokin@intel.com
> Cc: andreas.dilger@intel.com
> Cc: jsimmons@infradead.org
> Cc: lustre-devel@lists.lustre.org
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

This patch looks fine, but it would probably be better to also post the whole series on
linux-fsdevel for further review.

Cheers, Andreas

> ---
> drivers/staging/lustre/lustre/llite/Makefile       |   2 +-
> drivers/staging/lustre/lustre/llite/file.c         |  21 +-
> .../staging/lustre/lustre/llite/llite_internal.h   |   4 +-
> drivers/staging/lustre/lustre/llite/llite_lib.c    |   3 +-
> drivers/staging/lustre/lustre/llite/range_lock.c   | 239 ---------------------
> drivers/staging/lustre/lustre/llite/range_lock.h   |  82 -------
> 6 files changed, 15 insertions(+), 336 deletions(-)
> delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.c
> delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.h
> 
> diff --git a/drivers/staging/lustre/lustre/llite/Makefile b/drivers/staging/lustre/lustre/llite/Makefile
> index 322d4fa63f5d..922a901bc62c 100644
> --- a/drivers/staging/lustre/lustre/llite/Makefile
> +++ b/drivers/staging/lustre/lustre/llite/Makefile
> @@ -1,6 +1,6 @@
> obj-$(CONFIG_LUSTRE_FS) += lustre.o
> lustre-y := dcache.o dir.o file.o llite_lib.o llite_nfs.o \
> -	    rw.o rw26.o namei.o symlink.o llite_mmap.o range_lock.o \
> +	    rw.o rw26.o namei.o symlink.o llite_mmap.o \
> 	    xattr.o xattr_cache.o xattr_security.o \
> 	    super25.o statahead.o glimpse.o lcommon_cl.o lcommon_misc.o \
> 	    vvp_dev.o vvp_page.o vvp_lock.o vvp_io.o vvp_object.o \
> diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
> index 481c0d01d4c6..1a14a79f87f8 100644
> --- a/drivers/staging/lustre/lustre/llite/file.c
> +++ b/drivers/staging/lustre/lustre/llite/file.c
> @@ -42,6 +42,7 @@
> #include <linux/file.h>
> #include <linux/sched.h>
> #include <linux/mount.h>
> +#include <linux/range_rwlock.h>
> #include "../include/lustre/ll_fiemap.h"
> #include "../include/lustre/lustre_ioctl.h"
> #include "../include/lustre_swab.h"
> @@ -1055,7 +1056,7 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 	struct ll_inode_info *lli = ll_i2info(file_inode(file));
> 	struct ll_file_data  *fd  = LUSTRE_FPRIVATE(file);
> 	struct vvp_io *vio = vvp_env_io(env);
> -	struct range_lock range;
> +	struct range_rwlock range;
> 	struct cl_io	 *io;
> 	ssize_t result = 0;
> 	int rc = 0;
> @@ -1072,9 +1073,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		bool range_locked = false;
> 
> 		if (file->f_flags & O_APPEND)
> -			range_lock_init(&range, 0, LUSTRE_EOF);
> +			range_rwlock_init(&range, 0, LUSTRE_EOF);
> 		else
> -			range_lock_init(&range, *ppos, *ppos + count - 1);
> +			range_rwlock_init(&range, *ppos, *ppos + count - 1);
> 
> 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
> 		vio->vui_iter = args->u.normal.via_iter;
> @@ -1087,10 +1088,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		if (((iot == CIT_WRITE) ||
> 		     (iot == CIT_READ && (file->f_flags & O_DIRECT))) &&
> 		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
> -			CDEBUG(D_VFSTRACE, "Range lock [%llu, %llu]\n",
> -			       range.rl_node.in_extent.start,
> -			       range.rl_node.in_extent.end);
> -			rc = range_lock(&lli->lli_write_tree, &range);
> +			CDEBUG(D_VFSTRACE, "Range lock [%lu, %lu]\n",
> +			       range.node.start, range.node.last);
> +			rc = range_write_lock_interruptible(&lli->lli_write_tree, &range);
> 			if (rc < 0)
> 				goto out;
> 
> @@ -1100,10 +1100,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		rc = cl_io_loop(env, io);
> 		ll_cl_remove(file, env);
> 		if (range_locked) {
> -			CDEBUG(D_VFSTRACE, "Range unlock [%llu, %llu]\n",
> -			       range.rl_node.in_extent.start,
> -			       range.rl_node.in_extent.end);
> -			range_unlock(&lli->lli_write_tree, &range);
> +			CDEBUG(D_VFSTRACE, "Range unlock [%lu, %lu]\n",
> +			       range.node.start, range.node.last);
> +			range_write_unlock(&lli->lli_write_tree, &range);
> 		}
> 	} else {
> 		/* cl_io_rw_init() handled IO */
> diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
> index 55f68acd85d1..aa2ae72e3e70 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_internal.h
> +++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
> @@ -49,8 +49,8 @@
> #include <linux/namei.h>
> #include <linux/xattr.h>
> #include <linux/posix_acl_xattr.h>
> +#include <linux/range_rwlock.h>
> #include "vvp_internal.h"
> -#include "range_lock.h"
> 
> #ifndef FMODE_EXEC
> #define FMODE_EXEC 0
> @@ -193,7 +193,7 @@ struct ll_inode_info {
> 			 * }
> 			 */
> 			struct rw_semaphore		lli_trunc_sem;
> -			struct range_lock_tree		lli_write_tree;
> +			struct range_rwlock_tree	lli_write_tree;
> 
> 			struct rw_semaphore		lli_glimpse_sem;
> 			unsigned long			lli_glimpse_time;
> diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
> index b229cbc7bb33..8054e916b3f5 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
> +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
> @@ -40,6 +40,7 @@
> #include <linux/statfs.h>
> #include <linux/types.h>
> #include <linux/mm.h>
> +#include <linux/range_rwlock.h>
> 
> #include "../include/lustre/lustre_ioctl.h"
> #include "../include/lustre_ha.h"
> @@ -853,7 +854,7 @@ void ll_lli_init(struct ll_inode_info *lli)
> 		mutex_init(&lli->lli_size_mutex);
> 		lli->lli_symlink_name = NULL;
> 		init_rwsem(&lli->lli_trunc_sem);
> -		range_lock_tree_init(&lli->lli_write_tree);
> +		range_rwlock_tree_init(&lli->lli_write_tree);
> 		init_rwsem(&lli->lli_glimpse_sem);
> 		lli->lli_glimpse_time = 0;
> 		INIT_LIST_HEAD(&lli->lli_agl_list);
> diff --git a/drivers/staging/lustre/lustre/llite/range_lock.c b/drivers/staging/lustre/lustre/llite/range_lock.c
> deleted file mode 100644
> index 14148a097476..000000000000
> --- a/drivers/staging/lustre/lustre/llite/range_lock.c
> +++ /dev/null
> @@ -1,239 +0,0 @@
> -/*
> - * GPL HEADER START
> - *
> - * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 only,
> - * as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License version 2 for more details (a copy is included
> - * in the LICENSE file that accompanied this code).
> - *
> - * You should have received a copy of the GNU General Public License
> - * version 2 along with this program; If not, see
> - * http://www.gnu.org/licenses/gpl-2.0.html
> - *
> - * GPL HEADER END
> - */
> -/*
> - * Range lock is used to allow multiple threads writing a single shared
> - * file given each thread is writing to a non-overlapping portion of the
> - * file.
> - *
> - * Refer to the possible upstream kernel version of range lock by
> - * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
> - *
> - * This file could later replaced by the upstream kernel version.
> - */
> -/*
> - * Author: Prakash Surya <surya1@llnl.gov>
> - * Author: Bobi Jam <bobijam.xu@intel.com>
> - */
> -#include "range_lock.h"
> -#include "../include/lustre/lustre_user.h"
> -
> -/**
> - * Initialize a range lock tree
> - *
> - * \param tree [in]	an empty range lock tree
> - *
> - * Pre:  Caller should have allocated the range lock tree.
> - * Post: The range lock tree is ready to function.
> - */
> -void range_lock_tree_init(struct range_lock_tree *tree)
> -{
> -	tree->rlt_root = NULL;
> -	tree->rlt_sequence = 0;
> -	spin_lock_init(&tree->rlt_lock);
> -}
> -
> -/**
> - * Initialize a range lock node
> - *
> - * \param lock  [in]	an empty range lock node
> - * \param start [in]	start of the covering region
> - * \param end   [in]	end of the covering region
> - *
> - * Pre:  Caller should have allocated the range lock node.
> - * Post: The range lock node is meant to cover [start, end] region
> - */
> -int range_lock_init(struct range_lock *lock, __u64 start, __u64 end)
> -{
> -	int rc;
> -
> -	memset(&lock->rl_node, 0, sizeof(lock->rl_node));
> -	if (end != LUSTRE_EOF)
> -		end >>= PAGE_SHIFT;
> -	rc = interval_set(&lock->rl_node, start >> PAGE_SHIFT, end);
> -	if (rc)
> -		return rc;
> -
> -	INIT_LIST_HEAD(&lock->rl_next_lock);
> -	lock->rl_task = NULL;
> -	lock->rl_lock_count = 0;
> -	lock->rl_blocking_ranges = 0;
> -	lock->rl_sequence = 0;
> -	return rc;
> -}
> -
> -static inline struct range_lock *next_lock(struct range_lock *lock)
> -{
> -	return list_entry(lock->rl_next_lock.next, typeof(*lock), rl_next_lock);
> -}
> -
> -/**
> - * Helper function of range_unlock()
> - *
> - * \param node [in]	a range lock found overlapped during interval node
> - *			search
> - * \param arg [in]	the range lock to be tested
> - *
> - * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
> - *				overlapping range node
> - * \retval INTERVAL_ITER_STOP	indicate to stop the search
> - */
> -static enum interval_iter range_unlock_cb(struct interval_node *node, void *arg)
> -{
> -	struct range_lock *lock = arg;
> -	struct range_lock *overlap = node2rangelock(node);
> -	struct range_lock *iter;
> -
> -	list_for_each_entry(iter, &overlap->rl_next_lock, rl_next_lock) {
> -		if (iter->rl_sequence > lock->rl_sequence) {
> -			--iter->rl_blocking_ranges;
> -			LASSERT(iter->rl_blocking_ranges > 0);
> -		}
> -	}
> -	if (overlap->rl_sequence > lock->rl_sequence) {
> -		--overlap->rl_blocking_ranges;
> -		if (overlap->rl_blocking_ranges == 0)
> -			wake_up_process(overlap->rl_task);
> -	}
> -	return INTERVAL_ITER_CONT;
> -}
> -
> -/**
> - * Unlock a range lock, wake up locks blocked by this lock.
> - *
> - * \param tree [in]	range lock tree
> - * \param lock [in]	range lock to be deleted
> - *
> - * If this lock has been granted, relase it; if not, just delete it from
> - * the tree or the same region lock list. Wake up those locks only blocked
> - * by this lock through range_unlock_cb().
> - */
> -void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> -{
> -	spin_lock(&tree->rlt_lock);
> -	if (!list_empty(&lock->rl_next_lock)) {
> -		struct range_lock *next;
> -
> -		if (interval_is_intree(&lock->rl_node)) { /* first lock */
> -			/* Insert the next same range lock into the tree */
> -			next = next_lock(lock);
> -			next->rl_lock_count = lock->rl_lock_count - 1;
> -			interval_erase(&lock->rl_node, &tree->rlt_root);
> -			interval_insert(&next->rl_node, &tree->rlt_root);
> -		} else {
> -			/* find the first lock in tree */
> -			list_for_each_entry(next, &lock->rl_next_lock,
> -					    rl_next_lock) {
> -				if (!interval_is_intree(&next->rl_node))
> -					continue;
> -
> -				LASSERT(next->rl_lock_count > 0);
> -				next->rl_lock_count--;
> -				break;
> -			}
> -		}
> -		list_del_init(&lock->rl_next_lock);
> -	} else {
> -		LASSERT(interval_is_intree(&lock->rl_node));
> -		interval_erase(&lock->rl_node, &tree->rlt_root);
> -	}
> -
> -	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
> -			range_unlock_cb, lock);
> -	spin_unlock(&tree->rlt_lock);
> -}
> -
> -/**
> - * Helper function of range_lock()
> - *
> - * \param node [in]	a range lock found overlapped during interval node
> - *			search
> - * \param arg [in]	the range lock to be tested
> - *
> - * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
> - *				overlapping range node
> - * \retval INTERVAL_ITER_STOP	indicate to stop the search
> - */
> -static enum interval_iter range_lock_cb(struct interval_node *node, void *arg)
> -{
> -	struct range_lock *lock = (struct range_lock *)arg;
> -	struct range_lock *overlap = node2rangelock(node);
> -
> -	lock->rl_blocking_ranges += overlap->rl_lock_count + 1;
> -	return INTERVAL_ITER_CONT;
> -}
> -
> -/**
> - * Lock a region
> - *
> - * \param tree [in]	range lock tree
> - * \param lock [in]	range lock node containing the region span
> - *
> - * \retval 0	get the range lock
> - * \retval <0	error code while not getting the range lock
> - *
> - * If there exists overlapping range lock, the new lock will wait and
> - * retry, if later it find that it is not the chosen one to wake up,
> - * it wait again.
> - */
> -int range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> -{
> -	struct interval_node *node;
> -	int rc = 0;
> -
> -	spin_lock(&tree->rlt_lock);
> -	/*
> -	 * We need to check for all conflicting intervals
> -	 * already in the tree.
> -	 */
> -	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
> -			range_lock_cb, lock);
> -	/*
> -	 * Insert to the tree if I am unique, otherwise I've been linked to
> -	 * the rl_next_lock of another lock which has the same range as mine
> -	 * in range_lock_cb().
> -	 */
> -	node = interval_insert(&lock->rl_node, &tree->rlt_root);
> -	if (node) {
> -		struct range_lock *tmp = node2rangelock(node);
> -
> -		list_add_tail(&lock->rl_next_lock, &tmp->rl_next_lock);
> -		tmp->rl_lock_count++;
> -	}
> -	lock->rl_sequence = ++tree->rlt_sequence;
> -
> -	while (lock->rl_blocking_ranges > 0) {
> -		lock->rl_task = current;
> -		__set_current_state(TASK_INTERRUPTIBLE);
> -		spin_unlock(&tree->rlt_lock);
> -		schedule();
> -
> -		if (signal_pending(current)) {
> -			range_unlock(tree, lock);
> -			rc = -EINTR;
> -			goto out;
> -		}
> -		spin_lock(&tree->rlt_lock);
> -	}
> -	spin_unlock(&tree->rlt_lock);
> -out:
> -	return rc;
> -}
> diff --git a/drivers/staging/lustre/lustre/llite/range_lock.h b/drivers/staging/lustre/lustre/llite/range_lock.h
> deleted file mode 100644
> index 779091ccec4e..000000000000
> --- a/drivers/staging/lustre/lustre/llite/range_lock.h
> +++ /dev/null
> @@ -1,82 +0,0 @@
> -/*
> - * GPL HEADER START
> - *
> - * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 only,
> - * as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License version 2 for more details (a copy is included
> - * in the LICENSE file that accompanied this code).
> - *
> - * You should have received a copy of the GNU General Public License
> - * version 2 along with this program; If not, see
> - * http://www.gnu.org/licenses/gpl-2.0.html
> - *
> - * GPL HEADER END
> - */
> -/*
> - * Range lock is used to allow multiple threads writing a single shared
> - * file given each thread is writing to a non-overlapping portion of the
> - * file.
> - *
> - * Refer to the possible upstream kernel version of range lock by
> - * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
> - *
> - * This file could later replaced by the upstream kernel version.
> - */
> -/*
> - * Author: Prakash Surya <surya1@llnl.gov>
> - * Author: Bobi Jam <bobijam.xu@intel.com>
> - */
> -#ifndef _RANGE_LOCK_H
> -#define _RANGE_LOCK_H
> -
> -#include "../../include/linux/libcfs/libcfs.h"
> -#include "../include/interval_tree.h"
> -
> -struct range_lock {
> -	struct interval_node	rl_node;
> -	/**
> -	 * Process to enqueue this lock.
> -	 */
> -	struct task_struct	*rl_task;
> -	/**
> -	 * List of locks with the same range.
> -	 */
> -	struct list_head	rl_next_lock;
> -	/**
> -	 * Number of locks in the list rl_next_lock
> -	 */
> -	unsigned int		rl_lock_count;
> -	/**
> -	 * Number of ranges which are blocking acquisition of the lock
> -	 */
> -	unsigned int		rl_blocking_ranges;
> -	/**
> -	 * Sequence number of range lock. This number is used to get to know
> -	 * the order the locks are queued; this is required for range_cancel().
> -	 */
> -	__u64			rl_sequence;
> -};
> -
> -static inline struct range_lock *node2rangelock(const struct interval_node *n)
> -{
> -	return container_of(n, struct range_lock, rl_node);
> -}
> -
> -struct range_lock_tree {
> -	struct interval_node	*rlt_root;
> -	spinlock_t		 rlt_lock;	/* protect range lock tree */
> -	__u64			 rlt_sequence;
> -};
> -
> -void range_lock_tree_init(struct range_lock_tree *tree);
> -int range_lock_init(struct range_lock *lock, __u64 start, __u64 end);
> -int  range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> -void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> -#endif
> -- 
> 2.12.0
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [lustre-devel] [PATCH 6/6] staging/lustre: Use generic range rwlock
@ 2017-04-07 10:08     ` Dilger, Andreas
  0 siblings, 0 replies; 31+ messages in thread
From: Dilger, Andreas @ 2017-04-07 10:08 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, Andrew Morton, jack, kirill.shutemov, ldufour,
	mhocko, mgorman, linux-kernel, Drokin, Oleg, jsimmons,
	lustre-devel, Davidlohr Bueso

On Apr 6, 2017, at 02:46, Davidlohr Bueso <dave@stgolabs.net> wrote:
> 
> This replaces the in-house version, which is also derived
> from Jan's interval tree implementation.
> 
> Cc: oleg.drokin at intel.com
> Cc: andreas.dilger at intel.com
> Cc: jsimmons at infradead.org
> Cc: lustre-devel at lists.lustre.org
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>

This patch looks fine, but it would probably be better to also post the whole series on
linux-fsdevel for further review.

Cheers, Andreas

> ---
> drivers/staging/lustre/lustre/llite/Makefile       |   2 +-
> drivers/staging/lustre/lustre/llite/file.c         |  21 +-
> .../staging/lustre/lustre/llite/llite_internal.h   |   4 +-
> drivers/staging/lustre/lustre/llite/llite_lib.c    |   3 +-
> drivers/staging/lustre/lustre/llite/range_lock.c   | 239 ---------------------
> drivers/staging/lustre/lustre/llite/range_lock.h   |  82 -------
> 6 files changed, 15 insertions(+), 336 deletions(-)
> delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.c
> delete mode 100644 drivers/staging/lustre/lustre/llite/range_lock.h
> 
> diff --git a/drivers/staging/lustre/lustre/llite/Makefile b/drivers/staging/lustre/lustre/llite/Makefile
> index 322d4fa63f5d..922a901bc62c 100644
> --- a/drivers/staging/lustre/lustre/llite/Makefile
> +++ b/drivers/staging/lustre/lustre/llite/Makefile
> @@ -1,6 +1,6 @@
> obj-$(CONFIG_LUSTRE_FS) += lustre.o
> lustre-y := dcache.o dir.o file.o llite_lib.o llite_nfs.o \
> -	    rw.o rw26.o namei.o symlink.o llite_mmap.o range_lock.o \
> +	    rw.o rw26.o namei.o symlink.o llite_mmap.o \
> 	    xattr.o xattr_cache.o xattr_security.o \
> 	    super25.o statahead.o glimpse.o lcommon_cl.o lcommon_misc.o \
> 	    vvp_dev.o vvp_page.o vvp_lock.o vvp_io.o vvp_object.o \
> diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
> index 481c0d01d4c6..1a14a79f87f8 100644
> --- a/drivers/staging/lustre/lustre/llite/file.c
> +++ b/drivers/staging/lustre/lustre/llite/file.c
> @@ -42,6 +42,7 @@
> #include <linux/file.h>
> #include <linux/sched.h>
> #include <linux/mount.h>
> +#include <linux/range_rwlock.h>
> #include "../include/lustre/ll_fiemap.h"
> #include "../include/lustre/lustre_ioctl.h"
> #include "../include/lustre_swab.h"
> @@ -1055,7 +1056,7 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 	struct ll_inode_info *lli = ll_i2info(file_inode(file));
> 	struct ll_file_data  *fd  = LUSTRE_FPRIVATE(file);
> 	struct vvp_io *vio = vvp_env_io(env);
> -	struct range_lock range;
> +	struct range_rwlock range;
> 	struct cl_io	 *io;
> 	ssize_t result = 0;
> 	int rc = 0;
> @@ -1072,9 +1073,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		bool range_locked = false;
> 
> 		if (file->f_flags & O_APPEND)
> -			range_lock_init(&range, 0, LUSTRE_EOF);
> +			range_rwlock_init(&range, 0, LUSTRE_EOF);
> 		else
> -			range_lock_init(&range, *ppos, *ppos + count - 1);
> +			range_rwlock_init(&range, *ppos, *ppos + count - 1);
> 
> 		vio->vui_fd  = LUSTRE_FPRIVATE(file);
> 		vio->vui_iter = args->u.normal.via_iter;
> @@ -1087,10 +1088,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		if (((iot == CIT_WRITE) ||
> 		     (iot == CIT_READ && (file->f_flags & O_DIRECT))) &&
> 		    !(vio->vui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
> -			CDEBUG(D_VFSTRACE, "Range lock [%llu, %llu]\n",
> -			       range.rl_node.in_extent.start,
> -			       range.rl_node.in_extent.end);
> -			rc = range_lock(&lli->lli_write_tree, &range);
> +			CDEBUG(D_VFSTRACE, "Range lock [%lu, %lu]\n",
> +			       range.node.start, range.node.last);
> +			rc = range_write_lock_interruptible(&lli->lli_write_tree, &range);
> 			if (rc < 0)
> 				goto out;
> 
> @@ -1100,10 +1100,9 @@ ll_file_io_generic(const struct lu_env *env, struct vvp_io_args *args,
> 		rc = cl_io_loop(env, io);
> 		ll_cl_remove(file, env);
> 		if (range_locked) {
> -			CDEBUG(D_VFSTRACE, "Range unlock [%llu, %llu]\n",
> -			       range.rl_node.in_extent.start,
> -			       range.rl_node.in_extent.end);
> -			range_unlock(&lli->lli_write_tree, &range);
> +			CDEBUG(D_VFSTRACE, "Range unlock [%lu, %lu]\n",
> +			       range.node.start, range.node.last);
> +			range_write_unlock(&lli->lli_write_tree, &range);
> 		}
> 	} else {
> 		/* cl_io_rw_init() handled IO */
> diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
> index 55f68acd85d1..aa2ae72e3e70 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_internal.h
> +++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
> @@ -49,8 +49,8 @@
> #include <linux/namei.h>
> #include <linux/xattr.h>
> #include <linux/posix_acl_xattr.h>
> +#include <linux/range_rwlock.h>
> #include "vvp_internal.h"
> -#include "range_lock.h"
> 
> #ifndef FMODE_EXEC
> #define FMODE_EXEC 0
> @@ -193,7 +193,7 @@ struct ll_inode_info {
> 			 * }
> 			 */
> 			struct rw_semaphore		lli_trunc_sem;
> -			struct range_lock_tree		lli_write_tree;
> +			struct range_rwlock_tree	lli_write_tree;
> 
> 			struct rw_semaphore		lli_glimpse_sem;
> 			unsigned long			lli_glimpse_time;
> diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
> index b229cbc7bb33..8054e916b3f5 100644
> --- a/drivers/staging/lustre/lustre/llite/llite_lib.c
> +++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
> @@ -40,6 +40,7 @@
> #include <linux/statfs.h>
> #include <linux/types.h>
> #include <linux/mm.h>
> +#include <linux/range_rwlock.h>
> 
> #include "../include/lustre/lustre_ioctl.h"
> #include "../include/lustre_ha.h"
> @@ -853,7 +854,7 @@ void ll_lli_init(struct ll_inode_info *lli)
> 		mutex_init(&lli->lli_size_mutex);
> 		lli->lli_symlink_name = NULL;
> 		init_rwsem(&lli->lli_trunc_sem);
> -		range_lock_tree_init(&lli->lli_write_tree);
> +		range_rwlock_tree_init(&lli->lli_write_tree);
> 		init_rwsem(&lli->lli_glimpse_sem);
> 		lli->lli_glimpse_time = 0;
> 		INIT_LIST_HEAD(&lli->lli_agl_list);
> diff --git a/drivers/staging/lustre/lustre/llite/range_lock.c b/drivers/staging/lustre/lustre/llite/range_lock.c
> deleted file mode 100644
> index 14148a097476..000000000000
> --- a/drivers/staging/lustre/lustre/llite/range_lock.c
> +++ /dev/null
> @@ -1,239 +0,0 @@
> -/*
> - * GPL HEADER START
> - *
> - * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 only,
> - * as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License version 2 for more details (a copy is included
> - * in the LICENSE file that accompanied this code).
> - *
> - * You should have received a copy of the GNU General Public License
> - * version 2 along with this program; If not, see
> - * http://www.gnu.org/licenses/gpl-2.0.html
> - *
> - * GPL HEADER END
> - */
> -/*
> - * Range lock is used to allow multiple threads writing a single shared
> - * file given each thread is writing to a non-overlapping portion of the
> - * file.
> - *
> - * Refer to the possible upstream kernel version of range lock by
> - * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
> - *
> - * This file could later replaced by the upstream kernel version.
> - */
> -/*
> - * Author: Prakash Surya <surya1@llnl.gov>
> - * Author: Bobi Jam <bobijam.xu@intel.com>
> - */
> -#include "range_lock.h"
> -#include "../include/lustre/lustre_user.h"
> -
> -/**
> - * Initialize a range lock tree
> - *
> - * \param tree [in]	an empty range lock tree
> - *
> - * Pre:  Caller should have allocated the range lock tree.
> - * Post: The range lock tree is ready to function.
> - */
> -void range_lock_tree_init(struct range_lock_tree *tree)
> -{
> -	tree->rlt_root = NULL;
> -	tree->rlt_sequence = 0;
> -	spin_lock_init(&tree->rlt_lock);
> -}
> -
> -/**
> - * Initialize a range lock node
> - *
> - * \param lock  [in]	an empty range lock node
> - * \param start [in]	start of the covering region
> - * \param end   [in]	end of the covering region
> - *
> - * Pre:  Caller should have allocated the range lock node.
> - * Post: The range lock node is meant to cover [start, end] region
> - */
> -int range_lock_init(struct range_lock *lock, __u64 start, __u64 end)
> -{
> -	int rc;
> -
> -	memset(&lock->rl_node, 0, sizeof(lock->rl_node));
> -	if (end != LUSTRE_EOF)
> -		end >>= PAGE_SHIFT;
> -	rc = interval_set(&lock->rl_node, start >> PAGE_SHIFT, end);
> -	if (rc)
> -		return rc;
> -
> -	INIT_LIST_HEAD(&lock->rl_next_lock);
> -	lock->rl_task = NULL;
> -	lock->rl_lock_count = 0;
> -	lock->rl_blocking_ranges = 0;
> -	lock->rl_sequence = 0;
> -	return rc;
> -}
> -
> -static inline struct range_lock *next_lock(struct range_lock *lock)
> -{
> -	return list_entry(lock->rl_next_lock.next, typeof(*lock), rl_next_lock);
> -}
> -
> -/**
> - * Helper function of range_unlock()
> - *
> - * \param node [in]	a range lock found overlapped during interval node
> - *			search
> - * \param arg [in]	the range lock to be tested
> - *
> - * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
> - *				overlapping range node
> - * \retval INTERVAL_ITER_STOP	indicate to stop the search
> - */
> -static enum interval_iter range_unlock_cb(struct interval_node *node, void *arg)
> -{
> -	struct range_lock *lock = arg;
> -	struct range_lock *overlap = node2rangelock(node);
> -	struct range_lock *iter;
> -
> -	list_for_each_entry(iter, &overlap->rl_next_lock, rl_next_lock) {
> -		if (iter->rl_sequence > lock->rl_sequence) {
> -			--iter->rl_blocking_ranges;
> -			LASSERT(iter->rl_blocking_ranges > 0);
> -		}
> -	}
> -	if (overlap->rl_sequence > lock->rl_sequence) {
> -		--overlap->rl_blocking_ranges;
> -		if (overlap->rl_blocking_ranges == 0)
> -			wake_up_process(overlap->rl_task);
> -	}
> -	return INTERVAL_ITER_CONT;
> -}
> -
> -/**
> - * Unlock a range lock, wake up locks blocked by this lock.
> - *
> - * \param tree [in]	range lock tree
> - * \param lock [in]	range lock to be deleted
> - *
> - * If this lock has been granted, relase it; if not, just delete it from
> - * the tree or the same region lock list. Wake up those locks only blocked
> - * by this lock through range_unlock_cb().
> - */
> -void range_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> -{
> -	spin_lock(&tree->rlt_lock);
> -	if (!list_empty(&lock->rl_next_lock)) {
> -		struct range_lock *next;
> -
> -		if (interval_is_intree(&lock->rl_node)) { /* first lock */
> -			/* Insert the next same range lock into the tree */
> -			next = next_lock(lock);
> -			next->rl_lock_count = lock->rl_lock_count - 1;
> -			interval_erase(&lock->rl_node, &tree->rlt_root);
> -			interval_insert(&next->rl_node, &tree->rlt_root);
> -		} else {
> -			/* find the first lock in tree */
> -			list_for_each_entry(next, &lock->rl_next_lock,
> -					    rl_next_lock) {
> -				if (!interval_is_intree(&next->rl_node))
> -					continue;
> -
> -				LASSERT(next->rl_lock_count > 0);
> -				next->rl_lock_count--;
> -				break;
> -			}
> -		}
> -		list_del_init(&lock->rl_next_lock);
> -	} else {
> -		LASSERT(interval_is_intree(&lock->rl_node));
> -		interval_erase(&lock->rl_node, &tree->rlt_root);
> -	}
> -
> -	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
> -			range_unlock_cb, lock);
> -	spin_unlock(&tree->rlt_lock);
> -}
> -
> -/**
> - * Helper function of range_lock()
> - *
> - * \param node [in]	a range lock found overlapped during interval node
> - *			search
> - * \param arg [in]	the range lock to be tested
> - *
> - * \retval INTERVAL_ITER_CONT	indicate to continue the search for next
> - *				overlapping range node
> - * \retval INTERVAL_ITER_STOP	indicate to stop the search
> - */
> -static enum interval_iter range_lock_cb(struct interval_node *node, void *arg)
> -{
> -	struct range_lock *lock = (struct range_lock *)arg;
> -	struct range_lock *overlap = node2rangelock(node);
> -
> -	lock->rl_blocking_ranges += overlap->rl_lock_count + 1;
> -	return INTERVAL_ITER_CONT;
> -}
> -
> -/**
> - * Lock a region
> - *
> - * \param tree [in]	range lock tree
> - * \param lock [in]	range lock node containing the region span
> - *
> - * \retval 0	get the range lock
> - * \retval <0	error code while not getting the range lock
> - *
> - * If there exists overlapping range lock, the new lock will wait and
> - * retry, if later it find that it is not the chosen one to wake up,
> - * it wait again.
> - */
> -int range_lock(struct range_lock_tree *tree, struct range_lock *lock)
> -{
> -	struct interval_node *node;
> -	int rc = 0;
> -
> -	spin_lock(&tree->rlt_lock);
> -	/*
> -	 * We need to check for all conflicting intervals
> -	 * already in the tree.
> -	 */
> -	interval_search(tree->rlt_root, &lock->rl_node.in_extent,
> -			range_lock_cb, lock);
> -	/*
> -	 * Insert to the tree if I am unique, otherwise I've been linked to
> -	 * the rl_next_lock of another lock which has the same range as mine
> -	 * in range_lock_cb().
> -	 */
> -	node = interval_insert(&lock->rl_node, &tree->rlt_root);
> -	if (node) {
> -		struct range_lock *tmp = node2rangelock(node);
> -
> -		list_add_tail(&lock->rl_next_lock, &tmp->rl_next_lock);
> -		tmp->rl_lock_count++;
> -	}
> -	lock->rl_sequence = ++tree->rlt_sequence;
> -
> -	while (lock->rl_blocking_ranges > 0) {
> -		lock->rl_task = current;
> -		__set_current_state(TASK_INTERRUPTIBLE);
> -		spin_unlock(&tree->rlt_lock);
> -		schedule();
> -
> -		if (signal_pending(current)) {
> -			range_unlock(tree, lock);
> -			rc = -EINTR;
> -			goto out;
> -		}
> -		spin_lock(&tree->rlt_lock);
> -	}
> -	spin_unlock(&tree->rlt_lock);
> -out:
> -	return rc;
> -}
> diff --git a/drivers/staging/lustre/lustre/llite/range_lock.h b/drivers/staging/lustre/lustre/llite/range_lock.h
> deleted file mode 100644
> index 779091ccec4e..000000000000
> --- a/drivers/staging/lustre/lustre/llite/range_lock.h
> +++ /dev/null
> @@ -1,82 +0,0 @@
> -/*
> - * GPL HEADER START
> - *
> - * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 only,
> - * as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License version 2 for more details (a copy is included
> - * in the LICENSE file that accompanied this code).
> - *
> - * You should have received a copy of the GNU General Public License
> - * version 2 along with this program; If not, see
> - * http://www.gnu.org/licenses/gpl-2.0.html
> - *
> - * GPL HEADER END
> - */
> -/*
> - * Range lock is used to allow multiple threads writing a single shared
> - * file given each thread is writing to a non-overlapping portion of the
> - * file.
> - *
> - * Refer to the possible upstream kernel version of range lock by
> - * Jan Kara <jack@suse.cz>: https://lkml.org/lkml/2013/1/31/480
> - *
> - * This file could later replaced by the upstream kernel version.
> - */
> -/*
> - * Author: Prakash Surya <surya1@llnl.gov>
> - * Author: Bobi Jam <bobijam.xu@intel.com>
> - */
> -#ifndef _RANGE_LOCK_H
> -#define _RANGE_LOCK_H
> -
> -#include "../../include/linux/libcfs/libcfs.h"
> -#include "../include/interval_tree.h"
> -
> -struct range_lock {
> -	struct interval_node	rl_node;
> -	/**
> -	 * Process to enqueue this lock.
> -	 */
> -	struct task_struct	*rl_task;
> -	/**
> -	 * List of locks with the same range.
> -	 */
> -	struct list_head	rl_next_lock;
> -	/**
> -	 * Number of locks in the list rl_next_lock
> -	 */
> -	unsigned int		rl_lock_count;
> -	/**
> -	 * Number of ranges which are blocking acquisition of the lock
> -	 */
> -	unsigned int		rl_blocking_ranges;
> -	/**
> -	 * Sequence number of range lock. This number is used to get to know
> -	 * the order the locks are queued; this is required for range_cancel().
> -	 */
> -	__u64			rl_sequence;
> -};
> -
> -static inline struct range_lock *node2rangelock(const struct interval_node *n)
> -{
> -	return container_of(n, struct range_lock, rl_node);
> -}
> -
> -struct range_lock_tree {
> -	struct interval_node	*rlt_root;
> -	spinlock_t		 rlt_lock;	/* protect range lock tree */
> -	__u64			 rlt_sequence;
> -};
> -
> -void range_lock_tree_init(struct range_lock_tree *tree);
> -int range_lock_init(struct range_lock *lock, __u64 start, __u64 end);
> -int  range_lock(struct range_lock_tree *tree, struct range_lock *lock);
> -void range_unlock(struct range_lock_tree *tree, struct range_lock *lock);
> -#endif
> -- 
> 2.12.0
> 

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-06 16:50     ` Davidlohr Bueso
@ 2017-04-13  8:07       ` Laurent Dufour
  2017-04-13  8:38         ` Jan Kara
  0 siblings, 1 reply; 31+ messages in thread
From: Laurent Dufour @ 2017-04-13  8:07 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On 06/04/2017 18:50, Davidlohr Bueso wrote:
> On Thu, 06 Apr 2017, Laurent Dufour wrote:
> 
>> How is 'seqnum' wrapping handled here ?
>> I'd rather see something like time_before() here, isn't it ?
> 
> Its a 64bit counter, no overflows.

I should have miss something, what prevents this 64bit counter to not
overflow ?
At some point of time, this counter could reach ~0UL and then 0UL, which
is breaking this check.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-13  8:07       ` Laurent Dufour
@ 2017-04-13  8:38         ` Jan Kara
  2017-04-13  8:58           ` Laurent Dufour
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2017-04-13  8:38 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: Davidlohr Bueso, mingo, peterz, akpm, jack, kirill.shutemov,
	mhocko, mgorman, linux-kernel, Davidlohr Bueso

On Thu 13-04-17 10:07:48, Laurent Dufour wrote:
> On 06/04/2017 18:50, Davidlohr Bueso wrote:
> > On Thu, 06 Apr 2017, Laurent Dufour wrote:
> > 
> >> How is 'seqnum' wrapping handled here ?
> >> I'd rather see something like time_before() here, isn't it ?
> > 
> > Its a 64bit counter, no overflows.
> 
> I should have miss something, what prevents this 64bit counter to not
> overflow ?
> At some point of time, this counter could reach ~0UL and then 0UL, which
> is breaking this check.

Count for yourself how long would it take for the counter to overflow if we
incremented it say every nanosecond?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-13  8:38         ` Jan Kara
@ 2017-04-13  8:58           ` Laurent Dufour
  0 siblings, 0 replies; 31+ messages in thread
From: Laurent Dufour @ 2017-04-13  8:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Davidlohr Bueso, mingo, peterz, akpm, kirill.shutemov, mhocko,
	mgorman, linux-kernel, Davidlohr Bueso

On 13/04/2017 10:38, Jan Kara wrote:
> On Thu 13-04-17 10:07:48, Laurent Dufour wrote:
>> On 06/04/2017 18:50, Davidlohr Bueso wrote:
>>> On Thu, 06 Apr 2017, Laurent Dufour wrote:
>>>
>>>> How is 'seqnum' wrapping handled here ?
>>>> I'd rather see something like time_before() here, isn't it ?
>>>
>>> Its a 64bit counter, no overflows.
>>
>> I should have miss something, what prevents this 64bit counter to not
>> overflow ?
>> At some point of time, this counter could reach ~0UL and then 0UL, which
>> is breaking this check.
> 
> Count for yourself how long would it take for the counter to overflow if we
> incremented it say every nanosecond?

I do agree that this will take time, but systems become larger and
having 200 threads hitting a lock like the mmap_sem every nanoseconds,
it will increment faster (still need about 3 years)... if there are 1000
threads, less than 1 year to overflow this counter...
This is not for today, I agree.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-06  8:46 ` [PATCH 2/6] locking: Introduce range reader/writer lock Davidlohr Bueso
  2017-04-06  9:01   ` Laurent Dufour
  2017-04-06 10:24   ` Peter Zijlstra
@ 2017-04-18 13:57   ` Laurent Dufour
  2017-04-20 16:01     ` Davidlohr Bueso
  2 siblings, 1 reply; 31+ messages in thread
From: Laurent Dufour @ 2017-04-18 13:57 UTC (permalink / raw)
  To: Davidlohr Bueso, mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, linux-kernel, Davidlohr Bueso

On 06/04/2017 10:46, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
> 
> An interval tree of locked and to-be-locked ranges is kept. When a new range
> lock is requested, we add its interval to the tree and store number of
> intervals intersecting it to 'blocking_ranges'. For the reader case,
> 'blocking_ranges' is only accounted for if the intersecting range is
> marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
> guarantee that task is blocked until there are no overlapping ranges in the
> tree.
> 
> When a range is unlocked, we again walk intervals that overlap with the
> unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
> owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
> therefore relies on the order of the interval tree  -- as opposed to a
> more traditional fifo mechanism. There is no lock stealing either, which
> prevents starvation and guarantees fairness.
> 
> How much does it cost:
> ----------------------
> 
> The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
> is total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
> 
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
> 
> The first is the memory footprint, tree locks are larger than rwsems: 80 vs
> 40 bytes, and also requires an additional 72 bytes of stack for the range
> structure.
> 
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
> 
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
> 
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
> 
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
> 
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
> 
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
> 
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
> 
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
> 
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
> 
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr:  1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr:  1022.5, stddev: 6.41
> 
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr:  1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr:  1243.5, stddev: 30.40
> 
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr:  1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr:  1398.0, stddev: 56.56
> 
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
> 
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
> 
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
> 
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
> 
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
> 
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
> 
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
> 
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
> 
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr:  1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr:  1852.8, stddev: 147.5
> 
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr:  1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr:  1856.5, stddev: 198.69
> 
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr:  1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr:  1551.5, stddev: 361.33
> 
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
> 
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/range_rwlock.h  | 115 +++++++++
>  kernel/locking/Makefile       |   2 +-
>  kernel/locking/range_rwlock.c | 554 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 670 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/range_rwlock.h
>  create mode 100644 kernel/locking/range_rwlock.c
> 
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..0a8dad62d097
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,115 @@
> +/*
> + * Range/interval rw-locking
> + * -------------------------
> + *
> + * An interval tree of locked and to-be-locked ranges is kept. When a new range
> + * lock is requested, we add its interval to the tree and store number of
> + * intervals intersecting it to 'blocking_ranges'. For the reader case,
> + * 'blocking_ranges' is only accounted for if the intersecting range is
> + * marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
> + * guarantee that task is blocked until there are no overlapping ranges in the
> + * tree.
> + *
> + * When a range is unlocked, we again walk intervals that overlap with the
> + * unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
> + * owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
> + * therefore relies on the order of the interval tree  -- as opposed to a
> + * more traditional fifo mechanism. There is no lock stealing either, which
> + * prevents starvation and guarantees fairness.
> + *
> + * The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
> + * is total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_FULL].
> + */
> +#define RANGE_RWLOCK_FULL  ~0UL
> +
> +struct range_rwlock {
> +	struct interval_tree_node node;
> +	struct task_struct *waiter;
> +	/* Number of ranges which are blocking acquisition of the lock */
> +	unsigned int blocking_ranges;
> +	u64 seqnum;
> +};
> +
> +struct range_rwlock_tree {
> +	struct rb_root root;
> +	spinlock_t lock;
> +	struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +	u64 seqnum; /* track order of incoming ranges, avoid overflows */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
> +	{ .leftmost = NULL                              \
> +	, .root = RB_ROOT				\
> +	, .seqnum = 0					\
> +	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> +       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(__start, __last) {	\
> +		.node = {			\
> +			.start = (__start)	\
> +			,.last = (__last)	\
> +		}				\
> +		, .task = NULL			\
> +		, .blocking_ranges = 0		\
> +		, .reader = false		\
> +		, .seqnum = 0			\
> +	}
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last)				\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_FULL(name)		\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_FULL)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> +	tree->root = RB_ROOT;
> +	spin_lock_init(&tree->lock);
> +	tree->leftmost = NULL;
> +	tree->seqnum = 0;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> +		       unsigned long start, unsigned long last);
> +void range_rwlock_init_full(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock);
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT		:= n
> 
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
> 
>  ifdef CONFIG_FUNCTION_TRACER
>  CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..550ad360d87a
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,554 @@
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +#define range_interval_tree_foreach(node, root, start, last)	\
> +	for (node = interval_tree_iter_first(root, start, last); \
> +	     node; node = interval_tree_iter_next(node, start, last))
> +
> +/*
> + * Fastpath range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + *        a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree. This allows mitigating some of the tree walk overhead
> + * for non-intersecting ranges, maintained and consulted in O(1).
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *root;
> +
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		return false;
> +
> +	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> +	return lock->node.start <= root->__subtree_last &&
> +		tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)) ||
> +	    lock->node.start < tree->leftmost->start)
> +		tree->leftmost = &lock->node;
> +
> +	lock->seqnum = tree->seqnum++;
> +	interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (tree->leftmost == &lock->node) {
> +		struct rb_node *next = rb_next(&tree->leftmost->rb);
> +		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> +	}
> +
> +	interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +/*
> + * lock->waiter reader tracking.
> + */
> +#define RANGE_FLAG_READER	1UL
> +
> +static inline struct task_struct *range_lock_waiter(struct range_rwlock *lock)
> +{
> +	return (struct task_struct *)
> +		((unsigned long) lock->waiter & ~RANGE_FLAG_READER);
> +}
> +
> +static inline void range_lock_set_reader(struct range_rwlock *lock)
> +{
> +	lock->waiter = (struct task_struct *)
> +		((unsigned long)lock->waiter | RANGE_FLAG_READER);
> +}
> +
> +static inline void range_lock_clear_reader(struct range_rwlock *lock)
> +{
> +	lock->waiter = (struct task_struct *)
> +		((unsigned long)lock->waiter & ~RANGE_FLAG_READER);
> +}
> +
> +static inline bool range_lock_is_reader(struct range_rwlock *lock)
> +{
> +	return (unsigned long) lock->waiter & RANGE_FLAG_READER;
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock,
> +		    unsigned long start, unsigned long last)
> +{
> +	WARN_ON(start > last);
> +
> +	lock->node.start = start;
> +	lock->node.last = last;
> +	RB_CLEAR_NODE(&lock->node.rb);
> +	lock->blocking_ranges = 0;
> +	lock->waiter = NULL;
> +	lock->seqnum = 0;
> +}
> +
> +/**
> + * range_rwlock_init - Initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		     unsigned long last)
> +{
> +	__range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL_GPL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_full - Initialize a full range lock
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_full(struct range_rwlock *lock)
> +{
> +	__range_rwlock_init(lock, 0, RANGE_RWLOCK_FULL);
> +}
> +EXPORT_SYMBOL_GPL(range_rwlock_init_full);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> +	if (!--lock->blocking_ranges)
> +		wake_q_add(wake_q, range_lock_waiter(lock));
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock, long state)
> +{
> +	int ret = 0;
> +
> +	while (true) {
> +		set_current_state(state);
> +
> +		/* do we need to go to sleep? */
> +		if (!lock->blocking_ranges)
> +			break;
> +
> +		if (unlikely(signal_pending_state(state, current))) {
> +			struct interval_tree_node *node;
> +			unsigned long flags;
> +			DEFINE_WAKE_Q(wake_q);
> +
> +			ret = -EINTR;
> +			/*
> +			 * We're not taking the lock after all, cleanup
> +			 * after ourselves.
> +			 */
> +			spin_lock_irqsave(&tree->lock, flags);
> +
> +			range_lock_clear_reader(lock);
> +			__range_tree_remove(tree, lock);
> +
> +			if (!__range_intersects_intree(tree, lock))
> +				goto unlock;
> +
> +			range_interval_tree_foreach(node, &tree->root,
> +						    lock->node.start,
> +						    lock->node.last) {
> +				struct range_rwlock *blked;
> +				blked = range_entry(node,
> +						    struct range_rwlock, node);
> +
> +				if (range_lock_is_reader(lock) &&
> +				    range_lock_is_reader(blked))
> +					continue;
> +
> +				/* unaccount for threads _we_ are blocking */
> +				if (lock->seqnum < blked->seqnum)
> +					range_rwlock_unblock(blked, &wake_q);
> +			}
> +
> +		unlock:
> +			spin_unlock_irqrestore(&tree->lock, flags);
> +			wake_up_q(&wake_q);
> +			break;
> +		}
> +
> +		schedule();
> +	}
> +
> +	__set_current_state(TASK_RUNNING);
> +	return ret;
> +}
> +
> +static __always_inline int __sched
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> +			 struct range_rwlock *lock, long state)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +	range_lock_set_reader(lock);
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!range_lock_is_reader(blocked_lock))
> +			lock->blocking_ranges++;
> +	}
> +insert:
> +	__range_tree_insert(tree, lock);
> +
> +	lock->waiter = current;

Hi Davidlohr,

Setting lock->waiter after calling range_lock_set_reader() is resetting
the reader flag. Moving the call to range_lock_set_reader() here fixes that.

> +	spin_unlock_irqrestore(&tree->lock, flags);
> +
> +	return wait_for_ranges(tree, lock, state);
> +}
> +
> +/**
> + * range_read_lock - Lock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be locked
> + *
> + * Returns when the lock has been acquired or sleep until
> + * until there are no overlapping ranges.
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL_GPL(range_read_lock);
> +
> +/**
> + * range_read_lock_interruptible - Lock for reading (interruptible)
> + * @tree: interval tree
> + * @lock: the range lock to be locked
> + *
> + * Lock the range like range_read_lock(), and return 0 if the
> + * lock has been acquired or sleep until until there are no
> + * overlapping ranges. If a signal arrives while waiting for the
> + * lock then this function returns -EINTR.
> + */
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL_GPL(range_read_lock_interruptible);
> +
> +/**
> + * range_read_lock_killable - Lock for reading (killable)
> + * @tree: interval tree
> + * @lock: the range lock to be locked
> + *
> + * Lock the range like range_read_lock(), and return 0 if the
> + * lock has been acquired or sleep until until there are no
> + * overlapping ranges. If a signal arrives while waiting for the
> + * lock then this function returns -EINTR.
> + */
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock)
> +{
> +	might_sleep();
> +	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL_GPL(range_read_lock_killable);
> +
> +/**
> + * range_read_trylock - Trylock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	int ret = true;
> +	unsigned long flags;
> +	struct interval_tree_node *node;
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	/*
> +	 * We have overlapping ranges in the tree, ensure that we can
> +	 * in fact share the lock.
> +	 */
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {
> +		struct range_rwlock *blocked_lock;
> +		blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> +		if (!range_lock_is_reader(blocked_lock)) {
> +			ret = false;
> +			goto unlock;
> +		}
> +	}
> +insert:
> +	range_lock_set_reader(lock);

Here, the lock->waiter field should have been set to current before
calling range_lock_set_reader()

Cheers,
Laurent.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock
  2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
                   ` (5 preceding siblings ...)
  2017-04-06  8:46 ` [PATCH 6/6] staging/lustre: Use generic range rwlock Davidlohr Bueso
@ 2017-04-19 12:37 ` Peter Zijlstra
  2017-04-20 17:13   ` Davidlohr Bueso
  6 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-19 12:37 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel

On Thu, Apr 06, 2017 at 01:46:14AM -0700, Davidlohr Bueso wrote:
> ** What's still pending:
>   - Debug support (it's been a pain to use lockdep with range locking).

How so? Just assume that every range is the full range. Which isn't such
a weird assumption as it would seem. After all, you cannot assume
anything much about the ranges to begin with. So therefore you cannot
assume the ranges don't all overlap either. At which point you're back
to the regular r/w semantics for deadlocks.

Also:

  - explain interval order and what that means for forward progress
    guarantees. This is currently still unparsable.

  - explain why the loss of lock stealing makes sense. IIRC walken added
    that specifically to address mmap_sem performance issues.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-18 13:57   ` Laurent Dufour
@ 2017-04-20 16:01     ` Davidlohr Bueso
  2017-04-21  7:00       ` Laurent Dufour
  0 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-20 16:01 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Tue, 18 Apr 2017, Laurent Dufour wrote:

>On 06/04/2017 10:46, Davidlohr Bueso wrote:
>> +__range_read_lock_common(struct range_rwlock_tree *tree,
>> +			 struct range_rwlock *lock, long state)
>> +{
>> +	struct interval_tree_node *node;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&tree->lock, flags);
>> +	range_lock_set_reader(lock);
>> +
>> +	if (!__range_intersects_intree(tree, lock))
>> +		goto insert;
>> +
>> +	range_interval_tree_foreach(node, &tree->root,
>> +				    lock->node.start, lock->node.last) {
>> +		struct range_rwlock *blocked_lock;
>> +		blocked_lock = range_entry(node, struct range_rwlock, node);
>> +
>> +		if (!range_lock_is_reader(blocked_lock))
>> +			lock->blocking_ranges++;
>> +	}
>> +insert:
>> +	__range_tree_insert(tree, lock);
>> +
>> +	lock->waiter = current;
>
>Hi Davidlohr,
>
>Setting lock->waiter after calling range_lock_set_reader() is resetting
>the reader flag. Moving the call to range_lock_set_reader() here fixes that.

Yeah, I moved the set_reader() call to after setting lock->waiter.

>
>> +	spin_unlock_irqrestore(&tree->lock, flags);
>> +
>> +	return wait_for_ranges(tree, lock, state);
>> +}

[...]

>> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
>> +{
>> +	int ret = true;
>> +	unsigned long flags;
>> +	struct interval_tree_node *node;
>> +
>> +	spin_lock_irqsave(&tree->lock, flags);
>> +
>> +	if (!__range_intersects_intree(tree, lock))
>> +		goto insert;
>> +
>> +	/*
>> +	 * We have overlapping ranges in the tree, ensure that we can
>> +	 * in fact share the lock.
>> +	 */
>> +	range_interval_tree_foreach(node, &tree->root,
>> +				    lock->node.start, lock->node.last) {
>> +		struct range_rwlock *blocked_lock;
>> +		blocked_lock = range_entry(node, struct range_rwlock, node);
>> +
>> +		if (!range_lock_is_reader(blocked_lock)) {
>> +			ret = false;
>> +			goto unlock;
>> +		}
>> +	}
>> +insert:
>> +	range_lock_set_reader(lock);
>
>Here, the lock->waiter field should have been set to current before
>calling range_lock_set_reader()

But this is a trylock attempt, there is no waiting going on.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock
  2017-04-19 12:37 ` [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Peter Zijlstra
@ 2017-04-20 17:13   ` Davidlohr Bueso
  2017-04-20 17:53     ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-20 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel

On Wed, 19 Apr 2017, Peter Zijlstra wrote:

>  - explain why the loss of lock stealing makes sense. IIRC walken added
>    that specifically to address mmap_sem performance issues.

That's right, and the same applies to the writer spinning stuff; which
can makes a huge difference - more so than plain stealing. But as I've
mentioned, range locks can improve parallelism, which is/should be much
more welcomed than optimizations to the primitive. So yeah, we loose 
in comparing a full range to rwsem (not to mention the xadd stuff).
I have thought of some heuristics for avoiding sleeping under certain
constraints, which could mitigate the spinning step we loose, but 
I fear it will never be exactly as fast as rwsems -- just consider
we always take the tree->lock.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock
  2017-04-20 17:13   ` Davidlohr Bueso
@ 2017-04-20 17:53     ` Peter Zijlstra
  2017-04-20 18:36       ` Davidlohr Bueso
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-20 17:53 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel

On Thu, Apr 20, 2017 at 10:13:26AM -0700, Davidlohr Bueso wrote:

> I have thought of some heuristics for avoiding sleeping under certain
> constraints, which could mitigate the spinning step we loose, but I fear it
> will never be exactly as fast as rwsems -- just consider
> we always take the tree->lock.

But tree->lock is a spinlock, so while this gets us out of rwsem-xadd
territory for the fast paths, the whole lock-stealing and optimistic
spinning stuff is on a different scale.

Those are about avoiding actually going to sleep and having to be woken
up (and waiting to become running) again, which is a long time.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock
  2017-04-20 17:53     ` Peter Zijlstra
@ 2017-04-20 18:36       ` Davidlohr Bueso
  2017-04-20 19:10         ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Davidlohr Bueso @ 2017-04-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel

On Thu, 20 Apr 2017, Peter Zijlstra wrote:

>On Thu, Apr 20, 2017 at 10:13:26AM -0700, Davidlohr Bueso wrote:
>
>> I have thought of some heuristics for avoiding sleeping under certain
>> constraints, which could mitigate the spinning step we loose, but I fear it
>> will never be exactly as fast as rwsems -- just consider
>> we always take the tree->lock.
>
>But tree->lock is a spinlock, so while this gets us out of rwsem-xadd
>territory for the fast paths, the whole lock-stealing and optimistic
>spinning stuff is on a different scale.

Oh, absolutely. I was merely pointing out the differences at a hair
splitting level.

>Those are about avoiding actually going to sleep and having to be woken
>up (and waiting to become running) again, which is a long time.

Yes, which is why I was thinking of ways to mitigate this. Ie: for
blocked writers with low counts of 'blocking_ranges'.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock
  2017-04-20 18:36       ` Davidlohr Bueso
@ 2017-04-20 19:10         ` Peter Zijlstra
  2017-05-15  9:19           ` Davidlohr Bueso
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-04-20 19:10 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel

On Thu, Apr 20, 2017 at 11:36:46AM -0700, Davidlohr Bueso wrote:
> On Thu, 20 Apr 2017, Peter Zijlstra wrote:
> >Those are about avoiding actually going to sleep and having to be woken
> >up (and waiting to become running) again, which is a long time.
> 
> Yes, which is why I was thinking of ways to mitigate this. Ie: for
> blocked writers with low counts of 'blocking_ranges'.

So for this it would be good to have a better understanding of that
whole fairness / interval order crud.

IIRC rwsem only does writer-writer stealing and opt spinning, right? And
for stealing it doesn't matter how many are pending, just that you are
running and they are not (and then you get fairness issues and handover
etc..).

For opt spinning we need to specifically know who would be next in
order, again, doesn't matter how many, just who's next.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-04-20 16:01     ` Davidlohr Bueso
@ 2017-04-21  7:00       ` Laurent Dufour
  0 siblings, 0 replies; 31+ messages in thread
From: Laurent Dufour @ 2017-04-21  7:00 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On 20/04/2017 18:01, Davidlohr Bueso wrote:
>>> +int range_read_trylock(struct range_rwlock_tree *tree, struct
>>> range_rwlock *lock)
>>> +{
>>> +    int ret = true;
>>> +    unsigned long flags;
>>> +    struct interval_tree_node *node;
>>> +
>>> +    spin_lock_irqsave(&tree->lock, flags);
>>> +
>>> +    if (!__range_intersects_intree(tree, lock))
>>> +        goto insert;
>>> +
>>> +    /*
>>> +     * We have overlapping ranges in the tree, ensure that we can
>>> +     * in fact share the lock.
>>> +     */
>>> +    range_interval_tree_foreach(node, &tree->root,
>>> +                    lock->node.start, lock->node.last) {
>>> +        struct range_rwlock *blocked_lock;
>>> +        blocked_lock = range_entry(node, struct range_rwlock, node);
>>> +
>>> +        if (!range_lock_is_reader(blocked_lock)) {
>>> +            ret = false;
>>> +            goto unlock;
>>> +        }
>>> +    }
>>> +insert:
>>> +    range_lock_set_reader(lock);
>>
>> Here, the lock->waiter field should have been set to current before
>> calling range_lock_set_reader()
> 
> But this is a trylock attempt, there is no waiting going on.

Agreed, this one is useless.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock
  2017-04-20 19:10         ` Peter Zijlstra
@ 2017-05-15  9:19           ` Davidlohr Bueso
  0 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-05-15  9:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel

On Thu, 20 Apr 2017, Peter Zijlstra wrote:

>For opt spinning we need to specifically know who would be next in
>order, again, doesn't matter how many, just who's next.

I've sent a v3 with a more precise description of this, which I hope is
to your satisfaction.

Given a clear tree iteration/order defined by interval trees (indexed by
lowpoint and treats duplicates as inorder traversal), it is not something
I would wish to alter. Over the weekend I've been experimenting more with
still taking the tree->lock, but spinning while blocking ranges is 1 and
'owner' (in this case the first overlapping node, remembered when we did
the initial lookup adding to the tree, _with_ the tree->lock held) is on_cpu.
This would maintain the order and prevent blocking for threads that are
about (?) to receive the lock.

While I have somewhat of a patch, I'm tired and have not had the chance to
even test the thing, so I went ahead and sent v3 anyway to not delay further.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-05-15  9:07 ` [PATCH 2/6] " Davidlohr Bueso
                     ` (2 preceding siblings ...)
  2017-05-15 13:59   ` Peter Zijlstra
@ 2017-05-23 15:12   ` Laurent Dufour
  3 siblings, 0 replies; 31+ messages in thread
From: Laurent Dufour @ 2017-05-23 15:12 UTC (permalink / raw)
  To: Davidlohr Bueso, mingo, peterz, akpm
  Cc: jack, kirill.shutemov, mhocko, mgorman, linux-kernel, Davidlohr Bueso

On 15/05/2017 11:07, Davidlohr Bueso wrote:
> --- /dev/null
> +++ b/include/linux/range_lock.h
> @@ -0,0 +1,181 @@
> +/*
> + * Range/interval rw-locking
> + * -------------------------
> + *
> + * Interval-tree based range locking is about controlling tasks' forward
> + * progress when adding an arbitrary interval (node) to the tree, depending
> + * on any overlapping ranges. A task can only continue (wakeup) if there are
> + * no intersecting ranges, thus achieving mutual exclusion. To this end, a
> + * reference counter is kept for each intersecting range in the tree
> + * (_before_ adding itself to it). To enable shared locking semantics,
> + * the reader to-be-locked will not take reference if an intersecting node
> + * is also a reader, therefore ignoring the node altogether.
> + *
> + * Fairness and freedom of starvation are guaranteed by the lack of lock
> + * stealing, thus range locks depend directly on interval tree semantics.
> + * This is particularly for iterations, where the key for the rbtree is
> + * given by the interval's low endpoint, and duplicates are walked as it
> + * would an inorder traversal of the tree.
> + *
> + * The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where
> + * R_all is total number of ranges and R_int is the number of ranges
> + * intersecting the operated range.
> + */
> +#ifndef _LINUX_RANGE_LOCK_H
> +#define _LINUX_RANGE_LOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_LOCK_FULL].
> + */
> +#define RANGE_LOCK_FULL  ~0UL
> +
> +struct range_lock {
> +	struct interval_tree_node node;
> +	struct task_struct *tsk;
> +	/* Number of ranges which are blocking acquisition of the lock */
> +	unsigned int blocking_ranges;
> +	u64 seqnum;
> +};
> +
> +struct range_lock_tree {
> +	struct rb_root root;
> +	spinlock_t lock;
> +	struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +	u64 seqnum; /* track order of incoming ranges, avoid overflows */
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map dep_map;
> +#endif
> +};
> +
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +# define __RANGE_LOCK_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
> +#else
> +# define __RANGE_LOCK_DEP_MAP_INIT(lockname)
> +#endif
> +
> +#define __RANGE_LOCK_TREE_INITIALIZER(name)		\
> +	{ .leftmost = NULL                              \
> +	, .root = RB_ROOT				\
> +	, .seqnum = 0					\
> +	, .lock = __SPIN_LOCK_UNLOCKED(name.lock)       \
> +	__RANGE_LOCK_DEP_MAP_INIT(name) }		\
> +
> +#define DEFINE_RANGE_LOCK_TREE(name) \
> +	struct range_lock_tree name = __RANGE_LOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_LOCK_INITIALIZER(__start, __last) {	\
> +		.node = {				\
> +			.start = (__start)		\
> +			,.last = (__last)		\
> +		}					\
> +		, .task = NULL				\
                   ^tsk

> +		, .blocking_ranges = 0			\
> +		, .reader = false			\
                   ^ this field doesn't exist anymore

Cheers,
Laurent.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-05-15 13:02   ` Peter Zijlstra
@ 2017-05-16 22:19     ` Davidlohr Bueso
  0 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-05-16 22:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, 15 May 2017, Peter Zijlstra wrote:

>On Mon, May 15, 2017 at 02:07:21AM -0700, Davidlohr Bueso wrote:
>
>> + * Fairness and freedom of starvation are guaranteed by the lack of lock
>> + * stealing, thus range locks depend directly on interval tree semantics.
>> + * This is particularly for iterations, where the key for the rbtree is
>> + * given by the interval's low endpoint,
>
>
>So suppose the lock is held at [a,n], and I want to acquire [g,z], this
>conflicts, therefore I wait.

Ok, then, ref [g,z] = 1, ref [a,n] = 0 (lock owner). Per below,
at this point the tree will overlap with anything between [a,z],
which is the world.

>
>While I wait, someone else comes in at [b,m], they too wait.

[b,m] intersects with both nodes above, thus ref [b,m] = 2.

>
>[a,n] is released, per ordering [b,m] acquires, I still wait.

Now:
ref [g,z] = 0
ref [b,m] = 1

So due to reference counting [g,z] is acquired, despite [b,m]
being _put() before [g,z].

>[a,n] returns to wait.

Similar to the [b,m] case, when [a,n] comes back, it too will
get a ref = 2 and hence "go back in line". Iow, lock order does
depend on have fifo semantics among contended ranges.

>
>[b,m] releases, does the iteration then restart and grant it to [a,n] or
>will I (at [g,z]) finally acquire?
>
>
>Since the code always does range_interval_tree_foreach() it would appear
>to me [b,m] will always win and [g,z] could be made to wait
>indefinitely (by always contending with another range that has a lower
>starting point).
>
>
>
>>                                          and duplicates are walked as it
>> + * would an inorder traversal of the tree.
>
>Are duplicates ordered in FIFO ? Afaict the above is free of actual
>semantics.

This will strictly depend on the rotation when you have duplicates
when more nodes are added later. But again that's the order of
walking the tree.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-05-15 13:44   ` Peter Zijlstra
@ 2017-05-16 21:17     ` Davidlohr Bueso
  0 siblings, 0 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-05-16 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, 15 May 2017, Peter Zijlstra wrote:

>Nearly every range_interval_tree_foreach() usage has a
>__range_intersects_intree() in front, suggesting our
>range_interval_tree_foreach() is 'broken'.
>
>I suppose the only question is if we should fix
>range_interval_tree_foreach() or interval_tree_iter_first(). I'm tempted
>to suggest the latter.

Yes this functionality would be helpful to all interval tree users,
but for that we have to cache the leftmost node, and given the way
interval trees are setup via templates, the latter gets icky _fast_:

- For one we could add extra parameters to INTERVAL_TREE_DEFINE and
pass an arbitrary structure that contains a ptr to the node type
as well as the rb_root. This of course busts the generic flavor.

Or,

- Add a second leftmost rb_node to struct rb_root and (at least) only
use it for interval trees; while there are rbtree users that do this
caching explicitly, I doubt folks would like for the general case.

So it would seem that we ought to tuck __range_intersects_intree()
in the range_interval_tree_foreach() helper.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-05-15  9:07 ` [PATCH 2/6] " Davidlohr Bueso
  2017-05-15 13:02   ` Peter Zijlstra
  2017-05-15 13:44   ` Peter Zijlstra
@ 2017-05-15 13:59   ` Peter Zijlstra
  2017-05-23 15:12   ` Laurent Dufour
  3 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2017-05-15 13:59 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, May 15, 2017 at 02:07:21AM -0700, Davidlohr Bueso wrote:
> +static inline int wait_for_ranges(struct range_lock_tree *tree,
> +				  struct range_lock *lock, long state)
> +{
> +	int ret = 0;
> +
> +	while (true) {
> +		set_current_state(state);
> +
> +		/* do we need to go to sleep? */
> +		if (!lock->blocking_ranges)
> +			break;
> +
> +		if (unlikely(signal_pending_state(state, current))) {
> +			struct interval_tree_node *node;
> +			unsigned long flags;
> +			DEFINE_WAKE_Q(wake_q);
> +
> +			ret = -EINTR;
> +			/*
> +			 * We're not taking the lock after all, cleanup
> +			 * after ourselves.
> +			 */
> +			spin_lock_irqsave(&tree->lock, flags);
> +
> +			range_lock_clear_reader(lock);
> +			__range_tree_remove(tree, lock);
> +
> +			if (!__range_intersects_intree(tree, lock))
> +				goto unlock;
> +
> +			range_interval_tree_foreach(node, &tree->root,
> +						    lock->node.start,
> +						    lock->node.last) {
> +				struct range_lock *blked;
> +				blked = to_range_lock(node);
> +
> +				if (range_lock_is_reader(lock) &&
> +				    range_lock_is_reader(blked))
> +					continue;
> +
> +				/* unaccount for threads _we_ are blocking */
> +				if (lock->seqnum < blked->seqnum)
> +					range_lock_put(blked, &wake_q);
> +			}
> +
> +		unlock:
> +			spin_unlock_irqrestore(&tree->lock, flags);
> +			wake_up_q(&wake_q);
> +			break;
> +		}
> +
> +		schedule();
> +	}
> +
> +	__set_current_state(TASK_RUNNING);
> +	return ret;
> +}


> +void range_read_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	range_lock_clear_reader(lock);
> +	__range_tree_remove(tree, lock);
> +
> +	range_lock_release(&tree->dep_map, 1, _RET_IP_);
> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {
> +		struct range_lock *blocked_lock;
> +		blocked_lock = to_range_lock(node);
> +
> +		if (!range_lock_is_reader(blocked_lock))
> +			range_lock_put(blocked_lock, &wake_q);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL_GPL(range_read_unlock);

> +void range_write_unlock(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +	struct interval_tree_node *node;
> +	unsigned long flags;
> +	DEFINE_WAKE_Q(wake_q);
> +
> +	spin_lock_irqsave(&tree->lock, flags);
> +
> +	range_lock_clear_reader(lock);
> +	__range_tree_remove(tree, lock);
> +
> +	range_lock_release(&tree->dep_map, 1, _RET_IP_);
> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {
> +		struct range_lock *blocked_lock;
> +		blocked_lock = to_range_lock(node);
> +
> +		range_lock_put(blocked_lock, &wake_q);
> +	}
> +
> +	spin_unlock_irqrestore(&tree->lock, flags);
> +	wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL_GPL(range_write_unlock);


There is significant duplication here. Can't we have a
__range_unlock_common() and use that 3 times?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-05-15  9:07 ` [PATCH 2/6] " Davidlohr Bueso
  2017-05-15 13:02   ` Peter Zijlstra
@ 2017-05-15 13:44   ` Peter Zijlstra
  2017-05-16 21:17     ` Davidlohr Bueso
  2017-05-15 13:59   ` Peter Zijlstra
  2017-05-23 15:12   ` Laurent Dufour
  3 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-05-15 13:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, May 15, 2017 at 02:07:21AM -0700, Davidlohr Bueso wrote:

> +#define range_interval_tree_foreach(node, root, start, last)	\
> +	for (node = interval_tree_iter_first(root, start, last); \
> +	     node; node = interval_tree_iter_next(node, start, last))
> +

> +/*
> + * Fastpath range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + *        a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree. This allows mitigating some of the tree walk overhead
> + * for non-intersecting ranges, maintained and consulted in O(1).
> + */
> +static inline bool
> +__range_intersects_intree(struct range_lock_tree *tree, struct range_lock *lock)
> +{
> +	struct interval_tree_node *root;
> +
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		return false;
> +
> +	root = to_interval_tree_node(tree->root.rb_node);
> +
> +	return lock->node.start <= root->__subtree_last &&
> +		tree->leftmost->start <= lock->node.last;
> +}

> +
> +			if (!__range_intersects_intree(tree, lock))
> +				goto unlock;
> +
> +			range_interval_tree_foreach(node, &tree->root,
> +						    lock->node.start,
> +						    lock->node.last) {


> +
> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	/*
> +	 * We have overlapping ranges in the tree, ensure that we can
> +	 * in fact share the lock.
> +	 */
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {


> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {


> +
> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {


> +	if (!__range_intersects_intree(tree, lock))
> +		goto insert;
> +
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {



> +	if (!__range_intersects_intree(tree, lock)) {
> +		/* nobody to wakeup, we're done */
> +		spin_unlock_irqrestore(&tree->lock, flags);
> +		return;
> +	}
> +
> +	range_interval_tree_foreach(node, &tree->root,
> +				    lock->node.start, lock->node.last) {


Nearly every range_interval_tree_foreach() usage has a
__range_intersects_intree() in front, suggesting our
range_interval_tree_foreach() is 'broken'.

I suppose the only question is if we should fix
range_interval_tree_foreach() or interval_tree_iter_first(). I'm tempted
to suggest the latter.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-05-15  9:07 ` [PATCH 2/6] " Davidlohr Bueso
@ 2017-05-15 13:02   ` Peter Zijlstra
  2017-05-16 22:19     ` Davidlohr Bueso
  2017-05-15 13:44   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2017-05-15 13:02 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: mingo, akpm, jack, kirill.shutemov, ldufour, mhocko, mgorman,
	linux-kernel, Davidlohr Bueso

On Mon, May 15, 2017 at 02:07:21AM -0700, Davidlohr Bueso wrote:

> + * Fairness and freedom of starvation are guaranteed by the lack of lock
> + * stealing, thus range locks depend directly on interval tree semantics.
> + * This is particularly for iterations, where the key for the rbtree is
> + * given by the interval's low endpoint,


So suppose the lock is held at [a,n], and I want to acquire [g,z], this
conflicts, therefore I wait.

While I wait, someone else comes in at [b,m], they too wait.

[a,n] is released, per ordering [b,m] acquires, I still wait.

[a,n] returns to wait.

[b,m] releases, does the iteration then restart and grant it to [a,n] or
will I (at [g,z]) finally acquire?


Since the code always does range_interval_tree_foreach() it would appear
to me [b,m] will always win and [g,z] could be made to wait
indefinitely (by always contending with another range that has a lower
starting point).



>                                          and duplicates are walked as it
> + * would an inorder traversal of the tree.

Are duplicates ordered in FIFO ? Afaict the above is free of actual
semantics.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/6] locking: Introduce range reader/writer lock
  2017-05-15  9:07 [PATCH v3 " Davidlohr Bueso
@ 2017-05-15  9:07 ` Davidlohr Bueso
  2017-05-15 13:02   ` Peter Zijlstra
                     ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Davidlohr Bueso @ 2017-05-15  9:07 UTC (permalink / raw)
  To: mingo, peterz, akpm
  Cc: jack, kirill.shutemov, ldufour, mhocko, mgorman, dave,
	linux-kernel, Davidlohr Bueso

This implements a sleepable range rwlock, based on interval tree, serializing
conflicting/intersecting/overlapping ranges within the tree. The largest range
is given by [0, ~0] (inclusive). Unlike traditional locks, range locking
involves dealing with the tree itself and the range to be locked, normally
stack allocated and always explicitly prepared/initialized by the user in a
[a0, a1] a0 <= a1 sorted manner, before actually taking the lock.

Interval-tree based range locking is about controlling tasks' forward
progress when adding an arbitrary interval (node) to the tree, depending
on any overlapping ranges. A task can only continue (wakeup) if there are
no intersecting ranges, thus achieving mutual exclusion. To this end, a
reference counter is kept for each intersecting range in the tree
(_before_ adding itself to it). To enable shared locking semantics,
the reader to-be-locked will not take reference if an intersecting node
is also a reader, therefore ignoring the node altogether.

Fairness and freedom of starvation are guaranteed by the lack of lock
stealing, thus range locks depend directly on interval tree semantics.
This is particularly for iterations, where the key for the rbtree is
given by the interval's low endpoint, and duplicates are walked as it
would an inorder traversal of the tree.

The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where
R_all is total number of ranges and R_int is the number of ranges
intersecting the operated range.

How much does it cost:
----------------------

The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
is total number of ranges and R_int is the number of ranges intersecting the
new range range to be added.

Due to its sharable nature, full range locks can be compared with rw-sempahores,
which also serves from a mutex standpoint as writer-only situations are
pretty similar nowadays.

The first is the memory footprint, tree locks are smaller than rwsems: 32 vs
40 bytes, but require an additional 72 bytes of stack for the range structure.

Secondly, because every range call is serialized by the tree->lock, any lock()
fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
overhead compared to a single atomic insn in the case of rwsems. Similar scenario
obviously for the unlock() case.

The torture module was used to measure 1-1 differences in lock acquisition with
increasing core counts over a period of 10 minutes. Readers and writers are
interleaved, with a slight advantage to writers as its the first kthread that is
created. The following shows the avg ops/minute with various thread-setups on
boxes with small and large core-counts.

** 4-core AMD Opteron **
(write-only)
rwsem-2thr: 4198.5, stddev: 7.77
range-2thr: 4199.1, stddev: 0.73

rwsem-4thr: 6036.8, stddev: 50.91
range-4thr: 6004.9, stddev: 126.57

rwsem-8thr: 6245.6, stddev: 59.39
range-8thr: 6229.3, stddev: 10.60

(read-only)
rwsem-2thr: 5930.7, stddev: 21.92
range-2thr: 5917.3, stddev: 25.45

rwsem-4thr: 9881.6, stddev: 0.70
range-4thr: 9540.2, stddev: 98.28

rwsem-8thr: 11633.2, stddev: 7.72
range-8thr: 11314.7, stddev: 62.22

For the read/write-only cases, there is very little difference between the range lock
and rwsems, with up to a 3% hit, which could very well be considered in the noise range.

(read-write)
rwsem-write-1thr: 1744.8, stddev: 11.59
rwsem-read-1thr:  1043.1, stddev: 3.97
range-write-1thr: 1740.2, stddev: 5.99
range-read-1thr:  1022.5, stddev: 6.41

rwsem-write-2thr: 1662.5, stddev: 0.70
rwsem-read-2thr:  1278.0, stddev: 25.45
range-write-2thr: 1321.5, stddev: 51.61
range-read-2thr:  1243.5, stddev: 30.40

rwsem-write-4thr: 1761.0, stddev: 11.31
rwsem-read-4thr:  1426.0, stddev: 7.07
range-write-4thr: 1417.0, stddev: 29.69
range-read-4thr:  1398.0, stddev: 56.56

While a single reader and writer threads does not show must difference, increasing
core counts shows that in reader/writer workloads, writer threads can take a hit in
raw performance of up to ~20%, while the number of reader throughput is quite similar
among both locks.

** 240-core (ht) IvyBridge **
(write-only)
rwsem-120thr: 6844.5, stddev: 82.73
range-120thr: 6070.5, stddev: 85.55

rwsem-240thr: 6292.5, stddev: 146.3
range-240thr: 6099.0, stddev: 15.55

rwsem-480thr: 6164.8, stddev: 33.94
range-480thr: 6062.3, stddev: 19.79

(read-only)
rwsem-120thr: 136860.4, stddev: 2539.92
range-120thr: 138052.2, stddev: 327.39

rwsem-240thr: 235297.5, stddev: 2220.50
range-240thr: 232099.1, stddev: 3614.72

rwsem-480thr: 272683.0, stddev: 3924.32
range-480thr: 256539.2, stddev: 9541.69

Similar to the small box, larger machines show that range locks take only a minor
(up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.

(read-write)
rwsem-write-60thr: 4658.1, stddev: 1303.19
rwsem-read-60thr:  1108.7, stddev: 718.42
range-write-60thr: 3203.6, stddev: 139.30
range-read-60thr:  1852.8, stddev: 147.5

rwsem-write-120thr: 3971.3, stddev: 1413.0
rwsem-read-120thr:  1038.8, stddev: 353.51
range-write-120thr: 2282.1, stddev: 207.18
range-read-120thr:  1856.5, stddev: 198.69

rwsem-write-240thr: 4112.7, stddev: 2448.1
rwsem-read-240thr:  1277.4, stddev: 430.30
range-write-240thr: 2353.1, stddev: 502.04
range-read-240thr:  1551.5, stddev: 361.33

When mixing readers and writers, writer throughput can take a hit of up to ~40%,
similar to the 4 core machine, however, reader threads can increase the number of
acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
always be higher for rwsems. A huge factor in this behavior is that range locks
do not have writer spin-on-owner feature.

On both machines when actually testing threads acquiring different ranges, the
amount of throughput will always outperform the rwsem, due to the increased
parallelism; which is no surprise either. As such microbenchmarks that merely
pounds on a lock will pretty much always suffer upon direct lock conversions,
but not enough to matter in the overall picture.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/lockdep.h     |  33 +++
 include/linux/range_lock.h  | 181 ++++++++++++
 kernel/locking/Makefile     |   2 +-
 kernel/locking/range_lock.c | 690 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 905 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/range_lock.h
 create mode 100644 kernel/locking/range_lock.c

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index fffe49f188e6..bbce43976b92 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -481,6 +481,16 @@ do {								\
 	lock_acquired(&(_lock)->dep_map, _RET_IP_);			\
 } while (0)
 
+#define RANGE_LOCK_CONTENDED(tree, _lock, try, lock)		\
+do {								\
+	if (!try(tree, _lock)) {				\
+		lock_contended(&(tree)->dep_map, _RET_IP_);	\
+		lock(tree, _lock);				\
+	}							\
+	lock_acquired(&(tree)->dep_map, _RET_IP_);		\
+} while (0)
+
+
 #define LOCK_CONTENDED_RETURN(_lock, try, lock)			\
 ({								\
 	int ____err = 0;					\
@@ -493,6 +503,18 @@ do {								\
 	____err;						\
 })
 
+#define RANGE_LOCK_CONTENDED_RETURN(tree, _lock, try, lock)	\
+({								\
+	int ____err = 0;					\
+	if (!try(tree, _lock)) {				\
+		lock_contended(&(tree)->dep_map, _RET_IP_);	\
+		____err = lock(tree, _lock);			\
+	}							\
+	if (!____err)						\
+		lock_acquired(&(tree)->dep_map, _RET_IP_);	\
+	____err;						\
+})
+
 #else /* CONFIG_LOCK_STAT */
 
 #define lock_contended(lockdep_map, ip) do {} while (0)
@@ -501,9 +523,15 @@ do {								\
 #define LOCK_CONTENDED(_lock, try, lock) \
 	lock(_lock)
 
+#define RANGE_LOCK_CONTENDED(tree, _lock, try, lock) \
+	lock(tree, _lock)
+
 #define LOCK_CONTENDED_RETURN(_lock, try, lock) \
 	lock(_lock)
 
+#define RANGE_LOCK_CONTENDED_RETURN(tree, _lock, try, lock) \
+	lock(tree, _lock)
+
 #endif /* CONFIG_LOCK_STAT */
 
 #ifdef CONFIG_LOCKDEP
@@ -568,6 +596,11 @@ static inline void print_irqtrace_events(struct task_struct *curr)
 #define rwsem_acquire_read(l, s, t, i)		lock_acquire_shared(l, s, t, NULL, i)
 #define rwsem_release(l, n, i)			lock_release(l, n, i)
 
+#define range_lock_acquire(l, s, t, i)		lock_acquire_exclusive(l, s, t, NULL, i)
+#define range_lock_acquire_nest(l, s, t, n, i)	lock_acquire_exclusive(l, s, t, n, i)
+#define range_lock_acquire_read(l, s, t, i)	lock_acquire_shared(l, s, t, NULL, i)
+#define range_lock_release(l, n, i)		lock_release(l, n, i)
+
 #define lock_map_acquire(l)			lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)
 #define lock_map_acquire_read(l)		lock_acquire_shared_recursive(l, 0, 0, NULL, _THIS_IP_)
 #define lock_map_acquire_tryread(l)		lock_acquire_shared_recursive(l, 0, 1, NULL, _THIS_IP_)
diff --git a/include/linux/range_lock.h b/include/linux/range_lock.h
new file mode 100644
index 000000000000..45e6ed1a1c5c
--- /dev/null
+++ b/include/linux/range_lock.h
@@ -0,0 +1,181 @@
+/*
+ * Range/interval rw-locking
+ * -------------------------
+ *
+ * Interval-tree based range locking is about controlling tasks' forward
+ * progress when adding an arbitrary interval (node) to the tree, depending
+ * on any overlapping ranges. A task can only continue (wakeup) if there are
+ * no intersecting ranges, thus achieving mutual exclusion. To this end, a
+ * reference counter is kept for each intersecting range in the tree
+ * (_before_ adding itself to it). To enable shared locking semantics,
+ * the reader to-be-locked will not take reference if an intersecting node
+ * is also a reader, therefore ignoring the node altogether.
+ *
+ * Fairness and freedom of starvation are guaranteed by the lack of lock
+ * stealing, thus range locks depend directly on interval tree semantics.
+ * This is particularly for iterations, where the key for the rbtree is
+ * given by the interval's low endpoint, and duplicates are walked as it
+ * would an inorder traversal of the tree.
+ *
+ * The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where
+ * R_all is total number of ranges and R_int is the number of ranges
+ * intersecting the operated range.
+ */
+#ifndef _LINUX_RANGE_LOCK_H
+#define _LINUX_RANGE_LOCK_H
+
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+/*
+ * The largest range will span [0,RANGE_LOCK_FULL].
+ */
+#define RANGE_LOCK_FULL  ~0UL
+
+struct range_lock {
+	struct interval_tree_node node;
+	struct task_struct *tsk;
+	/* Number of ranges which are blocking acquisition of the lock */
+	unsigned int blocking_ranges;
+	u64 seqnum;
+};
+
+struct range_lock_tree {
+	struct rb_root root;
+	spinlock_t lock;
+	struct interval_tree_node *leftmost; /* compute smallest 'start' */
+	u64 seqnum; /* track order of incoming ranges, avoid overflows */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map dep_map;
+#endif
+};
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+# define __RANGE_LOCK_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
+#else
+# define __RANGE_LOCK_DEP_MAP_INIT(lockname)
+#endif
+
+#define __RANGE_LOCK_TREE_INITIALIZER(name)		\
+	{ .leftmost = NULL                              \
+	, .root = RB_ROOT				\
+	, .seqnum = 0					\
+	, .lock = __SPIN_LOCK_UNLOCKED(name.lock)       \
+	__RANGE_LOCK_DEP_MAP_INIT(name) }		\
+
+#define DEFINE_RANGE_LOCK_TREE(name) \
+	struct range_lock_tree name = __RANGE_LOCK_TREE_INITIALIZER(name)
+
+#define __RANGE_LOCK_INITIALIZER(__start, __last) {	\
+		.node = {				\
+			.start = (__start)		\
+			,.last = (__last)		\
+		}					\
+		, .task = NULL				\
+		, .blocking_ranges = 0			\
+		, .reader = false			\
+		, .seqnum = 0				\
+	}
+
+#define DEFINE_RANGE_LOCK(name, start, last)				\
+	struct range_lock name = __RANGE_LOCK_INITIALIZER((start), (last))
+
+#define DEFINE_RANGE_LOCK_FULL(name)					\
+	struct range_lock name = __RANGE_LOCK_INITIALIZER(0, RANGE_LOCK_FULL)
+
+static inline void
+__range_lock_tree_init(struct range_lock_tree *tree,
+		       const char *name, struct lock_class_key *key)
+{
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/*
+	 * Make sure we are not reinitializing a held lock:
+	 */
+	debug_check_no_locks_freed((void *)tree, sizeof(*tree));
+	lockdep_init_map(&tree->dep_map, name, key, 0);
+#endif
+	tree->root = RB_ROOT;
+	spin_lock_init(&tree->lock);
+	tree->leftmost = NULL;
+	tree->seqnum = 0;
+}
+
+#define range_lock_tree_init(tree)				\
+do {								\
+	static struct lock_class_key __key;			\
+								\
+	__range_lock_tree_init((tree), #tree, &__key);		\
+} while (0)
+
+void range_lock_init(struct range_lock *lock,
+		       unsigned long start, unsigned long last);
+void range_lock_init_full(struct range_lock *lock);
+
+/*
+ * lock for reading
+ */
+void range_read_lock(struct range_lock_tree *tree, struct range_lock *lock);
+int range_read_lock_interruptible(struct range_lock_tree *tree,
+				  struct range_lock *lock);
+int range_read_lock_killable(struct range_lock_tree *tree,
+			     struct range_lock *lock);
+int range_read_trylock(struct range_lock_tree *tree, struct range_lock *lock);
+void range_read_unlock(struct range_lock_tree *tree, struct range_lock *lock);
+
+/*
+ * lock for writing
+ */
+void range_write_lock(struct range_lock_tree *tree, struct range_lock *lock);
+int range_write_lock_interruptible(struct range_lock_tree *tree,
+				   struct range_lock *lock);
+int range_write_lock_killable(struct range_lock_tree *tree,
+			      struct range_lock *lock);
+int range_write_trylock(struct range_lock_tree *tree, struct range_lock *lock);
+void range_write_unlock(struct range_lock_tree *tree, struct range_lock *lock);
+
+void range_downgrade_write(struct range_lock_tree *tree,
+			   struct range_lock *lock);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+/*
+ * nested locking. NOTE: range locks are not allowed to recurse
+ * (which occurs if the same task tries to acquire the same
+ * lock instance multiple times), but multiple locks of the
+ * same lock class might be taken, if the order of the locks
+ * is always the same. This ordering rule can be expressed
+ * to lockdep via the _nested() APIs, but enumerating the
+ * subclasses that are used. (If the nesting relationship is
+ * static then another method for expressing nested locking is
+ * the explicit definition of lock class keys and the use of
+ * lockdep_set_class() at lock initialization time.
+ * See Documentation/locking/lockdep-design.txt for more details.)
+ */
+extern void range_read_lock_nested(struct range_lock_tree *tree,
+		struct range_lock *lock, int subclass);
+extern void range_write_lock_nested(struct range_lock_tree *tree,
+		struct range_lock *lock, int subclass);
+extern int range_write_lock_killable_nested(struct range_lock_tree *tree,
+		struct range_lock *lock, int subclass);
+extern void _range_write_lock_nest_lock(struct range_lock_tree *tree,
+		struct range_lock *lock, struct lockdep_map *nest_lock);
+
+# define range_write_lock_nest_lock(tree, lock, nest_lock)		\
+do {									\
+	typecheck(struct lockdep_map *, &(nest_lock)->dep_map);		\
+	_range_write_lock_nest_lock(tree, lock, &(nest_lock)->dep_map);	\
+} while (0);
+
+#else
+# define range_read_lock_nested(tree, lock, subclass) \
+	range_read_lock(tree, lock)
+# define range_write_lock_nest_lock(tree, lock, nest_lock) \
+	range_write_lock(tree, lock)
+# define range_write_lock_nested(tree, lock, subclass) \
+	range_write_lock(tree, lock)
+# define range_write_lock_killable_nested(tree, lock, subclass) \
+	range_write_lock_killable(tree, lock)
+#endif
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 760158d9d98d..2564548ecd37 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -2,7 +2,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_lock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/range_lock.c b/kernel/locking/range_lock.c
new file mode 100644
index 000000000000..9c0d848603e6
--- /dev/null
+++ b/kernel/locking/range_lock.c
@@ -0,0 +1,690 @@
+/*
+ * Copyright (C) 2017 Jan Kara, Davidlohr Bueso.
+ */
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/range_lock.h>
+#include <linux/lockdep.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched.h>
+#include <linux/export.h>
+
+#define range_interval_tree_foreach(node, root, start, last)	\
+	for (node = interval_tree_iter_first(root, start, last); \
+	     node; node = interval_tree_iter_next(node, start, last))
+
+#define to_range_lock(ptr) container_of(ptr, struct range_lock, node)
+#define to_interval_tree_node(ptr) \
+	container_of(ptr, struct interval_tree_node, rb)
+
+/*
+ * Fastpath range intersection/overlap between A: [a0, a1] and B: [b0, b1]
+ * is given by:
+ *
+ *        a0 <= b1 && b0 <= a1
+ *
+ * ... where A holds the lock range and B holds the smallest 'start' and
+ * largest 'last' in the tree. For the later, we rely on the root node,
+ * which by augmented interval tree property, holds the largest value in
+ * its last-in-subtree. This allows mitigating some of the tree walk overhead
+ * for non-intersecting ranges, maintained and consulted in O(1).
+ */
+static inline bool
+__range_intersects_intree(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	struct interval_tree_node *root;
+
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
+		return false;
+
+	root = to_interval_tree_node(tree->root.rb_node);
+
+	return lock->node.start <= root->__subtree_last &&
+		tree->leftmost->start <= lock->node.last;
+}
+
+static inline void
+__range_tree_insert(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)) ||
+	    lock->node.start < tree->leftmost->start)
+		tree->leftmost = &lock->node;
+
+	lock->seqnum = tree->seqnum++;
+	interval_tree_insert(&lock->node, &tree->root);
+}
+
+static inline void
+__range_tree_remove(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	if (tree->leftmost == &lock->node) {
+		struct rb_node *next = rb_next(&tree->leftmost->rb);
+		tree->leftmost = to_interval_tree_node(next);
+	}
+
+	interval_tree_remove(&lock->node, &tree->root);
+}
+
+/*
+ * lock->tsk reader tracking.
+ */
+#define RANGE_FLAG_READER	1UL
+
+static inline struct task_struct *range_lock_waiter(struct range_lock *lock)
+{
+	return (struct task_struct *)
+		((unsigned long) lock->tsk & ~RANGE_FLAG_READER);
+}
+
+static inline void range_lock_set_reader(struct range_lock *lock)
+{
+	lock->tsk = (struct task_struct *)
+		((unsigned long)lock->tsk | RANGE_FLAG_READER);
+}
+
+static inline void range_lock_clear_reader(struct range_lock *lock)
+{
+	lock->tsk = (struct task_struct *)
+		((unsigned long)lock->tsk & ~RANGE_FLAG_READER);
+}
+
+static inline bool range_lock_is_reader(struct range_lock *lock)
+{
+	return (unsigned long) lock->tsk & RANGE_FLAG_READER;
+}
+
+static inline void
+__range_lock_init(struct range_lock *lock,
+		  unsigned long start, unsigned long last)
+{
+	WARN_ON(start > last);
+
+	lock->node.start = start;
+	lock->node.last = last;
+	RB_CLEAR_NODE(&lock->node.rb);
+	lock->blocking_ranges = 0;
+	lock->tsk = NULL;
+	lock->seqnum = 0;
+}
+
+/**
+ * range_lock_init - Initialize the range lock
+ * @lock: the range lock to be initialized
+ * @start: start of the interval (inclusive)
+ * @last: last location in the interval (inclusive)
+ *
+ * Initialize the range's [start, last] such that it can
+ * later be locked. User is expected to enter a sorted
+ * range, such that @start <= @last.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_lock_init(struct range_lock *lock,
+		     unsigned long start, unsigned long last)
+{
+	__range_lock_init(lock, start, last);
+}
+EXPORT_SYMBOL_GPL(range_lock_init);
+
+/**
+ * range_lock_init_full - Initialize a full range lock
+ * @lock: the range lock to be initialized
+ *
+ * Initialize the full range.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_lock_init_full(struct range_lock *lock)
+{
+	__range_lock_init(lock, 0, RANGE_LOCK_FULL);
+}
+EXPORT_SYMBOL_GPL(range_lock_init_full);
+
+static inline void
+range_lock_put(struct range_lock *lock, struct wake_q_head *wake_q)
+{
+	if (!--lock->blocking_ranges)
+		wake_q_add(wake_q, range_lock_waiter(lock));
+}
+
+static inline int wait_for_ranges(struct range_lock_tree *tree,
+				  struct range_lock *lock, long state)
+{
+	int ret = 0;
+
+	while (true) {
+		set_current_state(state);
+
+		/* do we need to go to sleep? */
+		if (!lock->blocking_ranges)
+			break;
+
+		if (unlikely(signal_pending_state(state, current))) {
+			struct interval_tree_node *node;
+			unsigned long flags;
+			DEFINE_WAKE_Q(wake_q);
+
+			ret = -EINTR;
+			/*
+			 * We're not taking the lock after all, cleanup
+			 * after ourselves.
+			 */
+			spin_lock_irqsave(&tree->lock, flags);
+
+			range_lock_clear_reader(lock);
+			__range_tree_remove(tree, lock);
+
+			if (!__range_intersects_intree(tree, lock))
+				goto unlock;
+
+			range_interval_tree_foreach(node, &tree->root,
+						    lock->node.start,
+						    lock->node.last) {
+				struct range_lock *blked;
+				blked = to_range_lock(node);
+
+				if (range_lock_is_reader(lock) &&
+				    range_lock_is_reader(blked))
+					continue;
+
+				/* unaccount for threads _we_ are blocking */
+				if (lock->seqnum < blked->seqnum)
+					range_lock_put(blked, &wake_q);
+			}
+
+		unlock:
+			spin_unlock_irqrestore(&tree->lock, flags);
+			wake_up_q(&wake_q);
+			break;
+		}
+
+		schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+	return ret;
+}
+
+/**
+ * range_read_trylock - Trylock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+static inline int __range_read_trylock(struct range_lock_tree *tree,
+				       struct range_lock *lock)
+{
+	int ret = true;
+	unsigned long flags;
+	struct interval_tree_node *node;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	/*
+	 * We have overlapping ranges in the tree, ensure that we can
+	 * in fact share the lock.
+	 */
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_lock *blocked_lock;
+		blocked_lock = to_range_lock(node);
+
+		if (!range_lock_is_reader(blocked_lock)) {
+			ret = false;
+			goto unlock;
+		}
+	}
+insert:
+	range_lock_set_reader(lock);
+	__range_tree_insert(tree, lock);
+unlock:
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return ret;
+}
+
+int range_read_trylock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	int ret = __range_read_trylock(tree, lock);
+
+	if (ret)
+		range_lock_acquire_read(&tree->dep_map, 0, 1, _RET_IP_);
+
+	return ret;
+}
+
+EXPORT_SYMBOL_GPL(range_read_trylock);
+
+static __always_inline int __sched
+__range_read_lock_common(struct range_lock_tree *tree,
+			 struct range_lock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_lock *blocked_lock;
+		blocked_lock = to_range_lock(node);
+
+		if (!range_lock_is_reader(blocked_lock))
+			lock->blocking_ranges++;
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->tsk = current;
+	range_lock_set_reader(lock);
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+static __always_inline int
+__range_read_lock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	return __range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+
+/**
+ * range_read_lock - Lock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Returns when the lock has been acquired or sleep until
+ * until there are no overlapping ranges.
+ */
+void range_read_lock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	might_sleep();
+	range_lock_acquire_read(&tree->dep_map, 0, 0, _RET_IP_);
+
+	RANGE_LOCK_CONTENDED(tree, lock,
+			     __range_read_trylock, __range_read_lock);
+}
+EXPORT_SYMBOL_GPL(range_read_lock);
+
+/**
+ * range_read_lock_interruptible - Lock for reading (interruptible)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_read_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_read_lock_interruptible(struct range_lock_tree *tree,
+				  struct range_lock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock_interruptible);
+
+/**
+ * range_read_lock_killable - Lock for reading (killable)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_read_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+static __always_inline int
+__range_read_lock_killable(struct range_lock_tree *tree,
+			   struct range_lock *lock)
+{
+	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
+}
+
+int range_read_lock_killable(struct range_lock_tree *tree,
+			     struct range_lock *lock)
+{
+	might_sleep();
+	range_lock_acquire_read(&tree->dep_map, 0, 0, _RET_IP_);
+
+	if (RANGE_LOCK_CONTENDED_RETURN(tree, lock, __range_read_trylock,
+					__range_read_lock_killable)) {
+		range_lock_release(&tree->dep_map, 1, _RET_IP_);
+		return -EINTR;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(range_read_lock_killable);
+
+/**
+ * range_read_unlock - Unlock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be unlocked
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to unlock an unacquired read lock.
+ */
+void range_read_unlock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	range_lock_clear_reader(lock);
+	__range_tree_remove(tree, lock);
+
+	range_lock_release(&tree->dep_map, 1, _RET_IP_);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_lock *blocked_lock;
+		blocked_lock = to_range_lock(node);
+
+		if (!range_lock_is_reader(blocked_lock))
+			range_lock_put(blocked_lock, &wake_q);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_read_unlock);
+
+/**
+ * range_write_trylock - Trylock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+static inline int __range_write_trylock(struct range_lock_tree *tree,
+					struct range_lock *lock)
+{
+	int intersects;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	intersects = __range_intersects_intree(tree, lock);
+
+	if (!intersects) {
+		range_lock_clear_reader(lock);
+		__range_tree_insert(tree, lock);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return !intersects;
+}
+
+int range_write_trylock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	int ret = __range_write_trylock(tree, lock);
+
+	if (ret)
+		range_lock_acquire(&tree->dep_map, 0, 1, _RET_IP_);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(range_write_trylock);
+
+static __always_inline int __sched
+__range_write_lock_common(struct range_lock_tree *tree,
+			  struct range_lock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		/*
+		 * As a writer, we always consider an existing node. We
+		 * need to wait; either the intersecting node is another
+		 * writer or we have a reader that needs to finish.
+		 */
+		lock->blocking_ranges++;
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->tsk = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+static __always_inline int
+__range_write_lock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	return __range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+
+/**
+ * range_write_lock - Lock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Returns when the lock has been acquired or sleep until
+ * until there are no overlapping ranges.
+ */
+void range_write_lock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	might_sleep();
+	range_lock_acquire(&tree->dep_map, 0, 0, _RET_IP_);
+
+	RANGE_LOCK_CONTENDED(tree, lock,
+			     __range_write_trylock, __range_write_lock);
+}
+EXPORT_SYMBOL_GPL(range_write_lock);
+
+/**
+ * range_write_lock_interruptible - Lock for writing (interruptible)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_write_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_write_lock_interruptible(struct range_lock_tree *tree,
+				   struct range_lock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock_interruptible);
+
+/**
+ * range_write_lock_killable - Lock for writing (killable)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_write_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+static __always_inline int
+__range_write_lock_killable(struct range_lock_tree *tree,
+			   struct range_lock *lock)
+{
+	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
+}
+
+int range_write_lock_killable(struct range_lock_tree *tree,
+			      struct range_lock *lock)
+{
+	might_sleep();
+	range_lock_acquire(&tree->dep_map, 0, 0, _RET_IP_);
+
+	if (RANGE_LOCK_CONTENDED_RETURN(tree, lock, __range_write_trylock,
+					__range_write_lock_killable)) {
+		range_lock_release(&tree->dep_map, 1, _RET_IP_);
+		return -EINTR;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(range_write_lock_killable);
+
+/**
+ * range_write_unlock - Unlock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be unlocked
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to unlock an unacquired write lock.
+ */
+void range_write_unlock(struct range_lock_tree *tree, struct range_lock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	range_lock_clear_reader(lock);
+	__range_tree_remove(tree, lock);
+
+	range_lock_release(&tree->dep_map, 1, _RET_IP_);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_lock *blocked_lock;
+		blocked_lock = to_range_lock(node);
+
+		range_lock_put(blocked_lock, &wake_q);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_write_unlock);
+
+/**
+ * range_downgrade_write - Downgrade write range lock to read lock
+ * @tree: interval tree
+ * @lock: the range lock to be downgraded
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to downgrade an unacquired write lock.
+ */
+void range_downgrade_write(struct range_lock_tree *tree,
+			   struct range_lock *lock)
+{
+	unsigned long flags;
+	struct interval_tree_node *node;
+	DEFINE_WAKE_Q(wake_q);
+
+	lock_downgrade(&tree->dep_map, _RET_IP_);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	WARN_ON(range_lock_is_reader(lock));
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_lock *blocked_lock;
+		blocked_lock = to_range_lock(node);
+
+		/*
+		 * Unaccount for any blocked reader lock. Wakeup if possible.
+		 */
+		if (range_lock_is_reader(blocked_lock))
+			range_lock_put(blocked_lock, &wake_q);
+	}
+
+	range_lock_set_reader(lock);
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_downgrade_write);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+void range_read_lock_nested(struct range_lock_tree *tree,
+			    struct range_lock *lock, int subclass)
+{
+	might_sleep();
+	range_lock_acquire_read(&tree->dep_map, subclass, 0, _RET_IP_);
+
+	RANGE_LOCK_CONTENDED(tree, lock, __range_read_trylock, __range_read_lock);
+}
+EXPORT_SYMBOL_GPL(range_read_lock_nested);
+
+void _range_write_lock_nest_lock(struct range_lock_tree *tree,
+				struct range_lock *lock,
+				struct lockdep_map *nest)
+{
+	might_sleep();
+	range_lock_acquire_nest(&tree->dep_map, 0, 0, nest, _RET_IP_);
+
+	RANGE_LOCK_CONTENDED(tree, lock,
+			     __range_write_trylock, __range_write_lock);
+}
+EXPORT_SYMBOL_GPL(_range_write_lock_nest_lock);
+
+void range_write_lock_nested(struct range_lock_tree *tree,
+			    struct range_lock *lock, int subclass)
+{
+	might_sleep();
+	range_lock_acquire(&tree->dep_map, subclass, 0, _RET_IP_);
+
+	RANGE_LOCK_CONTENDED(tree, lock,
+			     __range_write_trylock, __range_write_lock);
+}
+EXPORT_SYMBOL_GPL(range_write_lock_nested);
+
+
+int range_write_lock_killable_nested(struct range_lock_tree *tree,
+				     struct range_lock *lock, int subclass)
+{
+	might_sleep();
+	range_lock_acquire(&tree->dep_map, subclass, 0, _RET_IP_);
+
+	if (RANGE_LOCK_CONTENDED_RETURN(tree, lock, __range_write_trylock,
+					__range_write_lock_killable)) {
+		range_lock_release(&tree->dep_map, 1, _RET_IP_);
+		return -EINTR;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(range_write_lock_killable_nested);
+
+#endif
-- 
2.12.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-05-23 15:12 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 1/6] interval-tree: Build unconditionally Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 2/6] locking: Introduce range reader/writer lock Davidlohr Bueso
2017-04-06  9:01   ` Laurent Dufour
2017-04-06 16:50     ` Davidlohr Bueso
2017-04-13  8:07       ` Laurent Dufour
2017-04-13  8:38         ` Jan Kara
2017-04-13  8:58           ` Laurent Dufour
2017-04-06 10:24   ` Peter Zijlstra
2017-04-18 13:57   ` Laurent Dufour
2017-04-20 16:01     ` Davidlohr Bueso
2017-04-21  7:00       ` Laurent Dufour
2017-04-06  8:46 ` [PATCH 3/6] locking/locktorture: Fix rwsem reader_delay Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 4/6] locking/locktorture: Fix num reader/writer corner cases Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 5/6] locking/locktorture: Support range rwlocks Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 6/6] staging/lustre: Use generic range rwlock Davidlohr Bueso
2017-04-07 10:08   ` Dilger, Andreas
2017-04-07 10:08     ` [lustre-devel] " Dilger, Andreas
2017-04-19 12:37 ` [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Peter Zijlstra
2017-04-20 17:13   ` Davidlohr Bueso
2017-04-20 17:53     ` Peter Zijlstra
2017-04-20 18:36       ` Davidlohr Bueso
2017-04-20 19:10         ` Peter Zijlstra
2017-05-15  9:19           ` Davidlohr Bueso
2017-05-15  9:07 [PATCH v3 " Davidlohr Bueso
2017-05-15  9:07 ` [PATCH 2/6] " Davidlohr Bueso
2017-05-15 13:02   ` Peter Zijlstra
2017-05-16 22:19     ` Davidlohr Bueso
2017-05-15 13:44   ` Peter Zijlstra
2017-05-16 21:17     ` Davidlohr Bueso
2017-05-15 13:59   ` Peter Zijlstra
2017-05-23 15:12   ` Laurent Dufour

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.