All of lore.kernel.org
 help / color / mirror / Atom feed
From: Davidlohr Bueso <dave@stgolabs.net>
To: mingo@kernel.org, peterz@infradead.org, akpm@linux-foundation.org
Cc: jack@suse.cz, kirill.shutemov@linux.intel.com,
	ldufour@linux.vnet.ibm.com, mhocko@suse.com,
	mgorman@techsingularity.net, dave@stgolabs.net,
	linux-kernel@vger.kernel.org, Davidlohr Bueso <dbueso@suse.de>
Subject: [PATCH 2/6] locking: Introduce range reader/writer lock
Date: Thu,  6 Apr 2017 01:46:16 -0700	[thread overview]
Message-ID: <20170406084620.22700-3-dave@stgolabs.net> (raw)
In-Reply-To: <20170406084620.22700-1-dave@stgolabs.net>

This implements a sleepable range rwlock, based on interval tree, serializing
conflicting/intersecting/overlapping ranges within the tree. The largest range
is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
involves dealing with the tree itself and the range to be locked, normally
stack allocated and always explicitly prepared/initialized by the user in a
[a0, a1] a0 <= a1 sorted manner, before actually taking the lock.

An interval tree of locked and to-be-locked ranges is kept. When a new range
lock is requested, we add its interval to the tree and store number of
intervals intersecting it to 'blocking_ranges'. For the reader case,
'blocking_ranges' is only accounted for if the intersecting range is
marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
guarantee that task is blocked until there are no overlapping ranges in the
tree.

When a range is unlocked, we again walk intervals that overlap with the
unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
therefore relies on the order of the interval tree  -- as opposed to a
more traditional fifo mechanism. There is no lock stealing either, which
prevents starvation and guarantees fairness.

How much does it cost:
----------------------

The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
is total number of ranges and R_int is the number of ranges intersecting the
new range range to be added.

Due to its sharable nature, full range locks can be compared with rw-sempahores,
which also serves from a mutex standpoint as writer-only situations are
pretty similar nowadays.

The first is the memory footprint, tree locks are larger than rwsems: 80 vs
40 bytes, and also requires an additional 72 bytes of stack for the range
structure.

Secondly, because every range call is serialized by the tree->lock, any lock()
fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
overhead compared to a single atomic insn in the case of rwsems. Similar scenario
obviously for the unlock() case.

The torture module was used to measure 1-1 differences in lock acquisition with
increasing core counts over a period of 10 minutes. Readers and writers are
interleaved, with a slight advantage to writers as its the first kthread that is
created. The following shows the avg ops/minute with various thread-setups on
boxes with small and large core-counts.

** 4-core AMD Opteron **
(write-only)
rwsem-2thr: 4198.5, stddev: 7.77
range-2thr: 4199.1, stddev: 0.73

rwsem-4thr: 6036.8, stddev: 50.91
range-4thr: 6004.9, stddev: 126.57

rwsem-8thr: 6245.6, stddev: 59.39
range-8thr: 6229.3, stddev: 10.60

(read-only)
rwsem-2thr: 5930.7, stddev: 21.92
range-2thr: 5917.3, stddev: 25.45

rwsem-4thr: 9881.6, stddev: 0.70
range-4thr: 9540.2, stddev: 98.28

rwsem-8thr: 11633.2, stddev: 7.72
range-8thr: 11314.7, stddev: 62.22

For the read/write-only cases, there is very little difference between the range lock
and rwsems, with up to a 3% hit, which could very well be considered in the noise range.

(read-write)
rwsem-write-1thr: 1744.8, stddev: 11.59
rwsem-read-1thr:  1043.1, stddev: 3.97
range-write-1thr: 1740.2, stddev: 5.99
range-read-1thr:  1022.5, stddev: 6.41

rwsem-write-2thr: 1662.5, stddev: 0.70
rwsem-read-2thr:  1278.0, stddev: 25.45
range-write-2thr: 1321.5, stddev: 51.61
range-read-2thr:  1243.5, stddev: 30.40

rwsem-write-4thr: 1761.0, stddev: 11.31
rwsem-read-4thr:  1426.0, stddev: 7.07
range-write-4thr: 1417.0, stddev: 29.69
range-read-4thr:  1398.0, stddev: 56.56

While a single reader and writer threads does not show must difference, increasing
core counts shows that in reader/writer workloads, writer threads can take a hit in
raw performance of up to ~20%, while the number of reader throughput is quite similar
among both locks.

** 240-core (ht) IvyBridge **
(write-only)
rwsem-120thr: 6844.5, stddev: 82.73
range-120thr: 6070.5, stddev: 85.55

rwsem-240thr: 6292.5, stddev: 146.3
range-240thr: 6099.0, stddev: 15.55

rwsem-480thr: 6164.8, stddev: 33.94
range-480thr: 6062.3, stddev: 19.79

(read-only)
rwsem-120thr: 136860.4, stddev: 2539.92
range-120thr: 138052.2, stddev: 327.39

rwsem-240thr: 235297.5, stddev: 2220.50
range-240thr: 232099.1, stddev: 3614.72

rwsem-480thr: 272683.0, stddev: 3924.32
range-480thr: 256539.2, stddev: 9541.69

Similar to the small box, larger machines show that range locks take only a minor
(up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.

(read-write)
rwsem-write-60thr: 4658.1, stddev: 1303.19
rwsem-read-60thr:  1108.7, stddev: 718.42
range-write-60thr: 3203.6, stddev: 139.30
range-read-60thr:  1852.8, stddev: 147.5

rwsem-write-120thr: 3971.3, stddev: 1413.0
rwsem-read-120thr:  1038.8, stddev: 353.51
range-write-120thr: 2282.1, stddev: 207.18
range-read-120thr:  1856.5, stddev: 198.69

rwsem-write-240thr: 4112.7, stddev: 2448.1
rwsem-read-240thr:  1277.4, stddev: 430.30
range-write-240thr: 2353.1, stddev: 502.04
range-read-240thr:  1551.5, stddev: 361.33

When mixing readers and writers, writer throughput can take a hit of up to ~40%,
similar to the 4 core machine, however, reader threads can increase the number of
acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
always be higher for rwsems. A huge factor in this behavior is that range locks
do not have writer spin-on-owner feature.

On both machines when actually testing threads acquiring different ranges, the
amount of throughput will always outperform the rwsem, due to the increased
parallelism; which is no surprise either. As such microbenchmarks that merely
pounds on a lock will pretty much always suffer upon direct lock conversions,
but not enough to matter in the overall picture.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/range_rwlock.h  | 115 +++++++++
 kernel/locking/Makefile       |   2 +-
 kernel/locking/range_rwlock.c | 554 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 670 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/range_rwlock.h
 create mode 100644 kernel/locking/range_rwlock.c

diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
new file mode 100644
index 000000000000..0a8dad62d097
--- /dev/null
+++ b/include/linux/range_rwlock.h
@@ -0,0 +1,115 @@
+/*
+ * Range/interval rw-locking
+ * -------------------------
+ *
+ * An interval tree of locked and to-be-locked ranges is kept. When a new range
+ * lock is requested, we add its interval to the tree and store number of
+ * intervals intersecting it to 'blocking_ranges'. For the reader case,
+ * 'blocking_ranges' is only accounted for if the intersecting range is
+ * marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
+ * guarantee that task is blocked until there are no overlapping ranges in the
+ * tree.
+ *
+ * When a range is unlocked, we again walk intervals that overlap with the
+ * unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
+ * owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
+ * therefore relies on the order of the interval tree  -- as opposed to a
+ * more traditional fifo mechanism. There is no lock stealing either, which
+ * prevents starvation and guarantees fairness.
+ *
+ * The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
+ * is total number of ranges and R_int is the number of ranges intersecting the
+ * operated range.
+ */
+#ifndef _LINUX_RANGE_RWLOCK_H
+#define _LINUX_RANGE_RWLOCK_H
+
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+/*
+ * The largest range will span [0,RANGE_RWLOCK_FULL].
+ */
+#define RANGE_RWLOCK_FULL  ~0UL
+
+struct range_rwlock {
+	struct interval_tree_node node;
+	struct task_struct *waiter;
+	/* Number of ranges which are blocking acquisition of the lock */
+	unsigned int blocking_ranges;
+	u64 seqnum;
+};
+
+struct range_rwlock_tree {
+	struct rb_root root;
+	spinlock_t lock;
+	struct interval_tree_node *leftmost; /* compute smallest 'start' */
+	u64 seqnum; /* track order of incoming ranges, avoid overflows */
+};
+
+#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
+	{ .leftmost = NULL                              \
+	, .root = RB_ROOT				\
+	, .seqnum = 0					\
+	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
+
+#define DEFINE_RANGE_RWLOCK_TREE(name) \
+       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
+
+#define __RANGE_RWLOCK_INITIALIZER(__start, __last) {	\
+		.node = {			\
+			.start = (__start)	\
+			,.last = (__last)	\
+		}				\
+		, .task = NULL			\
+		, .blocking_ranges = 0		\
+		, .reader = false		\
+		, .seqnum = 0			\
+	}
+
+#define DEFINE_RANGE_RWLOCK(name, start, last)				\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
+
+#define DEFINE_RANGE_RWLOCK_FULL(name)		\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_FULL)
+
+static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
+{
+	tree->root = RB_ROOT;
+	spin_lock_init(&tree->lock);
+	tree->leftmost = NULL;
+	tree->seqnum = 0;
+}
+
+void range_rwlock_init(struct range_rwlock *lock,
+		       unsigned long start, unsigned long last);
+void range_rwlock_init_full(struct range_rwlock *lock);
+
+/*
+ * lock for reading
+ */
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock);
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock);
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+/*
+ * lock for writing
+ */
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock);
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock);
+int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock);
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 760158d9d98d..bf054d0af82b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -2,7 +2,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
new file mode 100644
index 000000000000..550ad360d87a
--- /dev/null
+++ b/kernel/locking/range_rwlock.c
@@ -0,0 +1,554 @@
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/range_rwlock.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched.h>
+#include <linux/export.h>
+
+#define range_entry(ptr, type, member) container_of(ptr, type, member)
+
+#define range_interval_tree_foreach(node, root, start, last)	\
+	for (node = interval_tree_iter_first(root, start, last); \
+	     node; node = interval_tree_iter_next(node, start, last))
+
+/*
+ * Fastpath range intersection/overlap between A: [a0, a1] and B: [b0, b1]
+ * is given by:
+ *
+ *        a0 <= b1 && b0 <= a1
+ *
+ * ... where A holds the lock range and B holds the smallest 'start' and
+ * largest 'last' in the tree. For the later, we rely on the root node,
+ * which by augmented interval tree property, holds the largest value in
+ * its last-in-subtree. This allows mitigating some of the tree walk overhead
+ * for non-intersecting ranges, maintained and consulted in O(1).
+ */
+static inline bool
+__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *root;
+
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
+		return false;
+
+	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
+
+	return lock->node.start <= root->__subtree_last &&
+		tree->leftmost->start <= lock->node.last;
+}
+
+static inline void
+__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)) ||
+	    lock->node.start < tree->leftmost->start)
+		tree->leftmost = &lock->node;
+
+	lock->seqnum = tree->seqnum++;
+	interval_tree_insert(&lock->node, &tree->root);
+}
+
+static inline void
+__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (tree->leftmost == &lock->node) {
+		struct rb_node *next = rb_next(&tree->leftmost->rb);
+		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
+	}
+
+	interval_tree_remove(&lock->node, &tree->root);
+}
+
+/*
+ * lock->waiter reader tracking.
+ */
+#define RANGE_FLAG_READER	1UL
+
+static inline struct task_struct *range_lock_waiter(struct range_rwlock *lock)
+{
+	return (struct task_struct *)
+		((unsigned long) lock->waiter & ~RANGE_FLAG_READER);
+}
+
+static inline void range_lock_set_reader(struct range_rwlock *lock)
+{
+	lock->waiter = (struct task_struct *)
+		((unsigned long)lock->waiter | RANGE_FLAG_READER);
+}
+
+static inline void range_lock_clear_reader(struct range_rwlock *lock)
+{
+	lock->waiter = (struct task_struct *)
+		((unsigned long)lock->waiter & ~RANGE_FLAG_READER);
+}
+
+static inline bool range_lock_is_reader(struct range_rwlock *lock)
+{
+	return (unsigned long) lock->waiter & RANGE_FLAG_READER;
+}
+
+static inline void
+__range_rwlock_init(struct range_rwlock *lock,
+		    unsigned long start, unsigned long last)
+{
+	WARN_ON(start > last);
+
+	lock->node.start = start;
+	lock->node.last = last;
+	RB_CLEAR_NODE(&lock->node.rb);
+	lock->blocking_ranges = 0;
+	lock->waiter = NULL;
+	lock->seqnum = 0;
+}
+
+/**
+ * range_rwlock_init - Initialize the range lock
+ * @lock: the range lock to be initialized
+ * @start: start of the interval (inclusive)
+ * @last: last location in the interval (inclusive)
+ *
+ * Initialize the range's [start, last] such that it can
+ * later be locked. User is expected to enter a sorted
+ * range, such that @start <= @last.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
+		     unsigned long last)
+{
+	__range_rwlock_init(lock, start, last);
+}
+EXPORT_SYMBOL_GPL(range_rwlock_init);
+
+/**
+ * range_rwlock_init_full - Initialize a full range lock
+ * @lock: the range lock to be initialized
+ *
+ * Initialize the full range.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init_full(struct range_rwlock *lock)
+{
+	__range_rwlock_init(lock, 0, RANGE_RWLOCK_FULL);
+}
+EXPORT_SYMBOL_GPL(range_rwlock_init_full);
+
+static inline void
+range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
+{
+	if (!--lock->blocking_ranges)
+		wake_q_add(wake_q, range_lock_waiter(lock));
+}
+
+static inline int wait_for_ranges(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock, long state)
+{
+	int ret = 0;
+
+	while (true) {
+		set_current_state(state);
+
+		/* do we need to go to sleep? */
+		if (!lock->blocking_ranges)
+			break;
+
+		if (unlikely(signal_pending_state(state, current))) {
+			struct interval_tree_node *node;
+			unsigned long flags;
+			DEFINE_WAKE_Q(wake_q);
+
+			ret = -EINTR;
+			/*
+			 * We're not taking the lock after all, cleanup
+			 * after ourselves.
+			 */
+			spin_lock_irqsave(&tree->lock, flags);
+
+			range_lock_clear_reader(lock);
+			__range_tree_remove(tree, lock);
+
+			if (!__range_intersects_intree(tree, lock))
+				goto unlock;
+
+			range_interval_tree_foreach(node, &tree->root,
+						    lock->node.start,
+						    lock->node.last) {
+				struct range_rwlock *blked;
+				blked = range_entry(node,
+						    struct range_rwlock, node);
+
+				if (range_lock_is_reader(lock) &&
+				    range_lock_is_reader(blked))
+					continue;
+
+				/* unaccount for threads _we_ are blocking */
+				if (lock->seqnum < blked->seqnum)
+					range_rwlock_unblock(blked, &wake_q);
+			}
+
+		unlock:
+			spin_unlock_irqrestore(&tree->lock, flags);
+			wake_up_q(&wake_q);
+			break;
+		}
+
+		schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+	return ret;
+}
+
+static __always_inline int __sched
+__range_read_lock_common(struct range_rwlock_tree *tree,
+			 struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	range_lock_set_reader(lock);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock))
+			lock->blocking_ranges++;
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->waiter = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+/**
+ * range_read_lock - Lock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Returns when the lock has been acquired or sleep until
+ * until there are no overlapping ranges.
+ */
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock);
+
+/**
+ * range_read_lock_interruptible - Lock for reading (interruptible)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_read_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock_interruptible);
+
+/**
+ * range_read_lock_killable - Lock for reading (killable)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_read_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock_killable);
+
+/**
+ * range_read_trylock - Trylock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int ret = true;
+	unsigned long flags;
+	struct interval_tree_node *node;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	/*
+	 * We have overlapping ranges in the tree, ensure that we can
+	 * in fact share the lock.
+	 */
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock)) {
+			ret = false;
+			goto unlock;
+		}
+	}
+insert:
+	range_lock_set_reader(lock);
+	__range_tree_insert(tree, lock);
+unlock:
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(range_read_trylock);
+
+/**
+ * range_read_unlock - Unlock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be unlocked
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to unlock an unacquired read lock.
+ */
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	range_lock_clear_reader(lock);
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock))
+			range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_read_unlock);
+
+static __always_inline int __sched
+__range_write_lock_common(struct range_rwlock_tree *tree,
+			  struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		/*
+		 * As a writer, we always consider an existing node. We
+		 * need to block; either the intersecting node is another
+		 * writer or we have a reader that needs to finish.
+		 */
+		lock->blocking_ranges++;
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->waiter = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+/**
+ * range_write_lock - Lock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Returns when the lock has been acquired or sleep until
+ * until there are no overlapping ranges.
+ */
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock);
+
+/**
+ * range_write_lock_interruptible - Lock for writing (interruptible)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_write_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock_interruptible);
+
+/**
+ * range_write_lock_killable - Lock for writing (killable)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_write_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock_killable);
+
+/**
+ * range_write_trylock - Trylock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int intersects;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	intersects = __range_intersects_intree(tree, lock);
+
+	if (!intersects) {
+		range_lock_clear_reader(lock);
+		__range_tree_insert(tree, lock);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return !intersects;
+}
+EXPORT_SYMBOL_GPL(range_write_trylock);
+
+/**
+ * range_write_unlock - Unlock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be unlocked
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to unlock an unacquired write lock.
+ */
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	range_lock_clear_reader(lock);
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_write_unlock);
+
+/**
+ * range_downgrade_write - Downgrade write range lock to read lock
+ * @tree: interval tree
+ * @lock: the range lock to be downgraded
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to downgrade an unacquired write lock.
+ */
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock)
+{
+	unsigned long flags;
+	struct interval_tree_node *node;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	WARN_ON(range_lock_is_reader(lock));
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		/*
+		 * Unaccount for any blocked reader lock. Wakeup if possible.
+		 */
+		if (range_lock_is_reader(blocked_lock))
+			range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	range_lock_set_reader(lock);
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_downgrade_write);
-- 
2.12.0

  parent reply	other threads:[~2017-04-06  8:52 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06  8:46 [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 1/6] interval-tree: Build unconditionally Davidlohr Bueso
2017-04-06  8:46 ` Davidlohr Bueso [this message]
2017-04-06  9:01   ` [PATCH 2/6] locking: Introduce range reader/writer lock Laurent Dufour
2017-04-06 16:50     ` Davidlohr Bueso
2017-04-13  8:07       ` Laurent Dufour
2017-04-13  8:38         ` Jan Kara
2017-04-13  8:58           ` Laurent Dufour
2017-04-06 10:24   ` Peter Zijlstra
2017-04-18 13:57   ` Laurent Dufour
2017-04-20 16:01     ` Davidlohr Bueso
2017-04-21  7:00       ` Laurent Dufour
2017-04-06  8:46 ` [PATCH 3/6] locking/locktorture: Fix rwsem reader_delay Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 4/6] locking/locktorture: Fix num reader/writer corner cases Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 5/6] locking/locktorture: Support range rwlocks Davidlohr Bueso
2017-04-06  8:46 ` [PATCH 6/6] staging/lustre: Use generic range rwlock Davidlohr Bueso
2017-04-07 10:08   ` Dilger, Andreas
2017-04-07 10:08     ` [lustre-devel] " Dilger, Andreas
2017-04-19 12:37 ` [PATCH v2 -tip 0/6] locking: Introduce range reader/writer lock Peter Zijlstra
2017-04-20 17:13   ` Davidlohr Bueso
2017-04-20 17:53     ` Peter Zijlstra
2017-04-20 18:36       ` Davidlohr Bueso
2017-04-20 19:10         ` Peter Zijlstra
2017-05-15  9:19           ` Davidlohr Bueso
2017-05-15  9:07 [PATCH v3 " Davidlohr Bueso
2017-05-15  9:07 ` [PATCH 2/6] " Davidlohr Bueso
2017-05-15 13:02   ` Peter Zijlstra
2017-05-16 22:19     ` Davidlohr Bueso
2017-05-15 13:44   ` Peter Zijlstra
2017-05-16 21:17     ` Davidlohr Bueso
2017-05-15 13:59   ` Peter Zijlstra
2017-05-23 15:12   ` Laurent Dufour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170406084620.22700-3-dave@stgolabs.net \
    --to=dave@stgolabs.net \
    --cc=akpm@linux-foundation.org \
    --cc=dbueso@suse.de \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=ldufour@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.