[PATCH 2/6] locking: Introduce range reader/writer lock

From: Davidlohr Bueso <dave@stgolabs.net>
To: mingo@kernel.org, peterz@infradead.org, akpm@linux-foundation.org
Cc: jack@suse.cz, kirill.shutemov@linux.intel.com,
	ldufour@linux.vnet.ibm.com, mhocko@suse.com,
	mgorman@techsingularity.net, dave@stgolabs.net,
	linux-kernel@vger.kernel.org, Davidlohr Bueso <dbueso@suse.de>
Subject: [PATCH 2/6] locking: Introduce range reader/writer lock
Date: Thu,  6 Apr 2017 01:46:16 -0700	[thread overview]
Message-ID: <20170406084620.22700-3-dave@stgolabs.net> (raw)
In-Reply-To: <20170406084620.22700-1-dave@stgolabs.net>

This implements a sleepable range rwlock, based on interval tree, serializing
conflicting/intersecting/overlapping ranges within the tree. The largest range
is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
involves dealing with the tree itself and the range to be locked, normally
stack allocated and always explicitly prepared/initialized by the user in a
[a0, a1] a0 <= a1 sorted manner, before actually taking the lock.

An interval tree of locked and to-be-locked ranges is kept. When a new range
lock is requested, we add its interval to the tree and store number of
intervals intersecting it to 'blocking_ranges'. For the reader case,
'blocking_ranges' is only accounted for if the intersecting range is
marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
guarantee that task is blocked until there are no overlapping ranges in the
tree.

When a range is unlocked, we again walk intervals that overlap with the
unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
therefore relies on the order of the interval tree  -- as opposed to a
more traditional fifo mechanism. There is no lock stealing either, which
prevents starvation and guarantees fairness.

How much does it cost:
----------------------

The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
is total number of ranges and R_int is the number of ranges intersecting the
new range range to be added.

Due to its sharable nature, full range locks can be compared with rw-sempahores,
which also serves from a mutex standpoint as writer-only situations are
pretty similar nowadays.

The first is the memory footprint, tree locks are larger than rwsems: 80 vs
40 bytes, and also requires an additional 72 bytes of stack for the range
structure.

Secondly, because every range call is serialized by the tree->lock, any lock()
fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
overhead compared to a single atomic insn in the case of rwsems. Similar scenario
obviously for the unlock() case.

The torture module was used to measure 1-1 differences in lock acquisition with
increasing core counts over a period of 10 minutes. Readers and writers are
interleaved, with a slight advantage to writers as its the first kthread that is
created. The following shows the avg ops/minute with various thread-setups on
boxes with small and large core-counts.

** 4-core AMD Opteron **
(write-only)
rwsem-2thr: 4198.5, stddev: 7.77
range-2thr: 4199.1, stddev: 0.73

rwsem-4thr: 6036.8, stddev: 50.91
range-4thr: 6004.9, stddev: 126.57

rwsem-8thr: 6245.6, stddev: 59.39
range-8thr: 6229.3, stddev: 10.60

(read-only)
rwsem-2thr: 5930.7, stddev: 21.92
range-2thr: 5917.3, stddev: 25.45

rwsem-4thr: 9881.6, stddev: 0.70
range-4thr: 9540.2, stddev: 98.28

rwsem-8thr: 11633.2, stddev: 7.72
range-8thr: 11314.7, stddev: 62.22

For the read/write-only cases, there is very little difference between the range lock
and rwsems, with up to a 3% hit, which could very well be considered in the noise range.

(read-write)
rwsem-write-1thr: 1744.8, stddev: 11.59
rwsem-read-1thr:  1043.1, stddev: 3.97
range-write-1thr: 1740.2, stddev: 5.99
range-read-1thr:  1022.5, stddev: 6.41

rwsem-write-2thr: 1662.5, stddev: 0.70
rwsem-read-2thr:  1278.0, stddev: 25.45
range-write-2thr: 1321.5, stddev: 51.61
range-read-2thr:  1243.5, stddev: 30.40

rwsem-write-4thr: 1761.0, stddev: 11.31
rwsem-read-4thr:  1426.0, stddev: 7.07
range-write-4thr: 1417.0, stddev: 29.69
range-read-4thr:  1398.0, stddev: 56.56

While a single reader and writer threads does not show must difference, increasing
core counts shows that in reader/writer workloads, writer threads can take a hit in
raw performance of up to ~20%, while the number of reader throughput is quite similar
among both locks.

** 240-core (ht) IvyBridge **
(write-only)
rwsem-120thr: 6844.5, stddev: 82.73
range-120thr: 6070.5, stddev: 85.55

rwsem-240thr: 6292.5, stddev: 146.3
range-240thr: 6099.0, stddev: 15.55

rwsem-480thr: 6164.8, stddev: 33.94
range-480thr: 6062.3, stddev: 19.79

(read-only)
rwsem-120thr: 136860.4, stddev: 2539.92
range-120thr: 138052.2, stddev: 327.39

rwsem-240thr: 235297.5, stddev: 2220.50
range-240thr: 232099.1, stddev: 3614.72

rwsem-480thr: 272683.0, stddev: 3924.32
range-480thr: 256539.2, stddev: 9541.69

Similar to the small box, larger machines show that range locks take only a minor
(up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.

(read-write)
rwsem-write-60thr: 4658.1, stddev: 1303.19
rwsem-read-60thr:  1108.7, stddev: 718.42
range-write-60thr: 3203.6, stddev: 139.30
range-read-60thr:  1852.8, stddev: 147.5

rwsem-write-120thr: 3971.3, stddev: 1413.0
rwsem-read-120thr:  1038.8, stddev: 353.51
range-write-120thr: 2282.1, stddev: 207.18
range-read-120thr:  1856.5, stddev: 198.69

rwsem-write-240thr: 4112.7, stddev: 2448.1
rwsem-read-240thr:  1277.4, stddev: 430.30
range-write-240thr: 2353.1, stddev: 502.04
range-read-240thr:  1551.5, stddev: 361.33

When mixing readers and writers, writer throughput can take a hit of up to ~40%,
similar to the 4 core machine, however, reader threads can increase the number of
acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
always be higher for rwsems. A huge factor in this behavior is that range locks
do not have writer spin-on-owner feature.

On both machines when actually testing threads acquiring different ranges, the
amount of throughput will always outperform the rwsem, due to the increased
parallelism; which is no surprise either. As such microbenchmarks that merely
pounds on a lock will pretty much always suffer upon direct lock conversions,
but not enough to matter in the overall picture.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/range_rwlock.h  | 115 +++++++++
 kernel/locking/Makefile       |   2 +-
 kernel/locking/range_rwlock.c | 554 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 670 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/range_rwlock.h
 create mode 100644 kernel/locking/range_rwlock.c

diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
new file mode 100644
index 000000000000..0a8dad62d097
--- /dev/null
+++ b/include/linux/range_rwlock.h
@@ -0,0 +1,115 @@
+/*
+ * Range/interval rw-locking
+ * -------------------------
+ *
+ * An interval tree of locked and to-be-locked ranges is kept. When a new range
+ * lock is requested, we add its interval to the tree and store number of
+ * intervals intersecting it to 'blocking_ranges'. For the reader case,
+ * 'blocking_ranges' is only accounted for if the intersecting range is
+ * marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
+ * guarantee that task is blocked until there are no overlapping ranges in the
+ * tree.
+ *
+ * When a range is unlocked, we again walk intervals that overlap with the
+ * unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
+ * owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
+ * therefore relies on the order of the interval tree  -- as opposed to a
+ * more traditional fifo mechanism. There is no lock stealing either, which
+ * prevents starvation and guarantees fairness.
+ *
+ * The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
+ * is total number of ranges and R_int is the number of ranges intersecting the
+ * operated range.
+ */
+#ifndef _LINUX_RANGE_RWLOCK_H
+#define _LINUX_RANGE_RWLOCK_H
+
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+/*
+ * The largest range will span [0,RANGE_RWLOCK_FULL].
+ */
+#define RANGE_RWLOCK_FULL  ~0UL
+
+struct range_rwlock {
+	struct interval_tree_node node;
+	struct task_struct *waiter;
+	/* Number of ranges which are blocking acquisition of the lock */
+	unsigned int blocking_ranges;
+	u64 seqnum;
+};
+
+struct range_rwlock_tree {
+	struct rb_root root;
+	spinlock_t lock;
+	struct interval_tree_node *leftmost; /* compute smallest 'start' */
+	u64 seqnum; /* track order of incoming ranges, avoid overflows */
+};
+
+#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
+	{ .leftmost = NULL                              \
+	, .root = RB_ROOT				\
+	, .seqnum = 0					\
+	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
+
+#define DEFINE_RANGE_RWLOCK_TREE(name) \
+       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
+
+#define __RANGE_RWLOCK_INITIALIZER(__start, __last) {	\
+		.node = {			\
+			.start = (__start)	\
+			,.last = (__last)	\
+		}				\
+		, .task = NULL			\
+		, .blocking_ranges = 0		\
+		, .reader = false		\
+		, .seqnum = 0			\
+	}
+
+#define DEFINE_RANGE_RWLOCK(name, start, last)				\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
+
+#define DEFINE_RANGE_RWLOCK_FULL(name)		\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_FULL)
+
+static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
+{
+	tree->root = RB_ROOT;
+	spin_lock_init(&tree->lock);
+	tree->leftmost = NULL;
+	tree->seqnum = 0;
+}
+
+void range_rwlock_init(struct range_rwlock *lock,
+		       unsigned long start, unsigned long last);
+void range_rwlock_init_full(struct range_rwlock *lock);
+
+/*
+ * lock for reading
+ */
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock);
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock);
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+/*
+ * lock for writing
+ */
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock);
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock);
+int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock);
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 760158d9d98d..bf054d0af82b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -2,7 +2,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
new file mode 100644
index 000000000000..550ad360d87a
--- /dev/null
+++ b/kernel/locking/range_rwlock.c
@@ -0,0 +1,554 @@
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/range_rwlock.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched.h>
+#include <linux/export.h>
+
+#define range_entry(ptr, type, member) container_of(ptr, type, member)
+
+#define range_interval_tree_foreach(node, root, start, last)	\
+	for (node = interval_tree_iter_first(root, start, last); \
+	     node; node = interval_tree_iter_next(node, start, last))
+
+/*
+ * Fastpath range intersection/overlap between A: [a0, a1] and B: [b0, b1]
+ * is given by:
+ *
+ *        a0 <= b1 && b0 <= a1
+ *
+ * ... where A holds the lock range and B holds the smallest 'start' and
+ * largest 'last' in the tree. For the later, we rely on the root node,
+ * which by augmented interval tree property, holds the largest value in
+ * its last-in-subtree. This allows mitigating some of the tree walk overhead
+ * for non-intersecting ranges, maintained and consulted in O(1).
+ */
+static inline bool
+__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *root;
+
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
+		return false;
+
+	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
+
+	return lock->node.start <= root->__subtree_last &&
+		tree->leftmost->start <= lock->node.last;
+}
+
+static inline void
+__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)) ||
+	    lock->node.start < tree->leftmost->start)
+		tree->leftmost = &lock->node;
+
+	lock->seqnum = tree->seqnum++;
+	interval_tree_insert(&lock->node, &tree->root);
+}
+
+static inline void
+__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (tree->leftmost == &lock->node) {
+		struct rb_node *next = rb_next(&tree->leftmost->rb);
+		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
+	}
+
+	interval_tree_remove(&lock->node, &tree->root);
+}
+
+/*
+ * lock->waiter reader tracking.
+ */
+#define RANGE_FLAG_READER	1UL
+
+static inline struct task_struct *range_lock_waiter(struct range_rwlock *lock)
+{
+	return (struct task_struct *)
+		((unsigned long) lock->waiter & ~RANGE_FLAG_READER);
+}
+
+static inline void range_lock_set_reader(struct range_rwlock *lock)
+{
+	lock->waiter = (struct task_struct *)
+		((unsigned long)lock->waiter | RANGE_FLAG_READER);
+}
+
+static inline void range_lock_clear_reader(struct range_rwlock *lock)
+{
+	lock->waiter = (struct task_struct *)
+		((unsigned long)lock->waiter & ~RANGE_FLAG_READER);
+}
+
+static inline bool range_lock_is_reader(struct range_rwlock *lock)
+{
+	return (unsigned long) lock->waiter & RANGE_FLAG_READER;
+}
+
+static inline void
+__range_rwlock_init(struct range_rwlock *lock,
+		    unsigned long start, unsigned long last)
+{
+	WARN_ON(start > last);
+
+	lock->node.start = start;
+	lock->node.last = last;
+	RB_CLEAR_NODE(&lock->node.rb);
+	lock->blocking_ranges = 0;
+	lock->waiter = NULL;
+	lock->seqnum = 0;
+}
+
+/**
+ * range_rwlock_init - Initialize the range lock
+ * @lock: the range lock to be initialized
+ * @start: start of the interval (inclusive)
+ * @last: last location in the interval (inclusive)
+ *
+ * Initialize the range's [start, last] such that it can
+ * later be locked. User is expected to enter a sorted
+ * range, such that @start <= @last.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
+		     unsigned long last)
+{
+	__range_rwlock_init(lock, start, last);
+}
+EXPORT_SYMBOL_GPL(range_rwlock_init);
+
+/**
+ * range_rwlock_init_full - Initialize a full range lock
+ * @lock: the range lock to be initialized
+ *
+ * Initialize the full range.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init_full(struct range_rwlock *lock)
+{
+	__range_rwlock_init(lock, 0, RANGE_RWLOCK_FULL);
+}
+EXPORT_SYMBOL_GPL(range_rwlock_init_full);
+
+static inline void
+range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
+{
+	if (!--lock->blocking_ranges)
+		wake_q_add(wake_q, range_lock_waiter(lock));
+}
+
+static inline int wait_for_ranges(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock, long state)
+{
+	int ret = 0;
+
+	while (true) {
+		set_current_state(state);
+
+		/* do we need to go to sleep? */
+		if (!lock->blocking_ranges)
+			break;
+
+		if (unlikely(signal_pending_state(state, current))) {
+			struct interval_tree_node *node;
+			unsigned long flags;
+			DEFINE_WAKE_Q(wake_q);
+
+			ret = -EINTR;
+			/*
+			 * We're not taking the lock after all, cleanup
+			 * after ourselves.
+			 */
+			spin_lock_irqsave(&tree->lock, flags);
+
+			range_lock_clear_reader(lock);
+			__range_tree_remove(tree, lock);
+
+			if (!__range_intersects_intree(tree, lock))
+				goto unlock;
+
+			range_interval_tree_foreach(node, &tree->root,
+						    lock->node.start,
+						    lock->node.last) {
+				struct range_rwlock *blked;
+				blked = range_entry(node,
+						    struct range_rwlock, node);
+
+				if (range_lock_is_reader(lock) &&
+				    range_lock_is_reader(blked))
+					continue;
+
+				/* unaccount for threads _we_ are blocking */
+				if (lock->seqnum < blked->seqnum)
+					range_rwlock_unblock(blked, &wake_q);
+			}
+
+		unlock:
+			spin_unlock_irqrestore(&tree->lock, flags);
+			wake_up_q(&wake_q);
+			break;
+		}
+
+		schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+	return ret;
+}
+
+static __always_inline int __sched
+__range_read_lock_common(struct range_rwlock_tree *tree,
+			 struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	range_lock_set_reader(lock);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock))
+			lock->blocking_ranges++;
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->waiter = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+/**
+ * range_read_lock - Lock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Returns when the lock has been acquired or sleep until
+ * until there are no overlapping ranges.
+ */
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock);
+
+/**
+ * range_read_lock_interruptible - Lock for reading (interruptible)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_read_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock_interruptible);
+
+/**
+ * range_read_lock_killable - Lock for reading (killable)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_read_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(range_read_lock_killable);
+
+/**
+ * range_read_trylock - Trylock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int ret = true;
+	unsigned long flags;
+	struct interval_tree_node *node;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	/*
+	 * We have overlapping ranges in the tree, ensure that we can
+	 * in fact share the lock.
+	 */
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock)) {
+			ret = false;
+			goto unlock;
+		}
+	}
+insert:
+	range_lock_set_reader(lock);
+	__range_tree_insert(tree, lock);
+unlock:
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(range_read_trylock);
+
+/**
+ * range_read_unlock - Unlock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be unlocked
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to unlock an unacquired read lock.
+ */
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	range_lock_clear_reader(lock);
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!range_lock_is_reader(blocked_lock))
+			range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_read_unlock);
+
+static __always_inline int __sched
+__range_write_lock_common(struct range_rwlock_tree *tree,
+			  struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		/*
+		 * As a writer, we always consider an existing node. We
+		 * need to block; either the intersecting node is another
+		 * writer or we have a reader that needs to finish.
+		 */
+		lock->blocking_ranges++;
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->waiter = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+/**
+ * range_write_lock - Lock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Returns when the lock has been acquired or sleep until
+ * until there are no overlapping ranges.
+ */
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock);
+
+/**
+ * range_write_lock_interruptible - Lock for writing (interruptible)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_write_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock_interruptible);
+
+/**
+ * range_write_lock_killable - Lock for writing (killable)
+ * @tree: interval tree
+ * @lock: the range lock to be locked
+ *
+ * Lock the range like range_write_lock(), and return 0 if the
+ * lock has been acquired or sleep until until there are no
+ * overlapping ranges. If a signal arrives while waiting for the
+ * lock then this function returns -EINTR.
+ */
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
+}
+EXPORT_SYMBOL_GPL(range_write_lock_killable);
+
+/**
+ * range_write_trylock - Trylock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int intersects;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	intersects = __range_intersects_intree(tree, lock);
+
+	if (!intersects) {
+		range_lock_clear_reader(lock);
+		__range_tree_insert(tree, lock);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return !intersects;
+}
+EXPORT_SYMBOL_GPL(range_write_trylock);
+
+/**
+ * range_write_unlock - Unlock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be unlocked
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to unlock an unacquired write lock.
+ */
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	range_lock_clear_reader(lock);
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_write_unlock);
+
+/**
+ * range_downgrade_write - Downgrade write range lock to read lock
+ * @tree: interval tree
+ * @lock: the range lock to be downgraded
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to downgrade an unacquired write lock.
+ */
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock)
+{
+	unsigned long flags;
+	struct interval_tree_node *node;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	WARN_ON(range_lock_is_reader(lock));
+
+	range_interval_tree_foreach(node, &tree->root,
+				    lock->node.start, lock->node.last) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		/*
+		 * Unaccount for any blocked reader lock. Wakeup if possible.
+		 */
+		if (range_lock_is_reader(blocked_lock))
+			range_rwlock_unblock(blocked_lock, &wake_q);
+	}
+
+	range_lock_set_reader(lock);
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL_GPL(range_downgrade_write);
-- 
2.12.0