* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
@ 2017-03-09 11:03 ` Jan Kara
2017-03-28 10:00 ` Laurent Dufour
` (8 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2017-03-09 11:03 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
Hi!
On Mon 06-03-17 21:03:26, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
>
> We allow exclusive locking of arbitrary ranges. We guarantee that each
> range is locked only after all conflicting range locks requested previously
> have been unlocked. Thus we achieve fairness and avoid livelocks.
>
> When new range lock is requested, we add its interval to the tree and store
> number of intervals intersecting it to 'blocking_ranges'. When a range is
> unlocked, we again walk intervals that intersect with the unlocked one and
> decrement their 'blocking_ranges'. We wake up owner of any range lock whose
> 'blocking_ranges' drops to 0.
>
> For the shared case, the 'blocking_ranges' is only incremented if the
> intersecting range is not marked as a reader. In order to mitigate some of
> the tree walk overhead for non-intersecting ranges, the tree's min/max values
> are maintained and consulted in O(1) in the fastpath.
Thanks for pushing this into completion! I like the patch and FWIW I didn't
find any problem in it so feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
>
> How much does it cost:
> ----------------------
>
> The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
>
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
>
> The first is the memory footprint, rwsems are larger than tree locks: 40 vs
> 24 bytes, but the later requires an additional 64 bytes of stack for the range
> structure.
>
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
>
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
>
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
>
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
>
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
>
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
>
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
>
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
>
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
>
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr: 1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr: 1022.5, stddev: 6.41
>
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr: 1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr: 1243.5, stddev: 30.40
>
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr: 1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr: 1398.0, stddev: 56.56
>
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
>
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
>
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
>
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
>
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
>
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
>
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
>
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
>
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr: 1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr: 1852.8, stddev: 147.5
>
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr: 1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr: 1856.5, stddev: 198.69
>
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr: 1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr: 1551.5, stddev: 361.33
>
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
>
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> drivers/gpu/drm/Kconfig | 2 -
> drivers/gpu/drm/i915/Kconfig | 1 -
> include/linux/range_rwlock.h | 96 +++++++++
> kernel/locking/Makefile | 2 +-
> kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
> lib/Kconfig | 14 --
> lib/Kconfig.debug | 1 -
> lib/Makefile | 2 +-
> 8 files changed, 560 insertions(+), 20 deletions(-)
> create mode 100644 include/linux/range_rwlock.h
> create mode 100644 kernel/locking/range_rwlock.c
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have an ATI Radeon graphics card. There
> are both PCI and AGP versions. You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have a recent AMD Radeon graphics card.
>
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
> depends on DRM
> depends on X86 && PCI
> select INTEL_GTT
> - select INTERVAL_TREE
> # we need shmfs for the swappable backing store, and in particular
> # the shmem_readpage() which depends upon tmpfs
> select SHMEM
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..174e13668f67
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,96 @@
> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY (~0UL - 1)
> +
> +struct range_rwlock {
> + struct interval_tree_node node;
> + struct task_struct *task;
> + /* Number of ranges which are blocking acquisition of the lock */
> + unsigned int blocking_ranges;
> + bool reader;
> +};
> +
> +struct range_rwlock_tree {
> + struct rb_root root;
> + spinlock_t lock;
> + struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name) \
> + { .leftmost = NULL \
> + , .root = RB_ROOT \
> + , .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> + struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(start, last) { \
> + .node = { \
> + .start = (start) \
> + ,.last = (last) \
> + } \
> + }
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_INF(name) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> + tree->root = RB_ROOT;
> + spin_lock_init(&tree->lock);
> + tree->leftmost = NULL;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> + unsigned long start, unsigned long last);
> +void range_rwlock_init_inf(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
> # and is generally not a function of system call inputs.
> KCOV_INSTRUMENT := n
>
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
>
> ifdef CONFIG_FUNCTION_TRACER
> CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..4d92baa9bd80
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +/*
> + * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + * a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree.
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *root;
> +
> + if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> + return false;
> +
> + root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> + return lock->node.start <= root->__subtree_last &&
> + tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> + tree->leftmost = &lock->node;
> + else if (lock->node.start < tree->leftmost->start)
> + tree->leftmost = &lock->node;
> +
> + interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + if (tree->leftmost == &lock->node) {
> + struct rb_node *next = rb_next(&tree->leftmost->rb);
> + tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> + }
> +
> + interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> + unsigned long last)
> +{
> + WARN_ON(start > last);
> +
> + lock->node.start = start;
> + lock->node.last = last;
> + RB_CLEAR_NODE(&lock->node.rb);
> + lock->blocking_ranges = 0;
> + lock->reader = false;
> +}
> +
> +/**
> + * range_rwlock_init - initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> + unsigned long last)
> +{
> + __range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_inf - initialize the range lock to infinity
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_inf(struct range_rwlock *lock)
> +{
> + __range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
> +}
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> + if (!--lock->blocking_ranges)
> + wake_q_add(wake_q, lock->task);
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + int ret = 0;
> +
> + while (true) {
> + set_current_state(state);
> +
> + if (unlikely(signal_pending_state(state, current))) {
> + unsigned long flags;
> +
> + ret = -EINTR;
> + /*
> + * We're not taking the lock after all, cleanup
> + * after ourselves.
> + */
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> + spin_unlock_irqrestore(&tree->lock, flags);
> + break;
> + }
> +
> + /* do we need to go to sleep? */
> + if (!lock->blocking_ranges)
> + break;
> +
> + schedule();
> + }
> +
> + __set_current_state(TASK_RUNNING);
> + return ret;
> +}
> +
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = true;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + might_sleep();
> + __range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock);
> +
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_read_lock_common(tree, lock, TASK_KILLABLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_killable);
> +
> +/**
> + * range_read_trylock - Trylock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + int ret = true;
> + unsigned long flags;
> + struct interval_tree_node *node;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + /*
> + * We have overlapping ranges in the tree, ensure that we can
> + * in fact share the lock.
> + */
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader) {
> + ret = false;
> + goto unlock;
> + }
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + lock->reader = true;
> + __range_tree_insert(tree, lock);
> +unlock:
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(range_read_trylock);
> +
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> +
> + if (!__range_intersects_intree(tree, lock)) {
> + /* nobody to wakeup, we're done */
> + spin_unlock_irqrestore(&tree->lock, flags);
> + return;
> + }
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + range_rwlock_unblock(blocked_lock, &wake_q);
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_read_unlock);
> +
> +static inline int __sched
> +__range_write_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = false;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + /*
> + * As a writer, we always consider an existing node. We
> + * need to block; either the intersecting node is another
> + * writer or we have a reader that needs to finish.
> + */
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + might_sleep();
> + __range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock);
> +
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_write_lock_common(tree, lock, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_killable);
> +
> +/**
> + * range_write_trylock - Trylock for writing
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + int intersects;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + intersects = __range_intersects_intree(tree, lock);
> +
> + if (!intersects) {
> + lock->reader = false;
> + __range_tree_insert(tree, lock);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return !intersects;
> +}
> +EXPORT_SYMBOL(range_write_trylock);
> +
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> +
> + if (!__range_intersects_intree(tree, lock)) {
> + /* nobody to wakeup, we're done */
> + spin_unlock_irqrestore(&tree->lock, flags);
> + return;
> + }
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + range_rwlock_unblock(blocked_lock, &wake_q);
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_write_unlock);
> +
> +/**
> + * range_downgrade_write - Downgrade write range lock to read lock
> + * @tree: interval tree
> + * @lock: the range lock to be downgraded
> + *
> + * Wakes any blocked readers, when @lock is the only conflicting range.
> + *
> + * It is not allowed to downgrade an unacquired write lock.
> + */
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + unsigned long flags;
> + struct interval_tree_node *node;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + WARN_ON(lock->reader);
> +
> + /*
> + * Unaccount for any blocked reader lock. Wakeup if possible.
> + */
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (blocked_lock->reader)
> + range_rwlock_unblock(blocked_lock, &wake_q);
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + lock->reader = true;
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_downgrade_write);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 0c8b78a9ae2e..52ca6668a16d 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
> config BTREE
> bool
>
> -config INTERVAL_TREE
> - bool
> - help
> - Simple, embeddable, interval-tree. Can find the start of an
> - overlapping range in log(n) time and then iterate over all
> - overlapping nodes. The algorithm is implemented as an
> - augmented rbtree.
> -
> - See:
> -
> - Documentation/rbtree.txt
> -
> - for more information.
> -
> config RADIX_TREE_MULTIORDER
> bool
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 97d62c2da6c2..a8a64eb54d3a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1771,7 +1771,6 @@ config RBTREE_TEST
> config INTERVAL_TREE_TEST
> tristate "Interval tree test"
> depends on m && DEBUG_KERNEL
> - select INTERVAL_TREE
> help
> A benchmark measuring the performance of the interval tree library
>
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..d4ff3057450a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
> obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
>
> obj-$(CONFIG_BTREE) += btree.o
> -obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
> +obj-y += interval_tree.o
> obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
> obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
> obj-$(CONFIG_DEBUG_LIST) += list_debug.o
> --
> 2.6.6
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
2017-03-09 11:03 ` Jan Kara
@ 2017-03-28 10:00 ` Laurent Dufour
2017-03-28 16:39 ` Davidlohr Bueso
2017-03-29 8:56 ` Peter Zijlstra
` (7 subsequent siblings)
9 siblings, 1 reply; 34+ messages in thread
From: Laurent Dufour @ 2017-03-28 10:00 UTC (permalink / raw)
To: Davidlohr Bueso, mingo, peterz, akpm
Cc: jack, kirill.shutemov, mhocko, mgorman, linux-kernel, Davidlohr Bueso
On 07/03/2017 06:03, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
>
> We allow exclusive locking of arbitrary ranges. We guarantee that each
> range is locked only after all conflicting range locks requested previously
> have been unlocked. Thus we achieve fairness and avoid livelocks.
>
> When new range lock is requested, we add its interval to the tree and store
> number of intervals intersecting it to 'blocking_ranges'. When a range is
> unlocked, we again walk intervals that intersect with the unlocked one and
> decrement their 'blocking_ranges'. We wake up owner of any range lock whose
> 'blocking_ranges' drops to 0.
>
> For the shared case, the 'blocking_ranges' is only incremented if the
> intersecting range is not marked as a reader. In order to mitigate some of
> the tree walk overhead for non-intersecting ranges, the tree's min/max values
> are maintained and consulted in O(1) in the fastpath.
>
> How much does it cost:
> ----------------------
>
> The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
>
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
>
> The first is the memory footprint, rwsems are larger than tree locks: 40 vs
> 24 bytes, but the later requires an additional 64 bytes of stack for the range
> structure.
>
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
>
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
>
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
>
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
>
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
>
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
>
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
>
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
>
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
>
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr: 1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr: 1022.5, stddev: 6.41
>
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr: 1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr: 1243.5, stddev: 30.40
>
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr: 1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr: 1398.0, stddev: 56.56
>
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
>
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
>
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
>
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
>
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
>
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
>
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
>
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
>
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr: 1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr: 1852.8, stddev: 147.5
>
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr: 1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr: 1856.5, stddev: 198.69
>
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr: 1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr: 1551.5, stddev: 361.33
>
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
>
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> drivers/gpu/drm/Kconfig | 2 -
> drivers/gpu/drm/i915/Kconfig | 1 -
> include/linux/range_rwlock.h | 96 +++++++++
> kernel/locking/Makefile | 2 +-
> kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
> lib/Kconfig | 14 --
> lib/Kconfig.debug | 1 -
> lib/Makefile | 2 +-
> 8 files changed, 560 insertions(+), 20 deletions(-)
> create mode 100644 include/linux/range_rwlock.h
> create mode 100644 kernel/locking/range_rwlock.c
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have an ATI Radeon graphics card. There
> are both PCI and AGP versions. You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have a recent AMD Radeon graphics card.
>
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
> depends on DRM
> depends on X86 && PCI
> select INTEL_GTT
> - select INTERVAL_TREE
> # we need shmfs for the swappable backing store, and in particular
> # the shmem_readpage() which depends upon tmpfs
> select SHMEM
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..174e13668f67
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,96 @@
> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY (~0UL - 1)
> +
> +struct range_rwlock {
> + struct interval_tree_node node;
> + struct task_struct *task;
> + /* Number of ranges which are blocking acquisition of the lock */
> + unsigned int blocking_ranges;
> + bool reader;
> +};
> +
> +struct range_rwlock_tree {
> + struct rb_root root;
> + spinlock_t lock;
> + struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name) \
> + { .leftmost = NULL \
> + , .root = RB_ROOT \
> + , .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> + struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(start, last) { \
> + .node = { \
> + .start = (start) \
> + ,.last = (last) \
> + } \
> + }
Hi Davidlohr,
This macro doesn't expand correctly because the field name ".start" is
replaced by the start parameter. Should rather be :
#define __RANGE_RWLOCK_INITIALIZER(__start, __last) { \
.node = { \
.start = (__start) \
,.last = (__last) \
} \
}
By the way, should the other fields set as in __range_rwlock_init() ?
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_INF(name) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> + tree->root = RB_ROOT;
> + spin_lock_init(&tree->lock);
> + tree->leftmost = NULL;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> + unsigned long start, unsigned long last);
> +void range_rwlock_init_inf(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
^^^^
range_write_trylock(...) isn't it ?
Cheers,
Laurent.
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
> # and is generally not a function of system call inputs.
> KCOV_INSTRUMENT := n
>
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
>
> ifdef CONFIG_FUNCTION_TRACER
> CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..4d92baa9bd80
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +/*
> + * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + * a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree.
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *root;
> +
> + if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> + return false;
> +
> + root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> + return lock->node.start <= root->__subtree_last &&
> + tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> + tree->leftmost = &lock->node;
> + else if (lock->node.start < tree->leftmost->start)
> + tree->leftmost = &lock->node;
> +
> + interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + if (tree->leftmost == &lock->node) {
> + struct rb_node *next = rb_next(&tree->leftmost->rb);
> + tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> + }
> +
> + interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> + unsigned long last)
> +{
> + WARN_ON(start > last);
> +
> + lock->node.start = start;
> + lock->node.last = last;
> + RB_CLEAR_NODE(&lock->node.rb);
> + lock->blocking_ranges = 0;
> + lock->reader = false;
> +}
> +
> +/**
> + * range_rwlock_init - initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> + unsigned long last)
> +{
> + __range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_inf - initialize the range lock to infinity
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_inf(struct range_rwlock *lock)
> +{
> + __range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
> +}
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> + if (!--lock->blocking_ranges)
> + wake_q_add(wake_q, lock->task);
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + int ret = 0;
> +
> + while (true) {
> + set_current_state(state);
> +
> + if (unlikely(signal_pending_state(state, current))) {
> + unsigned long flags;
> +
> + ret = -EINTR;
> + /*
> + * We're not taking the lock after all, cleanup
> + * after ourselves.
> + */
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> + spin_unlock_irqrestore(&tree->lock, flags);
> + break;
> + }
> +
> + /* do we need to go to sleep? */
> + if (!lock->blocking_ranges)
> + break;
> +
> + schedule();
> + }
> +
> + __set_current_state(TASK_RUNNING);
> + return ret;
> +}
> +
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = true;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + might_sleep();
> + __range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock);
> +
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_read_lock_common(tree, lock, TASK_KILLABLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_killable);
> +
> +/**
> + * range_read_trylock - Trylock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + int ret = true;
> + unsigned long flags;
> + struct interval_tree_node *node;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + /*
> + * We have overlapping ranges in the tree, ensure that we can
> + * in fact share the lock.
> + */
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader) {
> + ret = false;
> + goto unlock;
> + }
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + lock->reader = true;
> + __range_tree_insert(tree, lock);
> +unlock:
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(range_read_trylock);
> +
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> +
> + if (!__range_intersects_intree(tree, lock)) {
> + /* nobody to wakeup, we're done */
> + spin_unlock_irqrestore(&tree->lock, flags);
> + return;
> + }
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + range_rwlock_unblock(blocked_lock, &wake_q);
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_read_unlock);
> +
> +static inline int __sched
> +__range_write_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = false;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + /*
> + * As a writer, we always consider an existing node. We
> + * need to block; either the intersecting node is another
> + * writer or we have a reader that needs to finish.
> + */
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + might_sleep();
> + __range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock);
> +
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_write_lock_common(tree, lock, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_killable);
> +
> +/**
> + * range_write_trylock - Trylock for writing
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + int intersects;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + intersects = __range_intersects_intree(tree, lock);
> +
> + if (!intersects) {
> + lock->reader = false;
> + __range_tree_insert(tree, lock);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return !intersects;
> +}
> +EXPORT_SYMBOL(range_write_trylock);
> +
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> +
> + if (!__range_intersects_intree(tree, lock)) {
> + /* nobody to wakeup, we're done */
> + spin_unlock_irqrestore(&tree->lock, flags);
> + return;
> + }
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + range_rwlock_unblock(blocked_lock, &wake_q);
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_write_unlock);
> +
> +/**
> + * range_downgrade_write - Downgrade write range lock to read lock
> + * @tree: interval tree
> + * @lock: the range lock to be downgraded
> + *
> + * Wakes any blocked readers, when @lock is the only conflicting range.
> + *
> + * It is not allowed to downgrade an unacquired write lock.
> + */
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + unsigned long flags;
> + struct interval_tree_node *node;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + WARN_ON(lock->reader);
> +
> + /*
> + * Unaccount for any blocked reader lock. Wakeup if possible.
> + */
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (blocked_lock->reader)
> + range_rwlock_unblock(blocked_lock, &wake_q);
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + lock->reader = true;
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_downgrade_write);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 0c8b78a9ae2e..52ca6668a16d 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
> config BTREE
> bool
>
> -config INTERVAL_TREE
> - bool
> - help
> - Simple, embeddable, interval-tree. Can find the start of an
> - overlapping range in log(n) time and then iterate over all
> - overlapping nodes. The algorithm is implemented as an
> - augmented rbtree.
> -
> - See:
> -
> - Documentation/rbtree.txt
> -
> - for more information.
> -
> config RADIX_TREE_MULTIORDER
> bool
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 97d62c2da6c2..a8a64eb54d3a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1771,7 +1771,6 @@ config RBTREE_TEST
> config INTERVAL_TREE_TEST
> tristate "Interval tree test"
> depends on m && DEBUG_KERNEL
> - select INTERVAL_TREE
> help
> A benchmark measuring the performance of the interval tree library
>
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..d4ff3057450a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
> obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
>
> obj-$(CONFIG_BTREE) += btree.o
> -obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
> +obj-y += interval_tree.o
> obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
> obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
> obj-$(CONFIG_DEBUG_LIST) += list_debug.o
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-28 10:00 ` Laurent Dufour
@ 2017-03-28 16:39 ` Davidlohr Bueso
2017-03-28 16:58 ` Kirill A. Shutemov
2017-04-03 14:19 ` Laurent Dufour
0 siblings, 2 replies; 34+ messages in thread
From: Davidlohr Bueso @ 2017-03-28 16:39 UTC (permalink / raw)
To: Laurent Dufour
Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Tue, 28 Mar 2017, Laurent Dufour wrote:
>> +#define __RANGE_RWLOCK_INITIALIZER(start, last) { \
>> + .node = { \
>> + .start = (start) \
>> + ,.last = (last) \
>> + } \
>> + }
>
>Hi Davidlohr,
>
>This macro doesn't expand correctly because the field name ".start" is
>replaced by the start parameter. Should rather be :
>
>#define __RANGE_RWLOCK_INITIALIZER(__start, __last) { \
> .node = { \
> .start = (__start) \
> ,.last = (__last) \
> } \
> }
>
>By the way, should the other fields set as in __range_rwlock_init() ?
Indeed.
>> +/*
>> + * lock for writing
>> + */
>> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
>> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
>> + struct range_rwlock *lock);
>> +int range_write_lock_killable(struct range_rwlock_tree *tree,
>> + struct range_rwlock *lock);
>> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> ^^^^
> range_write_trylock(...) isn't it ?
>
Duh, yeah.
I'll wait to see if there are any more concerns and send a v2 with your corrections.
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-28 16:39 ` Davidlohr Bueso
@ 2017-03-28 16:58 ` Kirill A. Shutemov
2017-03-29 8:38 ` Laurent Dufour
2017-04-03 14:19 ` Laurent Dufour
1 sibling, 1 reply; 34+ messages in thread
From: Kirill A. Shutemov @ 2017-03-28 16:58 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Laurent Dufour, mingo, peterz, akpm, jack, kirill.shutemov,
mhocko, mgorman, linux-kernel, Davidlohr Bueso
On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
> I'll wait to see if there are any more concerns and send a v2 with your corrections.
Have you tried drop-in replacement of mmap_sem with full range lock?
It would be interesting to see performance implication for this.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-28 16:58 ` Kirill A. Shutemov
@ 2017-03-29 8:38 ` Laurent Dufour
2017-03-29 15:31 ` Davidlohr Bueso
0 siblings, 1 reply; 34+ messages in thread
From: Laurent Dufour @ 2017-03-29 8:38 UTC (permalink / raw)
To: Kirill A. Shutemov, Davidlohr Bueso
Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On 28/03/2017 18:58, Kirill A. Shutemov wrote:
> On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
>> I'll wait to see if there are any more concerns and send a v2 with your corrections.
>
> Have you tried drop-in replacement of mmap_sem with full range lock?
> It would be interesting to see performance implication for this.
>
I've a patch that replace the mmap_sem with a full range lock, it seems
to work fine for x86 and ppc64 for now. I'll send it soon.
But I didn't yet check for performance. What is the best way to that ?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-29 8:38 ` Laurent Dufour
@ 2017-03-29 15:31 ` Davidlohr Bueso
2017-03-29 15:40 ` Kirill A. Shutemov
0 siblings, 1 reply; 34+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 15:31 UTC (permalink / raw)
To: Laurent Dufour
Cc: Kirill A. Shutemov, mingo, peterz, akpm, jack, kirill.shutemov,
mhocko, mgorman, linux-kernel, Davidlohr Bueso
On Wed, 29 Mar 2017, Laurent Dufour wrote:
>On 28/03/2017 18:58, Kirill A. Shutemov wrote:
>> On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
>>> I'll wait to see if there are any more concerns and send a v2 with your corrections.
>>
>> Have you tried drop-in replacement of mmap_sem with full range lock?
>> It would be interesting to see performance implication for this.
>>
>
>I've a patch that replace the mmap_sem with a full range lock, it seems
>to work fine for x86 and ppc64 for now. I'll send it soon.
>But I didn't yet check for performance. What is the best way to that ?
I expect performance to take a measurable hit if we simply use full range
lock as a drop in replacement. My rwsem vs range lock measurements were
done with this in mind. We only win with range locks when improving the
level of parallelism. There are plenty of benchmarks out there that
stress address space concurrency. I've been using some from the mosbench
suite (metis, psearchy, dedup); Ingo also has a microbench that does
mmap()/munmap()/mremap() combos:
http://people.redhat.com/mingo/threaded-mmap-stresstest/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-29 15:31 ` Davidlohr Bueso
@ 2017-03-29 15:40 ` Kirill A. Shutemov
2017-03-29 16:10 ` Davidlohr Bueso
0 siblings, 1 reply; 34+ messages in thread
From: Kirill A. Shutemov @ 2017-03-29 15:40 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Laurent Dufour, mingo, peterz, akpm, jack, kirill.shutemov,
mhocko, mgorman, linux-kernel, Davidlohr Bueso
On Wed, Mar 29, 2017 at 08:31:33AM -0700, Davidlohr Bueso wrote:
> On Wed, 29 Mar 2017, Laurent Dufour wrote:
>
> > On 28/03/2017 18:58, Kirill A. Shutemov wrote:
> > > On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
> > > > I'll wait to see if there are any more concerns and send a v2 with your corrections.
> > >
> > > Have you tried drop-in replacement of mmap_sem with full range lock?
> > > It would be interesting to see performance implication for this.
> > >
> >
> > I've a patch that replace the mmap_sem with a full range lock, it seems
> > to work fine for x86 and ppc64 for now. I'll send it soon.
> > But I didn't yet check for performance. What is the best way to that ?
>
> I expect performance to take a measurable hit if we simply use full range
> lock as a drop in replacement. My rwsem vs range lock measurements were
> done with this in mind. We only win with range locks when improving the
> level of parallelism.
It would be hard sell if we would see performance degradation simple
single-threaded workload.
--
Kirill A. Shutemov
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-29 15:40 ` Kirill A. Shutemov
@ 2017-03-29 16:10 ` Davidlohr Bueso
0 siblings, 0 replies; 34+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 16:10 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Laurent Dufour, mingo, peterz, akpm, jack, kirill.shutemov,
mhocko, mgorman, linux-kernel, Davidlohr Bueso
On Wed, 29 Mar 2017, Kirill A. Shutemov wrote:
>On Wed, Mar 29, 2017 at 08:31:33AM -0700, Davidlohr Bueso wrote:
>> On Wed, 29 Mar 2017, Laurent Dufour wrote:
>>
>> > On 28/03/2017 18:58, Kirill A. Shutemov wrote:
>> > > On Tue, Mar 28, 2017 at 09:39:18AM -0700, Davidlohr Bueso wrote:
>> > > > I'll wait to see if there are any more concerns and send a v2 with your corrections.
>> > >
>> > > Have you tried drop-in replacement of mmap_sem with full range lock?
>> > > It would be interesting to see performance implication for this.
>> > >
>> >
>> > I've a patch that replace the mmap_sem with a full range lock, it seems
>> > to work fine for x86 and ppc64 for now. I'll send it soon.
>> > But I didn't yet check for performance. What is the best way to that ?
>>
>> I expect performance to take a measurable hit if we simply use full range
>> lock as a drop in replacement. My rwsem vs range lock measurements were
>> done with this in mind. We only win with range locks when improving the
>> level of parallelism.
>
>It would be hard sell if we would see performance degradation simple
>single-threaded workload.
Yeah, that's why I included very low contention in the lock comparison.
Deltas are very much within the noise region, it is with high contention
where things go south performance wise.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-28 16:39 ` Davidlohr Bueso
2017-03-28 16:58 ` Kirill A. Shutemov
@ 2017-04-03 14:19 ` Laurent Dufour
2017-04-03 15:26 ` Davidlohr Bueso
1 sibling, 1 reply; 34+ messages in thread
From: Laurent Dufour @ 2017-04-03 14:19 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
Le Tue, 28 Mar 2017 09:39:18 -0700,
Davidlohr Bueso <dave@stgolabs.net> a écrit :
> I'll wait to see if there are any more concerns and send a v2 with
> your corrections.
Hi Bavidlohr, I think there is a major issue regarding the task
catching a signal in wait_for_range().
I can see it when a thread is catching a signal, the process deadlock
in exit path.
Let's imagine all these tasks waiting for the complete range lock, so
range doesn't matter:
A get the lock in write
B want the read lock => B->blocking_range=1 (because of A)
C want the write lock => C->blocking_range=2 (A,B)
D want the read lock => D->blocking_range=3 (A,B,C)
=> C catch a signal and exit wait_for_ranges()
A release the lock
=> B->blocking_range=0
=> D->blocking_range=2 (D has not seen C removal)
=> B get the lock
B release the lock
=> D->blocking_range=1
D remains blocked while no one has the lock !
The issue is when removing a task from the interval tree, we
should decrement all the blocking_ranges of the task added to that
range after the one leaving... I can't see an easy fix for that :(
Am I right ?
Cheers,
Laurent.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-04-03 14:19 ` Laurent Dufour
@ 2017-04-03 15:26 ` Davidlohr Bueso
2017-04-03 16:06 ` Jan Kara
0 siblings, 1 reply; 34+ messages in thread
From: Davidlohr Bueso @ 2017-04-03 15:26 UTC (permalink / raw)
To: Laurent Dufour
Cc: mingo, peterz, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Mon, 03 Apr 2017, Laurent Dufour wrote:
>Le Tue, 28 Mar 2017 09:39:18 -0700,
>Davidlohr Bueso <dave@stgolabs.net> a écrit :
>> I'll wait to see if there are any more concerns and send a v2 with
>> your corrections.
>
>Hi Bavidlohr, I think there is a major issue regarding the task
>catching a signal in wait_for_range().
>I can see it when a thread is catching a signal, the process deadlock
>in exit path.
>
>Let's imagine all these tasks waiting for the complete range lock, so
>range doesn't matter:
>
>A get the lock in write
>B want the read lock => B->blocking_range=1 (because of A)
>C want the write lock => C->blocking_range=2 (A,B)
>D want the read lock => D->blocking_range=3 (A,B,C)
>=> C catch a signal and exit wait_for_ranges()
>A release the lock
> => B->blocking_range=0
> => D->blocking_range=2 (D has not seen C removal)
>=> B get the lock
>B release the lock
> => D->blocking_range=1
>
>D remains blocked while no one has the lock !
>
>The issue is when removing a task from the interval tree, we
>should decrement all the blocking_ranges of the task added to that
>range after the one leaving... I can't see an easy fix for that :(
>
>Am I right ?
Yes. Peter had also mentioned the issue too. One way I though of fixing
the problem was to track the jiffies timestamp in a per range_rwlock
basis for when it was added, and in the signal_pending() case, along with
removing the lock from the tree, we iterate the tree again and decrement
the blocking_ranges for those with a higher timestamp. It would add some
overhead, but again this is the unlikely() case. It also adds an extra 8
bytes of footprint, but this is usually stack allocated.
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-04-03 15:26 ` Davidlohr Bueso
@ 2017-04-03 16:06 ` Jan Kara
2017-04-04 15:31 ` Davidlohr Bueso
0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2017-04-03 16:06 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Laurent Dufour, mingo, peterz, akpm, jack, kirill.shutemov,
mhocko, mgorman, linux-kernel, Davidlohr Bueso
On Mon 03-04-17 08:26:43, Davidlohr Bueso wrote:
> On Mon, 03 Apr 2017, Laurent Dufour wrote:
> >Le Tue, 28 Mar 2017 09:39:18 -0700,
> >Davidlohr Bueso <dave@stgolabs.net> a écrit :
> >>I'll wait to see if there are any more concerns and send a v2 with
> >>your corrections.
> >
> >Hi Bavidlohr, I think there is a major issue regarding the task
> >catching a signal in wait_for_range().
> >I can see it when a thread is catching a signal, the process deadlock
> >in exit path.
> >
> >Let's imagine all these tasks waiting for the complete range lock, so
> >range doesn't matter:
> >
> >A get the lock in write
> >B want the read lock => B->blocking_range=1 (because of A)
> >C want the write lock => C->blocking_range=2 (A,B)
> >D want the read lock => D->blocking_range=3 (A,B,C)
> >=> C catch a signal and exit wait_for_ranges()
> >A release the lock
> > => B->blocking_range=0
> > => D->blocking_range=2 (D has not seen C removal)
> >=> B get the lock
> >B release the lock
> > => D->blocking_range=1
> >
> >D remains blocked while no one has the lock !
> >
> >The issue is when removing a task from the interval tree, we
> >should decrement all the blocking_ranges of the task added to that
> >range after the one leaving... I can't see an easy fix for that :(
> >
> >Am I right ?
>
> Yes. Peter had also mentioned the issue too. One way I though of fixing
> the problem was to track the jiffies timestamp in a per range_rwlock
> basis for when it was added, and in the signal_pending() case, along with
> removing the lock from the tree, we iterate the tree again and decrement
> the blocking_ranges for those with a higher timestamp. It would add some
> overhead, but again this is the unlikely() case. It also adds an extra 8
> bytes of footprint, but this is usually stack allocated.
Or just a plain sequence counter of the lock operations? There are ways how
we could implement the functionality without the counter and although
asymptotically they will be the same, they are more complex code-wise so I
think the counter is the best solution.
Hoza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-04-03 16:06 ` Jan Kara
@ 2017-04-04 15:31 ` Davidlohr Bueso
0 siblings, 0 replies; 34+ messages in thread
From: Davidlohr Bueso @ 2017-04-04 15:31 UTC (permalink / raw)
To: Jan Kara
Cc: Laurent Dufour, mingo, peterz, akpm, kirill.shutemov, mhocko,
mgorman, linux-kernel, Davidlohr Bueso
On Mon, 03 Apr 2017, Jan Kara wrote:
>Or just a plain sequence counter of the lock operations?
So what I dislike about this is that we'd also have to enlarge
the struct range_rwlock_tree. otoh, I'm hesitant to depend on
the tick rate for lock correctness, so perhaps your suggestion
is best.
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
2017-03-09 11:03 ` Jan Kara
2017-03-28 10:00 ` Laurent Dufour
@ 2017-03-29 8:56 ` Peter Zijlstra
2017-03-29 15:12 ` Davidlohr Bueso
2017-03-29 8:59 ` Peter Zijlstra
` (6 subsequent siblings)
9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2017-03-29 8:56 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have an ATI Radeon graphics card. There
> are both PCI and AGP versions. You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have a recent AMD Radeon graphics card.
>
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
> depends on DRM
> depends on X86 && PCI
> select INTEL_GTT
> - select INTERVAL_TREE
> # we need shmfs for the swappable backing store, and in particular
> # the shmem_readpage() which depends upon tmpfs
> select SHMEM
I presume this is part of making INTERVAL_TREE unconditional; should be
a separate patch, no?
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY (~0UL - 1)
That's a strange limit, what's wrong with ~0UL ?
> +
> +struct range_rwlock {
> + struct interval_tree_node node;
> + struct task_struct *task;
> + /* Number of ranges which are blocking acquisition of the lock */
> + unsigned int blocking_ranges;
> + bool reader;
> +};
Hate the name; our rwlock is a spinlock, therefore this thing suggests
it is too.
Also, no bool in structures.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-29 8:56 ` Peter Zijlstra
@ 2017-03-29 15:12 ` Davidlohr Bueso
0 siblings, 0 replies; 34+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 15:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Wed, 29 Mar 2017, Peter Zijlstra wrote:
>On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>> index 88e01e08e279..e4d9eadd2c47 100644
>> --- a/drivers/gpu/drm/Kconfig
>> +++ b/drivers/gpu/drm/Kconfig
>> @@ -154,7 +154,6 @@ config DRM_RADEON
>> select HWMON
>> select BACKLIGHT_CLASS_DEVICE
>> select BACKLIGHT_LCD_SUPPORT
>> - select INTERVAL_TREE
>> help
>> Choose this option if you have an ATI Radeon graphics card. There
>> are both PCI and AGP versions. You don't need to choose this to
>> @@ -174,7 +173,6 @@ config DRM_AMDGPU
>> select HWMON
>> select BACKLIGHT_CLASS_DEVICE
>> select BACKLIGHT_LCD_SUPPORT
>> - select INTERVAL_TREE
>> help
>> Choose this option if you have a recent AMD Radeon graphics card.
>>
>> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
>> index 183f5dc1c3f2..8a9154550f46 100644
>> --- a/drivers/gpu/drm/i915/Kconfig
>> +++ b/drivers/gpu/drm/i915/Kconfig
>> @@ -3,7 +3,6 @@ config DRM_I915
>> depends on DRM
>> depends on X86 && PCI
>> select INTEL_GTT
>> - select INTERVAL_TREE
>> # we need shmfs for the swappable backing store, and in particular
>> # the shmem_readpage() which depends upon tmpfs
>> select SHMEM
>
>I presume this is part of making INTERVAL_TREE unconditional; should be
>a separate patch, no?
Ok, I can separate it.
>> +/*
>> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
>> + */
>> +#define RANGE_RWLOCK_INFINITY (~0UL - 1)
>
>That's a strange limit, what's wrong with ~0UL ?
Nothing, I was under the impression I had updated that.
>> +
>> +struct range_rwlock {
>> + struct interval_tree_node node;
>> + struct task_struct *task;
>> + /* Number of ranges which are blocking acquisition of the lock */
>> + unsigned int blocking_ranges;
>> + bool reader;
>> +};
>
>Hate the name; our rwlock is a spinlock, therefore this thing suggests
>it is too.
Hmmm not sure of a better name, kinda liked it. I'll see what I can come
up with.
>
>Also, no bool in structures.
Ok, but is that because sizeof(bool) can be undefined? In any case, I'm thinking
we could skip the 'reader' member entirely and tag the task pointer (similar with
what we do for other locks).
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
` (2 preceding siblings ...)
2017-03-29 8:56 ` Peter Zijlstra
@ 2017-03-29 8:59 ` Peter Zijlstra
2017-03-29 9:08 ` Peter Zijlstra
` (5 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2017-03-29 8:59 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
Was your editor broken? Those lines are > 80 for no reason what so ever.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
` (3 preceding siblings ...)
2017-03-29 8:59 ` Peter Zijlstra
@ 2017-03-29 9:08 ` Peter Zijlstra
2017-03-29 15:05 ` Davidlohr Bueso
2017-03-29 9:11 ` Peter Zijlstra
` (4 subsequent siblings)
9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2017-03-29 9:08 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +#define RANGE_RWLOCK_INFINITY (~0UL - 1)
> +#define DEFINE_RANGE_RWLOCK_INF(name) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
> +void range_rwlock_init_inf(struct range_rwlock *lock);
Ayes I'm a pendant, but that's a very small infinity. I always thought
infinity wasn't enumerable.
Can we think of a different name here? 'whole' / 'all' / 'full' ?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-29 9:08 ` Peter Zijlstra
@ 2017-03-29 15:05 ` Davidlohr Bueso
0 siblings, 0 replies; 34+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 15:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Wed, 29 Mar 2017, Peter Zijlstra wrote:
>On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
>> +#define RANGE_RWLOCK_INFINITY (~0UL - 1)
>
>> +#define DEFINE_RANGE_RWLOCK_INF(name) \
>> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
>
>> +void range_rwlock_init_inf(struct range_rwlock *lock);
>
>Ayes I'm a pendant, but that's a very small infinity. I always thought
>infinity wasn't enumerable.
:-)
>
>Can we think of a different name here? 'whole' / 'all' / 'full' ?
Yeah, I guess 'all' is more suitable.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
` (4 preceding siblings ...)
2017-03-29 9:08 ` Peter Zijlstra
@ 2017-03-29 9:11 ` Peter Zijlstra
2017-03-29 9:44 ` Peter Zijlstra
` (3 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2017-03-29 9:11 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader.
Not a word about fairness and starvation... Such important topics for
lock primitives.
In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
Maybe that ought not be here, doesn't seem like a fundamental design
point and would thus be better suited for a comment near where this
implementation detail is located ?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
` (5 preceding siblings ...)
2017-03-29 9:11 ` Peter Zijlstra
@ 2017-03-29 9:44 ` Peter Zijlstra
2017-03-29 10:35 ` Peter Zijlstra
` (2 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2017-03-29 9:44 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = true;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
for (node = interval_tree_iter_first(); node;
node = interval_tree_iter_next()) {
Or some interval_tree_for_each helper?
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + lock->blocking_ranges++;
Can this @blocked range now go EINTR and remove itself from
wait_for_ranges()? If so, who will decrement our blocking_ranges?
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
` (6 preceding siblings ...)
2017-03-29 9:44 ` Peter Zijlstra
@ 2017-03-29 10:35 ` Peter Zijlstra
2017-03-29 10:49 ` Peter Zijlstra
2017-03-30 14:56 ` Laurent Dufour
9 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2017-03-29 10:35 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +EXPORT_SYMBOL(range_rwlock_init);
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +EXPORT_SYMBOL(range_read_lock);
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +EXPORT_SYMBOL(range_read_lock_killable);
> +EXPORT_SYMBOL(range_read_trylock);
> +EXPORT_SYMBOL(range_read_unlock);
> +EXPORT_SYMBOL(range_write_lock);
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +EXPORT_SYMBOL(range_write_lock_killable);
> +EXPORT_SYMBOL(range_write_trylock);
> +EXPORT_SYMBOL(range_write_unlock);
> +EXPORT_SYMBOL(range_downgrade_write);
Is there a good reason this isn't _GPL ?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
` (7 preceding siblings ...)
2017-03-29 10:35 ` Peter Zijlstra
@ 2017-03-29 10:49 ` Peter Zijlstra
2017-03-29 15:14 ` Davidlohr Bueso
2017-03-30 14:56 ` Laurent Dufour
9 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2017-03-29 10:49 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = true;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
Another thing; this implementation lacks lockdep annotations and has 0
other debugging features.
Is there something that makes the normal rwsem lockdep annotation not
work for this?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-29 10:49 ` Peter Zijlstra
@ 2017-03-29 15:14 ` Davidlohr Bueso
0 siblings, 0 replies; 34+ messages in thread
From: Davidlohr Bueso @ 2017-03-29 15:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, akpm, jack, kirill.shutemov, mhocko, mgorman,
linux-kernel, Davidlohr Bueso
On Wed, 29 Mar 2017, Peter Zijlstra wrote:
>On Mon, Mar 06, 2017 at 09:03:26PM -0800, Davidlohr Bueso wrote:
>> +static __always_inline int
>> +__range_read_lock_common(struct range_rwlock_tree *tree,
>> + struct range_rwlock *lock, long state)
>> +{
>> + struct interval_tree_node *node;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&tree->lock, flags);
>> + lock->reader = true;
>> +
>> + if (!__range_intersects_intree(tree, lock))
>> + goto insert;
>> +
>> + node = interval_tree_iter_first(&tree->root, lock->node.start,
>> + lock->node.last);
>> + while (node) {
>> + struct range_rwlock *blocked_lock;
>> + blocked_lock = range_entry(node, struct range_rwlock, node);
>> +
>> + if (!blocked_lock->reader)
>> + lock->blocking_ranges++;
>> + node = interval_tree_iter_next(node, lock->node.start,
>> + lock->node.last);
>> + }
>> +insert:
>> + __range_tree_insert(tree, lock);
>> +
>> + lock->task = current;
>> + spin_unlock_irqrestore(&tree->lock, flags);
>> +
>> + return wait_for_ranges(tree, lock, state);
>> +}
>
>Another thing; this implementation lacks lockdep annotations and has 0
>other debugging features.
True.
>Is there something that makes the normal rwsem lockdep annotation not
>work for this?
Not that I know of. Let me look into it.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-07 5:03 ` [PATCH 1/5] locking: " Davidlohr Bueso
` (8 preceding siblings ...)
2017-03-29 10:49 ` Peter Zijlstra
@ 2017-03-30 14:56 ` Laurent Dufour
2017-03-30 17:13 ` Davidlohr Bueso
9 siblings, 1 reply; 34+ messages in thread
From: Laurent Dufour @ 2017-03-30 14:56 UTC (permalink / raw)
To: Davidlohr Bueso, mingo, peterz, akpm
Cc: jack, kirill.shutemov, mhocko, mgorman, linux-kernel, Davidlohr Bueso
On 07/03/2017 06:03, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
>
> We allow exclusive locking of arbitrary ranges. We guarantee that each
> range is locked only after all conflicting range locks requested previously
> have been unlocked. Thus we achieve fairness and avoid livelocks.
>
> When new range lock is requested, we add its interval to the tree and store
> number of intervals intersecting it to 'blocking_ranges'. When a range is
> unlocked, we again walk intervals that intersect with the unlocked one and
> decrement their 'blocking_ranges'. We wake up owner of any range lock whose
> 'blocking_ranges' drops to 0.
>
> For the shared case, the 'blocking_ranges' is only incremented if the
> intersecting range is not marked as a reader. In order to mitigate some of
> the tree walk overhead for non-intersecting ranges, the tree's min/max values
> are maintained and consulted in O(1) in the fastpath.
>
> How much does it cost:
> ----------------------
>
> The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
>
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
>
> The first is the memory footprint, rwsems are larger than tree locks: 40 vs
> 24 bytes, but the later requires an additional 64 bytes of stack for the range
> structure.
>
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
>
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
>
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
>
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
>
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
>
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
>
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
>
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
>
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
>
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr: 1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr: 1022.5, stddev: 6.41
>
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr: 1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr: 1243.5, stddev: 30.40
>
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr: 1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr: 1398.0, stddev: 56.56
>
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
>
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
>
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
>
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
>
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
>
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
>
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
>
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
>
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr: 1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr: 1852.8, stddev: 147.5
>
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr: 1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr: 1856.5, stddev: 198.69
>
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr: 1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr: 1551.5, stddev: 361.33
>
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
>
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
>
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> ---
> drivers/gpu/drm/Kconfig | 2 -
> drivers/gpu/drm/i915/Kconfig | 1 -
> include/linux/range_rwlock.h | 96 +++++++++
> kernel/locking/Makefile | 2 +-
> kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
> lib/Kconfig | 14 --
> lib/Kconfig.debug | 1 -
> lib/Makefile | 2 +-
> 8 files changed, 560 insertions(+), 20 deletions(-)
> create mode 100644 include/linux/range_rwlock.h
> create mode 100644 kernel/locking/range_rwlock.c
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have an ATI Radeon graphics card. There
> are both PCI and AGP versions. You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have a recent AMD Radeon graphics card.
>
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
> depends on DRM
> depends on X86 && PCI
> select INTEL_GTT
> - select INTERVAL_TREE
> # we need shmfs for the swappable backing store, and in particular
> # the shmem_readpage() which depends upon tmpfs
> select SHMEM
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..174e13668f67
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,96 @@
> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY (~0UL - 1)
> +
> +struct range_rwlock {
> + struct interval_tree_node node;
> + struct task_struct *task;
> + /* Number of ranges which are blocking acquisition of the lock */
> + unsigned int blocking_ranges;
> + bool reader;
> +};
> +
> +struct range_rwlock_tree {
> + struct rb_root root;
> + spinlock_t lock;
> + struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name) \
> + { .leftmost = NULL \
> + , .root = RB_ROOT \
> + , .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> + struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(start, last) { \
> + .node = { \
> + .start = (start) \
> + ,.last = (last) \
> + } \
> + }
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_INF(name) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> + tree->root = RB_ROOT;
> + spin_lock_init(&tree->lock);
> + tree->leftmost = NULL;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> + unsigned long start, unsigned long last);
> +void range_rwlock_init_inf(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
> # and is generally not a function of system call inputs.
> KCOV_INSTRUMENT := n
>
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
>
> ifdef CONFIG_FUNCTION_TRACER
> CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..4d92baa9bd80
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +/*
> + * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + * a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree.
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *root;
> +
> + if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> + return false;
> +
> + root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> + return lock->node.start <= root->__subtree_last &&
> + tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> + tree->leftmost = &lock->node;
> + else if (lock->node.start < tree->leftmost->start)
> + tree->leftmost = &lock->node;
> +
> + interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + if (tree->leftmost == &lock->node) {
> + struct rb_node *next = rb_next(&tree->leftmost->rb);
> + tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> + }
> +
> + interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> + unsigned long last)
> +{
> + WARN_ON(start > last);
> +
> + lock->node.start = start;
> + lock->node.last = last;
> + RB_CLEAR_NODE(&lock->node.rb);
> + lock->blocking_ranges = 0;
> + lock->reader = false;
> +}
> +
> +/**
> + * range_rwlock_init - initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> + unsigned long last)
> +{
> + __range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_inf - initialize the range lock to infinity
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_inf(struct range_rwlock *lock)
> +{
> + __range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
> +}
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> + if (!--lock->blocking_ranges)
> + wake_q_add(wake_q, lock->task);
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + int ret = 0;
> +
> + while (true) {
> + set_current_state(state);
> +
> + if (unlikely(signal_pending_state(state, current))) {
> + unsigned long flags;
> +
> + ret = -EINTR;
> + /*
> + * We're not taking the lock after all, cleanup
> + * after ourselves.
> + */
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> + spin_unlock_irqrestore(&tree->lock, flags);
> + break;
> + }
> +
> + /* do we need to go to sleep? */
> + if (!lock->blocking_ranges)
> + break;
Hi Davidlohr,
While building a kernel on top of a patched kernel using full range lock
in the place of mmap_sem, I found that fork() sometimes failed returning
ENOMEM.
It happens that if fork() get called at the time signal is sent to the
calling process, the call to range_write_lock_interruptible() is failing
even if there is no contention on the lock. This is because we check for
the signal pending before checking for the lock contention in
wait_for_ranges().
The loop here should rather be :
while (true) {
/* do we need to go to sleep? */
if (!lock->blocking_ranges)
break;
if (unlikely(signal_pending_state(state, current))) {
unsigned long flags;
ret = -EINTR;
/*
* We're not taking the lock after all, cleanup
* after ourselves.
*/
spin_lock_irqsave(&tree->lock, flags);
lock->reader = false;
__range_tree_remove(tree, lock);
spin_unlock_irqrestore(&tree->lock, flags);
break;
}
set_current_state(state);
schedule();
}
I also moved the call to set_current_state() before calling schedule(),
not sure this has to be done this way, but my system seems to work fine
like this.
Laurent.
> +
> + schedule();
> + }
> +
> + __set_current_state(TASK_RUNNING);
> + return ret;
> +}
> +
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = true;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + might_sleep();
> + __range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock);
> +d
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_read_lock_common(tree, lock, TASK_KILLABLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_killable);
> +
> +/**
> + * range_read_trylock - Trylock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + int ret = true;
> + unsigned long flags;
> + struct interval_tree_node *node;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + /*
> + * We have overlapping ranges in the tree, ensure that we can
> + * in fact share the lock.
> + */
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader) {
> + ret = false;
> + goto unlock;
> + }
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + lock->reader = true;
> + __range_tree_insert(tree, lock);
> +unlock:
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(range_read_trylock);
> +
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> +
> + if (!__range_intersects_intree(tree, lock)) {
> + /* nobody to wakeup, we're done */
> + spin_unlock_irqrestore(&tree->lock, flags);
> + return;
> + }
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + range_rwlock_unblock(blocked_lock, &wake_q);
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_read_unlock);
> +
> +static inline int __sched
> +__range_write_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = false;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + /*
> + * As a writer, we always consider an existing node. We
> + * need to block; either the intersecting node is another
> + * writer or we have a reader that needs to finish.
> + */
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + might_sleep();
> + __range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock);
> +
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_write_lock_common(tree, lock, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_killable);
> +
> +/**
> + * range_write_trylock - Trylock for writing
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + int intersects;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + intersects = __range_intersects_intree(tree, lock);
> +
> + if (!intersects) {
> + lock->reader = false;
> + __range_tree_insert(tree, lock);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return !intersects;
> +}
> +EXPORT_SYMBOL(range_write_trylock);
> +
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> +
> + if (!__range_intersects_intree(tree, lock)) {
> + /* nobody to wakeup, we're done */
> + spin_unlock_irqrestore(&tree->lock, flags);
> + return;
> + }
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + range_rwlock_unblock(blocked_lock, &wake_q);
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_write_unlock);
> +
> +/**
> + * range_downgrade_write - Downgrade write range lock to read lock
> + * @tree: interval tree
> + * @lock: the range lock to be downgraded
> + *
> + * Wakes any blocked readers, when @lock is the only conflicting range.
> + *
> + * It is not allowed to downgrade an unacquired write lock.
> + */
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + unsigned long flags;
> + struct interval_tree_node *node;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + WARN_ON(lock->reader);
> +
> + /*
> + * Unaccount for any blocked reader lock. Wakeup if possible.
> + */
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (blocked_lock->reader)
> + range_rwlock_unblock(blocked_lock, &wake_q);
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + lock->reader = true;
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_downgrade_write);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 0c8b78a9ae2e..52ca6668a16d 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
> config BTREE
> bool
>
> -config INTERVAL_TREE
> - bool
> - help
> - Simple, embeddable, interval-tree. Can find the start of an
> - overlapping range in log(n) time and then iterate over all
> - overlapping nodes. The algorithm is implemented as an
> - augmented rbtree.
> -
> - See:
> -
> - Documentation/rbtree.txt
> -
> - for more information.
> -
> config RADIX_TREE_MULTIORDER
> bool
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 97d62c2da6c2..a8a64eb54d3a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1771,7 +1771,6 @@ config RBTREE_TEST
> config INTERVAL_TREE_TEST
> tristate "Interval tree test"
> depends on m && DEBUG_KERNEL
> - select INTERVAL_TREE
> help
> A benchmark measuring the performance of the interval tree library
>
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..d4ff3057450a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
> obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
>
> obj-$(CONFIG_BTREE) += btree.o
> -obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
> +obj-y += interval_tree.o
> obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
> obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
> obj-$(CONFIG_DEBUG_LIST) += list_debug.o
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/5] locking: Introduce range reader/writer lock
2017-03-30 14:56 ` Laurent Dufour
@ 2017-03-30 17:13 ` Davidlohr Bueso
0 siblings, 0 replies; 34+ messages in thread
From: Davidlohr Bueso @ 2017-03-30 17:13 UTC (permalink / raw)
To: Laurent Dufour
Cc: Davidlohr Bueso, mingo, peterz, akpm, jack, kirill.shutemov,
mhocko, mgorman, linux-kernel
On 2017-03-30 07:56, Laurent Dufour wrote:
> On 07/03/2017 06:03, Davidlohr Bueso wrote:
>> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
>> + struct range_rwlock *lock, long state)
>> +{
>> + int ret = 0;
>> +
>> + while (true) {
>> + set_current_state(state);
>> +
>> + if (unlikely(signal_pending_state(state, current))) {
>> + unsigned long flags;
>> +
>> + ret = -EINTR;
>> + /*
>> + * We're not taking the lock after all, cleanup
>> + * after ourselves.
>> + */
>> + spin_lock_irqsave(&tree->lock, flags);
>> + lock->reader = false;
>> + __range_tree_remove(tree, lock);
>> + spin_unlock_irqrestore(&tree->lock, flags);
>> + break;
>> + }
>> +
>> + /* do we need to go to sleep? */
>> + if (!lock->blocking_ranges)
>> + break;
>
> Hi Davidlohr,
>
> While building a kernel on top of a patched kernel using full range
> lock
> in the place of mmap_sem, I found that fork() sometimes failed
> returning
> ENOMEM.
> It happens that if fork() get called at the time signal is sent to the
> calling process, the call to range_write_lock_interruptible() is
> failing
> even if there is no contention on the lock. This is because we check
> for
> the signal pending before checking for the lock contention in
> wait_for_ranges().
>
> The loop here should rather be :
>
> while (true) {
> /* do we need to go to sleep? */
> if (!lock->blocking_ranges)
> break;
>
Thanks, this makes sense, and is actually the standard way of waiting in
most locks.
> if (unlikely(signal_pending_state(state, current))) {
> unsigned long flags;
>
> ret = -EINTR;
> /*
> * We're not taking the lock after all, cleanup
> * after ourselves.
> */
> spin_lock_irqsave(&tree->lock, flags);
> lock->reader = false;
> __range_tree_remove(tree, lock);
> spin_unlock_irqrestore(&tree->lock, flags);
> break;
> }
>
> set_current_state(state);
>
> schedule();
> }
>
> I also moved the call to set_current_state() before calling schedule(),
> not sure this has to be done this way, but my system seems to work fine
> like this.
No, we do not hold any locks. Please keep set_current_state() the very
first
thing we do in the loop. You can check Documentation/memory-barriers.txt
for details :-)
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 34+ messages in thread