From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S933876AbdDFJCC (ORCPT <rfc822;w@1wt.eu>);
        Thu, 6 Apr 2017 05:02:02 -0400
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:38547 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S933801AbdDFJBv (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 6 Apr 2017 05:01:51 -0400
Subject: Re: [PATCH 2/6] locking: Introduce range reader/writer lock
To: Davidlohr Bueso <dave@stgolabs.net>, mingo@kernel.org,
        peterz@infradead.org, akpm@linux-foundation.org
References: <20170406084620.22700-1-dave@stgolabs.net>
 <20170406084620.22700-3-dave@stgolabs.net>
Cc: jack@suse.cz, kirill.shutemov@linux.intel.com, mhocko@suse.com,
        mgorman@techsingularity.net, linux-kernel@vger.kernel.org,
        Davidlohr Bueso <dbueso@suse.de>
From: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Date: Thu, 6 Apr 2017 11:01:39 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <20170406084620.22700-3-dave@stgolabs.net>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-TM-AS-GCONF: 00
x-cbid: 17040609-0040-0000-0000-0000037E5D78
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17040609-0041-0000-0000-00001F957514
Message-Id: <2a5c78b9-23ac-3b81-71ee-6a88b965f859@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-04-06_07:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001
 definitions=main-1704060076
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 06/04/2017 10:46, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
> 
> An interval tree of locked and to-be-locked ranges is kept. When a new range
> lock is requested, we add its interval to the tree and store number of
> intervals intersecting it to 'blocking_ranges'. For the reader case,
> 'blocking_ranges' is only accounted for if the intersecting range is
> marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
> guarantee that task is blocked until there are no overlapping ranges in the
> tree.
> 
> When a range is unlocked, we again walk intervals that overlap with the
> unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
> owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
> therefore relies on the order of the interval tree  -- as opposed to a
> more traditional fifo mechanism. There is no lock stealing either, which
> prevents starvation and guarantees fairness.
> 
> How much does it cost:
> ----------------------
> 
> The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
> is total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
> 
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
> 
> The first is the memory footprint, tree locks are larger than rwsems: 80 vs
> 40 bytes, and also requires an additional 72 bytes of stack for the range
> structure.
> 
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
> 
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
> 
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
> 
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
> 
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
> 
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
> 
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
> 
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
> 
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
> 
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr:  1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr:  1022.5, stddev: 6.41
> 
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr:  1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr:  1243.5, stddev: 30.40
> 
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr:  1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr:  1398.0, stddev: 56.56
> 
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
> 
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
> 
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
> 
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
> 
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
> 
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
> 
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
> 
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
> 
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr:  1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr:  1852.8, stddev: 147.5
> 
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr:  1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr:  1856.5, stddev: 198.69
> 
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr:  1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr:  1551.5, stddev: 361.33
> 
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
> 
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
> 
> Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/range_rwlock.h  | 115 +++++++++
>  kernel/locking/Makefile       |   2 +-
>  kernel/locking/range_rwlock.c | 554 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 670 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/range_rwlock.h
>  create mode 100644 kernel/locking/range_rwlock.c
> 
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..0a8dad62d097
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,115 @@
> +/*
> + * Range/interval rw-locking
> + * -------------------------
> + *
> + * An interval tree of locked and to-be-locked ranges is kept. When a new range
> + * lock is requested, we add its interval to the tree and store number of
> + * intervals intersecting it to 'blocking_ranges'. For the reader case,
> + * 'blocking_ranges' is only accounted for if the intersecting range is
> + * marked as a writer. To achieve mutual exclusion of arbitrary ranges, we
> + * guarantee that task is blocked until there are no overlapping ranges in the
> + * tree.
> + *
> + * When a range is unlocked, we again walk intervals that overlap with the
> + * unlocked one and decrement their 'blocking_ranges'. Naturally, we wake up
> + * owner of any range lock whose 'blocking_ranges' drops to 0. Wakeup order
> + * therefore relies on the order of the interval tree  -- as opposed to a
> + * more traditional fifo mechanism. There is no lock stealing either, which
> + * prevents starvation and guarantees fairness.
> + *
> + * The cost of lock and unlock of a range is O((1+R_int)log(R_all)) where R_all
> + * is total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_FULL].
> + */
> +#define RANGE_RWLOCK_FULL  ~0UL
> +
> +struct range_rwlock {
> +	struct interval_tree_node node;
> +	struct task_struct *waiter;
> +	/* Number of ranges which are blocking acquisition of the lock */
> +	unsigned int blocking_ranges;
> +	u64 seqnum;
> +};
> +
> +struct range_rwlock_tree {
> +	struct rb_root root;
> +	spinlock_t lock;
> +	struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +	u64 seqnum; /* track order of incoming ranges, avoid overflows */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
> +	{ .leftmost = NULL                              \
> +	, .root = RB_ROOT				\
> +	, .seqnum = 0					\
> +	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> +       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(__start, __last) {	\
> +		.node = {			\
> +			.start = (__start)	\
> +			,.last = (__last)	\
> +		}				\
> +		, .task = NULL			\
> +		, .blocking_ranges = 0		\
> +		, .reader = false		\
> +		, .seqnum = 0			\
> +	}
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last)				\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_FULL(name)		\
> +	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_FULL)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> +	tree->root = RB_ROOT;
> +	spin_lock_init(&tree->lock);
> +	tree->leftmost = NULL;
> +	tree->seqnum = 0;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> +		       unsigned long start, unsigned long last);
> +void range_rwlock_init_full(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> +			     struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> +				   struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> +			      struct range_rwlock *lock);
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> +			   struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT		:= n
> 
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
> 
>  ifdef CONFIG_FUNCTION_TRACER
>  CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..550ad360d87a
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,554 @@
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +#define range_interval_tree_foreach(node, root, start, last)	\
> +	for (node = interval_tree_iter_first(root, start, last); \
> +	     node; node = interval_tree_iter_next(node, start, last))
> +
> +/*
> + * Fastpath range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + *        a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree. This allows mitigating some of the tree walk overhead
> + * for non-intersecting ranges, maintained and consulted in O(1).
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	struct interval_tree_node *root;
> +
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> +		return false;
> +
> +	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> +	return lock->node.start <= root->__subtree_last &&
> +		tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (unlikely(RB_EMPTY_ROOT(&tree->root)) ||
> +	    lock->node.start < tree->leftmost->start)
> +		tree->leftmost = &lock->node;
> +
> +	lock->seqnum = tree->seqnum++;
> +	interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> +	if (tree->leftmost == &lock->node) {
> +		struct rb_node *next = rb_next(&tree->leftmost->rb);
> +		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> +	}
> +
> +	interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +/*
> + * lock->waiter reader tracking.
> + */
> +#define RANGE_FLAG_READER	1UL
> +
> +static inline struct task_struct *range_lock_waiter(struct range_rwlock *lock)
> +{
> +	return (struct task_struct *)
> +		((unsigned long) lock->waiter & ~RANGE_FLAG_READER);
> +}
> +
> +static inline void range_lock_set_reader(struct range_rwlock *lock)
> +{
> +	lock->waiter = (struct task_struct *)
> +		((unsigned long)lock->waiter | RANGE_FLAG_READER);
> +}
> +
> +static inline void range_lock_clear_reader(struct range_rwlock *lock)
> +{
> +	lock->waiter = (struct task_struct *)
> +		((unsigned long)lock->waiter & ~RANGE_FLAG_READER);
> +}
> +
> +static inline bool range_lock_is_reader(struct range_rwlock *lock)
> +{
> +	return (unsigned long) lock->waiter & RANGE_FLAG_READER;
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock,
> +		    unsigned long start, unsigned long last)
> +{
> +	WARN_ON(start > last);
> +
> +	lock->node.start = start;
> +	lock->node.last = last;
> +	RB_CLEAR_NODE(&lock->node.rb);
> +	lock->blocking_ranges = 0;
> +	lock->waiter = NULL;
> +	lock->seqnum = 0;
> +}
> +
> +/**
> + * range_rwlock_init - Initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> +		     unsigned long last)
> +{
> +	__range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL_GPL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_full - Initialize a full range lock
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_full(struct range_rwlock *lock)
> +{
> +	__range_rwlock_init(lock, 0, RANGE_RWLOCK_FULL);
> +}
> +EXPORT_SYMBOL_GPL(range_rwlock_init_full);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> +	if (!--lock->blocking_ranges)
> +		wake_q_add(wake_q, range_lock_waiter(lock));
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> +				  struct range_rwlock *lock, long state)
> +{
> +	int ret = 0;
> +
> +	while (true) {
> +		set_current_state(state);
> +
> +		/* do we need to go to sleep? */
> +		if (!lock->blocking_ranges)
> +			break;
> +
> +		if (unlikely(signal_pending_state(state, current))) {
> +			struct interval_tree_node *node;
> +			unsigned long flags;
> +			DEFINE_WAKE_Q(wake_q);
> +
> +			ret = -EINTR;
> +			/*
> +			 * We're not taking the lock after all, cleanup
> +			 * after ourselves.
> +			 */
> +			spin_lock_irqsave(&tree->lock, flags);
> +
> +			range_lock_clear_reader(lock);
> +			__range_tree_remove(tree, lock);
> +
> +			if (!__range_intersects_intree(tree, lock))
> +				goto unlock;
> +
> +			range_interval_tree_foreach(node, &tree->root,
> +						    lock->node.start,
> +						    lock->node.last) {
> +				struct range_rwlock *blked;
> +				blked = range_entry(node,
> +						    struct range_rwlock, node);
> +
> +				if (range_lock_is_reader(lock) &&
> +				    range_lock_is_reader(blked))
> +					continue;
> +
> +				/* unaccount for threads _we_ are blocking */
> +				if (lock->seqnum < blked->seqnum)
> +					range_rwlock_unblock(blked, &wake_q);

How is 'seqnum' wrapping handled here ?
I'd rather see something like time_before() here, isn't it ?

Cheers,
Laurent.