linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] RFC improvements to radix-tree related to DAX
@ 2016-02-28  5:09 NeilBrown
  2016-02-28  5:09 ` [PATCH 3/3] radix-tree: support locking of individual exception entries NeilBrown
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: NeilBrown @ 2016-02-28  5:09 UTC (permalink / raw)
  To: Ross Zwisler, Matthew Wilcox, Andrew Morton, Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-mm

Hi,

While pondering some issues with DAX and how it uses the radix tree I
conceived the following patches.  I don't know if they'll be useful
but I thought I would post them in case they are helpful.

The first is quite independent of the others - it removes some DAX
specific #defines from radix-tree.h which is a generic ADT.

The second makes an extra bit available when exception data is
stored in the radix tree.

The third uses this bit to provide a sleeping lock.  With this
it should be possible to delete exceptional entries from the radix
tree in a race-free way without external locking.
Like the page lock it requires an external set of wait_queue_heads.
The same ones used for page_lock would be suitable.

Note that this code is only compile tested.

NeilBrown


---

NeilBrown (3):
      DAX: move RADIX_DAX_ definitions to dax.c
      radix-tree: make 'indirect' bit available to exception entries.
      radix-tree: support locking of individual exception entries.


 fs/dax.c                   |    9 ++
 include/linux/radix-tree.h |   28 +++++---
 lib/radix-tree.c           |  160 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 185 insertions(+), 12 deletions(-)

--
Signature

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c
  2016-02-28  5:09 [PATCH 0/3] RFC improvements to radix-tree related to DAX NeilBrown
  2016-02-28  5:09 ` [PATCH 3/3] radix-tree: support locking of individual exception entries NeilBrown
  2016-02-28  5:09 ` [PATCH 2/3] radix-tree: make 'indirect' bit available to " NeilBrown
@ 2016-02-28  5:09 ` NeilBrown
  2016-02-29 14:28   ` Wilcox, Matthew R
  2 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2016-02-28  5:09 UTC (permalink / raw)
  To: Ross Zwisler, Matthew Wilcox, Andrew Morton, Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-mm

These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/dax.c                   |    9 +++++++++
 include/linux/radix-tree.h |    9 ---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 711172450da6..9c4d697fb6fc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,6 +32,15 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index f54be7082207..968150ab8a1c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,15 +51,6 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
-#define RADIX_DAX_MASK	0xf
-#define RADIX_DAX_SHIFT	4
-#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
-
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] radix-tree: make 'indirect' bit available to exception entries.
  2016-02-28  5:09 [PATCH 0/3] RFC improvements to radix-tree related to DAX NeilBrown
  2016-02-28  5:09 ` [PATCH 3/3] radix-tree: support locking of individual exception entries NeilBrown
@ 2016-02-28  5:09 ` NeilBrown
  2016-02-29 14:41   ` Wilcox, Matthew R
  2016-02-28  5:09 ` [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c NeilBrown
  2 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2016-02-28  5:09 UTC (permalink / raw)
  To: Ross Zwisler, Matthew Wilcox, Andrew Morton, Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-mm

A pointer to a radix_tree_node will always have the 'exception'
bit cleared, so if the exception bit is set the value cannot
be an indirect pointer.  Thus it is safe to make the 'indirect bit'
available to store extra information in exception entries.

This patch adds a 'PTR_MASK' and a value is only treated as
an indirect (pointer) entry the 2 ls-bits are '01'.

The change in radix-tree.c ensures the stored value still looks like an
indirect pointer, and saves a load as well.

We could swap the two bits and so keep all the exectional bits contigious.
But I have other plans for that bit....

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/radix-tree.h |   11 +++++++++--
 lib/radix-tree.c           |    2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 968150ab8a1c..450c12b546b7 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -40,8 +40,13 @@
  * Indirect pointer in fact is also used to tag the last pointer of a node
  * when it is shrunk, before we rcu free the node. See shrink code for
  * details.
+ *
+ * To allow an exception entry to only lose one bit, we ignore
+ * the INDIRECT bit when the exception bit is set.  So an entry is
+ * indirect if the least significant 2 bits are 01.
  */
 #define RADIX_TREE_INDIRECT_PTR		1
+#define RADIX_TREE_INDIRECT_MASK	3
 /*
  * A common use of the radix tree is to store pointers to struct pages;
  * but shmem/tmpfs needs also to store swap entries in the same tree:
@@ -53,7 +58,8 @@
 
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
+	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
+		== RADIX_TREE_INDIRECT_PTR;
 }
 
 /*** radix-tree API starts here ***/
@@ -221,7 +227,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
  */
 static inline int radix_tree_deref_retry(void *arg)
 {
-	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
+	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
+			== RADIX_TREE_INDIRECT_PTR);
 }
 
 /**
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6b79e9026e24..37d4643ab5c0 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1305,7 +1305,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 		 * to force callers to retry.
 		 */
 		if (root->height == 0)
-			*((unsigned long *)&to_free->slots[0]) |=
+			*((unsigned long *)&to_free->slots[0]) =
 						RADIX_TREE_INDIRECT_PTR;
 
 		radix_tree_node_free(to_free);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] radix-tree: support locking of individual exception entries.
  2016-02-28  5:09 [PATCH 0/3] RFC improvements to radix-tree related to DAX NeilBrown
@ 2016-02-28  5:09 ` NeilBrown
  2016-02-28  5:30   ` kbuild test robot
                     ` (2 more replies)
  2016-02-28  5:09 ` [PATCH 2/3] radix-tree: make 'indirect' bit available to " NeilBrown
  2016-02-28  5:09 ` [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c NeilBrown
  2 siblings, 3 replies; 15+ messages in thread
From: NeilBrown @ 2016-02-28  5:09 UTC (permalink / raw)
  To: Ross Zwisler, Matthew Wilcox, Andrew Morton, Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-mm

The least significant bit of an exception entry is used as a lock flag.
A caller can:
 - create a locked entry by simply adding an entry with this flag set
 - lock an existing entry with radix_tree_lookup_lock().  This may return
    NULL if the entry doesn't exists, or was deleted while waiting for
    the lock.  It may return a non-exception entry if that is what is
    found.  If it returns a locked entry then it has exclusive rights
    to delete the entry.
 - unlock an entry that is already locked.  This will wake any waiters.
 - delete an entry that is locked.  This will wake waiters so that they
   return NULL without looking at the slot in the radix tree.

These must all be called with the radix tree locked (i.e. a spinlock held).
That spinlock is passed to radix_tree_lookup_lock() so that it can drop
the lock while waiting.

This is a "demonstration of concept".  I haven't actually tested, only compiled.
A possible use case is for the exception entries used by DAX.

It is possible that some of the lookups can be optimised away in some
cases by storing a slot pointer.  I wanted to keep it reasonable
simple until it was determined if it might be useful.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/radix-tree.h |    8 ++
 lib/radix-tree.c           |  158 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 166 insertions(+)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 450c12b546b7..8f579f66574b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -308,6 +308,14 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
 unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item);
 
+void *radix_tree_lookup_lock(struct radix_tree_root *root, wait_queue_head_t *wq,
+			     unsigned long index, spinlock_t *lock);
+void radix_tree_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
+		       unsigned long index);
+void radix_tree_delete_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
+			      unsigned long index);
+
+
 static inline void radix_tree_preload_end(void)
 {
 	preempt_enable();
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 37d4643ab5c0..a24ea002f3eb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1500,3 +1500,161 @@ void __init radix_tree_init(void)
 	radix_tree_init_maxindex();
 	hotcpu_notifier(radix_tree_callback, 0);
 }
+
+/* Exception entry locking.
+ * The least significant bit of an exception entry can be used as a
+ * "locked" flag.  Supported locking operations are:
+ * radix_tree_lookup_lock() - if the indexed entry exists, lock it and
+ *         return the value, else return NULL.  If the indexed entry is not
+ *         exceptional it is returned without locking.
+ * radix_tree_unlock() - release the lock on the indexed entry
+ * radix_tree_delete_unlock() - the entry must be locked.  It will be atomically
+ *     unlocked and removed.  Any threads sleeping in lookup_lock() will return.
+ * Each of these take a radix_tree_root, a wait_queue_head_t, and an index.
+ * The '*lock' function also takes a spinlock_t which must be held when any
+ * of the functions is called.  *lock will drop the spinlock while waiting for
+ * the entry lock.
+ *
+ * As delete_unlock could free the radix_tree_node, waiters much not touch it
+ * when woken.  We provide a wake function for the waitq which records when the
+ * item has been deleted.
+ *
+ * The wait_queue_head passed should be one that is used for bit_wait, such
+ * as zone->wait_table.  We re-use the 'flags' and 'timeout' fields of the
+ * wait_bit_key to store the root and index that we are waiting for.
+ * __wake_up may only be called on one of these keys while the radix tree
+ * is locked.  The wakeup function will take the lock itself if appropriate, or
+ * may record that the radix tree entry has been deleted.  In either case
+ * the waiting function just looks at the status reported by the wakeup function
+ * and doesn't look at the radix tree itself.
+ *
+ * There is no function for locking an entry while inserting it.  Simply
+ * insert an entry that is already marked as 'locked' - lsb set.
+ *
+ */
+
+struct wait_slot_queue {
+	struct radix_tree_root	*root;
+	unsigned long		index;
+	wait_queue_t		wait;
+	enum {SLOT_WAITING, SLOT_LOCKED, SLOT_GONE} state;
+	void			*ret;
+};
+
+static inline int slot_locked(void *v)
+{
+	unsigned long l = (unsigned long)v;
+	return l & 1;
+}
+
+static inline void *lock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l |= 1);
+}
+
+static inline void * unlock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l &= ~1UL);
+}
+
+static int wake_slot_function(wait_queue_t *wait, unsigned mode, int sync,
+			      void *arg)
+{
+	struct wait_bit_key *key = arg;
+	struct wait_slot_queue *wait_slot =
+		container_of(wait, struct wait_slot_queue, wait);
+	void **slot;
+
+	if (wait_slot->root != key->flags ||
+	    wait_slot->index != key->timeout)
+		/* Not waking this waiter */
+		return 0;
+	if (wait_slot->state != SLOT_WAITING)
+		/* Should be impossible.... */
+		return 1;
+	if (key->bit_nr == -3)
+		/* Was just deleted, no point in doing a lookup */
+		wait_slot = NULL;
+	else
+		wait_slot->ret = __radix_tree_lookup(
+			wait_slot->root, wait_slot->index, NULL, &slot);
+	if (!wait_slot->ret || !radix_tree_exceptional_entry(wait_slot->ret)) {
+		wait_slot->state = SLOT_GONE;
+		return 1;
+	}
+	if (slot_locked(slot))
+		/* still locked */
+		return 0;
+	wait_slot->ret = lock_slot(slot);
+	wait_slot->state = SLOT_LOCKED;
+	return 1;
+}
+
+void *radix_tree_lookup_lock(struct radix_tree_root *root, wait_queue_head_t *wq,
+			     unsigned long index, spinlock_t *lock)
+{
+	void *ret, **slot;
+	struct wait_slot_queue wait;
+
+	ret = __radix_tree_lookup(root, index, NULL, &slot);
+	if (!ret || !radix_tree_exceptional_entry(ret))
+		return ret;
+	if (!slot_locked(slot))
+		return lock_slot(slot);
+
+	wait.wait.private = current;
+	wait.wait.func = wake_slot_function;
+	INIT_LIST_HEAD(&wait.wait.task_list);
+	wait.state = SLOT_WAITING;
+	wait.root = root;
+	wait.index = index;
+	wait.ret = NULL;
+	for (;;) {
+		prepare_to_wait(wq, &wait.wait,
+				TASK_UNINTERRUPTIBLE);
+		if (wait.state != SLOT_WAITING)
+			break;
+
+		spin_unlock(lock);
+		schedule();
+		spin_lock(lock);
+	}
+	finish_wait(wq, &wait.wait);
+	return wait.ret;
+}
+EXPORT_SYMBOL(radix_tree_lookup_lock);
+
+void radix_tree_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
+			unsigned long index)
+{
+	void *ret, **slot;
+
+	ret = __radix_tree_lookup(root, index, NULL, &slot);
+	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret)))
+		return;
+	if (WARN_ON_ONCE(!slot_locked(slot)))
+		return;
+	unlock_slot(slot);
+
+	if (waitqueue_active(wq)) {
+		struct wait_bit_key key = {.flags = root, .bit_nr = -2,
+					   .timeout = index};
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+EXPORT_SYMBOL(radix_tree_unlock);
+
+void radix_tree_delete_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
+			      unsigned long index)
+{
+	radix_tree_delete(root, index);
+	if (waitqueue_active(wq)) {
+		/* -3 here indicates deletion */
+		struct wait_bit_key key = {.flags = root, .bit_nr = -3,
+					   .timeout = index};
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+EXPORT_SYMBOL(radix_tree_delete_unlock);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] radix-tree: support locking of individual exception entries.
  2016-02-28  5:09 ` [PATCH 3/3] radix-tree: support locking of individual exception entries NeilBrown
@ 2016-02-28  5:30   ` kbuild test robot
  2016-02-28  6:27   ` NeilBrown
  2016-03-03 13:10   ` Jan Kara
  2 siblings, 0 replies; 15+ messages in thread
From: kbuild test robot @ 2016-02-28  5:30 UTC (permalink / raw)
  To: NeilBrown
  Cc: kbuild-all, Ross Zwisler, Matthew Wilcox, Andrew Morton,
	Jan Kara, linux-kernel, linux-fsdevel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2763 bytes --]

Hi NeilBrown,

[auto build test ERROR on v4.5-rc5]
[also build test ERROR on next-20160226]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/NeilBrown/RFC-improvements-to-radix-tree-related-to-DAX/20160228-132214
config: i386-tinyconfig (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   lib/radix-tree.c: In function 'radix_tree_lookup_lock':
>> lib/radix-tree.c:1616:5: error: 'TASK_UNINTERRUPTIBLE' undeclared (first use in this function)
        TASK_UNINTERRUPTIBLE);
        ^
   lib/radix-tree.c:1616:5: note: each undeclared identifier is reported only once for each function it appears in
>> lib/radix-tree.c:1621:3: error: implicit declaration of function 'schedule' [-Werror=implicit-function-declaration]
      schedule();
      ^
   lib/radix-tree.c: In function 'radix_tree_unlock':
>> lib/radix-tree.c:1644:17: error: 'TASK_NORMAL' undeclared (first use in this function)
      __wake_up(wq, TASK_NORMAL, 1, &key);
                    ^
   lib/radix-tree.c: In function 'radix_tree_delete_unlock':
   lib/radix-tree.c:1657:17: error: 'TASK_NORMAL' undeclared (first use in this function)
      __wake_up(wq, TASK_NORMAL, 1, &key);
                    ^
   cc1: some warnings being treated as errors

vim +/TASK_UNINTERRUPTIBLE +1616 lib/radix-tree.c

  1610		wait.state = SLOT_WAITING;
  1611		wait.root = root;
  1612		wait.index = index;
  1613		wait.ret = NULL;
  1614		for (;;) {
  1615			prepare_to_wait(wq, &wait.wait,
> 1616					TASK_UNINTERRUPTIBLE);
  1617			if (wait.state != SLOT_WAITING)
  1618				break;
  1619	
  1620			spin_unlock(lock);
> 1621			schedule();
  1622			spin_lock(lock);
  1623		}
  1624		finish_wait(wq, &wait.wait);
  1625		return wait.ret;
  1626	}
  1627	EXPORT_SYMBOL(radix_tree_lookup_lock);
  1628	
  1629	void radix_tree_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
  1630				unsigned long index)
  1631	{
  1632		void *ret, **slot;
  1633	
  1634		ret = __radix_tree_lookup(root, index, NULL, &slot);
  1635		if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret)))
  1636			return;
  1637		if (WARN_ON_ONCE(!slot_locked(slot)))
  1638			return;
  1639		unlock_slot(slot);
  1640	
  1641		if (waitqueue_active(wq)) {
  1642			struct wait_bit_key key = {.flags = root, .bit_nr = -2,
  1643						   .timeout = index};
> 1644			__wake_up(wq, TASK_NORMAL, 1, &key);
  1645		}
  1646	}
  1647	EXPORT_SYMBOL(radix_tree_unlock);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 6204 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] radix-tree: support locking of individual exception entries.
  2016-02-28  5:09 ` [PATCH 3/3] radix-tree: support locking of individual exception entries NeilBrown
  2016-02-28  5:30   ` kbuild test robot
@ 2016-02-28  6:27   ` NeilBrown
  2016-03-03 13:10   ` Jan Kara
  2 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2016-02-28  6:27 UTC (permalink / raw)
  To: Ross Zwisler, Matthew Wilcox, Andrew Morton, Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1176 bytes --]

On Sun, Feb 28 2016, NeilBrown <neilb@suse.com> wrote:

> +static int wake_slot_function(wait_queue_t *wait, unsigned mode, int sync,
> +			      void *arg)
> +{
> +	struct wait_bit_key *key = arg;
> +	struct wait_slot_queue *wait_slot =
> +		container_of(wait, struct wait_slot_queue, wait);
> +	void **slot;
> +
> +	if (wait_slot->root != key->flags ||
> +	    wait_slot->index != key->timeout)
> +		/* Not waking this waiter */
> +		return 0;
> +	if (wait_slot->state != SLOT_WAITING)
> +		/* Should be impossible.... */
> +		return 1;
> +	if (key->bit_nr == -3)
> +		/* Was just deleted, no point in doing a lookup */
> +		wait_slot = NULL;
> +	else
> +		wait_slot->ret = __radix_tree_lookup(
> +			wait_slot->root, wait_slot->index, NULL, &slot);
> +	if (!wait_slot->ret || !radix_tree_exceptional_entry(wait_slot->ret)) {
> +		wait_slot->state = SLOT_GONE;
> +		return 1;
> +	}
> +	if (slot_locked(slot))
> +		/* still locked */
> +		return 0;
> +	wait_slot->ret = lock_slot(slot);
> +	wait_slot->state = SLOT_LOCKED;
> +	return 1;
> +}

Sorry, just realized that this should:
  return autoremove_wake_function(wait, mode, sync, arg);

instead of "return 1;"

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c
  2016-02-28  5:09 ` [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c NeilBrown
@ 2016-02-29 14:28   ` Wilcox, Matthew R
  2016-02-29 17:46     ` Ross Zwisler
  0 siblings, 1 reply; 15+ messages in thread
From: Wilcox, Matthew R @ 2016-02-29 14:28 UTC (permalink / raw)
  To: NeilBrown, Ross Zwisler, Andrew Morton, Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-mm

I agree with this patch, but it's already part of the patchset that I'm working on, so merging this patch now would just introduce churn for me.

-----Original Message-----
From: NeilBrown [mailto:neilb@suse.com] 
Sent: Saturday, February 27, 2016 9:09 PM
To: Ross Zwisler; Wilcox, Matthew R; Andrew Morton; Jan Kara
Cc: linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
Subject: [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c

These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/dax.c                   |    9 +++++++++
 include/linux/radix-tree.h |    9 ---------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 711172450da6..9c4d697fb6fc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,6 +32,15 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index f54be7082207..968150ab8a1c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,15 +51,6 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
-#define RADIX_DAX_MASK	0xf
-#define RADIX_DAX_SHIFT	4
-#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
-		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
-
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: [PATCH 2/3] radix-tree: make 'indirect' bit available to exception entries.
  2016-02-28  5:09 ` [PATCH 2/3] radix-tree: make 'indirect' bit available to " NeilBrown
@ 2016-02-29 14:41   ` Wilcox, Matthew R
  2016-03-01 21:59     ` Ross Zwisler
  0 siblings, 1 reply; 15+ messages in thread
From: Wilcox, Matthew R @ 2016-02-29 14:41 UTC (permalink / raw)
  To: NeilBrown, Ross Zwisler, Andrew Morton, Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-mm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 3723 bytes --]

So based on the bottom two bits, we can tell what this entry is:

00 - data pointer
01 - indirect entry (pointer to another level of the radix tree)
10 - exceptional entry
11 - locked exceptional entry

I was concerned that this patch would clash with the support for multi-order entries in the radix tree, but after some thought, I now believe that it doesn't.  The multi-order entries changes permit finding data pointers or exceptional entries in the tree where before only indirect entries could be found, but with the changes to radix_tree_is_indirect_ptr below, everything should work fine.

-----Original Message-----
From: NeilBrown [mailto:neilb@suse.com] 
Sent: Saturday, February 27, 2016 9:09 PM
To: Ross Zwisler; Wilcox, Matthew R; Andrew Morton; Jan Kara
Cc: linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
Subject: [PATCH 2/3] radix-tree: make 'indirect' bit available to exception entries.

A pointer to a radix_tree_node will always have the 'exception'
bit cleared, so if the exception bit is set the value cannot
be an indirect pointer.  Thus it is safe to make the 'indirect bit'
available to store extra information in exception entries.

This patch adds a 'PTR_MASK' and a value is only treated as
an indirect (pointer) entry the 2 ls-bits are '01'.

The change in radix-tree.c ensures the stored value still looks like an
indirect pointer, and saves a load as well.

We could swap the two bits and so keep all the exectional bits contigious.
But I have other plans for that bit....

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/radix-tree.h |   11 +++++++++--
 lib/radix-tree.c           |    2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 968150ab8a1c..450c12b546b7 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -40,8 +40,13 @@
  * Indirect pointer in fact is also used to tag the last pointer of a node
  * when it is shrunk, before we rcu free the node. See shrink code for
  * details.
+ *
+ * To allow an exception entry to only lose one bit, we ignore
+ * the INDIRECT bit when the exception bit is set.  So an entry is
+ * indirect if the least significant 2 bits are 01.
  */
 #define RADIX_TREE_INDIRECT_PTR		1
+#define RADIX_TREE_INDIRECT_MASK	3
 /*
  * A common use of the radix tree is to store pointers to struct pages;
  * but shmem/tmpfs needs also to store swap entries in the same tree:
@@ -53,7 +58,8 @@
 
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
+	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
+		== RADIX_TREE_INDIRECT_PTR;
 }
 
 /*** radix-tree API starts here ***/
@@ -221,7 +227,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
  */
 static inline int radix_tree_deref_retry(void *arg)
 {
-	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
+	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
+			== RADIX_TREE_INDIRECT_PTR);
 }
 
 /**
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6b79e9026e24..37d4643ab5c0 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1305,7 +1305,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
 		 * to force callers to retry.
 		 */
 		if (root->height == 0)
-			*((unsigned long *)&to_free->slots[0]) |=
+			*((unsigned long *)&to_free->slots[0]) =
 						RADIX_TREE_INDIRECT_PTR;
 
 		radix_tree_node_free(to_free);


N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c
  2016-02-29 14:28   ` Wilcox, Matthew R
@ 2016-02-29 17:46     ` Ross Zwisler
  0 siblings, 0 replies; 15+ messages in thread
From: Ross Zwisler @ 2016-02-29 17:46 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: NeilBrown, Ross Zwisler, Andrew Morton, Jan Kara, linux-kernel,
	linux-fsdevel, linux-mm

On Mon, Feb 29, 2016 at 02:28:46PM +0000, Wilcox, Matthew R wrote:
> I agree with this patch, but it's already part of the patchset that I'm
> working on, so merging this patch now would just introduce churn for me.
> 
> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.com] 
> Sent: Saturday, February 27, 2016 9:09 PM
> To: Ross Zwisler; Wilcox, Matthew R; Andrew Morton; Jan Kara
> Cc: linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
> Subject: [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c
> 
> These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
> Let's try to maintain the idea that radix-tree simply implements an
> abstract data type.

Looks good.  I'm fine with this change, whether it happens via this standalone
patch or via Matthew's larger change set.

> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  fs/dax.c                   |    9 +++++++++
>  include/linux/radix-tree.h |    9 ---------
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 711172450da6..9c4d697fb6fc 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -32,6 +32,15 @@
>  #include <linux/pfn_t.h>
>  #include <linux/sizes.h>
>  
> +#define RADIX_DAX_MASK	0xf
> +#define RADIX_DAX_SHIFT	4
> +#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> +#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> +		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
> +
>  static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
>  {
>  	struct request_queue *q = bdev->bd_queue;
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index f54be7082207..968150ab8a1c 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -51,15 +51,6 @@
>  #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
>  #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
>  
> -#define RADIX_DAX_MASK	0xf
> -#define RADIX_DAX_SHIFT	4
> -#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> -#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> -		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
> -
>  static inline int radix_tree_is_indirect_ptr(void *ptr)
>  {
>  	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] radix-tree: make 'indirect' bit available to exception entries.
  2016-02-29 14:41   ` Wilcox, Matthew R
@ 2016-03-01 21:59     ` Ross Zwisler
  0 siblings, 0 replies; 15+ messages in thread
From: Ross Zwisler @ 2016-03-01 21:59 UTC (permalink / raw)
  To: Wilcox, Matthew R
  Cc: NeilBrown, Ross Zwisler, Andrew Morton, Jan Kara, linux-kernel,
	linux-fsdevel, linux-mm

On Mon, Feb 29, 2016 at 02:41:55PM +0000, Wilcox, Matthew R wrote:
> So based on the bottom two bits, we can tell what this entry is:
> 
> 00 - data pointer
> 01 - indirect entry (pointer to another level of the radix tree)
> 10 - exceptional entry
> 11 - locked exceptional entry
> 
> I was concerned that this patch would clash with the support for multi-order
> entries in the radix tree, but after some thought, I now believe that it
> doesn't.  The multi-order entries changes permit finding data pointers or
> exceptional entries in the tree where before only indirect entries could be
> found, but with the changes to radix_tree_is_indirect_ptr below, everything
> should work fine.

Yep, this seems workable to me.

> -----Original Message-----
> From: NeilBrown [mailto:neilb@suse.com] 
> Sent: Saturday, February 27, 2016 9:09 PM
> To: Ross Zwisler; Wilcox, Matthew R; Andrew Morton; Jan Kara
> Cc: linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
> Subject: [PATCH 2/3] radix-tree: make 'indirect' bit available to exception entries.
> 
> A pointer to a radix_tree_node will always have the 'exception'
> bit cleared, so if the exception bit is set the value cannot
> be an indirect pointer.  Thus it is safe to make the 'indirect bit'
> available to store extra information in exception entries.
> 
> This patch adds a 'PTR_MASK' and a value is only treated as
> an indirect (pointer) entry the 2 ls-bits are '01'.
> 
> The change in radix-tree.c ensures the stored value still looks like an
> indirect pointer, and saves a load as well.
> 
> We could swap the two bits and so keep all the exectional bits contigious.
> But I have other plans for that bit....
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  include/linux/radix-tree.h |   11 +++++++++--
>  lib/radix-tree.c           |    2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 968150ab8a1c..450c12b546b7 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -40,8 +40,13 @@
>   * Indirect pointer in fact is also used to tag the last pointer of a node
>   * when it is shrunk, before we rcu free the node. See shrink code for
>   * details.
> + *
> + * To allow an exception entry to only lose one bit, we ignore
> + * the INDIRECT bit when the exception bit is set.  So an entry is
> + * indirect if the least significant 2 bits are 01.
>   */
>  #define RADIX_TREE_INDIRECT_PTR		1
> +#define RADIX_TREE_INDIRECT_MASK	3
>  /*
>   * A common use of the radix tree is to store pointers to struct pages;
>   * but shmem/tmpfs needs also to store swap entries in the same tree:
> @@ -53,7 +58,8 @@
>  
>  static inline int radix_tree_is_indirect_ptr(void *ptr)
>  {
> -	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
> +	return ((unsigned long)ptr & RADIX_TREE_INDIRECT_MASK)
> +		== RADIX_TREE_INDIRECT_PTR;
>  }
>  
>  /*** radix-tree API starts here ***/
> @@ -221,7 +227,8 @@ static inline void *radix_tree_deref_slot_protected(void **pslot,
>   */
>  static inline int radix_tree_deref_retry(void *arg)
>  {
> -	return unlikely((unsigned long)arg & RADIX_TREE_INDIRECT_PTR);
> +	return unlikely(((unsigned long)arg & RADIX_TREE_INDIRECT_MASK)
> +			== RADIX_TREE_INDIRECT_PTR);
>  }
>  
>  /**
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 6b79e9026e24..37d4643ab5c0 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1305,7 +1305,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
>  		 * to force callers to retry.
>  		 */
>  		if (root->height == 0)
> -			*((unsigned long *)&to_free->slots[0]) |=
> +			*((unsigned long *)&to_free->slots[0]) =
>  						RADIX_TREE_INDIRECT_PTR;
>  
>  		radix_tree_node_free(to_free);
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] radix-tree: support locking of individual exception entries.
  2016-02-28  5:09 ` [PATCH 3/3] radix-tree: support locking of individual exception entries NeilBrown
  2016-02-28  5:30   ` kbuild test robot
  2016-02-28  6:27   ` NeilBrown
@ 2016-03-03 13:10   ` Jan Kara
  2016-03-03 23:51     ` NeilBrown
  2 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2016-03-03 13:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, Jan Kara,
	linux-kernel, linux-fsdevel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 8876 bytes --]

Hi Neil,

On Sun 28-02-16 16:09:29, NeilBrown wrote:
> The least significant bit of an exception entry is used as a lock flag.
> A caller can:
>  - create a locked entry by simply adding an entry with this flag set
>  - lock an existing entry with radix_tree_lookup_lock().  This may return
>     NULL if the entry doesn't exists, or was deleted while waiting for
>     the lock.  It may return a non-exception entry if that is what is
>     found.  If it returns a locked entry then it has exclusive rights
>     to delete the entry.
>  - unlock an entry that is already locked.  This will wake any waiters.
>  - delete an entry that is locked.  This will wake waiters so that they
>    return NULL without looking at the slot in the radix tree.
> 
> These must all be called with the radix tree locked (i.e. a spinlock held).
> That spinlock is passed to radix_tree_lookup_lock() so that it can drop
> the lock while waiting.
> 
> This is a "demonstration of concept".  I haven't actually tested, only compiled.
> A possible use case is for the exception entries used by DAX.
> 
> It is possible that some of the lookups can be optimised away in some
> cases by storing a slot pointer.  I wanted to keep it reasonable
> simple until it was determined if it might be useful.

Thanks for having a look! So the patch looks like it would do the work but
frankly the amount of hackiness in it has exceeded my personal threshold...
several times ;)

In particular I don't quite understand why have you decided to re-lookup
the exceptional entry in the wake function? That seems to be the source of
a lot of a hackiness? I was hoping for something simpler like what I've
attached (compile tested only). What do you think?

To avoid false wakeups and thundering herd issues which my simple version does
have, we could do something like what I outline in the second patch. Now
that I look at the result that is closer to your patch, just cleaner IMHO :).
But I wanted to have it separated to see how much complexity does this
additional functionality brings...

Now I'm going to have a look how to use this in DAX...

								Honza


> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  include/linux/radix-tree.h |    8 ++
>  lib/radix-tree.c           |  158 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 166 insertions(+)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 450c12b546b7..8f579f66574b 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -308,6 +308,14 @@ unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
>  int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
>  unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item);
>  
> +void *radix_tree_lookup_lock(struct radix_tree_root *root, wait_queue_head_t *wq,
> +			     unsigned long index, spinlock_t *lock);
> +void radix_tree_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
> +		       unsigned long index);
> +void radix_tree_delete_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
> +			      unsigned long index);
> +
> +
>  static inline void radix_tree_preload_end(void)
>  {
>  	preempt_enable();
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 37d4643ab5c0..a24ea002f3eb 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1500,3 +1500,161 @@ void __init radix_tree_init(void)
>  	radix_tree_init_maxindex();
>  	hotcpu_notifier(radix_tree_callback, 0);
>  }
> +
> +/* Exception entry locking.
> + * The least significant bit of an exception entry can be used as a
> + * "locked" flag.  Supported locking operations are:
> + * radix_tree_lookup_lock() - if the indexed entry exists, lock it and
> + *         return the value, else return NULL.  If the indexed entry is not
> + *         exceptional it is returned without locking.
> + * radix_tree_unlock() - release the lock on the indexed entry
> + * radix_tree_delete_unlock() - the entry must be locked.  It will be atomically
> + *     unlocked and removed.  Any threads sleeping in lookup_lock() will return.
> + * Each of these take a radix_tree_root, a wait_queue_head_t, and an index.
> + * The '*lock' function also takes a spinlock_t which must be held when any
> + * of the functions is called.  *lock will drop the spinlock while waiting for
> + * the entry lock.
> + *
> + * As delete_unlock could free the radix_tree_node, waiters much not touch it
> + * when woken.  We provide a wake function for the waitq which records when the
> + * item has been deleted.
> + *
> + * The wait_queue_head passed should be one that is used for bit_wait, such
> + * as zone->wait_table.  We re-use the 'flags' and 'timeout' fields of the
> + * wait_bit_key to store the root and index that we are waiting for.
> + * __wake_up may only be called on one of these keys while the radix tree
> + * is locked.  The wakeup function will take the lock itself if appropriate, or
> + * may record that the radix tree entry has been deleted.  In either case
> + * the waiting function just looks at the status reported by the wakeup function
> + * and doesn't look at the radix tree itself.
> + *
> + * There is no function for locking an entry while inserting it.  Simply
> + * insert an entry that is already marked as 'locked' - lsb set.
> + *
> + */
> +
> +struct wait_slot_queue {
> +	struct radix_tree_root	*root;
> +	unsigned long		index;
> +	wait_queue_t		wait;
> +	enum {SLOT_WAITING, SLOT_LOCKED, SLOT_GONE} state;
> +	void			*ret;
> +};
> +
> +static inline int slot_locked(void *v)
> +{
> +	unsigned long l = (unsigned long)v;
> +	return l & 1;
> +}
> +
> +static inline void *lock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l |= 1);
> +}
> +
> +static inline void * unlock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l &= ~1UL);
> +}
> +
> +static int wake_slot_function(wait_queue_t *wait, unsigned mode, int sync,
> +			      void *arg)
> +{
> +	struct wait_bit_key *key = arg;
> +	struct wait_slot_queue *wait_slot =
> +		container_of(wait, struct wait_slot_queue, wait);
> +	void **slot;
> +
> +	if (wait_slot->root != key->flags ||
> +	    wait_slot->index != key->timeout)
> +		/* Not waking this waiter */
> +		return 0;
> +	if (wait_slot->state != SLOT_WAITING)
> +		/* Should be impossible.... */
> +		return 1;
> +	if (key->bit_nr == -3)
> +		/* Was just deleted, no point in doing a lookup */
> +		wait_slot = NULL;
> +	else
> +		wait_slot->ret = __radix_tree_lookup(
> +			wait_slot->root, wait_slot->index, NULL, &slot);
> +	if (!wait_slot->ret || !radix_tree_exceptional_entry(wait_slot->ret)) {
> +		wait_slot->state = SLOT_GONE;
> +		return 1;
> +	}
> +	if (slot_locked(slot))
> +		/* still locked */
> +		return 0;
> +	wait_slot->ret = lock_slot(slot);
> +	wait_slot->state = SLOT_LOCKED;
> +	return 1;
> +}
> +
> +void *radix_tree_lookup_lock(struct radix_tree_root *root, wait_queue_head_t *wq,
> +			     unsigned long index, spinlock_t *lock)
> +{
> +	void *ret, **slot;
> +	struct wait_slot_queue wait;
> +
> +	ret = __radix_tree_lookup(root, index, NULL, &slot);
> +	if (!ret || !radix_tree_exceptional_entry(ret))
> +		return ret;
> +	if (!slot_locked(slot))
> +		return lock_slot(slot);
> +
> +	wait.wait.private = current;
> +	wait.wait.func = wake_slot_function;
> +	INIT_LIST_HEAD(&wait.wait.task_list);
> +	wait.state = SLOT_WAITING;
> +	wait.root = root;
> +	wait.index = index;
> +	wait.ret = NULL;
> +	for (;;) {
> +		prepare_to_wait(wq, &wait.wait,
> +				TASK_UNINTERRUPTIBLE);
> +		if (wait.state != SLOT_WAITING)
> +			break;
> +
> +		spin_unlock(lock);
> +		schedule();
> +		spin_lock(lock);
> +	}
> +	finish_wait(wq, &wait.wait);
> +	return wait.ret;
> +}
> +EXPORT_SYMBOL(radix_tree_lookup_lock);
> +
> +void radix_tree_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
> +			unsigned long index)
> +{
> +	void *ret, **slot;
> +
> +	ret = __radix_tree_lookup(root, index, NULL, &slot);
> +	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret)))
> +		return;
> +	if (WARN_ON_ONCE(!slot_locked(slot)))
> +		return;
> +	unlock_slot(slot);
> +
> +	if (waitqueue_active(wq)) {
> +		struct wait_bit_key key = {.flags = root, .bit_nr = -2,
> +					   .timeout = index};
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
> +	}
> +}
> +EXPORT_SYMBOL(radix_tree_unlock);
> +
> +void radix_tree_delete_unlock(struct radix_tree_root *root, wait_queue_head_t *wq,
> +			      unsigned long index)
> +{
> +	radix_tree_delete(root, index);
> +	if (waitqueue_active(wq)) {
> +		/* -3 here indicates deletion */
> +		struct wait_bit_key key = {.flags = root, .bit_nr = -3,
> +					   .timeout = index};
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
> +	}
> +}
> +EXPORT_SYMBOL(radix_tree_delete_unlock);
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-radix-tree-support-locking-of-individual-exception-e.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] radix-tree: support locking of individual exception entries.
  2016-03-03 13:10   ` Jan Kara
@ 2016-03-03 23:51     ` NeilBrown
  2016-03-04 10:14       ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2016-03-03 23:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3730 bytes --]

On Fri, Mar 04 2016, Jan Kara wrote:

> Hi Neil,
>
> On Sun 28-02-16 16:09:29, NeilBrown wrote:
>> The least significant bit of an exception entry is used as a lock flag.
>> A caller can:
>>  - create a locked entry by simply adding an entry with this flag set
>>  - lock an existing entry with radix_tree_lookup_lock().  This may return
>>     NULL if the entry doesn't exists, or was deleted while waiting for
>>     the lock.  It may return a non-exception entry if that is what is
>>     found.  If it returns a locked entry then it has exclusive rights
>>     to delete the entry.
>>  - unlock an entry that is already locked.  This will wake any waiters.
>>  - delete an entry that is locked.  This will wake waiters so that they
>>    return NULL without looking at the slot in the radix tree.
>> 
>> These must all be called with the radix tree locked (i.e. a spinlock held).
>> That spinlock is passed to radix_tree_lookup_lock() so that it can drop
>> the lock while waiting.
>> 
>> This is a "demonstration of concept".  I haven't actually tested, only compiled.
>> A possible use case is for the exception entries used by DAX.
>> 
>> It is possible that some of the lookups can be optimised away in some
>> cases by storing a slot pointer.  I wanted to keep it reasonable
>> simple until it was determined if it might be useful.
>
> Thanks for having a look! So the patch looks like it would do the work but
> frankly the amount of hackiness in it has exceeded my personal threshold...
> several times ;)

Achievement unlocked ? :-)

>
> In particular I don't quite understand why have you decided to re-lookup
> the exceptional entry in the wake function? That seems to be the source of
> a lot of a hackiness?

Yes.....

My original idea was that there would only be a single lookup.  If the
slot was found to be locked, the address of the slot would be stored in
the key so the wakeup function could trivially detect if it was being
deleted, or could claim the lock, and would signal the result to the
caller.

But then I realized that the address of the slot is not necessarily
stable.  In particular the slot for address 0 can be in the root, or it
can be in a node.  I could special-case address zero but it was easier
just to do the search.

Of course the slot disappears if the entry is deleted.  That is why the
wakeup function (which is called under the tree lock so can still safely
inspect the slot) would signal to the caller that the delete had
happened.

So the patch was a little confused....

>                       I was hoping for something simpler like what I've
> attached (compile tested only). What do you think?

Certainly simpler.

By not layering on top of wait_bit_key, you've precluded the use of the
current page wait_queues for these locks - you need to allocate new wait
queue heads.

If in

> +struct wait_exceptional_entry_queue {
> +	wait_queue_t wait;
> +	struct exceptional_entry_key key;
> +};

you had the exceptional_entry_key first (like wait_bit_queue does) you
would be closer to being able to re-use the queues.

Also I don't think it is safe to use an exclusive wait.  When a slot is
deleted, you need to wake up *all* the waiters.

Thanks,
NeilBrown


>
> To avoid false wakeups and thundering herd issues which my simple version does
> have, we could do something like what I outline in the second patch. Now
> that I look at the result that is closer to your patch, just cleaner IMHO :).
> But I wanted to have it separated to see how much complexity does this
> additional functionality brings...
>
> Now I'm going to have a look how to use this in DAX...
>
> 								Honza

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] radix-tree: support locking of individual exception entries.
  2016-03-03 23:51     ` NeilBrown
@ 2016-03-04 10:14       ` NeilBrown
  2016-03-04 12:31         ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2016-03-04 10:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 807 bytes --]

On Fri, Mar 04 2016, NeilBrown wrote:

>
> By not layering on top of wait_bit_key, you've precluded the use of the
> current page wait_queues for these locks - you need to allocate new wait
> queue heads.
>
> If in
>
>> +struct wait_exceptional_entry_queue {
>> +	wait_queue_t wait;
>> +	struct exceptional_entry_key key;
>> +};
>
> you had the exceptional_entry_key first (like wait_bit_queue does) you
> would be closer to being able to re-use the queues.

Scratch that bit, I was confusing myself again.  Sorry.
Each wait_queue_t has it's own function so one function will never be
called on other items in the queue - of course.

>
> Also I don't think it is safe to use an exclusive wait.  When a slot is
> deleted, you need to wake up *all* the waiters.

I think this issue is still valid.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] radix-tree: support locking of individual exception entries.
  2016-03-04 10:14       ` NeilBrown
@ 2016-03-04 12:31         ` Jan Kara
  2016-03-09  2:13           ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2016-03-04 12:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, Ross Zwisler, Matthew Wilcox, Andrew Morton,
	linux-kernel, linux-fsdevel, linux-mm

On Fri 04-03-16 21:14:24, NeilBrown wrote:
> On Fri, Mar 04 2016, NeilBrown wrote:
> 
> >
> > By not layering on top of wait_bit_key, you've precluded the use of the
> > current page wait_queues for these locks - you need to allocate new wait
> > queue heads.
> >
> > If in
> >
> >> +struct wait_exceptional_entry_queue {
> >> +	wait_queue_t wait;
> >> +	struct exceptional_entry_key key;
> >> +};
> >
> > you had the exceptional_entry_key first (like wait_bit_queue does) you
> > would be closer to being able to re-use the queues.
> 
> Scratch that bit, I was confusing myself again.  Sorry.
> Each wait_queue_t has it's own function so one function will never be
> called on other items in the queue - of course.

Yes.

> > Also I don't think it is safe to use an exclusive wait.  When a slot is
> > deleted, you need to wake up *all* the waiters.
> 
> I think this issue is still valid.

Yes, you are right. I have deleted your function radix_tree_delete_unlock()
because I thought it won't be needed - but if we use exclusive waits (which
I think we want to) we need to wakeup all waiters when deleting entry as you
properly spotted.

Currently I'm undecided how we want to deal with that. The thing is - when
exceptional entries use locking, we need deleting of a radix tree entry to
avoid deleting locked entry so the only proper way to delete entry would be
via something like radix_tree_delete_unlock(). OTOH when entry locking is
not used (like for tmpfs exceptional entries), we don't want to bother with
passing waitqueues around and locking entry just to delete it. The best I
came up with was that radix_tree_delete_item() would complain about
deleting locked entry so that we catch when someone doesn't properly obey
the locking protocol... But I'm still somewhat hesitating whether it would
not be better to move the locking out of generic radix tree code since it
is not quite as generic as I'd like and e.g. clear_exceptional_entry()
would use locked delete only for DAX mappings anyway.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] radix-tree: support locking of individual exception entries.
  2016-03-04 12:31         ` Jan Kara
@ 2016-03-09  2:13           ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2016-03-09  2:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Matthew Wilcox, Andrew Morton, linux-kernel,
	linux-fsdevel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1845 bytes --]

On Fri, Mar 04 2016, Jan Kara wrote:

> On Fri 04-03-16 21:14:24, NeilBrown wrote:
>> On Fri, Mar 04 2016, NeilBrown wrote:
>> 
>> >
>> > By not layering on top of wait_bit_key, you've precluded the use of the
>> > current page wait_queues for these locks - you need to allocate new wait
>> > queue heads.
>> >
>> > If in
>> >
>> >> +struct wait_exceptional_entry_queue {
>> >> +	wait_queue_t wait;
>> >> +	struct exceptional_entry_key key;
>> >> +};
>> >
>> > you had the exceptional_entry_key first (like wait_bit_queue does) you
>> > would be closer to being able to re-use the queues.
>> 
>> Scratch that bit, I was confusing myself again.  Sorry.
>> Each wait_queue_t has it's own function so one function will never be
>> called on other items in the queue - of course.
>
> Yes.

I was thinking about this some more, wondering why I thought what I did,
and realised there is a small issue that it is worth being aware of.

If different users of the same work queue use different "keys", a wake
function can get a key of a different type to the one it is expecting.

e.g. __wake_up_bit passes the address of a "struct wait_bit_key" to
__wake_up which is then passed as a "void* arg" to the
wait_queue_func_t.

With your code, a "struct exceptional_entry_key" could be passed to
wake_bit_function, or a "struct wait_bit_key" could be passed to
wake_exceptional_entry_func.

Both structures start with a pointer which will never appear in the
wrong type, and both function test that pointer first and access nothing
else if it doesn't match, so the code is safe.  But it could be seen as
a bit fragile, and doing something to make it obvious that the different
key types need to align on that primary key field would not be a bad
thing. .... If we end up using this code.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-03-09  2:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-28  5:09 [PATCH 0/3] RFC improvements to radix-tree related to DAX NeilBrown
2016-02-28  5:09 ` [PATCH 3/3] radix-tree: support locking of individual exception entries NeilBrown
2016-02-28  5:30   ` kbuild test robot
2016-02-28  6:27   ` NeilBrown
2016-03-03 13:10   ` Jan Kara
2016-03-03 23:51     ` NeilBrown
2016-03-04 10:14       ` NeilBrown
2016-03-04 12:31         ` Jan Kara
2016-03-09  2:13           ` NeilBrown
2016-02-28  5:09 ` [PATCH 2/3] radix-tree: make 'indirect' bit available to " NeilBrown
2016-02-29 14:41   ` Wilcox, Matthew R
2016-03-01 21:59     ` Ross Zwisler
2016-02-28  5:09 ` [PATCH 1/3] DAX: move RADIX_DAX_ definitions to dax.c NeilBrown
2016-02-29 14:28   ` Wilcox, Matthew R
2016-02-29 17:46     ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).