linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults
@ 2016-10-19 17:24 Johannes Weiner
  2016-10-19 17:24 ` [PATCH 1/5] lib: radix-tree: provide node-granular interface for radix tree tags Johannes Weiner
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Johannes Weiner @ 2016-10-19 17:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Jan Kara, Dave Jones, linux-mm, linux-kernel,
	kernel-team

Hi,

this is a follow-up to d3798ae8c6f3 ("mm: filemap: don't plant shadow
entries without radix tree node"). That patch fixed an issue that was
caused mainly by the page cache sneaking special shadow page entries
into the radix tree and relying on subtleties in the radix tree code
to make that work. The fix also had to stop tracking refaults for
single-page files because shadow pages stored as direct pointers in
radix_tree_root->rnode weren't properly handled during tree extension.

These patches make the radix tree code explicitely support and track
such special entries, to eliminate the subtleties and to restore the
thrash detection for single-page files.

They then turn the BUG_ONs in the shadow shrinker into mere warnings,
to prevent unnecessary crashes such as those mentioned in d3798ae8c6f3.

The changes have been running stable on my main machines for a couple
of days, survived kernel builds, chrome, and various synthetic stress
tests that excercise the shadow page tracking code. They've also been
solid doing scalability and page cache tests from Mel's mmtests.

It's more code, but it should be a lot less fragile. What do you think?

 include/linux/radix-tree.h      |  33 +++++----
 include/linux/swap.h            |  16 +++--
 lib/dma-debug.c                 |   6 +-
 lib/radix-tree.c                | 142 +++++++++++++++++++++++---------------
 mm/filemap.c                    |  25 ++++---
 mm/truncate.c                   |   2 +
 mm/workingset.c                 |  19 +++--
 tools/testing/radix-tree/test.c |   4 +-
 8 files changed, 149 insertions(+), 98 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] lib: radix-tree: provide node-granular interface for radix tree tags
  2016-10-19 17:24 [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Johannes Weiner
@ 2016-10-19 17:24 ` Johannes Weiner
  2016-10-19 17:24 ` [PATCH 2/5] lib: radix-tree: internal tags Johannes Weiner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2016-10-19 17:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Jan Kara, Dave Jones, linux-mm, linux-kernel,
	kernel-team

Page cache insertion and deletion will need to tag shadow entries, but
these callsites already look up the raw node and slot for accounting.
Provide an interface to modify tags without another tree lookup.

We already have node_tag_clear(); factor node_tag_set() from
radix_tree_tag_set(), then provide node-granular functions for
clearing and setting. We won't be needing a getter for now.

The existing radix_tree_clear_tags() can be implemented on top of
__radix_tree_tag_clear(). Since it's also a node-granular function,
rename and relocate it accordingly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  12 +++--
 lib/radix-tree.c           | 109 +++++++++++++++++++++++++++------------------
 mm/filemap.c               |   2 +-
 3 files changed, 76 insertions(+), 47 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index af3581b8a451..dc261da5096c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -280,9 +280,6 @@ bool __radix_tree_delete_node(struct radix_tree_root *root,
 			      struct radix_tree_node *node);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
-void radix_tree_clear_tags(struct radix_tree_root *root,
-			   struct radix_tree_node *node,
-			   void **slot);
 unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
 			void **results, unsigned long first_index,
 			unsigned int max_items);
@@ -293,6 +290,15 @@ int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
 void radix_tree_init(void);
+void __radix_tree_tag_set(struct radix_tree_root *root,
+			  struct radix_tree_node *node,
+			  void **slot, unsigned int tag);
+void __radix_tree_tag_clear(struct radix_tree_root *root,
+			  struct radix_tree_node *node,
+			  void **slot, unsigned int tag);
+void __radix_tree_clear_tags(struct radix_tree_root *root,
+			     struct radix_tree_node *node,
+			     void **slot);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
 void *radix_tree_tag_clear(struct radix_tree_root *root,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8e6d552c40dd..d04d0938d7b7 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -746,6 +746,67 @@ void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 
+static void node_tag_set(struct radix_tree_root *root,
+			 struct radix_tree_node *node,
+			 unsigned int tag, unsigned int offset)
+{
+	while (node) {
+		if (tag_get(node, tag, offset))
+			return;
+		tag_set(node, tag, offset);
+
+		offset = node->offset;
+		node = node->parent;
+	}
+
+	if (!root_tag_get(root, tag))
+		root_tag_set(root, tag);
+}
+
+static void node_tag_clear(struct radix_tree_root *root,
+				struct radix_tree_node *node,
+				unsigned int tag, unsigned int offset)
+{
+	while (node) {
+		if (!tag_get(node, tag, offset))
+			return;
+		tag_clear(node, tag, offset);
+		if (any_tag_set(node, tag))
+			return;
+
+		offset = node->offset;
+		node = node->parent;
+	}
+
+	/* clear the root's tag bit */
+	if (root_tag_get(root, tag))
+		root_tag_clear(root, tag);
+}
+
+void __radix_tree_tag_set(struct radix_tree_root *root,
+			  struct radix_tree_node *node,
+			  void **slot, unsigned int tag)
+{
+	node_tag_set(root, node, tag, node ? get_slot_offset(node, slot) : 0);
+}
+
+void __radix_tree_tag_clear(struct radix_tree_root *root,
+			    struct radix_tree_node *node,
+			    void **slot, unsigned int tag)
+{
+	node_tag_clear(root, node, tag, node ? get_slot_offset(node, slot) : 0);
+}
+
+void __radix_tree_clear_tags(struct radix_tree_root *root,
+			     struct radix_tree_node *node,
+			     void **slot)
+{
+	unsigned int tag;
+
+	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+		__radix_tree_tag_clear(root, node, slot, tag);
+}
+
 /**
  *	radix_tree_tag_set - set a tag on a radix tree node
  *	@root:		radix tree root
@@ -764,49 +825,25 @@ void *radix_tree_tag_set(struct radix_tree_root *root,
 {
 	struct radix_tree_node *node, *parent;
 	unsigned long maxindex;
+	int uninitialized_var(offset);
 
 	radix_tree_load_root(root, &node, &maxindex);
 	BUG_ON(index > maxindex);
 
-	while (radix_tree_is_internal_node(node)) {
-		unsigned offset;
+	parent = NULL;
 
+	while (radix_tree_is_internal_node(node)) {
 		parent = entry_to_node(node);
 		offset = radix_tree_descend(parent, &node, index);
-		BUG_ON(!node);
-
-		if (!tag_get(parent, tag, offset))
-			tag_set(parent, tag, offset);
 	}
 
-	/* set the root's tag bit */
-	if (!root_tag_get(root, tag))
-		root_tag_set(root, tag);
+	if (node)
+		node_tag_set(root, parent, tag, offset);
 
 	return node;
 }
 EXPORT_SYMBOL(radix_tree_tag_set);
 
-static void node_tag_clear(struct radix_tree_root *root,
-				struct radix_tree_node *node,
-				unsigned int tag, unsigned int offset)
-{
-	while (node) {
-		if (!tag_get(node, tag, offset))
-			return;
-		tag_clear(node, tag, offset);
-		if (any_tag_set(node, tag))
-			return;
-
-		offset = node->offset;
-		node = node->parent;
-	}
-
-	/* clear the root's tag bit */
-	if (root_tag_get(root, tag))
-		root_tag_clear(root, tag);
-}
-
 /**
  *	radix_tree_tag_clear - clear a tag on a radix tree node
  *	@root:		radix tree root
@@ -1583,20 +1620,6 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 }
 EXPORT_SYMBOL(radix_tree_delete);
 
-void radix_tree_clear_tags(struct radix_tree_root *root,
-			   struct radix_tree_node *node,
-			   void **slot)
-{
-	if (node) {
-		unsigned int tag, offset = get_slot_offset(node, slot);
-		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
-			node_tag_clear(root, node, tag, offset);
-	} else {
-		/* Clear root node tags */
-		root->gfp_mask &= __GFP_BITS_MASK;
-	}
-}
-
 /**
  *	radix_tree_tagged - test whether any items in the tree are tagged
  *	@root:		radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index 849f459ad078..42e1f006aa3d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -182,7 +182,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
 		__radix_tree_lookup(&mapping->page_tree, page->index + i,
 				    &node, &slot);
 
-		radix_tree_clear_tags(&mapping->page_tree, node, slot);
+		__radix_tree_clear_tags(&mapping->page_tree, node, slot);
 
 		if (!node) {
 			VM_BUG_ON_PAGE(nr != 1, page);
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] lib: radix-tree: internal tags
  2016-10-19 17:24 [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Johannes Weiner
  2016-10-19 17:24 ` [PATCH 1/5] lib: radix-tree: provide node-granular interface for radix tree tags Johannes Weiner
@ 2016-10-19 17:24 ` Johannes Weiner
  2016-10-19 17:24 ` [PATCH 3/5] lib: radix-tree: native accounting and tracking of special entries Johannes Weiner
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2016-10-19 17:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Jan Kara, Dave Jones, linux-mm, linux-kernel,
	kernel-team

To make the radix tree implementation aware of and be able to handle
the special shadow entries in the page cache, it needs tags that go
beyond the freely allocatable tag bits that we have right now; it
needs native tags that can influence radix tree behavior.

Turn the existing RADIX_TREE_MAX_TAGS definition into an enum that
starts out with the number of freely allocatable user tags and is
followed by internal tags that we can tie radix tree behavior to.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h      |  9 +++++++--
 lib/dma-debug.c                 |  6 +++---
 lib/radix-tree.c                | 21 ++++++++++++---------
 tools/testing/radix-tree/test.c |  4 ++--
 4 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index dc261da5096c..756b2909467e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -64,7 +64,12 @@ static inline bool radix_tree_is_internal_node(void *ptr)
 
 /*** radix-tree API starts here ***/
 
-#define RADIX_TREE_MAX_TAGS 3
+enum radix_tree_tags {
+	/* Freely allocatable radix tree user tags */
+	RADIX_TREE_NR_USER_TAGS = 3,
+	/* Radix tree internal tags */
+	RADIX_TREE_NR_TAGS = RADIX_TREE_NR_USER_TAGS,
+};
 
 #ifndef RADIX_TREE_MAP_SHIFT
 #define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
@@ -101,7 +106,7 @@ struct radix_tree_node {
 	/* For tree user */
 	struct list_head private_list;
 	void __rcu	*slots[RADIX_TREE_MAP_SIZE];
-	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
+	unsigned long	tags[RADIX_TREE_NR_TAGS][RADIX_TREE_TAG_LONGS];
 };
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index 8971370bfb16..b0798a6a5b8a 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -464,7 +464,7 @@ EXPORT_SYMBOL(debug_dma_dump_mappings);
  */
 static RADIX_TREE(dma_active_cacheline, GFP_NOWAIT);
 static DEFINE_SPINLOCK(radix_lock);
-#define ACTIVE_CACHELINE_MAX_OVERLAP ((1 << RADIX_TREE_MAX_TAGS) - 1)
+#define ACTIVE_CACHELINE_MAX_OVERLAP ((1 << RADIX_TREE_NR_USER_TAGS) - 1)
 #define CACHELINE_PER_PAGE_SHIFT (PAGE_SHIFT - L1_CACHE_SHIFT)
 #define CACHELINES_PER_PAGE (1 << CACHELINE_PER_PAGE_SHIFT)
 
@@ -478,7 +478,7 @@ static int active_cacheline_read_overlap(phys_addr_t cln)
 {
 	int overlap = 0, i;
 
-	for (i = RADIX_TREE_MAX_TAGS - 1; i >= 0; i--)
+	for (i = RADIX_TREE_NR_USER_TAGS - 1; i >= 0; i--)
 		if (radix_tree_tag_get(&dma_active_cacheline, cln, i))
 			overlap |= 1 << i;
 	return overlap;
@@ -491,7 +491,7 @@ static int active_cacheline_set_overlap(phys_addr_t cln, int overlap)
 	if (overlap > ACTIVE_CACHELINE_MAX_OVERLAP || overlap < 0)
 		return overlap;
 
-	for (i = RADIX_TREE_MAX_TAGS - 1; i >= 0; i--)
+	for (i = RADIX_TREE_NR_USER_TAGS - 1; i >= 0; i--)
 		if (overlap & 1 << i)
 			radix_tree_tag_set(&dma_active_cacheline, cln, i);
 		else
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index d04d0938d7b7..bb6ddfb60557 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -321,7 +321,7 @@ static void radix_tree_node_rcu_free(struct rcu_head *head)
 	 * can leave us with a non-NULL entry in the first slot, so clear
 	 * that here to make sure.
 	 */
-	for (i = 0; i < RADIX_TREE_MAX_TAGS; i++)
+	for (i = 0; i < RADIX_TREE_NR_TAGS; i++)
 		tag_clear(node, i, 0);
 
 	node->slots[0] = NULL;
@@ -512,7 +512,7 @@ static int radix_tree_extend(struct radix_tree_root *root,
 			return -ENOMEM;
 
 		/* Propagate the aggregated tag info into the new root */
-		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+		for (tag = 0; tag < RADIX_TREE_NR_TAGS; tag++) {
 			if (root_tag_get(root, tag))
 				tag_set(node, tag, 0);
 		}
@@ -803,7 +803,7 @@ void __radix_tree_clear_tags(struct radix_tree_root *root,
 {
 	unsigned int tag;
 
-	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+	for (tag = 0; tag < RADIX_TREE_NR_TAGS; tag++)
 		__radix_tree_tag_clear(root, node, slot, tag);
 }
 
@@ -813,7 +813,7 @@ void __radix_tree_clear_tags(struct radix_tree_root *root,
  *	@index:		index key
  *	@tag:		tag index
  *
- *	Set the search tag (which must be < RADIX_TREE_MAX_TAGS)
+ *	Set the search tag (which must be < RADIX_TREE_NR_TAGS)
  *	corresponding to @index in the radix tree.  From
  *	the root all the way down to the leaf node.
  *
@@ -850,7 +850,7 @@ EXPORT_SYMBOL(radix_tree_tag_set);
  *	@index:		index key
  *	@tag:		tag index
  *
- *	Clear the search tag (which must be < RADIX_TREE_MAX_TAGS)
+ *	Clear the search tag (which must be < RADIX_TREE_NR_TAGS)
  *	corresponding to @index in the radix tree.  If this causes
  *	the leaf node to have no tags set then clear the tag in the
  *	next-to-leaf node, etc.
@@ -887,7 +887,7 @@ EXPORT_SYMBOL(radix_tree_tag_clear);
  * radix_tree_tag_get - get a tag on a radix tree node
  * @root:		radix tree root
  * @index:		index key
- * @tag:		tag index (< RADIX_TREE_MAX_TAGS)
+ * @tag:		tag index (< RADIX_TREE_NR_TAGS)
  *
  * Return values:
  *
@@ -1262,7 +1262,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
  *	@results:	where the results of the lookup are placed
  *	@first_index:	start the lookup from this key
  *	@max_items:	place up to this many items at *results
- *	@tag:		the tag index (< RADIX_TREE_MAX_TAGS)
+ *	@tag:		the tag index (< RADIX_TREE_NR_TAGS)
  *
  *	Performs an index-ascending scan of the tree for present items which
  *	have the tag indexed by @tag set.  Places the items at *@results and
@@ -1303,7 +1303,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
  *	@results:	where the results of the lookup are placed
  *	@first_index:	start the lookup from this key
  *	@max_items:	place up to this many items at *results
- *	@tag:		the tag index (< RADIX_TREE_MAX_TAGS)
+ *	@tag:		the tag index (< RADIX_TREE_NR_TAGS)
  *
  *	Performs an index-ascending scan of the tree for present items which
  *	have the tag indexed by @tag set.  Places the slots at *@results and
@@ -1592,7 +1592,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
 	offset = get_slot_offset(node, slot);
 
 	/* Clear all tags associated with the item to be deleted.  */
-	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+	for (tag = 0; tag < RADIX_TREE_NR_TAGS; tag++)
 		node_tag_clear(root, node, tag, offset);
 
 	delete_sibling_entries(node, node_to_entry(slot), offset);
@@ -1687,6 +1687,9 @@ static int radix_tree_callback(struct notifier_block *nfb,
 
 void __init radix_tree_init(void)
 {
+	/* Root tags have to squeeze into radix_tree_root->gfp_mask */
+	BUILD_BUG_ON(RADIX_TREE_NR_TAGS + __GFP_BITS_SHIFT > sizeof(gfp_t) * 8);
+
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c
index a6e8099eaf4f..2cebba63a1a2 100644
--- a/tools/testing/radix-tree/test.c
+++ b/tools/testing/radix-tree/test.c
@@ -161,7 +161,7 @@ static int verify_node(struct radix_tree_node *slot, unsigned int tag,
 	if (tagged != anyset) {
 		printf("tag: %u, shift %u, tagged: %d, anyset: %d\n",
 			tag, slot->shift, tagged, anyset);
-		for (j = 0; j < RADIX_TREE_MAX_TAGS; j++) {
+		for (j = 0; j < RADIX_TREE_NR_USER_TAGS; j++) {
 			printf("tag %d: ", j);
 			for (i = 0; i < RADIX_TREE_TAG_LONGS; i++)
 				printf("%016lx ", slot->tags[j][i]);
@@ -178,7 +178,7 @@ static int verify_node(struct radix_tree_node *slot, unsigned int tag,
 				if (verify_node(slot->slots[i], tag,
 					    !!test_bit(i, slot->tags[tag]))) {
 					printf("Failure at off %d\n", i);
-					for (j = 0; j < RADIX_TREE_MAX_TAGS; j++) {
+					for (j = 0; j < RADIX_TREE_NR_USER_TAGS; j++) {
 						printf("tag %d: ", j);
 						for (i = 0; i < RADIX_TREE_TAG_LONGS; i++)
 							printf("%016lx ", slot->tags[j][i]);
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] lib: radix-tree: native accounting and tracking of special entries
  2016-10-19 17:24 [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Johannes Weiner
  2016-10-19 17:24 ` [PATCH 1/5] lib: radix-tree: provide node-granular interface for radix tree tags Johannes Weiner
  2016-10-19 17:24 ` [PATCH 2/5] lib: radix-tree: internal tags Johannes Weiner
@ 2016-10-19 17:24 ` Johannes Weiner
  2016-10-20 22:33   ` Dave Chinner
  2016-10-19 17:24 ` [PATCH 4/5] mm: workingset: restore single-page file refault tracking Johannes Weiner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2016-10-19 17:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Jan Kara, Dave Jones, linux-mm, linux-kernel,
	kernel-team

Add an internal tag to identify special entries that are accounted in
node->special in addition to node->count.

With this in place, the next patch can restore refault detection in
single-page files. It will also move the shadow count from the upper
bits of count to the new special counter, and then shrink count to a
char as well; the growth of struct radix_tree_node is temporary.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h | 10 ++++++----
 lib/radix-tree.c           | 14 ++++++++++----
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 756b2909467e..2e1c9added23 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -68,7 +68,8 @@ enum radix_tree_tags {
 	/* Freely allocatable radix tree user tags */
 	RADIX_TREE_NR_USER_TAGS = 3,
 	/* Radix tree internal tags */
-	RADIX_TREE_NR_TAGS = RADIX_TREE_NR_USER_TAGS,
+	RADIX_TREE_TAG_SPECIAL = RADIX_TREE_NR_USER_TAGS,
+	RADIX_TREE_NR_TAGS,
 };
 
 #ifndef RADIX_TREE_MAP_SHIFT
@@ -90,9 +91,10 @@ enum radix_tree_tags {
 #define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
 
 struct radix_tree_node {
-	unsigned char	shift;	/* Bits remaining in each slot */
-	unsigned char	offset;	/* Slot offset in parent */
-	unsigned int	count;
+	unsigned char	shift;		/* Bits remaining in each slot */
+	unsigned char	offset;		/* Slot offset in parent */
+	unsigned int	count;		/* Total entry count */
+	unsigned char	special;	/* Special entry count */
 	union {
 		struct {
 			/* Used when ascending tree */
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index bb6ddfb60557..e58cff1d97ed 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -220,10 +220,10 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
 {
 	unsigned long i;
 
-	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n",
+	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d special %d parent %p\n",
 		node, node->offset,
 		node->tags[0][0], node->tags[1][0], node->tags[2][0],
-		node->shift, node->count, node->parent);
+		node->shift, node->count, node->special, node->parent);
 
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		unsigned long first = index | (i << node->shift);
@@ -522,9 +522,15 @@ static int radix_tree_extend(struct radix_tree_root *root,
 		node->offset = 0;
 		node->count = 1;
 		node->parent = NULL;
-		if (radix_tree_is_internal_node(slot))
-			entry_to_node(slot)->parent = node;
 		node->slots[0] = slot;
+		/* Extending an existing node or root->rnode? */
+		if (radix_tree_is_internal_node(slot)) {
+			entry_to_node(slot)->parent = node;
+		} else {
+			/* Moving a special root->rnode to a node */
+			if (root_tag_get(root, RADIX_TREE_TAG_SPECIAL))
+				node->special = 1;
+		}
 		slot = node_to_entry(node);
 		rcu_assign_pointer(root->rnode, slot);
 		shift += RADIX_TREE_MAP_SHIFT;
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] mm: workingset: restore single-page file refault tracking
  2016-10-19 17:24 [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Johannes Weiner
                   ` (2 preceding siblings ...)
  2016-10-19 17:24 ` [PATCH 3/5] lib: radix-tree: native accounting and tracking of special entries Johannes Weiner
@ 2016-10-19 17:24 ` Johannes Weiner
  2016-10-19 17:24 ` [PATCH 5/5] mm: workingset: turn shadow node shrinker bugs into warnings Johannes Weiner
  2016-10-19 18:16 ` [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Linus Torvalds
  5 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2016-10-19 17:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Jan Kara, Dave Jones, linux-mm, linux-kernel,
	kernel-team

Currently, we account shadow entries in the page cache in the upper
bits of the radix_tree_node->count, behind the back of the radix tree
implementation. Because the radix tree code has no awareness of them,
we have to prevent shadow entries from going through operations where
the tree implementation relies on or modifies node->count: extending
and shrinking the tree from and to a single direct root->rnode entry.

As a consequence, we cannot store shadow entries for files that only
have index 0 populated, and thus cannot detect refaults from them,
which in turn degrades the thrashing compensation in LRU reclaim.

Another consequence is that we rely on subtleties throughout the radix
tree code, such as the node->count != 1 check in the shrinking code,
which is meant to exclude multi-entry nodes but also skips nodes with
only one shadow entry since they are accounted in the upper bits. This
is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
("mm: filemap: don't plant shadow entries without radix tree node").

To fix this, this patch moves the shadow counter from the upper bits
of node->count into the new node->special counter and tags shadow
entries RADIX_TREE_TAG_SPECIAL so the radix tree code handles them
properly. node->count then counts all tree entries again, including
shadows, and becomes a superset of node->special.

Switching from a magic node->count to a special entry tracking scheme
that is native to the radix tree code removes the fragile subtleties
mentioned above. By being able to tag special entries even when
they're a direct pointer in the tree root, we can store shadow entries
for single-page files again, and thus restore refault detection and
thrashing compensation for them.

As the upper bits of node->count are no longer used, we can shrink it
down to an unsigned char, which reverts the size increase of the radix
tree node caused by the previous patch.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  6 +-----
 include/linux/swap.h       | 16 ++++++++++------
 mm/filemap.c               | 23 +++++++++++------------
 mm/truncate.c              |  2 ++
 4 files changed, 24 insertions(+), 23 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 2e1c9added23..f6dbbd2eb4e0 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -86,14 +86,10 @@ enum radix_tree_tags {
 #define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
 					  RADIX_TREE_MAP_SHIFT))
 
-/* Internally used bits of node->count */
-#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
-#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
-
 struct radix_tree_node {
 	unsigned char	shift;		/* Bits remaining in each slot */
 	unsigned char	offset;		/* Slot offset in parent */
-	unsigned int	count;		/* Total entry count */
+	unsigned char	count;		/* Total entry count */
 	unsigned char	special;	/* Special entry count */
 	union {
 		struct {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a56523cefb9b..22786f2334fb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -250,7 +250,7 @@ extern struct list_lru workingset_shadow_nodes;
 
 static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
 {
-	return node->count & RADIX_TREE_COUNT_MASK;
+	return node->count - node->special;
 }
 
 static inline void workingset_node_pages_inc(struct radix_tree_node *node)
@@ -260,24 +260,28 @@ static inline void workingset_node_pages_inc(struct radix_tree_node *node)
 
 static inline void workingset_node_pages_dec(struct radix_tree_node *node)
 {
-	VM_WARN_ON_ONCE(!workingset_node_pages(node));
+	VM_WARN_ON_ONCE(node->count == node->special);
+	VM_WARN_ON_ONCE(!node->count);
 	node->count--;
 }
 
 static inline unsigned int workingset_node_shadows(struct radix_tree_node *node)
 {
-	return node->count >> RADIX_TREE_COUNT_SHIFT;
+	return node->special;
 }
 
 static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
 {
-	node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+	node->special++;
+	node->count++;
 }
 
 static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
 {
-	VM_WARN_ON_ONCE(!workingset_node_shadows(node));
-	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+	VM_WARN_ON_ONCE(!node->special);
+	VM_WARN_ON_ONCE(!node->count);
+	node->special--;
+	node->count--;
 }
 
 /* linux/mm/page_alloc.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 42e1f006aa3d..f684bd3c0838 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -130,10 +130,12 @@ static int page_cache_tree_insert(struct address_space *mapping,
 
 		mapping->nrexceptional--;
 		if (!dax_mapping(mapping)) {
-			if (shadowp)
-				*shadowp = p;
+			__radix_tree_tag_clear(&mapping->page_tree, node, slot,
+					       RADIX_TREE_TAG_SPECIAL);
 			if (node)
 				workingset_node_shadows_dec(node);
+			if (shadowp)
+				*shadowp = p;
 		} else {
 			/* DAX can replace empty locked entry with a hole */
 			WARN_ON_ONCE(p !=
@@ -184,19 +186,16 @@ static void page_cache_tree_delete(struct address_space *mapping,
 
 		__radix_tree_clear_tags(&mapping->page_tree, node, slot);
 
-		if (!node) {
-			VM_BUG_ON_PAGE(nr != 1, page);
-			/*
-			 * We need a node to properly account shadow
-			 * entries. Don't plant any without. XXX
-			 */
-			shadow = NULL;
-		}
-
 		radix_tree_replace_slot(slot, shadow);
 
-		if (!node)
+		if (shadow)
+			__radix_tree_tag_set(&mapping->page_tree, node, slot,
+					     RADIX_TREE_TAG_SPECIAL);
+
+		if (!node) {
+			VM_BUG_ON_PAGE(nr != 1, page);
 			break;
+		}
 
 		workingset_node_pages_dec(node);
 		if (shadow)
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..bec210e5ee4b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -50,6 +50,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	if (*slot != entry)
 		goto unlock;
 	radix_tree_replace_slot(slot, NULL);
+	__radix_tree_tag_clear(&mapping->page_tree, node, slot,
+			       RADIX_TREE_TAG_SPECIAL);
 	mapping->nrexceptional--;
 	if (!node)
 		goto unlock;
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] mm: workingset: turn shadow node shrinker bugs into warnings
  2016-10-19 17:24 [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Johannes Weiner
                   ` (3 preceding siblings ...)
  2016-10-19 17:24 ` [PATCH 4/5] mm: workingset: restore single-page file refault tracking Johannes Weiner
@ 2016-10-19 17:24 ` Johannes Weiner
  2016-10-19 18:16 ` [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Linus Torvalds
  5 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2016-10-19 17:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Jan Kara, Dave Jones, linux-mm, linux-kernel,
	kernel-team

When the shadow page shrinker tries to reclaim a radix tree node but
finds it in an unexpected state--it should contain no pages, and
non-zero shadow entries--there is no need to kill the executing task
or even the entire system.

Warn about the invalid state, then leave that tree node be. Simply
don't put it back on the shadow LRU for future reclaim and move on.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/workingset.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 617475f529f4..5f07db171c03 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,23 +418,28 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	 * no pages, so we expect to be able to remove them all and
 	 * delete and free the empty node afterwards.
 	 */
-	BUG_ON(!workingset_node_shadows(node));
-	BUG_ON(workingset_node_pages(node));
+	if (WARN_ON_ONCE(!workingset_node_shadows(node)))
+		goto out_invalid;
+	if (WARN_ON_ONCE(workingset_node_pages(node)))
+		goto out_invalid;
 
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		if (node->slots[i]) {
-			BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
+			if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
+				goto out_invalid;
 			node->slots[i] = NULL;
 			workingset_node_shadows_dec(node);
-			BUG_ON(!mapping->nrexceptional);
+			if (WARN_ON_ONCE(!mapping->nrexceptional))
+				goto out_invalid;
 			mapping->nrexceptional--;
 		}
 	}
-	BUG_ON(workingset_node_shadows(node));
+	if (WARN_ON_ONCE(workingset_node_shadows(node)))
+		goto out_invalid;
 	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
-	if (!__radix_tree_delete_node(&mapping->page_tree, node))
-		BUG();
+	__radix_tree_delete_node(&mapping->page_tree, node);
 
+out_invalid:
 	spin_unlock(&mapping->tree_lock);
 	ret = LRU_REMOVED_RETRY;
 out:
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults
  2016-10-19 17:24 [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Johannes Weiner
                   ` (4 preceding siblings ...)
  2016-10-19 17:24 ` [PATCH 5/5] mm: workingset: turn shadow node shrinker bugs into warnings Johannes Weiner
@ 2016-10-19 18:16 ` Linus Torvalds
  2016-10-24 18:47   ` Johannes Weiner
  5 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2016-10-19 18:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jan Kara, Dave Jones, linux-mm,
	Linux Kernel Mailing List, kernel-team

On Wed, Oct 19, 2016 at 10:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> These patches make the radix tree code explicitely support and track
> such special entries, to eliminate the subtleties and to restore the
> thrash detection for single-page files.

Ugh. I'm not a huge fan. The patches may be good and be a cleanup in
one respect, but they make one of my least favorite parts of the radix
tree code worse.

The radix tree "tag" thing is really horribly confusing, and part of
it is that there are two totally different "tags": the externally
visible tags (separate array), and the magical internal tags (low bits
of the node pointers that tag the pointers as internal or exceptional
entries).

And I think this series actually makes things even *more* complicated,
because now the radix tree itself uses one magical entry in the
externally visible tags for its own internal logic. So now it's
*really* messed up - the external tags aren't entirely external any
more.

Maybe I'm mis-reading it, and I'm just confused by the radix tree
implementation? But if so, it's just another sign of just how
confusing things are.

The external tag array itself is also somewhat nasty, in that it
spreads out the tag bits for one entry maximally (ie different bits
are in different words) so you can't even clear them together. I know
why - it makes both iterating over a specific tag and any_tag_set()
simpler, but it does seem confusing to me how we spread out the data
almost maximally.

I really would love to see somebody take a big look at the two
different tagging methods. If nothing else, explain it to me.

Because maybe this series is all great, and my objection is just that
it makes it even harder for me to understand the code.

For example, could we do this simplification:

 - get rid of RADIX_TREE_TAG_LONGS entirely
 - get rid of CONFIG_BASE_SMALL entirely
 - just say that the tag bitmap is one unsigned long
 - so RADIX_TREE_MAP_SIZE is just BITS_PER_LONG

and then at least we'd get rid of the double array and the confusion
about loops that are actually almost never loops. Because right now
RADIX_TREE_TAG_LONGS is usually 1, but is 2 if you're a 32-bit
platform with !CONFIG_BASE_SMALL. So you need those loops, but it all
looks almost entirely pointless.

I just get the feeling that we have already have unnecessary
complexity here, and that this patch series makes the code even more
impenetrable.

Comments?

                Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] lib: radix-tree: native accounting and tracking of special entries
  2016-10-19 17:24 ` [PATCH 3/5] lib: radix-tree: native accounting and tracking of special entries Johannes Weiner
@ 2016-10-20 22:33   ` Dave Chinner
  2016-10-24 16:01     ` Johannes Weiner
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2016-10-20 22:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Jan Kara, Dave Jones, linux-mm,
	linux-kernel, kernel-team

On Wed, Oct 19, 2016 at 01:24:26PM -0400, Johannes Weiner wrote:
> Add an internal tag to identify special entries that are accounted in
> node->special in addition to node->count.
> 
> With this in place, the next patch can restore refault detection in
> single-page files. It will also move the shadow count from the upper
> bits of count to the new special counter, and then shrink count to a
> char as well; the growth of struct radix_tree_node is temporary.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/radix-tree.h | 10 ++++++----
>  lib/radix-tree.c           | 14 ++++++++++----
>  2 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 756b2909467e..2e1c9added23 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -68,7 +68,8 @@ enum radix_tree_tags {
>  	/* Freely allocatable radix tree user tags */
>  	RADIX_TREE_NR_USER_TAGS = 3,
>  	/* Radix tree internal tags */
> -	RADIX_TREE_NR_TAGS = RADIX_TREE_NR_USER_TAGS,
> +	RADIX_TREE_TAG_SPECIAL = RADIX_TREE_NR_USER_TAGS,
> +	RADIX_TREE_NR_TAGS,
>  };
>  
>  #ifndef RADIX_TREE_MAP_SHIFT
> @@ -90,9 +91,10 @@ enum radix_tree_tags {
>  #define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
>  
>  struct radix_tree_node {
> -	unsigned char	shift;	/* Bits remaining in each slot */
> -	unsigned char	offset;	/* Slot offset in parent */
> -	unsigned int	count;
> +	unsigned char	shift;		/* Bits remaining in each slot */
> +	unsigned char	offset;		/* Slot offset in parent */
> +	unsigned int	count;		/* Total entry count */
> +	unsigned char	special;	/* Special entry count */

How about putting the new char field into the implicit hole between
offset and count? pahole is your friend here:

struct radix_tree_node {
        unsigned char              shift;                /*     0     1 */
        unsigned char              offset;               /*     1     1 */

        /* XXX 2 bytes hole, try to pack */

        unsigned int               count;                /*     4     4 */
.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] lib: radix-tree: native accounting and tracking of special entries
  2016-10-20 22:33   ` Dave Chinner
@ 2016-10-24 16:01     ` Johannes Weiner
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2016-10-24 16:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Linus Torvalds, Jan Kara, Dave Jones, linux-mm,
	linux-kernel, kernel-team

Hi Dave,

On Fri, Oct 21, 2016 at 09:33:08AM +1100, Dave Chinner wrote:
> On Wed, Oct 19, 2016 at 01:24:26PM -0400, Johannes Weiner wrote:
> > With this in place, the next patch can restore refault detection in
> > single-page files. It will also move the shadow count from the upper
> > bits of count to the new special counter, and then shrink count to a
> > char as well; the growth of struct radix_tree_node is temporary.

[...]

> > @@ -90,9 +91,10 @@ enum radix_tree_tags {
> >  #define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
> >  
> >  struct radix_tree_node {
> > -	unsigned char	shift;	/* Bits remaining in each slot */
> > -	unsigned char	offset;	/* Slot offset in parent */
> > -	unsigned int	count;
> > +	unsigned char	shift;		/* Bits remaining in each slot */
> > +	unsigned char	offset;		/* Slot offset in parent */
> > +	unsigned int	count;		/* Total entry count */
> > +	unsigned char	special;	/* Special entry count */
> 
> How about putting the new char field into the implicit hole between
> offset and count? pahole is your friend here:
> 
> struct radix_tree_node {
>         unsigned char              shift;                /*     0     1 */
>         unsigned char              offset;               /*     1     1 */
> 
>         /* XXX 2 bytes hole, try to pack */
> 
>         unsigned int               count;                /*     4     4 */
> .....

The next patch turns `count' into an unsigned char again, so the hole
is only temporary.

Thanks

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults
  2016-10-19 18:16 ` [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Linus Torvalds
@ 2016-10-24 18:47   ` Johannes Weiner
  2016-10-26  9:21     ` Jan Kara
  2016-10-26 18:18     ` Linus Torvalds
  0 siblings, 2 replies; 15+ messages in thread
From: Johannes Weiner @ 2016-10-24 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jan Kara, Dave Jones, linux-mm,
	Linux Kernel Mailing List, kernel-team

On Wed, Oct 19, 2016 at 11:16:30AM -0700, Linus Torvalds wrote:
> On Wed, Oct 19, 2016 at 10:24 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > These patches make the radix tree code explicitely support and track
> > such special entries, to eliminate the subtleties and to restore the
> > thrash detection for single-page files.
> 
> Ugh. I'm not a huge fan. The patches may be good and be a cleanup in
> one respect, but they make one of my least favorite parts of the radix
> tree code worse.
> 
> The radix tree "tag" thing is really horribly confusing, and part of
> it is that there are two totally different "tags": the externally
> visible tags (separate array), and the magical internal tags (low bits
> of the node pointers that tag the pointers as internal or exceptional
> entries).
>
> And I think this series actually makes things even *more* complicated,
> because now the radix tree itself uses one magical entry in the
> externally visible tags for its own internal logic. So now it's
> *really* messed up - the external tags aren't entirely external any
> more.
> 
> Maybe I'm mis-reading it, and I'm just confused by the radix tree
> implementation? But if so, it's just another sign of just how
> confusing things are.

No, I think you're right. This is no good.

As I see it, the main distinction between the "tags" tags and the
lower bits in the entry pointers is that the former recurse down to
the root, and so they make lookup by tag very efficient (e.g. find
dirty pages to sync), whereas the pointer bits are cheaper when we
specifically operate on entries anyway and branch out to different
behavior depending on the type of entry (truncate a cache range,
extend tree, descend tree).

My patch violated that by adding a recursively-set "lookup" tag for
the single purpose of distinguishing entry types in root->rnode.

How about this instead: given that we already mark the shadow entries
exceptional, and the exceptional bit is part of the radix tree API,
can we just introduce a node->exceptional counter for those entries
and have the radix tree code assist us with that instead? It adds the
counting for non-shadow exceptional entries as well (shmem swap slots,
and DAX non-page entries), unfortunately, but this is way cleaner. It
also makes mapping->nrexceptional and node->exceptional consistent in
DAX (Jan, could you please double check the accounting there?)

What do you think? Lightly tested patch below.

> The external tag array itself is also somewhat nasty, in that it
> spreads out the tag bits for one entry maximally (ie different bits
> are in different words) so you can't even clear them together. I know
> why - it makes both iterating over a specific tag and any_tag_set()
> simpler, but it does seem confusing to me how we spread out the data
> almost maximally.
> 
> I really would love to see somebody take a big look at the two
> different tagging methods. If nothing else, explain it to me.
> 
> Because maybe this series is all great, and my objection is just that
> it makes it even harder for me to understand the code.
> 
> For example, could we do this simplification:
> 
>  - get rid of RADIX_TREE_TAG_LONGS entirely
>  - get rid of CONFIG_BASE_SMALL entirely
>  - just say that the tag bitmap is one unsigned long
>  - so RADIX_TREE_MAP_SIZE is just BITS_PER_LONG
> 
> and then at least we'd get rid of the double array and the confusion
> about loops that are actually almost never loops. Because right now
> RADIX_TREE_TAG_LONGS is usually 1, but is 2 if you're a 32-bit
> platform with !CONFIG_BASE_SMALL. So you need those loops, but it all
> looks almost entirely pointless.

AFAICS, !BASE_SMALL is the default unless you de-select BASE_FULL in
EXPERT, so that would cut the radix tree node in half on most 32 bit
setups and would more than double the tree nodes we have to allocate
(two where we had one, plus the additional intermediate levels).

The extra code sucks, but is that a cost we'd be willing to pay?

---

>From 192c2589a5845d197f758045868844623e06b4db Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 17 Oct 2016 09:00:04 -0400
Subject: [PATCH] mm: workingset: restore single-page file refault tracking

Currently, we account shadow entries in the page cache in the upper
bits of the radix_tree_node->count, behind the back of the radix tree
implementation. Because the radix tree code has no awareness of them,
we have to prevent shadow entries from going through operations where
the tree implementation relies on or modifies node->count: extending
and shrinking the tree from and to a single direct root->rnode entry.

As a consequence, we cannot store shadow entries for files that only
have index 0 populated, and thus cannot detect refaults from them,
which in turn degrades the thrashing compensation in LRU reclaim.

Another consequence is that we rely on subtleties throughout the radix
tree code, such as the node->count != 1 check in the shrinking code,
which is meant to exclude multi-entry nodes but also skips nodes with
only one shadow entry since they are accounted in the upper bits. This
is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
("mm: filemap: don't plant shadow entries without radix tree node").

To fix this, this patch moves the shadow counter from the upper bits
of node->count into the new node->exceptional counter, where all
exceptional entries are explicitely tracked by the radix tree.
node->count then counts all tree entries again, including shadows.

Switching from a magic node->count to accounting exceptional entries
natively in the radix tree code removes the fragile subtleties
mentioned above. It also allows us to store shadow entries for
single-page files again, as the radix tree recognizes exceptional
entries when extending the tree from the root->rnode singleton, and
thus restore refault detection and thrashing compensation for them.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h | 11 ++++-------
 include/linux/swap.h       | 32 --------------------------------
 lib/radix-tree.c           | 16 +++++++++++++---
 mm/filemap.c               | 45 ++++++++++++++++++++-------------------------
 mm/truncate.c              |  6 +++---
 mm/workingset.c            | 13 ++++++++-----
 6 files changed, 48 insertions(+), 75 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index af3581b8a451..2674ed31fa84 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -80,14 +80,11 @@ static inline bool radix_tree_is_internal_node(void *ptr)
 #define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
 					  RADIX_TREE_MAP_SHIFT))
 
-/* Internally used bits of node->count */
-#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
-#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
-
 struct radix_tree_node {
-	unsigned char	shift;	/* Bits remaining in each slot */
-	unsigned char	offset;	/* Slot offset in parent */
-	unsigned int	count;
+	unsigned char	shift;		/* Bits remaining in each slot */
+	unsigned char	offset;		/* Slot offset in parent */
+	unsigned char	count;		/* Total entry count */
+	unsigned char	exceptional;	/* Exceptional entry count */
 	union {
 		struct {
 			/* Used when ascending tree */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a56523cefb9b..660a11de0186 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -248,38 +248,6 @@ bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
 extern struct list_lru workingset_shadow_nodes;
 
-static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
-{
-	return node->count & RADIX_TREE_COUNT_MASK;
-}
-
-static inline void workingset_node_pages_inc(struct radix_tree_node *node)
-{
-	node->count++;
-}
-
-static inline void workingset_node_pages_dec(struct radix_tree_node *node)
-{
-	VM_WARN_ON_ONCE(!workingset_node_pages(node));
-	node->count--;
-}
-
-static inline unsigned int workingset_node_shadows(struct radix_tree_node *node)
-{
-	return node->count >> RADIX_TREE_COUNT_SHIFT;
-}
-
-static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
-{
-	node->count += 1U << RADIX_TREE_COUNT_SHIFT;
-}
-
-static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
-{
-	VM_WARN_ON_ONCE(!workingset_node_shadows(node));
-	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-}
-
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8e6d552c40dd..c7d8452d755e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -220,10 +220,10 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
 {
 	unsigned long i;
 
-	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n",
+	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d exceptional %d parent %p\n",
 		node, node->offset,
 		node->tags[0][0], node->tags[1][0], node->tags[2][0],
-		node->shift, node->count, node->parent);
+		node->shift, node->count, node->exceptional, node->parent);
 
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		unsigned long first = index | (i << node->shift);
@@ -522,8 +522,14 @@ static int radix_tree_extend(struct radix_tree_root *root,
 		node->offset = 0;
 		node->count = 1;
 		node->parent = NULL;
-		if (radix_tree_is_internal_node(slot))
+		/* Extending an existing node or root->rnode */
+		if (radix_tree_is_internal_node(slot)) {
 			entry_to_node(slot)->parent = node;
+		} else {
+			/* Moving an exceptional root->rnode to a node */
+			if (radix_tree_exceptional_entry(slot))
+				node->exceptional = 1;
+		}
 		node->slots[0] = slot;
 		slot = node_to_entry(node);
 		rcu_assign_pointer(root->rnode, slot);
@@ -649,6 +655,8 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
 	if (node) {
 		unsigned offset = get_slot_offset(node, slot);
 		node->count++;
+		if (radix_tree_exceptional_entry(item))
+			node->exceptional++;
 		BUG_ON(tag_get(node, 0, offset));
 		BUG_ON(tag_get(node, 1, offset));
 		BUG_ON(tag_get(node, 2, offset));
@@ -1561,6 +1569,8 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
 	delete_sibling_entries(node, node_to_entry(slot), offset);
 	node->slots[offset] = NULL;
 	node->count--;
+	if (radix_tree_exceptional_entry(entry))
+		node->exceptional--;
 
 	__radix_tree_delete_node(root, node);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 849f459ad078..bf7d88b18374 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -128,20 +128,20 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
 
+		if (node) {
+			node->exceptional--;
+			node->count--;
+		}
 		mapping->nrexceptional--;
+
 		if (!dax_mapping(mapping)) {
 			if (shadowp)
 				*shadowp = p;
-			if (node)
-				workingset_node_shadows_dec(node);
 		} else {
 			/* DAX can replace empty locked entry with a hole */
 			WARN_ON_ONCE(p !=
 				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
 					 RADIX_DAX_ENTRY_LOCK));
-			/* DAX accounts exceptional entries as normal pages */
-			if (node)
-				workingset_node_pages_dec(node);
 			/* Wakeup waiters for exceptional entry lock */
 			dax_wake_mapping_entry_waiter(mapping, page->index,
 						      false);
@@ -150,7 +150,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
 	radix_tree_replace_slot(slot, page);
 	mapping->nrpages++;
 	if (node) {
-		workingset_node_pages_inc(node);
+		node->count++;
 		/*
 		 * Don't track node that contains actual pages.
 		 *
@@ -184,38 +184,33 @@ static void page_cache_tree_delete(struct address_space *mapping,
 
 		radix_tree_clear_tags(&mapping->page_tree, node, slot);
 
-		if (!node) {
-			VM_BUG_ON_PAGE(nr != 1, page);
-			/*
-			 * We need a node to properly account shadow
-			 * entries. Don't plant any without. XXX
-			 */
-			shadow = NULL;
-		}
-
 		radix_tree_replace_slot(slot, shadow);
 
-		if (!node)
+		if (!node) {
+			VM_BUG_ON_PAGE(nr != 1, page);
 			break;
+		}
 
-		workingset_node_pages_dec(node);
-		if (shadow)
-			workingset_node_shadows_inc(node);
-		else
+		if (shadow) {
+			node->exceptional++;
+		} else {
+			node->count--;
 			if (__radix_tree_delete_node(&mapping->page_tree, node))
 				continue;
+		}
 
 		/*
-		 * Track node that only contains shadow entries. DAX mappings
-		 * contain no shadow entries and may contain other exceptional
-		 * entries so skip those.
+		 * Track node that only contains shadow entries. DAX and SHMEM
+		 * mappings contain no shadow entries and may contain other
+		 * exceptional entries so skip those.
 		 *
 		 * Avoid acquiring the list_lru lock if already tracked.
 		 * The list_empty() test is safe as node->private_list is
 		 * protected by mapping->tree_lock.
 		 */
-		if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
-				list_empty(&node->private_list)) {
+		if (!dax_mapping(mapping) && !shmem_mapping(mapping) &&
+		    node->count == node->exceptional &&
+		    list_empty(&node->private_list)) {
 			node->private_data = mapping;
 			list_lru_add(&workingset_shadow_nodes,
 					&node->private_list);
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..b9b2a1b42822 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -53,7 +53,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	mapping->nrexceptional--;
 	if (!node)
 		goto unlock;
-	workingset_node_shadows_dec(node);
+	node->exceptional--;
+	node->count--;
 	/*
 	 * Don't track node without shadow entries.
 	 *
@@ -61,8 +62,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	 * The list_empty() test is safe as node->private_list is
 	 * protected by mapping->tree_lock.
 	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
+	if (!node->exceptional && !list_empty(&node->private_list))
 		list_lru_del(&workingset_shadow_nodes,
 				&node->private_list);
 	__radix_tree_delete_node(&mapping->page_tree, node);
diff --git a/mm/workingset.c b/mm/workingset.c
index 5f07db171c03..dfaec27aedd3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,23 +418,26 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	 * no pages, so we expect to be able to remove them all and
 	 * delete and free the empty node afterwards.
 	 */
-	if (WARN_ON_ONCE(!workingset_node_shadows(node)))
+	if (WARN_ON_ONCE(!node->exceptional))
 		goto out_invalid;
-	if (WARN_ON_ONCE(workingset_node_pages(node)))
+	if (WARN_ON_ONCE(node->count != node->exceptional))
 		goto out_invalid;
 
 	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
 		if (node->slots[i]) {
 			if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
 				goto out_invalid;
-			node->slots[i] = NULL;
-			workingset_node_shadows_dec(node);
+			if (WARN_ON_ONCE(!node->exceptional))
+				goto out_invalid;
 			if (WARN_ON_ONCE(!mapping->nrexceptional))
 				goto out_invalid;
+			node->slots[i] = NULL;
+			node->exceptional--;
+			node->count--;
 			mapping->nrexceptional--;
 		}
 	}
-	if (WARN_ON_ONCE(workingset_node_shadows(node)))
+	if (WARN_ON_ONCE(node->exceptional))
 		goto out_invalid;
 	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
 	__radix_tree_delete_node(&mapping->page_tree, node);
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults
  2016-10-24 18:47   ` Johannes Weiner
@ 2016-10-26  9:21     ` Jan Kara
  2016-10-26 18:15       ` Johannes Weiner
  2016-10-26 18:18     ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Kara @ 2016-10-26  9:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Andrew Morton, Jan Kara, Dave Jones, linux-mm,
	Linux Kernel Mailing List, kernel-team

On Mon 24-10-16 14:47:39, Johannes Weiner wrote:
> 
> ---
> 
> From 192c2589a5845d197f758045868844623e06b4db Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 17 Oct 2016 09:00:04 -0400
> Subject: [PATCH] mm: workingset: restore single-page file refault tracking
> 
> Currently, we account shadow entries in the page cache in the upper
> bits of the radix_tree_node->count, behind the back of the radix tree
> implementation. Because the radix tree code has no awareness of them,
> we have to prevent shadow entries from going through operations where
> the tree implementation relies on or modifies node->count: extending
> and shrinking the tree from and to a single direct root->rnode entry.
> 
> As a consequence, we cannot store shadow entries for files that only
> have index 0 populated, and thus cannot detect refaults from them,
> which in turn degrades the thrashing compensation in LRU reclaim.
> 
> Another consequence is that we rely on subtleties throughout the radix
> tree code, such as the node->count != 1 check in the shrinking code,
> which is meant to exclude multi-entry nodes but also skips nodes with
> only one shadow entry since they are accounted in the upper bits. This
> is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
> ("mm: filemap: don't plant shadow entries without radix tree node").
> 
> To fix this, this patch moves the shadow counter from the upper bits
> of node->count into the new node->exceptional counter, where all
> exceptional entries are explicitely tracked by the radix tree.
> node->count then counts all tree entries again, including shadows.
> 
> Switching from a magic node->count to accounting exceptional entries
> natively in the radix tree code removes the fragile subtleties
> mentioned above. It also allows us to store shadow entries for
> single-page files again, as the radix tree recognizes exceptional
> entries when extending the tree from the root->rnode singleton, and
> thus restore refault detection and thrashing compensation for them.

I like this solution. Just one suggestion: I think
radix_tree_replace_slot() can now do the node counter update on its own and
that would save us having to do quite a bit of accounting outside of the
radix tree code itself and it would be less prone to bugs (forgotten
updates of a counter). What do you think?

								Honza

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/radix-tree.h | 11 ++++-------
>  include/linux/swap.h       | 32 --------------------------------
>  lib/radix-tree.c           | 16 +++++++++++++---
>  mm/filemap.c               | 45 ++++++++++++++++++++-------------------------
>  mm/truncate.c              |  6 +++---
>  mm/workingset.c            | 13 ++++++++-----
>  6 files changed, 48 insertions(+), 75 deletions(-)
> 
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index af3581b8a451..2674ed31fa84 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -80,14 +80,11 @@ static inline bool radix_tree_is_internal_node(void *ptr)
>  #define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
>  					  RADIX_TREE_MAP_SHIFT))
>  
> -/* Internally used bits of node->count */
> -#define RADIX_TREE_COUNT_SHIFT	(RADIX_TREE_MAP_SHIFT + 1)
> -#define RADIX_TREE_COUNT_MASK	((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
> -
>  struct radix_tree_node {
> -	unsigned char	shift;	/* Bits remaining in each slot */
> -	unsigned char	offset;	/* Slot offset in parent */
> -	unsigned int	count;
> +	unsigned char	shift;		/* Bits remaining in each slot */
> +	unsigned char	offset;		/* Slot offset in parent */
> +	unsigned char	count;		/* Total entry count */
> +	unsigned char	exceptional;	/* Exceptional entry count */
>  	union {
>  		struct {
>  			/* Used when ascending tree */
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a56523cefb9b..660a11de0186 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -248,38 +248,6 @@ bool workingset_refault(void *shadow);
>  void workingset_activation(struct page *page);
>  extern struct list_lru workingset_shadow_nodes;
>  
> -static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
> -{
> -	return node->count & RADIX_TREE_COUNT_MASK;
> -}
> -
> -static inline void workingset_node_pages_inc(struct radix_tree_node *node)
> -{
> -	node->count++;
> -}
> -
> -static inline void workingset_node_pages_dec(struct radix_tree_node *node)
> -{
> -	VM_WARN_ON_ONCE(!workingset_node_pages(node));
> -	node->count--;
> -}
> -
> -static inline unsigned int workingset_node_shadows(struct radix_tree_node *node)
> -{
> -	return node->count >> RADIX_TREE_COUNT_SHIFT;
> -}
> -
> -static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
> -{
> -	node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> -}
> -
> -static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
> -{
> -	VM_WARN_ON_ONCE(!workingset_node_shadows(node));
> -	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> -}
> -
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 8e6d552c40dd..c7d8452d755e 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -220,10 +220,10 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
>  {
>  	unsigned long i;
>  
> -	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n",
> +	pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d exceptional %d parent %p\n",
>  		node, node->offset,
>  		node->tags[0][0], node->tags[1][0], node->tags[2][0],
> -		node->shift, node->count, node->parent);
> +		node->shift, node->count, node->exceptional, node->parent);
>  
>  	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
>  		unsigned long first = index | (i << node->shift);
> @@ -522,8 +522,14 @@ static int radix_tree_extend(struct radix_tree_root *root,
>  		node->offset = 0;
>  		node->count = 1;
>  		node->parent = NULL;
> -		if (radix_tree_is_internal_node(slot))
> +		/* Extending an existing node or root->rnode */
> +		if (radix_tree_is_internal_node(slot)) {
>  			entry_to_node(slot)->parent = node;
> +		} else {
> +			/* Moving an exceptional root->rnode to a node */
> +			if (radix_tree_exceptional_entry(slot))
> +				node->exceptional = 1;
> +		}
>  		node->slots[0] = slot;
>  		slot = node_to_entry(node);
>  		rcu_assign_pointer(root->rnode, slot);
> @@ -649,6 +655,8 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
>  	if (node) {
>  		unsigned offset = get_slot_offset(node, slot);
>  		node->count++;
> +		if (radix_tree_exceptional_entry(item))
> +			node->exceptional++;
>  		BUG_ON(tag_get(node, 0, offset));
>  		BUG_ON(tag_get(node, 1, offset));
>  		BUG_ON(tag_get(node, 2, offset));
> @@ -1561,6 +1569,8 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
>  	delete_sibling_entries(node, node_to_entry(slot), offset);
>  	node->slots[offset] = NULL;
>  	node->count--;
> +	if (radix_tree_exceptional_entry(entry))
> +		node->exceptional--;
>  
>  	__radix_tree_delete_node(root, node);
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 849f459ad078..bf7d88b18374 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -128,20 +128,20 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  		if (!radix_tree_exceptional_entry(p))
>  			return -EEXIST;
>  
> +		if (node) {
> +			node->exceptional--;
> +			node->count--;
> +		}
>  		mapping->nrexceptional--;
> +
>  		if (!dax_mapping(mapping)) {
>  			if (shadowp)
>  				*shadowp = p;
> -			if (node)
> -				workingset_node_shadows_dec(node);
>  		} else {
>  			/* DAX can replace empty locked entry with a hole */
>  			WARN_ON_ONCE(p !=
>  				(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
>  					 RADIX_DAX_ENTRY_LOCK));
> -			/* DAX accounts exceptional entries as normal pages */
> -			if (node)
> -				workingset_node_pages_dec(node);
>  			/* Wakeup waiters for exceptional entry lock */
>  			dax_wake_mapping_entry_waiter(mapping, page->index,
>  						      false);
> @@ -150,7 +150,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
>  	radix_tree_replace_slot(slot, page);
>  	mapping->nrpages++;
>  	if (node) {
> -		workingset_node_pages_inc(node);
> +		node->count++;
>  		/*
>  		 * Don't track node that contains actual pages.
>  		 *
> @@ -184,38 +184,33 @@ static void page_cache_tree_delete(struct address_space *mapping,
>  
>  		radix_tree_clear_tags(&mapping->page_tree, node, slot);
>  
> -		if (!node) {
> -			VM_BUG_ON_PAGE(nr != 1, page);
> -			/*
> -			 * We need a node to properly account shadow
> -			 * entries. Don't plant any without. XXX
> -			 */
> -			shadow = NULL;
> -		}
> -
>  		radix_tree_replace_slot(slot, shadow);
>  
> -		if (!node)
> +		if (!node) {
> +			VM_BUG_ON_PAGE(nr != 1, page);
>  			break;
> +		}
>  
> -		workingset_node_pages_dec(node);
> -		if (shadow)
> -			workingset_node_shadows_inc(node);
> -		else
> +		if (shadow) {
> +			node->exceptional++;
> +		} else {
> +			node->count--;
>  			if (__radix_tree_delete_node(&mapping->page_tree, node))
>  				continue;
> +		}
>  
>  		/*
> -		 * Track node that only contains shadow entries. DAX mappings
> -		 * contain no shadow entries and may contain other exceptional
> -		 * entries so skip those.
> +		 * Track node that only contains shadow entries. DAX and SHMEM
> +		 * mappings contain no shadow entries and may contain other
> +		 * exceptional entries so skip those.
>  		 *
>  		 * Avoid acquiring the list_lru lock if already tracked.
>  		 * The list_empty() test is safe as node->private_list is
>  		 * protected by mapping->tree_lock.
>  		 */
> -		if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
> -				list_empty(&node->private_list)) {
> +		if (!dax_mapping(mapping) && !shmem_mapping(mapping) &&
> +		    node->count == node->exceptional &&
> +		    list_empty(&node->private_list)) {
>  			node->private_data = mapping;
>  			list_lru_add(&workingset_shadow_nodes,
>  					&node->private_list);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index a01cce450a26..b9b2a1b42822 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -53,7 +53,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	mapping->nrexceptional--;
>  	if (!node)
>  		goto unlock;
> -	workingset_node_shadows_dec(node);
> +	node->exceptional--;
> +	node->count--;
>  	/*
>  	 * Don't track node without shadow entries.
>  	 *
> @@ -61,8 +62,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  	 * The list_empty() test is safe as node->private_list is
>  	 * protected by mapping->tree_lock.
>  	 */
> -	if (!workingset_node_shadows(node) &&
> -	    !list_empty(&node->private_list))
> +	if (!node->exceptional && !list_empty(&node->private_list))
>  		list_lru_del(&workingset_shadow_nodes,
>  				&node->private_list);
>  	__radix_tree_delete_node(&mapping->page_tree, node);
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 5f07db171c03..dfaec27aedd3 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -418,23 +418,26 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
>  	 * no pages, so we expect to be able to remove them all and
>  	 * delete and free the empty node afterwards.
>  	 */
> -	if (WARN_ON_ONCE(!workingset_node_shadows(node)))
> +	if (WARN_ON_ONCE(!node->exceptional))
>  		goto out_invalid;
> -	if (WARN_ON_ONCE(workingset_node_pages(node)))
> +	if (WARN_ON_ONCE(node->count != node->exceptional))
>  		goto out_invalid;
>  
>  	for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
>  		if (node->slots[i]) {
>  			if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
>  				goto out_invalid;
> -			node->slots[i] = NULL;
> -			workingset_node_shadows_dec(node);
> +			if (WARN_ON_ONCE(!node->exceptional))
> +				goto out_invalid;
>  			if (WARN_ON_ONCE(!mapping->nrexceptional))
>  				goto out_invalid;
> +			node->slots[i] = NULL;
> +			node->exceptional--;
> +			node->count--;
>  			mapping->nrexceptional--;
>  		}
>  	}
> -	if (WARN_ON_ONCE(workingset_node_shadows(node)))
> +	if (WARN_ON_ONCE(node->exceptional))
>  		goto out_invalid;
>  	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
>  	__radix_tree_delete_node(&mapping->page_tree, node);
> -- 
> 2.10.0
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults
  2016-10-26  9:21     ` Jan Kara
@ 2016-10-26 18:15       ` Johannes Weiner
  2016-10-27  8:48         ` Jan Kara
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2016-10-26 18:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, linux-mm,
	Linux Kernel Mailing List, kernel-team

On Wed, Oct 26, 2016 at 11:21:07AM +0200, Jan Kara wrote:
> On Mon 24-10-16 14:47:39, Johannes Weiner wrote:
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 17 Oct 2016 09:00:04 -0400
> > Subject: [PATCH] mm: workingset: restore single-page file refault tracking
> > 
> > Currently, we account shadow entries in the page cache in the upper
> > bits of the radix_tree_node->count, behind the back of the radix tree
> > implementation. Because the radix tree code has no awareness of them,
> > we have to prevent shadow entries from going through operations where
> > the tree implementation relies on or modifies node->count: extending
> > and shrinking the tree from and to a single direct root->rnode entry.
> > 
> > As a consequence, we cannot store shadow entries for files that only
> > have index 0 populated, and thus cannot detect refaults from them,
> > which in turn degrades the thrashing compensation in LRU reclaim.
> > 
> > Another consequence is that we rely on subtleties throughout the radix
> > tree code, such as the node->count != 1 check in the shrinking code,
> > which is meant to exclude multi-entry nodes but also skips nodes with
> > only one shadow entry since they are accounted in the upper bits. This
> > is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
> > ("mm: filemap: don't plant shadow entries without radix tree node").
> > 
> > To fix this, this patch moves the shadow counter from the upper bits
> > of node->count into the new node->exceptional counter, where all
> > exceptional entries are explicitely tracked by the radix tree.
> > node->count then counts all tree entries again, including shadows.
> > 
> > Switching from a magic node->count to accounting exceptional entries
> > natively in the radix tree code removes the fragile subtleties
> > mentioned above. It also allows us to store shadow entries for
> > single-page files again, as the radix tree recognizes exceptional
> > entries when extending the tree from the root->rnode singleton, and
> > thus restore refault detection and thrashing compensation for them.
> 
> I like this solution.

Thanks Jan.

> Just one suggestion: I think radix_tree_replace_slot() can now do
> the node counter update on its own and that would save us having to
> do quite a bit of accounting outside of the radix tree code itself
> and it would be less prone to bugs (forgotten updates of a
> counter). What do you think?

This would be nice indeed, but it's bigger surgery. We need the node
in the context of existing users that do slot lookup and replacement,
which is easier for individual lookups, and harder for gang lookups
(e.g. drivers/sh/intc/virq.c::intc_subgroup_map). And they'd all get
more complicated, AFAICS, without even using exceptional entries.

I'll see if I can find a nice way to do it, but any ideas are welcome.

Thanks

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults
  2016-10-24 18:47   ` Johannes Weiner
  2016-10-26  9:21     ` Jan Kara
@ 2016-10-26 18:18     ` Linus Torvalds
  2016-10-26 18:29       ` Johannes Weiner
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2016-10-26 18:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jan Kara, Dave Jones, linux-mm,
	Linux Kernel Mailing List, kernel-team

On Mon, Oct 24, 2016 at 11:47 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> How about this instead: given that we already mark the shadow entries
> exceptional, and the exceptional bit is part of the radix tree API,
> can we just introduce a node->exceptional counter for those entries
> and have the radix tree code assist us with that instead? It adds the
> counting for non-shadow exceptional entries as well (shmem swap slots,
> and DAX non-page entries), unfortunately, but this is way cleaner. It
> also makes mapping->nrexceptional and node->exceptional consistent in
> DAX (Jan, could you please double check the accounting there?)
>
> What do you think? Lightly tested patch below.

This certainly looks way better to me. I didn't *test* it, but it
doesn't make me scratch my head the way your previous patch did.

               Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults
  2016-10-26 18:18     ` Linus Torvalds
@ 2016-10-26 18:29       ` Johannes Weiner
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2016-10-26 18:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Jan Kara, Dave Jones, linux-mm,
	Linux Kernel Mailing List, kernel-team

On Wed, Oct 26, 2016 at 11:18:52AM -0700, Linus Torvalds wrote:
> On Mon, Oct 24, 2016 at 11:47 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > How about this instead: given that we already mark the shadow entries
> > exceptional, and the exceptional bit is part of the radix tree API,
> > can we just introduce a node->exceptional counter for those entries
> > and have the radix tree code assist us with that instead? It adds the
> > counting for non-shadow exceptional entries as well (shmem swap slots,
> > and DAX non-page entries), unfortunately, but this is way cleaner. It
> > also makes mapping->nrexceptional and node->exceptional consistent in
> > DAX (Jan, could you please double check the accounting there?)
> >
> > What do you think? Lightly tested patch below.
> 
> This certainly looks way better to me. I didn't *test* it, but it
> doesn't make me scratch my head the way your previous patch did.

Awesome, thanks. I'll continue to beat on this for a while and then
send it on to Andrew.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults
  2016-10-26 18:15       ` Johannes Weiner
@ 2016-10-27  8:48         ` Jan Kara
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kara @ 2016-10-27  8:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jan Kara, Linus Torvalds, Andrew Morton, Dave Jones, linux-mm,
	Linux Kernel Mailing List, kernel-team

On Wed 26-10-16 14:15:09, Johannes Weiner wrote:
> On Wed, Oct 26, 2016 at 11:21:07AM +0200, Jan Kara wrote:
> > On Mon 24-10-16 14:47:39, Johannes Weiner wrote:
> > > From: Johannes Weiner <hannes@cmpxchg.org>
> > > Date: Mon, 17 Oct 2016 09:00:04 -0400
> > > Subject: [PATCH] mm: workingset: restore single-page file refault tracking
> > > 
> > > Currently, we account shadow entries in the page cache in the upper
> > > bits of the radix_tree_node->count, behind the back of the radix tree
> > > implementation. Because the radix tree code has no awareness of them,
> > > we have to prevent shadow entries from going through operations where
> > > the tree implementation relies on or modifies node->count: extending
> > > and shrinking the tree from and to a single direct root->rnode entry.
> > > 
> > > As a consequence, we cannot store shadow entries for files that only
> > > have index 0 populated, and thus cannot detect refaults from them,
> > > which in turn degrades the thrashing compensation in LRU reclaim.
> > > 
> > > Another consequence is that we rely on subtleties throughout the radix
> > > tree code, such as the node->count != 1 check in the shrinking code,
> > > which is meant to exclude multi-entry nodes but also skips nodes with
> > > only one shadow entry since they are accounted in the upper bits. This
> > > is error prone, and has in fact caused the bug fixed in d3798ae8c6f3
> > > ("mm: filemap: don't plant shadow entries without radix tree node").
> > > 
> > > To fix this, this patch moves the shadow counter from the upper bits
> > > of node->count into the new node->exceptional counter, where all
> > > exceptional entries are explicitely tracked by the radix tree.
> > > node->count then counts all tree entries again, including shadows.
> > > 
> > > Switching from a magic node->count to accounting exceptional entries
> > > natively in the radix tree code removes the fragile subtleties
> > > mentioned above. It also allows us to store shadow entries for
> > > single-page files again, as the radix tree recognizes exceptional
> > > entries when extending the tree from the root->rnode singleton, and
> > > thus restore refault detection and thrashing compensation for them.
> > 
> > I like this solution.
> 
> Thanks Jan.
> 
> > Just one suggestion: I think radix_tree_replace_slot() can now do
> > the node counter update on its own and that would save us having to
> > do quite a bit of accounting outside of the radix tree code itself
> > and it would be less prone to bugs (forgotten updates of a
> > counter). What do you think?
> 
> This would be nice indeed, but it's bigger surgery. We need the node
> in the context of existing users that do slot lookup and replacement,
> which is easier for individual lookups, and harder for gang lookups
> (e.g. drivers/sh/intc/virq.c::intc_subgroup_map). And they'd all get
> more complicated, AFAICS, without even using exceptional entries.

Hum, I agree. But actually looking at e.g. the usage of
radix_tree_replace_slot() in mm/khugepaged.c I think this is even buggy
when replacing a slot referencing a page with NULL without updating
node->count. It is in the error bail-out path so I'm not surprised we did
not stumble over it.

So a relatively easy solution looks like: Create new function like
radix_tree_replace_node_slot() taking both node & slot and updating the
accounting information as needed. Make this function WARN if node is NULL
and accounting information should be updated. Make original
radix_tree_replace_slot() a wrapper around radix_tree_replace_node_slot()
passing NULL as node.

This should provide safe accounting info updates without forcing all users
to work with node pointers...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-10-27  9:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-19 17:24 [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Johannes Weiner
2016-10-19 17:24 ` [PATCH 1/5] lib: radix-tree: provide node-granular interface for radix tree tags Johannes Weiner
2016-10-19 17:24 ` [PATCH 2/5] lib: radix-tree: internal tags Johannes Weiner
2016-10-19 17:24 ` [PATCH 3/5] lib: radix-tree: native accounting and tracking of special entries Johannes Weiner
2016-10-20 22:33   ` Dave Chinner
2016-10-24 16:01     ` Johannes Weiner
2016-10-19 17:24 ` [PATCH 4/5] mm: workingset: restore single-page file refault tracking Johannes Weiner
2016-10-19 17:24 ` [PATCH 5/5] mm: workingset: turn shadow node shrinker bugs into warnings Johannes Weiner
2016-10-19 18:16 ` [PATCH 0/5] mm: workingset: radix tree subtleties & single-page file refaults Linus Torvalds
2016-10-24 18:47   ` Johannes Weiner
2016-10-26  9:21     ` Jan Kara
2016-10-26 18:15       ` Johannes Weiner
2016-10-27  8:48         ` Jan Kara
2016-10-26 18:18     ` Linus Torvalds
2016-10-26 18:29       ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).