All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Use multi-index entries in the page cache
@ 2020-06-29 15:20 Matthew Wilcox (Oracle)
  2020-06-29 15:20 ` [PATCH 1/2] XArray: Add xas_split Matthew Wilcox (Oracle)
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-29 15:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, Andrew Morton
  Cc: Matthew Wilcox (Oracle), Hugh Dickins

Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this,
but it turns out to be hard to support range writeback with the pagecache
using multiple single entries.  Of course, this isn't a problem for
shmem because it doesn't have a real backing device (only swap), but
real filesystems need to start writeback at arbitrary file offsets and
have problems if we don't notice that the first (huge) page in the range
is dirty.

Hugh, I would love it if you could test this.  It didn't introduce any new
regressions to the xfstests, but shmem does exercise different paths and
of course I don't have a clean xfstests run yet, so there could easily
still be bugs.

I'd like this to be included in mmotm, but it is less urgent than the
previous patch series that I've sent.  As far as risk, I think it only
affects shmem/tmpfs.

Matthew Wilcox (Oracle) (2):
  XArray: Add xas_split
  mm: Use multi-index entries in the page cache

 Documentation/core-api/xarray.rst |  16 ++--
 include/linux/xarray.h            |   2 +
 lib/test_xarray.c                 |  41 ++++++++
 lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
 mm/filemap.c                      |  42 ++++----
 mm/huge_memory.c                  |  21 +++-
 mm/khugepaged.c                   |  12 ++-
 mm/shmem.c                        |  11 +--
 8 files changed, 245 insertions(+), 53 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] XArray: Add xas_split
  2020-06-29 15:20 [PATCH 0/2] Use multi-index entries in the page cache Matthew Wilcox (Oracle)
@ 2020-06-29 15:20 ` Matthew Wilcox (Oracle)
  2020-06-29 17:18   ` William Kucharski
  2020-06-29 15:20 ` [PATCH 2/2] mm: Use multi-index entries in the page cache Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-29 15:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, Andrew Morton
  Cc: Matthew Wilcox (Oracle), Hugh Dickins

In order to use multi-index entries for huge pages in the page cache,
we need to be able to split a multi-index entry (eg if a file is
truncated in the middle of a huge page entry).  This version does not
support splitting more than one level of the tree at a time.  This is an
acceptable limitation for the page cache as we do not expect to support
order-12 pages in the near future.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/core-api/xarray.rst |  16 ++--
 include/linux/xarray.h            |   2 +
 lib/test_xarray.c                 |  41 ++++++++
 lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
 4 files changed, 196 insertions(+), 16 deletions(-)

diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index 640934b6f7b4..a137a0e6d068 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -475,13 +475,15 @@ or iterations will move the index to the first index in the range.
 Each entry will only be returned once, no matter how many indices it
 occupies.
 
-Using xas_next() or xas_prev() with a multi-index xa_state
-is not supported.  Using either of these functions on a multi-index entry
-will reveal sibling entries; these should be skipped over by the caller.
-
-Storing ``NULL`` into any index of a multi-index entry will set the entry
-at every index to ``NULL`` and dissolve the tie.  Splitting a multi-index
-entry into entries occupying smaller ranges is not yet supported.
+Using xas_next() or xas_prev() with a multi-index xa_state is not
+supported.  Using either of these functions on a multi-index entry will
+reveal sibling entries; these should be skipped over by the caller.
+
+Storing ``NULL`` into any index of a multi-index entry will set the
+entry at every index to ``NULL`` and dissolve the tie.  A multi-index
+entry can be split into entries occupying smaller ranges by calling
+xas_split_alloc() without the xa_lock held, followed by taking the lock
+and calling xas_split().
 
 Functions and structures
 ========================
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index b4d70e7568b2..fd4364e17632 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1491,6 +1491,8 @@ static inline bool xas_retry(struct xa_state *xas, const void *entry)
 
 void *xas_load(struct xa_state *);
 void *xas_store(struct xa_state *, void *entry);
+void xas_split(struct xa_state *, void *entry, unsigned int order);
+void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
 void *xas_find(struct xa_state *, unsigned long max);
 void *xas_find_conflict(struct xa_state *);
 
diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index acf5ccba58b0..282cce959aba 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1525,6 +1525,46 @@ static noinline void check_store_range(struct xarray *xa)
 	}
 }
 
+#ifdef CONFIG_XARRAY_MULTI
+static void check_split_1(struct xarray *xa, unsigned long index,
+							unsigned int order)
+{
+	XA_STATE_ORDER(xas, xa, index, order);
+	void *entry;
+	unsigned int i = 0;
+
+	xa_store_order(xa, index, order, xa, GFP_KERNEL);
+
+	xas_split_alloc(&xas, xa, 0, GFP_KERNEL);
+	xas_lock(&xas);
+	xas_split(&xas, xa, 0);
+	xas_unlock(&xas);
+
+	xa_for_each(xa, index, entry) {
+		XA_BUG_ON(xa, entry != xa);
+		i++;
+	}
+	XA_BUG_ON(xa, i != 1 << order);
+
+	xa_destroy(xa);
+}
+
+static noinline void check_split(struct xarray *xa)
+{
+	unsigned int order;
+
+	XA_BUG_ON(xa, !xa_empty(xa));
+
+	for (order = 1; order < 2 * XA_CHUNK_SHIFT; order++) {
+		check_split_1(xa, 0, order);
+		check_split_1(xa, 1UL << order, order);
+		check_split_1(xa, 3UL << order, order);
+	}
+}
+#else
+static void check_split(struct xarray *xa) { }
+#endif
+
 static void check_align_1(struct xarray *xa, char *name)
 {
 	int i;
@@ -1730,6 +1770,7 @@ static int xarray_checks(void)
 	check_store_range(&array);
 	check_store_iter(&array);
 	check_align(&xa0);
+	check_split(&array);
 
 	check_workingset(&array, 0);
 	check_workingset(&array, 64);
diff --git a/lib/xarray.c b/lib/xarray.c
index e9e641d3c0c3..faca1ac95b0e 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -266,13 +266,15 @@ static void xa_node_free(struct xa_node *node)
  */
 static void xas_destroy(struct xa_state *xas)
 {
-	struct xa_node *node = xas->xa_alloc;
+	struct xa_node *next, *node = xas->xa_alloc;
 
-	if (!node)
-		return;
-	XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
-	kmem_cache_free(radix_tree_node_cachep, node);
-	xas->xa_alloc = NULL;
+	while (node) {
+		XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
+		next = rcu_dereference_raw(node->parent);
+		/* XXX: need to free children */
+		kmem_cache_free(radix_tree_node_cachep, node);
+		xas->xa_alloc = node = next;
+	}
 }
 
 /**
@@ -304,6 +306,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
 	xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
 	if (!xas->xa_alloc)
 		return false;
+	xas->xa_alloc->parent = NULL;
 	XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
 	xas->xa_node = XAS_RESTART;
 	return true;
@@ -339,6 +342,7 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
 	}
 	if (!xas->xa_alloc)
 		return false;
+	xas->xa_alloc->parent = NULL;
 	XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
 	xas->xa_node = XAS_RESTART;
 	return true;
@@ -403,7 +407,7 @@ static unsigned long xas_size(const struct xa_state *xas)
 /*
  * Use this to calculate the maximum index that will need to be created
  * in order to add the entry described by @xas.  Because we cannot store a
- * multiple-index entry at index 0, the calculation is a little more complex
+ * multi-index entry at index 0, the calculation is a little more complex
  * than you might expect.
  */
 static unsigned long xas_max(struct xa_state *xas)
@@ -946,6 +950,137 @@ void xas_init_marks(const struct xa_state *xas)
 }
 EXPORT_SYMBOL_GPL(xas_init_marks);
 
+#ifdef CONFIG_XARRAY_MULTI
+static unsigned int node_get_marks(struct xa_node *node, unsigned int offset)
+{
+	unsigned int marks = 0;
+	xa_mark_t mark = XA_MARK_0;
+
+	for (;;) {
+		if (node_get_mark(node, offset, mark))
+			marks |= 1 << (__force unsigned int)mark;
+		if (mark == XA_MARK_MAX)
+			break;
+		mark_inc(mark);
+	}
+
+	return marks;
+}
+
+static void node_set_marks(struct xa_node *node, unsigned int offset,
+			struct xa_node *child, unsigned int marks)
+{
+	xa_mark_t mark = XA_MARK_0;
+
+	for (;;) {
+		if (marks & (1 << (__force unsigned int)mark))
+			node_set_mark(node, offset, mark);
+		if (child)
+			node_mark_all(child, mark);
+		if (mark == XA_MARK_MAX)
+			break;
+		mark_inc(mark);
+	}
+}
+
+void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
+		gfp_t gfp)
+{
+	unsigned int sibs = xas->xa_sibs;
+	unsigned int mask = (1 << (order % XA_CHUNK_SHIFT)) - 1;
+
+	/* XXX: no support for splitting really large entries yet */
+	if (WARN_ON(order + XA_CHUNK_SHIFT < xas->xa_shift))
+		goto nomem;
+	if (xas->xa_shift <= order)
+		return;
+
+	do {
+		unsigned int i;
+		void *sibling;
+		struct xa_node *node;
+
+		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+		if (!node)
+			goto nomem;
+		node->array = xas->xa;
+		for (i = 0; i < XA_CHUNK_SIZE; i++) {
+			if ((i & mask) == 0) {
+				RCU_INIT_POINTER(node->slots[i], entry);
+				sibling = xa_mk_sibling(0);
+			} else {
+				RCU_INIT_POINTER(node->slots[i], sibling);
+			}
+		}
+		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
+		xas->xa_alloc = node;
+	} while (sibs-- > 0);
+
+	return;
+nomem:
+	xas_destroy(xas);
+	xas_set_err(xas, -ENOMEM);
+}
+
+/**
+ * xas_split() - Split a multi-index entry into smaller entries.
+ * @xas: XArray operation state.
+ * @entry: New entry to store in the array.
+ * @order: New entry order.
+ *
+ * The value in the entry is copied to all the replacement entries.
+ *
+ * Context: Any context.  The caller should hold the xa_lock.
+ */
+void xas_split(struct xa_state *xas, void *entry, unsigned int order)
+{
+	unsigned int offset, marks;
+	struct xa_node *node;
+	void *curr = xas_load(xas);
+	int values = 0;
+
+	node = xas->xa_node;
+	if (xas_top(node))
+		return;
+
+	marks = node_get_marks(node, xas->xa_offset);
+
+	offset = xas->xa_offset + xas->xa_sibs;
+	do {
+		if (order < node->shift) {
+			struct xa_node *child = xas->xa_alloc;
+
+			xas->xa_alloc = rcu_dereference_raw(child->parent);
+			child->shift = node->shift - XA_CHUNK_SHIFT;
+			child->offset = offset;
+			child->count = XA_CHUNK_SIZE;
+			child->nr_values = xa_is_value(entry) ?
+					XA_CHUNK_SIZE : 0;
+			RCU_INIT_POINTER(child->parent, node);
+			node_set_marks(node, offset, child, marks);
+			rcu_assign_pointer(node->slots[offset],
+					xa_mk_node(child));
+			if (xa_is_value(curr))
+				values--;
+		} else {
+			unsigned int canon = offset -
+					(1 << (order % XA_CHUNK_SHIFT)) + 1;
+
+			node_set_marks(node, canon, NULL, marks);
+			rcu_assign_pointer(node->slots[canon], entry);
+			while (offset > canon)
+				rcu_assign_pointer(node->slots[offset--],
+						xa_mk_sibling(canon));
+			values += (xa_is_value(entry) - xa_is_value(curr)) *
+					(1 << (order % XA_CHUNK_SHIFT));
+		}
+	} while (offset-- > xas->xa_offset);
+
+	node->nr_values += values;
+}
+EXPORT_SYMBOL_GPL(xas_split);
+#endif
+
 /**
  * xas_pause() - Pause a walk to drop a lock.
  * @xas: XArray operation state.
@@ -1407,7 +1542,7 @@ EXPORT_SYMBOL(__xa_store);
  * @gfp: Memory allocation flags.
  *
  * After this function returns, loads from this index will return @entry.
- * Storing into an existing multislot entry updates the entry of every index.
+ * Storing into an existing multi-index entry updates the entry of every index.
  * The marks associated with @index are unaffected unless @entry is %NULL.
  *
  * Context: Any context.  Takes and releases the xa_lock.
@@ -1549,7 +1684,7 @@ static void xas_set_range(struct xa_state *xas, unsigned long first,
  *
  * After this function returns, loads from any index between @first and @last,
  * inclusive will return @entry.
- * Storing into an existing multislot entry updates the entry of every index.
+ * Storing into an existing multi-index entry updates the entry of every index.
  * The marks associated with @index are unaffected unless @entry is %NULL.
  *
  * Context: Process context.  Takes and releases the xa_lock.  May sleep
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] mm: Use multi-index entries in the page cache
  2020-06-29 15:20 [PATCH 0/2] Use multi-index entries in the page cache Matthew Wilcox (Oracle)
  2020-06-29 15:20 ` [PATCH 1/2] XArray: Add xas_split Matthew Wilcox (Oracle)
@ 2020-06-29 15:20 ` Matthew Wilcox (Oracle)
  2020-06-30  3:16 ` [PATCH 0/2] " William Kucharski
  2020-07-04 20:20   ` Hugh Dickins
  3 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox (Oracle) @ 2020-06-29 15:20 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, Andrew Morton
  Cc: Matthew Wilcox (Oracle), Hugh Dickins

We currently store order-N THPs as 2^N consecutive entries.  While this
consumes rather more memory than necessary, it also turns out to be
buggy for filesystems which track dirty pages as a writeback operation
which starts in the middle of a dirty THP will not notice as the dirty
bit is only set on the head index.  With multi-index entries, the dirty
bit will be found on the head index.

This does end up simplifying the page cache slightly, although not as
much as I had hoped.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/filemap.c     | 42 +++++++++++++++++++-----------------------
 mm/huge_memory.c | 21 +++++++++++++++++----
 mm/khugepaged.c  | 12 +++++++++++-
 mm/shmem.c       | 11 ++---------
 4 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 80ce3658b147..28859bc43a3a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -126,13 +126,12 @@ static void page_cache_delete(struct address_space *mapping,
 
 	/* hugetlb pages are represented by a single entry in the xarray */
 	if (!PageHuge(page)) {
-		xas_set_order(&xas, page->index, compound_order(page));
-		nr = compound_nr(page);
+		xas_set_order(&xas, page->index, thp_order(page));
+		nr = thp_nr_pages(page);
 	}
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	VM_BUG_ON_PAGE(nr != 1 && shadow, page);
 
 	xas_store(&xas, shadow);
 	xas_init_marks(&xas);
@@ -322,19 +321,12 @@ static void page_cache_delete_batch(struct address_space *mapping,
 
 		WARN_ON_ONCE(!PageLocked(page));
 
-		if (page->index == xas.xa_index)
-			page->mapping = NULL;
+		page->mapping = NULL;
 		/* Leave page->index set: truncation lookup relies on it */
 
-		/*
-		 * Move to the next page in the vector if this is a regular
-		 * page or the index is of the last sub-page of this compound
-		 * page.
-		 */
-		if (page->index + compound_nr(page) - 1 == xas.xa_index)
-			i++;
+		i++;
 		xas_store(&xas, NULL);
-		total_pages++;
+		total_pages += thp_nr_pages(page);
 	}
 	mapping->nrpages -= total_pages;
 }
@@ -851,20 +843,24 @@ static int __add_to_page_cache_locked(struct page *page,
 	}
 
 	do {
-		xas_lock_irq(&xas);
-		old = xas_load(&xas);
-		if (old && !xa_is_value(old))
-			xas_set_err(&xas, -EEXIST);
-		xas_store(&xas, page);
-		if (xas_error(&xas))
-			goto unlock;
+		unsigned long exceptional = 0;
 
-		if (xa_is_value(old)) {
-			mapping->nrexceptional--;
+		xas_lock_irq(&xas);
+		xas_for_each_conflict(&xas, old) {
+			if (!xa_is_value(old)) {
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			exceptional++;
 			if (shadowp)
 				*shadowp = old;
 		}
-		mapping->nrpages++;
+
+		xas_store(&xas, page);
+		if (xas_error(&xas))
+			goto unlock;
+		mapping->nrexceptional -= exceptional;
+		mapping->nrpages += nr;
 
 		/* hugetlb pages do not participate in page cache accounting */
 		if (!huge)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78c84bee7e29..7e5ff05ceeaa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2603,6 +2603,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct page *head = compound_head(page);
 	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
+	XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index,
+				compound_order(head));
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
@@ -2667,19 +2669,28 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
+	if (mapping) {
+		/* XXX: Need better GFP flags here */
+		xas_split_alloc(&xas, head, 0, GFP_ATOMIC);
+		if (xas_error(&xas)) {
+			ret = xas_error(&xas);
+			goto out_unlock;
+		}
+	}
+
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irqsave(&pgdata->lru_lock, flags);
 
 	if (mapping) {
-		XA_STATE(xas, &mapping->i_pages, page_index(head));
-
 		/*
 		 * Check if the head page is present in page cache.
 		 * We assume all tail are present too, if head is there.
 		 */
-		xa_lock(&mapping->i_pages);
+		xas_lock(&xas);
+		xas_reset(&xas);
 		if (xas_load(&xas) != head)
 			goto fail;
+		xas_split(&xas, head, 0);
 	}
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
@@ -2717,7 +2728,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:		if (mapping)
-			xa_unlock(&mapping->i_pages);
+			xas_unlock(&xas);
 		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
 		remap_page(head);
 		ret = -EBUSY;
@@ -2731,6 +2742,8 @@ fail:		if (mapping)
 	if (mapping)
 		i_mmap_unlock_read(mapping);
 out:
+	/* Free any memory we didn't use */
+	xas_nomem(&xas, 0);
 	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
 	return ret;
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b043c40a21d4..52dcec90e1c3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1638,7 +1638,10 @@ static void collapse_file(struct mm_struct *mm,
 	}
 	count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
 
-	/* This will be less messy when we use multi-index entries */
+	/*
+	 * Ensure we have slots for all the pages in the range.  This is
+	 * almost certainly a no-op because most of the pages must be present
+	 */
 	do {
 		xas_lock_irq(&xas);
 		xas_create_range(&xas);
@@ -1844,6 +1847,9 @@ static void collapse_file(struct mm_struct *mm,
 			__mod_lruvec_page_state(new_page, NR_SHMEM, nr_none);
 	}
 
+	/* Join all the small entries into a single multi-index entry */
+	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+	xas_store(&xas, new_page);
 xa_locked:
 	xas_unlock_irq(&xas);
 xa_unlocked:
@@ -1965,6 +1971,10 @@ static void khugepaged_scan_file(struct mm_struct *mm,
 			continue;
 		}
 
+		/*
+		 * XXX: khugepaged should compact smaller compound pages
+		 * into a PMD sized page
+		 */
 		if (PageTransCompound(page)) {
 			result = SCAN_PAGE_COMPOUND;
 			break;
diff --git a/mm/shmem.c b/mm/shmem.c
index a0dbe62f8042..030cc483dd3f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -608,7 +608,6 @@ static int shmem_add_to_page_cache(struct page *page,
 				   struct mm_struct *charge_mm)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
-	unsigned long i = 0;
 	unsigned long nr = compound_nr(page);
 	int error;
 
@@ -638,17 +637,11 @@ static int shmem_add_to_page_cache(struct page *page,
 		void *entry;
 		xas_lock_irq(&xas);
 		entry = xas_find_conflict(&xas);
-		if (entry != expected)
+		if (entry != expected) {
 			xas_set_err(&xas, -EEXIST);
-		xas_create_range(&xas);
-		if (xas_error(&xas))
 			goto unlock;
-next:
-		xas_store(&xas, page);
-		if (++i < nr) {
-			xas_next(&xas);
-			goto next;
 		}
+		xas_store(&xas, page);
 		if (PageTransHuge(page)) {
 			count_vm_event(THP_FILE_ALLOC);
 			__inc_node_page_state(page, NR_SHMEM_THPS);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] XArray: Add xas_split
  2020-06-29 15:20 ` [PATCH 1/2] XArray: Add xas_split Matthew Wilcox (Oracle)
@ 2020-06-29 17:18   ` William Kucharski
  0 siblings, 0 replies; 14+ messages in thread
From: William Kucharski @ 2020-06-29 17:18 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Hugh Dickins

Very nice, nit inline.

For the series: Reviewed-by: William Kucharski <william.kucharski@oracle.com>

> On Jun 29, 2020, at 9:20 AM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote:
> 
> In order to use multi-index entries for huge pages in the page cache,
> we need to be able to split a multi-index entry (eg if a file is
> truncated in the middle of a huge page entry).  This version does not
> support splitting more than one level of the tree at a time.  This is an
> acceptable limitation for the page cache as we do not expect to support
> order-12 pages in the near future.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
> Documentation/core-api/xarray.rst |  16 ++--
> include/linux/xarray.h            |   2 +
> lib/test_xarray.c                 |  41 ++++++++
> lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
> 4 files changed, 196 insertions(+), 16 deletions(-)
> 
> diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
> index 640934b6f7b4..a137a0e6d068 100644
> --- a/Documentation/core-api/xarray.rst
> +++ b/Documentation/core-api/xarray.rst
> @@ -475,13 +475,15 @@ or iterations will move the index to the first index in the range.
> Each entry will only be returned once, no matter how many indices it
> occupies.
> 
> -Using xas_next() or xas_prev() with a multi-index xa_state
> -is not supported.  Using either of these functions on a multi-index entry
> -will reveal sibling entries; these should be skipped over by the caller.
> -
> -Storing ``NULL`` into any index of a multi-index entry will set the entry
> -at every index to ``NULL`` and dissolve the tie.  Splitting a multi-index
> -entry into entries occupying smaller ranges is not yet supported.
> +Using xas_next() or xas_prev() with a multi-index xa_state is not
> +supported.  Using either of these functions on a multi-index entry will
> +reveal sibling entries; these should be skipped over by the caller.
> +
> +Storing ``NULL`` into any index of a multi-index entry will set the
> +entry at every index to ``NULL`` and dissolve the tie.  A multi-index
> +entry can be split into entries occupying smaller ranges by calling
> +xas_split_alloc() without the xa_lock held, followed by taking the lock
> +and calling xas_split().
> 
> Functions and structures
> ========================
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index b4d70e7568b2..fd4364e17632 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -1491,6 +1491,8 @@ static inline bool xas_retry(struct xa_state *xas, const void *entry)
> 
> void *xas_load(struct xa_state *);
> void *xas_store(struct xa_state *, void *entry);
> +void xas_split(struct xa_state *, void *entry, unsigned int order);
> +void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
> void *xas_find(struct xa_state *, unsigned long max);
> void *xas_find_conflict(struct xa_state *);
> 
> diff --git a/lib/test_xarray.c b/lib/test_xarray.c
> index acf5ccba58b0..282cce959aba 100644
> --- a/lib/test_xarray.c
> +++ b/lib/test_xarray.c
> @@ -1525,6 +1525,46 @@ static noinline void check_store_range(struct xarray *xa)
> 	}
> }
> 
> +#ifdef CONFIG_XARRAY_MULTI
> +static void check_split_1(struct xarray *xa, unsigned long index,
> +							unsigned int order)
> +{
> +	XA_STATE_ORDER(xas, xa, index, order);
> +	void *entry;
> +	unsigned int i = 0;
> +
> +	xa_store_order(xa, index, order, xa, GFP_KERNEL);
> +
> +	xas_split_alloc(&xas, xa, 0, GFP_KERNEL);
> +	xas_lock(&xas);
> +	xas_split(&xas, xa, 0);
> +	xas_unlock(&xas);
> +
> +	xa_for_each(xa, index, entry) {
> +		XA_BUG_ON(xa, entry != xa);
> +		i++;
> +	}
> +	XA_BUG_ON(xa, i != 1 << order);
> +
> +	xa_destroy(xa);
> +}
> +
> +static noinline void check_split(struct xarray *xa)
> +{
> +	unsigned int order;
> +
> +	XA_BUG_ON(xa, !xa_empty(xa));
> +
> +	for (order = 1; order < 2 * XA_CHUNK_SHIFT; order++) {

It would be nice to have a comment here as to why you are checking
these particular size thresholds.

> +		check_split_1(xa, 0, order);
> +		check_split_1(xa, 1UL << order, order);
> +		check_split_1(xa, 3UL << order, order);
> +	}
> +}
> +#else
> +static void check_split(struct xarray *xa) { }
> +#endif
> +
> static void check_align_1(struct xarray *xa, char *name)
> {
> 	int i;
> @@ -1730,6 +1770,7 @@ static int xarray_checks(void)
> 	check_store_range(&array);
> 	check_store_iter(&array);
> 	check_align(&xa0);
> +	check_split(&array);
> 
> 	check_workingset(&array, 0);
> 	check_workingset(&array, 64);
> diff --git a/lib/xarray.c b/lib/xarray.c
> index e9e641d3c0c3..faca1ac95b0e 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -266,13 +266,15 @@ static void xa_node_free(struct xa_node *node)
>  */
> static void xas_destroy(struct xa_state *xas)
> {
> -	struct xa_node *node = xas->xa_alloc;
> +	struct xa_node *next, *node = xas->xa_alloc;
> 
> -	if (!node)
> -		return;
> -	XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
> -	kmem_cache_free(radix_tree_node_cachep, node);
> -	xas->xa_alloc = NULL;
> +	while (node) {
> +		XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
> +		next = rcu_dereference_raw(node->parent);
> +		/* XXX: need to free children */
> +		kmem_cache_free(radix_tree_node_cachep, node);
> +		xas->xa_alloc = node = next;
> +	}
> }
> 
> /**
> @@ -304,6 +306,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
> 	xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
> 	if (!xas->xa_alloc)
> 		return false;
> +	xas->xa_alloc->parent = NULL;
> 	XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
> 	xas->xa_node = XAS_RESTART;
> 	return true;
> @@ -339,6 +342,7 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
> 	}
> 	if (!xas->xa_alloc)
> 		return false;
> +	xas->xa_alloc->parent = NULL;
> 	XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
> 	xas->xa_node = XAS_RESTART;
> 	return true;
> @@ -403,7 +407,7 @@ static unsigned long xas_size(const struct xa_state *xas)
> /*
>  * Use this to calculate the maximum index that will need to be created
>  * in order to add the entry described by @xas.  Because we cannot store a
> - * multiple-index entry at index 0, the calculation is a little more complex
> + * multi-index entry at index 0, the calculation is a little more complex
>  * than you might expect.
>  */
> static unsigned long xas_max(struct xa_state *xas)
> @@ -946,6 +950,137 @@ void xas_init_marks(const struct xa_state *xas)
> }
> EXPORT_SYMBOL_GPL(xas_init_marks);
> 
> +#ifdef CONFIG_XARRAY_MULTI
> +static unsigned int node_get_marks(struct xa_node *node, unsigned int offset)
> +{
> +	unsigned int marks = 0;
> +	xa_mark_t mark = XA_MARK_0;
> +
> +	for (;;) {
> +		if (node_get_mark(node, offset, mark))
> +			marks |= 1 << (__force unsigned int)mark;
> +		if (mark == XA_MARK_MAX)
> +			break;
> +		mark_inc(mark);
> +	}
> +
> +	return marks;
> +}
> +
> +static void node_set_marks(struct xa_node *node, unsigned int offset,
> +			struct xa_node *child, unsigned int marks)
> +{
> +	xa_mark_t mark = XA_MARK_0;
> +
> +	for (;;) {
> +		if (marks & (1 << (__force unsigned int)mark))
> +			node_set_mark(node, offset, mark);
> +		if (child)
> +			node_mark_all(child, mark);
> +		if (mark == XA_MARK_MAX)
> +			break;
> +		mark_inc(mark);
> +	}
> +}
> +
> +void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
> +		gfp_t gfp)
> +{
> +	unsigned int sibs = xas->xa_sibs;
> +	unsigned int mask = (1 << (order % XA_CHUNK_SHIFT)) - 1;
> +
> +	/* XXX: no support for splitting really large entries yet */
> +	if (WARN_ON(order + XA_CHUNK_SHIFT < xas->xa_shift))
> +		goto nomem;
> +	if (xas->xa_shift <= order)
> +		return;
> +
> +	do {
> +		unsigned int i;
> +		void *sibling;
> +		struct xa_node *node;
> +
> +		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
> +		if (!node)
> +			goto nomem;
> +		node->array = xas->xa;
> +		for (i = 0; i < XA_CHUNK_SIZE; i++) {
> +			if ((i & mask) == 0) {
> +				RCU_INIT_POINTER(node->slots[i], entry);
> +				sibling = xa_mk_sibling(0);
> +			} else {
> +				RCU_INIT_POINTER(node->slots[i], sibling);
> +			}
> +		}
> +		RCU_INIT_POINTER(node->parent, xas->xa_alloc);
> +		xas->xa_alloc = node;
> +	} while (sibs-- > 0);
> +
> +	return;
> +nomem:
> +	xas_destroy(xas);
> +	xas_set_err(xas, -ENOMEM);
> +}
> +
> +/**
> + * xas_split() - Split a multi-index entry into smaller entries.
> + * @xas: XArray operation state.
> + * @entry: New entry to store in the array.
> + * @order: New entry order.
> + *
> + * The value in the entry is copied to all the replacement entries.
> + *
> + * Context: Any context.  The caller should hold the xa_lock.
> + */
> +void xas_split(struct xa_state *xas, void *entry, unsigned int order)
> +{
> +	unsigned int offset, marks;
> +	struct xa_node *node;
> +	void *curr = xas_load(xas);
> +	int values = 0;
> +
> +	node = xas->xa_node;
> +	if (xas_top(node))
> +		return;
> +
> +	marks = node_get_marks(node, xas->xa_offset);
> +
> +	offset = xas->xa_offset + xas->xa_sibs;
> +	do {
> +		if (order < node->shift) {
> +			struct xa_node *child = xas->xa_alloc;
> +
> +			xas->xa_alloc = rcu_dereference_raw(child->parent);
> +			child->shift = node->shift - XA_CHUNK_SHIFT;
> +			child->offset = offset;
> +			child->count = XA_CHUNK_SIZE;
> +			child->nr_values = xa_is_value(entry) ?
> +					XA_CHUNK_SIZE : 0;
> +			RCU_INIT_POINTER(child->parent, node);
> +			node_set_marks(node, offset, child, marks);
> +			rcu_assign_pointer(node->slots[offset],
> +					xa_mk_node(child));
> +			if (xa_is_value(curr))
> +				values--;
> +		} else {
> +			unsigned int canon = offset -
> +					(1 << (order % XA_CHUNK_SHIFT)) + 1;
> +
> +			node_set_marks(node, canon, NULL, marks);
> +			rcu_assign_pointer(node->slots[canon], entry);
> +			while (offset > canon)
> +				rcu_assign_pointer(node->slots[offset--],
> +						xa_mk_sibling(canon));
> +			values += (xa_is_value(entry) - xa_is_value(curr)) *
> +					(1 << (order % XA_CHUNK_SHIFT));
> +		}
> +	} while (offset-- > xas->xa_offset);
> +
> +	node->nr_values += values;
> +}
> +EXPORT_SYMBOL_GPL(xas_split);
> +#endif
> +
> /**
>  * xas_pause() - Pause a walk to drop a lock.
>  * @xas: XArray operation state.
> @@ -1407,7 +1542,7 @@ EXPORT_SYMBOL(__xa_store);
>  * @gfp: Memory allocation flags.
>  *
>  * After this function returns, loads from this index will return @entry.
> - * Storing into an existing multislot entry updates the entry of every index.
> + * Storing into an existing multi-index entry updates the entry of every index.
>  * The marks associated with @index are unaffected unless @entry is %NULL.
>  *
>  * Context: Any context.  Takes and releases the xa_lock.
> @@ -1549,7 +1684,7 @@ static void xas_set_range(struct xa_state *xas, unsigned long first,
>  *
>  * After this function returns, loads from any index between @first and @last,
>  * inclusive will return @entry.
> - * Storing into an existing multislot entry updates the entry of every index.
> + * Storing into an existing multi-index entry updates the entry of every index.
>  * The marks associated with @index are unaffected unless @entry is %NULL.
>  *
>  * Context: Process context.  Takes and releases the xa_lock.  May sleep
> -- 
> 2.27.0
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
  2020-06-29 15:20 [PATCH 0/2] Use multi-index entries in the page cache Matthew Wilcox (Oracle)
  2020-06-29 15:20 ` [PATCH 1/2] XArray: Add xas_split Matthew Wilcox (Oracle)
  2020-06-29 15:20 ` [PATCH 2/2] mm: Use multi-index entries in the page cache Matthew Wilcox (Oracle)
@ 2020-06-30  3:16 ` William Kucharski
  2020-07-04 20:20   ` Hugh Dickins
  3 siblings, 0 replies; 14+ messages in thread
From: William Kucharski @ 2020-06-30  3:16 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Hugh Dickins

Another nice cleanup.

For the series:

Reviewed-by: William Kucharski <william.kucharski@oracle.com>

> On Jun 29, 2020, at 9:20 AM, Matthew Wilcox (Oracle) <willy@infradead.org> wrote:
> 
> Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this,
> but it turns out to be hard to support range writeback with the pagecache
> using multiple single entries.  Of course, this isn't a problem for
> shmem because it doesn't have a real backing device (only swap), but
> real filesystems need to start writeback at arbitrary file offsets and
> have problems if we don't notice that the first (huge) page in the range
> is dirty.
> 
> Hugh, I would love it if you could test this.  It didn't introduce any new
> regressions to the xfstests, but shmem does exercise different paths and
> of course I don't have a clean xfstests run yet, so there could easily
> still be bugs.
> 
> I'd like this to be included in mmotm, but it is less urgent than the
> previous patch series that I've sent.  As far as risk, I think it only
> affects shmem/tmpfs.
> 
> Matthew Wilcox (Oracle) (2):
>  XArray: Add xas_split
>  mm: Use multi-index entries in the page cache
> 
> Documentation/core-api/xarray.rst |  16 ++--
> include/linux/xarray.h            |   2 +
> lib/test_xarray.c                 |  41 ++++++++
> lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
> mm/filemap.c                      |  42 ++++----
> mm/huge_memory.c                  |  21 +++-
> mm/khugepaged.c                   |  12 ++-
> mm/shmem.c                        |  11 +--
> 8 files changed, 245 insertions(+), 53 deletions(-)
> 
> -- 
> 2.27.0
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
  2020-06-29 15:20 [PATCH 0/2] Use multi-index entries in the page cache Matthew Wilcox (Oracle)
@ 2020-07-04 20:20   ` Hugh Dickins
  2020-06-29 15:20 ` [PATCH 2/2] mm: Use multi-index entries in the page cache Matthew Wilcox (Oracle)
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2020-07-04 20:20 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Hugh Dickins

On Mon, 29 Jun 2020, Matthew Wilcox (Oracle) wrote:

> Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this,
> but it turns out to be hard to support range writeback with the pagecache
> using multiple single entries.  Of course, this isn't a problem for
> shmem because it doesn't have a real backing device (only swap), but
> real filesystems need to start writeback at arbitrary file offsets and
> have problems if we don't notice that the first (huge) page in the range
> is dirty.
> 
> Hugh, I would love it if you could test this.  It didn't introduce any new
> regressions to the xfstests, but shmem does exercise different paths and
> of course I don't have a clean xfstests run yet, so there could easily
> still be bugs.
> 
> I'd like this to be included in mmotm, but it is less urgent than the
> previous patch series that I've sent.  As far as risk, I think it only
> affects shmem/tmpfs.
> 
> Matthew Wilcox (Oracle) (2):
>   XArray: Add xas_split
>   mm: Use multi-index entries in the page cache
> 
>  Documentation/core-api/xarray.rst |  16 ++--
>  include/linux/xarray.h            |   2 +
>  lib/test_xarray.c                 |  41 ++++++++
>  lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
>  mm/filemap.c                      |  42 ++++----
>  mm/huge_memory.c                  |  21 +++-
>  mm/khugepaged.c                   |  12 ++-
>  mm/shmem.c                        |  11 +--
>  8 files changed, 245 insertions(+), 53 deletions(-)
> 
> -- 
> 2.27.0

I have been trying it, and it's not good yet. I had hoped to work
out what's wrong and send a patch, but I'm not making progress,
so better hand back to you with what I've found.

First problem was that it did not quite build on top of 5.8-rc3
plus your 1-7 THP prep patches: was there some other series we
were supposed to add in too? If so, that might make a big difference,
but I fixed up __add_to_page_cache_locked() as in the diff below
(and I'm not bothering about hugetlbfs, so haven't looked to see if
its page indexing is or isn't still exceptional with your series).

Second problem was fs/inode.c:530 BUG_ON(inode->i_data.nrpages),
after warning from shmem_evict_inode(). Surprisingly, that first
happened on a machine with CONFIG_TRANSPARENT_HUGEPAGE not set,
while doing an "rm -rf".

I've seen it on other machines since, with THP enabled, doing other
things; but have not seen it in the last couple of days, which is
intriguing (though partly consequence of narrowing down to xfstests).

The original non-THP machine ran the same load for
ten hours yesterday, but hit no problem. The only significant
difference in what ran successfully, is that I've been surprised
by all the non-zero entries I saw in xarray nodes, exceeding
total entry "count" (I've also been bothered by non-zero "offset"
at root, but imagine that's just noise that never gets used).
So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to
kmem_cache_zalloc()s, as in the diff below: not suggesting that
as necessary, just a temporary precaution in case something is
not being initialized as intended.

I've also changed shmem_evict_inode()'s warning to BUG to crash
sooner, as in the diff below; but not hit it since doing so.

Here's a script to run in the xfstests directory, just running
the tests I've seen frequent problems from: maybe you were not
choosing the "huge=always" option to tmpfs?  An alternative is
"echo force >/sys/kernel/mm/transparent_hugepage/shmem_enabled":
that's good when running random tests that you cannot poke into;
but can cause crashes if the graphics driver e.g. i915 is using
shmem, and then gets hugepages it was not expecting.

sync
export FSTYP=tmpfs
export DISABLE_UDF_TEST=1
export TEST_DEV=tmpfs1:
export TEST_DIR=/xft		# I have that, you'll have somewhere else
export SCRATCH_DEV=tmpfs2:
export SCRATCH_MNT=/mnt
export TMPFS_MOUNT_OPTIONS="-o size=1088M,huge=always"
mount -t $FSTYP $TMPFS_MOUNT_OPTIONS $TEST_DEV $TEST_DIR || exit $?
./check generic/083 generic/224 generic/228 generic/269 generic/476
umount /xft /mnt 2>/dev/null

These problems were either mm/filemap.c:1565 find_lock_entry()
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
(at least the ones that I went on to investigate) turned out also to be
find_lock_entry(), circling around with page_mapping(page) != mapping.
It seems that find_get_entry() is sometimes supplying the wrong page,
and you will work out why much quicker than I shall.  (One tantalizing
detail of the bad offset crashes: very often page pgoff is exactly one
less than the requested offset.)

I have modified find_lock_entry() as in the diff below, to be stricter:
we all prefer to investigate a crash than a hang, and I think it's been
unnecessarily forgiving of page->mapping. I might be wrong (fuse? page
migration? splice stealing? shmem swap? THP collapse? THP splitting?)
but I believe they are all okay with a simple head->mapping check there.

Hugh

--- 5083w/lib/radix-tree.c	2020-06-14 15:13:01.406042750 -0700
+++ 5083wh/lib/radix-tree.c	2020-07-04 11:35:11.433181700 -0700
@@ -249,7 +249,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		 * cache first for the new node to get accounted to the memory
 		 * cgroup.
 		 */
-		ret = kmem_cache_alloc(radix_tree_node_cachep,
+		ret = kmem_cache_zalloc(radix_tree_node_cachep,
 				       gfp_mask | __GFP_NOWARN);
 		if (ret)
 			goto out;
@@ -257,7 +257,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		/*
 		 * Provided the caller has preloaded here, we will always
 		 * succeed in getting a node here (and never reach
-		 * kmem_cache_alloc)
+		 * kmem_cache_zalloc)
 		 */
 		rtp = this_cpu_ptr(&radix_tree_preloads);
 		if (rtp->nr) {
@@ -272,7 +272,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		kmemleak_update_trace(ret);
 		goto out;
 	}
-	ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+	ret = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask);
 out:
 	BUG_ON(radix_tree_is_internal_node(ret));
 	if (ret) {
@@ -334,7 +334,7 @@ static __must_check int __radix_tree_pre
 	rtp = this_cpu_ptr(&radix_tree_preloads);
 	while (rtp->nr < nr) {
 		local_unlock(&radix_tree_preloads.lock);
-		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+		node = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		local_lock(&radix_tree_preloads.lock);
--- 5083w/mm/filemap.c	2020-06-30 11:38:58.094010948 -0700
+++ 5083wh/mm/filemap.c	2020-07-04 11:35:11.437181729 -0700
@@ -824,6 +824,7 @@ static int __add_to_page_cache_locked(st
 				      void **shadowp)
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
+	unsigned int nr = thp_nr_pages(page);
 	int huge = PageHuge(page);
 	int error;
 	void *old;
@@ -859,12 +860,13 @@ static int __add_to_page_cache_locked(st
 		xas_store(&xas, page);
 		if (xas_error(&xas))
 			goto unlock;
+
 		mapping->nrexceptional -= exceptional;
 		mapping->nrpages += nr;
 
 		/* hugetlb pages do not participate in page cache accounting */
 		if (!huge)
-			__inc_lruvec_page_state(page, NR_FILE_PAGES);
+			__mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
@@ -1548,19 +1550,26 @@ out:
  */
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
-
+	struct page *page, *head;
+	pgoff_t pgoff;
 repeat:
 	page = find_get_entry(mapping, offset);
 	if (page && !xa_is_value(page)) {
-		lock_page(page);
+		head = compound_head(page);
+		lock_page(head);
 		/* Has the page been truncated? */
-		if (unlikely(page_mapping(page) != mapping)) {
-			unlock_page(page);
+		if (unlikely(!head->mapping)) {
+			unlock_page(head);
 			put_page(page);
 			goto repeat;
 		}
-		VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+		pgoff = head->index + (page - head);
+		if (PageSwapCache(head) ||
+		    head->mapping != mapping || pgoff != offset) {
+			printk("page=%px pgoff=%lx offset=%lx headmapping=%px mapping=%px\n",
+				page, pgoff, offset, head->mapping, mapping);
+			VM_BUG_ON_PAGE(1, page);
+		}
 	}
 	return page;
 }
--- 5083w/mm/shmem.c	2020-06-30 11:38:58.098010985 -0700
+++ 5083wh/mm/shmem.c	2020-07-04 11:35:11.441181757 -0700
@@ -1108,7 +1108,12 @@ static void shmem_evict_inode(struct ino
 	}
 
 	simple_xattrs_free(&info->xattrs);
-	WARN_ON(inode->i_blocks);
+	if (inode->i_blocks) {
+		printk("mapping=%px i_blocks=%lx nrpages=%lx nrexcept=%lx\n",
+			inode->i_mapping, (unsigned long)inode->i_blocks,
+			inode->i_data.nrpages, inode->i_data.nrexceptional);
+		BUG();
+	}
 	shmem_free_inode(inode->i_sb);
 	clear_inode(inode);
 }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
@ 2020-07-04 20:20   ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2020-07-04 20:20 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Hugh Dickins

On Mon, 29 Jun 2020, Matthew Wilcox (Oracle) wrote:

> Following Hugh's advice at LSFMM 2019, I was trying to avoid doing this,
> but it turns out to be hard to support range writeback with the pagecache
> using multiple single entries.  Of course, this isn't a problem for
> shmem because it doesn't have a real backing device (only swap), but
> real filesystems need to start writeback at arbitrary file offsets and
> have problems if we don't notice that the first (huge) page in the range
> is dirty.
> 
> Hugh, I would love it if you could test this.  It didn't introduce any new
> regressions to the xfstests, but shmem does exercise different paths and
> of course I don't have a clean xfstests run yet, so there could easily
> still be bugs.
> 
> I'd like this to be included in mmotm, but it is less urgent than the
> previous patch series that I've sent.  As far as risk, I think it only
> affects shmem/tmpfs.
> 
> Matthew Wilcox (Oracle) (2):
>   XArray: Add xas_split
>   mm: Use multi-index entries in the page cache
> 
>  Documentation/core-api/xarray.rst |  16 ++--
>  include/linux/xarray.h            |   2 +
>  lib/test_xarray.c                 |  41 ++++++++
>  lib/xarray.c                      | 153 ++++++++++++++++++++++++++++--
>  mm/filemap.c                      |  42 ++++----
>  mm/huge_memory.c                  |  21 +++-
>  mm/khugepaged.c                   |  12 ++-
>  mm/shmem.c                        |  11 +--
>  8 files changed, 245 insertions(+), 53 deletions(-)
> 
> -- 
> 2.27.0

I have been trying it, and it's not good yet. I had hoped to work
out what's wrong and send a patch, but I'm not making progress,
so better hand back to you with what I've found.

First problem was that it did not quite build on top of 5.8-rc3
plus your 1-7 THP prep patches: was there some other series we
were supposed to add in too? If so, that might make a big difference,
but I fixed up __add_to_page_cache_locked() as in the diff below
(and I'm not bothering about hugetlbfs, so haven't looked to see if
its page indexing is or isn't still exceptional with your series).

Second problem was fs/inode.c:530 BUG_ON(inode->i_data.nrpages),
after warning from shmem_evict_inode(). Surprisingly, that first
happened on a machine with CONFIG_TRANSPARENT_HUGEPAGE not set,
while doing an "rm -rf".

I've seen it on other machines since, with THP enabled, doing other
things; but have not seen it in the last couple of days, which is
intriguing (though partly consequence of narrowing down to xfstests).

The original non-THP machine ran the same load for
ten hours yesterday, but hit no problem. The only significant
difference in what ran successfully, is that I've been surprised
by all the non-zero entries I saw in xarray nodes, exceeding
total entry "count" (I've also been bothered by non-zero "offset"
at root, but imagine that's just noise that never gets used).
So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to
kmem_cache_zalloc()s, as in the diff below: not suggesting that
as necessary, just a temporary precaution in case something is
not being initialized as intended.

I've also changed shmem_evict_inode()'s warning to BUG to crash
sooner, as in the diff below; but not hit it since doing so.

Here's a script to run in the xfstests directory, just running
the tests I've seen frequent problems from: maybe you were not
choosing the "huge=always" option to tmpfs?  An alternative is
"echo force >/sys/kernel/mm/transparent_hugepage/shmem_enabled":
that's good when running random tests that you cannot poke into;
but can cause crashes if the graphics driver e.g. i915 is using
shmem, and then gets hugepages it was not expecting.

sync
export FSTYP=tmpfs
export DISABLE_UDF_TEST=1
export TEST_DEV=tmpfs1:
export TEST_DIR=/xft		# I have that, you'll have somewhere else
export SCRATCH_DEV=tmpfs2:
export SCRATCH_MNT=/mnt
export TMPFS_MOUNT_OPTIONS="-o size=1088M,huge=always"
mount -t $FSTYP $TMPFS_MOUNT_OPTIONS $TEST_DEV $TEST_DIR || exit $?
./check generic/083 generic/224 generic/228 generic/269 generic/476
umount /xft /mnt 2>/dev/null

These problems were either mm/filemap.c:1565 find_lock_entry()
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
(at least the ones that I went on to investigate) turned out also to be
find_lock_entry(), circling around with page_mapping(page) != mapping.
It seems that find_get_entry() is sometimes supplying the wrong page,
and you will work out why much quicker than I shall.  (One tantalizing
detail of the bad offset crashes: very often page pgoff is exactly one
less than the requested offset.)

I have modified find_lock_entry() as in the diff below, to be stricter:
we all prefer to investigate a crash than a hang, and I think it's been
unnecessarily forgiving of page->mapping. I might be wrong (fuse? page
migration? splice stealing? shmem swap? THP collapse? THP splitting?)
but I believe they are all okay with a simple head->mapping check there.

Hugh

--- 5083w/lib/radix-tree.c	2020-06-14 15:13:01.406042750 -0700
+++ 5083wh/lib/radix-tree.c	2020-07-04 11:35:11.433181700 -0700
@@ -249,7 +249,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		 * cache first for the new node to get accounted to the memory
 		 * cgroup.
 		 */
-		ret = kmem_cache_alloc(radix_tree_node_cachep,
+		ret = kmem_cache_zalloc(radix_tree_node_cachep,
 				       gfp_mask | __GFP_NOWARN);
 		if (ret)
 			goto out;
@@ -257,7 +257,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		/*
 		 * Provided the caller has preloaded here, we will always
 		 * succeed in getting a node here (and never reach
-		 * kmem_cache_alloc)
+		 * kmem_cache_zalloc)
 		 */
 		rtp = this_cpu_ptr(&radix_tree_preloads);
 		if (rtp->nr) {
@@ -272,7 +272,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, st
 		kmemleak_update_trace(ret);
 		goto out;
 	}
-	ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+	ret = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask);
 out:
 	BUG_ON(radix_tree_is_internal_node(ret));
 	if (ret) {
@@ -334,7 +334,7 @@ static __must_check int __radix_tree_pre
 	rtp = this_cpu_ptr(&radix_tree_preloads);
 	while (rtp->nr < nr) {
 		local_unlock(&radix_tree_preloads.lock);
-		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+		node = kmem_cache_zalloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		local_lock(&radix_tree_preloads.lock);
--- 5083w/mm/filemap.c	2020-06-30 11:38:58.094010948 -0700
+++ 5083wh/mm/filemap.c	2020-07-04 11:35:11.437181729 -0700
@@ -824,6 +824,7 @@ static int __add_to_page_cache_locked(st
 				      void **shadowp)
 {
 	XA_STATE(xas, &mapping->i_pages, offset);
+	unsigned int nr = thp_nr_pages(page);
 	int huge = PageHuge(page);
 	int error;
 	void *old;
@@ -859,12 +860,13 @@ static int __add_to_page_cache_locked(st
 		xas_store(&xas, page);
 		if (xas_error(&xas))
 			goto unlock;
+
 		mapping->nrexceptional -= exceptional;
 		mapping->nrpages += nr;
 
 		/* hugetlb pages do not participate in page cache accounting */
 		if (!huge)
-			__inc_lruvec_page_state(page, NR_FILE_PAGES);
+			__mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
@@ -1548,19 +1550,26 @@ out:
  */
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
 {
-	struct page *page;
-
+	struct page *page, *head;
+	pgoff_t pgoff;
 repeat:
 	page = find_get_entry(mapping, offset);
 	if (page && !xa_is_value(page)) {
-		lock_page(page);
+		head = compound_head(page);
+		lock_page(head);
 		/* Has the page been truncated? */
-		if (unlikely(page_mapping(page) != mapping)) {
-			unlock_page(page);
+		if (unlikely(!head->mapping)) {
+			unlock_page(head);
 			put_page(page);
 			goto repeat;
 		}
-		VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+		pgoff = head->index + (page - head);
+		if (PageSwapCache(head) ||
+		    head->mapping != mapping || pgoff != offset) {
+			printk("page=%px pgoff=%lx offset=%lx headmapping=%px mapping=%px\n",
+				page, pgoff, offset, head->mapping, mapping);
+			VM_BUG_ON_PAGE(1, page);
+		}
 	}
 	return page;
 }
--- 5083w/mm/shmem.c	2020-06-30 11:38:58.098010985 -0700
+++ 5083wh/mm/shmem.c	2020-07-04 11:35:11.441181757 -0700
@@ -1108,7 +1108,12 @@ static void shmem_evict_inode(struct ino
 	}
 
 	simple_xattrs_free(&info->xattrs);
-	WARN_ON(inode->i_blocks);
+	if (inode->i_blocks) {
+		printk("mapping=%px i_blocks=%lx nrpages=%lx nrexcept=%lx\n",
+			inode->i_mapping, (unsigned long)inode->i_blocks,
+			inode->i_data.nrpages, inode->i_data.nrexceptional);
+		BUG();
+	}
 	shmem_free_inode(inode->i_sb);
 	clear_inode(inode);
 }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
  2020-07-04 20:20   ` Hugh Dickins
  (?)
@ 2020-07-06 14:43   ` Matthew Wilcox
  2020-07-06 18:50     ` Matthew Wilcox
  2020-07-07  3:21       ` Hugh Dickins
  -1 siblings, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2020-07-06 14:43 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-fsdevel, linux-mm, Andrew Morton

On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote:
> On Mon, 29 Jun 2020, Matthew Wilcox (Oracle) wrote:
> > Hugh, I would love it if you could test this.  It didn't introduce any new
> > regressions to the xfstests, but shmem does exercise different paths and
> > of course I don't have a clean xfstests run yet, so there could easily
> > still be bugs.
> 
> I have been trying it, and it's not good yet. I had hoped to work
> out what's wrong and send a patch, but I'm not making progress,
> so better hand back to you with what I've found.

Thank you so much!  I've made some progress.

> First problem was that it did not quite build on top of 5.8-rc3
> plus your 1-7 THP prep patches: was there some other series we
> were supposed to add in too? If so, that might make a big difference,
> but I fixed up __add_to_page_cache_locked() as in the diff below
> (and I'm not bothering about hugetlbfs, so haven't looked to see if
> its page indexing is or isn't still exceptional with your series).

Oops.  I shifted some patches around and clearly didn't get it quite
right.  I'll fix it up.

> Second problem was fs/inode.c:530 BUG_ON(inode->i_data.nrpages),
> after warning from shmem_evict_inode(). Surprisingly, that first
> happened on a machine with CONFIG_TRANSPARENT_HUGEPAGE not set,
> while doing an "rm -rf".

I've seen that occasionally too.

> The original non-THP machine ran the same load for
> ten hours yesterday, but hit no problem. The only significant
> difference in what ran successfully, is that I've been surprised
> by all the non-zero entries I saw in xarray nodes, exceeding
> total entry "count" (I've also been bothered by non-zero "offset"
> at root, but imagine that's just noise that never gets used).
> So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to
> kmem_cache_zalloc()s, as in the diff below: not suggesting that
> as necessary, just a temporary precaution in case something is
> not being initialized as intended.

Umm.  ->count should always be accurate and match the number of non-NULL
entries in a node.  the zalloc shouldn't be necessary, and will probably
break the workingset code.  Actually, it should BUG because we have both
a constructor and an instruction to zero the allocation, and they can't
both be right.

You're right that ->offset is never used at root.  I had plans to
repurpose that to support smaller files more efficiently, but never
got round to implementing those plans.

> These problems were either mm/filemap.c:1565 find_lock_entry()
> VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
> (at least the ones that I went on to investigate) turned out also to be
> find_lock_entry(), circling around with page_mapping(page) != mapping.
> It seems that find_get_entry() is sometimes supplying the wrong page,
> and you will work out why much quicker than I shall.  (One tantalizing
> detail of the bad offset crashes: very often page pgoff is exactly one
> less than the requested offset.)

I added this:

@@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
                goto repeat;
        }
        page = find_subpage(page, offset);
+       if (page_to_index(page) != offset) {
+               printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset);
+               dump_page(page, "index mismatch");
+               printk("xa_load %p\n", xa_load(&mapping->i_pages, offset));
+       }
 out:
        rcu_read_unlock();
 
and I have a good clue now:

1322 offset 631 xas index 631 offset 48
1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276
1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
1322 page dumped because: index mismatch
1322 xa_load 000000008c9a9bc3

0x276 is decimal 630.  So we're looking up a tail page and getting its
erstwhile head page.  I'll dig in and figure out exactly how that's
happening.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
  2020-07-06 14:43   ` Matthew Wilcox
@ 2020-07-06 18:50     ` Matthew Wilcox
  2020-07-07  3:43         ` Hugh Dickins
  2020-07-07  3:21       ` Hugh Dickins
  1 sibling, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2020-07-06 18:50 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-fsdevel, linux-mm, Andrew Morton

On Mon, Jul 06, 2020 at 03:43:20PM +0100, Matthew Wilcox wrote:
> On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote:
> > These problems were either mm/filemap.c:1565 find_lock_entry()
> > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
> > (at least the ones that I went on to investigate) turned out also to be
> > find_lock_entry(), circling around with page_mapping(page) != mapping.
> > It seems that find_get_entry() is sometimes supplying the wrong page,
> > and you will work out why much quicker than I shall.  (One tantalizing
> > detail of the bad offset crashes: very often page pgoff is exactly one
> > less than the requested offset.)
> 
> I added this:
> 
> @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
>                 goto repeat;
>         }
>         page = find_subpage(page, offset);
> +       if (page_to_index(page) != offset) {
> +               printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset);
> +               dump_page(page, "index mismatch");
> +               printk("xa_load %p\n", xa_load(&mapping->i_pages, offset));
> +       }
>  out:
>         rcu_read_unlock();
>  
> and I have a good clue now:
> 
> 1322 offset 631 xas index 631 offset 48
> 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276
> 1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
> 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
> 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
> 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
> 1322 page dumped because: index mismatch
> 1322 xa_load 000000008c9a9bc3
> 
> 0x276 is decimal 630.  So we're looking up a tail page and getting its
> erstwhile head page.  I'll dig in and figure out exactly how that's
> happening.

@@ -841,6 +842,7 @@ static int __add_to_page_cache_locked(struct page *page,
                nr = thp_nr_pages(page);
        }
 
+       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
        page_ref_add(page, nr);
        page->mapping = mapping;
        page->index = offset;
@@ -880,6 +882,7 @@ static int __add_to_page_cache_locked(struct page *page,
                goto error;
        }
 
+       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
        trace_mm_filemap_add_to_page_cache(page);
        return 0;
 error:

The second one triggers with generic/051 (running against xfs with the
rest of my patches).  So I deduce that we have a shadow entry which
takes up multiple indices, then when we store the page, we now have
a multi-index entry which refers to a single page.  And this explains
basically all the accounting problems.

Now I'm musing how best to fix this.

1. When removing a compound page from the cache, we could store only
a single entry.  That seems bad because we cuold hit somewhere else in
the compound page and we'd have the wrong information about workingset
history (or worse believe that a shmem page isn't in swap!)

2. When removing a compound page from the cache, we could split the
entry and store the same entry N times.  Again, this seems bad for shmem
because then all the swap entries would be the same, and we'd fetch the
wrong data from swap.

3 & 4. When adding a page to the cache, we delete any shadow entry which was
previously there, or replicate the shadow entry.  Same drawbacks as the
above two, I feel.

5. Use the size of the shadow entry to allocate a page of the right size.
We don't currently have an API to find the right size, so that would
need to be written.  And what do we do if we can't allocate a page of
sufficient size?

So that's five ideas with their drawbacks as I see them.  Maybe you have
a better idea?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
  2020-07-06 14:43   ` Matthew Wilcox
@ 2020-07-07  3:21       ` Hugh Dickins
  2020-07-07  3:21       ` Hugh Dickins
  1 sibling, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2020-07-07  3:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Hugh Dickins, linux-fsdevel, linux-mm, Andrew Morton

On Mon, 6 Jul 2020, Matthew Wilcox wrote:
> On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote:
> 
> > The original non-THP machine ran the same load for
> > ten hours yesterday, but hit no problem. The only significant
> > difference in what ran successfully, is that I've been surprised
> > by all the non-zero entries I saw in xarray nodes, exceeding
> > total entry "count" (I've also been bothered by non-zero "offset"
> > at root, but imagine that's just noise that never gets used).
> > So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to
> > kmem_cache_zalloc()s, as in the diff below: not suggesting that
> > as necessary, just a temporary precaution in case something is
> > not being initialized as intended.
> 
> Umm.  ->count should always be accurate and match the number of non-NULL
> entries in a node.  the zalloc shouldn't be necessary, and will probably
> break the workingset code.  Actually, it should BUG because we have both
> a constructor and an instruction to zero the allocation, and they can't
> both be right.

Those kmem_cache_zalloc()s did not cause BUGs because I did them in
lib/radix-tree.c: it occurred to me yesterday morning that that was
an odd choice to have made! so I then extended them to lib/xarray.c,
immediately hit the BUGs you point out, so also added INIT_LIST_HEADs.

And very interestingly, the little huge tmpfs xfstests script I gave
last time then usually succeeds without any crash; but not always.

To me that suggests that there's somewhere you omit to clear a slot;
and that usually those xfstests don't encounter that stale slot again,
before the node is recycled for use elsewhere (with the zalloc making
it right again); but occasionally the tests do hit the bogus entry
before the node is recycled.

Before the zallocs, I had noticed a node with count 4, but filled
with non-zero entries (most of them looking like huge head pointers).

> 
> You're right that ->offset is never used at root.  I had plans to
> repurpose that to support smaller files more efficiently, but never
> got round to implementing those plans.
> 
> > These problems were either mm/filemap.c:1565 find_lock_entry()
> > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
> > (at least the ones that I went on to investigate) turned out also to be
> > find_lock_entry(), circling around with page_mapping(page) != mapping.
> > It seems that find_get_entry() is sometimes supplying the wrong page,
> > and you will work out why much quicker than I shall.  (One tantalizing
> > detail of the bad offset crashes: very often page pgoff is exactly one
> > less than the requested offset.)
> 
> I added this:
> 
> @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
>                 goto repeat;
>         }
>         page = find_subpage(page, offset);
> +       if (page_to_index(page) != offset) {
> +               printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset);

(It's much easier to spot huge page issues with %x than with %d.)

> +               dump_page(page, "index mismatch");
> +               printk("xa_load %p\n", xa_load(&mapping->i_pages, offset));
> +       }
>  out:
>         rcu_read_unlock();
>  
> and I have a good clue now:
> 
> 1322 offset 631 xas index 631 offset 48
> 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276

(You appear to have more patience with all this ghastly hashing of
kernel addresses in debug messages than I have: it really gets in our way.)

> 1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
> 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
> 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
> 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
> 1322 page dumped because: index mismatch
> 1322 xa_load 000000008c9a9bc3
> 
> 0x276 is decimal 630.  So we're looking up a tail page and getting its

Index 630 for offset 631, yes, that's exactly the kind of off-by-one
I saw so frequently.

> erstwhile head page.  I'll dig in and figure out exactly how that's
> happening.

Very pleasing to see the word "erstwhile" in use (and I'm serious about
that); but I don't get your head for tail point - I can't make heads or
tails of what you're saying there :) Neither 631 nor 630 is close to 512.

Certainly it's finding a bogus page, and that's related to the multi-order
splitting (I saw no problem with your 1-7 series), I think it's from a
stale entry being left behind. (But that does not at all explain the
off-by-one pattern.)

Hugh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
@ 2020-07-07  3:21       ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2020-07-07  3:21 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Hugh Dickins, linux-fsdevel, linux-mm, Andrew Morton

On Mon, 6 Jul 2020, Matthew Wilcox wrote:
> On Sat, Jul 04, 2020 at 01:20:19PM -0700, Hugh Dickins wrote:
> 
> > The original non-THP machine ran the same load for
> > ten hours yesterday, but hit no problem. The only significant
> > difference in what ran successfully, is that I've been surprised
> > by all the non-zero entries I saw in xarray nodes, exceeding
> > total entry "count" (I've also been bothered by non-zero "offset"
> > at root, but imagine that's just noise that never gets used).
> > So I've changed the kmem_cache_alloc()s in lib/radix-tree.c to
> > kmem_cache_zalloc()s, as in the diff below: not suggesting that
> > as necessary, just a temporary precaution in case something is
> > not being initialized as intended.
> 
> Umm.  ->count should always be accurate and match the number of non-NULL
> entries in a node.  the zalloc shouldn't be necessary, and will probably
> break the workingset code.  Actually, it should BUG because we have both
> a constructor and an instruction to zero the allocation, and they can't
> both be right.

Those kmem_cache_zalloc()s did not cause BUGs because I did them in
lib/radix-tree.c: it occurred to me yesterday morning that that was
an odd choice to have made! so I then extended them to lib/xarray.c,
immediately hit the BUGs you point out, so also added INIT_LIST_HEADs.

And very interestingly, the little huge tmpfs xfstests script I gave
last time then usually succeeds without any crash; but not always.

To me that suggests that there's somewhere you omit to clear a slot;
and that usually those xfstests don't encounter that stale slot again,
before the node is recycled for use elsewhere (with the zalloc making
it right again); but occasionally the tests do hit the bogus entry
before the node is recycled.

Before the zallocs, I had noticed a node with count 4, but filled
with non-zero entries (most of them looking like huge head pointers).

> 
> You're right that ->offset is never used at root.  I had plans to
> repurpose that to support smaller files more efficiently, but never
> got round to implementing those plans.
> 
> > These problems were either mm/filemap.c:1565 find_lock_entry()
> > VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); or hangs, which
> > (at least the ones that I went on to investigate) turned out also to be
> > find_lock_entry(), circling around with page_mapping(page) != mapping.
> > It seems that find_get_entry() is sometimes supplying the wrong page,
> > and you will work out why much quicker than I shall.  (One tantalizing
> > detail of the bad offset crashes: very often page pgoff is exactly one
> > less than the requested offset.)
> 
> I added this:
> 
> @@ -1535,6 +1535,11 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
>                 goto repeat;
>         }
>         page = find_subpage(page, offset);
> +       if (page_to_index(page) != offset) {
> +               printk("offset %ld xas index %ld offset %d\n", offset, xas.xa_index, xas.xa_offset);

(It's much easier to spot huge page issues with %x than with %d.)

> +               dump_page(page, "index mismatch");
> +               printk("xa_load %p\n", xa_load(&mapping->i_pages, offset));
> +       }
>  out:
>         rcu_read_unlock();
>  
> and I have a good clue now:
> 
> 1322 offset 631 xas index 631 offset 48
> 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276

(You appear to have more patience with all this ghastly hashing of
kernel addresses in debug messages than I have: it really gets in our way.)

> 1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
> 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
> 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
> 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
> 1322 page dumped because: index mismatch
> 1322 xa_load 000000008c9a9bc3
> 
> 0x276 is decimal 630.  So we're looking up a tail page and getting its

Index 630 for offset 631, yes, that's exactly the kind of off-by-one
I saw so frequently.

> erstwhile head page.  I'll dig in and figure out exactly how that's
> happening.

Very pleasing to see the word "erstwhile" in use (and I'm serious about
that); but I don't get your head for tail point - I can't make heads or
tails of what you're saying there :) Neither 631 nor 630 is close to 512.

Certainly it's finding a bogus page, and that's related to the multi-order
splitting (I saw no problem with your 1-7 series), I think it's from a
stale entry being left behind. (But that does not at all explain the
off-by-one pattern.)

Hugh


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
  2020-07-06 18:50     ` Matthew Wilcox
@ 2020-07-07  3:43         ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2020-07-07  3:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Hugh Dickins, linux-fsdevel, linux-mm, Andrew Morton

On Mon, 6 Jul 2020, Matthew Wilcox wrote:
> 
> @@ -841,6 +842,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 nr = thp_nr_pages(page);
>         }
>  
> +       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
>         page_ref_add(page, nr);
>         page->mapping = mapping;
>         page->index = offset;
> @@ -880,6 +882,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 goto error;
>         }
>  
> +       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
>         trace_mm_filemap_add_to_page_cache(page);
>         return 0;
>  error:
> 
> The second one triggers with generic/051 (running against xfs with the
> rest of my patches).  So I deduce that we have a shadow entry which
> takes up multiple indices, then when we store the page, we now have
> a multi-index entry which refers to a single page.  And this explains
> basically all the accounting problems.

I think you are jumping too far ahead by bringing in xfs and your
later patches.  Don't let me stop you from thinking ahead, but the
problem at hand is with tmpfs.

tmpfs doesn't use shadow entries, or not where we use the term "shadow"
for workingset business: it does save swap entries as xarray values,
but I don't suppose the five xfstests I was running get into swap
(without applying additional pressure); and (since a huge page
gets split before shmem swapout, at present anyway) I don't see
a philosophical problem about their multi-index entries.

Multi-index shadows, sorry, not a subject I can think about at present.

> 
> Now I'm musing how best to fix this.
> 
> 1. When removing a compound page from the cache, we could store only
> a single entry.  That seems bad because we cuold hit somewhere else in
> the compound page and we'd have the wrong information about workingset
> history (or worse believe that a shmem page isn't in swap!)
> 
> 2. When removing a compound page from the cache, we could split the
> entry and store the same entry N times.  Again, this seems bad for shmem
> because then all the swap entries would be the same, and we'd fetch the
> wrong data from swap.
> 
> 3 & 4. When adding a page to the cache, we delete any shadow entry which was
> previously there, or replicate the shadow entry.  Same drawbacks as the
> above two, I feel.
> 
> 5. Use the size of the shadow entry to allocate a page of the right size.
> We don't currently have an API to find the right size, so that would
> need to be written.  And what do we do if we can't allocate a page of
> sufficient size?
> 
> So that's five ideas with their drawbacks as I see them.  Maybe you have
> a better idea?
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
@ 2020-07-07  3:43         ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2020-07-07  3:43 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Hugh Dickins, linux-fsdevel, linux-mm, Andrew Morton

On Mon, 6 Jul 2020, Matthew Wilcox wrote:
> 
> @@ -841,6 +842,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 nr = thp_nr_pages(page);
>         }
>  
> +       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
>         page_ref_add(page, nr);
>         page->mapping = mapping;
>         page->index = offset;
> @@ -880,6 +882,7 @@ static int __add_to_page_cache_locked(struct page *page,
>                 goto error;
>         }
>  
> +       VM_BUG_ON_PAGE(xa_load(&mapping->i_pages, offset + nr) == page, page);
>         trace_mm_filemap_add_to_page_cache(page);
>         return 0;
>  error:
> 
> The second one triggers with generic/051 (running against xfs with the
> rest of my patches).  So I deduce that we have a shadow entry which
> takes up multiple indices, then when we store the page, we now have
> a multi-index entry which refers to a single page.  And this explains
> basically all the accounting problems.

I think you are jumping too far ahead by bringing in xfs and your
later patches.  Don't let me stop you from thinking ahead, but the
problem at hand is with tmpfs.

tmpfs doesn't use shadow entries, or not where we use the term "shadow"
for workingset business: it does save swap entries as xarray values,
but I don't suppose the five xfstests I was running get into swap
(without applying additional pressure); and (since a huge page
gets split before shmem swapout, at present anyway) I don't see
a philosophical problem about their multi-index entries.

Multi-index shadows, sorry, not a subject I can think about at present.

> 
> Now I'm musing how best to fix this.
> 
> 1. When removing a compound page from the cache, we could store only
> a single entry.  That seems bad because we cuold hit somewhere else in
> the compound page and we'd have the wrong information about workingset
> history (or worse believe that a shmem page isn't in swap!)
> 
> 2. When removing a compound page from the cache, we could split the
> entry and store the same entry N times.  Again, this seems bad for shmem
> because then all the swap entries would be the same, and we'd fetch the
> wrong data from swap.
> 
> 3 & 4. When adding a page to the cache, we delete any shadow entry which was
> previously there, or replicate the shadow entry.  Same drawbacks as the
> above two, I feel.
> 
> 5. Use the size of the shadow entry to allocate a page of the right size.
> We don't currently have an API to find the right size, so that would
> need to be written.  And what do we do if we can't allocate a page of
> sufficient size?
> 
> So that's five ideas with their drawbacks as I see them.  Maybe you have
> a better idea?
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/2] Use multi-index entries in the page cache
  2020-07-07  3:21       ` Hugh Dickins
  (?)
@ 2020-07-07  3:49       ` Matthew Wilcox
  -1 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2020-07-07  3:49 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-fsdevel, linux-mm, Andrew Morton

On Mon, Jul 06, 2020 at 08:21:54PM -0700, Hugh Dickins wrote:
> > and I have a good clue now:
> > 
> > 1322 offset 631 xas index 631 offset 48
> > 1322 page:000000008c9a9bc3 refcount:4 mapcount:0 mapping:00000000d8615d47 index:0x276
> 
> (You appear to have more patience with all this ghastly hashing of
> kernel addresses in debug messages than I have: it really gets in our way.)
> 
> > 1322 flags: 0x4000000000002026(referenced|uptodate|active|private)
> > 1322 mapping->aops:0xffffffff88a2ebc0 ino 1800b82 dentry name:"f1141"
> > 1322 raw: 4000000000002026 dead000000000100 dead000000000122 ffff98ff2a8b8a20
> > 1322 raw: 0000000000000276 ffff98ff1ac271a0 00000004ffffffff 0000000000000000
> > 1322 page dumped because: index mismatch
> > 1322 xa_load 000000008c9a9bc3
> > 
> > 0x276 is decimal 630.  So we're looking up a tail page and getting its
> 
> Index 630 for offset 631, yes, that's exactly the kind of off-by-one
> I saw so frequently.
> 
> > erstwhile head page.  I'll dig in and figure out exactly how that's
> > happening.
> 
> Very pleasing to see the word "erstwhile" in use (and I'm serious about
> that); but I don't get your head for tail point - I can't make heads or
> tails of what you're saying there :) Neither 631 nor 630 is close to 512.

Ah, I'm working with another 40 or so patches on top of the single patch
I sent you, which includes the patches to make XFS use thp, and make thps
arbitrary order.  In this instance, we've hit what was probably an order-2
or order-4 page, not an order-9 page.

> Certainly it's finding a bogus page, and that's related to the multi-order
> splitting (I saw no problem with your 1-7 series), I think it's from a
> stale entry being left behind. (But that does not at all explain the
> off-by-one pattern.)

I think I understand that.  We create an order-4 page, and then swap
it out.  That leaves us with a shadow entry which stretches across 16
indices (0x270-0x27f).  Then we access one of those 16 indices again
and store that page (whichever index we happened to be looking at) in
the page cache, but the XArray doesn't know that it's supposed to be an
order-0 store, so it helpfully replaces the 'head' page (at 0x270) with
this one for 0x276.  Usually, we'll go on to access the page at 0x277,
and instead of seeing no page there, we find the wrong page and bring
everything to a juddering halt.

So seeing the common pattern of page at index n returned for page at
index n+1 is merely an artefact of our normal access patterns.  If we
were accessing random addresses, we'd've seen no pattern at all.

I'm halfway through a dual solution; one that first attempts to use the
information about the shadow entry to bring in the entire thp (if it was
in use as a thp before being paged out, we should probably try to keep
using it as a thp afterwards), but then splits the shadow entry across
the indices if the page being added is smaller than the shadow entry.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-07-07  3:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-29 15:20 [PATCH 0/2] Use multi-index entries in the page cache Matthew Wilcox (Oracle)
2020-06-29 15:20 ` [PATCH 1/2] XArray: Add xas_split Matthew Wilcox (Oracle)
2020-06-29 17:18   ` William Kucharski
2020-06-29 15:20 ` [PATCH 2/2] mm: Use multi-index entries in the page cache Matthew Wilcox (Oracle)
2020-06-30  3:16 ` [PATCH 0/2] " William Kucharski
2020-07-04 20:20 ` Hugh Dickins
2020-07-04 20:20   ` Hugh Dickins
2020-07-06 14:43   ` Matthew Wilcox
2020-07-06 18:50     ` Matthew Wilcox
2020-07-07  3:43       ` Hugh Dickins
2020-07-07  3:43         ` Hugh Dickins
2020-07-07  3:21     ` Hugh Dickins
2020-07-07  3:21       ` Hugh Dickins
2020-07-07  3:49       ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.