All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/21] Readdir improvements
@ 2022-02-23 21:12 trondmy
  2022-02-23 21:12 ` [PATCH v7 01/21] NFS: constify nfs_server_capable() and nfs_have_writebacks() trondmy
                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

The current NFS readdir code will always try to maximise the amount of
readahead it performs on the assumption that we can cache anything that
isn't immediately read by the process.
There are several cases where this assumption breaks down, including
when the 'ls -l' heuristic kicks in to try to force use of readdirplus
as a batch replacement for lookup/getattr.

This series also implement Ben's page cache filter to ensure that we can
improve the ability to share cached data between processes that are
reading the same directory at the same time, and to avoid live-locks
when the directory is simultaneously changing.

--
v2: Remove reset of dtsize when NFS_INO_FORCE_READDIR is set
v3: Avoid excessive window shrinking in uncached_readdir case
v4: Track 'ls -l' cache hit/miss statistics
    Improved algorithm for falling back to uncached readdir
    Skip readdirplus when files are being written to
v5: bugfixes
    Skip readdirplus when the acdirmax/acregmax values are low
    Request a full XDR buffer when doing READDIRPLUS
v6: Add tracing
    Don't have lookup request readdirplus when it won't help
v7: Implement Ben's page cache filter
    Reduce the use of uncached readdir
    Change indexing of the page cache to improve seekdir() performance.

Trond Myklebust (21):
  NFS: constify nfs_server_capable() and nfs_have_writebacks()
  NFS: Trace lookup revalidation failure
  NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context
  NFS: Calculate page offsets algorithmically
  NFS: Store the change attribute in the directory page cache
  NFS: If the cookie verifier changes, we must invalidate the page cache
  NFS: Don't re-read the entire page cache to find the next cookie
  NFS: Adjust the amount of readahead performed by NFS readdir
  NFS: Simplify nfs_readdir_xdr_to_array()
  NFS: Reduce use of uncached readdir
  NFS: Improve heuristic for readdirplus
  NFS: Don't ask for readdirplus unless it can help nfs_getattr()
  NFSv4: Ask for a full XDR buffer of readdir goodness
  NFS: Readdirplus can't help lookup for case insensitive filesystems
  NFS: Don't request readdirplus when revalidation was forced
  NFS: Add basic readdir tracing
  NFS: Trace effects of readdirplus on the dcache
  NFS: Trace effects of the readdirplus heuristic
  NFS: Convert readdir page cache to use a cookie based index
  NFS: Fix up forced readdirplus
  NFS: Remove unnecessary cache invalidations for directories

 fs/nfs/dir.c           | 450 ++++++++++++++++++++++++-----------------
 fs/nfs/inode.c         |  46 ++---
 fs/nfs/internal.h      |   4 +-
 fs/nfs/nfs3xdr.c       |   7 +-
 fs/nfs/nfs4proc.c      |   2 -
 fs/nfs/nfs4xdr.c       |   6 +-
 fs/nfs/nfstrace.h      | 122 ++++++++++-
 include/linux/nfs_fs.h |  19 +-
 8 files changed, 421 insertions(+), 235 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v7 01/21] NFS: constify nfs_server_capable() and nfs_have_writebacks()
  2022-02-23 21:12 [PATCH v7 00/21] Readdir improvements trondmy
@ 2022-02-23 21:12 ` trondmy
  2022-02-23 21:12   ` [PATCH v7 02/21] NFS: Trace lookup revalidation failure trondmy
  2022-02-24 12:25 ` [PATCH v7 00/21] Readdir improvements David Wysochanski
  2022-02-24 15:07 ` David Wysochanski
  2 siblings, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 include/linux/nfs_fs.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 72a732a5103c..6e10725887d1 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -363,7 +363,7 @@ static inline void nfs_mark_for_revalidate(struct inode *inode)
 	spin_unlock(&inode->i_lock);
 }
 
-static inline int nfs_server_capable(struct inode *inode, int cap)
+static inline int nfs_server_capable(const struct inode *inode, int cap)
 {
 	return NFS_SERVER(inode)->caps & cap;
 }
@@ -587,12 +587,11 @@ extern struct nfs_commit_data *nfs_commitdata_alloc(bool never_fail);
 extern void nfs_commit_free(struct nfs_commit_data *data);
 bool nfs_commit_end(struct nfs_mds_commit_info *cinfo);
 
-static inline int
-nfs_have_writebacks(struct inode *inode)
+static inline bool nfs_have_writebacks(const struct inode *inode)
 {
 	if (S_ISREG(inode->i_mode))
 		return atomic_long_read(&NFS_I(inode)->nrequests) != 0;
-	return 0;
+	return false;
 }
 
 /*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 02/21] NFS: Trace lookup revalidation failure
  2022-02-23 21:12 ` [PATCH v7 01/21] NFS: constify nfs_server_capable() and nfs_have_writebacks() trondmy
@ 2022-02-23 21:12   ` trondmy
  2022-02-23 21:12     ` [PATCH v7 03/21] NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context trondmy
  2022-02-24 14:14     ` [PATCH v7 02/21] NFS: Trace lookup revalidation failure Benjamin Coddington
  0 siblings, 2 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Enable tracing of lookup revalidation failures.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ebddc736eac2..1aa55cac9d9a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1474,9 +1474,7 @@ nfs_lookup_revalidate_done(struct inode *dir, struct dentry *dentry,
 {
 	switch (error) {
 	case 1:
-		dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is valid\n",
-			__func__, dentry);
-		return 1;
+		break;
 	case 0:
 		/*
 		 * We can't d_drop the root of a disconnected tree:
@@ -1485,13 +1483,10 @@ nfs_lookup_revalidate_done(struct inode *dir, struct dentry *dentry,
 		 * inodes on unmount and further oopses.
 		 */
 		if (inode && IS_ROOT(dentry))
-			return 1;
-		dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is invalid\n",
-				__func__, dentry);
-		return 0;
+			error = 1;
+		break;
 	}
-	dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) lookup returned error %d\n",
-				__func__, dentry, error);
+	trace_nfs_lookup_revalidate_exit(dir, dentry, 0, error);
 	return error;
 }
 
@@ -1623,9 +1618,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 		goto out_bad;
 
 	trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
-	error = nfs_lookup_revalidate_dentry(dir, dentry, inode);
-	trace_nfs_lookup_revalidate_exit(dir, dentry, flags, error);
-	return error;
+	return nfs_lookup_revalidate_dentry(dir, dentry, inode);
 out_valid:
 	return nfs_lookup_revalidate_done(dir, dentry, inode, 1);
 out_bad:
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 03/21] NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context
  2022-02-23 21:12   ` [PATCH v7 02/21] NFS: Trace lookup revalidation failure trondmy
@ 2022-02-23 21:12     ` trondmy
  2022-02-23 21:12       ` [PATCH v7 04/21] NFS: Calculate page offsets algorithmically trondmy
  2022-02-24 14:14     ` [PATCH v7 02/21] NFS: Trace lookup revalidation failure Benjamin Coddington
  1 sibling, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 1aa55cac9d9a..8f17aaebcd77 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -69,18 +69,15 @@ const struct address_space_operations nfs_dir_aops = {
 	.freepage = nfs_readdir_clear_array,
 };
 
-static struct nfs_open_dir_context *alloc_nfs_open_dir_context(struct inode *dir)
+static struct nfs_open_dir_context *
+alloc_nfs_open_dir_context(struct inode *dir)
 {
 	struct nfs_inode *nfsi = NFS_I(dir);
 	struct nfs_open_dir_context *ctx;
-	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
 	if (ctx != NULL) {
-		ctx->duped = 0;
 		ctx->attr_gencount = nfsi->attr_gencount;
-		ctx->dir_cookie = 0;
-		ctx->dup_cookie = 0;
-		ctx->page_index = 0;
-		ctx->eof = false;
 		spin_lock(&dir->i_lock);
 		if (list_empty(&nfsi->open_files) &&
 		    (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 04/21] NFS: Calculate page offsets algorithmically
  2022-02-23 21:12     ` [PATCH v7 03/21] NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context trondmy
@ 2022-02-23 21:12       ` trondmy
  2022-02-23 21:12         ` [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache trondmy
  2022-02-24 14:15         ` [PATCH v7 04/21] NFS: Calculate page offsets algorithmically Benjamin Coddington
  0 siblings, 2 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Instead of relying on counting the page offsets as we walk through the
page cache, switch to calculating them algorithmically.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8f17aaebcd77..f2258e926df2 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -248,17 +248,20 @@ static const char *nfs_readdir_copy_name(const char *name, unsigned int len)
 	return ret;
 }
 
+static size_t nfs_readdir_array_maxentries(void)
+{
+	return (PAGE_SIZE - sizeof(struct nfs_cache_array)) /
+	       sizeof(struct nfs_cache_array_entry);
+}
+
 /*
  * Check that the next array entry lies entirely within the page bounds
  */
 static int nfs_readdir_array_can_expand(struct nfs_cache_array *array)
 {
-	struct nfs_cache_array_entry *cache_entry;
-
 	if (array->page_full)
 		return -ENOSPC;
-	cache_entry = &array->array[array->size + 1];
-	if ((char *)cache_entry - (char *)array > PAGE_SIZE) {
+	if (array->size == nfs_readdir_array_maxentries()) {
 		array->page_full = 1;
 		return -ENOSPC;
 	}
@@ -317,6 +320,11 @@ static struct page *nfs_readdir_page_get_locked(struct address_space *mapping,
 	return page;
 }
 
+static loff_t nfs_readdir_page_offset(struct page *page)
+{
+	return (loff_t)page->index * (loff_t)nfs_readdir_array_maxentries();
+}
+
 static u64 nfs_readdir_page_last_cookie(struct page *page)
 {
 	struct nfs_cache_array *array;
@@ -447,7 +455,7 @@ static int nfs_readdir_search_for_cookie(struct nfs_cache_array *array,
 		if (array->array[i].cookie == desc->dir_cookie) {
 			struct nfs_inode *nfsi = NFS_I(file_inode(desc->file));
 
-			new_pos = desc->current_index + i;
+			new_pos = nfs_readdir_page_offset(desc->page) + i;
 			if (desc->attr_gencount != nfsi->attr_gencount ||
 			    !nfs_readdir_inode_mapping_valid(nfsi)) {
 				desc->duped = 0;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-23 21:12       ` [PATCH v7 04/21] NFS: Calculate page offsets algorithmically trondmy
@ 2022-02-23 21:12         ` trondmy
  2022-02-23 21:12           ` [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the " trondmy
  2022-02-24 14:53           ` [PATCH v7 05/21] NFS: Store the change attribute in the directory " Benjamin Coddington
  2022-02-24 14:15         ` [PATCH v7 04/21] NFS: Calculate page offsets algorithmically Benjamin Coddington
  1 sibling, 2 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Use the change attribute and the first cookie in a directory page cache
entry to validate that the page is up to date.

Suggested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 68 ++++++++++++++++++++++++++++------------------------
 1 file changed, 37 insertions(+), 31 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index f2258e926df2..5d9367d9b651 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
 };
 
 struct nfs_cache_array {
+	u64 change_attr;
 	u64 last_cookie;
 	unsigned int size;
 	unsigned char page_full : 1,
@@ -175,7 +176,8 @@ static void nfs_readdir_array_init(struct nfs_cache_array *array)
 	memset(array, 0, sizeof(struct nfs_cache_array));
 }
 
-static void nfs_readdir_page_init_array(struct page *page, u64 last_cookie)
+static void nfs_readdir_page_init_array(struct page *page, u64 last_cookie,
+					u64 change_attr)
 {
 	struct nfs_cache_array *array;
 
@@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64 last_cookie, gfp_t gfp_flags)
 {
 	struct page *page = alloc_page(gfp_flags);
 	if (page)
-		nfs_readdir_page_init_array(page, last_cookie);
+		nfs_readdir_page_init_array(page, last_cookie, 0);
 	return page;
 }
 
@@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct nfs_entry *entry, struct page *page)
 	return ret;
 }
 
+static bool nfs_readdir_page_cookie_match(struct page *page, u64 last_cookie,
+					  u64 change_attr)
+{
+	struct nfs_cache_array *array = kmap_atomic(page);
+	int ret = true;
+
+	if (array->change_attr != change_attr)
+		ret = false;
+	if (array->size > 0 && array->array[0].cookie != last_cookie)
+		ret = false;
+	kunmap_atomic(array);
+	return ret;
+}
+
+static void nfs_readdir_page_unlock_and_put(struct page *page)
+{
+	unlock_page(page);
+	put_page(page);
+}
+
 static struct page *nfs_readdir_page_get_locked(struct address_space *mapping,
 						pgoff_t index, u64 last_cookie)
 {
 	struct page *page;
+	u64 change_attr;
 
 	page = grab_cache_page(mapping, index);
-	if (page && !PageUptodate(page)) {
-		nfs_readdir_page_init_array(page, last_cookie);
-		if (invalidate_inode_pages2_range(mapping, index + 1, -1) < 0)
-			nfs_zap_mapping(mapping->host, mapping);
-		SetPageUptodate(page);
+	if (!page)
+		return NULL;
+	change_attr = inode_peek_iversion_raw(mapping->host);
+	if (PageUptodate(page)) {
+		if (nfs_readdir_page_cookie_match(page, last_cookie,
+						  change_attr))
+			return page;
+		nfs_readdir_clear_array(page);
 	}
-
+	nfs_readdir_page_init_array(page, last_cookie, change_attr);
+	SetPageUptodate(page);
 	return page;
 }
 
@@ -356,12 +383,6 @@ static void nfs_readdir_page_set_eof(struct page *page)
 	kunmap_atomic(array);
 }
 
-static void nfs_readdir_page_unlock_and_put(struct page *page)
-{
-	unlock_page(page);
-	put_page(page);
-}
-
 static struct page *nfs_readdir_page_get_next(struct address_space *mapping,
 					      pgoff_t index, u64 cookie)
 {
@@ -418,16 +439,6 @@ static int nfs_readdir_search_for_pos(struct nfs_cache_array *array,
 	return -EBADCOOKIE;
 }
 
-static bool
-nfs_readdir_inode_mapping_valid(struct nfs_inode *nfsi)
-{
-	if (nfsi->cache_validity & (NFS_INO_INVALID_CHANGE |
-				    NFS_INO_INVALID_DATA))
-		return false;
-	smp_rmb();
-	return !test_bit(NFS_INO_INVALIDATING, &nfsi->flags);
-}
-
 static bool nfs_readdir_array_cookie_in_range(struct nfs_cache_array *array,
 					      u64 cookie)
 {
@@ -456,8 +467,7 @@ static int nfs_readdir_search_for_cookie(struct nfs_cache_array *array,
 			struct nfs_inode *nfsi = NFS_I(file_inode(desc->file));
 
 			new_pos = nfs_readdir_page_offset(desc->page) + i;
-			if (desc->attr_gencount != nfsi->attr_gencount ||
-			    !nfs_readdir_inode_mapping_valid(nfsi)) {
+			if (desc->attr_gencount != nfsi->attr_gencount) {
 				desc->duped = 0;
 				desc->attr_gencount = nfsi->attr_gencount;
 			} else if (new_pos < desc->prev_index) {
@@ -1094,11 +1104,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	 * to either find the entry with the appropriate number or
 	 * revalidate the cookie.
 	 */
-	if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
-		res = nfs_revalidate_mapping(inode, file->f_mapping);
-		if (res < 0)
-			goto out;
-	}
+	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
 
 	res = -ENOMEM;
 	desc = kzalloc(sizeof(*desc), GFP_KERNEL);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the page cache
  2022-02-23 21:12         ` [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache trondmy
@ 2022-02-23 21:12           ` trondmy
  2022-02-23 21:12             ` [PATCH v7 07/21] NFS: Don't re-read the entire page cache to find the next cookie trondmy
  2022-02-24 16:18             ` [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the page cache Anna Schumaker
  2022-02-24 14:53           ` [PATCH v7 05/21] NFS: Store the change attribute in the directory " Benjamin Coddington
  1 sibling, 2 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Ensure that if the cookie verifier changes when we use the zero-valued
cookie, then we invalidate any cached pages.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 5d9367d9b651..7932d474ce00 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -945,9 +945,14 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 		/*
 		 * Set the cookie verifier if the page cache was empty
 		 */
-		if (desc->page_index == 0)
+		if (desc->last_cookie == 0 &&
+		    memcmp(nfsi->cookieverf, verf, sizeof(nfsi->cookieverf))) {
 			memcpy(nfsi->cookieverf, verf,
 			       sizeof(nfsi->cookieverf));
+			invalidate_inode_pages2_range(desc->file->f_mapping,
+						      desc->page_index_max + 1,
+						      -1);
+		}
 	}
 	res = nfs_readdir_search_array(desc);
 	if (res == 0)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 07/21] NFS: Don't re-read the entire page cache to find the next cookie
  2022-02-23 21:12           ` [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the " trondmy
@ 2022-02-23 21:12             ` trondmy
  2022-02-23 21:12               ` [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir trondmy
  2022-02-24 16:18             ` [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the page cache Anna Schumaker
  1 sibling, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If the page cache entry that was last read gets invalidated for some
reason, then make sure we can re-create it on the next call to readdir.
This, combined with the cache page validation, allows us to reuse the
cached value of page-index on successive calls to nfs_readdir.

Credit is due to Benjamin Coddington for showing that the concept works,
and that it allows for improved cache sharing between processes even in
the case where pages are lost due to LRU or active invalidation.

Suggested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 10 +++++++---
 include/linux/nfs_fs.h |  1 +
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 7932d474ce00..70c0db877815 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1124,6 +1124,8 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->dup_cookie = dir_ctx->dup_cookie;
 	desc->duped = dir_ctx->duped;
 	page_index = dir_ctx->page_index;
+	desc->page_index = page_index;
+	desc->last_cookie = dir_ctx->last_cookie;
 	desc->attr_gencount = dir_ctx->attr_gencount;
 	desc->eof = dir_ctx->eof;
 	memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
@@ -1172,6 +1174,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	spin_lock(&file->f_lock);
 	dir_ctx->dir_cookie = desc->dir_cookie;
 	dir_ctx->dup_cookie = desc->dup_cookie;
+	dir_ctx->last_cookie = desc->last_cookie;
 	dir_ctx->duped = desc->duped;
 	dir_ctx->attr_gencount = desc->attr_gencount;
 	dir_ctx->page_index = desc->page_index;
@@ -1213,10 +1216,11 @@ static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
 	}
 	if (offset != filp->f_pos) {
 		filp->f_pos = offset;
-		if (nfs_readdir_use_cookie(filp))
-			dir_ctx->dir_cookie = offset;
-		else
+		if (!nfs_readdir_use_cookie(filp)) {
 			dir_ctx->dir_cookie = 0;
+			dir_ctx->page_index = 0;
+		} else
+			dir_ctx->dir_cookie = offset;
 		if (offset == 0)
 			memset(dir_ctx->verf, 0, sizeof(dir_ctx->verf));
 		dir_ctx->duped = 0;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 6e10725887d1..1c533f2c1f36 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -105,6 +105,7 @@ struct nfs_open_dir_context {
 	__be32	verf[NFS_DIR_VERIFIER_SIZE];
 	__u64 dir_cookie;
 	__u64 dup_cookie;
+	__u64 last_cookie;
 	pgoff_t page_index;
 	signed char duped;
 	bool eof;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir
  2022-02-23 21:12             ` [PATCH v7 07/21] NFS: Don't re-read the entire page cache to find the next cookie trondmy
@ 2022-02-23 21:12               ` trondmy
  2022-02-23 21:12                 ` [PATCH v7 09/21] NFS: Simplify nfs_readdir_xdr_to_array() trondmy
  2022-02-24 16:30                 ` [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir Anna Schumaker
  0 siblings, 2 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

The current NFS readdir code will always try to maximise the amount of
readahead it performs on the assumption that we can cache anything that
isn't immediately read by the process.
There are several cases where this assumption breaks down, including
when the 'ls -l' heuristic kicks in to try to force use of readdirplus
as a batch replacement for lookup/getattr.

This patch therefore tries to tone down the amount of readahead we
perform, and adjust it to try to match the amount of data being
requested by user space.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 55 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/nfs_fs.h |  1 +
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 70c0db877815..83933b7018ea 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -69,6 +69,8 @@ const struct address_space_operations nfs_dir_aops = {
 	.freepage = nfs_readdir_clear_array,
 };
 
+#define NFS_INIT_DTSIZE PAGE_SIZE
+
 static struct nfs_open_dir_context *
 alloc_nfs_open_dir_context(struct inode *dir)
 {
@@ -78,6 +80,7 @@ alloc_nfs_open_dir_context(struct inode *dir)
 	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
 	if (ctx != NULL) {
 		ctx->attr_gencount = nfsi->attr_gencount;
+		ctx->dtsize = NFS_INIT_DTSIZE;
 		spin_lock(&dir->i_lock);
 		if (list_empty(&nfsi->open_files) &&
 		    (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
@@ -153,6 +156,7 @@ struct nfs_readdir_descriptor {
 	struct page	*page;
 	struct dir_context *ctx;
 	pgoff_t		page_index;
+	pgoff_t		page_index_max;
 	u64		dir_cookie;
 	u64		last_cookie;
 	u64		dup_cookie;
@@ -165,12 +169,36 @@ struct nfs_readdir_descriptor {
 	unsigned long	gencount;
 	unsigned long	attr_gencount;
 	unsigned int	cache_entry_index;
+	unsigned int	buffer_fills;
+	unsigned int	dtsize;
 	signed char duped;
 	bool plus;
 	bool eob;
 	bool eof;
 };
 
+static void nfs_set_dtsize(struct nfs_readdir_descriptor *desc, unsigned int sz)
+{
+	struct nfs_server *server = NFS_SERVER(file_inode(desc->file));
+	unsigned int maxsize = server->dtsize;
+
+	if (sz > maxsize)
+		sz = maxsize;
+	if (sz < NFS_MIN_FILE_IO_SIZE)
+		sz = NFS_MIN_FILE_IO_SIZE;
+	desc->dtsize = sz;
+}
+
+static void nfs_shrink_dtsize(struct nfs_readdir_descriptor *desc)
+{
+	nfs_set_dtsize(desc, desc->dtsize >> 1);
+}
+
+static void nfs_grow_dtsize(struct nfs_readdir_descriptor *desc)
+{
+	nfs_set_dtsize(desc, desc->dtsize << 1);
+}
+
 static void nfs_readdir_array_init(struct nfs_cache_array *array)
 {
 	memset(array, 0, sizeof(struct nfs_cache_array));
@@ -774,6 +802,7 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
 				break;
 			arrays++;
 			*arrays = page = new;
+			desc->page_index_max++;
 		} else {
 			new = nfs_readdir_page_get_next(mapping,
 							page->index + 1,
@@ -783,6 +812,7 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
 			if (page != *arrays)
 				nfs_readdir_page_unlock_and_put(page);
 			page = new;
+			desc->page_index_max = new->index;
 		}
 		status = nfs_readdir_add_to_array(entry, page);
 	} while (!status && !entry->eof);
@@ -848,7 +878,7 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
 	struct nfs_entry *entry;
 	size_t array_size;
 	struct inode *inode = file_inode(desc->file);
-	size_t dtsize = NFS_SERVER(inode)->dtsize;
+	unsigned int dtsize = desc->dtsize;
 	int status = -ENOMEM;
 
 	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
@@ -884,6 +914,7 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
 
 		status = nfs_readdir_page_filler(desc, entry, pages, pglen,
 						 arrays, narrays);
+		desc->buffer_fills++;
 	} while (!status && nfs_readdir_page_needs_filling(page) &&
 		page_mapping(page));
 
@@ -931,6 +962,7 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 	if (!desc->page)
 		return -ENOMEM;
 	if (nfs_readdir_page_needs_filling(desc->page)) {
+		desc->page_index_max = desc->page_index;
 		res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
 					       &desc->page, 1);
 		if (res < 0) {
@@ -1067,6 +1099,7 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	desc->cache_entry_index = 0;
 	desc->last_cookie = desc->dir_cookie;
 	desc->duped = 0;
+	desc->page_index_max = 0;
 
 	status = nfs_readdir_xdr_to_array(desc, desc->verf, verf, arrays, sz);
 
@@ -1076,10 +1109,22 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	}
 	desc->page = NULL;
 
+	/*
+	 * Grow the dtsize if we have to go back for more pages,
+	 * or shrink it if we're reading too many.
+	 */
+	if (!desc->eof) {
+		if (!desc->eob)
+			nfs_grow_dtsize(desc);
+		else if (desc->buffer_fills == 1 &&
+			 i < (desc->page_index_max >> 1))
+			nfs_shrink_dtsize(desc);
+	}
 
 	for (i = 0; i < sz && arrays[i]; i++)
 		nfs_readdir_page_array_free(arrays[i]);
 out:
+	desc->page_index_max = -1;
 	kfree(arrays);
 	dfprintk(DIRCACHE, "NFS: %s: returns %d\n", __func__, status);
 	return status;
@@ -1118,6 +1163,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->file = file;
 	desc->ctx = ctx;
 	desc->plus = nfs_use_readdirplus(inode, ctx);
+	desc->page_index_max = -1;
 
 	spin_lock(&file->f_lock);
 	desc->dir_cookie = dir_ctx->dir_cookie;
@@ -1128,6 +1174,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->last_cookie = dir_ctx->last_cookie;
 	desc->attr_gencount = dir_ctx->attr_gencount;
 	desc->eof = dir_ctx->eof;
+	nfs_set_dtsize(desc, dir_ctx->dtsize);
 	memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
 	spin_unlock(&file->f_lock);
 
@@ -1169,6 +1216,11 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 
 		nfs_do_filldir(desc, nfsi->cookieverf);
 		nfs_readdir_page_unlock_and_put_cached(desc);
+		if (desc->eob || desc->eof)
+			break;
+		/* Grow the dtsize if we have to go back for more pages */
+		if (desc->page_index == desc->page_index_max)
+			nfs_grow_dtsize(desc);
 	} while (!desc->eob && !desc->eof);
 
 	spin_lock(&file->f_lock);
@@ -1179,6 +1231,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	dir_ctx->attr_gencount = desc->attr_gencount;
 	dir_ctx->page_index = desc->page_index;
 	dir_ctx->eof = desc->eof;
+	dir_ctx->dtsize = desc->dtsize;
 	memcpy(dir_ctx->verf, desc->verf, sizeof(dir_ctx->verf));
 	spin_unlock(&file->f_lock);
 out_free:
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 1c533f2c1f36..691a27936849 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -107,6 +107,7 @@ struct nfs_open_dir_context {
 	__u64 dup_cookie;
 	__u64 last_cookie;
 	pgoff_t page_index;
+	unsigned int dtsize;
 	signed char duped;
 	bool eof;
 };
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 09/21] NFS: Simplify nfs_readdir_xdr_to_array()
  2022-02-23 21:12               ` [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir trondmy
@ 2022-02-23 21:12                 ` trondmy
  2022-02-23 21:12                   ` [PATCH v7 10/21] NFS: Reduce use of uncached readdir trondmy
  2022-02-24 16:30                 ` [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir Anna Schumaker
  1 sibling, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Recent changes to readdir mean that we can cope with partially filled
page cache entries, so we no longer need to rely on looping in
nfs_readdir_xdr_to_array().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 83933b7018ea..9b0f13b52dbf 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -879,6 +879,7 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
 	size_t array_size;
 	struct inode *inode = file_inode(desc->file);
 	unsigned int dtsize = desc->dtsize;
+	unsigned int pglen;
 	int status = -ENOMEM;
 
 	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
@@ -896,28 +897,20 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
 	if (!pages)
 		goto out;
 
-	do {
-		unsigned int pglen;
-		status = nfs_readdir_xdr_filler(desc, verf_arg, entry->cookie,
-						pages, dtsize,
-						verf_res);
-		if (status < 0)
-			break;
-
-		pglen = status;
-		if (pglen == 0) {
-			nfs_readdir_page_set_eof(page);
-			break;
-		}
-
-		verf_arg = verf_res;
+	status = nfs_readdir_xdr_filler(desc, verf_arg, entry->cookie, pages,
+					dtsize, verf_res);
+	if (status < 0)
+		goto free_pages;
 
+	pglen = status;
+	if (pglen != 0)
 		status = nfs_readdir_page_filler(desc, entry, pages, pglen,
 						 arrays, narrays);
-		desc->buffer_fills++;
-	} while (!status && nfs_readdir_page_needs_filling(page) &&
-		page_mapping(page));
+	else
+		nfs_readdir_page_set_eof(page);
+	desc->buffer_fills++;
 
+free_pages:
 	nfs_readdir_free_pages(pages, array_size);
 out:
 	nfs_free_fattr(entry->fattr);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 10/21] NFS: Reduce use of uncached readdir
  2022-02-23 21:12                 ` [PATCH v7 09/21] NFS: Simplify nfs_readdir_xdr_to_array() trondmy
@ 2022-02-23 21:12                   ` trondmy
  2022-02-23 21:12                     ` [PATCH v7 11/21] NFS: Improve heuristic for readdirplus trondmy
  2022-02-24 16:55                     ` [PATCH v7 10/21] NFS: Reduce use of uncached readdir Anna Schumaker
  0 siblings, 2 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

When reading a very large directory, we want to try to keep the page
cache up to date if doing so is inexpensive. With the change to allow
readdir to continue reading even when the cache is incomplete, we no
longer need to fall back to uncached readdir in order to scale to large
directories.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 9b0f13b52dbf..982b5dbe30d7 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -986,28 +986,11 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 	return res;
 }
 
-static bool nfs_readdir_dont_search_cache(struct nfs_readdir_descriptor *desc)
-{
-	struct address_space *mapping = desc->file->f_mapping;
-	struct inode *dir = file_inode(desc->file);
-	unsigned int dtsize = NFS_SERVER(dir)->dtsize;
-	loff_t size = i_size_read(dir);
-
-	/*
-	 * Default to uncached readdir if the page cache is empty, and
-	 * we're looking for a non-zero cookie in a large directory.
-	 */
-	return desc->dir_cookie != 0 && mapping->nrpages == 0 && size > dtsize;
-}
-
 /* Search for desc->dir_cookie from the beginning of the page cache */
 static int readdir_search_pagecache(struct nfs_readdir_descriptor *desc)
 {
 	int res;
 
-	if (nfs_readdir_dont_search_cache(desc))
-		return -EBADCOOKIE;
-
 	do {
 		if (desc->page_index == 0) {
 			desc->current_index = 0;
@@ -1262,10 +1245,10 @@ static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
 	}
 	if (offset != filp->f_pos) {
 		filp->f_pos = offset;
-		if (!nfs_readdir_use_cookie(filp)) {
+		dir_ctx->page_index = 0;
+		if (!nfs_readdir_use_cookie(filp))
 			dir_ctx->dir_cookie = 0;
-			dir_ctx->page_index = 0;
-		} else
+		else
 			dir_ctx->dir_cookie = offset;
 		if (offset == 0)
 			memset(dir_ctx->verf, 0, sizeof(dir_ctx->verf));
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 11/21] NFS: Improve heuristic for readdirplus
  2022-02-23 21:12                   ` [PATCH v7 10/21] NFS: Reduce use of uncached readdir trondmy
@ 2022-02-23 21:12                     ` trondmy
  2022-02-23 21:12                       ` [PATCH v7 12/21] NFS: Don't ask for readdirplus unless it can help nfs_getattr() trondmy
  2022-02-24 16:55                     ` [PATCH v7 10/21] NFS: Reduce use of uncached readdir Anna Schumaker
  1 sibling, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

The heuristic for readdirplus is designed to try to detect 'ls -l' and
similar patterns. It does so by looking for cache hit/miss patterns in
both the attribute cache and in the dcache of the files in a given
directory, and then sets a flag for the readdirplus code to interpret.

The problem with this approach is that a single attribute or dcache miss
can cause the NFS code to force a refresh of the attributes for the
entire set of files contained in the directory.

To be able to make a more nuanced decision, let's sample the number of
hits and misses in the set of open directory descriptors. That allows us
to set thresholds at which we start preferring READDIRPLUS over regular
READDIR, or at which we start to force a re-read of the remaining
readdir cache using READDIRPLUS.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 82 ++++++++++++++++++++++++++----------------
 fs/nfs/inode.c         |  4 +--
 fs/nfs/internal.h      |  4 +--
 fs/nfs/nfstrace.h      |  1 -
 include/linux/nfs_fs.h |  5 +--
 5 files changed, 58 insertions(+), 38 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 982b5dbe30d7..e7942fbe3f50 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -87,8 +87,7 @@ alloc_nfs_open_dir_context(struct inode *dir)
 			nfs_set_cache_invalid(dir,
 					      NFS_INO_INVALID_DATA |
 						      NFS_INO_REVAL_FORCED);
-		list_add(&ctx->list, &nfsi->open_files);
-		clear_bit(NFS_INO_FORCE_READDIR, &nfsi->flags);
+		list_add_tail_rcu(&ctx->list, &nfsi->open_files);
 		spin_unlock(&dir->i_lock);
 		return ctx;
 	}
@@ -98,9 +97,9 @@ alloc_nfs_open_dir_context(struct inode *dir)
 static void put_nfs_open_dir_context(struct inode *dir, struct nfs_open_dir_context *ctx)
 {
 	spin_lock(&dir->i_lock);
-	list_del(&ctx->list);
+	list_del_rcu(&ctx->list);
 	spin_unlock(&dir->i_lock);
-	kfree(ctx);
+	kfree_rcu(ctx, rcu_head);
 }
 
 /*
@@ -584,7 +583,6 @@ static int nfs_readdir_xdr_filler(struct nfs_readdir_descriptor *desc,
 		/* We requested READDIRPLUS, but the server doesn't grok it */
 		if (error == -ENOTSUPP && desc->plus) {
 			NFS_SERVER(inode)->caps &= ~NFS_CAP_READDIRPLUS;
-			clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags);
 			desc->plus = arg.plus = false;
 			goto again;
 		}
@@ -634,51 +632,61 @@ int nfs_same_file(struct dentry *dentry, struct nfs_entry *entry)
 	return 1;
 }
 
-static
-bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
+#define NFS_READDIR_CACHE_USAGE_THRESHOLD (8UL)
+
+static bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx,
+				unsigned int cache_hits,
+				unsigned int cache_misses)
 {
 	if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
 		return false;
-	if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
-		return true;
-	if (ctx->pos == 0)
+	if (ctx->pos == 0 ||
+	    cache_hits + cache_misses > NFS_READDIR_CACHE_USAGE_THRESHOLD)
 		return true;
 	return false;
 }
 
 /*
- * This function is called by the lookup and getattr code to request the
+ * This function is called by the getattr code to request the
  * use of readdirplus to accelerate any future lookups in the same
  * directory.
  */
-void nfs_advise_use_readdirplus(struct inode *dir)
+void nfs_readdir_record_entry_cache_hit(struct inode *dir)
 {
 	struct nfs_inode *nfsi = NFS_I(dir);
+	struct nfs_open_dir_context *ctx;
 
-	if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
-	    !list_empty(&nfsi->open_files))
-		set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
+	if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) {
+		rcu_read_lock();
+		list_for_each_entry_rcu (ctx, &nfsi->open_files, list)
+			atomic_inc(&ctx->cache_hits);
+		rcu_read_unlock();
+	}
 }
 
 /*
  * This function is mainly for use by nfs_getattr().
  *
  * If this is an 'ls -l', we want to force use of readdirplus.
- * Do this by checking if there is an active file descriptor
- * and calling nfs_advise_use_readdirplus, then forcing a
- * cache flush.
  */
-void nfs_force_use_readdirplus(struct inode *dir)
+void nfs_readdir_record_entry_cache_miss(struct inode *dir)
 {
 	struct nfs_inode *nfsi = NFS_I(dir);
+	struct nfs_open_dir_context *ctx;
 
-	if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
-	    !list_empty(&nfsi->open_files)) {
-		set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
-		set_bit(NFS_INO_FORCE_READDIR, &nfsi->flags);
+	if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) {
+		rcu_read_lock();
+		list_for_each_entry_rcu (ctx, &nfsi->open_files, list)
+			atomic_inc(&ctx->cache_misses);
+		rcu_read_unlock();
 	}
 }
 
+static void nfs_lookup_advise_force_readdirplus(struct inode *dir)
+{
+	nfs_readdir_record_entry_cache_miss(dir);
+}
+
 static
 void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 		unsigned long dir_verifier)
@@ -1106,6 +1114,19 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	return status;
 }
 
+#define NFS_READDIR_CACHE_MISS_THRESHOLD (16UL)
+
+static void nfs_readdir_handle_cache_misses(struct inode *inode,
+					    struct nfs_readdir_descriptor *desc,
+					    pgoff_t page_index,
+					    unsigned int cache_misses)
+{
+	if (desc->ctx->pos == 0 ||
+	    cache_misses <= NFS_READDIR_CACHE_MISS_THRESHOLD)
+		return;
+	invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1);
+}
+
 /* The file offset position represents the dirent entry number.  A
    last cookie cache takes care of the common case of reading the
    whole directory.
@@ -1117,6 +1138,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	struct nfs_inode *nfsi = NFS_I(inode);
 	struct nfs_open_dir_context *dir_ctx = file->private_data;
 	struct nfs_readdir_descriptor *desc;
+	unsigned int cache_hits, cache_misses;
 	pgoff_t page_index;
 	int res;
 
@@ -1138,7 +1160,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 		goto out;
 	desc->file = file;
 	desc->ctx = ctx;
-	desc->plus = nfs_use_readdirplus(inode, ctx);
 	desc->page_index_max = -1;
 
 	spin_lock(&file->f_lock);
@@ -1152,6 +1173,8 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->eof = dir_ctx->eof;
 	nfs_set_dtsize(desc, dir_ctx->dtsize);
 	memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
+	cache_hits = atomic_xchg(&dir_ctx->cache_hits, 0);
+	cache_misses = atomic_xchg(&dir_ctx->cache_misses, 0);
 	spin_unlock(&file->f_lock);
 
 	if (desc->eof) {
@@ -1159,9 +1182,8 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 		goto out_free;
 	}
 
-	if (test_and_clear_bit(NFS_INO_FORCE_READDIR, &nfsi->flags) &&
-	    list_is_singular(&nfsi->open_files))
-		invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1);
+	desc->plus = nfs_use_readdirplus(inode, ctx, cache_hits, cache_misses);
+	nfs_readdir_handle_cache_misses(inode, desc, page_index, cache_misses);
 
 	do {
 		res = readdir_search_pagecache(desc);
@@ -1180,7 +1202,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 			break;
 		}
 		if (res == -ETOOSMALL && desc->plus) {
-			clear_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
 			nfs_zap_caches(inode);
 			desc->page_index = 0;
 			desc->plus = false;
@@ -1599,7 +1620,7 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
 	nfs_set_verifier(dentry, dir_verifier);
 
 	/* set a readdirplus hint that we had a cache miss */
-	nfs_force_use_readdirplus(dir);
+	nfs_lookup_advise_force_readdirplus(dir);
 	ret = 1;
 out:
 	nfs_free_fattr(fattr);
@@ -1656,7 +1677,6 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 				nfs_mark_dir_for_revalidate(dir);
 			goto out_bad;
 		}
-		nfs_advise_use_readdirplus(dir);
 		goto out_valid;
 	}
 
@@ -1861,7 +1881,7 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
 		goto out;
 
 	/* Notify readdir to use READDIRPLUS */
-	nfs_force_use_readdirplus(dir);
+	nfs_lookup_advise_force_readdirplus(dir);
 
 no_entry:
 	res = d_splice_alias(inode, dentry);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7cecabf57b95..bbf4357ff727 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -787,7 +787,7 @@ static void nfs_readdirplus_parent_cache_miss(struct dentry *dentry)
 	if (!nfs_server_capable(d_inode(dentry), NFS_CAP_READDIRPLUS))
 		return;
 	parent = dget_parent(dentry);
-	nfs_force_use_readdirplus(d_inode(parent));
+	nfs_readdir_record_entry_cache_miss(d_inode(parent));
 	dput(parent);
 }
 
@@ -798,7 +798,7 @@ static void nfs_readdirplus_parent_cache_hit(struct dentry *dentry)
 	if (!nfs_server_capable(d_inode(dentry), NFS_CAP_READDIRPLUS))
 		return;
 	parent = dget_parent(dentry);
-	nfs_advise_use_readdirplus(d_inode(parent));
+	nfs_readdir_record_entry_cache_hit(d_inode(parent));
 	dput(parent);
 }
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index b5398af53c7f..194840a97e3a 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -366,8 +366,8 @@ extern struct nfs_client *nfs_init_client(struct nfs_client *clp,
 			   const struct nfs_client_initdata *);
 
 /* dir.c */
-extern void nfs_advise_use_readdirplus(struct inode *dir);
-extern void nfs_force_use_readdirplus(struct inode *dir);
+extern void nfs_readdir_record_entry_cache_hit(struct inode *dir);
+extern void nfs_readdir_record_entry_cache_miss(struct inode *dir);
 extern unsigned long nfs_access_cache_count(struct shrinker *shrink,
 					    struct shrink_control *sc);
 extern unsigned long nfs_access_cache_scan(struct shrinker *shrink,
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 45a310b586ce..3672f6703ee7 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -36,7 +36,6 @@
 
 #define nfs_show_nfsi_flags(v) \
 	__print_flags(v, "|", \
-			{ BIT(NFS_INO_ADVISE_RDPLUS), "ADVISE_RDPLUS" }, \
 			{ BIT(NFS_INO_STALE), "STALE" }, \
 			{ BIT(NFS_INO_ACL_LRU_SET), "ACL_LRU_SET" }, \
 			{ BIT(NFS_INO_INVALIDATING), "INVALIDATING" }, \
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 691a27936849..20a4cf0acad2 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -101,6 +101,8 @@ struct nfs_open_context {
 
 struct nfs_open_dir_context {
 	struct list_head list;
+	atomic_t cache_hits;
+	atomic_t cache_misses;
 	unsigned long attr_gencount;
 	__be32	verf[NFS_DIR_VERIFIER_SIZE];
 	__u64 dir_cookie;
@@ -110,6 +112,7 @@ struct nfs_open_dir_context {
 	unsigned int dtsize;
 	signed char duped;
 	bool eof;
+	struct rcu_head rcu_head;
 };
 
 /*
@@ -274,13 +277,11 @@ struct nfs4_copy_state {
 /*
  * Bit offsets in flags field
  */
-#define NFS_INO_ADVISE_RDPLUS	(0)		/* advise readdirplus */
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
 #define NFS_INO_INVALIDATING	(3)		/* inode is being invalidated */
 #define NFS_INO_PRESERVE_UNLINKED (4)		/* preserve file if removed while open */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
-#define NFS_INO_FORCE_READDIR	(7)		/* force readdirplus */
 #define NFS_INO_LAYOUTCOMMIT	(9)		/* layoutcommit required */
 #define NFS_INO_LAYOUTCOMMITTING (10)		/* layoutcommit inflight */
 #define NFS_INO_LAYOUTSTATS	(11)		/* layoutstats inflight */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 12/21] NFS: Don't ask for readdirplus unless it can help nfs_getattr()
  2022-02-23 21:12                     ` [PATCH v7 11/21] NFS: Improve heuristic for readdirplus trondmy
@ 2022-02-23 21:12                       ` trondmy
  2022-02-23 21:12                         ` [PATCH v7 13/21] NFSv4: Ask for a full XDR buffer of readdir goodness trondmy
  0 siblings, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If attribute caching is turned off, then use of readdirplus is not going
to help stat() performance.
Readdirplus also doesn't help if a file is being written to, since we
will have to flush those writes in order to sync the mtime/ctime.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/inode.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index bbf4357ff727..10d17cfb8639 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -780,24 +780,26 @@ void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr,
 }
 EXPORT_SYMBOL_GPL(nfs_setattr_update_inode);
 
-static void nfs_readdirplus_parent_cache_miss(struct dentry *dentry)
+/*
+ * Don't request help from readdirplus if the file is being written to,
+ * or if attribute caching is turned off
+ */
+static bool nfs_getattr_readdirplus_enable(const struct inode *inode)
 {
-	struct dentry *parent;
+	return nfs_server_capable(inode, NFS_CAP_READDIRPLUS) &&
+	       !nfs_have_writebacks(inode) && NFS_MAXATTRTIMEO(inode) > 5 * HZ;
+}
 
-	if (!nfs_server_capable(d_inode(dentry), NFS_CAP_READDIRPLUS))
-		return;
-	parent = dget_parent(dentry);
+static void nfs_readdirplus_parent_cache_miss(struct dentry *dentry)
+{
+	struct dentry *parent = dget_parent(dentry);
 	nfs_readdir_record_entry_cache_miss(d_inode(parent));
 	dput(parent);
 }
 
 static void nfs_readdirplus_parent_cache_hit(struct dentry *dentry)
 {
-	struct dentry *parent;
-
-	if (!nfs_server_capable(d_inode(dentry), NFS_CAP_READDIRPLUS))
-		return;
-	parent = dget_parent(dentry);
+	struct dentry *parent = dget_parent(dentry);
 	nfs_readdir_record_entry_cache_hit(d_inode(parent));
 	dput(parent);
 }
@@ -835,6 +837,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 	int err = 0;
 	bool force_sync = query_flags & AT_STATX_FORCE_SYNC;
 	bool do_update = false;
+	bool readdirplus_enabled = nfs_getattr_readdirplus_enable(inode);
 
 	trace_nfs_getattr_enter(inode);
 
@@ -843,7 +846,8 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 			STATX_INO | STATX_SIZE | STATX_BLOCKS;
 
 	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
-		nfs_readdirplus_parent_cache_hit(path->dentry);
+		if (readdirplus_enabled)
+			nfs_readdirplus_parent_cache_hit(path->dentry);
 		goto out_no_revalidate;
 	}
 
@@ -893,15 +897,12 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		do_update |= cache_validity & NFS_INO_INVALID_BLOCKS;
 
 	if (do_update) {
-		/* Update the attribute cache */
-		if (!(server->flags & NFS_MOUNT_NOAC))
+		if (readdirplus_enabled)
 			nfs_readdirplus_parent_cache_miss(path->dentry);
-		else
-			nfs_readdirplus_parent_cache_hit(path->dentry);
 		err = __nfs_revalidate_inode(server, inode);
 		if (err)
 			goto out;
-	} else
+	} else if (readdirplus_enabled)
 		nfs_readdirplus_parent_cache_hit(path->dentry);
 out_no_revalidate:
 	/* Only return attributes that were revalidated. */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 13/21] NFSv4: Ask for a full XDR buffer of readdir goodness
  2022-02-23 21:12                       ` [PATCH v7 12/21] NFS: Don't ask for readdirplus unless it can help nfs_getattr() trondmy
@ 2022-02-23 21:12                         ` trondmy
  2022-02-23 21:12                           ` [PATCH v7 14/21] NFS: Readdirplus can't help lookup for case insensitive filesystems trondmy
  0 siblings, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Instead of pretending that we know the ratio of directory info vs
readdirplus attribute info, just set the 'dircount' field to the same
value as the 'maxcount' field.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3xdr.c | 7 ++++---
 fs/nfs/nfs4xdr.c | 6 +++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
index 54a1d21cbcc6..296320f91579 100644
--- a/fs/nfs/nfs3xdr.c
+++ b/fs/nfs/nfs3xdr.c
@@ -1261,6 +1261,8 @@ static void nfs3_xdr_enc_readdir3args(struct rpc_rqst *req,
 static void encode_readdirplus3args(struct xdr_stream *xdr,
 				    const struct nfs3_readdirargs *args)
 {
+	uint32_t dircount = args->count;
+	uint32_t maxcount = args->count;
 	__be32 *p;
 
 	encode_nfs_fh3(xdr, args->fh);
@@ -1273,9 +1275,8 @@ static void encode_readdirplus3args(struct xdr_stream *xdr,
 	 * readdirplus: need dircount + buffer size.
 	 * We just make sure we make dircount big enough
 	 */
-	*p++ = cpu_to_be32(args->count >> 3);
-
-	*p = cpu_to_be32(args->count);
+	*p++ = cpu_to_be32(dircount);
+	*p = cpu_to_be32(maxcount);
 }
 
 static void nfs3_xdr_enc_readdirplus3args(struct rpc_rqst *req,
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 8e70b92df4cc..b7780b97dc4d 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1605,7 +1605,8 @@ static void encode_readdir(struct xdr_stream *xdr, const struct nfs4_readdir_arg
 		FATTR4_WORD0_RDATTR_ERROR,
 		FATTR4_WORD1_MOUNTED_ON_FILEID,
 	};
-	uint32_t dircount = readdir->count >> 1;
+	uint32_t dircount = readdir->count;
+	uint32_t maxcount = readdir->count;
 	__be32 *p, verf[2];
 	uint32_t attrlen = 0;
 	unsigned int i;
@@ -1618,7 +1619,6 @@ static void encode_readdir(struct xdr_stream *xdr, const struct nfs4_readdir_arg
 			FATTR4_WORD1_SPACE_USED|FATTR4_WORD1_TIME_ACCESS|
 			FATTR4_WORD1_TIME_METADATA|FATTR4_WORD1_TIME_MODIFY;
 		attrs[2] |= FATTR4_WORD2_SECURITY_LABEL;
-		dircount >>= 1;
 	}
 	/* Use mounted_on_fileid only if the server supports it */
 	if (!(readdir->bitmask[1] & FATTR4_WORD1_MOUNTED_ON_FILEID))
@@ -1634,7 +1634,7 @@ static void encode_readdir(struct xdr_stream *xdr, const struct nfs4_readdir_arg
 	encode_nfs4_verifier(xdr, &readdir->verifier);
 	p = reserve_space(xdr, 12 + (attrlen << 2));
 	*p++ = cpu_to_be32(dircount);
-	*p++ = cpu_to_be32(readdir->count);
+	*p++ = cpu_to_be32(maxcount);
 	*p++ = cpu_to_be32(attrlen);
 	for (i = 0; i < attrlen; i++)
 		*p++ = cpu_to_be32(attrs[i]);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 14/21] NFS: Readdirplus can't help lookup for case insensitive filesystems
  2022-02-23 21:12                         ` [PATCH v7 13/21] NFSv4: Ask for a full XDR buffer of readdir goodness trondmy
@ 2022-02-23 21:12                           ` trondmy
  2022-02-23 21:12                             ` [PATCH v7 15/21] NFS: Don't request readdirplus when revalidation was forced trondmy
  0 siblings, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If the filesystem is case insensitive, then readdirplus can't help with
cache misses, since it won't return case folded variants of the filename.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e7942fbe3f50..a9098d5a9fc8 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -684,6 +684,8 @@ void nfs_readdir_record_entry_cache_miss(struct inode *dir)
 
 static void nfs_lookup_advise_force_readdirplus(struct inode *dir)
 {
+	if (nfs_server_capable(dir, NFS_CAP_CASE_INSENSITIVE))
+		return;
 	nfs_readdir_record_entry_cache_miss(dir);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 15/21] NFS: Don't request readdirplus when revalidation was forced
  2022-02-23 21:12                           ` [PATCH v7 14/21] NFS: Readdirplus can't help lookup for case insensitive filesystems trondmy
@ 2022-02-23 21:12                             ` trondmy
  2022-02-23 21:13                               ` [PATCH v7 16/21] NFS: Add basic readdir tracing trondmy
  0 siblings, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:12 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If the revalidation was forced, due to the presence of a LOOKUP_EXCL or
a LOOKUP_REVAL flag, then readdirplus won't help. It also can't help
when we're doing a path component lookup.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a9098d5a9fc8..54f0d37485d5 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -682,10 +682,13 @@ void nfs_readdir_record_entry_cache_miss(struct inode *dir)
 	}
 }
 
-static void nfs_lookup_advise_force_readdirplus(struct inode *dir)
+static void nfs_lookup_advise_force_readdirplus(struct inode *dir,
+						unsigned int flags)
 {
 	if (nfs_server_capable(dir, NFS_CAP_CASE_INSENSITIVE))
 		return;
+	if (flags & (LOOKUP_EXCL | LOOKUP_PARENT | LOOKUP_REVAL))
+		return;
 	nfs_readdir_record_entry_cache_miss(dir);
 }
 
@@ -1583,15 +1586,17 @@ nfs_lookup_revalidate_delegated(struct inode *dir, struct dentry *dentry,
 	return nfs_lookup_revalidate_done(dir, dentry, inode, 1);
 }
 
-static int
-nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
-			     struct inode *inode)
+static int nfs_lookup_revalidate_dentry(struct inode *dir,
+					struct dentry *dentry,
+					struct inode *inode, unsigned int flags)
 {
 	struct nfs_fh *fhandle;
 	struct nfs_fattr *fattr;
 	unsigned long dir_verifier;
 	int ret;
 
+	trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
+
 	ret = -ENOMEM;
 	fhandle = nfs_alloc_fhandle();
 	fattr = nfs_alloc_fattr_with_label(NFS_SERVER(inode));
@@ -1612,6 +1617,10 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
 		}
 		goto out;
 	}
+
+	/* Request help from readdirplus */
+	nfs_lookup_advise_force_readdirplus(dir, flags);
+
 	ret = 0;
 	if (nfs_compare_fh(NFS_FH(inode), fhandle))
 		goto out;
@@ -1621,8 +1630,6 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
 	nfs_setsecurity(inode, fattr);
 	nfs_set_verifier(dentry, dir_verifier);
 
-	/* set a readdirplus hint that we had a cache miss */
-	nfs_lookup_advise_force_readdirplus(dir);
 	ret = 1;
 out:
 	nfs_free_fattr(fattr);
@@ -1688,8 +1695,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 	if (NFS_STALE(inode))
 		goto out_bad;
 
-	trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
-	return nfs_lookup_revalidate_dentry(dir, dentry, inode);
+	return nfs_lookup_revalidate_dentry(dir, dentry, inode, flags);
 out_valid:
 	return nfs_lookup_revalidate_done(dir, dentry, inode, 1);
 out_bad:
@@ -1883,7 +1889,7 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
 		goto out;
 
 	/* Notify readdir to use READDIRPLUS */
-	nfs_lookup_advise_force_readdirplus(dir);
+	nfs_lookup_advise_force_readdirplus(dir, flags);
 
 no_entry:
 	res = d_splice_alias(inode, dentry);
@@ -2146,7 +2152,7 @@ nfs4_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 reval_dentry:
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
-	return nfs_lookup_revalidate_dentry(dir, dentry, inode);
+	return nfs_lookup_revalidate_dentry(dir, dentry, inode, flags);
 
 full_reval:
 	return nfs_do_lookup_revalidate(dir, dentry, flags);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 16/21] NFS: Add basic readdir tracing
  2022-02-23 21:12                             ` [PATCH v7 15/21] NFS: Don't request readdirplus when revalidation was forced trondmy
@ 2022-02-23 21:13                               ` trondmy
  2022-02-23 21:13                                 ` [PATCH v7 17/21] NFS: Trace effects of readdirplus on the dcache trondmy
  2022-02-24 15:53                                 ` [PATCH v7 16/21] NFS: Add basic readdir tracing Benjamin Coddington
  0 siblings, 2 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:13 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Add tracing to track how often the client goes to the server for updated
readdir information.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c      | 13 ++++++++-
 fs/nfs/nfstrace.h | 68 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 54f0d37485d5..41e2d02d8611 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -969,10 +969,14 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 		return -ENOMEM;
 	if (nfs_readdir_page_needs_filling(desc->page)) {
 		desc->page_index_max = desc->page_index;
+		trace_nfs_readdir_cache_fill(desc->file, nfsi->cookieverf,
+					     desc->last_cookie,
+					     desc->page_index, desc->dtsize);
 		res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
 					       &desc->page, 1);
 		if (res < 0) {
 			nfs_readdir_page_unlock_and_put_cached(desc);
+			trace_nfs_readdir_cache_fill_done(inode, res);
 			if (res == -EBADCOOKIE || res == -ENOTSYNC) {
 				invalidate_inode_pages2(desc->file->f_mapping);
 				desc->page_index = 0;
@@ -1090,7 +1094,14 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	desc->duped = 0;
 	desc->page_index_max = 0;
 
+	trace_nfs_readdir_uncached(desc->file, desc->verf, desc->last_cookie,
+				   -1, desc->dtsize);
+
 	status = nfs_readdir_xdr_to_array(desc, desc->verf, verf, arrays, sz);
+	if (status < 0) {
+		trace_nfs_readdir_uncached_done(file_inode(desc->file), status);
+		goto out_free;
+	}
 
 	for (i = 0; !desc->eob && i < sz && arrays[i]; i++) {
 		desc->page = arrays[i];
@@ -1109,7 +1120,7 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 			 i < (desc->page_index_max >> 1))
 			nfs_shrink_dtsize(desc);
 	}
-
+out_free:
 	for (i = 0; i < sz && arrays[i]; i++)
 		nfs_readdir_page_array_free(arrays[i]);
 out:
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 3672f6703ee7..c2d0543ecb2d 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -160,6 +160,8 @@ DEFINE_NFS_INODE_EVENT(nfs_fsync_enter);
 DEFINE_NFS_INODE_EVENT_DONE(nfs_fsync_exit);
 DEFINE_NFS_INODE_EVENT(nfs_access_enter);
 DEFINE_NFS_INODE_EVENT_DONE(nfs_set_cache_invalid);
+DEFINE_NFS_INODE_EVENT_DONE(nfs_readdir_cache_fill_done);
+DEFINE_NFS_INODE_EVENT_DONE(nfs_readdir_uncached_done);
 
 TRACE_EVENT(nfs_access_exit,
 		TP_PROTO(
@@ -271,6 +273,72 @@ DEFINE_NFS_UPDATE_SIZE_EVENT(wcc);
 DEFINE_NFS_UPDATE_SIZE_EVENT(update);
 DEFINE_NFS_UPDATE_SIZE_EVENT(grow);
 
+DECLARE_EVENT_CLASS(nfs_readdir_event,
+		TP_PROTO(
+			const struct file *file,
+			const __be32 *verifier,
+			u64 cookie,
+			pgoff_t page_index,
+			unsigned int dtsize
+		),
+
+		TP_ARGS(file, verifier, cookie, page_index, dtsize),
+
+		TP_STRUCT__entry(
+			__field(dev_t, dev)
+			__field(u32, fhandle)
+			__field(u64, fileid)
+			__field(u64, version)
+			__array(char, verifier, NFS4_VERIFIER_SIZE)
+			__field(u64, cookie)
+			__field(pgoff_t, index)
+			__field(unsigned int, dtsize)
+		),
+
+		TP_fast_assign(
+			const struct inode *dir = file_inode(file);
+			const struct nfs_inode *nfsi = NFS_I(dir);
+
+			__entry->dev = dir->i_sb->s_dev;
+			__entry->fileid = nfsi->fileid;
+			__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
+			__entry->version = inode_peek_iversion_raw(dir);
+			if (cookie != 0)
+				memcpy(__entry->verifier, verifier,
+				       NFS4_VERIFIER_SIZE);
+			else
+				memset(__entry->verifier, 0,
+				       NFS4_VERIFIER_SIZE);
+			__entry->cookie = cookie;
+			__entry->index = page_index;
+			__entry->dtsize = dtsize;
+		),
+
+		TP_printk(
+			"fileid=%02x:%02x:%llu fhandle=0x%08x version=%llu "
+			"cookie=%s:0x%llx cache_index=%lu dtsize=%u",
+			MAJOR(__entry->dev), MINOR(__entry->dev),
+			(unsigned long long)__entry->fileid, __entry->fhandle,
+			__entry->version, show_nfs4_verifier(__entry->verifier),
+			(unsigned long long)__entry->cookie, __entry->index,
+			__entry->dtsize
+		)
+);
+
+#define DEFINE_NFS_READDIR_EVENT(name) \
+	DEFINE_EVENT(nfs_readdir_event, name, \
+			TP_PROTO( \
+				const struct file *file, \
+				const __be32 *verifier, \
+				u64 cookie, \
+				pgoff_t page_index, \
+				unsigned int dtsize \
+				), \
+			TP_ARGS(file, verifier, cookie, page_index, dtsize))
+
+DEFINE_NFS_READDIR_EVENT(nfs_readdir_cache_fill);
+DEFINE_NFS_READDIR_EVENT(nfs_readdir_uncached);
+
 DECLARE_EVENT_CLASS(nfs_lookup_event,
 		TP_PROTO(
 			const struct inode *dir,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 17/21] NFS: Trace effects of readdirplus on the dcache
  2022-02-23 21:13                               ` [PATCH v7 16/21] NFS: Add basic readdir tracing trondmy
@ 2022-02-23 21:13                                 ` trondmy
  2022-02-23 21:13                                   ` [PATCH v7 18/21] NFS: Trace effects of the readdirplus heuristic trondmy
  2022-02-24 15:53                                 ` [PATCH v7 16/21] NFS: Add basic readdir tracing Benjamin Coddington
  1 sibling, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:13 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Trace the effects of readdirplus on attribute and dentry revalidation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c      | 5 +++++
 fs/nfs/nfstrace.h | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 41e2d02d8611..95b18d1ad0cf 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -742,8 +742,12 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 			status = nfs_refresh_inode(d_inode(dentry), entry->fattr);
 			if (!status)
 				nfs_setsecurity(d_inode(dentry), entry->fattr);
+			trace_nfs_readdir_lookup_revalidate(d_inode(parent),
+							    dentry, 0, status);
 			goto out;
 		} else {
+			trace_nfs_readdir_lookup_revalidate_failed(
+				d_inode(parent), dentry, 0);
 			d_invalidate(dentry);
 			dput(dentry);
 			dentry = NULL;
@@ -765,6 +769,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 		dentry = alias;
 	}
 	nfs_set_verifier(dentry, dir_verifier);
+	trace_nfs_readdir_lookup(d_inode(parent), dentry, 0);
 out:
 	dput(dentry);
 }
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index c2d0543ecb2d..7c1102b991d0 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -432,6 +432,9 @@ DEFINE_NFS_LOOKUP_EVENT(nfs_lookup_enter);
 DEFINE_NFS_LOOKUP_EVENT_DONE(nfs_lookup_exit);
 DEFINE_NFS_LOOKUP_EVENT(nfs_lookup_revalidate_enter);
 DEFINE_NFS_LOOKUP_EVENT_DONE(nfs_lookup_revalidate_exit);
+DEFINE_NFS_LOOKUP_EVENT(nfs_readdir_lookup);
+DEFINE_NFS_LOOKUP_EVENT(nfs_readdir_lookup_revalidate_failed);
+DEFINE_NFS_LOOKUP_EVENT_DONE(nfs_readdir_lookup_revalidate);
 
 TRACE_EVENT(nfs_atomic_open_enter,
 		TP_PROTO(
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 18/21] NFS: Trace effects of the readdirplus heuristic
  2022-02-23 21:13                                 ` [PATCH v7 17/21] NFS: Trace effects of readdirplus on the dcache trondmy
@ 2022-02-23 21:13                                   ` trondmy
  2022-02-23 21:13                                     ` [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index trondmy
  0 siblings, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:13 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Enable tracking of when the readdirplus heuristic causes a page cache
invalidation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c      | 11 ++++++++++-
 fs/nfs/nfstrace.h | 50 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 95b18d1ad0cf..06bd612296d5 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -985,6 +985,8 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 			if (res == -EBADCOOKIE || res == -ENOTSYNC) {
 				invalidate_inode_pages2(desc->file->f_mapping);
 				desc->page_index = 0;
+				trace_nfs_readdir_invalidate_cache_range(
+					inode, 0, MAX_LFS_FILESIZE);
 				return -EAGAIN;
 			}
 			return res;
@@ -999,6 +1001,9 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 			invalidate_inode_pages2_range(desc->file->f_mapping,
 						      desc->page_index_max + 1,
 						      -1);
+			trace_nfs_readdir_invalidate_cache_range(
+				inode, desc->page_index_max + 1,
+				MAX_LFS_FILESIZE);
 		}
 	}
 	res = nfs_readdir_search_array(desc);
@@ -1145,7 +1150,11 @@ static void nfs_readdir_handle_cache_misses(struct inode *inode,
 	if (desc->ctx->pos == 0 ||
 	    cache_misses <= NFS_READDIR_CACHE_MISS_THRESHOLD)
 		return;
-	invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1);
+	if (invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1) == 0)
+		return;
+	trace_nfs_readdir_invalidate_cache_range(
+		inode, (loff_t)(page_index + 1) << PAGE_SHIFT,
+		MAX_LFS_FILESIZE);
 }
 
 /* The file offset position represents the dirent entry number.  A
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 7c1102b991d0..ec2645d20abf 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -273,6 +273,56 @@ DEFINE_NFS_UPDATE_SIZE_EVENT(wcc);
 DEFINE_NFS_UPDATE_SIZE_EVENT(update);
 DEFINE_NFS_UPDATE_SIZE_EVENT(grow);
 
+DECLARE_EVENT_CLASS(nfs_inode_range_event,
+		TP_PROTO(
+			const struct inode *inode,
+			loff_t range_start,
+			loff_t range_end
+		),
+
+		TP_ARGS(inode, range_start, range_end),
+
+		TP_STRUCT__entry(
+			__field(dev_t, dev)
+			__field(u32, fhandle)
+			__field(u64, fileid)
+			__field(u64, version)
+			__field(loff_t, range_start)
+			__field(loff_t, range_end)
+		),
+
+		TP_fast_assign(
+			const struct nfs_inode *nfsi = NFS_I(inode);
+
+			__entry->dev = inode->i_sb->s_dev;
+			__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
+			__entry->fileid = nfsi->fileid;
+			__entry->version = inode_peek_iversion_raw(inode);
+			__entry->range_start = range_start;
+			__entry->range_end = range_end;
+		),
+
+		TP_printk(
+			"fileid=%02x:%02x:%llu fhandle=0x%08x version=%llu "
+			"range=[%lld, %lld]",
+			MAJOR(__entry->dev), MINOR(__entry->dev),
+			(unsigned long long)__entry->fileid,
+			__entry->fhandle, __entry->version,
+			__entry->range_start, __entry->range_end
+		)
+);
+
+#define DEFINE_NFS_INODE_RANGE_EVENT(name) \
+	DEFINE_EVENT(nfs_inode_range_event, name, \
+			TP_PROTO( \
+				const struct inode *inode, \
+				loff_t range_start, \
+				loff_t range_end \
+			), \
+			TP_ARGS(inode, range_start, range_end))
+
+DEFINE_NFS_INODE_RANGE_EVENT(nfs_readdir_invalidate_cache_range);
+
 DECLARE_EVENT_CLASS(nfs_readdir_event,
 		TP_PROTO(
 			const struct file *file,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index
  2022-02-23 21:13                                   ` [PATCH v7 18/21] NFS: Trace effects of the readdirplus heuristic trondmy
@ 2022-02-23 21:13                                     ` trondmy
  2022-02-23 21:13                                       ` [PATCH v7 20/21] NFS: Fix up forced readdirplus trondmy
  2022-02-24 17:31                                       ` [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index Benjamin Coddington
  0 siblings, 2 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:13 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Instead of using a linear index to address the pages, use the cookie of
the first entry, since that is what we use to match the page anyway.

This allows us to avoid re-reading the entire cache on a seekdir() type
of operation. The latter is very common when re-exporting NFS, and is a
major performance drain.

The change does affect our duplicate cookie detection, since we can no
longer rely on the page index as a linear offset for detecting whether
we looped backwards. However since we no longer do a linear search
through all the pages on each call to nfs_readdir(), this is less of a
concern than it was previously.
The other downside is that invalidate_mapping_pages() no longer can use
the page index to avoid clearing pages that have been read. A subsequent
patch will restore the functionality this provides to the 'ls -l'
heuristic.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 99 +++++++++++++++---------------------------
 include/linux/nfs_fs.h |  2 -
 2 files changed, 34 insertions(+), 67 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 06bd612296d5..2007eebfb5cf 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -39,6 +39,7 @@
 #include <linux/sched.h>
 #include <linux/kmemleak.h>
 #include <linux/xattr.h>
+#include <linux/xxhash.h>
 
 #include "delegation.h"
 #include "iostat.h"
@@ -158,9 +159,7 @@ struct nfs_readdir_descriptor {
 	pgoff_t		page_index_max;
 	u64		dir_cookie;
 	u64		last_cookie;
-	u64		dup_cookie;
 	loff_t		current_index;
-	loff_t		prev_index;
 
 	__be32		verf[NFS_DIR_VERIFIER_SIZE];
 	unsigned long	dir_verifier;
@@ -170,7 +169,6 @@ struct nfs_readdir_descriptor {
 	unsigned int	cache_entry_index;
 	unsigned int	buffer_fills;
 	unsigned int	dtsize;
-	signed char duped;
 	bool plus;
 	bool eob;
 	bool eof;
@@ -333,6 +331,13 @@ int nfs_readdir_add_to_array(struct nfs_entry *entry, struct page *page)
 	return ret;
 }
 
+static pgoff_t nfs_readdir_page_cookie_hash(u64 cookie)
+{
+	if (cookie == 0)
+		return 0;
+	return xxhash(&cookie, sizeof(cookie), 0);
+}
+
 static bool nfs_readdir_page_cookie_match(struct page *page, u64 last_cookie,
 					  u64 change_attr)
 {
@@ -354,8 +359,9 @@ static void nfs_readdir_page_unlock_and_put(struct page *page)
 }
 
 static struct page *nfs_readdir_page_get_locked(struct address_space *mapping,
-						pgoff_t index, u64 last_cookie)
+						u64 last_cookie)
 {
+	pgoff_t index = nfs_readdir_page_cookie_hash(last_cookie);
 	struct page *page;
 	u64 change_attr;
 
@@ -374,11 +380,6 @@ static struct page *nfs_readdir_page_get_locked(struct address_space *mapping,
 	return page;
 }
 
-static loff_t nfs_readdir_page_offset(struct page *page)
-{
-	return (loff_t)page->index * (loff_t)nfs_readdir_array_maxentries();
-}
-
 static u64 nfs_readdir_page_last_cookie(struct page *page)
 {
 	struct nfs_cache_array *array;
@@ -411,11 +412,11 @@ static void nfs_readdir_page_set_eof(struct page *page)
 }
 
 static struct page *nfs_readdir_page_get_next(struct address_space *mapping,
-					      pgoff_t index, u64 cookie)
+					      u64 cookie)
 {
 	struct page *page;
 
-	page = nfs_readdir_page_get_locked(mapping, index, cookie);
+	page = nfs_readdir_page_get_locked(mapping, cookie);
 	if (page) {
 		if (nfs_readdir_page_last_cookie(page) == cookie)
 			return page;
@@ -443,6 +444,13 @@ bool nfs_readdir_use_cookie(const struct file *filp)
 	return true;
 }
 
+static void nfs_readdir_rewind_search(struct nfs_readdir_descriptor *desc)
+{
+	desc->current_index = 0;
+	desc->last_cookie = 0;
+	desc->page_index = 0;
+}
+
 static int nfs_readdir_search_for_pos(struct nfs_cache_array *array,
 				      struct nfs_readdir_descriptor *desc)
 {
@@ -491,32 +499,11 @@ static int nfs_readdir_search_for_cookie(struct nfs_cache_array *array,
 
 	for (i = 0; i < array->size; i++) {
 		if (array->array[i].cookie == desc->dir_cookie) {
-			struct nfs_inode *nfsi = NFS_I(file_inode(desc->file));
-
-			new_pos = nfs_readdir_page_offset(desc->page) + i;
-			if (desc->attr_gencount != nfsi->attr_gencount) {
-				desc->duped = 0;
-				desc->attr_gencount = nfsi->attr_gencount;
-			} else if (new_pos < desc->prev_index) {
-				if (desc->duped > 0
-				    && desc->dup_cookie == desc->dir_cookie) {
-					if (printk_ratelimit()) {
-						pr_notice("NFS: directory %pD2 contains a readdir loop."
-								"Please contact your server vendor.  "
-								"The file: %s has duplicate cookie %llu\n",
-								desc->file, array->array[i].name, desc->dir_cookie);
-					}
-					status = -ELOOP;
-					goto out;
-				}
-				desc->dup_cookie = desc->dir_cookie;
-				desc->duped = -1;
-			}
+			new_pos = desc->current_index + i;
 			if (nfs_readdir_use_cookie(desc->file))
 				desc->ctx->pos = desc->dir_cookie;
 			else
 				desc->ctx->pos = new_pos;
-			desc->prev_index = new_pos;
 			desc->cache_entry_index = i;
 			return 0;
 		}
@@ -527,7 +514,6 @@ static int nfs_readdir_search_for_cookie(struct nfs_cache_array *array,
 		if (desc->dir_cookie == array->last_cookie)
 			desc->eof = true;
 	}
-out:
 	return status;
 }
 
@@ -820,18 +806,16 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
 				break;
 			arrays++;
 			*arrays = page = new;
-			desc->page_index_max++;
 		} else {
 			new = nfs_readdir_page_get_next(mapping,
-							page->index + 1,
 							entry->prev_cookie);
 			if (!new)
 				break;
 			if (page != *arrays)
 				nfs_readdir_page_unlock_and_put(page);
 			page = new;
-			desc->page_index_max = new->index;
 		}
+		desc->page_index_max++;
 		status = nfs_readdir_add_to_array(entry, page);
 	} while (!status && !entry->eof);
 
@@ -954,7 +938,6 @@ static struct page *
 nfs_readdir_page_get_cached(struct nfs_readdir_descriptor *desc)
 {
 	return nfs_readdir_page_get_locked(desc->file->f_mapping,
-					   desc->page_index,
 					   desc->last_cookie);
 }
 
@@ -984,7 +967,7 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 			trace_nfs_readdir_cache_fill_done(inode, res);
 			if (res == -EBADCOOKIE || res == -ENOTSYNC) {
 				invalidate_inode_pages2(desc->file->f_mapping);
-				desc->page_index = 0;
+				nfs_readdir_rewind_search(desc);
 				trace_nfs_readdir_invalidate_cache_range(
 					inode, 0, MAX_LFS_FILESIZE);
 				return -EAGAIN;
@@ -998,12 +981,10 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 		    memcmp(nfsi->cookieverf, verf, sizeof(nfsi->cookieverf))) {
 			memcpy(nfsi->cookieverf, verf,
 			       sizeof(nfsi->cookieverf));
-			invalidate_inode_pages2_range(desc->file->f_mapping,
-						      desc->page_index_max + 1,
+			invalidate_inode_pages2_range(desc->file->f_mapping, 1,
 						      -1);
 			trace_nfs_readdir_invalidate_cache_range(
-				inode, desc->page_index_max + 1,
-				MAX_LFS_FILESIZE);
+				inode, 1, MAX_LFS_FILESIZE);
 		}
 	}
 	res = nfs_readdir_search_array(desc);
@@ -1019,11 +1000,6 @@ static int readdir_search_pagecache(struct nfs_readdir_descriptor *desc)
 	int res;
 
 	do {
-		if (desc->page_index == 0) {
-			desc->current_index = 0;
-			desc->prev_index = 0;
-			desc->last_cookie = 0;
-		}
 		res = find_and_lock_cache_page(desc);
 	} while (res == -EAGAIN);
 	return res;
@@ -1058,8 +1034,6 @@ static void nfs_do_filldir(struct nfs_readdir_descriptor *desc,
 			desc->ctx->pos = desc->dir_cookie;
 		else
 			desc->ctx->pos++;
-		if (desc->duped != 0)
-			desc->duped = 1;
 	}
 	if (array->page_is_eof)
 		desc->eof = !desc->eob;
@@ -1101,7 +1075,6 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	desc->page_index = 0;
 	desc->cache_entry_index = 0;
 	desc->last_cookie = desc->dir_cookie;
-	desc->duped = 0;
 	desc->page_index_max = 0;
 
 	trace_nfs_readdir_uncached(desc->file, desc->verf, desc->last_cookie,
@@ -1134,6 +1107,8 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	for (i = 0; i < sz && arrays[i]; i++)
 		nfs_readdir_page_array_free(arrays[i]);
 out:
+	if (!nfs_readdir_use_cookie(desc->file))
+		nfs_readdir_rewind_search(desc);
 	desc->page_index_max = -1;
 	kfree(arrays);
 	dfprintk(DIRCACHE, "NFS: %s: returns %d\n", __func__, status);
@@ -1144,17 +1119,14 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 
 static void nfs_readdir_handle_cache_misses(struct inode *inode,
 					    struct nfs_readdir_descriptor *desc,
-					    pgoff_t page_index,
 					    unsigned int cache_misses)
 {
 	if (desc->ctx->pos == 0 ||
 	    cache_misses <= NFS_READDIR_CACHE_MISS_THRESHOLD)
 		return;
-	if (invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1) == 0)
+	if (invalidate_mapping_pages(inode->i_mapping, 0, -1) == 0)
 		return;
-	trace_nfs_readdir_invalidate_cache_range(
-		inode, (loff_t)(page_index + 1) << PAGE_SHIFT,
-		MAX_LFS_FILESIZE);
+	trace_nfs_readdir_invalidate_cache_range(inode, 0, MAX_LFS_FILESIZE);
 }
 
 /* The file offset position represents the dirent entry number.  A
@@ -1194,8 +1166,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 
 	spin_lock(&file->f_lock);
 	desc->dir_cookie = dir_ctx->dir_cookie;
-	desc->dup_cookie = dir_ctx->dup_cookie;
-	desc->duped = dir_ctx->duped;
 	page_index = dir_ctx->page_index;
 	desc->page_index = page_index;
 	desc->last_cookie = dir_ctx->last_cookie;
@@ -1213,7 +1183,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	}
 
 	desc->plus = nfs_use_readdirplus(inode, ctx, cache_hits, cache_misses);
-	nfs_readdir_handle_cache_misses(inode, desc, page_index, cache_misses);
+	nfs_readdir_handle_cache_misses(inode, desc, cache_misses);
 
 	do {
 		res = readdir_search_pagecache(desc);
@@ -1233,7 +1203,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 		}
 		if (res == -ETOOSMALL && desc->plus) {
 			nfs_zap_caches(inode);
-			desc->page_index = 0;
 			desc->plus = false;
 			desc->eof = false;
 			continue;
@@ -1252,9 +1221,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 
 	spin_lock(&file->f_lock);
 	dir_ctx->dir_cookie = desc->dir_cookie;
-	dir_ctx->dup_cookie = desc->dup_cookie;
 	dir_ctx->last_cookie = desc->last_cookie;
-	dir_ctx->duped = desc->duped;
 	dir_ctx->attr_gencount = desc->attr_gencount;
 	dir_ctx->page_index = desc->page_index;
 	dir_ctx->eof = desc->eof;
@@ -1297,13 +1264,15 @@ static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
 	if (offset != filp->f_pos) {
 		filp->f_pos = offset;
 		dir_ctx->page_index = 0;
-		if (!nfs_readdir_use_cookie(filp))
+		if (!nfs_readdir_use_cookie(filp)) {
 			dir_ctx->dir_cookie = 0;
-		else
+			dir_ctx->last_cookie = 0;
+		} else {
 			dir_ctx->dir_cookie = offset;
+			dir_ctx->last_cookie = offset;
+		}
 		if (offset == 0)
 			memset(dir_ctx->verf, 0, sizeof(dir_ctx->verf));
-		dir_ctx->duped = 0;
 		dir_ctx->eof = false;
 	}
 	spin_unlock(&filp->f_lock);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 20a4cf0acad2..42aad886d3c0 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -106,11 +106,9 @@ struct nfs_open_dir_context {
 	unsigned long attr_gencount;
 	__be32	verf[NFS_DIR_VERIFIER_SIZE];
 	__u64 dir_cookie;
-	__u64 dup_cookie;
 	__u64 last_cookie;
 	pgoff_t page_index;
 	unsigned int dtsize;
-	signed char duped;
 	bool eof;
 	struct rcu_head rcu_head;
 };
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 20/21] NFS: Fix up forced readdirplus
  2022-02-23 21:13                                     ` [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index trondmy
@ 2022-02-23 21:13                                       ` trondmy
  2022-02-23 21:13                                         ` [PATCH v7 21/21] NFS: Remove unnecessary cache invalidations for directories trondmy
  2022-02-24 17:31                                       ` [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index Benjamin Coddington
  1 sibling, 1 reply; 57+ messages in thread
From: trondmy @ 2022-02-23 21:13 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Avoid clearing the entire readdir page cache if we're just doing forced
readdirplus for the 'ls -l' heuristic.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 49 ++++++++++++++++++++++++++----------------
 include/linux/nfs_fs.h |  1 +
 2 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2007eebfb5cf..d41ea614edec 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -169,6 +169,7 @@ struct nfs_readdir_descriptor {
 	unsigned int	cache_entry_index;
 	unsigned int	buffer_fills;
 	unsigned int	dtsize;
+	bool force_plus;
 	bool plus;
 	bool eob;
 	bool eof;
@@ -352,6 +353,16 @@ static bool nfs_readdir_page_cookie_match(struct page *page, u64 last_cookie,
 	return ret;
 }
 
+static bool nfs_readdir_page_is_full(struct page *page)
+{
+	struct nfs_cache_array *array = kmap_atomic(page);
+	int ret;
+
+	ret = nfs_readdir_array_is_full(array);
+	kunmap_atomic(array);
+	return ret;
+}
+
 static void nfs_readdir_page_unlock_and_put(struct page *page)
 {
 	unlock_page(page);
@@ -359,7 +370,7 @@ static void nfs_readdir_page_unlock_and_put(struct page *page)
 }
 
 static struct page *nfs_readdir_page_get_locked(struct address_space *mapping,
-						u64 last_cookie)
+						u64 last_cookie, bool clear)
 {
 	pgoff_t index = nfs_readdir_page_cookie_hash(last_cookie);
 	struct page *page;
@@ -371,8 +382,10 @@ static struct page *nfs_readdir_page_get_locked(struct address_space *mapping,
 	change_attr = inode_peek_iversion_raw(mapping->host);
 	if (PageUptodate(page)) {
 		if (nfs_readdir_page_cookie_match(page, last_cookie,
-						  change_attr))
-			return page;
+						  change_attr)) {
+			if (!clear || !nfs_readdir_page_is_full(page))
+				return page;
+		}
 		nfs_readdir_clear_array(page);
 	}
 	nfs_readdir_page_init_array(page, last_cookie, change_attr);
@@ -393,13 +406,7 @@ static u64 nfs_readdir_page_last_cookie(struct page *page)
 
 static bool nfs_readdir_page_needs_filling(struct page *page)
 {
-	struct nfs_cache_array *array;
-	bool ret;
-
-	array = kmap_atomic(page);
-	ret = !nfs_readdir_array_is_full(array);
-	kunmap_atomic(array);
-	return ret;
+	return !nfs_readdir_page_is_full(page);
 }
 
 static void nfs_readdir_page_set_eof(struct page *page)
@@ -412,11 +419,11 @@ static void nfs_readdir_page_set_eof(struct page *page)
 }
 
 static struct page *nfs_readdir_page_get_next(struct address_space *mapping,
-					      u64 cookie)
+					      u64 cookie, bool clear)
 {
 	struct page *page;
 
-	page = nfs_readdir_page_get_locked(mapping, cookie);
+	page = nfs_readdir_page_get_locked(mapping, cookie, clear);
 	if (page) {
 		if (nfs_readdir_page_last_cookie(page) == cookie)
 			return page;
@@ -808,7 +815,8 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
 			*arrays = page = new;
 		} else {
 			new = nfs_readdir_page_get_next(mapping,
-							entry->prev_cookie);
+							entry->prev_cookie,
+							desc->force_plus);
 			if (!new)
 				break;
 			if (page != *arrays)
@@ -937,8 +945,8 @@ nfs_readdir_page_unlock_and_put_cached(struct nfs_readdir_descriptor *desc)
 static struct page *
 nfs_readdir_page_get_cached(struct nfs_readdir_descriptor *desc)
 {
-	return nfs_readdir_page_get_locked(desc->file->f_mapping,
-					   desc->last_cookie);
+	return nfs_readdir_page_get_locked(
+		desc->file->f_mapping, desc->last_cookie, desc->force_plus);
 }
 
 /*
@@ -1124,9 +1132,7 @@ static void nfs_readdir_handle_cache_misses(struct inode *inode,
 	if (desc->ctx->pos == 0 ||
 	    cache_misses <= NFS_READDIR_CACHE_MISS_THRESHOLD)
 		return;
-	if (invalidate_mapping_pages(inode->i_mapping, 0, -1) == 0)
-		return;
-	trace_nfs_readdir_invalidate_cache_range(inode, 0, MAX_LFS_FILESIZE);
+	desc->force_plus = true;
 }
 
 /* The file offset position represents the dirent entry number.  A
@@ -1170,6 +1176,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->page_index = page_index;
 	desc->last_cookie = dir_ctx->last_cookie;
 	desc->attr_gencount = dir_ctx->attr_gencount;
+	desc->force_plus = dir_ctx->force_plus;
 	desc->eof = dir_ctx->eof;
 	nfs_set_dtsize(desc, dir_ctx->dtsize);
 	memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
@@ -1183,7 +1190,10 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	}
 
 	desc->plus = nfs_use_readdirplus(inode, ctx, cache_hits, cache_misses);
-	nfs_readdir_handle_cache_misses(inode, desc, cache_misses);
+	if (desc->plus)
+		nfs_readdir_handle_cache_misses(inode, desc, cache_misses);
+	else
+		desc->force_plus = false;
 
 	do {
 		res = readdir_search_pagecache(desc);
@@ -1224,6 +1234,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	dir_ctx->last_cookie = desc->last_cookie;
 	dir_ctx->attr_gencount = desc->attr_gencount;
 	dir_ctx->page_index = desc->page_index;
+	dir_ctx->force_plus = desc->force_plus;
 	dir_ctx->eof = desc->eof;
 	dir_ctx->dtsize = desc->dtsize;
 	memcpy(dir_ctx->verf, desc->verf, sizeof(dir_ctx->verf));
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 42aad886d3c0..3f9625c7d0ef 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -109,6 +109,7 @@ struct nfs_open_dir_context {
 	__u64 last_cookie;
 	pgoff_t page_index;
 	unsigned int dtsize;
+	bool force_plus;
 	bool eof;
 	struct rcu_head rcu_head;
 };
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v7 21/21] NFS: Remove unnecessary cache invalidations for directories
  2022-02-23 21:13                                       ` [PATCH v7 20/21] NFS: Fix up forced readdirplus trondmy
@ 2022-02-23 21:13                                         ` trondmy
  0 siblings, 0 replies; 57+ messages in thread
From: trondmy @ 2022-02-23 21:13 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Now that the directory page cache entries police themselves, don't
bother with marking the page cache for invalidation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 5 -----
 fs/nfs/inode.c         | 9 +++------
 fs/nfs/nfs4proc.c      | 2 --
 include/linux/nfs_fs.h | 2 --
 4 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index d41ea614edec..21aae3a3e282 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -83,11 +83,6 @@ alloc_nfs_open_dir_context(struct inode *dir)
 		ctx->attr_gencount = nfsi->attr_gencount;
 		ctx->dtsize = NFS_INIT_DTSIZE;
 		spin_lock(&dir->i_lock);
-		if (list_empty(&nfsi->open_files) &&
-		    (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
-			nfs_set_cache_invalid(dir,
-					      NFS_INO_INVALID_DATA |
-						      NFS_INO_REVAL_FORCED);
 		list_add_tail_rcu(&ctx->list, &nfsi->open_files);
 		spin_unlock(&dir->i_lock);
 		return ctx;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 10d17cfb8639..43af1b6de5a6 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -210,6 +210,8 @@ void nfs_set_cache_invalid(struct inode *inode, unsigned long flags)
 	if (flags & NFS_INO_INVALID_DATA)
 		nfs_fscache_invalidate(inode, 0);
 	flags &= ~NFS_INO_REVAL_FORCED;
+	if (S_ISDIR(inode->i_mode))
+		flags &= ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER);
 
 	nfsi->cache_validity |= flags;
 
@@ -1429,10 +1431,7 @@ static void nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 			&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
 			&& inode_eq_iversion_raw(inode, fattr->pre_change_attr)) {
 		inode_set_iversion_raw(inode, fattr->change_attr);
-		if (S_ISDIR(inode->i_mode))
-			nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
-		else if (nfs_server_capable(inode, NFS_CAP_XATTR))
-			nfs_set_cache_invalid(inode, NFS_INO_INVALID_XATTR);
+		nfs_set_cache_invalid(inode, NFS_INO_INVALID_XATTR);
 	}
 	/* If we have atomic WCC data, we may update some attributes */
 	ts = inode->i_ctime;
@@ -1851,8 +1850,6 @@ EXPORT_SYMBOL_GPL(nfs_refresh_inode);
 static int nfs_post_op_update_inode_locked(struct inode *inode,
 		struct nfs_fattr *fattr, unsigned int invalid)
 {
-	if (S_ISDIR(inode->i_mode))
-		invalid |= NFS_INO_INVALID_DATA;
 	nfs_set_cache_invalid(inode, invalid);
 	if ((fattr->valid & NFS_ATTR_FATTR) == 0)
 		return 0;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 8b875355824b..f1aa6b3c8523 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1206,8 +1206,6 @@ nfs4_update_changeattr_locked(struct inode *inode,
 	u64 change_attr = inode_peek_iversion_raw(inode);
 
 	cache_validity |= NFS_INO_INVALID_CTIME | NFS_INO_INVALID_MTIME;
-	if (S_ISDIR(inode->i_mode))
-		cache_validity |= NFS_INO_INVALID_DATA;
 
 	switch (NFS_SERVER(inode)->change_attr_type) {
 	case NFS4_CHANGE_TYPE_IS_UNDEFINED:
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 3f9625c7d0ef..08ba4db0db4a 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -360,8 +360,6 @@ static inline void nfs_mark_for_revalidate(struct inode *inode)
 	nfsi->cache_validity |= NFS_INO_INVALID_ACCESS | NFS_INO_INVALID_ACL |
 				NFS_INO_INVALID_CHANGE | NFS_INO_INVALID_CTIME |
 				NFS_INO_INVALID_SIZE;
-	if (S_ISDIR(inode->i_mode))
-		nfsi->cache_validity |= NFS_INO_INVALID_DATA;
 	spin_unlock(&inode->i_lock);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 00/21] Readdir improvements
  2022-02-23 21:12 [PATCH v7 00/21] Readdir improvements trondmy
  2022-02-23 21:12 ` [PATCH v7 01/21] NFS: constify nfs_server_capable() and nfs_have_writebacks() trondmy
@ 2022-02-24 12:25 ` David Wysochanski
  2022-02-25  4:00   ` Trond Myklebust
  2022-02-24 15:07 ` David Wysochanski
  2 siblings, 1 reply; 57+ messages in thread
From: David Wysochanski @ 2022-02-24 12:25 UTC (permalink / raw)
  To: trondmy; +Cc: linux-nfs

On Wed, Feb 23, 2022 at 4:24 PM <trondmy@kernel.org> wrote:
>
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> The current NFS readdir code will always try to maximise the amount of
> readahead it performs on the assumption that we can cache anything that
> isn't immediately read by the process.
> There are several cases where this assumption breaks down, including
> when the 'ls -l' heuristic kicks in to try to force use of readdirplus
> as a batch replacement for lookup/getattr.
>
> This series also implement Ben's page cache filter to ensure that we can
> improve the ability to share cached data between processes that are
> reading the same directory at the same time, and to avoid live-locks
> when the directory is simultaneously changing.
>
> --
> v2: Remove reset of dtsize when NFS_INO_FORCE_READDIR is set
> v3: Avoid excessive window shrinking in uncached_readdir case
> v4: Track 'ls -l' cache hit/miss statistics
>     Improved algorithm for falling back to uncached readdir
>     Skip readdirplus when files are being written to
> v5: bugfixes
>     Skip readdirplus when the acdirmax/acregmax values are low
>     Request a full XDR buffer when doing READDIRPLUS
> v6: Add tracing
>     Don't have lookup request readdirplus when it won't help
> v7: Implement Ben's page cache filter
>     Reduce the use of uncached readdir
>     Change indexing of the page cache to improve seekdir() performance.
>
> Trond Myklebust (21):
>   NFS: constify nfs_server_capable() and nfs_have_writebacks()
>   NFS: Trace lookup revalidation failure
>   NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context
>   NFS: Calculate page offsets algorithmically
>   NFS: Store the change attribute in the directory page cache
>   NFS: If the cookie verifier changes, we must invalidate the page cache
>   NFS: Don't re-read the entire page cache to find the next cookie
>   NFS: Adjust the amount of readahead performed by NFS readdir
>   NFS: Simplify nfs_readdir_xdr_to_array()
>   NFS: Reduce use of uncached readdir
>   NFS: Improve heuristic for readdirplus
>   NFS: Don't ask for readdirplus unless it can help nfs_getattr()
>   NFSv4: Ask for a full XDR buffer of readdir goodness
>   NFS: Readdirplus can't help lookup for case insensitive filesystems
>   NFS: Don't request readdirplus when revalidation was forced
>   NFS: Add basic readdir tracing
>   NFS: Trace effects of readdirplus on the dcache
>   NFS: Trace effects of the readdirplus heuristic
>   NFS: Convert readdir page cache to use a cookie based index
>   NFS: Fix up forced readdirplus
>   NFS: Remove unnecessary cache invalidations for directories
>
>  fs/nfs/dir.c           | 450 ++++++++++++++++++++++++-----------------
>  fs/nfs/inode.c         |  46 ++---
>  fs/nfs/internal.h      |   4 +-
>  fs/nfs/nfs3xdr.c       |   7 +-
>  fs/nfs/nfs4proc.c      |   2 -
>  fs/nfs/nfs4xdr.c       |   6 +-
>  fs/nfs/nfstrace.h      | 122 ++++++++++-
>  include/linux/nfs_fs.h |  19 +-
>  8 files changed, 421 insertions(+), 235 deletions(-)
>
> --
> 2.35.1
>

Trond I have been following your work here with periodic tests though
not fully following all the patches content.  As you know this is a tricky
area and seems to be a hotspot area for customers that use NFS, with
many scenarios that may go wrong.  Thanks for your work, which now
includes even some tracepoints and Ben's page cache filler.

This patchset seems to be the best of all the ones so far.  My initial
tests (listings when modifying as well as idle directories) indicate
that the issue that Gonzalo reported on Jan 14th [1] looks to be fixed
by this set, but I'll let him confirm.  I'll do some more testing and
let you know if there's anything else I find.  If there're some
scenarios (mount options, servers, etc) you need more testing on, let
us know and we'll try to make that happen.

[1] [PATCH] NFS: limit block size reported for directories


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 02/21] NFS: Trace lookup revalidation failure
  2022-02-23 21:12   ` [PATCH v7 02/21] NFS: Trace lookup revalidation failure trondmy
  2022-02-23 21:12     ` [PATCH v7 03/21] NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context trondmy
@ 2022-02-24 14:14     ` Benjamin Coddington
  2022-02-25  2:09       ` Trond Myklebust
  1 sibling, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-24 14:14 UTC (permalink / raw)
  To: trondmy; +Cc: linux-nfs

On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:

> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> Enable tracing of lookup revalidation failures.
>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfs/dir.c | 17 +++++------------
>  1 file changed, 5 insertions(+), 12 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index ebddc736eac2..1aa55cac9d9a 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1474,9 +1474,7 @@ nfs_lookup_revalidate_done(struct inode *dir, 
> struct dentry *dentry,
>  {
>  	switch (error) {
>  	case 1:
> -		dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is valid\n",
> -			__func__, dentry);
> -		return 1;
> +		break;
>  	case 0:
>  		/*
>  		 * We can't d_drop the root of a disconnected tree:
> @@ -1485,13 +1483,10 @@ nfs_lookup_revalidate_done(struct inode *dir, 
> struct dentry *dentry,
>  		 * inodes on unmount and further oopses.
>  		 */
>  		if (inode && IS_ROOT(dentry))
> -			return 1;
> -		dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is invalid\n",
> -				__func__, dentry);
> -		return 0;
> +			error = 1;
> +		break;
>  	}
> -	dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) lookup returned error %d\n",
> -				__func__, dentry, error);
> +	trace_nfs_lookup_revalidate_exit(dir, dentry, 0, error);


There's a path through nfs4_lookup_revalidate that will now only produce
this exit tracepoint.  Does it need the _enter tracepoint added?

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 04/21] NFS: Calculate page offsets algorithmically
  2022-02-23 21:12       ` [PATCH v7 04/21] NFS: Calculate page offsets algorithmically trondmy
  2022-02-23 21:12         ` [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache trondmy
@ 2022-02-24 14:15         ` Benjamin Coddington
  2022-02-25  2:11           ` Trond Myklebust
  1 sibling, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-24 14:15 UTC (permalink / raw)
  To: trondmy; +Cc: linux-nfs

On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:

> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> Instead of relying on counting the page offsets as we walk through the
> page cache, switch to calculating them algorithmically.
>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfs/dir.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 8f17aaebcd77..f2258e926df2 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -248,17 +248,20 @@ static const char *nfs_readdir_copy_name(const 
> char *name, unsigned int len)
>  	return ret;
>  }
>
> +static size_t nfs_readdir_array_maxentries(void)
> +{
> +	return (PAGE_SIZE - sizeof(struct nfs_cache_array)) /
> +	       sizeof(struct nfs_cache_array_entry);
> +}
> +

Why the choice to use a runtime function call rather than the compiler's
calculation?  I suspect that the end result is the same, as the compiler
will optimize it away, but I'm curious if there's a good reason for 
this.

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-23 21:12         ` [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache trondmy
  2022-02-23 21:12           ` [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the " trondmy
@ 2022-02-24 14:53           ` Benjamin Coddington
  2022-02-25  2:26             ` Trond Myklebust
  1 sibling, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-24 14:53 UTC (permalink / raw)
  To: trondmy; +Cc: linux-nfs

On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:

> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> Use the change attribute and the first cookie in a directory page 
> cache
> entry to validate that the page is up to date.
>
> Suggested-by: Benjamin Coddington <bcodding@redhat.com>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfs/dir.c | 68 
> ++++++++++++++++++++++++++++------------------------
>  1 file changed, 37 insertions(+), 31 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index f2258e926df2..5d9367d9b651 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
>  };
>
>  struct nfs_cache_array {
> +	u64 change_attr;
>  	u64 last_cookie;
>  	unsigned int size;
>  	unsigned char page_full : 1,
> @@ -175,7 +176,8 @@ static void nfs_readdir_array_init(struct 
> nfs_cache_array *array)
>  	memset(array, 0, sizeof(struct nfs_cache_array));
>  }
>
> -static void nfs_readdir_page_init_array(struct page *page, u64 
> last_cookie)
> +static void nfs_readdir_page_init_array(struct page *page, u64 
> last_cookie,
> +					u64 change_attr)
>  {
>  	struct nfs_cache_array *array;


There's a hunk missing here, something like:

@@ -185,6 +185,7 @@ static void nfs_readdir_page_init_array(struct page 
*page, u64 last_cookie,
         nfs_readdir_array_init(array);
         array->last_cookie = last_cookie;
         array->cookies_are_ordered = 1;
+       array->change_attr = change_attr;
         kunmap_atomic(array);
  }

>
> @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64 last_cookie, 
> gfp_t gfp_flags)
>  {
>  	struct page *page = alloc_page(gfp_flags);
>  	if (page)
> -		nfs_readdir_page_init_array(page, last_cookie);
> +		nfs_readdir_page_init_array(page, last_cookie, 0);
>  	return page;
>  }
>
> @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct nfs_entry 
> *entry, struct page *page)
>  	return ret;
>  }
>
> +static bool nfs_readdir_page_cookie_match(struct page *page, u64 
> last_cookie,
> +					  u64 change_attr)

How about "nfs_readdir_page_valid()"?  There's more going on than a 
cookie match.


> +{
> +	struct nfs_cache_array *array = kmap_atomic(page);
> +	int ret = true;
> +
> +	if (array->change_attr != change_attr)
> +		ret = false;

Can we skip the next test if ret = false?

> +	if (array->size > 0 && array->array[0].cookie != last_cookie)
> +		ret = false;
> +	kunmap_atomic(array);
> +	return ret;
> +}
> +
> +static void nfs_readdir_page_unlock_and_put(struct page *page)
> +{
> +	unlock_page(page);
> +	put_page(page);
> +}
> +
>  static struct page *nfs_readdir_page_get_locked(struct address_space 
> *mapping,
>  						pgoff_t index, u64 last_cookie)
>  {
>  	struct page *page;
> +	u64 change_attr;
>
>  	page = grab_cache_page(mapping, index);
> -	if (page && !PageUptodate(page)) {
> -		nfs_readdir_page_init_array(page, last_cookie);
> -		if (invalidate_inode_pages2_range(mapping, index + 1, -1) < 0)
> -			nfs_zap_mapping(mapping->host, mapping);
> -		SetPageUptodate(page);
> +	if (!page)
> +		return NULL;
> +	change_attr = inode_peek_iversion_raw(mapping->host);
> +	if (PageUptodate(page)) {
> +		if (nfs_readdir_page_cookie_match(page, last_cookie,
> +						  change_attr))
> +			return page;
> +		nfs_readdir_clear_array(page);


Why use i_version rather than nfs_save_change_attribute?  Seems having a
consistent value across the pachecache and dir_verifiers would help
debugging, and we've already have a bunch of machinery around the
change_attribute.

Don't we need to send a GETATTR with READDIR for v4?  Not doing so means
that the pagecache is going to behave differently for v3 and v4, and 
we'll
potentially end up with totally bogus listings for cases where one 
reader
has cached a page of entries in the middle of the pagecache marked with
i_version A, but entries are actually from i_version A++ on the server.
Then another reader comes along and follows earlier entries from 
i_version A
on the server that lead into entries from A++.  I don't think we can 
detect
this case unless we're checking the directory on every READDIR.

Sending a GETATTR for v4 doesn't eliminate that race on the server side, 
but
does remove the large window on the client created by the attribute 
cache
timeouts, and I think its mostly harmless performance-wise.

Also, we don't need the local change_attr variable just to pass it to 
other
functions that can access it themselves.

>  	}
> -
> +	nfs_readdir_page_init_array(page, last_cookie, change_attr);
> +	SetPageUptodate(page);
>  	return page;
>  }
>
> @@ -356,12 +383,6 @@ static void nfs_readdir_page_set_eof(struct page 
> *page)
>  	kunmap_atomic(array);
>  }
>
> -static void nfs_readdir_page_unlock_and_put(struct page *page)
> -{
> -	unlock_page(page);
> -	put_page(page);
> -}
> -
>  static struct page *nfs_readdir_page_get_next(struct address_space 
> *mapping,
>  					      pgoff_t index, u64 cookie)
>  {
> @@ -418,16 +439,6 @@ static int nfs_readdir_search_for_pos(struct 
> nfs_cache_array *array,
>  	return -EBADCOOKIE;
>  }
>
> -static bool
> -nfs_readdir_inode_mapping_valid(struct nfs_inode *nfsi)
> -{
> -	if (nfsi->cache_validity & (NFS_INO_INVALID_CHANGE |
> -				    NFS_INO_INVALID_DATA))
> -		return false;
> -	smp_rmb();
> -	return !test_bit(NFS_INO_INVALIDATING, &nfsi->flags);
> -}
> -
>  static bool nfs_readdir_array_cookie_in_range(struct nfs_cache_array 
> *array,
>  					      u64 cookie)
>  {
> @@ -456,8 +467,7 @@ static int nfs_readdir_search_for_cookie(struct 
> nfs_cache_array *array,
>  			struct nfs_inode *nfsi = NFS_I(file_inode(desc->file));
>
>  			new_pos = nfs_readdir_page_offset(desc->page) + i;
> -			if (desc->attr_gencount != nfsi->attr_gencount ||
> -			    !nfs_readdir_inode_mapping_valid(nfsi)) {
> +			if (desc->attr_gencount != nfsi->attr_gencount) {
>  				desc->duped = 0;
>  				desc->attr_gencount = nfsi->attr_gencount;
>  			} else if (new_pos < desc->prev_index) {
> @@ -1094,11 +1104,7 @@ static int nfs_readdir(struct file *file, 
> struct dir_context *ctx)
>  	 * to either find the entry with the appropriate number or
>  	 * revalidate the cookie.
>  	 */
> -	if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
> -		res = nfs_revalidate_mapping(inode, file->f_mapping);
> -		if (res < 0)
> -			goto out;
> -	}
> +	nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);

Same as above -> why not send GETATTR with READDIR instead of doing it 
in a
separate RPC?

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 00/21] Readdir improvements
  2022-02-23 21:12 [PATCH v7 00/21] Readdir improvements trondmy
  2022-02-23 21:12 ` [PATCH v7 01/21] NFS: constify nfs_server_capable() and nfs_have_writebacks() trondmy
  2022-02-24 12:25 ` [PATCH v7 00/21] Readdir improvements David Wysochanski
@ 2022-02-24 15:07 ` David Wysochanski
  2 siblings, 0 replies; 57+ messages in thread
From: David Wysochanski @ 2022-02-24 15:07 UTC (permalink / raw)
  To: trondmy; +Cc: linux-nfs

On Wed, Feb 23, 2022 at 4:24 PM <trondmy@kernel.org> wrote:
>
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> The current NFS readdir code will always try to maximise the amount of
> readahead it performs on the assumption that we can cache anything that
> isn't immediately read by the process.
> There are several cases where this assumption breaks down, including
> when the 'ls -l' heuristic kicks in to try to force use of readdirplus
> as a batch replacement for lookup/getattr.
>
> This series also implement Ben's page cache filter to ensure that we can
> improve the ability to share cached data between processes that are
> reading the same directory at the same time, and to avoid live-locks
> when the directory is simultaneously changing.
>
> --
> v2: Remove reset of dtsize when NFS_INO_FORCE_READDIR is set
> v3: Avoid excessive window shrinking in uncached_readdir case
> v4: Track 'ls -l' cache hit/miss statistics
>     Improved algorithm for falling back to uncached readdir
>     Skip readdirplus when files are being written to
> v5: bugfixes
>     Skip readdirplus when the acdirmax/acregmax values are low
>     Request a full XDR buffer when doing READDIRPLUS
> v6: Add tracing
>     Don't have lookup request readdirplus when it won't help
> v7: Implement Ben's page cache filter
>     Reduce the use of uncached readdir
>     Change indexing of the page cache to improve seekdir() performance.
>
> Trond Myklebust (21):
>   NFS: constify nfs_server_capable() and nfs_have_writebacks()
>   NFS: Trace lookup revalidation failure
>   NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context
>   NFS: Calculate page offsets algorithmically
>   NFS: Store the change attribute in the directory page cache
>   NFS: If the cookie verifier changes, we must invalidate the page cache
>   NFS: Don't re-read the entire page cache to find the next cookie
>   NFS: Adjust the amount of readahead performed by NFS readdir
>   NFS: Simplify nfs_readdir_xdr_to_array()
>   NFS: Reduce use of uncached readdir
>   NFS: Improve heuristic for readdirplus
>   NFS: Don't ask for readdirplus unless it can help nfs_getattr()
>   NFSv4: Ask for a full XDR buffer of readdir goodness
>   NFS: Readdirplus can't help lookup for case insensitive filesystems
>   NFS: Don't request readdirplus when revalidation was forced
>   NFS: Add basic readdir tracing
>   NFS: Trace effects of readdirplus on the dcache
>   NFS: Trace effects of the readdirplus heuristic
>   NFS: Convert readdir page cache to use a cookie based index
>   NFS: Fix up forced readdirplus
>   NFS: Remove unnecessary cache invalidations for directories
>
>  fs/nfs/dir.c           | 450 ++++++++++++++++++++++++-----------------
>  fs/nfs/inode.c         |  46 ++---
>  fs/nfs/internal.h      |   4 +-
>  fs/nfs/nfs3xdr.c       |   7 +-
>  fs/nfs/nfs4proc.c      |   2 -
>  fs/nfs/nfs4xdr.c       |   6 +-
>  fs/nfs/nfstrace.h      | 122 ++++++++++-
>  include/linux/nfs_fs.h |  19 +-
>  8 files changed, 421 insertions(+), 235 deletions(-)
>
> --
> 2.35.1
>

I don't see this pushed this to your 'testing' branch, but I applied
manually after resetting at 3f58e709c162


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/21] NFS: Add basic readdir tracing
  2022-02-23 21:13                               ` [PATCH v7 16/21] NFS: Add basic readdir tracing trondmy
  2022-02-23 21:13                                 ` [PATCH v7 17/21] NFS: Trace effects of readdirplus on the dcache trondmy
@ 2022-02-24 15:53                                 ` Benjamin Coddington
  2022-02-25  2:35                                   ` Trond Myklebust
  1 sibling, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-24 15:53 UTC (permalink / raw)
  To: trondmy; +Cc: linux-nfs

On 23 Feb 2022, at 16:13, trondmy@kernel.org wrote:

> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> Add tracing to track how often the client goes to the server for 
> updated
> readdir information.
>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfs/dir.c      | 13 ++++++++-
>  fs/nfs/nfstrace.h | 68 
> +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 80 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 54f0d37485d5..41e2d02d8611 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -969,10 +969,14 @@ static int find_and_lock_cache_page(struct 
> nfs_readdir_descriptor *desc)
>  		return -ENOMEM;
>  	if (nfs_readdir_page_needs_filling(desc->page)) {
>  		desc->page_index_max = desc->page_index;
> +		trace_nfs_readdir_cache_fill(desc->file, nfsi->cookieverf,
> +					     desc->last_cookie,
> +					     desc->page_index, desc->dtsize);
>  		res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
>  					       &desc->page, 1);
>  		if (res < 0) {
>  			nfs_readdir_page_unlock_and_put_cached(desc);
> +			trace_nfs_readdir_cache_fill_done(inode, res);
>  			if (res == -EBADCOOKIE || res == -ENOTSYNC) {
>  				invalidate_inode_pages2(desc->file->f_mapping);
>  				desc->page_index = 0;
> @@ -1090,7 +1094,14 @@ static int uncached_readdir(struct 
> nfs_readdir_descriptor *desc)
>  	desc->duped = 0;
>  	desc->page_index_max = 0;
>
> +	trace_nfs_readdir_uncached(desc->file, desc->verf, 
> desc->last_cookie,
> +				   -1, desc->dtsize);
> +
>  	status = nfs_readdir_xdr_to_array(desc, desc->verf, verf, arrays, 
> sz);
> +	if (status < 0) {
> +		trace_nfs_readdir_uncached_done(file_inode(desc->file), status);
> +		goto out_free;
> +	}
>
>  	for (i = 0; !desc->eob && i < sz && arrays[i]; i++) {
>  		desc->page = arrays[i];
> @@ -1109,7 +1120,7 @@ static int uncached_readdir(struct 
> nfs_readdir_descriptor *desc)
>  			 i < (desc->page_index_max >> 1))
>  			nfs_shrink_dtsize(desc);
>  	}
> -
> +out_free:
>  	for (i = 0; i < sz && arrays[i]; i++)
>  		nfs_readdir_page_array_free(arrays[i]);
>  out:
> diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
> index 3672f6703ee7..c2d0543ecb2d 100644
> --- a/fs/nfs/nfstrace.h
> +++ b/fs/nfs/nfstrace.h
> @@ -160,6 +160,8 @@ DEFINE_NFS_INODE_EVENT(nfs_fsync_enter);
>  DEFINE_NFS_INODE_EVENT_DONE(nfs_fsync_exit);
>  DEFINE_NFS_INODE_EVENT(nfs_access_enter);
>  DEFINE_NFS_INODE_EVENT_DONE(nfs_set_cache_invalid);
> +DEFINE_NFS_INODE_EVENT_DONE(nfs_readdir_cache_fill_done);
> +DEFINE_NFS_INODE_EVENT_DONE(nfs_readdir_uncached_done);
>
>  TRACE_EVENT(nfs_access_exit,
>  		TP_PROTO(
> @@ -271,6 +273,72 @@ DEFINE_NFS_UPDATE_SIZE_EVENT(wcc);
>  DEFINE_NFS_UPDATE_SIZE_EVENT(update);
>  DEFINE_NFS_UPDATE_SIZE_EVENT(grow);
>
> +DECLARE_EVENT_CLASS(nfs_readdir_event,
> +		TP_PROTO(
> +			const struct file *file,
> +			const __be32 *verifier,
> +			u64 cookie,
> +			pgoff_t page_index,
> +			unsigned int dtsize
> +		),
> +
> +		TP_ARGS(file, verifier, cookie, page_index, dtsize),
> +
> +		TP_STRUCT__entry(
> +			__field(dev_t, dev)
> +			__field(u32, fhandle)
> +			__field(u64, fileid)
> +			__field(u64, version)
> +			__array(char, verifier, NFS4_VERIFIER_SIZE)
> +			__field(u64, cookie)
> +			__field(pgoff_t, index)
> +			__field(unsigned int, dtsize)


I'd like to be able to see the change_attr too, whether or not it's the
cache_change_attribute or i_version.

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the page cache
  2022-02-23 21:12           ` [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the " trondmy
  2022-02-23 21:12             ` [PATCH v7 07/21] NFS: Don't re-read the entire page cache to find the next cookie trondmy
@ 2022-02-24 16:18             ` Anna Schumaker
  1 sibling, 0 replies; 57+ messages in thread
From: Anna Schumaker @ 2022-02-24 16:18 UTC (permalink / raw)
  To: trondmy; +Cc: Linux NFS Mailing List

Hi Trond,

On Wed, Feb 23, 2022 at 7:48 PM <trondmy@kernel.org> wrote:
>
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> Ensure that if the cookie verifier changes when we use the zero-valued
> cookie, then we invalidate any cached pages.
>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfs/dir.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 5d9367d9b651..7932d474ce00 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -945,9 +945,14 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
>                 /*
>                  * Set the cookie verifier if the page cache was empty
>                  */
> -               if (desc->page_index == 0)
> +               if (desc->last_cookie == 0 &&
> +                   memcmp(nfsi->cookieverf, verf, sizeof(nfsi->cookieverf))) {
>                         memcpy(nfsi->cookieverf, verf,
>                                sizeof(nfsi->cookieverf));
> +                       invalidate_inode_pages2_range(desc->file->f_mapping,
> +                                                     desc->page_index_max + 1,

I'm getting this when I try to compile this patch:

fs/nfs/dir.c: In function ‘find_and_lock_cache_page’:
fs/nfs/dir.c:953:61: error: ‘struct nfs_readdir_descriptor’ has no
member named ‘page_index_max’; did you mean ‘page_index’?
  953 |
desc->page_index_max + 1,
      |
^~~~~~~~~~~~~~
      |                                                             page_index
make[2]: *** [scripts/Makefile.build:288: fs/nfs/dir.o] Error 1
make[1]: *** [scripts/Makefile.build:550: fs/nfs] Error 2
make: *** [Makefile:1831: fs] Error 2
make: *** Waiting for unfinished jobs....

It looks like the "page_index_max" field is added in patch 8.

Anna


Anna
> +                                                     -1);
> +               }
>         }
>         res = nfs_readdir_search_array(desc);
>         if (res == 0)
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir
  2022-02-23 21:12               ` [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir trondmy
  2022-02-23 21:12                 ` [PATCH v7 09/21] NFS: Simplify nfs_readdir_xdr_to_array() trondmy
@ 2022-02-24 16:30                 ` Anna Schumaker
  1 sibling, 0 replies; 57+ messages in thread
From: Anna Schumaker @ 2022-02-24 16:30 UTC (permalink / raw)
  To: trondmy; +Cc: Linux NFS Mailing List

Hi Trond,

On Wed, Feb 23, 2022 at 8:11 PM <trondmy@kernel.org> wrote:
>
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> The current NFS readdir code will always try to maximise the amount of
> readahead it performs on the assumption that we can cache anything that
> isn't immediately read by the process.
> There are several cases where this assumption breaks down, including
> when the 'ls -l' heuristic kicks in to try to force use of readdirplus
> as a batch replacement for lookup/getattr.
>
> This patch therefore tries to tone down the amount of readahead we
> perform, and adjust it to try to match the amount of data being
> requested by user space.

I'm seeing cthon basic tests fail at this patch, but I'm unsure if it
would have started now or in patches 6 or 7 due to the earlier compile
error. The other cthon tests still pass, however:

Thu Feb 24 11:27:44 EST 2022
./server -b -o tcp,v3,sec=sys -m /mnt/nfsv3tcp -p /srv/test/anna/nfsv3tcp server
./server -b -o proto=tcp,sec=sys,v4.0 -m /mnt/nfsv4tcp -p
/srv/test/anna/nfsv4tcp server
./server -b -o proto=tcp,sec=sys,v4.1 -m /mnt/nfsv41tcp -p
/srv/test/anna/nfsv41tcp server
./server -b -o proto=tcp,sec=sys,v4.2 -m /mnt/nfsv42tcp -p
/srv/test/anna/nfsv42tcp server
Waiting for 'b' to finish...
The '-b' test using '-o tcp,v3,sec=sys' args to server: Failed!!
The '-b' test using '-o proto=tcp,sec=sys,v4.0' args to server: Failed!!
The '-b' test using '-o proto=tcp,sec=sys,v4.2' args to server: Failed!!
The '-b' test using '-o proto=tcp,sec=sys,v4.1' args to server: Failed!!
 Done: 11:27:46

Anna

>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> ---
>  fs/nfs/dir.c           | 55 +++++++++++++++++++++++++++++++++++++++++-
>  include/linux/nfs_fs.h |  1 +
>  2 files changed, 55 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 70c0db877815..83933b7018ea 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -69,6 +69,8 @@ const struct address_space_operations nfs_dir_aops = {
>         .freepage = nfs_readdir_clear_array,
>  };
>
> +#define NFS_INIT_DTSIZE PAGE_SIZE
> +
>  static struct nfs_open_dir_context *
>  alloc_nfs_open_dir_context(struct inode *dir)
>  {
> @@ -78,6 +80,7 @@ alloc_nfs_open_dir_context(struct inode *dir)
>         ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
>         if (ctx != NULL) {
>                 ctx->attr_gencount = nfsi->attr_gencount;
> +               ctx->dtsize = NFS_INIT_DTSIZE;
>                 spin_lock(&dir->i_lock);
>                 if (list_empty(&nfsi->open_files) &&
>                     (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
> @@ -153,6 +156,7 @@ struct nfs_readdir_descriptor {
>         struct page     *page;
>         struct dir_context *ctx;
>         pgoff_t         page_index;
> +       pgoff_t         page_index_max;
>         u64             dir_cookie;
>         u64             last_cookie;
>         u64             dup_cookie;
> @@ -165,12 +169,36 @@ struct nfs_readdir_descriptor {
>         unsigned long   gencount;
>         unsigned long   attr_gencount;
>         unsigned int    cache_entry_index;
> +       unsigned int    buffer_fills;
> +       unsigned int    dtsize;
>         signed char duped;
>         bool plus;
>         bool eob;
>         bool eof;
>  };
>
> +static void nfs_set_dtsize(struct nfs_readdir_descriptor *desc, unsigned int sz)
> +{
> +       struct nfs_server *server = NFS_SERVER(file_inode(desc->file));
> +       unsigned int maxsize = server->dtsize;
> +
> +       if (sz > maxsize)
> +               sz = maxsize;
> +       if (sz < NFS_MIN_FILE_IO_SIZE)
> +               sz = NFS_MIN_FILE_IO_SIZE;
> +       desc->dtsize = sz;
> +}
> +
> +static void nfs_shrink_dtsize(struct nfs_readdir_descriptor *desc)
> +{
> +       nfs_set_dtsize(desc, desc->dtsize >> 1);
> +}
> +
> +static void nfs_grow_dtsize(struct nfs_readdir_descriptor *desc)
> +{
> +       nfs_set_dtsize(desc, desc->dtsize << 1);
> +}
> +
>  static void nfs_readdir_array_init(struct nfs_cache_array *array)
>  {
>         memset(array, 0, sizeof(struct nfs_cache_array));
> @@ -774,6 +802,7 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
>                                 break;
>                         arrays++;
>                         *arrays = page = new;
> +                       desc->page_index_max++;
>                 } else {
>                         new = nfs_readdir_page_get_next(mapping,
>                                                         page->index + 1,
> @@ -783,6 +812,7 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
>                         if (page != *arrays)
>                                 nfs_readdir_page_unlock_and_put(page);
>                         page = new;
> +                       desc->page_index_max = new->index;
>                 }
>                 status = nfs_readdir_add_to_array(entry, page);
>         } while (!status && !entry->eof);
> @@ -848,7 +878,7 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
>         struct nfs_entry *entry;
>         size_t array_size;
>         struct inode *inode = file_inode(desc->file);
> -       size_t dtsize = NFS_SERVER(inode)->dtsize;
> +       unsigned int dtsize = desc->dtsize;
>         int status = -ENOMEM;
>
>         entry = kzalloc(sizeof(*entry), GFP_KERNEL);
> @@ -884,6 +914,7 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
>
>                 status = nfs_readdir_page_filler(desc, entry, pages, pglen,
>                                                  arrays, narrays);
> +               desc->buffer_fills++;
>         } while (!status && nfs_readdir_page_needs_filling(page) &&
>                 page_mapping(page));
>
> @@ -931,6 +962,7 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
>         if (!desc->page)
>                 return -ENOMEM;
>         if (nfs_readdir_page_needs_filling(desc->page)) {
> +               desc->page_index_max = desc->page_index;
>                 res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
>                                                &desc->page, 1);
>                 if (res < 0) {
> @@ -1067,6 +1099,7 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
>         desc->cache_entry_index = 0;
>         desc->last_cookie = desc->dir_cookie;
>         desc->duped = 0;
> +       desc->page_index_max = 0;
>
>         status = nfs_readdir_xdr_to_array(desc, desc->verf, verf, arrays, sz);
>
> @@ -1076,10 +1109,22 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
>         }
>         desc->page = NULL;
>
> +       /*
> +        * Grow the dtsize if we have to go back for more pages,
> +        * or shrink it if we're reading too many.
> +        */
> +       if (!desc->eof) {
> +               if (!desc->eob)
> +                       nfs_grow_dtsize(desc);
> +               else if (desc->buffer_fills == 1 &&
> +                        i < (desc->page_index_max >> 1))
> +                       nfs_shrink_dtsize(desc);
> +       }
>
>         for (i = 0; i < sz && arrays[i]; i++)
>                 nfs_readdir_page_array_free(arrays[i]);
>  out:
> +       desc->page_index_max = -1;
>         kfree(arrays);
>         dfprintk(DIRCACHE, "NFS: %s: returns %d\n", __func__, status);
>         return status;
> @@ -1118,6 +1163,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
>         desc->file = file;
>         desc->ctx = ctx;
>         desc->plus = nfs_use_readdirplus(inode, ctx);
> +       desc->page_index_max = -1;
>
>         spin_lock(&file->f_lock);
>         desc->dir_cookie = dir_ctx->dir_cookie;
> @@ -1128,6 +1174,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
>         desc->last_cookie = dir_ctx->last_cookie;
>         desc->attr_gencount = dir_ctx->attr_gencount;
>         desc->eof = dir_ctx->eof;
> +       nfs_set_dtsize(desc, dir_ctx->dtsize);
>         memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
>         spin_unlock(&file->f_lock);
>
> @@ -1169,6 +1216,11 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
>
>                 nfs_do_filldir(desc, nfsi->cookieverf);
>                 nfs_readdir_page_unlock_and_put_cached(desc);
> +               if (desc->eob || desc->eof)
> +                       break;
> +               /* Grow the dtsize if we have to go back for more pages */
> +               if (desc->page_index == desc->page_index_max)
> +                       nfs_grow_dtsize(desc);
>         } while (!desc->eob && !desc->eof);
>
>         spin_lock(&file->f_lock);
> @@ -1179,6 +1231,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
>         dir_ctx->attr_gencount = desc->attr_gencount;
>         dir_ctx->page_index = desc->page_index;
>         dir_ctx->eof = desc->eof;
> +       dir_ctx->dtsize = desc->dtsize;
>         memcpy(dir_ctx->verf, desc->verf, sizeof(dir_ctx->verf));
>         spin_unlock(&file->f_lock);
>  out_free:
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index 1c533f2c1f36..691a27936849 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -107,6 +107,7 @@ struct nfs_open_dir_context {
>         __u64 dup_cookie;
>         __u64 last_cookie;
>         pgoff_t page_index;
> +       unsigned int dtsize;
>         signed char duped;
>         bool eof;
>  };
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 10/21] NFS: Reduce use of uncached readdir
  2022-02-23 21:12                   ` [PATCH v7 10/21] NFS: Reduce use of uncached readdir trondmy
  2022-02-23 21:12                     ` [PATCH v7 11/21] NFS: Improve heuristic for readdirplus trondmy
@ 2022-02-24 16:55                     ` Anna Schumaker
  2022-02-25  4:07                       ` Trond Myklebust
  1 sibling, 1 reply; 57+ messages in thread
From: Anna Schumaker @ 2022-02-24 16:55 UTC (permalink / raw)
  To: trondmy; +Cc: Linux NFS Mailing List

Hi Trond,

On Wed, Feb 23, 2022 at 8:25 PM <trondmy@kernel.org> wrote:
>
> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> When reading a very large directory, we want to try to keep the page
> cache up to date if doing so is inexpensive. With the change to allow
> readdir to continue reading even when the cache is incomplete, we no
> longer need to fall back to uncached readdir in order to scale to large
> directories.
>
> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

As of this patch, cthon tests are passing again.

Anna


Anna

> ---
>  fs/nfs/dir.c | 23 +++--------------------
>  1 file changed, 3 insertions(+), 20 deletions(-)
>
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 9b0f13b52dbf..982b5dbe30d7 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -986,28 +986,11 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
>         return res;
>  }
>
> -static bool nfs_readdir_dont_search_cache(struct nfs_readdir_descriptor *desc)
> -{
> -       struct address_space *mapping = desc->file->f_mapping;
> -       struct inode *dir = file_inode(desc->file);
> -       unsigned int dtsize = NFS_SERVER(dir)->dtsize;
> -       loff_t size = i_size_read(dir);
> -
> -       /*
> -        * Default to uncached readdir if the page cache is empty, and
> -        * we're looking for a non-zero cookie in a large directory.
> -        */
> -       return desc->dir_cookie != 0 && mapping->nrpages == 0 && size > dtsize;
> -}
> -
>  /* Search for desc->dir_cookie from the beginning of the page cache */
>  static int readdir_search_pagecache(struct nfs_readdir_descriptor *desc)
>  {
>         int res;
>
> -       if (nfs_readdir_dont_search_cache(desc))
> -               return -EBADCOOKIE;
> -
>         do {
>                 if (desc->page_index == 0) {
>                         desc->current_index = 0;
> @@ -1262,10 +1245,10 @@ static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
>         }
>         if (offset != filp->f_pos) {
>                 filp->f_pos = offset;
> -               if (!nfs_readdir_use_cookie(filp)) {
> +               dir_ctx->page_index = 0;
> +               if (!nfs_readdir_use_cookie(filp))
>                         dir_ctx->dir_cookie = 0;
> -                       dir_ctx->page_index = 0;
> -               } else
> +               else
>                         dir_ctx->dir_cookie = offset;
>                 if (offset == 0)
>                         memset(dir_ctx->verf, 0, sizeof(dir_ctx->verf));
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index
  2022-02-23 21:13                                     ` [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index trondmy
  2022-02-23 21:13                                       ` [PATCH v7 20/21] NFS: Fix up forced readdirplus trondmy
@ 2022-02-24 17:31                                       ` Benjamin Coddington
  2022-02-25  2:33                                         ` Trond Myklebust
  1 sibling, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-24 17:31 UTC (permalink / raw)
  To: trondmy; +Cc: linux-nfs

On 23 Feb 2022, at 16:13, trondmy@kernel.org wrote:

> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> Instead of using a linear index to address the pages, use the cookie of
> the first entry, since that is what we use to match the page anyway.
>
> This allows us to avoid re-reading the entire cache on a seekdir() type
> of operation. The latter is very common when re-exporting NFS, and is a
> major performance drain.
>
> The change does affect our duplicate cookie detection, since we can no
> longer rely on the page index as a linear offset for detecting whether
> we looped backwards. However since we no longer do a linear search
> through all the pages on each call to nfs_readdir(), this is less of a
> concern than it was previously.
> The other downside is that invalidate_mapping_pages() no longer can use
> the page index to avoid clearing pages that have been read. A subsequent
> patch will restore the functionality this provides to the 'ls -l'
> heuristic.

This is cool, but one reason I did not explore this was that the page cache
index uses XArray, which is optimized for densly clustered indexes.  This
particular sentence in the documentation was enough to scare me away:

"The XArray implementation is efficient when the indices used are densely
clustered; hashing the object and using the hash as the index will not
perform well."

However, the "not perform well" may be orders of magnitude smaller than
anthing like RPC.  Do you have concerns about this?

Another option might be to flag the context after a seekdir, which would
trigger a shift in the page_index or "turn on" hashed indexes, however
that's really only going to improve the re-export case with v4 or cached
fds.

Or maybe the /first/ seekdir on a context sets its own offset into the
pagecache - that could be a hash, and pages are filled from there.

Hmm..

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 02/21] NFS: Trace lookup revalidation failure
  2022-02-24 14:14     ` [PATCH v7 02/21] NFS: Trace lookup revalidation failure Benjamin Coddington
@ 2022-02-25  2:09       ` Trond Myklebust
  0 siblings, 0 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  2:09 UTC (permalink / raw)
  To: bcodding, trondmy; +Cc: linux-nfs

On Thu, 2022-02-24 at 09:14 -0500, Benjamin Coddington wrote:
> On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
> 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > Enable tracing of lookup revalidation failures.
> > 
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > ---
> >  fs/nfs/dir.c | 17 +++++------------
> >  1 file changed, 5 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index ebddc736eac2..1aa55cac9d9a 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -1474,9 +1474,7 @@ nfs_lookup_revalidate_done(struct inode *dir,
> > struct dentry *dentry,
> >  {
> >         switch (error) {
> >         case 1:
> > -               dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is valid\n",
> > -                       __func__, dentry);
> > -               return 1;
> > +               break;
> >         case 0:
> >                 /*
> >                  * We can't d_drop the root of a disconnected tree:
> > @@ -1485,13 +1483,10 @@ nfs_lookup_revalidate_done(struct inode
> > *dir, 
> > struct dentry *dentry,
> >                  * inodes on unmount and further oopses.
> >                  */
> >                 if (inode && IS_ROOT(dentry))
> > -                       return 1;
> > -               dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is invalid\n",
> > -                               __func__, dentry);
> > -               return 0;
> > +                       error = 1;
> > +               break;
> >         }
> > -       dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) lookup returned error
> > %d\n",
> > -                               __func__, dentry, error);
> > +       trace_nfs_lookup_revalidate_exit(dir, dentry, 0, error);
> 
> 
> There's a path through nfs4_lookup_revalidate that will now only
> produce
> this exit tracepoint.  Does it need the _enter tracepoint added?


You're thinking about the nfs_lookup_revalidate_delegated() path? The
_enter() tracepoint doesn't provide any useful information that isn't
already provided by the _exit(), AFAICS.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 04/21] NFS: Calculate page offsets algorithmically
  2022-02-24 14:15         ` [PATCH v7 04/21] NFS: Calculate page offsets algorithmically Benjamin Coddington
@ 2022-02-25  2:11           ` Trond Myklebust
  2022-02-25 11:28             ` Benjamin Coddington
  0 siblings, 1 reply; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  2:11 UTC (permalink / raw)
  To: bcodding, trondmy; +Cc: linux-nfs

On Thu, 2022-02-24 at 09:15 -0500, Benjamin Coddington wrote:
> On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
> 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > Instead of relying on counting the page offsets as we walk through
> > the
> > page cache, switch to calculating them algorithmically.
> > 
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > ---
> >  fs/nfs/dir.c | 18 +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index 8f17aaebcd77..f2258e926df2 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -248,17 +248,20 @@ static const char
> > *nfs_readdir_copy_name(const 
> > char *name, unsigned int len)
> >         return ret;
> >  }
> > 
> > +static size_t nfs_readdir_array_maxentries(void)
> > +{
> > +       return (PAGE_SIZE - sizeof(struct nfs_cache_array)) /
> > +              sizeof(struct nfs_cache_array_entry);
> > +}
> > +
> 
> Why the choice to use a runtime function call rather than the
> compiler's
> calculation?  I suspect that the end result is the same, as the
> compiler
> will optimize it away, but I'm curious if there's a good reason for 
> this.
> 

The comparison is more efficient because no pointer arithmetic is
needed. As you said, the above function always evaluates to a constant,
and the array->size has been pre-calculated.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-24 14:53           ` [PATCH v7 05/21] NFS: Store the change attribute in the directory " Benjamin Coddington
@ 2022-02-25  2:26             ` Trond Myklebust
  2022-02-25  3:51               ` Trond Myklebust
  0 siblings, 1 reply; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  2:26 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
> On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
> 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > Use the change attribute and the first cookie in a directory page 
> > cache
> > entry to validate that the page is up to date.
> > 
> > Suggested-by: Benjamin Coddington <bcodding@redhat.com>
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > ---
> >  fs/nfs/dir.c | 68 
> > ++++++++++++++++++++++++++++------------------------
> >  1 file changed, 37 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index f2258e926df2..5d9367d9b651 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
> >  };
> > 
> >  struct nfs_cache_array {
> > +       u64 change_attr;
> >         u64 last_cookie;
> >         unsigned int size;
> >         unsigned char page_full : 1,
> > @@ -175,7 +176,8 @@ static void nfs_readdir_array_init(struct 
> > nfs_cache_array *array)
> >         memset(array, 0, sizeof(struct nfs_cache_array));
> >  }
> > 
> > -static void nfs_readdir_page_init_array(struct page *page, u64 
> > last_cookie)
> > +static void nfs_readdir_page_init_array(struct page *page, u64 
> > last_cookie,
> > +                                       u64 change_attr)
> >  {
> >         struct nfs_cache_array *array;
> 
> 
> There's a hunk missing here, something like:
> 
> @@ -185,6 +185,7 @@ static void nfs_readdir_page_init_array(struct
> page 
> *page, u64 last_cookie,
>          nfs_readdir_array_init(array);
>          array->last_cookie = last_cookie;
>          array->cookies_are_ordered = 1;
> +       array->change_attr = change_attr;
>          kunmap_atomic(array);
>   }
> 
> > 
> > @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64 last_cookie, 
> > gfp_t gfp_flags)
> >  {
> >         struct page *page = alloc_page(gfp_flags);
> >         if (page)
> > -               nfs_readdir_page_init_array(page, last_cookie);
> > +               nfs_readdir_page_init_array(page, last_cookie, 0);
> >         return page;
> >  }
> > 
> > @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct nfs_entry
> > *entry, struct page *page)
> >         return ret;
> >  }
> > 
> > +static bool nfs_readdir_page_cookie_match(struct page *page, u64 
> > last_cookie,
> > +                                         u64 change_attr)
> 
> How about "nfs_readdir_page_valid()"?  There's more going on than a 
> cookie match.
> 
> 
> > +{
> > +       struct nfs_cache_array *array = kmap_atomic(page);
> > +       int ret = true;
> > +
> > +       if (array->change_attr != change_attr)
> > +               ret = false;
> 
> Can we skip the next test if ret = false?

I'd expect the compiler to do that.

> 
> > +       if (array->size > 0 && array->array[0].cookie !=
> > last_cookie)
> > +               ret = false;
> > +       kunmap_atomic(array);
> > +       return ret;
> > +}
> > +
> > +static void nfs_readdir_page_unlock_and_put(struct page *page)
> > +{
> > +       unlock_page(page);
> > +       put_page(page);
> > +}
> > +
> >  static struct page *nfs_readdir_page_get_locked(struct
> > address_space 
> > *mapping,
> >                                                 pgoff_t index, u64
> > last_cookie)
> >  {
> >         struct page *page;
> > +       u64 change_attr;
> > 
> >         page = grab_cache_page(mapping, index);
> > -       if (page && !PageUptodate(page)) {
> > -               nfs_readdir_page_init_array(page, last_cookie);
> > -               if (invalidate_inode_pages2_range(mapping, index +
> > 1, -1) < 0)
> > -                       nfs_zap_mapping(mapping->host, mapping);
> > -               SetPageUptodate(page);
> > +       if (!page)
> > +               return NULL;
> > +       change_attr = inode_peek_iversion_raw(mapping->host);
> > +       if (PageUptodate(page)) {
> > +               if (nfs_readdir_page_cookie_match(page,
> > last_cookie,
> > +                                                 change_attr))
> > +                       return page;
> > +               nfs_readdir_clear_array(page);
> 
> 
> Why use i_version rather than nfs_save_change_attribute?  Seems
> having a
> consistent value across the pachecache and dir_verifiers would help
> debugging, and we've already have a bunch of machinery around the
> change_attribute.

The directory cache_change_attribute is not reported in tracepoints
because it is a directory-specific field, so it's not as useful for
debugging.

The inode change attribute is what we have traditionally used for
determining cache consistency, and when to invalidate the cache.

> 
> Don't we need to send a GETATTR with READDIR for v4?  Not doing so
> means
> that the pagecache is going to behave differently for v3 and v4, and 
> we'll
> potentially end up with totally bogus listings for cases where one 
> reader
> has cached a page of entries in the middle of the pagecache marked
> with
> i_version A, but entries are actually from i_version A++ on the
> server.
> Then another reader comes along and follows earlier entries from 
> i_version A
> on the server that lead into entries from A++.  I don't think we can 
> detect
> this case unless we're checking the directory on every READDIR.

The value of the change attribute is determined when the page is
allocated. That value is unaffected by the READDIR call.

That works with NFSv2 as well as NFSv3+v4.

> 
> Sending a GETATTR for v4 doesn't eliminate that race on the server
> side, 
> but
> does remove the large window on the client created by the attribute 
> cache
> timeouts, and I think its mostly harmless performance-wise.
> 
> Also, we don't need the local change_attr variable just to pass it to
> other
> functions that can access it themselves.
> 
> >         }
> > -
> > +       nfs_readdir_page_init_array(page, last_cookie,
> > change_attr);
> > +       SetPageUptodate(page);
> >         return page;
> >  }
> > 
> > @@ -356,12 +383,6 @@ static void nfs_readdir_page_set_eof(struct
> > page 
> > *page)
> >         kunmap_atomic(array);
> >  }
> > 
> > -static void nfs_readdir_page_unlock_and_put(struct page *page)
> > -{
> > -       unlock_page(page);
> > -       put_page(page);
> > -}
> > -
> >  static struct page *nfs_readdir_page_get_next(struct address_space
> > *mapping,
> >                                               pgoff_t index, u64
> > cookie)
> >  {
> > @@ -418,16 +439,6 @@ static int nfs_readdir_search_for_pos(struct 
> > nfs_cache_array *array,
> >         return -EBADCOOKIE;
> >  }
> > 
> > -static bool
> > -nfs_readdir_inode_mapping_valid(struct nfs_inode *nfsi)
> > -{
> > -       if (nfsi->cache_validity & (NFS_INO_INVALID_CHANGE |
> > -                                   NFS_INO_INVALID_DATA))
> > -               return false;
> > -       smp_rmb();
> > -       return !test_bit(NFS_INO_INVALIDATING, &nfsi->flags);
> > -}
> > -
> >  static bool nfs_readdir_array_cookie_in_range(struct
> > nfs_cache_array 
> > *array,
> >                                               u64 cookie)
> >  {
> > @@ -456,8 +467,7 @@ static int nfs_readdir_search_for_cookie(struct
> > nfs_cache_array *array,
> >                         struct nfs_inode *nfsi =
> > NFS_I(file_inode(desc->file));
> > 
> >                         new_pos = nfs_readdir_page_offset(desc-
> > >page) + i;
> > -                       if (desc->attr_gencount != nfsi-
> > >attr_gencount ||
> > -                           !nfs_readdir_inode_mapping_valid(nfsi))
> > {
> > +                       if (desc->attr_gencount != nfsi-
> > >attr_gencount) {
> >                                 desc->duped = 0;
> >                                 desc->attr_gencount = nfsi-
> > >attr_gencount;
> >                         } else if (new_pos < desc->prev_index) {
> > @@ -1094,11 +1104,7 @@ static int nfs_readdir(struct file *file, 
> > struct dir_context *ctx)
> >          * to either find the entry with the appropriate number or
> >          * revalidate the cookie.
> >          */
> > -       if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
> > -               res = nfs_revalidate_mapping(inode, file-
> > >f_mapping);
> > -               if (res < 0)
> > -                       goto out;
> > -       }
> > +       nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE);
> 
> Same as above -> why not send GETATTR with READDIR instead of doing
> it 
> in a
> separate RPC?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index
  2022-02-24 17:31                                       ` [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index Benjamin Coddington
@ 2022-02-25  2:33                                         ` Trond Myklebust
  2022-02-25  3:17                                           ` NeilBrown
  0 siblings, 1 reply; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  2:33 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Thu, 2022-02-24 at 12:31 -0500, Benjamin Coddington wrote:
> On 23 Feb 2022, at 16:13, trondmy@kernel.org wrote:
> 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > Instead of using a linear index to address the pages, use the
> > cookie of
> > the first entry, since that is what we use to match the page
> > anyway.
> > 
> > This allows us to avoid re-reading the entire cache on a seekdir()
> > type
> > of operation. The latter is very common when re-exporting NFS, and
> > is a
> > major performance drain.
> > 
> > The change does affect our duplicate cookie detection, since we can
> > no
> > longer rely on the page index as a linear offset for detecting
> > whether
> > we looped backwards. However since we no longer do a linear search
> > through all the pages on each call to nfs_readdir(), this is less
> > of a
> > concern than it was previously.
> > The other downside is that invalidate_mapping_pages() no longer can
> > use
> > the page index to avoid clearing pages that have been read. A
> > subsequent
> > patch will restore the functionality this provides to the 'ls -l'
> > heuristic.
> 
> This is cool, but one reason I did not explore this was that the page
> cache
> index uses XArray, which is optimized for densly clustered indexes. 
> This
> particular sentence in the documentation was enough to scare me away:
> 
> "The XArray implementation is efficient when the indices used are
> densely
> clustered; hashing the object and using the hash as the index will
> not
> perform well."
> 
> However, the "not perform well" may be orders of magnitude smaller
> than
> anthing like RPC.  Do you have concerns about this?

What is the difference between this workload and a random access
database workload?

If the XArray is incapable of dealing with random access, then we
should never have chosen it for the page cache. I'm therefore assuming
that either the above comment is referring to micro-optimisations that
don't matter much with these workloads, or else that the plan is to
replace the XArray with something more appropriate for a page cache
workload.


> 
> Another option might be to flag the context after a seekdir, which
> would
> trigger a shift in the page_index or "turn on" hashed indexes,
> however
> that's really only going to improve the re-export case with v4 or
> cached
> fds.
> 
> Or maybe the /first/ seekdir on a context sets its own offset into
> the
> pagecache - that could be a hash, and pages are filled from there.
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 16/21] NFS: Add basic readdir tracing
  2022-02-24 15:53                                 ` [PATCH v7 16/21] NFS: Add basic readdir tracing Benjamin Coddington
@ 2022-02-25  2:35                                   ` Trond Myklebust
  0 siblings, 0 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  2:35 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Thu, 2022-02-24 at 10:53 -0500, Benjamin Coddington wrote:
> On 23 Feb 2022, at 16:13, trondmy@kernel.org wrote:
> 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > Add tracing to track how often the client goes to the server for 
> > updated
> > readdir information.
> > 
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > ---
> >  fs/nfs/dir.c      | 13 ++++++++-
> >  fs/nfs/nfstrace.h | 68 
> > +++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 80 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > index 54f0d37485d5..41e2d02d8611 100644
> > --- a/fs/nfs/dir.c
> > +++ b/fs/nfs/dir.c
> > @@ -969,10 +969,14 @@ static int find_and_lock_cache_page(struct 
> > nfs_readdir_descriptor *desc)
> >                 return -ENOMEM;
> >         if (nfs_readdir_page_needs_filling(desc->page)) {
> >                 desc->page_index_max = desc->page_index;
> > +               trace_nfs_readdir_cache_fill(desc->file, nfsi-
> > >cookieverf,
> > +                                            desc->last_cookie,
> > +                                            desc->page_index,
> > desc->dtsize);
> >                 res = nfs_readdir_xdr_to_array(desc, nfsi-
> > >cookieverf, verf,
> >                                                &desc->page, 1);
> >                 if (res < 0) {
> >                         nfs_readdir_page_unlock_and_put_cached(desc
> > );
> > +                       trace_nfs_readdir_cache_fill_done(inode,
> > res);
> >                         if (res == -EBADCOOKIE || res == -ENOTSYNC)
> > {
> >                                 invalidate_inode_pages2(desc->file-
> > >f_mapping);
> >                                 desc->page_index = 0;
> > @@ -1090,7 +1094,14 @@ static int uncached_readdir(struct 
> > nfs_readdir_descriptor *desc)
> >         desc->duped = 0;
> >         desc->page_index_max = 0;
> > 
> > +       trace_nfs_readdir_uncached(desc->file, desc->verf, 
> > desc->last_cookie,
> > +                                  -1, desc->dtsize);
> > +
> >         status = nfs_readdir_xdr_to_array(desc, desc->verf, verf,
> > arrays, 
> > sz);
> > +       if (status < 0) {
> > +               trace_nfs_readdir_uncached_done(file_inode(desc-
> > >file), status);
> > +               goto out_free;
> > +       }
> > 
> >         for (i = 0; !desc->eob && i < sz && arrays[i]; i++) {
> >                 desc->page = arrays[i];
> > @@ -1109,7 +1120,7 @@ static int uncached_readdir(struct 
> > nfs_readdir_descriptor *desc)
> >                          i < (desc->page_index_max >> 1))
> >                         nfs_shrink_dtsize(desc);
> >         }
> > -
> > +out_free:
> >         for (i = 0; i < sz && arrays[i]; i++)
> >                 nfs_readdir_page_array_free(arrays[i]);
> >  out:
> > diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
> > index 3672f6703ee7..c2d0543ecb2d 100644
> > --- a/fs/nfs/nfstrace.h
> > +++ b/fs/nfs/nfstrace.h
> > @@ -160,6 +160,8 @@ DEFINE_NFS_INODE_EVENT(nfs_fsync_enter);
> >  DEFINE_NFS_INODE_EVENT_DONE(nfs_fsync_exit);
> >  DEFINE_NFS_INODE_EVENT(nfs_access_enter);
> >  DEFINE_NFS_INODE_EVENT_DONE(nfs_set_cache_invalid);
> > +DEFINE_NFS_INODE_EVENT_DONE(nfs_readdir_cache_fill_done);
> > +DEFINE_NFS_INODE_EVENT_DONE(nfs_readdir_uncached_done);
> > 
> >  TRACE_EVENT(nfs_access_exit,
> >                 TP_PROTO(
> > @@ -271,6 +273,72 @@ DEFINE_NFS_UPDATE_SIZE_EVENT(wcc);
> >  DEFINE_NFS_UPDATE_SIZE_EVENT(update);
> >  DEFINE_NFS_UPDATE_SIZE_EVENT(grow);
> > 
> > +DECLARE_EVENT_CLASS(nfs_readdir_event,
> > +               TP_PROTO(
> > +                       const struct file *file,
> > +                       const __be32 *verifier,
> > +                       u64 cookie,
> > +                       pgoff_t page_index,
> > +                       unsigned int dtsize
> > +               ),
> > +
> > +               TP_ARGS(file, verifier, cookie, page_index,
> > dtsize),
> > +
> > +               TP_STRUCT__entry(
> > +                       __field(dev_t, dev)
> > +                       __field(u32, fhandle)
> > +                       __field(u64, fileid)
> > +                       __field(u64, version)
> > +                       __array(char, verifier, NFS4_VERIFIER_SIZE)
> > +                       __field(u64, cookie)
> > +                       __field(pgoff_t, index)
> > +                       __field(unsigned int, dtsize)
> 
> 
> I'd like to be able to see the change_attr too, whether or not it's
> the
> cache_change_attribute or i_version.
> 

It is reported by the __entry->version.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index
  2022-02-25  2:33                                         ` Trond Myklebust
@ 2022-02-25  3:17                                           ` NeilBrown
  2022-02-25  4:25                                             ` Trond Myklebust
  0 siblings, 1 reply; 57+ messages in thread
From: NeilBrown @ 2022-02-25  3:17 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bcodding, linux-nfs

On Fri, 25 Feb 2022, Trond Myklebust wrote:
> On Thu, 2022-02-24 at 12:31 -0500, Benjamin Coddington wrote:
> > 
> > "The XArray implementation is efficient when the indices used are
> > densely
> > clustered; hashing the object and using the hash as the index will
> > not
> > perform well."
> > 
> > However, the "not perform well" may be orders of magnitude smaller
> > than
> > anthing like RPC.  Do you have concerns about this?
> 
> What is the difference between this workload and a random access
> database workload?

Probably the range of expected addresses.
If I understand the proposal correctly, the page addresses in this
workload could be any 64bit number.
For a large database, it would be at most 52 bits (assuming 64bits worth
of bytes), and very likely substantially smaller - maybe 40 bits for a
really really big database.

> 
> If the XArray is incapable of dealing with random access, then we
> should never have chosen it for the page cache. I'm therefore assuming
> that either the above comment is referring to micro-optimisations that
> don't matter much with these workloads, or else that the plan is to
> replace the XArray with something more appropriate for a page cache
> workload.

I haven't looked at the code recently so this might not be 100%
accurate, but XArray generally assumes that pages are often adjacent.
They don't have to be, but there is a cost.
It uses a multi-level array with 9 bits per level.  At each level there
are a whole number of pages for indexes to the next level.

If there are two entries, that are 2^45 separate, that is 5 levels of
indexing that cannot be shared.  So the path to one entry is 5 pages,
each of which contains a single pointer.  The path to the other entry is
a separate set of 5 pages.

So worst case, the index would be about 64/9 or 7 times the size of the
data.  As the number of data pages increases, this would shrink
slightly, but I suspect you wouldn't get below a factor of 3 before you
fill up all of your memory.

NeilBrown

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25  2:26             ` Trond Myklebust
@ 2022-02-25  3:51               ` Trond Myklebust
  2022-02-25 11:38                 ` Benjamin Coddington
  0 siblings, 1 reply; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  3:51 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
> On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
> > On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
> > 
> > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > 
> > > Use the change attribute and the first cookie in a directory page
> > > cache
> > > entry to validate that the page is up to date.
> > > 
> > > Suggested-by: Benjamin Coddington <bcodding@redhat.com>
> > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > ---
> > >  fs/nfs/dir.c | 68 
> > > ++++++++++++++++++++++++++++------------------------
> > >  1 file changed, 37 insertions(+), 31 deletions(-)
> > > 
> > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > index f2258e926df2..5d9367d9b651 100644
> > > --- a/fs/nfs/dir.c
> > > +++ b/fs/nfs/dir.c
> > > @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
> > >  };
> > > 
> > >  struct nfs_cache_array {
> > > +       u64 change_attr;
> > >         u64 last_cookie;
> > >         unsigned int size;
> > >         unsigned char page_full : 1,
> > > @@ -175,7 +176,8 @@ static void nfs_readdir_array_init(struct 
> > > nfs_cache_array *array)
> > >         memset(array, 0, sizeof(struct nfs_cache_array));
> > >  }
> > > 
> > > -static void nfs_readdir_page_init_array(struct page *page, u64 
> > > last_cookie)
> > > +static void nfs_readdir_page_init_array(struct page *page, u64 
> > > last_cookie,
> > > +                                       u64 change_attr)
> > >  {
> > >         struct nfs_cache_array *array;
> > 
> > 
> > There's a hunk missing here, something like:
> > 
> > @@ -185,6 +185,7 @@ static void nfs_readdir_page_init_array(struct
> > page 
> > *page, u64 last_cookie,
> >          nfs_readdir_array_init(array);
> >          array->last_cookie = last_cookie;
> >          array->cookies_are_ordered = 1;
> > +       array->change_attr = change_attr;
> >          kunmap_atomic(array);
> >   }
> > 
> > > 
> > > @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64 last_cookie,
> > > gfp_t gfp_flags)
> > >  {
> > >         struct page *page = alloc_page(gfp_flags);
> > >         if (page)
> > > -               nfs_readdir_page_init_array(page, last_cookie);
> > > +               nfs_readdir_page_init_array(page, last_cookie,
> > > 0);
> > >         return page;
> > >  }
> > > 
> > > @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
> > > nfs_entry
> > > *entry, struct page *page)
> > >         return ret;
> > >  }
> > > 
> > > +static bool nfs_readdir_page_cookie_match(struct page *page, u64
> > > last_cookie,
> > > +                                         u64 change_attr)
> > 
> > How about "nfs_readdir_page_valid()"?  There's more going on than a
> > cookie match.
> > 
> > 
> > > +{
> > > +       struct nfs_cache_array *array = kmap_atomic(page);
> > > +       int ret = true;
> > > +
> > > +       if (array->change_attr != change_attr)
> > > +               ret = false;
> > 
> > Can we skip the next test if ret = false?
> 
> I'd expect the compiler to do that.
> 
> > 
> > > +       if (array->size > 0 && array->array[0].cookie !=
> > > last_cookie)
> > > +               ret = false;
> > > +       kunmap_atomic(array);
> > > +       return ret;
> > > +}
> > > +
> > > +static void nfs_readdir_page_unlock_and_put(struct page *page)
> > > +{
> > > +       unlock_page(page);
> > > +       put_page(page);
> > > +}
> > > +
> > >  static struct page *nfs_readdir_page_get_locked(struct
> > > address_space 
> > > *mapping,
> > >                                                 pgoff_t index,
> > > u64
> > > last_cookie)
> > >  {
> > >         struct page *page;
> > > +       u64 change_attr;
> > > 
> > >         page = grab_cache_page(mapping, index);
> > > -       if (page && !PageUptodate(page)) {
> > > -               nfs_readdir_page_init_array(page, last_cookie);
> > > -               if (invalidate_inode_pages2_range(mapping, index
> > > +
> > > 1, -1) < 0)
> > > -                       nfs_zap_mapping(mapping->host, mapping);
> > > -               SetPageUptodate(page);
> > > +       if (!page)
> > > +               return NULL;
> > > +       change_attr = inode_peek_iversion_raw(mapping->host);
> > > +       if (PageUptodate(page)) {
> > > +               if (nfs_readdir_page_cookie_match(page,
> > > last_cookie,
> > > +                                                 change_attr))
> > > +                       return page;
> > > +               nfs_readdir_clear_array(page);
> > 
> > 
> > Why use i_version rather than nfs_save_change_attribute?  Seems
> > having a
> > consistent value across the pachecache and dir_verifiers would help
> > debugging, and we've already have a bunch of machinery around the
> > change_attribute.
> 
> The directory cache_change_attribute is not reported in tracepoints
> because it is a directory-specific field, so it's not as useful for
> debugging.
> 
> The inode change attribute is what we have traditionally used for
> determining cache consistency, and when to invalidate the cache.

I should probably elaborate a little further on the differences between
the inode change attribute and the cache_change_attribute.

One of the main reasons for introducing the latter was to have
something that allows us to track changes to the directory, but to
avoid forcing unnecessary revalidations of the dcache.

What this means is that when we create or remove a file, and the
pre/post-op attributes tell us that there were no third party changes
to the directory, we update the dcache, but we do _not_ update the
cache_change_attribute, because we know that the rest of the directory
contents are valid, and so we don't have to revalidate the dentries.
However in that case, we _do_ want to update the readdir cache to
reflect the fact that an entry was added or deleted. While we could
figure out how to remove an entry (at least for the case where the
filesystem is case-sensitive), we do not know where the filesystem
added the new file, or what cookies was assigned.

This is why the inode change attribute is more appropriate for indexing
the page cache pages. It reflects the cases where we want to revalidate
the readdir cache, as opposed to the dcache.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 00/21] Readdir improvements
  2022-02-24 12:25 ` [PATCH v7 00/21] Readdir improvements David Wysochanski
@ 2022-02-25  4:00   ` Trond Myklebust
  0 siblings, 0 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  4:00 UTC (permalink / raw)
  To: bcodding, dwysocha; +Cc: linux-nfs

On Thu, 2022-02-24 at 07:25 -0500, David Wysochanski wrote:
> On Wed, Feb 23, 2022 at 4:24 PM <trondmy@kernel.org> wrote:
> > 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > The current NFS readdir code will always try to maximise the amount
> > of
> > readahead it performs on the assumption that we can cache anything
> > that
> > isn't immediately read by the process.
> > There are several cases where this assumption breaks down,
> > including
> > when the 'ls -l' heuristic kicks in to try to force use of
> > readdirplus
> > as a batch replacement for lookup/getattr.
> > 
> > This series also implement Ben's page cache filter to ensure that
> > we can
> > improve the ability to share cached data between processes that are
> > reading the same directory at the same time, and to avoid live-
> > locks
> > when the directory is simultaneously changing.
> > 
> > --
> > v2: Remove reset of dtsize when NFS_INO_FORCE_READDIR is set
> > v3: Avoid excessive window shrinking in uncached_readdir case
> > v4: Track 'ls -l' cache hit/miss statistics
> >     Improved algorithm for falling back to uncached readdir
> >     Skip readdirplus when files are being written to
> > v5: bugfixes
> >     Skip readdirplus when the acdirmax/acregmax values are low
> >     Request a full XDR buffer when doing READDIRPLUS
> > v6: Add tracing
> >     Don't have lookup request readdirplus when it won't help
> > v7: Implement Ben's page cache filter
> >     Reduce the use of uncached readdir
> >     Change indexing of the page cache to improve seekdir()
> > performance.
> > 
> > Trond Myklebust (21):
> >   NFS: constify nfs_server_capable() and nfs_have_writebacks()
> >   NFS: Trace lookup revalidation failure
> >   NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context
> >   NFS: Calculate page offsets algorithmically
> >   NFS: Store the change attribute in the directory page cache
> >   NFS: If the cookie verifier changes, we must invalidate the page
> > cache
> >   NFS: Don't re-read the entire page cache to find the next cookie
> >   NFS: Adjust the amount of readahead performed by NFS readdir
> >   NFS: Simplify nfs_readdir_xdr_to_array()
> >   NFS: Reduce use of uncached readdir
> >   NFS: Improve heuristic for readdirplus
> >   NFS: Don't ask for readdirplus unless it can help nfs_getattr()
> >   NFSv4: Ask for a full XDR buffer of readdir goodness
> >   NFS: Readdirplus can't help lookup for case insensitive
> > filesystems
> >   NFS: Don't request readdirplus when revalidation was forced
> >   NFS: Add basic readdir tracing
> >   NFS: Trace effects of readdirplus on the dcache
> >   NFS: Trace effects of the readdirplus heuristic
> >   NFS: Convert readdir page cache to use a cookie based index
> >   NFS: Fix up forced readdirplus
> >   NFS: Remove unnecessary cache invalidations for directories
> > 
> >  fs/nfs/dir.c           | 450 ++++++++++++++++++++++++-------------
> > ----
> >  fs/nfs/inode.c         |  46 ++---
> >  fs/nfs/internal.h      |   4 +-
> >  fs/nfs/nfs3xdr.c       |   7 +-
> >  fs/nfs/nfs4proc.c      |   2 -
> >  fs/nfs/nfs4xdr.c       |   6 +-
> >  fs/nfs/nfstrace.h      | 122 ++++++++++-
> >  include/linux/nfs_fs.h |  19 +-
> >  8 files changed, 421 insertions(+), 235 deletions(-)
> > 
> > --
> > 2.35.1
> > 
> 
> Trond I have been following your work here with periodic tests though
> not fully following all the patches content.  As you know this is a
> tricky
> area and seems to be a hotspot area for customers that use NFS, with
> many scenarios that may go wrong.  Thanks for your work, which now
> includes even some tracepoints and Ben's page cache filler.
> 
> This patchset seems to be the best of all the ones so far.  My
> initial
> tests (listings when modifying as well as idle directories) indicate
> that the issue that Gonzalo reported on Jan 14th [1] looks to be
> fixed
> by this set, but I'll let him confirm.  I'll do some more testing and
> let you know if there's anything else I find.  If there're some
> scenarios (mount options, servers, etc) you need more testing on, let
> us know and we'll try to make that happen.
> 
> [1] [PATCH] NFS: limit block size reported for directories
> 

Thanks Dave! I very much appreciate your testing the patches. This is a
very complex area due to the interplay of readdir/readdirplus, the
dcache and the attribute cache, and so it is key to get as much
statistics as possible. Thanks again to you and Red Hat for
contributing to that effort.

Thanks also to Ben for his comments and reviews. His insights are key
to the progress we appear to be making.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 10/21] NFS: Reduce use of uncached readdir
  2022-02-24 16:55                     ` [PATCH v7 10/21] NFS: Reduce use of uncached readdir Anna Schumaker
@ 2022-02-25  4:07                       ` Trond Myklebust
  0 siblings, 0 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  4:07 UTC (permalink / raw)
  To: schumaker.anna, trondmy; +Cc: linux-nfs

On Thu, 2022-02-24 at 11:55 -0500, Anna Schumaker wrote:
> Hi Trond,
> 
> On Wed, Feb 23, 2022 at 8:25 PM <trondmy@kernel.org> wrote:
> > 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > When reading a very large directory, we want to try to keep the
> > page
> > cache up to date if doing so is inexpensive. With the change to
> > allow
> > readdir to continue reading even when the cache is incomplete, we
> > no
> > longer need to fall back to uncached readdir in order to scale to
> > large
> > directories.
> > 
> > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
> 
> As of this patch, cthon tests are passing again.
> 

I'm going to push out a v8 patchset. I'm just waiting for a few more
comments, etc.

Anyhow, I wonder if the fact that we're not initialising the verifier
in nfs_opendir() is part of the problem here. I've added a patch to do
this.

I also found an issue with the nfs2/nfs3_decode_dirent() return values
(and have a fix for that), however I'm assuming that is not the problem
you're seeing, since you were reporting issues with NFSv4, which is
unaffected.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index
  2022-02-25  3:17                                           ` NeilBrown
@ 2022-02-25  4:25                                             ` Trond Myklebust
  2022-02-25 12:33                                               ` Benjamin Coddington
  0 siblings, 1 reply; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25  4:25 UTC (permalink / raw)
  To: neilb; +Cc: linux-nfs, bcodding

On Fri, 2022-02-25 at 14:17 +1100, NeilBrown wrote:
> On Fri, 25 Feb 2022, Trond Myklebust wrote:
> > On Thu, 2022-02-24 at 12:31 -0500, Benjamin Coddington wrote:
> > > 
> > > "The XArray implementation is efficient when the indices used are
> > > densely
> > > clustered; hashing the object and using the hash as the index
> > > will
> > > not
> > > perform well."
> > > 
> > > However, the "not perform well" may be orders of magnitude
> > > smaller
> > > than
> > > anthing like RPC.  Do you have concerns about this?
> > 
> > What is the difference between this workload and a random access
> > database workload?
> 
> Probably the range of expected addresses.
> If I understand the proposal correctly, the page addresses in this
> workload could be any 64bit number.
> For a large database, it would be at most 52 bits (assuming 64bits
> worth
> of bytes), and very likely substantially smaller - maybe 40 bits for
> a
> really really big database.
> 
> > 
> > If the XArray is incapable of dealing with random access, then we
> > should never have chosen it for the page cache. I'm therefore
> > assuming
> > that either the above comment is referring to micro-optimisations
> > that
> > don't matter much with these workloads, or else that the plan is to
> > replace the XArray with something more appropriate for a page cache
> > workload.
> 
> I haven't looked at the code recently so this might not be 100%
> accurate, but XArray generally assumes that pages are often adjacent.
> They don't have to be, but there is a cost.
> It uses a multi-level array with 9 bits per level.  At each level
> there
> are a whole number of pages for indexes to the next level.
> 
> If there are two entries, that are 2^45 separate, that is 5 levels of
> indexing that cannot be shared.  So the path to one entry is 5 pages,
> each of which contains a single pointer.  The path to the other entry
> is
> a separate set of 5 pages.
> 
> So worst case, the index would be about 64/9 or 7 times the size of
> the
> data.  As the number of data pages increases, this would shrink
> slightly, but I suspect you wouldn't get below a factor of 3 before
> you
> fill up all of your memory.
> 


If the problem is just the range, then that is trivial to fix: we can
just use xxhash32(), and take the hit of more collisions. However if
the problem is the access pattern, then I have serious questions about
the choice of implementation for the page cache. If the cache can't
support file random access, then we're barking up the wrong tree on the
wrong continent.

Either way, I see avoiding linear searches for cookies as a benefit
that is worth pursuing.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 04/21] NFS: Calculate page offsets algorithmically
  2022-02-25  2:11           ` Trond Myklebust
@ 2022-02-25 11:28             ` Benjamin Coddington
  2022-02-25 12:44               ` Trond Myklebust
  0 siblings, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-25 11:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: trondmy, linux-nfs

On 24 Feb 2022, at 21:11, Trond Myklebust wrote:

> On Thu, 2022-02-24 at 09:15 -0500, Benjamin Coddington wrote:
>> On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
>>
>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>>
>>> Instead of relying on counting the page offsets as we walk through
>>> the
>>> page cache, switch to calculating them algorithmically.
>>>
>>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>>> ---
>>>  fs/nfs/dir.c | 18 +++++++++++++-----
>>>  1 file changed, 13 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>> index 8f17aaebcd77..f2258e926df2 100644
>>> --- a/fs/nfs/dir.c
>>> +++ b/fs/nfs/dir.c
>>> @@ -248,17 +248,20 @@ static const char
>>> *nfs_readdir_copy_name(const
>>> char *name, unsigned int len)
>>>         return ret;
>>>  }
>>>
>>> +static size_t nfs_readdir_array_maxentries(void)
>>> +{
>>> +       return (PAGE_SIZE - sizeof(struct nfs_cache_array)) /
>>> +              sizeof(struct nfs_cache_array_entry);
>>> +}
>>> +
>>
>> Why the choice to use a runtime function call rather than the
>> compiler's
>> calculation?  I suspect that the end result is the same, as the
>> compiler
>> will optimize it away, but I'm curious if there's a good reason for
>> this.
>>
>
> The comparison is more efficient because no pointer arithmetic is
> needed. As you said, the above function always evaluates to a constant,
> and the array->size has been pre-calculated.

Comparisons are more efficient than using something like this?:

static const int nfs_readdir_array_maxentries =
        (PAGE_SIZE - sizeof(struct nfs_cache_array)) /
        sizeof(struct nfs_cache_array_entry);

I don't understand why, I must admit.   I'm not saying it should be changed,
I'm just trying to figure out the reason for the function declaration when
the value is a constant, and I thought there was a hole in my head.

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25  3:51               ` Trond Myklebust
@ 2022-02-25 11:38                 ` Benjamin Coddington
  2022-02-25 13:10                   ` Trond Myklebust
  0 siblings, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-25 11:38 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 24 Feb 2022, at 22:51, Trond Myklebust wrote:

> On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
>> On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
>>> On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
>>>
>>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>>>
>>>> Use the change attribute and the first cookie in a directory page
>>>> cache
>>>> entry to validate that the page is up to date.
>>>>
>>>> Suggested-by: Benjamin Coddington <bcodding@redhat.com>
>>>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>>>> ---
>>>>  fs/nfs/dir.c | 68
>>>> ++++++++++++++++++++++++++++------------------------
>>>>  1 file changed, 37 insertions(+), 31 deletions(-)
>>>>
>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>> index f2258e926df2..5d9367d9b651 100644
>>>> --- a/fs/nfs/dir.c
>>>> +++ b/fs/nfs/dir.c
>>>> @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
>>>>  };
>>>>
>>>>  struct nfs_cache_array {
>>>> +       u64 change_attr;
>>>>         u64 last_cookie;
>>>>         unsigned int size;
>>>>         unsigned char page_full : 1,
>>>> @@ -175,7 +176,8 @@ static void nfs_readdir_array_init(struct
>>>> nfs_cache_array *array)
>>>>         memset(array, 0, sizeof(struct nfs_cache_array));
>>>>  }
>>>>
>>>> -static void nfs_readdir_page_init_array(struct page *page, u64
>>>> last_cookie)
>>>> +static void nfs_readdir_page_init_array(struct page *page, u64
>>>> last_cookie,
>>>> +                                       u64 
>>>> change_attr)
>>>>  {
>>>>         struct nfs_cache_array *array;
>>>
>>>
>>> There's a hunk missing here, something like:
>>>
>>> @@ -185,6 +185,7 @@ static void nfs_readdir_page_init_array(struct
>>> page
>>> *page, u64 last_cookie,
>>>          nfs_readdir_array_init(array);
>>>          array->last_cookie = last_cookie;
>>>          array->cookies_are_ordered = 1;
>>> +       array->change_attr = change_attr;
>>>          kunmap_atomic(array);
>>>   }
>>>
>>>>
>>>> @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64 last_cookie,
>>>> gfp_t gfp_flags)
>>>>  {
>>>>         struct page *page = alloc_page(gfp_flags);
>>>>         if (page)
>>>> -               nfs_readdir_page_init_array(page, 
>>>> last_cookie);
>>>> +               nfs_readdir_page_init_array(page, 
>>>> last_cookie,
>>>> 0);
>>>>         return page;
>>>>  }
>>>>
>>>> @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
>>>> nfs_entry
>>>> *entry, struct page *page)
>>>>         return ret;
>>>>  }
>>>>
>>>> +static bool nfs_readdir_page_cookie_match(struct page *page, u64
>>>> last_cookie,
>>>> +                                         
>>>> u64 change_attr)
>>>
>>> How about "nfs_readdir_page_valid()"?  There's more going on than a
>>> cookie match.
>>>
>>>
>>>> +{
>>>> +       struct nfs_cache_array *array = kmap_atomic(page);
>>>> +       int ret = true;
>>>> +
>>>> +       if (array->change_attr != change_attr)
>>>> +               ret = false;
>>>
>>> Can we skip the next test if ret = false?
>>
>> I'd expect the compiler to do that.
>>
>>>
>>>> +       if (array->size > 0 && array->array[0].cookie !=
>>>> last_cookie)
>>>> +               ret = false;
>>>> +       kunmap_atomic(array);
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static void nfs_readdir_page_unlock_and_put(struct page *page)
>>>> +{
>>>> +       unlock_page(page);
>>>> +       put_page(page);
>>>> +}
>>>> +
>>>>  static struct page *nfs_readdir_page_get_locked(struct
>>>> address_space
>>>> *mapping,
>>>>                                                 pgoff_t 
>>>> index,
>>>> u64
>>>> last_cookie)
>>>>  {
>>>>         struct page *page;
>>>> +       u64 change_attr;
>>>>
>>>>         page = grab_cache_page(mapping, index);
>>>> -       if (page && !PageUptodate(page)) {
>>>> -               nfs_readdir_page_init_array(page, 
>>>> last_cookie);
>>>> -               if 
>>>> (invalidate_inode_pages2_range(mapping, index
>>>> +
>>>> 1, -1) < 0)
>>>> -                       nfs_zap_mapping(mapping->host, 
>>>> mapping);
>>>> -               SetPageUptodate(page);
>>>> +       if (!page)
>>>> +               return NULL;
>>>> +       change_attr = 
>>>> inode_peek_iversion_raw(mapping->host);
>>>> +       if (PageUptodate(page)) {
>>>> +               if 
>>>> (nfs_readdir_page_cookie_match(page,
>>>> last_cookie,
>>>> +                                                 
>>>> change_attr))
>>>> +                       return page;
>>>> +               nfs_readdir_clear_array(page);
>>>
>>>
>>> Why use i_version rather than nfs_save_change_attribute?  Seems
>>> having a
>>> consistent value across the pachecache and dir_verifiers would help
>>> debugging, and we've already have a bunch of machinery around the
>>> change_attribute.
>>
>> The directory cache_change_attribute is not reported in tracepoints
>> because it is a directory-specific field, so it's not as useful for
>> debugging.
>>
>> The inode change attribute is what we have traditionally used for
>> determining cache consistency, and when to invalidate the cache.
>
> I should probably elaborate a little further on the differences 
> between
> the inode change attribute and the cache_change_attribute.
>
> One of the main reasons for introducing the latter was to have
> something that allows us to track changes to the directory, but to
> avoid forcing unnecessary revalidations of the dcache.
>
> What this means is that when we create or remove a file, and the
> pre/post-op attributes tell us that there were no third party changes
> to the directory, we update the dcache, but we do _not_ update the
> cache_change_attribute, because we know that the rest of the directory
> contents are valid, and so we don't have to revalidate the dentries.
> However in that case, we _do_ want to update the readdir cache to
> reflect the fact that an entry was added or deleted. While we could
> figure out how to remove an entry (at least for the case where the
> filesystem is case-sensitive), we do not know where the filesystem
> added the new file, or what cookies was assigned.
>
> This is why the inode change attribute is more appropriate for 
> indexing
> the page cache pages. It reflects the cases where we want to 
> revalidate
> the readdir cache, as opposed to the dcache.

Ok, thanks for explaining this.

I've noticed that you haven't responded about my concerns about not 
checking
the directory for changes with every v4 READDIR.  For v3, we have 
post-op
updates to the directory, but with v4 the directory can change and we'll
end up with entries in the cache that are marked with an old 
change_attr.

I'm pretty positive that not checking for changes to the directory (not
sending GETATTR with READDIR) is going to create cases of double-listed 
and
truncated-listings for dirctory listers.  Not handling those cases means 
I'm
going to have some very unhappy customers that complain about their 
files
disappearing/reappearing on NFS.

If you need me to prove that its an issue, I can take the time to write 
up
program that shows this problem.

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index
  2022-02-25  4:25                                             ` Trond Myklebust
@ 2022-02-25 12:33                                               ` Benjamin Coddington
  2022-02-25 13:11                                                 ` Trond Myklebust
  0 siblings, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-25 12:33 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: neilb, linux-nfs

On 24 Feb 2022, at 23:25, Trond Myklebust wrote:

> On Fri, 2022-02-25 at 14:17 +1100, NeilBrown wrote:

>> I haven't looked at the code recently so this might not be 100% accurate,
>> but XArray generally assumes that pages are often adjacent.  They don't
>> have to be, but there is a cost.  It uses a multi-level array with 9 bits
>> per level.  At each level there are a whole number of pages for indexes
>> to the next level.
>>
>> If there are two entries, that are 2^45 separate, that is 5 levels of
>> indexing that cannot be shared.  So the path to one entry is 5 pages,
>> each of which contains a single pointer.  The path to the other entry is
>> a separate set of 5 pages.
>>
>> So worst case, the index would be about 64/9 or 7 times the size of the
>> data.  As the number of data pages increases, this would shrink slightly,
>> but I suspect you wouldn't get below a factor of 3 before you fill up all
>> of your memory.

Yikes!

> If the problem is just the range, then that is trivial to fix: we can
> just use xxhash32(), and take the hit of more collisions. However if
> the problem is the access pattern, then I have serious questions about
> the choice of implementation for the page cache. If the cache can't
> support file random access, then we're barking up the wrong tree on the
> wrong continent.

I'm guessing the issue might be "get next", which for an "array" is probably
the operation tested for "perform well".  We're not doing any of that, we're
directly addressing pages with our hashed index.

> Either way, I see avoiding linear searches for cookies as a benefit
> that is worth pursuing.

Me too.  What about just kicking the seekdir users up into the second half
of the index, to use xxhash32() up there.  Everyone else can hang out in the
bottom half with dense indexes and help each other out.

The vast  majority of readdir() use is going to be short listings traversed
in order.  The memory inflation created by a process that needs to walk a
tree and for every two pages of readdir data require 10 pages of indexes
seems pretty extreme.

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 04/21] NFS: Calculate page offsets algorithmically
  2022-02-25 11:28             ` Benjamin Coddington
@ 2022-02-25 12:44               ` Trond Myklebust
  0 siblings, 0 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25 12:44 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Fri, 2022-02-25 at 06:28 -0500, Benjamin Coddington wrote:
> On 24 Feb 2022, at 21:11, Trond Myklebust wrote:
> 
> > On Thu, 2022-02-24 at 09:15 -0500, Benjamin Coddington wrote:
> > > On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
> > > 
> > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > 
> > > > Instead of relying on counting the page offsets as we walk
> > > > through
> > > > the
> > > > page cache, switch to calculating them algorithmically.
> > > > 
> > > > Signed-off-by: Trond Myklebust
> > > > <trond.myklebust@hammerspace.com>
> > > > ---
> > > >  fs/nfs/dir.c | 18 +++++++++++++-----
> > > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > index 8f17aaebcd77..f2258e926df2 100644
> > > > --- a/fs/nfs/dir.c
> > > > +++ b/fs/nfs/dir.c
> > > > @@ -248,17 +248,20 @@ static const char
> > > > *nfs_readdir_copy_name(const
> > > > char *name, unsigned int len)
> > > >         return ret;
> > > >  }
> > > > 
> > > > +static size_t nfs_readdir_array_maxentries(void)
> > > > +{
> > > > +       return (PAGE_SIZE - sizeof(struct nfs_cache_array)) /
> > > > +              sizeof(struct nfs_cache_array_entry);
> > > > +}
> > > > +
> > > 
> > > Why the choice to use a runtime function call rather than the
> > > compiler's
> > > calculation?  I suspect that the end result is the same, as the
> > > compiler
> > > will optimize it away, but I'm curious if there's a good reason
> > > for
> > > this.
> > > 
> > 
> > The comparison is more efficient because no pointer arithmetic is
> > needed. As you said, the above function always evaluates to a
> > constant,
> > and the array->size has been pre-calculated.
> 
> Comparisons are more efficient than using something like this?:
> 
> static const int nfs_readdir_array_maxentries =
>         (PAGE_SIZE - sizeof(struct nfs_cache_array)) /
>         sizeof(struct nfs_cache_array_entry);
> 
> I don't understand why, I must admit.   I'm not saying it should be
> changed,
> I'm just trying to figure out the reason for the function declaration
> when
> the value is a constant, and I thought there was a hole in my head.
> 

Unless we're talking about a compiler from the 1960s, there is little
difference between the two proposals. Any modern C compiler worth its
salt will know to inline the numeric value into the comparison (and
will optimise away the storage of your variable).

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 11:38                 ` Benjamin Coddington
@ 2022-02-25 13:10                   ` Trond Myklebust
  2022-02-25 13:26                     ` Trond Myklebust
  2022-02-25 14:44                     ` Benjamin Coddington
  0 siblings, 2 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25 13:10 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Fri, 2022-02-25 at 06:38 -0500, Benjamin Coddington wrote:
> On 24 Feb 2022, at 22:51, Trond Myklebust wrote:
> 
> > On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
> > > On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
> > > > On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
> > > > 
> > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > > 
> > > > > Use the change attribute and the first cookie in a directory
> > > > > page
> > > > > cache
> > > > > entry to validate that the page is up to date.
> > > > > 
> > > > > Suggested-by: Benjamin Coddington <bcodding@redhat.com>
> > > > > Signed-off-by: Trond Myklebust
> > > > > <trond.myklebust@hammerspace.com>
> > > > > ---
> > > > >  fs/nfs/dir.c | 68
> > > > > ++++++++++++++++++++++++++++------------------------
> > > > >  1 file changed, 37 insertions(+), 31 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > index f2258e926df2..5d9367d9b651 100644
> > > > > --- a/fs/nfs/dir.c
> > > > > +++ b/fs/nfs/dir.c
> > > > > @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
> > > > >  };
> > > > > 
> > > > >  struct nfs_cache_array {
> > > > > +       u64 change_attr;
> > > > >         u64 last_cookie;
> > > > >         unsigned int size;
> > > > >         unsigned char page_full : 1,
> > > > > @@ -175,7 +176,8 @@ static void nfs_readdir_array_init(struct
> > > > > nfs_cache_array *array)
> > > > >         memset(array, 0, sizeof(struct nfs_cache_array));
> > > > >  }
> > > > > 
> > > > > -static void nfs_readdir_page_init_array(struct page *page,
> > > > > u64
> > > > > last_cookie)
> > > > > +static void nfs_readdir_page_init_array(struct page *page,
> > > > > u64
> > > > > last_cookie,
> > > > > +                                       u64 
> > > > > change_attr)
> > > > >  {
> > > > >         struct nfs_cache_array *array;
> > > > 
> > > > 
> > > > There's a hunk missing here, something like:
> > > > 
> > > > @@ -185,6 +185,7 @@ static void
> > > > nfs_readdir_page_init_array(struct
> > > > page
> > > > *page, u64 last_cookie,
> > > >          nfs_readdir_array_init(array);
> > > >          array->last_cookie = last_cookie;
> > > >          array->cookies_are_ordered = 1;
> > > > +       array->change_attr = change_attr;
> > > >          kunmap_atomic(array);
> > > >   }
> > > > 
> > > > > 
> > > > > @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64
> > > > > last_cookie,
> > > > > gfp_t gfp_flags)
> > > > >  {
> > > > >         struct page *page = alloc_page(gfp_flags);
> > > > >         if (page)
> > > > > -               nfs_readdir_page_init_array(page, 
> > > > > last_cookie);
> > > > > +               nfs_readdir_page_init_array(page, 
> > > > > last_cookie,
> > > > > 0);
> > > > >         return page;
> > > > >  }
> > > > > 
> > > > > @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
> > > > > nfs_entry
> > > > > *entry, struct page *page)
> > > > >         return ret;
> > > > >  }
> > > > > 
> > > > > +static bool nfs_readdir_page_cookie_match(struct page *page,
> > > > > u64
> > > > > last_cookie,
> > > > > +                                         
> > > > > u64 change_attr)
> > > > 
> > > > How about "nfs_readdir_page_valid()"?  There's more going on
> > > > than a
> > > > cookie match.
> > > > 
> > > > 
> > > > > +{
> > > > > +       struct nfs_cache_array *array = kmap_atomic(page);
> > > > > +       int ret = true;
> > > > > +
> > > > > +       if (array->change_attr != change_attr)
> > > > > +               ret = false;
> > > > 
> > > > Can we skip the next test if ret = false?
> > > 
> > > I'd expect the compiler to do that.
> > > 
> > > > 
> > > > > +       if (array->size > 0 && array->array[0].cookie !=
> > > > > last_cookie)
> > > > > +               ret = false;
> > > > > +       kunmap_atomic(array);
> > > > > +       return ret;
> > > > > +}
> > > > > +
> > > > > +static void nfs_readdir_page_unlock_and_put(struct page
> > > > > *page)
> > > > > +{
> > > > > +       unlock_page(page);
> > > > > +       put_page(page);
> > > > > +}
> > > > > +
> > > > >  static struct page *nfs_readdir_page_get_locked(struct
> > > > > address_space
> > > > > *mapping,
> > > > >                                                 pgoff_t 
> > > > > index,
> > > > > u64
> > > > > last_cookie)
> > > > >  {
> > > > >         struct page *page;
> > > > > +       u64 change_attr;
> > > > > 
> > > > >         page = grab_cache_page(mapping, index);
> > > > > -       if (page && !PageUptodate(page)) {
> > > > > -               nfs_readdir_page_init_array(page, 
> > > > > last_cookie);
> > > > > -               if 
> > > > > (invalidate_inode_pages2_range(mapping, index
> > > > > +
> > > > > 1, -1) < 0)
> > > > > -                       nfs_zap_mapping(mapping->host, 
> > > > > mapping);
> > > > > -               SetPageUptodate(page);
> > > > > +       if (!page)
> > > > > +               return NULL;
> > > > > +       change_attr = 
> > > > > inode_peek_iversion_raw(mapping->host);
> > > > > +       if (PageUptodate(page)) {
> > > > > +               if 
> > > > > (nfs_readdir_page_cookie_match(page,
> > > > > last_cookie,
> > > > > +                                                 
> > > > > change_attr))
> > > > > +                       return page;
> > > > > +               nfs_readdir_clear_array(page);
> > > > 
> > > > 
> > > > Why use i_version rather than nfs_save_change_attribute?  Seems
> > > > having a
> > > > consistent value across the pachecache and dir_verifiers would
> > > > help
> > > > debugging, and we've already have a bunch of machinery around
> > > > the
> > > > change_attribute.
> > > 
> > > The directory cache_change_attribute is not reported in
> > > tracepoints
> > > because it is a directory-specific field, so it's not as useful
> > > for
> > > debugging.
> > > 
> > > The inode change attribute is what we have traditionally used for
> > > determining cache consistency, and when to invalidate the cache.
> > 
> > I should probably elaborate a little further on the differences 
> > between
> > the inode change attribute and the cache_change_attribute.
> > 
> > One of the main reasons for introducing the latter was to have
> > something that allows us to track changes to the directory, but to
> > avoid forcing unnecessary revalidations of the dcache.
> > 
> > What this means is that when we create or remove a file, and the
> > pre/post-op attributes tell us that there were no third party
> > changes
> > to the directory, we update the dcache, but we do _not_ update the
> > cache_change_attribute, because we know that the rest of the
> > directory
> > contents are valid, and so we don't have to revalidate the
> > dentries.
> > However in that case, we _do_ want to update the readdir cache to
> > reflect the fact that an entry was added or deleted. While we could
> > figure out how to remove an entry (at least for the case where the
> > filesystem is case-sensitive), we do not know where the filesystem
> > added the new file, or what cookies was assigned.
> > 
> > This is why the inode change attribute is more appropriate for 
> > indexing
> > the page cache pages. It reflects the cases where we want to 
> > revalidate
> > the readdir cache, as opposed to the dcache.
> 
> Ok, thanks for explaining this.
> 
> I've noticed that you haven't responded about my concerns about not 
> checking
> the directory for changes with every v4 READDIR.  For v3, we have 
> post-op
> updates to the directory, but with v4 the directory can change and
> we'll
> end up with entries in the cache that are marked with an old 
> change_attr.
> 

Then they will be rejected by nfs_readdir_page_cookie_match() if a user
looks up that page again after we've revalidated the change attribute
on the directory.

...and note that NFSv4 does returns a struct change_info4 for all
operations that change the directory, so we will update the change
attribute in all those cases.

If the change is made on the server, well then we will detect it
through the standard revalidation process that usually decides when to
invalidate the directory page cache.

> I'm pretty positive that not checking for changes to the directory
> (not
> sending GETATTR with READDIR) is going to create cases of double-
> listed 
> and
> truncated-listings for dirctory listers.  Not handling those cases
> means 
> I'm
> going to have some very unhappy customers that complain about their 
> files
> disappearing/reappearing on NFS.
> 
> If you need me to prove that its an issue, I can take the time to
> write 
> up
> program that shows this problem.
> 

If you label the page contents with an attribute that was retrieved
_after_ the READDIR op, then you will introduce this as a problem for
your customers.

The reason is that there is no atomicity between operations in a
COMPOUND. Worse, the implementation of readdir in scalable modern
systems, including Linux, does not even guarantee atomicity of the
readdir operation itself. Instead each readdir entry is filled without
holding any locks or preventing any changes to the directory or to the
object itself.

POSIX states very explicitly that if you're making changes to the
directory after the call to opendir() or rewinddir(), then the
behaviour w.r.t. whether that file appears in the readdir() call is
unspecified. See
https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html

This is also consistent with how glibc caches the results of a
getdents() call.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index
  2022-02-25 12:33                                               ` Benjamin Coddington
@ 2022-02-25 13:11                                                 ` Trond Myklebust
  0 siblings, 0 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25 13:11 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs, neilb

On Fri, 2022-02-25 at 07:33 -0500, Benjamin Coddington wrote:
> On 24 Feb 2022, at 23:25, Trond Myklebust wrote:
> 
> > On Fri, 2022-02-25 at 14:17 +1100, NeilBrown wrote:
> 
> > > I haven't looked at the code recently so this might not be 100%
> > > accurate,
> > > but XArray generally assumes that pages are often adjacent.  They
> > > don't
> > > have to be, but there is a cost.  It uses a multi-level array
> > > with 9 bits
> > > per level.  At each level there are a whole number of pages for
> > > indexes
> > > to the next level.
> > > 
> > > If there are two entries, that are 2^45 separate, that is 5
> > > levels of
> > > indexing that cannot be shared.  So the path to one entry is 5
> > > pages,
> > > each of which contains a single pointer.  The path to the other
> > > entry is
> > > a separate set of 5 pages.
> > > 
> > > So worst case, the index would be about 64/9 or 7 times the size
> > > of the
> > > data.  As the number of data pages increases, this would shrink
> > > slightly,
> > > but I suspect you wouldn't get below a factor of 3 before you
> > > fill up all
> > > of your memory.
> 
> Yikes!
> 
> > If the problem is just the range, then that is trivial to fix: we
> > can
> > just use xxhash32(), and take the hit of more collisions. However
> > if
> > the problem is the access pattern, then I have serious questions
> > about
> > the choice of implementation for the page cache. If the cache can't
> > support file random access, then we're barking up the wrong tree on
> > the
> > wrong continent.
> 
> I'm guessing the issue might be "get next", which for an "array" is
> probably
> the operation tested for "perform well".  We're not doing any of
> that, we're
> directly addressing pages with our hashed index.
> 
> > Either way, I see avoiding linear searches for cookies as a benefit
> > that is worth pursuing.
> 
> Me too.  What about just kicking the seekdir users up into the second
> half
> of the index, to use xxhash32() up there.  Everyone else can hang out
> in the
> bottom half with dense indexes and help each other out.
> 
> The vast  majority of readdir() use is going to be short listings
> traversed
> in order.  The memory inflation created by a process that needs to
> walk a
> tree and for every two pages of readdir data require 10 pages of
> indexes
> seems pretty extreme.
> 
> Ben
> 

#define NFS_READDIR_COOKIE_MASK (U32_MAX >> 14)
/*
 * Hash algorithm allowing content addressible access to sequences
 * of directory cookies. Content is addressed by the value of the
 * cookie index of the first readdir entry in a page.
 *
 * The xxhash algorithm is chosen because it is fast, and is supposed
 * to result in a decent flat distribution of hashes.
 *
 * We then select only the first 18 bits to avoid issues with excessive
 * memory use for the page cache XArray. 18 bits should allow the caching
 * of 262144 pages of sequences of readdir entries. Since each page holds
 * 127 readdir entries for a typical 64-bit system, that works out to a
 * cache of ~ 33 million entries per directory.
 */
static pgoff_t nfs_readdir_page_cookie_hash(u64 cookie)
{
       if (cookie == 0)
               return 0;
       return xxhash(&cookie, sizeof(cookie), 0) & NFS_READDIR_COOKIE_MASK;
}

So no, this is not a show-stopper.


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 13:10                   ` Trond Myklebust
@ 2022-02-25 13:26                     ` Trond Myklebust
  2022-02-25 14:44                     ` Benjamin Coddington
  1 sibling, 0 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25 13:26 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Fri, 2022-02-25 at 13:10 +0000, Trond Myklebust wrote:
> On Fri, 2022-02-25 at 06:38 -0500, Benjamin Coddington wrote:
> > On 24 Feb 2022, at 22:51, Trond Myklebust wrote:
> > 
> > > On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
> > > > On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
> > > > > On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
> > > > > 
> > > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > > > 
> > > > > > Use the change attribute and the first cookie in a
> > > > > > directory
> > > > > > page
> > > > > > cache
> > > > > > entry to validate that the page is up to date.
> > > > > > 
> > > > > > Suggested-by: Benjamin Coddington <bcodding@redhat.com>
> > > > > > Signed-off-by: Trond Myklebust
> > > > > > <trond.myklebust@hammerspace.com>
> > > > > > ---
> > > > > >  fs/nfs/dir.c | 68
> > > > > > ++++++++++++++++++++++++++++------------------------
> > > > > >  1 file changed, 37 insertions(+), 31 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > index f2258e926df2..5d9367d9b651 100644
> > > > > > --- a/fs/nfs/dir.c
> > > > > > +++ b/fs/nfs/dir.c
> > > > > > @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
> > > > > >  };
> > > > > > 
> > > > > >  struct nfs_cache_array {
> > > > > > +       u64 change_attr;
> > > > > >         u64 last_cookie;
> > > > > >         unsigned int size;
> > > > > >         unsigned char page_full : 1,
> > > > > > @@ -175,7 +176,8 @@ static void
> > > > > > nfs_readdir_array_init(struct
> > > > > > nfs_cache_array *array)
> > > > > >         memset(array, 0, sizeof(struct nfs_cache_array));
> > > > > >  }
> > > > > > 
> > > > > > -static void nfs_readdir_page_init_array(struct page *page,
> > > > > > u64
> > > > > > last_cookie)
> > > > > > +static void nfs_readdir_page_init_array(struct page *page,
> > > > > > u64
> > > > > > last_cookie,
> > > > > > +                                       u64 
> > > > > > change_attr)
> > > > > >  {
> > > > > >         struct nfs_cache_array *array;
> > > > > 
> > > > > 
> > > > > There's a hunk missing here, something like:
> > > > > 
> > > > > @@ -185,6 +185,7 @@ static void
> > > > > nfs_readdir_page_init_array(struct
> > > > > page
> > > > > *page, u64 last_cookie,
> > > > >          nfs_readdir_array_init(array);
> > > > >          array->last_cookie = last_cookie;
> > > > >          array->cookies_are_ordered = 1;
> > > > > +       array->change_attr = change_attr;
> > > > >          kunmap_atomic(array);
> > > > >   }
> > > > > 
> > > > > > 
> > > > > > @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64
> > > > > > last_cookie,
> > > > > > gfp_t gfp_flags)
> > > > > >  {
> > > > > >         struct page *page = alloc_page(gfp_flags);
> > > > > >         if (page)
> > > > > > -               nfs_readdir_page_init_array(page, 
> > > > > > last_cookie);
> > > > > > +               nfs_readdir_page_init_array(page, 
> > > > > > last_cookie,
> > > > > > 0);
> > > > > >         return page;
> > > > > >  }
> > > > > > 
> > > > > > @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
> > > > > > nfs_entry
> > > > > > *entry, struct page *page)
> > > > > >         return ret;
> > > > > >  }
> > > > > > 
> > > > > > +static bool nfs_readdir_page_cookie_match(struct page
> > > > > > *page,
> > > > > > u64
> > > > > > last_cookie,
> > > > > > +                                         
> > > > > > u64 change_attr)
> > > > > 
> > > > > How about "nfs_readdir_page_valid()"?  There's more going on
> > > > > than a
> > > > > cookie match.
> > > > > 
> > > > > 
> > > > > > +{
> > > > > > +       struct nfs_cache_array *array = kmap_atomic(page);
> > > > > > +       int ret = true;
> > > > > > +
> > > > > > +       if (array->change_attr != change_attr)
> > > > > > +               ret = false;
> > > > > 
> > > > > Can we skip the next test if ret = false?
> > > > 
> > > > I'd expect the compiler to do that.
> > > > 
> > > > > 
> > > > > > +       if (array->size > 0 && array->array[0].cookie !=
> > > > > > last_cookie)
> > > > > > +               ret = false;
> > > > > > +       kunmap_atomic(array);
> > > > > > +       return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static void nfs_readdir_page_unlock_and_put(struct page
> > > > > > *page)
> > > > > > +{
> > > > > > +       unlock_page(page);
> > > > > > +       put_page(page);
> > > > > > +}
> > > > > > +
> > > > > >  static struct page *nfs_readdir_page_get_locked(struct
> > > > > > address_space
> > > > > > *mapping,
> > > > > >                                                 pgoff_t 
> > > > > > index,
> > > > > > u64
> > > > > > last_cookie)
> > > > > >  {
> > > > > >         struct page *page;
> > > > > > +       u64 change_attr;
> > > > > > 
> > > > > >         page = grab_cache_page(mapping, index);
> > > > > > -       if (page && !PageUptodate(page)) {
> > > > > > -               nfs_readdir_page_init_array(page, 
> > > > > > last_cookie);
> > > > > > -               if 
> > > > > > (invalidate_inode_pages2_range(mapping, index
> > > > > > +
> > > > > > 1, -1) < 0)
> > > > > > -                       nfs_zap_mapping(mapping->host, 
> > > > > > mapping);
> > > > > > -               SetPageUptodate(page);
> > > > > > +       if (!page)
> > > > > > +               return NULL;
> > > > > > +       change_attr = 
> > > > > > inode_peek_iversion_raw(mapping->host);
> > > > > > +       if (PageUptodate(page)) {
> > > > > > +               if 
> > > > > > (nfs_readdir_page_cookie_match(page,
> > > > > > last_cookie,
> > > > > > +                                                 
> > > > > > change_attr))
> > > > > > +                       return page;
> > > > > > +               nfs_readdir_clear_array(page);
> > > > > 
> > > > > 
> > > > > Why use i_version rather than nfs_save_change_attribute? 
> > > > > Seems
> > > > > having a
> > > > > consistent value across the pachecache and dir_verifiers
> > > > > would
> > > > > help
> > > > > debugging, and we've already have a bunch of machinery around
> > > > > the
> > > > > change_attribute.
> > > > 
> > > > The directory cache_change_attribute is not reported in
> > > > tracepoints
> > > > because it is a directory-specific field, so it's not as useful
> > > > for
> > > > debugging.
> > > > 
> > > > The inode change attribute is what we have traditionally used
> > > > for
> > > > determining cache consistency, and when to invalidate the
> > > > cache.
> > > 
> > > I should probably elaborate a little further on the differences 
> > > between
> > > the inode change attribute and the cache_change_attribute.
> > > 
> > > One of the main reasons for introducing the latter was to have
> > > something that allows us to track changes to the directory, but
> > > to
> > > avoid forcing unnecessary revalidations of the dcache.
> > > 
> > > What this means is that when we create or remove a file, and the
> > > pre/post-op attributes tell us that there were no third party
> > > changes
> > > to the directory, we update the dcache, but we do _not_ update
> > > the
> > > cache_change_attribute, because we know that the rest of the
> > > directory
> > > contents are valid, and so we don't have to revalidate the
> > > dentries.
> > > However in that case, we _do_ want to update the readdir cache to
> > > reflect the fact that an entry was added or deleted. While we
> > > could
> > > figure out how to remove an entry (at least for the case where
> > > the
> > > filesystem is case-sensitive), we do not know where the
> > > filesystem
> > > added the new file, or what cookies was assigned.
> > > 
> > > This is why the inode change attribute is more appropriate for 
> > > indexing
> > > the page cache pages. It reflects the cases where we want to 
> > > revalidate
> > > the readdir cache, as opposed to the dcache.
> > 
> > Ok, thanks for explaining this.
> > 
> > I've noticed that you haven't responded about my concerns about not
> > checking
> > the directory for changes with every v4 READDIR.  For v3, we have 
> > post-op
> > updates to the directory, but with v4 the directory can change and
> > we'll
> > end up with entries in the cache that are marked with an old 
> > change_attr.
> > 
> 
> Then they will be rejected by nfs_readdir_page_cookie_match() if a
> user
> looks up that page again after we've revalidated the change attribute
> on the directory.
> 
> ...and note that NFSv4 does returns a struct change_info4 for all
> operations that change the directory, so we will update the change
> attribute in all those cases.
> 
> If the change is made on the server, well then we will detect it
> through the standard revalidation process that usually decides when
> to
> invalidate the directory page cache.
> 
> > I'm pretty positive that not checking for changes to the directory
> > (not
> > sending GETATTR with READDIR) is going to create cases of double-
> > listed 
> > and
> > truncated-listings for dirctory listers.  Not handling those cases
> > means 
> > I'm
> > going to have some very unhappy customers that complain about their
> > files
> > disappearing/reappearing on NFS.
> > 
> > If you need me to prove that its an issue, I can take the time to
> > write 
> > up
> > program that shows this problem.
> > 
> 
> If you label the page contents with an attribute that was retrieved
> _after_ the READDIR op, then you will introduce this as a problem for
> your customers.
> 
> The reason is that there is no atomicity between operations in a
> COMPOUND. Worse, the implementation of readdir in scalable modern
> systems, including Linux, does not even guarantee atomicity of the
> readdir operation itself. Instead each readdir entry is filled
> without
> holding any locks or preventing any changes to the directory or to
> the
> object itself.
> 
> POSIX states very explicitly that if you're making changes to the
> directory after the call to opendir() or rewinddir(), then the
> behaviour w.r.t. whether that file appears in the readdir() call is
> unspecified. See
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
> 
> This is also consistent with how glibc caches the results of a
> getdents() call.
> 

Ah, wait a minute...

There is a problem with the call to nfs_readdir_page_get_next(). It
will allocate the page _after_ the readdir call itself, and so might
label it with a newer change attribute... I'll fix that so we can pass
in the change attribute associated with the readdir call.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 13:10                   ` Trond Myklebust
  2022-02-25 13:26                     ` Trond Myklebust
@ 2022-02-25 14:44                     ` Benjamin Coddington
  2022-02-25 15:18                       ` Trond Myklebust
  1 sibling, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-25 14:44 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 25 Feb 2022, at 8:10, Trond Myklebust wrote:

> On Fri, 2022-02-25 at 06:38 -0500, Benjamin Coddington wrote:
>> On 24 Feb 2022, at 22:51, Trond Myklebust wrote:
>>
>>> On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
>>>> On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
>>>>> On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
>>>>>
>>>>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>>>>>
>>>>>> Use the change attribute and the first cookie in a directory
>>>>>> page
>>>>>> cache
>>>>>> entry to validate that the page is up to date.
>>>>>>
>>>>>> Suggested-by: Benjamin Coddington <bcodding@redhat.com>
>>>>>> Signed-off-by: Trond Myklebust
>>>>>> <trond.myklebust@hammerspace.com>
>>>>>> ---
>>>>>>  fs/nfs/dir.c | 68
>>>>>> ++++++++++++++++++++++++++++------------------------
>>>>>>  1 file changed, 37 insertions(+), 31 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>> index f2258e926df2..5d9367d9b651 100644
>>>>>> --- a/fs/nfs/dir.c
>>>>>> +++ b/fs/nfs/dir.c
>>>>>> @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
>>>>>>  };
>>>>>>
>>>>>>  struct nfs_cache_array {
>>>>>> +       u64 change_attr;
>>>>>>         u64 last_cookie;
>>>>>>         unsigned int size;
>>>>>>         unsigned char page_full : 1,
>>>>>> @@ -175,7 +176,8 @@ static void nfs_readdir_array_init(struct
>>>>>> nfs_cache_array *array)
>>>>>>         memset(array, 0, sizeof(struct nfs_cache_array));
>>>>>>  }
>>>>>>
>>>>>> -static void nfs_readdir_page_init_array(struct page *page,
>>>>>> u64
>>>>>> last_cookie)
>>>>>> +static void nfs_readdir_page_init_array(struct page *page,
>>>>>> u64
>>>>>> last_cookie,
>>>>>> +                                       u64
>>>>>> change_attr)
>>>>>>  {
>>>>>>         struct nfs_cache_array *array;
>>>>>
>>>>>
>>>>> There's a hunk missing here, something like:
>>>>>
>>>>> @@ -185,6 +185,7 @@ static void
>>>>> nfs_readdir_page_init_array(struct
>>>>> page
>>>>> *page, u64 last_cookie,
>>>>>          nfs_readdir_array_init(array);
>>>>>          array->last_cookie = last_cookie;
>>>>>          array->cookies_are_ordered = 1;
>>>>> +       array->change_attr = change_attr;
>>>>>          kunmap_atomic(array);
>>>>>   }
>>>>>
>>>>>>
>>>>>> @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64
>>>>>> last_cookie,
>>>>>> gfp_t gfp_flags)
>>>>>>  {
>>>>>>         struct page *page = alloc_page(gfp_flags);
>>>>>>         if (page)
>>>>>> -               nfs_readdir_page_init_array(page,
>>>>>> last_cookie);
>>>>>> +               nfs_readdir_page_init_array(page,
>>>>>> last_cookie,
>>>>>> 0);
>>>>>>         return page;
>>>>>>  }
>>>>>>
>>>>>> @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
>>>>>> nfs_entry
>>>>>> *entry, struct page *page)
>>>>>>         return ret;
>>>>>>  }
>>>>>>
>>>>>> +static bool nfs_readdir_page_cookie_match(struct page *page,
>>>>>> u64
>>>>>> last_cookie,
>>>>>> +                                        
>>>>>> u64 change_attr)
>>>>>
>>>>> How about "nfs_readdir_page_valid()"?  There's more going on
>>>>> than a
>>>>> cookie match.
>>>>>
>>>>>
>>>>>> +{
>>>>>> +       struct nfs_cache_array *array = kmap_atomic(page);
>>>>>> +       int ret = true;
>>>>>> +
>>>>>> +       if (array->change_attr != change_attr)
>>>>>> +               ret = false;
>>>>>
>>>>> Can we skip the next test if ret = false?
>>>>
>>>> I'd expect the compiler to do that.
>>>>
>>>>>
>>>>>> +       if (array->size > 0 && array->array[0].cookie !=
>>>>>> last_cookie)
>>>>>> +               ret = false;
>>>>>> +       kunmap_atomic(array);
>>>>>> +       return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static void nfs_readdir_page_unlock_and_put(struct page
>>>>>> *page)
>>>>>> +{
>>>>>> +       unlock_page(page);
>>>>>> +       put_page(page);
>>>>>> +}
>>>>>> +
>>>>>>  static struct page *nfs_readdir_page_get_locked(struct
>>>>>> address_space
>>>>>> *mapping,
>>>>>>                                                 pgoff_t
>>>>>> index,
>>>>>> u64
>>>>>> last_cookie)
>>>>>>  {
>>>>>>         struct page *page;
>>>>>> +       u64 change_attr;
>>>>>>
>>>>>>         page = grab_cache_page(mapping, index);
>>>>>> -       if (page && !PageUptodate(page)) {
>>>>>> -               nfs_readdir_page_init_array(page,
>>>>>> last_cookie);
>>>>>> -               if
>>>>>> (invalidate_inode_pages2_range(mapping, index
>>>>>> +
>>>>>> 1, -1) < 0)
>>>>>> -                       nfs_zap_mapping(mapping->host,
>>>>>> mapping);
>>>>>> -               SetPageUptodate(page);
>>>>>> +       if (!page)
>>>>>> +               return NULL;
>>>>>> +       change_attr =
>>>>>> inode_peek_iversion_raw(mapping->host);
>>>>>> +       if (PageUptodate(page)) {
>>>>>> +               if
>>>>>> (nfs_readdir_page_cookie_match(page,
>>>>>> last_cookie,
>>>>>> +                                                
>>>>>> change_attr))
>>>>>> +                       return page;
>>>>>> +               nfs_readdir_clear_array(page);
>>>>>
>>>>>
>>>>> Why use i_version rather than nfs_save_change_attribute?  Seems
>>>>> having a
>>>>> consistent value across the pachecache and dir_verifiers would
>>>>> help
>>>>> debugging, and we've already have a bunch of machinery around
>>>>> the
>>>>> change_attribute.
>>>>
>>>> The directory cache_change_attribute is not reported in
>>>> tracepoints
>>>> because it is a directory-specific field, so it's not as useful
>>>> for
>>>> debugging.
>>>>
>>>> The inode change attribute is what we have traditionally used for
>>>> determining cache consistency, and when to invalidate the cache.
>>>
>>> I should probably elaborate a little further on the differences
>>> between
>>> the inode change attribute and the cache_change_attribute.
>>>
>>> One of the main reasons for introducing the latter was to have
>>> something that allows us to track changes to the directory, but to
>>> avoid forcing unnecessary revalidations of the dcache.
>>>
>>> What this means is that when we create or remove a file, and the
>>> pre/post-op attributes tell us that there were no third party
>>> changes
>>> to the directory, we update the dcache, but we do _not_ update the
>>> cache_change_attribute, because we know that the rest of the
>>> directory
>>> contents are valid, and so we don't have to revalidate the
>>> dentries.
>>> However in that case, we _do_ want to update the readdir cache to
>>> reflect the fact that an entry was added or deleted. While we could
>>> figure out how to remove an entry (at least for the case where the
>>> filesystem is case-sensitive), we do not know where the filesystem
>>> added the new file, or what cookies was assigned.
>>>
>>> This is why the inode change attribute is more appropriate for
>>> indexing
>>> the page cache pages. It reflects the cases where we want to
>>> revalidate
>>> the readdir cache, as opposed to the dcache.
>>
>> Ok, thanks for explaining this.
>>
>> I've noticed that you haven't responded about my concerns about not
>> checking
>> the directory for changes with every v4 READDIR.  For v3, we have
>> post-op
>> updates to the directory, but with v4 the directory can change and
>> we'll
>> end up with entries in the cache that are marked with an old
>> change_attr.
>>
>
> Then they will be rejected by nfs_readdir_page_cookie_match() if a 
> user
> looks up that page again after we've revalidated the change attribute
> on the directory.
>
> ...and note that NFSv4 does returns a struct change_info4 for all
> operations that change the directory, so we will update the change
> attribute in all those cases.

I'm not worried about changes from the same client.

> If the change is made on the server, well then we will detect it
> through the standard revalidation process that usually decides when to
> invalidate the directory page cache.

The environments I'm concerned about are setup very frequently: they 
look
like multiple NFS clients co-ordinating on a directory with millions of
files.  Some clients are adding files as they do work, other clients are
then looking for those files by walking the directory entries to 
validate
their existence.  The systems that do this have a "very bad time" if 
some
of them produce listings that are _dramatically_ and transiently 
different
from a listing they produced before.

That can happen really easily with what we've got here, and it can 
create a
huge problem for these setups.  And it won't be easily reproduceable, 
and it
will be hard to find.  It will cost everyone involved a lot of time and
effort to track down, and we can fix it easily.

>> I'm pretty positive that not checking for changes to the directory
>> (not
>> sending GETATTR with READDIR) is going to create cases of double-
>> listed
>> and
>> truncated-listings for dirctory listers.  Not handling those cases
>> means
>> I'm
>> going to have some very unhappy customers that complain about their
>> files
>> disappearing/reappearing on NFS.
>>
>> If you need me to prove that its an issue, I can take the time to
>> write
>> up
>> program that shows this problem.
>>
>
> If you label the page contents with an attribute that was retrieved
> _after_ the READDIR op, then you will introduce this as a problem for
> your customers.

No the problem is already here, we're not introducing it.  By labeling 
the
page contents with every call we're shifting the race window from the 
client
where it's a very large window to the server where the window is small.

Its still possible, but *much* less likely.

> The reason is that there is no atomicity between operations in a
> COMPOUND. Worse, the implementation of readdir in scalable modern
> systems, including Linux, does not even guarantee atomicity of the
> readdir operation itself. Instead each readdir entry is filled without
> holding any locks or preventing any changes to the directory or to the
> object itself.

I understand all this, but its not a reason to make the problem worse.

> POSIX states very explicitly that if you're making changes to the
> directory after the call to opendir() or rewinddir(), then the
> behaviour w.r.t. whether that file appears in the readdir() call is
> unspecified. See
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html

Yes, but again - just because the problem exists doesn't give us reason 
to
amplify it when we can easily make a better choice for almost no cost.

Here are my reasons for wanting the GETATTR added:
  - it makes it *much* less likely for this problem to occur, with the 
minor
    downside of decreased caching for unstable directories.
  - it makes v3 and v4 readdir pagecache behavior consistent WRT 
changing
    directories.

I spent a non-trivial amount of time working on this problem, and saw 
this
exact issue appear.  Its definitely something that's going to come back 
and
bite us if we don't fix it.

How can I convince you?  I've offered to produce a working example of 
this
problem.  Will you review those results?  If I cannot convince you, I 
feel
I'll have to pursue distro-specific changes for this work.

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 14:44                     ` Benjamin Coddington
@ 2022-02-25 15:18                       ` Trond Myklebust
  2022-02-25 15:34                         ` Benjamin Coddington
  0 siblings, 1 reply; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25 15:18 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Fri, 2022-02-25 at 09:44 -0500, Benjamin Coddington wrote:
> On 25 Feb 2022, at 8:10, Trond Myklebust wrote:
> 
> > On Fri, 2022-02-25 at 06:38 -0500, Benjamin Coddington wrote:
> > > On 24 Feb 2022, at 22:51, Trond Myklebust wrote:
> > > 
> > > > On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
> > > > > On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
> > > > > > On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
> > > > > > 
> > > > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > > > > 
> > > > > > > Use the change attribute and the first cookie in a
> > > > > > > directory
> > > > > > > page
> > > > > > > cache
> > > > > > > entry to validate that the page is up to date.
> > > > > > > 
> > > > > > > Suggested-by: Benjamin Coddington <bcodding@redhat.com>
> > > > > > > Signed-off-by: Trond Myklebust
> > > > > > > <trond.myklebust@hammerspace.com>
> > > > > > > ---
> > > > > > >  fs/nfs/dir.c | 68
> > > > > > > ++++++++++++++++++++++++++++------------------------
> > > > > > >  1 file changed, 37 insertions(+), 31 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> > > > > > > index f2258e926df2..5d9367d9b651 100644
> > > > > > > --- a/fs/nfs/dir.c
> > > > > > > +++ b/fs/nfs/dir.c
> > > > > > > @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
> > > > > > >  };
> > > > > > > 
> > > > > > >  struct nfs_cache_array {
> > > > > > > +       u64 change_attr;
> > > > > > >         u64 last_cookie;
> > > > > > >         unsigned int size;
> > > > > > >         unsigned char page_full : 1,
> > > > > > > @@ -175,7 +176,8 @@ static void
> > > > > > > nfs_readdir_array_init(struct
> > > > > > > nfs_cache_array *array)
> > > > > > >         memset(array, 0, sizeof(struct nfs_cache_array));
> > > > > > >  }
> > > > > > > 
> > > > > > > -static void nfs_readdir_page_init_array(struct page
> > > > > > > *page,
> > > > > > > u64
> > > > > > > last_cookie)
> > > > > > > +static void nfs_readdir_page_init_array(struct page
> > > > > > > *page,
> > > > > > > u64
> > > > > > > last_cookie,
> > > > > > > +                                       u64
> > > > > > > change_attr)
> > > > > > >  {
> > > > > > >         struct nfs_cache_array *array;
> > > > > > 
> > > > > > 
> > > > > > There's a hunk missing here, something like:
> > > > > > 
> > > > > > @@ -185,6 +185,7 @@ static void
> > > > > > nfs_readdir_page_init_array(struct
> > > > > > page
> > > > > > *page, u64 last_cookie,
> > > > > >          nfs_readdir_array_init(array);
> > > > > >          array->last_cookie = last_cookie;
> > > > > >          array->cookies_are_ordered = 1;
> > > > > > +       array->change_attr = change_attr;
> > > > > >          kunmap_atomic(array);
> > > > > >   }
> > > > > > 
> > > > > > > 
> > > > > > > @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64
> > > > > > > last_cookie,
> > > > > > > gfp_t gfp_flags)
> > > > > > >  {
> > > > > > >         struct page *page = alloc_page(gfp_flags);
> > > > > > >         if (page)
> > > > > > > -               nfs_readdir_page_init_array(page,
> > > > > > > last_cookie);
> > > > > > > +               nfs_readdir_page_init_array(page,
> > > > > > > last_cookie,
> > > > > > > 0);
> > > > > > >         return page;
> > > > > > >  }
> > > > > > > 
> > > > > > > @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
> > > > > > > nfs_entry
> > > > > > > *entry, struct page *page)
> > > > > > >         return ret;
> > > > > > >  }
> > > > > > > 
> > > > > > > +static bool nfs_readdir_page_cookie_match(struct page
> > > > > > > *page,
> > > > > > > u64
> > > > > > > last_cookie,
> > > > > > > +                                        
> > > > > > > u64 change_attr)
> > > > > > 
> > > > > > How about "nfs_readdir_page_valid()"?  There's more going
> > > > > > on
> > > > > > than a
> > > > > > cookie match.
> > > > > > 
> > > > > > 
> > > > > > > +{
> > > > > > > +       struct nfs_cache_array *array =
> > > > > > > kmap_atomic(page);
> > > > > > > +       int ret = true;
> > > > > > > +
> > > > > > > +       if (array->change_attr != change_attr)
> > > > > > > +               ret = false;
> > > > > > 
> > > > > > Can we skip the next test if ret = false?
> > > > > 
> > > > > I'd expect the compiler to do that.
> > > > > 
> > > > > > 
> > > > > > > +       if (array->size > 0 && array->array[0].cookie !=
> > > > > > > last_cookie)
> > > > > > > +               ret = false;
> > > > > > > +       kunmap_atomic(array);
> > > > > > > +       return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void nfs_readdir_page_unlock_and_put(struct page
> > > > > > > *page)
> > > > > > > +{
> > > > > > > +       unlock_page(page);
> > > > > > > +       put_page(page);
> > > > > > > +}
> > > > > > > +
> > > > > > >  static struct page *nfs_readdir_page_get_locked(struct
> > > > > > > address_space
> > > > > > > *mapping,
> > > > > > >                                                 pgoff_t
> > > > > > > index,
> > > > > > > u64
> > > > > > > last_cookie)
> > > > > > >  {
> > > > > > >         struct page *page;
> > > > > > > +       u64 change_attr;
> > > > > > > 
> > > > > > >         page = grab_cache_page(mapping, index);
> > > > > > > -       if (page && !PageUptodate(page)) {
> > > > > > > -               nfs_readdir_page_init_array(page,
> > > > > > > last_cookie);
> > > > > > > -               if
> > > > > > > (invalidate_inode_pages2_range(mapping, index
> > > > > > > +
> > > > > > > 1, -1) < 0)
> > > > > > > -                       nfs_zap_mapping(mapping->host,
> > > > > > > mapping);
> > > > > > > -               SetPageUptodate(page);
> > > > > > > +       if (!page)
> > > > > > > +               return NULL;
> > > > > > > +       change_attr =
> > > > > > > inode_peek_iversion_raw(mapping->host);
> > > > > > > +       if (PageUptodate(page)) {
> > > > > > > +               if
> > > > > > > (nfs_readdir_page_cookie_match(page,
> > > > > > > last_cookie,
> > > > > > > +                                                
> > > > > > > change_attr))
> > > > > > > +                       return page;
> > > > > > > +               nfs_readdir_clear_array(page);
> > > > > > 
> > > > > > 
> > > > > > Why use i_version rather than nfs_save_change_attribute? 
> > > > > > Seems
> > > > > > having a
> > > > > > consistent value across the pachecache and dir_verifiers
> > > > > > would
> > > > > > help
> > > > > > debugging, and we've already have a bunch of machinery
> > > > > > around
> > > > > > the
> > > > > > change_attribute.
> > > > > 
> > > > > The directory cache_change_attribute is not reported in
> > > > > tracepoints
> > > > > because it is a directory-specific field, so it's not as
> > > > > useful
> > > > > for
> > > > > debugging.
> > > > > 
> > > > > The inode change attribute is what we have traditionally used
> > > > > for
> > > > > determining cache consistency, and when to invalidate the
> > > > > cache.
> > > > 
> > > > I should probably elaborate a little further on the differences
> > > > between
> > > > the inode change attribute and the cache_change_attribute.
> > > > 
> > > > One of the main reasons for introducing the latter was to have
> > > > something that allows us to track changes to the directory, but
> > > > to
> > > > avoid forcing unnecessary revalidations of the dcache.
> > > > 
> > > > What this means is that when we create or remove a file, and
> > > > the
> > > > pre/post-op attributes tell us that there were no third party
> > > > changes
> > > > to the directory, we update the dcache, but we do _not_ update
> > > > the
> > > > cache_change_attribute, because we know that the rest of the
> > > > directory
> > > > contents are valid, and so we don't have to revalidate the
> > > > dentries.
> > > > However in that case, we _do_ want to update the readdir cache
> > > > to
> > > > reflect the fact that an entry was added or deleted. While we
> > > > could
> > > > figure out how to remove an entry (at least for the case where
> > > > the
> > > > filesystem is case-sensitive), we do not know where the
> > > > filesystem
> > > > added the new file, or what cookies was assigned.
> > > > 
> > > > This is why the inode change attribute is more appropriate for
> > > > indexing
> > > > the page cache pages. It reflects the cases where we want to
> > > > revalidate
> > > > the readdir cache, as opposed to the dcache.
> > > 
> > > Ok, thanks for explaining this.
> > > 
> > > I've noticed that you haven't responded about my concerns about
> > > not
> > > checking
> > > the directory for changes with every v4 READDIR.  For v3, we have
> > > post-op
> > > updates to the directory, but with v4 the directory can change
> > > and
> > > we'll
> > > end up with entries in the cache that are marked with an old
> > > change_attr.
> > > 
> > 
> > Then they will be rejected by nfs_readdir_page_cookie_match() if a 
> > user
> > looks up that page again after we've revalidated the change
> > attribute
> > on the directory.
> > 
> > ...and note that NFSv4 does returns a struct change_info4 for all
> > operations that change the directory, so we will update the change
> > attribute in all those cases.
> 
> I'm not worried about changes from the same client.
> 
> > If the change is made on the server, well then we will detect it
> > through the standard revalidation process that usually decides when
> > to
> > invalidate the directory page cache.
> 
> The environments I'm concerned about are setup very frequently: they 
> look
> like multiple NFS clients co-ordinating on a directory with millions
> of
> files.  Some clients are adding files as they do work, other clients
> are
> then looking for those files by walking the directory entries to 
> validate
> their existence.  The systems that do this have a "very bad time" if 
> some
> of them produce listings that are _dramatically_ and transiently 
> different
> from a listing they produced before.
> 
> That can happen really easily with what we've got here, and it can 
> create a
> huge problem for these setups.  And it won't be easily reproduceable,
> and it
> will be hard to find.  It will cost everyone involved a lot of time
> and
> effort to track down, and we can fix it easily.
> 
> > > I'm pretty positive that not checking for changes to the
> > > directory
> > > (not
> > > sending GETATTR with READDIR) is going to create cases of double-
> > > listed
> > > and
> > > truncated-listings for dirctory listers.  Not handling those
> > > cases
> > > means
> > > I'm
> > > going to have some very unhappy customers that complain about
> > > their
> > > files
> > > disappearing/reappearing on NFS.
> > > 
> > > If you need me to prove that its an issue, I can take the time to
> > > write
> > > up
> > > program that shows this problem.
> > > 
> > 
> > If you label the page contents with an attribute that was retrieved
> > _after_ the READDIR op, then you will introduce this as a problem
> > for
> > your customers.
> 
> No the problem is already here, we're not introducing it.  By
> labeling 
> the
> page contents with every call we're shifting the race window from the
> client
> where it's a very large window to the server where the window is
> small.
> 
> Its still possible, but *much* less likely.
> 
> > The reason is that there is no atomicity between operations in a
> > COMPOUND. Worse, the implementation of readdir in scalable modern
> > systems, including Linux, does not even guarantee atomicity of the
> > readdir operation itself. Instead each readdir entry is filled
> > without
> > holding any locks or preventing any changes to the directory or to
> > the
> > object itself.
> 
> I understand all this, but its not a reason to make the problem
> worse.
> 
> > POSIX states very explicitly that if you're making changes to the
> > directory after the call to opendir() or rewinddir(), then the
> > behaviour w.r.t. whether that file appears in the readdir() call is
> > unspecified. See
> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
> 
> Yes, but again - just because the problem exists doesn't give us
> reason 
> to
> amplify it when we can easily make a better choice for almost no
> cost.
> 
> Here are my reasons for wanting the GETATTR added:
>   - it makes it *much* less likely for this problem to occur, with
> the 
> minor
>     downside of decreased caching for unstable directories.
>   - it makes v3 and v4 readdir pagecache behavior consistent WRT 
> changing
>     directories.
> 
> I spent a non-trivial amount of time working on this problem, and saw
> this
> exact issue appear.  Its definitely something that's going to come
> back 
> and
> bite us if we don't fix it.
> 
> How can I convince you?  I've offered to produce a working example of
> this
> problem.  Will you review those results?  If I cannot convince you, I
> feel
> I'll have to pursue distro-specific changes for this work.

Ben, the main cause of this kind of issue in the current code is the
following line:

        /*
         * ctx->pos points to the dirent entry number.
         * *desc->dir_cookie has the cookie for the next entry. We have
         * to either find the entry with the appropriate number or
         * revalidate the cookie.
         */
        if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
                res = nfs_revalidate_mapping(inode, file->f_mapping);
                if (res < 0)
                        goto out;
        }


That line protects the page cache against changes aften opendir(). It
was introduced by Red Hat in commmit 07b5ce8ef2d8 in order to fix a
claim of a severe performance problem.

These patches _remove_ that protection, because we're now able to cope
with more frequent revalidation without needing to restart directory
reads from scratch.

So no. Without further proof, I don't accept your claim that this
patchset introduces a regression. I don't accept your claim that we are
required to revalidate the change attribute on every readdir call. We
can't do that for NFSv2 or NFSv3 (the latter offers a post_op
attribute, not pre-op attribute) and as I already pointed out, there is
nothing in POSIX that requires this.

If you want to fork the Red Hat kernel over it, then that's your
decision.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 15:18                       ` Trond Myklebust
@ 2022-02-25 15:34                         ` Benjamin Coddington
  2022-02-25 20:23                           ` Benjamin Coddington
  0 siblings, 1 reply; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-25 15:34 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 25 Feb 2022, at 10:18, Trond Myklebust wrote:

> On Fri, 2022-02-25 at 09:44 -0500, Benjamin Coddington wrote:
>> On 25 Feb 2022, at 8:10, Trond Myklebust wrote:
>>
>>> On Fri, 2022-02-25 at 06:38 -0500, Benjamin Coddington wrote:
>>>> On 24 Feb 2022, at 22:51, Trond Myklebust wrote:
>>>>
>>>>> On Fri, 2022-02-25 at 02:26 +0000, Trond Myklebust wrote:
>>>>>> On Thu, 2022-02-24 at 09:53 -0500, Benjamin Coddington wrote:
>>>>>>> On 23 Feb 2022, at 16:12, trondmy@kernel.org wrote:
>>>>>>>
>>>>>>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>>>>>>>
>>>>>>>> Use the change attribute and the first cookie in a
>>>>>>>> directory
>>>>>>>> page
>>>>>>>> cache
>>>>>>>> entry to validate that the page is up to date.
>>>>>>>>
>>>>>>>> Suggested-by: Benjamin Coddington <bcodding@redhat.com>
>>>>>>>> Signed-off-by: Trond Myklebust
>>>>>>>> <trond.myklebust@hammerspace.com>
>>>>>>>> ---
>>>>>>>>  fs/nfs/dir.c | 68
>>>>>>>> ++++++++++++++++++++++++++++------------------------
>>>>>>>>  1 file changed, 37 insertions(+), 31 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>>>> index f2258e926df2..5d9367d9b651 100644
>>>>>>>> --- a/fs/nfs/dir.c
>>>>>>>> +++ b/fs/nfs/dir.c
>>>>>>>> @@ -139,6 +139,7 @@ struct nfs_cache_array_entry {
>>>>>>>>  };
>>>>>>>>
>>>>>>>>  struct nfs_cache_array {
>>>>>>>> +       u64 change_attr;
>>>>>>>>         u64 last_cookie;
>>>>>>>>         unsigned int size;
>>>>>>>>         unsigned char page_full : 1,
>>>>>>>> @@ -175,7 +176,8 @@ static void
>>>>>>>> nfs_readdir_array_init(struct
>>>>>>>> nfs_cache_array *array)
>>>>>>>>         memset(array, 0, sizeof(struct 
>>>>>>>> nfs_cache_array));
>>>>>>>>  }
>>>>>>>>
>>>>>>>> -static void nfs_readdir_page_init_array(struct page
>>>>>>>> *page,
>>>>>>>> u64
>>>>>>>> last_cookie)
>>>>>>>> +static void nfs_readdir_page_init_array(struct page
>>>>>>>> *page,
>>>>>>>> u64
>>>>>>>> last_cookie,
>>>>>>>> +                                       u64
>>>>>>>> change_attr)
>>>>>>>>  {
>>>>>>>>         struct nfs_cache_array *array;
>>>>>>>
>>>>>>>
>>>>>>> There's a hunk missing here, something like:
>>>>>>>
>>>>>>> @@ -185,6 +185,7 @@ static void
>>>>>>> nfs_readdir_page_init_array(struct
>>>>>>> page
>>>>>>> *page, u64 last_cookie,
>>>>>>>          nfs_readdir_array_init(array);
>>>>>>>          array->last_cookie = last_cookie;
>>>>>>>          array->cookies_are_ordered = 1;
>>>>>>> +       array->change_attr = change_attr;
>>>>>>>          kunmap_atomic(array);
>>>>>>>   }
>>>>>>>
>>>>>>>>
>>>>>>>> @@ -207,7 +209,7 @@ nfs_readdir_page_array_alloc(u64
>>>>>>>> last_cookie,
>>>>>>>> gfp_t gfp_flags)
>>>>>>>>  {
>>>>>>>>         struct page *page = alloc_page(gfp_flags);
>>>>>>>>         if (page)
>>>>>>>> -               nfs_readdir_page_init_array(page,
>>>>>>>> last_cookie);
>>>>>>>> +               nfs_readdir_page_init_array(page,
>>>>>>>> last_cookie,
>>>>>>>> 0);
>>>>>>>>         return page;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> @@ -304,19 +306,44 @@ int nfs_readdir_add_to_array(struct
>>>>>>>> nfs_entry
>>>>>>>> *entry, struct page *page)
>>>>>>>>         return ret;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static bool nfs_readdir_page_cookie_match(struct page
>>>>>>>> *page,
>>>>>>>> u64
>>>>>>>> last_cookie,
>>>>>>>> +                                        
>>>>>>>> u64 change_attr)
>>>>>>>
>>>>>>> How about "nfs_readdir_page_valid()"?  There's more going
>>>>>>> on
>>>>>>> than a
>>>>>>> cookie match.
>>>>>>>
>>>>>>>
>>>>>>>> +{
>>>>>>>> +       struct nfs_cache_array *array =
>>>>>>>> kmap_atomic(page);
>>>>>>>> +       int ret = true;
>>>>>>>> +
>>>>>>>> +       if (array->change_attr != change_attr)
>>>>>>>> +               ret = false;
>>>>>>>
>>>>>>> Can we skip the next test if ret = false?
>>>>>>
>>>>>> I'd expect the compiler to do that.
>>>>>>
>>>>>>>
>>>>>>>> +       if (array->size > 0 && array->array[0].cookie !=
>>>>>>>> last_cookie)
>>>>>>>> +               ret = false;
>>>>>>>> +       kunmap_atomic(array);
>>>>>>>> +       return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void nfs_readdir_page_unlock_and_put(struct page
>>>>>>>> *page)
>>>>>>>> +{
>>>>>>>> +       unlock_page(page);
>>>>>>>> +       put_page(page);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static struct page *nfs_readdir_page_get_locked(struct
>>>>>>>> address_space
>>>>>>>> *mapping,
>>>>>>>>                                                 pgoff_t
>>>>>>>> index,
>>>>>>>> u64
>>>>>>>> last_cookie)
>>>>>>>>  {
>>>>>>>>         struct page *page;
>>>>>>>> +       u64 change_attr;
>>>>>>>>
>>>>>>>>         page = grab_cache_page(mapping, index);
>>>>>>>> -       if (page && !PageUptodate(page)) {
>>>>>>>> -               nfs_readdir_page_init_array(page,
>>>>>>>> last_cookie);
>>>>>>>> -               if
>>>>>>>> (invalidate_inode_pages2_range(mapping, index
>>>>>>>> +
>>>>>>>> 1, -1) < 0)
>>>>>>>> -                       nfs_zap_mapping(mapping->host,
>>>>>>>> mapping);
>>>>>>>> -               SetPageUptodate(page);
>>>>>>>> +       if (!page)
>>>>>>>> +               return NULL;
>>>>>>>> +       change_attr =
>>>>>>>> inode_peek_iversion_raw(mapping->host);
>>>>>>>> +       if (PageUptodate(page)) {
>>>>>>>> +               if
>>>>>>>> (nfs_readdir_page_cookie_match(page,
>>>>>>>> last_cookie,
>>>>>>>> +                                                
>>>>>>>> change_attr))
>>>>>>>> +                       return page;
>>>>>>>> +               nfs_readdir_clear_array(page);
>>>>>>>
>>>>>>>
>>>>>>> Why use i_version rather than nfs_save_change_attribute? 
>>>>>>> Seems
>>>>>>> having a
>>>>>>> consistent value across the pachecache and dir_verifiers
>>>>>>> would
>>>>>>> help
>>>>>>> debugging, and we've already have a bunch of machinery
>>>>>>> around
>>>>>>> the
>>>>>>> change_attribute.
>>>>>>
>>>>>> The directory cache_change_attribute is not reported in
>>>>>> tracepoints
>>>>>> because it is a directory-specific field, so it's not as
>>>>>> useful
>>>>>> for
>>>>>> debugging.
>>>>>>
>>>>>> The inode change attribute is what we have traditionally used
>>>>>> for
>>>>>> determining cache consistency, and when to invalidate the
>>>>>> cache.
>>>>>
>>>>> I should probably elaborate a little further on the differences
>>>>> between
>>>>> the inode change attribute and the cache_change_attribute.
>>>>>
>>>>> One of the main reasons for introducing the latter was to have
>>>>> something that allows us to track changes to the directory, but
>>>>> to
>>>>> avoid forcing unnecessary revalidations of the dcache.
>>>>>
>>>>> What this means is that when we create or remove a file, and
>>>>> the
>>>>> pre/post-op attributes tell us that there were no third party
>>>>> changes
>>>>> to the directory, we update the dcache, but we do _not_ update
>>>>> the
>>>>> cache_change_attribute, because we know that the rest of the
>>>>> directory
>>>>> contents are valid, and so we don't have to revalidate the
>>>>> dentries.
>>>>> However in that case, we _do_ want to update the readdir cache
>>>>> to
>>>>> reflect the fact that an entry was added or deleted. While we
>>>>> could
>>>>> figure out how to remove an entry (at least for the case where
>>>>> the
>>>>> filesystem is case-sensitive), we do not know where the
>>>>> filesystem
>>>>> added the new file, or what cookies was assigned.
>>>>>
>>>>> This is why the inode change attribute is more appropriate for
>>>>> indexing
>>>>> the page cache pages. It reflects the cases where we want to
>>>>> revalidate
>>>>> the readdir cache, as opposed to the dcache.
>>>>
>>>> Ok, thanks for explaining this.
>>>>
>>>> I've noticed that you haven't responded about my concerns about
>>>> not
>>>> checking
>>>> the directory for changes with every v4 READDIR.  For v3, we have
>>>> post-op
>>>> updates to the directory, but with v4 the directory can change
>>>> and
>>>> we'll
>>>> end up with entries in the cache that are marked with an old
>>>> change_attr.
>>>>
>>>
>>> Then they will be rejected by nfs_readdir_page_cookie_match() if a
>>> user
>>> looks up that page again after we've revalidated the change
>>> attribute
>>> on the directory.
>>>
>>> ...and note that NFSv4 does returns a struct change_info4 for all
>>> operations that change the directory, so we will update the change
>>> attribute in all those cases.
>>
>> I'm not worried about changes from the same client.
>>
>>> If the change is made on the server, well then we will detect it
>>> through the standard revalidation process that usually decides when
>>> to
>>> invalidate the directory page cache.
>>
>> The environments I'm concerned about are setup very frequently: they
>> look
>> like multiple NFS clients co-ordinating on a directory with millions
>> of
>> files.  Some clients are adding files as they do work, other clients
>> are
>> then looking for those files by walking the directory entries to
>> validate
>> their existence.  The systems that do this have a "very bad time" if
>> some
>> of them produce listings that are _dramatically_ and transiently
>> different
>> from a listing they produced before.
>>
>> That can happen really easily with what we've got here, and it can
>> create a
>> huge problem for these setups.  And it won't be easily 
>> reproduceable,
>> and it
>> will be hard to find.  It will cost everyone involved a lot of time
>> and
>> effort to track down, and we can fix it easily.
>>
>>>> I'm pretty positive that not checking for changes to the
>>>> directory
>>>> (not
>>>> sending GETATTR with READDIR) is going to create cases of double-
>>>> listed
>>>> and
>>>> truncated-listings for dirctory listers.  Not handling those
>>>> cases
>>>> means
>>>> I'm
>>>> going to have some very unhappy customers that complain about
>>>> their
>>>> files
>>>> disappearing/reappearing on NFS.
>>>>
>>>> If you need me to prove that its an issue, I can take the time to
>>>> write
>>>> up
>>>> program that shows this problem.
>>>>
>>>
>>> If you label the page contents with an attribute that was retrieved
>>> _after_ the READDIR op, then you will introduce this as a problem
>>> for
>>> your customers.
>>
>> No the problem is already here, we're not introducing it.  By
>> labeling
>> the
>> page contents with every call we're shifting the race window from the
>> client
>> where it's a very large window to the server where the window is
>> small.
>>
>> Its still possible, but *much* less likely.
>>
>>> The reason is that there is no atomicity between operations in a
>>> COMPOUND. Worse, the implementation of readdir in scalable modern
>>> systems, including Linux, does not even guarantee atomicity of the
>>> readdir operation itself. Instead each readdir entry is filled
>>> without
>>> holding any locks or preventing any changes to the directory or to
>>> the
>>> object itself.
>>
>> I understand all this, but its not a reason to make the problem
>> worse.
>>
>>> POSIX states very explicitly that if you're making changes to the
>>> directory after the call to opendir() or rewinddir(), then the
>>> behaviour w.r.t. whether that file appears in the readdir() call is
>>> unspecified. See
>>> https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html
>>
>> Yes, but again - just because the problem exists doesn't give us
>> reason
>> to
>> amplify it when we can easily make a better choice for almost no
>> cost.
>>
>> Here are my reasons for wanting the GETATTR added:
>>   - it makes it *much* less likely for this problem to occur, with
>> the
>> minor
>>     downside of decreased caching for unstable directories.
>>   - it makes v3 and v4 readdir pagecache behavior consistent WRT
>> changing
>>     directories.
>>
>> I spent a non-trivial amount of time working on this problem, and saw
>> this
>> exact issue appear.  Its definitely something that's going to come
>> back
>> and
>> bite us if we don't fix it.
>>
>> How can I convince you?  I've offered to produce a working example 
>> of
>> this
>> problem.  Will you review those results?  If I cannot convince you, 
>> I
>> feel
>> I'll have to pursue distro-specific changes for this work.
>
> Ben, the main cause of this kind of issue in the current code is the
> following line:
>
>         /*
>          * ctx->pos points to the dirent entry number.
>          * *desc->dir_cookie has the cookie for the next entry. We 
> have
>          * to either find the entry with the appropriate number or
>          * revalidate the cookie.
>          */
>         if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
>                 res = nfs_revalidate_mapping(inode, file->f_mapping);
>                 if (res < 0)
>                         goto out;
>         }
>
>
> That line protects the page cache against changes aften opendir(). It
> was introduced by Red Hat in commmit 07b5ce8ef2d8 in order to fix a
> claim of a severe performance problem.
>
> These patches _remove_ that protection, because we're now able to cope
> with more frequent revalidation without needing to restart directory
> reads from scratch.

Yes, I know.  But the big change is that now we're heavily relying on
page validation to produce sane listing results, and proper page 
validation
relies on up-to-date change info.

> So no. Without further proof, I don't accept your claim that this
> patchset introduces a regression. I don't accept your claim that we 
> are
> required to revalidate the change attribute on every readdir call. We
> can't do that for NFSv2 or NFSv3 (the latter offers a post_op
> attribute, not pre-op attribute) and as I already pointed out, there 
> is
> nothing in POSIX that requires this.

You don't need a pre-op attribute.  You just need to detect the case 
where
you're walking into pages that contain entries that don't match the ones
you're currently using, and post-op is as good as we can get it.

Ok, so I'm reading that further proof is required, and I'm happy to do 
the
work.  Thanks for the replies here and elsewhere.

Ben


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 15:34                         ` Benjamin Coddington
@ 2022-02-25 20:23                           ` Benjamin Coddington
  2022-02-25 20:28                             ` Benjamin Coddington
  2022-02-25 20:41                             ` Trond Myklebust
  0 siblings, 2 replies; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-25 20:23 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 25 Feb 2022, at 10:34, Benjamin Coddington wrote:
> Ok, so I'm reading that further proof is required, and I'm happy to do 
> the
> work.  Thanks for the replies here and elsewhere.

Here's an example of this problem on a tmpfs export using v8 of your
patchset with the fix to set the change_attr in
nfs_readdir_page_init_array().

I'm using tmpfs, because it reliably orders cookies in reverse order of
creation (or perhaps sorted by name).

The program drives both the client-side and server-side - so on this one
system, /exports/tmpfs is:
tmpfs /exports/tmpfs tmpfs rw,seclabel,relatime,size=102400k 0 0

and /mnt/localhost is:
localhost:/exports/tmpfs /mnt/localhost/tmpfs nfs4 
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 
0 0

The program creates 256 files on the server, walks through them once on 
the
client, deletes the last 127 on the server, drops the first page from 
the
pagecache, and walks through them again on the client.

The second listing produces 124 duplicate entries.

I just have to say again: this behavior is _new_ (but not new to me), 
and it
is absolutely going to crop up on our customer's systems that are 
walking
through millions of directory entries on loaded servers under memory
pressure.  The directory listings as a whole become very likely to be
nonsense at random times.  I realize they are not /supposed/ to be 
coherent,
but what we're getting here is going to be far far less coherent, and 
its
going to be a mess.

There are other scenarios that are worse when the cookies aren't 
ordered,
you can end up with EOF, or get into repeating patterns.

Please compare this with v3, and before this patchset, and tell me if 
I'm
not justified playing chicken little.

Here's what I do to run this:

mount -t tmpfs -osize=100M tmpfs /exports/tmpfs/
exportfs -ofsid=0 *:/exports
exportfs -ofsid=1 *:/exports/tmpfs
mount -t nfs -ov4.1,sec=sys localhost:/exports /mnt/localhost
./getdents2

Compare "Listing 1" with "Listing 2".

I would also do a "rm -f /export/tmpfs/*" between each run.

Thanks again for your time and work.

Ben

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <string.h>

#define NFSDIR "/mnt/localhost/tmpfs"
#define LOCDIR "/exports/tmpfs"
#define BUF_SIZE 4096

int main(int argc, char **argv)
{
	int i, dir_fd, bpos, total = 0;
     size_t nread;
	struct linux_dirent {
		long           d_ino;
		off_t          d_off;
		unsigned short d_reclen;
		char           d_name[];
	};
     struct linux_dirent *d;
	char buf[BUF_SIZE];

     /* create files: */
     for (i = 0; i < 256; i++) {
         sprintf(buf, LOCDIR "/file_%03d", i);
         close(open(buf, O_CREAT, 666));
     }

	dir_fd = open(NFSDIR, O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
	if (dir_fd < 0) {
		perror("cannot open dir");
		return 1;
	}

	while (1) {
		nread = syscall(SYS_getdents, dir_fd, buf, BUF_SIZE);
		if (nread == 0 || nread == -1)
			break;
		for (bpos = 0; bpos < nread;) {
             d = (struct linux_dirent *) (buf + bpos);
             printf("%s\n", d->d_name);
             total++;
             bpos += d->d_reclen;
         }
     }
     printf("Listing 1: %d total dirents\n", total);

     /* rewind */
     lseek(dir_fd, 0, SEEK_SET);

     /* drop the first page */
     posix_fadvise(dir_fd, 0, 4096, POSIX_FADV_DONTNEED);

     /* delete the last 127 files: */
     for (i = 127; i < 256; i++) {
         sprintf(buf, LOCDIR "/file_%03d", i);
         unlink(buf);
     }

     total = 0;
	while (1) {
		nread = syscall(SYS_getdents, dir_fd, buf, BUF_SIZE);
		if (nread == 0 || nread == -1)
			break;
		for (bpos = 0; bpos < nread;) {
             d = (struct linux_dirent *) (buf + bpos);
             printf("%s\n", d->d_name);
             total++;
             bpos += d->d_reclen;
         }
     }
     printf("Listing 2: %d total dirents\n", total);

	close(dir_fd);
	return 0;
}


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 20:23                           ` Benjamin Coddington
@ 2022-02-25 20:28                             ` Benjamin Coddington
  2022-02-25 20:41                             ` Trond Myklebust
  1 sibling, 0 replies; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-25 20:28 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List

On 25 Feb 2022, at 15:23, Benjamin Coddington wrote:

> int main(int argc, char **argv)
> {
> 	int i, dir_fd, bpos, total = 0;
>     size_t nread;
> 	struct linux_dirent {

Ugh.. and sorry about the whitespace mess.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 20:23                           ` Benjamin Coddington
  2022-02-25 20:28                             ` Benjamin Coddington
@ 2022-02-25 20:41                             ` Trond Myklebust
  2022-02-25 22:04                               ` Benjamin Coddington
  2022-02-25 22:29                               ` Trond Myklebust
  1 sibling, 2 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25 20:41 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Fri, 2022-02-25 at 15:23 -0500, Benjamin Coddington wrote:
> On 25 Feb 2022, at 10:34, Benjamin Coddington wrote:
> > Ok, so I'm reading that further proof is required, and I'm happy to
> > do 
> > the
> > work.  Thanks for the replies here and elsewhere.
> 
> Here's an example of this problem on a tmpfs export using v8 of your
> patchset with the fix to set the change_attr in
> nfs_readdir_page_init_array().
> 
> I'm using tmpfs, because it reliably orders cookies in reverse order
> of
> creation (or perhaps sorted by name).
> 
> The program drives both the client-side and server-side - so on this
> one
> system, /exports/tmpfs is:
> tmpfs /exports/tmpfs tmpfs rw,seclabel,relatime,size=102400k 0 0
> 
> and /mnt/localhost is:
> localhost:/exports/tmpfs /mnt/localhost/tmpfs nfs4 
> rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,prot
> o=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=non
> e,addr=127.0.0.1 
> 0 0
> 
> The program creates 256 files on the server, walks through them once
> on 
> the
> client, deletes the last 127 on the server, drops the first page from
> the
> pagecache, and walks through them again on the client.
> 
> The second listing produces 124 duplicate entries.
> 
> I just have to say again: this behavior is _new_ (but not new to me),
> and it
> is absolutely going to crop up on our customer's systems that are 
> walking
> through millions of directory entries on loaded servers under memory
> pressure.  The directory listings as a whole become very likely to be
> nonsense at random times.  I realize they are not /supposed/ to be 
> coherent,
> but what we're getting here is going to be far far less coherent, and
> its
> going to be a mess.
> 
> There are other scenarios that are worse when the cookies aren't 
> ordered,
> you can end up with EOF, or get into repeating patterns.
> 
> Please compare this with v3, and before this patchset, and tell me if
> I'm
> not justified playing chicken little.
> 
> Here's what I do to run this:
> 
> mount -t tmpfs -osize=100M tmpfs /exports/tmpfs/
> exportfs -ofsid=0 *:/exports
> exportfs -ofsid=1 *:/exports/tmpfs
> mount -t nfs -ov4.1,sec=sys localhost:/exports /mnt/localhost
> ./getdents2
> 
> Compare "Listing 1" with "Listing 2".
> 
> I would also do a "rm -f /export/tmpfs/*" between each run.
> 
> Thanks again for your time and work.
> 
> Ben
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sched.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> #include <string.h>
> 
> #define NFSDIR "/mnt/localhost/tmpfs"
> #define LOCDIR "/exports/tmpfs"
> #define BUF_SIZE 4096
> 
> int main(int argc, char **argv)
> {
>         int i, dir_fd, bpos, total = 0;
>      size_t nread;
>         struct linux_dirent {
>                 long           d_ino;
>                 off_t          d_off;
>                 unsigned short d_reclen;
>                 char           d_name[];
>         };
>      struct linux_dirent *d;
>         char buf[BUF_SIZE];
> 
>      /* create files: */
>      for (i = 0; i < 256; i++) {
>          sprintf(buf, LOCDIR "/file_%03d", i);
>          close(open(buf, O_CREAT, 666));
>      }
> 
>         dir_fd = open(NFSDIR,
> O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
>         if (dir_fd < 0) {
>                 perror("cannot open dir");
>                 return 1;
>         }
> 
>         while (1) {
>                 nread = syscall(SYS_getdents, dir_fd, buf, BUF_SIZE);
>                 if (nread == 0 || nread == -1)
>                         break;
>                 for (bpos = 0; bpos < nread;) {
>              d = (struct linux_dirent *) (buf + bpos);
>              printf("%s\n", d->d_name);
>              total++;
>              bpos += d->d_reclen;
>          }
>      }
>      printf("Listing 1: %d total dirents\n", total);
> 
>      /* rewind */
>      lseek(dir_fd, 0, SEEK_SET);
> 
>      /* drop the first page */
>      posix_fadvise(dir_fd, 0, 4096, POSIX_FADV_DONTNEED);
> 
>      /* delete the last 127 files: */
>      for (i = 127; i < 256; i++) {
>          sprintf(buf, LOCDIR "/file_%03d", i);
>          unlink(buf);
>      }
> 
>      total = 0;
>         while (1) {
>                 nread = syscall(SYS_getdents, dir_fd, buf, BUF_SIZE);
>                 if (nread == 0 || nread == -1)
>                         break;
>                 for (bpos = 0; bpos < nread;) {
>              d = (struct linux_dirent *) (buf + bpos);
>              printf("%s\n", d->d_name);
>              total++;
>              bpos += d->d_reclen;
>          }
>      }
>      printf("Listing 2: %d total dirents\n", total);
> 
>         close(dir_fd);
>         return 0;
> }


tmpfs is broken on the server. It doesn't provide stable cookies, and
knfsd doesn't use the verifier to tell you that the cookie assignment
has changed.


Re-export of tmpfs has never worked reliably.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 20:41                             ` Trond Myklebust
@ 2022-02-25 22:04                               ` Benjamin Coddington
  2022-02-25 22:29                               ` Trond Myklebust
  1 sibling, 0 replies; 57+ messages in thread
From: Benjamin Coddington @ 2022-02-25 22:04 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 25 Feb 2022, at 15:41, Trond Myklebust wrote:

> On Fri, 2022-02-25 at 15:23 -0500, Benjamin Coddington wrote:
>> On 25 Feb 2022, at 10:34, Benjamin Coddington wrote:
>>> Ok, so I'm reading that further proof is required, and I'm happy to
>>> do
>>> the
>>> work.  Thanks for the replies here and elsewhere.
>>
>> Here's an example of this problem on a tmpfs export using v8 of your
>> patchset with the fix to set the change_attr in
>> nfs_readdir_page_init_array().
>>
>> I'm using tmpfs, because it reliably orders cookies in reverse order
>> of
>> creation (or perhaps sorted by name).
>>
>> The program drives both the client-side and server-side - so on this
>> one
>> system, /exports/tmpfs is:
>> tmpfs /exports/tmpfs tmpfs rw,seclabel,relatime,size=102400k 0 0
>>
>> and /mnt/localhost is:
>> localhost:/exports/tmpfs /mnt/localhost/tmpfs nfs4
>> rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,prot
>> o=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=non
>> e,addr=127.0.0.1
>> 0 0
>>
>> The program creates 256 files on the server, walks through them once
>> on
>> the
>> client, deletes the last 127 on the server, drops the first page from
>> the
>> pagecache, and walks through them again on the client.
>>
>> The second listing produces 124 duplicate entries.
>>
>> I just have to say again: this behavior is _new_ (but not new to me),
>> and it
>> is absolutely going to crop up on our customer's systems that are
>> walking
>> through millions of directory entries on loaded servers under memory
>> pressure.  The directory listings as a whole become very likely to 
>> be
>> nonsense at random times.  I realize they are not /supposed/ to be
>> coherent,
>> but what we're getting here is going to be far far less coherent, and
>> its
>> going to be a mess.
>>
>> There are other scenarios that are worse when the cookies aren't
>> ordered,
>> you can end up with EOF, or get into repeating patterns.
>>
>> Please compare this with v3, and before this patchset, and tell me if
>> I'm
>> not justified playing chicken little.
>>
>> Here's what I do to run this:
>>
>> mount -t tmpfs -osize=100M tmpfs /exports/tmpfs/
>> exportfs -ofsid=0 *:/exports
>> exportfs -ofsid=1 *:/exports/tmpfs
>> mount -t nfs -ov4.1,sec=sys localhost:/exports /mnt/localhost
>> ./getdents2
>>
>> Compare "Listing 1" with "Listing 2".
>>
>> I would also do a "rm -f /export/tmpfs/*" between each run.
>>
>> Thanks again for your time and work.
>>
>> Ben
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <fcntl.h>
>> #include <sched.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <sys/syscall.h>
>> #include <string.h>
>>
>> #define NFSDIR "/mnt/localhost/tmpfs"
>> #define LOCDIR "/exports/tmpfs"
>> #define BUF_SIZE 4096
>>
>> int main(int argc, char **argv)
>> {
>>         int i, dir_fd, bpos, total = 0;
>>      size_t nread;
>>         struct linux_dirent {
>>                 long           d_ino;
>>                 off_t          d_off;
>>                 unsigned short d_reclen;
>>                 char           d_name[];
>>         };
>>      struct linux_dirent *d;
>>         char buf[BUF_SIZE];
>>
>>      /* create files: */
>>      for (i = 0; i < 256; i++) {
>>          sprintf(buf, LOCDIR "/file_%03d", i);
>>          close(open(buf, O_CREAT, 666));
>>      }
>>
>>         dir_fd = open(NFSDIR,
>> O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
>>         if (dir_fd < 0) {
>>                 perror("cannot open dir");
>>                 return 1;
>>         }
>>
>>         while (1) {
>>                 nread = syscall(SYS_getdents, dir_fd, 
>> buf, BUF_SIZE);
>>                 if (nread == 0 || nread == -1)
>>                         break;
>>                 for (bpos = 0; bpos < nread;) {
>>              d = (struct linux_dirent *) (buf + bpos);
>>              printf("%s\n", d->d_name);
>>              total++;
>>              bpos += d->d_reclen;
>>          }
>>      }
>>      printf("Listing 1: %d total dirents\n", total);
>>
>>      /* rewind */
>>      lseek(dir_fd, 0, SEEK_SET);
>>
>>      /* drop the first page */
>>      posix_fadvise(dir_fd, 0, 4096, POSIX_FADV_DONTNEED);
>>
>>      /* delete the last 127 files: */
>>      for (i = 127; i < 256; i++) {
>>          sprintf(buf, LOCDIR "/file_%03d", i);
>>          unlink(buf);
>>      }
>>
>>      total = 0;
>>         while (1) {
>>                 nread = syscall(SYS_getdents, dir_fd, 
>> buf, BUF_SIZE);
>>                 if (nread == 0 || nread == -1)
>>                         break;
>>                 for (bpos = 0; bpos < nread;) {
>>              d = (struct linux_dirent *) (buf + bpos);
>>              printf("%s\n", d->d_name);
>>              total++;
>>              bpos += d->d_reclen;
>>          }
>>      }
>>      printf("Listing 2: %d total dirents\n", total);
>>
>>         close(dir_fd);
>>         return 0;
>> }
>
>
> tmpfs is broken on the server. It doesn't provide stable cookies, and
> knfsd doesn't use the verifier to tell you that the cookie assignment
> has changed.
>
>
> Re-export of tmpfs has never worked reliably.

In this case, the cookies are stable, they can be verified with a wire
capture.

I've just adapted the program slightly to ext4 below.  In this case the
Listing 2 shows files "file_125 file_126" that don't exist on the server 
and
leave out files "elif_125 elif_126".  It would take me more time to 
produce
more dramatic results, but I'm sure I could produce them.

And its not hard to understand either, so I'm not sure why so much proof 
is
needed.

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <string.h>

#define NFSDIR "/mnt/localhost/ext4"
#define LOCDIR "/exports/ext4"
#define BUF_SIZE 4096

int main(int argc, char **argv)
{
     int i, dir_fd, bpos, total = 0;
     size_t nread;
     struct linux_dirent {
             long           d_ino;
             off_t          d_off;
             unsigned short d_reclen;
             char           d_name[];
     };
     struct linux_dirent *d;
     char buf[BUF_SIZE];

     /* create files: */
     for (i = 0; i < 256; i++) {
         sprintf(buf, LOCDIR "/file_%03d", i);
         close(open(buf, O_CREAT, 666));
     }

     dir_fd = open(NFSDIR, O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
     if (dir_fd < 0) {
             perror("cannot open dir");
             return 1;
     }

     while (1) {
         nread = syscall(SYS_getdents, dir_fd, buf, BUF_SIZE);
         if (nread == 0 || nread == -1)
             break;
         for (bpos = 0; bpos < nread;) {
             d = (struct linux_dirent *) (buf + bpos);
             printf("%s\n", d->d_name);
             total++;
             bpos += d->d_reclen;
         }
     }
     printf("Listing 1: %d total dirents\n", total);

     /* rewind */
     lseek(dir_fd, 0, SEEK_SET);

     /* drop the first page */
     posix_fadvise(dir_fd, 0, 4096, POSIX_FADV_DONTNEED);

     /* delete the first 127 files: */
     for (i = 0; i < 127; i++) {
         sprintf(buf, LOCDIR "/file_%03d", i);
         unlink(buf);
     }

     /* create 127 more: */
     for (i = 0; i < 127; i++) {
         sprintf(buf, LOCDIR "/elif_%03d", i);
         close(open(buf, O_CREAT, 666));
     }

     total = 0;
     while (1) {
         nread = syscall(SYS_getdents, dir_fd, buf, BUF_SIZE);
         if (nread == 0 || nread == -1)
             break;
         for (bpos = 0; bpos < nread;) {
             d = (struct linux_dirent *) (buf + bpos);
             printf("%s\n", d->d_name);
             total++;
             bpos += d->d_reclen;
         }
     }
     printf("Listing 2: %d total dirents\n", total);

     close(dir_fd);
     return 0;
}


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache
  2022-02-25 20:41                             ` Trond Myklebust
  2022-02-25 22:04                               ` Benjamin Coddington
@ 2022-02-25 22:29                               ` Trond Myklebust
  1 sibling, 0 replies; 57+ messages in thread
From: Trond Myklebust @ 2022-02-25 22:29 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Fri, 2022-02-25 at 15:41 -0500, Trond Myklebust wrote:
> On Fri, 2022-02-25 at 15:23 -0500, Benjamin Coddington wrote:
> > On 25 Feb 2022, at 10:34, Benjamin Coddington wrote:
> > > Ok, so I'm reading that further proof is required, and I'm happy
> > > to
> > > do 
> > > the
> > > work.  Thanks for the replies here and elsewhere.
> > 
> > Here's an example of this problem on a tmpfs export using v8 of
> > your
> > patchset with the fix to set the change_attr in
> > nfs_readdir_page_init_array().
> > 
> > I'm using tmpfs, because it reliably orders cookies in reverse
> > order
> > of
> > creation (or perhaps sorted by name).
> > 
> > The program drives both the client-side and server-side - so on
> > this
> > one
> > system, /exports/tmpfs is:
> > tmpfs /exports/tmpfs tmpfs rw,seclabel,relatime,size=102400k 0 0
> > 
> > and /mnt/localhost is:
> > localhost:/exports/tmpfs /mnt/localhost/tmpfs nfs4 
> > rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,pr
> > ot
> > o=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=n
> > on
> > e,addr=127.0.0.1 
> > 0 0
> > 
> > The program creates 256 files on the server, walks through them
> > once
> > on 
> > the
> > client, deletes the last 127 on the server, drops the first page
> > from
> > the
> > pagecache, and walks through them again on the client.
> > 
> > The second listing produces 124 duplicate entries.
> > 
> > I just have to say again: this behavior is _new_ (but not new to
> > me),
> > and it
> > is absolutely going to crop up on our customer's systems that are 
> > walking
> > through millions of directory entries on loaded servers under
> > memory
> > pressure.  The directory listings as a whole become very likely to
> > be
> > nonsense at random times.  I realize they are not /supposed/ to be 
> > coherent,
> > but what we're getting here is going to be far far less coherent,
> > and
> > its
> > going to be a mess.
> > 
> > There are other scenarios that are worse when the cookies aren't 
> > ordered,
> > you can end up with EOF, or get into repeating patterns.
> > 
> > Please compare this with v3, and before this patchset, and tell me
> > if
> > I'm
> > not justified playing chicken little.
> > 
> > Here's what I do to run this:
> > 
> > mount -t tmpfs -osize=100M tmpfs /exports/tmpfs/
> > exportfs -ofsid=0 *:/exports
> > exportfs -ofsid=1 *:/exports/tmpfs
> > mount -t nfs -ov4.1,sec=sys localhost:/exports /mnt/localhost
> > ./getdents2
> > 
> > Compare "Listing 1" with "Listing 2".
> > 
> > I would also do a "rm -f /export/tmpfs/*" between each run.
> > 
> > Thanks again for your time and work.
> > 
> > Ben
> > 
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <fcntl.h>
> > #include <sched.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <sys/syscall.h>
> > #include <string.h>
> > 
> > #define NFSDIR "/mnt/localhost/tmpfs"
> > #define LOCDIR "/exports/tmpfs"
> > #define BUF_SIZE 4096
> > 
> > int main(int argc, char **argv)
> > {
> >         int i, dir_fd, bpos, total = 0;
> >      size_t nread;
> >         struct linux_dirent {
> >                 long           d_ino;
> >                 off_t          d_off;
> >                 unsigned short d_reclen;
> >                 char           d_name[];
> >         };
> >      struct linux_dirent *d;
> >         char buf[BUF_SIZE];
> > 
> >      /* create files: */
> >      for (i = 0; i < 256; i++) {
> >          sprintf(buf, LOCDIR "/file_%03d", i);
> >          close(open(buf, O_CREAT, 666));
> >      }
> > 
> >         dir_fd = open(NFSDIR,
> > O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC);
> >         if (dir_fd < 0) {
> >                 perror("cannot open dir");
> >                 return 1;
> >         }
> > 
> >         while (1) {
> >                 nread = syscall(SYS_getdents, dir_fd, buf,
> > BUF_SIZE);
> >                 if (nread == 0 || nread == -1)
> >                         break;
> >                 for (bpos = 0; bpos < nread;) {
> >              d = (struct linux_dirent *) (buf + bpos);
> >              printf("%s\n", d->d_name);
> >              total++;
> >              bpos += d->d_reclen;
> >          }
> >      }
> >      printf("Listing 1: %d total dirents\n", total);
> > 
> >      /* rewind */
> >      lseek(dir_fd, 0, SEEK_SET);
> > 
> >      /* drop the first page */
> >      posix_fadvise(dir_fd, 0, 4096, POSIX_FADV_DONTNEED);
> > 
> >      /* delete the last 127 files: */
> >      for (i = 127; i < 256; i++) {
> >          sprintf(buf, LOCDIR "/file_%03d", i);
> >          unlink(buf);
> >      }
> > 
> >      total = 0;
> >         while (1) {
> >                 nread = syscall(SYS_getdents, dir_fd, buf,
> > BUF_SIZE);
> >                 if (nread == 0 || nread == -1)
> >                         break;
> >                 for (bpos = 0; bpos < nread;) {
> >              d = (struct linux_dirent *) (buf + bpos);
> >              printf("%s\n", d->d_name);
> >              total++;
> >              bpos += d->d_reclen;
> >          }
> >      }
> >      printf("Listing 2: %d total dirents\n", total);
> > 
> >         close(dir_fd);
> >         return 0;
> > }
> 
> 
> tmpfs is broken on the server. It doesn't provide stable cookies, and
> knfsd doesn't use the verifier to tell you that the cookie assignment
> has changed.
> 
> 
> Re-export of tmpfs has never worked reliably.

What I mean is that tmpfs is always a poor choice for NFS because
seekdir()/telldir() don't work reliably, and so READDIR cannot work
reliably, since it relies on open()+seekdir() to continue reading the
directory in successive RPC calls.

Anyhow, to get back to your question about whether we should or should
not be detecting that the directory changed when you delete the files
on the server. The answer is no... Nothing in the above guarantees that
the cache is revalidated.

NFS close to open cache consistency means that we guarantee to
revalidate the cached data on open(), and only then. That guarantee
does not extend to lseek() or to the rewinddir/seekdir wrappers.

If your application wants stronger cache consistency, then there are
tricks to enable that. Now that statx() has the AT_STATX_FORCE_SYNC
flag, you could use that to force a revalidation of the directory
attributes on the client. You might also use the posix_fadvise() trick
to try to clear the cache. However note that none of these tricks are
guaranteed to work. They're not reliable now, and that situation is
unlikely to change in the future barring a deliberate (and documented!)
change in kernel policy.
So as of now, the only way to reliably introduce a revalidation point
in your testcase above is to close() and then open().

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-02-25 22:29 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-23 21:12 [PATCH v7 00/21] Readdir improvements trondmy
2022-02-23 21:12 ` [PATCH v7 01/21] NFS: constify nfs_server_capable() and nfs_have_writebacks() trondmy
2022-02-23 21:12   ` [PATCH v7 02/21] NFS: Trace lookup revalidation failure trondmy
2022-02-23 21:12     ` [PATCH v7 03/21] NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context trondmy
2022-02-23 21:12       ` [PATCH v7 04/21] NFS: Calculate page offsets algorithmically trondmy
2022-02-23 21:12         ` [PATCH v7 05/21] NFS: Store the change attribute in the directory page cache trondmy
2022-02-23 21:12           ` [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the " trondmy
2022-02-23 21:12             ` [PATCH v7 07/21] NFS: Don't re-read the entire page cache to find the next cookie trondmy
2022-02-23 21:12               ` [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir trondmy
2022-02-23 21:12                 ` [PATCH v7 09/21] NFS: Simplify nfs_readdir_xdr_to_array() trondmy
2022-02-23 21:12                   ` [PATCH v7 10/21] NFS: Reduce use of uncached readdir trondmy
2022-02-23 21:12                     ` [PATCH v7 11/21] NFS: Improve heuristic for readdirplus trondmy
2022-02-23 21:12                       ` [PATCH v7 12/21] NFS: Don't ask for readdirplus unless it can help nfs_getattr() trondmy
2022-02-23 21:12                         ` [PATCH v7 13/21] NFSv4: Ask for a full XDR buffer of readdir goodness trondmy
2022-02-23 21:12                           ` [PATCH v7 14/21] NFS: Readdirplus can't help lookup for case insensitive filesystems trondmy
2022-02-23 21:12                             ` [PATCH v7 15/21] NFS: Don't request readdirplus when revalidation was forced trondmy
2022-02-23 21:13                               ` [PATCH v7 16/21] NFS: Add basic readdir tracing trondmy
2022-02-23 21:13                                 ` [PATCH v7 17/21] NFS: Trace effects of readdirplus on the dcache trondmy
2022-02-23 21:13                                   ` [PATCH v7 18/21] NFS: Trace effects of the readdirplus heuristic trondmy
2022-02-23 21:13                                     ` [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index trondmy
2022-02-23 21:13                                       ` [PATCH v7 20/21] NFS: Fix up forced readdirplus trondmy
2022-02-23 21:13                                         ` [PATCH v7 21/21] NFS: Remove unnecessary cache invalidations for directories trondmy
2022-02-24 17:31                                       ` [PATCH v7 19/21] NFS: Convert readdir page cache to use a cookie based index Benjamin Coddington
2022-02-25  2:33                                         ` Trond Myklebust
2022-02-25  3:17                                           ` NeilBrown
2022-02-25  4:25                                             ` Trond Myklebust
2022-02-25 12:33                                               ` Benjamin Coddington
2022-02-25 13:11                                                 ` Trond Myklebust
2022-02-24 15:53                                 ` [PATCH v7 16/21] NFS: Add basic readdir tracing Benjamin Coddington
2022-02-25  2:35                                   ` Trond Myklebust
2022-02-24 16:55                     ` [PATCH v7 10/21] NFS: Reduce use of uncached readdir Anna Schumaker
2022-02-25  4:07                       ` Trond Myklebust
2022-02-24 16:30                 ` [PATCH v7 08/21] NFS: Adjust the amount of readahead performed by NFS readdir Anna Schumaker
2022-02-24 16:18             ` [PATCH v7 06/21] NFS: If the cookie verifier changes, we must invalidate the page cache Anna Schumaker
2022-02-24 14:53           ` [PATCH v7 05/21] NFS: Store the change attribute in the directory " Benjamin Coddington
2022-02-25  2:26             ` Trond Myklebust
2022-02-25  3:51               ` Trond Myklebust
2022-02-25 11:38                 ` Benjamin Coddington
2022-02-25 13:10                   ` Trond Myklebust
2022-02-25 13:26                     ` Trond Myklebust
2022-02-25 14:44                     ` Benjamin Coddington
2022-02-25 15:18                       ` Trond Myklebust
2022-02-25 15:34                         ` Benjamin Coddington
2022-02-25 20:23                           ` Benjamin Coddington
2022-02-25 20:28                             ` Benjamin Coddington
2022-02-25 20:41                             ` Trond Myklebust
2022-02-25 22:04                               ` Benjamin Coddington
2022-02-25 22:29                               ` Trond Myklebust
2022-02-24 14:15         ` [PATCH v7 04/21] NFS: Calculate page offsets algorithmically Benjamin Coddington
2022-02-25  2:11           ` Trond Myklebust
2022-02-25 11:28             ` Benjamin Coddington
2022-02-25 12:44               ` Trond Myklebust
2022-02-24 14:14     ` [PATCH v7 02/21] NFS: Trace lookup revalidation failure Benjamin Coddington
2022-02-25  2:09       ` Trond Myklebust
2022-02-24 12:25 ` [PATCH v7 00/21] Readdir improvements David Wysochanski
2022-02-25  4:00   ` Trond Myklebust
2022-02-24 15:07 ` David Wysochanski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.