All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/13] Readdir improvements
@ 2022-02-21 16:08 trondmy
  2022-02-21 16:08 ` [PATCH v6 01/13] NFS: constify nfs_server_capable() and nfs_have_writebacks() trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

The current NFS readdir code will always try to maximise the amount of
readahead it performs on the assumption that we can cache anything that
isn't immediately read by the process.
There are several cases where this assumption breaks down, including
when the 'ls -l' heuristic kicks in to try to force use of readdirplus
as a batch replacement for lookup/getattr.

--
v2: Remove reset of dtsize when NFS_INO_FORCE_READDIR is set
v3: Avoid excessive window shrinking in uncached_readdir case
v4: Track 'ls -l' cache hit/miss statistics
    Improved algorithm for falling back to uncached readdir
    Skip readdirplus when files are being written to
v5: bugfixes
    Skip readdirplus when the acdirmax/acregmax values are low
    Request a full XDR buffer when doing READDIRPLUS
v6: Add tracing
    Don't have lookup request readdirplus when it won't help

Trond Myklebust (13):
  NFS: constify nfs_server_capable() and nfs_have_writebacks()
  NFS: Trace lookup revalidation failure
  NFS: Adjust the amount of readahead performed by NFS readdir
  NFS: Simplify nfs_readdir_xdr_to_array()
  NFS: Improve algorithm for falling back to uncached readdir
  NFS: Improve heuristic for readdirplus
  NFS: Don't ask for readdirplus unless it can help nfs_getattr()
  NFSv4: Ask for a full XDR buffer of readdir goodness
  NFS: Readdirplus can't help lookup for case insensitive filesystems
  NFS: Don't request readdirplus when revaldation was forced
  NFS: Add basic readdir tracing
  NFS: Trace effects of readdirplus on the dcache
  NFS: Trace effects of the readdirplus heuristic

 fs/nfs/dir.c           | 275 +++++++++++++++++++++++++++--------------
 fs/nfs/inode.c         |  37 +++---
 fs/nfs/internal.h      |   4 +-
 fs/nfs/nfs3xdr.c       |   7 +-
 fs/nfs/nfs4xdr.c       |   6 +-
 fs/nfs/nfstrace.h      | 122 +++++++++++++++++-
 include/linux/nfs_fs.h |  14 ++-
 7 files changed, 339 insertions(+), 126 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v6 01/13] NFS: constify nfs_server_capable() and nfs_have_writebacks()
  2022-02-21 16:08 [PATCH v6 00/13] Readdir improvements trondmy
@ 2022-02-21 16:08 ` trondmy
  2022-02-21 16:08   ` [PATCH v6 02/13] NFS: Trace lookup revalidation failure trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 include/linux/nfs_fs.h | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 72a732a5103c..6e10725887d1 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -363,7 +363,7 @@ static inline void nfs_mark_for_revalidate(struct inode *inode)
 	spin_unlock(&inode->i_lock);
 }
 
-static inline int nfs_server_capable(struct inode *inode, int cap)
+static inline int nfs_server_capable(const struct inode *inode, int cap)
 {
 	return NFS_SERVER(inode)->caps & cap;
 }
@@ -587,12 +587,11 @@ extern struct nfs_commit_data *nfs_commitdata_alloc(bool never_fail);
 extern void nfs_commit_free(struct nfs_commit_data *data);
 bool nfs_commit_end(struct nfs_mds_commit_info *cinfo);
 
-static inline int
-nfs_have_writebacks(struct inode *inode)
+static inline bool nfs_have_writebacks(const struct inode *inode)
 {
 	if (S_ISREG(inode->i_mode))
 		return atomic_long_read(&NFS_I(inode)->nrequests) != 0;
-	return 0;
+	return false;
 }
 
 /*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 02/13] NFS: Trace lookup revalidation failure
  2022-02-21 16:08 ` [PATCH v6 01/13] NFS: constify nfs_server_capable() and nfs_have_writebacks() trondmy
@ 2022-02-21 16:08   ` trondmy
  2022-02-21 16:08     ` [PATCH v6 03/13] NFS: Adjust the amount of readahead performed by NFS readdir trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Enable tracing of lookup revalidation failures.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8b190c8e4a45..8e750ef34559 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1474,9 +1474,7 @@ nfs_lookup_revalidate_done(struct inode *dir, struct dentry *dentry,
 {
 	switch (error) {
 	case 1:
-		dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is valid\n",
-			__func__, dentry);
-		return 1;
+		break;
 	case 0:
 		/*
 		 * We can't d_drop the root of a disconnected tree:
@@ -1485,13 +1483,10 @@ nfs_lookup_revalidate_done(struct inode *dir, struct dentry *dentry,
 		 * inodes on unmount and further oopses.
 		 */
 		if (inode && IS_ROOT(dentry))
-			return 1;
-		dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is invalid\n",
-				__func__, dentry);
-		return 0;
+			error = 1;
+		break;
 	}
-	dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) lookup returned error %d\n",
-				__func__, dentry, error);
+	trace_nfs_lookup_revalidate_exit(dir, dentry, 0, error);
 	return error;
 }
 
@@ -1623,9 +1618,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 		goto out_bad;
 
 	trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
-	error = nfs_lookup_revalidate_dentry(dir, dentry, inode);
-	trace_nfs_lookup_revalidate_exit(dir, dentry, flags, error);
-	return error;
+	return nfs_lookup_revalidate_dentry(dir, dentry, inode);
 out_valid:
 	return nfs_lookup_revalidate_done(dir, dentry, inode, 1);
 out_bad:
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 03/13] NFS: Adjust the amount of readahead performed by NFS readdir
  2022-02-21 16:08   ` [PATCH v6 02/13] NFS: Trace lookup revalidation failure trondmy
@ 2022-02-21 16:08     ` trondmy
  2022-02-21 16:08       ` [PATCH v6 04/13] NFS: Simplify nfs_readdir_xdr_to_array() trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

The current NFS readdir code will always try to maximise the amount of
readahead it performs on the assumption that we can cache anything that
isn't immediately read by the process.
There are several cases where this assumption breaks down, including
when the 'ls -l' heuristic kicks in to try to force use of readdirplus
as a batch replacement for lookup/getattr.

This patch therefore tries to tone down the amount of readahead we
perform, and adjust it to try to match the amount of data being
requested by user space.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 55 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/nfs_fs.h |  1 +
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8e750ef34559..c84a3bbda216 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -69,6 +69,8 @@ const struct address_space_operations nfs_dir_aops = {
 	.freepage = nfs_readdir_clear_array,
 };
 
+#define NFS_INIT_DTSIZE PAGE_SIZE
+
 static struct nfs_open_dir_context *alloc_nfs_open_dir_context(struct inode *dir)
 {
 	struct nfs_inode *nfsi = NFS_I(dir);
@@ -80,6 +82,7 @@ static struct nfs_open_dir_context *alloc_nfs_open_dir_context(struct inode *dir
 		ctx->dir_cookie = 0;
 		ctx->dup_cookie = 0;
 		ctx->page_index = 0;
+		ctx->dtsize = NFS_INIT_DTSIZE;
 		ctx->eof = false;
 		spin_lock(&dir->i_lock);
 		if (list_empty(&nfsi->open_files) &&
@@ -155,6 +158,7 @@ struct nfs_readdir_descriptor {
 	struct page	*page;
 	struct dir_context *ctx;
 	pgoff_t		page_index;
+	pgoff_t		page_index_max;
 	u64		dir_cookie;
 	u64		last_cookie;
 	u64		dup_cookie;
@@ -167,12 +171,36 @@ struct nfs_readdir_descriptor {
 	unsigned long	gencount;
 	unsigned long	attr_gencount;
 	unsigned int	cache_entry_index;
+	unsigned int	buffer_fills;
+	unsigned int	dtsize;
 	signed char duped;
 	bool plus;
 	bool eob;
 	bool eof;
 };
 
+static void nfs_set_dtsize(struct nfs_readdir_descriptor *desc, unsigned int sz)
+{
+	struct nfs_server *server = NFS_SERVER(file_inode(desc->file));
+	unsigned int maxsize = server->dtsize;
+
+	if (sz > maxsize)
+		sz = maxsize;
+	if (sz < NFS_MIN_FILE_IO_SIZE)
+		sz = NFS_MIN_FILE_IO_SIZE;
+	desc->dtsize = sz;
+}
+
+static void nfs_shrink_dtsize(struct nfs_readdir_descriptor *desc)
+{
+	nfs_set_dtsize(desc, desc->dtsize >> 1);
+}
+
+static void nfs_grow_dtsize(struct nfs_readdir_descriptor *desc)
+{
+	nfs_set_dtsize(desc, desc->dtsize << 1);
+}
+
 static void nfs_readdir_array_init(struct nfs_cache_array *array)
 {
 	memset(array, 0, sizeof(struct nfs_cache_array));
@@ -759,6 +787,7 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
 				break;
 			arrays++;
 			*arrays = page = new;
+			desc->page_index_max++;
 		} else {
 			new = nfs_readdir_page_get_next(mapping,
 							page->index + 1,
@@ -768,6 +797,7 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
 			if (page != *arrays)
 				nfs_readdir_page_unlock_and_put(page);
 			page = new;
+			desc->page_index_max = new->index;
 		}
 		status = nfs_readdir_add_to_array(entry, page);
 	} while (!status && !entry->eof);
@@ -833,7 +863,7 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
 	struct nfs_entry *entry;
 	size_t array_size;
 	struct inode *inode = file_inode(desc->file);
-	size_t dtsize = NFS_SERVER(inode)->dtsize;
+	unsigned int dtsize = desc->dtsize;
 	int status = -ENOMEM;
 
 	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
@@ -869,6 +899,7 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
 
 		status = nfs_readdir_page_filler(desc, entry, pages, pglen,
 						 arrays, narrays);
+		desc->buffer_fills++;
 	} while (!status && nfs_readdir_page_needs_filling(page) &&
 		page_mapping(page));
 
@@ -916,6 +947,7 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 	if (!desc->page)
 		return -ENOMEM;
 	if (nfs_readdir_page_needs_filling(desc->page)) {
+		desc->page_index_max = desc->page_index;
 		res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
 					       &desc->page, 1);
 		if (res < 0) {
@@ -1047,6 +1079,7 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	desc->cache_entry_index = 0;
 	desc->last_cookie = desc->dir_cookie;
 	desc->duped = 0;
+	desc->page_index_max = 0;
 
 	status = nfs_readdir_xdr_to_array(desc, desc->verf, verf, arrays, sz);
 
@@ -1056,10 +1089,22 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	}
 	desc->page = NULL;
 
+	/*
+	 * Grow the dtsize if we have to go back for more pages,
+	 * or shrink it if we're reading too many.
+	 */
+	if (!desc->eof) {
+		if (!desc->eob)
+			nfs_grow_dtsize(desc);
+		else if (desc->buffer_fills == 1 &&
+			 i < (desc->page_index_max >> 1))
+			nfs_shrink_dtsize(desc);
+	}
 
 	for (i = 0; i < sz && arrays[i]; i++)
 		nfs_readdir_page_array_free(arrays[i]);
 out:
+	desc->page_index_max = -1;
 	kfree(arrays);
 	dfprintk(DIRCACHE, "NFS: %s: returns %d\n", __func__, status);
 	return status;
@@ -1102,6 +1147,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->file = file;
 	desc->ctx = ctx;
 	desc->plus = nfs_use_readdirplus(inode, ctx);
+	desc->page_index_max = -1;
 
 	spin_lock(&file->f_lock);
 	desc->dir_cookie = dir_ctx->dir_cookie;
@@ -1110,6 +1156,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	page_index = dir_ctx->page_index;
 	desc->attr_gencount = dir_ctx->attr_gencount;
 	desc->eof = dir_ctx->eof;
+	nfs_set_dtsize(desc, dir_ctx->dtsize);
 	memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
 	spin_unlock(&file->f_lock);
 
@@ -1151,6 +1198,11 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 
 		nfs_do_filldir(desc, nfsi->cookieverf);
 		nfs_readdir_page_unlock_and_put_cached(desc);
+		if (desc->eob || desc->eof)
+			break;
+		/* Grow the dtsize if we have to go back for more pages */
+		if (desc->page_index == desc->page_index_max)
+			nfs_grow_dtsize(desc);
 	} while (!desc->eob && !desc->eof);
 
 	spin_lock(&file->f_lock);
@@ -1160,6 +1212,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	dir_ctx->attr_gencount = desc->attr_gencount;
 	dir_ctx->page_index = desc->page_index;
 	dir_ctx->eof = desc->eof;
+	dir_ctx->dtsize = desc->dtsize;
 	memcpy(dir_ctx->verf, desc->verf, sizeof(dir_ctx->verf));
 	spin_unlock(&file->f_lock);
 out_free:
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 6e10725887d1..d27f7e788624 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -106,6 +106,7 @@ struct nfs_open_dir_context {
 	__u64 dir_cookie;
 	__u64 dup_cookie;
 	pgoff_t page_index;
+	unsigned int dtsize;
 	signed char duped;
 	bool eof;
 };
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 04/13] NFS: Simplify nfs_readdir_xdr_to_array()
  2022-02-21 16:08     ` [PATCH v6 03/13] NFS: Adjust the amount of readahead performed by NFS readdir trondmy
@ 2022-02-21 16:08       ` trondmy
  2022-02-21 16:08         ` [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Recent changes to readdir mean that we can cope with partially filled
page cache entries, so we no longer need to rely on looping in
nfs_readdir_xdr_to_array().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index c84a3bbda216..a7f25fef890a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -864,6 +864,7 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
 	size_t array_size;
 	struct inode *inode = file_inode(desc->file);
 	unsigned int dtsize = desc->dtsize;
+	unsigned int pglen;
 	int status = -ENOMEM;
 
 	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
@@ -881,28 +882,20 @@ static int nfs_readdir_xdr_to_array(struct nfs_readdir_descriptor *desc,
 	if (!pages)
 		goto out;
 
-	do {
-		unsigned int pglen;
-		status = nfs_readdir_xdr_filler(desc, verf_arg, entry->cookie,
-						pages, dtsize,
-						verf_res);
-		if (status < 0)
-			break;
-
-		pglen = status;
-		if (pglen == 0) {
-			nfs_readdir_page_set_eof(page);
-			break;
-		}
-
-		verf_arg = verf_res;
+	status = nfs_readdir_xdr_filler(desc, verf_arg, entry->cookie, pages,
+					dtsize, verf_res);
+	if (status < 0)
+		goto free_pages;
 
+	pglen = status;
+	if (pglen != 0)
 		status = nfs_readdir_page_filler(desc, entry, pages, pglen,
 						 arrays, narrays);
-		desc->buffer_fills++;
-	} while (!status && nfs_readdir_page_needs_filling(page) &&
-		page_mapping(page));
+	else
+		nfs_readdir_page_set_eof(page);
+	desc->buffer_fills++;
 
+free_pages:
 	nfs_readdir_free_pages(pages, array_size);
 out:
 	nfs_free_fattr(entry->fattr);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-21 16:08       ` [PATCH v6 04/13] NFS: Simplify nfs_readdir_xdr_to_array() trondmy
@ 2022-02-21 16:08         ` trondmy
  2022-02-21 16:08           ` [PATCH v6 06/13] NFS: Improve heuristic for readdirplus trondmy
  2022-02-21 16:45           ` [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir Benjamin Coddington
  0 siblings, 2 replies; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

When reading a very large directory, we want to try to keep the page
cache up to date if doing so is inexpensive. Right now, we will try to
refill the page cache if it is non-empty, irrespective of whether or not
doing so is going to take a long time.

Replace that algorithm with something that looks at how many times we've
refilled the page cache without seeing a cache hit.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 51 +++++++++++++++++++++---------------------
 include/linux/nfs_fs.h |  1 +
 2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a7f25fef890a..d0707e63ce45 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -71,19 +71,16 @@ const struct address_space_operations nfs_dir_aops = {
 
 #define NFS_INIT_DTSIZE PAGE_SIZE
 
-static struct nfs_open_dir_context *alloc_nfs_open_dir_context(struct inode *dir)
+static struct nfs_open_dir_context *
+alloc_nfs_open_dir_context(struct inode *dir)
 {
 	struct nfs_inode *nfsi = NFS_I(dir);
 	struct nfs_open_dir_context *ctx;
-	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT);
 	if (ctx != NULL) {
-		ctx->duped = 0;
 		ctx->attr_gencount = nfsi->attr_gencount;
-		ctx->dir_cookie = 0;
-		ctx->dup_cookie = 0;
-		ctx->page_index = 0;
 		ctx->dtsize = NFS_INIT_DTSIZE;
-		ctx->eof = false;
 		spin_lock(&dir->i_lock);
 		if (list_empty(&nfsi->open_files) &&
 		    (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
@@ -170,6 +167,7 @@ struct nfs_readdir_descriptor {
 	unsigned long	timestamp;
 	unsigned long	gencount;
 	unsigned long	attr_gencount;
+	unsigned int	page_fill_misses;
 	unsigned int	cache_entry_index;
 	unsigned int	buffer_fills;
 	unsigned int	dtsize;
@@ -925,6 +923,18 @@ nfs_readdir_page_get_cached(struct nfs_readdir_descriptor *desc)
 					   desc->last_cookie);
 }
 
+#define NFS_READDIR_PAGE_FILL_MISS_MAX 5
+/*
+ * If we've tried to refill the page cache more than 5 times, and
+ * still not found our cookie, then we should stop and fall back
+ * to uncached readdir
+ */
+static bool nfs_readdir_may_fill_pagecache(struct nfs_readdir_descriptor *desc)
+{
+	return desc->dir_cookie == 0 ||
+	       desc->page_fill_misses < NFS_READDIR_PAGE_FILL_MISS_MAX;
+}
+
 /*
  * Returns 0 if desc->dir_cookie was found on page desc->page_index
  * and locks the page to prevent removal from the page cache.
@@ -940,6 +950,8 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 	if (!desc->page)
 		return -ENOMEM;
 	if (nfs_readdir_page_needs_filling(desc->page)) {
+		if (!nfs_readdir_may_fill_pagecache(desc))
+			return -EBADCOOKIE;
 		desc->page_index_max = desc->page_index;
 		res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
 					       &desc->page, 1);
@@ -958,36 +970,22 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 		if (desc->page_index == 0)
 			memcpy(nfsi->cookieverf, verf,
 			       sizeof(nfsi->cookieverf));
+		desc->page_fill_misses++;
 	}
 	res = nfs_readdir_search_array(desc);
-	if (res == 0)
+	if (res == 0) {
+		desc->page_fill_misses = 0;
 		return 0;
+	}
 	nfs_readdir_page_unlock_and_put_cached(desc);
 	return res;
 }
 
-static bool nfs_readdir_dont_search_cache(struct nfs_readdir_descriptor *desc)
-{
-	struct address_space *mapping = desc->file->f_mapping;
-	struct inode *dir = file_inode(desc->file);
-	unsigned int dtsize = NFS_SERVER(dir)->dtsize;
-	loff_t size = i_size_read(dir);
-
-	/*
-	 * Default to uncached readdir if the page cache is empty, and
-	 * we're looking for a non-zero cookie in a large directory.
-	 */
-	return desc->dir_cookie != 0 && mapping->nrpages == 0 && size > dtsize;
-}
-
 /* Search for desc->dir_cookie from the beginning of the page cache */
 static int readdir_search_pagecache(struct nfs_readdir_descriptor *desc)
 {
 	int res;
 
-	if (nfs_readdir_dont_search_cache(desc))
-		return -EBADCOOKIE;
-
 	do {
 		if (desc->page_index == 0) {
 			desc->current_index = 0;
@@ -1149,6 +1147,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	page_index = dir_ctx->page_index;
 	desc->attr_gencount = dir_ctx->attr_gencount;
 	desc->eof = dir_ctx->eof;
+	desc->page_fill_misses = dir_ctx->page_fill_misses;
 	nfs_set_dtsize(desc, dir_ctx->dtsize);
 	memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
 	spin_unlock(&file->f_lock);
@@ -1204,6 +1203,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	dir_ctx->duped = desc->duped;
 	dir_ctx->attr_gencount = desc->attr_gencount;
 	dir_ctx->page_index = desc->page_index;
+	dir_ctx->page_fill_misses = desc->page_fill_misses;
 	dir_ctx->eof = desc->eof;
 	dir_ctx->dtsize = desc->dtsize;
 	memcpy(dir_ctx->verf, desc->verf, sizeof(dir_ctx->verf));
@@ -1247,6 +1247,7 @@ static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
 			dir_ctx->dir_cookie = offset;
 		else
 			dir_ctx->dir_cookie = 0;
+		dir_ctx->page_fill_misses = 0;
 		if (offset == 0)
 			memset(dir_ctx->verf, 0, sizeof(dir_ctx->verf));
 		dir_ctx->duped = 0;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index d27f7e788624..3165927048e4 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -106,6 +106,7 @@ struct nfs_open_dir_context {
 	__u64 dir_cookie;
 	__u64 dup_cookie;
 	pgoff_t page_index;
+	unsigned int page_fill_misses;
 	unsigned int dtsize;
 	signed char duped;
 	bool eof;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 06/13] NFS: Improve heuristic for readdirplus
  2022-02-21 16:08         ` [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir trondmy
@ 2022-02-21 16:08           ` trondmy
  2022-02-21 16:08             ` [PATCH v6 07/13] NFS: Don't ask for readdirplus unless it can help nfs_getattr() trondmy
  2022-02-21 16:45           ` [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir Benjamin Coddington
  1 sibling, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

The heuristic for readdirplus is designed to try to detect 'ls -l' and
similar patterns. It does so by looking for cache hit/miss patterns in
both the attribute cache and in the dcache of the files in a given
directory, and then sets a flag for the readdirplus code to interpret.

The problem with this approach is that a single attribute or dcache miss
can cause the NFS code to force a refresh of the attributes for the
entire set of files contained in the directory.

To be able to make a more nuanced decision, let's sample the number of
hits and misses in the set of open directory descriptors. That allows us
to set thresholds at which we start preferring READDIRPLUS over regular
READDIR, or at which we start to force a re-read of the remaining
readdir cache using READDIRPLUS.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c           | 83 ++++++++++++++++++++++++++----------------
 fs/nfs/inode.c         |  4 +-
 fs/nfs/internal.h      |  4 +-
 fs/nfs/nfstrace.h      |  1 -
 include/linux/nfs_fs.h |  5 ++-
 5 files changed, 59 insertions(+), 38 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index d0707e63ce45..c4d962bca9ef 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -87,8 +87,7 @@ alloc_nfs_open_dir_context(struct inode *dir)
 			nfs_set_cache_invalid(dir,
 					      NFS_INO_INVALID_DATA |
 						      NFS_INO_REVAL_FORCED);
-		list_add(&ctx->list, &nfsi->open_files);
-		clear_bit(NFS_INO_FORCE_READDIR, &nfsi->flags);
+		list_add_tail_rcu(&ctx->list, &nfsi->open_files);
 		spin_unlock(&dir->i_lock);
 		return ctx;
 	}
@@ -98,9 +97,9 @@ alloc_nfs_open_dir_context(struct inode *dir)
 static void put_nfs_open_dir_context(struct inode *dir, struct nfs_open_dir_context *ctx)
 {
 	spin_lock(&dir->i_lock);
-	list_del(&ctx->list);
+	list_del_rcu(&ctx->list);
 	spin_unlock(&dir->i_lock);
-	kfree(ctx);
+	kfree_rcu(ctx, rcu_head);
 }
 
 /*
@@ -567,7 +566,6 @@ static int nfs_readdir_xdr_filler(struct nfs_readdir_descriptor *desc,
 		/* We requested READDIRPLUS, but the server doesn't grok it */
 		if (error == -ENOTSUPP && desc->plus) {
 			NFS_SERVER(inode)->caps &= ~NFS_CAP_READDIRPLUS;
-			clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inode)->flags);
 			desc->plus = arg.plus = false;
 			goto again;
 		}
@@ -617,51 +615,61 @@ int nfs_same_file(struct dentry *dentry, struct nfs_entry *entry)
 	return 1;
 }
 
-static
-bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
+#define NFS_READDIR_CACHE_USAGE_THRESHOLD (8UL)
+
+static bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx,
+				unsigned int cache_hits,
+				unsigned int cache_misses)
 {
 	if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
 		return false;
-	if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
-		return true;
-	if (ctx->pos == 0)
+	if (ctx->pos == 0 ||
+	    cache_hits + cache_misses > NFS_READDIR_CACHE_USAGE_THRESHOLD)
 		return true;
 	return false;
 }
 
 /*
- * This function is called by the lookup and getattr code to request the
+ * This function is called by the getattr code to request the
  * use of readdirplus to accelerate any future lookups in the same
  * directory.
  */
-void nfs_advise_use_readdirplus(struct inode *dir)
+void nfs_readdir_record_entry_cache_hit(struct inode *dir)
 {
 	struct nfs_inode *nfsi = NFS_I(dir);
+	struct nfs_open_dir_context *ctx;
 
-	if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
-	    !list_empty(&nfsi->open_files))
-		set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
+	if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) {
+		rcu_read_lock();
+		list_for_each_entry_rcu (ctx, &nfsi->open_files, list)
+			atomic_inc(&ctx->cache_hits);
+		rcu_read_unlock();
+	}
 }
 
 /*
  * This function is mainly for use by nfs_getattr().
  *
  * If this is an 'ls -l', we want to force use of readdirplus.
- * Do this by checking if there is an active file descriptor
- * and calling nfs_advise_use_readdirplus, then forcing a
- * cache flush.
  */
-void nfs_force_use_readdirplus(struct inode *dir)
+void nfs_readdir_record_entry_cache_miss(struct inode *dir)
 {
 	struct nfs_inode *nfsi = NFS_I(dir);
+	struct nfs_open_dir_context *ctx;
 
-	if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS) &&
-	    !list_empty(&nfsi->open_files)) {
-		set_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
-		set_bit(NFS_INO_FORCE_READDIR, &nfsi->flags);
+	if (nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) {
+		rcu_read_lock();
+		list_for_each_entry_rcu (ctx, &nfsi->open_files, list)
+			atomic_inc(&ctx->cache_misses);
+		rcu_read_unlock();
 	}
 }
 
+static void nfs_lookup_advise_force_readdirplus(struct inode *dir)
+{
+	nfs_readdir_record_entry_cache_miss(dir);
+}
+
 static
 void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 		unsigned long dir_verifier)
@@ -1101,6 +1109,20 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	return status;
 }
 
+#define NFS_READDIR_CACHE_MISS_THRESHOLD (16UL)
+
+static void nfs_readdir_handle_cache_misses(struct inode *inode,
+					    struct nfs_readdir_descriptor *desc,
+					    pgoff_t page_index,
+					    unsigned int cache_misses)
+{
+	if (desc->ctx->pos == 0 ||
+	    cache_misses <= NFS_READDIR_CACHE_MISS_THRESHOLD ||
+	    !nfs_readdir_may_fill_pagecache(desc))
+		return;
+	invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1);
+}
+
 /* The file offset position represents the dirent entry number.  A
    last cookie cache takes care of the common case of reading the
    whole directory.
@@ -1112,6 +1134,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	struct nfs_inode *nfsi = NFS_I(inode);
 	struct nfs_open_dir_context *dir_ctx = file->private_data;
 	struct nfs_readdir_descriptor *desc;
+	unsigned int cache_hits, cache_misses;
 	pgoff_t page_index;
 	int res;
 
@@ -1137,7 +1160,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 		goto out;
 	desc->file = file;
 	desc->ctx = ctx;
-	desc->plus = nfs_use_readdirplus(inode, ctx);
 	desc->page_index_max = -1;
 
 	spin_lock(&file->f_lock);
@@ -1150,6 +1172,8 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->page_fill_misses = dir_ctx->page_fill_misses;
 	nfs_set_dtsize(desc, dir_ctx->dtsize);
 	memcpy(desc->verf, dir_ctx->verf, sizeof(desc->verf));
+	cache_hits = atomic_xchg(&dir_ctx->cache_hits, 0);
+	cache_misses = atomic_xchg(&dir_ctx->cache_misses, 0);
 	spin_unlock(&file->f_lock);
 
 	if (desc->eof) {
@@ -1157,9 +1181,8 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 		goto out_free;
 	}
 
-	if (test_and_clear_bit(NFS_INO_FORCE_READDIR, &nfsi->flags) &&
-	    list_is_singular(&nfsi->open_files))
-		invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1);
+	desc->plus = nfs_use_readdirplus(inode, ctx, cache_hits, cache_misses);
+	nfs_readdir_handle_cache_misses(inode, desc, page_index, cache_misses);
 
 	do {
 		res = readdir_search_pagecache(desc);
@@ -1178,7 +1201,6 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 			break;
 		}
 		if (res == -ETOOSMALL && desc->plus) {
-			clear_bit(NFS_INO_ADVISE_RDPLUS, &nfsi->flags);
 			nfs_zap_caches(inode);
 			desc->page_index = 0;
 			desc->plus = false;
@@ -1597,7 +1619,7 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
 	nfs_set_verifier(dentry, dir_verifier);
 
 	/* set a readdirplus hint that we had a cache miss */
-	nfs_force_use_readdirplus(dir);
+	nfs_lookup_advise_force_readdirplus(dir);
 	ret = 1;
 out:
 	nfs_free_fattr(fattr);
@@ -1654,7 +1676,6 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 				nfs_mark_dir_for_revalidate(dir);
 			goto out_bad;
 		}
-		nfs_advise_use_readdirplus(dir);
 		goto out_valid;
 	}
 
@@ -1859,7 +1880,7 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
 		goto out;
 
 	/* Notify readdir to use READDIRPLUS */
-	nfs_force_use_readdirplus(dir);
+	nfs_lookup_advise_force_readdirplus(dir);
 
 no_entry:
 	res = d_splice_alias(inode, dentry);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index f9fc506ebb29..1bef81f5373a 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -789,7 +789,7 @@ static void nfs_readdirplus_parent_cache_miss(struct dentry *dentry)
 	if (!nfs_server_capable(d_inode(dentry), NFS_CAP_READDIRPLUS))
 		return;
 	parent = dget_parent(dentry);
-	nfs_force_use_readdirplus(d_inode(parent));
+	nfs_readdir_record_entry_cache_miss(d_inode(parent));
 	dput(parent);
 }
 
@@ -800,7 +800,7 @@ static void nfs_readdirplus_parent_cache_hit(struct dentry *dentry)
 	if (!nfs_server_capable(d_inode(dentry), NFS_CAP_READDIRPLUS))
 		return;
 	parent = dget_parent(dentry);
-	nfs_advise_use_readdirplus(d_inode(parent));
+	nfs_readdir_record_entry_cache_hit(d_inode(parent));
 	dput(parent);
 }
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 2de7c56a1fbe..46dc97b65661 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -366,8 +366,8 @@ extern struct nfs_client *nfs_init_client(struct nfs_client *clp,
 			   const struct nfs_client_initdata *);
 
 /* dir.c */
-extern void nfs_advise_use_readdirplus(struct inode *dir);
-extern void nfs_force_use_readdirplus(struct inode *dir);
+extern void nfs_readdir_record_entry_cache_hit(struct inode *dir);
+extern void nfs_readdir_record_entry_cache_miss(struct inode *dir);
 extern unsigned long nfs_access_cache_count(struct shrinker *shrink,
 					    struct shrink_control *sc);
 extern unsigned long nfs_access_cache_scan(struct shrinker *shrink,
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 45a310b586ce..3672f6703ee7 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -36,7 +36,6 @@
 
 #define nfs_show_nfsi_flags(v) \
 	__print_flags(v, "|", \
-			{ BIT(NFS_INO_ADVISE_RDPLUS), "ADVISE_RDPLUS" }, \
 			{ BIT(NFS_INO_STALE), "STALE" }, \
 			{ BIT(NFS_INO_ACL_LRU_SET), "ACL_LRU_SET" }, \
 			{ BIT(NFS_INO_INVALIDATING), "INVALIDATING" }, \
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 3165927048e4..0a5425a58bbd 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -101,6 +101,8 @@ struct nfs_open_context {
 
 struct nfs_open_dir_context {
 	struct list_head list;
+	atomic_t cache_hits;
+	atomic_t cache_misses;
 	unsigned long attr_gencount;
 	__be32	verf[NFS_DIR_VERIFIER_SIZE];
 	__u64 dir_cookie;
@@ -110,6 +112,7 @@ struct nfs_open_dir_context {
 	unsigned int dtsize;
 	signed char duped;
 	bool eof;
+	struct rcu_head rcu_head;
 };
 
 /*
@@ -274,13 +277,11 @@ struct nfs4_copy_state {
 /*
  * Bit offsets in flags field
  */
-#define NFS_INO_ADVISE_RDPLUS	(0)		/* advise readdirplus */
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
 #define NFS_INO_INVALIDATING	(3)		/* inode is being invalidated */
 #define NFS_INO_PRESERVE_UNLINKED (4)		/* preserve file if removed while open */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
-#define NFS_INO_FORCE_READDIR	(7)		/* force readdirplus */
 #define NFS_INO_LAYOUTCOMMIT	(9)		/* layoutcommit required */
 #define NFS_INO_LAYOUTCOMMITTING (10)		/* layoutcommit inflight */
 #define NFS_INO_LAYOUTSTATS	(11)		/* layoutstats inflight */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 07/13] NFS: Don't ask for readdirplus unless it can help nfs_getattr()
  2022-02-21 16:08           ` [PATCH v6 06/13] NFS: Improve heuristic for readdirplus trondmy
@ 2022-02-21 16:08             ` trondmy
  2022-02-21 16:08               ` [PATCH v6 08/13] NFSv4: Ask for a full XDR buffer of readdir goodness trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If attribute caching is turned off, then use of readdirplus is not going
to help stat() performance.
Readdirplus also doesn't help if a file is being written to, since we
will have to flush those writes in order to sync the mtime/ctime.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/inode.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 1bef81f5373a..9d2af9887715 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -782,24 +782,26 @@ void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr,
 }
 EXPORT_SYMBOL_GPL(nfs_setattr_update_inode);
 
-static void nfs_readdirplus_parent_cache_miss(struct dentry *dentry)
+/*
+ * Don't request help from readdirplus if the file is being written to,
+ * or if attribute caching is turned off
+ */
+static bool nfs_getattr_readdirplus_enable(const struct inode *inode)
 {
-	struct dentry *parent;
+	return nfs_server_capable(inode, NFS_CAP_READDIRPLUS) &&
+	       !nfs_have_writebacks(inode) && NFS_MAXATTRTIMEO(inode) > 5 * HZ;
+}
 
-	if (!nfs_server_capable(d_inode(dentry), NFS_CAP_READDIRPLUS))
-		return;
-	parent = dget_parent(dentry);
+static void nfs_readdirplus_parent_cache_miss(struct dentry *dentry)
+{
+	struct dentry *parent = dget_parent(dentry);
 	nfs_readdir_record_entry_cache_miss(d_inode(parent));
 	dput(parent);
 }
 
 static void nfs_readdirplus_parent_cache_hit(struct dentry *dentry)
 {
-	struct dentry *parent;
-
-	if (!nfs_server_capable(d_inode(dentry), NFS_CAP_READDIRPLUS))
-		return;
-	parent = dget_parent(dentry);
+	struct dentry *parent = dget_parent(dentry);
 	nfs_readdir_record_entry_cache_hit(d_inode(parent));
 	dput(parent);
 }
@@ -837,6 +839,7 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 	int err = 0;
 	bool force_sync = query_flags & AT_STATX_FORCE_SYNC;
 	bool do_update = false;
+	bool readdirplus_enabled = nfs_getattr_readdirplus_enable(inode);
 
 	trace_nfs_getattr_enter(inode);
 
@@ -845,7 +848,8 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 			STATX_INO | STATX_SIZE | STATX_BLOCKS;
 
 	if ((query_flags & AT_STATX_DONT_SYNC) && !force_sync) {
-		nfs_readdirplus_parent_cache_hit(path->dentry);
+		if (readdirplus_enabled)
+			nfs_readdirplus_parent_cache_hit(path->dentry);
 		goto out_no_revalidate;
 	}
 
@@ -895,15 +899,12 @@ int nfs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		do_update |= cache_validity & NFS_INO_INVALID_BLOCKS;
 
 	if (do_update) {
-		/* Update the attribute cache */
-		if (!(server->flags & NFS_MOUNT_NOAC))
+		if (readdirplus_enabled)
 			nfs_readdirplus_parent_cache_miss(path->dentry);
-		else
-			nfs_readdirplus_parent_cache_hit(path->dentry);
 		err = __nfs_revalidate_inode(server, inode);
 		if (err)
 			goto out;
-	} else
+	} else if (readdirplus_enabled)
 		nfs_readdirplus_parent_cache_hit(path->dentry);
 out_no_revalidate:
 	/* Only return attributes that were revalidated. */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 08/13] NFSv4: Ask for a full XDR buffer of readdir goodness
  2022-02-21 16:08             ` [PATCH v6 07/13] NFS: Don't ask for readdirplus unless it can help nfs_getattr() trondmy
@ 2022-02-21 16:08               ` trondmy
  2022-02-21 16:08                 ` [PATCH v6 09/13] NFS: Readdirplus can't help lookup for case insensitive filesystems trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Instead of pretending that we know the ratio of directory info vs
readdirplus attribute info, just set the 'dircount' field to the same
value as the 'maxcount' field.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/nfs3xdr.c | 7 ++++---
 fs/nfs/nfs4xdr.c | 6 +++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
index 9274c9c5efea..feb6e2e36138 100644
--- a/fs/nfs/nfs3xdr.c
+++ b/fs/nfs/nfs3xdr.c
@@ -1261,6 +1261,8 @@ static void nfs3_xdr_enc_readdir3args(struct rpc_rqst *req,
 static void encode_readdirplus3args(struct xdr_stream *xdr,
 				    const struct nfs3_readdirargs *args)
 {
+	uint32_t dircount = args->count;
+	uint32_t maxcount = args->count;
 	__be32 *p;
 
 	encode_nfs_fh3(xdr, args->fh);
@@ -1273,9 +1275,8 @@ static void encode_readdirplus3args(struct xdr_stream *xdr,
 	 * readdirplus: need dircount + buffer size.
 	 * We just make sure we make dircount big enough
 	 */
-	*p++ = cpu_to_be32(args->count >> 3);
-
-	*p = cpu_to_be32(args->count);
+	*p++ = cpu_to_be32(dircount);
+	*p = cpu_to_be32(maxcount);
 }
 
 static void nfs3_xdr_enc_readdirplus3args(struct rpc_rqst *req,
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 8e70b92df4cc..b7780b97dc4d 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -1605,7 +1605,8 @@ static void encode_readdir(struct xdr_stream *xdr, const struct nfs4_readdir_arg
 		FATTR4_WORD0_RDATTR_ERROR,
 		FATTR4_WORD1_MOUNTED_ON_FILEID,
 	};
-	uint32_t dircount = readdir->count >> 1;
+	uint32_t dircount = readdir->count;
+	uint32_t maxcount = readdir->count;
 	__be32 *p, verf[2];
 	uint32_t attrlen = 0;
 	unsigned int i;
@@ -1618,7 +1619,6 @@ static void encode_readdir(struct xdr_stream *xdr, const struct nfs4_readdir_arg
 			FATTR4_WORD1_SPACE_USED|FATTR4_WORD1_TIME_ACCESS|
 			FATTR4_WORD1_TIME_METADATA|FATTR4_WORD1_TIME_MODIFY;
 		attrs[2] |= FATTR4_WORD2_SECURITY_LABEL;
-		dircount >>= 1;
 	}
 	/* Use mounted_on_fileid only if the server supports it */
 	if (!(readdir->bitmask[1] & FATTR4_WORD1_MOUNTED_ON_FILEID))
@@ -1634,7 +1634,7 @@ static void encode_readdir(struct xdr_stream *xdr, const struct nfs4_readdir_arg
 	encode_nfs4_verifier(xdr, &readdir->verifier);
 	p = reserve_space(xdr, 12 + (attrlen << 2));
 	*p++ = cpu_to_be32(dircount);
-	*p++ = cpu_to_be32(readdir->count);
+	*p++ = cpu_to_be32(maxcount);
 	*p++ = cpu_to_be32(attrlen);
 	for (i = 0; i < attrlen; i++)
 		*p++ = cpu_to_be32(attrs[i]);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 09/13] NFS: Readdirplus can't help lookup for case insensitive filesystems
  2022-02-21 16:08               ` [PATCH v6 08/13] NFSv4: Ask for a full XDR buffer of readdir goodness trondmy
@ 2022-02-21 16:08                 ` trondmy
  2022-02-21 16:08                   ` [PATCH v6 10/13] NFS: Don't request readdirplus when revaldation was forced trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If the filesystem is case insensitive, then readdirplus can't help with
cache misses, since it won't return case folded variants of the filename.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index c4d962bca9ef..b1e6c56d7e1a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -667,6 +667,8 @@ void nfs_readdir_record_entry_cache_miss(struct inode *dir)
 
 static void nfs_lookup_advise_force_readdirplus(struct inode *dir)
 {
+	if (nfs_server_capable(dir, NFS_CAP_CASE_INSENSITIVE))
+		return;
 	nfs_readdir_record_entry_cache_miss(dir);
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 10/13] NFS: Don't request readdirplus when revaldation was forced
  2022-02-21 16:08                 ` [PATCH v6 09/13] NFS: Readdirplus can't help lookup for case insensitive filesystems trondmy
@ 2022-02-21 16:08                   ` trondmy
  2022-02-21 16:08                     ` [PATCH v6 11/13] NFS: Add basic readdir tracing trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

If the revalidation was forced, due to the presence of a LOOKUP_EXCL or
a LOOKUP_REVAL flag, then readdirplus won't help. It also can't help
when we're doing a path component lookup.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index b1e6c56d7e1a..b6e21c5e1a0f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -665,10 +665,13 @@ void nfs_readdir_record_entry_cache_miss(struct inode *dir)
 	}
 }
 
-static void nfs_lookup_advise_force_readdirplus(struct inode *dir)
+static void nfs_lookup_advise_force_readdirplus(struct inode *dir,
+						unsigned int flags)
 {
 	if (nfs_server_capable(dir, NFS_CAP_CASE_INSENSITIVE))
 		return;
+	if (flags & (LOOKUP_EXCL | LOOKUP_PARENT | LOOKUP_REVAL))
+		return;
 	nfs_readdir_record_entry_cache_miss(dir);
 }
 
@@ -1582,15 +1585,17 @@ nfs_lookup_revalidate_delegated(struct inode *dir, struct dentry *dentry,
 	return nfs_lookup_revalidate_done(dir, dentry, inode, 1);
 }
 
-static int
-nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
-			     struct inode *inode)
+static int nfs_lookup_revalidate_dentry(struct inode *dir,
+					struct dentry *dentry,
+					struct inode *inode, unsigned int flags)
 {
 	struct nfs_fh *fhandle;
 	struct nfs_fattr *fattr;
 	unsigned long dir_verifier;
 	int ret;
 
+	trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
+
 	ret = -ENOMEM;
 	fhandle = nfs_alloc_fhandle();
 	fattr = nfs_alloc_fattr_with_label(NFS_SERVER(inode));
@@ -1611,6 +1616,10 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
 		}
 		goto out;
 	}
+
+	/* Request help from readdirplus */
+	nfs_lookup_advise_force_readdirplus(dir, flags);
+
 	ret = 0;
 	if (nfs_compare_fh(NFS_FH(inode), fhandle))
 		goto out;
@@ -1620,8 +1629,6 @@ nfs_lookup_revalidate_dentry(struct inode *dir, struct dentry *dentry,
 	nfs_setsecurity(inode, fattr);
 	nfs_set_verifier(dentry, dir_verifier);
 
-	/* set a readdirplus hint that we had a cache miss */
-	nfs_lookup_advise_force_readdirplus(dir);
 	ret = 1;
 out:
 	nfs_free_fattr(fattr);
@@ -1687,8 +1694,7 @@ nfs_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 	if (NFS_STALE(inode))
 		goto out_bad;
 
-	trace_nfs_lookup_revalidate_enter(dir, dentry, flags);
-	return nfs_lookup_revalidate_dentry(dir, dentry, inode);
+	return nfs_lookup_revalidate_dentry(dir, dentry, inode, flags);
 out_valid:
 	return nfs_lookup_revalidate_done(dir, dentry, inode, 1);
 out_bad:
@@ -1882,7 +1888,7 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
 		goto out;
 
 	/* Notify readdir to use READDIRPLUS */
-	nfs_lookup_advise_force_readdirplus(dir);
+	nfs_lookup_advise_force_readdirplus(dir, flags);
 
 no_entry:
 	res = d_splice_alias(inode, dentry);
@@ -2145,7 +2151,7 @@ nfs4_do_lookup_revalidate(struct inode *dir, struct dentry *dentry,
 reval_dentry:
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
-	return nfs_lookup_revalidate_dentry(dir, dentry, inode);
+	return nfs_lookup_revalidate_dentry(dir, dentry, inode, flags);
 
 full_reval:
 	return nfs_do_lookup_revalidate(dir, dentry, flags);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 11/13] NFS: Add basic readdir tracing
  2022-02-21 16:08                   ` [PATCH v6 10/13] NFS: Don't request readdirplus when revaldation was forced trondmy
@ 2022-02-21 16:08                     ` trondmy
  2022-02-21 16:08                       ` [PATCH v6 12/13] NFS: Trace effects of readdirplus on the dcache trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Add tracing to track how often the client goes to the server for updated
readdir information.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c      | 13 ++++++++-
 fs/nfs/nfstrace.h | 68 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index b6e21c5e1a0f..273a35851e42 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -966,10 +966,14 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 		if (!nfs_readdir_may_fill_pagecache(desc))
 			return -EBADCOOKIE;
 		desc->page_index_max = desc->page_index;
+		trace_nfs_readdir_cache_fill(desc->file, nfsi->cookieverf,
+					     desc->last_cookie,
+					     desc->page_index, desc->dtsize);
 		res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
 					       &desc->page, 1);
 		if (res < 0) {
 			nfs_readdir_page_unlock_and_put_cached(desc);
+			trace_nfs_readdir_cache_fill_done(inode, res);
 			if (res == -EBADCOOKIE || res == -ENOTSYNC) {
 				invalidate_inode_pages2(desc->file->f_mapping);
 				desc->page_index = 0;
@@ -1085,7 +1089,14 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	desc->duped = 0;
 	desc->page_index_max = 0;
 
+	trace_nfs_readdir_uncached(desc->file, desc->verf, desc->last_cookie,
+				   -1, desc->dtsize);
+
 	status = nfs_readdir_xdr_to_array(desc, desc->verf, verf, arrays, sz);
+	if (status < 0) {
+		trace_nfs_readdir_uncached_done(file_inode(desc->file), status);
+		goto out_free;
+	}
 
 	for (i = 0; !desc->eob && i < sz && arrays[i]; i++) {
 		desc->page = arrays[i];
@@ -1104,7 +1115,7 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 			 i < (desc->page_index_max >> 1))
 			nfs_shrink_dtsize(desc);
 	}
-
+out_free:
 	for (i = 0; i < sz && arrays[i]; i++)
 		nfs_readdir_page_array_free(arrays[i]);
 out:
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 3672f6703ee7..c2d0543ecb2d 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -160,6 +160,8 @@ DEFINE_NFS_INODE_EVENT(nfs_fsync_enter);
 DEFINE_NFS_INODE_EVENT_DONE(nfs_fsync_exit);
 DEFINE_NFS_INODE_EVENT(nfs_access_enter);
 DEFINE_NFS_INODE_EVENT_DONE(nfs_set_cache_invalid);
+DEFINE_NFS_INODE_EVENT_DONE(nfs_readdir_cache_fill_done);
+DEFINE_NFS_INODE_EVENT_DONE(nfs_readdir_uncached_done);
 
 TRACE_EVENT(nfs_access_exit,
 		TP_PROTO(
@@ -271,6 +273,72 @@ DEFINE_NFS_UPDATE_SIZE_EVENT(wcc);
 DEFINE_NFS_UPDATE_SIZE_EVENT(update);
 DEFINE_NFS_UPDATE_SIZE_EVENT(grow);
 
+DECLARE_EVENT_CLASS(nfs_readdir_event,
+		TP_PROTO(
+			const struct file *file,
+			const __be32 *verifier,
+			u64 cookie,
+			pgoff_t page_index,
+			unsigned int dtsize
+		),
+
+		TP_ARGS(file, verifier, cookie, page_index, dtsize),
+
+		TP_STRUCT__entry(
+			__field(dev_t, dev)
+			__field(u32, fhandle)
+			__field(u64, fileid)
+			__field(u64, version)
+			__array(char, verifier, NFS4_VERIFIER_SIZE)
+			__field(u64, cookie)
+			__field(pgoff_t, index)
+			__field(unsigned int, dtsize)
+		),
+
+		TP_fast_assign(
+			const struct inode *dir = file_inode(file);
+			const struct nfs_inode *nfsi = NFS_I(dir);
+
+			__entry->dev = dir->i_sb->s_dev;
+			__entry->fileid = nfsi->fileid;
+			__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
+			__entry->version = inode_peek_iversion_raw(dir);
+			if (cookie != 0)
+				memcpy(__entry->verifier, verifier,
+				       NFS4_VERIFIER_SIZE);
+			else
+				memset(__entry->verifier, 0,
+				       NFS4_VERIFIER_SIZE);
+			__entry->cookie = cookie;
+			__entry->index = page_index;
+			__entry->dtsize = dtsize;
+		),
+
+		TP_printk(
+			"fileid=%02x:%02x:%llu fhandle=0x%08x version=%llu "
+			"cookie=%s:0x%llx cache_index=%lu dtsize=%u",
+			MAJOR(__entry->dev), MINOR(__entry->dev),
+			(unsigned long long)__entry->fileid, __entry->fhandle,
+			__entry->version, show_nfs4_verifier(__entry->verifier),
+			(unsigned long long)__entry->cookie, __entry->index,
+			__entry->dtsize
+		)
+);
+
+#define DEFINE_NFS_READDIR_EVENT(name) \
+	DEFINE_EVENT(nfs_readdir_event, name, \
+			TP_PROTO( \
+				const struct file *file, \
+				const __be32 *verifier, \
+				u64 cookie, \
+				pgoff_t page_index, \
+				unsigned int dtsize \
+				), \
+			TP_ARGS(file, verifier, cookie, page_index, dtsize))
+
+DEFINE_NFS_READDIR_EVENT(nfs_readdir_cache_fill);
+DEFINE_NFS_READDIR_EVENT(nfs_readdir_uncached);
+
 DECLARE_EVENT_CLASS(nfs_lookup_event,
 		TP_PROTO(
 			const struct inode *dir,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 12/13] NFS: Trace effects of readdirplus on the dcache
  2022-02-21 16:08                     ` [PATCH v6 11/13] NFS: Add basic readdir tracing trondmy
@ 2022-02-21 16:08                       ` trondmy
  2022-02-21 16:08                         ` [PATCH v6 13/13] NFS: Trace effects of the readdirplus heuristic trondmy
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Trace the effects of readdirplus on attribute and dentry revalidation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c      | 5 +++++
 fs/nfs/nfstrace.h | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 273a35851e42..bcbfe03e3835 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -725,8 +725,12 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 			status = nfs_refresh_inode(d_inode(dentry), entry->fattr);
 			if (!status)
 				nfs_setsecurity(d_inode(dentry), entry->fattr);
+			trace_nfs_readdir_lookup_revalidate(d_inode(parent),
+							    dentry, 0, status);
 			goto out;
 		} else {
+			trace_nfs_readdir_lookup_revalidate_failed(
+				d_inode(parent), dentry, 0);
 			d_invalidate(dentry);
 			dput(dentry);
 			dentry = NULL;
@@ -748,6 +752,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 		dentry = alias;
 	}
 	nfs_set_verifier(dentry, dir_verifier);
+	trace_nfs_readdir_lookup(d_inode(parent), dentry, 0);
 out:
 	dput(dentry);
 }
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index c2d0543ecb2d..7c1102b991d0 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -432,6 +432,9 @@ DEFINE_NFS_LOOKUP_EVENT(nfs_lookup_enter);
 DEFINE_NFS_LOOKUP_EVENT_DONE(nfs_lookup_exit);
 DEFINE_NFS_LOOKUP_EVENT(nfs_lookup_revalidate_enter);
 DEFINE_NFS_LOOKUP_EVENT_DONE(nfs_lookup_revalidate_exit);
+DEFINE_NFS_LOOKUP_EVENT(nfs_readdir_lookup);
+DEFINE_NFS_LOOKUP_EVENT(nfs_readdir_lookup_revalidate_failed);
+DEFINE_NFS_LOOKUP_EVENT_DONE(nfs_readdir_lookup_revalidate);
 
 TRACE_EVENT(nfs_atomic_open_enter,
 		TP_PROTO(
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v6 13/13] NFS: Trace effects of the readdirplus heuristic
  2022-02-21 16:08                       ` [PATCH v6 12/13] NFS: Trace effects of readdirplus on the dcache trondmy
@ 2022-02-21 16:08                         ` trondmy
  2022-02-23 13:40                           ` [PATCH v3 1/8] NFS: save the directory's change attribute on pagecache pages Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: trondmy @ 2022-02-21 16:08 UTC (permalink / raw)
  To: linux-nfs

From: Trond Myklebust <trond.myklebust@hammerspace.com>

Enable tracking of when the readdirplus heuristic causes a page cache
invalidation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
---
 fs/nfs/dir.c      |  6 +++++-
 fs/nfs/nfstrace.h | 50 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index bcbfe03e3835..9f48c75dbf4c 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1141,7 +1141,11 @@ static void nfs_readdir_handle_cache_misses(struct inode *inode,
 	    cache_misses <= NFS_READDIR_CACHE_MISS_THRESHOLD ||
 	    !nfs_readdir_may_fill_pagecache(desc))
 		return;
-	invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1);
+	if (invalidate_mapping_pages(inode->i_mapping, page_index + 1, -1) == 0)
+		return;
+	trace_nfs_readdir_invalidate_cache_range(
+		inode, (loff_t)(page_index + 1) << PAGE_SHIFT,
+		MAX_LFS_FILESIZE);
 }
 
 /* The file offset position represents the dirent entry number.  A
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 7c1102b991d0..ec2645d20abf 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -273,6 +273,56 @@ DEFINE_NFS_UPDATE_SIZE_EVENT(wcc);
 DEFINE_NFS_UPDATE_SIZE_EVENT(update);
 DEFINE_NFS_UPDATE_SIZE_EVENT(grow);
 
+DECLARE_EVENT_CLASS(nfs_inode_range_event,
+		TP_PROTO(
+			const struct inode *inode,
+			loff_t range_start,
+			loff_t range_end
+		),
+
+		TP_ARGS(inode, range_start, range_end),
+
+		TP_STRUCT__entry(
+			__field(dev_t, dev)
+			__field(u32, fhandle)
+			__field(u64, fileid)
+			__field(u64, version)
+			__field(loff_t, range_start)
+			__field(loff_t, range_end)
+		),
+
+		TP_fast_assign(
+			const struct nfs_inode *nfsi = NFS_I(inode);
+
+			__entry->dev = inode->i_sb->s_dev;
+			__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
+			__entry->fileid = nfsi->fileid;
+			__entry->version = inode_peek_iversion_raw(inode);
+			__entry->range_start = range_start;
+			__entry->range_end = range_end;
+		),
+
+		TP_printk(
+			"fileid=%02x:%02x:%llu fhandle=0x%08x version=%llu "
+			"range=[%lld, %lld]",
+			MAJOR(__entry->dev), MINOR(__entry->dev),
+			(unsigned long long)__entry->fileid,
+			__entry->fhandle, __entry->version,
+			__entry->range_start, __entry->range_end
+		)
+);
+
+#define DEFINE_NFS_INODE_RANGE_EVENT(name) \
+	DEFINE_EVENT(nfs_inode_range_event, name, \
+			TP_PROTO( \
+				const struct inode *inode, \
+				loff_t range_start, \
+				loff_t range_end \
+			), \
+			TP_ARGS(inode, range_start, range_end))
+
+DEFINE_NFS_INODE_RANGE_EVENT(nfs_readdir_invalidate_cache_range);
+
 DECLARE_EVENT_CLASS(nfs_readdir_event,
 		TP_PROTO(
 			const struct file *file,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-21 16:08         ` [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir trondmy
  2022-02-21 16:08           ` [PATCH v6 06/13] NFS: Improve heuristic for readdirplus trondmy
@ 2022-02-21 16:45           ` Benjamin Coddington
  2022-02-21 19:58             ` Trond Myklebust
  1 sibling, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-21 16:45 UTC (permalink / raw)
  To: trondmy; +Cc: linux-nfs

On 21 Feb 2022, at 11:08, trondmy@kernel.org wrote:

> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>
> When reading a very large directory, we want to try to keep the page
> cache up to date if doing so is inexpensive. Right now, we will try to
> refill the page cache if it is non-empty, irrespective of whether or not
> doing so is going to take a long time.
>
> Replace that algorithm with something that looks at how many times we've
> refilled the page cache without seeing a cache hit.

Hi Trond, I've been following your work here - thanks for it.

I'm wondering if there might be a regression on this patch for the case
where two or more directory readers are part way through a large directory
when the pagecache is truncated.  If I'm reading this correctly, those
readers will stop caching after 5 fills and finish the remainder of their
directory reads in the uncached mode.

Isn't there an OP amplification per reader in this case?

Ben


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-21 16:45           ` [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir Benjamin Coddington
@ 2022-02-21 19:58             ` Trond Myklebust
  2022-02-21 20:22               ` Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2022-02-21 19:58 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Mon, 2022-02-21 at 11:45 -0500, Benjamin Coddington wrote:
> On 21 Feb 2022, at 11:08, trondmy@kernel.org wrote:
> 
> > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > 
> > When reading a very large directory, we want to try to keep the
> > page
> > cache up to date if doing so is inexpensive. Right now, we will try
> > to
> > refill the page cache if it is non-empty, irrespective of whether
> > or not
> > doing so is going to take a long time.
> > 
> > Replace that algorithm with something that looks at how many times
> > we've
> > refilled the page cache without seeing a cache hit.
> 
> Hi Trond, I've been following your work here - thanks for it.
> 
> I'm wondering if there might be a regression on this patch for the
> case
> where two or more directory readers are part way through a large
> directory
> when the pagecache is truncated.  If I'm reading this correctly,
> those
> readers will stop caching after 5 fills and finish the remainder of
> their
> directory reads in the uncached mode.
> 
> Isn't there an OP amplification per reader in this case?
> 

Depends... In the old case, we basically stopped doing uncached readdir
if a third process starts filling the page cache again. In particular,
this means we were vulnerable to restarting over and over once page
reclaim starts to kick in for very large directories.

In this new one, we have each process give it a try (5 fills each), and
then fallback to uncached. Yes, there will be corner cases where this
will perform less well than the old algorithm, but it should also be
more deterministic.

I am open to suggestions for better ways to determine when to cut over
to uncached readdir. This is one way, that I think is better than what
we have, however I'm sure it can be improved upon.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-21 19:58             ` Trond Myklebust
@ 2022-02-21 20:22               ` Benjamin Coddington
  2022-02-21 20:55                 ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-21 20:22 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 21 Feb 2022, at 14:58, Trond Myklebust wrote:

> On Mon, 2022-02-21 at 11:45 -0500, Benjamin Coddington wrote:
>> On 21 Feb 2022, at 11:08, trondmy@kernel.org wrote:
>>
>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>>
>>> When reading a very large directory, we want to try to keep the
>>> page
>>> cache up to date if doing so is inexpensive. Right now, we will try
>>> to
>>> refill the page cache if it is non-empty, irrespective of whether
>>> or not
>>> doing so is going to take a long time.
>>>
>>> Replace that algorithm with something that looks at how many times
>>> we've
>>> refilled the page cache without seeing a cache hit.
>>
>> Hi Trond, I've been following your work here - thanks for it.
>>
>> I'm wondering if there might be a regression on this patch for the
>> case
>> where two or more directory readers are part way through a large
>> directory
>> when the pagecache is truncated.  If I'm reading this correctly,
>> those
>> readers will stop caching after 5 fills and finish the remainder of
>> their
>> directory reads in the uncached mode.
>>
>> Isn't there an OP amplification per reader in this case?
>>
>
> Depends... In the old case, we basically stopped doing uncached readdir
> if a third process starts filling the page cache again. In particular,
> this means we were vulnerable to restarting over and over once page
> reclaim starts to kick in for very large directories.
>
> In this new one, we have each process give it a try (5 fills each), and
> then fallback to uncached. Yes, there will be corner cases where this
> will perform less well than the old algorithm, but it should also be
> more deterministic.
>
> I am open to suggestions for better ways to determine when to cut over
> to uncached readdir. This is one way, that I think is better than what
> we have, however I'm sure it can be improved upon.

I still have old patches that allow each page to be "versioned" with the
change attribute, page_index, and cookie.  This allows the page cache to be
culled page-by-page, and multiple fillers can continue to fill pages at
"headless" page offsets that match their original cookie and page_index
pair.  This change would mean readers don't have to start over filling the
page cache when the cache is dropped, so we wouldn't need to worry about
when to cut over to the uncached mode - it makes the problem go away.

I felt there wasn't much interest in this work, and our most vocal customer
was happy enough with last winter's readdir improvements (thanks!) that I
didn't follow up, but I can refresh those patches and send them along again.

Ben


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-21 20:22               ` Benjamin Coddington
@ 2022-02-21 20:55                 ` Trond Myklebust
  2022-02-21 21:10                   ` Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2022-02-21 20:55 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Mon, 2022-02-21 at 15:22 -0500, Benjamin Coddington wrote:
> On 21 Feb 2022, at 14:58, Trond Myklebust wrote:
> 
> > On Mon, 2022-02-21 at 11:45 -0500, Benjamin Coddington wrote:
> > > On 21 Feb 2022, at 11:08, trondmy@kernel.org wrote:
> > > 
> > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > 
> > > > When reading a very large directory, we want to try to keep the
> > > > page
> > > > cache up to date if doing so is inexpensive. Right now, we will
> > > > try
> > > > to
> > > > refill the page cache if it is non-empty, irrespective of
> > > > whether
> > > > or not
> > > > doing so is going to take a long time.
> > > > 
> > > > Replace that algorithm with something that looks at how many
> > > > times
> > > > we've
> > > > refilled the page cache without seeing a cache hit.
> > > 
> > > Hi Trond, I've been following your work here - thanks for it.
> > > 
> > > I'm wondering if there might be a regression on this patch for
> > > the
> > > case
> > > where two or more directory readers are part way through a large
> > > directory
> > > when the pagecache is truncated.  If I'm reading this correctly,
> > > those
> > > readers will stop caching after 5 fills and finish the remainder
> > > of
> > > their
> > > directory reads in the uncached mode.
> > > 
> > > Isn't there an OP amplification per reader in this case?
> > > 
> > 
> > Depends... In the old case, we basically stopped doing uncached
> > readdir
> > if a third process starts filling the page cache again. In
> > particular,
> > this means we were vulnerable to restarting over and over once page
> > reclaim starts to kick in for very large directories.
> > 
> > In this new one, we have each process give it a try (5 fills each),
> > and
> > then fallback to uncached. Yes, there will be corner cases where
> > this
> > will perform less well than the old algorithm, but it should also
> > be
> > more deterministic.
> > 
> > I am open to suggestions for better ways to determine when to cut
> > over
> > to uncached readdir. This is one way, that I think is better than
> > what
> > we have, however I'm sure it can be improved upon.
> 
> I still have old patches that allow each page to be "versioned" with
> the
> change attribute, page_index, and cookie.  This allows the page cache
> to be
> culled page-by-page, and multiple fillers can continue to fill pages
> at
> "headless" page offsets that match their original cookie and
> page_index
> pair.  This change would mean readers don't have to start over
> filling the
> page cache when the cache is dropped, so we wouldn't need to worry
> about
> when to cut over to the uncached mode - it makes the problem go away.
> 
> I felt there wasn't much interest in this work, and our most vocal
> customer
> was happy enough with last winter's readdir improvements (thanks!)
> that I
> didn't follow up, but I can refresh those patches and send them along
> again.
> 

We will always need the ability to cut over to uncached readdir.

If the cookie is no longer returned by the server because one or more
files were deleted then we need to resolve the situation somehow (IOW:
the 'rm *' case). The new algorithm _does_ improve performance on those
situations, because it no longer requires us to read the entire
directory before switching over: we try 5 times, then fail over.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-21 20:55                 ` Trond Myklebust
@ 2022-02-21 21:10                   ` Benjamin Coddington
  2022-02-21 23:20                     ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-21 21:10 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
>
> We will always need the ability to cut over to uncached readdir.

Yes.

> If the cookie is no longer returned by the server because one or more
> files were deleted then we need to resolve the situation somehow (IOW:
> the 'rm *' case). The new algorithm _does_ improve performance on those
> situations, because it no longer requires us to read the entire
> directory before switching over: we try 5 times, then fail over.

Yes, using per-page validation doesn't remove the need for uncached readdir.
It does allow a reader to simply resume filling the cache where it left
off.  There's no need to try 5 times and fail over.  And there's no need to
make a trade-off and make the situation worse in certain scenarios.

I thought I'd point that out and make an offer to re-submit it.  Any
interest?

Ben


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-21 21:10                   ` Benjamin Coddington
@ 2022-02-21 23:20                     ` Trond Myklebust
  2022-02-22 12:50                       ` Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2022-02-21 23:20 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
> On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
> > 
> > We will always need the ability to cut over to uncached readdir.
> 
> Yes.
> 
> > If the cookie is no longer returned by the server because one or
> > more
> > files were deleted then we need to resolve the situation somehow
> > (IOW:
> > the 'rm *' case). The new algorithm _does_ improve performance on
> > those
> > situations, because it no longer requires us to read the entire
> > directory before switching over: we try 5 times, then fail over.
> 
> Yes, using per-page validation doesn't remove the need for uncached
> readdir.
> It does allow a reader to simply resume filling the cache where it
> left
> off.  There's no need to try 5 times and fail over.  And there's no
> need to
> make a trade-off and make the situation worse in certain scenarios.
> 
> I thought I'd point that out and make an offer to re-submit it.  Any
> interest?
> 

As I recall, I had concerns about that approach. Can you explain again
how it will work?

A few of the concerns I have revolve around telldir()/seekdir(). If the
platform is 32-bit, then we cannot use cookies as the telldir() output,
and so our only option is to use offsets into the page cache (this is
why this patch carves out an exception if desc->dir_cookie == 0). How
would that work with you scheme?
Even in the 64-bit case where are able to use cookies for
telldir()/seekdir(), how do we determine an appropriate page index
after a seekdir()?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-21 23:20                     ` Trond Myklebust
@ 2022-02-22 12:50                       ` Benjamin Coddington
  2022-02-22 20:11                         ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-22 12:50 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 21 Feb 2022, at 18:20, Trond Myklebust wrote:

> On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
>> On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
>>>
>>> We will always need the ability to cut over to uncached readdir.
>>
>> Yes.
>>
>>> If the cookie is no longer returned by the server because one or more
>>> files were deleted then we need to resolve the situation somehow (IOW:
>>> the 'rm *' case). The new algorithm _does_ improve performance on those
>>> situations, because it no longer requires us to read the entire
>>> directory before switching over: we try 5 times, then fail over.
>>
>> Yes, using per-page validation doesn't remove the need for uncached
>> readdir.  It does allow a reader to simply resume filling the cache where
>> it left off.  There's no need to try 5 times and fail over.  And there's
>> no need to make a trade-off and make the situation worse in certain
>> scenarios.
>>
>> I thought I'd point that out and make an offer to re-submit it.  Any
>> interest?
>>
>
> As I recall, I had concerns about that approach. Can you explain again
> how it will work?

Every page of readdir results has the directory's change attr stored on the
page.  That, along with the page's index and the first cookie is enough
information to determine if the page's data can be used by another process.

Which means that when the pagecache is dropped, fillers don't have to restart
filling the cache at page index 0, they can continue to fill at whatever
index they were at previously.  If another process finds a page that doesn't
match its page index, cookie, and the current directory's change attr, the
page is dropped and refilled from that process' indexing.

> A few of the concerns I have revolve around telldir()/seekdir(). If the
> platform is 32-bit, then we cannot use cookies as the telldir() output,
> and so our only option is to use offsets into the page cache (this is
> why this patch carves out an exception if desc->dir_cookie == 0). How
> would that work with you scheme?

For 32-bit seekdir, pages are filled starting at index 0.  This is very
unlikely to match other readers (unless they also do the _same_ seekdir).

> Even in the 64-bit case where are able to use cookies for
> telldir()/seekdir(), how do we determine an appropriate page index
> after a seekdir()?

We don't.  Instead we start filling at index 0.  Again, the pagecache will
only be useful to other processes that have done the same llseek.

This approach optimizes the pagecache for processes that are doing similar
work, and has the added benefit of scaling well for large directory listings
under memory pressure.  Also a number of classes of directory modifications
(such as renames, or insertions/deletions at locations a reader has moved
beyond) are no longer a reason to re-fill the pagecache from scratch.

Ben


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-22 12:50                       ` Benjamin Coddington
@ 2022-02-22 20:11                         ` Trond Myklebust
  2022-02-22 20:21                           ` Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2022-02-22 20:11 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
> On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
> 
> > On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
> > > On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
> > > > 
> > > > We will always need the ability to cut over to uncached
> > > > readdir.
> > > 
> > > Yes.
> > > 
> > > > If the cookie is no longer returned by the server because one
> > > > or more
> > > > files were deleted then we need to resolve the situation
> > > > somehow (IOW:
> > > > the 'rm *' case). The new algorithm _does_ improve performance
> > > > on those
> > > > situations, because it no longer requires us to read the entire
> > > > directory before switching over: we try 5 times, then fail
> > > > over.
> > > 
> > > Yes, using per-page validation doesn't remove the need for
> > > uncached
> > > readdir.  It does allow a reader to simply resume filling the
> > > cache where
> > > it left off.  There's no need to try 5 times and fail over.  And
> > > there's
> > > no need to make a trade-off and make the situation worse in
> > > certain
> > > scenarios.
> > > 
> > > I thought I'd point that out and make an offer to re-submit it. 
> > > Any
> > > interest?
> > > 
> > 
> > As I recall, I had concerns about that approach. Can you explain
> > again
> > how it will work?
> 
> Every page of readdir results has the directory's change attr stored
> on the
> page.  That, along with the page's index and the first cookie is
> enough
> information to determine if the page's data can be used by another
> process.
> 
> Which means that when the pagecache is dropped, fillers don't have to
> restart
> filling the cache at page index 0, they can continue to fill at
> whatever
> index they were at previously.  If another process finds a page that
> doesn't
> match its page index, cookie, and the current directory's change
> attr, the
> page is dropped and refilled from that process' indexing.
> 
> > A few of the concerns I have revolve around telldir()/seekdir(). If
> > the
> > platform is 32-bit, then we cannot use cookies as the telldir()
> > output,
> > and so our only option is to use offsets into the page cache (this
> > is
> > why this patch carves out an exception if desc->dir_cookie == 0).
> > How
> > would that work with you scheme?
> 
> For 32-bit seekdir, pages are filled starting at index 0.  This is
> very
> unlikely to match other readers (unless they also do the _same_
> seekdir).
> 
> > Even in the 64-bit case where are able to use cookies for
> > telldir()/seekdir(), how do we determine an appropriate page index
> > after a seekdir()?
> 
> We don't.  Instead we start filling at index 0.  Again, the pagecache
> will
> only be useful to other processes that have done the same llseek.
> 
> This approach optimizes the pagecache for processes that are doing
> similar
> work, and has the added benefit of scaling well for large directory
> listings
> under memory pressure.  Also a number of classes of directory
> modifications
> (such as renames, or insertions/deletions at locations a reader has
> moved
> beyond) are no longer a reason to re-fill the pagecache from scratch.
> 

OK, you've got me more or less sold on it.

I'd still like to figure out how to improve the performance for seekdir
(since I do have an interest in re-exporting NFS) but I've been playing
around with a couple of patches that implement your concept and they do
seem to work well for the common case of a linear read through the
directory.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-22 20:11                         ` Trond Myklebust
@ 2022-02-22 20:21                           ` Benjamin Coddington
  2022-02-23 12:17                             ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-22 20:21 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 22 Feb 2022, at 15:11, Trond Myklebust wrote:

> On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
>> On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
>>
>>> On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
>>>> On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
>>>>>
>>>>> We will always need the ability to cut over to uncached
>>>>> readdir.
>>>>
>>>> Yes.
>>>>
>>>>> If the cookie is no longer returned by the server because one
>>>>> or more
>>>>> files were deleted then we need to resolve the situation
>>>>> somehow (IOW:
>>>>> the 'rm *' case). The new algorithm _does_ improve performance
>>>>> on those
>>>>> situations, because it no longer requires us to read the entire
>>>>> directory before switching over: we try 5 times, then fail
>>>>> over.
>>>>
>>>> Yes, using per-page validation doesn't remove the need for
>>>> uncached
>>>> readdir.  It does allow a reader to simply resume filling the
>>>> cache where
>>>> it left off.  There's no need to try 5 times and fail over.  And
>>>> there's
>>>> no need to make a trade-off and make the situation worse in
>>>> certain
>>>> scenarios.
>>>>
>>>> I thought I'd point that out and make an offer to re-submit it. 
>>>> Any
>>>> interest?
>>>>
>>>
>>> As I recall, I had concerns about that approach. Can you explain
>>> again
>>> how it will work?
>>
>> Every page of readdir results has the directory's change attr stored
>> on the
>> page.  That, along with the page's index and the first cookie is
>> enough
>> information to determine if the page's data can be used by another
>> process.
>>
>> Which means that when the pagecache is dropped, fillers don't have to
>> restart
>> filling the cache at page index 0, they can continue to fill at
>> whatever
>> index they were at previously.  If another process finds a page that
>> doesn't
>> match its page index, cookie, and the current directory's change
>> attr, the
>> page is dropped and refilled from that process' indexing.
>>
>>> A few of the concerns I have revolve around telldir()/seekdir(). If
>>> the
>>> platform is 32-bit, then we cannot use cookies as the telldir()
>>> output,
>>> and so our only option is to use offsets into the page cache (this
>>> is
>>> why this patch carves out an exception if desc->dir_cookie == 0).
>>> How
>>> would that work with you scheme?
>>
>> For 32-bit seekdir, pages are filled starting at index 0.  This is
>> very
>> unlikely to match other readers (unless they also do the _same_
>> seekdir).
>>
>>> Even in the 64-bit case where are able to use cookies for
>>> telldir()/seekdir(), how do we determine an appropriate page index
>>> after a seekdir()?
>>
>> We don't.  Instead we start filling at index 0.  Again, the pagecache
>> will
>> only be useful to other processes that have done the same llseek.
>>
>> This approach optimizes the pagecache for processes that are doing
>> similar
>> work, and has the added benefit of scaling well for large directory
>> listings
>> under memory pressure.  Also a number of classes of directory
>> modifications
>> (such as renames, or insertions/deletions at locations a reader has
>> moved
>> beyond) are no longer a reason to re-fill the pagecache from scratch.
>>
>
> OK, you've got me more or less sold on it.
>
> I'd still like to figure out how to improve the performance for seekdir
> (since I do have an interest in re-exporting NFS) but I've been playing
> around with a couple of patches that implement your concept and they do
> seem to work well for the common case of a linear read through the
> directory.

Nice.  I have another version from the one I originally posted:
https://lore.kernel.org/linux-nfs/cover.1611160120.git.bcodding@redhat.com/

.. but I don't remember exactly the changes and it needs rebasing.  Should I
rebase it against your testing branch and send the result?

Ben


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-22 20:21                           ` Benjamin Coddington
@ 2022-02-23 12:17                             ` Trond Myklebust
  2022-02-23 13:34                               ` Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Trond Myklebust @ 2022-02-23 12:17 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Tue, 2022-02-22 at 15:21 -0500, Benjamin Coddington wrote:
> On 22 Feb 2022, at 15:11, Trond Myklebust wrote:
> 
> > On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
> > > On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
> > > 
> > > > On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
> > > > > On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
> > > > > > 
> > > > > > We will always need the ability to cut over to uncached
> > > > > > readdir.
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > If the cookie is no longer returned by the server because
> > > > > > one
> > > > > > or more
> > > > > > files were deleted then we need to resolve the situation
> > > > > > somehow (IOW:
> > > > > > the 'rm *' case). The new algorithm _does_ improve
> > > > > > performance
> > > > > > on those
> > > > > > situations, because it no longer requires us to read the
> > > > > > entire
> > > > > > directory before switching over: we try 5 times, then fail
> > > > > > over.
> > > > > 
> > > > > Yes, using per-page validation doesn't remove the need for
> > > > > uncached
> > > > > readdir.  It does allow a reader to simply resume filling the
> > > > > cache where
> > > > > it left off.  There's no need to try 5 times and fail over. 
> > > > > And
> > > > > there's
> > > > > no need to make a trade-off and make the situation worse in
> > > > > certain
> > > > > scenarios.
> > > > > 
> > > > > I thought I'd point that out and make an offer to re-submit
> > > > > it. 
> > > > > Any
> > > > > interest?
> > > > > 
> > > > 
> > > > As I recall, I had concerns about that approach. Can you
> > > > explain
> > > > again
> > > > how it will work?
> > > 
> > > Every page of readdir results has the directory's change attr
> > > stored
> > > on the
> > > page.  That, along with the page's index and the first cookie is
> > > enough
> > > information to determine if the page's data can be used by
> > > another
> > > process.
> > > 
> > > Which means that when the pagecache is dropped, fillers don't
> > > have to
> > > restart
> > > filling the cache at page index 0, they can continue to fill at
> > > whatever
> > > index they were at previously.  If another process finds a page
> > > that
> > > doesn't
> > > match its page index, cookie, and the current directory's change
> > > attr, the
> > > page is dropped and refilled from that process' indexing.
> > > 
> > > > A few of the concerns I have revolve around
> > > > telldir()/seekdir(). If
> > > > the
> > > > platform is 32-bit, then we cannot use cookies as the telldir()
> > > > output,
> > > > and so our only option is to use offsets into the page cache
> > > > (this
> > > > is
> > > > why this patch carves out an exception if desc->dir_cookie ==
> > > > 0).
> > > > How
> > > > would that work with you scheme?
> > > 
> > > For 32-bit seekdir, pages are filled starting at index 0.  This
> > > is
> > > very
> > > unlikely to match other readers (unless they also do the _same_
> > > seekdir).
> > > 
> > > > Even in the 64-bit case where are able to use cookies for
> > > > telldir()/seekdir(), how do we determine an appropriate page
> > > > index
> > > > after a seekdir()?
> > > 
> > > We don't.  Instead we start filling at index 0.  Again, the
> > > pagecache
> > > will
> > > only be useful to other processes that have done the same llseek.
> > > 
> > > This approach optimizes the pagecache for processes that are
> > > doing
> > > similar
> > > work, and has the added benefit of scaling well for large
> > > directory
> > > listings
> > > under memory pressure.  Also a number of classes of directory
> > > modifications
> > > (such as renames, or insertions/deletions at locations a reader
> > > has
> > > moved
> > > beyond) are no longer a reason to re-fill the pagecache from
> > > scratch.
> > > 
> > 
> > OK, you've got me more or less sold on it.
> > 
> > I'd still like to figure out how to improve the performance for
> > seekdir
> > (since I do have an interest in re-exporting NFS) but I've been
> > playing
> > around with a couple of patches that implement your concept and
> > they do
> > seem to work well for the common case of a linear read through the
> > directory.
> 
> Nice.  I have another version from the one I originally posted:
> https://lore.kernel.org/linux-nfs/cover.1611160120.git.bcodding@redhat.com/
> 
> .. but I don't remember exactly the changes and it needs rebasing. 
> Should I
> rebase it against your testing branch and send the result?

My 2 patches did something slightly different to yours, storing the
change attribute in the array header instead of in page_private. That
makes for a slightly smaller change.

Having further slept on the idea, I think I know how to solve the
seekdir() issue, at least for the cases where we can use cookies as
telldir()/seekdir() offsets. If we ditch the linear index, and instead
use a hash of the desc->last_cookie as the index, then we can make
random access to the directories work with your idea.
There are a couple of further problems with this concept, but I think
those are solvable. The issues I have identified so far are:

 * invalidate_inode_pages2_range() no longer works to clear out the
   page cache in a predictable way for the readdirplus heuristic.
 * duplicate cookie detection needs to be changed.

I think those are solvable problems.

-- 
Trond Myklebust Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-23 12:17                             ` Trond Myklebust
@ 2022-02-23 13:34                               ` Benjamin Coddington
  2022-02-23 21:31                                 ` Trond Myklebust
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:34 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On 23 Feb 2022, at 7:17, Trond Myklebust wrote:

> On Tue, 2022-02-22 at 15:21 -0500, Benjamin Coddington wrote:
>> On 22 Feb 2022, at 15:11, Trond Myklebust wrote:
>>
>>> On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
>>>> On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
>>>>
>>>>> On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington wrote:
>>>>>> On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
>>>>>>>
>>>>>>> We will always need the ability to cut over to uncached
>>>>>>> readdir.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> If the cookie is no longer returned by the server because
>>>>>>> one
>>>>>>> or more
>>>>>>> files were deleted then we need to resolve the situation
>>>>>>> somehow (IOW:
>>>>>>> the 'rm *' case). The new algorithm _does_ improve
>>>>>>> performance
>>>>>>> on those
>>>>>>> situations, because it no longer requires us to read the
>>>>>>> entire
>>>>>>> directory before switching over: we try 5 times, then fail
>>>>>>> over.
>>>>>>
>>>>>> Yes, using per-page validation doesn't remove the need for
>>>>>> uncached
>>>>>> readdir.  It does allow a reader to simply resume filling the
>>>>>> cache where
>>>>>> it left off.  There's no need to try 5 times and fail over. 
>>>>>> And
>>>>>> there's
>>>>>> no need to make a trade-off and make the situation worse in
>>>>>> certain
>>>>>> scenarios.
>>>>>>
>>>>>> I thought I'd point that out and make an offer to re-submit
>>>>>> it. 
>>>>>> Any
>>>>>> interest?
>>>>>>
>>>>>
>>>>> As I recall, I had concerns about that approach. Can you
>>>>> explain
>>>>> again
>>>>> how it will work?
>>>>
>>>> Every page of readdir results has the directory's change attr
>>>> stored
>>>> on the
>>>> page.  That, along with the page's index and the first cookie is
>>>> enough
>>>> information to determine if the page's data can be used by
>>>> another
>>>> process.
>>>>
>>>> Which means that when the pagecache is dropped, fillers don't
>>>> have to
>>>> restart
>>>> filling the cache at page index 0, they can continue to fill at
>>>> whatever
>>>> index they were at previously.  If another process finds a page
>>>> that
>>>> doesn't
>>>> match its page index, cookie, and the current directory's change
>>>> attr, the
>>>> page is dropped and refilled from that process' indexing.
>>>>
>>>>> A few of the concerns I have revolve around
>>>>> telldir()/seekdir(). If
>>>>> the
>>>>> platform is 32-bit, then we cannot use cookies as the telldir()
>>>>> output,
>>>>> and so our only option is to use offsets into the page cache
>>>>> (this
>>>>> is
>>>>> why this patch carves out an exception if desc->dir_cookie ==
>>>>> 0).
>>>>> How
>>>>> would that work with you scheme?
>>>>
>>>> For 32-bit seekdir, pages are filled starting at index 0.  This
>>>> is
>>>> very
>>>> unlikely to match other readers (unless they also do the _same_
>>>> seekdir).
>>>>
>>>>> Even in the 64-bit case where are able to use cookies for
>>>>> telldir()/seekdir(), how do we determine an appropriate page
>>>>> index
>>>>> after a seekdir()?
>>>>
>>>> We don't.  Instead we start filling at index 0.  Again, the
>>>> pagecache
>>>> will
>>>> only be useful to other processes that have done the same llseek.
>>>>
>>>> This approach optimizes the pagecache for processes that are
>>>> doing
>>>> similar
>>>> work, and has the added benefit of scaling well for large
>>>> directory
>>>> listings
>>>> under memory pressure.  Also a number of classes of directory
>>>> modifications
>>>> (such as renames, or insertions/deletions at locations a reader
>>>> has
>>>> moved
>>>> beyond) are no longer a reason to re-fill the pagecache from
>>>> scratch.
>>>>
>>>
>>> OK, you've got me more or less sold on it.
>>>
>>> I'd still like to figure out how to improve the performance for
>>> seekdir
>>> (since I do have an interest in re-exporting NFS) but I've been
>>> playing
>>> around with a couple of patches that implement your concept and
>>> they do
>>> seem to work well for the common case of a linear read through the
>>> directory.
>>
>> Nice.  I have another version from the one I originally posted:
>> https://lore.kernel.org/linux-nfs/cover.1611160120.git.bcodding@redhat.com/
>>
>> .. but I don't remember exactly the changes and it needs rebasing. 
>> Should I
>> rebase it against your testing branch and send the result?
>
> My 2 patches did something slightly different to yours, storing the
> change attribute in the array header instead of in page_private. That
> makes for a slightly smaller change.

I worried that increasing the size of the array header wouldn't allow us 
to
store as many entries per page.

> Having further slept on the idea, I think I know how to solve the
> seekdir() issue, at least for the cases where we can use cookies as
> telldir()/seekdir() offsets. If we ditch the linear index, and instead
> use a hash of the desc->last_cookie as the index, then we can make
> random access to the directories work with your idea.

Yes!  Nice!

> There are a couple of further problems with this concept, but I think
> those are solvable. The issues I have identified so far are:
>
>  * invalidate_inode_pages2_range() no longer works to clear out the
>    page cache in a predictable way for the readdirplus heuristic.
>  * duplicate cookie detection needs to be changed.

I don't understand those problems, yet.  I have a few other things I 
have to
do, but after those I will come back and give them some attention.

FWIW, I rebased that old series of page-validation work on your
e85f86150b32 NFS: Trace effects of the readdirplus heuristic

I'll send it as a reply to [ v6 13/13] in this posting, but I've only
compile-tested it so far.

Ben


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 1/8] NFS: save the directory's change attribute on pagecache pages
  2022-02-21 16:08                         ` [PATCH v6 13/13] NFS: Trace effects of the readdirplus heuristic trondmy
@ 2022-02-23 13:40                           ` Benjamin Coddington
  2022-02-23 13:40                             ` [PATCH v3 2/8] NFSv4: Send GETATTR with READDIR Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:40 UTC (permalink / raw)
  To: linux-nfs

After a pagecache page has been filled with entries, set PagePrivate and
the directory's change attribute on the page.  This will help us perform
per-page invalidations in a later patch.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/dir.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 9f48c75dbf4c..79bdcedc0cad 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -54,6 +54,9 @@ static int nfs_closedir(struct inode *, struct file *);
 static int nfs_readdir(struct file *, struct dir_context *);
 static int nfs_fsync_dir(struct file *, loff_t, loff_t, int);
 static loff_t nfs_llseek_dir(struct file *, loff_t, int);
+static void nfs_readdir_invalidatepage(struct page *,
+			unsigned int, unsigned int);
+static int nfs_readdir_clear_page(struct page*, gfp_t);
 static void nfs_readdir_clear_array(struct page*);
 
 const struct file_operations nfs_dir_operations = {
@@ -66,6 +69,8 @@ const struct file_operations nfs_dir_operations = {
 };
 
 const struct address_space_operations nfs_dir_aops = {
+	.invalidatepage = nfs_readdir_invalidatepage,
+	.releasepage = nfs_readdir_clear_page,
 	.freepage = nfs_readdir_clear_array,
 };
 
@@ -212,6 +217,27 @@ static void nfs_readdir_page_init_array(struct page *page, u64 last_cookie)
 	array->last_cookie = last_cookie;
 	array->cookies_are_ordered = 1;
 	kunmap_atomic(array);
+	set_page_private(page, 0);
+}
+
+static int
+nfs_readdir_clear_page(struct page *page, gfp_t gfp_mask)
+{
+	detach_page_private(page);
+	return 1;
+}
+
+static void
+nfs_readdir_invalidatepage(struct page *page, unsigned int offset,
+							unsigned int length)
+{
+	nfs_readdir_clear_page(page, GFP_KERNEL);
+}
+
+static void
+nfs_readdir_set_page_verifier(struct page *page, unsigned long verf)
+{
+	attach_page_private(page, (void *)verf);
 }
 
 /*
@@ -794,6 +820,8 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
 		if (status != -ENOSPC)
 			continue;
 
+		nfs_readdir_set_page_verifier(page, desc->dir_verifier);
+
 		if (page->mapping != mapping) {
 			if (!--narrays)
 				break;
@@ -822,10 +850,13 @@ static int nfs_readdir_page_filler(struct nfs_readdir_descriptor *desc,
 	case -EBADCOOKIE:
 		if (entry->eof) {
 			nfs_readdir_page_set_eof(page);
+			nfs_readdir_set_page_verifier(page, desc->dir_verifier);
 			status = 0;
 		}
 		break;
 	case -ENOSPC:
+		nfs_readdir_set_page_verifier(page, desc->dir_verifier);
+		fallthrough;
 	case -EAGAIN:
 		status = 0;
 		break;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 2/8] NFSv4: Send GETATTR with READDIR
  2022-02-23 13:40                           ` [PATCH v3 1/8] NFS: save the directory's change attribute on pagecache pages Benjamin Coddington
@ 2022-02-23 13:40                             ` Benjamin Coddington
  2022-02-23 13:40                               ` [PATCH v3 3/8] NFS: Add a struct to track readdir pagecache location Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:40 UTC (permalink / raw)
  To: linux-nfs

For each batch of entries, track whether the directory has changed.  We can
use this information to better manage the cache when reading long
directories.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/nfs42proc.c        |  2 +-
 fs/nfs/nfs4proc.c         | 29 +++++++++++++++++++++--------
 fs/nfs/nfs4xdr.c          |  6 ++++++
 include/linux/nfs_fs_sb.h |  5 +++++
 include/linux/nfs_xdr.h   |  2 ++
 5 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 882bf84484ac..3ab54228b2ed 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -1082,7 +1082,7 @@ static int _nfs42_proc_clone(struct rpc_message *msg, struct file *src_f,
 	if (!res.dst_fattr)
 		return -ENOMEM;
 
-	nfs4_bitmask_set(dst_bitmask, server->cache_consistency_bitmask,
+	nfs4_bitmask_set(dst_bitmask, server->cache_consistency_bitmask_nl,
 			 dst_inode, NFS_INO_INVALID_BLOCKS);
 
 	status = nfs4_call_sync(server->client, server, msg,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 73a9b6de666c..45285447c077 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -3665,7 +3665,7 @@ static void nfs4_close_prepare(struct rpc_task *task, void *data)
 		/* Close-to-open cache consistency revalidation */
 		if (!nfs4_have_delegation(inode, FMODE_READ)) {
 			nfs4_bitmask_set(calldata->arg.bitmask_store,
-					 server->cache_consistency_bitmask,
+					 server->cache_consistency_bitmask_nl,
 					 inode, 0);
 			calldata->arg.bitmask = calldata->arg.bitmask_store;
 		} else
@@ -3905,7 +3905,12 @@ static int _nfs4_server_capabilities(struct nfs_server *server, struct nfs_fh *f
 		memcpy(server->cache_consistency_bitmask, res.attr_bitmask, sizeof(server->cache_consistency_bitmask));
 		server->cache_consistency_bitmask[0] &= FATTR4_WORD0_CHANGE|FATTR4_WORD0_SIZE;
 		server->cache_consistency_bitmask[1] &= FATTR4_WORD1_TIME_METADATA|FATTR4_WORD1_TIME_MODIFY;
-		server->cache_consistency_bitmask[2] = 0;
+		server->cache_consistency_bitmask[2] = res.attr_bitmask[2] & FATTR4_WORD2_SECURITY_LABEL;
+
+		memcpy(server->cache_consistency_bitmask_nl, server->cache_consistency_bitmask, sizeof(server->cache_consistency_bitmask));
+		server->cache_consistency_bitmask_nl[2] = 0;
+
+
 
 		/* Avoid a regression due to buggy server */
 		for (i = 0; i < ARRAY_SIZE(res.exclcreat_bitmask); i++)
@@ -4576,7 +4581,7 @@ static int _nfs4_proc_access(struct inode *inode, struct nfs_access_entry *entry
 		res.fattr = nfs_alloc_fattr();
 		if (res.fattr == NULL)
 			return -ENOMEM;
-		args.bitmask = server->cache_consistency_bitmask;
+		args.bitmask = server->cache_consistency_bitmask_nl;
 	}
 	status = nfs4_call_sync(server->client, server, &msg, &args.seq_args, &res.seq_res, 0);
 	if (!status) {
@@ -5098,14 +5103,19 @@ static int _nfs4_proc_readdir(struct nfs_readdir_arg *nr_arg,
 		.rpc_resp = &res,
 		.rpc_cred = nr_arg->cred,
 	};
-	int			status;
+	int			status = -ENOMEM;
 
 	dprintk("%s: dentry = %pd2, cookie = %llu\n", __func__,
 		nr_arg->dentry, (unsigned long long)nr_arg->cookie);
 	if (!(server->caps & NFS_CAP_SECURITY_LABEL))
-		args.bitmask = server->attr_bitmask_nl;
+		args.bitmask = server->cache_consistency_bitmask_nl;
 	else
-		args.bitmask = server->attr_bitmask;
+		args.bitmask = server->cache_consistency_bitmask;
+
+	res.dir_attr = nfs_alloc_fattr();
+	if (res.dir_attr == NULL)
+		goto out;
+	res.server = server;
 
 	nfs4_setup_readdir(nr_arg->cookie, nr_arg->verf, nr_arg->dentry, &args);
 	res.pgbase = args.pgbase;
@@ -5118,6 +5128,9 @@ static int _nfs4_proc_readdir(struct nfs_readdir_arg *nr_arg,
 
 	nfs_invalidate_atime(dir);
 
+	nfs_refresh_inode(dir, res.dir_attr);
+	nfs_free_fattr(res.dir_attr);
+out:
 	dprintk("%s: returns %d\n", __func__, status);
 	return status;
 }
@@ -5583,7 +5596,7 @@ static void nfs4_proc_write_setup(struct nfs_pgio_header *hdr,
 		hdr->res.fattr = NULL;
 	} else {
 		nfs4_bitmask_set(hdr->args.bitmask_store,
-				 server->cache_consistency_bitmask,
+				 server->cache_consistency_bitmask_nl,
 				 hdr->inode, NFS_INO_INVALID_BLOCKS);
 		hdr->args.bitmask = hdr->args.bitmask_store;
 	}
@@ -6622,7 +6635,7 @@ static int _nfs4_proc_delegreturn(struct inode *inode, const struct cred *cred,
 	data->args.fhandle = &data->fh;
 	data->args.stateid = &data->stateid;
 	nfs4_bitmask_set(data->args.bitmask_store,
-			 server->cache_consistency_bitmask, inode, 0);
+			 server->cache_consistency_bitmask_nl, inode, 0);
 	data->args.bitmask = data->args.bitmask_store;
 	nfs_copy_fh(&data->fh, NFS_FH(inode));
 	nfs4_stateid_copy(&data->stateid, stateid);
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index b7780b97dc4d..1cd0d49ef992 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -469,10 +469,12 @@ static int decode_layoutget(struct xdr_stream *xdr, struct rpc_rqst *req,
 #define NFS4_enc_readdir_sz	(compound_encode_hdr_maxsz + \
 				encode_sequence_maxsz + \
 				encode_putfh_maxsz + \
+				encode_getattr_maxsz + \
 				encode_readdir_maxsz)
 #define NFS4_dec_readdir_sz	(compound_decode_hdr_maxsz + \
 				decode_sequence_maxsz + \
 				decode_putfh_maxsz + \
+				decode_getattr_maxsz + \
 				decode_readdir_maxsz)
 #define NFS4_enc_write_sz	(compound_encode_hdr_maxsz + \
 				encode_sequence_maxsz + \
@@ -2529,6 +2531,7 @@ static void nfs4_xdr_enc_readdir(struct rpc_rqst *req, struct xdr_stream *xdr,
 	encode_compound_hdr(xdr, req, &hdr);
 	encode_sequence(xdr, &args->seq_args, &hdr);
 	encode_putfh(xdr, args->fh, &hdr);
+	encode_getfattr(xdr, args->bitmask, &hdr);
 	encode_readdir(xdr, args, req, &hdr);
 
 	rpc_prepare_reply_pages(req, args->pages, args->pgbase,
@@ -6769,6 +6772,9 @@ static int nfs4_xdr_dec_readdir(struct rpc_rqst *rqstp, struct xdr_stream *xdr,
 	if (status)
 		goto out;
 	status = decode_putfh(xdr);
+	if (status)
+		goto out;
+	status = decode_getfattr(xdr, res->dir_attr, res->server);
 	if (status)
 		goto out;
 	status = decode_readdir(xdr, rqstp, res);
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index ca0959e51e81..04bc827e4367 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -213,6 +213,11 @@ struct nfs_server {
 						   of change attribute, size, ctime
 						   and mtime attributes supported by
 						   the server */
+	u32			cache_consistency_bitmask_nl[3];
+						/* V4 bitmask representing the subset
+						   of change attribute, size, ctime
+						   and mtime attributes supported by
+						   the server excluding label support */
 	u32			acl_bitmask;	/* V4 bitmask representing the ACEs
 						   that are supported on this
 						   filesystem */
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 728cb0c1f0b6..fbb8b7695c30 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1139,6 +1139,8 @@ struct nfs4_readdir_res {
 	struct nfs4_sequence_res	seq_res;
 	nfs4_verifier			verifier;
 	unsigned int			pgbase;
+	struct nfs_fattr		*dir_attr;
+	const struct nfs_server *server;
 };
 
 struct nfs4_readlink {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 3/8] NFS: Add a struct to track readdir pagecache location
  2022-02-23 13:40                             ` [PATCH v3 2/8] NFSv4: Send GETATTR with READDIR Benjamin Coddington
@ 2022-02-23 13:40                               ` Benjamin Coddington
  2022-02-23 13:40                                 ` [PATCH v3 4/8] NFS: Keep the readdir pagecache cursor updated Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:40 UTC (permalink / raw)
  To: linux-nfs

Directory entries in the NFS readdir pagecache are referenced by their
cookie value and offset.  By defining a structure to group these values,
we'll simplify changes to validate pagecache pages in patches that follow.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/dir.c           | 46 ++++++++++++++++++++----------------------
 include/linux/nfs_fs.h |  6 ++++++
 2 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 79bdcedc0cad..009187c0ae0f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -158,11 +158,10 @@ struct nfs_readdir_descriptor {
 	struct file	*file;
 	struct page	*page;
 	struct dir_context *ctx;
-	pgoff_t		page_index;
 	pgoff_t		page_index_max;
 	u64		dir_cookie;
-	u64		last_cookie;
 	u64		dup_cookie;
+	struct nfs_dir_page_cursor pgc;
 	loff_t		current_index;
 	loff_t		prev_index;
 
@@ -172,7 +171,6 @@ struct nfs_readdir_descriptor {
 	unsigned long	gencount;
 	unsigned long	attr_gencount;
 	unsigned int	page_fill_misses;
-	unsigned int	cache_entry_index;
 	unsigned int	buffer_fills;
 	unsigned int	dtsize;
 	signed char duped;
@@ -457,7 +455,7 @@ static int nfs_readdir_search_for_pos(struct nfs_cache_array *array,
 
 	index = (unsigned int)diff;
 	desc->dir_cookie = array->array[index].cookie;
-	desc->cache_entry_index = index;
+	desc->pgc.entry_index = index;
 	return 0;
 out_eof:
 	desc->eof = true;
@@ -526,7 +524,7 @@ static int nfs_readdir_search_for_cookie(struct nfs_cache_array *array,
 			else
 				desc->ctx->pos = new_pos;
 			desc->prev_index = new_pos;
-			desc->cache_entry_index = i;
+			desc->pgc.entry_index = i;
 			return 0;
 		}
 	}
@@ -553,9 +551,9 @@ static int nfs_readdir_search_array(struct nfs_readdir_descriptor *desc)
 		status = nfs_readdir_search_for_cookie(array, desc);
 
 	if (status == -EAGAIN) {
-		desc->last_cookie = array->last_cookie;
+		desc->pgc.index_cookie = array->last_cookie;
 		desc->current_index += array->size;
-		desc->page_index++;
+		desc->pgc.page_index++;
 	}
 	kunmap_atomic(array);
 	return status;
@@ -968,8 +966,8 @@ static struct page *
 nfs_readdir_page_get_cached(struct nfs_readdir_descriptor *desc)
 {
 	return nfs_readdir_page_get_locked(desc->file->f_mapping,
-					   desc->page_index,
-					   desc->last_cookie);
+					   desc->pgc.page_index,
+					   desc->pgc.index_cookie);
 }
 
 #define NFS_READDIR_PAGE_FILL_MISS_MAX 5
@@ -1001,10 +999,10 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 	if (nfs_readdir_page_needs_filling(desc->page)) {
 		if (!nfs_readdir_may_fill_pagecache(desc))
 			return -EBADCOOKIE;
-		desc->page_index_max = desc->page_index;
+		desc->page_index_max = desc->pgc.page_index;
 		trace_nfs_readdir_cache_fill(desc->file, nfsi->cookieverf,
-					     desc->last_cookie,
-					     desc->page_index, desc->dtsize);
+					     desc->pgc.index_cookie,
+					     desc->pgc.page_index, desc->dtsize);
 		res = nfs_readdir_xdr_to_array(desc, nfsi->cookieverf, verf,
 					       &desc->page, 1);
 		if (res < 0) {
@@ -1012,7 +1010,7 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 			trace_nfs_readdir_cache_fill_done(inode, res);
 			if (res == -EBADCOOKIE || res == -ENOTSYNC) {
 				invalidate_inode_pages2(desc->file->f_mapping);
-				desc->page_index = 0;
+				desc->pgc.page_index = 0;
 				return -EAGAIN;
 			}
 			return res;
@@ -1020,7 +1018,7 @@ static int find_and_lock_cache_page(struct nfs_readdir_descriptor *desc)
 		/*
 		 * Set the cookie verifier if the page cache was empty
 		 */
-		if (desc->page_index == 0)
+		if (desc->pgc.page_index == 0)
 			memcpy(nfsi->cookieverf, verf,
 			       sizeof(nfsi->cookieverf));
 		desc->page_fill_misses++;
@@ -1040,10 +1038,10 @@ static int readdir_search_pagecache(struct nfs_readdir_descriptor *desc)
 	int res;
 
 	do {
-		if (desc->page_index == 0) {
+		if (desc->pgc.page_index == 0) {
 			desc->current_index = 0;
 			desc->prev_index = 0;
-			desc->last_cookie = 0;
+			desc->pgc.index_cookie = 0;
 		}
 		res = find_and_lock_cache_page(desc);
 	} while (res == -EAGAIN);
@@ -1061,7 +1059,7 @@ static void nfs_do_filldir(struct nfs_readdir_descriptor *desc,
 	unsigned int i = 0;
 
 	array = kmap(desc->page);
-	for (i = desc->cache_entry_index; i < array->size; i++) {
+	for (i = desc->pgc.entry_index; i < array->size; i++) {
 		struct nfs_cache_array_entry *ent;
 
 		ent = &array->array[i];
@@ -1119,13 +1117,13 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 	if (!arrays[0])
 		goto out;
 
-	desc->page_index = 0;
-	desc->cache_entry_index = 0;
-	desc->last_cookie = desc->dir_cookie;
+	desc->pgc.page_index = 0;
+	desc->pgc.entry_index = 0;
+	desc->pgc.index_cookie = desc->dir_cookie;
 	desc->duped = 0;
 	desc->page_index_max = 0;
 
-	trace_nfs_readdir_uncached(desc->file, desc->verf, desc->last_cookie,
+	trace_nfs_readdir_uncached(desc->file, desc->verf, desc->pgc.index_cookie,
 				   -1, desc->dtsize);
 
 	status = nfs_readdir_xdr_to_array(desc, desc->verf, verf, arrays, sz);
@@ -1258,7 +1256,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 		}
 		if (res == -ETOOSMALL && desc->plus) {
 			nfs_zap_caches(inode);
-			desc->page_index = 0;
+			desc->pgc.page_index = 0;
 			desc->plus = false;
 			desc->eof = false;
 			continue;
@@ -1271,7 +1269,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 		if (desc->eob || desc->eof)
 			break;
 		/* Grow the dtsize if we have to go back for more pages */
-		if (desc->page_index == desc->page_index_max)
+		if (desc->pgc.page_index == desc->page_index_max)
 			nfs_grow_dtsize(desc);
 	} while (!desc->eob && !desc->eof);
 
@@ -1280,7 +1278,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	dir_ctx->dup_cookie = desc->dup_cookie;
 	dir_ctx->duped = desc->duped;
 	dir_ctx->attr_gencount = desc->attr_gencount;
-	dir_ctx->page_index = desc->page_index;
+	dir_ctx->page_index = desc->pgc.page_index;
 	dir_ctx->page_fill_misses = desc->page_fill_misses;
 	dir_ctx->eof = desc->eof;
 	dir_ctx->dtsize = desc->dtsize;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 0a5425a58bbd..2f5ded282477 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -99,6 +99,12 @@ struct nfs_open_context {
 	struct rcu_head	rcu_head;
 };
 
+struct nfs_dir_page_cursor {
+	__u64 index_cookie;
+	pgoff_t page_index;
+	unsigned int entry_index;
+};
+
 struct nfs_open_dir_context {
 	struct list_head list;
 	atomic_t cache_hits;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 4/8] NFS: Keep the readdir pagecache cursor updated
  2022-02-23 13:40                               ` [PATCH v3 3/8] NFS: Add a struct to track readdir pagecache location Benjamin Coddington
@ 2022-02-23 13:40                                 ` Benjamin Coddington
  2022-02-23 13:40                                   ` [PATCH v3 5/8] NFS: readdir per-page cache validation Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:40 UTC (permalink / raw)
  To: linux-nfs

Whenever we successfully locate our dir_cookie within the pagecache, or
finish emitting entries to userspace, update the pagecache cursor.  These
updates provide marker points to validate pagecache pages in a future
patch.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/dir.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 009187c0ae0f..2b1a0c1cdce4 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -154,6 +154,10 @@ struct nfs_cache_array {
 	struct nfs_cache_array_entry array[];
 };
 
+static const int cache_entries_per_page =
+	(PAGE_SIZE - sizeof(struct nfs_cache_array)) /
+	sizeof(struct nfs_cache_array_entry);
+
 struct nfs_readdir_descriptor {
 	struct file	*file;
 	struct page	*page;
@@ -282,6 +286,21 @@ static bool nfs_readdir_array_is_full(struct nfs_cache_array *array)
 	return array->page_full;
 }
 
+static void nfs_readdir_set_cursor(struct nfs_readdir_descriptor *desc, int index)
+{
+	desc->pgc.entry_index = index;
+	desc->pgc.index_cookie = desc->dir_cookie;
+}
+
+static void nfs_readdir_cursor_next(struct nfs_dir_page_cursor *pgc, u64 cookie)
+{
+	pgc->index_cookie = cookie;
+	if (++pgc->entry_index == cache_entries_per_page) {
+		pgc->entry_index = 0;
+		pgc->page_index++;
+	}
+}
+
 /*
  * the caller is responsible for freeing qstr.name
  * when called by nfs_readdir_add_to_array, the strings will be freed in
@@ -455,7 +474,7 @@ static int nfs_readdir_search_for_pos(struct nfs_cache_array *array,
 
 	index = (unsigned int)diff;
 	desc->dir_cookie = array->array[index].cookie;
-	desc->pgc.entry_index = index;
+	nfs_readdir_set_cursor(desc, index);
 	return 0;
 out_eof:
 	desc->eof = true;
@@ -524,7 +543,7 @@ static int nfs_readdir_search_for_cookie(struct nfs_cache_array *array,
 			else
 				desc->ctx->pos = new_pos;
 			desc->prev_index = new_pos;
-			desc->pgc.entry_index = i;
+			nfs_readdir_set_cursor(desc, i);
 			return 0;
 		}
 	}
@@ -551,9 +570,9 @@ static int nfs_readdir_search_array(struct nfs_readdir_descriptor *desc)
 		status = nfs_readdir_search_for_cookie(array, desc);
 
 	if (status == -EAGAIN) {
-		desc->pgc.index_cookie = array->last_cookie;
+		desc->pgc.entry_index = array->size - 1;
+		nfs_readdir_cursor_next(&desc->pgc, array->last_cookie);
 		desc->current_index += array->size;
-		desc->pgc.page_index++;
 	}
 	kunmap_atomic(array);
 	return status;
@@ -1084,6 +1103,8 @@ static void nfs_do_filldir(struct nfs_readdir_descriptor *desc,
 		desc->eof = !desc->eob;
 
 	kunmap(desc->page);
+	desc->pgc.entry_index = i-1;
+	nfs_readdir_cursor_next(&desc->pgc, desc->dir_cookie);
 	dfprintk(DIRCACHE, "NFS: nfs_do_filldir() filling ended @ cookie %llu\n",
 			(unsigned long long)desc->dir_cookie);
 }
@@ -1118,8 +1139,7 @@ static int uncached_readdir(struct nfs_readdir_descriptor *desc)
 		goto out;
 
 	desc->pgc.page_index = 0;
-	desc->pgc.entry_index = 0;
-	desc->pgc.index_cookie = desc->dir_cookie;
+	nfs_readdir_set_cursor(desc, 0);
 	desc->duped = 0;
 	desc->page_index_max = 0;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 5/8] NFS: readdir per-page cache validation
  2022-02-23 13:40                                 ` [PATCH v3 4/8] NFS: Keep the readdir pagecache cursor updated Benjamin Coddington
@ 2022-02-23 13:40                                   ` Benjamin Coddington
  2022-02-23 13:40                                     ` [PATCH v3 6/8] NFS: stash the readdir pagecache cursor on the open directory context Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:40 UTC (permalink / raw)
  To: linux-nfs

The current implementation of the readdir page cache requires that all
pages contain entries ordered such that the cookie references lead to the
first entry as represented by cookie 0.  The invalidation of the cache
truncates either the entire cache or every page beyond a known good page.

A process that wants to emit directory entries near the end of a directory
must first fill in any entries missing in the cache near the beginning of
the directory in order that the entries decoded from READDIR XDR are
appropriately page-aligned for any readers thay may come later (and for
some error handling).

However, if we're careful to check the alignment of directory entries on
each page when the page is read, then it should be permissable to allow
"disconnected" filling of the pagecache.  Rather than requiring pagecache
data to always be positionally aligned, we can instead validate that each
page is properly aligned to the reading process' directory context. If it
doesn't match our alignment, we'll refresh the entries in the page so that
it does.

This patch implements a check for validity for each page as it is obtained
from the pagecache.  A page is valid if it was filled within the client's
current version of the directory and if the entries are aligned with the
current reader's directory context.

Invalid pages are re-filled by READDIR operations before being used to emit
entries for the current reader.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/dir.c | 68 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 55 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2b1a0c1cdce4..ba75a9593dae 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -219,7 +219,9 @@ static void nfs_readdir_page_init_array(struct page *page, u64 last_cookie)
 	array->last_cookie = last_cookie;
 	array->cookies_are_ordered = 1;
 	kunmap_atomic(array);
-	set_page_private(page, 0);
+	if (page->mapping)
+		set_page_private(page, nfs_save_change_attribute(page->mapping->host));
+	SetPageUptodate(page);
 }
 
 static int
@@ -256,6 +258,15 @@ void nfs_readdir_clear_array(struct page *page)
 		kfree(array->array[i].name);
 	nfs_readdir_array_init(array);
 	kunmap_atomic(array);
+	ClearPageUptodate(page);
+}
+
+static void
+nfs_readdir_recycle_page(struct page *page, u64 last_cookie)
+{
+	nfs_readdir_clear_array(page);
+	nfs_readdir_invalidatepage(page, 0, 0);
+	nfs_readdir_page_init_array(page, last_cookie);
 }
 
 static struct page *
@@ -372,18 +383,47 @@ int nfs_readdir_add_to_array(struct nfs_entry *entry, struct page *page)
 	return ret;
 }
 
+static bool
+nfs_readdir_page_valid(struct page *page, unsigned int entry_index, u64 index_cookie)
+{
+	bool ret = false;
+	struct nfs_cache_array *array;
+
+	if (page_private(page) != nfs_save_change_attribute(page->mapping->host))
+		goto out;
+
+	ret = true;
+	array = kmap_atomic(page);
+
+	if (array->size == 0 && array->last_cookie == index_cookie)
+		goto out_unmap;
+
+	if (array->size > entry_index &&
+		array->array[entry_index].cookie == index_cookie)
+		goto out_unmap;
+
+	ret = false;
+out_unmap:
+	kunmap_atomic(array);
+out:
+	return ret;
+}
+
 static struct page *nfs_readdir_page_get_locked(struct address_space *mapping,
-						pgoff_t index, u64 last_cookie)
+						struct nfs_dir_page_cursor *pgc)
 {
 	struct page *page;
 
-	page = grab_cache_page(mapping, index);
-	if (page && !PageUptodate(page)) {
-		nfs_readdir_page_init_array(page, last_cookie);
-		if (invalidate_inode_pages2_range(mapping, index + 1, -1) < 0)
-			nfs_zap_mapping(mapping->host, mapping);
-		SetPageUptodate(page);
-	}
+	page = grab_cache_page(mapping, pgc->page_index);
+
+	if (!page)
+		return page;
+
+	if (!PageUptodate(page))
+		nfs_readdir_page_init_array(page, pgc->index_cookie);
+
+	if (!nfs_readdir_page_valid(page, pgc->entry_index, pgc->index_cookie))
+		nfs_readdir_recycle_page(page, pgc->index_cookie);
 
 	return page;
 }
@@ -429,8 +469,12 @@ static struct page *nfs_readdir_page_get_next(struct address_space *mapping,
 					      pgoff_t index, u64 cookie)
 {
 	struct page *page;
+	struct nfs_dir_page_cursor pgc = {
+		.page_index = index,
+		.index_cookie = cookie,
+	};
 
-	page = nfs_readdir_page_get_locked(mapping, index, cookie);
+	page = nfs_readdir_page_get_locked(mapping, &pgc);
 	if (page) {
 		if (nfs_readdir_page_last_cookie(page) == cookie)
 			return page;
@@ -984,9 +1028,7 @@ nfs_readdir_page_unlock_and_put_cached(struct nfs_readdir_descriptor *desc)
 static struct page *
 nfs_readdir_page_get_cached(struct nfs_readdir_descriptor *desc)
 {
-	return nfs_readdir_page_get_locked(desc->file->f_mapping,
-					   desc->pgc.page_index,
-					   desc->pgc.index_cookie);
+	return nfs_readdir_page_get_locked(desc->file->f_mapping, &desc->pgc);
 }
 
 #define NFS_READDIR_PAGE_FILL_MISS_MAX 5
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 6/8] NFS: stash the readdir pagecache cursor on the open directory context
  2022-02-23 13:40                                   ` [PATCH v3 5/8] NFS: readdir per-page cache validation Benjamin Coddington
@ 2022-02-23 13:40                                     ` Benjamin Coddington
  2022-02-23 13:40                                       ` [PATCH v3 7/8] NFS: Support headless readdir pagecache pages Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:40 UTC (permalink / raw)
  To: linux-nfs

Now that we have per-page cache validation, we can allow filling of the
pagecache from arbitrary offsets.  Store the pagecache cursor on the open
directory context so it can be used to validate our entries the next time
we enter nfs_readdir() without having to fill the pagecache from the
beginning.

The open_directory_context's dir_cookie and index_cookie will store the
same value between calls to nfs_readdir(), so we can save a little room in
the struct by dropping dir_cookie.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/dir.c           | 13 +++++++------
 include/linux/nfs_fs.h |  2 +-
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ba75a9593dae..4f4a139f3181 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -395,7 +395,8 @@ nfs_readdir_page_valid(struct page *page, unsigned int entry_index, u64 index_co
 	ret = true;
 	array = kmap_atomic(page);
 
-	if (array->size == 0 && array->last_cookie == index_cookie)
+	if ((array->size == 0 || array->size == entry_index)
+		&& array->last_cookie == index_cookie)
 		goto out_unmap;
 
 	if (array->size > entry_index &&
@@ -1102,7 +1103,6 @@ static int readdir_search_pagecache(struct nfs_readdir_descriptor *desc)
 		if (desc->pgc.page_index == 0) {
 			desc->current_index = 0;
 			desc->prev_index = 0;
-			desc->pgc.index_cookie = 0;
 		}
 		res = find_and_lock_cache_page(desc);
 	} while (res == -EAGAIN);
@@ -1279,7 +1279,8 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	desc->page_index_max = -1;
 
 	spin_lock(&file->f_lock);
-	desc->dir_cookie = dir_ctx->dir_cookie;
+	desc->dir_cookie = dir_ctx->pgc.index_cookie;
+	desc->pgc = dir_ctx->pgc;
 	desc->dup_cookie = dir_ctx->dup_cookie;
 	desc->duped = dir_ctx->duped;
 	page_index = dir_ctx->page_index;
@@ -1336,7 +1337,7 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	} while (!desc->eob && !desc->eof);
 
 	spin_lock(&file->f_lock);
-	dir_ctx->dir_cookie = desc->dir_cookie;
+	dir_ctx->pgc = desc->pgc;
 	dir_ctx->dup_cookie = desc->dup_cookie;
 	dir_ctx->duped = desc->duped;
 	dir_ctx->attr_gencount = desc->attr_gencount;
@@ -1382,9 +1383,9 @@ static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
 	if (offset != filp->f_pos) {
 		filp->f_pos = offset;
 		if (nfs_readdir_use_cookie(filp))
-			dir_ctx->dir_cookie = offset;
+			dir_ctx->pgc.index_cookie = offset;
 		else
-			dir_ctx->dir_cookie = 0;
+			dir_ctx->pgc.index_cookie = 0;
 		dir_ctx->page_fill_misses = 0;
 		if (offset == 0)
 			memset(dir_ctx->verf, 0, sizeof(dir_ctx->verf));
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 2f5ded282477..aaeaad4006a4 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -111,7 +111,7 @@ struct nfs_open_dir_context {
 	atomic_t cache_misses;
 	unsigned long attr_gencount;
 	__be32	verf[NFS_DIR_VERIFIER_SIZE];
-	__u64 dir_cookie;
+	struct nfs_dir_page_cursor pgc;
 	__u64 dup_cookie;
 	pgoff_t page_index;
 	unsigned int page_fill_misses;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 7/8] NFS: Support headless readdir pagecache pages
  2022-02-23 13:40                                     ` [PATCH v3 6/8] NFS: stash the readdir pagecache cursor on the open directory context Benjamin Coddington
@ 2022-02-23 13:40                                       ` Benjamin Coddington
  2022-02-23 13:40                                         ` [PATCH v3 8/8] NFS: Revalidate the directory pagecache on every nfs_readdir() Benjamin Coddington
  0 siblings, 1 reply; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:40 UTC (permalink / raw)
  To: linux-nfs

It is now possible that a reader will resume a directory listing after an
invalidation and fill the rest of the pages with the offset left over from
the last partially-filled page.  These pages will then be recycled and
refilled by the next reader since their alignment is incorrect.

Add an index to the nfs_cache_array that will indicate where the next entry
should be filled.  This allows partially-filled pages to have the best
alignment possible.  They are more likely to be useful to readers that
follow.

This optimization targets the case when there are multiple processes
listing the directory simultaneously.  Often the processes will collect and
block on the same page waiting for a READDIR call to fill the pagecache.
If the pagecache is invalidated, a partially-filled page will usually
result.  This partially-filled page can immediately be used by all
processes to emit entries rather than having to discard and refill it for
every process.

The addition of another integer to struct nfs_cache_array increases its
size to 24 bytes. We do not lose the original capacity of 127 entries per
page.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/dir.c | 47 ++++++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 4f4a139f3181..a570f14633ab 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -147,6 +147,7 @@ struct nfs_cache_array_entry {
 
 struct nfs_cache_array {
 	u64 last_cookie;
+	unsigned int index;
 	unsigned int size;
 	unsigned char page_full : 1,
 		      page_is_eof : 1,
@@ -210,13 +211,15 @@ static void nfs_readdir_array_init(struct nfs_cache_array *array)
 	memset(array, 0, sizeof(struct nfs_cache_array));
 }
 
-static void nfs_readdir_page_init_array(struct page *page, u64 last_cookie)
+static void
+nfs_readdir_page_init_array(struct page *page, struct nfs_dir_page_cursor *pgc)
 {
 	struct nfs_cache_array *array;
 
 	array = kmap_atomic(page);
 	nfs_readdir_array_init(array);
-	array->last_cookie = last_cookie;
+	array->last_cookie = pgc->index_cookie;
+	array->index = pgc->entry_index;
 	array->cookies_are_ordered = 1;
 	kunmap_atomic(array);
 	if (page->mapping)
@@ -254,7 +257,7 @@ void nfs_readdir_clear_array(struct page *page)
 	int i;
 
 	array = kmap_atomic(page);
-	for (i = 0; i < array->size; i++)
+	for (i = array->index - array->size; i < array->size; i++)
 		kfree(array->array[i].name);
 	nfs_readdir_array_init(array);
 	kunmap_atomic(array);
@@ -262,19 +265,20 @@ void nfs_readdir_clear_array(struct page *page)
 }
 
 static void
-nfs_readdir_recycle_page(struct page *page, u64 last_cookie)
+nfs_readdir_recycle_page(struct page *page, struct nfs_dir_page_cursor *pgc)
 {
 	nfs_readdir_clear_array(page);
 	nfs_readdir_invalidatepage(page, 0, 0);
-	nfs_readdir_page_init_array(page, last_cookie);
+	nfs_readdir_page_init_array(page, pgc);
 }
 
 static struct page *
 nfs_readdir_page_array_alloc(u64 last_cookie, gfp_t gfp_flags)
 {
 	struct page *page = alloc_page(gfp_flags);
+	struct nfs_dir_page_cursor pgc = { .index_cookie = last_cookie };
 	if (page)
-		nfs_readdir_page_init_array(page, last_cookie);
+		nfs_readdir_page_init_array(page, &pgc);
 	return page;
 }
 
@@ -339,7 +343,7 @@ static int nfs_readdir_array_can_expand(struct nfs_cache_array *array)
 
 	if (array->page_full)
 		return -ENOSPC;
-	cache_entry = &array->array[array->size + 1];
+	cache_entry = &array->array[array->index + 1];
 	if ((char *)cache_entry - (char *)array > PAGE_SIZE) {
 		array->page_full = 1;
 		return -ENOSPC;
@@ -366,7 +370,7 @@ int nfs_readdir_add_to_array(struct nfs_entry *entry, struct page *page)
 		goto out;
 	}
 
-	cache_entry = &array->array[array->size];
+	cache_entry = &array->array[array->index];
 	cache_entry->cookie = entry->prev_cookie;
 	cache_entry->ino = entry->ino;
 	cache_entry->d_type = entry->d_type;
@@ -375,6 +379,7 @@ int nfs_readdir_add_to_array(struct nfs_entry *entry, struct page *page)
 	array->last_cookie = entry->cookie;
 	if (array->last_cookie <= cache_entry->cookie)
 		array->cookies_are_ordered = 0;
+	array->index++;
 	array->size++;
 	if (entry->eof != 0)
 		nfs_readdir_array_set_eof(array);
@@ -395,13 +400,15 @@ nfs_readdir_page_valid(struct page *page, unsigned int entry_index, u64 index_co
 	ret = true;
 	array = kmap_atomic(page);
 
-	if ((array->size == 0 || array->size == entry_index)
-		&& array->last_cookie == index_cookie)
-		goto out_unmap;
+	if (entry_index >= array->index - array->size) {
+		if ((array->size == 0 || array->size == entry_index)
+			&& array->last_cookie == index_cookie)
+			goto out_unmap;
 
-	if (array->size > entry_index &&
-		array->array[entry_index].cookie == index_cookie)
-		goto out_unmap;
+		if (array->size > entry_index &&
+			array->array[entry_index].cookie == index_cookie)
+			goto out_unmap;
+	}
 
 	ret = false;
 out_unmap:
@@ -421,10 +428,10 @@ static struct page *nfs_readdir_page_get_locked(struct address_space *mapping,
 		return page;
 
 	if (!PageUptodate(page))
-		nfs_readdir_page_init_array(page, pgc->index_cookie);
+		nfs_readdir_page_init_array(page, pgc);
 
 	if (!nfs_readdir_page_valid(page, pgc->entry_index, pgc->index_cookie))
-		nfs_readdir_recycle_page(page, pgc->index_cookie);
+		nfs_readdir_recycle_page(page, pgc);
 
 	return page;
 }
@@ -544,7 +551,7 @@ static bool nfs_readdir_array_cookie_in_range(struct nfs_cache_array *array,
 	/* Optimisation for monotonically increasing cookies */
 	if (cookie >= array->last_cookie)
 		return false;
-	if (array->size && cookie < array->array[0].cookie)
+	if (array->size && cookie < array->array[array->index - array->size].cookie)
 		return false;
 	return true;
 }
@@ -559,7 +566,7 @@ static int nfs_readdir_search_for_cookie(struct nfs_cache_array *array,
 	if (!nfs_readdir_array_cookie_in_range(array, desc->dir_cookie))
 		goto check_eof;
 
-	for (i = 0; i < array->size; i++) {
+	for (i = array->index - array->size; i < array->index; i++) {
 		if (array->array[i].cookie == desc->dir_cookie) {
 			struct nfs_inode *nfsi = NFS_I(file_inode(desc->file));
 
@@ -1120,7 +1127,7 @@ static void nfs_do_filldir(struct nfs_readdir_descriptor *desc,
 	unsigned int i = 0;
 
 	array = kmap(desc->page);
-	for (i = desc->pgc.entry_index; i < array->size; i++) {
+	for (i = desc->pgc.entry_index; i < array->index; i++) {
 		struct nfs_cache_array_entry *ent;
 
 		ent = &array->array[i];
@@ -1387,6 +1394,8 @@ static loff_t nfs_llseek_dir(struct file *filp, loff_t offset, int whence)
 		else
 			dir_ctx->pgc.index_cookie = 0;
 		dir_ctx->page_fill_misses = 0;
+		dir_ctx->pgc.page_index = 0;
+		dir_ctx->pgc.entry_index = 0;
 		if (offset == 0)
 			memset(dir_ctx->verf, 0, sizeof(dir_ctx->verf));
 		dir_ctx->duped = 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 8/8] NFS: Revalidate the directory pagecache on every nfs_readdir()
  2022-02-23 13:40                                       ` [PATCH v3 7/8] NFS: Support headless readdir pagecache pages Benjamin Coddington
@ 2022-02-23 13:40                                         ` Benjamin Coddington
  0 siblings, 0 replies; 34+ messages in thread
From: Benjamin Coddington @ 2022-02-23 13:40 UTC (permalink / raw)
  To: linux-nfs

Since there's no longer a significant performance penalty for dropping the
pagecache, and because we want to ensure that the directory's change
attribute is accurate before validating pagecache pages, revalidate the
directory's mapping on every call to nfs_readdir().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/dir.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a570f14633ab..565ff26e1b52 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1271,11 +1271,9 @@ static int nfs_readdir(struct file *file, struct dir_context *ctx)
 	 * to either find the entry with the appropriate number or
 	 * revalidate the cookie.
 	 */
-	if (ctx->pos == 0 || nfs_attribute_cache_expired(inode)) {
-		res = nfs_revalidate_mapping(inode, file->f_mapping);
-		if (res < 0)
-			goto out;
-	}
+	res = nfs_revalidate_mapping(inode, file->f_mapping);
+	if (res < 0)
+		goto out;
 
 	res = -ENOMEM;
 	desc = kzalloc(sizeof(*desc), GFP_KERNEL);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir
  2022-02-23 13:34                               ` Benjamin Coddington
@ 2022-02-23 21:31                                 ` Trond Myklebust
  0 siblings, 0 replies; 34+ messages in thread
From: Trond Myklebust @ 2022-02-23 21:31 UTC (permalink / raw)
  To: bcodding; +Cc: linux-nfs

On Wed, 2022-02-23 at 08:34 -0500, Benjamin Coddington wrote:
> On 23 Feb 2022, at 7:17, Trond Myklebust wrote:
> 
> > On Tue, 2022-02-22 at 15:21 -0500, Benjamin Coddington wrote:
> > > On 22 Feb 2022, at 15:11, Trond Myklebust wrote:
> > > 
> > > > On Tue, 2022-02-22 at 07:50 -0500, Benjamin Coddington wrote:
> > > > > On 21 Feb 2022, at 18:20, Trond Myklebust wrote:
> > > > > 
> > > > > > On Mon, 2022-02-21 at 16:10 -0500, Benjamin Coddington
> > > > > > wrote:
> > > > > > > On 21 Feb 2022, at 15:55, Trond Myklebust wrote:
> > > > > > > > 
> > > > > > > > We will always need the ability to cut over to uncached
> > > > > > > > readdir.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > If the cookie is no longer returned by the server
> > > > > > > > because
> > > > > > > > one
> > > > > > > > or more
> > > > > > > > files were deleted then we need to resolve the
> > > > > > > > situation
> > > > > > > > somehow (IOW:
> > > > > > > > the 'rm *' case). The new algorithm _does_ improve
> > > > > > > > performance
> > > > > > > > on those
> > > > > > > > situations, because it no longer requires us to read
> > > > > > > > the
> > > > > > > > entire
> > > > > > > > directory before switching over: we try 5 times, then
> > > > > > > > fail
> > > > > > > > over.
> > > > > > > 
> > > > > > > Yes, using per-page validation doesn't remove the need
> > > > > > > for
> > > > > > > uncached
> > > > > > > readdir.  It does allow a reader to simply resume filling
> > > > > > > the
> > > > > > > cache where
> > > > > > > it left off.  There's no need to try 5 times and fail
> > > > > > > over. 
> > > > > > > And
> > > > > > > there's
> > > > > > > no need to make a trade-off and make the situation worse
> > > > > > > in
> > > > > > > certain
> > > > > > > scenarios.
> > > > > > > 
> > > > > > > I thought I'd point that out and make an offer to re-
> > > > > > > submit
> > > > > > > it. 
> > > > > > > Any
> > > > > > > interest?
> > > > > > > 
> > > > > > 
> > > > > > As I recall, I had concerns about that approach. Can you
> > > > > > explain
> > > > > > again
> > > > > > how it will work?
> > > > > 
> > > > > Every page of readdir results has the directory's change attr
> > > > > stored
> > > > > on the
> > > > > page.  That, along with the page's index and the first cookie
> > > > > is
> > > > > enough
> > > > > information to determine if the page's data can be used by
> > > > > another
> > > > > process.
> > > > > 
> > > > > Which means that when the pagecache is dropped, fillers don't
> > > > > have to
> > > > > restart
> > > > > filling the cache at page index 0, they can continue to fill
> > > > > at
> > > > > whatever
> > > > > index they were at previously.  If another process finds a
> > > > > page
> > > > > that
> > > > > doesn't
> > > > > match its page index, cookie, and the current directory's
> > > > > change
> > > > > attr, the
> > > > > page is dropped and refilled from that process' indexing.
> > > > > 
> > > > > > A few of the concerns I have revolve around
> > > > > > telldir()/seekdir(). If
> > > > > > the
> > > > > > platform is 32-bit, then we cannot use cookies as the
> > > > > > telldir()
> > > > > > output,
> > > > > > and so our only option is to use offsets into the page
> > > > > > cache
> > > > > > (this
> > > > > > is
> > > > > > why this patch carves out an exception if desc->dir_cookie
> > > > > > ==
> > > > > > 0).
> > > > > > How
> > > > > > would that work with you scheme?
> > > > > 
> > > > > For 32-bit seekdir, pages are filled starting at index 0. 
> > > > > This
> > > > > is
> > > > > very
> > > > > unlikely to match other readers (unless they also do the
> > > > > _same_
> > > > > seekdir).
> > > > > 
> > > > > > Even in the 64-bit case where are able to use cookies for
> > > > > > telldir()/seekdir(), how do we determine an appropriate
> > > > > > page
> > > > > > index
> > > > > > after a seekdir()?
> > > > > 
> > > > > We don't.  Instead we start filling at index 0.  Again, the
> > > > > pagecache
> > > > > will
> > > > > only be useful to other processes that have done the same
> > > > > llseek.
> > > > > 
> > > > > This approach optimizes the pagecache for processes that are
> > > > > doing
> > > > > similar
> > > > > work, and has the added benefit of scaling well for large
> > > > > directory
> > > > > listings
> > > > > under memory pressure.  Also a number of classes of directory
> > > > > modifications
> > > > > (such as renames, or insertions/deletions at locations a
> > > > > reader
> > > > > has
> > > > > moved
> > > > > beyond) are no longer a reason to re-fill the pagecache from
> > > > > scratch.
> > > > > 
> > > > 
> > > > OK, you've got me more or less sold on it.
> > > > 
> > > > I'd still like to figure out how to improve the performance for
> > > > seekdir
> > > > (since I do have an interest in re-exporting NFS) but I've been
> > > > playing
> > > > around with a couple of patches that implement your concept and
> > > > they do
> > > > seem to work well for the common case of a linear read through
> > > > the
> > > > directory.
> > > 
> > > Nice.  I have another version from the one I originally posted:
> > > https://lore.kernel.org/linux-nfs/cover.1611160120.git.bcodding@redhat.com/
> > > 
> > > .. but I don't remember exactly the changes and it needs
> > > rebasing. 
> > > Should I
> > > rebase it against your testing branch and send the result?
> > 
> > My 2 patches did something slightly different to yours, storing the
> > change attribute in the array header instead of in page_private.
> > That
> > makes for a slightly smaller change.
> 
> I worried that increasing the size of the array header wouldn't allow
> us 
> to
> store as many entries per page.

The struct nfs_cache_array header is 24 bytes long with the change
attribute (as opposed to 16 bytes without it). This size is independent
of the architecture, assuming that 'unsigned int' is 32-bits and
unsigned char is 8-bits (as is always the case on Linux).

On a 64-bit system, A single struct nfs_cache_array_entry ends up being
32 bytes long. So in a standard 4k page with a 24-byte or 16 byte
header you will be able to cache exactly 127 cache array entries.

On a 32-bit system, the cache entry is 28 bytes long (the difference
being the pointer length), and you can pack 145 entries in the 4k page.

IOW: The change in header size makes no difference to the number of
entries you can cache, because in both cases, the header remains
smaller than a single entry.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-02-23 21:32 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 16:08 [PATCH v6 00/13] Readdir improvements trondmy
2022-02-21 16:08 ` [PATCH v6 01/13] NFS: constify nfs_server_capable() and nfs_have_writebacks() trondmy
2022-02-21 16:08   ` [PATCH v6 02/13] NFS: Trace lookup revalidation failure trondmy
2022-02-21 16:08     ` [PATCH v6 03/13] NFS: Adjust the amount of readahead performed by NFS readdir trondmy
2022-02-21 16:08       ` [PATCH v6 04/13] NFS: Simplify nfs_readdir_xdr_to_array() trondmy
2022-02-21 16:08         ` [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir trondmy
2022-02-21 16:08           ` [PATCH v6 06/13] NFS: Improve heuristic for readdirplus trondmy
2022-02-21 16:08             ` [PATCH v6 07/13] NFS: Don't ask for readdirplus unless it can help nfs_getattr() trondmy
2022-02-21 16:08               ` [PATCH v6 08/13] NFSv4: Ask for a full XDR buffer of readdir goodness trondmy
2022-02-21 16:08                 ` [PATCH v6 09/13] NFS: Readdirplus can't help lookup for case insensitive filesystems trondmy
2022-02-21 16:08                   ` [PATCH v6 10/13] NFS: Don't request readdirplus when revaldation was forced trondmy
2022-02-21 16:08                     ` [PATCH v6 11/13] NFS: Add basic readdir tracing trondmy
2022-02-21 16:08                       ` [PATCH v6 12/13] NFS: Trace effects of readdirplus on the dcache trondmy
2022-02-21 16:08                         ` [PATCH v6 13/13] NFS: Trace effects of the readdirplus heuristic trondmy
2022-02-23 13:40                           ` [PATCH v3 1/8] NFS: save the directory's change attribute on pagecache pages Benjamin Coddington
2022-02-23 13:40                             ` [PATCH v3 2/8] NFSv4: Send GETATTR with READDIR Benjamin Coddington
2022-02-23 13:40                               ` [PATCH v3 3/8] NFS: Add a struct to track readdir pagecache location Benjamin Coddington
2022-02-23 13:40                                 ` [PATCH v3 4/8] NFS: Keep the readdir pagecache cursor updated Benjamin Coddington
2022-02-23 13:40                                   ` [PATCH v3 5/8] NFS: readdir per-page cache validation Benjamin Coddington
2022-02-23 13:40                                     ` [PATCH v3 6/8] NFS: stash the readdir pagecache cursor on the open directory context Benjamin Coddington
2022-02-23 13:40                                       ` [PATCH v3 7/8] NFS: Support headless readdir pagecache pages Benjamin Coddington
2022-02-23 13:40                                         ` [PATCH v3 8/8] NFS: Revalidate the directory pagecache on every nfs_readdir() Benjamin Coddington
2022-02-21 16:45           ` [PATCH v6 05/13] NFS: Improve algorithm for falling back to uncached readdir Benjamin Coddington
2022-02-21 19:58             ` Trond Myklebust
2022-02-21 20:22               ` Benjamin Coddington
2022-02-21 20:55                 ` Trond Myklebust
2022-02-21 21:10                   ` Benjamin Coddington
2022-02-21 23:20                     ` Trond Myklebust
2022-02-22 12:50                       ` Benjamin Coddington
2022-02-22 20:11                         ` Trond Myklebust
2022-02-22 20:21                           ` Benjamin Coddington
2022-02-23 12:17                             ` Trond Myklebust
2022-02-23 13:34                               ` Benjamin Coddington
2022-02-23 21:31                                 ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.