Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: Kirill Smelkov <kirr@nexedi.com>
To: Miklos Szeredi <miklos@szeredi.hu>, Miklos Szeredi <mszeredi@redhat.com>
Cc: Brian Foster <bfoster@redhat.com>,
	Maxim Patlasov <mpatlasov@parallels.com>,
	Anatol Pomozov <anatol.pomozov@gmail.com>,
	Pavel Emelyanov <xemul@openvz.org>,
	Andrew Gallagher <andrewjcg@fb.com>,
	"Anand V . Avati" <avati@redhat.com>,
	Alexey Kuznetsov <kuznet@virtuozzo.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Constantine Shulyupin <const@MakeLinux.com>,
	Chad Austin <chadaustin@fb.com>,
	Dan Schatzberg <dschatzberg@fb.com>,
	<linux-fsdevel@vger.kernel.org>,
	<fuse-devel@lists.sourceforge.net>,
	<linux-kernel@vger.kernel.org>,
	Han-Wen Nienhuys <hanwen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kirill Smelkov <kirr@nexedi.com>
Subject: [RESEND1, PATCH v2 2/2] fuse: allow filesystems to have precise control over data cache
Date: Wed, 27 Mar 2019 10:14:12 +0000
Message-ID: <e0b43507976d6ea9010f1bacaef067f18de49f1f.1553677194.git.kirr@nexedi.com> (raw)
In-Reply-To: <cover.1553677194.git.kirr@nexedi.com>

On networked filesystems file data can be changed externally.
FUSE provides notification messages for filesystem to inform kernel that
metadata or data region of a file needs to be invalidated in local page
cache. That provides the basis for filesystem implementations to
invalidate kernel cache precisely based on observed filesystem-specific
events.

FUSE has also "automatic" invalidation mode(*) when the kernel
automatically invalidates data cache of a file if it sees mtime change.
It also automatically invalidates whole data cache of a file if it sees
file size being changed.

The automatic mode has corresponding capability - FUSE_AUTO_INVAL_DATA.
However, due to probably historical reason, that capability controls
only whether mtime change should be resulting in automatic invalidation
or not. A change in file size always results in invalidating whole data
cache of a file irregardless of whether FUSE_AUTO_INVAL_DATA was
negotiated(+).

The filesystem I write[1] represents data arrays stored in networked
database as local files suitable for mmap. It is read-only filesystem -
changes to data are committed externally via database interfaces and the
filesystem only glues data into contiguous file streams suitable for
mmap and traditional array processing. The files are big - starting from
hundreds gigabytes and more. The files change regularly, and frequently
by data being appended to their end. The size of files thus changes
frequently.

If a file was accessed locally and some part of its data got into page
cache, we want that data to stay cached unless there is memory pressure,
or unless corresponding part of the file was actually changed. However
current FUSE behaviour - when it sees file size change - is to
invalidate the whole file. The data cache of the file is thus completely
lost even on small size change, and despite that the filesystem server is
careful to accurately translate database changes into FUSE invalidation
messages to kernel.

Let's fix it: if a filesystem, through new FUSE_PRECISE_INVAL_DATA
capability, indicates to kernel that it is fully responsible for data
cache invalidation, then the kernel won't invalidate files data cache on
size change and only truncate that cache to new size in case the size
decreased.

(*) see 72d0d248ca "fuse: add FUSE_AUTO_INVAL_DATA init flag",
eed2179efe "fuse: invalidate inode mapping if mtime changes"

(+) in writeback mode the kernel does not invalidate data cache on file
size change, but neither it allows the filesystem to set the size due to
external event (see 8373200b12 "fuse: Trust kernel i_size only")

[1] https://lab.nexedi.com/kirr/wendelin.core/blob/a50f1d9f/wcfs/wcfs.go#L20

Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
---
 fs/fuse/fuse_i.h          |  3 +++
 fs/fuse/inode.c           | 12 ++++++++++--
 include/uapi/linux/fuse.h |  7 ++++++-
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e6195bc8f836..154f6cdd94d1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -694,6 +694,9 @@ struct fuse_conn {
 	/** Use enhanced/automatic page cache invalidation. */
 	unsigned auto_inval_data:1;
 
+	/** Filesystem is fully reponsible for page cache invalidation. */
+	unsigned precise_inval_data:1;
+
 	/** Does the filesystem support readdirplus? */
 	unsigned do_readdirplus:1;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1bca5023bcc5..2be0bca7f76c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -237,7 +237,8 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 
 		if (oldsize != attr->size) {
 			truncate_pagecache(inode, attr->size);
-			inval = true;
+			if (!fc->precise_inval_data)
+				inval = true;
 		} else if (fc->auto_inval_data) {
 			struct timespec64 new_mtime = {
 				.tv_sec = attr->mtime,
@@ -912,6 +913,13 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 				fc->dont_mask = 1;
 			if (arg->flags & FUSE_AUTO_INVAL_DATA)
 				fc->auto_inval_data = 1;
+			if (arg->flags & FUSE_PRECISE_INVAL_DATA)
+				fc->precise_inval_data = 1;
+			if (fc->auto_inval_data && fc->precise_inval_data) {
+				pr_warn("filesystem requested both auto and "
+					"precise cache control - using auto\n");
+				fc->precise_inval_data = 0;
+			}
 			if (arg->flags & FUSE_DO_READDIRPLUS) {
 				fc->do_readdirplus = 1;
 				if (arg->flags & FUSE_READDIRPLUS_AUTO)
@@ -973,7 +981,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 		FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
 		FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
 		FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
-		FUSE_NO_OPENDIR_SUPPORT;
+		FUSE_NO_OPENDIR_SUPPORT | FUSE_PRECISE_INVAL_DATA;
 	req->in.h.opcode = FUSE_INIT;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 2ac598614a8f..33de8f6391ec 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -125,6 +125,9 @@
  *
  *  7.29
  *  - add FUSE_NO_OPENDIR_SUPPORT flag
+ *
+ *  7.30
+ *  - add FUSE_PRECISE_INVAL_DATA
  */
 
 #ifndef _LINUX_FUSE_H
@@ -160,7 +163,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 29
+#define FUSE_KERNEL_MINOR_VERSION 30
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -263,6 +266,7 @@ struct fuse_file_lock {
  * FUSE_MAX_PAGES: init_out.max_pages contains the max number of req pages
  * FUSE_CACHE_SYMLINKS: cache READLINK responses
  * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
+ * FUSE_PRECISE_INVAL_DATA: filesystem is fully responsible for data cache invalidation
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -289,6 +293,7 @@ struct fuse_file_lock {
 #define FUSE_MAX_PAGES		(1 << 22)
 #define FUSE_CACHE_SYMLINKS	(1 << 23)
 #define FUSE_NO_OPENDIR_SUPPORT (1 << 24)
+#define FUSE_PRECISE_INVAL_DATA (1 << 25)
 
 /**
  * CUSE INIT request/reply flags
-- 
2.21.0.392.gf8f6787159

      parent reply index

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-27  9:15 [RESEND1, PATCH 0/2] " Kirill Smelkov
2019-03-27  9:15 ` [RESEND1, PATCH 1/2] fuse: convert printk -> pr_* Kirill Smelkov
2019-04-23 14:57   ` Miklos Szeredi
2019-04-24  8:38     ` FUSE workflow=? (Re: [RESEND1, PATCH 1/2] fuse: convert printk -> pr_*) Kirill Smelkov
2019-04-24  8:57       ` Miklos Szeredi
2019-04-24  9:54         ` Kirill Smelkov
2019-03-27 10:14 ` Kirill Smelkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e0b43507976d6ea9010f1bacaef067f18de49f1f.1553677194.git.kirr@nexedi.com \
    --to=kirr@nexedi.com \
    --cc=akpm@linux-foundation.org \
    --cc=anatol.pomozov@gmail.com \
    --cc=andrewjcg@fb.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=avati@redhat.com \
    --cc=bfoster@redhat.com \
    --cc=chadaustin@fb.com \
    --cc=const@MakeLinux.com \
    --cc=dschatzberg@fb.com \
    --cc=fuse-devel@lists.sourceforge.net \
    --cc=hanwen@google.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=kuznet@virtuozzo.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=mpatlasov@parallels.com \
    --cc=mszeredi@redhat.com \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git