All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy
@ 2013-10-10 13:09 Maxim Patlasov
  2013-10-10 13:10 ` [PATCH 01/11] fuse: Linking file to inode helper Maxim Patlasov
                   ` (11 more replies)
  0 siblings, 12 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:09 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

Hi,

This is the sixth iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. This version is for the branch
"writepages.v2" of the fuse git tree, so a few patches from the previous
version are dropped (because they are already in the tree). Initial patch-set
description was the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back policy.
With this file data are pushed to the userspace with big chunks (depending on the
dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this scenario.
Kirill (in Cc) can share more details about the performance and the storage concepts
details if required.

Changed in v2:
 - numerous bugfixes:
   - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must wait
     on page writeback because page writeback can extend beyond the lifetime of
     the page-cache page
   - fuse_send_writepages can end_page_writeback on original page only after adding
     request to fi->writepages list; otherwise another writeback may happen inside
     the gap between end_page_writeback and adding to the list
   - fuse_direct_io must wait on page writeback; otherwise data corruption is possible
     due to reordering requests
   - fuse_flush must flush dirty memory and wait for all writeback on given inode
     before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not reliable
   - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size update;
     otherwise a race with a writer extending i_size is possible
   - fix handling errors in fuse_writepages and fuse_send_writepages
 - handle i_mtime intelligently if writeback cache is on (see patch #7 (update i_mtime
   on buffered writes) for details.
 - put enabling writeback cache under fusermount control; (see mount option
   'allow_wbcache' introduced by patch #13 (turn writeback cache on))
 - rebased on v3.7-rc5

Changed in v3:
 - rebased on for-next branch of the fuse tree (fb05f41f5f96f7423c53da4d87913fb44fd0565d)

Changed in v4:
 - rebased on for-next branch of the fuse tree (634734b63ac39e137a1c623ba74f3e062b6577db)
 - fixed fuse_fillattr() for non-writeback_chace case
 - added comments explaining why we cannot trust size from server
 - rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
 - simplified patch titled Flush-files-on-wb-close
 - eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
 - added comment about "disk full" errors to fuse_write_begin()

Changed in v5:
 - rebased on for-next branch of the fuse tree (e5c5f05dca0cf90f0f3bb1aea85dcf658baff185)
   with "hold i_mutex in fuse_file_fallocate() - v2" patch applied manually
 - updated patch descriptions according to Miklos' demand (using From: for
   patches initially developed by someone else)
 - moved ->writepages() to a separate patch and enabled it unconditionally
   for mmaped writeback
 - moved restructuring fuse_readpage to a separate patch
 - avoided use of union for fuse_fill_data
 - grabbed a ref to the request in fuse_send_writepages

Changed in v6:
 - rebased on writepages.v2 branch of the fuse tree (beba4ae4c2fda4e03a813aec640586087fa80a8b)
 - added waiting for fuse writeback to ->write_begin fuse address space operation.

Thanks,
Maxim

---

Maxim Patlasov (9):
      fuse: Linking file to inode helper
      fuse: Connection bit for enabling writeback
      fuse: Trust kernel i_size only - v4
      fuse: Trust kernel i_mtime only -v2
      fuse: Flush files on wb close -v2
      fuse: restructure fuse_readpage()
      fuse: fuse_flush() should wait on writeback
      fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
      fuse: Turn writeback cache on

Pavel Emelyanov (2):
      fuse: Prepare to handle short reads
      fuse: Implement write_begin/write_end callbacks -v2


 fs/fuse/cuse.c            |    5 -
 fs/fuse/dir.c             |  120 +++++++++++++++-
 fs/fuse/file.c            |  328 +++++++++++++++++++++++++++++++++++++++------
 fs/fuse/fuse_i.h          |   26 +++-
 fs/fuse/inode.c           |   37 ++++-
 include/uapi/linux/fuse.h |    2 
 6 files changed, 451 insertions(+), 67 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/11] fuse: Linking file to inode helper
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
@ 2013-10-10 13:10 ` Maxim Patlasov
  2013-10-10 13:10 ` [PATCH 02/11] fuse: Prepare to handle short reads Maxim Patlasov
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

From: Pavel Emelyanov <xemul@openvz.org>

When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3f219fd..5e49c8f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -171,6 +171,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_file *ff = file->private_data;
+	/*
+	 * file may be written through mmap, so chain it onto the
+	 * inodes's write_file list
+	 */
+	spin_lock(&fc->lock);
+	if (list_empty(&ff->write_entry))
+		list_add(&ff->write_entry, &fi->write_files);
+	spin_unlock(&fc->lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
@@ -1929,20 +1945,9 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
-		struct inode *inode = file_inode(file);
-		struct fuse_conn *fc = get_fuse_conn(inode);
-		struct fuse_inode *fi = get_fuse_inode(inode);
-		struct fuse_file *ff = file->private_data;
-		/*
-		 * file may be written through mmap, so chain it onto the
-		 * inodes's write_file list
-		 */
-		spin_lock(&fc->lock);
-		if (list_empty(&ff->write_entry))
-			list_add(&ff->write_entry, &fi->write_files);
-		spin_unlock(&fc->lock);
-	}
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		fuse_link_write_file(file);
+
 	file_accessed(file);
 	vma->vm_ops = &fuse_file_vm_ops;
 	return 0;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/11] fuse: Prepare to handle short reads
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
  2013-10-10 13:10 ` [PATCH 01/11] fuse: Linking file to inode helper Maxim Patlasov
@ 2013-10-10 13:10 ` Maxim Patlasov
  2013-10-10 13:10 ` [PATCH 03/11] fuse: Connection bit for enabling writeback Maxim Patlasov
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

From: Pavel Emelyanov <xemul@openvz.org>

A helper which gets called when read reports less bytes than was requested.
See patch #4 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5e49c8f..886f3a2 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -654,6 +654,15 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+			    u64 attr_ver)
+{
+	size_t num_read = req->out.args[0].size;
+
+	loff_t pos = page_offset(req->pages[0]) + num_read;
+	fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
 	struct fuse_io_priv io = { .async = 0, .file = file };
@@ -691,18 +700,18 @@ static int fuse_readpage(struct file *file, struct page *page)
 	req->page_descs[0].length = count;
 	num_read = fuse_send_read(req, &io, pos, count, NULL);
 	err = req->out.h.error;
-	fuse_put_request(fc, req);
 
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
 		 */
 		if (num_read < count)
-			fuse_read_update_size(inode, pos + num_read, attr_ver);
+			fuse_short_read(req, inode, attr_ver);
 
 		SetPageUptodate(page);
 	}
 
+	fuse_put_request(fc, req);
 	fuse_invalidate_attr(inode); /* atime changed */
  out:
 	unlock_page(page);
@@ -725,13 +734,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
 		/*
 		 * Short read means EOF. If file size is larger, truncate it
 		 */
-		if (!req->out.h.error && num_read < count) {
-			loff_t pos;
+		if (!req->out.h.error && num_read < count)
+			fuse_short_read(req, inode, req->misc.read.attr_ver);
 
-			pos = page_offset(req->pages[0]) + num_read;
-			fuse_read_update_size(inode, pos,
-					      req->misc.read.attr_ver);
-		}
 		fuse_invalidate_attr(inode); /* atime changed */
 	}
 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/11] fuse: Connection bit for enabling writeback
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
  2013-10-10 13:10 ` [PATCH 01/11] fuse: Linking file to inode helper Maxim Patlasov
  2013-10-10 13:10 ` [PATCH 02/11] fuse: Prepare to handle short reads Maxim Patlasov
@ 2013-10-10 13:10 ` Maxim Patlasov
  2013-10-10 13:10 ` [PATCH 04/11] fuse: Trust kernel i_size only - v4 Maxim Patlasov
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

From: Pavel Emelyanov <xemul@openvz.org>

Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/fuse_i.h |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a08b355..a490ba3 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -479,6 +479,9 @@ struct fuse_conn {
 	/** Set if bdi is valid */
 	unsigned bdi_initialized:1;
 
+	/** write-back cache policy (default is write-through) */
+	unsigned writeback_cache:1;
+
 	/*
 	 * The following bitfields are only for optimization purposes
 	 * and hence races in setting them will not cause malfunction


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/11] fuse: Trust kernel i_size only - v4
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (2 preceding siblings ...)
  2013-10-10 13:10 ` [PATCH 03/11] fuse: Connection bit for enabling writeback Maxim Patlasov
@ 2013-10-10 13:10 ` Maxim Patlasov
  2013-10-10 13:10   ` Maxim Patlasov
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

From: Pavel Emelyanov <xemul@openvz.org>

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Changed in v3:
 - fixed fuse_fillattr() not to use local i_size if writeback-cache is off
 - added a comment explaining why we cannot trust attr.size from server

Changed in v4:
 - do not change fuse_file_fallocate() because what we need is already
   implemented in commits a200a2d and 14c1441

Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c   |   13 +++++++++++--
 fs/fuse/file.c  |   26 ++++++++++++++++++++++++--
 fs/fuse/inode.c |   11 +++++++++--
 3 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 62b43b5..f022968 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -854,6 +854,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 			  struct kstat *stat)
 {
 	unsigned int blkbits;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see the comment in fuse_change_attributes() */
+	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+		attr->size = i_size_read(inode);
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1594,6 +1599,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
 	bool is_truncate = false;
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	int err;
 
@@ -1665,7 +1671,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
-	i_size_write(inode, outarg.attr.size);
+	/* see the comment in fuse_change_attributes() */
+	if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+		i_size_write(inode, outarg.attr.size);
 
 	if (is_truncate) {
 		/* NOTE: this may release/reacquire fc->lock */
@@ -1677,7 +1685,8 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	 * Only call invalidate_inode_pages2() after removing
 	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 	 */
-	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+	if ((is_truncate || !is_wb) &&
+	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
 		truncate_pagecache(inode, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 886f3a2..333a764 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -658,9 +658,31 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
 			    u64 attr_ver)
 {
 	size_t num_read = req->out.args[0].size;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fc->writeback_cache) {
+		/*
+		 * A hole in a file. Some data after the hole are in page cache,
+		 * but have not reached the client fs yet. So, the hole is not
+		 * present there.
+		 */
+		int i;
+		int start_idx = num_read >> PAGE_CACHE_SHIFT;
+		size_t off = num_read & (PAGE_CACHE_SIZE - 1);
+
+		for (i = start_idx; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			void *mapaddr = kmap_atomic(page);
 
-	loff_t pos = page_offset(req->pages[0]) + num_read;
-	fuse_read_update_size(inode, pos, attr_ver);
+			memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+			kunmap_atomic(mapaddr);
+			off = 0;
+		}
+	} else {
+		loff_t pos = page_offset(req->pages[0]) + num_read;
+		fuse_read_update_size(inode, pos, attr_ver);
+	}
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index a8ce6da..63818ab 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -197,6 +197,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	struct timespec old_mtime;
 
@@ -211,10 +212,16 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	fuse_change_attributes_common(inode, attr, attr_valid);
 
 	oldsize = inode->i_size;
-	i_size_write(inode, attr->size);
+	/*
+	 * In case of writeback_cache enabled, the cached writes beyond EOF
+	 * extend local i_size without keeping userspace server in sync. So,
+	 * attr->size coming from server can be stale. We cannot trust it.
+	 */
+	if (!is_wb || !S_ISREG(inode->i_mode))
+		i_size_write(inode, attr->size);
 	spin_unlock(&fc->lock);
 
-	if (S_ISREG(inode->i_mode)) {
+	if (!is_wb && S_ISREG(inode->i_mode)) {
 		bool inval = false;
 
 		if (oldsize != attr->size) {


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/11] fuse: Trust kernel i_mtime only -v2
@ 2013-10-10 13:10   ` Maxim Patlasov
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Changed in v2 (thanks to Brian):
 - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
 - simplified fuse_set_mtime_local()
 - abandoned optimizations: clearing the flag on some operations (like
   direct write) is doable, but may lead to unexpected changes of
   user-visible mtime.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c    |  109 ++++++++++++++++++++++++++++++++++++++++++++++++------
 fs/fuse/file.c   |   30 +++++++++++++--
 fs/fuse/fuse_i.h |    6 ++-
 fs/fuse/inode.c  |   13 +++++-
 4 files changed, 138 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f022968..eda248b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	/* see the comment in fuse_change_attributes() */
-	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+	if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
 		attr->size = i_size_read(inode);
+		attr->mtime = inode->i_mtime.tv_sec;
+		attr->mtimensec = inode->i_mtime.tv_nsec;
+	}
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1582,6 +1585,82 @@ void fuse_release_nowrite(struct inode *inode)
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+			      struct inode *inode,
+			      struct fuse_setattr_in *inarg_p,
+			      struct fuse_attr_out *outarg_p)
+{
+	req->in.h.opcode = FUSE_SETATTR;
+	req->in.h.nodeid = get_node_id(inode);
+	req->in.numargs = 1;
+	req->in.args[0].size = sizeof(*inarg_p);
+	req->in.args[0].value = inarg_p;
+	req->out.numargs = 1;
+	if (fc->minor < 9)
+		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+	else
+		req->out.args[0].size = sizeof(*outarg_p);
+	req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	struct fuse_setattr_in inarg;
+	struct fuse_attr_out outarg;
+	int err;
+
+	if (nofail) {
+		req = fuse_get_req_nofail_nopages(fc, file);
+	} else {
+		req = fuse_get_req_nopages(fc);
+		if (IS_ERR(req))
+			return PTR_ERR(req);
+	}
+
+	memset(&inarg, 0, sizeof(inarg));
+	memset(&outarg, 0, sizeof(outarg));
+
+	inarg.valid |= FATTR_MTIME;
+	inarg.mtime = inode->i_mtime.tv_sec;
+	inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+	fuse_request_send(fc, req);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+
+	if (!err)
+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+
+	return err;
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+	unsigned ivalid = iattr->ia_valid;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) {
+		/* client fs has just set mtime to iattr->ia_mtime */
+		inode->i_mtime = iattr->ia_mtime;
+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+	} else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) {
+		/* client fs doesn't know that we're updating i_mtime */
+		inode->i_mtime = current_fs_time(inode->i_sb);
+		set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+	}
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1641,17 +1720,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		inarg.valid |= FATTR_LOCKOWNER;
 		inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
 	}
-	req->in.h.opcode = FUSE_SETATTR;
-	req->in.h.nodeid = get_node_id(inode);
-	req->in.numargs = 1;
-	req->in.args[0].size = sizeof(inarg);
-	req->in.args[0].value = &inarg;
-	req->out.numargs = 1;
-	if (fc->minor < 9)
-		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
-	else
-		req->out.args[0].size = sizeof(outarg);
-	req->out.args[0].value = &outarg;
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
@@ -1668,6 +1737,10 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	}
 
 	spin_lock(&fc->lock);
+	/* the kernel maintains i_mtime locally */
+	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+		fuse_set_mtime_local(attr, inode);
+
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
@@ -1898,6 +1971,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
 	return err;
 }
 
+static int fuse_update_time(struct inode *inode, struct timespec *now,
+			    int flags)
+{
+	if (flags & S_MTIME) {
+		inode->i_mtime = *now;
+		set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state);
+		BUG_ON(!S_ISREG(inode->i_mode));
+	}
+	return 0;
+}
+
 static const struct inode_operations fuse_dir_inode_operations = {
 	.lookup		= fuse_lookup,
 	.mkdir		= fuse_mkdir,
@@ -1937,6 +2021,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.getxattr	= fuse_getxattr,
 	.listxattr	= fuse_listxattr,
 	.removexattr	= fuse_removexattr,
+	.update_time	= fuse_update_time,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 333a764..eabe202 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,9 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
+		fuse_flush_mtime(file, true);
+
 	fuse_release_common(file, FUSE_RELEASE);
 
 	/* return value is ignored by VFS */
@@ -458,6 +461,12 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 
 	fuse_sync_writes(inode);
 
+	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) {
+		int err = fuse_flush_mtime(file, false);
+		if (err)
+			goto out;
+	}
+
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -948,16 +957,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
 	return req->misc.write.out.size;
 }
 
-void fuse_write_update_size(struct inode *inode, loff_t pos)
+bool fuse_write_update_size(struct inode *inode, loff_t pos)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool ret = false;
 
 	spin_lock(&fc->lock);
 	fi->attr_version = ++fc->attr_version;
-	if (pos > inode->i_size)
+	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
+		ret = true;
+	}
 	spin_unlock(&fc->lock);
+
+	return ret;
 }
 
 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
@@ -2862,8 +2876,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		goto out;
 
 	/* we could have extended the file */
-	if (!(mode & FALLOC_FL_KEEP_SIZE))
-		fuse_write_update_size(inode, offset + length);
+	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+		bool changed = fuse_write_update_size(inode, offset + length);
+
+		if (changed && fc->writeback_cache) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
+
+			inode->i_mtime = current_fs_time(inode->i_sb);
+			set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+		}
+	}
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		truncate_pagecache_range(inode, offset, offset + length - 1);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a490ba3..3028b8a 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -117,6 +117,8 @@ enum {
 	FUSE_I_ADVISE_RDPLUS,
 	/** An operation changing file size is in progress  */
 	FUSE_I_SIZE_UNSTABLE,
+	/** i_mtime has been updated locally; a flush to userspace needed */
+	FUSE_I_MTIME_DIRTY,
 };
 
 struct fuse_conn;
@@ -870,7 +872,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
 unsigned fuse_file_poll(struct file *file, poll_table *wait);
 int fuse_dev_release(struct inode *inode, struct file *file);
 
-void fuse_write_update_size(struct inode *inode, loff_t pos);
+bool fuse_write_update_size(struct inode *inode, loff_t pos);
+
+int fuse_flush_mtime(struct file *file, bool nofail);
 
 int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		    struct file *file);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 63818ab..253701f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
-	inode->i_mtime.tv_sec   = attr->mtime;
-	inode->i_mtime.tv_nsec  = attr->mtimensec;
+	/* mtime from server may be stale due to local buffered write */
+	if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
+		inode->i_mtime.tv_sec   = attr->mtime;
+		inode->i_mtime.tv_nsec  = attr->mtimensec;
+	}
 	inode->i_ctime.tv_sec   = attr->ctime;
 	inode->i_ctime.tv_nsec  = attr->ctimensec;
 
@@ -250,6 +253,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
 {
 	inode->i_mode = attr->mode & S_IFMT;
 	inode->i_size = attr->size;
+	inode->i_mtime.tv_sec  = attr->mtime;
+	inode->i_mtime.tv_nsec = attr->mtimensec;
 	if (S_ISREG(inode->i_mode)) {
 		fuse_init_common(inode);
 		fuse_init_file_inode(inode);
@@ -296,7 +301,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		return NULL;
 
 	if ((inode->i_state & I_NEW)) {
-		inode->i_flags |= S_NOATIME|S_NOCMTIME;
+		inode->i_flags |= S_NOATIME;
+		if (!fc->writeback_cache)
+			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
 		fuse_init_inode(inode, attr);


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/11] fuse: Trust kernel i_mtime only -v2
@ 2013-10-10 13:10   ` Maxim Patlasov
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:10 UTC (permalink / raw)
  To: miklos-sUDqSbJrdHQHWmgEVkV9KA
  Cc: dev-bzQdu9zFT3WakBO8gow8eQ, xemul-bzQdu9zFT3WakBO8gow8eQ,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jbottomley-bzQdu9zFT3WakBO8gow8eQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	devel-GEFAQzZX7r8dnm+yROfE0A

Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Changed in v2 (thanks to Brian):
 - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
 - simplified fuse_set_mtime_local()
 - abandoned optimizations: clearing the flag on some operations (like
   direct write) is doable, but may lead to unexpected changes of
   user-visible mtime.

Signed-off-by: Maxim Patlasov <MPatlasov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 fs/fuse/dir.c    |  109 ++++++++++++++++++++++++++++++++++++++++++++++++------
 fs/fuse/file.c   |   30 +++++++++++++--
 fs/fuse/fuse_i.h |    6 ++-
 fs/fuse/inode.c  |   13 +++++-
 4 files changed, 138 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f022968..eda248b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	/* see the comment in fuse_change_attributes() */
-	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+	if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
 		attr->size = i_size_read(inode);
+		attr->mtime = inode->i_mtime.tv_sec;
+		attr->mtimensec = inode->i_mtime.tv_nsec;
+	}
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1582,6 +1585,82 @@ void fuse_release_nowrite(struct inode *inode)
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+			      struct inode *inode,
+			      struct fuse_setattr_in *inarg_p,
+			      struct fuse_attr_out *outarg_p)
+{
+	req->in.h.opcode = FUSE_SETATTR;
+	req->in.h.nodeid = get_node_id(inode);
+	req->in.numargs = 1;
+	req->in.args[0].size = sizeof(*inarg_p);
+	req->in.args[0].value = inarg_p;
+	req->out.numargs = 1;
+	if (fc->minor < 9)
+		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+	else
+		req->out.args[0].size = sizeof(*outarg_p);
+	req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	struct fuse_setattr_in inarg;
+	struct fuse_attr_out outarg;
+	int err;
+
+	if (nofail) {
+		req = fuse_get_req_nofail_nopages(fc, file);
+	} else {
+		req = fuse_get_req_nopages(fc);
+		if (IS_ERR(req))
+			return PTR_ERR(req);
+	}
+
+	memset(&inarg, 0, sizeof(inarg));
+	memset(&outarg, 0, sizeof(outarg));
+
+	inarg.valid |= FATTR_MTIME;
+	inarg.mtime = inode->i_mtime.tv_sec;
+	inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+	fuse_request_send(fc, req);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+
+	if (!err)
+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+
+	return err;
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+	unsigned ivalid = iattr->ia_valid;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) {
+		/* client fs has just set mtime to iattr->ia_mtime */
+		inode->i_mtime = iattr->ia_mtime;
+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+	} else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) {
+		/* client fs doesn't know that we're updating i_mtime */
+		inode->i_mtime = current_fs_time(inode->i_sb);
+		set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+	}
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1641,17 +1720,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		inarg.valid |= FATTR_LOCKOWNER;
 		inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
 	}
-	req->in.h.opcode = FUSE_SETATTR;
-	req->in.h.nodeid = get_node_id(inode);
-	req->in.numargs = 1;
-	req->in.args[0].size = sizeof(inarg);
-	req->in.args[0].value = &inarg;
-	req->out.numargs = 1;
-	if (fc->minor < 9)
-		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
-	else
-		req->out.args[0].size = sizeof(outarg);
-	req->out.args[0].value = &outarg;
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
@@ -1668,6 +1737,10 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	}
 
 	spin_lock(&fc->lock);
+	/* the kernel maintains i_mtime locally */
+	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+		fuse_set_mtime_local(attr, inode);
+
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
@@ -1898,6 +1971,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
 	return err;
 }
 
+static int fuse_update_time(struct inode *inode, struct timespec *now,
+			    int flags)
+{
+	if (flags & S_MTIME) {
+		inode->i_mtime = *now;
+		set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state);
+		BUG_ON(!S_ISREG(inode->i_mode));
+	}
+	return 0;
+}
+
 static const struct inode_operations fuse_dir_inode_operations = {
 	.lookup		= fuse_lookup,
 	.mkdir		= fuse_mkdir,
@@ -1937,6 +2021,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.getxattr	= fuse_getxattr,
 	.listxattr	= fuse_listxattr,
 	.removexattr	= fuse_removexattr,
+	.update_time	= fuse_update_time,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 333a764..eabe202 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,9 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
+		fuse_flush_mtime(file, true);
+
 	fuse_release_common(file, FUSE_RELEASE);
 
 	/* return value is ignored by VFS */
@@ -458,6 +461,12 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 
 	fuse_sync_writes(inode);
 
+	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) {
+		int err = fuse_flush_mtime(file, false);
+		if (err)
+			goto out;
+	}
+
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -948,16 +957,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
 	return req->misc.write.out.size;
 }
 
-void fuse_write_update_size(struct inode *inode, loff_t pos)
+bool fuse_write_update_size(struct inode *inode, loff_t pos)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool ret = false;
 
 	spin_lock(&fc->lock);
 	fi->attr_version = ++fc->attr_version;
-	if (pos > inode->i_size)
+	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
+		ret = true;
+	}
 	spin_unlock(&fc->lock);
+
+	return ret;
 }
 
 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
@@ -2862,8 +2876,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		goto out;
 
 	/* we could have extended the file */
-	if (!(mode & FALLOC_FL_KEEP_SIZE))
-		fuse_write_update_size(inode, offset + length);
+	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+		bool changed = fuse_write_update_size(inode, offset + length);
+
+		if (changed && fc->writeback_cache) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
+
+			inode->i_mtime = current_fs_time(inode->i_sb);
+			set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+		}
+	}
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		truncate_pagecache_range(inode, offset, offset + length - 1);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a490ba3..3028b8a 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -117,6 +117,8 @@ enum {
 	FUSE_I_ADVISE_RDPLUS,
 	/** An operation changing file size is in progress  */
 	FUSE_I_SIZE_UNSTABLE,
+	/** i_mtime has been updated locally; a flush to userspace needed */
+	FUSE_I_MTIME_DIRTY,
 };
 
 struct fuse_conn;
@@ -870,7 +872,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
 unsigned fuse_file_poll(struct file *file, poll_table *wait);
 int fuse_dev_release(struct inode *inode, struct file *file);
 
-void fuse_write_update_size(struct inode *inode, loff_t pos);
+bool fuse_write_update_size(struct inode *inode, loff_t pos);
+
+int fuse_flush_mtime(struct file *file, bool nofail);
 
 int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		    struct file *file);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 63818ab..253701f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
-	inode->i_mtime.tv_sec   = attr->mtime;
-	inode->i_mtime.tv_nsec  = attr->mtimensec;
+	/* mtime from server may be stale due to local buffered write */
+	if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
+		inode->i_mtime.tv_sec   = attr->mtime;
+		inode->i_mtime.tv_nsec  = attr->mtimensec;
+	}
 	inode->i_ctime.tv_sec   = attr->ctime;
 	inode->i_ctime.tv_nsec  = attr->ctimensec;
 
@@ -250,6 +253,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
 {
 	inode->i_mode = attr->mode & S_IFMT;
 	inode->i_size = attr->size;
+	inode->i_mtime.tv_sec  = attr->mtime;
+	inode->i_mtime.tv_nsec = attr->mtimensec;
 	if (S_ISREG(inode->i_mode)) {
 		fuse_init_common(inode);
 		fuse_init_file_inode(inode);
@@ -296,7 +301,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		return NULL;
 
 	if ((inode->i_state & I_NEW)) {
-		inode->i_flags |= S_NOATIME|S_NOCMTIME;
+		inode->i_flags |= S_NOATIME;
+		if (!fc->writeback_cache)
+			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
 		fuse_init_inode(inode, attr);


------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/11] fuse: Flush files on wb close -v2
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (4 preceding siblings ...)
  2013-10-10 13:10   ` Maxim Patlasov
@ 2013-10-10 13:11 ` Maxim Patlasov
  2013-10-10 13:19     ` Maxim Patlasov
  2013-10-10 13:11 ` [PATCH 07/11] fuse: restructure fuse_readpage() Maxim Patlasov
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:11 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

Changed in v2:
 - updated patch to be applied smoothly on top of
   "Trust kernel i_mtime only -v2".

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eabe202..b4d4189 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see fuse_vma_close() for !writeback_cache case */
+	if (fc->writeback_cache)
+		filemap_write_and_wait(file->f_mapping);
+
 	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
 		fuse_flush_mtime(file, true);
 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/11] fuse: restructure fuse_readpage()
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (5 preceding siblings ...)
  2013-10-10 13:11 ` [PATCH 06/11] fuse: Flush files on wb close -v2 Maxim Patlasov
@ 2013-10-10 13:11 ` Maxim Patlasov
  2013-11-12 17:17   ` Miklos Szeredi
  2013-10-10 13:11 ` [PATCH 08/11] fuse: Implement write_begin/write_end callbacks -v2 Maxim Patlasov
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:11 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   55 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b4d4189..77eb849 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
 	}
 }
 
-static int fuse_readpage(struct file *file, struct page *page)
+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+			   int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
 {
 	struct fuse_io_priv io = { .async = 0, .file = file };
 	struct inode *inode = page->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_req *req;
 	size_t num_read;
-	loff_t pos = page_offset(page);
-	size_t count = PAGE_CACHE_SIZE;
-	u64 attr_ver;
-	int err;
-
-	err = -EIO;
-	if (is_bad_inode(inode))
-		goto out;
 
 	/*
 	 * Page writeback can extend beyond the lifetime of the
@@ -724,20 +717,45 @@ static int fuse_readpage(struct file *file, struct page *page)
 	fuse_wait_on_page_writeback(inode, page->index);
 
 	req = fuse_get_req(fc, 1);
-	err = PTR_ERR(req);
+	*err = PTR_ERR(req);
 	if (IS_ERR(req))
-		goto out;
+		return 0;
 
-	attr_ver = fuse_get_attr_version(fc);
+	if (attr_ver_p)
+		*attr_ver_p = fuse_get_attr_version(fc);
 
 	req->out.page_zeroing = 1;
 	req->out.argpages = 1;
 	req->num_pages = 1;
 	req->pages[0] = page;
 	req->page_descs[0].length = count;
-	num_read = fuse_send_read(req, &io, pos, count, NULL);
-	err = req->out.h.error;
 
+	num_read = fuse_send_read(req, &io, page_offset(page), count, NULL);
+	*err = req->out.h.error;
+
+	if (*err)
+		fuse_put_request(fc, req);
+	else
+		*req_pp = req;
+
+	return num_read;
+}
+
+static int fuse_readpage(struct file *file, struct page *page)
+{
+	struct inode *inode = page->mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	size_t num_read;
+	size_t count = PAGE_CACHE_SIZE;
+	u64 attr_ver = 0;
+	int err;
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
@@ -747,10 +765,11 @@ static int fuse_readpage(struct file *file, struct page *page)
 
 		SetPageUptodate(page);
 	}
-
-	fuse_put_request(fc, req);
-	fuse_invalidate_attr(inode); /* atime changed */
- out:
+	if (req) {
+		fuse_put_request(fc, req);
+		fuse_invalidate_attr(inode); /* atime changed */
+	}
+out:
 	unlock_page(page);
 	return err;
 }


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/11] fuse: Implement write_begin/write_end callbacks -v2
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (6 preceding siblings ...)
  2013-10-10 13:11 ` [PATCH 07/11] fuse: restructure fuse_readpage() Maxim Patlasov
@ 2013-10-10 13:11 ` Maxim Patlasov
  2013-10-10 13:11 ` [PATCH 09/11] fuse: fuse_flush() should wait on writeback Maxim Patlasov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:11 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

From: Pavel Emelyanov <xemul@openvz.org>

The .write_begin and .write_end are requiered to use generic routines
(generic_file_aio_write --> ... --> generic_perform_write) for buffered
writes.

Changed in v2:
 - added fuse_wait_on_page_writeback() to fuse ->write_begin; this is
   crucial to avoid the lost of user data by incorrect crop of a secondary
   request to a stale inarg->size of the primary.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77eb849..642bd51 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1950,6 +1950,104 @@ out:
 	return err;
 }
 
+/*
+ * Determine the number of bytes of data the page contains
+ */
+static inline unsigned fuse_page_length(struct page *page)
+{
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
+
+	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
+		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
+		if (page_index < end_index)
+			return PAGE_CACHE_SIZE;
+		if (page_index == end_index)
+			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
+	}
+	return 0;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+		struct page *page, loff_t pos, unsigned len)
+{
+	struct fuse_req *req = NULL;
+	unsigned num_read;
+	unsigned page_len;
+	int err;
+
+	if (PageUptodate(page) || (len == PAGE_CACHE_SIZE)) {
+		fuse_wait_on_page_writeback(page->mapping->host, page->index);
+		return 0;
+	}
+
+	page_len = fuse_page_length(page);
+	if (!page_len) {
+		fuse_wait_on_page_writeback(page->mapping->host, page->index);
+		zero_user(page, 0, PAGE_CACHE_SIZE);
+		return 0;
+	}
+
+	num_read = __fuse_readpage(file, page, page_len, &err, &req, NULL);
+	if (req)
+		fuse_put_request(fc, req);
+	if (err) {
+		unlock_page(page);
+		page_cache_release(page);
+	} else if (num_read != PAGE_CACHE_SIZE) {
+		zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
+	}
+	return err;
+}
+
+/*
+ * It's worthy to make sure that space is reserved on disk for the write,
+ * but how to implement it without killing performance need more thinking.
+ */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+	BUG_ON(!fc->writeback_cache);
+
+	*pagep = grab_cache_page_write_begin(mapping, index, flags);
+	if (!*pagep)
+		return -ENOMEM;
+
+	return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+		unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+
+	fuse_write_update_size(inode, pos);
+	set_page_dirty(page);
+	return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned copied,
+		struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	fuse_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
@@ -2974,6 +3072,8 @@ static const struct address_space_operations fuse_file_aops  = {
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
+	.write_begin	= fuse_write_begin,
+	.write_end	= fuse_write_end,
 };
 
 void fuse_init_file_inode(struct inode *inode)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/11] fuse: fuse_flush() should wait on writeback
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (7 preceding siblings ...)
  2013-10-10 13:11 ` [PATCH 08/11] fuse: Implement write_begin/write_end callbacks -v2 Maxim Patlasov
@ 2013-10-10 13:11 ` Maxim Patlasov
  2013-10-10 13:12 ` [PATCH 10/11] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:11 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 642bd51..f23fc65 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -19,6 +19,7 @@
 #include <linux/falloc.h>
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
@@ -399,6 +400,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
 	if (fc->no_flush)
 		return 0;
 
+	err = filemap_write_and_wait(file->f_mapping);
+	if (err)
+		return err;
+
+	mutex_lock(&inode->i_mutex);
+	fuse_sync_writes(inode);
+	mutex_unlock(&inode->i_mutex);
+
 	req = fuse_get_req_nofail_nopages(fc, file);
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.fh = ff->fh;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/11] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (8 preceding siblings ...)
  2013-10-10 13:11 ` [PATCH 09/11] fuse: fuse_flush() should wait on writeback Maxim Patlasov
@ 2013-10-10 13:12 ` Maxim Patlasov
  2013-10-10 13:12 ` [PATCH 11/11] fuse: Turn writeback cache on Maxim Patlasov
       [not found] ` <87li13pido.fsf@vostro.rath.org>
  11 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:12 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

From: Pavel Emelyanov <xemul@openvz.org>

The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/cuse.c   |    5 +++--
 fs/fuse/file.c   |   49 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/fuse/fuse_i.h |   13 ++++++++++++-
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index adbfd66..9bf4bb2 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -95,7 +95,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, size_t count,
 	struct iovec iov = { .iov_base = buf, .iov_len = count };
 	struct fuse_io_priv io = { .async = 0, .file = file };
 
-	return fuse_direct_io(&io, &iov, 1, count, &pos, 0);
+	return fuse_direct_io(&io, &iov, 1, count, &pos, FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -109,7 +109,8 @@ static ssize_t cuse_write(struct file *file, const char __user *buf,
 	 * No locking or generic_write_checks(), the server is
 	 * responsible for locking and sanity checks.
 	 */
-	return fuse_direct_io(&io, &iov, 1, count, &pos, 1);
+	return fuse_direct_io(&io, &iov, 1, count, &pos,
+			      FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f23fc65..653ae9c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -341,6 +341,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 	return (u64) v0 + ((u64) v1 << 32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+				    pgoff_t idx_to)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+	bool found = false;
+
+	spin_lock(&fc->lock);
+	list_for_each_entry(req, &fi->writepages, writepages_entry) {
+		pgoff_t curr_index;
+
+		BUG_ON(req->inode != inode);
+		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+		if (!(idx_from >= curr_index + req->num_pages ||
+		      idx_to < curr_index)) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock(&fc->lock);
+
+	return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -385,6 +410,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
 	return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+				   size_t bytes)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	pgoff_t idx_from, idx_to;
+
+	idx_from = start >> PAGE_CACHE_SHIFT;
+	idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+	wait_event(fi->page_waitq,
+		   !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
 	struct inode *inode = file_inode(file);
@@ -1363,8 +1401,10 @@ static inline int fuse_iter_npages(const struct iov_iter *ii_p)
 
 ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
 		       unsigned long nr_segs, size_t count, loff_t *ppos,
-		       int write)
+		       int flags)
 {
+	int write = flags & FUSE_DIO_WRITE;
+	int cuse = flags & FUSE_DIO_CUSE;
 	struct file *file = io->file;
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
@@ -1393,6 +1433,10 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
 			break;
 		}
 
+		if (!cuse)
+			fuse_wait_on_writeback(file->f_mapping->host, pos,
+					       nbytes);
+
 		if (write)
 			nres = fuse_send_write(req, io, pos, nbytes, owner);
 		else
@@ -1471,7 +1515,8 @@ static ssize_t __fuse_direct_write(struct fuse_io_priv *io,
 
 	res = generic_write_checks(file, ppos, &count, 0);
 	if (!res)
-		res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1);
+		res = fuse_direct_io(io, iov, nr_segs, count, ppos,
+				     FUSE_DIO_WRITE);
 
 	fuse_invalidate_attr(inode);
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3028b8a..129af78 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -862,9 +862,20 @@ int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid,
 
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 		 bool isdir);
+
+/**
+ * fuse_direct_io() flags
+ */
+
+/** If set, it is WRITE; otherwise - READ */
+#define FUSE_DIO_WRITE (1 << 0)
+
+/** CUSE pass fuse_direct_io() a file which f_mapping->host is not from FUSE */
+#define FUSE_DIO_CUSE  (1 << 1)
+
 ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
 		       unsigned long nr_segs, size_t count, loff_t *ppos,
-		       int write);
+		       int flags);
 long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
 		   unsigned int flags);
 long fuse_ioctl_common(struct file *file, unsigned int cmd,


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/11] fuse: Turn writeback cache on
  2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (9 preceding siblings ...)
  2013-10-10 13:12 ` [PATCH 10/11] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
@ 2013-10-10 13:12 ` Maxim Patlasov
       [not found] ` <87li13pido.fsf@vostro.rath.org>
  11 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:12 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

From: Pavel Emelyanov <xemul@openvz.org>

Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c            |    5 +++++
 fs/fuse/fuse_i.h          |    4 ++++
 fs/fuse/inode.c           |   13 +++++++++++++
 include/uapi/linux/fuse.h |    2 ++
 4 files changed, 24 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 653ae9c..aa38e77 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -208,6 +208,8 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 		spin_unlock(&fc->lock);
 		fuse_invalidate_attr(inode);
 	}
+	if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+		fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1228,6 +1230,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	struct iov_iter i;
 	loff_t endbyte = 0;
 
+	if (get_fuse_conn(inode)->writeback_cache)
+		return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
 	WARN_ON(iocb->ki_pos != pos);
 
 	ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 129af78..76c4f77 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
     doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER         (1 << 1)
 
+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+    module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE       (1 << 2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 253701f..6ddf135 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -460,6 +460,7 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+	OPT_ALLOW_WBCACHE,
 	OPT_ERR
 };
 
@@ -472,6 +473,7 @@ static const match_table_t tokens = {
 	{OPT_ALLOW_OTHER,		"allow_other"},
 	{OPT_MAX_READ,			"max_read=%u"},
 	{OPT_BLKSIZE,			"blksize=%u"},
+	{OPT_ALLOW_WBCACHE,		"allow_wbcache"},
 	{OPT_ERR,			NULL}
 };
 
@@ -545,6 +547,10 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 			d->blksize = value;
 			break;
 
+		case OPT_ALLOW_WBCACHE:
+			d->flags |= FUSE_ALLOW_WBCACHE;
+			break;
+
 		default:
 			return 0;
 		}
@@ -572,6 +578,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 		seq_printf(m, ",max_read=%u", fc->max_read);
 	if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
 		seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+	if (fc->flags & FUSE_ALLOW_WBCACHE)
+		seq_puts(m, ",allow_wbcache");
 	return 0;
 }
 
@@ -889,6 +897,9 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 			}
 			if (arg->flags & FUSE_ASYNC_DIO)
 				fc->async_dio = 1;
+			if (arg->flags & FUSE_WRITEBACK_CACHE &&
+			    fc->flags & FUSE_ALLOW_WBCACHE)
+				fc->writeback_cache = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_CACHE_SIZE;
 			fc->no_lock = 1;
@@ -917,6 +928,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 		FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
 		FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
 		FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO | FUSE_ASYNC_DIO;
+	if (fc->flags & FUSE_ALLOW_WBCACHE)
+		arg->flags |= FUSE_WRITEBACK_CACHE;
 	req->in.h.opcode = FUSE_INIT;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 60bb2f9..5c98dd1 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -219,6 +219,7 @@ struct fuse_file_lock {
  * FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one)
  * FUSE_READDIRPLUS_AUTO: adaptive readdirplus
  * FUSE_ASYNC_DIO: asynchronous direct I/O submission
+ * FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -236,6 +237,7 @@ struct fuse_file_lock {
 #define FUSE_DO_READDIRPLUS	(1 << 13)
 #define FUSE_READDIRPLUS_AUTO	(1 << 14)
 #define FUSE_ASYNC_DIO		(1 << 15)
+#define FUSE_WRITEBACK_CACHE	(1 << 16)
 
 /**
  * CUSE INIT request/reply flags


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH] fuse: Flush files on wb close -v2
@ 2013-10-10 13:19     ` Maxim Patlasov
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:19 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

From: Pavel Emelyanov <xemul@openvz.org>

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Changed in v2:
 - updated patch to be applied smoothly on top of
   "Trust kernel i_mtime only -v2".

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eabe202..b4d4189 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see fuse_vma_close() for !writeback_cache case */
+	if (fc->writeback_cache)
+		filemap_write_and_wait(file->f_mapping);
+
 	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
 		fuse_flush_mtime(file, true);
 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH] fuse: Flush files on wb close -v2
@ 2013-10-10 13:19     ` Maxim Patlasov
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-10-10 13:19 UTC (permalink / raw)
  To: miklos-sUDqSbJrdHQHWmgEVkV9KA
  Cc: dev-bzQdu9zFT3WakBO8gow8eQ, xemul-bzQdu9zFT3WakBO8gow8eQ,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jbottomley-bzQdu9zFT3WakBO8gow8eQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	devel-GEFAQzZX7r8dnm+yROfE0A

From: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Changed in v2:
 - updated patch to be applied smoothly on top of
   "Trust kernel i_mtime only -v2".

Signed-off-by: Maxim Patlasov <MPatlasov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 fs/fuse/file.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eabe202..b4d4189 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see fuse_vma_close() for !writeback_cache case */
+	if (fc->writeback_cache)
+		filemap_write_and_wait(file->f_mapping);
+
 	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
 		fuse_flush_mtime(file, true);
 


------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134071&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/11] fuse: Trust kernel i_mtime only -v2
  2013-10-10 13:10   ` Maxim Patlasov
  (?)
@ 2013-11-12 16:52   ` Miklos Szeredi
  2013-12-26 15:41     ` Maxim Patlasov
  2013-12-26 15:51     ` [PATCH 05/11] fuse: Trust kernel i_mtime only -v3 Maxim Patlasov
  -1 siblings, 2 replies; 25+ messages in thread
From: Miklos Szeredi @ 2013-11-12 16:52 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

On Thu, Oct 10, 2013 at 05:10:56PM +0400, Maxim Patlasov wrote:
> Let the kernel maintain i_mtime locally:
>  - clear S_NOCMTIME
>  - implement i_op->update_time()
>  - flush mtime on fsync and last close
>  - update i_mtime explicitly on truncate and fallocate
> 
> Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
> should be flushed to the server eventually.
> 
> Changed in v2 (thanks to Brian):
>  - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
>  - simplified fuse_set_mtime_local()
>  - abandoned optimizations: clearing the flag on some operations (like
>    direct write) is doable, but may lead to unexpected changes of
>    user-visible mtime.
> 
> Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
> ---
>  fs/fuse/dir.c    |  109 ++++++++++++++++++++++++++++++++++++++++++++++++------
>  fs/fuse/file.c   |   30 +++++++++++++--
>  fs/fuse/fuse_i.h |    6 ++-
>  fs/fuse/inode.c  |   13 +++++-
>  4 files changed, 138 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index f022968..eda248b 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  
>  	/* see the comment in fuse_change_attributes() */
> -	if (fc->writeback_cache && S_ISREG(inode->i_mode))
> +	if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
>  		attr->size = i_size_read(inode);
> +		attr->mtime = inode->i_mtime.tv_sec;
> +		attr->mtimensec = inode->i_mtime.tv_nsec;
> +	}
>  
>  	stat->dev = inode->i_sb->s_dev;
>  	stat->ino = attr->ino;
> @@ -1582,6 +1585,82 @@ void fuse_release_nowrite(struct inode *inode)
>  	spin_unlock(&fc->lock);
>  }
>  
> +static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
> +			      struct inode *inode,
> +			      struct fuse_setattr_in *inarg_p,
> +			      struct fuse_attr_out *outarg_p)
> +{
> +	req->in.h.opcode = FUSE_SETATTR;
> +	req->in.h.nodeid = get_node_id(inode);
> +	req->in.numargs = 1;
> +	req->in.args[0].size = sizeof(*inarg_p);
> +	req->in.args[0].value = inarg_p;
> +	req->out.numargs = 1;
> +	if (fc->minor < 9)
> +		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
> +	else
> +		req->out.args[0].size = sizeof(*outarg_p);
> +	req->out.args[0].value = outarg_p;
> +}
> +
> +/*
> + * Flush inode->i_mtime to the server
> + */
> +int fuse_flush_mtime(struct file *file, bool nofail)
> +{
> +	struct inode *inode = file->f_mapping->host;
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +	struct fuse_conn *fc = get_fuse_conn(inode);
> +	struct fuse_req *req = NULL;
> +	struct fuse_setattr_in inarg;
> +	struct fuse_attr_out outarg;
> +	int err;
> +
> +	if (nofail) {
> +		req = fuse_get_req_nofail_nopages(fc, file);
> +	} else {
> +		req = fuse_get_req_nopages(fc);
> +		if (IS_ERR(req))
> +			return PTR_ERR(req);
> +	}
> +
> +	memset(&inarg, 0, sizeof(inarg));
> +	memset(&outarg, 0, sizeof(outarg));
> +
> +	inarg.valid |= FATTR_MTIME;
> +	inarg.mtime = inode->i_mtime.tv_sec;
> +	inarg.mtimensec = inode->i_mtime.tv_nsec;
> +
> +	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
> +	fuse_request_send(fc, req);
> +	err = req->out.h.error;
> +	fuse_put_request(fc, req);
> +
> +	if (!err)
> +		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);

Doing the test and the clear separately opens a huge race window when i_mtime
modifications are bound to get lost.

> +
> +	return err;
> +}
> +
> +/*
> + * S_NOCMTIME is clear, so we need to update inode->i_mtime manually.
> + */
> +static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
> +{
> +	unsigned ivalid = iattr->ia_valid;
> +	struct fuse_inode *fi = get_fuse_inode(inode);
> +
> +	if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) {
> +		/* client fs has just set mtime to iattr->ia_mtime */
> +		inode->i_mtime = iattr->ia_mtime;
> +		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);

This is protected by i_mutex, so it should be safe.

> +	} else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) {
> +		/* client fs doesn't know that we're updating i_mtime */

If so, why not tell the client fs to update mtime?

> +		inode->i_mtime = current_fs_time(inode->i_sb);
> +		set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
> +	}
> +}
> +
>  /*
>   * Set attributes, and at the same time refresh them.
>   *
> @@ -1641,17 +1720,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
>  		inarg.valid |= FATTR_LOCKOWNER;
>  		inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
>  	}
> -	req->in.h.opcode = FUSE_SETATTR;
> -	req->in.h.nodeid = get_node_id(inode);
> -	req->in.numargs = 1;
> -	req->in.args[0].size = sizeof(inarg);
> -	req->in.args[0].value = &inarg;
> -	req->out.numargs = 1;
> -	if (fc->minor < 9)
> -		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
> -	else
> -		req->out.args[0].size = sizeof(outarg);
> -	req->out.args[0].value = &outarg;
> +	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
>  	fuse_request_send(fc, req);
>  	err = req->out.h.error;
>  	fuse_put_request(fc, req);
> @@ -1668,6 +1737,10 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
>  	}
>  
>  	spin_lock(&fc->lock);
> +	/* the kernel maintains i_mtime locally */
> +	if (fc->writeback_cache && S_ISREG(inode->i_mode))
> +		fuse_set_mtime_local(attr, inode);
> +
>  	fuse_change_attributes_common(inode, &outarg.attr,
>  				      attr_timeout(&outarg));
>  	oldsize = inode->i_size;
> @@ -1898,6 +1971,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
>  	return err;
>  }
>  
> +static int fuse_update_time(struct inode *inode, struct timespec *now,
> +			    int flags)
> +{
> +	if (flags & S_MTIME) {
> +		inode->i_mtime = *now;
> +		set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state);
> +		BUG_ON(!S_ISREG(inode->i_mode));
> +	}
> +	return 0;
> +}
> +
>  static const struct inode_operations fuse_dir_inode_operations = {
>  	.lookup		= fuse_lookup,
>  	.mkdir		= fuse_mkdir,
> @@ -1937,6 +2021,7 @@ static const struct inode_operations fuse_common_inode_operations = {
>  	.getxattr	= fuse_getxattr,
>  	.listxattr	= fuse_listxattr,
>  	.removexattr	= fuse_removexattr,
> +	.update_time	= fuse_update_time,
>  };
>  
>  static const struct inode_operations fuse_symlink_inode_operations = {
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 333a764..eabe202 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -291,6 +291,9 @@ static int fuse_open(struct inode *inode, struct file *file)
>  
>  static int fuse_release(struct inode *inode, struct file *file)
>  {
> +	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))

So, why not make this test_and_clear_bit()?


> +		fuse_flush_mtime(file, true);
> +
>  	fuse_release_common(file, FUSE_RELEASE);
>  
>  	/* return value is ignored by VFS */
> @@ -458,6 +461,12 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>  
>  	fuse_sync_writes(inode);
>  
> +	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) {

And this too.

> +		int err = fuse_flush_mtime(file, false);
> +		if (err)
> +			goto out;
> +	}
> +
>  	req = fuse_get_req_nopages(fc);
>  	if (IS_ERR(req)) {
>  		err = PTR_ERR(req);
> @@ -948,16 +957,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
>  	return req->misc.write.out.size;
>  }
>  
> -void fuse_write_update_size(struct inode *inode, loff_t pos)
> +bool fuse_write_update_size(struct inode *inode, loff_t pos)
>  {
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> +	bool ret = false;
>  
>  	spin_lock(&fc->lock);
>  	fi->attr_version = ++fc->attr_version;
> -	if (pos > inode->i_size)
> +	if (pos > inode->i_size) {
>  		i_size_write(inode, pos);
> +		ret = true;
> +	}
>  	spin_unlock(&fc->lock);
> +
> +	return ret;
>  }
>  
>  static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
> @@ -2862,8 +2876,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>  		goto out;
>  
>  	/* we could have extended the file */
> -	if (!(mode & FALLOC_FL_KEEP_SIZE))
> -		fuse_write_update_size(inode, offset + length);
> +	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> +		bool changed = fuse_write_update_size(inode, offset + length);
> +
> +		if (changed && fc->writeback_cache) {
> +			struct fuse_inode *fi = get_fuse_inode(inode);
> +
> +			inode->i_mtime = current_fs_time(inode->i_sb);
> +			set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
> +		}
> +	}
>  
>  	if (mode & FALLOC_FL_PUNCH_HOLE)
>  		truncate_pagecache_range(inode, offset, offset + length - 1);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index a490ba3..3028b8a 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -117,6 +117,8 @@ enum {
>  	FUSE_I_ADVISE_RDPLUS,
>  	/** An operation changing file size is in progress  */
>  	FUSE_I_SIZE_UNSTABLE,
> +	/** i_mtime has been updated locally; a flush to userspace needed */
> +	FUSE_I_MTIME_DIRTY,
>  };
>  
>  struct fuse_conn;
> @@ -870,7 +872,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
>  unsigned fuse_file_poll(struct file *file, poll_table *wait);
>  int fuse_dev_release(struct inode *inode, struct file *file);
>  
> -void fuse_write_update_size(struct inode *inode, loff_t pos);
> +bool fuse_write_update_size(struct inode *inode, loff_t pos);
> +
> +int fuse_flush_mtime(struct file *file, bool nofail);
>  
>  int fuse_do_setattr(struct inode *inode, struct iattr *attr,
>  		    struct file *file);
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 63818ab..253701f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>  	inode->i_blocks  = attr->blocks;
>  	inode->i_atime.tv_sec   = attr->atime;
>  	inode->i_atime.tv_nsec  = attr->atimensec;
> -	inode->i_mtime.tv_sec   = attr->mtime;
> -	inode->i_mtime.tv_nsec  = attr->mtimensec;
> +	/* mtime from server may be stale due to local buffered write */
> +	if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
> +		inode->i_mtime.tv_sec   = attr->mtime;
> +		inode->i_mtime.tv_nsec  = attr->mtimensec;
> +	}
>  	inode->i_ctime.tv_sec   = attr->ctime;
>  	inode->i_ctime.tv_nsec  = attr->ctimensec;
>  
> @@ -250,6 +253,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
>  {
>  	inode->i_mode = attr->mode & S_IFMT;
>  	inode->i_size = attr->size;
> +	inode->i_mtime.tv_sec  = attr->mtime;
> +	inode->i_mtime.tv_nsec = attr->mtimensec;
>  	if (S_ISREG(inode->i_mode)) {
>  		fuse_init_common(inode);
>  		fuse_init_file_inode(inode);
> @@ -296,7 +301,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>  		return NULL;
>  
>  	if ((inode->i_state & I_NEW)) {
> -		inode->i_flags |= S_NOATIME|S_NOCMTIME;
> +		inode->i_flags |= S_NOATIME;
> +		if (!fc->writeback_cache)
> +			inode->i_flags |= S_NOCMTIME;
>  		inode->i_generation = generation;
>  		inode->i_data.backing_dev_info = &fc->bdi;
>  		fuse_init_inode(inode, attr);
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 07/11] fuse: restructure fuse_readpage()
  2013-10-10 13:11 ` [PATCH 07/11] fuse: restructure fuse_readpage() Maxim Patlasov
@ 2013-11-12 17:17   ` Miklos Szeredi
  2013-12-20 14:54     ` Maxim Patlasov
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2013-11-12 17:17 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

On Thu, Oct 10, 2013 at 05:11:25PM +0400, Maxim Patlasov wrote:
> Move the code filling and sending read request to a separate function. Future
> patches will use it for .write_begin -- partial modification of a page
> requires reading the page from the storage very similarly to what fuse_readpage
> does.
> 
> Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
> ---
>  fs/fuse/file.c |   55 +++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 37 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index b4d4189..77eb849 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
>  	}
>  }
>  
> -static int fuse_readpage(struct file *file, struct page *page)
> +static int __fuse_readpage(struct file *file, struct page *page, size_t count,
> +			   int *err, struct fuse_req **req_pp, u64 *attr_ver_p)

Signature of this helper looks really ugly.  A quick look tells me that neither
caller actually needs 'req'.  And fuse_get_attr_version() can be moved to the
one caller that needs it.  And negative err can be returned.  And then all those
ugly pointer args are gone and the whole thing is much simpler.

Thanks,
Miklos



>  {
>  	struct fuse_io_priv io = { .async = 0, .file = file };
>  	struct inode *inode = page->mapping->host;
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  	struct fuse_req *req;
>  	size_t num_read;
> -	loff_t pos = page_offset(page);
> -	size_t count = PAGE_CACHE_SIZE;
> -	u64 attr_ver;
> -	int err;
> -
> -	err = -EIO;
> -	if (is_bad_inode(inode))
> -		goto out;
>  
>  	/*
>  	 * Page writeback can extend beyond the lifetime of the
> @@ -724,20 +717,45 @@ static int fuse_readpage(struct file *file, struct page *page)
>  	fuse_wait_on_page_writeback(inode, page->index);
>  
>  	req = fuse_get_req(fc, 1);
> -	err = PTR_ERR(req);
> +	*err = PTR_ERR(req);
>  	if (IS_ERR(req))
> -		goto out;
> +		return 0;
>  
> -	attr_ver = fuse_get_attr_version(fc);
> +	if (attr_ver_p)
> +		*attr_ver_p = fuse_get_attr_version(fc);
>  
>  	req->out.page_zeroing = 1;
>  	req->out.argpages = 1;
>  	req->num_pages = 1;
>  	req->pages[0] = page;
>  	req->page_descs[0].length = count;
> -	num_read = fuse_send_read(req, &io, pos, count, NULL);
> -	err = req->out.h.error;
>  
> +	num_read = fuse_send_read(req, &io, page_offset(page), count, NULL);
> +	*err = req->out.h.error;
> +
> +	if (*err)
> +		fuse_put_request(fc, req);
> +	else
> +		*req_pp = req;
> +
> +	return num_read;
> +}
> +
> +static int fuse_readpage(struct file *file, struct page *page)
> +{
> +	struct inode *inode = page->mapping->host;
> +	struct fuse_conn *fc = get_fuse_conn(inode);
> +	struct fuse_req *req = NULL;
> +	size_t num_read;
> +	size_t count = PAGE_CACHE_SIZE;
> +	u64 attr_ver = 0;
> +	int err;
> +
> +	err = -EIO;
> +	if (is_bad_inode(inode))
> +		goto out;
> +
> +	num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
>  	if (!err) {
>  		/*
>  		 * Short read means EOF.  If file size is larger, truncate it
> @@ -747,10 +765,11 @@ static int fuse_readpage(struct file *file, struct page *page)
>  
>  		SetPageUptodate(page);
>  	}
> -
> -	fuse_put_request(fc, req);
> -	fuse_invalidate_attr(inode); /* atime changed */
> - out:
> +	if (req) {
> +		fuse_put_request(fc, req);
> +		fuse_invalidate_attr(inode); /* atime changed */
> +	}
> +out:
>  	unlock_page(page);
>  	return err;
>  }
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 07/11] fuse: restructure fuse_readpage()
  2013-11-12 17:17   ` Miklos Szeredi
@ 2013-12-20 14:54     ` Maxim Patlasov
  2014-01-06 16:43       ` Miklos Szeredi
  0 siblings, 1 reply; 25+ messages in thread
From: Maxim Patlasov @ 2013-12-20 14:54 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

Hi Miklos,

Sorry for delay, see please inline comments below.

On 11/12/2013 09:17 PM, Miklos Szeredi wrote:
> On Thu, Oct 10, 2013 at 05:11:25PM +0400, Maxim Patlasov wrote:
>> Move the code filling and sending read request to a separate function. Future
>> patches will use it for .write_begin -- partial modification of a page
>> requires reading the page from the storage very similarly to what fuse_readpage
>> does.
>>
>> Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
>> ---
>>   fs/fuse/file.c |   55 +++++++++++++++++++++++++++++++++++++------------------
>>   1 file changed, 37 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index b4d4189..77eb849 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
>>   	}
>>   }
>>   
>> -static int fuse_readpage(struct file *file, struct page *page)
>> +static int __fuse_readpage(struct file *file, struct page *page, size_t count,
>> +			   int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
> Signature of this helper looks really ugly.  A quick look tells me that neither
> caller actually needs 'req'.

fuse_readpage() passes 'req' to fuse_short_read(). And the latter uses 
req->pages[] to nullify a part of request.

> And fuse_get_attr_version() can be moved to the
> one caller that needs it.

Yes, it's doable. But this would make attr_version mechanism less 
efficient (under some loads): suppose the file on server was truncated 
externally, then fuse_readpage() acquires fc->attr_version, then some 
innocent write bumps fc->attr_version while we're waiting for fuse 
writeback, then fuse_read_update_size() would noop. In the other words, 
it's beneficial to keep the time interval between acquiring 
fc->attr_version and subsequent comparison as short as possible.

> And negative err can be returned.

Yes, but this will require some precautions for positive 
req->out.h.error. Like "err = (req->out.h.error <= 0) ? req->out.h.error 
: -EIO;". But this must be OK - filtering out positive req->out.h.error 
is a good idea, imho.


> And then all those
> ugly pointer args are gone and the whole thing is much simpler.

If you agree with my comments above, only 1 of 3 ugly pointers can be 
avoided. Another way would be to revert the code back to the initial 
implementation where fuse_readpage() and fuse_prepare_write() sent read 
requests independently.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/11] fuse: Trust kernel i_mtime only -v2
  2013-11-12 16:52   ` Miklos Szeredi
@ 2013-12-26 15:41     ` Maxim Patlasov
  2014-01-06 16:22       ` Miklos Szeredi
  2013-12-26 15:51     ` [PATCH 05/11] fuse: Trust kernel i_mtime only -v3 Maxim Patlasov
  1 sibling, 1 reply; 25+ messages in thread
From: Maxim Patlasov @ 2013-12-26 15:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

Hi Miklos,

Sorry for delay, see please inline comments below.

On 11/12/2013 08:52 PM, Miklos Szeredi wrote:
> On Thu, Oct 10, 2013 at 05:10:56PM +0400, Maxim Patlasov wrote:
>> Let the kernel maintain i_mtime locally:
>>   - clear S_NOCMTIME
>>   - implement i_op->update_time()
>>   - flush mtime on fsync and last close
>>   - update i_mtime explicitly on truncate and fallocate
>>
>> Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
>> should be flushed to the server eventually.
>>
>> Changed in v2 (thanks to Brian):
>>   - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
>>   - simplified fuse_set_mtime_local()
>>   - abandoned optimizations: clearing the flag on some operations (like
>>     direct write) is doable, but may lead to unexpected changes of
>>     user-visible mtime.
>>
>> Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
>> ---
>>   fs/fuse/dir.c    |  109 ++++++++++++++++++++++++++++++++++++++++++++++++------
>>   fs/fuse/file.c   |   30 +++++++++++++--
>>   fs/fuse/fuse_i.h |    6 ++-
>>   fs/fuse/inode.c  |   13 +++++-
>>   4 files changed, 138 insertions(+), 20 deletions(-)
>>
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index f022968..eda248b 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>>   	struct fuse_conn *fc = get_fuse_conn(inode);
>>   
>>   	/* see the comment in fuse_change_attributes() */
>> -	if (fc->writeback_cache && S_ISREG(inode->i_mode))
>> +	if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
>>   		attr->size = i_size_read(inode);
>> +		attr->mtime = inode->i_mtime.tv_sec;
>> +		attr->mtimensec = inode->i_mtime.tv_nsec;
>> +	}
>>   
>>   	stat->dev = inode->i_sb->s_dev;
>>   	stat->ino = attr->ino;
>> @@ -1582,6 +1585,82 @@ void fuse_release_nowrite(struct inode *inode)
>>   	spin_unlock(&fc->lock);
>>   }
>>   
>> +static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
>> +			      struct inode *inode,
>> +			      struct fuse_setattr_in *inarg_p,
>> +			      struct fuse_attr_out *outarg_p)
>> +{
>> +	req->in.h.opcode = FUSE_SETATTR;
>> +	req->in.h.nodeid = get_node_id(inode);
>> +	req->in.numargs = 1;
>> +	req->in.args[0].size = sizeof(*inarg_p);
>> +	req->in.args[0].value = inarg_p;
>> +	req->out.numargs = 1;
>> +	if (fc->minor < 9)
>> +		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
>> +	else
>> +		req->out.args[0].size = sizeof(*outarg_p);
>> +	req->out.args[0].value = outarg_p;
>> +}
>> +
>> +/*
>> + * Flush inode->i_mtime to the server
>> + */
>> +int fuse_flush_mtime(struct file *file, bool nofail)
>> +{
>> +	struct inode *inode = file->f_mapping->host;
>> +	struct fuse_inode *fi = get_fuse_inode(inode);
>> +	struct fuse_conn *fc = get_fuse_conn(inode);
>> +	struct fuse_req *req = NULL;
>> +	struct fuse_setattr_in inarg;
>> +	struct fuse_attr_out outarg;
>> +	int err;
>> +
>> +	if (nofail) {
>> +		req = fuse_get_req_nofail_nopages(fc, file);
>> +	} else {
>> +		req = fuse_get_req_nopages(fc);
>> +		if (IS_ERR(req))
>> +			return PTR_ERR(req);
>> +	}
>> +
>> +	memset(&inarg, 0, sizeof(inarg));
>> +	memset(&outarg, 0, sizeof(outarg));
>> +
>> +	inarg.valid |= FATTR_MTIME;
>> +	inarg.mtime = inode->i_mtime.tv_sec;
>> +	inarg.mtimensec = inode->i_mtime.tv_nsec;
>> +
>> +	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
>> +	fuse_request_send(fc, req);
>> +	err = req->out.h.error;
>> +	fuse_put_request(fc, req);
>> +
>> +	if (!err)
>> +		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
> Doing the test and the clear separately opens a huge race window when i_mtime
> modifications are bound to get lost.

No. Because the whole operation is protected by i_mutex (see 
fuse_fsync_common()).

>
>> +
>> +	return err;
>> +}
>> +
>> +/*
>> + * S_NOCMTIME is clear, so we need to update inode->i_mtime manually.
>> + */
>> +static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
>> +{
>> +	unsigned ivalid = iattr->ia_valid;
>> +	struct fuse_inode *fi = get_fuse_inode(inode);
>> +
>> +	if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) {
>> +		/* client fs has just set mtime to iattr->ia_mtime */
>> +		inode->i_mtime = iattr->ia_mtime;
>> +		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
> This is protected by i_mutex, so it should be safe.

Yes.

>
>> +	} else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) {
>> +		/* client fs doesn't know that we're updating i_mtime */
> If so, why not tell the client fs to update mtime?

Looks reasonable. I'll resend updated patch soon.

>
>> +		inode->i_mtime = current_fs_time(inode->i_sb);
>> +		set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
>> +	}
>> +}
>> +
>>   /*
>>    * Set attributes, and at the same time refresh them.
>>    *
>> @@ -1641,17 +1720,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
>>   		inarg.valid |= FATTR_LOCKOWNER;
>>   		inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
>>   	}
>> -	req->in.h.opcode = FUSE_SETATTR;
>> -	req->in.h.nodeid = get_node_id(inode);
>> -	req->in.numargs = 1;
>> -	req->in.args[0].size = sizeof(inarg);
>> -	req->in.args[0].value = &inarg;
>> -	req->out.numargs = 1;
>> -	if (fc->minor < 9)
>> -		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
>> -	else
>> -		req->out.args[0].size = sizeof(outarg);
>> -	req->out.args[0].value = &outarg;
>> +	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
>>   	fuse_request_send(fc, req);
>>   	err = req->out.h.error;
>>   	fuse_put_request(fc, req);
>> @@ -1668,6 +1737,10 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
>>   	}
>>   
>>   	spin_lock(&fc->lock);
>> +	/* the kernel maintains i_mtime locally */
>> +	if (fc->writeback_cache && S_ISREG(inode->i_mode))
>> +		fuse_set_mtime_local(attr, inode);
>> +
>>   	fuse_change_attributes_common(inode, &outarg.attr,
>>   				      attr_timeout(&outarg));
>>   	oldsize = inode->i_size;
>> @@ -1898,6 +1971,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
>>   	return err;
>>   }
>>   
>> +static int fuse_update_time(struct inode *inode, struct timespec *now,
>> +			    int flags)
>> +{
>> +	if (flags & S_MTIME) {
>> +		inode->i_mtime = *now;
>> +		set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state);
>> +		BUG_ON(!S_ISREG(inode->i_mode));
>> +	}
>> +	return 0;
>> +}
>> +
>>   static const struct inode_operations fuse_dir_inode_operations = {
>>   	.lookup		= fuse_lookup,
>>   	.mkdir		= fuse_mkdir,
>> @@ -1937,6 +2021,7 @@ static const struct inode_operations fuse_common_inode_operations = {
>>   	.getxattr	= fuse_getxattr,
>>   	.listxattr	= fuse_listxattr,
>>   	.removexattr	= fuse_removexattr,
>> +	.update_time	= fuse_update_time,
>>   };
>>   
>>   static const struct inode_operations fuse_symlink_inode_operations = {
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 333a764..eabe202 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -291,6 +291,9 @@ static int fuse_open(struct inode *inode, struct file *file)
>>   
>>   static int fuse_release(struct inode *inode, struct file *file)
>>   {
>> +	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
> So, why not make this test_and_clear_bit()?

To be uniform with fuse_fsync_common(). See please the next comment.

>
>
>> +		fuse_flush_mtime(file, true);
>> +
>>   	fuse_release_common(file, FUSE_RELEASE);
>>   
>>   	/* return value is ignored by VFS */
>> @@ -458,6 +461,12 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
>>   
>>   	fuse_sync_writes(inode);
>>   
>> +	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) {
> And this too.

If mtime flush fails, the bit is kept. Another way would be to use 
test_and_clear_bit() as you suggested, but set the bit back in case of 
failure. Technically they are equivalent (due to i_mutex protection), 
but current implementation looks more natural.

Thanks,
Maxim

>
>> +		int err = fuse_flush_mtime(file, false);
>> +		if (err)
>> +			goto out;
>> +	}
>> +
>>   	req = fuse_get_req_nopages(fc);
>>   	if (IS_ERR(req)) {
>>   		err = PTR_ERR(req);
>> @@ -948,16 +957,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
>>   	return req->misc.write.out.size;
>>   }
>>   
>> -void fuse_write_update_size(struct inode *inode, loff_t pos)
>> +bool fuse_write_update_size(struct inode *inode, loff_t pos)
>>   {
>>   	struct fuse_conn *fc = get_fuse_conn(inode);
>>   	struct fuse_inode *fi = get_fuse_inode(inode);
>> +	bool ret = false;
>>   
>>   	spin_lock(&fc->lock);
>>   	fi->attr_version = ++fc->attr_version;
>> -	if (pos > inode->i_size)
>> +	if (pos > inode->i_size) {
>>   		i_size_write(inode, pos);
>> +		ret = true;
>> +	}
>>   	spin_unlock(&fc->lock);
>> +
>> +	return ret;
>>   }
>>   
>>   static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
>> @@ -2862,8 +2876,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>>   		goto out;
>>   
>>   	/* we could have extended the file */
>> -	if (!(mode & FALLOC_FL_KEEP_SIZE))
>> -		fuse_write_update_size(inode, offset + length);
>> +	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
>> +		bool changed = fuse_write_update_size(inode, offset + length);
>> +
>> +		if (changed && fc->writeback_cache) {
>> +			struct fuse_inode *fi = get_fuse_inode(inode);
>> +
>> +			inode->i_mtime = current_fs_time(inode->i_sb);
>> +			set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
>> +		}
>> +	}
>>   
>>   	if (mode & FALLOC_FL_PUNCH_HOLE)
>>   		truncate_pagecache_range(inode, offset, offset + length - 1);
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index a490ba3..3028b8a 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -117,6 +117,8 @@ enum {
>>   	FUSE_I_ADVISE_RDPLUS,
>>   	/** An operation changing file size is in progress  */
>>   	FUSE_I_SIZE_UNSTABLE,
>> +	/** i_mtime has been updated locally; a flush to userspace needed */
>> +	FUSE_I_MTIME_DIRTY,
>>   };
>>   
>>   struct fuse_conn;
>> @@ -870,7 +872,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
>>   unsigned fuse_file_poll(struct file *file, poll_table *wait);
>>   int fuse_dev_release(struct inode *inode, struct file *file);
>>   
>> -void fuse_write_update_size(struct inode *inode, loff_t pos);
>> +bool fuse_write_update_size(struct inode *inode, loff_t pos);
>> +
>> +int fuse_flush_mtime(struct file *file, bool nofail);
>>   
>>   int fuse_do_setattr(struct inode *inode, struct iattr *attr,
>>   		    struct file *file);
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 63818ab..253701f 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
>>   	inode->i_blocks  = attr->blocks;
>>   	inode->i_atime.tv_sec   = attr->atime;
>>   	inode->i_atime.tv_nsec  = attr->atimensec;
>> -	inode->i_mtime.tv_sec   = attr->mtime;
>> -	inode->i_mtime.tv_nsec  = attr->mtimensec;
>> +	/* mtime from server may be stale due to local buffered write */
>> +	if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
>> +		inode->i_mtime.tv_sec   = attr->mtime;
>> +		inode->i_mtime.tv_nsec  = attr->mtimensec;
>> +	}
>>   	inode->i_ctime.tv_sec   = attr->ctime;
>>   	inode->i_ctime.tv_nsec  = attr->ctimensec;
>>   
>> @@ -250,6 +253,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
>>   {
>>   	inode->i_mode = attr->mode & S_IFMT;
>>   	inode->i_size = attr->size;
>> +	inode->i_mtime.tv_sec  = attr->mtime;
>> +	inode->i_mtime.tv_nsec = attr->mtimensec;
>>   	if (S_ISREG(inode->i_mode)) {
>>   		fuse_init_common(inode);
>>   		fuse_init_file_inode(inode);
>> @@ -296,7 +301,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>>   		return NULL;
>>   
>>   	if ((inode->i_state & I_NEW)) {
>> -		inode->i_flags |= S_NOATIME|S_NOCMTIME;
>> +		inode->i_flags |= S_NOATIME;
>> +		if (!fc->writeback_cache)
>> +			inode->i_flags |= S_NOCMTIME;
>>   		inode->i_generation = generation;
>>   		inode->i_data.backing_dev_info = &fc->bdi;
>>   		fuse_init_inode(inode, attr);
>>
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 05/11] fuse: Trust kernel i_mtime only -v3
  2013-11-12 16:52   ` Miklos Szeredi
  2013-12-26 15:41     ` Maxim Patlasov
@ 2013-12-26 15:51     ` Maxim Patlasov
  1 sibling, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2013-12-26 15:51 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, linux-fsdevel,
	akpm, fengguang.wu, devel

Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Changed in v2 (thanks to Brian):
 - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
 - simplified fuse_set_mtime_local()
 - abandoned optimizations: clearing the flag on some operations (like
   direct write) is doable, but may lead to unexpected changes of
   user-visible mtime.

Changed in v3 (thanks to Miklos):
 - let client fs know about local mtime change for ATTR_SIZE and
   !ATTR_MTIME_SET cases.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c    |  113 +++++++++++++++++++++++++++++++++++++++++++++---------
 fs/fuse/file.c   |   30 ++++++++++++--
 fs/fuse/fuse_i.h |    6 ++-
 fs/fuse/inode.c  |   13 +++++-
 4 files changed, 136 insertions(+), 26 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f022968..d1a3167 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -857,8 +857,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	/* see the comment in fuse_change_attributes() */
-	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+	if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
 		attr->size = i_size_read(inode);
+		attr->mtime = inode->i_mtime.tv_sec;
+		attr->mtimensec = inode->i_mtime.tv_nsec;
+	}
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1510,7 +1513,8 @@ static bool update_mtime(unsigned ivalid)
 	return true;
 }
 
-static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg)
+static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg,
+			   bool trust_local_mtime, struct timespec *mtime)
 {
 	unsigned ivalid = iattr->ia_valid;
 
@@ -1529,11 +1533,17 @@ static void iattr_to_fattr(struct iattr *iattr, struct fuse_setattr_in *arg)
 		if (!(ivalid & ATTR_ATIME_SET))
 			arg->valid |= FATTR_ATIME_NOW;
 	}
-	if ((ivalid & ATTR_MTIME) && update_mtime(ivalid)) {
+
+	if (!trust_local_mtime || (ivalid & ATTR_MTIME_SET) ||
+	    !(ivalid & ATTR_SIZE))
+		*mtime = iattr->ia_mtime;
+
+	if ((ivalid & ATTR_MTIME) && (update_mtime(ivalid) ||
+				      trust_local_mtime)) {
 		arg->valid |= FATTR_MTIME;
-		arg->mtime = iattr->ia_mtime.tv_sec;
-		arg->mtimensec = iattr->ia_mtime.tv_nsec;
-		if (!(ivalid & ATTR_MTIME_SET))
+		arg->mtime = mtime->tv_sec;
+		arg->mtimensec = mtime->tv_nsec;
+		if (!(ivalid & ATTR_MTIME_SET) && !trust_local_mtime)
 			arg->valid |= FATTR_MTIME_NOW;
 	}
 }
@@ -1582,6 +1592,63 @@ void fuse_release_nowrite(struct inode *inode)
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+			      struct inode *inode,
+			      struct fuse_setattr_in *inarg_p,
+			      struct fuse_attr_out *outarg_p)
+{
+	req->in.h.opcode = FUSE_SETATTR;
+	req->in.h.nodeid = get_node_id(inode);
+	req->in.numargs = 1;
+	req->in.args[0].size = sizeof(*inarg_p);
+	req->in.args[0].value = inarg_p;
+	req->out.numargs = 1;
+	if (fc->minor < 9)
+		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+	else
+		req->out.args[0].size = sizeof(*outarg_p);
+	req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	struct fuse_setattr_in inarg;
+	struct fuse_attr_out outarg;
+	int err;
+
+	if (nofail) {
+		req = fuse_get_req_nofail_nopages(fc, file);
+	} else {
+		req = fuse_get_req_nopages(fc);
+		if (IS_ERR(req))
+			return PTR_ERR(req);
+	}
+
+	memset(&inarg, 0, sizeof(inarg));
+	memset(&outarg, 0, sizeof(outarg));
+
+	inarg.valid |= FATTR_MTIME;
+	inarg.mtime = inode->i_mtime.tv_sec;
+	inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+	fuse_request_send(fc, req);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+
+	if (!err)
+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+
+	return err;
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1602,6 +1669,8 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	int err;
+	bool trust_local_mtime = is_wb && S_ISREG(inode->i_mode);
+	struct timespec new_mtime = current_fs_time(inode->i_sb);
 
 	if (!(fc->flags & FUSE_DEFAULT_PERMISSIONS))
 		attr->ia_valid |= ATTR_FORCE;
@@ -1630,7 +1699,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 
 	memset(&inarg, 0, sizeof(inarg));
 	memset(&outarg, 0, sizeof(outarg));
-	iattr_to_fattr(attr, &inarg);
+	iattr_to_fattr(attr, &inarg, trust_local_mtime, &new_mtime);
 	if (file) {
 		struct fuse_file *ff = file->private_data;
 		inarg.valid |= FATTR_FH;
@@ -1641,17 +1710,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		inarg.valid |= FATTR_LOCKOWNER;
 		inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
 	}
-	req->in.h.opcode = FUSE_SETATTR;
-	req->in.h.nodeid = get_node_id(inode);
-	req->in.numargs = 1;
-	req->in.args[0].size = sizeof(inarg);
-	req->in.args[0].value = &inarg;
-	req->out.numargs = 1;
-	if (fc->minor < 9)
-		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
-	else
-		req->out.args[0].size = sizeof(outarg);
-	req->out.args[0].value = &outarg;
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
@@ -1668,6 +1727,12 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	}
 
 	spin_lock(&fc->lock);
+	/* the kernel maintains i_mtime locally */
+	if (trust_local_mtime && (attr->ia_valid & ATTR_MTIME)) {
+		inode->i_mtime = new_mtime;
+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+	}
+
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
@@ -1898,6 +1963,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
 	return err;
 }
 
+static int fuse_update_time(struct inode *inode, struct timespec *now,
+			    int flags)
+{
+	if (flags & S_MTIME) {
+		inode->i_mtime = *now;
+		set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state);
+		BUG_ON(!S_ISREG(inode->i_mode));
+	}
+	return 0;
+}
+
 static const struct inode_operations fuse_dir_inode_operations = {
 	.lookup		= fuse_lookup,
 	.mkdir		= fuse_mkdir,
@@ -1937,6 +2013,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.getxattr	= fuse_getxattr,
 	.listxattr	= fuse_listxattr,
 	.removexattr	= fuse_removexattr,
+	.update_time	= fuse_update_time,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 333a764..eabe202 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,9 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
+		fuse_flush_mtime(file, true);
+
 	fuse_release_common(file, FUSE_RELEASE);
 
 	/* return value is ignored by VFS */
@@ -458,6 +461,12 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 
 	fuse_sync_writes(inode);
 
+	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) {
+		int err = fuse_flush_mtime(file, false);
+		if (err)
+			goto out;
+	}
+
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -948,16 +957,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
 	return req->misc.write.out.size;
 }
 
-void fuse_write_update_size(struct inode *inode, loff_t pos)
+bool fuse_write_update_size(struct inode *inode, loff_t pos)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool ret = false;
 
 	spin_lock(&fc->lock);
 	fi->attr_version = ++fc->attr_version;
-	if (pos > inode->i_size)
+	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
+		ret = true;
+	}
 	spin_unlock(&fc->lock);
+
+	return ret;
 }
 
 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
@@ -2862,8 +2876,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		goto out;
 
 	/* we could have extended the file */
-	if (!(mode & FALLOC_FL_KEEP_SIZE))
-		fuse_write_update_size(inode, offset + length);
+	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+		bool changed = fuse_write_update_size(inode, offset + length);
+
+		if (changed && fc->writeback_cache) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
+
+			inode->i_mtime = current_fs_time(inode->i_sb);
+			set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+		}
+	}
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		truncate_pagecache_range(inode, offset, offset + length - 1);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a490ba3..3028b8a 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -117,6 +117,8 @@ enum {
 	FUSE_I_ADVISE_RDPLUS,
 	/** An operation changing file size is in progress  */
 	FUSE_I_SIZE_UNSTABLE,
+	/** i_mtime has been updated locally; a flush to userspace needed */
+	FUSE_I_MTIME_DIRTY,
 };
 
 struct fuse_conn;
@@ -870,7 +872,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
 unsigned fuse_file_poll(struct file *file, poll_table *wait);
 int fuse_dev_release(struct inode *inode, struct file *file);
 
-void fuse_write_update_size(struct inode *inode, loff_t pos);
+bool fuse_write_update_size(struct inode *inode, loff_t pos);
+
+int fuse_flush_mtime(struct file *file, bool nofail);
 
 int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		    struct file *file);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 63818ab..253701f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
-	inode->i_mtime.tv_sec   = attr->mtime;
-	inode->i_mtime.tv_nsec  = attr->mtimensec;
+	/* mtime from server may be stale due to local buffered write */
+	if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
+		inode->i_mtime.tv_sec   = attr->mtime;
+		inode->i_mtime.tv_nsec  = attr->mtimensec;
+	}
 	inode->i_ctime.tv_sec   = attr->ctime;
 	inode->i_ctime.tv_nsec  = attr->ctimensec;
 
@@ -250,6 +253,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
 {
 	inode->i_mode = attr->mode & S_IFMT;
 	inode->i_size = attr->size;
+	inode->i_mtime.tv_sec  = attr->mtime;
+	inode->i_mtime.tv_nsec = attr->mtimensec;
 	if (S_ISREG(inode->i_mode)) {
 		fuse_init_common(inode);
 		fuse_init_file_inode(inode);
@@ -296,7 +301,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		return NULL;
 
 	if ((inode->i_state & I_NEW)) {
-		inode->i_flags |= S_NOATIME|S_NOCMTIME;
+		inode->i_flags |= S_NOATIME;
+		if (!fc->writeback_cache)
+			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
 		fuse_init_inode(inode, attr);


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/11] fuse: Trust kernel i_mtime only -v2
  2013-12-26 15:41     ` Maxim Patlasov
@ 2014-01-06 16:22       ` Miklos Szeredi
  2014-01-20 11:33         ` Maxim Patlasov
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2014-01-06 16:22 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

On Thu, Dec 26, 2013 at 07:41:41PM +0400, Maxim Patlasov wrote:
> >>+
> >>+	if (!err)
> >>+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
> >Doing the test and the clear separately opens a huge race window when i_mtime
> >modifications are bound to get lost.
> 
> No. Because the whole operation is protected by i_mutex (see
> fuse_fsync_common()).

fuse_release_common() doesn't have i_mutex.  It's probably safe to acquire it,
but is that really needed?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 07/11] fuse: restructure fuse_readpage()
  2013-12-20 14:54     ` Maxim Patlasov
@ 2014-01-06 16:43       ` Miklos Szeredi
  2014-01-20 11:46         ` Maxim Patlasov
  0 siblings, 1 reply; 25+ messages in thread
From: Miklos Szeredi @ 2014-01-06 16:43 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

On Fri, Dec 20, 2013 at 06:54:40PM +0400, Maxim Patlasov wrote:
> Hi Miklos,
> 
> Sorry for delay, see please inline comments below.
> 
> On 11/12/2013 09:17 PM, Miklos Szeredi wrote:
> >On Thu, Oct 10, 2013 at 05:11:25PM +0400, Maxim Patlasov wrote:
> >>Move the code filling and sending read request to a separate function. Future
> >>patches will use it for .write_begin -- partial modification of a page
> >>requires reading the page from the storage very similarly to what fuse_readpage
> >>does.
> >>
> >>Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
> >>---
> >>  fs/fuse/file.c |   55 +++++++++++++++++++++++++++++++++++++------------------
> >>  1 file changed, 37 insertions(+), 18 deletions(-)
> >>
> >>diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> >>index b4d4189..77eb849 100644
> >>--- a/fs/fuse/file.c
> >>+++ b/fs/fuse/file.c
> >>@@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
> >>  	}
> >>  }
> >>-static int fuse_readpage(struct file *file, struct page *page)
> >>+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
> >>+			   int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
> >Signature of this helper looks really ugly.  A quick look tells me that neither
> >caller actually needs 'req'.
> 
> fuse_readpage() passes 'req' to fuse_short_read(). And the latter
> uses req->pages[] to nullify a part of request.

I don't get it.  __fuse_readpage() itself call's fuse_short_read(), not callers
of __fuse_readpage().  Or do they?

> 
> >And fuse_get_attr_version() can be moved to the
> >one caller that needs it.
> 
> Yes, it's doable. But this would make attr_version mechanism less
> efficient (under some loads): suppose the file on server was
> truncated externally, then fuse_readpage() acquires
> fc->attr_version, then some innocent write bumps fc->attr_version
> while we're waiting for fuse writeback, then fuse_read_update_size()
> would noop. In the other words, it's beneficial to keep the time
> interval between acquiring fc->attr_version and subsequent
> comparison as short as possible.

Okay, lets try to keep this the way it is.  I don't like it very much, but I
fear changing user visible behavior.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/11] fuse: Trust kernel i_mtime only -v2
  2014-01-06 16:22       ` Miklos Szeredi
@ 2014-01-20 11:33         ` Maxim Patlasov
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2014-01-20 11:33 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

On 01/06/2014 08:22 PM, Miklos Szeredi wrote:
> On Thu, Dec 26, 2013 at 07:41:41PM +0400, Maxim Patlasov wrote:
>>>> +
>>>> +	if (!err)
>>>> +		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
>>> Doing the test and the clear separately opens a huge race window when i_mtime
>>> modifications are bound to get lost.
>> No. Because the whole operation is protected by i_mutex (see
>> fuse_fsync_common()).
> fuse_release_common() doesn't have i_mutex.  It's probably safe to acquire it,
> but is that really needed?

No, that's not needed, I think. Because by the time of calling 
fuse_release(), file->f_count is already zero and no userspace activity 
is possible on the file.

Are you OK about -v3 version of the patch (sent 12/26/2013)?

Thanks,
Maxim

>
> Thanks,
> Miklos
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 07/11] fuse: restructure fuse_readpage()
  2014-01-06 16:43       ` Miklos Szeredi
@ 2014-01-20 11:46         ` Maxim Patlasov
  0 siblings, 0 replies; 25+ messages in thread
From: Maxim Patlasov @ 2014-01-20 11:46 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-fsdevel, akpm, fengguang.wu, devel

On 01/06/2014 08:43 PM, Miklos Szeredi wrote:
> On Fri, Dec 20, 2013 at 06:54:40PM +0400, Maxim Patlasov wrote:
>> Hi Miklos,
>>
>> Sorry for delay, see please inline comments below.
>>
>> On 11/12/2013 09:17 PM, Miklos Szeredi wrote:
>>> On Thu, Oct 10, 2013 at 05:11:25PM +0400, Maxim Patlasov wrote:
>>>> Move the code filling and sending read request to a separate function. Future
>>>> patches will use it for .write_begin -- partial modification of a page
>>>> requires reading the page from the storage very similarly to what fuse_readpage
>>>> does.
>>>>
>>>> Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
>>>> ---
>>>>   fs/fuse/file.c |   55 +++++++++++++++++++++++++++++++++++++------------------
>>>>   1 file changed, 37 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>> index b4d4189..77eb849 100644
>>>> --- a/fs/fuse/file.c
>>>> +++ b/fs/fuse/file.c
>>>> @@ -700,21 +700,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
>>>>   	}
>>>>   }
>>>> -static int fuse_readpage(struct file *file, struct page *page)
>>>> +static int __fuse_readpage(struct file *file, struct page *page, size_t count,
>>>> +			   int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
>>> Signature of this helper looks really ugly.  A quick look tells me that neither
>>> caller actually needs 'req'.
>> fuse_readpage() passes 'req' to fuse_short_read(). And the latter
>> uses req->pages[] to nullify a part of request.
> I don't get it.  __fuse_readpage() itself call's fuse_short_read(), not callers
> of __fuse_readpage().  Or do they?

fuse_readpage() is a caller of __fuse_readpage() and it looks (after 
applying the patch) like this:

 > static int fuse_readpage(struct file *file, struct page *page)
 > {
 > ...
 >    num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
 >    if (!err) {
 >        /*
 >         * Short read means EOF.  If file size is larger, truncate it
 >         */
 >        if (num_read < count)
 >            fuse_short_read(req, inode, attr_ver);
 >
 >        SetPageUptodate(page);
 >    }

Thanks,
Maxim

>
>>> And fuse_get_attr_version() can be moved to the
>>> one caller that needs it.
>> Yes, it's doable. But this would make attr_version mechanism less
>> efficient (under some loads): suppose the file on server was
>> truncated externally, then fuse_readpage() acquires
>> fc->attr_version, then some innocent write bumps fc->attr_version
>> while we're waiting for fuse writeback, then fuse_read_update_size()
>> would noop. In the other words, it's beneficial to keep the time
>> interval between acquiring fc->attr_version and subsequent
>> comparison as short as possible.
> Okay, lets try to keep this the way it is.  I don't like it very much, but I
> fear changing user visible behavior.
>
> Thanks,
> Miklos
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* fuse write performance
       [not found]               ` <87wqgwdvi1.fsf@vostro.rath.org>
@ 2014-02-27 15:33                 ` Miklos Szeredi
  0 siblings, 0 replies; 25+ messages in thread
From: Miklos Szeredi @ 2014-02-27 15:33 UTC (permalink / raw)
  To: Maxim Patlasov; +Cc: linux-fsdevel, fuse-devel, Teijo Holzer

I did a little benchmarking on fusexmp_fh over tmpfs.

First off, xattr support in the filesystem needs to be turned off otherwise the
kernel will do getxattr on each write(2) call.  This can be achieved by
e.g. commenting out the "#define HAVE_SETXATTR 1" line in include/config.h in
the fuse tree and recompiling the examples.

Splice support was broken in libfuse git, fixed now (head is 0096c126aa45).

Added kernel support to splice directly into a tmpfs file (patch attached; has
been tested, but please be careful, don't use on important data, etc).  Is
against:

  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

I'm using iozone, as it is somewhat more suited to the task than dd.

Tested two blocksizes 4k and 128k:

  iozone -s 128M -r4k -i0 -+n -e -f $file
  iozone -s 128M -r128k -i0 -+n -e -f $file

Results are in MB/s.

4k    128k  test                     # of memcpy's
---------------------------------------------------
1030  1140  native                               1
  80   750  fuse (direct_io,splice_read)	 1
  80   600  fuse (direct_io)			 2
 500   520  fuse (writeback_cache,splice_read)	 2
 320   320  fuse (writeback_cache)		 4
  75   470  fuse (big_writes,splice_read)	 2
  75   360  fuse (big_writes)			 3

So, things aren't looking all that bad.  The "direct_io" option is the best
performer if using a large blocksize.  But "writeback_cache,splice_read" isn't
doing bad at all, and is fast regardless of the blocksize (as Maxim mentioned
there's a performance bug in there that causes slowdown for <4k blocksizes, but
that's easily fixable).

In case you're wondering why writeback_cache is doing 4 copies, not 3: the fuse
kernel module uses a temporary buffer for cached writeback.  With
splice+writeback_cache this temporary page can be directly inserted into the
target file's page cache and so two copies can be saved (one copy from temporary
page to userspace buffer and one copy from userspace buffer to target's page
cache).  Without the attached patch "writeback_cache,splice_read" would be 3
copies and performance somewhere in between the 4 and 2 copies.

Also note: the above measurements are *sustained* write rates.  With
"writeback_cache" writes can be as quick as native.  This can be demonstrated by
omitting the "-e" option of iozone.  To make better use of the writeback cache
it is possible to enlarge the maximum size of the cache with:

  echo 100 > /sys/class/bdi/0:30/max_ratio

Where "100" is a percent (default is 1 for fuse 100 for everything else).

And "30" is the device number of the fuse filesystem.  Can be determined by

  stat -c%d fuse-mountpoint

Thanks,
Miklos


---
 fs/fuse/dev.c    |   31 +++++++++++--
 fs/fuse/file.c   |    8 ++-
 fs/fuse/fuse_i.h |    3 +
 mm/shmem.c       |  125 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 153 insertions(+), 14 deletions(-)

--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -724,6 +724,7 @@ static int fuse_copy_fill(struct fuse_co
 			buf->page = page;
 			buf->offset = 0;
 			buf->len = 0;
+			buf->ops = &nosteal_pipe_buf_ops;
 
 			cs->currbuf = buf;
 			cs->mapaddr = kmap(page);
@@ -884,6 +885,26 @@ static int fuse_try_move_page(struct fus
 	return 1;
 }
 
+static int fuse_pipe_buf_steal(struct pipe_inode_info *pipe,
+			       struct pipe_buffer *buf)
+{
+	struct page *page = buf->page;
+	lock_page(page);
+	return 0;
+}
+
+
+/* HACK HACK HACK: This will Oops if buffer persist after module unload: */
+const struct pipe_buf_operations fuse_pipe_buf_ops = {
+	.can_merge = 0,
+	.map = generic_pipe_buf_map,
+	.unmap = generic_pipe_buf_unmap,
+	.confirm = generic_pipe_buf_confirm,
+	.release = generic_pipe_buf_release,
+	.steal = fuse_pipe_buf_steal,
+	.get = generic_pipe_buf_get,
+};
+
 static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
 			 unsigned offset, unsigned count)
 {
@@ -900,6 +921,10 @@ static int fuse_ref_page(struct fuse_cop
 	buf->page = page;
 	buf->offset = offset;
 	buf->len = count;
+	if (cs->req->in.stealpages)
+		buf->ops = &fuse_pipe_buf_ops;
+	else
+		buf->ops = &nosteal_pipe_buf_ops;
 
 	cs->pipebufs++;
 	cs->nr_segs++;
@@ -1342,11 +1367,7 @@ static ssize_t fuse_dev_splice_read(stru
 		buf->page = bufs[page_nr].page;
 		buf->offset = bufs[page_nr].offset;
 		buf->len = bufs[page_nr].len;
-		/*
-		 * Need to be careful about this.  Having buf->ops in module
-		 * code can Oops if the buffer persists after module unload.
-		 */
-		buf->ops = &nosteal_pipe_buf_ops;
+		buf->ops = bufs[page_nr].ops;
 
 		pipe->nrbufs++;
 		page_nr++;
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1530,7 +1530,7 @@ static void fuse_writepage_free(struct f
 	int i;
 
 	for (i = 0; i < req->num_pages; i++)
-		__free_page(req->pages[i]);
+		put_page(req->pages[i]);
 
 	if (req->ff)
 		fuse_file_put(req->ff, false);
@@ -1704,6 +1704,7 @@ static int fuse_writepage_locked(struct
 	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
 	req->misc.write.next = NULL;
 	req->in.argpages = 1;
+	req->in.stealpages = 1;
 	req->num_pages = 1;
 	req->pages[0] = tmp_page;
 	req->page_descs[0].offset = 0;
@@ -1897,7 +1898,7 @@ static int fuse_writepages_fill(struct p
 		err = -ENOMEM;
 		req = fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
 		if (!req) {
-			__free_page(tmp_page);
+			put_page(tmp_page);
 			goto out_unlock;
 		}
 
@@ -1905,6 +1906,7 @@ static int fuse_writepages_fill(struct p
 		req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
 		req->misc.write.next = NULL;
 		req->in.argpages = 1;
+		req->in.stealpages = 1;
 		req->background = 1;
 		req->num_pages = 0;
 		req->end = fuse_writepage_end;
@@ -2670,7 +2672,7 @@ long fuse_do_ioctl(struct file *file, un
 		fuse_put_request(fc, req);
 	free_page((unsigned long) iov_page);
 	while (num_pages)
-		__free_page(pages[--num_pages]);
+		put_page(pages[--num_pages]);
 	kfree(pages);
 
 	return err ? err : outarg.result;
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -175,6 +175,9 @@ struct fuse_in {
 	/** True if the data for the last argument is in req->pages */
 	unsigned argpages:1;
 
+	/** True if the pages can be stolen */
+	unsigned stealpages:1;
+
 	/** Number of arguments */
 	unsigned numargs;
 
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -32,6 +32,7 @@
 #include <linux/export.h>
 #include <linux/swap.h>
 #include <linux/aio.h>
+#include <linux/pipe_fs_i.h>
 
 static struct vfsmount *shm_mnt;
 
@@ -116,13 +117,14 @@ static bool shmem_should_replace_page(st
 static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index);
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
-	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type);
+	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type,
+	struct pipe_buffer *bufm, struct pipe_inode_info *pipe);
 
 static inline int shmem_getpage(struct inode *inode, pgoff_t index,
 	struct page **pagep, enum sgp_type sgp, int *fault_type)
 {
 	return shmem_getpage_gfp(inode, index, pagep, sgp,
-			mapping_gfp_mask(inode->i_mapping), fault_type);
+		mapping_gfp_mask(inode->i_mapping), fault_type, NULL, NULL);
 }
 
 static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
@@ -1065,7 +1067,8 @@ static int shmem_replace_page(struct pag
  * entry since a page cannot live in both the swap and page cache
  */
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
-	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
+	struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type,
+	struct pipe_buffer *buf, struct pipe_inode_info *pipe)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info;
@@ -1075,6 +1078,7 @@ static int shmem_getpage_gfp(struct inod
 	int error;
 	int once = 0;
 	int alloced = 0;
+	int lru = 0;
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
@@ -1191,6 +1195,15 @@ static int shmem_getpage_gfp(struct inod
 			percpu_counter_inc(&sbinfo->used_blocks);
 		}
 
+		if (buf && buf->ops->steal(pipe, buf) == 0) {
+			page = buf->page;
+			ClearPageMappedToDisk(page);
+			page_cache_get(page);
+			SetPageSwapBacked(page);
+			if (buf->flags & PIPE_BUF_FLAG_LRU)
+				lru = 1;
+			goto skip_alloc;
+		}
 		page = shmem_alloc_page(gfp, info, index);
 		if (!page) {
 			error = -ENOMEM;
@@ -1199,6 +1212,7 @@ static int shmem_getpage_gfp(struct inod
 
 		SetPageSwapBacked(page);
 		__set_page_locked(page);
+skip_alloc:
 		error = mem_cgroup_cache_charge(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
 		if (error)
@@ -1213,7 +1227,8 @@ static int shmem_getpage_gfp(struct inod
 			mem_cgroup_uncharge_cache_page(page);
 			goto decused;
 		}
-		lru_cache_add_anon(page);
+		if (!lru)
+			lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
 		info->alloced++;
@@ -1714,6 +1729,103 @@ static ssize_t shmem_file_splice_read(st
 	return error;
 }
 
+static int shmem_pipe_to_file(struct pipe_inode_info *pipe,
+			      struct pipe_buffer *buf,
+			      struct splice_desc *sd)
+{
+	struct file *file = sd->u.file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	unsigned int offset, this_len;
+	pgoff_t index;
+	struct page *page;
+	int ret;
+	bool whole_page;
+
+	index = sd->pos >> PAGE_CACHE_SHIFT;
+	offset = sd->pos & ~PAGE_CACHE_MASK;
+
+	this_len = sd->len;
+	if (this_len + offset > PAGE_CACHE_SIZE)
+		this_len = PAGE_CACHE_SIZE - offset;
+
+	whole_page = (offset == 0) && (this_len == PAGE_CACHE_SIZE);
+
+	ret = shmem_getpage_gfp(inode, index, &page, SGP_WRITE,
+				mapping_gfp_mask(mapping), NULL,
+				whole_page ? buf : NULL, pipe);
+	if (unlikely(ret))
+		goto out;
+
+	if (buf->page != page) {
+		char *src = buf->ops->map(pipe, buf, 1);
+		char *dst = kmap_atomic(page);
+
+		memcpy(dst + offset, src + buf->offset, this_len);
+		flush_dcache_page(page);
+		kunmap_atomic(dst);
+		buf->ops->unmap(pipe, buf, src);
+	}
+	mark_page_accessed(page);
+	ret = shmem_write_end(file, mapping, sd->pos, this_len, this_len,
+			      page, NULL);
+out:
+	return ret;
+}
+
+static ssize_t shmem_file_splice_write(struct pipe_inode_info *pipe,
+				       struct file *out, loff_t *ppos,
+				       size_t len, unsigned int flags)
+{
+	struct address_space *mapping = out->f_mapping;
+	struct inode *inode = mapping->host;
+	struct splice_desc sd = {
+		.total_len = len,
+		.flags = flags,
+		.pos = *ppos,
+		.u.file = out,
+	};
+	ssize_t ret;
+
+	pipe_lock(pipe);
+
+	splice_from_pipe_begin(&sd);
+	do {
+		ret = splice_from_pipe_next(pipe, &sd);
+		if (ret <= 0)
+			break;
+
+		mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
+		ret = file_remove_suid(out);
+		if (!ret) {
+			ret = file_update_time(out);
+			if (!ret)
+				ret = splice_from_pipe_feed(pipe, &sd,
+							    shmem_pipe_to_file);
+		}
+		mutex_unlock(&inode->i_mutex);
+	} while (ret > 0);
+	splice_from_pipe_end(pipe, &sd);
+
+	pipe_unlock(pipe);
+
+	if (sd.num_spliced)
+		ret = sd.num_spliced;
+
+	if (ret > 0) {
+		int err;
+
+		err = generic_write_sync(out, *ppos, ret);
+		if (err)
+			ret = err;
+		else
+			*ppos += ret;
+		balance_dirty_pages_ratelimited(mapping);
+	}
+
+	return ret;
+}
+
 /*
  * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
  */
@@ -2714,7 +2826,7 @@ static const struct file_operations shme
 	.aio_write	= generic_file_aio_write,
 	.fsync		= noop_fsync,
 	.splice_read	= shmem_file_splice_read,
-	.splice_write	= generic_file_splice_write,
+	.splice_write	= shmem_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif
 };
@@ -3034,7 +3146,8 @@ struct page *shmem_read_mapping_page_gfp
 	int error;
 
 	BUG_ON(mapping->a_ops != &shmem_aops);
-	error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL);
+	error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp,
+				  NULL, NULL, NULL);
 	if (error)
 		page = ERR_PTR(error);
 	else

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-02-27 15:32 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-10 13:09 [PATCH v6 00/11] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
2013-10-10 13:10 ` [PATCH 01/11] fuse: Linking file to inode helper Maxim Patlasov
2013-10-10 13:10 ` [PATCH 02/11] fuse: Prepare to handle short reads Maxim Patlasov
2013-10-10 13:10 ` [PATCH 03/11] fuse: Connection bit for enabling writeback Maxim Patlasov
2013-10-10 13:10 ` [PATCH 04/11] fuse: Trust kernel i_size only - v4 Maxim Patlasov
2013-10-10 13:10 ` [PATCH 05/11] fuse: Trust kernel i_mtime only -v2 Maxim Patlasov
2013-10-10 13:10   ` Maxim Patlasov
2013-11-12 16:52   ` Miklos Szeredi
2013-12-26 15:41     ` Maxim Patlasov
2014-01-06 16:22       ` Miklos Szeredi
2014-01-20 11:33         ` Maxim Patlasov
2013-12-26 15:51     ` [PATCH 05/11] fuse: Trust kernel i_mtime only -v3 Maxim Patlasov
2013-10-10 13:11 ` [PATCH 06/11] fuse: Flush files on wb close -v2 Maxim Patlasov
2013-10-10 13:19   ` [PATCH] " Maxim Patlasov
2013-10-10 13:19     ` Maxim Patlasov
2013-10-10 13:11 ` [PATCH 07/11] fuse: restructure fuse_readpage() Maxim Patlasov
2013-11-12 17:17   ` Miklos Szeredi
2013-12-20 14:54     ` Maxim Patlasov
2014-01-06 16:43       ` Miklos Szeredi
2014-01-20 11:46         ` Maxim Patlasov
2013-10-10 13:11 ` [PATCH 08/11] fuse: Implement write_begin/write_end callbacks -v2 Maxim Patlasov
2013-10-10 13:11 ` [PATCH 09/11] fuse: fuse_flush() should wait on writeback Maxim Patlasov
2013-10-10 13:12 ` [PATCH 10/11] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
2013-10-10 13:12 ` [PATCH 11/11] fuse: Turn writeback cache on Maxim Patlasov
     [not found] ` <87li13pido.fsf@vostro.rath.org>
     [not found]   ` <CAJfpegvFFXPuEoXXXJCBLEQi+T3mx95g4X+MjxAbyg0rhjbfDA@mail.gmail.com>
     [not found]     ` <528321C3.20307@parallels.com>
     [not found]       ` <CAJfpegu_ScgdgHDVM_5GMQC86OFnpXLDsuArKbWP_cX-D8m8Jw@mail.gmail.com>
     [not found]         ` <52FA4ECD.2030608@parallels.com>
     [not found]           ` <52FC2DE5.9010008@rath.org>
     [not found]             ` <52FCB029.9040608@parallels.com>
     [not found]               ` <87wqgwdvi1.fsf@vostro.rath.org>
2014-02-27 15:33                 ` fuse write performance Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.