linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy
@ 2012-11-16 17:04 Maxim Patlasov
  2012-11-16 17:05 ` [PATCH 01/14] fuse: Linking file to inode helper Maxim Patlasov
                   ` (15 more replies)
  0 siblings, 16 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:04 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

Hi,

This is the second iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back policy.
With this file data are pushed to the userspace with big chunks (depending on the
dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this scenario.
Kirill (in Cc) can share more details about the performance and the storage concepts
details if required.

Changed in v2:
 - numerous bugfixes:
   - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must wait
     on page writeback because page writeback can extend beyond the lifetime of
     the page-cache page
   - fuse_send_writepages can end_page_writeback on original page only after adding
     request to fi->writepages list; otherwise another writeback may happen inside
     the gap between end_page_writeback and adding to the list
   - fuse_direct_io must wait on page writeback; otherwise data corruption is possible
     due to reordering requests
   - fuse_flush must flush dirty memory and wait for all writeback on given inode
     before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not reliable
   - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size update;
     otherwise a race with a writer extending i_size is possible
   - fix handling errors in fuse_writepages and fuse_send_writepages
 - handle i_mtime intelligently if writeback cache is on (see patch #7 (update i_mtime
   on buffered writes) for details.
 - put enabling writeback cache under fusermount control; (see mount option
   'allow_wbcache' introduced by patch #13 (turn writeback cache on))
 - rebased on v3.7-rc5

Thanks,
Maxim

---

Maxim Patlasov (14):
      fuse: Linking file to inode helper
      fuse: Getting file for writeback helper
      fuse: Prepare to handle short reads
      fuse: Prepare to handle multiple pages in writeback
      fuse: Connection bit for enabling writeback
      fuse: Trust kernel i_size only
      fuse: Update i_mtime on buffered writes
      fuse: Flush files on wb close
      fuse: Implement writepages and write_begin/write_end callbacks
      fuse: fuse_writepage_locked() should wait on writeback
      fuse: fuse_flush() should wait on writeback
      fuse: Fix O_DIRECT operations vs cached writeback misorder
      fuse: Turn writeback cache on
      mm: Account for WRITEBACK_TEMP in balance_dirty_pages


 fs/fuse/dir.c             |   51 ++++
 fs/fuse/file.c            |  523 +++++++++++++++++++++++++++++++++++++++++----
 fs/fuse/fuse_i.h          |   20 ++
 fs/fuse/inode.c           |   98 ++++++++
 include/uapi/linux/fuse.h |    1 
 mm/page-writeback.c       |    3 
 6 files changed, 638 insertions(+), 58 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/14] fuse: Linking file to inode helper
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
@ 2012-11-16 17:05 ` Maxim Patlasov
  2012-11-16 17:05 ` [PATCH 02/14] fuse: Getting file for writeback helper Maxim Patlasov
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:05 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 78d2837..9e85ef0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -167,6 +167,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_file *ff = file->private_data;
+	/*
+	 * file may be written through mmap, so chain it onto the
+	 * inodes's write_file list
+	 */
+	spin_lock(&fc->lock);
+	if (list_empty(&ff->write_entry))
+		list_add(&ff->write_entry, &fi->write_files);
+	spin_unlock(&fc->lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
@@ -1384,20 +1400,9 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
-		struct inode *inode = file->f_dentry->d_inode;
-		struct fuse_conn *fc = get_fuse_conn(inode);
-		struct fuse_inode *fi = get_fuse_inode(inode);
-		struct fuse_file *ff = file->private_data;
-		/*
-		 * file may be written through mmap, so chain it onto the
-		 * inodes's write_file list
-		 */
-		spin_lock(&fc->lock);
-		if (list_empty(&ff->write_entry))
-			list_add(&ff->write_entry, &fi->write_files);
-		spin_unlock(&fc->lock);
-	}
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		fuse_link_write_file(file);
+
 	file_accessed(file);
 	vma->vm_ops = &fuse_file_vm_ops;
 	return 0;


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 02/14] fuse: Getting file for writeback helper
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
  2012-11-16 17:05 ` [PATCH 01/14] fuse: Linking file to inode helper Maxim Patlasov
@ 2012-11-16 17:05 ` Maxim Patlasov
  2012-11-16 17:06 ` [PATCH 03/14] fuse: Prepare to handle short reads Maxim Patlasov
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:05 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   24 ++++++++++++++++--------
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9e85ef0..cd41b56 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1276,6 +1276,20 @@ static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
 	fuse_writepage_free(fc, req);
 }
 
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
+					 struct fuse_inode *fi)
+{
+	struct fuse_file *ff;
+
+	spin_lock(&fc->lock);
+	BUG_ON(list_empty(&fi->write_files));
+	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+	fuse_file_get(ff);
+	spin_unlock(&fc->lock);
+
+	return ff;
+}
+
 static int fuse_writepage_locked(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
@@ -1283,7 +1297,6 @@ static int fuse_writepage_locked(struct page *page)
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_req *req;
-	struct fuse_file *ff;
 	struct page *tmp_page;
 
 	set_page_writeback(page);
@@ -1296,13 +1309,8 @@ static int fuse_writepage_locked(struct page *page)
 	if (!tmp_page)
 		goto err_free;
 
-	spin_lock(&fc->lock);
-	BUG_ON(list_empty(&fi->write_files));
-	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
-	req->ff = fuse_file_get(ff);
-	spin_unlock(&fc->lock);
-
-	fuse_write_fill(req, ff, page_offset(page), 0);
+	req->ff = fuse_write_file(fc, fi);
+	fuse_write_fill(req, req->ff, page_offset(page), 0);
 
 	copy_highpage(tmp_page, page);
 	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 03/14] fuse: Prepare to handle short reads
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
  2012-11-16 17:05 ` [PATCH 01/14] fuse: Linking file to inode helper Maxim Patlasov
  2012-11-16 17:05 ` [PATCH 02/14] fuse: Getting file for writeback helper Maxim Patlasov
@ 2012-11-16 17:06 ` Maxim Patlasov
  2012-11-16 17:07 ` [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback Maxim Patlasov
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:06 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cd41b56..51804cf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -538,6 +538,15 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+			    u64 attr_ver)
+{
+	size_t num_read = req->out.args[0].size;
+
+	loff_t pos = page_offset(req->pages[0]) + num_read;
+	fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
 	struct inode *inode = page->mapping->host;
@@ -573,18 +582,18 @@ static int fuse_readpage(struct file *file, struct page *page)
 	req->pages[0] = page;
 	num_read = fuse_send_read(req, file, pos, count, NULL);
 	err = req->out.h.error;
-	fuse_put_request(fc, req);
 
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
 		 */
 		if (num_read < count)
-			fuse_read_update_size(inode, pos + num_read, attr_ver);
+			fuse_short_read(req, inode, attr_ver);
 
 		SetPageUptodate(page);
 	}
 
+	fuse_put_request(fc, req);
 	fuse_invalidate_attr(inode); /* atime changed */
  out:
 	unlock_page(page);
@@ -607,13 +616,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
 		/*
 		 * Short read means EOF. If file size is larger, truncate it
 		 */
-		if (!req->out.h.error && num_read < count) {
-			loff_t pos;
+		if (!req->out.h.error && num_read < count)
+			fuse_short_read(req, inode, req->misc.read.attr_ver);
 
-			pos = page_offset(req->pages[0]) + num_read;
-			fuse_read_update_size(inode, pos,
-					      req->misc.read.attr_ver);
-		}
 		fuse_invalidate_attr(inode); /* atime changed */
 	}
 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (2 preceding siblings ...)
  2012-11-16 17:06 ` [PATCH 03/14] fuse: Prepare to handle short reads Maxim Patlasov
@ 2012-11-16 17:07 ` Maxim Patlasov
  2012-11-16 17:07 ` [PATCH 05/14] fuse: Connection bit for enabling writeback Maxim Patlasov
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:07 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   22 +++++++++++++++-------
 1 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 51804cf..0cc62f59 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -345,7 +345,8 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
 
 		BUG_ON(req->inode != inode);
 		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
-		if (curr_index == index) {
+		if (curr_index <= index &&
+		    index < curr_index + req->num_pages) {
 			found = true;
 			break;
 		}
@@ -1196,7 +1197,10 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-	__free_page(req->pages[0]);
+	int i;
+
+	for (i = 0; i < req->num_pages; i++)
+		__free_page(req->pages[i]);
 	fuse_file_put(req->ff, false);
 }
 
@@ -1205,10 +1209,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	struct inode *inode = req->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	int i;
 
 	list_del(&req->writepages_entry);
-	dec_bdi_stat(bdi, BDI_WRITEBACK);
-	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
+	for (i = 0; i < req->num_pages; i++) {
+		dec_bdi_stat(bdi, BDI_WRITEBACK);
+		dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+	}
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
 }
@@ -1221,14 +1228,15 @@ __acquires(fc->lock)
 	struct fuse_inode *fi = get_fuse_inode(req->inode);
 	loff_t size = i_size_read(req->inode);
 	struct fuse_write_in *inarg = &req->misc.write.in;
+	__u64 data_size = req->num_pages * PAGE_CACHE_SIZE;
 
 	if (!fc->connected)
 		goto out_free;
 
-	if (inarg->offset + PAGE_CACHE_SIZE <= size) {
-		inarg->size = PAGE_CACHE_SIZE;
+	if (inarg->offset + data_size <= size) {
+		inarg->size = data_size;
 	} else if (inarg->offset < size) {
-		inarg->size = size & (PAGE_CACHE_SIZE - 1);
+		inarg->size = size - inarg->offset;
 	} else {
 		/* Got truncated off completely */
 		goto out_free;


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 05/14] fuse: Connection bit for enabling writeback
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (3 preceding siblings ...)
  2012-11-16 17:07 ` [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback Maxim Patlasov
@ 2012-11-16 17:07 ` Maxim Patlasov
  2012-11-16 17:07 ` [PATCH 06/14] fuse: Trust kernel i_size only Maxim Patlasov
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:07 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/fuse_i.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e24dd74..89375e2 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -428,6 +428,9 @@ struct fuse_conn {
 	/** Set if bdi is valid */
 	unsigned bdi_initialized:1;
 
+	/** write-back cache policy (default is write-through) */
+	unsigned writeback_cache:1;
+
 	/*
 	 * The following bitfields are only for optimization purposes
 	 * and hence races in setting them will not cause malfunction


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 06/14] fuse: Trust kernel i_size only
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (4 preceding siblings ...)
  2012-11-16 17:07 ` [PATCH 05/14] fuse: Connection bit for enabling writeback Maxim Patlasov
@ 2012-11-16 17:07 ` Maxim Patlasov
  2012-12-05 16:39   ` [PATCH] fuse: Trust kernel i_size only - v2 Maxim Patlasov
  2012-12-05 16:40   ` [PATCH] fuse: Implement writepages and write_begin/write_end callbacks " Maxim Patlasov
  2012-11-16 17:09 ` [PATCH 07/14] fuse: Update i_mtime on buffered writes Maxim Patlasov
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:07 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c   |    9 ++++++---
 fs/fuse/file.c  |   39 +++++++++++++++++++++++++++++++++++++--
 fs/fuse/inode.c |    6 ++++--
 3 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 324bc08..3e7250e 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -827,7 +827,7 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->mtime.tv_nsec = attr->mtimensec;
 	stat->ctime.tv_sec = attr->ctime;
 	stat->ctime.tv_nsec = attr->ctimensec;
-	stat->size = attr->size;
+	stat->size = i_size_read(inode);
 	stat->blocks = attr->blocks;
 
 	if (attr->blksize != 0)
@@ -1388,6 +1388,7 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
 	bool is_truncate = false;
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	int err;
 
@@ -1460,7 +1461,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
-	i_size_write(inode, outarg.attr.size);
+	if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+		i_size_write(inode, outarg.attr.size);
 
 	if (is_truncate) {
 		/* NOTE: this may release/reacquire fc->lock */
@@ -1472,7 +1474,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	 * Only call invalidate_inode_pages2() after removing
 	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 	 */
-	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+	if ((is_truncate || !is_wb) &&
+			S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
 		truncate_pagecache(inode, oldsize, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 0cc62f59..d13d57b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -543,9 +543,29 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
 			    u64 attr_ver)
 {
 	size_t num_read = req->out.args[0].size;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fc->writeback_cache) {
+		/*
+		 * A hole in a file. Some data after the hole are in page cache.
+		 */
+		int i;
+		int start_idx = num_read >> PAGE_CACHE_SHIFT;
+		size_t off = num_read & (PAGE_CACHE_SIZE - 1);
 
-	loff_t pos = page_offset(req->pages[0]) + num_read;
-	fuse_read_update_size(inode, pos, attr_ver);
+		for (i = start_idx; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			void *mapaddr = kmap_atomic(page);
+
+			memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+			kunmap_atomic(mapaddr);
+			off = 0;
+		}
+	} else {
+		loff_t pos = page_offset(req->pages[0]) + num_read;
+		fuse_read_update_size(inode, pos, attr_ver);
+	}
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
@@ -2216,6 +2236,7 @@ long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		.mode = mode
 	};
 	int err;
+	bool is_wb = fc->writeback_cache;
 
 	if (fc->no_fallocate)
 		return -EOPNOTSUPP;
@@ -2224,6 +2245,11 @@ long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
+	if (is_wb) {
+		struct inode *inode = file->f_mapping->host;
+		mutex_lock(&inode->i_mutex);
+	}
+
 	req->in.h.opcode = FUSE_FALLOCATE;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 1;
@@ -2237,6 +2263,15 @@ long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	}
 	fuse_put_request(fc, req);
 
+	if (is_wb) {
+		struct inode *inode = file->f_mapping->host;
+
+		if (!err)
+			fuse_write_update_size(inode, offset + length);
+
+		mutex_unlock(&inode->i_mutex);
+	}
+
 	return err;
 }
 EXPORT_SYMBOL_GPL(fuse_file_fallocate);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f0eda12..b2d1a27 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -196,6 +196,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	struct timespec old_mtime;
 
@@ -209,10 +210,11 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	fuse_change_attributes_common(inode, attr, attr_valid);
 
 	oldsize = inode->i_size;
-	i_size_write(inode, attr->size);
+	if (!is_wb || !S_ISREG(inode->i_mode))
+		i_size_write(inode, attr->size);
 	spin_unlock(&fc->lock);
 
-	if (S_ISREG(inode->i_mode)) {
+	if (!is_wb && S_ISREG(inode->i_mode)) {
 		bool inval = false;
 
 		if (oldsize != attr->size) {


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 07/14] fuse: Update i_mtime on buffered writes
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (5 preceding siblings ...)
  2012-11-16 17:07 ` [PATCH 06/14] fuse: Trust kernel i_size only Maxim Patlasov
@ 2012-11-16 17:09 ` Maxim Patlasov
  2012-11-16 17:09 ` [PATCH 08/14] fuse: Flush files on wb close Maxim Patlasov
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:09 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

If writeback cache is on, buffered write doesn't result in immediate mtime
update in userspace because the userspace will see modified data later, when
writeback happens. Consequently, mtime provided by userspace may be older than
actual time of buffered write.

The problem can be solved by generating mtime locally (will come in next
patches) and flushing it to userspace periodically. Here we introduce a flag to
keep the state of fuse_inode: the flag is ON if and only if locally generated
mtime (stored in inode->i_mtime) was not pushed to the userspace yet.

The patch also implements all bits related to flushing and clearing the flag.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c    |   42 +++++++++++++++++++++++++----
 fs/fuse/file.c   |   31 ++++++++++++++++++---
 fs/fuse/fuse_i.h |   13 ++++++++-
 fs/fuse/inode.c  |   79 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 3e7250e..d673698 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -177,6 +177,13 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
 		if (flags & LOOKUP_RCU)
 			return -ECHILD;
 
+		if (test_bit(FUSE_I_MTIME_UPDATED,
+			     &get_fuse_inode(inode)->state)) {
+			err = fuse_flush_mtime(inode, 0);
+			if (err)
+				return 0;
+		}
+
 		fc = get_fuse_conn(inode);
 		req = fuse_get_req(fc);
 		if (IS_ERR(req))
@@ -839,7 +846,7 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 }
 
 static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
-			   struct file *file)
+			   struct file *file, int locked)
 {
 	int err;
 	struct fuse_getattr_in inarg;
@@ -848,6 +855,12 @@ static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
 	struct fuse_req *req;
 	u64 attr_version;
 
+	if (test_bit(FUSE_I_MTIME_UPDATED, &get_fuse_inode(inode)->state)) {
+		err = fuse_flush_mtime(inode, locked);
+		if (err)
+			return err;
+	}
+
 	req = fuse_get_req(fc);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
@@ -893,7 +906,7 @@ static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
 }
 
 int fuse_update_attributes(struct inode *inode, struct kstat *stat,
-			   struct file *file, bool *refreshed)
+			   struct file *file, bool *refreshed, int locked)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	int err;
@@ -901,7 +914,7 @@ int fuse_update_attributes(struct inode *inode, struct kstat *stat,
 
 	if (fi->i_time < get_jiffies_64()) {
 		r = true;
-		err = fuse_do_getattr(inode, stat, file);
+		err = fuse_do_getattr(inode, stat, file, locked);
 	} else {
 		r = false;
 		err = 0;
@@ -1055,7 +1068,7 @@ static int fuse_perm_getattr(struct inode *inode, int mask)
 	if (mask & MAY_NOT_BLOCK)
 		return -ECHILD;
 
-	return fuse_do_getattr(inode, NULL, NULL);
+	return fuse_do_getattr(inode, NULL, NULL, 0);
 }
 
 /*
@@ -1371,6 +1384,12 @@ void fuse_release_nowrite(struct inode *inode)
 	spin_unlock(&fc->lock);
 }
 
+static inline bool fuse_operation_updates_mtime_on_server(unsigned ivalid)
+{
+	return (ivalid & ATTR_SIZE) ||
+		((ivalid & ATTR_MTIME) && update_mtime(ivalid));
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1411,6 +1430,15 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	if (attr->ia_valid & ATTR_SIZE)
 		is_truncate = true;
 
+	if (!fuse_operation_updates_mtime_on_server(attr->ia_valid)) {
+		struct fuse_inode *fi = get_fuse_inode(inode);
+		if (test_bit(FUSE_I_MTIME_UPDATED, &fi->state)) {
+			err = fuse_flush_mtime(inode, 1);
+			if (err)
+				return err;
+		}
+	}
+
 	req = fuse_get_req(fc);
 	if (IS_ERR(req))
 		return PTR_ERR(req);
@@ -1458,6 +1486,10 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	}
 
 	spin_lock(&fc->lock);
+	if (fuse_operation_updates_mtime_on_server(attr->ia_valid)) {
+		struct fuse_inode *fi = get_fuse_inode(inode);
+		clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+	}
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
@@ -1506,7 +1538,7 @@ static int fuse_getattr(struct vfsmount *mnt, struct dentry *entry,
 	if (!fuse_allow_task(fc, current))
 		return -EACCES;
 
-	return fuse_update_attributes(inode, stat, NULL, NULL);
+	return fuse_update_attributes(inode, stat, NULL, NULL, 0);
 }
 
 static int fuse_setxattr(struct dentry *entry, const char *name,
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d13d57b..3fee4a8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -382,6 +382,13 @@ static int fuse_flush(struct file *file, fl_owner_t id)
 	if (is_bad_inode(inode))
 		return -EIO;
 
+	if (test_bit(FUSE_I_MTIME_UPDATED,
+		     &get_fuse_inode(inode)->state)) {
+		err = fuse_flush_mtime(inode, 0);
+		if (err)
+			return err;
+	}
+
 	if (fc->no_flush)
 		return 0;
 
@@ -485,6 +492,15 @@ out:
 static int fuse_fsync(struct file *file, loff_t start, loff_t end,
 		      int datasync)
 {
+	struct inode *inode = file->f_mapping->host;
+
+	if (test_bit(FUSE_I_MTIME_UPDATED,
+		     &get_fuse_inode(inode)->state)) {
+		int err = fuse_flush_mtime(inode, 0);
+		if (err)
+			return err;
+	}
+
 	return fuse_fsync_common(file, start, end, datasync, 0);
 }
 
@@ -755,7 +771,8 @@ static ssize_t fuse_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 	if (fc->auto_inval_data ||
 	    (pos + iov_length(iov, nr_segs) > i_size_read(inode))) {
 		int err;
-		err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL);
+		err = fuse_update_attributes(inode, NULL, iocb->ki_filp, NULL,
+					     0);
 		if (err)
 			return err;
 	}
@@ -1189,8 +1206,11 @@ static ssize_t __fuse_direct_write(struct file *file, const char __user *buf,
 	res = generic_write_checks(file, ppos, &count, 0);
 	if (!res) {
 		res = fuse_direct_io(file, buf, count, ppos, 1);
-		if (res > 0)
+		if (res > 0) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
 			fuse_write_update_size(inode, *ppos);
+			clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+		}
 	}
 
 	fuse_invalidate_attr(inode);
@@ -1655,7 +1675,7 @@ static loff_t fuse_file_llseek(struct file *file, loff_t offset, int origin)
 		return generic_file_llseek(file, offset, origin);
 
 	mutex_lock(&inode->i_mutex);
-	retval = fuse_update_attributes(inode, NULL, file, NULL);
+	retval = fuse_update_attributes(inode, NULL, file, NULL, 1);
 	if (!retval)
 		retval = generic_file_llseek(file, offset, origin);
 	mutex_unlock(&inode->i_mutex);
@@ -2266,8 +2286,11 @@ long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	if (is_wb) {
 		struct inode *inode = file->f_mapping->host;
 
-		if (!err)
+		if (!err) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
 			fuse_write_update_size(inode, offset + length);
+			clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+		}
 
 		mutex_unlock(&inode->i_mutex);
 	}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 89375e2..72645e8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -103,6 +103,15 @@ struct fuse_inode {
 
 	/** List of writepage requestst (pending or sent) */
 	struct list_head writepages;
+
+	/** Miscellaneous bits describing inode state */
+	unsigned long state;
+};
+
+/** FUSE inode state bits */
+enum {
+	/** i_mtime has been updated locally; a flush to userspace needed */
+	FUSE_I_MTIME_UPDATED,
 };
 
 struct fuse_conn;
@@ -749,7 +758,7 @@ int fuse_allow_task(struct fuse_conn *fc, struct task_struct *task);
 u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id);
 
 int fuse_update_attributes(struct inode *inode, struct kstat *stat,
-			   struct file *file, bool *refreshed);
+			   struct file *file, bool *refreshed, int locked);
 
 void fuse_flush_writepages(struct inode *inode);
 
@@ -790,4 +799,6 @@ int fuse_dev_release(struct inode *inode, struct file *file);
 
 void fuse_write_update_size(struct inode *inode, loff_t pos);
 
+int fuse_flush_mtime(struct inode *inode, int locked);
+
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b2d1a27..92afee5 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -201,7 +201,8 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	struct timespec old_mtime;
 
 	spin_lock(&fc->lock);
-	if (attr_version != 0 && fi->attr_version > attr_version) {
+	if ((attr_version != 0 && fi->attr_version > attr_version) ||
+	    test_bit(FUSE_I_MTIME_UPDATED, &fi->state)) {
 		spin_unlock(&fc->lock);
 		return;
 	}
@@ -257,6 +258,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
 				   new_decode_dev(attr->rdev));
 	} else
 		BUG();
+
+	get_fuse_inode(inode)->state = 0;
 }
 
 int fuse_inode_eq(struct inode *inode, void *_nodeidp)
@@ -335,6 +338,80 @@ int fuse_reverse_inval_inode(struct super_block *sb, u64 nodeid,
 	return 0;
 }
 
+/*
+ * Flush inode->i_mtime to the server and clear FUSE_I_MTIME_UPDATED flag
+ *
+ * Do nothing if anybody cleared FUSE_I_MTIME_UPDATED flag by the time we
+ * acquired i_mutex.
+ *
+ * Do not clear FUSE_I_MTIME_UPDATED flag after flush if anybody (buffered
+ * write) updated i_mtime by the time we acquired fc->lock.
+ */
+int fuse_flush_mtime(struct inode *inode, int locked)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req;
+	struct fuse_setattr_in inarg;
+	struct fuse_attr_out outarg;
+	int err;
+
+	req = fuse_get_req(fc);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	memset(&inarg, 0, sizeof(inarg));
+	memset(&outarg, 0, sizeof(outarg));
+
+	if (!locked)
+		mutex_lock(&inode->i_mutex);
+
+	/*
+	 * This is crucial. We must re-check flag holding i_mutex. Otherwise
+	 * it would be possible to overwrite fresh mtime on server (for
+	 * example, updated as result of dio write) with our already outdated
+	 * inode->i_mtime.
+	 */
+	if (!test_bit(FUSE_I_MTIME_UPDATED, &fi->state)) {
+		mutex_unlock(&inode->i_mutex);
+		fuse_put_request(fc, req);
+		return 0;
+	}
+
+	inarg.valid |= FATTR_MTIME;
+	inarg.mtime = inode->i_mtime.tv_sec;
+	inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+	req->in.h.opcode = FUSE_SETATTR;
+	req->in.h.nodeid = get_node_id(inode);
+	req->in.numargs = 1;
+	req->in.args[0].size = sizeof(inarg);
+	req->in.args[0].value = &inarg;
+	req->out.numargs = 1;
+	if (fc->minor < 9)
+		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+	else
+		req->out.args[0].size = sizeof(outarg);
+	req->out.args[0].value = &outarg;
+
+	fuse_request_send(fc, req);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+
+	if (!err) {
+		spin_lock(&fc->lock);
+		if (inarg.mtime == inode->i_mtime.tv_sec &&
+		    inarg.mtimensec == inode->i_mtime.tv_nsec)
+			clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+		spin_unlock(&fc->lock);
+	}
+
+	if (!locked)
+		mutex_unlock(&inode->i_mutex);
+
+	return err;
+}
+
 static void fuse_umount_begin(struct super_block *sb)
 {
 	fuse_abort_conn(get_fuse_conn_super(sb));


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 08/14] fuse: Flush files on wb close
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (6 preceding siblings ...)
  2012-11-16 17:09 ` [PATCH 07/14] fuse: Update i_mtime on buffered writes Maxim Patlasov
@ 2012-11-16 17:09 ` Maxim Patlasov
  2012-11-16 17:09 ` [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks Maxim Patlasov
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:09 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding writeback cache. Note, that simply calling the
filemap_write_and_wait() is not enough since fuse finishes page writeback
immediately and thus the -wait part of the mentioned call will be no-op.
Do real wait on per-inode writepages list.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   26 +++++++++++++++++++++++++-
 1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3fee4a8..de9726a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -137,6 +137,12 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
 	}
 }
 
+static void __fuse_file_put(struct fuse_file *ff)
+{
+	if (atomic_dec_and_test(&ff->count))
+		BUG();
+}
+
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 		 bool isdir)
 {
@@ -285,8 +291,23 @@ static int fuse_open(struct inode *inode, struct file *file)
 	return fuse_open_common(inode, file, false);
 }
 
+static void fuse_flush_writeback(struct inode *inode, struct file *file)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	filemap_write_and_wait(file->f_mapping);
+	wait_event(fi->page_waitq, list_empty_careful(&fi->writepages));
+	spin_unlock_wait(&fc->lock);
+}
+
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fc->writeback_cache)
+		fuse_flush_writeback(inode, file);
+
 	fuse_release_common(file, FUSE_RELEASE);
 
 	/* return value is ignored by VFS */
@@ -1241,7 +1262,8 @@ static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 
 	for (i = 0; i < req->num_pages; i++)
 		__free_page(req->pages[i]);
-	fuse_file_put(req->ff, false);
+	if (!fc->writeback_cache)
+		fuse_file_put(req->ff, false);
 }
 
 static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
@@ -1258,6 +1280,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	}
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
+	if (fc->writeback_cache)
+		__fuse_file_put(req->ff);
 }
 
 /* Called under fc->lock, may release and reacquire it */


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (7 preceding siblings ...)
  2012-11-16 17:09 ` [PATCH 08/14] fuse: Flush files on wb close Maxim Patlasov
@ 2012-11-16 17:09 ` Maxim Patlasov
  2012-11-16 17:09 ` [PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback Maxim Patlasov
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:09 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

The .writepages one is required to make each writeback request carry more than
one page on it.

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |  251 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 250 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index de9726a..3274708 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -718,7 +718,10 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file)
 
 struct fuse_fill_data {
 	struct fuse_req *req;
-	struct file *file;
+	union {
+		struct file *file;
+		struct fuse_file *ff;
+	};
 	struct inode *inode;
 };
 
@@ -1427,6 +1430,249 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 	return err;
 }
 
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+	int i, all_ok = 1;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t off = -1;
+
+	if (!data->ff)
+		data->ff = fuse_write_file(fc, fi);
+
+	if (!data->ff) {
+		for (i = 0; i < req->num_pages; i++)
+			end_page_writeback(req->pages[i]);
+		return -EIO;
+	}
+
+	req->inode = inode;
+	req->misc.write.in.offset = page_offset(req->pages[0]);
+
+	spin_lock(&fc->lock);
+	list_add(&req->writepages_entry, &fi->writepages);
+	spin_unlock(&fc->lock);
+
+	for (i = 0; i < req->num_pages; i++) {
+		struct page *page = req->pages[i];
+		struct page *tmp_page;
+
+		tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (tmp_page) {
+			copy_highpage(tmp_page, page);
+			inc_bdi_stat(bdi, BDI_WRITEBACK);
+			inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+		} else
+			all_ok = 0;
+		req->pages[i] = tmp_page;
+		if (i == 0)
+			off = page_offset(page);
+
+		end_page_writeback(page);
+	}
+
+	if (!all_ok) {
+		for (i = 0; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			if (page) {
+				dec_bdi_stat(bdi, BDI_WRITEBACK);
+				dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+				__free_page(page);
+				req->pages[i] = NULL;
+			}
+		}
+
+		spin_lock(&fc->lock);
+		list_del(&req->writepages_entry);
+		wake_up(&fi->page_waitq);
+		spin_unlock(&fc->lock);
+		return -ENOMEM;
+	}
+
+	req->ff = fuse_file_get(data->ff);
+	fuse_write_fill(req, data->ff, off, 0);
+
+	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+	req->in.argpages = 1;
+	req->page_offset = 0;
+	req->end = fuse_writepage_end;
+
+	spin_lock(&fc->lock);
+	list_add_tail(&req->list, &fi->queued_writes);
+	fuse_flush_writepages(data->inode);
+	spin_unlock(&fc->lock);
+
+	return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+		struct writeback_control *wbc, void *_data)
+{
+	struct fuse_fill_data *data = _data;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fuse_page_is_writeback(inode, page->index)) {
+		if (wbc->sync_mode != WB_SYNC_ALL) {
+			redirty_page_for_writepage(wbc, page);
+			unlock_page(page);
+			return 0;
+		}
+		fuse_wait_on_page_writeback(inode, page->index);
+	}
+
+	if (req->num_pages &&
+	    (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+	     (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+	     req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+		int err;
+
+		err = fuse_send_writepages(data);
+		if (err) {
+			unlock_page(page);
+			return err;
+		}
+
+		data->req = req = fuse_request_alloc_nofs();
+		if (!req) {
+			unlock_page(page);
+			return -ENOMEM;
+		}
+	}
+
+	req->pages[req->num_pages] = page;
+	req->num_pages++;
+
+	if (test_set_page_writeback(page))
+		BUG();
+
+	unlock_page(page);
+
+	return 0;
+}
+
+static int fuse_writepages(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_fill_data data;
+	int err;
+
+	if (!fc->writeback_cache)
+		return generic_writepages(mapping, wbc);
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	data.ff = NULL;
+	data.inode = inode;
+	data.req = fuse_request_alloc_nofs();
+	err = -ENOMEM;
+	if (!data.req)
+		goto out_put;
+
+	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+	if (data.req) {
+		if (!err && data.req->num_pages) {
+			err = fuse_send_writepages(&data);
+			if (err)
+				fuse_put_request(fc, data.req);
+		} else
+			fuse_put_request(fc, data.req);
+	}
+out_put:
+	if (data.ff)
+		fuse_file_put(data.ff, false);
+out:
+	return err;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+		struct page *page, loff_t pos, unsigned len)
+{
+	struct fuse_req *req;
+	int err;
+
+	if (PageUptodate(page) || (len == PAGE_CACHE_SIZE))
+		return 0;
+
+	/*
+	 * Page writeback can extend beyond the lifetime of the
+	 * page-cache page, so make sure we read a properly synced
+	 * page.
+	 */
+	fuse_wait_on_page_writeback(page->mapping->host, page->index);
+
+	req = fuse_get_req(fc);
+	err = PTR_ERR(req);
+	if (IS_ERR(req))
+		goto out;
+
+	req->out.page_zeroing = 1;
+	req->out.argpages = 1;
+	req->num_pages = 1;
+	req->pages[0] = page;
+	fuse_send_read(req, file, page_offset(page), PAGE_CACHE_SIZE, NULL);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+out:
+	if (err) {
+		unlock_page(page);
+		page_cache_release(page);
+	}
+	return err;
+}
+
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+	BUG_ON(!fc->writeback_cache);
+
+	*pagep = grab_cache_page_write_begin(mapping, index, flags);
+	if (!*pagep)
+		return -ENOMEM;
+
+	return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+		unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+
+	fuse_write_update_size(inode, pos);
+	set_page_dirty(page);
+	return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned copied,
+		struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	fuse_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
@@ -2364,11 +2610,14 @@ static const struct file_operations fuse_direct_io_file_operations = {
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
 	.writepage	= fuse_writepage,
+	.writepages	= fuse_writepages,
 	.launder_page	= fuse_launder_page,
 	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
+	.write_begin	= fuse_write_begin,
+	.write_end	= fuse_write_end,
 };
 
 void fuse_init_file_inode(struct inode *inode)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (8 preceding siblings ...)
  2012-11-16 17:09 ` [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks Maxim Patlasov
@ 2012-11-16 17:09 ` Maxim Patlasov
  2012-11-16 17:10 ` [PATCH 11/14] fuse: fuse_flush() " Maxim Patlasov
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:09 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

fuse_writepage_locked() should never submit new i/o for given page->index
if there is another one 'in progress' already. In most cases it's safe to
wait on page writeback. But if it was called due to memory shortage
(WB_SYNC_NONE), we should redirty page rather than blocking caller.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3274708..4e4f6fd 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1370,7 +1370,8 @@ static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
 	return ff;
 }
 
-static int fuse_writepage_locked(struct page *page)
+static int fuse_writepage_locked(struct page *page,
+				 struct writeback_control *wbc)
 {
 	struct address_space *mapping = page->mapping;
 	struct inode *inode = mapping->host;
@@ -1379,6 +1380,14 @@ static int fuse_writepage_locked(struct page *page)
 	struct fuse_req *req;
 	struct page *tmp_page;
 
+	if (fuse_page_is_writeback(inode, page->index)) {
+		if (wbc->sync_mode != WB_SYNC_ALL) {
+			redirty_page_for_writepage(wbc, page);
+			return 0;
+		}
+		fuse_wait_on_page_writeback(inode, page->index);
+	}
+
 	set_page_writeback(page);
 
 	req = fuse_request_alloc_nofs();
@@ -1424,7 +1433,7 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 {
 	int err;
 
-	err = fuse_writepage_locked(page);
+	err = fuse_writepage_locked(page, wbc);
 	unlock_page(page);
 
 	return err;
@@ -1678,7 +1687,10 @@ static int fuse_launder_page(struct page *page)
 	int err = 0;
 	if (clear_page_dirty_for_io(page)) {
 		struct inode *inode = page->mapping->host;
-		err = fuse_writepage_locked(page);
+		struct writeback_control wbc = {
+			.sync_mode = WB_SYNC_ALL,
+		};
+		err = fuse_writepage_locked(page, &wbc);
 		if (!err)
 			fuse_wait_on_page_writeback(inode, page->index);
 	}


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 11/14] fuse: fuse_flush() should wait on writeback
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (9 preceding siblings ...)
  2012-11-16 17:09 ` [PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback Maxim Patlasov
@ 2012-11-16 17:10 ` Maxim Patlasov
  2012-11-16 17:10 ` [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder Maxim Patlasov
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 4e4f6fd..b73fe2a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -17,6 +17,7 @@
 #include <linux/swap.h>
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
@@ -413,6 +414,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
 	if (fc->no_flush)
 		return 0;
 
+	err = filemap_write_and_wait(file->f_mapping);
+	if (err)
+		return err;
+
+	mutex_lock(&inode->i_mutex);
+	fuse_sync_writes(inode);
+	mutex_unlock(&inode->i_mutex);
+
 	req = fuse_get_req_nofail(fc, file);
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.fh = ff->fh;


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (10 preceding siblings ...)
  2012-11-16 17:10 ` [PATCH 11/14] fuse: fuse_flush() " Maxim Patlasov
@ 2012-11-16 17:10 ` Maxim Patlasov
  2012-12-05 16:43   ` [PATCH] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
  2012-11-16 17:10 ` [PATCH 13/14] fuse: Turn writeback cache on Maxim Patlasov
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b73fe2a..741e9b4 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -348,6 +348,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 	return (u64) v0 + ((u64) v1 << 32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+				    pgoff_t idx_to)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+	bool found = false;
+
+	spin_lock(&fc->lock);
+	list_for_each_entry(req, &fi->writepages, writepages_entry) {
+		pgoff_t curr_index;
+
+		BUG_ON(req->inode != inode);
+		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+		if (!(idx_from >= curr_index + req->num_pages ||
+		      idx_to < curr_index)) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock(&fc->lock);
+
+	return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -392,6 +417,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
 	return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+				   size_t bytes)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	pgoff_t idx_from, idx_to;
+
+	idx_from = start >> PAGE_CACHE_SHIFT;
+	idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+	wait_event(fi->page_waitq,
+		   !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
@@ -1178,6 +1216,8 @@ ssize_t fuse_direct_io(struct file *file, const char __user *buf,
 			break;
 		}
 
+		fuse_wait_on_writeback(file->f_mapping->host, pos, nbytes);
+
 		if (write)
 			nres = fuse_send_write(req, file, pos, nbytes, owner);
 		else


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 13/14] fuse: Turn writeback cache on
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (11 preceding siblings ...)
  2012-11-16 17:10 ` [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder Maxim Patlasov
@ 2012-11-16 17:10 ` Maxim Patlasov
  2012-11-16 17:10 ` [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Maxim Patlasov
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c            |   15 +++++++++++++++
 fs/fuse/fuse_i.h          |    4 ++++
 fs/fuse/inode.c           |   13 +++++++++++++
 include/uapi/linux/fuse.h |    1 +
 4 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 741e9b4..f62463b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -210,6 +210,8 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 		spin_unlock(&fc->lock);
 		fuse_invalidate_attr(inode);
 	}
+	if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+		fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1069,6 +1071,19 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	struct iov_iter i;
 	loff_t endbyte = 0;
 
+	if (get_fuse_conn(inode)->writeback_cache) {
+		if (!(file->f_flags & O_DIRECT)) {
+			struct fuse_conn *fc = get_fuse_conn(inode);
+			struct fuse_inode *fi = get_fuse_inode(inode);
+
+			spin_lock(&fc->lock);
+			inode->i_mtime = current_fs_time(inode->i_sb);
+			set_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+			spin_unlock(&fc->lock);
+		}
+		return generic_file_aio_write(iocb, iov, nr_segs, pos);
+	}
+
 	WARN_ON(iocb->ki_pos != pos);
 
 	ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 72645e8..3b9a2fe 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
     doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER         (1 << 1)
 
+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+    module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE       (1 << 2)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 92afee5..85b716f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -521,6 +521,7 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+	OPT_ALLOW_WBCACHE,
 	OPT_ERR
 };
 
@@ -533,6 +534,7 @@ static const match_table_t tokens = {
 	{OPT_ALLOW_OTHER,		"allow_other"},
 	{OPT_MAX_READ,			"max_read=%u"},
 	{OPT_BLKSIZE,			"blksize=%u"},
+	{OPT_ALLOW_WBCACHE,		"allow_wbcache"},
 	{OPT_ERR,			NULL}
 };
 
@@ -602,6 +604,10 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 			d->blksize = value;
 			break;
 
+		case OPT_ALLOW_WBCACHE:
+			d->flags |= FUSE_ALLOW_WBCACHE;
+			break;
+
 		default:
 			return 0;
 		}
@@ -629,6 +635,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 		seq_printf(m, ",max_read=%u", fc->max_read);
 	if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
 		seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+	if (fc->flags & FUSE_ALLOW_WBCACHE)
+		seq_puts(m, ",allow_wbcache");
 	return 0;
 }
 
@@ -938,6 +946,9 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 				fc->dont_mask = 1;
 			if (arg->flags & FUSE_AUTO_INVAL_DATA)
 				fc->auto_inval_data = 1;
+			if (arg->flags & FUSE_WRITEBACK_CACHE &&
+			    fc->flags & FUSE_ALLOW_WBCACHE)
+				fc->writeback_cache = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_CACHE_SIZE;
 			fc->no_lock = 1;
@@ -965,6 +976,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 		FUSE_EXPORT_SUPPORT | FUSE_BIG_WRITES | FUSE_DONT_MASK |
 		FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
 		FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA;
+	if (fc->flags & FUSE_ALLOW_WBCACHE)
+		arg->flags |= FUSE_WRITEBACK_CACHE;
 	req->in.h.opcode = FUSE_INIT;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d8c713e..96cd120 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -193,6 +193,7 @@ struct fuse_file_lock {
 #define FUSE_FLOCK_LOCKS	(1 << 10)
 #define FUSE_HAS_IOCTL_DIR	(1 << 11)
 #define FUSE_AUTO_INVAL_DATA	(1 << 12)
+#define FUSE_WRITEBACK_CACHE	(1 << 13)
 
 /**
  * CUSE INIT request/reply flags


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (12 preceding siblings ...)
  2012-11-16 17:10 ` [PATCH 13/14] fuse: Turn writeback cache on Maxim Patlasov
@ 2012-11-16 17:10 ` Maxim Patlasov
  2012-11-21 12:01   ` Maxim Patlasov
  2012-11-27  1:04 ` [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Feng Shuo
  2012-12-12 14:53 ` Maxim V. Patlasov
  15 siblings, 1 reply; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-16 17:10 UTC (permalink / raw)
  To: miklos
  Cc: dev, fuse-devel, linux-kernel, jbottomley, viro, linux-fsdevel, xemul

Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is high enough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is writing to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 mm/page-writeback.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..499a606 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1220,7 +1220,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+			global_page_state(NR_WRITEBACK_TEMP);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2012-11-16 17:10 ` [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Maxim Patlasov
@ 2012-11-21 12:01   ` Maxim Patlasov
  2012-11-22 13:27     ` Jaegeuk Hanse
  0 siblings, 1 reply; 27+ messages in thread
From: Maxim Patlasov @ 2012-11-21 12:01 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, linux-mm, viro,
	linux-fsdevel

Added linux-mm@ to cc:. The patch can stand on it's own.

> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
> counter is high enough. This prevents us from having too many dirty
> pages on fuse, thus giving the userspace part of it a chance to write
> stuff properly.
> 
> Note, that the existing balance logic is per-bdi, i.e. if the fuse
> user task gets stuck in the function this means, that it either
> writes to the mountpoint it serves (but it can deadlock even without
> the writeback) or it is writing to some _other_ dirty bdi and in the
> latter case someone else will free the memory for it.

Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 mm/page-writeback.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..499a606 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1220,7 +1220,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+			global_page_state(NR_WRITEBACK_TEMP);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2012-11-21 12:01   ` Maxim Patlasov
@ 2012-11-22 13:27     ` Jaegeuk Hanse
  2012-11-22 13:56       ` Maxim V. Patlasov
  0 siblings, 1 reply; 27+ messages in thread
From: Jaegeuk Hanse @ 2012-11-22 13:27 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: miklos, dev, xemul, fuse-devel, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel

On 11/21/2012 08:01 PM, Maxim Patlasov wrote:
> Added linux-mm@ to cc:. The patch can stand on it's own.
>
>> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
>> counter is high enough. This prevents us from having too many dirty
>> pages on fuse, thus giving the userspace part of it a chance to write
>> stuff properly.
>>
>> Note, that the existing balance logic is per-bdi, i.e. if the fuse
>> user task gets stuck in the function this means, that it either
>> writes to the mountpoint it serves (but it can deadlock even without
>> the writeback) or it is writing to some _other_ dirty bdi and in the
>> latter case someone else will free the memory for it.
> Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> ---
>   mm/page-writeback.c |    3 ++-
>   1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 830893b..499a606 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1220,7 +1220,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>   		 */
>   		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>   					global_page_state(NR_UNSTABLE_NFS);
> -		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
> +		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
> +			global_page_state(NR_WRITEBACK_TEMP);
>   

Could you explain NR_WRITEBACK_TEMP is used for accounting what? And 
when it will increase?

>   		global_dirty_limits(&background_thresh, &dirty_thresh);
>   
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2012-11-22 13:27     ` Jaegeuk Hanse
@ 2012-11-22 13:56       ` Maxim V. Patlasov
  0 siblings, 0 replies; 27+ messages in thread
From: Maxim V. Patlasov @ 2012-11-22 13:56 UTC (permalink / raw)
  To: Jaegeuk Hanse
  Cc: miklos, dev, xemul, fuse-devel, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel

Hi,

11/22/2012 05:27 PM, Jaegeuk Hanse пишет:
> On 11/21/2012 08:01 PM, Maxim Patlasov wrote:
>> Added linux-mm@ to cc:. The patch can stand on it's own.
>>
>>> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
>>> counter is high enough. This prevents us from having too many dirty
>>> pages on fuse, thus giving the userspace part of it a chance to write
>>> stuff properly.
>>>
>>> Note, that the existing balance logic is per-bdi, i.e. if the fuse
>>> user task gets stuck in the function this means, that it either
>>> writes to the mountpoint it serves (but it can deadlock even without
>>> the writeback) or it is writing to some _other_ dirty bdi and in the
>>> latter case someone else will free the memory for it.
>> Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
>> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
>> ---
>>   mm/page-writeback.c |    3 ++-
>>   1 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 830893b..499a606 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1220,7 +1220,8 @@ static void balance_dirty_pages(struct 
>> address_space *mapping,
>>            */
>>           nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>>                       global_page_state(NR_UNSTABLE_NFS);
>> -        nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
>> +        nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
>> +            global_page_state(NR_WRITEBACK_TEMP);
>
> Could you explain NR_WRITEBACK_TEMP is used for accounting what? And 
> when it will increase?

The only user of NR_WRITEBACK_TEMP is fuse. Handling .writepage it:

1) allocates new page
2) copies original page (that came to .writepage as argument) to new page
3) attaches new page to fuse request
4) increments NR_WRITEBACK_TEMP
5) does end_page_writeback on original page
6) schedules fuse request for processing

Later, fuse request will be send to userspace, then userspace will 
process it and ACK it to kernel fuse. Processing this ACK from 
userspace, in-kernel fuse will free that new page and decrement 
NR_WRITEBACK_TEMP.

So, effectively, NR_WRITEBACK_TEMP keeps track of pages which are under 
'fuse writeback'.

Thanks,
Maxim

>
>> global_dirty_limits(&background_thresh, &dirty_thresh);
>>
>> -- 
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (13 preceding siblings ...)
  2012-11-16 17:10 ` [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Maxim Patlasov
@ 2012-11-27  1:04 ` Feng Shuo
  2012-11-27  7:56   ` Maxim V. Patlasov
  2012-12-12 14:53 ` Maxim V. Patlasov
  15 siblings, 1 reply; 27+ messages in thread
From: Feng Shuo @ 2012-11-27  1:04 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: miklos, dev, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, xemul

Hi Maxim,

I'm new to fuse but have some experience with NFS. From my
understanding after reviewing your patchset, it seems only work with
local file system or a distributed file system whose file is never
modified (could be grown but no or very few modified) because it
doesn't exam the pre/post status of the writing object (e.g. a file).
So if a file is modified outside, fuse might not get any chance to
handle it...... Correct me if I got wrong since I'm really new to
fuse. :-)

On Sat, Nov 17, 2012 at 1:04 AM, Maxim Patlasov <mpatlasov@parallels.com> wrote:
> Hi,
>
> This is the second iteration of Pavel Emelyanov's patch-set implementing
> write-back policy for FUSE page cache. Initial patch-set description was
> the following:
>
> One of the problems with the existing FUSE implementation is that it uses the
> write-through cache policy which results in performance problems on certain
> workloads. E.g. when copying a big file into a FUSE file the cp pushes every
> 128k to the userspace synchronously. This becomes a problem when the userspace
> back-end uses networking for storing the data.
>
> A good solution of this is switching the FUSE page cache into a write-back policy.
> With this file data are pushed to the userspace with big chunks (depending on the
> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
> handle the size updates in a more efficient manner.
>
> The writeback feature is per-connection and is explicitly configurable at the
> init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
> turned ON:
>
> * still copy writeback pages to temporary buffer when sending a writeback request
>   and finish the page writeback immediately
>
> * make kernel maintain the inode's i_size to avoid frequent i_size synchronization
>   with the user space
>
> * take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
>   This protects us from having too many dirty pages on FUSE
>
> The provided patchset survives the fsx test. Performance measurements are not yet
> all finished, but the mentioned copying of a huge file becomes noticeably faster
> even on machines with few RAM and doesn't make the system stuck (the dirty pages
> balancer does its work OK). Applies on top of v3.5-rc4.
>
> We are currently exploring this with our own distributed storage implementation
> which is heavily oriented on storing big blobs of data with extremely rare meta-data
> updates (virtual machines' and containers' disk images). With the existing cache
> policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
> too much time to proceed, much longer than if it was simply scp-ed over the same
> network. The write-back policy (as I mentioned) noticeably improves this scenario.
> Kirill (in Cc) can share more details about the performance and the storage concepts
> details if required.
>
> Changed in v2:
>  - numerous bugfixes:
>    - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must wait
>      on page writeback because page writeback can extend beyond the lifetime of
>      the page-cache page
>    - fuse_send_writepages can end_page_writeback on original page only after adding
>      request to fi->writepages list; otherwise another writeback may happen inside
>      the gap between end_page_writeback and adding to the list
>    - fuse_direct_io must wait on page writeback; otherwise data corruption is possible
>      due to reordering requests
>    - fuse_flush must flush dirty memory and wait for all writeback on given inode
>      before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not reliable
>    - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size update;
>      otherwise a race with a writer extending i_size is possible
>    - fix handling errors in fuse_writepages and fuse_send_writepages
>  - handle i_mtime intelligently if writeback cache is on (see patch #7 (update i_mtime
>    on buffered writes) for details.
>  - put enabling writeback cache under fusermount control; (see mount option
>    'allow_wbcache' introduced by patch #13 (turn writeback cache on))
>  - rebased on v3.7-rc5
>
> Thanks,
> Maxim
>
> ---
>
> Maxim Patlasov (14):
>       fuse: Linking file to inode helper
>       fuse: Getting file for writeback helper
>       fuse: Prepare to handle short reads
>       fuse: Prepare to handle multiple pages in writeback
>       fuse: Connection bit for enabling writeback
>       fuse: Trust kernel i_size only
>       fuse: Update i_mtime on buffered writes
>       fuse: Flush files on wb close
>       fuse: Implement writepages and write_begin/write_end callbacks
>       fuse: fuse_writepage_locked() should wait on writeback
>       fuse: fuse_flush() should wait on writeback
>       fuse: Fix O_DIRECT operations vs cached writeback misorder
>       fuse: Turn writeback cache on
>       mm: Account for WRITEBACK_TEMP in balance_dirty_pages
>
>
>  fs/fuse/dir.c             |   51 ++++
>  fs/fuse/file.c            |  523 +++++++++++++++++++++++++++++++++++++++++----
>  fs/fuse/fuse_i.h          |   20 ++
>  fs/fuse/inode.c           |   98 ++++++++
>  include/uapi/linux/fuse.h |    1
>  mm/page-writeback.c       |    3
>  6 files changed, 638 insertions(+), 58 deletions(-)
>
> --
> Signature
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Feng Shuo
Tel: (86)10-59851155-2116
Fax: (86)10-59851155-2008
Tianjin Zhongke Blue Whale Information Technologies Co., Ltd
10th Floor, Tower A, The GATE building, No. 19 Zhong-guan-cun Avenue
Haidian District, Beijing, China
Postcode 100080

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy
  2012-11-27  1:04 ` [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Feng Shuo
@ 2012-11-27  7:56   ` Maxim V. Patlasov
  0 siblings, 0 replies; 27+ messages in thread
From: Maxim V. Patlasov @ 2012-11-27  7:56 UTC (permalink / raw)
  To: Feng Shuo
  Cc: miklos, dev, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, xemul

Hi Feng,

11/27/2012 05:04 AM, Feng Shuo пишет:
> Hi Maxim,
>
> I'm new to fuse but have some experience with NFS. From my
> understanding after reviewing your patchset, it seems only work with
> local file system or a distributed file system whose file is never
> modified (could be grown but no or very few modified) because it
> doesn't exam the pre/post status of the writing object (e.g. a file).
> So if a file is modified outside, fuse might not get any chance to
> handle it...... Correct me if I got wrong since I'm really new to
> fuse. :-)

This topic was discussed when Pavel sent initial version of patches (you 
can find it in fuse-devel archives). Brian asked:

> Would this pose a problem for a filesystem in which the size of the
> inode can change remotely (i.e., not visible to the local instance of
> fuse)? I haven't tested this, but it seems like it could be an issue
> based on the implementation.

And Pavel replied:

> Yes, it will. The model of i_size management I implemented here is based
> on an assumption that the userspace is just a storage for data and should
> catch up with the kernel i_size value. In order to make it possible for user
> space to update i_size in kernel we'd have to implement some (probably)
> tricky algorithm, I haven't yet thought about it.

The patch-set follows the model "trust kernel i_size only". This works 
fine at least in case of userspace fuse wtih exclusive write semantics. 
In case of mutual concurrent internal/external read/write access, sysad 
should not turn the feature on (it's turned off by default). I wouldn't 
like to complicate the patch-set further adding bits for that case. This 
area is opened for future enhancements :)

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] fuse: Trust kernel i_size only - v2
  2012-11-16 17:07 ` [PATCH 06/14] fuse: Trust kernel i_size only Maxim Patlasov
@ 2012-12-05 16:39   ` Maxim Patlasov
  2012-12-05 16:40   ` [PATCH] fuse: Implement writepages and write_begin/write_end callbacks " Maxim Patlasov
  1 sibling, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-12-05 16:39 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley, viro,
	linux-fsdevel

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c   |    9 ++++++---
 fs/fuse/file.c  |   43 +++++++++++++++++++++++++++++++++++++++++--
 fs/fuse/inode.c |    6 ++++--
 3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 324bc08..3e7250e 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -827,7 +827,7 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->mtime.tv_nsec = attr->mtimensec;
 	stat->ctime.tv_sec = attr->ctime;
 	stat->ctime.tv_nsec = attr->ctimensec;
-	stat->size = attr->size;
+	stat->size = i_size_read(inode);
 	stat->blocks = attr->blocks;
 
 	if (attr->blksize != 0)
@@ -1388,6 +1388,7 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
 	bool is_truncate = false;
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	int err;
 
@@ -1460,7 +1461,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
-	i_size_write(inode, outarg.attr.size);
+	if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+		i_size_write(inode, outarg.attr.size);
 
 	if (is_truncate) {
 		/* NOTE: this may release/reacquire fc->lock */
@@ -1472,7 +1474,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	 * Only call invalidate_inode_pages2() after removing
 	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 	 */
-	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+	if ((is_truncate || !is_wb) &&
+			S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
 		truncate_pagecache(inode, oldsize, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 0cc62f59..7db1736 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/swap.h>
+#include <linux/falloc.h>
 
 static const struct file_operations fuse_direct_io_file_operations;
 
@@ -543,9 +544,31 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
 			    u64 attr_ver)
 {
 	size_t num_read = req->out.args[0].size;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fc->writeback_cache) {
+		/*
+		 * A hole in a file. Some data after the hole are in page cache,
+		 * but have not reached the client fs yet. So, the hole is not
+		 * present there.
+		 */
+		int i;
+		int start_idx = num_read >> PAGE_CACHE_SHIFT;
+		size_t off = num_read & (PAGE_CACHE_SIZE - 1);
 
-	loff_t pos = page_offset(req->pages[0]) + num_read;
-	fuse_read_update_size(inode, pos, attr_ver);
+		for (i = start_idx; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			void *mapaddr = kmap_atomic(page);
+
+			memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+			kunmap_atomic(mapaddr);
+			off = 0;
+		}
+	} else {
+		loff_t pos = page_offset(req->pages[0]) + num_read;
+		fuse_read_update_size(inode, pos, attr_ver);
+	}
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
@@ -2216,6 +2239,8 @@ long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		.mode = mode
 	};
 	int err;
+	bool change_i_size = fc->writeback_cache &&
+		!(mode & FALLOC_FL_KEEP_SIZE);
 
 	if (fc->no_fallocate)
 		return -EOPNOTSUPP;
@@ -2224,6 +2249,11 @@ long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
+	if (change_i_size) {
+		struct inode *inode = file->f_mapping->host;
+		mutex_lock(&inode->i_mutex);
+	}
+
 	req->in.h.opcode = FUSE_FALLOCATE;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 1;
@@ -2237,6 +2267,15 @@ long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	}
 	fuse_put_request(fc, req);
 
+	if (change_i_size) {
+		struct inode *inode = file->f_mapping->host;
+
+		if (!err)
+			fuse_write_update_size(inode, offset + length);
+
+		mutex_unlock(&inode->i_mutex);
+	}
+
 	return err;
 }
 EXPORT_SYMBOL_GPL(fuse_file_fallocate);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f0eda12..b2d1a27 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -196,6 +196,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	struct timespec old_mtime;
 
@@ -209,10 +210,11 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	fuse_change_attributes_common(inode, attr, attr_valid);
 
 	oldsize = inode->i_size;
-	i_size_write(inode, attr->size);
+	if (!is_wb || !S_ISREG(inode->i_mode))
+		i_size_write(inode, attr->size);
 	spin_unlock(&fc->lock);
 
-	if (S_ISREG(inode->i_mode)) {
+	if (!is_wb && S_ISREG(inode->i_mode)) {
 		bool inval = false;
 
 		if (oldsize != attr->size) {


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH] fuse: Implement writepages and write_begin/write_end callbacks - v2
  2012-11-16 17:07 ` [PATCH 06/14] fuse: Trust kernel i_size only Maxim Patlasov
  2012-12-05 16:39   ` [PATCH] fuse: Trust kernel i_size only - v2 Maxim Patlasov
@ 2012-12-05 16:40   ` Maxim Patlasov
  1 sibling, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-12-05 16:40 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley, viro,
	linux-fsdevel

The .writepages one is required to make each writeback request carry more than
one page on it.

Changed in v2:
 - fixed fuse_prepare_write() to avoid reads beyond EOF
 - fixed fuse_prepare_write() to zero uninitialized part of page

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |  279 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 278 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8daa0ef..44e966f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -721,7 +721,10 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file)
 
 struct fuse_fill_data {
 	struct fuse_req *req;
-	struct file *file;
+	union {
+		struct file *file;
+		struct fuse_file *ff;
+	};
 	struct inode *inode;
 };
 
@@ -1430,6 +1433,277 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 	return err;
 }
 
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+	int i, all_ok = 1;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t off = -1;
+
+	if (!data->ff)
+		data->ff = fuse_write_file(fc, fi);
+
+	if (!data->ff) {
+		for (i = 0; i < req->num_pages; i++)
+			end_page_writeback(req->pages[i]);
+		return -EIO;
+	}
+
+	req->inode = inode;
+	req->misc.write.in.offset = page_offset(req->pages[0]);
+
+	spin_lock(&fc->lock);
+	list_add(&req->writepages_entry, &fi->writepages);
+	spin_unlock(&fc->lock);
+
+	for (i = 0; i < req->num_pages; i++) {
+		struct page *page = req->pages[i];
+		struct page *tmp_page;
+
+		tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (tmp_page) {
+			copy_highpage(tmp_page, page);
+			inc_bdi_stat(bdi, BDI_WRITEBACK);
+			inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+		} else
+			all_ok = 0;
+		req->pages[i] = tmp_page;
+		if (i == 0)
+			off = page_offset(page);
+
+		end_page_writeback(page);
+	}
+
+	if (!all_ok) {
+		for (i = 0; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			if (page) {
+				dec_bdi_stat(bdi, BDI_WRITEBACK);
+				dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+				__free_page(page);
+				req->pages[i] = NULL;
+			}
+		}
+
+		spin_lock(&fc->lock);
+		list_del(&req->writepages_entry);
+		wake_up(&fi->page_waitq);
+		spin_unlock(&fc->lock);
+		return -ENOMEM;
+	}
+
+	req->ff = fuse_file_get(data->ff);
+	fuse_write_fill(req, data->ff, off, 0);
+
+	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+	req->in.argpages = 1;
+	req->page_offset = 0;
+	req->end = fuse_writepage_end;
+
+	spin_lock(&fc->lock);
+	list_add_tail(&req->list, &fi->queued_writes);
+	fuse_flush_writepages(data->inode);
+	spin_unlock(&fc->lock);
+
+	return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+		struct writeback_control *wbc, void *_data)
+{
+	struct fuse_fill_data *data = _data;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fuse_page_is_writeback(inode, page->index)) {
+		if (wbc->sync_mode != WB_SYNC_ALL) {
+			redirty_page_for_writepage(wbc, page);
+			unlock_page(page);
+			return 0;
+		}
+		fuse_wait_on_page_writeback(inode, page->index);
+	}
+
+	if (req->num_pages &&
+	    (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+	     (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+	     req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+		int err;
+
+		err = fuse_send_writepages(data);
+		if (err) {
+			unlock_page(page);
+			return err;
+		}
+
+		data->req = req = fuse_request_alloc_nofs();
+		if (!req) {
+			unlock_page(page);
+			return -ENOMEM;
+		}
+	}
+
+	req->pages[req->num_pages] = page;
+	req->num_pages++;
+
+	if (test_set_page_writeback(page))
+		BUG();
+
+	unlock_page(page);
+
+	return 0;
+}
+
+static int fuse_writepages(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_fill_data data;
+	int err;
+
+	if (!fc->writeback_cache)
+		return generic_writepages(mapping, wbc);
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	data.ff = NULL;
+	data.inode = inode;
+	data.req = fuse_request_alloc_nofs();
+	err = -ENOMEM;
+	if (!data.req)
+		goto out_put;
+
+	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+	if (data.req) {
+		if (!err && data.req->num_pages) {
+			err = fuse_send_writepages(&data);
+			if (err)
+				fuse_put_request(fc, data.req);
+		} else
+			fuse_put_request(fc, data.req);
+	}
+out_put:
+	if (data.ff)
+		fuse_file_put(data.ff, false);
+out:
+	return err;
+}
+
+/*
+ * Determine the number of bytes of data the page contains
+ */
+static inline unsigned fuse_page_length(struct page *page)
+{
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
+
+	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
+		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
+		if (page_index < end_index)
+			return PAGE_CACHE_SIZE;
+		if (page_index == end_index)
+			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
+	}
+	return 0;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+		struct page *page, loff_t pos, unsigned len)
+{
+	struct fuse_req *req;
+	unsigned num_read;
+	unsigned page_len;
+	int err;
+
+	if (PageUptodate(page) || (len == PAGE_CACHE_SIZE))
+		return 0;
+
+	page_len = fuse_page_length(page);
+	if (!page_len) {
+		zero_user(page, 0, PAGE_CACHE_SIZE);
+		return 0;
+	}
+
+	/*
+	 * Page writeback can extend beyond the lifetime of the
+	 * page-cache page, so make sure we read a properly synced
+	 * page.
+	 */
+	fuse_wait_on_page_writeback(page->mapping->host, page->index);
+
+	req = fuse_get_req(fc);
+	err = PTR_ERR(req);
+	if (IS_ERR(req))
+		goto out;
+
+	req->out.page_zeroing = 1;
+	req->out.argpages = 1;
+	req->num_pages = 1;
+	req->pages[0] = page;
+	num_read = fuse_send_read(req, file, page_offset(page), page_len, NULL);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+out:
+	if (err) {
+		unlock_page(page);
+		page_cache_release(page);
+	} else if (num_read != PAGE_CACHE_SIZE) {
+		zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
+	}
+	return err;
+}
+
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+	BUG_ON(!fc->writeback_cache);
+
+	*pagep = grab_cache_page_write_begin(mapping, index, flags);
+	if (!*pagep)
+		return -ENOMEM;
+
+	return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+		unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+
+	fuse_write_update_size(inode, pos);
+	set_page_dirty(page);
+	return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned copied,
+		struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	fuse_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
@@ -2368,11 +2642,14 @@ static const struct file_operations fuse_direct_io_file_operations = {
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
 	.writepage	= fuse_writepage,
+	.writepages	= fuse_writepages,
 	.launder_page	= fuse_launder_page,
 	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
+	.write_begin	= fuse_write_begin,
+	.write_end	= fuse_write_end,
 };
 
 void fuse_init_file_inode(struct inode *inode)


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  2012-11-16 17:10 ` [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder Maxim Patlasov
@ 2012-12-05 16:43   ` Maxim Patlasov
  0 siblings, 0 replies; 27+ messages in thread
From: Maxim Patlasov @ 2012-12-05 16:43 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley, viro,
	linux-fsdevel

The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/cuse.c   |    5 +++--
 fs/fuse/file.c   |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
 fs/fuse/fuse_i.h |   13 ++++++++++++-
 3 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index ee8d550..7e19bd2 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -93,7 +93,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, size_t count,
 {
 	loff_t pos = 0;
 
-	return fuse_direct_io(file, buf, count, &pos, 0);
+	return fuse_direct_io(file, buf, count, &pos, FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -104,7 +104,8 @@ static ssize_t cuse_write(struct file *file, const char __user *buf,
 	 * No locking or generic_write_checks(), the server is
 	 * responsible for locking and sanity checks.
 	 */
-	return fuse_direct_io(file, buf, count, &pos, 1);
+	return fuse_direct_io(file, buf, count, &pos,
+			      FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f2da298..d4ee020 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -349,6 +349,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 	return (u64) v0 + ((u64) v1 << 32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+				    pgoff_t idx_to)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+	bool found = false;
+
+	spin_lock(&fc->lock);
+	list_for_each_entry(req, &fi->writepages, writepages_entry) {
+		pgoff_t curr_index;
+
+		BUG_ON(req->inode != inode);
+		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+		if (!(idx_from >= curr_index + req->num_pages ||
+		      idx_to < curr_index)) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock(&fc->lock);
+
+	return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -393,6 +418,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
 	return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+				   size_t bytes)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	pgoff_t idx_from, idx_to;
+
+	idx_from = start >> PAGE_CACHE_SHIFT;
+	idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+	wait_event(fi->page_waitq,
+		   !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
@@ -1158,8 +1196,10 @@ static int fuse_get_user_pages(struct fuse_req *req, const char __user *buf,
 }
 
 ssize_t fuse_direct_io(struct file *file, const char __user *buf,
-		       size_t count, loff_t *ppos, int write)
+		       size_t count, loff_t *ppos, int flags)
 {
+	int write = flags & FUSE_DIO_WRITE;
+	int cuse = flags & FUSE_DIO_CUSE;
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
 	size_t nmax = write ? fc->max_write : fc->max_read;
@@ -1181,6 +1221,10 @@ ssize_t fuse_direct_io(struct file *file, const char __user *buf,
 			break;
 		}
 
+		if (!cuse)
+			fuse_wait_on_writeback(file->f_mapping->host, pos,
+					       nbytes);
+
 		if (write)
 			nres = fuse_send_write(req, file, pos, nbytes, owner);
 		else
@@ -1241,7 +1285,7 @@ static ssize_t __fuse_direct_write(struct file *file, const char __user *buf,
 
 	res = generic_write_checks(file, ppos, &count, 0);
 	if (!res) {
-		res = fuse_direct_io(file, buf, count, ppos, 1);
+		res = fuse_direct_io(file, buf, count, ppos, FUSE_DIO_WRITE);
 		if (res > 0) {
 			struct fuse_inode *fi = get_fuse_inode(inode);
 			fuse_write_update_size(inode, *ppos);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 72645e8..ec67597 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -788,8 +788,19 @@ int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid,
 
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 		 bool isdir);
+
+/**
+ * fuse_direct_io() flags
+ */
+
+/** If set, it is WRITE; otherwise - READ */
+#define FUSE_DIO_WRITE (1 << 0)
+
+/** CUSE pass fuse_direct_io() a file which f_mapping->host is not from FUSE */
+#define FUSE_DIO_CUSE  (1 << 1)
+
 ssize_t fuse_direct_io(struct file *file, const char __user *buf,
-		       size_t count, loff_t *ppos, int write);
+		       size_t count, loff_t *ppos, int flags);
 long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
 		   unsigned int flags);
 long fuse_ioctl_common(struct file *file, unsigned int cmd,


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy
  2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (14 preceding siblings ...)
  2012-11-27  1:04 ` [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Feng Shuo
@ 2012-12-12 14:53 ` Maxim V. Patlasov
  2013-01-15 15:20   ` Maxim V. Patlasov
  15 siblings, 1 reply; 27+ messages in thread
From: Maxim V. Patlasov @ 2012-12-12 14:53 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: miklos, Kirill Korotaev, fuse-devel, linux-kernel,
	James Bottomley, viro, linux-fsdevel, Pavel Emelianov

Hi Miklos,

11/16/2012 09:04 PM, Maxim Patlasov пишет:
> Hi,
>
> This is the second iteration of Pavel Emelyanov's patch-set implementing
> write-back policy for FUSE page cache. Initial patch-set description was
> the following:
>
> One of the problems with the existing FUSE implementation is that it uses the
> write-through cache policy which results in performance problems on certain
> workloads. E.g. when copying a big file into a FUSE file the cp pushes every
> 128k to the userspace synchronously. This becomes a problem when the userspace
> back-end uses networking for storing the data.
>
> A good solution of this is switching the FUSE page cache into a write-back policy.
> With this file data are pushed to the userspace with big chunks (depending on the
> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
> handle the size updates in a more efficient manner.
>
> The writeback feature is per-connection and is explicitly configurable at the
> init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
> turned ON:
>
> * still copy writeback pages to temporary buffer when sending a writeback request
>    and finish the page writeback immediately
>
> * make kernel maintain the inode's i_size to avoid frequent i_size synchronization
>    with the user space
>
> * take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
>    This protects us from having too many dirty pages on FUSE
>
> The provided patchset survives the fsx test. Performance measurements are not yet
> all finished, but the mentioned copying of a huge file becomes noticeably faster
> even on machines with few RAM and doesn't make the system stuck (the dirty pages
> balancer does its work OK). Applies on top of v3.5-rc4.
>
> We are currently exploring this with our own distributed storage implementation
> which is heavily oriented on storing big blobs of data with extremely rare meta-data
> updates (virtual machines' and containers' disk images). With the existing cache
> policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
> too much time to proceed, much longer than if it was simply scp-ed over the same
> network. The write-back policy (as I mentioned) noticeably improves this scenario.
> Kirill (in Cc) can share more details about the performance and the storage concepts
> details if required.
>
> Changed in v2:
>   - numerous bugfixes:
>     - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must wait
>       on page writeback because page writeback can extend beyond the lifetime of
>       the page-cache page
>     - fuse_send_writepages can end_page_writeback on original page only after adding
>       request to fi->writepages list; otherwise another writeback may happen inside
>       the gap between end_page_writeback and adding to the list
>     - fuse_direct_io must wait on page writeback; otherwise data corruption is possible
>       due to reordering requests
>     - fuse_flush must flush dirty memory and wait for all writeback on given inode
>       before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not reliable
>     - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size update;
>       otherwise a race with a writer extending i_size is possible
>     - fix handling errors in fuse_writepages and fuse_send_writepages
>   - handle i_mtime intelligently if writeback cache is on (see patch #7 (update i_mtime
>     on buffered writes) for details.
>   - put enabling writeback cache under fusermount control; (see mount option
>     'allow_wbcache' introduced by patch #13 (turn writeback cache on))
>   - rebased on v3.7-rc5

Any feedback on this version (v2) would be appreciated.

Thanks,
Maxim

>
> Thanks,
> Maxim
>
> ---
>
> Maxim Patlasov (14):
>        fuse: Linking file to inode helper
>        fuse: Getting file for writeback helper
>        fuse: Prepare to handle short reads
>        fuse: Prepare to handle multiple pages in writeback
>        fuse: Connection bit for enabling writeback
>        fuse: Trust kernel i_size only
>        fuse: Update i_mtime on buffered writes
>        fuse: Flush files on wb close
>        fuse: Implement writepages and write_begin/write_end callbacks
>        fuse: fuse_writepage_locked() should wait on writeback
>        fuse: fuse_flush() should wait on writeback
>        fuse: Fix O_DIRECT operations vs cached writeback misorder
>        fuse: Turn writeback cache on
>        mm: Account for WRITEBACK_TEMP in balance_dirty_pages
>
>
>   fs/fuse/dir.c             |   51 ++++
>   fs/fuse/file.c            |  523 +++++++++++++++++++++++++++++++++++++++++----
>   fs/fuse/fuse_i.h          |   20 ++
>   fs/fuse/inode.c           |   98 ++++++++
>   include/uapi/linux/fuse.h |    1
>   mm/page-writeback.c       |    3
>   6 files changed, 638 insertions(+), 58 deletions(-)
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy
  2012-12-12 14:53 ` Maxim V. Patlasov
@ 2013-01-15 15:20   ` Maxim V. Patlasov
  2013-01-25 10:21     ` Miklos Szeredi
  0 siblings, 1 reply; 27+ messages in thread
From: Maxim V. Patlasov @ 2013-01-15 15:20 UTC (permalink / raw)
  To: miklos
  Cc: Kirill Korotaev, fuse-devel, linux-kernel, James Bottomley, viro,
	linux-fsdevel, Pavel Emelianov

Hi Miklos,

12/12/2012 06:53 PM, Maxim V. Patlasov пишет:
> Hi Miklos,
>
> 11/16/2012 09:04 PM, Maxim Patlasov пишет:
>> Hi,
>>
>> This is the second iteration of Pavel Emelyanov's patch-set implementing
>> write-back policy for FUSE page cache. Initial patch-set description was
>> the following:
>>
>> One of the problems with the existing FUSE implementation is that it 
>> uses the
>> write-through cache policy which results in performance problems on 
>> certain
>> workloads. E.g. when copying a big file into a FUSE file the cp 
>> pushes every
>> 128k to the userspace synchronously. This becomes a problem when the 
>> userspace
>> back-end uses networking for storing the data.
>>
>> A good solution of this is switching the FUSE page cache into a 
>> write-back policy.
>> With this file data are pushed to the userspace with big chunks 
>> (depending on the
>> dirty memory limits, but this is much more than 128k) which lets the 
>> FUSE daemons
>> handle the size updates in a more efficient manner.
>>
>> The writeback feature is per-connection and is explicitly 
>> configurable at the
>> init stage (is it worth making it CAP_SOMETHING protected?) When the 
>> writeback is
>> turned ON:
>>
>> * still copy writeback pages to temporary buffer when sending a 
>> writeback request
>>    and finish the page writeback immediately
>>
>> * make kernel maintain the inode's i_size to avoid frequent i_size 
>> synchronization
>>    with the user space
>>
>> * take NR_WRITEBACK_TEMP into account when makeing 
>> balance_dirty_pages decision.
>>    This protects us from having too many dirty pages on FUSE
>>
>> The provided patchset survives the fsx test. Performance measurements 
>> are not yet
>> all finished, but the mentioned copying of a huge file becomes 
>> noticeably faster
>> even on machines with few RAM and doesn't make the system stuck (the 
>> dirty pages
>> balancer does its work OK). Applies on top of v3.5-rc4.
>>
>> We are currently exploring this with our own distributed storage 
>> implementation
>> which is heavily oriented on storing big blobs of data with extremely 
>> rare meta-data
>> updates (virtual machines' and containers' disk images). With the 
>> existing cache
>> policy a typical usage scenario -- copying a big VM disk into a cloud 
>> -- takes way
>> too much time to proceed, much longer than if it was simply scp-ed 
>> over the same
>> network. The write-back policy (as I mentioned) noticeably improves 
>> this scenario.
>> Kirill (in Cc) can share more details about the performance and the 
>> storage concepts
>> details if required.
>>
>> Changed in v2:
>>   - numerous bugfixes:
>>     - fuse_write_begin and fuse_writepages_fill and 
>> fuse_writepage_locked must wait
>>       on page writeback because page writeback can extend beyond the 
>> lifetime of
>>       the page-cache page
>>     - fuse_send_writepages can end_page_writeback on original page 
>> only after adding
>>       request to fi->writepages list; otherwise another writeback may 
>> happen inside
>>       the gap between end_page_writeback and adding to the list
>>     - fuse_direct_io must wait on page writeback; otherwise data 
>> corruption is possible
>>       due to reordering requests
>>     - fuse_flush must flush dirty memory and wait for all writeback 
>> on given inode
>>       before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is 
>> not reliable
>>     - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and 
>> i_size update;
>>       otherwise a race with a writer extending i_size is possible
>>     - fix handling errors in fuse_writepages and fuse_send_writepages
>>   - handle i_mtime intelligently if writeback cache is on (see patch 
>> #7 (update i_mtime
>>     on buffered writes) for details.
>>   - put enabling writeback cache under fusermount control; (see mount 
>> option
>>     'allow_wbcache' introduced by patch #13 (turn writeback cache on))
>>   - rebased on v3.7-rc5
>
> Any feedback on this version (v2) would be appreciated.

Heard nothing from you for two months. Any feedback would still be 
appreciated.

Thanks,
Maxim

>
> Thanks,
> Maxim
>
>>
>> Thanks,
>> Maxim
>>
>> ---
>>
>> Maxim Patlasov (14):
>>        fuse: Linking file to inode helper
>>        fuse: Getting file for writeback helper
>>        fuse: Prepare to handle short reads
>>        fuse: Prepare to handle multiple pages in writeback
>>        fuse: Connection bit for enabling writeback
>>        fuse: Trust kernel i_size only
>>        fuse: Update i_mtime on buffered writes
>>        fuse: Flush files on wb close
>>        fuse: Implement writepages and write_begin/write_end callbacks
>>        fuse: fuse_writepage_locked() should wait on writeback
>>        fuse: fuse_flush() should wait on writeback
>>        fuse: Fix O_DIRECT operations vs cached writeback misorder
>>        fuse: Turn writeback cache on
>>        mm: Account for WRITEBACK_TEMP in balance_dirty_pages
>>
>>
>>   fs/fuse/dir.c             |   51 ++++
>>   fs/fuse/file.c            |  523 
>> +++++++++++++++++++++++++++++++++++++++++----
>>   fs/fuse/fuse_i.h          |   20 ++
>>   fs/fuse/inode.c           |   98 ++++++++
>>   include/uapi/linux/fuse.h |    1
>>   mm/page-writeback.c       |    3
>>   6 files changed, 638 insertions(+), 58 deletions(-)
>>
>
>
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy
  2013-01-15 15:20   ` Maxim V. Patlasov
@ 2013-01-25 10:21     ` Miklos Szeredi
  2013-01-25 12:50       ` Maxim V. Patlasov
  0 siblings, 1 reply; 27+ messages in thread
From: Miklos Szeredi @ 2013-01-25 10:21 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: Kirill Korotaev, fuse-devel, linux-kernel, James Bottomley, viro,
	linux-fsdevel, Pavel Emelianov

On Tue, Jan 15, 2013 at 4:20 PM, Maxim V. Patlasov
<mpatlasov@parallels.com> wrote:
> Heard nothing from you for two months. Any feedback would still be
> appreciated.

Sorry about the long silence.

I haven't done a detailed review yet.  It would be good if you could
resent the patchset against for-next branch of the fuse tree.

I see that you have some other patchsets pending.   Are they independent?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy
  2013-01-25 10:21     ` Miklos Szeredi
@ 2013-01-25 12:50       ` Maxim V. Patlasov
  0 siblings, 0 replies; 27+ messages in thread
From: Maxim V. Patlasov @ 2013-01-25 12:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Kirill Korotaev, fuse-devel, linux-kernel, James Bottomley, viro,
	linux-fsdevel, Pavel Emelianov

Hi Miklos,

01/25/2013 02:21 PM, Miklos Szeredi пишет:
> On Tue, Jan 15, 2013 at 4:20 PM, Maxim V. Patlasov
> <mpatlasov@parallels.com> wrote:
>> Heard nothing from you for two months. Any feedback would still be
>> appreciated.
> Sorry about the long silence.
>
> I haven't done a detailed review yet.  It would be good if you could
> resent the patchset against for-next branch of the fuse tree.

OK.

> I see that you have some other patchsets pending.   Are they independent?

They are logically independent, but some of them may require cosmetic 
changes to be applied on the top of others.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-01-25 12:50 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-16 17:04 [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
2012-11-16 17:05 ` [PATCH 01/14] fuse: Linking file to inode helper Maxim Patlasov
2012-11-16 17:05 ` [PATCH 02/14] fuse: Getting file for writeback helper Maxim Patlasov
2012-11-16 17:06 ` [PATCH 03/14] fuse: Prepare to handle short reads Maxim Patlasov
2012-11-16 17:07 ` [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback Maxim Patlasov
2012-11-16 17:07 ` [PATCH 05/14] fuse: Connection bit for enabling writeback Maxim Patlasov
2012-11-16 17:07 ` [PATCH 06/14] fuse: Trust kernel i_size only Maxim Patlasov
2012-12-05 16:39   ` [PATCH] fuse: Trust kernel i_size only - v2 Maxim Patlasov
2012-12-05 16:40   ` [PATCH] fuse: Implement writepages and write_begin/write_end callbacks " Maxim Patlasov
2012-11-16 17:09 ` [PATCH 07/14] fuse: Update i_mtime on buffered writes Maxim Patlasov
2012-11-16 17:09 ` [PATCH 08/14] fuse: Flush files on wb close Maxim Patlasov
2012-11-16 17:09 ` [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks Maxim Patlasov
2012-11-16 17:09 ` [PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback Maxim Patlasov
2012-11-16 17:10 ` [PATCH 11/14] fuse: fuse_flush() " Maxim Patlasov
2012-11-16 17:10 ` [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder Maxim Patlasov
2012-12-05 16:43   ` [PATCH] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
2012-11-16 17:10 ` [PATCH 13/14] fuse: Turn writeback cache on Maxim Patlasov
2012-11-16 17:10 ` [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Maxim Patlasov
2012-11-21 12:01   ` Maxim Patlasov
2012-11-22 13:27     ` Jaegeuk Hanse
2012-11-22 13:56       ` Maxim V. Patlasov
2012-11-27  1:04 ` [PATCH v2 00/14] fuse: An attempt to implement a write-back cache policy Feng Shuo
2012-11-27  7:56   ` Maxim V. Patlasov
2012-12-12 14:53 ` Maxim V. Patlasov
2013-01-15 15:20   ` Maxim V. Patlasov
2013-01-25 10:21     ` Miklos Szeredi
2013-01-25 12:50       ` Maxim V. Patlasov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).