linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy
@ 2013-04-01 10:40 Maxim V. Patlasov
  2013-04-01 10:40 ` [PATCH 01/14] fuse: Linking file to inode helper Maxim V. Patlasov
                   ` (14 more replies)
  0 siblings, 15 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:40 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

Hi,

This is the fourth iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back policy.
With this file data are pushed to the userspace with big chunks (depending on the
dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this scenario.
Kirill (in Cc) can share more details about the performance and the storage concepts
details if required.

Changed in v2:
 - numerous bugfixes:
   - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must wait
     on page writeback because page writeback can extend beyond the lifetime of
     the page-cache page
   - fuse_send_writepages can end_page_writeback on original page only after adding
     request to fi->writepages list; otherwise another writeback may happen inside
     the gap between end_page_writeback and adding to the list
   - fuse_direct_io must wait on page writeback; otherwise data corruption is possible
     due to reordering requests
   - fuse_flush must flush dirty memory and wait for all writeback on given inode
     before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not reliable
   - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size update;
     otherwise a race with a writer extending i_size is possible
   - fix handling errors in fuse_writepages and fuse_send_writepages
 - handle i_mtime intelligently if writeback cache is on (see patch #7 (update i_mtime
   on buffered writes) for details.
 - put enabling writeback cache under fusermount control; (see mount option
   'allow_wbcache' introduced by patch #13 (turn writeback cache on))
 - rebased on v3.7-rc5

Changed in v3:
 - rebased on for-next branch of the fuse tree (fb05f41f5f96f7423c53da4d87913fb44fd0565d)

Changed in v4:
 - rebased on for-next branch of the fuse tree (634734b63ac39e137a1c623ba74f3e062b6577db)
 - fixed fuse_fillattr() for non-writeback_chace case
 - added comments explaining why we cannot trust size from server
 - rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
 - simplified patch titled Flush-files-on-wb-close
 - eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
 - added comment about "disk full" errors to fuse_write_begin()

Thanks,
Maxim

---

Maxim V. Patlasov (14):
      fuse: Linking file to inode helper
      fuse: Getting file for writeback helper
      fuse: Prepare to handle short reads
      fuse: Prepare to handle multiple pages in writeback
      fuse: Connection bit for enabling writeback
      fuse: Trust kernel i_size only - v3
      fuse: Trust kernel i_mtime only
      fuse: Flush files on wb close
      fuse: Implement writepages and write_begin/write_end callbacks - v3
      fuse: fuse_writepage_locked() should wait on writeback
      fuse: fuse_flush() should wait on writeback
      fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
      fuse: Turn writeback cache on
      mm: Account for WRITEBACK_TEMP in balance_dirty_pages


 fs/fuse/cuse.c            |    5 
 fs/fuse/dir.c             |  127 +++++++++-
 fs/fuse/file.c            |  575 ++++++++++++++++++++++++++++++++++++++++-----
 fs/fuse/fuse_i.h          |   26 ++
 fs/fuse/inode.c           |   37 +++
 include/uapi/linux/fuse.h |    2 
 mm/page-writeback.c       |    3 
 7 files changed, 689 insertions(+), 86 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/14] fuse: Linking file to inode helper
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
@ 2013-04-01 10:40 ` Maxim V. Patlasov
  2013-04-01 10:40 ` [PATCH 02/14] fuse: Getting file for writeback helper Maxim V. Patlasov
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:40 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c807176..356610c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -167,6 +167,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_file *ff = file->private_data;
+	/*
+	 * file may be written through mmap, so chain it onto the
+	 * inodes's write_file list
+	 */
+	spin_lock(&fc->lock);
+	if (list_empty(&ff->write_entry))
+		list_add(&ff->write_entry, &fi->write_files);
+	spin_unlock(&fc->lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
@@ -1484,20 +1500,9 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
-		struct inode *inode = file->f_dentry->d_inode;
-		struct fuse_conn *fc = get_fuse_conn(inode);
-		struct fuse_inode *fi = get_fuse_inode(inode);
-		struct fuse_file *ff = file->private_data;
-		/*
-		 * file may be written through mmap, so chain it onto the
-		 * inodes's write_file list
-		 */
-		spin_lock(&fc->lock);
-		if (list_empty(&ff->write_entry))
-			list_add(&ff->write_entry, &fi->write_files);
-		spin_unlock(&fc->lock);
-	}
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		fuse_link_write_file(file);
+
 	file_accessed(file);
 	vma->vm_ops = &fuse_file_vm_ops;
 	return 0;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/14] fuse: Getting file for writeback helper
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
  2013-04-01 10:40 ` [PATCH 01/14] fuse: Linking file to inode helper Maxim V. Patlasov
@ 2013-04-01 10:40 ` Maxim V. Patlasov
  2013-04-01 10:41 ` [PATCH 03/14] fuse: Prepare to handle short reads Maxim V. Patlasov
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:40 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   24 ++++++++++++++++--------
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 356610c..6fc65b4 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1375,6 +1375,20 @@ static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
 	fuse_writepage_free(fc, req);
 }
 
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
+					 struct fuse_inode *fi)
+{
+	struct fuse_file *ff;
+
+	spin_lock(&fc->lock);
+	BUG_ON(list_empty(&fi->write_files));
+	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+	fuse_file_get(ff);
+	spin_unlock(&fc->lock);
+
+	return ff;
+}
+
 static int fuse_writepage_locked(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
@@ -1382,7 +1396,6 @@ static int fuse_writepage_locked(struct page *page)
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_req *req;
-	struct fuse_file *ff;
 	struct page *tmp_page;
 
 	set_page_writeback(page);
@@ -1395,13 +1408,8 @@ static int fuse_writepage_locked(struct page *page)
 	if (!tmp_page)
 		goto err_free;
 
-	spin_lock(&fc->lock);
-	BUG_ON(list_empty(&fi->write_files));
-	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
-	req->ff = fuse_file_get(ff);
-	spin_unlock(&fc->lock);
-
-	fuse_write_fill(req, ff, page_offset(page), 0);
+	req->ff = fuse_write_file(fc, fi);
+	fuse_write_fill(req, req->ff, page_offset(page), 0);
 
 	copy_highpage(tmp_page, page);
 	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/14] fuse: Prepare to handle short reads
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
  2013-04-01 10:40 ` [PATCH 01/14] fuse: Linking file to inode helper Maxim V. Patlasov
  2013-04-01 10:40 ` [PATCH 02/14] fuse: Getting file for writeback helper Maxim V. Patlasov
@ 2013-04-01 10:41 ` Maxim V. Patlasov
  2013-04-01 10:41 ` [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback Maxim V. Patlasov
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:41 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6fc65b4..648de34 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -538,6 +538,15 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+			    u64 attr_ver)
+{
+	size_t num_read = req->out.args[0].size;
+
+	loff_t pos = page_offset(req->pages[0]) + num_read;
+	fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
 	struct inode *inode = page->mapping->host;
@@ -574,18 +583,18 @@ static int fuse_readpage(struct file *file, struct page *page)
 	req->page_descs[0].length = count;
 	num_read = fuse_send_read(req, file, pos, count, NULL);
 	err = req->out.h.error;
-	fuse_put_request(fc, req);
 
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
 		 */
 		if (num_read < count)
-			fuse_read_update_size(inode, pos + num_read, attr_ver);
+			fuse_short_read(req, inode, attr_ver);
 
 		SetPageUptodate(page);
 	}
 
+	fuse_put_request(fc, req);
 	fuse_invalidate_attr(inode); /* atime changed */
  out:
 	unlock_page(page);
@@ -608,13 +617,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
 		/*
 		 * Short read means EOF. If file size is larger, truncate it
 		 */
-		if (!req->out.h.error && num_read < count) {
-			loff_t pos;
+		if (!req->out.h.error && num_read < count)
+			fuse_short_read(req, inode, req->misc.read.attr_ver);
 
-			pos = page_offset(req->pages[0]) + num_read;
-			fuse_read_update_size(inode, pos,
-					      req->misc.read.attr_ver);
-		}
 		fuse_invalidate_attr(inode); /* atime changed */
 	}
 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (2 preceding siblings ...)
  2013-04-01 10:41 ` [PATCH 03/14] fuse: Prepare to handle short reads Maxim V. Patlasov
@ 2013-04-01 10:41 ` Maxim V. Patlasov
  2013-04-25 10:22   ` Miklos Szeredi
  2013-04-01 10:41 ` [PATCH 05/14] fuse: Connection bit for enabling writeback Maxim V. Patlasov
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:41 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   22 +++++++++++++++-------
 1 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 648de34..ee44b24 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -345,7 +345,8 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
 
 		BUG_ON(req->inode != inode);
 		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
-		if (curr_index == index) {
+		if (curr_index <= index &&
+		    index < curr_index + req->num_pages) {
 			found = true;
 			break;
 		}
@@ -1295,7 +1296,10 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-	__free_page(req->pages[0]);
+	int i;
+
+	for (i = 0; i < req->num_pages; i++)
+		__free_page(req->pages[i]);
 	fuse_file_put(req->ff, false);
 }
 
@@ -1304,10 +1308,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	struct inode *inode = req->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	int i;
 
 	list_del(&req->writepages_entry);
-	dec_bdi_stat(bdi, BDI_WRITEBACK);
-	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
+	for (i = 0; i < req->num_pages; i++) {
+		dec_bdi_stat(bdi, BDI_WRITEBACK);
+		dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+	}
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
 }
@@ -1320,14 +1327,15 @@ __acquires(fc->lock)
 	struct fuse_inode *fi = get_fuse_inode(req->inode);
 	loff_t size = i_size_read(req->inode);
 	struct fuse_write_in *inarg = &req->misc.write.in;
+	__u64 data_size = req->num_pages * PAGE_CACHE_SIZE;
 
 	if (!fc->connected)
 		goto out_free;
 
-	if (inarg->offset + PAGE_CACHE_SIZE <= size) {
-		inarg->size = PAGE_CACHE_SIZE;
+	if (inarg->offset + data_size <= size) {
+		inarg->size = data_size;
 	} else if (inarg->offset < size) {
-		inarg->size = size & (PAGE_CACHE_SIZE - 1);
+		inarg->size = size - inarg->offset;
 	} else {
 		/* Got truncated off completely */
 		goto out_free;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/14] fuse: Connection bit for enabling writeback
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (3 preceding siblings ...)
  2013-04-01 10:41 ` [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback Maxim V. Patlasov
@ 2013-04-01 10:41 ` Maxim V. Patlasov
  2013-04-01 10:41 ` [PATCH 06/14] fuse: Trust kernel i_size only - v3 Maxim V. Patlasov
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:41 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/fuse_i.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6aeba86..09c3139 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -455,6 +455,9 @@ struct fuse_conn {
 	/** Set if bdi is valid */
 	unsigned bdi_initialized:1;
 
+	/** write-back cache policy (default is write-through) */
+	unsigned writeback_cache:1;
+
 	/*
 	 * The following bitfields are only for optimization purposes
 	 * and hence races in setting them will not cause malfunction


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/14] fuse: Trust kernel i_size only - v3
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (4 preceding siblings ...)
  2013-04-01 10:41 ` [PATCH 05/14] fuse: Connection bit for enabling writeback Maxim V. Patlasov
@ 2013-04-01 10:41 ` Maxim V. Patlasov
  2013-04-01 10:41 ` [PATCH 07/14] fuse: Trust kernel i_mtime only Maxim V. Patlasov
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:41 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Changed in v3:
 - fixed fuse_fillattr() not to use local i_size if writeback-cache is off
 - added a comment explaining why we cannot trust attr.size from server

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c   |   13 +++++++++++--
 fs/fuse/file.c  |   43 +++++++++++++++++++++++++++++++++++++++++--
 fs/fuse/inode.c |   11 +++++++++--
 3 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8506522..8672ee4 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -845,6 +845,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 			  struct kstat *stat)
 {
 	unsigned int blkbits;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see the comment in fuse_change_attributes() */
+	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+		attr->size = i_size_read(inode);
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1571,6 +1576,7 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
 	bool is_truncate = false;
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	int err;
 
@@ -1643,7 +1649,9 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
-	i_size_write(inode, outarg.attr.size);
+	/* see the comment in fuse_change_attributes() */
+	if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+		i_size_write(inode, outarg.attr.size);
 
 	if (is_truncate) {
 		/* NOTE: this may release/reacquire fc->lock */
@@ -1655,7 +1663,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	 * Only call invalidate_inode_pages2() after removing
 	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 	 */
-	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+	if ((is_truncate || !is_wb) &&
+			S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
 		truncate_pagecache(inode, oldsize, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ee44b24..af58bbf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/compat.h>
 #include <linux/swap.h>
+#include <linux/falloc.h>
 
 static const struct file_operations fuse_direct_io_file_operations;
 
@@ -543,9 +544,31 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
 			    u64 attr_ver)
 {
 	size_t num_read = req->out.args[0].size;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fc->writeback_cache) {
+		/*
+		 * A hole in a file. Some data after the hole are in page cache,
+		 * but have not reached the client fs yet. So, the hole is not
+		 * present there.
+		 */
+		int i;
+		int start_idx = num_read >> PAGE_CACHE_SHIFT;
+		size_t off = num_read & (PAGE_CACHE_SIZE - 1);
 
-	loff_t pos = page_offset(req->pages[0]) + num_read;
-	fuse_read_update_size(inode, pos, attr_ver);
+		for (i = start_idx; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			void *mapaddr = kmap_atomic(page);
+
+			memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+			kunmap_atomic(mapaddr);
+			off = 0;
+		}
+	} else {
+		loff_t pos = page_offset(req->pages[0]) + num_read;
+		fuse_read_update_size(inode, pos, attr_ver);
+	}
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
@@ -2286,6 +2309,8 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		.mode = mode
 	};
 	int err;
+	bool change_i_size = fc->writeback_cache &&
+		!(mode & FALLOC_FL_KEEP_SIZE);
 
 	if (fc->no_fallocate)
 		return -EOPNOTSUPP;
@@ -2294,6 +2319,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	if (IS_ERR(req))
 		return PTR_ERR(req);
 
+	if (change_i_size) {
+		struct inode *inode = file->f_mapping->host;
+		mutex_lock(&inode->i_mutex);
+	}
+
 	req->in.h.opcode = FUSE_FALLOCATE;
 	req->in.h.nodeid = ff->nodeid;
 	req->in.numargs = 1;
@@ -2307,6 +2337,15 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	}
 	fuse_put_request(fc, req);
 
+	if (change_i_size) {
+		struct inode *inode = file->f_mapping->host;
+
+		if (!err)
+			fuse_write_update_size(inode, offset + length);
+
+		mutex_unlock(&inode->i_mutex);
+	}
+
 	return err;
 }
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 01353ed..94319e6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -197,6 +197,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	struct timespec old_mtime;
 
@@ -210,10 +211,16 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	fuse_change_attributes_common(inode, attr, attr_valid);
 
 	oldsize = inode->i_size;
-	i_size_write(inode, attr->size);
+        /*
+	 * In case of writeback_cache enabled, the cached writes beyond EOF
+	 * extend local i_size without keeping userspace server in sync. So,
+	 * attr->size coming from server can be stale. We cannot trust it.
+	 */
+	if (!is_wb || !S_ISREG(inode->i_mode))
+		i_size_write(inode, attr->size);
 	spin_unlock(&fc->lock);
 
-	if (S_ISREG(inode->i_mode)) {
+	if (!is_wb && S_ISREG(inode->i_mode)) {
 		bool inval = false;
 
 		if (oldsize != attr->size) {


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/14] fuse: Trust kernel i_mtime only
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (5 preceding siblings ...)
  2013-04-01 10:41 ` [PATCH 06/14] fuse: Trust kernel i_size only - v3 Maxim V. Patlasov
@ 2013-04-01 10:41 ` Maxim V. Patlasov
  2013-04-01 10:41 ` [PATCH 08/14] fuse: Flush files on wb close Maxim V. Patlasov
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:41 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_UPDATED serves as indication that local i_mtime
should be flushed to the server eventually. Some operations (direct write,
truncate, fallocate and setattr) leads to updating mtime on server. So, we can
clear FUSE_I_MTIME_UPDATED when such an operation is completed. This is safe
because these operations (as well as i_op->update_time and fsync) are
protected by i_mutex.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c    |  116 ++++++++++++++++++++++++++++++++++++++++++++++++------
 fs/fuse/file.c   |   33 +++++++++++++--
 fs/fuse/fuse_i.h |    6 ++-
 fs/fuse/inode.c  |   13 +++++-
 4 files changed, 147 insertions(+), 21 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8672ee4..8c04677 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -848,8 +848,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	/* see the comment in fuse_change_attributes() */
-	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+	if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
 		attr->size = i_size_read(inode);
+		attr->mtime = inode->i_mtime.tv_sec;
+		attr->mtimensec = inode->i_mtime.tv_nsec;
+	}
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1559,6 +1562,89 @@ void fuse_release_nowrite(struct inode *inode)
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+			      struct inode *inode,
+			      struct fuse_setattr_in *inarg_p,
+			      struct fuse_attr_out *outarg_p)
+{
+	req->in.h.opcode = FUSE_SETATTR;
+	req->in.h.nodeid = get_node_id(inode);
+	req->in.numargs = 1;
+	req->in.args[0].size = sizeof(*inarg_p);
+	req->in.args[0].value = inarg_p;
+	req->out.numargs = 1;
+	if (fc->minor < 9)
+		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+	else
+		req->out.args[0].size = sizeof(*outarg_p);
+	req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	struct fuse_setattr_in inarg;
+	struct fuse_attr_out outarg;
+	int err;
+
+	if (nofail) {
+		req = fuse_get_req_nofail_nopages(fc, file);
+	} else {
+		req = fuse_get_req_nopages(fc);
+		if (IS_ERR(req))
+			return PTR_ERR(req);
+	}
+
+	memset(&inarg, 0, sizeof(inarg));
+	memset(&outarg, 0, sizeof(outarg));
+
+	inarg.valid |= FATTR_MTIME;
+	inarg.mtime = inode->i_mtime.tv_sec;
+	inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+	fuse_request_send(fc, req);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+
+	if (!err)
+		clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+
+	return err;
+}
+
+static inline void set_mtime_helper(struct inode *inode, struct timespec mtime)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	inode->i_mtime = mtime;
+	clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually. But
+ * we can also clear FUSE_I_MTIME_UPDATED if FUSE_SETATTR has just changed
+ * mtime on server.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+	unsigned ivalid = iattr->ia_valid;
+
+	if ((ivalid & ATTR_MTIME) && update_mtime(ivalid)) {
+		if (ivalid & ATTR_MTIME_SET)
+			set_mtime_helper(inode, iattr->ia_mtime);
+		else
+			set_mtime_helper(inode, current_fs_time(inode->i_sb));
+	} else if (ivalid & ATTR_SIZE)
+		set_mtime_helper(inode, current_fs_time(inode->i_sb));
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1619,17 +1705,7 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 		inarg.valid |= FATTR_LOCKOWNER;
 		inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
 	}
-	req->in.h.opcode = FUSE_SETATTR;
-	req->in.h.nodeid = get_node_id(inode);
-	req->in.numargs = 1;
-	req->in.args[0].size = sizeof(inarg);
-	req->in.args[0].value = &inarg;
-	req->out.numargs = 1;
-	if (fc->minor < 9)
-		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
-	else
-		req->out.args[0].size = sizeof(outarg);
-	req->out.args[0].value = &outarg;
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
@@ -1646,6 +1722,10 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	}
 
 	spin_lock(&fc->lock);
+	/* the kernel maintains i_mtime locally */
+	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+		fuse_set_mtime_local(attr, inode);
+
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
@@ -1865,6 +1945,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
 	return err;
 }
 
+static int fuse_update_time(struct inode *inode, struct timespec *now,
+			    int flags)
+{
+	if (flags & S_MTIME) {
+		inode->i_mtime = *now;
+		set_bit(FUSE_I_MTIME_UPDATED, &get_fuse_inode(inode)->state);
+		BUG_ON(!S_ISREG(inode->i_mode));
+	}
+	return 0;
+}
+
 static const struct inode_operations fuse_dir_inode_operations = {
 	.lookup		= fuse_lookup,
 	.mkdir		= fuse_mkdir,
@@ -1904,6 +1995,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.getxattr	= fuse_getxattr,
 	.listxattr	= fuse_listxattr,
 	.removexattr	= fuse_removexattr,
+	.update_time	= fuse_update_time,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index af58bbf..6821e95 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -288,6 +288,10 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	if (test_bit(FUSE_I_MTIME_UPDATED,
+		     &get_fuse_inode(inode)->state))
+		fuse_flush_mtime(file, true);
+
 	fuse_release_common(file, FUSE_RELEASE);
 
 	/* return value is ignored by VFS */
@@ -454,6 +458,13 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 
 	fuse_sync_writes(inode);
 
+	if (test_bit(FUSE_I_MTIME_UPDATED,
+		     &get_fuse_inode(inode)->state)) {
+		int err = fuse_flush_mtime(file, false);
+		if (err)
+			goto out;
+	}
+
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -820,16 +831,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct file *file,
 	return req->misc.write.out.size;
 }
 
-void fuse_write_update_size(struct inode *inode, loff_t pos)
+bool fuse_write_update_size(struct inode *inode, loff_t pos)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool ret = false;
 
 	spin_lock(&fc->lock);
 	fi->attr_version = ++fc->attr_version;
-	if (pos > inode->i_size)
+	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
+		ret = true;
+	}
 	spin_unlock(&fc->lock);
+
+	return ret;
 }
 
 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
@@ -1290,8 +1306,11 @@ static ssize_t __fuse_direct_write(struct file *file, const struct iovec *iov,
 	res = generic_write_checks(file, ppos, &count, 0);
 	if (!res) {
 		res = fuse_direct_io(file, iov, nr_segs, count, ppos, 1);
-		if (res > 0)
+		if (res > 0) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
 			fuse_write_update_size(inode, *ppos);
+			clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+		}
 	}
 
 	fuse_invalidate_attr(inode);
@@ -2340,8 +2359,12 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 	if (change_i_size) {
 		struct inode *inode = file->f_mapping->host;
 
-		if (!err)
-			fuse_write_update_size(inode, offset + length);
+		if (!err && fuse_write_update_size(inode, offset + length)) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
+
+			inode->i_mtime = current_fs_time(inode->i_sb);
+			clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+		}
 
 		mutex_unlock(&inode->i_mutex);
 	}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 09c3139..ddf482b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -115,6 +115,8 @@ struct fuse_inode {
 enum {
 	/** Advise readdirplus  */
 	FUSE_I_ADVISE_RDPLUS,
+	/** i_mtime has been updated locally; a flush to userspace needed */
+	FUSE_I_MTIME_UPDATED,
 };
 
 struct fuse_conn;
@@ -836,6 +838,8 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
 unsigned fuse_file_poll(struct file *file, poll_table *wait);
 int fuse_dev_release(struct inode *inode, struct file *file);
 
-void fuse_write_update_size(struct inode *inode, loff_t pos);
+bool fuse_write_update_size(struct inode *inode, loff_t pos);
+
+int fuse_flush_mtime(struct file *file, bool nofail);
 
 #endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 94319e6..921930f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
-	inode->i_mtime.tv_sec   = attr->mtime;
-	inode->i_mtime.tv_nsec  = attr->mtimensec;
+	/* mtime from server may be stale due to local buffered write */
+	if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
+		inode->i_mtime.tv_sec   = attr->mtime;
+		inode->i_mtime.tv_nsec  = attr->mtimensec;
+	}
 	inode->i_ctime.tv_sec   = attr->ctime;
 	inode->i_ctime.tv_nsec  = attr->ctimensec;
 
@@ -249,6 +252,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
 {
 	inode->i_mode = attr->mode & S_IFMT;
 	inode->i_size = attr->size;
+	inode->i_mtime.tv_sec  = attr->mtime;
+	inode->i_mtime.tv_nsec = attr->mtimensec;
 	if (S_ISREG(inode->i_mode)) {
 		fuse_init_common(inode);
 		fuse_init_file_inode(inode);
@@ -295,7 +300,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		return NULL;
 
 	if ((inode->i_state & I_NEW)) {
-		inode->i_flags |= S_NOATIME|S_NOCMTIME;
+		inode->i_flags |= S_NOATIME;
+		if (!fc->writeback_cache)
+			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
 		fuse_init_inode(inode, attr);


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/14] fuse: Flush files on wb close
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (6 preceding siblings ...)
  2013-04-01 10:41 ` [PATCH 07/14] fuse: Trust kernel i_mtime only Maxim V. Patlasov
@ 2013-04-01 10:41 ` Maxim V. Patlasov
  2013-04-01 10:42 ` [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3 Maxim V. Patlasov
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:41 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6821e95..5509c0b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -288,6 +288,12 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see fuse_vma_close() for !writeback_cache case */
+	if (fc->writeback_cache)
+		filemap_write_and_wait(file->f_mapping);
+
 	if (test_bit(FUSE_I_MTIME_UPDATED,
 		     &get_fuse_inode(inode)->state))
 		fuse_flush_mtime(file, true);


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (7 preceding siblings ...)
  2013-04-01 10:41 ` [PATCH 08/14] fuse: Flush files on wb close Maxim V. Patlasov
@ 2013-04-01 10:42 ` Maxim V. Patlasov
  2013-04-25 10:35   ` Miklos Szeredi
  2013-04-01 10:42 ` [PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback Maxim V. Patlasov
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:42 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

The .writepages one is required to make each writeback request carry more than
one page on it.

Changed in v2:
 - fixed fuse_prepare_write() to avoid reads beyond EOF
 - fixed fuse_prepare_write() to zero uninitialized part of page

Changed in v3:
 - moved common part of fuse_readpage() and fuse_prepare_write() to
   __fuse_readpage().

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |  322 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 303 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5509c0b..6ceffdf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -588,20 +588,13 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
 	}
 }
 
-static int fuse_readpage(struct file *file, struct page *page)
+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+			   int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
 {
 	struct inode *inode = page->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_req *req;
 	size_t num_read;
-	loff_t pos = page_offset(page);
-	size_t count = PAGE_CACHE_SIZE;
-	u64 attr_ver;
-	int err;
-
-	err = -EIO;
-	if (is_bad_inode(inode))
-		goto out;
 
 	/*
 	 * Page writeback can extend beyond the lifetime of the
@@ -611,20 +604,45 @@ static int fuse_readpage(struct file *file, struct page *page)
 	fuse_wait_on_page_writeback(inode, page->index);
 
 	req = fuse_get_req(fc, 1);
-	err = PTR_ERR(req);
+	*err = PTR_ERR(req);
 	if (IS_ERR(req))
-		goto out;
+		return 0;
 
-	attr_ver = fuse_get_attr_version(fc);
+	if (attr_ver_p)
+		*attr_ver_p = fuse_get_attr_version(fc);
 
 	req->out.page_zeroing = 1;
 	req->out.argpages = 1;
 	req->num_pages = 1;
 	req->pages[0] = page;
 	req->page_descs[0].length = count;
-	num_read = fuse_send_read(req, file, pos, count, NULL);
-	err = req->out.h.error;
 
+	num_read = fuse_send_read(req, file, page_offset(page), count, NULL);
+	*err = req->out.h.error;
+
+	if (*err)
+		fuse_put_request(fc, req);
+	else
+		*req_pp = req;
+
+	return num_read;
+}
+
+static int fuse_readpage(struct file *file, struct page *page)
+{
+	struct inode *inode = page->mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	size_t num_read;
+	size_t count = PAGE_CACHE_SIZE;
+	u64 attr_ver;
+	int err;
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
@@ -634,10 +652,11 @@ static int fuse_readpage(struct file *file, struct page *page)
 
 		SetPageUptodate(page);
 	}
-
-	fuse_put_request(fc, req);
-	fuse_invalidate_attr(inode); /* atime changed */
- out:
+	if (req) {
+		fuse_put_request(fc, req);
+		fuse_invalidate_attr(inode); /* atime changed */
+	}
+out:
 	unlock_page(page);
 	return err;
 }
@@ -702,7 +721,10 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file)
 
 struct fuse_fill_data {
 	struct fuse_req *req;
-	struct file *file;
+	union {
+		struct file *file;
+		struct fuse_file *ff;
+	};
 	struct inode *inode;
 	unsigned nr_pages;
 };
@@ -1511,6 +1533,265 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 	return err;
 }
 
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+	int i, all_ok = 1;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t off = -1;
+
+	if (!data->ff)
+		data->ff = fuse_write_file(fc, fi);
+
+	if (!data->ff) {
+		for (i = 0; i < req->num_pages; i++)
+			end_page_writeback(req->pages[i]);
+		return -EIO;
+	}
+
+	req->inode = inode;
+	req->misc.write.in.offset = page_offset(req->pages[0]);
+
+	spin_lock(&fc->lock);
+	list_add(&req->writepages_entry, &fi->writepages);
+	spin_unlock(&fc->lock);
+
+	for (i = 0; i < req->num_pages; i++) {
+		struct page *page = req->pages[i];
+		struct page *tmp_page;
+
+		tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (tmp_page) {
+			copy_highpage(tmp_page, page);
+			inc_bdi_stat(bdi, BDI_WRITEBACK);
+			inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+		} else
+			all_ok = 0;
+		req->pages[i] = tmp_page;
+		if (i == 0)
+			off = page_offset(page);
+
+		end_page_writeback(page);
+	}
+
+	if (!all_ok) {
+		for (i = 0; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			if (page) {
+				dec_bdi_stat(bdi, BDI_WRITEBACK);
+				dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+				__free_page(page);
+				req->pages[i] = NULL;
+			}
+		}
+
+		spin_lock(&fc->lock);
+		list_del(&req->writepages_entry);
+		wake_up(&fi->page_waitq);
+		spin_unlock(&fc->lock);
+		return -ENOMEM;
+	}
+
+	req->ff = fuse_file_get(data->ff);
+	fuse_write_fill(req, data->ff, off, 0);
+
+	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+	req->in.argpages = 1;
+	fuse_page_descs_length_init(req, 0, req->num_pages);
+	req->end = fuse_writepage_end;
+
+	spin_lock(&fc->lock);
+	list_add_tail(&req->list, &fi->queued_writes);
+	fuse_flush_writepages(data->inode);
+	spin_unlock(&fc->lock);
+
+	return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+		struct writeback_control *wbc, void *_data)
+{
+	struct fuse_fill_data *data = _data;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fuse_page_is_writeback(inode, page->index)) {
+		if (wbc->sync_mode != WB_SYNC_ALL) {
+			redirty_page_for_writepage(wbc, page);
+			unlock_page(page);
+			return 0;
+		}
+		fuse_wait_on_page_writeback(inode, page->index);
+	}
+
+	if (req->num_pages &&
+	    (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+	     (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+	     req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+		int err;
+
+		err = fuse_send_writepages(data);
+		if (err) {
+			unlock_page(page);
+			return err;
+		}
+
+		data->req = req =
+			fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+		if (!req) {
+			unlock_page(page);
+			return -ENOMEM;
+		}
+	}
+
+	req->pages[req->num_pages] = page;
+	req->num_pages++;
+
+	if (test_set_page_writeback(page))
+		BUG();
+
+	unlock_page(page);
+
+	return 0;
+}
+
+static int fuse_writepages(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_fill_data data;
+	int err;
+
+	if (!fc->writeback_cache)
+		return generic_writepages(mapping, wbc);
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	data.ff = NULL;
+	data.inode = inode;
+	data.req = fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+	err = -ENOMEM;
+	if (!data.req)
+		goto out_put;
+
+	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+	if (data.req) {
+		if (!err && data.req->num_pages) {
+			err = fuse_send_writepages(&data);
+			if (err)
+				fuse_put_request(fc, data.req);
+		} else
+			fuse_put_request(fc, data.req);
+	}
+out_put:
+	if (data.ff)
+		fuse_file_put(data.ff, false);
+out:
+	return err;
+}
+
+/*
+ * Determine the number of bytes of data the page contains
+ */
+static inline unsigned fuse_page_length(struct page *page)
+{
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
+
+	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
+		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
+		if (page_index < end_index)
+			return PAGE_CACHE_SIZE;
+		if (page_index == end_index)
+			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
+	}
+	return 0;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+		struct page *page, loff_t pos, unsigned len)
+{
+	struct fuse_req *req = NULL;
+	unsigned num_read;
+	unsigned page_len;
+	int err;
+
+	if (PageUptodate(page) || (len == PAGE_CACHE_SIZE))
+		return 0;
+
+	page_len = fuse_page_length(page);
+	if (!page_len) {
+		zero_user(page, 0, PAGE_CACHE_SIZE);
+		return 0;
+	}
+
+	num_read = __fuse_readpage(file, page, page_len, &err, &req, NULL);
+	if (req)
+		fuse_put_request(fc, req);
+	if (err) {
+		unlock_page(page);
+		page_cache_release(page);
+	} else if (num_read != PAGE_CACHE_SIZE) {
+		zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
+	}
+	return err;
+}
+
+/*
+ * It's worthy to make sure that space is reserved on disk for the write,
+ * but how to implement it without killing performance need more thinking.
+ */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+	BUG_ON(!fc->writeback_cache);
+
+	*pagep = grab_cache_page_write_begin(mapping, index, flags);
+	if (!*pagep)
+		return -ENOMEM;
+
+	return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+		unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+
+	fuse_write_update_size(inode, pos);
+	set_page_dirty(page);
+	return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned copied,
+		struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	fuse_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
@@ -2419,11 +2700,14 @@ static const struct file_operations fuse_direct_io_file_operations = {
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
 	.writepage	= fuse_writepage,
+	.writepages	= fuse_writepages,
 	.launder_page	= fuse_launder_page,
 	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
+	.write_begin	= fuse_write_begin,
+	.write_end	= fuse_write_end,
 };
 
 void fuse_init_file_inode(struct inode *inode)


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (8 preceding siblings ...)
  2013-04-01 10:42 ` [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3 Maxim V. Patlasov
@ 2013-04-01 10:42 ` Maxim V. Patlasov
  2013-04-01 10:42 ` [PATCH 11/14] fuse: fuse_flush() " Maxim V. Patlasov
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:42 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

fuse_writepage_locked() should never submit new i/o for given page->index
if there is another one 'in progress' already. In most cases it's safe to
wait on page writeback. But if it was called due to memory shortage
(WB_SYNC_NONE), we should redirty page rather than blocking caller.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6ceffdf..2409654 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1472,7 +1472,8 @@ static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
 	return ff;
 }
 
-static int fuse_writepage_locked(struct page *page)
+static int fuse_writepage_locked(struct page *page,
+				 struct writeback_control *wbc)
 {
 	struct address_space *mapping = page->mapping;
 	struct inode *inode = mapping->host;
@@ -1481,6 +1482,14 @@ static int fuse_writepage_locked(struct page *page)
 	struct fuse_req *req;
 	struct page *tmp_page;
 
+	if (fuse_page_is_writeback(inode, page->index)) {
+		if (wbc->sync_mode != WB_SYNC_ALL) {
+			redirty_page_for_writepage(wbc, page);
+			return 0;
+		}
+		fuse_wait_on_page_writeback(inode, page->index);
+	}
+
 	set_page_writeback(page);
 
 	req = fuse_request_alloc_nofs(1);
@@ -1527,7 +1536,7 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 {
 	int err;
 
-	err = fuse_writepage_locked(page);
+	err = fuse_writepage_locked(page, wbc);
 	unlock_page(page);
 
 	return err;
@@ -1797,7 +1806,10 @@ static int fuse_launder_page(struct page *page)
 	int err = 0;
 	if (clear_page_dirty_for_io(page)) {
 		struct inode *inode = page->mapping->host;
-		err = fuse_writepage_locked(page);
+		struct writeback_control wbc = {
+			.sync_mode = WB_SYNC_ALL,
+		};
+		err = fuse_writepage_locked(page, &wbc);
 		if (!err)
 			fuse_wait_on_page_writeback(inode, page->index);
 	}


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/14] fuse: fuse_flush() should wait on writeback
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (9 preceding siblings ...)
  2013-04-01 10:42 ` [PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback Maxim V. Patlasov
@ 2013-04-01 10:42 ` Maxim V. Patlasov
  2013-04-01 10:42 ` [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim V. Patlasov
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:42 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 2409654..7c24f6b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,7 @@
 #include <linux/falloc.h>
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
@@ -396,6 +397,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
 	if (fc->no_flush)
 		return 0;
 
+	err = filemap_write_and_wait(file->f_mapping);
+	if (err)
+		return err;
+
+	mutex_lock(&inode->i_mutex);
+	fuse_sync_writes(inode);
+	mutex_unlock(&inode->i_mutex);
+
 	req = fuse_get_req_nofail_nopages(fc, file);
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.fh = ff->fh;


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (10 preceding siblings ...)
  2013-04-01 10:42 ` [PATCH 11/14] fuse: fuse_flush() " Maxim V. Patlasov
@ 2013-04-01 10:42 ` Maxim V. Patlasov
  2013-04-01 10:42 ` [PATCH 13/14] fuse: Turn writeback cache on Maxim V. Patlasov
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:42 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/cuse.c   |    5 +++--
 fs/fuse/file.c   |   49 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/fuse/fuse_i.h |   13 ++++++++++++-
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 6f96a8d..fb63185 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -93,7 +93,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, size_t count,
 	loff_t pos = 0;
 	struct iovec iov = { .iov_base = buf, .iov_len = count };
 
-	return fuse_direct_io(file, &iov, 1, count, &pos, 0);
+	return fuse_direct_io(file, &iov, 1, count, &pos, FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -106,7 +106,8 @@ static ssize_t cuse_write(struct file *file, const char __user *buf,
 	 * No locking or generic_write_checks(), the server is
 	 * responsible for locking and sanity checks.
 	 */
-	return fuse_direct_io(file, &iov, 1, count, &pos, 1);
+	return fuse_direct_io(file, &iov, 1, count, &pos,
+			      FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7c24f6b..14880bb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -338,6 +338,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 	return (u64) v0 + ((u64) v1 << 32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+				    pgoff_t idx_to)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+	bool found = false;
+
+	spin_lock(&fc->lock);
+	list_for_each_entry(req, &fi->writepages, writepages_entry) {
+		pgoff_t curr_index;
+
+		BUG_ON(req->inode != inode);
+		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+		if (!(idx_from >= curr_index + req->num_pages ||
+		      idx_to < curr_index)) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock(&fc->lock);
+
+	return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -382,6 +407,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
 	return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+				   size_t bytes)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	pgoff_t idx_from, idx_to;
+
+	idx_from = start >> PAGE_CACHE_SHIFT;
+	idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+	wait_event(fi->page_waitq,
+		   !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
@@ -1248,8 +1286,10 @@ static inline int fuse_iter_npages(const struct iov_iter *ii_p)
 
 ssize_t fuse_direct_io(struct file *file, const struct iovec *iov,
 		       unsigned long nr_segs, size_t count, loff_t *ppos,
-		       int write)
+		       int flags)
 {
+	int write = flags & FUSE_DIO_WRITE;
+	int cuse = flags & FUSE_DIO_CUSE;
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
 	size_t nmax = write ? fc->max_write : fc->max_read;
@@ -1274,6 +1314,10 @@ ssize_t fuse_direct_io(struct file *file, const struct iovec *iov,
 			break;
 		}
 
+		if (!cuse)
+			fuse_wait_on_writeback(file->f_mapping->host, pos,
+					       nbytes);
+
 		if (write)
 			nres = fuse_send_write(req, file, pos, nbytes, owner);
 		else
@@ -1342,7 +1386,8 @@ static ssize_t __fuse_direct_write(struct file *file, const struct iovec *iov,
 
 	res = generic_write_checks(file, ppos, &count, 0);
 	if (!res) {
-		res = fuse_direct_io(file, iov, nr_segs, count, ppos, 1);
+		res = fuse_direct_io(file, iov, nr_segs, count, ppos,
+				     FUSE_DIO_WRITE);
 		if (res > 0) {
 			struct fuse_inode *fi = get_fuse_inode(inode);
 			fuse_write_update_size(inode, *ppos);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ddf482b..f54d669 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -828,9 +828,20 @@ int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid,
 
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 		 bool isdir);
+
+/**
+ * fuse_direct_io() flags
+ */
+
+/** If set, it is WRITE; otherwise - READ */
+#define FUSE_DIO_WRITE (1 << 0)
+
+/** CUSE pass fuse_direct_io() a file which f_mapping->host is not from FUSE */
+#define FUSE_DIO_CUSE  (1 << 1)
+
 ssize_t fuse_direct_io(struct file *file, const struct iovec *iov,
 		       unsigned long nr_segs, size_t count, loff_t *ppos,
-		       int write);
+		       int flags);
 long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
 		   unsigned int flags);
 long fuse_ioctl_common(struct file *file, unsigned int cmd,


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 13/14] fuse: Turn writeback cache on
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (11 preceding siblings ...)
  2013-04-01 10:42 ` [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim V. Patlasov
@ 2013-04-01 10:42 ` Maxim V. Patlasov
  2013-04-01 10:42 ` [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Maxim V. Patlasov
  2013-04-11 11:18 ` [fuse-devel] [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
  14 siblings, 0 replies; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:42 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Original patch by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c            |    5 +++++
 fs/fuse/fuse_i.h          |    4 ++++
 fs/fuse/inode.c           |   13 +++++++++++++
 include/uapi/linux/fuse.h |    2 ++
 4 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 14880bb..5d2c77f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -205,6 +205,8 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 		spin_unlock(&fc->lock);
 		fuse_invalidate_attr(inode);
 	}
+	if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+		fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1099,6 +1101,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	struct iov_iter i;
 	loff_t endbyte = 0;
 
+	if (get_fuse_conn(inode)->writeback_cache)
+		return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
 	WARN_ON(iocb->ki_pos != pos);
 
 	ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f54d669..f023814 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
     doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER         (1 << 1)
 
+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+    module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE       (1 << 2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 921930f..2271177 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -457,6 +457,7 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+	OPT_ALLOW_WBCACHE,
 	OPT_ERR
 };
 
@@ -469,6 +470,7 @@ static const match_table_t tokens = {
 	{OPT_ALLOW_OTHER,		"allow_other"},
 	{OPT_MAX_READ,			"max_read=%u"},
 	{OPT_BLKSIZE,			"blksize=%u"},
+	{OPT_ALLOW_WBCACHE,		"allow_wbcache"},
 	{OPT_ERR,			NULL}
 };
 
@@ -542,6 +544,10 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 			d->blksize = value;
 			break;
 
+		case OPT_ALLOW_WBCACHE:
+			d->flags |= FUSE_ALLOW_WBCACHE;
+			break;
+
 		default:
 			return 0;
 		}
@@ -569,6 +575,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 		seq_printf(m, ",max_read=%u", fc->max_read);
 	if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
 		seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+	if (fc->flags & FUSE_ALLOW_WBCACHE)
+		seq_puts(m, ",allow_wbcache");
 	return 0;
 }
 
@@ -882,6 +890,9 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 				fc->do_readdirplus = 1;
 			if (arg->flags & FUSE_READDIRPLUS_AUTO)
 				fc->readdirplus_auto = 1;
+			if (arg->flags & FUSE_WRITEBACK_CACHE &&
+			    fc->flags & FUSE_ALLOW_WBCACHE)
+				fc->writeback_cache = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_CACHE_SIZE;
 			fc->no_lock = 1;
@@ -910,6 +921,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 		FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
 		FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
 		FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO;
+	if (fc->flags & FUSE_ALLOW_WBCACHE)
+		arg->flags |= FUSE_WRITEBACK_CACHE;
 	req->in.h.opcode = FUSE_INIT;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 4c43b44..6acda83 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -220,6 +220,7 @@ struct fuse_file_lock {
  * FUSE_AUTO_INVAL_DATA: automatically invalidate cached pages
  * FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one)
  * FUSE_READDIRPLUS_AUTO: adaptive readdirplus
+ * FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -236,6 +237,7 @@ struct fuse_file_lock {
 #define FUSE_AUTO_INVAL_DATA	(1 << 12)
 #define FUSE_DO_READDIRPLUS	(1 << 13)
 #define FUSE_READDIRPLUS_AUTO	(1 << 14)
+#define FUSE_WRITEBACK_CACHE	(1 << 15)
 
 /**
  * CUSE INIT request/reply flags


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (12 preceding siblings ...)
  2013-04-01 10:42 ` [PATCH 13/14] fuse: Turn writeback cache on Maxim V. Patlasov
@ 2013-04-01 10:42 ` Maxim V. Patlasov
  2013-04-25 14:29   ` [fuse-devel] " Maxim V. Patlasov
  2013-04-11 11:18 ` [fuse-devel] [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
  14 siblings, 1 reply; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-01 10:42 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is high enough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is writing to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 mm/page-writeback.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..c47bcd4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+			global_page_state(NR_WRITEBACK_TEMP);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy
  2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
                   ` (13 preceding siblings ...)
  2013-04-01 10:42 ` [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Maxim V. Patlasov
@ 2013-04-11 11:18 ` Maxim V. Patlasov
  2013-04-11 14:36   ` Miklos Szeredi
  14 siblings, 1 reply; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-11 11:18 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

Hi Miklos,

Any feedback would be highly appreciated.

Thanks,
Maxim

04/01/2013 02:40 PM, Maxim V. Patlasov пишет:
> Hi,
>
> This is the fourth iteration of Pavel Emelyanov's patch-set implementing
> write-back policy for FUSE page cache. Initial patch-set description was
> the following:
>
> One of the problems with the existing FUSE implementation is that it uses the
> write-through cache policy which results in performance problems on certain
> workloads. E.g. when copying a big file into a FUSE file the cp pushes every
> 128k to the userspace synchronously. This becomes a problem when the userspace
> back-end uses networking for storing the data.
>
> A good solution of this is switching the FUSE page cache into a write-back policy.
> With this file data are pushed to the userspace with big chunks (depending on the
> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
> handle the size updates in a more efficient manner.
>
> The writeback feature is per-connection and is explicitly configurable at the
> init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
> turned ON:
>
> * still copy writeback pages to temporary buffer when sending a writeback request
>    and finish the page writeback immediately
>
> * make kernel maintain the inode's i_size to avoid frequent i_size synchronization
>    with the user space
>
> * take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
>    This protects us from having too many dirty pages on FUSE
>
> The provided patchset survives the fsx test. Performance measurements are not yet
> all finished, but the mentioned copying of a huge file becomes noticeably faster
> even on machines with few RAM and doesn't make the system stuck (the dirty pages
> balancer does its work OK). Applies on top of v3.5-rc4.
>
> We are currently exploring this with our own distributed storage implementation
> which is heavily oriented on storing big blobs of data with extremely rare meta-data
> updates (virtual machines' and containers' disk images). With the existing cache
> policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
> too much time to proceed, much longer than if it was simply scp-ed over the same
> network. The write-back policy (as I mentioned) noticeably improves this scenario.
> Kirill (in Cc) can share more details about the performance and the storage concepts
> details if required.
>
> Changed in v2:
>   - numerous bugfixes:
>     - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must wait
>       on page writeback because page writeback can extend beyond the lifetime of
>       the page-cache page
>     - fuse_send_writepages can end_page_writeback on original page only after adding
>       request to fi->writepages list; otherwise another writeback may happen inside
>       the gap between end_page_writeback and adding to the list
>     - fuse_direct_io must wait on page writeback; otherwise data corruption is possible
>       due to reordering requests
>     - fuse_flush must flush dirty memory and wait for all writeback on given inode
>       before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not reliable
>     - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size update;
>       otherwise a race with a writer extending i_size is possible
>     - fix handling errors in fuse_writepages and fuse_send_writepages
>   - handle i_mtime intelligently if writeback cache is on (see patch #7 (update i_mtime
>     on buffered writes) for details.
>   - put enabling writeback cache under fusermount control; (see mount option
>     'allow_wbcache' introduced by patch #13 (turn writeback cache on))
>   - rebased on v3.7-rc5
>
> Changed in v3:
>   - rebased on for-next branch of the fuse tree (fb05f41f5f96f7423c53da4d87913fb44fd0565d)
>
> Changed in v4:
>   - rebased on for-next branch of the fuse tree (634734b63ac39e137a1c623ba74f3e062b6577db)
>   - fixed fuse_fillattr() for non-writeback_chace case
>   - added comments explaining why we cannot trust size from server
>   - rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
>   - simplified patch titled Flush-files-on-wb-close
>   - eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
>   - added comment about "disk full" errors to fuse_write_begin()
>
> Thanks,
> Maxim
>
> ---
>
> Maxim V. Patlasov (14):
>        fuse: Linking file to inode helper
>        fuse: Getting file for writeback helper
>        fuse: Prepare to handle short reads
>        fuse: Prepare to handle multiple pages in writeback
>        fuse: Connection bit for enabling writeback
>        fuse: Trust kernel i_size only - v3
>        fuse: Trust kernel i_mtime only
>        fuse: Flush files on wb close
>        fuse: Implement writepages and write_begin/write_end callbacks - v3
>        fuse: fuse_writepage_locked() should wait on writeback
>        fuse: fuse_flush() should wait on writeback
>        fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
>        fuse: Turn writeback cache on
>        mm: Account for WRITEBACK_TEMP in balance_dirty_pages
>
>
>   fs/fuse/cuse.c            |    5
>   fs/fuse/dir.c             |  127 +++++++++-
>   fs/fuse/file.c            |  575 ++++++++++++++++++++++++++++++++++++++++-----
>   fs/fuse/fuse_i.h          |   26 ++
>   fs/fuse/inode.c           |   37 +++
>   include/uapi/linux/fuse.h |    2
>   mm/page-writeback.c       |    3
>   7 files changed, 689 insertions(+), 86 deletions(-)
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy
  2013-04-11 11:18 ` [fuse-devel] [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
@ 2013-04-11 14:36   ` Miklos Szeredi
  0 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2013-04-11 14:36 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel

On Thu, Apr 11, 2013 at 1:18 PM, Maxim V. Patlasov
<mpatlasov@parallels.com> wrote:
>
> Any feedback would be highly appreciated.

I was reviewing the patchset but that got delayed by a few weeks due
to various things.

On the whole it looks good, and I'll try to push it for 3.10.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback
  2013-04-01 10:41 ` [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback Maxim V. Patlasov
@ 2013-04-25 10:22   ` Miklos Szeredi
  0 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2013-04-25 10:22 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel

On Mon, Apr 1, 2013 at 12:41 PM, Maxim V. Patlasov
<MPatlasov@parallels.com> wrote:
>
> The .writepages callback will issue writeback requests with more than one
> page aboard. Make existing end/check code be aware of this.
>
> Original patch by: Pavel Emelyanov <xemul@openvz.org>

If this patch was written by Pavel and then modified by you then
please add  a "From:" header at the top of the patch instead.  See
Documentation/SubmittingPatches.

> Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
> ---
>  fs/fuse/file.c |   22 +++++++++++++++-------
>  1 files changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 648de34..ee44b24 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -345,7 +345,8 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
>
>                 BUG_ON(req->inode != inode);
>                 curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
> -               if (curr_index == index) {
> +               if (curr_index <= index &&
> +                   index < curr_index + req->num_pages) {
>                         found = true;
>                         break;
>                 }
> @@ -1295,7 +1296,10 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
>
>  static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
>  {
> -       __free_page(req->pages[0]);
> +       int i;
> +
> +       for (i = 0; i < req->num_pages; i++)
> +               __free_page(req->pages[i]);
>         fuse_file_put(req->ff, false);
>  }
>
> @@ -1304,10 +1308,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>         struct inode *inode = req->inode;
>         struct fuse_inode *fi = get_fuse_inode(inode);
>         struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
> +       int i;
>
>         list_del(&req->writepages_entry);
> -       dec_bdi_stat(bdi, BDI_WRITEBACK);
> -       dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> +       for (i = 0; i < req->num_pages; i++) {
> +               dec_bdi_stat(bdi, BDI_WRITEBACK);
> +               dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
> +       }
>         bdi_writeout_inc(bdi);
>         wake_up(&fi->page_waitq);
>  }
> @@ -1320,14 +1327,15 @@ __acquires(fc->lock)
>         struct fuse_inode *fi = get_fuse_inode(req->inode);
>         loff_t size = i_size_read(req->inode);
>         struct fuse_write_in *inarg = &req->misc.write.in;
> +       __u64 data_size = req->num_pages * PAGE_CACHE_SIZE;
>
>         if (!fc->connected)
>                 goto out_free;
>
> -       if (inarg->offset + PAGE_CACHE_SIZE <= size) {
> -               inarg->size = PAGE_CACHE_SIZE;
> +       if (inarg->offset + data_size <= size) {
> +               inarg->size = data_size;
>         } else if (inarg->offset < size) {
> -               inarg->size = size & (PAGE_CACHE_SIZE - 1);
> +               inarg->size = size - inarg->offset;
>         } else {
>                 /* Got truncated off completely */
>                 goto out_free;
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3
  2013-04-01 10:42 ` [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3 Maxim V. Patlasov
@ 2013-04-25 10:35   ` Miklos Szeredi
  2013-06-14 14:03     ` Maxim Patlasov
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2013-04-25 10:35 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel

On Mon, Apr 1, 2013 at 12:42 PM, Maxim V. Patlasov
<MPatlasov@parallels.com> wrote:
> The .writepages one is required to make each writeback request carry more than
> one page on it.

I'd split this into two parts:

1) implement ->writepages() and enable it unconditionally for mmaped
writeback (why is it not enabled by this patch?)

2) implement ->write_begin() and ->write_end() and conditionally
enable cached writeback

More comments inline.


>
> Changed in v2:
>  - fixed fuse_prepare_write() to avoid reads beyond EOF
>  - fixed fuse_prepare_write() to zero uninitialized part of page
>
> Changed in v3:
>  - moved common part of fuse_readpage() and fuse_prepare_write() to
>    __fuse_readpage().
>
> Original patch by: Pavel Emelyanov <xemul@openvz.org>
> Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
> ---
>  fs/fuse/file.c |  322 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 files changed, 303 insertions(+), 19 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 5509c0b..6ceffdf 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -588,20 +588,13 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
>         }
>  }
>
> -static int fuse_readpage(struct file *file, struct page *page)
> +static int __fuse_readpage(struct file *file, struct page *page, size_t count,
> +                          int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
>  {
>         struct inode *inode = page->mapping->host;
>         struct fuse_conn *fc = get_fuse_conn(inode);
>         struct fuse_req *req;
>         size_t num_read;
> -       loff_t pos = page_offset(page);
> -       size_t count = PAGE_CACHE_SIZE;
> -       u64 attr_ver;
> -       int err;
> -
> -       err = -EIO;
> -       if (is_bad_inode(inode))
> -               goto out;
>

Make this a separate patch too.  Restructuring ...readpage() doesn't
have a lot to do with implementing ...writepages().

>         /*
>          * Page writeback can extend beyond the lifetime of the
> @@ -611,20 +604,45 @@ static int fuse_readpage(struct file *file, struct page *page)
>         fuse_wait_on_page_writeback(inode, page->index);
>
>         req = fuse_get_req(fc, 1);
> -       err = PTR_ERR(req);
> +       *err = PTR_ERR(req);
>         if (IS_ERR(req))
> -               goto out;
> +               return 0;
>
> -       attr_ver = fuse_get_attr_version(fc);
> +       if (attr_ver_p)
> +               *attr_ver_p = fuse_get_attr_version(fc);
>
>         req->out.page_zeroing = 1;
>         req->out.argpages = 1;
>         req->num_pages = 1;
>         req->pages[0] = page;
>         req->page_descs[0].length = count;
> -       num_read = fuse_send_read(req, file, pos, count, NULL);
> -       err = req->out.h.error;
>
> +       num_read = fuse_send_read(req, file, page_offset(page), count, NULL);
> +       *err = req->out.h.error;
> +
> +       if (*err)
> +               fuse_put_request(fc, req);
> +       else
> +               *req_pp = req;
> +
> +       return num_read;
> +}
> +
> +static int fuse_readpage(struct file *file, struct page *page)
> +{
> +       struct inode *inode = page->mapping->host;
> +       struct fuse_conn *fc = get_fuse_conn(inode);
> +       struct fuse_req *req = NULL;
> +       size_t num_read;
> +       size_t count = PAGE_CACHE_SIZE;
> +       u64 attr_ver;
> +       int err;
> +
> +       err = -EIO;
> +       if (is_bad_inode(inode))
> +               goto out;
> +
> +       num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
>         if (!err) {
>                 /*
>                  * Short read means EOF.  If file size is larger, truncate it
> @@ -634,10 +652,11 @@ static int fuse_readpage(struct file *file, struct page *page)
>
>                 SetPageUptodate(page);
>         }
> -
> -       fuse_put_request(fc, req);
> -       fuse_invalidate_attr(inode); /* atime changed */
> - out:
> +       if (req) {
> +               fuse_put_request(fc, req);
> +               fuse_invalidate_attr(inode); /* atime changed */
> +       }
> +out:
>         unlock_page(page);
>         return err;
>  }
> @@ -702,7 +721,10 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file)
>
>  struct fuse_fill_data {
>         struct fuse_req *req;
> -       struct file *file;
> +       union {
> +               struct file *file;
> +               struct fuse_file *ff;
> +       };

Don't do a union. Make two separate structures, one for read and one for write.

>         struct inode *inode;
>         unsigned nr_pages;
>  };
> @@ -1511,6 +1533,265 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
>         return err;
>  }
>
> +static int fuse_send_writepages(struct fuse_fill_data *data)
> +{
> +       int i, all_ok = 1;
> +       struct fuse_req *req = data->req;
> +       struct inode *inode = data->inode;
> +       struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
> +       struct fuse_conn *fc = get_fuse_conn(inode);
> +       struct fuse_inode *fi = get_fuse_inode(inode);
> +       loff_t off = -1;
> +
> +       if (!data->ff)
> +               data->ff = fuse_write_file(fc, fi);
> +
> +       if (!data->ff) {
> +               for (i = 0; i < req->num_pages; i++)
> +                       end_page_writeback(req->pages[i]);
> +               return -EIO;
> +       }
> +
> +       req->inode = inode;
> +       req->misc.write.in.offset = page_offset(req->pages[0]);

We need to set req->background.

> +
> +       spin_lock(&fc->lock);
> +       list_add(&req->writepages_entry, &fi->writepages);
> +       spin_unlock(&fc->lock);
> +
> +       for (i = 0; i < req->num_pages; i++) {
> +               struct page *page = req->pages[i];
> +               struct page *tmp_page;
> +
> +               tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
> +               if (tmp_page) {
> +                       copy_highpage(tmp_page, page);
> +                       inc_bdi_stat(bdi, BDI_WRITEBACK);
> +                       inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);

Is there a reason why we do the page allocation/copying here instead
of at fill time?  I'd guess it would be simpler and more logical
there.

> +               } else
> +                       all_ok = 0;
> +               req->pages[i] = tmp_page;
> +               if (i == 0)
> +                       off = page_offset(page);
> +
> +               end_page_writeback(page);
> +       }
> +
> +       if (!all_ok) {
> +               for (i = 0; i < req->num_pages; i++) {
> +                       struct page *page = req->pages[i];
> +                       if (page) {
> +                               dec_bdi_stat(bdi, BDI_WRITEBACK);
> +                               dec_zone_page_state(page, NR_WRITEBACK_TEMP);
> +                               __free_page(page);
> +                               req->pages[i] = NULL;
> +                       }
> +               }
> +
> +               spin_lock(&fc->lock);
> +               list_del(&req->writepages_entry);
> +               wake_up(&fi->page_waitq);
> +               spin_unlock(&fc->lock);
> +               return -ENOMEM;
> +       }
> +
> +       req->ff = fuse_file_get(data->ff);
> +       fuse_write_fill(req, data->ff, off, 0);
> +
> +       req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
> +       req->in.argpages = 1;
> +       fuse_page_descs_length_init(req, 0, req->num_pages);
> +       req->end = fuse_writepage_end;
> +
> +       spin_lock(&fc->lock);
> +       list_add_tail(&req->list, &fi->queued_writes);
> +       fuse_flush_writepages(data->inode);
> +       spin_unlock(&fc->lock);
> +
> +       return 0;
> +}
> +
> +static int fuse_writepages_fill(struct page *page,
> +               struct writeback_control *wbc, void *_data)
> +{
> +       struct fuse_fill_data *data = _data;
> +       struct fuse_req *req = data->req;
> +       struct inode *inode = data->inode;
> +       struct fuse_conn *fc = get_fuse_conn(inode);
> +
> +       if (fuse_page_is_writeback(inode, page->index)) {
> +               if (wbc->sync_mode != WB_SYNC_ALL) {
> +                       redirty_page_for_writepage(wbc, page);
> +                       unlock_page(page);
> +                       return 0;
> +               }
> +               fuse_wait_on_page_writeback(inode, page->index);
> +       }
> +
> +       if (req->num_pages &&
> +           (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
> +            (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
> +            req->pages[req->num_pages - 1]->index + 1 != page->index)) {
> +               int err;
> +
> +               err = fuse_send_writepages(data);
> +               if (err) {
> +                       unlock_page(page);
> +                       return err;
> +               }
> +
> +               data->req = req =
> +                       fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
> +               if (!req) {
> +                       unlock_page(page);
> +                       return -ENOMEM;
> +               }
> +       }
> +
> +       req->pages[req->num_pages] = page;
> +       req->num_pages++;
> +
> +       if (test_set_page_writeback(page))
> +               BUG();
> +
> +       unlock_page(page);
> +
> +       return 0;
> +}
> +
> +static int fuse_writepages(struct address_space *mapping,
> +                          struct writeback_control *wbc)
> +{
> +       struct inode *inode = mapping->host;
> +       struct fuse_conn *fc = get_fuse_conn(inode);
> +       struct fuse_fill_data data;
> +       int err;
> +
> +       if (!fc->writeback_cache)
> +               return generic_writepages(mapping, wbc);
> +
> +       err = -EIO;
> +       if (is_bad_inode(inode))
> +               goto out;
> +
> +       data.ff = NULL;
> +       data.inode = inode;
> +       data.req = fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
> +       err = -ENOMEM;
> +       if (!data.req)
> +               goto out_put;
> +
> +       err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
> +       if (data.req) {
> +               if (!err && data.req->num_pages) {
> +                       err = fuse_send_writepages(&data);
> +                       if (err)
> +                               fuse_put_request(fc, data.req);
> +               } else
> +                       fuse_put_request(fc, data.req);

I think fuse_send_writepages() should grab a ref to the request with
__fuse_get_request() if it successfully sent the request.  Then we can
unconditionally do the fuse_put_request() here.

> +       }
> +out_put:
> +       if (data.ff)
> +               fuse_file_put(data.ff, false);
> +out:
> +       return err;
> +}
> +
> +/*
> + * Determine the number of bytes of data the page contains
> + */
> +static inline unsigned fuse_page_length(struct page *page)
> +{
> +       loff_t i_size = i_size_read(page_file_mapping(page)->host);
> +
> +       if (i_size > 0) {
> +               pgoff_t page_index = page_file_index(page);
> +               pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
> +               if (page_index < end_index)
> +                       return PAGE_CACHE_SIZE;
> +               if (page_index == end_index)
> +                       return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
> +       }
> +       return 0;
> +}
> +
> +static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
> +               struct page *page, loff_t pos, unsigned len)
> +{
> +       struct fuse_req *req = NULL;
> +       unsigned num_read;
> +       unsigned page_len;
> +       int err;
> +
> +       if (PageUptodate(page) || (len == PAGE_CACHE_SIZE))
> +               return 0;
> +
> +       page_len = fuse_page_length(page);
> +       if (!page_len) {
> +               zero_user(page, 0, PAGE_CACHE_SIZE);
> +               return 0;
> +       }
> +
> +       num_read = __fuse_readpage(file, page, page_len, &err, &req, NULL);
> +       if (req)
> +               fuse_put_request(fc, req);
> +       if (err) {
> +               unlock_page(page);
> +               page_cache_release(page);
> +       } else if (num_read != PAGE_CACHE_SIZE) {
> +               zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
> +       }
> +       return err;
> +}
> +
> +/*
> + * It's worthy to make sure that space is reserved on disk for the write,
> + * but how to implement it without killing performance need more thinking.
> + */
> +static int fuse_write_begin(struct file *file, struct address_space *mapping,
> +               loff_t pos, unsigned len, unsigned flags,
> +               struct page **pagep, void **fsdata)
> +{
> +       pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> +       struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
> +
> +       BUG_ON(!fc->writeback_cache);
> +
> +       *pagep = grab_cache_page_write_begin(mapping, index, flags);
> +       if (!*pagep)
> +               return -ENOMEM;
> +
> +       return fuse_prepare_write(fc, file, *pagep, pos, len);
> +}
> +
> +static int fuse_commit_write(struct file *file, struct page *page,
> +               unsigned from, unsigned to)
> +{
> +       struct inode *inode = page->mapping->host;
> +       loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> +
> +       if (!PageUptodate(page))
> +               SetPageUptodate(page);
> +
> +       fuse_write_update_size(inode, pos);
> +       set_page_dirty(page);
> +       return 0;
> +}
> +
> +static int fuse_write_end(struct file *file, struct address_space *mapping,
> +               loff_t pos, unsigned len, unsigned copied,
> +               struct page *page, void *fsdata)
> +{
> +       unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> +
> +       fuse_commit_write(file, page, from, from+copied);
> +
> +       unlock_page(page);
> +       page_cache_release(page);
> +
> +       return copied;
> +}
> +
>  static int fuse_launder_page(struct page *page)
>  {
>         int err = 0;
> @@ -2419,11 +2700,14 @@ static const struct file_operations fuse_direct_io_file_operations = {
>  static const struct address_space_operations fuse_file_aops  = {
>         .readpage       = fuse_readpage,
>         .writepage      = fuse_writepage,
> +       .writepages     = fuse_writepages,
>         .launder_page   = fuse_launder_page,
>         .readpages      = fuse_readpages,
>         .set_page_dirty = __set_page_dirty_nobuffers,
>         .bmap           = fuse_bmap,
>         .direct_IO      = fuse_direct_IO,
> +       .write_begin    = fuse_write_begin,
> +       .write_end      = fuse_write_end,
>  };
>
>  void fuse_init_file_inode(struct inode *inode)
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-01 10:42 ` [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Maxim V. Patlasov
@ 2013-04-25 14:29   ` Maxim V. Patlasov
  2013-04-25 15:49     ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-25 14:29 UTC (permalink / raw)
  To: miklos
  Cc: dev, xemul, fuse-devel, linux-kernel, jbottomley, viro,
	linux-fsdevel, devel, akpm, fengguang.wu

Hi Miklos,

04/01/2013 02:42 PM, Maxim V. Patlasov пишет:
> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
> counter is high enough. This prevents us from having too many dirty
> pages on fuse, thus giving the userspace part of it a chance to write
> stuff properly.
>
> Note, that the existing balance logic is per-bdi, i.e. if the fuse
> user task gets stuck in the function this means, that it either
> writes to the mountpoint it serves (but it can deadlock even without
> the writeback) or it is writing to some _other_ dirty bdi and in the
> latter case someone else will free the memory for it.
>
> Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> ---
>   mm/page-writeback.c |    3 ++-
>   1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0713bfb..c47bcd4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>   		 */
>   		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>   					global_page_state(NR_UNSTABLE_NFS);
> -		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
> +		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
> +			global_page_state(NR_WRITEBACK_TEMP);
>   
>   		global_dirty_limits(&background_thresh, &dirty_thresh);

Please drop this patch. As we discussed in LSF/MM, the fix above is 
correct, but it's not enough: we also need to ensure disregard of 
NR_WRITEBACK_TEMP when balance_dirty_pages() is called from fuse daemon. 
I'll send a separate patch-set soon.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-25 14:29   ` [fuse-devel] " Maxim V. Patlasov
@ 2013-04-25 15:49     ` Miklos Szeredi
  2013-04-25 16:16       ` Maxim V. Patlasov
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2013-04-25 15:49 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel, Andrew Morton, fengguang.wu

On Thu, Apr 25, 2013 at 4:29 PM, Maxim V. Patlasov
<mpatlasov@parallels.com> wrote:
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 0713bfb..c47bcd4 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space
>> *mapping,
>>                  */
>>                 nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>>
>> global_page_state(NR_UNSTABLE_NFS);
>> -               nr_dirty = nr_reclaimable +
>> global_page_state(NR_WRITEBACK);
>> +               nr_dirty = nr_reclaimable +
>> global_page_state(NR_WRITEBACK) +
>> +                       global_page_state(NR_WRITEBACK_TEMP);
>>                 global_dirty_limits(&background_thresh, &dirty_thresh);
>
>
> Please drop this patch. As we discussed in LSF/MM, the fix above is correct,
> but it's not enough: we also need to ensure disregard of NR_WRITEBACK_TEMP
> when balance_dirty_pages() is called from fuse daemon. I'll send a separate
> patch-set soon.

Please elaborate.  From a technical perspective "fuse daemon" is very
hard to define, so anything that relies on whether something came from
the fuse daemon or not is conceptually broken.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-25 15:49     ` Miklos Szeredi
@ 2013-04-25 16:16       ` Maxim V. Patlasov
  2013-04-25 20:43         ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-25 16:16 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel, Andrew Morton, fengguang.wu, mgorman

Hi,

04/25/2013 07:49 PM, Miklos Szeredi пишет:
> On Thu, Apr 25, 2013 at 4:29 PM, Maxim V. Patlasov
> <mpatlasov@parallels.com> wrote:
>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>> index 0713bfb..c47bcd4 100644
>>> --- a/mm/page-writeback.c
>>> +++ b/mm/page-writeback.c
>>> @@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space
>>> *mapping,
>>>                   */
>>>                  nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>>>
>>> global_page_state(NR_UNSTABLE_NFS);
>>> -               nr_dirty = nr_reclaimable +
>>> global_page_state(NR_WRITEBACK);
>>> +               nr_dirty = nr_reclaimable +
>>> global_page_state(NR_WRITEBACK) +
>>> +                       global_page_state(NR_WRITEBACK_TEMP);
>>>                  global_dirty_limits(&background_thresh, &dirty_thresh);
>>
>> Please drop this patch. As we discussed in LSF/MM, the fix above is correct,
>> but it's not enough: we also need to ensure disregard of NR_WRITEBACK_TEMP
>> when balance_dirty_pages() is called from fuse daemon. I'll send a separate
>> patch-set soon.
> Please elaborate.  From a technical perspective "fuse daemon" is very
> hard to define, so anything that relies on whether something came from
> the fuse daemon or not is conceptually broken.

As Mel Gorman pointed out, fuse daemon diving into balance_dirty_pages 
should not kick flusher judging on NR_WRITEBACK_TEMP. Essentially, all 
we need in balance_dirty_pages is:

     if (I'm not fuse daemon)
         nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

The way how to identify fuse daemon was not thoroughly scrutinized 
during LSF/MM. Firstly, I thought it would be enough to set a 
per-process flag handling fuse device open. But now I understand that 
fuse daemon may be quite a complicated multi-threaded multi-process 
construction. I'm going to add new FUSE_NOTIFY to allow fuse daemon 
decide when it works on behalf of draining writeout-s. Having in mind 
that fuse-lib is multi-threaded, I'm also going to inherit the flag on 
copy_process(). Does it make sense for you?

Also, another patch will put this ad-hoc FUSE_NOTIFY under fusermount 
control. This will prevent malicious unprivileged fuse mounts from 
setting the flag for malicious purposes.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-25 16:16       ` Maxim V. Patlasov
@ 2013-04-25 20:43         ` Miklos Szeredi
  2013-04-26  8:32           ` Maxim V. Patlasov
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2013-04-25 20:43 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel, Andrew Morton, fengguang.wu, mgorman

On Thu, Apr 25, 2013 at 08:16:45PM +0400, Maxim V. Patlasov wrote:
> 
> As Mel Gorman pointed out, fuse daemon diving into
> balance_dirty_pages should not kick flusher judging on
> NR_WRITEBACK_TEMP. Essentially, all we need in balance_dirty_pages
> is:
> 
>     if (I'm not fuse daemon)
>         nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

I strongly dislike the above.

What about something like the following untested patch?

The idea is that fuse filesystems should not go over the bdi limit even if the
global limit hasn't been reached.

Thanks,
Miklos

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 137185c..195ee45 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -291,6 +291,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		inode->i_flags |= S_NOATIME|S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
+		set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..97f6a0c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_STRICTLIMIT	= __GFP_BITS_SHIFT + 5, /* strict dirty limit */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..91a9e6e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1226,6 +1226,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long dirty_ratelimit;
 	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	int strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
 	unsigned long start_time = jiffies;
 
 	for (;;) {
@@ -1250,7 +1251,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		freerun = dirty_freerun_ceiling(dirty_thresh,
 						background_thresh);
-		if (nr_dirty <= freerun) {
+		if (nr_dirty <= freerun && !strictlimit) {
 			current->dirty_paused_when = now;
 			current->nr_dirtied = 0;
 			current->nr_dirtied_pause =
@@ -1297,7 +1298,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 
 		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
-				  (nr_dirty > dirty_thresh);
+				  ((nr_dirty > dirty_thresh) || strictlimit);
 		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-25 20:43         ` Miklos Szeredi
@ 2013-04-26  8:32           ` Maxim V. Patlasov
  2013-04-26 14:02             ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-26  8:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel, Andrew Morton, fengguang.wu, mgorman, riel, hughd,
	gthelen, linux-mm, Andrew Morton

Hi Miklos,

04/26/2013 12:43 AM, Miklos Szeredi пишет:
> On Thu, Apr 25, 2013 at 08:16:45PM +0400, Maxim V. Patlasov wrote:
>> As Mel Gorman pointed out, fuse daemon diving into
>> balance_dirty_pages should not kick flusher judging on
>> NR_WRITEBACK_TEMP. Essentially, all we need in balance_dirty_pages
>> is:
>>
>>      if (I'm not fuse daemon)
>>          nr_dirty += global_page_state(NR_WRITEBACK_TEMP);
> I strongly dislike the above.

The above was well-discussed on mm track of LSF/MM. Everybody seemed to 
agree with solution above. I'm cc-ing some guys who were involved in 
discussion, mm mailing list and Andrew as well. For those who don't 
follow from the beginning here is an excerpt:

> 04/25/2013 07:49 PM, Miklos Szeredi пишет:
>> On Thu, Apr 25, 2013 at 4:29 PM, Maxim V. Patlasov
>> <mpatlasov@parallels.com>  wrote:
>>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>>> index 0713bfb..c47bcd4 100644
>>>> --- a/mm/page-writeback.c
>>>> +++ b/mm/page-writeback.c
>>>> @@ -1235,7 +1235,8 @@ static void balance_dirty_pages(struct address_space
>>>> *mapping,
>>>>                    */
>>>>                   nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>>>>
>>>> global_page_state(NR_UNSTABLE_NFS);
>>>> -               nr_dirty = nr_reclaimable +
>>>> global_page_state(NR_WRITEBACK);
>>>> +               nr_dirty = nr_reclaimable +
>>>> global_page_state(NR_WRITEBACK) +
>>>> +                       global_page_state(NR_WRITEBACK_TEMP);
>>>>                   global_dirty_limits(&background_thresh, &dirty_thresh);
>>> Please drop this patch. As we discussed in LSF/MM, the fix above is correct,
>>> but it's not enough: we also need to ensure disregard of NR_WRITEBACK_TEMP
>>> when balance_dirty_pages() is called from fuse daemon. I'll send a separate
>>> patch-set soon.
>> Please elaborate.  From a technical perspective "fuse daemon" is very
>> hard to define, so anything that relies on whether something came from
>> the fuse daemon or not is conceptually broken.
> As Mel Gorman pointed out, fuse daemon diving into balance_dirty_pages
> should not kick flusher judging on NR_WRITEBACK_TEMP. Essentially, all
> we need in balance_dirty_pages is:
>
>       if (I'm not fuse daemon)
>           nr_dirty += global_page_state(NR_WRITEBACK_TEMP);
>
> The way how to identify fuse daemon was not thoroughly scrutinized
> during LSF/MM. Firstly, I thought it would be enough to set a
> per-process flag handling fuse device open. But now I understand that
> fuse daemon may be quite a complicated multi-threaded multi-process
> construction. I'm going to add new FUSE_NOTIFY to allow fuse daemon
> decide when it works on behalf of draining writeout-s. Having in mind
> that fuse-lib is multi-threaded, I'm also going to inherit the flag on
> copy_process(). Does it make sense for you?
>
> Also, another patch will put this ad-hoc FUSE_NOTIFY under fusermount
> control. This will prevent malicious unprivileged fuse mounts from
> setting the flag for malicious purposes.

And returning back to the last Miklos' mail...

>
> What about something like the following untested patch?
>
> The idea is that fuse filesystems should not go over the bdi limit even if the
> global limit hasn't been reached.

This might work, but kicking flusher every time someone write to fuse 
mount and dives into balance_dirty_pages looks fishy. However, setting 
ad-hoc inode flag for files on fuse makes much more sense than my 
approach of identifying fuse daemons (a feeble hope that userspace 
daemons would notify in-kernel fuse saying "I'm fuse daemon, please 
disregard NR_WRITEBACK_TEMP for me"). Let's combine our suggestions: 
mark fuse inodes with AS_FUSE_WRITEBACK flag and convert what you 
strongly dislike above to:

if (test_bit(AS_FUSE_WRITEBACK, &mapping->flags))
nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

Thanks,
Maxim

>
> Thanks,
> Miklos
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 137185c..195ee45 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -291,6 +291,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>   		inode->i_flags |= S_NOATIME|S_NOCMTIME;
>   		inode->i_generation = generation;
>   		inode->i_data.backing_dev_info = &fc->bdi;
> +		set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
>   		fuse_init_inode(inode, attr);
>   		unlock_new_inode(inode);
>   	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 0e38e13..97f6a0c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -25,6 +25,7 @@ enum mapping_flags {
>   	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>   	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
>   	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
> +	AS_STRICTLIMIT	= __GFP_BITS_SHIFT + 5, /* strict dirty limit */
>   };
>   
>   static inline void mapping_set_error(struct address_space *mapping, int error)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..91a9e6e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1226,6 +1226,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>   	unsigned long dirty_ratelimit;
>   	unsigned long pos_ratio;
>   	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	int strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
>   	unsigned long start_time = jiffies;
>   
>   	for (;;) {
> @@ -1250,7 +1251,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>   		 */
>   		freerun = dirty_freerun_ceiling(dirty_thresh,
>   						background_thresh);
> -		if (nr_dirty <= freerun) {
> +		if (nr_dirty <= freerun && !strictlimit) {
>   			current->dirty_paused_when = now;
>   			current->nr_dirtied = 0;
>   			current->nr_dirtied_pause =
> @@ -1297,7 +1298,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>   		}
>   
>   		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
> -				  (nr_dirty > dirty_thresh);
> +				  ((nr_dirty > dirty_thresh) || strictlimit);
>   		if (dirty_exceeded && !bdi->dirty_exceeded)
>   			bdi->dirty_exceeded = 1;
>   
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-26  8:32           ` Maxim V. Patlasov
@ 2013-04-26 14:02             ` Miklos Szeredi
  2013-04-26 17:44               ` Maxim V. Patlasov
  0 siblings, 1 reply; 28+ messages in thread
From: Miklos Szeredi @ 2013-04-26 14:02 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel, Andrew Morton, fengguang.wu, mgorman, riel, hughd,
	gthelen, linux-mm

On Fri, Apr 26, 2013 at 12:32:24PM +0400, Maxim V. Patlasov wrote:

> > The idea is that fuse filesystems should not go over the bdi limit even if
> > the global limit hasn't been reached.
> 
> This might work, but kicking flusher every time someone write to
> fuse mount and dives into balance_dirty_pages looks fishy.

Yeah.  Fixed patch attached.

> Let's combine
> our suggestions: mark fuse inodes with AS_FUSE_WRITEBACK flag and
> convert what you strongly dislike above to:
> 
> if (test_bit(AS_FUSE_WRITEBACK, &mapping->flags))
> nr_dirty += global_page_state(NR_WRITEBACK_TEMP);

I don't think this is right.  The fuse daemon could itself be writing to another
fuse filesystem, in which case blocking because of NR_WRITEBACK_TEMP being high
isn't a smart strategy.

Furthermore it isn't enough.  Becuase the root problem, I think, is that we
allow fuse filesystems to grow a large number of dirty pages before throttling.
This was never intended and it may actually have worked properly at a point in
time but broke by some change to the dirty throttling algorithm.

Thanks,
Miklos


diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 137185c..195ee45 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -291,6 +291,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		inode->i_flags |= S_NOATIME|S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
+		set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0e38e13..97f6a0c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_STRICTLIMIT	= __GFP_BITS_SHIFT + 5, /* strict dirty limit */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index efe6814..b6db421 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1226,6 +1226,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long dirty_ratelimit;
 	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	int strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
 	unsigned long start_time = jiffies;
 
 	for (;;) {
@@ -1250,7 +1251,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		freerun = dirty_freerun_ceiling(dirty_thresh,
 						background_thresh);
-		if (nr_dirty <= freerun) {
+		if (nr_dirty <= freerun && !strictlimit) {
 			current->dirty_paused_when = now;
 			current->nr_dirtied = 0;
 			current->nr_dirtied_pause =
@@ -1258,7 +1259,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			break;
 		}
 
-		if (unlikely(!writeback_in_progress(bdi)))
+		if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
 			bdi_start_background_writeback(bdi);
 
 		/*
@@ -1296,8 +1297,12 @@ static void balance_dirty_pages(struct address_space *mapping,
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (unlikely(!writeback_in_progress(bdi)) &&
+		    bdi_dirty > bdi_thresh / 2)
+			bdi_start_background_writeback(bdi);
+
 		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
-				  (nr_dirty > dirty_thresh);
+				  ((nr_dirty > dirty_thresh) || strictlimit);
 		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-26 14:02             ` Miklos Szeredi
@ 2013-04-26 17:44               ` Maxim V. Patlasov
  2013-05-07 11:39                 ` Miklos Szeredi
  0 siblings, 1 reply; 28+ messages in thread
From: Maxim V. Patlasov @ 2013-04-26 17:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel, Andrew Morton, fengguang.wu, mgorman, riel, hughd,
	gthelen, linux-mm

Miklos, MM folks,

04/26/2013 06:02 PM, Miklos Szeredi пишет:
> On Fri, Apr 26, 2013 at 12:32:24PM +0400, Maxim V. Patlasov wrote:
>
>>> The idea is that fuse filesystems should not go over the bdi limit even if
>>> the global limit hasn't been reached.
>> This might work, but kicking flusher every time someone write to
>> fuse mount and dives into balance_dirty_pages looks fishy.
> Yeah.  Fixed patch attached.

The patch didn't work for me. I'll investigate what's wrong and get back 
to you later.

>
>> Let's combine
>> our suggestions: mark fuse inodes with AS_FUSE_WRITEBACK flag and
>> convert what you strongly dislike above to:
>>
>> if (test_bit(AS_FUSE_WRITEBACK, &mapping->flags))
>> nr_dirty += global_page_state(NR_WRITEBACK_TEMP);
> I don't think this is right.  The fuse daemon could itself be writing to another
> fuse filesystem, in which case blocking because of NR_WRITEBACK_TEMP being high
> isn't a smart strategy.

Please don't say 'blocking'. Per-bdi checks will decide whether to block 
or not. In the case you set forth, judging on per-bdi checks would be 
completely fine for upper fuse: it may and should block for a while if 
lower fuse doesn't catch up.

>
> Furthermore it isn't enough.  Becuase the root problem, I think, is that we
> allow fuse filesystems to grow a large number of dirty pages before throttling.
> This was never intended and it may actually have worked properly at a point in
> time but broke by some change to the dirty throttling algorithm.

Could someone from mm list step in and comment on this point? Which 
approach is better to follow: account NR_WRITEBACK_TEMP in 
balance_dirty_pages accurately (as we discussed in LSF/MM) or re-work 
balance_dirty_pages in direction suggested by Miklos (fuse should never 
go over the bdi limit even if the global limit hasn't been reached)?

I'm for accounting NR_WRITEBACK_TEMP because balance_dirty_pages is 
already overcomplicated (imho) and adding new clauses for FUSE makes me 
sick.

Thanks,
Maxim

>
> Thanks,
> Miklos
>
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 137185c..195ee45 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -291,6 +291,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>   		inode->i_flags |= S_NOATIME|S_NOCMTIME;
>   		inode->i_generation = generation;
>   		inode->i_data.backing_dev_info = &fc->bdi;
> +		set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
>   		fuse_init_inode(inode, attr);
>   		unlock_new_inode(inode);
>   	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 0e38e13..97f6a0c 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -25,6 +25,7 @@ enum mapping_flags {
>   	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>   	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
>   	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
> +	AS_STRICTLIMIT	= __GFP_BITS_SHIFT + 5, /* strict dirty limit */
>   };
>   
>   static inline void mapping_set_error(struct address_space *mapping, int error)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index efe6814..b6db421 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1226,6 +1226,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>   	unsigned long dirty_ratelimit;
>   	unsigned long pos_ratio;
>   	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	int strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
>   	unsigned long start_time = jiffies;
>   
>   	for (;;) {
> @@ -1250,7 +1251,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>   		 */
>   		freerun = dirty_freerun_ceiling(dirty_thresh,
>   						background_thresh);
> -		if (nr_dirty <= freerun) {
> +		if (nr_dirty <= freerun && !strictlimit) {
>   			current->dirty_paused_when = now;
>   			current->nr_dirtied = 0;
>   			current->nr_dirtied_pause =
> @@ -1258,7 +1259,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>   			break;
>   		}
>   
> -		if (unlikely(!writeback_in_progress(bdi)))
> +		if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
>   			bdi_start_background_writeback(bdi);
>   
>   		/*
> @@ -1296,8 +1297,12 @@ static void balance_dirty_pages(struct address_space *mapping,
>   				    bdi_stat(bdi, BDI_WRITEBACK);
>   		}
>   
> +		if (unlikely(!writeback_in_progress(bdi)) &&
> +		    bdi_dirty > bdi_thresh / 2)
> +			bdi_start_background_writeback(bdi);
> +
>   		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
> -				  (nr_dirty > dirty_thresh);
> +				  ((nr_dirty > dirty_thresh) || strictlimit);
>   		if (dirty_exceeded && !bdi->dirty_exceeded)
>   			bdi->dirty_exceeded = 1;
>   
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2013-04-26 17:44               ` Maxim V. Patlasov
@ 2013-05-07 11:39                 ` Miklos Szeredi
  0 siblings, 0 replies; 28+ messages in thread
From: Miklos Szeredi @ 2013-05-07 11:39 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel, Andrew Morton, fengguang.wu, Mel Gorman, riel, hughd,
	gthelen, linux-mm

On Fri, Apr 26, 2013 at 7:44 PM, Maxim V. Patlasov
<mpatlasov@parallels.com> wrote:
> I'm for accounting NR_WRITEBACK_TEMP because balance_dirty_pages is already
> overcomplicated (imho) and adding new clauses for FUSE makes me sick.

Agreed.

But instead of further complexifying balance_dirty_pages() fuse
specific throttling can be done in fuse_page_mkwrite(), I think.

And at that point NR_WRITEBACK_TEMP really becomes irrelevant to the
dirty balancing logic.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3
  2013-04-25 10:35   ` Miklos Szeredi
@ 2013-06-14 14:03     ` Maxim Patlasov
  0 siblings, 0 replies; 28+ messages in thread
From: Maxim Patlasov @ 2013-06-14 14:03 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Kirill Korotaev, Pavel Emelianov, fuse-devel,
	Kernel Mailing List, James Bottomley, Al Viro, Linux-Fsdevel,
	devel

Hi Miklos,

04/25/2013 02:35 PM, Miklos Szeredi пишет:
> On Mon, Apr 1, 2013 at 12:42 PM, Maxim V. Patlasov
> <MPatlasov@parallels.com> wrote:
>> The .writepages one is required to make each writeback request carry more than
>> one page on it.
> I'd split this into two parts:
>
> 1) implement ->writepages() and enable it unconditionally for mmaped
> writeback (why is it not enabled by this patch?)
>
> 2) implement ->write_begin() and ->write_end() and conditionally
> enable cached writeback
>
> More comments inline.

Thanks a lot for careful review. I agree with most of your comments and 
will address them in the next version of patchset. The only point I 
disagree is the following:

>> +
>> +       spin_lock(&fc->lock);
>> +       list_add(&req->writepages_entry, &fi->writepages);
>> +       spin_unlock(&fc->lock);
>> +
>> +       for (i = 0; i < req->num_pages; i++) {
>> +               struct page *page = req->pages[i];
>> +               struct page *tmp_page;
>> +
>> +               tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
>> +               if (tmp_page) {
>> +                       copy_highpage(tmp_page, page);
>> +                       inc_bdi_stat(bdi, BDI_WRITEBACK);
>> +                       inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> Is there a reason why we do the page allocation/copying here instead
> of at fill time?  I'd guess it would be simpler and more logical
> there.

There is a problem to have in mind: we can't call 
end_page_writeback(page) before update of fuse writeback state 
(fi->writepages). Otherwise a nasty race would be possible when an 
activity for that particular page offset intervenes in the middle of 
writeback leading to multiple in-flight fuse requests for given page offset.

What you suggested is doable but would require additional space to stash 
pointers to original pages because we cannot release them immediately 
after copying (due to the problem described above). The size of the 
space is not known in advance because we don't know how many pages 
write_cache_pages() will process. The size is currently limited by 
sizeof(struct page *) * FUSE_MAX_PAGES_PER_REQ, but may become variable 
in future (a lot of people complain about 128K limit of fuse request). 
Adding dynamically growing array of pages would make the code neither 
simpler nor logical.

The approach I implemented utilizes the fact that 
fuse_page_is_writeback() and friends require only 
req->misc.write.in.offset and req->num_pages to be set correctly. Actual 
pointers in req->pages[] doesn't matter. Thus, as soon as the two 
parameters are known, I add the request to fi->writepages (blocking 
other operations on given page offset) and perform "in place" 
allocation/copying avoiding need for extra space to stash page pointers.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-06-14 14:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-01 10:40 [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
2013-04-01 10:40 ` [PATCH 01/14] fuse: Linking file to inode helper Maxim V. Patlasov
2013-04-01 10:40 ` [PATCH 02/14] fuse: Getting file for writeback helper Maxim V. Patlasov
2013-04-01 10:41 ` [PATCH 03/14] fuse: Prepare to handle short reads Maxim V. Patlasov
2013-04-01 10:41 ` [PATCH 04/14] fuse: Prepare to handle multiple pages in writeback Maxim V. Patlasov
2013-04-25 10:22   ` Miklos Szeredi
2013-04-01 10:41 ` [PATCH 05/14] fuse: Connection bit for enabling writeback Maxim V. Patlasov
2013-04-01 10:41 ` [PATCH 06/14] fuse: Trust kernel i_size only - v3 Maxim V. Patlasov
2013-04-01 10:41 ` [PATCH 07/14] fuse: Trust kernel i_mtime only Maxim V. Patlasov
2013-04-01 10:41 ` [PATCH 08/14] fuse: Flush files on wb close Maxim V. Patlasov
2013-04-01 10:42 ` [PATCH 09/14] fuse: Implement writepages and write_begin/write_end callbacks - v3 Maxim V. Patlasov
2013-04-25 10:35   ` Miklos Szeredi
2013-06-14 14:03     ` Maxim Patlasov
2013-04-01 10:42 ` [PATCH 10/14] fuse: fuse_writepage_locked() should wait on writeback Maxim V. Patlasov
2013-04-01 10:42 ` [PATCH 11/14] fuse: fuse_flush() " Maxim V. Patlasov
2013-04-01 10:42 ` [PATCH 12/14] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim V. Patlasov
2013-04-01 10:42 ` [PATCH 13/14] fuse: Turn writeback cache on Maxim V. Patlasov
2013-04-01 10:42 ` [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Maxim V. Patlasov
2013-04-25 14:29   ` [fuse-devel] " Maxim V. Patlasov
2013-04-25 15:49     ` Miklos Szeredi
2013-04-25 16:16       ` Maxim V. Patlasov
2013-04-25 20:43         ` Miklos Szeredi
2013-04-26  8:32           ` Maxim V. Patlasov
2013-04-26 14:02             ` Miklos Szeredi
2013-04-26 17:44               ` Maxim V. Patlasov
2013-05-07 11:39                 ` Miklos Szeredi
2013-04-11 11:18 ` [fuse-devel] [PATCH v4 00/14] fuse: An attempt to implement a write-back cache policy Maxim V. Patlasov
2013-04-11 14:36   ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).