linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy
@ 2013-06-29 17:41 Maxim Patlasov
  2013-06-29 17:42 ` [PATCH 01/16] fuse: Linking file to inode helper Maxim Patlasov
                   ` (14 more replies)
  0 siblings, 15 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:41 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

Hi,

This is the fifth iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back policy.
With this file data are pushed to the userspace with big chunks (depending on the
dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this scenario.
Kirill (in Cc) can share more details about the performance and the storage concepts
details if required.

Changed in v2:
 - numerous bugfixes:
   - fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must wait
     on page writeback because page writeback can extend beyond the lifetime of
     the page-cache page
   - fuse_send_writepages can end_page_writeback on original page only after adding
     request to fi->writepages list; otherwise another writeback may happen inside
     the gap between end_page_writeback and adding to the list
   - fuse_direct_io must wait on page writeback; otherwise data corruption is possible
     due to reordering requests
   - fuse_flush must flush dirty memory and wait for all writeback on given inode
     before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not reliable
   - fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size update;
     otherwise a race with a writer extending i_size is possible
   - fix handling errors in fuse_writepages and fuse_send_writepages
 - handle i_mtime intelligently if writeback cache is on (see patch #7 (update i_mtime
   on buffered writes) for details.
 - put enabling writeback cache under fusermount control; (see mount option
   'allow_wbcache' introduced by patch #13 (turn writeback cache on))
 - rebased on v3.7-rc5

Changed in v3:
 - rebased on for-next branch of the fuse tree (fb05f41f5f96f7423c53da4d87913fb44fd0565d)

Changed in v4:
 - rebased on for-next branch of the fuse tree (634734b63ac39e137a1c623ba74f3e062b6577db)
 - fixed fuse_fillattr() for non-writeback_chace case
 - added comments explaining why we cannot trust size from server
 - rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
 - simplified patch titled Flush-files-on-wb-close
 - eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
 - added comment about "disk full" errors to fuse_write_begin()

Changed in v5:
 - rebased on for-next branch of the fuse tree (e5c5f05dca0cf90f0f3bb1aea85dcf658baff185)
   with "hold i_mutex in fuse_file_fallocate() - v2" patch applied manually
 - updated patch descriptions according to Miklos' demand (using From: for
   patches initially developed by someone else)
 - moved ->writepages() to a separate patch and enabled it unconditionally
   for mmaped writeback
 - moved restructuring fuse_readpage to a separate patch
 - avoided use of union for fuse_fill_data
 - grabbed a ref to the request in fuse_send_writepages

Thanks,
Maxim

---

Maxim Patlasov (10):
      fuse: Connection bit for enabling writeback
      fuse: Trust kernel i_mtime only
      fuse: Flush files on wb close
      fuse: restructure fuse_readpage()
      fuse: Implement write_begin/write_end callbacks
      fuse: fuse_writepage_locked() should wait on writeback
      fuse: fuse_flush() should wait on writeback
      fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
      fuse: Turn writeback cache on
      mm: strictlimit feature

Pavel Emelyanov (6):
      fuse: Linking file to inode helper
      fuse: Getting file for writeback helper
      fuse: Prepare to handle short reads
      fuse: Prepare to handle multiple pages in writeback
      fuse: Trust kernel i_size only - v4
      fuse: Implement writepages callback


 fs/fs-writeback.c           |    2 
 fs/fuse/cuse.c              |    5 
 fs/fuse/dir.c               |  127 +++++++++-
 fs/fuse/file.c              |  572 ++++++++++++++++++++++++++++++++++++++-----
 fs/fuse/fuse_i.h            |   26 ++
 fs/fuse/inode.c             |   38 ++-
 include/linux/backing-dev.h |    5 
 include/linux/pagemap.h     |    1 
 include/linux/writeback.h   |    3 
 include/uapi/linux/fuse.h   |    2 
 mm/backing-dev.c            |    3 
 mm/page-writeback.c         |  131 ++++++++--
 12 files changed, 794 insertions(+), 121 deletions(-)

-- 
Signature

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/16] fuse: Linking file to inode helper
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
@ 2013-06-29 17:42 ` Maxim Patlasov
  2013-06-29 17:42 ` [PATCH 02/16] fuse: Getting file for writeback helper Maxim Patlasov
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:42 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 35f2810..cc858ee 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -171,6 +171,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_file *ff = file->private_data;
+	/*
+	 * file may be written through mmap, so chain it onto the
+	 * inodes's write_file list
+	 */
+	spin_lock(&fc->lock);
+	if (list_empty(&ff->write_entry))
+		list_add(&ff->write_entry, &fi->write_files);
+	spin_unlock(&fc->lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
@@ -1615,20 +1631,9 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
-		struct inode *inode = file_inode(file);
-		struct fuse_conn *fc = get_fuse_conn(inode);
-		struct fuse_inode *fi = get_fuse_inode(inode);
-		struct fuse_file *ff = file->private_data;
-		/*
-		 * file may be written through mmap, so chain it onto the
-		 * inodes's write_file list
-		 */
-		spin_lock(&fc->lock);
-		if (list_empty(&ff->write_entry))
-			list_add(&ff->write_entry, &fi->write_files);
-		spin_unlock(&fc->lock);
-	}
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		fuse_link_write_file(file);
+
 	file_accessed(file);
 	vma->vm_ops = &fuse_file_vm_ops;
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/16] fuse: Getting file for writeback helper
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
  2013-06-29 17:42 ` [PATCH 01/16] fuse: Linking file to inode helper Maxim Patlasov
@ 2013-06-29 17:42 ` Maxim Patlasov
  2013-06-29 17:42 ` [PATCH 04/16] fuse: Prepare to handle multiple pages in writeback Maxim Patlasov
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:42 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   24 ++++++++++++++++--------
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cc858ee..6b6d307 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1505,6 +1505,20 @@ static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
 	fuse_writepage_free(fc, req);
 }
 
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
+					 struct fuse_inode *fi)
+{
+	struct fuse_file *ff;
+
+	spin_lock(&fc->lock);
+	BUG_ON(list_empty(&fi->write_files));
+	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+	fuse_file_get(ff);
+	spin_unlock(&fc->lock);
+
+	return ff;
+}
+
 static int fuse_writepage_locked(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
@@ -1512,7 +1526,6 @@ static int fuse_writepage_locked(struct page *page)
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_req *req;
-	struct fuse_file *ff;
 	struct page *tmp_page;
 
 	set_page_writeback(page);
@@ -1526,13 +1539,8 @@ static int fuse_writepage_locked(struct page *page)
 	if (!tmp_page)
 		goto err_free;
 
-	spin_lock(&fc->lock);
-	BUG_ON(list_empty(&fi->write_files));
-	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
-	req->ff = fuse_file_get(ff);
-	spin_unlock(&fc->lock);
-
-	fuse_write_fill(req, ff, page_offset(page), 0);
+	req->ff = fuse_write_file(fc, fi);
+	fuse_write_fill(req, req->ff, page_offset(page), 0);
 
 	copy_highpage(tmp_page, page);
 	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/16] fuse: Prepare to handle short reads
       [not found] ` <20130629172211.20175.70154.stgit-vWG5eQQidJHciZdyczg/7Q@public.gmane.org>
@ 2013-06-29 17:42   ` Maxim Patlasov
  2013-06-29 17:45   ` [PATCH 09/16] fuse: restructure fuse_readpage() Maxim Patlasov
  1 sibling, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:42 UTC (permalink / raw)
  To: miklos-sUDqSbJrdHQHWmgEVkV9KA
  Cc: riel-H+wXaHxf7aLQT0dZR+AlfA, dev-bzQdu9zFT3WakBO8gow8eQ,
	xemul-bzQdu9zFT3WakBO8gow8eQ,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jbottomley-bzQdu9zFT3WakBO8gow8eQ,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	devel-GEFAQzZX7r8dnm+yROfE0A, mgorman-l3A5Bk7waGM

From: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov <MPatlasov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/fuse/file.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6b6d307..ea70814 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -653,6 +653,15 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+			    u64 attr_ver)
+{
+	size_t num_read = req->out.args[0].size;
+
+	loff_t pos = page_offset(req->pages[0]) + num_read;
+	fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
 	struct fuse_io_priv io = { .async = 0, .file = file };
@@ -690,18 +699,18 @@ static int fuse_readpage(struct file *file, struct page *page)
 	req->page_descs[0].length = count;
 	num_read = fuse_send_read(req, &io, pos, count, NULL);
 	err = req->out.h.error;
-	fuse_put_request(fc, req);
 
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
 		 */
 		if (num_read < count)
-			fuse_read_update_size(inode, pos + num_read, attr_ver);
+			fuse_short_read(req, inode, attr_ver);
 
 		SetPageUptodate(page);
 	}
 
+	fuse_put_request(fc, req);
 	fuse_invalidate_attr(inode); /* atime changed */
  out:
 	unlock_page(page);
@@ -724,13 +733,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
 		/*
 		 * Short read means EOF. If file size is larger, truncate it
 		 */
-		if (!req->out.h.error && num_read < count) {
-			loff_t pos;
+		if (!req->out.h.error && num_read < count)
+			fuse_short_read(req, inode, req->misc.read.attr_ver);
 
-			pos = page_offset(req->pages[0]) + num_read;
-			fuse_read_update_size(inode, pos,
-					      req->misc.read.attr_ver);
-		}
 		fuse_invalidate_attr(inode); /* atime changed */
 	}
 


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/16] fuse: Prepare to handle multiple pages in writeback
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
  2013-06-29 17:42 ` [PATCH 01/16] fuse: Linking file to inode helper Maxim Patlasov
  2013-06-29 17:42 ` [PATCH 02/16] fuse: Getting file for writeback helper Maxim Patlasov
@ 2013-06-29 17:42 ` Maxim Patlasov
  2013-06-29 17:42 ` [PATCH 05/16] fuse: Connection bit for enabling writeback Maxim Patlasov
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:42 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   24 ++++++++++++++++--------
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ea70814..90fb235 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -350,7 +350,8 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
 
 		BUG_ON(req->inode != inode);
 		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
-		if (curr_index == index) {
+		if (curr_index <= index &&
+		    index < curr_index + req->num_pages) {
 			found = true;
 			break;
 		}
@@ -1425,7 +1426,10 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-	__free_page(req->pages[0]);
+	int i;
+
+	for (i = 0; i < req->num_pages; i++)
+		__free_page(req->pages[i]);
 	fuse_file_put(req->ff, false);
 }
 
@@ -1434,11 +1438,14 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	struct inode *inode = req->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	int i;
 
 	list_del(&req->writepages_entry);
-	dec_bdi_stat(bdi, BDI_WRITEBACK);
-	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
-	bdi_writeout_inc(bdi);
+	for (i = 0; i < req->num_pages; i++) {
+		dec_bdi_stat(bdi, BDI_WRITEBACK);
+		dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+		bdi_writeout_inc(bdi);
+	}
 	wake_up(&fi->page_waitq);
 }
 
@@ -1450,14 +1457,15 @@ __acquires(fc->lock)
 	struct fuse_inode *fi = get_fuse_inode(req->inode);
 	loff_t size = i_size_read(req->inode);
 	struct fuse_write_in *inarg = &req->misc.write.in;
+	__u64 data_size = req->num_pages * PAGE_CACHE_SIZE;
 
 	if (!fc->connected)
 		goto out_free;
 
-	if (inarg->offset + PAGE_CACHE_SIZE <= size) {
-		inarg->size = PAGE_CACHE_SIZE;
+	if (inarg->offset + data_size <= size) {
+		inarg->size = data_size;
 	} else if (inarg->offset < size) {
-		inarg->size = size & (PAGE_CACHE_SIZE - 1);
+		inarg->size = size - inarg->offset;
 	} else {
 		/* Got truncated off completely */
 		goto out_free;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/16] fuse: Connection bit for enabling writeback
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (2 preceding siblings ...)
  2013-06-29 17:42 ` [PATCH 04/16] fuse: Prepare to handle multiple pages in writeback Maxim Patlasov
@ 2013-06-29 17:42 ` Maxim Patlasov
  2013-06-29 17:44 ` [PATCH 06/16] fuse: Trust kernel i_size only - v4 Maxim Patlasov
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:42 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/fuse_i.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fde7249..a0062a0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -476,6 +476,9 @@ struct fuse_conn {
 	/** Set if bdi is valid */
 	unsigned bdi_initialized:1;
 
+	/** write-back cache policy (default is write-through) */
+	unsigned writeback_cache:1;
+
 	/*
 	 * The following bitfields are only for optimization purposes
 	 * and hence races in setting them will not cause malfunction

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/16] fuse: Trust kernel i_size only - v4
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (3 preceding siblings ...)
  2013-06-29 17:42 ` [PATCH 05/16] fuse: Connection bit for enabling writeback Maxim Patlasov
@ 2013-06-29 17:44 ` Maxim Patlasov
  2013-06-29 17:44 ` [PATCH 07/16] fuse: Trust kernel i_mtime only Maxim Patlasov
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:44 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
 - improved comment in fuse_short_read()
 - fixed fuse_file_fallocate() for KEEP_SIZE mode

Changed in v3:
 - fixed fuse_fillattr() not to use local i_size if writeback-cache is off
 - added a comment explaining why we cannot trust attr.size from server

Changed in v4:
 - do not change fuse_file_fallocate() because what we need is already
   implemented in commits a200a2d and 14c1441

Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c   |   13 +++++++++++--
 fs/fuse/file.c  |   26 ++++++++++++++++++++++++--
 fs/fuse/inode.c |   11 +++++++++--
 3 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f3f783d..b755884 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -851,6 +851,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 			  struct kstat *stat)
 {
 	unsigned int blkbits;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see the comment in fuse_change_attributes() */
+	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+		attr->size = i_size_read(inode);
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1576,6 +1581,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
 	bool is_truncate = false;
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	int err;
 
@@ -1645,7 +1651,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
-	i_size_write(inode, outarg.attr.size);
+	/* see the comment in fuse_change_attributes() */
+	if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+		i_size_write(inode, outarg.attr.size);
 
 	if (is_truncate) {
 		/* NOTE: this may release/reacquire fc->lock */
@@ -1657,7 +1665,8 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	 * Only call invalidate_inode_pages2() after removing
 	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 	 */
-	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+	if ((is_truncate || !is_wb) &&
+	    S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
 		truncate_pagecache(inode, oldsize, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 90fb235..98bc0d0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -658,9 +658,31 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
 			    u64 attr_ver)
 {
 	size_t num_read = req->out.args[0].size;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fc->writeback_cache) {
+		/*
+		 * A hole in a file. Some data after the hole are in page cache,
+		 * but have not reached the client fs yet. So, the hole is not
+		 * present there.
+		 */
+		int i;
+		int start_idx = num_read >> PAGE_CACHE_SHIFT;
+		size_t off = num_read & (PAGE_CACHE_SIZE - 1);
+
+		for (i = start_idx; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			void *mapaddr = kmap_atomic(page);
 
-	loff_t pos = page_offset(req->pages[0]) + num_read;
-	fuse_read_update_size(inode, pos, attr_ver);
+			memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+			kunmap_atomic(mapaddr);
+			off = 0;
+		}
+	} else {
+		loff_t pos = page_offset(req->pages[0]) + num_read;
+		fuse_read_update_size(inode, pos, attr_ver);
+	}
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9a0cdde..121638d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -197,6 +197,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	struct timespec old_mtime;
 
@@ -210,10 +211,16 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	fuse_change_attributes_common(inode, attr, attr_valid);
 
 	oldsize = inode->i_size;
-	i_size_write(inode, attr->size);
+	/*
+	 * In case of writeback_cache enabled, the cached writes beyond EOF
+	 * extend local i_size without keeping userspace server in sync. So,
+	 * attr->size coming from server can be stale. We cannot trust it.
+	 */
+	if (!is_wb || !S_ISREG(inode->i_mode))
+		i_size_write(inode, attr->size);
 	spin_unlock(&fc->lock);
 
-	if (S_ISREG(inode->i_mode)) {
+	if (!is_wb && S_ISREG(inode->i_mode)) {
 		bool inval = false;
 
 		if (oldsize != attr->size) {


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/16] fuse: Trust kernel i_mtime only
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (4 preceding siblings ...)
  2013-06-29 17:44 ` [PATCH 06/16] fuse: Trust kernel i_size only - v4 Maxim Patlasov
@ 2013-06-29 17:44 ` Maxim Patlasov
  2013-07-11 16:14   ` [PATCH 07/16] fuse: Trust kernel i_mtime only -v2 Maxim Patlasov
  2013-06-29 17:45 ` [PATCH 08/16] fuse: Flush files on wb close Maxim Patlasov
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:44 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_UPDATED serves as indication that local i_mtime
should be flushed to the server eventually. Some operations (direct write,
truncate, fallocate and setattr) leads to updating mtime on server. So, we can
clear FUSE_I_MTIME_UPDATED when such an operation is completed. This is safe
because these operations (as well as i_op->update_time and fsync) are
protected by i_mutex.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c    |  116 ++++++++++++++++++++++++++++++++++++++++++++++++------
 fs/fuse/file.c   |   38 +++++++++++++++---
 fs/fuse/fuse_i.h |    6 ++-
 fs/fuse/inode.c  |   13 +++++-
 4 files changed, 151 insertions(+), 22 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b755884..cd67a0c 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -854,8 +854,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	/* see the comment in fuse_change_attributes() */
-	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+	if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
 		attr->size = i_size_read(inode);
+		attr->mtime = inode->i_mtime.tv_sec;
+		attr->mtimensec = inode->i_mtime.tv_nsec;
+	}
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1565,6 +1568,89 @@ void fuse_release_nowrite(struct inode *inode)
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+			      struct inode *inode,
+			      struct fuse_setattr_in *inarg_p,
+			      struct fuse_attr_out *outarg_p)
+{
+	req->in.h.opcode = FUSE_SETATTR;
+	req->in.h.nodeid = get_node_id(inode);
+	req->in.numargs = 1;
+	req->in.args[0].size = sizeof(*inarg_p);
+	req->in.args[0].value = inarg_p;
+	req->out.numargs = 1;
+	if (fc->minor < 9)
+		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+	else
+		req->out.args[0].size = sizeof(*outarg_p);
+	req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	struct fuse_setattr_in inarg;
+	struct fuse_attr_out outarg;
+	int err;
+
+	if (nofail) {
+		req = fuse_get_req_nofail_nopages(fc, file);
+	} else {
+		req = fuse_get_req_nopages(fc);
+		if (IS_ERR(req))
+			return PTR_ERR(req);
+	}
+
+	memset(&inarg, 0, sizeof(inarg));
+	memset(&outarg, 0, sizeof(outarg));
+
+	inarg.valid |= FATTR_MTIME;
+	inarg.mtime = inode->i_mtime.tv_sec;
+	inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+	fuse_request_send(fc, req);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+
+	if (!err)
+		clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+
+	return err;
+}
+
+static inline void set_mtime_helper(struct inode *inode, struct timespec mtime)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	inode->i_mtime = mtime;
+	clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually. But
+ * we can also clear FUSE_I_MTIME_UPDATED if FUSE_SETATTR has just changed
+ * mtime on server.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+	unsigned ivalid = iattr->ia_valid;
+
+	if ((ivalid & ATTR_MTIME) && update_mtime(ivalid)) {
+		if (ivalid & ATTR_MTIME_SET)
+			set_mtime_helper(inode, iattr->ia_mtime);
+		else
+			set_mtime_helper(inode, current_fs_time(inode->i_sb));
+	} else if (ivalid & ATTR_SIZE)
+		set_mtime_helper(inode, current_fs_time(inode->i_sb));
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1621,17 +1707,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		inarg.valid |= FATTR_LOCKOWNER;
 		inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
 	}
-	req->in.h.opcode = FUSE_SETATTR;
-	req->in.h.nodeid = get_node_id(inode);
-	req->in.numargs = 1;
-	req->in.args[0].size = sizeof(inarg);
-	req->in.args[0].value = &inarg;
-	req->out.numargs = 1;
-	if (fc->minor < 9)
-		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
-	else
-		req->out.args[0].size = sizeof(outarg);
-	req->out.args[0].value = &outarg;
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
@@ -1648,6 +1724,10 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	}
 
 	spin_lock(&fc->lock);
+	/* the kernel maintains i_mtime locally */
+	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+		fuse_set_mtime_local(attr, inode);
+
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
@@ -1872,6 +1952,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
 	return err;
 }
 
+static int fuse_update_time(struct inode *inode, struct timespec *now,
+			    int flags)
+{
+	if (flags & S_MTIME) {
+		inode->i_mtime = *now;
+		set_bit(FUSE_I_MTIME_UPDATED, &get_fuse_inode(inode)->state);
+		BUG_ON(!S_ISREG(inode->i_mode));
+	}
+	return 0;
+}
+
 static const struct inode_operations fuse_dir_inode_operations = {
 	.lookup		= fuse_lookup,
 	.mkdir		= fuse_mkdir,
@@ -1911,6 +2002,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.getxattr	= fuse_getxattr,
 	.listxattr	= fuse_listxattr,
 	.removexattr	= fuse_removexattr,
+	.update_time	= fuse_update_time,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 98bc0d0..f53697c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,10 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	if (test_bit(FUSE_I_MTIME_UPDATED,
+		     &get_fuse_inode(inode)->state))
+		fuse_flush_mtime(file, true);
+
 	fuse_release_common(file, FUSE_RELEASE);
 
 	/* return value is ignored by VFS */
@@ -458,6 +462,13 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 
 	fuse_sync_writes(inode);
 
+	if (test_bit(FUSE_I_MTIME_UPDATED,
+		     &get_fuse_inode(inode)->state)) {
+		int err = fuse_flush_mtime(file, false);
+		if (err)
+			goto out;
+	}
+
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -948,16 +959,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
 	return req->misc.write.out.size;
 }
 
-void fuse_write_update_size(struct inode *inode, loff_t pos)
+bool fuse_write_update_size(struct inode *inode, loff_t pos)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool ret = false;
 
 	spin_lock(&fc->lock);
 	fi->attr_version = ++fc->attr_version;
-	if (pos > inode->i_size)
+	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
+		ret = true;
+	}
 	spin_unlock(&fc->lock);
+
+	return ret;
 }
 
 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
@@ -2495,9 +2511,11 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	}
 
 	if (rw == WRITE) {
-		if (ret > 0)
+		if (ret > 0) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
 			fuse_write_update_size(inode, pos);
-		else if (ret < 0 && offset + count > i_size)
+			clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+		} else if (ret < 0 && offset + count > i_size)
 			fuse_do_truncate(file);
 	}
 
@@ -2553,8 +2571,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		goto out;
 
 	/* we could have extended the file */
-	if (!(mode & FALLOC_FL_KEEP_SIZE))
-		fuse_write_update_size(inode, offset + length);
+	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+		bool changed = fuse_write_update_size(inode, offset + length);
+
+		if (changed && fc->writeback_cache) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
+
+			inode->i_mtime = current_fs_time(inode->i_sb);
+			clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+		}
+	}
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		truncate_pagecache_range(inode, offset, offset + length - 1);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a0062a0..82abb82 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -115,6 +115,8 @@ struct fuse_inode {
 enum {
 	/** Advise readdirplus  */
 	FUSE_I_ADVISE_RDPLUS,
+	/** i_mtime has been updated locally; a flush to userspace needed */
+	FUSE_I_MTIME_UPDATED,
 };
 
 struct fuse_conn;
@@ -867,7 +869,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
 unsigned fuse_file_poll(struct file *file, poll_table *wait);
 int fuse_dev_release(struct inode *inode, struct file *file);
 
-void fuse_write_update_size(struct inode *inode, loff_t pos);
+bool fuse_write_update_size(struct inode *inode, loff_t pos);
+
+int fuse_flush_mtime(struct file *file, bool nofail);
 
 int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		    struct file *file);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 121638d..591b46c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
-	inode->i_mtime.tv_sec   = attr->mtime;
-	inode->i_mtime.tv_nsec  = attr->mtimensec;
+	/* mtime from server may be stale due to local buffered write */
+	if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
+		inode->i_mtime.tv_sec   = attr->mtime;
+		inode->i_mtime.tv_nsec  = attr->mtimensec;
+	}
 	inode->i_ctime.tv_sec   = attr->ctime;
 	inode->i_ctime.tv_nsec  = attr->ctimensec;
 
@@ -249,6 +252,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
 {
 	inode->i_mode = attr->mode & S_IFMT;
 	inode->i_size = attr->size;
+	inode->i_mtime.tv_sec  = attr->mtime;
+	inode->i_mtime.tv_nsec = attr->mtimensec;
 	if (S_ISREG(inode->i_mode)) {
 		fuse_init_common(inode);
 		fuse_init_file_inode(inode);
@@ -295,7 +300,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		return NULL;
 
 	if ((inode->i_state & I_NEW)) {
-		inode->i_flags |= S_NOATIME|S_NOCMTIME;
+		inode->i_flags |= S_NOATIME;
+		if (!fc->writeback_cache)
+			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
 		fuse_init_inode(inode, attr);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/16] fuse: Flush files on wb close
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (5 preceding siblings ...)
  2013-06-29 17:44 ` [PATCH 07/16] fuse: Trust kernel i_mtime only Maxim Patlasov
@ 2013-06-29 17:45 ` Maxim Patlasov
  2013-07-11 16:18   ` [PATCH 08/16] fuse: Flush files on wb close -v2 Maxim Patlasov
       [not found] ` <20130629172211.20175.70154.stgit-vWG5eQQidJHciZdyczg/7Q@public.gmane.org>
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:45 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f53697c..799bf46 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see fuse_vma_close() for !writeback_cache case */
+	if (fc->writeback_cache)
+		filemap_write_and_wait(file->f_mapping);
+
 	if (test_bit(FUSE_I_MTIME_UPDATED,
 		     &get_fuse_inode(inode)->state))
 		fuse_flush_mtime(file, true);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/16] fuse: restructure fuse_readpage()
       [not found] ` <20130629172211.20175.70154.stgit-vWG5eQQidJHciZdyczg/7Q@public.gmane.org>
  2013-06-29 17:42   ` [PATCH 03/16] fuse: Prepare to handle short reads Maxim Patlasov
@ 2013-06-29 17:45   ` Maxim Patlasov
  1 sibling, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:45 UTC (permalink / raw)
  To: miklos-sUDqSbJrdHQHWmgEVkV9KA
  Cc: riel-H+wXaHxf7aLQT0dZR+AlfA, dev-bzQdu9zFT3WakBO8gow8eQ,
	xemul-bzQdu9zFT3WakBO8gow8eQ,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jbottomley-bzQdu9zFT3WakBO8gow8eQ,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	devel-GEFAQzZX7r8dnm+yROfE0A, mgorman-l3A5Bk7waGM

Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.

Signed-off-by: Maxim Patlasov <MPatlasov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 fs/fuse/file.c |   55 +++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 799bf46..e693e35 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -702,21 +702,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
 	}
 }
 
-static int fuse_readpage(struct file *file, struct page *page)
+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+			   int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
 {
 	struct fuse_io_priv io = { .async = 0, .file = file };
 	struct inode *inode = page->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_req *req;
 	size_t num_read;
-	loff_t pos = page_offset(page);
-	size_t count = PAGE_CACHE_SIZE;
-	u64 attr_ver;
-	int err;
-
-	err = -EIO;
-	if (is_bad_inode(inode))
-		goto out;
 
 	/*
 	 * Page writeback can extend beyond the lifetime of the
@@ -726,20 +719,45 @@ static int fuse_readpage(struct file *file, struct page *page)
 	fuse_wait_on_page_writeback(inode, page->index);
 
 	req = fuse_get_req(fc, 1);
-	err = PTR_ERR(req);
+	*err = PTR_ERR(req);
 	if (IS_ERR(req))
-		goto out;
+		return 0;
 
-	attr_ver = fuse_get_attr_version(fc);
+	if (attr_ver_p)
+		*attr_ver_p = fuse_get_attr_version(fc);
 
 	req->out.page_zeroing = 1;
 	req->out.argpages = 1;
 	req->num_pages = 1;
 	req->pages[0] = page;
 	req->page_descs[0].length = count;
-	num_read = fuse_send_read(req, &io, pos, count, NULL);
-	err = req->out.h.error;
 
+	num_read = fuse_send_read(req, &io, page_offset(page), count, NULL);
+	*err = req->out.h.error;
+
+	if (*err)
+		fuse_put_request(fc, req);
+	else
+		*req_pp = req;
+
+	return num_read;
+}
+
+static int fuse_readpage(struct file *file, struct page *page)
+{
+	struct inode *inode = page->mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	size_t num_read;
+	size_t count = PAGE_CACHE_SIZE;
+	u64 attr_ver = 0;
+	int err;
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
@@ -749,10 +767,11 @@ static int fuse_readpage(struct file *file, struct page *page)
 
 		SetPageUptodate(page);
 	}
-
-	fuse_put_request(fc, req);
-	fuse_invalidate_attr(inode); /* atime changed */
- out:
+	if (req) {
+		fuse_put_request(fc, req);
+		fuse_invalidate_attr(inode); /* atime changed */
+	}
+out:
 	unlock_page(page);
 	return err;
 }


------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/16] fuse: Implement writepages callback
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (7 preceding siblings ...)
       [not found] ` <20130629172211.20175.70154.stgit-vWG5eQQidJHciZdyczg/7Q@public.gmane.org>
@ 2013-06-29 17:45 ` Maxim Patlasov
  2013-07-19 16:50   ` Miklos Szeredi
  2013-06-29 17:45 ` [PATCH 11/16] fuse: Implement write_begin/write_end callbacks Maxim Patlasov
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:45 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

The .writepages one is required to make each writeback request carry more than
one page on it. The patch enables optimized behaviour unconditionally,
i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |  170 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 170 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e693e35..ad3bb8a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1657,6 +1657,175 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 	return err;
 }
 
+struct fuse_fill_wb_data {
+	struct fuse_req *req;
+	struct fuse_file *ff;
+	struct inode *inode;
+	unsigned nr_pages;
+};
+
+static int fuse_send_writepages(struct fuse_fill_wb_data *data)
+{
+	int i, all_ok = 1;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t off = -1;
+
+	if (!data->ff)
+		data->ff = fuse_write_file(fc, fi);
+
+	if (!data->ff) {
+		for (i = 0; i < req->num_pages; i++)
+			end_page_writeback(req->pages[i]);
+		return -EIO;
+	}
+
+	req->inode = inode;
+	req->misc.write.in.offset = page_offset(req->pages[0]);
+
+	spin_lock(&fc->lock);
+	list_add(&req->writepages_entry, &fi->writepages);
+	spin_unlock(&fc->lock);
+
+	for (i = 0; i < req->num_pages; i++) {
+		struct page *page = req->pages[i];
+		struct page *tmp_page;
+
+		tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (tmp_page) {
+			copy_highpage(tmp_page, page);
+			inc_bdi_stat(bdi, BDI_WRITEBACK);
+			inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+		} else
+			all_ok = 0;
+		req->pages[i] = tmp_page;
+		if (i == 0)
+			off = page_offset(page);
+
+		end_page_writeback(page);
+	}
+
+	if (!all_ok) {
+		for (i = 0; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			if (page) {
+				dec_bdi_stat(bdi, BDI_WRITEBACK);
+				dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+				__free_page(page);
+				req->pages[i] = NULL;
+			}
+		}
+
+		spin_lock(&fc->lock);
+		list_del(&req->writepages_entry);
+		wake_up(&fi->page_waitq);
+		spin_unlock(&fc->lock);
+		return -ENOMEM;
+	}
+
+	__fuse_get_request(req);
+	req->ff = fuse_file_get(data->ff);
+	fuse_write_fill(req, data->ff, off, 0);
+
+	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+	req->in.argpages = 1;
+	fuse_page_descs_length_init(req, 0, req->num_pages);
+	req->end = fuse_writepage_end;
+	req->background = 1;
+
+	spin_lock(&fc->lock);
+	list_add_tail(&req->list, &fi->queued_writes);
+	fuse_flush_writepages(data->inode);
+	spin_unlock(&fc->lock);
+
+	return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+		struct writeback_control *wbc, void *_data)
+{
+	struct fuse_fill_wb_data *data = _data;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fuse_page_is_writeback(inode, page->index)) {
+		if (wbc->sync_mode != WB_SYNC_ALL) {
+			redirty_page_for_writepage(wbc, page);
+			unlock_page(page);
+			return 0;
+		}
+		fuse_wait_on_page_writeback(inode, page->index);
+	}
+
+	if (req->num_pages &&
+	    (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+	     (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+	     req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+		int err;
+
+		err = fuse_send_writepages(data);
+		if (err) {
+			unlock_page(page);
+			return err;
+		}
+		fuse_put_request(fc, req);
+
+		data->req = req =
+			fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+		if (!req) {
+			unlock_page(page);
+			return -ENOMEM;
+		}
+	}
+
+	req->pages[req->num_pages] = page;
+	req->num_pages++;
+
+	if (test_set_page_writeback(page))
+		BUG();
+
+	unlock_page(page);
+
+	return 0;
+}
+
+static int fuse_writepages(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_fill_wb_data data;
+	int err;
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	data.ff = NULL;
+	data.inode = inode;
+	data.req = fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+	err = -ENOMEM;
+	if (!data.req)
+		goto out_put;
+
+	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+	if (data.req) {
+		if (!err && data.req->num_pages)
+			err = fuse_send_writepages(&data);
+
+		fuse_put_request(fc, data.req);
+	}
+out_put:
+	if (data.ff)
+		fuse_file_put(data.ff, false);
+out:
+	return err;
+}
+
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
@@ -2663,6 +2832,7 @@ static const struct file_operations fuse_direct_io_file_operations = {
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
 	.writepage	= fuse_writepage,
+	.writepages	= fuse_writepages,
 	.launder_page	= fuse_launder_page,
 	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 11/16] fuse: Implement write_begin/write_end callbacks
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (8 preceding siblings ...)
  2013-06-29 17:45 ` [PATCH 10/16] fuse: Implement writepages callback Maxim Patlasov
@ 2013-06-29 17:45 ` Maxim Patlasov
  2013-06-29 17:46 ` [PATCH 12/16] fuse: fuse_writepage_locked() should wait on writeback Maxim Patlasov
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:45 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

The .write_begin and .write_end are requiered to use generic routines
(generic_file_aio_write --> ... --> generic_perform_write) for buffered
writes.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ad3bb8a..89b08d6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1826,6 +1826,101 @@ out:
 	return err;
 }
 
+/*
+ * Determine the number of bytes of data the page contains
+ */
+static inline unsigned fuse_page_length(struct page *page)
+{
+	loff_t i_size = i_size_read(page_file_mapping(page)->host);
+
+	if (i_size > 0) {
+		pgoff_t page_index = page_file_index(page);
+		pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
+		if (page_index < end_index)
+			return PAGE_CACHE_SIZE;
+		if (page_index == end_index)
+			return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
+	}
+	return 0;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+		struct page *page, loff_t pos, unsigned len)
+{
+	struct fuse_req *req = NULL;
+	unsigned num_read;
+	unsigned page_len;
+	int err;
+
+	if (PageUptodate(page) || (len == PAGE_CACHE_SIZE))
+		return 0;
+
+	page_len = fuse_page_length(page);
+	if (!page_len) {
+		zero_user(page, 0, PAGE_CACHE_SIZE);
+		return 0;
+	}
+
+	num_read = __fuse_readpage(file, page, page_len, &err, &req, NULL);
+	if (req)
+		fuse_put_request(fc, req);
+	if (err) {
+		unlock_page(page);
+		page_cache_release(page);
+	} else if (num_read != PAGE_CACHE_SIZE) {
+		zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
+	}
+	return err;
+}
+
+/*
+ * It's worthy to make sure that space is reserved on disk for the write,
+ * but how to implement it without killing performance need more thinking.
+ */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+	BUG_ON(!fc->writeback_cache);
+
+	*pagep = grab_cache_page_write_begin(mapping, index, flags);
+	if (!*pagep)
+		return -ENOMEM;
+
+	return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+		unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+
+	fuse_write_update_size(inode, pos);
+	set_page_dirty(page);
+	return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned copied,
+		struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	fuse_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
@@ -2838,6 +2933,8 @@ static const struct address_space_operations fuse_file_aops  = {
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
+	.write_begin	= fuse_write_begin,
+	.write_end	= fuse_write_end,
 };
 
 void fuse_init_file_inode(struct inode *inode)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 12/16] fuse: fuse_writepage_locked() should wait on writeback
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (9 preceding siblings ...)
  2013-06-29 17:45 ` [PATCH 11/16] fuse: Implement write_begin/write_end callbacks Maxim Patlasov
@ 2013-06-29 17:46 ` Maxim Patlasov
  2013-06-29 17:46 ` [PATCH 13/16] fuse: fuse_flush() " Maxim Patlasov
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:46 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

fuse_writepage_locked() should never submit new i/o for given page->index
if there is another one 'in progress' already. In most cases it's safe to
wait on page writeback. But if it was called due to memory shortage
(WB_SYNC_NONE), we should redirty page rather than blocking caller.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 89b08d6..f48f796 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1595,7 +1595,8 @@ static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
 	return ff;
 }
 
-static int fuse_writepage_locked(struct page *page)
+static int fuse_writepage_locked(struct page *page,
+				 struct writeback_control *wbc)
 {
 	struct address_space *mapping = page->mapping;
 	struct inode *inode = mapping->host;
@@ -1604,6 +1605,14 @@ static int fuse_writepage_locked(struct page *page)
 	struct fuse_req *req;
 	struct page *tmp_page;
 
+	if (fuse_page_is_writeback(inode, page->index)) {
+		if (wbc->sync_mode != WB_SYNC_ALL) {
+			redirty_page_for_writepage(wbc, page);
+			return 0;
+		}
+		fuse_wait_on_page_writeback(inode, page->index);
+	}
+
 	set_page_writeback(page);
 
 	req = fuse_request_alloc_nofs(1);
@@ -1651,7 +1660,7 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 {
 	int err;
 
-	err = fuse_writepage_locked(page);
+	err = fuse_writepage_locked(page, wbc);
 	unlock_page(page);
 
 	return err;
@@ -1926,7 +1935,10 @@ static int fuse_launder_page(struct page *page)
 	int err = 0;
 	if (clear_page_dirty_for_io(page)) {
 		struct inode *inode = page->mapping->host;
-		err = fuse_writepage_locked(page);
+		struct writeback_control wbc = {
+			.sync_mode = WB_SYNC_ALL,
+		};
+		err = fuse_writepage_locked(page, &wbc);
 		if (!err)
 			fuse_wait_on_page_writeback(inode, page->index);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 13/16] fuse: fuse_flush() should wait on writeback
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (10 preceding siblings ...)
  2013-06-29 17:46 ` [PATCH 12/16] fuse: fuse_writepage_locked() should wait on writeback Maxim Patlasov
@ 2013-06-29 17:46 ` Maxim Patlasov
  2013-06-29 17:46 ` [PATCH 14/16] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:46 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f48f796..86ef60f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -19,6 +19,7 @@
 #include <linux/falloc.h>
 
 static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 			  int opcode, struct fuse_open_out *outargp)
@@ -400,6 +401,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
 	if (fc->no_flush)
 		return 0;
 
+	err = filemap_write_and_wait(file->f_mapping);
+	if (err)
+		return err;
+
+	mutex_lock(&inode->i_mutex);
+	fuse_sync_writes(inode);
+	mutex_unlock(&inode->i_mutex);
+
 	req = fuse_get_req_nofail_nopages(fc, file);
 	memset(&inarg, 0, sizeof(inarg));
 	inarg.fh = ff->fh;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 14/16] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (11 preceding siblings ...)
  2013-06-29 17:46 ` [PATCH 13/16] fuse: fuse_flush() " Maxim Patlasov
@ 2013-06-29 17:46 ` Maxim Patlasov
  2013-06-29 17:47 ` [PATCH 15/16] fuse: Turn writeback cache on Maxim Patlasov
  2013-06-29 17:48 ` [PATCH 16/16] mm: strictlimit feature Maxim Patlasov
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:46 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
 - do not wait on writeback if fuse_direct_io() call came from
   CUSE (because it doesn't use fuse inodes)

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/cuse.c   |    5 +++--
 fs/fuse/file.c   |   49 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/fuse/fuse_i.h |   13 ++++++++++++-
 3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index aef34b1..b57b5ee 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -95,7 +95,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, size_t count,
 	struct iovec iov = { .iov_base = buf, .iov_len = count };
 	struct fuse_io_priv io = { .async = 0, .file = file };
 
-	return fuse_direct_io(&io, &iov, 1, count, &pos, 0);
+	return fuse_direct_io(&io, &iov, 1, count, &pos, FUSE_DIO_CUSE);
 }
 
 static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -109,7 +109,8 @@ static ssize_t cuse_write(struct file *file, const char __user *buf,
 	 * No locking or generic_write_checks(), the server is
 	 * responsible for locking and sanity checks.
 	 */
-	return fuse_direct_io(&io, &iov, 1, count, &pos, 1);
+	return fuse_direct_io(&io, &iov, 1, count, &pos,
+			      FUSE_DIO_WRITE | FUSE_DIO_CUSE);
 }
 
 static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 86ef60f..ea45cf3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -342,6 +342,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
 	return (u64) v0 + ((u64) v1 << 32);
 }
 
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+				    pgoff_t idx_to)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_req *req;
+	bool found = false;
+
+	spin_lock(&fc->lock);
+	list_for_each_entry(req, &fi->writepages, writepages_entry) {
+		pgoff_t curr_index;
+
+		BUG_ON(req->inode != inode);
+		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+		if (!(idx_from >= curr_index + req->num_pages ||
+		      idx_to < curr_index)) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock(&fc->lock);
+
+	return found;
+}
+
 /*
  * Check if page is under writeback
  *
@@ -386,6 +411,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
 	return 0;
 }
 
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+				   size_t bytes)
+{
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	pgoff_t idx_from, idx_to;
+
+	idx_from = start >> PAGE_CACHE_SHIFT;
+	idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+	wait_event(fi->page_waitq,
+		   !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
 static int fuse_flush(struct file *file, fl_owner_t id)
 {
 	struct inode *inode = file_inode(file);
@@ -1360,8 +1398,10 @@ static inline int fuse_iter_npages(const struct iov_iter *ii_p)
 
 ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
 		       unsigned long nr_segs, size_t count, loff_t *ppos,
-		       int write)
+		       int flags)
 {
+	int write = flags & FUSE_DIO_WRITE;
+	int cuse = flags & FUSE_DIO_CUSE;
 	struct file *file = io->file;
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fc;
@@ -1390,6 +1430,10 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
 			break;
 		}
 
+		if (!cuse)
+			fuse_wait_on_writeback(file->f_mapping->host, pos,
+					       nbytes);
+
 		if (write)
 			nres = fuse_send_write(req, io, pos, nbytes, owner);
 		else
@@ -1468,7 +1512,8 @@ static ssize_t __fuse_direct_write(struct fuse_io_priv *io,
 
 	res = generic_write_checks(file, ppos, &count, 0);
 	if (!res)
-		res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1);
+		res = fuse_direct_io(io, iov, nr_segs, count, ppos,
+				     FUSE_DIO_WRITE);
 
 	fuse_invalidate_attr(inode);
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 82abb82..f969b22 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -859,9 +859,20 @@ int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid,
 
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 		 bool isdir);
+
+/**
+ * fuse_direct_io() flags
+ */
+
+/** If set, it is WRITE; otherwise - READ */
+#define FUSE_DIO_WRITE (1 << 0)
+
+/** CUSE pass fuse_direct_io() a file which f_mapping->host is not from FUSE */
+#define FUSE_DIO_CUSE  (1 << 1)
+
 ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
 		       unsigned long nr_segs, size_t count, loff_t *ppos,
-		       int write);
+		       int flags);
 long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
 		   unsigned int flags);
 long fuse_ioctl_common(struct file *file, unsigned int cmd,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 15/16] fuse: Turn writeback cache on
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (12 preceding siblings ...)
  2013-06-29 17:46 ` [PATCH 14/16] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
@ 2013-06-29 17:47 ` Maxim Patlasov
  2013-06-29 17:48 ` [PATCH 16/16] mm: strictlimit feature Maxim Patlasov
  14 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:47 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Pavel Emelyanov <xemul@openvz.org>

Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c            |    5 +++++
 fs/fuse/fuse_i.h          |    4 ++++
 fs/fuse/inode.c           |   13 +++++++++++++
 include/uapi/linux/fuse.h |    2 ++
 4 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ea45cf3..6d30db1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -208,6 +208,8 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 		spin_unlock(&fc->lock);
 		fuse_invalidate_attr(inode);
 	}
+	if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+		fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1225,6 +1227,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	struct iov_iter i;
 	loff_t endbyte = 0;
 
+	if (get_fuse_conn(inode)->writeback_cache)
+		return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
 	WARN_ON(iocb->ki_pos != pos);
 
 	ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f969b22..3494457 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
     doing the mount will be allowed to access the filesystem */
 #define FUSE_ALLOW_OTHER         (1 << 1)
 
+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+    module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE       (1 << 2)
+
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 591b46c..4beb8e3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -459,6 +459,7 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+	OPT_ALLOW_WBCACHE,
 	OPT_ERR
 };
 
@@ -471,6 +472,7 @@ static const match_table_t tokens = {
 	{OPT_ALLOW_OTHER,		"allow_other"},
 	{OPT_MAX_READ,			"max_read=%u"},
 	{OPT_BLKSIZE,			"blksize=%u"},
+	{OPT_ALLOW_WBCACHE,		"allow_wbcache"},
 	{OPT_ERR,			NULL}
 };
 
@@ -544,6 +546,10 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
 			d->blksize = value;
 			break;
 
+		case OPT_ALLOW_WBCACHE:
+			d->flags |= FUSE_ALLOW_WBCACHE;
+			break;
+
 		default:
 			return 0;
 		}
@@ -571,6 +577,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
 		seq_printf(m, ",max_read=%u", fc->max_read);
 	if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
 		seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+	if (fc->flags & FUSE_ALLOW_WBCACHE)
+		seq_puts(m, ",allow_wbcache");
 	return 0;
 }
 
@@ -888,6 +896,9 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 			}
 			if (arg->flags & FUSE_ASYNC_DIO)
 				fc->async_dio = 1;
+			if (arg->flags & FUSE_WRITEBACK_CACHE &&
+			    fc->flags & FUSE_ALLOW_WBCACHE)
+				fc->writeback_cache = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_CACHE_SIZE;
 			fc->no_lock = 1;
@@ -916,6 +927,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 		FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
 		FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
 		FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO | FUSE_ASYNC_DIO;
+	if (fc->flags & FUSE_ALLOW_WBCACHE)
+		arg->flags |= FUSE_WRITEBACK_CACHE;
 	req->in.h.opcode = FUSE_INIT;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 60bb2f9..5c98dd1 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -219,6 +219,7 @@ struct fuse_file_lock {
  * FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one)
  * FUSE_READDIRPLUS_AUTO: adaptive readdirplus
  * FUSE_ASYNC_DIO: asynchronous direct I/O submission
+ * FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -236,6 +237,7 @@ struct fuse_file_lock {
 #define FUSE_DO_READDIRPLUS	(1 << 13)
 #define FUSE_READDIRPLUS_AUTO	(1 << 14)
 #define FUSE_ASYNC_DIO		(1 << 15)
+#define FUSE_WRITEBACK_CACHE	(1 << 16)
 
 /**
  * CUSE INIT request/reply flags

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 16/16] mm: strictlimit feature
  2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
                   ` (13 preceding siblings ...)
  2013-06-29 17:47 ` [PATCH 15/16] fuse: Turn writeback cache on Maxim Patlasov
@ 2013-06-29 17:48 ` Maxim Patlasov
  2013-07-01 21:16   ` Andrew Morton
  2013-07-02 17:44   ` [PATCH] mm: strictlimit feature -v2 Maxim Patlasov
  14 siblings, 2 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-06-29 17:48 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Miklos Szeredi <mszeredi@suse.cz>

The feature prevents mistrusted filesystems to grow a large number of dirty
pages before throttling. For such filesystems balance_dirty_pages always
check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now is
fuse: it sets bdi max_ratio to 1% by default and system administrators are
supposed to expect that this limit won't be exceeded.

The feature is on if address space is marked by AS_STRICTLIMIT flag.
A filesystem may set the flag when it initializes a new inode.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fs-writeback.c           |    2 -
 fs/fuse/file.c              |    3 +
 fs/fuse/inode.c             |    1 
 include/linux/backing-dev.h |    5 ++
 include/linux/pagemap.h     |    1 
 include/linux/writeback.h   |    3 +
 mm/backing-dev.c            |    3 +
 mm/page-writeback.c         |  131 +++++++++++++++++++++++++++++++++----------
 8 files changed, 116 insertions(+), 33 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..8c75277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -762,7 +762,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time, false);
 }
 
 /*
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6d30db1..99b5c97 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1697,6 +1697,7 @@ static int fuse_writepage_locked(struct page *page,
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITTEN_BACK);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
@@ -1766,6 +1767,7 @@ static int fuse_send_writepages(struct fuse_fill_wb_data *data)
 		if (tmp_page) {
 			copy_highpage(tmp_page, page);
 			inc_bdi_stat(bdi, BDI_WRITEBACK);
+			inc_bdi_stat(bdi, BDI_WRITTEN_BACK);
 			inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 		} else
 			all_ok = 0;
@@ -1781,6 +1783,7 @@ static int fuse_send_writepages(struct fuse_fill_wb_data *data)
 			struct page *page = req->pages[i];
 			if (page) {
 				dec_bdi_stat(bdi, BDI_WRITEBACK);
+				dec_bdi_stat(bdi, BDI_WRITTEN_BACK);
 				dec_zone_page_state(page, NR_WRITEBACK_TEMP);
 				__free_page(page);
 				req->pages[i] = NULL;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4beb8e3..00a28af 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
+		set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..9c368af 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -33,6 +33,8 @@ enum bdi_state {
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_registered,		/* bdi_register() was done */
 	BDI_writeback_running,	/* Writeback is in progress */
+	BDI_idle,		/* No pages under writeback at the moment of
+				 * last update of write bw */
 	BDI_unused,		/* Available bits start here */
 };
 
@@ -43,6 +45,7 @@ enum bdi_stat_item {
 	BDI_WRITEBACK,
 	BDI_DIRTIED,
 	BDI_WRITTEN,
+	BDI_WRITTEN_BACK,
 	NR_BDI_STAT_ITEMS
 };
 
@@ -76,6 +79,8 @@ struct backing_dev_info {
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
 	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
+	unsigned long writeback_stamp;	/* pages sent to writeback at
+					 * bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..baac702 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_STRICTLIMIT	= __GFP_BITS_SHIFT + 5, /* strict dirty limit */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a500..5144051 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -159,7 +159,8 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
-			    unsigned long start_time);
+			    unsigned long start_time,
+			    bool strictlimit);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5025174..71413d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -94,6 +94,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BackgroundThresh:   %10lu kB\n"
 		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
+		   "BdiWrittenBack:     %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
@@ -107,6 +108,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN_BACK)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,
 		   nr_io,
@@ -455,6 +457,7 @@ int bdi_init(struct backing_dev_info *bdi)
 
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
+	bdi->writeback_stamp = 0;
 
 	bdi->balanced_dirty_ratelimit = INIT_BW;
 	bdi->dirty_ratelimit = INIT_BW;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..77449b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -585,6 +585,37 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 }
 
 /*
+ *                           setpoint - dirty 3
+ *        f(dirty) := 1.0 + (----------------)
+ *                           limit - setpoint
+ *
+ * it's a 3rd order polynomial that subjects to
+ *
+ * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
+ * (2) f(setpoint) = 1.0 => the balance point
+ * (3) f(limit)    = 0   => the hard limit
+ * (4) df/dx      <= 0	 => negative feedback control
+ * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+ *     => fast response on large errors; small oscillation near setpoint
+ */
+static inline long long pos_ratio_polynom(unsigned long setpoint,
+					  unsigned long dirty,
+					  unsigned long limit)
+{
+	long long pos_ratio;
+	long x;
+
+	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	return pos_ratio;
+}
+
+/*
  * Dirty position control.
  *
  * (o) global/bdi setpoints
@@ -664,7 +695,8 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
 					unsigned long bg_thresh,
 					unsigned long dirty,
 					unsigned long bdi_thresh,
-					unsigned long bdi_dirty)
+					unsigned long bdi_dirty,
+					bool strictlimit)
 {
 	unsigned long write_bw = bdi->avg_write_bandwidth;
 	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
@@ -679,29 +711,31 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
 	if (unlikely(dirty >= limit))
 		return 0;
 
+	if (unlikely(strictlimit)) {
+		if (bdi_dirty < 8)
+			return 2 << RATELIMIT_CALC_SHIFT;
+
+		if (bdi_dirty >= bdi_thresh)
+			return 0;
+
+		bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
+		bdi_setpoint /= 2;
+
+		if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
+			return 0;
+
+		pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
+					      bdi_thresh);
+		return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
+	}
+
 	/*
 	 * global setpoint
 	 *
-	 *                           setpoint - dirty 3
-	 *        f(dirty) := 1.0 + (----------------)
-	 *                           limit - setpoint
-	 *
-	 * it's a 3rd order polynomial that subjects to
-	 *
-	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
-	 * (2) f(setpoint) = 1.0 => the balance point
-	 * (3) f(limit)    = 0   => the hard limit
-	 * (4) df/dx      <= 0	 => negative feedback control
-	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
-	 *     => fast response on large errors; small oscillation near setpoint
+	 * See comment for pos_ratio_polynom().
 	 */
 	setpoint = (freerun + limit) / 2;
-	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
-		    limit - setpoint + 1);
-	pos_ratio = x;
-	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
-	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
-	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
 
 	/*
 	 * We have computed basic pos_ratio above based on global situation. If
@@ -892,7 +926,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
 				       unsigned long bdi_thresh,
 				       unsigned long bdi_dirty,
 				       unsigned long dirtied,
-				       unsigned long elapsed)
+				       unsigned long elapsed,
+				       bool strictlimit)
 {
 	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
 	unsigned long limit = hard_dirty_limit(thresh);
@@ -913,7 +948,7 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
 	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
 
 	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
-				       bdi_thresh, bdi_dirty);
+				       bdi_thresh, bdi_dirty, strictlimit);
 	/*
 	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
 	 */
@@ -994,6 +1029,16 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
 	 * keep that period small to reduce time lags).
 	 */
 	step = 0;
+
+	if (unlikely(strictlimit)) {
+		dirty = bdi_dirty;
+		if (bdi_dirty < 8)
+			setpoint = bdi_dirty + 1;
+		else
+			setpoint = (bdi_thresh +
+				    bdi_dirty_limit(bdi, bg_thresh)) / 2;
+	}
+
 	if (dirty < setpoint) {
 		x = min(bdi->balanced_dirty_ratelimit,
 			 min(balanced_dirty_ratelimit, task_ratelimit));
@@ -1034,12 +1079,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
-			    unsigned long start_time)
+			    unsigned long start_time,
+			    bool strictlimit)
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
 	unsigned long dirtied;
 	unsigned long written;
+	unsigned long writeback;
 
 	/*
 	 * rate-limit, only update once every 200ms.
@@ -1049,6 +1096,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 
 	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+	writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);
 
 	/*
 	 * Skip quiet periods when disk bandwidth is under-utilized.
@@ -1057,11 +1105,15 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
+	if (test_bit(BDI_idle, &bdi->state) &&
+	    bdi->writeback_stamp == writeback)
+		goto snapshot;
+
 	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
 		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
 					   bdi_thresh, bdi_dirty,
-					   dirtied, elapsed);
+					   dirtied, elapsed, strictlimit);
 	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
@@ -1069,6 +1121,12 @@ snapshot:
 	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
+
+	bdi->writeback_stamp = writeback;
+	if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
+		set_bit(BDI_idle, &bdi->state);
+	else
+		clear_bit(BDI_idle, &bdi->state);
 }
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
@@ -1077,13 +1135,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
-				 unsigned long start_time)
+				 unsigned long start_time,
+				 bool strictlimit)
 {
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
 	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
-			       bdi_thresh, bdi_dirty, start_time);
+			       bdi_thresh, bdi_dirty, start_time, strictlimit);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -1226,6 +1285,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long dirty_ratelimit;
 	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
 	unsigned long start_time = jiffies;
 
 	for (;;) {
@@ -1250,7 +1310,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		freerun = dirty_freerun_ceiling(dirty_thresh,
 						background_thresh);
-		if (nr_dirty <= freerun) {
+		if (nr_dirty <= freerun  && !strictlimit) {
 			current->dirty_paused_when = now;
 			current->nr_dirtied = 0;
 			current->nr_dirtied_pause =
@@ -1258,7 +1318,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			break;
 		}
 
-		if (unlikely(!writeback_in_progress(bdi)))
+		if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
 			bdi_start_background_writeback(bdi);
 
 		/*
@@ -1296,19 +1356,24 @@ static void balance_dirty_pages(struct address_space *mapping,
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (unlikely(!writeback_in_progress(bdi)) &&
+		    bdi_dirty > bdi_thresh / 4)
+			bdi_start_background_writeback(bdi);
+
 		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
-				  (nr_dirty > dirty_thresh);
+				 ((nr_dirty > dirty_thresh) || strictlimit);
 		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
-				     start_time);
+				     start_time, strictlimit);
 
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
-					       bdi_thresh, bdi_dirty);
+					       bdi_thresh, bdi_dirty,
+					       strictlimit);
 		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
 							RATELIMIT_CALC_SHIFT;
 		max_pause = bdi_max_pause(bdi, bdi_dirty);
@@ -1362,6 +1427,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 
 pause:
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
 		trace_balance_dirty_pages(bdi,
 					  dirty_thresh,
 					  background_thresh,
@@ -2265,8 +2332,10 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_account_writeback(bdi))
+			if (bdi_cap_account_writeback(bdi)) {
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+				__inc_bdi_stat(bdi, BDI_WRITTEN_BACK);
+			}
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 16/16] mm: strictlimit feature
  2013-06-29 17:48 ` [PATCH 16/16] mm: strictlimit feature Maxim Patlasov
@ 2013-07-01 21:16   ` Andrew Morton
  2013-07-02  8:33     ` Maxim Patlasov
  2013-07-02 17:44   ` [PATCH] mm: strictlimit feature -v2 Maxim Patlasov
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2013-07-01 21:16 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: miklos, riel, dev, xemul, fuse-devel, bfoster, linux-kernel,
	jbottomley, linux-mm, viro, linux-fsdevel, fengguang.wu, devel,
	mgorman

On Sat, 29 Jun 2013 21:48:54 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:

> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> The feature prevents mistrusted filesystems to grow a large number of dirty
> pages before throttling. For such filesystems balance_dirty_pages always
> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
> "freerun", it's not allowed to skip bdi checks. The only use case for now is
> fuse: it sets bdi max_ratio to 1% by default and system administrators are
> supposed to expect that this limit won't be exceeded.
> 
> The feature is on if address space is marked by AS_STRICTLIMIT flag.
> A filesystem may set the flag when it initializes a new inode.
> 

Fengguang, could you please review this patch?

I suggest you await the next version, which hopefully will be more
reviewable...

>
> ...
>
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -33,6 +33,8 @@ enum bdi_state {
>  	BDI_sync_congested,	/* The sync queue is getting full */
>  	BDI_registered,		/* bdi_register() was done */
>  	BDI_writeback_running,	/* Writeback is in progress */
> +	BDI_idle,		/* No pages under writeback at the moment of
> +				 * last update of write bw */

Why does BDI_idle exist?

>  	BDI_unused,		/* Available bits start here */
>  };
>  
> @@ -43,6 +45,7 @@ enum bdi_stat_item {
>  	BDI_WRITEBACK,
>  	BDI_DIRTIED,
>  	BDI_WRITTEN,
> +	BDI_WRITTEN_BACK,
>  	NR_BDI_STAT_ITEMS
>  };
>  
> @@ -76,6 +79,8 @@ struct backing_dev_info {
>  	unsigned long bw_time_stamp;	/* last time write bw is updated */
>  	unsigned long dirtied_stamp;
>  	unsigned long written_stamp;	/* pages written at bw_time_stamp */
> +	unsigned long writeback_stamp;	/* pages sent to writeback at
> +					 * bw_time_stamp */

Well this sucks.  Some of the "foo_stamp" fields are in units of time
(jiffies?  We aren't told) and some of the "foo_stamp" fields are in
units of number-of-pages.  It would be good to fix the naming here.

>  	unsigned long write_bandwidth;	/* the estimated write bandwidth */
>  	unsigned long avg_write_bandwidth; /* further smoothed write bw */
>  
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75..baac702 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -25,6 +25,7 @@ enum mapping_flags {
>  	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>  	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
>  	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
> +	AS_STRICTLIMIT	= __GFP_BITS_SHIFT + 5, /* strict dirty limit */

Thing is, "strict dirty limit" isn't documented anywhere, so this
reference is left dangling.

>
> ...
>
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -94,6 +94,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>  		   "BackgroundThresh:   %10lu kB\n"
>  		   "BdiDirtied:         %10lu kB\n"
>  		   "BdiWritten:         %10lu kB\n"
> +		   "BdiWrittenBack:     %10lu kB\n"
>  		   "BdiWriteBandwidth:  %10lu kBps\n"
>  		   "b_dirty:            %10lu\n"
>  		   "b_io:               %10lu\n"

I can't imagine what the difference is between BdiWritten and
BdiWrittenBack.

I suggest you document this at the BDI_WRITTEN_BACK definition site in
enum bdi_stat_item.  BDI_WRITTEN (at least) will also need
documentation so people can understand the difference.

>
> ...
>
> @@ -679,29 +711,31 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>  	if (unlikely(dirty >= limit))
>  		return 0;
>  
> +	if (unlikely(strictlimit)) {
> +		if (bdi_dirty < 8)
> +			return 2 << RATELIMIT_CALC_SHIFT;
> +
> +		if (bdi_dirty >= bdi_thresh)
> +			return 0;
> +
> +		bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
> +		bdi_setpoint /= 2;
> +
> +		if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
> +			return 0;
> +
> +		pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
> +					      bdi_thresh);
> +		return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
> +	}

This would be a suitable site at which to document the strictlimit
feature.  What it is, how it works and most importantly, why it exists.

>
> ...
>
> @@ -994,6 +1029,16 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>  	 * keep that period small to reduce time lags).
>  	 */
>  	step = 0;
> +
> +	if (unlikely(strictlimit)) {
> +		dirty = bdi_dirty;
> +		if (bdi_dirty < 8)
> +			setpoint = bdi_dirty + 1;
> +		else
> +			setpoint = (bdi_thresh +
> +				    bdi_dirty_limit(bdi, bg_thresh)) / 2;
> +	}

Explain this to the reader, please.

>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 16/16] mm: strictlimit feature
  2013-07-01 21:16   ` Andrew Morton
@ 2013-07-02  8:33     ` Maxim Patlasov
  0 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-07-02  8:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: miklos, riel, dev, xemul, fuse-devel, bfoster, linux-kernel,
	jbottomley, linux-mm, viro, linux-fsdevel, fengguang.wu, devel,
	mgorman

Hi Andrew,

07/02/2013 01:16 AM, Andrew Morton пишет:
> On Sat, 29 Jun 2013 21:48:54 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:
>
>> From: Miklos Szeredi <mszeredi@suse.cz>
>>
>> The feature prevents mistrusted filesystems to grow a large number of dirty
>> pages before throttling. For such filesystems balance_dirty_pages always
>> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
>> "freerun", it's not allowed to skip bdi checks. The only use case for now is
>> fuse: it sets bdi max_ratio to 1% by default and system administrators are
>> supposed to expect that this limit won't be exceeded.
>>
>> The feature is on if address space is marked by AS_STRICTLIMIT flag.
>> A filesystem may set the flag when it initializes a new inode.
>>
> Fengguang, could you please review this patch?
>
> I suggest you await the next version, which hopefully will be more
> reviewable...

Thanks a lot for quick review, I'll update the patch according to your 
comments soon.

I'm answering the question about BDI_idle below inline, but I'll add 
some comment about it where BDI_idle is actually used as well.

Thanks,
Maxim

>
>> ...
>>
>> --- a/include/linux/backing-dev.h
>> +++ b/include/linux/backing-dev.h
>> @@ -33,6 +33,8 @@ enum bdi_state {
>>   	BDI_sync_congested,	/* The sync queue is getting full */
>>   	BDI_registered,		/* bdi_register() was done */
>>   	BDI_writeback_running,	/* Writeback is in progress */
>> +	BDI_idle,		/* No pages under writeback at the moment of
>> +				 * last update of write bw */
> Why does BDI_idle exist?

BDI_idle along with BDI_WRITTEN_BACK exists to distinguish two cases:

1st. BDI_WRITTEN has not been incremented since we looked at it last 
time because backing dev is unresponding. I.e. it had some pages under 
writeback but it have not made any progress for some reasons.

2nd. BDI_WRITTEN has not been incremented since we looked at it last 
time because backing dev had nothing to do. I.e. there are some dirty 
pages on bdi, but they have not been passed to backing dev yet. This is 
the case when bdi_dirty is under bdi background threshold and flusher 
refrains from flushing even if we woke it up explicitly by 
bdi_start_background_writeback.

We have to skip bdi_update_write_bandwidth() in the 2nd case because 
otherwise bdi_update_write_bandwidth() will see written==0 and 
mistakenly decrease write_bandwidth. The criterion to skip is the 
following: BDI_idle is set (i.e. there were no pages under writeback 
when we looked at the bdi last time) && BDI_WRITTEN_BACK counter has not 
changed (i.e. no new pages has been sent to writeback since we looked at 
the bdi last time).

Thanks,
Maxim

>
>>   	BDI_unused,		/* Available bits start here */
>>   };
>>   
>> @@ -43,6 +45,7 @@ enum bdi_stat_item {
>>   	BDI_WRITEBACK,
>>   	BDI_DIRTIED,
>>   	BDI_WRITTEN,
>> +	BDI_WRITTEN_BACK,
>>   	NR_BDI_STAT_ITEMS
>>   };
>>   
>> @@ -76,6 +79,8 @@ struct backing_dev_info {
>>   	unsigned long bw_time_stamp;	/* last time write bw is updated */
>>   	unsigned long dirtied_stamp;
>>   	unsigned long written_stamp;	/* pages written at bw_time_stamp */
>> +	unsigned long writeback_stamp;	/* pages sent to writeback at
>> +					 * bw_time_stamp */
> Well this sucks.  Some of the "foo_stamp" fields are in units of time
> (jiffies?  We aren't told) and some of the "foo_stamp" fields are in
> units of number-of-pages.  It would be good to fix the naming here.
>
>>   	unsigned long write_bandwidth;	/* the estimated write bandwidth */
>>   	unsigned long avg_write_bandwidth; /* further smoothed write bw */
>>   
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index e3dea75..baac702 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -25,6 +25,7 @@ enum mapping_flags {
>>   	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
>>   	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
>>   	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
>> +	AS_STRICTLIMIT	= __GFP_BITS_SHIFT + 5, /* strict dirty limit */
> Thing is, "strict dirty limit" isn't documented anywhere, so this
> reference is left dangling.
>
>> ...
>>
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -94,6 +94,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>   		   "BackgroundThresh:   %10lu kB\n"
>>   		   "BdiDirtied:         %10lu kB\n"
>>   		   "BdiWritten:         %10lu kB\n"
>> +		   "BdiWrittenBack:     %10lu kB\n"
>>   		   "BdiWriteBandwidth:  %10lu kBps\n"
>>   		   "b_dirty:            %10lu\n"
>>   		   "b_io:               %10lu\n"
> I can't imagine what the difference is between BdiWritten and
> BdiWrittenBack.
>
> I suggest you document this at the BDI_WRITTEN_BACK definition site in
> enum bdi_stat_item.  BDI_WRITTEN (at least) will also need
> documentation so people can understand the difference.
>
>> ...
>>
>> @@ -679,29 +711,31 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>>   	if (unlikely(dirty >= limit))
>>   		return 0;
>>   
>> +	if (unlikely(strictlimit)) {
>> +		if (bdi_dirty < 8)
>> +			return 2 << RATELIMIT_CALC_SHIFT;
>> +
>> +		if (bdi_dirty >= bdi_thresh)
>> +			return 0;
>> +
>> +		bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
>> +		bdi_setpoint /= 2;
>> +
>> +		if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
>> +			return 0;
>> +
>> +		pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
>> +					      bdi_thresh);
>> +		return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
>> +	}
> This would be a suitable site at which to document the strictlimit
> feature.  What it is, how it works and most importantly, why it exists.
>
>> ...
>>
>> @@ -994,6 +1029,16 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>>   	 * keep that period small to reduce time lags).
>>   	 */
>>   	step = 0;
>> +
>> +	if (unlikely(strictlimit)) {
>> +		dirty = bdi_dirty;
>> +		if (bdi_dirty < 8)
>> +			setpoint = bdi_dirty + 1;
>> +		else
>> +			setpoint = (bdi_thresh +
>> +				    bdi_dirty_limit(bdi, bg_thresh)) / 2;
>> +	}
> Explain this to the reader, please.
>
>> ...
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] mm: strictlimit feature -v2
  2013-06-29 17:48 ` [PATCH 16/16] mm: strictlimit feature Maxim Patlasov
  2013-07-01 21:16   ` Andrew Morton
@ 2013-07-02 17:44   ` Maxim Patlasov
  2013-07-02 19:38     ` Andrew Morton
  1 sibling, 1 reply; 35+ messages in thread
From: Maxim Patlasov @ 2013-07-02 17:44 UTC (permalink / raw)
  To: miklos
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

From: Miklos Szeredi <mszeredi@suse.cz>

The feature prevents mistrusted filesystems to grow a large number of dirty
pages before throttling. For such filesystems balance_dirty_pages always
check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now is
fuse: it sets bdi max_ratio to 1% by default and system administrators are
supposed to expect that this limit won't be exceeded.

The feature is on if address space is marked by AS_STRICTLIMIT flag.
A filesystem may set the flag when it initializes a new inode.

Changed in v2 (thanks to Andrew Morton):
 - added a few explanatory comments
 - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
   stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
   fields to reflect that they are in units of number-of-pages.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fs-writeback.c           |    2 
 fs/fuse/file.c              |    3 +
 fs/fuse/inode.c             |    1 
 include/linux/backing-dev.h |   22 ++++-
 include/linux/pagemap.h     |    1 
 include/linux/writeback.h   |    3 -
 mm/backing-dev.c            |    5 +
 mm/page-writeback.c         |  176 +++++++++++++++++++++++++++++++++++--------
 8 files changed, 171 insertions(+), 42 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..8c75277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -762,7 +762,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time, false);
 }
 
 /*
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6d30db1..99b5c97 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1697,6 +1697,7 @@ static int fuse_writepage_locked(struct page *page,
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITTEN_BACK);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
@@ -1766,6 +1767,7 @@ static int fuse_send_writepages(struct fuse_fill_wb_data *data)
 		if (tmp_page) {
 			copy_highpage(tmp_page, page);
 			inc_bdi_stat(bdi, BDI_WRITEBACK);
+			inc_bdi_stat(bdi, BDI_WRITTEN_BACK);
 			inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 		} else
 			all_ok = 0;
@@ -1781,6 +1783,7 @@ static int fuse_send_writepages(struct fuse_fill_wb_data *data)
 			struct page *page = req->pages[i];
 			if (page) {
 				dec_bdi_stat(bdi, BDI_WRITEBACK);
+				dec_bdi_stat(bdi, BDI_WRITTEN_BACK);
 				dec_zone_page_state(page, NR_WRITEBACK_TEMP);
 				__free_page(page);
 				req->pages[i] = NULL;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4beb8e3..00a28af 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
+		set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..6b12d01 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -33,6 +33,8 @@ enum bdi_state {
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_registered,		/* bdi_register() was done */
 	BDI_writeback_running,	/* Writeback is in progress */
+	BDI_idle,		/* No pages under writeback at the moment of
+				 * last update of write bw */
 	BDI_unused,		/* Available bits start here */
 };
 
@@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
-	BDI_DIRTIED,
-	BDI_WRITTEN,
+
+	/*
+	 * The three counters below reflects number of events of specific type
+	 * happened since bdi_init(). The type is defined in comments below:
+	 */
+	BDI_DIRTIED,	  /* a page was dirtied */
+	BDI_WRITTEN,	  /* writeout completed for a page */
+	BDI_WRITTEN_BACK, /* a page went to writeback */
+
 	NR_BDI_STAT_ITEMS
 };
 
@@ -73,9 +82,12 @@ struct backing_dev_info {
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
-	unsigned long bw_time_stamp;	/* last time write bw is updated */
-	unsigned long dirtied_stamp;
-	unsigned long written_stamp;	/* pages written at bw_time_stamp */
+	unsigned long bw_time_stamp;	/* last time (in jiffies) write bw
+					 * is updated */
+	unsigned long dirtied_nr_stamp;
+	unsigned long written_nr_stamp;	/* pages written at bw_time_stamp */
+	unsigned long writeback_nr_stamp; /* pages sent to writeback at
+					   * bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..baac702 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
 	AS_BALLOON_MAP  = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+	AS_STRICTLIMIT	= __GFP_BITS_SHIFT + 5, /* strict dirty limit */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a500..5144051 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -159,7 +159,8 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
-			    unsigned long start_time);
+			    unsigned long start_time,
+			    bool strictlimit);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5025174..acfeec7 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -94,6 +94,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BackgroundThresh:   %10lu kB\n"
 		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
+		   "BdiWrittenBack:     %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
@@ -107,6 +108,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN_BACK)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,
 		   nr_io,
@@ -454,7 +456,8 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->dirty_exceeded = 0;
 
 	bdi->bw_time_stamp = jiffies;
-	bdi->written_stamp = 0;
+	bdi->written_nr_stamp = 0;
+	bdi->writeback_nr_stamp = 0;
 
 	bdi->balanced_dirty_ratelimit = INIT_BW;
 	bdi->dirty_ratelimit = INIT_BW;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..83c7434 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -585,6 +585,37 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 }
 
 /*
+ *                           setpoint - dirty 3
+ *        f(dirty) := 1.0 + (----------------)
+ *                           limit - setpoint
+ *
+ * it's a 3rd order polynomial that subjects to
+ *
+ * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
+ * (2) f(setpoint) = 1.0 => the balance point
+ * (3) f(limit)    = 0   => the hard limit
+ * (4) df/dx      <= 0	 => negative feedback control
+ * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+ *     => fast response on large errors; small oscillation near setpoint
+ */
+static inline long long pos_ratio_polynom(unsigned long setpoint,
+					  unsigned long dirty,
+					  unsigned long limit)
+{
+	long long pos_ratio;
+	long x;
+
+	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	return pos_ratio;
+}
+
+/*
  * Dirty position control.
  *
  * (o) global/bdi setpoints
@@ -664,7 +695,8 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
 					unsigned long bg_thresh,
 					unsigned long dirty,
 					unsigned long bdi_thresh,
-					unsigned long bdi_dirty)
+					unsigned long bdi_dirty,
+					bool strictlimit)
 {
 	unsigned long write_bw = bdi->avg_write_bandwidth;
 	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
@@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
 		return 0;
 
 	/*
-	 * global setpoint
+	 * The strictlimit feature is a tool preventing mistrusted filesystems
+	 * to grow a large number of dirty pages before throttling. For such
+	 * filesystems balance_dirty_pages always checks bdi counters against
+	 * bdi limits. Even if global "nr_dirty" is under "freerun". This is
+	 * especially important for fuse who sets bdi->max_ratio to 1% by
+	 * default. Without strictlimit feature, fuse writeback may consume
+	 * arbitrary amount of RAM because it is accounted in
+	 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
 	 *
-	 *                           setpoint - dirty 3
-	 *        f(dirty) := 1.0 + (----------------)
-	 *                           limit - setpoint
+	 * Here, in bdi_position_ratio(), we calculate pos_ratio based on
+	 * two values: bdi_dirty and bdi_thresh. Let's consider an example:
+	 * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
+	 * limits are set by default to 10% and 20% (background and throttle).
+	 * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
+	 * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
+	 * about ~6K pages (as the average of background and throttle bdi
+	 * limits). The 3rd order polynomial will provide positive feedback if
+	 * bdi_dirty is under bdi_setpoint and vice versa.
 	 *
-	 * it's a 3rd order polynomial that subjects to
+	 * Note, that we cannot use global counters in these calculations
+	 * because we want to throttle process writing to strictlimit address
+	 * space much earlier than global "freerun" is reached (~23MB vs.
+	 * ~2.3GB in the example above).
+	 */
+	if (unlikely(strictlimit)) {
+		if (bdi_dirty < 8)
+			return 2 << RATELIMIT_CALC_SHIFT;
+
+		if (bdi_dirty >= bdi_thresh)
+			return 0;
+
+		bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
+		bdi_setpoint /= 2;
+
+		if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
+			return 0;
+
+		pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
+					      bdi_thresh);
+		return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
+	}
+
+	/*
+	 * global setpoint
 	 *
-	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
-	 * (2) f(setpoint) = 1.0 => the balance point
-	 * (3) f(limit)    = 0   => the hard limit
-	 * (4) df/dx      <= 0	 => negative feedback control
-	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
-	 *     => fast response on large errors; small oscillation near setpoint
+	 * See comment for pos_ratio_polynom().
 	 */
 	setpoint = (freerun + limit) / 2;
-	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
-		    limit - setpoint + 1);
-	pos_ratio = x;
-	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
-	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
-	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
 
 	/*
 	 * We have computed basic pos_ratio above based on global situation. If
@@ -799,7 +858,7 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 	 * write_bandwidth = ---------------------------------------------------
 	 *                                          period
 	 */
-	bw = written - bdi->written_stamp;
+	bw = written - bdi->written_nr_stamp;
 	bw *= HZ;
 	if (unlikely(elapsed > period)) {
 		do_div(bw, elapsed);
@@ -892,7 +951,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
 				       unsigned long bdi_thresh,
 				       unsigned long bdi_dirty,
 				       unsigned long dirtied,
-				       unsigned long elapsed)
+				       unsigned long elapsed,
+				       bool strictlimit)
 {
 	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
 	unsigned long limit = hard_dirty_limit(thresh);
@@ -910,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
 	 * The dirty rate will match the writeout rate in long term, except
 	 * when dirty pages are truncated by userspace or re-dirtied by FS.
 	 */
-	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+	dirty_rate = (dirtied - bdi->dirtied_nr_stamp) * HZ / elapsed;
 
 	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
-				       bdi_thresh, bdi_dirty);
+				       bdi_thresh, bdi_dirty, strictlimit);
 	/*
 	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
 	 */
@@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
 	 * keep that period small to reduce time lags).
 	 */
 	step = 0;
+
+	/*
+	 * For strictlimit case, balanced_dirty_ratelimit was calculated
+	 * above based on bdi counters and limits (see bdi_position_ratio()).
+	 * Hence, to calculate "step" properly, we have to use bdi_dirty as
+	 * "dirty" and bdi_setpoint as "setpoint".
+	 *
+	 * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
+	 * it's possible that bdi_thresh is close to zero due to inactivity
+	 * of backing device (see the implementation of bdi_dirty_limit()).
+	 */
+	if (unlikely(strictlimit)) {
+		dirty = bdi_dirty;
+		if (bdi_dirty < 8)
+			setpoint = bdi_dirty + 1;
+		else
+			setpoint = (bdi_thresh +
+				    bdi_dirty_limit(bdi, bg_thresh)) / 2;
+	}
+
 	if (dirty < setpoint) {
 		x = min(bdi->balanced_dirty_ratelimit,
 			 min(balanced_dirty_ratelimit, task_ratelimit));
@@ -1034,12 +1114,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
-			    unsigned long start_time)
+			    unsigned long start_time,
+			    bool strictlimit)
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
 	unsigned long dirtied;
 	unsigned long written;
+	unsigned long writeback;
 
 	/*
 	 * rate-limit, only update once every 200ms.
@@ -1049,6 +1131,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 
 	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+	writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);
 
 	/*
 	 * Skip quiet periods when disk bandwidth is under-utilized.
@@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
+	/*
+	 * Skip periods when backing dev was idle due to abscence of pages
+	 * under writeback (when over_bground_thresh() returns false)
+	 */
+	if (test_bit(BDI_idle, &bdi->state) &&
+	    bdi->writeback_nr_stamp == writeback)
+		goto snapshot;
+
 	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
 		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
 					   bdi_thresh, bdi_dirty,
-					   dirtied, elapsed);
+					   dirtied, elapsed, strictlimit);
 	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
-	bdi->dirtied_stamp = dirtied;
-	bdi->written_stamp = written;
+	bdi->dirtied_nr_stamp = dirtied;
+	bdi->written_nr_stamp = written;
 	bdi->bw_time_stamp = now;
+
+	bdi->writeback_nr_stamp = writeback;
+	if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
+		set_bit(BDI_idle, &bdi->state);
+	else
+		clear_bit(BDI_idle, &bdi->state);
 }
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
@@ -1077,13 +1174,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
-				 unsigned long start_time)
+				 unsigned long start_time,
+				 bool strictlimit)
 {
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
 	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
-			       bdi_thresh, bdi_dirty, start_time);
+			       bdi_thresh, bdi_dirty, start_time, strictlimit);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -1226,6 +1324,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long dirty_ratelimit;
 	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
 	unsigned long start_time = jiffies;
 
 	for (;;) {
@@ -1250,7 +1349,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		freerun = dirty_freerun_ceiling(dirty_thresh,
 						background_thresh);
-		if (nr_dirty <= freerun) {
+		if (nr_dirty <= freerun  && !strictlimit) {
 			current->dirty_paused_when = now;
 			current->nr_dirtied = 0;
 			current->nr_dirtied_pause =
@@ -1258,7 +1357,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			break;
 		}
 
-		if (unlikely(!writeback_in_progress(bdi)))
+		if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
 			bdi_start_background_writeback(bdi);
 
 		/*
@@ -1296,19 +1395,24 @@ static void balance_dirty_pages(struct address_space *mapping,
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (unlikely(!writeback_in_progress(bdi)) &&
+		    bdi_dirty > bdi_thresh / 4)
+			bdi_start_background_writeback(bdi);
+
 		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
-				  (nr_dirty > dirty_thresh);
+				 ((nr_dirty > dirty_thresh) || strictlimit);
 		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
 				     nr_dirty, bdi_thresh, bdi_dirty,
-				     start_time);
+				     start_time, strictlimit);
 
 		dirty_ratelimit = bdi->dirty_ratelimit;
 		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
 					       background_thresh, nr_dirty,
-					       bdi_thresh, bdi_dirty);
+					       bdi_thresh, bdi_dirty,
+					       strictlimit);
 		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
 							RATELIMIT_CALC_SHIFT;
 		max_pause = bdi_max_pause(bdi, bdi_dirty);
@@ -1362,6 +1466,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 
 pause:
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
 		trace_balance_dirty_pages(bdi,
 					  dirty_thresh,
 					  background_thresh,
@@ -2265,8 +2371,10 @@ int test_set_page_writeback(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_account_writeback(bdi))
+			if (bdi_cap_account_writeback(bdi)) {
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+				__inc_bdi_stat(bdi, BDI_WRITTEN_BACK);
+			}
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: strictlimit feature -v2
  2013-07-02 17:44   ` [PATCH] mm: strictlimit feature -v2 Maxim Patlasov
@ 2013-07-02 19:38     ` Andrew Morton
  2013-07-03 11:01       ` Maxim Patlasov
  0 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2013-07-02 19:38 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: miklos, riel, dev, xemul, fuse-devel, bfoster, linux-kernel,
	jbottomley, linux-mm, viro, linux-fsdevel, fengguang.wu, devel,
	mgorman

On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:

> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> The feature prevents mistrusted filesystems to grow a large number of dirty
> pages before throttling. For such filesystems balance_dirty_pages always
> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
> "freerun", it's not allowed to skip bdi checks. The only use case for now is
> fuse: it sets bdi max_ratio to 1% by default and system administrators are
> supposed to expect that this limit won't be exceeded.
> 
> The feature is on if address space is marked by AS_STRICTLIMIT flag.
> A filesystem may set the flag when it initializes a new inode.
> 
> Changed in v2 (thanks to Andrew Morton):
>  - added a few explanatory comments
>  - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
>    stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
>    fields to reflect that they are in units of number-of-pages.
> 

Better, thanks.

The writeback arithemtic makes my head spin - I'd really like Fengguang
to go over this, please.

A quick visit from the spelling police:

>
> ...
>
> @@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
>  enum bdi_stat_item {
>  	BDI_RECLAIMABLE,
>  	BDI_WRITEBACK,
> -	BDI_DIRTIED,
> -	BDI_WRITTEN,
> +
> +	/*
> +	 * The three counters below reflects number of events of specific type
> +	 * happened since bdi_init(). The type is defined in comments below:

"The three counters below reflect the number of events of specific
types since bdi_init()"

> +	 */
> +	BDI_DIRTIED,	  /* a page was dirtied */
> +	BDI_WRITTEN,	  /* writeout completed for a page */
> +	BDI_WRITTEN_BACK, /* a page went to writeback */
> +
>  	NR_BDI_STAT_ITEMS
>  };
>  
>
> ...
>
> @@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>  		return 0;
>  
>  	/*
> -	 * global setpoint
> +	 * The strictlimit feature is a tool preventing mistrusted filesystems
> +	 * to grow a large number of dirty pages before throttling. For such

"from growing"

> +	 * filesystems balance_dirty_pages always checks bdi counters against
> +	 * bdi limits. Even if global "nr_dirty" is under "freerun". This is
> +	 * especially important for fuse who sets bdi->max_ratio to 1% by

s/who/which/

> +	 * default. Without strictlimit feature, fuse writeback may consume
> +	 * arbitrary amount of RAM because it is accounted in
> +	 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
>
> ...
>
> @@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>  	 * keep that period small to reduce time lags).
>  	 */
>  	step = 0;
> +
> +	/*
> +	 * For strictlimit case, balanced_dirty_ratelimit was calculated

balance_dirty_ratelimit?

> +	 * above based on bdi counters and limits (see bdi_position_ratio()).
> +	 * Hence, to calculate "step" properly, we have to use bdi_dirty as
> +	 * "dirty" and bdi_setpoint as "setpoint".
> +	 *
> +	 * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
> +	 * it's possible that bdi_thresh is close to zero due to inactivity
> +	 * of backing device (see the implementation of bdi_dirty_limit()).
> +	 */
> +	if (unlikely(strictlimit)) {
> +		dirty = bdi_dirty;
> +		if (bdi_dirty < 8)
> +			setpoint = bdi_dirty + 1;
> +		else
>
> ...
>
> @@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>  		goto snapshot;
>  
> +	/*
> +	 * Skip periods when backing dev was idle due to abscence of pages

"absence"

> +	 * under writeback (when over_bground_thresh() returns false)
> +	 */
> +	if (test_bit(BDI_idle, &bdi->state) &&
> +	    bdi->writeback_nr_stamp == writeback)
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: strictlimit feature -v2
  2013-07-02 19:38     ` Andrew Morton
@ 2013-07-03 11:01       ` Maxim Patlasov
  2013-07-03 23:16         ` Jan Kara
  0 siblings, 1 reply; 35+ messages in thread
From: Maxim Patlasov @ 2013-07-03 11:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: miklos, riel, dev, xemul, fuse-devel, bfoster, linux-kernel,
	jbottomley, linux-mm, viro, linux-fsdevel, fengguang.wu, devel,
	mgorman

07/02/2013 11:38 PM, Andrew Morton пишет:
> On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:
>
>> From: Miklos Szeredi <mszeredi@suse.cz>
>>
>> The feature prevents mistrusted filesystems to grow a large number of dirty
>> pages before throttling. For such filesystems balance_dirty_pages always
>> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
>> "freerun", it's not allowed to skip bdi checks. The only use case for now is
>> fuse: it sets bdi max_ratio to 1% by default and system administrators are
>> supposed to expect that this limit won't be exceeded.
>>
>> The feature is on if address space is marked by AS_STRICTLIMIT flag.
>> A filesystem may set the flag when it initializes a new inode.
>>
>> Changed in v2 (thanks to Andrew Morton):
>>   - added a few explanatory comments
>>   - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
>>     stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
>>     fields to reflect that they are in units of number-of-pages.
>>
> Better, thanks.
>
> The writeback arithemtic makes my head spin - I'd really like Fengguang
> to go over this, please.
>
> A quick visit from the spelling police:

Great! Thank you, Andrew. I'll wait for Fengguang' feedback for a while 
before respin.

>
>> ...
>>
>> @@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
>>   enum bdi_stat_item {
>>   	BDI_RECLAIMABLE,
>>   	BDI_WRITEBACK,
>> -	BDI_DIRTIED,
>> -	BDI_WRITTEN,
>> +
>> +	/*
>> +	 * The three counters below reflects number of events of specific type
>> +	 * happened since bdi_init(). The type is defined in comments below:
> "The three counters below reflect the number of events of specific
> types since bdi_init()"
>
>> +	 */
>> +	BDI_DIRTIED,	  /* a page was dirtied */
>> +	BDI_WRITTEN,	  /* writeout completed for a page */
>> +	BDI_WRITTEN_BACK, /* a page went to writeback */
>> +
>>   	NR_BDI_STAT_ITEMS
>>   };
>>   
>>
>> ...
>>
>> @@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>>   		return 0;
>>   
>>   	/*
>> -	 * global setpoint
>> +	 * The strictlimit feature is a tool preventing mistrusted filesystems
>> +	 * to grow a large number of dirty pages before throttling. For such
> "from growing"
>
>> +	 * filesystems balance_dirty_pages always checks bdi counters against
>> +	 * bdi limits. Even if global "nr_dirty" is under "freerun". This is
>> +	 * especially important for fuse who sets bdi->max_ratio to 1% by
> s/who/which/
>
>> +	 * default. Without strictlimit feature, fuse writeback may consume
>> +	 * arbitrary amount of RAM because it is accounted in
>> +	 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
>>
>> ...
>>
>> @@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>>   	 * keep that period small to reduce time lags).
>>   	 */
>>   	step = 0;
>> +
>> +	/*
>> +	 * For strictlimit case, balanced_dirty_ratelimit was calculated
> balance_dirty_ratelimit?
>
>> +	 * above based on bdi counters and limits (see bdi_position_ratio()).
>> +	 * Hence, to calculate "step" properly, we have to use bdi_dirty as
>> +	 * "dirty" and bdi_setpoint as "setpoint".
>> +	 *
>> +	 * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
>> +	 * it's possible that bdi_thresh is close to zero due to inactivity
>> +	 * of backing device (see the implementation of bdi_dirty_limit()).
>> +	 */
>> +	if (unlikely(strictlimit)) {
>> +		dirty = bdi_dirty;
>> +		if (bdi_dirty < 8)
>> +			setpoint = bdi_dirty + 1;
>> +		else
>>
>> ...
>>
>> @@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>>   	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>>   		goto snapshot;
>>   
>> +	/*
>> +	 * Skip periods when backing dev was idle due to abscence of pages
> "absence"
>
>> +	 * under writeback (when over_bground_thresh() returns false)
>> +	 */
>> +	if (test_bit(BDI_idle, &bdi->state) &&
>> +	    bdi->writeback_nr_stamp == writeback)
>>
>> ...
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: strictlimit feature -v2
  2013-07-03 11:01       ` Maxim Patlasov
@ 2013-07-03 23:16         ` Jan Kara
  2013-07-05 13:14           ` Maxim Patlasov
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2013-07-03 23:16 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: Andrew Morton, miklos, riel, dev, xemul, fuse-devel, bfoster,
	linux-kernel, jbottomley, linux-mm, viro, linux-fsdevel,
	fengguang.wu, devel, mgorman

On Wed 03-07-13 15:01:19, Maxim Patlasov wrote:
> 07/02/2013 11:38 PM, Andrew Morton пишет:
> >On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:
> >
> >>From: Miklos Szeredi <mszeredi@suse.cz>
> >>
> >>The feature prevents mistrusted filesystems to grow a large number of dirty
> >>pages before throttling. For such filesystems balance_dirty_pages always
> >>check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
> >>"freerun", it's not allowed to skip bdi checks. The only use case for now is
> >>fuse: it sets bdi max_ratio to 1% by default and system administrators are
> >>supposed to expect that this limit won't be exceeded.
> >>
> >>The feature is on if address space is marked by AS_STRICTLIMIT flag.
> >>A filesystem may set the flag when it initializes a new inode.
> >>
> >>Changed in v2 (thanks to Andrew Morton):
> >>  - added a few explanatory comments
> >>  - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
> >>    stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
> >>    fields to reflect that they are in units of number-of-pages.
> >>
> >Better, thanks.
> >
> >The writeback arithemtic makes my head spin - I'd really like Fengguang
> >to go over this, please.
> >
> >A quick visit from the spelling police:
> 
> Great! Thank you, Andrew. I'll wait for Fengguang' feedback for a
> while before respin.
  Sorry for the bad mail threading but I've noticed the thread only now and
I don't have email with your patches in my mailbox anymore. Below is a
review of your strictlimit patch. In principle, I'm OK with the idea (I
even wanted to have a similar ability e.g. for NFS mounts) but I have some
reservations regarding the implementation:

> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 4beb8e3..00a28af 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>  			inode->i_flags |= S_NOCMTIME;
>  		inode->i_generation = generation;
>  		inode->i_data.backing_dev_info = &fc->bdi;
> +		set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
>  		fuse_init_inode(inode, attr);
>  		unlock_new_inode(inode);
>  	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
  It seems wrong to use address space bits for this. Using BDI capabilities for
this would look more appropriate. Sure you couldn't then always tune this on
per-fs basis since filesystems can share BDI (e.g. when they are on different
partitions of one disk) but if several filesystems are sharing a BDI and some
would want 'strict' behavior and some don't I don't believe the resulting
behavior would be sane - e.g. non-strict fs could be getting bdi over per-bdi
limit without any throttling and strictly limited fs would be continuously
stalled. Or do I miss any reason why this is set on address space?

As a bonus you don't have to pass the 'strictlimit' flag to writeback
functions when it is a bdi flag.

> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..6b12d01 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -33,6 +33,8 @@ enum bdi_state {
>  	BDI_sync_congested,	/* The sync queue is getting full */
>  	BDI_registered,		/* bdi_register() was done */
>  	BDI_writeback_running,	/* Writeback is in progress */
> +	BDI_idle,		/* No pages under writeback at the moment of
> +				 * last update of write bw */
>  	BDI_unused,		/* Available bits start here */
>  };
>  
> @@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
>  enum bdi_stat_item {
>  	BDI_RECLAIMABLE,
>  	BDI_WRITEBACK,
> -	BDI_DIRTIED,
> -	BDI_WRITTEN,
> +
> +	/*
> +	 * The three counters below reflects number of events of specific type
> +	 * happened since bdi_init(). The type is defined in comments below:
> +	 */
> +	BDI_DIRTIED,	  /* a page was dirtied */
> +	BDI_WRITTEN,	  /* writeout completed for a page */
> +	BDI_WRITTEN_BACK, /* a page went to writeback */
> +
>  	NR_BDI_STAT_ITEMS
>  };
>  
> @@ -73,9 +82,12 @@ struct backing_dev_info {
>  
>  	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
>  
> -	unsigned long bw_time_stamp;	/* last time write bw is updated */
> -	unsigned long dirtied_stamp;
> -	unsigned long written_stamp;	/* pages written at bw_time_stamp */
> +	unsigned long bw_time_stamp;	/* last time (in jiffies) write bw
> +					 * is updated */
> +	unsigned long dirtied_nr_stamp;
> +	unsigned long written_nr_stamp;	/* pages written at bw_time_stamp */
> +	unsigned long writeback_nr_stamp; /* pages sent to writeback at
> +					   * bw_time_stamp */
>  	unsigned long write_bandwidth;	/* the estimated write bandwidth */
>  	unsigned long avg_write_bandwidth; /* further smoothed write bw */
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4514ad7..83c7434 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>  		return 0;
>  
>  	/*
> -	 * global setpoint
> +	 * The strictlimit feature is a tool preventing mistrusted filesystems
> +	 * to grow a large number of dirty pages before throttling. For such
> +	 * filesystems balance_dirty_pages always checks bdi counters against
> +	 * bdi limits. Even if global "nr_dirty" is under "freerun". This is
> +	 * especially important for fuse who sets bdi->max_ratio to 1% by
> +	 * default. Without strictlimit feature, fuse writeback may consume
> +	 * arbitrary amount of RAM because it is accounted in
> +	 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
>  	 *
> -	 *                           setpoint - dirty 3
> -	 *        f(dirty) := 1.0 + (----------------)
> -	 *                           limit - setpoint
> +	 * Here, in bdi_position_ratio(), we calculate pos_ratio based on
> +	 * two values: bdi_dirty and bdi_thresh. Let's consider an example:
> +	 * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
> +	 * limits are set by default to 10% and 20% (background and throttle).
> +	 * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
> +	 * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
> +	 * about ~6K pages (as the average of background and throttle bdi
> +	 * limits). The 3rd order polynomial will provide positive feedback if
> +	 * bdi_dirty is under bdi_setpoint and vice versa.
>  	 *
> -	 * it's a 3rd order polynomial that subjects to
> +	 * Note, that we cannot use global counters in these calculations
> +	 * because we want to throttle process writing to strictlimit address
> +	 * space much earlier than global "freerun" is reached (~23MB vs.
> +	 * ~2.3GB in the example above).
> +	 */
> +	if (unlikely(strictlimit)) {
> +		if (bdi_dirty < 8)
> +			return 2 << RATELIMIT_CALC_SHIFT;
> +
> +		if (bdi_dirty >= bdi_thresh)
> +			return 0;
> +
> +		bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
> +		bdi_setpoint /= 2;
> +
> +		if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
> +			return 0;
> +
> +		pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
> +					      bdi_thresh);
> +		return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
> +	}
  But if global limits are exceeded but a BDI in strict mode is below limit,
this would allow dirtying on that BDI which seems wrong. Also the logic
in bdi_position_ratio() is already supposed to take bdi limits in account
(although the math is somewhat convoluted) so you shouldn't have to touch it.
Only maybe the increasing of bdi_thresh to (limit - dirty) / 8 might be too
much for strict limitting so that may need some tweaking (although setting it
at 8 pages as your patch does seems *too* strict to me).

> +
> +	/*
> +	 * global setpoint
>  	 *
> -	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
> -	 * (2) f(setpoint) = 1.0 => the balance point
> -	 * (3) f(limit)    = 0   => the hard limit
> -	 * (4) df/dx      <= 0	 => negative feedback control
> -	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> -	 *     => fast response on large errors; small oscillation near setpoint
> +	 * See comment for pos_ratio_polynom().
>  	 */
>  	setpoint = (freerun + limit) / 2;
> -	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
> -		    limit - setpoint + 1);
> -	pos_ratio = x;
> -	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> -	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> -	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
>  
>  	/*
>  	 * We have computed basic pos_ratio above based on global situation. If
> @@ -892,7 +951,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>  				       unsigned long bdi_thresh,
>  				       unsigned long bdi_dirty,
>  				       unsigned long dirtied,
> -				       unsigned long elapsed)
> +				       unsigned long elapsed,
> +				       bool strictlimit)
>  {
>  	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
>  	unsigned long limit = hard_dirty_limit(thresh);
> @@ -910,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>  	 * The dirty rate will match the writeout rate in long term, except
>  	 * when dirty pages are truncated by userspace or re-dirtied by FS.
>  	 */
> -	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> +	dirty_rate = (dirtied - bdi->dirtied_nr_stamp) * HZ / elapsed;
>  
>  	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
> -				       bdi_thresh, bdi_dirty);
> +				       bdi_thresh, bdi_dirty, strictlimit);
>  	/*
>  	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
>  	 */
> @@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>  	 * keep that period small to reduce time lags).
>  	 */
>  	step = 0;
> +
> +	/*
> +	 * For strictlimit case, balanced_dirty_ratelimit was calculated
> +	 * above based on bdi counters and limits (see bdi_position_ratio()).
> +	 * Hence, to calculate "step" properly, we have to use bdi_dirty as
> +	 * "dirty" and bdi_setpoint as "setpoint".
> +	 *
> +	 * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
> +	 * it's possible that bdi_thresh is close to zero due to inactivity
> +	 * of backing device (see the implementation of bdi_dirty_limit()).
> +	 */
> +	if (unlikely(strictlimit)) {
> +		dirty = bdi_dirty;
> +		if (bdi_dirty < 8)
> +			setpoint = bdi_dirty + 1;
> +		else
> +			setpoint = (bdi_thresh +
> +				    bdi_dirty_limit(bdi, bg_thresh)) / 2;
> +	}
> +
>  	if (dirty < setpoint) {
>  		x = min(bdi->balanced_dirty_ratelimit,
>  			 min(balanced_dirty_ratelimit, task_ratelimit));
> @@ -1034,12 +1114,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> -			    unsigned long start_time)
> +			    unsigned long start_time,
> +			    bool strictlimit)
>  {
>  	unsigned long now = jiffies;
>  	unsigned long elapsed = now - bdi->bw_time_stamp;
>  	unsigned long dirtied;
>  	unsigned long written;
> +	unsigned long writeback;
>  
>  	/*
>  	 * rate-limit, only update once every 200ms.
> @@ -1049,6 +1131,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  
>  	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
>  	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> +	writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);
>  
>  	/*
>  	 * Skip quiet periods when disk bandwidth is under-utilized.
> @@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>  		goto snapshot;
>  
> +	/*
> +	 * Skip periods when backing dev was idle due to abscence of pages
> +	 * under writeback (when over_bground_thresh() returns false)
> +	 */
> +	if (test_bit(BDI_idle, &bdi->state) &&
> +	    bdi->writeback_nr_stamp == writeback)
> +		goto snapshot;
> +
  Hum, I understand the desire behind BDI_idle but that seems to be solving
only a special case, isn't it? When there is a small traffic on the bdi, the
bandwidth will get updated anyway and the computed bandwidth will go down
from the real maximum value when there are enouch pages to write. So I'm not
sure how much this really helps. Plus this 'idle' logic seems completely
independent to balance_dirty_pages() tweaks so it would be better done as a
separate patch (in case you have convincing reasons we really need that
logic).

>  	if (thresh) {
>  		global_update_bandwidth(thresh, dirty, now);
>  		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
>  					   bdi_thresh, bdi_dirty,
> -					   dirtied, elapsed);
> +					   dirtied, elapsed, strictlimit);
>  	}
>  	bdi_update_write_bandwidth(bdi, elapsed, written);
>  
>  snapshot:
> -	bdi->dirtied_stamp = dirtied;
> -	bdi->written_stamp = written;
> +	bdi->dirtied_nr_stamp = dirtied;
> +	bdi->written_nr_stamp = written;
>  	bdi->bw_time_stamp = now;
> +
> +	bdi->writeback_nr_stamp = writeback;
> +	if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
> +		set_bit(BDI_idle, &bdi->state);
> +	else
> +		clear_bit(BDI_idle, &bdi->state);
>  }
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> @@ -1077,13 +1174,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> -				 unsigned long start_time)
> +				 unsigned long start_time,
> +				 bool strictlimit)
>  {
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
>  	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> -			       bdi_thresh, bdi_dirty, start_time);
> +			       bdi_thresh, bdi_dirty, start_time, strictlimit);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -1226,6 +1324,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	unsigned long dirty_ratelimit;
>  	unsigned long pos_ratio;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
>  	unsigned long start_time = jiffies;
>  
>  	for (;;) {
> @@ -1250,7 +1349,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		 */
>  		freerun = dirty_freerun_ceiling(dirty_thresh,
>  						background_thresh);
> -		if (nr_dirty <= freerun) {
> +		if (nr_dirty <= freerun  && !strictlimit) {
>  			current->dirty_paused_when = now;
>  			current->nr_dirtied = 0;
>  			current->nr_dirtied_pause =
  I'd rather change this to check bdi_dirty <= bdi_freerun in strictlimit
case.

> @@ -1258,7 +1357,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>  			break;
>  		}
>  
> -		if (unlikely(!writeback_in_progress(bdi)))
> +		if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
>  			bdi_start_background_writeback(bdi);
  This can then go away.

>  		/*
> @@ -1296,19 +1395,24 @@ static void balance_dirty_pages(struct address_space *mapping,
>  				    bdi_stat(bdi, BDI_WRITEBACK);
>  		}
>  
> +		if (unlikely(!writeback_in_progress(bdi)) &&
> +		    bdi_dirty > bdi_thresh / 4)
> +			bdi_start_background_writeback(bdi);
> +
  Why is this?

>  		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
> -				  (nr_dirty > dirty_thresh);
> +				 ((nr_dirty > dirty_thresh) || strictlimit);
>  		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
>  				     nr_dirty, bdi_thresh, bdi_dirty,
> -				     start_time);
> +				     start_time, strictlimit);
>  
>  		dirty_ratelimit = bdi->dirty_ratelimit;
>  		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
>  					       background_thresh, nr_dirty,
> -					       bdi_thresh, bdi_dirty);
> +					       bdi_thresh, bdi_dirty,
> +					       strictlimit);
>  		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
>  							RATELIMIT_CALC_SHIFT;
>  		max_pause = bdi_max_pause(bdi, bdi_dirty);
> @@ -1362,6 +1466,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		}
>  
>  pause:
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
>  		trace_balance_dirty_pages(bdi,
>  					  dirty_thresh,
>  					  background_thresh,
  And this shouldn't be necessary either after updating the freerun test
properly.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] mm: strictlimit feature -v2
  2013-07-03 23:16         ` Jan Kara
@ 2013-07-05 13:14           ` Maxim Patlasov
  0 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-07-05 13:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, miklos, riel, dev, xemul, fuse-devel, bfoster,
	linux-kernel, jbottomley, linux-mm, viro, linux-fsdevel,
	fengguang.wu, devel, mgorman

Hi Jan,

First of all, thanks a lot for review, I highly appreciate it. I agree 
with most of your comments, see please inline replies below.

To make further reviews easier, I'm going to send the next version in 
the form of small mm-only patchset (i.e. putting all that "fuse 
writeback cache policy" stuff aside).

07/04/2013 03:16 AM, Jan Kara пишет:
> On Wed 03-07-13 15:01:19, Maxim Patlasov wrote:
>> 07/02/2013 11:38 PM, Andrew Morton пишет:
>>> On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <MPatlasov@parallels.com> wrote:
>>>
>>>> From: Miklos Szeredi <mszeredi@suse.cz>
>>>>
>>>> The feature prevents mistrusted filesystems to grow a large number of dirty
>>>> pages before throttling. For such filesystems balance_dirty_pages always
>>>> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
>>>> "freerun", it's not allowed to skip bdi checks. The only use case for now is
>>>> fuse: it sets bdi max_ratio to 1% by default and system administrators are
>>>> supposed to expect that this limit won't be exceeded.
>>>>
>>>> The feature is on if address space is marked by AS_STRICTLIMIT flag.
>>>> A filesystem may set the flag when it initializes a new inode.
>>>>
>>>> Changed in v2 (thanks to Andrew Morton):
>>>>   - added a few explanatory comments
>>>>   - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
>>>>     stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
>>>>     fields to reflect that they are in units of number-of-pages.
>>>>
>>> Better, thanks.
>>>
>>> The writeback arithemtic makes my head spin - I'd really like Fengguang
>>> to go over this, please.
>>>
>>> A quick visit from the spelling police:
>> Great! Thank you, Andrew. I'll wait for Fengguang' feedback for a
>> while before respin.
>    Sorry for the bad mail threading but I've noticed the thread only now and
> I don't have email with your patches in my mailbox anymore. Below is a
> review of your strictlimit patch. In principle, I'm OK with the idea (I
> even wanted to have a similar ability e.g. for NFS mounts) but I have some
> reservations regarding the implementation:
>
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 4beb8e3..00a28af 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>>   			inode->i_flags |= S_NOCMTIME;
>>   		inode->i_generation = generation;
>>   		inode->i_data.backing_dev_info = &fc->bdi;
>> +		set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
>>   		fuse_init_inode(inode, attr);
>>   		unlock_new_inode(inode);
>>   	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
>    It seems wrong to use address space bits for this. Using BDI capabilities for
> this would look more appropriate. Sure you couldn't then always tune this on
> per-fs basis since filesystems can share BDI (e.g. when they are on different
> partitions of one disk) but if several filesystems are sharing a BDI and some
> would want 'strict' behavior and some don't I don't believe the resulting
> behavior would be sane - e.g. non-strict fs could be getting bdi over per-bdi
> limit without any throttling and strictly limited fs would be continuously
> stalled. Or do I miss any reason why this is set on address space?
>
> As a bonus you don't have to pass the 'strictlimit' flag to writeback
> functions when it is a bdi flag.

Completely agree. Address space flag was Miklos' idea. I had to convert 
it to bdi cap from the beginning.

>
>> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
>> index c388155..6b12d01 100644
>> --- a/include/linux/backing-dev.h
>> +++ b/include/linux/backing-dev.h
>> @@ -33,6 +33,8 @@ enum bdi_state {
>>   	BDI_sync_congested,	/* The sync queue is getting full */
>>   	BDI_registered,		/* bdi_register() was done */
>>   	BDI_writeback_running,	/* Writeback is in progress */
>> +	BDI_idle,		/* No pages under writeback at the moment of
>> +				 * last update of write bw */
>>   	BDI_unused,		/* Available bits start here */
>>   };
>>   
>> @@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
>>   enum bdi_stat_item {
>>   	BDI_RECLAIMABLE,
>>   	BDI_WRITEBACK,
>> -	BDI_DIRTIED,
>> -	BDI_WRITTEN,
>> +
>> +	/*
>> +	 * The three counters below reflects number of events of specific type
>> +	 * happened since bdi_init(). The type is defined in comments below:
>> +	 */
>> +	BDI_DIRTIED,	  /* a page was dirtied */
>> +	BDI_WRITTEN,	  /* writeout completed for a page */
>> +	BDI_WRITTEN_BACK, /* a page went to writeback */
>> +
>>   	NR_BDI_STAT_ITEMS
>>   };
>>   
>> @@ -73,9 +82,12 @@ struct backing_dev_info {
>>   
>>   	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
>>   
>> -	unsigned long bw_time_stamp;	/* last time write bw is updated */
>> -	unsigned long dirtied_stamp;
>> -	unsigned long written_stamp;	/* pages written at bw_time_stamp */
>> +	unsigned long bw_time_stamp;	/* last time (in jiffies) write bw
>> +					 * is updated */
>> +	unsigned long dirtied_nr_stamp;
>> +	unsigned long written_nr_stamp;	/* pages written at bw_time_stamp */
>> +	unsigned long writeback_nr_stamp; /* pages sent to writeback at
>> +					   * bw_time_stamp */
>>   	unsigned long write_bandwidth;	/* the estimated write bandwidth */
>>   	unsigned long avg_write_bandwidth; /* further smoothed write bw */
>>   
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 4514ad7..83c7434 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>>   		return 0;
>>   
>>   	/*
>> -	 * global setpoint
>> +	 * The strictlimit feature is a tool preventing mistrusted filesystems
>> +	 * to grow a large number of dirty pages before throttling. For such
>> +	 * filesystems balance_dirty_pages always checks bdi counters against
>> +	 * bdi limits. Even if global "nr_dirty" is under "freerun". This is
>> +	 * especially important for fuse who sets bdi->max_ratio to 1% by
>> +	 * default. Without strictlimit feature, fuse writeback may consume
>> +	 * arbitrary amount of RAM because it is accounted in
>> +	 * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
>>   	 *
>> -	 *                           setpoint - dirty 3
>> -	 *        f(dirty) := 1.0 + (----------------)
>> -	 *                           limit - setpoint
>> +	 * Here, in bdi_position_ratio(), we calculate pos_ratio based on
>> +	 * two values: bdi_dirty and bdi_thresh. Let's consider an example:
>> +	 * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
>> +	 * limits are set by default to 10% and 20% (background and throttle).
>> +	 * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
>> +	 * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
>> +	 * about ~6K pages (as the average of background and throttle bdi
>> +	 * limits). The 3rd order polynomial will provide positive feedback if
>> +	 * bdi_dirty is under bdi_setpoint and vice versa.
>>   	 *
>> -	 * it's a 3rd order polynomial that subjects to
>> +	 * Note, that we cannot use global counters in these calculations
>> +	 * because we want to throttle process writing to strictlimit address
>> +	 * space much earlier than global "freerun" is reached (~23MB vs.
>> +	 * ~2.3GB in the example above).
>> +	 */
>> +	if (unlikely(strictlimit)) {
>> +		if (bdi_dirty < 8)
>> +			return 2 << RATELIMIT_CALC_SHIFT;
>> +
>> +		if (bdi_dirty >= bdi_thresh)
>> +			return 0;
>> +
>> +		bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
>> +		bdi_setpoint /= 2;
>> +
>> +		if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
>> +			return 0;
>> +
>> +		pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
>> +					      bdi_thresh);
>> +		return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
>> +	}
>    But if global limits are exceeded but a BDI in strict mode is below limit,
> this would allow dirtying on that BDI which seems wrong. Also the logic
> in bdi_position_ratio() is already supposed to take bdi limits in account
> (although the math is somewhat convoluted) so you shouldn't have to touch it.
> Only maybe the increasing of bdi_thresh to (limit - dirty) / 8 might be too
> much for strict limitting so that may need some tweaking (although setting it
> at 8 pages as your patch does seems *too* strict to me).

I agree that if global nr_dirty comes close to global limit (or already 
exceeded it), we must take global pos_ratio in consideration (even if 
bdi_dirty << bdi_thresh). But I don't think we can keep 
bdi_position_ratio() intact. Because: 1) the math there goes crazy if 
dirty < freerun; 2) "(limit - dirty) / 8" is useless for FUSE which 
account internal writeback in NR_WRITEBACK_TEMP; i.e. you can easily 
observe "dirty" close to zero while bdi_dirty is huge. Please let me 
know if you need more details about NR_WRITEBACK_TEMP peculiarities.

>
>> +
>> +	/*
>> +	 * global setpoint
>>   	 *
>> -	 * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
>> -	 * (2) f(setpoint) = 1.0 => the balance point
>> -	 * (3) f(limit)    = 0   => the hard limit
>> -	 * (4) df/dx      <= 0	 => negative feedback control
>> -	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
>> -	 *     => fast response on large errors; small oscillation near setpoint
>> +	 * See comment for pos_ratio_polynom().
>>   	 */
>>   	setpoint = (freerun + limit) / 2;
>> -	x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
>> -		    limit - setpoint + 1);
>> -	pos_ratio = x;
>> -	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>> -	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>> -	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
>> +	pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
>>   
>>   	/*
>>   	 * We have computed basic pos_ratio above based on global situation. If
>> @@ -892,7 +951,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>>   				       unsigned long bdi_thresh,
>>   				       unsigned long bdi_dirty,
>>   				       unsigned long dirtied,
>> -				       unsigned long elapsed)
>> +				       unsigned long elapsed,
>> +				       bool strictlimit)
>>   {
>>   	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
>>   	unsigned long limit = hard_dirty_limit(thresh);
>> @@ -910,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>>   	 * The dirty rate will match the writeout rate in long term, except
>>   	 * when dirty pages are truncated by userspace or re-dirtied by FS.
>>   	 */
>> -	dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
>> +	dirty_rate = (dirtied - bdi->dirtied_nr_stamp) * HZ / elapsed;
>>   
>>   	pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
>> -				       bdi_thresh, bdi_dirty);
>> +				       bdi_thresh, bdi_dirty, strictlimit);
>>   	/*
>>   	 * task_ratelimit reflects each dd's dirty rate for the past 200ms.
>>   	 */
>> @@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>>   	 * keep that period small to reduce time lags).
>>   	 */
>>   	step = 0;
>> +
>> +	/*
>> +	 * For strictlimit case, balanced_dirty_ratelimit was calculated
>> +	 * above based on bdi counters and limits (see bdi_position_ratio()).
>> +	 * Hence, to calculate "step" properly, we have to use bdi_dirty as
>> +	 * "dirty" and bdi_setpoint as "setpoint".
>> +	 *
>> +	 * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
>> +	 * it's possible that bdi_thresh is close to zero due to inactivity
>> +	 * of backing device (see the implementation of bdi_dirty_limit()).
>> +	 */
>> +	if (unlikely(strictlimit)) {
>> +		dirty = bdi_dirty;
>> +		if (bdi_dirty < 8)
>> +			setpoint = bdi_dirty + 1;
>> +		else
>> +			setpoint = (bdi_thresh +
>> +				    bdi_dirty_limit(bdi, bg_thresh)) / 2;
>> +	}
>> +
>>   	if (dirty < setpoint) {
>>   		x = min(bdi->balanced_dirty_ratelimit,
>>   			 min(balanced_dirty_ratelimit, task_ratelimit));
>> @@ -1034,12 +1114,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>>   			    unsigned long dirty,
>>   			    unsigned long bdi_thresh,
>>   			    unsigned long bdi_dirty,
>> -			    unsigned long start_time)
>> +			    unsigned long start_time,
>> +			    bool strictlimit)
>>   {
>>   	unsigned long now = jiffies;
>>   	unsigned long elapsed = now - bdi->bw_time_stamp;
>>   	unsigned long dirtied;
>>   	unsigned long written;
>> +	unsigned long writeback;
>>   
>>   	/*
>>   	 * rate-limit, only update once every 200ms.
>> @@ -1049,6 +1131,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>>   
>>   	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
>>   	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
>> +	writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);
>>   
>>   	/*
>>   	 * Skip quiet periods when disk bandwidth is under-utilized.
>> @@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>>   	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>>   		goto snapshot;
>>   
>> +	/*
>> +	 * Skip periods when backing dev was idle due to abscence of pages
>> +	 * under writeback (when over_bground_thresh() returns false)
>> +	 */
>> +	if (test_bit(BDI_idle, &bdi->state) &&
>> +	    bdi->writeback_nr_stamp == writeback)
>> +		goto snapshot;
>> +
>    Hum, I understand the desire behind BDI_idle but that seems to be solving
> only a special case, isn't it? When there is a small traffic on the bdi, the
> bandwidth will get updated anyway and the computed bandwidth will go down
> from the real maximum value when there are enouch pages to write. So I'm not
> sure how much this really helps. Plus this 'idle' logic seems completely
> independent to balance_dirty_pages() tweaks so it would be better done as a
> separate patch (in case you have convincing reasons we really need that
> logic).

Yes, I'll move it to separate patch. The rationale behind BDI_idle looks 
like this:

> BDI_idle along with BDI_WRITTEN_BACK exists to distinguish two cases:
>
> 1st. BDI_WRITTEN has not been incremented since we looked at it last
> time because backing dev is unresponding. I.e. it had some pages under
> writeback but it have not made any progress for some reasons.
>
> 2nd. BDI_WRITTEN has not been incremented since we looked at it last
> time because backing dev had nothing to do. I.e. there are some dirty
> pages on bdi, but they have not been passed to backing dev yet. This is
> the case when bdi_dirty is under bdi background threshold and flusher
> refrains from flushing even if we woke it up explicitly by
> bdi_start_background_writeback.
>
> We have to skip bdi_update_write_bandwidth() in the 2nd case because
> otherwise bdi_update_write_bandwidth() will see written==0 and
> mistakenly decrease write_bandwidth. The criterion to skip is the
> following: BDI_idle is set (i.e. there were no pages under writeback
> when we looked at the bdi last time) && BDI_WRITTEN_BACK counter has not
> changed (i.e. no new pages has been sent to writeback since we looked at
> the bdi last time).

Notice, that BDI_idle is useful right now only for strictlimit feature 
because w/o strictlimit, we do not call bdi_update_write_bandwidth() 
until "dirty" exceeds "freerun" and over_bground_thresh() guarantees 
that flusher has passed some amount of work to backing dev by then.

>
>>   	if (thresh) {
>>   		global_update_bandwidth(thresh, dirty, now);
>>   		bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
>>   					   bdi_thresh, bdi_dirty,
>> -					   dirtied, elapsed);
>> +					   dirtied, elapsed, strictlimit);
>>   	}
>>   	bdi_update_write_bandwidth(bdi, elapsed, written);
>>   
>>   snapshot:
>> -	bdi->dirtied_stamp = dirtied;
>> -	bdi->written_stamp = written;
>> +	bdi->dirtied_nr_stamp = dirtied;
>> +	bdi->written_nr_stamp = written;
>>   	bdi->bw_time_stamp = now;
>> +
>> +	bdi->writeback_nr_stamp = writeback;
>> +	if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
>> +		set_bit(BDI_idle, &bdi->state);
>> +	else
>> +		clear_bit(BDI_idle, &bdi->state);
>>   }
>>   
>>   static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>> @@ -1077,13 +1174,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>>   				 unsigned long dirty,
>>   				 unsigned long bdi_thresh,
>>   				 unsigned long bdi_dirty,
>> -				 unsigned long start_time)
>> +				 unsigned long start_time,
>> +				 bool strictlimit)
>>   {
>>   	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>>   		return;
>>   	spin_lock(&bdi->wb.list_lock);
>>   	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
>> -			       bdi_thresh, bdi_dirty, start_time);
>> +			       bdi_thresh, bdi_dirty, start_time, strictlimit);
>>   	spin_unlock(&bdi->wb.list_lock);
>>   }
>>   
>> @@ -1226,6 +1324,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>>   	unsigned long dirty_ratelimit;
>>   	unsigned long pos_ratio;
>>   	struct backing_dev_info *bdi = mapping->backing_dev_info;
>> +	bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
>>   	unsigned long start_time = jiffies;
>>   
>>   	for (;;) {
>> @@ -1250,7 +1349,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>>   		 */
>>   		freerun = dirty_freerun_ceiling(dirty_thresh,
>>   						background_thresh);
>> -		if (nr_dirty <= freerun) {
>> +		if (nr_dirty <= freerun  && !strictlimit) {
>>   			current->dirty_paused_when = now;
>>   			current->nr_dirtied = 0;
>>   			current->nr_dirtied_pause =
>    I'd rather change this to check bdi_dirty <= bdi_freerun in strictlimit
> case.

An excellent idea, thank you very much! Some complication comes from the 
fact that bdi_dirty and bdi_freerun are not calculated by that time, but 
we can check bdi_dirty <= bdi_freerun a bit later -- that shouldn't make 
big difference.

>
>> @@ -1258,7 +1357,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>>   			break;
>>   		}
>>   
>> -		if (unlikely(!writeback_in_progress(bdi)))
>> +		if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
>>   			bdi_start_background_writeback(bdi);
>    This can then go away.

Yes.

>
>>   		/*
>> @@ -1296,19 +1395,24 @@ static void balance_dirty_pages(struct address_space *mapping,
>>   				    bdi_stat(bdi, BDI_WRITEBACK);
>>   		}
>>   
>> +		if (unlikely(!writeback_in_progress(bdi)) &&
>> +		    bdi_dirty > bdi_thresh / 4)
>> +			bdi_start_background_writeback(bdi);
>> +
>    Why is this?

We attempted to avoid kicking flusher every time we dive into 
balance_dirty_pages(). This will surely go away.

>
>>   		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
>> -				  (nr_dirty > dirty_thresh);
>> +				 ((nr_dirty > dirty_thresh) || strictlimit);
>>   		if (dirty_exceeded && !bdi->dirty_exceeded)
>>   			bdi->dirty_exceeded = 1;
>>   
>>   		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
>>   				     nr_dirty, bdi_thresh, bdi_dirty,
>> -				     start_time);
>> +				     start_time, strictlimit);
>>   
>>   		dirty_ratelimit = bdi->dirty_ratelimit;
>>   		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
>>   					       background_thresh, nr_dirty,
>> -					       bdi_thresh, bdi_dirty);
>> +					       bdi_thresh, bdi_dirty,
>> +					       strictlimit);
>>   		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
>>   							RATELIMIT_CALC_SHIFT;
>>   		max_pause = bdi_max_pause(bdi, bdi_dirty);
>> @@ -1362,6 +1466,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>>   		}
>>   
>>   pause:
>> +		if (unlikely(!writeback_in_progress(bdi)))
>> +			bdi_start_background_writeback(bdi);
>>   		trace_balance_dirty_pages(bdi,
>>   					  dirty_thresh,
>>   					  background_thresh,
>    And this shouldn't be necessary either after updating the freerun test
> properly.

Yep!

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/16] fuse: Trust kernel i_mtime only -v2
  2013-06-29 17:44 ` [PATCH 07/16] fuse: Trust kernel i_mtime only Maxim Patlasov
@ 2013-07-11 16:14   ` Maxim Patlasov
  0 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-07-11 16:14 UTC (permalink / raw)
  Cc: dev, xemul, fuse-devel, bfoster, miklos, linux-kernel,
	jbottomley, viro, linux-fsdevel, devel

Let the kernel maintain i_mtime locally:
 - clear S_NOCMTIME
 - implement i_op->update_time()
 - flush mtime on fsync and last close
 - update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Changed in v2 (thanks to Brian):
 - renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
 - simplified fuse_set_mtime_local()
 - abandoned optimizations: clearing the flag on some operations (like
   direct write) is doable, but may lead to unexpected changes of
   user-visible mtime.

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/dir.c    |  109 ++++++++++++++++++++++++++++++++++++++++++++++++------
 fs/fuse/file.c   |   30 +++++++++++++--
 fs/fuse/fuse_i.h |    6 ++-
 fs/fuse/inode.c  |   13 +++++-
 4 files changed, 138 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b755884..b834e46 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -854,8 +854,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	struct fuse_conn *fc = get_fuse_conn(inode);
 
 	/* see the comment in fuse_change_attributes() */
-	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+	if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
 		attr->size = i_size_read(inode);
+		attr->mtime = inode->i_mtime.tv_sec;
+		attr->mtimensec = inode->i_mtime.tv_nsec;
+	}
 
 	stat->dev = inode->i_sb->s_dev;
 	stat->ino = attr->ino;
@@ -1565,6 +1568,82 @@ void fuse_release_nowrite(struct inode *inode)
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+			      struct inode *inode,
+			      struct fuse_setattr_in *inarg_p,
+			      struct fuse_attr_out *outarg_p)
+{
+	req->in.h.opcode = FUSE_SETATTR;
+	req->in.h.nodeid = get_node_id(inode);
+	req->in.numargs = 1;
+	req->in.args[0].size = sizeof(*inarg_p);
+	req->in.args[0].value = inarg_p;
+	req->out.numargs = 1;
+	if (fc->minor < 9)
+		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+	else
+		req->out.args[0].size = sizeof(*outarg_p);
+	req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_req *req = NULL;
+	struct fuse_setattr_in inarg;
+	struct fuse_attr_out outarg;
+	int err;
+
+	if (nofail) {
+		req = fuse_get_req_nofail_nopages(fc, file);
+	} else {
+		req = fuse_get_req_nopages(fc);
+		if (IS_ERR(req))
+			return PTR_ERR(req);
+	}
+
+	memset(&inarg, 0, sizeof(inarg));
+	memset(&outarg, 0, sizeof(outarg));
+
+	inarg.valid |= FATTR_MTIME;
+	inarg.mtime = inode->i_mtime.tv_sec;
+	inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+	fuse_request_send(fc, req);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+
+	if (!err)
+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+
+	return err;
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+	unsigned ivalid = iattr->ia_valid;
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) {
+		/* client fs has just set mtime to iattr->ia_mtime */
+		inode->i_mtime = iattr->ia_mtime;
+		clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+	} else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) {
+		/* client fs doesn't know that we're updating i_mtime */
+		inode->i_mtime = current_fs_time(inode->i_sb);
+		set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+	}
+}
+
 /*
  * Set attributes, and at the same time refresh them.
  *
@@ -1621,17 +1700,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		inarg.valid |= FATTR_LOCKOWNER;
 		inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
 	}
-	req->in.h.opcode = FUSE_SETATTR;
-	req->in.h.nodeid = get_node_id(inode);
-	req->in.numargs = 1;
-	req->in.args[0].size = sizeof(inarg);
-	req->in.args[0].value = &inarg;
-	req->out.numargs = 1;
-	if (fc->minor < 9)
-		req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
-	else
-		req->out.args[0].size = sizeof(outarg);
-	req->out.args[0].value = &outarg;
+	fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
 	fuse_request_send(fc, req);
 	err = req->out.h.error;
 	fuse_put_request(fc, req);
@@ -1648,6 +1717,10 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 	}
 
 	spin_lock(&fc->lock);
+	/* the kernel maintains i_mtime locally */
+	if (fc->writeback_cache && S_ISREG(inode->i_mode))
+		fuse_set_mtime_local(attr, inode);
+
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
@@ -1872,6 +1945,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
 	return err;
 }
 
+static int fuse_update_time(struct inode *inode, struct timespec *now,
+			    int flags)
+{
+	if (flags & S_MTIME) {
+		inode->i_mtime = *now;
+		set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state);
+		BUG_ON(!S_ISREG(inode->i_mode));
+	}
+	return 0;
+}
+
 static const struct inode_operations fuse_dir_inode_operations = {
 	.lookup		= fuse_lookup,
 	.mkdir		= fuse_mkdir,
@@ -1911,6 +1995,7 @@ static const struct inode_operations fuse_common_inode_operations = {
 	.getxattr	= fuse_getxattr,
 	.listxattr	= fuse_listxattr,
 	.removexattr	= fuse_removexattr,
+	.update_time	= fuse_update_time,
 };
 
 static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 98bc0d0..09a9eeb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,9 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
+		fuse_flush_mtime(file, true);
+
 	fuse_release_common(file, FUSE_RELEASE);
 
 	/* return value is ignored by VFS */
@@ -458,6 +461,12 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 
 	fuse_sync_writes(inode);
 
+	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) {
+		int err = fuse_flush_mtime(file, false);
+		if (err)
+			goto out;
+	}
+
 	req = fuse_get_req_nopages(fc);
 	if (IS_ERR(req)) {
 		err = PTR_ERR(req);
@@ -948,16 +957,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
 	return req->misc.write.out.size;
 }
 
-void fuse_write_update_size(struct inode *inode, loff_t pos)
+bool fuse_write_update_size(struct inode *inode, loff_t pos)
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool ret = false;
 
 	spin_lock(&fc->lock);
 	fi->attr_version = ++fc->attr_version;
-	if (pos > inode->i_size)
+	if (pos > inode->i_size) {
 		i_size_write(inode, pos);
+		ret = true;
+	}
 	spin_unlock(&fc->lock);
+
+	return ret;
 }
 
 static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
@@ -2553,8 +2567,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
 		goto out;
 
 	/* we could have extended the file */
-	if (!(mode & FALLOC_FL_KEEP_SIZE))
-		fuse_write_update_size(inode, offset + length);
+	if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+		bool changed = fuse_write_update_size(inode, offset + length);
+
+		if (changed && fc->writeback_cache) {
+			struct fuse_inode *fi = get_fuse_inode(inode);
+
+			inode->i_mtime = current_fs_time(inode->i_sb);
+			set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+		}
+	}
 
 	if (mode & FALLOC_FL_PUNCH_HOLE)
 		truncate_pagecache_range(inode, offset, offset + length - 1);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a0062a0..c192e6f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -115,6 +115,8 @@ struct fuse_inode {
 enum {
 	/** Advise readdirplus  */
 	FUSE_I_ADVISE_RDPLUS,
+	/** i_mtime has been updated locally; a flush to userspace needed */
+	FUSE_I_MTIME_DIRTY,
 };
 
 struct fuse_conn;
@@ -867,7 +869,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
 unsigned fuse_file_poll(struct file *file, poll_table *wait);
 int fuse_dev_release(struct inode *inode, struct file *file);
 
-void fuse_write_update_size(struct inode *inode, loff_t pos);
+bool fuse_write_update_size(struct inode *inode, loff_t pos);
+
+int fuse_flush_mtime(struct file *file, bool nofail);
 
 int fuse_do_setattr(struct inode *inode, struct iattr *attr,
 		    struct file *file);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 121638d..591b46c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
 	inode->i_blocks  = attr->blocks;
 	inode->i_atime.tv_sec   = attr->atime;
 	inode->i_atime.tv_nsec  = attr->atimensec;
-	inode->i_mtime.tv_sec   = attr->mtime;
-	inode->i_mtime.tv_nsec  = attr->mtimensec;
+	/* mtime from server may be stale due to local buffered write */
+	if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
+		inode->i_mtime.tv_sec   = attr->mtime;
+		inode->i_mtime.tv_nsec  = attr->mtimensec;
+	}
 	inode->i_ctime.tv_sec   = attr->ctime;
 	inode->i_ctime.tv_nsec  = attr->ctimensec;
 
@@ -249,6 +252,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
 {
 	inode->i_mode = attr->mode & S_IFMT;
 	inode->i_size = attr->size;
+	inode->i_mtime.tv_sec  = attr->mtime;
+	inode->i_mtime.tv_nsec = attr->mtimensec;
 	if (S_ISREG(inode->i_mode)) {
 		fuse_init_common(inode);
 		fuse_init_file_inode(inode);
@@ -295,7 +300,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 		return NULL;
 
 	if ((inode->i_state & I_NEW)) {
-		inode->i_flags |= S_NOATIME|S_NOCMTIME;
+		inode->i_flags |= S_NOATIME;
+		if (!fc->writeback_cache)
+			inode->i_flags |= S_NOCMTIME;
 		inode->i_generation = generation;
 		inode->i_data.backing_dev_info = &fc->bdi;
 		fuse_init_inode(inode, attr);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/16] fuse: Flush files on wb close -v2
  2013-06-29 17:45 ` [PATCH 08/16] fuse: Flush files on wb close Maxim Patlasov
@ 2013-07-11 16:18   ` Maxim Patlasov
  0 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-07-11 16:18 UTC (permalink / raw)
  Cc: dev, xemul, fuse-devel, bfoster, miklos, linux-kernel,
	jbottomley, viro, linux-fsdevel, devel

From: Pavel Emelyanov <xemul@openvz.org>

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Changed in v2:
 - updated patch to be applied smoothly on top of
   "Trust kernel i_mtime only -v2".

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
---
 fs/fuse/file.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 09a9eeb..3778de1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file)
 
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	/* see fuse_vma_close() for !writeback_cache case */
+	if (fc->writeback_cache)
+		filemap_write_and_wait(file->f_mapping);
+
 	if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
 		fuse_flush_mtime(file, true);
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-06-29 17:45 ` [PATCH 10/16] fuse: Implement writepages callback Maxim Patlasov
@ 2013-07-19 16:50   ` Miklos Szeredi
  2013-08-02 15:40     ` Maxim Patlasov
  0 siblings, 1 reply; 35+ messages in thread
From: Miklos Szeredi @ 2013-07-19 16:50 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:
> From: Pavel Emelyanov <xemul@openvz.org>
> 
> The .writepages one is required to make each writeback request carry more than
> one page on it. The patch enables optimized behaviour unconditionally,
> i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.

I rewrote this a bit, so we won't have to do the thing in two passes, which
makes it simpler and more robust.  Waiting for page writeback here is wrong
anyway, see comment above fuse_page_mkwrite().  BTW we had a race there because
fuse_page_mkwrite() didn't take the page lock.  I've also fixed that up and
pushed a series containing these patches up to implementing ->writepages() to

  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git writepages

Passed some trivial testing but more is needed.

I'll get to the rest of the patches next week.

Thanks,
Miklos


Subject: fuse: Implement writepages callback
From: Pavel Emelyanov <xemul@openvz.org>
Date: Sat, 29 Jun 2013 21:45:29 +0400

The .writepages one is required to make each writeback request carry more than
one page on it. The patch enables optimized behaviour unconditionally,
i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.

[SzM: simplify, add comments]

Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/fuse/file.c |  139 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 139 insertions(+)

--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1579,6 +1579,144 @@ static int fuse_writepage(struct page *p
 	return err;
 }
 
+struct fuse_fill_wb_data {
+	struct fuse_req *req;
+	struct fuse_file *ff;
+	struct inode *inode;
+};
+
+static void fuse_writepages_send(struct fuse_fill_wb_data *data)
+{
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	req->ff = fuse_file_get(data->ff);
+	spin_lock(&fc->lock);
+	list_add_tail(&req->list, &fi->queued_writes);
+	fuse_flush_writepages(inode);
+	spin_unlock(&fc->lock);
+}
+
+static int fuse_writepages_fill(struct page *page,
+		struct writeback_control *wbc, void *_data)
+{
+	struct fuse_fill_wb_data *data = _data;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct page *tmp_page;
+	int err;
+
+	if (req) {
+		BUG_ON(!req->num_pages);
+		if (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+		    (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+		    req->pages[req->num_pages - 1]->index + 1 != page->index) {
+
+			fuse_writepages_send(data);
+			data->req = NULL;
+		}
+	}
+	err = -ENOMEM;
+	tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+	if (!tmp_page)
+		goto out_unlock;
+
+	/*
+	 * The page must not be redirtied until the writeout is completed
+	 * (i.e. userspace has sent a reply to the write request).  Otherwise
+	 * there could be more than one temporary page instance for each real
+	 * page.
+	 *
+	 * This is ensured by holding the page lock in page_mkwrite() while
+	 * checking fuse_page_is_writeback().  We already hold the page lock
+	 * since clear_page_dirty_for_io() and keep it held until we add the
+	 * request to the fi->writepages list and increment req->num_pages.
+	 * After this fuse_page_is_writeback() will indicate that the page is
+	 * under writeback, so we can release the page lock.
+	 */
+	if (data->req == NULL) {
+		struct fuse_inode *fi = get_fuse_inode(inode);
+
+		err = -ENOMEM;
+		req = fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+		if (!req) {
+			__free_page(tmp_page);
+			goto out_unlock;
+		}
+
+		fuse_write_fill(req, data->ff, page_offset(page), 0);
+		req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+		req->in.argpages = 1;
+		req->background = 1;
+		req->num_pages = 0;
+		req->end = fuse_writepage_end;
+		req->inode = inode;
+
+		spin_lock(&fc->lock);
+		list_add(&req->writepages_entry, &fi->writepages);
+		spin_unlock(&fc->lock);
+
+		data->req = req;
+	}
+	set_page_writeback(page);
+
+	copy_highpage(tmp_page, page);
+	req->pages[req->num_pages] = tmp_page;
+	req->page_descs[req->num_pages].offset = 0;
+	req->page_descs[req->num_pages].length = PAGE_SIZE;
+
+	inc_bdi_stat(page->mapping->backing_dev_info, BDI_WRITEBACK);
+	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+	end_page_writeback(page);
+
+	/*
+	 * Protected by fc->lock against concurrent access by
+	 * fuse_page_is_writeback().
+	 */
+	spin_lock(&fc->lock);
+	req->num_pages++;
+	spin_unlock(&fc->lock);
+
+	err = 0;
+out_unlock:
+	unlock_page(page);
+
+	return err;
+}
+
+static int fuse_writepages(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_fill_wb_data data;
+	int err;
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	data.req = NULL;
+	data.inode = inode;
+	data.ff = fuse_write_file(fc, get_fuse_inode(inode));
+	if (!data.ff)
+		goto out;
+
+	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+	if (data.req) {
+		/* Ignore errors if we can write at least one page */
+		BUG_ON(!data.req->num_pages);
+		fuse_writepages_send(&data);
+		err = 0;
+	}
+	fuse_file_put(data.ff, false);
+out:
+	return err;
+}
+
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
@@ -2589,6 +2727,7 @@ static const struct file_operations fuse
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
 	.writepage	= fuse_writepage,
+	.writepages	= fuse_writepages,
 	.launder_page	= fuse_launder_page,
 	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-07-19 16:50   ` Miklos Szeredi
@ 2013-08-02 15:40     ` Maxim Patlasov
  2013-08-06 16:25       ` Miklos Szeredi
  0 siblings, 1 reply; 35+ messages in thread
From: Maxim Patlasov @ 2013-08-02 15:40 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: riel, dev, xemul, fuse-devel, bfoster, linux-kernel, jbottomley,
	linux-mm, viro, linux-fsdevel, akpm, fengguang.wu, devel,
	mgorman

07/19/2013 08:50 PM, Miklos Szeredi пишет:
> On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:
>> From: Pavel Emelyanov <xemul@openvz.org>
>>
>> The .writepages one is required to make each writeback request carry more than
>> one page on it. The patch enables optimized behaviour unconditionally,
>> i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.
> I rewrote this a bit, so we won't have to do the thing in two passes, which
> makes it simpler and more robust.  Waiting for page writeback here is wrong
> anyway, see comment above fuse_page_mkwrite().  BTW we had a race there because
> fuse_page_mkwrite() didn't take the page lock.  I've also fixed that up and
> pushed a series containing these patches up to implementing ->writepages() to
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git writepages
>
> Passed some trivial testing but more is needed.

Thanks a lot for efforts. The approach you implemented looks promising, 
but it introduces the following assumption: a page cannot become dirty 
before we have a chance to wait on fuse writeback holding the page 
locked. This is already true for mmap-ed writes (due to your fixes) and 
it seems doable for cached writes as well (like we do in 
fuse_perform_write). But the assumption seems to be broken in case of 
direct read from local fs (e.g. ext4) to a memory region mmap-ed to a 
file on fuse fs. See how dio_bio_submit() marks pages dirty by 
bio_set_pages_dirty(). I can't see any solution for this use-case. Do you?

Thanks,
Maxim

>
> I'll get to the rest of the patches next week.
>
> Thanks,
> Miklos
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-08-02 15:40     ` Maxim Patlasov
@ 2013-08-06 16:25       ` Miklos Szeredi
  2013-08-06 16:26         ` Eric Boxer
  2013-08-09 15:02         ` Maxim Patlasov
  0 siblings, 2 replies; 35+ messages in thread
From: Miklos Szeredi @ 2013-08-06 16:25 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: riel, Kirill Korotaev, Pavel Emelianov, fuse-devel, Brian Foster,
	Kernel Mailing List, James Bottomley, linux-mm, Al Viro,
	Linux-Fsdevel, Andrew Morton, fengguang.wu, devel, Mel Gorman

On Fri, Aug 2, 2013 at 5:40 PM, Maxim Patlasov <mpatlasov@parallels.com> wrote:
> 07/19/2013 08:50 PM, Miklos Szeredi пишет:
>
>> On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:
>>>
>>> From: Pavel Emelyanov <xemul@openvz.org>
>>>
>>> The .writepages one is required to make each writeback request carry more
>>> than
>>> one page on it. The patch enables optimized behaviour unconditionally,
>>> i.e. mmap-ed writes will benefit from the patch even if
>>> fc->writeback_cache=0.
>>
>> I rewrote this a bit, so we won't have to do the thing in two passes,
>> which
>> makes it simpler and more robust.  Waiting for page writeback here is
>> wrong
>> anyway, see comment above fuse_page_mkwrite().  BTW we had a race there
>> because
>> fuse_page_mkwrite() didn't take the page lock.  I've also fixed that up
>> and
>> pushed a series containing these patches up to implementing ->writepages()
>> to
>>
>>    git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git
>> writepages
>>
>> Passed some trivial testing but more is needed.
>
>
> Thanks a lot for efforts. The approach you implemented looks promising, but
> it introduces the following assumption: a page cannot become dirty before we
> have a chance to wait on fuse writeback holding the page locked. This is
> already true for mmap-ed writes (due to your fixes) and it seems doable for
> cached writes as well (like we do in fuse_perform_write). But the assumption
> seems to be broken in case of direct read from local fs (e.g. ext4) to a
> memory region mmap-ed to a file on fuse fs. See how dio_bio_submit() marks
> pages dirty by bio_set_pages_dirty(). I can't see any solution for this
> use-case. Do you?

Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
do the necessary page fault magic and ->page_mkwrite() will be called.
At least AFAICS.

The page cannot become dirty through a memory mapping without first
switching the pte from read-only to read-write first.  Page accounting
logic relies on this too.  The other way the page can become dirty is
through write(2) on the fs.  But we do get notified about that too.

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-08-06 16:25       ` Miklos Szeredi
@ 2013-08-06 16:26         ` Eric Boxer
  2013-08-09 15:02         ` Maxim Patlasov
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Boxer @ 2013-08-06 16:26 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: James Bottomley, devel, Kirill Korotaev, Brian Foster, linux-mm,
	Kernel Mailing List, Linux-Fsdevel, fuse-devel, riel,
	Pavel Emelianov, Al Viro, Maxim Patlasov, Andrew Morton,
	fengguang wu, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 2576 bytes --]



Ok

On Fri, Aug 2, 2013 at 5:40 PM, Maxim Patlasov <mpatlasov@parallels.com> wrote:

> 07/19/2013 08:50 PM, Miklos Szeredi пишет:

>

>> On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:

>>>

>>> From: Pavel Emelyanov <xemul@openvz.org>

>>>

>>> The .writepages one is required to make each writeback request carry more

>>> than

>>> one page on it. The patch enables optimized behaviour unconditionally,

>>> i.e. mmap-ed writes will benefit from the patch even if

>>> fc->writeback_cache=0.

>>

>> I rewrote this a bit, so we won't have to do the thing in two passes,

>> which

>> makes it simpler and more robust.  Waiting for page writeback here is

>> wrong

>> anyway, see comment above fuse_page_mkwrite().  BTW we had a race there

>> because

>> fuse_page_mkwrite() didn't take the page lock.  I've also fixed that up

>> and

>> pushed a series containing these patches up to implementing ->writepages()

>> to

>>

>>    git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git

>> writepages

>>

>> Passed some trivial testing but more is needed.

>

>

> Thanks a lot for efforts. The approach you implemented looks promising, but

> it introduces the following assumption: a page cannot become dirty before we

> have a chance to wait on fuse writeback holding the page locked. This is

> already true for mmap-ed writes (due to your fixes) and it seems doable for

> cached writes as well (like we do in fuse_perform_write). But the assumption

> seems to be broken in case of direct read from local fs (e.g. ext4) to a

> memory region mmap-ed to a file on fuse fs. See how dio_bio_submit() marks

> pages dirty by bio_set_pages_dirty(). I can't see any solution for this

> use-case. Do you?



Hmm.  Direct IO on an mmaped file will do get_user_pages() which will

do the necessary page fault magic and ->page_mkwrite() will be called.

At least AFAICS.



The page cannot become dirty through a memory mapping without first

switching the pte from read-only to read-write first.  Page accounting

logic relies on this too.  The other way the page can become dirty is

through write(2) on the fs.  But we do get notified about that too.



Thanks,

Miklos

--

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

the body of a message to majordomo@vger.kernel.org

More majordomo info at  http://vger.kernel.org/majordomo-info.html

Please read the FAQ at  http://www.tux.org/lkml/


[-- Attachment #2: Type: text/html, Size: 3069 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-08-06 16:25       ` Miklos Szeredi
  2013-08-06 16:26         ` Eric Boxer
@ 2013-08-09 15:02         ` Maxim Patlasov
  2013-08-30 10:12           ` Miklos Szeredi
  1 sibling, 1 reply; 35+ messages in thread
From: Maxim Patlasov @ 2013-08-09 15:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: riel, Kirill Korotaev, Pavel Emelianov, fuse-devel, Brian Foster,
	Kernel Mailing List, James Bottomley, linux-mm, Al Viro,
	Linux-Fsdevel, Andrew Morton, fengguang.wu, devel, Mel Gorman

Hi Miklos,

08/06/2013 08:25 PM, Miklos Szeredi пишет:
> On Fri, Aug 2, 2013 at 5:40 PM, Maxim Patlasov <mpatlasov@parallels.com> wrote:
>> 07/19/2013 08:50 PM, Miklos Szeredi пишет:
>>
>>> On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:
>>>> From: Pavel Emelyanov <xemul@openvz.org>
>>>>
>>>> The .writepages one is required to make each writeback request carry more
>>>> than
>>>> one page on it. The patch enables optimized behaviour unconditionally,
>>>> i.e. mmap-ed writes will benefit from the patch even if
>>>> fc->writeback_cache=0.
>>> I rewrote this a bit, so we won't have to do the thing in two passes,
>>> which
>>> makes it simpler and more robust.  Waiting for page writeback here is
>>> wrong
>>> anyway, see comment above fuse_page_mkwrite().  BTW we had a race there
>>> because
>>> fuse_page_mkwrite() didn't take the page lock.  I've also fixed that up
>>> and
>>> pushed a series containing these patches up to implementing ->writepages()
>>> to
>>>
>>>     git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git
>>> writepages
>>>
>>> Passed some trivial testing but more is needed.
>>
>> Thanks a lot for efforts. The approach you implemented looks promising, but
>> it introduces the following assumption: a page cannot become dirty before we
>> have a chance to wait on fuse writeback holding the page locked. This is
>> already true for mmap-ed writes (due to your fixes) and it seems doable for
>> cached writes as well (like we do in fuse_perform_write). But the assumption
>> seems to be broken in case of direct read from local fs (e.g. ext4) to a
>> memory region mmap-ed to a file on fuse fs. See how dio_bio_submit() marks
>> pages dirty by bio_set_pages_dirty(). I can't see any solution for this
>> use-case. Do you?
> Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
> do the necessary page fault magic and ->page_mkwrite() will be called.
> At least AFAICS.

Yes, I agree.

>
> The page cannot become dirty through a memory mapping without first
> switching the pte from read-only to read-write first.  Page accounting
> logic relies on this too.  The other way the page can become dirty is
> through write(2) on the fs.  But we do get notified about that too.

Yes, that's correct, but I don't understand why you disregard two other 
cases of marking page dirty (both related to direct AIO read from a file 
to a memory region mmap-ed to a fuse file):

1. dio_bio_submit() -->
       bio_set_pages_dirty() -->
         set_page_dirty_lock()

2. dio_bio_complete() -->
       bio_check_pages_dirty() -->
          bio_dirty_fn() -->
             bio_set_pages_dirty() -->
                set_page_dirty_lock()

As soon as a page became dirty through a memory mapping (exactly as you 
explained), nothing would prevent it to be written-back. And fuse will 
call end_page_writeback almost immediately after copying the real page 
to a temporary one. Then dio_bio_submit may re-dirty page speculatively 
w/o notifying fuse. And again, since then nothing would prevent it to be 
written-back once more. Hence we can end up in more then one temporary 
page in fuse write-back. And similar concern for dio_bio_complete() 
re-dirty.

This make me think that we do need fuse_page_is_writeback() in 
fuse_writepages_fill(). But it shouldn't be harmful because it will 
no-op practically always due to waiting for fuse writeback in 
->page_mkwrite() and in course of handling write(2).

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-08-09 15:02         ` Maxim Patlasov
@ 2013-08-30 10:12           ` Miklos Szeredi
  2013-08-30 14:50             ` Maxim Patlasov
  0 siblings, 1 reply; 35+ messages in thread
From: Miklos Szeredi @ 2013-08-30 10:12 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: riel, Kirill Korotaev, Pavel Emelianov, fuse-devel, Brian Foster,
	Kernel Mailing List, James Bottomley, linux-mm, Al Viro,
	Linux-Fsdevel, Andrew Morton, fengguang.wu, devel, Mel Gorman

On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
> 08/06/2013 08:25 PM, Miklos Szeredi пишет:
> >Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
> >do the necessary page fault magic and ->page_mkwrite() will be called.
> >At least AFAICS.
> 
> Yes, I agree.
> 
> >
> >The page cannot become dirty through a memory mapping without first
> >switching the pte from read-only to read-write first.  Page accounting
> >logic relies on this too.  The other way the page can become dirty is
> >through write(2) on the fs.  But we do get notified about that too.
> 
> Yes, that's correct, but I don't understand why you disregard two
> other cases of marking page dirty (both related to direct AIO read
> from a file to a memory region mmap-ed to a fuse file):
> 
> 1. dio_bio_submit() -->
>       bio_set_pages_dirty() -->
>         set_page_dirty_lock()
> 
> 2. dio_bio_complete() -->
>       bio_check_pages_dirty() -->
>          bio_dirty_fn() -->
>             bio_set_pages_dirty() -->
>                set_page_dirty_lock()
> 
> As soon as a page became dirty through a memory mapping (exactly as
> you explained), nothing would prevent it to be written-back. And
> fuse will call end_page_writeback almost immediately after copying
> the real page to a temporary one. Then dio_bio_submit may re-dirty
> page speculatively w/o notifying fuse. And again, since then nothing
> would prevent it to be written-back once more. Hence we can end up
> in more then one temporary page in fuse write-back. And similar
> concern for dio_bio_complete() re-dirty.
> 
> This make me think that we do need fuse_page_is_writeback() in
> fuse_writepages_fill(). But it shouldn't be harmful because it will
> no-op practically always due to waiting for fuse writeback in
> ->page_mkwrite() and in course of handling write(2).

The problem is: if we need it in ->writepages, we need it in ->writepage too.
And that's where we can't have it because it would deadlock in reclaim.

There's a way to work around this:

   - if the request is still in queue, just update it with the contents of the
     new page

   - if the request already in userspace, create a new reqest, but only let
     userspace have it once the previous request for the same page completes, so
     the ordering is not messed up

But that's a lot of hairy code.

Any other ideas?

The best would be if we could get rid of the ugly temporary page requirement for
fuse writeback.  The big blocker has always been direct reclaim: allocation must
not wait on fuse writebacks.  AFAICS there's still a wait_on_page_writeback() in
relation to memcg.  And it interacts with page migration which also has them.
Those are the really difficult ones...

The other offender, balance_dirty_pages() is much easier and needs to be tought
about fuse behavior anyway.

Thoughts?

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-08-30 10:12           ` Miklos Szeredi
@ 2013-08-30 14:50             ` Maxim Patlasov
  2013-09-03 10:31               ` Miklos Szeredi
  0 siblings, 1 reply; 35+ messages in thread
From: Maxim Patlasov @ 2013-08-30 14:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: riel, Kirill Korotaev, Pavel Emelianov, fuse-devel, Brian Foster,
	Kernel Mailing List, James Bottomley, linux-mm, Al Viro,
	Linux-Fsdevel, Andrew Morton, fengguang.wu, devel, Mel Gorman

Hi Miklos,

08/30/2013 02:12 PM, Miklos Szeredi пишет:
> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
>> 08/06/2013 08:25 PM, Miklos Szeredi пишет:
>>> Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
>>> do the necessary page fault magic and ->page_mkwrite() will be called.
>>> At least AFAICS.
>> Yes, I agree.
>>
>>> The page cannot become dirty through a memory mapping without first
>>> switching the pte from read-only to read-write first.  Page accounting
>>> logic relies on this too.  The other way the page can become dirty is
>>> through write(2) on the fs.  But we do get notified about that too.
>> Yes, that's correct, but I don't understand why you disregard two
>> other cases of marking page dirty (both related to direct AIO read
>> from a file to a memory region mmap-ed to a fuse file):
>>
>> 1. dio_bio_submit() -->
>>        bio_set_pages_dirty() -->
>>          set_page_dirty_lock()
>>
>> 2. dio_bio_complete() -->
>>        bio_check_pages_dirty() -->
>>           bio_dirty_fn() -->
>>              bio_set_pages_dirty() -->
>>                 set_page_dirty_lock()
>>
>> As soon as a page became dirty through a memory mapping (exactly as
>> you explained), nothing would prevent it to be written-back. And
>> fuse will call end_page_writeback almost immediately after copying
>> the real page to a temporary one. Then dio_bio_submit may re-dirty
>> page speculatively w/o notifying fuse. And again, since then nothing
>> would prevent it to be written-back once more. Hence we can end up
>> in more then one temporary page in fuse write-back. And similar
>> concern for dio_bio_complete() re-dirty.
>>
>> This make me think that we do need fuse_page_is_writeback() in
>> fuse_writepages_fill(). But it shouldn't be harmful because it will
>> no-op practically always due to waiting for fuse writeback in
>> ->page_mkwrite() and in course of handling write(2).
> The problem is: if we need it in ->writepages, we need it in ->writepage too.
> And that's where we can't have it because it would deadlock in reclaim.

I thought we're protected from the deadlock by the following chunk (in 
the very beginning of fuse_writepage):

> +	if (fuse_page_is_writeback(inode, page->index)) {
> +		if (wbc->sync_mode != WB_SYNC_ALL) {
> +			redirty_page_for_writepage(wbc, page);
> +			return 0;
> +		}
> +		fuse_wait_on_page_writeback(inode, page->index);
> +	}

Because reclaimer will never call us with WB_SYNC_ALL. Did I miss 
something?

>
> There's a way to work around this:
>
>     - if the request is still in queue, just update it with the contents of the
>       new page
>
>     - if the request already in userspace, create a new reqest, but only let
>       userspace have it once the previous request for the same page completes, so
>       the ordering is not messed up
>
> But that's a lot of hairy code.

Is it exactly how NFS solves similar problem?

>
> Any other ideas?
>
> The best would be if we could get rid of the ugly temporary page requirement for
> fuse writeback.  The big blocker has always been direct reclaim: allocation must
> not wait on fuse writebacks.  AFAICS there's still a wait_on_page_writeback() in
> relation to memcg.  And it interacts with page migration which also has them.
> Those are the really difficult ones...

Yes, I agree. I think there are pretty many reasons not to keep original 
page under writeback for long. And not to make it dependant on userspace 
process as well.

>
> The other offender, balance_dirty_pages() is much easier and needs to be tought
> about fuse behavior anyway.

BTW, strictlimit feature (including its enable for fuse) is already in 
linux-next.

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-08-30 14:50             ` Maxim Patlasov
@ 2013-09-03 10:31               ` Miklos Szeredi
  2013-09-03 16:02                 ` Maxim Patlasov
  0 siblings, 1 reply; 35+ messages in thread
From: Miklos Szeredi @ 2013-09-03 10:31 UTC (permalink / raw)
  To: Maxim Patlasov
  Cc: riel, Kirill Korotaev, Pavel Emelianov, fuse-devel, Brian Foster,
	Kernel Mailing List, James Bottomley, linux-mm, Al Viro,
	Linux-Fsdevel, Andrew Morton, fengguang.wu, devel, Mel Gorman

On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote:
> Hi Miklos,
> 
> 08/30/2013 02:12 PM, Miklos Szeredi пишет:
> >On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
> >>08/06/2013 08:25 PM, Miklos Szeredi пишет:
> >>>Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
> >>>do the necessary page fault magic and ->page_mkwrite() will be called.
> >>>At least AFAICS.
> >>Yes, I agree.
> >>
> >>>The page cannot become dirty through a memory mapping without first
> >>>switching the pte from read-only to read-write first.  Page accounting
> >>>logic relies on this too.  The other way the page can become dirty is
> >>>through write(2) on the fs.  But we do get notified about that too.
> >>Yes, that's correct, but I don't understand why you disregard two
> >>other cases of marking page dirty (both related to direct AIO read
> >>from a file to a memory region mmap-ed to a fuse file):
> >>
> >>1. dio_bio_submit() -->
> >>       bio_set_pages_dirty() -->
> >>         set_page_dirty_lock()
> >>
> >>2. dio_bio_complete() -->
> >>       bio_check_pages_dirty() -->
> >>          bio_dirty_fn() -->
> >>             bio_set_pages_dirty() -->
> >>                set_page_dirty_lock()
> >>
> >>As soon as a page became dirty through a memory mapping (exactly as
> >>you explained), nothing would prevent it to be written-back. And
> >>fuse will call end_page_writeback almost immediately after copying
> >>the real page to a temporary one. Then dio_bio_submit may re-dirty
> >>page speculatively w/o notifying fuse. And again, since then nothing
> >>would prevent it to be written-back once more. Hence we can end up
> >>in more then one temporary page in fuse write-back. And similar
> >>concern for dio_bio_complete() re-dirty.
> >>
> >>This make me think that we do need fuse_page_is_writeback() in
> >>fuse_writepages_fill(). But it shouldn't be harmful because it will
> >>no-op practically always due to waiting for fuse writeback in
> >>->page_mkwrite() and in course of handling write(2).
> >The problem is: if we need it in ->writepages, we need it in ->writepage too.
> >And that's where we can't have it because it would deadlock in reclaim.
> 
> I thought we're protected from the deadlock by the following chunk
> (in the very beginning of fuse_writepage):
> 
> >+	if (fuse_page_is_writeback(inode, page->index)) {
> >+		if (wbc->sync_mode != WB_SYNC_ALL) {
> >+			redirty_page_for_writepage(wbc, page);
> >+			return 0;
> >+		}
> >+		fuse_wait_on_page_writeback(inode, page->index);
> >+	}
> 
> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss
> something?

Yeah, we could have that in ->writepage() too.  And apparently that would work,
reclaim would just leave us alone.

Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse
mount we don't want ->writepages() to block because that's a quite clear DoS
issue.

So we are left with this:

> >There's a way to work around this:
> >
> >    - if the request is still in queue, just update it with the contents of
> >      the new page
> >
> >    - if the request already in userspace, create a new reqest, but only let
> >      userspace have it once the previous request for the same page
> >      completes, so the ordering is not messed up
> >
> >But that's a lot of hairy code.
> 
> Is it exactly how NFS solves similar problem?

NFS will apparently just block if there's a request outstanding and we are in
WB_SYNC_ALL mode.  Which is somewhat simpler.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/16] fuse: Implement writepages callback
  2013-09-03 10:31               ` Miklos Szeredi
@ 2013-09-03 16:02                 ` Maxim Patlasov
  0 siblings, 0 replies; 35+ messages in thread
From: Maxim Patlasov @ 2013-09-03 16:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: riel, Kirill Korotaev, Pavel Emelianov, fuse-devel, Brian Foster,
	Kernel Mailing List, James Bottomley, linux-mm, Al Viro,
	Linux-Fsdevel, Andrew Morton, fengguang.wu, devel, Mel Gorman

09/03/2013 02:31 PM, Miklos Szeredi пишет:
> On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote:
>> Hi Miklos,
>>
>> 08/30/2013 02:12 PM, Miklos Szeredi пишет:
>>> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
>>>> 08/06/2013 08:25 PM, Miklos Szeredi пишет:
>>>>> Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
>>>>> do the necessary page fault magic and ->page_mkwrite() will be called.
>>>>> At least AFAICS.
>>>> Yes, I agree.
>>>>
>>>>> The page cannot become dirty through a memory mapping without first
>>>>> switching the pte from read-only to read-write first.  Page accounting
>>>>> logic relies on this too.  The other way the page can become dirty is
>>>>> through write(2) on the fs.  But we do get notified about that too.
>>>> Yes, that's correct, but I don't understand why you disregard two
>>>> other cases of marking page dirty (both related to direct AIO read
>>> >from a file to a memory region mmap-ed to a fuse file):
>>>> 1. dio_bio_submit() -->
>>>>        bio_set_pages_dirty() -->
>>>>          set_page_dirty_lock()
>>>>
>>>> 2. dio_bio_complete() -->
>>>>        bio_check_pages_dirty() -->
>>>>           bio_dirty_fn() -->
>>>>              bio_set_pages_dirty() -->
>>>>                 set_page_dirty_lock()
>>>>
>>>> As soon as a page became dirty through a memory mapping (exactly as
>>>> you explained), nothing would prevent it to be written-back. And
>>>> fuse will call end_page_writeback almost immediately after copying
>>>> the real page to a temporary one. Then dio_bio_submit may re-dirty
>>>> page speculatively w/o notifying fuse. And again, since then nothing
>>>> would prevent it to be written-back once more. Hence we can end up
>>>> in more then one temporary page in fuse write-back. And similar
>>>> concern for dio_bio_complete() re-dirty.
>>>>
>>>> This make me think that we do need fuse_page_is_writeback() in
>>>> fuse_writepages_fill(). But it shouldn't be harmful because it will
>>>> no-op practically always due to waiting for fuse writeback in
>>>> ->page_mkwrite() and in course of handling write(2).
>>> The problem is: if we need it in ->writepages, we need it in ->writepage too.
>>> And that's where we can't have it because it would deadlock in reclaim.
>> I thought we're protected from the deadlock by the following chunk
>> (in the very beginning of fuse_writepage):
>>
>>> +	if (fuse_page_is_writeback(inode, page->index)) {
>>> +		if (wbc->sync_mode != WB_SYNC_ALL) {
>>> +			redirty_page_for_writepage(wbc, page);
>>> +			return 0;
>>> +		}
>>> +		fuse_wait_on_page_writeback(inode, page->index);
>>> +	}
>> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss
>> something?
> Yeah, we could have that in ->writepage() too.  And apparently that would work,
> reclaim would just leave us alone.
>
> Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse
> mount we don't want ->writepages() to block because that's a quite clear DoS
> issue.

Yes, I agree, but those cases (when sync(2) coincides with a page under 
fuse writeback originated by flusher coinciding with those direct AIO 
read redirty) should be very rare. I'd suggest to go on and put up with 
it for now: unprivileged users won't be able to use writeback_cache 
option until sysad enables allow_wbcache in fusermount.

>
> So we are left with this:

Yes. May we implement it as a separate fix after inclusion of this 
patch-set?

>
>>> There's a way to work around this:
>>>
>>>     - if the request is still in queue, just update it with the contents of
>>>       the new page
>>>
>>>     - if the request already in userspace, create a new reqest, but only let
>>>       userspace have it once the previous request for the same page
>>>       completes, so the ordering is not messed up
>>>
>>> But that's a lot of hairy code.
>> Is it exactly how NFS solves similar problem?
> NFS will apparently just block if there's a request outstanding and we are in
> WB_SYNC_ALL mode.  Which is somewhat simpler.

Yes, indeed.

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2013-09-03 16:02 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
2013-06-29 17:42 ` [PATCH 01/16] fuse: Linking file to inode helper Maxim Patlasov
2013-06-29 17:42 ` [PATCH 02/16] fuse: Getting file for writeback helper Maxim Patlasov
2013-06-29 17:42 ` [PATCH 04/16] fuse: Prepare to handle multiple pages in writeback Maxim Patlasov
2013-06-29 17:42 ` [PATCH 05/16] fuse: Connection bit for enabling writeback Maxim Patlasov
2013-06-29 17:44 ` [PATCH 06/16] fuse: Trust kernel i_size only - v4 Maxim Patlasov
2013-06-29 17:44 ` [PATCH 07/16] fuse: Trust kernel i_mtime only Maxim Patlasov
2013-07-11 16:14   ` [PATCH 07/16] fuse: Trust kernel i_mtime only -v2 Maxim Patlasov
2013-06-29 17:45 ` [PATCH 08/16] fuse: Flush files on wb close Maxim Patlasov
2013-07-11 16:18   ` [PATCH 08/16] fuse: Flush files on wb close -v2 Maxim Patlasov
     [not found] ` <20130629172211.20175.70154.stgit-vWG5eQQidJHciZdyczg/7Q@public.gmane.org>
2013-06-29 17:42   ` [PATCH 03/16] fuse: Prepare to handle short reads Maxim Patlasov
2013-06-29 17:45   ` [PATCH 09/16] fuse: restructure fuse_readpage() Maxim Patlasov
2013-06-29 17:45 ` [PATCH 10/16] fuse: Implement writepages callback Maxim Patlasov
2013-07-19 16:50   ` Miklos Szeredi
2013-08-02 15:40     ` Maxim Patlasov
2013-08-06 16:25       ` Miklos Szeredi
2013-08-06 16:26         ` Eric Boxer
2013-08-09 15:02         ` Maxim Patlasov
2013-08-30 10:12           ` Miklos Szeredi
2013-08-30 14:50             ` Maxim Patlasov
2013-09-03 10:31               ` Miklos Szeredi
2013-09-03 16:02                 ` Maxim Patlasov
2013-06-29 17:45 ` [PATCH 11/16] fuse: Implement write_begin/write_end callbacks Maxim Patlasov
2013-06-29 17:46 ` [PATCH 12/16] fuse: fuse_writepage_locked() should wait on writeback Maxim Patlasov
2013-06-29 17:46 ` [PATCH 13/16] fuse: fuse_flush() " Maxim Patlasov
2013-06-29 17:46 ` [PATCH 14/16] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
2013-06-29 17:47 ` [PATCH 15/16] fuse: Turn writeback cache on Maxim Patlasov
2013-06-29 17:48 ` [PATCH 16/16] mm: strictlimit feature Maxim Patlasov
2013-07-01 21:16   ` Andrew Morton
2013-07-02  8:33     ` Maxim Patlasov
2013-07-02 17:44   ` [PATCH] mm: strictlimit feature -v2 Maxim Patlasov
2013-07-02 19:38     ` Andrew Morton
2013-07-03 11:01       ` Maxim Patlasov
2013-07-03 23:16         ` Jan Kara
2013-07-05 13:14           ` Maxim Patlasov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).