[PATCH 0/10] fuse: An attempt to implement a write-back cache policy

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
@ 2012-07-03 15:53 Pavel Emelyanov
  2012-07-03 15:53 ` [PATCH 1/10] fuse: Linking file to inode helper Pavel Emelyanov
                   ` (7 more replies)
  0 siblings, 8 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:53 UTC (permalink / raw)
  To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
  Cc: James Bottomley, Kirill Korotaev

Hi everyone.

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back policy.
With this file data are pushed to the userspace with big chunks (depending on the
dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this scenario.
Kirill (in Cc) can share more details about the performance and the storage concepts
details if required.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/10] fuse: Linking file to inode helper
  2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
@ 2012-07-03 15:53 ` Pavel Emelyanov
  2012-07-03 15:54 ` [PATCH 2/10] fuse: Getting file for writeback helper Pavel Emelyanov
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:53 UTC (permalink / raw)
  To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
  Cc: James Bottomley, Kirill Korotaev

When writeback is ON every writable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b321a68..7d10b4c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -167,6 +167,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 }
 EXPORT_SYMBOL_GPL(fuse_do_open);
 
+static void fuse_link_write_file(struct file *file)
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+	struct fuse_file *ff = file->private_data;
+	/*
+	 * file may be written through mmap, so chain it onto the
+	 * inodes's write_file list
+	 */
+	spin_lock(&fc->lock);
+	if (list_empty(&ff->write_entry))
+		list_add(&ff->write_entry, &fi->write_files);
+	spin_unlock(&fc->lock);
+}
+
 void fuse_finish_open(struct inode *inode, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
@@ -1380,20 +1396,9 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
-		struct inode *inode = file->f_dentry->d_inode;
-		struct fuse_conn *fc = get_fuse_conn(inode);
-		struct fuse_inode *fi = get_fuse_inode(inode);
-		struct fuse_file *ff = file->private_data;
-		/*
-		 * file may be written through mmap, so chain it onto the
-		 * inodes's write_file list
-		 */
-		spin_lock(&fc->lock);
-		if (list_empty(&ff->write_entry))
-			list_add(&ff->write_entry, &fi->write_files);
-		spin_unlock(&fc->lock);
-	}
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		fuse_link_write_file(file);
+
 	file_accessed(file);
 	vma->vm_ops = &fuse_file_vm_ops;
 	return 0;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/10] fuse: Getting file for writeback helper
  2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
  2012-07-03 15:53 ` [PATCH 1/10] fuse: Linking file to inode helper Pavel Emelyanov
@ 2012-07-03 15:54 ` Pavel Emelyanov
  2012-07-03 15:54 ` [PATCH 3/10] fuse: Prepare to handle short reads Pavel Emelyanov
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:54 UTC (permalink / raw)
  To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
  Cc: James Bottomley, Kirill Korotaev

There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   23 +++++++++++++++--------
 1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7d10b4c..9c32bf5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1273,6 +1273,19 @@ static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
 	fuse_writepage_free(fc, req);
 }
 
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc, struct fuse_inode *fi)
+{
+	struct fuse_file *ff;
+
+	spin_lock(&fc->lock);
+	BUG_ON(list_empty(&fi->write_files));
+	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+	fuse_file_get(ff);
+	spin_unlock(&fc->lock);
+
+	return ff;
+}
+
 static int fuse_writepage_locked(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
@@ -1280,7 +1293,6 @@ static int fuse_writepage_locked(struct page *page)
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_req *req;
-	struct fuse_file *ff;
 	struct page *tmp_page;
 
 	set_page_writeback(page);
@@ -1293,13 +1305,8 @@ static int fuse_writepage_locked(struct page *page)
 	if (!tmp_page)
 		goto err_free;
 
-	spin_lock(&fc->lock);
-	BUG_ON(list_empty(&fi->write_files));
-	ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
-	req->ff = fuse_file_get(ff);
-	spin_unlock(&fc->lock);
-
-	fuse_write_fill(req, ff, page_offset(page), 0);
+	req->ff = fuse_write_file(fc, fi);
+	fuse_write_fill(req, req->ff, page_offset(page), 0);
 
 	copy_highpage(tmp_page, page);
 	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/10] fuse: Prepare to handle short reads
  2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
  2012-07-03 15:53 ` [PATCH 1/10] fuse: Linking file to inode helper Pavel Emelyanov
  2012-07-03 15:54 ` [PATCH 2/10] fuse: Getting file for writeback helper Pavel Emelyanov
@ 2012-07-03 15:54 ` Pavel Emelyanov
  2012-07-03 15:55 ` [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback Pavel Emelyanov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:54 UTC (permalink / raw)
  To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
  Cc: James Bottomley, Kirill Korotaev

A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   20 ++++++++++++--------
 1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9c32bf5..6bf9723 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -538,6 +538,14 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
 	spin_unlock(&fc->lock);
 }
 
+static void fuse_short_read(struct fuse_req *req, struct inode *inode, u64 attr_ver)
+{
+	size_t num_read = req->out.args[0].size;
+
+	loff_t pos = page_offset(req->pages[0]) + num_read;
+	fuse_read_update_size(inode, pos, attr_ver);
+}
+
 static int fuse_readpage(struct file *file, struct page *page)
 {
 	struct inode *inode = page->mapping->host;
@@ -573,18 +581,18 @@ static int fuse_readpage(struct file *file, struct page *page)
 	req->pages[0] = page;
 	num_read = fuse_send_read(req, file, pos, count, NULL);
 	err = req->out.h.error;
-	fuse_put_request(fc, req);
 
 	if (!err) {
 		/*
 		 * Short read means EOF.  If file size is larger, truncate it
 		 */
 		if (num_read < count)
-			fuse_read_update_size(inode, pos + num_read, attr_ver);
+			fuse_short_read(req, inode, attr_ver);
 
 		SetPageUptodate(page);
 	}
 
+	fuse_put_request(fc, req);
 	fuse_invalidate_attr(inode); /* atime changed */
  out:
 	unlock_page(page);
@@ -607,13 +615,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
 		/*
 		 * Short read means EOF. If file size is larger, truncate it
 		 */
-		if (!req->out.h.error && num_read < count) {
-			loff_t pos;
+		if (!req->out.h.error && num_read < count)
+			fuse_short_read(req, inode, req->misc.read.attr_ver);
 
-			pos = page_offset(req->pages[0]) + num_read;
-			fuse_read_update_size(inode, pos,
-					      req->misc.read.attr_ver);
-		}
 		fuse_invalidate_attr(inode); /* atime changed */
 	}
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback
  2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
                   ` (2 preceding siblings ...)
  2012-07-03 15:54 ` [PATCH 3/10] fuse: Prepare to handle short reads Pavel Emelyanov
@ 2012-07-03 15:55 ` Pavel Emelyanov
  2012-07-04 13:06   ` Miklos Szeredi
       [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:55 UTC (permalink / raw)
  To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
  Cc: James Bottomley, Kirill Korotaev

The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/file.c |   21 ++++++++++++++-------
 1 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6bf9723..47f0f2e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -345,7 +345,7 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
 
 		BUG_ON(req->inode != inode);
 		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
-		if (curr_index == index) {
+		if (curr_index == index && index < curr_index + req->num_pages) {
 			found = true;
 			break;
 		}
@@ -1192,7 +1192,10 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
 
 static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 {
-	__free_page(req->pages[0]);
+	int i;
+
+	for (i = 0; i < req->num_pages; i++)
+		__free_page(req->pages[i]);
 	fuse_file_put(req->ff, false);
 }
 
@@ -1201,10 +1204,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	struct inode *inode = req->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	int i;
 
 	list_del(&req->writepages_entry);
-	dec_bdi_stat(bdi, BDI_WRITEBACK);
-	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
+	for (i = 0; i < req->num_pages; i++) {
+		dec_bdi_stat(bdi, BDI_WRITEBACK);
+		dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+	}
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
 }
@@ -1217,14 +1223,15 @@ __acquires(fc->lock)
 	struct fuse_inode *fi = get_fuse_inode(req->inode);
 	loff_t size = i_size_read(req->inode);
 	struct fuse_write_in *inarg = &req->misc.write.in;
+	__u64 data_size = req->num_pages * PAGE_CACHE_SIZE;
 
 	if (!fc->connected)
 		goto out_free;
 
-	if (inarg->offset + PAGE_CACHE_SIZE <= size) {
-		inarg->size = PAGE_CACHE_SIZE;
+	if (inarg->offset + data_size <= size) {
+		inarg->size = data_size;
 	} else if (inarg->offset < size) {
-		inarg->size = size & (PAGE_CACHE_SIZE - 1);
+		inarg->size = size - inarg->offset;
 	} else {
 		/* Got truncated off completely */
 		goto out_free;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/10] fuse: Connection bit for enabling writeback
       [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-07-03 15:55   ` Pavel Emelyanov
  2012-07-03 15:56   ` [PATCH 7/10] fuse: Flush files on wb close Pavel Emelyanov
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:55 UTC (permalink / raw)
  To: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Miklos Szeredi,
	Alexander Viro, linux-fsdevel
  Cc: Kirill Korotaev, James Bottomley

Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/fuse/fuse_i.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 771fb63..218b0e6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -428,6 +428,9 @@ struct fuse_conn {
 	/** Set if bdi is valid */
 	unsigned bdi_initialized:1;
 
+	/** write-back cache policy (default is write-through) */
+	unsigned writeback_cache:1;
+
 	/*
 	 * The following bitfields are only for optimization purposes
 	 * and hence races in setting them will not cause malfunction
-- 
1.5.5.6

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/10] fuse: Trust kernel i_size only
  2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
                   ` (4 preceding siblings ...)
       [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-07-03 15:55 ` Pavel Emelyanov
  2012-07-04 14:39   ` Miklos Szeredi
  2012-07-03 15:57 ` [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Pavel Emelyanov
  2012-07-05 19:26 ` [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Anand Avati
  7 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:55 UTC (permalink / raw)
  To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
  Cc: James Bottomley, Kirill Korotaev

Make fuse think that when writeback is on the inode's i_size is alway up-to-date
and not update it with the value received from the userspace. This is done because
the page cache code may update i_size without letting the FS know.

This assumption implies fixing the previously introduced short-read helper -- when
a short read occurs the 'hole' is filled with zeroes.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 fs/fuse/dir.c   |   10 ++++++----
 fs/fuse/file.c  |   23 +++++++++++++++++++++--
 fs/fuse/inode.c |    6 ++++--
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 334e0b1..032fb20 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -790,7 +790,7 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
 	stat->mtime.tv_nsec = attr->mtimensec;
 	stat->ctime.tv_sec = attr->ctime;
 	stat->ctime.tv_nsec = attr->ctimensec;
-	stat->size = attr->size;
+	stat->size = i_size_read(inode);
 	stat->blocks = attr->blocks;
 
 	if (attr->blksize != 0)
@@ -1350,7 +1350,7 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	struct fuse_req *req;
 	struct fuse_setattr_in inarg;
 	struct fuse_attr_out outarg;
-	bool is_truncate = false;
+	bool is_truncate = false, is_wb = fc->writeback_cache;
 	loff_t oldsize;
 	int err;
 
@@ -1423,7 +1423,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	fuse_change_attributes_common(inode, &outarg.attr,
 				      attr_timeout(&outarg));
 	oldsize = inode->i_size;
-	i_size_write(inode, outarg.attr.size);
+	if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+		i_size_write(inode, outarg.attr.size);
 
 	if (is_truncate) {
 		/* NOTE: this may release/reacquire fc->lock */
@@ -1435,7 +1436,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
 	 * Only call invalidate_inode_pages2() after removing
 	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
 	 */
-	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+	if ((is_truncate || !is_wb) &&
+			S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
 		truncate_pagecache(inode, oldsize, outarg.attr.size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 47f0f2e..9bc1390 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -541,9 +541,28 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
 static void fuse_short_read(struct fuse_req *req, struct inode *inode, u64 attr_ver)
 {
 	size_t num_read = req->out.args[0].size;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fc->writeback_cache) {
+		/*
+		 * A hole in a file. Some data after the hole are in page cache.
+		 */
+		int i;
+		size_t off = num_read & (PAGE_CACHE_SIZE - 1);
+
+		for (i = num_read >> PAGE_CACHE_SHIFT; i < req->num_pages; i++) {
+			struct page *page = req->pages[i];
+			void *mapaddr = kmap_atomic(page, KM_USER0);
 
-	loff_t pos = page_offset(req->pages[0]) + num_read;
-	fuse_read_update_size(inode, pos, attr_ver);
+			memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+			kunmap_atomic(mapaddr, KM_USER0);
+			off = 0;
+		}
+	} else {
+		loff_t pos = page_offset(req->pages[0]) + num_read;
+		fuse_read_update_size(inode, pos, attr_ver);
+	}
 }
 
 static int fuse_readpage(struct file *file, struct page *page)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1cd6165..88c577f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -196,6 +196,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 {
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	bool is_wb = fc->writeback_cache;
 	loff_t oldsize;
 
 	spin_lock(&fc->lock);
@@ -207,10 +208,11 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	fuse_change_attributes_common(inode, attr, attr_valid);
 
 	oldsize = inode->i_size;
-	i_size_write(inode, attr->size);
+	if (!is_wb || !S_ISREG(inode->i_mode))
+		i_size_write(inode, attr->size);
 	spin_unlock(&fc->lock);
 
-	if (S_ISREG(inode->i_mode) && oldsize != attr->size) {
+	if (!is_wb && S_ISREG(inode->i_mode) && oldsize != attr->size) {
 		truncate_pagecache(inode, oldsize, attr->size);
 		invalidate_inode_pages2(inode->i_mapping);
 	}
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 7/10] fuse: Flush files on wb close
       [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-07-03 15:55   ` [PATCH 5/10] fuse: Connection bit for enabling writeback Pavel Emelyanov
@ 2012-07-03 15:56   ` Pavel Emelyanov
  2012-07-03 15:56   ` [PATCH 8/10] fuse: Implement writepages and write_begin/write_end callbacks Pavel Emelyanov
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:56 UTC (permalink / raw)
  To: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Miklos Szeredi,
	Alexander Viro, linux-fsdevel
  Cc: Kirill Korotaev, James Bottomley

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding writeback cache. Note, that simply calling the
filemap_write_and_wait() is not enough since fuse finishes page writeback
immediatelly and thus the -wait part of the mentioned call will be no-op.
Do real wait on per-inode writepages list.

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/fuse/file.c |   26 +++++++++++++++++++++++++-
 1 files changed, 25 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9bc1390..d9d566e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -137,6 +137,12 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
 	}
 }
 
+static void __fuse_file_put(struct fuse_file *ff)
+{
+	if (atomic_dec_and_test(&ff->count))
+		BUG();
+}
+
 int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
 		 bool isdir)
 {
@@ -285,8 +291,23 @@ static int fuse_open(struct inode *inode, struct file *file)
 	return fuse_open_common(inode, file, false);
 }
 
+static void fuse_flush_writeback(struct inode *inode, struct file *file)
+{
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_inode *fi = get_fuse_inode(inode);
+
+	filemap_write_and_wait(file->f_mapping);
+	wait_event(fi->page_waitq, list_empty_careful(&fi->writepages));
+	spin_unlock_wait(&fc->lock);
+}
+
 static int fuse_release(struct inode *inode, struct file *file)
 {
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (fc->writeback_cache)
+		fuse_flush_writeback(inode, file);
+
 	fuse_release_common(file, FUSE_RELEASE);
 
 	/* return value is ignored by VFS */
@@ -1215,7 +1236,8 @@ static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
 
 	for (i = 0; i < req->num_pages; i++)
 		__free_page(req->pages[i]);
-	fuse_file_put(req->ff, false);
+	if (!fc->writeback_cache)
+		fuse_file_put(req->ff, false);
 }
 
 static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
@@ -1232,6 +1254,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 	}
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
+	if (fc->writeback_cache)
+		__fuse_file_put(req->ff);
 }
 
 /* Called under fc->lock, may release and reacquire it */
-- 
1.5.5.6

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 8/10] fuse: Implement writepages and write_begin/write_end callbacks
       [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-07-03 15:55   ` [PATCH 5/10] fuse: Connection bit for enabling writeback Pavel Emelyanov
  2012-07-03 15:56   ` [PATCH 7/10] fuse: Flush files on wb close Pavel Emelyanov
@ 2012-07-03 15:56   ` Pavel Emelyanov
  2012-07-03 15:57   ` [PATCH 9/10] fuse: Turn writeback on Pavel Emelyanov
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:56 UTC (permalink / raw)
  To: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Miklos Szeredi,
	Alexander Viro, linux-fsdevel
  Cc: Kirill Korotaev, James Bottomley

The .writepages one is required to make each writeback request carry more than
one page on it.

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/fuse/file.c |  202 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 201 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d9d566e..058498d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -699,7 +699,10 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file)
 
 struct fuse_fill_data {
 	struct fuse_req *req;
-	struct file *file;
+	union {
+		struct file *file;
+		struct fuse_file *ff;
+	};
 	struct inode *inode;
 };
 
@@ -1400,6 +1403,200 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
 	return err;
 }
 
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+	int i, all_ok = 1;
+	struct fuse_req *req = data->req;
+	struct fuse_conn *fc = get_fuse_conn(data->inode);
+	struct fuse_inode *fi = get_fuse_inode(data->inode);
+	loff_t off = -1;
+
+	if (!data->ff)
+		data->ff = fuse_write_file(fc, fi);
+
+	for (i = 0; i < req->num_pages; i++) {
+		struct page *page = req->pages[i];
+		struct address_space *mapping = page->mapping;
+		struct page *tmp_page;
+
+		tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (tmp_page) {
+			copy_highpage(tmp_page, page);
+			inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+			inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+		} else
+			all_ok = 0;
+		req->pages[i] = tmp_page;
+		if (i == 0)
+			off = page_offset(page);
+
+		end_page_writeback(page);
+	}
+
+	if (!all_ok)
+		return -ENOMEM;
+
+	req->ff = fuse_file_get(data->ff);
+	fuse_write_fill(req, data->ff, off, 0);
+
+	req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+	req->in.argpages = 1;
+	req->page_offset = 0;
+	req->end = fuse_writepage_end;
+	req->inode = data->inode;
+
+	spin_lock(&fc->lock);
+	list_add(&req->writepages_entry, &fi->writepages);
+	list_add_tail(&req->list, &fi->queued_writes);
+	fuse_flush_writepages(data->inode);
+	spin_unlock(&fc->lock);
+
+	return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+		struct writeback_control *wbc, void *_data)
+{
+	struct fuse_fill_data *data = _data;
+	struct fuse_req *req = data->req;
+	struct inode *inode = data->inode;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+
+	if (req->num_pages &&
+			(req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+			 (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+			 req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+		int err;
+
+		err = fuse_send_writepages(data);
+		if (err) {
+			unlock_page(page);
+			return err;
+		}
+
+		data->req = req = fuse_request_alloc_nofs();
+		if (IS_ERR(req)) {
+			unlock_page(page);
+			return PTR_ERR(req);
+		}
+	}
+
+	req->pages[req->num_pages] = page;
+	req->num_pages++;
+	set_page_writeback(page);
+	unlock_page(page);
+
+	return 0;
+}
+
+static int fuse_writepages(struct address_space *mapping, struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct fuse_conn *fc = get_fuse_conn(inode);
+	struct fuse_fill_data data;
+	int err;
+
+	if (!fc->writeback_cache)
+		return generic_writepages(mapping, wbc);
+
+	err = -EIO;
+	if (is_bad_inode(inode))
+		goto out;
+
+	data.ff = NULL;
+	data.inode = inode;
+	data.req = fuse_request_alloc_nofs();
+	err = PTR_ERR(data.req);
+	if (IS_ERR(data.req))
+		goto out_put;
+
+	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+	if (!err) {
+		if (data.req->num_pages)
+			err = fuse_send_writepages(&data);
+		else
+			fuse_put_request(fc, data.req);
+	}
+out_put:
+	if (data.ff)
+		fuse_file_put(data.ff, false);
+out:
+	return err;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+		struct page *page, loff_t pos, unsigned len)
+{
+	struct fuse_req *req;
+	int err;
+
+	if (PageUptodate(page) || (len == PAGE_CACHE_SIZE))
+		return 0;
+
+	req = fuse_get_req(fc);
+	err = PTR_ERR(req);
+	if (IS_ERR(req))
+		goto out;
+
+	req->out.page_zeroing = 1;
+	req->out.argpages = 1;
+	req->num_pages = 1;
+	req->pages[0] = page;
+	fuse_send_read(req, file, page_offset(page), PAGE_CACHE_SIZE, NULL);
+	err = req->out.h.error;
+	fuse_put_request(fc, req);
+out:
+	if (err) {
+		unlock_page(page);
+		page_cache_release(page);
+	}
+	return err;
+}
+
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned flags,
+		struct page **pagep, void **fsdata)
+{
+	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+	struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+	BUG_ON(!fc->writeback_cache);
+
+	*pagep = grab_cache_page_write_begin(mapping, index, flags);
+	if (!*pagep)
+		return -ENOMEM;
+
+	return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+		unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+	if (!PageUptodate(page))
+		SetPageUptodate(page);
+
+	fuse_write_update_size(inode, pos);
+	set_page_dirty(page);
+	return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+		loff_t pos, unsigned len, unsigned copied,
+		struct page *page, void *fsdata)
+{
+	unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+	fuse_commit_write(file, page, from, from+copied);
+
+	unlock_page(page);
+	page_cache_release(page);
+
+	return copied;
+}
+
 static int fuse_launder_page(struct page *page)
 {
 	int err = 0;
@@ -2318,11 +2515,14 @@ static const struct file_operations fuse_direct_io_file_operations = {
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
 	.writepage	= fuse_writepage,
+	.writepages	= fuse_writepages,
 	.launder_page	= fuse_launder_page,
 	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
+	.write_begin	= fuse_write_begin,
+	.write_end	= fuse_write_end,
 };
 
 void fuse_init_file_inode(struct inode *inode)
-- 
1.5.5.6

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 9/10] fuse: Turn writeback on
       [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (2 preceding siblings ...)
  2012-07-03 15:56   ` [PATCH 8/10] fuse: Implement writepages and write_begin/write_end callbacks Pavel Emelyanov
@ 2012-07-03 15:57   ` Pavel Emelyanov
  2012-07-04  3:01   ` [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Nikolaus Rath
  2012-07-05 19:31   ` Anand Avati
  5 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:57 UTC (permalink / raw)
  To: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Miklos Szeredi,
	Alexander Viro, linux-fsdevel
  Cc: Kirill Korotaev, James Bottomley

Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this.

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
 fs/fuse/file.c       |    5 +++++
 fs/fuse/inode.c      |    4 +++-
 include/linux/fuse.h |    1 +
 3 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 058498d..e3a1cc8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -209,6 +209,8 @@ void fuse_finish_open(struct inode *inode, struct file *file)
 		spin_unlock(&fc->lock);
 		fuse_invalidate_attr(inode);
 	}
+	if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+		fuse_link_write_file(file);
 }
 
 int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -999,6 +1001,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	struct iov_iter i;
 	loff_t endbyte = 0;
 
+	if (get_fuse_conn(inode)->writeback_cache)
+		return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
 	WARN_ON(iocb->ki_pos != pos);
 
 	ocount = 0;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 88c577f..df1dc83 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -836,6 +836,8 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
 				fc->big_writes = 1;
 			if (arg->flags & FUSE_DONT_MASK)
 				fc->dont_mask = 1;
+			if (arg->flags & FUSE_WRITEBACK_CACHE)
+				fc->writeback_cache = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_CACHE_SIZE;
 			fc->no_lock = 1;
@@ -861,7 +863,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 	arg->max_readahead = fc->bdi.ra_pages * PAGE_CACHE_SIZE;
 	arg->flags |= FUSE_ASYNC_READ | FUSE_POSIX_LOCKS | FUSE_ATOMIC_O_TRUNC |
 		FUSE_EXPORT_SUPPORT | FUSE_BIG_WRITES | FUSE_DONT_MASK |
-		FUSE_FLOCK_LOCKS;
+		FUSE_FLOCK_LOCKS | FUSE_WRITEBACK_CACHE;
 	req->in.h.opcode = FUSE_INIT;
 	req->in.numargs = 1;
 	req->in.args[0].size = sizeof(*arg);
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 9303348..7ed5ced 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -176,6 +176,7 @@ struct fuse_file_lock {
 #define FUSE_BIG_WRITES		(1 << 5)
 #define FUSE_DONT_MASK		(1 << 6)
 #define FUSE_FLOCK_LOCKS	(1 << 10)
+#define FUSE_WRITEBACK_CACHE	(1 << 11)
 
 /**
  * CUSE INIT request/reply flags
-- 
1.5.5.6

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
                   ` (5 preceding siblings ...)
  2012-07-03 15:55 ` [PATCH 6/10] fuse: Trust kernel i_size only Pavel Emelyanov
@ 2012-07-03 15:57 ` Pavel Emelyanov
       [not found]   ` <4FF3166B.5090800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-07-05 19:26 ` [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Anand Avati
  7 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:57 UTC (permalink / raw)
  To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
  Cc: James Bottomley, Kirill Korotaev

Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is hight ehough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is wrting to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 mm/page-writeback.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 93d8d2f..7cb54db 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1192,7 +1192,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+			global_page_state(NR_WRITEBACK_TEMP);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
       [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (3 preceding siblings ...)
  2012-07-03 15:57   ` [PATCH 9/10] fuse: Turn writeback on Pavel Emelyanov
@ 2012-07-04  3:01   ` Nikolaus Rath
       [not found]     ` <87a9zg1b7q.fsf-sKB8Sp2ER+yL2G7IJ6k9tw@public.gmane.org>
  2012-07-05 19:31   ` Anand Avati
  5 siblings, 1 reply; 43+ messages in thread
From: Nikolaus Rath @ 2012-07-04  3:01 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-fsdevel

Hi Pavel,

I think it's great that you're working on this! I've been waiting for
FUSE being able to supply write data in bigger chunks for a long time,
and I'm very excited to see some progress on this. I'm not a kernel
developer, but I'll be happy to try the patches.

While I try to get this to compile, a few more questions:

Pavel Emelyanov <xemul@parallels.com> writes:
> Hi everyone.
>
> One of the problems with the existing FUSE implementation is that it uses the
> write-through cache policy which results in performance problems on certain
> workloads. E.g. when copying a big file into a FUSE file the cp pushes every
> 128k to the userspace synchronously. This becomes a problem when the userspace
> back-end uses networking for storing the data.
>
> A good solution of this is switching the FUSE page cache into a write-back policy.
> With this file data are pushed to the userspace with big chunks (depending on the
> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
> handle the size updates in a more efficient manner.
>
> The writeback feature is per-connection and is explicitly configurable at the
> init stage (is it worth making it CAP_SOMETHING protected?)

From your description it sounds as if the only effect of write-back is
to increase the chunk size. Why is the a need to require special
privileges for this?

> When the writeback is turned ON:
>
> * still copy writeback pages to temporary buffer when sending a writeback request
>   and finish the page writeback immediately

Could you elaborate? I don't understand what you're saying here.



Best,

   -Nikolaus

-- 
 »Time flies like an arrow, fruit flies like a Banana.«

  PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
fuse-devel mailing list
fuse-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fuse-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
       [not found]     ` <87a9zg1b7q.fsf-sKB8Sp2ER+yL2G7IJ6k9tw@public.gmane.org>
@ 2012-07-04  7:11       ` Pavel Emelyanov
  2012-07-04 13:22         ` Nikolaus Rath
  2012-07-05 13:07         ` Nikolaus Rath
  0 siblings, 2 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-04  7:11 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-fsdevel

On 07/04/2012 07:01 AM, Nikolaus Rath wrote:
> Hi Pavel,
> 
> I think it's great that you're working on this! I've been waiting for
> FUSE being able to supply write data in bigger chunks for a long time,
> and I'm very excited to see some progress on this. I'm not a kernel
> developer, but I'll be happy to try the patches.

Just to make it clear. I didn't increase the 32 pages per request limit. What
I did is made FUSE submit more than one request at a time while serving massive
writes. So yes, bigger chunks can be now seen by the daemon, but it should read
several requests for that.

> While I try to get this to compile,

That's good news! Thanks a lot!

> a few more questions:

> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>> Hi everyone.
>>
>> One of the problems with the existing FUSE implementation is that it uses the
>> write-through cache policy which results in performance problems on certain
>> workloads. E.g. when copying a big file into a FUSE file the cp pushes every
>> 128k to the userspace synchronously. This becomes a problem when the userspace
>> back-end uses networking for storing the data.
>>
>> A good solution of this is switching the FUSE page cache into a write-back policy.
>> With this file data are pushed to the userspace with big chunks (depending on the
>> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
>> handle the size updates in a more efficient manner.
>>
>> The writeback feature is per-connection and is explicitly configurable at the
>> init stage (is it worth making it CAP_SOMETHING protected?)
> 
> From your description it sounds as if the only effect of write-back is
> to increase the chunk size. Why is the a need to require special
> privileges for this?

Provided I understand the code correctly: if FUSE daemon turns writeback on and sets
per-bdi dirty limit too high it can cause a deadlock on the box. Thus then daemon
should be trusted by the kernel, i.e. -- privileged.

>> When the writeback is turned ON:
>>
>> * still copy writeback pages to temporary buffer when sending a writeback request
>>   and finish the page writeback immediately
> 
> Could you elaborate? I don't understand what you're saying here.

This is an implementation detail. To avoid a deadlock in the memory reclaim code
existing FUSE copies a mmapped dirty page contents into a temporary buffer before
sending it to the user space. What I wanted to say here is that I did use the same
trick in the introduced writeback paths. This doesn't affect kernel-to-user API
at all.

> 
> 
> Best,
> 
>    -Nikolaus
> 

Thanks,
Pavel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback
  2012-07-03 15:55 ` [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback Pavel Emelyanov
@ 2012-07-04 13:06   ` Miklos Szeredi
  2012-07-04 14:26     ` Pavel Emelyanov
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-04 13:06 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

Pavel Emelyanov <xemul@parallels.com> writes:

> The .writepages callback will issue writeback requests with more than one
> page aboard. Make existing end/check code be aware of this.
>
> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> ---
>  fs/fuse/file.c |   21 ++++++++++++++-------
>  1 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 6bf9723..47f0f2e 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -345,7 +345,7 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
>  
>  		BUG_ON(req->inode != inode);
>  		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
> -		if (curr_index == index) {
> +		if (curr_index == index && index < curr_index + req->num_pages) {

This condition looks bogus.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-04  7:11       ` Pavel Emelyanov
@ 2012-07-04 13:22         ` Nikolaus Rath
       [not found]           ` <4FF4438B.8050807-BTH8mxji4b0@public.gmane.org>
  2012-07-05 13:07         ` Nikolaus Rath
  1 sibling, 1 reply; 43+ messages in thread
From: Nikolaus Rath @ 2012-07-04 13:22 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: fuse-devel, linux-fsdevel

On 07/04/2012 03:11 AM, Pavel Emelyanov wrote:
> On 07/04/2012 07:01 AM, Nikolaus Rath wrote:
>> Hi Pavel,
>>
>> I think it's great that you're working on this! I've been waiting for
>> FUSE being able to supply write data in bigger chunks for a long time,
>> and I'm very excited to see some progress on this. I'm not a kernel
>> developer, but I'll be happy to try the patches.
> 
> Just to make it clear. I didn't increase the 32 pages per request limit. What
> I did is made FUSE submit more than one request at a time while serving massive
> writes. So yes, bigger chunks can be now seen by the daemon, but it should read
> several requests for that.

Ah, I thought that your patch would do both. So with the patch an
userspace client can now writes data in say 4 kb chunks, and the FUSE
daemon will still receive it from the kernel in 128 kb chunks? But if
the client writes a say 1 MB chunk, the FUSE daemon will still see 8
128kb write requests?

Would it be very hard to raise the 32 pages per request limit at the
same time?


>>> A good solution of this is switching the FUSE page cache into a write-back policy.
>>> With this file data are pushed to the userspace with big chunks (depending on the
>>> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
>>> handle the size updates in a more efficient manner.
>>>
>>> The writeback feature is per-connection and is explicitly configurable at the
>>> init stage (is it worth making it CAP_SOMETHING protected?)
>>
>> From your description it sounds as if the only effect of write-back is
>> to increase the chunk size. Why the need to require special
>> privileges for this?
> 
> Provided I understand the code correctly: if FUSE daemon turns writeback on and sets
> per-bdi dirty limit too high it can cause a deadlock on the box. Thus then daemon
> should be trusted by the kernel, i.e. -- privileged.

Wouldn't it be more reasonable to enforce that the bdi dirty limit is
not set too high then?


Thanks,

   -Nikolaus

-- 
 »Time flies like an arrow, fruit flies like a Banana.«

  PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback
  2012-07-04 13:06   ` Miklos Szeredi
@ 2012-07-04 14:26     ` Pavel Emelyanov
  0 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-04 14:26 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

On 07/04/2012 05:06 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> The .writepages callback will issue writeback requests with more than one
>> page aboard. Make existing end/check code be aware of this.
>>
>> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
>> ---
>>  fs/fuse/file.c |   21 ++++++++++++++-------
>>  1 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 6bf9723..47f0f2e 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -345,7 +345,7 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
>>  
>>  		BUG_ON(req->inode != inode);
>>  		curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
>> -		if (curr_index == index) {
>> +		if (curr_index == index && index < curr_index + req->num_pages) {
> 
> This condition looks bogus.

Oops, indeed :( This should be

if (curr_index <= index && index < curr_index + req->num_pages)

> Thanks,
> Miklos
> .
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
       [not found]           ` <4FF4438B.8050807-BTH8mxji4b0@public.gmane.org>
@ 2012-07-04 14:33             ` Pavel Emelyanov
       [not found]               ` <4FF45447.5000705-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-04 14:33 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-fsdevel

On 07/04/2012 05:22 PM, Nikolaus Rath wrote:
> On 07/04/2012 03:11 AM, Pavel Emelyanov wrote:
>> On 07/04/2012 07:01 AM, Nikolaus Rath wrote:
>>> Hi Pavel,
>>>
>>> I think it's great that you're working on this! I've been waiting for
>>> FUSE being able to supply write data in bigger chunks for a long time,
>>> and I'm very excited to see some progress on this. I'm not a kernel
>>> developer, but I'll be happy to try the patches.
>>
>> Just to make it clear. I didn't increase the 32 pages per request limit. What
>> I did is made FUSE submit more than one request at a time while serving massive
>> writes. So yes, bigger chunks can be now seen by the daemon, but it should read
>> several requests for that.
> 
> Ah, I thought that your patch would do both. So with the patch an
> userspace client can now writes data in say 4 kb chunks, and the FUSE
> daemon will still receive it from the kernel in 128 kb chunks?

In 128 kb per request.

> But if the client writes a say 1 MB chunk, the FUSE daemon will still
> see 8 128kb write requests?

Yes, but this time daemon can see all 8 requests at once. Before the patch you had
to ack the 1st request before seeing the 2nd one.

> Would it be very hard to raise the 32 pages per request limit at the
> same time?

I haven't looked at it yet, but it doesn't seem too hard. The only problem that
can arise is compatibility with the existing binaries. I don't know whether every
single fuse daemon is capable of handling this change... OTOH this parameter can
be done as configurable at init time with the default value of 32, but this is a
separate task.

>>>> A good solution of this is switching the FUSE page cache into a write-back policy.
>>>> With this file data are pushed to the userspace with big chunks (depending on the
>>>> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
>>>> handle the size updates in a more efficient manner.
>>>>
>>>> The writeback feature is per-connection and is explicitly configurable at the
>>>> init stage (is it worth making it CAP_SOMETHING protected?)
>>>
>>> From your description it sounds as if the only effect of write-back is
>>> to increase the chunk size. Why the need to require special
>>> privileges for this?
>>
>> Provided I understand the code correctly: if FUSE daemon turns writeback on and sets
>> per-bdi dirty limit too high it can cause a deadlock on the box. Thus then daemon
>> should be trusted by the kernel, i.e. -- privileged.
> 
> Wouldn't it be more reasonable to enforce that the bdi dirty limit is
> not set too high then?

Hardly, since if you have several mounts with small bdi limit each the sum of them
can be still high.

> 
> Thanks,
> 
>    -Nikolaus
> 


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
  2012-07-03 15:55 ` [PATCH 6/10] fuse: Trust kernel i_size only Pavel Emelyanov
@ 2012-07-04 14:39   ` Miklos Szeredi
  2012-07-05 14:10     ` Pavel Emelyanov
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-04 14:39 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

Pavel Emelyanov <xemul@parallels.com> writes:

> Make fuse think that when writeback is on the inode's i_size is alway
> up-to-date and not update it with the value received from the
> userspace. This is done because the page cache code may update i_size
> without letting the FS know.

Similar rule applies to i_mtime.  Except it's even more tricky, since
you have to flush the mtime together with the data (the flush should not
update the modification time).  And we have other operations which also
change the mtime, and those also need to be updated.

We should probably look at what NFS is doing.

> This assumption implies fixing the previously introduced short-read
> helper -- when a short read occurs the 'hole' is filled with zeroes.
>
> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> ---
>  fs/fuse/dir.c   |   10 ++++++----
>  fs/fuse/file.c  |   23 +++++++++++++++++++++--
>  fs/fuse/inode.c |    6 ++++--
>  3 files changed, 31 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 334e0b1..032fb20 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -790,7 +790,7 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
>  	stat->mtime.tv_nsec = attr->mtimensec;
>  	stat->ctime.tv_sec = attr->ctime;
>  	stat->ctime.tv_nsec = attr->ctimensec;
> -	stat->size = attr->size;
> +	stat->size = i_size_read(inode);
>  	stat->blocks = attr->blocks;
>  
>  	if (attr->blksize != 0)
> @@ -1350,7 +1350,7 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
>  	struct fuse_req *req;
>  	struct fuse_setattr_in inarg;
>  	struct fuse_attr_out outarg;
> -	bool is_truncate = false;
> +	bool is_truncate = false, is_wb = fc->writeback_cache;
>  	loff_t oldsize;
>  	int err;
>  
> @@ -1423,7 +1423,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
>  	fuse_change_attributes_common(inode, &outarg.attr,
>  				      attr_timeout(&outarg));
>  	oldsize = inode->i_size;
> -	i_size_write(inode, outarg.attr.size);
> +	if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
> +		i_size_write(inode, outarg.attr.size);
>  
>  	if (is_truncate) {
>  		/* NOTE: this may release/reacquire fc->lock */
> @@ -1435,7 +1436,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
>  	 * Only call invalidate_inode_pages2() after removing
>  	 * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
>  	 */
> -	if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
> +	if ((is_truncate || !is_wb) &&
> +			S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
>  		truncate_pagecache(inode, oldsize, outarg.attr.size);
>  		invalidate_inode_pages2(inode->i_mapping);
>  	}
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 47f0f2e..9bc1390 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -541,9 +541,28 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
>  static void fuse_short_read(struct fuse_req *req, struct inode *inode, u64 attr_ver)
>  {
>  	size_t num_read = req->out.args[0].size;
> +	struct fuse_conn *fc = get_fuse_conn(inode);
> +
> +	if (fc->writeback_cache) {
> +		/*
> +		 * A hole in a file. Some data after the hole are in page cache.
> +		 */
> +		int i;
> +		size_t off = num_read & (PAGE_CACHE_SIZE - 1);
> +
> +		for (i = num_read >> PAGE_CACHE_SHIFT; i < req->num_pages; i++) {
> +			struct page *page = req->pages[i];
> +			void *mapaddr = kmap_atomic(page, KM_USER0);
>  
> -	loff_t pos = page_offset(req->pages[0]) + num_read;
> -	fuse_read_update_size(inode, pos, attr_ver);
> +			memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
> +
> +			kunmap_atomic(mapaddr, KM_USER0);
> +			off = 0;
> +		}
> +	} else {
> +		loff_t pos = page_offset(req->pages[0]) + num_read;
> +		fuse_read_update_size(inode, pos, attr_ver);
> +	}
>  }
>  
>  static int fuse_readpage(struct file *file, struct page *page)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 1cd6165..88c577f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -196,6 +196,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>  {
>  	struct fuse_conn *fc = get_fuse_conn(inode);
>  	struct fuse_inode *fi = get_fuse_inode(inode);
> +	bool is_wb = fc->writeback_cache;
>  	loff_t oldsize;
>  
>  	spin_lock(&fc->lock);
> @@ -207,10 +208,11 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>  	fuse_change_attributes_common(inode, attr, attr_valid);
>  
>  	oldsize = inode->i_size;
> -	i_size_write(inode, attr->size);
> +	if (!is_wb || !S_ISREG(inode->i_mode))
> +		i_size_write(inode, attr->size);
>  	spin_unlock(&fc->lock);
>  
> -	if (S_ISREG(inode->i_mode) && oldsize != attr->size) {
> +	if (!is_wb && S_ISREG(inode->i_mode) && oldsize != attr->size) {
>  		truncate_pagecache(inode, oldsize, attr->size);
>  		invalidate_inode_pages2(inode->i_mapping);
>  	}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
       [not found]               ` <4FF45447.5000705-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-07-04 17:08                 ` Nikolaus Rath
  2012-07-05  9:01                   ` Pavel Emelyanov
  0 siblings, 1 reply; 43+ messages in thread
From: Nikolaus Rath @ 2012-07-04 17:08 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-fsdevel

On 07/04/2012 10:33 AM, Pavel Emelyanov wrote:
>>>>> A good solution of this is switching the FUSE page cache into a write-back policy.
>>>>> >>>> With this file data are pushed to the userspace with big chunks (depending on the
>>>>> >>>> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
>>>>> >>>> handle the size updates in a more efficient manner.
>>>>> >>>>
>>>>> >>>> The writeback feature is per-connection and is explicitly configurable at the
>>>>> >>>> init stage (is it worth making it CAP_SOMETHING protected?)
>>>> >>>
>>>> >>> From your description it sounds as if the only effect of write-back is
>>>> >>> to increase the chunk size. Why the need to require special
>>>> >>> privileges for this?
>>> >>
>>> >> Provided I understand the code correctly: if FUSE daemon turns writeback on and sets
>>> >> per-bdi dirty limit too high it can cause a deadlock on the box. Thus then daemon
>>> >> should be trusted by the kernel, i.e. -- privileged.
>> > 
>> > Wouldn't it be more reasonable to enforce that the bdi dirty limit is
>> > not set too high then?
>
> Hardly, since if you have several mounts with small bdi limit each the sum of them
> can be still high.

Yeah, so isn't it possible to put a limit on the sum of them?


I think it would be much better if this could be made safe to use
instead of requiring special privileges. If that isn't possible,
wouldn't it be better to enable/disable the feature in /etc/fuse.conf,
similar to the user_allow_* options?

Having to explain how to assign capabilities in the documentation of a
fuse filesystem sounds quite dreadful to me...


Best,

   -Nikolaus


PS: My naive attempt to apply your patches to my Debian 3.2 kernel
failed. I'm compiling 3.5rc5 at the moment - will see if I can get it to
boot afterwards...


-- 
 »Time flies like an arrow, fruit flies like a Banana.«

  PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
fuse-devel mailing list
fuse-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fuse-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-04 17:08                 ` Nikolaus Rath
@ 2012-07-05  9:01                   ` Pavel Emelyanov
  0 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-05  9:01 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: fuse-devel, linux-fsdevel

On 07/04/2012 09:08 PM, Nikolaus Rath wrote:
> On 07/04/2012 10:33 AM, Pavel Emelyanov wrote:
>>>>>> A good solution of this is switching the FUSE page cache into a write-back policy.
>>>>>>>>>> With this file data are pushed to the userspace with big chunks (depending on the
>>>>>>>>>> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
>>>>>>>>>> handle the size updates in a more efficient manner.
>>>>>>>>>>
>>>>>>>>>> The writeback feature is per-connection and is explicitly configurable at the
>>>>>>>>>> init stage (is it worth making it CAP_SOMETHING protected?)
>>>>>>>>
>>>>>>>> From your description it sounds as if the only effect of write-back is
>>>>>>>> to increase the chunk size. Why the need to require special
>>>>>>>> privileges for this?
>>>>>>
>>>>>> Provided I understand the code correctly: if FUSE daemon turns writeback on and sets
>>>>>> per-bdi dirty limit too high it can cause a deadlock on the box. Thus then daemon
>>>>>> should be trusted by the kernel, i.e. -- privileged.
>>>>
>>>> Wouldn't it be more reasonable to enforce that the bdi dirty limit is
>>>> not set too high then?
>>
>> Hardly, since if you have several mounts with small bdi limit each the sum of them
>> can be still high.
> 
> Yeah, so isn't it possible to put a limit on the sum of them?

Only on FUSE mounts? Currently no, only on the overall dirty set. Which, in turn,
will affect disk filesystems behavior.

> I think it would be much better if this could be made safe to use
> instead of requiring special privileges. If that isn't possible,
> wouldn't it be better to enable/disable the feature in /etc/fuse.conf,
> similar to the user_allow_* options?

I'll look at it.

> Having to explain how to assign capabilities in the documentation of a
> fuse filesystem sounds quite dreadful to me...
> 
> 
> Best,
> 
>    -Nikolaus
> 
> 
> PS: My naive attempt to apply your patches to my Debian 3.2 kernel
> failed. I'm compiling 3.5rc5 at the moment - will see if I can get it to
> boot afterwards...
> 
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-04  7:11       ` Pavel Emelyanov
  2012-07-04 13:22         ` Nikolaus Rath
@ 2012-07-05 13:07         ` Nikolaus Rath
  2012-07-05 14:08           ` Pavel Emelyanov
  1 sibling, 1 reply; 43+ messages in thread
From: Nikolaus Rath @ 2012-07-05 13:07 UTC (permalink / raw)
  To: xemul
  Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	linux-fsdevel



Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>> While I try to get this to compile,
>
> That's good news! Thanks a lot!

Ok, I got it to compile and tested it by copying a large file into an
S3QL FUSE file system using different block sizes. At first glance,
things look fantastic.

With kernel 3.2:

Blocksize: 4k
131072000 bytes (131 MB) copied, 7.76037 s, 16.9 MB/s

Blocksize: 8k
131072000 bytes (131 MB) copied, 2.48469 s, 52.8 MB/s

Blocksize: 16k
131072000 bytes (131 MB) copied, 2.21868 s, 59.1 MB/s

Blocksize: 32k
131072000 bytes (131 MB) copied, 2.0455 s, 64.1 MB/s

Blocksize: 64k
131072000 bytes (131 MB) copied, 1.83897 s, 71.3 MB/s

Blocksize: 96k
131072000 bytes (131 MB) copied, 1.72298 s, 76.1 MB/s

Blocksize: 128k
131072000 bytes (131 MB) copied, 2.10194 s, 62.4 MB/s

Blocksize: 256k
131072000 bytes (131 MB) copied, 1.96788 s, 66.6 MB/s

Blocksize: 512k
131072000 bytes (131 MB) copied, 2.42342 s, 54.1 MB/s

Blocksize: 1024k
131072000 bytes (131 MB) copied, 1.6651 s, 78.7 MB/s


With 3.5-pre and write-back patches:

Blocksize: 4k
131072000 bytes (131 MB) copied, 5.50489 s, 23.8 MB/s

Blocksize: 8k
131072000 bytes (131 MB) copied, 1.43694 s, 91.2 MB/s

Blocksize: 16k
131072000 bytes (131 MB) copied, 0.967118 s, 136 MB/s

Blocksize: 32k
131072000 bytes (131 MB) copied, 0.707767 s, 185 MB/s

Blocksize: 64k
131072000 bytes (131 MB) copied, 0.605011 s, 217 MB/s

Blocksize: 96k
131072000 bytes (131 MB) copied, 0.573445 s, 229 MB/s

Blocksize: 128k
131072000 bytes (131 MB) copied, 0.51943 s, 252 MB/s

Blocksize: 256k
131072000 bytes (131 MB) copied, 0.458335 s, 286 MB/s

Blocksize: 512k
131072000 bytes (131 MB) copied, 0.452351 s, 290 MB/s

Blocksize: 1024k
131072000 bytes (131 MB) copied, 0.446027 s, 294 MB/s



However, I suspect that most of the gain is really because with the
patch most of the data is still in the kernel cache when dd finishes and
hasn't yet been received by the FUSE client.


Is there a way to force flushing of the fuse cache, so that I can
measure the time for dd + final flush?


Best,

   -Nikolaus

-- 
 »Time flies like an arrow, fruit flies like a Banana.«

  PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-05 13:07         ` Nikolaus Rath
@ 2012-07-05 14:08           ` Pavel Emelyanov
  2012-07-05 14:29             ` Nikolaus Rath
  0 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-05 14:08 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: fuse-devel, linux-fsdevel

On 07/05/2012 05:07 PM, Nikolaus Rath wrote:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>> While I try to get this to compile,
>>
>> That's good news! Thanks a lot!

The fuse and fsdevel mainling lists were corrupted in you original message,
so looping the lists back in.

> Ok, I got it to compile and tested it by copying a large file into an
> S3QL FUSE file system using different block sizes. At first glance,
> things look fantastic.

That's good news actually! :)

> With kernel 3.2:
> 
> Blocksize: 4k
> 131072000 bytes (131 MB) copied, 7.76037 s, 16.9 MB/s
> 

[snip]

> 
> However, I suspect that most of the gain is really because with the
> patch most of the data is still in the kernel cache when dd finishes and
> hasn't yet been received by the FUSE client.
> 
> 
> Is there a way to force flushing of the fuse cache, so that I can
> measure the time for dd + final flush?

Actually when dd closes an output file the whole page cache which relates to
it gets flushed to the userspace! This is due to how FUSE write requests are
served.

That said -- you've already measured writeback with flush :)

> 
> Best,
> 
>    -Nikolaus
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
  2012-07-04 14:39   ` Miklos Szeredi
@ 2012-07-05 14:10     ` Pavel Emelyanov
  2012-07-10  5:53     ` Pavel Emelyanov
       [not found]     ` <8762a3pp3m.fsf-d8RdFUjzFsbxNFs70CDYszOMxtEWgIxa@public.gmane.org>
  2 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-05 14:10 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

On 07/04/2012 06:39 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> Make fuse think that when writeback is on the inode's i_size is alway
>> up-to-date and not update it with the value received from the
>> userspace. This is done because the page cache code may update i_size
>> without letting the FS know.
> 
> Similar rule applies to i_mtime.  Except it's even more tricky, since
> you have to flush the mtime together with the data (the flush should not
> update the modification time).  And we have other operations which also
> change the mtime, and those also need to be updated.

Yup, I've missed that part :( Thanks for pointing this out.

> We should probably look at what NFS is doing.

I'll look at what can be done about it.


Thanks,
Pavel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-05 14:08           ` Pavel Emelyanov
@ 2012-07-05 14:29             ` Nikolaus Rath
  2012-07-05 14:34               ` Pavel Emelyanov
  0 siblings, 1 reply; 43+ messages in thread
From: Nikolaus Rath @ 2012-07-05 14:29 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: fuse-devel, linux-fsdevel

On 07/05/2012 10:08 AM, Pavel Emelyanov wrote:
> On 07/05/2012 05:07 PM, Nikolaus Rath wrote:
>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>> While I try to get this to compile,
>>>
>>> That's good news! Thanks a lot!
> 
> The fuse and fsdevel mainling lists were corrupted in you original message,
> so looping the lists back in.
> 
>> Ok, I got it to compile and tested it by copying a large file into an
>> S3QL FUSE file system using different block sizes. At first glance,
>> things look fantastic.
> 
> That's good news actually! :)
> 
>> With kernel 3.2:
>>
>> Blocksize: 4k
>> 131072000 bytes (131 MB) copied, 7.76037 s, 16.9 MB/s
>>
> 
> [snip]
> 
>>
>> However, I suspect that most of the gain is really because with the
>> patch most of the data is still in the kernel cache when dd finishes and
>> hasn't yet been received by the FUSE client.
>>
>>
>> Is there a way to force flushing of the fuse cache, so that I can
>> measure the time for dd + final flush?
> 
> Actually when dd closes an output file the whole page cache which relates to
> it gets flushed to the userspace! This is due to how FUSE write requests are
> served.
> 
> That said -- you've already measured writeback with flush :)


But then how is it possible that there is such a big difference even
when using 128kb blocks?

Kernel 3.2:
131072000 bytes (131 MB) copied, 2.10194 s, 62.4 MB/s

Kernel 3.5-pre:
Blocksize: 128k
131072000 bytes (131 MB) copied, 0.51943 s, 252 MB/s


I would think that it both cases the FUSE daemon gets the data in 128 kb
packets, so why the difference? Or is this due to some other change
between 3.2 and 3.5 that significantly increased FUSE performance?

I can try the same test with the unpatched 3.5 tonight if that would be
of interest.

Best,

   -Nikolaus

-- 
 »Time flies like an arrow, fruit flies like a Banana.«

  PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-05 14:29             ` Nikolaus Rath
@ 2012-07-05 14:34               ` Pavel Emelyanov
  2012-07-06  2:04                 ` Nikolaus Rath
  0 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-05 14:34 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: fuse-devel, linux-fsdevel

On 07/05/2012 06:29 PM, Nikolaus Rath wrote:
> On 07/05/2012 10:08 AM, Pavel Emelyanov wrote:
>> On 07/05/2012 05:07 PM, Nikolaus Rath wrote:
>>> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>>>>> While I try to get this to compile,
>>>>
>>>> That's good news! Thanks a lot!
>>
>> The fuse and fsdevel mainling lists were corrupted in you original message,
>> so looping the lists back in.
>>
>>> Ok, I got it to compile and tested it by copying a large file into an
>>> S3QL FUSE file system using different block sizes. At first glance,
>>> things look fantastic.
>>
>> That's good news actually! :)
>>
>>> With kernel 3.2:
>>>
>>> Blocksize: 4k
>>> 131072000 bytes (131 MB) copied, 7.76037 s, 16.9 MB/s
>>>
>>
>> [snip]
>>
>>>
>>> However, I suspect that most of the gain is really because with the
>>> patch most of the data is still in the kernel cache when dd finishes and
>>> hasn't yet been received by the FUSE client.
>>>
>>>
>>> Is there a way to force flushing of the fuse cache, so that I can
>>> measure the time for dd + final flush?
>>
>> Actually when dd closes an output file the whole page cache which relates to
>> it gets flushed to the userspace! This is due to how FUSE write requests are
>> served.
>>
>> That said -- you've already measured writeback with flush :)
> 
> 
> But then how is it possible that there is such a big difference even
> when using 128kb blocks?
> 
> Kernel 3.2:
> 131072000 bytes (131 MB) copied, 2.10194 s, 62.4 MB/s
> 
> Kernel 3.5-pre:
> Blocksize: 128k
> 131072000 bytes (131 MB) copied, 0.51943 s, 252 MB/s
> 
> 
> I would think that it both cases the FUSE daemon gets the data in 128 kb
> packets, so why the difference? Or is this due to some other change
> between 3.2 and 3.5 that significantly increased FUSE performance?

I don't know for sure. Need to read an analyze the FUSE userspace side
logs (if any).

> I can try the same test with the unpatched 3.5 tonight if that would be
> of interest.

Yes, this can help.

> Best,
> 
>    -Nikolaus
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
                   ` (6 preceding siblings ...)
  2012-07-03 15:57 ` [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Pavel Emelyanov
@ 2012-07-05 19:26 ` Anand Avati
  7 siblings, 0 replies; 43+ messages in thread
From: Anand Avati @ 2012-07-05 19:26 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Kirill Korotaev, James Bottomley, fuse-devel, Miklos Szeredi,
	Alexander Viro, linux-fsdevel

Doesn't such a change increase the "dirty debt" held in the kernel and therefore
increase the probability of resulting in a deadlock (when the userspace filesystem's
memory request ends up waiting on a dirty page writeout caused by this write-back
feature)? Why not implement the write-back inside the userspace filesystem leaving
the kernel operate in write-through, which makes things overall less unsafe? Do you
have performance numbers comparing write-back in kernel v/s write-back in userspace?

Avati

----- Original Message -----
From: "Pavel Emelyanov" <xemul@parallels.com>
To: fuse-devel@lists.sourceforge.net, "Miklos Szeredi" <miklos@szeredi.hu>, "Alexander Viro" <viro@zeniv.linux.org.uk>, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>
Cc: "Kirill Korotaev" <dev@parallels.com>, "James Bottomley" <jbottomley@parallels.com>
Sent: Tuesday, July 3, 2012 8:53:18 AM
Subject: [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a	write-back cache policy

Hi everyone.

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back policy.
With this file data are pushed to the userspace with big chunks (depending on the
dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback request
  and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size synchronization
  with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
  This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this scenario.
Kirill (in Cc) can share more details about the performance and the storage concepts
details if required.

Thanks,
Pavel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
fuse-devel mailing list
fuse-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fuse-devel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
       [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
                     ` (4 preceding siblings ...)
  2012-07-04  3:01   ` [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Nikolaus Rath
@ 2012-07-05 19:31   ` Anand Avati
       [not found]     ` <4FF5EB85.1050701-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  5 siblings, 1 reply; 43+ messages in thread
From: Anand Avati @ 2012-07-05 19:31 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Kirill Korotaev, Miklos Szeredi,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, James Bottomley,
	Alexander Viro, linux-fsdevel

Doesn't such a change increase the "dirty debt" held in the kernel and 
therefore increase the probability of resulting in a deadlock (when the 
userspace filesystem's memory allocation request ends up waiting on a 
dirty page writeout caused by this write-back feature)? Why not 
implement the write-back inside the userspace filesystem leaving the 
kernel operate in write-through, which makes things overall less unsafe? 
Do you have performance numbers comparing write-back in kernel v/s 
write-back in userspace?

Avati


On 07/03/2012 08:53 AM, Pavel Emelyanov wrote:
> Hi everyone.
>
> One of the problems with the existing FUSE implementation is that it uses the
> write-through cache policy which results in performance problems on certain
> workloads. E.g. when copying a big file into a FUSE file the cp pushes every
> 128k to the userspace synchronously. This becomes a problem when the userspace
> back-end uses networking for storing the data.
>
> A good solution of this is switching the FUSE page cache into a write-back policy.
> With this file data are pushed to the userspace with big chunks (depending on the
> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
> handle the size updates in a more efficient manner.
>
> The writeback feature is per-connection and is explicitly configurable at the
> init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
> turned ON:
>
> * still copy writeback pages to temporary buffer when sending a writeback request
>    and finish the page writeback immediately
>
> * make kernel maintain the inode's i_size to avoid frequent i_size synchronization
>    with the user space
>
> * take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
>    This protects us from having too many dirty pages on FUSE
>
> The provided patchset survives the fsx test. Performance measurements are not yet
> all finished, but the mentioned copying of a huge file becomes noticeably faster
> even on machines with few RAM and doesn't make the system stuck (the dirty pages
> balancer does its work OK). Applies on top of v3.5-rc4.
>
> We are currently exploring this with our own distributed storage implementation
> which is heavily oriented on storing big blobs of data with extremely rare meta-data
> updates (virtual machines' and containers' disk images). With the existing cache
> policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
> too much time to proceed, much longer than if it was simply scp-ed over the same
> network. The write-back policy (as I mentioned) noticeably improves this scenario.
> Kirill (in Cc) can share more details about the performance and the storage concepts
> details if required.
>
> Thanks,
> Pavel
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> fuse-devel mailing list
> fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/fuse-devel


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
       [not found]     ` <4FF5EB85.1050701-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-07-05 20:07       ` Pavel Emelyanov
  2012-07-06 11:52         ` [fuse-devel] " Kirill Korotaev
  0 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-05 20:07 UTC (permalink / raw)
  To: Anand Avati
  Cc: aavati-H+wXaHxf7aLQT0dZR+AlfA, Kirill Korotaev, Miklos Szeredi,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, James Bottomley,
	Alexander Viro, linux-fsdevel

On 07/05/2012 11:31 PM, Anand Avati wrote:
> Doesn't such a change increase the "dirty debt" held in the kernel and 
> therefore increase the probability of resulting in a deadlock (when the 
> userspace filesystem's memory allocation request ends up waiting on a 
> dirty page writeout caused by this write-back feature)?

That's a very good question! I've tried to address this in the patch #10.
The balance_dirty_pages start throttling task which tries to put too many
dirty pages on a bdi. On FUSE this limit is set to a reasonably low value,
so that the dirty set of a FUSE mount won't grow too high.

> Why not  implement the write-back inside the userspace filesystem leaving the 
> kernel operate in write-through, which makes things overall less unsafe? 

Making yet another cache in the user space is extremely hard, since the
user space app has no idea at all when to start shrinking this cache. Neither
it knows which part of the cache (clean/dirty) is better candidate for that.
On the other hand, the in kernel cache gets shrunk when required and is balanced.

> Do you have performance numbers comparing write-back in kernel v/s 
> write-back in userspace?

I have only very preliminary measurements, I mostly played with the patches
stability. Pushing a big portion of data (4x times bigger than RAM) from one
VM to another via FUSE + networking backend with dd in 4k chunks increases
the reported by dd speed more than twice.

But, Kirill Korotaev (in Cc) was supposed to perform more accurate measurements,
hopefully he can provide the results.

> Avati

Thanks,
Pavel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-05 14:34               ` Pavel Emelyanov
@ 2012-07-06  2:04                 ` Nikolaus Rath
       [not found]                   ` <8762a1odbf.fsf-sKB8Sp2ER+yL2G7IJ6k9tw@public.gmane.org>
  0 siblings, 1 reply; 43+ messages in thread
From: Nikolaus Rath @ 2012-07-06  2:04 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: fuse-devel, linux-fsdevel

Pavel Emelyanov <xemul@parallels.com> writes:
>>>> With kernel 3.2:
>>>>
>>>> Blocksize: 4k
>>>> 131072000 bytes (131 MB) copied, 7.76037 s, 16.9 MB/s
>>>>
>>>
>>> [snip]
>>>
>>>>
>>>> However, I suspect that most of the gain is really because with the
>>>> patch most of the data is still in the kernel cache when dd finishes and
>>>> hasn't yet been received by the FUSE client.
>>>>
>>>>
>>>> Is there a way to force flushing of the fuse cache, so that I can
>>>> measure the time for dd + final flush?
>>>
>>> Actually when dd closes an output file the whole page cache which relates to
>>> it gets flushed to the userspace! This is due to how FUSE write requests are
>>> served.
>>>
>>> That said -- you've already measured writeback with flush :)
>> 
>> 
>> But then how is it possible that there is such a big difference even
>> when using 128kb blocks?
>> 
>> Kernel 3.2:
>> 131072000 bytes (131 MB) copied, 2.10194 s, 62.4 MB/s
>> 
>> Kernel 3.5-pre:
>> Blocksize: 128k
>> 131072000 bytes (131 MB) copied, 0.51943 s, 252 MB/s
>> 
>> I would think that it both cases the FUSE daemon gets the data in 128 kb
>> packets, so why the difference? Or is this due to some other change
>> between 3.2 and 3.5 that significantly increased FUSE performance?
>
> I don't know for sure. Need to read an analyze the FUSE userspace side
> logs (if any).
>
>> I can try the same test with the unpatched 3.5 tonight if that would be
>> of interest.
>
> Yes, this can help.

Alright, here are the resulting transfer rates for 131 MB of data:

Blocksize    Kernel 3.2         3.5-pre          3.5-pre + patch
4k           16.2               31.5             23.8
8k           52.8               48.5             91.2
16k          59.1               46.8             136
32k          64.1               47.6             185
64k          71.3               50.9             217
96k          76.1               50.6             229
128k         62.4               76.5             252
256k         66.6               68.8             286
512k         54.1               55.1             290
1024k        78.7               68.5             294

There's up to 20 MB/s variation for a block size on successive runs, but
it's still obvious that the patch really makes a huge difference.
Thinking about it, it actually makes sense even for the block sizes
greater than 128 kb. Here the speedup probably comes from the fact that
the FUSE daemon can now process several chunks at the same time.


Please let me know if there is anything I can do to help get this
merged. The results look fantastic.

Best,

   -Nikolaus

-- 
 »Time flies like an arrow, fruit flies like a Banana.«

  PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
       [not found]                   ` <8762a1odbf.fsf-sKB8Sp2ER+yL2G7IJ6k9tw@public.gmane.org>
@ 2012-07-06  8:46                     ` Pavel Emelyanov
  0 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-06  8:46 UTC (permalink / raw)
  To: Nikolaus Rath; +Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-fsdevel

> Alright, here are the resulting transfer rates for 131 MB of data:
> 
> Blocksize    Kernel 3.2         3.5-pre          3.5-pre + patch
> 4k           16.2               31.5             23.8
> 8k           52.8               48.5             91.2
> 16k          59.1               46.8             136
> 32k          64.1               47.6             185
> 64k          71.3               50.9             217
> 96k          76.1               50.6             229
> 128k         62.4               76.5             252
> 256k         66.6               68.8             286
> 512k         54.1               55.1             290
> 1024k        78.7               68.5             294
> 
> There's up to 20 MB/s variation for a block size on successive runs, but
> it's still obvious that the patch really makes a huge difference.
> Thinking about it, it actually makes sense even for the block sizes
> greater than 128 kb. Here the speedup probably comes from the fact that
> the FUSE daemon can now process several chunks at the same time.

Thanks a lot for testing the patches, I really appreciate that!

> Please let me know if there is anything I can do to help get this
> merged. The results look fantastic.

Well, I'm going to address Miklos' comments and resubmit the series (hopefully
next week). When I resubmit them you can put your Tested-by: tags on the series
and provide details about how you tested it. This will give the patches more
chances to get merged earlier.

> Best,
> 
>    -Nikolaus
> 

Thanks,
Pavel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
  2012-07-05 20:07       ` Pavel Emelyanov
@ 2012-07-06 11:52         ` Kirill Korotaev
  0 siblings, 0 replies; 43+ messages in thread
From: Kirill Korotaev @ 2012-07-06 11:52 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Anand Avati, fuse-devel, Miklos Szeredi, Alexander Viro,
	linux-fsdevel, James Bottomley, aavati


On Jul 6, 2012, at 00:07 , Pavel Emelyanov wrote:

> On 07/05/2012 11:31 PM, Anand Avati wrote:
>> Doesn't such a change increase the "dirty debt" held in the kernel and 
>> therefore increase the probability of resulting in a deadlock (when the 
>> userspace filesystem's memory allocation request ends up waiting on a 
>> dirty page writeout caused by this write-back feature)?
> 
> That's a very good question! I've tried to address this in the patch #10.
> The balance_dirty_pages start throttling task which tries to put too many
> dirty pages on a bdi. On FUSE this limit is set to a reasonably low value,
> so that the dirty set of a FUSE mount won't grow too high.
> 
>> Why not  implement the write-back inside the userspace filesystem leaving the 
>> kernel operate in write-through, which makes things overall less unsafe? 
> 
> Making yet another cache in the user space is extremely hard, since the
> user space app has no idea at all when to start shrinking this cache. Neither
> it knows which part of the cache (clean/dirty) is better candidate for that.
> On the other hand, the in kernel cache gets shrunk when required and is balanced.

Anand, do you have examples of such cache working in user-space?
It's more then extremely hard. Even ANK stopped trying after spending weeks on this.
And it's rather ugly to have this machinery in 2 places.

>> Do you have performance numbers comparing write-back in kernel v/s 
>> write-back in userspace?
> 
> I have only very preliminary measurements, I mostly played with the patches
> stability. Pushing a big portion of data (4x times bigger than RAM) from one
> VM to another via FUSE + networking backend with dd in 4k chunks increases
> the reported by dd speed more than twice.
> 
> But, Kirill Korotaev (in Cc) was supposed to perform more accurate measurements,
> hopefully he can provide the results.

In our case file write with i_size changing improved from 60MB/sec to ~1GBit performance (~110MB/sec).
I'm pretty sure in case of 10GBit the win would be even higher.

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
  2012-07-04 14:39   ` Miklos Szeredi
  2012-07-05 14:10     ` Pavel Emelyanov
@ 2012-07-10  5:53     ` Pavel Emelyanov
  2012-07-13 16:30       ` Miklos Szeredi
       [not found]     ` <8762a3pp3m.fsf-d8RdFUjzFsbxNFs70CDYszOMxtEWgIxa@public.gmane.org>
  2 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-10  5:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

On 07/04/2012 06:39 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> Make fuse think that when writeback is on the inode's i_size is alway
>> up-to-date and not update it with the value received from the
>> userspace. This is done because the page cache code may update i_size
>> without letting the FS know.
> 
> Similar rule applies to i_mtime.  Except it's even more tricky, since
> you have to flush the mtime together with the data (the flush should not
> update the modification time).  And we have other operations which also
> change the mtime, and those also need to be updated.
> 
> We should probably look at what NFS is doing.

Miklos,

I've looked at how NFS and FUSE manage the i_mtime.

The NFS (if not looking at the fscache) tries to keep the i_mtime at the
maximum between the local and the remote values. This looks like correct
behavior for FUSE too.

But, looking at the FUSE code I see that the existing attr_version checks
do implement the same approach even if we turn the writeback cache (this
series) ON.

That said, I see no problems with the i_mtime management after this series.
Is there something I've missed?

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
  2012-07-10  5:53     ` Pavel Emelyanov
@ 2012-07-13 16:30       ` Miklos Szeredi
  2012-07-16  3:32         ` Pavel Emelyanov
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-13 16:30 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

Pavel Emelyanov <xemul@parallels.com> writes:

> On 07/04/2012 06:39 PM, Miklos Szeredi wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>> 
>>> Make fuse think that when writeback is on the inode's i_size is alway
>>> up-to-date and not update it with the value received from the
>>> userspace. This is done because the page cache code may update i_size
>>> without letting the FS know.
>> 
>> Similar rule applies to i_mtime.  Except it's even more tricky, since
>> you have to flush the mtime together with the data (the flush should not
>> update the modification time).  And we have other operations which also
>> change the mtime, and those also need to be updated.
>> 
>> We should probably look at what NFS is doing.
>
> Miklos,
>
> I've looked at how NFS and FUSE manage the i_mtime.
>
> The NFS (if not looking at the fscache) tries to keep the i_mtime at the
> maximum between the local and the remote values. This looks like correct
> behavior for FUSE too.
>
> But, looking at the FUSE code I see that the existing attr_version checks
> do implement the same approach even if we turn the writeback cache (this
> series) ON.

It's not as simple as that.

First off, fuse sets S_NOCMTIME which means the kernel won't update
i_mtime on writes.  Which is fine for write-through, since the time
update is always the responsibility of the userspace filesystem.

If we are doing buffered writes, then the kernel must update i_mtime on
write(2) and must flush that to the userspace filesystem at some point
(with a SETTATTR operation).

The attr_version thing is about making sure that we don't update the
kernel's cached attribute with stale data from the userspace filesystem.
E.g. if a GETATTR races with a WRITE then by the time the GETATTR
finishes write may have already updated i_size despite the fact that
GETATTR is returning i_size before the write.  In that case we don't
store the i_size we believe is stale but we do use it the stat(2) reply,
since it came from the userspace filesystem, which is the authoritative
source of information.

But that means that if we want stat(2) to work correctly, then we must
flush the updated i_mtime to userspace before we do the GETATTR.

So basically what we need is a per-inode flag that says that i_mtime has
been updated (it is more recent then what userspace has) and we must
update i_mtime *only* in write and not other operations which still do
the mtime update in the userspace filesystem.  Any operation that
modifies i_mtime (and hence invalidate the attributes) must clear the
flag.  Any other operation which updates or invalidates the attributes
must first flush the i_mtime to userspace if the flag is set.

In addition the userspace fileystem need to implement the policy similar
to NFS, in which it only updates mtime if it is greater than the current
one.  This means that we must differentiate between an mtime update due
to a buffered write from an mtime update due to an utime (and friends)
system call.

I think that would work, but it's a tricky thing and needs to be thought
trough.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
       [not found]   ` <4FF3166B.5090800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-07-13 16:57     ` Miklos Szeredi
  2012-07-16  3:27       ` Pavel Emelyanov
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-13 16:57 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-fsdevel,
	Alexander Viro, James Bottomley, Kirill Korotaev

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
> counter is hight ehough. This prevents us from having too many dirty
> pages on fuse, thus giving the userspace part of it a chance to write
> stuff properly.
>
> Note, that the existing balance logic is per-bdi, i.e. if the fuse
> user task gets stuck in the function this means, that it either
> writes to the mountpoint it serves (but it can deadlock even without
> the writeback) or it is wrting to some _other_ dirty bdi and in the
> latter case someone else will free the memory for it.

This is not just about deadlocking.  Unprivileged fuse filesystems
should not impact the operation of other filesystems.  I.e. a fuse
filesystem which is not making progress writing out pages shouln't cause
a write on an unrelated filesystem to block.

I believe this patch breaks that promise.

Thanks,
Miklos


>
> Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> ---
>  mm/page-writeback.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 93d8d2f..7cb54db 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1192,7 +1192,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		 */
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
> +		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
> +			global_page_state(NR_WRITEBACK_TEMP);
>  
>  		global_dirty_limits(&background_thresh, &dirty_thresh);

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2012-07-13 16:57     ` Miklos Szeredi
@ 2012-07-16  3:27       ` Pavel Emelyanov
  2012-07-17 19:11         ` Miklos Szeredi
  0 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-16  3:27 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

On 07/13/2012 08:57 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
>> counter is hight ehough. This prevents us from having too many dirty
>> pages on fuse, thus giving the userspace part of it a chance to write
>> stuff properly.
>>
>> Note, that the existing balance logic is per-bdi, i.e. if the fuse
>> user task gets stuck in the function this means, that it either
>> writes to the mountpoint it serves (but it can deadlock even without
>> the writeback) or it is wrting to some _other_ dirty bdi and in the
>> latter case someone else will free the memory for it.
> 
> This is not just about deadlocking.  Unprivileged fuse filesystems
> should not impact the operation of other filesystems.  I.e. a fuse
> filesystem which is not making progress writing out pages shouln't cause
> a write on an unrelated filesystem to block.
> 
> I believe this patch breaks that promise.

Hm... I believe it does not, and that's why.

When a task writes to some bdi the balance_dirty_pages will evaluate the
amount of time to block this task on based on this bdi dirty set counters.
The global stats are only used to a) check whether this decision should be
made at all and b) evaluate the dirty "fraction" of a bdi. That said, even
if we stop the fuse daemon (I actually did this) other filesystems won't
lock. The global counter would be high, yes, but the dirty set fraction of 
non-fuse bdi would be low thus allowing others to progress.

> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
  2012-07-13 16:30       ` Miklos Szeredi
@ 2012-07-16  3:32         ` Pavel Emelyanov
  2012-07-17 15:17           ` Miklos Szeredi
  0 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-16  3:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

On 07/13/2012 08:30 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
> 
>> On 07/04/2012 06:39 PM, Miklos Szeredi wrote:
>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>
>>>> Make fuse think that when writeback is on the inode's i_size is alway
>>>> up-to-date and not update it with the value received from the
>>>> userspace. This is done because the page cache code may update i_size
>>>> without letting the FS know.
>>>
>>> Similar rule applies to i_mtime.  Except it's even more tricky, since
>>> you have to flush the mtime together with the data (the flush should not
>>> update the modification time).  And we have other operations which also
>>> change the mtime, and those also need to be updated.
>>>
>>> We should probably look at what NFS is doing.
>>
>> Miklos,
>>
>> I've looked at how NFS and FUSE manage the i_mtime.
>>
>> The NFS (if not looking at the fscache) tries to keep the i_mtime at the
>> maximum between the local and the remote values. This looks like correct
>> behavior for FUSE too.
>>
>> But, looking at the FUSE code I see that the existing attr_version checks
>> do implement the same approach even if we turn the writeback cache (this
>> series) ON.
> 
> It's not as simple as that.
> 
> First off, fuse sets S_NOCMTIME which means the kernel won't update
> i_mtime on writes.  Which is fine for write-through, since the time
> update is always the responsibility of the userspace filesystem.

Oops, I've missed that fact :( I wonder why writes did update the mtime
in my tests then...

> If we are doing buffered writes, then the kernel must update i_mtime on
> write(2) and must flush that to the userspace filesystem at some point
> (with a SETTATTR operation).

Agree.

> The attr_version thing is about making sure that we don't update the
> kernel's cached attribute with stale data from the userspace filesystem.
> E.g. if a GETATTR races with a WRITE then by the time the GETATTR
> finishes write may have already updated i_size despite the fact that
> GETATTR is returning i_size before the write.  In that case we don't
> store the i_size we believe is stale but we do use it the stat(2) reply,
> since it came from the userspace filesystem, which is the authoritative
> source of information.
> 
> But that means that if we want stat(2) to work correctly, then we must
> flush the updated i_mtime to userspace before we do the GETATTR.

Why? Isn't it enough just to look at the per-inode flag (you're talking
below) and take either cached i_mtime value or the user-space verion of it?

> So basically what we need is a per-inode flag that says that i_mtime has
> been updated (it is more recent then what userspace has) and we must
> update i_mtime *only* in write and not other operations which still do
> the mtime update in the userspace filesystem.  Any operation that
> modifies i_mtime (and hence invalidate the attributes) must clear the
> flag.  Any other operation which updates or invalidates the attributes
> must first flush the i_mtime to userspace if the flag is set.
> 
> In addition the userspace fileystem need to implement the policy similar
> to NFS, in which it only updates mtime if it is greater than the current
> one.  This means that we must differentiate between an mtime update due
> to a buffered write from an mtime update due to an utime (and friends)
> system call.
> 
> I think that would work, but it's a tricky thing and needs to be thought
> trough.
> 
> Thanks,
> Miklos
> .
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
  2012-07-16  3:32         ` Pavel Emelyanov
@ 2012-07-17 15:17           ` Miklos Szeredi
  0 siblings, 0 replies; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-17 15:17 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

Pavel Emelyanov <xemul@parallels.com> writes:

>> The attr_version thing is about making sure that we don't update the
>> kernel's cached attribute with stale data from the userspace filesystem.
>> E.g. if a GETATTR races with a WRITE then by the time the GETATTR
>> finishes write may have already updated i_size despite the fact that
>> GETATTR is returning i_size before the write.  In that case we don't
>> store the i_size we believe is stale but we do use it the stat(2) reply,
>> since it came from the userspace filesystem, which is the authoritative
>> source of information.
>> 
>> But that means that if we want stat(2) to work correctly, then we must
>> flush the updated i_mtime to userspace before we do the GETATTR.
>
> Why? Isn't it enough just to look at the per-inode flag (you're talking
> below) and take either cached i_mtime value or the user-space verion
> of it?

That would only work if the kernel has the more up-to-date mtime value,
which is not necessarily true.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2012-07-16  3:27       ` Pavel Emelyanov
@ 2012-07-17 19:11         ` Miklos Szeredi
  2012-07-27  4:01           ` Pavel Emelyanov
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-17 19:11 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

Pavel Emelyanov <xemul@parallels.com> writes:

> On 07/13/2012 08:57 PM, Miklos Szeredi wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>> 
>>> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
>>> counter is hight ehough. This prevents us from having too many dirty
>>> pages on fuse, thus giving the userspace part of it a chance to write
>>> stuff properly.
>>>
>>> Note, that the existing balance logic is per-bdi, i.e. if the fuse
>>> user task gets stuck in the function this means, that it either
>>> writes to the mountpoint it serves (but it can deadlock even without
>>> the writeback) or it is wrting to some _other_ dirty bdi and in the
>>> latter case someone else will free the memory for it.
>> 
>> This is not just about deadlocking.  Unprivileged fuse filesystems
>> should not impact the operation of other filesystems.  I.e. a fuse
>> filesystem which is not making progress writing out pages shouln't cause
>> a write on an unrelated filesystem to block.
>> 
>> I believe this patch breaks that promise.
>
> Hm... I believe it does not, and that's why.
>
> When a task writes to some bdi the balance_dirty_pages will evaluate the
> amount of time to block this task on based on this bdi dirty set counters.
> The global stats are only used to a) check whether this decision should be
> made at all

Okay, maybe I'm blind but if this is true, then how is
balance_dirty_pages() supposed to ensure that the per-bdi limit is not
exceeded?

> and b) evaluate the dirty "fraction" of a bdi. That said, even
> if we stop the fuse daemon (I actually did this) other filesystems won't
> lock. The global counter would be high, yes, but the dirty set fraction of 
> non-fuse bdi would be low thus allowing others to progress.

That makes some sense, but it looks to me that FUSE, NFS and friends
want a stricter dirty balancing logic that looks at the bdi thresholds
even if the global limits are not exceeded.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
  2012-07-17 19:11         ` Miklos Szeredi
@ 2012-07-27  4:01           ` Pavel Emelyanov
       [not found]             ` <5012127C.8070203-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-27  4:01 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
	Kirill Korotaev

On 07/17/2012 11:11 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:

Miklos, sorry for the late response. Please, find the answers inline.

>> On 07/13/2012 08:57 PM, Miklos Szeredi wrote:
>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>
>>>> Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
>>>> counter is hight ehough. This prevents us from having too many dirty
>>>> pages on fuse, thus giving the userspace part of it a chance to write
>>>> stuff properly.
>>>>
>>>> Note, that the existing balance logic is per-bdi, i.e. if the fuse
>>>> user task gets stuck in the function this means, that it either
>>>> writes to the mountpoint it serves (but it can deadlock even without
>>>> the writeback) or it is wrting to some _other_ dirty bdi and in the
>>>> latter case someone else will free the memory for it.
>>>
>>> This is not just about deadlocking.  Unprivileged fuse filesystems
>>> should not impact the operation of other filesystems.  I.e. a fuse
>>> filesystem which is not making progress writing out pages shouln't cause
>>> a write on an unrelated filesystem to block.
>>>
>>> I believe this patch breaks that promise.
>>
>> Hm... I believe it does not, and that's why.
>>
>> When a task writes to some bdi the balance_dirty_pages will evaluate the
>> amount of time to block this task on based on this bdi dirty set counters.
>> The global stats are only used to a) check whether this decision should be
>> made at all
> 
> Okay, maybe I'm blind but if this is true, then how is
> balance_dirty_pages() supposed to ensure that the per-bdi limit is not
> exceeded?

The balance_dirty_pages logic is _very_ roughly the the following:

Let this_bdi be a bdi the current task is writing to
Let D be the total amount of dirty and writeback memory (and writeback_tmp after this patch)
Let L be the limit of dirty memory (L = ram_size * ratio)
Let d be the amount of dirty and writeback on this_bdi
And let l be the limit of dirty memory on this_bdi

With that the balancer logic look like

while (1) {
	if (D < L)
		return;

	start_background_writeback(this_bdi);

	if (d < l)
		return;

	timeout = get_sleep_timeout(d, l, D, L);
	shcedule_timeout(timeout);
}

The d and l are calculated out of the D and L using this_bdi and global IO completions
proportions (with more complexity, but still).

Thus, since we throttle tasks looking ad d and l only we cannot affect all the bdis in
the system by live-locking a single one of them.

Accounting for writeback_tmp is required since the D should become high when there are
lots of pages in-flight in FUSE. Otherwise, the balance_dirty_pages will not limit the
task writing on a fuse mount.

>> and b) evaluate the dirty "fraction" of a bdi. That said, even
>> if we stop the fuse daemon (I actually did this) other filesystems won't
>> lock. The global counter would be high, yes, but the dirty set fraction of 
>> non-fuse bdi would be low thus allowing others to progress.
> 
> That makes some sense, but it looks to me that FUSE, NFS and friends
> want a stricter dirty balancing logic that looks at the bdi thresholds
> even if the global limits are not exceeded.

Probably, but I did a very straighforward test -- I just stopped the fuse daemon and started
writing to a fuse file. After some time the writing task was locked in balance_dirty_pages,
since fuse daemon didn't ack-ed writeback. At the same time I tried to write to other bdis
(disks and nfs) and none of them was locked, all the writes succeeded. After I let the fuse
daemon run again the fuse-writer unlocked and went on writing.

Do you have some trickier scenario in mind?

> Thanks,
> Miklos
> .
> 

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
       [not found]             ` <5012127C.8070203-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-08-07 17:30               ` Miklos Szeredi
  0 siblings, 0 replies; 43+ messages in thread
From: Miklos Szeredi @ 2012-08-07 17:30 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-fsdevel,
	Alexander Viro, James Bottomley, Kirill Korotaev

Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:

> On 07/17/2012 11:11 PM, Miklos Szeredi wrote:
>> 
>> Okay, maybe I'm blind but if this is true, then how is
>> balance_dirty_pages() supposed to ensure that the per-bdi limit is not
>> exceeded?
>
> The balance_dirty_pages logic is _very_ roughly the the following:
>
> Let this_bdi be a bdi the current task is writing to
> Let D be the total amount of dirty and writeback memory (and writeback_tmp after this patch)
> Let L be the limit of dirty memory (L = ram_size * ratio)
> Let d be the amount of dirty and writeback on this_bdi
> And let l be the limit of dirty memory on this_bdi
>
> With that the balancer logic look like
>
> while (1) {
> 	if (D < L)
> 		return;
>
> 	start_background_writeback(this_bdi);
>
> 	if (d < l)
> 		return;
>
> 	timeout = get_sleep_timeout(d, l, D, L);
> 	shcedule_timeout(timeout);
> }
>
> The d and l are calculated out of the D and L using this_bdi and
> global IO completions proportions (with more complexity, but still).
>
> Thus, since we throttle tasks looking ad d and l only we cannot affect
> all the bdis in the system by live-locking a single one of them.
>
> Accounting for writeback_tmp is required since the D should become
> high when there are lots of pages in-flight in FUSE. Otherwise, the
> balance_dirty_pages will not limit the task writing on a fuse mount.

Okay, that makes sense, and it's certainly an improvement from the
current situation.

What I'm worried about is that with the above algorithm a filesystem's
"d" can grow as high as "L" if only that filesystem is dirtying memory.

If that filesystem is very slow or broken and other filesystems start
dirtying data then they are left with only a fraction of the original
limit.

They won't deadlock, but performance will be affected.  So ideally I'd
like to see more strict per-bdi limit enforcement for fuse (the per-bdi
limit is just 1% of "L" by default on fuse).

Thanks,
Miklos

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
       [not found]     ` <8762a3pp3m.fsf-d8RdFUjzFsbxNFs70CDYszOMxtEWgIxa@public.gmane.org>
@ 2012-10-01 17:30       ` Maxim V. Patlasov
  2012-11-16  9:49         ` Miklos Szeredi
  0 siblings, 1 reply; 43+ messages in thread
From: Maxim V. Patlasov @ 2012-10-01 17:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Kirill Korotaev, Pavel Emelianov,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, James Bottomley,
	Alexander Viro, linux-fsdevel

Hi Miklos,

07/04/2012 06:39 PM, Miklos Szeredi ?????:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>
>> Make fuse think that when writeback is on the inode's i_size is alway
>> up-to-date and not update it with the value received from the
>> userspace. This is done because the page cache code may update i_size
>> without letting the FS know.
> Similar rule applies to i_mtime.  Except it's even more tricky, since
> you have to flush the mtime together with the data (the flush should not
> update the modification time).  And we have other operations which also
> change the mtime, and those also need to be updated.
>
> We should probably look at what NFS is doing.

In case of NFS, the flush does updates the modification time on server. 
And on client, getattr triggers flush:

> int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct 
> kstat *stat)
> {
>     ...
>
>     /* Flush out writes to the server in order to update c/mtime. */
>     if (S_ISREG(inode->i_mode)) {
>         nfs_inode_dio_wait(inode);
>         err = filemap_write_and_wait(inode->i_mapping);
>         if (err)
>             goto out;
>     }

In another email of this thread you suggested some approach where 
in-kernel fuse flushes i_mtime to userspace:

> So basically what we need is a per-inode flag that says that i_mtime has
> been updated (it is more recent then what userspace has) and we must
> update i_mtime*only*  in write and not other operations which still do
> the mtime update in the userspace filesystem.  Any operation that
> modifies i_mtime (and hence invalidate the attributes) must clear the
> flag.  Any other operation which updates or invalidates the attributes
> must first flush the i_mtime to userspace if the flag is set.
>
> In addition the userspace fileystem need to implement the policy similar
> to NFS, in which it only updates mtime if it is greater than the current
> one.  This means that we must differentiate between an mtime update due
> to a buffered write from an mtime update due to an utime (and friends)
> system call.

My question is why do we need all these complications if we could follow 
NFS way: trigger flush and wait for its (and fuse write-back) completion 
before sending FUSE_GETATTR to userspace?

Another concern is about the idea of sending i_mtime to userspace per 
se. You wrote:

> If we are doing buffered writes, then the kernel must update i_mtime on
> write(2) and must flush that to the userspace filesystem at some point
> (with a SETTATTR operation).

Fuse userspace may have its own non-trivial concept of 'modification 
time'. It's not obliged to advance its mtime on every write. The only 
requirement is to be consistent: if we expose new data handling READs, 
mtime must be advanced properly as well. But, for example, the 
granularity of changes is up to implementation. From this view, in-kenel 
fuse pushing i_mtime with a SETATTR operation would look like a cheating 
userspace. What do you think?

Thanks,
Maxim
------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
  2012-10-01 17:30       ` Maxim V. Patlasov
@ 2012-11-16  9:49         ` Miklos Szeredi
  2012-11-16 10:32           ` Maxim V. Patlasov
  0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2012-11-16  9:49 UTC (permalink / raw)
  To: Maxim V. Patlasov
  Cc: Pavel Emelianov, fuse-devel, Alexander Viro, linux-fsdevel,
	James Bottomley, Kirill Korotaev

"Maxim V. Patlasov" <mpatlasov@parallels.com> writes:

>     We should probably look at what NFS is doing.
>
>
> In case of NFS, the flush does updates the modification time on server. And on
> client, getattr triggers flush:
>
>
>     int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat
>     *stat)
>     {
>         ...
>
>         /* Flush out writes to the server in order to update c/mtime.  */
>         if (S_ISREG(inode->i_mode)) {
>             nfs_inode_dio_wait(inode);
>             err = filemap_write_and_wait(inode->i_mapping);
>             if (err)
>                 goto out;
>         }
>
>
> In another email of this thread you suggested some approach where in-kernel
> fuse flushes i_mtime to userspace:
>
>
>     So basically what we need is a per-inode flag that says that i_mtime has
>     been updated (it is more recent then what userspace has) and we must
>     update i_mtime *only* in write and not other operations which still do
>     the mtime update in the userspace filesystem.  Any operation that
>     modifies i_mtime (and hence invalidate the attributes) must clear the
>     flag.  Any other operation which updates or invalidates the attributes
>     must first flush the i_mtime to userspace if the flag is set.
>
>     In addition the userspace fileystem need to implement the policy similar
>     to NFS, in which it only updates mtime if it is greater than the current
>     one.  This means that we must differentiate between an mtime update due
>     to a buffered write from an mtime update due to an utime (and friends)
>     system call.
>
>
> My question is why do we need all these complications if we could follow NFS
> way: trigger flush and wait for its (and fuse write-back) completion before
> sending FUSE_GETATTR to userspace?


Yes, the NFS way seems like a good approach assuming that getattrs are
not too frequent.  But I guess the fact that NFS does this is a pretty
good assurance that will work fine.


>
> Another concern is about the idea of sending i_mtime to userspace per se. You
> wrote:
>
>
>     If we are doing buffered writes, then the kernel must update i_mtime on
>     write(2) and must flush that to the userspace filesystem at some point
>     (with a SETTATTR operation).
>
>
> Fuse userspace may have its own non-trivial concept of 'modification time'.
> It's not obliged to advance its mtime on every write. The only requirement is
> to be consistent: if we expose new data handling READs, mtime must be advanced
> properly as well. But, for example, the granularity of changes is up to
> implementation. From this view, in-kenel fuse pushing i_mtime with a SETATTR
> operation would look like a cheating userspace. What do you think?

I think you are right in that mixing kernel mtime updates with userspace
mtime updates doesn't work.  Either the kernel should be wholly
responsible (which works only for "local" filesystems) or the userspace
is fully responsible for mtime updates (which works in all cases but may
be suboptimal).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/10] fuse: Trust kernel i_size only
  2012-11-16  9:49         ` Miklos Szeredi
@ 2012-11-16 10:32           ` Maxim V. Patlasov
  0 siblings, 0 replies; 43+ messages in thread
From: Maxim V. Patlasov @ 2012-11-16 10:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Pavel Emelianov, fuse-devel, Alexander Viro, linux-fsdevel,
	James Bottomley, Kirill Korotaev

Hi Miklos,

Thanks a lot for reply. See please inline comments below...

11/16/2012 01:49 PM, Miklos Szeredi пишет:
> "Maxim V. Patlasov" <mpatlasov@parallels.com> writes:
>
>>      We should probably look at what NFS is doing.
>>
>>
>> In case of NFS, the flush does updates the modification time on server. And on
>> client, getattr triggers flush:
>>
>>
>>      int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat
>>      *stat)
>>      {
>>          ...
>>
>>          /* Flush out writes to the server in order to update c/mtime.  */
>>          if (S_ISREG(inode->i_mode)) {
>>              nfs_inode_dio_wait(inode);
>>              err = filemap_write_and_wait(inode->i_mapping);
>>              if (err)
>>                  goto out;
>>          }
>>
>>
>> In another email of this thread you suggested some approach where in-kernel
>> fuse flushes i_mtime to userspace:
>>
>>
>>      So basically what we need is a per-inode flag that says that i_mtime has
>>      been updated (it is more recent then what userspace has) and we must
>>      update i_mtime *only* in write and not other operations which still do
>>      the mtime update in the userspace filesystem.  Any operation that
>>      modifies i_mtime (and hence invalidate the attributes) must clear the
>>      flag.  Any other operation which updates or invalidates the attributes
>>      must first flush the i_mtime to userspace if the flag is set.
>>
>>      In addition the userspace fileystem need to implement the policy similar
>>      to NFS, in which it only updates mtime if it is greater than the current
>>      one.  This means that we must differentiate between an mtime update due
>>      to a buffered write from an mtime update due to an utime (and friends)
>>      system call.
>>
>>
>> My question is why do we need all these complications if we could follow NFS
>> way: trigger flush and wait for its (and fuse write-back) completion before
>> sending FUSE_GETATTR to userspace?
>
> Yes, the NFS way seems like a good approach assuming that getattrs are
> not too frequent.  But I guess the fact that NFS does this is a pretty
> good assurance that will work fine.

My first intention was to follow NFS way because it's very simple and 
straightforward. But then I realized it would impact performance too 
badly. You correctly noticed that frequent getattrs will be a problem. 
But there will be a problem even without it: a single innocent 'ls' will 
wait for pretty long till all dirty memory is flushed.


>> Another concern is about the idea of sending i_mtime to userspace per se. You
>> wrote:
>>
>>
>>      If we are doing buffered writes, then the kernel must update i_mtime on
>>      write(2) and must flush that to the userspace filesystem at some point
>>      (with a SETTATTR operation).
>>
>>
>> Fuse userspace may have its own non-trivial concept of 'modification time'.
>> It's not obliged to advance its mtime on every write. The only requirement is
>> to be consistent: if we expose new data handling READs, mtime must be advanced
>> properly as well. But, for example, the granularity of changes is up to
>> implementation. From this view, in-kenel fuse pushing i_mtime with a SETATTR
>> operation would look like a cheating userspace. What do you think?
> I think you are right in that mixing kernel mtime updates with userspace
> mtime updates doesn't work.  Either the kernel should be wholly
> responsible (which works only for "local" filesystems) or the userspace
> is fully responsible for mtime updates (which works in all cases but may
> be suboptimal).

Not having any feedback from you for long while I worked pretty hard to 
implement the approach you suggested early (update mtime locally on 
buffered writes, flush it to userspace when needed). Now I have an 
implementation that works in my tests. I'll send patches soon.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2012-11-16 10:32 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
2012-07-03 15:53 ` [PATCH 1/10] fuse: Linking file to inode helper Pavel Emelyanov
2012-07-03 15:54 ` [PATCH 2/10] fuse: Getting file for writeback helper Pavel Emelyanov
2012-07-03 15:54 ` [PATCH 3/10] fuse: Prepare to handle short reads Pavel Emelyanov
2012-07-03 15:55 ` [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback Pavel Emelyanov
2012-07-04 13:06   ` Miklos Szeredi
2012-07-04 14:26     ` Pavel Emelyanov
     [not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-07-03 15:55   ` [PATCH 5/10] fuse: Connection bit for enabling writeback Pavel Emelyanov
2012-07-03 15:56   ` [PATCH 7/10] fuse: Flush files on wb close Pavel Emelyanov
2012-07-03 15:56   ` [PATCH 8/10] fuse: Implement writepages and write_begin/write_end callbacks Pavel Emelyanov
2012-07-03 15:57   ` [PATCH 9/10] fuse: Turn writeback on Pavel Emelyanov
2012-07-04  3:01   ` [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Nikolaus Rath
     [not found]     ` <87a9zg1b7q.fsf-sKB8Sp2ER+yL2G7IJ6k9tw@public.gmane.org>
2012-07-04  7:11       ` Pavel Emelyanov
2012-07-04 13:22         ` Nikolaus Rath
     [not found]           ` <4FF4438B.8050807-BTH8mxji4b0@public.gmane.org>
2012-07-04 14:33             ` Pavel Emelyanov
     [not found]               ` <4FF45447.5000705-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-07-04 17:08                 ` Nikolaus Rath
2012-07-05  9:01                   ` Pavel Emelyanov
2012-07-05 13:07         ` Nikolaus Rath
2012-07-05 14:08           ` Pavel Emelyanov
2012-07-05 14:29             ` Nikolaus Rath
2012-07-05 14:34               ` Pavel Emelyanov
2012-07-06  2:04                 ` Nikolaus Rath
     [not found]                   ` <8762a1odbf.fsf-sKB8Sp2ER+yL2G7IJ6k9tw@public.gmane.org>
2012-07-06  8:46                     ` Pavel Emelyanov
2012-07-05 19:31   ` Anand Avati
     [not found]     ` <4FF5EB85.1050701-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-07-05 20:07       ` Pavel Emelyanov
2012-07-06 11:52         ` [fuse-devel] " Kirill Korotaev
2012-07-03 15:55 ` [PATCH 6/10] fuse: Trust kernel i_size only Pavel Emelyanov
2012-07-04 14:39   ` Miklos Szeredi
2012-07-05 14:10     ` Pavel Emelyanov
2012-07-10  5:53     ` Pavel Emelyanov
2012-07-13 16:30       ` Miklos Szeredi
2012-07-16  3:32         ` Pavel Emelyanov
2012-07-17 15:17           ` Miklos Szeredi
     [not found]     ` <8762a3pp3m.fsf-d8RdFUjzFsbxNFs70CDYszOMxtEWgIxa@public.gmane.org>
2012-10-01 17:30       ` Maxim V. Patlasov
2012-11-16  9:49         ` Miklos Szeredi
2012-11-16 10:32           ` Maxim V. Patlasov
2012-07-03 15:57 ` [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Pavel Emelyanov
     [not found]   ` <4FF3166B.5090800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-07-13 16:57     ` Miklos Szeredi
2012-07-16  3:27       ` Pavel Emelyanov
2012-07-17 19:11         ` Miklos Szeredi
2012-07-27  4:01           ` Pavel Emelyanov
     [not found]             ` <5012127C.8070203-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-08-07 17:30               ` Miklos Szeredi
2012-07-05 19:26 ` [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Anand Avati

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.