* [PATCH 1/10] fuse: Linking file to inode helper
2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
@ 2012-07-03 15:53 ` Pavel Emelyanov
2012-07-03 15:54 ` [PATCH 2/10] fuse: Getting file for writeback helper Pavel Emelyanov
` (6 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:53 UTC (permalink / raw)
To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
Cc: James Bottomley, Kirill Korotaev
When writeback is ON every writable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
fs/fuse/file.c | 33 +++++++++++++++++++--------------
1 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b321a68..7d10b4c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -167,6 +167,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
}
EXPORT_SYMBOL_GPL(fuse_do_open);
+static void fuse_link_write_file(struct file *file)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_file *ff = file->private_data;
+ /*
+ * file may be written through mmap, so chain it onto the
+ * inodes's write_file list
+ */
+ spin_lock(&fc->lock);
+ if (list_empty(&ff->write_entry))
+ list_add(&ff->write_entry, &fi->write_files);
+ spin_unlock(&fc->lock);
+}
+
void fuse_finish_open(struct inode *inode, struct file *file)
{
struct fuse_file *ff = file->private_data;
@@ -1380,20 +1396,9 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
{
- if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
- struct inode *inode = file->f_dentry->d_inode;
- struct fuse_conn *fc = get_fuse_conn(inode);
- struct fuse_inode *fi = get_fuse_inode(inode);
- struct fuse_file *ff = file->private_data;
- /*
- * file may be written through mmap, so chain it onto the
- * inodes's write_file list
- */
- spin_lock(&fc->lock);
- if (list_empty(&ff->write_entry))
- list_add(&ff->write_entry, &fi->write_files);
- spin_unlock(&fc->lock);
- }
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+ fuse_link_write_file(file);
+
file_accessed(file);
vma->vm_ops = &fuse_file_vm_ops;
return 0;
--
1.5.5.6
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 2/10] fuse: Getting file for writeback helper
2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
2012-07-03 15:53 ` [PATCH 1/10] fuse: Linking file to inode helper Pavel Emelyanov
@ 2012-07-03 15:54 ` Pavel Emelyanov
2012-07-03 15:54 ` [PATCH 3/10] fuse: Prepare to handle short reads Pavel Emelyanov
` (5 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:54 UTC (permalink / raw)
To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
Cc: James Bottomley, Kirill Korotaev
There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
fs/fuse/file.c | 23 +++++++++++++++--------
1 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7d10b4c..9c32bf5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1273,6 +1273,19 @@ static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
fuse_writepage_free(fc, req);
}
+static struct fuse_file *fuse_write_file(struct fuse_conn *fc, struct fuse_inode *fi)
+{
+ struct fuse_file *ff;
+
+ spin_lock(&fc->lock);
+ BUG_ON(list_empty(&fi->write_files));
+ ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+ fuse_file_get(ff);
+ spin_unlock(&fc->lock);
+
+ return ff;
+}
+
static int fuse_writepage_locked(struct page *page)
{
struct address_space *mapping = page->mapping;
@@ -1280,7 +1293,6 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_req *req;
- struct fuse_file *ff;
struct page *tmp_page;
set_page_writeback(page);
@@ -1293,13 +1305,8 @@ static int fuse_writepage_locked(struct page *page)
if (!tmp_page)
goto err_free;
- spin_lock(&fc->lock);
- BUG_ON(list_empty(&fi->write_files));
- ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
- req->ff = fuse_file_get(ff);
- spin_unlock(&fc->lock);
-
- fuse_write_fill(req, ff, page_offset(page), 0);
+ req->ff = fuse_write_file(fc, fi);
+ fuse_write_fill(req, req->ff, page_offset(page), 0);
copy_highpage(tmp_page, page);
req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
--
1.5.5.6
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 3/10] fuse: Prepare to handle short reads
2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
2012-07-03 15:53 ` [PATCH 1/10] fuse: Linking file to inode helper Pavel Emelyanov
2012-07-03 15:54 ` [PATCH 2/10] fuse: Getting file for writeback helper Pavel Emelyanov
@ 2012-07-03 15:54 ` Pavel Emelyanov
2012-07-03 15:55 ` [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback Pavel Emelyanov
` (4 subsequent siblings)
7 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:54 UTC (permalink / raw)
To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
Cc: James Bottomley, Kirill Korotaev
A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
fs/fuse/file.c | 20 ++++++++++++--------
1 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9c32bf5..6bf9723 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -538,6 +538,14 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
spin_unlock(&fc->lock);
}
+static void fuse_short_read(struct fuse_req *req, struct inode *inode, u64 attr_ver)
+{
+ size_t num_read = req->out.args[0].size;
+
+ loff_t pos = page_offset(req->pages[0]) + num_read;
+ fuse_read_update_size(inode, pos, attr_ver);
+}
+
static int fuse_readpage(struct file *file, struct page *page)
{
struct inode *inode = page->mapping->host;
@@ -573,18 +581,18 @@ static int fuse_readpage(struct file *file, struct page *page)
req->pages[0] = page;
num_read = fuse_send_read(req, file, pos, count, NULL);
err = req->out.h.error;
- fuse_put_request(fc, req);
if (!err) {
/*
* Short read means EOF. If file size is larger, truncate it
*/
if (num_read < count)
- fuse_read_update_size(inode, pos + num_read, attr_ver);
+ fuse_short_read(req, inode, attr_ver);
SetPageUptodate(page);
}
+ fuse_put_request(fc, req);
fuse_invalidate_attr(inode); /* atime changed */
out:
unlock_page(page);
@@ -607,13 +615,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
/*
* Short read means EOF. If file size is larger, truncate it
*/
- if (!req->out.h.error && num_read < count) {
- loff_t pos;
+ if (!req->out.h.error && num_read < count)
+ fuse_short_read(req, inode, req->misc.read.attr_ver);
- pos = page_offset(req->pages[0]) + num_read;
- fuse_read_update_size(inode, pos,
- req->misc.read.attr_ver);
- }
fuse_invalidate_attr(inode); /* atime changed */
}
--
1.5.5.6
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback
2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
` (2 preceding siblings ...)
2012-07-03 15:54 ` [PATCH 3/10] fuse: Prepare to handle short reads Pavel Emelyanov
@ 2012-07-03 15:55 ` Pavel Emelyanov
2012-07-04 13:06 ` Miklos Szeredi
[not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (3 subsequent siblings)
7 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:55 UTC (permalink / raw)
To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
Cc: James Bottomley, Kirill Korotaev
The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
fs/fuse/file.c | 21 ++++++++++++++-------
1 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6bf9723..47f0f2e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -345,7 +345,7 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
BUG_ON(req->inode != inode);
curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
- if (curr_index == index) {
+ if (curr_index == index && index < curr_index + req->num_pages) {
found = true;
break;
}
@@ -1192,7 +1192,10 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,
static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
{
- __free_page(req->pages[0]);
+ int i;
+
+ for (i = 0; i < req->num_pages; i++)
+ __free_page(req->pages[i]);
fuse_file_put(req->ff, false);
}
@@ -1201,10 +1204,13 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
struct inode *inode = req->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+ int i;
list_del(&req->writepages_entry);
- dec_bdi_stat(bdi, BDI_WRITEBACK);
- dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
+ for (i = 0; i < req->num_pages; i++) {
+ dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+ }
bdi_writeout_inc(bdi);
wake_up(&fi->page_waitq);
}
@@ -1217,14 +1223,15 @@ __acquires(fc->lock)
struct fuse_inode *fi = get_fuse_inode(req->inode);
loff_t size = i_size_read(req->inode);
struct fuse_write_in *inarg = &req->misc.write.in;
+ __u64 data_size = req->num_pages * PAGE_CACHE_SIZE;
if (!fc->connected)
goto out_free;
- if (inarg->offset + PAGE_CACHE_SIZE <= size) {
- inarg->size = PAGE_CACHE_SIZE;
+ if (inarg->offset + data_size <= size) {
+ inarg->size = data_size;
} else if (inarg->offset < size) {
- inarg->size = size & (PAGE_CACHE_SIZE - 1);
+ inarg->size = size - inarg->offset;
} else {
/* Got truncated off completely */
goto out_free;
--
1.5.5.6
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback
2012-07-03 15:55 ` [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback Pavel Emelyanov
@ 2012-07-04 13:06 ` Miklos Szeredi
2012-07-04 14:26 ` Pavel Emelyanov
0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-04 13:06 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
Kirill Korotaev
Pavel Emelyanov <xemul@parallels.com> writes:
> The .writepages callback will issue writeback requests with more than one
> page aboard. Make existing end/check code be aware of this.
>
> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> ---
> fs/fuse/file.c | 21 ++++++++++++++-------
> 1 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 6bf9723..47f0f2e 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -345,7 +345,7 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
>
> BUG_ON(req->inode != inode);
> curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
> - if (curr_index == index) {
> + if (curr_index == index && index < curr_index + req->num_pages) {
This condition looks bogus.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 4/10] fuse: Prepare to handle multiple pages in writeback
2012-07-04 13:06 ` Miklos Szeredi
@ 2012-07-04 14:26 ` Pavel Emelyanov
0 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-04 14:26 UTC (permalink / raw)
To: Miklos Szeredi
Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
Kirill Korotaev
On 07/04/2012 05:06 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
>
>> The .writepages callback will issue writeback requests with more than one
>> page aboard. Make existing end/check code be aware of this.
>>
>> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
>> ---
>> fs/fuse/file.c | 21 ++++++++++++++-------
>> 1 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 6bf9723..47f0f2e 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -345,7 +345,7 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
>>
>> BUG_ON(req->inode != inode);
>> curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
>> - if (curr_index == index) {
>> + if (curr_index == index && index < curr_index + req->num_pages) {
>
> This condition looks bogus.
Oops, indeed :( This should be
if (curr_index <= index && index < curr_index + req->num_pages)
> Thanks,
> Miklos
> .
>
Thanks,
Pavel
^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* [PATCH 5/10] fuse: Connection bit for enabling writeback
[not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-07-03 15:55 ` Pavel Emelyanov
2012-07-03 15:56 ` [PATCH 7/10] fuse: Flush files on wb close Pavel Emelyanov
` (4 subsequent siblings)
5 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:55 UTC (permalink / raw)
To: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Miklos Szeredi,
Alexander Viro, linux-fsdevel
Cc: Kirill Korotaev, James Bottomley
Off (0) by default. Will be used in the next patches and will be turned
on at the very end.
Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
fs/fuse/fuse_i.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 771fb63..218b0e6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -428,6 +428,9 @@ struct fuse_conn {
/** Set if bdi is valid */
unsigned bdi_initialized:1;
+ /** write-back cache policy (default is write-through) */
+ unsigned writeback_cache:1;
+
/*
* The following bitfields are only for optimization purposes
* and hence races in setting them will not cause malfunction
--
1.5.5.6
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 7/10] fuse: Flush files on wb close
[not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-07-03 15:55 ` [PATCH 5/10] fuse: Connection bit for enabling writeback Pavel Emelyanov
@ 2012-07-03 15:56 ` Pavel Emelyanov
2012-07-03 15:56 ` [PATCH 8/10] fuse: Implement writepages and write_begin/write_end callbacks Pavel Emelyanov
` (3 subsequent siblings)
5 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:56 UTC (permalink / raw)
To: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Miklos Szeredi,
Alexander Viro, linux-fsdevel
Cc: Kirill Korotaev, James Bottomley
Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding writeback cache. Note, that simply calling the
filemap_write_and_wait() is not enough since fuse finishes page writeback
immediatelly and thus the -wait part of the mentioned call will be no-op.
Do real wait on per-inode writepages list.
Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
fs/fuse/file.c | 26 +++++++++++++++++++++++++-
1 files changed, 25 insertions(+), 1 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9bc1390..d9d566e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -137,6 +137,12 @@ static void fuse_file_put(struct fuse_file *ff, bool sync)
}
}
+static void __fuse_file_put(struct fuse_file *ff)
+{
+ if (atomic_dec_and_test(&ff->count))
+ BUG();
+}
+
int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
bool isdir)
{
@@ -285,8 +291,23 @@ static int fuse_open(struct inode *inode, struct file *file)
return fuse_open_common(inode, file, false);
}
+static void fuse_flush_writeback(struct inode *inode, struct file *file)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ filemap_write_and_wait(file->f_mapping);
+ wait_event(fi->page_waitq, list_empty_careful(&fi->writepages));
+ spin_unlock_wait(&fc->lock);
+}
+
static int fuse_release(struct inode *inode, struct file *file)
{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ if (fc->writeback_cache)
+ fuse_flush_writeback(inode, file);
+
fuse_release_common(file, FUSE_RELEASE);
/* return value is ignored by VFS */
@@ -1215,7 +1236,8 @@ static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
for (i = 0; i < req->num_pages; i++)
__free_page(req->pages[i]);
- fuse_file_put(req->ff, false);
+ if (!fc->writeback_cache)
+ fuse_file_put(req->ff, false);
}
static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
@@ -1232,6 +1254,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
}
bdi_writeout_inc(bdi);
wake_up(&fi->page_waitq);
+ if (fc->writeback_cache)
+ __fuse_file_put(req->ff);
}
/* Called under fc->lock, may release and reacquire it */
--
1.5.5.6
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 8/10] fuse: Implement writepages and write_begin/write_end callbacks
[not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-07-03 15:55 ` [PATCH 5/10] fuse: Connection bit for enabling writeback Pavel Emelyanov
2012-07-03 15:56 ` [PATCH 7/10] fuse: Flush files on wb close Pavel Emelyanov
@ 2012-07-03 15:56 ` Pavel Emelyanov
2012-07-03 15:57 ` [PATCH 9/10] fuse: Turn writeback on Pavel Emelyanov
` (2 subsequent siblings)
5 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:56 UTC (permalink / raw)
To: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Miklos Szeredi,
Alexander Viro, linux-fsdevel
Cc: Kirill Korotaev, James Bottomley
The .writepages one is required to make each writeback request carry more than
one page on it.
Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
fs/fuse/file.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 201 insertions(+), 1 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d9d566e..058498d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -699,7 +699,10 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file)
struct fuse_fill_data {
struct fuse_req *req;
- struct file *file;
+ union {
+ struct file *file;
+ struct fuse_file *ff;
+ };
struct inode *inode;
};
@@ -1400,6 +1403,200 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
return err;
}
+static int fuse_send_writepages(struct fuse_fill_data *data)
+{
+ int i, all_ok = 1;
+ struct fuse_req *req = data->req;
+ struct fuse_conn *fc = get_fuse_conn(data->inode);
+ struct fuse_inode *fi = get_fuse_inode(data->inode);
+ loff_t off = -1;
+
+ if (!data->ff)
+ data->ff = fuse_write_file(fc, fi);
+
+ for (i = 0; i < req->num_pages; i++) {
+ struct page *page = req->pages[i];
+ struct address_space *mapping = page->mapping;
+ struct page *tmp_page;
+
+ tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+ if (tmp_page) {
+ copy_highpage(tmp_page, page);
+ inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+ inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+ } else
+ all_ok = 0;
+ req->pages[i] = tmp_page;
+ if (i == 0)
+ off = page_offset(page);
+
+ end_page_writeback(page);
+ }
+
+ if (!all_ok)
+ return -ENOMEM;
+
+ req->ff = fuse_file_get(data->ff);
+ fuse_write_fill(req, data->ff, off, 0);
+
+ req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+ req->in.argpages = 1;
+ req->page_offset = 0;
+ req->end = fuse_writepage_end;
+ req->inode = data->inode;
+
+ spin_lock(&fc->lock);
+ list_add(&req->writepages_entry, &fi->writepages);
+ list_add_tail(&req->list, &fi->queued_writes);
+ fuse_flush_writepages(data->inode);
+ spin_unlock(&fc->lock);
+
+ return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+ struct writeback_control *wbc, void *_data)
+{
+ struct fuse_fill_data *data = _data;
+ struct fuse_req *req = data->req;
+ struct inode *inode = data->inode;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ if (req->num_pages &&
+ (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+ (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+ req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+ int err;
+
+ err = fuse_send_writepages(data);
+ if (err) {
+ unlock_page(page);
+ return err;
+ }
+
+ data->req = req = fuse_request_alloc_nofs();
+ if (IS_ERR(req)) {
+ unlock_page(page);
+ return PTR_ERR(req);
+ }
+ }
+
+ req->pages[req->num_pages] = page;
+ req->num_pages++;
+ set_page_writeback(page);
+ unlock_page(page);
+
+ return 0;
+}
+
+static int fuse_writepages(struct address_space *mapping, struct writeback_control *wbc)
+{
+ struct inode *inode = mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_fill_data data;
+ int err;
+
+ if (!fc->writeback_cache)
+ return generic_writepages(mapping, wbc);
+
+ err = -EIO;
+ if (is_bad_inode(inode))
+ goto out;
+
+ data.ff = NULL;
+ data.inode = inode;
+ data.req = fuse_request_alloc_nofs();
+ err = PTR_ERR(data.req);
+ if (IS_ERR(data.req))
+ goto out_put;
+
+ err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+ if (!err) {
+ if (data.req->num_pages)
+ err = fuse_send_writepages(&data);
+ else
+ fuse_put_request(fc, data.req);
+ }
+out_put:
+ if (data.ff)
+ fuse_file_put(data.ff, false);
+out:
+ return err;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+ struct page *page, loff_t pos, unsigned len)
+{
+ struct fuse_req *req;
+ int err;
+
+ if (PageUptodate(page) || (len == PAGE_CACHE_SIZE))
+ return 0;
+
+ req = fuse_get_req(fc);
+ err = PTR_ERR(req);
+ if (IS_ERR(req))
+ goto out;
+
+ req->out.page_zeroing = 1;
+ req->out.argpages = 1;
+ req->num_pages = 1;
+ req->pages[0] = page;
+ fuse_send_read(req, file, page_offset(page), PAGE_CACHE_SIZE, NULL);
+ err = req->out.h.error;
+ fuse_put_request(fc, req);
+out:
+ if (err) {
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ return err;
+}
+
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+ struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+ BUG_ON(!fc->writeback_cache);
+
+ *pagep = grab_cache_page_write_begin(mapping, index, flags);
+ if (!*pagep)
+ return -ENOMEM;
+
+ return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+ unsigned from, unsigned to)
+{
+ struct inode *inode = page->mapping->host;
+ loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+ if (!PageUptodate(page))
+ SetPageUptodate(page);
+
+ fuse_write_update_size(inode, pos);
+ set_page_dirty(page);
+ return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+ fuse_commit_write(file, page, from, from+copied);
+
+ unlock_page(page);
+ page_cache_release(page);
+
+ return copied;
+}
+
static int fuse_launder_page(struct page *page)
{
int err = 0;
@@ -2318,11 +2515,14 @@ static const struct file_operations fuse_direct_io_file_operations = {
static const struct address_space_operations fuse_file_aops = {
.readpage = fuse_readpage,
.writepage = fuse_writepage,
+ .writepages = fuse_writepages,
.launder_page = fuse_launder_page,
.readpages = fuse_readpages,
.set_page_dirty = __set_page_dirty_nobuffers,
.bmap = fuse_bmap,
.direct_IO = fuse_direct_IO,
+ .write_begin = fuse_write_begin,
+ .write_end = fuse_write_end,
};
void fuse_init_file_inode(struct inode *inode)
--
1.5.5.6
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH 9/10] fuse: Turn writeback on
[not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (2 preceding siblings ...)
2012-07-03 15:56 ` [PATCH 8/10] fuse: Implement writepages and write_begin/write_end callbacks Pavel Emelyanov
@ 2012-07-03 15:57 ` Pavel Emelyanov
2012-07-04 3:01 ` [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Nikolaus Rath
2012-07-05 19:31 ` Anand Avati
5 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:57 UTC (permalink / raw)
To: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Miklos Szeredi,
Alexander Viro, linux-fsdevel
Cc: Kirill Korotaev, James Bottomley
Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this.
Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.
Signed-off-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
---
fs/fuse/file.c | 5 +++++
fs/fuse/inode.c | 4 +++-
include/linux/fuse.h | 1 +
3 files changed, 9 insertions(+), 1 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 058498d..e3a1cc8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -209,6 +209,8 @@ void fuse_finish_open(struct inode *inode, struct file *file)
spin_unlock(&fc->lock);
fuse_invalidate_attr(inode);
}
+ if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+ fuse_link_write_file(file);
}
int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -999,6 +1001,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
struct iov_iter i;
loff_t endbyte = 0;
+ if (get_fuse_conn(inode)->writeback_cache)
+ return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
WARN_ON(iocb->ki_pos != pos);
ocount = 0;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 88c577f..df1dc83 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -836,6 +836,8 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
fc->big_writes = 1;
if (arg->flags & FUSE_DONT_MASK)
fc->dont_mask = 1;
+ if (arg->flags & FUSE_WRITEBACK_CACHE)
+ fc->writeback_cache = 1;
} else {
ra_pages = fc->max_read / PAGE_CACHE_SIZE;
fc->no_lock = 1;
@@ -861,7 +863,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
arg->max_readahead = fc->bdi.ra_pages * PAGE_CACHE_SIZE;
arg->flags |= FUSE_ASYNC_READ | FUSE_POSIX_LOCKS | FUSE_ATOMIC_O_TRUNC |
FUSE_EXPORT_SUPPORT | FUSE_BIG_WRITES | FUSE_DONT_MASK |
- FUSE_FLOCK_LOCKS;
+ FUSE_FLOCK_LOCKS | FUSE_WRITEBACK_CACHE;
req->in.h.opcode = FUSE_INIT;
req->in.numargs = 1;
req->in.args[0].size = sizeof(*arg);
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 9303348..7ed5ced 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -176,6 +176,7 @@ struct fuse_file_lock {
#define FUSE_BIG_WRITES (1 << 5)
#define FUSE_DONT_MASK (1 << 6)
#define FUSE_FLOCK_LOCKS (1 << 10)
+#define FUSE_WRITEBACK_CACHE (1 << 11)
/**
* CUSE INIT request/reply flags
--
1.5.5.6
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
[not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (3 preceding siblings ...)
2012-07-03 15:57 ` [PATCH 9/10] fuse: Turn writeback on Pavel Emelyanov
@ 2012-07-04 3:01 ` Nikolaus Rath
[not found] ` <87a9zg1b7q.fsf-sKB8Sp2ER+yL2G7IJ6k9tw@public.gmane.org>
2012-07-05 19:31 ` Anand Avati
5 siblings, 1 reply; 43+ messages in thread
From: Nikolaus Rath @ 2012-07-04 3:01 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, linux-fsdevel
Hi Pavel,
I think it's great that you're working on this! I've been waiting for
FUSE being able to supply write data in bigger chunks for a long time,
and I'm very excited to see some progress on this. I'm not a kernel
developer, but I'll be happy to try the patches.
While I try to get this to compile, a few more questions:
Pavel Emelyanov <xemul@parallels.com> writes:
> Hi everyone.
>
> One of the problems with the existing FUSE implementation is that it uses the
> write-through cache policy which results in performance problems on certain
> workloads. E.g. when copying a big file into a FUSE file the cp pushes every
> 128k to the userspace synchronously. This becomes a problem when the userspace
> back-end uses networking for storing the data.
>
> A good solution of this is switching the FUSE page cache into a write-back policy.
> With this file data are pushed to the userspace with big chunks (depending on the
> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
> handle the size updates in a more efficient manner.
>
> The writeback feature is per-connection and is explicitly configurable at the
> init stage (is it worth making it CAP_SOMETHING protected?)
From your description it sounds as if the only effect of write-back is
to increase the chunk size. Why is the a need to require special
privileges for this?
> When the writeback is turned ON:
>
> * still copy writeback pages to temporary buffer when sending a writeback request
> and finish the page writeback immediately
Could you elaborate? I don't understand what you're saying here.
Best,
-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«
PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
fuse-devel mailing list
fuse-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fuse-devel
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
[not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
` (4 preceding siblings ...)
2012-07-04 3:01 ` [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Nikolaus Rath
@ 2012-07-05 19:31 ` Anand Avati
[not found] ` <4FF5EB85.1050701-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
5 siblings, 1 reply; 43+ messages in thread
From: Anand Avati @ 2012-07-05 19:31 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: Kirill Korotaev, Miklos Szeredi,
fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, James Bottomley,
Alexander Viro, linux-fsdevel
Doesn't such a change increase the "dirty debt" held in the kernel and
therefore increase the probability of resulting in a deadlock (when the
userspace filesystem's memory allocation request ends up waiting on a
dirty page writeout caused by this write-back feature)? Why not
implement the write-back inside the userspace filesystem leaving the
kernel operate in write-through, which makes things overall less unsafe?
Do you have performance numbers comparing write-back in kernel v/s
write-back in userspace?
Avati
On 07/03/2012 08:53 AM, Pavel Emelyanov wrote:
> Hi everyone.
>
> One of the problems with the existing FUSE implementation is that it uses the
> write-through cache policy which results in performance problems on certain
> workloads. E.g. when copying a big file into a FUSE file the cp pushes every
> 128k to the userspace synchronously. This becomes a problem when the userspace
> back-end uses networking for storing the data.
>
> A good solution of this is switching the FUSE page cache into a write-back policy.
> With this file data are pushed to the userspace with big chunks (depending on the
> dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
> handle the size updates in a more efficient manner.
>
> The writeback feature is per-connection and is explicitly configurable at the
> init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
> turned ON:
>
> * still copy writeback pages to temporary buffer when sending a writeback request
> and finish the page writeback immediately
>
> * make kernel maintain the inode's i_size to avoid frequent i_size synchronization
> with the user space
>
> * take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
> This protects us from having too many dirty pages on FUSE
>
> The provided patchset survives the fsx test. Performance measurements are not yet
> all finished, but the mentioned copying of a huge file becomes noticeably faster
> even on machines with few RAM and doesn't make the system stuck (the dirty pages
> balancer does its work OK). Applies on top of v3.5-rc4.
>
> We are currently exploring this with our own distributed storage implementation
> which is heavily oriented on storing big blobs of data with extremely rare meta-data
> updates (virtual machines' and containers' disk images). With the existing cache
> policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
> too much time to proceed, much longer than if it was simply scp-ed over the same
> network. The write-back policy (as I mentioned) noticeably improves this scenario.
> Kirill (in Cc) can share more details about the performance and the storage concepts
> details if required.
>
> Thanks,
> Pavel
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> fuse-devel mailing list
> fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/fuse-devel
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 6/10] fuse: Trust kernel i_size only
2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
` (4 preceding siblings ...)
[not found] ` <4FF3156E.8030109-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-07-03 15:55 ` Pavel Emelyanov
2012-07-04 14:39 ` Miklos Szeredi
2012-07-03 15:57 ` [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Pavel Emelyanov
2012-07-05 19:26 ` [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Anand Avati
7 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:55 UTC (permalink / raw)
To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
Cc: James Bottomley, Kirill Korotaev
Make fuse think that when writeback is on the inode's i_size is alway up-to-date
and not update it with the value received from the userspace. This is done because
the page cache code may update i_size without letting the FS know.
This assumption implies fixing the previously introduced short-read helper -- when
a short read occurs the 'hole' is filled with zeroes.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
fs/fuse/dir.c | 10 ++++++----
fs/fuse/file.c | 23 +++++++++++++++++++++--
fs/fuse/inode.c | 6 ++++--
3 files changed, 31 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 334e0b1..032fb20 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -790,7 +790,7 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
stat->mtime.tv_nsec = attr->mtimensec;
stat->ctime.tv_sec = attr->ctime;
stat->ctime.tv_nsec = attr->ctimensec;
- stat->size = attr->size;
+ stat->size = i_size_read(inode);
stat->blocks = attr->blocks;
if (attr->blksize != 0)
@@ -1350,7 +1350,7 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
struct fuse_req *req;
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
- bool is_truncate = false;
+ bool is_truncate = false, is_wb = fc->writeback_cache;
loff_t oldsize;
int err;
@@ -1423,7 +1423,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
fuse_change_attributes_common(inode, &outarg.attr,
attr_timeout(&outarg));
oldsize = inode->i_size;
- i_size_write(inode, outarg.attr.size);
+ if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+ i_size_write(inode, outarg.attr.size);
if (is_truncate) {
/* NOTE: this may release/reacquire fc->lock */
@@ -1435,7 +1436,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
* Only call invalidate_inode_pages2() after removing
* FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
*/
- if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+ if ((is_truncate || !is_wb) &&
+ S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
truncate_pagecache(inode, oldsize, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 47f0f2e..9bc1390 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -541,9 +541,28 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
static void fuse_short_read(struct fuse_req *req, struct inode *inode, u64 attr_ver)
{
size_t num_read = req->out.args[0].size;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ if (fc->writeback_cache) {
+ /*
+ * A hole in a file. Some data after the hole are in page cache.
+ */
+ int i;
+ size_t off = num_read & (PAGE_CACHE_SIZE - 1);
+
+ for (i = num_read >> PAGE_CACHE_SHIFT; i < req->num_pages; i++) {
+ struct page *page = req->pages[i];
+ void *mapaddr = kmap_atomic(page, KM_USER0);
- loff_t pos = page_offset(req->pages[0]) + num_read;
- fuse_read_update_size(inode, pos, attr_ver);
+ memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+ kunmap_atomic(mapaddr, KM_USER0);
+ off = 0;
+ }
+ } else {
+ loff_t pos = page_offset(req->pages[0]) + num_read;
+ fuse_read_update_size(inode, pos, attr_ver);
+ }
}
static int fuse_readpage(struct file *file, struct page *page)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1cd6165..88c577f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -196,6 +196,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+ bool is_wb = fc->writeback_cache;
loff_t oldsize;
spin_lock(&fc->lock);
@@ -207,10 +208,11 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
fuse_change_attributes_common(inode, attr, attr_valid);
oldsize = inode->i_size;
- i_size_write(inode, attr->size);
+ if (!is_wb || !S_ISREG(inode->i_mode))
+ i_size_write(inode, attr->size);
spin_unlock(&fc->lock);
- if (S_ISREG(inode->i_mode) && oldsize != attr->size) {
+ if (!is_wb && S_ISREG(inode->i_mode) && oldsize != attr->size) {
truncate_pagecache(inode, oldsize, attr->size);
invalidate_inode_pages2(inode->i_mapping);
}
--
1.5.5.6
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
2012-07-03 15:55 ` [PATCH 6/10] fuse: Trust kernel i_size only Pavel Emelyanov
@ 2012-07-04 14:39 ` Miklos Szeredi
2012-07-05 14:10 ` Pavel Emelyanov
` (2 more replies)
0 siblings, 3 replies; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-04 14:39 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
Kirill Korotaev
Pavel Emelyanov <xemul@parallels.com> writes:
> Make fuse think that when writeback is on the inode's i_size is alway
> up-to-date and not update it with the value received from the
> userspace. This is done because the page cache code may update i_size
> without letting the FS know.
Similar rule applies to i_mtime. Except it's even more tricky, since
you have to flush the mtime together with the data (the flush should not
update the modification time). And we have other operations which also
change the mtime, and those also need to be updated.
We should probably look at what NFS is doing.
> This assumption implies fixing the previously introduced short-read
> helper -- when a short read occurs the 'hole' is filled with zeroes.
>
> Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
> ---
> fs/fuse/dir.c | 10 ++++++----
> fs/fuse/file.c | 23 +++++++++++++++++++++--
> fs/fuse/inode.c | 6 ++++--
> 3 files changed, 31 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 334e0b1..032fb20 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -790,7 +790,7 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
> stat->mtime.tv_nsec = attr->mtimensec;
> stat->ctime.tv_sec = attr->ctime;
> stat->ctime.tv_nsec = attr->ctimensec;
> - stat->size = attr->size;
> + stat->size = i_size_read(inode);
> stat->blocks = attr->blocks;
>
> if (attr->blksize != 0)
> @@ -1350,7 +1350,7 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
> struct fuse_req *req;
> struct fuse_setattr_in inarg;
> struct fuse_attr_out outarg;
> - bool is_truncate = false;
> + bool is_truncate = false, is_wb = fc->writeback_cache;
> loff_t oldsize;
> int err;
>
> @@ -1423,7 +1423,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
> fuse_change_attributes_common(inode, &outarg.attr,
> attr_timeout(&outarg));
> oldsize = inode->i_size;
> - i_size_write(inode, outarg.attr.size);
> + if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
> + i_size_write(inode, outarg.attr.size);
>
> if (is_truncate) {
> /* NOTE: this may release/reacquire fc->lock */
> @@ -1435,7 +1436,8 @@ static int fuse_do_setattr(struct dentry *entry, struct iattr *attr,
> * Only call invalidate_inode_pages2() after removing
> * FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
> */
> - if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
> + if ((is_truncate || !is_wb) &&
> + S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
> truncate_pagecache(inode, oldsize, outarg.attr.size);
> invalidate_inode_pages2(inode->i_mapping);
> }
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 47f0f2e..9bc1390 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -541,9 +541,28 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
> static void fuse_short_read(struct fuse_req *req, struct inode *inode, u64 attr_ver)
> {
> size_t num_read = req->out.args[0].size;
> + struct fuse_conn *fc = get_fuse_conn(inode);
> +
> + if (fc->writeback_cache) {
> + /*
> + * A hole in a file. Some data after the hole are in page cache.
> + */
> + int i;
> + size_t off = num_read & (PAGE_CACHE_SIZE - 1);
> +
> + for (i = num_read >> PAGE_CACHE_SHIFT; i < req->num_pages; i++) {
> + struct page *page = req->pages[i];
> + void *mapaddr = kmap_atomic(page, KM_USER0);
>
> - loff_t pos = page_offset(req->pages[0]) + num_read;
> - fuse_read_update_size(inode, pos, attr_ver);
> + memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
> +
> + kunmap_atomic(mapaddr, KM_USER0);
> + off = 0;
> + }
> + } else {
> + loff_t pos = page_offset(req->pages[0]) + num_read;
> + fuse_read_update_size(inode, pos, attr_ver);
> + }
> }
>
> static int fuse_readpage(struct file *file, struct page *page)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 1cd6165..88c577f 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -196,6 +196,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> {
> struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_inode *fi = get_fuse_inode(inode);
> + bool is_wb = fc->writeback_cache;
> loff_t oldsize;
>
> spin_lock(&fc->lock);
> @@ -207,10 +208,11 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> fuse_change_attributes_common(inode, attr, attr_valid);
>
> oldsize = inode->i_size;
> - i_size_write(inode, attr->size);
> + if (!is_wb || !S_ISREG(inode->i_mode))
> + i_size_write(inode, attr->size);
> spin_unlock(&fc->lock);
>
> - if (S_ISREG(inode->i_mode) && oldsize != attr->size) {
> + if (!is_wb && S_ISREG(inode->i_mode) && oldsize != attr->size) {
> truncate_pagecache(inode, oldsize, attr->size);
> invalidate_inode_pages2(inode->i_mapping);
> }
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
2012-07-04 14:39 ` Miklos Szeredi
@ 2012-07-05 14:10 ` Pavel Emelyanov
2012-07-10 5:53 ` Pavel Emelyanov
[not found] ` <8762a3pp3m.fsf-d8RdFUjzFsbxNFs70CDYszOMxtEWgIxa@public.gmane.org>
2 siblings, 0 replies; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-05 14:10 UTC (permalink / raw)
To: Miklos Szeredi
Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
Kirill Korotaev
On 07/04/2012 06:39 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
>
>> Make fuse think that when writeback is on the inode's i_size is alway
>> up-to-date and not update it with the value received from the
>> userspace. This is done because the page cache code may update i_size
>> without letting the FS know.
>
> Similar rule applies to i_mtime. Except it's even more tricky, since
> you have to flush the mtime together with the data (the flush should not
> update the modification time). And we have other operations which also
> change the mtime, and those also need to be updated.
Yup, I've missed that part :( Thanks for pointing this out.
> We should probably look at what NFS is doing.
I'll look at what can be done about it.
Thanks,
Pavel
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
2012-07-04 14:39 ` Miklos Szeredi
2012-07-05 14:10 ` Pavel Emelyanov
@ 2012-07-10 5:53 ` Pavel Emelyanov
2012-07-13 16:30 ` Miklos Szeredi
[not found] ` <8762a3pp3m.fsf-d8RdFUjzFsbxNFs70CDYszOMxtEWgIxa@public.gmane.org>
2 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-10 5:53 UTC (permalink / raw)
To: Miklos Szeredi
Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
Kirill Korotaev
On 07/04/2012 06:39 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
>
>> Make fuse think that when writeback is on the inode's i_size is alway
>> up-to-date and not update it with the value received from the
>> userspace. This is done because the page cache code may update i_size
>> without letting the FS know.
>
> Similar rule applies to i_mtime. Except it's even more tricky, since
> you have to flush the mtime together with the data (the flush should not
> update the modification time). And we have other operations which also
> change the mtime, and those also need to be updated.
>
> We should probably look at what NFS is doing.
Miklos,
I've looked at how NFS and FUSE manage the i_mtime.
The NFS (if not looking at the fscache) tries to keep the i_mtime at the
maximum between the local and the remote values. This looks like correct
behavior for FUSE too.
But, looking at the FUSE code I see that the existing attr_version checks
do implement the same approach even if we turn the writeback cache (this
series) ON.
That said, I see no problems with the i_mtime management after this series.
Is there something I've missed?
Thanks,
Pavel
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
2012-07-10 5:53 ` Pavel Emelyanov
@ 2012-07-13 16:30 ` Miklos Szeredi
2012-07-16 3:32 ` Pavel Emelyanov
0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-13 16:30 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
Kirill Korotaev
Pavel Emelyanov <xemul@parallels.com> writes:
> On 07/04/2012 06:39 PM, Miklos Szeredi wrote:
>> Pavel Emelyanov <xemul@parallels.com> writes:
>>
>>> Make fuse think that when writeback is on the inode's i_size is alway
>>> up-to-date and not update it with the value received from the
>>> userspace. This is done because the page cache code may update i_size
>>> without letting the FS know.
>>
>> Similar rule applies to i_mtime. Except it's even more tricky, since
>> you have to flush the mtime together with the data (the flush should not
>> update the modification time). And we have other operations which also
>> change the mtime, and those also need to be updated.
>>
>> We should probably look at what NFS is doing.
>
> Miklos,
>
> I've looked at how NFS and FUSE manage the i_mtime.
>
> The NFS (if not looking at the fscache) tries to keep the i_mtime at the
> maximum between the local and the remote values. This looks like correct
> behavior for FUSE too.
>
> But, looking at the FUSE code I see that the existing attr_version checks
> do implement the same approach even if we turn the writeback cache (this
> series) ON.
It's not as simple as that.
First off, fuse sets S_NOCMTIME which means the kernel won't update
i_mtime on writes. Which is fine for write-through, since the time
update is always the responsibility of the userspace filesystem.
If we are doing buffered writes, then the kernel must update i_mtime on
write(2) and must flush that to the userspace filesystem at some point
(with a SETTATTR operation).
The attr_version thing is about making sure that we don't update the
kernel's cached attribute with stale data from the userspace filesystem.
E.g. if a GETATTR races with a WRITE then by the time the GETATTR
finishes write may have already updated i_size despite the fact that
GETATTR is returning i_size before the write. In that case we don't
store the i_size we believe is stale but we do use it the stat(2) reply,
since it came from the userspace filesystem, which is the authoritative
source of information.
But that means that if we want stat(2) to work correctly, then we must
flush the updated i_mtime to userspace before we do the GETATTR.
So basically what we need is a per-inode flag that says that i_mtime has
been updated (it is more recent then what userspace has) and we must
update i_mtime *only* in write and not other operations which still do
the mtime update in the userspace filesystem. Any operation that
modifies i_mtime (and hence invalidate the attributes) must clear the
flag. Any other operation which updates or invalidates the attributes
must first flush the i_mtime to userspace if the flag is set.
In addition the userspace fileystem need to implement the policy similar
to NFS, in which it only updates mtime if it is greater than the current
one. This means that we must differentiate between an mtime update due
to a buffered write from an mtime update due to an utime (and friends)
system call.
I think that would work, but it's a tricky thing and needs to be thought
trough.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
2012-07-13 16:30 ` Miklos Szeredi
@ 2012-07-16 3:32 ` Pavel Emelyanov
2012-07-17 15:17 ` Miklos Szeredi
0 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-16 3:32 UTC (permalink / raw)
To: Miklos Szeredi
Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
Kirill Korotaev
On 07/13/2012 08:30 PM, Miklos Szeredi wrote:
> Pavel Emelyanov <xemul@parallels.com> writes:
>
>> On 07/04/2012 06:39 PM, Miklos Szeredi wrote:
>>> Pavel Emelyanov <xemul@parallels.com> writes:
>>>
>>>> Make fuse think that when writeback is on the inode's i_size is alway
>>>> up-to-date and not update it with the value received from the
>>>> userspace. This is done because the page cache code may update i_size
>>>> without letting the FS know.
>>>
>>> Similar rule applies to i_mtime. Except it's even more tricky, since
>>> you have to flush the mtime together with the data (the flush should not
>>> update the modification time). And we have other operations which also
>>> change the mtime, and those also need to be updated.
>>>
>>> We should probably look at what NFS is doing.
>>
>> Miklos,
>>
>> I've looked at how NFS and FUSE manage the i_mtime.
>>
>> The NFS (if not looking at the fscache) tries to keep the i_mtime at the
>> maximum between the local and the remote values. This looks like correct
>> behavior for FUSE too.
>>
>> But, looking at the FUSE code I see that the existing attr_version checks
>> do implement the same approach even if we turn the writeback cache (this
>> series) ON.
>
> It's not as simple as that.
>
> First off, fuse sets S_NOCMTIME which means the kernel won't update
> i_mtime on writes. Which is fine for write-through, since the time
> update is always the responsibility of the userspace filesystem.
Oops, I've missed that fact :( I wonder why writes did update the mtime
in my tests then...
> If we are doing buffered writes, then the kernel must update i_mtime on
> write(2) and must flush that to the userspace filesystem at some point
> (with a SETTATTR operation).
Agree.
> The attr_version thing is about making sure that we don't update the
> kernel's cached attribute with stale data from the userspace filesystem.
> E.g. if a GETATTR races with a WRITE then by the time the GETATTR
> finishes write may have already updated i_size despite the fact that
> GETATTR is returning i_size before the write. In that case we don't
> store the i_size we believe is stale but we do use it the stat(2) reply,
> since it came from the userspace filesystem, which is the authoritative
> source of information.
>
> But that means that if we want stat(2) to work correctly, then we must
> flush the updated i_mtime to userspace before we do the GETATTR.
Why? Isn't it enough just to look at the per-inode flag (you're talking
below) and take either cached i_mtime value or the user-space verion of it?
> So basically what we need is a per-inode flag that says that i_mtime has
> been updated (it is more recent then what userspace has) and we must
> update i_mtime *only* in write and not other operations which still do
> the mtime update in the userspace filesystem. Any operation that
> modifies i_mtime (and hence invalidate the attributes) must clear the
> flag. Any other operation which updates or invalidates the attributes
> must first flush the i_mtime to userspace if the flag is set.
>
> In addition the userspace fileystem need to implement the policy similar
> to NFS, in which it only updates mtime if it is greater than the current
> one. This means that we must differentiate between an mtime update due
> to a buffered write from an mtime update due to an utime (and friends)
> system call.
>
> I think that would work, but it's a tricky thing and needs to be thought
> trough.
>
> Thanks,
> Miklos
> .
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
2012-07-16 3:32 ` Pavel Emelyanov
@ 2012-07-17 15:17 ` Miklos Szeredi
0 siblings, 0 replies; 43+ messages in thread
From: Miklos Szeredi @ 2012-07-17 15:17 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: fuse-devel, Alexander Viro, linux-fsdevel, James Bottomley,
Kirill Korotaev
Pavel Emelyanov <xemul@parallels.com> writes:
>> The attr_version thing is about making sure that we don't update the
>> kernel's cached attribute with stale data from the userspace filesystem.
>> E.g. if a GETATTR races with a WRITE then by the time the GETATTR
>> finishes write may have already updated i_size despite the fact that
>> GETATTR is returning i_size before the write. In that case we don't
>> store the i_size we believe is stale but we do use it the stat(2) reply,
>> since it came from the userspace filesystem, which is the authoritative
>> source of information.
>>
>> But that means that if we want stat(2) to work correctly, then we must
>> flush the updated i_mtime to userspace before we do the GETATTR.
>
> Why? Isn't it enough just to look at the per-inode flag (you're talking
> below) and take either cached i_mtime value or the user-space verion
> of it?
That would only work if the kernel has the more up-to-date mtime value,
which is not necessarily true.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 43+ messages in thread
[parent not found: <8762a3pp3m.fsf-d8RdFUjzFsbxNFs70CDYszOMxtEWgIxa@public.gmane.org>]
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
[not found] ` <8762a3pp3m.fsf-d8RdFUjzFsbxNFs70CDYszOMxtEWgIxa@public.gmane.org>
@ 2012-10-01 17:30 ` Maxim V. Patlasov
2012-11-16 9:49 ` Miklos Szeredi
0 siblings, 1 reply; 43+ messages in thread
From: Maxim V. Patlasov @ 2012-10-01 17:30 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Kirill Korotaev, Pavel Emelianov,
fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, James Bottomley,
Alexander Viro, linux-fsdevel
Hi Miklos,
07/04/2012 06:39 PM, Miklos Szeredi ?????:
> Pavel Emelyanov <xemul-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> writes:
>
>> Make fuse think that when writeback is on the inode's i_size is alway
>> up-to-date and not update it with the value received from the
>> userspace. This is done because the page cache code may update i_size
>> without letting the FS know.
> Similar rule applies to i_mtime. Except it's even more tricky, since
> you have to flush the mtime together with the data (the flush should not
> update the modification time). And we have other operations which also
> change the mtime, and those also need to be updated.
>
> We should probably look at what NFS is doing.
In case of NFS, the flush does updates the modification time on server.
And on client, getattr triggers flush:
> int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct
> kstat *stat)
> {
> ...
>
> /* Flush out writes to the server in order to update c/mtime. */
> if (S_ISREG(inode->i_mode)) {
> nfs_inode_dio_wait(inode);
> err = filemap_write_and_wait(inode->i_mapping);
> if (err)
> goto out;
> }
In another email of this thread you suggested some approach where
in-kernel fuse flushes i_mtime to userspace:
> So basically what we need is a per-inode flag that says that i_mtime has
> been updated (it is more recent then what userspace has) and we must
> update i_mtime*only* in write and not other operations which still do
> the mtime update in the userspace filesystem. Any operation that
> modifies i_mtime (and hence invalidate the attributes) must clear the
> flag. Any other operation which updates or invalidates the attributes
> must first flush the i_mtime to userspace if the flag is set.
>
> In addition the userspace fileystem need to implement the policy similar
> to NFS, in which it only updates mtime if it is greater than the current
> one. This means that we must differentiate between an mtime update due
> to a buffered write from an mtime update due to an utime (and friends)
> system call.
My question is why do we need all these complications if we could follow
NFS way: trigger flush and wait for its (and fuse write-back) completion
before sending FUSE_GETATTR to userspace?
Another concern is about the idea of sending i_mtime to userspace per
se. You wrote:
> If we are doing buffered writes, then the kernel must update i_mtime on
> write(2) and must flush that to the userspace filesystem at some point
> (with a SETTATTR operation).
Fuse userspace may have its own non-trivial concept of 'modification
time'. It's not obliged to advance its mtime on every write. The only
requirement is to be consistent: if we expose new data handling READs,
mtime must be advanced properly as well. But, for example, the
granularity of changes is up to implementation. From this view, in-kenel
fuse pushing i_mtime with a SETATTR operation would look like a cheating
userspace. What do you think?
Thanks,
Maxim
------------------------------------------------------------------------------
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
2012-10-01 17:30 ` Maxim V. Patlasov
@ 2012-11-16 9:49 ` Miklos Szeredi
2012-11-16 10:32 ` Maxim V. Patlasov
0 siblings, 1 reply; 43+ messages in thread
From: Miklos Szeredi @ 2012-11-16 9:49 UTC (permalink / raw)
To: Maxim V. Patlasov
Cc: Pavel Emelianov, fuse-devel, Alexander Viro, linux-fsdevel,
James Bottomley, Kirill Korotaev
"Maxim V. Patlasov" <mpatlasov@parallels.com> writes:
> We should probably look at what NFS is doing.
>
>
> In case of NFS, the flush does updates the modification time on server. And on
> client, getattr triggers flush:
>
>
> int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat
> *stat)
> {
> ...
>
> /* Flush out writes to the server in order to update c/mtime. */
> if (S_ISREG(inode->i_mode)) {
> nfs_inode_dio_wait(inode);
> err = filemap_write_and_wait(inode->i_mapping);
> if (err)
> goto out;
> }
>
>
> In another email of this thread you suggested some approach where in-kernel
> fuse flushes i_mtime to userspace:
>
>
> So basically what we need is a per-inode flag that says that i_mtime has
> been updated (it is more recent then what userspace has) and we must
> update i_mtime *only* in write and not other operations which still do
> the mtime update in the userspace filesystem. Any operation that
> modifies i_mtime (and hence invalidate the attributes) must clear the
> flag. Any other operation which updates or invalidates the attributes
> must first flush the i_mtime to userspace if the flag is set.
>
> In addition the userspace fileystem need to implement the policy similar
> to NFS, in which it only updates mtime if it is greater than the current
> one. This means that we must differentiate between an mtime update due
> to a buffered write from an mtime update due to an utime (and friends)
> system call.
>
>
> My question is why do we need all these complications if we could follow NFS
> way: trigger flush and wait for its (and fuse write-back) completion before
> sending FUSE_GETATTR to userspace?
Yes, the NFS way seems like a good approach assuming that getattrs are
not too frequent. But I guess the fact that NFS does this is a pretty
good assurance that will work fine.
>
> Another concern is about the idea of sending i_mtime to userspace per se. You
> wrote:
>
>
> If we are doing buffered writes, then the kernel must update i_mtime on
> write(2) and must flush that to the userspace filesystem at some point
> (with a SETTATTR operation).
>
>
> Fuse userspace may have its own non-trivial concept of 'modification time'.
> It's not obliged to advance its mtime on every write. The only requirement is
> to be consistent: if we expose new data handling READs, mtime must be advanced
> properly as well. But, for example, the granularity of changes is up to
> implementation. From this view, in-kenel fuse pushing i_mtime with a SETATTR
> operation would look like a cheating userspace. What do you think?
I think you are right in that mixing kernel mtime updates with userspace
mtime updates doesn't work. Either the kernel should be wholly
responsible (which works only for "local" filesystems) or the userspace
is fully responsible for mtime updates (which works in all cases but may
be suboptimal).
Thanks,
Miklos
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 6/10] fuse: Trust kernel i_size only
2012-11-16 9:49 ` Miklos Szeredi
@ 2012-11-16 10:32 ` Maxim V. Patlasov
0 siblings, 0 replies; 43+ messages in thread
From: Maxim V. Patlasov @ 2012-11-16 10:32 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Pavel Emelianov, fuse-devel, Alexander Viro, linux-fsdevel,
James Bottomley, Kirill Korotaev
Hi Miklos,
Thanks a lot for reply. See please inline comments below...
11/16/2012 01:49 PM, Miklos Szeredi пишет:
> "Maxim V. Patlasov" <mpatlasov@parallels.com> writes:
>
>> We should probably look at what NFS is doing.
>>
>>
>> In case of NFS, the flush does updates the modification time on server. And on
>> client, getattr triggers flush:
>>
>>
>> int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat
>> *stat)
>> {
>> ...
>>
>> /* Flush out writes to the server in order to update c/mtime. */
>> if (S_ISREG(inode->i_mode)) {
>> nfs_inode_dio_wait(inode);
>> err = filemap_write_and_wait(inode->i_mapping);
>> if (err)
>> goto out;
>> }
>>
>>
>> In another email of this thread you suggested some approach where in-kernel
>> fuse flushes i_mtime to userspace:
>>
>>
>> So basically what we need is a per-inode flag that says that i_mtime has
>> been updated (it is more recent then what userspace has) and we must
>> update i_mtime *only* in write and not other operations which still do
>> the mtime update in the userspace filesystem. Any operation that
>> modifies i_mtime (and hence invalidate the attributes) must clear the
>> flag. Any other operation which updates or invalidates the attributes
>> must first flush the i_mtime to userspace if the flag is set.
>>
>> In addition the userspace fileystem need to implement the policy similar
>> to NFS, in which it only updates mtime if it is greater than the current
>> one. This means that we must differentiate between an mtime update due
>> to a buffered write from an mtime update due to an utime (and friends)
>> system call.
>>
>>
>> My question is why do we need all these complications if we could follow NFS
>> way: trigger flush and wait for its (and fuse write-back) completion before
>> sending FUSE_GETATTR to userspace?
>
> Yes, the NFS way seems like a good approach assuming that getattrs are
> not too frequent. But I guess the fact that NFS does this is a pretty
> good assurance that will work fine.
My first intention was to follow NFS way because it's very simple and
straightforward. But then I realized it would impact performance too
badly. You correctly noticed that frequent getattrs will be a problem.
But there will be a problem even without it: a single innocent 'ls' will
wait for pretty long till all dirty memory is flushed.
>> Another concern is about the idea of sending i_mtime to userspace per se. You
>> wrote:
>>
>>
>> If we are doing buffered writes, then the kernel must update i_mtime on
>> write(2) and must flush that to the userspace filesystem at some point
>> (with a SETTATTR operation).
>>
>>
>> Fuse userspace may have its own non-trivial concept of 'modification time'.
>> It's not obliged to advance its mtime on every write. The only requirement is
>> to be consistent: if we expose new data handling READs, mtime must be advanced
>> properly as well. But, for example, the granularity of changes is up to
>> implementation. From this view, in-kenel fuse pushing i_mtime with a SETATTR
>> operation would look like a cheating userspace. What do you think?
> I think you are right in that mixing kernel mtime updates with userspace
> mtime updates doesn't work. Either the kernel should be wholly
> responsible (which works only for "local" filesystems) or the userspace
> is fully responsible for mtime updates (which works in all cases but may
> be suboptimal).
Not having any feedback from you for long while I worked pretty hard to
implement the approach you suggested early (update mtime locally on
buffered writes, flush it to userspace when needed). Now I have an
implementation that works in my tests. I'll send patches soon.
Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
` (5 preceding siblings ...)
2012-07-03 15:55 ` [PATCH 6/10] fuse: Trust kernel i_size only Pavel Emelyanov
@ 2012-07-03 15:57 ` Pavel Emelyanov
[not found] ` <4FF3166B.5090800-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-07-05 19:26 ` [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Anand Avati
7 siblings, 1 reply; 43+ messages in thread
From: Pavel Emelyanov @ 2012-07-03 15:57 UTC (permalink / raw)
To: fuse-devel, Miklos Szeredi, Alexander Viro, linux-fsdevel
Cc: James Bottomley, Kirill Korotaev
Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is hight ehough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.
Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is wrting to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
mm/page-writeback.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 93d8d2f..7cb54db 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1192,7 +1192,8 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
- nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+ nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+ global_page_state(NR_WRITEBACK_TEMP);
global_dirty_limits(&background_thresh, &dirty_thresh);
--
1.5.5.6
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
2012-07-03 15:53 [PATCH 0/10] fuse: An attempt to implement a write-back cache policy Pavel Emelyanov
` (6 preceding siblings ...)
2012-07-03 15:57 ` [PATCH 10/10] mm: Account for WRITEBACK_TEMP in balance_dirty_pages Pavel Emelyanov
@ 2012-07-05 19:26 ` Anand Avati
7 siblings, 0 replies; 43+ messages in thread
From: Anand Avati @ 2012-07-05 19:26 UTC (permalink / raw)
To: Pavel Emelyanov
Cc: Kirill Korotaev, James Bottomley, fuse-devel, Miklos Szeredi,
Alexander Viro, linux-fsdevel
Doesn't such a change increase the "dirty debt" held in the kernel and therefore
increase the probability of resulting in a deadlock (when the userspace filesystem's
memory request ends up waiting on a dirty page writeout caused by this write-back
feature)? Why not implement the write-back inside the userspace filesystem leaving
the kernel operate in write-through, which makes things overall less unsafe? Do you
have performance numbers comparing write-back in kernel v/s write-back in userspace?
Avati
----- Original Message -----
From: "Pavel Emelyanov" <xemul@parallels.com>
To: fuse-devel@lists.sourceforge.net, "Miklos Szeredi" <miklos@szeredi.hu>, "Alexander Viro" <viro@zeniv.linux.org.uk>, "linux-fsdevel" <linux-fsdevel@vger.kernel.org>
Cc: "Kirill Korotaev" <dev@parallels.com>, "James Bottomley" <jbottomley@parallels.com>
Sent: Tuesday, July 3, 2012 8:53:18 AM
Subject: [fuse-devel] [PATCH 0/10] fuse: An attempt to implement a write-back cache policy
Hi everyone.
One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.
A good solution of this is switching the FUSE page cache into a write-back policy.
With this file data are pushed to the userspace with big chunks (depending on the
dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
handle the size updates in a more efficient manner.
The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
turned ON:
* still copy writeback pages to temporary buffer when sending a writeback request
and finish the page writeback immediately
* make kernel maintain the inode's i_size to avoid frequent i_size synchronization
with the user space
* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
This protects us from having too many dirty pages on FUSE
The provided patchset survives the fsx test. Performance measurements are not yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.
We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this scenario.
Kirill (in Cc) can share more details about the performance and the storage concepts
details if required.
Thanks,
Pavel
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
fuse-devel mailing list
fuse-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fuse-devel
^ permalink raw reply [flat|nested] 43+ messages in thread