All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode
@ 2023-06-30  9:45 Hao Xu
  2023-06-30  9:46 ` [PATCH 1/3] fuse: invalidate page cache pages before direct write Hao Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Hao Xu @ 2023-06-30  9:45 UTC (permalink / raw)
  To: fuse-devel; +Cc: miklos, bernd.schubert, linux-fsdevel, Wanpeng Li, cgxu519

From: Hao Xu <howeyxu@tencent.com>

Patch 1 is a fix for private mmap in FOPEN_DIRECT_IO mode
  This is added here together since the later two depends on it.
Patch 2 is the main dish
Patch 3 is to maintain direct write logic for shared mmap in FOPEN_DIRECT_IO mode

v2 -> v3
    add patch 1 fix here, and adjust it follow Bernd's comment
    add patch 3 which does right thing for shared mmap in FOPEN_DIRECT_IO mode

v1 -> v2:
     make the new flag a fuse init one rather than a open flag since it's
     not common that different files in a filesystem has different
     strategy of shared mmap.

Hao Xu (3):
  fuse: invalidate page cache pages before direct write
  fuse: add a new fuse init flag to relax restrictions in no cache mode
  fuse: write back dirty pages before direct write in direct_io_relax
    mode

 fs/fuse/file.c            | 26 +++++++++++++++++++++++---
 fs/fuse/fuse_i.h          |  3 +++
 fs/fuse/inode.c           |  5 ++++-
 include/uapi/linux/fuse.h |  1 +
 4 files changed, 31 insertions(+), 4 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] fuse: invalidate page cache pages before direct write
  2023-06-30  9:45 [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
@ 2023-06-30  9:46 ` Hao Xu
  2023-06-30 10:32   ` Bernd Schubert
  2023-07-21  3:34   ` [fuse-devel] " Jiachen Zhang
  2023-06-30  9:46 ` [PATCH 2/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 28+ messages in thread
From: Hao Xu @ 2023-06-30  9:46 UTC (permalink / raw)
  To: fuse-devel; +Cc: miklos, bernd.schubert, linux-fsdevel, Wanpeng Li, cgxu519

From: Hao Xu <howeyxu@tencent.com>

In FOPEN_DIRECT_IO, page cache may still be there for a file since
private mmap is allowed. Direct write should respect that and invalidate
the corresponding pages so that page cache readers don't get stale data.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/fuse/file.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index bc4115288eec..7d6dd0e56b73 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1465,7 +1465,8 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 	int write = flags & FUSE_DIO_WRITE;
 	int cuse = flags & FUSE_DIO_CUSE;
 	struct file *file = io->iocb->ki_filp;
-	struct inode *inode = file->f_mapping->host;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	struct fuse_file *ff = file->private_data;
 	struct fuse_conn *fc = ff->fm->fc;
 	size_t nmax = write ? fc->max_write : fc->max_read;
@@ -1477,6 +1478,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 	int err = 0;
 	struct fuse_io_args *ia;
 	unsigned int max_pages;
+	bool fopen_direct_write = (ff->open_flags & FOPEN_DIRECT_IO) && write;
 
 	max_pages = iov_iter_npages(iter, fc->max_pages);
 	ia = fuse_io_alloc(io, max_pages);
@@ -1491,6 +1493,14 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 			inode_unlock(inode);
 	}
 
+	if (fopen_direct_write) {
+		res = invalidate_inode_pages2_range(mapping, idx_from, idx_to);
+		if (res) {
+			fuse_io_free(ia);
+			return res;
+		}
+	}
+
 	io->should_dirty = !write && user_backed_iter(iter);
 	while (count) {
 		ssize_t nres;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] fuse: add a new fuse init flag to relax restrictions in no cache mode
  2023-06-30  9:45 [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
  2023-06-30  9:46 ` [PATCH 1/3] fuse: invalidate page cache pages before direct write Hao Xu
@ 2023-06-30  9:46 ` Hao Xu
  2023-06-30 10:35   ` Bernd Schubert
  2023-06-30  9:46 ` [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode Hao Xu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-06-30  9:46 UTC (permalink / raw)
  To: fuse-devel; +Cc: miklos, bernd.schubert, linux-fsdevel, Wanpeng Li, cgxu519

From: Hao Xu <howeyxu@tencent.com>

FOPEN_DIRECT_IO is usually set by fuse daemon to indicate need of strong
coherency, e.g. network filesystems. Thus shared mmap is disabled since
it leverages page cache and may write to it, which may cause
inconsistence. But FOPEN_DIRECT_IO can be used not for coherency but to
reduce memory footprint as well, e.g. reduce guest memory usage with
virtiofs. Therefore, add a new fuse init flag FUSE_DIRECT_IO_RELAX to
relax restrictions in that mode, currently, it allows shared mmap.
One thing to note is to make sure it doesn't break coherency in your
use case.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/fuse/file.c            | 7 +++++--
 fs/fuse/fuse_i.h          | 3 +++
 fs/fuse/inode.c           | 5 ++++-
 include/uapi/linux/fuse.h | 1 +
 4 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7d6dd0e56b73..176f719f8fc8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2488,14 +2488,17 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct fuse_file *ff = file->private_data;
+	struct fuse_conn *fc = ff->fm->fc;
 
 	/* DAX mmap is superior to direct_io mmap */
 	if (FUSE_IS_DAX(file_inode(file)))
 		return fuse_dax_mmap(file, vma);
 
 	if (ff->open_flags & FOPEN_DIRECT_IO) {
-		/* Can't provide the coherency needed for MAP_SHARED */
-		if (vma->vm_flags & VM_MAYSHARE)
+		/* Can't provide the coherency needed for MAP_SHARED
+		 * if FUSE_DIRECT_IO_RELAX isn't set.
+		 */
+		if ((vma->vm_flags & VM_MAYSHARE) && !fc->direct_io_relax)
 			return -ENODEV;
 
 		invalidate_inode_pages2(file->f_mapping);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 9b7fc7d3c7f1..d830c2360aef 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -792,6 +792,9 @@ struct fuse_conn {
 	/* Is tmpfile not implemented by fs? */
 	unsigned int no_tmpfile:1;
 
+	/* relax restrictions in FOPEN_DIRECT_IO mode */
+	unsigned int direct_io_relax:1;
+
 	/** The number of requests waiting for completion */
 	atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d66070af145d..049f9ee547d5 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1209,6 +1209,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 				fc->init_security = 1;
 			if (flags & FUSE_CREATE_SUPP_GROUP)
 				fc->create_supp_group = 1;
+
+			if (flags & FUSE_DIRECT_IO_RELAX)
+				fc->direct_io_relax = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1254,7 +1257,7 @@ void fuse_send_init(struct fuse_mount *fm)
 		FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
 		FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
 		FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
-		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP;
+		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP | FUSE_DIRECT_IO_RELAX;
 #ifdef CONFIG_FUSE_DAX
 	if (fm->fc->dax)
 		flags |= FUSE_MAP_ALIGNMENT;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1b9d0dfae72d..2da2acec6bf4 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -406,6 +406,7 @@ struct fuse_file_lock {
 #define FUSE_SECURITY_CTX	(1ULL << 32)
 #define FUSE_HAS_INODE_DAX	(1ULL << 33)
 #define FUSE_CREATE_SUPP_GROUP	(1ULL << 34)
+#define FUSE_DIRECT_IO_RELAX	(1ULL << 35)
 
 /**
  * CUSE INIT request/reply flags
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-06-30  9:45 [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
  2023-06-30  9:46 ` [PATCH 1/3] fuse: invalidate page cache pages before direct write Hao Xu
  2023-06-30  9:46 ` [PATCH 2/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
@ 2023-06-30  9:46 ` Hao Xu
  2023-06-30 10:40   ` Bernd Schubert
  2023-07-21  6:35   ` [External] [fuse-devel] " Jiachen Zhang
  2023-07-05 10:23 ` [RFC] [PATCH] fuse: DIO writes always use the same code path Bernd Schubert
  2023-07-20  7:32 ` [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
  4 siblings, 2 replies; 28+ messages in thread
From: Hao Xu @ 2023-06-30  9:46 UTC (permalink / raw)
  To: fuse-devel; +Cc: miklos, bernd.schubert, linux-fsdevel, Wanpeng Li, cgxu519

From: Hao Xu <howeyxu@tencent.com>

In direct_io_relax mode, there can be shared mmaped files and thus dirty
pages in its page cache. Therefore those dirty pages should be written
back to backend before direct write to avoid data loss.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/fuse/file.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 176f719f8fc8..7c9167c62bf6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 	if (!ia)
 		return -ENOMEM;
 
+	if (fopen_direct_write && fc->direct_io_relax) {
+		res = filemap_write_and_wait_range(mapping, pos, pos + count - 1);
+		if (res) {
+			fuse_io_free(ia);
+			return res;
+		}
+	}
 	if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
 		if (!write)
 			inode_lock(inode);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] fuse: invalidate page cache pages before direct write
  2023-06-30  9:46 ` [PATCH 1/3] fuse: invalidate page cache pages before direct write Hao Xu
@ 2023-06-30 10:32   ` Bernd Schubert
  2023-07-21  3:34   ` [fuse-devel] " Jiachen Zhang
  1 sibling, 0 replies; 28+ messages in thread
From: Bernd Schubert @ 2023-06-30 10:32 UTC (permalink / raw)
  To: Hao Xu, fuse-devel; +Cc: miklos, linux-fsdevel, Wanpeng Li, cgxu519

Thanks, looks good to me:

Reviewed-by: Bernd Schubert <bschubert@ddn.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] fuse: add a new fuse init flag to relax restrictions in no cache mode
  2023-06-30  9:46 ` [PATCH 2/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
@ 2023-06-30 10:35   ` Bernd Schubert
  0 siblings, 0 replies; 28+ messages in thread
From: Bernd Schubert @ 2023-06-30 10:35 UTC (permalink / raw)
  To: Hao Xu, fuse-devel; +Cc: miklos, linux-fsdevel, Wanpeng Li, cgxu519

Thanks, looks good to me:

Reviewed-by: Bernd Schubert <bschubert@ddn.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-06-30  9:46 ` [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode Hao Xu
@ 2023-06-30 10:40   ` Bernd Schubert
  2023-07-21  6:35   ` [External] [fuse-devel] " Jiachen Zhang
  1 sibling, 0 replies; 28+ messages in thread
From: Bernd Schubert @ 2023-06-30 10:40 UTC (permalink / raw)
  To: Hao Xu, fuse-devel; +Cc: miklos, linux-fsdevel, Wanpeng Li, cgxu519

Thanks, looks good to me:

Reviewed-by: Bernd Schubert <bschubert@ddn.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC] [PATCH] fuse: DIO writes always use the same code path
  2023-06-30  9:45 [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
                   ` (2 preceding siblings ...)
  2023-06-30  9:46 ` [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode Hao Xu
@ 2023-07-05 10:23 ` Bernd Schubert
  2023-07-06 14:43   ` Christoph Hellwig
  2023-07-17  8:03   ` Hao Xu
  2023-07-20  7:32 ` [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
  4 siblings, 2 replies; 28+ messages in thread
From: Bernd Schubert @ 2023-07-05 10:23 UTC (permalink / raw)
  To: Hao Xu, fuse-devel
  Cc: miklos, linux-fsdevel, Wanpeng Li, cgxu519, Dharmendra Singh

From: Bernd Schubert <bschubert@ddn.com>

In commit 153524053bbb0d27bb2e0be36d1b46862e9ce74c DIO
writes can be handled in parallel, as long as the file
is not extended. So far this only works when daemon/server
side set FOPEN_DIRECT_IO and FOPEN_PARALLEL_DIRECT_WRITES,
but O_DIRECT (iocb->ki_flags & IOCB_DIRECT) went another
code path that doesn't have the parallel DIO write
optimization.
Given that fuse_direct_write_iter has to handle page writes
and invalidation anyway (for mmap), the DIO handler in
fuse_cache_write_iter() is removed and DIO writes are now
only handled by fuse_direct_write_iter().

Note: Correctness of this patch depends on a non-merged
series from Hao Xu <hao.xu@linux.dev>
( fuse: add a new fuse init flag to relax restrictions in no cache mode)
---
  fs/fuse/file.c |   38 +++++---------------------------------
  1 file changed, 5 insertions(+), 33 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 89d97f6188e0..1490329b536c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1337,11 +1337,9 @@ static ssize_t fuse_cache_write_iter(struct kiocb 
*iocb, struct iov_iter *from)
  	struct file *file = iocb->ki_filp;
  	struct address_space *mapping = file->f_mapping;
  	ssize_t written = 0;
-	ssize_t written_buffered = 0;
  	struct inode *inode = mapping->host;
  	ssize_t err;
  	struct fuse_conn *fc = get_fuse_conn(inode);
-	loff_t endbyte = 0;

  	if (fc->writeback_cache) {
  		/* Update size (EOF optimization) and mode (SUID clearing) */
@@ -1377,37 +1375,10 @@ static ssize_t fuse_cache_write_iter(struct 
kiocb *iocb, struct iov_iter *from)
  	if (err)
  		goto out;

-	if (iocb->ki_flags & IOCB_DIRECT) {
-		loff_t pos = iocb->ki_pos;
-		written = generic_file_direct_write(iocb, from);
-		if (written < 0 || !iov_iter_count(from))
-			goto out;
-
-		pos += written;
-
-		written_buffered = fuse_perform_write(iocb, mapping, from, pos);
-		if (written_buffered < 0) {
-			err = written_buffered;
-			goto out;
-		}
-		endbyte = pos + written_buffered - 1;
-
-		err = filemap_write_and_wait_range(file->f_mapping, pos,
-						   endbyte);
-		if (err)
-			goto out;
-
-		invalidate_mapping_pages(file->f_mapping,
-					 pos >> PAGE_SHIFT,
-					 endbyte >> PAGE_SHIFT);
+	written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
+	if (written >= 0)
+		iocb->ki_pos += written;

-		written += written_buffered;
-		iocb->ki_pos = pos + written_buffered;
-	} else {
-		written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
-		if (written >= 0)
-			iocb->ki_pos += written;
-	}
  out:
  	current->backing_dev_info = NULL;
  	inode_unlock(inode);
@@ -1691,7 +1662,8 @@ static ssize_t fuse_file_write_iter(struct kiocb 
*iocb, struct iov_iter *from)
  	if (FUSE_IS_DAX(inode))
  		return fuse_dax_write_iter(iocb, from);

-	if (!(ff->open_flags & FOPEN_DIRECT_IO))
+	if (!(ff->open_flags & FOPEN_DIRECT_IO) &&
+	    !(iocb->ki_flags & IOCB_DIRECT))
  		return fuse_cache_write_iter(iocb, from);
  	else
  		return fuse_direct_write_iter(iocb, from);


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC] [PATCH] fuse: DIO writes always use the same code path
  2023-07-05 10:23 ` [RFC] [PATCH] fuse: DIO writes always use the same code path Bernd Schubert
@ 2023-07-06 14:43   ` Christoph Hellwig
  2023-07-07 13:36     ` Bernd Schubert
  2023-07-17  8:03   ` Hao Xu
  1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2023-07-06 14:43 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Hao Xu, fuse-devel, miklos, linux-fsdevel, Wanpeng Li, cgxu519,
	Dharmendra Singh

On Wed, Jul 05, 2023 at 12:23:40PM +0200, Bernd Schubert wrote:
> @@ -1377,37 +1375,10 @@ static ssize_t fuse_cache_write_iter(struct kiocb
> *iocb, struct iov_iter *from)
>  	if (err)
>  		goto out;
> 
> -	if (iocb->ki_flags & IOCB_DIRECT) {
> -		loff_t pos = iocb->ki_pos;
> -		written = generic_file_direct_write(iocb, from);

After this generic_file_direct_write becomes unused outside of
mm/filemap.c, please add a patch to the series to mark it static.

> +	written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
> +	if (written >= 0)
> +		iocb->ki_pos += written;

This needs to be updated to the new fuse_perform_write calling
conventions in Linus tree.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] [PATCH] fuse: DIO writes always use the same code path
  2023-07-06 14:43   ` Christoph Hellwig
@ 2023-07-07 13:36     ` Bernd Schubert
  0 siblings, 0 replies; 28+ messages in thread
From: Bernd Schubert @ 2023-07-07 13:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hao Xu, fuse-devel, miklos, linux-fsdevel, Wanpeng Li, cgxu519,
	Dharmendra Singh



On 7/6/23 16:43, Christoph Hellwig wrote:
> On Wed, Jul 05, 2023 at 12:23:40PM +0200, Bernd Schubert wrote:
>> @@ -1377,37 +1375,10 @@ static ssize_t fuse_cache_write_iter(struct kiocb
>> *iocb, struct iov_iter *from)
>>   	if (err)
>>   		goto out;
>>
>> -	if (iocb->ki_flags & IOCB_DIRECT) {
>> -		loff_t pos = iocb->ki_pos;
>> -		written = generic_file_direct_write(iocb, from);
> 
> After this generic_file_direct_write becomes unused outside of
> mm/filemap.c, please add a patch to the series to mark it static.
> 
>> +	written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
>> +	if (written >= 0)
>> +		iocb->ki_pos += written;
> 
> This needs to be updated to the new fuse_perform_write calling
> conventions in Linus tree.
> 

Thanks a lot for your review, I will address it when I'm back from 
vacation in two weeks. I had seen your DIO patch series, but didn't 
notice it was merged already.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] [PATCH] fuse: DIO writes always use the same code path
  2023-07-05 10:23 ` [RFC] [PATCH] fuse: DIO writes always use the same code path Bernd Schubert
  2023-07-06 14:43   ` Christoph Hellwig
@ 2023-07-17  8:03   ` Hao Xu
  2023-07-17 21:17     ` Bernd Schubert
  1 sibling, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-07-17  8:03 UTC (permalink / raw)
  To: Bernd Schubert, fuse-devel
  Cc: miklos, linux-fsdevel, Wanpeng Li, cgxu519, Dharmendra Singh

Hi Bernd,

On 7/5/23 18:23, Bernd Schubert wrote:
> From: Bernd Schubert <bschubert@ddn.com>
> 
> In commit 153524053bbb0d27bb2e0be36d1b46862e9ce74c DIO
> writes can be handled in parallel, as long as the file
> is not extended. So far this only works when daemon/server
> side set FOPEN_DIRECT_IO and FOPEN_PARALLEL_DIRECT_WRITES,
> but O_DIRECT (iocb->ki_flags & IOCB_DIRECT) went another
> code path that doesn't have the parallel DIO write
> optimization.
> Given that fuse_direct_write_iter has to handle page writes
> and invalidation anyway (for mmap), the DIO handler in
> fuse_cache_write_iter() is removed and DIO writes are now
> only handled by fuse_direct_write_iter().
> 
> Note: Correctness of this patch depends on a non-merged
> series from Hao Xu <hao.xu@linux.dev>
> ( fuse: add a new fuse init flag to relax restrictions in no cache mode)
> ---
>   fs/fuse/file.c |   38 +++++---------------------------------
>   1 file changed, 5 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 89d97f6188e0..1490329b536c 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1337,11 +1337,9 @@ static ssize_t fuse_cache_write_iter(struct kiocb 
> *iocb, struct iov_iter *from)
>       struct file *file = iocb->ki_filp;
>       struct address_space *mapping = file->f_mapping;
>       ssize_t written = 0;
> -    ssize_t written_buffered = 0;
>       struct inode *inode = mapping->host;
>       ssize_t err;
>       struct fuse_conn *fc = get_fuse_conn(inode);
> -    loff_t endbyte = 0;
> 
>       if (fc->writeback_cache) {
>           /* Update size (EOF optimization) and mode (SUID clearing) */
> @@ -1377,37 +1375,10 @@ static ssize_t fuse_cache_write_iter(struct 
> kiocb *iocb, struct iov_iter *from)
>       if (err)
>           goto out;
> 
> -    if (iocb->ki_flags & IOCB_DIRECT) {
> -        loff_t pos = iocb->ki_pos;
> -        written = generic_file_direct_write(iocb, from);
> -        if (written < 0 || !iov_iter_count(from))
> -            goto out;
> -
> -        pos += written;
> -
> -        written_buffered = fuse_perform_write(iocb, mapping, from, pos);
> -        if (written_buffered < 0) {
> -            err = written_buffered;
> -            goto out;
> -        }
> -        endbyte = pos + written_buffered - 1;
> -
> -        err = filemap_write_and_wait_range(file->f_mapping, pos,
> -                           endbyte);
> -        if (err)
> -            goto out;
> -
> -        invalidate_mapping_pages(file->f_mapping,
> -                     pos >> PAGE_SHIFT,
> -                     endbyte >> PAGE_SHIFT);
> +    written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
> +    if (written >= 0)
> +        iocb->ki_pos += written;
> 
> -        written += written_buffered;
> -        iocb->ki_pos = pos + written_buffered;
> -    } else {
> -        written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
> -        if (written >= 0)
> -            iocb->ki_pos += written;
> -    }
>   out:
>       current->backing_dev_info = NULL;
>       inode_unlock(inode);
> @@ -1691,7 +1662,8 @@ static ssize_t fuse_file_write_iter(struct kiocb 
> *iocb, struct iov_iter *from)
>       if (FUSE_IS_DAX(inode))
>           return fuse_dax_write_iter(iocb, from);
> 
> -    if (!(ff->open_flags & FOPEN_DIRECT_IO))
> +    if (!(ff->open_flags & FOPEN_DIRECT_IO) &&
> +        !(iocb->ki_flags & IOCB_DIRECT))
>           return fuse_cache_write_iter(iocb, from);
>       else
>           return fuse_direct_write_iter(iocb, from);
> 

For normal direct io(IOCB_DIRECT set, FOPEN_DIRECT_IO not set), it now
goes to fuse_direct_write_iter() but the thing is the previous patchset
I sent adds page flush and invalidation in FOPEN_DIRECT_IO
and/or fc->direct_io_relax, so I guess this part(flush and invalidation)
is not included in the normal direct io code path.

Regards,
Hao


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] [PATCH] fuse: DIO writes always use the same code path
  2023-07-17  8:03   ` Hao Xu
@ 2023-07-17 21:17     ` Bernd Schubert
  0 siblings, 0 replies; 28+ messages in thread
From: Bernd Schubert @ 2023-07-17 21:17 UTC (permalink / raw)
  To: Hao Xu, fuse-devel
  Cc: miklos, linux-fsdevel, Wanpeng Li, cgxu519, Dharmendra Singh

On July 17, 2023 10:03:11 AM GMT+02:00, Hao Xu <hao.xu@linux.dev> wrote:
>Hi Bernd,
>
>On 7/5/23 18:23, Bernd Schubert wrote:
>> From: Bernd Schubert <bschubert@ddn.com>
>> 
>> In commit 153524053bbb0d27bb2e0be36d1b46862e9ce74c DIO
>> writes can be handled in parallel, as long as the file
>> is not extended. So far this only works when daemon/server
>> side set FOPEN_DIRECT_IO and FOPEN_PARALLEL_DIRECT_WRITES,
>> but O_DIRECT (iocb->ki_flags & IOCB_DIRECT) went another
>> code path that doesn't have the parallel DIO write
>> optimization.
>> Given that fuse_direct_write_iter has to handle page writes
>> and invalidation anyway (for mmap), the DIO handler in
>> fuse_cache_write_iter() is removed and DIO writes are now
>> only handled by fuse_direct_write_iter().
>> 
>> Note: Correctness of this patch depends on a non-merged
>> series from Hao Xu <hao.xu@linux.dev>
>> ( fuse: add a new fuse init flag to relax restrictions in no cache mode)
>> ---
>>   fs/fuse/file.c |   38 +++++---------------------------------
>>   1 file changed, 5 insertions(+), 33 deletions(-)
>> 
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 89d97f6188e0..1490329b536c 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -1337,11 +1337,9 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>       struct file *file = iocb->ki_filp;
>>       struct address_space *mapping = file->f_mapping;
>>       ssize_t written = 0;
>> -    ssize_t written_buffered = 0;
>>       struct inode *inode = mapping->host;
>>       ssize_t err;
>>       struct fuse_conn *fc = get_fuse_conn(inode);
>> -    loff_t endbyte = 0;
>> 
>>       if (fc->writeback_cache) {
>>           /* Update size (EOF optimization) and mode (SUID clearing) */
>> @@ -1377,37 +1375,10 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>       if (err)
>>           goto out;
>> 
>> -    if (iocb->ki_flags & IOCB_DIRECT) {
>> -        loff_t pos = iocb->ki_pos;
>> -        written = generic_file_direct_write(iocb, from);
>> -        if (written < 0 || !iov_iter_count(from))
>> -            goto out;
>> -
>> -        pos += written;
>> -
>> -        written_buffered = fuse_perform_write(iocb, mapping, from, pos);
>> -        if (written_buffered < 0) {
>> -            err = written_buffered;
>> -            goto out;
>> -        }
>> -        endbyte = pos + written_buffered - 1;
>> -
>> -        err = filemap_write_and_wait_range(file->f_mapping, pos,
>> -                           endbyte);
>> -        if (err)
>> -            goto out;
>> -
>> -        invalidate_mapping_pages(file->f_mapping,
>> -                     pos >> PAGE_SHIFT,
>> -                     endbyte >> PAGE_SHIFT);
>> +    written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
>> +    if (written >= 0)
>> +        iocb->ki_pos += written;
>> 
>> -        written += written_buffered;
>> -        iocb->ki_pos = pos + written_buffered;
>> -    } else {
>> -        written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
>> -        if (written >= 0)
>> -            iocb->ki_pos += written;
>> -    }
>>   out:
>>       current->backing_dev_info = NULL;
>>       inode_unlock(inode);
>> @@ -1691,7 +1662,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>       if (FUSE_IS_DAX(inode))
>>           return fuse_dax_write_iter(iocb, from);
>> 
>> -    if (!(ff->open_flags & FOPEN_DIRECT_IO))
>> +    if (!(ff->open_flags & FOPEN_DIRECT_IO) &&
>> +        !(iocb->ki_flags & IOCB_DIRECT))
>>           return fuse_cache_write_iter(iocb, from);
>>       else
>>           return fuse_direct_write_iter(iocb, from);
>> 
>
>For normal direct io(IOCB_DIRECT set, FOPEN_DIRECT_IO not set), it now
>goes to fuse_direct_write_iter() but the thing is the previous patchset
>I sent adds page flush and invalidation in FOPEN_DIRECT_IO
>and/or fc->direct_io_relax, so I guess this part(flush and invalidation)
>is not included in the normal direct io code path.
>
>Regards,
>Hao
>

Hi Hao,

I'm going to rebase to for-next and create a single patch set that should handle that, but only next week.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode
  2023-06-30  9:45 [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
                   ` (3 preceding siblings ...)
  2023-07-05 10:23 ` [RFC] [PATCH] fuse: DIO writes always use the same code path Bernd Schubert
@ 2023-07-20  7:32 ` Hao Xu
  4 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-07-20  7:32 UTC (permalink / raw)
  To: fuse-devel; +Cc: miklos, bernd.schubert, linux-fsdevel, Wanpeng Li, cgxu519

On 6/30/23 17:45, Hao Xu wrote:
> From: Hao Xu <howeyxu@tencent.com>
> 
> Patch 1 is a fix for private mmap in FOPEN_DIRECT_IO mode
>    This is added here together since the later two depends on it.
> Patch 2 is the main dish
> Patch 3 is to maintain direct write logic for shared mmap in FOPEN_DIRECT_IO mode
> 
> v2 -> v3
>      add patch 1 fix here, and adjust it follow Bernd's comment
>      add patch 3 which does right thing for shared mmap in FOPEN_DIRECT_IO mode
> 
> v1 -> v2:
>       make the new flag a fuse init one rather than a open flag since it's
>       not common that different files in a filesystem has different
>       strategy of shared mmap.
> 
> Hao Xu (3):
>    fuse: invalidate page cache pages before direct write
>    fuse: add a new fuse init flag to relax restrictions in no cache mode
>    fuse: write back dirty pages before direct write in direct_io_relax
>      mode
> 
>   fs/fuse/file.c            | 26 +++++++++++++++++++++++---
>   fs/fuse/fuse_i.h          |  3 +++
>   fs/fuse/inode.c           |  5 ++++-
>   include/uapi/linux/fuse.h |  1 +
>   4 files changed, 31 insertions(+), 4 deletions(-)
> 

Ping this one.

Hi Miklos,

Could you take a look at this one when you have time, since Bernd is
going to make his patch separate, this series is ready for reviewing.

Thanks,
Hao

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 1/3] fuse: invalidate page cache pages before direct write
  2023-06-30  9:46 ` [PATCH 1/3] fuse: invalidate page cache pages before direct write Hao Xu
  2023-06-30 10:32   ` Bernd Schubert
@ 2023-07-21  3:34   ` Jiachen Zhang
  1 sibling, 0 replies; 28+ messages in thread
From: Jiachen Zhang @ 2023-07-21  3:34 UTC (permalink / raw)
  To: Hao Xu, fuse-devel
  Cc: linux-fsdevel, bernd.schubert, Wanpeng Li, cgxu519, miklos



On 2023/6/30 17:46, Hao Xu wrote:
> From: Hao Xu <howeyxu@tencent.com>
> 
> In FOPEN_DIRECT_IO, page cache may still be there for a file since
> private mmap is allowed. Direct write should respect that and invalidate
> the corresponding pages so that page cache readers don't get stale data.
> 
> Signed-off-by: Hao Xu <howeyxu@tencent.com>
> ---
>   fs/fuse/file.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index bc4115288eec..7d6dd0e56b73 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1465,7 +1465,8 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>   	int write = flags & FUSE_DIO_WRITE;
>   	int cuse = flags & FUSE_DIO_CUSE;
>   	struct file *file = io->iocb->ki_filp;
> -	struct inode *inode = file->f_mapping->host;
> +	struct address_space *mapping = file->f_mapping;
> +	struct inode *inode = mapping->host;
>   	struct fuse_file *ff = file->private_data;
>   	struct fuse_conn *fc = ff->fm->fc;
>   	size_t nmax = write ? fc->max_write : fc->max_read;
> @@ -1477,6 +1478,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>   	int err = 0;
>   	struct fuse_io_args *ia;
>   	unsigned int max_pages;
> +	bool fopen_direct_write = (ff->open_flags & FOPEN_DIRECT_IO) && write;
>   
>   	max_pages = iov_iter_npages(iter, fc->max_pages);
>   	ia = fuse_io_alloc(io, max_pages);
> @@ -1491,6 +1493,14 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>   			inode_unlock(inode);
>   	}
>   
> +	if (fopen_direct_write) {
> +		res = invalidate_inode_pages2_range(mapping, idx_from, idx_to);
> +		if (res) {
> +			fuse_io_free(ia);
> +			return res;
> +		}
> +	}
> +
>   	io->should_dirty = !write && user_backed_iter(iter);
>   	while (count) {
>   		ssize_t nres;


Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>

Thanks,
Jiachen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-06-30  9:46 ` [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode Hao Xu
  2023-06-30 10:40   ` Bernd Schubert
@ 2023-07-21  6:35   ` Jiachen Zhang
  2023-07-21 11:27     ` Hao Xu
  1 sibling, 1 reply; 28+ messages in thread
From: Jiachen Zhang @ 2023-07-21  6:35 UTC (permalink / raw)
  To: Hao Xu, fuse-devel
  Cc: linux-fsdevel, bernd.schubert, Wanpeng Li, cgxu519, miklos


On 2023/6/30 17:46, Hao Xu wrote:
> From: Hao Xu <howeyxu@tencent.com>
> 
> In direct_io_relax mode, there can be shared mmaped files and thus dirty
> pages in its page cache. Therefore those dirty pages should be written
> back to backend before direct write to avoid data loss.
> 
> Signed-off-by: Hao Xu <howeyxu@tencent.com>
> ---
>   fs/fuse/file.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 176f719f8fc8..7c9167c62bf6 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>   	if (!ia)
>   		return -ENOMEM;
>   
> +	if (fopen_direct_write && fc->direct_io_relax) {
> +		res = filemap_write_and_wait_range(mapping, pos, pos + count - 1);
> +		if (res) {
> +			fuse_io_free(ia);
> +			return res;
> +		}
> +	}
>   	if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
>   		if (!write)
>   			inode_lock(inode);

Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>


Looks good to me.

By the way, the behaviour would be a first FUSE_WRITE flushing the page 
cache, followed by a second FUSE_WRITE doing the direct IO. In the 
future, further optimization could be first write into the page cache 
and then flush the dirty page to the FUSE daemon.


Thanks,
Jiachen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-21  6:35   ` [External] [fuse-devel] " Jiachen Zhang
@ 2023-07-21 11:27     ` Hao Xu
  2023-07-21 11:56       ` Bernd Schubert
  0 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-07-21 11:27 UTC (permalink / raw)
  To: Jiachen Zhang, fuse-devel
  Cc: linux-fsdevel, bernd.schubert, Wanpeng Li, cgxu519, miklos

On 7/21/23 14:35, Jiachen Zhang wrote:
> 
> On 2023/6/30 17:46, Hao Xu wrote:
>> From: Hao Xu <howeyxu@tencent.com>
>>
>> In direct_io_relax mode, there can be shared mmaped files and thus dirty
>> pages in its page cache. Therefore those dirty pages should be written
>> back to backend before direct write to avoid data loss.
>>
>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>> ---
>>   fs/fuse/file.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 176f719f8fc8..7c9167c62bf6 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, 
>> struct iov_iter *iter,
>>       if (!ia)
>>           return -ENOMEM;
>> +    if (fopen_direct_write && fc->direct_io_relax) {
>> +        res = filemap_write_and_wait_range(mapping, pos, pos + count 
>> - 1);
>> +        if (res) {
>> +            fuse_io_free(ia);
>> +            return res;
>> +        }
>> +    }
>>       if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
>>           if (!write)
>>               inode_lock(inode);
> 
> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
> 
> 
> Looks good to me.
> 
> By the way, the behaviour would be a first FUSE_WRITE flushing the page 
> cache, followed by a second FUSE_WRITE doing the direct IO. In the 
> future, further optimization could be first write into the page cache 
> and then flush the dirty page to the FUSE daemon.
> 

I think this makes sense, cannot think of any issue in it for now, so
I'll do that change and send next version, super thanks, Jiachen!

Thanks,
Hao

> 
> Thanks,
> Jiachen


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-21 11:27     ` Hao Xu
@ 2023-07-21 11:56       ` Bernd Schubert
  2023-07-25 10:11         ` Hao Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Bernd Schubert @ 2023-07-21 11:56 UTC (permalink / raw)
  To: Hao Xu, Jiachen Zhang, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos

On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> wrote:
>On 7/21/23 14:35, Jiachen Zhang wrote:
>> 
>> On 2023/6/30 17:46, Hao Xu wrote:
>>> From: Hao Xu <howeyxu@tencent.com>
>>> 
>>> In direct_io_relax mode, there can be shared mmaped files and thus dirty
>>> pages in its page cache. Therefore those dirty pages should be written
>>> back to backend before direct write to avoid data loss.
>>> 
>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>> ---
>>>   fs/fuse/file.c | 7 +++++++
>>>   1 file changed, 7 insertions(+)
>>> 
>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>> index 176f719f8fc8..7c9167c62bf6 100644
>>> --- a/fs/fuse/file.c
>>> +++ b/fs/fuse/file.c
>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>>>       if (!ia)
>>>           return -ENOMEM;
>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + count - 1);
>>> +        if (res) {
>>> +            fuse_io_free(ia);
>>> +            return res;
>>> +        }
>>> +    }
>>>       if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
>>>           if (!write)
>>>               inode_lock(inode);
>> 
>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>> 
>> 
>> Looks good to me.
>> 
>> By the way, the behaviour would be a first FUSE_WRITE flushing the page cache, followed by a second FUSE_WRITE doing the direct IO. In the future, further optimization could be first write into the page cache and then flush the dirty page to the FUSE daemon.
>> 
>
>I think this makes sense, cannot think of any issue in it for now, so
>I'll do that change and send next version, super thanks, Jiachen!
>
>Thanks,
>Hao
>
>> 
>> Thanks,
>> Jiachen
>

On my phone, sorry if mail formatting is not optimal.
Do I understand it right? You want DIO code path copy into pages and then flush/invalidate these pages? That would be punish DIO for for the unlikely case there are also dirty pages (discouraged IO pattern).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-21 11:56       ` Bernd Schubert
@ 2023-07-25 10:11         ` Hao Xu
  2023-07-25 13:00           ` Bernd Schubert
  0 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-07-25 10:11 UTC (permalink / raw)
  To: Bernd Schubert, Jiachen Zhang, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos

On 7/21/23 19:56, Bernd Schubert wrote:
> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> wrote:
>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>
>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>
>>>> In direct_io_relax mode, there can be shared mmaped files and thus dirty
>>>> pages in its page cache. Therefore those dirty pages should be written
>>>> back to backend before direct write to avoid data loss.
>>>>
>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>> ---
>>>>    fs/fuse/file.c | 7 +++++++
>>>>    1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>> --- a/fs/fuse/file.c
>>>> +++ b/fs/fuse/file.c
>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>>>>        if (!ia)
>>>>            return -ENOMEM;
>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + count - 1);
>>>> +        if (res) {
>>>> +            fuse_io_free(ia);
>>>> +            return res;
>>>> +        }
>>>> +    }
>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, idx_to)) {
>>>>            if (!write)
>>>>                inode_lock(inode);
>>>
>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>
>>>
>>> Looks good to me.
>>>
>>> By the way, the behaviour would be a first FUSE_WRITE flushing the page cache, followed by a second FUSE_WRITE doing the direct IO. In the future, further optimization could be first write into the page cache and then flush the dirty page to the FUSE daemon.
>>>
>>
>> I think this makes sense, cannot think of any issue in it for now, so
>> I'll do that change and send next version, super thanks, Jiachen!
>>
>> Thanks,
>> Hao
>>
>>>
>>> Thanks,
>>> Jiachen
>>
> 
> On my phone, sorry if mail formatting is not optimal.
> Do I understand it right? You want DIO code path copy into pages and then flush/invalidate these pages? That would be punish DIO for for the unlikely case there are also dirty pages (discouraged IO pattern).

Hi Bernd,
I think I don't get what you said, why it is punishment and why it's 
discouraged IO pattern?
On my first eyes seeing Jiachen's idea, I was thinking "that sounds
disobeying direct write semantics" because usually direct write is
"flush dirty page -> invalidate page -> write data through to backend"
not "write data to page -> flush dirty page/(writeback data)"
The latter in worst case write data both to page cache and backend
while the former just write to backend and load it to the page cache
when buffered reading. But seems there is no such "standard way" which
says we should implement direct IO in that way.

Regards,
Hao


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-25 10:11         ` Hao Xu
@ 2023-07-25 13:00           ` Bernd Schubert
  2023-07-25 16:57             ` Hao Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Bernd Schubert @ 2023-07-25 13:00 UTC (permalink / raw)
  To: Hao Xu, Jiachen Zhang, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos



On 7/25/23 12:11, Hao Xu wrote:
> On 7/21/23 19:56, Bernd Schubert wrote:
>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> wrote:
>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>
>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>
>>>>> In direct_io_relax mode, there can be shared mmaped files and thus 
>>>>> dirty
>>>>> pages in its page cache. Therefore those dirty pages should be written
>>>>> back to backend before direct write to avoid data loss.
>>>>>
>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>> ---
>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>    1 file changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>> --- a/fs/fuse/file.c
>>>>> +++ b/fs/fuse/file.c
>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv 
>>>>> *io, struct iov_iter *iter,
>>>>>        if (!ia)
>>>>>            return -ENOMEM;
>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>> count - 1);
>>>>> +        if (res) {
>>>>> +            fuse_io_free(ia);
>>>>> +            return res;
>>>>> +        }
>>>>> +    }
>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>> idx_to)) {
>>>>>            if (!write)
>>>>>                inode_lock(inode);
>>>>
>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>
>>>>
>>>> Looks good to me.
>>>>
>>>> By the way, the behaviour would be a first FUSE_WRITE flushing the 
>>>> page cache, followed by a second FUSE_WRITE doing the direct IO. In 
>>>> the future, further optimization could be first write into the page 
>>>> cache and then flush the dirty page to the FUSE daemon.
>>>>
>>>
>>> I think this makes sense, cannot think of any issue in it for now, so
>>> I'll do that change and send next version, super thanks, Jiachen!
>>>
>>> Thanks,
>>> Hao
>>>
>>>>
>>>> Thanks,
>>>> Jiachen
>>>
>>
>> On my phone, sorry if mail formatting is not optimal.
>> Do I understand it right? You want DIO code path copy into pages and 
>> then flush/invalidate these pages? That would be punish DIO for for 
>> the unlikely case there are also dirty pages (discouraged IO pattern).
> 
> Hi Bernd,
> I think I don't get what you said, why it is punishment and why it's 
> discouraged IO pattern?
> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
> disobeying direct write semantics" because usually direct write is
> "flush dirty page -> invalidate page -> write data through to backend"
> not "write data to page -> flush dirty page/(writeback data)"
> The latter in worst case write data both to page cache and backend
> while the former just write to backend and load it to the page cache
> when buffered reading. But seems there is no such "standard way" which
> says we should implement direct IO in that way.

Hi Hao,

sorry for being brief last week, I was on vacation and reading/writing 
some mails on my phone.

With 'punishment' I mean memory copies to the page cache - memory copies 
are expensive and DIO should avoid it.

Right now your patch adds filemap_write_and_wait_range(), but we do not 
know if it did work (i.e. if pages had to be flushed). So unless you 
find a way to get that information, copy to page cache would be 
unconditionally - overhead of memory copy even if there are no dirty pages.

With 'discouraged' I mean mix of page cache and direct-io. Typically one 
should only do either of both (page cache or DIO), but not a mix of 
them. For example see your patch, it flushes the page cache, but without 
a lock - races are possible. Copying to the page cache might be a 
solution, but it has the overhead above.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-25 13:00           ` Bernd Schubert
@ 2023-07-25 16:57             ` Hao Xu
  2023-07-25 17:59               ` Bernd Schubert
  2023-07-26 11:07               ` Jiachen Zhang
  0 siblings, 2 replies; 28+ messages in thread
From: Hao Xu @ 2023-07-25 16:57 UTC (permalink / raw)
  To: Bernd Schubert, Jiachen Zhang, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos


On 7/25/23 21:00, Bernd Schubert wrote:
>
>
> On 7/25/23 12:11, Hao Xu wrote:
>> On 7/21/23 19:56, Bernd Schubert wrote:
>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> wrote:
>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>
>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>
>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>> thus dirty
>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>> written
>>>>>> back to backend before direct write to avoid data loss.
>>>>>>
>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>> ---
>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>    1 file changed, 7 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>> --- a/fs/fuse/file.c
>>>>>> +++ b/fs/fuse/file.c
>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv 
>>>>>> *io, struct iov_iter *iter,
>>>>>>        if (!ia)
>>>>>>            return -ENOMEM;
>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>> count - 1);
>>>>>> +        if (res) {
>>>>>> +            fuse_io_free(ia);
>>>>>> +            return res;
>>>>>> +        }
>>>>>> +    }
>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>> idx_to)) {
>>>>>>            if (!write)
>>>>>>                inode_lock(inode);
>>>>>
>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>
>>>>>
>>>>> Looks good to me.
>>>>>
>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing the 
>>>>> page cache, followed by a second FUSE_WRITE doing the direct IO. 
>>>>> In the future, further optimization could be first write into the 
>>>>> page cache and then flush the dirty page to the FUSE daemon.
>>>>>
>>>>
>>>> I think this makes sense, cannot think of any issue in it for now, so
>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>
>>>> Thanks,
>>>> Hao
>>>>
>>>>>
>>>>> Thanks,
>>>>> Jiachen
>>>>
>>>
>>> On my phone, sorry if mail formatting is not optimal.
>>> Do I understand it right? You want DIO code path copy into pages and 
>>> then flush/invalidate these pages? That would be punish DIO for for 
>>> the unlikely case there are also dirty pages (discouraged IO pattern).
>>
>> Hi Bernd,
>> I think I don't get what you said, why it is punishment and why it's 
>> discouraged IO pattern?
>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>> disobeying direct write semantics" because usually direct write is
>> "flush dirty page -> invalidate page -> write data through to backend"
>> not "write data to page -> flush dirty page/(writeback data)"
>> The latter in worst case write data both to page cache and backend
>> while the former just write to backend and load it to the page cache
>> when buffered reading. But seems there is no such "standard way" which
>> says we should implement direct IO in that way.
>
> Hi Hao,
>
> sorry for being brief last week, I was on vacation and reading/writing 
> some mails on my phone.
>
> With 'punishment' I mean memory copies to the page cache - memory 
> copies are expensive and DIO should avoid it.
>
> Right now your patch adds filemap_write_and_wait_range(), but we do 
> not know if it did work (i.e. if pages had to be flushed). So unless 
> you find a way to get that information, copy to page cache would be 
> unconditionally - overhead of memory copy even if there are no dirty 
> pages.


Ah, looks I understood what you mean in my last email reply. Yes, just 
like what I said in last email:

[1] flush dirty page --> invalidate page --> write data to backend

    This is what we do for direct write right now in kernel, I call this 
policy "write-through", since it doesn't care much about the cache.

[2] write data to page cache --> flush dirty page in suitable time

    This is  "write-back" policy, used by buffered write. Here in this 
patch's case, we flush pages synchronously, so it still can be called 
direct-write.

Surely, in the worst case, the page is clean, then [2] has one extra 
memory copy than [1]. But like what I pointed out, for [2], next time 
buffered

read happens, the page is in latest state, so no I/O needed, while for 
[1], it has to load data from backend to page cache.


>
> With 'discouraged' I mean mix of page cache and direct-io. Typically 
> one should only do either of both (page cache or DIO), but not a mix 
> of them. For example see your patch, it flushes the page cache, but 
> without a lock - races are possible. Copying to the page cache might 
> be a solution, but it has the overhead above.


For race, we held inode lock there, do I miss anything?


>
> Thanks,
> Bernd


I now think it's good to keep the pattern same as other filesystems 
which is [1] to avoid possible performance issues in the future, thanks 
Bernd.


Hao



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-25 16:57             ` Hao Xu
@ 2023-07-25 17:59               ` Bernd Schubert
  2023-07-27  9:42                 ` Hao Xu
  2023-07-26 11:07               ` Jiachen Zhang
  1 sibling, 1 reply; 28+ messages in thread
From: Bernd Schubert @ 2023-07-25 17:59 UTC (permalink / raw)
  To: Hao Xu, Jiachen Zhang, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos



On 7/25/23 18:57, Hao Xu wrote:
> 
> On 7/25/23 21:00, Bernd Schubert wrote:
>>
>>
>> On 7/25/23 12:11, Hao Xu wrote:
>>> On 7/21/23 19:56, Bernd Schubert wrote:
>>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> wrote:
>>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>>
>>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>>
>>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>>> thus dirty
>>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>>> written
>>>>>>> back to backend before direct write to avoid data loss.
>>>>>>>
>>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>>> ---
>>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>
>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>>> --- a/fs/fuse/file.c
>>>>>>> +++ b/fs/fuse/file.c
>>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv 
>>>>>>> *io, struct iov_iter *iter,
>>>>>>>        if (!ia)
>>>>>>>            return -ENOMEM;
>>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>>> count - 1);
>>>>>>> +        if (res) {
>>>>>>> +            fuse_io_free(ia);
>>>>>>> +            return res;
>>>>>>> +        }
>>>>>>> +    }
>>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>>> idx_to)) {
>>>>>>>            if (!write)
>>>>>>>                inode_lock(inode);
>>>>>>
>>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>>
>>>>>>
>>>>>> Looks good to me.
>>>>>>
>>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing the 
>>>>>> page cache, followed by a second FUSE_WRITE doing the direct IO. 
>>>>>> In the future, further optimization could be first write into the 
>>>>>> page cache and then flush the dirty page to the FUSE daemon.
>>>>>>
>>>>>
>>>>> I think this makes sense, cannot think of any issue in it for now, so
>>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>>
>>>>> Thanks,
>>>>> Hao
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Jiachen
>>>>>
>>>>
>>>> On my phone, sorry if mail formatting is not optimal.
>>>> Do I understand it right? You want DIO code path copy into pages and 
>>>> then flush/invalidate these pages? That would be punish DIO for for 
>>>> the unlikely case there are also dirty pages (discouraged IO pattern).
>>>
>>> Hi Bernd,
>>> I think I don't get what you said, why it is punishment and why it's 
>>> discouraged IO pattern?
>>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>>> disobeying direct write semantics" because usually direct write is
>>> "flush dirty page -> invalidate page -> write data through to backend"
>>> not "write data to page -> flush dirty page/(writeback data)"
>>> The latter in worst case write data both to page cache and backend
>>> while the former just write to backend and load it to the page cache
>>> when buffered reading. But seems there is no such "standard way" which
>>> says we should implement direct IO in that way.
>>
>> Hi Hao,
>>
>> sorry for being brief last week, I was on vacation and reading/writing 
>> some mails on my phone.
>>
>> With 'punishment' I mean memory copies to the page cache - memory 
>> copies are expensive and DIO should avoid it.
>>
>> Right now your patch adds filemap_write_and_wait_range(), but we do 
>> not know if it did work (i.e. if pages had to be flushed). So unless 
>> you find a way to get that information, copy to page cache would be 
>> unconditionally - overhead of memory copy even if there are no dirty 
>> pages.
> 
> 
> Ah, looks I understood what you mean in my last email reply. Yes, just 
> like what I said in last email:
> 
> [1] flush dirty page --> invalidate page --> write data to backend
> 
>     This is what we do for direct write right now in kernel, I call this 
> policy "write-through", since it doesn't care much about the cache.
> 
> [2] write data to page cache --> flush dirty page in suitable time
> 
>     This is  "write-back" policy, used by buffered write. Here in this 
> patch's case, we flush pages synchronously, so it still can be called 
> direct-write.
> 
> Surely, in the worst case, the page is clean, then [2] has one extra 
> memory copy than [1]. But like what I pointed out, for [2], next time 
> buffered
> 
> read happens, the page is in latest state, so no I/O needed, while for 
> [1], it has to load data from backend to page cache.
> 
> 
>>
>> With 'discouraged' I mean mix of page cache and direct-io. Typically 
>> one should only do either of both (page cache or DIO), but not a mix 
>> of them. For example see your patch, it flushes the page cache, but 
>> without a lock - races are possible. Copying to the page cache might 
>> be a solution, but it has the overhead above.
> 
> 
> For race, we held inode lock there, do I miss anything?

We hold inode lock in write path, but not in read path. That ensures 
invalidate_inode_pages2_range() is not racing, but DIO read might race 
with a page cache writing happening in parallel. I guess results are 
then a bit unexpected anyway, although differently if we would hold the 
lock.


> 
> 
>>
>> Thanks,
>> Bernd
> 
> 
> I now think it's good to keep the pattern same as other filesystems 
> which is [1] to avoid possible performance issues in the future, thanks 
> Bernd.

Thanks, I think we should keep fuse consistent with what other fs do.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-25 16:57             ` Hao Xu
  2023-07-25 17:59               ` Bernd Schubert
@ 2023-07-26 11:07               ` Jiachen Zhang
  2023-07-26 13:15                 ` Bernd Schubert
                                   ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Jiachen Zhang @ 2023-07-26 11:07 UTC (permalink / raw)
  To: Hao Xu, Bernd Schubert, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos



On 2023/7/26 00:57, Hao Xu wrote:
> 
> On 7/25/23 21:00, Bernd Schubert wrote:
>>
>>
>> On 7/25/23 12:11, Hao Xu wrote:
>>> On 7/21/23 19:56, Bernd Schubert wrote:
>>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> wrote:
>>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>>
>>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>>
>>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>>> thus dirty
>>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>>> written
>>>>>>> back to backend before direct write to avoid data loss.
>>>>>>>
>>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>>> ---
>>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>
>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>>> --- a/fs/fuse/file.c
>>>>>>> +++ b/fs/fuse/file.c
>>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct fuse_io_priv 
>>>>>>> *io, struct iov_iter *iter,
>>>>>>>        if (!ia)
>>>>>>>            return -ENOMEM;
>>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {


Hi,

Seems this patchset flushes and invalidates the page cache before doing 
the direct-io writes, which avoids data loss caused by flushing staled 
data to FUSE daemon. And I tested it works well.

But there is also another side of the same problem we should consider. 
If a file is modified through its page cache (shared mmapped regions, or 
non-FOPEN_DIRECT_IO files), the following direct-io reads may bypass the 
new data in dirty page cache and read the staled data from FUSE daemon. 
I think this is also a problem that should be fixed. It could be fixed 
by uncondictionally calling filemap_write_and_wait_range() before 
direct-io read.


>>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>>> count - 1);
>>>>>>> +        if (res) {
>>>>>>> +            fuse_io_free(ia);
>>>>>>> +            return res;
>>>>>>> +        }
>>>>>>> +    }
>>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>>> idx_to)) {
>>>>>>>            if (!write)
>>>>>>>                inode_lock(inode);
>>>>>>
>>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>>
>>>>>>
>>>>>> Looks good to me.
>>>>>>
>>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing the 
>>>>>> page cache, followed by a second FUSE_WRITE doing the direct IO. 
>>>>>> In the future, further optimization could be first write into the 
>>>>>> page cache and then flush the dirty page to the FUSE daemon.
>>>>>>
>>>>>
>>>>> I think this makes sense, cannot think of any issue in it for now, so
>>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>>
>>>>> Thanks,
>>>>> Hao
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Jiachen
>>>>>
>>>>
>>>> On my phone, sorry if mail formatting is not optimal.
>>>> Do I understand it right? You want DIO code path copy into pages and 
>>>> then flush/invalidate these pages? That would be punish DIO for for 
>>>> the unlikely case there are also dirty pages (discouraged IO pattern).
>>>
>>> Hi Bernd,
>>> I think I don't get what you said, why it is punishment and why it's 
>>> discouraged IO pattern?
>>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>>> disobeying direct write semantics" because usually direct write is
>>> "flush dirty page -> invalidate page -> write data through to backend"
>>> not "write data to page -> flush dirty page/(writeback data)"
>>> The latter in worst case write data both to page cache and backend
>>> while the former just write to backend and load it to the page cache
>>> when buffered reading. But seems there is no such "standard way" which
>>> says we should implement direct IO in that way.
>>
>> Hi Hao,
>>
>> sorry for being brief last week, I was on vacation and reading/writing 
>> some mails on my phone.
>>
>> With 'punishment' I mean memory copies to the page cache - memory 
>> copies are expensive and DIO should avoid it.
>>
>> Right now your patch adds filemap_write_and_wait_range(), but we do 
>> not know if it did work (i.e. if pages had to be flushed). So unless 
>> you find a way to get that information, copy to page cache would be 
>> unconditionally - overhead of memory copy even if there are no dirty 
>> pages.
> 
> 
> Ah, looks I understood what you mean in my last email reply. Yes, just 
> like what I said in last email:
> 
> [1] flush dirty page --> invalidate page --> write data to backend
> 
>     This is what we do for direct write right now in kernel, I call this 
> policy "write-through", since it doesn't care much about the cache.
> 
> [2] write data to page cache --> flush dirty page in suitable time
> 
>     This is  "write-back" policy, used by buffered write. Here in this 
> patch's case, we flush pages synchronously, so it still can be called 
> direct-write.
> 
> Surely, in the worst case, the page is clean, then [2] has one extra 
> memory copy than [1]. But like what I pointed out, for [2], next time 
> buffered
> 
> read happens, the page is in latest state, so no I/O needed, while for 
> [1], it has to load data from backend to page cache.
> 

Write-through, write-back and direct-io are also exlained in the kernel 
documentation [*], of which write-through and write-back are cache 
modes. According to the document, the pattern [2] is similar to the FUSE 
write-back mode, but the pattern [1] is different from the FUSE 
write-through mode. The FUSE write-through mode obeys the 'write data to 
page cache --> flush dirty page synchronously' (let us call it pattern 
[3]), which keeps the clean cache in-core after flushing.

To improve performance while keeping the direct-io semantics, my 
thoughts was in the future, maybe we can fallback to the pattern [3] if 
the target page is in-core, otherwise keep the original direct-io 
pattern without reading from whole pages from FUSE daemon.

[*] https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt

Thanks,
Jiachen

> 
>>
>> With 'discouraged' I mean mix of page cache and direct-io. Typically 
>> one should only do either of both (page cache or DIO), but not a mix 
>> of them. For example see your patch, it flushes the page cache, but 
>> without a lock - races are possible. Copying to the page cache might 
>> be a solution, but it has the overhead above.
> 
> 
> For race, we held inode lock there, do I miss anything?
> 
> 
>>
>> Thanks,
>> Bernd
> 
> 
> I now think it's good to keep the pattern same as other filesystems 
> which is [1] to avoid possible performance issues in the future, thanks 
> Bernd.
> 
> 
> Hao
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-26 11:07               ` Jiachen Zhang
@ 2023-07-26 13:15                 ` Bernd Schubert
  2023-07-27  2:24                   ` Jiachen Zhang
  2023-07-27 10:31                 ` Hao Xu
  2023-07-27 10:48                 ` Hao Xu
  2 siblings, 1 reply; 28+ messages in thread
From: Bernd Schubert @ 2023-07-26 13:15 UTC (permalink / raw)
  To: Jiachen Zhang, Hao Xu, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos



On 7/26/23 13:07, Jiachen Zhang wrote:
> 
> 
> On 2023/7/26 00:57, Hao Xu wrote:
>>
>> On 7/25/23 21:00, Bernd Schubert wrote:
>>>
>>>
>>> On 7/25/23 12:11, Hao Xu wrote:
>>>> On 7/21/23 19:56, Bernd Schubert wrote:
>>>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> 
>>>>> wrote:
>>>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>>>
>>>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>>>
>>>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>>>> thus dirty
>>>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>>>> written
>>>>>>>> back to backend before direct write to avoid data loss.
>>>>>>>>
>>>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>>>> ---
>>>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>>>> --- a/fs/fuse/file.c
>>>>>>>> +++ b/fs/fuse/file.c
>>>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct 
>>>>>>>> fuse_io_priv *io, struct iov_iter *iter,
>>>>>>>>        if (!ia)
>>>>>>>>            return -ENOMEM;
>>>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
> 
> 
> Hi,
> 
> Seems this patchset flushes and invalidates the page cache before doing 
> the direct-io writes, which avoids data loss caused by flushing staled 
> data to FUSE daemon. And I tested it works well.
> 
> But there is also another side of the same problem we should consider. 
> If a file is modified through its page cache (shared mmapped regions, or 
> non-FOPEN_DIRECT_IO files), the following direct-io reads may bypass the 
> new data in dirty page cache and read the staled data from FUSE daemon. 
> I think this is also a problem that should be fixed. It could be fixed 
> by uncondictionally calling filemap_write_and_wait_range() before 
> direct-io read.
> 
> 
>>>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>>>> count - 1);
>>>>>>>> +        if (res) {
>>>>>>>> +            fuse_io_free(ia);
>>>>>>>> +            return res;
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>>>> idx_to)) {
>>>>>>>>            if (!write)
>>>>>>>>                inode_lock(inode);
>>>>>>>
>>>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>>>
>>>>>>>
>>>>>>> Looks good to me.
>>>>>>>
>>>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing 
>>>>>>> the page cache, followed by a second FUSE_WRITE doing the direct 
>>>>>>> IO. In the future, further optimization could be first write into 
>>>>>>> the page cache and then flush the dirty page to the FUSE daemon.
>>>>>>>
>>>>>>
>>>>>> I think this makes sense, cannot think of any issue in it for now, so
>>>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>>>
>>>>>> Thanks,
>>>>>> Hao
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jiachen
>>>>>>
>>>>>
>>>>> On my phone, sorry if mail formatting is not optimal.
>>>>> Do I understand it right? You want DIO code path copy into pages 
>>>>> and then flush/invalidate these pages? That would be punish DIO for 
>>>>> for the unlikely case there are also dirty pages (discouraged IO 
>>>>> pattern).
>>>>
>>>> Hi Bernd,
>>>> I think I don't get what you said, why it is punishment and why it's 
>>>> discouraged IO pattern?
>>>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>>>> disobeying direct write semantics" because usually direct write is
>>>> "flush dirty page -> invalidate page -> write data through to backend"
>>>> not "write data to page -> flush dirty page/(writeback data)"
>>>> The latter in worst case write data both to page cache and backend
>>>> while the former just write to backend and load it to the page cache
>>>> when buffered reading. But seems there is no such "standard way" which
>>>> says we should implement direct IO in that way.
>>>
>>> Hi Hao,
>>>
>>> sorry for being brief last week, I was on vacation and 
>>> reading/writing some mails on my phone.
>>>
>>> With 'punishment' I mean memory copies to the page cache - memory 
>>> copies are expensive and DIO should avoid it.
>>>
>>> Right now your patch adds filemap_write_and_wait_range(), but we do 
>>> not know if it did work (i.e. if pages had to be flushed). So unless 
>>> you find a way to get that information, copy to page cache would be 
>>> unconditionally - overhead of memory copy even if there are no dirty 
>>> pages.
>>
>>
>> Ah, looks I understood what you mean in my last email reply. Yes, just 
>> like what I said in last email:
>>
>> [1] flush dirty page --> invalidate page --> write data to backend
>>
>>     This is what we do for direct write right now in kernel, I call 
>> this policy "write-through", since it doesn't care much about the cache.
>>
>> [2] write data to page cache --> flush dirty page in suitable time
>>
>>     This is  "write-back" policy, used by buffered write. Here in this 
>> patch's case, we flush pages synchronously, so it still can be called 
>> direct-write.
>>
>> Surely, in the worst case, the page is clean, then [2] has one extra 
>> memory copy than [1]. But like what I pointed out, for [2], next time 
>> buffered
>>
>> read happens, the page is in latest state, so no I/O needed, while for 
>> [1], it has to load data from backend to page cache.
>>
> 
> Write-through, write-back and direct-io are also exlained in the kernel 
> documentation [*], of which write-through and write-back are cache 
> modes. According to the document, the pattern [2] is similar to the FUSE 
> write-back mode, but the pattern [1] is different from the FUSE 
> write-through mode. The FUSE write-through mode obeys the 'write data to 
> page cache --> flush dirty page synchronously' (let us call it pattern 
> [3]), which keeps the clean cache in-core after flushing.
> 
> To improve performance while keeping the direct-io semantics, my 
> thoughts was in the future, maybe we can fallback to the pattern [3] if 
> the target page is in-core, otherwise keep the original direct-io 
> pattern without reading from whole pages from FUSE daemon.
> 
> [*] https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt

Hmm, ok, so mode 3 would be be basically look up a pages  folio in the 
buffer range (__filemap_get_folio?), check if it is dirty and if so, 
copy to it?
I see that it makes sense to reduce IOs, but then it also makes the code 
more complex. Do you have a use case / application that does mixed DIO / 
page cache IO?


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-26 13:15                 ` Bernd Schubert
@ 2023-07-27  2:24                   ` Jiachen Zhang
  0 siblings, 0 replies; 28+ messages in thread
From: Jiachen Zhang @ 2023-07-27  2:24 UTC (permalink / raw)
  To: Bernd Schubert, Hao Xu, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos



On 2023/7/26 21:15, Bernd Schubert wrote:
> 
> 
> On 7/26/23 13:07, Jiachen Zhang wrote:
>>
>>
>> On 2023/7/26 00:57, Hao Xu wrote:
>>>
>>> On 7/25/23 21:00, Bernd Schubert wrote:
>>>>
>>>>
>>>> On 7/25/23 12:11, Hao Xu wrote:
>>>>> On 7/21/23 19:56, Bernd Schubert wrote:
>>>>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> 
>>>>>> wrote:
>>>>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>>>>
>>>>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>>>>
>>>>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>>>>> thus dirty
>>>>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>>>>> written
>>>>>>>>> back to backend before direct write to avoid data loss.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>>>>> ---
>>>>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>>>>> --- a/fs/fuse/file.c
>>>>>>>>> +++ b/fs/fuse/file.c
>>>>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct 
>>>>>>>>> fuse_io_priv *io, struct iov_iter *iter,
>>>>>>>>>        if (!ia)
>>>>>>>>>            return -ENOMEM;
>>>>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>>
>>
>> Hi,
>>
>> Seems this patchset flushes and invalidates the page cache before 
>> doing the direct-io writes, which avoids data loss caused by flushing 
>> staled data to FUSE daemon. And I tested it works well.
>>
>> But there is also another side of the same problem we should consider. 
>> If a file is modified through its page cache (shared mmapped regions, 
>> or non-FOPEN_DIRECT_IO files), the following direct-io reads may 
>> bypass the new data in dirty page cache and read the staled data from 
>> FUSE daemon. I think this is also a problem that should be fixed. It 
>> could be fixed by uncondictionally calling 
>> filemap_write_and_wait_range() before direct-io read.
>>
>>
>>>>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>>>>> count - 1);
>>>>>>>>> +        if (res) {
>>>>>>>>> +            fuse_io_free(ia);
>>>>>>>>> +            return res;
>>>>>>>>> +        }
>>>>>>>>> +    }
>>>>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>>>>> idx_to)) {
>>>>>>>>>            if (!write)
>>>>>>>>>                inode_lock(inode);
>>>>>>>>
>>>>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> Looks good to me.
>>>>>>>>
>>>>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing 
>>>>>>>> the page cache, followed by a second FUSE_WRITE doing the direct 
>>>>>>>> IO. In the future, further optimization could be first write 
>>>>>>>> into the page cache and then flush the dirty page to the FUSE 
>>>>>>>> daemon.
>>>>>>>>
>>>>>>>
>>>>>>> I think this makes sense, cannot think of any issue in it for 
>>>>>>> now, so
>>>>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Hao
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jiachen
>>>>>>>
>>>>>>
>>>>>> On my phone, sorry if mail formatting is not optimal.
>>>>>> Do I understand it right? You want DIO code path copy into pages 
>>>>>> and then flush/invalidate these pages? That would be punish DIO 
>>>>>> for for the unlikely case there are also dirty pages (discouraged 
>>>>>> IO pattern).
>>>>>
>>>>> Hi Bernd,
>>>>> I think I don't get what you said, why it is punishment and why 
>>>>> it's discouraged IO pattern?
>>>>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>>>>> disobeying direct write semantics" because usually direct write is
>>>>> "flush dirty page -> invalidate page -> write data through to backend"
>>>>> not "write data to page -> flush dirty page/(writeback data)"
>>>>> The latter in worst case write data both to page cache and backend
>>>>> while the former just write to backend and load it to the page cache
>>>>> when buffered reading. But seems there is no such "standard way" which
>>>>> says we should implement direct IO in that way.
>>>>
>>>> Hi Hao,
>>>>
>>>> sorry for being brief last week, I was on vacation and 
>>>> reading/writing some mails on my phone.
>>>>
>>>> With 'punishment' I mean memory copies to the page cache - memory 
>>>> copies are expensive and DIO should avoid it.
>>>>
>>>> Right now your patch adds filemap_write_and_wait_range(), but we do 
>>>> not know if it did work (i.e. if pages had to be flushed). So unless 
>>>> you find a way to get that information, copy to page cache would be 
>>>> unconditionally - overhead of memory copy even if there are no dirty 
>>>> pages.
>>>
>>>
>>> Ah, looks I understood what you mean in my last email reply. Yes, 
>>> just like what I said in last email:
>>>
>>> [1] flush dirty page --> invalidate page --> write data to backend
>>>
>>>     This is what we do for direct write right now in kernel, I call 
>>> this policy "write-through", since it doesn't care much about the cache.
>>>
>>> [2] write data to page cache --> flush dirty page in suitable time
>>>
>>>     This is  "write-back" policy, used by buffered write. Here in 
>>> this patch's case, we flush pages synchronously, so it still can be 
>>> called direct-write.
>>>
>>> Surely, in the worst case, the page is clean, then [2] has one extra 
>>> memory copy than [1]. But like what I pointed out, for [2], next time 
>>> buffered
>>>
>>> read happens, the page is in latest state, so no I/O needed, while 
>>> for [1], it has to load data from backend to page cache.
>>>
>>
>> Write-through, write-back and direct-io are also exlained in the 
>> kernel documentation [*], of which write-through and write-back are 
>> cache modes. According to the document, the pattern [2] is similar to 
>> the FUSE write-back mode, but the pattern [1] is different from the 
>> FUSE write-through mode. The FUSE write-through mode obeys the 'write 
>> data to page cache --> flush dirty page synchronously' (let us call it 
>> pattern [3]), which keeps the clean cache in-core after flushing.
>>
>> To improve performance while keeping the direct-io semantics, my 
>> thoughts was in the future, maybe we can fallback to the pattern [3] 
>> if the target page is in-core, otherwise keep the original direct-io 
>> pattern without reading from whole pages from FUSE daemon.
>>
>> [*] https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt
> 
> Hmm, ok, so mode 3 would be be basically look up a pages  folio in the 
> buffer range (__filemap_get_folio?), check if it is dirty and if so, 
> copy to it?
> I see that it makes sense to reduce IOs, but then it also makes the code 
> more complex. Do you have a use case / application that does mixed DIO / 
> page cache IO?
> 

Hi Bernd,

Yes, it's just an theoretical idea of reducing FUSE_WRITE requests when 
mixing DIOs and buffered IOs, nothing to do with the correctness of this 
patchset. And I haven't come up with any use-cases.

Thanks,
Jiachen


> 
> Thanks,
> Bernd



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-25 17:59               ` Bernd Schubert
@ 2023-07-27  9:42                 ` Hao Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-07-27  9:42 UTC (permalink / raw)
  To: Bernd Schubert, Jiachen Zhang, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos

On 7/26/23 01:59, Bernd Schubert wrote:
> 
> 
> On 7/25/23 18:57, Hao Xu wrote:
>>
>> On 7/25/23 21:00, Bernd Schubert wrote:
>>>
>>>
>>> On 7/25/23 12:11, Hao Xu wrote:
>>>> On 7/21/23 19:56, Bernd Schubert wrote:
>>>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> 
>>>>> wrote:
>>>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>>>
>>>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>>>
>>>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>>>> thus dirty
>>>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>>>> written
>>>>>>>> back to backend before direct write to avoid data loss.
>>>>>>>>
>>>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>>>> ---
>>>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>>>> --- a/fs/fuse/file.c
>>>>>>>> +++ b/fs/fuse/file.c
>>>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct 
>>>>>>>> fuse_io_priv *io, struct iov_iter *iter,
>>>>>>>>        if (!ia)
>>>>>>>>            return -ENOMEM;
>>>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>>>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>>>> count - 1);
>>>>>>>> +        if (res) {
>>>>>>>> +            fuse_io_free(ia);
>>>>>>>> +            return res;
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>>>> idx_to)) {
>>>>>>>>            if (!write)
>>>>>>>>                inode_lock(inode);
>>>>>>>
>>>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>>>
>>>>>>>
>>>>>>> Looks good to me.
>>>>>>>
>>>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing 
>>>>>>> the page cache, followed by a second FUSE_WRITE doing the direct 
>>>>>>> IO. In the future, further optimization could be first write into 
>>>>>>> the page cache and then flush the dirty page to the FUSE daemon.
>>>>>>>
>>>>>>
>>>>>> I think this makes sense, cannot think of any issue in it for now, so
>>>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>>>
>>>>>> Thanks,
>>>>>> Hao
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jiachen
>>>>>>
>>>>>
>>>>> On my phone, sorry if mail formatting is not optimal.
>>>>> Do I understand it right? You want DIO code path copy into pages 
>>>>> and then flush/invalidate these pages? That would be punish DIO for 
>>>>> for the unlikely case there are also dirty pages (discouraged IO 
>>>>> pattern).
>>>>
>>>> Hi Bernd,
>>>> I think I don't get what you said, why it is punishment and why it's 
>>>> discouraged IO pattern?
>>>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>>>> disobeying direct write semantics" because usually direct write is
>>>> "flush dirty page -> invalidate page -> write data through to backend"
>>>> not "write data to page -> flush dirty page/(writeback data)"
>>>> The latter in worst case write data both to page cache and backend
>>>> while the former just write to backend and load it to the page cache
>>>> when buffered reading. But seems there is no such "standard way" which
>>>> says we should implement direct IO in that way.
>>>
>>> Hi Hao,
>>>
>>> sorry for being brief last week, I was on vacation and 
>>> reading/writing some mails on my phone.
>>>
>>> With 'punishment' I mean memory copies to the page cache - memory 
>>> copies are expensive and DIO should avoid it.
>>>
>>> Right now your patch adds filemap_write_and_wait_range(), but we do 
>>> not know if it did work (i.e. if pages had to be flushed). So unless 
>>> you find a way to get that information, copy to page cache would be 
>>> unconditionally - overhead of memory copy even if there are no dirty 
>>> pages.
>>
>>
>> Ah, looks I understood what you mean in my last email reply. Yes, just 
>> like what I said in last email:
>>
>> [1] flush dirty page --> invalidate page --> write data to backend
>>
>>     This is what we do for direct write right now in kernel, I call 
>> this policy "write-through", since it doesn't care much about the cache.
>>
>> [2] write data to page cache --> flush dirty page in suitable time
>>
>>     This is  "write-back" policy, used by buffered write. Here in this 
>> patch's case, we flush pages synchronously, so it still can be called 
>> direct-write.
>>
>> Surely, in the worst case, the page is clean, then [2] has one extra 
>> memory copy than [1]. But like what I pointed out, for [2], next time 
>> buffered
>>
>> read happens, the page is in latest state, so no I/O needed, while for 
>> [1], it has to load data from backend to page cache.
>>
>>
>>>
>>> With 'discouraged' I mean mix of page cache and direct-io. Typically 
>>> one should only do either of both (page cache or DIO), but not a mix 
>>> of them. For example see your patch, it flushes the page cache, but 
>>> without a lock - races are possible. Copying to the page cache might 
>>> be a solution, but it has the overhead above.
>>
>>
>> For race, we held inode lock there, do I miss anything?
> 
> We hold inode lock in write path, but not in read path. That ensures 
> invalidate_inode_pages2_range() is not racing, but DIO read might race 
> with a page cache writing happening in parallel. I guess results are 
> then a bit unexpected anyway, although differently if we would hold the 
> lock.
> 

This confused me a bit. Direct read should hold inode shared lock I
believe, I checked it for fuse, neither in FOPEN_DIRECT_IO or not we
don't hold shared inode lock. Any concern causes fuse doing so?

Regards,
Hao

> 
>>
>>
>>>
>>> Thanks,
>>> Bernd
>>
>>
>> I now think it's good to keep the pattern same as other filesystems 
>> which is [1] to avoid possible performance issues in the future, 
>> thanks Bernd.
> 
> Thanks, I think we should keep fuse consistent with what other fs do.
> 
> 
> Thanks,
> Bernd


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-26 11:07               ` Jiachen Zhang
  2023-07-26 13:15                 ` Bernd Schubert
@ 2023-07-27 10:31                 ` Hao Xu
  2023-07-28  2:57                   ` Jiachen Zhang
  2023-07-27 10:48                 ` Hao Xu
  2 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-07-27 10:31 UTC (permalink / raw)
  To: Jiachen Zhang, Bernd Schubert, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos

On 7/26/23 19:07, Jiachen Zhang wrote:
> 
> 
> On 2023/7/26 00:57, Hao Xu wrote:
>>
>> On 7/25/23 21:00, Bernd Schubert wrote:
>>>
>>>
>>> On 7/25/23 12:11, Hao Xu wrote:
>>>> On 7/21/23 19:56, Bernd Schubert wrote:
>>>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> 
>>>>> wrote:
>>>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>>>
>>>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>>>
>>>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>>>> thus dirty
>>>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>>>> written
>>>>>>>> back to backend before direct write to avoid data loss.
>>>>>>>>
>>>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>>>> ---
>>>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>>>> --- a/fs/fuse/file.c
>>>>>>>> +++ b/fs/fuse/file.c
>>>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct 
>>>>>>>> fuse_io_priv *io, struct iov_iter *iter,
>>>>>>>>        if (!ia)
>>>>>>>>            return -ENOMEM;
>>>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
> 
> 
> Hi,
> 
> Seems this patchset flushes and invalidates the page cache before doing 
> the direct-io writes, which avoids data loss caused by flushing staled 
> data to FUSE daemon. And I tested it works well.
> 
> But there is also another side of the same problem we should consider. 
> If a file is modified through its page cache (shared mmapped regions, or 
> non-FOPEN_DIRECT_IO files), the following direct-io reads may bypass the 
> new data in dirty page cache and read the staled data from FUSE daemon. 
> I think this is also a problem that should be fixed. It could be fixed 
> by uncondictionally calling filemap_write_and_wait_range() before 
> direct-io read.
> 
> 
>>>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>>>> count - 1);
>>>>>>>> +        if (res) {
>>>>>>>> +            fuse_io_free(ia);
>>>>>>>> +            return res;
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>>>> idx_to)) {
>>>>>>>>            if (!write)
>>>>>>>>                inode_lock(inode);
>>>>>>>
>>>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>>>
>>>>>>>
>>>>>>> Looks good to me.
>>>>>>>
>>>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing 
>>>>>>> the page cache, followed by a second FUSE_WRITE doing the direct 
>>>>>>> IO. In the future, further optimization could be first write into 
>>>>>>> the page cache and then flush the dirty page to the FUSE daemon.
>>>>>>>
>>>>>>
>>>>>> I think this makes sense, cannot think of any issue in it for now, so
>>>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>>>
>>>>>> Thanks,
>>>>>> Hao
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jiachen
>>>>>>
>>>>>
>>>>> On my phone, sorry if mail formatting is not optimal.
>>>>> Do I understand it right? You want DIO code path copy into pages 
>>>>> and then flush/invalidate these pages? That would be punish DIO for 
>>>>> for the unlikely case there are also dirty pages (discouraged IO 
>>>>> pattern).
>>>>
>>>> Hi Bernd,
>>>> I think I don't get what you said, why it is punishment and why it's 
>>>> discouraged IO pattern?
>>>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>>>> disobeying direct write semantics" because usually direct write is
>>>> "flush dirty page -> invalidate page -> write data through to backend"
>>>> not "write data to page -> flush dirty page/(writeback data)"
>>>> The latter in worst case write data both to page cache and backend
>>>> while the former just write to backend and load it to the page cache
>>>> when buffered reading. But seems there is no such "standard way" which
>>>> says we should implement direct IO in that way.
>>>
>>> Hi Hao,
>>>
>>> sorry for being brief last week, I was on vacation and 
>>> reading/writing some mails on my phone.
>>>
>>> With 'punishment' I mean memory copies to the page cache - memory 
>>> copies are expensive and DIO should avoid it.
>>>
>>> Right now your patch adds filemap_write_and_wait_range(), but we do 
>>> not know if it did work (i.e. if pages had to be flushed). So unless 
>>> you find a way to get that information, copy to page cache would be 
>>> unconditionally - overhead of memory copy even if there are no dirty 
>>> pages.
>>
>>
>> Ah, looks I understood what you mean in my last email reply. Yes, just 
>> like what I said in last email:
>>
>> [1] flush dirty page --> invalidate page --> write data to backend
>>
>>     This is what we do for direct write right now in kernel, I call 
>> this policy "write-through", since it doesn't care much about the cache.
>>
>> [2] write data to page cache --> flush dirty page in suitable time
>>
>>     This is  "write-back" policy, used by buffered write. Here in this 
>> patch's case, we flush pages synchronously, so it still can be called 
>> direct-write.
>>
>> Surely, in the worst case, the page is clean, then [2] has one extra 
>> memory copy than [1]. But like what I pointed out, for [2], next time 
>> buffered
>>
>> read happens, the page is in latest state, so no I/O needed, while for 
>> [1], it has to load data from backend to page cache.
>>
> 
> Write-through, write-back and direct-io are also exlained in the kernel 
> documentation [*], of which write-through and write-back are cache 
> modes. According to the document, the pattern [2] is similar to the FUSE 

Yep, in previous mail write-through and write-back I mentioned are
generic concepts for any cache system, e.g. cpu cache, page cache.


> write-back mode, but the pattern [1] is different from the FUSE 
> write-through mode. The FUSE write-through mode obeys the 'write data to 
> page cache --> flush dirty page synchronously' (let us call it pattern 
> [3]), which keeps the clean cache in-core after flushing.

I read the fuse doc, I think you are right, in !FOPEN_DIRECT_IO mode,
the IO model for fuse is different with other filesystems. Specifically,
its 'write-back' branch is just following other filesystems and
'write-through' forces all writes go to both page cache and back-end,
this is actually closer to the concept write-through.

> 
> To improve performance while keeping the direct-io semantics, my 
> thoughts was in the future, maybe we can fallback to the pattern [3] if 
> the target page is in-core, otherwise keep the original direct-io 
> pattern without reading from whole pages from FUSE daemon.

Why will we reading from whole pages from backend?
That case happens for buffered write, because for buffered write, if we
partially write to a page that is not in the front-end cache, it causes
reading the whole page from back-end. But here we always do direct write
in FOPEN_DIRECT_IO mode, so we just invalidate the pages involved.

And just like Bernd said, "if the target page is in-core" is not enough,
if the target page is not dirty, following pattern [3] make one extra
page cache write.

> 
> [*] https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt
> 
> Thanks,
> Jiachen
> 
>>
>>>
>>> With 'discouraged' I mean mix of page cache and direct-io. Typically 
>>> one should only do either of both (page cache or DIO), but not a mix 
>>> of them. For example see your patch, it flushes the page cache, but 
>>> without a lock - races are possible. Copying to the page cache might 
>>> be a solution, but it has the overhead above.
>>
>>
>> For race, we held inode lock there, do I miss anything?
>>
>>
>>>
>>> Thanks,
>>> Bernd
>>
>>
>> I now think it's good to keep the pattern same as other filesystems 
>> which is [1] to avoid possible performance issues in the future, 
>> thanks Bernd.
>>
>>
>> Hao
>>
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-26 11:07               ` Jiachen Zhang
  2023-07-26 13:15                 ` Bernd Schubert
  2023-07-27 10:31                 ` Hao Xu
@ 2023-07-27 10:48                 ` Hao Xu
  2 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-07-27 10:48 UTC (permalink / raw)
  To: Jiachen Zhang, Bernd Schubert, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos


On 7/26/23 19:07, Jiachen Zhang wrote:
>
>
> On 2023/7/26 00:57, Hao Xu wrote:
>>
>> On 7/25/23 21:00, Bernd Schubert wrote:
>>>
>>>
>>> On 7/25/23 12:11, Hao Xu wrote:
>>>> On 7/21/23 19:56, Bernd Schubert wrote:
>>>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> 
>>>>> wrote:
>>>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>>>
>>>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>>>
>>>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>>>> thus dirty
>>>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>>>> written
>>>>>>>> back to backend before direct write to avoid data loss.
>>>>>>>>
>>>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>>>> ---
>>>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>>>> --- a/fs/fuse/file.c
>>>>>>>> +++ b/fs/fuse/file.c
>>>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct 
>>>>>>>> fuse_io_priv *io, struct iov_iter *iter,
>>>>>>>>        if (!ia)
>>>>>>>>            return -ENOMEM;
>>>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>
>
> Hi,
>
> Seems this patchset flushes and invalidates the page cache before 
> doing the direct-io writes, which avoids data loss caused by flushing 
> staled data to FUSE daemon. And I tested it works well.
>
> But there is also another side of the same problem we should consider. 
> If a file is modified through its page cache (shared mmapped regions, 
> or non-FOPEN_DIRECT_IO files), the following direct-io reads may 
> bypass the new data in dirty page cache and read the staled data from 
> FUSE daemon. I think this is also a problem that should be fixed. It 
> could be fixed by uncondictionally calling 
> filemap_write_and_wait_range() before direct-io read.


Yea, I think this is true, I'll fix it in v2. Thanks Jiachen.


>
>
>>>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>>>> count - 1);
>>>>>>>> +        if (res) {
>>>>>>>> +            fuse_io_free(ia);
>>>>>>>> +            return res;
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>>>> idx_to)) {
>>>>>>>>            if (!write)
>>>>>>>>                inode_lock(inode);
>>>>>>>
>>>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>>>
>>>>>>>
>>>>>>> Looks good to me.
>>>>>>>
>>>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing 
>>>>>>> the page cache, followed by a second FUSE_WRITE doing the direct 
>>>>>>> IO. In the future, further optimization could be first write 
>>>>>>> into the page cache and then flush the dirty page to the FUSE 
>>>>>>> daemon.
>>>>>>>
>>>>>>
>>>>>> I think this makes sense, cannot think of any issue in it for 
>>>>>> now, so
>>>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>>>
>>>>>> Thanks,
>>>>>> Hao
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jiachen
>>>>>>
>>>>>
>>>>> On my phone, sorry if mail formatting is not optimal.
>>>>> Do I understand it right? You want DIO code path copy into pages 
>>>>> and then flush/invalidate these pages? That would be punish DIO 
>>>>> for for the unlikely case there are also dirty pages (discouraged 
>>>>> IO pattern).
>>>>
>>>> Hi Bernd,
>>>> I think I don't get what you said, why it is punishment and why 
>>>> it's discouraged IO pattern?
>>>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>>>> disobeying direct write semantics" because usually direct write is
>>>> "flush dirty page -> invalidate page -> write data through to backend"
>>>> not "write data to page -> flush dirty page/(writeback data)"
>>>> The latter in worst case write data both to page cache and backend
>>>> while the former just write to backend and load it to the page cache
>>>> when buffered reading. But seems there is no such "standard way" which
>>>> says we should implement direct IO in that way.
>>>
>>> Hi Hao,
>>>
>>> sorry for being brief last week, I was on vacation and 
>>> reading/writing some mails on my phone.
>>>
>>> With 'punishment' I mean memory copies to the page cache - memory 
>>> copies are expensive and DIO should avoid it.
>>>
>>> Right now your patch adds filemap_write_and_wait_range(), but we do 
>>> not know if it did work (i.e. if pages had to be flushed). So unless 
>>> you find a way to get that information, copy to page cache would be 
>>> unconditionally - overhead of memory copy even if there are no dirty 
>>> pages.
>>
>>
>> Ah, looks I understood what you mean in my last email reply. Yes, 
>> just like what I said in last email:
>>
>> [1] flush dirty page --> invalidate page --> write data to backend
>>
>>     This is what we do for direct write right now in kernel, I call 
>> this policy "write-through", since it doesn't care much about the cache.
>>
>> [2] write data to page cache --> flush dirty page in suitable time
>>
>>     This is  "write-back" policy, used by buffered write. Here in 
>> this patch's case, we flush pages synchronously, so it still can be 
>> called direct-write.
>>
>> Surely, in the worst case, the page is clean, then [2] has one extra 
>> memory copy than [1]. But like what I pointed out, for [2], next time 
>> buffered
>>
>> read happens, the page is in latest state, so no I/O needed, while 
>> for [1], it has to load data from backend to page cache.
>>
>
> Write-through, write-back and direct-io are also exlained in the 
> kernel documentation [*], of which write-through and write-back are 
> cache modes. According to the document, the pattern [2] is similar to 
> the FUSE write-back mode, but the pattern [1] is different from the 
> FUSE write-through mode. The FUSE write-through mode obeys the 'write 
> data to page cache --> flush dirty page synchronously' (let us call it 
> pattern [3]), which keeps the clean cache in-core after flushing.
>
> To improve performance while keeping the direct-io semantics, my 
> thoughts was in the future, maybe we can fallback to the pattern [3] 
> if the target page is in-core, otherwise keep the original direct-io 
> pattern without reading from whole pages from FUSE daemon.
>
> [*] https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt
>
> Thanks,
> Jiachen
>
>>
>>>
>>> With 'discouraged' I mean mix of page cache and direct-io. Typically 
>>> one should only do either of both (page cache or DIO), but not a mix 
>>> of them. For example see your patch, it flushes the page cache, but 
>>> without a lock - races are possible. Copying to the page cache might 
>>> be a solution, but it has the overhead above.
>>
>>
>> For race, we held inode lock there, do I miss anything?
>>
>>
>>>
>>> Thanks,
>>> Bernd
>>
>>
>> I now think it's good to keep the pattern same as other filesystems 
>> which is [1] to avoid possible performance issues in the future, 
>> thanks Bernd.
>>
>>
>> Hao
>>
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [fuse-devel] [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode
  2023-07-27 10:31                 ` Hao Xu
@ 2023-07-28  2:57                   ` Jiachen Zhang
  0 siblings, 0 replies; 28+ messages in thread
From: Jiachen Zhang @ 2023-07-28  2:57 UTC (permalink / raw)
  To: Hao Xu, Bernd Schubert, fuse-devel
  Cc: linux-fsdevel, Wanpeng Li, cgxu519, miklos

On 2023/7/27 18:31, Hao Xu wrote:
> On 7/26/23 19:07, Jiachen Zhang wrote:
>>
>>
>> On 2023/7/26 00:57, Hao Xu wrote:
>>>
>>> On 7/25/23 21:00, Bernd Schubert wrote:
>>>>
>>>>
>>>> On 7/25/23 12:11, Hao Xu wrote:
>>>>> On 7/21/23 19:56, Bernd Schubert wrote:
>>>>>> On July 21, 2023 1:27:26 PM GMT+02:00, Hao Xu <hao.xu@linux.dev> 
>>>>>> wrote:
>>>>>>> On 7/21/23 14:35, Jiachen Zhang wrote:
>>>>>>>>
>>>>>>>> On 2023/6/30 17:46, Hao Xu wrote:
>>>>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>>>>
>>>>>>>>> In direct_io_relax mode, there can be shared mmaped files and 
>>>>>>>>> thus dirty
>>>>>>>>> pages in its page cache. Therefore those dirty pages should be 
>>>>>>>>> written
>>>>>>>>> back to backend before direct write to avoid data loss.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Hao Xu <howeyxu@tencent.com>
>>>>>>>>> ---
>>>>>>>>>    fs/fuse/file.c | 7 +++++++
>>>>>>>>>    1 file changed, 7 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>>>>>> index 176f719f8fc8..7c9167c62bf6 100644
>>>>>>>>> --- a/fs/fuse/file.c
>>>>>>>>> +++ b/fs/fuse/file.c
>>>>>>>>> @@ -1485,6 +1485,13 @@ ssize_t fuse_direct_io(struct 
>>>>>>>>> fuse_io_priv *io, struct iov_iter *iter,
>>>>>>>>>        if (!ia)
>>>>>>>>>            return -ENOMEM;
>>>>>>>>> +    if (fopen_direct_write && fc->direct_io_relax) {
>>
>>
>> Hi,
>>
>> Seems this patchset flushes and invalidates the page cache before 
>> doing the direct-io writes, which avoids data loss caused by flushing 
>> staled data to FUSE daemon. And I tested it works well.
>>
>> But there is also another side of the same problem we should consider. 
>> If a file is modified through its page cache (shared mmapped regions, 
>> or non-FOPEN_DIRECT_IO files), the following direct-io reads may 
>> bypass the new data in dirty page cache and read the staled data from 
>> FUSE daemon. I think this is also a problem that should be fixed. It 
>> could be fixed by uncondictionally calling 
>> filemap_write_and_wait_range() before direct-io read.
>>
>>
>>>>>>>>> +        res = filemap_write_and_wait_range(mapping, pos, pos + 
>>>>>>>>> count - 1);
>>>>>>>>> +        if (res) {
>>>>>>>>> +            fuse_io_free(ia);
>>>>>>>>> +            return res;
>>>>>>>>> +        }
>>>>>>>>> +    }
>>>>>>>>>        if (!cuse && fuse_range_is_writeback(inode, idx_from, 
>>>>>>>>> idx_to)) {
>>>>>>>>>            if (!write)
>>>>>>>>>                inode_lock(inode);
>>>>>>>>
>>>>>>>> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> Looks good to me.
>>>>>>>>
>>>>>>>> By the way, the behaviour would be a first FUSE_WRITE flushing 
>>>>>>>> the page cache, followed by a second FUSE_WRITE doing the direct 
>>>>>>>> IO. In the future, further optimization could be first write 
>>>>>>>> into the page cache and then flush the dirty page to the FUSE 
>>>>>>>> daemon.
>>>>>>>>
>>>>>>>
>>>>>>> I think this makes sense, cannot think of any issue in it for 
>>>>>>> now, so
>>>>>>> I'll do that change and send next version, super thanks, Jiachen!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Hao
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jiachen
>>>>>>>
>>>>>>
>>>>>> On my phone, sorry if mail formatting is not optimal.
>>>>>> Do I understand it right? You want DIO code path copy into pages 
>>>>>> and then flush/invalidate these pages? That would be punish DIO 
>>>>>> for for the unlikely case there are also dirty pages (discouraged 
>>>>>> IO pattern).
>>>>>
>>>>> Hi Bernd,
>>>>> I think I don't get what you said, why it is punishment and why 
>>>>> it's discouraged IO pattern?
>>>>> On my first eyes seeing Jiachen's idea, I was thinking "that sounds
>>>>> disobeying direct write semantics" because usually direct write is
>>>>> "flush dirty page -> invalidate page -> write data through to backend"
>>>>> not "write data to page -> flush dirty page/(writeback data)"
>>>>> The latter in worst case write data both to page cache and backend
>>>>> while the former just write to backend and load it to the page cache
>>>>> when buffered reading. But seems there is no such "standard way" which
>>>>> says we should implement direct IO in that way.
>>>>
>>>> Hi Hao,
>>>>
>>>> sorry for being brief last week, I was on vacation and 
>>>> reading/writing some mails on my phone.
>>>>
>>>> With 'punishment' I mean memory copies to the page cache - memory 
>>>> copies are expensive and DIO should avoid it.
>>>>
>>>> Right now your patch adds filemap_write_and_wait_range(), but we do 
>>>> not know if it did work (i.e. if pages had to be flushed). So unless 
>>>> you find a way to get that information, copy to page cache would be 
>>>> unconditionally - overhead of memory copy even if there are no dirty 
>>>> pages.
>>>
>>>
>>> Ah, looks I understood what you mean in my last email reply. Yes, 
>>> just like what I said in last email:
>>>
>>> [1] flush dirty page --> invalidate page --> write data to backend
>>>
>>>     This is what we do for direct write right now in kernel, I call 
>>> this policy "write-through", since it doesn't care much about the cache.
>>>
>>> [2] write data to page cache --> flush dirty page in suitable time
>>>
>>>     This is  "write-back" policy, used by buffered write. Here in 
>>> this patch's case, we flush pages synchronously, so it still can be 
>>> called direct-write.
>>>
>>> Surely, in the worst case, the page is clean, then [2] has one extra 
>>> memory copy than [1]. But like what I pointed out, for [2], next time 
>>> buffered
>>>
>>> read happens, the page is in latest state, so no I/O needed, while 
>>> for [1], it has to load data from backend to page cache.
>>>
>>
>> Write-through, write-back and direct-io are also exlained in the 
>> kernel documentation [*], of which write-through and write-back are 
>> cache modes. According to the document, the pattern [2] is similar to 
>> the FUSE 
> 
> Yep, in previous mail write-through and write-back I mentioned are
> generic concepts for any cache system, e.g. cpu cache, page cache.
> 
> 
>> write-back mode, but the pattern [1] is different from the FUSE 
>> write-through mode. The FUSE write-through mode obeys the 'write data 
>> to page cache --> flush dirty page synchronously' (let us call it 
>> pattern [3]), which keeps the clean cache in-core after flushing.
> 
> I read the fuse doc, I think you are right, in !FOPEN_DIRECT_IO mode,
> the IO model for fuse is different with other filesystems. Specifically,
> its 'write-back' branch is just following other filesystems and
> 'write-through' forces all writes go to both page cache and back-end,
> this is actually closer to the concept write-through.
> 
>>
>> To improve performance while keeping the direct-io semantics, my 
>> thoughts was in the future, maybe we can fallback to the pattern [3] 
>> if the target page is in-core, otherwise keep the original direct-io 
>> pattern without reading from whole pages from FUSE daemon.
> 
> Why will we reading from whole pages from backend?
> That case happens for buffered write, because for buffered write, if we
> partially write to a page that is not in the front-end cache, it causes
> reading the whole page from back-end. But here we always do direct write
> in FOPEN_DIRECT_IO mode, so we just invalidate the pages involved.
> 

Hi Hao,

Yes, and I did say that if the page is not in-core, we can still keep 
the direct-io pattern, which avoids paging-in of the write-through and 
write-back modes.

> And just like Bernd said, "if the target page is in-core" is not enough,
> if the target page is not dirty, following pattern [3] make one extra
> page cache write.
> 

Yes, I agree the further checking of the dirtyness is better for 
improving performance.

Thanks,
Jiachen


>>
>> [*] https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt
>>
>> Thanks,
>> Jiachen
>>
>>>
>>>>
>>>> With 'discouraged' I mean mix of page cache and direct-io. Typically 
>>>> one should only do either of both (page cache or DIO), but not a mix 
>>>> of them. For example see your patch, it flushes the page cache, but 
>>>> without a lock - races are possible. Copying to the page cache might 
>>>> be a solution, but it has the overhead above.
>>>
>>>
>>> For race, we held inode lock there, do I miss anything?
>>>
>>>
>>>>
>>>> Thanks,
>>>> Bernd
>>>
>>>
>>> I now think it's good to keep the pattern same as other filesystems 
>>> which is [1] to avoid possible performance issues in the future, 
>>> thanks Bernd.
>>>
>>>
>>> Hao
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-07-28  2:58 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-30  9:45 [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
2023-06-30  9:46 ` [PATCH 1/3] fuse: invalidate page cache pages before direct write Hao Xu
2023-06-30 10:32   ` Bernd Schubert
2023-07-21  3:34   ` [fuse-devel] " Jiachen Zhang
2023-06-30  9:46 ` [PATCH 2/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu
2023-06-30 10:35   ` Bernd Schubert
2023-06-30  9:46 ` [PATCH 3/3] fuse: write back dirty pages before direct write in direct_io_relax mode Hao Xu
2023-06-30 10:40   ` Bernd Schubert
2023-07-21  6:35   ` [External] [fuse-devel] " Jiachen Zhang
2023-07-21 11:27     ` Hao Xu
2023-07-21 11:56       ` Bernd Schubert
2023-07-25 10:11         ` Hao Xu
2023-07-25 13:00           ` Bernd Schubert
2023-07-25 16:57             ` Hao Xu
2023-07-25 17:59               ` Bernd Schubert
2023-07-27  9:42                 ` Hao Xu
2023-07-26 11:07               ` Jiachen Zhang
2023-07-26 13:15                 ` Bernd Schubert
2023-07-27  2:24                   ` Jiachen Zhang
2023-07-27 10:31                 ` Hao Xu
2023-07-28  2:57                   ` Jiachen Zhang
2023-07-27 10:48                 ` Hao Xu
2023-07-05 10:23 ` [RFC] [PATCH] fuse: DIO writes always use the same code path Bernd Schubert
2023-07-06 14:43   ` Christoph Hellwig
2023-07-07 13:36     ` Bernd Schubert
2023-07-17  8:03   ` Hao Xu
2023-07-17 21:17     ` Bernd Schubert
2023-07-20  7:32 ` [PATCH v3 0/3] fuse: add a new fuse init flag to relax restrictions in no cache mode Hao Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.