All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/1] FUSE: Allow non-extending parallel direct writes
@ 2022-06-05  7:21 Dharmendra Singh
  2022-06-05  7:22 ` [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file Dharmendra Singh
  0 siblings, 1 reply; 11+ messages in thread
From: Dharmendra Singh @ 2022-06-05  7:21 UTC (permalink / raw)
  To: miklos, vgoyal
  Cc: Dharmendra Singh, linux-fsdevel, fuse-devel, linux-kernel, bschubert

It is observed that currently in Fuse, for direct writes, we hold 
inode lock for the full duration of the request. As a result, 
only one direct write request can proceed on the same file. This, 
I think is due to various reasons such as serialization needed by 
USER space fuse implementations/file size issues/write failures.

This patch allows parallel writes to proceed on the same file by
by holding shared lock on the non-extending writes and exlusive
lock on extending writes.

For measuring performance, I carried out test on these 
changes over example/passthrough.c (part of libfuse) by setting 
direct-io and parallel_direct_writes flags on the file.
 
Note that we disabled write to underlying file system from passthrough 
as we wanted to check gain for Fuse only. Fio was used to test
the impact of these changes on File-per-job and Single shared File. 
CPU binding was performed on passthrough process only.

Job file for SSF:
[global]
directory=/tmp/dest
filename=ssf
size=100g
blocksize=1m
ioengine=sync
group_reporting=1
fallocate=none
runtime=60
stonewall

[write]
rw=randwrite:256
rw_sequencer=sequential
fsync_on_close=1


Job file for file-per-job:
[sequential-write]
rw=write
size=100G
directory=/tmp/dest/
group_reporting
name=sequential-write-direct
bs=1M
runtime=60


Results:

unpatched=================

File  per job


Fri May  6 09:36:52 EDT 2022
numjobs: 1  WRITE: bw=3441MiB/s (3608MB/s), 3441MiB/s-3441MiB/s (3608MB/s-3608MB/s), io=100GiB (107GB), run=29762-29762msec
numjobs: 2  WRITE: bw=8174MiB/s (8571MB/s), 8174MiB/s-8174MiB/s (8571MB/s-8571MB/s), io=200GiB (215GB), run=25054-25054msec
numjobs: 4  WRITE: bw=14.9GiB/s (15.0GB/s), 14.9GiB/s-14.9GiB/s (15.0GB/s-15.0GB/s), io=400GiB (429GB), run=26900-26900msec
numjobs: 8  WRITE: bw=23.4GiB/s (25.2GB/s), 23.4GiB/s-23.4GiB/s (25.2GB/s-25.2GB/s), io=800GiB (859GB), run=34115-34115msec
numjobs: 16  WRITE: bw=24.5GiB/s (26.3GB/s), 24.5GiB/s-24.5GiB/s (26.3GB/s-26.3GB/s), io=1469GiB (1577GB), run=60001-60001msec
numjobs: 32  WRITE: bw=20.5GiB/s (21.0GB/s), 20.5GiB/s-20.5GiB/s (21.0GB/s-21.0GB/s), io=1229GiB (1320GB), run=60003-60003msec


SSF

Fri May  6 09:46:38 EDT 2022
numjobs: 1  WRITE: bw=3624MiB/s (3800MB/s), 3624MiB/s-3624MiB/s (3800MB/s-3800MB/s), io=100GiB (107GB), run=28258-28258msec
numjobs: 2  WRITE: bw=5801MiB/s (6083MB/s), 5801MiB/s-5801MiB/s (6083MB/s-6083MB/s), io=200GiB (215GB), run=35302-35302msec
numjobs: 4  WRITE: bw=4794MiB/s (5027MB/s), 4794MiB/s-4794MiB/s (5027MB/s-5027MB/s), io=281GiB (302GB), run=60001-60001msec
numjobs: 8  WRITE: bw=3946MiB/s (4137MB/s), 3946MiB/s-3946MiB/s (4137MB/s-4137MB/s), io=231GiB (248GB), run=60003-60003msec
numjobs: 16  WRITE: bw=4040MiB/s (4236MB/s), 4040MiB/s-4040MiB/s (4236MB/s-4236MB/s), io=237GiB (254GB), run=60006-60006msec
numjobs: 32  WRITE: bw=2822MiB/s (2959MB/s), 2822MiB/s-2822MiB/s (2959MB/s-2959MB/s), io=165GiB (178GB), run=60013-60013msec


Patched=====

File per job

Fri May  6 10:05:46 EDT 2022
numjobs: 1  WRITE: bw=3193MiB/s (3348MB/s), 3193MiB/s-3193MiB/s (3348MB/s-3348MB/s), io=100GiB (107GB), run=32068-32068msec
numjobs: 2  WRITE: bw=9084MiB/s (9525MB/s), 9084MiB/s-9084MiB/s (9525MB/s-9525MB/s), io=200GiB (215GB), run=22545-22545msec
numjobs: 4  WRITE: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s-15.9GB/s), io=400GiB (429GB), run=26986-26986msec
numjobs: 8  WRITE: bw=24.5GiB/s (26.3GB/s), 24.5GiB/s-24.5GiB/s (26.3GB/s-26.3GB/s), io=800GiB (859GB), run=32624-32624msec
numjobs: 16  WRITE: bw=24.2GiB/s (25.0GB/s), 24.2GiB/s-24.2GiB/s (25.0GB/s-25.0GB/s), io=1451GiB (1558GB), run=60001-60001msec
numjobs: 32  WRITE: bw=19.3GiB/s (20.8GB/s), 19.3GiB/s-19.3GiB/s (20.8GB/s-20.8GB/s), io=1160GiB (1245GB), run=60002-60002msec


SSF

Fri May  6 09:58:33 EDT 2022
numjobs: 1  WRITE: bw=3137MiB/s (3289MB/s), 3137MiB/s-3137MiB/s (3289MB/s-3289MB/s), io=100GiB (107GB), run=32646-32646msec
numjobs: 2  WRITE: bw=7736MiB/s (8111MB/s), 7736MiB/s-7736MiB/s (8111MB/s-8111MB/s), io=200GiB (215GB), run=26475-26475msec
numjobs: 4  WRITE: bw=14.4GiB/s (15.4GB/s), 14.4GiB/s-14.4GiB/s (15.4GB/s-15.4GB/s), io=400GiB (429GB), run=27869-27869msec
numjobs: 8  WRITE: bw=22.6GiB/s (24.3GB/s), 22.6GiB/s-22.6GiB/s (24.3GB/s-24.3GB/s), io=800GiB (859GB), run=35340-35340msec
numjobs: 16  WRITE: bw=25.6GiB/s (27.5GB/s), 25.6GiB/s-25.6GiB/s (27.5GB/s-27.5GB/s), io=1535GiB (1648GB), run=60001-60001msec
numjobs: 32  WRITE: bw=20.2GiB/s (21.7GB/s), 20.2GiB/s-20.2GiB/s (21.7GB/s-21.7GB/s), io=1211GiB (1300GB), run=60003-60003msec



SSF gain in percentage:-
For 1 fio thread: +0%
For 2 fio threads: +0% 
For 4 fio threads: +42%
For 8 fio threads: +246.8%
For 16 fio threads: +549%
For 32 fio threads: +630.33%


Dharmendra Singh (1):
  Allow non-extending parallel direct writes on the same file.

 fs/fuse/file.c            | 46 ++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/fuse.h |  2 ++
 2 files changed, 45 insertions(+), 3 deletions(-)

---
 
 v4: Handled the case when file size can get reduced after the check but 
     before we acquire the shared lock.

 v3: Addressed all comments.

---

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-05  7:21 [PATCH v4 0/1] FUSE: Allow non-extending parallel direct writes Dharmendra Singh
@ 2022-06-05  7:22 ` Dharmendra Singh
  2022-06-07 21:25   ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Dharmendra Singh @ 2022-06-05  7:22 UTC (permalink / raw)
  To: miklos, vgoyal
  Cc: Dharmendra Singh, linux-fsdevel, fuse-devel, linux-kernel,
	bschubert, Dharmendra Singh

From: Dharmendra Singh <dsingh@ddn.com>

In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration
of the write request. I could not found in fuse code a comment which
clearly explains why this exclusive lock is taken for direct writes.

Following might be the reasons for acquiring exclusive lock but not
limited to
1) Our guess is some USER space fuse implementations might be relying
   on this lock for seralization.
2) This lock protects for the issues arising due to file size
   assumptions.
3) Ruling out any issues arising due to multiple writes where some 
   writes succeeded and some failed.

This patch relaxes this exclusive lock for non-extending direct writes.

With these changes, we allows non-extending parallel direct writes
on the same file with the help of a flag called FOPEN_PARALLEL_WRITES.
If this flag is set on the file (flag is passed from libfuse to fuse
kernel as part of file open/create), we do not take exclusive lock
instead use shared lock so that all non-extending writes can run in 
parallel.

Best practise would be to enable parallel direct writes of all kinds
including extending writes as well but we see some issues such as
1) When one write completes on one server and other fails on another
server, how we should truncate(if needed) the file if underlying file 
system does not support holes (For file systems which supports holes,
there might be a possibility of enabling parallel writes for all cases).

FUSE implementations which rely on this inode lock for serialisation
can continue to do so and this is default behaviour i.e no parallel
direct writes.

Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/file.c            | 46 ++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/fuse.h |  2 ++
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 829094451774..72524612bd5c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1541,14 +1541,50 @@ static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return res;
 }
 
+static bool fuse_direct_write_extending_i_size(struct kiocb *iocb,
+					       struct iov_iter *iter)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	return (iocb->ki_flags & IOCB_APPEND ||
+		iocb->ki_pos + iov_iter_count(iter) > i_size_read(inode));
+}
+
 static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	struct file *file = iocb->ki_filp;
+	struct fuse_file *ff = file->private_data;
 	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
 	ssize_t res;
+	bool exclusive_lock = !(ff->open_flags & FOPEN_PARALLEL_WRITES ||
+			       fuse_direct_write_extending_i_size(iocb, from));
+
+	/*
+	 * Take exclusive lock if
+	 * - parallel writes are disabled.
+	 * - parallel writes are enabled and i_size is being extended
+	 * Take shared lock if
+	 * - parallel writes are enabled but i_size does not extend.
+	 */
+retry:
+	if (exclusive_lock)
+		inode_lock(inode);
+	else {
+		inode_lock_shared(inode);
+
+		/*
+		 * Its possible that truncate reduced the file size after the check
+		 * but before acquiring shared lock. If its so than drop shared lock and
+		 * acquire exclusive lock.
+		 */
+		if (fuse_direct_write_extending_i_size(iocb, from)) {
+			inode_unlock_shared(inode);
+			exclusive_lock = true;
+			goto retry;
+		}
+	}
 
-	/* Don't allow parallel writes to the same file */
-	inode_lock(inode);
 	res = generic_write_checks(iocb, from);
 	if (res > 0) {
 		if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
@@ -1559,7 +1595,10 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			fuse_write_update_attr(inode, iocb->ki_pos, res);
 		}
 	}
-	inode_unlock(inode);
+	if (exclusive_lock)
+		inode_unlock(inode);
+	else
+		inode_unlock_shared(inode);
 
 	return res;
 }
@@ -2901,6 +2940,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 	if (iov_iter_rw(iter) == WRITE) {
 		fuse_write_update_attr(inode, pos, ret);
+		/* For extending writes we already hold exclusive lock */
 		if (ret < 0 && offset + count > i_size)
 			fuse_do_truncate(file);
 	}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d6ccee961891..ee5379d41906 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -301,6 +301,7 @@ struct fuse_file_lock {
  * FOPEN_CACHE_DIR: allow caching this directory
  * FOPEN_STREAM: the file is stream-like (no file position at all)
  * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
+ * FOPEN_PARALLEL_WRITES: Allow concurrent writes on the same inode
  */
 #define FOPEN_DIRECT_IO		(1 << 0)
 #define FOPEN_KEEP_CACHE	(1 << 1)
@@ -308,6 +309,7 @@ struct fuse_file_lock {
 #define FOPEN_CACHE_DIR		(1 << 3)
 #define FOPEN_STREAM		(1 << 4)
 #define FOPEN_NOFLUSH		(1 << 5)
+#define FOPEN_PARALLEL_WRITES	(1 << 6)
 
 /**
  * INIT request/reply flags
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-05  7:22 ` [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file Dharmendra Singh
@ 2022-06-07 21:25   ` Vivek Goyal
  2022-06-07 21:42     ` Bernd Schubert
  0 siblings, 1 reply; 11+ messages in thread
From: Vivek Goyal @ 2022-06-07 21:25 UTC (permalink / raw)
  To: Dharmendra Singh
  Cc: miklos, linux-fsdevel, fuse-devel, linux-kernel, bschubert,
	Dharmendra Singh

On Sun, Jun 05, 2022 at 12:52:00PM +0530, Dharmendra Singh wrote:
> From: Dharmendra Singh <dsingh@ddn.com>
> 
> In general, as of now, in FUSE, direct writes on the same file are
> serialized over inode lock i.e we hold inode lock for the full duration
> of the write request. I could not found in fuse code a comment which
> clearly explains why this exclusive lock is taken for direct writes.
> 
> Following might be the reasons for acquiring exclusive lock but not
> limited to
> 1) Our guess is some USER space fuse implementations might be relying
>    on this lock for seralization.

Hi Dharmendra,

I will just try to be devil's advocate. So if this is server side
limitation, then it is possible that fuse client's isize data in
cache is stale. For example, filesystem is shared between two 
clients.

- File size is 4G as seen by client A.
- Client B truncates the file to 2G.
- Two processes in client A, try to do parallel direct writes and will
  be able to proceed and server will get two parallel writes both
  extending file size.

I can see that this can happen with virtiofs with cache=auto policy.

IOW, if this is a fuse server side limitation, then how do you ensure
that fuse kernel's i_size definition is not stale.

> 2) This lock protects for the issues arising due to file size
>    assumptions.
> 3) Ruling out any issues arising due to multiple writes where some 
>    writes succeeded and some failed.
> 
> This patch relaxes this exclusive lock for non-extending direct writes.
> 
> With these changes, we allows non-extending parallel direct writes
> on the same file with the help of a flag called FOPEN_PARALLEL_WRITES.
> If this flag is set on the file (flag is passed from libfuse to fuse
> kernel as part of file open/create), we do not take exclusive lock
> instead use shared lock so that all non-extending writes can run in 
> parallel.
> 
> Best practise would be to enable parallel direct writes of all kinds
> including extending writes as well but we see some issues such as
> 1) When one write completes on one server and other fails on another
> server, how we should truncate(if needed) the file if underlying file 
> system does not support holes (For file systems which supports holes,
> there might be a possibility of enabling parallel writes for all cases).
> 
> FUSE implementations which rely on this inode lock for serialisation
> can continue to do so and this is default behaviour i.e no parallel
> direct writes.
> 
> Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/file.c            | 46 ++++++++++++++++++++++++++++++++++++---
>  include/uapi/linux/fuse.h |  2 ++
>  2 files changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 829094451774..72524612bd5c 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1541,14 +1541,50 @@ static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	return res;
>  }
>  
> +static bool fuse_direct_write_extending_i_size(struct kiocb *iocb,
> +					       struct iov_iter *iter)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	return (iocb->ki_flags & IOCB_APPEND ||
> +		iocb->ki_pos + iov_iter_count(iter) > i_size_read(inode));
> +}
> +
>  static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct file *file = iocb->ki_filp;
> +	struct fuse_file *ff = file->private_data;
>  	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
>  	ssize_t res;
> +	bool exclusive_lock = !(ff->open_flags & FOPEN_PARALLEL_WRITES ||
> +			       fuse_direct_write_extending_i_size(iocb, from));
> +
> +	/*
> +	 * Take exclusive lock if
> +	 * - parallel writes are disabled.
> +	 * - parallel writes are enabled and i_size is being extended
> +	 * Take shared lock if
> +	 * - parallel writes are enabled but i_size does not extend.
> +	 */
> +retry:
> +	if (exclusive_lock)
> +		inode_lock(inode);
> +	else {
> +		inode_lock_shared(inode);
> +
> +		/*
> +		 * Its possible that truncate reduced the file size after the check
> +		 * but before acquiring shared lock. If its so than drop shared lock and
> +		 * acquire exclusive lock.
> +		 */
> +		if (fuse_direct_write_extending_i_size(iocb, from)) {
> +			inode_unlock_shared(inode);
> +			exclusive_lock = true;
> +			goto retry;
> +		}
> +	}
>  
> -	/* Don't allow parallel writes to the same file */
> -	inode_lock(inode);
>  	res = generic_write_checks(iocb, from);
>  	if (res > 0) {
>  		if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
> @@ -1559,7 +1595,10 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			fuse_write_update_attr(inode, iocb->ki_pos, res);
>  		}
>  	}
> -	inode_unlock(inode);
> +	if (exclusive_lock)
> +		inode_unlock(inode);
> +	else
> +		inode_unlock_shared(inode);
>  
>  	return res;
>  }
> @@ -2901,6 +2940,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  
>  	if (iov_iter_rw(iter) == WRITE) {
>  		fuse_write_update_attr(inode, pos, ret);
> +		/* For extending writes we already hold exclusive lock */
>  		if (ret < 0 && offset + count > i_size)
>  			fuse_do_truncate(file);
>  	}
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d6ccee961891..ee5379d41906 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -301,6 +301,7 @@ struct fuse_file_lock {
>   * FOPEN_CACHE_DIR: allow caching this directory
>   * FOPEN_STREAM: the file is stream-like (no file position at all)
>   * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
> + * FOPEN_PARALLEL_WRITES: Allow concurrent writes on the same inode
>   */
>  #define FOPEN_DIRECT_IO		(1 << 0)
>  #define FOPEN_KEEP_CACHE	(1 << 1)
> @@ -308,6 +309,7 @@ struct fuse_file_lock {
>  #define FOPEN_CACHE_DIR		(1 << 3)
>  #define FOPEN_STREAM		(1 << 4)
>  #define FOPEN_NOFLUSH		(1 << 5)
> +#define FOPEN_PARALLEL_WRITES	(1 << 6)

Given you are relaxing this only for DIRECT writes (and not other kind of
writes), should we call it say "FOPEN_PARALLEL_DIRECT_WRITES" instead?

Thanks
Vivek

>  
>  /**
>   * INIT request/reply flags
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-07 21:25   ` Vivek Goyal
@ 2022-06-07 21:42     ` Bernd Schubert
  2022-06-07 22:01       ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2022-06-07 21:42 UTC (permalink / raw)
  To: Vivek Goyal, Dharmendra Singh
  Cc: miklos, linux-fsdevel, fuse-devel, linux-kernel, Dharmendra Singh



On 6/7/22 23:25, Vivek Goyal wrote:
> On Sun, Jun 05, 2022 at 12:52:00PM +0530, Dharmendra Singh wrote:
>> From: Dharmendra Singh <dsingh@ddn.com>
>>
>> In general, as of now, in FUSE, direct writes on the same file are
>> serialized over inode lock i.e we hold inode lock for the full duration
>> of the write request. I could not found in fuse code a comment which
>> clearly explains why this exclusive lock is taken for direct writes.
>>
>> Following might be the reasons for acquiring exclusive lock but not
>> limited to
>> 1) Our guess is some USER space fuse implementations might be relying
>>     on this lock for seralization.
> 
> Hi Dharmendra,
> 
> I will just try to be devil's advocate. So if this is server side
> limitation, then it is possible that fuse client's isize data in
> cache is stale. For example, filesystem is shared between two
> clients.
> 
> - File size is 4G as seen by client A.
> - Client B truncates the file to 2G.
> - Two processes in client A, try to do parallel direct writes and will
>    be able to proceed and server will get two parallel writes both
>    extending file size.
> 
> I can see that this can happen with virtiofs with cache=auto policy.
> 
> IOW, if this is a fuse server side limitation, then how do you ensure
> that fuse kernel's i_size definition is not stale.

Hi Vivek,

I'm sorry, to be sure, can you explain where exactly a client is located 
for you? For us these are multiple daemons linked to libufse - which you 
seem to call 'server' Typically these clients are on different machines. 
And servers are for us on the other side of the network - like an NFS 
server.

So now while I'm not sure what you mean with 'client', I'm wondering 
about two generic questions

a) I need to double check, but we were under the assumption the code in 
question is a direct-io code path. I assume cache=auto would use the 
page cache and should not be effected?

b) How would the current lock help for distributed clients? Or multiple 
fuse daemons (what you seem to call server) per local machine?

For a single vfs mount point served by fuse, truncate should take the 
exclusive lock and parallel writes the shared lock - I don't see a 
problem here either.


Thanks,
Bernd





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-07 21:42     ` Bernd Schubert
@ 2022-06-07 22:01       ` Vivek Goyal
  2022-06-07 22:42         ` Bernd Schubert
  0 siblings, 1 reply; 11+ messages in thread
From: Vivek Goyal @ 2022-06-07 22:01 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Dharmendra Singh, miklos, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh

On Tue, Jun 07, 2022 at 11:42:16PM +0200, Bernd Schubert wrote:
> 
> 
> On 6/7/22 23:25, Vivek Goyal wrote:
> > On Sun, Jun 05, 2022 at 12:52:00PM +0530, Dharmendra Singh wrote:
> > > From: Dharmendra Singh <dsingh@ddn.com>
> > > 
> > > In general, as of now, in FUSE, direct writes on the same file are
> > > serialized over inode lock i.e we hold inode lock for the full duration
> > > of the write request. I could not found in fuse code a comment which
> > > clearly explains why this exclusive lock is taken for direct writes.
> > > 
> > > Following might be the reasons for acquiring exclusive lock but not
> > > limited to
> > > 1) Our guess is some USER space fuse implementations might be relying
> > >     on this lock for seralization.
> > 
> > Hi Dharmendra,
> > 
> > I will just try to be devil's advocate. So if this is server side
> > limitation, then it is possible that fuse client's isize data in
> > cache is stale. For example, filesystem is shared between two
> > clients.
> > 
> > - File size is 4G as seen by client A.
> > - Client B truncates the file to 2G.
> > - Two processes in client A, try to do parallel direct writes and will
> >    be able to proceed and server will get two parallel writes both
> >    extending file size.
> > 
> > I can see that this can happen with virtiofs with cache=auto policy.
> > 
> > IOW, if this is a fuse server side limitation, then how do you ensure
> > that fuse kernel's i_size definition is not stale.
> 
> Hi Vivek,
> 
> I'm sorry, to be sure, can you explain where exactly a client is located for
> you? For us these are multiple daemons linked to libufse - which you seem to
> call 'server' Typically these clients are on different machines. And servers
> are for us on the other side of the network - like an NFS server.

Hi Bernd,

Agreed, terminology is little confusing. I am calling "fuse kernel" as
client and fuse daemon (user space) as server. This server in turn might
be the client to another network filesystem and real files might be
served by that server on network.

So for simple virtiofs case, There can be two fuse daemons (virtiofsd
instances) sharing same directory (either on local filesystem or on
a network filesystem).

> 
> So now while I'm not sure what you mean with 'client', I'm wondering about
> two generic questions
> 
> a) I need to double check, but we were under the assumption the code in
> question is a direct-io code path. I assume cache=auto would use the page
> cache and should not be effected?

By default cache=auto use page cache but if application initiates a
direct I/O, it should use direct I/O path.

> 
> b) How would the current lock help for distributed clients? Or multiple fuse
> daemons (what you seem to call server) per local machine?

I thought that current lock is trying to protect fuse kernel side and
assumed fuse server (daemon linked to libfuse) can handle multiple
parallel writes. Atleast that's how I thought about the things. I might
be wrong. I am not sure.

> 
> For a single vfs mount point served by fuse, truncate should take the
> exclusive lock and parallel writes the shared lock - I don't see a problem
> here either.

Agreed that this does not seem like a problem from fuse kernel side. I was
just questioning that where parallel direct writes become a problem. And
answer I heard was that it probably is fuse server (daemon linked with
libfuse) which is expecting the locking. And if that's the case, this
patch is not fool proof. It is possible that file got truncated from
a different client (from a different fuse daemon linked with libfuse).

So say A is first fuse daemon and B is another fuse daemon. Both are
clients to some network file system as NFS.

- Fuse kernel for A, sees file size as 4G.
- fuse daemon B truncates the file to size 2G.
- Fuse kernel for A, has stale cache, and can send two parallel writes
  say at 3G and 3.5G offset.
- Fuser daemon A might not like it.(Assuming this is fuse daemon/user
  space side limitation).

I hope I am able to explain my concern. I am not saying that this patch
is not good. All I am saying that fuse daemon (user space) can not rely
on that it will never get two parallel direct writes which can be beyond
the file size. If fuse kernel cache is stale, it can happen. Just trying
to set the expectations right.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-07 22:01       ` Vivek Goyal
@ 2022-06-07 22:42         ` Bernd Schubert
  2022-06-09 13:53           ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2022-06-07 22:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dharmendra Singh, miklos, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh



On 6/8/22 00:01, Vivek Goyal wrote:
> On Tue, Jun 07, 2022 at 11:42:16PM +0200, Bernd Schubert wrote:
>>
>>
>> On 6/7/22 23:25, Vivek Goyal wrote:
>>> On Sun, Jun 05, 2022 at 12:52:00PM +0530, Dharmendra Singh wrote:
>>>> From: Dharmendra Singh <dsingh@ddn.com>
>>>>
>>>> In general, as of now, in FUSE, direct writes on the same file are
>>>> serialized over inode lock i.e we hold inode lock for the full duration
>>>> of the write request. I could not found in fuse code a comment which
>>>> clearly explains why this exclusive lock is taken for direct writes.
>>>>
>>>> Following might be the reasons for acquiring exclusive lock but not
>>>> limited to
>>>> 1) Our guess is some USER space fuse implementations might be relying
>>>>      on this lock for seralization.
>>>
>>> Hi Dharmendra,
>>>
>>> I will just try to be devil's advocate. So if this is server side
>>> limitation, then it is possible that fuse client's isize data in
>>> cache is stale. For example, filesystem is shared between two
>>> clients.
>>>
>>> - File size is 4G as seen by client A.
>>> - Client B truncates the file to 2G.
>>> - Two processes in client A, try to do parallel direct writes and will
>>>     be able to proceed and server will get two parallel writes both
>>>     extending file size.
>>>
>>> I can see that this can happen with virtiofs with cache=auto policy.
>>>
>>> IOW, if this is a fuse server side limitation, then how do you ensure
>>> that fuse kernel's i_size definition is not stale.
>>
>> Hi Vivek,
>>
>> I'm sorry, to be sure, can you explain where exactly a client is located for
>> you? For us these are multiple daemons linked to libufse - which you seem to
>> call 'server' Typically these clients are on different machines. And servers
>> are for us on the other side of the network - like an NFS server.
> 
> Hi Bernd,
> 
> Agreed, terminology is little confusing. I am calling "fuse kernel" as
> client and fuse daemon (user space) as server. This server in turn might
> be the client to another network filesystem and real files might be
> served by that server on network.
> 
> So for simple virtiofs case, There can be two fuse daemons (virtiofsd
> instances) sharing same directory (either on local filesystem or on
> a network filesystem).

So the combination of fuse-kernel + fuse-daemon == vfs mount.

> 
>>
>> So now while I'm not sure what you mean with 'client', I'm wondering about
>> two generic questions
>>
>> a) I need to double check, but we were under the assumption the code in
>> question is a direct-io code path. I assume cache=auto would use the page
>> cache and should not be effected?
> 
> By default cache=auto use page cache but if application initiates a
> direct I/O, it should use direct I/O path.

Ok, so we are on the same page regarding direct-io.

> 
>>
>> b) How would the current lock help for distributed clients? Or multiple fuse
>> daemons (what you seem to call server) per local machine?
> 
> I thought that current lock is trying to protect fuse kernel side and
> assumed fuse server (daemon linked to libfuse) can handle multiple
> parallel writes. Atleast that's how I thought about the things. I might
> be wrong. I am not sure.
> 
>>
>> For a single vfs mount point served by fuse, truncate should take the
>> exclusive lock and parallel writes the shared lock - I don't see a problem
>> here either.
> 
> Agreed that this does not seem like a problem from fuse kernel side. I was
> just questioning that where parallel direct writes become a problem. And
> answer I heard was that it probably is fuse server (daemon linked with
> libfuse) which is expecting the locking. And if that's the case, this
> patch is not fool proof. It is possible that file got truncated from
> a different client (from a different fuse daemon linked with libfuse).
> 
> So say A is first fuse daemon and B is another fuse daemon. Both are
> clients to some network file system as NFS.
> 
> - Fuse kernel for A, sees file size as 4G.
> - fuse daemon B truncates the file to size 2G.
> - Fuse kernel for A, has stale cache, and can send two parallel writes
>    say at 3G and 3.5G offset.

I guess you mean inode cache, not data cache, as this is direct-io. But 
now why would we need to worry about any cache here, if this is 
direct-io - the application writes without going into any cache and at 
the same time a truncate happens? The current kernel side lock would not 
help here, but a distrubuted lock is needed to handle this correctly?

int fd = open(path, O_WRONLY | O_DIRECT);

clientA: pwrite(fd, buf, 100G, 0) -> takes a long time
clientB: ftruncate(fd, 0)

I guess on a local file system that will result in a zero size file. On 
different fuse mounts (without a DLM) or NFS, undefined behavior.


> - Fuser daemon A might not like it.(Assuming this is fuse daemon/user
>    space side limitation).

I think there are two cases for the fuser daemons:

a) does not have a distributed lock - just needs to handle the writes, 
the local kernel lock does not protect against distributed races. I 
guess most of these file systems can enable parallel writes, unless the 
kernel lock is used to handle userspace thread synchronization.

b) has a distributed lock - needs a callback to fuse kernel to inform 
the kernel to invalidate all data.

At DDN we have both of them, a) is in production, the successor b) is 
being worked on. We might come back with more patches for more callbacks 
for the DLM - I'm not sure yet.


> 
> I hope I am able to explain my concern. I am not saying that this patch
> is not good. All I am saying that fuse daemon (user space) can not rely
> on that it will never get two parallel direct writes which can be beyond
> the file size. If fuse kernel cache is stale, it can happen. Just trying
> to set the expectations right.


I don't see an issue yet. Regarding virtiofs, does it have a distributed 
lock manager (DLM)? I guess not?


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-07 22:42         ` Bernd Schubert
@ 2022-06-09 13:53           ` Vivek Goyal
  2022-06-10  7:24             ` Dharmendra Hans
  2022-06-16  9:01             ` Miklos Szeredi
  0 siblings, 2 replies; 11+ messages in thread
From: Vivek Goyal @ 2022-06-09 13:53 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Dharmendra Singh, miklos, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh

On Wed, Jun 08, 2022 at 12:42:20AM +0200, Bernd Schubert wrote:
> 
> 
> On 6/8/22 00:01, Vivek Goyal wrote:
> > On Tue, Jun 07, 2022 at 11:42:16PM +0200, Bernd Schubert wrote:
> > > 
> > > 
> > > On 6/7/22 23:25, Vivek Goyal wrote:
> > > > On Sun, Jun 05, 2022 at 12:52:00PM +0530, Dharmendra Singh wrote:
> > > > > From: Dharmendra Singh <dsingh@ddn.com>
> > > > > 
> > > > > In general, as of now, in FUSE, direct writes on the same file are
> > > > > serialized over inode lock i.e we hold inode lock for the full duration
> > > > > of the write request. I could not found in fuse code a comment which
> > > > > clearly explains why this exclusive lock is taken for direct writes.
> > > > > 
> > > > > Following might be the reasons for acquiring exclusive lock but not
> > > > > limited to
> > > > > 1) Our guess is some USER space fuse implementations might be relying
> > > > >      on this lock for seralization.
> > > > 
> > > > Hi Dharmendra,
> > > > 
> > > > I will just try to be devil's advocate. So if this is server side
> > > > limitation, then it is possible that fuse client's isize data in
> > > > cache is stale. For example, filesystem is shared between two
> > > > clients.
> > > > 
> > > > - File size is 4G as seen by client A.
> > > > - Client B truncates the file to 2G.
> > > > - Two processes in client A, try to do parallel direct writes and will
> > > >     be able to proceed and server will get two parallel writes both
> > > >     extending file size.
> > > > 
> > > > I can see that this can happen with virtiofs with cache=auto policy.
> > > > 
> > > > IOW, if this is a fuse server side limitation, then how do you ensure
> > > > that fuse kernel's i_size definition is not stale.
> > > 
> > > Hi Vivek,
> > > 
> > > I'm sorry, to be sure, can you explain where exactly a client is located for
> > > you? For us these are multiple daemons linked to libufse - which you seem to
> > > call 'server' Typically these clients are on different machines. And servers
> > > are for us on the other side of the network - like an NFS server.
> > 
> > Hi Bernd,
> > 
> > Agreed, terminology is little confusing. I am calling "fuse kernel" as
> > client and fuse daemon (user space) as server. This server in turn might
> > be the client to another network filesystem and real files might be
> > served by that server on network.
> > 
> > So for simple virtiofs case, There can be two fuse daemons (virtiofsd
> > instances) sharing same directory (either on local filesystem or on
> > a network filesystem).
> 
> So the combination of fuse-kernel + fuse-daemon == vfs mount.

This is fine for regular fuse file systems. For virtiofs fuse-kernel is
running in a VM and fuse-daemon is running outside the VM on host.
> 
> > 
> > > 
> > > So now while I'm not sure what you mean with 'client', I'm wondering about
> > > two generic questions
> > > 
> > > a) I need to double check, but we were under the assumption the code in
> > > question is a direct-io code path. I assume cache=auto would use the page
> > > cache and should not be effected?
> > 
> > By default cache=auto use page cache but if application initiates a
> > direct I/O, it should use direct I/O path.
> 
> Ok, so we are on the same page regarding direct-io.
> 
> > 
> > > 
> > > b) How would the current lock help for distributed clients? Or multiple fuse
> > > daemons (what you seem to call server) per local machine?
> > 
> > I thought that current lock is trying to protect fuse kernel side and
> > assumed fuse server (daemon linked to libfuse) can handle multiple
> > parallel writes. Atleast that's how I thought about the things. I might
> > be wrong. I am not sure.
> > 
> > > 
> > > For a single vfs mount point served by fuse, truncate should take the
> > > exclusive lock and parallel writes the shared lock - I don't see a problem
> > > here either.
> > 
> > Agreed that this does not seem like a problem from fuse kernel side. I was
> > just questioning that where parallel direct writes become a problem. And
> > answer I heard was that it probably is fuse server (daemon linked with
> > libfuse) which is expecting the locking. And if that's the case, this
> > patch is not fool proof. It is possible that file got truncated from
> > a different client (from a different fuse daemon linked with libfuse).
> > 
> > So say A is first fuse daemon and B is another fuse daemon. Both are
> > clients to some network file system as NFS.
> > 
> > - Fuse kernel for A, sees file size as 4G.
> > - fuse daemon B truncates the file to size 2G.
> > - Fuse kernel for A, has stale cache, and can send two parallel writes
> >    say at 3G and 3.5G offset.
> 
> I guess you mean inode cache, not data cache, as this is direct-io.

Yes inode cache and cached ->i_size might be an issue. These patches
used cached ->i_size to determine if parallel direct I/O should be
allowed or not.


> But now
> why would we need to worry about any cache here, if this is direct-io - the
> application writes without going into any cache and at the same time a
> truncate happens? The current kernel side lock would not help here, but a
> distrubuted lock is needed to handle this correctly?
> 
> int fd = open(path, O_WRONLY | O_DIRECT);
> 
> clientA: pwrite(fd, buf, 100G, 0) -> takes a long time
> clientB: ftruncate(fd, 0)
> 
> I guess on a local file system that will result in a zero size file. On
> different fuse mounts (without a DLM) or NFS, undefined behavior.
> 
> 
> > - Fuser daemon A might not like it.(Assuming this is fuse daemon/user
> >    space side limitation).
> 
> I think there are two cases for the fuser daemons:
> 
> a) does not have a distributed lock - just needs to handle the writes, the
> local kernel lock does not protect against distributed races.

Exactly. This is the point I am trying to raise. "Local kernel lock does
not protect against distributed races".

So in this case local kernel has ->i_size cached and this might be an
old value and checking i_size does not guarantee that fuse daemon
will not get parallel extending writes.

> I guess most
> of these file systems can enable parallel writes, unless the kernel lock is
> used to handle userspace thread synchronization.

Right. If user space is relying on kernel lock for thread synchronization,
it can not enable parallel writes.

But if it is not relying on this, it should be able to enable parallel
writes. Just keep in mind that ->i_size check is not sufficient to
guarantee that you will not get "two extnding parallel writes". If
another client on a different machine truncated the file, it is
possible this client has old cached ->i_size and it will can
get multiple file extending parallel writes.

So if fuse daemon enables parallel extending writes, it should be
prepared to deal with multiple extending parallel writes.

And if this is correct assumption, I am wondering why to even try
to do ->i_size check and try to avoid parallel extending writes
in fuse kernel. May be there is something I am not aware of. And
that's why I am just raising questions.

> 
> b) has a distributed lock - needs a callback to fuse kernel to inform the
> kernel to invalidate all data.
> 
> At DDN we have both of them, a) is in production, the successor b) is being
> worked on. We might come back with more patches for more callbacks for the
> DLM - I'm not sure yet.
> 
> 
> > 
> > I hope I am able to explain my concern. I am not saying that this patch
> > is not good. All I am saying that fuse daemon (user space) can not rely
> > on that it will never get two parallel direct writes which can be beyond
> > the file size. If fuse kernel cache is stale, it can happen. Just trying
> > to set the expectations right.
> 
> 
> I don't see an issue yet. Regarding virtiofs, does it have a distributed
> lock manager (DLM)? I guess not?

Nope. virtiofs does not have any DLM.

Vivek
> 
> 
> Thanks,
> Bernd
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-09 13:53           ` Vivek Goyal
@ 2022-06-10  7:24             ` Dharmendra Hans
  2022-06-15 21:12               ` Vivek Goyal
  2022-06-16  9:01             ` Miklos Szeredi
  1 sibling, 1 reply; 11+ messages in thread
From: Dharmendra Hans @ 2022-06-10  7:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Bernd Schubert, Miklos Szeredi, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh

On Thu, Jun 9, 2022 at 7:23 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Jun 08, 2022 at 12:42:20AM +0200, Bernd Schubert wrote:
> >
> >
> > On 6/8/22 00:01, Vivek Goyal wrote:
> > > On Tue, Jun 07, 2022 at 11:42:16PM +0200, Bernd Schubert wrote:
> > > >
> > > >
> > > > On 6/7/22 23:25, Vivek Goyal wrote:
> > > > > On Sun, Jun 05, 2022 at 12:52:00PM +0530, Dharmendra Singh wrote:
> > > > > > From: Dharmendra Singh <dsingh@ddn.com>
> > > > > >
> > > > > > In general, as of now, in FUSE, direct writes on the same file are
> > > > > > serialized over inode lock i.e we hold inode lock for the full duration
> > > > > > of the write request. I could not found in fuse code a comment which
> > > > > > clearly explains why this exclusive lock is taken for direct writes.
> > > > > >
> > > > > > Following might be the reasons for acquiring exclusive lock but not
> > > > > > limited to
> > > > > > 1) Our guess is some USER space fuse implementations might be relying
> > > > > >      on this lock for seralization.
> > > > >
> > > > > Hi Dharmendra,
> > > > >
> > > > > I will just try to be devil's advocate. So if this is server side
> > > > > limitation, then it is possible that fuse client's isize data in
> > > > > cache is stale. For example, filesystem is shared between two
> > > > > clients.
> > > > >
> > > > > - File size is 4G as seen by client A.
> > > > > - Client B truncates the file to 2G.
> > > > > - Two processes in client A, try to do parallel direct writes and will
> > > > >     be able to proceed and server will get two parallel writes both
> > > > >     extending file size.
> > > > >
> > > > > I can see that this can happen with virtiofs with cache=auto policy.
> > > > >
> > > > > IOW, if this is a fuse server side limitation, then how do you ensure
> > > > > that fuse kernel's i_size definition is not stale.
> > > >
> > > > Hi Vivek,
> > > >
> > > > I'm sorry, to be sure, can you explain where exactly a client is located for
> > > > you? For us these are multiple daemons linked to libufse - which you seem to
> > > > call 'server' Typically these clients are on different machines. And servers
> > > > are for us on the other side of the network - like an NFS server.
> > >
> > > Hi Bernd,
> > >
> > > Agreed, terminology is little confusing. I am calling "fuse kernel" as
> > > client and fuse daemon (user space) as server. This server in turn might
> > > be the client to another network filesystem and real files might be
> > > served by that server on network.
> > >
> > > So for simple virtiofs case, There can be two fuse daemons (virtiofsd
> > > instances) sharing same directory (either on local filesystem or on
> > > a network filesystem).
> >
> > So the combination of fuse-kernel + fuse-daemon == vfs mount.
>
> This is fine for regular fuse file systems. For virtiofs fuse-kernel is
> running in a VM and fuse-daemon is running outside the VM on host.
> >
> > >
> > > >
> > > > So now while I'm not sure what you mean with 'client', I'm wondering about
> > > > two generic questions
> > > >
> > > > a) I need to double check, but we were under the assumption the code in
> > > > question is a direct-io code path. I assume cache=auto would use the page
> > > > cache and should not be effected?
> > >
> > > By default cache=auto use page cache but if application initiates a
> > > direct I/O, it should use direct I/O path.
> >
> > Ok, so we are on the same page regarding direct-io.
> >
> > >
> > > >
> > > > b) How would the current lock help for distributed clients? Or multiple fuse
> > > > daemons (what you seem to call server) per local machine?
> > >
> > > I thought that current lock is trying to protect fuse kernel side and
> > > assumed fuse server (daemon linked to libfuse) can handle multiple
> > > parallel writes. Atleast that's how I thought about the things. I might
> > > be wrong. I am not sure.
> > >
> > > >
> > > > For a single vfs mount point served by fuse, truncate should take the
> > > > exclusive lock and parallel writes the shared lock - I don't see a problem
> > > > here either.
> > >
> > > Agreed that this does not seem like a problem from fuse kernel side. I was
> > > just questioning that where parallel direct writes become a problem. And
> > > answer I heard was that it probably is fuse server (daemon linked with
> > > libfuse) which is expecting the locking. And if that's the case, this
> > > patch is not fool proof. It is possible that file got truncated from
> > > a different client (from a different fuse daemon linked with libfuse).
> > >
> > > So say A is first fuse daemon and B is another fuse daemon. Both are
> > > clients to some network file system as NFS.
> > >
> > > - Fuse kernel for A, sees file size as 4G.
> > > - fuse daemon B truncates the file to size 2G.
> > > - Fuse kernel for A, has stale cache, and can send two parallel writes
> > >    say at 3G and 3.5G offset.
> >
> > I guess you mean inode cache, not data cache, as this is direct-io.
>
> Yes inode cache and cached ->i_size might be an issue. These patches
> used cached ->i_size to determine if parallel direct I/O should be
> allowed or not.
>
>
> > But now
> > why would we need to worry about any cache here, if this is direct-io - the
> > application writes without going into any cache and at the same time a
> > truncate happens? The current kernel side lock would not help here, but a
> > distrubuted lock is needed to handle this correctly?
> >
> > int fd = open(path, O_WRONLY | O_DIRECT);
> >
> > clientA: pwrite(fd, buf, 100G, 0) -> takes a long time
> > clientB: ftruncate(fd, 0)
> >
> > I guess on a local file system that will result in a zero size file. On
> > different fuse mounts (without a DLM) or NFS, undefined behavior.
> >
> >
> > > - Fuser daemon A might not like it.(Assuming this is fuse daemon/user
> > >    space side limitation).
> >
> > I think there are two cases for the fuser daemons:
> >
> > a) does not have a distributed lock - just needs to handle the writes, the
> > local kernel lock does not protect against distributed races.
>
> Exactly. This is the point I am trying to raise. "Local kernel lock does
> not protect against distributed races".
>
> So in this case local kernel has ->i_size cached and this might be an
> old value and checking i_size does not guarantee that fuse daemon
> will not get parallel extending writes.
>
> > I guess most
> > of these file systems can enable parallel writes, unless the kernel lock is
> > used to handle userspace thread synchronization.
>
> Right. If user space is relying on kernel lock for thread synchronization,
> it can not enable parallel writes.
>
> But if it is not relying on this, it should be able to enable parallel
> writes. Just keep in mind that ->i_size check is not sufficient to
> guarantee that you will not get "two extnding parallel writes". If
> another client on a different machine truncated the file, it is
> possible this client has old cached ->i_size and it will can
> get multiple file extending parallel writes.
>
> So if fuse daemon enables parallel extending writes, it should be
> prepared to deal with multiple extending parallel writes.
>
> And if this is correct assumption, I am wondering why to even try
> to do ->i_size check and try to avoid parallel extending writes
> in fuse kernel. May be there is something I am not aware of. And
> that's why I am just raising questions.

Let's consider couple of cases:
1) Fuse daemon is  file server itself(local file system):
   Here we need to make sure few things in fuse kernel
     a) Appending writes are handled. This requires serialized access
to inode in fuse kernel as we generate off from i_size(as i_size is
updated after write         returns).
     b) If we allow concurrent writes then we can have following cases
        - All writes coming under i_size, it's overwrite.
           If any of the write fails(though it is expected all
following writes would fail on that file),  usually on a single
daemon, all following writes on the same
           file would be aborted. Since fuse upates i_size after write
returns successfully, we have no worry in this case, no action is
required from fuse like
           truncate etc as we are not using page cache here.

       - All writes are extending writes
         These writes are extending current i_size.  Let's assume, as
of now, i_size is 1 mb.  Now, wr1 extends i_size from 1mb to 2mb, and
wr2 extends i_size
         from 2mb  to 3mb. Let's assume wr1 succeeds, and wr2 fails,
in this case wr1 would update i_size to 2mb and
         wr2 would not update i_size, so we are good, nothing required here.
         In just reverse case, where wr1 fails and wr2 succeeds, then
wr2 must be updating i_size to 3mb(wr1 would not update i_size). Here
we are required
        to create hole in the file from offset 1mb to 2mb otherwise
gargabe would be provided to the reader as it is fresh write and no
old data exists yet at that offset.

2) Fuse daemon forwards req to actual file server(i.e fuse daemon is
client here)
    Please note that this fuse daemon is forwarding data to actual
servers(and we can have single or multple servers consuming data)
therefore it can send
    wr1 to srv1 and wr2 to srv2 and so on.
    Here we need to make sure few things again
    a) Appending writes as pointed out in 1), every fuse daemon should
generate correct offset(local to itself) at which data is written. We
need exclusive lock for this.
    b) Allowing concurrent writes:
         -  All writes coming under i_size, it's overwrite.
            Here it can happen that some write went to srv1 and
succeeded and some went to srv2 and failed(due to space issue on this
node or something else
           like network problems). In this case we are not required to
do anything as usual.
         - All writes are extending writes
           Let's assume as done in 1), as of now, i_size is 1 mb.
Now, wr1 extends i_size from 1mb to 2mb and goes to srv1, and wr2
extends i_size
         from 2mb  to 3mb and goes to srv2. Let's assume wr1 succeeds,
and wr2 fails, in this case wr1 would update i_size to 2mb and
         wr2 would not update i_size, so we are good, nothing required here.
         In just reverse case, where wr1 fails and wr2 succeeds, then
wr2 must be updating i_size to 3mb(wr1 would not update i_size). Here
we are required
        to create hole in the file from offset 1mb to 2mb otherwise
gargabe would be provided to the reader as it is fresh write and no
old data exists yet at that offset.

It can happen that holes are not supported by all file server types.
In that case also, I don't think we can allow extending writes.
My understanding is that each fuse daemon is supposed to maintain
consistency related to offset/i_size on its own end when we do not
have DLM.

> >
> > b) has a distributed lock - needs a callback to fuse kernel to inform the
> > kernel to invalidate all data.
> >
> > At DDN we have both of them, a) is in production, the successor b) is being
> > worked on. We might come back with more patches for more callbacks for the
> > DLM - I'm not sure yet.
> >
> >
> > >
> > > I hope I am able to explain my concern. I am not saying that this patch
> > > is not good. All I am saying that fuse daemon (user space) can not rely
> > > on that it will never get two parallel direct writes which can be beyond
> > > the file size. If fuse kernel cache is stale, it can happen. Just trying
> > > to set the expectations right.
> >
> >
> > I don't see an issue yet. Regarding virtiofs, does it have a distributed
> > lock manager (DLM)? I guess not?
>
> Nope. virtiofs does not have any DLM.
>
> Vivek
> >
> >
> > Thanks,
> > Bernd
> >
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-10  7:24             ` Dharmendra Hans
@ 2022-06-15 21:12               ` Vivek Goyal
  0 siblings, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2022-06-15 21:12 UTC (permalink / raw)
  To: Dharmendra Hans
  Cc: Bernd Schubert, Miklos Szeredi, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh

On Fri, Jun 10, 2022 at 12:54:46PM +0530, Dharmendra Hans wrote:
> On Thu, Jun 9, 2022 at 7:23 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Wed, Jun 08, 2022 at 12:42:20AM +0200, Bernd Schubert wrote:
> > >
> > >
> > > On 6/8/22 00:01, Vivek Goyal wrote:
> > > > On Tue, Jun 07, 2022 at 11:42:16PM +0200, Bernd Schubert wrote:
> > > > >
> > > > >
> > > > > On 6/7/22 23:25, Vivek Goyal wrote:
> > > > > > On Sun, Jun 05, 2022 at 12:52:00PM +0530, Dharmendra Singh wrote:
> > > > > > > From: Dharmendra Singh <dsingh@ddn.com>
> > > > > > >
> > > > > > > In general, as of now, in FUSE, direct writes on the same file are
> > > > > > > serialized over inode lock i.e we hold inode lock for the full duration
> > > > > > > of the write request. I could not found in fuse code a comment which
> > > > > > > clearly explains why this exclusive lock is taken for direct writes.
> > > > > > >
> > > > > > > Following might be the reasons for acquiring exclusive lock but not
> > > > > > > limited to
> > > > > > > 1) Our guess is some USER space fuse implementations might be relying
> > > > > > >      on this lock for seralization.
> > > > > >
> > > > > > Hi Dharmendra,
> > > > > >
> > > > > > I will just try to be devil's advocate. So if this is server side
> > > > > > limitation, then it is possible that fuse client's isize data in
> > > > > > cache is stale. For example, filesystem is shared between two
> > > > > > clients.
> > > > > >
> > > > > > - File size is 4G as seen by client A.
> > > > > > - Client B truncates the file to 2G.
> > > > > > - Two processes in client A, try to do parallel direct writes and will
> > > > > >     be able to proceed and server will get two parallel writes both
> > > > > >     extending file size.
> > > > > >
> > > > > > I can see that this can happen with virtiofs with cache=auto policy.
> > > > > >
> > > > > > IOW, if this is a fuse server side limitation, then how do you ensure
> > > > > > that fuse kernel's i_size definition is not stale.
> > > > >
> > > > > Hi Vivek,
> > > > >
> > > > > I'm sorry, to be sure, can you explain where exactly a client is located for
> > > > > you? For us these are multiple daemons linked to libufse - which you seem to
> > > > > call 'server' Typically these clients are on different machines. And servers
> > > > > are for us on the other side of the network - like an NFS server.
> > > >
> > > > Hi Bernd,
> > > >
> > > > Agreed, terminology is little confusing. I am calling "fuse kernel" as
> > > > client and fuse daemon (user space) as server. This server in turn might
> > > > be the client to another network filesystem and real files might be
> > > > served by that server on network.
> > > >
> > > > So for simple virtiofs case, There can be two fuse daemons (virtiofsd
> > > > instances) sharing same directory (either on local filesystem or on
> > > > a network filesystem).
> > >
> > > So the combination of fuse-kernel + fuse-daemon == vfs mount.
> >
> > This is fine for regular fuse file systems. For virtiofs fuse-kernel is
> > running in a VM and fuse-daemon is running outside the VM on host.
> > >
> > > >
> > > > >
> > > > > So now while I'm not sure what you mean with 'client', I'm wondering about
> > > > > two generic questions
> > > > >
> > > > > a) I need to double check, but we were under the assumption the code in
> > > > > question is a direct-io code path. I assume cache=auto would use the page
> > > > > cache and should not be effected?
> > > >
> > > > By default cache=auto use page cache but if application initiates a
> > > > direct I/O, it should use direct I/O path.
> > >
> > > Ok, so we are on the same page regarding direct-io.
> > >
> > > >
> > > > >
> > > > > b) How would the current lock help for distributed clients? Or multiple fuse
> > > > > daemons (what you seem to call server) per local machine?
> > > >
> > > > I thought that current lock is trying to protect fuse kernel side and
> > > > assumed fuse server (daemon linked to libfuse) can handle multiple
> > > > parallel writes. Atleast that's how I thought about the things. I might
> > > > be wrong. I am not sure.
> > > >
> > > > >
> > > > > For a single vfs mount point served by fuse, truncate should take the
> > > > > exclusive lock and parallel writes the shared lock - I don't see a problem
> > > > > here either.
> > > >
> > > > Agreed that this does not seem like a problem from fuse kernel side. I was
> > > > just questioning that where parallel direct writes become a problem. And
> > > > answer I heard was that it probably is fuse server (daemon linked with
> > > > libfuse) which is expecting the locking. And if that's the case, this
> > > > patch is not fool proof. It is possible that file got truncated from
> > > > a different client (from a different fuse daemon linked with libfuse).
> > > >
> > > > So say A is first fuse daemon and B is another fuse daemon. Both are
> > > > clients to some network file system as NFS.
> > > >
> > > > - Fuse kernel for A, sees file size as 4G.
> > > > - fuse daemon B truncates the file to size 2G.
> > > > - Fuse kernel for A, has stale cache, and can send two parallel writes
> > > >    say at 3G and 3.5G offset.
> > >
> > > I guess you mean inode cache, not data cache, as this is direct-io.
> >
> > Yes inode cache and cached ->i_size might be an issue. These patches
> > used cached ->i_size to determine if parallel direct I/O should be
> > allowed or not.
> >
> >
> > > But now
> > > why would we need to worry about any cache here, if this is direct-io - the
> > > application writes without going into any cache and at the same time a
> > > truncate happens? The current kernel side lock would not help here, but a
> > > distrubuted lock is needed to handle this correctly?
> > >
> > > int fd = open(path, O_WRONLY | O_DIRECT);
> > >
> > > clientA: pwrite(fd, buf, 100G, 0) -> takes a long time
> > > clientB: ftruncate(fd, 0)
> > >
> > > I guess on a local file system that will result in a zero size file. On
> > > different fuse mounts (without a DLM) or NFS, undefined behavior.
> > >
> > >
> > > > - Fuser daemon A might not like it.(Assuming this is fuse daemon/user
> > > >    space side limitation).
> > >
> > > I think there are two cases for the fuser daemons:
> > >
> > > a) does not have a distributed lock - just needs to handle the writes, the
> > > local kernel lock does not protect against distributed races.
> >
> > Exactly. This is the point I am trying to raise. "Local kernel lock does
> > not protect against distributed races".
> >
> > So in this case local kernel has ->i_size cached and this might be an
> > old value and checking i_size does not guarantee that fuse daemon
> > will not get parallel extending writes.
> >
> > > I guess most
> > > of these file systems can enable parallel writes, unless the kernel lock is
> > > used to handle userspace thread synchronization.
> >
> > Right. If user space is relying on kernel lock for thread synchronization,
> > it can not enable parallel writes.
> >
> > But if it is not relying on this, it should be able to enable parallel
> > writes. Just keep in mind that ->i_size check is not sufficient to
> > guarantee that you will not get "two extnding parallel writes". If
> > another client on a different machine truncated the file, it is
> > possible this client has old cached ->i_size and it will can
> > get multiple file extending parallel writes.
> >
> > So if fuse daemon enables parallel extending writes, it should be
> > prepared to deal with multiple extending parallel writes.
> >
> > And if this is correct assumption, I am wondering why to even try
> > to do ->i_size check and try to avoid parallel extending writes
> > in fuse kernel. May be there is something I am not aware of. And
> > that's why I am just raising questions.
> 
> Let's consider couple of cases:
> 1) Fuse daemon is  file server itself(local file system):
>    Here we need to make sure few things in fuse kernel
>      a) Appending writes are handled. This requires serialized access
> to inode in fuse kernel as we generate off from i_size(as i_size is
> updated after write         returns).
>      b) If we allow concurrent writes then we can have following cases
>         - All writes coming under i_size, it's overwrite.
>            If any of the write fails(though it is expected all
> following writes would fail on that file),  usually on a single
> daemon, all following writes on the same
>            file would be aborted. Since fuse upates i_size after write
> returns successfully, we have no worry in this case, no action is
> required from fuse like
>            truncate etc as we are not using page cache here.
> 
>        - All writes are extending writes
>          These writes are extending current i_size.  Let's assume, as
> of now, i_size is 1 mb.  Now, wr1 extends i_size from 1mb to 2mb, and
> wr2 extends i_size
>          from 2mb  to 3mb. Let's assume wr1 succeeds, and wr2 fails,
> in this case wr1 would update i_size to 2mb and
>          wr2 would not update i_size, so we are good, nothing required here.
>          In just reverse case, where wr1 fails and wr2 succeeds, then
> wr2 must be updating i_size to 3mb(wr1 would not update i_size). Here
> we are required
>         to create hole in the file from offset 1mb to 2mb otherwise
> gargabe would be provided to the reader as it is fresh write and no
> old data exists yet at that offset.

Hi Dharmendra,

I think this idea of fuse daemon having to ensure holes in file don't
return garbase is confusing.

Should underlying filesystem not take care of this. For example, in above
example assume only wr2 was issued (and not wr1). IOW, i_size 1 1MB. I
do lseek(2MB) and write 1MB of data from 2MB to 3MB (wr2). This succeeds
and fuse will udpate i_size to 3MB. Now we should have a hole between
1MB to 2MB. Will underlying filesystem not take care of it in normal
cases and return 0 if we read from hole.

man lseek says.

       lseek()  allows  the  file  offset to be set beyond the end of the file
       (but this does not change the size of the  file).   If  data  is  later
       written  at  this  point,  subsequent  reads  of the data in the gap (a
       "hole") return null bytes ('\0') until data is  actually  written  into
       the gap.

Thanks
Vivek


> 
> 2) Fuse daemon forwards req to actual file server(i.e fuse daemon is
> client here)
>     Please note that this fuse daemon is forwarding data to actual
> servers(and we can have single or multple servers consuming data)
> therefore it can send
>     wr1 to srv1 and wr2 to srv2 and so on.
>     Here we need to make sure few things again
>     a) Appending writes as pointed out in 1), every fuse daemon should
> generate correct offset(local to itself) at which data is written. We
> need exclusive lock for this.
>     b) Allowing concurrent writes:
>          -  All writes coming under i_size, it's overwrite.
>             Here it can happen that some write went to srv1 and
> succeeded and some went to srv2 and failed(due to space issue on this
> node or something else
>            like network problems). In this case we are not required to
> do anything as usual.
>          - All writes are extending writes
>            Let's assume as done in 1), as of now, i_size is 1 mb.
> Now, wr1 extends i_size from 1mb to 2mb and goes to srv1, and wr2
> extends i_size
>          from 2mb  to 3mb and goes to srv2. Let's assume wr1 succeeds,
> and wr2 fails, in this case wr1 would update i_size to 2mb and
>          wr2 would not update i_size, so we are good, nothing required here.
>          In just reverse case, where wr1 fails and wr2 succeeds, then
> wr2 must be updating i_size to 3mb(wr1 would not update i_size). Here
> we are required
>         to create hole in the file from offset 1mb to 2mb otherwise
> gargabe would be provided to the reader as it is fresh write and no
> old data exists yet at that offset.
> 
> It can happen that holes are not supported by all file server types.
> In that case also, I don't think we can allow extending writes.
> My understanding is that each fuse daemon is supposed to maintain
> consistency related to offset/i_size on its own end when we do not
> have DLM.
> 
> > >
> > > b) has a distributed lock - needs a callback to fuse kernel to inform the
> > > kernel to invalidate all data.
> > >
> > > At DDN we have both of them, a) is in production, the successor b) is being
> > > worked on. We might come back with more patches for more callbacks for the
> > > DLM - I'm not sure yet.
> > >
> > >
> > > >
> > > > I hope I am able to explain my concern. I am not saying that this patch
> > > > is not good. All I am saying that fuse daemon (user space) can not rely
> > > > on that it will never get two parallel direct writes which can be beyond
> > > > the file size. If fuse kernel cache is stale, it can happen. Just trying
> > > > to set the expectations right.
> > >
> > >
> > > I don't see an issue yet. Regarding virtiofs, does it have a distributed
> > > lock manager (DLM)? I guess not?
> >
> > Nope. virtiofs does not have any DLM.
> >
> > Vivek
> > >
> > >
> > > Thanks,
> > > Bernd
> > >
> >
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-09 13:53           ` Vivek Goyal
  2022-06-10  7:24             ` Dharmendra Hans
@ 2022-06-16  9:01             ` Miklos Szeredi
  2022-06-16 13:17               ` Vivek Goyal
  1 sibling, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2022-06-16  9:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Bernd Schubert, Dharmendra Singh, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh

On Thu, 9 Jun 2022 at 15:53, Vivek Goyal <vgoyal@redhat.com> wrote:

> Right. If user space is relying on kernel lock for thread synchronization,
> it can not enable parallel writes.
>
> But if it is not relying on this, it should be able to enable parallel
> writes. Just keep in mind that ->i_size check is not sufficient to
> guarantee that you will not get "two extnding parallel writes". If
> another client on a different machine truncated the file, it is
> possible this client has old cached ->i_size and it will can
> get multiple file extending parallel writes.

There are two cases:

1. the filesystem can be changed only through a single fuse instance

2. the filesystem can be changed externally.

In case 1 the fuse client must ensure that data is updated
consistently (as defined by e.g. POSIX).  This is what I'm mostly
worried about.

Case 2 is much more difficult in the general case, and network
filesystems often have a relaxed consistency model.


> So if fuse daemon enables parallel extending writes, it should be
> prepared to deal with multiple extending parallel writes.
>
> And if this is correct assumption, I am wondering why to even try
> to do ->i_size check and try to avoid parallel extending writes
> in fuse kernel. May be there is something I am not aware of. And
> that's why I am just raising questions.

We can probably do that, but it needs careful review of where i_size
is changed and where i_size is used so we can never get into an
inconsistent state.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-16  9:01             ` Miklos Szeredi
@ 2022-06-16 13:17               ` Vivek Goyal
  0 siblings, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2022-06-16 13:17 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Dharmendra Singh, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh

On Thu, Jun 16, 2022 at 11:01:59AM +0200, Miklos Szeredi wrote:
> On Thu, 9 Jun 2022 at 15:53, Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > Right. If user space is relying on kernel lock for thread synchronization,
> > it can not enable parallel writes.
> >
> > But if it is not relying on this, it should be able to enable parallel
> > writes. Just keep in mind that ->i_size check is not sufficient to
> > guarantee that you will not get "two extnding parallel writes". If
> > another client on a different machine truncated the file, it is
> > possible this client has old cached ->i_size and it will can
> > get multiple file extending parallel writes.
> 
> There are two cases:
> 
> 1. the filesystem can be changed only through a single fuse instance
> 
> 2. the filesystem can be changed externally.
> 
> In case 1 the fuse client must ensure that data is updated
> consistently (as defined by e.g. POSIX).  This is what I'm mostly
> worried about.
> 
> Case 2 is much more difficult in the general case, and network
> filesystems often have a relaxed consistency model.
> 
> 
> > So if fuse daemon enables parallel extending writes, it should be
> > prepared to deal with multiple extending parallel writes.
> >
> > And if this is correct assumption, I am wondering why to even try
> > to do ->i_size check and try to avoid parallel extending writes
> > in fuse kernel. May be there is something I am not aware of. And
> > that's why I am just raising questions.
> 
> We can probably do that, but it needs careful review of where i_size
> is changed and where i_size is used so we can never get into an
> inconsistent state.

Ok. Agreed that non-extending parallel writes are safer option. Atleast
for the case 1) above. For case 2) we can get multiple parallel extending
writes with these patches if another client on another machine truncates
file.

So I don't have any objections to these patches. I just wanted to
understand it better.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-06-16 13:17 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-05  7:21 [PATCH v4 0/1] FUSE: Allow non-extending parallel direct writes Dharmendra Singh
2022-06-05  7:22 ` [PATCH v4 1/1] Allow non-extending parallel direct writes on the same file Dharmendra Singh
2022-06-07 21:25   ` Vivek Goyal
2022-06-07 21:42     ` Bernd Schubert
2022-06-07 22:01       ` Vivek Goyal
2022-06-07 22:42         ` Bernd Schubert
2022-06-09 13:53           ` Vivek Goyal
2022-06-10  7:24             ` Dharmendra Hans
2022-06-15 21:12               ` Vivek Goyal
2022-06-16  9:01             ` Miklos Szeredi
2022-06-16 13:17               ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.