All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/1] FUSE: Allow non-extending parallel direct writes
@ 2022-06-17  7:10 Dharmendra Singh
  2022-06-17  7:10 ` [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file Dharmendra Singh
  0 siblings, 1 reply; 11+ messages in thread
From: Dharmendra Singh @ 2022-06-17  7:10 UTC (permalink / raw)
  To: miklos, vgoyal
  Cc: Dharmendra Singh, linux-fsdevel, fuse-devel, linux-kernel, bschubert

It is observed that currently in Fuse, for direct writes, we hold 
inode lock for the full duration of the request. As a result, 
only one direct write request can proceed on the same file. This, 
I think is due to various reasons such as serialization needed by 
USER space fuse implementations/file size issues/partial write
failures.

This patch allows parallel writes to proceed on the same file by
holding shared lock on the non-extending writes and exlusive lock
on extending writes.

For measuring performance, I carried out test on these 
changes over example/passthrough.c (part of libfuse) by setting 
direct-io and parallel_direct_writes flags on the file.
 
Note that we disabled write to underlying file system from passthrough 
as we wanted to check gain for Fuse only. Fio was used to test
the impact of these changes on File-per-job and Single shared File. 
CPU binding was performed on passthrough process only.

Job file for SSF:
[global]
directory=/tmp/dest
filename=ssf
size=100g
blocksize=1m
ioengine=sync
group_reporting=1
fallocate=none
runtime=60
stonewall

[write]
rw=randwrite:256
rw_sequencer=sequential
fsync_on_close=1


Job file for file-per-job:
[sequential-write]
rw=write
size=100G
directory=/tmp/dest/
group_reporting
name=sequential-write-direct
bs=1M
runtime=60


Results:

unpatched=================

File  per job


Fri May  6 09:36:52 EDT 2022
numjobs: 1  WRITE: bw=3441MiB/s (3608MB/s), 3441MiB/s-3441MiB/s (3608MB/s-3608MB/s), io=100GiB (107GB), run=29762-29762msec
numjobs: 2  WRITE: bw=8174MiB/s (8571MB/s), 8174MiB/s-8174MiB/s (8571MB/s-8571MB/s), io=200GiB (215GB), run=25054-25054msec
numjobs: 4  WRITE: bw=14.9GiB/s (15.0GB/s), 14.9GiB/s-14.9GiB/s (15.0GB/s-15.0GB/s), io=400GiB (429GB), run=26900-26900msec
numjobs: 8  WRITE: bw=23.4GiB/s (25.2GB/s), 23.4GiB/s-23.4GiB/s (25.2GB/s-25.2GB/s), io=800GiB (859GB), run=34115-34115msec
numjobs: 16  WRITE: bw=24.5GiB/s (26.3GB/s), 24.5GiB/s-24.5GiB/s (26.3GB/s-26.3GB/s), io=1469GiB (1577GB), run=60001-60001msec
numjobs: 32  WRITE: bw=20.5GiB/s (21.0GB/s), 20.5GiB/s-20.5GiB/s (21.0GB/s-21.0GB/s), io=1229GiB (1320GB), run=60003-60003msec


SSF

Fri May  6 09:46:38 EDT 2022
numjobs: 1  WRITE: bw=3624MiB/s (3800MB/s), 3624MiB/s-3624MiB/s (3800MB/s-3800MB/s), io=100GiB (107GB), run=28258-28258msec
numjobs: 2  WRITE: bw=5801MiB/s (6083MB/s), 5801MiB/s-5801MiB/s (6083MB/s-6083MB/s), io=200GiB (215GB), run=35302-35302msec
numjobs: 4  WRITE: bw=4794MiB/s (5027MB/s), 4794MiB/s-4794MiB/s (5027MB/s-5027MB/s), io=281GiB (302GB), run=60001-60001msec
numjobs: 8  WRITE: bw=3946MiB/s (4137MB/s), 3946MiB/s-3946MiB/s (4137MB/s-4137MB/s), io=231GiB (248GB), run=60003-60003msec
numjobs: 16  WRITE: bw=4040MiB/s (4236MB/s), 4040MiB/s-4040MiB/s (4236MB/s-4236MB/s), io=237GiB (254GB), run=60006-60006msec
numjobs: 32  WRITE: bw=2822MiB/s (2959MB/s), 2822MiB/s-2822MiB/s (2959MB/s-2959MB/s), io=165GiB (178GB), run=60013-60013msec


Patched=====

File per job

Fri May  6 10:05:46 EDT 2022
numjobs: 1  WRITE: bw=3193MiB/s (3348MB/s), 3193MiB/s-3193MiB/s (3348MB/s-3348MB/s), io=100GiB (107GB), run=32068-32068msec
numjobs: 2  WRITE: bw=9084MiB/s (9525MB/s), 9084MiB/s-9084MiB/s (9525MB/s-9525MB/s), io=200GiB (215GB), run=22545-22545msec
numjobs: 4  WRITE: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s-15.9GB/s), io=400GiB (429GB), run=26986-26986msec
numjobs: 8  WRITE: bw=24.5GiB/s (26.3GB/s), 24.5GiB/s-24.5GiB/s (26.3GB/s-26.3GB/s), io=800GiB (859GB), run=32624-32624msec
numjobs: 16  WRITE: bw=24.2GiB/s (25.0GB/s), 24.2GiB/s-24.2GiB/s (25.0GB/s-25.0GB/s), io=1451GiB (1558GB), run=60001-60001msec
numjobs: 32  WRITE: bw=19.3GiB/s (20.8GB/s), 19.3GiB/s-19.3GiB/s (20.8GB/s-20.8GB/s), io=1160GiB (1245GB), run=60002-60002msec


SSF

Fri May  6 09:58:33 EDT 2022
numjobs: 1  WRITE: bw=3137MiB/s (3289MB/s), 3137MiB/s-3137MiB/s (3289MB/s-3289MB/s), io=100GiB (107GB), run=32646-32646msec
numjobs: 2  WRITE: bw=7736MiB/s (8111MB/s), 7736MiB/s-7736MiB/s (8111MB/s-8111MB/s), io=200GiB (215GB), run=26475-26475msec
numjobs: 4  WRITE: bw=14.4GiB/s (15.4GB/s), 14.4GiB/s-14.4GiB/s (15.4GB/s-15.4GB/s), io=400GiB (429GB), run=27869-27869msec
numjobs: 8  WRITE: bw=22.6GiB/s (24.3GB/s), 22.6GiB/s-22.6GiB/s (24.3GB/s-24.3GB/s), io=800GiB (859GB), run=35340-35340msec
numjobs: 16  WRITE: bw=25.6GiB/s (27.5GB/s), 25.6GiB/s-25.6GiB/s (27.5GB/s-27.5GB/s), io=1535GiB (1648GB), run=60001-60001msec
numjobs: 32  WRITE: bw=20.2GiB/s (21.7GB/s), 20.2GiB/s-20.2GiB/s (21.7GB/s-21.7GB/s), io=1211GiB (1300GB), run=60003-60003msec



SSF gain in percentage:-
For 1 fio thread: +0%
For 2 fio threads: +0% 
For 4 fio threads: +42%
For 8 fio threads: +246.8%
For 16 fio threads: +549%
For 32 fio threads: +630.33%

Dharmendra Singh (1):
  Allow non-extending parallel direct writes on the same file.

 fs/fuse/file.c            | 43 ++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/fuse.h |  2 ++
 2 files changed, 42 insertions(+), 3 deletions(-)

---

v5:
   - Removed retry label.
   - Renamed feaure flag as suggested
   - Updated commit message to reflect more precisely why exclusive lock needed.

v4: Handled the case when file size can get reduced after the check but
     before we acquire the shared lock.

v3: Addressed all comments.

---

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-17  7:10 [PATCH v5 0/1] FUSE: Allow non-extending parallel direct writes Dharmendra Singh
@ 2022-06-17  7:10 ` Dharmendra Singh
  2022-06-17  7:36   ` Miklos Szeredi
  2022-06-18 19:07   ` Vivek Goyal
  0 siblings, 2 replies; 11+ messages in thread
From: Dharmendra Singh @ 2022-06-17  7:10 UTC (permalink / raw)
  To: miklos, vgoyal
  Cc: Dharmendra Singh, linux-fsdevel, fuse-devel, linux-kernel,
	bschubert, Dharmendra Singh

In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration
of the write request. I could not find in fuse code and git history
a comment which clearly explains why this exclusive lock is taken
for direct writes.  Following might be the reasons for acquiring
an exclusive lock but not be limited to
1) Our guess is some USER space fuse implementations might be relying
   on this lock for seralization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.

This patch relaxes the exclusive lock for direct non-extending writes
only. File size extending writes might not need the lock either,
but we are not entirely sure if there is a risk to introduce any
kind of regression. Furthermore, benchmarking with fio does not
show a difference between patch versions that take on file size
extension a) an exclusive lock and b) a shared lock.

A possible example of an issue with i_size extending writes are write
error cases. Some writes might succeed and others might fail for
file system internal reasons - for example ENOSPACE. With parallel
file size extending writes it _might_ be difficult to revert the action
of the failing write, especially to restore the right i_size.

With these changes, we allow non-extending parallel direct writes
on the same file with the help of a flag called
FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is
passed from libfuse to fuse kernel as part of file open/create),
we do not take exclusive lock anymore, but instead use a shared lock
that allows non-extending writes to run in parallel.
FUSE implementations which rely on this inode lock for serialisation
can continue to do so and serialized direct writes are still the
default.  Implementations that do not do write serialization need to
be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in
their file open/create reply.

On patch review there were concerns that network file systems (or
vfs multiple mounts of the same file system) might have issues with
parallel writes. We believe this is not the case, as this is just a
local lock, which network file systems could not rely on anyway.
I.e. this lock is just for local consistency.

Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/file.c            | 43 ++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/fuse.h |  2 ++
 2 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 37eebfb90500..b3a5706f301d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1565,14 +1565,47 @@ static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return res;
 }
 
+static bool fuse_direct_write_extending_i_size(struct kiocb *iocb,
+					       struct iov_iter *iter)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	return iocb->ki_pos + iov_iter_count(iter) > i_size_read(inode);
+}
+
 static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
+	struct file *file = iocb->ki_filp;
+	struct fuse_file *ff = file->private_data;
 	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
 	ssize_t res;
+	bool exclusive_lock =
+		!(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) ||
+		iocb->ki_flags & IOCB_APPEND ||
+		fuse_direct_write_extending_i_size(iocb, from);
+
+	/*
+	 * Take exclusive lock if
+	 * - Parallel direct writes are disabled - a user space decision
+	 * - Parallel direct writes are enabled and i_size is being extended.
+	 *   This might not be needed at all, but needs further investigation.
+	 */
+	if (exclusive_lock)
+		inode_lock(inode);
+	else {
+		inode_lock_shared(inode);
+
+		/* A race with truncate might have come up as the decision for
+		 * the lock type was done without holding the lock, check again.
+		 */
+		if (fuse_direct_write_extending_i_size(iocb, from)) {
+			inode_unlock_shared(inode);
+			inode_lock(inode);
+			exclusive_lock = true;
+		}
+	}
 
-	/* Don't allow parallel writes to the same file */
-	inode_lock(inode);
 	res = generic_write_checks(iocb, from);
 	if (res > 0) {
 		if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
@@ -1583,7 +1616,10 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			fuse_write_update_attr(inode, iocb->ki_pos, res);
 		}
 	}
-	inode_unlock(inode);
+	if (exclusive_lock)
+		inode_unlock(inode);
+	else
+		inode_unlock_shared(inode);
 
 	return res;
 }
@@ -2925,6 +2961,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 	if (iov_iter_rw(iter) == WRITE) {
 		fuse_write_update_attr(inode, pos, ret);
+		/* For extending writes we already hold exclusive lock */
 		if (ret < 0 && offset + count > i_size)
 			fuse_do_truncate(file);
 	}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index a28dd60078ff..bbb1246cae19 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -301,6 +301,7 @@ struct fuse_file_lock {
  * FOPEN_CACHE_DIR: allow caching this directory
  * FOPEN_STREAM: the file is stream-like (no file position at all)
  * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
+ * FOPEN_PARALLEL_DIRECT_WRITES: Allow concurrent direct writes on the same inode
  */
 #define FOPEN_DIRECT_IO		(1 << 0)
 #define FOPEN_KEEP_CACHE	(1 << 1)
@@ -308,6 +309,7 @@ struct fuse_file_lock {
 #define FOPEN_CACHE_DIR		(1 << 3)
 #define FOPEN_STREAM		(1 << 4)
 #define FOPEN_NOFLUSH		(1 << 5)
+#define FOPEN_PARALLEL_DIRECT_WRITES	(1 << 6)
 
 /**
  * INIT request/reply flags
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-17  7:10 ` [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file Dharmendra Singh
@ 2022-06-17  7:36   ` Miklos Szeredi
  2022-06-17  9:25     ` Bernd Schubert
  2022-06-18 19:07   ` Vivek Goyal
  1 sibling, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2022-06-17  7:36 UTC (permalink / raw)
  To: Dharmendra Singh
  Cc: Vivek Goyal, linux-fsdevel, fuse-devel, linux-kernel,
	Bernd Schubert, Dharmendra Singh

On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:

> This patch relaxes the exclusive lock for direct non-extending writes
> only. File size extending writes might not need the lock either,
> but we are not entirely sure if there is a risk to introduce any
> kind of regression. Furthermore, benchmarking with fio does not
> show a difference between patch versions that take on file size
> extension a) an exclusive lock and b) a shared lock.

I'm okay with this, but ISTR Bernd noted a real-life scenario where
this is not sufficient.  Maybe that should be mentioned in the patch
header?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-17  7:36   ` Miklos Szeredi
@ 2022-06-17  9:25     ` Bernd Schubert
  2022-06-17 12:43       ` Miklos Szeredi
  0 siblings, 1 reply; 11+ messages in thread
From: Bernd Schubert @ 2022-06-17  9:25 UTC (permalink / raw)
  To: Miklos Szeredi, Dharmendra Singh
  Cc: Vivek Goyal, linux-fsdevel, fuse-devel, linux-kernel,
	Bernd Schubert, Dharmendra Singh

Hi Miklos,

On 6/17/22 09:36, Miklos Szeredi wrote:
> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
> 
>> This patch relaxes the exclusive lock for direct non-extending writes
>> only. File size extending writes might not need the lock either,
>> but we are not entirely sure if there is a risk to introduce any
>> kind of regression. Furthermore, benchmarking with fio does not
>> show a difference between patch versions that take on file size
>> extension a) an exclusive lock and b) a shared lock.
> 
> I'm okay with this, but ISTR Bernd noted a real-life scenario where
> this is not sufficient.  Maybe that should be mentioned in the patch
> header?


the above comment is actually directly from me.

We didn't check if fio extends the file before the runs, but even if it 
would, my current thinking is that before we serialized n-threads, now 
we have an alternation of
	- "parallel n-1 threads running" + 1 waiting thread
	- "blocked  n-1 threads" + 1 running

I think if we will come back anyway, if we should continue to see slow 
IO with MPIIO. Right now we want to get our patches merged first and 
then will create an updated module for RHEL8 (+derivatives) customers. 
Our benchmark machines are also running plain RHEL8 kernels - without 
back porting the modules first we don' know yet what we will be the 
actual impact to things like io500.

Shall we still extend the commit message or are we good to go?



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-17  9:25     ` Bernd Schubert
@ 2022-06-17 12:43       ` Miklos Szeredi
  2022-06-17 13:07         ` Bernd Schubert
  2022-09-13  8:44         ` Bernd Schubert
  0 siblings, 2 replies; 11+ messages in thread
From: Miklos Szeredi @ 2022-06-17 12:43 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Dharmendra Singh, Vivek Goyal, linux-fsdevel, fuse-devel,
	linux-kernel, Bernd Schubert, Dharmendra Singh

On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
> Hi Miklos,
>
> On 6/17/22 09:36, Miklos Szeredi wrote:
> > On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
> >
> >> This patch relaxes the exclusive lock for direct non-extending writes
> >> only. File size extending writes might not need the lock either,
> >> but we are not entirely sure if there is a risk to introduce any
> >> kind of regression. Furthermore, benchmarking with fio does not
> >> show a difference between patch versions that take on file size
> >> extension a) an exclusive lock and b) a shared lock.
> >
> > I'm okay with this, but ISTR Bernd noted a real-life scenario where
> > this is not sufficient.  Maybe that should be mentioned in the patch
> > header?
>
>
> the above comment is actually directly from me.
>
> We didn't check if fio extends the file before the runs, but even if it
> would, my current thinking is that before we serialized n-threads, now
> we have an alternation of
>         - "parallel n-1 threads running" + 1 waiting thread
>         - "blocked  n-1 threads" + 1 running
>
> I think if we will come back anyway, if we should continue to see slow
> IO with MPIIO. Right now we want to get our patches merged first and
> then will create an updated module for RHEL8 (+derivatives) customers.
> Our benchmark machines are also running plain RHEL8 kernels - without
> back porting the modules first we don' know yet what we will be the
> actual impact to things like io500.
>
> Shall we still extend the commit message or are we good to go?

Well, it would be nice to see the real workload on the backported
patch.   Not just because it would tell us if this makes sense in the
first place, but also to have additional testing.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-17 12:43       ` Miklos Szeredi
@ 2022-06-17 13:07         ` Bernd Schubert
  2022-09-13  8:44         ` Bernd Schubert
  1 sibling, 0 replies; 11+ messages in thread
From: Bernd Schubert @ 2022-06-17 13:07 UTC (permalink / raw)
  To: Miklos Szeredi, Bernd Schubert
  Cc: Dharmendra Singh, Vivek Goyal, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh



On 6/17/22 14:43, Miklos Szeredi wrote:
> On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>
>> Hi Miklos,
>>
>> On 6/17/22 09:36, Miklos Szeredi wrote:
>>> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
>>>
>>>> This patch relaxes the exclusive lock for direct non-extending writes
>>>> only. File size extending writes might not need the lock either,
>>>> but we are not entirely sure if there is a risk to introduce any
>>>> kind of regression. Furthermore, benchmarking with fio does not
>>>> show a difference between patch versions that take on file size
>>>> extension a) an exclusive lock and b) a shared lock.
>>>
>>> I'm okay with this, but ISTR Bernd noted a real-life scenario where
>>> this is not sufficient.  Maybe that should be mentioned in the patch
>>> header?
>>
>>
>> the above comment is actually directly from me.
>>
>> We didn't check if fio extends the file before the runs, but even if it
>> would, my current thinking is that before we serialized n-threads, now
>> we have an alternation of
>>          - "parallel n-1 threads running" + 1 waiting thread
>>          - "blocked  n-1 threads" + 1 running
>>
>> I think if we will come back anyway, if we should continue to see slow
>> IO with MPIIO. Right now we want to get our patches merged first and
>> then will create an updated module for RHEL8 (+derivatives) customers.
>> Our benchmark machines are also running plain RHEL8 kernels - without
>> back porting the modules first we don' know yet what we will be the
>> actual impact to things like io500.
>>
>> Shall we still extend the commit message or are we good to go?
> 
> Well, it would be nice to see the real workload on the backported
> patch.   Not just because it would tell us if this makes sense in the
> first place, but also to have additional testing.

I really don't want to backport before it is merged upstream - back 
porting first has several issues (like it gets never merged and we need 
to maintain for ever, management believes the work is done and doesn't 
plan for more time, etc).

What we can do, is to install a recent kernel on one of our systems and 
then run single-shared-filed IOR over MPIIO against the patched and 
unpatched fuse module. I hope I find the time later on today or latest 
on Monday.


Thanks,
Bernd





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-17  7:10 ` [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file Dharmendra Singh
  2022-06-17  7:36   ` Miklos Szeredi
@ 2022-06-18 19:07   ` Vivek Goyal
  1 sibling, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2022-06-18 19:07 UTC (permalink / raw)
  To: Dharmendra Singh
  Cc: miklos, linux-fsdevel, fuse-devel, linux-kernel, bschubert,
	Dharmendra Singh

On Fri, Jun 17, 2022 at 12:40:27PM +0530, Dharmendra Singh wrote:
> In general, as of now, in FUSE, direct writes on the same file are
> serialized over inode lock i.e we hold inode lock for the full duration
> of the write request. I could not find in fuse code and git history
> a comment which clearly explains why this exclusive lock is taken
> for direct writes.  Following might be the reasons for acquiring
> an exclusive lock but not be limited to
> 1) Our guess is some USER space fuse implementations might be relying
>    on this lock for seralization.
> 2) The lock protects against file read/write size races.
> 3) Ruling out any issues arising from partial write failures.
> 
> This patch relaxes the exclusive lock for direct non-extending writes
> only. File size extending writes might not need the lock either,
> but we are not entirely sure if there is a risk to introduce any
> kind of regression. Furthermore, benchmarking with fio does not
> show a difference between patch versions that take on file size
> extension a) an exclusive lock and b) a shared lock.
> 
> A possible example of an issue with i_size extending writes are write
> error cases. Some writes might succeed and others might fail for
> file system internal reasons - for example ENOSPACE. With parallel
> file size extending writes it _might_ be difficult to revert the action
> of the failing write, especially to restore the right i_size.
> 
> With these changes, we allow non-extending parallel direct writes
> on the same file with the help of a flag called
> FOPEN_PARALLEL_DIRECT_WRITES. If this flag is set on the file (flag is
> passed from libfuse to fuse kernel as part of file open/create),
> we do not take exclusive lock anymore, but instead use a shared lock
> that allows non-extending writes to run in parallel.
> FUSE implementations which rely on this inode lock for serialisation
> can continue to do so and serialized direct writes are still the
> default.  Implementations that do not do write serialization need to
> be updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in
> their file open/create reply.
> 
> On patch review there were concerns that network file systems (or
> vfs multiple mounts of the same file system) might have issues with
> parallel writes. We believe this is not the case, as this is just a
> local lock, which network file systems could not rely on anyway.
> I.e. this lock is just for local consistency.
> 
> Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/file.c            | 43 ++++++++++++++++++++++++++++++++++++---
>  include/uapi/linux/fuse.h |  2 ++
>  2 files changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 37eebfb90500..b3a5706f301d 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1565,14 +1565,47 @@ static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	return res;
>  }
>  
> +static bool fuse_direct_write_extending_i_size(struct kiocb *iocb,
> +					       struct iov_iter *iter)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	return iocb->ki_pos + iov_iter_count(iter) > i_size_read(inode);
> +}
> +
>  static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct file *file = iocb->ki_filp;
> +	struct fuse_file *ff = file->private_data;
>  	struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
>  	ssize_t res;
> +	bool exclusive_lock =
> +		!(ff->open_flags & FOPEN_PARALLEL_DIRECT_WRITES) ||
> +		iocb->ki_flags & IOCB_APPEND ||
> +		fuse_direct_write_extending_i_size(iocb, from);
> +
> +	/*
> +	 * Take exclusive lock if
> +	 * - Parallel direct writes are disabled - a user space decision
> +	 * - Parallel direct writes are enabled and i_size is being extended.
> +	 *   This might not be needed at all, but needs further investigation.
> +	 */
> +	if (exclusive_lock)
> +		inode_lock(inode);
> +	else {
> +		inode_lock_shared(inode);
> +
> +		/* A race with truncate might have come up as the decision for
> +		 * the lock type was done without holding the lock, check again.
> +		 */
> +		if (fuse_direct_write_extending_i_size(iocb, from)) {
> +			inode_unlock_shared(inode);
> +			inode_lock(inode);
> +			exclusive_lock = true;
> +		}
> +	}
>  
> -	/* Don't allow parallel writes to the same file */
> -	inode_lock(inode);
>  	res = generic_write_checks(iocb, from);
>  	if (res > 0) {
>  		if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
> @@ -1583,7 +1616,10 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			fuse_write_update_attr(inode, iocb->ki_pos, res);
>  		}
>  	}
> -	inode_unlock(inode);
> +	if (exclusive_lock)
> +		inode_unlock(inode);
> +	else
> +		inode_unlock_shared(inode);
>  
>  	return res;
>  }
> @@ -2925,6 +2961,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  
>  	if (iov_iter_rw(iter) == WRITE) {
>  		fuse_write_update_attr(inode, pos, ret);
> +		/* For extending writes we already hold exclusive lock */
>  		if (ret < 0 && offset + count > i_size)
>  			fuse_do_truncate(file);

I was curious about this truncation if ret < 0. I am assuming that this
means if some write failed, do a truncation. Looking at git history
I found following commit.

commit efb9fa9e911b23c7ea5330215bda778a7c69dba8
Author: Maxim Patlasov <mpatlasov@parallels.com>
Date:   Tue Dec 18 14:05:08 2012 +0400

    fuse: truncate file if async dio failed

    The patch improves error handling in fuse_direct_IO(): if we successfully
    submitted several fuse requests on behalf of synchronous direct write
    extending file and some of them failed, let's try to do our best to clean-up.

    Changed in v2: reuse fuse_do_setattr(). Thanks to Brian for suggestion.


What's interesting here is that looks like we already have code to
submit multiple FUSE file extending AIO + DIO requests and then we
wait for completion. So a single write can be be broken into multiple
smaller fuse write requests, IIUC.

I see following.

fuse_direct_IO()
{
        /*
         * We cannot asynchronously extend the size of a file.
         * In such case the aio will behave exactly like sync io.
         */
        if ((offset + count > i_size) && io->write)
                io->blocking = true;
}

This should force the IO to be blocking/synchronous. And looks like
if io->async is set, fuse_direct_io() can still submit multiple
file extending requests and we will wait for completion.

wait_for_completion(&wait);

And truncate the file if some I/O failed. This probably means undo
all the writes we did even if some of them succeeded.

                if (ret < 0 && offset + count > i_size)
                        fuse_do_truncate(file);
        }


Anyway, point I am trying to make is that for a single AIO + DIO file
extending write, looks like existing code might split it into multiple
AIO + DIO file extending writes and wait for their completion. And
if any one of the split requests fails, we need to truncate the file
and undo all the WRITEs. And for truncation we need exclusive lock.
And that leads to the conclusion that we can't hold shared lock
while extending the file otherwise current code will do truncation
with shared lock held and things will be broken somewhere.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-06-17 12:43       ` Miklos Szeredi
  2022-06-17 13:07         ` Bernd Schubert
@ 2022-09-13  8:44         ` Bernd Schubert
  2022-10-20 14:51           ` Bernd Schubert
  2022-10-21  6:57           ` Miklos Szeredi
  1 sibling, 2 replies; 11+ messages in thread
From: Bernd Schubert @ 2022-09-13  8:44 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dharmendra Singh, Vivek Goyal, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh, Horst Birthelmer



On 6/17/22 14:43, Miklos Szeredi wrote:
> On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>
>> Hi Miklos,
>>
>> On 6/17/22 09:36, Miklos Szeredi wrote:
>>> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
>>>
>>>> This patch relaxes the exclusive lock for direct non-extending writes
>>>> only. File size extending writes might not need the lock either,
>>>> but we are not entirely sure if there is a risk to introduce any
>>>> kind of regression. Furthermore, benchmarking with fio does not
>>>> show a difference between patch versions that take on file size
>>>> extension a) an exclusive lock and b) a shared lock.
>>>
>>> I'm okay with this, but ISTR Bernd noted a real-life scenario where
>>> this is not sufficient.  Maybe that should be mentioned in the patch
>>> header?
>>
>>
>> the above comment is actually directly from me.
>>
>> We didn't check if fio extends the file before the runs, but even if it
>> would, my current thinking is that before we serialized n-threads, now
>> we have an alternation of
>>          - "parallel n-1 threads running" + 1 waiting thread
>>          - "blocked  n-1 threads" + 1 running
>>
>> I think if we will come back anyway, if we should continue to see slow
>> IO with MPIIO. Right now we want to get our patches merged first and
>> then will create an updated module for RHEL8 (+derivatives) customers.
>> Our benchmark machines are also running plain RHEL8 kernels - without
>> back porting the modules first we don' know yet what we will be the
>> actual impact to things like io500.
>>
>> Shall we still extend the commit message or are we good to go?
> 
> Well, it would be nice to see the real workload on the backported
> patch.   Not just because it would tell us if this makes sense in the
> first place, but also to have additional testing.


Sorry for the delay, Dharmendra and me got busy with other tasks and 
Horst (in CC) took over the patches and did the MPIIO benchmarks on 5.19.

Results with https://github.com/dchirikov/mpiio.git

		unpatched    patched	  patched
		(extending) (extending)	 (non-extending)
----------------------------------------------------------
		 MB/s	     MB/s            MB/s
2 threads     2275.00      2497.00 	 5688.00
4 threads     2438.00      2560.00      10240.00
8 threads     2925.00      3792.00      25600.00
16 threads    3792.00 	  10240.00      20480.00


(Patched-nonextending is a manual operation on the file to extend the 
size, mpiio does not support that natively, as far as I know.)



Results with IOR (HPC quasi standard benchmark)

ior -w -E -k -o /tmp/test/home/hbi/test/test.1 -a mpiio -s 1280 -b 8m -t 8m


		unpatched     	patched
		(extending)     (extending)
-------------------------------------------
		   MB/s		  MB/s
2 threads	2086.10		2027.76
4 threads	1858.94		2132.73
8 threads	1792.68		4609.05
16 threads	1786.48		8627.96


(IOR does not allow manual file extension, without changing its code.)

We can see that patched non-extending gives the best results, as 
Dharmendra has already posted before, but results are still
much better with the patches in extending mode. My assumption is here 
instead serializing N-writers, there is an alternative
run of
	- 1 thread extending, N-1 waiting
	- N-1 writing, 1 thread waiting
in the patched version.



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-09-13  8:44         ` Bernd Schubert
@ 2022-10-20 14:51           ` Bernd Schubert
  2022-10-21  6:57           ` Miklos Szeredi
  1 sibling, 0 replies; 11+ messages in thread
From: Bernd Schubert @ 2022-10-20 14:51 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dharmendra Singh, Vivek Goyal, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh, Horst Birthelmer

Miklos,

is there anything that speaks against getting this patch into 
linux-next? Shall I resend against v6.0? I just tested it - there is no 
merge conflict. The updated patch with benchmark results in the commit 
message is here
https://github.com/aakefbs/linux/tree/parallel-dio-write


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-09-13  8:44         ` Bernd Schubert
  2022-10-20 14:51           ` Bernd Schubert
@ 2022-10-21  6:57           ` Miklos Szeredi
  2022-10-21  9:16             ` Bernd Schubert
  1 sibling, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2022-10-21  6:57 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Dharmendra Singh, Vivek Goyal, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh, Horst Birthelmer

On Tue, 13 Sept 2022 at 10:44, Bernd Schubert <bschubert@ddn.com> wrote:
>
>
>
> On 6/17/22 14:43, Miklos Szeredi wrote:
> > On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> >>
> >> Hi Miklos,
> >>
> >> On 6/17/22 09:36, Miklos Szeredi wrote:
> >>> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
> >>>
> >>>> This patch relaxes the exclusive lock for direct non-extending writes
> >>>> only. File size extending writes might not need the lock either,
> >>>> but we are not entirely sure if there is a risk to introduce any
> >>>> kind of regression. Furthermore, benchmarking with fio does not
> >>>> show a difference between patch versions that take on file size
> >>>> extension a) an exclusive lock and b) a shared lock.
> >>>
> >>> I'm okay with this, but ISTR Bernd noted a real-life scenario where
> >>> this is not sufficient.  Maybe that should be mentioned in the patch
> >>> header?
> >>
> >>
> >> the above comment is actually directly from me.
> >>
> >> We didn't check if fio extends the file before the runs, but even if it
> >> would, my current thinking is that before we serialized n-threads, now
> >> we have an alternation of
> >>          - "parallel n-1 threads running" + 1 waiting thread
> >>          - "blocked  n-1 threads" + 1 running
> >>
> >> I think if we will come back anyway, if we should continue to see slow
> >> IO with MPIIO. Right now we want to get our patches merged first and
> >> then will create an updated module for RHEL8 (+derivatives) customers.
> >> Our benchmark machines are also running plain RHEL8 kernels - without
> >> back porting the modules first we don' know yet what we will be the
> >> actual impact to things like io500.
> >>
> >> Shall we still extend the commit message or are we good to go?
> >
> > Well, it would be nice to see the real workload on the backported
> > patch.   Not just because it would tell us if this makes sense in the
> > first place, but also to have additional testing.
>
>
> Sorry for the delay, Dharmendra and me got busy with other tasks and
> Horst (in CC) took over the patches and did the MPIIO benchmarks on 5.19.
>
> Results with https://github.com/dchirikov/mpiio.git
>
>                 unpatched    patched      patched
>                 (extending) (extending)  (non-extending)
> ----------------------------------------------------------
>                  MB/s        MB/s            MB/s
> 2 threads     2275.00      2497.00       5688.00
> 4 threads     2438.00      2560.00      10240.00
> 8 threads     2925.00      3792.00      25600.00
> 16 threads    3792.00     10240.00      20480.00
>
>
> (Patched-nonextending is a manual operation on the file to extend the
> size, mpiio does not support that natively, as far as I know.)
>
>
>
> Results with IOR (HPC quasi standard benchmark)
>
> ior -w -E -k -o /tmp/test/home/hbi/test/test.1 -a mpiio -s 1280 -b 8m -t 8m
>
>
>                 unpatched       patched
>                 (extending)     (extending)
> -------------------------------------------
>                    MB/s           MB/s
> 2 threads       2086.10         2027.76
> 4 threads       1858.94         2132.73
> 8 threads       1792.68         4609.05
> 16 threads      1786.48         8627.96
>
>
> (IOR does not allow manual file extension, without changing its code.)
>
> We can see that patched non-extending gives the best results, as
> Dharmendra has already posted before, but results are still
> much better with the patches in extending mode. My assumption is here
> instead serializing N-writers, there is an alternative
> run of
>         - 1 thread extending, N-1 waiting
>         - N-1 writing, 1 thread waiting
> in the patched version.
>

Okay, thanks for the heads up.

I queued the patch up for v6.2

Thanks,
Miklos


>
>
> Thanks,
> Bernd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file.
  2022-10-21  6:57           ` Miklos Szeredi
@ 2022-10-21  9:16             ` Bernd Schubert
  0 siblings, 0 replies; 11+ messages in thread
From: Bernd Schubert @ 2022-10-21  9:16 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Dharmendra Singh, Vivek Goyal, linux-fsdevel, fuse-devel,
	linux-kernel, Dharmendra Singh, Horst Birthelmer



On 10/21/22 08:57, Miklos Szeredi wrote:
> On Tue, 13 Sept 2022 at 10:44, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>>
>>
>> On 6/17/22 14:43, Miklos Szeredi wrote:
>>> On Fri, 17 Jun 2022 at 11:25, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>>>
>>>> Hi Miklos,
>>>>
>>>> On 6/17/22 09:36, Miklos Szeredi wrote:
>>>>> On Fri, 17 Jun 2022 at 09:10, Dharmendra Singh <dharamhans87@gmail.com> wrote:
>>>>>
>>>>>> This patch relaxes the exclusive lock for direct non-extending writes
>>>>>> only. File size extending writes might not need the lock either,
>>>>>> but we are not entirely sure if there is a risk to introduce any
>>>>>> kind of regression. Furthermore, benchmarking with fio does not
>>>>>> show a difference between patch versions that take on file size
>>>>>> extension a) an exclusive lock and b) a shared lock.
>>>>>
>>>>> I'm okay with this, but ISTR Bernd noted a real-life scenario where
>>>>> this is not sufficient.  Maybe that should be mentioned in the patch
>>>>> header?
>>>>
>>>>
>>>> the above comment is actually directly from me.
>>>>
>>>> We didn't check if fio extends the file before the runs, but even if it
>>>> would, my current thinking is that before we serialized n-threads, now
>>>> we have an alternation of
>>>>           - "parallel n-1 threads running" + 1 waiting thread
>>>>           - "blocked  n-1 threads" + 1 running
>>>>
>>>> I think if we will come back anyway, if we should continue to see slow
>>>> IO with MPIIO. Right now we want to get our patches merged first and
>>>> then will create an updated module for RHEL8 (+derivatives) customers.
>>>> Our benchmark machines are also running plain RHEL8 kernels - without
>>>> back porting the modules first we don' know yet what we will be the
>>>> actual impact to things like io500.
>>>>
>>>> Shall we still extend the commit message or are we good to go?
>>>
>>> Well, it would be nice to see the real workload on the backported
>>> patch.   Not just because it would tell us if this makes sense in the
>>> first place, but also to have additional testing.
>>
>>
>> Sorry for the delay, Dharmendra and me got busy with other tasks and
>> Horst (in CC) took over the patches and did the MPIIO benchmarks on 5.19.
>>
>> Results with https://github.com/dchirikov/mpiio.git
>>
>>                  unpatched    patched      patched
>>                  (extending) (extending)  (non-extending)
>> ----------------------------------------------------------
>>                   MB/s        MB/s            MB/s
>> 2 threads     2275.00      2497.00       5688.00
>> 4 threads     2438.00      2560.00      10240.00
>> 8 threads     2925.00      3792.00      25600.00
>> 16 threads    3792.00     10240.00      20480.00
>>
>>
>> (Patched-nonextending is a manual operation on the file to extend the
>> size, mpiio does not support that natively, as far as I know.)
>>
>>
>>
>> Results with IOR (HPC quasi standard benchmark)
>>
>> ior -w -E -k -o /tmp/test/home/hbi/test/test.1 -a mpiio -s 1280 -b 8m -t 8m
>>
>>
>>                  unpatched       patched
>>                  (extending)     (extending)
>> -------------------------------------------
>>                     MB/s           MB/s
>> 2 threads       2086.10         2027.76
>> 4 threads       1858.94         2132.73
>> 8 threads       1792.68         4609.05
>> 16 threads      1786.48         8627.96
>>
>>
>> (IOR does not allow manual file extension, without changing its code.)
>>
>> We can see that patched non-extending gives the best results, as
>> Dharmendra has already posted before, but results are still
>> much better with the patches in extending mode. My assumption is here
>> instead serializing N-writers, there is an alternative
>> run of
>>          - 1 thread extending, N-1 waiting
>>          - N-1 writing, 1 thread waiting
>> in the patched version.
>>
> 
> Okay, thanks for the heads up.
> 
> I queued the patch up for v6.2
> 

Thank you!

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-10-21  9:17 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-17  7:10 [PATCH v5 0/1] FUSE: Allow non-extending parallel direct writes Dharmendra Singh
2022-06-17  7:10 ` [PATCH v5 1/1] Allow non-extending parallel direct writes on the same file Dharmendra Singh
2022-06-17  7:36   ` Miklos Szeredi
2022-06-17  9:25     ` Bernd Schubert
2022-06-17 12:43       ` Miklos Szeredi
2022-06-17 13:07         ` Bernd Schubert
2022-09-13  8:44         ` Bernd Schubert
2022-10-20 14:51           ` Bernd Schubert
2022-10-21  6:57           ` Miklos Szeredi
2022-10-21  9:16             ` Bernd Schubert
2022-06-18 19:07   ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.