linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload
@ 2019-12-12  5:55 Ritesh Harjani
  2019-12-12  5:55 ` [PATCHv5 1/3] ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT Ritesh Harjani
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Ritesh Harjani @ 2019-12-12  5:55 UTC (permalink / raw)
  To: jack, tytso, linux-ext4
  Cc: linux-fsdevel, mbobrowski, joseph.qi, Ritesh Harjani

Hello, 

Please find v5 version with some minor comments addressed (from Jan).
Also added Tested-by & Reviewed-by.


Changes from RFCv3 => PATCHv4
1. Dropped patch-2 which introduced ext4_ilock/iunlock API based on discussion.
Now the lock/unlock decision is open coded in ext4_dio_write_iter() &
ext4_dio_write_checks().

2. Addressed review comments from Jan on last 2 patches.

Please note that apart from all the conditions mentioned in patch-3 there is
still an existing race. It is between ext4_page_mkwrite & DIO read in parallel.
This is discussed in detail at [7].
Since that race exist even before this patch series and is not caused
due to this patch series, I plan to address that after these patches are merged.
It is to ensure proper testing/review and to not club too many things in one go.

These patches are tested again with "xfstests -g auto". There were no new
failures except the known one.


Background (copied from previous with some edits)
=================================================

These are ilock patches which helps improve the current inode lock scalabiliy
problem in ext4 DIO mixed read/write workload case. The problem was first
reported by Joseph [1]. This should help improve mixed read/write workload
cases for databases which use directIO.

These patches are based upon upstream discussion with Jan Kara & Joseph [2].

The problem really is that in case of DIO overwrites, we start with
a exclusive lock and then downgrade it later to shared lock. This causes a
scalability problem in case of mixed DIO read/write workload case. 
i.e. if we have any ongoing DIO reads and then comes a DIO writes,
(since writes starts with excl. inode lock) then it has to wait until the
shared lock is released (which only happens when DIO read is completed). 
Same is true for vice versa as well.
The same can be easily observed with perf-tools trace analysis [3].

This patch series (Patch-3) helps fix that situation even without
dioread_nolock mount opt. This is inline with the discussions with Jan [4].
More details about this are mentioned in commit msg of patch 2 & 3.

These patches are based on the top of Ted's ext4 master tree.

Patch description
=================
Patch-1: Fixes ext4_dax_read/write inode locking sequence for IOCB_NOWAIT

Patch-2: Starts with shared inode lock in case of DIO instead of exclusive inode
lock. This patchset helps fix the reported scalablity problem. But this fixes it
only for dioread_nolock mount option.

Patch-3: In this we get away with dioread_nolock mount option condition
to check for shared locking. This patch commit msg describe in
detail about why we don't need excl. lock even without dioread_nolock.

Git tree
========
https://github.com/riteshharjani/linux/tree/ext4-ilock-PATCHv4

Testing
=======
Completed xfstests -g auto with default mkfs & mount opts.
No new failures except the known one without these patches.


Performance results
===================
Collected some performance numbers for DIO sync mixed random read/write
workload w.r.t number of threads (ext4) to check for scalability.
The performance historgram shown below is the percentage change in
performance by using this ilock patchset as compared to vanilla kernel.


FIO command:
fio -name=DIO-mixed-randrw -filename=./testfile -direct=1 -iodepth=1 -thread \
-rw=randrw -ioengine=psync -bs=$bs -size=10G -numjobs=$thread \
-group_reporting=1 -runtime=120

Used fioperf tool [5] for collecting this performance scores.

Below shows the performance benefit hist with this ilock patchset in (%)
w.r.t vanilla kernel for mixed randrw workload (for 4K block size).
Notice, the percentage benefit increases with increasing number of
threads. So this patchset help achieve good scalability in the mentioned
workload. Also this gives upto ~140% perf improvement in 24 threads mixed randrw
workload with 4K burst size.
The performance difference can be even higher with high speed storage
devices, since bw speeds without the patch seems to flatten due to lock
contention problem in case of multiple threads.
[Absolute perf delta can be seen at [6]]


		Performance benefit (%) data randrw (read)-4K
		    (default mount options)
  160 +-+------+-------+--------+--------+-------+--------+-------+------+-+
      |        +       +        +        +       +        +       +        |
  140 +-+ 							   **    +-+
      |                                                    **      **      |
  120 +-+                                         **       **      **    +-+
      |                                           **       **      **      |
  100 +-+                                **       **       **      **    +-+
      |                                  **       **       **      **      |
   80 +-+                                **       **       **      **    +-+
      |                                  **       **       **      **      |
      |                          **      **       **       **      **      |
   60 +-+                        **      **       **       **      **    +-+
      |                          **      **       **       **      **      |
   40 +-+                        **      **       **       **      **    +-+
      |                 **       **      **       **       **      **      |
   20 +-+               **       **      **       **       **      **    +-+
      |                 **       **      **       **       **      **      |
    0 +-+       **      **       **      **       **       **      **    +-+
      |        +       +        +        +       +        +       +        |
  -20 +-+------+-------+--------+--------+-------+--------+-------+------+-+
               1       2        4        8      12       16      24
	       		Threads


		Performance benefit (%) data randrw (write)-4K
		     (default mount options)
  160 +-+------+-------+--------+--------+-------+--------+-------+------+-+
      |        +       +        +        +       +        +       +        |
  140 +-+ 							   **    +-+
      |                                                    **      **      |
  120 +-+                                         **       **      **    +-+
      |                                           **       **      **      |
  100 +-+                                **       **       **      **    +-+
      |                                  **       **       **      **      |
   80 +-+                                **       **       **      **    +-+
      |                                  **       **       **      **      |
      |                          **      **       **       **      **      |
   60 +-+                        **      **       **       **      **    +-+
      |                          **      **       **       **      **      |
   40 +-+                        **      **       **       **      **    +-+
      |                 **       **      **       **       **      **      |
   20 +-+               **       **      **       **       **      **    +-+
      |                 **       **      **       **       **      **      |
    0 +-+       **      **       **      **       **       **      **    +-+
      |        +       +        +        +       +        +       +        |
  -20 +-+------+-------+--------+--------+-------+--------+-------+------+-+
               1       2        4        8      12       16      24
			Threads

Previous version
================
v4: https://marc.info/?l=linux-ext4&m=157552839624723&w=2
v3: https://www.spinics.net/lists/linux-ext4/msg68649.html
v2: https://www.spinics.net/lists/kernel/msg3262531.html
v1: https://patchwork.ozlabs.org/cover/1163286/

References
==========
[1]: https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph.qi@linux.alibaba.com/
[2]: https://lore.kernel.org/linux-ext4/20190910215720.GA7561@quack2.suse.cz/
[3]: https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/perf.report
[4]: https://patchwork.ozlabs.org/cover/1163286/
[5]: https://github.com/riteshharjani/fioperf
[6]: https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/diff_ilock_v3_default_dio_randrw_4K.txt
[7]: https://www.spinics.net/lists/linux-ext4/msg68659.html

-ritesh


Ritesh Harjani (3):
  ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT
  ext4: Start with shared i_rwsem in case of DIO instead of exclusive
  ext4: Move to shared i_rwsem even without dioread_nolock mount opt

 fs/ext4/file.c | 198 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 145 insertions(+), 53 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCHv5 1/3] ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT
  2019-12-12  5:55 [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload Ritesh Harjani
@ 2019-12-12  5:55 ` Ritesh Harjani
  2019-12-12  5:55 ` [PATCHv5 2/3] ext4: Start with shared i_rwsem in case of DIO instead of exclusive Ritesh Harjani
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Ritesh Harjani @ 2019-12-12  5:55 UTC (permalink / raw)
  To: jack, tytso, linux-ext4
  Cc: linux-fsdevel, mbobrowski, joseph.qi, Ritesh Harjani

Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme.  So change our dax read/write methods to just do the
trylock for the RWF_NOWAIT case.
This seems to fix AIM7 regression in some scalable filesystems upto ~25%
in some cases. Claimed in commit 942491c9e6d6 ("xfs: fix AIM7 regression")

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/file.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6a7293a5cda2..977ac58dc718 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -88,9 +88,10 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
-	if (!inode_trylock_shared(inode)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock_shared(inode))
 			return -EAGAIN;
+	} else {
 		inode_lock_shared(inode);
 	}
 	/*
@@ -487,9 +488,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	bool extend = false;
 	struct inode *inode = file_inode(iocb->ki_filp);
 
-	if (!inode_trylock(inode)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!inode_trylock(inode))
 			return -EAGAIN;
+	} else {
 		inode_lock(inode);
 	}
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCHv5 2/3] ext4: Start with shared i_rwsem in case of DIO instead of exclusive
  2019-12-12  5:55 [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload Ritesh Harjani
  2019-12-12  5:55 ` [PATCHv5 1/3] ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT Ritesh Harjani
@ 2019-12-12  5:55 ` Ritesh Harjani
  2019-12-12  5:55 ` [PATCHv5 3/3] ext4: Move to shared i_rwsem even without dioread_nolock mount opt Ritesh Harjani
  2019-12-26 15:06 ` [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload Theodore Y. Ts'o
  3 siblings, 0 replies; 5+ messages in thread
From: Ritesh Harjani @ 2019-12-12  5:55 UTC (permalink / raw)
  To: jack, tytso, linux-ext4
  Cc: linux-fsdevel, mbobrowski, joseph.qi, Ritesh Harjani

Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.

But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.

So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().

Other than that, it also simplifies below cases:-

1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.

Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.

2. Added ext4_extending_io(). This checks if the IO is extending the file.

3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.

Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/file.c | 191 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 142 insertions(+), 49 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 977ac58dc718..1da49dffa3df 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -166,19 +166,25 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
  * threads are at work on the same unwritten block, they must be synchronized
  * or one thread will zero the other's data, causing corruption.
  */
-static int
-ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
+static bool
+ext4_unaligned_io(struct inode *inode, struct iov_iter *from, loff_t pos)
 {
 	struct super_block *sb = inode->i_sb;
-	int blockmask = sb->s_blocksize - 1;
-
-	if (pos >= ALIGN(i_size_read(inode), sb->s_blocksize))
-		return 0;
+	unsigned long blockmask = sb->s_blocksize - 1;
 
 	if ((pos | iov_iter_alignment(from)) & blockmask)
-		return 1;
+		return true;
 
-	return 0;
+	return false;
+}
+
+static bool
+ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
+{
+	if (offset + len > i_size_read(inode) ||
+	    offset + len > EXT4_I(inode)->i_disksize)
+		return true;
+	return false;
 }
 
 /* Is IO overwriting allocated and initialized blocks? */
@@ -204,7 +210,8 @@ static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
 	return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
 }
 
-static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
+					 struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
@@ -228,11 +235,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 		iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
 	}
 
+	return iov_iter_count(from);
+}
+
+static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t ret, count;
+
+	count = ext4_generic_write_checks(iocb, from);
+	if (count <= 0)
+		return count;
+
 	ret = file_modified(iocb->ki_filp);
 	if (ret)
 		return ret;
-
-	return iov_iter_count(from);
+	return count;
 }
 
 static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
@@ -364,62 +381,139 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
 	.end_io = ext4_dio_write_end_io,
 };
 
+/*
+ * The intention here is to start with shared lock acquired then see if any
+ * condition requires an exclusive inode lock. If yes, then we restart the
+ * whole operation by releasing the shared lock and acquiring exclusive lock.
+ *
+ * - For unaligned_io we never take shared lock as it may cause data corruption
+ *   when two unaligned IO tries to modify the same block e.g. while zeroing.
+ *
+ * - For extending writes case we don't take the shared lock, since it requires
+ *   updating inode i_disksize and/or orphan handling with exclusive lock.
+ *
+ * - shared locking will only be true mostly with overwrites in dioread_nolock
+ *   mode. Otherwise we will switch to exclusive i_rwsem lock.
+ */
+static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
+				     bool *ilock_shared, bool *extend)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	loff_t offset;
+	size_t count;
+	ssize_t ret;
+
+restart:
+	ret = ext4_generic_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	offset = iocb->ki_pos;
+	count = ret;
+	if (ext4_extending_io(inode, offset, count))
+		*extend = true;
+	/*
+	 * Determine whether the IO operation will overwrite allocated
+	 * and initialized blocks. If so, check to see whether it is
+	 * possible to take the dioread_nolock path.
+	 *
+	 * We need exclusive i_rwsem for changing security info
+	 * in file_modified().
+	 */
+	if (*ilock_shared && (!IS_NOSEC(inode) || *extend ||
+	     !ext4_should_dioread_nolock(inode) ||
+	     !ext4_overwrite_io(inode, offset, count))) {
+		inode_unlock_shared(inode);
+		*ilock_shared = false;
+		inode_lock(inode);
+		goto restart;
+	}
+
+	ret = file_modified(file);
+	if (ret < 0)
+		goto out;
+
+	return count;
+out:
+	if (*ilock_shared)
+		inode_unlock_shared(inode);
+	else
+		inode_unlock(inode);
+	return ret;
+}
+
 static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	ssize_t ret;
-	size_t count;
-	loff_t offset;
 	handle_t *handle;
 	struct inode *inode = file_inode(iocb->ki_filp);
-	bool extend = false, overwrite = false, unaligned_aio = false;
+	loff_t offset = iocb->ki_pos;
+	size_t count = iov_iter_count(from);
+	bool extend = false, unaligned_io = false;
+	bool ilock_shared = true;
+
+	/*
+	 * We initially start with shared inode lock unless it is
+	 * unaligned IO which needs exclusive lock anyways.
+	 */
+	if (ext4_unaligned_io(inode, from, offset)) {
+		unaligned_io = true;
+		ilock_shared = false;
+	}
+	/*
+	 * Quick check here without any i_rwsem lock to see if it is extending
+	 * IO. A more reliable check is done in ext4_dio_write_checks() with
+	 * proper locking in place.
+	 */
+	if (offset + count > i_size_read(inode))
+		ilock_shared = false;
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!inode_trylock(inode))
-			return -EAGAIN;
+		if (ilock_shared) {
+			if (!inode_trylock_shared(inode))
+				return -EAGAIN;
+		} else {
+			if (!inode_trylock(inode))
+				return -EAGAIN;
+		}
 	} else {
-		inode_lock(inode);
+		if (ilock_shared)
+			inode_lock_shared(inode);
+		else
+			inode_lock(inode);
 	}
 
+	/* Fallback to buffered I/O if the inode does not support direct I/O. */
 	if (!ext4_dio_supported(inode)) {
-		inode_unlock(inode);
-		/*
-		 * Fallback to buffered I/O if the inode does not support
-		 * direct I/O.
-		 */
+		if (ilock_shared)
+			inode_unlock_shared(inode);
+		else
+			inode_unlock(inode);
 		return ext4_buffered_write_iter(iocb, from);
 	}
 
-	ret = ext4_write_checks(iocb, from);
-	if (ret <= 0) {
-		inode_unlock(inode);
+	ret = ext4_dio_write_checks(iocb, from, &ilock_shared, &extend);
+	if (ret <= 0)
 		return ret;
-	}
 
-	/*
-	 * Unaligned asynchronous direct I/O must be serialized among each
-	 * other as the zeroing of partial blocks of two competing unaligned
-	 * asynchronous direct I/O writes can result in data corruption.
-	 */
 	offset = iocb->ki_pos;
-	count = iov_iter_count(from);
-	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
-	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
-		unaligned_aio = true;
-		inode_dio_wait(inode);
-	}
+	count = ret;
 
 	/*
-	 * Determine whether the I/O will overwrite allocated and initialized
-	 * blocks. If so, check to see whether it is possible to take the
-	 * dioread_nolock path.
+	 * Unaligned direct IO must be serialized among each other as zeroing
+	 * of partial blocks of two competing unaligned IOs can result in data
+	 * corruption.
+	 *
+	 * So we make sure we don't allow any unaligned IO in flight.
+	 * For IOs where we need not wait (like unaligned non-AIO DIO),
+	 * below inode_dio_wait() may anyway become a no-op, since we start
+	 * with exclusive lock.
 	 */
-	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
-	    ext4_should_dioread_nolock(inode)) {
-		overwrite = true;
-		downgrade_write(&inode->i_rwsem);
-	}
+	if (unaligned_io)
+		inode_dio_wait(inode);
 
-	if (offset + count > EXT4_I(inode)->i_disksize) {
+	if (extend) {
 		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
 		if (IS_ERR(handle)) {
 			ret = PTR_ERR(handle);
@@ -432,18 +526,17 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out;
 		}
 
-		extend = true;
 		ext4_journal_stop(handle);
 	}
 
 	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
-			   is_sync_kiocb(iocb) || unaligned_aio || extend);
+			   is_sync_kiocb(iocb) || unaligned_io || extend);
 
 	if (extend)
 		ret = ext4_handle_inode_extension(inode, offset, ret, count);
 
 out:
-	if (overwrite)
+	if (ilock_shared)
 		inode_unlock_shared(inode);
 	else
 		inode_unlock(inode);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCHv5 3/3] ext4: Move to shared i_rwsem even without dioread_nolock mount opt
  2019-12-12  5:55 [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload Ritesh Harjani
  2019-12-12  5:55 ` [PATCHv5 1/3] ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT Ritesh Harjani
  2019-12-12  5:55 ` [PATCHv5 2/3] ext4: Start with shared i_rwsem in case of DIO instead of exclusive Ritesh Harjani
@ 2019-12-12  5:55 ` Ritesh Harjani
  2019-12-26 15:06 ` [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload Theodore Y. Ts'o
  3 siblings, 0 replies; 5+ messages in thread
From: Ritesh Harjani @ 2019-12-12  5:55 UTC (permalink / raw)
  To: jack, tytso, linux-ext4
  Cc: linux-fsdevel, mbobrowski, joseph.qi, Ritesh Harjani

We were using shared locking only in case of dioread_nolock mount option in case
of DIO overwrites. This mount condition is not needed anymore with current code,
since:-

1. No race between buffered writes & DIO overwrites. Since buffIO writes takes
exclusive lock & DIO overwrites will take shared locking. Also DIO path will
make sure to flush and wait for any dirty page cache data.

2. No race between buffered reads & DIO overwrites, since there is no block
allocation that is possible with DIO overwrites. So no stale data exposure
should happen. Same is the case between DIO reads & DIO overwrites.

3. Also other paths like truncate is protected, since we wait there for any DIO
in flight to be over.

Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/file.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1da49dffa3df..9c2711bce0f9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -392,8 +392,8 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
  * - For extending writes case we don't take the shared lock, since it requires
  *   updating inode i_disksize and/or orphan handling with exclusive lock.
  *
- * - shared locking will only be true mostly with overwrites in dioread_nolock
- *   mode. Otherwise we will switch to exclusive i_rwsem lock.
+ * - shared locking will only be true mostly with overwrites. Otherwise we will
+ *   switch to exclusive i_rwsem lock.
  */
 static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 				     bool *ilock_shared, bool *extend)
@@ -415,14 +415,11 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 		*extend = true;
 	/*
 	 * Determine whether the IO operation will overwrite allocated
-	 * and initialized blocks. If so, check to see whether it is
-	 * possible to take the dioread_nolock path.
-	 *
+	 * and initialized blocks.
 	 * We need exclusive i_rwsem for changing security info
 	 * in file_modified().
 	 */
 	if (*ilock_shared && (!IS_NOSEC(inode) || *extend ||
-	     !ext4_should_dioread_nolock(inode) ||
 	     !ext4_overwrite_io(inode, offset, count))) {
 		inode_unlock_shared(inode);
 		*ilock_shared = false;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload
  2019-12-12  5:55 [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload Ritesh Harjani
                   ` (2 preceding siblings ...)
  2019-12-12  5:55 ` [PATCHv5 3/3] ext4: Move to shared i_rwsem even without dioread_nolock mount opt Ritesh Harjani
@ 2019-12-26 15:06 ` Theodore Y. Ts'o
  3 siblings, 0 replies; 5+ messages in thread
From: Theodore Y. Ts'o @ 2019-12-26 15:06 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: jack, linux-ext4, linux-fsdevel, mbobrowski, joseph.qi

Thanks, I've applied this to the ext4 git tree.

	     	     	     	  - Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-12-26 15:07 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-12  5:55 [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload Ritesh Harjani
2019-12-12  5:55 ` [PATCHv5 1/3] ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT Ritesh Harjani
2019-12-12  5:55 ` [PATCHv5 2/3] ext4: Start with shared i_rwsem in case of DIO instead of exclusive Ritesh Harjani
2019-12-12  5:55 ` [PATCHv5 3/3] ext4: Move to shared i_rwsem even without dioread_nolock mount opt Ritesh Harjani
2019-12-26 15:06 ` [PATCHv5 0/3] Fix inode_lock sequence to scale performance of DIO mixed R/W workload Theodore Y. Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).