linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] ext4: Improve locking sequence in DIO write path
@ 2019-09-17 10:32 Ritesh Harjani
  2019-09-17 10:32 ` [RFC 1/2] ext4: Add ext4_ilock & ext4_iunlock API Ritesh Harjani
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Ritesh Harjani @ 2019-09-17 10:32 UTC (permalink / raw)
  To: jack, tytso, linux-ext4, joseph.qi
  Cc: david, hch, adilger, riteshh, mbobrowski, rgoldwyn

Hello,

This patch series is based on the upstream discussion with Jan
& Joseph @ [1].
It is based on top of Matthew's v3 ext4 iomap patch series [2]

Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
inode_lock/unlock instances from fs/ext4/*

For now I already accounted for trylock/lock issue symantics
(which was discussed here [3]) in the same patch,
since the this whole patch was around inode_lock/unlock API,
so I thought it will be best to address that issue in the same patch. 
However, kindly let me know if otherwise.

Patch-2: Commit msg of this patch describes in detail about
what it is doing.
In brief - we try to first take the shared lock (instead of exclusive
lock), unless it is a unaligned_io or extend_io. Then in
ext4_dio_write_checks(), if we start with shared lock, we see
if we can really continue with shared lock or not. If not, then
we release the shared lock then acquire exclusive lock
and restart ext4_dio_write_checks().


Tested against few xfstests (with dioread_nolock mount option),
those ran fine (ext4 & generic).

I tried testing performance numbers on my VM (since I could not get
hold of any real h/w based test device). I could test the fact
that earlier we were trying to do downgrade_write() lock, but with
this patch, that path is now avoided for fio test case
(as reported by Joseph in [4]).
But for the actual results, I am not sure if VM machine testing could
really give the reliable perf numbers which we want to take a look at.
Though I do observe some form of perf improvements, but I could not
get any reliable numbers (not even with the same list of with/without
patches with which Joseph posted his numbers [1]).


@Joseph,
Would it be possible for you to give your test case a run with this
patches? That will be really helpful.

Branch for this is hosted at below tree.

https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC

[1]: https://lore.kernel.org/linux-ext4/20190910215720.GA7561@quack2.suse.cz/
[2]: https://lwn.net/Articles/799184/
[3]: https://lore.kernel.org/linux-fsdevel/20190911103117.E32C34C044@d06av22.portsmouth.uk.ibm.com/
[4]: https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph.qi@linux.alibaba.com/


Ritesh Harjani (2):
  ext4: Add ext4_ilock & ext4_iunlock API
  ext4: Improve DIO writes locking sequence

 fs/ext4/ext4.h    |  33 ++++++
 fs/ext4/extents.c |  16 +--
 fs/ext4/file.c    | 253 ++++++++++++++++++++++++++++++++--------------
 fs/ext4/inode.c   |   4 +-
 fs/ext4/ioctl.c   |  16 +--
 fs/ext4/super.c   |  12 +--
 fs/ext4/xattr.c   |  16 +--
 7 files changed, 244 insertions(+), 106 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC 1/2] ext4: Add ext4_ilock & ext4_iunlock API
  2019-09-17 10:32 [RFC 0/2] ext4: Improve locking sequence in DIO write path Ritesh Harjani
@ 2019-09-17 10:32 ` Ritesh Harjani
  2019-09-17 10:32 ` [RFC 2/2] ext4: Improve DIO writes locking sequence Ritesh Harjani
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2019-09-17 10:32 UTC (permalink / raw)
  To: jack, tytso, linux-ext4, joseph.qi
  Cc: david, hch, adilger, riteshh, mbobrowski, rgoldwyn

This adds ext4_ilock/iunlock types of APIs.
This is the preparation APIs to make shared
locking/unlocking & restarting with exclusive
locking/unlocking easier in next patch.

Along with above this also addresses the AIM7
regression problem which was only fixed for XFS
in,
commit 942491c9e6d6 ("xfs: fix AIM7 regression")

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/ext4.h    | 33 +++++++++++++++++++++++++
 fs/ext4/extents.c | 16 ++++++------
 fs/ext4/file.c    | 63 +++++++++++++++++++++++++----------------------
 fs/ext4/inode.c   |  4 +--
 fs/ext4/ioctl.c   | 16 ++++++------
 fs/ext4/super.c   | 12 ++++-----
 fs/ext4/xattr.c   | 16 ++++++------
 7 files changed, 98 insertions(+), 62 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2ab91815f52d..9ffafbe6bc3f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2945,6 +2945,39 @@ static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
 	return changed;
 }
 
+#define EXT4_IOLOCK_EXCL	(1 << 0)
+#define EXT4_IOLOCK_SHARED	(1 << 1)
+
+static inline void ext4_ilock(struct inode *inode, unsigned int iolock)
+{
+	if (iolock == EXT4_IOLOCK_EXCL)
+		inode_lock(inode);
+	else
+		inode_lock_shared(inode);
+}
+
+static inline void ext4_iunlock(struct inode *inode, unsigned int iolock)
+{
+	if (iolock == EXT4_IOLOCK_EXCL)
+		inode_unlock(inode);
+	else
+		inode_unlock_shared(inode);
+}
+
+static inline int ext4_ilock_nowait(struct inode *inode, unsigned int iolock)
+{
+	if (iolock == EXT4_IOLOCK_EXCL)
+		return inode_trylock(inode);
+	else
+		return inode_trylock_shared(inode);
+}
+
+static inline void ext4_ilock_demote(struct inode *inode, unsigned int iolock)
+{
+	BUG_ON(iolock != EXT4_IOLOCK_EXCL);
+	downgrade_write(&inode->i_rwsem);
+}
+
 int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 				      loff_t len);
 
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a869e206bd81..ef37f4d4ee7e 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4680,7 +4680,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	else
 		max_blocks -= lblk;
 
-	inode_lock(inode);
+	ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 
 	/*
 	 * Indirect files do not support unwritten extnets
@@ -4790,7 +4790,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 
 	ext4_journal_stop(handle);
 out_mutex:
-	inode_unlock(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	return ret;
 }
 
@@ -4856,7 +4856,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 
-	inode_lock(inode);
+	ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 
 	/*
 	 * We only support preallocation for extent-based files only
@@ -4887,7 +4887,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 						EXT4_I(inode)->i_sync_tid);
 	}
 out:
-	inode_unlock(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
 	return ret;
 }
@@ -5387,7 +5387,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 			return ret;
 	}
 
-	inode_lock(inode);
+	ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 	/*
 	 * There is no need to overlap collapse range with EOF, in which case
 	 * it is effectively a truncate operation
@@ -5486,7 +5486,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 out_mmap:
 	up_write(&EXT4_I(inode)->i_mmap_sem);
 out_mutex:
-	inode_unlock(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	return ret;
 }
 
@@ -5537,7 +5537,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 			return ret;
 	}
 
-	inode_lock(inode);
+	ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 	/* Currently just for extent based files */
 	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
 		ret = -EOPNOTSUPP;
@@ -5664,7 +5664,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 out_mmap:
 	up_write(&EXT4_I(inode)->i_mmap_sem);
 out_mutex:
-	inode_unlock(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	return ret;
 }
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d2ff383a8b9f..ce1cecbae932 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -57,14 +57,15 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	/*
 	 * Get exclusion from truncate and other inode operations.
 	 */
-	if (!inode_trylock_shared(inode)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
 			return -EAGAIN;
-		inode_lock_shared(inode);
+	} else {
+		ext4_ilock(inode, EXT4_IOLOCK_SHARED);
 	}
 
 	if (!ext4_dio_checks(inode)) {
-		inode_unlock_shared(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
 		/*
 		 * Fallback to buffered IO if the operation being
 		 * performed on the inode is not supported by direct
@@ -77,7 +78,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 
 	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL);
-	inode_unlock_shared(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
 	return ret;
@@ -89,22 +90,23 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
-	if (!inode_trylock_shared(inode)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_SHARED))
 			return -EAGAIN;
-		inode_lock_shared(inode);
+	} else {
+		ext4_ilock(inode, EXT4_IOLOCK_SHARED);
 	}
 	/*
 	 * Recheck under inode lock - at this point we are sure it cannot
 	 * change anymore
 	 */
 	if (!IS_DAX(inode)) {
-		inode_unlock_shared(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
 		/* Fallback to buffered IO in case we cannot support DAX */
 		return generic_file_read_iter(iocb, to);
 	}
 	ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
-	inode_unlock_shared(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
 
 	file_accessed(iocb->ki_filp);
 	return ret;
@@ -241,7 +243,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		return -EOPNOTSUPP;
 
-	inode_lock(inode);
+	ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 	ret = ext4_write_checks(iocb, from);
 	if (ret <= 0)
 		goto out;
@@ -250,7 +252,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
 	current->backing_dev_info = NULL;
 out:
-	inode_unlock(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	if (likely(ret > 0)) {
 		iocb->ki_pos += ret;
 		ret = generic_write_sync(iocb, ret);
@@ -374,15 +376,17 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	loff_t offset = iocb->ki_pos;
 	struct inode *inode = file_inode(iocb->ki_filp);
 	bool extend = false, overwrite = false, unaligned_aio = false;
+	unsigned int iolock = EXT4_IOLOCK_EXCL;
 
-	if (!inode_trylock(inode)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!ext4_ilock_nowait(inode, iolock))
 			return -EAGAIN;
-		inode_lock(inode);
+	} else {
+		ext4_ilock(inode, iolock);
 	}
 
 	if (!ext4_dio_checks(inode)) {
-		inode_unlock(inode);
+		ext4_iunlock(inode, iolock);
 		/*
 		 * Fallback to buffered IO if the operation on the
 		 * inode is not supported by direct IO.
@@ -392,7 +396,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	ret = ext4_write_checks(iocb, from);
 	if (ret <= 0) {
-		inode_unlock(inode);
+		ext4_iunlock(inode, iolock);
 		return ret;
 	}
 
@@ -416,7 +420,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
 	    ext4_should_dioread_nolock(inode)) {
 		overwrite = true;
-		downgrade_write(&inode->i_rwsem);
+		ext4_ilock_demote(inode, iolock);
+		iolock = EXT4_IOLOCK_SHARED;
 	}
 
 	if (offset + count > i_size_read(inode) ||
@@ -438,10 +443,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
 		inode_dio_wait(inode);
 
-	if (overwrite)
-		inode_unlock_shared(inode);
-	else
-		inode_unlock(inode);
+	ext4_iunlock(inode, iolock);
 
 	if (ret >= 0 && iov_iter_count(from))
 		return ext4_buffered_write_iter(iocb, from);
@@ -457,10 +459,11 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	loff_t offset;
 	struct inode *inode = file_inode(iocb->ki_filp);
 
-	if (!inode_trylock(inode)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!ext4_ilock_nowait(inode, EXT4_IOLOCK_EXCL))
 			return -EAGAIN;
-		inode_lock(inode);
+	} else {
+		ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 	}
 
 	ret = ext4_write_checks(iocb, from);
@@ -480,7 +483,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (error)
 		ret = error;
 out:
-	inode_unlock(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	if (ret > 0)
 		ret = generic_write_sync(iocb, ret);
 	return ret;
@@ -707,14 +710,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
 		return generic_file_llseek_size(file, offset, whence,
 						maxbytes, i_size_read(inode));
 	case SEEK_HOLE:
-		inode_lock_shared(inode);
+		ext4_ilock(inode, EXT4_IOLOCK_SHARED);
 		offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
-		inode_unlock_shared(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
 		break;
 	case SEEK_DATA:
-		inode_lock_shared(inode);
+		ext4_ilock(inode, EXT4_IOLOCK_SHARED);
 		offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
-		inode_unlock_shared(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_SHARED);
 		break;
 	}
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a4f0749527c7..2870699ee504 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3914,7 +3914,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 			return ret;
 	}
 
-	inode_lock(inode);
+	ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 
 	/* No need to punch hole beyond i_size */
 	if (offset >= inode->i_size)
@@ -4021,7 +4021,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 out_dio:
 	up_write(&EXT4_I(inode)->i_mmap_sem);
 out_mutex:
-	inode_unlock(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	return ret;
 }
 
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 442f7ef873fc..c6ae48567207 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -787,13 +787,13 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (err)
 			return err;
 
-		inode_lock(inode);
+		ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 		err = ext4_ioctl_check_immutable(inode,
 				from_kprojid(&init_user_ns, ei->i_projid),
 				flags);
 		if (!err)
 			err = ext4_ioctl_setflags(inode, flags);
-		inode_unlock(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 		mnt_drop_write_file(filp);
 		return err;
 	}
@@ -824,7 +824,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 			goto setversion_out;
 		}
 
-		inode_lock(inode);
+		ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
 		if (IS_ERR(handle)) {
 			err = PTR_ERR(handle);
@@ -839,7 +839,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		ext4_journal_stop(handle);
 
 unlock_out:
-		inode_unlock(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 setversion_out:
 		mnt_drop_write_file(filp);
 		return err;
@@ -958,9 +958,9 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		 * ext4_ext_swap_inode_data before we switch the
 		 * inode format to prevent read.
 		 */
-		inode_lock((inode));
+		ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 		err = ext4_ext_migrate(inode);
-		inode_unlock((inode));
+		ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 		mnt_drop_write_file(filp);
 		return err;
 	}
@@ -1150,7 +1150,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (err)
 			return err;
 
-		inode_lock(inode);
+		ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 		ext4_fill_fsxattr(inode, &old_fa);
 		err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
 		if (err)
@@ -1165,7 +1165,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 			goto out;
 		err = ext4_ioctl_setproject(filp, fa.fsx_projid);
 out:
-		inode_unlock(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 		mnt_drop_write_file(filp);
 		return err;
 	}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4079605d437a..45519036de83 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2681,12 +2681,12 @@ static void ext4_orphan_cleanup(struct super_block *sb,
 					__func__, inode->i_ino, inode->i_size);
 			jbd_debug(2, "truncating inode %lu to %lld bytes\n",
 				  inode->i_ino, inode->i_size);
-			inode_lock(inode);
+			ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 			truncate_inode_pages(inode->i_mapping, inode->i_size);
 			ret = ext4_truncate(inode);
 			if (ret)
 				ext4_std_error(inode->i_sb, ret);
-			inode_unlock(inode);
+			ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 			nr_truncates++;
 		} else {
 			if (test_opt(sb, DEBUG))
@@ -5763,7 +5763,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
 		 * files. If this fails, we return success anyway since quotas
 		 * are already enabled and this is not a hard failure.
 		 */
-		inode_lock(inode);
+		ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 		handle = ext4_journal_start(inode, EXT4_HT_QUOTA, 1);
 		if (IS_ERR(handle))
 			goto unlock_inode;
@@ -5773,7 +5773,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
 		ext4_mark_inode_dirty(handle, inode);
 		ext4_journal_stop(handle);
 	unlock_inode:
-		inode_unlock(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	}
 	return err;
 }
@@ -5865,7 +5865,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
 	if (err || ext4_has_feature_quota(sb))
 		goto out_put;
 
-	inode_lock(inode);
+	ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 	/*
 	 * Update modification times of quota files when userspace can
 	 * start looking at them. If we fail, we return success anyway since
@@ -5880,7 +5880,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
 	ext4_mark_inode_dirty(handle, inode);
 	ext4_journal_stop(handle);
 out_unlock:
-	inode_unlock(inode);
+	ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 out_put:
 	lockdep_set_quota_inode(inode, I_DATA_SEM_NORMAL);
 	iput(inode);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 491f9ee4040e..dbe3e2900c24 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -422,9 +422,9 @@ static int ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino,
 		ext4_set_inode_state(inode, EXT4_STATE_LUSTRE_EA_INODE);
 		ext4_xattr_inode_set_ref(inode, 1);
 	} else {
-		inode_lock(inode);
+		ext4_ilock(inode, EXT4_IOLOCK_EXCL);
 		inode->i_flags |= S_NOQUOTA;
-		inode_unlock(inode);
+		ext4_iunlock(inode, EXT4_IOLOCK_EXCL);
 	}
 
 	*ea_inode = inode;
@@ -1025,7 +1025,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
 	u32 hash;
 	int ret;
 
-	inode_lock(ea_inode);
+	ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
 
 	ret = ext4_reserve_inode_write(handle, ea_inode, &iloc);
 	if (ret)
@@ -1079,7 +1079,7 @@ static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
 		ext4_warning_inode(ea_inode,
 				   "ext4_mark_iloc_dirty() failed ret=%d", ret);
 out:
-	inode_unlock(ea_inode);
+	ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
 	return ret;
 }
 
@@ -1400,10 +1400,10 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
 		block += 1;
 	}
 
-	inode_lock(ea_inode);
+	ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
 	i_size_write(ea_inode, wsize);
 	ext4_update_i_disksize(ea_inode, wsize);
-	inode_unlock(ea_inode);
+	ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
 
 	ext4_mark_inode_dirty(handle, ea_inode);
 
@@ -1452,9 +1452,9 @@ static struct inode *ext4_xattr_inode_create(handle_t *handle,
 		 */
 		dquot_free_inode(ea_inode);
 		dquot_drop(ea_inode);
-		inode_lock(ea_inode);
+		ext4_ilock(ea_inode, EXT4_IOLOCK_EXCL);
 		ea_inode->i_flags |= S_NOQUOTA;
-		inode_unlock(ea_inode);
+		ext4_iunlock(ea_inode, EXT4_IOLOCK_EXCL);
 	}
 
 	return ea_inode;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 2/2] ext4: Improve DIO writes locking sequence
  2019-09-17 10:32 [RFC 0/2] ext4: Improve locking sequence in DIO write path Ritesh Harjani
  2019-09-17 10:32 ` [RFC 1/2] ext4: Add ext4_ilock & ext4_iunlock API Ritesh Harjani
@ 2019-09-17 10:32 ` Ritesh Harjani
  2019-09-18  0:58 ` [RFC 0/2] ext4: Improve locking sequence in DIO write path Joseph Qi
  2019-09-18  6:35 ` Joseph Qi
  3 siblings, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2019-09-17 10:32 UTC (permalink / raw)
  To: jack, tytso, linux-ext4, joseph.qi
  Cc: david, hch, adilger, riteshh, mbobrowski, rgoldwyn

Earlier there was no shared lock in DIO read path.
But this patch (16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing
for parallel DIO reads by adding shared lock in inode DIO
read path.

But this created problem with mixed read/write workload.
It is due to the fact that in DIO path, we first start with
exclusive lock and only when we determine that it is a ovewrite
IO, we downgrade the lock. This causes the problem, since
with above patch we have shared locking in DIO reads.

So, this patch tries to fix this issue by starting with
shared lock and then switching to exclusive lock only
when required based on ext4_dio_write_checks().

Other than that, it also simplifies below cases:-

1. Simplified ext4_unaligned_aio API to
ext4_unaligned_io.
Previous API was abused in the sense that it was
not really checking for AIO anywhere also it used to
check for extending writes.
So this API was renamed and simplified to ext4_unaligned_io()
which actully only checks if the IO is really unaligned.

This is because in all other cases inode_dio_wait()
will anyway become a no-op. So no need to over complicate
by checking for every condition here.

2. Added ext4_extending_io API. This checks if the IO
is extending the file.

Now we only check for
unaligned_io, extend, dioread_nolock & overwrite.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/file.c | 206 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 154 insertions(+), 52 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ce1cecbae932..45af2b7679ad 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -166,14 +166,11 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
  * threads are at work on the same unwritten block, they must be synchronized
  * or one thread will zero the other's data, causing corruption.
  */
-static int
-ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
+static bool
+ext4_unaligned_io(struct inode *inode, struct iov_iter *from, loff_t pos)
 {
 	struct super_block *sb = inode->i_sb;
-	int blockmask = sb->s_blocksize - 1;
-
-	if (pos >= ALIGN(i_size_read(inode), sb->s_blocksize))
-		return 0;
+	unsigned long blockmask = sb->s_blocksize - 1;
 
 	if ((pos | iov_iter_alignment(from)) & blockmask)
 		return 1;
@@ -181,6 +178,15 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
 	return 0;
 }
 
+static bool
+ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
+{
+	if (offset + len > i_size_read(inode) ||
+	    offset + len > EXT4_I(inode)->i_disksize)
+		return 1;
+	return 0;
+}
+
 /* Is IO overwriting allocated and initialized blocks? */
 static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
 {
@@ -204,7 +210,9 @@ static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
 	return err == blklen && (map.m_flags & EXT4_MAP_MAPPED);
 }
 
-static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+
+static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
+					 struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
@@ -216,10 +224,6 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if (ret <= 0)
 		return ret;
 
-	ret = file_modified(iocb->ki_filp);
-	if (ret)
-		return 0;
-
 	/*
 	 * If we have encountered a bitmap-format file, the size limit
 	 * is smaller than s_maxbytes, which is for extent-mapped files.
@@ -231,9 +235,26 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 			return -EFBIG;
 		iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
 	}
+
 	return iov_iter_count(from);
 }
 
+static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t ret;
+	ssize_t count;
+
+	count = ext4_generic_write_checks(iocb, from);
+	if (count <= 0)
+		return count;
+
+	ret = file_modified(iocb->ki_filp);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
 static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 					struct iov_iter *from)
 {
@@ -336,6 +357,83 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
 	return 0;
 }
 
+/*
+ * The intention here is to start with shared lock acquired
+ * (except in unaligned IO & extending writes case),
+ * then see if any condition requires an exclusive inode
+ * lock. If yes, then we restart the whole operation by
+ * releasing the shared lock and acquiring exclusive lock.
+ *
+ * - For unaligned_io we never take exclusive lock as it
+ *   may cause data corruption when two unaligned IO tries
+ *   to modify the same block.
+ *
+ * - For extending wirtes case we don't take
+ *   the exclusive lock, since it requires updating
+ *   inode i_disksize with exclusive lock.
+ *
+ * - shared locking will only be true mostly in case of
+ *   overwrites with dioread_nolock mode.
+ *   Otherwise we will switch to exclusive locking mode.
+ */
+static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
+				 unsigned int *iolock, bool *unaligned_io,
+				 bool *extend)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	loff_t offset = iocb->ki_pos;
+	loff_t final_size;
+	size_t count;
+	ssize_t ret;
+
+restart:
+	if (!ext4_dio_checks(inode)) {
+		ext4_iunlock(inode, *iolock);
+		return ext4_buffered_write_iter(iocb, from);
+	}
+
+	ret = ext4_generic_write_checks(iocb, from);
+	if (ret <= 0) {
+		ext4_iunlock(inode, *iolock);
+		return ret;
+	}
+
+	/* Recalculate since offset & count may change above. */
+	offset = iocb->ki_pos;
+	count = iov_iter_count(from);
+	final_size = offset + count;
+
+	if (ext4_unaligned_io(inode, from, offset))
+		*unaligned_io = true;
+
+	if (ext4_extending_io(inode, offset, count))
+		*extend = true;
+	/*
+	 * Determine whether the IO operation will overwrite allocated
+	 * and initialized blocks. If so, check to see whether it is
+	 * possible to take the dioread_nolock path.
+	 *
+	 * We need exclusive i_rwsem for changing security info
+	 * in file_modified().
+	 */
+	if (*iolock == EXT4_IOLOCK_SHARED &&
+	    (!IS_NOSEC(inode) || *unaligned_io || *extend ||
+	     !ext4_should_dioread_nolock(inode) ||
+	     !ext4_overwrite_io(inode, offset, count))) {
+		ext4_iunlock(inode, *iolock);
+		*iolock = EXT4_IOLOCK_EXCL;
+		ext4_ilock(inode, *iolock);
+		goto restart;
+	}
+
+	ret = file_modified(file);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
 /*
  * For a write that extends the inode size, ext4_dio_write_iter() will
  * wait for the write to complete. Consequently, operations performed
@@ -371,64 +469,68 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
 
 static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
+
 	ssize_t ret;
-	size_t count;
 	loff_t offset = iocb->ki_pos;
+	size_t count = iov_iter_count(from);
 	struct inode *inode = file_inode(iocb->ki_filp);
-	bool extend = false, overwrite = false, unaligned_aio = false;
-	unsigned int iolock = EXT4_IOLOCK_EXCL;
+	bool extend = false, unaligned_io = false;
+	unsigned int iolock = EXT4_IOLOCK_SHARED;
+
+	/*
+	 * We initially start with shared inode lock
+	 * unless it is unaligned IO which needs
+	 * exclusive lock anyways.
+	 */
+	if (ext4_unaligned_io(inode, from, offset)) {
+		unaligned_io = true;
+		iolock = EXT4_IOLOCK_EXCL;
+	}
+	/*
+	 * Extending writes need exclusive lock
+	 * to update
+	 */
+	if (ext4_extending_io(inode, offset, count)) {
+		extend = true;
+		iolock = EXT4_IOLOCK_EXCL;
+	}
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
+		/*
+		 * unaligned IO may anyway require wait
+		 * at other places, so bail out.
+		 */
+		if (unaligned_io)
+			return -EAGAIN;
 		if (!ext4_ilock_nowait(inode, iolock))
 			return -EAGAIN;
 	} else {
 		ext4_ilock(inode, iolock);
 	}
 
-	if (!ext4_dio_checks(inode)) {
-		ext4_iunlock(inode, iolock);
-		/*
-		 * Fallback to buffered IO if the operation on the
-		 * inode is not supported by direct IO.
-		 */
-		return ext4_buffered_write_iter(iocb, from);
-	}
-
-	ret = ext4_write_checks(iocb, from);
-	if (ret <= 0) {
-		ext4_iunlock(inode, iolock);
+	ret = ext4_dio_write_checks(iocb, from, &iolock, &unaligned_io,
+				    &extend);
+	if (ret <= 0)
 		return ret;
-	}
-
 	/*
-	 * Unaligned direct AIO must be serialized among each other as
+	 * Unaligned direct IO must be serialized among each other as
 	 * the zeroing of partial blocks of two competing unaligned
-	 * AIOs can result in data corruption.
+	 * IOs can result in data corruption. This can mainly
+	 * happen since we may start with shared locking and for
+	 * dioread_nolock and overwrite case we may continue to be
+	 * in shared locking mode. In that case two parallel unaligned
+	 * IO may cause data corruption.
+	 *
+	 * So we make sure we don't allow any unaligned IO in flight.
+	 * For IOs where we need not wait (like unaligned non-AIO DIO),
+	 * below dio_wait may anyway become a no-op,
+	 * since we start take exclusive locks.
 	 */
-	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
-	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
-		unaligned_aio = true;
+	if (unaligned_io)
 		inode_dio_wait(inode);
-	}
-
-	/*
-	 * Determine whether the IO operation will overwrite allocated
-	 * and initialized blocks. If so, check to see whether it is
-	 * possible to take the dioread_nolock path.
-	 */
-	count = iov_iter_count(from);
-	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
-	    ext4_should_dioread_nolock(inode)) {
-		overwrite = true;
-		ext4_ilock_demote(inode, iolock);
-		iolock = EXT4_IOLOCK_SHARED;
-	}
 
-	if (offset + count > i_size_read(inode) ||
-	    offset + count > EXT4_I(inode)->i_disksize) {
+	if (extend)
 		ext4_update_i_disksize(inode, inode->i_size);
-		extend = true;
-	}
 
 	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
 
@@ -440,7 +542,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	 * routines in ext4_dio_write_end_io() are covered by the
 	 * inode_lock().
 	 */
-	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
+	if (ret == -EIOCBQUEUED && (unaligned_io || extend))
 		inode_dio_wait(inode);
 
 	ext4_iunlock(inode, iolock);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-17 10:32 [RFC 0/2] ext4: Improve locking sequence in DIO write path Ritesh Harjani
  2019-09-17 10:32 ` [RFC 1/2] ext4: Add ext4_ilock & ext4_iunlock API Ritesh Harjani
  2019-09-17 10:32 ` [RFC 2/2] ext4: Improve DIO writes locking sequence Ritesh Harjani
@ 2019-09-18  0:58 ` Joseph Qi
  2019-09-18  6:35 ` Joseph Qi
  3 siblings, 0 replies; 16+ messages in thread
From: Joseph Qi @ 2019-09-18  0:58 UTC (permalink / raw)
  To: Ritesh Harjani, jack, tytso, linux-ext4
  Cc: david, hch, adilger, mbobrowski, rgoldwyn



On 19/9/17 18:32, Ritesh Harjani wrote:
> Hello,
> 
> This patch series is based on the upstream discussion with Jan
> & Joseph @ [1].
> It is based on top of Matthew's v3 ext4 iomap patch series [2]
> 
> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> inode_lock/unlock instances from fs/ext4/*
> 
> For now I already accounted for trylock/lock issue symantics
> (which was discussed here [3]) in the same patch,
> since the this whole patch was around inode_lock/unlock API,
> so I thought it will be best to address that issue in the same patch. 
> However, kindly let me know if otherwise.
> 
> Patch-2: Commit msg of this patch describes in detail about
> what it is doing.
> In brief - we try to first take the shared lock (instead of exclusive
> lock), unless it is a unaligned_io or extend_io. Then in
> ext4_dio_write_checks(), if we start with shared lock, we see
> if we can really continue with shared lock or not. If not, then
> we release the shared lock then acquire exclusive lock
> and restart ext4_dio_write_checks().
> 
> 
> Tested against few xfstests (with dioread_nolock mount option),
> those ran fine (ext4 & generic).
> 
> I tried testing performance numbers on my VM (since I could not get
> hold of any real h/w based test device). I could test the fact
> that earlier we were trying to do downgrade_write() lock, but with
> this patch, that path is now avoided for fio test case
> (as reported by Joseph in [4]).
> But for the actual results, I am not sure if VM machine testing could
> really give the reliable perf numbers which we want to take a look at.
> Though I do observe some form of perf improvements, but I could not
> get any reliable numbers (not even with the same list of with/without
> patches with which Joseph posted his numbers [1]).
> 
> 
> @Joseph,
> Would it be possible for you to give your test case a run with this
> patches? That will be really helpful.
> 

Sure, will post the result ASAP.

Thanks,
Joseph

> Branch for this is hosted at below tree.
> 
> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
> 
> [1]: https://lore.kernel.org/linux-ext4/20190910215720.GA7561@quack2.suse.cz/
> [2]: https://lwn.net/Articles/799184/
> [3]: https://lore.kernel.org/linux-fsdevel/20190911103117.E32C34C044@d06av22.portsmouth.uk.ibm.com/
> [4]: https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph.qi@linux.alibaba.com/
> 
> 
> Ritesh Harjani (2):
>   ext4: Add ext4_ilock & ext4_iunlock API
>   ext4: Improve DIO writes locking sequence
> 
>  fs/ext4/ext4.h    |  33 ++++++
>  fs/ext4/extents.c |  16 +--
>  fs/ext4/file.c    | 253 ++++++++++++++++++++++++++++++++--------------
>  fs/ext4/inode.c   |   4 +-
>  fs/ext4/ioctl.c   |  16 +--
>  fs/ext4/super.c   |  12 +--
>  fs/ext4/xattr.c   |  16 +--
>  7 files changed, 244 insertions(+), 106 deletions(-)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-17 10:32 [RFC 0/2] ext4: Improve locking sequence in DIO write path Ritesh Harjani
                   ` (2 preceding siblings ...)
  2019-09-18  0:58 ` [RFC 0/2] ext4: Improve locking sequence in DIO write path Joseph Qi
@ 2019-09-18  6:35 ` Joseph Qi
  2019-09-18 10:03   ` Ritesh Harjani
  2019-09-24 15:10   ` Jan Kara
  3 siblings, 2 replies; 16+ messages in thread
From: Joseph Qi @ 2019-09-18  6:35 UTC (permalink / raw)
  To: Ritesh Harjani, jack, tytso, linux-ext4
  Cc: david, hch, adilger, mbobrowski, rgoldwyn

Hi Ritesh,

On 19/9/17 18:32, Ritesh Harjani wrote:
> Hello,
> 
> This patch series is based on the upstream discussion with Jan
> & Joseph @ [1].
> It is based on top of Matthew's v3 ext4 iomap patch series [2]
> 
> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> inode_lock/unlock instances from fs/ext4/*
> 
> For now I already accounted for trylock/lock issue symantics
> (which was discussed here [3]) in the same patch,
> since the this whole patch was around inode_lock/unlock API,
> so I thought it will be best to address that issue in the same patch. 
> However, kindly let me know if otherwise.
> 
> Patch-2: Commit msg of this patch describes in detail about
> what it is doing.
> In brief - we try to first take the shared lock (instead of exclusive
> lock), unless it is a unaligned_io or extend_io. Then in
> ext4_dio_write_checks(), if we start with shared lock, we see
> if we can really continue with shared lock or not. If not, then
> we release the shared lock then acquire exclusive lock
> and restart ext4_dio_write_checks().
> 
> 
> Tested against few xfstests (with dioread_nolock mount option),
> those ran fine (ext4 & generic).
> 
> I tried testing performance numbers on my VM (since I could not get
> hold of any real h/w based test device). I could test the fact
> that earlier we were trying to do downgrade_write() lock, but with
> this patch, that path is now avoided for fio test case
> (as reported by Joseph in [4]).
> But for the actual results, I am not sure if VM machine testing could
> really give the reliable perf numbers which we want to take a look at.
> Though I do observe some form of perf improvements, but I could not
> get any reliable numbers (not even with the same list of with/without
> patches with which Joseph posted his numbers [1]).
> 
> 
> @Joseph,
> Would it be possible for you to give your test case a run with this
> patches? That will be really helpful.
> 
> Branch for this is hosted at below tree.
> 
> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
> 
I've tested your branch, the result is:
mounting with dioread_nolock, it behaves the same like reverting
parallel dio reads + dioread_nolock;
while mounting without dioread_nolock, no improvement, or even worse.
Please refer the test data below. 

fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
-direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
-size=20G -numjobs=8 -runtime=600 -group_reporting

w/     = with parallel dio reads
w/o    = reverting parallel dio reads
w/o+   = reverting parallel dio reads + dioread_nolock
ilock  = ext4-ilock-RFC
ilock+ = ext4-ilock-RFC + dioread_nolock

bs=4k:
--------------------------------------------------------------
      |            READ           |           WRITE          |
--------------------------------------------------------------
w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
--------------------------------------------------------------
w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
--------------------------------------------------------------
w/o+  | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
--------------------------------------------------------------
ilock | 29964KB/s,7491,326.70us	  | 29940KB/s,7485,740.62us  |
--------------------------------------------------------------
ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
--------------------------------------------------------------


bs=16k:
--------------------------------------------------------------
      |            READ           |           WRITE          |
--------------------------------------------------------------
w/    | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
--------------------------------------------------------------
w/o   | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
--------------------------------------------------------------
w/o+  | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
--------------------------------------------------------------
ilock | 56039KB/s,3502,632.96us	  | 55943KB/s,3496,1652.72us |
--------------------------------------------------------------
ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
--------------------------------------------------------------

bs=64k
----------------------------------------------------------------
      |            READ            |           WRITE           |
----------------------------------------------------------------
w/    | 119396KB/s,1865,1759.38us  | 119159KB/s,1861,2532.26us |
----------------------------------------------------------------
w/o   | 422815KB/s,6606,1146.05us  | 421619KB/s,6587,60.72us   |
--------------------------------------------,-------------------
w/o+  | 427406KB/s,6678,1141.52us  | 426197KB/s,6659,52.79us   |
----------------------------------------------------------------
ilock | 105800KB/s,1653,1451.68us  | 105721KB/s,1651,3388.64us |
----------------------------------------------------------------
ilock+| 427678KB/s,6682,1142.13us  | 426468KB/s,6663,52.31us   |
----------------------------------------------------------------

bs=512k
----------------------------------------------------------------
      |            READ            |           WRITE           |
----------------------------------------------------------------
w/    | 392973KB/s,767,5046.35us   | 393165KB/s,767,5359.86us  |
----------------------------------------------------------------
w/o   | 590266KB/s,1152,4312.01us  | 590554KB/s,1153,2606.82us |
----------------------------------------------------------------
w/o+  | 618752KB/s,1208,4125.82us  | 619054KB/s,1209,2487.90us |
----------------------------------------------------------------
ilock | 296239KB/s,578,4703.10us   | 296384KB/s,578,9049.32us  |
----------------------------------------------------------------
ilock+| 616636KB/s,1204,4143.38us  | 616937KB/s,1204,2490.08us |
----------------------------------------------------------------

bs=1M
----------------------------------------------------------------
      |            READ            |           WRITE           |
----------------------------------------------------------------
w/    | 487779KB/s,476,8058.55us   | 485592KB/s,474,8630.51us  |
----------------------------------------------------------------
w/o   | 593927KB/s,580,7623.63us   | 591265KB/s,577,6163.42us  |
----------------------------------------------------------------
w/o+  | 615011KB/s,600,7399.93us   | 612255KB/s,597,5936.61us  |
----------------------------------------------------------------
ilock | 394762KB/s,385,7097.55us   | 392993KB/s,383,13626.98us |
----------------------------------------------------------------
ilock+| 626183KB/s,611,7319.16us   | 623377KB/s,608,5773.24us  |
----------------------------------------------------------------

Thanks,
Joseph

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-18  6:35 ` Joseph Qi
@ 2019-09-18 10:03   ` Ritesh Harjani
  2019-09-18 10:57     ` Joseph Qi
  2019-09-19  2:08     ` Joseph Qi
  2019-09-24 15:10   ` Jan Kara
  1 sibling, 2 replies; 16+ messages in thread
From: Ritesh Harjani @ 2019-09-18 10:03 UTC (permalink / raw)
  To: Joseph Qi, jack, tytso, linux-ext4
  Cc: david, hch, adilger, mbobrowski, rgoldwyn

Hello Joseph,

First of all thanks a lot for collecting a thorough
performance numbers.

On 9/18/19 12:05 PM, Joseph Qi wrote:
> Hi Ritesh,
> 
> On 19/9/17 18:32, Ritesh Harjani wrote:
>> Hello,
>>
>> This patch series is based on the upstream discussion with Jan
>> & Joseph @ [1].
>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>
>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>> inode_lock/unlock instances from fs/ext4/*
>>
>> For now I already accounted for trylock/lock issue symantics
>> (which was discussed here [3]) in the same patch,
>> since the this whole patch was around inode_lock/unlock API,
>> so I thought it will be best to address that issue in the same patch.
>> However, kindly let me know if otherwise.
>>
>> Patch-2: Commit msg of this patch describes in detail about
>> what it is doing.
>> In brief - we try to first take the shared lock (instead of exclusive
>> lock), unless it is a unaligned_io or extend_io. Then in
>> ext4_dio_write_checks(), if we start with shared lock, we see
>> if we can really continue with shared lock or not. If not, then
>> we release the shared lock then acquire exclusive lock
>> and restart ext4_dio_write_checks().
>>
>>
>> Tested against few xfstests (with dioread_nolock mount option),
>> those ran fine (ext4 & generic).
>>
>> I tried testing performance numbers on my VM (since I could not get
>> hold of any real h/w based test device). I could test the fact
>> that earlier we were trying to do downgrade_write() lock, but with
>> this patch, that path is now avoided for fio test case
>> (as reported by Joseph in [4]).
>> But for the actual results, I am not sure if VM machine testing could
>> really give the reliable perf numbers which we want to take a look at.
>> Though I do observe some form of perf improvements, but I could not
>> get any reliable numbers (not even with the same list of with/without
>> patches with which Joseph posted his numbers [1]).
>>
>>
>> @Joseph,
>> Would it be possible for you to give your test case a run with this
>> patches? That will be really helpful.
>>
>> Branch for this is hosted at below tree.
>>
>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>
> I've tested your branch, the result is:
> mounting with dioread_nolock, it behaves the same like reverting
> parallel dio reads + dioread_nolock;

Good sign, means that patch is doing what it is supposed to do.


> while mounting without dioread_nolock, no improvement, or even worse.
> Please refer the test data below.
Actually without dioread_nolock, we take the restart path.
i.e. initially we start with SHARED_LOCK, but if dioread_nolock
is not enabled (or check some other conditions like overwrite),
we release the shared lock and re-acquire the EXCL lock.


But as an optimization, I added the below diff just now
to directly first check for ext4_should_dioread_nolock too
before taking the shared lock.

I think with this we should not see any performance regression
(even without dioread_nolock mount option).
Since it will directly start with exclusive lock
if dioread_nolock mount option is not enabled.

I have updated the tree with this diff in same branch.


ext4_dio_file_write_iter ()
<...>

498         if (iolock == EXT4_IOLOCK_SHARED && 
!ext4_should_dioread_nolock(inode))
499                 iolock = EXT4_IOLOCK_EXCL;
500
<...>


> 
> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
> -size=20G -numjobs=8 -runtime=600 -group_reporting
> 
> w/     = with parallel dio reads
> w/o    = reverting parallel dio reads
> w/o+   = reverting parallel dio reads + dioread_nolock
> ilock  = ext4-ilock-RFC
> ilock+ = ext4-ilock-RFC + dioread_nolock

I will request to kindly also add "w/ + dioread_nolock" in your list.


"w/ + dioread_nolock"  v/s  "ilock+" - should show some improvements.
"w/ "  v/s  "ilock" - should not show any regression.

But thanks for the exhaustive performance numbers you collected.


-ritesh


> 
> bs=4k:
> --------------------------------------------------------------
>        |            READ           |           WRITE          |
> --------------------------------------------------------------
> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
> --------------------------------------------------------------
> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> --------------------------------------------------------------
> w/o+  | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
> --------------------------------------------------------------
> ilock | 29964KB/s,7491,326.70us	  | 29940KB/s,7485,740.62us  |
> --------------------------------------------------------------
> ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
> --------------------------------------------------------------
> 
> 
> bs=16k:
> --------------------------------------------------------------
>        |            READ           |           WRITE          |
> --------------------------------------------------------------
> w/    | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
> --------------------------------------------------------------
> w/o   | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
> --------------------------------------------------------------
> w/o+  | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
> --------------------------------------------------------------
> ilock | 56039KB/s,3502,632.96us	  | 55943KB/s,3496,1652.72us |
> --------------------------------------------------------------
> ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
> --------------------------------------------------------------
> 
> bs=64k
> ----------------------------------------------------------------
>        |            READ            |           WRITE           |
> ----------------------------------------------------------------
> w/    | 119396KB/s,1865,1759.38us  | 119159KB/s,1861,2532.26us |
> ----------------------------------------------------------------
> w/o   | 422815KB/s,6606,1146.05us  | 421619KB/s,6587,60.72us   |
> --------------------------------------------,-------------------
> w/o+  | 427406KB/s,6678,1141.52us  | 426197KB/s,6659,52.79us   |
> ----------------------------------------------------------------
> ilock | 105800KB/s,1653,1451.68us  | 105721KB/s,1651,3388.64us |
> ----------------------------------------------------------------
> ilock+| 427678KB/s,6682,1142.13us  | 426468KB/s,6663,52.31us   |
> ----------------------------------------------------------------
> 
> bs=512k
> ----------------------------------------------------------------
>        |            READ            |           WRITE           |
> ----------------------------------------------------------------
> w/    | 392973KB/s,767,5046.35us   | 393165KB/s,767,5359.86us  |
> ----------------------------------------------------------------
> w/o   | 590266KB/s,1152,4312.01us  | 590554KB/s,1153,2606.82us |
> ----------------------------------------------------------------
> w/o+  | 618752KB/s,1208,4125.82us  | 619054KB/s,1209,2487.90us |
> ----------------------------------------------------------------
> ilock | 296239KB/s,578,4703.10us   | 296384KB/s,578,9049.32us  |
> ----------------------------------------------------------------
> ilock+| 616636KB/s,1204,4143.38us  | 616937KB/s,1204,2490.08us |
> ----------------------------------------------------------------
> 
> bs=1M
> ----------------------------------------------------------------
>        |            READ            |           WRITE           |
> ----------------------------------------------------------------
> w/    | 487779KB/s,476,8058.55us   | 485592KB/s,474,8630.51us  |
> ----------------------------------------------------------------
> w/o   | 593927KB/s,580,7623.63us   | 591265KB/s,577,6163.42us  |
> ----------------------------------------------------------------
> w/o+  | 615011KB/s,600,7399.93us   | 612255KB/s,597,5936.61us  |
> ----------------------------------------------------------------
> ilock | 394762KB/s,385,7097.55us   | 392993KB/s,383,13626.98us |
> ----------------------------------------------------------------
> ilock+| 626183KB/s,611,7319.16us   | 623377KB/s,608,5773.24us  |
> ----------------------------------------------------------------
> 
> Thanks,
> Joseph
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-18 10:03   ` Ritesh Harjani
@ 2019-09-18 10:57     ` Joseph Qi
  2019-09-19  2:08     ` Joseph Qi
  1 sibling, 0 replies; 16+ messages in thread
From: Joseph Qi @ 2019-09-18 10:57 UTC (permalink / raw)
  To: Ritesh Harjani, jack, tytso, linux-ext4
  Cc: david, hch, adilger, mbobrowski, rgoldwyn



On 19/9/18 18:03, Ritesh Harjani wrote:
> Hello Joseph,
> 
> First of all thanks a lot for collecting a thorough
> performance numbers.
> 
> On 9/18/19 12:05 PM, Joseph Qi wrote:
>> Hi Ritesh,
>>
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
> 
> Good sign, means that patch is doing what it is supposed to do.
> 
> 
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
> Actually without dioread_nolock, we take the restart path.
> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
> is not enabled (or check some other conditions like overwrite),
> we release the shared lock and re-acquire the EXCL lock.
> 
> 
> But as an optimization, I added the below diff just now
> to directly first check for ext4_should_dioread_nolock too
> before taking the shared lock.
> 
> I think with this we should not see any performance regression
> (even without dioread_nolock mount option).
> Since it will directly start with exclusive lock
> if dioread_nolock mount option is not enabled.
> 
> I have updated the tree with this diff in same branch.
> 
> 
> ext4_dio_file_write_iter ()
> <...>
> 
> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
> 499                 iolock = EXT4_IOLOCK_EXCL;
> 500
> <...>
> 
> 
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/     = with parallel dio reads
>> w/o    = reverting parallel dio reads
>> w/o+   = reverting parallel dio reads + dioread_nolock
>> ilock  = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
> 
> I will request to kindly also add "w/ + dioread_nolock" in your list.
> 
I've done this test before, it still behaves poor.
You can refer the previous RFC thread:
https://www.spinics.net/lists/linux-ext4/msg67066.html

Thanks,
Joseph

> 
> "w/ + dioread_nolock"  v/s  "ilock+" - should show some improvements.
> "w/ "  v/s  "ilock" - should not show any regression.
> 
> But thanks for the exhaustive performance numbers you collected.
> 
> 
> -ritesh
> 
> 
>>
>> bs=4k:
>> --------------------------------------------------------------
>>        |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
>> --------------------------------------------------------------
>> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
>> w/o+  | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
>> --------------------------------------------------------------
>> ilock | 29964KB/s,7491,326.70us      | 29940KB/s,7485,740.62us  |
>> --------------------------------------------------------------
>> ilock+| 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
>> --------------------------------------------------------------
>>
>>
>> bs=16k:
>> --------------------------------------------------------------
>>        |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
>> --------------------------------------------------------------
>> w/o   | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
>> --------------------------------------------------------------
>> w/o+  | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
>> --------------------------------------------------------------
>> ilock | 56039KB/s,3502,632.96us      | 55943KB/s,3496,1652.72us |
>> --------------------------------------------------------------
>> ilock+| 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
>> --------------------------------------------------------------
>>
>> bs=64k
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 119396KB/s,1865,1759.38us  | 119159KB/s,1861,2532.26us |
>> ----------------------------------------------------------------
>> w/o   | 422815KB/s,6606,1146.05us  | 421619KB/s,6587,60.72us   |
>> --------------------------------------------,-------------------
>> w/o+  | 427406KB/s,6678,1141.52us  | 426197KB/s,6659,52.79us   |
>> ----------------------------------------------------------------
>> ilock | 105800KB/s,1653,1451.68us  | 105721KB/s,1651,3388.64us |
>> ----------------------------------------------------------------
>> ilock+| 427678KB/s,6682,1142.13us  | 426468KB/s,6663,52.31us   |
>> ----------------------------------------------------------------
>>
>> bs=512k
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 392973KB/s,767,5046.35us   | 393165KB/s,767,5359.86us  |
>> ----------------------------------------------------------------
>> w/o   | 590266KB/s,1152,4312.01us  | 590554KB/s,1153,2606.82us |
>> ----------------------------------------------------------------
>> w/o+  | 618752KB/s,1208,4125.82us  | 619054KB/s,1209,2487.90us |
>> ----------------------------------------------------------------
>> ilock | 296239KB/s,578,4703.10us   | 296384KB/s,578,9049.32us  |
>> ----------------------------------------------------------------
>> ilock+| 616636KB/s,1204,4143.38us  | 616937KB/s,1204,2490.08us |
>> ----------------------------------------------------------------
>>
>> bs=1M
>> ----------------------------------------------------------------
>>        |            READ            |           WRITE           |
>> ----------------------------------------------------------------
>> w/    | 487779KB/s,476,8058.55us   | 485592KB/s,474,8630.51us  |
>> ----------------------------------------------------------------
>> w/o   | 593927KB/s,580,7623.63us   | 591265KB/s,577,6163.42us  |
>> ----------------------------------------------------------------
>> w/o+  | 615011KB/s,600,7399.93us   | 612255KB/s,597,5936.61us  |
>> ----------------------------------------------------------------
>> ilock | 394762KB/s,385,7097.55us   | 392993KB/s,383,13626.98us |
>> ----------------------------------------------------------------
>> ilock+| 626183KB/s,611,7319.16us   | 623377KB/s,608,5773.24us  |
>> ----------------------------------------------------------------
>>
>> Thanks,
>> Joseph
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-18 10:03   ` Ritesh Harjani
  2019-09-18 10:57     ` Joseph Qi
@ 2019-09-19  2:08     ` Joseph Qi
  2019-09-19 18:48       ` Ritesh Harjani
  2019-09-23  6:19       ` Ritesh Harjani
  1 sibling, 2 replies; 16+ messages in thread
From: Joseph Qi @ 2019-09-19  2:08 UTC (permalink / raw)
  To: Ritesh Harjani, jack, tytso, linux-ext4
  Cc: david, hch, adilger, mbobrowski, rgoldwyn



On 19/9/18 18:03, Ritesh Harjani wrote:
> Hello Joseph,
> 
> First of all thanks a lot for collecting a thorough
> performance numbers.
> 
> On 9/18/19 12:05 PM, Joseph Qi wrote:
>> Hi Ritesh,
>>
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
> 
> Good sign, means that patch is doing what it is supposed to do.
> 
> 
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
> Actually without dioread_nolock, we take the restart path.
> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
> is not enabled (or check some other conditions like overwrite),
> we release the shared lock and re-acquire the EXCL lock.
> 
> 
> But as an optimization, I added the below diff just now
> to directly first check for ext4_should_dioread_nolock too
> before taking the shared lock.
> 
> I think with this we should not see any performance regression
> (even without dioread_nolock mount option).
> Since it will directly start with exclusive lock
> if dioread_nolock mount option is not enabled.
> 
> I have updated the tree with this diff in same branch.
> 
> 
> ext4_dio_file_write_iter ()
> <...>
> 
> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
> 499                 iolock = EXT4_IOLOCK_EXCL;
> 500
> <...>
> 
With this optimization, when mounting without dioread_nolock, it has
a little improvement compared to the last ilock version, but still
poor than original, especially for big block size.

w/        = with parallel dio reads (original)
w/o       = reverting parallel dio reads
w/o+      = reverting parallel dio reads + dioread_nolock
ilock     = ext4-ilock-RFC
ilock+    = ext4-ilock-RFC + dioread_nolock
ilocknew  = ext4-ilock-RFC latest
ilocknew+ = ext4-ilock-RFC latest + dioread_nolock


bs=4k:
------------------------------------------------------------------
          |            READ           |           WRITE          |
------------------------------------------------------------------
w/        | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
------------------------------------------------------------------
w/o       | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
------------------------------------------------------------------
w/o+      | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
------------------------------------------------------------------
ilock     | 29964KB/s,7491,326.70us   | 29940KB/s,7485,740.62us  |
------------------------------------------------------------------
ilocknew  | 30190KB/s,7547,497.47us   | 30159KB/s,7539,561.85us  |
------------------------------------------------------------------
ilock+    | 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
------------------------------------------------------------------
ilocknew+ | 123169KB/s,30792,245.81us | 123087KB/s,30771,12.85us |
------------------------------------------------------------------


bs=16k:
------------------------------------------------------------------
          |            READ           |           WRITE          |
------------------------------------------------------------------
w/        | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
------------------------------------------------------------------
w/o       | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
------------------------------------------------------------------
w/o+      | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
------------------------------------------------------------------
ilock     | 56039KB/s,3502,632.96us   | 55943KB/s,3496,1652.72us |
------------------------------------------------------------------
ilocknew  | 57317KB/s,3582,1023.88us  | 57230KB/s,3576,1209.91us |
------------------------------------------------------------------
ilock+    | 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
------------------------------------------------------------------
ilocknew+ | 221945KB/s,13871,552.61us | 221791KB/s,13861,21.29us |
------------------------------------------------------------------

bs=64k
-------------------------------------------------------------------
          |            READ           |           WRITE           |
-------------------------------------------------------------------
w/        | 119396KB/s,1865,1759.38us | 119159KB/s,1861,2532.26us |
-------------------------------------------------------------------
w/o       | 422815KB/s,6606,1146.05us | 421619KB/s,6587,60.72us   |
--------------------------------------------,----------------------
w/o+      | 427406KB/s,6678,1141.52us | 426197KB/s,6659,52.79us   |
-------------------------------------------------------------------
ilock     | 105800KB/s,1653,1451.68us | 105721KB/s,1651,3388.64us |
-------------------------------------------------------------------
ilocknew  | 107447KB/s,1678,1654.33us | 107322KB/s,1676,3112.96us |
-------------------------------------------------------------------
ilock+    | 427678KB/s,6682,1142.13us | 426468KB/s,6663,52.31us   |
-------------------------------------------------------------------
ilocknew+ | 427054KB/s,6672,1143.43us | 425846KB/s,6653,52.87us   |
-------------------------------------------------------------------

bs=512k
-------------------------------------------------------------------
          |            READ           |           WRITE           |
-------------------------------------------------------------------
w/        | 392973KB/s,767,5046.35us  | 393165KB/s,767,5359.86us  |
-------------------------------------------------------------------
w/o       | 590266KB/s,1152,4312.01us | 590554KB/s,1153,2606.82us |
-------------------------------------------------------------------
w/o+      | 618752KB/s,1208,4125.82us | 619054KB/s,1209,2487.90us |
-------------------------------------------------------------------
ilock     | 296239KB/s,578,4703.10us  | 296384KB/s,578,9049.32us  |
-------------------------------------------------------------------
ilocknew  | 309740KB/s,604,5485.93us  | 309892KB/s,605,7666.79us  |
-------------------------------------------------------------------
ilock+    | 616636KB/s,1204,4143.38us | 616937KB/s,1204,2490.08us |
-------------------------------------------------------------------
ilocknew+ | 618159KB/s,1207,4129.76us | 618461KB/s,1207,2486.90us |
-------------------------------------------------------------------

bs=1M
-------------------------------------------------------------------
          |            READ           |           WRITE           |
-------------------------------------------------------------------
w/        | 487779KB/s,476,8058.55us  | 485592KB/s,474,8630.51us  |
-------------------------------------------------------------------
w/o       | 593927KB/s,580,7623.63us  | 591265KB/s,577,6163.42us  |
-------------------------------------------------------------------
w/o+      | 615011KB/s,600,7399.93us  | 612255KB/s,597,5936.61us  |
-------------------------------------------------------------------
ilock     | 394762KB/s,385,7097.55us  | 392993KB/s,383,13626.98us |
-------------------------------------------------------------------
ilocknew  | 422052KB/s,412,8338.16us  | 420161KB/s,410,11008.95us |
-------------------------------------------------------------------
ilock+    | 626183KB/s,611,7319.16us  | 623377KB/s,608,5773.24us  |
-------------------------------------------------------------------
ilocknew+ | 626006KB/s,611,7281.09us  | 623200KB/s,608,5817.84us  |
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-19  2:08     ` Joseph Qi
@ 2019-09-19 18:48       ` Ritesh Harjani
  2019-09-23  6:19       ` Ritesh Harjani
  1 sibling, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2019-09-19 18:48 UTC (permalink / raw)
  To: Joseph Qi, jack, tytso, linux-ext4
  Cc: david, hch, adilger, mbobrowski, rgoldwyn

Hello,

On 9/19/19 7:38 AM, Joseph Qi wrote:
> 
> 
> On 19/9/18 18:03, Ritesh Harjani wrote:
>> Hello Joseph,
>>
>> First of all thanks a lot for collecting a thorough
>> performance numbers.
>>
>> On 9/18/19 12:05 PM, Joseph Qi wrote:
>>> Hi Ritesh,
>>>
>>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>>> Hello,
>>>>
>>>> This patch series is based on the upstream discussion with Jan
>>>> & Joseph @ [1].
>>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>>
>>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>>> inode_lock/unlock instances from fs/ext4/*
>>>>
>>>> For now I already accounted for trylock/lock issue symantics
>>>> (which was discussed here [3]) in the same patch,
>>>> since the this whole patch was around inode_lock/unlock API,
>>>> so I thought it will be best to address that issue in the same patch.
>>>> However, kindly let me know if otherwise.
>>>>
>>>> Patch-2: Commit msg of this patch describes in detail about
>>>> what it is doing.
>>>> In brief - we try to first take the shared lock (instead of exclusive
>>>> lock), unless it is a unaligned_io or extend_io. Then in
>>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>>> if we can really continue with shared lock or not. If not, then
>>>> we release the shared lock then acquire exclusive lock
>>>> and restart ext4_dio_write_checks().
>>>>
>>>>
>>>> Tested against few xfstests (with dioread_nolock mount option),
>>>> those ran fine (ext4 & generic).
>>>>
>>>> I tried testing performance numbers on my VM (since I could not get
>>>> hold of any real h/w based test device). I could test the fact
>>>> that earlier we were trying to do downgrade_write() lock, but with
>>>> this patch, that path is now avoided for fio test case
>>>> (as reported by Joseph in [4]).
>>>> But for the actual results, I am not sure if VM machine testing could
>>>> really give the reliable perf numbers which we want to take a look at.
>>>> Though I do observe some form of perf improvements, but I could not
>>>> get any reliable numbers (not even with the same list of with/without
>>>> patches with which Joseph posted his numbers [1]).
>>>>
>>>>
>>>> @Joseph,
>>>> Would it be possible for you to give your test case a run with this
>>>> patches? That will be really helpful.
>>>>
>>>> Branch for this is hosted at below tree.
>>>>
>>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>>
>>> I've tested your branch, the result is:
>>> mounting with dioread_nolock, it behaves the same like reverting
>>> parallel dio reads + dioread_nolock;
>>
>> Good sign, means that patch is doing what it is supposed to do.
>>
>>
>>> while mounting without dioread_nolock, no improvement, or even worse.
>>> Please refer the test data below.
>> Actually without dioread_nolock, we take the restart path.
>> i.e. initially we start with SHARED_LOCK, but if dioread_nolock
>> is not enabled (or check some other conditions like overwrite),
>> we release the shared lock and re-acquire the EXCL lock.
>>
>>
>> But as an optimization, I added the below diff just now
>> to directly first check for ext4_should_dioread_nolock too
>> before taking the shared lock.
>>
>> I think with this we should not see any performance regression
>> (even without dioread_nolock mount option).
>> Since it will directly start with exclusive lock
>> if dioread_nolock mount option is not enabled.
>>
>> I have updated the tree with this diff in same branch.
>>
>>
>> ext4_dio_file_write_iter ()
>> <...>
>>
>> 498         if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
>> 499                 iolock = EXT4_IOLOCK_EXCL;
>> 500
>> <...>
>>
> With this optimization, when mounting without dioread_nolock, it has
> a little improvement compared to the last ilock version, but still
> poor than original, especially for big block size.

Finally, I got hold of some HW. I am collecting the numbers as we speak.
Will post those tomorrow.

Thanks for your help!!

-ritesh



> 
> w/        = with parallel dio reads (original)
> w/o       = reverting parallel dio reads
> w/o+      = reverting parallel dio reads + dioread_nolock
> ilock     = ext4-ilock-RFC
> ilock+    = ext4-ilock-RFC + dioread_nolock
> ilocknew  = ext4-ilock-RFC latest
> ilocknew+ = ext4-ilock-RFC latest + dioread_nolock
> 
> 
> bs=4k:
> ------------------------------------------------------------------
>            |            READ           |           WRITE          |
> ------------------------------------------------------------------
> w/        | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
> ------------------------------------------------------------------
> w/o       | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> ------------------------------------------------------------------
> w/o+      | 123450KB/s,30862,245.77us | 123368KB/s,30841,12.14us |
> ------------------------------------------------------------------
> ilock     | 29964KB/s,7491,326.70us   | 29940KB/s,7485,740.62us  |
> ------------------------------------------------------------------
> ilocknew  | 30190KB/s,7547,497.47us   | 30159KB/s,7539,561.85us  |
> ------------------------------------------------------------------
> ilock+    | 123685KB/s,30921,245.52us | 123601KB/s,30900,12.11us |
> ------------------------------------------------------------------
> ilocknew+ | 123169KB/s,30792,245.81us | 123087KB/s,30771,12.85us |
> ------------------------------------------------------------------
> 
> 
> bs=16k:
> ------------------------------------------------------------------
>            |            READ           |           WRITE          |
> ------------------------------------------------------------------
> w/        | 58961KB/s,3685,835.28us   | 58877KB/s,3679,1335.98us |
> ------------------------------------------------------------------
> w/o       | 218409KB/s,13650,554.46us | 218257KB/s,13641,29.22us |
> ------------------------------------------------------------------
> w/o+      | 222477KB/s,13904,552.94us | 222322KB/s,13895,20.28us |
> ------------------------------------------------------------------
> ilock     | 56039KB/s,3502,632.96us   | 55943KB/s,3496,1652.72us |
> ------------------------------------------------------------------
> ilocknew  | 57317KB/s,3582,1023.88us  | 57230KB/s,3576,1209.91us |
> ------------------------------------------------------------------
> ilock+    | 222747KB/s,13921,552.57us | 222592KB/s,13912,20.31us |
> ------------------------------------------------------------------
> ilocknew+ | 221945KB/s,13871,552.61us | 221791KB/s,13861,21.29us |
> ------------------------------------------------------------------
> 
> bs=64k
> -------------------------------------------------------------------
>            |            READ           |           WRITE           |
> -------------------------------------------------------------------
> w/        | 119396KB/s,1865,1759.38us | 119159KB/s,1861,2532.26us |
> -------------------------------------------------------------------
> w/o       | 422815KB/s,6606,1146.05us | 421619KB/s,6587,60.72us   |
> --------------------------------------------,----------------------
> w/o+      | 427406KB/s,6678,1141.52us | 426197KB/s,6659,52.79us   |
> -------------------------------------------------------------------
> ilock     | 105800KB/s,1653,1451.68us | 105721KB/s,1651,3388.64us |
> -------------------------------------------------------------------
> ilocknew  | 107447KB/s,1678,1654.33us | 107322KB/s,1676,3112.96us |
> -------------------------------------------------------------------
> ilock+    | 427678KB/s,6682,1142.13us | 426468KB/s,6663,52.31us   |
> -------------------------------------------------------------------
> ilocknew+ | 427054KB/s,6672,1143.43us | 425846KB/s,6653,52.87us   |
> -------------------------------------------------------------------
> 
> bs=512k
> -------------------------------------------------------------------
>            |            READ           |           WRITE           |
> -------------------------------------------------------------------
> w/        | 392973KB/s,767,5046.35us  | 393165KB/s,767,5359.86us  |
> -------------------------------------------------------------------
> w/o       | 590266KB/s,1152,4312.01us | 590554KB/s,1153,2606.82us |
> -------------------------------------------------------------------
> w/o+      | 618752KB/s,1208,4125.82us | 619054KB/s,1209,2487.90us |
> -------------------------------------------------------------------
> ilock     | 296239KB/s,578,4703.10us  | 296384KB/s,578,9049.32us  |
> -------------------------------------------------------------------
> ilocknew  | 309740KB/s,604,5485.93us  | 309892KB/s,605,7666.79us  |
> -------------------------------------------------------------------
> ilock+    | 616636KB/s,1204,4143.38us | 616937KB/s,1204,2490.08us |
> -------------------------------------------------------------------
> ilocknew+ | 618159KB/s,1207,4129.76us | 618461KB/s,1207,2486.90us |
> -------------------------------------------------------------------
> 
> bs=1M
> -------------------------------------------------------------------
>            |            READ           |           WRITE           |
> -------------------------------------------------------------------
> w/        | 487779KB/s,476,8058.55us  | 485592KB/s,474,8630.51us  |
> -------------------------------------------------------------------
> w/o       | 593927KB/s,580,7623.63us  | 591265KB/s,577,6163.42us  |
> -------------------------------------------------------------------
> w/o+      | 615011KB/s,600,7399.93us  | 612255KB/s,597,5936.61us  |
> -------------------------------------------------------------------
> ilock     | 394762KB/s,385,7097.55us  | 392993KB/s,383,13626.98us |
> -------------------------------------------------------------------
> ilocknew  | 422052KB/s,412,8338.16us  | 420161KB/s,410,11008.95us |
> -------------------------------------------------------------------
> ilock+    | 626183KB/s,611,7319.16us  | 623377KB/s,608,5773.24us  |
> -------------------------------------------------------------------
> ilocknew+ | 626006KB/s,611,7281.09us  | 623200KB/s,608,5817.84us  |
> -------------------------------------------------------------------
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-19  2:08     ` Joseph Qi
  2019-09-19 18:48       ` Ritesh Harjani
@ 2019-09-23  6:19       ` Ritesh Harjani
  1 sibling, 0 replies; 16+ messages in thread
From: Ritesh Harjani @ 2019-09-23  6:19 UTC (permalink / raw)
  To: Joseph Qi, jack, tytso, linux-ext4
  Cc: david, hch, adilger, mbobrowski, rgoldwyn

> Finally, I got hold of some HW. I am collecting the numbers as we speak.
> Will post those tomorrow.

I have sent RFC-v2 with comprehensive set of performance numbers and
charts. It could be found below.

https://marc.info/?l=linux-ext4&m=156921748126221&w=2

Please take a look at it.

-ritesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-18  6:35 ` Joseph Qi
  2019-09-18 10:03   ` Ritesh Harjani
@ 2019-09-24 15:10   ` Jan Kara
  2019-09-24 19:48     ` Ritesh Harjani
  2019-09-25  1:17     ` Joseph Qi
  1 sibling, 2 replies; 16+ messages in thread
From: Jan Kara @ 2019-09-24 15:10 UTC (permalink / raw)
  To: Joseph Qi
  Cc: Ritesh Harjani, jack, tytso, linux-ext4, david, hch, adilger,
	mbobrowski, rgoldwyn

Hi Joseph!

On Wed 18-09-19 14:35:15, Joseph Qi wrote:
> On 19/9/17 18:32, Ritesh Harjani wrote:
> > Hello,
> > 
> > This patch series is based on the upstream discussion with Jan
> > & Joseph @ [1].
> > It is based on top of Matthew's v3 ext4 iomap patch series [2]
> > 
> > Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
> > inode_lock/unlock instances from fs/ext4/*
> > 
> > For now I already accounted for trylock/lock issue symantics
> > (which was discussed here [3]) in the same patch,
> > since the this whole patch was around inode_lock/unlock API,
> > so I thought it will be best to address that issue in the same patch. 
> > However, kindly let me know if otherwise.
> > 
> > Patch-2: Commit msg of this patch describes in detail about
> > what it is doing.
> > In brief - we try to first take the shared lock (instead of exclusive
> > lock), unless it is a unaligned_io or extend_io. Then in
> > ext4_dio_write_checks(), if we start with shared lock, we see
> > if we can really continue with shared lock or not. If not, then
> > we release the shared lock then acquire exclusive lock
> > and restart ext4_dio_write_checks().
> > 
> > 
> > Tested against few xfstests (with dioread_nolock mount option),
> > those ran fine (ext4 & generic).
> > 
> > I tried testing performance numbers on my VM (since I could not get
> > hold of any real h/w based test device). I could test the fact
> > that earlier we were trying to do downgrade_write() lock, but with
> > this patch, that path is now avoided for fio test case
> > (as reported by Joseph in [4]).
> > But for the actual results, I am not sure if VM machine testing could
> > really give the reliable perf numbers which we want to take a look at.
> > Though I do observe some form of perf improvements, but I could not
> > get any reliable numbers (not even with the same list of with/without
> > patches with which Joseph posted his numbers [1]).
> > 
> > 
> > @Joseph,
> > Would it be possible for you to give your test case a run with this
> > patches? That will be really helpful.
> > 
> > Branch for this is hosted at below tree.
> > 
> > https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
> > 
> I've tested your branch, the result is:
> mounting with dioread_nolock, it behaves the same like reverting
> parallel dio reads + dioread_nolock;
> while mounting without dioread_nolock, no improvement, or even worse.
> Please refer the test data below. 
> 
> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
> -size=20G -numjobs=8 -runtime=600 -group_reporting
> 
> w/     = with parallel dio reads
> w/o    = reverting parallel dio reads

This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
right?

> w/o+   = reverting parallel dio reads + dioread_nolock
> ilock  = ext4-ilock-RFC
> ilock+ = ext4-ilock-RFC + dioread_nolock
> 
> bs=4k:
> --------------------------------------------------------------
>       |            READ           |           WRITE          |
> --------------------------------------------------------------
> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
> --------------------------------------------------------------
> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
> --------------------------------------------------------------

I'm really surprised by the numbers here. They would mean that when DIO
read takes i_rwsem exclusive lock instead of shared, it is a win for your
workload... Argh, now checking code in fs/direct-io.c I think I can see the
difference. The trick in do_blockdev_direct_IO() is:

        if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
                inode_unlock(dio->inode);
        if (dio->is_async && retval == 0 && dio->result &&
            (iov_iter_rw(iter) == READ || dio->result == count))
                retval = -EIOCBQUEUED;
        else
                dio_await_completion(dio);

So actually only direct IO read submission is protected by i_rwsem with
DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.

After some thought I think the best solution for this is to just finally
finish the conversion of ext4 so that dioread_nolock is the only DIO path.
With i_rwsem held in shared mode even for "unlocked" DIO, it should be
actually relatively simple and most of the dances with unwritten extents
shouldn't be needed anymore.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-24 15:10   ` Jan Kara
@ 2019-09-24 19:48     ` Ritesh Harjani
  2019-09-25  9:23       ` Jan Kara
  2019-09-25  1:17     ` Joseph Qi
  1 sibling, 1 reply; 16+ messages in thread
From: Ritesh Harjani @ 2019-09-24 19:48 UTC (permalink / raw)
  To: Jan Kara, Joseph Qi
  Cc: tytso, linux-ext4, david, hch, adilger, mbobrowski, rgoldwyn



On 9/24/19 8:40 PM, Jan Kara wrote:
> Hi Joseph!
> 
> On Wed 18-09-19 14:35:15, Joseph Qi wrote:
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch.
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below.
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/     = with parallel dio reads
>> w/o    = reverting parallel dio reads
> 
> This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
> right?

He posted the same numbers where he posted previous reverts too,
which I thought we already noticed [1].
 From [2] below, I assumed we knew this.

[2] - """
(note
that the patches actually improve performance of read-only DIO workload
when not using dioread_nolock as for that case, exclusive lock is 
replaced with a shared one)
"""


[1]  https://patchwork.ozlabs.org/patch/1153546/
[2] 
https://lore.kernel.org/linux-ext4/20190830153520.GB25069@quack2.suse.cz/

> 
>> w/o+   = reverting parallel dio reads + dioread_nolock
>> ilock  = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
>>
>> bs=4k:
>> --------------------------------------------------------------
>>        |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
>> --------------------------------------------------------------
>> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
> 
> I'm really surprised by the numbers here. They would mean that when DIO

While testing my patches I noticed this again, but then when I saw [2]
above, I thought we were aware of this.
My bad, I should have brought this point up maybe once before going
ahead with implementing our discussed solution.


> read takes i_rwsem exclusive lock instead of shared, it is a win for your
> workload... Argh, now checking code in fs/direct-io.c I think I can see the
> difference. The trick in do_blockdev_direct_IO() is:
> 
>          if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
>                  inode_unlock(dio->inode);
>          if (dio->is_async && retval == 0 && dio->result &&
>              (iov_iter_rw(iter) == READ || dio->result == count))
>                  retval = -EIOCBQUEUED;
>          else
>                  dio_await_completion(dio);
> 
> So actually only direct IO read submission is protected by i_rwsem with
> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> 
> After some thought I think the best solution for this is to just finally
> finish the conversion of ext4 so that dioread_nolock is the only DIO path.

Sorry, I still didn't get this completely. Could you please explain a 
bit more?


> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> actually relatively simple and most of the dances with unwritten extents
> shouldn't be needed anymore.

Again, maybe it's related to above comment. Could you please give some
insights?


Or do you mean that we should do it like this-
So as of now in dioread_nolock, we allocate blocks, mark the entry into
extents as unwritten, then do the data IO, and then finally do the
conversion of unwritten to written extents.

So instead of that we first only reserve the disk blocks, (without
making any on-disk changes in extent tree), do the data IO and then
finally make an entry into extent tree on disk. And going
forward only keep this as the default path.

The above is something I have been looking into for enabling
dioread_nolock for powerpc platforms where blocksize < page_size.
This is based upon an upstream discussion between Ted and you :)


But even with above, in case of extending writes, we still
will have to zero out those extending blocks no? Which
will require an exclusive inode lock anyways for zeroing.
(same which has been done in XFS too).

So going with current discussed solution of mounting with
dioread_nolock to provide performance scalability in mixed read/write 
workload should be also the right approach, no?
Also looking at the numbers here [3] & [4], this patch also seems
to improve the performance with dioread_nolock mount option.
Please help me understand your thoughts on this.

[3] - https://marc.info/?l=linux-ext4&m=156921748126221&w=2
[4] - 
https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png


-ritesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-24 15:10   ` Jan Kara
  2019-09-24 19:48     ` Ritesh Harjani
@ 2019-09-25  1:17     ` Joseph Qi
  1 sibling, 0 replies; 16+ messages in thread
From: Joseph Qi @ 2019-09-25  1:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ritesh Harjani, tytso, linux-ext4, david, hch, adilger,
	mbobrowski, rgoldwyn



On 19/9/24 23:10, Jan Kara wrote:
> Hi Joseph!
> 
> On Wed 18-09-19 14:35:15, Joseph Qi wrote:
>> On 19/9/17 18:32, Ritesh Harjani wrote:
>>> Hello,
>>>
>>> This patch series is based on the upstream discussion with Jan
>>> & Joseph @ [1].
>>> It is based on top of Matthew's v3 ext4 iomap patch series [2]
>>>
>>> Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
>>> inode_lock/unlock instances from fs/ext4/*
>>>
>>> For now I already accounted for trylock/lock issue symantics
>>> (which was discussed here [3]) in the same patch,
>>> since the this whole patch was around inode_lock/unlock API,
>>> so I thought it will be best to address that issue in the same patch. 
>>> However, kindly let me know if otherwise.
>>>
>>> Patch-2: Commit msg of this patch describes in detail about
>>> what it is doing.
>>> In brief - we try to first take the shared lock (instead of exclusive
>>> lock), unless it is a unaligned_io or extend_io. Then in
>>> ext4_dio_write_checks(), if we start with shared lock, we see
>>> if we can really continue with shared lock or not. If not, then
>>> we release the shared lock then acquire exclusive lock
>>> and restart ext4_dio_write_checks().
>>>
>>>
>>> Tested against few xfstests (with dioread_nolock mount option),
>>> those ran fine (ext4 & generic).
>>>
>>> I tried testing performance numbers on my VM (since I could not get
>>> hold of any real h/w based test device). I could test the fact
>>> that earlier we were trying to do downgrade_write() lock, but with
>>> this patch, that path is now avoided for fio test case
>>> (as reported by Joseph in [4]).
>>> But for the actual results, I am not sure if VM machine testing could
>>> really give the reliable perf numbers which we want to take a look at.
>>> Though I do observe some form of perf improvements, but I could not
>>> get any reliable numbers (not even with the same list of with/without
>>> patches with which Joseph posted his numbers [1]).
>>>
>>>
>>> @Joseph,
>>> Would it be possible for you to give your test case a run with this
>>> patches? That will be really helpful.
>>>
>>> Branch for this is hosted at below tree.
>>>
>>> https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC
>>>
>> I've tested your branch, the result is:
>> mounting with dioread_nolock, it behaves the same like reverting
>> parallel dio reads + dioread_nolock;
>> while mounting without dioread_nolock, no improvement, or even worse.
>> Please refer the test data below. 
>>
>> fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
>> -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
>> -size=20G -numjobs=8 -runtime=600 -group_reporting
>>
>> w/     = with parallel dio reads
>> w/o    = reverting parallel dio reads
> 
> This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
> right?

Yes, actually, it also reverts the related patches:

Revert "ext4: remove EXT4_STATE_DIOREAD_LOCK flag"
Revert "ext4: fix off-by-one error when writing back pages before dio read"
Revert "ext4: Allow parallel DIO reads"

> 
>> w/o+   = reverting parallel dio reads + dioread_nolock
>> ilock  = ext4-ilock-RFC
>> ilock+ = ext4-ilock-RFC + dioread_nolock
>>
>> bs=4k:
>> --------------------------------------------------------------
>>       |            READ           |           WRITE          |
>> --------------------------------------------------------------
>> w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
>> --------------------------------------------------------------
>> w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
>> --------------------------------------------------------------
> 
> I'm really surprised by the numbers here. They would mean that when DIO
> read takes i_rwsem exclusive lock instead of shared, it is a win for your
> workload... Argh, now checking code in fs/direct-io.c I think I can see the
> difference. The trick in do_blockdev_direct_IO() is:
> 
>         if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
>                 inode_unlock(dio->inode);
>         if (dio->is_async && retval == 0 && dio->result &&
>             (iov_iter_rw(iter) == READ || dio->result == count))
>                 retval = -EIOCBQUEUED;
>         else
>                 dio_await_completion(dio);
> 
> So actually only direct IO read submission is protected by i_rwsem with
> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> 
> After some thought I think the best solution for this is to just finally
> finish the conversion of ext4 so that dioread_nolock is the only DIO path.
> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> actually relatively simple and most of the dances with unwritten extents
> shouldn't be needed anymore.
> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-24 19:48     ` Ritesh Harjani
@ 2019-09-25  9:23       ` Jan Kara
  2019-09-26 12:34         ` Ritesh Harjani
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2019-09-25  9:23 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Jan Kara, Joseph Qi, tytso, linux-ext4, david, hch, adilger,
	mbobrowski, rgoldwyn

On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
> > read takes i_rwsem exclusive lock instead of shared, it is a win for your
> > workload... Argh, now checking code in fs/direct-io.c I think I can see the
> > difference. The trick in do_blockdev_direct_IO() is:
> > 
> >          if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> >                  inode_unlock(dio->inode);
> >          if (dio->is_async && retval == 0 && dio->result &&
> >              (iov_iter_rw(iter) == READ || dio->result == count))
> >                  retval = -EIOCBQUEUED;
> >          else
> >                  dio_await_completion(dio);
> > 
> > So actually only direct IO read submission is protected by i_rwsem with
> > DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> > 
> > After some thought I think the best solution for this is to just finally
> > finish the conversion of ext4 so that dioread_nolock is the only DIO path.
> 
> Sorry, I still didn't get this completely. Could you please explain a bit
> more?

Well, currently we have two different locking schemes for DIO - the
"normal" case and the "dioread_nolock" case. And the "normal" case really
only exists because buffered writeback needed to be more careful (so that
nolock DIO cannot expose stale data) and nobody did the effort to make that
work when blocksize < pagesize. But having two different locking schemes
for DIO is really confusing to users and a maintenance burden so we want to
get rid of the old scheme once the "dioread_nolock" scheme works for all
the configurations.
 
> > With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> > actually relatively simple and most of the dances with unwritten extents
> > shouldn't be needed anymore.
> 
> Again, maybe it's related to above comment. Could you please give some
> insights?

Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
"unlocked" anymore. Which actually very much limits the races with buffered
writes and thus page writeback (because we flush page cache before doing
DIO).

> Or do you mean that we should do it like this-
> So as of now in dioread_nolock, we allocate blocks, mark the entry into
> extents as unwritten, then do the data IO, and then finally do the
> conversion of unwritten to written extents.

So this allocation of extents as unwritten in dioread_nolock mode is now
mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
submitting any DIO. Because we flush page cache before starting DIO and new
pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
with page writeback and thus cannot expose stale block contents. There is
one exception though - which is why I wrote "mostly" above - pages can
still be dirtied through memory mappings (there's no i_rwsem protection for
mmap writes) and thus races with page writeback exposing stale data are still
theoretically possible. In fact the "normal" DIO mode has this kind of race
all the time ext4 exists.

I guess we could fix this by falling back to page writeback using unwritten
extents when we have dirty pages locked for writeback and see there's any
DIO in flight for the inode. Sadly that means we we cannot get rid of
writeback code using unwritten extents but at least it won't be hurting
performance in the common case.

> So instead of that we first only reserve the disk blocks, (without
> making any on-disk changes in extent tree), do the data IO and then
> finally make an entry into extent tree on disk. And going
> forward only keep this as the default path.
> 
> The above is something I have been looking into for enabling
> dioread_nolock for powerpc platforms where blocksize < page_size.
> This is based upon an upstream discussion between Ted and you :)

Yes, this is even better solution to dioread_nolock "problem" but it is
also more work then just dropping the old DIO locking mode and fix
writeback using unwritten extents for blocksize < pagesize.
 
> But even with above, in case of extending writes, we still
> will have to zero out those extending blocks no? Which
> will require an exclusive inode lock anyways for zeroing.
> (same which has been done in XFS too).

Yes, extending writes are a different matter.

> So going with current discussed solution of mounting with
> dioread_nolock to provide performance scalability in mixed read/write
> workload should be also the right approach, no?
> Also looking at the numbers here [3] & [4], this patch also seems
> to improve the performance with dioread_nolock mount option.
> Please help me understand your thoughts on this.

Yes, your patches are definitely going in the right direction. What I'm
discussing is mostly how to make ext4 perform well for mixed read/write
workload by default without user having to use magic mount option.

> [3] - https://marc.info/?l=linux-ext4&m=156921748126221&w=2
> [4] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-25  9:23       ` Jan Kara
@ 2019-09-26 12:34         ` Ritesh Harjani
  2019-09-26 13:47           ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Ritesh Harjani @ 2019-09-26 12:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Joseph Qi, tytso, linux-ext4, david, hch, adilger, mbobrowski, rgoldwyn

Hello Jan,

Thanks for answering it all and giving all the info.

On 9/25/19 2:53 PM, Jan Kara wrote:
> On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
>>> read takes i_rwsem exclusive lock instead of shared, it is a win for your
>>> workload... Argh, now checking code in fs/direct-io.c I think I can see the
>>> difference. The trick in do_blockdev_direct_IO() is:
>>>
>>>           if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
>>>                   inode_unlock(dio->inode);
>>>           if (dio->is_async && retval == 0 && dio->result &&
>>>               (iov_iter_rw(iter) == READ || dio->result == count))
>>>                   retval = -EIOCBQUEUED;
>>>           else
>>>                   dio_await_completion(dio);
>>>
>>> So actually only direct IO read submission is protected by i_rwsem with
>>> DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
>>>
>>> After some thought I think the best solution for this is to just finally
>>> finish the conversion of ext4 so that dioread_nolock is the only DIO path.
>>
>> Sorry, I still didn't get this completely. Could you please explain a bit
>> more?
> 
> Well, currently we have two different locking schemes for DIO - the
> "normal" case and the "dioread_nolock" case. And the "normal" case really
> only exists because buffered writeback needed to be more careful (so that
> nolock DIO cannot expose stale data) and nobody did the effort to make that
> work when blocksize < pagesize. But having two different locking schemes
> for DIO is really confusing to users and a maintenance burden so we want to
> get rid of the old scheme once the "dioread_nolock" scheme works for all
> the configurations.

Agreed.

>   
>>> With i_rwsem held in shared mode even for "unlocked" DIO, it should be
>>> actually relatively simple and most of the dances with unwritten extents
>>> shouldn't be needed anymore.
>>
>> Again, maybe it's related to above comment. Could you please give some
>> insights?
> 
> Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
> "unlocked" anymore. Which actually very much limits the races with buffered
> writes and thus page writeback (because we flush page cache before doing
> DIO).

So looking at the code again based on your inputs from above, we should
be able to remove this condition <snip below> in ext4_dio_write_checks.

What I meant is, in DIO writes path we can start with shared lock
(which the current patch is doing), and only check for below conditions
<snip below> for it to continue using shared lock.
(i.e. we need not check for ext4_should_dioread_nolock(inode) anymore).

That means there should not be any race for non-extent based mode
too right?


Because for overwrites in DIO (for both extents & non-extents)-
(just reiterating)

1. Race against bufferedIO writes and DIO overwrites will be protected
via SHARED inode lock in DIO overwrites & EXCL lock in bufferedIO
writes. Plus we flush page cache before doing DIO.

2. Race against bufferedIO reads & DIO overwrites will be anyway 
protected since we don't do any block allocations during DIO overwrite.

3. Again race against DIO reads & DIO overwrites is not be a problem,
since no block allocation is done anyway. So no stale data will be
exposed.


<snip of change we should do in ext4_dio_write_checks>

         if (*iolock == EXT4_IOLOCK_SHARED &&
             (!IS_NOSEC(inode) || *unaligned_io || *extend ||
-            !ext4_should_dioread_nolock(inode) ||
              !ext4_overwrite_io(inode, offset, count))) {
                 ext4_iunlock(inode, *iolock);
                 *iolock = EXT4_IOLOCK_EXCL;
                 ext4_ilock(inode, *iolock);
                 goto restart;
         }

> 
>> Or do you mean that we should do it like this-
>> So as of now in dioread_nolock, we allocate blocks, mark the entry into
>> extents as unwritten, then do the data IO, and then finally do the
>> conversion of unwritten to written extents.
> 
> So this allocation of extents as unwritten in dioread_nolock mode is now
> mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
> submitting any DIO. Because we flush page cache before starting DIO and new
> pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
> with page writeback and thus cannot expose stale block contents. There is
> one exception though - which is why I wrote "mostly" above - pages can
> still be dirtied through memory mappings (there's no i_rwsem protection for
> mmap writes) and thus races with page writeback exposing stale data are still
> theoretically possible. In fact the "normal" DIO mode has this kind of race
> all the time ext4 exists.

Yes, now that you said I could see this below race even with current
code. Any other race too?

i.e.
ext4_dio_read			ext4_page_mkwrite()

     filemap_write_and_wait_range()
				     ext4_get_blocks()

     submit_bio
     // this could expose the stale data.
		
				     mark_buffer_dirty()


Ok. I guess we cannot use any exclusive inode lock in ext4_page_mkwrite,
because then we loose the parallelism that it offers right now.
Wonder how other FS protect this race (like XFS?)


> I guess we could fix this by falling back to page writeback using unwritten
> extents when we have dirty pages locked for writeback and see there's any
> DIO in flight for the inode. Sadly that means we we cannot get rid of
> writeback code using unwritten extents but at least it won't be hurting
> performance in the common case.


1. So why to check for dirty pages locked for writeback (in
ext4_page_mkwrite)? we anyway lock the page which we modify or
allocate block for. So race against bufferedRead should not happen.

2. And even if we check for inode_dio_wait(), we still can't gurantee
that the next DIO may not snoopin while we are in ext4_page_mkwrite?

I am definitely missing something here.


> 
>> So instead of that we first only reserve the disk blocks, (without
>> making any on-disk changes in extent tree), do the data IO and then
>> finally make an entry into extent tree on disk. And going
>> forward only keep this as the default path.
>>
>> The above is something I have been looking into for enabling
>> dioread_nolock for powerpc platforms where blocksize < page_size.
>> This is based upon an upstream discussion between Ted and you :)
> 
> Yes, this is even better solution to dioread_nolock "problem" but it is
> also more work 

Yes, I agree that we should do this incrementally.


> then just dropping the old DIO locking mode and
> fix writeback using unwritten extents for blocksize < pagesize.


So, I was looking into this (to support dioread_nolock for
blocksize < pagesize) but I could not find any history in git,
nor any thread which explains the problem.

I got below understanding while going over code -

1. This problem may be somehow related to bufferheads in the
extent conversion from unwritten to written in writeback path.
But I couldn't see what exactly is the issue there?

I see that the completion path happens via
ext4_end_io
    |- ext4_convert_unwritten_extents() // offset & size is properly 
taken care.
    |- ext4_release_io_end() (which is same in both DIO & writeback).


2. One thing which I noticed is the FIXME in
mpage_map_and_submit_buffers(). Where we
io->io_end->size we add the whole PAGE_SIZE.
I guess we should update the size here carefully
based on buffer_heads.


Could you please give some pointers as to what
is the limitation. Or some hint which I can go
and check by myself.



>> But even with above, in case of extending writes, we still
>> will have to zero out those extending blocks no? Which
>> will require an exclusive inode lock anyways for zeroing.
>> (same which has been done in XFS too).
> 
> Yes, extending writes are a different matter.
> 
>> So going with current discussed solution of mounting with
>> dioread_nolock to provide performance scalability in mixed read/write
>> workload should be also the right approach, no?
>> Also looking at the numbers here [3] & [4], this patch also seems
>> to improve the performance with dioread_nolock mount option.
>> Please help me understand your thoughts on this.
> 
> Yes, your patches are definitely going in the right direction. What I'm
> discussing is mostly how to make ext4 perform well for mixed read/write
> workload by default without user having to use magic mount option.

Really thanks for your support here.

So can we get these patches upstream incrementally?
i.e.
1. As a first step, we can remove this condition
ext4_should_dioread_nolock from the current patchset
(as mentioned above too) &  get this patch rebased
on top of iomap pathset. We should be good to merge
this patchset then, right? Since this should be able
to improve the performance even without dioread_nolock
mount option.


2. Meanwhile I will continue to check for blocksize < pagesize
requirement for dioread_nolock. We can follow the plan
as you mentioned above then.

Your thoughts?


-ritesh


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path
  2019-09-26 12:34         ` Ritesh Harjani
@ 2019-09-26 13:47           ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2019-09-26 13:47 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Jan Kara, Joseph Qi, tytso, linux-ext4, david, hch, adilger,
	mbobrowski, rgoldwyn

Hello,

On Thu 26-09-19 18:04:55, Ritesh Harjani wrote:
> On 9/25/19 2:53 PM, Jan Kara wrote:
> > On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
> > > > read takes i_rwsem exclusive lock instead of shared, it is a win for your
> > > > workload... Argh, now checking code in fs/direct-io.c I think I can see the
> > > > difference. The trick in do_blockdev_direct_IO() is:
> > > > 
> > > >           if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> > > >                   inode_unlock(dio->inode);
> > > >           if (dio->is_async && retval == 0 && dio->result &&
> > > >               (iov_iter_rw(iter) == READ || dio->result == count))
> > > >                   retval = -EIOCBQUEUED;
> > > >           else
> > > >                   dio_await_completion(dio);
> > > > 
> > > > So actually only direct IO read submission is protected by i_rwsem with
> > > > DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.
> > > > 
> > > > After some thought I think the best solution for this is to just finally
> > > > finish the conversion of ext4 so that dioread_nolock is the only DIO path.
> > > 
> > > Sorry, I still didn't get this completely. Could you please explain a bit
> > > more?
> > 
> > Well, currently we have two different locking schemes for DIO - the
> > "normal" case and the "dioread_nolock" case. And the "normal" case really
> > only exists because buffered writeback needed to be more careful (so that
> > nolock DIO cannot expose stale data) and nobody did the effort to make that
> > work when blocksize < pagesize. But having two different locking schemes
> > for DIO is really confusing to users and a maintenance burden so we want to
> > get rid of the old scheme once the "dioread_nolock" scheme works for all
> > the configurations.
> 
> Agreed.
> 
> > > > With i_rwsem held in shared mode even for "unlocked" DIO, it should be
> > > > actually relatively simple and most of the dances with unwritten extents
> > > > shouldn't be needed anymore.
> > > 
> > > Again, maybe it's related to above comment. Could you please give some
> > > insights?
> > 
> > Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
> > "unlocked" anymore. Which actually very much limits the races with buffered
> > writes and thus page writeback (because we flush page cache before doing
> > DIO).
> 
> So looking at the code again based on your inputs from above, we should
> be able to remove this condition <snip below> in ext4_dio_write_checks.
> 
> What I meant is, in DIO writes path we can start with shared lock
> (which the current patch is doing), and only check for below conditions
> <snip below> for it to continue using shared lock.
> (i.e. we need not check for ext4_should_dioread_nolock(inode) anymore).
> 
> That means there should not be any race for non-extent based mode
> too right?

Yes, that is correct. But please do this as a separate change with good
explanation of why this is OK.

> Because for overwrites in DIO (for both extents & non-extents)-
> (just reiterating)
> 
> 1. Race against bufferedIO writes and DIO overwrites will be protected
> via SHARED inode lock in DIO overwrites & EXCL lock in bufferedIO
> writes. Plus we flush page cache before doing DIO.
> 
> 2. Race against bufferedIO reads & DIO overwrites will be anyway protected
> since we don't do any block allocations during DIO overwrite.
> 
> 3. Again race against DIO reads & DIO overwrites is not be a problem,
> since no block allocation is done anyway. So no stale data will be
> exposed.
> 
> 
> <snip of change we should do in ext4_dio_write_checks>
> 
>         if (*iolock == EXT4_IOLOCK_SHARED &&
>             (!IS_NOSEC(inode) || *unaligned_io || *extend ||
> -            !ext4_should_dioread_nolock(inode) ||
>              !ext4_overwrite_io(inode, offset, count))) {
>                 ext4_iunlock(inode, *iolock);
>                 *iolock = EXT4_IOLOCK_EXCL;
>                 ext4_ilock(inode, *iolock);
>                 goto restart;
>         }
> 
> > 
> > > Or do you mean that we should do it like this-
> > > So as of now in dioread_nolock, we allocate blocks, mark the entry into
> > > extents as unwritten, then do the data IO, and then finally do the
> > > conversion of unwritten to written extents.
> > 
> > So this allocation of extents as unwritten in dioread_nolock mode is now
> > mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
> > submitting any DIO. Because we flush page cache before starting DIO and new
> > pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
> > with page writeback and thus cannot expose stale block contents. There is
> > one exception though - which is why I wrote "mostly" above - pages can
> > still be dirtied through memory mappings (there's no i_rwsem protection for
> > mmap writes) and thus races with page writeback exposing stale data are still
> > theoretically possible. In fact the "normal" DIO mode has this kind of race
> > all the time ext4 exists.
> 
> Yes, now that you said I could see this below race even with current
> code. Any other race too?
> 
> i.e.
> ext4_dio_read			ext4_page_mkwrite()
> 
>     filemap_write_and_wait_range()
> 				     ext4_get_blocks()
> 
>     submit_bio
>     // this could expose the stale data.
> 		
> 				     mark_buffer_dirty()

Yes, this is what I meant. Although note that in most cases
ext4_get_blocks() from ext4_page_mkwrite() will just reserve delalloc
blocks so you additionally need writeback to happen on that inode to
allocate delalloc block and only after that call of ext4_iomap_begin() +
submit_bio() from iomap_dio_rw() will be able to expose stale data. So I
don't think the race is very easy to trigger.

> Ok. I guess we cannot use any exclusive inode lock in ext4_page_mkwrite,
> because then we loose the parallelism that it offers right now.
> Wonder how other FS protect this race (like XFS?)

You cannot get i_rwsem at all in ext4_page_mkwrite() because mm code
holds mmap_sem when calling into ext4_page_mkwrite(). And mmap_sem ranks
below i_rwsem. And the deadlock is actually real because e.g. direct IO
code holds i_rwsem and then grabs mmap_sem as part of
bio_iov_iter_get_pages().

XFS always uses unwritten extents when writing out pages which prevents
this race.

> > I guess we could fix this by falling back to page writeback using unwritten
> > extents when we have dirty pages locked for writeback and see there's any
> > DIO in flight for the inode. Sadly that means we we cannot get rid of
> > writeback code using unwritten extents but at least it won't be hurting
> > performance in the common case.
> 
> 1. So why to check for dirty pages locked for writeback (in
> ext4_page_mkwrite)? we anyway lock the page which we modify or
> allocate block for. So race against bufferedRead should not happen.
> 
> 2. And even if we check for inode_dio_wait(), we still can't gurantee
> that the next DIO may not snoopin while we are in ext4_page_mkwrite?
> 
> I am definitely missing something here.

I was speaking about code in ext4_writepages(). The block mapping in
ext4_page_mkwrite() can always use unwritten extents - that is just a
fallback path in case delayed allocation is disabled or we are running out
of disk space. The code in ext4_writepages() needs to decide whether to
writeback using unwritten extents or we can use normal ones. And in that
place I suggest replacing current "ext4_should_dioread_nolock()" check with
"is there any dio in flight against the inode?". And to make the check
reliable (without holding i_rwsem), you need to do it only after you have
all pages locked for writeback.

> > > So instead of that we first only reserve the disk blocks, (without
> > > making any on-disk changes in extent tree), do the data IO and then
> > > finally make an entry into extent tree on disk. And going
> > > forward only keep this as the default path.
> > > 
> > > The above is something I have been looking into for enabling
> > > dioread_nolock for powerpc platforms where blocksize < page_size.
> > > This is based upon an upstream discussion between Ted and you :)
> > 
> > Yes, this is even better solution to dioread_nolock "problem" but it is
> > also more work
> 
> Yes, I agree that we should do this incrementally.
> 
> > then just dropping the old DIO locking mode and
> > fix writeback using unwritten extents for blocksize < pagesize.
> 
> 
> So, I was looking into this (to support dioread_nolock for
> blocksize < pagesize) but I could not find any history in git,
> nor any thread which explains the problem.

Yeah, part of the problem is that we have only a rough understanding of
what actually the problem is :)

> I got below understanding while going over code -
> 
> 1. This problem may be somehow related to bufferheads in the
> extent conversion from unwritten to written in writeback path.
> But I couldn't see what exactly is the issue there?
> 
> I see that the completion path happens via
> ext4_end_io
>    |- ext4_convert_unwritten_extents() // offset & size is properly taken
> care.
>    |- ext4_release_io_end() (which is same in both DIO & writeback).
> 
> 
> 2. One thing which I noticed is the FIXME in
> mpage_map_and_submit_buffers(). Where we
> io->io_end->size we add the whole PAGE_SIZE.
> I guess we should update the size here carefully
> based on buffer_heads.
> 
> Could you please give some pointers as to what
> is the limitation. Or some hint which I can go
> and check by myself.

You are correct that the problem is in ext4_writepages() (and functions
called from there). Essentially the whole code there needs verification
whether unwritten extent handling works correctly when blocksize <
pagesize. As you noted, there are couple of FIXMEs there where I was aware
that things would break but there may be other problems I've missed.
I remember things can get really complex there when there are multiple
unwritten extents underlying the page and we need to write them out.

> > > But even with above, in case of extending writes, we still
> > > will have to zero out those extending blocks no? Which
> > > will require an exclusive inode lock anyways for zeroing.
> > > (same which has been done in XFS too).
> > 
> > Yes, extending writes are a different matter.
> > 
> > > So going with current discussed solution of mounting with
> > > dioread_nolock to provide performance scalability in mixed read/write
> > > workload should be also the right approach, no?
> > > Also looking at the numbers here [3] & [4], this patch also seems
> > > to improve the performance with dioread_nolock mount option.
> > > Please help me understand your thoughts on this.
> > 
> > Yes, your patches are definitely going in the right direction. What I'm
> > discussing is mostly how to make ext4 perform well for mixed read/write
> > workload by default without user having to use magic mount option.
> 
> Really thanks for your support here.
> 
> So can we get these patches upstream incrementally?
> i.e.
> 1. As a first step, we can remove this condition
> ext4_should_dioread_nolock from the current patchset
> (as mentioned above too) &  get this patch rebased
> on top of iomap pathset. We should be good to merge
> this patchset then, right? Since this should be able
> to improve the performance even without dioread_nolock
> mount option.

Yes, correct.

> 2. Meanwhile I will continue to check for blocksize < pagesize
> requirement for dioread_nolock. We can follow the plan
> as you mentioned above then.

Good.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-09-26 13:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-17 10:32 [RFC 0/2] ext4: Improve locking sequence in DIO write path Ritesh Harjani
2019-09-17 10:32 ` [RFC 1/2] ext4: Add ext4_ilock & ext4_iunlock API Ritesh Harjani
2019-09-17 10:32 ` [RFC 2/2] ext4: Improve DIO writes locking sequence Ritesh Harjani
2019-09-18  0:58 ` [RFC 0/2] ext4: Improve locking sequence in DIO write path Joseph Qi
2019-09-18  6:35 ` Joseph Qi
2019-09-18 10:03   ` Ritesh Harjani
2019-09-18 10:57     ` Joseph Qi
2019-09-19  2:08     ` Joseph Qi
2019-09-19 18:48       ` Ritesh Harjani
2019-09-23  6:19       ` Ritesh Harjani
2019-09-24 15:10   ` Jan Kara
2019-09-24 19:48     ` Ritesh Harjani
2019-09-25  9:23       ` Jan Kara
2019-09-26 12:34         ` Ritesh Harjani
2019-09-26 13:47           ` Jan Kara
2019-09-25  1:17     ` Joseph Qi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).