Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* RFC: hold i_rwsem until aio completes
@ 2020-01-14 16:12 Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 01/12] mm: fix a comment in sys_swapon Christoph Hellwig
                   ` (15 more replies)
  0 siblings, 16 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Hi all,

Asynchronous read/write operations currently use a rather magic locking
scheme, were access to file data is normally protected using a rw_semaphore,
but if we are doing aio where the syscall returns to userspace before the
I/O has completed we also use an atomic_t to track the outstanding aio
ops.  This scheme has lead to lots of subtle bugs in file systems where
didn't wait to the count to reach zero, and due to its adhoc nature also
means we have to serialize direct I/O writes that are smaller than the
file system block size.

All this is solved by releasing i_rwsem only when the I/O has actually
completed, but doings so is against to mantras of Linux locking primites:

 (1) no unlocking by another process than the one that acquired it
 (2) no return to userspace with locks held

It actually happens we have various places that work around this.  A few
callers do non-owner unlocks of rwsems, which are pretty nasty for
PREEMPT_RT as the owner tracking doesn't work.  OTOH the file system
freeze code has both problems and works around them a little better,
although in a somewhat awkward way, in that it releases the lockdep
object when returning to userspace, and reacquires it when done, and
also clears the rwsem owner when returning to userspace, and then sets
the new onwer before unlocking.

This series tries to follow that scheme, also it doesn't fully work.  The
first issue is that the rwsem code has a bug where it doesn't properly
handle clearing the owner.  This series has a patch to fix that, but it
is ugly and might not be correct so some help is needed.  Second I/O
completions often come from interrupt context, which means the re-acquire
is recorded as from irq context, leading to warnings about incorrect
contexts.  I wonder if we could just have a bit in lockdep that says
returning to userspace is ok for this particular lock?  That would also
clean up the fsfreeze situation a lot.

Let me know what you think of all this.  While I converted all the iomap
using file systems only XFS is actually tested.

Diffstat:

 24 files changed, 144 insertions(+), 180 deletions(-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/12] mm: fix a comment in sys_swapon
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-02-10 23:29   ` Andrew Morton
  2020-01-14 16:12 ` [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner Christoph Hellwig
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

claim_swapfile now always takes i_rwsem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/swapfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index bb3261d45b6a..fe6e4c1add0b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3157,7 +3157,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	mapping = swap_file->f_mapping;
 	inode = mapping->host;
 
-	/* If S_ISREG(inode->i_mode) will do inode_lock(inode); */
+	/* will take i_rwsem; */
 	error = claim_swapfile(p, inode);
 	if (unlikely(error))
 		goto bad_swap;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 01/12] mm: fix a comment in sys_swapon Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-01-14 18:17   ` Waiman Long
  2020-01-14 16:12 ` [PATCH 03/12] xfs: fix IOCB_NOWAIT handling in xfs_file_dio_aio_read Christoph Hellwig
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

The rwsem code overloads the owner field with either a task struct or
negative magic numbers.  Add a quick hack to catch these negative
values early on.  Without this spinning on a writer that replaced the
owner with RWSEM_OWNER_UNKNOWN, rwsem_spin_on_owner can crash while
deferencing the task_struct ->on_cpu field of a -8 value.

XXX: This might be a bit of a hack as the code otherwise doesn't use
the ERR_PTR family macros, better suggestions welcome.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 kernel/locking/rwsem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 44e68761f432..6adc719a30a1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -725,6 +725,8 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
 	state = rwsem_owner_state(owner, flags, nonspinnable);
 	if (state != OWNER_WRITER)
 		return state;
+	if (IS_ERR(owner))
+		return state;
 
 	rcu_read_lock();
 	for (;;) {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 03/12] xfs: fix IOCB_NOWAIT handling in xfs_file_dio_aio_read
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 01/12] mm: fix a comment in sys_swapon Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 04/12] gfs2: move setting current->backing_dev_info Christoph Hellwig
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Direct I/O reads can also be used with RWF_NOWAIT & co.  Fix the inode
locking in xfs_file_dio_aio_read to take IOCB_NOWAIT into account.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c93250108952..b8a4a3f29b36 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -187,7 +187,12 @@ xfs_file_dio_aio_read(
 
 	file_accessed(iocb->ki_filp);
 
-	xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
+			return -EAGAIN;
+	} else {
+		xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	}
 	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
 			is_sync_kiocb(iocb));
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 04/12] gfs2: move setting current->backing_dev_info
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (2 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 03/12] xfs: fix IOCB_NOWAIT handling in xfs_file_dio_aio_read Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 05/12] gfs2: fix O_SYNC write handling Christoph Hellwig
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Only set current->backing_dev_info just around the buffered write calls
to prepare for the next fix.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/file.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 9d58295ccf7a..21d032c4b077 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -867,18 +867,15 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	inode_lock(inode);
 	ret = generic_write_checks(iocb, from);
 	if (ret <= 0)
-		goto out;
-
-	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = inode_to_bdi(inode);
+		goto out_unlock;
 
 	ret = file_remove_privs(file);
 	if (ret)
-		goto out2;
+		goto out_unlock;
 
 	ret = file_update_time(file);
 	if (ret)
-		goto out2;
+		goto out_unlock;
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		struct address_space *mapping = file->f_mapping;
@@ -887,11 +884,13 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 		written = gfs2_file_direct_write(iocb, from);
 		if (written < 0 || !iov_iter_count(from))
-			goto out2;
+			goto out_unlock;
 
+		current->backing_dev_info = inode_to_bdi(inode);
 		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+		current->backing_dev_info = NULL;
 		if (unlikely(ret < 0))
-			goto out2;
+			goto out_unlock;
 		buffered = ret;
 
 		/*
@@ -915,14 +914,14 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			 */
 		}
 	} else {
+		current->backing_dev_info = inode_to_bdi(inode);
 		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+		current->backing_dev_info = NULL;
 		if (likely(ret > 0))
 			iocb->ki_pos += ret;
 	}
 
-out2:
-	current->backing_dev_info = NULL;
-out:
+out_unlock:
 	inode_unlock(inode);
 	if (likely(ret > 0)) {
 		/* Handle various SYNC-type writes */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 05/12] gfs2: fix O_SYNC write handling
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (3 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 04/12] gfs2: move setting current->backing_dev_info Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-02-06 15:31   ` [Cluster-devel] " Andreas Gruenbacher
  2020-01-14 16:12 ` [PATCH 06/12] iomap: pass a flags value to iomap_dio_rw Christoph Hellwig
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Don't ignore the return value from generic_write_sync for the direct to
buffered I/O callback case when written is non-zero.  Also don't bother
to call generic_write_sync for the pure direct I/O case, as iomap_dio_rw
already takes care of that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/file.c | 51 +++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 21d032c4b077..86c0e61407b6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -847,7 +847,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct gfs2_inode *ip = GFS2_I(inode);
-	ssize_t written = 0, ret;
+	ssize_t ret = 0;
 
 	ret = gfs2_rsqa_alloc(ip);
 	if (ret)
@@ -882,52 +882,51 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		loff_t pos, endbyte;
 		ssize_t buffered;
 
-		written = gfs2_file_direct_write(iocb, from);
-		if (written < 0 || !iov_iter_count(from))
+		ret = gfs2_file_direct_write(iocb, from);
+		if (ret < 0 || !iov_iter_count(from))
 			goto out_unlock;
 
 		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+		buffered = iomap_file_buffered_write(iocb, from,
+						     &gfs2_iomap_ops);
 		current->backing_dev_info = NULL;
-		if (unlikely(ret < 0))
+		if (unlikely(buffered <= 0)) {
+			if (buffered < 0)
+				ret = buffered;
 			goto out_unlock;
-		buffered = ret;
+		}
 
 		/*
 		 * We need to ensure that the page cache pages are written to
 		 * disk and invalidated to preserve the expected O_DIRECT
-		 * semantics.
+		 * semantics.  If the writeback or invalidate fails only report
+		 * the direct I/O range as we don't know if the buffered pages
+		 * made it to disk.
 		 */
 		pos = iocb->ki_pos;
 		endbyte = pos + buffered - 1;
 		ret = filemap_write_and_wait_range(mapping, pos, endbyte);
-		if (!ret) {
-			iocb->ki_pos += buffered;
-			written += buffered;
-			invalidate_mapping_pages(mapping,
-						 pos >> PAGE_SHIFT,
-						 endbyte >> PAGE_SHIFT);
-		} else {
-			/*
-			 * We don't know how much we wrote, so just return
-			 * the number of bytes which were direct-written
-			 */
-		}
+		if (ret)
+			goto out_unlock;
+
+		invalidate_mapping_pages(mapping, pos >> PAGE_SHIFT,
+					 endbyte >> PAGE_SHIFT);
+		ret += buffered;
 	} else {
 		current->backing_dev_info = inode_to_bdi(inode);
 		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
 		current->backing_dev_info = NULL;
-		if (likely(ret > 0))
-			iocb->ki_pos += ret;
+		if (unlikely(ret <= 0))
+			goto out_unlock;
 	}
 
+	iocb->ki_pos += ret;
+	inode_unlock(inode);
+	return generic_write_sync(iocb, ret);
+
 out_unlock:
 	inode_unlock(inode);
-	if (likely(ret > 0)) {
-		/* Handle various SYNC-type writes */
-		ret = generic_write_sync(iocb, ret);
-	}
-	return written ? written : ret;
+	return ret;
 }
 
 static int fallocate_chunk(struct inode *inode, loff_t offset, loff_t len,
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 06/12] iomap: pass a flags value to iomap_dio_rw
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (4 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 05/12] gfs2: fix O_SYNC write handling Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 07/12] iomap: allow holding i_rwsem until aio completion Christoph Hellwig
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Replace the wait_for_completion flag in struct iomap_dio with a new
IOMAP_DIO_SYNCHRONOUS flag for dio->flags, and allow passing the
initial flags to iomap_dio_rw.  Also take the check for synchronous
iocbs into iomap_dio_rw instead of duplicating it in all the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/ext4/file.c        |  8 +++++---
 fs/gfs2/file.c        |  6 ++----
 fs/iomap/direct-io.c  |  7 ++++---
 fs/xfs/xfs_file.c     | 21 +++++++++------------
 include/linux/iomap.h |  5 +++--
 5 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6a7293a5cda2..08b603d0c638 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -74,8 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return generic_file_read_iter(iocb, to);
 	}
 
-	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
+	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
 	inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
@@ -371,6 +370,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	handle_t *handle;
 	struct inode *inode = file_inode(iocb->ki_filp);
 	bool extend = false, overwrite = false, unaligned_aio = false;
+	unsigned int dio_flags = 0;
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		if (!inode_trylock(inode))
@@ -404,6 +404,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
 	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
 		unaligned_aio = true;
+		dio_flags |= IOMAP_DIO_SYNCHRONOUS;
 		inode_dio_wait(inode);
 	}
 
@@ -432,11 +433,12 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		}
 
 		extend = true;
+		dio_flags |= IOMAP_DIO_SYNCHRONOUS;
 		ext4_journal_stop(handle);
 	}
 
 	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
-			   is_sync_kiocb(iocb) || unaligned_aio || extend);
+			   dio_flags);
 
 	if (extend)
 		ret = ext4_handle_inode_extension(inode, offset, ret, count);
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 86c0e61407b6..2260cb5d31af 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -771,8 +771,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to)
 	if (ret)
 		goto out_uninit;
 
-	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
+	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
 
 	gfs2_glock_dq(&gh);
 out_uninit:
@@ -807,8 +806,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
-			   is_sync_kiocb(iocb));
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
 
 out:
 	gfs2_glock_dq(&gh);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..e706329d71a0 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -400,7 +400,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		bool wait_for_completion)
+		unsigned int dio_flags)
 {
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -410,14 +410,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	unsigned int flags = IOMAP_DIRECT;
 	struct blk_plug plug;
 	struct iomap_dio *dio;
+	bool wait_for_completion = false;
 
 	lockdep_assert_held(&inode->i_rwsem);
 
 	if (!count)
 		return 0;
 
-	if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion))
-		return -EIO;
+	if (is_sync_kiocb(iocb) || (dio_flags & IOMAP_DIO_SYNCHRONOUS))
+		wait_for_completion = true;
 
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	if (!dio)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b8a4a3f29b36..0cc843a4a163 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -193,8 +193,7 @@ xfs_file_dio_aio_read(
 	} else {
 		xfs_ilock(ip, XFS_IOLOCK_SHARED);
 	}
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
-			is_sync_kiocb(iocb));
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -493,6 +492,7 @@ xfs_file_dio_aio_write(
 	int			iolock;
 	size_t			count = iov_iter_count(from);
 	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
+	unsigned int		dio_flags = 0;
 
 	/* DIO must be aligned to device logical sector size */
 	if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
@@ -538,27 +538,24 @@ xfs_file_dio_aio_write(
 	count = iov_iter_count(from);
 
 	/*
-	 * If we are doing unaligned IO, we can't allow any other overlapping IO
-	 * in-flight at the same time or we risk data corruption. Wait for all
-	 * other IO to drain before we submit. If the IO is aligned, demote the
-	 * iolock if we had to take the exclusive lock in
+	 * If we are doing unaligned I/O, we can't allow any other overlapping
+	 * I/O in-flight at the same time or we risk data corruption.  Wait for
+	 * all other I/O to drain before we submit and execute the I/O
+	 * synchronously to prevent subsequent overlapping I/O.  If the I/O is
+	 * aligned, demote the iolock if we had to take the exclusive lock in
 	 * xfs_file_aio_write_checks() for other reasons.
 	 */
 	if (unaligned_io) {
 		inode_dio_wait(inode);
+		dio_flags = IOMAP_DIO_SYNCHRONOUS;
 	} else if (iolock == XFS_IOLOCK_EXCL) {
 		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
 		iolock = XFS_IOLOCK_SHARED;
 	}
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
-	/*
-	 * If unaligned, this is the only IO in-flight. Wait on it before we
-	 * release the iolock to prevent subsequent overlapping IO.
-	 */
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
-			   &xfs_dio_write_ops,
-			   is_sync_kiocb(iocb) || unaligned_io);
+			   &xfs_dio_write_ops, dio_flags);
 out:
 	xfs_iunlock(ip, iolock);
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..3faeb8fd0961 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -244,10 +244,11 @@ int iomap_writepages(struct address_space *mapping,
 		const struct iomap_writeback_ops *ops);
 
 /*
- * Flags for direct I/O ->end_io:
+ * Flags for iomap_dio_complete and ->end_io:
  */
 #define IOMAP_DIO_UNWRITTEN	(1 << 0)	/* covers unwritten extent(s) */
 #define IOMAP_DIO_COW		(1 << 1)	/* covers COW extent(s) */
+#define IOMAP_DIO_SYNCHRONOUS	(1 << 2)	/* no async completion */
 
 struct iomap_dio_ops {
 	int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
@@ -256,7 +257,7 @@ struct iomap_dio_ops {
 
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
-		bool wait_for_completion);
+		unsigned int dio_flags);
 int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
 #ifdef CONFIG_SWAP
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 07/12] iomap: allow holding i_rwsem until aio completion
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (5 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 06/12] iomap: pass a flags value to iomap_dio_rw Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 08/12] ext4: hold i_rwsem until AIO completes Christoph Hellwig
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

The direct I/O code currently uses a hand crafted i_dio_count that needs
to be incremented under i_rwsem and then is decremented when I/O
completes.  That scheme means file system code needs to be very careful
to wait for i_dio_count to reach zero under i_rwsem in various places
that are very cumbersome to get rid.  It also means we can't get the
effect of an exclusive i_rwsem for actually asynchronous I/O, forcing
pointless synchronous execution of sub-blocksize writes.

Replace the i_dio_count scheme with holding i_rwsem over the duration
of the whole I/O.  While this introduces a non-owner unlock that isn't
nice to RT workload, the open coded locking primitive using i_dio_count
isn't any better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/direct-io.c  | 44 +++++++++++++++++++++++++++++++++++++------
 include/linux/iomap.h |  2 ++
 2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e706329d71a0..0113ac33b0a0 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -70,7 +70,7 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 	dio->submit.cookie = submit_bio(bio);
 }
 
-static ssize_t iomap_dio_complete(struct iomap_dio *dio)
+static ssize_t iomap_dio_complete(struct iomap_dio *dio, bool unlock)
 {
 	const struct iomap_dio_ops *dops = dio->dops;
 	struct kiocb *iocb = dio->iocb;
@@ -112,6 +112,13 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 			dio_warn_stale_pagecache(iocb->ki_filp);
 	}
 
+	if (unlock) {
+		if (dio->flags & IOMAP_DIO_RWSEM_EXCL)
+			up_write(&inode->i_rwsem);
+		else if (dio->flags & IOMAP_DIO_RWSEM_SHARED)
+			up_read(&inode->i_rwsem);
+	}
+
 	/*
 	 * If this is a DSYNC write, make sure we push it to stable storage now
 	 * that we've written data.
@@ -129,8 +136,22 @@ static void iomap_dio_complete_work(struct work_struct *work)
 {
 	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
 	struct kiocb *iocb = dio->iocb;
+	struct inode *inode = file_inode(iocb->ki_filp);
 
-	iocb->ki_complete(iocb, iomap_dio_complete(dio), 0);
+	/*
+	 * XXX: For reads this code is directly called from bio ->end_io, which
+	 * often is hard or softirq context.  In that case lockdep records the
+	 * below as lock acquisitions from irq context and causes warnings.
+	 */
+	if (dio->flags & IOMAP_DIO_RWSEM_EXCL) {
+		rwsem_acquire(&inode->i_rwsem.dep_map, 0, 0, _THIS_IP_);
+		if (IS_ENABLED(CONFIG_RWSEM_SPIN_ON_OWNER))
+			atomic_long_set(&inode->i_rwsem.owner, (long)current);
+	} else if (dio->flags & IOMAP_DIO_RWSEM_SHARED) {
+		rwsem_acquire_read(&inode->i_rwsem.dep_map, 0, 0, _THIS_IP_);
+	}
+
+	iocb->ki_complete(iocb, iomap_dio_complete(dio, true), 0);
 }
 
 /*
@@ -430,7 +451,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->i_size = i_size_read(inode);
 	dio->dops = dops;
 	dio->error = 0;
-	dio->flags = 0;
+	dio->flags = dio_flags;
 
 	dio->submit.iter = iter;
 	dio->submit.waiter = current;
@@ -551,8 +572,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->wait_for_completion = wait_for_completion;
 	if (!atomic_dec_and_test(&dio->ref)) {
 		if (!wait_for_completion)
-			return -EIOCBQUEUED;
-
+			goto async_completion;
 		for (;;) {
 			set_current_state(TASK_UNINTERRUPTIBLE);
 			if (!READ_ONCE(dio->submit.waiter))
@@ -567,10 +587,22 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		__set_current_state(TASK_RUNNING);
 	}
 
-	return iomap_dio_complete(dio);
+	return iomap_dio_complete(dio, false);
 
 out_free_dio:
 	kfree(dio);
 	return ret;
+
+async_completion:
+	/*
+	 * We are returning to userspace now, but i_rwsem is still held until
+	 * the I/O completion comes back.
+	 */
+	if (dio_flags & (IOMAP_DIO_RWSEM_EXCL | IOMAP_DIO_RWSEM_SHARED))
+		rwsem_release(&inode->i_rwsem.dep_map, _THIS_IP_);
+	if ((dio_flags & IOMAP_DIO_RWSEM_EXCL) &&
+	    IS_ENABLED(CONFIG_RWSEM_SPIN_ON_OWNER))
+		atomic_long_set(&inode->i_rwsem.owner, RWSEM_OWNER_UNKNOWN);
+	return -EIOCBQUEUED;
 }
 EXPORT_SYMBOL_GPL(iomap_dio_rw);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 3faeb8fd0961..f259bb979d7f 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -249,6 +249,8 @@ int iomap_writepages(struct address_space *mapping,
 #define IOMAP_DIO_UNWRITTEN	(1 << 0)	/* covers unwritten extent(s) */
 #define IOMAP_DIO_COW		(1 << 1)	/* covers COW extent(s) */
 #define IOMAP_DIO_SYNCHRONOUS	(1 << 2)	/* no async completion */
+#define IOMAP_DIO_RWSEM_EXCL	(1 << 3)	/* holds shared i_rwsem */
+#define IOMAP_DIO_RWSEM_SHARED	(1 << 4)	/* holds exclusive i_rwsem */
 
 struct iomap_dio_ops {
 	int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 08/12] ext4: hold i_rwsem until AIO completes
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (6 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 07/12] iomap: allow holding i_rwsem until aio completion Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-01-14 21:50   ` Theodore Y. Ts'o
  2020-01-14 16:12 ` [PATCH 09/12] gfs2: " Christoph Hellwig
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Switch ext4 from the magic i_dio_count scheme to just hold i_rwsem
until the actual I/O has completed to reduce the locking complexity
and avoid nasty bugs due to missing inode_dio_wait calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/ext4/extents.c     | 12 ------------
 fs/ext4/file.c        | 21 +++++++++++++--------
 fs/ext4/inode.c       | 11 -----------
 fs/ext4/ioctl.c       |  5 -----
 fs/ext4/move_extent.c |  4 ----
 5 files changed, 13 insertions(+), 40 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0e8708b77da6..b6aa2d249b30 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4777,9 +4777,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	if (mode & FALLOC_FL_KEEP_SIZE)
 		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 
-	/* Wait all existing dio workers, newcomers will block on i_mutex */
-	inode_dio_wait(inode);
-
 	/* Preallocate the range including the unaligned edges */
 	if (partial_begin || partial_end) {
 		ret = ext4_alloc_file_blocks(file,
@@ -4949,9 +4946,6 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 			goto out;
 	}
 
-	/* Wait all existing dio workers, newcomers will block on i_mutex */
-	inode_dio_wait(inode);
-
 	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size, flags);
 	if (ret)
 		goto out;
@@ -5525,9 +5519,6 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 		goto out_mutex;
 	}
 
-	/* Wait for existing dio to complete */
-	inode_dio_wait(inode);
-
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
 	 * page cache.
@@ -5678,9 +5669,6 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 		goto out_mutex;
 	}
 
-	/* Wait for existing dio to complete */
-	inode_dio_wait(inode);
-
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
 	 * page cache.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 08b603d0c638..b3410a3ede27 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -74,9 +74,10 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return generic_file_read_iter(iocb, to);
 	}
 
-	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
-	inode_unlock_shared(inode);
-
+	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
+			   IOMAP_DIO_RWSEM_SHARED);
+	if (ret != -EIOCBQUEUED)
+		inode_unlock_shared(inode);
 	file_accessed(iocb->ki_filp);
 	return ret;
 }
@@ -405,7 +406,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
 		unaligned_aio = true;
 		dio_flags |= IOMAP_DIO_SYNCHRONOUS;
-		inode_dio_wait(inode);
 	}
 
 	/*
@@ -416,7 +416,10 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
 	    ext4_should_dioread_nolock(inode)) {
 		overwrite = true;
+		dio_flags |= IOMAP_DIO_RWSEM_SHARED;
 		downgrade_write(&inode->i_rwsem);
+	} else {
+		dio_flags |= IOMAP_DIO_RWSEM_EXCL;
 	}
 
 	if (offset + count > EXT4_I(inode)->i_disksize) {
@@ -444,10 +447,12 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		ret = ext4_handle_inode_extension(inode, offset, ret, count);
 
 out:
-	if (overwrite)
-		inode_unlock_shared(inode);
-	else
-		inode_unlock(inode);
+	if (ret != -EIOCBQUEUED) {
+		if (overwrite)
+			inode_unlock_shared(inode);
+		else
+			inode_unlock(inode);
+	}
 
 	if (ret >= 0 && iov_iter_count(from)) {
 		ssize_t err;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 629a25d999f0..e2dac0727ab0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3965,9 +3965,6 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 
 	}
 
-	/* Wait all existing dio workers, newcomers will block on i_mutex */
-	inode_dio_wait(inode);
-
 	/*
 	 * Prevent page faults from reinstantiating pages we have released from
 	 * page cache.
@@ -5263,11 +5260,6 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				if (error)
 					goto err_out;
 			}
-			/*
-			 * Blocks are going to be removed from the inode. Wait
-			 * for dio in flight.
-			 */
-			inode_dio_wait(inode);
 		}
 
 		down_write(&EXT4_I(inode)->i_mmap_sem);
@@ -5798,9 +5790,6 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
 	if (is_journal_aborted(journal))
 		return -EROFS;
 
-	/* Wait for all existing dio workers */
-	inode_dio_wait(inode);
-
 	/*
 	 * Before flushing the journal and switching inode's aops, we have
 	 * to flush all dirty data the inode has. There can be outstanding
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index e8870fff8224..99d21d81074f 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -153,10 +153,6 @@ static long swap_inode_boot_loader(struct super_block *sb,
 	if (err)
 		goto err_out;
 
-	/* Wait for all existing dio workers */
-	inode_dio_wait(inode);
-	inode_dio_wait(inode_bl);
-
 	truncate_inode_pages(&inode->i_data, 0);
 	truncate_inode_pages(&inode_bl->i_data, 0);
 
@@ -364,7 +360,6 @@ static int ext4_ioctl_setflags(struct inode *inode,
 	 */
 	if (S_ISREG(inode->i_mode) && !IS_IMMUTABLE(inode) &&
 	    (flags & EXT4_IMMUTABLE_FL)) {
-		inode_dio_wait(inode);
 		err = filemap_write_and_wait(inode->i_mapping);
 		if (err)
 			goto flags_out;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 30ce3dc69378..20240808569f 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -602,10 +602,6 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
 	/* Protect orig and donor inodes against a truncate */
 	lock_two_nondirectories(orig_inode, donor_inode);
 
-	/* Wait for all existing dio workers */
-	inode_dio_wait(orig_inode);
-	inode_dio_wait(donor_inode);
-
 	/* Protect extent tree against block allocations via delalloc */
 	ext4_double_down_write_data_sem(orig_inode, donor_inode);
 	/* Check the filesystem environment whether move_extent can be done */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 09/12] gfs2: hold i_rwsem until AIO completes
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (7 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 08/12] ext4: hold i_rwsem until AIO completes Christoph Hellwig
@ 2020-01-14 16:12 ` " Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 10/12] xfs: " Christoph Hellwig
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Switch gfs from the magic i_dio_count scheme to just hold i_rwsem
until the actual I/O has completed to reduce the locking complexity
and avoid nasty bugs due to missing inode_dio_wait calls.

Note that gfs only uses i_rwsem for direct I/O writes, not for
reads so no change to the read behavior.  It might also make sense
to use the same scheme for the gfs2 internal cluster lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c  |  2 --
 fs/gfs2/file.c  |  6 ++++--
 fs/gfs2/glops.c | 10 ++--------
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 08f6fbb3655e..226f4eb680c7 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -2181,8 +2181,6 @@ int gfs2_setattr_size(struct inode *inode, u64 newsize)
 	if (ret)
 		return ret;
 
-	inode_dio_wait(inode);
-
 	ret = gfs2_rsqa_alloc(ip);
 	if (ret)
 		goto out;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 2260cb5d31af..82a2f313a3e6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -806,7 +806,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (offset + len > i_size_read(&ip->i_inode))
 		goto out;
 
-	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+	ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
+			   IOMAP_DIO_RWSEM_EXCL);
 
 out:
 	gfs2_glock_dq(&gh);
@@ -923,7 +924,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return generic_write_sync(iocb, ret);
 
 out_unlock:
-	inode_unlock(inode);
+	if (ret != -EIOCBQUEUED)
+		inode_unlock(inode);
 	return ret;
 }
 
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 4ede1f18de85..a705eeb75117 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -243,11 +243,8 @@ static void inode_go_sync(struct gfs2_glock *gl)
 	struct address_space *metamapping = gfs2_glock2aspace(gl);
 	int error;
 
-	if (isreg) {
-		if (test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags))
-			unmap_shared_mapping_range(ip->i_inode.i_mapping, 0, 0);
-		inode_dio_wait(&ip->i_inode);
-	}
+	if (isreg && test_and_clear_bit(GIF_SW_PAGED, &ip->i_flags))
+		unmap_shared_mapping_range(ip->i_inode.i_mapping, 0, 0);
 	if (!test_and_clear_bit(GLF_DIRTY, &gl->gl_flags))
 		goto out;
 
@@ -440,9 +437,6 @@ static int inode_go_lock(struct gfs2_holder *gh)
 			return error;
 	}
 
-	if (gh->gh_state != LM_ST_DEFERRED)
-		inode_dio_wait(&ip->i_inode);
-
 	if ((ip->i_diskflags & GFS2_DIF_TRUNC_IN_PROG) &&
 	    (gl->gl_state == LM_ST_EXCLUSIVE) &&
 	    (gh->gh_state == LM_ST_EXCLUSIVE)) {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 10/12] xfs: hold i_rwsem until AIO completes
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (8 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 09/12] gfs2: " Christoph Hellwig
@ 2020-01-14 16:12 ` " Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 11/12] xfs: don't set IOMAP_DIO_SYNCHRONOUS for unaligned I/O Christoph Hellwig
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Switch ext4 from the magic i_dio_count scheme to just hold i_rwsem
until the actual I/O has completed to reduce the locking complexity
and avoid nasty bugs due to missing inode_dio_wait calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/scrub/bmap.c    |  1 -
 fs/xfs/xfs_bmap_util.c |  3 ---
 fs/xfs/xfs_file.c      | 47 +++++++++++++-----------------------------
 fs/xfs/xfs_icache.c    |  3 +--
 fs/xfs/xfs_ioctl.c     |  1 -
 fs/xfs/xfs_iops.c      |  5 -----
 fs/xfs/xfs_reflink.c   |  2 --
 7 files changed, 15 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index fa6ea6407992..d3e4068d3189 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -45,7 +45,6 @@ xchk_setup_inode_bmap(
 	 */
 	if (S_ISREG(VFS_I(sc->ip)->i_mode) &&
 	    sc->sm->sm_type == XFS_SCRUB_TYPE_BMBTD) {
-		inode_dio_wait(VFS_I(sc->ip));
 		error = filemap_write_and_wait(VFS_I(sc->ip)->i_mapping);
 		if (error)
 			goto out;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index e62fb5216341..a454f481107e 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -674,9 +674,6 @@ xfs_free_eofblocks(
 		if (error)
 			return error;
 
-		/* wait on dio to ensure i_size has settled */
-		inode_dio_wait(VFS_I(ip));
-
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0,
 				&tp);
 		if (error) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 0cc843a4a163..d0ee7d2932e4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -193,9 +193,11 @@ xfs_file_dio_aio_read(
 	} else {
 		xfs_ilock(ip, XFS_IOLOCK_SHARED);
 	}
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
+			   IOMAP_DIO_RWSEM_SHARED);
+	if (ret != -EIOCBQUEUED)
+		xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 	return ret;
 }
 
@@ -341,15 +343,6 @@ xfs_file_aio_write_checks(
 				xfs_ilock(ip, *iolock);
 				iov_iter_reexpand(from, count);
 			}
-			/*
-			 * We now have an IO submission barrier in place, but
-			 * AIO can do EOF updates during IO completion and hence
-			 * we now need to wait for all of them to drain. Non-AIO
-			 * DIO will have drained before we are given the
-			 * XFS_IOLOCK_EXCL, and so for most cases this wait is a
-			 * no-op.
-			 */
-			inode_dio_wait(inode);
 			drained_dio = true;
 			goto restart;
 		}
@@ -469,13 +462,7 @@ static const struct iomap_dio_ops xfs_dio_write_ops = {
  * needs to do sub-block zeroing and that requires serialisation against other
  * direct IOs to the same block. In this case we need to serialise the
  * submission of the unaligned IOs so that we don't get racing block zeroing in
- * the dio layer.  To avoid the problem with aio, we also need to wait for
- * outstanding IOs to complete so that unwritten extent conversion is completed
- * before we try to map the overlapping block. This is currently implemented by
- * hitting it with a big hammer (i.e. inode_dio_wait()).
- *
- * Returns with locks held indicated by @iolock and errors indicated by
- * negative return values.
+ * the dio layer.
  */
 STATIC ssize_t
 xfs_file_dio_aio_write(
@@ -546,18 +533,21 @@ xfs_file_dio_aio_write(
 	 * xfs_file_aio_write_checks() for other reasons.
 	 */
 	if (unaligned_io) {
-		inode_dio_wait(inode);
-		dio_flags = IOMAP_DIO_SYNCHRONOUS;
-	} else if (iolock == XFS_IOLOCK_EXCL) {
-		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
-		iolock = XFS_IOLOCK_SHARED;
+		dio_flags = IOMAP_DIO_RWSEM_EXCL | IOMAP_DIO_SYNCHRONOUS;
+	} else {
+		if (iolock == XFS_IOLOCK_EXCL) {
+			xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
+			iolock = XFS_IOLOCK_SHARED;
+		}
+		dio_flags = IOMAP_DIO_RWSEM_SHARED;
 	}
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
 			   &xfs_dio_write_ops, dio_flags);
 out:
-	xfs_iunlock(ip, iolock);
+	if (ret != -EIOCBQUEUED)
+		xfs_iunlock(ip, iolock);
 
 	/*
 	 * No fallback to buffered IO on errors for XFS, direct IO will either
@@ -819,15 +809,6 @@ xfs_file_fallocate(
 	if (error)
 		goto out_unlock;
 
-	/*
-	 * Must wait for all AIO to complete before we continue as AIO can
-	 * change the file size on completion without holding any locks we
-	 * currently hold. We must do this first because AIO can update both
-	 * the on disk and in memory inode sizes, and the operations that follow
-	 * require the in-memory size to be fully up-to-date.
-	 */
-	inode_dio_wait(inode);
-
 	/*
 	 * Now AIO and DIO has drained we flush and (if necessary) invalidate
 	 * the cached range over the first operation we are about to run.
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8dc2e5414276..9e6f32fd32f5 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1720,8 +1720,7 @@ xfs_prep_free_cowblocks(
 	 */
 	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
 	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
-	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
-	    atomic_read(&VFS_I(ip)->i_dio_count))
+	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK))
 		return false;
 
 	return true;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 7b35d62ede9f..331453f2c4be 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -548,7 +548,6 @@ xfs_ioc_space(
 	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
 	if (error)
 		goto out_unlock;
-	inode_dio_wait(inode);
 
 	switch (bf->l_whence) {
 	case 0: /*SEEK_SET*/
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8afe69ca188b..700edeccc6bf 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -893,11 +893,6 @@ xfs_setattr_size(
 	if (error)
 		return error;
 
-	/*
-	 * Wait for all direct I/O to complete.
-	 */
-	inode_dio_wait(inode);
-
 	/*
 	 * File data changes must be complete before we start the transaction to
 	 * modify the inode.  This needs to be done before joining the inode to
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index de451235c4ee..f775e60ca6f7 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1525,8 +1525,6 @@ xfs_reflink_unshare(
 
 	trace_xfs_reflink_unshare(ip, offset, len);
 
-	inode_dio_wait(inode);
-
 	error = iomap_file_unshare(inode, offset, len,
 			&xfs_buffered_write_iomap_ops);
 	if (error)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 11/12] xfs: don't set IOMAP_DIO_SYNCHRONOUS for unaligned I/O
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (9 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 10/12] xfs: " Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-01-14 16:12 ` [PATCH 12/12] iomap: remove the inode_dio_begin/end calls Christoph Hellwig
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Now that i_rwsem is held until asynchronous writes complete, there
is no need to force them to execute synchronously, as the i_rwsem
protection is exactly the same as for synchronous writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d0ee7d2932e4..3a734ad4bb10 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -510,9 +510,6 @@ xfs_file_dio_aio_write(
 	}
 
 	if (iocb->ki_flags & IOCB_NOWAIT) {
-		/* unaligned dio always waits, bail */
-		if (unaligned_io)
-			return -EAGAIN;
 		if (!xfs_ilock_nowait(ip, iolock))
 			return -EAGAIN;
 	} else {
@@ -526,14 +523,11 @@ xfs_file_dio_aio_write(
 
 	/*
 	 * If we are doing unaligned I/O, we can't allow any other overlapping
-	 * I/O in-flight at the same time or we risk data corruption.  Wait for
-	 * all other I/O to drain before we submit and execute the I/O
-	 * synchronously to prevent subsequent overlapping I/O.  If the I/O is
-	 * aligned, demote the iolock if we had to take the exclusive lock in
-	 * xfs_file_aio_write_checks() for other reasons.
+	 * If the I/O is aligned, demote the iolock if we had to take the
+	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
 	 */
 	if (unaligned_io) {
-		dio_flags = IOMAP_DIO_RWSEM_EXCL | IOMAP_DIO_SYNCHRONOUS;
+		dio_flags = IOMAP_DIO_RWSEM_EXCL;
 	} else {
 		if (iolock == XFS_IOLOCK_EXCL) {
 			xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 12/12] iomap: remove the inode_dio_begin/end calls
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (10 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 11/12] xfs: don't set IOMAP_DIO_SYNCHRONOUS for unaligned I/O Christoph Hellwig
@ 2020-01-14 16:12 ` Christoph Hellwig
  2020-01-14 18:47 ` RFC: hold i_rwsem until aio completes Matthew Wilcox
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 16:12 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

Now that all iomap users hold i_rwsem over asynchronous I/O
operations these calls can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/direct-io.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 0113ac33b0a0..c90ec82e8e08 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -126,7 +126,6 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio, bool unlock)
 	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
 		ret = generic_write_sync(iocb, ret);
 
-	inode_dio_end(file_inode(iocb->ki_filp));
 	kfree(dio);
 
 	return ret;
@@ -513,8 +512,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			goto out_free_dio;
 	}
 
-	inode_dio_begin(inode);
-
 	blk_start_plug(&plug);
 	do {
 		ret = iomap_apply(inode, pos, count, flags, ops, dio,
-- 
2.24.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner
  2020-01-14 16:12 ` [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner Christoph Hellwig
@ 2020-01-14 18:17   ` Waiman Long
  2020-01-14 18:25     ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-14 18:17 UTC (permalink / raw)
  To: Christoph Hellwig, linux-xfs, linux-fsdevel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel
  Cc: linux-kernel, linux-mm

On 1/14/20 11:12 AM, Christoph Hellwig wrote:
> The rwsem code overloads the owner field with either a task struct or
> negative magic numbers.  Add a quick hack to catch these negative
> values early on.  Without this spinning on a writer that replaced the
> owner with RWSEM_OWNER_UNKNOWN, rwsem_spin_on_owner can crash while
> deferencing the task_struct ->on_cpu field of a -8 value.
>
> XXX: This might be a bit of a hack as the code otherwise doesn't use
> the ERR_PTR family macros, better suggestions welcome.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  kernel/locking/rwsem.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index 44e68761f432..6adc719a30a1 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -725,6 +725,8 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, unsigned long nonspinnable)
>  	state = rwsem_owner_state(owner, flags, nonspinnable);
>  	if (state != OWNER_WRITER)
>  		return state;
> +	if (IS_ERR(owner))
> +		return state;
>  
>  	rcu_read_lock();
>  	for (;;) {

The owner field is just a pointer to the task structure with the lower 3
bits served as flag bits. Setting owner to RWSEM_OWNER_UNKNOWN (-2) will
stop optimistic spinning. So under what condition did the crash happen?

Anyway, PeterZ is working on revising the percpu-rwsem implementation to
more gracefully handle the frozen case. At the end, there will not be a
need for the RWSEM_OWNER_UNKNOWN magic and it can be removed.

Cheers,
Longman

RWSEM_OWNER_UNKNOWN


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner
  2020-01-14 18:17   ` Waiman Long
@ 2020-01-14 18:25     ` Christoph Hellwig
  2020-01-14 18:33       ` Waiman Long
  2020-01-14 18:55       ` Waiman Long
  0 siblings, 2 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-14 18:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Tue, Jan 14, 2020 at 01:17:45PM -0500, Waiman Long wrote:
> The owner field is just a pointer to the task structure with the lower 3
> bits served as flag bits. Setting owner to RWSEM_OWNER_UNKNOWN (-2) will
> stop optimistic spinning. So under what condition did the crash happen?

When running xfstests with all patches in this series except for this
one, IIRC in generic/114.

> Anyway, PeterZ is working on revising the percpu-rwsem implementation to
> more gracefully handle the frozen case. At the end, there will not be a
> need for the RWSEM_OWNER_UNKNOWN magic and it can be removed.

Well, this series relies on that value.  And I think it fundamentally
is the right thing to do for AIO, and potentially other I/O related
locking where we take a lock to synchronize access to data, then
do I/O and then eventually get an I/O completion from an interrupt.
Even thinking from the PREEMP_RT context we want to boost the
initial thread as long as we can, then do nothing when it is off
to I/O hardware (except maybe providing good diagnostics that the cause
for the latency is I/O), and then boost the thread that is handling
the completion.  Things like the i_dio_count hack can't provide that.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner
  2020-01-14 18:25     ` Christoph Hellwig
@ 2020-01-14 18:33       ` Waiman Long
  2020-01-14 18:55       ` Waiman Long
  1 sibling, 0 replies; 39+ messages in thread
From: Waiman Long @ 2020-01-14 18:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Will Deacon, Andrew Morton, linux-ext4,
	cluster-devel, linux-kernel, linux-mm

On 1/14/20 1:25 PM, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 01:17:45PM -0500, Waiman Long wrote:
>> The owner field is just a pointer to the task structure with the lower 3
>> bits served as flag bits. Setting owner to RWSEM_OWNER_UNKNOWN (-2) will
>> stop optimistic spinning. So under what condition did the crash happen?
> When running xfstests with all patches in this series except for this
> one, IIRC in generic/114.

OK, I think I know where the bug is. I will send a patch to fix that.

Thanks,
Longman



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (11 preceding siblings ...)
  2020-01-14 16:12 ` [PATCH 12/12] iomap: remove the inode_dio_begin/end calls Christoph Hellwig
@ 2020-01-14 18:47 ` Matthew Wilcox
  2020-01-15  6:54   ` Christoph Hellwig
  2020-01-14 19:27 ` Jason Gunthorpe
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2020-01-14 18:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Tue, Jan 14, 2020 at 05:12:13PM +0100, Christoph Hellwig wrote:
> Second I/O
> completions often come from interrupt context, which means the re-acquire
> is recorded as from irq context, leading to warnings about incorrect
> contexts.  I wonder if we could just have a bit in lockdep that says
> returning to userspace is ok for this particular lock?  That would also
> clean up the fsfreeze situation a lot.

It would be helpful if we could also use the same lockdep logic
for PageLocked.  Again, it's a case where returning to userspace with
PageLock held is fine, because we're expecting an interrupt to come in
and drop the lock for us.

Perhaps the right answer is, from lockdep's point of view, to mark the
lock as being released at the point where we submit the I/O.  Then
in the completion path release the lock without telling lockdep we
released it.

That would catch cases where we inadvertently returned to userspace
without submitting the I/O, for example.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner
  2020-01-14 18:25     ` Christoph Hellwig
  2020-01-14 18:33       ` Waiman Long
@ 2020-01-14 18:55       ` Waiman Long
  1 sibling, 0 replies; 39+ messages in thread
From: Waiman Long @ 2020-01-14 18:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Will Deacon, Andrew Morton, linux-ext4,
	cluster-devel, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 521 bytes --]

On 1/14/20 1:25 PM, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 01:17:45PM -0500, Waiman Long wrote:
>> The owner field is just a pointer to the task structure with the lower 3
>> bits served as flag bits. Setting owner to RWSEM_OWNER_UNKNOWN (-2) will
>> stop optimistic spinning. So under what condition did the crash happen?
> When running xfstests with all patches in this series except for this
> one, IIRC in generic/114.

Could you try the attached patch to see if it can fix the problem?

Thanks,
Longman


[-- Attachment #2: 0001-locking-rwsem-Fix-kernel-crash-when-spinning-on-RWSE.patch --]
[-- Type: text/x-patch, Size: 1458 bytes --]

From 1fcfa946609b5e919a6b953a64be6853af5cdf05 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman@redhat.com>
Date: Tue, 14 Jan 2020 13:39:02 -0500
Subject: [PATCH] locking/rwsem: Fix kernel crash when spinning on
 RWSEM_OWNER_UNKNOWN

The commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer
optimistically spin on owner") will allow a recently woken up waiting
writer to spin on the owner. Unfortunately, if the owner happens to be
RWSEM_OWNER_UNKNOWN, the code will incorrectly spin on it leading to a
kernel crash. This is fixed by passing the proper non-spinnable bits
to rwsem_spin_on_owner() so that RWSEM_OWNER_UNKNOWN will be treated
as a non-spinnable target.

Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner")

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/locking/rwsem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 44e68761f432..1dd3d53f43c3 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1227,7 +1227,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 		 * without sleeping.
 		 */
 		if ((wstate == WRITER_HANDOFF) &&
-		    (rwsem_spin_on_owner(sem, 0) == OWNER_NULL))
+		    rwsem_spin_on_owner(sem, RWSEM_NONSPINNABLE) == OWNER_NULL)
 			goto trylock_again;
 
 		/* Block until there are no active lockers. */
-- 
2.18.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (12 preceding siblings ...)
  2020-01-14 18:47 ` RFC: hold i_rwsem until aio completes Matthew Wilcox
@ 2020-01-14 19:27 ` Jason Gunthorpe
  2020-01-15  6:56   ` Christoph Hellwig
  2020-01-16 14:00 ` Jan Kara
  2020-01-18  9:28 ` Dave Chinner
  15 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2020-01-14 19:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Tue, Jan 14, 2020 at 05:12:13PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> Asynchronous read/write operations currently use a rather magic locking
> scheme, were access to file data is normally protected using a rw_semaphore,
> but if we are doing aio where the syscall returns to userspace before the
> I/O has completed we also use an atomic_t to track the outstanding aio
> ops.  This scheme has lead to lots of subtle bugs in file systems where
> didn't wait to the count to reach zero, and due to its adhoc nature also
> means we have to serialize direct I/O writes that are smaller than the
> file system block size.

I've seen similar locking patterns quite a lot, enough I've thought
about having a dedicated locking primitive to do it. It really wants
to be a rwsem, but as here the rwsem rules don't allow it.

The common pattern I'm looking at looks something like this:

 'try begin read'() // aka down_read_trylock()

  /* The lockdep release hackery you describe,
     the rwsem remains read locked */
 'exit reader'()

 .. delegate unlock to work queue, timer, irq, etc ..

in the new context:

 're_enter reader'() // Get our lockdep tracking back

 'end reader'() // aka up_read()

vs a typical write side:

 'begin write'() // aka down_write()

 /* There is no reason to unlock it before kfree of the rwsem memory.
    Somehow the user prevents any new down_read_trylock()'s */
 'abandon writer'() // The object will be kfree'd with a locked writer
 kfree()

The typical goal is to provide an object destruction path that can
serialize and fence all readers wherever they may be before proceeding
to some synchronous destruction.

Usually this gets open coded with some atomic/kref/refcount and a
completion or wait queue. Often implemented wrongly, lacking the write
favoring bias in the rwsem, and lacking any lockdep tracking on the
naked completion.

Not to discourage your patch, but to ask if we can make the solution
more broadly applicable?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/12] ext4: hold i_rwsem until AIO completes
  2020-01-14 16:12 ` [PATCH 08/12] ext4: hold i_rwsem until AIO completes Christoph Hellwig
@ 2020-01-14 21:50   ` Theodore Y. Ts'o
  2020-01-15  6:48     ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Theodore Y. Ts'o @ 2020-01-14 21:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Tue, Jan 14, 2020 at 05:12:21PM +0100, Christoph Hellwig wrote:
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 0e8708b77da6..b6aa2d249b30 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4777,9 +4777,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	if (mode & FALLOC_FL_KEEP_SIZE)
>  		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
>  
> -	/* Wait all existing dio workers, newcomers will block on i_mutex */
> -	inode_dio_wait(inode);
> -
>  	/* Preallocate the range including the unaligned edges */
>  	if (partial_begin || partial_end) {
>  		ret = ext4_alloc_file_blocks(file,

I note that you've dropped the inode_dio_wait() in ext4's ZERO_RANGE,
COLLAPSE_RANGE, INSERT_RANGE, etc.  We had added these to avoid
problems when various fallocate operations which modify the inode's
logical->physical block mapping racing with direct I/O (both reads or
writes).

I don't see a replacement protection in this patch series.  How does
are file systems supported to protect against such races?

    	 	 	      	      	      - Ted

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/12] ext4: hold i_rwsem until AIO completes
  2020-01-14 21:50   ` Theodore Y. Ts'o
@ 2020-01-15  6:48     ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-15  6:48 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Will Deacon,
	Andrew Morton, linux-ext4, cluster-devel, linux-kernel, linux-mm

On Tue, Jan 14, 2020 at 04:50:23PM -0500, Theodore Y. Ts'o wrote:
> I note that you've dropped the inode_dio_wait() in ext4's ZERO_RANGE,
> COLLAPSE_RANGE, INSERT_RANGE, etc.  We had added these to avoid
> problems when various fallocate operations which modify the inode's
> logical->physical block mapping racing with direct I/O (both reads or
> writes).
> 
> I don't see a replacement protection in this patch series.  How does
> are file systems supported to protect against such races?

By holding i_rwsem until the direct I/O operations are finished, and
not just until they are sumbitted.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-14 18:47 ` RFC: hold i_rwsem until aio completes Matthew Wilcox
@ 2020-01-15  6:54   ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-15  6:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Will Deacon,
	Andrew Morton, linux-ext4, cluster-devel, linux-kernel, linux-mm

On Tue, Jan 14, 2020 at 10:47:07AM -0800, Matthew Wilcox wrote:
> It would be helpful if we could also use the same lockdep logic
> for PageLocked.  Again, it's a case where returning to userspace with
> PageLock held is fine, because we're expecting an interrupt to come in
> and drop the lock for us.

Yes, this is a very typical pattern for I/O.  Besides the page and
buffer head bit locks it also applies to the semaphore in the xfs_buf
structure and probably various other places that currently used hand
crafted or legacy locking primitives to escape lockdep.

> Perhaps the right answer is, from lockdep's point of view, to mark the
> lock as being released at the point where we submit the I/O.  Then
> in the completion path release the lock without telling lockdep we
> released it.

That is similar to what the fsfreeze code does, but I don't think it
is very optimal, as misses to track any dependencies after I/O
submission, and at least some of the completions paths do take
locks.  But it might be a start.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-14 19:27 ` Jason Gunthorpe
@ 2020-01-15  6:56   ` Christoph Hellwig
  2020-01-15 13:24     ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-15  6:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Will Deacon,
	Andrew Morton, linux-ext4, cluster-devel, linux-kernel, linux-mm

On Tue, Jan 14, 2020 at 03:27:00PM -0400, Jason Gunthorpe wrote:
> I've seen similar locking patterns quite a lot, enough I've thought
> about having a dedicated locking primitive to do it. It really wants
> to be a rwsem, but as here the rwsem rules don't allow it.
> 
> The common pattern I'm looking at looks something like this:
> 
>  'try begin read'() // aka down_read_trylock()
> 
>   /* The lockdep release hackery you describe,
>      the rwsem remains read locked */
>  'exit reader'()
> 
>  .. delegate unlock to work queue, timer, irq, etc ..
> 
> in the new context:
> 
>  're_enter reader'() // Get our lockdep tracking back
> 
>  'end reader'() // aka up_read()
> 
> vs a typical write side:
> 
>  'begin write'() // aka down_write()
> 
>  /* There is no reason to unlock it before kfree of the rwsem memory.
>     Somehow the user prevents any new down_read_trylock()'s */
>  'abandon writer'() // The object will be kfree'd with a locked writer
>  kfree()
> 
> The typical goal is to provide an object destruction path that can
> serialize and fence all readers wherever they may be before proceeding
> to some synchronous destruction.
> 
> Usually this gets open coded with some atomic/kref/refcount and a
> completion or wait queue. Often implemented wrongly, lacking the write
> favoring bias in the rwsem, and lacking any lockdep tracking on the
> naked completion.
> 
> Not to discourage your patch, but to ask if we can make the solution
> more broadly applicable?

Your requirement seems a little different, and in fact in many ways
similar to the percpu_ref primitive.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-15  6:56   ` Christoph Hellwig
@ 2020-01-15 13:24     ` Jason Gunthorpe
  2020-01-15 14:33       ` Peter Zijlstra
  2020-01-15 15:36       ` Christoph Hellwig
  0 siblings, 2 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2020-01-15 13:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Wed, Jan 15, 2020 at 07:56:14AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 03:27:00PM -0400, Jason Gunthorpe wrote:
> > I've seen similar locking patterns quite a lot, enough I've thought
> > about having a dedicated locking primitive to do it. It really wants
> > to be a rwsem, but as here the rwsem rules don't allow it.
> > 
> > The common pattern I'm looking at looks something like this:
> > 
> >  'try begin read'() // aka down_read_trylock()
> > 
> >   /* The lockdep release hackery you describe,
> >      the rwsem remains read locked */
> >  'exit reader'()
> > 
> >  .. delegate unlock to work queue, timer, irq, etc ..
> > 
> > in the new context:
> > 
> >  're_enter reader'() // Get our lockdep tracking back
> > 
> >  'end reader'() // aka up_read()
> > 
> > vs a typical write side:
> > 
> >  'begin write'() // aka down_write()
> > 
> >  /* There is no reason to unlock it before kfree of the rwsem memory.
> >     Somehow the user prevents any new down_read_trylock()'s */
> >  'abandon writer'() // The object will be kfree'd with a locked writer
> >  kfree()
> > 
> > The typical goal is to provide an object destruction path that can
> > serialize and fence all readers wherever they may be before proceeding
> > to some synchronous destruction.
> > 
> > Usually this gets open coded with some atomic/kref/refcount and a
> > completion or wait queue. Often implemented wrongly, lacking the write
> > favoring bias in the rwsem, and lacking any lockdep tracking on the
> > naked completion.
> > 
> > Not to discourage your patch, but to ask if we can make the solution
> > more broadly applicable?
> 
> Your requirement seems a little different, and in fact in many ways
> similar to the percpu_ref primitive.

I was interested because you are talking about allowing the read/write side
of a rw sem to be held across a return to user space/etc, which is the
same basic problem.

precpu refcount looks more like a typical refcount with a release that
is called by whatever context does the final put. The point above is
to basically move the release of a refcount into a synchrnous path by
introducing some barrier to wait for the refcount to go to zero. In
the above the barrier is the down_write() as it is really closer to a
rwsem than a refcount.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-15 13:24     ` Jason Gunthorpe
@ 2020-01-15 14:33       ` Peter Zijlstra
  2020-01-15 14:49         ` Jason Gunthorpe
  2020-01-18 22:40         ` Matthew Wilcox
  2020-01-15 15:36       ` Christoph Hellwig
  1 sibling, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2020-01-15 14:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:

> I was interested because you are talking about allowing the read/write side
> of a rw sem to be held across a return to user space/etc, which is the
> same basic problem.

No it is not; allowing the lock to be held across userspace doesn't
change the owner. This is a crucial difference, PI depends on there
being a distinct owner. That said, allowing the lock to be held across
userspace still breaks PI in that it completely wrecks the ability to
analyze the critical section.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-15 14:33       ` Peter Zijlstra
@ 2020-01-15 14:49         ` Jason Gunthorpe
  2020-01-15 19:03           ` Waiman Long
  2020-01-18 22:40         ` Matthew Wilcox
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2020-01-15 14:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Wed, Jan 15, 2020 at 03:33:47PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:
> 
> > I was interested because you are talking about allowing the read/write side
> > of a rw sem to be held across a return to user space/etc, which is the
> > same basic problem.
> 
> No it is not; allowing the lock to be held across userspace doesn't
> change the owner. This is a crucial difference, PI depends on there
> being a distinct owner. That said, allowing the lock to be held across
> userspace still breaks PI in that it completely wrecks the ability to
> analyze the critical section.

I'm not sure what you are contrasting?

I was remarking that I see many places open code a rwsem using an
atomic and a completion specifically because they need to do the
things Christoph identified:

> (1) no unlocking by another process than the one that acquired it
> (2) no return to userspace with locks held

As an example flow: obtain the read side lock, schedual a work queue,
return to user space, and unlock the read side from the work queue.

If we can make some general primative that addresses this then maybe
those open coded places can convert as well?

Regards,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-15 13:24     ` Jason Gunthorpe
  2020-01-15 14:33       ` Peter Zijlstra
@ 2020-01-15 15:36       ` Christoph Hellwig
  2020-01-15 16:26         ` Jason Gunthorpe
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-15 15:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Will Deacon,
	Andrew Morton, linux-ext4, cluster-devel, linux-kernel, linux-mm

On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:
> > Your requirement seems a little different, and in fact in many ways
> > similar to the percpu_ref primitive.
> 
> I was interested because you are talking about allowing the read/write side
> of a rw sem to be held across a return to user space/etc, which is the
> same basic problem.
> 
> precpu refcount looks more like a typical refcount with a release that
> is called by whatever context does the final put. The point above is
> to basically move the release of a refcount into a synchrnous path by
> introducing some barrier to wait for the refcount to go to zero. In
> the above the barrier is the down_write() as it is really closer to a
> rwsem than a refcount.

No, percpu_ref is a little different than the name suggests, as it has
a magic initial reference, and then the other short term reference.  To
actually tear it down now just a normal put of the reference is needed,
but an explicit percpu_ref_kill or percpu_ref_kill_and_confirm.  Various
callers (including all I added) would like that operation to be
synchronous and currently hack that up, so a version of the percpu_ref
that actually waits for the other references to away like we hacked
up various places seems to exactly suit your requirements.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-15 15:36       ` Christoph Hellwig
@ 2020-01-15 16:26         ` Jason Gunthorpe
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2020-01-15 16:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Wed, Jan 15, 2020 at 04:36:14PM +0100, Christoph Hellwig wrote:

> synchronous and currently hack that up, so a version of the percpu_ref
> that actually waits for the other references to away like we hacked
> up various places seems to exactly suit your requirements.

Ah, yes, sounds like a similar goal, many cases I'm thinking about
also hack up a kref to trigger a completion to make it
synchronous.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-15 14:49         ` Jason Gunthorpe
@ 2020-01-15 19:03           ` Waiman Long
  2020-01-15 19:07             ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Waiman Long @ 2020-01-15 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Zijlstra
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Thomas Gleixner,
	Ingo Molnar, Will Deacon, Andrew Morton, linux-ext4,
	cluster-devel, linux-kernel, linux-mm

On 1/15/20 9:49 AM, Jason Gunthorpe wrote:
> On Wed, Jan 15, 2020 at 03:33:47PM +0100, Peter Zijlstra wrote:
>> On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:
>>
>>> I was interested because you are talking about allowing the read/write side
>>> of a rw sem to be held across a return to user space/etc, which is the
>>> same basic problem.
>> No it is not; allowing the lock to be held across userspace doesn't
>> change the owner. This is a crucial difference, PI depends on there
>> being a distinct owner. That said, allowing the lock to be held across
>> userspace still breaks PI in that it completely wrecks the ability to
>> analyze the critical section.
> I'm not sure what you are contrasting?
>
> I was remarking that I see many places open code a rwsem using an
> atomic and a completion specifically because they need to do the
> things Christoph identified:
>
>> (1) no unlocking by another process than the one that acquired it
>> (2) no return to userspace with locks held
> As an example flow: obtain the read side lock, schedual a work queue,
> return to user space, and unlock the read side from the work queue.

We currently have down_read_non_owner() and up_read_non_owner() that
perform the lock and unlock without lockdep tracking. Of course, that is
a hack and their use must be carefully scrutinized to make sure that
there is no deadlock or other potentially locking issues.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-15 19:03           ` Waiman Long
@ 2020-01-15 19:07             ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-01-15 19:07 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jason Gunthorpe, Peter Zijlstra, Christoph Hellwig, linux-xfs,
	linux-fsdevel, Thomas Gleixner, Ingo Molnar, Will Deacon,
	Andrew Morton, linux-ext4, cluster-devel, linux-kernel, linux-mm

On Wed, Jan 15, 2020 at 02:03:22PM -0500, Waiman Long wrote:
> >> (1) no unlocking by another process than the one that acquired it
> >> (2) no return to userspace with locks held
> > As an example flow: obtain the read side lock, schedual a work queue,
> > return to user space, and unlock the read side from the work queue.
> 
> We currently have down_read_non_owner() and up_read_non_owner() that
> perform the lock and unlock without lockdep tracking. Of course, that is
> a hack and their use must be carefully scrutinized to make sure that
> there is no deadlock or other potentially locking issues.

That doesn't help with returning to userspace while the lock is held.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (13 preceding siblings ...)
  2020-01-14 19:27 ` Jason Gunthorpe
@ 2020-01-16 14:00 ` Jan Kara
  2020-02-03 17:44   ` Christoph Hellwig
  2020-01-18  9:28 ` Dave Chinner
  15 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2020-01-16 14:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

Hello!

On Tue 14-01-20 17:12:13, Christoph Hellwig wrote:
> Asynchronous read/write operations currently use a rather magic locking
> scheme, were access to file data is normally protected using a rw_semaphore,
> but if we are doing aio where the syscall returns to userspace before the
> I/O has completed we also use an atomic_t to track the outstanding aio
> ops.  This scheme has lead to lots of subtle bugs in file systems where
> didn't wait to the count to reach zero, and due to its adhoc nature also
> means we have to serialize direct I/O writes that are smaller than the
> file system block size.
> 
> All this is solved by releasing i_rwsem only when the I/O has actually
> completed, but doings so is against to mantras of Linux locking primites:
> 
>  (1) no unlocking by another process than the one that acquired it
>  (2) no return to userspace with locks held

I'd like to note that using i_dio_count has also one advantage you didn't
mention. For AIO case, if you need to hold i_rwsem in exclusive mode,
holding the i_rwsem just for submission part is a significant performance
advantage (shorter lock hold times allow for higher IO parallelism). I
guess this could be mitigated by downgrading the lock to shared mode
once the IO is submitted. But there will be still some degradation visible
for the cases of mixed exclusive and shared acquisitions because shared
holders will be blocking exclusive ones for longer time.

This may be especially painful for filesystems that don't implement DIO
overwrites with i_rwsem in shared mode...


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
                   ` (14 preceding siblings ...)
  2020-01-16 14:00 ` Jan Kara
@ 2020-01-18  9:28 ` Dave Chinner
  2020-02-03 17:46   ` Christoph Hellwig
  15 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2020-01-18  9:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Tue, Jan 14, 2020 at 05:12:13PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> Asynchronous read/write operations currently use a rather magic locking
> scheme, were access to file data is normally protected using a rw_semaphore,
> but if we are doing aio where the syscall returns to userspace before the
> I/O has completed we also use an atomic_t to track the outstanding aio
> ops.  This scheme has lead to lots of subtle bugs in file systems where
> didn't wait to the count to reach zero, and due to its adhoc nature also
> means we have to serialize direct I/O writes that are smaller than the
> file system block size.
> 
> All this is solved by releasing i_rwsem only when the I/O has actually
> completed, but doings so is against to mantras of Linux locking primites:
> 
>  (1) no unlocking by another process than the one that acquired it
>  (2) no return to userspace with locks held
> 
> It actually happens we have various places that work around this.  A few
> callers do non-owner unlocks of rwsems, which are pretty nasty for
> PREEMPT_RT as the owner tracking doesn't work.  OTOH the file system
> freeze code has both problems and works around them a little better,
> although in a somewhat awkward way, in that it releases the lockdep
> object when returning to userspace, and reacquires it when done, and
> also clears the rwsem owner when returning to userspace, and then sets
> the new onwer before unlocking.
> 
> This series tries to follow that scheme, also it doesn't fully work.  The
> first issue is that the rwsem code has a bug where it doesn't properly
> handle clearing the owner.  This series has a patch to fix that, but it
> is ugly and might not be correct so some help is needed.  Second I/O
> completions often come from interrupt context, which means the re-acquire
> is recorded as from irq context, leading to warnings about incorrect
> contexts.  I wonder if we could just have a bit in lockdep that says
> returning to userspace is ok for this particular lock?  That would also
> clean up the fsfreeze situation a lot.
> 
> Let me know what you think of all this.  While I converted all the iomap
> using file systems only XFS is actually tested.

I think it's pretty gross, actually. It  makes the same mistake made
with locking in the old direct IO code - it encodes specific lock
operations via flags into random locations in the DIO path. This is
a very slippery slope, and IMO it is an layering violation to encode
specific filesystem locking smeantics into a layer that is supposed
to be generic and completely filesystem agnostic. i.e.  this
mechanism breaks if a filesystem moves to a different type of lock
(e.g. range locks), and history teaches us that we'll end up making
a horrible, unmaintainable mess to support different locking
mechanisms and contexts.

I think that we should be moving to a model where the filesystem
provides an unlock method in the iomap operations structure, and if
the method is present in iomap_dio_complete() it gets called for the
filesystem to unlock the inode at the appropriate point. This also
allows the filesystem to provide a different method for read or
write unlock, depending on what type of lock it held at submission.
This gets rid of the need for the iomap code to know what type of
lock the caller holds, too.

This way we always have a consistent point in the AIO/DIO completion
model where the inode gets unlocked, it only gets called for the IO
contexts where the filesystem wants/needs unlock on IO competion
semantics, and it's completely filesystem IO-lock implementation
agnostic. And for filesystems that use the inode i_rwsem, we can
just provide simple helper functions for their read/write unlock
methods.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-15 14:33       ` Peter Zijlstra
  2020-01-15 14:49         ` Jason Gunthorpe
@ 2020-01-18 22:40         ` Matthew Wilcox
  1 sibling, 0 replies; 39+ messages in thread
From: Matthew Wilcox @ 2020-01-18 22:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jason Gunthorpe, Christoph Hellwig, linux-xfs, linux-fsdevel,
	Waiman Long, Thomas Gleixner, Ingo Molnar, Will Deacon,
	Andrew Morton, linux-ext4, cluster-devel, linux-kernel, linux-mm

On Wed, Jan 15, 2020 at 03:33:47PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:
> 
> > I was interested because you are talking about allowing the read/write side
> > of a rw sem to be held across a return to user space/etc, which is the
> > same basic problem.
> 
> No it is not; allowing the lock to be held across userspace doesn't
> change the owner. This is a crucial difference, PI depends on there
> being a distinct owner. That said, allowing the lock to be held across
> userspace still breaks PI in that it completely wrecks the ability to
> analyze the critical section.

Thinking about this from a PI point of view, the problem is not that we
returned to userspace still holding the lock, it's that boosting this
process's priority will not help release the lock faster because this
process no longer owns the lock.

If we had a lock owner handoff API (ie I can donate my lock to another
owner), that would solve this problem.  We'd want to have special owners
to denote "RCU" "bottom halves" or "irq" so we know what we can do about
PI.  I don't think we need a "I have stolen this lock from somebody else"
API, but maybe I'm wrong there.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-16 14:00 ` Jan Kara
@ 2020-02-03 17:44   ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-02-03 17:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Will Deacon,
	Andrew Morton, linux-ext4, cluster-devel, linux-kernel, linux-mm

On Thu, Jan 16, 2020 at 03:00:04PM +0100, Jan Kara wrote:
> I'd like to note that using i_dio_count has also one advantage you didn't
> mention. For AIO case, if you need to hold i_rwsem in exclusive mode,
> holding the i_rwsem just for submission part is a significant performance
> advantage (shorter lock hold times allow for higher IO parallelism). I
> guess this could be mitigated by downgrading the lock to shared mode
> once the IO is submitted. But there will be still some degradation visible
> for the cases of mixed exclusive and shared acquisitions because shared
> holders will be blocking exclusive ones for longer time.
> 
> This may be especially painful for filesystems that don't implement DIO
> overwrites with i_rwsem in shared mode...

True.  Fortunately there are patches for ext4 out to move over to that
scheme.  gfs2 will need a little more attention, but that also for other
reasons.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-01-18  9:28 ` Dave Chinner
@ 2020-02-03 17:46   ` Christoph Hellwig
  2020-02-03 23:02     ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2020-02-03 17:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Will Deacon,
	Andrew Morton, linux-ext4, cluster-devel, linux-kernel, linux-mm

On Sat, Jan 18, 2020 at 08:28:38PM +1100, Dave Chinner wrote:
> I think it's pretty gross, actually. It  makes the same mistake made
> with locking in the old direct IO code - it encodes specific lock
> operations via flags into random locations in the DIO path. This is
> a very slippery slope, and IMO it is an layering violation to encode
> specific filesystem locking smeantics into a layer that is supposed
> to be generic and completely filesystem agnostic. i.e.  this
> mechanism breaks if a filesystem moves to a different type of lock
> (e.g. range locks), and history teaches us that we'll end up making
> a horrible, unmaintainable mess to support different locking
> mechanisms and contexts.
> 
> I think that we should be moving to a model where the filesystem
> provides an unlock method in the iomap operations structure, and if
> the method is present in iomap_dio_complete() it gets called for the
> filesystem to unlock the inode at the appropriate point. This also
> allows the filesystem to provide a different method for read or
> write unlock, depending on what type of lock it held at submission.
> This gets rid of the need for the iomap code to know what type of
> lock the caller holds, too.

I'd rather avoid yet another method.  But I think with a little
tweaking we can move the unlock into the ->end_io method.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: RFC: hold i_rwsem until aio completes
  2020-02-03 17:46   ` Christoph Hellwig
@ 2020-02-03 23:02     ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2020-02-03 23:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, linux-kernel, linux-mm

On Mon, Feb 03, 2020 at 06:46:41PM +0100, Christoph Hellwig wrote:
> On Sat, Jan 18, 2020 at 08:28:38PM +1100, Dave Chinner wrote:
> > I think it's pretty gross, actually. It  makes the same mistake made
> > with locking in the old direct IO code - it encodes specific lock
> > operations via flags into random locations in the DIO path. This is
> > a very slippery slope, and IMO it is an layering violation to encode
> > specific filesystem locking smeantics into a layer that is supposed
> > to be generic and completely filesystem agnostic. i.e.  this
> > mechanism breaks if a filesystem moves to a different type of lock
> > (e.g. range locks), and history teaches us that we'll end up making
> > a horrible, unmaintainable mess to support different locking
> > mechanisms and contexts.
> > 
> > I think that we should be moving to a model where the filesystem
> > provides an unlock method in the iomap operations structure, and if
> > the method is present in iomap_dio_complete() it gets called for the
> > filesystem to unlock the inode at the appropriate point. This also
> > allows the filesystem to provide a different method for read or
> > write unlock, depending on what type of lock it held at submission.
> > This gets rid of the need for the iomap code to know what type of
> > lock the caller holds, too.
> 
> I'd rather avoid yet another method.  But I think with a little
> tweaking we can move the unlock into the ->end_io method.

That would work, too :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Cluster-devel] [PATCH 05/12] gfs2: fix O_SYNC write handling
  2020-01-14 16:12 ` [PATCH 05/12] gfs2: fix O_SYNC write handling Christoph Hellwig
@ 2020-02-06 15:31   ` " Andreas Gruenbacher
  0 siblings, 0 replies; 39+ messages in thread
From: Andreas Gruenbacher @ 2020-02-06 15:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, Andrew Morton,
	linux-ext4, cluster-devel, Linux-MM, LKML

Hi Christoph,

thanks for this patch, and sorry for taking so long to react.

On Tue, Jan 14, 2020 at 5:54 PM Christoph Hellwig <hch@lst.de> wrote:
> Don't ignore the return value from generic_write_sync for the direct to
> buffered I/O callback case when written is non-zero.  Also don't bother
> to call generic_write_sync for the pure direct I/O case, as iomap_dio_rw
> already takes care of that.

I like the idea, but the patch as is doesn't quite work: iomap_dio_rw
already bumps iocb->ki_pos, so we end up with the wrong value by
adding the (direct + buffered) write size again.
We'd probably also be better served by replacing
filemap_write_and_wait_range with generic_write_sync + IOCB_DSYNC in
the buffered fallback case. I'll send an update that you'll hopefully
like.

Andreas


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/12] mm: fix a comment in sys_swapon
  2020-01-14 16:12 ` [PATCH 01/12] mm: fix a comment in sys_swapon Christoph Hellwig
@ 2020-02-10 23:29   ` Andrew Morton
  2020-02-12  7:37     ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Morton @ 2020-02-10 23:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, Waiman Long, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Will Deacon, linux-ext4,
	cluster-devel, linux-kernel, linux-mm, Naohiro Aota

On Tue, 14 Jan 2020 17:12:14 +0100 Christoph Hellwig <hch@lst.de> wrote:

> claim_swapfile now always takes i_rwsem.
> 
> ...
>
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3157,7 +3157,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  	mapping = swap_file->f_mapping;
>  	inode = mapping->host;
>  
> -	/* If S_ISREG(inode->i_mode) will do inode_lock(inode); */
> +	/* will take i_rwsem; */
>  	error = claim_swapfile(p, inode);
>  	if (unlikely(error))
>  		goto bad_swap;

http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com
removes this comment altogether.  Please check that this is OK?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/12] mm: fix a comment in sys_swapon
  2020-02-10 23:29   ` Andrew Morton
@ 2020-02-12  7:37     ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2020-02-12  7:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, Waiman Long,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Will Deacon,
	linux-ext4, cluster-devel, linux-kernel, linux-mm, Naohiro Aota

On Mon, Feb 10, 2020 at 03:29:42PM -0800, Andrew Morton wrote:
> On Tue, 14 Jan 2020 17:12:14 +0100 Christoph Hellwig <hch@lst.de> wrote:
> 
> > claim_swapfile now always takes i_rwsem.
> > 
> > ...
> >
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3157,7 +3157,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> >  	mapping = swap_file->f_mapping;
> >  	inode = mapping->host;
> >  
> > -	/* If S_ISREG(inode->i_mode) will do inode_lock(inode); */
> > +	/* will take i_rwsem; */
> >  	error = claim_swapfile(p, inode);
> >  	if (unlikely(error))
> >  		goto bad_swap;
> 
> http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com
> removes this comment altogether.  Please check that this is OK?

Killing it is fine with me.  Just the fact that the comment was wrong
while I did an audit of the area really thew me off.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, back to index

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-14 16:12 RFC: hold i_rwsem until aio completes Christoph Hellwig
2020-01-14 16:12 ` [PATCH 01/12] mm: fix a comment in sys_swapon Christoph Hellwig
2020-02-10 23:29   ` Andrew Morton
2020-02-12  7:37     ` Christoph Hellwig
2020-01-14 16:12 ` [PATCH 02/12] locking/rwsem: Exit early when held by an anonymous owner Christoph Hellwig
2020-01-14 18:17   ` Waiman Long
2020-01-14 18:25     ` Christoph Hellwig
2020-01-14 18:33       ` Waiman Long
2020-01-14 18:55       ` Waiman Long
2020-01-14 16:12 ` [PATCH 03/12] xfs: fix IOCB_NOWAIT handling in xfs_file_dio_aio_read Christoph Hellwig
2020-01-14 16:12 ` [PATCH 04/12] gfs2: move setting current->backing_dev_info Christoph Hellwig
2020-01-14 16:12 ` [PATCH 05/12] gfs2: fix O_SYNC write handling Christoph Hellwig
2020-02-06 15:31   ` [Cluster-devel] " Andreas Gruenbacher
2020-01-14 16:12 ` [PATCH 06/12] iomap: pass a flags value to iomap_dio_rw Christoph Hellwig
2020-01-14 16:12 ` [PATCH 07/12] iomap: allow holding i_rwsem until aio completion Christoph Hellwig
2020-01-14 16:12 ` [PATCH 08/12] ext4: hold i_rwsem until AIO completes Christoph Hellwig
2020-01-14 21:50   ` Theodore Y. Ts'o
2020-01-15  6:48     ` Christoph Hellwig
2020-01-14 16:12 ` [PATCH 09/12] gfs2: " Christoph Hellwig
2020-01-14 16:12 ` [PATCH 10/12] xfs: " Christoph Hellwig
2020-01-14 16:12 ` [PATCH 11/12] xfs: don't set IOMAP_DIO_SYNCHRONOUS for unaligned I/O Christoph Hellwig
2020-01-14 16:12 ` [PATCH 12/12] iomap: remove the inode_dio_begin/end calls Christoph Hellwig
2020-01-14 18:47 ` RFC: hold i_rwsem until aio completes Matthew Wilcox
2020-01-15  6:54   ` Christoph Hellwig
2020-01-14 19:27 ` Jason Gunthorpe
2020-01-15  6:56   ` Christoph Hellwig
2020-01-15 13:24     ` Jason Gunthorpe
2020-01-15 14:33       ` Peter Zijlstra
2020-01-15 14:49         ` Jason Gunthorpe
2020-01-15 19:03           ` Waiman Long
2020-01-15 19:07             ` Christoph Hellwig
2020-01-18 22:40         ` Matthew Wilcox
2020-01-15 15:36       ` Christoph Hellwig
2020-01-15 16:26         ` Jason Gunthorpe
2020-01-16 14:00 ` Jan Kara
2020-02-03 17:44   ` Christoph Hellwig
2020-01-18  9:28 ` Dave Chinner
2020-02-03 17:46   ` Christoph Hellwig
2020-02-03 23:02     ` Dave Chinner

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git